Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate size and capacity of LGDO ArrayLike objects #109

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

iguinn
Copy link
Contributor

@iguinn iguinn commented Oct 13, 2024

See: #107

Added an Abstract Base Class for LGDOContainers, which are ArrayLike objects and can be placed inside of tables. LGDOContainers are expected to implement methods for manipulating the capacity and size of the containers. Capacity controls the number of rows in memory, while size controls the length of the container; capacity may be larger than size, resulting in "padding" at the end. The advantage of this approach is that it enables changes in size without reallocating memory (always for shrinking, most of the time for growing). The cost is the extra memory used by this padding (although resizes require allocating a new array and then copying data over, requiring a spike in memory usage, so this cost is not so big). Methods and options have been added for "trimming" the capacity to match the size of the array in case the user prefers to not have this padding for some reason.

As a result of this change, the returned values for core.read and store.read have (sometimes) changed. Before, when reading an LGDO object in place, they would return the number of rows read. Now, they resize the array to obj_buf_start + n_rows_read. The reason for the old behavior was that we did not want to shrink arrays when not enough rows were in the file to fill a Table in place, since the reallocation could cause problems. Since shrinking no longer requires reallocation, this is no longer problematic. A major benefit is that the read functions are both simpler to use since they have consistent returns. It should be noted that this change will break code in other LEGEND packages.

When resizing, if the capacity is smaller than the new size, the behavior is to reserve more capacity up to the next power of 2 larger than the new size. When reading a huge number of files, this will reduce the time spent reallocating/copying arrays from O(n) to O(log(n)); it is not clear to me how important this is, but someone can try running tests of reading a large number of files to check.
Detailed changes:

  • Array, VectorOfVectors, Table, ArrayOfEqualSizeArrays, and the encoded Array classes all implement the new LGDOContainer base class. In particular they have reserve_capacity, get_capacity, and trim_capacity defined. In some cases, this involved implementing other methods like append and insert.
  • Array and AoESA both have a default behavior of setting the capacity to the next power of 2 larger than what is required for a resize, if the capacity is not big enough. This behavior propagates to all the other container classes
  • nda is now a property, which returns a slice (of length len(obj)) of the internal data buffer (which has length equal to capacity). This will preserve the old behavior as far as the user is concerned
  • dtype was turned into a property in Array, AoESA and VoV. n_dims was for VoV
  • core.read resizes the returned array to size obj_buf_start + n_rows_read. Note that this resize could probably be done in the _serializer.read functions in the future (which still return n_rows_read)
  • store.read has been simplified, and now just calls core.read after gimme_files. This reduces redundant code
  • lh5.iterator no longer returns n_rows_read
  • Changed all tests to not expect n_rows_read. Most of them were filling it into _; for the ones that actually had asserts on n_rows_read, these were replaced with asserts on len(obj)
  • Table no longer tracks loc, which was a member variable that was more or less used to enable the size to be different from capacity before; since this is now a core feature, this is unnecessary. It does not seem to play any role in our other code packages
  • Table had a few methods changed to LGDOContainer methods for consistency

Copy link

codecov bot commented Oct 13, 2024

Codecov Report

Attention: Patch coverage is 92.18750% with 15 lines in your changes missing coverage. Please review.

Project coverage is 77.12%. Comparing base (da6747d) to head (0a0bffb).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/lgdo/types/encoded.py 57.89% 8 Missing ⚠️
src/lgdo/types/table.py 89.65% 3 Missing ⚠️
src/lgdo/types/array.py 96.00% 2 Missing ⚠️
src/lgdo/lh5/core.py 92.85% 1 Missing ⚠️
src/lgdo/types/vectorofvectors.py 97.29% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #109      +/-   ##
==========================================
+ Coverage   76.70%   77.12%   +0.42%     
==========================================
  Files          46       46              
  Lines        3134     3196      +62     
==========================================
+ Hits         2404     2465      +61     
- Misses        730      731       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gipert gipert added enhancement New feature or request performance Code performance types LGDO types labels Oct 14, 2024
@iguinn
Copy link
Contributor Author

iguinn commented Oct 19, 2024

I had a chance to test the performance change on perlmutter. I tested by reading data from 856 files, totalling 5.3 GB read into memory. For my test, I did a single timed read followed by a %timeit read, to differentiate the first read, which is dominated by file caching.

Here are my findings:
with changes: 282 s first read, 77 s subsequent reads
without changes: 96 s first read (files were still cached so don't read into this), 90 s for subsequent reads

In terms of memory usage, the total memory used while reading was quite similar (during copies additional memory needs to be allocated); when all is said and done, the new version was using ~8 GB while the old was using 5.3 GB. This could be remedied by trimming the arrays.

So, in the end there is a performance improvement, but it is small relative to the time spent reading files. This is especially true on first read where we would probably see a ~5% improvement in speed (in most cases this will be all that matters). It should be noted that the total time spent allocating/copying memory scales as O(nlogn) with changes and O(n^2) pre-change, so if we continued to read in more data the timing discrepancy could grow enough that the time copying is comparable to the time reading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Code performance types LGDO types
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants