Separate size and capacity of LGDO ArrayLike objects #109

iguinn · 2024-10-13T21:15:04Z

Added an Abstract Base Class for LGDOContainers, which are ArrayLike objects and can be placed inside of tables. LGDOContainers are expected to implement methods for manipulating the capacity and size of the containers. Capacity controls the number of rows in memory, while size controls the length of the container; capacity may be larger than size, resulting in "padding" at the end. The advantage of this approach is that it enables changes in size without reallocating memory (always for shrinking, most of the time for growing). The cost is the extra memory used by this padding (although resizes require allocating a new array and then copying data over, requiring a spike in memory usage, so this cost is not so big). Methods and options have been added for "trimming" the capacity to match the size of the array in case the user prefers to not have this padding for some reason.

As a result of this change, the returned values for core.read and store.read have (sometimes) changed. Before, when reading an LGDO object in place, they would return the number of rows read. Now, they resize the array to obj_buf_start + n_rows_read. The reason for the old behavior was that we did not want to shrink arrays when not enough rows were in the file to fill a Table in place, since the reallocation could cause problems. Since shrinking no longer requires reallocation, this is no longer problematic. A major benefit is that the read functions are both simpler to use since they have consistent returns. It should be noted that this change will break code in other LEGEND packages.

When resizing, if the capacity is smaller than the new size, the behavior is to reserve more capacity up to the next power of 2 larger than the new size. When reading a huge number of files, this will reduce the time spent reallocating/copying arrays from O(n) to O(log(n)); it is not clear to me how important this is, but someone can try running tests of reading a large number of files to check.
Detailed changes:

Array, VectorOfVectors, Table, ArrayOfEqualSizeArrays, and the encoded Array classes all implement the new LGDOContainer base class. In particular they have reserve_capacity, get_capacity, and trim_capacity defined. In some cases, this involved implementing other methods like append and insert.
Array and AoESA both have a default behavior of setting the capacity to the next power of 2 larger than what is required for a resize, if the capacity is not big enough. This behavior propagates to all the other container classes
nda is now a property, which returns a slice (of length len(obj)) of the internal data buffer (which has length equal to capacity). This will preserve the old behavior as far as the user is concerned
dtype was turned into a property in Array, AoESA and VoV. n_dims was for VoV
core.read resizes the returned array to size obj_buf_start + n_rows_read. Note that this resize could probably be done in the _serializer.read functions in the future (which still return n_rows_read)
store.read has been simplified, and now just calls core.read after gimme_files. This reduces redundant code
lh5.iterator no longer returns n_rows_read
Changed all tests to not expect n_rows_read. Most of them were filling it into _; for the ones that actually had asserts on n_rows_read, these were replaced with asserts on len(obj)
Table no longer tracks loc, which was a member variable that was more or less used to enable the size to be different from capacity before; since this is now a core feature, this is unnecessary. It does not seem to play any role in our other code packages
Table had a few methods changed to LGDOContainer methods for consistency

… changes to len/capacity management

… instead of returning n_read

codecov · 2024-10-13T21:17:28Z

Codecov Report

Attention: Patch coverage is 92.18750% with 15 lines in your changes missing coverage. Please review.

Project coverage is 77.12%. Comparing base (da6747d) to head (0a0bffb).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/lgdo/types/encoded.py	57.89%	8 Missing ⚠️
src/lgdo/types/table.py	89.65%	3 Missing ⚠️
src/lgdo/types/array.py	96.00%	2 Missing ⚠️
src/lgdo/lh5/core.py	92.85%	1 Missing ⚠️
src/lgdo/types/vectorofvectors.py	97.29%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #109      +/-   ##
==========================================
+ Coverage   76.70%   77.12%   +0.42%     
==========================================
  Files          46       46              
  Lines        3134     3196      +62     
==========================================
+ Hits         2404     2465      +61     
- Misses        730      731       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

iguinn · 2024-10-19T21:00:06Z

I had a chance to test the performance change on perlmutter. I tested by reading data from 856 files, totalling 5.3 GB read into memory. For my test, I did a single timed read followed by a %timeit read, to differentiate the first read, which is dominated by file caching.

Here are my findings:
with changes: 282 s first read, 77 s subsequent reads
without changes: 96 s first read (files were still cached so don't read into this), 90 s for subsequent reads

In terms of memory usage, the total memory used while reading was quite similar (during copies additional memory needs to be allocated); when all is said and done, the new version was using ~8 GB while the old was using 5.3 GB. This could be remedied by trimming the arrays.

So, in the end there is a performance improvement, but it is small relative to the time spent reading files. This is especially true on first read where we would probably see a ~5% improvement in speed (in most cases this will be all that matters). It should be noted that the total time spent allocating/copying memory scales as O(nlogn) with changes and O(n^2) pre-change, so if we continued to read in more data the timing discrepancy could grow enough that the time copying is comparable to the time reading.

iguinn and others added 9 commits October 11, 2024 13:30

Array manages capacity separately from size

545d6dc

VectorOfVectors dtype is a property

980ad6a

Raise error on insert if i>len

d0d8ce2

Add get/set_capacity to VoV and change modifiers to take advantage of…

075a4f1

… changes to len/capacity management

Modify core.read and store.read to resize array when filling in place…

1e1fdda

… instead of returning n_read

Changed table to handle capacity and resizing similar to array

23a03a6

Fixed test

32ceef9

Added abstract base class for LGDO collections

8d7c1eb

style: pre-commit fixes

8a5dcb2

iguinn and others added 7 commits October 13, 2024 14:24

Appease the pre-commit bot

8dd3d75

Fixed tutorial

5a2e402

Fixed docstring error

0a24cf9

Added tests for capacity and fixed bugs

0fb6adf

style: pre-commit fixes

03f5ce7

Appease pre-commit bot

a1cf2b2

Improve test coverage

0a0bffb

gipert added enhancement New feature or request performance Code performance types LGDO types labels Oct 14, 2024

iguinn mentioned this pull request Oct 15, 2024

Fixes for change to lh5.read legend-exp/dspeed#89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate size and capacity of LGDO ArrayLike objects #109

Separate size and capacity of LGDO ArrayLike objects #109

iguinn commented Oct 13, 2024

codecov bot commented Oct 13, 2024 •

edited

Loading

iguinn commented Oct 19, 2024

Separate size and capacity of LGDO ArrayLike objects #109

Are you sure you want to change the base?

Separate size and capacity of LGDO ArrayLike objects #109

Conversation

iguinn commented Oct 13, 2024

codecov bot commented Oct 13, 2024 • edited Loading

Codecov Report

iguinn commented Oct 19, 2024

codecov bot commented Oct 13, 2024 •

edited

Loading