-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Dynamic rechunking option for StoreToZarr
#546
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Thanks, Julius, |
Oh my bad for the confusion. My name buffer is super small and always overflows on conferences 😂 |
@jbusecke This looks great! It is exactly what I was thinking as the next version of the simple algorithm we originally developed. Thanks for taking this to the next level from what we developed during the sprint. |
Nice work, @jbusecke, and everyone else named here who contributed.
That seems reasonable to insert right above here pangeo-forge-recipes/pangeo_forge_recipes/transforms.py Lines 396 to 398 in f0c7dac
as an alternate path for deriving
At the risk of stating the obvious, would be great to reference whatever established parsing strategies / conventions exist in other packages for this, rather than inventing our own convention. |
def dynamic_target_chunks_from_schema( | ||
schema: XarraySchema, | ||
target_chunk_nbytes: int, # TODO: Accept a str like `100MB` | ||
target_chunk_ratio: Dict[str, int], | ||
nbytes_tolerance: float = 0.2, | ||
) -> dict[str, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to review this PR but realized I was missing some key context. Could we provide a docstring for this function which explains what this function does and what these parameters are? In particular, I don't understand target_chunk_ratio
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I was also confused by that term. So +1 on additional context. 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be slightly clearer to call it target_chunk_aspect_ratio
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a docstring explaining the inputs.
def dynamic_target_chunks_from_schema( | ||
schema: XarraySchema, | ||
target_chunk_nbytes: int, # TODO: Accept a str like `100MB` | ||
target_chunk_ratio: Dict[str, int], | ||
nbytes_tolerance: float = 0.2, | ||
) -> dict[str, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be slightly clearer to call it target_chunk_aspect_ratio
?
tests/test_dynamic_target_chunks.py
Outdated
from pangeo_forge_recipes.dynamic_target_chunks import dynamic_target_chunks_from_schema | ||
|
||
|
||
class TestDynamicTargetChunks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is potentially a good use case for hypothesis - you could parameterize a unit test with a hypothesis strategy that generates arbitrary (regular) chunks, then assert that the property that the returned target chunk size is within the specified tolerance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is a good idea. I will have to dig into hypothesis a bit to understand how to encode the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR to dask might give you what you need (I can also finish it if that would help)
A function for parsing sizes expressed as strings with SI units into integers exists in both cubed and dask, and dask I think has a strategy that will try to generate chunks of a given size. |
Hey everyone, thanks for the great comments. Sorry for the patchwork PR (working in between flights), I should have marked this as draft in the meantime. To clarify the purpose (and naming) of How about something like |
…ecipes into dynamic_chunks_2
for more information, see https://pre-commit.ci
I just implemented |
I think this conveys the concept nicely! And thanks for working on this -- we intend to use this for the HyTEST project as well! |
I favor In either case, I like the consistency with the Thanks for all the work on this, Julius! |
I think this is preferable, because the aspect ratio is between plural chunks. (As opposed to |
…ecipes into dynamic_chunks_2
for more information, see https://pre-commit.ci
@jbusecke and @cisaacstern , where do we stand with this? |
I have just added some logic to implement a fallback algorithm. This one is a lot more naive. It basically just determines the biggest chunk possible by dividing the length of each dimension by the corresponding value of We still need to think about a way to check this functionality with an end-to-end test. @cisaacstern do you have any suggestions for this in general? |
@jbusecke can you remind me what about this is non-unit-testable? (Or rather, what the unit tests don't/can't capture?) |
Basically all the tests that I currently wrote are for the logic in |
@rsignell-usgs, this PR is useable (I am running CMIP6 rechunking jobs with it right now), but I believe it needs some more testing and docs before it can get merged. |
Per conversion with @jbusecke, we believe the best path to releasing this work is:
This design allows users to bring their own dynamic chunking algorithm, and also keeps the rather large amount of code in this implementation out of the core Julius notes that those wanting to use this work before above action items are complete can install |
Just wanted to confirm that this does not depend on my dynamic injections work :) |
Closing this in favor of #595. I have refactored the logic implemented here to dynamic-chunks and am working on an implementation of the CMIP workflow here |
This PR came out of the Saturday Sprint at Scipy (together with @amsnyder, @thodson-usgs, @alaws-USGS, @kjdoore).
The proposed mechanism here is a generalization of my prototype CMIP6 pangeo-forge feedstock. Over there I implemented dynamic rechunking along a single dimension according to a desired chunksize in memory (EDIT: I now realized that I could have probably achieved this with dask auto chunking for this specific case).
The generalization was prompted by @rsignell-usgs. Rich wanted to have an option to specify the size of chunks and then have a fixed ration of total chunks between different dimensions.
I think I have put together a solution that should work well in this case. The outline of the algo is the following:
chunking ratio
I put some useage examples together here.
TODO:
I propose to integrate this functionality into
StoreToZarr
. The user could do something like this:100MiB
etc.cc @cisaacstern @rabernat