WIP: Dynamic rechunking option for `StoreToZarr` #546

jbusecke · 2023-07-16T20:53:32Z

This PR came out of the Saturday Sprint at Scipy (together with @amsnyder, @thodson-usgs, @alaws-USGS, @kjdoore).

The proposed mechanism here is a generalization of my prototype CMIP6 pangeo-forge feedstock. Over there I implemented dynamic rechunking along a single dimension according to a desired chunksize in memory (EDIT: I now realized that I could have probably achieved this with dask auto chunking for this specific case).

The generalization was prompted by @rsignell-usgs. Rich wanted to have an option to specify the size of chunks and then have a fixed ration of total chunks between different dimensions.

I think I have put together a solution that should work well in this case. The outline of the algo is the following:

Find even divisors of the dimension length for each individual dimension
Iterate over each possible combination of chunks along the given dimensions, and compute the chunksize of a single variable for each of the combinations.
Select viable combinations that fall within a range of sizes between [size - tolerance * size, size + tolerance * size]
For these candidates then find the closest fit to the desired chunking ratio

@alaws-USGS This is a different implementation than we originally came up with at the sprint, but I think the even division of chunks makes this quite attractive.

I put some useage examples together here.

TODO:

Decide on how to expose this in the PGF-recipe API

I propose to integrate this functionality into StoreToZarr. The user could do something like this:

...
StoreToZarr(..., target_chunk_size, target_chunk_ratio)

Enable string size input. Currently this only takes the size argument as an integer (specifying the number of bytes we want our chunks to have). I think it would be more intuitive to be able to specify size like 100MiB etc.

cc @cisaacstern @rabernat

@TomNicholas: I looked into the dask auto chunking, and I think that this covers the case where you strictly want to chunk only one dimension, but does not enable the user to specify the above ratio (they seem to aim for evenly sized chunks. But I might be overlooking something.

Enable a fallback algorithm, that does loosen the constraint of even divisors to avoid raising an error for certain CMIP6 models

for more information, see https://pre-commit.ci

thodson-usgs · 2023-07-16T21:27:37Z

Thanks, Julius,
BTW, @kjdoore was the one working on the implementation with you.

jbusecke · 2023-07-16T22:29:16Z

Oh my bad for the confusion. My name buffer is super small and always overflows on conferences 😂

kjdoore · 2023-07-17T15:22:35Z

@jbusecke This looks great! It is exactly what I was thinking as the next version of the simple algorithm we originally developed. Thanks for taking this to the next level from what we developed during the sprint.

cisaacstern · 2023-07-17T22:34:56Z

Nice work, @jbusecke, and everyone else named here who contributed.

I propose to integrate this functionality into StoreToZarr

That seems reasonable to insert right above here

pangeo-forge-recipes/pangeo_forge_recipes/transforms.py

Lines 396 to 398 in f0c7dac

    
           rechunked_datasets = indexed_datasets | Rechunk( 
        
               target_chunks=self.target_chunks, schema=schema 
        
           )

as an alternate path for deriving target_chunks.

Enable string size input.

At the risk of stating the obvious, would be great to reference whatever established parsing strategies / conventions exist in other packages for this, rather than inventing our own convention.

rabernat · 2023-07-17T22:45:42Z

pangeo_forge_recipes/dynamic_target_chunks.py

+def dynamic_target_chunks_from_schema(
+    schema: XarraySchema,
+    target_chunk_nbytes: int,  # TODO: Accept a str like `100MB`
+    target_chunk_ratio: Dict[str, int],
+    nbytes_tolerance: float = 0.2,
+) -> dict[str, int]:


I attempted to review this PR but realized I was missing some key context. Could we provide a docstring for this function which explains what this function does and what these parameters are? In particular, I don't understand target_chunk_ratio.

FWIW, I was also confused by that term. So +1 on additional context. 🙏

Would it be slightly clearer to call it target_chunk_aspect_ratio?

I added a docstring explaining the inputs.

pangeo_forge_recipes/dynamic_target_chunks.py

TomNicholas · 2023-07-17T23:17:43Z

pangeo_forge_recipes/dynamic_target_chunks.py

+def dynamic_target_chunks_from_schema(
+    schema: XarraySchema,
+    target_chunk_nbytes: int,  # TODO: Accept a str like `100MB`
+    target_chunk_ratio: Dict[str, int],
+    nbytes_tolerance: float = 0.2,
+) -> dict[str, int]:


Would it be slightly clearer to call it target_chunk_aspect_ratio?

TomNicholas · 2023-07-17T23:21:13Z

tests/test_dynamic_target_chunks.py

+from pangeo_forge_recipes.dynamic_target_chunks import dynamic_target_chunks_from_schema
+
+
+class TestDynamicTargetChunks:


This is potentially a good use case for hypothesis - you could parameterize a unit test with a hypothesis strategy that generates arbitrary (regular) chunks, then assert that the property that the returned target chunk size is within the specified tolerance.

Yeah that is a good idea. I will have to dig into hypothesis a bit to understand how to encode the logic.

This PR to dask might give you what you need (I can also finish it if that would help)

dask/dask#9374

TomNicholas · 2023-07-17T23:26:28Z

established parsing strategies / conventions exist in other packages

A function for parsing sizes expressed as strings with SI units into integers exists in both cubed and dask, and dask I think has a strategy that will try to generate chunks of a given size.

for more information, see https://pre-commit.ci

jbusecke · 2023-07-18T07:39:11Z

Hey everyone, thanks for the great comments. Sorry for the patchwork PR (working in between flights), I should have marked this as draft in the meantime.

To clarify the purpose (and naming) of target_chunk_ratio a bit and maybe crowdsource a better name:
The idea here is to give the user some control over the ratio of total number of chunks along each dimension. For example if I specify target_chunks_ratio={'x':1, 'y':10} the algorithm should aim to get len(ds_rechunked.chunks['x']/len(ds_rechunked.chunks['y'] as close to 1:10 as possible (within given constraints). This works (now, see most recent commit) independently of dataset size. I agree that the term target_chunk_ratio here is confusing.

How about something like nchunks_aspect_ratio? I am trying to reflect here that we are optimizing the ratio between absolute chunks (EDIT: Now also explained in docstring), not the aspect ratio of each individual chunk (which I believe would be closer to what dask is doing).

…ecipes into dynamic_chunks_2

for more information, see https://pre-commit.ci

jbusecke · 2023-07-18T08:19:27Z

A function for parsing sizes expressed as strings with SI units into integers exists in both cubed and dask, and dask I think has a strategy that will try to generate chunks of a given size.

I just implemented dask.utils.parse_bytes (with test). I chose the dask implementation since we already depend on dask here.

rsignell-usgs · 2023-07-18T11:34:00Z

How about something like nchunks_aspect_ratio?

I think this conveys the concept nicely!

And thanks for working on this -- we intend to use this for the HyTEST project as well!

cisaacstern · 2023-07-18T20:44:32Z

How about something like nchunks_aspect_ratio?

I favor target_chunk_aspect_ratio as proposed by @TomNicholas in #546 (comment). Or maybe target_chunks_aspect_ratio (plural "chunks")?

In either case, I like the consistency with the target_chunk_nbytes kw on this function, and the target_chunks kw on the StoreToZarr transform, which feels in keeping with the spirit of "call things the same thing".

Thanks for all the work on this, Julius!

cisaacstern · 2023-07-18T20:47:47Z

Or maybe target_chunks_aspect_ratio (plural "chunks")?

I think this is preferable, because the aspect ratio is between plural chunks. (As opposed to target_chunk_nbytes which does refer to the desired size of a single chunk.)

rsignell-usgs · 2023-07-25T10:30:43Z

@jbusecke back to you to change https://github.com/jbusecke/pangeo-forge-recipes/blob/dynamic_chunks_2/pangeo_forge_recipes/dynamic_target_chunks.py#L42, right?

…ecipes into dynamic_chunks_2

for more information, see https://pre-commit.ci

rsignell-usgs · 2023-08-18T14:40:13Z

@jbusecke and @cisaacstern , where do we stand with this?

jbusecke · 2023-08-22T00:14:47Z

I have just added some logic to implement a fallback algorithm. This one is a lot more naive. It basically just determines the biggest chunk possible by dividing the length of each dimension by the corresponding value of target_chunks_aspect_ratio. Then I scale this maximum chunk by dividing it by increasing integers, until we find the closest match to the desired size.
This can end up with really weird chunking schemes if applied in multiple dimensions, but should work fine with cases where we only chunk along a single dimension (leap-stc/cmip6-leap-feedstock#9).
Ill give that a go over there, and report back.

We still need to think about a way to check this functionality with an end-to-end test. @cisaacstern do you have any suggestions for this in general?

cisaacstern · 2023-08-22T00:20:29Z

We still need to think about a way to check this functionality with an end-to-end test. @cisaacstern do you have any suggestions for this in general?

@jbusecke can you remind me what about this is non-unit-testable? (Or rather, what the unit tests don't/can't capture?)

jbusecke · 2023-08-22T14:01:22Z

Basically all the tests that I currently wrote are for the logic in pangeo_forge_recipes/dynamic_target_chunks.py, but nothing touches the dynamic chunking logic in transforms.py/StoreToZarr!
I am not suggesting that there isnt a way to unit-test this, I merely do not know how one would go about it. I did not find any example of using StoreToZarr in the tests (outside of the end-to-end tests). The current transform.py tests do not even import StoreToZarr.
I 100% understand the desire not to add more parametrizations to the end-to-end tests, but do you maybe have some guidance on how to make a 'synthetic' test that actually writes out a chunked store?

jbusecke · 2023-08-22T14:06:49Z

@jbusecke and @cisaacstern , where do we stand with this?

@rsignell-usgs, this PR is useable (I am running CMIP6 rechunking jobs with it right now), but I believe it needs some more testing and docs before it can get merged.
I would love for you to give this a try on your end and see if the results are satisfying. You can see an implementation here. The only specific thing to include to run from this PR is a requirements.txt. NOTE: It was vital to include the #egg=pangeo_forge_recipes suffix to make this work on dataflow!

cisaacstern · 2023-08-22T20:08:57Z

Per conversion with @jbusecke, we believe the best path to releasing this work is:

Merge @yuvipanda's efforts in Tell pangeo-forge-runner what to inject #566 and Dynamically determine injections based on installed set of packages pangeo-forge-runner#86 which define a plugin system for -runner deploy-time injections into -recipes. (This is on me to review + merge, and is currently waiting on Use pangeo-forge-runner for integration testing #563, bc I'd like to integration test all of this before it goes in.)

🤔 Hmm actually I think we can move forward with below items without above item being a blocker, because this interface does not require dynamic deploy-time injections.

Define dynamic target chunks interface in StoreToZarr #572, something like:

class StoreToZarr(beam.PTransform):
    """
    
    :param dynamic_chunking_fn: A callable which _must_ take `schema` as first arg.
    """
    ...
    dynamic_chunking_fn: Optional[Callable] = None
    dynamic_chunking_fn_kwargs: Optional[dict] = field(default_factory=dict)

    def expand(...):
        schema = ...
        target_chunks = (
            dynamic_chunking_fn(schema, **dynamic_chunking_fn_kwargs)
            if self.dynamic_chunking_fn
            else self.target_chunks
        )          
        ...

Move the work in this PR into a separate plugin package

This design allows users to bring their own dynamic chunking algorithm, and also keeps the rather large amount of code in this implementation out of the core pangeo-forge-recipes maintenance sphere.

Julius notes that those wanting to use this work before above action items are complete can install pangeo-forge-recipes from this branch. This comment is intended not to dissuade use of this work now (please, go for it!), but rather to chart a path for how it will be released eventually.

yuvipanda · 2023-08-23T17:17:46Z

Just wanted to confirm that this does not depend on my dynamic injections work :)

jbusecke · 2023-11-15T06:08:30Z

Closing this in favor of #595. I have refactored the logic implemented here to dynamic-chunks and am working on an implementation of the CMIP workflow here

jbusecke and others added 7 commits July 15, 2023 15:21

first implementation not working

6346d41

Save progress from the hack

089dc7a

New even divisor algo + passing tests

5fadc03

[pre-commit.ci] auto fixes from pre-commit.com hooks

d0f3bcb

for more information, see https://pre-commit.ci

Remove commented out old version

1903ca7

merge commit

62d46ad

[pre-commit.ci] auto fixes from pre-commit.com hooks

243c5c3

for more information, see https://pre-commit.ci

rabernat reviewed Jul 17, 2023

View reviewed changes

TomNicholas reviewed Jul 17, 2023

View reviewed changes

jbusecke and others added 3 commits July 18, 2023 08:22

Bugfix now scales with ds size

4ba4066

[pre-commit.ci] auto fixes from pre-commit.com hooks

9375136

for more information, see https://pre-commit.ci

rename difference to similarity

6b89214

jbusecke and others added 3 commits July 18, 2023 08:41

Merge branch 'dynamic_chunks_2' of github.com:jbusecke/pangeo-forge-r…

f6e86d6

…ecipes into dynamic_chunks_2

implemented + tested dask.utils.parse_bytes + docstring

b2c48bf

[pre-commit.ci] auto fixes from pre-commit.com hooks

b99c4ef

for more information, see https://pre-commit.ci

jbusecke marked this pull request as draft July 18, 2023 08:41

jbusecke and others added 3 commits July 25, 2023 16:41

Implemented review + renaming + test pass locally

bc980f7

Merge branch 'dynamic_chunks_2' of github.com:jbusecke/pangeo-forge-r…

29097e8

…ecipes into dynamic_chunks_2

[pre-commit.ci] auto fixes from pre-commit.com hooks

f3aeaa5

for more information, see https://pre-commit.ci

jbusecke added 2 commits August 6, 2023 18:32

fix matched error message in test

27bfd36

Fix another test

3ca76df

jbusecke mentioned this pull request Aug 8, 2023

Refactoring the recipe into dict-object leap-stc/cmip6-leap-feedstock#9

Merged

7 tasks

jbusecke added 6 commits August 18, 2023 10:47

Add logging step printing the schema

4838c5f

Merge branch 'main' into dynamic_chunks_2

59432fd

Restor cast import

d9115f7

Try to force test in-line after finished store

ec8ab95

more tinkering with store return

25823e6

Reverting changes to target_store return

ad79726

abarciauskas-bgse mentioned this pull request Aug 18, 2023

Add determine-chunk-shape notebook cloudnativegeo/cloud-optimized-geospatial-formats-guide#31

Draft

Added fallback algorithm for non-even divisions

2cc6700

jbusecke added 2 commits August 21, 2023 20:58

Attempt to add logging to dyn chunk logic

db6767b

Increase range of scaling factors

fa05c9c

More tests + slight algo refactor

082979b

cisaacstern mentioned this pull request Aug 22, 2023

Define dynamic target chunks interface in StoreToZarr #572

Closed

Merge remote-tracking branch 'upstream/main' into dynamic_chunks_2

e40eb55

This was referenced Aug 25, 2023

Dynamically determine subset_inputs and target_chunks from cached files. #355

Closed

Dynamic chunking interface for StoreToZarr #595

Merged

Merge remote-tracking branch 'upstream/main' into dynamic_chunks_2

774fff8

jbusecke mentioned this pull request Sep 1, 2023

Implement dynamic chunking as plug in leap-stc/cmip6-leap-feedstock#35

Closed

jbusecke closed this Nov 15, 2023

jbusecke mentioned this pull request Nov 15, 2023

Refactory dynamic chunking to plugin leap-stc/cmip6-leap-feedstock#62

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Dynamic rechunking option for `StoreToZarr` #546

WIP: Dynamic rechunking option for `StoreToZarr` #546

jbusecke commented Jul 16, 2023 •

edited

Loading

thodson-usgs commented Jul 16, 2023

jbusecke commented Jul 16, 2023

kjdoore commented Jul 17, 2023

cisaacstern commented Jul 17, 2023

rabernat Jul 17, 2023

cisaacstern Jul 17, 2023

TomNicholas Jul 17, 2023

jbusecke Jul 18, 2023

TomNicholas Jul 17, 2023

TomNicholas Jul 17, 2023

jbusecke Jul 18, 2023

TomNicholas Jul 18, 2023

TomNicholas commented Jul 17, 2023

jbusecke commented Jul 18, 2023 •

edited

Loading

jbusecke commented Jul 18, 2023

rsignell-usgs commented Jul 18, 2023 •

edited

Loading

cisaacstern commented Jul 18, 2023 •

edited

Loading

cisaacstern commented Jul 18, 2023

rsignell-usgs commented Jul 25, 2023

rsignell-usgs commented Aug 18, 2023

jbusecke commented Aug 22, 2023

cisaacstern commented Aug 22, 2023 •

edited

Loading

jbusecke commented Aug 22, 2023

jbusecke commented Aug 22, 2023

cisaacstern commented Aug 22, 2023 •

edited

Loading

yuvipanda commented Aug 23, 2023

jbusecke commented Nov 15, 2023

		from pangeo_forge_recipes.dynamic_target_chunks import dynamic_target_chunks_from_schema


		class TestDynamicTargetChunks:

WIP: Dynamic rechunking option for StoreToZarr #546

WIP: Dynamic rechunking option for StoreToZarr #546

Conversation

jbusecke commented Jul 16, 2023 • edited Loading

TODO:

thodson-usgs commented Jul 16, 2023

jbusecke commented Jul 16, 2023

kjdoore commented Jul 17, 2023

cisaacstern commented Jul 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas commented Jul 17, 2023

jbusecke commented Jul 18, 2023 • edited Loading

jbusecke commented Jul 18, 2023

rsignell-usgs commented Jul 18, 2023 • edited Loading

cisaacstern commented Jul 18, 2023 • edited Loading

cisaacstern commented Jul 18, 2023

rsignell-usgs commented Jul 25, 2023

rsignell-usgs commented Aug 18, 2023

jbusecke commented Aug 22, 2023

cisaacstern commented Aug 22, 2023 • edited Loading

jbusecke commented Aug 22, 2023

jbusecke commented Aug 22, 2023

cisaacstern commented Aug 22, 2023 • edited Loading

yuvipanda commented Aug 23, 2023

jbusecke commented Nov 15, 2023

WIP: Dynamic rechunking option for `StoreToZarr` #546

WIP: Dynamic rechunking option for `StoreToZarr` #546

jbusecke commented Jul 16, 2023 •

edited

Loading

jbusecke commented Jul 18, 2023 •

edited

Loading

rsignell-usgs commented Jul 18, 2023 •

edited

Loading

cisaacstern commented Jul 18, 2023 •

edited

Loading

cisaacstern commented Aug 22, 2023 •

edited

Loading

cisaacstern commented Aug 22, 2023 •

edited

Loading