-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add ConsolidateDimensionCoordinates + ConsolidateZarrMetadata transforms #556
Conversation
Thanks for getting this going, @norlandrhagen! Looks like a great start. Please lmk if I can help. |
I might be off here, but would it be possible that this is related to the issues I am having over at leap-stc/cmip6-leap-feedstock#4 (comment)? I am currently digging into what is causing those failures and will report in separate issue. Just wanted to flag this. |
Thank you for looking into this @norlandrhagen ! |
@jbusecke what issues are you referring to? And are you still thinking this may be related? |
Co-authored-by: Charles Stern <[email protected]>
pangeo_forge_recipes/transforms.py
Outdated
@@ -400,4 +421,7 @@ def expand(self, datasets: beam.PCollection) -> beam.PCollection: | |||
target=self.get_full_target(), target_chunks=self.target_chunks | |||
) | |||
rechunked_datasets | StoreDatasetFragments(target_store=target_store) | |||
if self.consolidate_coords: | |||
ConsolidateDimensionCoordinates(target_store=target_store) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConsolidateDimensionCoordinates(target_store=target_store) | |
target_store | ConsolidateDimensionCoordinates() |
@norlandrhagen hmm maybe as a PTransform, ConsolidateDimensionCoordinates needs to take a PCollection as input (via |
)? If so, we can probably do it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch! Yeah that makes more sense.
Maybe I was wrong. Ill investigate my issues with the dynamic chunking next week and report back there. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Update from the work @cisaacstern and I did this morning.
I'm sure there is a lot I missed, so feel free to add @cisaacstern . |
If I had to guess, the slow loading speed noted in leap-stc/ClimSim#38 (comment) could be partially alleviated by this PR. This is a case where we probably wouldn't want to re-run the entire pipeline, so we might run a one-off job applying the Edit: On further reflection, I'm actually not 100% sure if initial xarray dataset loading time should be expected to speed up with consolidated coordinate dims, or if that is purely addressable with #568. 🤔 |
Also @norlandrhagen I will try to fix the target_store sampling issue we ran into together in a separate PR, as described in #564, as this is now also a blocker for other work. |
@norlandrhagen merge of #574 unblocks us here. I think it will be easiest if we finish this PR after #575 goes in. In anticipation of that being soon, could you clean up the Files Changed here? It looks like for some reason a bunch of files in Thanks so much! 🙏 |
Great to hear! I'll get this PR cleaned up. |
Updates:
FAILED tests/test_end_to_end.py::test_xarray_zarr_consolidate_coords[netcdf3_local_paths_sequential_1d-True] - ValueError: Failed to decode variable 'time': destination buffer too small; expected at least 40, got 4
FAILED tests/test_end_to_end.py::test_xarray_zarr_consolidate_coords[netcdf_local_paths_sequential_2d-True] - ValueError: Failed to decode variable 'time': cannot reshape array of size 10 into shape (2,)
FAILED tests/test_end_to_end.py::test_xarray_zarr_consolidate_coords[netcdf_local_paths_sequential_1d-True] - ValueError: Failed to decode variable 'time': destination buffer too small; expected at least 80, got 8 |
@norlandrhagen thanks for all your work on this - I tried this branch out as I'm trying to benchmark different chunk configurations of NEX GDDP CMIP6 data on AWS and got some errors. Could you take a look and tell me if it's user error? https://nbviewer.org/gist/abarciauskas-bgse/e3b3b8b7dd2d8abb6be1e36a5c3b7678 |
Thanks for following the progress here, @abarciauskas-bgse. In its current form, I would not expect this PR to work. Raphael and I are planning to devote some further effort to this this week, we'll ping you here when it looks functional! |
Update from today after working with @cisaacstern
To Do:
singleton_target_store = (
n_target_stores | SampleSingleton()
if not self.consolidate_coords
else n_target_stores | SampleSingleton() | ConsolidateDimensionCoordinates()
)
singleton_target_store = (
n_target_stores | SampleSingleton()
if not self.consolidate_metadata
else n_target_stores | SampleSingleton() | ConsolidateZarrMetadata()
) |
@abarciauskas-bgse we made some more progress, but not quite done yet! |
Co-authored-by: Charles Stern <[email protected]>
store = zarr.open(os.path.join(tmp_target_url, "subpath")) | ||
|
||
# fails | ||
assert netcdf_local_file_pattern_sequential.dims["time"] == store.time.chunks[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@norlandrhagen what is the rationale behind this assert
? It is not intuitive to me why these two values:
- Would be
==
to one-another; or - How this would be evidence of coordinate consolidation
Additionally, in your view is the fact that this is currently failing evidence that the current implementation is wrong? Or that this assert
is not checking the right thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cisaacstern I think it's the the assert isn't checking the right thing.
@norlandrhagen: I'm thinking about giving this branch a run. Is there anything I can help with on this ticket to get it merged besides testing it? |
@ranchodeluxe please do! It needs testing. |
consolidate_coords: bool = True | ||
consolidate_metadata: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'd prefer composition over configuration here. Is there a chance we can just expose the ConsolidateMetadata
transform from the module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting @sbquinlan. Is there an example that could illustrate the advantage of exposing these stages? Just looking at this I would think this looks good since in almost all my cases I would want to have those two steps happening by default, but I might be missing something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbusecke My point is basically I prefer:
recipe = (
beam.Create(pattern.items())
| OpenWithKerchunk(
remote_protocol='s3' if selected_rel == S3_REL else 'https',
file_type=pattern.file_type,
storage_options=pattern.fsspec_open_kwargs,
)
| CombineReferences(
concat_dims=pattern.concat_dims,
identical_dims=IDENTICAL_DIMS,
mzz_kwargs={'coo_map': {'time': get_time_from_attr}},
)
| ConsolidateMetadata(storage_options=pattern.fsspec_open_kwargs)
| WriteReferences(
store_name=SHORT_NAME,
target_root=hacky_way_to_pull.target_root,
storage_options=pattern.fsspec_open_kwargs,
)
| ValidateDatasetDimensions(expected_dims={'time': None, 'lat': (-50, 50), 'lon': (-180, 180)})
To:
recipe = (
beam.Create(pattern.items())
CreateZarrFromPattern(
combine_references=True,
consolidate_metadata=True,
consolidate_coords=True,
)
Hopefully that makes sense. @norlandrhagen already landed changes to make the former possible so that was very much appreciated and unblocks me. So I don't really have much of a strong opinion here anymore. If we want to provide "one-size-fits-all" transforms with StoreToZarr
/WriteCombinedReferences
that are configurable with boolean flags for people less familiar with Beam or Zarr internals, that's fine, as long as power users still have access to the individual transforms to customize as they see fit. Like if you need to tweak a step for CMIP6.
Again, no blocking feedback from me on this PR. Just need to figure out the tests and we're good to go. Might be work breaking up metadata and coords just to cut down on PR complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think putting it that way makes a lot of sense @sbquinlan.
Hey folks. Thank you very much for working on this. Is there any way I can help here? I think this feature is quite essential for my CMIP6 data ingestion, so just wanted to offer my help if needed. |
Hey @jbusecke. This PR has been my pangeo-forge albatross. Any work on it would be appreciated! I just updated it with main and am getting some test failures, so it probably needs a bit of updating. It also is lacking any useful tests to see if the |
Would be happy to pair on this early next week or after Jan 6 if you like. |
After Jan 6th works good for me. |
Picking this back up after crawling through my gh inbox (Yay winter break). Since I am coming closer to releasing CMIP6 data on the public bucket, this is getting more important. Ill reach out on slack maybe? |
Replaced by this PR: #671 |
ConsolidatedDimensionCoords Transform -- cleaned up version of PR #556
This PR is intended to kickstart issue #554. It lays down some of the groundwork outlined in @cisaacstern's recommendation under issue 544.
cc @abarciauskas-bgse @cisaacstern
Currently fails when running end_to_end tests.
pytest tests/test_end_to_end.py -k 'test_xarray_zarr[netcdf_local_file_pattern_sequential-target_chunks0-netcdf_local_paths_sequential_1d]'
Traceback: