-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConsolidatedDimensionCoords Transform -- cleaned up version of PR #556 #671
ConsolidatedDimensionCoords Transform -- cleaned up version of PR #556 #671
Conversation
Unit and end2end tests added! Big thanks to @jbusecke for help here. Both of us went back to Zarr101 class to figure out how to create a dataset without Xarray. |
Big thanks back! This was really fun, and I am excited to see this moving. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me (but I don't know much 😆). Like the tests. Left a couple non-blocking comments
|
||
def _gather_coordinate_dimensions(group: zarr.Group) -> List[str]: | ||
return list( | ||
set(itertools.chain(*(group[var].attrs.get("_ARRAY_DIMENSIONS", []) for var in group))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine since this is a zarr.Group
this iteration isn't very expensive b/c it's on summary data and never really that large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehh, I think you're right? Since it's just scanning across vars in groups and looking at the .attrs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this function but the fact that it's the same as we had in the pre-Beam release checks out to me!
def _gather_coordinate_dimensions(group: zarr.Group) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @norlandrhagen. Ill try this in the wild with CMIP6 in a few.
You can probably delete the TODO comments here too.
Seems like we need some more test cases with time arrays/enconding ase datetime and cftime? |
Thanks for giving it a test drive @jbusecke! Yeah, seems like it. Maybe we can add another test in our Zar102 class next. |
Would be happy to pair on this maybe tomorrow (Wednesday)? In case that works for you. |
That would be great! I'm a bit busy with ESIP today through Thursday, but I'm free all of Friday or next week. |
Looks like we are waiting on resolution of the error @jbusecke hit here, correct? |
@cisaacstern yup! @jbusecke and I are going to work on this tomorrow. |
Ok with the current state of this PR I am getting successful writes of CMIP6 stores like this one: import gcsfs
fs = gcsfs.GCSFileSystem()
fs.ls('gs://cmip6/CMIP6/CMIP/CNRM-CERFACS/CNRM-CM6-1/historical/r29i1p1f2/Omon/tos/gr1/v20200529/time') shows that there is only one time chunk:
|
@cisaacstern this PR seems to be working now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much @norlandrhagen and @jbusecke for your work on this!
I think this is very close. As I went through this, I did end up feeling rather strongly that @sbquinlan's earlier suggestion to favor modularity/composability over boolean flag configuration felt to me much more in the design spirit of Beam, and potentially a lower maintenance burden for us as developers as well. All of the "big" changes I suggest here relate to that.
Curious to know your thoughts.
pangeo_forge_recipes/rechunking.py
Outdated
# Open question: Should this always be called | ||
# or should we have the option of ConsolidateMetadata | ||
zarr.consolidate_metadata(singleton_target_store) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Open question: Should this always be called | |
# or should we have the option of ConsolidateMetadata | |
zarr.consolidate_metadata(singleton_target_store) |
IMO this is too opinionated, and it should be up to the user if they want to consolidate coordinates and metadata, or just coordinates.
I'm using as a guide here the fact that in the pre-Beam releases, these options were not coupled either, see:
pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Lines 665 to 697 in 3b3c13c
if config.consolidate_dimension_coordinates: | |
logger.info("Consolidating dimension coordinate arrays") | |
target_mapper = config.storage_config.target.get_mapper() | |
group = zarr.open(target_mapper, mode="a") | |
# https://github.com/pangeo-forge/pangeo-forge-recipes/issues/214 | |
# filter out the dims from the array metadata not in the Zarr group | |
# to handle coordinateless dimensions. | |
dims = (dim for dim in _gather_coordinate_dimensions(group) if dim in group) | |
for dim in dims: | |
arr = group[dim] | |
attrs = dict(arr.attrs) | |
data = arr[:] | |
# This will generally use bulk-delete API calls | |
config.storage_config.target.rm(dim, recursive=True) | |
new = group.array( | |
dim, | |
data, | |
chunks=arr.shape, | |
dtype=arr.dtype, | |
compressor=arr.compressor, | |
fill_value=arr.fill_value, | |
order=arr.order, | |
filters=arr.filters, | |
overwrite=True, | |
) | |
new.attrs.update(attrs) | |
if config.consolidate_zarr: | |
logger.info("Consolidating Zarr metadata") | |
target_mapper = config.storage_config.target.get_mapper() | |
zarr.consolidate_metadata(target_mapper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the fix @jbusecke and I came up with yesterday to get around the errors reported in the cmip6 pipeline. I think when the coordinate_dims are consolidated in this PR, the metadata has to be updated after. @jbusecke might have a better way to explain this though!
Before the consolidated_metadata PR option was introduced all of the StoreToZarr
recipes were consolidating metadata by default.
from the docs The__ default (consolidated=None) means write consolidated metadata and attempt to read consolidated metadata for existing stores (falling back to non-consolidated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes total sense @norlandrhagen!
Apologies the review here is a little disjointed. Down in #671 (comment) I mention my proposed fix for this, which is that we should also remove the consolidate_metadata flag from StoreToZarr, and always default consolidated=False
in StoreToZarr.
Then it is up to the user if they want to do:
StoreToZarr() | ConsolidateCoordinateDimensions()
or
StoreToZarr() | ConsolidateCoordinateDimensions() | ConsolidateMetadata()
(☝️ seems like this is what @jbusecke would want to do)
or
StoreToZarr() | ConsolidateMetadata()
But in order to make these options possible, we need to start from an unconsolidated store emitted by StoreToZarr()
, because as you observe here, if StoreToZarr emits a consolidated store, and coordinate dims are subsequently consolidated without then re-consolidating metadata, we end up in a broken state.
Always emitting an unconsolidated store from StoreToZarr()
should resolve this.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries!
It does fix it! However, then we are defaulting to non_consolidated coords, which seems like the wrong approach for most recipes. I think this is fine if everyone knows that they probably should add the ConsolidatedCoords transform into their recipe, but I could see it commonly slipping through the cracks. Maybe this is kind of what kind of guard rails / default assumptions we want to have 🤷 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share the concern about this requirement for explicitly having to chain consolidation stages (in a specific order) leading to errors on the user side.
Maybe we can consider renaming StoreToZarr
-> StoreToZarrUnconsolidated
(or something else) and subclass that as the new StoreToZarr
(this would change the default behavior)
alternatively make a new class StoreToZarrConsolidated
that subclasses the current StoreToZarr
, this would not change the default behavior but probably would need a thorough rewrite of the docs to make sure users are using consolidated storage as a default?
|
||
def _gather_coordinate_dimensions(group: zarr.Group) -> List[str]: | ||
return list( | ||
set(itertools.chain(*(group[var].attrs.get("_ARRAY_DIMENSIONS", []) for var in group))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this function but the fact that it's the same as we had in the pre-Beam release checks out to me!
def _gather_coordinate_dimensions(group: zarr.Group) -> List[str]: |
@@ -430,6 +430,11 @@ def expand(self, pcoll: beam.PCollection) -> beam.PCollection: | |||
return new_fragments | |||
|
|||
|
|||
class ConsolidateDimensionCoordinates(beam.PTransform): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a docstring here?
:param consolidate_dimension_coordinates: Whether to rewrite coordinate variables as a | ||
single chunk. We recommend consolidating coordinate variables to avoid | ||
many small read requests to get the coordinates in xarray. Defaults to ``True``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point of clarification for my own understanding: do we know for sure that this make a meaningful difference for kerchunk datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I don't actually know. Maybe @abarciauskas-bgse has some insight into this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if there is a significant difference if we consolidate dimension coordinates in kerchunk datasets. I experienced slow performance when working with zarr stores and saw a relationship between performance (specifically for creating image tiles using rio_tiler's XarrayReader) and the number of coordinate chunks. The notebook which demonstrates this is unfortunately not a brief one (https://github.com/developmentseed/tile-benchmarking/blob/9758c732597c154d1cf7cd796b21c858c3130046/profiling/profile.ipynb) but I could streamline it if we need to do more investigation.
For kerchunk, I would guess there would be a performance impact if a coordinate's chunks are all read all at once or in multiple reads. If it's in multiple reads, it still might impact performance. Do we know which it is? Or should I investigate/run some more tests?
pyproject.toml
Outdated
@@ -42,7 +42,7 @@ dependencies = [ | |||
[project.optional-dependencies] | |||
test = [ | |||
"click", | |||
"pytest", | |||
"pytest<8.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this needs to be rebased to main
, since didn't this fix already go in?
pangeo_forge_recipes/transforms.py
Outdated
# TODO: optionally use `singleton_target_store` to | ||
# consolidate metadata and/or coordinate dims here | ||
if self.consolidate_dimension_coordinates: | ||
singleton_target_store = singleton_target_store | ConsolidateDimensionCoordinates() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: optionally use `singleton_target_store` to | |
# consolidate metadata and/or coordinate dims here | |
if self.consolidate_dimension_coordinates: | |
singleton_target_store = singleton_target_store | ConsolidateDimensionCoordinates() |
I went back and forth on this, but IMHO I think the right path here is to not do any consolidation by default, and leave it to users to compose modularly, i.e.:
recipe = (
...
| StoreToZarr()
| ConsolidateCoordinateDimensions()
| ConsolidateMetadata()
)
This is more verbose for users, but it is also more explicit as to what they can expect, and it adheres to the composable ethos of Beam, and it's less API-creep for us to maintain in StoreToZarr.
Happy to discuss dissenting views here, but as of now this feels like the right path to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to chat more about this!
I also do like the composability, but wonder about the trade off for transformations that should be done on almost every recipe, but might get overlooked by a newer recipe developer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a chat, I would love to attend if it fits on my schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for weighing in! I am not 100% sure which design is best here, and appreciate the lively discussion so we can land on the best solution. (Here and in the comment thread above.)
done on almost every recipe, but might get overlooked by a newer recipe developer.
This is a small point, but one that just came to mind: in my experience, for smaller zarr stores, consolidation doesn't really matter / have a performance impact. Therefore, in some sense do we care if newer recipe developers (which presumably implies smaller-scale as well?) overlook this?
We absolutely should have a design discussion on this, possibly before the next coordination meeting (to expedite the process). I will try to first outline what I see as the various options in a comment on this thread to solicit async feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Happy to chat before. I won't be able to make it to the next coordination meeting, so this week or next before Friday would be ideal for me.
pangeo_forge_recipes/transforms.py
Outdated
consolidate_dimension_coordinates: bool = False | ||
consolidated_metadata: Optional[bool] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consolidate_dimension_coordinates: bool = False | |
consolidated_metadata: Optional[bool] = True |
See comment below re: boolean flags vs. composition. I feel like @sbquinlan's design suggestion is the right way to go here: modularity/composibility over branching/boolean flag logic.
Note if we go this route we want to also make the following change in the body of StoreToZarr below (but GitHub won't let me make a comment on these lines because the PR does not already change them):
target_store = schema | PrepareZarrTarget(
target=self.get_full_target(),
target_chunks=target_chunks,
attrs=self.attrs,
- consolidated_metadata=self.consolidated_metadata,
+ # In StoreToZarr, always default to unconsolidated. This leaves it up to the
+ # user whether or not they want to consolidate with ConsolidateMetadata(). Also,
+ # it prevents a broken/inconsistent state that could arise from metadata being
+ # consolidated here, and then falling out of sync with coordinates if
+ # ConsolidateDimensionCoordinates() is applied to the output of StoreToZarr().
+ consolidated_metadata=False,
encoding=self.encoding,
)
tests/test_end_to_end.py
Outdated
( | ||
p | ||
| beam.Create(pattern.items()) | ||
| OpenWithXarray(file_type=pattern.file_type) | ||
| StoreToZarr( | ||
target_root=tmp_target, | ||
store_name="subpath", | ||
combine_dims=pattern.combine_dim_keys, | ||
consolidate_dimension_coordinates=consolidate_dimension_coordinates, | ||
) | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( | |
p | |
| beam.Create(pattern.items()) | |
| OpenWithXarray(file_type=pattern.file_type) | |
| StoreToZarr( | |
target_root=tmp_target, | |
store_name="subpath", | |
combine_dims=pattern.combine_dim_keys, | |
consolidate_dimension_coordinates=consolidate_dimension_coordinates, | |
) | |
) | |
unconsolidated_zarr = ( | |
p | |
| beam.Create(pattern.items()) | |
| OpenWithXarray(file_type=pattern.file_type) | |
| StoreToZarr( | |
target_root=tmp_target, | |
store_name="subpath", | |
combine_dims=pattern.combine_dim_keys, | |
) | |
) | |
if consolidate_dimension_coordinates: | |
unconsolidated_zarr | ConsolidateCoordinateDimensions() | |
Assuming we go with the composable/modular design rather than the boolean flag design, then this test would need to updated to look something like above.
I'm not sure if ConsolidateMetadata()
would also need to be piped on the end here to get the asserts below to pass.
tests/test_transforms.py
Outdated
consolidate_dimension_coordinates=False, | ||
consolidated_metadata=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consolidate_dimension_coordinates=False, | |
consolidated_metadata=True, |
Assuming we remove these flags as proposed above, they would need to be removed from this test as well.
Co-authored-by: Charles Stern <[email protected]>
for more information, see https://pre-commit.ci
Going to hold off doing any more work on this PR until we all chat design. |
…ecipes into consolidate_dimension_coords
Tests are passing! @cisaacstern & @jbusecke |
Update to all the discussion about design above @cisaacstern, @jbusecke met and decided to go the route of composability instead of adding extra kwargs. This adds a bit of work onto the recipe writer, but gives them more flexibility and should reduce the maintence burden on the already very complex |
Reviving from a clean slate https://github.com/norlandrhagen/pangeo-forge-recipes/pull/new/consolidate_dimension_coords. This PR only intends to work on this issue: #556.