-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConsolidatedDimensionCoords Transform -- cleaned up version of PR #556 #671
Changes from 4 commits
06e475c
d613934
3d67039
6c39134
b645d9a
8f5bfe4
4ec50f1
4d58722
910920b
792d40a
4f26242
cde3cfe
fbe8859
f1d43ab
22e9f42
4d986a4
5b19225
36b4faa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -21,7 +21,7 @@ | |||||||||
from .combiners import CombineMultiZarrToZarr, CombineXarraySchemas | ||||||||||
from .openers import open_url, open_with_kerchunk, open_with_xarray | ||||||||||
from .patterns import CombineOp, Dimension, FileType, Index, augment_index_with_start_stop | ||||||||||
from .rechunking import combine_fragments, split_fragment | ||||||||||
from .rechunking import combine_fragments, consolidate_dimension_coordinates, split_fragment | ||||||||||
from .storage import CacheFSSpecTarget, FSSpecTarget | ||||||||||
from .writers import ZarrWriterMixin, store_dataset_fragment, write_combined_reference | ||||||||||
|
||||||||||
|
@@ -412,6 +412,11 @@ def expand(self, pcoll: beam.PCollection) -> beam.PCollection: | |||||||||
return new_fragments | ||||||||||
|
||||||||||
|
||||||||||
class ConsolidateDimensionCoordinates(beam.PTransform): | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add a docstring here? |
||||||||||
def expand(self, pcoll: beam.PCollection[zarr.storage.FSStore]) -> beam.PCollection: | ||||||||||
norlandrhagen marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
return pcoll | beam.Map(consolidate_dimension_coordinates) | ||||||||||
|
||||||||||
|
||||||||||
@dataclass | ||||||||||
class CombineReferences(beam.PTransform): | ||||||||||
"""Combines Kerchunk references into a single reference dataset. | ||||||||||
|
@@ -572,6 +577,9 @@ class StoreToZarr(beam.PTransform, ZarrWriterMixin): | |||||||||
`store_name` will be appended to this prefix to create a full path. | ||||||||||
:param target_chunks: Dictionary mapping dimension names to chunks sizes. | ||||||||||
If a dimension is a not named, the chunks will be inferred from the data. | ||||||||||
:param consolidate_dimension_coordinates: Whether to rewrite coordinate variables as a | ||||||||||
single chunk. We recommend consolidating coordinate variables to avoid | ||||||||||
many small read requests to get the coordinates in xarray. Defaults to ``True``. | ||||||||||
Comment on lines
+588
to
+590
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Point of clarification for my own understanding: do we know for sure that this make a meaningful difference for kerchunk datasets? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question! I don't actually know. Maybe @abarciauskas-bgse has some insight into this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know if there is a significant difference if we consolidate dimension coordinates in kerchunk datasets. I experienced slow performance when working with zarr stores and saw a relationship between performance (specifically for creating image tiles using rio_tiler's XarrayReader) and the number of coordinate chunks. The notebook which demonstrates this is unfortunately not a brief one (https://github.com/developmentseed/tile-benchmarking/blob/9758c732597c154d1cf7cd796b21c858c3130046/profiling/profile.ipynb) but I could streamline it if we need to do more investigation. For kerchunk, I would guess there would be a performance impact if a coordinate's chunks are all read all at once or in multiple reads. If it's in multiple reads, it still might impact performance. Do we know which it is? Or should I investigate/run some more tests? |
||||||||||
:param dynamic_chunking_fn: Optionally provide a function that takes an ``xarray.Dataset`` | ||||||||||
template dataset as its first argument and returns a dynamically generated chunking dict. | ||||||||||
If provided, ``target_chunks`` cannot also be passed. You can use this to determine chunking | ||||||||||
|
@@ -593,6 +601,7 @@ class StoreToZarr(beam.PTransform, ZarrWriterMixin): | |||||||||
dynamic_chunking_fn: Optional[Callable[[xr.Dataset], dict]] = None | ||||||||||
dynamic_chunking_fn_kwargs: Optional[dict] = field(default_factory=dict) | ||||||||||
attrs: Dict[str, str] = field(default_factory=dict) | ||||||||||
consolidate_dimension_coordinates: bool = False | ||||||||||
|
||||||||||
def __post_init__(self): | ||||||||||
if self.target_chunks and self.dynamic_chunking_fn: | ||||||||||
|
@@ -625,7 +634,7 @@ def expand( | |||||||||
| beam.combiners.Sample.FixedSizeGlobally(1) | ||||||||||
| beam.FlatMap(lambda x: x) # https://stackoverflow.com/a/47146582 | ||||||||||
) | ||||||||||
# TODO: optionally use `singleton_target_store` to | ||||||||||
# consolidate metadata and/or coordinate dims here | ||||||||||
if self.consolidate_dimension_coordinates: | ||||||||||
singleton_target_store = singleton_target_store | ConsolidateDimensionCoordinates() | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I went back and forth on this, but IMHO I think the right path here is to not do any consolidation by default, and leave it to users to compose modularly, i.e.: recipe = (
...
| StoreToZarr()
| ConsolidateCoordinateDimensions()
| ConsolidateMetadata()
) This is more verbose for users, but it is also more explicit as to what they can expect, and it adheres to the composable ethos of Beam, and it's less API-creep for us to maintain in StoreToZarr. Happy to discuss dissenting views here, but as of now this feels like the right path to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would love to chat more about this! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you have a chat, I would love to attend if it fits on my schedule. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for weighing in! I am not 100% sure which design is best here, and appreciate the lively discussion so we can land on the best solution. (Here and in the comment thread above.)
This is a small point, but one that just came to mind: in my experience, for smaller zarr stores, consolidation doesn't really matter / have a performance impact. Therefore, in some sense do we care if newer recipe developers (which presumably implies smaller-scale as well?) overlook this? We absolutely should have a design discussion on this, possibly before the next coordination meeting (to expedite the process). I will try to first outline what I see as the various options in a comment on this thread to solicit async feedback. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Awesome! Happy to chat before. I won't be able to make it to the next coordination meeting, so this week or next before Friday would be ideal for me. |
||||||||||
|
||||||||||
return singleton_target_store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine since this is a
zarr.Group
this iteration isn't very expensive b/c it's on summary data and never really that large?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehh, I think you're right? Since it's just scanning across vars in groups and looking at the .attrs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this function but the fact that it's the same as we had in the pre-Beam release checks out to me!
pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Line 655 in 3b3c13c