Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are coordinates being chunked by default with StoreToZarr? #554

Open
abarciauskas-bgse opened this issue Jul 29, 2023 · 4 comments
Open

Comments

@abarciauskas-bgse
Copy link
Contributor

👋🏽 I'm using pangeo-forge to create some Zarr stores to test with rio_tiler's XarrayReader. I found that when generating image tiles, performance was significantly impact by the number of S3 calls when opening the dataset and this was because the coordinates were chunked. The code to reproduce this issue is here: https://nbviewer.org/gist/abarciauskas-bgse/c826a49a966f3c157626f45ea816330b

@cisaacstern
Copy link
Member

Thanks for raising this, @abarciauskas-bgse.

Based on the notebook it looks like you are using a development version of pangeo-forge-recipes that predates the 0.10.0 release?

I don't have any specific reason to think this is fixed in 0.10.0, but could you confirm that the issue is present there as well?

@abarciauskas-bgse
Copy link
Contributor Author

Thanks for looking @cisaacstern - I just installed pangeo-forge-recipes 0.10.0 and verified the coordinates are also chunked when using that version of the library (code linked in notebook has been updated to reflect this library version).

@cisaacstern
Copy link
Member

cisaacstern commented Aug 4, 2023

Thanks for checking that, @abarciauskas-bgse. AFAICT, this feature did exist in 0.9.4 here:

if config.consolidate_dimension_coordinates:
logger.info("Consolidating dimension coordinate arrays")
target_mapper = config.storage_config.target.get_mapper()
group = zarr.open(target_mapper, mode="a")
# https://github.com/pangeo-forge/pangeo-forge-recipes/issues/214
# filter out the dims from the array metadata not in the Zarr group
# to handle coordinateless dimensions.
dims = (dim for dim in _gather_coordinate_dimensions(group) if dim in group)
for dim in dims:
arr = group[dim]
attrs = dict(arr.attrs)
data = arr[:]
# This will generally use bulk-delete API calls
config.storage_config.target.rm(dim, recursive=True)
new = group.array(
dim,
data,
chunks=arr.shape,
dtype=arr.dtype,
compressor=arr.compressor,
fill_value=arr.fill_value,
order=arr.order,
filters=arr.filters,
overwrite=True,
)
new.attrs.update(attrs)

And re-implementing this as part of the Beam rewrite was simply overlooked. Adding this would probably mean defining a new transform

class ConsolidateDimensionCoordinates(beam.PTransform):
    ...

and slotting it in (optionally, perhaps, but as the default?) at the end of StoreToZarr, something like

 target_store = schema | PrepareZarrTarget( 
      target=self.get_full_target(), target_chunks=self.target_chunks 
 ) 
- return rechunked_datasets | StoreDatasetFragments(target_store=target_store)
+ stored_fragments = rechunked_datasets | StoreDatasetFragments(target_store=target_store)
+ return (
+     stored_fragments
+     if not self.consolidate_coords
+     else stored_fragments | ConsolidateDimensionCoordinates(target_store=target_store)
+ )

I could potentially take a look at this on the few weeksish time horizon from now, but would be thrilled if you or someone else wanted to take this on, and would be happy to do async and/or video walkthrough to get you (or whoever wanted to work on this) up to speed.

Thanks for catching this!

@abarciauskas-bgse
Copy link
Contributor Author

Thank you @cisaacstern for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants