-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for more than one ConcatDim #140
Comments
I'd like to take the chance to once more gently remind that RefernceFileSystem really wants this logic: of figuring out for every chunk in the input, what chunk index it should have in a theoretical output zarr dataset. This process doesn't need to live inside a recipe class, but I suppose that's a good place to start. |
Hi @nbarlowATI and welcome! You are correct that this is not yet supported. You are welcome to try to implement it, and we would love to have your contribution. However, I am concerned that the current code related to this is not very clear or accessible to outside contributors. Consequently, you may become frustrated if you try to work on this. As Martin suggested, abstracting this logic into a standalone module would probably be wise; however that is also a difficult task. I will try to give you some hints about how to proceed: The key method to look at is pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py Lines 412 to 415 in 32e9201
This method translates a
(This is the part of the code that is hardest to understand.) Currently chunk_key is either a one or two-item tuple (e.g. (2,) [only a ConcatDim] or (2, 3) [ConcatDim and MergeDim]). I To implement multiple ConcatDims, we would probably need to generalize chunk_key to an arbitrary length.
Another key method is pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py Lines 455 to 458 in 32e9201
This would also need to be generalized to handle multiple ConcatDims. I hope these tips are helpful. I don't want to suggest it will be easy, but we will be happy to support you, answer questions, review PRs etc. |
Having just written that up, I realize that I may be able to make some progress on this fairly quickly and also refactor the code along the lines Martin suggested. So if you can wait a few weeks, I may be able to implement the feature myself. @nbarlowATI - I'm curious what your application is for this... |
Hurray @rabernat ! I feel like it might be useful for us two (or more) to have a brainstorm on how to bring this about. |
Would love to listen in on this brainstorm. |
Since we all seem to be online, could we drop in https://whereby.com/pangeo right now? I can work for another 2 hours today, then I am checking out for a long weekend. |
Sorry, Dask is taking all of my time right now, I would prefer after my ESIP talk on Monday. |
Also noting that this is basically the same as #98. |
We're working on providing the SubX data in Zarr format, and this issue kept us from using pangeo-forge to do the conversion. |
It should be possible to use the reference-maker idea to merge subsets of the total data on one dimension, and then merge these intermediate virtual datasets on a second dimension, and so on. The original data would only need scanning once. Having done all this, then you can rechunk as required. Of course, it would be worth keeping the reference-maker output too, which would view the original data with whatever chunks it had. |
Hi all, I'm trying to convert a dataset from NetCDF to Zarr, and would really like to concatenate over more than one dimension (in my case "time" and "ensemble_id").
I see that XarrayZarrRecipe/FilePattern doesn't currently support this.
I'd be happy to work on implementing this if others think it's worthwhile, and if there isn't already someone working on it?
The text was updated successfully, but these errors were encountered: