Support for more than one ConcatDim #140

nbarlowATI · 2021-05-20T14:16:41Z

Hi all, I'm trying to convert a dataset from NetCDF to Zarr, and would really like to concatenate over more than one dimension (in my case "time" and "ensemble_id").
I see that XarrayZarrRecipe/FilePattern doesn't currently support this.
I'd be happy to work on implementing this if others think it's worthwhile, and if there isn't already someone working on it?

martindurant · 2021-05-20T14:37:45Z

I'd like to take the chance to once more gently remind that RefernceFileSystem really wants this logic: of figuring out for every chunk in the input, what chunk index it should have in a theoretical output zarr dataset. This process doesn't need to live inside a recipe class, but I suppose that's a good place to start.

rabernat · 2021-05-20T15:48:25Z

Hi @nbarlowATI and welcome!

You are correct that this is not yet supported. You are welcome to try to implement it, and we would love to have your contribution. However, I am concerned that the current code related to this is not very clear or accessible to outside contributors. Consequently, you may become frustrated if you try to work on this. As Martin suggested, abstracting this logic into a standalone module would probably be wise; however that is also a difficult task.

I will try to give you some hints about how to proceed:

The key method to look at is region_and_conflicts_for_chunk:

pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py

Lines 412 to 415 in 32e9201

    
           def region_and_conflicts_for_chunk(self, chunk_key: ChunkKey): 
        
               # return a dict suitable to pass to xr.to_zarr(region=...) 
        
               # specifies where in the overall array to put this chunk's data 
        
               # also return the conflicts with other chunks

This method translates a chunk_key to a specific slice within a multidimensional dataset (e.g. {'time': slice(10, 20)}). chunk_key is a tuple of ints. It is a key internal data structure which encodes the chunk's place within the broader dataset. It is set here:

pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py

Line 151 in 32e9201

chunk_key = tuple([v[0] if hasattr(v, "__len__") else v for v in k])

(This is the part of the code that is hardest to understand.) Currently chunk_key is either a one or two-item tuple (e.g. (2,) [only a ConcatDim] or (2, 3) [ConcatDim and MergeDim]). I To implement multiple ConcatDims, we would probably need to generalize chunk_key to an arbitrary length.

Another key method is input_position:

pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py

Lines 455 to 458 in 32e9201

    
           def input_position(self, input_key): 
        
               # returns the index position of an input key wrt the concat_dim 
        
               concat_dim_axis = list(self.file_pattern.dims).index(self._concat_dim) 
        
               return input_key[concat_dim_axis]

This would also need to be generalized to handle multiple ConcatDims.

I hope these tips are helpful. I don't want to suggest it will be easy, but we will be happy to support you, answer questions, review PRs etc.

rabernat · 2021-05-20T15:57:41Z

Having just written that up, I realize that I may be able to make some progress on this fairly quickly and also refactor the code along the lines Martin suggested. So if you can wait a few weeks, I may be able to implement the feature myself.

@nbarlowATI - I'm curious what your application is for this...

martindurant · 2021-05-20T16:00:05Z

Hurray @rabernat ! I feel like it might be useful for us two (or more) to have a brainstorm on how to bring this about.

cisaacstern · 2021-05-20T16:01:26Z

Would love to listen in on this brainstorm.

rabernat · 2021-05-20T16:02:34Z

Since we all seem to be online, could we drop in https://whereby.com/pangeo right now?

I can work for another 2 hours today, then I am checking out for a long weekend.

martindurant · 2021-05-20T16:04:05Z

Sorry, Dask is taking all of my time right now, I would prefer after my ESIP talk on Monday.

rabernat · 2021-05-20T17:29:10Z

Also noting that this is basically the same as #98.

aaron-kaplan · 2021-07-14T20:11:56Z

We're working on providing the SubX data in Zarr format, and this issue kept us from using pangeo-forge to do the conversion.

martindurant · 2021-08-02T18:56:17Z

It should be possible to use the reference-maker idea to merge subsets of the total data on one dimension, and then merge these intermediate virtual datasets on a second dimension, and so on. The original data would only need scanning once. Having done all this, then you can rechunk as required. Of course, it would be worth keeping the reference-maker output too, which would view the original data with whatever chunks it had.

rabernat mentioned this issue May 20, 2021

Add chunk grid logic #141

Merged

rabernat mentioned this issue Jul 28, 2021

Example pipeline for GFS Archive pangeo-forge/staged-recipes#50

Open

rabernat mentioned this issue Dec 18, 2021

Should we just adopt xarray-beam as our internal data model? #256

Open

rabernat pinned this issue Feb 1, 2022

rabernat mentioned this issue Feb 1, 2022

Proposed Recipes for Global Drifter Program pangeo-forge/staged-recipes#101

Open

rabernat mentioned this issue Apr 26, 2022

Using pangeo-forge-recipes with two concat dims #348

Closed

cisaacstern mentioned this issue Apr 26, 2022

More than one ConcatDim pangeo-forge/user-stories#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for more than one ConcatDim #140

Support for more than one ConcatDim #140

nbarlowATI commented May 20, 2021

martindurant commented May 20, 2021

rabernat commented May 20, 2021

rabernat commented May 20, 2021

martindurant commented May 20, 2021

cisaacstern commented May 20, 2021

rabernat commented May 20, 2021

martindurant commented May 20, 2021

rabernat commented May 20, 2021

aaron-kaplan commented Jul 14, 2021

martindurant commented Aug 2, 2021

Support for more than one ConcatDim #140

Support for more than one ConcatDim #140

Comments

nbarlowATI commented May 20, 2021

martindurant commented May 20, 2021

rabernat commented May 20, 2021

rabernat commented May 20, 2021

martindurant commented May 20, 2021

cisaacstern commented May 20, 2021

rabernat commented May 20, 2021

martindurant commented May 20, 2021

rabernat commented May 20, 2021

aaron-kaplan commented Jul 14, 2021

martindurant commented Aug 2, 2021