-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamically determine subset_inputs
and target_chunks
from cached files.
#355
Comments
This is a very compelling argument. Also a small point that the pathlib Let's coordinate on this perhaps next week once some of the other features we've been discussing have moved forward. |
This is a good idea. Making it work would require significant refactoring to the "pipelines" execution model. We would need the ability for sequence for a map stage to be generated dynamically by an earlier stage. This should be possible. But someone will need to refactor the executors. So it touches #256. |
Superseded by #546, so closing. We've made so much headway on this since this issue was first opened! |
Over in pangeo-forge/cmip6-feedstock#2 we are planning to convert CMIP6 data (which is very heterogenous in chunking/filesize etc) to ARCO data using pangeo forge. We currently have to specify the input parameters
subset_inputs
andtarget_chunks
for each converted dataset.If we scale up our efforts there we will have to possibly do this for hundreds of thousands of datasets, which is obviously not sustainable.
While working on that effort I was wondering if there wouldnt be a way to determine these parameters in a dynamic way, once the files are cached (this would not work in a recipe where caching is not enabled).
In particular the
subset_inputs
parameter actually does not seem quite aligned with the proposed separation of recipe and execution logic, since that ultimately is dictated by the size of the workers of the execution environment?I was able to draft up some pretty light code that does successfully infer these from a feedstock recipe.
Assume that the files have already been cached.
Dynamically determine
subset_inputs
The
subset_inputs
could then be dynamically inferred as such:which gives reasonable values
and can then be set internally on the recipe object
Dynamically determine
target_chunks
A similar logic block could enable the user to specify a size range for the target chunks (which would nicely accomodate the different dimensionality of our input for instance):
I was discussing this with @cisaacstern earlier and we were thinking that this could be integrated as part of a more granual stage structure as mentioned in #224.
See here for the full notebook.
Happy to help wherever I can.
The text was updated successfully, but these errors were encountered: