-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "datastores" to represent input data from zarr, npy, etc #66
base: main
Are you sure you want to change the base?
Conversation
Rename base class for datastores representating data on a regular grid. Also introduce DummyDatastore in tests that represent data on an irregular grid
Make sure that the tensors, datasets and arrays always follow these conventions (consistency with mllam-data-prep and weather-model-graphs). This relevant whenever lat-lon or x-y data is stored in that object.
|
After adjusting the order of all tensors, datasets and arrays in mllam to |
If we are getting rid of
For this PR I think option 3 is the best, and then we can think about the dependency structure later. We don't have any proper documentation system yet (#61), but for now these instructions could just sit in the readme. Or potentially it could be an example in the WMG documentation, and we could just link there? That might make more sense since it is documenting how to create a graph with WMG. |
This sounds like the best option for now. We should only couple the repos with good reason in the future
Good choice, I add this to list of outstanding tasks above, as it is required for this PR. For the tests I assume that all relevant graphs are present in the test folder. If not assertion will fail. OK? |
As the TODO list kept getting washed away by newer comments (again 😆) here it is as an issue. Please only use this issue to track progress from now on! #80 |
@joeloskarsson @khintz
|
Great work @sadamov ! I can't add reviewers on Leifs branch, but he is making a mistake in showing up at DMI physically tomorrow, so if not before I will track him down then 😄 |
Great stuff @sadamov ! I hope to give this another read through with all the new changes before we merge, and then I can take a special look at the points 1-5 that you mention (but we have already discussed a few of them, so should be all good 😄). It is probably easiest for me to do that though when all the changes are in this PR. |
Describe your changes
This PR builds on #54 (which introduces zarr-based training data) by splitting the
Config
-class introduced in #54 to separately represent the configuration for what data to load from the functions to load data (the latter is what I call a "datastore"). In doing this I have also introduced a general interface through an abstract base classBaseDatastore
with a set of functions that are called in the rest ofneural-lam
which provide data for training/validation/test and information about this data (see #58 for my overview of the methods that #54 uses to load data).The motivation for this work is to allow for a clear separation between how data is loaded into neural-lam and how training/validation/test samples are created from that data. Creating the interface between these two steps makes it clear what is expected to be provided when people want to add new data-sources to neural-lam
In the text below I am trying to use the same nomenclature that @sadamov introduced, namely:
state
,forcing
orstatic
data.grid_index
coordinate, levels and variables into a{category}_feature
coordinate (i.e. these are operations thatnp.ndarray
andxr.Dataset
/xr.DataArray
objectstorch.Tensor
objectsTo support both the multizar config format that @sadamov introduced in #54, the old npyfiles and also data transformed with mllam-data-prep I have currently implemented the following three datastore classes:
neural_lam.datastore.NpyDataStore
: reads data from .npy-files in the format introduced in neural-lam v0.1.0 - this usesdask.delayed
so no array content is read until it is used- removed as we decidedneural_lam.datastore.MultizarrDatastore
: can combines multiple zarr files during train/val/test sampling, with the transformations to facilitate this implemented within neural_lam.datastore.MultizarrDatastore.MDPDatastore
was enoughneural_lam.datastore.MDPDatastore
: can combine multiple zarr datasets either either as a preprocessing step or during sampling, but offloads the implementation of the transformations the mllam-data-prep package.Each of the these inherit from
BaseCartesianDatastore
which itself inherits fromBaseDatastore
. I have added this last layer of indirection to make it easier for non-gridded data to be used inneural-lam
in future.Testing:
create_graph
,create_normalization
commands etc so that they can be called not just from the command line.Caveats:
grid
togrid_index
. I think it ambiguous what "grid" refers to since that could be the grid itself, as well as the grid-index as it was used..variable
as a variable name for a anxr.DataArray
because xr.DataArray.variable is a reserved attribute for data-arrays# target_states: (ar_steps-2, N_grid, d_features)
in WeatherDataset.getitem is incorrect @sadamov, or at least my understand of whatar_steps
represents is different. I expect the target states to have exactlyar_steps
in them, rather thanar_steps-2
. Or said another way, would otherwise happen ifar_steps == 0
?Things I am unsure about:
DataLoader(…, multiprocessing_context="spawn”)
as suggested through RuntimeError: This class is not fork-safe fsspec/filesystem_spec#755 and https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader, but not sure if we should do this or always use local zarr datasets rather than open from s3?On whether something should be in
BaseDatastore
vsWeatherDataset
:WeatherDataset
because it doesn’t apply to “state” category for exampleType of change
Checklist before requesting a review
pull
with--rebase
option if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee