Add "datastores" to represent input data from zarr, npy, etc #66

leifdenby · 2024-07-17T10:18:58Z

Describe your changes

This PR builds on #54 (which introduces zarr-based training data) by splitting the Config-class introduced in #54 to separately represent the configuration for what data to load from the functions to load data (the latter is what I call a "datastore"). In doing this I have also introduced a general interface through an abstract base class BaseDatastore with a set of functions that are called in the rest of neural-lam which provide data for training/validation/test and information about this data (see #58 for my overview of the methods that #54 uses to load data).

The motivation for this work is to allow for a clear separation between how data is loaded into neural-lam and how training/validation/test samples are created from that data. Creating the interface between these two steps makes it clear what is expected to be provided when people want to add new data-sources to neural-lam

In the text below I am trying to use the same nomenclature that @sadamov introduced, namely:

data "category": relates to whether a multidimensional array represents state, forcing or static data.
data "transformation": this refers to the operations of the extracting of specific variables from source datasets (e.g. zarr datasets), flattening spatial coordinates into a grid_index coordinate, levels and variables into a {category}_feature coordinate (i.e. these are operations that

	BaseDatastore-derived classes	WeatherDataset
returns	only python primitive types, `np.ndarray` and `xr.Dataset`/`xr.DataArray` objects	`torch.Tensor` objects
provides	transformed train/test/val datasets that cover the entire time and space range for a given category of data	individual time samples (including windowing and handling both analysis and forecasts) for train/test/val, optionally sample from ensemble members

To support both the multizar config format that @sadamov introduced in #54, the old npyfiles and also data transformed with mllam-data-prep I have currently implemented the following three datastore classes:

neural_lam.datastore.NpyDataStore: reads data from .npy-files in the format introduced in neural-lam v0.1.0 - this uses dask.delayed so no array content is read until it is used
neural_lam.datastore.MultizarrDatastore: can combines multiple zarr files during train/val/test sampling, with the transformations to facilitate this implemented within neural_lam.datastore.MultizarrDatastore. - removed as we decided MDPDatastore was enough
neural_lam.datastore.MDPDatastore: can combine multiple zarr datasets either either as a preprocessing step or during sampling, but offloads the implementation of the transformations the mllam-data-prep package.

Each of the these inherit from BaseCartesianDatastore which itself inherits from BaseDatastore. I have added this last layer of indirection to make it easier for non-gridded data to be used in neural-lam in future.

Testing:

coverage: I have implemented tests for all three types of datastores which tests all that all the methods return the correct values and types. For this I have reused the testing for npyfiles from @SimonKamuk, reused the DANRA download from @sadamov and for mllam the data is streamed directly from S3. The tests also cover checking the return of a single training item, a full batch and training a few epochs.
creation of graphs, boundary-masks, etc: To make it possible to run the training process in tests I needed to make it possible to create all the necessary auxiliary inputs. For this reason I have refactored the create_graph, create_normalization commands etc so that they can be called not just from the command line.

Caveats:

storage of graphs and other auxiliary information: Reading @sadamov's Multiple Zarr to Rule them All #54 I got the feeling that the intention was that the path for a config-file for where data is coming from is in effect the directory for a dataset. It make sense to me to put everything relative the the parent directory of this config file, at least it as an easy thing to simply use. By placing the configuration file externally to the neural-lam repository (by making neural-lam a package Refactor codebase into a python package #32) I think this is necessary and less arbitrary that saying everything has to be in a "data" directory. For this reason I have assumed that any paths in the mllam and multizarr configs that don't start with a protocol or are an absolute path, that these paths are relative to the parent path of the config. For example multizarr's "create_forcing" cli interface defines a path, but so does the config so that was inconsistent and errorprone I think.
I have renamed the coordinate you introduced @sadamov from grid to grid_index. I think it ambiguous what "grid" refers to since that could be the grid itself, as well as the grid-index as it was used.
We shouldn’t use .variable as a variable name for a an xr.DataArray because xr.DataArray.variable is a reserved attribute for data-arrays
I think the comment # target_states: (ar_steps-2, N_grid, d_features) in WeatherDataset.getitem is incorrect @sadamov, or at least my understand of what ar_steps represents is different. I expect the target states to have exactly ar_steps in them, rather than ar_steps-2. Or said another way, would otherwise happen if ar_steps == 0?

Things I am unsure about:

Currently, training appears needs a minimum 2 devices otherwise aggregation of metrics (the all gather https://github.com/leifdenby/neural-lam/blob/feat/datastores/neural_lam/models/ar_model.py#L534) during plotting fails
I don’t think it should the step in multizarr should be called “create forcing”, because there are other forcing variables. Maybe auxiliary forcings instead?
Streaming zarr data from s3 doesn’t work with parallel dataloders, I tried setting DataLoader(…, multiprocessing_context="spawn”) as suggested through RuntimeError: This class is not fork-safe fsspec/filesystem_spec#755 and https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader, but not sure if we should do this or always use local zarr datasets rather than open from s3?
I don't quite understand why the multizarr code opened the source zarr datasets so many times. The way I have refactored the the open is only done once.

On whether something should be in BaseDatastore vs WeatherDataset:

I have moved “apply_windowing” to WeatherDataset because it doesn’t apply to “state” category for example

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging)

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

…package

Rename base class for datastores representating data on a regular grid. Also introduce DummyDatastore in tests that represent data on an irregular grid

sadamov · 2024-10-23T09:06:22Z

Make sure that the tensors, datasets and arrays always follow these conventions (consistency with mllam-data-prep and weather-model-graphs). This relevant whenever lat-lon or x-y data is stored in that object.

[x, y] <-> [lon, lat]
[numpoints, 2]

sadamov · 2024-10-23T19:48:18Z

After adjusting the order of all tensors, datasets and arrays in mllam to [x, y] <-> [lon, lat] and [numpoints, 2] the script create_mesh.py does not work anymore. Instead of updating the script, @joeloskarsson and I discussed that we should from now on rely directly on WGM for graph-creation. For me this means we remove create_mesh.py from neural-lam. I started removing parts of the codebase relying on it. Multiple tests must be reimplemented. Before I do that, I quickly wanted to check that we are all on the same page here @TomasLandelius @khintz

joeloskarsson · 2024-10-24T07:15:02Z

If we are getting rid of create_mesh.py (now create_graph.py) from here it would probably be good to have some clear instructions here in neural_lam on how to use WMG to construct and save the graph. Especially since the coordinates to use with WMG has to be extracted from the datastore. There are really three options I'd say:

Introduce a dependency in neural_lam to WMG and include a script here that extracts coordinates from the datastore and calls the WMG methods with them.
Introduce a dependency in WMG to neural_lam and let WMG use a neural_lam datastore to get the coordinates.
Don't create any dependencies, but rather just document exactly the python code needed to extract the coordinates from the datastore and use them to call WMG (this is really very little code).

For this PR I think option 3 is the best, and then we can think about the dependency structure later. We don't have any proper documentation system yet (#61), but for now these instructions could just sit in the readme. Or potentially it could be an example in the WMG documentation, and we could just link there? That might make more sense since it is documenting how to create a graph with WMG.

sadamov · 2024-10-24T08:22:08Z

Don't create any dependencies, but rather just document exactly the python code needed to extract the coordinates from the datastore and use them to call WMG (this is really very little code).

This sounds like the best option for now. We should only couple the repos with good reason in the future

Or potentially it could be an example in the WMG documentation, and we could just link there? That might make more sense since it is documenting how to create a graph with WMG.

Good choice, I add this to list of outstanding tasks above, as it is required for this PR.

For the tests I assume that all relevant graphs are present in the test folder. If not assertion will fail. OK?

sadamov · 2024-10-24T08:28:26Z

As the TODO list kept getting washed away by newer comments (again 😆) here it is as an issue. Please only use this issue to track progress from now on! #80

sadamov · 2024-10-28T16:26:06Z

@joeloskarsson @khintz
I am done with all TODOs from Leif 🚀 except for the README, which someone can rewrite based on this latest PR: leifdenby#2
I cannot add reviewers to this PR however, maybe Kasper has more access? I think, I could push to Leif's branch directly. Not sure if he would approve of such brute-force methods. Thoughts?
I even managed to pass all tests 💚! On some systems I had issues with gloo and test_training.py though.
In general I could simply implement the conclusion of each specific TODO-comment. If Leif and Joel had no common solution, I made an executive decision 😈. Here are the most noteworthy things I changed.

Changed the dimensions of everything related to a xy-grid to (Nx, Ny, 2). You might want to check:

create_graph.py
base.py
mdp.py
store.py
dummy_datastore.py
test_datastores.py

Implemented feature weighting. You might want to check:

base.py
mdp.py
store.py
dummy_datastore.py

Implemented flexible past-future window forcing. You might want to check:

ar_model.py
weather_dataset.py

Simplified requirements in pyproject.toml
Extended the dummy_datastore.py a bit to handle more realistic coordinates.

khintz · 2024-10-29T11:51:51Z

Great work @sadamov ! I can't add reviewers on Leifs branch, but he is making a mistake in showing up at DMI physically tomorrow, so if not before I will track him down then 😄

leifdenby · 2024-10-29T15:02:23Z

Not sure if he would approve of such brute-force methods. Thoughts?

Very happy with that @sadamov ! :D And thanks for taking up the mantle here. I will catch up with @khintz tomorrow and he can update me on where things need to be reviewed/adjusted.

joeloskarsson · 2024-10-29T18:46:22Z

Great stuff @sadamov ! I hope to give this another read through with all the new changes before we merge, and then I can take a special look at the points 1-5 that you mention (but we have already discussed a few of them, so should be all good 😄). It is probably easiest for me to do that though when all the changes are in this PR.

Simon Adamov and others added 30 commits May 9, 2024 21:58

ar_steps for training and eval

e80aa58

smaller ammendments

a86fc07

Dummy mask was inverted - fixed

7ae9c87

replace hardcoded normalization path

93674a2

wip on simplifying pre-commit setup

0afdfee

setup pylint version

28118a6

remove external deps install in cicd linting

3da3108

create project

ea64309

replace absolute imports with relative

0c68537

simplify black config

4b77be6

headers for import sections no longer needed

f2bae03

minor fixes

1d12b0d

run on all branch pushes

ad0accc

rename action to "lint"

3e69502

add ci/cd test for imports

681c7b1

py version must be quoted

5ad0230

fix torch install url

35987e5

use pdm in ci/cd

148d7f6

disable cache for now

b912d1a

check in lock file

b656445

add pytest

248196f

cache in cicd

1af1576

add torch-geometric to deps

7ed7c97

fix import and more tests

2869952

Merge branch 'maint/simplify-precommit-setup' into maint/refactor-as-…

9aaaecd

…package

pdm to sync to requirements.txt

358c8d6

update requirements.txt

6c3bdce

more import tests

fbd6a2b

move deps to projects and add import tests

93190de

add cicd testing workflow

afd6012

Leif Denby and others added 10 commits October 1, 2024 14:19

remove wandb config log comment, we log now

bca1482

ensure loading from checkpoint during train possible

fc973c4

get step_length from datastore in plot_error_map

9fcf06e

remove step_legnth attr in ARModel

2bbe666

remove unused obs_mask arg for vis.plot_prediction

b41ed2f

ensure no reference to multizarr "data_config"

7e46194

introduce neural-lam config

b57bc7a

include meps neural-lam config example

2b30715

fix extra space typo in BaseDatastore

8e7b2e6

add check and print of train/test/val split in MDPDatastore

e0300fb

leifdenby mentioned this pull request Oct 3, 2024

Change config example to call validation split val instead of validation mllam/mllam-data-prep#24

Closed

khintz mentioned this pull request Oct 3, 2024

WIP: Add support for mlflow #77

Open

20 tasks

BaseCartesianDatastore -> BaseRegularGridDatastore

d1b4ca7

Rename base class for datastores representating data on a regular grid. Also introduce DummyDatastore in tests that represent data on an irregular grid

ealerskans mentioned this pull request Oct 21, 2024

Call validation split 'val' instead of 'validation' in example config mllam/mllam-data-prep#28

Merged

removed `control_only' arg

de46fb4

sadamov mentioned this pull request Oct 23, 2024

Outstanding Tasks leifdenby/neural-lam#2

Open

sadamov mentioned this pull request Oct 24, 2024

TODOs from Leif #80

Open

5 tasks

sadamov mentioned this pull request Oct 28, 2024

Add graph creation functionality using weather-model-graphs #83

Open

3 tasks

This was referenced Oct 31, 2024

Separate interior state and boundary forcing to only predict state #84

Draft

Make mllam-data-prep aware of geographical coordinates (by introducing projection info) mllam/mllam-data-prep#33

Open

leifdenby self-assigned this Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "datastores" to represent input data from zarr, npy, etc #66

Add "datastores" to represent input data from zarr, npy, etc #66

leifdenby commented Jul 17, 2024 •

edited

Loading

sadamov commented Oct 23, 2024

sadamov commented Oct 23, 2024

joeloskarsson commented Oct 24, 2024

sadamov commented Oct 24, 2024 •

edited

Loading

sadamov commented Oct 24, 2024 •

edited

Loading

sadamov commented Oct 28, 2024

khintz commented Oct 29, 2024

leifdenby commented Oct 29, 2024

joeloskarsson commented Oct 29, 2024

Add "datastores" to represent input data from zarr, npy, etc #66

Are you sure you want to change the base?

Add "datastores" to represent input data from zarr, npy, etc #66

Conversation

leifdenby commented Jul 17, 2024 • edited Loading

Describe your changes

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

sadamov commented Oct 23, 2024

sadamov commented Oct 23, 2024

joeloskarsson commented Oct 24, 2024

sadamov commented Oct 24, 2024 • edited Loading

sadamov commented Oct 24, 2024 • edited Loading

sadamov commented Oct 28, 2024

khintz commented Oct 29, 2024

leifdenby commented Oct 29, 2024

joeloskarsson commented Oct 29, 2024

leifdenby commented Jul 17, 2024 •

edited

Loading

sadamov commented Oct 24, 2024 •

edited

Loading

sadamov commented Oct 24, 2024 •

edited

Loading