mllam · leifdenby · May 9, 2024 · May 9, 2024 · May 9, 2024 · May 9, 2024
diff --git a/.github/workflows/pre-commit.yml b/.github/workflows/pre-commit.yml
@@ -13,7 +13,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.9", "3.10", "3.11"]
     steps:
     - uses: actions/checkout@v2
     - name: Set up Python

diff --git a/.gitignore b/.gitignore
@@ -1,14 +1,14 @@
 ### Project Specific ###
 wandb
-slurm_log*
 saved_models
 lightning_logs
 data
 graphs
 *.sif
 sweeps
-test_*.sh
 .vscode
+*.html
+*.zarr
 *slurm*
 
 ### Python ###
@@ -75,8 +75,14 @@ tags
 
 # Coc configuration directory
 .vim
+.vscode
+
+# macos
+.DS_Store
 
 # pdm (https://pdm-project.org/en/stable/)
 .pdm-python
+.venv
+
 # exclude pdm.lock file so that both cpu and gpu versions of torch will be accepted by pdm
 pdm.lock
diff --git a/README.md b/README.md
@@ -46,16 +46,54 @@ Still, some restrictions are inevitable:
 </p>
 
 
-## A note on the limited area setting
-Currently we are using these models on a limited area covering the Nordic region, the so called MEPS area (see [paper](https://arxiv.org/abs/2309.17370)).
-There are still some parts of the code that is quite specific for the MEPS area use case.
-This is in particular true for the mesh graph creation (`python -m neural_lam.create_mesh`) and some of the constants set in a `data_config.yaml` file (path specified in `python -m neural_lam.train_model --data_config <data-config-filepath>` ).
-If there is interest to use Neural-LAM for other areas it is not a substantial undertaking to refactor the code to be fully area-agnostic.
-We would be happy to support such enhancements.
-See the issues https://github.com/joeloskarsson/neural-lam/issues/2, https://github.com/joeloskarsson/neural-lam/issues/3 and https://github.com/joeloskarsson/neural-lam/issues/4 for some initial ideas on how this could be done.
-
 # Using Neural-LAM
-Below follows instructions on how to use Neural-LAM to train and evaluate models.
+Below follows instructions on how to use Neural-LAM to train and evaluate models. Once `neural-lam` has been installed the general process is:
+
+1. Run any pre-processing scripts to generate the necessary derived data that your chosen datastore requires
+2. Run graph-creation step
+3. Train the model
+
+## Data
+
+To enable flexibility in what input-data sources can be used with neural-lam,
+the input-data representation is split into two parts:
+
+1. a "datastore" (represented by instances of
+   [neural_lam.datastore.BaseDataStore](neural_lam/datastore/base.py)) which
+   takes care of loading a given category (state, forcing or static) and split
+   (train/val/test) of data from disk and returning it as a `xarray.DataArray`.
+   The returned data-array is expected to have the spatial coordinates
+   flattened into a single `grid_index` dimension and all variables and vertical
+   levels stacked into a feature dimension (named as `{category}_feature`) The
+   datastore also provides information about the number, names and units of
+   variables in the data, the boundary mask, normalisation values and grid
+   information.
+
+2. a `pytorch.Dataset`-derived class (called
+   `neural_lam.weather_dataset.WeatherDataset`) which takes care of sampling in
+   time to create individual samples for training, validation and testing. The
+   `WeatherDataset` class is also responsible for normalising the values and
+   returning `torch.Tensor`-objects.
+
+There are currently three different datastores implemented in the codebase:
+
+1. `neural_lam.datastore.NpyDataStore` which reads MEPS data from `.npy`-files in
+   the format introduced in neural-lam `v0.1.0`. Note that this datastore is specific to the format of the MEPS dataset, but can act as an example for how to create similar numpy-based datastores.
+
+2. `neural_lam.datastore.MultizarrDatastore` which can combines multiple zarr
+   files during train/val/test sampling, with the transformations to facilitate
+   this implemented within `neural_lam.datastore.MultizarrDatastore`.
+
+3. `neural_lam.datastore.MDPDatastore` which can combine multiple zarr
+   datasets either either as a preprocessing step or during sampling, but
+   offloads the implementation of the transformations the
+   [mllam-data-prep](https://github.com/mllam/mllam-data-prep) package.
+
+If neither of these options fit your need you can create your own datastore by
+subclassing the `neural_lam.datastore.BaseDataStore` class or
+`neural_lam.datastore.BaseCartesianDatastore` class (if your data is stored on
+a Cartesian grid) and implementing the abstract methods.
+
 
 ## Installation
 
@@ -103,16 +141,29 @@ Note that this is far too little data to train any useful models, but all pre-pr
 It should thus be useful to make sure that your python environment is set up correctly and that all the code can be ran without any issues.
 
 ## Pre-processing
+
+There are two main steps in the pre-processing pipeline: creating the graph and creating additional features/normalisation/boundary-masks.
+
+The amount of pre-processing required will depend on what kind of datastore you will be using for training.
+
+### Additional inputs
+
+#### MultiZarr Datastore
+
+* `python -m neural_lam.create_boundary_mask`
+* `python -m neural_lam.create_datetime_forcings`
+* `python -m neural_lam.create_norm`
+
+#### NpyFiles Datastore
+
+#### MDP (mllam-data-prep) Datastore
+
 An overview of how the different pre-processing steps, training and files depend on each other is given in this figure:
 <p align="middle">
   <img src="figures/component_dependencies.png"/>
 </p>
 In order to start training models at least three pre-processing steps have to be run:
 
-* `python -m neural_lam.create_mesh`
-* `python -m neural_lam.create_grid_features`
-* `python -m neural_lam.create_parameter_weights`
-
 ### Create graph
 Run `python -m neural_lam.create_mesh` with suitable options to generate the graph you want to use (see `python neural_lam.create_mesh --help` for a list of options).
 The graphs used for the different models in the [paper](https://arxiv.org/abs/2309.17370) can be created as:
@@ -143,11 +194,12 @@ wandb off
 ```
 
 ## Train Models
-Models can be trained using `python -m neural_lam.train_model`.
+Models can be trained using `python -m neural_lam.train_model <datastore_type> <datastore_config_path>`.
 Run `python neural_lam.train_model --help` for a full list of training options.
 A few of the key ones are outlined below:
 
-* `--dataset`: Which data to train on
+* `<datastore_type>`: The kind of datastore that you are using (should be one of `npyfiles`, `multizarr` or `mllam`)
+* `<datastore_config_path>`: Path to the data store configuration file
 * `--model`: Which model to train
 * `--graph`: Which graph to use with the model
 * `--processor_layers`: Number of GNN layers to use in the processing part of the model
@@ -204,47 +256,7 @@ Some options specifically important for evaluation are:
 # Repository Structure
 Except for training and pre-processing scripts all the source code can be found in the `neural_lam` directory.
 Model classes, including abstract base classes, are located in `neural_lam/models`.
-
-## Format of data directory
-It is possible to store multiple datasets in the `data` directory.
-Each dataset contains a set of files with static features and a set of samples.
-The samples are split into different sub-directories for training, validation and testing.
-The directory structure is shown with examples below.
-Script names within parenthesis denote the script used to generate the file.
-```
-data
-├── dataset1
-│   ├── samples                             - Directory with data samples
-│   │   ├── train                           - Training data
-│   │   │   ├── nwp_2022040100_mbr000.npy  - A time series sample
-│   │   │   ├── nwp_2022040100_mbr001.npy
-│   │   │   ├── ...
-│   │   │   ├── nwp_2022043012_mbr001.npy
-│   │   │   ├── nwp_toa_downwelling_shortwave_flux_2022040100.npy   - Solar flux forcing
-│   │   │   ├── nwp_toa_downwelling_shortwave_flux_2022040112.npy
-│   │   │   ├── ...
-│   │   │   ├── nwp_toa_downwelling_shortwave_flux_2022043012.npy
-│   │   │   ├── wtr_2022040100.npy          - Open water features for one sample
-│   │   │   ├── wtr_2022040112.npy
-│   │   │   ├── ...
-│   │   │   └── wtr_202204012.npy
-│   │   ├── val                             - Validation data
-│   │   └── test                            - Test data
-│   └── static                              - Directory with graph information and static features
-│       ├── nwp_xy.npy                      - Coordinates of grid nodes (part of dataset)
-│       ├── surface_geopotential.npy        - Geopotential at surface of grid nodes (part of dataset)
-│       ├── border_mask.npy                 - Mask with True for grid nodes that are part of border (part of dataset)
-│       ├── grid_features.pt                - Static features of grid nodes (neural_lam.create_grid_features)
-│       ├── parameter_mean.pt               - Means of state parameters (neural_lam.create_parameter_weights)
-│       ├── parameter_std.pt                - Std.-dev. of state parameters (neural_lam.create_parameter_weights)
-│       ├── diff_mean.pt                    - Means of one-step differences (neural_lam.create_parameter_weights)
-│       ├── diff_std.pt                     - Std.-dev. of one-step differences (neural_lam.create_parameter_weights)
-│       ├── flux_stats.pt                   - Mean and std.-dev. of solar flux forcing (neural_lam.create_parameter_weights)
-│       └── parameter_weights.npy           - Loss weights for different state parameters (neural_lam.create_parameter_weights)
-├── dataset2
-├── ...
-└── datasetN
-```
+Notebooks for visualization and analysis are located in `docs`.
 
 ## Format of graph directory
 The `graphs` directory contains generated graph structures that can be used by different graph-based models.

diff --git a/neural_lam/__init__.py b/neural_lam/__init__.py
@@ -1,5 +1,4 @@
 # First-party
-import neural_lam.config
 import neural_lam.interaction_net
 import neural_lam.metrics
 import neural_lam.models

diff --git a/neural_lam/config.py b/neural_lam/config.py