Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "datastores" to represent input data from zarr, npy, etc #66

Open
wants to merge 268 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
268 commits
Select commit Hold shift + click to select a range
e80aa58
ar_steps for training and eval
May 9, 2024
a86fc07
smaller ammendments
May 9, 2024
7ae9c87
Dummy mask was inverted - fixed
May 9, 2024
93674a2
replace hardcoded normalization path
May 9, 2024
0afdfee
wip on simplifying pre-commit setup
leifdenby May 13, 2024
28118a6
setup pylint version
leifdenby May 13, 2024
3da3108
remove external deps install in cicd linting
leifdenby May 13, 2024
ea64309
create project
leifdenby May 13, 2024
0c68537
replace absolute imports with relative
leifdenby May 13, 2024
4b77be6
simplify black config
leifdenby May 13, 2024
f2bae03
headers for import sections no longer needed
leifdenby May 13, 2024
1d12b0d
minor fixes
leifdenby May 13, 2024
ad0accc
run on all branch pushes
leifdenby May 13, 2024
3e69502
rename action to "lint"
leifdenby May 13, 2024
681c7b1
add ci/cd test for imports
leifdenby May 13, 2024
5ad0230
py version must be quoted
leifdenby May 13, 2024
35987e5
fix torch install url
leifdenby May 13, 2024
148d7f6
use pdm in ci/cd
leifdenby May 13, 2024
b912d1a
disable cache for now
leifdenby May 13, 2024
b656445
check in lock file
leifdenby May 13, 2024
248196f
add pytest
leifdenby May 13, 2024
1af1576
cache in cicd
leifdenby May 13, 2024
7ed7c97
add torch-geometric to deps
leifdenby May 13, 2024
2869952
fix import and more tests
leifdenby May 13, 2024
9aaaecd
Merge branch 'maint/simplify-precommit-setup' into maint/refactor-as-…
leifdenby May 13, 2024
358c8d6
pdm to sync to requirements.txt
leifdenby May 13, 2024
6c3bdce
update requirements.txt
leifdenby May 13, 2024
fbd6a2b
more import tests
leifdenby May 13, 2024
93190de
move deps to projects and add import tests
leifdenby May 22, 2024
afd6012
add cicd testing workflow
leifdenby May 22, 2024
4013796
test both with pdm and pip install
leifdenby May 22, 2024
de72b95
clean up test cicd
leifdenby May 22, 2024
4d78c68
remove requirements.txt
leifdenby May 23, 2024
f2cbc44
create 3D mesh objects for schol AR
May 23, 2024
c0e7529
fixed math writing
sadamov May 28, 2024
af7751a
Merge branch 'feature_dataset_yaml' of https://github.com/joeloskarss…
sadamov May 28, 2024
5f538f9
cherry-pick with main
sadamov May 28, 2024
6685e94
bugfixes
sadamov May 28, 2024
6423fdf
pre_commits
sadamov May 28, 2024
59c4947
Merge remote-tracking branch 'origin/main' into feature_dataset_yaml
sadamov May 31, 2024
4e457ed
config.py is ready for danra
sadamov May 31, 2024
adc592f
streamlined multi-zarr workflow
sadamov Jun 1, 2024
a7bea6b
xarray zarr based data normalization
sadamov Jun 1, 2024
1f7cbe8
adjusted pre-processing scripts to new data config workflow
sadamov Jun 2, 2024
e328152
plotting update with latest get_xy() function
sadamov Jun 2, 2024
cb85cda
making data config more modular
sadamov Jun 2, 2024
eb8c6fb
removing boundaries for now
sadamov Jun 2, 2024
0cfbb33
small updates
sadamov Jun 2, 2024
59d0c8a
improved stats and units retrieval
sadamov Jun 2, 2024
2f6a87a
add GPU-based runner on cirun.io
leifdenby Jun 3, 2024
668dd81
improved zarr-based normalization
sadamov Jun 3, 2024
143cf2a
pdm install with cpu torch
leifdenby Jun 3, 2024
b760915
ensure exec in pdm venv
leifdenby Jun 3, 2024
7797cef
ensure exec in pdm venv
leifdenby Jun 3, 2024
e689650
check version #2
leifdenby Jun 3, 2024
fb8ef23
check version no 3
leifdenby Jun 3, 2024
51b0a0b
check versions
leifdenby Jun 3, 2024
374d032
merge main
sadamov Jun 3, 2024
8fa3ca7
Introduced datetime forcing calculation as seperate script
sadamov Jun 3, 2024
a748903
Fixed order of y and x dims to adhere to #52
sadamov Jun 3, 2024
70425ee
fix for pip install
leifdenby Jun 3, 2024
60110f6
switch cirun instance type
leifdenby Jun 3, 2024
6fff3fc
install py39 on cirun runner
leifdenby Jun 3, 2024
74b4a10
cleanup: boundary_mask, zarr-opening, utils
sadamov Jun 4, 2024
0a041d1
Merge remote-tracking branch 'origin/main' into feature_dataset_yaml
sadamov Jun 4, 2024
8054e9e
change ami image to gpu
leifdenby Jun 4, 2024
39fbf3a
Merge remote-tracking branch 'upstream/main' into maint/deps-in-pypro…
leifdenby Jun 4, 2024
97aeb2e
use cheaper gpu instance
leifdenby Jun 4, 2024
425123c
adapted tests for zarr-analysis data
sadamov Jun 4, 2024
4dcf671
Readme adapted for yaml zarr analysis workflow
sadamov Jun 4, 2024
6d384f0
samller bugfixes and improvements
sadamov Jun 4, 2024
12ff4f2
Added fixed data config file for testing on Danra
sadamov Jun 4, 2024
03f7769
reducing runtime of tests with smaller sample
sadamov Jun 4, 2024
26f069c
download danra data for test and example (streaming not possible)
sadamov Jun 6, 2024
1f1cbcc
bugfixes after real-life testcase
sadamov Jun 6, 2024
b369306
Merge remote-tracking branch 'origin/main' into feature_dataset_yaml
sadamov Jun 6, 2024
0cdc361
organize .zarr in /data
sadamov Jun 6, 2024
23ca7b3
cleanup
sadamov Jun 6, 2024
81422f1
linter
sadamov Jun 6, 2024
124541b
static dataset doesn't have time dim
sadamov Jun 7, 2024
6140fdb
making two complex functions more modular
sadamov Jun 7, 2024
db6a912
chunk dataset by time
sadamov Jun 8, 2024
1aaa8dc
create list first for performance
sadamov Jun 8, 2024
81856b2
converting to_array is very slow
sadamov Jun 8, 2024
b3da818
allow for forcings to not be normalized
sadamov Jun 8, 2024
7ee5398
allow non_normalized_vars to be null
sadamov Jun 8, 2024
4782103
fixed coastlines using new xy_extent function
sadamov Jun 8, 2024
e0ffc5b
Some projections return inverted axes (rotatedPole)
sadamov Jun 9, 2024
c1f43b7
Docstrings added
sadamov Jun 13, 2024
21fd929
wip
leifdenby Jun 26, 2024
c52f98e
npy mllam nearly done
leifdenby Jul 6, 2024
80f3639
minor adjustment
leifdenby Jul 7, 2024
048f8c6
Merge branch 'main' of https://github.com/mllam/neural-lam into maint…
leifdenby Jul 11, 2024
5aaa239
add pooch and tweak pip cicd testing
leifdenby Jul 11, 2024
66c3b03
combine cicd tests with caching
leifdenby Jul 11, 2024
8566b8f
linting
leifdenby Jul 11, 2024
29bd9e5
add pyg dep
leifdenby Jul 11, 2024
bc7f028
set cirun aws region to frankfurt
leifdenby Jul 11, 2024
2070166
adapt image
leifdenby Jul 11, 2024
e4e86e5
set image
leifdenby Jul 11, 2024
1fba8fe
try different image
leifdenby Jul 11, 2024
02b77cf
add pooch to cicd
leifdenby Jul 11, 2024
b481929
add pdm gpu test
leifdenby Jul 16, 2024
bcec472
start work on readme
leifdenby Jul 16, 2024
c5beec9
Merge branch 'maint/deps-in-pyproject-toml' into datastore
leifdenby Jul 16, 2024
e89facc
Merge branch 'main' into maint/refactor-as-package
leifdenby Jul 16, 2024
0b5687a
Merge branch 'main' of https://github.com/mllam/neural-lam into maint…
leifdenby Jul 16, 2024
095fdbc
turn meps testdata download into pytest fixture
leifdenby Jul 16, 2024
49e9bfe
adapt README for package
leifdenby Jul 16, 2024
12cc02b
remove pdm cicd test (will be in separate PR)
leifdenby Jul 16, 2024
b47f50b
remove pdm in gitignore
leifdenby Jul 16, 2024
90d99ca
remove pdm and pyproject files (will be sep PR)
leifdenby Jul 16, 2024
a91eaaa
add pyproject.toml from main
leifdenby Jul 16, 2024
5508cea
clean out tests
leifdenby Jul 16, 2024
5c623c3
fix linting
leifdenby Jul 16, 2024
08ec168
add cli entrypoints import test
leifdenby Jul 16, 2024
d9cf7ba
Merge branch 'maint/refactor-as-package' into datastore
leifdenby Jul 16, 2024
3954f04
tweak cicd pytest execution
leifdenby Jul 16, 2024
f99fdce
Merge branch 'maint/refactor-as-package' into datastore
leifdenby Jul 16, 2024
db9d96f
Update tests/test_mllam_dataset.py
leifdenby Jul 17, 2024
3c864b2
grid-shape ok
leifdenby Jul 17, 2024
1f54b0e
get_vars_names and units
leifdenby Jul 17, 2024
9b88160
get_vars_names and units 2
leifdenby Jul 17, 2024
a9fdad5
test for stats
leifdenby Jul 23, 2024
555154f
get_dataarray test
leifdenby Jul 24, 2024
8b8a77e
get_dataarray test
leifdenby Jul 24, 2024
41f11cd
boundary_mask
leifdenby Jul 24, 2024
a17de0f
get_xy
leifdenby Jul 24, 2024
0a38a7d
remove TrainingSample dataclass
leifdenby Jul 24, 2024
f65f6b5
test for WeatherDataset.__getitem__
leifdenby Jul 24, 2024
a35100e
test for graph creation
leifdenby Jul 24, 2024
cfb0618
more graph creation tests
leifdenby Jul 24, 2024
8698719
check for consistency of num features across splits
leifdenby Jul 24, 2024
3381404
test for single batch from mllam through model
leifdenby Jul 24, 2024
2a6796c
Add init files to expose classes in editable package
joeloskarsson Jul 24, 2024
8f4e0e0
Linting
joeloskarsson Jul 24, 2024
e657abb
working training_step with datastores!
Jul 25, 2024
effc99b
remove superfluous tests
Jul 25, 2024
a047026
fix for dataset length
Jul 25, 2024
d2c62ed
step length should be int
Jul 25, 2024
58f5d99
step length should be int
Jul 25, 2024
64d43a6
training working with mllam datastore!
Jul 25, 2024
07444f8
adapt neural_lam.train_model for datastores
Jul 25, 2024
d1b6fc1
fixes for npy
Jul 25, 2024
6fe19ac
npyfiles datastore complete
leifdenby Jul 26, 2024
fe65a4d
cleanup for datastore examples
leifdenby Jul 26, 2024
e533794
training on ohm with danra!
Jul 26, 2024
640ac05
use mllam-data-prep v0.2.0
Aug 5, 2024
0f16f13
remove py3.12 from pre-commit
Aug 5, 2024
724548e
cleanup
Aug 8, 2024
a1b2037
all tests passing!
Aug 12, 2024
e35958f
use mllam-data-prep v0.3.0
Aug 12, 2024
8b92318
delete requirements.txt
Aug 13, 2024
658836a
remove .DS_Store
Aug 13, 2024
421efed
use tmate in gpu pdm cicd
Aug 13, 2024
05f1e9f
remove requirements
Aug 13, 2024
3afe0e4
update pdm gpu cicd setup to pdm venv on nvme drive
Aug 13, 2024
f3d028b
don't try to use pdm venv in-project
Aug 13, 2024
2c35662
remove tmate
Aug 13, 2024
5f30255
update README with install instructions
Aug 14, 2024
b2b5631
changelog
Aug 14, 2024
c8ae829
update ci/cd badges to include gpu + gpu
Aug 14, 2024
e7cf2c0
Merge pull request #1 from mllam/package_inits
leifdenby Aug 14, 2024
0b72e9d
add pyproject-flake8 to precommit config
Aug 14, 2024
190d1de
use Flake8-pyproject instead
Aug 14, 2024
791af0a
update README
Aug 14, 2024
58fab84
Merge branch 'maint/deps-in-pyproject-toml' into feat/datastores
Aug 14, 2024
dbe2e6d
Merge branch 'maint/refactor-as-package' into maint/deps-in-pyproject…
Aug 14, 2024
eac6e35
Merge branch 'maint/deps-in-pyproject-toml' into feat/datastores
Aug 14, 2024
799d55e
linting fixes
Aug 14, 2024
57bbb81
train only 1 epoch in cicd and print to stdout
Aug 14, 2024
a955cee
log datastore config
Aug 14, 2024
0a79c74
cleanup doctrings
Aug 15, 2024
9f3c014
Merge branch 'maint/refactor-as-package' into datastore
leifdenby Aug 19, 2024
41364a8
Merge branch 'main' of https://github.com/mllam/neural-lam into maint…
leifdenby Aug 19, 2024
3422298
update changelog
leifdenby Aug 19, 2024
689ef69
move dev deps optional dependencies group
leifdenby Aug 20, 2024
9a0d538
update cicd tests to install dev deps
leifdenby Aug 20, 2024
bddfcaf
update readme with new dev deps group
leifdenby Aug 20, 2024
b96cfdc
quote the skip step the install readme
leifdenby Aug 20, 2024
2600dee
remove unused files
leifdenby Aug 20, 2024
65a8074
Merge branch 'feat/datastores' of https://github.com/leifdenby/neural…
leifdenby Aug 20, 2024
6adf6cc
revert to line length of 80
leifdenby Aug 20, 2024
46b37f8
revert docstring formatting changes
leifdenby Aug 20, 2024
3cd0f8b
pin numpy to <2.0.0
leifdenby Aug 20, 2024
826270a
Merge branch 'maint/deps-in-pyproject-toml' into feat/datastores
leifdenby Aug 20, 2024
4ba22ea
Merge branch 'main' into feat/datastores
leifdenby Aug 20, 2024
1f661c6
fix flake8 linting errors
leifdenby Aug 20, 2024
4838872
Update neural_lam/weather_dataset.py
leifdenby Sep 8, 2024
b59e7e5
Update neural_lam/datastore/multizarr/create_normalization_stats.py
leifdenby Sep 8, 2024
75b1fe7
Update neural_lam/datastore/npyfiles/store.py
leifdenby Sep 8, 2024
7e736cb
Update neural_lam/datastore/npyfiles/store.py
leifdenby Sep 8, 2024
613a7e2
Update neural_lam/datastore/npyfiles/store.py
leifdenby Sep 8, 2024
65e199b
Update tests/test_training.py
leifdenby Sep 8, 2024
4435e26
Update tests/test_datasets.py
leifdenby Sep 8, 2024
4693408
Update README.md
leifdenby Sep 8, 2024
2dfed2c
update README
leifdenby Sep 10, 2024
c3d033d
Merge branch 'main' of https://github.com/mllam/neural-lam into feat/…
leifdenby Sep 10, 2024
4a70268
Merge branch 'feat/datastores' of https://github.com/leifdenby/neural…
leifdenby Sep 10, 2024
66c663f
column_water -> open_water_fraction
leifdenby Sep 10, 2024
11a7978
fix linting
leifdenby Sep 10, 2024
a41c314
static data same for all splits
leifdenby Sep 10, 2024
6f1efd6
forcing_window_size from args
leifdenby Sep 10, 2024
bacb9ec
Update neural_lam/datastore/base.py
leifdenby Sep 10, 2024
4a9db4e
only use first ensemble member in datastores
leifdenby Sep 10, 2024
4fc2448
Merge branch 'feat/datastores' of https://github.com/leifdenby/neural…
leifdenby Sep 10, 2024
bcaa919
Update neural_lam/datastore/base.py
leifdenby Sep 10, 2024
90bc594
Update neural_lam/datastore/base.py
leifdenby Sep 10, 2024
5bda935
Update neural_lam/datastore/base.py
leifdenby Sep 10, 2024
8e7931d
remove all multizarr functionality
leifdenby Sep 10, 2024
6998683
cleanup and test fixes for recent changes
leifdenby Sep 10, 2024
c415008
Merge branch 'feat/datastores' of https://github.com/leifdenby/neural…
leifdenby Sep 10, 2024
735d324
fix linting
leifdenby Sep 10, 2024
5f2d919
remove multizar example files
leifdenby Sep 10, 2024
5263d2c
normalization -> standardization
leifdenby Sep 10, 2024
ba1bec3
fix import for tests
leifdenby Sep 10, 2024
d04d15e
Update neural_lam/datastore/base.py
leifdenby Sep 10, 2024
743d7a1
fix coord issues and add datastore example plotting cli
leifdenby Sep 12, 2024
ac10d7d
add lru_cache to get_xy_extent
leifdenby Sep 12, 2024
bf8172a
MLLAMDatastore -> MDPDatastore
leifdenby Sep 12, 2024
90ca400
missed renames for MDPDatastore
leifdenby Sep 12, 2024
154139d
update graph plot for datastores
leifdenby Sep 12, 2024
50ee0b0
use relative import
leifdenby Sep 12, 2024
7dfd570
add long_names and refactor npyfiles create weights
leifdenby Sep 12, 2024
2b45b5a
Update neural_lam/weather_dataset.py
leifdenby Sep 23, 2024
aee0b1c
Update neural_lam/weather_dataset.py
leifdenby Sep 23, 2024
8453c2b
Update neural_lam/models/ar_model.py
leifdenby Sep 27, 2024
7f32557
Update neural_lam/weather_dataset.py
leifdenby Sep 27, 2024
67998b8
read projection from datastore config extra section
leifdenby Sep 27, 2024
ac7e46a
NpyFilesDatastore -> NpyFilesDatastoreMEPS
leifdenby Sep 27, 2024
b7bf506
revert tp training with 1 AR step by default
leifdenby Sep 27, 2024
5df2ecf
add missing kwarg to BaseHiGraphModel.__init__
leifdenby Sep 27, 2024
d4d438f
add missing kwarg to HiLAM.__init__
leifdenby Sep 27, 2024
1889771
add missing kwarg to HiLAMParallel
leifdenby Sep 27, 2024
2c3bbde
check that for enough forecast steps given ar_steps
leifdenby Sep 27, 2024
f0a151b
remove numpy<2.0.0 version cap
leifdenby Sep 27, 2024
f3566b0
tweak print statement working in mdp
Oct 1, 2024
dba94b3
fix missed removed argument from cli
Oct 1, 2024
bca1482
remove wandb config log comment, we log now
Oct 1, 2024
fc973c4
ensure loading from checkpoint during train possible
Oct 1, 2024
9fcf06e
get step_length from datastore in plot_error_map
leifdenby Oct 1, 2024
2bbe666
remove step_legnth attr in ARModel
leifdenby Oct 1, 2024
b41ed2f
remove unused obs_mask arg for vis.plot_prediction
leifdenby Oct 1, 2024
7e46194
ensure no reference to multizarr "data_config"
leifdenby Oct 1, 2024
b57bc7a
introduce neural-lam config
leifdenby Oct 2, 2024
2b30715
include meps neural-lam config example
leifdenby Oct 2, 2024
8e7b2e6
fix extra space typo in BaseDatastore
leifdenby Oct 2, 2024
e0300fb
add check and print of train/test/val split in MDPDatastore
leifdenby Oct 2, 2024
d1b4ca7
BaseCartesianDatastore -> BaseRegularGridDatastore
leifdenby Oct 3, 2024
de46fb4
removed `control_only' arg
sadamov Oct 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
python-version: ["3.9", "3.10", "3.11"]
steps:
- uses: actions/checkout@v2
- name: Set up Python
Expand Down
10 changes: 8 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
### Project Specific ###
wandb
slurm_log*
saved_models
lightning_logs
data
graphs
*.sif
sweeps
test_*.sh
.vscode
*.html
*.zarr
*slurm*

### Python ###
Expand Down Expand Up @@ -75,8 +75,14 @@ tags

# Coc configuration directory
.vim
.vscode

# macos
.DS_Store
leifdenby marked this conversation as resolved.
Show resolved Hide resolved

# pdm (https://pdm-project.org/en/stable/)
.pdm-python
.venv

# exclude pdm.lock file so that both cpu and gpu versions of torch will be accepted by pdm
pdm.lock
124 changes: 68 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,16 +46,54 @@ Still, some restrictions are inevitable:
</p>


## A note on the limited area setting
Currently we are using these models on a limited area covering the Nordic region, the so called MEPS area (see [paper](https://arxiv.org/abs/2309.17370)).
There are still some parts of the code that is quite specific for the MEPS area use case.
This is in particular true for the mesh graph creation (`python -m neural_lam.create_mesh`) and some of the constants set in a `data_config.yaml` file (path specified in `python -m neural_lam.train_model --data_config <data-config-filepath>` ).
If there is interest to use Neural-LAM for other areas it is not a substantial undertaking to refactor the code to be fully area-agnostic.
We would be happy to support such enhancements.
See the issues https://github.com/joeloskarsson/neural-lam/issues/2, https://github.com/joeloskarsson/neural-lam/issues/3 and https://github.com/joeloskarsson/neural-lam/issues/4 for some initial ideas on how this could be done.

# Using Neural-LAM
Below follows instructions on how to use Neural-LAM to train and evaluate models.
Below follows instructions on how to use Neural-LAM to train and evaluate models. Once `neural-lam` has been installed the general process is:

1. Run any pre-processing scripts to generate the necessary derived data that your chosen datastore requires
2. Run graph-creation step
3. Train the model

## Data

To enable flexibility in what input-data sources can be used with neural-lam,
the input-data representation is split into two parts:

1. a "datastore" (represented by instances of
[neural_lam.datastore.BaseDataStore](neural_lam/datastore/base.py)) which
takes care of loading a given category (state, forcing or static) and split
(train/val/test) of data from disk and returning it as a `xarray.DataArray`.
The returned data-array is expected to have the spatial coordinates
flattened into a single `grid_index` dimension and all variables and vertical
levels stacked into a feature dimension (named as `{category}_feature`) The
datastore also provides information about the number, names and units of
variables in the data, the boundary mask, normalisation values and grid
information.

2. a `pytorch.Dataset`-derived class (called
`neural_lam.weather_dataset.WeatherDataset`) which takes care of sampling in
time to create individual samples for training, validation and testing. The
`WeatherDataset` class is also responsible for normalising the values and
returning `torch.Tensor`-objects.

There are currently three different datastores implemented in the codebase:

1. `neural_lam.datastore.NpyDataStore` which reads MEPS data from `.npy`-files in
the format introduced in neural-lam `v0.1.0`. Note that this datastore is specific to the format of the MEPS dataset, but can act as an example for how to create similar numpy-based datastores.

2. `neural_lam.datastore.MultizarrDatastore` which can combines multiple zarr
files during train/val/test sampling, with the transformations to facilitate
this implemented within `neural_lam.datastore.MultizarrDatastore`.

sadamov marked this conversation as resolved.
Show resolved Hide resolved
3. `neural_lam.datastore.MDPDatastore` which can combine multiple zarr
datasets either either as a preprocessing step or during sampling, but
offloads the implementation of the transformations the
[mllam-data-prep](https://github.com/mllam/mllam-data-prep) package.

If neither of these options fit your need you can create your own datastore by
subclassing the `neural_lam.datastore.BaseDataStore` class or
`neural_lam.datastore.BaseCartesianDatastore` class (if your data is stored on
a Cartesian grid) and implementing the abstract methods.


## Installation

Expand Down Expand Up @@ -103,16 +141,29 @@ Note that this is far too little data to train any useful models, but all pre-pr
It should thus be useful to make sure that your python environment is set up correctly and that all the code can be ran without any issues.

## Pre-processing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire pre-processing section, including the figure, requires updating.


There are two main steps in the pre-processing pipeline: creating the graph and creating additional features/normalisation/boundary-masks.

The amount of pre-processing required will depend on what kind of datastore you will be using for training.

### Additional inputs

#### MultiZarr Datastore

* `python -m neural_lam.create_boundary_mask`
* `python -m neural_lam.create_datetime_forcings`
* `python -m neural_lam.create_norm`

#### NpyFiles Datastore

#### MDP (mllam-data-prep) Datastore

An overview of how the different pre-processing steps, training and files depend on each other is given in this figure:
<p align="middle">
<img src="figures/component_dependencies.png"/>
</p>
In order to start training models at least three pre-processing steps have to be run:

* `python -m neural_lam.create_mesh`
* `python -m neural_lam.create_grid_features`
* `python -m neural_lam.create_parameter_weights`

### Create graph
Run `python -m neural_lam.create_mesh` with suitable options to generate the graph you want to use (see `python neural_lam.create_mesh --help` for a list of options).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the rename of create_mesh.py to create_graph.py this should be updated.
Also, python neural_lam.create_mesh --help is missing an -m

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the whole README needs a big reworking!

The graphs used for the different models in the [paper](https://arxiv.org/abs/2309.17370) can be created as:
Expand Down Expand Up @@ -143,11 +194,12 @@ wandb off
```

## Train Models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arguments for train_model have changed now, should be updated

Models can be trained using `python -m neural_lam.train_model`.
Models can be trained using `python -m neural_lam.train_model <datastore_type> <datastore_config_path>`.
Run `python neural_lam.train_model --help` for a full list of training options.
A few of the key ones are outlined below:

* `--dataset`: Which data to train on
* `<datastore_type>`: The kind of datastore that you are using (should be one of `npyfiles`, `multizarr` or `mllam`)
* `<datastore_config_path>`: Path to the data store configuration file
* `--model`: Which model to train
* `--graph`: Which graph to use with the model
* `--processor_layers`: Number of GNN layers to use in the processing part of the model
Expand Down Expand Up @@ -204,47 +256,7 @@ Some options specifically important for evaluation are:
# Repository Structure
Except for training and pre-processing scripts all the source code can be found in the `neural_lam` directory.
Model classes, including abstract base classes, are located in `neural_lam/models`.

## Format of data directory
It is possible to store multiple datasets in the `data` directory.
Each dataset contains a set of files with static features and a set of samples.
The samples are split into different sub-directories for training, validation and testing.
The directory structure is shown with examples below.
Script names within parenthesis denote the script used to generate the file.
```
data
├── dataset1
│ ├── samples - Directory with data samples
│ │ ├── train - Training data
│ │ │ ├── nwp_2022040100_mbr000.npy - A time series sample
│ │ │ ├── nwp_2022040100_mbr001.npy
│ │ │ ├── ...
│ │ │ ├── nwp_2022043012_mbr001.npy
│ │ │ ├── nwp_toa_downwelling_shortwave_flux_2022040100.npy - Solar flux forcing
│ │ │ ├── nwp_toa_downwelling_shortwave_flux_2022040112.npy
│ │ │ ├── ...
│ │ │ ├── nwp_toa_downwelling_shortwave_flux_2022043012.npy
│ │ │ ├── wtr_2022040100.npy - Open water features for one sample
│ │ │ ├── wtr_2022040112.npy
│ │ │ ├── ...
│ │ │ └── wtr_202204012.npy
│ │ ├── val - Validation data
│ │ └── test - Test data
│ └── static - Directory with graph information and static features
│ ├── nwp_xy.npy - Coordinates of grid nodes (part of dataset)
│ ├── surface_geopotential.npy - Geopotential at surface of grid nodes (part of dataset)
│ ├── border_mask.npy - Mask with True for grid nodes that are part of border (part of dataset)
│ ├── grid_features.pt - Static features of grid nodes (neural_lam.create_grid_features)
│ ├── parameter_mean.pt - Means of state parameters (neural_lam.create_parameter_weights)
│ ├── parameter_std.pt - Std.-dev. of state parameters (neural_lam.create_parameter_weights)
│ ├── diff_mean.pt - Means of one-step differences (neural_lam.create_parameter_weights)
│ ├── diff_std.pt - Std.-dev. of one-step differences (neural_lam.create_parameter_weights)
│ ├── flux_stats.pt - Mean and std.-dev. of solar flux forcing (neural_lam.create_parameter_weights)
│ └── parameter_weights.npy - Loss weights for different state parameters (neural_lam.create_parameter_weights)
├── dataset2
├── ...
└── datasetN
```
Notebooks for visualization and analysis are located in `docs`.

## Format of graph directory
The `graphs` directory contains generated graph structures that can be used by different graph-based models.
Expand Down
1 change: 0 additions & 1 deletion neural_lam/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
# First-party
import neural_lam.config
import neural_lam.interaction_net
import neural_lam.metrics
import neural_lam.models
Expand Down
62 changes: 0 additions & 62 deletions neural_lam/config.py

This file was deleted.

Loading
Loading