WIP: Add support for mlflow #77

khintz · 2024-10-03T12:15:33Z

Describe your changes

Add support for mlflow logger by utilising pytorch_lightning.loggers
The native wandb module is replaced with pytorch_lightning wandb logger and introducing pytorch_lightning mlflow logger.
https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/loggers/logger.py

This will allow people to choose between wandb and mlflow.

Builds upon #66 although this is not strictly necessary for this change, but I am working with this feature to work with our dataset.

Issue Link

Closes #76

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

…package

…on/neural-lam into feature_dataset_yaml

tracking metrics is disabled currently because neural-lam previously used a Logger.define_metrics method which isn't available with the mlflow logger in pytorch-lightning as far as I'm aware

khintz · 2024-10-03T12:17:14Z

WIP, mlflow logger still not working, but got wandb working with the pytorch_lightning wandb logger.
This is dependent on #66 to be merged first.

khintz · 2024-10-07T12:16:58Z

I now got model metrics, system metrics and artifacts logging (including model logging) supported for mlflow. See e.g:
https://mlflow.dmidev.org/#/experiments/2/runs/aceb8c6c94844736844dc7d1c12aa57f

However I get this warning:

2024/10/07 12:04:38 WARNING mlflow.models.model: Model logged without a signature and input example. Please set `input_example` parameter when logging the model to auto infer the model signature.

I am calling a log_model function after trainer.fit

training_logger.log_model(model)

which is

def log_model(self, model):
    mlflow.pytorch.log_model(model, "model")

But I need to set the signature.
From https://mlflow.org/docs/latest/model/signatures.html, it states:

In MLflow, a model signature precisely defines 
the schema for model inputs, outputs, 
and any additional parameters required for 
effective model operation.

It should be possible to use infer_signature() from mlflow (https://mlflow.org/docs/latest/python_api/mlflow.models.html#mlflow.models.infer_signature), but to work with data one needs to input the training data like signature = infer_signature(model, training_data).
But the training dataset is probably too big to parse, and I am not sure I can get it via train_model.py. Could we manually infer a signature or should we discard giving a signature at all?

Any thoughts @joeloskarsson, @sadamov, @TomasLandelius ?

sadamov · 2024-10-10T12:27:46Z

@khintz Thanks for adding mlflow to the list of loggers, it's nice to give the user more choice. And clearly you already got most of the work done 🚀 . About this warning you are seeing: I don't think manually specifying the signatures is a good idea, as it is too error prone. How long would it take to use a single example as a signature to pass to mlflow with smth like this:

Modify CustomMLFlowLogger:

class CustomMLFlowLogger(pl.loggers.MLFlowLogger):
    def __init__(self, experiment_name, tracking_uri, data_module):
        super().__init__(experiment_name=experiment_name, tracking_uri=tracking_uri)
        mlflow.start_run(run_id=self.run_id, log_system_metrics=True)
        mlflow.log_param("run_id", self.run_id)
        self.data_module = data_module

    def log_image(self, key, images):
        from PIL import Image
        temporary_image = f"{key}.png"
        images[0].savefig(temporary_image)
        mlflow.log_image(Image.open(temporary_image), f"{key}.png")

    def log_model(self, model):
        input_example = self.create_input_example()
        with torch.no_grad():
            model_output = model(*input_example)

        #TODO: Are we sure we can hardcode the input names?
        signature = infer_signature(
            {name: tensor.cpu().numpy() for name, tensor in zip(['init_states', 'target_states', 'forcing', 'target_times'], input_example)},
            model_output.cpu().numpy()
        )

        mlflow.pytorch.log_model(
            model,
            "model",
            input_example=input_example,
            signature=signature
        )

    def create_input_example(self):
        if self.data_module.val_dataset is None:
            self.data_module.setup(stage="fit")
        return self.data_module.val_dataset[0]

joeloskarsson · 2024-10-10T16:14:14Z

But the training dataset is probably too big to parse

From my understanding you don't need to feed the whole dataset to the model to infer this signature, only one example batch. Going by this, something like what @sadamov proposed should work. However:

I don't think manually specifying the signatures is a good idea, as it is too error prone.

I agree. Optimally we would even get rid of the hard-coded argument names in the zip from @sadamov 's code (but I don't have an immediate idea how to do that).

Something else to consider here is that there are additional important inputs that are necessary to make a forecast with the model (that do not enter as arguments when calling the model() function). These include in particular:

Static inputs (grid static features)

neural-lam/neural_lam/models/ar_model.py

Lines 48 to 55 in de27e9a

    
           arr_static = da_static_features.transpose( 
        
               "grid_index", "static_feature" 
        
           ).values 
        
           self.register_buffer( 
        
               "grid_static_features", 
        
               torch.tensor(arr_static, dtype=torch.float32), 
        
               persistent=False, 
        
           )

The graph parts (edge_index + static graph features)

neural-lam/neural_lam/models/base_graph_model.py

Lines 23 to 31 in de27e9a

    
           self.hierarchical, graph_ldict = utils.load_graph( 
        
               graph_dir_path=graph_dir_path 
        
           ) 
        
           for name, attr_value in graph_ldict.items(): 
        
               # Make BufferLists module members and register tensors as buffers 
        
               if isinstance(attr_value, torch.Tensor): 
        
                   self.register_buffer(name, attr_value, persistent=False) 
        
               else: 
        
                   setattr(self, name, attr_value)

I don't know if these (or rather their shape) should be considered for the third part of the model signature ("Parameters (params)"), or somehow also viewed as part of the input. But I also fear that including these might just make this complex enough that this signature is no longer particularly useful. I think we should be motivated by how useful we actually find this signature to be. If we just want to get rid of the warning maybe we don't have to worry about these.

leifdenby and others added 30 commits May 13, 2024 19:31

create project

ea64309

replace absolute imports with relative

0c68537

simplify black config

4b77be6

headers for import sections no longer needed

f2bae03

minor fixes

1d12b0d

run on all branch pushes

ad0accc

rename action to "lint"

3e69502

add ci/cd test for imports

681c7b1

py version must be quoted

5ad0230

fix torch install url

35987e5

use pdm in ci/cd

148d7f6

disable cache for now

b912d1a

check in lock file

b656445

add pytest

248196f

cache in cicd

1af1576

add torch-geometric to deps

7ed7c97

fix import and more tests

2869952

Merge branch 'maint/simplify-precommit-setup' into maint/refactor-as-…

9aaaecd

…package

pdm to sync to requirements.txt

358c8d6

update requirements.txt

6c3bdce

more import tests

fbd6a2b

move deps to projects and add import tests

93190de

add cicd testing workflow

afd6012

test both with pdm and pip install

4013796

clean up test cicd

de72b95

remove requirements.txt

4d78c68

create 3D mesh objects for schol AR

f2cbc44

fixed math writing

c0e7529

Merge branch 'feature_dataset_yaml' of https://github.com/joeloskarss…

af7751a

…on/neural-lam into feature_dataset_yaml

cherry-pick with main

5f538f9

leifdenby and others added 18 commits September 27, 2024 18:01

add missing kwarg to HiLAMParallel

1889771

check that for enough forecast steps given ar_steps

2c3bbde

remove numpy<2.0.0 version cap

f0a151b

tweak print statement working in mdp

f3566b0

fix missed removed argument from cli

dba94b3

remove wandb config log comment, we log now

bca1482

ensure loading from checkpoint during train possible

fc973c4

get step_length from datastore in plot_error_map

9fcf06e

remove step_legnth attr in ARModel

2bbe666

remove unused obs_mask arg for vis.plot_prediction

b41ed2f

ensure no reference to multizarr "data_config"

7e46194

introduce neural-lam config

b57bc7a

include meps neural-lam config example

2b30715

fix extra space typo in BaseDatastore

8e7b2e6

add check and print of train/test/val split in MDPDatastore

e0300fb

add experimental mlflow server support

a921e35

tracking metrics is disabled currently because neural-lam previously used a Logger.define_metrics method which isn't available with the mlflow logger in pytorch-lightning as far as I'm aware

more fixes for mlflow logging support

0f30259

Make wandb work again with pytorch_lightning.logger

3fbe2d0

khintz self-assigned this Oct 3, 2024

khintz added 2 commits October 4, 2024 12:07

upload of artifact to mlflow works, but instantiates a new experiment

e0284a8

make mlflow use same experiment run id as pl.logger.MLFlowLogger

7eed79b

This comment was marked as resolved.

Sign in to view

khintz added 3 commits October 7, 2024 10:13

logger artifact working for both wandb and mlflow

27408f2

support mlflow system metrics logging

e61a9e7

support model logging for mlflow

b53bab5

log model

de27e9a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add support for mlflow #77

WIP: Add support for mlflow #77

khintz commented Oct 3, 2024

khintz commented Oct 3, 2024

This comment was marked as resolved.

khintz commented Oct 7, 2024 •

edited

Loading

sadamov commented Oct 10, 2024

joeloskarsson commented Oct 10, 2024

WIP: Add support for mlflow #77

Are you sure you want to change the base?

WIP: Add support for mlflow #77

Conversation

khintz commented Oct 3, 2024

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

khintz commented Oct 3, 2024

This comment was marked as resolved.

khintz commented Oct 7, 2024 • edited Loading

sadamov commented Oct 10, 2024

joeloskarsson commented Oct 10, 2024

khintz commented Oct 7, 2024 •

edited

Loading