New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Addition of loggers and dist strategy #180

Open

r-sarma wants to merge 20 commits into main from cerfacs-integration

Collaborator

r-sarma commented Jul 2, 2024

Summary

Distributed Strategy and loggers are added

Related issue : #160

r-sarma requested a review from matbun

July 2, 2024 10:36

r-sarma linked an issue

that may be closed by this pull request

Distributed ML for CERFACS #160

Open

matbun requested changes

View reviewed changes

Collaborator

matbun left a comment •

edited

Loading

Nice integration, just a few comments in addition to the ones left inline:

train.py is not needed and should be replaced with a command like:

itwinai exec-pipeline --config pipeline.yaml --pipe-key pipeline

There are some new changes in main, especially the TrainingConfiguration has been improved to provide more values by default (although it can support any new user-defined fields).

use-cases/xtclim/src/trainer.py

+                      BCE = bce_loss
+                      KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
+                      return BCE + beta*KLD
                   @monitor_exec
                   def execute(self):

Collaborator

matbun Jul 2, 2024

it would be nice to separate the boilerplate code (e.g., distributed strategy setup, loggers setup) from the actual training code, if possible. Ideally, the use cases should not worry about the setup and should only override the train method from the TorchTrainer class. In other words, use cases can get the best out of the TorchTrainer when reusing its execute method and overriding only the train method

Collaborator Author

r-sarma Jul 29, 2024

As we had discussed, in this case, the preprocessing part first writes the files to disk and then the trainer reads the data from disk. Hence execute is not used, rather @monitor_exec is employed.

use-cases/xtclim/src/trainer.py Outdated

Comment on lines 91 to 100

                       seasons = ["winter_", "spring_", "summer_", "autumn_"]
                       # number of members used for the training of the network
                       n_memb = 1
-                      # initialize learning parameters
-                      #lr0 = 0.001
-                      #batch_size = 64
-                      #epochs = 100
                       #early stopping parameters
                       stop_delta = 0.01  # under 1% improvement consider the model starts converging
                       patience = 15  # wait for a few epochs to be sure before actually stopping
                       early_count = 0  # count when validation loss < stop_delta
                       old_valid_loss = 0  # keep track of validation loss at t-1

Collaborator

matbun Jul 2, 2024

these could be better organized in the TrainingConfiguration object, which is meant to store all the hyperparameters in a single place

I mean these:

seasons = ["winter_", "spring_", "summer_", "autumn_"]

# number of members used for the training of the network
n_memb = 1

# initialize learning parameters
#lr0 = 0.001
#batch_size = 64
#epochs = 100
#early stopping parameters
stop_delta = 0.01  # under 1% improvement consider the model starts converging
patience = 15  # wait for a few epochs to be sure before actually stopping
early_count = 0  # count when validation loss < stop_delta
old_valid_loss = 0  # keep track of validation loss at t-1

Collaborator Author

r-sarma Jul 29, 2024

This has been updated now.

use-cases/xtclim/src/trainer.py Outdated

Comment on lines 75 to 79

+                      Parameters:
+                      bce_loss: recontruction loss
+                      mu: the mean from the latent vector
+                      logvar: log variance from the latent vector
+                      beta: weight over the KL-Divergence

Collaborator

matbun Jul 2, 2024

this part of the doctstring is not compliant with the Google docstring format

Collaborator Author

r-sarma Jul 29, 2024

Modified

use-cases/xtclim/src/trainer.py

Comment on lines 49 to +50

                           batch_size: int,
-                          lr: float
+                          lr: float,

Collaborator

matbun Jul 2, 2024

lr and batch_size should be embedded in config (needs to be added to the constructor). The training config contains all the hyperparams and user-defined params

Collaborator Author

r-sarma Jul 29, 2024

Done

use-cases/xtclim/src/trainer.py Outdated Show resolved Hide resolved

use-cases/xtclim/src/trainer.py Outdated

                           # initialize the model
-                          cvae_model = model.ConvVAE().to(device)
+                          cvae_model = model.ConvVAE()
                           optimizer = optim.Adam(cvae_model.parameters(), lr=self.lr)

Collaborator

matbun Jul 2, 2024

lr should be retrieved from self.config

Collaborator Author

r-sarma Jul 29, 2024

Done

Collaborator

matbun commented Jul 2, 2024 •

edited

Loading

I've noticed an interesting pattern: Inside the trainer there is a loop, training a new model for each season.
I wonder how we could represent it in the pipeline... Maybe it is not possible to use the pipeline in this case.

Perhaps it would be easier to create a new XTClimTrainer and train it for each season. It could receive the season as a constructor argument. This would simplify the training code inside the trainer. Although this is not mandatory, it would make the trainer more modular, allowing to separate the instantiation of model, optimizer, dataloader in the dedicated methods: create_model_loss_optimizer and create_dataloaders respectively.
I mean something like this, in the train.py file:

# Do some pre-processing
...

# Training
for season in seasons:
  net = MyModel()
  config = TrainingConfiguration(season=season, batch_size=32, optimizer='sgd', optim_lr=0.001, ...)
  trainer = XTClimTrainer(config=config, model=net)
  trainer.execute()

  # Do some prediction
  ...

In this case, we would not use the itwinai Pipeline.

Each trainer instance would create a new logger context (thus a new logging run), and a new distributed ML context (if distributed ML is needed)...

Collaborator Author

r-sarma commented Jul 29, 2024

Nice integration, just a few comments in addition to the ones left inline:

train.py is not needed and should be replaced with a command like:
itwinai exec-pipeline --config pipeline.yaml --pipe-key pipeline
There are some new changes in main, especially the TrainingConfiguration has been improved to provide more values by default (although it can support any new user-defined fields).

r-sarma closed this

r-sarma reopened this

Collaborator Author

r-sarma commented Jul 29, 2024

Nice integration, just a few comments in addition to the ones left inline:
train.py is not needed and should be replaced with a command like:
itwinai exec-pipeline --config pipeline.yaml --pipe-key pipeline
There are some new changes in main, especially the TrainingConfiguration has been improved to provide more values by default (although it can support any new user-defined fields).

'train.py' now takes care of the launch of each season separately. In the current implementation, we can still maintain the pipelines, while running each season separately.

Collaborator Author

r-sarma commented Jul 29, 2024

I've noticed an interesting pattern: Inside the trainer there is a loop, training a new model for each season. I wonder how we could represent it in the pipeline... Maybe it is not possible to use the pipeline in this case.

Perhaps it would be easier to create a new XTClimTrainer and train it for each season. It could receive the season as a constructor argument. This would simplify the training code inside the trainer. Although this is not mandatory, it would make the trainer more modular, allowing to separate the instantiation of model, optimizer, dataloader in the dedicated methods: create_model_loss_optimizer and create_dataloaders respectively. I mean something like this, in the train.py file:
# Do some pre-processing
...

# Training
for season in seasons:
  net = MyModel()
  config = TrainingConfiguration(season=season, batch_size=32, optimizer='sgd', optim_lr=0.001, ...)
  trainer = XTClimTrainer(config=config, model=net)
  trainer.execute()

  # Do some prediction
  ...
In this case, we would not use the itwinai Pipeline.

Each trainer instance would create a new logger context (thus a new logging run), and a new distributed ML context (if distributed ML is needed)...

This has now been implemented in a slightly different manner. The train.py now reads the configuration file, which contains the list of the seasons. Then it loops over the seasons, and dynamically adjust the seasons value to launch a pipeline iteratively for each season.

Collaborator

matbun commented Oct 16, 2024

I would suggest to rebase onto main before proceeding given the latest changes

r-sarma and others added 16 commits

October 16, 2024 10:34


          Addition of loggers and dist strategy

acbd387


          seasons launch iteratively and fixes

505e1f0


          Update generic_tf.sh

0bf4f18


          Distributed inference for CERFACS and src

ea53c76


          Linter errors

778cd5d


          Update inference.py

995312f


          Update config.yaml

8dc4275


          Update test_mnist.py

82a25c9


          Update test_mnist.py

8310c66


          Update config.yaml

3a8499b


          Update inference.py

c6a57da


          Update inference.py

dbd4862


          Update inference.py

66221d7


          Update inference.py

a748172


          Update config.yaml

26ad142


          Update inference.py

afa2765

r-sarma force-pushed the cerfacs-integration branch from 2c26e1c to afa2765 Compare

October 16, 2024 09:33

Collaborator Author

r-sarma commented Oct 16, 2024

Branch has been rebased onto main.

r-sarma added 2 commits

October 21, 2024 15:07


          Update inference.py

6f0dcb1


          Update inference.py

06ff1d7

r-sarma and others added 2 commits

October 23, 2024 14:03


          Additional of HPO to CERFACS usecase

3468c08


          Update hpo.py

e1b60b0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet