Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of loggers and dist strategy #180

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

Conversation

r-sarma
Copy link
Collaborator

@r-sarma r-sarma commented Jul 2, 2024

Summary

Distributed Strategy and loggers are added

Related issue : #160

@r-sarma r-sarma requested a review from matbun July 2, 2024 10:36
@r-sarma r-sarma linked an issue Jul 2, 2024 that may be closed by this pull request
Copy link
Collaborator

@matbun matbun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice integration, just a few comments in addition to the ones left inline:

train.py is not needed and should be replaced with a command like:

itwinai exec-pipeline --config pipeline.yaml --pipe-key pipeline

There are some new changes in main, especially the TrainingConfiguration has been improved to provide more values by default (although it can support any new user-defined fields).

BCE = bce_loss
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return BCE + beta*KLD


@monitor_exec
def execute(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to separate the boilerplate code (e.g., distributed strategy setup, loggers setup) from the actual training code, if possible. Ideally, the use cases should not worry about the setup and should only override the train method from the TorchTrainer class. In other words, use cases can get the best out of the TorchTrainer when reusing its execute method and overriding only the train method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we had discussed, in this case, the preprocessing part first writes the files to disk and then the trainer reads the data from disk. Hence execute is not used, rather @monitor_exec is employed.

Comment on lines 91 to 100
seasons = ["winter_", "spring_", "summer_", "autumn_"]

# number of members used for the training of the network
n_memb = 1

# initialize learning parameters
#lr0 = 0.001
#batch_size = 64
#epochs = 100
#early stopping parameters
stop_delta = 0.01 # under 1% improvement consider the model starts converging
patience = 15 # wait for a few epochs to be sure before actually stopping
early_count = 0 # count when validation loss < stop_delta
old_valid_loss = 0 # keep track of validation loss at t-1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these could be better organized in the TrainingConfiguration object, which is meant to store all the hyperparameters in a single place

I mean these:

seasons = ["winter_", "spring_", "summer_", "autumn_"]

# number of members used for the training of the network
n_memb = 1

# initialize learning parameters
#lr0 = 0.001
#batch_size = 64
#epochs = 100
#early stopping parameters
stop_delta = 0.01  # under 1% improvement consider the model starts converging
patience = 15  # wait for a few epochs to be sure before actually stopping
early_count = 0  # count when validation loss < stop_delta
old_valid_loss = 0  # keep track of validation loss at t-1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been updated now.

Comment on lines 75 to 79
Parameters:
bce_loss: recontruction loss
mu: the mean from the latent vector
logvar: log variance from the latent vector
beta: weight over the KL-Divergence
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part of the doctstring is not compliant with the Google docstring format

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

Comment on lines 49 to +50
batch_size: int,
lr: float
lr: float,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lr and batch_size should be embedded in config (needs to be added to the constructor). The training config contains all the hyperparams and user-defined params

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

use-cases/xtclim/src/trainer.py Outdated Show resolved Hide resolved
# initialize the model
cvae_model = model.ConvVAE().to(device)
cvae_model = model.ConvVAE()
optimizer = optim.Adam(cvae_model.parameters(), lr=self.lr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lr should be retrieved from self.config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@matbun
Copy link
Collaborator

matbun commented Jul 2, 2024

I've noticed an interesting pattern: Inside the trainer there is a loop, training a new model for each season.
I wonder how we could represent it in the pipeline... Maybe it is not possible to use the pipeline in this case.

Perhaps it would be easier to create a new XTClimTrainer and train it for each season. It could receive the season as a constructor argument. This would simplify the training code inside the trainer. Although this is not mandatory, it would make the trainer more modular, allowing to separate the instantiation of model, optimizer, dataloader in the dedicated methods: create_model_loss_optimizer and create_dataloaders respectively.
I mean something like this, in the train.py file:

# Do some pre-processing
...

# Training
for season in seasons:
  net = MyModel()
  config = TrainingConfiguration(season=season, batch_size=32, optimizer='sgd', optim_lr=0.001, ...)
  trainer = XTClimTrainer(config=config, model=net)
  trainer.execute()

  # Do some prediction
  ...

In this case, we would not use the itwinai Pipeline.

Each trainer instance would create a new logger context (thus a new logging run), and a new distributed ML context (if distributed ML is needed)...

@r-sarma
Copy link
Collaborator Author

r-sarma commented Jul 29, 2024

Nice integration, just a few comments in addition to the ones left inline:

train.py is not needed and should be replaced with a command like:

itwinai exec-pipeline --config pipeline.yaml --pipe-key pipeline

There are some new changes in main, especially the TrainingConfiguration has been improved to provide more values by default (although it can support any new user-defined fields).

@r-sarma r-sarma closed this Jul 29, 2024
@r-sarma r-sarma reopened this Jul 29, 2024
@r-sarma
Copy link
Collaborator Author

r-sarma commented Jul 29, 2024

Nice integration, just a few comments in addition to the ones left inline:
train.py is not needed and should be replaced with a command like:

itwinai exec-pipeline --config pipeline.yaml --pipe-key pipeline

There are some new changes in main, especially the TrainingConfiguration has been improved to provide more values by default (although it can support any new user-defined fields).

'train.py' now takes care of the launch of each season separately. In the current implementation, we can still maintain the pipelines, while running each season separately.

@r-sarma
Copy link
Collaborator Author

r-sarma commented Jul 29, 2024

I've noticed an interesting pattern: Inside the trainer there is a loop, training a new model for each season. I wonder how we could represent it in the pipeline... Maybe it is not possible to use the pipeline in this case.

Perhaps it would be easier to create a new XTClimTrainer and train it for each season. It could receive the season as a constructor argument. This would simplify the training code inside the trainer. Although this is not mandatory, it would make the trainer more modular, allowing to separate the instantiation of model, optimizer, dataloader in the dedicated methods: create_model_loss_optimizer and create_dataloaders respectively. I mean something like this, in the train.py file:

# Do some pre-processing
...

# Training
for season in seasons:
  net = MyModel()
  config = TrainingConfiguration(season=season, batch_size=32, optimizer='sgd', optim_lr=0.001, ...)
  trainer = XTClimTrainer(config=config, model=net)
  trainer.execute()

  # Do some prediction
  ...

In this case, we would not use the itwinai Pipeline.

Each trainer instance would create a new logger context (thus a new logging run), and a new distributed ML context (if distributed ML is needed)...

This has now been implemented in a slightly different manner. The train.py now reads the configuration file, which contains the list of the seasons. Then it loops over the seasons, and dynamically adjust the seasons value to launch a pipeline iteratively for each season.

@matbun
Copy link
Collaborator

matbun commented Oct 16, 2024

I would suggest to rebase onto main before proceeding given the latest changes

@r-sarma
Copy link
Collaborator Author

r-sarma commented Oct 16, 2024

Branch has been rebased onto main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distributed ML for CERFACS
2 participants