-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Addition of loggers and dist strategy #180
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice integration, just a few comments in addition to the ones left inline:
train.py
is not needed and should be replaced with a command like:
itwinai exec-pipeline --config pipeline.yaml --pipe-key pipeline
There are some new changes in main
, especially the TrainingConfiguration
has been improved to provide more values by default (although it can support any new user-defined fields).
BCE = bce_loss | ||
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) | ||
return BCE + beta*KLD | ||
|
||
|
||
@monitor_exec | ||
def execute(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to separate the boilerplate code (e.g., distributed strategy setup, loggers setup) from the actual training code, if possible. Ideally, the use cases should not worry about the setup and should only override the train
method from the TorchTrainer
class. In other words, use cases can get the best out of the TorchTrainer
when reusing its execute
method and overriding only the train
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we had discussed, in this case, the preprocessing part first writes the files to disk and then the trainer reads the data from disk. Hence execute
is not used, rather @monitor_exec
is employed.
use-cases/xtclim/src/trainer.py
Outdated
seasons = ["winter_", "spring_", "summer_", "autumn_"] | ||
|
||
# number of members used for the training of the network | ||
n_memb = 1 | ||
|
||
# initialize learning parameters | ||
#lr0 = 0.001 | ||
#batch_size = 64 | ||
#epochs = 100 | ||
#early stopping parameters | ||
stop_delta = 0.01 # under 1% improvement consider the model starts converging | ||
patience = 15 # wait for a few epochs to be sure before actually stopping | ||
early_count = 0 # count when validation loss < stop_delta | ||
old_valid_loss = 0 # keep track of validation loss at t-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these could be better organized in the TrainingConfiguration
object, which is meant to store all the hyperparameters in a single place
I mean these:
seasons = ["winter_", "spring_", "summer_", "autumn_"]
# number of members used for the training of the network
n_memb = 1
# initialize learning parameters
#lr0 = 0.001
#batch_size = 64
#epochs = 100
#early stopping parameters
stop_delta = 0.01 # under 1% improvement consider the model starts converging
patience = 15 # wait for a few epochs to be sure before actually stopping
early_count = 0 # count when validation loss < stop_delta
old_valid_loss = 0 # keep track of validation loss at t-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been updated now.
use-cases/xtclim/src/trainer.py
Outdated
Parameters: | ||
bce_loss: recontruction loss | ||
mu: the mean from the latent vector | ||
logvar: log variance from the latent vector | ||
beta: weight over the KL-Divergence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part of the doctstring is not compliant with the Google docstring format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
batch_size: int, | ||
lr: float | ||
lr: float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lr
and batch_size
should be embedded in config
(needs to be added to the constructor). The training config contains all the hyperparams and user-defined params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
use-cases/xtclim/src/trainer.py
Outdated
# initialize the model | ||
cvae_model = model.ConvVAE().to(device) | ||
cvae_model = model.ConvVAE() | ||
optimizer = optim.Adam(cvae_model.parameters(), lr=self.lr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lr should be retrieved from self.config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
I've noticed an interesting pattern: Inside the trainer there is a loop, training a new model for each season. Perhaps it would be easier to create a new XTClimTrainer and train it for each season. It could receive the season as a constructor argument. This would simplify the training code inside the trainer. Although this is not mandatory, it would make the trainer more modular, allowing to separate the instantiation of model, optimizer, dataloader in the dedicated methods: # Do some pre-processing
...
# Training
for season in seasons:
net = MyModel()
config = TrainingConfiguration(season=season, batch_size=32, optimizer='sgd', optim_lr=0.001, ...)
trainer = XTClimTrainer(config=config, model=net)
trainer.execute()
# Do some prediction
... In this case, we would not use the itwinai Pipeline. Each trainer instance would create a new logger context (thus a new logging run), and a new distributed ML context (if distributed ML is needed)... |
|
'train.py' now takes care of the launch of each season separately. In the current implementation, we can still maintain the pipelines, while running each season separately. |
This has now been implemented in a slightly different manner. The |
I would suggest to rebase onto main before proceeding given the latest changes |
2c26e1c
to
afa2765
Compare
Branch has been rebased onto main. |
Summary
Distributed Strategy and loggers are added
Related issue : #160