Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom callbacks for metrics, saving checkpoints #2575

Open
Garfounkel opened this issue Mar 22, 2024 · 4 comments
Open

Custom callbacks for metrics, saving checkpoints #2575

Garfounkel opened this issue Mar 22, 2024 · 4 comments

Comments

@Garfounkel
Copy link

Garfounkel commented Mar 22, 2024

I need to compute custom metrics during training. I first thought it would be as easy as adding my own metric function to some callback, but I couldn't find anything like this in the doc or in issues. I would be fine just having a callback when a new checkpoint is saved, or when the validation step is running.

Workaround

My current workaround is to run a second process that's constantly watching over the directory of models for any new checkpoint. When a new one is found, it executes my metrics calculation.

It works, but it's really not ideal for multiple reasons:

  • The evaluation process is not synched with the training process, meaning if the training process is killed, the evaluation process might continue to run. This means more work is needed to also maintain that process lifetime.
  • Because the evaluation process is run during the training steps, it cannot use the same GPU, meaning this process either needs its own dedicated GPU, or it needs to run on CPU. It would be much better to be able to run it on the same GPU, in between the training steps. This would also be faster because the model wouldn't need to be loaded in memory every time.

Question

How can I add a callback during the validation step or after a new checkpoint? If there is no current out-of-box solution, I would be fine doing a PR if you can give me some pointers to what would need to be changed.

@vince62s
Copy link
Member

look at this: https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/utils/scoring_utils.py
and the way it is instanced and called from Trainer.py
This is what we use to compute BLEU score at validation time.

@Garfounkel
Copy link
Author

Got it, thanks. This gives me a good idea how I could hack my own custom evaluation mechanism directly into the ScoringPreparator. But just to be clear, you're suggesting that I do that because there's no callback system at the moment ?

Thinking out loud, it seems that ScoringPreparator is indeed the correct place for me to inject my logic, but it feels very hacky. I don't want to compute my metric on the valid set, I have my own test set that I use for this metric. I also need to communicate with a web service to push scores and predictions for further analysis.

Basically my point is: sure I could inject all of that into ScoringPreparator, but what makes more sense to me is a custom callback that I could add in the config file, similar to how custom transforms are implemented.

Does that make sense? Do you think there's need for such a feature?

@vince62s
Copy link
Member

I am not sure to understand what exactly you are looking for. You mentioned earlier that you want to compute a metric based on a "saved checkpoint" meaning it has to happen when ever you save a checkpoint, but compute your metric on what data ?
if it is not based on a checkpoint but must happen during training then you can look at where we report the stats (ppl, accuracy, ..)
Again I am not clear about what you are asking.

@Garfounkel
Copy link
Author

Garfounkel commented Jun 14, 2024

Forget about the example I gave about needing to compute custom metrics during training. What I'm trying to get at is a more general solution for when you need custom behavior injected during different steps of training.

I feel that a custom callback system is missing from OpenNMT at the moment, here's a rough api of what I expect, let me know if that make sense:

# as a user I expect to be able to implement some kind of callback like so:

from onmt.callbacks import OnCheckpointSavedCallback

class MyCheckpointCallback(OnCheckpointSavedCallback):
  def on_checkpoint(checkpoint_path: str, step: int):
     print("A checkpoint was saved at", checkpoint_path)

Then, in the training.yaml config, something like:

callbacks: [MyCheckpointCallback]

The custom method on_checkpoint, would then be called by OpenNMT after each new checkpoint.

Here I gave an example for OnCheckpointSaved but you could imagine similar callbacks for different steps of the training loop.

Currently, the only workaround I found for running my custom a logic when a new checkpoint is saved is to run a separate process that's watching over the folder of checkpoints for new files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants