Skip to content

Commit

Permalink
Mark training complete after last checkpoint saving is completed.
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 689337972
  • Loading branch information
The kauldron Authors committed Oct 24, 2024
1 parent 87f7474 commit 5f36848
Showing 1 changed file with 7 additions and 3 deletions.
10 changes: 7 additions & 3 deletions kauldron/train/train_lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,15 +141,19 @@ def train_impl(
log_summaries=log_summaries,
)

# Ensure all hosts exit together. See section in dm/jax-faqs.
_sync()
# Checkpoint saving must be finalized before notifying eval jobs that training
# is complete. Otherwise, eval jobs may stop before the last checkpoint
# becomes available.
ckpt.wait_until_finished()

# Notify the eval job training is complete
if trainer.workdir.exists(): # `TrainEvaluator` do not have a workdir
epath.Path(trainer.workdir).joinpath(
eval_impl.TRAIN_COMPLETE_FILENAME
).touch()

# Ensure all hosts exit together. See section in dm/jax-faqs.
_sync()
ckpt.wait_until_finished()
# Returning the final state is convenient for interactive training in colab
return state, aux

Expand Down

0 comments on commit 5f36848

Please sign in to comment.