Skip to content

Commit

Permalink
Mark training complete after last checkpoint saving is completed.
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 689279915
  • Loading branch information
The kauldron Authors committed Oct 24, 2024
1 parent 87f7474 commit 99deeca
Showing 1 changed file with 4 additions and 3 deletions.
7 changes: 4 additions & 3 deletions kauldron/train/train_lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,15 +141,16 @@ def train_impl(
log_summaries=log_summaries,
)

# Ensure all hosts exit together. See section in dm/jax-faqs.
_sync()
ckpt.wait_until_finished()

# Notify the eval job training is complete
if trainer.workdir.exists(): # `TrainEvaluator` do not have a workdir
epath.Path(trainer.workdir).joinpath(
eval_impl.TRAIN_COMPLETE_FILENAME
).touch()

# Ensure all hosts exit together. See section in dm/jax-faqs.
_sync()
ckpt.wait_until_finished()
# Returning the final state is convenient for interactive training in colab
return state, aux

Expand Down

0 comments on commit 99deeca

Please sign in to comment.