Rewrite of the load_checkpoint function #650

Xirider · 2023-02-14T14:06:38Z

This is a rewrite of how we determine which checkpoint to load when starting/restarting a training run. (Originally there was also a refactor of how our different checkpoint paths are processed, but I seperated this out for now)

Previously we had a quite brittle logic for this, with edge cases where metaseq would not load the correct checkpoint. For example here: #544

The new logic checks first all possible sources for checkpoints (restore-file, finetune-from, local checkpoints, nfs / azure checkpoints), and assigns them priority based on their progress in training and prefers local caches.

It then takes the most recent checkpoint and copies it to local disk.

To test this you need both metaseq / metaseq-internal PR's. Here: https://github.com/fairinternal/metaseq-internal/pull/842

I tested:

multi-node and single node local start
with and without nfs cloud upload path
crashing a run and correct continuing
with and without finetune-from
with and without resume-file
evals

What I didn't test yet, is if starting with an azure blob path is working.

…ists check

Xirider · 2023-02-14T16:37:08Z

Will test this on azure soon.

suchenzang · 2023-02-15T06:21:43Z

Heads up: #646 will likely go in first since tests are passing there (after loss parity check is added). There will probably be merge conflicts after, but hopefully not too bad.

suchenzang

Nit on getting the checkpoint saving/uploading tests to pass (may need to modify to adjust to new logic, just need to be clear why the modification to the test is needed)

suchenzang and others added 25 commits November 10, 2022 07:58

split out a get_checkpoint_path_to_load

e4a3a39

make restore_file optional, indent in trainer.load_checkpoint path ex…

1f6071b

…ists check

switch from default_restore_file string matching to is/not None checks

6ee9c4d

none guard

c68d578

sorting out restore_file logic

a6fd478

merge main

0cf3c0c

make save_async a single command

18c6d16

remove duplicate flags

570fced

change naming to local_checkpoints_dir

a9989aa

finish core prepare_local_checkpoint_path function

6eefb06

black lint

a8b0613

add dataclass

6fe250c

remove old load function, add type hints, move out internal import

2ddf1fb

changed naming of prio

d46404a

change naming for epoch checkpoints to include num steps

725bfb8

add local caching by including num steps in cache file name

5e6218e

move checkpoint path out

24b4ab8

only merge load_checkpoint related changes

cec3941

add types and clean

243c327

Merge branch 'main' into peter/rewrite_load_checkpoint

d8c0e87

fixes

c3396b8

two paths for local vs nfs checkpoints

58dcfb2

add some debugging and load local checkpoints

65c03bc

more cleanup, add reset_for_finetuning back in

e2efa2d

run black .

cbff07c

Xirider requested review from suchenzang, ngoyal2707, punitkoura, moyapchen and klshuster as code owners February 14, 2023 14:06

Xirider requested review from ruanslv, davides, igormolybogFB, sharannarang, andrewPoulton, bashnick and tangbinh as code owners February 14, 2023 14:06

facebook-github-bot added the cla signed label Feb 14, 2023

Peter Albert added 3 commits February 14, 2023 06:09

Merge branch 'main' into peter/rewrite_load_checkpoint

9366295

flake8

0bde2fe

flake 8

7991503

suchenzang approved these changes Feb 21, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite of the load_checkpoint function #650

Rewrite of the load_checkpoint function #650

Xirider commented Feb 14, 2023 •

edited

Loading

Xirider commented Feb 14, 2023

suchenzang commented Feb 15, 2023

suchenzang left a comment

Rewrite of the load_checkpoint function #650

Are you sure you want to change the base?

Rewrite of the load_checkpoint function #650

Conversation

Xirider commented Feb 14, 2023 • edited Loading

Xirider commented Feb 14, 2023

suchenzang commented Feb 15, 2023

suchenzang left a comment

Choose a reason for hiding this comment

Xirider commented Feb 14, 2023 •

edited

Loading