You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the model training was forced to stop due to an accident. I use the opt "train_from" to continue training from the checkpoint. But the result is different from the tranining from start to finish without stopping:
The stored patience for "early stop" was not saved into checkpoint.
The order of data batch provied train_iter is different, when train_from a checkpoint. (When train_from, it starts over from the begining of the dataset and the data are very different from where it stands at the step of saved checkpoint)
Note that i fixed all random seed.
So it is very convenient that If a reproduction mechanism can be added into the code base.
Any help will be greatly appreciated.
The text was updated successfully, but these errors were encountered:
This is roughly what I intended with #1826, but it's not compatible with all the changes we introduced in 2.0.
It should be possible to introduce such a mechanism though, that would store some counter to keep track of where we are in each dataset. It would never be perfect though, as there is quite a gap between when the data is read and when it's indeed seen in a training batch, because of the pooling mechanism.
Regarding this issue I implemented the following:
new option -dryrun_steps xxxxx
which would batch during xxxxx steps without actually training, then start trianing at xxxxx+1
That would restart the training at the exact point in the data where it stopped.
The only issue is that it is very very slow to reach xxxxx+1
If there is a better idea other than storing the index in each dataset.
Dear,
When the model training was forced to stop due to an accident. I use the opt "train_from" to continue training from the checkpoint. But the result is different from the tranining from start to finish without stopping:
Note that i fixed all random seed.
So it is very convenient that If a reproduction mechanism can be added into the code base.
Any help will be greatly appreciated.
The text was updated successfully, but these errors were encountered: