Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't restore model from checkpoint #139

Open
seventran opened this issue Apr 21, 2023 · 1 comment
Open

Can't restore model from checkpoint #139

seventran opened this issue Apr 21, 2023 · 1 comment

Comments

@seventran
Copy link

I cannot restore training model from checkpoint. The error log:
File "/data/zmining/jupyter-notebook/antnh/embeddings/tml/common/checkpointing/snapshot.py", line 67, in restore
snapshot.restore(self.state) #check
File "/home/zdeploy/anaconda3/envs/quanhm_torchrec/lib/python3.8/site-packages/torchsnapshot/snapshot.py", line 406, in restore
self._load_stateful(
File "/home/zdeploy/anaconda3/envs/quanhm_torchrec/lib/python3.8/site-packages/torchsnapshot/snapshot.py", line 671, in _load_stateful
stateful.load_state_dict(state_dict)
File "/home/zdeploy/anaconda3/envs/quanhm_torchrec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2027, in load_state_dict
load(self, state_dict)
File "/home/zdeploy/anaconda3/envs/quanhm_torchrec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2015, in load
load(child, child_state_dict, child_prefix)
File "/home/zdeploy/anaconda3/envs/quanhm_torchrec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2015, in load
load(child, child_state_dict, child_prefix)
File "/home/zdeploy/anaconda3/envs/quanhm_torchrec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2015, in load
load(child, child_state_dict, child_prefix)
[Previous line repeated 2 more times]
File "/home/zdeploy/anaconda3/envs/quanhm_torchrec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2009, in load
module._load_from_state_dict(
File "/home/zdeploy/anaconda3/envs/quanhm_torchrec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1909, in _load_from_state_dict
hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
File "/home/zdeploy/anaconda3/envs/quanhm_torchrec/lib/python3.8/site-packages/torch/nn/modules/module.py", line 69, in call
return self.hook(module, *args, **kwargs)
File "/home/zdeploy/anaconda3/envs/quanhm_torchrec/lib/python3.8/site-packages/torchrec/distributed/embeddingbag.py", line 439, in _pre_load_state_dict_hook
local_shards = state_dict[key].local_shards()
KeyError: 'model._dmp_wrapped_module.module.large_embeddings.ebc.embedding_bags.user.weight'

Thank you very much for helping me.

@chenhaobupt
Copy link

do you solve this problelm? i get the same question,thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants