Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT REVIEW] gaps to enable FDSP2 cpu offloading #622

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

weifengpy
Copy link
Contributor

@weifengpy weifengpy commented Oct 16, 2024

command: CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh

resolve #620

this PR runs FSDP2 cpu offload. there are 2 hacks / gaps we should resolve

  • freqs_cis does not belong to model.parameters(). This PR moves it to cuda device since FSDP2 cannot manager it. Another option is to manage it as model.parameters()
  • grad clipping does not work for cpu tensors (DTensor dispatch + all_reduce on cpu)

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 16, 2024
@weifengpy
Copy link
Contributor Author

this task can be helpful for ramp up @mori360

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is there way to offload training memory to DRAM (using FSDP2?) for training Llama3-8B with torchtitan?
2 participants