Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 4GPU unit test #82

Merged
merged 5 commits into from
Feb 24, 2024
Merged

Add 4GPU unit test #82

merged 5 commits into from
Feb 24, 2024

Conversation

wconstab
Copy link
Contributor

@wconstab wconstab commented Feb 24, 2024

For now this literally just runs NGPU=4 ./run_llama_train.sh but I verified at least it catches problems.

As a follow up, we should integrate mgpu test infra from pytorch and set up actual unit tests to run in this job.

We should probably also keep testing the run_llama_train.sh script, and add other combinations of 2D parallelism to ensure they all keep working.

image

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 24, 2024
Copy link
Contributor

@gnadathur gnadathur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@wconstab wconstab merged commit 6e17001 into main Feb 24, 2024
4 checks passed
@wconstab wconstab deleted the whc/4gpu branch February 24, 2024 01:10
lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
For now this literally just runs `NGPU=4 ./run_llama_train.sh` but I
verified at least it catches problems.

As a follow up, we should integrate mgpu test infra from pytorch and set
up actual unit tests to run in this job.

We should probably also keep testing the run_llama_train.sh script, and
add other combinations of 2D parallelism to ensure they all keep
working.

<img width="2120" alt="image"
src="https://github.com/pytorch/torchtrain/assets/4984825/2c235e9a-04ed-4f2d-9915-67de39d78e1c">
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
For now this literally just runs `NGPU=4 ./run_llama_train.sh` but I
verified at least it catches problems.

As a follow up, we should integrate mgpu test infra from pytorch and set
up actual unit tests to run in this job.

We should probably also keep testing the run_llama_train.sh script, and
add other combinations of 2D parallelism to ensure they all keep
working.

<img width="2120" alt="image"
src="https://github.com/pytorch/torchtrain/assets/4984825/2c235e9a-04ed-4f2d-9915-67de39d78e1c">
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants