[BE] replace the extra DeviceMesh _flatten with mesh access #666

XilunWu · 2024-10-30T22:06:15Z

Stack from ghstack (oldest at bottom):

-> [BE] replace the extra DeviceMesh _flatten with mesh access #666

Summary
pytorch/pytorch#138945 fixes DeviceMesh access on flattened mesh which are constructed from more than 2 meshes. Refer to the fix PR for details if interested.

In #592 we avoided this issue by calling _flatten instead of direct accessing the flattened mesh. We want to turn back to mesh access which is more straightforward since the fix has been merged in PyTorch.

[ghstack-poisoned]

ghstack-source-id: 6afa471f6e5320e998a99422e26b2f7f09dd1c6f Pull Request resolved: #666

awgu · 2024-10-30T22:07:38Z

torchtitan/parallelisms/parallelize_llama.py

-            if parallel_dims.cp_enabled
-            else world_mesh[dp_mesh_dim_names]
-        )
+        dp_mesh = world_mesh["dp_cp"] if parallel_dims.cp_enabled else world_mesh["dp"]


Is this a new DeviceMesh functionality that reacts specifically to <name1>_<name2>?

This is not new. DeviceMesh supports world_mesh[<name1>_<name2>] when the _flatten behavior was implemented. However, it has a bug -- if the flattened mesh is constructed from 3+ mesh dimensions (e.g. dp_cp is flattened using dp_shard, dp_replicate, and cp. Accessing world_mesh[dp_cp] throws error which breaks 3D/4D/5D composability).

It was flattened here
https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/parallel_dims.py#L82

Can we catch the error and ask users to update to some version?

For my understanding, for dp, if hsdp is enabled, "dp" is the flatten mesh for "dp_replicate", "dp_shard", right? Otherwise, "dp" is just "dp_shard".

@wz337 , that's right. To summarize:

FSDP: the only dp dimension in mesh is "dp"

DDP: the only dp dimension in mesh is "dp"

HSDP: the basic dp dimensions in mesh are "dp_shard" and "dp_replicate", which are later on flattened into "dp"

tianyu-l

lgtm

tianyu-l · 2024-10-30T22:17:27Z

torchtitan/parallelisms/parallelize_llama.py

-            if parallel_dims.cp_enabled
-            else world_mesh[dp_mesh_dim_names]
-        )
+        dp_mesh = world_mesh["dp_cp"] if parallel_dims.cp_enabled else world_mesh["dp"]


It was flattened here
https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/parallel_dims.py#L82

fegin

It's better to have a try-except to indicate users are not using the latest PyTorch.

fegin · 2024-10-30T22:18:13Z

torchtitan/parallelisms/parallelize_llama.py

-            if parallel_dims.cp_enabled
-            else world_mesh[dp_mesh_dim_names]
-        )
+        dp_mesh = world_mesh["dp_cp"] if parallel_dims.cp_enabled else world_mesh["dp"]


Can we catch the error and ask users to update to some version?

XilunWu · 2024-10-30T22:20:14Z

It's better to have a try-except to indicate users are not using the latest PyTorch.

Oh yeah that's right...

**Summary** pytorch/pytorch#138945 fixes DeviceMesh access on flattened mesh which are constructed from more than 2 meshes. Refer to the fix PR for details if interested. In #592 we avoided this issue by calling `_flatten` instead of direct accessing the flattened mesh. We want to turn back to mesh access which is more straightforward since the fix has been merged in PyTorch. [ghstack-poisoned]

ghstack-source-id: a0689ec03803419d67a4a79ec325dfed15113cdf Pull Request resolved: #666

…666)" This reverts commit 3653bf2.

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #667 Note: This PR is a reland of #666 where the PR was mistakenly merged into a wrong branch. **Summary** pytorch/pytorch#138945 fixes DeviceMesh access on flattened mesh which are constructed from more than 2 meshes. Refer to the fix PR for details if interested. In #592 we avoided this issue by calling `_flatten` instead of direct accessing the flattened mesh. We want to turn back to mesh access which is more straightforward since the fix has been merged in PyTorch.

[BE] replace the extra DeviceMesh _flatten with mesh access

47356dc

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Oct 30, 2024

[BE] replace the extra DeviceMesh _flatten with mesh access

a69cbf4

ghstack-source-id: 6afa471f6e5320e998a99422e26b2f7f09dd1c6f Pull Request resolved: #666

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 30, 2024

awgu reviewed Oct 30, 2024

View reviewed changes

XilunWu requested review from tianyu-l, fegin and wz337 October 30, 2024 22:10

tianyu-l approved these changes Oct 30, 2024

View reviewed changes

fegin approved these changes Oct 30, 2024

View reviewed changes

XilunWu added a commit that referenced this pull request Oct 31, 2024

[BE] replace the extra DeviceMesh _flatten with mesh access

6e7061b

ghstack-source-id: a0689ec03803419d67a4a79ec325dfed15113cdf Pull Request resolved: #666

XilunWu merged commit 3653bf2 into gh/XilunWu/9/base Oct 31, 2024
5 checks passed

XilunWu added a commit that referenced this pull request Oct 31, 2024

Revert "[BE] replace the extra DeviceMesh _flatten with mesh access (#…

4f729c2

…666)" This reverts commit 3653bf2.

XilunWu mentioned this pull request Oct 31, 2024

[BE] replace the extra DeviceMesh _flatten with mesh access #667

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BE] replace the extra DeviceMesh _flatten with mesh access #666

[BE] replace the extra DeviceMesh _flatten with mesh access #666

XilunWu commented Oct 30, 2024 •

edited

Loading

awgu Oct 30, 2024

XilunWu Oct 30, 2024

tianyu-l Oct 30, 2024

fegin Oct 30, 2024

wz337 Oct 30, 2024

XilunWu Oct 31, 2024

tianyu-l left a comment

tianyu-l Oct 30, 2024

fegin left a comment

fegin Oct 30, 2024

XilunWu commented Oct 30, 2024

[BE] replace the extra DeviceMesh _flatten with mesh access #666

[BE] replace the extra DeviceMesh _flatten with mesh access #666

Conversation

XilunWu commented Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fegin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XilunWu commented Oct 30, 2024

XilunWu commented Oct 30, 2024 •

edited

Loading