-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collective Permute Long Tail on trn1.32xlarge #998
Comments
We are looking at this issue. We will update soon. Thanks. |
Can I get the URL for the profile result? If you can attach the NEFF as well, that would be very helpful. |
The profile result is hosted on my instances, but here goes my NEFF file (I have to zip it because NEFF extension is not supported here). MODULE_SyncTensorsGraph.40_10114637376880686083.zip Here goes the script I use to profile (%1 is python script to execute, %2 is number of worker to profile):
|
Here goes my NTFF file: |
|
I am launching nccl.collective_permute on a trn1.32xlarge. Within the workload, each neuron core sends data to neighboring worker following a pre-specified topology. However, some of the workers experience extremely long duration (0.2 ms) whereas most of the workers has a duration of 0.014 ms.
Below is a screen shot of the profiling result of worker 1 (0.014 ms duration).
Below is the screen shot of the profiling result of worker 0 (abnormal 0.2 ms duration):
The source code is:
My pip freeze is:
My neuron-profile version is:
When profiling, I output the profile result of the second iteration:
The text was updated successfully, but these errors were encountered: