Collective Permute Long Tail on trn1.32xlarge #998

zhdllwyc · 2024-10-04T17:51:18Z

I am launching nccl.collective_permute on a trn1.32xlarge. Within the workload, each neuron core sends data to neighboring worker following a pre-specified topology. However, some of the workers experience extremely long duration (0.2 ms) whereas most of the workers has a duration of 0.014 ms.

Below is a screen shot of the profiling result of worker 1 (0.014 ms duration).

Below is the screen shot of the profiling result of worker 0 (abnormal 0.2 ms duration):

The source code is:

import torch
import torch.nn as nn
import torch_xla.core.xla_model as xm


from neuronx_distributed.parallel_layers.parallel_state import (
    get_tensor_model_parallel_size,
    get_tensor_model_parallel_group,
    get_tensor_model_parallel_rank,
)

from neuronx_distributed.trace import parallel_model_trace, parallel_model_save

import torch.nn.functional as F

import neuronxcc.nki.language as nl
import neuronxcc.nki.nccl as nccl
import neuronxcc.nki.isa as nisa
from neuronxcc.nki.language import par_dim
import numpy as np

from torch_neuronx import nki_jit

@nki_jit
def basic_collective_permute_1(in_ref, out_ref):
    h, w = in_ref.shape
    send_buf = nl.ndarray((h, w), dtype=in_ref.dtype, buffer=nl.private_hbm, name="send_buf")
    recv_buf = nl.ndarray((h, w), dtype=in_ref.dtype, buffer=nl.private_hbm, name="recv_buf")
    
    ip_send_buf, if_send_buf = nl.mgrid[0:h, 0:w]

    nisa._tiled_offloaded_memcpy(src=in_ref[ip_send_buf, if_send_buf], dst=send_buf[ip_send_buf, if_send_buf])

    nccl.collective_permute(src=send_buf[:, :], dst=recv_buf[:, :],
                            replica_groups=[[0, 1], [1, 2], [2, 3], [3, 10], [10, 11], [11, 18], [18, 19], [19, 26], [26, 27], [27, 28], [28, 29], [29, 4], [4, 5], [5, 12], [12, 13], [13, 20], [20, 21], [21, 22], [22, 23], [23, 30], [30, 31], [31, 6], [6, 7], [7, 14], [14, 15], [15, 8], [8, 9], [9, 16], [16, 17], [17, 24], [24, 25], [25, 0]])

    nisa._tiled_offloaded_memcpy(src=recv_buf[ip_send_buf, if_send_buf], dst=out_ref[ip_send_buf, if_send_buf])

class TestModule(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, q):
        h = q.shape[0]
        w = q.shape[1]

        output = torch.zeros((h, w), dtype=q.dtype, device=q.device)
        basic_collective_permute_1(q, output)
        return output


def get_model():
    # parallel_model_trace needs a function that returns a Model and a dictionary of states.
    # See details at:
    # https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#model-trace
    return TestModule(), {} # Dictionary of states

def run_test_module(q):
    
    input = (q,)

    traced_model = parallel_model_trace(
        get_model,
        input,
        tp_degree=32,
        compiler_workdir='./compile_cache',
        compiler_args=None,
        max_parallel_compilations=8,
    )
    parallel_model_save(traced_model, "./saved_model/")

    result = traced_model(q)
    return result

def main():

    h, w = 128, 4096 

    q = torch.ones([h, w], dtype=torch.bfloat16)
    result = run_test_module(q)

    print(result)

    golden = torch.ones((h, w), dtype=torch.bfloat16)
    assert(torch.allclose(result, golden))


if __name__ == "__main__":
    main()

My pip freeze is:

absl-py==2.1.0
accelerate==0.34.2
anyio==4.6.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
attrs==24.2.0
aws-neuronx-runtime-discovery==2.9
awscli==1.34.25
babel==2.16.0
beautifulsoup4==4.12.3
bleach==6.1.0
boto3==1.35.25
botocore==1.35.25
cachetools==5.5.0
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.3.2
cloud-tpu-client==0.10
colorama==0.4.6
comm==0.2.2
debugpy==1.8.5
decorator==5.1.1
defusedxml==0.7.1
docutils==0.16
ec2-metadata==2.13.0
environment-kernels==1.2.0
exceptiongroup==1.2.2
executing==2.1.0
fastjsonschema==2.20.0
filelock==3.16.1
fqdn==1.5.1
fsspec==2024.9.0
google-api-core==1.34.1
google-api-python-client==1.8.0
google-auth==2.35.0
google-auth-httplib2==0.2.0
googleapis-common-protos==1.65.0
h11==0.14.0
httpcore==1.0.5
httplib2==0.22.0
httpx==0.27.2
huggingface-hub==0.25.1
idna==3.10
ipykernel==6.29.5
ipython==8.27.0
ipywidgets==8.1.5
islpy==2023.2.5
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.4
jmespath==1.0.1
json5==0.9.25
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
jupyter==1.1.1
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.3
jupyter_core==5.7.2
jupyter_server==2.14.2
jupyter_server_terminals==0.5.3
jupyterlab==4.2.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets==3.0.13
libneuronxla==2.0.4115.0
lockfile==0.12.2
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mistune==3.0.2
ml-dtypes==0.2.0
mpmath==1.3.0
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==2.8.8
neuronx-cc==2.15.128.0+56dc5a86
neuronx-distributed==0.9.0
notebook==7.2.2
notebook_shim==0.2.4
numpy==1.25.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
oauth2client==4.1.3
overrides==7.7.0
packaging==24.1
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pgzip==0.3.5
pillow==10.4.0
platformdirs==4.3.6
prometheus_client==0.21.0
prompt_toolkit==3.0.47
protobuf==3.20.3
psutil==6.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycparser==2.22
Pygments==2.18.0
pyparsing==3.1.4
python-daemon==3.0.1
python-dateutil==2.9.0.post0
python-json-logger==2.0.7
PyYAML==6.0.2
pyzmq==26.2.0
referencing==0.35.1
regex==2024.9.11
requests==2.31.0
requests-unixsocket==0.3.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.20.0
rsa==4.7.2
s3transfer==0.10.2
safetensors==0.4.5
scipy==1.11.2
Send2Trash==1.8.3
six==1.16.0
sniffio==1.3.1
soupsieve==2.6
stack-data==0.6.3
sympy==1.13.3
terminado==0.18.1
tinycss2==1.3.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.1.2
torch-neuronx==2.1.2.2.3.0
torch-xla==2.1.4
torchvision==0.16.2
tornado==6.4.1
tqdm==4.66.5
traitlets==5.14.3
transformers==4.44.2
transformers-neuronx==0.12.313
triton==2.1.0
types-python-dateutil==2.9.0.20240906
typing_extensions==4.12.2
uri-template==1.3.0
uritemplate==3.0.1
urllib3==2.2.3
wcwidth==0.2.13
webcolors==24.8.0
webencodings==0.5.1
websocket-client==1.8.0
wget==3.2
widgetsnbextension==4.0.13

My neuron-profile version is:

neuron-profile 2.19.0.0%kaena-tools/2.19@c48a122 built on 2024-08-02T17:21:14Z

When profiling, I output the profile result of the second iteration:

neuron-profile capture -n "$file" -s profile.ntff --profile-nth-exec=2

The text was updated successfully, but these errors were encountered:

koyongse · 2024-10-04T18:03:22Z

We are looking at this issue. We will update soon. Thanks.

koyongse · 2024-10-04T18:21:52Z

Can I get the URL for the profile result? If you can attach the NEFF as well, that would be very helpful.

zhdllwyc · 2024-10-04T19:15:44Z

The profile result is hosted on my instances, but here goes my NEFF file (I have to zip it because NEFF extension is not supported here).

MODULE_SyncTensorsGraph.40_10114637376880686083.zip

Here goes the script I use to profile (%1 is python script to execute, %2 is number of worker to profile):

#!/bin/bash

# Check if file is provided as an argument
if [ -z "$1" ]; then
  echo "Please provide a file."
  exit 1
fi

# Check if the provided argument is a file
if [ ! -f "$1" ]; then
  echo "The provided argument is not a file."
  exit 1
fi

current_datetime=$(TZ="America/Los_Angeles" date +"%Y-%m-%d-%H:%M:%S")

filename="${1%.py}"

DIR="${filename}_${current_datetime}"

rm -rf /tmp/ubuntu/neuroncc_compile_workdir/*
rm -rf /var/tmp/neuron-compile-cache/neuronxcc-*/*

rm -rf "$DIR"

mkdir "$DIR"

python $1 

mv MODULE_* "$DIR"

cd "$DIR"

# Find the first file with the .neff extension in the current directory
file=$(find . -maxdepth 1 -type f -name "*.neff" | head -n 1)

neuron-profile capture -n "$file" -s profile.ntff --collectives-workers-per-node $2 --profile-nth-exec=2

mkdir profile_result
mv profile_*exec* profile_result/

mkdir profile_result_json

for ntff_file in profile_result/*; do
    echo "$ntff_file"
    rank_integer=$(echo "$ntff_file" | grep -oP '(?<=_rank_)[0-9]+')
    echo "$rank_integer"
    neuron-profile view --output-format json --output-file "./profile_result_json/profile_${rank_integer}_${current_datetime}.json" -n "$file" -s "${ntff_file}"
done

neuron-profile view -n "$file" -d profile_result --db-bucket="${current_datetime}"

cd ..

zhdllwyc · 2024-10-04T20:19:01Z

Here goes my NTFF file:
profile_result.zip

koyongse · 2024-10-04T23:03:40Z

On the profile, if you hover over the CC box, you can see it is AllReduce. I could also confirm it with disassemble the binary in the NEFF. This might be because your neuron version does not support the CollectivePermute. Neuron compiler and runtime should support it.
Please use the latest Neuron version.

JonathanHenson assigned aws-qieqingy Oct 8, 2024

JonathanHenson added the NKI label Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collective Permute Long Tail on trn1.32xlarge #998

Collective Permute Long Tail on trn1.32xlarge #998

zhdllwyc commented Oct 4, 2024 •

edited

Loading

koyongse commented Oct 4, 2024

koyongse commented Oct 4, 2024

zhdllwyc commented Oct 4, 2024

zhdllwyc commented Oct 4, 2024

koyongse commented Oct 4, 2024

Collective Permute Long Tail on trn1.32xlarge #998

Collective Permute Long Tail on trn1.32xlarge #998

Comments

zhdllwyc commented Oct 4, 2024 • edited Loading

koyongse commented Oct 4, 2024

koyongse commented Oct 4, 2024

zhdllwyc commented Oct 4, 2024

zhdllwyc commented Oct 4, 2024

koyongse commented Oct 4, 2024

zhdllwyc commented Oct 4, 2024 •

edited

Loading