Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collective Permute Long Tail on trn1.32xlarge #998

Open
zhdllwyc opened this issue Oct 4, 2024 · 5 comments
Open

Collective Permute Long Tail on trn1.32xlarge #998

zhdllwyc opened this issue Oct 4, 2024 · 5 comments
Assignees
Labels

Comments

@zhdllwyc
Copy link

zhdllwyc commented Oct 4, 2024

I am launching nccl.collective_permute on a trn1.32xlarge. Within the workload, each neuron core sends data to neighboring worker following a pre-specified topology. However, some of the workers experience extremely long duration (0.2 ms) whereas most of the workers has a duration of 0.014 ms.

Below is a screen shot of the profiling result of worker 1 (0.014 ms duration).
Screenshot from 2024-10-04 10-46-41

Below is the screen shot of the profiling result of worker 0 (abnormal 0.2 ms duration):
Screenshot from 2024-10-04 10-46-51

The source code is:

import torch
import torch.nn as nn
import torch_xla.core.xla_model as xm


from neuronx_distributed.parallel_layers.parallel_state import (
    get_tensor_model_parallel_size,
    get_tensor_model_parallel_group,
    get_tensor_model_parallel_rank,
)

from neuronx_distributed.trace import parallel_model_trace, parallel_model_save

import torch.nn.functional as F

import neuronxcc.nki.language as nl
import neuronxcc.nki.nccl as nccl
import neuronxcc.nki.isa as nisa
from neuronxcc.nki.language import par_dim
import numpy as np

from torch_neuronx import nki_jit

@nki_jit
def basic_collective_permute_1(in_ref, out_ref):
    h, w = in_ref.shape
    send_buf = nl.ndarray((h, w), dtype=in_ref.dtype, buffer=nl.private_hbm, name="send_buf")
    recv_buf = nl.ndarray((h, w), dtype=in_ref.dtype, buffer=nl.private_hbm, name="recv_buf")
    
    ip_send_buf, if_send_buf = nl.mgrid[0:h, 0:w]

    nisa._tiled_offloaded_memcpy(src=in_ref[ip_send_buf, if_send_buf], dst=send_buf[ip_send_buf, if_send_buf])

    nccl.collective_permute(src=send_buf[:, :], dst=recv_buf[:, :],
                            replica_groups=[[0, 1], [1, 2], [2, 3], [3, 10], [10, 11], [11, 18], [18, 19], [19, 26], [26, 27], [27, 28], [28, 29], [29, 4], [4, 5], [5, 12], [12, 13], [13, 20], [20, 21], [21, 22], [22, 23], [23, 30], [30, 31], [31, 6], [6, 7], [7, 14], [14, 15], [15, 8], [8, 9], [9, 16], [16, 17], [17, 24], [24, 25], [25, 0]])

    nisa._tiled_offloaded_memcpy(src=recv_buf[ip_send_buf, if_send_buf], dst=out_ref[ip_send_buf, if_send_buf])

class TestModule(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, q):
        h = q.shape[0]
        w = q.shape[1]

        output = torch.zeros((h, w), dtype=q.dtype, device=q.device)
        basic_collective_permute_1(q, output)
        return output


def get_model():
    # parallel_model_trace needs a function that returns a Model and a dictionary of states.
    # See details at:
    # https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/api_guide.html#model-trace
    return TestModule(), {} # Dictionary of states

def run_test_module(q):
    
    input = (q,)

    traced_model = parallel_model_trace(
        get_model,
        input,
        tp_degree=32,
        compiler_workdir='./compile_cache',
        compiler_args=None,
        max_parallel_compilations=8,
    )
    parallel_model_save(traced_model, "./saved_model/")

    result = traced_model(q)
    return result

def main():

    h, w = 128, 4096 

    q = torch.ones([h, w], dtype=torch.bfloat16)
    result = run_test_module(q)

    print(result)

    golden = torch.ones((h, w), dtype=torch.bfloat16)
    assert(torch.allclose(result, golden))


if __name__ == "__main__":
    main()

My pip freeze is:

absl-py==2.1.0
accelerate==0.34.2
anyio==4.6.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
attrs==24.2.0
aws-neuronx-runtime-discovery==2.9
awscli==1.34.25
babel==2.16.0
beautifulsoup4==4.12.3
bleach==6.1.0
boto3==1.35.25
botocore==1.35.25
cachetools==5.5.0
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.3.2
cloud-tpu-client==0.10
colorama==0.4.6
comm==0.2.2
debugpy==1.8.5
decorator==5.1.1
defusedxml==0.7.1
docutils==0.16
ec2-metadata==2.13.0
environment-kernels==1.2.0
exceptiongroup==1.2.2
executing==2.1.0
fastjsonschema==2.20.0
filelock==3.16.1
fqdn==1.5.1
fsspec==2024.9.0
google-api-core==1.34.1
google-api-python-client==1.8.0
google-auth==2.35.0
google-auth-httplib2==0.2.0
googleapis-common-protos==1.65.0
h11==0.14.0
httpcore==1.0.5
httplib2==0.22.0
httpx==0.27.2
huggingface-hub==0.25.1
idna==3.10
ipykernel==6.29.5
ipython==8.27.0
ipywidgets==8.1.5
islpy==2023.2.5
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.4
jmespath==1.0.1
json5==0.9.25
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
jupyter==1.1.1
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.3
jupyter_core==5.7.2
jupyter_server==2.14.2
jupyter_server_terminals==0.5.3
jupyterlab==4.2.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets==3.0.13
libneuronxla==2.0.4115.0
lockfile==0.12.2
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mistune==3.0.2
ml-dtypes==0.2.0
mpmath==1.3.0
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==2.8.8
neuronx-cc==2.15.128.0+56dc5a86
neuronx-distributed==0.9.0
notebook==7.2.2
notebook_shim==0.2.4
numpy==1.25.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
oauth2client==4.1.3
overrides==7.7.0
packaging==24.1
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pgzip==0.3.5
pillow==10.4.0
platformdirs==4.3.6
prometheus_client==0.21.0
prompt_toolkit==3.0.47
protobuf==3.20.3
psutil==6.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycparser==2.22
Pygments==2.18.0
pyparsing==3.1.4
python-daemon==3.0.1
python-dateutil==2.9.0.post0
python-json-logger==2.0.7
PyYAML==6.0.2
pyzmq==26.2.0
referencing==0.35.1
regex==2024.9.11
requests==2.31.0
requests-unixsocket==0.3.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.20.0
rsa==4.7.2
s3transfer==0.10.2
safetensors==0.4.5
scipy==1.11.2
Send2Trash==1.8.3
six==1.16.0
sniffio==1.3.1
soupsieve==2.6
stack-data==0.6.3
sympy==1.13.3
terminado==0.18.1
tinycss2==1.3.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.1.2
torch-neuronx==2.1.2.2.3.0
torch-xla==2.1.4
torchvision==0.16.2
tornado==6.4.1
tqdm==4.66.5
traitlets==5.14.3
transformers==4.44.2
transformers-neuronx==0.12.313
triton==2.1.0
types-python-dateutil==2.9.0.20240906
typing_extensions==4.12.2
uri-template==1.3.0
uritemplate==3.0.1
urllib3==2.2.3
wcwidth==0.2.13
webcolors==24.8.0
webencodings==0.5.1
websocket-client==1.8.0
wget==3.2
widgetsnbextension==4.0.13

My neuron-profile version is:

neuron-profile 2.19.0.0%kaena-tools/2.19@c48a122 built on 2024-08-02T17:21:14Z

When profiling, I output the profile result of the second iteration:

neuron-profile capture -n "$file" -s profile.ntff --profile-nth-exec=2
@koyongse
Copy link

koyongse commented Oct 4, 2024

We are looking at this issue. We will update soon. Thanks.

@koyongse
Copy link

koyongse commented Oct 4, 2024

Can I get the URL for the profile result? If you can attach the NEFF as well, that would be very helpful.

@zhdllwyc
Copy link
Author

zhdllwyc commented Oct 4, 2024

The profile result is hosted on my instances, but here goes my NEFF file (I have to zip it because NEFF extension is not supported here).

MODULE_SyncTensorsGraph.40_10114637376880686083.zip

Here goes the script I use to profile (%1 is python script to execute, %2 is number of worker to profile):

#!/bin/bash

# Check if file is provided as an argument
if [ -z "$1" ]; then
  echo "Please provide a file."
  exit 1
fi

# Check if the provided argument is a file
if [ ! -f "$1" ]; then
  echo "The provided argument is not a file."
  exit 1
fi

current_datetime=$(TZ="America/Los_Angeles" date +"%Y-%m-%d-%H:%M:%S")

filename="${1%.py}"

DIR="${filename}_${current_datetime}"

rm -rf /tmp/ubuntu/neuroncc_compile_workdir/*
rm -rf /var/tmp/neuron-compile-cache/neuronxcc-*/*

rm -rf "$DIR"

mkdir "$DIR"

python $1 

mv MODULE_* "$DIR"

cd "$DIR"

# Find the first file with the .neff extension in the current directory
file=$(find . -maxdepth 1 -type f -name "*.neff" | head -n 1)

neuron-profile capture -n "$file" -s profile.ntff --collectives-workers-per-node $2 --profile-nth-exec=2

mkdir profile_result
mv profile_*exec* profile_result/

mkdir profile_result_json

for ntff_file in profile_result/*; do
    echo "$ntff_file"
    rank_integer=$(echo "$ntff_file" | grep -oP '(?<=_rank_)[0-9]+')
    echo "$rank_integer"
    neuron-profile view --output-format json --output-file "./profile_result_json/profile_${rank_integer}_${current_datetime}.json" -n "$file" -s "${ntff_file}"
done

neuron-profile view -n "$file" -d profile_result --db-bucket="${current_datetime}"

cd ..

@zhdllwyc
Copy link
Author

zhdllwyc commented Oct 4, 2024

Here goes my NTFF file:
profile_result.zip

@koyongse
Copy link

koyongse commented Oct 4, 2024

  1. On the profile, if you hover over the CC box, you can see it is AllReduce. I could also confirm it with disassemble the binary in the NEFF. This might be because your neuron version does not support the CollectivePermute. Neuron compiler and runtime should support it.
  2. Please use the latest Neuron version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants