Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchrun error when generating training split #24

Open
OswaldHe opened this issue Aug 1, 2024 · 3 comments
Open

torchrun error when generating training split #24

OswaldHe opened this issue Aug 1, 2024 · 3 comments

Comments

@OswaldHe
Copy link

OswaldHe commented Aug 1, 2024

When I try to run run/train.sh for OPT-2.7b, it generates the training split for the first 5813 samples, then exit immediately without any error log.

Generating train split:   7%|▋         | 5813/81380 [00:35<03:31, 357.02 examples/s]E0731 23:14:13.108000 140299780256832 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 488431) of binary: /home/oswaldhe/miniconda3/envs/autocompressor/bin/python
Traceback (most recent call last):
  File "/home/oswaldhe/miniconda3/envs/autocompressor/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I'm running on NVIDIA-A100 40GB PCIe. What could be the possible issue? Thank you.

@CodeCreator
Copy link
Member

Sorry, I'm not sure what the issue is and it might be related to your setup (e.g., disk space, RAM). Are there any additional error messages?

@OswaldHe
Copy link
Author

OswaldHe commented Aug 1, 2024

Thank you for your response. I tried to increase the RAM size to 50GB and it can generate training split now. However, when it starts training, it raises a wandb related error:

[WARNING|integration_utils.py:81] 2024-08-01 00:17:04,944 >> Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/oswaldhe/AutoCompressors/train.py", line 286, in <module>
[rank0]:     main()
[rank0]:   File "/home/oswaldhe/AutoCompressors/train.py", line 226, in main
[rank0]:     trainer = SubstepTrainer(
[rank0]:   File "/home/oswaldhe/AutoCompressors/substep_trainer.py", line 69, in __init__
[rank0]:     super().__init__(model,
[rank0]:   File "/home/oswaldhe/AutoCompressors/base_trainer.py", line 138, in __init__
[rank0]:     super().__init__(model, args, *more_args, **kwargs)
[rank0]:   File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/transformers/trainer.py", line 557, in __init__
[rank0]:     self.callback_handler = CallbackHandler(
[rank0]:   File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/transformers/trainer_callback.py", line 305, in __init__
[rank0]:     self.add_callback(cb)
[rank0]:   File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/transformers/trainer_callback.py", line 322, in add_callback
[rank0]:     cb = callback() if isinstance(callback, type) else callback
[rank0]:   File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 673, in __init__
[rank0]:     raise RuntimeError("WandbCallback requires wandb to be installed. Run `pip install wandb`.")
[rank0]: RuntimeError: WandbCallback requires wandb to be installed. Run `pip install wandb`.

I already install wandb. Here are all packages I installed with the corresponding versions:

absl-py==1.4.0
accelerate==0.24.1
aiohttp==3.8.5
aiosignal==1.3.1
altair==5.0.1
array-record==0.4.1
async-timeout==4.0.3
attributedict==0.3.0
attrs==23.2.0
audioread==3.0.0
autobridge==0.0.20220512.dev1
blessings==1.7
cached-property==1.5.2
cachetools==5.3.1
certifi==2024.7.4
cffi==1.15.1
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.4
cmake==3.27.2
codecov==2.1.13
colorama==0.4.6
coloredlogs==15.0.1
colour-runner==0.1.1
conllu==4.5.3
contourpy==1.1.0
coverage==7.3.0
cycler==0.11.0
DataProperty==1.0.1
datasets==2.14.0
decorator==5.1.1
deepdiff==6.3.1
dill==0.3.7
distlib==0.3.7
dm-tree==0.1.8
docker-pycreds==0.4.0
einops==0.8.0
elastic-transport==8.4.0
elasticsearch==8.9.0
etils==1.4.1
evaluate==0.4.0
exceptiongroup==1.1.3
fairscale==0.4.13
filelock==3.12.2
fire==0.5.0
flash-attn==2.6.2
fonttools==4.42.1
frozenlist==1.4.0
fsspec==2023.6.0
gensim==4.3.2
git-python==1.0.3
gitdb==4.0.10
GitPython==3.1.32
google-auth==2.22.0
google-auth-oauthlib==1.0.0
googleapis-common-protos==1.60.0
grpcio==1.57.0
haoda==0.0.20240228.dev1
huggingface-hub==0.17.3
humanfriendly==10.0
idna==3.7
importlib-resources==6.0.1
iniconfig==2.0.0
inspecta==0.1.3
Jinja2==3.1.2
jiwer==3.0.2
joblib==1.3.2
jsonlines==3.1.0
kiwisolver==1.4.5
lazy_loader==0.3
librosa==0.10.1
lit==16.0.6
llvmlite==0.40.1
lm-eval==0.3.0
Markdown==3.4.4
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib==3.7.2
mbstrdecoder==1.1.3
mdurl==0.1.2
mip==1.15.0
mpmath==1.3.0
msgpack==1.0.5
multidict==6.0.4
multiprocess==0.70.15
networkx==3.1
nltk==3.8.1
numba==0.57.1
numexpr==2.8.5
numpy==1.24.4
nvidia-cublas-cu11==11.10.3.66
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu11==8.5.0.96
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu11==10.9.0.58
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu11==10.2.10.91
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu11==11.7.4.91
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu11==2.14.3
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.82
nvidia-nvtx-cu11==11.7.91
nvidia-nvtx-cu12==12.1.105
openai==0.27.9
ordered-set==4.1.0
packaging==23.1
pandas==2.0.3
pathvalidate==3.1.0
peft==0.12.0
Pillow==10.0.0
platformdirs==3.10.0
pluggy==1.2.0
ply==3.11
pooch==1.7.0
portalocker==2.7.0
prettytable==3.8.0
promise==2.3
protobuf==5.27.3
psutil==5.9.5
pyarrow==12.0.1
pybind11==2.11.1
pycountry==22.3.5
pycparser==2.21
pydeck==0.8.0
Pympler==1.0.1
pyproject-api==1.5.4
pytablewriter==1.0.0
pytest==7.4.0
python-dateutil==2.8.2
pytz==2024.1
pytz-deprecation-shim==0.1.0.post0
pyverilog==1.3.0
PyYAML==6.0
rapidfuzz==2.13.7
regex==2023.8.8
requests==2.32.3
requests-oauthlib==1.3.1
responses==0.18.0
rich==13.5.2
rootpath==0.1.1
rouge-score==0.1.2
rsa==4.9
sacrebleu==1.5.0
safetensors==0.4.3
scikit-learn==1.3.0
scipy==1.11.2
sentencepiece==0.1.99
sentry-sdk==2.12.0
seqeval==1.2.2
setproctitle==1.3.3
six==1.16.0
smart-open==6.3.0
smmap==5.0.0
soundfile==0.12.1
soxr==0.3.6
sqlitedict==2.1.0
streamlit==1.26.0
sympy==1.12
tabledata==1.3.1
tapa-fast-cosim==0.0.20220816.dev1
tcolorpy==0.1.3
tenacity==8.2.3
tensorboard==2.14.0
tensorboard-data-server==0.7.1
tensorflow-datasets==4.9.2
tensorflow-metadata==1.14.0
termcolor==2.3.0
texttable==1.6.7
threadpoolctl==3.2.0
tokenizers==0.14.1
toml==0.10.2
tomli==2.0.1
toolz==0.12.0
toposort==1.10
torch==2.4.0
torchvision==0.19.0
tox==4.10.0
tqdm==4.66.1
tqdm-multiprocess==0.0.11
transformers==4.34.0
triton==3.0.0
typepy==1.3.1
typing_extensions==4.12.2
tzdata==2023.3
tzlocal==4.3.1
urllib3==2.2.2
validators==0.21.2
virtualenv==20.24.3
wandb==0.17.5
watchdog==3.0.0
wcwidth==0.2.6
Werkzeug==2.3.7
wrapt==1.15.0
xxhash==3.3.0
yarl==1.9.2
zstandard==0.21.0

@super-wuliao
Copy link

I've encountered this issue as well. It seems to be a problem with insufficient memory on your end, not related to the GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants