Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #113

Open
sido420 opened this issue Jan 1, 2024 · 1 comment
Open

Comments

@sido420
Copy link

sido420 commented Jan 1, 2024

I am new to AI and trying to use llama2 model locally using pyllama.

I tried different options, but nothing seems to work. I downloaded llama using https://github.com/facebookresearch/llama.

Here is what I tried (see below for installed packages):

$ torchrun --nproc_per_node 1 example.py --ckpt_dir ../codellama/CodeLlama-7b/ --tokenizer_path ../codellama/CodeLlama-7b/tokenizer.model                                        

Traceback (most recent call last):                                                                                                                                                                           
  File "/home/xxxxx/pyllama/example.py", line 80, in <module>                                                                                                                                                 
    fire.Fire(main)                                                                                                                                                                                          
  File "/home/xxxxx/miniconda3/envs/llama2/lib/python3.11/site-packages/fire/core.py", line 141, in Fire                                                                                                      
    component_trace = _Fire(component, args, parsed_flag_args, context, name)                                                                                                                                
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
..
 File "/home/xxxxx/miniconda3/envs/llama2/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1268, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
[2024-01-01 20:58:30,998] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1814953) of binary: /home/xxxxx/miniconda3/envs/llama2/bin/python
Traceback (most recent call last):
..
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Below seems to work, but I don't get any response whatsoever:

KV_CACHE_IN_GPU=0 python inference.py --ckpt_dir ../codellama/CodeLlama-7b/ --tokenizer_path ../codellama/CodeLlama-7b/tokenizer.model
.. <after waiting for several seconds .. typed in the following command and pressed Enter> ..
Prompt:['I believe in '] 
<no response whatsoever>

I tried both pytorch cuda and non-cuda packages from https://pytorch.org/get-started/locally/. Example: conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia but same NCCL error in torchrun and no output from inference.py

I am on an HP workstation running ubuntu (23.04 (Lunar Lobster))

CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU           W3565  @ 3.20GHz
    CPU family:          6
    Model:               26
    Thread(s) per core:  1
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            5
$ sudo lshw -numeric -C display
..
  *-display                 
       description: VGA compatible controller
       product: G94GL [Quadro FX 1800] [10DE:638]
       vendor: NVIDIA Corporation [10DE]
...
@sido420
Copy link
Author

sido420 commented Jan 1, 2024

Error I am getting after a few minutes:

$ KV_CACHE_IN_GPU=1 python inference.py --ckpt_dir ../codellama/CodeLlama-7b/ --tokenizer_path ../codellama/CodeLlama-7b/tokenizer.model 
Prompt:['I believe in ']

Killed
(llama2) xxx@localhost:~/pyllama$ Prompt:['I believe in ']
Prompt:[I believe in ]: command not found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant