[Bug]: Engine loop has died #419

warlock135 · 2024-10-23T10:02:24Z

Your current environment

The output of `python collect_env.py`

NUMA node0 CPU(s):                    0-39,80-119
NUMA node1 CPU(s):                    40-79,120-159
Vulnerability Gather data sampling:   Mitigation; Microcode
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] habana-torch-dataloader==1.18.0.524
[pip3] habana-torch-plugin==1.18.0.524
[pip3] numpy==1.26.4
[pip3] pynvml==8.0.4
[pip3] pytorch-lightning==2.4.0
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0a0+git74cd574
[pip3] torch_tb_profiler==0.4.0
[pip3] torchaudio==2.4.0a0+69d4077
[pip3] torchdata==0.7.1+5e6f7b7
[pip3] torchmetrics==1.4.2
[pip3] torchtext==0.18.0a0+9bed85d
[pip3] torchvision==0.19.0a0+48b1edf
[pip3] transformers==4.45.2
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.dev554+g07c98a52
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

Model Input Dumps

No response

🐛 Describe the bug

When inferencing with vllm (using openAI api server), I got the error below:

ERROR 10-23 09:12:38 client.py:250] RuntimeError('Engine loop has died')^M
ERROR 10-23 09:12:38 client.py:250] Traceback (most recent call last):^M
ERROR 10-23 09:12:38 client.py:250]   File "/vllm-fork/vllm/engine/multiprocessing/client.py", line 150, in run_heartbeat_loop^M
ERROR 10-23 09:12:38 client.py:250]     await self._check_success(^M
ERROR 10-23 09:12:38 client.py:250]   File "/vllm-fork/vllm/engine/multiprocessing/client.py", line 314, in _check_success^M
ERROR 10-23 09:12:38 client.py:250]     raise response^M
ERROR 10-23 09:12:38 client.py:250] RuntimeError: Engine loop has died

After this, all current request were released and vllm crash when another request came:

CRITICAL 10-23 09:53:06 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     127.0.0.1:44188 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error

I started vllm with the following command (inside docker container with image vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest)

PT_HPU_LAZY_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct --port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 --disable-log-requests --block-size 128

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

michalkuligowski · 2024-10-23T10:41:00Z

Hi @warlock135, which version of HabanaAI vllm-fork are you using, 1.18.0 or habana_main?

warlock135 · 2024-10-23T10:58:58Z

Hi @warlock135, which version of HabanaAI vllm-fork are you using, 1.18.0 or habana_main?

I'm using the habana_main version

michalkuligowski · 2024-10-23T11:51:16Z

@warlock135 we will try to find what the issue is. In the meantime, please try using v1.18.0 branch (tag v0.5.3.post1+Gaudi-1.18.0), and see if the issue is still present.

warlock135 · 2024-10-24T01:37:02Z

@warlock135 we will try to find what the issue is. In the meantime, please try using v1.18.0 branch (tag v0.5.3.post1+Gaudi-1.18.0), and see if the issue is still present.

After switching to v0.5.3.post1+Gaudi-1.18.0 and applying the patch from this pull request, the error no longer occurred, even after several hours of inferencing.
Another things, I successfully completed a benchmark test with the habana_main version (using about 1000 prompts and a batch size of 20). But after raising the batch size to 30, errors started to occur.

michalkuligowski · 2024-10-24T06:53:44Z

@warlock135 glad to hear that original issue goes away. Can you please file another ticket with repro steps regarding the BS 30 scenario? It's the same command as here? We will look into this

warlock135 · 2024-10-24T08:41:11Z

@warlock135 glad to hear that original issue goes away. Can you please file another ticket with repro steps regarding the BS 30 scenario? It's the same command as here? We will look into this

Setup: I followed the instructions provided here. The only difference is that I used the pytorch-installer-2.4.0 Docker image instead of pytorch-installer-2.4.1 (which resulted in an "unknown manifest" error during the pull). The command to run VLLM remains the same as in here.

Client: I’m using a Python aiohttp client to send requests to VLLM.

API: /v1/completions

Prompts: Each prompt contains approximately 3,500 words (with a chat template applied, without any system prompt).

warlock135 · 2024-10-24T08:45:16Z

I also encountered an engine crash today when attempting to cancel client requests (70 ccu/batchsize). Below is the error log.

INFO 10-24 07:57:02 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 68 reqs, Swapp
ped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 26.5%, CPU KV cache usage: 0.1%.
ERROR 10-24 07:57:03 worker_base.py:382] Error executing method execute_model. This might cause deadlock in distributed execution.^M
ERROR 10-24 07:57:03 worker_base.py:382] Traceback (most recent call last):^M
ERROR 10-24 07:57:03 worker_base.py:382]   File "/vllm-fork/vllm/worker/worker_base.py", line 374, in execute_method^M
ERROR 10-24 07:57:03 worker_base.py:382]     return executor(*args, **kwargs)^M
ERROR 10-24 07:57:03 worker_base.py:382]   File "/vllm-fork/vllm/worker/worker_base.py", line 236, in execute_model^M
ERROR 10-24 07:57:03 worker_base.py:382]     self.model_runner.prepare_model_input(^M
ERROR 10-24 07:57:03 worker_base.py:382]   File "/vllm-fork/vllm/worker/habana_model_runner.py", line 1792, in prepare_model_input^M
ERROR 10-24 07:57:03 worker_base.py:382]     model_input, sampling_metadata = self.prepare_input_tensors(^M
ERROR 10-24 07:57:03 worker_base.py:382]   File "/vllm-fork/vllm/worker/habana_model_runner.py", line 1129, in prepare_input_tensors^M
ERROR 10-24 07:57:03 worker_base.py:382]     ) = self._prepare_decode(decode_reqs)^M
ERROR 10-24 07:57:03 worker_base.py:382]   File "/vllm-fork/vllm/worker/habana_model_runERROR 10-24 07:57:03 async_llm_engine.py:56] Engine background task failed^M
ERROR 10-24 07:57:03 async_llm_engine.py:56] Traceback (most recent call last):^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     return_value = task.result()^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 650, in run_engine_loop^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     result = task.result()^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 593, in engine_step^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     request_outputs = await self.engine.step_async(virtual_engine)^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 253, in step_async^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     output = await self.model_executor.execute_model_async(^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/executor/ray_habana_executor.py", line 391, in execute_model_async^^
M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     return await super().execute_model_async(execute_model_req)^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/executor/distributed_gpu_executor.py", line 175, in execute_model_aa
sync^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     return await self._driver_execute_model_async(execute_model_req)^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/executor/ray_habana_executor.py", line 407, in _driver_execute_modee
l_async^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     return await self.driver_exec_method("execute_model",^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     result = self.fn(*self.args, **self.kwargs)^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/worker/worker_base.py", line 383, in execute_method^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     raise e^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/worker/worker_base.py", line 374, in execute_method^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     return executor(*args, **kwargs)^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/worker/worker_base.py", line 236, in execute_model^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     self.model_runner.prepare_model_input(^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/worker/habana_model_runner.py", line 1792, in prepare_model_input^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     model_input, sampling_metadata = self.prepare_input_tensors(^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/worker/habana_model_runner.py", line 1129, in prepare_input_tensorss
^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     ) = self._prepare_decode(decode_reqs)^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]   File "/vllm-fork/vllm/worker/habana_model_runner.py", line 1036, in _prepare_decode^M
ERROR 10-24 07:57:03 async_llm_engine.py:56]     slot_mapping = torch.tensor(slot_mapping,^M
ERROR 10-24 07:57:03 async_llm_engine.py:56] RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_SYNHELPER synStatus 31 [Synapse is being tee
rminated due to fatality]. DMA to HPU start failedner.py", line 1036, in _prepare_decode^M
ERROR 10-24 07:57:03 worker_base.py:382]     slot_mapping = torch.tensor(slot_mapping,^M
ERROR 10-24 07:57:03 worker_base.py:382] RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_SYNHELPER synStatus 31 [Synapse is being terminnated due to fatality]. DMA to HPU start failed

warlock135 added the bug Something isn't working label Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Engine loop has died #419

[Bug]: Engine loop has died #419

warlock135 commented Oct 23, 2024

michalkuligowski commented Oct 23, 2024

warlock135 commented Oct 23, 2024

michalkuligowski commented Oct 23, 2024

warlock135 commented Oct 24, 2024

michalkuligowski commented Oct 24, 2024

warlock135 commented Oct 24, 2024

warlock135 commented Oct 24, 2024

[Bug]: Engine loop has died #419

[Bug]: Engine loop has died #419

Comments

warlock135 commented Oct 23, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

michalkuligowski commented Oct 23, 2024

warlock135 commented Oct 23, 2024

michalkuligowski commented Oct 23, 2024

warlock135 commented Oct 24, 2024

michalkuligowski commented Oct 24, 2024

warlock135 commented Oct 24, 2024

warlock135 commented Oct 24, 2024