Replies: 3 comments 2 replies
-
I'm having the same issue |
Beta Was this translation helpful? Give feedback.
1 reply
-
Same error here. |
Beta Was this translation helpful? Give feedback.
1 reply
-
same to me |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
I am trying to perform inference using TheBloke/Mistral-7B-Instruct-v0.2-AWQ with vLLM Installation with CPU using Docker and I keep receiving this error:
I am successful in building the CPU Docker image, and when using the default facebook/opt-125m model, I am also successful in running the server and performing inference to receive a completions response.
As for the Mistral model, AWQ is a part of the supported hardware for quantization kernels, and I am able to start the server with my Docker run command as follows:
docker run -it --rm -v Mistral:/mnt/models/Mistral --network=host --ipc=host -e VLLM_CPU_KVCACHE_SPACE=40 vllm-cpu-env --model="/mnt/models/Mistral/Mistral-7B-Instruct-v0.2-AWQ" --dtype="half" --quantization awq --device "cpu" --max-model-len 2048
When I send an inference query, I am also able to see the following log:
It is only after a few seconds that I receive
RuntimeError('Engine loop has died')
which kills the server, and shuts down the Docker container.I have tried various parameters of the
VLLM_CPU_KVCACHE_SPACE
value and have increasedVLLM_ENGINE_ITERATION_TIMEOUT_S
, as well as settingVLLM_CPU_OMP_THREADS_BIND
to my physical cores, but to no avail.I'm reaching out in the hopes that this error can be rectified. Thank you for your attention thus far. Cheers
Beta Was this translation helpful? Give feedback.
All reactions