Managing Performance (token/s) #9665
Unanswered
gotDaijobu
asked this question in
Q&A
Replies: 2 comments 1 reply
-
It's probably memory bandwidth bound. Have you tried increasing the request concurrency? |
Beta Was this translation helpful? Give feedback.
1 reply
-
So I've done further tests : when I have several requests in parallel, the performances drops. I am struggling to find the cause. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I'm using vllm for inference. I gave it several tries and I'm currently running a Llama3.1 70B quantized model (bitsandbytes, 4bits) on a single A100 (80GB).
I made quite a lot of tests and I realized that decreasing (or increasing) the gpu_memory_utilization has no effect on the throughput (in tokens/seconds, i'm stuck at 13tps).
So the bottleneck is elsewhere but I can't figure where.
Any idea ?
Beta Was this translation helpful? Give feedback.
All reactions