Managing Performance (token/s) #9665

gotDaijobu · 2024-10-24T15:34:57Z

gotDaijobu
Oct 24, 2024

Hi,

I'm using vllm for inference. I gave it several tries and I'm currently running a Llama3.1 70B quantized model (bitsandbytes, 4bits) on a single A100 (80GB).

I made quite a lot of tests and I realized that decreasing (or increasing) the gpu_memory_utilization has no effect on the throughput (in tokens/seconds, i'm stuck at 13tps).

So the bottleneck is elsewhere but I can't figure where.

Any idea ?

dhruvmullick · 2024-10-24T15:40:05Z

dhruvmullick
Oct 24, 2024

It's probably memory bandwidth bound.

Have you tried increasing the request concurrency?

1 reply

gotDaijobu Oct 24, 2024
Author

Meaning --max-num-seqs ? Yes. But in this case, i'm trying to raise the throughput for a single request.

gotDaijobu · 2024-10-25T08:07:56Z

gotDaijobu
Oct 25, 2024
Author

So I've done further tests : when I have several requests in parallel, the performances drops. I am struggling to find the cause.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managing Performance (token/s) #9665

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Managing Performance (token/s) #9665

gotDaijobu Oct 24, 2024

Replies: 2 comments · 1 reply

dhruvmullick Oct 24, 2024

gotDaijobu Oct 24, 2024 Author

gotDaijobu Oct 25, 2024 Author

gotDaijobu
Oct 24, 2024

Replies: 2 comments 1 reply

dhruvmullick
Oct 24, 2024

gotDaijobu Oct 24, 2024
Author

gotDaijobu
Oct 25, 2024
Author