Why do not you calculate the real output token for throughput? #493
-
vllm/benchmarks/benchmark_throughput.py Lines 177 to 180 in c894836 You can see that you just use the tokens' length of request in the dataset , not the real output length for the model to eval. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
The token returned also contains input prompt. I don't think computing the output will solve you problem. vllm/vllm/entrypoints/api_server.py Lines 64 to 67 in c6dfc3c |
Beta Was this translation helpful? Give feedback.
-
I believe the main reason is that prompt tokens also take compute. There are different ways to measure throughput for LLMs, but I believe the trends between systems should be similar under different metrics. Move this issue to discussions for future questions. |
Beta Was this translation helpful? Give feedback.
I believe the main reason is that prompt tokens also take compute. There are different ways to measure throughput for LLMs, but I believe the trends between systems should be similar under different metrics. Move this issue to discussions for future questions.