Service can concurrently process multiple requests #82

TomasTomecek · 2024-10-22T12:20:17Z

This is a tracking issue for us to figure out for the service to process multiple requests in parallel "so users wouldn't notice" and we don't need to heavily invest into multiple GPUs

TomasTomecek · 2024-10-22T13:41:42Z

I started researching vllm since it's one of the engines in InstructLab.

In their readme they state (points related to this issue):

State-of-the-art serving throughput
Continuous batching of incoming requests
Quantizations: GPTQ, AWQ, INT4, INT8, and FP8.
Tensor parallelism and pipeline parallelism support for distributed inference
OpenAI-compatible API server

There is also a relevant performance benchmark

We can easily test out their OpenAI webserver in a container:
https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html
This will be one of my next steps.

I love they have metrics baked-in so it will be easy for us to measure how many people use our service: https://docs.vllm.ai/en/v0.6.0/serving/metrics.html

One thing to check is the streaming response, there is an open PR over here: vllm-project/vllm#7648

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service can concurrently process multiple requests #82

Service can concurrently process multiple requests #82

TomasTomecek commented Oct 22, 2024

TomasTomecek commented Oct 22, 2024

Service can concurrently process multiple requests #82

Service can concurrently process multiple requests #82

Comments

TomasTomecek commented Oct 22, 2024

TomasTomecek commented Oct 22, 2024