Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service can concurrently process multiple requests #82

Open
TomasTomecek opened this issue Oct 22, 2024 · 1 comment
Open

Service can concurrently process multiple requests #82

TomasTomecek opened this issue Oct 22, 2024 · 1 comment

Comments

@TomasTomecek
Copy link
Collaborator

This is a tracking issue for us to figure out for the service to process multiple requests in parallel "so users wouldn't notice" and we don't need to heavily invest into multiple GPUs

@TomasTomecek
Copy link
Collaborator Author

I started researching vllm since it's one of the engines in InstructLab.

In their readme they state (points related to this issue):

  • State-of-the-art serving throughput
  • Continuous batching of incoming requests
  • Quantizations: GPTQ, AWQ, INT4, INT8, and FP8.
  • Tensor parallelism and pipeline parallelism support for distributed inference
  • OpenAI-compatible API server

There is also a relevant performance benchmark

We can easily test out their OpenAI webserver in a container:
https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html
This will be one of my next steps.

I love they have metrics baked-in so it will be easy for us to measure how many people use our service: https://docs.vllm.ai/en/v0.6.0/serving/metrics.html

One thing to check is the streaming response, there is an open PR over here: vllm-project/vllm#7648

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Someday in future
Development

No branches or pull requests

1 participant