Using vLLM to deploy LLM as an API to accelerate inference #100

fx-hit · 2024-06-21T15:26:11Z

Based on practical tests, deploying omost-llama-3-8b on an A100 using torch==2.3.0+cu118, vllm==0.5.0.post1+cu118, and xformers==0.0.26.post1+cu118 works well. if want to speed up the process, can refer to this setup.

vllm: https://docs.vllm.ai/en/stable/getting_started/quickstart.html

badcookie78 · 2024-06-27T10:48:37Z

Hi, Can I know if is possible to run in with ollama and then host the LLM locally?

zk19971101 · 2024-07-24T07:50:46Z

I find comfyui_omost show a way to accelerate inference by TGI(text generation inference).
https://github.com/huchenlei/ComfyUI_omost?tab=readme-ov-file#accelerating-llm

sudanl · 2024-09-03T16:56:54Z

Based on practical tests, deploying omost-llama-3-8b on an A100 using torch==2.3.0+cu118, vllm==0.5.0.post1+cu118, and xformers==0.0.26.post1+cu118 works well. if want to speed up the process, can refer to this setup.

vllm: https://docs.vllm.ai/en/stable/getting_started/quickstart.html

Good idea! Could you kindly share the code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using vLLM to deploy LLM as an API to accelerate inference #100

Using vLLM to deploy LLM as an API to accelerate inference #100

fx-hit commented Jun 21, 2024

badcookie78 commented Jun 27, 2024

zk19971101 commented Jul 24, 2024

sudanl commented Sep 3, 2024

Using vLLM to deploy LLM as an API to accelerate inference #100

Using vLLM to deploy LLM as an API to accelerate inference #100

Comments

fx-hit commented Jun 21, 2024

badcookie78 commented Jun 27, 2024

zk19971101 commented Jul 24, 2024

sudanl commented Sep 3, 2024