Replies: 1 comment
-
You can use |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi team, thanks for making this product and I am glad to see huggingface text-generation-inference(tgi) has support on it. For my understanding, they only use vllm vllm_cache_ops and vllm_attention_ops. However, the vllm reserved a lot of gpu memory beforehand making the model out of memory. I wonder if there is any config I can made after the build, or not through the model LLM interface to specify the memory usage or gpu utilization rate? Thanks
Beta Was this translation helpful? Give feedback.
All reactions