Default vLLM needs to allocate for the full KVCache #9525

sbaby171 · 2024-10-19T06:57:45Z

sbaby171
Oct 19, 2024

I thought vLLMs dynamic mgmt of memory would allow the GPU memory to handle as much KVCache as it could aka Growing the KV Cache one page at a time as the KVCache grows and constantly deleting finished queries and reallocating memory.

However, when putting Llama-3.1-8b on a 24G 4090, it errors out that not enough space is available for a FULL 128k context length.

I am effectively running the vLLM offline inference example but with Llama-3.1-8B (https://docs.vllm.ai/en/v0.5.5/getting_started/examples/offline_inference.html)

What am I miss understanding here?

INFO 10-18 23:45:44 model_runner.py:1025] Loading model weights took 14.9888 GB
INFO 10-18 23:45:44 gpu_executor.py:122] # GPU blocks: 2502, # CPU blocks: 2048
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/msbabo/code/vllm/vllm_examples/offline_inference.py", line 29, in <module>
[rank0]:     main(model = args.model)
[rank0]:   File "/home/msbabo/code/vllm/vllm_examples/offline_inference.py", line 12, in main
[rank0]:     llm = LLM(model=model)
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 214, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 564, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 339, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 125, in initialize_cache
[rank0]:     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/worker/worker.py", line 258, in initialize_cache
[rank0]:     raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]:   File "/home/msbabo/code/vllm/VENV/lib/python3.10/site-packages/vllm/worker/worker.py", line 483, in raise_if_cache_size_invalid
[rank0]:     raise ValueError(
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (40032). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

ghost · 2024-10-19T07:19:30Z

ghost
Oct 19, 2024

5 good looking call

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default vLLM needs to allocate for the full KVCache #9525

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Default vLLM needs to allocate for the full KVCache #9525

sbaby171 Oct 19, 2024

Replies: 1 comment

ghost Oct 19, 2024

sbaby171
Oct 19, 2024

ghost
Oct 19, 2024