[Feature request] Adding optional `cpu_indptr`/`cpu_qo_indptr` parameter to `plan` method to avoid synchronized device to host copy. #565

reyoung · 2024-10-28T06:33:22Z

The current plan methods appear to include unnecessary device-to-CPU copies, as seen in the following links:

Line 722 in d30667b

indptr_host = indptr.to("cpu")

Lines 1006 to 1008 in d30667b

    
           # NOTE(Zihao): only required if qo_indptr/paged_kv_indptr are device tensors 
        
           qo_indptr_host = qo_indptr.to("cpu") 
        
           paged_kv_indptr_host = paged_kv_indptr.to("cpu")

These copies are unnecessary because qo_indptr and append_indptr should be initialized during data preparation and are originally on the CPU. Therefore, the plan method could accept an optional indptr parameter. The tensor.to("cpu") operation should only be invoked if this indptr parameter is None.

yzh119 · 2024-10-28T06:40:10Z

Hi @reyoung , you are right, if the indptr arrays are on CPU, these operations are not necessary. However, we didn't note that these arrays should be on CPU and some of the frameworks that integrates flashinfer have legacy issues of using GPU tensors, such conversion is mainly for backward compatibility.

If the tensors are already on CPU, then the indptr_host = indptr.to("cpu") is a no-op and will not result in a copy.

>>> import torch
>>> x = torch.rand(12)
>>> x.data_ptr()
94804334735872
>>> y = x.to("cpu")
>>> y.data_ptr()
94804334735872

reyoung · 2024-10-28T06:50:26Z

However， when indptr is CPU tensor， an unnecessary CPU -> Device copy will be invoked here

flashinfer/python/flashinfer/prefill.py

Line 990 in d30667b

self._qo_indptr_buf = qo_indptr.to(self.device, non_blocking=True)

It seems that the current API cannot avoid cross device copy.

reyoung · 2024-10-28T09:39:33Z

I found the kv_indptr should always be on CPU, and qo_indptr is used in both CPU/CUDA.

Maybe it is better to only add cpu_qo_indptr parameter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Adding optional `cpu_indptr`/`cpu_qo_indptr` parameter to `plan` method to avoid synchronized device to host copy. #565

[Feature request] Adding optional `cpu_indptr`/`cpu_qo_indptr` parameter to `plan` method to avoid synchronized device to host copy. #565

reyoung commented Oct 28, 2024

yzh119 commented Oct 28, 2024

reyoung commented Oct 28, 2024

reyoung commented Oct 28, 2024

[Feature request] Adding optional cpu_indptr/cpu_qo_indptr parameter to plan method to avoid synchronized device to host copy. #565

[Feature request] Adding optional cpu_indptr/cpu_qo_indptr parameter to plan method to avoid synchronized device to host copy. #565

Comments

reyoung commented Oct 28, 2024

yzh119 commented Oct 28, 2024

reyoung commented Oct 28, 2024

reyoung commented Oct 28, 2024

[Feature request] Adding optional `cpu_indptr`/`cpu_qo_indptr` parameter to `plan` method to avoid synchronized device to host copy. #565

[Feature request] Adding optional `cpu_indptr`/`cpu_qo_indptr` parameter to `plan` method to avoid synchronized device to host copy. #565