You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# NOTE(Zihao): only required if qo_indptr/paged_kv_indptr are device tensors
qo_indptr_host=qo_indptr.to("cpu")
paged_kv_indptr_host=paged_kv_indptr.to("cpu")
These copies are unnecessary because qo_indptr and append_indptr should be initialized during data preparation and are originally on the CPU. Therefore, the plan method could accept an optional indptr parameter. The tensor.to("cpu") operation should only be invoked if this indptr parameter is None.
The text was updated successfully, but these errors were encountered:
Hi @reyoung , you are right, if the indptr arrays are on CPU, these operations are not necessary. However, we didn't note that these arrays should be on CPU and some of the frameworks that integrates flashinfer have legacy issues of using GPU tensors, such conversion is mainly for backward compatibility.
If the tensors are already on CPU, then the indptr_host = indptr.to("cpu") is a no-op and will not result in a copy.
The current plan methods appear to include unnecessary device-to-CPU copies, as seen in the following links:
flashinfer/python/flashinfer/decode.py
Line 722 in d30667b
flashinfer/python/flashinfer/prefill.py
Lines 1006 to 1008 in d30667b
These copies are unnecessary because qo_indptr and append_indptr should be initialized during data preparation and are originally on the CPU. Therefore, the plan method could accept an optional indptr parameter. The tensor.to("cpu") operation should only be invoked if this indptr parameter is None.
The text was updated successfully, but these errors were encountered: