-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support vLLM-style rope #530
Comments
Hi @ByronHsu , thanks for your suggestions, I think 1 & 3 are easy to support. For 1, we can adding For 3, yes we can add another rope_dim field for partial rope. Can you give a concrete example of 2? |
Okay I think I understand 2 now, for example, if Is that true? |
Yes exactly! Thank you for the prompt response! All sounds good to me. One comment: Can we separate the new API from the current 4 flashinfer's rope functions and provide the exact same interface with vLLM? Several reasons:
Maybe we can call this def apply_rope_inplace_with_cache(
positions: torch.Tensor,
query: torch.Tensor,
key: torch.Tensor,
head_size: int,
cos_sin_cache: torch.Tensor,
is_neox: bool,
) -> None:
... |
I did a global search and found |
Previously our rope apis assume the position indices of each request is contiguous, which is not appropriate for applications such as speculative decoding, this PR fixes the issue by supporting the huggingface transformer-style API which use `pos_ids` argument to specify positions. This PR implements parts of the feature of #530 , other requests are coming in later PRs. cc @dreaming-panda @abcdabcd987 @ByronHsu
As part of SGLang Issue #1487, SGLang plans to move vLLM to optional dependencies and use flashinfer as the main dependency.
I am working on moving rope to flashinfer. My plan is to reuse most of the existing vllm rope but replace
ops.rotary_embedding
andops.batch_rotary_embedding
with flashinfer's kernel, which can be found here.However, I've noticed some gaps between the vLLM and flashinfer implementations:
positions
back tooffsets
+indptr
. Can we supportpositions
directly?In general, we can prioritize the first three issues and consider the fourth as a stretch goal.
The text was updated successfully, but these errors were encountered: