[LayerSkip] Self-Speculative Decoding #642

mostafaelhoushi · 2024-07-08T17:00:18Z

Describe the solution you would like:
Implement self-speculative decoding as described in this paper where the earlier layers act as the draft stage and remaining layers act as the verification stage.

Describe the alternatives you have considered:
There are different options to implement that:

Implement regular Speculative Decoding where the draft stage is a separate model, and then Self-Speculative Decoding could be implemented by providing a subset of the layers as the draft model (e.g., this implementation)
- If we use this setup, we can add some flags to inform earlier layers if they are running the draft stage or verification stage
Directly implement Self-Speculative Decoding as done here

Additional Context:

Speculative Decoding was first proposed in Fast Inference from Transformers via Speculative Decoding
Another variant of self-speculative decoding where the draft stage is a subset of the layers of the main model is presented in Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

mostafaelhoushi added the enhancement New feature or request label Jul 8, 2024

mostafaelhoushi mentioned this issue Sep 4, 2024

[WIP] Implement LayerSkip YJYJLee/fairseq2#2

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LayerSkip] Self-Speculative Decoding #642

[LayerSkip] Self-Speculative Decoding #642

mostafaelhoushi commented Jul 8, 2024

[LayerSkip] Self-Speculative Decoding #642

[LayerSkip] Self-Speculative Decoding #642

Comments

mostafaelhoushi commented Jul 8, 2024