-
-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will it support CPU offloading? #578
Comments
PRs are welcome. 🤷 I just have too many feature requests and too little time. |
I see. I may not have enough time to do this recently (I have my open source lib to maintain e.g. https://github.com/fzyzcjy/flutter_rust_bridge as well as doing research and projects and etc), but looking forward to the feature, and again thank you for the great job! |
Technically, it is possible to do it I think. Just need to call |
Maybe this would be good to have for scenarios where the model almost fits in the VRAM? I don't know how much of a performance penalty we would have in this case though. Willing to do a PR, but to be honest, not sure where to begin from. I'd appreciate any directions. |
You can look at the streaming example in But these are for perplexity eval or for offline forward passes. Even if you had a single layer stuck in CPU, it would still be pretty bad I think for token by token inference. You'd basically have to load/unload entire layer(s) for every token generated. At that point, why not just use CPU inference? |
Hi, thanks for the great library! I have heard some people saying EXL2 being very fast, but I would like to try the 70B llama model on a 24GB 4090 card, so it cannot be fit into the GPU using e.g. 4bit quantization. Therefore, I wonder whether there are some theoretical limitation on EXL2/GPTQ, or it is just because it has not been implemented yet? Thanks!
The text was updated successfully, but these errors were encountered: