-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantzation AWQ GEMM + GEMV #1727
Conversation
Wow, that is spooky...I just started benchmarking AWQ last night for the first time. Do you think that eventually you'd want to incorporate the "exllama" option as well. For more info see here: Also, would you mind sharing the script you used to benchmark or perhaps just some snippets? I wouldn't mind downloading a development branch and trying my hand at it. |
Currently, I only support GEMM and GEMV which are the most used version. It could be nice to support all in the future. I did some benchmarks in the C++ only, I think you have to build this project in c++ first. BTW, the code where I used to benchmark is quite dirty but I will try to improve it and add it to the repo. |
Thanks, I'm still learning to "build" anything (unsuccessfully as of yet...) believe it or not, but if you upload it I'll take a look. |
317f14a
to
0745fec
Compare
Can you share the code you used to benchmark? |
I used this. You can tweak a bit to create the correct prompt for the model used. |
Support quantization 4 bit with AWQ. There are 2 stable versions available:
gemm
andgemv
.Currently, I only add AWQ for Llama and Mistral converter. Other models could be added easily if they need AWQ quant.
I did some benchmark with it:
With only batch_size = 1, model mistral 7B: