Quantzation AWQ GEMM + GEMV #1727

minhthuc2502 · 2024-06-19T12:07:45Z

Support quantization 4 bit with AWQ. There are 2 stable versions available: gemm and gemv.

Currently, I only add AWQ for Llama and Mistral converter. Other models could be added easily if they need AWQ quant.

I did some benchmark with it:

With only batch_size = 1, model mistral 7B:

Quant type	Speed (tok/s)	VRAM
int8	86,4	7722MiB
awq gemm	73	4746MiB
awq gemv	127	4746MiB

BBC-Esq · 2024-06-19T12:20:53Z

Support quantization 4 bit with AWQ. There are 2 stable versions available: gemm and gemv.

Currently, I only add AWQ for Llama and Mistral converter. Other models could be added easily if they need AWQ quant.

I did some benchmark with it:

With only batch_size = 1, model mistral 7B:
Quant type Speed (tok/s) VRAM
int8 86,4 7722MiB
awq gemm 73 4746MiB
awq gemv 127 4746MiB

Wow, that is spooky...I just started benchmarking AWQ last night for the first time. Do you think that eventually you'd want to incorporate the "exllama" option as well. For more info see here:

https://github.com/huggingface/transformers/blob/547b5582ec85147492f2485dd8e9cbbeb1016fd8/src/transformers/utils/quantization_config.py#L47

Also, would you mind sharing the script you used to benchmark or perhaps just some snippets? I wouldn't mind downloading a development branch and trying my hand at it.

minhthuc2502 · 2024-06-19T14:21:48Z

Currently, I only support GEMM and GEMV which are the most used version. It could be nice to support all in the future.

I did some benchmarks in the C++ only, I think you have to build this project in c++ first. BTW, the code where I used to benchmark is quite dirty but I will try to improve it and add it to the repo.

BBC-Esq · 2024-06-19T14:24:46Z

Currently, I only support GEMM and GEMV which are the most used version. It could be nice to support all in the future.

I did some benchmarks in the C++ only, I think you have to build this project in c++ first. BTW, the code where I used to benchmark is quite dirty but I will try to improve it and add it to the repo.

Thanks, I'm still learning to "build" anything (unsuccessfully as of yet...) believe it or not, but if you upload it I'll take a look.

BBC-Esq · 2024-09-09T09:56:29Z

Can you share the code you used to benchmark?

minhthuc2502 · 2024-09-10T10:33:29Z

I used this. You can tweak a bit to create the correct prompt for the model used.

minhthuc2502 added 6 commits June 27, 2024 11:41

quantzation awq gemm + gemv

6533d56

fix pipeline

a95dd9a

fix pipeline

a06caea

fix pipeline

157e136

fix dequantize awq

7ebf00a

remove duplicated code

0745fec

minhthuc2502 force-pushed the dev/awq_support branch from 317f14a to 0745fec Compare June 27, 2024 09:41

minhthuc2502 merged commit 39f48f2 into OpenNMT:master Jul 4, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantzation AWQ GEMM + GEMV #1727

Quantzation AWQ GEMM + GEMV #1727

minhthuc2502 commented Jun 19, 2024

BBC-Esq commented Jun 19, 2024

minhthuc2502 commented Jun 19, 2024

BBC-Esq commented Jun 19, 2024

BBC-Esq commented Sep 9, 2024

minhthuc2502 commented Sep 10, 2024

Quantzation AWQ GEMM + GEMV #1727

Quantzation AWQ GEMM + GEMV #1727

Conversation

minhthuc2502 commented Jun 19, 2024

BBC-Esq commented Jun 19, 2024

minhthuc2502 commented Jun 19, 2024

BBC-Esq commented Jun 19, 2024

BBC-Esq commented Sep 9, 2024

minhthuc2502 commented Sep 10, 2024