[BUG] Quantization of Qwen return garbage #621

fahadh4ilyas · 2024-09-10T10:31:20Z

OS

Linux

GPU Library

CUDA 12.x

Python version

3.10

Pytorch version

2.4.0

Model

No response

Describe the bug

I quantize my own qwen 7B model and the return token is always 60021. Here is config file of my qwen model

{
  "_name_or_path": "models/qwen2-7B",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 131072,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 131072,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.2",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}

Reproduction steps

Here is my parameter to quantize

python convert.py -i models/myQwen-7B-HF -o models/myQwen-7B-EXL2/ -b 8 -hb 8 -l 16384 -ml 16384

Expected behavior

The generation after quantization is working.

Logs

No response

Additional context

No response

Acknowledgements

I have looked for similar issues before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

The text was updated successfully, but these errors were encountered:

fahadh4ilyas · 2024-09-10T11:37:23Z

Additional Note: After I tried to quantize qwen model made by the developer Qwen/Qwen2-7B-Instruct, the result is also garbage. So, the problem is not from my model.

turboderp · 2024-09-12T18:02:41Z

I've been unable to reproduce this with v0.2.1 and the original Qwen2 model. It quantizes and inferences correctly here.

Are you using Q4 cache by any chance? Qwen2-7B specifically fails with Q4 cache, but Q6 and Q8 should be fine:

Model	Quant	Cache	pass@1	pass@10	Wikitext 5x1k
Qwen2-7B	FP16	Q4	19.74%	46.34%	40.72
Qwen2-7B	FP16	Q6	61.65%	81.70%	15.20
Qwen2-7B	FP16	Q8	62.37%	81.09%	15.18
Qwen2-7B	FP16	FP16	61.16%	82.31%	15.16

fahadh4ilyas · 2024-09-12T22:57:23Z

Could you test it using my exllamav2_hf here at #606 ? Because I kept getting garbage answers inferencing using that. But, when inferencing another model, it works just fine.

turboderp · 2024-09-13T00:32:41Z

I looked into it and managed to reproduce the problem with the HF wrapper code.

It seems the issue is with attention. Since you're supplying a mask, that disables both the Flash Attention and SDPA code paths as they only support causal attention. ExLlama falls back on matmul attention, which works in half precision, which isn't an issue for most models, but for Qwen2-7B specifically you get occasional overflows on some layers and then inference breaks. I think it's to do with weird normalization of the keys/queries, related to why it doesn't like the Q4 cache mode.

Regardless, SDPA can use an arbitrary mask, Torch just won't use efficient kernels internally, but it should avoid the overflows anyway. I've enabled that in the latest commit on the dev branch and it seems to be working with your wrapper.

fahadh4ilyas · 2024-09-13T00:44:55Z

wait, do you mean that if I set the input_mask parameter, flash attention wouldn't be used? Then how to generate a batch of texts without input_mask?

turboderp · 2024-09-13T02:16:41Z

Yes, flash-attn doesn't support input masks. And if you want to supply a rectangular input IDs tensor where the rows aren't the same length, the only way that can happen is with masking. Otherwise you'd have to start with the shortest input, generate at a batch size of 1 until it reaches the same length as the 2nd shortest input, then at bsz 2, etc. That way the input is always rectangular and you won't have to mask out any padding, but it's very inefficient.

flash-attn does have a "varlen" mode, but it's not efficient either since it requires the cache to be contiguous, so you have to constantly rebuild it (copy the whole thing in VRAM) to make space for new keys/values for every token generated.

The alternative is to use paged attention with a flat cache. This however is only compatible with flash-attn.

The SDPA approach at least allows for Torch to switch to a more efficient backend at some later time if flash-attn ever supports masking. There has been some work on this with Dao-AILab/flash-attention#617, but it's not finished yet, apparently.

Thireus · 2024-09-14T07:01:13Z

I've been unable to reproduce this with v0.2.1 and the original Qwen2 model. It quantizes and inferences correctly here.

Are you using Q4 cache by any chance? Qwen2-7B specifically fails with Q4 cache, but Q6 and Q8 should be fine:
Model Quant Cache pass@1 pass@10 Wikitext 5x1k
Qwen2-7B FP16 Q4 19.74% 46.34% 40.72
Qwen2-7B FP16 Q6 61.65% 81.70% 15.20
Qwen2-7B FP16 Q8 62.37% 81.09% 15.18
Qwen2-7B FP16 FP16 61.16% 82.31% 15.16

@turboderp, would you have similar metrics for other models?

DocShotgun · 2024-09-14T17:01:39Z

I've been unable to reproduce this with v0.2.1 and the original Qwen2 model. It quantizes and inferences correctly here.
Are you using Q4 cache by any chance? Qwen2-7B specifically fails with Q4 cache, but Q6 and Q8 should be fine:
Model Quant Cache pass@1 pass@10 Wikitext 5x1k
Qwen2-7B FP16 Q4 19.74% 46.34% 40.72
Qwen2-7B FP16 Q6 61.65% 81.70% 15.20
Qwen2-7B FP16 Q8 62.37% 81.09% 15.18
Qwen2-7B FP16 FP16 61.16% 82.31% 15.16

@turboderp, would you have similar metrics for other models?

https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md

Downtown-Case · 2024-09-19T20:04:12Z

@Thireus I tested the 2024 Command-R here:

https://old.reddit.com/r/LocalLLaMA/comments/1f6ijye/commandr_35b_q4q6q8_cache_perplexity_mmlu/

And this is a model with extremely "compressed" attention where 110K of Q4 context only takes like 4GB. I think Qwen2 was just a crazy extreme, and the new Qwen 2.5 doesn't behave like that.

fahadh4ilyas added the bug Something isn't working label Sep 10, 2024

Downtown-Case mentioned this issue Sep 19, 2024

[BUG] Qwen 2.5 34B returns garbage at certain quantization levels, but not others #628

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Quantization of Qwen return garbage #621

[BUG] Quantization of Qwen return garbage #621

fahadh4ilyas commented Sep 10, 2024

fahadh4ilyas commented Sep 10, 2024

turboderp commented Sep 12, 2024

fahadh4ilyas commented Sep 12, 2024 •

edited

Loading

turboderp commented Sep 13, 2024

fahadh4ilyas commented Sep 13, 2024

turboderp commented Sep 13, 2024

Thireus commented Sep 14, 2024

DocShotgun commented Sep 14, 2024

Downtown-Case commented Sep 19, 2024 •

edited

Loading

[BUG] Quantization of Qwen return garbage #621

[BUG] Quantization of Qwen return garbage #621

Comments

fahadh4ilyas commented Sep 10, 2024

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

fahadh4ilyas commented Sep 10, 2024

turboderp commented Sep 12, 2024

fahadh4ilyas commented Sep 12, 2024 • edited Loading

turboderp commented Sep 13, 2024

fahadh4ilyas commented Sep 13, 2024

turboderp commented Sep 13, 2024

Thireus commented Sep 14, 2024

DocShotgun commented Sep 14, 2024

Downtown-Case commented Sep 19, 2024 • edited Loading

fahadh4ilyas commented Sep 12, 2024 •

edited

Loading

Downtown-Case commented Sep 19, 2024 •

edited

Loading