Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Quantization of Qwen return garbage #621

Open
3 tasks done
fahadh4ilyas opened this issue Sep 10, 2024 · 9 comments
Open
3 tasks done

[BUG] Quantization of Qwen return garbage #621

fahadh4ilyas opened this issue Sep 10, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@fahadh4ilyas
Copy link
Contributor

OS

Linux

GPU Library

CUDA 12.x

Python version

3.10

Pytorch version

2.4.0

Model

No response

Describe the bug

I quantize my own qwen 7B model and the return token is always 60021. Here is config file of my qwen model

{
  "_name_or_path": "models/qwen2-7B",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 131072,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 131072,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.2",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}

Reproduction steps

Here is my parameter to quantize

python convert.py -i models/myQwen-7B-HF -o models/myQwen-7B-EXL2/ -b 8 -hb 8 -l 16384 -ml 16384

Expected behavior

The generation after quantization is working.

Logs

No response

Additional context

No response

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.
@fahadh4ilyas fahadh4ilyas added the bug Something isn't working label Sep 10, 2024
@fahadh4ilyas
Copy link
Contributor Author

Additional Note: After I tried to quantize qwen model made by the developer Qwen/Qwen2-7B-Instruct, the result is also garbage. So, the problem is not from my model.

@turboderp
Copy link
Owner

I've been unable to reproduce this with v0.2.1 and the original Qwen2 model. It quantizes and inferences correctly here.

Are you using Q4 cache by any chance? Qwen2-7B specifically fails with Q4 cache, but Q6 and Q8 should be fine:

Model Quant Cache pass@1 pass@10 Wikitext 5x1k
Qwen2-7B FP16 Q4 19.74% 46.34% 40.72
Qwen2-7B FP16 Q6 61.65% 81.70% 15.20
Qwen2-7B FP16 Q8 62.37% 81.09% 15.18
Qwen2-7B FP16 FP16 61.16% 82.31% 15.16

@fahadh4ilyas
Copy link
Contributor Author

fahadh4ilyas commented Sep 12, 2024

Could you test it using my exllamav2_hf here at #606 ? Because I kept getting garbage answers inferencing using that. But, when inferencing another model, it works just fine.

@turboderp
Copy link
Owner

I looked into it and managed to reproduce the problem with the HF wrapper code.

It seems the issue is with attention. Since you're supplying a mask, that disables both the Flash Attention and SDPA code paths as they only support causal attention. ExLlama falls back on matmul attention, which works in half precision, which isn't an issue for most models, but for Qwen2-7B specifically you get occasional overflows on some layers and then inference breaks. I think it's to do with weird normalization of the keys/queries, related to why it doesn't like the Q4 cache mode.

Regardless, SDPA can use an arbitrary mask, Torch just won't use efficient kernels internally, but it should avoid the overflows anyway. I've enabled that in the latest commit on the dev branch and it seems to be working with your wrapper.

@fahadh4ilyas
Copy link
Contributor Author

wait, do you mean that if I set the input_mask parameter, flash attention wouldn't be used? Then how to generate a batch of texts without input_mask?

@turboderp
Copy link
Owner

Yes, flash-attn doesn't support input masks. And if you want to supply a rectangular input IDs tensor where the rows aren't the same length, the only way that can happen is with masking. Otherwise you'd have to start with the shortest input, generate at a batch size of 1 until it reaches the same length as the 2nd shortest input, then at bsz 2, etc. That way the input is always rectangular and you won't have to mask out any padding, but it's very inefficient.

flash-attn does have a "varlen" mode, but it's not efficient either since it requires the cache to be contiguous, so you have to constantly rebuild it (copy the whole thing in VRAM) to make space for new keys/values for every token generated.

The alternative is to use paged attention with a flat cache. This however is only compatible with flash-attn.

The SDPA approach at least allows for Torch to switch to a more efficient backend at some later time if flash-attn ever supports masking. There has been some work on this with Dao-AILab/flash-attention#617, but it's not finished yet, apparently.

@Thireus
Copy link

Thireus commented Sep 14, 2024

I've been unable to reproduce this with v0.2.1 and the original Qwen2 model. It quantizes and inferences correctly here.

Are you using Q4 cache by any chance? Qwen2-7B specifically fails with Q4 cache, but Q6 and Q8 should be fine:
Model Quant Cache pass@1 pass@10 Wikitext 5x1k
Qwen2-7B FP16 Q4 19.74% 46.34% 40.72
Qwen2-7B FP16 Q6 61.65% 81.70% 15.20
Qwen2-7B FP16 Q8 62.37% 81.09% 15.18
Qwen2-7B FP16 FP16 61.16% 82.31% 15.16

@turboderp, would you have similar metrics for other models?

@DocShotgun
Copy link

I've been unable to reproduce this with v0.2.1 and the original Qwen2 model. It quantizes and inferences correctly here.
Are you using Q4 cache by any chance? Qwen2-7B specifically fails with Q4 cache, but Q6 and Q8 should be fine:
Model Quant Cache pass@1 pass@10 Wikitext 5x1k
Qwen2-7B FP16 Q4 19.74% 46.34% 40.72
Qwen2-7B FP16 Q6 61.65% 81.70% 15.20
Qwen2-7B FP16 Q8 62.37% 81.09% 15.18
Qwen2-7B FP16 FP16 61.16% 82.31% 15.16

@turboderp, would you have similar metrics for other models?

https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md

@Downtown-Case
Copy link
Contributor

Downtown-Case commented Sep 19, 2024

@Thireus I tested the 2024 Command-R here:

https://old.reddit.com/r/LocalLLaMA/comments/1f6ijye/commandr_35b_q4q6q8_cache_perplexity_mmlu/

And this is a model with extremely "compressed" attention where 110K of Q4 context only takes like 4GB. I think Qwen2 was just a crazy extreme, and the new Qwen 2.5 doesn't behave like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants