-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: block_softmax
accuracy issue in flat_pa kernel, qwen2-7B model
#275
Comments
Hi @jikunshang, Thanks for the analysis. We had a look at the issue and the clean fix is not trivial because Could you check if our new WA ( #280 ) fixes the issue? Thanks. PS I'll close #268 since it's anyway marked as "DNM". Let's move the discussion to this issue. |
Hi, @szutenberg , Thanks for the quick fixing through #280. I tested with #280, however, generated output is not very good, even with incorrect characters on some queries. Query:
output is not finished and not making sense.
|
Thanks @xuechendi for verify this. |
Hi @jikunshang and @xuechendi . Thanks for the detailed analysis! Just to let you know I've started working on a proper fix. The tricky part is to implement it in a way that won't dramatically hurt performance in other models. Once I have a working prototype I'll send it to you so that you can verify on your end. |
I've looked at attention values between heads and there's no other option here then calculating local maximum per bs/qheads/kvheads . Even for a single attention head the values can span from -2k to 2k. Which is in-line with your observations. The good news is that once I've changed the code to properly calculate the local maximum the outputs started to look ok:
Next step is to remove hardcoded values and figure out how to do calculate the local maximums efficiently on Gaudi. |
Thanks for your investigation! |
@jikunshang @madamczykhabana , does this issue only impact Qwen? we tried meta-llama/Llama-2-7b-chat-hf, meta-llama/Llama-2-70b-chat-hf(tp-4), and sbintuitions/sarashina2-70b(tp=4) based on latest habana_main commit: b62fba8, the model outupt quality isn't good Prompt: 'Hello, my name is', Generated text: ' Ethan,' meta-llama/Llama-2-70b-chat-hf Prompt: 'Hello, my name is', Generated text: ' Margo' sbintuitions/sarashina2-70b Prompt: 'Hello, my name is', Generated text: ' Noemi. I have been a Spanish teacher for 14 years. I' |
you are running |
Yes, I used offline_inference.py, I don't modify any sampling parameters, I only change the LLM model to the one I used and set TP parameters, not sure whether it's same issue or we need another issue to track? |
I feel it's same issue. |
Thanks a lot, then will have a check when we have a fix for this issue |
I've pushed a fix candidate to dev/madamczyk/flat_pa_acc branch both to vllm-fork and vllm-hpu-extension repositories. Unfortunately using scatter_reduce introduces a big performance hit so it's disabled by default. You can try it out by running with VLLM_PA_SOFTMAX_IMPL=scatter_reduce env flag. Please check if it solves the accuracy issue you're seeing and at the same time we'll continue exploring internally how to speed it up. |
Thanks for your fix! I did a quick test, this works on Qwen2-7B model. how much performance regression did you observed? I will also do benchmark later. |
On llama3.1-8b it's around 30% perf drop |
thanks, I also observed about 30% perf drop. |
Hi @jikunshang . During brainstorming in my team we've found another possible solution. Please take another look at dev/madamczyk/flat_pa_acc branch. I've added another option to normalize by average of maximums (it's enabled by default so no need to set anything. Current behavior ca be re-enabled by running with VLLM_PA_SOFTMAX_IMPL=amax). |
I take a quick try on this fix, result looks quite promising, will also test performance. Thanks a lot! |
|
As far as I understand missing dims shouldn't be dynamic. |
@madamczykhabana , we found some accuracy issue with your latest solution of wsum solution. It happens when the input length is big. E.g. the output of below 4 prompts were not correct (the first one has long input). |
@czhu15 thanks for the failing test cases! I'll take a closer look what's going on there. |
@madamczykhabana will this fix merge to both habana_main and 1.18 release? |
@czhu15 there was in fact a bug with handling padding blocks. I've pushed the fix to dev/madamczyk/flat_pa_acc . Here are my results with the latest code:
Please double check on your end. |
@Yanli2190 It's hard to tell. I'll discuss it with the rest of the team. |
@Yanli2190 we're not planning to merge this fix to 1.18.1 (1.18.0 has already been released). It will be merged to habana_main (and thus automatically included in 1.19.0) |
@madamczykhabana , verified that the latest change has fixed the long input sequence issue :) |
@madamczykhabana , I back-ported your changes into v1.18.0 branch. It works correctly under Synapse 1.17.0 docker, but not correctly under Synapse 1.18.0 docker. And Below is the output from 1.18.0 docker: |
Unfortunately, as it turns out that there's still some issues with accuracy. I was able to reproduce it on 1.19, 1.18 and 1.17. In my case the issue starts reproducing when the difference between the average and the maximum values becomes large enough. I might have an idea on how to workaround it... |
@madamczykhabana Did you latest commit (HabanaAI/vllm-hpu-extension@fd7f2e6#diff-c45331496d9b806755711b375270bd629261a029e414691d69401e5c297d0fd7R53) fixed this accuracy issue eventually or you are still working on it? |
Unfortunately I'm still working on it. I've decided to push it and use it as the default as it helps in case of llama. There are still numerical issues in Qwen when we're crossing the block boundary. |
Got it. Pls kindly let me know when you have progress. Good luck! |
I have a new idea that I'm testing. @czhu15 please check dev/madamczyk/softmax_options branch. By default it normalizes by weighted sum, followed by amax. It's not perfect, but it should be better. There might be a slight perf drop (around 5%) that might be improved. |
@madamczykhabana {
"Prompt": "Do you know the book Traction by Gino Wickman",
"Generated text": "?! I!LO!!VE!! IT!! I! Absolutely!!! T! r!! a! c!! o!! n!! ! b! y!! G!!! i!! n!!! o!! W!!!!!!!! Wick!!!man!!!!!!!!! is!!!!!!! one!!!!! !!!of!!!!!!!the!!!!!!!!!!!!!best!"
} BTW: The solution with |
@yangulei thanks for checking! |
diff --git a/benchmarks/benchmark_throughput.py b/benchmarks/benchmark_throughput.py
index b7bc2a64..8d65af40 100644
--- a/benchmarks/benchmark_throughput.py
+++ b/benchmarks/benchmark_throughput.py
@@ -135,8 +135,20 @@ def run_vllm(
if not use_beam_search:
start = time.perf_counter()
- llm.generate(prompts, sampling_params, use_tqdm=True)
+ outputs = llm.generate(prompts, sampling_params, use_tqdm=True)
end = time.perf_counter()
+ print("saving results ...")
+ records=[]
+ for output in outputs:
+ record={
+ "Prompt": output.prompt,
+ "Generated text": output.outputs[0].text
+ }
+ records.append(record)
+ with open('benchmark_throughput_results.json', 'w', encoding='utf-8') as file:
+ for record in records:
+ json_record = json.dumps(record)
+ file.write(json_record + '\n')
else:
prompts = [prompt for prompt, _, _ in requests]
# output_len should be the same for all requests.
#! /bin/bash
model_path=/models/Qwen2-7B-Instruct
dataset=/models/ShareGPT_V3_unfiltered_cleaned_split.json
model_name=$( echo $model_path | awk -F/ '{print $NF}' )
export VLLM_PA_SOFTMAX_IMPL=wsum,amax
export VLLM_SKIP_WARMUP=True
export TOKENIZERS_PARALLELISM=true
export VLLM_GRAPH_RESERVED_MEM=0.2
export VLLM_GRAPH_PROMPT_RATIO=0.8
python benchmark_throughput.py \
--backend vllm \
--model $model_path \
--trust-remote-code \
--tensor-parallel-size 1 \
--dataset ${dataset} \
--device hpu \
--dtype bfloat16 \
--seed 2024 \
--max-num-batched-tokens 8192 \
--max-model-len 4096 \
--num-prompts 1000 \
> benchmark_throughput_${model_name}_sharegpt.log 2>&1 |
@yangulei please check if the problem persists on current habana_main. I'm still not 100% convinced that accuracy is as good as it should. I'm currently running multiple benchmarks from lm_eval to compare accuracy results. |
Hi @madamczykhabana , I verified the wsum_head_amax solution on shareGPT, and compared it to scatter_reduce. Basically the result is very close (but diverge at later sentence in most cases). |
below are two files that store the output of shareGPT with different algorithms. |
@madamczykhabana , can you kindly share the lm_eval scores once you get? |
Hi, on llama3-8b, I found that the quality of generated text for float32 is much better than that on bfloat16. I have attached my code in the end,
And with PT_HPU_MAX_COMPOUND_OP_SIZE=1 it generated result is trash:
|
Hi @ccrhx4 . Thanks for the info! Could you please retry your experiment with VLLM_PA_SOFTMAX_IMPL=scatter_reduce? I'd like to confirm whether it's a different issue or it's related to softmax normalization. |
@madamczykhabana Hi Michal, the result is the same.
|
Then most likely it's a different root cause. Please create a new issue for it. We'll take a look at it. |
Hi Michal, I have created the new issue. #443 |
Your current environment
🐛 Describe the bug
This issue is introduced by
block_softmax
kernel(part offlat_pa
, see #169 )For some models(like Qwen2-7B), value of
Q * K.t() * scale
may be much greater (like 2000+), block_softmax just subtrct a dummy max_value(hardcoded 10), the exp kernel may overflow, lead to follow calculation not correct.you can use
examples/offline_inference.py
with Qwen/Qwen2-7B model to reproduce, result is like belowif you print attn in block_softmax, you will see the value are abnormal, compare to llama model.
The text was updated successfully, but these errors were encountered: