You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I appreciate your work! I have a question regarding the zero-shot common sense reasoning task on llama2-7. I tested llama2-7b 4-4-4 using lm-eval and observed a significant discrepancy in the results compared to those reported in the paper. Could you please confirm which version of lm-eval was used in the paper? Thank you!
Tasks
Version
Filter
n-shot
Metric
Value
Stderr
arc_easy
1
none
0
acc
0.7054
±
0.0094
none
0
acc_norm
0.6759
±
0.0096
arc_challenge
1
none
0
acc
0.3771
±
0.0142
none
0
acc_norm
0.4010
±
0.0143
piqa
1
none
0
acc
0.7432
±
0.0102
none
0
acc_norm
0.7568
±
0.0100
The text was updated successfully, but these errors were encountered:
I tested Llama2-7b 4-16-16 (RTN) with 10_optimize_rotation.sh and got wikitext-2 ppl 5.5, which is exact same as paper reported.
BUT, when I testes zero-shot task with lm_eval==0.4.3,
The results is as follows.
ARC-e
ARC-c
PIAQ
HellaSwag
Winogrande
Paper
72.2
48.6
78.2
74.2
67.9
Mine
70.6
42.7
77.7
74.6
67.3
I think there's discrepancy in zero-shot results.
The paper tested on 8 x A100 and I tested on 8 x A6000.
If you have time, can you test your checkpoint and let me know the results?
Hi, I appreciate your work! I have a question regarding the zero-shot common sense reasoning task on llama2-7. I tested llama2-7b 4-4-4 using lm-eval and observed a significant discrepancy in the results compared to those reported in the paper. Could you please confirm which version of lm-eval was used in the paper? Thank you!
The text was updated successfully, but these errors were encountered: