Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

version of lm-eval #6

Open
JingyangXiang opened this issue Jul 26, 2024 · 1 comment
Open

version of lm-eval #6

JingyangXiang opened this issue Jul 26, 2024 · 1 comment

Comments

@JingyangXiang
Copy link

Hi, I appreciate your work! I have a question regarding the zero-shot common sense reasoning task on llama2-7. I tested llama2-7b 4-4-4 using lm-eval and observed a significant discrepancy in the results compared to those reported in the paper. Could you please confirm which version of lm-eval was used in the paper? Thank you!

Tasks Version Filter n-shot Metric Value Stderr
arc_easy 1 none 0 acc 0.7054 ± 0.0094
none 0 acc_norm 0.6759 ± 0.0096
arc_challenge 1 none 0 acc 0.3771 ± 0.0142
none 0 acc_norm 0.4010 ± 0.0143
piqa 1 none 0 acc 0.7432 ± 0.0102
none 0 acc_norm 0.7568 ± 0.0100
@kiucho
Copy link

kiucho commented Aug 31, 2024

I tested Llama2-7b 4-16-16 (RTN) with 10_optimize_rotation.sh and got wikitext-2 ppl 5.5, which is exact same as paper reported.

BUT, when I testes zero-shot task with lm_eval==0.4.3,
The results is as follows.

ARC-e ARC-c PIAQ HellaSwag Winogrande
Paper 72.2 48.6 78.2 74.2 67.9
Mine 70.6 42.7 77.7 74.6 67.3

I think there's discrepancy in zero-shot results.
The paper tested on 8 x A100 and I tested on 8 x A6000.
If you have time, can you test your checkpoint and let me know the results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants