version of lm-eval #6

JingyangXiang · 2024-07-26T11:39:32Z

Hi, I appreciate your work! I have a question regarding the zero-shot common sense reasoning task on llama2-7. I tested llama2-7b 4-4-4 using lm-eval and observed a significant discrepancy in the results compared to those reported in the paper. Could you please confirm which version of lm-eval was used in the paper? Thank you!

Tasks	Version	Filter	Metric	Value		Stderr
arc_easy	1	none	acc	0.7054	±	0.0094
		none	acc_norm	0.6759	±	0.0096
arc_challenge	1	none	acc	0.3771	±	0.0142
		none	acc_norm	0.4010	±	0.0143
piqa	1	none	acc	0.7432	±	0.0102
		none	acc_norm	0.7568	±	0.0100

kiucho · 2024-08-31T07:57:18Z

I tested Llama2-7b 4-16-16 (RTN) with 10_optimize_rotation.sh and got wikitext-2 ppl 5.5, which is exact same as paper reported.

BUT, when I testes zero-shot task with lm_eval==0.4.3,
The results is as follows.

	ARC-e	ARC-c	PIAQ	HellaSwag	Winogrande
Paper	72.2	48.6	78.2	74.2	67.9
Mine	70.6	42.7	77.7	74.6	67.3

I think there's discrepancy in zero-shot results.
The paper tested on 8 x A100 and I tested on 8 x A6000.
If you have time, can you test your checkpoint and let me know the results?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

version of lm-eval #6

version of lm-eval #6

JingyangXiang commented Jul 26, 2024

kiucho commented Aug 31, 2024

version of lm-eval #6

version of lm-eval #6

Comments

JingyangXiang commented Jul 26, 2024

kiucho commented Aug 31, 2024