Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing results on Llama-3.1-8B-Inst #8

Open
chtmp223 opened this issue Oct 22, 2024 · 3 comments
Open

Reproducing results on Llama-3.1-8B-Inst #8

chtmp223 opened this issue Oct 22, 2024 · 3 comments

Comments

@chtmp223
Copy link

Thank you for your great work!

I tried to reproduce the results for meta-llama/Llama-3.1-8B-Instruct, but I noticed several discrepancies between my outcomes and those reported in the public result file under Llama-3.1-8B-Inst.

You can view my results here: Results. I obtained these results by running scripts/run_eval_slurm.sh on both short and long configs for some benchmarks, and I compiled all the results using scripts/collect_results.py.

Do you have any suggestion on specific things I could adjust to align my results with yours?

@howard-yen
Copy link
Collaborator

Hi, thank you for your interest in our work!

Taking a closer look at the difference between the results in the linked spreadsheet, it seems like most numbers at 128k are within 1-2 absolute point of each other, which, unfortunately, is expected given the nondeterministic nature of flash attention and bf16.

The largest deviation appears to be in the ICL datasets, which I will double-check if the random seeds are set correctly to provide the same demo and shuffled labels across different runs.
For future work, we will also add the results across multiple runs with expected error margins. Thanks for bringing this up!

@howard-yen
Copy link
Collaborator

Quick update: I checked up on the ICL code, and the problem is that the random seed is not set correctly, which results in different demos, label mapping, and thus high variance between runs. Thanks for prompting me to look into this issue, I will update the code soon as well as the results on the spreadsheet + the paper in their next iteration!

@chtmp223
Copy link
Author

This is good to know! I will pause running the ICL benchmarks for now and wait for updates from your end. Thank you for getting back to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants