You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can view my results here: Results. I obtained these results by running scripts/run_eval_slurm.sh on both short and long configs for some benchmarks, and I compiled all the results using scripts/collect_results.py.
Do you have any suggestion on specific things I could adjust to align my results with yours?
The text was updated successfully, but these errors were encountered:
Taking a closer look at the difference between the results in the linked spreadsheet, it seems like most numbers at 128k are within 1-2 absolute point of each other, which, unfortunately, is expected given the nondeterministic nature of flash attention and bf16.
The largest deviation appears to be in the ICL datasets, which I will double-check if the random seeds are set correctly to provide the same demo and shuffled labels across different runs.
For future work, we will also add the results across multiple runs with expected error margins. Thanks for bringing this up!
Quick update: I checked up on the ICL code, and the problem is that the random seed is not set correctly, which results in different demos, label mapping, and thus high variance between runs. Thanks for prompting me to look into this issue, I will update the code soon as well as the results on the spreadsheet + the paper in their next iteration!
Thank you for your great work!
I tried to reproduce the results for meta-llama/Llama-3.1-8B-Instruct, but I noticed several discrepancies between my outcomes and those reported in the public result file under Llama-3.1-8B-Inst.
You can view my results here: Results. I obtained these results by running
scripts/run_eval_slurm.sh
on both short and long configs for some benchmarks, and I compiled all the results usingscripts/collect_results.py
.Do you have any suggestion on specific things I could adjust to align my results with yours?
The text was updated successfully, but these errors were encountered: