This is the official codebase of our paper "Preference Fine-Tuning of LLMs Should Leverage Suboptimal On-Policy Data" by Fahim Tajwar*, Anikait Singh*, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn and Aviral Kumar. For any questions/concerns related to this codebase, please reach out to Anikait Singh.
For bandit experiments, make sure you are in the bandit_experiment
directory. The bandit_experiment/scripts
directory provides example commands to run our experiments.
For UltraFeedback DPO/Pref-FT experiments, HALOs/project_scripts/run_halos_multi.sh
has the example commands to reproduce the experiments in our paper.
We note the following additional datasets used in our LLM experiments:
We acknowledge the following codebases for our paper:
- TRL - Adapted for our synthetic LLM experiments.
- HALOs - Used for our DPO/Pref-FT experiments on UltraFeedback.
- DrQ-v2 - Used for our bandit experiments.
- minGPT - Used for our bandit experiments.
We thank the authors for providing us with easy-to-work-with codebases.