Preference Fine-Tuning of LLMs Should Leverage Suboptimal On-Policy Data

This is the official codebase of our paper "Preference Fine-Tuning of LLMs Should Leverage Suboptimal On-Policy Data" by Fahim Tajwar*, Anikait Singh*, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn and Aviral Kumar. For any questions/concerns related to this codebase, please reach out to Anikait Singh.

Running experiments

For bandit experiments, make sure you are in the bandit_experiment directory. The bandit_experiment/scripts directory provides example commands to run our experiments.

For UltraFeedback DPO/Pref-FT experiments, HALOs/project_scripts/run_halos_multi.sh has the example commands to reproduce the experiments in our paper.

Additional Datasets

We note the following additional datasets used in our LLM experiments:

Acknowledgements

We acknowledge the following codebases for our paper:

TRL - Adapted for our synthetic LLM experiments.
HALOs - Used for our DPO/Pref-FT experiments on UltraFeedback.
DrQ-v2 - Used for our bandit experiments.
minGPT - Used for our bandit experiments.

We thank the authors for providing us with easy-to-work-with codebases.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bandit_experiment		bandit_experiment
llm_experiment		llm_experiment
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preference Fine-Tuning of LLMs Should Leverage Suboptimal On-Policy Data

Running experiments

Additional Datasets

Acknowledgements

About

Releases

Packages

Languages

License

Asap7772/understanding-rlhf

Folders and files

Latest commit

History

Repository files navigation

Preference Fine-Tuning of LLMs Should Leverage Suboptimal On-Policy Data

Running experiments

Additional Datasets

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages