A Comprehensive Survey of Direct Preference Optimization:
Datasets, Theories, Variants, and Applications
- 2024.10: We released a survey paper "A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications". Feel free to cite or open pull requests.
Welcome to the repository for our survey paper, "A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications". This repository provides resources and updates related to our research. For a detailed introduction, please refer to our survey paper.
- Training language models to follow instructions with human feedback [Paper] (2022.03)
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model [Paper] (2023.05)
- Learning Dynamics of LLM Finetuning [Paper] (2024.07)
- On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization [Paper] (2024.09)
- Policy Optimization in RLHF: The Impact of Out-of-preference Data [Paper] (2023.12)
- A General Theoretical Paradigm to Understand Learning from Human Preferences [Paper] (2023.10)
- Generalizing Reward Modeling for Out-of-Distribution Preference Learning [Paper] (2024.11)
- Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs [Paper] (2024.06)
- 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward [Paper] (2024.05)
- Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [Paper] (2024.09)
- Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [Paper] (2025.07)
- RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [Paper] (2023.04)
- RRHF: Rank Responses to Align Language Models with Human Feedback without tears [Paper] (2023.04)
- Advancing LLM Reasoning Generalists with Preference Trees [Paper] (2024.04)
- Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper] (2024.06)
- KTO: Model Alignment as Prospect Theoretic Optimization [Paper] (2024.02)
- Token-level Direct Preference Optimization [Paper] (2024.04)
- Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [Paper] (2024.01)
- Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation [Paper] (2024.01)
- Iterative Reasoning Preference Optimization [Paper] (2024.04)
- Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability [Paper] (2024.11)
- Building Math Agents with Multi-Turn Iterative Preference Learning [Paper] (2024.09)
- Ξ²-DPO: Direct Preference Optimization with Dynamic Ξ² [Paper] (2024.07)
- Understanding Reference Policies in Direct Preference Optimization [Paper] (2024.07)
- From r to Q*: Your Language Model is Secretly a Q-Function [Paper] (2024.04)
- ORPO: Monolithic Preference Optimization without Reference Model [Paper] (2024.03)
- SimPO: Simple Preference Optimization with a Reference-Free Reward [Paper] (2024.05)
- Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective [Paper] (2024.04)
- Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study [Paper] (2024.04)
- Learn Your Reference Model for Real Good Alignment [Paper] (2024.05)
- KL Penalty Control via Perturbation for Direct Preference Optimization [Paper] (2025.02)
- Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [Paper] (2024.09)
- Self-Rewarding Language Models [Paper] (2024.01)
- Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss [Paper] (2023.12)
- Direct Language Model Alignment from Online AI Feedback [Paper] (2024.02)
- sDPO: Don't Use Your Data All at Once [Paper] (2024.03)
- OPTune: Efficient Online Preference Tuning [Paper] (2024.06)
- Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing [Paper] (2024.06)
- Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [Paper] (2024.01)
- Understanding the performance gap between online and offline alignment algorithms [Paper] (2024.05)
- Offline RLHF Methods Need More Accurate Supervision Signals [Paper] (2024.08)
- Hybrid Preference Optimization for Alignment: Provably Faster Convergence Rates by Combining Offline Preferences with Online Exploration [Paper] (2024.09)
- AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback [Paper] (2023.05)
- A Long Way to Go: Investigating Length Correlations in RLHF [Paper] (2023.10)
- Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow Questions [Paper] (2023.10)
- How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources [Paper] (2023.06)
- Disentangling Length from Quality in Direct Preference Optimization [Paper] (2024.08)
- Following Length Constraints in Instructions [Paper] (2024.06)
- Length Desensitization in Directed Preference Optimization [Paper] (2024.09)
- Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking [Paper] (2024.09)
- Rethinking the Role of Proxy Rewards in Language Model Alignment [Paper] (2024.02)
- Online merging optimizers for boosting rewards and mitigating tax in alignment [Paper] (2024.05)
- SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling [Paper] (2024.05)
- PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning [Paper] (2024.06)
- Mitigating the Alignment Tax of RLHF [Paper] (2023.09)
- Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction [Paper] (2024.05)
- Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [Paper] (2024.02)
- Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization [Paper] (2024.08)
- Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach [Paper] (2024.05)
- Learning to summarize from human feedback [Paper] (2020.09)
- WebGPT: Browser-assisted question-answering with human feedback [Paper] (2021.12)
- HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM [Paper] (2023.11)
- HelpSteer2: Open-source dataset for training top-performing reward models [Paper] (2024.06)
- HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages [Paper] (2025.05)
- OpenAssistant Conversations -- Democratizing Large Language Model Alignment [Paper] (2023.04)
- Understanding Dataset Difficulty with V-Usable Information (Introduces SHP) [Paper] (2022.07)
- tasksource: A Large Collection of NLP tasks (Introduces DPO-pairs) [Paper] (2024.05)
- RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-Grained Correctional Human Feedback [Paper] (2024.06)
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Introduces HH-RLHF) [Paper] (2022.04)
- MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [Paper] (2025.02)
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [Paper] (2023.06)
- AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback [Paper] (2023.05)
- Towards Harmless Multimodal Assistants with Blind Preference Optimization (Introduces MMSafe-PO) [Paper] (2025.03)
- RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [Paper] (2024.05)
- Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper] (2024.06)
- UltraFeedback: Boosting Language Models with High-quality Feedback [Paper] (2023.10)
- Distilabel Capybara DPO Dataset [Dataset]
- ChatML DPO Pairs (from Orca) [Dataset]
- Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing [Paper] (2024.06)
- TLDR Preference DPO [Dataset]
- Zerolink ZSQL-Postgres DPO [Dataset]
- RewardBench: Evaluating Reward Models for Language Modeling [Paper] (2024.03)
- RewardBench 2: Advancing Reward Model Evaluation [Paper] (2025.06)
- VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [Paper] (2024.11)
- Silkie: Preference Distillation for Large Visual Language Models (Introduces VLFeedback) [Paper] (2023.12)
- SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model [Paper] (2024.06)
- Stack Exchange Preferences Dataset [Dataset]
- Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF (Introduces Nectar) [Blog Post] (2023.11)
- SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces [Dataset]
- CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance [Paper] (2025.02)
- Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs (Introduces ANAH) [Paper] (2025.03)
- Unified Reward Model for Multimodal Understanding and Generation [Paper] (2025.03)
- MM-IFEngine: Towards Multimodal Instruction Following [Paper] (2025.04)
- Qwen2 Technical Report [Paper] (2024.07)
- The Llama 3 Herd of Models [Paper] (2024.07)
- DeepSeek LLM: Scaling Open-Source Language Models with Longtermism [Paper] (2024.01)
- Mixtral of Experts [Paper] (2024.01)
- Baichuan 2: Open Large-scale Language Models [Paper] (2023.09)
- Yi: Open Foundation Models by 01.AI [Paper] (2024.03)
- Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper] (2024.06)
- Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning [Paper] (2024.07)
- Iterative Reasoning Preference Optimization [Paper] (2024.04)
- Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning [Paper] (2024.07)
- MAPO: Advancing Multilingual Reasoning through Multilingual-Alignment-as-Preference Optimization [Paper] (2024.08)
- Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models [Paper] (2024.06)
- Enhancing and Assessing Instruction-Following with Fine-Grained Instruction Variants [Paper] (2024.06)
- Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models [Paper] (2024.04)
- FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema [Paper] (2024.02)
- RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models [Paper] (2024.02)
- FLAME: Factuality-Aware Alignment for Large Language Models [Paper] (2024.05)
- Halu-J: Critique-Based Hallucination Judge [Paper] (2024.07)
- Fine-grained Hallucination Detection and Mitigation in Long-form Question Answering [Paper] (2024.07)
- Fine-tuning Language Models for Factuality [Paper] (2023.11)
- Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency [Paper] (2024.06)
- Stable Code Technical Report [Paper] (2024.04)
- Performance-Aligned LLMs for Generating Fast Code [Paper] (2024.04)
- Instruct-Code-Llama: Improving Capabilities of Language Model in Competition Level Code Generation by Online Judge Feedback [Paper] (2024.07)
- Silkie: Preference Distillation for Large Visual Language Models [Paper] (2023.12)
- Aligning Modalities in Vision Large Language Models via Preference Fine-tuning [Paper] (2024.02)
- RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-Grained Correctional Human Feedback [Paper] (2024.06)
- RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [Paper] (2024.05)
- Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [Paper] (2024.04)
- Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward [Paper] (2024.04)
- LLaVA-NeXT: A Strong Zero-shot Video Understanding Model [Blog Post] (2024.04)
- Qwen2-Audio Technical Report [Paper] (2024.07)
- CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios [Paper] (2024.03)
- Diffusion Model Alignment Using Direct Preference Optimization [Paper] (2023.11)
- Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model [Paper] (2023.11)
- A Dense Reward View on Aligning Text-to-Image Diffusion with Preference [Paper] (2024.02)
- Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [Paper] (2024.04)
- Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization [Paper] (2024.09)
- SpeechAlign: Aligning Speech Generation to Human Preferences [Paper] (2024.04)
- Speechworthy Instruction-tuned Language Models [Paper] (2024.09)
- Aligning protein generative models with experimental fitness via Direct Preference Optimization [Paper] (2024.05)
- A Human Feedback Strategy for Photoresponsive Molecules in Drug Delivery [Paper] (2024.07)
- ConPro: Learning Severity Representation for Medical Images using Contrastive Learning and Preference Optimization [Paper] (2024.04)
- Exploring Text-to-Motion Generation with Human Preference [Paper] (2024.04)
- On Softmax Direct Preference Optimization for Recommendation [Paper] (2024.06)
- Finetuning Large Language Model for Personalized Ranking [Paper] (2024.05)
If you find this work useful, welcome to cite us.
@misc{xiao2024comprehensivesurveydirectpreference,
title={A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications},
author={Wenyi Xiao and Zechuan Wang and Leilei Gan and Shuai Zhao and Wanggui He and Luu Anh Tuan and Long Chen and Hao Jiang and Zhou Zhao and Fei Wu},
year={2024},
eprint={2410.15595},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={[https://arxiv.org/abs/2410.15595](https://arxiv.org/abs/2410.15595)},
}







