A Comprehensive Survey of Direct Preference Optimization:
Datasets, Theories, Variants, and Applications

📢 Updates

2024.10: We released a survey paper "A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications". Feel free to cite or open pull requests.

👀 Introduction

Welcome to the repository for our survey paper, "A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications". This repository provides resources and updates related to our research. For a detailed introduction, please refer to our survey paper.

📒 Table of Contents

Research Questions and Variants
Datasets
- Human Labeled Datasets
- AI Labeled Datasets
Applications

📄 Research Questions and Variants

RQ0: Why DPO?

Training language models to follow instructions with human feedback [Paper] (2022.03)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [Paper] (2023.05)
Learning Dynamics of LLM Finetuning [Paper] (2024.07)

RQ1: Effect of Implicit Reward Modeling

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization [Paper] (2024.09)
Policy Optimization in RLHF: The Impact of Out-of-preference Data [Paper] (2023.12)
A General Theoretical Paradigm to Understand Learning from Human Preferences [Paper] (2023.10)
Generalizing Reward Modeling for Out-of-Distribution Preference Learning [Paper] (2024.11)
Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs [Paper] (2024.06)
3D-Properties: Identifying Challenges in DPO and Charting a Path Forward [Paper] (2024.05)
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [Paper] (2024.09)
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [Paper] (2025.07)

RQ2: Effect of Different Feedback

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [Paper] (2023.04)
RRHF: Rank Responses to Align Language Models with Human Feedback without tears [Paper] (2023.04)
Advancing LLM Reasoning Generalists with Preference Trees [Paper] (2024.04)
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper] (2024.06)
KTO: Model Alignment as Prospect Theoretic Optimization [Paper] (2024.02)
Token-level Direct Preference Optimization [Paper] (2024.04)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [Paper] (2024.01)
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation [Paper] (2024.01)
Iterative Reasoning Preference Optimization [Paper] (2024.04)
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability [Paper] (2024.11)
Building Math Agents with Multi-Turn Iterative Preference Learning [Paper] (2024.09)

RQ3: Effect of KL Penalty Coefficient and Reference Model

β-DPO: Direct Preference Optimization with Dynamic β [Paper] (2024.07)
Understanding Reference Policies in Direct Preference Optimization [Paper] (2024.07)
From r to Q*: Your Language Model is Secretly a Q-Function [Paper] (2024.04)
ORPO: Monolithic Preference Optimization without Reference Model [Paper] (2024.03)
SimPO: Simple Preference Optimization with a Reference-Free Reward [Paper] (2024.05)
Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective [Paper] (2024.04)
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study [Paper] (2024.04)
Learn Your Reference Model for Real Good Alignment [Paper] (2024.05)
KL Penalty Control via Perturbation for Direct Preference Optimization [Paper] (2025.02)
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [Paper] (2024.09)

RQ4: Online DPO

Self-Rewarding Language Models [Paper] (2024.01)
Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss [Paper] (2023.12)
Direct Language Model Alignment from Online AI Feedback [Paper] (2024.02)
sDPO: Don't Use Your Data All at Once [Paper] (2024.03)
OPTune: Efficient Online Preference Tuning [Paper] (2024.06)
Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing [Paper] (2024.06)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [Paper] (2024.01)
Understanding the performance gap between online and offline alignment algorithms [Paper] (2024.05)
Offline RLHF Methods Need More Accurate Supervision Signals [Paper] (2024.08)
Hybrid Preference Optimization for Alignment: Provably Faster Convergence Rates by Combining Offline Preferences with Online Exploration [Paper] (2024.09)

RQ5: Reward Hacking

AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback [Paper] (2023.05)
A Long Way to Go: Investigating Length Correlations in RLHF [Paper] (2023.10)
Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow Questions [Paper] (2023.10)
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources [Paper] (2023.06)
Disentangling Length from Quality in Direct Preference Optimization [Paper] (2024.08)
Following Length Constraints in Instructions [Paper] (2024.06)
Length Desensitization in Directed Preference Optimization [Paper] (2024.09)
Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking [Paper] (2024.09)

RQ6: Alignment Tax

Rethinking the Role of Proxy Rewards in Language Model Alignment [Paper] (2024.02)
Online merging optimizers for boosting rewards and mitigating tax in alignment [Paper] (2024.05)
SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling [Paper] (2024.05)
PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning [Paper] (2024.06)
Mitigating the Alignment Tax of RLHF [Paper] (2023.09)
Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction [Paper] (2024.05)
Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [Paper] (2024.02)
Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization [Paper] (2024.08)
Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach [Paper] (2024.05)

📊 Datasets

Human Labeled Datasets

Learning to summarize from human feedback [Paper] (2020.09)
WebGPT: Browser-assisted question-answering with human feedback [Paper] (2021.12)
HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM [Paper] (2023.11)
HelpSteer2: Open-source dataset for training top-performing reward models [Paper] (2024.06)
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages [Paper] (2025.05)
OpenAssistant Conversations -- Democratizing Large Language Model Alignment [Paper] (2023.04)
Understanding Dataset Difficulty with V-Usable Information (Introduces SHP) [Paper] (2022.07)
tasksource: A Large Collection of NLP tasks (Introduces DPO-pairs) [Paper] (2024.05)
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-Grained Correctional Human Feedback [Paper] (2024.06)
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Introduces HH-RLHF) [Paper] (2022.04)
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [Paper] (2025.02)
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [Paper] (2023.06)
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback [Paper] (2023.05)
Towards Harmless Multimodal Assistants with Blind Preference Optimization (Introduces MMSafe-PO) [Paper] (2025.03)

AI Labeled Datasets

RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [Paper] (2024.05)
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper] (2024.06)
UltraFeedback: Boosting Language Models with High-quality Feedback [Paper] (2023.10)
Distilabel Capybara DPO Dataset [Dataset]
ChatML DPO Pairs (from Orca) [Dataset]
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing [Paper] (2024.06)
TLDR Preference DPO [Dataset]
Zerolink ZSQL-Postgres DPO [Dataset]
RewardBench: Evaluating Reward Models for Language Modeling [Paper] (2024.03)
RewardBench 2: Advancing Reward Model Evaluation [Paper] (2025.06)
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [Paper] (2024.11)
Silkie: Preference Distillation for Large Visual Language Models (Introduces VLFeedback) [Paper] (2023.12)
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model [Paper] (2024.06)
Stack Exchange Preferences Dataset [Dataset]
Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF (Introduces Nectar) [Blog Post] (2023.11)
SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces [Dataset]
CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance [Paper] (2025.02)
Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs (Introduces ANAH) [Paper] (2025.03)
Unified Reward Model for Multimodal Understanding and Generation [Paper] (2025.03)
MM-IFEngine: Towards Multimodal Instruction Following [Paper] (2025.04)

🌐 Applications

Application on Large Language Models

Qwen2 Technical Report [Paper] (2024.07)
The Llama 3 Herd of Models [Paper] (2024.07)
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism [Paper] (2024.01)
Mixtral of Experts [Paper] (2024.01)
Baichuan 2: Open Large-scale Language Models [Paper] (2023.09)
Yi: Open Foundation Models by 01.AI [Paper] (2024.03)

Application on Reasoning

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper] (2024.06)
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning [Paper] (2024.07)
Iterative Reasoning Preference Optimization [Paper] (2024.04)
Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning [Paper] (2024.07)
MAPO: Advancing Multilingual Reasoning through Multilingual-Alignment-as-Preference Optimization [Paper] (2024.08)

Application on Instruction-Following

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models [Paper] (2024.06)
Enhancing and Assessing Instruction-Following with Fine-Grained Instruction Variants [Paper] (2024.06)
Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models [Paper] (2024.04)
FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema [Paper] (2024.02)
RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models [Paper] (2024.02)

Application on Anti-Hallucination

FLAME: Factuality-Aware Alignment for Large Language Models [Paper] (2024.05)
Halu-J: Critique-Based Hallucination Judge [Paper] (2024.07)
Fine-grained Hallucination Detection and Mitigation in Long-form Question Answering [Paper] (2024.07)
Fine-tuning Language Models for Factuality [Paper] (2023.11)

Application on Code-Generation

Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency [Paper] (2024.06)
Stable Code Technical Report [Paper] (2024.04)
Performance-Aligned LLMs for Generating Fast Code [Paper] (2024.04)
Instruct-Code-Llama: Improving Capabilities of Language Model in Competition Level Code Generation by Online Judge Feedback [Paper] (2024.07)

Application on Multi-modal Understanding and Generation

Multi-modal Understanding

Silkie: Preference Distillation for Large Visual Language Models [Paper] (2023.12)
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning [Paper] (2024.02)
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-Grained Correctional Human Feedback [Paper] (2024.06)
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [Paper] (2024.05)
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [Paper] (2024.04)
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward [Paper] (2024.04)
LLaVA-NeXT: A Strong Zero-shot Video Understanding Model [Blog Post] (2024.04)
Qwen2-Audio Technical Report [Paper] (2024.07)
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios [Paper] (2024.03)

Multi-modal Generation

Diffusion Model Alignment Using Direct Preference Optimization [Paper] (2023.11)
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model [Paper] (2023.11)
A Dense Reward View on Aligning Text-to-Image Diffusion with Preference [Paper] (2024.02)
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [Paper] (2024.04)
Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization [Paper] (2024.09)
SpeechAlign: Aligning Speech Generation to Human Preferences [Paper] (2024.04)
Speechworthy Instruction-tuned Language Models [Paper] (2024.09)

More Applications

Aligning protein generative models with experimental fitness via Direct Preference Optimization [Paper] (2024.05)
A Human Feedback Strategy for Photoresponsive Molecules in Drug Delivery [Paper] (2024.07)
ConPro: Learning Severity Representation for Medical Images using Contrastive Learning and Preference Optimization [Paper] (2024.04)
Exploring Text-to-Motion Generation with Human Preference [Paper] (2024.04)
On Softmax Direct Preference Optimization for Recommendation [Paper] (2024.06)
Finetuning Large Language Model for Personalized Ranking [Paper] (2024.05)

📑 Citation

If you find this work useful, welcome to cite us.

@misc{xiao2024comprehensivesurveydirectpreference,
      title={A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications}, 
      author={Wenyi Xiao and Zechuan Wang and Leilei Gan and Shuai Zhao and Wanggui He and Luu Anh Tuan and Long Chen and Hao Jiang and Zhou Zhao and Fei Wu},
      year={2024},
      eprint={2410.15595},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={[https://arxiv.org/abs/2410.15595](https://arxiv.org/abs/2410.15595)}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
figs		figs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Comprehensive Survey of Direct Preference Optimization:
Datasets, Theories, Variants, and Applications

📢 Updates

👀 Introduction

📒 Table of Contents

📄 Research Questions and Variants

RQ0: Why DPO?

RQ1: Effect of Implicit Reward Modeling

RQ2: Effect of Different Feedback

RQ3: Effect of KL Penalty Coefficient and Reference Model

RQ4: Online DPO

RQ5: Reward Hacking

RQ6: Alignment Tax

📊 Datasets

Human Labeled Datasets

AI Labeled Datasets

🌐 Applications

Application on Large Language Models

Application on Reasoning

Application on Instruction-Following

Application on Anti-Hallucination

Application on Code-Generation

Application on Multi-modal Understanding and Generation

Multi-modal Understanding

Multi-modal Generation

More Applications

📑 Citation

About

Uh oh!

Releases

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

📢 Updates

👀 Introduction

📒 Table of Contents

📄 Research Questions and Variants

RQ0: Why DPO?

RQ1: Effect of Implicit Reward Modeling

RQ2: Effect of Different Feedback

RQ3: Effect of KL Penalty Coefficient and Reference Model

RQ4: Online DPO

RQ5: Reward Hacking

RQ6: Alignment Tax

📊 Datasets

Human Labeled Datasets

AI Labeled Datasets

🌐 Applications

Application on Large Language Models

Application on Reasoning

Application on Instruction-Following

Application on Anti-Hallucination

Application on Code-Generation

Application on Multi-modal Understanding and Generation

Multi-modal Understanding

Multi-modal Generation

More Applications

📑 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

A Comprehensive Survey of Direct Preference Optimization:
Datasets, Theories, Variants, and Applications