Skip to content

Mr-Loevan/DPO-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

A Comprehensive Survey of Direct Preference Optimization:
Datasets, Theories, Variants, and Applications

arXiv

πŸ“’ Updates

πŸ‘€ Introduction

Welcome to the repository for our survey paper, "A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications". This repository provides resources and updates related to our research. For a detailed introduction, please refer to our survey paper.

πŸ“’ Table of Contents

πŸ“„ Research Questions and Variants

RQ0: Why DPO?

RQ0

  • Training language models to follow instructions with human feedback [Paper] (2022.03)
  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model [Paper] (2023.05)
  • Learning Dynamics of LLM Finetuning [Paper] (2024.07)

RQ1: Effect of Implicit Reward Modeling

RQ1

  • On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization [Paper] (2024.09)
  • Policy Optimization in RLHF: The Impact of Out-of-preference Data [Paper] (2023.12)
  • A General Theoretical Paradigm to Understand Learning from Human Preferences [Paper] (2023.10)
  • Generalizing Reward Modeling for Out-of-Distribution Preference Learning [Paper] (2024.11)
  • Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs [Paper] (2024.06)
  • 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward [Paper] (2024.05)
  • Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [Paper] (2024.09)
  • Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [Paper] (2025.07)

RQ2: Effect of Different Feedback

RQ2

  • RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [Paper] (2023.04)
  • RRHF: Rank Responses to Align Language Models with Human Feedback without tears [Paper] (2023.04)
  • Advancing LLM Reasoning Generalists with Preference Trees [Paper] (2024.04)
  • Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper] (2024.06)
  • KTO: Model Alignment as Prospect Theoretic Optimization [Paper] (2024.02)
  • Token-level Direct Preference Optimization [Paper] (2024.04)
  • Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [Paper] (2024.01)
  • Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation [Paper] (2024.01)
  • Iterative Reasoning Preference Optimization [Paper] (2024.04)
  • Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability [Paper] (2024.11)
  • Building Math Agents with Multi-Turn Iterative Preference Learning [Paper] (2024.09)

RQ3: Effect of KL Penalty Coefficient and Reference Model

RQ3

  • Ξ²-DPO: Direct Preference Optimization with Dynamic Ξ² [Paper] (2024.07)
  • Understanding Reference Policies in Direct Preference Optimization [Paper] (2024.07)
  • From r to Q*: Your Language Model is Secretly a Q-Function [Paper] (2024.04)
  • ORPO: Monolithic Preference Optimization without Reference Model [Paper] (2024.03)
  • SimPO: Simple Preference Optimization with a Reference-Free Reward [Paper] (2024.05)
  • Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective [Paper] (2024.04)
  • Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study [Paper] (2024.04)
  • Learn Your Reference Model for Real Good Alignment [Paper] (2024.05)
  • KL Penalty Control via Perturbation for Direct Preference Optimization [Paper] (2025.02)
  • Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [Paper] (2024.09)

RQ4: Online DPO

RQ4

  • Self-Rewarding Language Models [Paper] (2024.01)
  • Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss [Paper] (2023.12)
  • Direct Language Model Alignment from Online AI Feedback [Paper] (2024.02)
  • sDPO: Don't Use Your Data All at Once [Paper] (2024.03)
  • OPTune: Efficient Online Preference Tuning [Paper] (2024.06)
  • Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing [Paper] (2024.06)
  • Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [Paper] (2024.01)
  • Understanding the performance gap between online and offline alignment algorithms [Paper] (2024.05)
  • Offline RLHF Methods Need More Accurate Supervision Signals [Paper] (2024.08)
  • Hybrid Preference Optimization for Alignment: Provably Faster Convergence Rates by Combining Offline Preferences with Online Exploration [Paper] (2024.09)

RQ5: Reward Hacking

RQ5

  • AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback [Paper] (2023.05)
  • A Long Way to Go: Investigating Length Correlations in RLHF [Paper] (2023.10)
  • Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow Questions [Paper] (2023.10)
  • How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources [Paper] (2023.06)
  • Disentangling Length from Quality in Direct Preference Optimization [Paper] (2024.08)
  • Following Length Constraints in Instructions [Paper] (2024.06)
  • Length Desensitization in Directed Preference Optimization [Paper] (2024.09)
  • Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking [Paper] (2024.09)

RQ6: Alignment Tax

RQ6

  • Rethinking the Role of Proxy Rewards in Language Model Alignment [Paper] (2024.02)
  • Online merging optimizers for boosting rewards and mitigating tax in alignment [Paper] (2024.05)
  • SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling [Paper] (2024.05)
  • PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning [Paper] (2024.06)
  • Mitigating the Alignment Tax of RLHF [Paper] (2023.09)
  • Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction [Paper] (2024.05)
  • Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [Paper] (2024.02)
  • Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization [Paper] (2024.08)
  • Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach [Paper] (2024.05)

πŸ“Š Datasets

Human Labeled Datasets

Human Labeled Datasets Example

  • Learning to summarize from human feedback [Paper] (2020.09)
  • WebGPT: Browser-assisted question-answering with human feedback [Paper] (2021.12)
  • HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM [Paper] (2023.11)
  • HelpSteer2: Open-source dataset for training top-performing reward models [Paper] (2024.06)
  • HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages [Paper] (2025.05)
  • OpenAssistant Conversations -- Democratizing Large Language Model Alignment [Paper] (2023.04)
  • Understanding Dataset Difficulty with V-Usable Information (Introduces SHP) [Paper] (2022.07)
  • tasksource: A Large Collection of NLP tasks (Introduces DPO-pairs) [Paper] (2024.05)
  • RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-Grained Correctional Human Feedback [Paper] (2024.06)
  • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Introduces HH-RLHF) [Paper] (2022.04)
  • MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [Paper] (2025.02)
  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [Paper] (2023.06)
  • AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback [Paper] (2023.05)
  • Towards Harmless Multimodal Assistants with Blind Preference Optimization (Introduces MMSafe-PO) [Paper] (2025.03)

AI Labeled Datasets

AI Labeled Datasets Example

  • RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [Paper] (2024.05)
  • Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper] (2024.06)
  • UltraFeedback: Boosting Language Models with High-quality Feedback [Paper] (2023.10)
  • Distilabel Capybara DPO Dataset [Dataset]
  • ChatML DPO Pairs (from Orca) [Dataset]
  • Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing [Paper] (2024.06)
  • TLDR Preference DPO [Dataset]
  • Zerolink ZSQL-Postgres DPO [Dataset]
  • RewardBench: Evaluating Reward Models for Language Modeling [Paper] (2024.03)
  • RewardBench 2: Advancing Reward Model Evaluation [Paper] (2025.06)
  • VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [Paper] (2024.11)
  • Silkie: Preference Distillation for Large Visual Language Models (Introduces VLFeedback) [Paper] (2023.12)
  • SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model [Paper] (2024.06)
  • Stack Exchange Preferences Dataset [Dataset]
  • Starling-7B: Improving LLM Helpfulness & Harmlessness with RLAIF (Introduces Nectar) [Blog Post] (2023.11)
  • SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces [Dataset]
  • CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance [Paper] (2025.02)
  • Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs (Introduces ANAH) [Paper] (2025.03)
  • Unified Reward Model for Multimodal Understanding and Generation [Paper] (2025.03)
  • MM-IFEngine: Towards Multimodal Instruction Following [Paper] (2025.04)

🌐 Applications

Application on Large Language Models

  • Qwen2 Technical Report [Paper] (2024.07)
  • The Llama 3 Herd of Models [Paper] (2024.07)
  • DeepSeek LLM: Scaling Open-Source Language Models with Longtermism [Paper] (2024.01)
  • Mixtral of Experts [Paper] (2024.01)
  • Baichuan 2: Open Large-scale Language Models [Paper] (2023.09)
  • Yi: Open Foundation Models by 01.AI [Paper] (2024.03)

Application on Reasoning

  • Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper] (2024.06)
  • Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning [Paper] (2024.07)
  • Iterative Reasoning Preference Optimization [Paper] (2024.04)
  • Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning [Paper] (2024.07)
  • MAPO: Advancing Multilingual Reasoning through Multilingual-Alignment-as-Preference Optimization [Paper] (2024.08)

Application on Instruction-Following

  • Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models [Paper] (2024.06)
  • Enhancing and Assessing Instruction-Following with Fine-Grained Instruction Variants [Paper] (2024.06)
  • Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models [Paper] (2024.04)
  • FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema [Paper] (2024.02)
  • RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models [Paper] (2024.02)

Application on Anti-Hallucination

  • FLAME: Factuality-Aware Alignment for Large Language Models [Paper] (2024.05)
  • Halu-J: Critique-Based Hallucination Judge [Paper] (2024.07)
  • Fine-grained Hallucination Detection and Mitigation in Long-form Question Answering [Paper] (2024.07)
  • Fine-tuning Language Models for Factuality [Paper] (2023.11)

Application on Code-Generation

  • Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency [Paper] (2024.06)
  • Stable Code Technical Report [Paper] (2024.04)
  • Performance-Aligned LLMs for Generating Fast Code [Paper] (2024.04)
  • Instruct-Code-Llama: Improving Capabilities of Language Model in Competition Level Code Generation by Online Judge Feedback [Paper] (2024.07)

Application on Multi-modal Understanding and Generation

Multi-modal Understanding

  • Silkie: Preference Distillation for Large Visual Language Models [Paper] (2023.12)
  • Aligning Modalities in Vision Large Language Models via Preference Fine-tuning [Paper] (2024.02)
  • RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-Grained Correctional Human Feedback [Paper] (2024.06)
  • RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [Paper] (2024.05)
  • Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [Paper] (2024.04)
  • Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward [Paper] (2024.04)
  • LLaVA-NeXT: A Strong Zero-shot Video Understanding Model [Blog Post] (2024.04)
  • Qwen2-Audio Technical Report [Paper] (2024.07)
  • CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios [Paper] (2024.03)

Multi-modal Generation

  • Diffusion Model Alignment Using Direct Preference Optimization [Paper] (2023.11)
  • Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model [Paper] (2023.11)
  • A Dense Reward View on Aligning Text-to-Image Diffusion with Preference [Paper] (2024.02)
  • Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [Paper] (2024.04)
  • Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization [Paper] (2024.09)
  • SpeechAlign: Aligning Speech Generation to Human Preferences [Paper] (2024.04)
  • Speechworthy Instruction-tuned Language Models [Paper] (2024.09)

More Applications

  • Aligning protein generative models with experimental fitness via Direct Preference Optimization [Paper] (2024.05)
  • A Human Feedback Strategy for Photoresponsive Molecules in Drug Delivery [Paper] (2024.07)
  • ConPro: Learning Severity Representation for Medical Images using Contrastive Learning and Preference Optimization [Paper] (2024.04)
  • Exploring Text-to-Motion Generation with Human Preference [Paper] (2024.04)
  • On Softmax Direct Preference Optimization for Recommendation [Paper] (2024.06)
  • Finetuning Large Language Model for Personalized Ranking [Paper] (2024.05)

πŸ“‘ Citation

If you find this work useful, welcome to cite us.

@misc{xiao2024comprehensivesurveydirectpreference,
      title={A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications}, 
      author={Wenyi Xiao and Zechuan Wang and Leilei Gan and Shuai Zhao and Wanggui He and Luu Anh Tuan and Long Chen and Hao Jiang and Zhou Zhao and Fei Wu},
      year={2024},
      eprint={2410.15595},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={[https://arxiv.org/abs/2410.15595](https://arxiv.org/abs/2410.15595)}, 
}

About

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors