Release v1.0.0 · huggingface/trl

Read our blog post for an overview of TRL v1.

Features

Asynchronous GRPO

Asynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.

from trl.experimental.async_grpo import AsyncGRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = AsyncGRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in #5293

Variational Sequence-Level Soft Policy Optimization (VESPO)

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in #5199

Divergence Proximal Policy Optimization (DPPO)

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in #5117

Self-Distillation Policy Optimization (SDPO)

SDPO is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.

from trl.experimental import SDPOTrainer, SDPOConfig

config = SDPOConfig(
    output_dir="./results",
    num_generations=8,
    success_reward_threshold=1.0,
    use_successful_as_teacher=True,
)

trainer = SDPOTrainer(
    model="Qwen/Qwen2.5-Math-1.5B-Instruct",
    reward_funcs=[accuracy_reward],
    args=config,
    train_dataset=dataset,
)
trainer.train()

by @MengAiDev in #4935

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards

by @manueldeprada in #5233

Tool calling support in `VLLMClient.chat()`

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in #4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

by @mariosasko in #5189

[GKD] Buffer implementation and vLLM inference for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.

by @cmpatino in #5137 and #5388

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in #5255

Other

Change default vllm_mode to "colocate" by @qgallouedec in #5255
Support truncation_mode in SFT by @albertvillanova in #5306
Support max_length in DPO VLM training by @albertvillanova in #5284
Add pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180
Support sequence sampling in Liger Kernel by @michaelroyzen in #5190
Add tool calling support to VLLMClient.chat() by @kansalaman in #4889
Add support for raw token IDs in vLLM client prompts by @qgallouedec in #5225
Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227
Enhance print_prompt_completions_sample to include reasoning content by @qgallouedec in #5327
Add support for pixel_position_ids vision key by @qgallouedec in #5374
Add second version of Qwen 3.5 chat template by @apardyl in #5405
Pass tools as None to apply_chat_template when it is an empty list by @rabinadk1 in #5380

Fixes

Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
Fix accuracy_reward crash when called from non-main thread by @qgallouedec in #5281
Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
[GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in #5242
[CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in #4639
Fix RewardFunc type alias to reflect actual calling convention by @s-zx in #5246
fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in #5212
Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
Clean up model update group on worker exit by @AmineDiro in #5325
Fix prefix EOS slicing for tool suffix (with Qwen3/3.5 chat templates) by @casinca in #5330
Fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in #5353
Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in #5354

Documentation and Examples

Add minimal CARLA example script by @sergiopaniego in #5161
Nemotron 3 examples added by @sergiopaniego in #5272
Align docs about tool calling in trainers with dataset format by @albertvillanova in #5311
Add repository-specific guidance for agents (AGENTS.md) by @qgallouedec in #5236
Align documentation with the intended public API by @qgallouedec in #5162
Update openenv examples to use environment_factory by @sergiopaniego in #5235
Add "It Takes Two: Your GRPO Is Secretly DPO" paper to GRPOTrainer by @DhruvvArora in #5347
Centralize AI agent templates in .ai by @qgallouedec in #5268

What's Changed

⬆️ Bump dev version by @qgallouedec in #5182
Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in #5178
Document parameters with differing default values in core configs by @albertvillanova in #5168
Make _BaseConfig and _BaseTrainer explicitly private by @albertvillanova in #5169
Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser by @albertvillanova in #5170
Add minimal CARLA example script by @sergiopaniego in #5161
Align documentation with the intended public API by @qgallouedec in #5162
Fix deprecation warning of create_reference_model by @albertvillanova in #5184
Fix deprecation warning of fork in multi-threaded process by @albertvillanova in #5185
Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports by @albertvillanova in #5186
Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports by @albertvillanova in #5187
Fix CI tests patching BaseTrainer by @albertvillanova in #5192
Add pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180
Re-add liger-kernel to dev deps by @qgallouedec in #5164
Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM by @albertvillanova in #5197
Support sequence sampling in Liger Kernel and pass importance_samplin… by @michaelroyzen in #5190
Mark CI test_training_vlm_and_liger as xfail by @albertvillanova in #5202
Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in #5122
CI: Add Qwen 3.5 tiny model to tests by @qgallouedec in #5204
Add support for Qwen3.5 for agent training by @qgallouedec in #5205
Update vLLM version support to include 0.13.0 by @qgallouedec in #5206
feat: Add tool calling support to VLLMClient.chat() by @kansalaman in #4889
Refactor CLI [7/N]: Move patching to compat and import transformers conditionally by @albertvillanova in #5208
Update vLLM version support to include 0.14.0 and 0.14.1 by @qgallouedec in #5214
Refactor CLI [8/N]: Refactor scripts/utils with delayed imports by @albertvillanova in #5209
Simplify logic for structured outputs across vLLM versions by @albertvillanova in #5215
Refactor CLI [9/N]: Replace HfArgumentParser from transformers with local by @albertvillanova in #5210
Refactor CLI [10/N]: Refactor scripts with delayed imports by @albertvillanova in #5219
Refactor CLI [11/N]: Refactor scripts/vllm_serve with delayed imports by @albertvillanova in #5220
Refactor CLI [12/N]: Fix command name in scripts help usage by @albertvillanova in #5221
Refactor CLI [13/N]: Pass clean training args to scripts by @albertvillanova in #5223
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in #5212
Fix link to Hugging Face Hub in OpenEnv documentation by @thesteve0 in #5229
Fix type for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
Add repository-specific guidance for agents (AGENTS.md) by @qgallouedec in #5236
Add support for raw ids in prompts in vLLM client and server by @qgallouedec in #5225
Deprecate truncate_prompt_tokens for vLLM 0.17.0 by @winglian in #5248
Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227
Move rollout_func from _generate_single_turn to _generate by @qgallouedec in #5232
Fix RewardFunc type alias to reflect actual calling convention by @s-zx in #5246
[GRPO] In-place temperature scaling operation by @winglian in #5254
Update vLLM version support to 0.15.0 by @qgallouedec in #5251
Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
Update vLLM version support to 0.16.0 by @qgallouedec in #5252
Update vLLM version support to 0.17.0 by @qgallouedec in #5253
[GRPO/RLOO] Tokenize before vLLM generation call by @qgallouedec in #5238
Refactor CLI [14/N] : Remove TrainingArguments import from core trainers by @albertvillanova in #5257
Support JSON string parsing of teacher_model_init_kwargs in MiniLLMConfig by @albertvillanova in #5259
Fix typo in docstring for teacher_model_init_kwargs by @albertvillanova in #5260
Remove extra_fields dead code [1/N]: Remove extra_fields handling from VLLMGeneration.generate by @albertvillanova in #5262
[GRPO/RLOO] Unify tokenization across all generation backends in _generate_single_turn by @qgallouedec in #5239
Remove extra_fields dead code [2/N]: Remove extra_fields from VLLMGeneration.generate return value by @albertvillanova in #5263
Remove extra_fields dead code [3/N]: Remove extra_fields from GRPOTrainer._generate_single_turn return value by @albertvillanova in #5264
fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
[GRPO/RLOO] Extract tokenize prompts from _generate_single_turn by @qgallouedec in #5240
[CPO/ORPO] Fix handling of different length chosen/rejected prompts. by @davmels in #4639
Fix type for teacher_model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5258
Align GOLDConfig docstrings for optional params with None default by @albertvillanova in #5261
Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
Update TRL banner to support light/dark mode by @qgallouedec in #5270
Fix error message in OnlineDPO by @qgallouedec in #5237
Fix title consistency from "Transformer Reinforcement Learning" to "Transformers Reinforcement Learning" by @qgallouedec in #5183
Nemotron 3 examples added by @sergiopaniego in #5272
Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
Simplify get_train_dataloader in GRPO and RLOO by @albertvillanova in #5276
Raise ValueError for None train_dataset in experimental trainers by @albertvillanova in #5275
35% faster packing + rename bfd-requeue to bfd_split by @mariosasko in #5189
Change default vllm_mode to "colocate" and add v0→v1 migration guide by @qgallouedec in #5255
Allow nullable logprobs in vLLM serve responses by @LeonEricsson in #5203
feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO) by @casinca in #5199
Simplify structured outputs logic across vLLM versions in scripts/vllm_serve by @albertvillanova in #5273
Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
Fix accuracy_reward crash when called from non-main thread by @qgallouedec in #5281
Remove TrainingArguments import from experimental trainers by @albertvillanova in #5290
Remove custom get_train/eval_dataloader from OnlineDPO by @albertvillanova in #5291
[GKD] Buffer Implementation for Distillation Trainer by @cmpatino in #5137
Support max_length in DPO VLM training by @albertvillanova in #5284
Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
Apply docstyle by @qgallouedec in #5296
Add guidance to avoid hasattr and getattr with defaults in AGENTS.md by @qgallouedec in #5294
Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
Update RewardFunc type annotation to allow Nonevalues in reward list by @qgallouedec in #5297
Suggest the Json() type for tool calling dataset format by @lhoestq in #5307
Allow reward functions to log extra columns and scalar metrics by @manueldeprada in #5233
Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
Support truncation_mode in SFT by @albertvillanova in #5306
🔌 Asynchronous GRPO by @qgallouedec in #5293
Fix datasets version supporting Json dtype in docs about tool calling dataset format by @albertvillanova in #5310
Align docs about tool calling in trainers with dataset format by @albertvillanova in #5311
[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs by @qgallouedec in #5242
feat(experimental): Divergence Proximal Policy Optimization by @LeonEricsson in #5117
Clean up model update group on worker exit by @AmineDiro in #5325
Fix style in DPPO docstrings by @albertvillanova in #5326
GRPOTrainer/async: fix prefix EOS slicing for tool suffix (with Qwen3/3.5 type of chat templates) by @casinca in #5330
refactor(async_rollout_worker): renamed tool variables to mirror grpo_trainer.py by @casinca in #5332
Add truncation to SFT DataCollatorForLanguageModeling by @albertvillanova in #5315
Add SDPO (Self-Distillation Policy Optimization) trainer by @MengAiDev in #4935
Update openenv examples to use environment_factory by @sergiopaniego in #5235
Enhance print_prompt_completions_sample to include reasoning content by @qgallouedec in #5327
Add Cursor Bugbot rules from AGENTS.md by @qgallouedec in #5280
Change model dtype from bfloat16 to float32 in AsyncGRPOTrainer by @qgallouedec in #5333
docs: Add "It Takes Two: Your GRPO Is Secretly DPO" paper to GRPOTrainer by @DhruvvArora in #5347
fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in #5353
Remove post-collation truncation from DPO by @albertvillanova in #5350
Remove unused flush_right by @albertvillanova in #5358
Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in #5354
Remove post-collation truncation from SFT by @albertvillanova in #5359
Simplify DPO DataCollatorForPreference by @albertvillanova in #5362
Simplify SFT tokenization by @albertvillanova in #5363
Simplify SFT DataCollatorForLanguageModeling by @albertvillanova in #5360
Use BaseConfig post_init in experimental KTO and MiniLLM configs by @albertvillanova in #5371
Move truncate_dataset to experimental by @albertvillanova in #5370
Simplify DPO tokenization by @albertvillanova in #5369
Kd vllm generation by @cmpatino in #5351
Adds support for the pixel_position_ids vision key by @qgallouedec in #5374
Minor diff reduction between RLOO and GRPO by @qgallouedec in #5368
Remove requirements.txt by @albertvillanova in #5377
Remove dead truncation_mode from experimental BCO, CPO and ORPO by @albertvillanova in #5378
Centralize AI agent templates in .ai by @qgallouedec in #5268
Pass tools as None to apply_chat_template when it is an empty list by @rabinadk1 in #5380
Require datasets>=4.7.0 for Json dtype to prevent insertion of None values by @albertvillanova in #5376
Remove deprecated TRACKIO_SPACE_ID env var from all scripts by @sergiopaniego in #5365
Mark test_rloo[fsdp2] as xfail for transformers 5.4.0 by @albertvillanova in #5387
Enforce PR template for first-time contributors and document AI usage policy by @qgallouedec in #5356
Enhance PR template check to exclude reopened PRs from first-time contributor validation by @qgallouedec in #5392
chore: update pr_template_check.yml by @qgallouedec in #5393
Move disable_config=True from generate to GenerationConfig by @qgallouedec in #5384
Add vLLM inference to the Base Self-Distillation Trainer by @cmpatino in #5388
Add HF_TOKEN environment variable to workflow files by @qgallouedec in #5397
Add second version of Qwen 3.5 chat template to chat_template_utils by @apardyl in #5405
Release: v1.0 by @qgallouedec in #5409

New Contributors

@czkkkkkk made their first contribution in #5180
@michaelroyzen made their first contribution in #5190
@thesteve0 made their first contribution in #5229
@s-zx made their first contribution in #5246
@shawnghu made their first contribution in #5218
@davmels made their first contribution in #4639
@manueldeprada made their first contribution in #5233
@falcondai made their first contribution in #5302
@AmineDiro made their first contribution in #5325
@DhruvvArora made their first contribution in #5347
@lailanelkoussy made their first contribution in #5353
@rabinadk1 made their first contribution in #5380
@apardyl made their first contribution in #5405

Full Changelog: v0.29.0...v1.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Features

Asynchronous GRPO

Variational Sequence-Level Soft Policy Optimization (VESPO)

Divergence Proximal Policy Optimization (DPPO)

Self-Distillation Policy Optimization (SDPO)

Reward functions can now log extra columns and scalar metrics

Tool calling support in `VLLMClient.chat()`

35% faster packing

[GKD] Buffer implementation and vLLM inference for distillation trainer

v0 → v1 migration guide

Other

Fixes

Documentation and Examples

What's Changed

New Contributors

Contributors

Uh oh!

v1.0.0

Features

Asynchronous GRPO

Variational Sequence-Level Soft Policy Optimization (VESPO)

Divergence Proximal Policy Optimization (DPPO)

Self-Distillation Policy Optimization (SDPO)

Reward functions can now log extra columns and scalar metrics

Tool calling support in VLLMClient.chat()

35% faster packing

[GKD] Buffer implementation and vLLM inference for distillation trainer

v0 → v1 migration guide

Other

Fixes

Documentation and Examples

What's Changed

New Contributors

Contributors

Uh oh!

Tool calling support in `VLLMClient.chat()`