v1.0.0
Read our blog post for an overview of TRL v1.
Features
Asynchronous GRPO
Asynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.
from trl.experimental.async_grpo import AsyncGRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = AsyncGRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
trainer.train()by @qgallouedec in #5293
Variational Sequence-Level Soft Policy Optimization (VESPO)
VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:
from trl import GRPOConfig, GRPOTrainer
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
args=GRPOConfig(loss_type="vespo"),
...
)Divergence Proximal Policy Optimization (DPPO)
DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.
by @LeonEricsson in #5117
Self-Distillation Policy Optimization (SDPO)
SDPO is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.
from trl.experimental import SDPOTrainer, SDPOConfig
config = SDPOConfig(
output_dir="./results",
num_generations=8,
success_reward_threshold=1.0,
use_successful_as_teacher=True,
)
trainer = SDPOTrainer(
model="Qwen/Qwen2.5-Math-1.5B-Instruct",
reward_funcs=[accuracy_reward],
args=config,
train_dataset=dataset,
)
trainer.train()by @MengAiDev in #4935
Reward functions can now log extra columns and scalar metrics
Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.
def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
extracted = [extract_answer(c) for c in completions]
rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]
if log_extra:
log_extra("golden_answer", list(answer))
log_extra("extracted_answer", extracted)
if log_metric:
log_metric("accuracy", sum(rewards) / len(rewards))
return rewards
by @manueldeprada in #5233
Tool calling support in VLLMClient.chat()
VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.
by @kansalaman in #4889
35% faster packing
BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.
by @mariosasko in #5189
[GKD] Buffer implementation and vLLM inference for distillation trainer
The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.
by @cmpatino in #5137 and #5388
v0 → v1 migration guide
A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.
by @qgallouedec in #5255
Other
- Change default
vllm_modeto"colocate"by @qgallouedec in #5255 - Support
truncation_modein SFT by @albertvillanova in #5306 - Support
max_lengthin DPO VLM training by @albertvillanova in #5284 - Add
pad_to_multiple_ofto GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180 - Support sequence sampling in Liger Kernel by @michaelroyzen in #5190
- Add tool calling support to VLLMClient.chat() by @kansalaman in #4889
- Add support for raw token IDs in vLLM client prompts by @qgallouedec in #5225
- Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227
- Enhance
print_prompt_completions_sampleto include reasoning content by @qgallouedec in #5327 - Add support for
pixel_position_idsvision key by @qgallouedec in #5374 - Add second version of Qwen 3.5 chat template by @apardyl in #5405
- Pass tools as
Nonetoapply_chat_templatewhen it is an empty list by @rabinadk1 in #5380
Fixes
- Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
- Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
- Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
- Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
- Fix
accuracy_rewardcrash when called from non-main thread by @qgallouedec in #5281 - Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
- [GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in #5242
- [CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in #4639
- Fix
RewardFunctype alias to reflect actual calling convention by @s-zx in #5246 - fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
- Fix
prepare_multimodal_messagesto supporttool_callsandtoolrole by @alvarobartt in #5212 - Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
- Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
- Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
- Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
- Clean up model update group on worker exit by @AmineDiro in #5325
- Fix prefix EOS slicing for tool suffix (with Qwen3/3.5 chat templates) by @casinca in #5330
- Fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in #5353
- Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in #5354
Documentation and Examples
- Add minimal CARLA example script by @sergiopaniego in #5161
- Nemotron 3 examples added by @sergiopaniego in #5272
- Align docs about tool calling in trainers with dataset format by @albertvillanova in #5311
- Add repository-specific guidance for agents (
AGENTS.md) by @qgallouedec in #5236 - Align documentation with the intended public API by @qgallouedec in #5162
- Update openenv examples to use
environment_factoryby @sergiopaniego in #5235 - Add "It Takes Two: Your GRPO Is Secretly DPO" paper to GRPOTrainer by @DhruvvArora in #5347
- Centralize AI agent templates in
.aiby @qgallouedec in #5268
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #5182
- Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in #5178
- Document parameters with differing default values in core configs by @albertvillanova in #5168
- Make _BaseConfig and _BaseTrainer explicitly private by @albertvillanova in #5169
- Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser by @albertvillanova in #5170
- Add minimal CARLA example script by @sergiopaniego in #5161
- Align documentation with the intended public API by @qgallouedec in #5162
- Fix deprecation warning of create_reference_model by @albertvillanova in #5184
- Fix deprecation warning of fork in multi-threaded process by @albertvillanova in #5185
- Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports by @albertvillanova in #5186
- Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports by @albertvillanova in #5187
- Fix CI tests patching BaseTrainer by @albertvillanova in #5192
- Add
pad_to_multiple_ofto GRPOTrainer and RLOOTrainer by @czkkkkkk in #5180 - Re-add liger-kernel to dev deps by @qgallouedec in #5164
- Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM by @albertvillanova in #5197
- Support sequence sampling in Liger Kernel and pass importance_samplin… by @michaelroyzen in #5190
- Mark CI test_training_vlm_and_liger as xfail by @albertvillanova in #5202
- Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in #5122
- CI: Add Qwen 3.5 tiny model to tests by @qgallouedec in #5204
- Add support for Qwen3.5 for agent training by @qgallouedec in #5205
- Update vLLM version support to include 0.13.0 by @qgallouedec in #5206
- feat: Add tool calling support to VLLMClient.chat() by @kansalaman in #4889
- Refactor CLI [7/N]: Move patching to compat and import transformers conditionally by @albertvillanova in #5208
- Update vLLM version support to include 0.14.0 and 0.14.1 by @qgallouedec in #5214
- Refactor CLI [8/N]: Refactor scripts/utils with delayed imports by @albertvillanova in #5209
- Simplify logic for structured outputs across vLLM versions by @albertvillanova in #5215
- Refactor CLI [9/N]: Replace HfArgumentParser from transformers with local by @albertvillanova in #5210
- Refactor CLI [10/N]: Refactor scripts with delayed imports by @albertvillanova in #5219
- Refactor CLI [11/N]: Refactor scripts/vllm_serve with delayed imports by @albertvillanova in #5220
- Refactor CLI [12/N]: Fix command name in scripts help usage by @albertvillanova in #5221
- Refactor CLI [13/N]: Pass clean training args to scripts by @albertvillanova in #5223
- Fix
prepare_multimodal_messagesto supporttool_callsandtoolrole by @alvarobartt in #5212 - Fix link to Hugging Face Hub in OpenEnv documentation by @thesteve0 in #5229
- Fix type for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
- Add repository-specific guidance for agents (
AGENTS.md) by @qgallouedec in #5236 - Add support for raw ids in
promptsin vLLM client and server by @qgallouedec in #5225 - Deprecate
truncate_prompt_tokensfor vLLM 0.17.0 by @winglian in #5248 - Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in #5227
- Move
rollout_funcfrom_generate_single_turnto_generateby @qgallouedec in #5232 - Fix
RewardFunctype alias to reflect actual calling convention by @s-zx in #5246 - [GRPO] In-place temperature scaling operation by @winglian in #5254
- Update vLLM version support to 0.15.0 by @qgallouedec in #5251
- Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
- Update vLLM version support to 0.16.0 by @qgallouedec in #5252
- Update vLLM version support to 0.17.0 by @qgallouedec in #5253
- [GRPO/RLOO] Tokenize before vLLM generation call by @qgallouedec in #5238
- Refactor CLI [14/N] : Remove TrainingArguments import from core trainers by @albertvillanova in #5257
- Support JSON string parsing of teacher_model_init_kwargs in MiniLLMConfig by @albertvillanova in #5259
- Fix typo in docstring for teacher_model_init_kwargs by @albertvillanova in #5260
- Remove extra_fields dead code [1/N]: Remove extra_fields handling from VLLMGeneration.generate by @albertvillanova in #5262
- [GRPO/RLOO] Unify tokenization across all generation backends in
_generate_single_turnby @qgallouedec in #5239 - Remove extra_fields dead code [2/N]: Remove extra_fields from VLLMGeneration.generate return value by @albertvillanova in #5263
- Remove extra_fields dead code [3/N]: Remove extra_fields from GRPOTrainer._generate_single_turn return value by @albertvillanova in #5264
- fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
- [GRPO/RLOO] Extract tokenize prompts from
_generate_single_turnby @qgallouedec in #5240 - [CPO/ORPO] Fix handling of different length chosen/rejected prompts. by @davmels in #4639
- Fix type for teacher_model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5258
- Align GOLDConfig docstrings for optional params with None default by @albertvillanova in #5261
- Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
- Update TRL banner to support light/dark mode by @qgallouedec in #5270
- Fix error message in OnlineDPO by @qgallouedec in #5237
- Fix title consistency from "Transformer Reinforcement Learning" to "Transformers Reinforcement Learning" by @qgallouedec in #5183
- Nemotron 3 examples added by @sergiopaniego in #5272
- Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
- Simplify get_train_dataloader in GRPO and RLOO by @albertvillanova in #5276
- Raise ValueError for None train_dataset in experimental trainers by @albertvillanova in #5275
- 35% faster packing + rename
bfd-requeuetobfd_splitby @mariosasko in #5189 - Change default
vllm_modeto"colocate"and add v0→v1 migration guide by @qgallouedec in #5255 - Allow nullable logprobs in vLLM serve responses by @LeonEricsson in #5203
- feat(
grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO) by @casinca in #5199 - Simplify structured outputs logic across vLLM versions in scripts/vllm_serve by @albertvillanova in #5273
- Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
- Fix
accuracy_rewardcrash when called from non-main thread by @qgallouedec in #5281 - Remove TrainingArguments import from experimental trainers by @albertvillanova in #5290
- Remove custom get_train/eval_dataloader from OnlineDPO by @albertvillanova in #5291
- [GKD] Buffer Implementation for Distillation Trainer by @cmpatino in #5137
- Support max_length in DPO VLM training by @albertvillanova in #5284
- Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
- Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
- Apply docstyle by @qgallouedec in #5296
- Add guidance to avoid
hasattrandgetattrwith defaults inAGENTS.mdby @qgallouedec in #5294 - Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
- Update
RewardFunctype annotation to allowNonevalues in reward list by @qgallouedec in #5297 - Suggest the
Json()type for tool calling dataset format by @lhoestq in #5307 - Allow reward functions to log extra columns and scalar metrics by @manueldeprada in #5233
- Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
- Support truncation_mode in SFT by @albertvillanova in #5306
- 🔌 Asynchronous GRPO by @qgallouedec in #5293
- Fix datasets version supporting Json dtype in docs about tool calling dataset format by @albertvillanova in #5310
- Align docs about tool calling in trainers with dataset format by @albertvillanova in #5311
- [GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs by @qgallouedec in #5242
- feat(experimental): Divergence Proximal Policy Optimization by @LeonEricsson in #5117
- Clean up model update group on worker exit by @AmineDiro in #5325
- Fix style in DPPO docstrings by @albertvillanova in #5326
GRPOTrainer/async: fix prefix EOS slicing for tool suffix (with Qwen3/3.5 type of chat templates) by @casinca in #5330- refactor(async_rollout_worker): renamed tool variables to mirror
grpo_trainer.pyby @casinca in #5332 - Add truncation to SFT DataCollatorForLanguageModeling by @albertvillanova in #5315
- Add SDPO (Self-Distillation Policy Optimization) trainer by @MengAiDev in #4935
- Update openenv examples to use
environment_factoryby @sergiopaniego in #5235 - Enhance
print_prompt_completions_sampleto include reasoning content by @qgallouedec in #5327 - Add Cursor Bugbot rules from
AGENTS.mdby @qgallouedec in #5280 - Change model dtype from bfloat16 to float32 in AsyncGRPOTrainer by @qgallouedec in #5333
- docs: Add "It Takes Two: Your GRPO Is Secretly DPO" paper to GRPOTrainer by @DhruvvArora in #5347
- fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in #5353
- Remove post-collation truncation from DPO by @albertvillanova in #5350
- Remove unused flush_right by @albertvillanova in #5358
- Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in #5354
- Remove post-collation truncation from SFT by @albertvillanova in #5359
- Simplify DPO DataCollatorForPreference by @albertvillanova in #5362
- Simplify SFT tokenization by @albertvillanova in #5363
- Simplify SFT DataCollatorForLanguageModeling by @albertvillanova in #5360
- Use BaseConfig post_init in experimental KTO and MiniLLM configs by @albertvillanova in #5371
- Move truncate_dataset to experimental by @albertvillanova in #5370
- Simplify DPO tokenization by @albertvillanova in #5369
- Kd vllm generation by @cmpatino in #5351
- Adds support for the
pixel_position_idsvision key by @qgallouedec in #5374 - Minor diff reduction between RLOO and GRPO by @qgallouedec in #5368
- Remove requirements.txt by @albertvillanova in #5377
- Remove dead truncation_mode from experimental BCO, CPO and ORPO by @albertvillanova in #5378
- Centralize AI agent templates in
.aiby @qgallouedec in #5268 - Pass tools as None to
apply_chat_templatewhen it is an empty list by @rabinadk1 in #5380 - Require datasets>=4.7.0 for Json dtype to prevent insertion of None values by @albertvillanova in #5376
- Remove deprecated
TRACKIO_SPACE_IDenv var from all scripts by @sergiopaniego in #5365 - Mark test_rloo[fsdp2] as xfail for transformers 5.4.0 by @albertvillanova in #5387
- Enforce PR template for first-time contributors and document AI usage policy by @qgallouedec in #5356
- Enhance PR template check to exclude reopened PRs from first-time contributor validation by @qgallouedec in #5392
- chore: update
pr_template_check.ymlby @qgallouedec in #5393 - Move
disable_config=TruefromgeneratetoGenerationConfigby @qgallouedec in #5384 - Add vLLM inference to the Base Self-Distillation Trainer by @cmpatino in #5388
- Add HF_TOKEN environment variable to workflow files by @qgallouedec in #5397
- Add second version of Qwen 3.5 chat template to chat_template_utils by @apardyl in #5405
- Release: v1.0 by @qgallouedec in #5409
New Contributors
- @czkkkkkk made their first contribution in #5180
- @michaelroyzen made their first contribution in #5190
- @thesteve0 made their first contribution in #5229
- @s-zx made their first contribution in #5246
- @shawnghu made their first contribution in #5218
- @davmels made their first contribution in #4639
- @manueldeprada made their first contribution in #5233
- @falcondai made their first contribution in #5302
- @AmineDiro made their first contribution in #5325
- @DhruvvArora made their first contribution in #5347
- @lailanelkoussy made their first contribution in #5353
- @rabinadk1 made their first contribution in #5380
- @apardyl made their first contribution in #5405
Full Changelog: v0.29.0...v1.0.0