Skip to content

v1.0.0

Choose a tag to compare

@qgallouedec qgallouedec released this 31 Mar 14:15
· 142 commits to main since this release
f3e9ac1
thumbnail-2

Read our blog post for an overview of TRL v1.

Features

Asynchronous GRPO

Asynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.

from trl.experimental.async_grpo import AsyncGRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = AsyncGRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in #5293

Variational Sequence-Level Soft Policy Optimization (VESPO)

Screenshot 2026-03-20 at 5 49 50 PM

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in #5199

Divergence Proximal Policy Optimization (DPPO)

z_TXYw37xZqsQ21YiDkYL SfgWotuuuRKPkg-0bxWv1

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in #5117

Self-Distillation Policy Optimization (SDPO)

SDPO is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.

from trl.experimental import SDPOTrainer, SDPOConfig

config = SDPOConfig(
    output_dir="./results",
    num_generations=8,
    success_reward_threshold=1.0,
    use_successful_as_teacher=True,
)

trainer = SDPOTrainer(
    model="Qwen/Qwen2.5-Math-1.5B-Instruct",
    reward_funcs=[accuracy_reward],
    args=config,
    train_dataset=dataset,
)
trainer.train()

by @MengAiDev in #4935

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards
image image

by @manueldeprada in #5233

Tool calling support in VLLMClient.chat()

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in #4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

benchmark_results

by @mariosasko in #5189

[GKD] Buffer implementation and vLLM inference for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.

by @cmpatino in #5137 and #5388

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in #5255

Other

Fixes

  • Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in #5305
  • Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in #5286
  • Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in #5279
  • Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in #5295
  • Fix accuracy_reward crash when called from non-main thread by @qgallouedec in #5281
  • Fix GRPOTrainer attribute access for vLLM model config by @falcondai in #5302
  • [GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in #5242
  • [CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in #4639
  • Fix RewardFunc type alias to reflect actual calling convention by @s-zx in #5246
  • fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in #5245
  • Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in #5212
  • Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in #5230
  • Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in #5274
  • Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in #5266
  • Sync entire prompt/completion token tensors before indexing by @shawnghu in #5218
  • Clean up model update group on worker exit by @AmineDiro in #5325
  • Fix prefix EOS slicing for tool suffix (with Qwen3/3.5 chat templates) by @casinca in #5330
  • Fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in #5353
  • Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in #5354

Documentation and Examples

What's Changed

New Contributors

Full Changelog: v0.29.0...v1.0.0