Skip to content

[Feature] Feature Proposal: Multi-Agent Training Framework (Dr. MAS Integration) #1114

@luzai

Description

@luzai

Checklist

  • This feature will maintain backward compatibility with the current APIs in areal/api/
  • Feature Scope: Addresses the Multi-LLM training TODO tracked in Issue #907.

Background

While AReaL supports agentic RL training with multi-turn interactions, it currently lacks a native training primitive for Multi-Agent Systems (MAS), i.e., MARL.

As identified in prior literature (e.g., Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems), two common MARL paradigms emerge depending on whether multiple agents share the same backbone LLM, namely homogeneous and heterogeneous settings.

This proposal aims to upgrade AReaL into a multi-agent–ready framework that ensures both training stability and architectural flexibility for MARL. To achieve this, we propose a two-phase development plan:

Phase I: Supporting homogeneous MARL, where all agents in the MAS share a common LLM backbone.
This phase requires relatively minimal code modifications. However, the current AReaL framework does not support agent-wise group reward normalization required by algorithms such as GRPO. Moreover, AReaL currently lacks a multi-agent system example at a meaningful scale.

Phase II: Supporting heterogeneous MARL, where agents in the MAS are powered by different LLM backends. 
This phase is expected to require more substantial code modifications. The current AReaL framework does not support collaborative rollouts involving multiple rollout engines orchestrated under a predefined multi-agent workflow. Moreover, it lacks the flexibility to serve and train multiple LLMs within a single RL training process.

Potential Solution

We are implementing this in a two-phase approach:

Phase I: For MARL with GRPO, agent-wise advantage computation (i.e., reward normalization in AReaL) has been shown to significantly improve training stability, as demonstrated in prior work (e.g., Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems). To support this, we introduce the following:

  • (1) We introduce a norm_group parameter in the InteractionWithTokenLogpReward class. During workflow execution, each turn is tagged with a normalization group based on the agent that generates it. When the turn dictionary of an episode is returned to the GroupedRolloutWorkflow, turns are regrouped according to norm_group, enabling the existing normalization implementation in AReaL to be applied in an agent-wise manner.
  • (2) We introduce a new multi-agent system example: a math-oriented agent pipeline consisting of three agents (generator → verifier → refiner) organized in a chain-of-agents topology. All agents share the same LLM backbone. Experiments on this setup demonstrate that our implementation improves both training stability and performance.

Phase 2: Heterogeneous Agent Support

  • Role-Based Model Routing: Support for routing specific sub-turns to different physical Worker Groups. This enables MAS configurations where a "Code Agent" sub-turn is processed by a specialized smaller model, while the "Main Controller" remains on a larger model.
    • Resource Mapping: Implementation of a wg_to_agents_mapping to track and maintain the relationship between logical agent roles and their unique physical model instances.
    • Data Routing (Split & Combine): Mandatory implementation of split_batch_by_wg_ids to divide global trajectory batches into sub-batches for specific physical models, and combine_batches to merge them back for global reward processing.
    • Isolated Weight Management: Independent checkpointing paths (e.g., actor/{wg_id}) ensure that heterogeneous model parameters are saved and loaded without conflict.
    • Role-Specific Config Validation: Utilizing _validate_multi_agent_config to individually verify architectural parameters—such as micro-batch size and sequence parallelism—for every agent role.

Additional Information

  • Contributors: This feature is a collaborative effort by @lxing532 and @luzai.
  • Reference: Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems. https://arxiv.org/abs/2602.08847 
  • Current Status: Phase 1 has been implemented with early healthy results in multi-agent scenarios with shared backend. The first PR will include the norm_group logic and a COA training example.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions