[Feature] Feature Proposal: Multi-Agent Training Framework (Dr. MAS Integration)

## Checklist

- [x] This feature will maintain backward compatibility with the current APIs in `areal/api/`. 
- [x] Feature Scope: Addresses the Multi-LLM training TODO tracked in [Issue #907](https://github.com/inclusionAI/AReaL/issues/907).

## Background

While AReaL supports agentic RL training with multi-turn interactions, it currently lacks a native training primitive for Multi-Agent Systems (MAS), i.e., MARL.

As identified in prior literature (e.g., Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems), two common MARL paradigms emerge depending on whether multiple agents share the same backbone LLM, namely homogeneous and heterogeneous settings.

This proposal aims to upgrade AReaL into a multi-agent–ready framework that ensures both training stability and architectural flexibility for MARL. To achieve this, we propose a two-phase development plan:

**Phase I:** Supporting homogeneous MARL, where all agents in the MAS share a common LLM backbone.
This phase requires relatively minimal code modifications. However, the current AReaL framework does not support agent-wise group reward normalization required by algorithms such as GRPO. Moreover, AReaL currently lacks a multi-agent system example at a meaningful scale.

**Phase II:** Supporting heterogeneous MARL, where agents in the MAS are powered by different LLM backends. 
This phase is expected to require more substantial code modifications. The current AReaL framework does not support collaborative rollouts involving multiple rollout engines orchestrated under a predefined multi-agent workflow. Moreover, it lacks the flexibility to serve and train multiple LLMs within a single RL training process.

## Potential Solution

We are implementing this in a two-phase approach:

**Phase I:** For MARL with GRPO, agent-wise advantage computation (i.e., reward normalization in AReaL) has been shown to significantly improve training stability, as demonstrated in prior work (e.g., Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems). To support this, we introduce the following:

- (1) We introduce a norm_group parameter in the InteractionWithTokenLogpReward class. During workflow execution, each turn is tagged with a normalization group based on the agent that generates it. When the turn dictionary of an episode is returned to the GroupedRolloutWorkflow, turns are regrouped according to norm_group, enabling the existing normalization implementation in AReaL to be applied in an agent-wise manner.
- (2) We introduce a new multi-agent system example: a math-oriented agent pipeline consisting of three agents (generator → verifier → refiner) organized in a chain-of-agents topology. All agents share the same LLM backbone. Experiments on this setup demonstrate that our implementation improves both training stability and performance.


**Phase 2:** Heterogeneous Agent Support
* **Role-Based Model Routing:** Support for routing specific sub-turns to different physical Worker Groups. This enables MAS configurations where a "Code Agent" sub-turn is processed by a specialized smaller model, while the "Main Controller" remains on a larger model.
    * **Resource Mapping:** Implementation of a `wg_to_agents_mapping` to track and maintain the relationship between logical agent roles and their unique physical model instances.
    * **Data Routing (Split & Combine):** Mandatory implementation of `split_batch_by_wg_ids` to divide global trajectory batches into sub-batches for specific physical models, and `combine_batches` to merge them back for global reward processing.
    * **Isolated Weight Management:** Independent checkpointing paths (e.g., `actor/{wg_id}`) ensure that heterogeneous model parameters are saved and loaded without conflict.
    * **Role-Specific Config Validation:** Utilizing `_validate_multi_agent_config` to individually verify architectural parameters—such as micro-batch size and sequence parallelism—for every agent role. 

## Additional Information

- **Contributors**: This feature is a collaborative effort by [@lxing532](https://github.com/lxing532) and [@luzai](https://github.com/luzai).
- **Reference**: *Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems*. https://arxiv.org/abs/2602.08847 
- **Current Status**: Phase 1 has been implemented with early healthy results in multi-agent scenarios with shared backend. The first PR will include the `norm_group` logic and a COA training example. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Feature Proposal: Multi-Agent Training Framework (Dr. MAS Integration) #1114

Checklist

Background

Potential Solution

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Feature Proposal: Multi-Agent Training Framework (Dr. MAS Integration) #1114

Description

Checklist

Background

Potential Solution

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions