Checklist
Background
While AReaL supports agentic RL training with multi-turn interactions, it currently lacks a native training primitive for Multi-Agent Systems (MAS), i.e., MARL.
As identified in prior literature (e.g., Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems), two common MARL paradigms emerge depending on whether multiple agents share the same backbone LLM, namely homogeneous and heterogeneous settings.
This proposal aims to upgrade AReaL into a multi-agent–ready framework that ensures both training stability and architectural flexibility for MARL. To achieve this, we propose a two-phase development plan:
Phase I: Supporting homogeneous MARL, where all agents in the MAS share a common LLM backbone.
This phase requires relatively minimal code modifications. However, the current AReaL framework does not support agent-wise group reward normalization required by algorithms such as GRPO. Moreover, AReaL currently lacks a multi-agent system example at a meaningful scale.
Phase II: Supporting heterogeneous MARL, where agents in the MAS are powered by different LLM backends.
This phase is expected to require more substantial code modifications. The current AReaL framework does not support collaborative rollouts involving multiple rollout engines orchestrated under a predefined multi-agent workflow. Moreover, it lacks the flexibility to serve and train multiple LLMs within a single RL training process.
Potential Solution
We are implementing this in a two-phase approach:
Phase I: For MARL with GRPO, agent-wise advantage computation (i.e., reward normalization in AReaL) has been shown to significantly improve training stability, as demonstrated in prior work (e.g., Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems). To support this, we introduce the following:
- (1) We introduce a norm_group parameter in the InteractionWithTokenLogpReward class. During workflow execution, each turn is tagged with a normalization group based on the agent that generates it. When the turn dictionary of an episode is returned to the GroupedRolloutWorkflow, turns are regrouped according to norm_group, enabling the existing normalization implementation in AReaL to be applied in an agent-wise manner.
- (2) We introduce a new multi-agent system example: a math-oriented agent pipeline consisting of three agents (generator → verifier → refiner) organized in a chain-of-agents topology. All agents share the same LLM backbone. Experiments on this setup demonstrate that our implementation improves both training stability and performance.
Phase 2: Heterogeneous Agent Support
- Role-Based Model Routing: Support for routing specific sub-turns to different physical Worker Groups. This enables MAS configurations where a "Code Agent" sub-turn is processed by a specialized smaller model, while the "Main Controller" remains on a larger model.
- Resource Mapping: Implementation of a
wg_to_agents_mapping to track and maintain the relationship between logical agent roles and their unique physical model instances.
- Data Routing (Split & Combine): Mandatory implementation of
split_batch_by_wg_ids to divide global trajectory batches into sub-batches for specific physical models, and combine_batches to merge them back for global reward processing.
- Isolated Weight Management: Independent checkpointing paths (e.g.,
actor/{wg_id}) ensure that heterogeneous model parameters are saved and loaded without conflict.
- Role-Specific Config Validation: Utilizing
_validate_multi_agent_config to individually verify architectural parameters—such as micro-batch size and sequence parallelism—for every agent role.
Additional Information
- Contributors: This feature is a collaborative effort by @lxing532 and @luzai.
- Reference: Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems. https://arxiv.org/abs/2602.08847
- Current Status: Phase 1 has been implemented with early healthy results in multi-agent scenarios with shared backend. The first PR will include the
norm_group logic and a COA training example.
Checklist
areal/api/.Background
While AReaL supports agentic RL training with multi-turn interactions, it currently lacks a native training primitive for Multi-Agent Systems (MAS), i.e., MARL.
As identified in prior literature (e.g., Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems), two common MARL paradigms emerge depending on whether multiple agents share the same backbone LLM, namely homogeneous and heterogeneous settings.
This proposal aims to upgrade AReaL into a multi-agent–ready framework that ensures both training stability and architectural flexibility for MARL. To achieve this, we propose a two-phase development plan:
Phase I: Supporting homogeneous MARL, where all agents in the MAS share a common LLM backbone.
This phase requires relatively minimal code modifications. However, the current AReaL framework does not support agent-wise group reward normalization required by algorithms such as GRPO. Moreover, AReaL currently lacks a multi-agent system example at a meaningful scale.
Phase II: Supporting heterogeneous MARL, where agents in the MAS are powered by different LLM backends.
This phase is expected to require more substantial code modifications. The current AReaL framework does not support collaborative rollouts involving multiple rollout engines orchestrated under a predefined multi-agent workflow. Moreover, it lacks the flexibility to serve and train multiple LLMs within a single RL training process.
Potential Solution
We are implementing this in a two-phase approach:
Phase I: For MARL with GRPO, agent-wise advantage computation (i.e., reward normalization in AReaL) has been shown to significantly improve training stability, as demonstrated in prior work (e.g., Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems). To support this, we introduce the following:
Phase 2: Heterogeneous Agent Support
wg_to_agents_mappingto track and maintain the relationship between logical agent roles and their unique physical model instances.split_batch_by_wg_idsto divide global trajectory batches into sub-batches for specific physical models, andcombine_batchesto merge them back for global reward processing.actor/{wg_id}) ensure that heterogeneous model parameters are saved and loaded without conflict._validate_multi_agent_configto individually verify architectural parameters—such as micro-batch size and sequence parallelism—for every agent role.Additional Information
norm_grouplogic and a COA training example.