[WIP]feat: support pp for Sglang by TaoZex · Pull Request #1162 · inclusionAI/AReaL

TaoZex · 2026-04-11T05:54:16Z

Description

Related Issue

Fixes #(issue)

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

Additional Context

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

gemini-code-assist

Code Review

This pull request introduces support for pipeline parallelism (PP) in the SGLang inference backend. Key changes include updating the Megatron engine to handle per-PP-rank NCCL groups for weight synchronization, implementing monkey-patches for SGLang to support pp_rank in weight update requests, and adding comprehensive documentation and tests. Feedback focuses on improving the robustness of the weight update initialization by ensuring distributed locks are released in finally blocks and optimizing port allocation in the PP==1 case.

gemini-code-assist · 2026-04-11T05:58:55Z

+                self.engine_lock.acquire()

-            gen_world_size = meta.gen_allocation.parallel.world_size
-            init_method = f"tcp://{format_host_for_url(meta.nccl_master_address)}:{meta.nccl_master_port}"
-            self.logger.info(
-                f"Initializing weight update group: type={meta.type} "
-                f"init_method={init_method} "
-                f"group={self.weight_update_group_name}"
-            )
-            self.weight_update_group = init_custom_process_group(
-                backend=current_platform.communication_backend,
-                world_size=gen_world_size + 1,
-                init_method=init_method,
-                rank=0,
-                group_name=self.weight_update_group_name,
-                timeout=DIST_GROUP_DEFAULT_TIMEOUT,
-            )
+                # Assign address/port *after* acquiring the lock so that two
+                # PP-head ranks on the same node cannot race on port selection
+                meta.nccl_master_address = self.weight_update_master_addr = gethostip()
+                meta.nccl_master_port = self.weight_update_master_port = find_free_ports(1)[0]
+                meta.nccl_group_name = self.weight_update_group_name

-            fut.result()
+                fut = self.rollout_engine.init_weights_update_group(meta)
+
+                per_pp_world_size = (
+                    meta.gen_allocation.parallel.world_size // gen_pp_size
+                )
+                init_method = (
+                    f"tcp://{format_host_for_url(meta.nccl_master_address)}"
+                    f":{meta.nccl_master_port}"
+                )
+                self.logger.info(
+                    f"Initializing per-PP-rank weight update group: "
+                    f"type={meta.type} init_method={init_method} "
+                    f"group={self.weight_update_group_name} "
+                    f"per_pp_world_size={per_pp_world_size}"
+                )
+                self.weight_update_group = init_custom_process_group(
+                    backend=current_platform.communication_backend,
+                    world_size=per_pp_world_size + 1,
+                    init_method=init_method,
+                    rank=0,
+                    group_name=self.weight_update_group_name,
+                    timeout=DIST_GROUP_DEFAULT_TIMEOUT,
+                )
+
+                fut.result()
+
+                self.engine_lock.release()


The DistributedLock is acquired but not released in a finally block. If an exception occurs during group initialization (e.g., network timeout or port allocation failure), the lock will be leaked, potentially hanging the entire experiment. Please wrap the logic in a try...finally block or use a context manager if supported.

gemini-code-assist · 2026-04-11T05:58:55Z

+                self.engine_lock.acquire()
+
+                fut = self.rollout_engine.init_weights_update_group(meta)
+
+                gen_world_size = meta.gen_allocation.parallel.world_size
+                init_method = (
+                    f"tcp://{format_host_for_url(meta.nccl_master_address)}"
+                    f":{meta.nccl_master_port}"
+                )
+                self.logger.info(
+                    f"Initializing weight update group: type={meta.type} "
+                    f"init_method={init_method} "
+                    f"group={self.weight_update_group_name}"
+                )
+                self.weight_update_group = init_custom_process_group(
+                    backend=current_platform.communication_backend,
+                    world_size=gen_world_size + 1,
+                    init_method=init_method,
+                    rank=0,
+                    group_name=self.weight_update_group_name,
+                    timeout=DIST_GROUP_DEFAULT_TIMEOUT,
+                )
+
+                fut.result()

-            self.engine_lock.release()
+                self.engine_lock.release()


Similar to the PP>1 case, the DistributedLock here is not protected by a finally block. Please ensure the lock is released even if an error occurs during initialization.

gemini-code-assist · 2026-04-11T05:58:55Z

+            meta.nccl_master_address = self.weight_update_master_addr = gethostip()
+            meta.nccl_master_port = self.weight_update_master_port = find_free_ports(1)[0]
+            meta.nccl_group_name = self.weight_update_group_name


In the PP==1 case, nccl_master_address and nccl_master_port are assigned by all ranks, causing every rank to call find_free_ports(1). This is wasteful as only the head rank actually uses these values to initialize the group. These assignments should be moved inside the if self.is_pipeline_parallel_head(): block, similar to the PP>1 implementation.

bingyechen added 7 commits April 10, 2026 22:21

feat: add pp for sglang

e17022e

feat: add path

cadec07

feat: fix

64ce6bd

feat: fix code

cfab63e

feat: fix torch

b195ec3

feat: apply for fsdp

35ca1db

feat: rm useless code

8ffacf1

gemini-code-assist bot reviewed Apr 11, 2026

View reviewed changes

feat: fix

116e39b

TaoZex force-pushed the sglang_pp_new branch from 482736d to 116e39b Compare April 11, 2026 10:11

feat: add log

86f1085

TaoZex force-pushed the sglang_pp_new branch from d5ab0f4 to 86f1085 Compare April 11, 2026 13:50

bingyechen added 5 commits April 11, 2026 21:56

feat: test

62c11bf

feat: fix config

994e1b9

feat: fix

abc8d5f

feat: apply for archon

67677d5

feat: fix

d72d1f5

TaoZex force-pushed the sglang_pp_new branch from 9481691 to d72d1f5 Compare April 12, 2026 05:55

bingyechen added 7 commits April 13, 2026 16:11

feat: remove useless

ef2a696

remove log

c6068e6

feat: apply test

9d32259

feat: remove condition3

3dd5596

feat: fix

75501b6

feat: test code fix

9e9aa7c

feat: fix test

4db119c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]feat: support pp for Sglang#1162

[WIP]feat: support pp for Sglang#1162
TaoZex wants to merge 21 commits intoinclusionAI:mainfrom
TaoZex:sglang_pp_new

TaoZex commented Apr 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 11, 2026

Uh oh!

gemini-code-assist bot Apr 11, 2026

Uh oh!

gemini-code-assist bot Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TaoZex commented Apr 11, 2026

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant