Skip to content

[BUG] vLLM generation permanently paused after disk weight update, causing training to hang in Ascend NPU #1160

@chefwang-cloid

Description

@chefwang-cloid

Checklist

  • The error occurs when using our provided Docker image.
  • I can consistently reproduce the bug across multiple trials or random seeds.
  • If the error causes experiment abortion, I've verified that this error is the root
    cause, not a secondary error caused by peer workers.

Detailed Information

When using weight_update_mode=disk with co-located vLLM + FSDP, training hangs
permanently at step 3's rollout phase.

_update_weights_from_disk in fsdp_engine.py triggers pause_generation() on the
vLLM server via the /areal_update_weights endpoint, but continue_generation() is
never called afterward. After the first weight update completes, vLLM remains permanently
paused, so all subsequent rollout requests wait forever.

The xccl path (_update_weights_from_distributed, line 1135) and Megatron disk path
(_update_weights_from_awex, line 1403) both correctly call continue_generation()
after the weight update — the disk path in FSDPEngine is missing this call.

Expected behavior

vLLM resumes generation after each weight update and training continues normally beyond step 2.

Full logs

�[0;36m(Worker pid=14023)�[0;0m /usr/local/python3.11.14/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify device_id in init_process_group to mute this warning.
�[0;36m(Worker pid=14023)�[0;0m warnings.warn( # warn only once
�[96m�[1;38;2;180;140;80m(AReaL)�[0m �[96m20260409-03:03:27.519 [RemoteInfEngine Rank 0] INFO: Loading weights from disk done in 4.78s. Respond time: 0.37s.�[0m
�[96m�[1;38;2;180;140;80m(AReaL)�[0m �[96m20260409-03:03:27.676 [RemoteInfEngine Rank 1] INFO: Loading weights from disk done in 4.94s. Respond time: 0.42s.�[0m
�[37m�[1;38;2;180;140;80m(AReaL)�[0m �[37m20260409-03:03:29.243 SyncRPCServer INFO: Cleared 174 batch shards. Stats: {'cleared_count': 174, 'num_tensors': 366, 'total_bytes': 1666500}�[0m
�[37m�[1;38;2;180;140;80m(AReaL)�[0m �[37m20260409-03:03:29.244 SyncRPCServer INFO: Cleared 384 batch shards. Stats: {'cleared_count': 384, 'num_tensors': 0, 'total_bytes': 0}�[0m
�[37m�[1;38;2;180;140;80m(AReaL)�[0m �[37m20260409-03:03:29.244 SyncRPCServer INFO: Cleared 384 batch shards. Stats: {'cleared_count': 384, 'num_tensors': 0, 'total_bytes': 0}�[0m
�[37m�[1;38;2;180;140;80m(AReaL)�[0m �[37m20260409-03:03:29.244 SyncRPCServer INFO: Cleared 210 batch shards. Stats: {'cleared_count': 210, 'num_tensors': 336, 'total_bytes': 1548304}�[0m
�[92m�[1;38;2;180;140;80m(AReaL)�[0m �[92m20260409-03:03:29.509 StatsLogger INFO: Epoch 1/10 Step 1/116 Train step 1/1160 done.�[0m
�[92m�[1;38;2;180;140;80m(AReaL)�[0m �[92m20260409-03:03:29.510 StatsLogger INFO: Stats (1/1):�[0m
�[92m�[1;38;2;180;140;80m(AReaL)�[0m �[92m20260409-03:03:29.521 StatsLogger INFO: ...
�[96m�[1;38;2;180;140;80m(AReaL)�[0m �[96m20260409-03:03:30.390 [FSDPEngine Rank 1] INFO: Microbatch #tokens (rank 1): [10194, 10184, 10234, 10025, 7135], padded to: [10240, 10240, 10240, 10240, 7168], padding lengths: [46, 56, 6, 215, 33]�[0m
�[96m�[1;38;2;180;140;80m(AReaL)�[0m �[96m20260409-03:03:30.393 [FSDPEngine Rank 0] INFO: Microbatch #tokens (rank 0): [10125, 10124, 10175, 10085, 7335], padded to: [10240, 10240, 10240, 10240, 7424], padding lengths: [115, 116, 65, 155, 89]�[0m
�[37m�[1;38;2;180;140;80m(AReaL)�[0m �[37m20260409-03:03:31.485 IOStruct INFO: Memory-Usage recompute logp: memory allocated (GB): 2.23, memory reserved (GB): 21.55, device memory used/total (GB): 43.59/60.96�[0m
�[37m�[1;38;2;180;140;80m(AReaL)�[0m �[37m20260409-03:03:32.644 IOStruct INFO: Memory-Usage compute advantages: memory allocated (GB): 2.23, memory reserved (GB): 21.55, device memory used/total (GB): 43.59/60.96�[0m
�[0;36m(APIServer pid=13860)�[0;0m INFO 04-09 03:03:32 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 99.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
�[96m�[1;38;2;180;140;80m(AReaL)�[0m �[96m20260409-03:03:33.697 [FSDPEngine Rank 1] INFO: Microbatch #tokens (rank 1): [10194, 10184, 10234, 10025, 7135], padded to: [10240, 10240, 10240, 10240, 7168], padding lengths: [46, 56, 6, 215, 33]�[0m
�[96m�[1;38;2;180;140;80m(AReaL)�[0m �[96m20260409-03:03:33.716 [FSDPEngine Rank 0] INFO: Microbatch #tokens (rank 0): [10125, 10124, 10175, 10085, 7335], padded to: [10240, 10240, 10240, 10240, 7424], padding lengths: [115, 116, 65, 155, 89]�[0m
�[0;36m(APIServer pid=13859)�[0;0m INFO 04-09 03:03:34 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
�[37m�[1;38;2;180;140;80m(AReaL)�[0m �[37m20260409-03:03:38.837 IOStruct INFO: Memory-Usage ppo update: memory allocated (GB): 2.23, memory reserved (GB): 22.95, device memory used/total (GB): 44.98/60.96�[0m
�[0;36m(Worker pid=14023)�[0;0m
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
�[0;36m(APIServer pid=13860)�[0;0m INFO 04-09 03:03:42 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
�[0;36m(Worker pid=14026)�[0;0m
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
�[0;36m(Worker pid=14023)�[0;0m
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.31it/s]
�[0;36m(Worker pid=14023)�[0;0m
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.31it/s]
�[0;36m(Worker pid=14023)�[0;0m
�[0;36m(Worker pid=14023)�[0;0m INFO 04-09 03:03:43 [default_loader.py:291] Loading weights took 0.40 seconds
�[96m�[1;38;2;180;140;80m(AReaL)�[0m �[96m20260409-03:03:43.367 [RemoteInfEngine Rank 1] INFO: Loading weights from disk done in 4.50s. Respond time: 0.12s.�[0m
�[0;36m(Worker pid=14026)�[0;0m
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.54it/s]
�[0;36m(Worker pid=14026)�[0;0m
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.53it/s]
�[0;36m(Worker pid=14026)�[0;0m
�[0;36m(Worker pid=14026)�[0;0m INFO 04-09 03:03:43 [default_loader.py:291] Loading weights took 0.51 seconds
�[96m�[1;38;2;180;140;80m(AReaL)�[0m �[96m20260409-03:03:43.521 [RemoteInfEngine Rank 0] INFO: Loading weights from disk done in 4.65s. Respond time: 0.16s.�[0m
�[37m�[1;38;2;180;140;80m(AReaL)�[0m �[37m20260409-03:03:43.990 SyncRPCServer INFO: Cleared 198 batch shards. Stats: {'cleared_count': 198, 'num_tensors': 174, 'total_bytes': 728200}�[0m
�[37m�[1;38;2;180;140;80m(AReaL)�[0m �[37m20260409-03:03:43.991 SyncRPCServer INFO: Cleared 186 batch shards. Stats: {'cleared_count': 186, 'num_tensors': 156, 'total_bytes': 617516}�[0m
�[37m�[1;38;2;180;140;80m(AReaL)�[0m �[37m20260409-03:03:43.991 SyncRPCServer INFO: Cleared 384 batch shards. Stats: {'cleared_count': 384, 'num_tensors': 0, 'total_bytes': 0}�[0m
�[37m�[1;38;2;180;140;80m(AReaL)�[0m �[37m20260409-03:03:43.991 SyncRPCServer INFO: Cleared 384 batch shards. Stats: {'cleared_count': 384, 'num_tensors': 0, 'total_bytes': 0}�[0m
�[92m�[1;38;2;180;140;80m(AReaL)�[0m �[92m20260409-03:03:44.108 StatsLogger INFO: Epoch 1/10 Step 2/116 Train step 2/1160 done.�[0m
�[92m�[1;38;2;180;140;80m(AReaL)�[0m �[92m20260409-03:03:44.109 StatsLogger INFO: Stats (1/1):�[0m
�[92m�[1;38;2;180;140;80m(AReaL)�[0m �[92m20260409-03:03:44.115 StatsLogger INFO: ...
�[0;36m(APIServer pid=13859)�[0;0m INFO 04-09 03:03:44 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

To Reproduce

Commit ID

4f291dd

Environment

Hardware: Ascend NPU 910B4-1 x8
Docker image: swr.cn-north-9.myhuaweicloud.com/areal/areal_npu:v1.0.1-a2

Script

ASCEND_RT_VISIBLE_DEVICES="0,1" python3 /examples/math/gsm8k_rl.py
--config examples/math/gsm8k_grpo_npu.yaml
scheduler.type=local
experiment_name=test_0409_1
trial_name=gsm8k
actor.path=/model/qwen-0.6b
actor.weight_update_mode=disk
allocation_mode=vllm:d2p1t1+fsdp:d2p1t1
cluster.n_nodes=1
cluster.n_gpus_per_node=2
gconfig.max_new_tokens=2048
vllm.gpu_memory_utilization=0.3
vllm.max_model_len=4096 \

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions