[BUG]  vLLM generation permanently paused after disk weight update, causing training to hang in Ascend NPU

## Checklist

- [x] The error occurs when using our provided Docker image.
- [x] I can consistently reproduce the bug across multiple trials or random seeds.
- [x] If the error causes experiment abortion, I've verified that this error is the root
  cause, not a secondary error caused by peer workers.

## Detailed Information

 When using `weight_update_mode=disk` with co-located vLLM + FSDP, training hangs
  permanently at step 3's rollout phase.

  `_update_weights_from_disk` in `fsdp_engine.py` triggers `pause_generation()` on the
  vLLM server via the `/areal_update_weights` endpoint, but `continue_generation()` is
  never called afterward. After the first weight update completes, vLLM remains permanently
  paused, so all subsequent rollout requests wait forever.

  The `xccl` path (`_update_weights_from_distributed`, line 1135) and Megatron disk path
  (`_update_weights_from_awex`, line 1403) both correctly call `continue_generation()`
  after the weight update — the `disk` path in `FSDPEngine` is missing this call.

### Expected behavior

 vLLM resumes generation after each weight update and training continues normally beyond step 2.

### Full logs

[0;36m(Worker pid=14023)[0;0m /usr/local/python3.11.14/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:4876: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
[0;36m(Worker pid=14023)[0;0m   warnings.warn(  # warn only once
[96m[1;38;2;180;140;80m(AReaL)[0m [96m20260409-03:03:27.519 [RemoteInfEngine Rank 0] INFO: Loading weights from disk done in 4.78s. Respond time: 0.37s.[0m
[96m[1;38;2;180;140;80m(AReaL)[0m [96m20260409-03:03:27.676 [RemoteInfEngine Rank 1] INFO: Loading weights from disk done in 4.94s. Respond time: 0.42s.[0m
[37m[1;38;2;180;140;80m(AReaL)[0m [37m20260409-03:03:29.243 SyncRPCServer INFO: Cleared 174 batch shards. Stats: {'cleared_count': 174, 'num_tensors': 366, 'total_bytes': 1666500}[0m
[37m[1;38;2;180;140;80m(AReaL)[0m [37m20260409-03:03:29.244 SyncRPCServer INFO: Cleared 384 batch shards. Stats: {'cleared_count': 384, 'num_tensors': 0, 'total_bytes': 0}[0m
[37m[1;38;2;180;140;80m(AReaL)[0m [37m20260409-03:03:29.244 SyncRPCServer INFO: Cleared 384 batch shards. Stats: {'cleared_count': 384, 'num_tensors': 0, 'total_bytes': 0}[0m
[37m[1;38;2;180;140;80m(AReaL)[0m [37m20260409-03:03:29.244 SyncRPCServer INFO: Cleared 210 batch shards. Stats: {'cleared_count': 210, 'num_tensors': 336, 'total_bytes': 1548304}[0m
[92m[1;38;2;180;140;80m(AReaL)[0m [92m20260409-03:03:29.509 StatsLogger INFO: Epoch 1/10 Step 1/116 Train step 1/1160 done.[0m
[92m[1;38;2;180;140;80m(AReaL)[0m [92m20260409-03:03:29.510 StatsLogger INFO: Stats (1/1):[0m
[92m[1;38;2;180;140;80m(AReaL)[0m [92m20260409-03:03:29.521 StatsLogger INFO: ...
[96m[1;38;2;180;140;80m(AReaL)[0m [96m20260409-03:03:30.390 [FSDPEngine Rank 1] INFO: Microbatch #tokens (rank 1): [10194, 10184, 10234, 10025, 7135], padded to: [10240, 10240, 10240, 10240, 7168], padding lengths: [46, 56, 6, 215, 33][0m
[96m[1;38;2;180;140;80m(AReaL)[0m [96m20260409-03:03:30.393 [FSDPEngine Rank 0] INFO: Microbatch #tokens (rank 0): [10125, 10124, 10175, 10085, 7335], padded to: [10240, 10240, 10240, 10240, 7424], padding lengths: [115, 116, 65, 155, 89][0m
[37m[1;38;2;180;140;80m(AReaL)[0m [37m20260409-03:03:31.485 IOStruct INFO: Memory-Usage recompute logp: memory allocated (GB): 2.23, memory reserved (GB): 21.55, device memory used/total (GB): 43.59/60.96[0m
[37m[1;38;2;180;140;80m(AReaL)[0m [37m20260409-03:03:32.644 IOStruct INFO: Memory-Usage compute advantages: memory allocated (GB): 2.23, memory reserved (GB): 21.55, device memory used/total (GB): 43.59/60.96[0m
[0;36m(APIServer pid=13860)[0;0m INFO 04-09 03:03:32 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 99.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[96m[1;38;2;180;140;80m(AReaL)[0m [96m20260409-03:03:33.697 [FSDPEngine Rank 1] INFO: Microbatch #tokens (rank 1): [10194, 10184, 10234, 10025, 7135], padded to: [10240, 10240, 10240, 10240, 7168], padding lengths: [46, 56, 6, 215, 33][0m
[96m[1;38;2;180;140;80m(AReaL)[0m [96m20260409-03:03:33.716 [FSDPEngine Rank 0] INFO: Microbatch #tokens (rank 0): [10125, 10124, 10175, 10085, 7335], padded to: [10240, 10240, 10240, 10240, 7424], padding lengths: [115, 116, 65, 155, 89][0m
[0;36m(APIServer pid=13859)[0;0m INFO 04-09 03:03:34 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[37m[1;38;2;180;140;80m(AReaL)[0m [37m20260409-03:03:38.837 IOStruct INFO: Memory-Usage ppo update: memory allocated (GB): 2.23, memory reserved (GB): 22.95, device memory used/total (GB): 44.98/60.96[0m
[0;36m(Worker pid=14023)[0;0m 
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[0;36m(APIServer pid=13860)[0;0m INFO 04-09 03:03:42 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[0;36m(Worker pid=14026)[0;0m 
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
[0;36m(Worker pid=14023)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.31it/s]
[0;36m(Worker pid=14023)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.31it/s]
[0;36m(Worker pid=14023)[0;0m 
[0;36m(Worker pid=14023)[0;0m INFO 04-09 03:03:43 [default_loader.py:291] Loading weights took 0.40 seconds
[96m[1;38;2;180;140;80m(AReaL)[0m [96m20260409-03:03:43.367 [RemoteInfEngine Rank 1] INFO: Loading weights from disk done in 4.50s. Respond time: 0.12s.[0m
[0;36m(Worker pid=14026)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.54it/s]
[0;36m(Worker pid=14026)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.53it/s]
[0;36m(Worker pid=14026)[0;0m 
[0;36m(Worker pid=14026)[0;0m INFO 04-09 03:03:43 [default_loader.py:291] Loading weights took 0.51 seconds
[96m[1;38;2;180;140;80m(AReaL)[0m [96m20260409-03:03:43.521 [RemoteInfEngine Rank 0] INFO: Loading weights from disk done in 4.65s. Respond time: 0.16s.[0m
[37m[1;38;2;180;140;80m(AReaL)[0m [37m20260409-03:03:43.990 SyncRPCServer INFO: Cleared 198 batch shards. Stats: {'cleared_count': 198, 'num_tensors': 174, 'total_bytes': 728200}[0m
[37m[1;38;2;180;140;80m(AReaL)[0m [37m20260409-03:03:43.991 SyncRPCServer INFO: Cleared 186 batch shards. Stats: {'cleared_count': 186, 'num_tensors': 156, 'total_bytes': 617516}[0m
[37m[1;38;2;180;140;80m(AReaL)[0m [37m20260409-03:03:43.991 SyncRPCServer INFO: Cleared 384 batch shards. Stats: {'cleared_count': 384, 'num_tensors': 0, 'total_bytes': 0}[0m
[37m[1;38;2;180;140;80m(AReaL)[0m [37m20260409-03:03:43.991 SyncRPCServer INFO: Cleared 384 batch shards. Stats: {'cleared_count': 384, 'num_tensors': 0, 'total_bytes': 0}[0m
[92m[1;38;2;180;140;80m(AReaL)[0m [92m20260409-03:03:44.108 StatsLogger INFO: Epoch 1/10 Step 2/116 Train step 2/1160 done.[0m
[92m[1;38;2;180;140;80m(AReaL)[0m [92m20260409-03:03:44.109 StatsLogger INFO: Stats (1/1):[0m
[92m[1;38;2;180;140;80m(AReaL)[0m [92m20260409-03:03:44.115 StatsLogger INFO:  ...
[0;36m(APIServer pid=13859)[0;0m INFO 04-09 03:03:44 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

## To Reproduce

### Commit ID

4f291dd4f6957ab7c1e4e895a90b570d4bdd5738

### Environment

  Hardware: Ascend NPU 910B4-1 x8
  Docker image: swr.cn-north-9.myhuaweicloud.com/areal/areal_npu:v1.0.1-a2

### Script

ASCEND_RT_VISIBLE_DEVICES="0,1" python3 /examples/math/gsm8k_rl.py \
    --config examples/math/gsm8k_grpo_npu.yaml \
    scheduler.type=local \
    experiment_name=test_0409_1 \
    trial_name=gsm8k \
    actor.path=/model/qwen-0.6b \
    actor.weight_update_mode=disk \
    allocation_mode=vllm:d2p1t1+fsdp:d2p1t1 \
    cluster.n_nodes=1 \
    cluster.n_gpus_per_node=2 \
    gconfig.max_new_tokens=2048 \
    vllm.gpu_memory_utilization=0.3 \
    vllm.max_model_len=4096 \


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] vLLM generation permanently paused after disk weight update, causing training to hang in Ascend NPU #1160

Checklist

Detailed Information

Expected behavior

Full logs

To Reproduce

Commit ID

Environment

Script

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] vLLM generation permanently paused after disk weight update, causing training to hang in Ascend NPU #1160

Description

Checklist

Detailed Information

Expected behavior

Full logs

To Reproduce

Commit ID

Environment

Script

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions