Skip to content

运行examples/test_dream_dvllm_human_eval.py卡死问题 #6

@luozixin2

Description

@luozixin2

描述

  • 在远程 origin/main(commit e9e9bb08ad8396646c8c1378d252c0facdfabeb9)直接运行 examples/test_dream_dvllm_human_eval.py,多次卡死在 decode 阶段的 model_runner.prepare_decode 内层 while 循环。
  • 现场栈显示停在 diffulex/legacy/engine/model_runner.py 的 decode 路径,cur_map[local_start_idx()] 取到的 block 既非 is_in_cache、也非 is_to_cache、也非 is_active,导致 start_idx 不推进、循环不退出。

复现环境

  • 代码:工作树 /home/lzx/Diffulex-remote-main,来自 origin/main(上述 commit),无本地修改(仅 .venv 未跟踪)。
  • CUDA:CUDA_HOME=$HOME/cuda-12.2PATH="$CUDA_HOME/bin:$PATH"LD_LIBRARY_PATH="$CUDA_HOME/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
  • GPU:CUDA_VISIBLE_DEVICES=0,1,2,3
  • 代理:http_proxy/https_proxy/all_proxy=http://127.0.0.1:17780(本地代理)。
  • 运行器:uv run(Python venv 在仓库 .venv/)。
  • 已设置:PYTHONFAULTHANDLER=1UV_HTTP_TIMEOUT=180

复现步骤

cd /home/lzx/Diffulex-remote-main
export PYTHONFAULTHANDLER=1 \
  http_proxy=http://127.0.0.1:17780 https_proxy=http://127.0.0.1:17780 \
  HTTP_PROXY=http://127.0.0.1:17780 HTTPS_PROXY=http://127.0.0.1:17780 \
  all_proxy=http://127.0.0.1:17780 ALL_PROXY=http://127.0.0.1:17780 \
  no_proxy=localhost,127.0.0.1,::1 NO_PROXY=localhost,127.0.0.1,::1 \
  CUDA_HOME=$HOME/cuda-12.2 PATH="$CUDA_HOME/bin:$PATH" \
  LD_LIBRARY_PATH="$CUDA_HOME/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}" \
  CUDA_VISIBLE_DEVICES=0,1,2,3 UV_HTTP_TIMEOUT=180
uv run python examples/test_dream_dvllm_human_eval.py > log/test_dvllm_dream_human_eval.remote_main.log 2>&1

观察到的行为

  • 进度约在 Generating: 79%|█████ | 130/164 ... 后无新输出,GPU 利用率掉到 0%,进程持续占用 CPU。
  • 手工中断打印的 Python 栈(节选):
[rank0]:   File "/home/lzx/Diffulex-remote-main/examples/test_dream_dvllm_human_eval.py", line 74, in <module>
[rank0]:     outputs = LLM.generate(prompts[:], sampling_params)
[rank0]:   File "/home/lzx/Diffulex/diffulex/legacy/engine/llm_engine.py", line 118, in generate
[rank0]:     output, num_tokens, is_prefill, cur_n_diff_steps, _ = self.step()
[rank0]:   File "/home/lzx/Diffulex/diffulex/legacy/engine/llm_engine.py", line 77, in step
[rank0]:     sample_output = self.model_runner.call("run", seqs, is_prefill)
[rank0]:   File "/home/lzx/Diffulex/diffulex/legacy/engine/model_runner.py", line 678, in run
[rank0]:     input_ids, positions = self.prepare_prefill(seqs) if is_prefill else self.prepare_decode(seqs)
[rank0]:   File "/home/lzx/Diffulex/diffulex/legacy/engine/model_runner.py", line 586, in prepare_decode
[rank0]:     if cur_map[local_start_idx()] == seq.num_diffusion_blocks - 1:
  • prepare_decode 中的 while 循环:
while start_idx < end_idx and not is_last_block and not meet_active_block:
    local_start_idx = lambda: start_idx % seq.block_size
    diffusion_block = seq.diffusion_blocks[cur_map[local_start_idx()]]
    ...
    if diffusion_block.is_in_cache:
        ...
        start_idx += step
    elif diffusion_block.is_to_cache:
        ...
        start_idx += step
    elif diffusion_block.is_active:
        meet_active_block = True
    # 其他状态未处理 → start_idx 不变,循环不退出

期望行为

  • decode 阶段不应进入无限循环,遇到异常状态应至少推进指针或报错,运行应能继续或 fail-fast。

初步推测

  • 某些 diffusion block 处于非 cache / 非 to_cache / 非 active 状态,导致指针不前进。建议在该分支补充防护(例如 else: break 或记录异常并推进 start_idx),同时输出遇到的 block 状态,帮助确认正确语义。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions