Skip to content

fix(scheduler): reject step ranges that produce empty timestep slices#2012

Open
laurigates wants to merge 1 commit into
kijai:mainfrom
laurigates:fix/scheduler-step-validation
Open

fix(scheduler): reject step ranges that produce empty timestep slices#2012
laurigates wants to merge 1 commit into
kijai:mainfrom
laurigates:fix/scheduler-step-validation

Conversation

@laurigates
Copy link
Copy Markdown

Problem

A WanVideoSchedulerv2 (or v1) configured with a step range that resolves to an empty timesteps slice — e.g. steps=4, start_step=8, end_step=-1, or steps=8, end_step=12 — currently builds a scheduler with zero timesteps. The downstream sampler's main loop (for idx, t in enumerate(timesteps[ttm_start_step:])) never executes, and the post-loop return references variables that are only assigned inside the loop:

UnboundLocalError: cannot access local variable 'callback_latent'
where it is not associated with a value

(callback_latent is the most common offender on the v2 sampler return path; v1 hits the same class of error on noise_pred.)

This is especially painful for HIGH/LOW two-sampler chains: the HIGH sampler runs to completion — full model load, JIT compile, full step count — before the LOW sampler's broken scheduler builds and crashes. On a 14B Wan 2.2 model that's a couple of minutes of GPU time burned on work that was always going to be discarded.

A real-world misconfiguration that triggered this:

node scheduler steps shift start_step end_step
HIGH dpm++_sde/beta 4 8 0 4
LOW dpm++_sde/beta 4 8 8 -1

LOW's timesteps[8:] on a length-4 sequence is []. The HIGH branch ran 4 steps, fp8 compiled, then died on the LOW return.

Why the existing check misses it

get_scheduler() already validates start_step >= end_step, but only when end_step != -1. The end_step=-1 ("run to the end") path has no upper bound on start_step, so anything >= len(timesteps) slips through.

# wanvideo/schedulers/__init__.py (before)
if (isinstance(start_step, int) and end_step != -1 and start_step >= end_step) or ...:
    raise ValueError("start_step must be less than end_step")

Fix

Two layers, so the validation fires regardless of how the scheduler was constructed:

1. get_scheduler() — defensive check for any caller

Adds:

  • start_step >= steps → reject
  • end_step > steps (when end_step != -1) → reject
  • Existing start_step >= end_step check kept and given a more informative message.

This protects callers that build a scheduler outside the public WanVideoScheduler[v2] node, and it covers the float-sigma-threshold path that the original check already handled.

2. VALIDATE_INPUTS on WanVideoScheduler — rejects at prompt-queue time

VALIDATE_INPUTS is ComfyUI's pre-execution hook: it runs before any node in the graph is invoked. So a misconfigured scheduler stops the whole prompt before HIGH loads the model, with a red error in the UI pointing directly at the offending node. No GPU time is consumed at all. WanVideoSchedulerv2 inherits from WanVideoScheduler and picks up the check automatically.

@classmethod
def VALIDATE_INPUTS(cls, steps, start_step, end_step, **kwargs):
    if start_step >= steps:
        return f"start_step ({start_step}) must be < steps ({steps}); ..."
    ...

Error-message philosophy

Each failure message names all three values (start_step, end_step, steps) and explains the consequence. The original "start_step must be less than end_step" wasn't enough to map back to "I typo'd 8 for 4 in the LOW branch's start_step." The new messages let the user fix the widget without first having to reverse-engineer the symptom.

What about auto-fixing?

Considered and rejected. start=8, end=-1 on steps=4 is ambiguous: the user might have meant steps=8 total or start=2 for a 4-step split — no way to tell. Refusing with a precise message is the right move; silent clamping would hide misconfigurations behind plausibly-shaped but unintended output.

Test plan

  • python3 -c "import ast; ast.parse(...)" clean on both edited files.
  • Reproduce the original callback_latent UnboundLocalError on main with WanVideoSchedulerv2(steps=4, start_step=8, end_step=-1).
  • Apply this branch — prompt is now rejected at queue time with "start_step (8) must be < steps (4); the requested range would produce an empty timestep slice and the sampler would crash mid-graph."
  • Valid configurations (steps=8, start_step=4, end_step=-1; steps=12, start_step=7, end_step=10; etc.) still validate and run.
  • Float-sigma threshold path (start/end as float) is unchanged — VALIDATE_INPUTS only sees the int widget values; get_scheduler() keeps the existing float branch.

Related

This addresses the same class of "fail loudly, fail fast" gap that #2011 (use_zero_init early-exit returning a 2-tuple where the caller unpacks 3) targets — both bugs would have produced clearer symptoms with a guard one layer up. Happy to fold both into a single PR if that's preferred.

Previously a WanVideoSchedulerv2 / WanVideoScheduler configured with a
step range that resolves to an empty timestep slice (e.g. steps=4,
start_step=8, end_step=-1) would silently build a scheduler with zero
timesteps. The downstream sampler's main loop then never executes, and
the post-loop return statement references variables (callback_latent,
noise_pred, ...) that are only assigned inside the loop:

    UnboundLocalError: cannot access local variable 'callback_latent'
    where it is not associated with a value

This is especially confusing in HIGH/LOW two-sampler chains because the
HIGH sampler runs to completion (model load, JIT compile, full step
count) before the LOW sampler's broken scheduler crashes — wasting
significant GPU time on work that was always going to be discarded.

Two layers of validation now catch this:

1. wanvideo/schedulers/__init__.py: get_scheduler() now also rejects
   start_step >= steps (the existing check only fired when end_step
   was not -1) and end_step > steps. These run for any caller, not
   just the public scheduler nodes.

2. nodes_sampler.py: WanVideoScheduler now defines VALIDATE_INPUTS,
   which ComfyUI invokes at prompt-queue time before any node in the
   graph executes. This rejects the prompt with a precise error before
   any model loads or any sampling begins, eliminating the wasted GPU
   work entirely. WanVideoSchedulerv2 inherits the check.

Each error message names the offending values (start_step, end_step,
steps) and explains why the range is invalid, so the user can fix the
widget directly without having to reconstruct what the symptom maps to.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant