Skip to content

F14 Phase 2: resumable --wait via the job ledger #194

@AlephNotation

Description

@AlephNotation

PR #191 shipped F14 Phase 1: every --wait invocation writes start + completion entries to ~/.vers/jobs.jsonl. Phase 2 closes the loop — when an agent's --wait invocation gets killed mid-poll, the next invocation should re-attach to the in-flight job rather than submit a new one.

Background

Trevin's principle 4 calls this out specifically:

Retries on a long-running operation aren't just about idempotency at submission; they're about idempotency across the whole submit-poll-collect arc. If the agent's first invocation submits a job and then loses connection mid-poll, the second invocation needs to find the in-flight job, not start a new one.

Today, if an agent runs vers run --wait --vm-alias my-vm and the process is killed mid-poll, retrying that exact command submits a brand-new VM submission rather than re-attaching to the existing job. The first VM is orphaned (will eventually complete or fail without the agent observing it).

Target

$ vers run --wait --vm-alias my-vm
^C  # killed mid-poll

$ vers run --wait --vm-alias my-vm
note: resuming in-flight job job_8f2a3b4c5d (status: running, age 12s)
... continues polling existing job ...
{"vm_id":"vm-abc","status":"complete"}

Design questions

  1. Resumption key. What identifies an in-flight job uniquely?

    • Option A: command + args fingerprint (e.g., vm.run + ["--vm-alias","my-vm"])
    • Option B: explicit --resume-job <id> flag (agent must record the job id from the previous run)
    • Option C: presence of a stable user-supplied key (--vm-alias for run; what about commands without one?)

    Recommend: A as default with B as escape hatch. Commands without a stable key (e.g., vers run without --vm-alias) cannot resume — document the limitation.

  2. Staleness window. How long after started_at is a job considered "in-flight" vs "abandoned"? Recommend: 24h, configurable via --max-resume-age.

  3. Server-side state authority. The local ledger says "running"; what does the server say? On resume, the CLI should re-poll the server's actual job/VM state and reconcile. If the server says "completed" already, treat as complete.

  4. Concurrent invocations. Two vers run --wait calls with the same args from two terminals — do both attach to the same job, or does the second error? Recommend: attach (both observe the same job).

Scope

  • Add Find(kind, command, args) to internal/jobs/ledger.go returning the latest non-terminal entry matching the fingerprint.
  • Modify the --wait plumbing in cmd/run.go, cmd/branch.go, cmd/deploy.go, cmd/resume.go, cmd/run_commit.go:
    • Before submitting, check the ledger for an in-flight match.
    • If found: emit a stderr note and re-attach to the existing job (re-poll).
    • If not: submit normally.
  • Add --no-resume opt-out flag for users who explicitly want a fresh submission.
  • Add --resume-job <id> for explicit resumption.
  • Update agent-context to declare async_resume_supported: true per command.
  • Tests: simulate killed-mid-poll by using a fake handler that completes after Submit but before Complete.

Out of scope for this issue

  • Changing existing handler polling logic (resume should reuse the same poll loop).
  • Cross-machine resumption (the ledger is local).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions