PR #191 shipped F14 Phase 1: every --wait invocation writes start + completion entries to ~/.vers/jobs.jsonl. Phase 2 closes the loop — when an agent's --wait invocation gets killed mid-poll, the next invocation should re-attach to the in-flight job rather than submit a new one.
Background
Trevin's principle 4 calls this out specifically:
Retries on a long-running operation aren't just about idempotency at submission; they're about idempotency across the whole submit-poll-collect arc. If the agent's first invocation submits a job and then loses connection mid-poll, the second invocation needs to find the in-flight job, not start a new one.
Today, if an agent runs vers run --wait --vm-alias my-vm and the process is killed mid-poll, retrying that exact command submits a brand-new VM submission rather than re-attaching to the existing job. The first VM is orphaned (will eventually complete or fail without the agent observing it).
Target
$ vers run --wait --vm-alias my-vm
^C # killed mid-poll
$ vers run --wait --vm-alias my-vm
note: resuming in-flight job job_8f2a3b4c5d (status: running, age 12s)
... continues polling existing job ...
{"vm_id":"vm-abc","status":"complete"}
Design questions
-
Resumption key. What identifies an in-flight job uniquely?
- Option A: command + args fingerprint (e.g.,
vm.run + ["--vm-alias","my-vm"])
- Option B: explicit
--resume-job <id> flag (agent must record the job id from the previous run)
- Option C: presence of a stable user-supplied key (
--vm-alias for run; what about commands without one?)
Recommend: A as default with B as escape hatch. Commands without a stable key (e.g., vers run without --vm-alias) cannot resume — document the limitation.
-
Staleness window. How long after started_at is a job considered "in-flight" vs "abandoned"? Recommend: 24h, configurable via --max-resume-age.
-
Server-side state authority. The local ledger says "running"; what does the server say? On resume, the CLI should re-poll the server's actual job/VM state and reconcile. If the server says "completed" already, treat as complete.
-
Concurrent invocations. Two vers run --wait calls with the same args from two terminals — do both attach to the same job, or does the second error? Recommend: attach (both observe the same job).
Scope
- Add
Find(kind, command, args) to internal/jobs/ledger.go returning the latest non-terminal entry matching the fingerprint.
- Modify the
--wait plumbing in cmd/run.go, cmd/branch.go, cmd/deploy.go, cmd/resume.go, cmd/run_commit.go:
- Before submitting, check the ledger for an in-flight match.
- If found: emit a stderr note and re-attach to the existing job (re-poll).
- If not: submit normally.
- Add
--no-resume opt-out flag for users who explicitly want a fresh submission.
- Add
--resume-job <id> for explicit resumption.
- Update
agent-context to declare async_resume_supported: true per command.
- Tests: simulate killed-mid-poll by using a fake handler that completes after
Submit but before Complete.
Out of scope for this issue
- Changing existing handler polling logic (resume should reuse the same poll loop).
- Cross-machine resumption (the ledger is local).
PR #191 shipped F14 Phase 1: every
--waitinvocation writes start + completion entries to~/.vers/jobs.jsonl. Phase 2 closes the loop — when an agent's--waitinvocation gets killed mid-poll, the next invocation should re-attach to the in-flight job rather than submit a new one.Background
Trevin's principle 4 calls this out specifically:
Today, if an agent runs
vers run --wait --vm-alias my-vmand the process is killed mid-poll, retrying that exact command submits a brand-new VM submission rather than re-attaching to the existing job. The first VM is orphaned (will eventually complete or fail without the agent observing it).Target
Design questions
Resumption key. What identifies an in-flight job uniquely?
vm.run + ["--vm-alias","my-vm"])--resume-job <id>flag (agent must record the job id from the previous run)--vm-aliasfor run; what about commands without one?)Recommend: A as default with B as escape hatch. Commands without a stable key (e.g.,
vers runwithout--vm-alias) cannot resume — document the limitation.Staleness window. How long after
started_atis a job considered "in-flight" vs "abandoned"? Recommend: 24h, configurable via--max-resume-age.Server-side state authority. The local ledger says "running"; what does the server say? On resume, the CLI should re-poll the server's actual job/VM state and reconcile. If the server says "completed" already, treat as complete.
Concurrent invocations. Two
vers run --waitcalls with the same args from two terminals — do both attach to the same job, or does the second error? Recommend: attach (both observe the same job).Scope
Find(kind, command, args)tointernal/jobs/ledger.goreturning the latest non-terminal entry matching the fingerprint.--waitplumbing incmd/run.go,cmd/branch.go,cmd/deploy.go,cmd/resume.go,cmd/run_commit.go:--no-resumeopt-out flag for users who explicitly want a fresh submission.--resume-job <id>for explicit resumption.agent-contextto declareasync_resume_supported: trueper command.Submitbut beforeComplete.Out of scope for this issue