F14 Phase 2: resumable `--wait` via the job ledger

PR #191 shipped F14 Phase 1: every `--wait` invocation writes start + completion entries to `~/.vers/jobs.jsonl`. Phase 2 closes the loop — when an agent's `--wait` invocation gets killed mid-poll, the next invocation should re-attach to the in-flight job rather than submit a new one.

## Background

Trevin's principle 4 calls this out specifically:

> Retries on a long-running operation aren't just about idempotency at submission; they're about idempotency across the whole submit-poll-collect arc. If the agent's first invocation submits a job and then loses connection mid-poll, the second invocation needs to find the in-flight job, not start a new one.

Today, if an agent runs `vers run --wait --vm-alias my-vm` and the process is killed mid-poll, retrying that exact command submits a brand-new VM submission rather than re-attaching to the existing job. The first VM is orphaned (will eventually complete or fail without the agent observing it).

## Target

```
$ vers run --wait --vm-alias my-vm
^C  # killed mid-poll

$ vers run --wait --vm-alias my-vm
note: resuming in-flight job job_8f2a3b4c5d (status: running, age 12s)
... continues polling existing job ...
{"vm_id":"vm-abc","status":"complete"}
```

## Design questions

1. **Resumption key.** What identifies an in-flight job uniquely?
   - Option A: command + args fingerprint (e.g., `vm.run + ["--vm-alias","my-vm"]`)
   - Option B: explicit `--resume-job <id>` flag (agent must record the job id from the previous run)
   - Option C: presence of a stable user-supplied key (`--vm-alias` for run; what about commands without one?)

   Recommend: A as default with B as escape hatch. Commands without a stable key (e.g., `vers run` without `--vm-alias`) cannot resume — document the limitation.

2. **Staleness window.** How long after `started_at` is a job considered "in-flight" vs "abandoned"? Recommend: 24h, configurable via `--max-resume-age`.

3. **Server-side state authority.** The local ledger says "running"; what does the server say? On resume, the CLI should re-poll the server's actual job/VM state and reconcile. If the server says "completed" already, treat as complete.

4. **Concurrent invocations.** Two `vers run --wait` calls with the same args from two terminals — do both attach to the same job, or does the second error? Recommend: attach (both observe the same job).

## Scope

- Add `Find(kind, command, args)` to `internal/jobs/ledger.go` returning the latest non-terminal entry matching the fingerprint.
- Modify the `--wait` plumbing in `cmd/run.go`, `cmd/branch.go`, `cmd/deploy.go`, `cmd/resume.go`, `cmd/run_commit.go`:
  - Before submitting, check the ledger for an in-flight match.
  - If found: emit a stderr note and re-attach to the existing job (re-poll).
  - If not: submit normally.
- Add `--no-resume` opt-out flag for users who explicitly want a fresh submission.
- Add `--resume-job <id>` for explicit resumption.
- Update `agent-context` to declare `async_resume_supported: true` per command.
- Tests: simulate killed-mid-poll by using a fake handler that completes after `Submit` but before `Complete`.

## Out of scope for this issue

- Changing existing handler polling logic (resume should reuse the same poll loop).
- Cross-machine resumption (the ledger is local).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F14 Phase 2: resumable `--wait` via the job ledger #194

Background

Target

Design questions

Scope

Out of scope for this issue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

F14 Phase 2: resumable --wait via the job ledger #194

Description

Background

Target

Design questions

Scope

Out of scope for this issue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

F14 Phase 2: resumable `--wait` via the job ledger #194