Skip to content

Design durable goals as the substrate for auto execution #121

@cbusillo

Description

@cbusillo

Intent

Codex Lab CLI should treat long-running coding objectives as durable runtime state, not as prompt prose that the model has to remember across compaction, tool waits, freezes, or resumed sessions.

The clean-sheet direction should be:

  • Use a /goal-style persisted objective as the substrate.
  • Use an /auto-style coordinator as one execution policy for that goal.
  • Make completion, blocked state, validation evidence, and resume behavior explicit runtime concepts.

In short: /goal answers what is still owed; /auto answers how aggressively the agent should pursue it.

Why this matters

Recent Every Code / Launchplane v2 work exposed the failure mode clearly. A huge Launchplane rollout in ~/.code/sessions/2026/06/16/rollout-2026-06-16T17-02-47-6154bdb8-fbd2-4461-a01e-c3e0313c1bec.jsonl carried a broad user instruction to keep going until v2 was done. The agent completed multiple useful v2 slices and merged PRs, but the last active slice ended/froze while a full test suite was still running.

Important details from that artifact:

  • CWD was /Users/cbusillo/Developer/launchplane.
  • Model telemetry showed gpt-5.5 with a 272k context window.
  • The session had about 11M cumulative tokens across many turns.
  • The last completed task was a specific slice, PR chore: install just in the devcontainer for Linux development openai/codex#1375, not all of v2.
  • The final visible messages said the full test suite was still running and JetBrains/final review/commit/PR/merge were still pending.
  • There was no final task_complete for the last in-progress slice.

The issue was not that the model explicitly decided “v2 is infeasible.” The issue was that “v2 is still active” lived mostly in transcript prose and conversational continuity. Once the session froze/interrupted, there was no first-class durable objective that the runtime could resume and continue against.

Design Direction

Prefer this layering:

Durable Goal Runtime
- objective
- success criteria
- status
- token/time budget
- completed slices
- evidence ledger
- open blockers
        |
        v
Execution Policy
- manual
- assisted
- auto
        |
        v
Auto Coordinator
- plan next slice
- delegate agents
- run tools
- validate
- open/update PR
- babysit CI
- merge if allowed
        |
        v
Completion Audit
- map objective to artifacts
- verify evidence
- mark complete, continue, or blocked

A user command such as:

/auto finish Launchplane v2

should effectively mean:

create or attach active goal: "finish Launchplane v2"
set execution policy: auto
run until goal is complete, blocked, paused, cleared, or budget-limited

Features to borrow from upstream Codex /goal

  • Persist a thread/session goal independently of the visible prompt transcript.
  • Track explicit statuses: active, paused, complete, budget-limited, and Codex Lab should add blocked.
  • Automatically continue when idle if the goal is still active and there is no pending user/tool interaction.
  • Keep token/time accounting attached to the goal, not only individual turns.
  • Give the model narrow goal tools: inspect current goal, create only when explicitly requested, mark complete only through a constrained completion tool.
  • Require a completion audit before complete: map objective requirements to concrete files, commands, tests, PRs, CI, screenshots, or other evidence.
  • Treat uncertainty as not complete.

Features to borrow from Every Code /auto

  • Coordinator loop that decides next action separately from the worker turn.
  • Structured decisions such as continue, finish_success, finish_failed/blocked, with evidence.
  • Agent delegation for scouting, implementation, and review.
  • Review and validation gates before completion.
  • PR/CI babysitting as a first-class part of long-running coding work.
  • User-visible progress cards/state so the user can tell whether the harness is implementing, validating, waiting, blocked, reviewing, or merging.
  • Stop/pause/escape semantics that are reliable and visible.

Proposed MVP

Build the smallest useful goal substrate first:

  • Store one active goal per thread/session.
  • Support /goal <objective>, /goal status, /goal pause, /goal resume, /goal clear.
  • Add statuses: active, paused, blocked, complete, budget_limited.
  • Add a continuation prompt that tells the model to choose the next concrete action and audit before complete.
  • Add model-visible tools to read the goal and mark complete or blocked with evidence.
  • Resume active goals after CLI/app restart from the persisted session state.
  • Add tests proving that a normal assistant final answer does not end the long-running goal if it is still active.

Then layer auto policy on top:

  • /auto <objective> creates/attaches a goal and sets execution policy to auto.
  • Coordinator chooses next slice and can call worker/agent/review paths.
  • Coordinator cannot mark the goal complete without a completion audit and evidence ledger update.
  • A blocked decision must preserve a concrete unblock request for the user.

Acceptance Criteria

  • A long-running objective survives compaction, app restart, and resumed rollouts.
  • If the model finishes a turn without completing the goal, the harness can continue automatically or show the user the active-goal state.
  • Completion requires explicit evidence, not just a plausible final message.
  • Blocked is first-class and does not look like success or silent abandonment.
  • Auto mode can complete small slices while keeping the parent goal active.
  • The UI clearly distinguishes slice completion from parent-goal completion.
  • Tests cover continuation after no-tool/final-answer turns, blocked state, pause/resume, budget limiting, and restart/resume behavior.

Open Questions

  • Should goals live in the existing rollout/session store, a new state DB table, or both?
  • Should goal status be model-visible as tools only, UI events only, or both?
  • Should auto policy be stored on the goal or as a separate run/controller state?
  • How much of Every Code Auto Drive should be ported directly versus redesigned around the goal substrate?
  • What is the minimum desktop-app UI for active goals, blocked requests, and evidence history?

Current Recommendation

Implement /goal as the durable substrate first. Then implement /auto as an execution policy that operates against an active goal.

Do not rely on stronger prose alone. The harness should own the fact that a broad goal remains active until the runtime records completion, blocked state, pause, clear, or budget exhaustion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    planDurable planning issueplan:activePlan is actionable now

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions