You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Codex Lab CLI should treat long-running coding objectives as durable runtime state, not as prompt prose that the model has to remember across compaction, tool waits, freezes, or resumed sessions.
The clean-sheet direction should be:
Use a /goal-style persisted objective as the substrate.
Use an /auto-style coordinator as one execution policy for that goal.
Make completion, blocked state, validation evidence, and resume behavior explicit runtime concepts.
In short: /goal answers what is still owed; /auto answers how aggressively the agent should pursue it.
Why this matters
Recent Every Code / Launchplane v2 work exposed the failure mode clearly. A huge Launchplane rollout in ~/.code/sessions/2026/06/16/rollout-2026-06-16T17-02-47-6154bdb8-fbd2-4461-a01e-c3e0313c1bec.jsonl carried a broad user instruction to keep going until v2 was done. The agent completed multiple useful v2 slices and merged PRs, but the last active slice ended/froze while a full test suite was still running.
Important details from that artifact:
CWD was /Users/cbusillo/Developer/launchplane.
Model telemetry showed gpt-5.5 with a 272k context window.
The session had about 11M cumulative tokens across many turns.
The final visible messages said the full test suite was still running and JetBrains/final review/commit/PR/merge were still pending.
There was no final task_complete for the last in-progress slice.
The issue was not that the model explicitly decided “v2 is infeasible.” The issue was that “v2 is still active” lived mostly in transcript prose and conversational continuity. Once the session froze/interrupted, there was no first-class durable objective that the runtime could resume and continue against.
Design Direction
Prefer this layering:
Durable Goal Runtime
- objective
- success criteria
- status
- token/time budget
- completed slices
- evidence ledger
- open blockers
|
v
Execution Policy
- manual
- assisted
- auto
|
v
Auto Coordinator
- plan next slice
- delegate agents
- run tools
- validate
- open/update PR
- babysit CI
- merge if allowed
|
v
Completion Audit
- map objective to artifacts
- verify evidence
- mark complete, continue, or blocked
A user command such as:
/auto finish Launchplane v2
should effectively mean:
create or attach active goal: "finish Launchplane v2"
set execution policy: auto
run until goal is complete, blocked, paused, cleared, or budget-limited
Features to borrow from upstream Codex /goal
Persist a thread/session goal independently of the visible prompt transcript.
Track explicit statuses: active, paused, complete, budget-limited, and Codex Lab should add blocked.
Automatically continue when idle if the goal is still active and there is no pending user/tool interaction.
Keep token/time accounting attached to the goal, not only individual turns.
Give the model narrow goal tools: inspect current goal, create only when explicitly requested, mark complete only through a constrained completion tool.
Require a completion audit before complete: map objective requirements to concrete files, commands, tests, PRs, CI, screenshots, or other evidence.
Treat uncertainty as not complete.
Features to borrow from Every Code /auto
Coordinator loop that decides next action separately from the worker turn.
Structured decisions such as continue, finish_success, finish_failed/blocked, with evidence.
Agent delegation for scouting, implementation, and review.
Review and validation gates before completion.
PR/CI babysitting as a first-class part of long-running coding work.
User-visible progress cards/state so the user can tell whether the harness is implementing, validating, waiting, blocked, reviewing, or merging.
Stop/pause/escape semantics that are reliable and visible.
Add a continuation prompt that tells the model to choose the next concrete action and audit before complete.
Add model-visible tools to read the goal and mark complete or blocked with evidence.
Resume active goals after CLI/app restart from the persisted session state.
Add tests proving that a normal assistant final answer does not end the long-running goal if it is still active.
Then layer auto policy on top:
/auto <objective> creates/attaches a goal and sets execution policy to auto.
Coordinator chooses next slice and can call worker/agent/review paths.
Coordinator cannot mark the goal complete without a completion audit and evidence ledger update.
A blocked decision must preserve a concrete unblock request for the user.
Acceptance Criteria
A long-running objective survives compaction, app restart, and resumed rollouts.
If the model finishes a turn without completing the goal, the harness can continue automatically or show the user the active-goal state.
Completion requires explicit evidence, not just a plausible final message.
Blocked is first-class and does not look like success or silent abandonment.
Auto mode can complete small slices while keeping the parent goal active.
The UI clearly distinguishes slice completion from parent-goal completion.
Tests cover continuation after no-tool/final-answer turns, blocked state, pause/resume, budget limiting, and restart/resume behavior.
Open Questions
Should goals live in the existing rollout/session store, a new state DB table, or both?
Should goal status be model-visible as tools only, UI events only, or both?
Should auto policy be stored on the goal or as a separate run/controller state?
How much of Every Code Auto Drive should be ported directly versus redesigned around the goal substrate?
What is the minimum desktop-app UI for active goals, blocked requests, and evidence history?
Current Recommendation
Implement /goal as the durable substrate first. Then implement /auto as an execution policy that operates against an active goal.
Do not rely on stronger prose alone. The harness should own the fact that a broad goal remains active until the runtime records completion, blocked state, pause, clear, or budget exhaustion.
Intent
Codex Lab CLI should treat long-running coding objectives as durable runtime state, not as prompt prose that the model has to remember across compaction, tool waits, freezes, or resumed sessions.
The clean-sheet direction should be:
/goal-style persisted objective as the substrate./auto-style coordinator as one execution policy for that goal.In short:
/goalanswers what is still owed;/autoanswers how aggressively the agent should pursue it.Why this matters
Recent Every Code / Launchplane v2 work exposed the failure mode clearly. A huge Launchplane rollout in
~/.code/sessions/2026/06/16/rollout-2026-06-16T17-02-47-6154bdb8-fbd2-4461-a01e-c3e0313c1bec.jsonlcarried a broad user instruction to keep going until v2 was done. The agent completed multiple useful v2 slices and merged PRs, but the last active slice ended/froze while a full test suite was still running.Important details from that artifact:
/Users/cbusillo/Developer/launchplane.gpt-5.5with a 272k context window.justin the devcontainer for Linux development openai/codex#1375, not all of v2.task_completefor the last in-progress slice.The issue was not that the model explicitly decided “v2 is infeasible.” The issue was that “v2 is still active” lived mostly in transcript prose and conversational continuity. Once the session froze/interrupted, there was no first-class durable objective that the runtime could resume and continue against.
Design Direction
Prefer this layering:
A user command such as:
should effectively mean:
Features to borrow from upstream Codex
/goalFeatures to borrow from Every Code
/autoProposed MVP
Build the smallest useful goal substrate first:
/goal <objective>,/goal status,/goal pause,/goal resume,/goal clear.Then layer auto policy on top:
/auto <objective>creates/attaches a goal and sets execution policy to auto.Acceptance Criteria
Open Questions
Current Recommendation
Implement
/goalas the durable substrate first. Then implement/autoas an execution policy that operates against an active goal.Do not rely on stronger prose alone. The harness should own the fact that a broad goal remains active until the runtime records completion, blocked state, pause, clear, or budget exhaustion.