Summary
Over the last several days, GPT-5.5/Codex behavior appeared to regress around skill usage: the model was more likely to apologize, misinterpret requests, and especially fail to reach for skills that should have triggered implicitly. We investigated rollout files and local repo changes, then rebuilt the sibling code binary with stronger Every Code skill-trigger guidance as a temporary mitigation.
This issue tracks the durable fix needed in this codex repo: align Codex's skill instructions and harness coverage with the stronger behavior in code, and make sure future changes cannot silently weaken implicit skill routing again.
Findings
- The live temporary fix was made in
/Users/cbusillo/Developer/code, not this repo.
- This checkout,
/Users/cbusillo/Developer/codex, is currently unchanged and does not contain the fix.
- The strongest local suspect was the newer skill-rendering/substrate path where full skill bodies are only injected after the model chooses a skill from the available-skills metadata.
- The prompt-level skill guidance was weaker than Every Code's previous guidance, especially around mandatory triggers, delegated skill triggers, and using all relevant skills rather than treating one match as suppressing others.
- Mediaforce rollout review showed examples where relevant skills were available in metadata but not opened, or opened but not reliably applied.
- A temporary rebuild restored stronger trigger guidance and improved behavior in focused harness tests.
Temporary Mitigation Already Done Outside This Repo
In sibling repo /Users/cbusillo/Developer/code, branch fix/restore-skill-trigger-guidance:
- Strengthened skill trigger guidance in
code-rs/core-skills/src/render.rs.
- Fixed
scripts/local/rebuild-path-code.sh to build the current package name.
- Rebuilt
/Users/cbusillo/.local/bin/code, which points to /Users/cbusillo/Developer/code/code-rs/target/release/code.
This is live only after restarting sessions/processes. Existing resumed sessions can still carry old skill prompt/context.
Harness Evidence
Read-only exec harness runs were used to test whether the rebuilt binary reaches for unnamed matching skills.
Passing GPT-5.5 runs:
/tmp/code-exec-harness-skill-test/20260610-121616-implicit-skill-marker-current-code-home
- Synthetic skill installed through isolated
CODE_HOME/skills.
- Prompt did not name the skill.
- The model selected and read the skill body and returned
IMPLICIT_ROUTE_SENTINEL_OK.
/tmp/code-exec-harness-skill-test/20260610-121959-local-llm-sibling-skill-routing
- GPT-5.5 classified a DNS/Cloudflare docs + read-only infra request as:
docs_phase_skill=docs-lookup
ops_phase_skill=infra-ops
sequence=docs-before-ops
Good-day historical baseline artifacts from June 4 were also inspected:
local-llm-sibling-skill-routing: 8 passing / 11 total
local-llm-readiness-before-closeout: 7 passing / 11 total
The historical local-LLM scenarios needed compatibility tweaks with the newer binary (--max-seconds, provider config, wire_api), so they are useful context but not a perfect apples-to-apples comparison.
Work Needed
- Port or adapt the stronger Every Code skill-trigger instructions into this Codex repo.
- Preserve these semantics explicitly:
- Mandatory skill triggers are binding when a description says MUST.
- Delegated triggers require opening the delegated skill before subdomain work.
- Match skills independently against every part of the request.
- One relevant skill must not suppress another relevant skill.
- Use all relevant mandatory/delegated skills before ordinary exploration or implementation.
- Add exec-harness coverage for implicit skill routing without explicit skill names.
- Include at least one marker-style test where the success output exists only in
SKILL.md, so the test proves the model opened the skill body.
- Add a sibling-routing scenario similar to
docs-lookup before infra-ops for private DNS/Cloudflare + read-only infra work.
- Keep the Codex exec harness and CLI aligned on supported flags/options. The temporary binary currently rejects
--review-output-json, and the harness encountered --max-seconds/provider-config drift during testing.
- Document the intended source of truth for shared Codex vs Every Code skill instruction wording so future substrate imports do not silently weaken local skill behavior.
Related Local Artifact
The broader audit queue is saved at:
/Users/cbusillo/Developer/weak-skills-commit-audit.md
It lists local commits that may need follow-up because they were created while weak skill routing may have bypassed expected gates such as love gate, JetBrains inspection, design-collaboration, repo-readiness, or similar skill-controlled workflows.
Acceptance Criteria
- Codex has skill-trigger guidance equivalent in strength to the temporary
code fix.
- Exec-harness tests fail if the model does not open an unnamed matching skill.
- Exec-harness tests cover delegated/sibling skill routing.
- Harness tests run cleanly against the current CLI without stale flags or provider config.
- The fix is documented enough that future Codex/Every Code instruction divergence is intentional and reviewable.
Current Status
2026-06-10: Branch port/exec-harness-pilot now has a Codex-native exec harness plus the skills prompt fix. The harness proof core supports isolated workspaces, isolated CODEX_HOME, fake /v1/responses, request-body assertions, artifact output under .tmp/codex-exec-harness/, multi-turn resume, event-type assertions, explicit scenario-owned provider config, and a root Just entrypoint.
Completed harness feature slices:
-
Multi-turn/resume scenarios
- Added a
turns scenario shape.
- Turn 1 runs
codex exec; later turns run codex exec resume <thread_id>.
- The harness captures
thread_id from the real thread.started JSONL event.
- Added
multi-turn-resume.json, which proves two turns, two fake Responses requests, and a resumed thread against codex-rs/target/debug/codex.
-
Richer request/event assertions
- Added
expect.turn_count.
- Added
expect.thread_id = "required".
- Added
expect.turns for per-turn returncode, event_count, responses_request_count, thread_id, and event_types.
- Added global
expect.event_types.
-
Explicit local-provider/local-LLM-style config
- Added
{responses_base_url} substitution for scenario-owned config_toml provider definitions.
- Added
local-provider-config.json, which proves Codex uses an explicit scenario provider pointing at the fake local Responses server.
- The harness still does not inherit real provider config and does not silently fall back to a cloud provider.
-
Canonical local entrypoint
- Added
just exec-harness-test.
- The recipe builds
codex-cli and runs every harness scenario through tools/codex-exec-harness/run_all.py.
Current green proof:
just exec-harness-test passed against /Users/cbusillo/Developer/codex/codex-rs/target/debug/codex.
cd codex-rs && cargo clippy --tests -p codex-core-skills completed cleanly as the dry-run.
cd codex-rs && just fix -p codex-core-skills completed cleanly.
cd codex-rs && just fmt completed with only the existing uv/ruff exclude-newer = "7 days" warnings.
- Previous focused
cd codex-rs && just test -p codex-core-skills passed 101 tests after the harness/provider slices.
Remaining harness backlog:
-
Fake GitHub CLI/service support
- Add only when a Codex GitHub automation scenario needs it.
- Keep it as an explicit service fixture, not always-on harness behavior.
-
Explicit live/auth runs
- Add only for live-model smoke tests that cannot be proven with fake Responses.
- Make auth inheritance opt-in and loud.
Next action: package the current prompt + harness foundation into a PR-sized first slice.
Summary
Over the last several days, GPT-5.5/Codex behavior appeared to regress around skill usage: the model was more likely to apologize, misinterpret requests, and especially fail to reach for skills that should have triggered implicitly. We investigated rollout files and local repo changes, then rebuilt the sibling
codebinary with stronger Every Code skill-trigger guidance as a temporary mitigation.This issue tracks the durable fix needed in this
codexrepo: align Codex's skill instructions and harness coverage with the stronger behavior incode, and make sure future changes cannot silently weaken implicit skill routing again.Findings
/Users/cbusillo/Developer/code, not this repo./Users/cbusillo/Developer/codex, is currently unchanged and does not contain the fix.Temporary Mitigation Already Done Outside This Repo
In sibling repo
/Users/cbusillo/Developer/code, branchfix/restore-skill-trigger-guidance:code-rs/core-skills/src/render.rs.scripts/local/rebuild-path-code.shto build the current package name./Users/cbusillo/.local/bin/code, which points to/Users/cbusillo/Developer/code/code-rs/target/release/code.This is live only after restarting sessions/processes. Existing resumed sessions can still carry old skill prompt/context.
Harness Evidence
Read-only exec harness runs were used to test whether the rebuilt binary reaches for unnamed matching skills.
Passing GPT-5.5 runs:
/tmp/code-exec-harness-skill-test/20260610-121616-implicit-skill-marker-current-code-homeCODE_HOME/skills.IMPLICIT_ROUTE_SENTINEL_OK./tmp/code-exec-harness-skill-test/20260610-121959-local-llm-sibling-skill-routingdocs_phase_skill=docs-lookupops_phase_skill=infra-opssequence=docs-before-opsGood-day historical baseline artifacts from June 4 were also inspected:
local-llm-sibling-skill-routing: 8 passing / 11 totallocal-llm-readiness-before-closeout: 7 passing / 11 totalThe historical local-LLM scenarios needed compatibility tweaks with the newer binary (
--max-seconds, provider config,wire_api), so they are useful context but not a perfect apples-to-apples comparison.Work Needed
SKILL.md, so the test proves the model opened the skill body.docs-lookupbeforeinfra-opsfor private DNS/Cloudflare + read-only infra work.--review-output-json, and the harness encountered--max-seconds/provider-config drift during testing.Related Local Artifact
The broader audit queue is saved at:
/Users/cbusillo/Developer/weak-skills-commit-audit.mdIt lists local commits that may need follow-up because they were created while weak skill routing may have bypassed expected gates such as love gate, JetBrains inspection, design-collaboration, repo-readiness, or similar skill-controlled workflows.
Acceptance Criteria
codefix.Current Status
2026-06-10: Branch
port/exec-harness-pilotnow has a Codex-native exec harness plus the skills prompt fix. The harness proof core supports isolated workspaces, isolatedCODEX_HOME, fake/v1/responses, request-body assertions, artifact output under.tmp/codex-exec-harness/, multi-turn resume, event-type assertions, explicit scenario-owned provider config, and a root Just entrypoint.Completed harness feature slices:
Multi-turn/resume scenarios
turnsscenario shape.codex exec; later turns runcodex exec resume <thread_id>.thread_idfrom the realthread.startedJSONL event.multi-turn-resume.json, which proves two turns, two fake Responses requests, and a resumed thread againstcodex-rs/target/debug/codex.Richer request/event assertions
expect.turn_count.expect.thread_id = "required".expect.turnsfor per-turnreturncode,event_count,responses_request_count,thread_id, andevent_types.expect.event_types.Explicit local-provider/local-LLM-style config
{responses_base_url}substitution for scenario-ownedconfig_tomlprovider definitions.local-provider-config.json, which proves Codex uses an explicit scenario provider pointing at the fake local Responses server.Canonical local entrypoint
just exec-harness-test.codex-cliand runs every harness scenario throughtools/codex-exec-harness/run_all.py.Current green proof:
just exec-harness-testpassed against/Users/cbusillo/Developer/codex/codex-rs/target/debug/codex.cd codex-rs && cargo clippy --tests -p codex-core-skillscompleted cleanly as the dry-run.cd codex-rs && just fix -p codex-core-skillscompleted cleanly.cd codex-rs && just fmtcompleted with only the existing uv/ruffexclude-newer = "7 days"warnings.cd codex-rs && just test -p codex-core-skillspassed 101 tests after the harness/provider slices.Remaining harness backlog:
Fake GitHub CLI/service support
Explicit live/auth runs
Next action: package the current prompt + harness foundation into a PR-sized first slice.