Align Codex skill routing with Every Code guidance and harness tests

## Summary

Over the last several days, GPT-5.5/Codex behavior appeared to regress around skill usage: the model was more likely to apologize, misinterpret requests, and especially fail to reach for skills that should have triggered implicitly. We investigated rollout files and local repo changes, then rebuilt the sibling `code` binary with stronger Every Code skill-trigger guidance as a temporary mitigation.

This issue tracks the durable fix needed in this `codex` repo: align Codex's skill instructions and harness coverage with the stronger behavior in `code`, and make sure future changes cannot silently weaken implicit skill routing again.

## Findings

- The live temporary fix was made in `/Users/cbusillo/Developer/code`, not this repo.
- This checkout, `/Users/cbusillo/Developer/codex`, is currently unchanged and does not contain the fix.
- The strongest local suspect was the newer skill-rendering/substrate path where full skill bodies are only injected after the model chooses a skill from the available-skills metadata.
- The prompt-level skill guidance was weaker than Every Code's previous guidance, especially around mandatory triggers, delegated skill triggers, and using all relevant skills rather than treating one match as suppressing others.
- Mediaforce rollout review showed examples where relevant skills were available in metadata but not opened, or opened but not reliably applied.
- A temporary rebuild restored stronger trigger guidance and improved behavior in focused harness tests.

## Temporary Mitigation Already Done Outside This Repo

In sibling repo `/Users/cbusillo/Developer/code`, branch `fix/restore-skill-trigger-guidance`:

- Strengthened skill trigger guidance in `code-rs/core-skills/src/render.rs`.
- Fixed `scripts/local/rebuild-path-code.sh` to build the current package name.
- Rebuilt `/Users/cbusillo/.local/bin/code`, which points to `/Users/cbusillo/Developer/code/code-rs/target/release/code`.

This is live only after restarting sessions/processes. Existing resumed sessions can still carry old skill prompt/context.

## Harness Evidence

Read-only exec harness runs were used to test whether the rebuilt binary reaches for unnamed matching skills.

Passing GPT-5.5 runs:

- `/tmp/code-exec-harness-skill-test/20260610-121616-implicit-skill-marker-current-code-home`
  - Synthetic skill installed through isolated `CODE_HOME/skills`.
  - Prompt did not name the skill.
  - The model selected and read the skill body and returned `IMPLICIT_ROUTE_SENTINEL_OK`.
- `/tmp/code-exec-harness-skill-test/20260610-121959-local-llm-sibling-skill-routing`
  - GPT-5.5 classified a DNS/Cloudflare docs + read-only infra request as:
    - `docs_phase_skill=docs-lookup`
    - `ops_phase_skill=infra-ops`
    - `sequence=docs-before-ops`

Good-day historical baseline artifacts from June 4 were also inspected:

- `local-llm-sibling-skill-routing`: 8 passing / 11 total
- `local-llm-readiness-before-closeout`: 7 passing / 11 total

The historical local-LLM scenarios needed compatibility tweaks with the newer binary (`--max-seconds`, provider config, `wire_api`), so they are useful context but not a perfect apples-to-apples comparison.

## Work Needed

- Port or adapt the stronger Every Code skill-trigger instructions into this Codex repo.
- Preserve these semantics explicitly:
  - Mandatory skill triggers are binding when a description says MUST.
  - Delegated triggers require opening the delegated skill before subdomain work.
  - Match skills independently against every part of the request.
  - One relevant skill must not suppress another relevant skill.
  - Use all relevant mandatory/delegated skills before ordinary exploration or implementation.
- Add exec-harness coverage for implicit skill routing without explicit skill names.
- Include at least one marker-style test where the success output exists only in `SKILL.md`, so the test proves the model opened the skill body.
- Add a sibling-routing scenario similar to `docs-lookup` before `infra-ops` for private DNS/Cloudflare + read-only infra work.
- Keep the Codex exec harness and CLI aligned on supported flags/options. The temporary binary currently rejects `--review-output-json`, and the harness encountered `--max-seconds`/provider-config drift during testing.
- Document the intended source of truth for shared Codex vs Every Code skill instruction wording so future substrate imports do not silently weaken local skill behavior.

## Related Local Artifact

The broader audit queue is saved at:

`/Users/cbusillo/Developer/weak-skills-commit-audit.md`

It lists local commits that may need follow-up because they were created while weak skill routing may have bypassed expected gates such as love gate, JetBrains inspection, design-collaboration, repo-readiness, or similar skill-controlled workflows.

## Acceptance Criteria

- Codex has skill-trigger guidance equivalent in strength to the temporary `code` fix.
- Exec-harness tests fail if the model does not open an unnamed matching skill.
- Exec-harness tests cover delegated/sibling skill routing.
- Harness tests run cleanly against the current CLI without stale flags or provider config.
- The fix is documented enough that future Codex/Every Code instruction divergence is intentional and reviewable.

## Current Status

2026-06-10: Branch `port/exec-harness-pilot` now has a Codex-native exec harness plus the skills prompt fix. The harness proof core supports isolated workspaces, isolated `CODEX_HOME`, fake `/v1/responses`, request-body assertions, artifact output under `.tmp/codex-exec-harness/`, multi-turn resume, event-type assertions, explicit scenario-owned provider config, and a root Just entrypoint.

Completed harness feature slices:

1. Multi-turn/resume scenarios
   - Added a `turns` scenario shape.
   - Turn 1 runs `codex exec`; later turns run `codex exec resume <thread_id>`.
   - The harness captures `thread_id` from the real `thread.started` JSONL event.
   - Added `multi-turn-resume.json`, which proves two turns, two fake Responses requests, and a resumed thread against `codex-rs/target/debug/codex`.

2. Richer request/event assertions
   - Added `expect.turn_count`.
   - Added `expect.thread_id = "required"`.
   - Added `expect.turns` for per-turn `returncode`, `event_count`, `responses_request_count`, `thread_id`, and `event_types`.
   - Added global `expect.event_types`.

3. Explicit local-provider/local-LLM-style config
   - Added `{responses_base_url}` substitution for scenario-owned `config_toml` provider definitions.
   - Added `local-provider-config.json`, which proves Codex uses an explicit scenario provider pointing at the fake local Responses server.
   - The harness still does not inherit real provider config and does not silently fall back to a cloud provider.

4. Canonical local entrypoint
   - Added `just exec-harness-test`.
   - The recipe builds `codex-cli` and runs every harness scenario through `tools/codex-exec-harness/run_all.py`.

Current green proof:
- `just exec-harness-test` passed against `/Users/cbusillo/Developer/codex/codex-rs/target/debug/codex`.
- `cd codex-rs && cargo clippy --tests -p codex-core-skills` completed cleanly as the dry-run.
- `cd codex-rs && just fix -p codex-core-skills` completed cleanly.
- `cd codex-rs && just fmt` completed with only the existing uv/ruff `exclude-newer = "7 days"` warnings.
- Previous focused `cd codex-rs && just test -p codex-core-skills` passed 101 tests after the harness/provider slices.

Remaining harness backlog:

5. Fake GitHub CLI/service support
   - Add only when a Codex GitHub automation scenario needs it.
   - Keep it as an explicit service fixture, not always-on harness behavior.

6. Explicit live/auth runs
   - Add only for live-model smoke tests that cannot be proven with fake Responses.
   - Make auth inheritance opt-in and loud.

Next action: package the current prompt + harness foundation into a PR-sized first slice.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align Codex skill routing with Every Code guidance and harness tests #23

Summary

Findings

Temporary Mitigation Already Done Outside This Repo

Harness Evidence

Work Needed

Related Local Artifact

Acceptance Criteria

Current Status

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Align Codex skill routing with Every Code guidance and harness tests #23

Description

Summary

Findings

Temporary Mitigation Already Done Outside This Repo

Harness Evidence

Work Needed

Related Local Artifact

Acceptance Criteria

Current Status

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions