Skip to content

Align Codex skill routing with Every Code guidance and harness tests #23

@shiny-code-bot

Description

@shiny-code-bot

Summary

Over the last several days, GPT-5.5/Codex behavior appeared to regress around skill usage: the model was more likely to apologize, misinterpret requests, and especially fail to reach for skills that should have triggered implicitly. We investigated rollout files and local repo changes, then rebuilt the sibling code binary with stronger Every Code skill-trigger guidance as a temporary mitigation.

This issue tracks the durable fix needed in this codex repo: align Codex's skill instructions and harness coverage with the stronger behavior in code, and make sure future changes cannot silently weaken implicit skill routing again.

Findings

  • The live temporary fix was made in /Users/cbusillo/Developer/code, not this repo.
  • This checkout, /Users/cbusillo/Developer/codex, is currently unchanged and does not contain the fix.
  • The strongest local suspect was the newer skill-rendering/substrate path where full skill bodies are only injected after the model chooses a skill from the available-skills metadata.
  • The prompt-level skill guidance was weaker than Every Code's previous guidance, especially around mandatory triggers, delegated skill triggers, and using all relevant skills rather than treating one match as suppressing others.
  • Mediaforce rollout review showed examples where relevant skills were available in metadata but not opened, or opened but not reliably applied.
  • A temporary rebuild restored stronger trigger guidance and improved behavior in focused harness tests.

Temporary Mitigation Already Done Outside This Repo

In sibling repo /Users/cbusillo/Developer/code, branch fix/restore-skill-trigger-guidance:

  • Strengthened skill trigger guidance in code-rs/core-skills/src/render.rs.
  • Fixed scripts/local/rebuild-path-code.sh to build the current package name.
  • Rebuilt /Users/cbusillo/.local/bin/code, which points to /Users/cbusillo/Developer/code/code-rs/target/release/code.

This is live only after restarting sessions/processes. Existing resumed sessions can still carry old skill prompt/context.

Harness Evidence

Read-only exec harness runs were used to test whether the rebuilt binary reaches for unnamed matching skills.

Passing GPT-5.5 runs:

  • /tmp/code-exec-harness-skill-test/20260610-121616-implicit-skill-marker-current-code-home
    • Synthetic skill installed through isolated CODE_HOME/skills.
    • Prompt did not name the skill.
    • The model selected and read the skill body and returned IMPLICIT_ROUTE_SENTINEL_OK.
  • /tmp/code-exec-harness-skill-test/20260610-121959-local-llm-sibling-skill-routing
    • GPT-5.5 classified a DNS/Cloudflare docs + read-only infra request as:
      • docs_phase_skill=docs-lookup
      • ops_phase_skill=infra-ops
      • sequence=docs-before-ops

Good-day historical baseline artifacts from June 4 were also inspected:

  • local-llm-sibling-skill-routing: 8 passing / 11 total
  • local-llm-readiness-before-closeout: 7 passing / 11 total

The historical local-LLM scenarios needed compatibility tweaks with the newer binary (--max-seconds, provider config, wire_api), so they are useful context but not a perfect apples-to-apples comparison.

Work Needed

  • Port or adapt the stronger Every Code skill-trigger instructions into this Codex repo.
  • Preserve these semantics explicitly:
    • Mandatory skill triggers are binding when a description says MUST.
    • Delegated triggers require opening the delegated skill before subdomain work.
    • Match skills independently against every part of the request.
    • One relevant skill must not suppress another relevant skill.
    • Use all relevant mandatory/delegated skills before ordinary exploration or implementation.
  • Add exec-harness coverage for implicit skill routing without explicit skill names.
  • Include at least one marker-style test where the success output exists only in SKILL.md, so the test proves the model opened the skill body.
  • Add a sibling-routing scenario similar to docs-lookup before infra-ops for private DNS/Cloudflare + read-only infra work.
  • Keep the Codex exec harness and CLI aligned on supported flags/options. The temporary binary currently rejects --review-output-json, and the harness encountered --max-seconds/provider-config drift during testing.
  • Document the intended source of truth for shared Codex vs Every Code skill instruction wording so future substrate imports do not silently weaken local skill behavior.

Related Local Artifact

The broader audit queue is saved at:

/Users/cbusillo/Developer/weak-skills-commit-audit.md

It lists local commits that may need follow-up because they were created while weak skill routing may have bypassed expected gates such as love gate, JetBrains inspection, design-collaboration, repo-readiness, or similar skill-controlled workflows.

Acceptance Criteria

  • Codex has skill-trigger guidance equivalent in strength to the temporary code fix.
  • Exec-harness tests fail if the model does not open an unnamed matching skill.
  • Exec-harness tests cover delegated/sibling skill routing.
  • Harness tests run cleanly against the current CLI without stale flags or provider config.
  • The fix is documented enough that future Codex/Every Code instruction divergence is intentional and reviewable.

Current Status

2026-06-10: Branch port/exec-harness-pilot now has a Codex-native exec harness plus the skills prompt fix. The harness proof core supports isolated workspaces, isolated CODEX_HOME, fake /v1/responses, request-body assertions, artifact output under .tmp/codex-exec-harness/, multi-turn resume, event-type assertions, explicit scenario-owned provider config, and a root Just entrypoint.

Completed harness feature slices:

  1. Multi-turn/resume scenarios

    • Added a turns scenario shape.
    • Turn 1 runs codex exec; later turns run codex exec resume <thread_id>.
    • The harness captures thread_id from the real thread.started JSONL event.
    • Added multi-turn-resume.json, which proves two turns, two fake Responses requests, and a resumed thread against codex-rs/target/debug/codex.
  2. Richer request/event assertions

    • Added expect.turn_count.
    • Added expect.thread_id = "required".
    • Added expect.turns for per-turn returncode, event_count, responses_request_count, thread_id, and event_types.
    • Added global expect.event_types.
  3. Explicit local-provider/local-LLM-style config

    • Added {responses_base_url} substitution for scenario-owned config_toml provider definitions.
    • Added local-provider-config.json, which proves Codex uses an explicit scenario provider pointing at the fake local Responses server.
    • The harness still does not inherit real provider config and does not silently fall back to a cloud provider.
  4. Canonical local entrypoint

    • Added just exec-harness-test.
    • The recipe builds codex-cli and runs every harness scenario through tools/codex-exec-harness/run_all.py.

Current green proof:

  • just exec-harness-test passed against /Users/cbusillo/Developer/codex/codex-rs/target/debug/codex.
  • cd codex-rs && cargo clippy --tests -p codex-core-skills completed cleanly as the dry-run.
  • cd codex-rs && just fix -p codex-core-skills completed cleanly.
  • cd codex-rs && just fmt completed with only the existing uv/ruff exclude-newer = "7 days" warnings.
  • Previous focused cd codex-rs && just test -p codex-core-skills passed 101 tests after the harness/provider slices.

Remaining harness backlog:

  1. Fake GitHub CLI/service support

    • Add only when a Codex GitHub automation scenario needs it.
    • Keep it as an explicit service fixture, not always-on harness behavior.
  2. Explicit live/auth runs

    • Add only for live-model smoke tests that cannot be proven with fake Responses.
    • Make auth inheritance opt-in and loud.

Next action: package the current prompt + harness foundation into a PR-sized first slice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions