Skip to content

Phase F.1: skill-eval engine (MVP) + framework fixes#158

Merged
azalio merged 12 commits into
mainfrom
cinder-copper
Jun 3, 2026
Merged

Phase F.1: skill-eval engine (MVP) + framework fixes#158
azalio merged 12 commits into
mainfrom
cinder-copper

Conversation

@azalio
Copy link
Copy Markdown
Owner

@azalio azalio commented Jun 3, 2026

Phase F.1 — Skill-Eval Engine (MVP)

Builds the MAP Framework skill-evaluation engine: run an eval set against a skill under the full installed skill set, observe which skill actually fired, check deterministic assertions, and report pass-rate + token/duration stats with durable, resumable run artifacts. Optimizer + HTML viewer are intentionally deferred to Phase F.2.

What ships

  • src/mapify_cli/skills_eval/:
    • eval_schema.py — single-source Contract-First types (EvalSetEntry, DispatchResult, EvalResultRecord, make_cell_id); reuses token_budget.TokenUsage.
    • dispatcher.pyVariantDispatcher ABC + MockDispatcher (zero subprocess) + ClaudeSubprocessDispatcher (claude -p --output-format json in an isolated temp cwd seeded with .claude/, transcript-parse trigger detection, bounded jittered backoff).
    • assertions.py — deterministic runner (contains/not_contains/regex/valid_json/trigger/not_trigger).
    • runner.py — prompts×runs matrix (variants=1), durable per-cell append-only .jsonl, --resume by cell_id (no dupes).
    • aggregator.py — null-tolerant pass-rate + token/duration mean±stddev, bounded --max-concurrency (lock-serialized writes).
  • CLI: mapify skill-eval run <skill> [--eval-set] [--dry-run] [--resume] [--max-concurrency]; --dry-run spends zero quota; gates on shutil.which("claude").
  • Shipped map-skill-eval skill (authored in templates_src/, rendered to all trees; make check-render green).
  • Full MockDispatcher test suite + fixtures + an AST guard that fails on any anthropic import / ANTHROPIC_API_KEY read.

Invariants

INV-1 render parity · INV-2 zero real claude -p in tests · INV-3 no anthropic/API-key (AST-enforced) · INV-4 durable resume · INV-5 temp-cwd isolation · INV-6 single-source schema · INV-7 detection authored only after the ST-001 spike, single-source in the dispatcher.

Also in this PR (found while building F.1)

  • docs(map-skill-eval): corrected the shipped SKILL.md eval-set format (was a fictional YAML schema the JSON-only loader can't parse) and the --max-concurrency default (4 → 1).
  • docs(map-efficient): note that record_test_baseline returns compact JSON — don't pipe through head/tail.
  • fix(framework) — two MAP self-bugs:
    1. safety-guardrails rm -rf / pattern was over-broad — blocked legitimate temp-scratch cleanup and any command merely mentioning such a path; narrowed via negative lookahead (still blocks bare root, system dirs, root-glob, and the temp root itself).
    2. record_subtask_result raised a false "Possible Actor truncation" warning for gitignored-but-present deliverables (.map/ artifacts); now filters declared paths through git check-ignore. Negative-proved.
  • docs(learning): 8 new /map-learn rules + the "fix MAP framework self-errors immediately" CLAUDE.md rule.

Verification

make check green: 2129 passed, ruff/mypy/pyright 0/0/0, check-render ✅. Every subtask passed Monitor; final-verifier PASS.

🤖 Generated with Claude Code

azalio and others added 12 commits June 3, 2026 23:24
Single-source Contract-First types for the Phase F.1 skill-eval engine:
EvalSetEntry, DispatchResult, EvalResultRecord (+ to_dict/from_dict round-trip),
and make_cell_id(). Reuses token_budget.TokenUsage (INV-6); no anthropic
import / no ANTHROPIC_API_KEY (INV-3). Pure data layer — no dispatch,
transcript parsing, or assertion logic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atchers (ST-003)

VariantDispatcher ABC with two implementations:
- MockDispatcher: returns caller-set DispatchResult, zero subprocess (INV-2) —
  the dispatcher all CI tests use.
- ClaudeSubprocessDispatcher: shells `claude -p --output-format json` in a
  throwaway temp cwd seeded with .claude/ + fresh empty .map/ (INV-5), applies
  the spike's seed-time temp-flip of disable-model-invocation to temp copies
  only (production templates untouched), parses .result/.usage/.session_id from
  the envelope, derives triggered_skill from the transcript Skill tool_use, and
  uses bounded jittered backoff (persistent failure -> DispatchResult.error,
  never raises). Temp dir always cleaned up (try/finally).

No anthropic import / no ANTHROPIC_API_KEY (INV-3); reuses DispatchResult +
TokenUsage; stdlib only. Mirrors the memory/finalize.py claude -p precedent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pure, no-LLM/no-subprocess assertion engine over DispatchResult:
contains, not_contains, regex (invalid pattern -> FAIL, never raises),
valid_json, trigger, and not_trigger. not_trigger correctly PASSES when
triggered_skill is None (SC-3 should_not_trigger enforcement). Unknown type
and missing keys return a FAIL AssertionResult with a debuggable detail
rather than raising, keeping the eval matrix robust. run_assertions() splits
into passed/failed detail lists for EvalResultRecord.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (ST-005)

runner.py executes the prompts × runs matrix (variants fixed=1, D10) through an
injected VariantDispatcher, synthesizes trigger/not_trigger assertions from
should_trigger/should_not_trigger, and appends each completed cell immediately
to .map/eval-runs/<skill>/<timestamp>.jsonl as a parseable EvalResultRecord
(INV-4). --resume reads present cell_ids (tolerating blank/malformed lines) and
skips them, so a killed run resumes to a complete set with no duplicates. A
per-cell dispatch error is recorded (dispatch_error assertion) without aborting
the matrix (VC4). Trigger detection is NOT re-implemented here — the runner
consumes DispatchResult.triggered_skill (single-source detection in the
dispatcher, INV-7). run_eval is clock-free (caller passes out_path).

Adds tests/test_skills_eval_runner.py: one MockDispatcher-driven test per VC
(VC1 matrix/no-variants, VC2 durable per-cell, VC3 resume-no-dupes + malformed
-line tolerance, VC4 error-not-fatal) plus load_eval_set validation. Zero real
claude -p (INV-2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(ST-006)

aggregator.py:
- aggregate(records) -> AggregateSummary: pass_rate (zero-failed-assertion
  cells / total), mean±stddev for token totals and duration. Null token_usage
  is excluded from token stats; all-null -> token stats None (not error);
  never raises on n<2 (stddev 0.0) or empty input (VC1/VC2/VC4).
- bounded_run(..., max_concurrency=1): runs the missing-cell set through a
  bounded ThreadPoolExecutor; durable .jsonl appends serialized under a lock
  (no interleaving/corruption); resume keys on cell_id so order-independent
  and dup-free at any concurrency (SC-1). INV-5 isolation is automatic — each
  ClaudeSubprocessDispatcher dispatch owns its temp cwd.

runner.py: extracted the per-cell body into a shared evaluate_cell() helper
(behavior-identical; run_eval and bounded_run both use it — DRY). ST-005 tests
stay green.

Adds tests/test_skills_eval_aggregator.py (15 tests: VC1-VC4, SC-1, concurrent
write integrity, resume-no-dupes). stdlib only; no anthropic (INV-3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wires skill_eval_app via add_typer (mirrors validate_app). Command:
`mapify skill-eval run <skill> [--eval-set PATH] [--dry-run] [--resume]
[--max-concurrency N]` (no --variants/--runs; runs fixed=1 per D10).

- --dry-run validates the eval-set schema and prints the planned invocation
  count (prompts × 1 × 1) without constructing a dispatcher or spending quota.
- Malformed/empty eval-set → Exit 2 with no invocation (SC-2).
- Real run gates on shutil.which("claude"); absent → "requires-cmd: claude"
  and non-zero exit BEFORE any invocation (HC-6). Then runs the matrix via
  aggregator.bounded_run and prints the AggregateSummary + artifact path.

Command body uses lazy skills_eval imports (mirrors validate_graph); no
anthropic import. Adds 4 CliRunner tests (registration, dry-run-no-dispatch,
missing-claude-nonzero, malformed→Exit2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…008)

Adds the map-skill-eval skill authored in templates_src/ and propagated to all
generated trees via `make render-templates`:
- templates_src/skills/map-skill-eval/SKILL.md.jinja (task skill; effort medium;
  disable-model-invocation; description with Use-when / Do-NOT-use).
- skill-rules.json.jinja entry: skillClass "task", enforcement manual,
  requires-cmd ["claude"], prompt triggers. Renders into .claude/ + templates/.
- No codex variant (Claude-specific, gated on the claude CLI).
- Bumps the skill-count guard in test_skills_consistency.py 15 → 16.

make check-render green (generated trees match source); skill/consistency/
template-render tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xture (ST-009)

Completes the skills_eval test suite (all MockDispatcher/monkeypatched — zero
real claude -p, INV-2):
- Dispatcher: ABC returns DispatchResult, MockDispatcher zero-subprocess,
  bounded backoff (exactly max_retries+1 calls), and subprocess cwd is a seeded
  temp dir (.claude + .map) not the repo .map (INV-5).
- Assertions: contains/regex match+nonmatch, valid_json pass+fail, trigger/
  not_trigger incl. None (SC-3).
- AC-11/INV-3: test_vc2_no_anthropic_import_in_skills_eval ast-scans every .py
  under skills_eval/ and fails on any anthropic import / ANTHROPIC_API_KEY read
  (negative-proved: RED when an import is injected, GREEN when restored).
- End-to-end: load_eval_set(fixture) -> run via MockDispatcher -> aggregate.
- Adds tests/skills_eval/fixtures/map_debug_eval_set.json.

make check green: 2118 passed, ruff clean, check-render ✅.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ault

Final-verifier surfaced doc/impl drift in the shipped SKILL.md:
- --max-concurrency default documented as 4; actual default is 1.
- Eval-Set Format showed a YAML `skill:/cases:/id:` schema that load_eval_set
  does NOT accept (the loader is JSON `{"entries":[{prompt, should_trigger,
  should_not_trigger, assertions}]}`). Rewrote the example to the real JSON
  schema and switched the --eval-set examples from .yaml to .json.

Edited templates_src source + re-rendered; make check-render green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… no head/tail

Agents were defensively piping `record_test_baseline` through `| tail -N`
(seen in the wild). The command captures the test run internally and prints a
single compact JSON object at the end — truncating it only hides fields and
contradicts the repo bash guidelines. Add an explicit "read the JSON directly,
do not pipe through head/tail" note next to the pre-flight baseline command in
both the SKILL body and efficient-reference. Edited templates_src; re-rendered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1. safety-guardrails: the destructive-rm pattern was over-broad. It blocked
   legitimate temp-scratch cleanup under /tmp, /private/tmp and /var/folders,
   and even read-only commands that merely mentioned such a path. Narrow it
   with a negative lookahead that still blocks bare root, system dirs
   (/etc, /home/...), root-glob, and the temp ROOT itself, but allows deletion
   of subpaths under a temp root. (Command-position parsing for the
   mentioned-inside-grep/echo case is intentionally left as a separate,
   reviewed change: narrowing a security blocklist via shell parsing risks a
   bypass.)

2. record_subtask_result: files_not_in_diff raised a false "Possible Actor
   truncation" warning for declared deliverables that are gitignored-but-
   present on disk (e.g. .map/ spike docs, eval-run .jsonl). Filter declared
   paths through `git check-ignore` before warning -- a gitignored file that
   exists is intentionally out of git, not truncation. A gitignored file that
   is also missing is still flagged via missing_files; a tracked-unchanged
   file still surfaces (negative-control test).

Both edited at the templates_src source and re-rendered. Adds regression tests
(temp-allow + roots-blocked; gitignored-suppressed + tracked-unchanged-flagged,
negative-proved). make check green (2129 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- .claude/rules/learned/: 8 new lessons from /map-learn (spike-first gating,
  producer-owns-parse, claude -p two-channel output, scoped config-flag temp
  mutation, clock-free durable writers, concurrent durable append, blueprint-
  named-tests-are-a-contract, final-verify shipped docs).
- CLAUDE.md: the "fix MAP framework self-errors immediately" rule from planning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@azalio azalio merged commit cbe3a04 into main Jun 3, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant