Phase F.1: skill-eval engine (MVP) + framework fixes by azalio · Pull Request #158 · azalio/map-framework

azalio · 2026-06-03T23:03:34Z

Phase F.1 — Skill-Eval Engine (MVP)

Builds the MAP Framework skill-evaluation engine: run an eval set against a skill under the full installed skill set, observe which skill actually fired, check deterministic assertions, and report pass-rate + token/duration stats with durable, resumable run artifacts. Optimizer + HTML viewer are intentionally deferred to Phase F.2.

What ships

src/mapify_cli/skills_eval/:
- eval_schema.py — single-source Contract-First types (EvalSetEntry, DispatchResult, EvalResultRecord, make_cell_id); reuses token_budget.TokenUsage.
- dispatcher.py — VariantDispatcher ABC + MockDispatcher (zero subprocess) + ClaudeSubprocessDispatcher (claude -p --output-format json in an isolated temp cwd seeded with .claude/, transcript-parse trigger detection, bounded jittered backoff).
- assertions.py — deterministic runner (contains/not_contains/regex/valid_json/trigger/not_trigger).
- runner.py — prompts×runs matrix (variants=1), durable per-cell append-only .jsonl, --resume by cell_id (no dupes).
- aggregator.py — null-tolerant pass-rate + token/duration mean±stddev, bounded --max-concurrency (lock-serialized writes).
CLI: mapify skill-eval run <skill> [--eval-set] [--dry-run] [--resume] [--max-concurrency]; --dry-run spends zero quota; gates on shutil.which("claude").
Shipped map-skill-eval skill (authored in templates_src/, rendered to all trees; make check-render green).
Full MockDispatcher test suite + fixtures + an AST guard that fails on any anthropic import / ANTHROPIC_API_KEY read.

Invariants

INV-1 render parity · INV-2 zero real claude -p in tests · INV-3 no anthropic/API-key (AST-enforced) · INV-4 durable resume · INV-5 temp-cwd isolation · INV-6 single-source schema · INV-7 detection authored only after the ST-001 spike, single-source in the dispatcher.

Also in this PR (found while building F.1)

docs(map-skill-eval): corrected the shipped SKILL.md eval-set format (was a fictional YAML schema the JSON-only loader can't parse) and the --max-concurrency default (4 → 1).
docs(map-efficient): note that record_test_baseline returns compact JSON — don't pipe through head/tail.
fix(framework) — two MAP self-bugs:
1. safety-guardrails rm -rf / pattern was over-broad — blocked legitimate temp-scratch cleanup and any command merely mentioning such a path; narrowed via negative lookahead (still blocks bare root, system dirs, root-glob, and the temp root itself).
2. record_subtask_result raised a false "Possible Actor truncation" warning for gitignored-but-present deliverables (.map/ artifacts); now filters declared paths through git check-ignore. Negative-proved.
docs(learning): 8 new /map-learn rules + the "fix MAP framework self-errors immediately" CLAUDE.md rule.

Verification

make check green: 2129 passed, ruff/mypy/pyright 0/0/0, check-render ✅. Every subtask passed Monitor; final-verifier PASS.

🤖 Generated with Claude Code

Single-source Contract-First types for the Phase F.1 skill-eval engine: EvalSetEntry, DispatchResult, EvalResultRecord (+ to_dict/from_dict round-trip), and make_cell_id(). Reuses token_budget.TokenUsage (INV-6); no anthropic import / no ANTHROPIC_API_KEY (INV-3). Pure data layer — no dispatch, transcript parsing, or assertion logic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…atchers (ST-003) VariantDispatcher ABC with two implementations: - MockDispatcher: returns caller-set DispatchResult, zero subprocess (INV-2) — the dispatcher all CI tests use. - ClaudeSubprocessDispatcher: shells `claude -p --output-format json` in a throwaway temp cwd seeded with .claude/ + fresh empty .map/ (INV-5), applies the spike's seed-time temp-flip of disable-model-invocation to temp copies only (production templates untouched), parses .result/.usage/.session_id from the envelope, derives triggered_skill from the transcript Skill tool_use, and uses bounded jittered backoff (persistent failure -> DispatchResult.error, never raises). Temp dir always cleaned up (try/finally). No anthropic import / no ANTHROPIC_API_KEY (INV-3); reuses DispatchResult + TokenUsage; stdlib only. Mirrors the memory/finalize.py claude -p precedent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Pure, no-LLM/no-subprocess assertion engine over DispatchResult: contains, not_contains, regex (invalid pattern -> FAIL, never raises), valid_json, trigger, and not_trigger. not_trigger correctly PASSES when triggered_skill is None (SC-3 should_not_trigger enforcement). Unknown type and missing keys return a FAIL AssertionResult with a debuggable detail rather than raising, keeping the eval matrix robust. run_assertions() splits into passed/failed detail lists for EvalResultRecord. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… (ST-005) runner.py executes the prompts × runs matrix (variants fixed=1, D10) through an injected VariantDispatcher, synthesizes trigger/not_trigger assertions from should_trigger/should_not_trigger, and appends each completed cell immediately to .map/eval-runs/<skill>/<timestamp>.jsonl as a parseable EvalResultRecord (INV-4). --resume reads present cell_ids (tolerating blank/malformed lines) and skips them, so a killed run resumes to a complete set with no duplicates. A per-cell dispatch error is recorded (dispatch_error assertion) without aborting the matrix (VC4). Trigger detection is NOT re-implemented here — the runner consumes DispatchResult.triggered_skill (single-source detection in the dispatcher, INV-7). run_eval is clock-free (caller passes out_path). Adds tests/test_skills_eval_runner.py: one MockDispatcher-driven test per VC (VC1 matrix/no-variants, VC2 durable per-cell, VC3 resume-no-dupes + malformed -line tolerance, VC4 error-not-fatal) plus load_eval_set validation. Zero real claude -p (INV-2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…(ST-006) aggregator.py: - aggregate(records) -> AggregateSummary: pass_rate (zero-failed-assertion cells / total), mean±stddev for token totals and duration. Null token_usage is excluded from token stats; all-null -> token stats None (not error); never raises on n<2 (stddev 0.0) or empty input (VC1/VC2/VC4). - bounded_run(..., max_concurrency=1): runs the missing-cell set through a bounded ThreadPoolExecutor; durable .jsonl appends serialized under a lock (no interleaving/corruption); resume keys on cell_id so order-independent and dup-free at any concurrency (SC-1). INV-5 isolation is automatic — each ClaudeSubprocessDispatcher dispatch owns its temp cwd. runner.py: extracted the per-cell body into a shared evaluate_cell() helper (behavior-identical; run_eval and bounded_run both use it — DRY). ST-005 tests stay green. Adds tests/test_skills_eval_aggregator.py (15 tests: VC1-VC4, SC-1, concurrent write integrity, resume-no-dupes). stdlib only; no anthropic (INV-3). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Wires skill_eval_app via add_typer (mirrors validate_app). Command: `mapify skill-eval run <skill> [--eval-set PATH] [--dry-run] [--resume] [--max-concurrency N]` (no --variants/--runs; runs fixed=1 per D10). - --dry-run validates the eval-set schema and prints the planned invocation count (prompts × 1 × 1) without constructing a dispatcher or spending quota. - Malformed/empty eval-set → Exit 2 with no invocation (SC-2). - Real run gates on shutil.which("claude"); absent → "requires-cmd: claude" and non-zero exit BEFORE any invocation (HC-6). Then runs the matrix via aggregator.bounded_run and prints the AggregateSummary + artifact path. Command body uses lazy skills_eval imports (mirrors validate_graph); no anthropic import. Adds 4 CliRunner tests (registration, dry-run-no-dispatch, missing-claude-nonzero, malformed→Exit2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…008) Adds the map-skill-eval skill authored in templates_src/ and propagated to all generated trees via `make render-templates`: - templates_src/skills/map-skill-eval/SKILL.md.jinja (task skill; effort medium; disable-model-invocation; description with Use-when / Do-NOT-use). - skill-rules.json.jinja entry: skillClass "task", enforcement manual, requires-cmd ["claude"], prompt triggers. Renders into .claude/ + templates/. - No codex variant (Claude-specific, gated on the claude CLI). - Bumps the skill-count guard in test_skills_consistency.py 15 → 16. make check-render green (generated trees match source); skill/consistency/ template-render tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…xture (ST-009) Completes the skills_eval test suite (all MockDispatcher/monkeypatched — zero real claude -p, INV-2): - Dispatcher: ABC returns DispatchResult, MockDispatcher zero-subprocess, bounded backoff (exactly max_retries+1 calls), and subprocess cwd is a seeded temp dir (.claude + .map) not the repo .map (INV-5). - Assertions: contains/regex match+nonmatch, valid_json pass+fail, trigger/ not_trigger incl. None (SC-3). - AC-11/INV-3: test_vc2_no_anthropic_import_in_skills_eval ast-scans every .py under skills_eval/ and fails on any anthropic import / ANTHROPIC_API_KEY read (negative-proved: RED when an import is injected, GREEN when restored). - End-to-end: load_eval_set(fixture) -> run via MockDispatcher -> aggregate. - Adds tests/skills_eval/fixtures/map_debug_eval_set.json. make check green: 2118 passed, ruff clean, check-render ✅. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ault Final-verifier surfaced doc/impl drift in the shipped SKILL.md: - --max-concurrency default documented as 4; actual default is 1. - Eval-Set Format showed a YAML `skill:/cases:/id:` schema that load_eval_set does NOT accept (the loader is JSON `{"entries":[{prompt, should_trigger, should_not_trigger, assertions}]}`). Rewrote the example to the real JSON schema and switched the --eval-set examples from .yaml to .json. Edited templates_src source + re-rendered; make check-render green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… no head/tail Agents were defensively piping `record_test_baseline` through `| tail -N` (seen in the wild). The command captures the test run internally and prints a single compact JSON object at the end — truncating it only hides fields and contradicts the repo bash guidelines. Add an explicit "read the JSON directly, do not pipe through head/tail" note next to the pre-flight baseline command in both the SKILL body and efficient-reference. Edited templates_src; re-rendered. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

1. safety-guardrails: the destructive-rm pattern was over-broad. It blocked legitimate temp-scratch cleanup under /tmp, /private/tmp and /var/folders, and even read-only commands that merely mentioned such a path. Narrow it with a negative lookahead that still blocks bare root, system dirs (/etc, /home/...), root-glob, and the temp ROOT itself, but allows deletion of subpaths under a temp root. (Command-position parsing for the mentioned-inside-grep/echo case is intentionally left as a separate, reviewed change: narrowing a security blocklist via shell parsing risks a bypass.) 2. record_subtask_result: files_not_in_diff raised a false "Possible Actor truncation" warning for declared deliverables that are gitignored-but- present on disk (e.g. .map/ spike docs, eval-run .jsonl). Filter declared paths through `git check-ignore` before warning -- a gitignored file that exists is intentionally out of git, not truncation. A gitignored file that is also missing is still flagged via missing_files; a tracked-unchanged file still surfaces (negative-control test). Both edited at the templates_src source and re-rendered. Adds regression tests (temp-allow + roots-blocked; gitignored-suppressed + tracked-unchanged-flagged, negative-proved). make check green (2129 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- .claude/rules/learned/: 8 new lessons from /map-learn (spike-first gating, producer-owns-parse, claude -p two-channel output, scoped config-flag temp mutation, clock-free durable writers, concurrent durable append, blueprint- named-tests-are-a-contract, final-verify shipped docs). - CLAUDE.md: the "fix MAP framework self-errors immediately" rule from planning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

azalio and others added 12 commits June 3, 2026 23:24

azalio merged commit cbe3a04 into main Jun 3, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase F.1: skill-eval engine (MVP) + framework fixes#158

Phase F.1: skill-eval engine (MVP) + framework fixes#158
azalio merged 12 commits into
mainfrom
cinder-copper

azalio commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azalio commented Jun 3, 2026