Phase F.1: skill-eval engine (MVP) + framework fixes#158
Merged
Conversation
Single-source Contract-First types for the Phase F.1 skill-eval engine: EvalSetEntry, DispatchResult, EvalResultRecord (+ to_dict/from_dict round-trip), and make_cell_id(). Reuses token_budget.TokenUsage (INV-6); no anthropic import / no ANTHROPIC_API_KEY (INV-3). Pure data layer — no dispatch, transcript parsing, or assertion logic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atchers (ST-003) VariantDispatcher ABC with two implementations: - MockDispatcher: returns caller-set DispatchResult, zero subprocess (INV-2) — the dispatcher all CI tests use. - ClaudeSubprocessDispatcher: shells `claude -p --output-format json` in a throwaway temp cwd seeded with .claude/ + fresh empty .map/ (INV-5), applies the spike's seed-time temp-flip of disable-model-invocation to temp copies only (production templates untouched), parses .result/.usage/.session_id from the envelope, derives triggered_skill from the transcript Skill tool_use, and uses bounded jittered backoff (persistent failure -> DispatchResult.error, never raises). Temp dir always cleaned up (try/finally). No anthropic import / no ANTHROPIC_API_KEY (INV-3); reuses DispatchResult + TokenUsage; stdlib only. Mirrors the memory/finalize.py claude -p precedent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pure, no-LLM/no-subprocess assertion engine over DispatchResult: contains, not_contains, regex (invalid pattern -> FAIL, never raises), valid_json, trigger, and not_trigger. not_trigger correctly PASSES when triggered_skill is None (SC-3 should_not_trigger enforcement). Unknown type and missing keys return a FAIL AssertionResult with a debuggable detail rather than raising, keeping the eval matrix robust. run_assertions() splits into passed/failed detail lists for EvalResultRecord. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (ST-005) runner.py executes the prompts × runs matrix (variants fixed=1, D10) through an injected VariantDispatcher, synthesizes trigger/not_trigger assertions from should_trigger/should_not_trigger, and appends each completed cell immediately to .map/eval-runs/<skill>/<timestamp>.jsonl as a parseable EvalResultRecord (INV-4). --resume reads present cell_ids (tolerating blank/malformed lines) and skips them, so a killed run resumes to a complete set with no duplicates. A per-cell dispatch error is recorded (dispatch_error assertion) without aborting the matrix (VC4). Trigger detection is NOT re-implemented here — the runner consumes DispatchResult.triggered_skill (single-source detection in the dispatcher, INV-7). run_eval is clock-free (caller passes out_path). Adds tests/test_skills_eval_runner.py: one MockDispatcher-driven test per VC (VC1 matrix/no-variants, VC2 durable per-cell, VC3 resume-no-dupes + malformed -line tolerance, VC4 error-not-fatal) plus load_eval_set validation. Zero real claude -p (INV-2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(ST-006) aggregator.py: - aggregate(records) -> AggregateSummary: pass_rate (zero-failed-assertion cells / total), mean±stddev for token totals and duration. Null token_usage is excluded from token stats; all-null -> token stats None (not error); never raises on n<2 (stddev 0.0) or empty input (VC1/VC2/VC4). - bounded_run(..., max_concurrency=1): runs the missing-cell set through a bounded ThreadPoolExecutor; durable .jsonl appends serialized under a lock (no interleaving/corruption); resume keys on cell_id so order-independent and dup-free at any concurrency (SC-1). INV-5 isolation is automatic — each ClaudeSubprocessDispatcher dispatch owns its temp cwd. runner.py: extracted the per-cell body into a shared evaluate_cell() helper (behavior-identical; run_eval and bounded_run both use it — DRY). ST-005 tests stay green. Adds tests/test_skills_eval_aggregator.py (15 tests: VC1-VC4, SC-1, concurrent write integrity, resume-no-dupes). stdlib only; no anthropic (INV-3). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wires skill_eval_app via add_typer (mirrors validate_app). Command:
`mapify skill-eval run <skill> [--eval-set PATH] [--dry-run] [--resume]
[--max-concurrency N]` (no --variants/--runs; runs fixed=1 per D10).
- --dry-run validates the eval-set schema and prints the planned invocation
count (prompts × 1 × 1) without constructing a dispatcher or spending quota.
- Malformed/empty eval-set → Exit 2 with no invocation (SC-2).
- Real run gates on shutil.which("claude"); absent → "requires-cmd: claude"
and non-zero exit BEFORE any invocation (HC-6). Then runs the matrix via
aggregator.bounded_run and prints the AggregateSummary + artifact path.
Command body uses lazy skills_eval imports (mirrors validate_graph); no
anthropic import. Adds 4 CliRunner tests (registration, dry-run-no-dispatch,
missing-claude-nonzero, malformed→Exit2).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…008) Adds the map-skill-eval skill authored in templates_src/ and propagated to all generated trees via `make render-templates`: - templates_src/skills/map-skill-eval/SKILL.md.jinja (task skill; effort medium; disable-model-invocation; description with Use-when / Do-NOT-use). - skill-rules.json.jinja entry: skillClass "task", enforcement manual, requires-cmd ["claude"], prompt triggers. Renders into .claude/ + templates/. - No codex variant (Claude-specific, gated on the claude CLI). - Bumps the skill-count guard in test_skills_consistency.py 15 → 16. make check-render green (generated trees match source); skill/consistency/ template-render tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xture (ST-009) Completes the skills_eval test suite (all MockDispatcher/monkeypatched — zero real claude -p, INV-2): - Dispatcher: ABC returns DispatchResult, MockDispatcher zero-subprocess, bounded backoff (exactly max_retries+1 calls), and subprocess cwd is a seeded temp dir (.claude + .map) not the repo .map (INV-5). - Assertions: contains/regex match+nonmatch, valid_json pass+fail, trigger/ not_trigger incl. None (SC-3). - AC-11/INV-3: test_vc2_no_anthropic_import_in_skills_eval ast-scans every .py under skills_eval/ and fails on any anthropic import / ANTHROPIC_API_KEY read (negative-proved: RED when an import is injected, GREEN when restored). - End-to-end: load_eval_set(fixture) -> run via MockDispatcher -> aggregate. - Adds tests/skills_eval/fixtures/map_debug_eval_set.json. make check green: 2118 passed, ruff clean, check-render ✅. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ault
Final-verifier surfaced doc/impl drift in the shipped SKILL.md:
- --max-concurrency default documented as 4; actual default is 1.
- Eval-Set Format showed a YAML `skill:/cases:/id:` schema that load_eval_set
does NOT accept (the loader is JSON `{"entries":[{prompt, should_trigger,
should_not_trigger, assertions}]}`). Rewrote the example to the real JSON
schema and switched the --eval-set examples from .yaml to .json.
Edited templates_src source + re-rendered; make check-render green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… no head/tail Agents were defensively piping `record_test_baseline` through `| tail -N` (seen in the wild). The command captures the test run internally and prints a single compact JSON object at the end — truncating it only hides fields and contradicts the repo bash guidelines. Add an explicit "read the JSON directly, do not pipe through head/tail" note next to the pre-flight baseline command in both the SKILL body and efficient-reference. Edited templates_src; re-rendered. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1. safety-guardrails: the destructive-rm pattern was over-broad. It blocked legitimate temp-scratch cleanup under /tmp, /private/tmp and /var/folders, and even read-only commands that merely mentioned such a path. Narrow it with a negative lookahead that still blocks bare root, system dirs (/etc, /home/...), root-glob, and the temp ROOT itself, but allows deletion of subpaths under a temp root. (Command-position parsing for the mentioned-inside-grep/echo case is intentionally left as a separate, reviewed change: narrowing a security blocklist via shell parsing risks a bypass.) 2. record_subtask_result: files_not_in_diff raised a false "Possible Actor truncation" warning for declared deliverables that are gitignored-but- present on disk (e.g. .map/ spike docs, eval-run .jsonl). Filter declared paths through `git check-ignore` before warning -- a gitignored file that exists is intentionally out of git, not truncation. A gitignored file that is also missing is still flagged via missing_files; a tracked-unchanged file still surfaces (negative-control test). Both edited at the templates_src source and re-rendered. Adds regression tests (temp-allow + roots-blocked; gitignored-suppressed + tracked-unchanged-flagged, negative-proved). make check green (2129 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- .claude/rules/learned/: 8 new lessons from /map-learn (spike-first gating, producer-owns-parse, claude -p two-channel output, scoped config-flag temp mutation, clock-free durable writers, concurrent durable append, blueprint- named-tests-are-a-contract, final-verify shipped docs). - CLAUDE.md: the "fix MAP framework self-errors immediately" rule from planning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase F.1 — Skill-Eval Engine (MVP)
Builds the MAP Framework skill-evaluation engine: run an eval set against a skill under the full installed skill set, observe which skill actually fired, check deterministic assertions, and report pass-rate + token/duration stats with durable, resumable run artifacts. Optimizer + HTML viewer are intentionally deferred to Phase F.2.
What ships
src/mapify_cli/skills_eval/:eval_schema.py— single-source Contract-First types (EvalSetEntry,DispatchResult,EvalResultRecord,make_cell_id); reusestoken_budget.TokenUsage.dispatcher.py—VariantDispatcherABC +MockDispatcher(zero subprocess) +ClaudeSubprocessDispatcher(claude -p --output-format jsonin an isolated temp cwd seeded with.claude/, transcript-parse trigger detection, bounded jittered backoff).assertions.py— deterministic runner (contains/not_contains/regex/valid_json/trigger/not_trigger).runner.py— prompts×runs matrix (variants=1), durable per-cell append-only.jsonl,--resumebycell_id(no dupes).aggregator.py— null-tolerant pass-rate + token/duration mean±stddev, bounded--max-concurrency(lock-serialized writes).mapify skill-eval run <skill> [--eval-set] [--dry-run] [--resume] [--max-concurrency];--dry-runspends zero quota; gates onshutil.which("claude").map-skill-evalskill (authored intemplates_src/, rendered to all trees;make check-rendergreen).anthropicimport /ANTHROPIC_API_KEYread.Invariants
INV-1 render parity · INV-2 zero real
claude -pin tests · INV-3 no anthropic/API-key (AST-enforced) · INV-4 durable resume · INV-5 temp-cwd isolation · INV-6 single-source schema · INV-7 detection authored only after the ST-001 spike, single-source in the dispatcher.Also in this PR (found while building F.1)
--max-concurrencydefault (4 → 1).record_test_baselinereturns compact JSON — don't pipe throughhead/tail.safety-guardrailsrm -rf /pattern was over-broad — blocked legitimate temp-scratch cleanup and any command merely mentioning such a path; narrowed via negative lookahead (still blocks bare root, system dirs, root-glob, and the temp root itself).record_subtask_resultraised a false "Possible Actor truncation" warning for gitignored-but-present deliverables (.map/artifacts); now filters declared paths throughgit check-ignore. Negative-proved./map-learnrules + the "fix MAP framework self-errors immediately" CLAUDE.md rule.Verification
make checkgreen: 2129 passed, ruff/mypy/pyright 0/0/0,check-render✅. Every subtask passed Monitor; final-verifier PASS.🤖 Generated with Claude Code