An Agent Skill that measures whether coding agents actually follow skills, rules, and agent definitions. Auto-generates test scenarios at 3 prompt strictness levels, runs agents, classifies tool call sequences via LLM, and reports compliance rates with full timelines.
# Copy skill into your global skills directory
cp -r skills/skill-comply ~/.claude/skills/skill-comply
cd ~/.claude/skills/skill-comply && uv sync/skills add shimo4228/claude-skill-comply- Spec Generation — LLM extracts expected behavioral steps from any
.mdfile - Scenario Generation — Creates 3 scenarios with decreasing prompt support (supportive -> neutral -> competing)
- Execution — Runs
claude -pin sandbox, captures tool call traces via stream-json - Classification — LLM classifies tool calls against spec steps (semantic, not regex)
- Grading — Deterministic temporal ordering validation
- Report — Self-contained Markdown with compliance rates and full tool call timelines
Tests whether a skill/rule is followed even when the prompt doesn't explicitly support it. The 3-level scenario structure covers the full spectrum:
| Level | Name | What it tests |
|---|---|---|
| 1 | Supportive | Prompt explicitly mentions the skill |
| 2 | Neutral | Same task, skill not mentioned |
| 3 | Competing | Task instructions contradict the skill |
cd ~/.claude/skills/skill-comply
# Full run
uv run python -m scripts.run ~/.claude/rules/common/testing.md
# Dry run (no cost, spec + scenarios only)
uv run python -m scripts.run --dry-run ~/.claude/skills/search-first/SKILL.md
# Custom models
uv run python -m scripts.run --gen-model haiku --model sonnet <path>| Target | Overall | Supportive | Insight |
|---|---|---|---|
| testing.md | 73% | 100% | Observable 6-step TDD spec fully matches sonnet when explicitly instructed |
| search-first | 56% | 67% | Text-based verdicts (Adopt/Extend/Build) now captured via Text pseudo-events |
| security.md | dry-run OK | — | Spec + scenarios generated successfully |
| git-workflow.md | dry-run OK | — | Spec + scenarios generated successfully |
| Target | v0.1.0 overall | v0.2.0 overall | Δ |
|---|---|---|---|
| search-first | 8% | 56% | +48 |
| testing.md | 33% | 73% | +40 |
v0.1.0 systematically under-scored thinking-centric skills because the runner
discarded assistant text blocks and the spec generator was free to emit
cognitive-only steps (evaluate_findings, state_verdict) that no tool call
could satisfy. After a downstream after_step dependency, cascading failures
nullified observable steps as well. v0.2.0 fixes both layers — see
CHANGELOG.md.
- Python >= 3.11
uv(orpip install pyyaml)- Claude Code CLI (
claude)
cd skills/skill-comply && uv run pytest -v # 32 testsMIT
スキル/ルール/エージェント定義が実際にエージェントに遵守されているかを自動計測する Agent Skill です。3段階のプロンプト厳格度でシナリオを生成し、ツールコールを LLM で意味的に分類、コンプライアンスレポートを出力します。
詳細は skills/skill-comply/SKILL.md を参照してください。