Ad-Astra-Computing · jasonodoom · Jun 15, 2026 · Jun 15, 2026
diff --git a/docs/evals/weekly/2026-06-15.md b/docs/evals/weekly/2026-06-15.md
@@ -0,0 +1,94 @@
+# ahd eval · swiss-editorial · 2026-06-15T13:03:21.834Z
+
+```yaml ahd-replay
+schema_version: 1
+kind: eval-live
+ahd_version: 0.11.0
+ahd_commit: a772e367f2a1d021433d2ab181d1cd53845a8485
+git_dirty: true
+node_version: v20.20.2
+platform: linux-x64
+invoked_at: 2026-06-15T12:32:05.978Z
+token:
+  path: /home/runner/work/ahd/ahd/tokens/swiss-editorial.yml
+  hash: sha256:380a3d833d94
+brief:
+  path: briefs/landing.yml
+  hash: sha256:8b7d42759643
+sampling:
+  n: 30
+  temperature: null
+  seed: null
+models:
+  - id: @cf/google/gemma-4-26b-a4b-it
+    provider: cloudflare-workers-ai
+    provider_request_ids: 53 captured
+  - id: @cf/meta/llama-4-scout-17b-16e-instruct
+    provider: cloudflare-workers-ai
+    provider_request_ids: 60 captured
+  - id: @cf/mistralai/mistral-small-3.1-24b-instruct
+    provider: cloudflare-workers-ai
+    provider_request_ids: 60 captured
+  - id: @cf/openai/gpt-oss-120b
+    provider: cloudflare-workers-ai
+    provider_request_ids: 60 captured
+  - id: @cf/qwen/qwen3-30b-a3b-fp8
+    provider: cloudflare-workers-ai
+    provider_request_ids: 60 captured
+conditions:
+  requested: [raw, compiled]
+  effective: [raw, compiled]
+```
+
+Replay this run:
+
+```sh
+git checkout a772e367f2a1
+npm ci && npm run build
+/opt/hostedtoolcache/node/20.20.2/x64/bin/node /home/runner/work/ahd/ahd/bin/ahd.js eval-live swiss-editorial --brief briefs/landing.yml --models cf:@cf/google/gemma-4-26b-a4b-it,cf:@cf/meta/llama-4-scout-17b-16e-instruct,cf:@cf/mistralai/mistral-small-3.1-24b-instruct,cf:@cf/openai/gpt-oss-120b,cf:@cf/qwen/qwen3-30b-a3b-fp8 --n 30 --sample-concurrency 6 --out evals --report docs/evals/weekly/2026-06-15.md
+```
+
+## Run
+
+- Brief: `briefs/landing.yml`
+- Samples per cell: **30**
+- Max tokens: 12000
+- Models:
+  - `@cf/google/gemma-4-26b-a4b-it` (cloudflare-workers-ai) · spec `cf:@cf/google/gemma-4-26b-a4b-it`
+  - `@cf/meta/llama-4-scout-17b-16e-instruct` (cloudflare-workers-ai) · spec `cf:@cf/meta/llama-4-scout-17b-16e-instruct`
+  - `@cf/mistralai/mistral-small-3.1-24b-instruct` (cloudflare-workers-ai) · spec `cf:@cf/mistralai/mistral-small-3.1-24b-instruct`
+  - `@cf/openai/gpt-oss-120b` (cloudflare-workers-ai) · spec `cf:@cf/openai/gpt-oss-120b`
+  - `@cf/qwen/qwen3-30b-a3b-fp8` (cloudflare-workers-ai) · spec `cf:@cf/qwen/qwen3-30b-a3b-fp8`
+
+## Per-model slop reduction
+
+| model | raw attempted → scored | compiled attempted → scored | raw mean tells | compiled mean tells | Δ | reduction |
+|---|---:|---:|---:|---:|---:|---:|
+| `@cf/google/gemma-4-26b-a4b-it` | 30 → 28 | 30 → 25 | 2.57 | 1.20 | 1.37 | 53.3% |
+| `@cf/meta/llama-4-scout-17b-16e-instruct` | 30 → 30 | 30 → 30 | 2.00 | 2.00 | 0.00 | 0.0% |
+| `@cf/mistralai/mistral-small-3.1-24b-instruct` | 30 → 30 | 30 → 30 | 3.33 | 1.13 | 2.20 | 66.0% |
+| `@cf/openai/gpt-oss-120b` | 30 → 30 | 30 → 30 | 3.33 | 0.90 | 2.43 | 73.0% |
+| `@cf/qwen/qwen3-30b-a3b-fp8` | 30 → 30 | 30 → 30 | 1.87 | 1.73 | 0.13 | 7.1% |
+
+## Per-tell frequency (scored samples only)
+
+| tell | @cf/google/gemma-4-26b-a4b-it/raw | @cf/google/gemma-4-26b-a4b-it/compiled | @cf/meta/llama-4-scout-17b-16e-instruct/raw | @cf/meta/llama-4-scout-17b-16e-instruct/compiled | @cf/mistralai/mistral-small-3.1-24b-instruct/raw | @cf/mistralai/mistral-small-3.1-24b-instruct/compiled | @cf/openai/gpt-oss-120b/raw | @cf/openai/gpt-oss-120b/compiled | @cf/qwen/qwen3-30b-a3b-fp8/raw | @cf/qwen/qwen3-30b-a3b-fp8/compiled |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| ahd/a11y/heading-skip | 0% | 8% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
+| ahd/body-measure | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 3% |
+| ahd/line-height-per-size | 82% | 0% | 0% | 100% | 53% | 20% | 100% | 0% | 60% | 37% |
+| ahd/no-em-dashes-in-prose | 0% | 0% | 0% | 0% | 0% | 3% | 7% | 0% | 0% | 0% |
+| ahd/no-flat-dark-mode | 4% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 3% | 0% |
+| ahd/radius-hierarchy | 50% | 4% | 0% | 100% | 100% | 0% | 90% | 3% | 27% | 47% |
+| ahd/require-named-grid | 0% | 0% | 100% | 0% | 100% | 50% | 33% | 0% | 13% | 3% |
+| ahd/require-type-pairing | 21% | 0% | 100% | 0% | 80% | 0% | 53% | 0% | 70% | 0% |
+| ahd/tracking-per-size | 0% | 32% | 0% | 0% | 0% | 30% | 0% | 3% | 0% | 0% |
+| ahd/weight-variety | 100% | 76% | 0% | 0% | 0% | 10% | 50% | 83% | 13% | 83% |
+
+## Caveats
+- Scoring runs the deterministic AHD linter (38 source-level rules) over every sample that passes a basic HTML sanity check.
+- Counts reported per cell: attempted (runs initiated) / errored (API / runtime errors) / extractionFailed (response contained no usable HTML) / scored (linted). A large gap between attempted and scored is a signal that the model is struggling with the instruction, not that it passed the taxonomy.
+- Raw condition: the brief is expanded as plain prose (intent + audience + surfaces + mustInclude + mustAvoid) with no AHD system prompt, no style token, no forbidden list. Compiled condition: same brief plus the AHD-compiled system prompt. The only thing that differs between conditions is the AHD intervention.
+- Vision-only tells (14 rules in the critic) are not scored in this pipeline; run the critic on rendered screenshots for full taxonomy coverage.
+- Tells-per-page is a proxy metric: a thin page has little surface for rules to fire against. Read the Δ alongside the actual rendered HTML, not in isolation.
+- Model versions change. See the run manifest for exact canonical model ids.