Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions docs/evals/weekly/2026-06-15.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# ahd eval · swiss-editorial · 2026-06-15T13:03:21.834Z

```yaml ahd-replay
schema_version: 1
kind: eval-live
ahd_version: 0.11.0
ahd_commit: a772e367f2a1d021433d2ab181d1cd53845a8485
git_dirty: true
node_version: v20.20.2
platform: linux-x64
invoked_at: 2026-06-15T12:32:05.978Z
token:
path: /home/runner/work/ahd/ahd/tokens/swiss-editorial.yml
hash: sha256:380a3d833d94
brief:
path: briefs/landing.yml
hash: sha256:8b7d42759643
sampling:
n: 30
temperature: null
seed: null
models:
- id: @cf/google/gemma-4-26b-a4b-it
provider: cloudflare-workers-ai
provider_request_ids: 53 captured
- id: @cf/meta/llama-4-scout-17b-16e-instruct
provider: cloudflare-workers-ai
provider_request_ids: 60 captured
- id: @cf/mistralai/mistral-small-3.1-24b-instruct
provider: cloudflare-workers-ai
provider_request_ids: 60 captured
- id: @cf/openai/gpt-oss-120b
provider: cloudflare-workers-ai
provider_request_ids: 60 captured
- id: @cf/qwen/qwen3-30b-a3b-fp8
provider: cloudflare-workers-ai
provider_request_ids: 60 captured
conditions:
requested: [raw, compiled]
effective: [raw, compiled]
```

Replay this run:

```sh
git checkout a772e367f2a1
npm ci && npm run build
/opt/hostedtoolcache/node/20.20.2/x64/bin/node /home/runner/work/ahd/ahd/bin/ahd.js eval-live swiss-editorial --brief briefs/landing.yml --models cf:@cf/google/gemma-4-26b-a4b-it,cf:@cf/meta/llama-4-scout-17b-16e-instruct,cf:@cf/mistralai/mistral-small-3.1-24b-instruct,cf:@cf/openai/gpt-oss-120b,cf:@cf/qwen/qwen3-30b-a3b-fp8 --n 30 --sample-concurrency 6 --out evals --report docs/evals/weekly/2026-06-15.md
```

## Run

- Brief: `briefs/landing.yml`
- Samples per cell: **30**
- Max tokens: 12000
- Models:
- `@cf/google/gemma-4-26b-a4b-it` (cloudflare-workers-ai) · spec `cf:@cf/google/gemma-4-26b-a4b-it`
- `@cf/meta/llama-4-scout-17b-16e-instruct` (cloudflare-workers-ai) · spec `cf:@cf/meta/llama-4-scout-17b-16e-instruct`
- `@cf/mistralai/mistral-small-3.1-24b-instruct` (cloudflare-workers-ai) · spec `cf:@cf/mistralai/mistral-small-3.1-24b-instruct`
- `@cf/openai/gpt-oss-120b` (cloudflare-workers-ai) · spec `cf:@cf/openai/gpt-oss-120b`
- `@cf/qwen/qwen3-30b-a3b-fp8` (cloudflare-workers-ai) · spec `cf:@cf/qwen/qwen3-30b-a3b-fp8`

## Per-model slop reduction

| model | raw attempted → scored | compiled attempted → scored | raw mean tells | compiled mean tells | Δ | reduction |
|---|---:|---:|---:|---:|---:|---:|
| `@cf/google/gemma-4-26b-a4b-it` | 30 → 28 | 30 → 25 | 2.57 | 1.20 | 1.37 | 53.3% |
| `@cf/meta/llama-4-scout-17b-16e-instruct` | 30 → 30 | 30 → 30 | 2.00 | 2.00 | 0.00 | 0.0% |
| `@cf/mistralai/mistral-small-3.1-24b-instruct` | 30 → 30 | 30 → 30 | 3.33 | 1.13 | 2.20 | 66.0% |
| `@cf/openai/gpt-oss-120b` | 30 → 30 | 30 → 30 | 3.33 | 0.90 | 2.43 | 73.0% |
| `@cf/qwen/qwen3-30b-a3b-fp8` | 30 → 30 | 30 → 30 | 1.87 | 1.73 | 0.13 | 7.1% |

## Per-tell frequency (scored samples only)

| tell | @cf/google/gemma-4-26b-a4b-it/raw | @cf/google/gemma-4-26b-a4b-it/compiled | @cf/meta/llama-4-scout-17b-16e-instruct/raw | @cf/meta/llama-4-scout-17b-16e-instruct/compiled | @cf/mistralai/mistral-small-3.1-24b-instruct/raw | @cf/mistralai/mistral-small-3.1-24b-instruct/compiled | @cf/openai/gpt-oss-120b/raw | @cf/openai/gpt-oss-120b/compiled | @cf/qwen/qwen3-30b-a3b-fp8/raw | @cf/qwen/qwen3-30b-a3b-fp8/compiled |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| ahd/a11y/heading-skip | 0% | 8% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
| ahd/body-measure | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 3% |
| ahd/line-height-per-size | 82% | 0% | 0% | 100% | 53% | 20% | 100% | 0% | 60% | 37% |
| ahd/no-em-dashes-in-prose | 0% | 0% | 0% | 0% | 0% | 3% | 7% | 0% | 0% | 0% |
| ahd/no-flat-dark-mode | 4% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 3% | 0% |
| ahd/radius-hierarchy | 50% | 4% | 0% | 100% | 100% | 0% | 90% | 3% | 27% | 47% |
| ahd/require-named-grid | 0% | 0% | 100% | 0% | 100% | 50% | 33% | 0% | 13% | 3% |
| ahd/require-type-pairing | 21% | 0% | 100% | 0% | 80% | 0% | 53% | 0% | 70% | 0% |
| ahd/tracking-per-size | 0% | 32% | 0% | 0% | 0% | 30% | 0% | 3% | 0% | 0% |
| ahd/weight-variety | 100% | 76% | 0% | 0% | 0% | 10% | 50% | 83% | 13% | 83% |

## Caveats
- Scoring runs the deterministic AHD linter (38 source-level rules) over every sample that passes a basic HTML sanity check.
- Counts reported per cell: attempted (runs initiated) / errored (API / runtime errors) / extractionFailed (response contained no usable HTML) / scored (linted). A large gap between attempted and scored is a signal that the model is struggling with the instruction, not that it passed the taxonomy.
- Raw condition: the brief is expanded as plain prose (intent + audience + surfaces + mustInclude + mustAvoid) with no AHD system prompt, no style token, no forbidden list. Compiled condition: same brief plus the AHD-compiled system prompt. The only thing that differs between conditions is the AHD intervention.
- Vision-only tells (14 rules in the critic) are not scored in this pipeline; run the critic on rendered screenshots for full taxonomy coverage.
- Tells-per-page is a proxy metric: a thin page has little surface for rules to fire against. Read the Δ alongside the actual rendered HTML, not in isolation.
- Model versions change. See the run manifest for exact canonical model ids.
Loading