Derived from benchmark run 20260428_200424_uuxo on 2026-04-28.
- Fixtures:
49 - Overall score:
7.71 / 10 - Prior reference run:
20260428_102409_kd0sat7.74 / 10
The headline score is effectively flat, but the distribution got slightly worse: perfect 10/10 fixtures dropped from 13 to 11, and fixtures at <= 6/10 rose from 12 to 13.
| Dimension | Current | Prior | Delta |
|---|---|---|---|
| component_fit | 2.69 | 2.70 | -0.01 |
| data_completeness | 2.49 | 2.62 | -0.13 |
| format_quality | 2.49 | 2.53 | -0.04 |
| layout_coherence | 2.65 | 2.57 | +0.08 |
| overall | 7.71 | 7.74 | -0.03 |
Score distribution:
11fixtures scored10/109fixtures scored9/1012fixtures scored8/1013fixtures scored<= 6/10
Benchmark gate failures in this run:
items-with-number-arrays(7/10)items-with-tag-arrays(10/10)polymorphic-records(8/10)schema-inconsistent(10/10)tree-embedded-children(8/10)
This means judge quality and parse/render correctness are not fully aligned. Two fixtures are judged 10/10 but still fail the benchmark gate, so quality work and runtime/test stability work should be tracked separately.
The current prompt still falls back to a raw table for bare number arrays, even when the task clearly asks for a distribution-style view.
Affected fixtures:
primitive-number-array(1/10,cf=0,dc=0)items-with-number-arrays(7/10, benchmark gate failure)
Evidence:
primitive-number-array: "uses a table for raw samples and omits the endpoint, sample count, and p95"items-with-number-arrays: latency arrays are not shown as per-row mini-distributions
Likely fix:
- Add a first-class histogram or mini-distribution rendering path for primitive number arrays
- Add prompt guidance that bare numeric samples should surface summary stats first, then a chart, not a table of raw values
The language is no longer blocked on dynamic-key data, but the generated DSL still fails to reliably materialize rows from maps and nested child collections.
Affected fixtures:
object-map-by-id(4/10,dc=1,fq=0)array-with-nested-arrays(4/10, nested rows disappear)timeseries-multi-entity-unaligned(5/10, detail table renders no rows)flat-parentid-reference(7/10, hierarchy flattened)
What changed vs the previous map:
@ObjectEntriesand@ObjectKeysremoved the original "cannot iterate object maps" blocker- The remaining problem is adoption and rendering fidelity, not parser capability
Likely fix:
- Add stronger prompt examples for
value.*access after@ObjectEntries - Add one canonical nested-array example for "card per parent, table/list per child array"
- Add a regression fixture or prompt rule that forbids empty nested tables when the child array is non-empty
The model still invents labels or records instead of showing null/unknown states directly.
Affected fixtures:
unlabeled-ratio-array(4/10)nearly-all-null(5/10)
Evidence:
unlabeled-ratio-array: donut chart choice is fine, but labels are fabricated and values are not shown as percentagesnearly-all-null: dashboard layout is clear, but it fabricates detection rows instead of showing unavailable metrics
Likely fix:
- Strengthen the anti-fabrication prompt rule
- Add an explicit unlabeled-ratio example: if labels do not exist, do not invent them
- Prefer
null,unknown, or omission over synthetic category names or fake detail rows
Several fixtures want a more specific visualization than the current general-purpose chart/list fallback.
Affected fixtures:
timeseries-min-max-band(5/10): still rendered as three lines instead of a range bandnodes-edges-graph(5/10): nodes are shown, but edges are reduced to textadjacency-list-graph(6/10): dependency relationships are surfaced, but not as a real topologyrecord-with-sparkline(6/10): oversized charts instead of compact inline sparklines
Likely fix:
- Complete the band/range chart path for min/max envelopes
- Decide whether graph/topology is a real product goal or an acceptable fallback class
- Tighten prompt rules for "compact inline trend" so sparkline-capable layouts are preferred inside record lists and tables
Formatting problems are no longer the dominant blocker, but they still suppress scores on otherwise-correct layouts.
Affected fixtures:
timeseries-tuple-pairs(6/10,fq=0): timestamps stay rawobject-map-by-id(4/10,fq=0): broken status/CPU/heartbeat displaycross-magnitude-values(6/10): inconsistent byte units / broken top-card bindingbyte-large-values(7/10)multi-top-arrays(8/10): last online time rendered in the wrong form
Likely fix:
- Make formatting rules more shape-aware:
- tuple timestamps must pass through
@FormatDate - large byte values must use unit-aware formatting consistently
- percent-like ratios should not appear as raw decimals
- tuple timestamps must pass through
Previous concern:
L1-Ctracked missing null-coalescing as a high-severity parse failure source
Current evidence:
schema-inconsistentis now10/10polymorphic-recordsis now8/10
The remaining issue in polymorphic-records is presentation quality, not parser failure.
Previous concern:
L1-Btracked tuple arrays as fundamentally unrenderable
Current evidence:
timeseries-tuple-pairsnow chooses the right chart type- Remaining issue is timestamp formatting, not inability to project the tuple columns
Previous concern:
L1-Dframed flat-to-tree conversion as a core blocker
Current evidence:
flat-parentid-referenceimproved to7/10tree-embedded-childrenis judged8/10even though it still fails the benchmark gate
The remaining issue is that hierarchy metadata is incomplete or flattened, not that tree-like rendering is impossible.
Previous concern:
L1-Atracked "no object iteration primitive"
Current evidence:
- The parser/runtime gap is addressed
object-map-by-idis still only4/10, so prompt usage and rendering robustness are now the bottleneck
This issue should stay on the roadmap, but it has moved from "missing language feature" to "insufficient prompt/runtime follow-through".
- Fix primitive numeric array handling. This is the worst current failure and likely needs both prompt and component support.
- Tighten anti-fabrication rules for null-heavy and unlabeled data.
- Stabilize nested-array and object-map rendering so non-empty structures cannot silently render as empty.
- Decide whether band charts, sparklines, and topology are first-class features or explicitly unsupported fallback cases.
- Clean up formatting on tuple timestamps, large byte values, and ratio-like fields.
- Investigate benchmark gate mismatch separately from judge quality scoring.
- Should primitive number arrays map to a dedicated
HistogramChart, a summary-plus-sparkline layout, or a more generic distribution component? - For unlabeled ratio arrays, should the DSL prefer a chart without labels, synthetic ordinal labels (
Bucket 1,Bucket 2, ...), or a non-chart representation? - Is graph/topology rendering a real scope item, or should benchmark expectations be lowered to a structured fallback layout?
- Why do
items-with-tag-arraysandschema-inconsistentstill fail the benchmark gate while the judge scores them10/10?
Generated: 2026-04-28 | Benchmark: 20260428_200424_uuxo | Suite: benchmark