|
| 1 | +# Results: Renderer-first vs prompt-guided spatial code |
| 2 | + |
| 3 | +## What was compared |
| 4 | + |
| 5 | +One slide type (`financial_summary`), one fixed input (ADUS 3-year historical), |
| 6 | +two code paths: |
| 7 | + |
| 8 | +- **Track A** — `IBRenderer().render_financial_summary(spec)` against a 18-line JSON spec. |
| 9 | +- **Track B** — a hand-written `python-pptx` build script representing what an |
| 10 | + LLM plausibly writes on a first pass given the prompt and no reference to an |
| 11 | + existing renderer. Not engineered to fail. |
| 12 | + |
| 13 | +Both tracks were run three times. Screenshots of `run1.pptx` from each track |
| 14 | +are in `screenshots/`. |
| 15 | + |
| 16 | +## Visual comparison (run 1) |
| 17 | + |
| 18 | +| | Track A (renderer) | Track B (prompt-guided) | |
| 19 | +|---|---|---| |
| 20 | +| Screenshot | `screenshots/track_a_run1.png` | `screenshots/track_b_run1.png` | |
| 21 | +| Action title | Rendered in navy bold; wraps to two lines at the library's default title font size | Rendered in navy bold; fits on one line (LLM chose a slightly smaller title area) | |
| 22 | +| Section header bar | Navy rectangle, white bold text | Navy rectangle, white bold text | |
| 23 | +| Table | Real PPTX table object | Real PPTX table object | |
| 24 | +| Right-aligned numerics | Yes | Yes | |
| 25 | +| Bold subtotal rows | Yes (Revenue, GP, EBITDA, NI, Diluted EPS) | Yes (Revenue, GP, EBITDA, NI — Diluted EPS is *not* bolded in Track B because the prompt listed it under subtotals but the build.py mapped it to `"normal"` style) | |
| 26 | +| Italic gray `%` rows | Yes | Yes | |
| 27 | +| EBITDA row highlight | Light blue background | Light blue background | |
| 28 | +| Source citation | Present, bottom-left | Present, bottom-left | |
| 29 | +| Slide number | Present (in a navy footer bar with "Investment Banking" label — library default) | Present (bare "6" in the bottom-right corner; no footer bar) | |
| 30 | +| "Confidential" marker | Present, top-right, red italic | Present, top-right, red italic | |
| 31 | +| Library branding | "Advisory Group" header, "Investment Banking" footer (library defaults; not in the prompt) | Absent (the prompt didn't mention them, so the LLM didn't add them) | |
| 32 | +| Overflow / clipping | None observed | None observed | |
| 33 | + |
| 34 | +**Honest read of the pictures:** on this single simple slide, both tracks |
| 35 | +produce something a reviewer would accept. The prompt-guided version is not |
| 36 | +broken. The differences that show up visually are cosmetic (title line-wrap, |
| 37 | +presence/absence of library footer branding, one subtotal row that Track B's |
| 38 | +build.py mis-classified). |
| 39 | + |
| 40 | +## Code-size comparison |
| 41 | + |
| 42 | +| | Track A | Track B | |
| 43 | +|---|---|---| |
| 44 | +| Lines of spatial code the LLM had to write | **0** | **166** non-blank code lines | |
| 45 | +| `Inches(...)` calls | 0 | 32 | |
| 46 | +| `Pt(...)` calls | 0 | 12 | |
| 47 | +| `RGBColor(...)` calls | 0 | 10 | |
| 48 | +| `PP_ALIGN.*` references | 0 | 6 | |
| 49 | +| `MSO_SHAPE.*` references | 0 | 1 | |
| 50 | +| Total artifact the LLM produces | `financial_summary_spec.json` — 18 lines, 1,262 chars | `build.py` — 213 lines including docstring | |
| 51 | + |
| 52 | +Track A's `build.py` (45 lines) only loads the JSON and calls the renderer. The |
| 53 | +45 lines are infrastructure, not LLM output. The actual LLM-produced artifact |
| 54 | +is the JSON spec. |
| 55 | + |
| 56 | +## Determinism (3 runs each) |
| 57 | + |
| 58 | +Both tracks were run three times. PPTX files were hashed after stripping |
| 59 | +volatile ZIP metadata (`dcterms:created`, `dcterms:modified`, |
| 60 | +`cp:lastModifiedBy`, `cp:revision`) so that only the stable slide content |
| 61 | +contributes to the hash. See `../tests/normalize.py`. |
| 62 | + |
| 63 | +| | Track A | Track B | |
| 64 | +|---|---|---| |
| 65 | +| Unique hashes across 3 runs | 1 | 1 | |
| 66 | +| Hash | `b36cff98f93cf143746bfb3726240a0fe83a10bf83cb6b2735b1303dcd6dd35e` | `410ae4ba5387034d4e3da8b68a17ca765041f5b4517984360e68359fa082234e` | |
| 67 | + |
| 68 | +### What this determinism test does and does not prove |
| 69 | + |
| 70 | +**What it proves:** running the same Python code twice produces the same PPTX |
| 71 | +bytes, on both tracks. Track A's renderer is deterministic. Track B's |
| 72 | +hand-written build script is also deterministic. Good, but unsurprising — |
| 73 | +both are just Python files. |
| 74 | + |
| 75 | +**What it does not prove:** whether an LLM, asked to build this slide *from |
| 76 | +scratch* three separate times, would produce the same output three times. For |
| 77 | +Track A, the LLM only needs to produce the same 18-line JSON spec, which is |
| 78 | +much easier to keep stable across generations. For Track B, the LLM needs to |
| 79 | +re-derive 32 `Inches()` positions, 12 `Pt()` sizes, and 10 `RGBColor()` values |
| 80 | +from scratch each time. The variance that matters lives in the code the LLM |
| 81 | +writes, not in running that code multiple times. |
| 82 | + |
| 83 | +This test case does not attempt to measure LLM generation variance. It isolates |
| 84 | +only the *architectural* variable: how much spatial code the LLM is asked to |
| 85 | +produce in the first place. |
| 86 | + |
| 87 | +## Where the difference actually lives |
| 88 | + |
| 89 | +On a single slide in isolation, the two approaches are close. The renderer |
| 90 | +approach starts to pull away when: |
| 91 | + |
| 92 | +1. **The same layout appears on multiple slides.** A 20-slide deck with five |
| 93 | + financial_summary slides in Track A shares one renderer. In Track B, every |
| 94 | + slide is its own 150–200 line file, and drift between them is a natural |
| 95 | + outcome. |
| 96 | +2. **Formatting conventions change.** Changing the navy color, the title font |
| 97 | + size, or the source-line position is a one-line edit in the renderer. In |
| 98 | + Track B it requires editing every build script. |
| 99 | +3. **The data changes.** Track A: edit the JSON, rerun. Track B: re-run the |
| 100 | + same build script (fine) — or, if the LLM is regenerating each slide, write |
| 101 | + a fresh 150–200 line file each time. |
| 102 | + |
| 103 | +None of the above is tested by this one-slide comparison. This comparison only |
| 104 | +establishes that the *code-size difference* and the *single-pass determinism* |
| 105 | +are real. The scaling properties are a separate claim. |
| 106 | + |
| 107 | +## Criteria summary |
| 108 | + |
| 109 | +| Criterion | Track A | Track B | |
| 110 | +|---|---|---| |
| 111 | +| Renders without errors | Pass | Pass | |
| 112 | +| Honors IB formatting conventions listed in the prompt | Pass | Pass (with one subtotal misclassification) | |
| 113 | +| No overflow / clipping | Pass | Pass | |
| 114 | +| Uses a real PPTX table object | Pass | Pass | |
| 115 | +| Lines of spatial code the LLM wrote | 0 | 166 | |
| 116 | +| Normalized content hash stable across 3 runs of the same code | Yes (1/3) | Yes (1/3) | |
| 117 | +| Measures LLM-generation variance | No | No | |
| 118 | + |
| 119 | +## Conclusion (narrow) |
| 120 | + |
| 121 | +On this single test case, renderer-first architecture produces the same slide |
| 122 | +using zero lines of spatial code from the LLM, versus 166 lines for a |
| 123 | +good-faith prompt-guided implementation. Both implementations render |
| 124 | +deterministically on re-execution. Neither implementation's determinism result |
| 125 | +speaks to LLM generation variance, which is the scaling question this |
| 126 | +comparison does not attempt to answer. |
0 commit comments