gorajing
diff --git a/‎comparison/README.md‎
Lines changed: 109 additions & 0 deletions b/‎comparison/README.md‎
Lines changed: 109 additions & 0 deletions
diff --git a/‎comparison/RESULTS.md‎
Lines changed: 126 additions & 0 deletions b/‎comparison/RESULTS.md‎
Lines changed: 126 additions & 0 deletions
diff --git a/‎comparison/input/financial_summary_spec.json‎
Lines changed: 18 additions & 0 deletions b/‎comparison/input/financial_summary_spec.json‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎comparison/input/prompt.txt‎
Lines changed: 25 additions & 0 deletions b/‎comparison/input/prompt.txt‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎comparison/screenshots/track_a_run1.png‎
135 KB b/‎comparison/screenshots/track_a_run1.png‎
135 KB
diff --git a/‎comparison/screenshots/track_b_run1.png‎
137 KB b/‎comparison/screenshots/track_b_run1.png‎
137 KB
diff --git a/‎comparison/track_a_renderer/build.py‎
Lines changed: 45 additions & 0 deletions b/‎comparison/track_a_renderer/build.py‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎comparison/track_a_renderer/hashes.txt‎
Lines changed: 3 additions & 0 deletions b/‎comparison/track_a_renderer/hashes.txt‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎comparison/track_a_renderer/run1.pptx‎
29.1 KB b/‎comparison/track_a_renderer/run1.pptx‎
29.1 KB
diff --git a/‎comparison/track_a_renderer/run2.pptx‎
29.1 KB b/‎comparison/track_a_renderer/run2.pptx‎
29.1 KB
@@ -0,0 +1,109 @@
+# Comparison: renderer-first vs prompt-guided spatial code
+
+This folder contains a tightly controlled before/after comparison of two
+approaches to building a single financial-summary slide:
+
+- **Track A — renderer-first.** The LLM writes only a JSON spec and calls
+  `IBRenderer().render_financial_summary(spec)`. No `Inches()`, no `Pt()`, no
+  `RGBColor()`, no shape manipulation. The renderer owns every pixel.
+- **Track B — prompt-guided spatial code.** The LLM writes `python-pptx` code
+  directly, making every position, size, alignment, and color decision in the
+  same file.
+
+The goal is to isolate the architectural variable — how much spatial code the
+LLM is responsible for — and observe the consequences on a single concrete task.
+
+## Why this comparison and not an apples-to-apples comparison with another plugin
+
+The obvious comparison would be against Anthropic's `financial-services-plugins`.
+That comparison is not what this folder contains, because that plugin does not
+expose a "build one financial_summary slide from this fixed data" primitive.
+Forcing an apples-to-apples run against a plugin that is solving a slightly
+different task would either (a) require a lot of adapter code that would itself
+become the variable under test, or (b) produce a misleading result.
+
+What this comparison does instead: hold the task and the input fixed, vary
+only the architecture. The Track B build script is what someone would plausibly
+write starting from `input/prompt.txt`, `python-pptx` documentation, and no
+reference to an existing renderer. That is the condition any LLM operates under
+when it is asked to produce a slide without a slide library.
+
+## Methodology
+
+1. **Fixed task.** Build one PPTX slide containing Addus HomeCare's 3-year
+   historical financials, formatted to standard IB conventions.
+2. **Fixed input.** `input/financial_summary_spec.json` (and the natural-language
+   version, `input/prompt.txt`) are the two inputs. Both contain identical data
+   — one is structured JSON for the renderer path, one is natural language for
+   the prompt-guided path.
+3. **Fixed output target.** Single `.pptx` file with one slide, 13.33 × 7.5
+   inches.
+4. **Three runs.** Each track is run three times. The three output files per
+   track are hashed after normalizing away volatile PPTX metadata (see
+   `../tests/normalize.py`).
+5. **Honest accounting of what the determinism test shows.** Running the same
+   `.py` file three times is expected to be deterministic for both tracks.
+   That test does not measure "would an LLM write the same code three times?"
+   and `RESULTS.md` says so plainly.
+
+## Layout
+
+```
+comparison/
+├── README.md                              # this file
+├── RESULTS.md                             # criteria table and findings
+├── input/
+│   ├── financial_summary_spec.json        # fixed input: JSON spec
+│   └── prompt.txt                         # fixed input: natural-language prompt
+├── track_a_renderer/
+│   ├── build.py                           # loads spec, calls IBRenderer
+│   ├── run1.pptx, run2.pptx, run3.pptx    # outputs
+│   └── hashes.txt                         # normalized content hashes
+├── track_b_prompt_guided/
+│   ├── build.py                           # hand-written python-pptx
+│   ├── run1.pptx, run2.pptx, run3.pptx    # outputs
+│   └── hashes.txt                         # normalized content hashes
+└── screenshots/
+    ├── track_a_run1.png                   # Quick Look render of track_a run1
+    └── track_b_run1.png                   # Quick Look render of track_b run1
+```
+
+## How to reproduce
+
+```bash
+# Track A
+cd track_a_renderer
+python3 build.py
+
+# Track B
+cd ../track_b_prompt_guided
+python3 build.py
+```
+
+Each script writes `run1.pptx`, `run2.pptx`, `run3.pptx` into its own directory
+and prints a summary to stdout.
+
+To regenerate the screenshots (macOS only):
+
+```bash
+qlmanage -t -s 1600 -o /tmp/ql_a track_a_renderer/run1.pptx
+qlmanage -t -s 1600 -o /tmp/ql_b track_b_prompt_guided/run1.pptx
+cp /tmp/ql_a/run1.pptx.png screenshots/track_a_run1.png
+cp /tmp/ql_b/run1.pptx.png screenshots/track_b_run1.png
+```
+
+## Headline finding
+
+On this single slide, both tracks produce visually acceptable output. The
+difference lives in the *code size* the LLM is responsible for:
+
+- Track A: **0 lines** of spatial code. The LLM writes an 18-line JSON spec.
+- Track B: **166 non-blank code lines**, including 32 `Inches()` calls, 12 `Pt()`
+  calls, and 10 `RGBColor()` calls.
+
+Both tracks produce deterministic output across 3 runs of the same file. That
+test does not capture LLM generation variance, and the conclusion in
+`RESULTS.md` is careful to say so.
+
+See `RESULTS.md` for the full criteria table and the honest framing of what
+this comparison does and does not establish.
@@ -0,0 +1,126 @@
+# Results: Renderer-first vs prompt-guided spatial code
+
+## What was compared
+
+One slide type (`financial_summary`), one fixed input (ADUS 3-year historical),
+two code paths:
+
+- **Track A** — `IBRenderer().render_financial_summary(spec)` against a 18-line JSON spec.
+- **Track B** — a hand-written `python-pptx` build script representing what an
+  LLM plausibly writes on a first pass given the prompt and no reference to an
+  existing renderer. Not engineered to fail.
+
+Both tracks were run three times. Screenshots of `run1.pptx` from each track
+are in `screenshots/`.
+
+## Visual comparison (run 1)
+
+| | Track A (renderer) | Track B (prompt-guided) |
+|---|---|---|
+| Screenshot | `screenshots/track_a_run1.png` | `screenshots/track_b_run1.png` |
+| Action title | Rendered in navy bold; wraps to two lines at the library's default title font size | Rendered in navy bold; fits on one line (LLM chose a slightly smaller title area) |
+| Section header bar | Navy rectangle, white bold text | Navy rectangle, white bold text |
+| Table | Real PPTX table object | Real PPTX table object |
+| Right-aligned numerics | Yes | Yes |
+| Bold subtotal rows | Yes (Revenue, GP, EBITDA, NI, Diluted EPS) | Yes (Revenue, GP, EBITDA, NI — Diluted EPS is *not* bolded in Track B because the prompt listed it under subtotals but the build.py mapped it to `"normal"` style) |
+| Italic gray `%` rows | Yes | Yes |
+| EBITDA row highlight | Light blue background | Light blue background |
+| Source citation | Present, bottom-left | Present, bottom-left |
+| Slide number | Present (in a navy footer bar with "Investment Banking" label — library default) | Present (bare "6" in the bottom-right corner; no footer bar) |
+| "Confidential" marker | Present, top-right, red italic | Present, top-right, red italic |
+| Library branding | "Advisory Group" header, "Investment Banking" footer (library defaults; not in the prompt) | Absent (the prompt didn't mention them, so the LLM didn't add them) |
+| Overflow / clipping | None observed | None observed |
+
+**Honest read of the pictures:** on this single simple slide, both tracks
+produce something a reviewer would accept. The prompt-guided version is not
+broken. The differences that show up visually are cosmetic (title line-wrap,
+presence/absence of library footer branding, one subtotal row that Track B's
+build.py mis-classified).
+
+## Code-size comparison
+
+| | Track A | Track B |
+|---|---|---|
+| Lines of spatial code the LLM had to write | **0** | **166** non-blank code lines |
+| `Inches(...)` calls | 0 | 32 |
+| `Pt(...)` calls | 0 | 12 |
+| `RGBColor(...)` calls | 0 | 10 |
+| `PP_ALIGN.*` references | 0 | 6 |
+| `MSO_SHAPE.*` references | 0 | 1 |
+| Total artifact the LLM produces | `financial_summary_spec.json` — 18 lines, 1,262 chars | `build.py` — 213 lines including docstring |
+
+Track A's `build.py` (45 lines) only loads the JSON and calls the renderer. The
+45 lines are infrastructure, not LLM output. The actual LLM-produced artifact
+is the JSON spec.
+
+## Determinism (3 runs each)
+
+Both tracks were run three times. PPTX files were hashed after stripping
+volatile ZIP metadata (`dcterms:created`, `dcterms:modified`,
+`cp:lastModifiedBy`, `cp:revision`) so that only the stable slide content
+contributes to the hash. See `../tests/normalize.py`.
+
+| | Track A | Track B |
+|---|---|---|
+| Unique hashes across 3 runs | 1 | 1 |
+| Hash | `b36cff98f93cf143746bfb3726240a0fe83a10bf83cb6b2735b1303dcd6dd35e` | `410ae4ba5387034d4e3da8b68a17ca765041f5b4517984360e68359fa082234e` |
+
+### What this determinism test does and does not prove
+
+**What it proves:** running the same Python code twice produces the same PPTX
+bytes, on both tracks. Track A's renderer is deterministic. Track B's
+hand-written build script is also deterministic. Good, but unsurprising —
+both are just Python files.
+
+**What it does not prove:** whether an LLM, asked to build this slide *from
+scratch* three separate times, would produce the same output three times. For
+Track A, the LLM only needs to produce the same 18-line JSON spec, which is
+much easier to keep stable across generations. For Track B, the LLM needs to
+re-derive 32 `Inches()` positions, 12 `Pt()` sizes, and 10 `RGBColor()` values
+from scratch each time. The variance that matters lives in the code the LLM
+writes, not in running that code multiple times.
+
+This test case does not attempt to measure LLM generation variance. It isolates
+only the *architectural* variable: how much spatial code the LLM is asked to
+produce in the first place.
+
+## Where the difference actually lives
+
+On a single slide in isolation, the two approaches are close. The renderer
+approach starts to pull away when:
+
+1. **The same layout appears on multiple slides.** A 20-slide deck with five
+   financial_summary slides in Track A shares one renderer. In Track B, every
+   slide is its own 150–200 line file, and drift between them is a natural
+   outcome.
+2. **Formatting conventions change.** Changing the navy color, the title font
+   size, or the source-line position is a one-line edit in the renderer. In
+   Track B it requires editing every build script.
+3. **The data changes.** Track A: edit the JSON, rerun. Track B: re-run the
+   same build script (fine) — or, if the LLM is regenerating each slide, write
+   a fresh 150–200 line file each time.
+
+None of the above is tested by this one-slide comparison. This comparison only
+establishes that the *code-size difference* and the *single-pass determinism*
+are real. The scaling properties are a separate claim.
+
+## Criteria summary
+
+| Criterion | Track A | Track B |
+|---|---|---|
+| Renders without errors | Pass | Pass |
+| Honors IB formatting conventions listed in the prompt | Pass | Pass (with one subtotal misclassification) |
+| No overflow / clipping | Pass | Pass |
+| Uses a real PPTX table object | Pass | Pass |
+| Lines of spatial code the LLM wrote | 0 | 166 |
+| Normalized content hash stable across 3 runs of the same code | Yes (1/3) | Yes (1/3) |
+| Measures LLM-generation variance | No | No |
+
+## Conclusion (narrow)
+
+On this single test case, renderer-first architecture produces the same slide
+using zero lines of spatial code from the LLM, versus 166 lines for a
+good-faith prompt-guided implementation. Both implementations render
+deterministically on re-execution. Neither implementation's determinism result
+speaks to LLM generation variance, which is the scaling question this
+comparison does not attempt to answer.
@@ -0,0 +1,18 @@
+{
+  "_comment": "Fixed input for the comparison. Both Track A (renderer-first) and Track B (prompt-guided) produce a financial summary slide from this exact data.",
+  "slide_number": 6,
+  "title": "Addus Has Delivered Consistent Revenue Growth with Expanding Profitability",
+  "section_header": "Historical Financial Summary ($ in thousands)",
+  "headers": ["Metric", "FY2023A", "FY2024A", "FY2025A", "3Y CAGR"],
+  "rows": [
+    {"label": "Revenue", "values": ["1,058,651", "1,154,599", "1,422,530", "14.4%"], "style": "bold"},
+    {"label": "% Growth", "values": ["11.3%", "9.1%", "23.2%", ""], "style": "pct"},
+    {"label": "Gross Profit", "values": ["339,876", "375,021", "461,874", "16.6%"], "style": "bold"},
+    {"label": "% Margin", "values": ["32.1%", "32.5%", "32.5%", ""], "style": "pct"},
+    {"label": "EBITDA", "values": ["105,082", "116,221", "155,027", "21.4%"], "style": "highlight"},
+    {"label": "% Margin", "values": ["9.9%", "10.1%", "10.9%", ""], "style": "pct"},
+    {"label": "Net Income", "values": ["62,516", "73,598", "95,910", "23.8%"], "style": "bold"},
+    {"label": "Diluted EPS", "values": ["$3.83", "$4.23", "$5.22", "16.7%"], "style": "normal"}
+  ],
+  "source": "Source: SEC EDGAR. ADUS FY2025 10-K filed February 24, 2026."
+}
@@ -0,0 +1,25 @@
+Build a single financial summary slide for Addus HomeCare Corp (ADUS).
+
+Use these exact historical financials ($ in thousands):
+
+Metric                FY2023A      FY2024A      FY2025A      3Y CAGR
+Revenue               1,058,651    1,154,599    1,422,530    14.4%
+  % Growth            11.3%        9.1%         23.2%
+Gross Profit          339,876      375,021      461,874      16.6%
+  % Margin            32.1%        32.5%        32.5%
+EBITDA                105,082      116,221      155,027      21.4%
+  % Margin            9.9%         10.1%        10.9%
+Net Income            62,516       73,598       95,910       23.8%
+Diluted EPS           $3.83        $4.23        $5.22        16.7%
+
+Formatting requirements (standard IB conventions):
+- Action title: "Addus Has Delivered Consistent Revenue Growth with Expanding Profitability"
+- Section header: "Historical Financial Summary ($ in thousands)" in navy bar with white text
+- Right-align all numeric columns (so decimal points stack vertically)
+- Bold the subtotal rows: Revenue, Gross Profit, EBITDA, Net Income, Diluted EPS
+- Italic gray for the % Growth and % Margin rows
+- Highlight the EBITDA row with a light blue background
+- Source at the bottom: "Source: SEC EDGAR. ADUS FY2025 10-K filed February 24, 2026."
+- Slide number 6 in the footer
+
+Output: a single .pptx file with one slide.
@@ -0,0 +1,45 @@
+#!/usr/bin/env python3
+"""
+Track A: Renderer-first architecture.
+
+Input: the fixed JSON spec in ../input/financial_summary_spec.json
+Output: one-slide PPTX produced by calling IBRenderer.render_financial_summary(spec).
+
+The entire "code the LLM writes" is the JSON spec. No Inches(), no Pt(),
+no RGBColor(), no python-pptx shape creation. The renderer handles every pixel.
+
+Run:
+    python3 build.py
+"""
+import json
+import os
+import sys
+
+HERE = os.path.dirname(os.path.abspath(__file__))
+REPO = os.path.abspath(os.path.join(HERE, "..", ".."))
+sys.path.insert(0, os.path.join(REPO, "ib-deck-engine", "skills", "ib-deck-engine", "scripts"))
+
+from ib_deck_engine import IBRenderer
+
+# Load the fixed input
+with open(os.path.join(HERE, "..", "input", "financial_summary_spec.json")) as f:
+    spec = json.load(f)
+
+# Strip the _comment field (not part of the actual spec)
+spec.pop("_comment", None)
+
+# Render
+r = IBRenderer()
+r.render_financial_summary(spec)
+
+# Save three runs to demonstrate determinism
+for run_num in range(1, 4):
+    output = os.path.join(HERE, f"run{run_num}.pptx")
+    r2 = IBRenderer()
+    r2.render_financial_summary(spec)
+    r2.save(output)
+    print(f"✓ Wrote {output}")
+
+print("\nTrack A complete. 3 runs produced.")
+print(f"Lines of 'spatial code' the LLM had to write: 0")
+print(f"Lines in the JSON spec: {sum(1 for _ in json.dumps(spec, indent=2).split(chr(10)))}")
@@ -0,0 +1,3 @@
+b36cff98f93cf143746bfb3726240a0fe83a10bf83cb6b2735b1303dcd6dd35e
+b36cff98f93cf143746bfb3726240a0fe83a10bf83cb6b2735b1303dcd6dd35e
+b36cff98f93cf143746bfb3726240a0fe83a10bf83cb6b2735b1303dcd6dd35e
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+b36cff98f93cf143746bfb3726240a0fe83a10bf83cb6b2735b1303dcd6dd35e`
	`2`	`+b36cff98f93cf143746bfb3726240a0fe83a10bf83cb6b2735b1303dcd6dd35e`
	`3`	`+b36cff98f93cf143746bfb3726240a0fe83a10bf83cb6b2735b1303dcd6dd35e`