OCR pipeline that turns 1990–2001 WXYC handwritten flowsheet PDFs into structured JSON using Gemini 3 vision.
For each page of every PDF under scans/:
- Extracts the embedded page image directly with
pdfimages -png. The WXYC PDFs each wrap a single CCITT Group 4 (lossless 1-bit grayscale) image at 300 PPI native, so the extracted PNG is bit-for-bit identical to the source bitmap — no rasterization, no anti-aliasing, no DPI choice. - Sends the image to Gemini 3 with a Pydantic
response_schemathat defines the four-quadrant flowsheet layout. - Stores a JSON result file with the per-row
raw_text,artist_guess,track_guess,confidence, and any phase-2notes(continuation, double-height, crossed-out, illegible). - Tracks every page in a SQLite job table so reruns are idempotent and partial failures resume.
Phase 1 captures the per-row "Artist – Track" text and the four-quadrant frame. Phase 2 adds the left-margin H/M/L/Std/O/R type column (Entry.type_raw), the bottom-of-page comments field (GeminiPageResult.comments_raw), and continues to roll out continuation/double-height handling and reconciliation against the WXYC library DB — see PLAN.md.
# 1. Install dependencies (Python 3.12+, poppler for pdftoppm)
brew install poppler # macOS; on Ubuntu: sudo apt-get install poppler-utils
uv venv --python 3.12 .venv
uv pip install -e ".[dev]"
# 2. Configure
cp .env.example .env
# Edit .env: at minimum, set GEMINI_API_KEY (https://aistudio.google.com/apikey)
# Default SCANS_ROOT is ./scans. Point it elsewhere if your scans live on
# an external drive, a synced Dropbox/iCloud folder, or a future cloud mount.
# 3. Run the pipeline
.venv/bin/flowsheets discover # register one job per PDF page
.venv/bin/flowsheets render --limit 200 # extract embedded PNGs, parallel by default (RENDER_CONCURRENCY)
.venv/bin/flowsheets render --concurrency 8 --limit 1000 # override per-run
.venv/bin/flowsheets process --limit 50 # PNG -> Gemini -> JSON
.venv/bin/flowsheets status # show counts by statusEach step is resumable: rerun any subcommand and it picks up where it left off. Successful work is never repeated unless you explicitly call retry-page.
All settings live in .env (see .env.example for the canonical list):
| Variable | Default | Purpose |
|---|---|---|
GEMINI_API_KEY |
(required) | API key from https://aistudio.google.com/apikey. |
SCANS_ROOT |
./scans |
Where the PDFs live. Point at an external drive or synced folder if needed. |
DATA_ROOT |
./data |
Where rendered PNGs, JSON results, and jobs.db are written. Gitignored. |
GEMINI_MODEL |
gemini-3.1-pro-preview |
The original gemini-3-pro-preview was shut down March 2026; we use 3.1 Pro Preview. Bump as Google releases stable Gemini 3.x. |
GEMINI_MEDIA_RESOLUTION |
high |
Vision token allocation: low / medium / high (1120 tokens) / ultra_high. high is the default for fine handwriting. |
RENDER_CONCURRENCY |
4 |
Parallel pdfimages workers for flowsheets render. Override per-run with --concurrency. |
PROCESS_CONCURRENCY |
4 |
Parallel Gemini calls for flowsheets process. Tune to fit your per-minute rate limit. Override per-run with --concurrency. |
MAX_ATTEMPTS |
3 |
Failed pages retry up to this many times across runs before sticking in failed. |
pending ──render──▶ rendered ──process──▶ completed
│ │ ▲
│ │ │ (no auto retry; only via retry-page)
│ │ │
└────────────────────┴───── failed ─────────┘
▲
│ retried up to MAX_ATTEMPTS
completed is terminal and never reprocessed automatically. Successful Gemini extractions are not free — protecting them from accidental overwrite is a deliberate design choice. To re-extract a single page that you know was wrong:
flowsheets retry-page "1990/January 1990/1990-01jan0106.pdf" 5Never use it for bulk resets. If you need to redo many pages, edit the job row in data/jobs.db directly with a targeted WHERE clause and confirm with a SELECT first.
.venv/bin/ruff check .
.venv/bin/ruff format --check .
.venv/bin/mypy core cli.py
.venv/bin/pytestTests are split into:
tests/unit/— schema, render, jobs, prompts, gemini client (mocked SDK), CLI (mocked pipeline), golden comparison.tests/integration/— end-to-end pipeline orchestration with mocked Gemini and realpdftoppm.tests/golden/— rendered page images plus hand-transcribed truth JSON, used to spot-check extraction quality with the real API. Seetests/golden/README.md.
The default test run excludes the external_api and slow markers; CI runs the same default. The golden-page external-API runner is a follow-up.
After the pipeline produces data/results/<rel>/page-NN.json, you can hand-verify and correct entries via the static SPA in verifier/. Each row's cropped image strip sits next to its detected text in an editable field. Export emits a <stem>.verified.json (PageResult-shaped, plugs back into the pipeline as ground truth) and derive_truth produces a matching tests/golden/<stem>.truth.json.
# Generate a bundle
python -m scripts.make_verifier_bundle \
data/results/<rel>/page-NN.json \
data/pages/<rel>/page-NN.png \
--out data/verifier/<stem>.bundle.json
# Open the verifier
python -m http.server 8765
# then visit:
# http://localhost:8765/verifier/?bundle=/data/verifier/<stem>.bundle.json
# Derive a truth file from the exported verified.json
python -m scripts.derive_truth \
data/verifier/<stem>.verified.json \
--out tests/golden/<stem>.truth.jsonSee verifier/README.md for the bundle schema, expected file layout, and the substring-derivation rules.
Gemini 3.1 Pro charges per input token; one 300-DPI flowsheet page at media_resolution=high is ~1120 image tokens plus ~600 prompt tokens. Across the full corpus (~16K pages) input cost lands in the low tens of dollars; output adds modestly. Run the pipeline against a 10–20 page sample first and inspect both quality and usage_metadata before scheduling a full run.
flowsheets process registers the page prompt as a Gemini cachedContent resource on the first call of each run so the ~2-3K-token prompt isn't re-billed on every page. Caching is best-effort: if caches.create fails (prompt below the SDK's min-token threshold, model doesn't support caching, transient API error), the run continues on the un-cached path. The response schema lives in the per-call config (SDK limitation — CreateCachedContentConfig has no response_schema field), so the savings are on the prompt portion only.
- Python 3.12, src-flat layout (
core/+ top-levelcli.py), Pydantic v2. - ruff (pinned) + mypy (pinned) so a tooling release with new rules cannot silently fail CI on an unrelated PR.
- pytest markers:
external_api(real Gemini) andslow— both excluded from default runs. Seepyproject.toml. - See
CLAUDE.mdfor in-repo architecture notes and the phase 1 / phase 2 split.