flowsheet-digitization

OCR pipeline that turns 1990–2001 WXYC handwritten flowsheet PDFs into structured JSON using Gemini 3 vision.

What it does

For each page of every PDF under scans/:

Extracts the embedded page image directly with pdfimages -png. The WXYC PDFs each wrap a single CCITT Group 4 (lossless 1-bit grayscale) image at 300 PPI native, so the extracted PNG is bit-for-bit identical to the source bitmap — no rasterization, no anti-aliasing, no DPI choice.
Sends the image to Gemini 3 with a Pydantic response_schema that defines the four-quadrant flowsheet layout.
Stores a JSON result file with the per-row raw_text, artist_guess, track_guess, confidence, and any phase-2 notes (continuation, double-height, crossed-out, illegible).
Tracks every page in a SQLite job table so reruns are idempotent and partial failures resume.

Phase 1 captures the per-row "Artist – Track" text and the four-quadrant frame. Phase 2 adds the left-margin H/M/L/Std/O/R type column (Entry.type_raw), the bottom-of-page comments field (GeminiPageResult.comments_raw), and continues to roll out continuation/double-height handling and reconciliation against the WXYC library DB — see PLAN.md.

Quickstart

# 1. Install dependencies (Python 3.12+, poppler for pdftoppm)
brew install poppler   # macOS; on Ubuntu: sudo apt-get install poppler-utils
uv venv --python 3.12 .venv
uv pip install -e ".[dev]"

# 2. Configure
cp .env.example .env
# Edit .env: at minimum, set GEMINI_API_KEY (https://aistudio.google.com/apikey)
# Default SCANS_ROOT is ./scans. Point it elsewhere if your scans live on
# an external drive, a synced Dropbox/iCloud folder, or a future cloud mount.

# 3. Run the pipeline
.venv/bin/flowsheets discover                       # register one job per PDF page
.venv/bin/flowsheets render --limit 200             # extract embedded PNGs, parallel by default (RENDER_CONCURRENCY)
.venv/bin/flowsheets render --concurrency 8 --limit 1000   # override per-run
.venv/bin/flowsheets process --limit 50             # PNG -> Gemini -> JSON
.venv/bin/flowsheets status                         # show counts by status

Each step is resumable: rerun any subcommand and it picks up where it left off. Successful work is never repeated unless you explicitly call retry-page.

Configuration

All settings live in .env (see .env.example for the canonical list):

Variable	Default	Purpose
`GEMINI_API_KEY`	(required)	API key from https://aistudio.google.com/apikey.
`SCANS_ROOT`	`./scans`	Where the PDFs live. Point at an external drive or synced folder if needed.
`DATA_ROOT`	`./data`	Where rendered PNGs, JSON results, and `jobs.db` are written. Gitignored.
`GEMINI_MODEL`	`gemini-3.1-pro-preview`	The original `gemini-3-pro-preview` was shut down March 2026; we use 3.1 Pro Preview. Bump as Google releases stable Gemini 3.x.
`GEMINI_MEDIA_RESOLUTION`	`high`	Vision token allocation: `low` / `medium` / `high` (1120 tokens) / `ultra_high`. `high` is the default for fine handwriting.
`RENDER_CONCURRENCY`	`4`	Parallel `pdfimages` workers for `flowsheets render`. Override per-run with `--concurrency`.
`PROCESS_CONCURRENCY`	`4`	Parallel Gemini calls for `flowsheets process`. Tune to fit your per-minute rate limit. Override per-run with `--concurrency`.
`MAX_ATTEMPTS`	`3`	Failed pages retry up to this many times across runs before sticking in `failed`.

Job lifecycle and data safety

pending  ──render──▶ rendered ──process──▶ completed
   │                    │                      ▲
   │                    │                      │  (no auto retry; only via retry-page)
   │                    │                      │
   └────────────────────┴───── failed ─────────┘
                                ▲
                                │ retried up to MAX_ATTEMPTS

completed is terminal and never reprocessed automatically. Successful Gemini extractions are not free — protecting them from accidental overwrite is a deliberate design choice. To re-extract a single page that you know was wrong:

flowsheets retry-page "1990/January 1990/1990-01jan0106.pdf" 5

Never use it for bulk resets. If you need to redo many pages, edit the job row in data/jobs.db directly with a targeted WHERE clause and confirm with a SELECT first.

Development

.venv/bin/ruff check .
.venv/bin/ruff format --check .
.venv/bin/mypy core cli.py
.venv/bin/pytest

Tests are split into:

tests/unit/ — schema, render, jobs, prompts, gemini client (mocked SDK), CLI (mocked pipeline), golden comparison.
tests/integration/ — end-to-end pipeline orchestration with mocked Gemini and real pdftoppm.
tests/golden/ — rendered page images plus hand-transcribed truth JSON, used to spot-check extraction quality with the real API. See tests/golden/README.md.

The default test run excludes the external_api and slow markers; CI runs the same default. The golden-page external-API runner is a follow-up.

Manual verifier

After the pipeline produces data/results/<rel>/page-NN.json, you can hand-verify and correct entries via the static SPA in verifier/. Each row's cropped image strip sits next to its detected text in an editable field. Export emits a <stem>.verified.json (PageResult-shaped, plugs back into the pipeline as ground truth) and derive_truth produces a matching tests/golden/<stem>.truth.json.

# Generate a bundle
python -m scripts.make_verifier_bundle \
    data/results/<rel>/page-NN.json \
    data/pages/<rel>/page-NN.png \
    --out data/verifier/<stem>.bundle.json

# Open the verifier
python -m http.server 8765
# then visit:
# http://localhost:8765/verifier/?bundle=/data/verifier/<stem>.bundle.json

# Derive a truth file from the exported verified.json
python -m scripts.derive_truth \
    data/verifier/<stem>.verified.json \
    --out tests/golden/<stem>.truth.json

See verifier/README.md for the bundle schema, expected file layout, and the substring-derivation rules.

Cost calibration

Gemini 3.1 Pro charges per input token; one 300-DPI flowsheet page at media_resolution=high is ~1120 image tokens plus ~600 prompt tokens. Across the full corpus (~16K pages) input cost lands in the low tens of dollars; output adds modestly. Run the pipeline against a 10–20 page sample first and inspect both quality and usage_metadata before scheduling a full run.

flowsheets process registers the page prompt as a Gemini cachedContent resource on the first call of each run so the ~2-3K-token prompt isn't re-billed on every page. Caching is best-effort: if caches.create fails (prompt below the SDK's min-token threshold, model doesn't support caching, transient API error), the run continues on the un-cached path. The response schema lives in the per-call config (SDK limitation — CreateCachedContentConfig has no response_schema field), so the savings are on the prompt portion only.

Repo conventions

Python 3.12, src-flat layout (core/ + top-level cli.py), Pydantic v2.
ruff (pinned) + mypy (pinned) so a tooling release with new rules cannot silently fail CI on an unrelated PR.
pytest markers: external_api (real Gemini) and slow — both excluded from default runs. See pyproject.toml.
See CLAUDE.md for in-repo architecture notes and the phase 1 / phase 2 split.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

flowsheet-digitization

What it does

Quickstart

Configuration

Job lifecycle and data safety

Development

Manual verifier

Cost calibration

Repo conventions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
core		core
scripts		scripts
tests		tests
verifier		verifier
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
PLAN.md		PLAN.md
README.md		README.md
cli.py		cli.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

flowsheet-digitization

What it does

Quickstart

Configuration

Job lifecycle and data safety

Development

Manual verifier

Cost calibration

Repo conventions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages