feat(schema): drop derivable Entry.artist_guess/track_guess; add core/parse.py by jakebromberg · Pull Request #49 · WXYC/flowsheet-digitization

jakebromberg · 2026-05-11T02:09:23Z

Summary

Drops Entry.artist_guess and Entry.track_guess from core/schema.py. The model used to fill them on every row at the cost of ~20-30% of output tokens for a deterministic computation.
New core/parse.py exposes parse_artist_track(raw_text) -> tuple[str | None, str | None] — separates on the first whitespace-bracketed ASCII hyphen, en-dash, or em-dash. Compound band names like "X-Ray Spex" are NOT split (no flanking whitespace).
Removes the two bullets from PAGE_EXTRACTION_PROMPT and QUADRANT_EXTRACTION_PROMPT_TEMPLATE. Prompt-contract tests updated.
core/spot_check.collect_entries and scripts/spot_check_discogs.py updated to fall back to the parser for post-PR B: Drop derivable Entry.artist_guess / track_guess; add core/parse.py #41 extractions; legacy 34-JSON shape with explicit-null artist_guess still skips (continuation marker).
14 new parametrized tests for parse_artist_track (ASCII hyphen, en-dash, em-dash, no dash, multiple dashes, whitespace, empty/None, edge-of-line separators, compound names).

Sprint 1 / Cost wins. Pattern siblings: core/continuations.py, core/comments.py.

Closes #41

Test plan

All 380 default tests pass (1 deselected as expected).
ruff check clean.
ruff format --check clean.
mypy core cli.py clean.
Backward-compat regression test: an old-shape Entry JSON with artist_guess/track_guess keys validates and dumps without them reappearing.
spot_check.collect_entries legacy-shape test confirms null artist_guess still skips the row.
spot_check.collect_entries post-audit test confirms parse_artist_track(raw_text) fallback kicks in when keys are absent.
(Out of scope locally) external_api test that Gemini's output still validates against the trimmed schema — the schema simply has fewer optional fields, so this should be a no-op.
(Out of scope locally) 5-golden calibration score unchanged — golden scorer doesn't look at the dropped fields.

…/parse.py The artist/track split is deterministic — separate on the first whitespace-bracketed dash. Producing it on the model side spent output tokens for no analytic benefit (~20-30% of per-page output spend). The two fields are removed from the Entry response schema; downstream consumers call parse_artist_track(raw_text) at read time. Sibling pattern: core.continuations.merge_continuations, core.comments.normalize_comments. Pure function, no I/O; the on-disk shape stays verbatim. Backward compat: the 34 pre-audit corpus JSONs still carry artist_guess/track_guess keys. Pydantic v2's default extra='ignore' silently drops them on validation, and core/spot_check.collect_entries branches on key PRESENCE — legacy null still means "skip this row" (continuation marker), while new entries derive from raw_text. Closes #41

Adds a test that documents `parse_artist_track`'s behavior when the separator straddles a newline (matches across, splits) so a future regex tightening is a deliberate decision. Adds a docstring pointer from `core.continuations.merge_continuations` to `core.parse.parse_artist_track` for callers wanting an artist/track split off the merged `raw_text`. Reframes `spot_check_discogs.py`'s "post-#41" reference as "post-schema-trim" so the wording ages better than a closed-issue number.

jakebromberg added 2 commits May 10, 2026 19:08

jakebromberg merged commit 961b165 into main May 11, 2026
3 checks passed

jakebromberg deleted the worktree-sprint1-schema-audit-parse branch May 11, 2026 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(schema): drop derivable Entry.artist_guess/track_guess; add core/parse.py#49

feat(schema): drop derivable Entry.artist_guess/track_guess; add core/parse.py#49
jakebromberg merged 2 commits into
mainfrom
worktree-sprint1-schema-audit-parse

jakebromberg commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jakebromberg commented May 11, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant