Skip to content

feat(schema): drop derivable Entry.artist_guess/track_guess; add core/parse.py#49

Merged
jakebromberg merged 2 commits into
mainfrom
worktree-sprint1-schema-audit-parse
May 11, 2026
Merged

feat(schema): drop derivable Entry.artist_guess/track_guess; add core/parse.py#49
jakebromberg merged 2 commits into
mainfrom
worktree-sprint1-schema-audit-parse

Conversation

@jakebromberg
Copy link
Copy Markdown
Member

Summary

  • Drops Entry.artist_guess and Entry.track_guess from core/schema.py. The model used to fill them on every row at the cost of ~20-30% of output tokens for a deterministic computation.
  • New core/parse.py exposes parse_artist_track(raw_text) -> tuple[str | None, str | None] — separates on the first whitespace-bracketed ASCII hyphen, en-dash, or em-dash. Compound band names like "X-Ray Spex" are NOT split (no flanking whitespace).
  • Removes the two bullets from PAGE_EXTRACTION_PROMPT and QUADRANT_EXTRACTION_PROMPT_TEMPLATE. Prompt-contract tests updated.
  • core/spot_check.collect_entries and scripts/spot_check_discogs.py updated to fall back to the parser for post-PR B: Drop derivable Entry.artist_guess / track_guess; add core/parse.py #41 extractions; legacy 34-JSON shape with explicit-null artist_guess still skips (continuation marker).
  • 14 new parametrized tests for parse_artist_track (ASCII hyphen, en-dash, em-dash, no dash, multiple dashes, whitespace, empty/None, edge-of-line separators, compound names).

Sprint 1 / Cost wins. Pattern siblings: core/continuations.py, core/comments.py.

Closes #41

Test plan

  • All 380 default tests pass (1 deselected as expected).
  • ruff check clean.
  • ruff format --check clean.
  • mypy core cli.py clean.
  • Backward-compat regression test: an old-shape Entry JSON with artist_guess/track_guess keys validates and dumps without them reappearing.
  • spot_check.collect_entries legacy-shape test confirms null artist_guess still skips the row.
  • spot_check.collect_entries post-audit test confirms parse_artist_track(raw_text) fallback kicks in when keys are absent.
  • (Out of scope locally) external_api test that Gemini's output still validates against the trimmed schema — the schema simply has fewer optional fields, so this should be a no-op.
  • (Out of scope locally) 5-golden calibration score unchanged — golden scorer doesn't look at the dropped fields.

…/parse.py

The artist/track split is deterministic — separate on the first whitespace-bracketed dash. Producing it on the model side spent output tokens for no analytic benefit (~20-30% of per-page output spend). The two fields are removed from the Entry response schema; downstream consumers call parse_artist_track(raw_text) at read time.

Sibling pattern: core.continuations.merge_continuations, core.comments.normalize_comments. Pure function, no I/O; the on-disk shape stays verbatim.

Backward compat: the 34 pre-audit corpus JSONs still carry artist_guess/track_guess keys. Pydantic v2's default extra='ignore' silently drops them on validation, and core/spot_check.collect_entries branches on key PRESENCE — legacy null still means "skip this row" (continuation marker), while new entries derive from raw_text.

Closes #41
Adds a test that documents `parse_artist_track`'s behavior when the separator straddles a newline (matches across, splits) so a future regex tightening is a deliberate decision. Adds a docstring pointer from `core.continuations.merge_continuations` to `core.parse.parse_artist_track` for callers wanting an artist/track split off the merged `raw_text`. Reframes `spot_check_discogs.py`'s "post-#41" reference as "post-schema-trim" so the wording ages better than a closed-issue number.
@jakebromberg jakebromberg merged commit 961b165 into main May 11, 2026
3 checks passed
@jakebromberg jakebromberg deleted the worktree-sprint1-schema-audit-parse branch May 11, 2026 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PR B: Drop derivable Entry.artist_guess / track_guess; add core/parse.py

1 participant