feat(schema): drop derivable Entry.artist_guess/track_guess; add core/parse.py#49
Merged
Merged
Conversation
…/parse.py The artist/track split is deterministic — separate on the first whitespace-bracketed dash. Producing it on the model side spent output tokens for no analytic benefit (~20-30% of per-page output spend). The two fields are removed from the Entry response schema; downstream consumers call parse_artist_track(raw_text) at read time. Sibling pattern: core.continuations.merge_continuations, core.comments.normalize_comments. Pure function, no I/O; the on-disk shape stays verbatim. Backward compat: the 34 pre-audit corpus JSONs still carry artist_guess/track_guess keys. Pydantic v2's default extra='ignore' silently drops them on validation, and core/spot_check.collect_entries branches on key PRESENCE — legacy null still means "skip this row" (continuation marker), while new entries derive from raw_text. Closes #41
Adds a test that documents `parse_artist_track`'s behavior when the separator straddles a newline (matches across, splits) so a future regex tightening is a deliberate decision. Adds a docstring pointer from `core.continuations.merge_continuations` to `core.parse.parse_artist_track` for callers wanting an artist/track split off the merged `raw_text`. Reframes `spot_check_discogs.py`'s "post-#41" reference as "post-schema-trim" so the wording ages better than a closed-issue number.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Entry.artist_guessandEntry.track_guessfromcore/schema.py. The model used to fill them on every row at the cost of ~20-30% of output tokens for a deterministic computation.core/parse.pyexposesparse_artist_track(raw_text) -> tuple[str | None, str | None]— separates on the first whitespace-bracketed ASCII hyphen, en-dash, or em-dash. Compound band names like "X-Ray Spex" are NOT split (no flanking whitespace).PAGE_EXTRACTION_PROMPTandQUADRANT_EXTRACTION_PROMPT_TEMPLATE. Prompt-contract tests updated.core/spot_check.collect_entriesandscripts/spot_check_discogs.pyupdated to fall back to the parser for post-PR B: Drop derivable Entry.artist_guess / track_guess; add core/parse.py #41 extractions; legacy 34-JSON shape with explicit-nullartist_guessstill skips (continuation marker).parse_artist_track(ASCII hyphen, en-dash, em-dash, no dash, multiple dashes, whitespace, empty/None, edge-of-line separators, compound names).Sprint 1 / Cost wins. Pattern siblings:
core/continuations.py,core/comments.py.Closes #41
Test plan
ruff checkclean.ruff format --checkclean.mypy core cli.pyclean.EntryJSON withartist_guess/track_guesskeys validates and dumps without them reappearing.spot_check.collect_entrieslegacy-shape test confirms nullartist_guessstill skips the row.spot_check.collect_entriespost-audit test confirmsparse_artist_track(raw_text)fallback kicks in when keys are absent.external_apitest that Gemini's output still validates against the trimmed schema — the schema simply has fewer optional fields, so this should be a no-op.