feat(comments): read-time normalize + entry-overlap diagnostics by jakebromberg · Pull Request #35 · WXYC/flowsheet-digitization

jakebromberg · 2026-05-10T23:26:03Z

Summary

core/comments.py: pure read-time normalize_comments(raw) (whitespace-collapse per line, trim, drop blank lines, idempotent, empty→None) and find_comment_entry_overlaps(page) returning a list of CommentEntryOverlap records for cases where a normalized comments line is a case- and whitespace-insensitive substring of an entry's raw_text on the same page.
Mirrors the core.continuations pattern: no I/O, no on-disk mutation. Verbatim comments_raw on disk stays verbatim per the org data-safety rule.
Categorization, CLI wiring, and pipeline-run-path integration are intentionally deferred — out of scope on the ticket.

Inventory

The ticket's step-1 inventory of data/results/ against the existing 34-page corpus shows the feature is brand-new in main and the existing extractions predate it. All 34 JSONs are missing the comments_raw key entirely (they were extracted before the schema split landed). No non-null examples yet — empirical validation against real Comments-band text will come from the next extraction run. Tests here use synthetic PageResult fixtures, which is sufficient to lock in the contract.

state	count
total JSONs scanned	34
missing `comments_raw` key	34
`comments_raw: null`	0
`comments_raw` non-null	0

Test plan

ruff check . clean
ruff format --check . clean
mypy core cli.py clean
pytest — 357 passed (26 new test_comments.py cases covering None passthrough, whitespace collapse, multi-line preservation, idempotency, trim, empty-to-None, and overlap diagnostics: empty/no-match/single/multi/case-insensitive/whitespace-insensitive/multi-line-one-matches/canonical-ordering)

Closes #34

… comments_raw Mirrors the shape of core.continuations: pure read-time functions over PageResult, no I/O, no on-disk mutation. normalize_comments collapses intra-line whitespace, trims per line, drops empty interior lines, and returns None when the result is empty — verbatim on disk stays verbatim per the org data-safety rule. find_comment_entry_overlaps returns one record per (normalized comments line, matching entry) pair where the line is a case- and whitespace-insensitive substring of an entry's raw_text on the same page; results come back in canonical quadrant order with ascending row_index. Categorization, the audit CLI, and any wiring into the pipeline run path are intentionally deferred — this lands the primitives so the calibration scorer (and a future audit tool) can call them. Closes #34

Addresses code-review feedback on the original PR. (1) `CommentEntryOverlap` now carries `comment_line_raw` — the verbatim comments-band line that produced the match, with whitespace and case preserved — parallel to `entry_raw_text` on the entry side. A future audit CLI wants to display both sides exactly as the DJ wrote them; the casefolded form (`matched_text`) stays for matcher introspection. (2) Extracted `_normalize_line` as the shared per-line primitive used by both `normalize_comments` and the matcher; the prior `_normalize_for_match` helper duplicated the same regex+strip composition. (3) Docstring on `find_comment_entry_overlaps` now states the scope explicitly — only `entry.raw_text` is the haystack; `oddities` lists at both quadrant and page levels are intentionally not searched. (4) Dropped the redundant empty-line filter that the refactored line walk no longer needs.

jakebromberg added 2 commits May 10, 2026 16:25

jakebromberg merged commit 6a5dc12 into main May 11, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(comments): read-time normalize + entry-overlap diagnostics#35

feat(comments): read-time normalize + entry-overlap diagnostics#35
jakebromberg merged 2 commits into
mainfrom
phase2-comments-normalize

jakebromberg commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jakebromberg commented May 10, 2026

Summary

Inventory

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant