Skip to content

feat(comments): read-time normalize + entry-overlap diagnostics#35

Merged
jakebromberg merged 2 commits into
mainfrom
phase2-comments-normalize
May 11, 2026
Merged

feat(comments): read-time normalize + entry-overlap diagnostics#35
jakebromberg merged 2 commits into
mainfrom
phase2-comments-normalize

Conversation

@jakebromberg
Copy link
Copy Markdown
Member

Summary

  • core/comments.py: pure read-time normalize_comments(raw) (whitespace-collapse per line, trim, drop blank lines, idempotent, empty→None) and find_comment_entry_overlaps(page) returning a list of CommentEntryOverlap records for cases where a normalized comments line is a case- and whitespace-insensitive substring of an entry's raw_text on the same page.
  • Mirrors the core.continuations pattern: no I/O, no on-disk mutation. Verbatim comments_raw on disk stays verbatim per the org data-safety rule.
  • Categorization, CLI wiring, and pipeline-run-path integration are intentionally deferred — out of scope on the ticket.

Inventory

The ticket's step-1 inventory of data/results/ against the existing 34-page corpus shows the feature is brand-new in main and the existing extractions predate it. All 34 JSONs are missing the comments_raw key entirely (they were extracted before the schema split landed). No non-null examples yet — empirical validation against real Comments-band text will come from the next extraction run. Tests here use synthetic PageResult fixtures, which is sufficient to lock in the contract.

state count
total JSONs scanned 34
missing comments_raw key 34
comments_raw: null 0
comments_raw non-null 0

Test plan

  • ruff check . clean
  • ruff format --check . clean
  • mypy core cli.py clean
  • pytest — 357 passed (26 new test_comments.py cases covering None passthrough, whitespace collapse, multi-line preservation, idempotency, trim, empty-to-None, and overlap diagnostics: empty/no-match/single/multi/case-insensitive/whitespace-insensitive/multi-line-one-matches/canonical-ordering)

Closes #34

… comments_raw

Mirrors the shape of core.continuations: pure read-time functions over PageResult, no I/O, no on-disk mutation. normalize_comments collapses intra-line whitespace, trims per line, drops empty interior lines, and returns None when the result is empty — verbatim on disk stays verbatim per the org data-safety rule. find_comment_entry_overlaps returns one record per (normalized comments line, matching entry) pair where the line is a case- and whitespace-insensitive substring of an entry's raw_text on the same page; results come back in canonical quadrant order with ascending row_index. Categorization, the audit CLI, and any wiring into the pipeline run path are intentionally deferred — this lands the primitives so the calibration scorer (and a future audit tool) can call them.

Closes #34
Addresses code-review feedback on the original PR. (1) `CommentEntryOverlap` now carries `comment_line_raw` — the verbatim comments-band line that produced the match, with whitespace and case preserved — parallel to `entry_raw_text` on the entry side. A future audit CLI wants to display both sides exactly as the DJ wrote them; the casefolded form (`matched_text`) stays for matcher introspection. (2) Extracted `_normalize_line` as the shared per-line primitive used by both `normalize_comments` and the matcher; the prior `_normalize_for_match` helper duplicated the same regex+strip composition. (3) Docstring on `find_comment_entry_overlaps` now states the scope explicitly — only `entry.raw_text` is the haystack; `oddities` lists at both quadrant and page levels are intentionally not searched. (4) Dropped the redundant empty-line filter that the refactored line walk no longer needs.
@jakebromberg jakebromberg merged commit 6a5dc12 into main May 11, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 2: read-time normalization and entry-overlap diagnostics for comments_raw

1 participant