feat(comments): read-time normalize + entry-overlap diagnostics#35
Merged
Conversation
… comments_raw Mirrors the shape of core.continuations: pure read-time functions over PageResult, no I/O, no on-disk mutation. normalize_comments collapses intra-line whitespace, trims per line, drops empty interior lines, and returns None when the result is empty — verbatim on disk stays verbatim per the org data-safety rule. find_comment_entry_overlaps returns one record per (normalized comments line, matching entry) pair where the line is a case- and whitespace-insensitive substring of an entry's raw_text on the same page; results come back in canonical quadrant order with ascending row_index. Categorization, the audit CLI, and any wiring into the pipeline run path are intentionally deferred — this lands the primitives so the calibration scorer (and a future audit tool) can call them. Closes #34
Addresses code-review feedback on the original PR. (1) `CommentEntryOverlap` now carries `comment_line_raw` — the verbatim comments-band line that produced the match, with whitespace and case preserved — parallel to `entry_raw_text` on the entry side. A future audit CLI wants to display both sides exactly as the DJ wrote them; the casefolded form (`matched_text`) stays for matcher introspection. (2) Extracted `_normalize_line` as the shared per-line primitive used by both `normalize_comments` and the matcher; the prior `_normalize_for_match` helper duplicated the same regex+strip composition. (3) Docstring on `find_comment_entry_overlaps` now states the scope explicitly — only `entry.raw_text` is the haystack; `oddities` lists at both quadrant and page levels are intentionally not searched. (4) Dropped the redundant empty-line filter that the refactored line walk no longer needs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
core/comments.py: pure read-timenormalize_comments(raw)(whitespace-collapse per line, trim, drop blank lines, idempotent, empty→None) andfind_comment_entry_overlaps(page)returning a list ofCommentEntryOverlaprecords for cases where a normalized comments line is a case- and whitespace-insensitive substring of an entry'sraw_texton the same page.core.continuationspattern: no I/O, no on-disk mutation. Verbatimcomments_rawon disk stays verbatim per the org data-safety rule.Inventory
The ticket's step-1 inventory of
data/results/against the existing 34-page corpus shows the feature is brand-new in main and the existing extractions predate it. All 34 JSONs are missing thecomments_rawkey entirely (they were extracted before the schema split landed). No non-null examples yet — empirical validation against real Comments-band text will come from the next extraction run. Tests here use syntheticPageResultfixtures, which is sufficient to lock in the contract.comments_rawkeycomments_raw: nullcomments_rawnon-nullTest plan
ruff check .cleanruff format --check .cleanmypy core cli.pycleanpytest— 357 passed (26 newtest_comments.pycases covering None passthrough, whitespace collapse, multi-line preservation, idempotency, trim, empty-to-None, and overlap diagnostics: empty/no-match/single/multi/case-insensitive/whitespace-insensitive/multi-line-one-matches/canonical-ordering)Closes #34