Skip to content

chore: release 0.4.3#13

Closed
maish wants to merge 2 commits into
mainfrom
release-0.4.3
Closed

chore: release 0.4.3#13
maish wants to merge 2 commits into
mainfrom
release-0.4.3

Conversation

@maish

@maish maish commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Patch release: fix reprinted continuation-page headers appended as data rows on multi-page merge.

Fix

For a table whose column header is a multi-row (hierarchical) header reprinted at the top of each page, _grid_to_dataframe emitted every grid row — including rows Docling flagged column_header=True — as DataFrame body rows. Each page's reprinted header (plus the anchor's own header) then survived the merge as bogus data rows, misaligning the table for downstream consumers.

Leading column_header=True grid rows are now excluded from the DataFrame body (they're already reconstructed as the header block on injection). Because the fix keys on Docling's structural header flags, it is immune to per-cell OCR drift (e.g. (S$) vs ($$)) and needs no header_sim_* tuning. Single-row headers and tables without header flags are unaffected. A debug-level log reports how many header rows were excluded.

Note: header_sim_strict/header_sim_loose gate the merge decision, not row-dropping — there was no continuation-header-stripping step before this.

Tests

  • 108 passing (2 new regression tests: multi-row header excluded from body; single-row header unaffected).
  • Verified against a real 3-page insurance-benefit table: body header pollution went from 9 rows (anchor + 2 continuations) to 0; header block, multi-level spans, and data integrity preserved.

🤖 Generated with Claude Code

maish and others added 2 commits June 11, 2026 11:37
…i-page merge

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@maish

maish commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Closing: the extraction-side approach (trusting Docling's column_header flags) drops real data — Docling over-flags continuation/rowspan rows as headers (see repeated-header/rowspan-insurance-payout: 5 leading rows flagged column_header, but 3 are data). Reworking with a content-based approach.

@maish maish closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant