Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 0 additions & 57 deletions .claude/settings.json

This file was deleted.

3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,6 @@ build/

# Local benchmark outputs (gold should be versioned)
benchmarks/current/

# Claude Code local config (settings, skills) — never commit
/.claude/
34 changes: 34 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,40 @@ All notable changes to `table2rules` are documented here. Dates are in

## [Unreleased]

## [0.6.3] — 2026-06-14

### Fixed

- **Label-only row-group headers now thread through real docling schedule
shapes.** Three additional shapes from real `TableItem.export_to_html` output
were dropping the line-item title from grouped values' `row_headers`:
- *Narrow title → full-width description band → values.* The title's row-group
extent now extends *through* an immediately-following full-width description
band (a nested member of the header block, not a boundary), so the title
threads as the outer ancestor instead of being dropped:
`9. Trip Cancellation > If your trip is cancelled… > 1. Adult insured person | …`.
- *Multi-cell title rows* (a leading item number/key plus a textual title, e.g.
`10 | Travel delay`) are now recognized as group headers. A row is a title
when at most one of its label cells is numeric-only — this admits the
number+title shape while still rejecting a data row whose value columns merely
happen to be empty (`Average: | 80.2 | 10.7 | 3.3`, ≥2 numeric). A repeating
key column is excluded from the promoted title so it is not duplicated.
- *Two-column `Label | Value` schedules.* The left column is now promoted to the
row-label/stub even under a single-row thead header (`Benefit | Maximum limit`)
— Signal D, scoped to exactly two columns so multi-column property tables are
untouched. This also produces proper one-record-per-line output for plain
two-column relational tables (`North | Sales: 100`) instead of splitting each
row into two disconnected `Header: value` lines.

New fixtures `matrix/label-only-title-then-description-band` and
`matrix/label-only-title-number-key-matrix`.

*Known limitation:* a sub-grouped header of the form `<n>. | Group: | (empty)`
with a `colspan` title over a promoted descriptor column still falls back to
flat (the spanned cell trips the gate's "rules originate from `<td>`"
invariant). This is a pre-existing, separate gate interaction — not the
label-only-threading path — and is tracked for a follow-up.

## [0.6.2] — 2026-06-14

### Fixed
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Trip Cancellation > If your trip is cancelled due to specified events. > 9 > 1. Adult insured person | Value Plan > Individual: 5,000
Trip Cancellation > If your trip is cancelled due to specified events. > 9 > 1. Adult insured person | Value Plan > Family: 10,000
Trip Cancellation > If your trip is cancelled due to specified events. > 9 > 1. Adult insured person | Economy Plan > Individual: 3,000
Trip Cancellation > If your trip is cancelled due to specified events. > 9 > 1. Adult insured person | Economy Plan > Family: 6,000
Trip Cancellation > If your trip is cancelled due to specified events. > 9 > 2. Child insured person | Value Plan > Individual: 2,500
Trip Cancellation > If your trip is cancelled due to specified events. > 9 > 2. Child insured person | Value Plan > Family: 5,000
Trip Cancellation > If your trip is cancelled due to specified events. > 9 > 2. Child insured person | Economy Plan > Individual: 1,500
Trip Cancellation > If your trip is cancelled due to specified events. > 9 > 2. Child insured person | Economy Plan > Family: 3,000
Travel delay > If the departure of your public transport is delayed by six hours. > 10 > 1. Adult insured person | Value Plan > Individual: 100
Travel delay > If the departure of your public transport is delayed by six hours. > 10 > 1. Adult insured person | Value Plan > Family: 200
Travel delay > If the departure of your public transport is delayed by six hours. > 10 > 1. Adult insured person | Economy Plan > Individual: 150
Travel delay > If the departure of your public transport is delayed by six hours. > 10 > 1. Adult insured person | Economy Plan > Family: 300
Travel delay > If the departure of your public transport is delayed by six hours. > 10 > 2. Child insured person | Value Plan > Individual: 50
Travel delay > If the departure of your public transport is delayed by six hours. > 10 > 2. Child insured person | Value Plan > Family: 100
Travel delay > If the departure of your public transport is delayed by six hours. > 10 > 2. Child insured person | Economy Plan > Individual: 75
Travel delay > If the departure of your public transport is delayed by six hours. > 10 > 2. Child insured person | Economy Plan > Family: 150
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
9. Trip Cancellation > If your trip is cancelled due to specified events before departure. > 1. Adult insured person | Maximum limit (S$): 5,000
9. Trip Cancellation > If your trip is cancelled due to specified events before departure. > 2. Child insured person | Maximum limit (S$): 2,500
10. Travel Delay > If the departure of your public transport is delayed by at least six hours. > 1. Adult insured person | Maximum limit (S$): 100 per six hours up to 1,500
10. Travel Delay > If the departure of your public transport is delayed by at least six hours. > 2. Child insured person | Maximum limit (S$): 50 per six hours up to 1,500
11. Trip Postponement
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,5 @@ A-001 | Status: Open
A-001 | Comment: Parent row has nested summary: k, v; x, 1
A-002 | Status: Closed
A-002 | Comment: Normal row
Metric: Total Open
Value: 1
Metric: Total Closed
Value: 1
Total Open | Value: 1
Total Closed | Value: 1
Original file line number Diff line number Diff line change
@@ -1,4 +1,2 @@
Item: Widget
Qty: 10
Item: Gadget
Qty: 20
Widget | Qty: 10
Gadget | Qty: 20
12 changes: 4 additions & 8 deletions benchmarks/gold/rules/fixtures/relational/multiple-tbody.out.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,4 @@
Region: North
Sales: 100
Region: South
Sales: 200
Region: East
Sales: 150
Region: West
Sales: 180
North | Sales: 100
South | Sales: 200
East | Sales: 150
West | Sales: 180
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
Total | Amount: 300
Item: Widget
Amount: 100
Item: Gadget
Amount: 200
Widget | Amount: 100
Gadget | Amount: 200
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "table2rules"
version = "0.6.2"
version = "0.6.3"
description = "Convert HTML tables to flat, LLM-friendly rules using spatial pathfinding."
readme = "README.md"
license = "MIT"
Expand Down
83 changes: 67 additions & 16 deletions src/table2rules/_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -411,15 +411,33 @@ def _label_cols(r: int) -> List[int]:
cols.append(c)
return cols

def _single_label_origin(r: int) -> bool:
# A group header is exactly one label source cell (a title, possibly
# colspan'd). More than one distinct non-empty label cell means a data
# row, not a divider — do not thread it.
origins = set()
for c in _label_cols(r):
cell = grid[r][c]
origins.add(cell.get("origin", (r, c)) if cell.get("is_span_copy") else (r, c))
return len(origins) == 1
def _cell_text(r: int, c: int) -> str:
cell = grid[r][c]
if cell.get("is_span_copy"):
o = cell.get("origin", (r, c))
return (grid[o[0]][o[1]].get("text") or "").strip()
return (cell.get("text") or "").strip()

def _is_numeric_only(text: str) -> bool:
# No alphabetic character but at least one digit — a bare item number
# ("3.", "10"), reusing the parser's universal "letters label, digits
# measure" signal. A group title carries text; a mis-promoted value cell
# is a number.
return (
bool(text) and not any(ch.isalpha() for ch in text) and any(ch.isdigit() for ch in text)
)

def _title_like(r: int) -> bool:
# A group-header title carries at most ONE numeric-only label cell (a
# leading item number, e.g. "10 | Travel delay" or "3. | Permanent loss
# of:"). Two or more numeric label cells means a data row whose value
# columns merely happen to be empty (e.g. a header that over-promoted
# numeric columns to row labels, "Average: | 80.2 | 10.7 | 3.3") —
# threading it would invent a breadcrumb, so it stays an is_label.
cols = _label_cols(r)
if not cols:
return False
return sum(1 for c in cols if _is_numeric_only(_cell_text(r, c))) <= 1

# A row already carrying a rowgroup cell (a full-width band promoted above,
# or a source scope="rowgroup") is a boundary, not a label-only candidate.
Expand All @@ -432,7 +450,7 @@ def _single_label_origin(r: int) -> bool:
and r not in band_rows
and not _has_value(r)
and bool(_label_cols(r))
and _single_label_origin(r)
and _title_like(r)
for r in range(n_rows)
]

Expand All @@ -448,23 +466,56 @@ def _single_label_origin(r: int) -> bool:
s_end = r
r += 1 # advance past the stack for the outer loop

# Extent: down to the row before the next boundary (next label stack or
# full-width band). Bounded by a value row's presence.
# Absorb a run of full-width band rows immediately following the title
# stack (a description band under the title) into this header block —
# they are nested members, not a boundary. Without this the title's
# extent would terminate at the band and the title would be dropped (the
# narrow-title-then-full-width-description shape).
header_end = s_end
while header_end + 1 < n_rows and (header_end + 1) in band_rows:
header_end += 1

# Extent: from the first row after the header block to the row before the
# next group start — the next title, or a full-width band that begins a
# new section (one appearing AFTER a value row). A band absorbed above is
# part of this header, not a boundary.
extent_end = n_rows - 1
for rr in range(s_end + 1, n_rows):
if is_label_row[rr] or rr in band_rows:
saw_value = False
for rr in range(header_end + 1, n_rows):
if is_label_row[rr]:
extent_end = rr - 1
break
has_data_row = any(_has_value(rr) for rr in range(s_end + 1, extent_end + 1))
if not has_data_row:
if rr in band_rows and saw_value:
extent_end = rr - 1
break
if _has_value(rr):
saw_value = True
value_rows = [rr for rr in range(header_end + 1, extent_end + 1) if _has_value(rr)]
if not value_rows:
continue

# Promote each title cell, EXCLUDING a key column whose (column, text)
# repeats on a value row of the group — a repeating item-number/key
# already threads via the value rows' own labels; promoting it again
# would duplicate it in the path. The remaining cells are the title.
for rr in range(s_start, s_end + 1):
for c in _label_cols(rr):
text = _cell_text(rr, c)
if any(_cell_text(vr, c) == text for vr in value_rows):
continue
grid[rr][c]["type"] = "th"
grid[rr][c]["scope"] = "rowgroup"
grid[rr][c]["rowgroup_extent_end"] = extent_end

# Bound the absorbed description band(s) by the same extent so a
# full-width description does not leak past the next narrow title (its
# colspan is wider, so the maze's colspan rule would not close it).
for rr in range(s_end + 1, header_end + 1):
for c in range(n_cols):
cell = grid[rr][c]
if cell.get("scope") == "rowgroup":
cell["rowgroup_extent_end"] = extent_end


def _process_table_with_gate(table_html: str) -> Tuple[List[LogicRule], GateResult]:
"""Runs the full pipeline and returns rules plus the gate verdict.
Expand Down
18 changes: 18 additions & 0 deletions src/table2rules/grid_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,24 @@ def _descriptor_like(col: int) -> bool:
if _descriptor_like(c):
promote_cols.add(c)

# --- Signal D: stub column in a 2-column Label|Value schedule ---
# In a two-column table the left column is the row-label/stub and the
# right column is its value — even when col 0 carries a thead header,
# which Signals A/C/B all skip (they need a multi-row/rowspan header or
# an unlabeled column). This is the single-row-thead
# "Benefit | Maximum limit (S$)" schedule shape. Scoped to exactly two
# columns so multi-column property tables (where col 0 is one data field
# among several) are untouched; col 0 must be descriptor-like and col 1
# must carry values, so a two-column all-text table is left alone.
if (
max_cols == 2
and 0 not in promote_cols
and _descriptor_like(0)
and body_nonempty[1] >= 1
and not _descriptor_like(1)
):
promote_cols.add(0)

if promote_cols:
for c in sorted(promote_cols):
for r in range(data_start_row_idx, len(grid)):
Expand Down
27 changes: 27 additions & 0 deletions tests/fixtures/matrix/label-only-title-number-key-matrix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
<!-- Real docling matrix shape: a repeating number-KEY column (col 0, "10"
repeated on every row of the group, from an expanded rowspan), a benefit
descriptor column (col 1), then plan x cover value columns. Each group is a
label-only TITLE row ("10 | Travel delay | <empty values>") followed by a
full-width DESCRIPTION band, then the value sub-rows.

The title "Travel delay" is a multi-cell label-only row (number key + title)
— at most one numeric label cell, so it is a group header, not a data row.
The repeating key "10" is excluded from the promoted title (it already
threads via the value rows' own key cell); only "Travel delay" is threaded,
so the path carries the line-item identity without duplicating the number. -->
<table>
<thead>
<tr><th rowspan="2"></th><th rowspan="2"></th><th colspan="2">Value Plan</th><th colspan="2">Economy Plan</th></tr>
<tr><th>Individual</th><th>Family</th><th>Individual</th><th>Family</th></tr>
</thead>
<tbody>
<tr><td>9</td><td>Trip Cancellation</td><td colspan="4"></td></tr>
<tr><td>9</td><td colspan="5">If your trip is cancelled due to specified events.</td></tr>
<tr><td>9</td><td>1. Adult insured person</td><td>5,000</td><td>10,000</td><td>3,000</td><td>6,000</td></tr>
<tr><td>9</td><td>2. Child insured person</td><td>2,500</td><td>5,000</td><td>1,500</td><td>3,000</td></tr>
<tr><td>10</td><td>Travel delay</td><td colspan="4"></td></tr>
<tr><td>10</td><td colspan="5">If the departure of your public transport is delayed by six hours.</td></tr>
<tr><td>10</td><td>1. Adult insured person</td><td>100</td><td>200</td><td>150</td><td>300</td></tr>
<tr><td>10</td><td>2. Child insured person</td><td>50</td><td>100</td><td>75</td><td>150</td></tr>
</tbody>
</table>
30 changes: 30 additions & 0 deletions tests/fixtures/matrix/label-only-title-then-description-band.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
<!-- Real docling schedule shape: a narrow label-only TITLE row immediately
followed by a full-width <td colspan=N> DESCRIPTION band, then the value
sub-rows. Both must thread, title as the outer ancestor:

9. Trip Cancellation > If your trip is cancelled... > 1. Adult ... | value

Two bugs combined here, both fixed:
- The title's row-group extent must extend THROUGH the adjacent description
band (which is itself a full-width band) to reach the value rows; before,
the band terminated the extent and the title was dropped.
- In a two-column Label|Value schedule the left column is the row-label/stub
even though it carries a thead header ("Benefit"), which the multi-row /
rowspan header signals miss (Signal D).

The trailing "11. Trip Postponement" has no value rows under it, so it stays
an is_label note rather than creating an empty group. -->
<table>
<thead><tr><th>Benefit</th><th>Maximum limit (S$)</th></tr></thead>
<tbody>
<tr><td>9. Trip Cancellation</td><td></td></tr>
<tr><td colspan="2">If your trip is cancelled due to specified events before departure.</td></tr>
<tr><td>1. Adult insured person</td><td>5,000</td></tr>
<tr><td>2. Child insured person</td><td>2,500</td></tr>
<tr><td>10. Travel Delay</td><td></td></tr>
<tr><td colspan="2">If the departure of your public transport is delayed by at least six hours.</td></tr>
<tr><td>1. Adult insured person</td><td>100 per six hours up to 1,500</td></tr>
<tr><td>2. Child insured person</td><td>50 per six hours up to 1,500</td></tr>
<tr><td>11. Trip Postponement</td><td></td></tr>
</tbody>
</table>