Skip to content

feat: revize TripletTableSerializer#574

Open
wanadzhar913 wants to merge 8 commits into
docling-project:mainfrom
wanadzhar913:feature/revize-triplettableserializer
Open

feat: revize TripletTableSerializer#574
wanadzhar913 wants to merge 8 commits into
docling-project:mainfrom
wanadzhar913:feature/revize-triplettableserializer

Conversation

@wanadzhar913
Copy link
Copy Markdown
Contributor

@wanadzhar913 wanadzhar913 commented Mar 31, 2026

Why

TripletTableSerializer relied on first-row/first-column assumptions, causing triplets with vague semantic meaning when converted to text, and sometimes skipped cells (especially in 0,0).

The solution I came up with is very verbose e.g., row_1, col_1 = Value, etc. but was of the opinion that it's better to be precise with positioning (instead of assuming).

As the code edits the tests & test data, please let me know if it's acceptable as it will disturb with the main docling package downstream.

Issues

Changes

  • Updated TripletTableSerializer in docling_core/transforms/chunker/hierarchical_chunker.py:
    • metadata-first header detection derived from TableItem's row_header / col_header attribute e.g., has_row_headers = any(getattr(cell, "row_header", False) for row in TableItem.data.grid for cell in row)
    • safe fallback when metadata is missing
    • consistent row_i/col_j labeling when headers are absent
    • empty/NaN filtering
  • Improved nested table detection/serialization:
    • handles nested DataFrame cells
    • detects serialized inner-table strings and emits -> relation (instead of =
    • recursion depth cap (max_depth=3)
  • Added focused tests in test/test_triplet_serializer.py (8 scenarios, including nested tables and merged cells)
  • Refreshed affected chunker/hybrid golden fixtures datasets (e.g., uv run env DOCLING_GEN_TEST_DATA=1 pytest -q test/test_hierarchical_chunker.py)

Sample Output

I'm comparing it to 0c_out_chunks.json.

Old:

cell 0,0, 1 = cell 0,1. cell 1,0, 1 = <em><p>text in italic</p></em>. <ul>\n<li>list item 1</li>\n<li>list item 2</li>\n</ul>, 1 = cell 2,1. cell 3,0, 1 = inner cell 0,0, 1 = inner cell 0,1. inner cell 0,0, 2 = inner cell 0,2. inner cell 1,0, 1 = inner cell 1,1. inner cell 1,0, 2 = inner cell 1,2. <p>Some text in a generic group.</p>\n<p>More text in the group.</p>, 1 = cell 4,1

New:

row_0, col_0 = cell 0,0. row_0, col_1 = cell 0,1. row_1, col_0 = cell 1,0. row_1, col_1 = <em><p>text in italic</p></em>. row_2, col_0 = <ul>\n<li>list item 1</li>\n<li>list item 2</li>\n</ul>. row_2, col_1 = cell 2,1. row_3, col_0 = cell 3,0. row_3, col_1 -> row_0, col_0 = inner cell 0,0. row_0, col_1 = inner cell 0,1. row_0, col_2 = inner cell 0,2. row_1, col_0 = inner cell 1,0. row_1, col_1 = inner cell 1,1. row_1, col_2 = inner cell 1,2. row_4, col_0 = <p>Some text in a generic group.</p>\n<p>More text in the group.</p>. row_4, col_1 = cell 4,1

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 31, 2026

DCO Check Passed

Thanks @wanadzhar913, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 31, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@wanadzhar913 wanadzhar913 marked this pull request as ready for review April 4, 2026 18:55
I, wanadzhar913 <adzhar.faiq@gmail.com>, hereby add my Signed-off-by to this commit: 73155f5

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
… comp. for single column with header

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
…ps_first_data_row

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
@wanadzhar913 wanadzhar913 force-pushed the feature/revize-triplettableserializer branch from 16e2541 to 7476951 Compare April 12, 2026 14:27
@wanadzhar913
Copy link
Copy Markdown
Contributor Author

Hi @ceberam! Hope the general approach I'm taking for this PR is correct. Tysm!

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
@wanadzhar913
Copy link
Copy Markdown
Contributor Author

Hi @ceberam, I've added wider test coverage! Apols and tysm.

@ceberam
Copy link
Copy Markdown
Member

ceberam commented Apr 13, 2026

Hi @ceberam, I've added wider test coverage! Apols and tysm.

Thanks @wanadzhar913 , I'll review in the next couple of days. In the meantime, can you please remediate the Developer Certificate of Origin (DCO) check failure?

I, wanadzhar913 <adzhar.faiq@gmail.com>, hereby add my Signed-off-by to this commit: 676d24d

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
@wanadzhar913
Copy link
Copy Markdown
Contributor Author

wanadzhar913 commented Apr 27, 2026

Hi @ceberam, I've signed the DCO commit checks.

Am also aware of an open issue in docling-project/docling#3335. Am happy to rebase and reimplement if that fix is more pertinent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Review the serialization of tables in triplet format

2 participants