feat: revize TripletTableSerializer#574
Conversation
|
✅ DCO Check Passed Thanks @wanadzhar913, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
I, wanadzhar913 <adzhar.faiq@gmail.com>, hereby add my Signed-off-by to this commit: 73155f5 Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
… comp. for single column with header Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
…ps_first_data_row Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
16e2541 to
7476951
Compare
|
Hi @ceberam! Hope the general approach I'm taking for this PR is correct. Tysm! |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
|
Hi @ceberam, I've added wider test coverage! Apols and tysm. |
Thanks @wanadzhar913 , I'll review in the next couple of days. In the meantime, can you please remediate the Developer Certificate of Origin (DCO) check failure? |
I, wanadzhar913 <adzhar.faiq@gmail.com>, hereby add my Signed-off-by to this commit: 676d24d Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
|
Hi @ceberam, I've signed the DCO commit checks. Am also aware of an open issue in docling-project/docling#3335. Am happy to rebase and reimplement if that fix is more pertinent. |
Why
TripletTableSerializerrelied on first-row/first-column assumptions, causing triplets with vague semantic meaning when converted to text, and sometimes skipped cells (especially in 0,0).The solution I came up with is very verbose e.g.,
row_1, col_1 = Value, etc. but was of the opinion that it's better to be precise with positioning (instead of assuming).As the code edits the tests & test data, please let me know if it's acceptable as it will disturb with the main docling package downstream.
Issues
Changes
TripletTableSerializerindocling_core/transforms/chunker/hierarchical_chunker.py:TableItem'srow_header/col_headerattribute e.g.,has_row_headers = any(getattr(cell, "row_header", False) for row in TableItem.data.grid for cell in row)row_i/col_jlabeling when headers are absentDataFramecellsmax_depth=3)test/test_triplet_serializer.py(8 scenarios, including nested tables and merged cells)uv run env DOCLING_GEN_TEST_DATA=1 pytest -q test/test_hierarchical_chunker.py)Sample Output
I'm comparing it to 0c_out_chunks.json.
Old:
New: