Skip to content

fix: (cont.) Remove Extra Space Before and After Group Items using Inline Boundaries#605

Open
wanadzhar913 wants to merge 1 commit into
docling-project:mainfrom
wanadzhar913:bugfix/v2-2745_ExtraSpaceBeforeAndAfterGroupItems
Open

fix: (cont.) Remove Extra Space Before and After Group Items using Inline Boundaries#605
wanadzhar913 wants to merge 1 commit into
docling-project:mainfrom
wanadzhar913:bugfix/v2-2745_ExtraSpaceBeforeAndAfterGroupItems

Conversation

@wanadzhar913
Copy link
Copy Markdown
Contributor

Details

This is a continuation of the work in Pull Request: #458

which emoves extra space before and after group items to resolve the issue raised in #2745

Resolves #371
Resolves docling-project/docling#2745

Approach

Refactors inline spacing in docling_core/transforms/serializer/common.py into a clearer decision flow centered on _classify_inline_boundary() instead the old approach in #458 where we just remove the space (" ") when joining all parts without separators.

Control Flow

_join_inline_parts() is the entry point. It walks adjacent inline chunks, calls _classify_inline_boundary() for each boundary condition, and inserts a space only when that classifier returns InlineBoundary.SPACE.

_classify_inline_boundary() now handles boundaries in a fixed order:

  • Existing whitespace: if either side already carries whitespace, it returns JOIN and avoids adding another space.
  • Provenance-based signals: _classify_provenance_boundary() checks original text and provenance/source positions to decide whether the boundary should join or include a space.
  • Text-to-text rules: if both sides are TextItems, _classify_text_boundary() applies rules for styled/plain text transitions, punctuation, and short word continuations.
  • Code/formula/link separation: _is_semantic_inline_atom() identifies code, formulas, and links so they can be visually separated from regular text when needed.
  • Final fallback: _classify_ambiguous_word_boundary() handles uncertain styled/plain word splits by joining likely continuations like Pars + ing, otherwise preferring readability with a space.

Helper Roles

_is_styled_text() detects whether a text item has visible formatting or a hyperlink, which feeds the text-boundary logic. _is_semantic_inline_atom() marks inline items that should usually stand apart from regular text, especially code, formulas, and links.

…CO error

Signed-off-by: wanadzhar913 <adzhar.faiq@gmail.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 7, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@wanadzhar913
Copy link
Copy Markdown
Contributor Author

wanadzhar913 commented May 7, 2026

Hi @ceberam, do review when you can. Thanks so much! Happy to just reuse the old approach in #458 where we just remove the space (" ") when joining all parts without separators. Let me know what you prefer.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

DCO Check Passed

Thanks @wanadzhar913, all your commits are properly signed off. 🎉

@PeterStaar-IBM PeterStaar-IBM requested a review from vagenas May 18, 2026 05:32
Copy link
Copy Markdown
Member

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 96.93878% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/transforms/serializer/common.py 96.87% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants