Skip to content

fix: prevent O(N²) XObject reloading in do_form() causing OOM on complex PDFs#279

Open
seongmincho315 wants to merge 1 commit into
docling-project:mainfrom
seongmincho315:main
Open

fix: prevent O(N²) XObject reloading in do_form() causing OOM on complex PDFs#279
seongmincho315 wants to merge 1 commit into
docling-project:mainfrom
seongmincho315:main

Conversation

@seongmincho315

Copy link
Copy Markdown

Summary

PDFs containing many Form XObjects that share a large resource dictionary (e.g. a scatter plot with 2,296 data points, each referencing 2,301 sibling XObjects) caused an OOM kill during parsing.

Root cause: In do_form() (src/parse/pdf_decoders/stream.h), set() was unconditionally called for fonts, graphics states, and XObjects on every invocation. When do_form() is called N times and each call reloads all N resources, the result is O(N²) memory allocations — 2,296 × 2,301 ≈ 5.3 million redundant reloads in the reported case.

Fix: Before calling set(), check whether all keys in the child resource dictionary are already present in the parent chain. If so, skip the set() call. Resources inherited from a parent Form XObject are already available, so re-registering them on every do_form() call is unnecessary.

Test case

  • PDF: arXiv:2508.13113 (page 2 — scatter plot with 2,296 data points)
  • Before fix: OOM kill at ~200 s
  • After fix: page parsed successfully in ~181 s

Related issue: docling-project/docling#2109

Checklist

  • Signed-off-by included (DCO)
  • Single file changed (src/parse/pdf_decoders/stream.h)
  • No behavior change for PDFs without repeated Form XObject resource dictionaries

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @seongmincho315, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

…lex PDFs

When a PDF page contains many Form XObjects sharing the same resource
dictionary (e.g. 2296 scatter plot data points each referencing 2301
XObjects), do_form() was calling set() on fonts, graphics states, and
XObjects on every invocation — reloading all N resources N times,
resulting in O(N²) memory and CPU usage that caused OOM kills.

Fix: before calling set(), check whether all keys in the resource
dictionary are already present in the parent chain. If so, skip the
set() call entirely. This reduces the redundant reloading to O(1) per
do_form() call when resources are already inherited.

Fixes: docling-project/docling#2109

Signed-off-by: Seongmin Cho <seongmin.cho@genon.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant