Skip to content

feat: add _to_referenced_images and _to_embedded_images to DoclingDocument#589

Open
OfekDanny wants to merge 4 commits into
docling-project:mainfrom
OfekDanny:feat/image-mode-transform-methods
Open

feat: add _to_referenced_images and _to_embedded_images to DoclingDocument#589
OfekDanny wants to merge 4 commits into
docling-project:mainfrom
OfekDanny:feat/image-mode-transform-methods

Conversation

@OfekDanny
Copy link
Copy Markdown

Summary

Closes #3094

Adds two named transformation methods on DoclingDocument that follow the same pattern as _hierarchize() and _flatten() — they operate on a deep copy and leave the source document unchanged, as suggested by @PeterStaar-IBM in docling-project/docling#3288.

_to_referenced_images(image_dir, image_path_prefix="")

Saves each PictureItem image as image_001.png, image_002.png, … into image_dir and updates the stored URI to image_path_prefix + filename. This makes it easy to call export_to_markdown(image_mode=ImageRefMode.REFERENCED) with full control over where images land and what prefix appears in the output — without writing the whole document to disk first.

_to_embedded_images()

Returns a deep copy where every image referenced through a file URI is converted to inline base64 form (delegates to _with_embedded_pictures()).

Usage

from pathlib import Path
from docling_core.types.doc.base import ImageRefMode

# Save images to disk, get markdown with relative references
doc_ref = doc._to_referenced_images(
    image_dir=Path("output/images"),
    image_path_prefix="images/",
)
md = doc_ref.export_to_markdown(image_mode=ImageRefMode.REFERENCED)

# Get a fully self-contained document with embedded images
doc_emb = doc._to_embedded_images()
md = doc_emb.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)

Tests

Three new tests added to test/test_docling_doc.py:

  • test_to_referenced_images — verifies images are written with sequential filenames (image_001.png) and URIs are set to prefix + filename; checks source document is not mutated
  • test_to_referenced_images_markdown_output — verifies the full pipeline: transform → export_to_markdown emits the correct image reference
  • test_to_embedded_images — verifies the method returns an independent deep copy

Design notes

  • Both methods follow the _hierarchize / _flatten pattern (private helper, returns a new DoclingDocument)
  • _to_embedded_images is a thin, named wrapper over _with_embedded_pictures() to make the intent explicit at call sites
  • No existing behaviour is changed; the two lower-level helpers (_with_pictures_refs, _with_embedded_pictures) remain in place

Signed-off-by: OfekDanny ofekdanny@gmail.com

…ument

Adds two named transformation methods on DoclingDocument that make it
easy to obtain a copy of the document with images in a desired storage
mode, without having to write the document to disk first:

- _to_referenced_images(image_dir, image_path_prefix): saves each
  PictureItem image as image_001.png … into image_dir and sets the
  stored URI to image_path_prefix + filename, so that a subsequent
  export_to_markdown(image_mode=ImageRefMode.REFERENCED) emits the
  correct relative references.

- _to_embedded_images(): returns a deep copy where every image
  referenced through a file URI is converted to inline base64 form,
  delegating to the existing _with_embedded_pictures() helper.

Both methods follow the same pattern as _hierarchize() and _flatten():
they operate on a deep copy and leave the source document unchanged.

Three pytest tests are added to test/test_docling_doc.py covering:
- images are written with sequential filenames (image_001.png)
- markdown output contains the expected prefix/filename reference
- _to_embedded_images returns an independent copy

Closes #3094

Signed-off-by: OfekDanny <ofekdanny@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 15, 2026

DCO Check Failed

Hi @OfekDanny, your pull request has failed the Developer Certificate of Origin (DCO) check.

This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format.


🛠 Quick Fix: Add a remediation commit

Run this command:

git commit --allow-empty -s -m "DCO Remediation Commit for Ofek Danny <63648262+OfekDanny@users.noreply.github.com>

I, Ofek Danny <63648262+OfekDanny@users.noreply.github.com>, hereby add my Signed-off-by to this commit: b3d3609e6abadf5a26b2bda2d2c563b070a9be13
I, Ofek Danny <63648262+OfekDanny@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 7f3211cbf8526601180f2c12e73c32fb7496d5ff"
git push

🔧 Advanced: Sign off each commit directly

For the latest commit:

git commit --amend --signoff
git push --force-with-lease

For multiple commits:

git rebase --signoff origin/main
git push --force-with-lease

More info: DCO check report

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 15, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

item.image = ImageRef.from_pil(image=img, dpi=round(72 * scale))
item.image.uri = uri_path
img_count += 1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OfekDanny I think it might make sense here to also add the saving of the page-images.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! I've updated _to_referenced_images to also iterate over result.pages and save each PageItem.image as page_NNN.png, updating its URI to image_path_prefix + filename. Added a dedicated test (test_to_referenced_images_saves_page_images) that builds a document with a page image, verifies the file is saved, and checks that the source document is not mutated.

OfekDanny and others added 3 commits April 19, 2026 11:49
Per reviewer feedback from @PeterStaar-IBM: in addition to PictureItem
images, iterate result.pages and save each PageItem.image as
page_NNN.png, updating its URI to image_path_prefix + filename.
Verifies that _to_referenced_images saves page images to disk and
updates their URIs, while not mutating the source document.
…ly.github.com>

I, Ofek Danny <63648262+OfekDanny@users.noreply.github.com>, hereby add my Signed-off-by to this commit: b3d3609
I, Ofek Danny <63648262+OfekDanny@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 7f3211c

Signed-off-by: Ofek Danny <ofek@kaps.co.il>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants