feat: add _to_referenced_images and _to_embedded_images to DoclingDocument#589
feat: add _to_referenced_images and _to_embedded_images to DoclingDocument#589OfekDanny wants to merge 4 commits into
Conversation
…ument Adds two named transformation methods on DoclingDocument that make it easy to obtain a copy of the document with images in a desired storage mode, without having to write the document to disk first: - _to_referenced_images(image_dir, image_path_prefix): saves each PictureItem image as image_001.png … into image_dir and sets the stored URI to image_path_prefix + filename, so that a subsequent export_to_markdown(image_mode=ImageRefMode.REFERENCED) emits the correct relative references. - _to_embedded_images(): returns a deep copy where every image referenced through a file URI is converted to inline base64 form, delegating to the existing _with_embedded_pictures() helper. Both methods follow the same pattern as _hierarchize() and _flatten(): they operate on a deep copy and leave the source document unchanged. Three pytest tests are added to test/test_docling_doc.py covering: - images are written with sequential filenames (image_001.png) - markdown output contains the expected prefix/filename reference - _to_embedded_images returns an independent copy Closes #3094 Signed-off-by: OfekDanny <ofekdanny@gmail.com>
|
❌ DCO Check Failed Hi @OfekDanny, your pull request has failed the Developer Certificate of Origin (DCO) check. This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format. 🛠 Quick Fix: Add a remediation commitRun this command: git commit --allow-empty -s -m "DCO Remediation Commit for Ofek Danny <63648262+OfekDanny@users.noreply.github.com>
I, Ofek Danny <63648262+OfekDanny@users.noreply.github.com>, hereby add my Signed-off-by to this commit: b3d3609e6abadf5a26b2bda2d2c563b070a9be13
I, Ofek Danny <63648262+OfekDanny@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 7f3211cbf8526601180f2c12e73c32fb7496d5ff"
git push🔧 Advanced: Sign off each commit directlyFor the latest commit: git commit --amend --signoff
git push --force-with-leaseFor multiple commits: git rebase --signoff origin/main
git push --force-with-leaseMore info: DCO check report |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
| item.image = ImageRef.from_pil(image=img, dpi=round(72 * scale)) | ||
| item.image.uri = uri_path | ||
| img_count += 1 | ||
|
|
There was a problem hiding this comment.
@OfekDanny I think it might make sense here to also add the saving of the page-images.
There was a problem hiding this comment.
Done! I've updated _to_referenced_images to also iterate over result.pages and save each PageItem.image as page_NNN.png, updating its URI to image_path_prefix + filename. Added a dedicated test (test_to_referenced_images_saves_page_images) that builds a document with a page image, verifies the file is saved, and checks that the source document is not mutated.
Per reviewer feedback from @PeterStaar-IBM: in addition to PictureItem images, iterate result.pages and save each PageItem.image as page_NNN.png, updating its URI to image_path_prefix + filename.
Verifies that _to_referenced_images saves page images to disk and updates their URIs, while not mutating the source document.
Summary
Closes #3094
Adds two named transformation methods on
DoclingDocumentthat follow the same pattern as_hierarchize()and_flatten()— they operate on a deep copy and leave the source document unchanged, as suggested by @PeterStaar-IBM in docling-project/docling#3288._to_referenced_images(image_dir, image_path_prefix="")Saves each
PictureItemimage asimage_001.png,image_002.png, … intoimage_dirand updates the stored URI toimage_path_prefix + filename. This makes it easy to callexport_to_markdown(image_mode=ImageRefMode.REFERENCED)with full control over where images land and what prefix appears in the output — without writing the whole document to disk first._to_embedded_images()Returns a deep copy where every image referenced through a file URI is converted to inline base64 form (delegates to
_with_embedded_pictures()).Usage
Tests
Three new tests added to
test/test_docling_doc.py:test_to_referenced_images— verifies images are written with sequential filenames (image_001.png) and URIs are set toprefix + filename; checks source document is not mutatedtest_to_referenced_images_markdown_output— verifies the full pipeline: transform →export_to_markdownemits the correct image referencetest_to_embedded_images— verifies the method returns an independent deep copyDesign notes
_hierarchize/_flattenpattern (private helper, returns a newDoclingDocument)_to_embedded_imagesis a thin, named wrapper over_with_embedded_pictures()to make the intent explicit at call sites_with_pictures_refs,_with_embedded_pictures) remain in placeSigned-off-by: OfekDanny ofekdanny@gmail.com