[MS] Add OCR layer service for embedded images and PDF scans#1541
Merged
Conversation
- Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.
- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction.
zashed
approved these changes
Jan 27, 2026
zashed
approved these changes
Jan 27, 2026
…nctionality across DOCX, PDF, PPTX, and XLSX converters
- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end. - Implemented tests for complex layouts, multi-page documents, and documents with multiple images. - Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction. - Added expected OCR results for validation against ground truth. - Included tests for scanned documents to verify OCR fallback mechanisms.
…kitdown into u/vilesyk/inline_image
… and OCR format consistency - Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed. - Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage. - Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.
… for multipage PDFs
…converter and test files
…kitdown into u/vilesyk/inline_image
|
I would love to see this included in a new release. Could 0.16 be tagged? |
Ahoo-Wang
added a commit
to Ahoo-Wang/markitdown
that referenced
this pull request
Jun 1, 2026
* Resolved an issue with linked images in docx [mammoth] (microsoft#1405) * Fixed documentation typos in _base_converter.py (microsoft#1393) * Ensure safe ExifTool usage: require >= 12.24 (microsoft#1399) * feat: add version verification for ExifTool to ensure security compliance * fix: improve ExifTool version verification --------- * Bump actions/checkout from 4 to 5 (microsoft#1394) Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5. * Add HTML support to DocumentIntelligenceConverter (microsoft#1352) * fix: correctly pass custom llm prompt parameter (microsoft#1319) * fix: correctly pass custom llm prompt parameter * Update README.md (microsoft#1335) Fix typo in README.md * Update README.md (microsoft#1350) ISSUE microsoft#1339 * Update README.md (microsoft#1191) Fix: Subtle spelling mistake fixed. * Adding support for data-src Attribute (microsoft#1226) * supportfordata-src * docs: correct minor typos (microsoft#1173) * fix docx parse error(\n in alt) (microsoft#1163) * Handle PPTX shapes where position is None (microsoft#1161) * Handle shapes where position is None * Fixed recursion error, and place no-coord shapes at front * feat: add checkbox support to Markdown converter (microsoft#1208) This change introduces functionality to convert HTML checkbox input elements (<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]). Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com> * Test if mammoth resolves rlinks. (microsoft#1451) * Upgrade mammoth to 1.11.0 (microsoft#1452) * Bump versions of mammoth and pdfminer.six (microsoft#1492) * Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched. * [MS] Update PDF table extraction to support aligned Markdown (microsoft#1499) * Added PDF table extraction feature with aligned Markdown (microsoft#1419) * Add PDF test files and enhance extraction tests - Added a medical report scan PDF for testing scanned PDF handling. - Included a retail purchase receipt PDF to validate receipt extraction functionality. - Introduced a multipage invoice PDF to test extraction of complex invoice structures. - Added a borderless table PDF for testing inventory reconciliation report extraction. - Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity. - Enhanced existing tests to validate the order and presence of extracted content across various PDF types. * fix: update dependencies for PDF processing and improve table extraction logic * Bumped version of pdfminer.six --------- Authored-by: Ashok <ashh010101@gmail.com> * Fix: PDF parsing doesn't support partially numbered lists (microsoft#1525) * Fix: PDF parsing doesn't support partially numbered lists * Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file * Refactor: Improve assertion formatting in partial numbering tests * [MS] Extend table support for wide tables (microsoft#1552) * feat: enhance PDF table extraction to support complex forms and add new test cases * feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases * fix: correct formatting and improve assertions in PDF table tests * Add text/markdown to Accept header (microsoft#1554) * Remove onnxruntime<=1.20.1 Windows pin (microsoft#1551) * Bump version for release. (microsoft#1564) * [MS] Add OCR layer service for embedded images and PDF scans (microsoft#1541) * Add OCR test data and implement tests for various document formats - Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy. * Enhance OCR functionality and validation in document converters - Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction. * Add support for scanned PDFs with full-page OCR fallback and implement tests * Bump version to 0.1.6b1 in __about__.py * Refactor OCR services to support LLM Vision, update README and tests accordingly * Add OCR-enabled converters and ensure consistent OCR format across document types * Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters * Refactor exception imports for consistency across converters and tests * Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling * Bump version to 0.1.6b1 in __about__.py * Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing * Add comprehensive OCR test suite for various document formats - Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end. - Implemented tests for complex layouts, multi-page documents, and documents with multiple images. - Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction. - Added expected OCR results for validation against ground truth. - Included tests for scanned documents to verify OCR fallback mechanisms. * Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency - Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed. - Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage. - Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework. * Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs * Revert * Revert * Update REDMEs * Refactor import statements for consistency and improve formatting in converter and test files * Fix O(n) memory growth in PDF conversion by calling page.close() afte… (microsoft#1612) * Fix O(n) memory growth in PDF conversion by calling page.close() after each page * Refactor PDF memory optimization tests for improved readability and consistency * Add memory benchmarking tests for PDF conversion with page.close() fix * Remove unnecessary blank lines in PDF memory optimization tests for cleaner code * Bump version to 0.1.6b2 in __about__.py * Update PDF conversion tests to include mimetype in StreamInfo * Updated warning about binding to non-local interfaces. (microsoft#1653) * fix: handle deeply nested HTML that triggers RecursionError (microsoft#1644) * fix: handle deeply nested HTML that triggers RecursionError (microsoft#1636) Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause markdownify's recursive DOM traversal to exceed Python's default recursion limit (1000). Previously this RecursionError was caught by the top-level _convert() dispatcher, which then fell through to PlainTextConverter — silently returning the raw HTML as 'markdown' with no warning. This fix catches RecursionError in HtmlConverter.convert() and falls back to BeautifulSoup's iterative get_text() method, which handles arbitrary nesting depths. A warning is emitted so callers know the output is plain text rather than full markdown. Root cause chain: 1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive) 2. Deeply nested HTML (>~400 levels) triggers RecursionError 3. _convert() catches all Exceptions, stores in failed_attempts 4. PlainTextConverter.accepts() matches text/html via 'text/' prefix 5. PlainTextConverter.convert() returns raw HTML bytes as text 6. Caller receives 'markdown' that is actually unconverted HTML * refactor: address review feedback on RecursionError fallback - Move 'import warnings' to module top level (was inside except block) - Make test environment-independent by temporarily lowering sys.setrecursionlimit(200) instead of relying on depth=500 being sufficient on all platforms; original limit restored in finally block - Add strict=True keyword argument to opt out of the plain-text fallback and let RecursionError propagate to the caller * test: use result.markdown instead of deprecated result.text_content --------- Co-authored-by: jigangz <jigangz@github.com> * Clarify security posture in READMEs (microsoft#1807) * feat: Add Azure Content Understanding converter (microsoft#1865) * inital version * improve mime type detection * prebuilt-image custom analzyer route to image * enhance cu priority over di * fix: apply black formatting * update cache of known prebuilt name and README improvement * add test cases, run black * update readme and deriving content_type from the resolved file_type * update readme * Bump version to 0.1.6 (microsoft#1914) --------- Co-authored-by: afourney <adamfo@microsoft.com> Co-authored-by: JonahDelman <jonah.delman@gmail.com> Co-authored-by: t3tra <admin@t3tra.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: safen0s <99965118+safen0s@users.noreply.github.com> Co-authored-by: Stefan Rink <stefan-rink@users.noreply.github.com> Co-authored-by: [W]DOS_ <40659600+W-DOS0@users.noreply.github.com> Co-authored-by: Utkarsh kumar <m83610278@gmail.com> Co-authored-by: Ebrahim Tayabali <47640402+ebrahimHakimuddin@users.noreply.github.com> Co-authored-by: Noah Zhu <118643158+Noah-Zhuhaotian@users.noreply.github.com> Co-authored-by: Dmitry <98899785+mdqst@users.noreply.github.com> Co-authored-by: Yuzhong Zhang <141388234+BetterAndBetterII@users.noreply.github.com> Co-authored-by: Richard Ye <33409792+richardye101@users.noreply.github.com> Co-authored-by: Meirna <61427701+Meirna-kamal@users.noreply.github.com> Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com> Co-authored-by: lesyk <lesyk@users.noreply.github.com> Co-authored-by: Bas Nijholt <basnijholt@gmail.com> Co-authored-by: jigangz <115519042+jigangz@users.noreply.github.com> Co-authored-by: jigangz <jigangz@github.com> Co-authored-by: Chien Yuan Chang <ds.chienyuanchang@gmail.com>
Ahoo-Wang
added a commit
to Ahoo-Wang/markitdown
that referenced
this pull request
Jun 1, 2026
* Resolved an issue with linked images in docx [mammoth] (microsoft#1405) * Fixed documentation typos in _base_converter.py (microsoft#1393) * Ensure safe ExifTool usage: require >= 12.24 (microsoft#1399) * feat: add version verification for ExifTool to ensure security compliance * fix: improve ExifTool version verification --------- * Bump actions/checkout from 4 to 5 (microsoft#1394) Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5. * Add HTML support to DocumentIntelligenceConverter (microsoft#1352) * fix: correctly pass custom llm prompt parameter (microsoft#1319) * fix: correctly pass custom llm prompt parameter * Update README.md (microsoft#1335) Fix typo in README.md * Update README.md (microsoft#1350) ISSUE microsoft#1339 * Update README.md (microsoft#1191) Fix: Subtle spelling mistake fixed. * Adding support for data-src Attribute (microsoft#1226) * supportfordata-src * docs: correct minor typos (microsoft#1173) * fix docx parse error(\n in alt) (microsoft#1163) * Handle PPTX shapes where position is None (microsoft#1161) * Handle shapes where position is None * Fixed recursion error, and place no-coord shapes at front * feat: add checkbox support to Markdown converter (microsoft#1208) This change introduces functionality to convert HTML checkbox input elements (<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]). Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com> * Test if mammoth resolves rlinks. (microsoft#1451) * Upgrade mammoth to 1.11.0 (microsoft#1452) * Bump versions of mammoth and pdfminer.six (microsoft#1492) * Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched. * [MS] Update PDF table extraction to support aligned Markdown (microsoft#1499) * Added PDF table extraction feature with aligned Markdown (microsoft#1419) * Add PDF test files and enhance extraction tests - Added a medical report scan PDF for testing scanned PDF handling. - Included a retail purchase receipt PDF to validate receipt extraction functionality. - Introduced a multipage invoice PDF to test extraction of complex invoice structures. - Added a borderless table PDF for testing inventory reconciliation report extraction. - Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity. - Enhanced existing tests to validate the order and presence of extracted content across various PDF types. * fix: update dependencies for PDF processing and improve table extraction logic * Bumped version of pdfminer.six --------- Authored-by: Ashok <ashh010101@gmail.com> * Fix: PDF parsing doesn't support partially numbered lists (microsoft#1525) * Fix: PDF parsing doesn't support partially numbered lists * Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file * Refactor: Improve assertion formatting in partial numbering tests * [MS] Extend table support for wide tables (microsoft#1552) * feat: enhance PDF table extraction to support complex forms and add new test cases * feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases * fix: correct formatting and improve assertions in PDF table tests * Add text/markdown to Accept header (microsoft#1554) * Remove onnxruntime<=1.20.1 Windows pin (microsoft#1551) * Bump version for release. (microsoft#1564) * [MS] Add OCR layer service for embedded images and PDF scans (microsoft#1541) * Add OCR test data and implement tests for various document formats - Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy. * Enhance OCR functionality and validation in document converters - Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction. * Add support for scanned PDFs with full-page OCR fallback and implement tests * Bump version to 0.1.6b1 in __about__.py * Refactor OCR services to support LLM Vision, update README and tests accordingly * Add OCR-enabled converters and ensure consistent OCR format across document types * Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters * Refactor exception imports for consistency across converters and tests * Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling * Bump version to 0.1.6b1 in __about__.py * Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing * Add comprehensive OCR test suite for various document formats - Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end. - Implemented tests for complex layouts, multi-page documents, and documents with multiple images. - Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction. - Added expected OCR results for validation against ground truth. - Included tests for scanned documents to verify OCR fallback mechanisms. * Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency - Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed. - Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage. - Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework. * Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs * Revert * Revert * Update REDMEs * Refactor import statements for consistency and improve formatting in converter and test files * Fix O(n) memory growth in PDF conversion by calling page.close() afte… (microsoft#1612) * Fix O(n) memory growth in PDF conversion by calling page.close() after each page * Refactor PDF memory optimization tests for improved readability and consistency * Add memory benchmarking tests for PDF conversion with page.close() fix * Remove unnecessary blank lines in PDF memory optimization tests for cleaner code * Bump version to 0.1.6b2 in __about__.py * Update PDF conversion tests to include mimetype in StreamInfo * Updated warning about binding to non-local interfaces. (microsoft#1653) * fix: handle deeply nested HTML that triggers RecursionError (microsoft#1644) * fix: handle deeply nested HTML that triggers RecursionError (microsoft#1636) Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause markdownify's recursive DOM traversal to exceed Python's default recursion limit (1000). Previously this RecursionError was caught by the top-level _convert() dispatcher, which then fell through to PlainTextConverter — silently returning the raw HTML as 'markdown' with no warning. This fix catches RecursionError in HtmlConverter.convert() and falls back to BeautifulSoup's iterative get_text() method, which handles arbitrary nesting depths. A warning is emitted so callers know the output is plain text rather than full markdown. Root cause chain: 1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive) 2. Deeply nested HTML (>~400 levels) triggers RecursionError 3. _convert() catches all Exceptions, stores in failed_attempts 4. PlainTextConverter.accepts() matches text/html via 'text/' prefix 5. PlainTextConverter.convert() returns raw HTML bytes as text 6. Caller receives 'markdown' that is actually unconverted HTML * refactor: address review feedback on RecursionError fallback - Move 'import warnings' to module top level (was inside except block) - Make test environment-independent by temporarily lowering sys.setrecursionlimit(200) instead of relying on depth=500 being sufficient on all platforms; original limit restored in finally block - Add strict=True keyword argument to opt out of the plain-text fallback and let RecursionError propagate to the caller * test: use result.markdown instead of deprecated result.text_content --------- Co-authored-by: jigangz <jigangz@github.com> * Clarify security posture in READMEs (microsoft#1807) * feat: Add Azure Content Understanding converter (microsoft#1865) * inital version * improve mime type detection * prebuilt-image custom analzyer route to image * enhance cu priority over di * fix: apply black formatting * update cache of known prebuilt name and README improvement * add test cases, run black * update readme and deriving content_type from the resolved file_type * update readme * Bump version to 0.1.6 (microsoft#1914) * chore(api): remove package metadata --------- Co-authored-by: afourney <adamfo@microsoft.com> Co-authored-by: JonahDelman <jonah.delman@gmail.com> Co-authored-by: t3tra <admin@t3tra.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: safen0s <99965118+safen0s@users.noreply.github.com> Co-authored-by: Stefan Rink <stefan-rink@users.noreply.github.com> Co-authored-by: [W]DOS_ <40659600+W-DOS0@users.noreply.github.com> Co-authored-by: Utkarsh kumar <m83610278@gmail.com> Co-authored-by: Ebrahim Tayabali <47640402+ebrahimHakimuddin@users.noreply.github.com> Co-authored-by: Noah Zhu <118643158+Noah-Zhuhaotian@users.noreply.github.com> Co-authored-by: Dmitry <98899785+mdqst@users.noreply.github.com> Co-authored-by: Yuzhong Zhang <141388234+BetterAndBetterII@users.noreply.github.com> Co-authored-by: Richard Ye <33409792+richardye101@users.noreply.github.com> Co-authored-by: Meirna <61427701+Meirna-kamal@users.noreply.github.com> Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com> Co-authored-by: lesyk <lesyk@users.noreply.github.com> Co-authored-by: Bas Nijholt <basnijholt@gmail.com> Co-authored-by: jigangz <115519042+jigangz@users.noreply.github.com> Co-authored-by: jigangz <jigangz@github.com> Co-authored-by: Chien Yuan Chang <ds.chienyuanchang@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces the new
markitdown-ocrplugin, which adds LLM Vision-based OCR capabilities to MarkItDown. The plugin enables extraction of text from images embedded in PDF, DOCX, PPTX, and XLSX files using any OpenAI-compatible client, without requiring additional ML libraries or binaries.#1344
Output for testing files:
docx_image_end.docx
docx_image_middle.docx
docx_image_start.docx
docx_multipage.docx
docx_multiple_images.docx
pdf_complex_layout.pdf
pdf_image_end.pdf
pdf_image_middle.pdf
pdf_image_start.pdf
pdf_multiple_images.pdf
pdf_scanned_invoice.pdf
pdf_scanned_meeting_minutes.pdf
pdf_scanned_minimal.pdf
pdf_scanned_report.pdf
pdf_scanned_sales_report.pdf
pptx_complex_layout.pptx
pptx_image_end.pptx
pptx_image_middle.pptx
pptx_image_start.pptx
pptx_multiple_images.pptx
xlsx_complex_layout.xlsx
xlsx_image_end.xlsx
xlsx_image_middle.xlsx
xlsx_image_start.xlsx
xlsx_multiple_images.xlsx