Skip to content

[MS] Add OCR layer service for embedded images and PDF scans#1541

Merged
afourney merged 23 commits into
microsoft:mainfrom
lesyk:u/vilesyk/inline_image
Mar 10, 2026
Merged

[MS] Add OCR layer service for embedded images and PDF scans#1541
afourney merged 23 commits into
microsoft:mainfrom
lesyk:u/vilesyk/inline_image

Conversation

@lesyk
Copy link
Copy Markdown
Contributor

@lesyk lesyk commented Jan 26, 2026

This pull request introduces the new markitdown-ocr plugin, which adds LLM Vision-based OCR capabilities to MarkItDown. The plugin enables extraction of text from images embedded in PDF, DOCX, PPTX, and XLSX files using any OpenAI-compatible client, without requiring additional ML libraries or binaries.

#1344

Output for testing files:
---

## docx_complex_layout.docx

```markdown
Complex Document

|  |  |
| --- | --- |
| Feature | Status |
| Authentication | Active |
| Encryption | Enabled |

Security notice:

*[Image OCR]
NOTICE: SSL Certificate Expires 2025-12-31
[End OCR]*

docx_image_end.docx

Report

Main findings of the report.

Details and analysis.

Recommendations.

*[Image OCR]
FOOTER: Document ID: DOC-2024-001
[End OCR]*

docx_image_middle.docx

# Introduction

This is the introduction section.

We will see an image below.

*[Image OCR]
FIGURE 1: System Architecture
[End OCR]*

# Analysis

This section comes after the image.

docx_image_start.docx

Document with Image at Start

*[Image OCR]
HEADER: Company Logo - ACME Corp
[End OCR]*

This is the main content after the header image.

More text content here.

docx_multipage.docx

# Page 1 - Mixed Content

This is the first paragraph on page 1.

BEFORE IMAGE: Important content appears here.

*[Image OCR]
DOCX PAGE 1: Section Title
[End OCR]*

AFTER IMAGE: This content follows the image.

More text on page 1.

# Page 2 - Image at End

Content on page 2.

Multiple paragraphs of text.

Building up to the image...

Final paragraph before image.

*[Image OCR]
DOCX PAGE 2: Footer Note
[End OCR]*

# Page 3 - Image at Start

*[Image OCR]
DOCX PAGE 3: Header Image
[End OCR]*

Content that follows the header image.

AFTER IMAGE: This text is after the image.

docx_multiple_images.docx

Multi-Image Document

First section

*[Image OCR]
Chart 1: Revenue Growth
[End OCR]*

Second section with another image

*[Image OCR]
Chart 2: Customer Satisfaction
[End OCR]*

Conclusion

pdf_complex_layout.pdf

## Page 1

Complex Layout Document

Table:

ItemQuantity

*[Image OCR]
WARNING: Handle with care
[End OCR]*

Widget A5

pdf_image_end.pdf

## Page 1

Main Content

This is the main text content.

The image will appear at the end.

Keep reading...

*[Image OCR]
END: Contact: support@example.com
[End OCR]*

pdf_image_middle.pdf

## Page 1

Section 1: Introduction

This document contains an image in the middle.

Here is some introductory text.

*[Image OCR]
MIDDLE: Product Code: ABC-12345
[End OCR]*

Section 2: Details

This text appears AFTER the image.

pdf_image_start.pdf

## Page 1

*[Image OCR]
START: This is the first image in PDF
[End OCR]*

This is text BEFORE the image.

The image should appear above this text.

This is more content after the image.

pdf_multiple_images.pdf

## Page 1

Document with Multiple Images

*[Image OCR]
Image 1: Serial Number SN-001
[End OCR]*

Text between first and second image.

*[Image OCR]
Image 2: Model Number M-2024
[End OCR]*

Final text after all images.

pdf_scanned_invoice.pdf

## Page 1

*[Image OCR]
# INVOICE

Company: TechCorp Industries

Invoice Number: INV-2024-001

Date: January 15, 2024

BILL TO:

Acme Corporation

123 Main Street

New York, NY 10001

DESCRIPTION:

Software Development Services

Professional Consulting

Technical Support

TOTAL AMOUNT DUE: $5,000.00
[End OCR]*

pdf_scanned_meeting_minutes.pdf

## Page 1

*[Image OCR]
# MEETING MINUTES

Date: March 10, 2024

Attendees: John Smith, Jane Doe, Bob Johnson

AGENDA ITEMS

1. Project Status Update

- Phase 1 completed successfully

- Phase 2 on track for Q2 delivery

2. Budget Review

- Current spend: 75% of allocated budget

- Forecast: Within budget

3. Action Items

- John: Finalize requirements document
[End OCR]*

pdf_scanned_minimal.pdf

## Page 1

*[Image OCR]
# NOTICE

This is a minimal test document

with just a few lines of text.

It should still be processed correctly.
[End OCR]*

pdf_scanned_report.pdf

## Page 1

*[Image OCR]
# TECHNICAL REPORT

# Page 1

EXECUTIVE SUMMARY

This document presents the findings of our

technical analysis conducted in Q1 2024.

Key highlights include:

- System performance improvements

- Security enhancements

- User experience updates

The following pages detail our methodology

and recommendations.
[End OCR]*

## Page 2

*[Image OCR]
# TECHNICAL REPORT

# Page 2

METHODOLOGY

Our analysis involved three phases:

1. Data Collection

    Gathered metrics from production systems

    over a 90-day period.

2. Performance Analysis

    Identified bottlenecks and optimization

    opportunities.

3. Security Review

    Conducted vulnerability assessment and
[End OCR]*

## Page 3

*[Image OCR]
# TECHNICAL REPORT

# Page 3

RECOMMENDATIONS

Based on our findings, we recommend:

1. Implement caching layer to improve

    response times by 40%.

2. Upgrade authentication system to

    support multi-factor authentication.

3. Optimize database queries to reduce

    server load by 30%.

CONCLUSION
[End OCR]*

pdf_scanned_sales_report.pdf

## Page 1

*[Image OCR]
# QUARTERLY SALES REPORT

Q1 2024 Performance Summary

REGIONAL BREAKDOWN

Region        Revenue        Growth
North America  $2.5M         +15%
Europe        $1.8M         +22%
Asia Pacific  $3.2M         +35%
Latin America $0.9M         +12%

TOTAL         $8.4M         +23%

Top performing products:

- Product A: $3.1M

- Product B: $2.7M
[End OCR]*

pptx_complex_layout.pptx

\n\n<!-- Slide number: 1 -->\n# Product Comparison\n\nOur products lead the market\n
*[Image OCR]
Market Share: 35%
[End OCR]*

pptx_image_end.pptx

\n\n<!-- Slide number: 1 -->\n# Presentation\n\n\n\n<!-- Slide number: 2 -->\n# Thank You\n\n
*[Image OCR]
Contact: info@techcorp.com
[End OCR]*

pptx_image_middle.pptx

\n\n<!-- Slide number: 1 -->\n# Introduction\n\n\n\n<!-- Slide number: 2 -->\n# Architecture\n\n
*[Image OCR]
Diagram: System Components
[End OCR]*\n\n<!-- Slide number: 3 -->\n# Conclusion\n\n

pptx_image_start.pptx

\n\n<!-- Slide number: 1 -->\n# Welcome\n\n
*[Image OCR]
Company: TechCorp Inc.
[End OCR]*

pptx_multiple_images.pptx

\n\n<!-- Slide number: 1 -->\n# \n
*[Image OCR]
Before: 50% Efficiency
[End OCR]*

*[Image OCR]
After: 95% Efficiency
[End OCR]*

xlsx_complex_layout.xlsx

## Complex Report

| Annual Report 2024 | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Month | Sales |
| Jan | 1000 |
| Feb | 1200 |
| NaN | NaN |
| Total | 2200 |

### Images in this sheet:

*[Image OCR]
Figure 1: Monthly Trend
[End OCR]*

*[Image OCR]
Figure 2: Year Overview
[End OCR]*

## Customers

| Customer Metrics | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| New Customers | 250 |
| Retention Rate | 92% |

### Images in this sheet:

*[Image OCR]
Customer Growth: +25% Year-over-Year
[End OCR]*

## Regions

| Regional Breakdown | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Region | Revenue |
| North | $800K |
| South | $600K |

### Images in this sheet:

*[Image OCR]
Regional Map: Top Perform
[End OCR]*

xlsx_image_end.xlsx

## Sheet

| Financial Summary | Unnamed: 1 |
| --- | --- |
| Total Revenue | $500,000 |
| Total Expenses | $300,000 |
| Net Profit | $200,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Signature: | NaN |

### Images in this sheet:

*[Image OCR]
Approved by: John Doe, CFO
[End OCR]*

## Budget

| Budget Allocation | Unnamed: 1 |
| --- | --- |
| Marketing | $100,000 |
| R&D | $150,000 |
| Operations | $50,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Approved: | NaN |

### Images in this sheet:

*[Image OCR]
viewed by: Jane Smith, CTO
[End OCR]*

xlsx_image_middle.xlsx

## Revenue

| Q1 Report | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Revenue | $50,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Profit Margin | 40% |

### Images in this sheet:

*[Image OCR]
Growth Trend: +15%
[End OCR]*

## Expenses

| Expense Breakdown | Unnamed: 1 |
| --- | --- |
| NaN | NaN |
| Expenses | $30,000 |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| NaN | NaN |
| Savings | $5,000 |

### Images in this sheet:

*[Image OCR]
Cost Analysis: Optimized
[End OCR]*

xlsx_image_start.xlsx

## Sales Q1

| Product | Sales |
| --- | --- |
| Widget A | 100 |
| Widget B | 150 |

### Images in this sheet:

*[Image OCR]
Q1 Sales Chart
[End OCR]*

## Forecast Q2

| Projected Sales | Unnamed: 1 |
| --- | --- |
| Widget A | 120 |
| Widget B | 180 |

### Images in this sheet:

*[Image OCR]
Q2 Forecast: +20% Growth
[End OCR]*

xlsx_multiple_images.xlsx

## Overview

| Dashboard |
| --- |
| Status: Active |
| NaN |
| NaN |
| NaN |
| NaN |
| Performance Summary |

### Images in this sheet:

*[Image OCR]
KPI: 95% Success Rate
[End OCR]*

*[Image OCR]
Uptime: 99.9%
[End OCR]*

## Details

| Detailed Metrics |
| --- |
| System Health |

### Images in this sheet:

*[Image OCR]
Metric: Response Time 50ms
[End OCR]*

## Summary

| Quarter Summary |
| --- |
| Overall Performance |

### Images in this sheet:

*[Image OCR]
Q1 Results: Exceeded Goals
[End OCR]*

</details>

lesyk and others added 4 commits January 26, 2026 19:44
- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.
- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.
@lesyk lesyk marked this pull request as ready for review January 27, 2026 10:21
@lesyk lesyk changed the title Add OCR test data and implement tests for various document formats Add OCR service for embedded images and PDF scans Jan 27, 2026
@lesyk lesyk changed the title Add OCR service for embedded images and PDF scans Add OCR layer service for embedded images and PDF scans Jan 27, 2026
@lesyk lesyk changed the title Add OCR layer service for embedded images and PDF scans [MS] Add OCR layer service for embedded images and PDF scans Jan 27, 2026
lesyk and others added 19 commits February 12, 2026 09:55
…nctionality across DOCX, PDF, PPTX, and XLSX converters
- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.
… and OCR format consistency

- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.
@afourney afourney merged commit c6308dc into microsoft:main Mar 10, 2026
3 checks passed
@EwoutH
Copy link
Copy Markdown

EwoutH commented Apr 24, 2026

I would love to see this included in a new release. Could 0.16 be tagged?

Copy link
Copy Markdown

@ahearnaustin40701-sudo ahearnaustin40701-sudo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[

devcontainer-feature.json.txt

](

  • `
  • [ ]

)

Ahoo-Wang added a commit to Ahoo-Wang/markitdown that referenced this pull request Jun 1, 2026
* Resolved an issue with linked images in docx [mammoth] (microsoft#1405)

* Fixed documentation typos in _base_converter.py (microsoft#1393)

* Ensure safe ExifTool usage: require >= 12.24 (microsoft#1399)

* feat: add version verification for ExifTool to ensure security compliance
* fix: improve ExifTool version verification

---------

* Bump actions/checkout from 4 to 5 (microsoft#1394)

Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.

* Add HTML support to DocumentIntelligenceConverter (microsoft#1352)

* fix: correctly pass custom llm prompt parameter (microsoft#1319)

* fix: correctly pass custom llm prompt parameter

* Update README.md (microsoft#1335)

Fix typo in README.md

* Update README.md (microsoft#1350)

ISSUE microsoft#1339

* Update README.md (microsoft#1191)

Fix: Subtle spelling mistake fixed.

* Adding support for data-src Attribute (microsoft#1226)

* supportfordata-src

* docs: correct minor typos (microsoft#1173)

* fix docx parse error(\n in alt) (microsoft#1163)

* Handle PPTX shapes where position is None (microsoft#1161)

* Handle shapes where position is None
* Fixed recursion error, and place no-coord shapes at front

* feat: add checkbox support to Markdown converter (microsoft#1208)

This change introduces functionality to convert HTML checkbox input elements
(<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]).
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>

* Test if mammoth resolves rlinks. (microsoft#1451)

* Upgrade mammoth to 1.11.0 (microsoft#1452)

* Bump versions of mammoth and pdfminer.six (microsoft#1492)

* Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched.

* [MS] Update PDF table extraction to support aligned Markdown (microsoft#1499)

* Added PDF table extraction feature with aligned Markdown (microsoft#1419)

* Add PDF test files and enhance extraction tests

- Added a medical report scan PDF for testing scanned PDF handling.
- Included a retail purchase receipt PDF to validate receipt extraction functionality.
- Introduced a multipage invoice PDF to test extraction of complex invoice structures.
- Added a borderless table PDF for testing inventory reconciliation report extraction.
- Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity.
- Enhanced existing tests to validate the order and presence of extracted content across various PDF types.

* fix: update dependencies for PDF processing and improve table extraction logic

* Bumped version of pdfminer.six
---------

Authored-by: Ashok <ashh010101@gmail.com>

* Fix: PDF parsing doesn't support partially numbered lists (microsoft#1525)

* Fix: PDF parsing doesn't support partially numbered lists

* Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file

* Refactor: Improve assertion formatting in partial numbering tests

* [MS] Extend table support for wide tables (microsoft#1552)

* feat: enhance PDF table extraction to support complex forms and add new test cases
* feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases
* fix: correct formatting and improve assertions in PDF table tests

* Add text/markdown to Accept header (microsoft#1554)

* Remove onnxruntime<=1.20.1 Windows pin (microsoft#1551)

* Bump version for release. (microsoft#1564)

* [MS] Add OCR layer service for embedded images and PDF scans (microsoft#1541)

* Add OCR test data and implement tests for various document formats

- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.

* Enhance OCR functionality and validation in document converters

- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.

* Add support for scanned PDFs with full-page OCR fallback and implement tests

* Bump version to 0.1.6b1 in __about__.py

* Refactor OCR services to support LLM Vision, update README and tests accordingly

* Add OCR-enabled converters and ensure consistent OCR format across document types

* Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters

* Refactor exception imports for consistency across converters and tests

* Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling

* Bump version to 0.1.6b1 in __about__.py

* Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing

* Add comprehensive OCR test suite for various document formats

- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.

* Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency

- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.

* Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs

* Revert

* Revert

* Update REDMEs

* Refactor import statements for consistency and improve formatting in converter and test files

* Fix O(n) memory growth in PDF conversion by calling page.close() afte… (microsoft#1612)

* Fix O(n) memory growth in PDF conversion by calling page.close() after each page

* Refactor PDF memory optimization tests for improved readability and consistency

* Add memory benchmarking tests for PDF conversion with page.close() fix

* Remove unnecessary blank lines in PDF memory optimization tests for cleaner code

* Bump version to 0.1.6b2 in __about__.py

* Update PDF conversion tests to include mimetype in StreamInfo

* Updated warning about binding to non-local interfaces. (microsoft#1653)

* fix: handle deeply nested HTML that triggers RecursionError (microsoft#1644)

* fix: handle deeply nested HTML that triggers RecursionError (microsoft#1636)

Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause
markdownify's recursive DOM traversal to exceed Python's default
recursion limit (1000). Previously this RecursionError was caught by
the top-level _convert() dispatcher, which then fell through to
PlainTextConverter — silently returning the raw HTML as 'markdown'
with no warning.

This fix catches RecursionError in HtmlConverter.convert() and falls
back to BeautifulSoup's iterative get_text() method, which handles
arbitrary nesting depths. A warning is emitted so callers know the
output is plain text rather than full markdown.

Root cause chain:
1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive)
2. Deeply nested HTML (>~400 levels) triggers RecursionError
3. _convert() catches all Exceptions, stores in failed_attempts
4. PlainTextConverter.accepts() matches text/html via 'text/' prefix
5. PlainTextConverter.convert() returns raw HTML bytes as text
6. Caller receives 'markdown' that is actually unconverted HTML

* refactor: address review feedback on RecursionError fallback

- Move 'import warnings' to module top level (was inside except block)
- Make test environment-independent by temporarily lowering
  sys.setrecursionlimit(200) instead of relying on depth=500 being
  sufficient on all platforms; original limit restored in finally block
- Add strict=True keyword argument to opt out of the plain-text
  fallback and let RecursionError propagate to the caller

* test: use result.markdown instead of deprecated result.text_content

---------

Co-authored-by: jigangz <jigangz@github.com>

* Clarify security posture in READMEs (microsoft#1807)

* feat: Add Azure Content Understanding converter (microsoft#1865)

* inital version

* improve mime type detection

* prebuilt-image custom analzyer route to image

* enhance cu priority over di

* fix: apply black formatting

* update cache of known prebuilt name and README improvement

* add test cases, run black

* update readme and deriving content_type from the resolved file_type

* update readme

* Bump version to 0.1.6 (microsoft#1914)

---------

Co-authored-by: afourney <adamfo@microsoft.com>
Co-authored-by: JonahDelman <jonah.delman@gmail.com>
Co-authored-by: t3tra <admin@t3tra.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: safen0s <99965118+safen0s@users.noreply.github.com>
Co-authored-by: Stefan Rink <stefan-rink@users.noreply.github.com>
Co-authored-by: [W]DOS_ <40659600+W-DOS0@users.noreply.github.com>
Co-authored-by: Utkarsh kumar <m83610278@gmail.com>
Co-authored-by: Ebrahim Tayabali <47640402+ebrahimHakimuddin@users.noreply.github.com>
Co-authored-by: Noah Zhu <118643158+Noah-Zhuhaotian@users.noreply.github.com>
Co-authored-by: Dmitry <98899785+mdqst@users.noreply.github.com>
Co-authored-by: Yuzhong Zhang <141388234+BetterAndBetterII@users.noreply.github.com>
Co-authored-by: Richard Ye <33409792+richardye101@users.noreply.github.com>
Co-authored-by: Meirna <61427701+Meirna-kamal@users.noreply.github.com>
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>
Co-authored-by: lesyk <lesyk@users.noreply.github.com>
Co-authored-by: Bas Nijholt <basnijholt@gmail.com>
Co-authored-by: jigangz <115519042+jigangz@users.noreply.github.com>
Co-authored-by: jigangz <jigangz@github.com>
Co-authored-by: Chien Yuan Chang <ds.chienyuanchang@gmail.com>
Ahoo-Wang added a commit to Ahoo-Wang/markitdown that referenced this pull request Jun 1, 2026
* Resolved an issue with linked images in docx [mammoth] (microsoft#1405)

* Fixed documentation typos in _base_converter.py (microsoft#1393)

* Ensure safe ExifTool usage: require >= 12.24 (microsoft#1399)

* feat: add version verification for ExifTool to ensure security compliance
* fix: improve ExifTool version verification

---------

* Bump actions/checkout from 4 to 5 (microsoft#1394)

Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.

* Add HTML support to DocumentIntelligenceConverter (microsoft#1352)

* fix: correctly pass custom llm prompt parameter (microsoft#1319)

* fix: correctly pass custom llm prompt parameter

* Update README.md (microsoft#1335)

Fix typo in README.md

* Update README.md (microsoft#1350)

ISSUE microsoft#1339

* Update README.md (microsoft#1191)

Fix: Subtle spelling mistake fixed.

* Adding support for data-src Attribute (microsoft#1226)

* supportfordata-src

* docs: correct minor typos (microsoft#1173)

* fix docx parse error(\n in alt) (microsoft#1163)

* Handle PPTX shapes where position is None (microsoft#1161)

* Handle shapes where position is None
* Fixed recursion error, and place no-coord shapes at front

* feat: add checkbox support to Markdown converter (microsoft#1208)

This change introduces functionality to convert HTML checkbox input elements
(<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]).
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>

* Test if mammoth resolves rlinks. (microsoft#1451)

* Upgrade mammoth to 1.11.0 (microsoft#1452)

* Bump versions of mammoth and pdfminer.six (microsoft#1492)

* Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched.

* [MS] Update PDF table extraction to support aligned Markdown (microsoft#1499)

* Added PDF table extraction feature with aligned Markdown (microsoft#1419)

* Add PDF test files and enhance extraction tests

- Added a medical report scan PDF for testing scanned PDF handling.
- Included a retail purchase receipt PDF to validate receipt extraction functionality.
- Introduced a multipage invoice PDF to test extraction of complex invoice structures.
- Added a borderless table PDF for testing inventory reconciliation report extraction.
- Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity.
- Enhanced existing tests to validate the order and presence of extracted content across various PDF types.

* fix: update dependencies for PDF processing and improve table extraction logic

* Bumped version of pdfminer.six
---------

Authored-by: Ashok <ashh010101@gmail.com>

* Fix: PDF parsing doesn't support partially numbered lists (microsoft#1525)

* Fix: PDF parsing doesn't support partially numbered lists

* Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file

* Refactor: Improve assertion formatting in partial numbering tests

* [MS] Extend table support for wide tables (microsoft#1552)

* feat: enhance PDF table extraction to support complex forms and add new test cases
* feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases
* fix: correct formatting and improve assertions in PDF table tests

* Add text/markdown to Accept header (microsoft#1554)

* Remove onnxruntime<=1.20.1 Windows pin (microsoft#1551)

* Bump version for release. (microsoft#1564)

* [MS] Add OCR layer service for embedded images and PDF scans (microsoft#1541)

* Add OCR test data and implement tests for various document formats

- Created HTML file with multiple images for testing OCR extraction.
- Added several PDF files with different layouts and image placements to validate OCR functionality.
- Introduced PPTX files with complex layouts and images at various positions for comprehensive testing.
- Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction.
- Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy.

* Enhance OCR functionality and validation in document converters

- Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency.
- Implement detailed validation for OCR text positioning relative to surrounding text in test cases.
- Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present.
- Improve error handling and logging for better debugging during OCR extraction.

* Add support for scanned PDFs with full-page OCR fallback and implement tests

* Bump version to 0.1.6b1 in __about__.py

* Refactor OCR services to support LLM Vision, update README and tests accordingly

* Add OCR-enabled converters and ensure consistent OCR format across document types

* Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters

* Refactor exception imports for consistency across converters and tests

* Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling

* Bump version to 0.1.6b1 in __about__.py

* Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing

* Add comprehensive OCR test suite for various document formats

- Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end.
- Implemented tests for complex layouts, multi-page documents, and documents with multiple images.
- Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction.
- Added expected OCR results for validation against ground truth.
- Included tests for scanned documents to verify OCR fallback mechanisms.

* Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency

- Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed.
- Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage.
- Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework.

* Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs

* Revert

* Revert

* Update REDMEs

* Refactor import statements for consistency and improve formatting in converter and test files

* Fix O(n) memory growth in PDF conversion by calling page.close() afte… (microsoft#1612)

* Fix O(n) memory growth in PDF conversion by calling page.close() after each page

* Refactor PDF memory optimization tests for improved readability and consistency

* Add memory benchmarking tests for PDF conversion with page.close() fix

* Remove unnecessary blank lines in PDF memory optimization tests for cleaner code

* Bump version to 0.1.6b2 in __about__.py

* Update PDF conversion tests to include mimetype in StreamInfo

* Updated warning about binding to non-local interfaces. (microsoft#1653)

* fix: handle deeply nested HTML that triggers RecursionError (microsoft#1644)

* fix: handle deeply nested HTML that triggers RecursionError (microsoft#1636)

Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause
markdownify's recursive DOM traversal to exceed Python's default
recursion limit (1000). Previously this RecursionError was caught by
the top-level _convert() dispatcher, which then fell through to
PlainTextConverter — silently returning the raw HTML as 'markdown'
with no warning.

This fix catches RecursionError in HtmlConverter.convert() and falls
back to BeautifulSoup's iterative get_text() method, which handles
arbitrary nesting depths. A warning is emitted so callers know the
output is plain text rather than full markdown.

Root cause chain:
1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive)
2. Deeply nested HTML (>~400 levels) triggers RecursionError
3. _convert() catches all Exceptions, stores in failed_attempts
4. PlainTextConverter.accepts() matches text/html via 'text/' prefix
5. PlainTextConverter.convert() returns raw HTML bytes as text
6. Caller receives 'markdown' that is actually unconverted HTML

* refactor: address review feedback on RecursionError fallback

- Move 'import warnings' to module top level (was inside except block)
- Make test environment-independent by temporarily lowering
  sys.setrecursionlimit(200) instead of relying on depth=500 being
  sufficient on all platforms; original limit restored in finally block
- Add strict=True keyword argument to opt out of the plain-text
  fallback and let RecursionError propagate to the caller

* test: use result.markdown instead of deprecated result.text_content

---------

Co-authored-by: jigangz <jigangz@github.com>

* Clarify security posture in READMEs (microsoft#1807)

* feat: Add Azure Content Understanding converter (microsoft#1865)

* inital version

* improve mime type detection

* prebuilt-image custom analzyer route to image

* enhance cu priority over di

* fix: apply black formatting

* update cache of known prebuilt name and README improvement

* add test cases, run black

* update readme and deriving content_type from the resolved file_type

* update readme

* Bump version to 0.1.6 (microsoft#1914)

* chore(api): remove package metadata

---------

Co-authored-by: afourney <adamfo@microsoft.com>
Co-authored-by: JonahDelman <jonah.delman@gmail.com>
Co-authored-by: t3tra <admin@t3tra.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: safen0s <99965118+safen0s@users.noreply.github.com>
Co-authored-by: Stefan Rink <stefan-rink@users.noreply.github.com>
Co-authored-by: [W]DOS_ <40659600+W-DOS0@users.noreply.github.com>
Co-authored-by: Utkarsh kumar <m83610278@gmail.com>
Co-authored-by: Ebrahim Tayabali <47640402+ebrahimHakimuddin@users.noreply.github.com>
Co-authored-by: Noah Zhu <118643158+Noah-Zhuhaotian@users.noreply.github.com>
Co-authored-by: Dmitry <98899785+mdqst@users.noreply.github.com>
Co-authored-by: Yuzhong Zhang <141388234+BetterAndBetterII@users.noreply.github.com>
Co-authored-by: Richard Ye <33409792+richardye101@users.noreply.github.com>
Co-authored-by: Meirna <61427701+Meirna-kamal@users.noreply.github.com>
Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>
Co-authored-by: lesyk <lesyk@users.noreply.github.com>
Co-authored-by: Bas Nijholt <basnijholt@gmail.com>
Co-authored-by: jigangz <115519042+jigangz@users.noreply.github.com>
Co-authored-by: jigangz <jigangz@github.com>
Co-authored-by: Chien Yuan Chang <ds.chienyuanchang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants