Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
2613ebb
Add OCR test data and implement tests for various document formats
lesyk Jan 26, 2026
2e83594
Merge branch 'main' into u/vilesyk/inline_image
lesyk Jan 26, 2026
2a3a3ef
Enhance OCR functionality and validation in document converters
lesyk Jan 27, 2026
f4fab9b
Add support for scanned PDFs with full-page OCR fallback and implemen…
lesyk Jan 27, 2026
40e0be5
Bump version to 0.1.6b1 in __about__.py
lesyk Feb 12, 2026
9daaeff
Refactor OCR services to support LLM Vision, update README and tests …
lesyk Feb 13, 2026
bd9c98d
Add OCR-enabled converters and ensure consistent OCR format across do…
lesyk Feb 13, 2026
6732692
Refactor converters to improve import organization and enhance OCR fu…
lesyk Feb 13, 2026
678ea75
Refactor exception imports for consistency across converters and tests
lesyk Feb 13, 2026
dfd57e0
Fix OCR tests to match MockOCRService output and fix cross-platform f…
lesyk Feb 16, 2026
550243a
Merge origin/main into u/vilesyk/inline_image
lesyk Feb 16, 2026
222ec95
Bump version to 0.1.6b1 in __about__.py
lesyk Feb 16, 2026
ce21005
Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing
lesyk Feb 16, 2026
6e7cf50
Add comprehensive OCR test suite for various document formats
lesyk Feb 24, 2026
0816de8
Merge branch 'u/vilesyk/inline_image' of https://github.com/lesyk/mar…
lesyk Feb 24, 2026
a23087a
Remove obsolete HTML test files and refactor test cases for file URIs…
lesyk Feb 24, 2026
f7ee5ef
Refactor OCR processing in PdfConverterWithOCR and enhance unit tests…
lesyk Feb 24, 2026
fefc3b6
Revert
lesyk Feb 24, 2026
1ef0d50
Revert
lesyk Feb 24, 2026
9d485bd
Update REDMEs
lesyk Feb 24, 2026
207e58c
Merge branch 'main' into u/vilesyk/inline_image
lesyk Feb 25, 2026
b8e28c0
Refactor import statements for consistency and improve formatting in …
lesyk Feb 25, 2026
aff82a3
Merge branch 'u/vilesyk/inline_image' of https://github.com/lesyk/mar…
lesyk Feb 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

> [!IMPORTANT]
> Breaking changes between 0.0.1 to 0.1.0:
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
> * convert\_stream() now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, where it previously also accepted text file-like objects, like io.StringIO.
> * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.

Expand Down Expand Up @@ -132,6 +132,38 @@ markitdown --use-plugins path-to-file.pdf

To find available plugins, search GitHub for the hashtag `#markitdown-plugin`. To develop a plugin, see `packages/markitdown-sample-plugin`.

#### markitdown-ocr Plugin

The `markitdown-ocr` plugin adds OCR support to PDF, DOCX, PPTX, and XLSX converters, extracting text from embedded images using LLM Vision — the same `llm_client` / `llm_model` pattern that MarkItDown already uses for image descriptions. No new ML libraries or binary dependencies required.

**Installation:**

```bash
pip install markitdown-ocr
pip install openai # or any OpenAI-compatible client
```

**Usage:**

Pass the same `llm_client` and `llm_model` you would use for image descriptions:

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)
```

If no `llm_client` is provided the plugin still loads, but OCR is silently skipped and the standard built-in converter is used instead.

See [`packages/markitdown-ocr/README.md`](packages/markitdown-ocr/README.md) for detailed documentation.

### Azure Document Intelligence

To use Microsoft Document Intelligence for conversion:
Expand Down
21 changes: 21 additions & 0 deletions packages/markitdown-ocr/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) Microsoft Corporation.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE
200 changes: 200 additions & 0 deletions packages/markitdown-ocr/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# MarkItDown OCR Plugin

LLM Vision plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, and XLSX files.

Uses the same `llm_client` / `llm_model` pattern that MarkItDown already supports for image descriptions — no new ML libraries or binary dependencies required.

## Features

- **Enhanced PDF Converter**: Extracts text from images within PDFs, with full-page OCR fallback for scanned documents
- **Enhanced DOCX Converter**: OCR for images in Word documents
- **Enhanced PPTX Converter**: OCR for images in PowerPoint presentations
- **Enhanced XLSX Converter**: OCR for images in Excel spreadsheets
- **Context Preservation**: Maintains document structure and flow when inserting extracted text

## Installation

```bash
pip install markitdown-ocr
```

The plugin uses whatever OpenAI-compatible client you already have. Install one if you don't have it yet:

```bash
pip install openai
```

## Usage

### Command Line

```bash
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
```

### Python API

Pass `llm_client` and `llm_model` to `MarkItDown()` exactly as you would for image descriptions:

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
)

result = md.convert("document_with_images.pdf")
print(result.text_content)
```

If no `llm_client` is provided the plugin still loads, but OCR is silently skipped — falling back to the standard built-in converter.

### Custom Prompt

Override the default extraction prompt for specialized documents:

```python
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
llm_prompt="Extract all text from this image, preserving table structure.",
)
```

### Any OpenAI-Compatible Client

Works with any client that follows the OpenAI API:

```python
from openai import AzureOpenAI

md = MarkItDown(
enable_plugins=True,
llm_client=AzureOpenAI(
api_key="...",
azure_endpoint="https://your-resource.openai.azure.com/",
api_version="2024-02-01",
),
llm_model="gpt-4o",
)
```

## How It Works

When `MarkItDown(enable_plugins=True, llm_client=..., llm_model=...)` is called:

1. MarkItDown discovers the plugin via the `markitdown.plugin` entry point group
2. It calls `register_converters()`, forwarding all kwargs including `llm_client` and `llm_model`
3. The plugin creates an `LLMVisionOCRService` from those kwargs
4. Four OCR-enhanced converters are registered at **priority -1.0** — before the built-in converters at priority 0.0

When a file is converted:

1. The OCR converter accepts the file
2. It extracts embedded images from the document
3. Each image is sent to the LLM with an extraction prompt
4. The returned text is inserted inline, preserving document structure
5. If the LLM call fails, conversion continues without that image's text

## Supported File Formats

### PDF

- Embedded images are extracted by position (via `page.images` / page XObjects) and OCR'd inline, interleaved with the surrounding text in vertical reading order.
- **Scanned PDFs** (pages with no extractable text) are detected automatically: each page is rendered at 300 DPI and sent to the LLM as a full-page image.
- **Malformed PDFs** that pdfplumber/pdfminer cannot open (e.g. truncated EOF) are retried with PyMuPDF page rendering, so content is still recovered.

### DOCX

- Images are extracted via document part relationships (`doc.part.rels`).
- OCR is run before the DOCX→HTML→Markdown pipeline executes: placeholder tokens are injected into the HTML so that the markdown converter does not escape the OCR markers, and the final placeholders are replaced with the formatted `*[Image OCR]...[End OCR]*` blocks after conversion.
- Document flow (headings, paragraphs, tables) is fully preserved around the OCR blocks.

### PPTX

- Picture shapes, placeholder shapes with images, and images inside groups are all supported.
- Shapes are processed in top-to-left reading order per slide.
- If an `llm_client` is configured, the LLM is asked for a description first; OCR is used as the fallback when no description is returned.

### XLSX

- Images embedded in worksheets (`sheet._images`) are extracted per sheet.
- Cell position is calculated from the image anchor coordinates (column/row → Excel letter notation).
- Images are listed under a `### Images in this sheet:` section after the sheet's data table — they are not interleaved into the table rows.

### Output format

Every extracted OCR block is wrapped as:

```text
*[Image OCR]
<extracted text>
[End OCR]*
```

## Troubleshooting

### OCR text missing from output

The most likely cause is a missing `llm_client` or `llm_model`. Verify:

```python
from openai import OpenAI
from markitdown import MarkItDown

md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(), # required
llm_model="gpt-4o", # required
)
```

### Plugin not loading

Confirm the plugin is installed and discovered:

```bash
markitdown --list-plugins # should show: ocr
```

### API errors

The plugin propagates LLM API errors as warnings and continues conversion. Check your API key, quota, and that the chosen model supports vision inputs.

## Development

### Running Tests

```bash
cd packages/markitdown-ocr
pytest tests/ -v
```

### Building from Source

```bash
git clone https://github.com/microsoft/markitdown.git
cd markitdown/packages/markitdown-ocr
pip install -e .
```

## Contributing

Contributions are welcome! See the [MarkItDown repository](https://github.com/microsoft/markitdown) for guidelines.

## License

MIT — see [LICENSE](LICENSE).

## Changelog

### 0.1.0 (Initial Release)

- LLM Vision OCR for PDF, DOCX, PPTX, XLSX
- Full-page OCR fallback for scanned PDFs
- Context-aware inline text insertion
- Priority-based converter replacement (no code changes required)
57 changes: 57 additions & 0 deletions packages/markitdown-ocr/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "markitdown-ocr"
dynamic = ["version"]
description = 'OCR plugin for MarkItDown - Extracts text from images in PDF, DOCX, PPTX, and XLSX via LLM Vision'
readme = "README.md"
requires-python = ">=3.10"
license = "MIT"
keywords = ["markitdown", "ocr", "pdf", "docx", "xlsx", "pptx", "llm", "vision"]
authors = [
{ name = "Contributors", email = "noreply@github.com" },
]
classifiers = [
"Development Status :: 4 - Beta",
"Programming Language :: Python",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: Implementation :: CPython",
]

# Core dependencies — matches the file-format libraries markitdown already uses
dependencies = [
"markitdown>=0.1.0",
"pdfminer.six>=20251230",
"pdfplumber>=0.11.9",
"PyMuPDF>=1.24.0",
"mammoth~=1.11.0",
"python-docx",
"python-pptx",
"pandas",
"openpyxl",
"Pillow>=9.0.0",
]

# llm_client is passed in by the user (same as for markitdown image descriptions);
# install openai or any OpenAI-compatible SDK separately.
[project.optional-dependencies]
llm = [
"openai>=1.0.0",
]

[project.urls]
Documentation = "https://github.com/microsoft/markitdown#readme"
Issues = "https://github.com/microsoft/markitdown/issues"
Source = "https://github.com/microsoft/markitdown"

[tool.hatch.version]
path = "src/markitdown_ocr/__about__.py"

# CRITICAL: Plugin entry point - MarkItDown will discover this plugin through this entry point
[project.entry-points."markitdown.plugin"]
ocr = "markitdown_ocr"
4 changes: 4 additions & 0 deletions packages/markitdown-ocr/src/markitdown_ocr/__about__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# SPDX-FileCopyrightText: 2025-present Contributors
# SPDX-License-Identifier: MIT

__version__ = "0.1.0"
31 changes: 31 additions & 0 deletions packages/markitdown-ocr/src/markitdown_ocr/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# SPDX-FileCopyrightText: 2025-present Contributors
# SPDX-License-Identifier: MIT

"""
markitdown-ocr: OCR plugin for MarkItDown

Adds LLM Vision-based text extraction from images embedded in PDF, DOCX, PPTX, and XLSX files.
"""

from ._plugin import __plugin_interface_version__, register_converters
from .__about__ import __version__
from ._ocr_service import (
OCRResult,
LLMVisionOCRService,
)
from ._pdf_converter_with_ocr import PdfConverterWithOCR
from ._docx_converter_with_ocr import DocxConverterWithOCR
from ._pptx_converter_with_ocr import PptxConverterWithOCR
from ._xlsx_converter_with_ocr import XlsxConverterWithOCR

__all__ = [
"__version__",
"__plugin_interface_version__",
"register_converters",
"OCRResult",
"LLMVisionOCRService",
"PdfConverterWithOCR",
"DocxConverterWithOCR",
"PptxConverterWithOCR",
"XlsxConverterWithOCR",
]
Loading