Skip to content

feat: add LiteParseLoader (LiteParse optional extra)#534

Open
jexp wants to merge 3 commits into
neo4j:mainfrom
jexp:feature/liteparse-loader
Open

feat: add LiteParseLoader (LiteParse optional extra)#534
jexp wants to merge 3 commits into
neo4j:mainfrom
jexp:feature/liteparse-loader

Conversation

@jexp

@jexp jexp commented May 28, 2026

Copy link
Copy Markdown
Member

Summary

  • Adds LiteParseLoader in a new neo4j_graphrag.experimental.components.liteparse_loader module as an alternative PDF/document loader backed by LiteParse — a local, zero-cloud Rust parser with optional Tesseract OCR support
  • data_loader.py is unchanged — LiteParseLoader lives in its own module alongside the optional-dependency import guard
  • New liteparse optional extra: pip install "neo4j-graphrag[liteparse]"
  • Drop-in replacement for PdfLoader; pass as file_loader=LiteParseLoader() to SimpleKGPipeline

Changes

  • src/neo4j_graphrag/experimental/components/liteparse_loader.py — new module with LiteParseLoader
  • pyproject.tomlliteparse = ["liteparse>=2.0.0,<3.0.0"] optional extra + mypy override
  • CHANGELOG.md — entry under ## Next
  • examples/customize/build_graph/components/loaders/liteparse_loader.py — usage example
  • tests/unit/experimental/components/test_liteparse_loader.py — 9 unit tests (fully mocked)
  • tests/unit/experimental/components/test_liteparse_loader_integration.py — 5 integration tests (real liteparse; auto-skipped when absent)

Usage

from neo4j_graphrag.experimental.components.liteparse_loader import LiteParseLoader
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline

pipeline = SimpleKGPipeline(
    llm=...,
    driver=...,
    embedder=...,
    file_loader=LiteParseLoader(ocr_enabled=True),  # or LiteParseLoader() for text PDFs
    from_file=True,
)
await pipeline.run_async(file_path="document.pdf")

Test plan

  • uv run pytest tests/unit/experimental/components/test_liteparse_loader.py — 9 unit tests pass without liteparse installed
  • pip install "neo4j-graphrag[liteparse]" && uv run pytest tests/unit/experimental/components/test_liteparse_loader_integration.py — 5 integration tests pass with liteparse installed
  • Existing test_data_loader.py tests unaffected

@jexp jexp requested a review from a team as a code owner May 28, 2026 20:31
Adds LiteParseLoader to neo4j_graphrag.experimental.components.data_loader
as an alternative PDF/document loader using LiteParse — a local, zero-cloud
Rust-based parser with optional Tesseract OCR support.

- New liteparse optional extra: pip install "neo4j-graphrag[liteparse]"
- Lazy-imports liteparse; raises a clear ImportError with install hint when absent
- Parser instance cached per loader to avoid repeated Rust init overhead
- Supports local FS fast-path (file path) and non-local FS (bytes) via fsspec
- 9 unit tests (fully mocked, no liteparse install required)
- 5 integration tests (real liteparse; auto-skipped when not installed)
- Usage example in examples/customize/build_graph/components/loaders/
@jexp jexp force-pushed the feature/liteparse-loader branch from 82300a5 to 0b4241c Compare May 28, 2026 20:32
jexp added 2 commits May 28, 2026 22:37
Extracts LiteParseLoader from data_loader.py into a dedicated
liteparse_loader.py, keeping the optional-dependency import guard
and all liteparse-specific logic isolated from the core loaders.

data_loader.py is restored to only DataLoader, PdfLoader, MarkdownLoader.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant