feat: add LiteParseLoader (LiteParse optional extra)#534
Open
jexp wants to merge 3 commits into
Open
Conversation
Adds LiteParseLoader to neo4j_graphrag.experimental.components.data_loader as an alternative PDF/document loader using LiteParse — a local, zero-cloud Rust-based parser with optional Tesseract OCR support. - New liteparse optional extra: pip install "neo4j-graphrag[liteparse]" - Lazy-imports liteparse; raises a clear ImportError with install hint when absent - Parser instance cached per loader to avoid repeated Rust init overhead - Supports local FS fast-path (file path) and non-local FS (bytes) via fsspec - 9 unit tests (fully mocked, no liteparse install required) - 5 integration tests (real liteparse; auto-skipped when not installed) - Usage example in examples/customize/build_graph/components/loaders/
82300a5 to
0b4241c
Compare
Extracts LiteParseLoader from data_loader.py into a dedicated liteparse_loader.py, keeping the optional-dependency import guard and all liteparse-specific logic isolated from the core loaders. data_loader.py is restored to only DataLoader, PdfLoader, MarkdownLoader.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
LiteParseLoaderin a newneo4j_graphrag.experimental.components.liteparse_loadermodule as an alternative PDF/document loader backed by LiteParse — a local, zero-cloud Rust parser with optional Tesseract OCR supportdata_loader.pyis unchanged —LiteParseLoaderlives in its own module alongside the optional-dependency import guardliteparseoptional extra:pip install "neo4j-graphrag[liteparse]"PdfLoader; pass asfile_loader=LiteParseLoader()toSimpleKGPipelineChanges
src/neo4j_graphrag/experimental/components/liteparse_loader.py— new module withLiteParseLoaderpyproject.toml—liteparse = ["liteparse>=2.0.0,<3.0.0"]optional extra + mypy overrideCHANGELOG.md— entry under## Nextexamples/customize/build_graph/components/loaders/liteparse_loader.py— usage exampletests/unit/experimental/components/test_liteparse_loader.py— 9 unit tests (fully mocked)tests/unit/experimental/components/test_liteparse_loader_integration.py— 5 integration tests (real liteparse; auto-skipped when absent)Usage
Test plan
uv run pytest tests/unit/experimental/components/test_liteparse_loader.py— 9 unit tests pass without liteparse installedpip install "neo4j-graphrag[liteparse]" && uv run pytest tests/unit/experimental/components/test_liteparse_loader_integration.py— 5 integration tests pass with liteparse installedtest_data_loader.pytests unaffected