Skip to content

feat(loaders): unify ingestion via Docling (DOCX/XLSX/PPTX/HTML/CSV + URL)#262

Open
drr00t wants to merge 12 commits into
FalkorDB:mainfrom
drr00t:241-issue-expand-document-loader-coverage
Open

feat(loaders): unify ingestion via Docling (DOCX/XLSX/PPTX/HTML/CSV + URL)#262
drr00t wants to merge 12 commits into
FalkorDB:mainfrom
drr00t:241-issue-expand-document-loader-coverage

Conversation

@drr00t
Copy link
Copy Markdown
Contributor

@drr00t drr00t commented May 19, 2026

Summary

This feature extends the loader interface using Docling to add support for DOCX, XLSX, PPTX, CSV, HTML, and XHTML.

Changes

  • DoclingBaseLoader: Base class wrapping the Docling document model.

Test Plan

  • All existing tests pass (pytest tests/ -q)
  • New tests added for new functionality (if applicable)
  • Lint passes (ruff check src/)

Notes

Currently, the GraphRAG_SDK loader interface maps one loader to one extension. Since Docling supports multiple extensions, this PR updates the current loader implementation paradigm to support multi-format loaders.

Summary by CodeRabbit

  • New Features

    • Add optional Docling-based universal loader for DOCX/XLSX/PPTX/HTML/CSV and URLs (installable via an optional "docling" extra). Automatic fallback to this loader for those formats when installed.
  • Documentation

    • Updated docs and README with installation note, supported formats, strategy reference, and usage examples.
  • Examples

    • Added an example showing advanced Docling loader usage.
  • Tests

    • Added comprehensive tests covering Docling ingestion behaviors and error handling.

Review Change Stack

drr00t and others added 3 commits May 17, 2026 17:10
…te()

Changes the default chunker that ``GraphRAG.ingest()`` and
``GraphRAG.update()`` fall back to when the caller doesn't pass an
explicit ``chunker=``. Was ``FixedSizeChunking()``; now
``SentenceTokenCapChunking()`` (sentence-aware, max_tokens=512,
overlap_sentences=2 — the strategy's own defaults).

Why
---
``FixedSizeChunking`` splits on a hard character window with no awareness
of sentence, word, or paragraph boundaries. When the window cuts through
an entity name, the per-chunk LLM extractor produces a stub entity for
the fragment (``"Wayne Enterprises"`` → ``"Wayne En"`` in chunk N plus
unparsable text in chunk N+1). These stubs never merge with their full
forms during resolution because their embeddings differ enough that
LLMVerifiedResolution scores them below the soft threshold.

This silently inflates cypher counts and pollutes "which X" lists. The
strategy that surfaced this — ``CypherFirstAggregationStrategy`` — was
hitting a 6/7 ceiling on the internal aggregation benchmark with one
question failing because of these stubs. Switching to
``SentenceTokenCapChunking`` cleared the benchmark to 7/7 stable across
three runs, and the post-ingest graph state went from 11-14 organization
nodes (including ``Glo`` / ``Initech System`` / ``Wayne En``) to exactly
10 clean orgs, and from 66-80 ``Person`` nodes (with ``Carla`` / ``Carla
Okafor`` duplicates) to exactly 56 distinct persons — matching the
corpus.

A side benefit: sentence-aware chunks with 2-sentence overlap almost
always keep a person's first mention in the same chunk as their later
short-form references, so per-chunk FastCoref now binds ``Carla → Carla
Okafor`` reliably. That eliminates the short-form-duplicate class too,
not just the truncation stubs.

Compatibility
-------------
``FixedSizeChunking`` remains exported and fully supported — callers who
explicitly pass ``chunker=FixedSizeChunking()`` get unchanged behavior.
Existing tests (748 passed, 24 skipped) pass without modification: no
test in the suite asserts on chunk count or content shape from the
default chunker, so switching defaults doesn't break the suite.

Callers who relied on the previous default and want to keep it should
pass ``chunker=FixedSizeChunking()`` explicitly. The docstrings call out
the new default and reference ``FixedSizeChunking`` as the opt-in
character-window alternative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@drr00t drr00t marked this pull request as draft May 19, 2026 00:27
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aaaa8c1f-0552-47b6-bc19-a05aa8977b20

📥 Commits

Reviewing files that changed from the base of the PR and between a451138 and 15407de.

📒 Files selected for processing (2)
  • graphrag_sdk/examples/09_docling_advanced_loader.py
  • graphrag_sdk/tests/test_docling_loaders.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • graphrag_sdk/examples/09_docling_advanced_loader.py

📝 Walkthrough

Walkthrough

This PR introduces DoclingLoader, a universal document ingestion strategy for GraphRAG SDK that leverages the docling library to handle DOCX, XLSX, PPTX, HTML, CSV, and URLs. The loader is packaged as an optional extra, integrated into the main API's auto-detection logic, thoroughly tested, and documented with user guides and examples.

Changes

DoclingLoader Implementation and Integration

Layer / File(s) Summary
DoclingLoader implementation and dependency packaging
graphrag_sdk/pyproject.toml, graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_loader.py
DoclingLoader uses async load() with background thread sync conversion, validates file paths, wraps import/conversion errors as LoaderError, maps docling labels to DocumentElement types, tracks header breadcrumbs, and falls back to Markdown export for empty text. Optional docling>=2.91.0 extra is declared; pytest warning filters suppress known transitive deprecations.
Package wiring and main API routing
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/__init__.py, graphrag_sdk/src/graphrag_sdk/api/main.py
DoclingLoader is re-exported at package level; _default_loader_for() auto-detection routes unrecognized file extensions to DoclingLoader instead of TextLoader fallback.
Loader behavior tests
graphrag_sdk/tests/test_docling_loaders.py
Tests import error wrapping, label-to-type mapping with metadata preservation, breadcrumb construction from header nesting, Markdown export fallback, and format-specific scenarios: CSV headers, HTML pass-through, XLSX multi-sheet, Markdown title→section hierarchy.
User documentation and example
docs/ingestion.md, docs/strategies.md, graphrag_sdk/README.md, graphrag_sdk/examples/09_docling_advanced_loader.py
Step 1 "Load" section documents format routing (.docx/.xlsx/.pptx/.html/.csv/URLs → DoclingLoader when installed); "Built-in: DoclingLoader" section describes usage and breadcrumb design; README expanded with install extra and multi-format ingestion examples; runnable example shows provider setup, format-restricted ingestion, and completion flow.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Docling's magic wand arrives at last,
Converting formats, both new and past,
With breadcrumbs trailing through headers deep,
Documents parsed while schemas keep,
A rabbit's gift—more formats to feast!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding Docling-based unified ingestion support for multiple document formats (DOCX/XLSX/PPTX/HTML/CSV) and URLs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/pyproject.toml`:
- Around line 59-67: The extras specification is inconsistent: the standalone
"docling" extra pins docling>=2.91.0 while the "all" extras list contains
docling>=2.0.0; update the "all" extras entry to require the same minimum
(docling>=2.91.0) so installing graphrag-sdk[all] cannot pull an older
incompatible docling version—edit the "docling" entry inside the all array in
pyproject.toml to match the docling extra's minimum version.

In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 329-333: Update the stale loader-default docstring that currently
reads "Loader: auto-detected from file extension (PDF or text)" so it reflects
the new extension routing; locate the docstring containing that exact phrase in
graphrag_sdk/api/main.py (the help/usage text shown in the diff) and replace it
with a concise description like "Loader: auto-detected from file extension (PDF,
DOCX, XLSX, PPTX, HTML/XHTML, CSV, or plain text)" so the user-facing docs list
the supported formats.

In `@graphrag_sdk/tests/test_docling_loaders.py`:
- Around line 24-27: The test is breaking the import system by having the
patched builtins.__import__ return None for non-target imports; change the patch
side_effect to delegate to the real importer for all names except the one you
want to simulate failing (i.e., capture the original_import =
builtins.__import__ and in the side_effect for the patch used around loader.load
call, raise ImportError("module not found") when name ==
"docling.document_converter" and otherwise return original_import(name, *args,
**kwargs)); keep the pytest.raises assertion around await loader.load(str(file),
ctx) unchanged and reference the patched builtins.__import__ side_effect that
delegates for everything except "docling.document_converter".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6566ed76-11b6-4ce9-8ef8-6fb548181321

📥 Commits

Reviewing files that changed from the base of the PR and between a174629 and 9bfbab4.

📒 Files selected for processing (10)
  • graphrag_sdk/pyproject.toml
  • graphrag_sdk/src/graphrag_sdk/api/main.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/__init__.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/csv_loader.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_base.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docx_loader.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/html_loader.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/pptx_loader.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/xlsx_loader.py
  • graphrag_sdk/tests/test_docling_loaders.py

Comment thread graphrag_sdk/pyproject.toml Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py
Comment thread graphrag_sdk/tests/test_docling_loaders.py Outdated
drr00t and others added 4 commits May 18, 2026 21:48
Path C in retrieve_chunks used `COLLECT(c)[..3]` with no ORDER BY, so
hub entities (which can be MENTIONED_IN hundreds of chunks) returned
an arbitrary 3 — almost never including the chunks most relevant to
the current query.

Add an ORDER BY on `vec.cosineDistance(c.embedding, query_vector)`
before the COLLECT so per-entity chunk selection is query-aware.

Refs FalkorDB#258

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@drr00t drr00t marked this pull request as ready for review May 21, 2026 20:02
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
test_mock_import.py (1)

9-15: ⚡ Quick win

Avoid global sys.modules mutation at import time.

Lines 9-15 mutate global module state and execute prints on import, which can leak across tests/tools.

Proposed fix
-sys.modules["docling"] = MagicMock()
-sys.modules["docling.datamodel"] = MagicMock()
-sys.modules["docling.datamodel.document"] = mock_datamodel
-
-from docling.datamodel.document import DocItemLabel
-print("DocItemLabel is:", DocItemLabel)
-print("DocItemLabel.LIST_ITEM is:", getattr(DocItemLabel, "LIST_ITEM", None))
+def main():
+    modules = {
+        "docling": MagicMock(),
+        "docling.datamodel": MagicMock(),
+        "docling.datamodel.document": mock_datamodel,
+    }
+    with patch.dict("sys.modules", modules):
+        from docling.datamodel.document import DocItemLabel
+        print("DocItemLabel is:", DocItemLabel)
+        print("DocItemLabel.LIST_ITEM is:", getattr(DocItemLabel, "LIST_ITEM", None))
+
+if __name__ == "__main__":
+    main()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test_mock_import.py` around lines 9 - 15, The test mutates global sys.modules
and does prints at import time (sys.modules["docling"],
sys.modules["docling.datamodel"], sys.modules["docling.datamodel.document"],
mock_datamodel) and then imports DocItemLabel, which can leak state; change this
to perform the module patching inside the test or a fixture using a temporary
monkeypatch (e.g., monkeypatch.setitem(sys.modules, "docling", MagicMock()) and
monkeypatch.setitem(... "docling.datamodel.document", mock_datamodel)) or use
importlib and context-local injection before importing, remove the top-level
print statements, and reference DocItemLabel only within the test body so
mock_datamodel and DocItemLabel are created and torn down per-test instead of at
import time.
graphrag_sdk/test_monkeypatch.py (1)

19-19: ⚡ Quick win

Guard script execution behind if __name__ == "__main__":.

Line 19 causes side effects on import. In a test_*.py file this is easy to trigger unintentionally.

Proposed fix
-asyncio.run(main())
+if __name__ == "__main__":
+    asyncio.run(main())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/test_monkeypatch.py` at line 19, Current top-level call
asyncio.run(main()) in test_monkeypatch.py causes the module to execute on
import; wrap that invocation in a standard entry-point guard by adding if
__name__ == "__main__": and moving asyncio.run(main()) inside it so async def
main() only runs when the file is executed directly, not when imported by pytest
or other modules.
graphrag_sdk/test_monkeypatch3.py (1)

16-19: ⚡ Quick win

Don’t swallow all exceptions in the failure-path check.

Catching broad Exception and pass hides regressions unrelated to the mocked import failure.

Proposed fix
-        try:
-            await asyncio.to_thread(load_sync)
-        except Exception as e:
-            pass # catch the mocked exception
+        with pytest.raises(ImportError, match="module not found"):
+            await asyncio.to_thread(load_sync)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/test_monkeypatch3.py` around lines 16 - 19, The test currently
swallows all exceptions around the await asyncio.to_thread(load_sync) call;
change this to only accept the specific expected failure (e.g.,
ModuleNotFoundError or the mocked exception class) or use pytest.raises to
assert the failure. Replace the broad "except Exception as e: pass" with either
"except ModuleNotFoundError: pass" (or the exact mocked exception type used in
the test) or wrap the await in "with pytest.raises(ExpectedException): await
asyncio.to_thread(load_sync)" so unexpected exceptions are not hidden; ensure
you reference the load_sync call and asyncio.to_thread in the updated test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/test_debug.py`:
- Around line 27-28: The debug harness currently calls
loader._load_sync("dummy_path") which fails file-existence checks; replace the
dummy string with a real temporary file path (create a temp file via
tempfile.NamedTemporaryFile or tempfile.mkstemp, write minimal content that will
exercise the converter mapping logic, flush/close it) and pass that file's path
to loader._load_sync; ensure the temp file is cleaned up after the test and keep
the existing print(result.elements) to observe outputs.

In `@graphrag_sdk/test_monkeypatch2.py`:
- Around line 4-6: The function load_sync is declared async but is passed
directly to asyncio.to_thread at its call sites (asyncio.to_thread(load_sync)),
which causes to_thread to execute a callable that returns a coroutine instead of
running the import logic; change load_sync from async def load_sync() to a
synchronous def load_sync() that performs the import and returns "success" (so
the body runs inside asyncio.to_thread), or alternatively update the call sites
to await load_sync() (or wrap with asyncio.to_thread(lambda:
asyncio.run(load_sync())) if you must keep it async).

In `@test_import_mock.py`:
- Around line 6-10: The custom import hook _import currently calls __import__
directly for non-"fake" modules which will re-enter the patched
builtins.__import__ and cause infinite recursion; fix by saving the original
import function before patching (e.g., orig_import = builtins.__import__) and
have _import delegate to orig_import(name, *args, **kwargs) for non-"fake"
cases, then install _import via patch("builtins.__import__",
side_effect=_import) so normal imports use the saved original implementation.

In `@test_patch_dict.py`:
- Around line 3-5: Convert the top-level patch.dict snippet into a real pytest
test function (e.g., def test_patch_dict_imports_mocked_module():) that uses
unittest.mock.patch.dict to temporarily inject {"docling": MagicMock(),
"docling.document_converter": MagicMock()}, then imports (or
importlib.import_module) "docling.document_converter" and asserts that the
imported object is the MagicMock from sys.modules and that inside the context
sys.modules["docling.document_converter"] is the mock; after exiting the
patch.dict context assert sys.modules has been restored to its pre-test state
(original module present or key removed) to ensure cleanup.

In `@test_sys_modules.py`:
- Around line 3-6: The module-level mutations of sys.modules
(sys.modules['docling'] and sys.modules['docling.document_converter']) should be
moved into a test-scoped patch.dict to avoid cross-test pollution; wrap the
import of docling.document_converter inside a with patch.dict("sys.modules",
{"docling": MagicMock(), "docling.document_converter": MagicMock()}): block (or
use the patch.dict decorator) so the mocked entries are only present for the
duration of the test, and ensure the import statement (import
docling.document_converter) happens inside that patched context.

---

Nitpick comments:
In `@graphrag_sdk/test_monkeypatch.py`:
- Line 19: Current top-level call asyncio.run(main()) in test_monkeypatch.py
causes the module to execute on import; wrap that invocation in a standard
entry-point guard by adding if __name__ == "__main__": and moving
asyncio.run(main()) inside it so async def main() only runs when the file is
executed directly, not when imported by pytest or other modules.

In `@graphrag_sdk/test_monkeypatch3.py`:
- Around line 16-19: The test currently swallows all exceptions around the await
asyncio.to_thread(load_sync) call; change this to only accept the specific
expected failure (e.g., ModuleNotFoundError or the mocked exception class) or
use pytest.raises to assert the failure. Replace the broad "except Exception as
e: pass" with either "except ModuleNotFoundError: pass" (or the exact mocked
exception type used in the test) or wrap the await in "with
pytest.raises(ExpectedException): await asyncio.to_thread(load_sync)" so
unexpected exceptions are not hidden; ensure you reference the load_sync call
and asyncio.to_thread in the updated test.

In `@test_mock_import.py`:
- Around line 9-15: The test mutates global sys.modules and does prints at
import time (sys.modules["docling"], sys.modules["docling.datamodel"],
sys.modules["docling.datamodel.document"], mock_datamodel) and then imports
DocItemLabel, which can leak state; change this to perform the module patching
inside the test or a fixture using a temporary monkeypatch (e.g.,
monkeypatch.setitem(sys.modules, "docling", MagicMock()) and
monkeypatch.setitem(... "docling.datamodel.document", mock_datamodel)) or use
importlib and context-local injection before importing, remove the top-level
print statements, and reference DocItemLabel only within the test body so
mock_datamodel and DocItemLabel are created and torn down per-test instead of at
import time.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af7af7df-af2f-4640-a6c3-c5b56215e9ad

📥 Commits

Reviewing files that changed from the base of the PR and between 531d29c and a5ecc4e.

📒 Files selected for processing (11)
  • graphrag_sdk/dummy_path
  • graphrag_sdk/pyproject.toml
  • graphrag_sdk/test_debug.py
  • graphrag_sdk/test_monkeypatch.py
  • graphrag_sdk/test_monkeypatch2.py
  • graphrag_sdk/test_monkeypatch3.py
  • graphrag_sdk/tests/test_docling_loaders.py
  • test_import_mock.py
  • test_mock_import.py
  • test_patch_dict.py
  • test_sys_modules.py

Comment thread graphrag_sdk/test_debug.py Outdated
Comment thread graphrag_sdk/test_monkeypatch2.py Outdated
Comment thread test_import_mock.py Outdated
Comment thread test_patch_dict.py Outdated
Comment thread test_sys_modules.py Outdated
Copy link
Copy Markdown
Collaborator

@galshubeli galshubeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Withdrawn — will repost after internal review.)

Copy link
Copy Markdown
Collaborator

@galshubeli galshubeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @drr00t — first, thank you for tackling this. Issue #241 has been open for a while and you've done the hard part: introducing docling, wiring up the dispatcher, adding tests, and getting the integration working end-to-end. The direction is right and the foundation is solid.

After looking at this more carefully, the main thing I'd like to discuss is the architecture. I think there's a meaningfully simpler shape here.

🎯 Headline change: one universal loader, not five

Docling's DocumentConverter already auto-detects format from the source — one converter object handles PDF, DOCX, XLSX, PPTX, HTML, CSV, Markdown, URLs, and more. The PR currently wraps that single object in five empty subclasses that differ only by an extension_name string, and adds five new branches to _default_loader_for. Leaning into docling's unified API collapses both.

Before (this PR):

# 5 loader files, ~60 LOC of boilerplate
class CsvLoader(DoclingBaseLoader):  extension_name = "csv"
class DocxLoader(DoclingBaseLoader): extension_name = "docx"
class HtmlLoader(DoclingBaseLoader): extension_name = "html"
class PptxLoader(DoclingBaseLoader): extension_name = "pptx"
class XlsxLoader(DoclingBaseLoader): extension_name = "xlsx"

# _default_loader_for, 5 new branches
if lower.endswith(".docx"): return DocxLoader()
if lower.endswith(".xlsx"): return XlsxLoader()
if lower.endswith(".pptx"): return PptxLoader()
if lower.endswith(".html") or lower.endswith(".xhtml"): return HtmlLoader()
if lower.endswith(".csv"):  return CsvLoader()

After:

class DoclingLoader(LoaderStrategy):
    """Universal loader. Delegates to docling for any supported format."""
    def __init__(self, **docling_kwargs): ...
    async def load(self, source: str, ctx: Context) -> DocumentOutput:
        return await asyncio.to_thread(self._load_sync, source)

# Dispatcher collapses to a single line — docling handles everything.
def _default_loader_for(source: str) -> LoaderStrategy:
    return DoclingLoader()

Why this is better:

  • 5 classes → 1, 5 dispatch branches → 1 line.
  • All formats through one path — PDF, DOCX, XLSX, PPTX, HTML, CSV, MD, URL — no extension sniffing, no branching.
  • URLs work for free — docling accepts URLs as source, partially closing the URL line item in #241.
  • Future formats are free — when docling adds AsciiDoc, JATS, USPTO XML, etc., GraphRAG-SDK picks them up automatically.
  • Format-specific kwargs still possible via docling's format_options dict if needed later.
  • Existing loaders stay availableTextLoader, PdfLoader, MarkdownLoader aren't removed; users who want them can pass loader=PdfLoader() explicitly. The change is only in what the default is.

Must-fix

Nine stray scratch files in the diff — looks like leftovers from working out the mock-import setup:

test_import_mock.py             (repo root)
test_mock_import.py             (repo root)
test_patch_dict.py              (repo root)
test_sys_modules.py             (repo root)
graphrag_sdk/test_debug.py
graphrag_sdk/test_monkeypatch.py
graphrag_sdk/test_monkeypatch2.py
graphrag_sdk/test_monkeypatch3.py
graphrag_sdk/dummy_path          (0-byte empty file)

The working version made it into tests/test_docling_loaders.py; these can all be deleted.

Smaller polish

  • PR title: consider feat(loaders): unify ingestion via Docling (DOCX/XLSX/PPTX/HTML/CSV + URL) (the current title is just the branch name).
  • In docling_base.py, the metadata["label"] fallback label.value if hasattr(label, "value") else label can store a non-serializable enum object — coerce with str(label) so downstream storage doesn't break.
  • With the unified DoclingLoader, one smoke test per format (CSV header row, HTML script/style stripping, XLSX multi-sheet, MD breadcrumbs) would catch most regressions.
  • README/CONTRIBUTING update listing the now-supported formats would help discoverability.

Scope

Issue #241 asks for HTML, Markdown, DOCX, CSV, JSON, URL, S3, Image. With the unified DoclingLoader this PR covers HTML, DOCX, CSV, XLSX, PPTX, Markdown, and URL in one shot — and PDF goes through the same path too. JSON, S3, and Image can land in follow-ups (S3 in particular is its own thing — a source adapter rather than a format).

Happy to chat in the thread if anything here doesn't land. Thanks again for the contribution!

@drr00t
Copy link
Copy Markdown
Contributor Author

drr00t commented May 26, 2026

Hi @galshubeli, thank you very much for the review. I agree with all your comments. When looking into Docling, I noticed that graphrag_sdk didn't have the concept of multi-loader documents. Although we hadn't discussed it, I decided to implement it to see if it makes sense. I tried to minimize the impact by creating a "fake" DoclingBaseLoader to avoid duplicating too much code.

I will make the requested changes.

@drr00t drr00t changed the title 241 issue expand document loader coverage feat(loaders): unify ingestion via Docling (DOCX/XLSX/PPTX/HTML/CSV + URL) May 28, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_loader.py (1)

42-45: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix URL ingestion: remove filesystem existence/stat gating in DoclingLoader.

  • _load_sync() blocks HTTP/HTTPS sources via Path(source).exists() and then always builds DocumentInfo using path.stat().st_size / path.suffix; this contradicts Docling’s ability to convert URL strings directly.
💡 Proposed fix
+from urllib.parse import urlparse
+
     def _load_sync(self, source: str) -> DocumentOutput:
-        path = Path(source)
-        if not path.exists():
-            raise LoaderError(f"File not found: {source}")
+        parsed = urlparse(source)
+        is_url = parsed.scheme in {"http", "https"}
+        path = Path(source)
+        if not is_url and not path.exists():
+            raise LoaderError(f"File not found: {source}")
@@
-        return DocumentOutput(
+        metadata = {
+            "loader": "docling",
+            "suffix": Path(parsed.path).suffix if is_url else path.suffix,
+        }
+        if not is_url:
+            metadata["size_bytes"] = path.stat().st_size
+
+        return DocumentOutput(
             text=full_text,
             document_info=DocumentInfo(
-                path=str(path),
-                metadata={
-                    "size_bytes": path.stat().st_size,
-                    "loader": "docling",
-                    "suffix": path.suffix,
-                },
+                path=source,
+                metadata=metadata,
             ),
             elements=elements,
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_loader.py` around
lines 42 - 45, DoclingLoader._load_sync currently forces a filesystem check
using Path(source).exists() and then uses path.stat().st_size and path.suffix,
which blocks HTTP/HTTPS URL inputs; change _load_sync in class DoclingLoader to
detect URL sources (e.g. via urllib.parse.urlparse or startswith http/https),
skip the Path.exists() gating for those, and avoid calling path.stat() for URLs
— instead construct DocumentInfo for URL inputs using size=None (or an
appropriate header-derived content-length if you choose to fetch headers) and
derive suffix from the URL path portion (or leave empty) rather than
path.suffix; keep the filesystem checks and stat usage only for local file
paths.
🧹 Nitpick comments (3)
graphrag_sdk/examples/09_docling_advanced_loader.py (1)

21-21: ⚡ Quick win

Use the public loader import path in the example.

On Line 21, importing from the concrete module couples the example to internal layout. Prefer the package export for a more stable public API example.

Proposed import change
-from graphrag_sdk.ingestion.loaders.docling_loader import DoclingLoader
+from graphrag_sdk.ingestion.loaders import DoclingLoader
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/examples/09_docling_advanced_loader.py` at line 21, Replace the
concrete internal import of DoclingLoader with the package's public loader
export: instead of importing from graphrag_sdk.ingestion.loaders.docling_loader,
import DoclingLoader from the public loaders export (e.g., from
graphrag_sdk.loaders import DoclingLoader) so the example uses the stable public
API and avoids coupling to internal module layout.
graphrag_sdk/tests/test_docling_loaders.py (2)

37-54: ⚡ Quick win

Use stricter module stubs instead of permissive MagicMock modules.

The module mocks here can accept unexpected attribute access, which can let import/API drift pass silently. Using explicit types.ModuleType stubs (with only required attributes) makes these tests fail fast on contract breaks.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/tests/test_docling_loaders.py` around lines 37 - 54, Replace
permissive MagicMock module stubs with strict types.ModuleType instances so
unexpected attribute access fails fast: create ModuleType objects for
mock_docling, mock_datamodel_pkg, mock_datamodel and mock_converter_mod, set
only the required attributes (e.g., mock_datamodel.DocItemLabel = LabelEnum,
mock_converter_mod.DocumentConverter = <callable/class stub>, and __path__ lists
on packages), then populate the modules dict with these ModuleType stubs instead
of MagicMocks; adjust any tests that rely on MagicMock features to use explicit
minimal stubs for DocumentConverter and other required attributes.

14-20: ⚡ Quick win

Make MockLabel string output deterministic for metadata assertions.

str(LabelEnum.FOOTNOTE) currently depends on default object repr (includes memory address), so the assertion is less meaningful and not representative of real label serialization.

Suggested change
 class MockLabel:
     def __init__(self, val): self.value = val
+    def __str__(self):
+        return self.value
     def __eq__(self, other):
         if isinstance(other, type) and hasattr(other, "SECTION_HEADER"):
             return False # Not a direct match
         return isinstance(other, MockLabel) and self.value == other.value
@@
-        assert elements[2].metadata["label"] == str(LabelEnum.FOOTNOTE)
+        assert elements[2].metadata["label"] == LabelEnum.FOOTNOTE.value

Also applies to: 120-120

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/tests/test_docling_loaders.py` around lines 14 - 20, The
MockLabel class produces non-deterministic string output; add a deterministic
string representation by implementing __str__ (and optionally __repr__) on
MockLabel to return the underlying label text (e.g., return self.value) so
metadata assertions comparing str(MockLabel(...)) are stable; update the
MockLabel definition (class MockLabel, its __init__ and __eq__) to include a
__str__ method that returns self.value (and __repr__ mirroring it) and apply the
same change for the second occurrence around the other test at the indicated
spot.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/examples/09_docling_advanced_loader.py`:
- Line 14: The example's run instruction is not repo-root-safe; replace the bare
filename invocation with a module-style command so it works from the repo
root—update the example line that currently says "python
09_docling_advanced_loader.py" to use "python -m
graphrag_sdk.examples.09_docling_advanced_loader" (referencing the example
script 09_docling_advanced_loader.py) so users can run it regardless of current
working directory.

In `@graphrag_sdk/tests/test_docling_loaders.py`:
- Around line 257-283: The test test_html_script_style_stripping is asserting
stripping behavior while using mock_items that already contains clean text, so
the script/style assertions are tautological; fix by either (A) converting this
into a pass-through unit test: rename to test_html_pass_through, remove the
"script"/"style" negative assertions, and assert that mock_converter.convert was
called and result.elements contain "Hello" (references: mock_items,
mock_converter.convert, MockDoclingLoader, loader.load), or (B) make it a real
stripping integration test by removing the mocks and invoking the real
DocumentConverter on the file content that includes <script> and <style>, then
assert elements[0].content == "Hello". Ensure the test reflects which mode you
choose rather than asserting stripping on mocked clean output.

---

Outside diff comments:
In `@graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_loader.py`:
- Around line 42-45: DoclingLoader._load_sync currently forces a filesystem
check using Path(source).exists() and then uses path.stat().st_size and
path.suffix, which blocks HTTP/HTTPS URL inputs; change _load_sync in class
DoclingLoader to detect URL sources (e.g. via urllib.parse.urlparse or
startswith http/https), skip the Path.exists() gating for those, and avoid
calling path.stat() for URLs — instead construct DocumentInfo for URL inputs
using size=None (or an appropriate header-derived content-length if you choose
to fetch headers) and derive suffix from the URL path portion (or leave empty)
rather than path.suffix; keep the filesystem checks and stat usage only for
local file paths.

---

Nitpick comments:
In `@graphrag_sdk/examples/09_docling_advanced_loader.py`:
- Line 21: Replace the concrete internal import of DoclingLoader with the
package's public loader export: instead of importing from
graphrag_sdk.ingestion.loaders.docling_loader, import DoclingLoader from the
public loaders export (e.g., from graphrag_sdk.loaders import DoclingLoader) so
the example uses the stable public API and avoids coupling to internal module
layout.

In `@graphrag_sdk/tests/test_docling_loaders.py`:
- Around line 37-54: Replace permissive MagicMock module stubs with strict
types.ModuleType instances so unexpected attribute access fails fast: create
ModuleType objects for mock_docling, mock_datamodel_pkg, mock_datamodel and
mock_converter_mod, set only the required attributes (e.g.,
mock_datamodel.DocItemLabel = LabelEnum, mock_converter_mod.DocumentConverter =
<callable/class stub>, and __path__ lists on packages), then populate the
modules dict with these ModuleType stubs instead of MagicMocks; adjust any tests
that rely on MagicMock features to use explicit minimal stubs for
DocumentConverter and other required attributes.
- Around line 14-20: The MockLabel class produces non-deterministic string
output; add a deterministic string representation by implementing __str__ (and
optionally __repr__) on MockLabel to return the underlying label text (e.g.,
return self.value) so metadata assertions comparing str(MockLabel(...)) are
stable; update the MockLabel definition (class MockLabel, its __init__ and
__eq__) to include a __str__ method that returns self.value (and __repr__
mirroring it) and apply the same change for the second occurrence around the
other test at the indicated spot.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 333da765-d0bf-49d8-9d1d-95fbf6651374

📥 Commits

Reviewing files that changed from the base of the PR and between a5ecc4e and a451138.

⛔ Files ignored due to path filters (1)
  • graphrag_sdk/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • docs/ingestion.md
  • docs/strategies.md
  • graphrag_sdk/README.md
  • graphrag_sdk/examples/09_docling_advanced_loader.py
  • graphrag_sdk/pyproject.toml
  • graphrag_sdk/src/graphrag_sdk/api/main.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/__init__.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_loader.py
  • graphrag_sdk/tests/test_docling_loaders.py
✅ Files skipped from review due to trivial changes (3)
  • graphrag_sdk/README.md
  • docs/strategies.md
  • docs/ingestion.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • graphrag_sdk/pyproject.toml

Comment thread graphrag_sdk/examples/09_docling_advanced_loader.py Outdated
Comment thread graphrag_sdk/tests/test_docling_loaders.py Outdated
@drr00t
Copy link
Copy Markdown
Contributor Author

drr00t commented May 28, 2026

@galshubeli, it's good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand document loader coverage (HTML, Markdown, DOCX, CSV, JSON, URL, S3, Image)

2 participants