FalkorDB · drr00t · May 17, 2026 · May 14, 2026 · May 19, 2026 · May 19, 2026
diff --git a/docs/ingestion.md b/docs/ingestion.md
@@ -55,6 +55,8 @@ Steps 1-7 run **sequentially** (each depends on the previous). Steps 8-9 run **i
 
 **How:** The `LoaderStrategy` ABC handles this. The SDK auto-detects the loader based on file extension:
 - `.pdf` files use `PdfLoader`
+- `.docx`, `.xlsx`, `.pptx`, `.html`, `.csv`, and URLs use `DoclingLoader` (if `graphrag-sdk[docling]` is installed)
+- `.md` files use `MarkdownLoader`
 - Everything else uses `TextLoader`
 - If you pass `text=` directly, the loader step is skipped entirely
 

diff --git a/docs/strategies.md b/docs/strategies.md
@@ -64,12 +64,30 @@ loader = MarkdownLoader()
 **Design Note: Markup Preservation**
 For complex elements like tables, lists, and code blocks, `MarkdownLoader` intentionally outputs the raw markdown source (including pipes `|`, list dashes `-`, and code fences) rather than stripping the syntax. While this introduces minor syntax "noise", it preserves critical structural cues (such as spatial column alignment and nested indentation) that the LLM requires during the Extraction phase to accurately parse relational data.
 
+### Built-in: DoclingLoader
+
+A universal loader utilizing the `docling` library to parse rich document formats (PDF, DOCX, XLSX, PPTX, HTML, CSV, Markdown, URLs). Requires `pip install graphrag-sdk[docling]`.
+
+```python
+from graphrag_sdk.ingestion.loaders.docling_loader import DoclingLoader
+
+# You can pass arbitrary docling DocumentConverter arguments
+loader = DoclingLoader(
+    allowed_formats=["docx", "pptx"],
+    format_options={...}
+)
+```
+
+**Design Note: Hierarchical Breadcrumbs**
+`DoclingLoader` extracts deep structural context by generating "breadcrumbs" (e.g., tracking that a paragraph is inside `H1 -> H2 -> List`). These breadcrumbs are attached to each extracted text chunk as metadata, ensuring that the semantic location of the text within the original document is preserved during chunking and extraction.
+
 ### Default Behavior
 
 If no loader is specified in `ingest()`:
 - `.pdf` files use `PdfLoader`
+- `.docx`, `.xlsx`, `.pptx`, `.html`, `.csv` files and URLs use `DoclingLoader` (if installed)
 - `.md` files use `MarkdownLoader`
--   Everything else uses `TextLoader`
+- Everything else uses `TextLoader`
 - If `text=` is passed directly, the loader is skipped
 
 ### Writing Your Own

diff --git a/graphrag_sdk/README.md b/graphrag_sdk/README.md
@@ -36,6 +36,7 @@ asyncio.run(main())
 pip install graphrag-sdk[litellm]       # OpenAI, Azure, Anthropic, 100+ models
 pip install graphrag-sdk[openrouter]    # OpenRouter models
 pip install graphrag-sdk[pdf]           # PDF ingestion
+pip install graphrag-sdk[docling]       # DOCX, XLSX, PPTX, HTML, CSV, URLs
 pip install graphrag-sdk[all]           # Everything
 ```
 
@@ -59,7 +60,14 @@ async def main():
         llm=LiteLLM(model="openai/gpt-4o"),
         embedder=LiteLLMEmbedder(model="openai/text-embedding-3-small"),
     ) as rag:
+        # Supported formats: PDF, DOCX, XLSX, PPTX, HTML, CSV, Markdown, URLs, TXT
         await rag.ingest("report.pdf")                              # PDF
+        await rag.ingest("document.docx")                           # Word
+        await rag.ingest("spreadsheet.xlsx")                        # Excel
+        await rag.ingest("presentation.pptx")                       # PowerPoint
+        await rag.ingest("page.html")                               # HTML
+        await rag.ingest("data.csv")                                # CSV
+        await rag.ingest("notes.md")                                # Markdown
         await rag.ingest("source_id", text="Alice works at Acme.")  # Raw text
         await rag.finalize()                                         # Dedup + index
 
@@ -135,7 +143,7 @@ Every algorithmic concern is a swappable strategy behind an abstract base class:
 
 | Concern | ABC | Built-in Options | Default |
 |---------|-----|-----------------|---------|
-| **Loading** | `LoaderStrategy` | `TextLoader`, `PdfLoader` | Auto-detect by extension |
+| **Loading** | `LoaderStrategy` | `TextLoader`, `PdfLoader`, `DoclingLoader` (universal: DOCX/XLSX/PPTX/HTML/CSV/URL) | Auto-detect by extension |
 | **Chunking** | `ChunkingStrategy` | `FixedSizeChunking`, `SentenceTokenCapChunking`, `ContextualChunking`, `CallableChunking` | `FixedSizeChunking` |
 | **Extraction** | `ExtractionStrategy` | `GraphExtraction` (GLiNER2 + LLM) | `GraphExtraction` |
 | **Resolution** | `ResolutionStrategy` | `ExactMatchResolution`, `DescriptionMergeResolution`, `SemanticResolution`, `LLMVerifiedResolution` | `ExactMatch` |

diff --git a/graphrag_sdk/examples/09_docling_advanced_loader.py b/graphrag_sdk/examples/09_docling_advanced_loader.py
@@ -0,0 +1,80 @@
+"""
+GraphRAG SDK -- Advanced Docling Loader
+=======================================
+This example demonstrates how to explicitly instantiate and configure the
+DoclingLoader to parse rich document formats, passing advanced options to
+the underlying docling DocumentConverter.
+
+Prerequisites:
+    docker run -p 6379:6379 falkordb/falkordb
+    pip install graphrag-sdk[litellm,docling]
+
+Usage:
+    export OPENAI_API_KEY="sk-..."
+    python graphrag_sdk/examples/09_docling_advanced_loader.py
+"""
+
+import asyncio
+from pathlib import Path
+
+from graphrag_sdk import ConnectionConfig, GraphRAG, LiteLLM, LiteLLMEmbedder
+from graphrag_sdk.ingestion.loaders.docling_loader import DoclingLoader
+
+
+def get_providers():
+    llm = LiteLLM(model="openai/gpt-4.1")
+    embedder = LiteLLMEmbedder(model="openai/text-embedding-3-small")
+    return llm, embedder
+
+
+async def main():
+    llm, embedder = get_providers()
+
+    # Create a dummy markdown file for demonstration purposes
+    # Note: In a real scenario, this would be a PDF, DOCX, XLSX, etc.
+    dummy_file = Path("sample_docling_input.md")
+    dummy_file.write_text(
+        "# Advanced Analysis\n\n"
+        "## Section 1\n"
+        "This is a paragraph inside section 1.\n\n"
+        "## Section 2\n"
+        "This is another paragraph."
+    )
+
+    try:
+        # Instantiate DoclingLoader with custom configuration
+        # Any **kwargs are passed directly to docling's DocumentConverter
+        advanced_loader = DoclingLoader(
+            allowed_formats=["md", "docx", "pdf"],
+            # Example docling kwargs (pipeline_options, etc. could be added here)
+        )
+
+        async with GraphRAG(
+            connection=ConnectionConfig(host="localhost", graph_name="docling_demo"),
+            llm=llm,
+            embedder=embedder,
+        ) as rag:
+            print("Ingesting document with advanced DoclingLoader...")
+            # Explicitly pass the loader to override auto-dispatch
+            result = await rag.ingest(
+                str(dummy_file),
+                loader=advanced_loader,
+            )
+            print(f"Ingested: {result.nodes_created} nodes, {result.relationships_created} edges")
+
+            print("\nFinalizing graph...")
+            await rag.finalize()
+
+            question = "What sections are in the advanced analysis?"
+            print(f"\nQ: {question}")
+            answer = await rag.completion(question)
+            print(f"A: {answer.answer}")
+
+    finally:
+        # Cleanup dummy file
+        if dummy_file.exists():
+            dummy_file.unlink()
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
diff --git a/graphrag_sdk/pyproject.toml b/graphrag_sdk/pyproject.toml
@@ -56,13 +56,15 @@ litellm = ["litellm>=1.83.0,<2.0"]
 openrouter = ["openai>=1.0,<3.0"]
 fastcoref = ["fastcoref>=2.0"]
 spacy = ["spacy>=3.0"]
+docling = ["docling>=2.91.0"]
 all = [
     "openai>=1.0,<3.0",
     "anthropic>=0.20,<1.0",
     "cohere>=5.0",
     "sentence-transformers>=2.0",
     "pypdf>=6.9.2",
     "litellm>=1.83.0,<2.0",
+    "docling>=2.91.0",
 ]
 dev = [
     "pytest>=8.0",
@@ -97,6 +99,12 @@ plugins = ["pydantic.mypy"]
 [tool.pytest.ini_options]
 asyncio_mode = "auto"
 testpaths = ["tests"]
+filterwarnings = [
+    "ignore:builtin type SwigPyObject has no __module__ attribute:DeprecationWarning",
+    "ignore:.*hf_xet.download_files\\(\\) is deprecated.*:DeprecationWarning",
+    "ignore:.*`torch.jit.script` is deprecated.*:DeprecationWarning",
+    "ignore:.*The `resume_download` argument is deprecated.*:UserWarning",
+]
 markers = [
     "integration: tests that require a live FalkorDB instance",
 ]
diff --git a/graphrag_sdk/src/graphrag_sdk/api/main.py b/graphrag_sdk/src/graphrag_sdk/api/main.py
@@ -46,6 +46,7 @@
 )
 from graphrag_sdk.ingestion.extraction_strategies.graph_extraction import GraphExtraction
 from graphrag_sdk.ingestion.loaders.base import LoaderStrategy
+from graphrag_sdk.ingestion.loaders.docling_loader import DoclingLoader
 from graphrag_sdk.ingestion.loaders.markdown_loader import MarkdownLoader
 from graphrag_sdk.ingestion.loaders.pdf_loader import PdfLoader
 from graphrag_sdk.ingestion.loaders.text_loader import TextLoader
@@ -829,7 +830,9 @@ def _default_loader_for(source: str) -> LoaderStrategy:
             return PdfLoader()
         if lower.endswith(".md"):
             return MarkdownLoader()
-        return TextLoader()
+        if lower.endswith(".txt"):
+            return TextLoader()
+        return DoclingLoader()
 
     # ── Incremental Updates ─────────────────────────────────────
     #

diff --git a/graphrag_sdk/src/graphrag_sdk/ingestion/loaders/__init__.py b/graphrag_sdk/src/graphrag_sdk/ingestion/loaders/__init__.py
@@ -1,8 +1,15 @@
 # GraphRAG SDK — Ingestion: Loaders
 
 from graphrag_sdk.ingestion.loaders.base import LoaderStrategy
+from graphrag_sdk.ingestion.loaders.docling_loader import DoclingLoader
 from graphrag_sdk.ingestion.loaders.markdown_loader import MarkdownLoader
 from graphrag_sdk.ingestion.loaders.pdf_loader import PdfLoader
 from graphrag_sdk.ingestion.loaders.text_loader import TextLoader
 
-__all__ = ["LoaderStrategy", "MarkdownLoader", "PdfLoader", "TextLoader"]
+__all__ = [
+    "LoaderStrategy",
+    "DoclingLoader",
+    "MarkdownLoader",
+    "PdfLoader",
+    "TextLoader",
+]
diff --git a/graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_loader.py b/graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_loader.py
@@ -0,0 +1,150 @@
+# GraphRAG SDK — Ingestion: Docling Universal Loader
+# Pattern: Universal Strategy — one loader for all docling-supported formats
+
+from __future__ import annotations
+
+import asyncio
+import logging
+from pathlib import Path
+from typing import Any
+
+from graphrag_sdk.core.context import Context
+from graphrag_sdk.core.exceptions import LoaderError
+from graphrag_sdk.core.models import DocumentElement, DocumentInfo, DocumentOutput
+from graphrag_sdk.ingestion.loaders.base import LoaderStrategy
+
+logger = logging.getLogger(__name__)
+
+
+class DoclingLoader(LoaderStrategy):
+    """Universal loader using docling for advanced document parsing.
+
+    Handles PDF, DOCX, XLSX, PPTX, HTML, CSV, Markdown, URLs, and more.
+    Format auto-detection is handled by docling's DocumentConverter.
+    """
+
+    def __init__(self, **docling_kwargs: Any) -> None:
+        """Initialize the loader.
+
+        Args:
+            **docling_kwargs: Arbitrary keyword arguments passed to
+                `docling.document_converter.DocumentConverter` (e.g.,
+                pipeline_options).
+        """
+        self.docling_kwargs = docling_kwargs
+
+    async def load(self, source: str, ctx: Context) -> DocumentOutput:
+        ctx.log(f"Loading file via docling: {source}")
+        # Run synchronous docling extraction in a non-blocking thread
+        return await asyncio.to_thread(self._load_sync, source)
+
+    def _load_sync(self, source: str) -> DocumentOutput:
+        path = Path(source)
+        if not path.exists():
+            raise LoaderError(f"File not found: {source}")
+
+        try:
+            from docling.datamodel.document import DocItemLabel
+            from docling.document_converter import DocumentConverter
+        except ImportError:
+            raise LoaderError(
+                "This format requires 'docling'. Install with:\n  pip install graphrag-sdk[docling]"
+            )
+
+        try:
+            converter = DocumentConverter(**self.docling_kwargs)
+            result = converter.convert(source)
+            doc = result.document
+        except Exception as exc:
+            raise LoaderError(f"Docling failed to process {source}: {exc}") from exc
+
+        elements: list[DocumentElement] = []
+        current_breadcrumbs: list[tuple[int, str]] = []
+        full_text_blocks = []
+
+        # Map docling hierarchy to GraphRAG DocumentElements
+        for item, level in doc.iterate_items():
+            content = getattr(item, "text", "")
+            if not content and hasattr(item, "export_to_markdown"):
+                try:
+                    content = item.export_to_markdown()
+                except Exception:
+                    pass
+
+            if not content:
+                continue
+
+            full_text_blocks.append(content)
+            label = getattr(item, "label", None)
+
+            if label in (DocItemLabel.TITLE, DocItemLabel.SECTION_HEADER):
+                # Update breadcrumbs
+                while current_breadcrumbs and current_breadcrumbs[-1][0] >= level:
+                    current_breadcrumbs.pop()
+                current_breadcrumbs.append((level, content))
+
+                elements.append(
+                    DocumentElement(
+                        type="header",
+                        level=level,
+                        content=content,
+                        breadcrumbs=[b[1] for b in current_breadcrumbs],
+                    )
+                )
+            elif label in (DocItemLabel.PARAGRAPH, DocItemLabel.TEXT):
+                elements.append(
+                    DocumentElement(
+                        type="paragraph",
+                        content=content,
+                        breadcrumbs=[b[1] for b in current_breadcrumbs],
+                    )
+                )
+            elif label == DocItemLabel.LIST_ITEM:
+                elements.append(
+                    DocumentElement(
+                        type="list",
+                        content=content,
+                        breadcrumbs=[b[1] for b in current_breadcrumbs],
+                    )
+                )
+            elif label == DocItemLabel.TABLE:
+                elements.append(
+                    DocumentElement(
+                        type="table",
+                        content=content,
+                        breadcrumbs=[b[1] for b in current_breadcrumbs],
+                    )
+                )
+            elif label == DocItemLabel.CODE:
+                elements.append(
+                    DocumentElement(
+                        type="code",
+                        content=content,
+                        breadcrumbs=[b[1] for b in current_breadcrumbs],
+                    )
+                )
+            else:
+                # Default for CAPTION, FOOTNOTE, etc.
+                elements.append(
+                    DocumentElement(
+                        type="paragraph",
+                        content=content,
+                        breadcrumbs=[b[1] for b in current_breadcrumbs],
+                        metadata={"label": str(label)},
+                    )
+                )
+
+        full_text = "\n\n".join(full_text_blocks)
+
+        return DocumentOutput(
+            text=full_text,
+            document_info=DocumentInfo(
+                path=str(path),
+                metadata={
+                    "size_bytes": path.stat().st_size,
+                    "loader": "docling",
+                    "suffix": path.suffix,
+                },
+            ),
+            elements=elements,
+        )