Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@ Steps 1-7 run **sequentially** (each depends on the previous). Steps 8-9 run **i

**How:** The `LoaderStrategy` ABC handles this. The SDK auto-detects the loader based on file extension:
- `.pdf` files use `PdfLoader`
- `.docx`, `.xlsx`, `.pptx`, `.html`, `.csv`, and URLs use `DoclingLoader` (if `graphrag-sdk[docling]` is installed)
- `.md` files use `MarkdownLoader`
- Everything else uses `TextLoader`
- If you pass `text=` directly, the loader step is skipped entirely

Expand Down
20 changes: 19 additions & 1 deletion docs/strategies.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,12 +64,30 @@ loader = MarkdownLoader()
**Design Note: Markup Preservation**
For complex elements like tables, lists, and code blocks, `MarkdownLoader` intentionally outputs the raw markdown source (including pipes `|`, list dashes `-`, and code fences) rather than stripping the syntax. While this introduces minor syntax "noise", it preserves critical structural cues (such as spatial column alignment and nested indentation) that the LLM requires during the Extraction phase to accurately parse relational data.

### Built-in: DoclingLoader

A universal loader utilizing the `docling` library to parse rich document formats (PDF, DOCX, XLSX, PPTX, HTML, CSV, Markdown, URLs). Requires `pip install graphrag-sdk[docling]`.

```python
from graphrag_sdk.ingestion.loaders.docling_loader import DoclingLoader

# You can pass arbitrary docling DocumentConverter arguments
loader = DoclingLoader(
allowed_formats=["docx", "pptx"],
format_options={...}
)
```

**Design Note: Hierarchical Breadcrumbs**
`DoclingLoader` extracts deep structural context by generating "breadcrumbs" (e.g., tracking that a paragraph is inside `H1 -> H2 -> List`). These breadcrumbs are attached to each extracted text chunk as metadata, ensuring that the semantic location of the text within the original document is preserved during chunking and extraction.

### Default Behavior

If no loader is specified in `ingest()`:
- `.pdf` files use `PdfLoader`
- `.docx`, `.xlsx`, `.pptx`, `.html`, `.csv` files and URLs use `DoclingLoader` (if installed)
- `.md` files use `MarkdownLoader`
- Everything else uses `TextLoader`
- Everything else uses `TextLoader`
- If `text=` is passed directly, the loader is skipped

### Writing Your Own
Expand Down
10 changes: 9 additions & 1 deletion graphrag_sdk/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ asyncio.run(main())
pip install graphrag-sdk[litellm] # OpenAI, Azure, Anthropic, 100+ models
pip install graphrag-sdk[openrouter] # OpenRouter models
pip install graphrag-sdk[pdf] # PDF ingestion
pip install graphrag-sdk[docling] # DOCX, XLSX, PPTX, HTML, CSV, URLs
pip install graphrag-sdk[all] # Everything
```

Expand All @@ -59,7 +60,14 @@ async def main():
llm=LiteLLM(model="openai/gpt-4o"),
embedder=LiteLLMEmbedder(model="openai/text-embedding-3-small"),
) as rag:
# Supported formats: PDF, DOCX, XLSX, PPTX, HTML, CSV, Markdown, URLs, TXT
await rag.ingest("report.pdf") # PDF
await rag.ingest("document.docx") # Word
await rag.ingest("spreadsheet.xlsx") # Excel
await rag.ingest("presentation.pptx") # PowerPoint
await rag.ingest("page.html") # HTML
await rag.ingest("data.csv") # CSV
await rag.ingest("notes.md") # Markdown
await rag.ingest("source_id", text="Alice works at Acme.") # Raw text
await rag.finalize() # Dedup + index

Expand Down Expand Up @@ -135,7 +143,7 @@ Every algorithmic concern is a swappable strategy behind an abstract base class:

| Concern | ABC | Built-in Options | Default |
|---------|-----|-----------------|---------|
| **Loading** | `LoaderStrategy` | `TextLoader`, `PdfLoader` | Auto-detect by extension |
| **Loading** | `LoaderStrategy` | `TextLoader`, `PdfLoader`, `DoclingLoader` (universal: DOCX/XLSX/PPTX/HTML/CSV/URL) | Auto-detect by extension |
| **Chunking** | `ChunkingStrategy` | `FixedSizeChunking`, `SentenceTokenCapChunking`, `ContextualChunking`, `CallableChunking` | `FixedSizeChunking` |
| **Extraction** | `ExtractionStrategy` | `GraphExtraction` (GLiNER2 + LLM) | `GraphExtraction` |
| **Resolution** | `ResolutionStrategy` | `ExactMatchResolution`, `DescriptionMergeResolution`, `SemanticResolution`, `LLMVerifiedResolution` | `ExactMatch` |
Expand Down
80 changes: 80 additions & 0 deletions graphrag_sdk/examples/09_docling_advanced_loader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
"""
GraphRAG SDK -- Advanced Docling Loader
=======================================
This example demonstrates how to explicitly instantiate and configure the
DoclingLoader to parse rich document formats, passing advanced options to
the underlying docling DocumentConverter.

Prerequisites:
docker run -p 6379:6379 falkordb/falkordb
pip install graphrag-sdk[litellm,docling]

Usage:
export OPENAI_API_KEY="sk-..."
python graphrag_sdk/examples/09_docling_advanced_loader.py
"""

import asyncio
from pathlib import Path

from graphrag_sdk import ConnectionConfig, GraphRAG, LiteLLM, LiteLLMEmbedder
from graphrag_sdk.ingestion.loaders.docling_loader import DoclingLoader


def get_providers():
llm = LiteLLM(model="openai/gpt-4.1")
embedder = LiteLLMEmbedder(model="openai/text-embedding-3-small")
return llm, embedder


async def main():
llm, embedder = get_providers()

# Create a dummy markdown file for demonstration purposes
# Note: In a real scenario, this would be a PDF, DOCX, XLSX, etc.
dummy_file = Path("sample_docling_input.md")
dummy_file.write_text(
"# Advanced Analysis\n\n"
"## Section 1\n"
"This is a paragraph inside section 1.\n\n"
"## Section 2\n"
"This is another paragraph."
)

try:
# Instantiate DoclingLoader with custom configuration
# Any **kwargs are passed directly to docling's DocumentConverter
advanced_loader = DoclingLoader(
allowed_formats=["md", "docx", "pdf"],
# Example docling kwargs (pipeline_options, etc. could be added here)
)

async with GraphRAG(
connection=ConnectionConfig(host="localhost", graph_name="docling_demo"),
llm=llm,
embedder=embedder,
) as rag:
print("Ingesting document with advanced DoclingLoader...")
# Explicitly pass the loader to override auto-dispatch
result = await rag.ingest(
str(dummy_file),
loader=advanced_loader,
)
print(f"Ingested: {result.nodes_created} nodes, {result.relationships_created} edges")

print("\nFinalizing graph...")
await rag.finalize()

question = "What sections are in the advanced analysis?"
print(f"\nQ: {question}")
answer = await rag.completion(question)
print(f"A: {answer.answer}")

finally:
# Cleanup dummy file
if dummy_file.exists():
dummy_file.unlink()


if __name__ == "__main__":
asyncio.run(main())
8 changes: 8 additions & 0 deletions graphrag_sdk/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,15 @@ litellm = ["litellm>=1.83.0,<2.0"]
openrouter = ["openai>=1.0,<3.0"]
fastcoref = ["fastcoref>=2.0"]
spacy = ["spacy>=3.0"]
docling = ["docling>=2.91.0"]
all = [
"openai>=1.0,<3.0",
"anthropic>=0.20,<1.0",
"cohere>=5.0",
"sentence-transformers>=2.0",
"pypdf>=6.9.2",
"litellm>=1.83.0,<2.0",
"docling>=2.91.0",
]
dev = [
"pytest>=8.0",
Expand Down Expand Up @@ -97,6 +99,12 @@ plugins = ["pydantic.mypy"]
[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["tests"]
filterwarnings = [
"ignore:builtin type SwigPyObject has no __module__ attribute:DeprecationWarning",
"ignore:.*hf_xet.download_files\\(\\) is deprecated.*:DeprecationWarning",
"ignore:.*`torch.jit.script` is deprecated.*:DeprecationWarning",
"ignore:.*The `resume_download` argument is deprecated.*:UserWarning",
]
markers = [
"integration: tests that require a live FalkorDB instance",
]
5 changes: 4 additions & 1 deletion graphrag_sdk/src/graphrag_sdk/api/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
)
from graphrag_sdk.ingestion.extraction_strategies.graph_extraction import GraphExtraction
from graphrag_sdk.ingestion.loaders.base import LoaderStrategy
from graphrag_sdk.ingestion.loaders.docling_loader import DoclingLoader
from graphrag_sdk.ingestion.loaders.markdown_loader import MarkdownLoader
from graphrag_sdk.ingestion.loaders.pdf_loader import PdfLoader
from graphrag_sdk.ingestion.loaders.text_loader import TextLoader
Expand Down Expand Up @@ -829,7 +830,9 @@ def _default_loader_for(source: str) -> LoaderStrategy:
return PdfLoader()
if lower.endswith(".md"):
return MarkdownLoader()
return TextLoader()
if lower.endswith(".txt"):
return TextLoader()
return DoclingLoader()

# ── Incremental Updates ─────────────────────────────────────
#
Expand Down
9 changes: 8 additions & 1 deletion graphrag_sdk/src/graphrag_sdk/ingestion/loaders/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
# GraphRAG SDK — Ingestion: Loaders

from graphrag_sdk.ingestion.loaders.base import LoaderStrategy
from graphrag_sdk.ingestion.loaders.docling_loader import DoclingLoader
from graphrag_sdk.ingestion.loaders.markdown_loader import MarkdownLoader
from graphrag_sdk.ingestion.loaders.pdf_loader import PdfLoader
from graphrag_sdk.ingestion.loaders.text_loader import TextLoader

__all__ = ["LoaderStrategy", "MarkdownLoader", "PdfLoader", "TextLoader"]
__all__ = [
"LoaderStrategy",
"DoclingLoader",
"MarkdownLoader",
"PdfLoader",
"TextLoader",
]
150 changes: 150 additions & 0 deletions graphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_loader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# GraphRAG SDK — Ingestion: Docling Universal Loader
# Pattern: Universal Strategy — one loader for all docling-supported formats

from __future__ import annotations

import asyncio
import logging
from pathlib import Path
from typing import Any

from graphrag_sdk.core.context import Context
from graphrag_sdk.core.exceptions import LoaderError
from graphrag_sdk.core.models import DocumentElement, DocumentInfo, DocumentOutput
from graphrag_sdk.ingestion.loaders.base import LoaderStrategy

logger = logging.getLogger(__name__)


class DoclingLoader(LoaderStrategy):
"""Universal loader using docling for advanced document parsing.

Handles PDF, DOCX, XLSX, PPTX, HTML, CSV, Markdown, URLs, and more.
Format auto-detection is handled by docling's DocumentConverter.
"""

def __init__(self, **docling_kwargs: Any) -> None:
"""Initialize the loader.

Args:
**docling_kwargs: Arbitrary keyword arguments passed to
`docling.document_converter.DocumentConverter` (e.g.,
pipeline_options).
"""
self.docling_kwargs = docling_kwargs

async def load(self, source: str, ctx: Context) -> DocumentOutput:
ctx.log(f"Loading file via docling: {source}")
# Run synchronous docling extraction in a non-blocking thread
return await asyncio.to_thread(self._load_sync, source)

def _load_sync(self, source: str) -> DocumentOutput:
path = Path(source)
if not path.exists():
raise LoaderError(f"File not found: {source}")

try:
from docling.datamodel.document import DocItemLabel
from docling.document_converter import DocumentConverter
except ImportError:
raise LoaderError(
"This format requires 'docling'. Install with:\n pip install graphrag-sdk[docling]"
)

try:
converter = DocumentConverter(**self.docling_kwargs)
result = converter.convert(source)
doc = result.document
except Exception as exc:
raise LoaderError(f"Docling failed to process {source}: {exc}") from exc

elements: list[DocumentElement] = []
current_breadcrumbs: list[tuple[int, str]] = []
full_text_blocks = []

# Map docling hierarchy to GraphRAG DocumentElements
for item, level in doc.iterate_items():
content = getattr(item, "text", "")
if not content and hasattr(item, "export_to_markdown"):
try:
content = item.export_to_markdown()
except Exception:
pass

if not content:
continue

full_text_blocks.append(content)
label = getattr(item, "label", None)

if label in (DocItemLabel.TITLE, DocItemLabel.SECTION_HEADER):
# Update breadcrumbs
while current_breadcrumbs and current_breadcrumbs[-1][0] >= level:
current_breadcrumbs.pop()
current_breadcrumbs.append((level, content))

elements.append(
DocumentElement(
type="header",
level=level,
content=content,
breadcrumbs=[b[1] for b in current_breadcrumbs],
)
)
elif label in (DocItemLabel.PARAGRAPH, DocItemLabel.TEXT):
elements.append(
DocumentElement(
type="paragraph",
content=content,
breadcrumbs=[b[1] for b in current_breadcrumbs],
)
)
elif label == DocItemLabel.LIST_ITEM:
elements.append(
DocumentElement(
type="list",
content=content,
breadcrumbs=[b[1] for b in current_breadcrumbs],
)
)
elif label == DocItemLabel.TABLE:
elements.append(
DocumentElement(
type="table",
content=content,
breadcrumbs=[b[1] for b in current_breadcrumbs],
)
)
elif label == DocItemLabel.CODE:
elements.append(
DocumentElement(
type="code",
content=content,
breadcrumbs=[b[1] for b in current_breadcrumbs],
)
)
else:
# Default for CAPTION, FOOTNOTE, etc.
elements.append(
DocumentElement(
type="paragraph",
content=content,
breadcrumbs=[b[1] for b in current_breadcrumbs],
metadata={"label": str(label)},
)
)

full_text = "\n\n".join(full_text_blocks)

return DocumentOutput(
text=full_text,
document_info=DocumentInfo(
path=str(path),
metadata={
"size_bytes": path.stat().st_size,
"loader": "docling",
"suffix": path.suffix,
},
),
elements=elements,
)
Loading
Loading