FalkorDB · drr00t · May 10, 2026 · May 10, 2026 · May 11, 2026 · May 11, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -539,4 +539,4 @@ to this version by default. Legacy v0.x users can pin `graphrag-sdk==0.8.2`.
 ### Fixed
 
 - `hnswlib` import guard in SemanticResolution and LLMVerifiedResolution — raises clear `ImportError` instead of `AttributeError` when hnswlib is not installed.
-- 14 ruff lint errors (import sorting, line length) resolved; CI no longer ignores lint rules.
+- 14 ruff lint errors (import sorting, line length) resolved; CI no longer ignores lint rules.
diff --git a/docs/api-reference.md b/docs/api-reference.md
@@ -8,6 +8,7 @@ Complete reference for all public classes and methods exported by `graphrag_sdk`
 - [Connection](#connection)
 - [Providers](#providers)
 - [Data Models](#data-models)
+- [TokenUsage](#tokenusage)
 - [Schema](#schema)
 - [Ingestion Strategies](#ingestion-strategies)
 - [Ingestion Pipeline](#ingestion-pipeline)
@@ -269,11 +270,11 @@ LLMInterface(model_name: str, model_params: dict | None = None, max_concurrency:
 | Method | Signature | Description |
 |--------|-----------|-------------|
 | `invoke` | `(prompt: str, **kwargs) -> LLMResponse` | Sync text generation (abstract) |
-| `ainvoke` | `(prompt: str, *, max_retries=3, **kwargs) -> LLMResponse` | Async with retry + backoff |
-| `ainvoke_messages` | `(messages: list[ChatMessage], *, max_retries=3, **kwargs) -> LLMResponse` | Multi-turn native messages (see below) |
+| `ainvoke` | `(prompt: str, *, ctx=None, max_retries=3, **kwargs) -> LLMResponse` | Async with retry + backoff; records usage into `ctx` |
+| `ainvoke_messages` | `(messages: list[ChatMessage], *, ctx=None, max_retries=3, **kwargs) -> LLMResponse` | Multi-turn native messages; records usage into `ctx` |
 | `invoke_with_model` | `(prompt: str, response_model: Type[BaseModel], **kwargs) -> BaseModel` | Structured output |
 | `ainvoke_with_model` | `(prompt: str, response_model: Type[BaseModel], *, max_retries=3) -> BaseModel` | Async structured output |
-| `abatch_invoke` | `(prompts: list[str], *, max_concurrency=None, max_retries=3) -> list[LLMBatchItem]` | Concurrent batch |
+| `abatch_invoke` | `(prompts: list[str], *, ctx=None, max_concurrency=None, max_retries=3) -> list[LLMBatchItem]` | Concurrent batch; threads `ctx` to each `ainvoke` |
 
 `ainvoke_messages()` is used by `completion()` when conversation history is provided. The default implementation concatenates messages into a single prompt string and calls `ainvoke()`, so custom providers work without changes. `LiteLLM` and `OpenRouterLLM` override this with native multi-turn implementations.
 
@@ -283,9 +284,9 @@ LLMInterface(model_name: str, model_params: dict | None = None, max_concurrency:
 |--------|-----------|-------------|
 | `model_name` | `@property -> str` | Embedding model identifier (abstract) |
 | `embed_query` | `(text: str, **kwargs) -> list[float]` | Single text embedding (abstract) |
-| `aembed_query` | `(text: str, **kwargs) -> list[float]` | Async single (default: thread pool) |
+| `aembed_query` | `(text: str, *, ctx=None, **kwargs) -> list[float]` | Async single; records `embedding_tokens` into `ctx` |
 | `embed_documents` | `(texts: list[str], **kwargs) -> list[list[float]]` | Batch (default: sequential) |
-| `aembed_documents` | `(texts: list[str], **kwargs) -> list[list[float]]` | Async batch (default: thread pool) |
+| `aembed_documents` | `(texts: list[str], *, ctx=None, **kwargs) -> list[list[float]]` | Async batch; records `embedding_tokens` into `ctx` |
 
 ### LLMBatchItem
 
@@ -411,25 +412,34 @@ class IngestionResult(DataModel):
     relationships_created: int = 0
     chunks_indexed: int = 0
     metadata: dict[str, Any] = {}
+    usage: TokenUsage = TokenUsage()     # Accumulated token counts for this ingest
 ```
 
+See [Token Usage](token-usage.md) for what each counter covers.
+
 ### RagResult
 
 ```python
 class RagResult(DataModel):
     answer: str
     retriever_result: RetrieverResult | None = None      # Populated when return_context=True
     metadata: dict[str, Any] = {}                        # Contains model, num_context_items, strategy
+    usage: TokenUsage = TokenUsage()                     # Accumulated token counts for this completion
 ```
 
+See [Token Usage](token-usage.md) for what each counter covers.
+
 ### RetrieverResult
 
 ```python
 class RetrieverResult(DataModel):
     items: list[RetrieverResultItem] = []
     metadata: dict[str, Any] = {}
+    usage: TokenUsage = TokenUsage()     # Accumulated token counts for this retrieve
 ```
 
+See [Token Usage](token-usage.md) for what each counter covers.
+
 ### RetrieverResultItem
 
 ```python
@@ -519,6 +529,32 @@ Deterministic entity ID from normalized name and optional type. When `entity_typ
 
 ---
 
+## TokenUsage
+
+```python
+from graphrag_sdk import TokenUsage
+
+class TokenUsage(DataModel):
+    prompt_tokens: int = 0       # Total tokens sent to LLM (input)
+    completion_tokens: int = 0   # Total tokens generated by LLM (output)
+    embedding_tokens: int = 0    # Total tokens sent to the embedder
+```
+
+All fields default to `0`. Supports `+` (returns new instance) and `+=` (in-place accumulation) for aggregation across batch results.
+
+```python
+# Aggregate batch ingest usage
+results = await rag.ingest(["a.pdf", "b.pdf"])
+total = sum(
+    (r.usage for r in results if isinstance(r, IngestionResult)),
+    start=TokenUsage(),
+)
+```
+
+See the full guide in [Token Usage](token-usage.md).
+
+---
+
 ## Schema
 
 ```python
@@ -745,16 +781,25 @@ VectorStore(connection, embedder=None, index_name="chunk_embeddings", embedding_
 from graphrag_sdk import Context
 ```
 
-Execution context for logging and budget tracking.
+Execution context threaded through every strategy call for logging, budget tracking, and token usage accumulation.
 
 ```python
-Context(tenant_id: str = "default", latency_budget_ms: float = 60000.0)
+Context(
+    tenant_id: str = "default",
+    latency_budget_ms: float | None = None,
+    metadata: dict[str, Any] = {},
+)
 ```
 
 | Method/Property | Description |
 |----------------|-------------|
-| `ctx.log(message, log_level=logging.INFO)` | Log a message |
-| `ctx.budget_exceeded` | True if elapsed time > latency_budget_ms |
+| `ctx.log(message, log_level=logging.INFO)` | Log a message with tenant/trace prefix |
+| `ctx.budget_exceeded` | True if elapsed time > `latency_budget_ms` |
+| `ctx.remaining_budget_ms` | Remaining budget in ms, or `None` |
+| `ctx.elapsed_ms` | Milliseconds since context creation |
+| `ctx.usage` | `TokenUsage` accumulator for this operation |
+| `ctx.record_usage(*, prompt_tokens=0, completion_tokens=0, embedding_tokens=0)` | Add token counts to the accumulator |
+| `ctx.child(**overrides)` | Create a child context with inherited tenant/trace |
 
 ---
 

diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -118,12 +118,17 @@ The default is `256` (matched-Matryoshka dimensions of `text-embedding-3-large`)
 result = await rag.ingest("path/to/document.txt")
 print(f"Created {result.nodes_created} nodes, {result.relationships_created} relationships")
 print(f"Indexed {result.chunks_indexed} chunks")
+# Token costs for this ingest (extraction LLM + chunk embeddings)
+print(f"Prompt tokens: {result.usage.prompt_tokens}")
+print(f"Completion tokens: {result.usage.completion_tokens}")
+print(f"Embedding tokens: {result.usage.embedding_tokens}")
 ```
 
 ### From raw text
 
 ```python
-result = await rag.ingest("acme_doc", text="Acme Corp was founded in 1985 by Jane Doe in Austin, Texas.")
+result = await rag.ingest(text="Acme Corp was founded in 1985 by Jane Doe in Austin, Texas.",
+                          document_id="acme_doc")
 print(f"Created {result.nodes_created} nodes, {result.relationships_created} relationships")
 print(f"Indexed {result.chunks_indexed} chunks")
 ```
@@ -151,6 +156,10 @@ Use `completion()` for the full RAG pipeline — retrieval + answer generation:
 ```python
 result = await rag.completion("Who works at Acme Corp?")
 print(result.answer)
+# See what it cost
+print(f"Tokens used — prompt: {result.usage.prompt_tokens}, "
+      f"completion: {result.usage.completion_tokens}, "
+      f"embedding: {result.usage.embedding_tokens}")
 ```
 
 ### With context inspection
@@ -201,14 +210,16 @@ Supported roles: `"system"`, `"user"`, `"assistant"`. Invalid roles raise `Value
 Use `get_statistics()` to see a summary of what the graph contains:
 
 ```python
-stats = await rag.graph_store.get_statistics()
+stats = await rag.get_statistics()
 print(f"Nodes: {stats['node_count']}, Edges: {stats['edge_count']}")
 ```
 
 You can also run raw Cypher queries against the graph:
 
 ```python
-results = await rag.graph_store.query_raw("MATCH (p:Person)-[:WORKS_AT]->(o:Organization) RETURN p.name, o.name LIMIT 10")
+results = await rag.graph_store.query_raw(
+    "MATCH (p:Person)-[:WORKS_AT]->(o:Organization) RETURN p.name, o.name LIMIT 10"
+)
 for row in results.result_set:
     print(row)
 ```
@@ -221,8 +232,8 @@ After all documents have been ingested, run `finalize()` to deduplicate entities
 
 ```python
 results = await rag.finalize()
-print(f"Deduplicated: {results['entities_deduplicated']}")
-print(f"Embedded: {results['entities_embedded']} entities, {results['relationships_embedded']} relationships")
+print(f"Deduplicated: {results.entities_deduplicated}")
+print(f"Embedded: {results.entities_embedded} entities, {results.relationships_embedded} relationships")
 ```
 
 This step is important for query accuracy. It merges duplicate entities (e.g., "J. Doe" and "Jane Doe") and ensures all entities have vector embeddings for semantic search.
@@ -233,6 +244,7 @@ This step is important for query accuracy. It merges duplicate entities (e.g., "
 
 - [docs/configuration.md](configuration.md) -- Tuning connection settings, chunking parameters, and retrieval options.
 - [docs/strategies.md](strategies.md) -- Custom extraction and resolution strategies.
+- [docs/token-usage.md](token-usage.md) -- Cost tracking, billing dashboards, and observability patterns.
 - [docs/benchmark.md](benchmark.md) -- Reproducing benchmark results on the GraphRAG-Bench Novel corpus (20 novels, 2,010 questions).
 
 ---

diff --git a/docs/index.md b/docs/index.md
@@ -12,6 +12,7 @@ GraphRAG SDK builds knowledge graphs from documents and answers questions over t
 - **Fully modular** -- swap chunking, extraction, resolution, retrieval, and reranking strategies
 - **Production-ready** -- async-first, connection pooling, circuit breaker, batched writes
 - **Full provenance** -- every answer traces back to its source document and chunk
+- **Built-in cost tracking** -- `result.usage.prompt_tokens / completion_tokens / embedding_tokens` on every response
 
 ## Quick Start
 
@@ -43,5 +44,6 @@ asyncio.run(main())
 - [Getting Started](getting-started.md) -- Full tutorial from install to first query
 - [Architecture](architecture.md) -- How the 9-step pipeline works
 - [Strategies](strategies.md) -- All swappable strategy ABCs and built-in options
+- [Token Usage](token-usage.md) -- Cost tracking and observability
 - [Benchmark](benchmark.md) -- Methodology and reproduction instructions
 - [API Reference](api-reference.md) -- Full API documentation
diff --git a/docs/ingestion.md b/docs/ingestion.md
@@ -277,6 +277,7 @@ stats = await rag.finalize()
 - **Fastest step:** Quality filter, prune, and resolve — all in-memory, sub-second.
 - **Parallelism:** Steps 8-9 run in parallel. Step 1 NER uses a semaphore (default 12 concurrent calls).
 - **Batch size:** The benchmark uses 1500-character chunks. 20 documents (~4.7 MB total) take ~47 minutes to ingest.
+- **Cost tracking:** Check `result.usage.prompt_tokens`, `result.usage.completion_tokens`, and `result.usage.embedding_tokens` after each `ingest()` call. See [Token Usage](token-usage.md) for aggregation patterns across batch ingestion.
 
 ---
 

diff --git a/docs/providers.md b/docs/providers.md
@@ -137,11 +137,13 @@ class MyLLM(LLMInterface):
 
 | Method | Default Behavior | Override When |
 |--------|-----------------|--------------|
-| `ainvoke(prompt, max_retries=3)` | Runs `invoke()` in a thread pool with retry | You have a native async client |
-| `ainvoke_messages(messages, max_retries=3)` | Concatenates messages into a single prompt and calls `ainvoke()` | You have a native multi-turn chat API |
+| `ainvoke(prompt, *, ctx=None, max_retries=3)` | Runs `invoke()` in a thread pool with retry; records usage if `ctx` provided | You have a native async client |
+| `ainvoke_messages(messages, *, ctx=None, max_retries=3)` | Concatenates messages into a single prompt and calls `ainvoke()` | You have a native multi-turn chat API |
 | `invoke_with_model(prompt, response_model)` | Calls `invoke()` and parses JSON into Pydantic model | Your provider has native structured output |
 | `ainvoke_with_model(prompt, response_model)` | Calls `ainvoke()` and parses JSON | Same, async version |
-| `abatch_invoke(prompts, max_concurrency)` | Concurrent `ainvoke()` with semaphore | You have a native batch API |
+| `abatch_invoke(prompts, *, ctx=None, max_concurrency)` | Concurrent `ainvoke()` with semaphore; threads `ctx` to each call | You have a native batch API |
+
+> **Token usage:** pass the current `ctx` to record prompt/completion tokens automatically. See [Token Usage](token-usage.md).
 
 `ainvoke_messages()` is called by `completion()` when conversation history is provided. Override it to pass messages natively to your LLM's chat API for proper multi-turn handling:
 
@@ -195,9 +197,11 @@ The `model_name` property is used by the graph config node to validate that the
 
 | Method | Default Behavior | Override When |
 |--------|-----------------|--------------|
-| `aembed_query(text)` | Runs `embed_query()` in thread pool | You have async embedding |
+| `aembed_query(text, *, ctx=None)` | Runs `embed_query()` in thread pool; records embedding tokens if `ctx` provided | You have async embedding |
 | `embed_documents(texts)` | Sequential `embed_query()` per text | You can batch embeddings |
-| `aembed_documents(texts)` | Runs `embed_documents()` in thread pool | You have async batch |
+| `aembed_documents(texts, *, ctx=None)` | Runs `embed_documents()` in thread pool; records embedding tokens if `ctx` provided | You have async batch |
+
+> **Token usage:** pass the current `ctx` to record embedding tokens automatically. See [Token Usage](token-usage.md).
 
 ### Batch Embedding
 

diff --git a/docs/retrieval.md b/docs/retrieval.md
@@ -312,11 +312,32 @@ reranker = CosineReranker(embedder=embedder, top_k=10)
 result = await rag.completion("Your question", reranker=reranker)
 ```
 
+### Token Usage
+
+Both `retrieve()` and `completion()` attach token counters to the result:
+
+```python
+# Retrieval only
+result = await rag.retrieve("What did Professor Harmon discover?")
+print(result.usage.embedding_tokens)   # query embedding tokens
+print(result.usage.prompt_tokens)      # keyword-extraction LLM tokens
+
+# Full completion
+result = await rag.completion("What did Professor Harmon discover?")
+print(result.usage.prompt_tokens)      # retrieval + answer generation LLM input
+print(result.usage.completion_tokens)  # answer tokens
+print(result.usage.embedding_tokens)   # query embedding tokens
+```
+
+See [Token Usage](token-usage.md) for cost estimation helpers and observability patterns.
+
 ---
 
+
 ## File Reference
 
 | File | What it contains |
+
 |------|-----------------|
 | [`multi_path.py`](https://github.com/FalkorDB/GraphRAG-SDK/blob/main/graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py) | Main orchestrator — coordinates all 9 steps |
 | [`entity_discovery.py`](https://github.com/FalkorDB/GraphRAG-SDK/blob/main/graphrag_sdk/src/graphrag_sdk/retrieval/strategies/entity_discovery.py) | RELATES vector search + 2-path entity discovery |