Skip to content

feat(ontology): mutating evolution API β€” rename / drop / retype / backfill#268

Open
galshubeli wants to merge 7 commits into
mainfrom
feat/ontology-management
Open

feat(ontology): mutating evolution API β€” rename / drop / retype / backfill#268
galshubeli wants to merge 7 commits into
mainfrom
feat/ontology-management

Conversation

@galshubeli
Copy link
Copy Markdown
Collaborator

@galshubeli galshubeli commented May 27, 2026

Summary

Follow-on to #256. PR #256 introduced a persistent ontology graph and a strict-by-default OntologyStore.register() that admits new labels but refuses any modification to an existing one. This PR adds the public seam to actually evolve the ontology after first ingest β€” adding attributes, renaming labels, dropping properties, retyping fields, and re-scanning stored chunks to fill new structure into existing data.

Every change lands as one of three groups:

  • Group 1 β€” pure ontology, no data, no LLM. set_entity_description, set_relation_description, set_attribute_description, add_entity, add_attribute, add_relation_pattern.
  • Group 2 β€” mechanical data migration via Cypher. rename_entity, rename_attribute, rename_relation, drop_attribute, drop_entity, drop_relation, drop_relation_pattern, retype_attribute.
  • Group 3 β€” LLM-driven backfill over stored chunks. backfill_attribute, backfill_entity, backfill_relation_pattern, backfill_attribute_semantic.

18 new methods on GraphRAG, all async. Each is a thin orchestrator: validate input β†’ run data migration first (so a crash leaves the data graph ahead of the ontology and re-running is idempotent) β†’ update the ontology graph β†’ refresh _global_ontology and propagate to retrieval.

register() stays strict and protective for initial setup. The new methods deliberately bypass it by using lower-level OntologyStore primitives (added in this PR) plus direct data-graph Cypher.

Backfill idempotency

Each :Chunk gains an extracted_ops: list[string] property. op_id is deterministic from the operation signature (e.g. backfill_attribute:Person:age), so:

  • Re-running the same backfill is a no-op once complete.
  • Different backfills don't collide on the same chunk.
  • A crash mid-run leaves partial markers; the resume skips chunks already done.

Cost surface

BackfillResult.estimated_cost_usd is Optional[float] and currently None β€” LLMInterface has no native usage hook. A future provider-specific wrapper can populate it without breaking the API.

What's in this PR

Source (~1,500 LOC):

  • storage/ontology_store.py β€” 11 evolution primitives bypassing register() strictness
  • storage/graph_store.py β€” 8 data-migration primitives + 5 backfill scope/marker helpers
  • ingestion/backfill.py (new) β€” BackfillExecutor, BackfillResult, ChunkContext, BackfillMergeStats
  • ingestion/extraction_strategies/graph_extraction.py β€” 4 focused backfill prompts
  • api/main.py β€” 18 new public methods + private _refresh_global_ontology / _resolve_owner_kind helpers
  • __init__.py β€” export BackfillResult, BackfillExecutor, ChunkContext, BackfillMergeStats

Tests (~870 LOC):

  • tests/test_ontology_evolve.py (new) β€” 33 unit tests (Groups 1+2) using _FakeGraph + mock_graph_store
  • tests/test_backfill.py (new) β€” 14 unit tests (Group 3) using MockLLM + mock_graph_store
  • tests/test_ontology_evolve_integration.py (new, gated by RUN_INTEGRATION=1) β€” Groups 1+2 against real FalkorDB
  • tests/test_backfill_integration.py (new, gated by RUN_INTEGRATION=1 + OPENAI_API_KEY) β€” Group 3 end-to-end with a live LLM

Example: examples/09_ontology_evolution.py β€” companion to 08_ontology_lifecycle.py (strict additive evolution); demonstrates the mutating tier end-to-end including idempotency.

Known scope cuts (v1)

  • Relation-attribute data-graph migrations (rename_attribute / drop_attribute / retype_attribute on relation owners) update the ontology graph only; the data-graph edge property is left untouched and a warning is logged. Edge-property mutations are a follow-up.
  • Mechanical retype handles INTEGER / FLOAT / STRING / BOOLEAN. Use backfill_attribute_semantic for DATE and LIST (LLM-assisted coercion).
  • register() semantics from PR feat(schema): schema-driven attributes end-to-end with persistent ontologyΒ #256 are preserved β€” OntologyModificationNotAllowedError still fires from the ingest path. Only the new methods bypass it.

Test plan

  • pytest graphrag_sdk/tests/test_ontology_evolve.py graphrag_sdk/tests/test_backfill.py β€” 47 tests pass
  • Full unit suite β€” 949 passed, 24 skipped (no regressions); PR feat(schema): schema-driven attributes end-to-end with persistent ontologyΒ #256 tests still green
  • Package-level imports for BackfillResult / BackfillExecutor / ChunkContext resolve cleanly
  • examples/09_ontology_evolution.py compiles
  • RUN_INTEGRATION=1 pytest tests/test_ontology_evolve_integration.py β€” needs reviewer with FalkorDB running locally
  • RUN_INTEGRATION=1 OPENAI_API_KEY=... pytest tests/test_backfill_integration.py β€” needs reviewer with both
  • Run examples/09_ontology_evolution.py against a real FalkorDB + OpenAI to validate the focused prompts on a non-trivial corpus

πŸ€– Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Comprehensive ontology evolution: add/rename/drop entities, relations, and attributes; atomic attribute backfills and schema migrations; semantic retype/coercion.
    • LLM-driven backfills: attribute filling, entity discovery, relation-pattern inference with progress, failure tracking and cost/elapsed reporting.
  • API
    • Backfill primitives and execution results exposed for direct use.
  • Examples
    • New async end-to-end ontology evolution example.
  • Tests
    • New unit and gated integration tests covering backfill and evolution flows.

Review Change Stack

…kfill

PR #256 introduced a persistent ontology graph and a strict-by-default
register() that rejects any modification to an existing label. This PR
adds the public seam to actually *evolve* the ontology after first
ingest, in three tiers:

- Group 1 (pure ontology): set_*_description, add_entity, add_attribute,
  add_relation_pattern
- Group 2 (mechanical data migration via Cypher): rename_entity,
  rename_attribute, rename_relation, drop_attribute, drop_entity,
  drop_relation, drop_relation_pattern, retype_attribute
- Group 3 (LLM-driven backfill over stored chunks): backfill_attribute,
  backfill_entity, backfill_relation_pattern, backfill_attribute_semantic

Backfill is concurrency-bounded and idempotent β€” each :Chunk carries an
extracted_ops list, so re-running the same backfill skips already-done
chunks. op_id is deterministic from the operation signature.

register() stays strict and protective for initial setup; the new
evolution methods deliberately bypass it via lower-level OntologyStore
primitives plus direct data-graph Cypher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▢️ Resume reviews
  • πŸ” Trigger review
πŸ“ Walkthrough

Walkthrough

Adds ontology-evolution features: storage- and ontology-store mutation primitives, a concurrency-bounded BackfillExecutor with prompt/parse/merge callbacks, GraphRAG facade methods (ontology-only, mechanical migrations, LLM-driven backfills), package exports, unit/integration tests, and an example walkthrough.

Changes

Ontology Evolution

Layer / File(s) Summary
Example walkthrough
graphrag_sdk/examples/09_ontology_evolution.py
Async example demonstrating staged ontology evolution, LLM backfills, idempotency check, final ontology inspection, and cleanup.
Package-level public exports
graphrag_sdk/src/graphrag_sdk/__init__.py
Expose BackfillExecutor, BackfillMergeStats, BackfillResult, ChunkContext at package level.
Backfill LLM prompt templates
graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py
New JSON-instruction prompt constants for attribute, entity, relation-pattern, and semantic coercion backfills.
OntologyStore typing & allowlist
graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py
Adds Literal aliases OwnerKind/DescriptionKind and _VALID_PROPERTY_TYPES allowlist.
OntologyStore evolution primitives
graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py
Async APIs to add/rename/retype entity and relation properties, add relation-pattern nodes (with property copying), rename labels/properties, set descriptions with kind/owner validation, drop properties/labels/patterns, and retype properties with allowlist validation.
GraphRAG facade infra & helpers
graphrag_sdk/src/graphrag_sdk/api/main.py
Add JSON parsing helper (_strip_and_load_json), imports for backfill types/prompts, _refresh_global_ontology, and _resolve_owner_kind to manage cached ontology and owner-kind resolution.
BackfillExecutor and supporting types
graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py
New dataclasses (ChunkContext, BackfillMergeStats, BackfillResult, EvolutionResult), OntologyEvolutionError, callback type aliases, and BackfillExecutor (concurrency-bounded per-chunk LLM run, parse, merge, mark, and stats aggregation).
GraphStore evolution primitives
graphrag_sdk/src/graphrag_sdk/storage/graph_store.py
Idempotent Cypher helpers: rename labels, rename/drop node properties, delete nodes/relations, coerce node property types (two-pass), mark/list chunks for backfill, list node values for coercion, set node property by id.
GraphRAG facade Group 1 (ontology-only)
graphrag_sdk/src/graphrag_sdk/api/main.py
APIs to set/clear descriptions and declare entities/attributes/relation patterns with validation and atomic attribute declaration/backfill orchestration.
GraphRAG facade Group 2 (mechanical data migration)
graphrag_sdk/src/graphrag_sdk/api/main.py
APIs to rename/drop entities/relations/attributes and coordinate GraphStore updates before OntologyStore writes and refresh cached ontology.
GraphRAG facade Group 3 (LLM-driven backfills)
graphrag_sdk/src/graphrag_sdk/api/main.py
APIs to backfill attributes/entities/relation patterns and perform semantic coercion; build prompt/parse/merge callbacks and run BackfillExecutor with concurrency, returning detailed BackfillResult/EvolutionResult.
Unit tests: BackfillExecutor and Group 3 flows
graphrag_sdk/tests/test_backfill.py
Unit tests covering BackfillExecutor behavior, add_attribute atomic backfill/commit flows, backfill_entity/backfill_relation_pattern, idempotency, and failure isolation with mocked LLM and stores.
Integration tests: backfill end-to-end
graphrag_sdk/tests/test_backfill_integration.py
Gated integration test verifying backfill fills missing values and is idempotent against real FalkorDB + LLM.
Unit tests: OntologyStore and Groups 1+2 orchestration
graphrag_sdk/tests/test_ontology_evolve.py
Unit tests for OntologyStore primitives and GraphRAG orchestration ordering, validation, and conditional behaviors for Groups 1 and 2.
Integration tests: ontology evolution end-to-end
graphrag_sdk/tests/test_ontology_evolve_integration.py
Gated integration tests validating rename_entity, drop_attribute, and drop_entity flows against real FalkorDB.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

  • FalkorDB/GraphRAG-SDK#256: Adds persisted OntologyStore/ontology wiring that this PR extends with evolution/backfill primitives.

Suggested Reviewers

  • Naseem77

Poem

🐰 I hopped through graphs at dawn,

schemas shifted, edges drawn.
LLMs hummed a careful song,
backfills stitched where data's wrong,
hop on β€” the ontology grows strong!

πŸš₯ Pre-merge checks | βœ… 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
βœ… Passed checks (4 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The title clearly and specifically describes the main changes: adding a mutating ontology-evolution API with rename, drop, retype, and backfill capabilities.
Linked Issues check βœ… Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check βœ… Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
πŸ§ͺ Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/ontology-management

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread graphrag_sdk/tests/test_backfill.py Fixed
Comment thread graphrag_sdk/tests/test_backfill_integration.py Fixed
Comment thread graphrag_sdk/tests/test_ontology_evolve.py Fixed
- Drop unused Relation import in api/main.py (F401).
- Re-order ingestion imports so I001 is satisfied; merge the two
  entity_extractors import lines into one.
- Wrap two over-long lines in api/main.py (E501).
- Switch a %-format ValueError to an f-string in ontology_store.py (UP031).
- Re-run ruff format across the four changed files.

Functional behavior unchanged; full unit suite still green (949 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a mutating ontology-evolution API on GraphRAG (rename/drop/retype/backfill) to evolve a persisted ontology after initial ingest, including mechanical data-graph migrations and LLM-driven backfills over stored chunks.

Changes:

  • Add OntologyStore evolution primitives that intentionally bypass register() strictness for controlled mutations.
  • Add GraphStore Cypher migration helpers (rename label/property/type, drop property/nodes/edges, mechanical coercion) plus chunk backfill marker/scope queries.
  • Add BackfillExecutor + focused backfill prompts and expose new backfill/evolution methods through the public GraphRAG API (plus tests + an example script).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
graphrag_sdk/src/graphrag_sdk/api/main.py Adds the public async ontology evolution + backfill orchestrators and JSON parsing helper.
graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py Adds low-level ontology mutation primitives (rename/drop/retype/add property/pattern/description).
graphrag_sdk/src/graphrag_sdk/storage/graph_store.py Adds data-graph migration primitives and backfill marker/scope helpers.
graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py New backfill executor, result/stats/context DTOs.
graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py Adds focused prompts for attribute/entity/relation-pattern backfill and semantic coercion.
graphrag_sdk/src/graphrag_sdk/init.py Re-exports backfill types in the package API.
graphrag_sdk/tests/test_ontology_evolve.py Unit tests for Groups 1+2 evolution primitives + GraphRAG orchestration ordering/validation.
graphrag_sdk/tests/test_ontology_evolve_integration.py Integration tests (gated) for Groups 1+2 against real FalkorDB.
graphrag_sdk/tests/test_backfill.py Unit tests for Group 3 backfill executor + GraphRAG backfill methods.
graphrag_sdk/tests/test_backfill_integration.py Integration test (gated) for attribute backfill idempotency end-to-end.
graphrag_sdk/examples/09_ontology_evolution.py End-to-end example demonstrating all three evolution groups and idempotency.

πŸ’‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py
Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py
Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py
Comment thread graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py
Comment thread graphrag_sdk/src/graphrag_sdk/storage/graph_store.py Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
graphrag_sdk/tests/test_backfill_integration.py (1)

22-22: ⚑ Quick win

Remove redundant @pytest.mark.asyncio under auto mode.

With repository pytest-asyncio auto mode, this marker is unnecessary noise.

Proposed fix
-@pytest.mark.asyncio
 async def test_backfill_attribute_fills_missing_values(
     real_falkordb_rag_factory,
 ):

Based on learnings: Because pytest-asyncio in graphrag_sdk/pyproject.toml uses asyncio_mode = "auto", async test functions generally do not need pytest.mark.asyncio.

πŸ€– Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/tests/test_backfill_integration.py` at line 22, Remove the
redundant pytest marker by deleting the `@pytest.mark.asyncio` decorator from the
test in tests/test_backfill_integration.py (the async test function that
currently has `@pytest.mark.asyncio`) since pytest-asyncio is configured with
asyncio_mode="auto" in pyproject.toml; simply remove the decorator so the async
test runs under auto mode without the marker.
graphrag_sdk/tests/test_ontology_evolve_integration.py (1)

40-40: ⚑ Quick win

Drop redundant async markers in this module.

These decorators can be removed when async auto-detection is enabled.

Proposed fix
-@pytest.mark.asyncio
 async def test_rename_entity_relabels_data_nodes(
     real_falkordb_rag_factory, starter_ontology
 ):
@@
-@pytest.mark.asyncio
 async def test_drop_attribute_removes_property_and_declaration(
     real_falkordb_rag_factory, starter_ontology
 ):
@@
-@pytest.mark.asyncio
 async def test_drop_entity_cascades_relations(
     real_falkordb_rag_factory, starter_ontology
 ):

Based on learnings: Because pytest-asyncio in graphrag_sdk/pyproject.toml uses asyncio_mode = "auto", async test functions generally do not need pytest.mark.asyncio.

Also applies to: 65-65, 84-84

πŸ€– Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/tests/test_ontology_evolve_integration.py` at line 40, Remove
the redundant `@pytest.mark.asyncio` decorators applied to the async test
functions in this module: locate the uses of `@pytest.mark.asyncio` (the
decorators shown in the diff) and delete them so the async tests rely on
pytest-asyncio's asyncio_mode="auto"; keep the async def test_* functions
unchanged and run the test suite to verify no behavior changes.
πŸ€– Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 542-551: add_attribute currently always calls
_ontology_store.add_entity_property, making relation properties unreachable;
update add_attribute (and use _ensure_ontology_initialized,
_global_ontology.entities/_global_ontology.relations, add_entity_property,
add_relation_property, _refresh_global_ontology, backfill_attribute) to detect
whether the provided label is an entity or a relation and call the corresponding
store method (add_entity_property for entity labels, add_relation_property for
relation labels), preserving the existing validation and return value paths so
relation attributes become reachable via the same facade.
- Around line 597-626: Before doing any rename in rename_attribute, validate
that the source property exists and the target name does not to avoid silent
no-ops or accidental overwrites: after resolving kind =
self._resolve_owner_kind(owner_label) but before touching the graph or ontology,
query the ontology for the owner's property labels (via the appropriate
ontology-store read APIβ€”e.g. a method on self._ontology_store or by inspecting
self._global_ontology) to assert old_name is present and new_name is absent; if
old_name is missing or new_name already exists, raise a clear exception
(ValueError) and do not call _graph_store.rename_node_property or
_ontology_store.rename_property_label so no partial migration occurs. Ensure the
checks reference rename_attribute, _graph_store.rename_node_property and
_ontology_store.rename_property_label so reviewers can find the guarded
operations.
- Around line 463-472: The _refresh_global_ontology function updates only
self._global_ontology and the retrieval strategy but leaves self.ontology stale,
which breaks _default_extractor (it reads self.ontology.entities) used by later
ingest calls; modify _refresh_global_ontology to also update self.ontology
(e.g., assign or deep-copy the reloaded Ontology into self.ontology) after
setting self._global_ontology and before returning, ensuring consistency with
the retrieval strategy (_retrieval_strategy._ontology) and preventing extraction
against pre-mutation labels.

In `@graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py`:
- Around line 155-159: BackfillExecutor.run currently creates and retains a Task
per chunk in the tasks list, causing O(total_chunks) memory pressure; change it
to maintain a bounded set of pending tasks instead (e.g., pending:
set[asyncio.Task] with a configurable concurrency limit like max_concurrency).
For each ctx from chunks, create a task for _process_one(ctx), add it to
pending, and when len(pending) >= max_concurrency await asyncio.wait(pending,
return_when=asyncio.FIRST_COMPLETED) and remove finished tasks from pending;
after the loop await asyncio.gather(*pending) for any remaining tasks. Update
any config/constructor to expose max_concurrency and ensure references to tasks
are dropped when tasks complete so memory is bounded.

In `@graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py`:
- Around line 492-521: The helpers currently overwrite an existing
Property.type; update add_relation_property and the called helper
_upsert_entity_property so MERGE uses ON CREATE for type (or SET p.type =
coalesce(p.type, $type)) instead of unconditionally SET p.type = $type, and keep
the description logic as coalesce($description, p.description) (or put
description in ON CREATE and still allow caller-supplied non-null to override).
Concretely, replace the MERGE ... SET p.type = $type line in
add_relation_property and the MERGE in _upsert_entity_property with MERGE
(r)-[:HAS_PROPERTY]->(p:Property {label: $name}) ON CREATE SET p.type = $type,
p.description = coalesce($description, p.description) (or equivalent coalesce
for type), so existing Property.type is not mutated on re-add.

In `@graphrag_sdk/tests/test_backfill_integration.py`:
- Around line 16-19: The current pytestmark skip condition treats any non-empty
RUN_INTEGRATION value as enabled; update the skip predicate used in the
pytestmark declaration so it only enables integration when
os.getenv("RUN_INTEGRATION") == "1" and OPENAI_API_KEY is present (e.g., change
the condition to not (os.getenv("RUN_INTEGRATION") == "1" and
os.getenv("OPENAI_API_KEY"))), leaving the pytestmark variable name as-is so the
integration test runs only when RUN_INTEGRATION is exactly "1".

In `@graphrag_sdk/tests/test_ontology_evolve_integration.py`:
- Around line 18-21: The skip gate uses a truthy check that treats
RUN_INTEGRATION="0" as true; change the condition in the pytestmark skipif to
explicitly check for the string "1" (e.g., replace not
os.getenv("RUN_INTEGRATION") with os.getenv("RUN_INTEGRATION") != "1") so only
RUN_INTEGRATION=="1" will run integration tests; update the reason string if
desired and ensure the change is applied to the pytestmark assignment that
references os.getenv("RUN_INTEGRATION").

---

Nitpick comments:
In `@graphrag_sdk/tests/test_backfill_integration.py`:
- Line 22: Remove the redundant pytest marker by deleting the
`@pytest.mark.asyncio` decorator from the test in
tests/test_backfill_integration.py (the async test function that currently has
`@pytest.mark.asyncio`) since pytest-asyncio is configured with
asyncio_mode="auto" in pyproject.toml; simply remove the decorator so the async
test runs under auto mode without the marker.

In `@graphrag_sdk/tests/test_ontology_evolve_integration.py`:
- Line 40: Remove the redundant `@pytest.mark.asyncio` decorators applied to the
async test functions in this module: locate the uses of `@pytest.mark.asyncio`
(the decorators shown in the diff) and delete them so the async tests rely on
pytest-asyncio's asyncio_mode="auto"; keep the async def test_* functions
unchanged and run the test suite to verify no behavior changes.
πŸͺ„ Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
βš™οΈ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 32e7bd8d-1d0d-4052-9d38-f0b203ce382a

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 212d47b and 8ab9394.

πŸ“’ Files selected for processing (11)
  • graphrag_sdk/examples/09_ontology_evolution.py
  • graphrag_sdk/src/graphrag_sdk/__init__.py
  • graphrag_sdk/src/graphrag_sdk/api/main.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/extraction_strategies/graph_extraction.py
  • graphrag_sdk/src/graphrag_sdk/storage/graph_store.py
  • graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py
  • graphrag_sdk/tests/test_backfill.py
  • graphrag_sdk/tests/test_backfill_integration.py
  • graphrag_sdk/tests/test_ontology_evolve.py
  • graphrag_sdk/tests/test_ontology_evolve_integration.py

Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py
Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py
Comment thread graphrag_sdk/tests/test_backfill_integration.py
Comment thread graphrag_sdk/tests/test_ontology_evolve_integration.py
Items numbered per the triage in the PR thread:

1. _strip_and_load_json now raises on malformed JSON instead of
   returning {}. BackfillExecutor catches the ValueError, lands the
   chunk in failed_chunks, and skips the marker β€” so a corrupted
   chunk is retry-eligible instead of permanently silently skipped.

2. _refresh_global_ontology now updates self.ontology in lockstep with
   _global_ontology. _default_extractor reads self.ontology.entities,
   so without this a later default ingest() would extract against the
   pre-mutation label set.

3. add_attribute now resolves entity-vs-relation dynamically (via
   _resolve_owner_kind) and calls add_entity_property or
   add_relation_property accordingly. Previously relation owners were
   silently unreachable.

4. add_entity_property / add_relation_property refuse to silently
   retype an existing property (new _check_no_property_retype probe
   raises OntologyContradictionError). MERGE switched to ON CREATE
   SET p.type so re-adding identical declarations stays idempotent.
   Callers wanting to change types must go through retype_property /
   GraphRAG.retype_attribute / backfill_attribute_semantic.

5. backfill_relation_pattern merge keys candidate pairs by the full
   (src_name, tgt_name) tuple. The previous separate by_src / by_tgt
   maps could combine src_id from one pair with tgt_id from another
   when a chunk had multiple candidates sharing a name.

6. BackfillResult gets a new failed_node_ids field;
   backfill_attribute_semantic appends to it (not failed_chunks),
   since semantic coercion walks node values, not chunks. Callers
   now always know which surface to retry.

7. rename_attribute validates that old_name exists and new_name is
   free before touching data or ontology. A typoed old_name used to
   silently no-op; a collision on new_name could overwrite values.

8. backfill_attribute merge keys the entity lookup by BOTH name and
   id, matching what the prompt advertises (name OR id fallback).
   Previously, if the entity had no name (or the LLM echoed the id),
   the match dropped silently.

9. BackfillExecutor docstring no longer claims to do retries (it
   doesn't β€” retries live in LLMInterface.ainvoke). Failed chunks
   land in failed_chunks for caller-driven retry.

10. BackfillResult.chunks_skipped is now populated via a new
    GraphStore.count_chunks_marked_with_op helper, so on an
    idempotent rerun callers can see the work skipped (instead of
    always reading 0).

11. backfill_relation_pattern no longer writes source_chunk_ids on
    arbitrary-typed edges. GraphStore.upsert_relationships only
    unions provenance for RELATES; other edge types would have their
    list silently overwritten by future MERGE writes. Field removed
    rather than misleadingly carried.

Tests:
- 11 new unit tests covering items 1, 2, 4, 5, 7, 8, 10, 11 + the
  new failed_node_ids semantics.
- 58 ontology-evolution + backfill tests pass; full unit suite
  960 passed / 24 skipped (no regressions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
graphrag_sdk/tests/test_backfill.py (1)

24-29: ⚠️ Potential issue | 🟑 Minor | ⚑ Quick win

Unused import: BackfillResult.

The import is not referenced in the test code.

πŸ€– Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/tests/test_backfill.py` around lines 24 - 29, Remove the unused
import BackfillResult from the import list in the test module; update the import
statement that currently imports BackfillExecutor, BackfillMergeStats,
BackfillResult, and ChunkContext so it only imports the symbols actually used
(e.g., BackfillExecutor, BackfillMergeStats, ChunkContext) to eliminate the
unused BackfillResult reference.
🧹 Nitpick comments (1)
graphrag_sdk/tests/test_backfill.py (1)

214-215: ⚑ Quick win

Clarify comment to match mock value.

The comment says "Pretend 5 prior chunks are already marked" but the mock returns 6. While the test logic is correct (6 total - 1 scanned = 5 skipped), the comment is misleading.

πŸ“ Suggested clarification
-        # Pretend 5 prior chunks are already marked.
+        # Mock returns 6 total marked (5 prior + 1 from this run).
         rag._graph_store.count_chunks_marked_with_op = AsyncMock(return_value=6)
πŸ€– Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/tests/test_backfill.py` around lines 214 - 215, Update the
misleading comment to match the mocked return value: change the comment above
the AsyncMock setup for rag._graph_store.count_chunks_marked_with_op to reflect
that it returns 6 prior marked chunks (so 6 total - 1 scanned = 5 skipped)
instead of saying "5 prior chunks are already marked"; keep the mock call
AsyncMock(return_value=6) as-is and only adjust the comment text to correctly
describe the mocked value.
πŸ€– Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@graphrag_sdk/tests/test_backfill.py`:
- Around line 24-29: Remove the unused import BackfillResult from the import
list in the test module; update the import statement that currently imports
BackfillExecutor, BackfillMergeStats, BackfillResult, and ChunkContext so it
only imports the symbols actually used (e.g., BackfillExecutor,
BackfillMergeStats, ChunkContext) to eliminate the unused BackfillResult
reference.

---

Nitpick comments:
In `@graphrag_sdk/tests/test_backfill.py`:
- Around line 214-215: Update the misleading comment to match the mocked return
value: change the comment above the AsyncMock setup for
rag._graph_store.count_chunks_marked_with_op to reflect that it returns 6 prior
marked chunks (so 6 total - 1 scanned = 5 skipped) instead of saying "5 prior
chunks are already marked"; keep the mock call AsyncMock(return_value=6) as-is
and only adjust the comment text to correctly describe the mocked value.

ℹ️ Review info
βš™οΈ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 00cd2294-b3e7-40e6-835e-06e8b18c761e

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 72913ea and be5f569.

πŸ“’ Files selected for processing (6)
  • graphrag_sdk/src/graphrag_sdk/api/main.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py
  • graphrag_sdk/src/graphrag_sdk/storage/graph_store.py
  • graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py
  • graphrag_sdk/tests/test_backfill.py
  • graphrag_sdk/tests/test_ontology_evolve.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py
  • graphrag_sdk/src/graphrag_sdk/storage/graph_store.py
  • graphrag_sdk/src/graphrag_sdk/api/main.py

Items follow the same numbering as the PR triage.

12-13. Worker-pool refactor for BackfillExecutor.run and
       backfill_attribute_semantic. Live asyncio.Task count is now
       O(concurrency) regardless of corpus size (previously
       O(total_chunks) or O(node_count) β€” one task created per item
       up-front and held until gather drained them). Each method now
       spawns N long-lived workers that consume from a bounded
       asyncio.Queue, with sentinels for graceful shutdown.

       New regression test asserts the live-task peak stays bounded
       across a 200-chunk run with concurrency=4.

14. list_chunks_for_relation_pattern_backfill now caps the per-chunk
    candidate Cartesian via Cypher list slice `pairs[0..N]`. New
    pairs_per_chunk parameter defaults to 50; document that you can
    tune higher if you have evidence the clipped pairs are real edges.
    Prevents an M*N prompt explosion when a chunk mentions many src/
    tgt-typed entities.

15-17. Drop unused imports: BackfillResult in test_backfill.py,
       Relation in test_backfill_integration.py, call in
       test_ontology_evolve.py.

18. Tighten integration-test gating in both integration files:
    `os.getenv("RUN_INTEGRATION") == "1"` (was truthy check, which
    accepted `RUN_INTEGRATION=0`).

19. Drop redundant `@pytest.mark.asyncio` decorator in
    test_backfill_integration.py β€” pytest-asyncio is configured with
    `asyncio_mode = "auto"`.

961 passed, 24 skipped β€” no regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py (1)

105-112: πŸ’€ Low value

Docstring mentions Semaphore but implementation uses Queue-based worker pool.

The class docstring says "Concurrency-bounded (asyncio.Semaphore)" but the actual implementation uses an asyncio.Queue with long-lived worker tasks. Update the docstring to reflect the worker-pool pattern.

πŸ“ Suggested docstring update
 class BackfillExecutor:
     """Drive an LLM-backed re-scan over already-ingested chunks.

-    Concurrency-bounded (``asyncio.Semaphore``), per-chunk-idempotent
+    Concurrency-bounded (worker-pool with ``asyncio.Queue``), per-chunk-idempotent
     (``GraphStore.mark_chunk_extracted``), and resilient β€” a chunk that
     raises during parse/merge lands in ``BackfillResult.failed_chunks``
     and the run continues.
     """
πŸ€– Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py` around lines 105 - 112,
Update the BackfillExecutor class docstring to accurately describe its
concurrency model: replace the mention of "Concurrency-bounded
(``asyncio.Semaphore``)" with a description that it uses an asyncio.Queue-based
worker-pool with long-lived worker tasks (bounded concurrency via a fixed number
of workers consuming the queue), and keep notes about per-chunk idempotency
(GraphStore.mark_chunk_extracted) and resilience (failed_chunks) so the
docstring matches the actual implementation.
πŸ€– Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py`:
- Around line 105-112: Update the BackfillExecutor class docstring to accurately
describe its concurrency model: replace the mention of "Concurrency-bounded
(``asyncio.Semaphore``)" with a description that it uses an asyncio.Queue-based
worker-pool with long-lived worker tasks (bounded concurrency via a fixed number
of workers consuming the queue), and keep notes about per-chunk idempotency
(GraphStore.mark_chunk_extracted) and resilience (failed_chunks) so the
docstring matches the actual implementation.

ℹ️ Review info
βš™οΈ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 647f0c7d-4bde-4ebc-a8b1-20e2cca1369f

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between be5f569 and a24ea90.

πŸ“’ Files selected for processing (7)
  • graphrag_sdk/src/graphrag_sdk/api/main.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py
  • graphrag_sdk/src/graphrag_sdk/storage/graph_store.py
  • graphrag_sdk/tests/test_backfill.py
  • graphrag_sdk/tests/test_backfill_integration.py
  • graphrag_sdk/tests/test_ontology_evolve.py
  • graphrag_sdk/tests/test_ontology_evolve_integration.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • graphrag_sdk/tests/test_backfill_integration.py
  • graphrag_sdk/tests/test_ontology_evolve_integration.py
  • graphrag_sdk/tests/test_ontology_evolve.py
  • graphrag_sdk/src/graphrag_sdk/api/main.py
  • graphrag_sdk/tests/test_backfill.py

Major API change. Per the design discussion: the ontology is a contract,
not a hint. A declared attribute MUST be aligned with the data graph.
This refactor enforces the invariant by making add_attribute atomic and
removing the ways to silently dis-align.

API changes (GraphRAG: 18 β†’ 15 attribute-touching methods):

ADDED
- EvolutionResult: return value of atomic-evolve calls. Carries the
  refreshed ontology + observability counters (chunks_scanned,
  values_filled, llm_calls, elapsed_s).
- OntologyEvolutionError: raised when atomic evolve cannot complete its
  data migration (hard-failed chunks). The ontology graph is NOT updated
  when raised β€” schema stays consistent. Retrying is idempotent.

CHANGED
- add_attribute(owner_label, attribute) is now ATOMIC:
    1. LLM-extract values from every chunk mentioning the owner type.
    2. SET values on matching entities (null where LLM doesn't know).
    3. Commit ontology graph as the FINAL step.
  Returns EvolutionResult. Hard-failure raises OntologyEvolutionError
  and the ontology stays at its pre-call state. Entity owners only.

- drop_attribute(owner_label, name) is now strict entity-only. Relation
  owners raise NotImplementedError (was: ontology-only with warning;
  that allowed schema/data drift on relation attributes).

REMOVED
- retype_attribute: use drop_attribute + add_attribute. LLM re-derives
  values from chunks with the new type. One less code path, no
  coercion-failure policy to argue about.
- backfill_attribute (public): folded into add_attribute as the
  internal _run_atomic_attribute_backfill helper.
- backfill_attribute_semantic: gone with retype_attribute.

UNCHANGED
- All Group 1 methods (set_*_description, add_entity, add_relation_pattern).
- All Group 2 methods (rename_*, drop_entity, drop_relation, drop_relation_pattern).
- backfill_entity and backfill_relation_pattern stay public as
  opportunistic discovery tools β€” entity types and relation patterns
  are declarations of what's allowed, not what must exist.

Concurrency rule (documented on add_attribute): do NOT run ingest()
concurrently with add_attribute() / drop_attribute(). The extractor
reads the persisted ontology; concurrent ingest during evolution
would create entities without the new attribute. Coordinate at the
application level.

Tests:
- TestAddAttributeAtomic replaces TestBackfillAttribute, covering:
  relation-owner NotImplementedError, already-declared ValueError,
  unknown-label ValueError, happy-path backfill+commit ordering,
  hard-failure rollback (ontology NOT committed), deterministic op_id,
  id-only lookup.
- TestRetypeRemoved asserts retype_attribute no longer exists.
- TestBackfillAttributeSemantic class removed.
- 957 passed, 24 skipped β€” no regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
graphrag_sdk/src/graphrag_sdk/api/main.py (1)

873-899: ⚠️ Potential issue | 🟠 Major | πŸ—οΈ Heavy lift

Concurrent chunk merges can make the final attribute value nondeterministic.

Each worker writes its chunk’s proposal directly to the entity node. When the same entity shows up in multiple chunks with different extracted values, the winner is just whichever task finishes last, so repeated runs can land on different final values.

Please stage per-entity proposals and resolve them with a stable rule before writing, or serialize the write phase.

Also applies to: 907-914

πŸ€– Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/src/graphrag_sdk/api/main.py` around lines 873 - 899, The merge
currently writes each chunk proposal directly in _merge, causing
nondeterministic last-writer-wins results; instead, collect all proposals into a
per-entity staging map (keyed by ent["id"]) while iterating parsed in _merge,
using _coerce_attribute_value to validate/coerce each proposal, then
deterministically resolve a single value per entity (e.g., most-frequent,
highest-confidence if available, or tie-break by stable key like
sorted(raw_value) or ent id) and only after resolution call
self._graph_store.set_node_property_by_id once per entity to write the final
value; apply the same staging+resolve pattern for the other merge site that also
calls set_node_property_by_id so writes are serialized and reproducible.
🧹 Nitpick comments (1)
graphrag_sdk/tests/test_backfill.py (1)

148-148: πŸ’€ Low value

Consider extracting the magic number 8 to a named constant.

The ceiling calculation concurrency + 8 uses a somewhat arbitrary constant. While the comment explains the rationale, a named constant would improve clarity.

♻️ Suggested refactoring
+    # Overhead tasks: producer, test driver, pytest infrastructure
+    TASK_OVERHEAD_CEILING = 8
+
     `@pytest.mark.asyncio`
     async def test_live_task_count_stays_bounded(self, small_ontology, mock_graph_store):
         ...
-        assert peak <= concurrency + 8
+        assert peak <= concurrency + TASK_OVERHEAD_CEILING
πŸ€– Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@graphrag_sdk/tests/test_backfill.py` at line 148, Replace the magic number 8
in the assertion `assert peak <= concurrency + 8` with a clearly named constant
(e.g., CONCURRENCY_TOLERANCE or MAX_PEAK_OVERHEAD) declared near the test setup,
and update the assertion to use that constant (`assert peak <= concurrency +
CONCURRENCY_TOLERANCE`) so the rationale and intent are explicit and
maintainable; ensure the constant has a brief comment referencing the reason
described in the original comment.
πŸ€– Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/examples/09_ontology_evolution.py`:
- Around line 128-133: The banner text is misleading: the call to
rag.add_attribute("Person", Attribute(name="join_year", type="INTEGER")) is a
different operation than the previous add_attribute for Person.role and, with
op-scoped markers, should be described as a fresh backfill rather than β€œre-uses
chunk markers”; update the banner() invocation string (and any adjacent
explanatory comment) to reflect that adding Person.join_year triggers a new
backfill and is not an idempotent re-run of the prior op (reference:
banner(...), rag.add_attribute, Attribute, Person.join_year, Person.role).

In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 599-609: The op_id used to gate backfill only uses
"{owner_label}:{attribute.name}", so dropping and re-adding the same attribute
name is treated as already-processed; update the backfill gating key to include
the attribute's identity/version (e.g., include attribute.type,
attribute.schema_version, or a stable attribute.uid/signature) when calling
_run_atomic_attribute_backfill so a newly added attribute always yields a
different op_id; make the same change in both locations that call
_run_atomic_attribute_backfill (references: owner_label, attribute.name,
attribute.type/uid, and _run_atomic_attribute_backfill).

In `@graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py`:
- Around line 102-112: Update the EvolutionResult docstring to remove the
incorrect claim that EvolutionResult contains a failed_chunks field and instead
point readers to OntologyEvolutionError.failed_chunks for hard-failure details;
edit the paragraph that begins "On a successful return:" (in the EvolutionResult
docstring) to state that failed chunks are surfaced via
OntologyEvolutionError.failed_chunks on exceptions and that a successful
EvolutionResult does not include a failed_chunks attribute.

---

Outside diff comments:
In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 873-899: The merge currently writes each chunk proposal directly
in _merge, causing nondeterministic last-writer-wins results; instead, collect
all proposals into a per-entity staging map (keyed by ent["id"]) while iterating
parsed in _merge, using _coerce_attribute_value to validate/coerce each
proposal, then deterministically resolve a single value per entity (e.g.,
most-frequent, highest-confidence if available, or tie-break by stable key like
sorted(raw_value) or ent id) and only after resolution call
self._graph_store.set_node_property_by_id once per entity to write the final
value; apply the same staging+resolve pattern for the other merge site that also
calls set_node_property_by_id so writes are serialized and reproducible.

---

Nitpick comments:
In `@graphrag_sdk/tests/test_backfill.py`:
- Line 148: Replace the magic number 8 in the assertion `assert peak <=
concurrency + 8` with a clearly named constant (e.g., CONCURRENCY_TOLERANCE or
MAX_PEAK_OVERHEAD) declared near the test setup, and update the assertion to use
that constant (`assert peak <= concurrency + CONCURRENCY_TOLERANCE`) so the
rationale and intent are explicit and maintainable; ensure the constant has a
brief comment referencing the reason described in the original comment.
πŸͺ„ Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
βš™οΈ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 688bae4c-d140-4614-9632-10c54d91a1f4

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between a24ea90 and 29cfaf3.

πŸ“’ Files selected for processing (6)
  • graphrag_sdk/examples/09_ontology_evolution.py
  • graphrag_sdk/src/graphrag_sdk/__init__.py
  • graphrag_sdk/src/graphrag_sdk/api/main.py
  • graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py
  • graphrag_sdk/tests/test_backfill.py
  • graphrag_sdk/tests/test_ontology_evolve.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • graphrag_sdk/src/graphrag_sdk/init.py

Comment thread graphrag_sdk/examples/09_ontology_evolution.py Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/api/main.py
Comment thread graphrag_sdk/src/graphrag_sdk/ingestion/backfill.py Outdated
Covers the 15 attribute-touching methods + the alignment invariant
enforced by atomic add_attribute / drop_attribute. Documents:

- The four method groups (pure declarations, mechanical migration,
  atomic attribute evolution, opportunistic discovery).
- Why add_attribute is atomic β€” invariant + commit ordering.
- Failure & retry model (OntologyEvolutionError + idempotent retry
  via chunk markers).
- Type-change pattern (drop_attribute + add_attribute, LLM re-derives).
- Concurrency rule β€” don't run ingest() during attribute evolution.
- End-to-end example.
- EvolutionResult / OntologyEvolutionError reference tables.
- What's not in the API and why (retype, public backfill_attribute,
  backfill_attribute_semantic).

Wired into mkdocs nav between Graph Schema and Benchmark. Cross-linked
from index.md Next Steps and from the top of graph-schema.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Comment thread graphrag_sdk/tests/test_backfill_integration.py Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/storage/graph_store.py Outdated
Comment thread graphrag_sdk/src/graphrag_sdk/storage/ontology_store.py
…code

Items follow the PR triage numbering.

#3319054109 (CRITICAL) β€” op_id now includes attribute.type. The
documented type-change pattern (drop_attribute + add_attribute with
new type) previously short-circuited: the new add reused the prior
op_id, so chunk markers filtered every chunk and the schema committed
without a fresh LLM rescan. With type in op_id, the second add gets
a different op_id and the rescan fires. Regression test added.

#3319843059 β€” test_backfill_integration.py rewritten against
add_attribute (the old version called removed backfill_attribute and
would AttributeError under RUN_INTEGRATION=1). New tests cover the
happy path, duplicate-add rejection, and the drop+add fresh-rescan
invariant.

#3319843117 β€” deleted dead code left behind by the prior refactor:
  - graph_store.py: coerce_node_property, _RETYPE_COERCERS map,
    list_node_values_for_semantic_coerce.
  - graph_extraction.py: BACKFILL_SEMANTIC_COERCION_PROMPT.
  - ontology_store.py: retype_property + _VALID_PROPERTY_TYPES.
None of these have callers under the new atomic-evolve design.
Test fixtures that mocked them are also pruned.

#3319843142 β€” error messages and docstrings in OntologyStore that
told users to call retype_property() / GraphRAG.retype_attribute() /
backfill_attribute_semantic() β€” none of which exist anymore β€” now
point at the supported workflow: drop_attribute() + add_attribute().

#3319054116 β€” EvolutionResult docstring no longer mentions a
failed_chunks field that doesn't exist on the type; it now correctly
points readers at OntologyEvolutionError.failed_chunks.

#3319054100 β€” example step 3 banner rewritten. It used to claim
"re-uses chunk markers" but that's a different op_id (Person.join_year
vs Person.role); it's actually an independent atomic backfill.

956 passed, 24 skipped β€” no regressions. Lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants