Skip to content

Add support for existing graph nodes and relationsships to the entity and relationship extractor#532

Closed
ali-sedaghatbaf wants to merge 4 commits into
mainfrom
support-existing-graph
Closed

Add support for existing graph nodes and relationsships to the entity and relationship extractor#532
ali-sedaghatbaf wants to merge 4 commits into
mainfrom
support-existing-graph

Conversation

@ali-sedaghatbaf

@ali-sedaghatbaf ali-sedaghatbaf commented May 26, 2026

Copy link
Copy Markdown
Contributor

Description

This pull request enhances the entity and relation extraction process by allowing the system to incorporate information about existing nodes and relationships in the knowledge graph. When provided, these existing entities are included in the prompt to the language model, instructing it to reuse IDs for matching entities instead of creating duplicates. The changes also ensure that sensitive or unnecessary properties (like embeddings) are excluded from the prompt and add comprehensive tests for the new functionality.

  • The LLMEntityRelationExtractor now accepts an optional existing_graph parameter in its main methods (run, run_for_chunk, and extract_for_chunk). When supplied, the extractor serializes existing nodes and relationships (excluding embedding_properties) and includes them in the prompt to guide the language model to reuse entity IDs.
  • The ERExtractionTemplate prompt now conditionally adds a section listing existing nodes and relationships, formatted as JSON, and provides clear instructions to the LLM about reusing IDs for existing entities. If no existing entities are provided, this section is omitted.
  • New unit tests verify that:
    • Existing entities are included in the prompt when provided.
    • embedding_properties are excluded from the prompt.
    • The "Existing graph entities" section is omitted when no existing entities are provided.
    • The prompt template formats the existing entity information correctly under various scenarios.

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Documentation update
  • Project configuration change

Complexity

Complexity: low

How Has This Been Tested?

  • Unit tests
  • E2E tests
  • Manual tests

Checklist

The following requirements should have been met (depending on the changes in the branch):

  • Documentation has been updated
  • Unit tests have been updated
  • E2E tests have been updated
  • Examples have been updated
  • New files have copyright header
  • CLA (https://neo4j.com/developer/cla/) has been signed
  • CHANGELOG.md updated if appropriate

@ali-sedaghatbaf ali-sedaghatbaf marked this pull request as ready for review May 28, 2026 14:19
@ali-sedaghatbaf ali-sedaghatbaf requested a review from a team as a code owner May 28, 2026 14:19

@NathalieCharbel NathalieCharbel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff!

I'd like to further discuss the following points that are still unclear to me:

  • how do we accumulate existing graphs from previous runs or what is the intention regarding how the caller deals with the existing graph?
  • how is the approach impacted by node IDs being prefixed in post_process_chunk -> update_ids?
  • how does the approach hold end to end, should we prove the actual approach works as intended through a small test with "real" LLM calls?

schema,
examples,
lexical_graph_builder,
existing_graphs[i] if existing_graphs else None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so each chunk is seeing only its caller-supplied existing_graphs[i]. Do we suppose we're accumulating the graphs across previous runs and that's the responsibility of the caller? But anyway, if the same new entity appears in chunk 0 and chunk 5 of the same run, two distinct IDs will be created - we are only trying to deduplicate against externally-provided (prior/persisted) state, never against the current batch? I am just trying to understand the approach and the possible drawbacks. This could be legitimate design choice but I don't think it is clear in the docstring.

lexical_graph_config (Optional[LexicalGraphConfig], optional): Lexical graph configuration to customize node labels and relationship types in the lexical graph.
schema (GraphSchema | None): Definition of the schema to guide the LLM in its extraction.
examples (str): Examples for few-shot learning in the prompt.
existing_graphs (Optional[list[Neo4jGraph]]): One subgraph per chunk, each containing nodes and relationships already in the knowledge graph that are relevant to that chunk. When provided, the LLM is instructed to reuse their IDs for matching entities instead of creating new ones. Must have the same length as chunks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be negatively impacted by nodes IDs being rewritten by update_id (where node ids are re-written to ensure their uniqueness across chunks)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, don't think so.

Comment thread src/neo4j_graphrag/generation/prompts.py
existing_nodes=[],
existing_rels=[],
)
assert "Existing graph entities" not in prompt

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while all of these unit tests are great to have, it feels a bit difficult to prove the approach and that the LLM is doing the right thing in picking the existing nodes/rels without testing it with real LLM calls on a very small dataset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants