Add support for existing graph nodes and relationsships to the entity and relationship extractor#532
Add support for existing graph nodes and relationsships to the entity and relationship extractor#532ali-sedaghatbaf wants to merge 4 commits into
Conversation
NathalieCharbel
left a comment
There was a problem hiding this comment.
Great stuff!
I'd like to further discuss the following points that are still unclear to me:
- how do we accumulate existing graphs from previous runs or what is the intention regarding how the caller deals with the existing graph?
- how is the approach impacted by node IDs being prefixed in
post_process_chunk->update_ids? - how does the approach hold end to end, should we prove the actual approach works as intended through a small test with "real" LLM calls?
| schema, | ||
| examples, | ||
| lexical_graph_builder, | ||
| existing_graphs[i] if existing_graphs else None, |
There was a problem hiding this comment.
so each chunk is seeing only its caller-supplied existing_graphs[i]. Do we suppose we're accumulating the graphs across previous runs and that's the responsibility of the caller? But anyway, if the same new entity appears in chunk 0 and chunk 5 of the same run, two distinct IDs will be created - we are only trying to deduplicate against externally-provided (prior/persisted) state, never against the current batch? I am just trying to understand the approach and the possible drawbacks. This could be legitimate design choice but I don't think it is clear in the docstring.
| lexical_graph_config (Optional[LexicalGraphConfig], optional): Lexical graph configuration to customize node labels and relationship types in the lexical graph. | ||
| schema (GraphSchema | None): Definition of the schema to guide the LLM in its extraction. | ||
| examples (str): Examples for few-shot learning in the prompt. | ||
| existing_graphs (Optional[list[Neo4jGraph]]): One subgraph per chunk, each containing nodes and relationships already in the knowledge graph that are relevant to that chunk. When provided, the LLM is instructed to reuse their IDs for matching entities instead of creating new ones. Must have the same length as chunks. |
There was a problem hiding this comment.
would this be negatively impacted by nodes IDs being rewritten by update_id (where node ids are re-written to ensure their uniqueness across chunks)?
There was a problem hiding this comment.
No, don't think so.
| existing_nodes=[], | ||
| existing_rels=[], | ||
| ) | ||
| assert "Existing graph entities" not in prompt |
There was a problem hiding this comment.
while all of these unit tests are great to have, it feels a bit difficult to prove the approach and that the LLM is doing the right thing in picking the existing nodes/rels without testing it with real LLM calls on a very small dataset?
Description
This pull request enhances the entity and relation extraction process by allowing the system to incorporate information about existing nodes and relationships in the knowledge graph. When provided, these existing entities are included in the prompt to the language model, instructing it to reuse IDs for matching entities instead of creating duplicates. The changes also ensure that sensitive or unnecessary properties (like embeddings) are excluded from the prompt and add comprehensive tests for the new functionality.
LLMEntityRelationExtractornow accepts an optionalexisting_graphparameter in its main methods (run,run_for_chunk, andextract_for_chunk). When supplied, the extractor serializes existing nodes and relationships (excludingembedding_properties) and includes them in the prompt to guide the language model to reuse entity IDs.ERExtractionTemplateprompt now conditionally adds a section listing existing nodes and relationships, formatted as JSON, and provides clear instructions to the LLM about reusing IDs for existing entities. If no existing entities are provided, this section is omitted.embedding_propertiesare excluded from the prompt.Type of Change
Complexity
Complexity: low
How Has This Been Tested?
Checklist
The following requirements should have been met (depending on the changes in the branch):