Knowledge Graph Extractor

A Python tool for extracting knowledge graphs from unstructured text using LLM-based triple extraction. This tool processes text into Subject-Predicate-Object (SPO) triples that can be used to build knowledge graphs.

Features

Text chunking with configurable overlap
LLM-based triple extraction using OpenAI/DeepSeek API
JSON-formatted output
Automatic pronoun resolution
Error handling and failed chunk tracking
Results export to pandas DataFrame

Requirements

Python 3.8+
OpenAI API key or DeepSeek API credentials
Required Python packages (see requirements.txt)

Installation

Clone the repository:

git clone <repository-url>
cd knowledge-graph-extractor

Create and activate a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Configuration

Set your API credentials either through environment variables:

export OPENAI_API_KEY="your-api-key"
export OPENAI_API_BASE="your-api-base-url"  # Optional, for DeepSeek or other providers

Or provide them directly when initializing the extractor:

extractor = KnowledgeGraphExtractor(
    api_key="your-api-key",
    api_base="your-api-base-url"  # Optional
)

Usage

Basic Usage

from extract import KnowledgeGraphExtractor

# Initialize the extractor
extractor = KnowledgeGraphExtractor()

# Process text
text = """
Your unstructured text here...
"""
results = extractor.process_text(text)

# Get results as DataFrame
df = extractor.get_results_dataframe()
print(df)

Running the Example

The repository includes an example script that demonstrates the usage with a sample text about Albert Einstein:

python example.py

Customizing Chunk Size and Overlap

extractor = KnowledgeGraphExtractor()
extractor.chunk_size = 200  # Adjust chunk size
extractor.overlap = 40     # Adjust overlap

Handling Results

results = extractor.process_text(text)

# Access extracted triples
triples = results['triples']

# Check for failed chunks
failed_chunks = results['failed_chunks']
for chunk in failed_chunks:
    print(f"Chunk {chunk['chunk_number']} failed: {chunk['error']}")

Testing

The repository includes unit tests to verify the functionality:

python -m unittest test_extract.py

The tests cover:

Text chunking functionality
Configuration validation
Empty text handling
Results DataFrame structure

Output Format

The extracted triples are returned in the following format:

{
    'triples': [
        {
            'subject': 'entity1',
            'predicate': 'relation',
            'object': 'entity2',
            'chunk': 1  # chunk number where this triple was found
        },
        # ... more triples
    ],
    'failed_chunks': [
        {
            'chunk_number': 2,
            'error': 'error message',
            'response': 'raw response'
        },
        # ... any failed chunks
    ]
}

Project Structure

.
├── extract.py           # Main implementation
├── example.py          # Example usage
├── test_extract.py     # Unit tests
├── requirements.txt    # Dependencies
├── README.md          # Documentation
├── LICENSE            # MIT License
└── .gitignore         # Git ignore file

Limitations

API rate limits apply based on your OpenAI/DeepSeek account
Processing large texts may take time due to API calls
Quality of extraction depends on the LLM model used
Text chunks are processed independently, which may miss cross-chunk relationships

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Graph Extractor

Features

Requirements

Installation

Configuration

Usage

Basic Usage

Running the Example

Customizing Chunk Size and Overlap

Handling Results

Testing

Output Format

Project Structure

Limitations

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
extract.py		extract.py
neptune_graph.py		neptune_graph.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Knowledge Graph Extractor

Features

Requirements

Installation

Configuration

Usage

Basic Usage

Running the Example

Customizing Chunk Size and Overlap

Handling Results

Testing

Output Format

Project Structure

Limitations

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages