Skip to content

jeffjjohnston/genomics-rare-disease-ai-workflow

Repository files navigation

Overview

This repo is an experimental agent-based workflow designed to search for disease-causing variants in a patient's genomic sequencing results based on a description of the patient's symptoms.

See my blog posts for additional discussion:

Getting started

Here's what you'll need to run this workflow:

Set up the Python environment

git clone https://github.com/jeffjjohnston/genomics-rare-disease-ai-workflow.git
cd genomics-rare-disease-ai-workflow
uv venv -p python3.12
source .venv/bin/activate
uv pip install -r requirements.txt
echo OPENAI_API_KEY=YOUR_API_KEY > .env

Generate required resources

Build the HPO terms vector database from the downloaded hpo.obo file:

mkdir -p resources/hpo_agent
python generate-hpo-index.py \
    --model cambridgeltl/SapBERT-from-PubMedBERT-fulltext \
    --obo_file hp.obo \
    --index_base resources/hpo_agent/SapBERT-PubMedBERT_hpo

Create a new DuckDB database from the Nirvana JSON:

mkdir patients
python add-variants.py \
    --json /path/to/variants.json.gz \
    --db patients/variants.duckdb

For instructions on running Nirvana on a VCF, see this guide

Run the workflow

First, describe your patient's symptoms in a plain text file (for example, patient_symptoms.txt).

Run the workflow:

python run-workflow.py \
    --symptoms patient_symptoms.txt \
    --hpo-db resources/hpo_agent/SapBERT-PubMedBERT_hpo.json.gz \
    --phenotypes-to-gene-file phenotype_to_genes.txt \
    --variant-db patients/variants.duckdb \
    --output results.txt

Example data

If you want to run the two example scenarios from the blog posts linked above, I've made the Colombian trio data available as both the VCF from the International Genome Sample Resource site as well as the Nirvana-produced JSON. The pathogenic variants added to the variant database for the two scenarios can be found in the examples/ directory in this repo.

Colombian trio VCF: https://downloads.newmatter.net/genomics-rare-disease-ai-workflow/colombian_trio.exome.vcf.gz
Colombian trio Nirvana JSON: https://downloads.newmatter.net/genomics-rare-disease-ai-workflow/colombian_trio.exome.json.gz

To run the first scenario:

mkdir patients
wget https://downloads.newmatter.net/genomics-rare-disease-ai-workflow/colombian_trio.exome.json.gz

# Create a new database with the full exome variants
python add-variants.py \
    --json colombian_trio.exome.json.gz \
    --db patients/colombian_trio.duckdb

# Inject the single annotated pathogenic variant
python add-variants.py \
    --json examples/clinvar_143754.json.gz \
    --db patients/colombian_trio.duckdb

# Run the workflow
python run-workflow.py \
    --symptoms examples/example_case_1.md \
    --hpo-db resources/hpo_agent/SapBERT-PubMedBERT_hpo.json.gz \
    --phenotypes-to-gene-file phenotype_to_genes.txt \
    --variant-db patients/colombian_trio.duckdb \
    --output example_case_1_results.txt

For the second scenario, add the examples/BCKDHA_variant.json.gz variant and use the examples/example_case_2.md symptom description file.

About

An experimental AI agent workflow.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages