This repo is an experimental agent-based workflow designed to search for disease-causing variants in a patient's genomic sequencing results based on a description of the patient's symptoms.
See my blog posts for additional discussion:
- Can an AI agent help diagnose genetic diseases?
- Implementing the HPO Agent
- Implementing the Gene Agent
- Implementing the Variant Agent: Part 1
- Implementing the Variant Agent: Part 2
Here's what you'll need to run this workflow:
- An OpenAI API key
- Annotated variants from Illumina Nirvana
- An
hpo.obofile downloaded from the HPO website - A
phenotype_to_genes.txtfile downloaded from the HPO website
git clone https://github.com/jeffjjohnston/genomics-rare-disease-ai-workflow.git
cd genomics-rare-disease-ai-workflow
uv venv -p python3.12
source .venv/bin/activate
uv pip install -r requirements.txt
echo OPENAI_API_KEY=YOUR_API_KEY > .envBuild the HPO terms vector database from the downloaded hpo.obo file:
mkdir -p resources/hpo_agent
python generate-hpo-index.py \
--model cambridgeltl/SapBERT-from-PubMedBERT-fulltext \
--obo_file hp.obo \
--index_base resources/hpo_agent/SapBERT-PubMedBERT_hpoCreate a new DuckDB database from the Nirvana JSON:
mkdir patients
python add-variants.py \
--json /path/to/variants.json.gz \
--db patients/variants.duckdbFor instructions on running Nirvana on a VCF, see this guide
First, describe your patient's symptoms in a plain text file (for example, patient_symptoms.txt).
Run the workflow:
python run-workflow.py \
--symptoms patient_symptoms.txt \
--hpo-db resources/hpo_agent/SapBERT-PubMedBERT_hpo.json.gz \
--phenotypes-to-gene-file phenotype_to_genes.txt \
--variant-db patients/variants.duckdb \
--output results.txtIf you want to run the two example scenarios from the blog posts linked above, I've made the Colombian trio data available as both the VCF from the International Genome Sample Resource site as well as the Nirvana-produced JSON. The pathogenic variants added to the variant database for the two scenarios can be found in the examples/ directory in this repo.
Colombian trio VCF: https://downloads.newmatter.net/genomics-rare-disease-ai-workflow/colombian_trio.exome.vcf.gz
Colombian trio Nirvana JSON: https://downloads.newmatter.net/genomics-rare-disease-ai-workflow/colombian_trio.exome.json.gz
To run the first scenario:
mkdir patients
wget https://downloads.newmatter.net/genomics-rare-disease-ai-workflow/colombian_trio.exome.json.gz
# Create a new database with the full exome variants
python add-variants.py \
--json colombian_trio.exome.json.gz \
--db patients/colombian_trio.duckdb
# Inject the single annotated pathogenic variant
python add-variants.py \
--json examples/clinvar_143754.json.gz \
--db patients/colombian_trio.duckdb
# Run the workflow
python run-workflow.py \
--symptoms examples/example_case_1.md \
--hpo-db resources/hpo_agent/SapBERT-PubMedBERT_hpo.json.gz \
--phenotypes-to-gene-file phenotype_to_genes.txt \
--variant-db patients/colombian_trio.duckdb \
--output example_case_1_results.txtFor the second scenario, add the examples/BCKDHA_variant.json.gz variant and use the examples/example_case_2.md symptom description file.