This project implements a small, testable benchmarking framework for rare disease diagnosis tools, comparing a traditional algorithmic method (Exomiser-style rankings) with large language model (LLM)–based disease prediction on toy rare-disease cases.
There are two LLM modes:
- Mock LLM results – a simple CSV file with pre-defined rankings (for offline testing).
- Live LLM results – calls a real OpenAI model on the phenotype text, writes rankings to CSV, then benchmarks those results.
The focus is on:
- Structuring the benchmark in a reproducible way.
- Using shared metrics (Top-1 accuracy, Mean Reciprocal Rank, MRR).
- Providing a clean foundation that can later plug in real Exomiser output and real LLM predictions.
This is a learning and portfolio project, not a clinical decision-support tool.
All data live under the data/ directory:
-
cases.csv– toy rare-disease cases:case_id– numeric identifier.phenotypes– free-text phenotype description.true_disease– the correct diagnosis for that case.
-
exomiser_mock_results.csv– mock Exomiser-style rankings:case_id– links tocases.csv.rank– 1 = top-ranked (best), higher numbers = lower rank.disease– predicted disease name.
-
llm_mock_results.csv– mock LLM-style rankings with the same columns:case_id,rank,disease. Used for fully offline benchmarking without calling an API.
-
llm_results_live.csv– live LLM rankings:case_id,rank,disease. This file is generated by the live LLM runner and used in the live benchmark.
Project layout (simplified):
exomiser_llm_benchmark/
├── data/
│ ├── cases.csv
│ ├── exomiser_mock_results.csv
│ ├── llm_mock_results.csv
│ └── llm_results_live.csv
├── src/
│ └── elbench/
│ ├── __init__.py
│ ├── data.py
│ ├── metrics.py
│ ├── benchmark.py
│ ├── llm_client.py
│ └── llm_runner.py
├── tests/
│ └── test_metrics.py
├── run_benchmark.py
├── run_llm_runner.py
├── run_benchmark_live.py
├── requirements.txt
└── README.md
The core logic lives in src/elbench/ and is organised into several small modules.
-
load_cases(path)
Loadscases.csvinto a DataFrame. Expects columns:case_id,phenotypes,true_disease.
-
load_rankings(path)
Loads tool outputs (Exomiser, LLM) from CSV. Expects:case_id,rank,disease.
This separation keeps input/output logic simple and makes it easy to swap datasets.
Two standard ranking metrics are implemented:
-
compute_top1_accuracy(cases, rankings)
For each case, checks whether the true disease appears at rank 1 in the rankings and returns the fraction of cases where the top prediction is correct. -
compute_mrr(cases, rankings)
Computes Mean Reciprocal Rank (MRR):- For each case, find the rank r of the true disease.
- Use 1/r as the score (or 0 if the disease is not found).
- Average over all cases. MRR gives partial credit when the true disease is near the top even if not at rank 1.
Because the metrics only require (case_id, rank, disease), they can be applied to any ranking-based method: Exomiser, LLMs, or other phenotype–disease matching approaches.
run_benchmark(cases_path, exomiser_path, llm_path)- Loads cases and tool outputs.
- Computes Top-1 accuracy and MRR for Exomiser and LLM.
- Prints summary metrics to the console.
This function is used by:
run_benchmark.py– benchmark using mock Exomiser and mock LLM results.run_benchmark_live.py– benchmark using mock Exomiser and live LLM results.
suggest_diseases_from_phenotypes(phenotypes_text, k=5, model_name="gpt-4o-mini")- Constructs a prompt for the LLM describing the patient’s phenotypes.
- Calls the OpenAI API via the official client.
- Parses a numbered list of disease names from the model’s response.
- Returns a list of the top k disease names in ranked order.
The LLM client expects the OpenAI API key to be set as an environment variable:
export OPENAI_API_KEY="your_real_openai_key_here"
run_llm_over_cases(cases_path, out_path, top_k=5)- Loads cases from
cases.csv. - For each case, calls
suggest_diseases_from_phenotypes. - Collects the ranked disease predictions and writes them to
llm_results_live.csvwith columnscase_id,rank,disease.
- Loads cases from
run_llm_runner.py is a small entry script that calls llm_runner.main() so the whole process can be run with one command.
-
Clone the repository:
git clone <this-repo-url> cd exomiser_llm_benchmark -
Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate # Windows PowerShell: .venv\Scripts\Activate.ps1 -
Install dependencies:
pip install -r requirements.txtThis will install:
- pandas
- numpy
- pytest
- openai
-
Set your OpenAI API key (required only for live LLM mode):
On macOS/Linux:
export OPENAI_API_KEY="your_real_openai_key_here"On Windows PowerShell:
$Env:OPENAI_API_KEY="your_real_openai_key_here"
To confirm that the metric functions behave correctly on the mock data:
pytest
You should see something like:
tests/test_metrics.py . [100%]
indicating that the basic tests pass.
This mode does not require an API key and uses only the mock CSV files.
Run:
python run_benchmark.py
This compares mock Exomiser and mock LLM results using Top-1 accuracy and MRR.
Example output with the current toy mock data:
Number of cases: 5
Exomiser (mock) performance:
Top-1 accuracy: 0.60
MRR: 0.77
LLM (mock) performance:
Top-1 accuracy: 0.80
MRR: 0.90
This mode calls the OpenAI API for each case in cases.csv. Make sure OPENAI_API_KEY is set in your environment.
Run:
python run_llm_runner.py
This will:
- Read each case from
data/cases.csv. - Print the phenotype description and the diseases suggested by the LLM.
- Write the results to
data/llm_results_live.csv.
Example console output snippet:
=== Case 1 ===
Phenotypes: short stature, intellectual disability, limb abnormalities, distinctive facial features
LLM suggested diseases (in order):
1. Cornelia de Lange syndrome
2. ...
...
Saved live LLM results to data/llm_results_live.csv
After generating llm_results_live.csv, run:
python run_benchmark_live.py
This compares mock Exomiser output with the live LLM rankings:
Number of cases: 5
Exomiser (mock) performance:
Top-1 accuracy: 0.60
MRR: 0.77
LLM (live) performance:
Top-1 accuracy: <depends on the model output>
MRR: <depends on the model output>
The exact values for the live LLM will depend on the model, prompt, and current API behaviour, but the framework and metrics remain the same.
With the current toy mock setup (5 cases):
-
Exomiser (mock)
- Top-1 accuracy: 0.60
- MRR: 0.77
-
LLM (mock)
- Top-1 accuracy: 0.80
- MRR: 0.90
These results are fully reproducible because they come from static CSV files and do not involve live API calls.
When using run_llm_runner.py followed by run_benchmark_live.py:
- Exomiser metrics remain fixed because they come from
exomiser_mock_results.csv. - LLM (live) metrics are computed from the real OpenAI model’s ranking behaviour on the phenotype text.
By adjusting the prompt, the number of candidates k, or the model name in llm_client.py, you can observe how LLM performance changes under the same evaluation metrics.
This project is intentionally small and self-contained, but it highlights several important ideas for real-world rare disease benchmarking:
-
Reproducible data flow
All inputs and outputs are explicit CSV files underdata/, making it straightforward to version, share and swap datasets or tool outputs. -
Shared metrics across methods
Exomiser-style tools, classical IR methods and LLM-based systems are all evaluated with the same metrics (Top-1 accuracy and MRR), allowing fair comparisons between very different approaches. -
Separation of concerns
Data loading, metric computation, benchmarking logic and LLM integration are cleanly separated into different modules:data.py– loading cases and rankings.metrics.py– Top-1 and MRR computations.benchmark.py– coordinates evaluations.llm_client.py– all LLM-specific logic.llm_runner.py– connects phenotypes to ranked LLM predictions.
-
Mock vs live modes
The mock CSVs make the project fully runnable without any API key, while the live mode demonstrates how to integrate a real model into the same benchmark in a controlled way.
Possible extensions include:
- Replacing
exomiser_mock_results.csvwith real Exomiser output for a curated case set. - Adding additional baselines (e.g. TF-IDF or BM25 from phenotype–disease matching) into the same framework.
- Logging all metrics to a consolidated results table and creating plots to compare methods or case subsets.
- Refining prompts and parsing rules in
llm_client.pyto make the LLM predictions more robust and clinically sensible.
Overall, this repository acts as a benchmarking scaffold for rare disease diagnosis tools, showing how to compare traditional pipelines like Exomiser with modern LLM-based approaches in a transparent, testable and extensible way.