Skip to content

munaberhe/exomiser_llm_benchmark

Repository files navigation

Exomiser vs LLM Benchmark

Overview

This project implements a small, testable benchmarking framework for rare disease diagnosis tools, comparing a traditional algorithmic method (Exomiser-style rankings) with large language model (LLM)–based disease prediction on toy rare-disease cases.

There are two LLM modes:

  • Mock LLM results – a simple CSV file with pre-defined rankings (for offline testing).
  • Live LLM results – calls a real OpenAI model on the phenotype text, writes rankings to CSV, then benchmarks those results.

The focus is on:

  • Structuring the benchmark in a reproducible way.
  • Using shared metrics (Top-1 accuracy, Mean Reciprocal Rank, MRR).
  • Providing a clean foundation that can later plug in real Exomiser output and real LLM predictions.

This is a learning and portfolio project, not a clinical decision-support tool.


Dataset

All data live under the data/ directory:

  • cases.csv – toy rare-disease cases:

    • case_id – numeric identifier.
    • phenotypes – free-text phenotype description.
    • true_disease – the correct diagnosis for that case.
  • exomiser_mock_results.csv – mock Exomiser-style rankings:

    • case_id – links to cases.csv.
    • rank – 1 = top-ranked (best), higher numbers = lower rank.
    • disease – predicted disease name.
  • llm_mock_results.csv – mock LLM-style rankings with the same columns:

    • case_id, rank, disease. Used for fully offline benchmarking without calling an API.
  • llm_results_live.csv – live LLM rankings:

    • case_id, rank, disease. This file is generated by the live LLM runner and used in the live benchmark.

Project layout (simplified):

exomiser_llm_benchmark/
├── data/
│   ├── cases.csv
│   ├── exomiser_mock_results.csv
│   ├── llm_mock_results.csv
│   └── llm_results_live.csv
├── src/
│   └── elbench/
│       ├── __init__.py
│       ├── data.py
│       ├── metrics.py
│       ├── benchmark.py
│       ├── llm_client.py
│       └── llm_runner.py
├── tests/
│   └── test_metrics.py
├── run_benchmark.py
├── run_llm_runner.py
├── run_benchmark_live.py
├── requirements.txt
└── README.md

Methods

The core logic lives in src/elbench/ and is organised into several small modules.

Data loading (data.py)

  • load_cases(path)
    Loads cases.csv into a DataFrame. Expects columns:

    • case_id, phenotypes, true_disease.
  • load_rankings(path)
    Loads tool outputs (Exomiser, LLM) from CSV. Expects:

    • case_id, rank, disease.

This separation keeps input/output logic simple and makes it easy to swap datasets.

Metrics (metrics.py)

Two standard ranking metrics are implemented:

  • compute_top1_accuracy(cases, rankings)
    For each case, checks whether the true disease appears at rank 1 in the rankings and returns the fraction of cases where the top prediction is correct.

  • compute_mrr(cases, rankings)
    Computes Mean Reciprocal Rank (MRR):

    • For each case, find the rank r of the true disease.
    • Use 1/r as the score (or 0 if the disease is not found).
    • Average over all cases. MRR gives partial credit when the true disease is near the top even if not at rank 1.

Because the metrics only require (case_id, rank, disease), they can be applied to any ranking-based method: Exomiser, LLMs, or other phenotype–disease matching approaches.

Benchmark runner (benchmark.py)

  • run_benchmark(cases_path, exomiser_path, llm_path)
    • Loads cases and tool outputs.
    • Computes Top-1 accuracy and MRR for Exomiser and LLM.
    • Prints summary metrics to the console.

This function is used by:

  • run_benchmark.py – benchmark using mock Exomiser and mock LLM results.
  • run_benchmark_live.py – benchmark using mock Exomiser and live LLM results.

LLM client (llm_client.py)

  • suggest_diseases_from_phenotypes(phenotypes_text, k=5, model_name="gpt-4o-mini")
    • Constructs a prompt for the LLM describing the patient’s phenotypes.
    • Calls the OpenAI API via the official client.
    • Parses a numbered list of disease names from the model’s response.
    • Returns a list of the top k disease names in ranked order.

The LLM client expects the OpenAI API key to be set as an environment variable:

export OPENAI_API_KEY="your_real_openai_key_here"

LLM runner (llm_runner.py)

  • run_llm_over_cases(cases_path, out_path, top_k=5)
    • Loads cases from cases.csv.
    • For each case, calls suggest_diseases_from_phenotypes.
    • Collects the ranked disease predictions and writes them to llm_results_live.csv with columns case_id, rank, disease.

run_llm_runner.py is a small entry script that calls llm_runner.main() so the whole process can be run with one command.


Setup

  1. Clone the repository:

    git clone <this-repo-url>
    cd exomiser_llm_benchmark
    
  2. Create and activate a virtual environment:

    python -m venv .venv
    source .venv/bin/activate          # Windows PowerShell: .venv\Scripts\Activate.ps1
    
  3. Install dependencies:

    pip install -r requirements.txt
    

    This will install:

    • pandas
    • numpy
    • pytest
    • openai
  4. Set your OpenAI API key (required only for live LLM mode):

    On macOS/Linux:

    export OPENAI_API_KEY="your_real_openai_key_here"
    

    On Windows PowerShell:

    $Env:OPENAI_API_KEY="your_real_openai_key_here"
    

How to Run

1. Run tests

To confirm that the metric functions behave correctly on the mock data:

pytest

You should see something like:

tests/test_metrics.py .    [100%]

indicating that the basic tests pass.

2. Benchmark with mock data (offline)

This mode does not require an API key and uses only the mock CSV files.

Run:

python run_benchmark.py

This compares mock Exomiser and mock LLM results using Top-1 accuracy and MRR.

Example output with the current toy mock data:

Number of cases: 5

Exomiser (mock) performance:
  Top-1 accuracy: 0.60
  MRR:            0.77

LLM (mock) performance:
  Top-1 accuracy: 0.80
  MRR:            0.90

3. Generate live LLM rankings

This mode calls the OpenAI API for each case in cases.csv. Make sure OPENAI_API_KEY is set in your environment.

Run:

python run_llm_runner.py

This will:

  • Read each case from data/cases.csv.
  • Print the phenotype description and the diseases suggested by the LLM.
  • Write the results to data/llm_results_live.csv.

Example console output snippet:

=== Case 1 ===
Phenotypes: short stature, intellectual disability, limb abnormalities, distinctive facial features
LLM suggested diseases (in order):
  1. Cornelia de Lange syndrome
  2. ...
  ...

Saved live LLM results to data/llm_results_live.csv

4. Benchmark using live LLM results

After generating llm_results_live.csv, run:

python run_benchmark_live.py

This compares mock Exomiser output with the live LLM rankings:

Number of cases: 5

Exomiser (mock) performance:
  Top-1 accuracy: 0.60
  MRR:            0.77

LLM (live) performance:
  Top-1 accuracy: <depends on the model output>
  MRR:            <depends on the model output>

The exact values for the live LLM will depend on the model, prompt, and current API behaviour, but the framework and metrics remain the same.


Results

Mock benchmark (offline)

With the current toy mock setup (5 cases):

  • Exomiser (mock)

    • Top-1 accuracy: 0.60
    • MRR: 0.77
  • LLM (mock)

    • Top-1 accuracy: 0.80
    • MRR: 0.90

These results are fully reproducible because they come from static CSV files and do not involve live API calls.

Live LLM benchmark

When using run_llm_runner.py followed by run_benchmark_live.py:

  • Exomiser metrics remain fixed because they come from exomiser_mock_results.csv.
  • LLM (live) metrics are computed from the real OpenAI model’s ranking behaviour on the phenotype text.

By adjusting the prompt, the number of candidates k, or the model name in llm_client.py, you can observe how LLM performance changes under the same evaluation metrics.


Discussion

This project is intentionally small and self-contained, but it highlights several important ideas for real-world rare disease benchmarking:

  • Reproducible data flow
    All inputs and outputs are explicit CSV files under data/, making it straightforward to version, share and swap datasets or tool outputs.

  • Shared metrics across methods
    Exomiser-style tools, classical IR methods and LLM-based systems are all evaluated with the same metrics (Top-1 accuracy and MRR), allowing fair comparisons between very different approaches.

  • Separation of concerns
    Data loading, metric computation, benchmarking logic and LLM integration are cleanly separated into different modules:

    • data.py – loading cases and rankings.
    • metrics.py – Top-1 and MRR computations.
    • benchmark.py – coordinates evaluations.
    • llm_client.py – all LLM-specific logic.
    • llm_runner.py – connects phenotypes to ranked LLM predictions.
  • Mock vs live modes
    The mock CSVs make the project fully runnable without any API key, while the live mode demonstrates how to integrate a real model into the same benchmark in a controlled way.

Possible extensions include:

  • Replacing exomiser_mock_results.csv with real Exomiser output for a curated case set.
  • Adding additional baselines (e.g. TF-IDF or BM25 from phenotype–disease matching) into the same framework.
  • Logging all metrics to a consolidated results table and creating plots to compare methods or case subsets.
  • Refining prompts and parsing rules in llm_client.py to make the LLM predictions more robust and clinically sensible.

Overall, this repository acts as a benchmarking scaffold for rare disease diagnosis tools, showing how to compare traditional pipelines like Exomiser with modern LLM-based approaches in a transparent, testable and extensible way.

About

Experimental benchmark for rare disease prioritisation tools, contrasting algorithmic (Exomiser-like) and LLM-based approaches.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages