Skip to content

DIGIT-X-Lab/MOSAICX

Repository files navigation

MOSAICX

PyPI DOI Python License Downloads

DIGIT-X Lab · LMU Munich
Turn unstructured medical documents into validated, machine-readable JSON.
Runs locally — no PHI leaves your machine.


How It Works

flowchart LR
    A["PDF / Image / Text"] --> B["Chandra OCR"]
    B --> C["LLM Extraction"]
    C --> D["Structured JSON"]

    style A fill:#B5A89A,stroke:#8a7e72,color:#fff
    style B fill:#E87461,stroke:#c25a49,color:#fff
    style C fill:#E87461,stroke:#c25a49,color:#fff
    style D fill:#B5A89A,stroke:#8a7e72,color:#fff
Loading

MOSAICX converts medical documents (radiology reports, pathology summaries, clinical notes) into structured JSON. Define what to extract with a YAML template, point it at your documents, get clean data back. Every field comes with an excerpt citing the source text.

Why MOSAICX? Fully local (no PHI leaves your machine), schema-driven (you define exactly what to extract), VLM-powered OCR via Chandra (handles scans, handwriting, tables), and HIPAA-conformant de-identification built in.

Prerequisites

MOSAICX needs two servers running: an LLM for extraction and Chandra for OCR.

1. LLM Server

We recommend Gemma 4 31B via vLLM:

NVIDIA GPU:

pip install vllm
vllm serve google/gemma-4-31B-it --port 8000 --max-num-seqs 16

Adjust --max-num-seqs based on your GPU: 16 for 96GB (A6000 Pro), 8 for 80GB (A100), 4 for 24GB (4090).

Apple Silicon (Mac M1/M2/M3/M4):

pip install vllm-mlx
vllm-mlx serve mlx-community/gemma-4-31b-it-bf16 --port 8000

2. OCR Server (for PDFs and images)

Chandra is a VLM-based OCR that handles handwriting, tables, and complex layouts. Run it as a vLLM server on a GPU:

Option A -- Docker (easiest):

pip install chandra-ocr
VLLM_API_BASE=http://localhost:8001/v1 chandra_vllm

Option B -- bare vLLM:

vllm serve datalab-to/chandra-ocr-2 --port 8001

Note

Chandra is only needed for PDF/image documents. If you're extracting from .txt or .md files, you can skip this. Without Chandra, MOSAICX falls back to PaddleOCR automatically.

Verify

curl -s http://localhost:8000/v1/models    # LLM server

Tip

Any OpenAI-compatible LLM server works (Ollama, llama.cpp, SGLang). vLLM + Gemma 4 31B is what we test against.

Install

python -m venv ~/.mosaicx-venv
source ~/.mosaicx-venv/bin/activate
pip install git+https://github.com/DIGIT-X-Lab/MOSAICX.git

With uv (faster):

uv venv ~/.mosaicx-venv
source ~/.mosaicx-venv/bin/activate
uv pip install git+https://github.com/DIGIT-X-Lab/MOSAICX.git

Then create a .env file in your working directory:

MOSAICX_LM=openai/google/gemma-4-31B-it
MOSAICX_API_BASE=http://localhost:8000/v1
MOSAICX_API_KEY=not-needed
MOSAICX_CHANDRA_SERVER_URL=http://localhost:8001/v1

MOSAICX reads this automatically. Check everything works:

mosaicx doctor

Three Things You Can Do

1. Create a Template

Tell MOSAICX what to extract using natural language:

mosaicx template create --describe "chest CT with nodules, lung-rads score, and impression"

This generates a YAML template with typed fields (strings, numbers, enums, nested objects, lists). MOSAICX also ships with built-in templates:

mosaicx template list

You can also create templates deterministically from CSV, TSV, or Excel data dictionaries:

mosaicx template create --from-table fields.csv --name OncologyFields

Recommended fillable CSV format:

field_name,type,description,required,values
diagnosis_reason,enum,Reason that led to diagnostic workup,false,E|F|Z
tumor_size_mm,float,Tumor size in millimeters,false,
impression,str,Clinical impression,true,

For large catalog exports that contain several forms in one file, create one YAML template per form:

mosaicx template create \
  --from-table onkostar_catalog.csv \
  --split-by form_name \
  --output-dir ./templates/onkostar

For example, form_name=OS.Diagnose becomes OSDiagnose.yaml, and form_name=OS.TNM becomes OSTNM.yaml.

Use the generated YAML exactly like any other MOSAICX template:

mosaicx extract --document report.pdf --template ./templates/onkostar/OSDiagnose.yaml

To inspect what the LLM will see before running extraction, render the DSPy/BAML prompt preview locally:

mosaicx template prompt ./templates/onkostar/OSDiagnose.yaml

This does not call an LLM server. It shows the schema text produced from the YAML, including enum codes and labels such as Z=Zufallsbefund.

2. Extract Structured Data

Single document:

mosaicx extract --document report.pdf --template chest_ct

Batch (parallel):

mosaicx extract --dir ./reports/ --template chest_ct --workers 8 --output-dir ./results/

Output is clean JSON with {value, excerpt} for every field:

{
  "indication": {
    "value": "Follow-up pulmonary nodule",
    "excerpt": "Indication: Follow-up of incidentally detected pulmonary nodule"
  },
  "impression": {
    "value": "Stable 6mm nodule, recommend 12-month follow-up",
    "excerpt": "Impression: Stable 6mm solid nodule in right lower lobe"
  }
}

3. De-identify Documents

Remove PHI with HIPAA conformance (LLM + regex safety net):

mosaicx deidentify --document note.pdf
mosaicx deidentify --document note.pdf -o redacted.json

Batch:

mosaicx deidentify --dir ./notes/ --workers 4 --output-dir ./cleaned/

Output:

{
  "conformance": "hipaa",
  "redacted_text": "Patient [REDACTED] presented with...",
  "phi": [
    {"value": "John Doe", "type": "NAME", "excerpt": "Patient John Doe presented"},
    {"value": "01/15/1990", "type": "DATE", "excerpt": "DOB: 01/15/1990"}
  ]
}

Privacy

Important

Data stays on your machine. MOSAICX runs against a local LLM server -- no external API calls, no cloud uploads. De-identification follows HIPAA Safe Harbor rules by default.

Configuration

All settings live in a .env file (recommended) or environment variables with the MOSAICX_ prefix:

MOSAICX_LM=openai/google/gemma-4-31B-it
MOSAICX_API_BASE=http://localhost:8000/v1
MOSAICX_API_KEY=not-needed
MOSAICX_OCR_ENGINE=chandra
MOSAICX_CHANDRA_SERVER_URL=http://localhost:8001/v1
# View active config
mosaicx config show

See docs/configuration.md for the full reference.

Documentation

Guide Description
Quickstart First successful run in ~10 minutes
Getting Started Install, first extraction, basics
CLI Reference Every command, every flag, examples
Schemas & Templates Create and manage extraction templates
Configuration Env vars, backends, OCR, export formats
Developer Guide Custom pipelines, Python SDK, MCP server

Development

git clone https://github.com/DIGIT-X-Lab/MOSAICX.git
cd MOSAICX
pip install -e ".[dev]"
pytest tests/ -q

Citation

@software{mosaicx2025,
  title   = {MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
  author  = {Sundar, Lalith Kumar Shiyam and DIGIT-X Lab},
  year    = {2025},
  url     = {https://github.com/DIGIT-X-Lab/MOSAICX},
  doi     = {10.5281/zenodo.17601890}
}

License

Apache 2.0 -- see LICENSE.

Contact

Research: lalith.shiyam@med.uni-muenchen.de | Commercial: lalith@zenta.solutions | Issues: github.com/DIGIT-X-Lab/MOSAICX/issues

About

Medical cOmputational Suite for Advanced Intelligent eXtraction of Healthcare data using local LLMs

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages