Official implementation and foundation model weights for , a vision-language pre-training framework leveraging a large-scale dataset of paired brain MRIs and clinical reports (~80,000 sessions). Our models provide a powerful starting point for downstream clinical tasks like report generation, classification, and segmentation.
We provide several foundation models trained on T1 post-contrast (T1c) scans and a high-performing multimodal variant. All weights consist of a vision backbone connected to a QFormer-like architecture to obtain the multi-view embeddings. It is possible to use either the full model (vision backbone + multi-view embeddings) or just the vision backbone for feature extraction.
| Model Name | Input Modalities | Vision Backbone | Weights |
|---|---|---|---|
| T1c | DenseNet-121 | Download | |
| T1c | ViT-B/16 | Download | |
| T1c | ResNet-50 | Download | |
| T1c, T1, T2, FLAIR | DenseNet-169 | Download |
Note (a): The ViT model provided here is an updated version that outperforms the variant originally reported in the paper, matching the performance of our ResNet foundation model.
Note (b): The-Multimodal is not reported in the paper and is our strongest model yet.
Install the required packages using pip:
pip install -r requirements.txtThe extract_features.py script contains the necessary code to load the models and generate embeddings for a toy input volume.
Full Embedding (Vision backbone + multi-view embeddings):
python extract_features.py \
--weights /path/to/weights.bin \
--vision-model-name densenet121 \
--in-channels 1 \
--mode fullVision backbone only:
python extract_features.py \
--weights /path/to/weights.bin \
--vision-model-name vit \
--in-channels 1 \
--mode visionFor the Multimodal model, modalities must be stacked in the channel dimension (
Below is the MONAI-based preprocessing pipeline we used for our pre-training runs. Images are expected to be in NIfTI format (.nii or .nii.gz).
- LoadImaged: {}
- EnsureChannelFirstd: {channel_dim: 'no_channel'}
- Spacingd: {pixdim: [1, 1, 1], mode: 'bilinear'}
- Orientationd: {axcodes: 'SAR'}
- Resized: {spatial_size: [32, 256, 256]}
- NormalizeIntensityd: {channel_wise: true, nonzero: true}
- ScaleIntensityd: {channel_wise: true, maxv: 1.0}If you make use of our models, please consider citing us at:
@inproceedings{kayser2026brat,
title = {brat: Aligned Multi-View Embeddings for Brain MRI Analysis},
author = {Kayser, Maxime and Gridnev, Maksim and Wang, Wanting and Bain, Max and Rangnekar, Aneesh and Chatterjee, Avijit and Petrov, Aleksandr and Veeraraghavan, Harini and Swinburne, Nathaniel C.},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year = {2026},
}This project and the accompanying model weights are licensed under the Creative Commons Attribution-Non Commercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License.
- Academic/Research Use: Encouraged and permitted.
- Commercial Use: Prohibited.
For commercial licensing inquiries or if you are unsure if your use case qualifies as non-commercial, please open an issue or contact the maintainers directly.

