This repository implements an incremental comparative atlas construction framework using:
- Bregman Information (BI) to build a replay buffer
- Fisher Information–based importance weighting
- Regularized incremental model updates
The workflow enables robust case–control integration without catastrophic forgetting.
The pipeline consists of four stages:
- Integrated reference atlas (multi-study healthy reference)
- Case–control query data (multi-study)
Select informative reference cells to preserve during continual training.
Estimate Fisher Information for encoder and decoder weights.
Perform incremental update with regularization from (1) and (2).
git clone https://github.com/theislab/comparative_atlas
cd comparative_atlasUse the .yml file provided in the repo
conda env create -f environment.yml
conda activate cscanviHere we provide an example on training a scANVI model incrementally.
Tip
You can download a simulated case–control PBMC scRNA-seq dataset—featuring increased IFN signaling in monocytes from case samples—along with the corresponding reference model from this link.
Import the modified SCANVI class from source
from cscanvi._scanvi import SCANVIConstruct a Replay Buffer by computing the Bregman Information metric for each cell.
Here we select 20% of cells from the reference model , ref_model. The gene expression counts from the atlas are stored in adata. We compute BI by generating 200 augmentations to score each, then choose cells following the step approach.
import scvi
ref_model = scvi.model.SCANVI.load(ref_model_path, adata)
prop_cells_to_replay = 0.2
num_points_bi = int(adata.n_obs * prop_cells_to_replay)
N=200
unc_scores, score_idx = SCANVI.get_uncertainty(adata,
ref_model,
order='step',
num_points = num_points_bi,
tta_rep = N)
adata_healthyRef_replay = adata[score_idx.detach().cpu().numpy()]Next we compute Fisher Information to estimate parameter importance. To compute Fisher Information, we first need to select a subset of control cells from the query:
# select a small proportion of query control cells for computing Fisher Information
healthy_controls = (query_adata.obs.condition.isin(['control']))
adata_queryCtrl = sc.pp.subsample(query_adata[healthy_controls].copy(), 0.5, copy = True)
# concatenate reply buffer with query data
query_adata = query_adata.concatenate(adata_healthyRef_replay)
# add the query-control subset and replay buffer to uns.
query_adata.uns['ctrl_query'] = adata_queryCtrl
query_adata.uns['replay_adata'] = adata_healthyRef_replay
# compute importance weights
query_model = SCANVI.load_query_data_with_replay(query_adata,
reference_model = ref_model_path,
unfrozen=True,
control_uns_key = 'ctrl_query',
replay_uns_key = 'replay_adata'
)Set the desired value for ewc_importance (regularization strenght) and train:
contl_epochs = 150
train_kwargs_surgery = {
"early_stopping": True,
"early_stopping_monitor": "elbo_train",
"early_stopping_patience": 10,
"early_stopping_min_delta": 0.001,
"plan_kwargs": {"ewc_importance": 0.1 ,"weight_decay": 0.0},
}
query_model.train(
max_epochs=contl_epochs,
**train_kwargs_surgery,
)Important
A very strong regularization can terminate training pre-maturely.
The scANVI models of the comparative CRC all-lineage, Epithelial lineage and NK-T cell lineage integrations, and the notebooks to reproduce the figures from the manuscript will be released progressively.
Note
Integrated objects and the associated metadata is now available on HuggingFace.
If you use this project, please cite:
@article{hediyeh2026perturbation,
title={Perturbation-guided mapping of colorectal cancer cell states to causal mechanisms},
author={Hediyeh-zadeh, Soroor and Toh, Tzen S and Dufva, Olli and Serra, Giuseppe and Jackmola, Rashika and Fourneaux, Camille and Pinto, Goncalo and Fang, Zijian and Picco, Gabriele and Oliver, Amanda J and others},
journal={bioRxiv},
pages={2026--03},
year={2026},
publisher={Cold Spring Harbor Laboratory}
}