GPC (Genetic Probabilistic Circuit) is a tractable deep generative model for haplotype data. It supports exact likelihood evaluation, exact marginalization, and fast conditional queries. The same trained model can generate artificial genomes and impute missing SNPs with improved privacy properties.
Preprint: GPC: An expressive and tractable deep generative model for genetic variation data (bioRxiv, 2026).
pip install pyjuice
git clone https://github.com/sriramlab/GPC.git
cd GPCRequires a CUDA-capable GPU, PyTorch, NumPy, pandas, scikit-learn, networkx, matplotlib, and tqdm. GPC is built on PyJuice.
A whitespace-separated 0/1 haplotype file: rows = haplotypes, columns = SNPs.
Optionally, a .legend file with header id position a0 a1 (space-separated) giving the bp position of each SNP. If supplied, LD plots use bp distance on the x-axis.
pc/demo/ ships two files from a contiguous chr15 region in 1000 Genomes Project Phase 3 (5008 haplotypes):
| File | SNPs | Use for |
|---|---|---|
1K_full.txt + 1K_full.legend |
1,000 | default — fast end-to-end run |
10K_full.txt + 10K_full.legend |
10,000 | full run |
From pc/demo/, run the four steps in order. Every script shares one --run-dir (default out/1K).
cd pc/demo
# 1. Train a GPC with train/val/test split and early stopping.
python3 train_demo.py
# 2. Sample artificial genomes from the best checkpoint.
python3 generate_demo.py
# 3. Evaluate sample quality + privacy: PCA, LD decay, LD error, CLT tree, AATS.
python3 evaluate.py
# 4. Imputation benchmark: single-SNP + multi-SNP at 30/50/80% missingness.
python3 impute_demo.py
python3 plot_imputation.pyAfterwards out/1K/ contains:
out/1K/
├── config.json
├── gpc_best.jpc best-val GPC checkpoint
├── train.txt / val.txt / test.txt shuffled splits (reused downstream)
├── train.log per-epoch train/val LL
├── samples.txt generated haplotypes
├── quality/ pca, ld_decay, ld_error, clt_tree, clt_summary
├── imputation/ r2 CSVs + imputation_r2.pdf + imputation_summary.csv
└── privacy/ aats
All scripts take --help.
Same four steps, bigger model:
cd pc/demo
python3 train_demo.py --data 10K_full.txt --output-dir out/10K \
--latents 128 --epochs 2000 --patience 100 --seed 1
python3 generate_demo.py --run-dir out/10K --num-samples 5008 --seed 1
python3 evaluate.py --run-dir out/10K --legend 10K_full.legend --seed 1
python3 impute_demo.py --run-dir out/10K --mask-rates 0.3 0.5 0.8 --seed 1
python3 plot_imputation.py --run-dir out/10Kpython3 train_demo.py --data path/to/haps.txt --output-dir out/my_run \
--latents 128 --epochs 2000 --patience 100 --seed 1
python3 generate_demo.py --run-dir out/my_run
python3 evaluate.py --run-dir out/my_run --legend path/to/haps.legend
python3 impute_demo.py --run-dir out/my_run
python3 plot_imputation.py --run-dir out/my_runPass --legend '' to evaluate.py to fall back to SNP-index distance when no legend is available.
pc/demo/ self-contained demo (start here)
pc/ training / sampling / imputation scripts used in the paper
plots/ analysis notebooks and figure-generating scripts for the paper
aux/ SNP legends, MAF tables, AATS utilities
results/ per-dataset outputs (checkpoints, samples, metrics)
plots/structure/ adopts and extends code from Yelmen et al., Deep convolutional and conditional neural networks for large-scale genomic data generation (PLOS Comput. Biol.).
@article{anand2026gpc,
title = {GPC: An expressive and tractable deep generative model for genetic variation data},
author = {Anand, Prateek and Liu, Anji and Dang, Meihua and Fu, Boyang and Wei, Xinzhu and Van den Broeck, Guy and Sankararaman, Sriram},
journal = {bioRxiv},
year = {2026},
doi = {10.1101/2023.05.16.541036},
url = {https://www.biorxiv.org/content/10.1101/2023.05.16.541036v3}
}
