Pathogen-agnostic infection detection, radiation biodosimetry, and dose-response modeling using distributional physics features computed from gene expression data.
Traditional gene-expression biomarkers (e.g., Sweeney-7, Herberg-2) require:
- Probe-to-symbol mapping specific to each microarray platform
- Gene-identity annotations that may not exist for novel platforms
- Platform-specific normalization pipelines
This limits their use for rapid field triage on unknown or unannotated platforms.
We take a different approach: instead of asking which genes change, we ask how the entire distribution changes. Four physics-inspired summary statistics capture the shape of any expression vector — no gene names needed.
From any expression vector (RNA-seq counts, microarray intensities, etc.), compute:
| Feature | What it measures | Why it matters |
|---|---|---|
| Gini coefficient | Expression inequality | Infected samples concentrate expression in fewer genes |
| Shannon entropy | Expression diversity | Infection reduces transcriptional diversity |
| Normalized entropy | Scale-free diversity | Comparable across platforms with different probe counts |
| Zipf exponent | Power-law decay rate | Steeper decay = more extreme expression hierarchy |
These features are platform-agnostic: they require no gene annotation, no probe mapping, and no platform-specific normalization. They work identically on RNA-seq, Affymetrix, Illumina, and Agilent arrays.
| Track | Datasets | Samples | Key Finding |
|---|---|---|---|
| Infection Detection | 10 GEO datasets, 5 platform families | 1,954 | Cross-platform AUROC 0.720 (median); independent validation 0.731 |
| Radiation Biodosimetry | 3 GeneLab tissues (OSD-202/211/237) | 82 | Taylor k separates irradiated vs control (AUROC 0.793) |
| Dose-Response | 3 tissues, 0-13 Gy | 82 | Monotonic dose-feature relationships with cross-tissue transferability |
Physics features trained on one platform generalize to others:
| Train | GSE161731 (RNA-seq) | GSE63990 (Affy) | GSE60244 (Illumina) | GSE236713 (Agilent) |
|---|---|---|---|---|
| GSE161731 | -- | 0.847 | 0.718 | 0.652 |
| GSE63990 | 0.721 | -- | 0.635 | 0.571 |
| GSE60244 | 0.773 | 0.849 | -- | 0.483 |
| GSE236713 | 0.763 | 0.842 | 0.708 | -- |
Independent validation on a held-out platform (GSE100150, Illumina WG-6 v3): AUROC 0.731.
Physics features complement (not replace) gene-identity approaches:
| Task | Physics Features | Sweeney-7 | Herberg-2 |
|---|---|---|---|
| Infected vs Healthy (cross-platform) | 0.720 | N/A | N/A |
| Bacterial vs Viral (within-dataset) | 0.68-0.92 | 0.88-0.96 | 0.70-0.96 |
| Bacterial vs Viral (cross-platform) | 0.50 | 0.92 | N/A |
| Validation (GSE100150, pooled) | 0.492 | 0.640 | 0.745 |
Physics features excel at infected-vs-healthy detection across platforms (a task gene signatures cannot perform without probe mapping). Gene signatures are superior for bacterial-vs-viral discrimination when mappable.
git clone https://github.com/jang1563/cbrn-physics-features.git
cd cbrn-physics-features
pip install -r requirements.txtRequirements: Python 3.8+, NumPy, Pandas, SciPy, scikit-learn, Matplotlib.
# 1. Download GEO data (~1 GB)
python scripts/download_infection_data.py
python scripts/download_validation_data.py
# 2. Run full analysis pipeline (stages 1-5)
python scripts/run_infection_analysis.py
# 3. Generate figures (18 PNGs)
python scripts/make_infection_figures.pyUse --stages 1 2 to run only specific stages. Results are saved to results/infection/.
# Data included in data/radiation/ (37 MB)
python scripts/run_radiation_analysis.py
python scripts/make_radiation_figures.py# Uses same radiation data
python scripts/run_dose_response.py
python scripts/make_dose_response_figures.py| Feature | Formula | Interpretation | Range | Replicates? |
|---|---|---|---|---|
| Gini coefficient | Lorenz curve area | Expression inequality | [0, 1] | No |
| Shannon entropy | H = -sum(p log p) | Expression diversity | [0, log n] | No |
| Normalized entropy | H / log(n_expressed) | Scale-free diversity | [0, 1] | No |
| Zipf exponent (gamma) | OLS: log(expr) ~ log(rank) | Power-law decay rate | [-3, 0] | No |
| Taylor k | OLS: log(Var) ~ log(Mean) | Noise scaling exponent | [1, 2] | Yes |
| Fano factor | Var / Mean | Overdispersion | [0, inf) | Yes |
The first 4 features are computed per sample (no replicates needed, enabling single-sample triage). Taylor k and Fano factor require biological replicates (computed per condition group).
Platforms differ dramatically in probe count (19K-51K). To make features comparable across platforms, we extract the top K = 15,000 positive expression values per sample before computing features:
| K | Median AUROC | Notes |
|---|---|---|
| 5,000 | 0.575 | Too aggressive — loses signal |
| 10,000 | 0.700 | |
| 15,000 | 0.720 | Optimal |
| 20,000 | 0.657 | Includes noise probes |
| 25,000 | 0.570 | |
| Full | 0.555 | Platform confound dominates |
| Dataset | Platform | N | Groups | Role |
|---|---|---|---|---|
| GSE161731 | RNA-seq NovaSeq | 155 | Healthy, COVID-19, CoV, Flu, Bacterial | Development (primary) |
| GSE157103 | RNA-seq NovaSeq | 126 | COVID (100), non-COVID (26) | Negative control |
| GSE63990 | Affy U133A 2.0 | 280 | Bacterial (73), Viral (117), Non-infectious (90) | Development |
| GSE60244 | Illumina HT-12 V4.0 | 158 | Bacterial (22), Viral (71), Coinfection (25), Healthy (40) | Development |
| GSE72829 | Illumina HT-12 V4.0 | 196 | Def. Bacterial (52), Def. Viral (92), Control (52) | Excluded (artifact) |
| GSE236713 | Agilent SurePrint G3 | 155 | Sepsis (125), Healthy (30) | Development |
| GSE100150 | Illumina WG-6 v3.0 | 248 | Healthy (95), Viral (95), Bacterial (58) | Validation |
| GSE111368 | Illumina HT-12 V4.0 | 239 | Healthy (130), Influenza (109) | Excluded (degenerate) |
| GSE40012 | Illumina HT-12 V3.0 | 58 | Healthy (18), Bacterial (16), SIRS (13), Viral (8) | Excluded (degenerate) |
| GSE73072 | Affy U133A 2.0 | 296 | Healthy (148), Viral challenge (148) | Challenge study |
| Dataset | Tissue | N | Groups | Source |
|---|---|---|---|---|
| OSD-202 | Brain | 41 | NL_Ctrl, NL_Rad, HLU_Ctrl, HLU_Rad | NASA GeneLab |
| OSD-211 | Spleen | 16 | Control, Irradiated | NASA GeneLab |
| OSD-237 | Skin | 25 | 0 Gy, 0.04-13 Gy | NASA GeneLab |
cbrn-physics-features/
├── README.md
├── LICENSE # MIT
├── requirements.txt
├── .gitignore
├── src/physics_features/ # Feature computation library (~900 lines)
│ ├── __init__.py
│ ├── bulk.py # Per-sample and group features
│ ├── information.py # Entropy, Gini, Fano, MI
│ ├── scaling_laws.py # Zipf exponent
│ └── conservation.py # Conservation law candidates
├── scripts/
│ ├── download_infection_data.py # Download GEO data
│ ├── download_validation_data.py # Download validation datasets
│ ├── run_infection_analysis.py # Full infection pipeline
│ ├── run_radiation_analysis.py # Radiation biodosimetry
│ ├── run_dose_response.py # Dose-response modeling
│ ├── run_topk_diagnostic.py # Top-K normalization sweep
│ ├── make_infection_figures.py # 18 publication-quality figures
│ ├── make_radiation_figures.py # 7 radiation figures
│ └── make_dose_response_figures.py # 6 dose-response figures
├── data/
│ ├── geo/ # GEO data (downloaded by scripts)
│ └── radiation/ # GeneLab data (included, 37 MB)
│ ├── brain/
│ ├── spleen/
│ └── skin_rad/
└── results/
├── infection/ # 19 CSVs + summary + 18 figures
├── radiation/ # 5 CSVs + 7 figures
└── dose_response/ # 5 CSVs + 6 figures
- GEO data not included — Run
download_infection_data.pyanddownload_validation_data.pyfirst (~1 GB total). - Gene-level comparison (Stage 4) requires
ensembl_to_symbol.csvindata/geo/. Without it, this comparison is gracefully skipped. - GPL probe mapping CSVs are auto-generated on first run from GPL annotation files.
- 2/4 validation datasets excluded — GSE111368 and GSE40012 produce nearly constant distributional features due to platform-specific preprocessing. See
results/infection/RESULTS_SUMMARY.md. - Physics features are not a replacement for gene-identity biomarkers — they are strongest for infected-vs-healthy triage when gene mapping is unavailable.
If you use this work, please cite:
@software{kim2026physics,
author = {Kim, Jangwoo},
title = {Physics-Informed Distributional Features for CBRN Threat Detection},
year = {2026},
url = {https://github.com/jang1563/cbrn-physics-features},
license = {MIT}
}MIT License. See LICENSE.