Skip to content

jang1563/cbrn-physics-features

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Physics-Informed Distributional Features for CBRN Threat Detection

License: MIT Python 3.8+ Dataset on HF

Pathogen-agnostic infection detection, radiation biodosimetry, and dose-response modeling using distributional physics features computed from gene expression data.


Motivation

Traditional gene-expression biomarkers (e.g., Sweeney-7, Herberg-2) require:

  • Probe-to-symbol mapping specific to each microarray platform
  • Gene-identity annotations that may not exist for novel platforms
  • Platform-specific normalization pipelines

This limits their use for rapid field triage on unknown or unannotated platforms.

We take a different approach: instead of asking which genes change, we ask how the entire distribution changes. Four physics-inspired summary statistics capture the shape of any expression vector — no gene names needed.

Key Idea

From any expression vector (RNA-seq counts, microarray intensities, etc.), compute:

Feature What it measures Why it matters
Gini coefficient Expression inequality Infected samples concentrate expression in fewer genes
Shannon entropy Expression diversity Infection reduces transcriptional diversity
Normalized entropy Scale-free diversity Comparable across platforms with different probe counts
Zipf exponent Power-law decay rate Steeper decay = more extreme expression hierarchy

These features are platform-agnostic: they require no gene annotation, no probe mapping, and no platform-specific normalization. They work identically on RNA-seq, Affymetrix, Illumina, and Agilent arrays.

Results at a Glance

Track Datasets Samples Key Finding
Infection Detection 10 GEO datasets, 5 platform families 1,954 Cross-platform AUROC 0.720 (median); independent validation 0.731
Radiation Biodosimetry 3 GeneLab tissues (OSD-202/211/237) 82 Taylor k separates irradiated vs control (AUROC 0.793)
Dose-Response 3 tissues, 0-13 Gy 82 Monotonic dose-feature relationships with cross-tissue transferability

Cross-Platform Transfer Matrix (Infection)

Physics features trained on one platform generalize to others:

Train GSE161731 (RNA-seq) GSE63990 (Affy) GSE60244 (Illumina) GSE236713 (Agilent)
GSE161731 -- 0.847 0.718 0.652
GSE63990 0.721 -- 0.635 0.571
GSE60244 0.773 0.849 -- 0.483
GSE236713 0.763 0.842 0.708 --

Independent validation on a held-out platform (GSE100150, Illumina WG-6 v3): AUROC 0.731.

Honest Comparison with Gene-Identity Signatures

Physics features complement (not replace) gene-identity approaches:

Task Physics Features Sweeney-7 Herberg-2
Infected vs Healthy (cross-platform) 0.720 N/A N/A
Bacterial vs Viral (within-dataset) 0.68-0.92 0.88-0.96 0.70-0.96
Bacterial vs Viral (cross-platform) 0.50 0.92 N/A
Validation (GSE100150, pooled) 0.492 0.640 0.745

Physics features excel at infected-vs-healthy detection across platforms (a task gene signatures cannot perform without probe mapping). Gene signatures are superior for bacterial-vs-viral discrimination when mappable.

Installation

git clone https://github.com/jang1563/cbrn-physics-features.git
cd cbrn-physics-features
pip install -r requirements.txt

Requirements: Python 3.8+, NumPy, Pandas, SciPy, scikit-learn, Matplotlib.

Quick Start

Infection Detection

# 1. Download GEO data (~1 GB)
python scripts/download_infection_data.py
python scripts/download_validation_data.py

# 2. Run full analysis pipeline (stages 1-5)
python scripts/run_infection_analysis.py

# 3. Generate figures (18 PNGs)
python scripts/make_infection_figures.py

Use --stages 1 2 to run only specific stages. Results are saved to results/infection/.

Radiation Biodosimetry

# Data included in data/radiation/ (37 MB)
python scripts/run_radiation_analysis.py
python scripts/make_radiation_figures.py

Dose-Response Modeling

# Uses same radiation data
python scripts/run_dose_response.py
python scripts/make_dose_response_figures.py

Physics Features Reference

Feature Formula Interpretation Range Replicates?
Gini coefficient Lorenz curve area Expression inequality [0, 1] No
Shannon entropy H = -sum(p log p) Expression diversity [0, log n] No
Normalized entropy H / log(n_expressed) Scale-free diversity [0, 1] No
Zipf exponent (gamma) OLS: log(expr) ~ log(rank) Power-law decay rate [-3, 0] No
Taylor k OLS: log(Var) ~ log(Mean) Noise scaling exponent [1, 2] Yes
Fano factor Var / Mean Overdispersion [0, inf) Yes

The first 4 features are computed per sample (no replicates needed, enabling single-sample triage). Taylor k and Fano factor require biological replicates (computed per condition group).

Top-K Normalization

Platforms differ dramatically in probe count (19K-51K). To make features comparable across platforms, we extract the top K = 15,000 positive expression values per sample before computing features:

K Median AUROC Notes
5,000 0.575 Too aggressive — loses signal
10,000 0.700
15,000 0.720 Optimal
20,000 0.657 Includes noise probes
25,000 0.570
Full 0.555 Platform confound dominates

Datasets

Infection Detection (10 GEO datasets)

Dataset Platform N Groups Role
GSE161731 RNA-seq NovaSeq 155 Healthy, COVID-19, CoV, Flu, Bacterial Development (primary)
GSE157103 RNA-seq NovaSeq 126 COVID (100), non-COVID (26) Negative control
GSE63990 Affy U133A 2.0 280 Bacterial (73), Viral (117), Non-infectious (90) Development
GSE60244 Illumina HT-12 V4.0 158 Bacterial (22), Viral (71), Coinfection (25), Healthy (40) Development
GSE72829 Illumina HT-12 V4.0 196 Def. Bacterial (52), Def. Viral (92), Control (52) Excluded (artifact)
GSE236713 Agilent SurePrint G3 155 Sepsis (125), Healthy (30) Development
GSE100150 Illumina WG-6 v3.0 248 Healthy (95), Viral (95), Bacterial (58) Validation
GSE111368 Illumina HT-12 V4.0 239 Healthy (130), Influenza (109) Excluded (degenerate)
GSE40012 Illumina HT-12 V3.0 58 Healthy (18), Bacterial (16), SIRS (13), Viral (8) Excluded (degenerate)
GSE73072 Affy U133A 2.0 296 Healthy (148), Viral challenge (148) Challenge study

Radiation Biodosimetry (3 GeneLab datasets)

Dataset Tissue N Groups Source
OSD-202 Brain 41 NL_Ctrl, NL_Rad, HLU_Ctrl, HLU_Rad NASA GeneLab
OSD-211 Spleen 16 Control, Irradiated NASA GeneLab
OSD-237 Skin 25 0 Gy, 0.04-13 Gy NASA GeneLab

Repository Structure

cbrn-physics-features/
├── README.md
├── LICENSE                             # MIT
├── requirements.txt
├── .gitignore
├── src/physics_features/              # Feature computation library (~900 lines)
│   ├── __init__.py
│   ├── bulk.py                        # Per-sample and group features
│   ├── information.py                 # Entropy, Gini, Fano, MI
│   ├── scaling_laws.py                # Zipf exponent
│   └── conservation.py                # Conservation law candidates
├── scripts/
│   ├── download_infection_data.py     # Download GEO data
│   ├── download_validation_data.py    # Download validation datasets
│   ├── run_infection_analysis.py      # Full infection pipeline
│   ├── run_radiation_analysis.py      # Radiation biodosimetry
│   ├── run_dose_response.py           # Dose-response modeling
│   ├── run_topk_diagnostic.py         # Top-K normalization sweep
│   ├── make_infection_figures.py      # 18 publication-quality figures
│   ├── make_radiation_figures.py      # 7 radiation figures
│   └── make_dose_response_figures.py  # 6 dose-response figures
├── data/
│   ├── geo/                           # GEO data (downloaded by scripts)
│   └── radiation/                     # GeneLab data (included, 37 MB)
│       ├── brain/
│       ├── spleen/
│       └── skin_rad/
└── results/
    ├── infection/                     # 19 CSVs + summary + 18 figures
    ├── radiation/                     # 5 CSVs + 7 figures
    └── dose_response/                 # 5 CSVs + 6 figures

Known Limitations

  1. GEO data not included — Run download_infection_data.py and download_validation_data.py first (~1 GB total).
  2. Gene-level comparison (Stage 4) requires ensembl_to_symbol.csv in data/geo/. Without it, this comparison is gracefully skipped.
  3. GPL probe mapping CSVs are auto-generated on first run from GPL annotation files.
  4. 2/4 validation datasets excluded — GSE111368 and GSE40012 produce nearly constant distributional features due to platform-specific preprocessing. See results/infection/RESULTS_SUMMARY.md.
  5. Physics features are not a replacement for gene-identity biomarkers — they are strongest for infected-vs-healthy triage when gene mapping is unavailable.

Citation

If you use this work, please cite:

@software{kim2026physics,
  author = {Kim, Jangwoo},
  title = {Physics-Informed Distributional Features for CBRN Threat Detection},
  year = {2026},
  url = {https://github.com/jang1563/cbrn-physics-features},
  license = {MIT}
}

License

MIT License. See LICENSE.

Releases

No releases published

Packages

 
 
 

Contributors

Languages