This repository accompanies the paper:
Bayesian Optimization for Structure-Based Drug Design with High-Fidelity MM/GBSA Scoring https://doi.org/10.1101/2025.06.22.660936
We study active learning / Bayesian optimization (BO) for structure-based drug design, using MM/GBSA as a high-fidelity oracle in place of fast docking scores. The main finding is that BO with molecular language model embeddings efficiently recovers top-scoring compounds with far fewer oracle evaluations than random screening.
For a guide on running the optimizer on your own data or with a custom scoring oracle, see USAGE.md.
All steps except docking use the base conda environment:
conda env create -f docker/environment.yml
conda activate baseThe docking step requires a separate environment due to Python version incompatibilities with
the vina and easydock packages:
conda env create -f docker/environment-docking.ymlDocker images are also provided (docker/Dockerfile.base, docker/Dockerfile.docking) for
fully reproducible containerized execution.
Run the preprocessing notebook to produce the required CSV files:
data/process_data.ipynb
Expected output files in data/:
MCL1-mmgbsa.csv— MCL1 MM/GBSA scoresMCL1-vina.csv— MCL1 Vina docking scoresEnamine10k_scores.csvEnamine50k_scores.csv
bash run_bayesian_optimization.shResults are written to bayesian_optimization_results/. To run ablations:
bash run_seeds.sh # Initialization robustness
bash run_batch_size.sh # batch acquisition size
bash run_diverse_init.sh # diverse initialization
bash run_ucb_experiment.sh # UCB kappa sweep
bash run_ucb_kappa.sh # UCB acquisition function for specific kappaOpen the notebooks in analysis/ corresponding to the experiment of interest (e.g.
analysis/mcl1_mmgbsa.ipynb for the main MM/GBSA results). Figures are saved to
analysis/figures/.
If you want to reproduce the full pipeline including data collection:
- Screening — ZINC22 similarity screening; see
screening/README.md - Docking — AutoDock Vina via easydock (docking environment); see
docking/README.md - MD simulation + MM/GBSA — GROMACS + gmx_MMPBSA; see
simulation_configuration/README.md - Preprocessing —
data/process_data.ipynb - BO experiments —
run_bayesian_optimization.sh - Analysis — notebooks in
analysis/
bayesian-optimization-mmgbsa/
│
├── docker/ # Conda environment files and Docker images
│ ├── environment.yml # Base environment (BO + analysis)
│ ├── environment-docking.yml # Docking environment (separate due to version constraints)
│ ├── Dockerfile.base # Docker image for base environment
│ └── Dockerfile.docking # Docker image for docking environment
│
├── screening/ # Step 1: ZINC22 similarity screening
│ ├── similarity.py # MCS-based screening script
│ └── README.md
│
├── docking/ # Step 2: AutoDock Vina docking via easydock
│ ├── start_docking # Docking launcher script
│ ├── extract_mols.py # Post-processing: extract mol blocks from .db files
│ └── README.md
│
├── simulation_configuration/ # Step 3: GROMACS + gmx_MMPBSA setup
│ ├── *.mdp # GROMACS MD parameter files
│ ├── mmpbsa.in # gmx_MMPBSA input
│ └── README.md
│
├── data/ # Step 4: Data preprocessing
│ ├── process_data.ipynb # Deduplication and formatting
│ └── README.md
│
├── bayesian_optimization/ # Step 5: Active learning experiments
│ ├── run.py # Main entry point
│ └── ligbo/ # Core BO library (surrogate models, embeddings, samplers)
│
├── bayesian_optimization_results/ # Output: BO run results (populated at runtime)
│
├── analysis/ # Step 6: Analysis notebooks and figures
│ ├── figures/ # Publication figures
│ ├── mcl1_mmgbsa.ipynb # Main MCL1 MM/GBSA results
│ ├── mcl1_vina.ipynb # MCL1 Vina results
│ ├── mcl1_vina_docking_in_one.ipynb # Combined Vina analysis
│ ├── mcl1_batch_size.ipynb # Effect of batch acquisition size
│ ├── mcl1_medoid_effect.ipynb # Effect of initialization strategy (medoid)
│ ├── mcl1_mmgbsa_diverse_init.ipynb # Diverse initialization ablation
│ ├── mcl1_mmgbsa_ucb.ipynb # UCB acquisition function experiments
│ ├── enamine10k.ipynb # Enamine 10k benchmark
│ ├── enamine50k.ipynb # Enamine 50k benchmark
│ ├── seeds_error_bars.ipynb # Initialization robustness / error bars
│ ├── rogi_curves.ipynb # ROGI landscape analysis
│ ├── utils.py # Shared plotting utilities
│ └── exploratory/ # Exploratory dataset analysis (score distributions, embeddings)
│
├── run_bayesian_optimization.sh # Runs all main BO experiments
├── run_seeds.sh # Initialization robustness sweep
├── run_batch_size.sh # Batch size ablation
├── run_diverse_init.sh # Diverse initialization ablation
├── run_ucb_experiment.sh # UCB kappa hyperparameter sweep
└── run_ucb_kappa.sh # UCB acquisition ablation
