This repository contains the codebase and analytical workflows for the manuscript “The Cumulative Impact of Passenger Mutations on Cancer Development.” The project explores the functional effects of passenger mutations across diverse cancer types, quantifies their impact on patient survival, and evaluates their value as prognostic biomarkers.
The analyses include:
- Quantification of passenger mutation burden in known cancer genes
- Prediction of pathogenicity for passenger mutations using computational tools
- Assessment of functional impact on transcription factor binding, splicing, and gene expression
- Evaluation of survival outcomes and the prognostic potential of passenger mutations
- datasets/: Raw data used for all analyses. Due to privacy and access restrictions, only a subset is shared via Zenodo (DOI: 10.5281/zenodo.19432064). Controlled TCGA data must be obtained independently under the relevant data use agreements.
- metadata/: Text files and tabular metadata collected from multiple sources and used throughout the analyses.
- notebooks/: Jupyter notebooks for analysis and visualization of the primary PCAWG cohort.
- scripts/: Scripts for downloading, parsing, and preprocessing PCAWG data, including wrapper scripts for pipeline execution.
- POG570_scripts/: Scripts and notebooks used to reproduce core analyses on the POG570 cohort.
- TCGA_scripts/: Scripts and notebooks used to reproduce core analyses on the TCGA WGS cohort.
- conda_env_cancer_passengers.yml: Conda environment specification used to reproduce the analysis environment.
To get started, clone the repository and create the conda environment from the provided specification file.
# clone the repository
git clone https://github.com/Georgakopoulos-Soares-lab/cancer_passengers.git
cd cancer_passengers
# create a new conda environment
conda env create -f conda_env_cancer_passengers.yml
conda activate cancer_passengersSet up the following resources at the repository root:
- Download and set up Annovar at
annovar/. - Download the CADD whole-genome SNV file from CADD and place it at
cadd/whole_genome_SNVs.tsv.gz(tabix-indexed). - Set up Pangolin.
- Clone the Borzoi repository and follow its installation instructions to set up
borzoi/andbaskerville/. - Download UCSC
liftOverbinary atscripts/liftOver. - Set up the GDC Client.
- Download the
datasets/directory from Zenodo and place it at the repository root. This directory contains workflow input files. - The
datasets/directory includes open-access files only. Controlled TCGA files are excluded and require dbGaP authorization under accession phs000178.v11.p8. - After obtaining TCGA access, download the following files from ICGC Bionimbus PDC (https://icgc.bionimbus.org), extract them, and place them at the indicated paths:
- final_consensus_passonly.snv_mnv_indel.tcga.controlled.maf.gz -> datasets/PCAWG/snv_mnv_indel/final_consensus_passonly.snv_mnv_indel.tcga.controlled.maf
- TableS3_panorama_driver_mutations_TCGA_samples.controlled.tsv.gz -> datasets/PCAWG/driver_mutations/TableS3_panorama_driver_mutations_TCGA_samples.controlled.tsv
To run the full preprocessing pipeline, execute:
bash scripts/run_scripts.shSome analyses are run per cancer type. To run these notebooks across all cancer types using papermill, use:
bash scripts/run_notebooks.shAll PCAWG analyses and figures generated in this study can be reproduced by running the notebooks in thenotebooks/ directory.
Most notebooks can be executed independently after running scripts/run_scripts.sh (Step 1). Two notebooks require per-cancer outputs (from Step 2) before they can generate aggregate plots:
notebooks/1.2_ mutation_density_genic_region_plots.ipynb— depends on results fromnotebooks/1.1_mutation_density_genic_region_by_cancer.ipynbrun for every cancer type.notebooks/2.2_ cadd_score_genic_region_plots.ipynb— depends on results fromnotebooks/2.1_cadd_score_genic_region_by_cancer.ipynbrun for every cancer type.
Before running these scripts, ensure you have obtained access to TCGA data via dbGaP authorization under accession phs000178.v11.p8.
Also ensure the GDC client is set up for downloading files from the GDC portal. Download it at the repository root or set GDC_CLIENT=/full/path/to/gdc-client before running the scripts.
Option 1 (recommended): To run on a cluster, use the SLURM script below. Edit SLURM directives as needed before submission.
bash TCGA_scripts/vcf_annot_pipeline/0_download_tcga_vcfs.sh Option 2: To run locally, use:
bash TCGA_scripts/vcf_annot_pipeline/run_download_local.sh \
TCGA_scripts/vcf_annot_pipeline/gdc_manifest.2025-10-30.203056.txt \
data/TCGA/datasets/vcfs \
/path/to/gdc-user-token.txtpython TCGA_scripts/vcf_annot_pipeline/0.1_get_top_driver_genes.py
python TCGA_scripts/vcf_annot_pipeline/0.2_get_hotspot_drivers.py
bash TCGA_scripts/vcf_annot_pipeline/0.3_liftover_hotspots_to_hg38.shThese scripts generate TCGA-specific driver-gene and hotspot files in data/TCGA/.
Option 1 (recommended): To run on a cluster, use the SLURM script below. Edit SLURM directives as needed before submission.
sbatch TCGA_scripts/vcf_annot_pipeline/1_annotate_tcga_vcf.shOption 2: To run locally, use:
bash TCGA_scripts/vcf_annot_pipeline/run_annotate_local.sh \
data/TCGA/datasets/vcfs \
./nf_workThese wrappers run the Nextflow pipeline in TCGA_scripts/vcf_annot_pipeline/main.nf.
python TCGA_scripts/vcf_annot_pipeline/2_combine_mut.pyThis writes patient-level combined files to data/TCGA/combined_variants.
bash TCGA_scripts/cnv_data/run_cnv_pipeline.shAll analyses and figures generated in this study can be reproduced by running the following notebook: TCGA_scripts/analysis.ipynb
python POG570_scripts/0_get_top_driver_genes.py
python POG570_scripts/1_annovar_annotation.pyAll analyses and figures generated in this study can be reproduced by running the following notebook: POG570_scripts/2_analysis_pancancer_genes.ipynb
For questions, comments, or to report issues, contact:
an36943@my.utexas.edu (Akshatha Nayak)
ilias@austin.utexas.edu (Dr. Ilias Georgakopoulos-Soares)