Skip to content

Georgakopoulos-Soares-lab/cancer_passengers

Repository files navigation

Continuum model of Cancer development

This repository contains the codebase and analytical workflows for the manuscript “The Cumulative Impact of Passenger Mutations on Cancer Development.” The project explores the functional effects of passenger mutations across diverse cancer types, quantifies their impact on patient survival, and evaluates their value as prognostic biomarkers.

The analyses include:

  • Quantification of passenger mutation burden in known cancer genes
  • Prediction of pathogenicity for passenger mutations using computational tools
  • Assessment of functional impact on transcription factor binding, splicing, and gene expression
  • Evaluation of survival outcomes and the prognostic potential of passenger mutations

Repository Structure

  • datasets/: Raw data used for all analyses. Due to privacy and access restrictions, only a subset is shared via Zenodo (DOI: 10.5281/zenodo.19432064). Controlled TCGA data must be obtained independently under the relevant data use agreements.
  • metadata/: Text files and tabular metadata collected from multiple sources and used throughout the analyses.
  • notebooks/: Jupyter notebooks for analysis and visualization of the primary PCAWG cohort.
  • scripts/: Scripts for downloading, parsing, and preprocessing PCAWG data, including wrapper scripts for pipeline execution.
  • POG570_scripts/: Scripts and notebooks used to reproduce core analyses on the POG570 cohort.
  • TCGA_scripts/: Scripts and notebooks used to reproduce core analyses on the TCGA WGS cohort.
  • conda_env_cancer_passengers.yml: Conda environment specification used to reproduce the analysis environment.

Getting Started

To get started, clone the repository and create the conda environment from the provided specification file.

# clone the repository
git clone https://github.com/Georgakopoulos-Soares-lab/cancer_passengers.git
cd cancer_passengers

# create a new conda environment
conda env create -f conda_env_cancer_passengers.yml
conda activate cancer_passengers

Set up the following resources at the repository root:

  • Download and set up Annovar at annovar/.
  • Download the CADD whole-genome SNV file from CADD and place it at cadd/whole_genome_SNVs.tsv.gz (tabix-indexed).
  • Set up Pangolin.
  • Clone the Borzoi repository and follow its installation instructions to set up borzoi/ and baskerville/.
  • Download UCSC liftOver binary at scripts/liftOver.
  • Set up the GDC Client.

Downloads

  1. Download the datasets/ directory from Zenodo and place it at the repository root. This directory contains workflow input files.
  2. The datasets/ directory includes open-access files only. Controlled TCGA files are excluded and require dbGaP authorization under accession phs000178.v11.p8.
  3. After obtaining TCGA access, download the following files from ICGC Bionimbus PDC (https://icgc.bionimbus.org), extract them, and place them at the indicated paths:
  • final_consensus_passonly.snv_mnv_indel.tcga.controlled.maf.gz -> datasets/PCAWG/snv_mnv_indel/final_consensus_passonly.snv_mnv_indel.tcga.controlled.maf
  • TableS3_panorama_driver_mutations_TCGA_samples.controlled.tsv.gz -> datasets/PCAWG/driver_mutations/TableS3_panorama_driver_mutations_TCGA_samples.controlled.tsv

PCAWG Workflow

1. Preprocessing

To run the full preprocessing pipeline, execute:

bash scripts/run_scripts.sh

2. Run parameterized notebooks for all cancer types

Some analyses are run per cancer type. To run these notebooks across all cancer types using papermill, use:

bash scripts/run_notebooks.sh

3. Analysis and Visualization

All PCAWG analyses and figures generated in this study can be reproduced by running the notebooks in thenotebooks/ directory.

Most notebooks can be executed independently after running scripts/run_scripts.sh (Step 1). Two notebooks require per-cancer outputs (from Step 2) before they can generate aggregate plots:

  • notebooks/1.2_ mutation_density_genic_region_plots.ipynb — depends on results from notebooks/1.1_mutation_density_genic_region_by_cancer.ipynb run for every cancer type.
  • notebooks/2.2_ cadd_score_genic_region_plots.ipynb — depends on results from notebooks/2.1_cadd_score_genic_region_by_cancer.ipynb run for every cancer type.

TCGA Workflow

Before running these scripts, ensure you have obtained access to TCGA data via dbGaP authorization under accession phs000178.v11.p8.

Also ensure the GDC client is set up for downloading files from the GDC portal. Download it at the repository root or set GDC_CLIENT=/full/path/to/gdc-client before running the scripts.

1. Download TCGA VCF files

Option 1 (recommended): To run on a cluster, use the SLURM script below. Edit SLURM directives as needed before submission.

bash TCGA_scripts/vcf_annot_pipeline/0_download_tcga_vcfs.sh 

Option 2: To run locally, use:

bash TCGA_scripts/vcf_annot_pipeline/run_download_local.sh \
	TCGA_scripts/vcf_annot_pipeline/gdc_manifest.2025-10-30.203056.txt \
	data/TCGA/datasets/vcfs \
	/path/to/gdc-user-token.txt

2. Build TCGA-specific resources

python TCGA_scripts/vcf_annot_pipeline/0.1_get_top_driver_genes.py
python TCGA_scripts/vcf_annot_pipeline/0.2_get_hotspot_drivers.py
bash TCGA_scripts/vcf_annot_pipeline/0.3_liftover_hotspots_to_hg38.sh

These scripts generate TCGA-specific driver-gene and hotspot files in data/TCGA/.

3. Parse and annotate TCGA VCFs

Option 1 (recommended): To run on a cluster, use the SLURM script below. Edit SLURM directives as needed before submission.

sbatch TCGA_scripts/vcf_annot_pipeline/1_annotate_tcga_vcf.sh

Option 2: To run locally, use:

bash TCGA_scripts/vcf_annot_pipeline/run_annotate_local.sh \
	data/TCGA/datasets/vcfs \
	./nf_work

These wrappers run the Nextflow pipeline in TCGA_scripts/vcf_annot_pipeline/main.nf.

4. Combine variants per patient

python TCGA_scripts/vcf_annot_pipeline/2_combine_mut.py

This writes patient-level combined files to data/TCGA/combined_variants.

5. Download TCGA CNV data and calculate CNA burden

bash TCGA_scripts/cnv_data/run_cnv_pipeline.sh

Analysis and Visualization

All analyses and figures generated in this study can be reproduced by running the following notebook: TCGA_scripts/analysis.ipynb

POG570 Workflow

1. Preprocessing

python POG570_scripts/0_get_top_driver_genes.py
python POG570_scripts/1_annovar_annotation.py

Analysis and Visualization

All analyses and figures generated in this study can be reproduced by running the following notebook: POG570_scripts/2_analysis_pancancer_genes.ipynb

Contact

For questions, comments, or to report issues, contact:

an36943@my.utexas.edu (Akshatha Nayak)
ilias@austin.utexas.edu (Dr. Ilias Georgakopoulos-Soares)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors