Skip to content

munaberhe/rnaseq_deseq2_pathway

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bulk RNA-seq Differential Expression and Pathway Analysis in R

Overview

This mini-project performs bulk RNA-seq differential expression analysis and pathway (GO) enrichment in R using:

  • DESeq2 for differential expression
  • clusterProfiler for GO enrichment
  • The public airway RNA-seq dataset as a reproducible example

It is designed as a small, self-contained R/Bioconductor project to demonstrate practical skills that are directly relevant to genomics and pharmaceutical research (e.g. target discovery, mechanism-of-action analysis, biomarker discovery).


Dataset

We use the airway RNA-seq dataset from Bioconductor:

  • Human airway smooth muscle cells
  • Treated with dexamethasone (dex = "trt") or untreated (dex = "untrt")
  • Raw counts and sample metadata are provided in a SummarizedExperiment object

The dataset is loaded directly from the airway package; no external files need to be downloaded manually. This makes the analysis fully reproducible on any machine with R and Bioconductor installed.


Methods

Differential Expression (DESeq2)

Script: 01_deseq2_analysis.R

Pipeline steps:

  1. Load the airway dataset (airway::airway) and extract the SummarizedExperiment.

  2. Relevel the treatment factor so "untrt" is the reference (dex = relevel(dex, ref = "untrt")).

  3. Construct a DESeqDataSet with design ~ dex.

  4. Pre-filter lowly expressed genes to remove genes with very low counts.

  5. Run DESeq() to estimate size factors, dispersions, and fit the negative binomial model.

  6. Extract results for trt vs untrt and apply log2 fold-change shrinkage using lfcShrink(..., type = "apeglm").

  7. Sort genes by adjusted p-value (padj) and save a full results table as:

    results/deseq2_results_airway.csv

  8. Create a volcano plot highlighting significantly differentially expressed genes and save it as:

    figures/volcano_airway_deseq2.png

The output includes per-gene statistics such as log2FoldChange, lfcSE, test statistic, raw p-value, and adjusted p-value (padj).

Pathway / GO Enrichment (clusterProfiler)

Script: 02_pathway_analysis.R

Pipeline steps:

  1. Load the DESeq2 results from:

    results/deseq2_results_airway.csv

  2. Select significantly differentially expressed genes using thresholds such as:

    • adjusted p-value padj < 0.05
    • absolute log2 fold-change |log2FC| > 1
  3. Extract gene identifiers and map them to Entrez IDs using clusterProfiler::bitr() with org.Hs.eg.db.

  4. Perform Gene Ontology Biological Process (GO BP) enrichment using clusterProfiler::enrichGO() with the human annotation database (OrgDb = org.Hs.eg.db).

  5. Save the enrichment results as:

    results/go_enrichment_airway_BP.csv

  6. Generate a barplot of the top enriched GO BP terms and save it as:

    figures/go_bp_barplot_airway.png

The enrichment output includes, for each GO term, the ID and description, gene ratio, background ratio, p-values, adjusted p-values, and the list of contributing genes.


Setup

  1. Install R (and optionally RStudio) if not already installed.

  2. From an R session, set a CRAN mirror and install required packages:

    options(repos = c(CRAN = "https://cloud.r-project.org"))

    install.packages(c("tidyverse", "ggplot2"))

    if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") }

    BiocManager::install(c("DESeq2", "airway", "clusterProfiler", "org.Hs.eg.db"), ask = FALSE)

  3. Clone the repository and move into it:

    git clone
    cd rnaseq_deseq2_pathway

  4. Ensure the results/ and figures/ folders exist (or let the scripts create them).


How to Run

From the project folder (rnaseq_deseq2_pathway):

  1. Differential expression with DESeq2

    Rscript 01_deseq2_analysis.R

    This will:

    • load the airway dataset
    • run DESeq2 for trt vs untrt
    • save a full DE results table to results/deseq2_results_airway.csv
    • create a volcano plot at figures/volcano_airway_deseq2.png
  2. GO enrichment analysis

    Rscript 02_pathway_analysis.R

    This will:

    • read results/deseq2_results_airway.csv
    • select significantly differentially expressed genes (padj < 0.05, |log2FC| > 1)
    • map them to Entrez IDs
    • perform GO BP enrichment using clusterProfiler
    • save enriched GO terms to results/go_enrichment_airway_BP.csv
    • create a barplot of the top GO terms at figures/go_bp_barplot_airway.png

After running both scripts, you should have:

  • numeric results in the results/ directory
  • publication-style plots in the figures/ directory

Results

On the airway dataset, DESeq2 typically identifies a substantial number of genes that are significantly differentially expressed between dexamethasone-treated and untreated samples. The output includes:

  • a ranked list of genes with log2 fold-change and adjusted p-values, allowing identification of strongly up- and down-regulated genes
  • a volcano plot (figures/volcano_airway_deseq2.png) where significantly differentially expressed genes (for example, padj < 0.05 and |log2FC| > 1) are highlighted relative to background genes

The GO enrichment analysis generally recovers biological processes consistent with the known effects of glucocorticoids, such as:

  • response to steroid hormone
  • regulation of inflammatory and immune responses
  • transcriptional and signalling pathways affected by dexamethasone

The exact list of enriched terms and the number of significant genes will depend on the chosen thresholds, but the pipeline provides a reproducible and interpretable set of results that connect differential expression to biological pathways.


Discussion

This project implements a typical RNA-seq analysis workflow in R:

  1. Starting from raw count data using a well-curated public dataset (airway).
  2. Performing robust differential expression analysis with DESeq2.
  3. Visualising results using a volcano plot to summarise significance and effect size.
  4. Interpreting the biological signal through GO enrichment analysis with clusterProfiler.

These steps mirror common analyses in:

  • genomics and transcriptomics research
  • pharmacogenomics and drug response studies
  • biomarker discovery and mechanism-of-action investigations

The code is intentionally minimal and script-based to make it easy to:

  • adapt the pipeline to other RNA-seq datasets
  • adjust thresholds (for example, different padj cutoffs or log2FC cutoffs)
  • swap GO for other pathway databases such as KEGG or Reactome
  • or wrap the workflow into an R Markdown report or Shiny app

This project complements Python-based work (such as Scanpy scRNA-seq analysis and LLM-based phenotype–disease benchmarks) by demonstrating solid competence in R, Bioconductor, DESeq2, and pathway analysis, which are highly valued skills in bioinformatics and the pharmaceutical industry.

About

DESeq2 and clusterProfiler pipeline for differential expression and GO enrichment on the airway RNA-seq dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages