This mini-project performs bulk RNA-seq differential expression analysis and pathway (GO) enrichment in R using:
- DESeq2 for differential expression
- clusterProfiler for GO enrichment
- The public airway RNA-seq dataset as a reproducible example
It is designed as a small, self-contained R/Bioconductor project to demonstrate practical skills that are directly relevant to genomics and pharmaceutical research (e.g. target discovery, mechanism-of-action analysis, biomarker discovery).
We use the airway RNA-seq dataset from Bioconductor:
- Human airway smooth muscle cells
- Treated with dexamethasone (dex = "trt") or untreated (dex = "untrt")
- Raw counts and sample metadata are provided in a SummarizedExperiment object
The dataset is loaded directly from the airway package; no external files need to be downloaded manually. This makes the analysis fully reproducible on any machine with R and Bioconductor installed.
Script: 01_deseq2_analysis.R
Pipeline steps:
-
Load the airway dataset (airway::airway) and extract the SummarizedExperiment.
-
Relevel the treatment factor so "untrt" is the reference (dex = relevel(dex, ref = "untrt")).
-
Construct a DESeqDataSet with design ~ dex.
-
Pre-filter lowly expressed genes to remove genes with very low counts.
-
Run DESeq() to estimate size factors, dispersions, and fit the negative binomial model.
-
Extract results for trt vs untrt and apply log2 fold-change shrinkage using lfcShrink(..., type = "apeglm").
-
Sort genes by adjusted p-value (padj) and save a full results table as:
results/deseq2_results_airway.csv
-
Create a volcano plot highlighting significantly differentially expressed genes and save it as:
figures/volcano_airway_deseq2.png
The output includes per-gene statistics such as log2FoldChange, lfcSE, test statistic, raw p-value, and adjusted p-value (padj).
Script: 02_pathway_analysis.R
Pipeline steps:
-
Load the DESeq2 results from:
results/deseq2_results_airway.csv
-
Select significantly differentially expressed genes using thresholds such as:
- adjusted p-value padj < 0.05
- absolute log2 fold-change |log2FC| > 1
-
Extract gene identifiers and map them to Entrez IDs using clusterProfiler::bitr() with org.Hs.eg.db.
-
Perform Gene Ontology Biological Process (GO BP) enrichment using clusterProfiler::enrichGO() with the human annotation database (OrgDb = org.Hs.eg.db).
-
Save the enrichment results as:
results/go_enrichment_airway_BP.csv
-
Generate a barplot of the top enriched GO BP terms and save it as:
figures/go_bp_barplot_airway.png
The enrichment output includes, for each GO term, the ID and description, gene ratio, background ratio, p-values, adjusted p-values, and the list of contributing genes.
-
Install R (and optionally RStudio) if not already installed.
-
From an R session, set a CRAN mirror and install required packages:
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages(c("tidyverse", "ggplot2"))
if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") }
BiocManager::install(c("DESeq2", "airway", "clusterProfiler", "org.Hs.eg.db"), ask = FALSE)
-
Clone the repository and move into it:
git clone
cd rnaseq_deseq2_pathway -
Ensure the results/ and figures/ folders exist (or let the scripts create them).
From the project folder (rnaseq_deseq2_pathway):
-
Differential expression with DESeq2
Rscript 01_deseq2_analysis.R
This will:
- load the airway dataset
- run DESeq2 for trt vs untrt
- save a full DE results table to results/deseq2_results_airway.csv
- create a volcano plot at figures/volcano_airway_deseq2.png
-
GO enrichment analysis
Rscript 02_pathway_analysis.R
This will:
- read results/deseq2_results_airway.csv
- select significantly differentially expressed genes (padj < 0.05, |log2FC| > 1)
- map them to Entrez IDs
- perform GO BP enrichment using clusterProfiler
- save enriched GO terms to results/go_enrichment_airway_BP.csv
- create a barplot of the top GO terms at figures/go_bp_barplot_airway.png
After running both scripts, you should have:
- numeric results in the results/ directory
- publication-style plots in the figures/ directory
On the airway dataset, DESeq2 typically identifies a substantial number of genes that are significantly differentially expressed between dexamethasone-treated and untreated samples. The output includes:
- a ranked list of genes with log2 fold-change and adjusted p-values, allowing identification of strongly up- and down-regulated genes
- a volcano plot (figures/volcano_airway_deseq2.png) where significantly differentially expressed genes (for example, padj < 0.05 and |log2FC| > 1) are highlighted relative to background genes
The GO enrichment analysis generally recovers biological processes consistent with the known effects of glucocorticoids, such as:
- response to steroid hormone
- regulation of inflammatory and immune responses
- transcriptional and signalling pathways affected by dexamethasone
The exact list of enriched terms and the number of significant genes will depend on the chosen thresholds, but the pipeline provides a reproducible and interpretable set of results that connect differential expression to biological pathways.
This project implements a typical RNA-seq analysis workflow in R:
- Starting from raw count data using a well-curated public dataset (airway).
- Performing robust differential expression analysis with DESeq2.
- Visualising results using a volcano plot to summarise significance and effect size.
- Interpreting the biological signal through GO enrichment analysis with clusterProfiler.
These steps mirror common analyses in:
- genomics and transcriptomics research
- pharmacogenomics and drug response studies
- biomarker discovery and mechanism-of-action investigations
The code is intentionally minimal and script-based to make it easy to:
- adapt the pipeline to other RNA-seq datasets
- adjust thresholds (for example, different padj cutoffs or log2FC cutoffs)
- swap GO for other pathway databases such as KEGG or Reactome
- or wrap the workflow into an R Markdown report or Shiny app
This project complements Python-based work (such as Scanpy scRNA-seq analysis and LLM-based phenotype–disease benchmarks) by demonstrating solid competence in R, Bioconductor, DESeq2, and pathway analysis, which are highly valued skills in bioinformatics and the pharmaceutical industry.