Bulk RNA-seq Differential Expression and Pathway Analysis in R

Overview

This mini-project performs bulk RNA-seq differential expression analysis and pathway (GO) enrichment in R using:

DESeq2 for differential expression
clusterProfiler for GO enrichment
The public airway RNA-seq dataset as a reproducible example

It is designed as a small, self-contained R/Bioconductor project to demonstrate practical skills that are directly relevant to genomics and pharmaceutical research (e.g. target discovery, mechanism-of-action analysis, biomarker discovery).

Dataset

We use the airway RNA-seq dataset from Bioconductor:

Human airway smooth muscle cells
Treated with dexamethasone (dex = "trt") or untreated (dex = "untrt")
Raw counts and sample metadata are provided in a SummarizedExperiment object

The dataset is loaded directly from the airway package; no external files need to be downloaded manually. This makes the analysis fully reproducible on any machine with R and Bioconductor installed.

Methods

Differential Expression (DESeq2)

Script: 01_deseq2_analysis.R

Pipeline steps:

Load the airway dataset (airway::airway) and extract the SummarizedExperiment.
Relevel the treatment factor so "untrt" is the reference (dex = relevel(dex, ref = "untrt")).
Construct a DESeqDataSet with design ~ dex.
Pre-filter lowly expressed genes to remove genes with very low counts.
Run DESeq() to estimate size factors, dispersions, and fit the negative binomial model.
Extract results for trt vs untrt and apply log2 fold-change shrinkage using lfcShrink(..., type = "apeglm").
Sort genes by adjusted p-value (padj) and save a full results table as:

results/deseq2_results_airway.csv
Create a volcano plot highlighting significantly differentially expressed genes and save it as:

figures/volcano_airway_deseq2.png

The output includes per-gene statistics such as log2FoldChange, lfcSE, test statistic, raw p-value, and adjusted p-value (padj).

Pathway / GO Enrichment (clusterProfiler)

Script: 02_pathway_analysis.R

Pipeline steps:

Load the DESeq2 results from:

results/deseq2_results_airway.csv
Select significantly differentially expressed genes using thresholds such as:
- adjusted p-value padj < 0.05
- absolute log2 fold-change |log2FC| > 1
Extract gene identifiers and map them to Entrez IDs using clusterProfiler::bitr() with org.Hs.eg.db.
Perform Gene Ontology Biological Process (GO BP) enrichment using clusterProfiler::enrichGO() with the human annotation database (OrgDb = org.Hs.eg.db).
Save the enrichment results as:

results/go_enrichment_airway_BP.csv
Generate a barplot of the top enriched GO BP terms and save it as:

figures/go_bp_barplot_airway.png

The enrichment output includes, for each GO term, the ID and description, gene ratio, background ratio, p-values, adjusted p-values, and the list of contributing genes.

Setup

Install R (and optionally RStudio) if not already installed.
From an R session, set a CRAN mirror and install required packages:

options(repos = c(CRAN = "https://cloud.r-project.org"))

install.packages(c("tidyverse", "ggplot2"))

if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") }

BiocManager::install(c("DESeq2", "airway", "clusterProfiler", "org.Hs.eg.db"), ask = FALSE)
Clone the repository and move into it:

git clone
cd rnaseq_deseq2_pathway
Ensure the results/ and figures/ folders exist (or let the scripts create them).

How to Run

From the project folder (rnaseq_deseq2_pathway):

Differential expression with DESeq2

Rscript 01_deseq2_analysis.R

This will:
- load the airway dataset
- run DESeq2 for trt vs untrt
- save a full DE results table to results/deseq2_results_airway.csv
- create a volcano plot at figures/volcano_airway_deseq2.png
GO enrichment analysis

Rscript 02_pathway_analysis.R

This will:
- read results/deseq2_results_airway.csv
- select significantly differentially expressed genes (padj < 0.05, |log2FC| > 1)
- map them to Entrez IDs
- perform GO BP enrichment using clusterProfiler
- save enriched GO terms to results/go_enrichment_airway_BP.csv
- create a barplot of the top GO terms at figures/go_bp_barplot_airway.png

After running both scripts, you should have:

numeric results in the results/ directory
publication-style plots in the figures/ directory

Results

On the airway dataset, DESeq2 typically identifies a substantial number of genes that are significantly differentially expressed between dexamethasone-treated and untreated samples. The output includes:

a ranked list of genes with log2 fold-change and adjusted p-values, allowing identification of strongly up- and down-regulated genes
a volcano plot (figures/volcano_airway_deseq2.png) where significantly differentially expressed genes (for example, padj < 0.05 and |log2FC| > 1) are highlighted relative to background genes

The GO enrichment analysis generally recovers biological processes consistent with the known effects of glucocorticoids, such as:

response to steroid hormone
regulation of inflammatory and immune responses
transcriptional and signalling pathways affected by dexamethasone

The exact list of enriched terms and the number of significant genes will depend on the chosen thresholds, but the pipeline provides a reproducible and interpretable set of results that connect differential expression to biological pathways.

Discussion

This project implements a typical RNA-seq analysis workflow in R:

Starting from raw count data using a well-curated public dataset (airway).
Performing robust differential expression analysis with DESeq2.
Visualising results using a volcano plot to summarise significance and effect size.
Interpreting the biological signal through GO enrichment analysis with clusterProfiler.

These steps mirror common analyses in:

genomics and transcriptomics research
pharmacogenomics and drug response studies
biomarker discovery and mechanism-of-action investigations

The code is intentionally minimal and script-based to make it easy to:

adapt the pipeline to other RNA-seq datasets
adjust thresholds (for example, different padj cutoffs or log2FC cutoffs)
swap GO for other pathway databases such as KEGG or Reactome
or wrap the workflow into an R Markdown report or Shiny app

This project complements Python-based work (such as Scanpy scRNA-seq analysis and LLM-based phenotype–disease benchmarks) by demonstrating solid competence in R, Bioconductor, DESeq2, and pathway analysis, which are highly valued skills in bioinformatics and the pharmaceutical industry.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bulk RNA-seq Differential Expression and Pathway Analysis in R

Overview

Dataset

Methods

Differential Expression (DESeq2)

Pathway / GO Enrichment (clusterProfiler)

Setup

How to Run

Results

Discussion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
figures		figures
results		results
.gitignore		.gitignore
01_deseq2_analysis.R		01_deseq2_analysis.R
02_pathway_analysis.R		02_pathway_analysis.R
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Bulk RNA-seq Differential Expression and Pathway Analysis in R

Overview

Dataset

Methods

Differential Expression (DESeq2)

Pathway / GO Enrichment (clusterProfiler)

Setup

How to Run

Results

Discussion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages