Rare disease genomics with WES: from raw FASTQ to annotated variants
An end-to-end bioinformatics variant calling pipeline for Whole Exome Sequencing (WES) data in rare disease genomics, demonstrated using a MOPD-II neonate case study. The WES analysis uncovers pathogenic variants in targeted rare diseases. The workflow demonstrates how to process raw sequencing reads, perform quality control, align to a reference genome, call variants, annotate and visualise them and generate summary reports. Such workflows are critical in clinical genomics, where rapid, reproducible pipelines can directly impact diagnosis and treatment.
- SRA Toolkit (for FASTQ download)
- FastQC / MultiQC (for QC)
- Chrome
.debpackage (for viewing multiQC HTML reports) - Trimmomatic (for trimming)
- BWA (for alignment)
- Samtools and Picard (for BAM processing)
- GATK (for variant calling)
- VEP (for variant annotation)
This dataset (SRR18395025) from the Sri Lanka Rare Disease Project provides Whole Exome Sequencing of a neonate affected by MOPD-II, a rare genetic disorder with clinical relevance in growth and developmental pathways
- Source: SRA
- Accession ID:
SRR18395025 - Type: Whole Exome Sequencing (WES) FASTQ files
- Library: Illumina NovaSeq 6000
- Organism: Human (Homo sapiens)
- Reference Genome: GRCh38/hg38 (latest assembly)
- Clone this repo:
git clone https://github.com/Naila-Srivastava/Variant_Calling-WES-RareDisease.gitcdVariant_Calling-WES-RareDisease
fastq-dump --split-files --gzip -X 100000 SRRxxxxxxx # Adjust the subset read counts
fastqc SRRxxxxxxxx_*.fastq.gz
multiqc .
trimmomatic PE -threads 4 -phred33 \
SRRxxxxxxxx_1.fastq.gz SRRxxxxxxxx_2.fastq.gz \
SRRxxxxxxxx_1_paired.fq.gz SRRxxxxxxxx_1_unpaired.fq.gz \
SRRxxxxxxxx_2_paired.fq.gz SRRxxxxxxxx_2_unpaired.fq.gz \
ILLUMINACLIP:$CONDA_PREFIX/share/trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
bwa mem -t 4 hg38.fa SRRxxxxxxxx_1.fastq SRRxxxxxxxx_2.fastq > SRRxxxxxxxx_aligned.sam
samtools view -Sb SRRxxxxxxxx_aligned.sam > SRRxxxxxxxx_aligned.bam
samtools sort SRRxxxxxxxx_aligned.bam -o SRRxxxxxxxx_sorted.bam
picard MarkDuplicates \
I=SRRxxxxxxxx_sorted.bam \
O=SRRxxxxxxxx_dedup.bam \
M=SRRxxxxxxxx_dedup.metrics.txt
gatk CreateSequenceDictionary -R hg38.fa # Create a seq dictionary (1 per reference)
gatk HaplotypeCaller \ # Call variants
-R hg38.fa \
-I SRRxxxxxxxx_dedup.bam \
-O SRRxxxxxxxx_gatk_variants.vcf.gz
vep -i SRRxxxxxxxx_gatk_variants.vcf.gz -o SRRxxxxxxxx_vep_annotated_online.vcf \
--assembly GRCh38 --everything --database#!/bin/bash
#SBATCH --job-name=fastqc_job
#SBATCH --output=fastqc_out.txt
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
module load fastqc
fastqc SRRxxxxxxxx.fastq -o ./results/
- End-to-end workflow: From raw reads to annotated VCF.
- Automated QC reporting using FastQC + MultiQC.
- Rare disease relevance: Pipeline tailored for rare phenotype exploration.
- Reproducibility: Modular scripts for easy reuse across datasets.
- Organised repo: Clear separation of data, scripts, results, and reports.
- FastQC reports → per-sample quality check (per-base quality, GC content, adapter presence).
- MultiQC dashboard → aggregate sample QC metrics in one interactive report.
- Variant summary plots → variant counts, Ti/Tv ratios, functional impact categories.
- Annotated variants table → candidate mutations with gene names, predicted effects, and clinical significance.
- FastQC + MultiQC reports
- Sequencing quality, GC content, adapter content
-
Rare variants identified: From a total of ~16,250 variants annotated with Ensembl VEP, filtering based on predicted impact (missense, splice-site, nonsense) and computational pathogenicity predictions (SIFT and PolyPhen) resulted in a shortlist of candidate variants of interest.
- Most retained variants are missense mutations with predicted moderate impact.
- Several variants were consistently annotated as deleterious (SIFT) and probably damaging (PolyPhen), highlighting their potential functional significance.
- Key affected genes include MMP23B, CFAP74, PRDM16, CHD5, RNF207, PLEKHG5, DNAJC11, EXOSC10, C1orf167, CLCN6, NPPA, VPS13D, among others.
- No strong ClinVar pathogenic annotations were detected in this subset (which might indicate that variants may still be novel).
Summary: The integration of functional consequence, in silico prediction, and allele effect prioritization identified a set of candidate variants likely to have biological relevance and warrant further investigation.
- Rare disease-focused WES pipeline from raw FASTQ → annotated variants.
- Mastery of Linux command-line tools for real-world bioinformatics.
- Clinical insights through variant annotation & prioritization.
- Transferable skills for genomics research & clinical diagnostics.
- Expand pipeline to multiple rare diseases datasets
- Integrate Nextflow/Snakemake for workflow automation
- Add structural variant detection
- Perform functional annotation and pathway analysis
variant_calling-WES-RareDisease/
│── data/ # FASTQ and reference files (not uploaded due to size)
│── scripts/ # Scripts for each step of the pipeline
│── results/ # Output files
│── reports/ # MultiQC & FastQC HTML reports
│── README.md # Project documentation (this file)