Skip to content

Latest commit

 

History

History
232 lines (159 loc) · 7.74 KB

File metadata and controls

232 lines (159 loc) · 7.74 KB
id atacseq-pcalling-macs
title ATAC-seq Peak Calling (MACS)
sidebar_label ATAC-seq Peak Calling (MACS)
sidebar_position 3

import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import Link from '@docusaurus/Link';

Performing ATAC-seq peak calling with MACS2

Goal: This tutorial provides a guide to peak calling using MACS2 and data generated by the Stanford Encode Project.


Background

What is MACS2?

MACS2 (Model-based Analysis of ChIP-Seq) is a widely used program to detect peaks — enriched regions in sequencing data where reads pile up.

  • For ChIP-seq, peaks correspond to transcription factor binding sites.
  • For ATAC-seq, peaks correspond to open chromatin regions cut by the Tn5 transposase.

That’s why we use MACS2 for ATAC-seq: it helps us identify the regions of the genome that are likely regulatory hotspots.


Download Data

You need one set of genomic coordinate regions to investigate (BED) and one file of sequencing data alignments (BAM) to complete this exercise. [Read more about the BED/BAM file formats here.][file-formats]

BED File

This is the set of transcription start site (TSS) annotations for the GRCh38 genome build that has been subsampled down to 2000 sites for quick tutorial purposes. A more formal analysis may use a more complete set of annotations (many more sites).

Download sample BED file

:::caution If your BED file downloads with a .txt extension, make sure to change the filename to a .bed extension. For this tutorial, the BED file is named UCSC_GRCh38_knownGene_GENCODEV3_TSS_2000bp_SUBSAMPLE-2000-Sites.bed. :::

BAM File

This is the set of read alignments from the ENCODE Project (ENCFF205TSU.bam).

Download sample BAM file

:::caution The BAM file is ~2.5GB large so make sure you have enough space on your machine before downloading. :::

Peak Calling ATAC-seq with MACS2

1. Install MACS2

There are two common ways to install bioinformatics software: pip and conda.

  • pip: the Python package manager. It installs Python-based tools (like MACS2) into your environment.
  • conda: an environment manager that can install packages (both Python and non-Python) with all dependencies. Widely used in bioinformatics. We’ll show both methods,

1.1 Install using pip

# Create a clean Python environment 
python3 -m venv macs2_env
source macs2_env/bin/activate

# Update pip and install MACS2
pip install --upgrade pip wheel
pip install numpy
pip install MACS2

1.2 Install using conda

conda create -n macs2_env python=3.8
conda activate macs2_env
conda install -c bioconda macs2

2. Generate BAI index file

A BAI index file is required for each BAM file of interest (i.e., the tag occupancy data you want to plot). This file allows for rapid access of the sorted and aligned sequence reads (BAM file).

:::tip

SAM/BAM standard is to keep BAI file in same directory as BAM file with the ScriptManager-generated filename.

samtools index ENCFF205TSU.bam 

3. Run MACS2 for ATAC-seq

ATAC-seq is slightly different from ChIP-seq because fragments are shifted by Tn5 transposase insertions and you usually call peaks with paired-end BAM.

macs2 callpeak -t ENCFF205TSU.bam \
 -f BAMPE \
 -g hs \
 -n sample_ATAC \
 --outdir macs2_output \
 --shift -100 --extsize 200 \
 -q 0.01

Key flags explained:

  • -f BAMPE: Paired-end BAM input
  • -g hs: Human genome
  • --shift -100 --extsize 200: Adjusts for Tn5 integration offset
  • -q 0.01: False Discovery Rate Threshold

The output files should be the follwoing:

  • sample_ATAC_peaks.narrowPeak: Peak regions (main file to use downstream)
  • sample_ATAC_peaks.xls: Detailed peak statistics
  • sample_ATAC_summits.bed: Summit positions

4. Convert 'sample_ATAC_peaks.narrowPeak' file to BED File

  cd macs2_output
  cut -f1-3 sample_ATAC_peaks.narrowPeak > sample_ATAC_peaks.bed
  awk '{print $1"\t"$2-1000"\t"$2+1000}' sample_ATAC_top1000_summits.bed > summits_1000bp.bed

:::note The file of this BED file is very large, so we use 1000bp to around each summit in order reduce processing time. :::

The output file should be the follwoing:

  • sample_ATAC_peaks.bed

5. ScriptManager for TagPileup

ScriptManager can plot composite pileups and heatmaps using BAM + BED.

5.1 Index Bam File: BAM Manipulation ➡️ [BAM-BAI Indexer][bam-indexer]
  ENCFF205TSU.bam
  ENCFF205TSU.bam.bai  # Need to generate this file to proceed.

:::

:::note The speed of this step scales with the size of the BAM file. Generally this step 30 sec for a 100 MB BAM file but may take 1-2 min for a multi-GB BAM file. :::

5.2 TagPileup: Sequence Read Analysis ➡️ [Tag Pileup][tag-pileup]

Use "Full Fragment" in the top right, then select "Combined" color , "Output Composite", "CDT", and "Output GZIP". Then, proceed to "Output Directory".

6. ScriptManager for Heatmap

6.1 Sort BED file by CDT file: Coordinate File Manipulation ➡️ [Sort BED][sort-bed]

6.2 Heatmap: Figure Generation ➡️ [Heat Map][heatmap] and Figure Generation ➡️ [Label Heatmaps][label-heatmap]

Command-Line shell script for ScriptManager

The following shell commands records the locations for a BED file, a BAM file, and the anticipated OUTPUT basename as environmental variables to derive the corresponding composite plot values and heatmaps. This can serve as a template for you to write out your own workflows as bash scripts that execute command-line style ScriptManager.

SCRIPTMANAGER=/path/to/ScriptManager.jar
BEDFILE=/path/to/sample_ATAC_top1000_summits.bed
BAMFILE=/path/to/ENCFF205TSU.bam
OUTPUT=/path/to/myoutput

samtools index $BAMFILE

java -jar $SCRIPTMANAGER read-analysis tag-pileup --combined --full-fragment $OUTPUT\_1000bp.bed $BAMFILE -o $OUTPUT\_composite.out -M $OUTPUT\_matrix
java -jar $SCRIPTMANAGER coordinate-manipulation sort-bed -c 1000 $OUTPUT\_1000bp.bed $OUTPUT\_matrix_combined.cdt -o $OUTPUT\_SORT
java -jar $SCRIPTMANAGER figure-generation heatmap -p .95 --black $OUTPUT\_SORT.cdt -o $OUTPUT\_heatmap.png

# Output files:
#  - /path/to/myoutput_composite.out
#  - /path/to/myoutput_matrix_combined.cdt
#  - /path/to/myoutput_SORT.bed
#  - /path/to/myoutput_SORT.cdt
#  - /path/to/myoutput_heatmap.png