This README outlines the steps to set up and run a basic protocol for MAG data processing using sra-tools, fastp, and FastQC. These tools are critical for fetching, cleaning, and validating the quality of sequencing data. Each tool/package plays a specific role in ensuring the integrity and usability of the sequence data for downstream analysis.
- macOS
- Conda (Miniconda or Anaconda)
- Command-line (Terminal)
Conda is a widely-used environment management system, essential for managing dependencies, isolating project environments, and ensuring reproducibility.
conda create -n Basic_protocol_1
conda activate Basic_protocol_1Adding the right Conda channels is important because some bioinformatics tools are hosted on specific repositories (like bioconda and conda-forge). These repositories are curated to ensure compatibility and updates for bioinformatics tools.
conda config --add channels bioconda
conda config --add channels conda-forge- Purpose:
sra-toolsis essential for retrieving sequence data from the Sequence Read Archive (SRA), a large repository of publicly available next-generation sequencing data. - Why it's important: It provides easy access to raw sequencing data (in
.sraformat), and its tools likeprefetchandfasterq-dumpare indispensable for converting.srafiles into usable FASTQ files.
conda install -c bioconda sra-tools==3.0.8- Purpose:
fastpis a highly efficient tool for quality control and preprocessing of FASTQ files. It performs functions like adapter trimming, filtering by quality, and basic data analysis. - Why it's important: Ensuring high-quality sequence data is crucial before downstream analyses such as assembly or mapping.
fastpautomates the trimming and filtering process, which improves the reliability of the data.
conda install -c bioconda fastp==0.23.4Create a directory to store sequence data and quality check results:
mkdir MAG
cd MAGUsing sra-tools to download data directly from the SRA repository:
prefetch: Downloads the raw.srafiles from the SRA repository.fasterq-dump: Converts.srafiles into FASTQ format, which is the standard input format for most sequence processing tools. The--split-filesflag ensures that paired-end reads are split into two separate files, and--skip-technicalignores technical reads that do not contribute to biological information.
prefetch SRR23604271 SRR23604268
fasterq-dump SRR23604271 --split-files --skip-technical
fasterq-dump SRR23604268 --split-files --skip-technical- Purpose: FastQC is a tool for quality control of raw sequence data. It generates comprehensive reports with metrics like sequence quality scores, GC content, overrepresented sequences, and adapter content.
- Why it's important: Assessing the quality of sequence data is critical before any further analysis. FastQC provides a quick overview to identify any issues such as low-quality reads or contamination, ensuring the reliability of the dataset for downstream processes.
FastQC is not available directly via Conda for macOS, so it needs to be downloaded manually:
- Visit the FastQC download page and download FastQC v0.12.1 (Mac DMG image).
- Mount the
.dmgfile and drag the FastQC application to theApplicationsfolder. - Unmount the
.dmgafter installation.
To run FastQC from the command line in your conda environment or system-wide, you need to add it to your PATH variable.
- Open Terminal and add FastQC to your PATH by adding this line to your
~/.bash_profileor~/.zshrcfile:export PATH=$PATH:/Applications/FastQC.app/Contents/MacOS/
- Reload your shell configuration:
Or, if you use Zsh:
source ~/.bash_profile
source ~/.zshrc
Verify that FastQC has been added to your PATH:
which fastqcExpected output:
/Applications/anaconda3/envs/Basic_protocol_1/bin/fastqc
Check the version of FastQC:
fastqc --versionExpected output:
FastQC v0.12.1
If you encounter issues running FastQC, you may need to make the application executable:
chmod +x /Applications/FastQC.app/Contents/MacOS/fastqcRun FastQC from any directory by simply typing:
fastqc- Prefetch output:
.srafiles downloaded from the SRA. - Fasterq-dump output: Split FASTQ files (e.g.,
SRR23604271_1.fastq,SRR23604271_2.fastq). - Fastp output: Cleaned FASTQ files (e.g.,
SRR23604271_1_clean.fastq,SRR23604271_2_clean.fastq). - FastQC output: Quality control reports (
.htmland.zipfiles) summarizing sequence quality metrics.
- Ensure that Conda is correctly installed on your system before proceeding.
- Always make sure your Conda environment is activated (
conda activate Basic_protocol_1) when running commands. - If FastQC is not recognized in your PATH, revisit the steps for adding it to your PATH.
- Conda: Manages environments and dependencies to ensure tools don't conflict with each other.
- sra-tools: Essential for fetching publicly available sequence data from SRA.
- fastp: Critical for cleaning sequence data, ensuring the highest quality input for downstream analysis.
- FastQC: Ensures the quality of sequence data, allowing you to spot issues early on.