|
1 | | -# Fastq to Gene Counts |
| 1 | +# FastqToGeneCounts |
2 | 2 |
|
3 | 3 | [](https://github.com/snakemake/snakefmt) |
4 | 4 |
|
5 | | -Please [view the documentation](https://helikarlab.github.io/FastqToGeneCounts/) for instructions on how to download, set up, and execute this pipeline. |
| 5 | +A Snakemake workflow for processing bulk RNA-seq data from NCBI GEO accession numbers or FASTQ files. |
| 6 | +This pipeline performs read quality control, alignment to a reference genome, and generates gene count matrices suitable |
| 7 | +for downstream analysis such as differential expression or metabolic modeling. |
| 8 | + |
| 9 | +## Table of Contents |
| 10 | + |
| 11 | +- [Overview](#overview) |
| 12 | +- [Requirements](#requirements) |
| 13 | +- [Installation](#installation) |
| 14 | +- [Conda Environment Setup](#conda-environment-setup) |
| 15 | +- [Configuration](#configuration) |
| 16 | + - [Sample File (MASTER_CONTROL)](#sample-file-master_control) |
| 17 | + - [Output Settings](#output-settings) |
| 18 | + - [Processing Options](#processing-options) |
| 19 | + - [Genome Settings](#genome-settings) |
| 20 | +- [Profile Configuration](#profile-configuration) |
| 21 | + - [SLURM Cluster](#slurm-cluster) |
| 22 | + - [Local Execution](#local-execution) |
| 23 | +- [Running the Pipeline](#running-the-pipeline) |
| 24 | + - [Dry Run](#dry-run) |
| 25 | + - [Full Execution](#full-execution) |
| 26 | +- [Output Description](#output-description) |
| 27 | +- [Benchmarking](#benchmarking) |
| 28 | +- [Troubleshooting](#troubleshooting) |
| 29 | +- [Citation](#citation) |
| 30 | +- [Contributing](#contributing) |
| 31 | +- [License](#license) |
| 32 | + |
| 33 | +## Overview |
| 34 | + |
| 35 | +This workflow: |
| 36 | + |
| 37 | +1. Downloads SRA files from NCBI/GEO (if using prefetch) or reads local FASTQ files |
| 38 | +2. Converts SRA files to FASTQ format using parallel-fastq-dump |
| 39 | +3. Performs quality control with FastQC before and after trimming |
| 40 | +4. (Optionally) Trims adapters and low-quality bases with Trim Galore |
| 41 | +5. Aligns reads to a chosen reference genome using STAR |
| 42 | +6. Quantifies gene counts using Salmon |
| 43 | +7. Collects RNA-seq metrics with Picard |
| 44 | +8. Calculates insert sizes and fragment sizes |
| 45 | +9. Screens for common contaminants |
| 46 | +10. Generates a MultiQC report combining all quality metrics |
| 47 | + |
| 48 | +## Installation |
| 49 | + |
| 50 | +Clone the repository using Git: |
| 51 | + |
| 52 | +```bash |
| 53 | +git clone https://git.unl.edu/HelikarLab/FastqToGeneCounts.git |
| 54 | +cd FastqToGeneCounts |
| 55 | +``` |
| 56 | + |
| 57 | +## Conda Environment Setup |
| 58 | + |
| 59 | +Create and activate a new Conda environment with the required dependencies: |
| 60 | + |
| 61 | +```bash |
| 62 | +conda create -n fastq2genecounts python=3.10 snakemake |
| 63 | +conda activate fastq2genecounts |
| 64 | + |
| 65 | +# If you are running this pipeline and want to use SLURM, install the executor plugin: |
| 66 | +pip install snakemake-executor-plugin-slurm |
| 67 | +``` |
| 68 | + |
| 69 | +Verify the installation: |
| 70 | + |
| 71 | +```bash |
| 72 | +snakemake --version |
| 73 | +``` |
| 74 | + |
| 75 | +## Setup |
| 76 | + |
| 77 | +### Configuration File |
| 78 | + |
| 79 | +The `config.yaml` file contains all customization options available to the pipeline. The options in this file are |
| 80 | +explained below. |
| 81 | + |
| 82 | +### Sample File (`MASTER_CONTROL`) |
| 83 | + |
| 84 | +The `MASTER_CONTROL` setting specifies the CSV file containing sample information. This file must have the |
| 85 | +following four columns as a header row: |
| 86 | + |
| 87 | +| Column | Description | Valid Values | |
| 88 | +|---------------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------| |
| 89 | +| `srr` | SRA accession code | SRR###### (leave this empty if you are using local files) | |
| 90 | +| `sample` | Sample identifier | Prefix with any alphanumeric string (e.g., tissue name), plus the "Study", "Run", and "Replicate" for that sample, separated by underscores | |
| 91 | +| `endtype` | Sequencing type | `PE` (paired-end) or `SE` (single-end) | |
| 92 | +| `prep_method` | Library preparation | `total` (total RNA) or `mrna` (PolyA/mRNA) | |
| 93 | + |
| 94 | +Example CSV file: |
| 95 | + |
| 96 | +```csv |
| 97 | +srr,sample,endtype,prep_method |
| 98 | +SRR101,tissueA_S1R1,PE,mrna |
| 99 | +SRR102,tissueA_S1R2,SE,total |
| 100 | +SRR103,tissueB_S1R1r1,PE,total |
| 101 | +SRR104,tissueB_S1R1r2,PE,total |
| 102 | +``` |
| 103 | + |
| 104 | +For local FASTQ files (see [Processing Options](#processing-options) below), leave the srr column empty. If the |
| 105 | +sampe uses paired-end sequencing, the pipeline will automatically find the forward and reverse reads based on the |
| 106 | +sample name provded in the file (e.g., `tissueA_S1R1_1.fastq.gz` and `tissueA_S1R1_2.fastq.gz`) |
| 107 | + |
| 108 | +```csv |
| 109 | +srr,sample,endtype,prep_method |
| 110 | +,tissueA_S1R1,PE,total |
| 111 | +,tissueA_S1R2,SE,total |
| 112 | +``` |
| 113 | + |
| 114 | +### Output Settings |
| 115 | + |
| 116 | +| Setting | Description | |
| 117 | +|-------------------|------------------------------------------------------------------------------------------------| |
| 118 | +| `ROOTDIR` | Directory where results will be saved (default: `results`) | |
| 119 | +| `EXPERIMENT_NAME` | Optional subdirectory name for organizing experiment outputs | |
| 120 | +| `BENCHMARK_DIR` | Directory for benchmark output (default: `benchmarks`) | |
| 121 | +| `BENCHMARK_TIMES` | Number of times to run each rule for benchmarking (default: 1, minimal benchmarking performed) | |
| 122 | +| `LOG_DIR` | Directory for log files (default: `logs`) | |
| 123 | + |
| 124 | +### Processing Options |
| 125 | + |
| 126 | +| Setting | Description | Default | |
| 127 | +|-------------------------------|----------------------------------------------|---------| |
| 128 | +| `PERFORM_DUMP_FASTQ` | Download + convert SRA files to FASTQ format | `True` | |
| 129 | +| `PERFORM_TRIM` | Trim adapters and low-quality bases | `True` | |
| 130 | +| `PERFORM_SCREEN` | Screen for contamination | `True` | |
| 131 | +| `PERFORM_GET_RNASEQ_METRICS` | Collect RNA-seq metrics with Picard | `True` | |
| 132 | +| `PERFORM_GET_INSERT_SIZE` | Calculate insert sizes | `True` | |
| 133 | +| `PERFORM_GET_FRAGMENT_SIZE` | Calculate fragment sizes with RSeQC | `True` | |
| 134 | +| `BYPASS_REPLICATE_VALIDATION` | Skip replicate count validation (for COMO) | `False` | |
| 135 | +| `BYPASS_GENOME_VALIDATION` | Skip genome file validation before running | `False` | |
| 136 | + |
| 137 | +#### Local FASTQ Mode |
| 138 | + |
| 139 | +To process local FASTQ files instead of downloading from SRA: |
| 140 | + |
| 141 | +1. Set `PERFORM_DUMP_FASTQ: False` |
| 142 | +2. Set `LOCAL_FASTQ_FILES: "<dirname>"` (change `<dirname>` to the directory path) |
| 143 | +3. Place your FASTQ files in the specified directory |
| 144 | +4. Leave the `srr` column empty in your MASTER_CONTROL CSV |
| 145 | + |
| 146 | +### Genome Settings |
| 147 | + |
| 148 | +The pipeline automatically downloads and prepares reference genome files from Ensembl. Configure these settings under |
| 149 | +the `GENOME` section: |
| 150 | + |
| 151 | +| Setting | Description | Example | |
| 152 | +|-----------------|----------------------------|----------------------------------| |
| 153 | +| `SAVE_DIR` | Directory for genome files | `genome` | |
| 154 | +| `TAXONOMY_ID` | NCBI taxonomy ID | `9606` (human), `10090` (mouse) | |
| 155 | +| `VERSION` | Ensembl release version | `115`, `latest`, `release-112` | |
| 156 | +| `SHOW_PROGRESS` | Show download progress | `True` | |
| 157 | +| `FASTA_VERSION` | FASTA file type | `primary_assembly` or `toplevel` | |
| 158 | + |
| 159 | +Find your taxonomy ID at https://www.ncbi.nlm.nih.gov/Taxonomy |
| 160 | + |
| 161 | +Available Ensembl releases are listed at https://www.ftp.ensembl.org/pub/ |
| 162 | + |
| 163 | +## Profile Configuration |
| 164 | + |
| 165 | +Snakemake profiles configure how jobs are submitted to your compute environment. Two profiles are included with this |
| 166 | +pipeline. |
| 167 | + |
| 168 | +### SLURM Cluster |
| 169 | + |
| 170 | +Use the SLURM profile for high-performance computing clusters: |
| 171 | + |
| 172 | +```bash |
| 173 | +snakemake --profile profiles/cluster |
| 174 | +``` |
| 175 | + |
| 176 | +The SLURM profile configuration (`profiles/cluster/config.v8+.yaml`) includes: |
| 177 | + |
| 178 | +| Setting | Description | Default | |
| 179 | +|-----------------|--------------------------------------------|---------| |
| 180 | +| `cores` | Maximum cores used in workflow | 10 | |
| 181 | +| `jobs` | Maximum concurrent jobs to submit to SLURM | 250 | |
| 182 | +| `restart-times` | Retry attempts on failure | 0 | |
| 183 | +| `use-conda` | Enable Conda environments | `True` | |
| 184 | + |
| 185 | +> [!WARNING] Conda |
| 186 | +> The `use-conda` setting should always be kept to **True**, otherwise Snakemake assumes all required dependencies |
| 187 | +> are installed in the current environment. By keeping this value as True, Snakemake will create the required conda |
| 188 | +> environments and activate them for each required rule. |
| 189 | +
|
| 190 | +#### Customizing SLURM Account |
| 191 | + |
| 192 | +Edit `profiles/cluster/config.v8+.yaml` to change the SLURM account: |
| 193 | + |
| 194 | +```yaml |
| 195 | +sbatch |
| 196 | + --job-name=smk-{rule}-{wildcards} |
| 197 | + --account=YOUR_ACCOUNT_NAME # Change to your account name: `sacctmgr show user $USER accounts` |
| 198 | + --cpus-per-task={threads} |
| 199 | +... |
| 200 | +``` |
| 201 | + |
| 202 | +### Local Execution |
| 203 | + |
| 204 | +For execution on a single machine without a cluster scheduler: |
| 205 | + |
| 206 | +```bash |
| 207 | +snakemake --profile profiles/local |
| 208 | +``` |
| 209 | + |
| 210 | +> [!warning] |
| 211 | +> Local execution may be slow for large datasets and, unless you are running this pipeline on a workstation, is not |
| 212 | +> recommended for genome alignment due to memory constraints. |
| 213 | +
|
| 214 | +### Additional Parameters |
| 215 | +If you would like to change any of the default parameters in the profile, you can either: |
| 216 | +1. Modify the profile to your liking; this will become the new default for all future pipeline workflows, or |
| 217 | +2. Override the parameters on a per-run basis by including them on the command line. Execute `snakemake --help` to |
| 218 | + view all options |
| 219 | + |
| 220 | +## Running the Pipeline |
| 221 | + |
| 222 | +### Dry Run |
| 223 | + |
| 224 | +A dry run can be performed to verify configuration: |
| 225 | + |
| 226 | +```bash |
| 227 | +snakemake --profile profiles/cluster --dry-run # or `--profile profiles/local` if using a local workstation |
| 228 | +``` |
| 229 | + |
| 230 | +The dry run will: |
| 231 | + |
| 232 | +- Validate the Snakefile syntax |
| 233 | +- Print the defined output directories |
| 234 | +- Display all rules that will be executed |
| 235 | +- Check that configuration is properly set up |
| 236 | +- Verify input files are accessible |
| 237 | + |
| 238 | +### Full Execution |
| 239 | + |
| 240 | +After confirming the dry run succeeds, run the full pipeline: |
| 241 | + |
| 242 | +```bash |
| 243 | +snakemake --profile profiles/cluster # or `--profile profiles/local` if using a local workstation |
| 244 | +``` |
| 245 | + |
| 246 | +#### Keeping the Session Alive |
| 247 | + |
| 248 | +Since Snakemake does not daemonize by default, use `screen` or `tmux` before running Snakemake to keep the |
| 249 | +main process running: |
| 250 | + |
| 251 | +- [Screen](https://www.gnu.org/software/screen/manual/screen.html) |
| 252 | +- [Tmux](https://man7.org/linux/man-pages/man1/tmux.1.html) |
| 253 | + |
| 254 | +> [!NOTE] Screen vs Tmux |
| 255 | +> Screen is easer to set up and use, but Tmux offers more powerful features |
| 256 | +
|
| 257 | +```bash |
| 258 | +# Start a screen session |
| 259 | +screen -S snakemake |
| 260 | + |
| 261 | +# Detach from session: Ctrl+a, d |
| 262 | + |
| 263 | +# Reattach to session |
| 264 | +screen -r snakemake |
| 265 | +``` |
| 266 | + |
| 267 | +#### Log Files |
| 268 | + |
| 269 | +Log files are stored in the `logs` directory organized by tissue and rule: |
| 270 | +``` |
| 271 | +logs/ # or the defined `LOG_DIR` configuration option |
| 272 | + {tissue}/ |
| 273 | + {rule}/ |
| 274 | + {tissue}_{rule}.log |
| 275 | +``` |
| 276 | + |
| 277 | +## Benchmarking |
| 278 | + |
| 279 | +The pipeline supports benchmarking to measure execution time for each rule. To enable benchmarking: |
| 280 | + |
| 281 | +1. Set `BENCHMARK_TIMES` to your desired number of repetitions (e.g., `3`) |
| 282 | +2. Run the pipeline normally |
| 283 | + |
| 284 | +Generate benchmark plots using the provided script: |
| 285 | + |
| 286 | +```bash |
| 287 | +python utils/benchmark.py |
| 288 | +``` |
| 289 | + |
| 290 | +This script reads benchmark data from the `benchmarks/` directory and produces visualization plots. |
| 291 | + |
| 292 | +> [!WARNING] Bencharmking |
| 293 | +> The pipeline will run `BENCHMARK_TIMES` *for each input file*. If set to 2, the entire pipeline will run twice to |
| 294 | +> collect benchmark statistics. |
| 295 | +
|
| 296 | +## Getting Help |
| 297 | + |
| 298 | +Please [open an issue](https://git.unl.edu/helikarlab/FastqToGeneCounts/issues) with: |
| 299 | + |
| 300 | +- The exact error message |
| 301 | +- Your configuration file |
| 302 | +- The rule and sample that failed |
| 303 | +- Relevant log files from `logs/` |
| 304 | + |
| 305 | +## Citation |
| 306 | + |
| 307 | +If you use this pipeline in your research, please cite it as: |
| 308 | + |
| 309 | +```bibtex |
| 310 | +@software{FastqToGeneCounts, |
| 311 | + author = {Josh Loecker, Brandt Bessell, Bhanwar lal Puniya, Tomáš Helikar}, |
| 312 | + title = {FastqToGeneCounts: Automated Bulk RNA-seq Alignment}, |
| 313 | + url = {https://git.unl.edu/helikarlab/FastqToGeneCounts} |
| 314 | +} |
| 315 | +``` |
| 316 | + |
| 317 | +## Contributing |
| 318 | + |
| 319 | +Contributions are welcome! Please open an issue or submit a merge request for: |
| 320 | + |
| 321 | +- Bug reports |
| 322 | +- Feature requests |
| 323 | +- Documentation improvements |
| 324 | +- Code enhancements |
0 commit comments