Skip to content

Commit 9ec3b84

Browse files
authored
Publication Finalizations (#111)
* refactor: convert all documentation into a single README.md file * chore: bump bowtie * chore: bump trim-galore * refactor: include `retries` configuration * chore: do not print shell commands to reduce terminal output * feat: define slurm status command This lets snakemake determine which jobs have failed as a result of a TIMEOUT or CANCELLED errors * chore: remove deprecated `conda-frontend` configuration * chore: add `network_slots` to local profile This prevents dozens of downloads from happening at the same time * feat: added documentation + code to reproduce publication results * chore: do not commit cache-related files * feat: added license * feat: rewrote docs into single-file README for easier deployment to new git backeds (github, gitlab, gitea, etc.) * refactor: removed riboInt.sh script as it is now included as a rule in the pipeline * refactor: removed perform.py script as it is now included in the `parse.py` and `cfg.*` file/object * refactor: compile `re` search for (slight) performance improvement * refactor: inline exon start/end calcuation * refactor: remove list storing all sizes as it was unused and causing OOM errors * refactor: if BED line has less than 12 lines, skip it * fix: early-continue if current or next reference start is None * fix: incorrect addition of query length alignment * fix: more graceful handling if log file is `None` * feat: better cache handling of genome related cache files * feat: parse the configuration file for object-based usage, increased type safety, and improved error checking * feat: only print the parsed directory table on startup * chore: use new `cfg` object for object-based configuration usage * feat: copy configuration file to results directory for future reference * feat: use the `attempt` variable to dynamically increase runtime and memory requested * chore: remove unused imports * chore: copy & move verbosely * feat: set conda, benchmark, & log file location This commit also organizes these components in the same order for all rules for consistency * refactor: use `data` instead of `samples` * refactor: use genome type (`primary_assembly`, `toplevel`) for output filename * refactor: bundle prefetch with fastq_dump_{paired,single} Because we always dump SRA files after prefetch, it makes more sense to bundle these together to reduce the overall I/O on the filesystem * refactor: use `cfg` object * refactor: do not mark rules as `localrule` * refactor: raise error if no files can be found * feat: allow trim_{paired,single} to read from the local file system * feat: mark trimming reports as outputs for trim_galore * refactor: use a function for `align` input better readability * refactor: return a dictionary to allow for `.r1` and `.r2` dot-access access * feat: input function for multiqc contamination input * chore: increase salmon_quantification runtime * chore: include documentation about overriding default profile parameters
1 parent 40a5f1e commit 9ec3b84

111 files changed

Lines changed: 1503 additions & 18011 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ results
88
controls
99
config.yaml
1010
gurobi.log
11+
.taxon_to_species.json
12+
.valid_release_numbers.txt
1113

1214
# General files
1315
**/__pycache__/

LICENSE.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
Copyright (c) 2026 Josh Loecker, Brandt Bessell, Bhanwar lal Puniya, Tomáš Helikar
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
4+
documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
5+
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
6+
persons to whom the Software is furnished to do so, subject to the following conditions:
7+
8+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
9+
Software.
10+
11+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
12+
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
13+
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
14+
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md

Lines changed: 321 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,324 @@
1-
# Fastq to Gene Counts
1+
# FastqToGeneCounts
22

33
[![Code style: snakefmt](https://img.shields.io/badge/code%20style-snakefmt-000000.svg)](https://github.com/snakemake/snakefmt)
44

5-
Please [view the documentation](https://helikarlab.github.io/FastqToGeneCounts/) for instructions on how to download, set up, and execute this pipeline.
5+
A Snakemake workflow for processing bulk RNA-seq data from NCBI GEO accession numbers or FASTQ files.
6+
This pipeline performs read quality control, alignment to a reference genome, and generates gene count matrices suitable
7+
for downstream analysis such as differential expression or metabolic modeling.
8+
9+
## Table of Contents
10+
11+
- [Overview](#overview)
12+
- [Requirements](#requirements)
13+
- [Installation](#installation)
14+
- [Conda Environment Setup](#conda-environment-setup)
15+
- [Configuration](#configuration)
16+
- [Sample File (MASTER_CONTROL)](#sample-file-master_control)
17+
- [Output Settings](#output-settings)
18+
- [Processing Options](#processing-options)
19+
- [Genome Settings](#genome-settings)
20+
- [Profile Configuration](#profile-configuration)
21+
- [SLURM Cluster](#slurm-cluster)
22+
- [Local Execution](#local-execution)
23+
- [Running the Pipeline](#running-the-pipeline)
24+
- [Dry Run](#dry-run)
25+
- [Full Execution](#full-execution)
26+
- [Output Description](#output-description)
27+
- [Benchmarking](#benchmarking)
28+
- [Troubleshooting](#troubleshooting)
29+
- [Citation](#citation)
30+
- [Contributing](#contributing)
31+
- [License](#license)
32+
33+
## Overview
34+
35+
This workflow:
36+
37+
1. Downloads SRA files from NCBI/GEO (if using prefetch) or reads local FASTQ files
38+
2. Converts SRA files to FASTQ format using parallel-fastq-dump
39+
3. Performs quality control with FastQC before and after trimming
40+
4. (Optionally) Trims adapters and low-quality bases with Trim Galore
41+
5. Aligns reads to a chosen reference genome using STAR
42+
6. Quantifies gene counts using Salmon
43+
7. Collects RNA-seq metrics with Picard
44+
8. Calculates insert sizes and fragment sizes
45+
9. Screens for common contaminants
46+
10. Generates a MultiQC report combining all quality metrics
47+
48+
## Installation
49+
50+
Clone the repository using Git:
51+
52+
```bash
53+
git clone https://git.unl.edu/HelikarLab/FastqToGeneCounts.git
54+
cd FastqToGeneCounts
55+
```
56+
57+
## Conda Environment Setup
58+
59+
Create and activate a new Conda environment with the required dependencies:
60+
61+
```bash
62+
conda create -n fastq2genecounts python=3.10 snakemake
63+
conda activate fastq2genecounts
64+
65+
# If you are running this pipeline and want to use SLURM, install the executor plugin:
66+
pip install snakemake-executor-plugin-slurm
67+
```
68+
69+
Verify the installation:
70+
71+
```bash
72+
snakemake --version
73+
```
74+
75+
## Setup
76+
77+
### Configuration File
78+
79+
The `config.yaml` file contains all customization options available to the pipeline. The options in this file are
80+
explained below.
81+
82+
### Sample File (`MASTER_CONTROL`)
83+
84+
The `MASTER_CONTROL` setting specifies the CSV file containing sample information. This file must have the
85+
following four columns as a header row:
86+
87+
| Column | Description | Valid Values |
88+
|---------------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
89+
| `srr` | SRA accession code | SRR###### (leave this empty if you are using local files) |
90+
| `sample` | Sample identifier | Prefix with any alphanumeric string (e.g., tissue name), plus the "Study", "Run", and "Replicate" for that sample, separated by underscores |
91+
| `endtype` | Sequencing type | `PE` (paired-end) or `SE` (single-end) |
92+
| `prep_method` | Library preparation | `total` (total RNA) or `mrna` (PolyA/mRNA) |
93+
94+
Example CSV file:
95+
96+
```csv
97+
srr,sample,endtype,prep_method
98+
SRR101,tissueA_S1R1,PE,mrna
99+
SRR102,tissueA_S1R2,SE,total
100+
SRR103,tissueB_S1R1r1,PE,total
101+
SRR104,tissueB_S1R1r2,PE,total
102+
```
103+
104+
For local FASTQ files (see [Processing Options](#processing-options) below), leave the srr column empty. If the
105+
sampe uses paired-end sequencing, the pipeline will automatically find the forward and reverse reads based on the
106+
sample name provded in the file (e.g., `tissueA_S1R1_1.fastq.gz` and `tissueA_S1R1_2.fastq.gz`)
107+
108+
```csv
109+
srr,sample,endtype,prep_method
110+
,tissueA_S1R1,PE,total
111+
,tissueA_S1R2,SE,total
112+
```
113+
114+
### Output Settings
115+
116+
| Setting | Description |
117+
|-------------------|------------------------------------------------------------------------------------------------|
118+
| `ROOTDIR` | Directory where results will be saved (default: `results`) |
119+
| `EXPERIMENT_NAME` | Optional subdirectory name for organizing experiment outputs |
120+
| `BENCHMARK_DIR` | Directory for benchmark output (default: `benchmarks`) |
121+
| `BENCHMARK_TIMES` | Number of times to run each rule for benchmarking (default: 1, minimal benchmarking performed) |
122+
| `LOG_DIR` | Directory for log files (default: `logs`) |
123+
124+
### Processing Options
125+
126+
| Setting | Description | Default |
127+
|-------------------------------|----------------------------------------------|---------|
128+
| `PERFORM_DUMP_FASTQ` | Download + convert SRA files to FASTQ format | `True` |
129+
| `PERFORM_TRIM` | Trim adapters and low-quality bases | `True` |
130+
| `PERFORM_SCREEN` | Screen for contamination | `True` |
131+
| `PERFORM_GET_RNASEQ_METRICS` | Collect RNA-seq metrics with Picard | `True` |
132+
| `PERFORM_GET_INSERT_SIZE` | Calculate insert sizes | `True` |
133+
| `PERFORM_GET_FRAGMENT_SIZE` | Calculate fragment sizes with RSeQC | `True` |
134+
| `BYPASS_REPLICATE_VALIDATION` | Skip replicate count validation (for COMO) | `False` |
135+
| `BYPASS_GENOME_VALIDATION` | Skip genome file validation before running | `False` |
136+
137+
#### Local FASTQ Mode
138+
139+
To process local FASTQ files instead of downloading from SRA:
140+
141+
1. Set `PERFORM_DUMP_FASTQ: False`
142+
2. Set `LOCAL_FASTQ_FILES: "<dirname>"` (change `<dirname>` to the directory path)
143+
3. Place your FASTQ files in the specified directory
144+
4. Leave the `srr` column empty in your MASTER_CONTROL CSV
145+
146+
### Genome Settings
147+
148+
The pipeline automatically downloads and prepares reference genome files from Ensembl. Configure these settings under
149+
the `GENOME` section:
150+
151+
| Setting | Description | Example |
152+
|-----------------|----------------------------|----------------------------------|
153+
| `SAVE_DIR` | Directory for genome files | `genome` |
154+
| `TAXONOMY_ID` | NCBI taxonomy ID | `9606` (human), `10090` (mouse) |
155+
| `VERSION` | Ensembl release version | `115`, `latest`, `release-112` |
156+
| `SHOW_PROGRESS` | Show download progress | `True` |
157+
| `FASTA_VERSION` | FASTA file type | `primary_assembly` or `toplevel` |
158+
159+
Find your taxonomy ID at https://www.ncbi.nlm.nih.gov/Taxonomy
160+
161+
Available Ensembl releases are listed at https://www.ftp.ensembl.org/pub/
162+
163+
## Profile Configuration
164+
165+
Snakemake profiles configure how jobs are submitted to your compute environment. Two profiles are included with this
166+
pipeline.
167+
168+
### SLURM Cluster
169+
170+
Use the SLURM profile for high-performance computing clusters:
171+
172+
```bash
173+
snakemake --profile profiles/cluster
174+
```
175+
176+
The SLURM profile configuration (`profiles/cluster/config.v8+.yaml`) includes:
177+
178+
| Setting | Description | Default |
179+
|-----------------|--------------------------------------------|---------|
180+
| `cores` | Maximum cores used in workflow | 10 |
181+
| `jobs` | Maximum concurrent jobs to submit to SLURM | 250 |
182+
| `restart-times` | Retry attempts on failure | 0 |
183+
| `use-conda` | Enable Conda environments | `True` |
184+
185+
> [!WARNING] Conda
186+
> The `use-conda` setting should always be kept to **True**, otherwise Snakemake assumes all required dependencies
187+
> are installed in the current environment. By keeping this value as True, Snakemake will create the required conda
188+
> environments and activate them for each required rule.
189+
190+
#### Customizing SLURM Account
191+
192+
Edit `profiles/cluster/config.v8+.yaml` to change the SLURM account:
193+
194+
```yaml
195+
sbatch
196+
--job-name=smk-{rule}-{wildcards}
197+
--account=YOUR_ACCOUNT_NAME # Change to your account name: `sacctmgr show user $USER accounts`
198+
--cpus-per-task={threads}
199+
...
200+
```
201+
202+
### Local Execution
203+
204+
For execution on a single machine without a cluster scheduler:
205+
206+
```bash
207+
snakemake --profile profiles/local
208+
```
209+
210+
> [!warning]
211+
> Local execution may be slow for large datasets and, unless you are running this pipeline on a workstation, is not
212+
> recommended for genome alignment due to memory constraints.
213+
214+
### Additional Parameters
215+
If you would like to change any of the default parameters in the profile, you can either:
216+
1. Modify the profile to your liking; this will become the new default for all future pipeline workflows, or
217+
2. Override the parameters on a per-run basis by including them on the command line. Execute `snakemake --help` to
218+
view all options
219+
220+
## Running the Pipeline
221+
222+
### Dry Run
223+
224+
A dry run can be performed to verify configuration:
225+
226+
```bash
227+
snakemake --profile profiles/cluster --dry-run # or `--profile profiles/local` if using a local workstation
228+
```
229+
230+
The dry run will:
231+
232+
- Validate the Snakefile syntax
233+
- Print the defined output directories
234+
- Display all rules that will be executed
235+
- Check that configuration is properly set up
236+
- Verify input files are accessible
237+
238+
### Full Execution
239+
240+
After confirming the dry run succeeds, run the full pipeline:
241+
242+
```bash
243+
snakemake --profile profiles/cluster # or `--profile profiles/local` if using a local workstation
244+
```
245+
246+
#### Keeping the Session Alive
247+
248+
Since Snakemake does not daemonize by default, use `screen` or `tmux` before running Snakemake to keep the
249+
main process running:
250+
251+
- [Screen](https://www.gnu.org/software/screen/manual/screen.html)
252+
- [Tmux](https://man7.org/linux/man-pages/man1/tmux.1.html)
253+
254+
> [!NOTE] Screen vs Tmux
255+
> Screen is easer to set up and use, but Tmux offers more powerful features
256+
257+
```bash
258+
# Start a screen session
259+
screen -S snakemake
260+
261+
# Detach from session: Ctrl+a, d
262+
263+
# Reattach to session
264+
screen -r snakemake
265+
```
266+
267+
#### Log Files
268+
269+
Log files are stored in the `logs` directory organized by tissue and rule:
270+
```
271+
logs/ # or the defined `LOG_DIR` configuration option
272+
{tissue}/
273+
{rule}/
274+
{tissue}_{rule}.log
275+
```
276+
277+
## Benchmarking
278+
279+
The pipeline supports benchmarking to measure execution time for each rule. To enable benchmarking:
280+
281+
1. Set `BENCHMARK_TIMES` to your desired number of repetitions (e.g., `3`)
282+
2. Run the pipeline normally
283+
284+
Generate benchmark plots using the provided script:
285+
286+
```bash
287+
python utils/benchmark.py
288+
```
289+
290+
This script reads benchmark data from the `benchmarks/` directory and produces visualization plots.
291+
292+
> [!WARNING] Bencharmking
293+
> The pipeline will run `BENCHMARK_TIMES` *for each input file*. If set to 2, the entire pipeline will run twice to
294+
> collect benchmark statistics.
295+
296+
## Getting Help
297+
298+
Please [open an issue](https://git.unl.edu/helikarlab/FastqToGeneCounts/issues) with:
299+
300+
- The exact error message
301+
- Your configuration file
302+
- The rule and sample that failed
303+
- Relevant log files from `logs/`
304+
305+
## Citation
306+
307+
If you use this pipeline in your research, please cite it as:
308+
309+
```bibtex
310+
@software{FastqToGeneCounts,
311+
author = {Josh Loecker, Brandt Bessell, Bhanwar lal Puniya, Tomáš Helikar},
312+
title = {FastqToGeneCounts: Automated Bulk RNA-seq Alignment},
313+
url = {https://git.unl.edu/helikarlab/FastqToGeneCounts}
314+
}
315+
```
316+
317+
## Contributing
318+
319+
Contributions are welcome! Please open an issue or submit a merge request for:
320+
321+
- Bug reports
322+
- Feature requests
323+
- Documentation improvements
324+
- Code enhancements

0 commit comments

Comments
 (0)