Skip to content

Commit 807df3a

Browse files
authored
Merge pull request #4 from MPUSP/dev
feat: add prokka and bakta for annotation
2 parents 6c9134c + dcbd2f4 commit 807df3a

19 files changed

Lines changed: 701 additions & 88 deletions

File tree

.github/workflows/main.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@ name: CI
22

33
on:
44
push:
5-
branches: [main, dev]
5+
branches: [main]
66
pull_request:
7-
branches: [main, dev]
7+
branches: [main]
88

99
jobs:
1010
Formatting:
@@ -47,7 +47,7 @@ jobs:
4747
with:
4848
directory: .test
4949
snakefile: workflow/Snakefile
50-
args: "--sdm conda --show-failed-logs --cores 1 --conda-cleanup-pkgs cache -n"
50+
args: "--sdm conda --show-failed-logs --cores 3 --conda-cleanup-pkgs cache"
5151

5252
- name: Test report
5353
uses: snakemake/snakemake-github-action@v2.0.0

.test/config/config.yml

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,29 @@
11
samplesheet: "config/samples.csv"
2-
outdir: "results"
2+
tool: ["prokka"]
33

44
pgap:
55
bin: "path/to/pgap.py"
66
use_yaml_config: True
77
prepare_yaml_files:
88
generic: "config/generic.yaml"
99
submol: "config/submol.yaml"
10+
11+
prokka:
12+
center: ""
13+
extra: "--addgenes"
14+
15+
bakta:
16+
download_db: "light"
17+
existing_db: ""
18+
extra: "--keep-contig-headers --compliant"
19+
20+
quast:
21+
reference_fasta: ""
22+
reference_gff: ""
23+
extra: ""
24+
25+
panaroo:
26+
skip: False
27+
remove_source: "cmsearch"
28+
remove_feature: "tRNA|rRNA|ncRNA|exon|sequence_feature"
29+
extra: "--clean-mode strict --remove-invalid-genes"

.test/config/samples.csv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
sample,species,strain,id_prefix,file
2-
EC2224,"Streptococcus pyogenes",SF370,SPY,"data/assembly.fasta"
2+
EC2224,"Streptococcus pyogenes",SF370,SPY,"data/assembly.fasta"

README.md

Lines changed: 57 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,39 @@
44
[![GitHub actions status](https://github.com/MPUSP/snakemake-assembly-postprocessing/actions/workflows/main.yml/badge.svg)](https://github.com/MPUSP/snakemake-assembly-postprocessing/actions/workflows/main.yml)
55
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
66
[![run with apptainer](https://img.shields.io/badge/run%20with-apptainer-1D355C.svg?labelColor=000000)](https://apptainer.org/)
7+
[![workflow catalog](https://img.shields.io/badge/Snakemake%20workflow%20catalog-darkgreen)](https://snakemake.github.io/snakemake-workflow-catalog/docs/workflows/MPUSP/snakemake-assembly-postprocessing)
78

89
A Snakemake workflow for the post-processing of microbial genome assemblies.
910

11+
- [snakemake-assembly-postprocessing](#snakemake-assembly-postprocessing)
12+
- [Usage](#usage)
13+
- [Workflow overview](#workflow-overview)
14+
- [Installation](#installation)
15+
- [Deployment options](#deployment-options)
16+
- [Authors](#authors)
17+
- [References](#references)
18+
1019
## Usage
1120

1221
The usage of this workflow is described in the [Snakemake Workflow Catalog](https://snakemake.github.io/snakemake-workflow-catalog/docs/workflows/MPUSP/snakemake-assembly-postprocessing).
1322

23+
Detailed information about input data and workflow configuration can also be found in the [`config/README.md`](config/README.md).
24+
1425
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository.
1526

16-
## Workflow overview
27+
_Workflow overview:_
1728

18-
1. Parse `samples.csv` table containing the samples's meta data (`python`)
19-
2. Annotate assemblies using NCBI's Prokaryotic Genome Annotation Pipeline ([PGAP](https://github.com/ncbi/pgap))
29+
<img src="resources/images/dag.svg" align="center" />
2030

21-
## Requirements
31+
## Workflow overview
2232

23-
- [PGAP](https://github.com/ncbi/pgap)
33+
1. Parse `samples.csv` table containing the samples's meta data (`python`)
34+
2. Annotate assemblies using one of the following tools:
35+
1. NCBI's Prokaryotic Genome Annotation Pipeline ([PGAP](https://github.com/ncbi/pgap)). Note: needs to be installed manually
36+
2. [prokka](https://github.com/tseemann/prokka), a fast and light-weight prokaryotic annotation tool
37+
3. [bakta](https://github.com/oschwengers/bakta), a fast, alignment-free annotation tool. Note: Bakta will automatically download its companion database from zenodo (light: 1.5 GB, full: 40 GB)
38+
3. Create a QC report for the assemblies using [Quast](https://github.com/ablab/quast)
39+
4. Create a pangenome analysis (orthologs/homologs) using [Panaroo](https://gthlab.au/panaroo/)
2440

2541
## Installation
2642

@@ -46,9 +62,37 @@ conda activate snakemake-assembly-postprocessing
4662

4763
**Step 4: Install PGAP**
4864

65+
- if you want to use [PGAP](https://github.com/ncbi/pgap) for annotation, it needs to be installed separately
4966
- PGAP can be downloaded from https://github.com/ncbi/pgap. Please follow the installation instructions there.
5067
- Define the path to the `pgap.py` script (located in the `scripts` folder) in the `config` file (recommended: `./resources`)
5168

69+
## Deployment options
70+
71+
To run the workflow from command line, change the working directory.
72+
73+
```bash
74+
cd snakemake-assembly-postprocessing
75+
```
76+
77+
Adjust options in the default config file `config/config.yml`.
78+
Before running the complete workflow, you can perform a dry run using:
79+
80+
```bash
81+
snakemake --cores 1 --dry-run
82+
```
83+
84+
To run the workflow with test files using **conda**:
85+
86+
```bash
87+
snakemake --cores 2 --sdm conda --directory .test
88+
```
89+
90+
To run the workflow with test files using **apptainer**:
91+
92+
```bash
93+
snakemake --cores 2 --sdm conda apptainer --directory .test
94+
```
95+
5296
## Authors
5397

5498
- Dr. Rina Ahmed-Begrich
@@ -61,6 +105,14 @@ conda activate snakemake-assembly-postprocessing
61105

62106
## References
63107

108+
> Seemann T. _Prokka: rapid prokaryotic genome annotation_. Bioinformatics. **2014** Jul 15;30(14):2068-9. PMID: 24642063. https://doi.org/10.1093/bioinformatics/btu153.
109+
110+
> Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. _Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification_. Microb Genom, 7(11):000685 **2021**. PMID: 34739369. https://doi.org/10.1099/mgen.0.000685.
111+
64112
> Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, Coulouris G, Chitsaz F, Derbyshire MK, Durkin AS, Gonzales NR, Gwadz M, Lanczycki CJ, Song JS, Thanki N, Wang J, Yamashita RA, Yang M, Zheng C, Marchler-Bauer A, Thibaud-Nissen F. _RefSeq: Expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation._ Nucleic Acids Res, **2021** Jan 8;49(D1):D1020-D1028. https://doi.org/10.1093/nar/gkaa1105
65113
114+
> Gurevich A, Saveliev V, Vyahhi N, Tesler G. _QUAST: quality assessment tool for genome assemblies_. Bioinformatics. 29(8):1072-5, **2013**. PMID: 23422339. https://doi.org/10.1093/bioinformatics/btt086.
115+
116+
> Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J. _Producing polished prokaryotic pangenomes with the Panaroo pipeline_. Genome Biol. 21(1):180, **2020**. PMID: 32698896. https://doi.org/10.1186/s13059-020-02090-4.
117+
66118
> Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. _Sustainable data analysis with Snakemake_. F1000Research, 10:33, 10, 33, **2021**. https://doi.org/10.12688/f1000research.29032.2.

config/README.md

Lines changed: 48 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,55 @@
1+
## Workflow overview
2+
3+
A Snakemake workflow for the post-processing of microbial genome assemblies.
4+
5+
1. Parse `samples.csv` table containing the samples's meta data (`python`)
6+
2. Annotate assemblies using one of the following tools:
7+
1. NCBI's Prokaryotic Genome Annotation Pipeline ([PGAP](https://github.com/ncbi/pgap)). Note: needs to be installed manually
8+
2. [prokka](https://github.com/tseemann/prokka), a fast and light-weight prokaryotic annotation tool
9+
3. [bakta](https://github.com/oschwengers/bakta), a fast, alignment-free annotation tool. Note: Bakta will automatically download its companion database from zenodo (light: 1.5 GB, full: 40 GB)
10+
3. Create a QC report for the assemblies using [Quast](https://github.com/ablab/quast)
11+
4. Create a pangenome analysis (orthologs/homologs) using [Panaroo](https://gthlab.au/panaroo/)
12+
113
## Running the workflow
214

315
### Input data
416

517
This workflow requires `fasta` input data.
618
The samplesheet table has the following layout:
719

8-
| sample | species | strain | id_prefix | file |
9-
| ----------- | ------------ | ------------- | ------------- | ------------- |
10-
| EC2224 | "Streptococcus pyogenes" | SF370 | Spy | assembly.fasta |
11-
12-
### Execution
13-
14-
To run the workflow from command line, change to the working directory and activate the conda environment.
15-
16-
```bash
17-
cd snakemake-assembly-postprocessing
18-
conda activate snakemake-assembly-postprocessing
19-
```
20-
21-
Adjust options in the default config file `config/config.yml`.
22-
Before running the entire workflow, perform a dry run using:
23-
24-
```bash
25-
snakemake --cores 1 --sdm conda --directory .test --dry-run
26-
```
27-
28-
To run the workflow with test files using **conda**:
29-
30-
```bash
31-
snakemake --cores 1 --sdm conda --directory .test
32-
```
20+
| sample | species | strain | id_prefix | file |
21+
| ------ | ------------------------ | ------ | --------- | -------------- |
22+
| EC2224 | "Streptococcus pyogenes" | SF370 | SPY | assembly.fasta |
23+
| ... | ... | ... | ... | ... |
24+
25+
**Note:** Pangenome analysis with `Panaroo` requires at least two samples.
26+
27+
### Parameters
28+
29+
This table lists all parameters that can be used to run the workflow.
30+
31+
| Parameter | Type | Details | Default |
32+
|:---|:---|:---|:---|
33+
| **samplesheet** | string | Path to the sample sheet file in csv format | |
34+
| **tool** | array[string] | Annotation tool to use (one of `prokka`, `pgap`, `bakta`) | |
35+
| **pgap** | | PGAP configuration object | |
36+
| bin | string | Path to the PGAP script | |
37+
| use_yaml_config | boolean | Whether to use YAML configuration for PGAP | `False` |
38+
| _prepare_yaml_files_ | | Paths to YAML templates for PGAP | |
39+
| generic | string | Path to the generic YAML configuration file | |
40+
| submol | string | Path to the submol YAML configuration file | |
41+
| **prokka** | | Prokka configuration object | |
42+
| center | string | Center name for Prokka annotation (used in sequence IDs) | |
43+
| extra | string | Extra command-line arguments for Prokka | `--addgenes` |
44+
| **bakta** | | Bakta configuration object | |
45+
| download_db | string | Bakta database type (`full`, `light`, or `none`) | `light` |
46+
| existing_db | string | Path to an existing Bakta database (optional). Needs to be combined with `download_db='none'` | `--keep-contig-headers --compliant` |
47+
| extra | string | Extra command-line arguments for Bakta | |
48+
| **quast** | | QUAST configuration object | |
49+
| reference_fasta | string | Path to the reference genome for QUAST | |
50+
| reference_gff | string | Path to the reference annotation for QUAST |
51+
| extra | string | Extra command-line arguments for QUAST | |
52+
| **panaroo** | | Panaroo configuration object | |
53+
| remove_source | string | Source types to remove in Panaroo (regex supported) | `cmsearch` |
54+
| remove_feature | string | Feature types to remove in Panaroo (regex supported) | `tRNA\|rRNA\|ncRNA\|exon\|sequence_feature` |
55+
| extra | string | Extra command-line arguments for Panaroo | `--clean-mode strict --remove-invalid-genes` |

config/config.yml

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,29 @@
11
samplesheet: "config/samples.csv"
2-
outdir: "results"
2+
tool: ["prokka"]
33

44
pgap:
55
bin: "path/to/pgap.py"
66
use_yaml_config: True
77
prepare_yaml_files:
88
generic: "config/generic.yaml"
99
submol: "config/submol.yaml"
10+
11+
prokka:
12+
center: ""
13+
extra: "--addgenes"
14+
15+
bakta:
16+
download_db: "light"
17+
existing_db: ""
18+
extra: "--keep-contig-headers --compliant"
19+
20+
quast:
21+
reference_fasta: ""
22+
reference_gff: ""
23+
extra: ""
24+
25+
panaroo:
26+
skip: False
27+
remove_source: "cmsearch"
28+
remove_feature: "tRNA|rRNA|ncRNA|exon|sequence_feature"
29+
extra: "--clean-mode strict --remove-invalid-genes"

config/schemas/config.schema.yml

Lines changed: 68 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,15 @@ properties:
66
samplesheet:
77
type: string
88
description: Path to the sample sheet file
9-
outdir:
10-
type: string
11-
description: Output directory for results
9+
tool:
10+
type: array
11+
description: Annotation tool to use
12+
items:
13+
type: string
14+
enum:
15+
- prokka
16+
- pgap
17+
- bakta
1218
pgap:
1319
type: object
1420
properties:
@@ -34,7 +40,66 @@ properties:
3440
- bin
3541
- use_yaml_config
3642
- prepare_yaml_files
43+
prokka:
44+
type: object
45+
properties:
46+
center:
47+
type: string
48+
description: Center name for Prokka annotation (used in sequence IDs)
49+
extra:
50+
type: string
51+
description: Extra command-line arguments for Prokka
52+
required:
53+
- center
54+
- extra
55+
bakta:
56+
type: object
57+
properties:
58+
download_db:
59+
type: string
60+
description: Bakta database type, one of 'full', 'light', or 'none' if existing is used
61+
existing_db:
62+
type: string
63+
description: Path to an existing Bakta database (optional)
64+
extra:
65+
type: string
66+
description: Extra command-line arguments for Bakta
67+
required:
68+
- download_db
69+
- existing_db
70+
- extra
71+
quast:
72+
type: object
73+
properties:
74+
reference_fasta:
75+
type: string
76+
description: Path to the reference genome for QUAST
77+
reference_gff:
78+
type: string
79+
description: Path to the reference annotation for QUAST
80+
extra:
81+
type: string
82+
description: Extra command-line arguments for QUAST
83+
panaroo:
84+
type: object
85+
properties:
86+
skip:
87+
type: boolean
88+
description: Whether to skip Panaroo analysis
89+
remove_source:
90+
type: string
91+
description: Source types to remove in Panaroo (regex supported)
92+
remove_feature:
93+
type: string
94+
description: Feature types to remove in Panaroo (regex supported)
95+
extra:
96+
type: string
97+
description: Extra command-line arguments for Panaroo
3798

3899
required:
39100
- samplesheet
101+
- tool
40102
- pgap
103+
- prokka
104+
- bakta
105+
- quast

resources/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.*

0 commit comments

Comments
 (0)