MPUSP
diff --git a/‎.github/workflows/main.yml‎
Lines changed: 3 additions & 3 deletions b/‎.github/workflows/main.yml‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎.test/config/config.yml‎
Lines changed: 21 additions & 1 deletion b/‎.test/config/config.yml‎
Lines changed: 21 additions & 1 deletion
diff --git a/‎.test/config/samples.csv‎
Lines changed: 1 addition & 1 deletion b/‎.test/config/samples.csv‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 57 additions & 5 deletions b/‎README.md‎
Lines changed: 57 additions & 5 deletions
diff --git a/‎config/README.md‎
Lines changed: 48 additions & 25 deletions b/‎config/README.md‎
Lines changed: 48 additions & 25 deletions
diff --git a/‎config/config.yml‎
Lines changed: 21 additions & 1 deletion b/‎config/config.yml‎
Lines changed: 21 additions & 1 deletion
diff --git a/‎config/schemas/config.schema.yml‎
Lines changed: 68 additions & 3 deletions b/‎config/schemas/config.schema.yml‎
Lines changed: 68 additions & 3 deletions
diff --git a/‎resources/.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎resources/.gitignore‎
Lines changed: 1 addition & 0 deletions
@@ -2,9 +2,9 @@ name: CI
 
 on:
   push:
-    branches: [main, dev]
+    branches: [main]
   pull_request:
-    branches: [main, dev]
+    branches: [main]
 
 jobs:
   Formatting:
@@ -47,7 +47,7 @@ jobs:
         with:
           directory: .test
           snakefile: workflow/Snakefile
-          args: "--sdm conda --show-failed-logs --cores 1 --conda-cleanup-pkgs cache -n"
+          args: "--sdm conda --show-failed-logs --cores 3 --conda-cleanup-pkgs cache"
 
       - name: Test report
         uses: snakemake/snakemake-github-action@v2.0.0
 
@@ -1,9 +1,29 @@
 samplesheet: "config/samples.csv"
-outdir: "results"
+tool: ["prokka"]
 
 pgap:
   bin: "path/to/pgap.py"
   use_yaml_config: True
   prepare_yaml_files:
     generic: "config/generic.yaml"
     submol: "config/submol.yaml"
+
+prokka:
+  center: ""
+  extra: "--addgenes"
+
+bakta:
+  download_db: "light"
+  existing_db: ""
+  extra: "--keep-contig-headers --compliant"
+
+quast:
+  reference_fasta: ""
+  reference_gff: ""
+  extra: ""
+
+panaroo:
+  skip: False
+  remove_source: "cmsearch"
+  remove_feature: "tRNA|rRNA|ncRNA|exon|sequence_feature"
+  extra: "--clean-mode strict --remove-invalid-genes"
@@ -1,2 +1,2 @@
 sample,species,strain,id_prefix,file
-EC2224,"Streptococcus pyogenes",SF370,SPY,"data/assembly.fasta"
+EC2224,"Streptococcus pyogenes",SF370,SPY,"data/assembly.fasta"
@@ -4,23 +4,39 @@
 [![GitHub actions status](https://github.com/MPUSP/snakemake-assembly-postprocessing/actions/workflows/main.yml/badge.svg)](https://github.com/MPUSP/snakemake-assembly-postprocessing/actions/workflows/main.yml)
 [![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
 [![run with apptainer](https://img.shields.io/badge/run%20with-apptainer-1D355C.svg?labelColor=000000)](https://apptainer.org/)
+[![workflow catalog](https://img.shields.io/badge/Snakemake%20workflow%20catalog-darkgreen)](https://snakemake.github.io/snakemake-workflow-catalog/docs/workflows/MPUSP/snakemake-assembly-postprocessing)
 
 A Snakemake workflow for the post-processing of microbial genome assemblies.
 
+- [snakemake-assembly-postprocessing](#snakemake-assembly-postprocessing)
+  - [Usage](#usage)
+  - [Workflow overview](#workflow-overview)
+  - [Installation](#installation)
+  - [Deployment options](#deployment-options)
+  - [Authors](#authors)
+  - [References](#references)
+
 ## Usage
 
 The usage of this workflow is described in the [Snakemake Workflow Catalog](https://snakemake.github.io/snakemake-workflow-catalog/docs/workflows/MPUSP/snakemake-assembly-postprocessing).
 
+Detailed information about input data and workflow configuration can also be found in the [`config/README.md`](config/README.md).
+
 If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository.
 
-## Workflow overview
+_Workflow overview:_
 
-1. Parse `samples.csv` table containing the samples's meta data (`python`)
-2. Annotate assemblies using NCBI's Prokaryotic Genome Annotation Pipeline ([PGAP](https://github.com/ncbi/pgap))
+<img src="resources/images/dag.svg" align="center" />
 
-## Requirements
+## Workflow overview
 
-- [PGAP](https://github.com/ncbi/pgap)
+1. Parse `samples.csv` table containing the samples's meta data (`python`)
+2. Annotate assemblies using one of the following tools:
+   1. NCBI's Prokaryotic Genome Annotation Pipeline ([PGAP](https://github.com/ncbi/pgap)). Note: needs to be installed manually
+   2. [prokka](https://github.com/tseemann/prokka), a fast and light-weight prokaryotic annotation tool
+   3. [bakta](https://github.com/oschwengers/bakta), a fast, alignment-free annotation tool. Note: Bakta will automatically download its companion database from zenodo (light: 1.5 GB, full: 40 GB)
+3. Create a QC report for the assemblies using [Quast](https://github.com/ablab/quast)
+4. Create a pangenome analysis (orthologs/homologs) using [Panaroo](https://gthlab.au/panaroo/)
 
 ## Installation
 
@@ -46,9 +62,37 @@ conda activate snakemake-assembly-postprocessing
 
 **Step 4: Install PGAP**
 
+- if you want to use [PGAP](https://github.com/ncbi/pgap) for annotation, it needs to be installed separately
 - PGAP can be downloaded from https://github.com/ncbi/pgap. Please follow the installation instructions there.
 - Define the path to the `pgap.py` script (located in the `scripts` folder) in the `config` file (recommended: `./resources`)
 
+## Deployment options
+
+To run the workflow from command line, change the working directory.
+
+```bash
+cd snakemake-assembly-postprocessing
+```
+
+Adjust options in the default config file `config/config.yml`.
+Before running the complete workflow, you can perform a dry run using:
+
+```bash
+snakemake --cores 1 --dry-run
+```
+
+To run the workflow with test files using **conda**:
+
+```bash
+snakemake --cores 2 --sdm conda --directory .test
+```
+
+To run the workflow with test files using **apptainer**:
+
+```bash
+snakemake --cores 2 --sdm conda apptainer --directory .test
+```
+
 ## Authors
 
 - Dr. Rina Ahmed-Begrich
@@ -61,6 +105,14 @@ conda activate snakemake-assembly-postprocessing
 
 ## References
 
+> Seemann T. _Prokka: rapid prokaryotic genome annotation_. Bioinformatics. **2014** Jul 15;30(14):2068-9. PMID: 24642063. https://doi.org/10.1093/bioinformatics/btu153.
+
+> Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. _Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification_. Microb Genom, 7(11):000685 **2021**. PMID: 34739369. https://doi.org/10.1099/mgen.0.000685.
+
 > Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, Coulouris G, Chitsaz F, Derbyshire MK, Durkin AS, Gonzales NR, Gwadz M, Lanczycki CJ, Song JS, Thanki N, Wang J, Yamashita RA, Yang M, Zheng C, Marchler-Bauer A, Thibaud-Nissen F. _RefSeq: Expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation._ Nucleic Acids Res, **2021** Jan 8;49(D1):D1020-D1028. https://doi.org/10.1093/nar/gkaa1105
 
+> Gurevich A, Saveliev V, Vyahhi N, Tesler G. _QUAST: quality assessment tool for genome assemblies_. Bioinformatics. 29(8):1072-5, **2013**. PMID: 23422339. https://doi.org/10.1093/bioinformatics/btt086.
+
+> Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J. _Producing polished prokaryotic pangenomes with the Panaroo pipeline_. Genome Biol. 21(1):180, **2020**. PMID: 32698896. https://doi.org/10.1186/s13059-020-02090-4.
+
 > Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. _Sustainable data analysis with Snakemake_. F1000Research, 10:33, 10, 33, **2021**. https://doi.org/10.12688/f1000research.29032.2.
@@ -1,32 +1,55 @@
+## Workflow overview
+
+A Snakemake workflow for the post-processing of microbial genome assemblies.
+
+1. Parse `samples.csv` table containing the samples's meta data (`python`)
+2. Annotate assemblies using one of the following tools:
+   1. NCBI's Prokaryotic Genome Annotation Pipeline ([PGAP](https://github.com/ncbi/pgap)). Note: needs to be installed manually
+   2. [prokka](https://github.com/tseemann/prokka), a fast and light-weight prokaryotic annotation tool
+   3. [bakta](https://github.com/oschwengers/bakta), a fast, alignment-free annotation tool. Note: Bakta will automatically download its companion database from zenodo (light: 1.5 GB, full: 40 GB)
+3. Create a QC report for the assemblies using [Quast](https://github.com/ablab/quast)
+4. Create a pangenome analysis (orthologs/homologs) using [Panaroo](https://gthlab.au/panaroo/)
+
 ## Running the workflow
 
 ### Input data
 
 This workflow requires `fasta` input data.
 The samplesheet table has the following layout:
 
-| sample | species | strain | id_prefix | file |
-| ----------- | ------------ | ------------- | ------------- | ------------- |
-| EC2224 | "Streptococcus pyogenes" | SF370 | Spy | assembly.fasta |
-
-### Execution
-
-To run the workflow from command line, change to the working directory and activate the conda environment.
-
-```bash
-cd snakemake-assembly-postprocessing
-conda activate snakemake-assembly-postprocessing
-```
-
-Adjust options in the default config file `config/config.yml`.
-Before running the entire workflow, perform a dry run using:
-
-```bash
-snakemake --cores 1 --sdm conda --directory .test --dry-run
-```
-
-To run the workflow with test files using **conda**:
-
-```bash
-snakemake --cores 1 --sdm conda --directory .test
-```
+| sample | species                  | strain | id_prefix | file           |
+| ------ | ------------------------ | ------ | --------- | -------------- |
+| EC2224 | "Streptococcus pyogenes" | SF370  | SPY       | assembly.fasta |
+| ...    | ...                      | ...    | ...       | ...            |
+
+**Note:** Pangenome analysis with `Panaroo` requires at least two samples.
+
+### Parameters
+
+This table lists all parameters that can be used to run the workflow.
+
+| Parameter | Type | Details | Default |
+|:---|:---|:---|:---|
+| **samplesheet** | string | Path to the sample sheet file in csv format | |
+| **tool** | array[string] | Annotation tool to use (one of `prokka`, `pgap`, `bakta`) | |
+| **pgap** | | PGAP configuration object |  |
+| bin | string | Path to the PGAP script | |
+| use_yaml_config | boolean | Whether to use YAML configuration for PGAP | `False` |
+| _prepare_yaml_files_ | | Paths to YAML templates for PGAP | |
+| generic | string | Path to the generic YAML configuration file | |
+| submol | string | Path to the submol YAML configuration file | |
+| **prokka** | | Prokka configuration object | |
+| center | string | Center name for Prokka annotation (used in sequence IDs) | |
+| extra | string | Extra command-line arguments for Prokka | `--addgenes` |
+| **bakta** | | Bakta configuration object | |
+| download_db | string | Bakta database type (`full`, `light`, or `none`) | `light` |
+| existing_db | string | Path to an existing Bakta database (optional). Needs to be combined with `download_db='none'` | `--keep-contig-headers --compliant` |
+| extra | string | Extra command-line arguments for Bakta | |
+| **quast** | | QUAST configuration object | |
+| reference_fasta | string | Path to the reference genome for QUAST | |
+| reference_gff | string | Path to the reference annotation for QUAST |
+| extra | string | Extra command-line arguments for QUAST | |
+| **panaroo** | | Panaroo configuration object | |
+| remove_source | string | Source types to remove in Panaroo (regex supported) | `cmsearch` |
+| remove_feature | string | Feature types to remove in Panaroo (regex supported) | `tRNA\|rRNA\|ncRNA\|exon\|sequence_feature` |
+| extra | string | Extra command-line arguments for Panaroo | `--clean-mode strict --remove-invalid-genes` |
@@ -1,9 +1,29 @@
 samplesheet: "config/samples.csv"
-outdir: "results"
+tool: ["prokka"]
 
 pgap:
   bin: "path/to/pgap.py"
   use_yaml_config: True
   prepare_yaml_files:
     generic: "config/generic.yaml"
     submol: "config/submol.yaml"
+
+prokka:
+  center: ""
+  extra: "--addgenes"
+
+bakta:
+  download_db: "light"
+  existing_db: ""
+  extra: "--keep-contig-headers --compliant"
+
+quast:
+  reference_fasta: ""
+  reference_gff: ""
+  extra: ""
+
+panaroo:
+  skip: False
+  remove_source: "cmsearch"
+  remove_feature: "tRNA|rRNA|ncRNA|exon|sequence_feature"
+  extra: "--clean-mode strict --remove-invalid-genes"
@@ -6,9 +6,15 @@ properties:
   samplesheet:
     type: string
     description: Path to the sample sheet file
-  outdir:
-    type: string
-    description: Output directory for results
+  tool:
+    type: array
+    description: Annotation tool to use
+    items:
+      type: string
+      enum:
+        - prokka
+        - pgap
+        - bakta
   pgap:
     type: object
     properties:
@@ -34,7 +40,66 @@ properties:
       - bin
       - use_yaml_config
       - prepare_yaml_files
+  prokka:
+    type: object
+    properties:
+      center:
+        type: string
+        description: Center name for Prokka annotation (used in sequence IDs)
+      extra:
+        type: string
+        description: Extra command-line arguments for Prokka
+    required:
+      - center
+      - extra
+  bakta:
+    type: object
+    properties:
+      download_db:
+        type: string
+        description: Bakta database type, one of 'full', 'light', or 'none' if existing is used
+      existing_db:
+        type: string
+        description: Path to an existing Bakta database (optional)
+      extra:
+        type: string
+        description: Extra command-line arguments for Bakta
+    required:
+      - download_db
+      - existing_db
+      - extra
+  quast:
+    type: object
+    properties:
+      reference_fasta:
+        type: string
+        description: Path to the reference genome for QUAST
+      reference_gff:
+        type: string
+        description: Path to the reference annotation for QUAST
+      extra:
+        type: string
+        description: Extra command-line arguments for QUAST
+  panaroo:
+    type: object
+    properties:
+      skip:
+        type: boolean
+        description: Whether to skip Panaroo analysis
+      remove_source:
+        type: string
+        description: Source types to remove in Panaroo (regex supported)
+      remove_feature:
+        type: string
+        description: Feature types to remove in Panaroo (regex supported)
+      extra:
+        type: string
+        description: Extra command-line arguments for Panaroo
 
 required:
   - samplesheet
+  - tool
   - pgap
+  - prokka
+  - bakta
+  - quast
@@ -0,0 +1 @@
+.*
Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,2 @@`
`1`	`1`	`sample,species,strain,id_prefix,file`
`2`		`-EC2224,"Streptococcus pyogenes",SF370,SPY,"data/assembly.fasta"`
	`2`	`+EC2224,"Streptococcus pyogenes",SF370,SPY,"data/assembly.fasta"`