From fb4614312f99a8fe143fae7bec1e0c3dece289c5 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Mon, 8 Sep 2025 09:43:32 -0700
Subject: [PATCH 01/47] Added first draft of low-biomass ppl

---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md      | 1876 +++++++++++++++++
 1 file changed, 1876 insertions(+)
 create mode 100644 Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
new file mode 100644
index 000000000..bba4e3c74
--- /dev/null
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -0,0 +1,1876 @@
+# Bioinformatics pipeline for Low biomass long-read metagenomics data
+
+> **This document holds an overview and some example commands of how GeneLab processes low-biomass, long-read metagenomics datasets. Exact processing commands for specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
+
+---
+
+**Date:** XXX NN, 2025  
+**Revision:** -  
+**Document Number:** GL-DPPD-XXXX  
+
+**Submitted by:**  
+Olabiyi A. Obayomi (GeneLab Analysis Team)  
+
+**Approved by:**  
+Samrawit Gebre (OSDR Project Manager)  
+Jonathan Galazka (OSDR Project Scientist)  
+Amanda Saravia-Butler (GeneLab Science Lead)  
+Barbara Novak (GeneLab Data Processing Lead)  
+
+
+---
+
+# Table of contents
+
+- [**Software used**](#software-used)
+- [**General processing overview with example commands**](#general-processing-overview-with-example-commands)
+  - [**Pre-processing**](#pre-processing)
+    - [1. Basecalling](#1-basecalling)
+    - [2. Demultiplexing](#2-demultiplexing)
+      - [2a. Demultiplex]()
+      - [2b. Concatenate files for each sample]()
+    - [3. Raw Data QC](#3-raw-data-qc)
+      - [3a. Raw Data QC](#3a-raw-data-qc)
+      - [3b. Compile Raw Data QC](#3b-compile-raw-data-qc)
+    - [4. Quality filtering](#4-quality-filtering)
+      - [4a. Filter Raw Data](#4a-filter-raw-data)
+      - [4a. Filtered Data QC](#4b-filtered-data-qc)
+      - [4c. Compile Filtered Data QC](#4c-compile-filtered-data-qc)
+    - [5. Trimming](#3-filteredtrimmed-data-qc)
+      - [5a. Trim Filtered Data](#5a-trim-filtered-data)
+      - [5b. Trimmed Data QC](#5b-trimmed-data-qc)
+      - [5c. Compile Trimmed Data QC](#5c-compile-filtered-data-qc)
+    - [6. Assemble Contaminants](#6-assemble-contaminants)
+    - [7. Contaminant Removal](#7-remove-contaminants)
+      - [7a. Build Contaminant Index and Map Reads](#7a-build-contaminant-index-and-map-reads)
+      - [7b. Sort and Index Contaminant Reads](#7b-sort-and-index-contaminant-alignments)
+      - [7c. Gather Contaminant Mapping Metrics](#7c-gather-contaminant-mapping-metrics)
+      - [7d. Generate Decontaminated Read Files](#7d-generate-decontaminated-read-files)
+      - [7e. Contaminant Removal QC](#7e-contaminant-removal-qc)
+      - [7f. Compile Contaminant Removal QC](#7f-compile-contaminant-removal-qc)
+    - [8. Host Removal](#8-host-removal)
+      - [8a. Remove Host Reads](#8a)
+      - [8b. Compile Host Removal QC]()
+  - [**Read-based processing**](#read-based-processing)
+    - [9. Taxonomic and functional profiling using Kaiju](#8-taxonomic-and-functional-profiling)
+      - [9a. Taxonomic Classification](#9a-taxonomic-classification)
+      - [9b. Convert Kaiju output to Krona format](#9b-convert-kaiju-output-to-krona-format)
+      - [9c. Generate per sample Krona charts](#9c-generate-per-sample-krona-charts)
+      - [9d. Generate combined Krona chart](#9d-generate-combined-krona-chart)
+      - [9e. Compute per-sample taxon level summaries](#9e-compute-taxon-level-summaries-for-each-sample)
+      - [9f. Compile taxon level summaries](#9f-compile-kaiju-taxonomy-results)
+    - [10. Taxonomic and functional profiling using Kraken2](#10-taxonomic-and-functional-profiling-using-kraken2)
+      - [10a. Taxonomic Classification](#10a-taxonomic-classification)
+      - [10b. Combine Kraken2 reports](#10b-combine-kraken2-reports)
+      - [10c. Convert Kraken2 output to krona format](#10c-convert-kraken2-output-to-krona-format)
+      - [10c. Generate per sample Krona charts](#10d-generate-per-sample-krona-charts)
+      - [10d. Generate combined Krona chart](#10e-generate-combined-krona-chart)
+      - [10e. Compile Kraken2 Summary QC](#10f-compile-kraken2-summary-qc)
+  - [**Assembly-based processing**](#assembly-based-processing)
+    - [11. Sample assembly](#11-sample-assembly)
+    - [12. Polish assembly](#12-polish-assembly)
+    - [13. Renaming contigs and summarizing assemblies](#13-renaming-contigs-and-summarizing-assemblies)
+    - [14. Gene prediction](#14-gene-prediction)
+    - [15. Functional annotation](#15-functional-annotation)
+    - [16. Taxonomic classification](#16-taxonomic-classification)
+    - [17. Read-mapping](#17-read-mapping)
+    - [18. Getting coverage information and filtering based on detection](#18-getting-coverage-information-and-filtering-based-on-detection)
+    - [19. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample](#19-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample)
+    - [20. Combining contig-level coverage and taxonomy into one table for each sample](#20-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample)
+    - [21. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples](#21-generating-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
+    - [22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#22-metagenome-assembled-genome-mag-recovery)
+    - [23. Generating MAG-level functional summary overview](#23-generating-mag-level-functional-summary-overview)
+
+---
+
+# Software used
+
+|Program|Version|Relevant Links|
+|:------|:-----:|------:|
+|bbduk| 38.86 |[https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/)|
+|CAT| 5.2.3 |[https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)|
+|CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)|
+|Decontam| | |
+|Dorado| 1.1.1| [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)|
+|Flye| 2.9.5 | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye) |
+|GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
+|Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) |
+|KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan#kofamscan](https://github.com/takaram/kofam_scan#kofamscan)|
+|Kraken2| 2.1.6 | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2) |
+|KrakenTools | 1.2 | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/) |
+|Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)|
+|MetaBAT| 2.15 |[https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)|
+|Minimap2| 2.2.8 | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2) |
+|MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)|
+|Medaka| 2.0.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) |
+|MEGAHIT| 1.2.9 |[https://github.com/voutcn/megahit#megahit](https://github.com/voutcn/megahit#megahit)|
+|NanoPlot| 1.44.1 | [https://github.com/wdecoster/NanoPlot](https://github.com/wdecoster/NanoPlot)|
+|Porechop| 0.2.4 | [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop) |
+|Prodigal| 2.6.3 |[https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)|
+|samtools| 1.20 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
+|KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder)
+|HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)|
+|MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)|
+
+---
+
+# General processing overview with example commands
+
+> Exact processing commands and output files listed in **bold** below are included with each Low Biomass Metagenomics Seq processed dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).  
+
+## Pre-processing
+
+### 1. Basecalling
+
+```bash
+model="fast@4.3.0"
+input_dir=/path/to/raw/data
+
+dorado basecaller ${model} ${input_directory} \
+	--no-trim \
+  --device auto \
+  --recursive \
+  --kit-name ${kit_name} \
+  --min-qscore 7 > basecalled.bam
+```
+
+**Parameter Definitions:**
+
+- `--no-trim` - Skips trimming of barcodes, adapters, and primers
+- `--device` - specifies CPU or GPU device; specifying 'auto' chooses either 'cpu' or 'gpu' depending on detected presence of a GPU device
+- `--recursive` - enables recursive scanning through input directory to load FAST5 and/or POD5 files
+- `--kit-name` - enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names
+- `--min-qscore` - 
+- `model` - positional argument specifying the basecalling model to use or a path to the model directory
+- `input_directory` - positional argument specifying the location of the raw data in POD5 or FAST5 format
+
+**Input Data:**
+
+- *pod5 and/or *fast5 (raw nanopore data)
+
+**Output Data:**
+
+- **basecalled.bam** (raw data in BAM format)
+
+### 2. Demultiplexing
+
+```bash
+dorado demux \
+  --output-dir /path/to/fastq/output \
+  --emit-fastq \
+  --emit-summary \
+  --kit-name ${kit_name} \
+  basecalled.bam
+```
+
+**Parameter Definitions:**
+
+- `--output-dir` - specifies the output folder that is the root of the nested output structure
+- `--emit-fastq` - specifies that output is fastq format
+- `--emit-summary` - creates a summary listing each read and its classified barcode.
+- `--kit-name` - enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names
+
+**Input Data:**
+
+- basecalled.bam (raw nanopore data in BAM format, output from [step 1](#1-basecalling))
+
+**Output Data:**
+
+- \*_barcode\*.fastq (demultiplexed reads in fastq format)
+- \*_unclassified.fastq (unclassified reads in fastq format)
+- barcoding_summary.txt (barcode summary file listing each read, the file it was assigned to, and its classified barcode )
+
+### 3. Raw Data QC
+
+#### 3a. Raw Data QC
+
+```bash 
+NanoPlot --only-report --prefix sample_ -o /path/to/raw_nanoplot_output -t NumberOfThreads --fastq sample_raw.fastq.gz
+```
+
+**Parameter Definitions:**
+
+- `-o` – specifies the output directory to store results
+- `--only-report` - output only the report files
+- `--prefix` - adds a sample specific prefix to the name of each output file
+- `-t` - number of processing threads
+- `sample_raw.fastq.gz` – the input reads are specified as a positional argument
+
+**Input data:**
+
+- *raw.fastq.gz (raw reads, output from [Step 2](#2-demultiplexing))
+
+**Output data:**
+
+- **sample_NanoPlot-report.html** (NanoPlot html summary)
+- sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
+- sample_NanoStats.txt (text file containing basic statistics)
+
+#### 3b. Compile Raw Data QC
+
+```bash 
+multiqc -o raw_multiqc_report -n raw_multiqc --interactive /path/to/raw_nanoplot_output/
+```
+
+**Parameter Definitions:**
+
+-	`-o` – the output directory to store results
+-	`-n` – the filename prefix of results
+- `--interactive` - force multiqc to always create interactive javascript plots
+-	`/path/to/raw_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+
+**Input data:**
+
+- /path/to/raw_nanoplot_output/*NanoStats.txt (NanoPlot output data, from [Step 3a](#3a-raw-data-qc))
+
+**Output data:**
+
+- **raw_multiqc.html** (multiqc output html summary)
+- **raw_multiqc_data.zip** (zip archive containing multiqc output data)
+
+<br>  
+
+---
+
+### 4. Quality filtering
+
+#### 4a. Filter Raw Data
+
+```bash
+filtlong --min_length 200 --min_mean_q 8 /path/to/raw_fastq/sample.fastq > sample_filtered.fastq
+```
+
+**Parameter Definitions:**
+
+-	`-o` – the output directory to store results
+-	`-n` – the filename prefix of results
+- `--interactive` - force multiqc to always create interactive javascript plots
+-	`/path/to/raw_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+
+**Input data:**
+
+- *_raw.fastq (raw reads, output from [Step 2](#2-demultiplexing))
+
+**Output data:**
+
+- *_filtered.fastq (quality filtered reads)
+
+
+#### 4b. Filtered Data QC
+
+```bash
+NanoPlot --only-report --prefix sample_ -o /path/to/filtered_nanoplot_output -t NumberOfThreads --fastq sample_filtered.fastq
+```
+
+**Parameter Definitions:**
+
+- `-o` – specifies the output directory to store results
+- `--only-report` - output only the report files
+- `--prefix` - adds a sample specific prefix to the name of each output file
+- `-t` - number of processing threads
+- `sample_filtered.fastq` – the input reads are specified as a positional argument
+
+**Input data:**
+
+- *filtered.fastq (raw reads, output from [Step 2](#2-demultiplexing))
+
+**Output data:**
+
+- **sample_NanoPlot-report.html** (NanoPlot html summary)
+- sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
+- sample_NanoStats.txt (text file containing basic statistics)
+
+#### 4c. Compile Filtered Data QC
+
+```bash
+multiqc -o raw_multiqc_report -n raw_multiqc --interactive /path/to/filtered_nanoplot_output/
+```
+
+**Parameter Definitions:**
+
+- `-o` – the output directory to store results
+-	`-n` – the filename prefix of results
+- `--interactive` - force multiqc to always create interactive javascript plots
+-	`/path/to/filtered_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+
+**Input data:**
+
+- /path/to/filtered_nanoplot_output/*NanoStats.txt (NanoPlot output data, from [Step 4b](#4b-filtered-data-qc))
+
+**Output data:**
+
+- **filtered_multiqc.html** (multiqc output html summary)
+- **filtered_multiqc_data.zip** (zip archive containing multiqc output data)
+
+### 5. Trimming
+
+
+#### 5a. Trim Filtered Data
+
+```bash
+porechop --input sample_filtered.fastq --threads NumberOfThreads \
+		--discard_middle --output sample_trimmed.fastq  > sample_porechop.log
+```
+
+**Parameter Definitions:**
+
+-	`--input` – the input read file in fastq format
+- `--threads` - number of processing threads
+- `--discard_middle` - 
+- `--output` - output filename
+- `> sample_porechop.log` - capture stdout in a log file
+
+**Input Data:**
+
+- sample_filtered.fastq (filtered reads output from [Step 4a](#4a-filter-raw-data))
+
+**Output Data:**
+
+- **sample_trimmed.fastq** (filtered and trimmed reads)
+
+#### 5b. Trimmed Data QC
+
+```bash
+NanoPlot --only-report --prefix sample_ -o /path/to/trimmed_nanoplot_output -t NumberOfThreads --fastq sample_trimmed.fastq
+```
+
+**Parameter Definitions:**
+
+- `-o` – specifies the output directory to store results
+- `--only-report` - output only the report files
+- `--prefix` - adds a sample specific prefix to the name of each output file
+- `-t` - number of processing threads
+- `sample_trimmed.fastq.gz` – the input reads are specified as a positional argument
+
+**Input data:**
+
+- *trimmed.fastq.gz (raw reads, output from [Step 2](#2-demultiplexing))
+
+**Output data:**
+
+- **sample_NanoPlot-report.html** (NanoPlot html summary)
+- sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
+- sample_NanoStats.txt (text file containing basic statistics)
+
+#### 5c. Compile Filtered Data QC
+
+```bash
+multiqc -o raw_multiqc_report -n raw_multiqc --interactive /path/to/trimmed_nanoplot_output/
+```
+
+**Parameter Definitions:**
+
+- `-o` – the output directory to store results
+-	`-n` – the filename prefix of results
+- `--interactive` - force multiqc to always create interactive javascript plots
+-	`/path/to/trimmed_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+
+**Input data:**
+
+- /path/to/trimmed_nanoplot_output/*NanoStats.txt (NanoPlot output data, output from [Step 5b](#5b-trimmed-data-qc))
+
+**Output data:**
+
+- **filtered_multiqc.html** (multiqc output html summary)
+- **filtered_multiqc_data.zip** (zip archive containing multiqc output data)
+
+---
+
+### 6. Assemble Contaminants
+
+```bash
+flye --meta --threads NumberOfThreads --out-dir /path/to/contaminant_assembly --nano-raw /path/to/blank_samples/\*_trimmed.fastq
+```
+
+**Parameter Definitions:**
+
+-	`--meta` – use metagenome/uneven coverage mode
+- `--threads` - Number of parallel processing threads
+- `--out-dir` - Output directory
+- `--nano-raw` - specifies that input is from Oxford Nanopore regular reads (pre-Guppy5, <20% error)
+
+**Input Data**
+
+- *_trimmed.fastq (filtered and trimmed reads from blank samples, output from [Step 5a](#5a-trim-filtered-data))
+
+**Output Data**
+
+- /path/to/contaminant_assembly/assembly.fasta (Assembly built from reads in blank samples in fasta format)
+
+
+### 7. Remove Contaminants
+
+#### 7a. Build Contaminant Index and Map Reads
+
+```bash
+# Build contaminant index
+minimap2 -t NumberOfThreads -a -x splice -d blanks.mmi /path/to/contaminant_assembly/assembly.fasta
+
+# Map reads to index
+minimap2 -t NumberOfThreads -a -x splice blanks.mmi /path/to/trimmed_reads/sample_trimmed.fastq  > sample.sam
+```
+
+**Parameter Definitions:**
+
+- `-t` - Number of parallel processing threads
+-	`-a` – output in SAM format
+- `-x splice` - specifies preset for spliced alignment of long reads
+- `-d` - specifies the output file for the index
+
+**Input Data**
+
+- /path/to/contaminant_assembly/assembly.fasta (Contaminant assembly, output from [Step 6](#6-assemble-contaminants))
+- /path/to/trimmed_reads/sample_trimmed.fastq (Filtered and trimmed reads, output from [Step 5a](#5a-trim-filtered-data))
+
+**Output Data**
+
+- sample.sam (Reads aligned to contaminant assembly)
+
+#### 7b. Sort and Index Contaminant Alignments
+```bash
+# Sort Sam, convert to bam and create index
+samtools sort --threads NumberOfThreads -o sample_sorted.bam sample.sam > sample_sort.log 2>&1
+
+samtools index sample_sorted.bam sample_sorted.bam.bai
+```
+
+**Parameter Definitions:**
+
+**samtools sort**
+- `--threads` - Number of parallel processing threads
+- `-o` - specifies the output file for the sorted reads
+- `sample.sam` - positional argument specifying the input SAM file
+
+**samtools index**
+- `sample_sorted.bam` - positional argument specifying the input BAM file to be sorted
+- `sample_sorted.bam.bai` - positional argument specifying the name of the index file
+
+**Input Data:**
+
+- sample.sam (Reads aligned to contaminant assembly, output from [Step 7a](#7a-identify-contaminants))
+
+**Output Data:**
+
+- sample_sorted.bam (sorted mapping to contaminant assembly)
+- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly)
+
+#### 7c. Gather Contaminant Mapping Metrics
+
+```bash
+
+samtools flagstat sample_sorted.bam > sample_flagstats.txt  2> sample_flagstats.log
+
+samtools stats --remove-dups sample_sorted.bam > sample_stats.txt   2> sample_stats.log
+samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.log
+```
+
+**Parameter Definitions:**
+
+- `flagstat` - positional argument specifying the program for counting the number of alignments for each SAM FLAG type
+- `stats` - positional argument specifying the program for producing comprehensive statistics from the alignment file
+- `idxstats` - positional argument specifying the program for producing contig alignment summary statistics
+- `--remove-dups` - excludes reads marked as duplicates from comprehensive statistics
+- `sample_sorted.bam` - positional argument specifying the input BAM file
+
+**Input Data:**
+
+- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 7b](#7b-sort-mapped-reads-and-convert-to-bam))
+- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly, output from [Step 7b](#7b-sort-mapped-reads-and-convert-to-bam))
+
+**Output Data:**
+
+- sample_flagstats.txt (SAM FLAG counts)
+- sample_stats.txt (comprehensive alignment statistics)
+- sample_idxstats.txt (contig alignment summary statistics)
+
+#### 7d. Generate Decontaminated Read Files
+```bash
+# Retain reads that do not match contaminants
+samtools fastq -t -f 4 sample_sorted.bam | gzip --to-stdout > sample_blank_removed.fastq.gz
+```
+
+**Parameter Definitions:**
+
+- `fastq` - positional argument specifying the program for generating fastq files from a SAM/BAM file
+- `-t` - copy RG, BC, and QT tags to the FASTQ header line
+- `-f 4` - only retain reads that have been marked with the SAM "segment unmapped" FLAG (4)
+- `sample_sorted.bam` - positional argument specifying the input BAM file
+- `| gzip --to-stdout` - sends output from `samtools fastq` to `gzip` to create compressed fastq.gz file
+- `> sample_blank_removed.fastq.gz` - specifies the name of the file used to store the fastq.gz output
+
+**Input Data:**
+
+- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 7b](#7b-sort-mapped-reads-and-convert-to-bam))
+
+**Output Data:**
+
+- sample_blank_removed.fastq.gz (decontaminated reads in fastq format)
+
+#### 7e. Contaminant Removal QC
+
+```bash
+NanoPlot --only-report --prefix sample_ -o /path/to/noblank_nanoplot_output -t NumberOfThreads --fastq sample_blank_removed.fastq.gz
+```
+
+**Parameter Definitions:**
+
+- `-o` – specifies the output directory to store results
+- `--only-report` - output only the report files
+- `--prefix` - adds a sample specific prefix to the name of each output file
+- `-t` - number of processing threads
+- `sample_blank_removed.fastq.gz` – the input reads are specified as a positional argument
+
+**Input data:**
+
+- sample_blank_removed.fastq.gz (raw reads, output from [Step 7d](#7d-generate-non-contaminant-read-files))
+
+**Output data:**
+
+- **sample_NanoPlot-report.html** (NanoPlot html summary)
+- sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
+- sample_NanoStats.txt (text file containing basic statistics)
+
+
+#### 7f. Compile Contaminant Removal QC
+
+```bash
+multiqc -o noblank_multiqc_report -n noblank_multiqc --interactive /path/to/noblank_nanoplot_output/
+```
+
+**Parameter Definitions:**
+
+- `-o` – the output directory to store results
+-	`-n` – the filename prefix of results
+- `--interactive` - force multiqc to always create interactive javascript plots
+-	`/path/to/noblank_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+
+**Input data:**
+
+- /path/to/noblank_nanoplot_output/*NanoStats.txt (NanoPlot output data, output from [Step 7d](#7d-generate-non-contaminant-read-files))
+
+**Output data:**
+
+- **noblank_multiqc.html** (multiqc output html summary)
+- **noblank_multiqc_data.zip** (zip archive containing multiqc output data)
+
+---
+
+### 8. Host Removal
+
+```bash
+kraken2 --db kraken2_host_db --gzip-compressed --threads NumberOfThreads --use-names \
+        --output sample-kraken2-output.txt --report sample-kraken2-report.tsv \
+        --unclassified-out sample_host_removed.fastq sample_blank_removed.fastq.gz && \
+		&& gzip sample_host_removed.fastq
+```
+
+**Parameter Definitions:**
+
+- `--db` - specifies the directory holding the kraken2 database files created in step 1
+- `--gzip-compressed` - specifies the input fastq files are gzip-compressed
+- `--threads` - specifies the number of threads to use
+- `--use-names` - specifies adding taxa names in addition to taxids
+- `--output` - specifies the name of the kraken2 read-based output file (one line per read)
+- `--report` - specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it)
+- `--unclassified-out` - name of output file of reads that were not classified 
+- `sample_blank_removed.fastq.gz` - positional argument specifying the input read file
+
+**Input data:**
+
+- sample_blank_removed.fastq.gz (gzipped reads fastq file, output from [Step 7d](#7d-generate-decontaminated-read-files))
+
+**Output data:**
+
+- sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
+- sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
+- **sample_HRremoved_raw.fastq.gz** (human-read removed, gzipped reads fastq file)
+
+---
+
+### 9. Taxonomic and Functional Profiling using Kaiju
+
+#### 9a. Taxonomic Classification
+```
+kaiju -f kaiju_db.fmi -t nodes.dmp \
+    -z NumberOfThreads \
+    -E 1e-05 \
+    -i /path/to/decontaminated_reads/sample_host_removed.fastq.gz \
+    -o sample_kaiju.out
+```
+
+**Parameter Definitions:**
+
+- `-f` - specifies path to the Kaiju database (.fmi) file
+- `-t` - specifies path to the Kaiju nodes.dmp file
+- `-z` - specifies the number of threads to use
+- `-E` - specifies the minimum E-value in Greedy mode (default: 0.01)
+- `-i` - specifies path to the input file
+- `-o` - specifies the name of output file
+
+**Input data:**
+
+- sample_host_removed.fastq.gz (gzipped decontaminated reads fastq file, output from [Step 8](#8-host-removal))
+
+**Output data:**
+
+- sample_kaiju.out (kaiju output file)
+
+#### 9b. Convert Kaiju Output to Krona Format
+```
+kaiju2krona -u -n ${NAMES} -t nodes.dmp \
+	-i sample_kaiju.out \
+	-o sample.krona
+```
+
+**Parameter Definitions:**
+
+- `-u` - include count for unclassified reads in output
+- `-n` - specifies path to the Kaiju names.dmp file
+- `-t` - specifies path to the Kaiju nodes.dmp file
+- `-i` - specifies path to the Kaiju output file (output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+- `-o` - specifies the name of krona formatted kaiju output file
+
+**Input data:**
+
+- sample_kaiju.out (kaiju output file, output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+
+**Output data:**
+
+- sample.krona (krona formatted kaiju output)
+
+#### 9c. Generate per sample Krona charts
+
+```bash
+ktImportText -o sample_krona.html sample.krona
+```
+
+**Parameter Definitions:**
+
+- `-o` - specifies the name of the krona output html file
+- `sample.krona` - positional argument specifying the krona text file for each sample
+
+**Input Data:**
+
+- sample.krona (krona formatted kaiju output from [Step 9aii](#9aii-convert-kaiju-output-to-krona-format))
+
+**Output Data:**
+
+- **sample_krona.html** (per-sample Krona charts in html format)
+
+#### 9d. Generate combined Krona chart
+
+```bash
+ktImportText -o kaiju_report.html *.krona
+```
+
+**Parameter Definitions:**
+
+- `-o` - specifies the name of the krona output html file
+- `*.krona` - positional argument specifying krona formatted text files for all samples
+
+**Input Data:**
+
+- *.krona (krona formatted kaiju output files from [Step 9aii](#9aii-convert-kaiju-output-to-krona-format))
+
+**Output Data:**
+
+- **kaiju_report.html** (per-sample Krona charts in html format)
+
+#### 9e. Compute per-sample taxon level summaries
+
+```bash
+# Get taxon level information for each sample
+for TAXON_LEVEL in (phylum class order family genus species); do
+  kaiju2table -t nodes.dmp -n names.dmp -p  -r $TAXON_LEVEL \
+              -o sample_kaiju_summary_${TAXON_LEVEL}.tsv sample_kaiju.out
+done
+```
+
+**Parameter Definitions:**
+
+- `-n` - specifies path to the Kaiju names.dmp file
+- `-t` - specifies path to the Kaiju nodes.dmp file
+- `-r` - specifies taxonomic rank, must be one of: phylum, class, order, family, genus, species
+- `-o` - specifies the name of krona formatted kaiju output file
+- `sample_kaiju.out` - positional argument specifying the path to the Kaiju output file (output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+
+**Input Data:**
+
+- sample_kaiju.out (kaiju output file, output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+
+**Output Data:**
+
+- **sample_kaiju_summary_phylum.tsv** (Compiled kaiju outputs at the phylum taxon level)
+- **sample_kaiju_summary_class.tsv** (Compiled kaiju outputs at the class taxon level)
+- **sample_kaiju_summary_order.tsv** (Compiled kaiju outputs at the order taxon level)
+- **sample_kaiju_summary_family.tsv** (Compiled kaiju outputs at the family taxon level)
+- **sample_kaiju_summary_genus.tsv** (Compiled kaiju outputs at the genus taxon level)
+- **sample_kaiju_summary_species.tsv** (Compiled kaiju outputs at the species taxon level)
+
+#### 9f. Compile Kaiju taxonomy results
+
+```bash
+for TAXON_LEVEL in (phylum class order family genus species); do
+  kaiju2table -t nodes.dmp -n names.dmp -p -r $TAXON_LEVEL \
+              -o merged_kaiju_summary_${TAXON_LEVEL}.tsv *_kaiju.out
+```
+
+**Parameter Definitions:**
+
+- `-n` - specifies path to the Kaiju names.dmp file
+- `-t` - specifies path to the Kaiju nodes.dmp file
+- `-r` - specifies taxonomic rank, must be one of: phylum, class, order, family, genus, species
+- `-o` - specifies the name of krona formatted kaiju output file
+- `sample_kaiju.out` - positional argument specifying the path to the Kaiju output file (output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+
+**Input Data:**
+
+- *kaiju.out (kaiju output files, output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+
+**Output Data:**
+
+- **merged_kaiju_summary_phylum.tsv** (Compiled kaiju outputs at the phylum taxon level)
+- **merged_kaiju_summary_class.tsv** (Compiled kaiju outputs at the class taxon level)
+- **merged_kaiju_summary_order.tsv** (Compiled kaiju outputs at the order taxon level)
+- **merged_kaiju_summary_family.tsv** (Compiled kaiju outputs at the family taxon level)
+- **merged_kaiju_summary_genus.tsv** (Compiled kaiju outputs at the genus taxon level)
+- **merged_kaiju_summary_species.tsv** (Compiled kaiju outputs at the species taxon level)
+
+---
+
+### 10. Taxonomic and Functional Profiling using Kraken2
+
+      - [9c. Generate per sample Krona charts](#9c-generate-per-sample-krona-charts)
+      - [9d. Generate combined Krona chart](#9d-generate-combined-krona-chart)
+      - [9e. Compute per-sample taxon level summaries](#9e-compute-taxon-level-summaries-for-each-sample)
+      - [9f. Compile taxon level summaries](#9f-compile-kaiju-taxonomy-results)
+
+#### 10a. Taxonomic Classification
+
+```bash
+kraken2 --db ${DATABASE} --gzip-compressed --threads NumberOfThreads --use-names \
+        --output sample-kraken2-output.txt --report sample-kraken2-report.tsv \
+        /path/to/decontaminated_reads/sample_host_removed.fastq.gz
+```
+
+**Parameter Definition:**
+
+- `--db` - specifies the directory holding the kraken2 database files created in step 1
+- `--gzip-compressed` - specifies the input fastq files are gzip-compressed
+- `--threads` - specifies the number of threads to use
+- `--use-names` - specifies adding taxa names in addition to taxids
+- `--output` - specifies the name of the kraken2 read-based output file (one line per read)
+- `--report` - specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it)
+- `sample_host_removed.fastq.gz` - positional argument specifying the input read file
+
+**Input data:**
+
+- sample_host_removed.fastq.gz (gzipped reads fastq file, output from [Step 7d](#7d-generate-decontaminated-read-files))
+
+**Output data:**
+
+- sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
+- sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
+
+#### 10b. Combine Kraken2 Reports
+
+```bash
+combine_kreports.py --output merged-kraken-table.tsv \
+                    --report-files sample-1-kraken2-report.tsv sample-2-kraken2-report.tsv sample-3-kraken2-report.tsv \
+                    --sample-names sample-1 sample-2 sample-3
+```
+
+**Parameter Definition:**
+
+- `--output` - specifies the name of the kraken2 read-based output file
+- `--report-files` - a space separated list of kraken2 report output file
+- `--sample-names` - a space separated list of sample name to use as headers in the report (in the same order as the report files)
+
+**Input data:**
+
+- *kraken2-report.tsv (kraken reports, output from [Step 10a](#10a-taxonomic-classification)
+
+**Output data:**
+
+- **merged-kraken-table.tsv**  (merged Kraken2 output in tab-delimited format)
+
+#### 10c. Convert Kraken2 output to Krona format
+
+```bash
+kreport2krona.py --report-file sample-kraken2-report.tsv  --output sample.krona
+```
+
+**Parameter Definition:**
+
+- `--output` - specifies the name of the krona output file
+- `--report-file` - specifies the name of the input kraken2 report file
+
+**Input data:**
+
+- sample-kraken2-report.tsv (kraken report, output from [Step 10a](#10a-taxonomic-classification)
+
+**Output data:**
+
+- sample.krona (krona formatted kraken2 output)
+
+#### 10d. Generate per sample Krona charts
+
+```bash
+ktImportText -o sample_krona.html sample.krona
+```
+
+**Parameter Definitions:**
+
+- `-o` - specifies the name of the krona output html file
+- `sample.krona` - positional argument specifying the krona text file for each sample
+
+**Input Data:**
+
+- sample.krona (krona formatted kraken2 output from [Step 10c](#10c-convert-kraken2-output-to-krona-format)
+
+**Output Data:**
+
+- **sample_krona.html** (per-sample Krona charts in html format)
+
+#### 10e. Generate combined Krona chart
+
+```bash
+ktImportText -o kraken_report.html *.krona
+```
+
+**Parameter Definitions:**
+
+- `-o` - specifies the name of the krona output html file
+- `*.krona` - positional argument specifying krona formatted text files for all samples
+
+**Input Data:**
+
+- *.krona (krona formatted kaiju output from [Step 10c](#10c-convert-kraken2-output-to-krona-format)
+
+**Output Data:**
+
+- **kraken_report.html** (per-sample Krona charts in html format)
+
+#### 10f. Compile Kraken2 Summary QC
+
+```bash 
+multiqc -o kraken_multiqc_report -n kraken_multiqc --interactive /path/to/kraken2_output/
+```
+
+**Parameter Definitions:**
+
+-	`-o` – the output directory to store results
+-	`-n` – the filename prefix of results
+- `--interactive` - force multiqc to always create interactive javascript plots
+-	`/path/to/kraken2_output/` – the directory holding the output data from the Kraken2 run, provided as a positional argument
+
+**Input data:**
+
+- /path/to/kraken2_output/*kraken2-report.tsv (Kraken2 output data, from [Step 10a](#10a-taxonomic-classification))
+
+**Output data:**
+
+- **kraken2_multiqc.html** (multiqc output html summary)
+- **kraken2_multiqc_data.zip** (zip archive containing multiqc output data)
+
+---
+
+## Assembly-based processing
+### 11. Sample assembly
+
+```bash
+flye --meta --threads NumberOfThreads --out-dir sample/ \
+     --nano-hq /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz
+
+# rename output files	              
+mv sample/assembly.fasta sample_assembly.fasta
+mv sample/flye.log sample_flye.log
+```
+
+**Parameter Definitions:**
+
+-	`--meta` – use metagenome/uneven coverage mode
+- `--threads` - Number of parallel processing threads
+- `--out-dir` - Output directory
+- `--nano-hq` - specifies that input is from Oxford Nanopore high-quality reads (Guppy5+ SUP or Q20, <5% error)
+
+**Input Data**
+
+- sample_host_removed.fastq.gz (decontaminated raw data in fastq format, output from [Step 8](#8-host-removal))
+
+**Output Data**
+
+- sample_assembly.fasta (sample assembly)
+- sample_flye.log (log file)
+
+<br>
+
+---
+
+### 12. Polish assembly
+
+```bash
+medaka_consensus -t NumberOfThreads -i /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz \
+  -d /path/to/assemblies/sample_assembly.fasta -o sample/
+  
+mv sample/consensus.fasta sample_polished.fasta
+```
+
+**Parameter Definition:**
+
+- `-t` - Number of parallel processing threads
+- `-i` - specifies path to input read files used in creating the assembly
+- `-d` - specifies path to the assembly fasta file
+- `-o` - specifies the output directory
+
+**Input Data:**
+
+- /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz (decontaminated raw data in fastq format, output from [Step 8](#8-host-removal))
+- /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 11](#11-sample-assembly))
+
+**Output Data:**
+
+- sample_polished.fasta (polished sample assembly)
+
+---
+
+### 13.
+
+#### 13a. Renaming contig headers
+
+```bash
+bit-rename-fasta-headers -i sample-1_polished.fasta -w c_sample-1 -o sample-1_assembly.fasta
+```
+
+**Parameter Definitions:**  
+
+- `-i` – input fasta file
+
+- `-w` – wanted header prefix (a number will be appended for each contig), starts with a “c_” to ensure they won’t start with a number which can be problematic
+
+- `-o` – output fasta file
+
+
+**Input data:**
+
+- sample-1_polished.fasta (polished assembly file from [step 12](#12-polish-assembly))
+
+**Output files:**
+
+- **sample-1-assembly.fasta** (contig-renamed assembly file)
+
+
+#### 13b. Summarizing assemblies
+
+```bash
+bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv *assembly.fasta
+```
+
+**Parameter Definitions:**  
+
+- `-o` – output summary table
+
+*	– multiple input assemblies can be provided as positional arguments
+
+
+**Input data:**
+
+- *-assembly.fasta (contig-renamed assembly files from [step 13a](#13a-renaming-contig-headers))
+
+**Output files:**
+
+- **assembly-summaries_GLmetagenomics.tsv** (table of assembly summary statistics)
+
+<br>
+
+---
+
+
+---
+
+### 6. Gene prediction
+```
+prodigal -a sample-1-genes.faa -d sample-1-genes.fasta -f gff -p meta -c -q \
+         -o sample-1-genes.gff -i sample-1-assembly.fasta
+```
+**Parameter Definitions:**
+
+- `-a` – specifies the output amino acid sequences file
+
+- `-d` – specifies the output nucleotide sequences file
+
+- `-f` – specifies the output format gene-calls file
+
+- `-p` – specifies which mode to run the gene-caller in 
+
+- `-c` – no incomplete genes reported 
+
+- `-q` – run in quiet mode (don’t output process on each contig) 
+
+- `-o` – specifies the name of the output gene-calls file 
+
+- `-i` – specifies the input assembly
+
+**Input data:**
+
+- sample-1-assembly.fasta (contig-renamed assembly file from [step 5a](#5a-renaming-contig-headers))
+
+**Output data:**
+
+- **sample-1-genes.faa** (gene-calls amino-acid fasta file)
+- **sample-1-genes.fasta** (gene-calls nucleotide fasta file)
+- **sample-1-genes.gff** (gene-calls in general feature format)
+
+<br>
+
+---
+
+### 7. Functional annotation
+> **Notes**  
+> The annotation process overwrites the same temporary directory by default. So if running multiple processses at a time, it is necessary to specify a specific temporary directory with the `--tmp-dir` argument as shown below.
+
+
+#### 7a. Downloading reference database of HMM models (only needs to be done once)
+
+```
+curl -LO ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
+curl -LO ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
+tar -xzvf profiles.tar.gz
+gunzip ko_list.gz 
+```
+
+#### 7b. Running KEGG annotation
+```
+exec_annotation -p profiles/ -k ko_list --cpu 15 -f detail-tsv -o sample-1-KO-tab.tmp \
+                --tmp-dir sample-1-tmp-KO --report-unannotated sample-1-genes.faa 
+```
+
+**Parameter Definitions:**
+- `-p` – specifies the directory holding the downloaded reference HMMs
+
+- `-k` – specifies the downloaded reference KO  (Kegg Orthology) terms 
+
+- `--cpu` – specifies the number of searches to run in parallel
+
+- `-f` – specifies the output format
+
+- `-o` – specifies the output file name
+
+- `--tmp-dir` – specifies the temporary directory to write to (needed if running more than one process concurrently, see Notes above)
+
+- `--report-unannotated` – specifies to generate an output for each entry
+
+- `sample-1-genes.faa` – the input file is specified as a positional argument 
+
+
+**Input data:**
+
+- sample-1-genes.faa (amino-acid fasta file, from [step 6](#6-gene-prediction))
+- profiles/ (reference directory holding the KO HMMs)
+- ko_list (reference list of KOs to scan for)
+
+**Output data:**
+
+- sample-1-KO-tab.tmp (table of KO annotations assigned to gene IDs)
+
+
+#### 7c. Filtering output to retain only those passing the KO-specific score and top hits
+```
+bit-filter-KOFamScan-results -i sample-1-KO-tab.tmp -o sample-1-annotations.tsv
+
+  # removing temporary files
+rm -rf sample-1-tmp-KO/ sample-1-KO-annots.tmp
+```
+
+**Parameter Definitions:**  
+
+- `-i` – specifies the input table
+
+- `-o` – specifies the output table
+
+
+**Input data:**
+
+- sample-1-KO-tab.tmp (table of KO annotations assigned to gene IDs from [step 7b](#7b-running-kegg-annotation))
+
+**Output data:**
+
+- sample-1-annotations.tsv (table of KO annotations assigned to gene IDs)
+
+<br>
+
+---
+
+### 8. Taxonomic classification
+
+#### 8a. Pulling and un-packing pre-built reference db (only needs to be done once)
+```
+wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20200618.tar.gz
+tar -xvzf CAT_prepare_20200618.tar.gz
+```
+
+#### 8b. Running taxonomic classification
+```
+CAT contigs -c sample-1-assembly.fasta -d CAT_prepare_20200618/2020-06-18_database/ \
+            -t CAT_prepare_20200618/2020-06-18_taxonomy/ -p sample-1-genes.faa \
+            -o sample-1-tax-out.tmp -n NumberOfThreads -r 3 --top 4 --I_know_what_Im_doing --no-stars
+```
+
+**Parameter Definitions:**  
+
+- `-c` – specifies the input assembly fasta file
+
+- `-d` – specifies the CAT reference sequence database
+
+- `-t` – specifies the CAT reference taxonomy database
+
+- `-p` – specifies the input protein fasta file
+
+- `-o` – specifies the output prefix
+
+- `-n` – specifies the number of CPU cores to use
+
+- `-r` – specifies the number of top protein hits to consider in assigning tax
+
+- `--top` – specifies the number of protein alignments to store
+
+- `--I_know_what_Im_doing` – allows us to alter the `--top` parameter
+
+- `--no-stars` - suppress marking of suggestive taxonomic assignments
+
+
+**Input data:**
+
+- sample-1-assembly.fasta (assembly file from [step 5a](#5a-renaming-contig-headers))
+- sample-1-genes.faa (gene-calls amino-acid fasta file from [step 6](#6-gene-prediction))
+
+**Output data:**
+
+- sample-1-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file)
+- sample-1-tax-out.tmp.contig2classification.txt (contig taxonomy file)
+
+#### 8c. Adding taxonomy info from taxids to genes
+```
+CAT add_names -i sample-1-tax-out.tmp.ORF2LCA.txt -o sample-1-gene-tax-out.tmp \
+              -t CAT_prepare_20200618/2020-06-18_taxonomy/ --only_official --exclude-scores
+```
+
+**Parameter Definitions:**  
+
+- `-i` – specifies the input taxonomy file
+
+- `-o` – specifies the output file 
+
+- `-t` – specifies the CAT reference taxonomy database
+
+- `--only_official` – specifies to add only standard taxonomic ranks
+
+- `--exclude-scores` - specifies to exclude bit-score support scores in the lineage
+
+**Input data:**
+
+- sample-1-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file from [step 8b](#8b-running-taxonomic-classification))
+
+**Output data:**
+
+- sample-1-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
+
+
+
+#### 8d. Adding taxonomy info from taxids to contigs
+```
+CAT add_names -i sample-1-tax-out.tmp.contig2classification.txt -o sample-1-contig-tax-out.tmp \
+              -t CAT-ref/2020-06-18_taxonomy/ --only_official --exclude-scores
+```
+
+**Parameter Definitions:**  
+
+- `-i` – specifies the input taxonomy file
+
+- `-o` – specifies the output file 
+
+- `-t` – specifies the CAT reference taxonomy database
+
+- `--only_official` – specifies to add only standard taxonomic ranks
+
+- `--exclude-scores` - specifies to exclude bit-score support scores in the lineage
+
+
+**Input data:**
+
+- sample-1-tax-out.tmp.contig2classification.txt (contig taxonomy file from [step 8b](#8b-running-taxonomic-classification))
+
+**Output data:**
+
+- sample-1-contig-tax-out.tmp (contig taxonomy file with lineage info added)
+
+
+#### 8e. Formatting gene-level output with awk and sed
+```
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,$8,$9,$10,$11 } \
+    else if ( $2 == "ORF has no hit to database" || $2 ~ /^no taxid found/ ) \
+    { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } else { n=split($3,lineage,";"); \
+    print $1,lineage[n],$5,$6,$7,$8,$9,$10,$11 } } ' sample-1-gene-tax-out.tmp | \
+    sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/# ORF/gene_ID/' | \
+    sed 's/lineage/taxid/'  > sample-1-gene-tax-out.tsv
+```
+
+#### 8f. Formatting contig-level output with awk and sed
+```
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6,$7,$8,$9,$10,$11,$12 } \
+    else if ( $2 == "no taxid assigned" ) { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } \
+    else { n=split($4,lineage,";"); print $1,lineage[n],$6,$7,$8,$9,$10,$11,$12 } } ' sample-1-contig-tax-out.tmp | \
+    sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/^# contig/contig_ID/' | \
+    sed 's/lineage/taxid/' > sample-1-contig-tax-out.tsv
+
+  # clearing intermediate files
+rm sample-1*.tmp*
+```
+
+**Input data:**
+
+- sample-1-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added from [step 8c](#8c-adding-taxonomy-info-from-taxids-to-genes))
+- sample-1-contig-tax-out.tmp (contig taxonomy file with lineage info added from [step 8d](#8d-adding-taxonomy-info-from-taxids-to-contigs))
+
+
+**Output data:**
+
+- sample-1-gene-tax-out.tsv (gene-calls taxonomy file with lineage info added reformatted)
+- sample-1-contig-tax-out.tsv (contig taxonomy file with lineage info added reformatted)
+
+<br>
+
+---
+
+
+#### 13c. Read-Mapping
+
+```bash
+minimap2 -a -x map-ont -t NumberOfThreads sample_assembly.fasta sample_host_removed.fastq.gz \
+  > sample.sam  2> sample-mapping-info.txt | 
+```
+
+**Parameter Definitions:**
+
+- `-t` - Number of parallel processing threads
+-	`-a` – output in SAM format
+- `-x map-ont` - specifies preset for mapping Nanopore reads to a reference
+
+**Input Data**
+
+- /path/to/assemblies/sample_assembly.fasta (Sample assembly, output from [Step 13a](#13a-renaming-contig-headers))
+- /path/to/trimmed_reads/sample_host_removed.fastq.gz (Filtered and trimmed reads, output from [Step 8](#8-host-removal))
+
+**Output Data**
+
+- sample.sam (Reads aligned to contaminant assembly)
+
+#### 13d. Sort and Index Assembly Alignments
+```bash
+# Sort Sam, convert to bam and create index
+samtools sort --threads NumberOfThreads -o sample_sorted.bam sample.sam > sample_sort.log 2>&1
+
+samtools index sample_sorted.bam sample_sorted.bam.bai
+```
+
+**Parameter Definitions:**
+
+**samtools sort**
+- `--threads` - Number of parallel processing threads
+- `-o` - specifies the output file for the sorted reads
+- `sample.sam` - positional argument specifying the input SAM file
+
+**samtools index**
+- `sample_sorted.bam` - positional argument specifying the input BAM file to be sorted
+- `sample_sorted.bam.bai` - positional argument specifying the name of the index file
+
+**Input Data:**
+
+- sample.sam (Reads aligned to sample assembly, output from [Step 13c](#13c-read-mapping))
+
+**Output Data:**
+
+- sample_sorted.bam (sorted mapping to sample assembly)
+- sample_sorted.bam.bai (index of sorted mapping to sample assembly)
+
+<br>
+
+---
+
+### 10. Getting coverage information and filtering based on detection
+> **Notes**  
+> “Detection” is a metric of what proportion of a reference sequence recruited reads (see [here](http://merenlab.org/2017/05/08/anvio-views/#detection)). Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
+
+#### 10a. Filtering coverage levels based on detection
+
+```
+  # pileup.sh comes from the bbduk.sh package
+pileup.sh -in sample-1.bam fastaorf=sample-1-genes.fasta outorf=sample-1-gene-cov-and-det.tmp \
+          out=sample-1-contig-cov-and-det.tmp
+```
+
+**Parameter Definitions:**  
+
+- `-in` – the input bam file
+
+- `fastaorf=` – input gene-calls nucleotide fasta file
+
+- `outorf=` – the output gene-coverage tsv file
+
+- `out=` – the output contig-coverage tsv file
+
+
+#### 10b. Filtering gene coverage based on requiring 50% detection and parsing down to just gene ID and coverage
+```
+grep -v "#" sample-1-gene-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
+     { print $1,$4 } ' > sample-1-gene-cov.tmp
+
+cat <( printf "gene_ID\tcoverage\n" ) sample-1-gene-cov.tmp > sample-1-gene-coverages.tsv
+```
+
+Filtering contig coverage based on requiring 50% detection and parsing down to just contig ID and coverage:
+```
+grep -v "#" sample-1-contig-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
+     { print $1,$2 } ' > sample-1-contig-cov.tmp
+
+cat <( printf "contig_ID\tcoverage\n" ) sample-1-contig-cov.tmp > sample-1-contig-coverages.tsv
+
+  # removing intermediate files
+
+rm sample-1-*.tmp
+```
+
+**Input data:**
+
+- sample-1.bam (mapping file from [step 9b](#9b-performing-mapping-conversion-to-bam-and-sorting))
+- sample-1-genes.fasta (gene-calls nucleotide fasta file from [step 6](#6-gene-prediction))
+
+**Output data:**
+
+- sample-1-gene-coverages.tsv (table with gene-level coverages)
+- sample-1-contig-coverages.tsv (table with contig-level coverages)
+
+<br>
+
+---
+
+### 11. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample
+> **Notes**  
+> Just uses `paste`, `sed`, and `awk`, all are standard in any Unix-like environment.  
+
+```
+paste <( tail -n +2 sample-1-gene-coverages.tsv | sort -V -k 1 ) <( tail -n +2 sample-1-annotations.tsv | sort -V -k 1 | cut -f 2- ) \
+      <( tail -n +2 sample-1-gene-tax-out.tsv | sort -V -k 1 | cut -f 2- ) > sample-1-gene-tab.tmp
+
+paste <( head -n 1 sample-1-gene-coverages.tsv ) <( head -n 1 sample-1-annotations.tsv | cut -f 2- ) \
+      <( head -n 1 sample-1-gene-tax-out.tsv | cut -f 2- ) > sample-1-header.tmp
+
+cat sample-1-header.tmp sample-1-gene-tab.tmp > sample-1-gene-coverage-annotation-and-tax.tsv
+
+  # removing intermediate files
+rm sample-1*tmp sample-1-gene-coverages.tsv sample-1-annotations.tsv sample-1-gene-tax-out.tsv
+```
+
+**Input data:**
+
+- sample-1-gene-coverages.tsv (table with gene-level coverages from [step 10b](#10b-filtering-gene-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage))
+- sample-1-annotations.tsv (table of KO annotations assigned to gene IDs from [step 7c](#7c-filtering-output-to-retain-only-those-passing-the-ko-specific-score-and-top-hits))
+- sample-1-gene-tax-out.tsv (gene-level taxonomic classifications from [step 8f](#8f-formatting-contig-level-output-with-awk-and-sed))
+
+
+**Output data:**
+
+- **sample-1-gene-coverage-annotation-and-tax.tsv** (table with combined gene coverage, annotation, and taxonomy info)
+
+<br>
+
+---
+
+### 12. Combining contig-level coverage and taxonomy into one table for each sample
+> **Notes**  
+> Just uses `paste`, `sed`, and `awk`, all are standard in any Unix-like environment.  
+
+```
+paste <( tail -n +2 sample-1-contig-coverages.tsv | sort -V -k 1 ) \
+      <( tail -n +2 sample-1-contig-tax-out.tsv | sort -V -k 1 | cut -f 2- ) > sample-1-contig.tmp
+
+paste <( head -n 1 sample-1-contig-coverages.tsv ) <( head -n 1 sample-1-contig-tax-out.tsv | cut -f 2- ) \
+      > sample-1-contig-header.tmp
+      
+cat sample-1-contig-header.tmp sample-1-contig.tmp > sample-1-contig-coverage-and-tax.tsv
+
+  # removing intermediate files
+rm sample-1*tmp sample-1-contig-coverages.tsv sample-1-contig-tax-out.tsv
+```
+
+**Input data:**
+
+- sample-1-contig-coverages.tsv (table with contig-level coverages from [step 10b](#10b-filtering-gene-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage))
+- sample-1-contig-tax-out.tsv (contig-level taxonomic classifications from [step 8f](#8f-formatting-contig-level-output-with-awk-and-sed))
+
+
+**Output data:**
+
+- **sample-1-contig-coverage-and-tax.tsv** (table with combined contig coverage and taxonomy info)
+
+<br>
+
+---
+
+### 13. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples
+
+> **Notes**  
+> * To combine across samples to generate these summary tables, we need the same "units". This is done for annotations based on the assigned KO terms, and all non-annotated functions are included together as "Not annotated". It is done for taxonomic classifications based on taxids (full lineages included in the table), and any not classified are included together as "Not classified". 
+> * The values we are working with are coverage per gene (so they are number of bases recruited to the gene normalized by the length of the gene). These have been normalized by making the total coverage of a sample 1,000,000 and setting each individual gene-level coverage its proportion of that 1,000,000 total. So basically percent, but out of 1,000,000 instead of 100 to make the numbers more friendly. 
+
+#### 13a. Generating gene-level coverage summary tables
+
+```
+bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv -o Combined
+```
+
+**Parameter Definitions:**  
+
+*	takes positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above
+
+-	`-o` – specifies the output prefix
+
+
+**Input data:**
+
+- *-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples from [step 11](#11-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample))
+
+**Output data:**
+
+- **Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on KO annotations; normalized to coverage per million genes covered)
+- **Combined-gene-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications; normalized to coverage per million genes covered)
+- **Combined-gene-level-KO-function-coverages_GLmetagenomics.tsv** (table with all samples combined based on KO annotations)
+- **Combined-gene-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications)
+
+
+#### 13b. Generating contig-level coverage summary tables
+
+```
+bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
+```
+**Parameter Definitions:**  
+
+*	takes positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above
+
+-	`-o` – specifies the output prefix
+
+
+**Input data:**
+
+- *-contig-coverage-annotation-and-tax.tsv (tables with combined contig coverage, annotation, and taxonomy info generated for individual samples from [step 12](#12-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample))
+
+**Output data:**
+
+- **Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered)
+- **Combined-contig-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications)
+
+<br>
+
+---
+
+### 14. **M**etagenome-**A**ssembled **G**enome (MAG) recovery
+
+#### 14a. Binning contigs
+```
+jgi_summarize_bam_contig_depths --outputDepth sample-1-metabat-assembly-depth.tsv --percentIdentity 97 --minContigLength 1000 --minContigDepth 1.0  --referenceFasta sample-1-assembly.fasta sample-1.bam
+
+metabat2  --inFile sample-1-assembly.fasta --outFile sample-1 --abdFile sample-1-metabat-assembly-depth.tsv -t NumberOfThreads
+
+mkdir sample-1-bins
+mv sample-1*bin*.fasta sample-1-bins
+zip -r sample-1-bins.zip sample-1-bins
+```
+
+**Parameter Definitions:**  
+
+-  `--outputDepth` – specifies the output depth file
+-  `--percentIdentity` – minimum end-to-end percent identity of a mapped read to be included
+-  `--minContigLength` – minimum contig length to include
+-  `--minContigDepth` – minimum contig depth to include
+-  `--referenceFasta` – the assembly fasta file generated in step 5a
+-  `sample-1.bam` – final positional arguments are the bam files generated in step 9
+-  `--inFile` - the assembly fasta file generated in step 5a
+-  `--outFile` - the prefix of the identified bins output files
+-  `--abdFile` - the depth file generated by the previous `jgi_summarize_bam_contig_depths` command
+-  `-t` - specifies number of threads to use
+
+
+**Input data:**
+
+- sample-1-assembly.fasta (assembly fasta file created in [step 5a](#5a-renaming-contig-headers))
+- sample-1.bam (bam file created in [step 9b](#9b-performing-mapping-conversion-to-bam-and-sorting))
+
+**Output data:**
+
+- **sample-1-metabat-assembly-depth.tsv** (tab-delimited summary of coverages)
+- sample-1-bins/sample-1-bin\*.fasta (fasta files of recovered bins)
+- **sample-1-bins.zip** (zip file containing fasta files of recovered bins)
+
+#### 14b. Bin quality assessment
+Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
+
+```
+checkm lineage_wf -f bins-overview_GLmetagenomics.tsv --tab_table -x fa ./ checkm-output-dir
+```
+
+**Parameter Definitions:**  
+
+-  `lineage_wf` – specifies the workflow being utilized
+-  `-f` – specifies the output summary file
+-  `--tab_table` – specifies the output summary file should be a tab-delimited table
+-  `-x` – specifies the extension that is on the bin fasta files that are being assessed
+-  `./` – first positional argument at end specifies the directory holding the bins generated in step 14a
+-  `checkm-output-dir` – second positional argument at end specifies the primary checkm output directory with detailed information
+
+**Input data:**
+
+- sample-1-bins/sample-1-bin\*.fasta (bin fasta files generated in [step 14a](#14a-binning-contigs))
+
+**Output data:**
+
+- **bins-overview_GLmetagenomics.tsv** (tab-delimited file with quality estimates per bin)
+- checkm-output-dir (directory holding detailed checkm outputs)
+
+#### 14c. Filtering MAGs
+
+```
+cat <( head -n 1 bins-overview_GLmetagenomics.tsv ) \
+    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLmetagenomics.tsv | sed 's/bin./MAG-/' ) \
+    > checkm-MAGs-overview.tsv
+    
+# copying bins into a MAGs directory in order to run tax classification
+awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLmetagenomics.tsv | cut -f 1 > MAG-bin-IDs.tmp
+
+mkdir MAGs
+for ID in MAG-bin-IDs.tmp
+do
+    MAG_ID=$(echo $ID | sed 's/bin./MAG-/')
+    cp ${ID}.fasta MAGs/${MAG_ID}.fasta
+done
+
+for SAMPLE in $(cat MAG-bin-IDs.tmp | sed 's/-bin.*//' | sort -u);
+do
+  mkdir ${SAMPLE}-MAGs
+  mv ${SAMPLE}-*MAG*.fasta ${SAMPLE}-MAGs
+  zip -r ${SAMPLE}-MAGs.zip ${SAMPLE}-MAGs
+done
+```
+
+**Input data:**
+
+- bins-overview_GLmetagenomics.tsv (tab-delimited file with quality estimates per bin from [step 14b](#14b-bin-quality-assessment))
+
+**Output data:**
+
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG)
+- MAGs/\*.fasta (directory holding high-quality MAGs)
+- **\*-MAGs.zip** (zip files containing directories of high-quality MAGs)
+
+
+#### 14d. MAG taxonomic classification
+Uses default `gtdbtk` database setup with program's `download.sh` command.
+
+```
+gtdbtk classify_wf --genome_dir MAGs/ -x fa --out_dir gtdbtk-output-dir  --skip_ani_screen
+```
+
+**Parameter Definitions:**  
+
+-  `classify_wf` – specifies the workflow being utilized
+-  `--genome_dir` – specifies the directory holding the MAGs generated in step 14c
+-  `-x` – specifies the extension that is on the MAG fasta files that are being taxonomically classified
+-  `--out_dir` – specifies the output directory
+-  `--skip_ani_screen`  - specifies to skip ani_screening step to classify genomes using mash and skani
+
+**Input data:**
+
+- MAGs/\*.fasta (directory holding high-quality MAGs from [step 14c](#14c-filtering-mags))
+
+**Output data:**
+
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
+
+#### 14e. Generating overview table of all MAGs
+
+```bash
+# combine summaries
+for MAG in $(cut -f 1 assembly-summaries_GLmetagenomics.tsv | tail -n +2); do
+
+    grep -w -m 1 "^${MAG}" checkm-MAGs-overview.tsv | cut -f 12,13,14 \
+        >> checkm-estimates.tmp
+
+    grep -w "^${MAG}" gtdbtk-output-dir/gtdbtk.*.summary.tsv | \
+    cut -f 2 | sed 's/^.__//' | \
+    sed 's/;.__/\t/g' | \
+    awk 'BEGIN{ OFS=FS="\t" } { for (i=1; i<=NF; i++) if ( $i ~ /^ *$/ ) $i = "NA" }; 1' \
+        >> gtdb-taxonomies.tmp
+
+done
+
+# Add headers
+cat <(printf "est. completeness\test. redundancy\test. strain heterogeneity\n") checkm-estimates.tmp \
+    > checkm-estimates-with-headers.tmp
+
+cat <(printf "domain\tphylum\tclass\torder\tfamily\\tgenus\tspecies\n") gtdb-taxonomies.tmp \
+    > gtdb-taxonomies-with-headers.tmp
+
+paste assembly-summaries_GLmetagenomics.tsv \
+checkm-estimates-with-headers.tmp \
+gtdb-taxonomies-with-headers.tmp \
+    > MAGs-overview.tmp
+
+# Ordering by taxonomy
+head -n 1 MAGs-overview.tmp > MAGs-overview-header.tmp
+
+tail -n +2 MAGs-overview.tmp | sort -t \$'\t' -k 14,20 > MAGs-overview-sorted.tmp
+
+cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
+    > MAGs-overview_GLmetagenomics.tsv
+
+```
+
+**Input data:**
+
+- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics from [step 5b](#5b-summarizing-assemblies))
+- MAGs/\*.fasta (directory holding high-quality MAGs from [step 14c](#14c-filtering-mags))
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG from [step 14c](#14c-filtering-mags))
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info from [step 14d](#14d-mag-taxonomic-classification))
+
+**Output data:**
+
+- **MAGs-overview_GLmetagenomics.tsv** (a tab-delimited overview of all recovered MAGs)
+
+
+<br>
+
+---
+
+### 15. Generating MAG-level functional summary overview
+
+#### 15a. Getting KO annotations per MAG
+This utilizes the helper script [`parse-MAG-annots.py`](../Workflow_Documentation/NF_MGIllumina/workflow_code/bin/parse-MAG-annots.py) 
+
+```bash
+for file in $( ls MAGs/*.fasta )
+do
+
+    MAG_ID=$( echo ${file} | cut -f 2 -d "/" | sed 's/.fasta//' )
+    sample_ID=$( echo ${MAG_ID} | sed 's/-MAG-[0-9]*$//' )
+
+    grep "^>" ${file} | tr -d ">" > ${MAG_ID}-contigs.tmp
+
+    python parse-MAG-annots.py -i annotations-and-taxonomy/${sample_ID}-gene-coverage-annotation-and-tax.tsv \
+                               -w ${MAG_ID}-contigs.tmp -M ${MAG_ID} \
+                               -o MAG-level-KO-annotations_GLmetagenomics.tsv
+
+    rm ${MAG_ID}-contigs.tmp
+
+done
+```
+
+**Parameter Definitions:**  
+
+- `-i` – specifies the input sample gene-coverage-annotation-and-tax.tsv file generated in step 11
+
+-  `-w` – specifies the appropriate temporary file holding all the contigs in the current MAG
+
+- `-M` – specifies the current MAG unique identifier
+
+- `-o` – specifies the output file
+
+**Input data:**
+
+- \*-gene-coverage-annotation-and-tax.tsv (sample gene-coverage-annotation-and-tax.tsv file generated in [step 11](#11-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample))
+- MAGs/\*.fasta (directory holding high-quality MAGs from [step 14c](#14c-filtering-mags))
+
+**Output data:**
+
+- **MAG-level-KO-annotations_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their KO annotations)
+
+
+#### 15b. Summarizing KO annotations with KEGG-Decoder
+
+```bash
+KEGG-decoder -v interactive -i MAG-level-KO-annotations_GLmetagenomics.tsv -o MAG-KEGG-Decoder-out_GLmetagenomics.tsv
+```
+
+**Parameter Definitions:**  
+
+- `-v interactive` – specifies to create an interactive html output
+ 
+- `-i` – specifies the input MAG-level-KO-annotations_GLmetagenomics.tsv file generated in [step 15a](#15a-getting-ko-annotations-per-mag)
+
+- `-o` – specifies the output table
+
+**Input data:**
+
+- MAG-level-KO-annotations_GLmetagenomics.tsv (tab-delimited table holding MAGs and their KO annotations, generated in [step 15a](#15a-getting-ko-annotations-per-mag))
+
+**Output data:**
+
+- **MAG-KEGG-Decoder-out_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their proportions of genes held known to be required for specific pathways/metabolisms)
+
+- **MAG-KEGG-Decoder-out_GLmetagenomics.html** (interactive heatmap html file of the above output table)
+
+<br>
+---
+
+## Read-based processing
+### 16. Taxonomic and functional profiling
+The following uses the `humann` and `metaphlan` reference databases downloaded on 13-Jun-2024 as follows:
+
+```bash
+humann_databases --download chocophlan full
+humann_databases --download uniref uniref90_diamond 
+humann_databases --download utility_mapping full 
+metaphlan --install
+```
+
+#### 16a. Running humann (which also runs metaphlan)
+```bash
+  # forward and reverse reads need to be provided combined if paired-end (if not paired-end, single-end reads are provided to the --input argument next)
+cat sample-1_R1_filtered.fastq.gz sample-1_R2_filtered.fastq.gz > sample-1-combined.fastq.gz
+
+humann --input sample-1-combined.fastq.gz --output sample-1-humann3-out-dir --threads NumberOfThreads \
+       --output-basename sample-1 --metaphlan-options "--unknown_estimation --add_viruses \
+       --sample_id sample-1"
+```
+
+**Parameter Definitions:**  
+
+- `--input` – specifies the input combined forward and reverse reads (if paired-end)
+
+- `--output` – specifies output directory
+
+- `--threads` – specifies the number of threads to use
+
+- `--output-basename` – specifies prefix of the output files
+
+- `--metaphlan-options` – options to be passed to metaphlan
+	- `--unknown_estimation` – include unclassified in estimated relative abundances
+	- `--add_viruses` – include viruses in the reference database
+	- `--sample_id` – specifies the sample identifier we want in the table (rather than full filename)
+
+
+#### 16b. Merging multiple sample functional profiles into one table
+```bash
+  # they need to be in their own directories
+mkdir genefamily-results/ pathabundance-results/ pathcoverage-results/
+
+  # copying results from previous running humann3 step (16a) to get them all together in their own directories (as is needed)
+cp *-humann3-out-dir/*genefamilies.tsv genefamily-results/
+cp *-humann3-out-dir/*abundance.tsv pathabundance-results/
+cp *-humann3-out-dir/*coverage.tsv pathcoverage-results/
+
+humann_join_tables -i genefamily-results/ -o gene-families.tsv
+humann_join_tables -i pathabundance-results/ -o path-abundances.tsv
+humann_join_tables -i pathcoverage-results/ -o path-coverages.tsv
+```
+
+**Parameter Definitions:**  
+
+- `-i` – the directory holding the input tables
+
+- `-o` – the name of the output combined table
+
+
+#### 16c. Splitting results tables
+The read-based functional annotation tables have taxonomic info and non-taxonomic info mixed together initially. `humann` comes with a helper script to split these. Here we are using that to generate both non-taxonomically grouped functional info files and taxonomically grouped ones.
+
+```bash
+humann_split_stratified_table -i gene-families.tsv -o ./
+mv gene-families_stratified.tsv Gene-families-grouped-by-taxa_GLmetagenomics.tsv
+mv gene-families_unstratified.tsv Gene-families_GLmetagenomics.tsv
+
+humann_split_stratified_table -i path-abundances.tsv -o ./
+mv path-abundances_stratified.tsv Path-abundances-grouped-by-taxa_GLmetagenomics.tsv
+mv path-abundances_unstratified.tsv Path-abundances_GLmetagenomics.tsv
+
+humann2_split_stratified_table -i path-coverages.tsv -o ./
+mv path-coverages_stratified.tsv Path-coverages-grouped-by-taxa_GLmetagenomics.tsv
+mv path-coverages_unstratified.tsv Path-coverages_GLmetagenomics.tsv
+```
+
+**Parameter Definitions:**  
+
+- `-i` – the input combined table
+
+- `-o` – output directory (here specifying current directory)
+
+
+#### 16d. Normalizing gene families and pathway abundance tables
+This generates some normalized tables of the read-based functional outputs from humann that are more readily suitable for across sample comparisons.
+
+```bash
+humann_renorm_table -i Gene-families_GLmetagenomics.tsv -o Gene-families-cpm_GLmetagenomics.tsv --update-snames
+humann_renorm_table -i Path-abundances_GLmetagenomics.tsv -o Path-abundances-cpm_GLmetagenomics.tsv --update-snames
+```
+
+**Parameter Definitions:**  
+
+- `-i` – the input combined table
+
+- `-o` – name of the output normalized table
+
+- `--update-snames` – change suffix of column names in tables to "-CPM"
+
+
+#### 16e. Generating a normalized gene-family table that is grouped by Kegg Orthologs (KOs)
+
+```bash
+humann_regroup_table -i Gene-families_GLmetagenomics.tsv -g uniref90_ko | humann_rename_table -n kegg-orthology | \
+                     humann_renorm_table -o Gene-families-KO-cpm_GLmetagenomics.tsv --update-snames
+```
+
+**Parameter Definitions:**  
+
+- `-i` – the input table
+
+- `-g` – the map to use to group uniref IDs into Kegg Orthologs
+
+- `|` – sending that output into the next humann command to add human-readable Kegg Orthology names
+
+- `-n` – specifying we are converting Kegg orthology IDs into Kegg orthology human-readable names
+
+- `|` – sending that output into the next humann command to normalize to copies-per-million
+
+- `-o` – specifying the final output file name
+
+-  `--update-snames` – change suffix of column names in tables to "-CPM"
+
+#### 16f. Combining taxonomy tables
+
+```bash
+merge_metaphlan_tables.py *-humann3-out-dir/*_humann_temp/*_metaphlan_bugs_list.tsv > Metaphlan-taxonomy_GLmetagenomics.tsv
+```
+
+**Parameter Definitions:**  
+
+*	input metaphlan tables are provided as position arguments (produced during humann3 run in [step 16a](#16a-running-humann-which-also-runs-metaphlan)
+
+-  `>` – output is redirected from stdout to a file
+
+
+**Input data:**
+
+- *fastq.gz (filtered/trimmed reads from [step 2](#2-quality-filteringtrimming), forward and reverse reads concatenated if paired-end)
+
+**Output data:**
+
+- **Gene-families_GLmetagenomics.tsv** (gene-family abundances) 
+- **Gene-families-grouped-by-taxa_GLmetagenomics.tsv** (gene-family abundances grouped by taxa)
+- **Gene-families-cpm_GLmetagenomics.tsv** (gene-family abundances normalized to copies-per-million)
+- **Gene-families-KO-cpm_GLmetagenomics.tsv** (KO term abundances normalized to copies-per-million)
+- **Pathway-abundances_GLmetagenomics.tsv** (pathway abundances)
+- **Pathway-abundances-grouped-by-taxa_GLmetagenomics.tsv** (pathway abundances grouped by taxa)
+- **Pathway-abundances-cpm_GLmetagenomics.tsv** (pathway abundances normalized to copies-per-million)
+- **Pathway-coverages_GLmetagenomics.tsv** (pathway coverages)
+- **Pathway-coverages-grouped-by-taxa_GLmetagenomics.tsv** (pathway coverages grouped by taxa)
+- **Metaphlan-taxonomy_GLmetagenomics.tsv** (metaphlan estimated taxonomic relative abundances)
+
+---

From 60fb1fcea021793b67266adf131c98a4abe8ae0b Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Mon, 8 Sep 2025 10:23:21 -0700
Subject: [PATCH 02/47] Fixed Assembly section numbering

- removed read-based processing from standard pipeline (no longer used,
  humann does not work with this data type)
---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md      | 252 ++++--------------
 1 file changed, 59 insertions(+), 193 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index bba4e3c74..a14bdc2db 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -988,8 +988,8 @@ bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv *assembly.fasta
 
 ---
 
-### 6. Gene prediction
-```
+### 14. Gene prediction
+```bash
 prodigal -a sample-1-genes.faa -d sample-1-genes.fasta -f gff -p meta -c -q \
          -o sample-1-genes.gff -i sample-1-assembly.fasta
 ```
@@ -1017,20 +1017,40 @@ prodigal -a sample-1-genes.faa -d sample-1-genes.fasta -f gff -p meta -c -q \
 
 **Output data:**
 
-- **sample-1-genes.faa** (gene-calls amino-acid fasta file)
-- **sample-1-genes.fasta** (gene-calls nucleotide fasta file)
+- sample-1-genes.faa (gene-calls amino-acid fasta file)
+- sample-1-genes.fasta (gene-calls nucleotide fasta file)
 - **sample-1-genes.gff** (gene-calls in general feature format)
 
 <br>
 
+#### 14a. Remove line wraps in gene prediction output
+```bash
+bit-remove-wraps sample-1-genes.faa > sample-1-genes.faa.tmp 2> /dev/null
+mv sample-1-genes.faa.tmp sample-1-genes.faa
+
+bit-remove-wraps sample-1-genes.fasta > sample-1-genes.fasta.tmp 2> /dev/null
+mv sample-1-genes.fasta.tmp sample-1-genes.fasta
+```
+
+**Input data:**
+
+- sample-1-genes.faa (gene-calls amino-acid fasta file, output from [Step 14](#14-gene-prediction))
+- sample-1-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14](#14-gene-prediction))
+
+**Output data:**
+
+- **sample-1-genes.faa** (gene-calls amino-acid fasta file with line wraps removed)
+- **sample-1-genes.fasta** (gene-calls nucleotide fasta file with line wraps removed)
+
+
 ---
 
-### 7. Functional annotation
+### 15. Functional annotation
 > **Notes**  
 > The annotation process overwrites the same temporary directory by default. So if running multiple processses at a time, it is necessary to specify a specific temporary directory with the `--tmp-dir` argument as shown below.
 
 
-#### 7a. Downloading reference database of HMM models (only needs to be done once)
+#### 15a. Downloading reference database of HMM models (only needs to be done once)
 
 ```
 curl -LO ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
@@ -1039,9 +1059,9 @@ tar -xzvf profiles.tar.gz
 gunzip ko_list.gz 
 ```
 
-#### 7b. Running KEGG annotation
+#### 15b. Running KEGG annotation
 ```
-exec_annotation -p profiles/ -k ko_list --cpu 15 -f detail-tsv -o sample-1-KO-tab.tmp \
+exec_annotation -p profiles/ -k ko_list --cpu NumberOfThreads -f detail-tsv -o sample-1-KO-tab.tmp \
                 --tmp-dir sample-1-tmp-KO --report-unannotated sample-1-genes.faa 
 ```
 
@@ -1074,7 +1094,7 @@ exec_annotation -p profiles/ -k ko_list --cpu 15 -f detail-tsv -o sample-1-KO-ta
 - sample-1-KO-tab.tmp (table of KO annotations assigned to gene IDs)
 
 
-#### 7c. Filtering output to retain only those passing the KO-specific score and top hits
+#### 15c. Filtering output to retain only those passing the KO-specific score and top hits
 ```
 bit-filter-KOFamScan-results -i sample-1-KO-tab.tmp -o sample-1-annotations.tsv
 
@@ -1101,15 +1121,15 @@ rm -rf sample-1-tmp-KO/ sample-1-KO-annots.tmp
 
 ---
 
-### 8. Taxonomic classification
+### 16. Taxonomic classification
 
-#### 8a. Pulling and un-packing pre-built reference db (only needs to be done once)
+#### 16a. Pulling and un-packing pre-built reference db (only needs to be done once)
 ```
 wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20200618.tar.gz
 tar -xvzf CAT_prepare_20200618.tar.gz
 ```
 
-#### 8b. Running taxonomic classification
+#### 16b. Running taxonomic classification
 ```
 CAT contigs -c sample-1-assembly.fasta -d CAT_prepare_20200618/2020-06-18_database/ \
             -t CAT_prepare_20200618/2020-06-18_taxonomy/ -p sample-1-genes.faa \
@@ -1149,7 +1169,7 @@ CAT contigs -c sample-1-assembly.fasta -d CAT_prepare_20200618/2020-06-18_databa
 - sample-1-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file)
 - sample-1-tax-out.tmp.contig2classification.txt (contig taxonomy file)
 
-#### 8c. Adding taxonomy info from taxids to genes
+#### 16c. Adding taxonomy info from taxids to genes
 ```
 CAT add_names -i sample-1-tax-out.tmp.ORF2LCA.txt -o sample-1-gene-tax-out.tmp \
               -t CAT_prepare_20200618/2020-06-18_taxonomy/ --only_official --exclude-scores
@@ -1177,7 +1197,7 @@ CAT add_names -i sample-1-tax-out.tmp.ORF2LCA.txt -o sample-1-gene-tax-out.tmp \
 
 
 
-#### 8d. Adding taxonomy info from taxids to contigs
+#### 16d. Adding taxonomy info from taxids to contigs
 ```
 CAT add_names -i sample-1-tax-out.tmp.contig2classification.txt -o sample-1-contig-tax-out.tmp \
               -t CAT-ref/2020-06-18_taxonomy/ --only_official --exclude-scores
@@ -1205,7 +1225,7 @@ CAT add_names -i sample-1-tax-out.tmp.contig2classification.txt -o sample-1-cont
 - sample-1-contig-tax-out.tmp (contig taxonomy file with lineage info added)
 
 
-#### 8e. Formatting gene-level output with awk and sed
+#### 16e. Formatting gene-level output with awk and sed
 ```
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,$8,$9,$10,$11 } \
     else if ( $2 == "ORF has no hit to database" || $2 ~ /^no taxid found/ ) \
@@ -1215,7 +1235,7 @@ awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,
     sed 's/lineage/taxid/'  > sample-1-gene-tax-out.tsv
 ```
 
-#### 8f. Formatting contig-level output with awk and sed
+#### 16f. Formatting contig-level output with awk and sed
 ```
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6,$7,$8,$9,$10,$11,$12 } \
     else if ( $2 == "no taxid assigned" ) { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } \
@@ -1242,8 +1262,9 @@ rm sample-1*.tmp*
 
 ---
 
+### 17. Read-Mapping
 
-#### 13c. Read-Mapping
+#### 17a. Align Reads to Sample Assembly
 
 ```bash
 minimap2 -a -x map-ont -t NumberOfThreads sample_assembly.fasta sample_host_removed.fastq.gz \
@@ -1265,7 +1286,7 @@ minimap2 -a -x map-ont -t NumberOfThreads sample_assembly.fasta sample_host_remo
 
 - sample.sam (Reads aligned to contaminant assembly)
 
-#### 13d. Sort and Index Assembly Alignments
+#### 17b. Sort and Index Assembly Alignments
 ```bash
 # Sort Sam, convert to bam and create index
 samtools sort --threads NumberOfThreads -o sample_sorted.bam sample.sam > sample_sort.log 2>&1
@@ -1297,13 +1318,13 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 
 ---
 
-### 10. Getting coverage information and filtering based on detection
+### 18. Getting coverage information and filtering based on detection
 > **Notes**  
 > “Detection” is a metric of what proportion of a reference sequence recruited reads (see [here](http://merenlab.org/2017/05/08/anvio-views/#detection)). Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
 
-#### 10a. Filtering coverage levels based on detection
+#### 18a. Filtering coverage levels based on detection
 
-```
+```bash
   # pileup.sh comes from the bbduk.sh package
 pileup.sh -in sample-1.bam fastaorf=sample-1-genes.fasta outorf=sample-1-gene-cov-and-det.tmp \
           out=sample-1-contig-cov-and-det.tmp
@@ -1320,8 +1341,8 @@ pileup.sh -in sample-1.bam fastaorf=sample-1-genes.fasta outorf=sample-1-gene-co
 - `out=` – the output contig-coverage tsv file
 
 
-#### 10b. Filtering gene coverage based on requiring 50% detection and parsing down to just gene ID and coverage
-```
+#### 18b. Filtering gene coverage based on requiring 50% detection and parsing down to just gene ID and coverage
+```bash
 grep -v "#" sample-1-gene-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
      { print $1,$4 } ' > sample-1-gene-cov.tmp
 
@@ -1329,7 +1350,7 @@ cat <( printf "gene_ID\tcoverage\n" ) sample-1-gene-cov.tmp > sample-1-gene-cove
 ```
 
 Filtering contig coverage based on requiring 50% detection and parsing down to just contig ID and coverage:
-```
+```bash
 grep -v "#" sample-1-contig-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
      { print $1,$2 } ' > sample-1-contig-cov.tmp
 
@@ -1354,7 +1375,7 @@ rm sample-1-*.tmp
 
 ---
 
-### 11. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample
+### 19. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample
 > **Notes**  
 > Just uses `paste`, `sed`, and `awk`, all are standard in any Unix-like environment.  
 
@@ -1386,7 +1407,7 @@ rm sample-1*tmp sample-1-gene-coverages.tsv sample-1-annotations.tsv sample-1-ge
 
 ---
 
-### 12. Combining contig-level coverage and taxonomy into one table for each sample
+### 20. Combining contig-level coverage and taxonomy into one table for each sample
 > **Notes**  
 > Just uses `paste`, `sed`, and `awk`, all are standard in any Unix-like environment.  
 
@@ -1417,13 +1438,13 @@ rm sample-1*tmp sample-1-contig-coverages.tsv sample-1-contig-tax-out.tsv
 
 ---
 
-### 13. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples
+### 21. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples
 
 > **Notes**  
 > * To combine across samples to generate these summary tables, we need the same "units". This is done for annotations based on the assigned KO terms, and all non-annotated functions are included together as "Not annotated". It is done for taxonomic classifications based on taxids (full lineages included in the table), and any not classified are included together as "Not classified". 
 > * The values we are working with are coverage per gene (so they are number of bases recruited to the gene normalized by the length of the gene). These have been normalized by making the total coverage of a sample 1,000,000 and setting each individual gene-level coverage its proportion of that 1,000,000 total. So basically percent, but out of 1,000,000 instead of 100 to make the numbers more friendly. 
 
-#### 13a. Generating gene-level coverage summary tables
+#### 21a. Generating gene-level coverage summary tables
 
 ```
 bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv -o Combined
@@ -1448,7 +1469,7 @@ bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv -o Combi
 - **Combined-gene-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications)
 
 
-#### 13b. Generating contig-level coverage summary tables
+#### 21b. Generating contig-level coverage summary tables
 
 ```
 bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
@@ -1473,9 +1494,9 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 
 ---
 
-### 14. **M**etagenome-**A**ssembled **G**enome (MAG) recovery
+### 22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery
 
-#### 14a. Binning contigs
+#### 22a. Binning contigs
 ```
 jgi_summarize_bam_contig_depths --outputDepth sample-1-metabat-assembly-depth.tsv --percentIdentity 97 --minContigLength 1000 --minContigDepth 1.0  --referenceFasta sample-1-assembly.fasta sample-1.bam
 
@@ -1511,7 +1532,7 @@ zip -r sample-1-bins.zip sample-1-bins
 - sample-1-bins/sample-1-bin\*.fasta (fasta files of recovered bins)
 - **sample-1-bins.zip** (zip file containing fasta files of recovered bins)
 
-#### 14b. Bin quality assessment
+#### 22b. Bin quality assessment
 Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
 
 ```
@@ -1536,7 +1557,7 @@ checkm lineage_wf -f bins-overview_GLmetagenomics.tsv --tab_table -x fa ./ check
 - **bins-overview_GLmetagenomics.tsv** (tab-delimited file with quality estimates per bin)
 - checkm-output-dir (directory holding detailed checkm outputs)
 
-#### 14c. Filtering MAGs
+#### 22c. Filtering MAGs
 
 ```
 cat <( head -n 1 bins-overview_GLmetagenomics.tsv ) \
@@ -1572,7 +1593,7 @@ done
 - **\*-MAGs.zip** (zip files containing directories of high-quality MAGs)
 
 
-#### 14d. MAG taxonomic classification
+#### 22d. MAG taxonomic classification
 Uses default `gtdbtk` database setup with program's `download.sh` command.
 
 ```
@@ -1595,7 +1616,7 @@ gtdbtk classify_wf --genome_dir MAGs/ -x fa --out_dir gtdbtk-output-dir  --skip_
 
 - gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
 
-#### 14e. Generating overview table of all MAGs
+#### 22e. Generating overview table of all MAGs
 
 ```bash
 # combine summaries
@@ -1650,9 +1671,9 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 
 ---
 
-### 15. Generating MAG-level functional summary overview
+### 23. Generating MAG-level functional summary overview
 
-#### 15a. Getting KO annotations per MAG
+#### 23a. Getting KO annotations per MAG
 This utilizes the helper script [`parse-MAG-annots.py`](../Workflow_Documentation/NF_MGIllumina/workflow_code/bin/parse-MAG-annots.py) 
 
 ```bash
@@ -1693,7 +1714,7 @@ done
 - **MAG-level-KO-annotations_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their KO annotations)
 
 
-#### 15b. Summarizing KO annotations with KEGG-Decoder
+#### 23b. Summarizing KO annotations with KEGG-Decoder
 
 ```bash
 KEGG-decoder -v interactive -i MAG-level-KO-annotations_GLmetagenomics.tsv -o MAG-KEGG-Decoder-out_GLmetagenomics.tsv
@@ -1719,158 +1740,3 @@ KEGG-decoder -v interactive -i MAG-level-KO-annotations_GLmetagenomics.tsv -o MA
 
 <br>
 ---
-
-## Read-based processing
-### 16. Taxonomic and functional profiling
-The following uses the `humann` and `metaphlan` reference databases downloaded on 13-Jun-2024 as follows:
-
-```bash
-humann_databases --download chocophlan full
-humann_databases --download uniref uniref90_diamond 
-humann_databases --download utility_mapping full 
-metaphlan --install
-```
-
-#### 16a. Running humann (which also runs metaphlan)
-```bash
-  # forward and reverse reads need to be provided combined if paired-end (if not paired-end, single-end reads are provided to the --input argument next)
-cat sample-1_R1_filtered.fastq.gz sample-1_R2_filtered.fastq.gz > sample-1-combined.fastq.gz
-
-humann --input sample-1-combined.fastq.gz --output sample-1-humann3-out-dir --threads NumberOfThreads \
-       --output-basename sample-1 --metaphlan-options "--unknown_estimation --add_viruses \
-       --sample_id sample-1"
-```
-
-**Parameter Definitions:**  
-
-- `--input` – specifies the input combined forward and reverse reads (if paired-end)
-
-- `--output` – specifies output directory
-
-- `--threads` – specifies the number of threads to use
-
-- `--output-basename` – specifies prefix of the output files
-
-- `--metaphlan-options` – options to be passed to metaphlan
-	- `--unknown_estimation` – include unclassified in estimated relative abundances
-	- `--add_viruses` – include viruses in the reference database
-	- `--sample_id` – specifies the sample identifier we want in the table (rather than full filename)
-
-
-#### 16b. Merging multiple sample functional profiles into one table
-```bash
-  # they need to be in their own directories
-mkdir genefamily-results/ pathabundance-results/ pathcoverage-results/
-
-  # copying results from previous running humann3 step (16a) to get them all together in their own directories (as is needed)
-cp *-humann3-out-dir/*genefamilies.tsv genefamily-results/
-cp *-humann3-out-dir/*abundance.tsv pathabundance-results/
-cp *-humann3-out-dir/*coverage.tsv pathcoverage-results/
-
-humann_join_tables -i genefamily-results/ -o gene-families.tsv
-humann_join_tables -i pathabundance-results/ -o path-abundances.tsv
-humann_join_tables -i pathcoverage-results/ -o path-coverages.tsv
-```
-
-**Parameter Definitions:**  
-
-- `-i` – the directory holding the input tables
-
-- `-o` – the name of the output combined table
-
-
-#### 16c. Splitting results tables
-The read-based functional annotation tables have taxonomic info and non-taxonomic info mixed together initially. `humann` comes with a helper script to split these. Here we are using that to generate both non-taxonomically grouped functional info files and taxonomically grouped ones.
-
-```bash
-humann_split_stratified_table -i gene-families.tsv -o ./
-mv gene-families_stratified.tsv Gene-families-grouped-by-taxa_GLmetagenomics.tsv
-mv gene-families_unstratified.tsv Gene-families_GLmetagenomics.tsv
-
-humann_split_stratified_table -i path-abundances.tsv -o ./
-mv path-abundances_stratified.tsv Path-abundances-grouped-by-taxa_GLmetagenomics.tsv
-mv path-abundances_unstratified.tsv Path-abundances_GLmetagenomics.tsv
-
-humann2_split_stratified_table -i path-coverages.tsv -o ./
-mv path-coverages_stratified.tsv Path-coverages-grouped-by-taxa_GLmetagenomics.tsv
-mv path-coverages_unstratified.tsv Path-coverages_GLmetagenomics.tsv
-```
-
-**Parameter Definitions:**  
-
-- `-i` – the input combined table
-
-- `-o` – output directory (here specifying current directory)
-
-
-#### 16d. Normalizing gene families and pathway abundance tables
-This generates some normalized tables of the read-based functional outputs from humann that are more readily suitable for across sample comparisons.
-
-```bash
-humann_renorm_table -i Gene-families_GLmetagenomics.tsv -o Gene-families-cpm_GLmetagenomics.tsv --update-snames
-humann_renorm_table -i Path-abundances_GLmetagenomics.tsv -o Path-abundances-cpm_GLmetagenomics.tsv --update-snames
-```
-
-**Parameter Definitions:**  
-
-- `-i` – the input combined table
-
-- `-o` – name of the output normalized table
-
-- `--update-snames` – change suffix of column names in tables to "-CPM"
-
-
-#### 16e. Generating a normalized gene-family table that is grouped by Kegg Orthologs (KOs)
-
-```bash
-humann_regroup_table -i Gene-families_GLmetagenomics.tsv -g uniref90_ko | humann_rename_table -n kegg-orthology | \
-                     humann_renorm_table -o Gene-families-KO-cpm_GLmetagenomics.tsv --update-snames
-```
-
-**Parameter Definitions:**  
-
-- `-i` – the input table
-
-- `-g` – the map to use to group uniref IDs into Kegg Orthologs
-
-- `|` – sending that output into the next humann command to add human-readable Kegg Orthology names
-
-- `-n` – specifying we are converting Kegg orthology IDs into Kegg orthology human-readable names
-
-- `|` – sending that output into the next humann command to normalize to copies-per-million
-
-- `-o` – specifying the final output file name
-
--  `--update-snames` – change suffix of column names in tables to "-CPM"
-
-#### 16f. Combining taxonomy tables
-
-```bash
-merge_metaphlan_tables.py *-humann3-out-dir/*_humann_temp/*_metaphlan_bugs_list.tsv > Metaphlan-taxonomy_GLmetagenomics.tsv
-```
-
-**Parameter Definitions:**  
-
-*	input metaphlan tables are provided as position arguments (produced during humann3 run in [step 16a](#16a-running-humann-which-also-runs-metaphlan)
-
--  `>` – output is redirected from stdout to a file
-
-
-**Input data:**
-
-- *fastq.gz (filtered/trimmed reads from [step 2](#2-quality-filteringtrimming), forward and reverse reads concatenated if paired-end)
-
-**Output data:**
-
-- **Gene-families_GLmetagenomics.tsv** (gene-family abundances) 
-- **Gene-families-grouped-by-taxa_GLmetagenomics.tsv** (gene-family abundances grouped by taxa)
-- **Gene-families-cpm_GLmetagenomics.tsv** (gene-family abundances normalized to copies-per-million)
-- **Gene-families-KO-cpm_GLmetagenomics.tsv** (KO term abundances normalized to copies-per-million)
-- **Pathway-abundances_GLmetagenomics.tsv** (pathway abundances)
-- **Pathway-abundances-grouped-by-taxa_GLmetagenomics.tsv** (pathway abundances grouped by taxa)
-- **Pathway-abundances-cpm_GLmetagenomics.tsv** (pathway abundances normalized to copies-per-million)
-- **Pathway-coverages_GLmetagenomics.tsv** (pathway coverages)
-- **Pathway-coverages-grouped-by-taxa_GLmetagenomics.tsv** (pathway coverages grouped by taxa)
-- **Metaphlan-taxonomy_GLmetagenomics.tsv** (metaphlan estimated taxonomic relative abundances)
-
----

From 5821e1d73e849252fdeebb680c8de9ba01841974 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Thu, 18 Sep 2025 20:51:55 -0700
Subject: [PATCH 03/47] Added read-based decontamination

---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md      | 938 ++++++++++++++++--
 1 file changed, 843 insertions(+), 95 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index a14bdc2db..b8d24b775 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -59,6 +59,7 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [9d. Generate combined Krona chart](#9d-generate-combined-krona-chart)
       - [9e. Compute per-sample taxon level summaries](#9e-compute-taxon-level-summaries-for-each-sample)
       - [9f. Compile taxon level summaries](#9f-compile-kaiju-taxonomy-results)
+      - [9e. Process Kaiju output]()
     - [10. Taxonomic and functional profiling using Kraken2](#10-taxonomic-and-functional-profiling-using-kraken2)
       - [10a. Taxonomic Classification](#10a-taxonomic-classification)
       - [10b. Combine Kraken2 reports](#10b-combine-kraken2-reports)
@@ -66,6 +67,12 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [10c. Generate per sample Krona charts](#10d-generate-per-sample-krona-charts)
       - [10d. Generate combined Krona chart](#10e-generate-combined-krona-chart)
       - [10e. Compile Kraken2 Summary QC](#10f-compile-kraken2-summary-qc)
+      - [10f. Process Kraken2 output]()
+    - [11. Taxonomy plots]()
+        - [11a. Per-sample]()
+        - [11b. combined]()
+    - [12. Read-based Feature Table Decontamination]()
+      - [11a. Kaiju outp conversion]()
   - [**Assembly-based processing**](#assembly-based-processing)
     - [11. Sample assembly](#11-sample-assembly)
     - [12. Polish assembly](#12-polish-assembly)
@@ -80,6 +87,22 @@ Barbara Novak (GeneLab Data Processing Lead)
     - [21. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples](#21-generating-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
     - [22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#22-metagenome-assembled-genome-mag-recovery)
     - [23. Generating MAG-level functional summary overview](#23-generating-mag-level-functional-summary-overview)
+  - [**Feature Table Decontamination**]
+    - [24. R Environment Setup](#24-r-environment-setup)
+      - [24a. Load libraries](#24a-load-libraries)
+      - [24b. Define Custom Functions](#24b-define-custom-functions)
+      - [24c. Set Variable](#24c-set-variables)
+      - [24d. Import kaiju taxonomy data](#24d-import-kaiju-taxonomy-data)
+      - [24e. Import kraken2 taxonomy data](#24e-import-kraken2-taxonomy-data)
+      - [24f. Import sample metadata](#24f-import-sample-metadata)
+    - [25. Read-based processing feature-table decontamination](#25-read-based-processing-feature-table-decontamination)
+      - [25a. Taxonomy filtering](#25a-taxonomy-filtering)
+      - [25b. Decontamination](#25b-decontamination-with-decontam)
+        - [25b.i. Setup Variables](#25bi-setup-variables)
+        - [25b.ii. Identify prevalence of contaminant sequences](#25bii-identify-prevalence-of-contaminant-sequences)
+        - [25b.iii. Decontaminated taxonomy plots](#25biii-decontaminated-taxonomy-plots)
+    - [26. Assembly-based processing decontamination](#26-assembly-based-processing-decontamination)
+
 
 ---
 
@@ -90,12 +113,12 @@ Barbara Novak (GeneLab Data Processing Lead)
 |bbduk| 38.86 |[https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/)|
 |CAT| 5.2.3 |[https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)|
 |CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)|
-|Decontam| | |
 |Dorado| 1.1.1| [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)|
 |Flye| 2.9.5 | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye) |
 |GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
 |Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) |
-|KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan#kofamscan](https://github.com/takaram/kofam_scan#kofamscan)|
+|KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder)
+|KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)|
 |Kraken2| 2.1.6 | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2) |
 |KrakenTools | 1.2 | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/) |
 |Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)|
@@ -103,14 +126,21 @@ Barbara Novak (GeneLab Data Processing Lead)
 |Minimap2| 2.2.8 | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2) |
 |MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)|
 |Medaka| 2.0.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) |
-|MEGAHIT| 1.2.9 |[https://github.com/voutcn/megahit#megahit](https://github.com/voutcn/megahit#megahit)|
 |NanoPlot| 1.44.1 | [https://github.com/wdecoster/NanoPlot](https://github.com/wdecoster/NanoPlot)|
 |Porechop| 0.2.4 | [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop) |
 |Prodigal| 2.6.3 |[https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)|
 |samtools| 1.20 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
-|KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder)
-|HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)|
-|MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)|
+| R | 4.5.1 | [https://www.r-project.org](https://www.r-project.org) |
+|Bioconductor | 3.21 | [https://www.bioconductor.org](https://www.bioconductor.org) |
+|decontam| 1.28.0 | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html) |
+|DT| 0.34.0 | [https://cran.r-project.org/web/packages/DT/index.html](https://cran.r-project.org/web/packages/DT/index.html) |
+|glue| 1.8.0 | [https://cran.r-project.org/web/packages/glue/index.html](https://cran.r-project.org/web/packages/glue/index.html) |
+|optparse| 1.7.5 |[https://cran.r-project.org/web/packages/optparse/index.html](https://cran.r-project.org/web/packages/optparse/index.html) |
+|pavian| 1.2.1 | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian) |
+|pheatmap| 1.0.13 | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap) |
+|phyloseq| 1.52.0 | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html) |
+|plotly| 4.11.0 | [https://cran.r-project.org/web/packages/plotly/index.html](https://cran.r-project.org/web/packages/plotly/index.html) |
+|tidyverse| 2.0.0 | [https://www.tidyverse.org](https://www.tidyverse.org) |
 
 ---
 
@@ -589,7 +619,7 @@ kraken2 --db kraken2_host_db --gzip-compressed --threads NumberOfThreads --use-n
 
 ### 9. Taxonomic and Functional Profiling using Kaiju
 
-#### 9a. Taxonomic Classification
+#### 9a. Kaiju Taxonomic Classification
 ```
 kaiju -f kaiju_db.fmi -t nodes.dmp \
     -z NumberOfThreads \
@@ -615,68 +645,8 @@ kaiju -f kaiju_db.fmi -t nodes.dmp \
 
 - sample_kaiju.out (kaiju output file)
 
-#### 9b. Convert Kaiju Output to Krona Format
-```
-kaiju2krona -u -n ${NAMES} -t nodes.dmp \
-	-i sample_kaiju.out \
-	-o sample.krona
-```
-
-**Parameter Definitions:**
-
-- `-u` - include count for unclassified reads in output
-- `-n` - specifies path to the Kaiju names.dmp file
-- `-t` - specifies path to the Kaiju nodes.dmp file
-- `-i` - specifies path to the Kaiju output file (output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
-- `-o` - specifies the name of krona formatted kaiju output file
-
-**Input data:**
-
-- sample_kaiju.out (kaiju output file, output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
-
-**Output data:**
-
-- sample.krona (krona formatted kaiju output)
-
-#### 9c. Generate per sample Krona charts
-
-```bash
-ktImportText -o sample_krona.html sample.krona
-```
-
-**Parameter Definitions:**
-
-- `-o` - specifies the name of the krona output html file
-- `sample.krona` - positional argument specifying the krona text file for each sample
-
-**Input Data:**
-
-- sample.krona (krona formatted kaiju output from [Step 9aii](#9aii-convert-kaiju-output-to-krona-format))
-
-**Output Data:**
-
-- **sample_krona.html** (per-sample Krona charts in html format)
 
-#### 9d. Generate combined Krona chart
-
-```bash
-ktImportText -o kaiju_report.html *.krona
-```
-
-**Parameter Definitions:**
-
-- `-o` - specifies the name of the krona output html file
-- `*.krona` - positional argument specifying krona formatted text files for all samples
-
-**Input Data:**
-
-- *.krona (krona formatted kaiju output files from [Step 9aii](#9aii-convert-kaiju-output-to-krona-format))
-
-**Output Data:**
-
-- **kaiju_report.html** (per-sample Krona charts in html format)
-
-#### 9e. Compute per-sample taxon level summaries
+#### 9e. Kaiju per-sample taxon level summaries
 
 ```bash
 # Get taxon level information for each sample
@@ -736,6 +706,29 @@ for TAXON_LEVEL in (phylum class order family genus species); do
 - **merged_kaiju_summary_genus.tsv** (Compiled kaiju outputs at the genus taxon level)
 - **merged_kaiju_summary_species.tsv** (Compiled kaiju outputs at the species taxon level)
 
+#### 9b. Convert Kaiju Output to Krona Format
+```
+kaiju2krona -u -n ${NAMES} -t nodes.dmp \
+	-i sample_kaiju.out \
+	-o sample.krona
+```
+
+**Parameter Definitions:**
+
+- `-u` - include count for unclassified reads in output
+- `-n` - specifies path to the Kaiju names.dmp file
+- `-t` - specifies path to the Kaiju nodes.dmp file
+- `-i` - specifies path to the Kaiju output file (output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+- `-o` - specifies the name of krona formatted kaiju output file
+
+**Input data:**
+
+- sample_kaiju.out (kaiju output file, output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+
+**Output data:**
+
+- sample.krona (krona formatted kaiju output)
+
 ---
 
 ### 10. Taxonomic and Functional Profiling using Kraken2
@@ -794,6 +787,28 @@ combine_kreports.py --output merged-kraken-table.tsv \
 
 - **merged-kraken-table.tsv**  (merged Kraken2 output in tab-delimited format)
 
+#### 10f. Compile Kraken2 Summary QC
+
+```bash 
+multiqc -o kraken_multiqc_report -n kraken_multiqc --interactive /path/to/kraken2_output/
+```
+
+**Parameter Definitions:**
+
+-	`-o` – the output directory to store results
+-	`-n` – the filename prefix of results
+- `--interactive` - force multiqc to always create interactive javascript plots
+-	`/path/to/kraken2_output/` – the directory holding the output data from the Kraken2 run, provided as a positional argument
+
+**Input data:**
+
+- /path/to/kraken2_output/*kraken2-report.tsv (Kraken2 output data, from [Step 10a](#10a-taxonomic-classification))
+
+**Output data:**
+
+- **kraken2_multiqc.html** (multiqc output html summary)
+- **kraken2_multiqc_data.zip** (zip archive containing multiqc output data)
+
 #### 10c. Convert Kraken2 output to Krona format
 
 ```bash
@@ -813,7 +828,12 @@ kreport2krona.py --report-file sample-kraken2-report.tsv  --output sample.krona
 
 - sample.krona (krona formatted kraken2 output)
 
-#### 10d. Generate per sample Krona charts
+
+---
+
+### 11. Taxonomy Plots
+
+#### 11a. Generate per sample Krona charts
 
 ```bash
 ktImportText -o sample_krona.html sample.krona
@@ -826,7 +846,7 @@ ktImportText -o sample_krona.html sample.krona
 
 **Input Data:**
 
-- sample.krona (krona formatted kraken2 output from [Step 10c](#10c-convert-kraken2-output-to-krona-format)
+- sample.krona (krona formatted kaiju or kraken output from [Step 9b](#9b-convert-kaiju-output-to-krona-format) or [Step 10c](#10c-convert-kraken2-output-to-krona-format)
 
 **Output Data:**
 
@@ -835,43 +855,23 @@ ktImportText -o sample_krona.html sample.krona
 #### 10e. Generate combined Krona chart
 
 ```bash
-ktImportText -o kraken_report.html *.krona
+ktImportText -o ${classification_type}_krona_report.html ${input_dir}/*.krona
 ```
 
 **Parameter Definitions:**
 
 - `-o` - specifies the name of the krona output html file
+- `input_dir` - positional argument specifying the location of the krona files
+- `classification_type` - positional argument specifying which tool was used to create the taxonomic classification (kaiju or kraken2)
 - `*.krona` - positional argument specifying krona formatted text files for all samples
 
 **Input Data:**
 
-- *.krona (krona formatted kaiju output from [Step 10c](#10c-convert-kraken2-output-to-krona-format)
+- *.krona (krona formatted kaiju or kraken output in krona format from [Step 9b](#9b-convert-kaiju-output-to-krona-format) or [Step 10c](#10c-convert-kraken2-output-to-krona-format))
 
 **Output Data:**
 
-- **kraken_report.html** (per-sample Krona charts in html format)
-
-#### 10f. Compile Kraken2 Summary QC
-
-```bash 
-multiqc -o kraken_multiqc_report -n kraken_multiqc --interactive /path/to/kraken2_output/
-```
-
-**Parameter Definitions:**
-
--	`-o` – the output directory to store results
--	`-n` – the filename prefix of results
-- `--interactive` - force multiqc to always create interactive javascript plots
--	`/path/to/kraken2_output/` – the directory holding the output data from the Kraken2 run, provided as a positional argument
-
-**Input data:**
-
-- /path/to/kraken2_output/*kraken2-report.tsv (Kraken2 output data, from [Step 10a](#10a-taxonomic-classification))
-
-**Output data:**
-
-- **kraken2_multiqc.html** (multiqc output html summary)
-- **kraken2_multiqc_data.zip** (zip archive containing multiqc output data)
+- **${classification_type}_krona_report.html** (per-sample Krona charts in html format)
 
 ---
 
@@ -1740,3 +1740,751 @@ KEGG-decoder -v interactive -i MAG-level-KO-annotations_GLmetagenomics.tsv -o MA
 
 <br>
 ---
+
+## Read-based Feature Table Decontamination
+> Feature table decontamination is performed in R.  
+
+### 24. R Environment Setup
+
+#### 24a. Load libraries
+
+```R
+library(decontam)
+library(phyloseq)
+library(tidyverse)
+library(DT)
+library(plotly)
+library(glue)
+library(pheatmap)
+library(pavian)
+```
+
+#### 24b. Define Custom Functions
+
+##### get_last_assignment()
+<details>
+  <summary>retrieves the last taxonomy assignment from a taxonomy string</summary>
+
+  ```R
+  get_last_assignment <- function(taxonomy_string, split_by=';', remove_prefix=NULL){
+    # A function to get the last taxonomy assignment from a taxonomy string 
+    split_names <- strsplit(x =  taxonomy_string , split = split_by) %>% 
+      unlist()
+    
+    level_name <- split_names[[length(split_names)]]
+    
+    if(level_name == "_"){
+      return(taxonomy_string)
+    }
+    
+    if(!is.null(remove_prefix)){
+      level_name <- gsub(pattern = remove_prefix, replacement = '', x = level_name)
+    }
+    
+    return(level_name)
+  }
+  ```
+  **Function Parameter Definitions:**
+  - `taxonomy_string` - a character string containing a list of taxonomy assignments
+  - `split_by=` - a character string containing a regular expression used to split the `taxonomy_string`
+  - `remove_prefix=` - a character string containing a regular expression to be matched and removed, default=`NULL`
+
+  **Returns:** the last taxonomy assignment listed in the `taxonomy_string`
+</details>
+
+##### mutate_taxonomy()
+<details>
+  <summary>ensure that the taxonomy column is named "taxonomy" and aggregate duplicates to ensure that taxonomy names are unique</summary>
+
+  ```R
+  mutate_taxonomy <- function(df, taxonomy_column="taxonomy"){
+    
+    # make sure that the taxonomy column is always named taxonomy
+    col_index <- which(colnames(df) == taxonomy_column)
+    colnames(df)[col_index] <- 'taxonomy'
+    df <- df %>% dplyr::mutate(across( where(is.numeric), \(x) tidyr::replace_na(x,0)  ) )%>% 
+      dplyr::mutate(taxonomy=map_chr(taxonomy,.f = function(taxon_name=.x){
+        last_assignment <- get_last_assignment(taxon_name) 
+        last_assignment  <- gsub(pattern = "\\[|\\]|'", replacement = '',x = last_assignment)
+        trimws(last_assignment, which = "both")
+      })) %>% 
+      as.data.frame(check.names=FALSE, StringAsFactor=FASLE)
+    # Ensure the taxonomy names are unique by aggregating duplicates
+    df <- aggregate(.~taxonomy,data = df, FUN = sum)
+    return(df)
+  }
+  ```
+  **Function Parameter Definitions:**
+  - `df` - a dataframe containing the taxonomy assignments
+  - `taxonomy_column=` - name of the column in `df` containing the taxonomy assignments, default="taxonomy"
+
+  **Returns:** a dataframe with unique taxonomy names stored in a column named "taxonomy"
+
+</details>
+
+##### process_kaiju_table()
+<details>
+  <summary>reformat kaiju output table</summary>
+
+  ```R
+  process_kaiju_table <- function(file_path, taxon_col="taxon_path",
+                                  kingdom=NULL, remove_non_microbial = TRUE){
+    
+    kaija_table <- read_delim(file = file_path,
+                              delim = "\t",
+                              col_names = TRUE)
+    
+    if(remove_non_microbial){
+      # Remove non-microbial and unclassified assignments in this case Metazoa for animal assignments
+      non_microbial_indices <- grep(pattern = "unclassified|assigned|Metazoa|Chordata|Nematoda|Arthropoda|Annelida|Brachiopoda|Mollusca|Cnidaria|Streptophyta",
+                                    x = kaija_table[[taxon_col]])
+      
+      if(!is_empty(non_microbial_indices)){
+        kaija_table <- kaija_table[-non_microbial_indices,]
+      }
+      
+    }
+    
+    if(!is.null(kingdom)){
+      kingdom_indices <- grep(pattern = kingdom ,
+                              x = kaija_table[[taxon_col]])
+      if(!is_empty(kingdom_indices)){
+        kaija_table <- kaija_table[kingdom_indices,]
+      }
+    }
+    
+    
+    abs_abun_df <- pivot_wider(data = kaija_table %>% dplyr::select(sample,reads,taxonomy=!!sym(taxon_col)), 
+                              names_from = "sample", values_from = "reads",
+                              names_sort = TRUE) %>% mutate_taxonomy
+    
+    rel_abun_df <- pivot_wider(data = kaija_table %>% dplyr::select(sample,percent,taxonomy=!!sym(taxon_col)), 
+                              names_from = "sample", values_from = "percent",
+                              names_sort = TRUE) %>% mutate_taxonomy
+    
+    # Set the taxon names as row names, drop the taxonomy column and convert to a matrix
+    rownames(abs_abun_df) <- abs_abun_df[,"taxonomy"]
+    rownames(rel_abun_df) <- rel_abun_df[,"taxonomy"]
+    
+    abs_abun_df <- abs_abun_df[,-(which(colnames(abs_abun_df) == "taxonomy"))]
+    rel_abun_df <- rel_abun_df[,-(which(colnames(rel_abun_df) == "taxonomy"))]
+    
+    abs_abun_matrix <- as.matrix(abs_abun_df)
+    rel_abun_matrix <- as.matrix(rel_abun_df)
+    
+    final_tables <- list("relative_table"=rel_abun_matrix,
+                        "abundance_table"=abs_abun_matrix)
+    return(final_tables)
+    
+  }
+  ```
+  **Function Parameter Definitions:**
+  - `file_path` - file path to the tab-delimited kaiju output table file
+  - `taxon_col=`- name of the taxon column in the input data file, default="taxon_path"
+  - `kingdom=` - a character string containing a regular expression used to filter for specific kingdoms, default=`NULL`
+  - `remove_non_microbial=` - a boolean specifying whether or not to remove non-microbial and unclassified assuments, default=`TRUE`
+
+  **Returns:** a dataframe with reformated kaiju output
+
+</details>
+
+##### create_dt()
+<details>
+  <summary>create an HTML widget to display rectangular data (`matrix` or `dataframe`) using the DataTables Javascript library</summary>
+
+```R
+create_dt <- function(table2show, caption=NULL) {
+  DT::datatable(table2show,
+                rownames = FALSE, # remove row numbers
+                filter = "top", # add filter on top of columns
+                extensions = "Buttons", # add download buttons
+                caption=caption,
+                options = list(
+                  autoWidth = TRUE,
+                  dom = "Blfrtip", # location of the download buttons
+                  buttons = c("copy", "csv", "excel", "pdf", "print"), # download buttons
+                  pageLength = 5, # show first 5 entries, default is 10
+                  order = list(0, "asc") # order the title column by ascending order
+                ),
+                escape = FALSE # make URLs clickable) 
+  )
+}
+```
+**Function Parameter Definitions:**
+- `table2show` - a `matrix` or `dataframe` containing tabular data to display
+- `caption=` - a character vector to use as the caption for the table
+
+</details>
+
+##### filter_rare()
+<details>
+  <summary>filter out rare and non_microbial taxonomy assignments</summary>
+
+  ```R
+  filter_rare <- function(species_table, non_microbial, threshold=1){
+    
+    clean_tab_count  <-  species_table %>% 
+      filter(str_detect(Species, non_microbial, negate = TRUE))  
+    
+    clean_tab <- clean_tab_count %>% 
+      mutate( across( where(is.numeric)  , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
+    
+    rownames(clean_tab) <- clean_tab$Species
+    clean_tab  <- clean_tab[,-1] 
+    
+    
+    # Get species with relative abundance less than 1% in all samples
+    rare_species <- map(clean_tab, .f = \(col) rownames(clean_tab)[col < threshold])
+    rare <- Reduce(intersect, rare_species)
+    
+    rownames(clean_tab_count) <- clean_tab_count$Species
+    clean_tab_count  <- clean_tab_count[,-1] 
+    
+    abund_table <- clean_tab_count[!(rownames(clean_tab_count) %in% rare), ]
+    
+    return(abund_table)
+  }
+  ```
+  **Function Parameter Definitions:**
+  - `species_table` - the dataframe to filter
+  - `non_microbial` - a character vector denoting the string used to identify a species as non-microbial
+  - `threshold=` - abundance threshold used to determine if the relative abundance is rare, value denotes a percentage between 0 and 100.
+
+  **Returns:** a dataframe with rare and non_microbrial assignemnts removed
+</details>
+
+
+##### make_plot()
+<details>
+  <summary>create bar plot of relative abundance</summary>
+
+  ```R
+  # Make bar plot
+  make_plot <- function(abund_table, metadata, colors2use, publication_format){
+    
+    abund_table_wide <- abund_table %>% 
+        as.data.frame() %>% 
+        rownames_to_column("Sample_ID") %>% 
+        inner_join(metadata) %>% 
+        select(!!!colnames(metadata), everything()) %>% 
+        mutate(Sample_ID = Sample_ID %>% str_remove("barcode"))
+        
+    abund_table_long <- abund_table_wide  %>%
+        pivot_longer(-colnames(metadata), 
+                    names_to = "Species",
+                    values_to = "relative_abundance")
+      
+    p <- ggplot(abund_table_long, mapping = aes(x=Sample_ID, y=relative_abundance, fill=Species)) +
+         geom_col() +
+         scale_fill_manual(values = colors2use) + 
+         labs(x=NULL, y="Relative Abundance (%)") + 
+         publication_format
+
+    return(p)
+  }
+  ```
+  **Function Parameter Definitions:**
+  - `abund_table` - a dataframe containing the data to plot
+  - `metadata` - a vector of strings specifying the data to include in the plot
+  - `colors2use` - a vector of strings specifying a custom color palette for coloring plots
+  - `publication_format` - a ggplot::theme object specifying the custom theme for plotting
+
+  **Returns:** a ggplot bar plot
+
+</details>
+
+##### get_colors2use()
+<details>
+  <summary>get colors to use in plots</summary>
+
+  ```R
+  get_colors2use <- function(species, expected_microbes, microbe_colors, custom_palette){
+    
+    unexpected_microbes <- setdiff(species, expected_microbes)
+    
+    start <- length(species)+1
+    end <-  length(species) + length(unexpected_microbes)
+    unexpected_microbes_colors <-  custom_palette[start:end]
+    names(unexpected_microbes_colors) <- unexpected_microbes
+    colors2use <- append(microbe_colors,unexpected_microbes_colors)
+    return(colors2use)
+    
+  }
+  ```
+  **Function Parameter Definitions:**
+  - `species` - a vector specifying the list of species that will use this color palette, used to set the number of colors in the palette
+  - `expected_microbes` - the list of microbe species that were expected in the data
+  - `microbe_colors` - colors assigned to the expected microbes
+  - `custom_palette` - a vector of strings specifying a custom color palette
+
+  **Returns:** a vector of strings specifying the color palette to use for the input species list
+
+</details>
+
+#### 24c. Set global variables
+
+```R
+
+kraken_taxonomy_outdir <- "kraken2_taxonomy/"
+kaiju_taxonomy_outdir <- "kaiju_taxonomy/"
+
+# Define custom theme for plotting
+publication_format <- theme_bw() +
+  theme(panel.grid = element_blank()) +
+  theme(axis.ticks.length=unit(-0.15, "cm"),
+        axis.text.x=element_text(margin=ggplot2::margin(t=0.5,r=0,b=0,l=0,unit ="cm")),
+        axis.text.y=element_text(margin=ggplot2::margin(t=0,r=0.5,b=0,l=0,unit ="cm")), 
+        axis.title = element_text(size = 18,face ='bold.italic', color = 'black'), 
+        axis.text = element_text(size = 16,face ='bold', color = 'black'),
+        legend.position = 'right', legend.title = element_text(size = 15,face ='bold', color = 'black'),
+        legend.text = element_text(size = 14,face ='bold', color = 'black'),
+        strip.text =  element_text(size = 14,face ='bold', color = 'black'))
+
+# Define custom palette for plotting
+custom_palette <- c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F", "#FF7F00",
+                    "#CAB2D6","#6A3D9A","#FF00FFFF","#B15928","#000000","#FFC0CBFF","#8B864EFF","#F0027F",
+                    "#666666","#1B9E77", "#E6AB02","#A6761D","#FFFF00FF","#FFFF99","#00FFFFFF",
+                    "#B2182B","#FDDBC7","#D1E5F0","#CC0033","#FF00CC","#330033",
+                    "#999933","#FF9933","#FFFAFAFF",colors()) 
+# remove white colors
+custom_palette <- custom_palette[-c(21:23,
+                                    grep(pattern = "white|snow|azure|gray|#FFFAFAFF|aliceblue",
+                                         x = custom_palette, 
+                                         ignore.case = TRUE)
+                                   )
+                                ]
+# Define expected microbes to use for filtering
+expected_microbes <- c("Pseudomonas aeruginosa", "Salmonella enterica",
+                       "Limosilactobacillus fermentum", "Lactobacillus fermentum", "Staphylococcus aureus",
+                       "Enterococcus faecalis", "Escherichia coli",
+                       "Listeria monocytogenes", "Bacillus subtilis", "Bacillus spizizenii",
+                       "Saccharomyces cerevisiae", "Cryptococcus neoformans")
+orig_expected_microbes <- c("Pseudomonas aeruginosa", "Salmonella enterica",
+                       "Limosilactobacillus fermentum", "Staphylococcus aureus",
+                       "Enterococcus faecalis", "Escherichia coli",
+                       "Listeria monocytogenes", "Bacillus spizizenii",
+                       "Saccharomyces cerevisiae", "Cryptococcus neoformans")
+orig_expected_microbes <- c(sort(orig_expected_microbes), "Escherichia phage Lambda")
+
+# Define expected microbe color palette
+microbe_colors <- custom_palette[1:length(orig_expected_microbes)]
+names(microbe_colors) <- orig_expected_microbes
+
+# Define human associated microbes
+human_associated_microbes <- c("Staphylococcus epidermedis", "Staphylococcus hominis", "Cutibacterium acnes",
+                               "Staphylococcus haemolyticus", "Malassezia", "Corynebacterium", "Micrococcus",
+                               "Hoylesella shahii", "Streptococcus mitis",
+                               "Eubacterium saphenum", "Lawsonella clevelandensis")
+
+# subplots grouping variable
+facets_kaiju <- c("Sample_Type","input_conc_ng", "lambda_spike")
+facets_kraken2 <- c("Sample_Type","input_conc_ng")
+```
+**Input Data:** 
+
+*No input data required*
+
+**Output Data:**
+
+- `publication_format` (a ggplot::theme object specifying the custom theme for plotting)
+- `custom_palette` (a vector of strings specifying a custom color palette for coloring plots)
+- `expected_microbes` (a vector of strings listing microbes that may be found in the samples)
+- `orig_expected_microbes` (a vector of strings listing microbes that may be found in the samples plus "Escherichia phage Lambda")
+- `microbe_colors` (a vector of strings specifying the custom color palette to use for coloring the `orig_expected_microbes`)
+- `human_associated_microbes` (a vector of strings listing microbes that are known to be found in humans)
+- `facets_kaiju` (a vector of strings listing subplot grouping variables for kaiju data)
+- `facets_kraken2` (a vector of strings listing subplot grouping variables for kraken2 data)
+- `kraken_taxonomy_outdir` (a path to the output folder for read taxonomy output based on kraken2 processing)
+- `kaiju_taxonomy_outdir` (a path to the output folder for read taxonomy output based on kaiju processing)
+
+#### 24d. Import Kaiju Taxonomy Data
+
+```R
+kaiju_table <- "/path/to/kaiju_read_taxonomy/merged_kaiju_summary_species.tsv"
+feature_table <- process_kaiju_table(kaiju_table, taxon_col="taxon_name", remove_non_microbial = FALSE)$abundance_table
+
+# Create Species table (species raw read count by barcode)
+feature_table %>% as.data.frame %>% 
+  rownames_to_column("Species") %>%
+  pivot_longer(-Species, names_to = "Barcode", values_to = "Reads") %>% 
+  write_delim(species_csv, delim=',')
+
+## The number of reads classified at the species level
+colSums(feature_table) %>%
+  enframe(name = "Barcode", value = "Number of reads") %>%
+  write_delim("{kaiju_taxonomy_outdir}species_counts{assay_suffix}.csv", delim=",")
+
+species_table <- feature_table
+
+```
+**Input Data:**
+- `kaiju_table` (the merged kaiju summary data at the species taxon level, output from [Step 9f](#9f-compile-kaiju-taxonomy-results))
+- `kaiju_taxonomy_outdir` (a path to the output folder for read taxonomy output based on kaiju processing)
+- `assay_suffix` (standard GeneLab assay suffix to use in output files)
+
+**Output Data:**
+- `species_table` (dataframe of relative abundance data)
+- **kaiju_taxonomy/species_counts_GLMetagenomics.csv** (Number of reads classified for each species)
+
+
+##### 24e. Import Kraken2 Taxonomy Data
+
+```R
+kraken_reports_dir <- "/path/to/read_taxonomy/kraken2_output/"
+
+# import kraken2 reports
+reports <- pavian::read_reports(kraken_reports_dir)
+
+# create taxonomy overview
+summary_table  <- pavian::summarize_reports(reports)
+rownames(summary_table) <- rownames(summary_table) %>% str_split("-") %>% map_chr(\(x) pluck(x, 1))
+summary_table %>% rownames_to_column("Sample_ID") %>% write_delim('{kraken_taxonomy_outdir}kraken_taxonomy_overview.csv', delim=',')
+
+samples <- names(reports) %>% str_split("-") %>% map_chr(\(x) pluck(x, 1))
+merged_reports  <- pavian::merge_reports2(reports, col_names = samples)
+taxonReads <- merged_reports$taxonReads
+cladeReads <- merged_reports$cladeReads
+tax_data <- merged_reports[["tax_data"]]
+
+#Create species table
+species_table <- tax_data %>% 
+  bind_cols(cladeReads) %>%
+  filter(taxRank %in% c("U","S")) %>% 
+  select(-contains("tax")) %>%
+  zero_if_na() %>% 
+  filter(name != 0) %>%  # drop unknown taxonomies
+  group_by(name) %>% 
+  summarise(across(everything(), sum)) %>% 
+  ungroup() %>% 
+  as.data.frame()
+
+species_names <- species_table[,"name"]
+rownames(species_table) <- species_names
+
+taxonomy_col <- match("name", colnames(species_table))
+species_table <- species_table[,-taxonomy_col]
+
+species_table <- apply(X = species_table, MARGIN = 2, FUN = as.numeric)
+rownames(species_table) <- species_names
+
+# calculate total number of reads for each sample
+colSums(species_table) %>%
+  enframe(name = "Sample", value = "Number of reads") %>%
+  write_delim("{kraken_taxonomy_outdir}species_counts{assay_suffix}.csv", delim=",")
+```
+
+**Input Data:**
+- `kraken_reports` (the per-sample kraken reports, output from [Step 10a](#10a-taxonomic-classification))
+- `kraken_taxonomy_outdir` (a path to the output folder for read taxonomy output based on kraken2 processing)
+- `assay_suffix` (standard GeneLab assay suffix to use in output files)
+
+
+**Output Data:**
+- `species_table` (a dataframe of species raw read counts by barcode)
+- **kraken_taxonomy/species_counts_GLMetagenomics.csv** (a dataframe of per-sample read counts)
+- **kraken_taxonomy/kraken_taxonomy_overview.csv** (Comma-separated table containing a summary of Kraken2 taxonomy classification)
+
+
+#### 24f. Import Sample Metadata
+
+```R
+# define input files
+metadata_file <- "/path/to/metadata.txt"
+
+# Import metadata
+metadata <- read_delim(metdata_file , delim = "\t") %>% as.data.frame()
+row.names(metadata) <- metadata$Sample_ID
+```
+
+**Input Data:** 
+
+- `metadata_file` (a file containing sample metadata for the study, columns are: Sample_ID (string), Sample_Type (string), input_conc_ng (float), lambda_spike ('no'/'yes'), Sample_or_Control (string))
+
+**Output Data:**
+
+- `metadata` (a dataframe containing sample metadata for the study with the sampleIDs as the row names)
+
+--- 
+
+### 25. Read-based processing feature table decontamination
+
+The read-based feature table decontamination and taxonomy QC are performed using the same functions for both kraken2 and kaiju generated taxonomies.
+
+#### 25a. Taxonomy filtering
+
+```R
+# with unclassified data
+output_dir <- "{taxonomy_type}_taxonomy/"
+abundance_threshold <- 0.5
+
+species <- species_table %>% as.data.frame %>% 
+  rownames_to_column("Species") %>% pull(Species) %>% unique()
+colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
+
+abund_table <- species_table %>% 
+               as.data.frame %>% 
+               mutate( across(everything(), \(x) (x/sum(x, na.rm = TRUE))*100 ) ) %>% 
+               rownames_to_column("Species") 
+  
+rownames(abund_table) <- abund_table$Species
+  
+abund_table <- abund_table[, -match(x = "Species", colnames(abund_table))] %>% t
+
+# excluding unclassified and host reads
+non_microbial <- "Unclassified|unclassified|Homo sapien"
+
+# Get species with relative abundance greater than 0.5 in all the samples
+clean_tab <- species_table %>% 
+  as.data.frame %>% 
+  rownames_to_column("Species") 
+
+abund_table <- filter_rare(clean_tab, non_microbial, threshold=abundance_threshold)
+species <- rownames(abund_table)
+colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
+
+species_abund_table <- abund_table %>% 
+                    as.data.frame %>% 
+                   mutate( across( where(is.numeric)  , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
+
+abund_table <- species_abund_table %>% t
+
+# Without human-associated microbes
+unwanted <- str_c(c(non_microbial, human_associated_microbes), collapse = "|")
+clean_tab2 <- filter_rare(clean_tab, unwanted, threshold=abundance_threshold)
+clean_tab2 <- clean_tab2   %>% 
+  mutate( across( where(is.numeric)  , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
+abund_table <- clean_tab2 %>% t
+species <- rownames(clean_tab2)
+colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
+
+p <- make_plot(abund_table, metadata, colors2use, publication_format) + 
+  facet_wrap(facets, scales = "free_x", nrow=1)
+
+p$data %>% mutate(run="Ultra Low", host_read_removal="kraken", taxonomy="{taxonomy_type}") %>% write_delim(file="{output_dir}/{taxonomy_type}_no_unwanted{assay_suffix}.tsv", delim = "\t")
+
+# Expected microbes alone 
+non_microbial <- "Unclassifed|unclassified|Homo sapien"
+
+clean_tab2 <- clean_tab %>% 
+  filter(str_detect(Species, non_microbial, negate = TRUE))  %>% 
+    filter(str_detect(Species, str_c(expected_microbes, collapse = "|"))) %>%  #select only the expected microbes
+  mutate( across( where(is.numeric)  , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
+
+rownames(clean_tab2) <- clean_tab2$Species
+clean_tab2  <- clean_tab2[,-1] 
+abund_table <- clean_tab2 %>% t
+species <- rownames(clean_tab2)
+colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
+
+p <- make_plot(abund_table, metadata, colors2use, publication_format) + 
+  facet_wrap(facets, scales = "free_x", nrow=1)
+
+p$data %>% mutate(run="Ultra Low", host_read_removal="kraken", taxonomy="{taxonomy_type}") %>% write_delim(file="{output_dir}{taxonomy_type}_expected{assay_suffix}.tsv", delim = "\t")
+
+# Without Unclassified and host reads alone
+
+# Get species with relative abundance greater than 1 in all the samples
+clean_tab2 <- clean_tab %>% 
+  as.data.frame %>% 
+  filter(str_detect(Species, non_microbial, negate = TRUE))  %>% 
+  mutate( across( where(is.numeric)  , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
+
+rownames(clean_tab2) <- clean_tab2$Species
+clean_tab2  <- clean_tab2[,-1] 
+abund_table <- clean_tab2 %>% t
+species <- rownames(clean_tab2)
+colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
+
+p <- make_plot(abund_table, metadata, colors2use, publication_format) + 
+  facet_wrap(facets, scales = "free_x", nrow=1)
+
+#Without removing taxonomies with relative abundance less than 0.5%
+p$data %>% mutate(run="Ultra Low", host_read_removal="kraken", taxonomy="{taxonomy_type}") %>% write_delim(file="{output_dir}{taxonomy_type}_no_filt{assay_suffix}.tsv", delim = "\t")
+
+# Filter out unclassified, human reads and rare species
+
+# Rare species here are classified as species with a relative abundance less than 0.5% across
+# all samples.
+
+# Get species with relative abundance greater than 0.5 in all the samples
+abund_table <- filter_rare(clean_tab, non_microbial, threshold=abundance_threshold)
+species <- rownames(abund_table)
+colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
+
+species_abund_table <- abund_table %>% 
+                    as.data.frame %>% 
+                   mutate( across( where(is.numeric)  , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
+
+abund_table <- species_abund_table %>% t
+
+p <- make_plot(abund_table, metadata, colors2use, publication_format) + 
+  facet_wrap(facets, scales = "free_x", nrow=1)
+
+p$data %>% mutate(run="Ultra Low", host_read_removal="kraken", taxonomy="{taxonomy_type}") %>% write_delim(file="{output_dir}{taxonomy_type}_filtered{assay_suffix}.tsv", delim = "\t")
+```
+
+**Parameter Definitions:**
+- `abundance_threshold` - threshold for defining rare species, default=0.5
+- `taxonomy_type` - string specify which tool was used to create the input taxonomy, either `kaiju` or `kraken`
+- `assay_suffix` - string specifying an assay suffix to use for output file nameing in this dataset (default: GLMetagenomics)
+
+
+**Input Data:**
+- `species_table` (dataframe of relative abundance data, from [Step 24d](#24d-import-kaiju-taxonomy-data) if using kaiju taxonomies or [Step 24e](#24e-import-kraken2-taxonomy-data) is using kraken taxonomies)
+- `facets` (a vector of strings listing subplot grouping variables for either kaiju or kraken data, from [Step 24c](#24c-set-global-variables))
+
+
+**Output Data**
+- `species_abund_table` (a dataframe containing filtered realtive abundance values)
+- **<kraken|kaiju>_taxonomy/<kraken|kaiju>_expected_GLMetagenomics.tsv** ()
+- **<kraken|kaiju>_taxonomy/<kraken|kaiju>_no_filt_GLMetagenomics.tsv** ()
+- **<kraken|kaiju>_taxonomy/<kraken|kaiju>_filtered_GLMetagenomics.tsv** ()
+
+---
+
+#### 25b. Decontamination with Decontam
+
+##### 25b.i. Setup variables
+```R
+feature_table <- species_abund_table #species_table
+sub_metadata <- metadata[colnames(feature_table),]
+# Modify NTC concentration
+sub_metadata <- sub_metadata %>% 
+  mutate(input_conc_ng=map2_dbl(Sample_Type, input_conc_ng,
+                                .f= function(type, conc) { 
+                                  if(conc == 0) return(0.0000001) else return(conc) 
+                                  } )
+         )
+sub_metadata$input_conc_ng <- as.numeric(sub_metadata$input_conc_ng)
+ps <- phyloseq(otu_table(feature_table, taxa_are_rows = TRUE),
+            sample_data(sub_metadata))
+```
+
+**Input Data:**
+- `species_abund_table` (a dataframe containing filtered relative abundance values, from [Step ](#25a-taxonomy-filtering))
+
+**Output Data:**
+- `ps` (phyloseq object of the relative abundance values with NTC metadata added)
+
+##### 25b.ii. Identify prevalence of contaminant sequences
+The prevalence (presence/absence across samples) of each sequence feature in 
+true positive samples is compared to the prevalence in negative controls to 
+identify contaminants.
+
+```R
+contam_threshold <- 0.1
+output_dir <- "{taxonomy_type}_taxonomy_decontam/"
+# In our phyloseq object, "Sample_or_Control" is the sample variable that holds 
+# the negative control information. We’ll summarize that data as a logical 
+# variable, with TRUE for control samples, as that is the form required by isContaminant
+sample_data(ps)$is.neg <- sample_data(ps)$Sample_or_Control == "Control_Sample"
+contamdf <- isContaminant(ps, neg="is.neg", conc="input_conc_ng", threshold=contam_threshold) # threshold
+
+#### Create contaminant table
+contamdf %>%
+  mutate( across( where(is.numeric), \(x) round(x, digits = 2) ) ) %>%
+  rownames_to_column("Species") %>% 
+  write_delim(file="{output_dir}{taxonomy_type}_contaminant_table{assay_suffix}.tsv", delim = "\t")
+
+table(contamdf$contaminant)
+
+contamdf %>% filter(contaminant == TRUE) %>% 
+  write_delim(file="{output_dir}{taxonomy_type}_filtered_contaminant_table{assay_suffix}.tsv", delim = "\t")
+
+
+isExpected <- str_detect(rownames(contamdf), pattern = str_c(expected_microbes, collapse = "|"))
+contamdf[isExpected,] %>%
+  select(-p.freq) %>%
+  mutate( across( where(is.numeric), \(x) round(x, digits = 3) ) ) %>% 
+  write_delim(file="{output_dir}{taxonomy_type}_contaminant_table_expected_microbes{assay_suffix}.tsv", delim = "\t")
+```
+
+**Parameter Defintitions:**
+- `contam_threshold` - probability threshold below which the null hypothesis (not a contaminant) should be rejected in favor of the alternate hypothesis (contaminant) (default: 0.1)
+- `taxonomy_type` - string specify which tool was used to create the input taxonomy, either `kaiju` or `kraken`
+- `assay_suffix` - string specifying an assay suffix to use for output file nameing in this dataset (default: GLMetagenomics)
+
+
+**Input Data:**
+- `ps` (phyloseq object of the relative abundance values with NTC metadata added, from [Step ](#25bi-setup-variables))
+
+**Output Data:**
+- `contam_df` (dataframe of contaminant table)
+- **<kaiju|kraken>_taxonomy_decontam/<kaiju|kraken>_contaminant_table_GLMetagenomics.tsv** (tab-delimited table of classification information for all input sequences)
+- **<kaiju|kraken>_taxonomy_decontam/<kaiju|kraken>_filtered_contaminant_table_GLMetagenomics.tsv** (tab-delimited table of classification information for all sequences identified as contaminants)
+- **<kaiju|kraken>_taxonomy_decontam/<kaiju|kraken>_contaminant_table_expected_microbes_GLMetagenomics.tsv** (tab-delimited table of classification information for expected microbes)
+
+##### 25b.iii. Decontaminated taxonomy plots
+
+```R
+output_dir <- "{taxonomy_type}_taxonomy_decontam/"
+contaminants <- contamdf %>%
+  as.data.frame %>%
+  rownames_to_column("Species") %>%
+  filter(contaminant == TRUE) %>% pull(Species)
+species <- species_abund_table  %>% 
+  as.data.frame %>% 
+  rownames_to_column("Species") %>%
+  filter(str_detect(Species, pattern = str_c(contaminants, collapse = "|"), negate = TRUE)) %>%
+  pull(Species) %>%
+  unique()
+colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
+
+abund_table <- species_abund_table %>% 
+                    as.data.frame  %>% 
+                    rownames_to_column("Species") %>% 
+                    filter(str_detect(Species, 
+                                      pattern = str_c(contaminants,
+                                                      collapse = "|"),
+                                      negate = TRUE)) %>%
+                    mutate( across( where(is.numeric)   , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
+  
+rownames(abund_table) <- abund_table$Species
+  
+abund_table <- abund_table[, -match(x = "Species", colnames(abund_table))] %>% t
+  
+abund_table_wide <- abund_table %>% 
+    as.data.frame() %>% 
+    rownames_to_column("Sample_ID") %>% 
+    inner_join(metadata) %>% 
+    select(!!!colnames(metadata), everything()) %>% 
+    mutate(Sample_ID = Sample_ID %>% str_remove("barcode"))
+    
+  
+abund_table_long <- abund_table_wide  %>%
+    pivot_longer(-colnames(metadata), 
+                 names_to = "Species",
+                 values_to = "relative_abundance")
+  
+p <- ggplot(abund_table_long, mapping = aes(x=Sample_ID, 
+                                              y=relative_abundance, fill=Species)) +
+    geom_col() +
+    scale_fill_manual(values = colors2use) + 
+    labs(x=NULL, y="Relative Abundance (%)") + 
+    publication_format + 
+  facet_wrap(facets, scales = "free_x", nrow=1)
+
+#### Taxonomy plot without contaminants
+
+# Taxonomy plot after contaminant removal at a set threshold of 0.1
+# ggsave(filename = "results/species_plot.png", plot = p,
+#          device = "png", width = 10, height = 6, units = "in", dpi = 300)
+ggplotly(p) %>% saveWidget(file = "{output_dir}{taxonomy_type}_taxonomy_plots_no_contam{assay_suffix}.html")
+```
+**Parameter Definitions:**
+- `taxonomy_type` - string specify which tool was used to create the input taxonomy, either `kaiju` or `kraken`
+- `assay_suffix` - string specifying an assay suffix to use for output file nameing in this dataset (default: GLMetagenomics)
+
+**Input Data:**
+- `species_abund_table` ()
+- `contam_df` ()
+
+**Output Data:**
+- **<kaiju|kraken>_taxonomy_decontam/<kaiju|kraken>_taxonomy_plots_no_contam_GLMetagenomics.html** (Plot of taxonomies for decontaminated data)
+
+---
+
+### 26. Assembly-based processing decontamination
+Medaka assembly annotation of kraken decontaminated low biomass samples
+Quality filtered and trimmed reads were decontaminated (host (human) reads filtered out) using kraken2. Assembly of the clean reads was performed using metaflye followed by polishing with medaka. The polished assembly was annotated using our standard assembly annotation pipeline with prodigal used to predict genes, CAT used for taxonomy assignment of genes and contigs and KOFamScan for genes functional annotation.  
+

From a332edfdf52cb6ac5638f1f642531d5202c642e8 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Thu, 18 Sep 2025 20:59:39 -0700
Subject: [PATCH 04/47] added missing parameter definition

---
 Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index b8d24b775..4127f7a07 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -170,7 +170,7 @@ dorado basecaller ${model} ${input_directory} \
 - `--device` - specifies CPU or GPU device; specifying 'auto' chooses either 'cpu' or 'gpu' depending on detected presence of a GPU device
 - `--recursive` - enables recursive scanning through input directory to load FAST5 and/or POD5 files
 - `--kit-name` - enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names
-- `--min-qscore` - 
+- `--min-qscore` - specifies the minimum Q-score, reads with a mean Q-score below this threshold are discarded (default to `7` for this pipeline)
 - `model` - positional argument specifying the basecalling model to use or a path to the model directory
 - `input_directory` - positional argument specifying the location of the raw data in POD5 or FAST5 format
 

From 02f19f893730189e547dab0ed146ccc004716e3d Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Mon, 29 Sep 2025 21:36:24 -0700
Subject: [PATCH 05/47] Update Low Biomass pipeline draft (#176)

* Update to latest draft
* Add read-based feature table decontamination steps
* Add taxonomy plots
* Regularize formatting
---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md      | 3185 +++++++++--------
 1 file changed, 1695 insertions(+), 1490 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index 4127f7a07..ac9c389af 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -27,8 +27,8 @@ Barbara Novak (GeneLab Data Processing Lead)
   - [**Pre-processing**](#pre-processing)
     - [1. Basecalling](#1-basecalling)
     - [2. Demultiplexing](#2-demultiplexing)
-      - [2a. Demultiplex]()
-      - [2b. Concatenate files for each sample]()
+      - [2a. Split fastq ](#2a-split-fastq)
+      - [2b. Concatenate files for each sample](#2b-concatenate-files-for-each-sample)
     - [3. Raw Data QC](#3-raw-data-qc)
       - [3a. Raw Data QC](#3a-raw-data-qc)
       - [3b. Compile Raw Data QC](#3b-compile-raw-data-qc)
@@ -36,7 +36,7 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [4a. Filter Raw Data](#4a-filter-raw-data)
       - [4a. Filtered Data QC](#4b-filtered-data-qc)
       - [4c. Compile Filtered Data QC](#4c-compile-filtered-data-qc)
-    - [5. Trimming](#3-filteredtrimmed-data-qc)
+    - [5. Trimming](#5-trimming)
       - [5a. Trim Filtered Data](#5a-trim-filtered-data)
       - [5b. Trimmed Data QC](#5b-trimmed-data-qc)
       - [5c. Compile Trimmed Data QC](#5c-compile-filtered-data-qc)
@@ -49,59 +49,49 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [7e. Contaminant Removal QC](#7e-contaminant-removal-qc)
       - [7f. Compile Contaminant Removal QC](#7f-compile-contaminant-removal-qc)
     - [8. Host Removal](#8-host-removal)
-      - [8a. Remove Host Reads](#8a)
-      - [8b. Compile Host Removal QC]()
+      - [8a. Build or download host database](#8a-build-or-download-host-database)
+        - [8a.i. Download from URL](#8ai-download-from-url)
+        - [8a.ii. Build from custom reference](#8aii-build-from-custom-reference)
+        - [8a.iii. Build from host name](#8aiii-build-from-host-name)
+      - [8b. Remove Host Reads](#8b-remove-host-reads)
+    - [9. R Environment Setup](#9-r-environment-setup)
+      - [9a. Load libraries](#9a-load-libraries)
+      - [9b. Define Custom Functions](#9b-define-custom-functions)
+      - [9c. Set global variables](#9c-set-global-variables)
   - [**Read-based processing**](#read-based-processing)
-    - [9. Taxonomic and functional profiling using Kaiju](#8-taxonomic-and-functional-profiling)
-      - [9a. Taxonomic Classification](#9a-taxonomic-classification)
-      - [9b. Convert Kaiju output to Krona format](#9b-convert-kaiju-output-to-krona-format)
-      - [9c. Generate per sample Krona charts](#9c-generate-per-sample-krona-charts)
-      - [9d. Generate combined Krona chart](#9d-generate-combined-krona-chart)
-      - [9e. Compute per-sample taxon level summaries](#9e-compute-taxon-level-summaries-for-each-sample)
-      - [9f. Compile taxon level summaries](#9f-compile-kaiju-taxonomy-results)
-      - [9e. Process Kaiju output]()
-    - [10. Taxonomic and functional profiling using Kraken2](#10-taxonomic-and-functional-profiling-using-kraken2)
-      - [10a. Taxonomic Classification](#10a-taxonomic-classification)
-      - [10b. Combine Kraken2 reports](#10b-combine-kraken2-reports)
-      - [10c. Convert Kraken2 output to krona format](#10c-convert-kraken2-output-to-krona-format)
-      - [10c. Generate per sample Krona charts](#10d-generate-per-sample-krona-charts)
-      - [10d. Generate combined Krona chart](#10e-generate-combined-krona-chart)
-      - [10e. Compile Kraken2 Summary QC](#10f-compile-kraken2-summary-qc)
-      - [10f. Process Kraken2 output]()
-    - [11. Taxonomy plots]()
-        - [11a. Per-sample]()
-        - [11b. combined]()
-    - [12. Read-based Feature Table Decontamination]()
-      - [11a. Kaiju outp conversion]()
+    - [10. Taxonomic profiling using kaiju](#10-taxonomic-profiling-using-kaiju)
+      - [10a. Build kaiju database](#10a-build-kaiju-database)
+      - [10b. Kaiju Taxonomic Classification](#10b-kaiju-taxonomic-classification)
+      - [10c. Compile kaiju taxonomy results](#10c-compile-kaiju-taxonomy-results)
+      - [10d. Convert kaiju output to krona format](#10d-convert-kaiju-output-to-krona-format)
+      - [10e. Compile kaiju krona report](#10e-compile-kaiju-krona-report)
+      - [10f. Create kaiju species count table](#10f-create-kaiju-species-count-table)
+      - [10g. Read-in tables](#10g-read-in-tables)
+      - [10h. Taxonomy barplots](#10h-taxonomy-barplots)
+    - [11. Taxonomic Profiling using Kraken2](#11-taxonomic-profiling-using-kraken2)
+      - [11a. Download kraken2 database](#11a-download-kraken2-database)
+      - [11b. Taxonomic Classification](#11b-taxonomic-classification)
+      - [11c. Convert Kraken2 output to Krona format](#11c-convert-kraken2-output-to-krona-format)
+      - [11d. Compile kraken2 krona report](#11d-compile-kraken2-krona-report)
+      - [11e. Create kraken species count table](#11e-create-kraken-species-count-table)
+      - [11f. Read-in tables](#11f-read-in-tables)
+      - [11g. Taxonomy barplots](#11g-taxonomy-barplots)
+      - [11h. Feature decontamination](#11h-feature-decontamination)
   - [**Assembly-based processing**](#assembly-based-processing)
-    - [11. Sample assembly](#11-sample-assembly)
-    - [12. Polish assembly](#12-polish-assembly)
-    - [13. Renaming contigs and summarizing assemblies](#13-renaming-contigs-and-summarizing-assemblies)
-    - [14. Gene prediction](#14-gene-prediction)
-    - [15. Functional annotation](#15-functional-annotation)
-    - [16. Taxonomic classification](#16-taxonomic-classification)
-    - [17. Read-mapping](#17-read-mapping)
-    - [18. Getting coverage information and filtering based on detection](#18-getting-coverage-information-and-filtering-based-on-detection)
-    - [19. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample](#19-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample)
-    - [20. Combining contig-level coverage and taxonomy into one table for each sample](#20-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample)
-    - [21. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples](#21-generating-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
-    - [22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#22-metagenome-assembled-genome-mag-recovery)
-    - [23. Generating MAG-level functional summary overview](#23-generating-mag-level-functional-summary-overview)
-  - [**Feature Table Decontamination**]
-    - [24. R Environment Setup](#24-r-environment-setup)
-      - [24a. Load libraries](#24a-load-libraries)
-      - [24b. Define Custom Functions](#24b-define-custom-functions)
-      - [24c. Set Variable](#24c-set-variables)
-      - [24d. Import kaiju taxonomy data](#24d-import-kaiju-taxonomy-data)
-      - [24e. Import kraken2 taxonomy data](#24e-import-kraken2-taxonomy-data)
-      - [24f. Import sample metadata](#24f-import-sample-metadata)
-    - [25. Read-based processing feature-table decontamination](#25-read-based-processing-feature-table-decontamination)
-      - [25a. Taxonomy filtering](#25a-taxonomy-filtering)
-      - [25b. Decontamination](#25b-decontamination-with-decontam)
-        - [25b.i. Setup Variables](#25bi-setup-variables)
-        - [25b.ii. Identify prevalence of contaminant sequences](#25bii-identify-prevalence-of-contaminant-sequences)
-        - [25b.iii. Decontaminated taxonomy plots](#25biii-decontaminated-taxonomy-plots)
-    - [26. Assembly-based processing decontamination](#26-assembly-based-processing-decontamination)
+    - [12. Sample assembly](#12-sample-assembly)
+    - [13. Polish assembly](#13-polish-assembly)
+    - [14. Renaming contigs and summarizing assemblies](#14-renaming-contigs-and-summarizing-assemblies)
+    - [15. Gene prediction](#15-gene-prediction)
+    - [16. Functional annotation](#16-functional-annotation)
+    - [17. Taxonomic classification](#17-taxonomic-classification)
+    - [18. Read-mapping](#18-read-mapping)
+    - [19. Getting coverage information and filtering based on detection](#19-getting-coverage-information-and-filtering-based-on-detection)
+    - [20. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample](#20-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample)
+    - [21. Combining contig-level coverage and taxonomy into one table for each sample](#21-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample)
+    - [22. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples](#22-generating-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
+    - [23. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#23-metagenome-assembled-genome-mag-recovery)
+    - [24. Generating MAG-level functional summary overview](#24-generating-mag-level-functional-summary-overview)
+
 
 
 ---
@@ -114,6 +104,7 @@ Barbara Novak (GeneLab Data Processing Lead)
 |CAT| 5.2.3 |[https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)|
 |CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)|
 |Dorado| 1.1.1| [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)|
+|filtlong| 0.2.1 |[https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)|
 |Flye| 2.9.5 | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye) |
 |GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
 |Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) |
@@ -133,13 +124,11 @@ Barbara Novak (GeneLab Data Processing Lead)
 | R | 4.5.1 | [https://www.r-project.org](https://www.r-project.org) |
 |Bioconductor | 3.21 | [https://www.bioconductor.org](https://www.bioconductor.org) |
 |decontam| 1.28.0 | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html) |
-|DT| 0.34.0 | [https://cran.r-project.org/web/packages/DT/index.html](https://cran.r-project.org/web/packages/DT/index.html) |
 |glue| 1.8.0 | [https://cran.r-project.org/web/packages/glue/index.html](https://cran.r-project.org/web/packages/glue/index.html) |
 |optparse| 1.7.5 |[https://cran.r-project.org/web/packages/optparse/index.html](https://cran.r-project.org/web/packages/optparse/index.html) |
 |pavian| 1.2.1 | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian) |
 |pheatmap| 1.0.13 | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap) |
 |phyloseq| 1.52.0 | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html) |
-|plotly| 4.11.0 | [https://cran.r-project.org/web/packages/plotly/index.html](https://cran.r-project.org/web/packages/plotly/index.html) |
 |tidyverse| 2.0.0 | [https://www.tidyverse.org](https://www.tidyverse.org) |
 
 ---
@@ -153,11 +142,12 @@ Barbara Novak (GeneLab Data Processing Lead)
 ### 1. Basecalling
 
 ```bash
-model="fast@4.3.0"
-input_dir=/path/to/raw/data
+model="hac"
+input_directory=/path/to/pod5/or/fast5/data
+kit_name=SQK-RPB004
 
 dorado basecaller ${model} ${input_directory} \
-	--no-trim \
+  --no-trim \
   --device auto \
   --recursive \
   --kit-name ${kit_name} \
@@ -166,12 +156,12 @@ dorado basecaller ${model} ${input_directory} \
 
 **Parameter Definitions:**
 
-- `--no-trim` - Skips trimming of barcodes, adapters, and primers
+- `--no-trim` - skips trimming of barcodes, adapters, and primers
 - `--device` - specifies CPU or GPU device; specifying 'auto' chooses either 'cpu' or 'gpu' depending on detected presence of a GPU device
 - `--recursive` - enables recursive scanning through input directory to load FAST5 and/or POD5 files
-- `--kit-name` - enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names
+- `--kit-name` - The nanopore barcoding kit used during sequencing preparation. Enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names
 - `--min-qscore` - specifies the minimum Q-score, reads with a mean Q-score below this threshold are discarded (default to `7` for this pipeline)
-- `model` - positional argument specifying the basecalling model to use or a path to the model directory
+- `model` - positional argument specifying the basecalling model to use or a path to the model directory. `hac` chooses the high accuracy model.
 - `input_directory` - positional argument specifying the location of the raw data in POD5 or FAST5 format
 
 **Input Data:**
@@ -180,10 +170,16 @@ dorado basecaller ${model} ${input_directory} \
 
 **Output Data:**
 
-- **basecalled.bam** (raw data in BAM format)
+- basecalled.bam (basecalled data in bam format)
+
+<br>
+
+---
 
 ### 2. Demultiplexing
 
+#### 2a. Split fastq
+
 ```bash
 dorado demux \
   --output-dir /path/to/fastq/output \
@@ -195,65 +191,111 @@ dorado demux \
 
 **Parameter Definitions:**
 
-- `--output-dir` - specifies the output folder that is the root of the nested output structure
+- `--output-dir` - specifies the output folder that is the root of the nested output structure. 
 - `--emit-fastq` - specifies that output is fastq format
 - `--emit-summary` - creates a summary listing each read and its classified barcode.
-- `--kit-name` - enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names
+- `--kit-name` - The nanopore barcoding kit used during sequencing preparation. Enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names
+
+**Input Data:**
+
+- basecalled.bam (basecalled nanopore data in bam format, output from [step 1](#1-basecalling))
+
+**Output Data:**
+
+- /path/to/fastq/output/\*_barcode\*.fastq (demultiplexed reads in fastq format)
+- /path/to/fastq/output/\*_unclassified.fastq (unclassified reads in fastq format)
+- /path/to/fastq/output/barcoding_summary.txt (barcode summary file listing each read, the file it was assigned to, and its classified barcode )
+
+
+#### 2b. Concatenate files for each sample
+
+```bash
+# Change to directory containing split fastq files generated from step 2a. split fastq above
+cd /path/to/fastq/output/ # output of step 2a
+# Get unique barcode names from demultiplexed file names
+BARCODES=($(ls -1 *fastq* |sed -E 's/.+_(barcode[0-9]+)_.+/\1/g' | sort -u))
+
+# Concat separate barcode/sample fastq files into per sample fastq gzippped files
+[ -d raw_data/ ] || mkdir raw_data/
+for sample in ${BARCODES[*]}; do
+
+  [ -d  ${sample}/ ] ||  mkdir ${sample}/  
+  mv *_${sample}_*  ${sample}/ 
+
+  cat ${sample}/* | gzip --to-stdout raw_data/${sample}.fastq.gz
+
+done
+```
+
+**Parameter Definitions:**
+
+- `| gzip --to-stdout` - sends output from `cat` to `gzip` to create compressed fastq.gz file
 
 **Input Data:**
 
-- basecalled.bam (raw nanopore data in BAM format, output from [step 1](#1-basecalling))
+- /path/to/fastq/output/ (directory containing spilt fastq files)[step 2a](#2a-split-fastq))
 
 **Output Data:**
 
-- \*_barcode\*.fastq (demultiplexed reads in fastq format)
-- \*_unclassified.fastq (unclassified reads in fastq format)
-- barcoding_summary.txt (barcode summary file listing each read, the file it was assigned to, and its classified barcode )
+-  raw_data/sample.fastq.gz (gzipped per sample/barcode fastq files)
+
+<br>
+
+---
 
-### 3. Raw Data QC
+### 3.  Raw Data QC
 
 #### 3a. Raw Data QC
 
 ```bash 
-NanoPlot --only-report --prefix sample_ -o /path/to/raw_nanoplot_output -t NumberOfThreads --fastq sample_raw.fastq.gz
+NanoPlot --only-report \
+         --prefix sample_ \
+         --outdir /path/to/raw_nanoplot_output \
+         --threads NumberOfThreads \
+         --fastq /path/to/raw_data/sample.fastq.gz
 ```
 
 **Parameter Definitions:**
 
-- `-o` – specifies the output directory to store results
+- `--outdir` – specifies the output directory to store results
 - `--only-report` - output only the report files
 - `--prefix` - adds a sample specific prefix to the name of each output file
-- `-t` - number of processing threads
-- `sample_raw.fastq.gz` – the input reads are specified as a positional argument
+- `--threads` - number of parallel processing threads to use
+- `--fastq` - specifies that the input data is in a fastq format
+- `/path/to/raw_data/sample.fastq.gz` – the input reads are specified as a positional argument
 
-**Input data:**
+**Input Data:**
 
-- *raw.fastq.gz (raw reads, output from [Step 2](#2-demultiplexing))
+- /path/to/raw_data/sample.fastq.gz (raw reads, output from [Step 2](#2-demultiplexing))
 
-**Output data:**
+**Output Data:**
 
-- **sample_NanoPlot-report.html** (NanoPlot html summary)
-- sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
-- sample_NanoStats.txt (text file containing basic statistics)
+- **/path/to/raw_nanoplot_output/sample_NanoPlot-report.html** (NanoPlot html summary)
+- /path/to/raw_nanoplot_output/sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
+- /path/to/raw_nanoplot_output/sample_NanoStats.txt (text file containing basic statistics)
 
 #### 3b. Compile Raw Data QC
 
 ```bash 
-multiqc -o raw_multiqc_report -n raw_multiqc --interactive /path/to/raw_nanoplot_output/
+multiqc --zip-data-dir \
+        --outdir raw_multiqc_report \
+        --filename raw_multiqc \
+        --interactive /path/to/raw_nanoplot_output/
 ```
 
 **Parameter Definitions:**
 
--	`-o` – the output directory to store results
--	`-n` – the filename prefix of results
+- `--zip-data-dir` - compress the data directory
+- `--outdir` – the output directory to store results
+- `--filename` – the filename prefix of results
 - `--interactive` - force multiqc to always create interactive javascript plots
--	`/path/to/raw_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+- `/path/to/raw_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
 
-**Input data:**
+**Input Data:**
 
 - /path/to/raw_nanoplot_output/*NanoStats.txt (NanoPlot output data, from [Step 3a](#3a-raw-data-qc))
 
-**Output data:**
+**Output Data:**
 
 - **raw_multiqc.html** (multiqc output html summary)
 - **raw_multiqc_data.zip** (zip archive containing multiqc output data)
@@ -267,87 +309,96 @@ multiqc -o raw_multiqc_report -n raw_multiqc --interactive /path/to/raw_nanoplot
 #### 4a. Filter Raw Data
 
 ```bash
-filtlong --min_length 200 --min_mean_q 8 /path/to/raw_fastq/sample.fastq > sample_filtered.fastq
+filtlong --min_length 200 --min_mean_q 8 /path/to/raw_data/sample.fastq.gz > sample_filtered.fastq
 ```
 
 **Parameter Definitions:**
 
--	`-o` – the output directory to store results
--	`-n` – the filename prefix of results
-- `--interactive` - force multiqc to always create interactive javascript plots
--	`/path/to/raw_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+- `--min_length` – specifies the minimum read length to retain (default to `200` for this pipeline)
+- `--min_mean_q` – specifies the minimum mean read quality (default to `8` for this pipeline)
 
-**Input data:**
+**Input Data:**
 
-- *_raw.fastq (raw reads, output from [Step 2](#2-demultiplexing))
+- /path/to/raw_data/sample.fastq.gz (raw reads, output from [Step 2b](#2b-concatenate-files-for-each-sample))
 
-**Output data:**
+**Output Data:**
 
-- *_filtered.fastq (quality filtered reads)
+- *sample_filtered.fastq (quality filtered reads)
 
 
 #### 4b. Filtered Data QC
 
 ```bash
-NanoPlot --only-report --prefix sample_ -o /path/to/filtered_nanoplot_output -t NumberOfThreads --fastq sample_filtered.fastq
+NanoPlot --only-report \
+         --prefix sample_ \
+         --outdir /path/to/filtered_nanoplot_output \
+         --threads NumberOfThreads \
+         --fastq sample_filtered.fastq
 ```
 
 **Parameter Definitions:**
 
-- `-o` – specifies the output directory to store results
+- `--outdir` – specifies the output directory to store results
 - `--only-report` - output only the report files
 - `--prefix` - adds a sample specific prefix to the name of each output file
-- `-t` - number of processing threads
+- `--threads` - number of parallel processing threads to use
 - `sample_filtered.fastq` – the input reads are specified as a positional argument
 
-**Input data:**
+**Input Data:**
 
-- *filtered.fastq (raw reads, output from [Step 2](#2-demultiplexing))
+- sample_filtered.fastq (raw reads, output from [Step 2](#2-demultiplexing))
 
-**Output data:**
+**Output Data:**
 
-- **sample_NanoPlot-report.html** (NanoPlot html summary)
-- sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
-- sample_NanoStats.txt (text file containing basic statistics)
+- **/path/to/filtered_nanoplot_output/sample_NanoPlot-report.html** (NanoPlot html summary)
+- /path/to/filtered_nanoplot_output/sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
+- /path/to/filtered_nanoplot_output/sample_NanoStats.txt (text file containing basic statistics)
 
 #### 4c. Compile Filtered Data QC
 
 ```bash
-multiqc -o raw_multiqc_report -n raw_multiqc --interactive /path/to/filtered_nanoplot_output/
+multiqc  --zip-data-dir \ 
+         --outdir filtered_multiqc_report \
+         --filename filtered_multiqc \
+         --interactive /path/to/filtered_nanoplot_output/
 ```
 
 **Parameter Definitions:**
 
-- `-o` – the output directory to store results
--	`-n` – the filename prefix of results
+- `--zip-data-dir` - compress the data directory
+- `--outdir` – the output directory to store results
+- `--filename` – the filename prefix of results
 - `--interactive` - force multiqc to always create interactive javascript plots
--	`/path/to/filtered_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+- `/path/to/filtered_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
 
-**Input data:**
+**Input Data:**
 
 - /path/to/filtered_nanoplot_output/*NanoStats.txt (NanoPlot output data, from [Step 4b](#4b-filtered-data-qc))
 
-**Output data:**
+**Output Data:**
+
+- **filtered_multiqc_report/filtered_multiqc.html** (multiqc output html summary)
+- **filtered_multiqc_report/filtered_multiqc_data.zip** (zip archive containing multiqc output data)
 
-- **filtered_multiqc.html** (multiqc output html summary)
-- **filtered_multiqc_data.zip** (zip archive containing multiqc output data)
+<br>
 
-### 5. Trimming
+---
 
+### 5. Trimming
 
 #### 5a. Trim Filtered Data
 
 ```bash
 porechop --input sample_filtered.fastq --threads NumberOfThreads \
-		--discard_middle --output sample_trimmed.fastq  > sample_porechop.log
+         --discard_middle --output sample_trimmed.fastq  > sample_porechop.log
 ```
 
 **Parameter Definitions:**
 
--	`--input` – the input read file in fastq format
-- `--threads` - number of processing threads
-- `--discard_middle` - 
-- `--output` - output filename
+- `--input` – the input read file in fastq format
+- `--threads` - number of parallel processing threads to use
+- `--discard_middle` -  reads with middle adapters will be discarded
+- `--output` - trimmed reads output fastq filename
 - `> sample_porechop.log` - capture stdout in a log file
 
 **Input Data:**
@@ -361,74 +412,89 @@ porechop --input sample_filtered.fastq --threads NumberOfThreads \
 #### 5b. Trimmed Data QC
 
 ```bash
-NanoPlot --only-report --prefix sample_ -o /path/to/trimmed_nanoplot_output -t NumberOfThreads --fastq sample_trimmed.fastq
+NanoPlot --only-report \
+         --prefix sample_ \
+         --outdir /path/to/trimmed_nanoplot_output \
+         --threads NumberOfThreads \
+         --fastq sample_trimmed.fastq
 ```
 
 **Parameter Definitions:**
 
-- `-o` – specifies the output directory to store results
+- `--outdir` – specifies the output directory to store results
 - `--only-report` - output only the report files
 - `--prefix` - adds a sample specific prefix to the name of each output file
-- `-t` - number of processing threads
-- `sample_trimmed.fastq.gz` – the input reads are specified as a positional argument
+- `--threads` - number of parallel processing threads to use
+- `sample_trimmed.fastq` – the input reads are specified as a positional argument
 
-**Input data:**
+**Input Data:**
 
-- *trimmed.fastq.gz (raw reads, output from [Step 2](#2-demultiplexing))
+- sample_trimmed.fastq (filtered and trimmed reads, output from [Step 5a](#5a-trim-filtered-data))
 
-**Output data:**
+**Output Data:**
 
-- **sample_NanoPlot-report.html** (NanoPlot html summary)
-- sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
-- sample_NanoStats.txt (text file containing basic statistics)
+- **/path/to/trimmed_nanoplot_output/sample_NanoPlot-report.html** (NanoPlot html summary)
+- /path/to/trimmed_nanoplot_output/sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
+- /path/to/trimmed_nanoplot_output/sample_NanoStats.txt (text file containing basic statistics)
 
-#### 5c. Compile Filtered Data QC
+#### 5c. Compile Trimmed Data QC
 
 ```bash
-multiqc -o raw_multiqc_report -n raw_multiqc --interactive /path/to/trimmed_nanoplot_output/
+multiqc --zip-data-dir \ 
+        --outdir trimmed_multiqc_report \
+        --filename trimmed_multiqc \
+        --interactive /path/to/trimmed_nanoplot_output/
 ```
 
 **Parameter Definitions:**
 
-- `-o` – the output directory to store results
--	`-n` – the filename prefix of results
+- `--zip-data-dir` - compress the data directory
+- `--outdir` – the output directory to store results
+- `--filename` – the filename prefix of results
 - `--interactive` - force multiqc to always create interactive javascript plots
--	`/path/to/trimmed_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+- `/path/to/trimmed_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
 
-**Input data:**
+**Input Data:**
 
 - /path/to/trimmed_nanoplot_output/*NanoStats.txt (NanoPlot output data, output from [Step 5b](#5b-trimmed-data-qc))
 
-**Output data:**
+**Output Data:**
+
+- **trimmed_multiqc.html** (multiqc output html summary)
+- **trimmed_multiqc_data.zip** (zip archive containing multiqc output data)
 
-- **filtered_multiqc.html** (multiqc output html summary)
-- **filtered_multiqc_data.zip** (zip archive containing multiqc output data)
+<br>
 
 ---
 
 ### 6. Assemble Contaminants
 
 ```bash
-flye --meta --threads NumberOfThreads --out-dir /path/to/contaminant_assembly --nano-raw /path/to/blank_samples/\*_trimmed.fastq
+flye --meta --threads NumberOfThreads \
+     --out-dir /path/to/contaminant_assembly \
+     --nano-raw /path/to/blank_samples/\*_trimmed.fastq
 ```
 
 **Parameter Definitions:**
 
--	`--meta` – use metagenome/uneven coverage mode
-- `--threads` - Number of parallel processing threads
-- `--out-dir` - Output directory
-- `--nano-raw` - specifies that input is from Oxford Nanopore regular reads (pre-Guppy5, <20% error)
+- `--meta` – use metagenome/uneven coverage mode
+- `--threads` - number of parallel processing threads to use
+- `--out-dir` - output directory
+- `--nano-raw` - specifies that input is from Oxford Nanopore regular raw reads. This adds a polishing step for error correction after the assembly is generated.
 
 **Input Data**
 
-- *_trimmed.fastq (filtered and trimmed reads from blank samples, output from [Step 5a](#5a-trim-filtered-data))
+- *_trimmed.fastq (one or more trimmed reads from blank samples, output from [Step 5a](#5a-trim-filtered-data))
 
 **Output Data**
 
 - /path/to/contaminant_assembly/assembly.fasta (Assembly built from reads in blank samples in fasta format)
 
+<br>
+
+---
 
-### 7. Remove Contaminants
+### 7. Contaminant Removal
 
 #### 7a. Build Contaminant Index and Map Reads
 
@@ -442,8 +508,8 @@ minimap2 -t NumberOfThreads -a -x splice blanks.mmi /path/to/trimmed_reads/sampl
 
 **Parameter Definitions:**
 
-- `-t` - Number of parallel processing threads
--	`-a` – output in SAM format
+- `-t` - number of parallel processing threads
+- `-a` – output in SAM format
 - `-x splice` - specifies preset for spliced alignment of long reads
 - `-d` - specifies the output file for the index
 
@@ -467,17 +533,17 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 **Parameter Definitions:**
 
 **samtools sort**
-- `--threads` - Number of parallel processing threads
+- `--threads` - number of parallel processing threads to use
 - `-o` - specifies the output file for the sorted reads
 - `sample.sam` - positional argument specifying the input SAM file
 
 **samtools index**
-- `sample_sorted.bam` - positional argument specifying the input BAM file to be sorted
+- `sample_sorted.bam` - positional argument specifying the input BAM file to be indexed
 - `sample_sorted.bam.bai` - positional argument specifying the name of the index file
 
 **Input Data:**
 
-- sample.sam (Reads aligned to contaminant assembly, output from [Step 7a](#7a-identify-contaminants))
+- sample.sam (Reads aligned to contaminant assembly, output from [Step 7a](#7a-build-contaminant-index-and-map-reads))
 
 **Output Data:**
 
@@ -489,7 +555,6 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 ```bash
 
 samtools flagstat sample_sorted.bam > sample_flagstats.txt  2> sample_flagstats.log
-
 samtools stats --remove-dups sample_sorted.bam > sample_stats.txt   2> sample_stats.log
 samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.log
 ```
@@ -504,8 +569,8 @@ samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.lo
 
 **Input Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 7b](#7b-sort-mapped-reads-and-convert-to-bam))
-- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly, output from [Step 7b](#7b-sort-mapped-reads-and-convert-to-bam))
+- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 7b](#7b-sort-and-index-contaminant-alignments))
+- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly, output from [Step 7b](#7b-sort-and-index-contaminant-alignments))
 
 **Output Data:**
 
@@ -515,7 +580,7 @@ samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.lo
 
 #### 7d. Generate Decontaminated Read Files
 ```bash
-# Retain reads that do not match contaminants
+# Retain reads that do not map to contaminants
 samtools fastq -t -f 4 sample_sorted.bam | gzip --to-stdout > sample_blank_removed.fastq.gz
 ```
 
@@ -523,1968 +588,2108 @@ samtools fastq -t -f 4 sample_sorted.bam | gzip --to-stdout > sample_blank_remov
 
 - `fastq` - positional argument specifying the program for generating fastq files from a SAM/BAM file
 - `-t` - copy RG, BC, and QT tags to the FASTQ header line
-- `-f 4` - only retain reads that have been marked with the SAM "segment unmapped" FLAG (4)
+- `-f 4` - only retain unmapped reads that have been marked with the SAM "segment unmapped" FLAG (4)
 - `sample_sorted.bam` - positional argument specifying the input BAM file
 - `| gzip --to-stdout` - sends output from `samtools fastq` to `gzip` to create compressed fastq.gz file
 - `> sample_blank_removed.fastq.gz` - specifies the name of the file used to store the fastq.gz output
 
 **Input Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 7b](#7b-sort-mapped-reads-and-convert-to-bam))
+- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 7b](#7b-sort-and-index-contaminant-alignments))
 
 **Output Data:**
 
-- sample_blank_removed.fastq.gz (decontaminated reads in fastq format)
+- sample_blank_removed.fastq.gz (blank removed reads in fastq format)
 
 #### 7e. Contaminant Removal QC
 
 ```bash
-NanoPlot --only-report --prefix sample_ -o /path/to/noblank_nanoplot_output -t NumberOfThreads --fastq sample_blank_removed.fastq.gz
+NanoPlot --only-report \
+         --prefix sample_ \
+         --outdir /path/to/noblank_nanoplot_output \
+         --threads NumberOfThreads \
+         --fastq sample_blank_removed.fastq.gz
 ```
 
 **Parameter Definitions:**
 
-- `-o` – specifies the output directory to store results
+- `--outdir` – specifies the output directory to store results
 - `--only-report` - output only the report files
 - `--prefix` - adds a sample specific prefix to the name of each output file
-- `-t` - number of processing threads
+- `--threads` - number of parallel processing threads to use
+- `--fastq` - specifies that the input data is in a fastq format
 - `sample_blank_removed.fastq.gz` – the input reads are specified as a positional argument
 
-**Input data:**
+**Input Data:**
 
-- sample_blank_removed.fastq.gz (raw reads, output from [Step 7d](#7d-generate-non-contaminant-read-files))
+- sample_blank_removed.fastq.gz (blank removed reads, output from [Step 7d](#7d-generate-decontaminated-read-files))
 
-**Output data:**
+**Output Data:**
 
-- **sample_NanoPlot-report.html** (NanoPlot html summary)
-- sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
-- sample_NanoStats.txt (text file containing basic statistics)
+- **/path/to/noblank_nanoplot_output/sample_NanoPlot-report.html** (NanoPlot html summary)
+- /path/to/noblank_nanoplot_output/sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
+- /path/to/noblank_nanoplot_output/sample_NanoStats.txt (text file containing basic statistics)
 
 
 #### 7f. Compile Contaminant Removal QC
 
 ```bash
-multiqc -o noblank_multiqc_report -n noblank_multiqc --interactive /path/to/noblank_nanoplot_output/
+multiqc --zip-data-dir \ 
+        --outdir noblank_multiqc_report \
+        --filename noblank_multiqc \
+        --interactive /path/to/noblank_nanoplot_output/
 ```
 
 **Parameter Definitions:**
 
-- `-o` – the output directory to store results
--	`-n` – the filename prefix of results
+- `--zip-data-dir` - compress the data directory
+- `--outdir` – the output directory to store results
+- `--filename` – the filename prefix of results
 - `--interactive` - force multiqc to always create interactive javascript plots
--	`/path/to/noblank_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+- `/path/to/noblank_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+
+**Input Data:**
 
-**Input data:**
+- /path/to/noblank_nanoplot_output/*NanoStats.txt (NanoPlot output data, output from [Step 7d](#7d-generate-decontaminated-read-files))
 
-- /path/to/noblank_nanoplot_output/*NanoStats.txt (NanoPlot output data, output from [Step 7d](#7d-generate-non-contaminant-read-files))
+**Output Data:**
 
-**Output data:**
+- **noblank_multiqc_report/noblank_multiqc.html** (multiqc output html summary)
+- **noblank_multiqc_report/noblank_multiqc_data.zip** (zip archive containing multiqc output data)
 
-- **noblank_multiqc.html** (multiqc output html summary)
-- **noblank_multiqc_data.zip** (zip archive containing multiqc output data)
+<br>
 
 ---
 
 ### 8. Host Removal
 
-```bash
-kraken2 --db kraken2_host_db --gzip-compressed --threads NumberOfThreads --use-names \
-        --output sample-kraken2-output.txt --report sample-kraken2-report.tsv \
-        --unclassified-out sample_host_removed.fastq sample_blank_removed.fastq.gz && \
-		&& gzip sample_host_removed.fastq
-```
-
-**Parameter Definitions:**
-
-- `--db` - specifies the directory holding the kraken2 database files created in step 1
-- `--gzip-compressed` - specifies the input fastq files are gzip-compressed
-- `--threads` - specifies the number of threads to use
-- `--use-names` - specifies adding taxa names in addition to taxids
-- `--output` - specifies the name of the kraken2 read-based output file (one line per read)
-- `--report` - specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it)
-- `--unclassified-out` - name of output file of reads that were not classified 
-- `sample_blank_removed.fastq.gz` - positional argument specifying the input read file
-
-**Input data:**
-
-- sample_blank_removed.fastq.gz (gzipped reads fastq file, output from [Step 7d](#7d-generate-decontaminated-read-files))
-
-**Output data:**
-
-- sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
-- sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
-- **sample_HRremoved_raw.fastq.gz** (human-read removed, gzipped reads fastq file)
+#### 8a. Build or download host database
 
----
+##### 8a.i. Download from URL
 
-### 9. Taxonomic and Functional Profiling using Kaiju
+```bash
+  # Downloading and unpacking database from ${host_url}
+  wget -O host.tar.gz --timeout=3600 --tries=0 --continue  host_url
 
-#### 9a. Kaiju Taxonomic Classification
-```
-kaiju -f kaiju_db.fmi -t nodes.dmp \
-    -z NumberOfThreads \
-    -E 1e-05 \
-    -i /path/to/decontaminated_reads/sample_host_removed.fastq.gz \
-    -o sample_kaiju.out
+  mkdir kraken2_host_db/ && \
+  tar -zxvf -C kraken2_host_db/ && \
+  rm -rf  host.tar.gz # Cleaning up
 ```
 
 **Parameter Definitions:**
 
-- `-f` - specifies path to the Kaiju database (.fmi) file
-- `-t` - specifies path to the Kaiju nodes.dmp file
-- `-z` - specifies the number of threads to use
-- `-E` - specifies the minimum E-value in Greedy mode (default: 0.01)
-- `-i` - specifies path to the input file
-- `-o` - specifies the name of output file
-
-**Input data:**
-
-- sample_host_removed.fastq.gz (gzipped decontaminated reads fastq file, output from [Step 8](#8-host-removal))
+- `--timeout` - network timeout in seconds
+- `--tries` - number of times to retry the download
+- `--continue` - continue getting a partially downloaded file (if it exists)
+- `host_url` - positional argument specifying the URl for the host database
 
-**Output data:**
+**Output Data:**
 
-- sample_kaiju.out (kaiju output file)
+- kraken2_host_db/ - Kraken2 database directory
 
 
-#### 9e. Kaiju per-sample taxon level summaries
+##### 8a.ii. Build from custom reference
 
 ```bash
-# Get taxon level information for each sample
-for TAXON_LEVEL in (phylum class order family genus species); do
-  kaiju2table -t nodes.dmp -n names.dmp -p  -r $TAXON_LEVEL \
-              -o sample_kaiju_summary_${TAXON_LEVEL}.tsv sample_kaiju.out
-done
+# Install taxonomy       
+kraken2-build --download-taxonomy --db kraken2_host_db/
+# Add sequence to your database's genomic library
+kraken2-build --add-to-library host_assembly.fasta --db kraken2_host_db/ --no-masking
+# Once your library is finalized, build the database
+kraken2-build --build --db kraken2_host_db/
 ```
 
 **Parameter Definitions:**
 
-- `-n` - specifies path to the Kaiju names.dmp file
-- `-t` - specifies path to the Kaiju nodes.dmp file
-- `-r` - specifies taxonomic rank, must be one of: phylum, class, order, family, genus, species
-- `-o` - specifies the name of krona formatted kaiju output file
-- `sample_kaiju.out` - positional argument specifying the path to the Kaiju output file (output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+- `--download-taxonomy` - downloads taxonomic mapping information
+- `--add-to-library host_assembly.fasta` - specifies to add assembly fasta to library
+- `--db` - specifies the output directory for the kraken database
+- `--build` - specifies to construct kraken2-formatted database
 
 **Input Data:**
 
-- sample_kaiju.out (kaiju output file, output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+- `host_assembly.fasta` - host genome assembly in fasta format 
 
 **Output Data:**
 
-- **sample_kaiju_summary_phylum.tsv** (Compiled kaiju outputs at the phylum taxon level)
-- **sample_kaiju_summary_class.tsv** (Compiled kaiju outputs at the class taxon level)
-- **sample_kaiju_summary_order.tsv** (Compiled kaiju outputs at the order taxon level)
-- **sample_kaiju_summary_family.tsv** (Compiled kaiju outputs at the family taxon level)
-- **sample_kaiju_summary_genus.tsv** (Compiled kaiju outputs at the genus taxon level)
-- **sample_kaiju_summary_species.tsv** (Compiled kaiju outputs at the species taxon level)
+- kraken2_host_db/ - Kraken2 database directory
+
 
-#### 9f. Compile Kaiju taxonomy results
+##### 8a.iii. Build from host name
 
 ```bash
-for TAXON_LEVEL in (phylum class order family genus species); do
-  kaiju2table -t nodes.dmp -n names.dmp -p -r $TAXON_LEVEL \
-              -o merged_kaiju_summary_${TAXON_LEVEL}.tsv *_kaiju.out
+# Build kraken reference from host_name
+kraken2-build --download-library host_name  -db kraken2_host_db/ \
+              --threads numberOfThreads  --no-masking
+kraken2-build --download-taxonomy --db kraken2_host_db/
+kraken2-build --build --db kraken2_host_db/ --threads numberOfThreads 
+kraken2-build --clean --db kraken2_host_db/
 ```
 
 **Parameter Definitions:**
 
-- `-n` - specifies path to the Kaiju names.dmp file
-- `-t` - specifies path to the Kaiju nodes.dmp file
-- `-r` - specifies taxonomic rank, must be one of: phylum, class, order, family, genus, species
-- `-o` - specifies the name of krona formatted kaiju output file
-- `sample_kaiju.out` - positional argument specifying the path to the Kaiju output file (output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+- `--download-library` - specifies the reference name/type to download, host_name must 
+                         be one of: "archaea", "bacteria", "plasmid", "viral", "human", 
+                         "fungi", "plant", "protozoa", "nr", "nt", "UniVec", "UniVec_Core"
+- `--db` - specifies the directory we are putting the database in
+- `--threads` - number of parallel processing threads to use
+- `--no-masking` - prevents masking of low-complexity sequences. For additional 
+                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences)
+- `--download-taxonomy` - downloads taxonomic mapping information
+- `--build` - specifies to construct kraken2-formatted database
+- `--clean` - specifies to remove unnecessarily intermediate files
 
 **Input Data:**
 
-- *kaiju.out (kaiju output files, output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+- `host_name` - host database name (one of )
 
 **Output Data:**
 
-- **merged_kaiju_summary_phylum.tsv** (Compiled kaiju outputs at the phylum taxon level)
-- **merged_kaiju_summary_class.tsv** (Compiled kaiju outputs at the class taxon level)
-- **merged_kaiju_summary_order.tsv** (Compiled kaiju outputs at the order taxon level)
-- **merged_kaiju_summary_family.tsv** (Compiled kaiju outputs at the family taxon level)
-- **merged_kaiju_summary_genus.tsv** (Compiled kaiju outputs at the genus taxon level)
-- **merged_kaiju_summary_species.tsv** (Compiled kaiju outputs at the species taxon level)
-
-#### 9b. Convert Kaiju Output to Krona Format
-```
-kaiju2krona -u -n ${NAMES} -t nodes.dmp \
-	-i sample_kaiju.out \
-	-o sample.krona
-```
-
-**Parameter Definitions:**
-
-- `-u` - include count for unclassified reads in output
-- `-n` - specifies path to the Kaiju names.dmp file
-- `-t` - specifies path to the Kaiju nodes.dmp file
-- `-i` - specifies path to the Kaiju output file (output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
-- `-o` - specifies the name of krona formatted kaiju output file
-
-**Input data:**
-
-- sample_kaiju.out (kaiju output file, output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
-
-**Output data:**
-
-- sample.krona (krona formatted kaiju output)
-
----
-
-### 10. Taxonomic and Functional Profiling using Kraken2
-
-      - [9c. Generate per sample Krona charts](#9c-generate-per-sample-krona-charts)
-      - [9d. Generate combined Krona chart](#9d-generate-combined-krona-chart)
-      - [9e. Compute per-sample taxon level summaries](#9e-compute-taxon-level-summaries-for-each-sample)
-      - [9f. Compile taxon level summaries](#9f-compile-kaiju-taxonomy-results)
-
-#### 10a. Taxonomic Classification
+- kraken2_host_db/ - Kraken2 database directory
 
+#### 8b. Remove host reads
 ```bash
-kraken2 --db ${DATABASE} --gzip-compressed --threads NumberOfThreads --use-names \
+kraken2 --db kraken2_host_db/ --gzip-compressed --threads NumberOfThreads --use-names \
         --output sample-kraken2-output.txt --report sample-kraken2-report.tsv \
-        /path/to/decontaminated_reads/sample_host_removed.fastq.gz
+        --unclassified-out sample_host_removed.fastq sample_blank_removed.fastq.gz
+gzip sample_host_removed.fastq
 ```
 
-**Parameter Definition:**
+**Parameter Definitions:**
 
-- `--db` - specifies the directory holding the kraken2 database files created in step 1
+- `--db` - specifies the directory holding the kraken2 database files created in [Step 8a](#8a-build-or-download-host-database)
 - `--gzip-compressed` - specifies the input fastq files are gzip-compressed
-- `--threads` - specifies the number of threads to use
-- `--use-names` - specifies adding taxa names in addition to taxids
+- `--threads` - number of parallel processing threads to use
+- `--use-names` - specifies adding taxa names in addition to taxon IDs
 - `--output` - specifies the name of the kraken2 read-based output file (one line per read)
 - `--report` - specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it)
-- `sample_host_removed.fastq.gz` - positional argument specifying the input read file
+- `--unclassified-out` - name of output file of reads that were not classified i.e non-host reads.
+- `sample_blank_removed.fastq.gz` - positional argument specifying the input read file
 
-**Input data:**
+**Input Data:**
 
-- sample_host_removed.fastq.gz (gzipped reads fastq file, output from [Step 7d](#7d-generate-decontaminated-read-files))
+- sample_blank_removed.fastq.gz (gzipped blank removed fastq file, output from [Step 7d](#7d-generate-decontaminated-read-files))
 
-**Output data:**
+**Output Data:**
 
 - sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
 - sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
+- **sample_host_removed.fastq.gz** (host-read removed, gzipped fastq file)
 
-#### 10b. Combine Kraken2 Reports
-
-```bash
-combine_kreports.py --output merged-kraken-table.tsv \
-                    --report-files sample-1-kraken2-report.tsv sample-2-kraken2-report.tsv sample-3-kraken2-report.tsv \
-                    --sample-names sample-1 sample-2 sample-3
-```
-
-**Parameter Definition:**
-
-- `--output` - specifies the name of the kraken2 read-based output file
-- `--report-files` - a space separated list of kraken2 report output file
-- `--sample-names` - a space separated list of sample name to use as headers in the report (in the same order as the report files)
+<br>
 
-**Input data:**
+---
 
-- *kraken2-report.tsv (kraken reports, output from [Step 10a](#10a-taxonomic-classification)
+## Read-based Processing
 
-**Output data:**
+### 9. R Environment Setup
 
-- **merged-kraken-table.tsv**  (merged Kraken2 output in tab-delimited format)
+> Taxonomy bar plot, heatmaps and feature decontamination with decontam are performed in R.
 
-#### 10f. Compile Kraken2 Summary QC
+#### 9a. Load libraries
 
-```bash 
-multiqc -o kraken_multiqc_report -n kraken_multiqc --interactive /path/to/kraken2_output/
+```R
+library(decontam)
+library(phyloseq)
+library(tidyverse)
+library(glue)
+library(pheatmap)
+library(pavian)
 ```
 
-**Parameter Definitions:**
+#### 9b. Define Custom Functions
 
--	`-o` – the output directory to store results
--	`-n` – the filename prefix of results
-- `--interactive` - force multiqc to always create interactive javascript plots
--	`/path/to/kraken2_output/` – the directory holding the output data from the Kraken2 run, provided as a positional argument
+##### get_last_assignment()
+<details>
+  <summary>retrieves the last taxonomy assignment from a taxonomy string</summary>
 
-**Input data:**
+  ```R
+  get_last_assignment <- function(taxonomy_string, split_by=';', remove_prefix=NULL) {
+    # A function to get the last taxonomy assignment from a taxonomy string 
+    split_names <- strsplit(x =  taxonomy_string , split = split_by) %>% 
+      unlist()
+    
+    level_name <- split_names[[length(split_names)]]
+    
+    if(level_name == "_"){
+      return(taxonomy_string)
+    }
+    
+    if(!is.null(remove_prefix)){
+      level_name <- gsub(pattern = remove_prefix, replacement = '', x = level_name)
+    }
+    
+    return(level_name)
+  }
+  ```
 
-- /path/to/kraken2_output/*kraken2-report.tsv (Kraken2 output data, from [Step 10a](#10a-taxonomic-classification))
+  **Function Parameter Definitions:**
+  - `taxonomy_string` - a character string containing a list of taxonomy assignments
+  - `split_by=` - a character string containing a regular expression used to split the `taxonomy_string`
+  - `remove_prefix=` - a character string containing a regular expression to be matched and removed, default=`NULL`
 
-**Output data:**
+  **Returns:** the last taxonomy assignment listed in the `taxonomy_string`
+</details>
 
-- **kraken2_multiqc.html** (multiqc output html summary)
-- **kraken2_multiqc_data.zip** (zip archive containing multiqc output data)
+##### mutate_taxonomy()
+<details>
+  <summary>ensure that the taxonomy column is named "taxonomy" and aggregate duplicates to ensure that taxonomy names are unique</summary>
 
-#### 10c. Convert Kraken2 output to Krona format
+  ```R
+  mutate_taxonomy <- function(df, taxonomy_column="taxonomy") {
+    
+    # make sure that the taxonomy column is always named taxonomy
+    col_index <- which(colnames(df) == taxonomy_column)
+    colnames(df)[col_index] <- 'taxonomy'
+    df <- df %>% dplyr::mutate(across( where(is.numeric), \(x) tidyr::replace_na(x,0)  ) )%>% 
+      dplyr::mutate(taxonomy=map_chr(taxonomy,.f = function(taxon_name=.x){
+        last_assignment <- get_last_assignment(taxon_name) 
+        last_assignment  <- gsub(pattern = "\\[|\\]|'", replacement = '',x = last_assignment)
+        trimws(last_assignment, which = "both")
+      })) %>% 
+      as.data.frame(check.names=FALSE, StringAsFactor=FASLE)
+    # Ensure the taxonomy names are unique by aggregating duplicates
+    df <- aggregate(.~taxonomy,data = df, FUN = sum)
+    return(df)
+  }
+  ```
 
-```bash
-kreport2krona.py --report-file sample-kraken2-report.tsv  --output sample.krona
-```
+  **Function Parameter Definitions:**
+  - `df` - a dataframe containing the taxonomy assignments
+  - `taxonomy_column=` - name of the column in `df` containing the taxonomy assignments, default="taxonomy"
 
-**Parameter Definition:**
+  **Returns:** a dataframe with unique taxonomy names stored in a column named "taxonomy"
 
-- `--output` - specifies the name of the krona output file
-- `--report-file` - specifies the name of the input kraken2 report file
+</details>
 
-**Input data:**
+##### process_kaiju_table()
+<details>
+  <summary>reformat kaiju output table</summary>
 
-- sample-kraken2-report.tsv (kraken report, output from [Step 10a](#10a-taxonomic-classification)
+  ```R
+  process_kaiju_table <- function(file_path, taxon_col="taxon_name") {
+  
+    abs_abun_df <-  read_delim(file = file_path,
+                               delim = "\t",
+                               col_names = TRUE) %>% 
+             select(sample, reads, taxonomy=!!sym(taxon_col)) %>%
+             pivot_wider(names_from = "sample", values_from = "reads",
+                             names_sort = TRUE) %>%
+             mutate_taxonomy
+  
+    # Set the taxon names as row names, drop the taxonomy column and convert to a matrix
+    rownames(abs_abun_df) <- abs_abun_df[,"taxonomy"]
+    abs_abun_df <- abs_abun_df[,-(which(colnames(abs_abun_df) == "taxonomy"))]
+    abs_abun_matrix <- as.matrix(abs_abun_df)
+    
+    return(abs_abun_matrix)
+  }
+  ```
 
-**Output data:**
+  **Function Parameter Definitions:**
+  - `file_path` - file path to the tab-delimited kaiju output table file
+  - `taxon_col=`- name of the taxon column in the input data file, default="taxon_path"
 
-- sample.krona (krona formatted kraken2 output)
+  **Returns:** a dataframe with reformated kaiju output
 
+</details>
 
----
 
-### 11. Taxonomy Plots
+##### process_kraken_table()
+<details>
+  <summary>merge and process multiple kraken outputs to one species table</summary>
 
-#### 11a. Generate per sample Krona charts
+  ```R
+  process_kraken_table <- function(reports_dir) {
+
+    reports <- read_reports(reports_dir)
+
+    samples <- names(reports) %>%
+                  str_split("-") %>%
+                  map_chr(function(x) pluck(x, 1))
+    merged_reports  <- merge_reports2(reports, col_names = samples)
+    taxonReads <- merged_reports$taxonReads
+    cladeReads <- merged_reports$cladeReads
+    tax_data <- merged_reports[["tax_data"]]
+
+    species_table <- tax_data %>% 
+      bind_cols(cladeReads) %>%
+      filter(taxRank %in% c("U","S")) %>% # select unclassified and species rows 
+      select(-contains("tax")) %>%
+      zero_if_na() %>% 
+      filter(name != 0) %>%  # drop unknown taxonomies
+      group_by(name) %>% 
+      summarise(across(everything(), sum)) %>% 
+      ungroup() %>% 
+      as.data.frame() %>% 
+      rename(species=name)
+
+    species_names <- species_table[,"species"]
+    rownames(species_table) <- species_names
+    species_table <- species_table[,-(which(colnames(species_table) == "species"))]
+    species_table <- as.matrix(species_table)
+    
+    return(species_table)\
+  }
+  ```
 
-```bash
-ktImportText -o sample_krona.html sample.krona
-```
+  **Function Parameter Definitions:**
+  - `reports_dir` - path to a directory containing kraken2 reports 
 
-**Parameter Definitions:**
+  **Returns:** a kraken species count matrix with samples and species as columns and rows, respectively.
 
-- `-o` - specifies the name of the krona output html file
-- `sample.krona` - positional argument specifying the krona text file for each sample
+</details>
 
-**Input Data:**
 
-- sample.krona (krona formatted kaiju or kraken output from [Step 9b](#9b-convert-kaiju-output-to-krona-format) or [Step 10c](#10c-convert-kraken2-output-to-krona-format)
+##### count_to_rel_abundance()
+<details>
+  <summary>Convert species count matrix to relative abundance matrix</summary>
 
-**Output Data:**
+  ```R
+  count_to_rel_abundance <- function(species_table) {
+
+    abund_table <- species_table %>% 
+                        as.data.frame %>% 
+                        mutate( across(everything(), function(x) (x/sum(x, na.rm = TRUE))*100 ) )  %>% # calculation species relative abundance per sample
+        select(
+                where( ~all(!is.na(.)) )
+              )  %>% # drop columns where none of the reads were classified or were non-microbial
+              rownames_to_column("Species") 
+      
+    rownames(abund_table) <- abund_table$Species
+      
+    abund_table <- abund_table[, -match(x = "Species", colnames(abund_table))] %>% t
 
-- **sample_krona.html** (per-sample Krona charts in html format)
+    return(abund_table)
+  }
+  ```
 
-#### 10e. Generate combined Krona chart
+  **Function Parameter Definitions:**
+  - `species_table` - a species count matrix with samples and species as columns and rows, respectively.
 
-```bash
-ktImportText -o ${classification_type}_krona_report.html ${input_dir}/*.krona
-```
+  **Returns:** a species relative abundance matrix with samples and species as rows and column, respectively.
 
-**Parameter Definitions:**
+</details>
 
-- `-o` - specifies the name of the krona output html file
-- `input_dir` - positional argument specifying the location of the krona files
-- `classification_type` - positional argument specifying which tool was used to create the taxonomic classification (kaiju or kraken2)
-- `*.krona` - positional argument specifying krona formatted text files for all samples
 
-**Input Data:**
-
-- *.krona (krona formatted kaiju or kraken output in krona format from [Step 9b](#9b-convert-kaiju-output-to-krona-format) or [Step 10c](#10c-convert-kraken2-output-to-krona-format))
-
-**Output Data:**
+##### filter_rare()
+<details>
+  <summary>filter out rare and non_microbial taxonomy assignments</summary>
 
-- **${classification_type}_krona_report.html** (per-sample Krona charts in html format)
+  ```R
+  filter_rare <- function(species_table, non_microbial, threshold=1) {
+    
+    clean_tab_count  <-  species_table %>% 
+                         as.data.frame %>% 
+                         rownames_to_column("Species") %>% 
+                         filter(str_detect(Species, non_microbial, negate = TRUE))  
+    
+    clean_tab <- clean_tab_count %>% 
+      mutate( across( where(is.numeric), \(x) (x/sum(x, na.rm = TRUE))*100 ) )
+    
+    rownames(clean_tab) <- clean_tab$Species
+    clean_tab  <- clean_tab[,-1] 
+    
+    
+    # Get species with relative abundance less than 1% in all samples
+    rare_species <- map(clean_tab, .f = \(col) rownames(clean_tab)[col < threshold])
+    rare <- Reduce(intersect, rare_species)
+    
+    rownames(clean_tab_count) <- clean_tab_count$Species
+    clean_tab_count  <- clean_tab_count[,-1] 
+    
+    abund_table <- clean_tab_count[!(rownames(clean_tab_count) %in% rare), ]
+    
+    return(abund_table)
+  }
+  ```
 
----
+  **Function Parameter Definitions:**
+  - `species_table` - the dataframe to filter
+  - `non_microbial` - a character vector denoting the string used to identify a species as non-microbial
+  - `threshold=` - abundance threshold used to determine if the relative abundance is rare, value denotes a percentage between 0 and 100.
 
-## Assembly-based processing
-### 11. Sample assembly
+  **Returns:** a dataframe with rare and non_microbial assignments removed
+</details>
 
-```bash
-flye --meta --threads NumberOfThreads --out-dir sample/ \
-     --nano-hq /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz
 
-# rename output files	              
-mv sample/assembly.fasta sample_assembly.fasta
-mv sample/flye.log sample_flye.log
-```
+##### make_plot()
+<details>
+  <summary>create bar plot of relative abundance</summary>
 
-**Parameter Definitions:**
+  ```R
+  # Make bar plot
+  make_plot <- function(abund_table, metadata, colors2use, publication_format) {
+    
+    abund_table_wide <- abund_table %>% 
+        as.data.frame() %>% 
+        rownames_to_column("Sample_ID") %>% 
+        inner_join(metadata) %>% 
+        select(!!!colnames(metadata), everything()) %>% 
+        mutate(Sample_ID = Sample_ID %>% str_remove("barcode"))
+        
+    abund_table_long <- abund_table_wide  %>%
+        pivot_longer(-colnames(metadata), 
+                    names_to = "Species",
+                    values_to = "relative_abundance")
+      
+    p <- ggplot(abund_table_long, mapping = aes(x=Sample_ID, y=relative_abundance, fill=Species)) +
+         geom_col() +
+         scale_fill_manual(values = colors2use) + 
+         labs(x=NULL, y="Relative Abundance (%)") + 
+         publication_format
 
--	`--meta` – use metagenome/uneven coverage mode
-- `--threads` - Number of parallel processing threads
-- `--out-dir` - Output directory
-- `--nano-hq` - specifies that input is from Oxford Nanopore high-quality reads (Guppy5+ SUP or Q20, <5% error)
+    return(p)
+  }
+  ```
 
-**Input Data**
+  **Function Parameter Definitions:**
+  - `abund_table` - a dataframe containing the data to plot
+  - `metadata` - a vector of strings specifying the data to include in the plot
+  - `colors2use` - a vector of strings specifying a custom color palette for coloring plots
+  - `publication_format` - a ggplot::theme object specifying the custom theme for plotting
 
-- sample_host_removed.fastq.gz (decontaminated raw data in fastq format, output from [Step 8](#8-host-removal))
+  **Returns:** a ggplot bar plot
 
-**Output Data**
+</details>
 
-- sample_assembly.fasta (sample assembly)
-- sample_flye.log (log file)
 
-<br>
+##### run_decontam()
+<details>
+  <summary>Feature table decoxntamination with decontam</summary>
 
----
+  ```R
+  run_decontam <- function(feature_table, metadata, contam_threshold=0.1, prev_col=NULL, freq_col=NULL) {
+
+    sub_metadata <- metadata[colnames(feature_table),]
+    # Modify NTC concentration
+    # Often times the user may set the NTC concentration to zero because they think nothing 
+    # should be in the negative control but decontam fails if the value is zero.
+    # to prevent decontam from failing we use a very small concentration value
+    # 0.0000001
+    if (!is.null(freq_col)) {
+
+      sub_metadata <- sub_metadata %>% 
+        mutate(!!freq_col:=map_dbl(!!sym(freq_col), .f= function(conc) { 
+                                      if(conc == 0) return(0.0000001) else return(conc) 
+                                    } 
+                                  )
+              )
+      sub_metadata[, freq_col] <- as.numeric(sub_metadata[,freq_col])
 
-### 12. Polish assembly
+    }
 
-```bash
-medaka_consensus -t NumberOfThreads -i /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz \
-  -d /path/to/assemblies/sample_assembly.fasta -o sample/
-  
-mv sample/consensus.fasta sample_polished.fasta
-```
+    # Create phyloseq object
+    ps <- phyloseq(otu_table(feature_table, taxa_are_rows = TRUE), sample_data(sub_metadata))
 
-**Parameter Definition:**
+    # In our phyloseq object, "Sample_or_Control" is the sample variable that holds  the negative 
+    # control information. We’ll summarize that data as a logical variable, with TRUE for control 
+    # samples, as that is the form required by isContaminant
+    sample_data(ps)$is.neg <- sample_data(ps)[[prev_col]] == "Control_Sample"
+    contamdf <- isContaminant(ps, neg="is.neg", conc="input_conc_ng") # thresheld = 0.1 - default
 
-- `-t` - Number of parallel processing threads
-- `-i` - specifies path to input read files used in creating the assembly
-- `-d` - specifies path to the assembly fasta file
-- `-o` - specifies the output directory
+    # Run Decontam 
+    if (!is.null(freq_col) && !is.null(prev_col)) {   
 
-**Input Data:**
+      # Run decontam in both prevalence and frequency modes
+      contamdf <- isContaminant(ps, neg=prev_col, conc=freq_col, threshold=contam_threshold) # threshold
 
-- /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz (decontaminated raw data in fastq format, output from [Step 8](#8-host-removal))
-- /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 11](#11-sample-assembly))
+    } else if(!is.null(freq_col)) {
+      
+      # Run decontam in frequency mode
+      contamdf <- isContaminant(ps, conc=freq_col, threshold=contam_threshold) # threshold
 
-**Output Data:**
+    } else if(!is.null(prev_col)){
 
-- sample_polished.fasta (polished sample assembly)
+      # Run decontam in prevalence mode
+      contamdf <- isContaminant(ps, conc=freq_col, threshold=contam_threshold)
+    
+    } else {
 
----
+      cat("Both freq_col and prev_col cannot be set tdo NULL\n")
+      cat("please supply either one or both column names your metadata")
+      cat("for frequency and prevalence based analysis, respectively\n")
+      stop()
 
-### 13.
+    }
+                    
+    return(contamdf)
+  }
+  ```
 
-#### 13a. Renaming contig headers
+  **Function Parameter Definitions:**
+  - `metadata` - a vector of strings specifying the data to include in the plot
+  - `feature_table` -  feature matrix to decontaminate with sample names as column and features as row
+  - `prev_col` - a character column in metadata to be used for prevalence based analysis. Controls in this column should always be names "Control_Sample"
+  - `freq_col` - a numeric column in metadata to be use for frequency based analysis
+  - `contam_threshold` -  the probability threshold below which (strictly less than) the null-hypothesis 
+                          (not a contaminant) should be rejected in favor of the alternate hypothesis (contaminant).
 
-```bash
-bit-rename-fasta-headers -i sample-1_polished.fasta -w c_sample-1 -o sample-1_assembly.fasta
-```
+  **Returns:** a dataframe of detailed decontam results
+</details>
 
-**Parameter Definitions:**  
+#### 9c. Set global variables
 
-- `-i` – input fasta file
+```R
+# Define custom theme for plotting
+publication_format <- theme_bw() +
+  theme(panel.grid = element_blank()) +
+  theme(axis.ticks.length=unit(-0.15, "cm"),
+        axis.text.x=element_text(margin=ggplot2::margin(t=0.5,r=0,b=0,l=0,unit ="cm")),
+        axis.text.y=element_text(margin=ggplot2::margin(t=0,r=0.5,b=0,l=0,unit ="cm")), 
+        axis.title = element_text(size = 18,face ='bold.italic', color = 'black'), 
+        axis.text = element_text(size = 16,face ='bold', color = 'black'),
+        legend.position = 'right', legend.title = element_text(size = 15,face ='bold', color = 'black'),
+        legend.text = element_text(size = 14,face ='bold', color = 'black'),
+        strip.text =  element_text(size = 14,face ='bold', color = 'black'))
 
-- `-w` – wanted header prefix (a number will be appended for each contig), starts with a “c_” to ensure they won’t start with a number which can be problematic
+# Define custom palette for plotting
+custom_palette <- c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F", "#FF7F00",
+                    "#CAB2D6","#6A3D9A","#FF00FFFF","#B15928","#000000","#FFC0CBFF","#8B864EFF","#F0027F",
+                    "#666666","#1B9E77", "#E6AB02","#A6761D","#FFFF00FF","#FFFF99","#00FFFFFF",
+                    "#B2182B","#FDDBC7","#D1E5F0","#CC0033","#FF00CC","#330033",
+                    "#999933","#FF9933","#FFFAFAFF",colors()) 
+# remove white colors
+custom_palette <- custom_palette[-c(21:23,
+                                    grep(pattern = "white|snow|azure|gray|#FFFAFAFF|aliceblue",
+                                         x = custom_palette, 
+                                         ignore.case = TRUE)
+                                   )
+                                ]
+```
 
-- `-o` – output fasta file
+**Input Data:** 
 
+*No input data required*
 
-**Input data:**
+**Output Data:**
 
-- sample-1_polished.fasta (polished assembly file from [step 12](#12-polish-assembly))
+- `publication_format` (a ggplot::theme object specifying the custom theme for plotting)
+- `custom_palette` (a vector of strings specifying a custom color palette for coloring plots)
 
-**Output files:**
+<br>
 
-- **sample-1-assembly.fasta** (contig-renamed assembly file)
+---
 
+### 10. Taxonomic profiling using kaiju
 
-#### 13b. Summarizing assemblies
+#### 10a. Build kaiju database
 
 ```bash
-bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv *assembly.fasta
+# Make directory that will hold all the download kaiju database
+mkdir kaiju-db/ && cd kaiju-db/
+# Download kaiju's reference database
+kaiju-makedb -s nr_euk -t NumberOfThreads
+# Cleaning up
+rm nr_euk/kaiju_db_nr_euk.bwt nr_euk/kaiju_db_nr_euk.sa
 ```
 
-**Parameter Definitions:**  
-
-- `-o` – output summary table
-
-*	– multiple input assemblies can be provided as positional arguments
-
-
-**Input data:**
+**Parameter Definitions:**
 
-- *-assembly.fasta (contig-renamed assembly files from [step 13a](#13a-renaming-contig-headers))
+- `-t` - number of parallel processing threads to use
+- `-s nr_euk` - specifies to download NCBI's nr and additionally including fungi and microbial eukaryotes databases
 
-**Output files:**
+**Input Data:**
 
-- **assembly-summaries_GLmetagenomics.tsv** (table of assembly summary statistics)
+*No input data required*
 
-<br>
+**Output Data:**
 
----
+- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (fmi file)
+- kaiju-db/nodes.dmp (nodes file)
+- kaiju-db/names.dmp (names file)
 
 
----
+#### 10b. Kaiju Taxonomic Classification
 
-### 14. Gene prediction
 ```bash
-prodigal -a sample-1-genes.faa -d sample-1-genes.fasta -f gff -p meta -c -q \
-         -o sample-1-genes.gff -i sample-1-assembly.fasta
+kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi -t kaiju-db/nodes.dmp \
+    -z NumberOfThreads \
+    -E 1e-05 \
+    -i /path/to/decontaminated_reads/sample_host_removed.fastq.gz \
+    -o sample_kaiju.out
 ```
+
 **Parameter Definitions:**
 
-- `-a` – specifies the output amino acid sequences file
+- `-f` - specifies path to the Kaiju database (.fmi) file
+- `-t` - specifies path to the Kaiju nodes.dmp file
+- `-z` - number of parallel processing threads to use
+- `-E` - specifies the minimum E-value in Greedy mode (default: 0.01)
+- `-i` - specifies path to the input file
+- `-o` - specifies the name of output file
 
-- `-d` – specifies the output nucleotide sequences file
+**Input Data:**
 
-- `-f` – specifies the output format gene-calls file
+- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (fmi file, output from [Step 9a](#9a-build-kaiju-database))
+- kaiju-db/nodes.dmp (nodes file, output from [Step 9a](#9a-build-kaiju-database))
+- sample_host_removed.fastq.gz (gzipped decontaminated reads fastq file, output from [Step 8](#8-host-removal))
 
-- `-p` – specifies which mode to run the gene-caller in 
+**Output Data:**
 
-- `-c` – no incomplete genes reported 
+- sample_kaiju.out (kaiju output file)
 
-- `-q` – run in quiet mode (don’t output process on each contig) 
+#### 10c. Compile kaiju taxonomy results
 
-- `-o` – specifies the name of the output gene-calls file 
+```bash
+# Merge kaiju reports to one table at the species level
+  kaiju2table -t nodes.dmp -n names.dmp -p -r species \
+              -o merged_kaiju_table.tsv *_kaiju.out
 
-- `-i` – specifies the input assembly
+# Covert the file names to sample names
+sed -i -E 's/.+\/(.+)_kaiju\.out/\1/g' merged_kaiju_table.tsv && \
+sed -i -E 's/file/sample/' merged_kaiju_table.tsv
+```
 
-**Input data:**
+**Parameter Definitions:**
 
-- sample-1-assembly.fasta (contig-renamed assembly file from [step 5a](#5a-renaming-contig-headers))
+- `-n` - specifies path to the Kaiju names.dmp file
+- `-t` - specifies path to the Kaiju nodes.dmp file
+- `-r` - specifies taxonomic rank, must be one of: phylum, class, order, family, genus, species
+- `-o` - specifies the name of krona formatted kaiju output file
+- `*_kaiju.out` - positional argument specifying the path to the Kaiju output file (output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
 
-**Output data:**
+**Input Data:**
 
-- sample-1-genes.faa (gene-calls amino-acid fasta file)
-- sample-1-genes.fasta (gene-calls nucleotide fasta file)
-- **sample-1-genes.gff** (gene-calls in general feature format)
+- kaiju-db/nodes.dmp (nodes file, output from [Step 9a](#9a-build-kaiju-database))
+- kaiju-db/names.dmp (names file, output from [Step 9a](#9a-build-kaiju-database))
+- *kaiju.out (kaiju report files, output from [Step 9b](#9b-kaiju-taxonomic-classification))
 
-<br>
+**Output Data:**
 
-#### 14a. Remove line wraps in gene prediction output
-```bash
-bit-remove-wraps sample-1-genes.faa > sample-1-genes.faa.tmp 2> /dev/null
-mv sample-1-genes.faa.tmp sample-1-genes.faa
+- **merged_kaiju_table.tsv** (Compiled kaiju table at the species taxon level)
 
-bit-remove-wraps sample-1-genes.fasta > sample-1-genes.fasta.tmp 2> /dev/null
-mv sample-1-genes.fasta.tmp sample-1-genes.fasta
-```
+#### 10d. Convert kaiju output to krona format
 
-**Input data:**
+```bash
+kaiju2krona -u -n kaiju-db/names.dmp -t kaiju-db/nodes.dmp \
+            -i sample_kaiju.out -o sample.krona
+```
 
-- sample-1-genes.faa (gene-calls amino-acid fasta file, output from [Step 14](#14-gene-prediction))
-- sample-1-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14](#14-gene-prediction))
+**Parameter Definitions:**
 
-**Output data:**
+- `-u` - include count for unclassified reads in output
+- `-n` - specifies path to the Kaiju names.dmp file
+- `-t` - specifies path to the Kaiju nodes.dmp file
+- `-i` - specifies path to the Kaiju output file (output from [Step 9b](#9b-kaiju-taxonomic-classification))
+- `-o` - specifies the name of krona formatted kaiju output file
 
-- **sample-1-genes.faa** (gene-calls amino-acid fasta file with line wraps removed)
-- **sample-1-genes.fasta** (gene-calls nucleotide fasta file with line wraps removed)
+**Input Data:**
+- kaiju-db/nodes.dmp (nodes file, output from [Step 9a](#9a-build-kaiju-database))
+- kaiju-db/names.dmp (names file, output from [Step 9a](#9a-build-kaiju-database))
+- sample_kaiju.out (kaiju output file, output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
 
+**Output Data:**
 
----
+- sample.krona (krona formatted kaiju output)
 
-### 15. Functional annotation
-> **Notes**  
-> The annotation process overwrites the same temporary directory by default. So if running multiple processses at a time, it is necessary to specify a specific temporary directory with the `--tmp-dir` argument as shown below.
+#### 10e. Compile kaiju krona report
 
+```bash
+# Find, list and write all .krona files to file 
+find . -type f -name "*.krona" |sort -uV > krona_files.txt
 
-#### 15a. Downloading reference database of HMM models (only needs to be done once)
+FILES=($(find . -type f -name "*.krona"))
+basename --multiple --suffix='.krona' ${FILES[*]} | sort -uV  > sample_names.txt
 
-```
-curl -LO ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
-curl -LO ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
-tar -xzvf profiles.tar.gz
-gunzip ko_list.gz 
-```
+# Create ktImportText input format files
+KTEXT_FILES=($(paste -d',' "krona_files.txt" "sample_names.txt"))
 
-#### 15b. Running KEGG annotation
-```
-exec_annotation -p profiles/ -k ko_list --cpu NumberOfThreads -f detail-tsv -o sample-1-KO-tab.tmp \
-                --tmp-dir sample-1-tmp-KO --report-unannotated sample-1-genes.faa 
+# Create html   
+ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 ```
 
 **Parameter Definitions:**
-- `-p` – specifies the directory holding the downloaded reference HMMs
 
-- `-k` – specifies the downloaded reference KO  (Kegg Orthology) terms 
+**find**
 
-- `--cpu` – specifies the number of searches to run in parallel
+- `-type f` -  specifies that the type of file to find is a regular file
+- `-name "*.krona"` - specifies to find files ending with the .krona suffix  
 
-- `-f` – specifies the output format
+**sort**
 
-- `-o` – specifies the output file name
+- `-u` - specifies to perform a unique sort
+- `-V` - specifies to perform a mixed type of sorting
 
-- `--tmp-dir` – specifies the temporary directory to write to (needed if running more than one process concurrently, see Notes above)
+**basename**
 
-- `--report-unannotated` – specifies to generate an output for each entry
+- `--multiple` - support multiple arguments and treat each as a file name
+- `--suffix='.krona'` - remove a trailing '.krona' suffix
 
-- `sample-1-genes.faa` – the input file is specified as a positional argument 
+**paste**
 
+- `-d','` - paste both krona and sample files together line by line delimited by comma ','
 
-**Input data:**
+**ktImportText**
 
-- sample-1-genes.faa (amino-acid fasta file, from [step 6](#6-gene-prediction))
-- profiles/ (reference directory holding the KO HMMs)
-- ko_list (reference list of KOs to scan for)
+- `-o` - specifies the compiled output html file name
+- `${KTEXT_FILES[*]}` - a array positional arguement with the follow content: 
+                     sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
 
-**Output data:**
+**Input Data:**
+*.krona (all sample .krona formatted files, output from [Step 9e](#9e-convert-kaiju-output-to-krona-format)) 
 
-- sample-1-KO-tab.tmp (table of KO annotations assigned to gene IDs)
+                      
+**Output Data:**
 
+- kaiju-report.html (compiled krona html report output)
 
-#### 15c. Filtering output to retain only those passing the KO-specific score and top hits
-```
-bit-filter-KOFamScan-results -i sample-1-KO-tab.tmp -o sample-1-annotations.tsv
 
-  # removing temporary files
-rm -rf sample-1-tmp-KO/ sample-1-KO-annots.tmp
-```
+#### 10f. Create kaiju species count table
 
-**Parameter Definitions:**  
+```R
+library(tidyverse)
+feature_table <- process_kaiju_table (file_path="merged_kaiju_table.tsv")
+write_csv(x = feature_table, file = "kaiju_species_table.csv")
+```
 
-- `-i` – specifies the input table
+**Parameter Definitions:**
 
-- `-o` – specifies the output table
+- `file_path` - path to compiled kaiju table at the species taxon level
+- `x`  - feature table dataframe to write to file
+- `file` - path to where to write kaiju count table per sample.
 
+**Input Data:**
 
-**Input data:**
+- merged_kaiju_table.tsv (Compiled kaiju table at the species taxon level, from [Step 9c](#9c-compile-kaiju-taxonomy-results))
 
-- sample-1-KO-tab.tmp (table of KO annotations assigned to gene IDs from [step 7b](#7b-running-kegg-annotation))
+**Output Data:**
 
-**Output data:**
+- kaiju_species_table.csv (kaiju species count table in csv format)
 
-- sample-1-annotations.tsv (table of KO annotations assigned to gene IDs)
 
-<br>
+#### 10g. Read-in tables
 
----
+```R
+library(tidyverse)
 
-### 16. Taxonomic classification
+# Read-in metadata
+metdata_file <- "/path/to/sample/metadata"
+samples_column <- "Sample_ID"
+metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
+row.names(metadata) <- metadata[,samples_column]
 
-#### 16a. Pulling and un-packing pre-built reference db (only needs to be done once)
-```
-wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20200618.tar.gz
-tar -xvzf CAT_prepare_20200618.tar.gz
+# Read-in feature table
+species_table <- read_csv(file="kaiju_species_table.csv") %>%  as.data.frame()
 ```
 
-#### 16b. Running taxonomic classification
-```
-CAT contigs -c sample-1-assembly.fasta -d CAT_prepare_20200618/2020-06-18_database/ \
-            -t CAT_prepare_20200618/2020-06-18_taxonomy/ -p sample-1-genes.faa \
-            -o sample-1-tax-out.tmp -n NumberOfThreads -r 3 --top 4 --I_know_what_Im_doing --no-stars
-```
+**Parameter Definitions:**
 
-**Parameter Definitions:**  
+- `file` - path to input tables
+- `delim` - file delimiter 
 
-- `-c` – specifies the input assembly fasta file
+**Input Data:**
 
-- `-d` – specifies the CAT reference sequence database
+- metadata_file  (path to sample-wise metadata file)
+- kaiju_species_table.csv (path to kaiju species taable from [step 10f](#10f-create-kaiju-species-count-table))
 
-- `-t` – specifies the CAT reference taxonomy database
+**Output Data:**
 
-- `-p` – specifies the input protein fasta file
+- `metadata` - a dataframe of sample-wise metadata
+- `species_table` - a dataframe of species count per sample
+---
 
-- `-o` – specifies the output prefix
+#### 10h. Taxonomy barplots
 
-- `-n` – specifies the number of CPU cores to use
+```R
+library(tidyverse)
 
-- `-r` – specifies the number of top protein hits to consider in assigning tax
+filter_threshold=0.5
+# Filter out Rare and non-microbial assignment
+# You can add as many species that you'd like to filter out
+# using the following syntax "|species_name1|species_name2"
+non_microbial <- "Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
 
-- `--top` – specifies the number of protein alignments to store
+plot_width <- 18
+plot_height <- 8
 
-- `--I_know_what_Im_doing` – allows us to alter the `--top` parameter
+# Convert count matrix to relative abundance matrix
+abund_table <- count_to_rel_abundance(species_table)
 
-- `--no-stars` - suppress marking of suggestive taxonomic assignments
+# Make plot without filtering
+p <- make_plot(abund_table, metadata, custom_palette, publication_format)
 
+ggsave(filename =  "unfiltered-kaiju_species_plot.png", plot = p,
+       device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
 
-**Input data:**
 
-- sample-1-assembly.fasta (assembly file from [step 5a](#5a-renaming-contig-headers))
-- sample-1-genes.faa (gene-calls amino-acid fasta file from [step 6](#6-gene-prediction))
+# Get species with relative abundance greater than filter_threshold in all samples
+# Drop rare and non-microbial assignments
+filtered_species_table  <- filter_rare(species_table, non_microbial, threshold=filter_threshold)
 
-**Output data:**
 
-- sample-1-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file)
-- sample-1-tax-out.tmp.contig2classification.txt (contig taxonomy file)
+# Convert count matrix to relative abundance matrix
+filtered_species_table <- count_to_rel_abundance(filtered_species_table)
 
-#### 16c. Adding taxonomy info from taxids to genes
-```
-CAT add_names -i sample-1-tax-out.tmp.ORF2LCA.txt -o sample-1-gene-tax-out.tmp \
-              -t CAT_prepare_20200618/2020-06-18_taxonomy/ --only_official --exclude-scores
+# Write filtered table to file
+write_csv(x = filtered_species_table, file = "filtered-kaiju_species_table.csv")
+
+# Make plot after filtering
+p <- make_plot(filtered_species_table , metadata, custom_palette, publication_format)
+
+ggsave(filename = "filtered-kaiju_species_plot.png", plot = p,
+         device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
 ```
 
-**Parameter Definitions:**  
+**Parameter Definitions:**
 
-- `-i` – specifies the input taxonomy file
+- `filter_threshold` - a decimal threshold from 0-1 for filter out rare species i.e potential fals epositives.
+- `non_microbial` - a regex string  listing out assignmnets to drop before filtering based on the `filter_threshold` above. 
 
-- `-o` – specifies the output file 
+**Input Data:**
 
-- `-t` – specifies the CAT reference taxonomy database
+- `species_table` (a dataframe of species count per sample, output from [Step 10g](#10g-read-in-tables))
+- `metadata` - (a dataframe of sample-wise metadata, output from [Step 10g](#10g-read-in-tables))
 
-- `--only_official` – specifies to add only standard taxonomic ranks
+**Output Data:**
 
-- `--exclude-scores` - specifies to exclude bit-score support scores in the lineage
+- **unfiltered-kaiju_species_plot.png** (barplot plot without filtering)
+- **filtered-kaiju_species_table.csv** (filtered relative abundance table)
+- **filtered-kaiju_species_plot.png** (barplot after filtering rare and non-microbial taxa)
 
-**Input data:**
 
-- sample-1-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file from [step 8b](#8b-running-taxonomic-classification))
+#### 10i. Feature decontamination
 
-**Output data:**
+Feature decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
 
-- sample-1-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
+```R
+library(tidyverse)
+library(decontam)
+feature_table <- read_csv("filtered-kaiju_species_table.csv")
+contam_threshold <- 0.1
+# Control samples in this column should always be written as "Control_Sample" and true samples as "True_Sample"
+prev_col <- "Sample_or_Control"
+freq_col <- "input_conc_ng"
+plot_width <- 18
+plot_height <- 8
 
+contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
 
+# Write decontam results table to file
+write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-kaiju_results.csv")
 
-#### 16d. Adding taxonomy info from taxids to contigs
-```
-CAT add_names -i sample-1-tax-out.tmp.contig2classification.txt -o sample-1-contig-tax-out.tmp \
-              -t CAT-ref/2020-06-18_taxonomy/ --only_official --exclude-scores
+# Get the list of contaminats identified by decontam
+contaminants <- contamdf %>%
+                as.data.frame %>%
+                rownames_to_column("Species") %>%
+                filter(contaminant == TRUE) %>% pull(Species)
+
+# Drop contaminant features identified by decontam
+decontaminated_table <- feature_table %>% 
+                as.data.frame  %>% 
+                rownames_to_column("Species") %>% 
+                filter(str_detect(Species, 
+                                  pattern = str_c(contaminants,
+                                                  collapse = "|"),
+                                  negate = TRUE)) %>%
+                select(-Species) %>% as.matrix
+
+# Convert count matrix to relative abundance matrix
+decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
+
+# Write decontaminated species table to file
+write_csv(x = decontaminated_species_table, file = "decontaminated-kaiju_species_table.csv")
+
+# Make plot after filtering out contaminants
+p <- make_plot(decontaminated_species_table , metadata, custom_palette, publication_format)
+
+ggsave(filename = "decontaminated-kaiju-species_plot.png", plot = p,
+         device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
 ```
 
-**Parameter Definitions:**  
+**Input Data:**
 
-- `-i` – specifies the input taxonomy file
+- `filtered-kaiju_species_table.csv`(a dataframe of species count per sample, output from [Step 10h](#10h-taxonomy-barplots))
+- `metadata`(a dataframe of sample-wise metadata, output from [Step 10g](#10g-read-in-tables))
 
-- `-o` – specifies the output file 
+**Output Data:**
 
-- `-t` – specifies the CAT reference taxonomy database
+- **decontam-kaiju_results.csv** (decontam's results table)
+- **decontaminated-kaiju_species_table.csv** (decontaminated species table)
+- **decontaminated-kaiju-species_plot.png** (barplot after filtering out contaminants)
 
-- `--only_official` – specifies to add only standard taxonomic ranks
+<br>
 
-- `--exclude-scores` - specifies to exclude bit-score support scores in the lineage
+---
 
+### 11. Taxonomic Profiling using Kraken2
 
-**Input data:**
+#### 11a. Download kraken2 database
 
-- sample-1-tax-out.tmp.contig2classification.txt (contig taxonomy file from [step 8b](#8b-running-taxonomic-classification))
+```bash 
+## Download all microbial (including eukaryotes) - https://benlangmead.github.io/aws-indexes/k2
 
-**Output data:**
+# Downloading and building kraken2's pluspfp database which contains that standard database + plants + protists + fungi..
 
-- sample-1-contig-tax-out.tmp (contig taxonomy file with lineage info added)
+mkdir kraken2-db/ && cd kraken2-db/
 
+# Inspect file
+INSPECT_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/inspect.txt
+wget ${INSPECT_URL}
 
-#### 16e. Formatting gene-level output with awk and sed
-```
-awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,$8,$9,$10,$11 } \
-    else if ( $2 == "ORF has no hit to database" || $2 ~ /^no taxid found/ ) \
-    { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } else { n=split($3,lineage,";"); \
-    print $1,lineage[n],$5,$6,$7,$8,$9,$10,$11 } } ' sample-1-gene-tax-out.tmp | \
-    sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/# ORF/gene_ID/' | \
-    sed 's/lineage/taxid/'  > sample-1-gene-tax-out.tsv
-```
+# Library report
+LIRARY_REPORT_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/library_report.tsv
+wget ${LIRARY_REPORT_URL}
 
-#### 16f. Formatting contig-level output with awk and sed
-```
-awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6,$7,$8,$9,$10,$11,$12 } \
-    else if ( $2 == "no taxid assigned" ) { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } \
-    else { n=split($4,lineage,";"); print $1,lineage[n],$6,$7,$8,$9,$10,$11,$12 } } ' sample-1-contig-tax-out.tmp | \
-    sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/^# contig/contig_ID/' | \
-    sed 's/lineage/taxid/' > sample-1-contig-tax-out.tsv
+# Md5sums
+MD5_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/pluspfp.md5 
+wget ${MD5_URL}
 
-  # clearing intermediate files
-rm sample-1*.tmp*
+# Download and unzip the main database files
+DB_URL=https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_20250714.tar.gz 
+wget -O k2_pluspfp.tar.gz --timeout=3600 --tries=0 --continue ${DB_URL} && \
+tar -xvzf k2_pluspfp.tar.gz
 ```
 
-**Input data:**
+**Parameter Definitions:**
 
-- sample-1-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added from [step 8c](#8c-adding-taxonomy-info-from-taxids-to-genes))
-- sample-1-contig-tax-out.tmp (contig taxonomy file with lineage info added from [step 8d](#8d-adding-taxonomy-info-from-taxids-to-contigs))
+**wget**
 
+- `O` - name of file to download the url content to.
+- `--timeout=3600` - specifies the network timeout to seconds seconds
+- `--tries=0` - retry downdload infinitely.
+- `--continue` -  continue getting a partially-downloaded file
+- `*_URL` - position arguement specifying the url to download a particular resource from.
 
-**Output data:**
 
-- sample-1-gene-tax-out.tsv (gene-calls taxonomy file with lineage info added reformatted)
-- sample-1-contig-tax-out.tsv (contig taxonomy file with lineage info added reformatted)
+**Input Data:**
 
-<br>
+- `INSPECT_URL=` - url specifying the location of kraken2 inspect file
+- `LIRARY_REPORT_URL=` -  url specifying the location of kraken2 library report file
+- `MD5_URL=` -  url specifying the location of md5 file of kraken database
+- `DB_URL=` - url specifying the location of the main kraken database archive in .tar.gz format
 
----
+**Output Data:**
 
-### 17. Read-Mapping
+- kraken2-db/  (a directory containing kraken 2 database files)
 
-#### 17a. Align Reads to Sample Assembly
+#### 11b. Taxonomic Classification
 
 ```bash
-minimap2 -a -x map-ont -t NumberOfThreads sample_assembly.fasta sample_host_removed.fastq.gz \
-  > sample.sam  2> sample-mapping-info.txt | 
+kraken2 --db kraken2-db/ --gzip-compressed --threads NumberOfThreads --use-names \
+        --output sample-kraken2-output.txt --report sample-kraken2-report.tsv \
+        /path/to/decontaminated_reads/sample_host_removed.fastq.gz
 ```
 
 **Parameter Definitions:**
 
-- `-t` - Number of parallel processing threads
--	`-a` – output in SAM format
-- `-x map-ont` - specifies preset for mapping Nanopore reads to a reference
+- `--db` - specifies the directory holding the kraken2 database files 
+- `--gzip-compressed` - specifies the input fastq files are gzip-compressed
+- `--threads` - number of parallel processing threads to use
+- `--use-names` - specifies adding taxa names in addition to taxids
+- `--output` - specifies the name of the kraken2 read-based output file (one line per read)
+- `--report` - specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it)
+- `sample_host_removed.fastq.gz` - positional argument specifying the input read file
 
-**Input Data**
+**Input Data:**
 
-- /path/to/assemblies/sample_assembly.fasta (Sample assembly, output from [Step 13a](#13a-renaming-contig-headers))
-- /path/to/trimmed_reads/sample_host_removed.fastq.gz (Filtered and trimmed reads, output from [Step 8](#8-host-removal))
+- kraken2-db/ (a direcory containing kraken 2 database files, output from [Step 10a](#10a-download-kraken2-database))
+- sample_host_removed.fastq.gz (gzipped reads fastq file, output from [Step 7d](#7d-generate-decontaminated-read-files))
 
-**Output Data**
+**Output Data:**
 
-- sample.sam (Reads aligned to contaminant assembly)
+- sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
+- sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
 
-#### 17b. Sort and Index Assembly Alignments
-```bash
-# Sort Sam, convert to bam and create index
-samtools sort --threads NumberOfThreads -o sample_sorted.bam sample.sam > sample_sort.log 2>&1
+#### 11c. Convert Kraken2 output to Krona format
 
-samtools index sample_sorted.bam sample_sorted.bam.bai
+```bash
+kreport2krona.py --report-file sample-kraken2-report.tsv  --output sample.krona
 ```
 
 **Parameter Definitions:**
 
-**samtools sort**
-- `--threads` - Number of parallel processing threads
-- `-o` - specifies the output file for the sorted reads
-- `sample.sam` - positional argument specifying the input SAM file
-
-**samtools index**
-- `sample_sorted.bam` - positional argument specifying the input BAM file to be sorted
-- `sample_sorted.bam.bai` - positional argument specifying the name of the index file
+- `--output` - specifies the name of the krona output file
+- `--report-file` - specifies the name of the input kraken2 report file
 
 **Input Data:**
 
-- sample.sam (Reads aligned to sample assembly, output from [Step 13c](#13c-read-mapping))
+- sample-kraken2-report.tsv (kraken report, output from [Step 10b](#10b-taxonomic-classification)
 
 **Output Data:**
 
-- sample_sorted.bam (sorted mapping to sample assembly)
-- sample_sorted.bam.bai (index of sorted mapping to sample assembly)
+- sample.krona (krona formatted kraken2 output)
 
-<br>
 
----
+#### 11d. Compile kraken2 krona report
 
-### 18. Getting coverage information and filtering based on detection
-> **Notes**  
-> “Detection” is a metric of what proportion of a reference sequence recruited reads (see [here](http://merenlab.org/2017/05/08/anvio-views/#detection)). Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
+```bash
+# Find, list and write all .krona files to file 
+find . -type f -name "*.krona" |sort -uV > krona_files.txt
 
-#### 18a. Filtering coverage levels based on detection
+FILES=($(find . -type f -name "*.krona"))
+basename --multiple --suffix='.krona' ${FILES[*]} | sort -uV  > sample_names.txt
 
-```bash
-  # pileup.sh comes from the bbduk.sh package
-pileup.sh -in sample-1.bam fastaorf=sample-1-genes.fasta outorf=sample-1-gene-cov-and-det.tmp \
-          out=sample-1-contig-cov-and-det.tmp
+# Create ktImportText input format files
+KTEXT_FILES=($(paste -d',' "krona_files.txt" "sample_names.txt"))
+
+# Create html   
+ktImportText  -o kraken-report.html ${KTEXT_FILES[*]}
 ```
 
-**Parameter Definitions:**  
+**Parameter Definitions:**
 
-- `-in` – the input bam file
+**find**
 
-- `fastaorf=` – input gene-calls nucleotide fasta file
+- `-type f` -  specifies that the type of file to find is a regular file
+- `-name "*.krona"` - specifies to find files ending with the .krona suffix  
 
-- `outorf=` – the output gene-coverage tsv file
+**sort**
 
-- `out=` – the output contig-coverage tsv file
+- `-u` - specifies to perform a unique sort
+- `-V` - specifies to perform a mixed type of sorting
 
+**basename**
 
-#### 18b. Filtering gene coverage based on requiring 50% detection and parsing down to just gene ID and coverage
-```bash
-grep -v "#" sample-1-gene-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
-     { print $1,$4 } ' > sample-1-gene-cov.tmp
+- `--multiple` - support multiple arguments and treat each as a file name
+- `--suffix='.krona'` - remove a trailing '.krona' suffix
 
-cat <( printf "gene_ID\tcoverage\n" ) sample-1-gene-cov.tmp > sample-1-gene-coverages.tsv
-```
+**paste**
 
-Filtering contig coverage based on requiring 50% detection and parsing down to just contig ID and coverage:
-```bash
-grep -v "#" sample-1-contig-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
-     { print $1,$2 } ' > sample-1-contig-cov.tmp
+- `-d','` - paste both krona and sample files together line by line delimited by comma ','
 
-cat <( printf "contig_ID\tcoverage\n" ) sample-1-contig-cov.tmp > sample-1-contig-coverages.tsv
+**ktImportText**
 
-  # removing intermediate files
+- `-o` - specifies the compiled output html file name
+- `${KTEXT_FILES[*]}` - a array positional arguement with the follow content: 
+                     sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
 
-rm sample-1-*.tmp
-```
+**Input Data:**
 
-**Input data:**
+- *.krona (all sample .krona formatted files, output from [Step 10d](#10d-convert-kraken2-output-to-krona-format)) 
 
-- sample-1.bam (mapping file from [step 9b](#9b-performing-mapping-conversion-to-bam-and-sorting))
-- sample-1-genes.fasta (gene-calls nucleotide fasta file from [step 6](#6-gene-prediction))
+                      
+**Output Data:**
 
-**Output data:**
+- kraken-report.html (compiled krona html report output)
 
-- sample-1-gene-coverages.tsv (table with gene-level coverages)
-- sample-1-contig-coverages.tsv (table with contig-level coverages)
+#### 11e. Create kraken species count table
 
-<br>
+```R
+library(tidyverse)
+library(pavian)
 
----
+reports_dir <- "/path/to/directory/with/*-kraken2-report.tsv"
+species_table <- process_kraken_table(reports_dir)
+write_csv(x = species_table, 
+          file = "kraken_species_table.csv")
+```
 
-### 19. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample
-> **Notes**  
-> Just uses `paste`, `sed`, and `awk`, all are standard in any Unix-like environment.  
+**Parameter Definitions:**
 
-```
-paste <( tail -n +2 sample-1-gene-coverages.tsv | sort -V -k 1 ) <( tail -n +2 sample-1-annotations.tsv | sort -V -k 1 | cut -f 2- ) \
-      <( tail -n +2 sample-1-gene-tax-out.tsv | sort -V -k 1 | cut -f 2- ) > sample-1-gene-tab.tmp
+- `reports_dir` - a directory containing kraken2 default reports
+- `x` - table to write
+- `file` - file name to write table to.
 
-paste <( head -n 1 sample-1-gene-coverages.tsv ) <( head -n 1 sample-1-annotations.tsv | cut -f 2- ) \
-      <( head -n 1 sample-1-gene-tax-out.tsv | cut -f 2- ) > sample-1-header.tmp
+**Input Data:**
 
-cat sample-1-header.tmp sample-1-gene-tab.tmp > sample-1-gene-coverage-annotation-and-tax.tsv
+- *-kraken2-report.tsv (kraken2 report output file, from [Step 10b](#10b-taxonomic-classification))
 
-  # removing intermediate files
-rm sample-1*tmp sample-1-gene-coverages.tsv sample-1-annotations.tsv sample-1-gene-tax-out.tsv
-```
+**Output Data:**
 
-**Input data:**
+- **kraken_species_table.csv** (kraken species count table in csv format)
 
-- sample-1-gene-coverages.tsv (table with gene-level coverages from [step 10b](#10b-filtering-gene-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage))
-- sample-1-annotations.tsv (table of KO annotations assigned to gene IDs from [step 7c](#7c-filtering-output-to-retain-only-those-passing-the-ko-specific-score-and-top-hits))
-- sample-1-gene-tax-out.tsv (gene-level taxonomic classifications from [step 8f](#8f-formatting-contig-level-output-with-awk-and-sed))
+#### 11f. Read-in tables
 
+```R
+library(tidyverse)
 
-**Output data:**
+# Read-in metadata
 
-- **sample-1-gene-coverage-annotation-and-tax.tsv** (table with combined gene coverage, annotation, and taxonomy info)
+metdata_file <- "/path/to/sample/metadata"
+samples_column <- "Sample_ID"
+metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
+row.names(metadata) <- metadata[,samples_column]
+# Read-in feature table
+species_table <- read_csv(file="kraken_species_table.csv") %>%  as.data.frame()
+rownames(species_table) <- species_table$species
+# Drop the species column
+species_table <- species_table[,-match("species", colnames(species_table))]
+```
 
-<br>
+**Parameter Definitions:**
 
----
+- `file` - path to input tables
+- `delim` - file delimiter 
 
-### 20. Combining contig-level coverage and taxonomy into one table for each sample
-> **Notes**  
-> Just uses `paste`, `sed`, and `awk`, all are standard in any Unix-like environment.  
+**Input Data:**
 
-```
-paste <( tail -n +2 sample-1-contig-coverages.tsv | sort -V -k 1 ) \
-      <( tail -n +2 sample-1-contig-tax-out.tsv | sort -V -k 1 | cut -f 2- ) > sample-1-contig.tmp
+- metadata_file  (path to sample-wise metadata file)
+- kraken_species_table.csv (path to kraken species taable)
 
-paste <( head -n 1 sample-1-contig-coverages.tsv ) <( head -n 1 sample-1-contig-tax-out.tsv | cut -f 2- ) \
-      > sample-1-contig-header.tmp
-      
-cat sample-1-contig-header.tmp sample-1-contig.tmp > sample-1-contig-coverage-and-tax.tsv
+**Output Data:**
 
-  # removing intermediate files
-rm sample-1*tmp sample-1-contig-coverages.tsv sample-1-contig-tax-out.tsv
-```
+- `metadata` - a dataframe of sample-wise metadata
+- `species_table` - a dataframe
 
-**Input data:**
 
-- sample-1-contig-coverages.tsv (table with contig-level coverages from [step 10b](#10b-filtering-gene-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage))
-- sample-1-contig-tax-out.tsv (contig-level taxonomic classifications from [step 8f](#8f-formatting-contig-level-output-with-awk-and-sed))
+#### 11g. Taxonomy barplots
 
+```R
+library(tidyverse)
 
-**Output data:**
+filter_threshold=0.5
+# Filter out Rare and non-microbial assignment
+# You can add as many species that you'd like to filter out
+# using the following syntax "|species_name1|species_name2"
+non_microbial <- "Unclassifed|unclassified|Homo sapien"
 
-- **sample-1-contig-coverage-and-tax.tsv** (table with combined contig coverage and taxonomy info)
+plot_width <- 18
+plot_height <- 8
 
-<br>
+# Convert count matrix to relative abundance matrix
+abund_table <- count_to_rel_abundance(species_table)
 
----
+# Make plot without filtering
+p <- make_plot(abund_table, metadata, custom_palette, publication_format)
 
-### 21. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples
+ggsave(filename =  "unfiltered-kraken_species_plot.png", plot = p, device = "png", 
+       width = plot_width, height = plot_height, units = "in", dpi = 300)
 
-> **Notes**  
-> * To combine across samples to generate these summary tables, we need the same "units". This is done for annotations based on the assigned KO terms, and all non-annotated functions are included together as "Not annotated". It is done for taxonomic classifications based on taxids (full lineages included in the table), and any not classified are included together as "Not classified". 
-> * The values we are working with are coverage per gene (so they are number of bases recruited to the gene normalized by the length of the gene). These have been normalized by making the total coverage of a sample 1,000,000 and setting each individual gene-level coverage its proportion of that 1,000,000 total. So basically percent, but out of 1,000,000 instead of 100 to make the numbers more friendly. 
 
-#### 21a. Generating gene-level coverage summary tables
+# Get species with relative abundance greater than filter_threshold in all samples
+# Drop rare and non-microbial assignments
+filtered_species_table  <- filter_rare(species_table, non_microbial, threshold=filter_threshold)
 
-```
-bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv -o Combined
-```
 
-**Parameter Definitions:**  
+# Convert count matrix to relative abundance matrix
+filtered_species_table <- count_to_rel_abundance(filtered_species_table)
 
-*	takes positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above
+# Write filtered table to file
+write_csv(x = filtered_species_table, file = "filtered-kraken_species_table.csv")
 
--	`-o` – specifies the output prefix
+# Make plot after filtering
+p <- make_plot(filtered_species_table , metadata, custom_palette, publication_format)
 
+ggsave(filename = "filtered-kraken_species_plot.png", plot = p,
+         device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
+```
 
-**Input data:**
+**Parameter Definitions:**
 
-- *-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples from [step 11](#11-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample))
+- `filter_threshold` - a decimal threshold from 0-1 for filter out rare species i.e potential fals epositives.
+- `non_microbial` - a regex string  listing out assignmnets to drop before filtering based on the `filter_threshold` above. 
 
-**Output data:**
+**Input Data:**
 
-- **Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on KO annotations; normalized to coverage per million genes covered)
-- **Combined-gene-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications; normalized to coverage per million genes covered)
-- **Combined-gene-level-KO-function-coverages_GLmetagenomics.tsv** (table with all samples combined based on KO annotations)
-- **Combined-gene-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications)
+- `species_table` (a dataframe of species count per sample, output from [Step 10g](#10g-read-in-tables))
+- `metadata` - (a dataframe of sample-wise metadata, output from [Step 10g](#10g-read-in-tables))
 
+**Output Data:**
 
-#### 21b. Generating contig-level coverage summary tables
+- **unfiltered-kraken_species_plot.png** (barplot plot without filtering)
+- **filtered-kraken_species_table.csv** (filtered relative abundance table)
+- **filtered-kraken_species_plot.png** (barplot after filtering rare and non-microbial taxa)
 
-```
-bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
-```
-**Parameter Definitions:**  
+---
 
-*	takes positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above
+#### 11h. Feature decontamination
 
--	`-o` – specifies the output prefix
+Feature decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
 
+```R
+library(tidyverse)
+library(decontam)
 
-**Input data:**
+feature_table <- read_csv("filtered-kraken_species_table.csv")
+contam_threshold <- 0.1
+# Control samples in this column should always be written as "Control_Sample" and true samples as "True_Sample"
+prev_col <- "Sample_or_Control"
+freq_col <- "input_conc_ng"
+plot_width <- 18
+plot_height <- 8
 
-- *-contig-coverage-annotation-and-tax.tsv (tables with combined contig coverage, annotation, and taxonomy info generated for individual samples from [step 12](#12-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample))
+contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
 
-**Output data:**
+# Write decontam results table to file
+write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-kraken_results.csv")
 
-- **Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered)
-- **Combined-contig-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications)
+# Get the list of contaminats identified by decontam
+contaminants <- contamdf %>%
+                as.data.frame %>%
+                rownames_to_column("Species") %>%
+                filter(contaminant == TRUE) %>% pull(Species)
 
-<br>
+# Drop contaminant features identified by decontam
+decontaminated_table <- feature_table %>% 
+                as.data.frame  %>% 
+                rownames_to_column("Species") %>% 
+                filter(str_detect(Species, 
+                                  pattern = str_c(contaminants,
+                                                  collapse = "|"),
+                                  negate = TRUE)) %>%
+                select(-Species) %>% as.matrix
 
----
+# Convert count matrix to relative abundance matrix
+decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
 
-### 22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery
+# Write decontaminated species table to file
+write_csv(x = decontaminated_species_table, file = "decontaminated-kraken_species_table.csv")
 
-#### 22a. Binning contigs
-```
-jgi_summarize_bam_contig_depths --outputDepth sample-1-metabat-assembly-depth.tsv --percentIdentity 97 --minContigLength 1000 --minContigDepth 1.0  --referenceFasta sample-1-assembly.fasta sample-1.bam
-
-metabat2  --inFile sample-1-assembly.fasta --outFile sample-1 --abdFile sample-1-metabat-assembly-depth.tsv -t NumberOfThreads
+# Make plot after filtering out contaminants
+p <- make_plot(decontaminated_species_table , metadata, custom_palette, publication_format)
 
-mkdir sample-1-bins
-mv sample-1*bin*.fasta sample-1-bins
-zip -r sample-1-bins.zip sample-1-bins
+ggsave(filename = "decontaminated-kraken-species_plot.png", plot = p,
+         device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
 ```
 
-**Parameter Definitions:**  
+**Input Data:**
 
--  `--outputDepth` – specifies the output depth file
--  `--percentIdentity` – minimum end-to-end percent identity of a mapped read to be included
--  `--minContigLength` – minimum contig length to include
--  `--minContigDepth` – minimum contig depth to include
--  `--referenceFasta` – the assembly fasta file generated in step 5a
--  `sample-1.bam` – final positional arguments are the bam files generated in step 9
--  `--inFile` - the assembly fasta file generated in step 5a
--  `--outFile` - the prefix of the identified bins output files
--  `--abdFile` - the depth file generated by the previous `jgi_summarize_bam_contig_depths` command
--  `-t` - specifies number of threads to use
+- `filtered-kraken_species_table.csv`(a dataframe of species count per sample, output from [Step 11g](#11g-taxonomy-barplots))
+- `metadata`(a dataframe of sample-wise metadata, output from step[Step 11f](#11f-read-in-tables))
 
+**Output Data:**
 
-**Input data:**
+- **decontam-kraken_results.csv** (decontam's results table)
+- **decontaminated-kraken_species_table.csv** (decontaminated species table)
+- **decontaminated-kraken-species_plot.png** (barplot after filtering out contaminants)
 
-- sample-1-assembly.fasta (assembly fasta file created in [step 5a](#5a-renaming-contig-headers))
-- sample-1.bam (bam file created in [step 9b](#9b-performing-mapping-conversion-to-bam-and-sorting))
+<br>
 
-**Output data:**
+---
 
-- **sample-1-metabat-assembly-depth.tsv** (tab-delimited summary of coverages)
-- sample-1-bins/sample-1-bin\*.fasta (fasta files of recovered bins)
-- **sample-1-bins.zip** (zip file containing fasta files of recovered bins)
+## Assembly-based processing
 
-#### 22b. Bin quality assessment
-Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
+### 12. Sample assembly
 
-```
-checkm lineage_wf -f bins-overview_GLmetagenomics.tsv --tab_table -x fa ./ checkm-output-dir
+```bash
+flye --meta --threads NumberOfThreads --out-dir sample/ \
+     --nano-hq /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz
+
+# rename output files            
+mv sample/assembly.fasta sample_assembly.fasta
+mv sample/flye.log sample_flye.log
 ```
 
-**Parameter Definitions:**  
+**Parameter Definitions:**
 
--  `lineage_wf` – specifies the workflow being utilized
--  `-f` – specifies the output summary file
--  `--tab_table` – specifies the output summary file should be a tab-delimited table
--  `-x` – specifies the extension that is on the bin fasta files that are being assessed
--  `./` – first positional argument at end specifies the directory holding the bins generated in step 14a
--  `checkm-output-dir` – second positional argument at end specifies the primary checkm output directory with detailed information
+- `--meta` – use metagenome/uneven coverage mode
+- `--threads` - number of parallel processing threads to use
+- `--out-dir` - Output directory
+- `--nano-hq` - specifies that input is from Oxford Nanopore high-quality reads (Guppy5+ SUP or Q20, <5% error). This skips a genome polishing step since the assembly will be polished with medaka in the next step
 
-**Input data:**
+**Input Data**
 
-- sample-1-bins/sample-1-bin\*.fasta (bin fasta files generated in [step 14a](#14a-binning-contigs))
+- sample_host_removed.fastq.gz (decontaminated raw data in fastq format, output from [Step 8](#8-host-removal))
 
-**Output data:**
+**Output Data**
 
-- **bins-overview_GLmetagenomics.tsv** (tab-delimited file with quality estimates per bin)
-- checkm-output-dir (directory holding detailed checkm outputs)
+- sample_assembly.fasta (sample assembly)
+- sample_flye.log (log file)
 
-#### 22c. Filtering MAGs
+<br>
 
-```
-cat <( head -n 1 bins-overview_GLmetagenomics.tsv ) \
-    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLmetagenomics.tsv | sed 's/bin./MAG-/' ) \
-    > checkm-MAGs-overview.tsv
-    
-# copying bins into a MAGs directory in order to run tax classification
-awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLmetagenomics.tsv | cut -f 1 > MAG-bin-IDs.tmp
+---
 
-mkdir MAGs
-for ID in MAG-bin-IDs.tmp
-do
-    MAG_ID=$(echo $ID | sed 's/bin./MAG-/')
-    cp ${ID}.fasta MAGs/${MAG_ID}.fasta
-done
+### 13. Polish assembly
 
-for SAMPLE in $(cat MAG-bin-IDs.tmp | sed 's/-bin.*//' | sort -u);
-do
-  mkdir ${SAMPLE}-MAGs
-  mv ${SAMPLE}-*MAG*.fasta ${SAMPLE}-MAGs
-  zip -r ${SAMPLE}-MAGs.zip ${SAMPLE}-MAGs
-done
+```bash
+medaka_consensus -t NumberOfThreads -i /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz \
+  -d /path/to/assemblies/sample_assembly.fasta -o sample/
+  
+mv sample/consensus.fasta sample_polished.fasta
 ```
 
-**Input data:**
+**Parameter Definitions:**
 
-- bins-overview_GLmetagenomics.tsv (tab-delimited file with quality estimates per bin from [step 14b](#14b-bin-quality-assessment))
+- `-t` - number of parallel processing threads to use
+- `-i` - specifies path to input read files used in creating the assembly
+- `-d` - specifies path to the assembly fasta file
+- `-o` - specifies the output directory
 
-**Output data:**
+**Input Data:**
 
-- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG)
-- MAGs/\*.fasta (directory holding high-quality MAGs)
-- **\*-MAGs.zip** (zip files containing directories of high-quality MAGs)
+- /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz (decontaminated raw data in fastq format, output from [Step 8](#8-host-removal))
+- /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 11](#11-sample-assembly))
 
+**Output Data:**
 
-#### 22d. MAG taxonomic classification
-Uses default `gtdbtk` database setup with program's `download.sh` command.
+- sample_polished.fasta (polished sample assembly)
 
-```
-gtdbtk classify_wf --genome_dir MAGs/ -x fa --out_dir gtdbtk-output-dir  --skip_ani_screen
+---
+
+### 14. Renaming contigs and summarizing assemblies
+
+#### 14a. Renaming contig headers
+
+```bash
+bit-rename-fasta-headers -i sample-1_polished.fasta -w c_sample-1 -o sample-1_assembly.fasta
 ```
 
 **Parameter Definitions:**  
 
--  `classify_wf` – specifies the workflow being utilized
--  `--genome_dir` – specifies the directory holding the MAGs generated in step 14c
--  `-x` – specifies the extension that is on the MAG fasta files that are being taxonomically classified
--  `--out_dir` – specifies the output directory
--  `--skip_ani_screen`  - specifies to skip ani_screening step to classify genomes using mash and skani
+- `-i` – input fasta file
+- `-w` – wanted header prefix (a number will be appended for each contig), starts with a “c_” to ensure they won’t start with a number which can be problematic
+- `-o` – output fasta file
 
-**Input data:**
 
-- MAGs/\*.fasta (directory holding high-quality MAGs from [step 14c](#14c-filtering-mags))
+**Input Data:**
 
-**Output data:**
+- sample-1_polished.fasta (polished assembly file from [step 12](#12-polish-assembly))
 
-- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
+**Output files:**
 
-#### 22e. Generating overview table of all MAGs
+- **sample-1-assembly.fasta** (contig-renamed assembly file)
 
-```bash
-# combine summaries
-for MAG in $(cut -f 1 assembly-summaries_GLmetagenomics.tsv | tail -n +2); do
 
-    grep -w -m 1 "^${MAG}" checkm-MAGs-overview.tsv | cut -f 12,13,14 \
-        >> checkm-estimates.tmp
+#### 14b. Summarizing assemblies
 
-    grep -w "^${MAG}" gtdbtk-output-dir/gtdbtk.*.summary.tsv | \
-    cut -f 2 | sed 's/^.__//' | \
-    sed 's/;.__/\t/g' | \
-    awk 'BEGIN{ OFS=FS="\t" } { for (i=1; i<=NF; i++) if ( $i ~ /^ *$/ ) $i = "NA" }; 1' \
-        >> gtdb-taxonomies.tmp
+```bash
+bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv *-assembly.fasta
+```
 
-done
+**Parameter Definitions:**  
 
-# Add headers
-cat <(printf "est. completeness\test. redundancy\test. strain heterogeneity\n") checkm-estimates.tmp \
-    > checkm-estimates-with-headers.tmp
+- `-o` – output summary table
+- `*-assembly.fasta` - multiple input assemblies provided as positional arguments
 
-cat <(printf "domain\tphylum\tclass\torder\tfamily\\tgenus\tspecies\n") gtdb-taxonomies.tmp \
-    > gtdb-taxonomies-with-headers.tmp
+**Input Data:**
 
-paste assembly-summaries_GLmetagenomics.tsv \
-checkm-estimates-with-headers.tmp \
-gtdb-taxonomies-with-headers.tmp \
-    > MAGs-overview.tmp
+- *-assembly.fasta (contig-renamed assembly files from [step 13a](#13a-renaming-contig-headers))
 
-# Ordering by taxonomy
-head -n 1 MAGs-overview.tmp > MAGs-overview-header.tmp
+**Output files:**
 
-tail -n +2 MAGs-overview.tmp | sort -t \$'\t' -k 14,20 > MAGs-overview-sorted.tmp
+- **assembly-summaries_GLmetagenomics.tsv** (table of assembly summary statistics)
 
-cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
-    > MAGs-overview_GLmetagenomics.tsv
+<br>
+
+---
 
+### 15. Gene prediction
+```bash
+prodigal -a sample-1-genes.faa -d sample-1-genes.fasta -f gff -p meta -c -q \
+         -o sample-1-genes.gff -i sample-1-assembly.fasta
 ```
 
-**Input data:**
+**Parameter Definitions:**
 
-- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics from [step 5b](#5b-summarizing-assemblies))
-- MAGs/\*.fasta (directory holding high-quality MAGs from [step 14c](#14c-filtering-mags))
-- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG from [step 14c](#14c-filtering-mags))
-- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info from [step 14d](#14d-mag-taxonomic-classification))
+- `-a` – specifies the output amino acid sequences file
+- `-d` – specifies the output nucleotide sequences file
+- `-f` – specifies the output format gene-calls file
+- `-p` – specifies which mode to run the gene-caller in 
+- `-c` – no incomplete genes reported 
+- `-q` – run in quiet mode (don’t output process on each contig) 
+- `-o` – specifies the name of the output gene-calls file 
+- `-i` – specifies the input assembly
+
+**Input Data:**
 
-**Output data:**
+- sample-1-assembly.fasta (contig-renamed assembly file from [step 5a](#5a-renaming-contig-headers))
 
-- **MAGs-overview_GLmetagenomics.tsv** (a tab-delimited overview of all recovered MAGs)
+**Output Data:**
 
+- sample-1-genes.faa (gene-calls amino-acid fasta file)
+- sample-1-genes.fasta (gene-calls nucleotide fasta file)
+- **sample-1-genes.gff** (gene-calls in general feature format)
 
 <br>
 
----
+#### 15a. Remove line wraps in gene prediction output
+```bash
+bit-remove-wraps sample-1-genes.faa > sample-1-genes.faa.tmp 2> /dev/null
+mv sample-1-genes.faa.tmp sample-1-genes.faa
 
-### 23. Generating MAG-level functional summary overview
+bit-remove-wraps sample-1-genes.fasta > sample-1-genes.fasta.tmp 2> /dev/null
+mv sample-1-genes.fasta.tmp sample-1-genes.fasta
+```
 
-#### 23a. Getting KO annotations per MAG
-This utilizes the helper script [`parse-MAG-annots.py`](../Workflow_Documentation/NF_MGIllumina/workflow_code/bin/parse-MAG-annots.py) 
+**Input Data:**
 
-```bash
-for file in $( ls MAGs/*.fasta )
-do
+- sample-1-genes.faa (gene-calls amino-acid fasta file, output from [Step 14](#14-gene-prediction))
+- sample-1-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14](#14-gene-prediction))
 
-    MAG_ID=$( echo ${file} | cut -f 2 -d "/" | sed 's/.fasta//' )
-    sample_ID=$( echo ${MAG_ID} | sed 's/-MAG-[0-9]*$//' )
+**Output Data:**
 
-    grep "^>" ${file} | tr -d ">" > ${MAG_ID}-contigs.tmp
+- **sample-1-genes.faa** (gene-calls amino-acid fasta file with line wraps removed)
+- **sample-1-genes.fasta** (gene-calls nucleotide fasta file with line wraps removed)
 
-    python parse-MAG-annots.py -i annotations-and-taxonomy/${sample_ID}-gene-coverage-annotation-and-tax.tsv \
-                               -w ${MAG_ID}-contigs.tmp -M ${MAG_ID} \
-                               -o MAG-level-KO-annotations_GLmetagenomics.tsv
+<br>
 
-    rm ${MAG_ID}-contigs.tmp
+---
 
-done
+### 16. Functional annotation
+> **Note:**  
+> The annotation process overwrites the same temporary directory by default. When running multiple 
+processses at a time, it is necessary to specify a specific temporary directory with the 
+`--tmp-dir` argument as shown below.
+
+
+#### 16a. Downloading reference database of HMM models (only needs to be done once)
+
+```bash
+curl -LO ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
+curl -LO ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
+tar -xzvf profiles.tar.gz
+gunzip ko_list.gz 
 ```
 
-**Parameter Definitions:**  
+#### 16b. Running KEGG annotation
 
-- `-i` – specifies the input sample gene-coverage-annotation-and-tax.tsv file generated in step 11
+```bash
+exec_annotation -p profiles/ -k ko_list --cpu NumberOfThreads -f detail-tsv -o sample-1-KO-tab.tmp \
+                --tmp-dir sample-1-tmp-KO --report-unannotated sample-1-genes.faa 
+```
 
--  `-w` – specifies the appropriate temporary file holding all the contigs in the current MAG
+**Parameter Definitions:**
 
-- `-M` – specifies the current MAG unique identifier
+- `-p` – specifies the directory holding the downloaded reference HMMs
+- `-k` – specifies the downloaded reference KO  (Kegg Orthology) terms 
+- `--cpu` – specifies the number of searches to run in parallel
+- `-f` – specifies the output format
+- `-o` – specifies the output file name
+- `--tmp-dir` – specifies the temporary directory to write to (needed if running more than one process concurrently, see Notes above)
+- `--report-unannotated` – specifies to generate an output for each entry
+- `sample-1-genes.faa` – the input file is specified as a positional argument 
 
-- `-o` – specifies the output file
 
-**Input data:**
+**Input Data:**
 
-- \*-gene-coverage-annotation-and-tax.tsv (sample gene-coverage-annotation-and-tax.tsv file generated in [step 11](#11-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample))
-- MAGs/\*.fasta (directory holding high-quality MAGs from [step 14c](#14c-filtering-mags))
+- sample-1-genes.faa (amino-acid fasta file, from [step 6](#6-gene-prediction))
+- profiles/ (reference directory holding the KO HMMs)
+- ko_list (reference list of KOs to scan for)
 
-**Output data:**
+**Output Data:**
 
-- **MAG-level-KO-annotations_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their KO annotations)
+- sample-1-KO-tab.tmp (table of KO annotations assigned to gene IDs)
 
 
-#### 23b. Summarizing KO annotations with KEGG-Decoder
+#### 16c. Filtering output to retain only those passing the KO-specific score and top hits
 
 ```bash
-KEGG-decoder -v interactive -i MAG-level-KO-annotations_GLmetagenomics.tsv -o MAG-KEGG-Decoder-out_GLmetagenomics.tsv
+bit-filter-KOFamScan-results -i sample-1-KO-tab.tmp -o sample-1-annotations.tsv
+
+# removing temporary files
+rm -rf sample-1-tmp-KO/ sample-1-KO-annots.tmp
 ```
 
 **Parameter Definitions:**  
 
-- `-v interactive` – specifies to create an interactive html output
- 
-- `-i` – specifies the input MAG-level-KO-annotations_GLmetagenomics.tsv file generated in [step 15a](#15a-getting-ko-annotations-per-mag)
-
+- `-i` – specifies the input table
 - `-o` – specifies the output table
 
-**Input data:**
-
-- MAG-level-KO-annotations_GLmetagenomics.tsv (tab-delimited table holding MAGs and their KO annotations, generated in [step 15a](#15a-getting-ko-annotations-per-mag))
+**Input Data:**
 
-**Output data:**
+- sample-1-KO-tab.tmp (table of KO annotations assigned to gene IDs from [step 7b](#7b-running-kegg-annotation))
 
-- **MAG-KEGG-Decoder-out_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their proportions of genes held known to be required for specific pathways/metabolisms)
+**Output Data:**
 
-- **MAG-KEGG-Decoder-out_GLmetagenomics.html** (interactive heatmap html file of the above output table)
+- sample-1-annotations.tsv (table of KO annotations assigned to gene IDs)
 
 <br>
----
 
-## Read-based Feature Table Decontamination
-> Feature table decontamination is performed in R.  
+---
 
-### 24. R Environment Setup
+### 17. Taxonomic classification
 
-#### 24a. Load libraries
+#### 17a. Pulling and un-packing pre-built reference db (only needs to be done once)
 
-```R
-library(decontam)
-library(phyloseq)
-library(tidyverse)
-library(DT)
-library(plotly)
-library(glue)
-library(pheatmap)
-library(pavian)
+```bash
+wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20200618.tar.gz
+tar -xvzf CAT_prepare_20200618.tar.gz
 ```
 
-#### 24b. Define Custom Functions
+#### 17b. Running taxonomic classification
 
-##### get_last_assignment()
-<details>
-  <summary>retrieves the last taxonomy assignment from a taxonomy string</summary>
+```bash
+CAT contigs -c sample-1-assembly.fasta -d CAT_prepare_20200618/2020-06-18_database/ \
+            -t CAT_prepare_20200618/2020-06-18_taxonomy/ -p sample-1-genes.faa \
+            -o sample-1-tax-out.tmp -n NumberOfThreads -r 3 --top 4 --I_know_what_Im_doing --no-stars
+```
 
-  ```R
-  get_last_assignment <- function(taxonomy_string, split_by=';', remove_prefix=NULL){
-    # A function to get the last taxonomy assignment from a taxonomy string 
-    split_names <- strsplit(x =  taxonomy_string , split = split_by) %>% 
-      unlist()
-    
-    level_name <- split_names[[length(split_names)]]
-    
-    if(level_name == "_"){
-      return(taxonomy_string)
-    }
-    
-    if(!is.null(remove_prefix)){
-      level_name <- gsub(pattern = remove_prefix, replacement = '', x = level_name)
-    }
-    
-    return(level_name)
-  }
-  ```
-  **Function Parameter Definitions:**
-  - `taxonomy_string` - a character string containing a list of taxonomy assignments
-  - `split_by=` - a character string containing a regular expression used to split the `taxonomy_string`
-  - `remove_prefix=` - a character string containing a regular expression to be matched and removed, default=`NULL`
+**Parameter Definitions:**  
 
-  **Returns:** the last taxonomy assignment listed in the `taxonomy_string`
-</details>
+- `-c` – specifies the input assembly fasta file
+- `-d` – specifies the CAT reference sequence database
+- `-t` – specifies the CAT reference taxonomy database
+- `-p` – specifies the input protein fasta file
+- `-o` – specifies the output prefix
+- `-n` – specifies the number of CPU cores to use
+- `-r` – specifies the number of top protein hits to consider in assigning tax
+- `--top` – specifies the number of protein alignments to store
+- `--I_know_what_Im_doing` – allows us to alter the `--top` parameter
+- `--no-stars` - suppress marking of suggestive taxonomic assignments
 
-##### mutate_taxonomy()
-<details>
-  <summary>ensure that the taxonomy column is named "taxonomy" and aggregate duplicates to ensure that taxonomy names are unique</summary>
+**Input Data:**
 
-  ```R
-  mutate_taxonomy <- function(df, taxonomy_column="taxonomy"){
-    
-    # make sure that the taxonomy column is always named taxonomy
-    col_index <- which(colnames(df) == taxonomy_column)
-    colnames(df)[col_index] <- 'taxonomy'
-    df <- df %>% dplyr::mutate(across( where(is.numeric), \(x) tidyr::replace_na(x,0)  ) )%>% 
-      dplyr::mutate(taxonomy=map_chr(taxonomy,.f = function(taxon_name=.x){
-        last_assignment <- get_last_assignment(taxon_name) 
-        last_assignment  <- gsub(pattern = "\\[|\\]|'", replacement = '',x = last_assignment)
-        trimws(last_assignment, which = "both")
-      })) %>% 
-      as.data.frame(check.names=FALSE, StringAsFactor=FASLE)
-    # Ensure the taxonomy names are unique by aggregating duplicates
-    df <- aggregate(.~taxonomy,data = df, FUN = sum)
-    return(df)
-  }
-  ```
-  **Function Parameter Definitions:**
-  - `df` - a dataframe containing the taxonomy assignments
-  - `taxonomy_column=` - name of the column in `df` containing the taxonomy assignments, default="taxonomy"
+- sample-1-assembly.fasta (assembly file from [step 5a](#5a-renaming-contig-headers))
+- sample-1-genes.faa (gene-calls amino-acid fasta file from [step 6](#6-gene-prediction))
 
-  **Returns:** a dataframe with unique taxonomy names stored in a column named "taxonomy"
+**Output Data:**
 
-</details>
+- sample-1-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file)
+- sample-1-tax-out.tmp.contig2classification.txt (contig taxonomy file)
 
-##### process_kaiju_table()
-<details>
-  <summary>reformat kaiju output table</summary>
+#### 17c. Adding taxonomy info from taxids to genes
 
-  ```R
-  process_kaiju_table <- function(file_path, taxon_col="taxon_path",
-                                  kingdom=NULL, remove_non_microbial = TRUE){
-    
-    kaija_table <- read_delim(file = file_path,
-                              delim = "\t",
-                              col_names = TRUE)
-    
-    if(remove_non_microbial){
-      # Remove non-microbial and unclassified assignments in this case Metazoa for animal assignments
-      non_microbial_indices <- grep(pattern = "unclassified|assigned|Metazoa|Chordata|Nematoda|Arthropoda|Annelida|Brachiopoda|Mollusca|Cnidaria|Streptophyta",
-                                    x = kaija_table[[taxon_col]])
-      
-      if(!is_empty(non_microbial_indices)){
-        kaija_table <- kaija_table[-non_microbial_indices,]
-      }
-      
-    }
-    
-    if(!is.null(kingdom)){
-      kingdom_indices <- grep(pattern = kingdom ,
-                              x = kaija_table[[taxon_col]])
-      if(!is_empty(kingdom_indices)){
-        kaija_table <- kaija_table[kingdom_indices,]
-      }
-    }
-    
-    
-    abs_abun_df <- pivot_wider(data = kaija_table %>% dplyr::select(sample,reads,taxonomy=!!sym(taxon_col)), 
-                              names_from = "sample", values_from = "reads",
-                              names_sort = TRUE) %>% mutate_taxonomy
-    
-    rel_abun_df <- pivot_wider(data = kaija_table %>% dplyr::select(sample,percent,taxonomy=!!sym(taxon_col)), 
-                              names_from = "sample", values_from = "percent",
-                              names_sort = TRUE) %>% mutate_taxonomy
-    
-    # Set the taxon names as row names, drop the taxonomy column and convert to a matrix
-    rownames(abs_abun_df) <- abs_abun_df[,"taxonomy"]
-    rownames(rel_abun_df) <- rel_abun_df[,"taxonomy"]
-    
-    abs_abun_df <- abs_abun_df[,-(which(colnames(abs_abun_df) == "taxonomy"))]
-    rel_abun_df <- rel_abun_df[,-(which(colnames(rel_abun_df) == "taxonomy"))]
-    
-    abs_abun_matrix <- as.matrix(abs_abun_df)
-    rel_abun_matrix <- as.matrix(rel_abun_df)
-    
-    final_tables <- list("relative_table"=rel_abun_matrix,
-                        "abundance_table"=abs_abun_matrix)
-    return(final_tables)
-    
-  }
-  ```
-  **Function Parameter Definitions:**
-  - `file_path` - file path to the tab-delimited kaiju output table file
-  - `taxon_col=`- name of the taxon column in the input data file, default="taxon_path"
-  - `kingdom=` - a character string containing a regular expression used to filter for specific kingdoms, default=`NULL`
-  - `remove_non_microbial=` - a boolean specifying whether or not to remove non-microbial and unclassified assuments, default=`TRUE`
+```bash
+CAT add_names -i sample-1-tax-out.tmp.ORF2LCA.txt -o sample-1-gene-tax-out.tmp \
+              -t CAT_prepare_20200618/2020-06-18_taxonomy/ --only_official --exclude-scores
+```
 
-  **Returns:** a dataframe with reformated kaiju output
+**Parameter Definitions:**  
 
-</details>
+- `-i` – specifies the input taxonomy file
+- `-o` – specifies the output file 
+- `-t` – specifies the CAT reference taxonomy database
+- `--only_official` – specifies to add only standard taxonomic ranks
+- `--exclude-scores` - specifies to exclude bit-score support scores in the lineage
 
-##### create_dt()
-<details>
-  <summary>create an HTML widget to display rectangular data (`matrix` or `dataframe`) using the DataTables Javascript library</summary>
+**Input Data:**
 
-```R
-create_dt <- function(table2show, caption=NULL) {
-  DT::datatable(table2show,
-                rownames = FALSE, # remove row numbers
-                filter = "top", # add filter on top of columns
-                extensions = "Buttons", # add download buttons
-                caption=caption,
-                options = list(
-                  autoWidth = TRUE,
-                  dom = "Blfrtip", # location of the download buttons
-                  buttons = c("copy", "csv", "excel", "pdf", "print"), # download buttons
-                  pageLength = 5, # show first 5 entries, default is 10
-                  order = list(0, "asc") # order the title column by ascending order
-                ),
-                escape = FALSE # make URLs clickable) 
-  )
-}
-```
-**Function Parameter Definitions:**
-- `table2show` - a `matrix` or `dataframe` containing tabular data to display
-- `caption=` - a character vector to use as the caption for the table
+- sample-1-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file from [step 8b](#8b-running-taxonomic-classification))
 
-</details>
+**Output Data:**
 
-##### filter_rare()
-<details>
-  <summary>filter out rare and non_microbial taxonomy assignments</summary>
+- sample-1-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
 
-  ```R
-  filter_rare <- function(species_table, non_microbial, threshold=1){
-    
-    clean_tab_count  <-  species_table %>% 
-      filter(str_detect(Species, non_microbial, negate = TRUE))  
-    
-    clean_tab <- clean_tab_count %>% 
-      mutate( across( where(is.numeric)  , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
-    
-    rownames(clean_tab) <- clean_tab$Species
-    clean_tab  <- clean_tab[,-1] 
-    
-    
-    # Get species with relative abundance less than 1% in all samples
-    rare_species <- map(clean_tab, .f = \(col) rownames(clean_tab)[col < threshold])
-    rare <- Reduce(intersect, rare_species)
-    
-    rownames(clean_tab_count) <- clean_tab_count$Species
-    clean_tab_count  <- clean_tab_count[,-1] 
-    
-    abund_table <- clean_tab_count[!(rownames(clean_tab_count) %in% rare), ]
-    
-    return(abund_table)
-  }
-  ```
-  **Function Parameter Definitions:**
-  - `species_table` - the dataframe to filter
-  - `non_microbial` - a character vector denoting the string used to identify a species as non-microbial
-  - `threshold=` - abundance threshold used to determine if the relative abundance is rare, value denotes a percentage between 0 and 100.
+#### 17d. Adding taxonomy info from taxids to contigs
 
-  **Returns:** a dataframe with rare and non_microbrial assignemnts removed
-</details>
+```bash
+CAT add_names -i sample-1-tax-out.tmp.contig2classification.txt -o sample-1-contig-tax-out.tmp \
+              -t CAT-ref/2020-06-18_taxonomy/ --only_official --exclude-scores
+```
 
+**Parameter Definitions:**  
 
-##### make_plot()
-<details>
-  <summary>create bar plot of relative abundance</summary>
+- `-i` – specifies the input taxonomy file
+- `-o` – specifies the output file 
+- `-t` – specifies the CAT reference taxonomy database
+- `--only_official` – specifies to add only standard taxonomic ranks
+- `--exclude-scores` - specifies to exclude bit-score support scores in the lineage
 
-  ```R
-  # Make bar plot
-  make_plot <- function(abund_table, metadata, colors2use, publication_format){
-    
-    abund_table_wide <- abund_table %>% 
-        as.data.frame() %>% 
-        rownames_to_column("Sample_ID") %>% 
-        inner_join(metadata) %>% 
-        select(!!!colnames(metadata), everything()) %>% 
-        mutate(Sample_ID = Sample_ID %>% str_remove("barcode"))
-        
-    abund_table_long <- abund_table_wide  %>%
-        pivot_longer(-colnames(metadata), 
-                    names_to = "Species",
-                    values_to = "relative_abundance")
+**Input Data:**
+
+- sample-1-tax-out.tmp.contig2classification.txt (contig taxonomy file from [step 8b](#8b-running-taxonomic-classification))
+
+**Output Data:**
+
+- sample-1-contig-tax-out.tmp (contig taxonomy file with lineage info added)
+
+
+#### 17e. Formatting gene-level output with awk and sed
+
+```bash
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,$8,$9,$10,$11 } \
+    else if ( $2 == "ORF has no hit to database" || $2 ~ /^no taxid found/ ) \
+    { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } else { n=split($3,lineage,";"); \
+    print $1,lineage[n],$5,$6,$7,$8,$9,$10,$11 } } ' sample-1-gene-tax-out.tmp | \
+    sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/# ORF/gene_ID/' | \
+    sed 's/lineage/taxid/'  > sample-1-gene-tax-out.tsv
+```
+
+**Input Data:**
+
+- sample-1-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added from [step 8c](#8c-adding-taxonomy-info-from-taxids-to-genes))
+
+**Output Data:**
+
+- sample-1-gene-tax-out.tsv (gene-calls taxonomy file with lineage info added reformatted)
+
+#### 17f. Formatting contig-level output with awk and sed
+
+```bash
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6,$7,$8,$9,$10,$11,$12 } \
+    else if ( $2 == "no taxid assigned" ) { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } \
+    else { n=split($4,lineage,";"); print $1,lineage[n],$6,$7,$8,$9,$10,$11,$12 } } ' sample-1-contig-tax-out.tmp | \
+    sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/^# contig/contig_ID/' | \
+    sed 's/lineage/taxid/' > sample-1-contig-tax-out.tsv
+
+  # clearing intermediate files
+rm sample-1*.tmp*
+```
+
+**Input Data:**
+
+- sample-1-contig-tax-out.tmp (contig taxonomy file with lineage info added from [step 8d](#8d-adding-taxonomy-info-from-taxids-to-contigs))
+
+**Output Data:**
+
+- sample-1-contig-tax-out.tsv (contig taxonomy file with lineage info added reformatted)
+
+<br>
+
+---
+
+### 18. Read-Mapping
+
+#### 18a. Align Reads to Sample Assembly
+
+```bash
+minimap2 -a -x map-ont \
+        -t NumberOfThreads \
+        sample_assembly.fasta sample_host_removed.fastq.gz \
+        > sample.sam  2> sample-mapping-info.txt | 
+```
+
+**Parameter Definitions:**
+
+- `-t` - number of parallel processing threads to use
+- `-a` – output in SAM format
+- `-x map-ont` - specifies preset for mapping Nanopore reads to a reference
+
+**Input Data**
+
+- /path/to/assemblies/sample_assembly.fasta (Sample assembly, output from [Step 13a](#13a-renaming-contig-headers))
+- /path/to/trimmed_reads/sample_host_removed.fastq.gz (Filtered and trimmed reads, output from [Step 8](#8-host-removal))
+
+**Output Data**
+
+- sample.sam (Reads aligned to contaminant assembly)
+
+#### 18b. Sort and Index Assembly Alignments
+
+```bash
+# Sort Sam, convert to bam and create index
+samtools sort --threads NumberOfThreads -o sample_sorted.bam sample.sam > sample_sort.log 2>&1
+
+samtools index sample_sorted.bam sample_sorted.bam.bai
+```
+
+**Parameter Definitions:**
+
+**samtools sort**
+- `--threads` - number of parallel processing threads to use
+- `-o` - specifies the output file for the sorted reads
+- `sample.sam` - positional argument specifying the input SAM file
+
+**samtools index**
+- `sample_sorted.bam` - positional argument specifying the input BAM file to be sorted
+- `sample_sorted.bam.bai` - positional argument specifying the name of the index file
+
+**Input Data:**
+
+- sample.sam (Reads aligned to sample assembly, output from [Step 13c](#13c-read-mapping))
+
+**Output Data:**
+
+- sample_sorted.bam (sorted mapping to sample assembly)
+- sample_sorted.bam.bai (index of sorted mapping to sample assembly)
+
+<br>
+
+---
+
+### 19. Getting coverage information and filtering based on detection
+> **Note:**  
+> “Detection” is a measure of what proportion of a reference sequence recruited reads 
+(see the discussion of detection [here](http://merenlab.org/2017/05/08/anvio-views/#detection)). 
+Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
+
+#### 19a. Filtering coverage levels based on detection
+
+```bash
+  # pileup.sh comes from the bbduk.sh package
+pileup.sh -in sample-1.bam fastaorf=sample-1-genes.fasta outorf=sample-1-gene-cov-and-det.tmp \
+          out=sample-1-contig-cov-and-det.tmp
+```
+
+**Parameter Definitions:**  
+
+- `-in` – the input bam file
+- `fastaorf=` – input gene-calls nucleotide fasta file
+- `outorf=` – the output gene-coverage tsv file
+- `out=` – the output contig-coverage tsv file
+
+
+#### 19b. Filtering gene and contig coverage based on requiring 50% detection and parsing down to just gene ID and coverage
+
+```bash
+# Filtering gene coverage
+grep -v "#" sample-1-gene-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
+     { print $1,$4 } ' > sample-1-gene-cov.tmp
+
+cat <( printf "gene_ID\tcoverage\n" ) sample-1-gene-cov.tmp > sample-1-gene-coverages.tsv
+
+# Filtering contig coverage
+grep -v "#" sample-1-contig-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
+     { print $1,$2 } ' > sample-1-contig-cov.tmp
+
+cat <( printf "contig_ID\tcoverage\n" ) sample-1-contig-cov.tmp > sample-1-contig-coverages.tsv
+
+# removing intermediate files
+rm sample-1-*.tmp
+```
+
+**Input Data:**
+
+- sample-1.bam (mapping file from [step 9b](#9b-performing-mapping-conversion-to-bam-and-sorting))
+- sample-1-genes.fasta (gene-calls nucleotide fasta file from [step 6](#6-gene-prediction))
+
+**Output Data:**
+
+- sample-1-gene-coverages.tsv (table with gene-level coverages)
+- sample-1-contig-coverages.tsv (table with contig-level coverages)
+
+<br>
+
+---
+
+### 20. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample
+> **Note:**  
+> Just uses `paste`, `sed`, and `awk`, which are all standard in any Unix-like environment.  
+
+```bash
+paste <( tail -n +2 sample-1-gene-coverages.tsv | sort -V -k 1 ) <( tail -n +2 sample-1-annotations.tsv | sort -V -k 1 | cut -f 2- ) \
+      <( tail -n +2 sample-1-gene-tax-out.tsv | sort -V -k 1 | cut -f 2- ) > sample-1-gene-tab.tmp
+
+paste <( head -n 1 sample-1-gene-coverages.tsv ) <( head -n 1 sample-1-annotations.tsv | cut -f 2- ) \
+      <( head -n 1 sample-1-gene-tax-out.tsv | cut -f 2- ) > sample-1-header.tmp
+
+cat sample-1-header.tmp sample-1-gene-tab.tmp > sample-1-gene-coverage-annotation-and-tax.tsv
+
+  # removing intermediate files
+rm sample-1*tmp sample-1-gene-coverages.tsv sample-1-annotations.tsv sample-1-gene-tax-out.tsv
+```
+
+**Input Data:**
+
+- sample-1-gene-coverages.tsv (table with gene-level coverages from [step 10b](#10b-filtering-gene-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage))
+- sample-1-annotations.tsv (table of KO annotations assigned to gene IDs from [step 7c](#7c-filtering-output-to-retain-only-those-passing-the-ko-specific-score-and-top-hits))
+- sample-1-gene-tax-out.tsv (gene-level taxonomic classifications from [step 8f](#8f-formatting-contig-level-output-with-awk-and-sed))
+
+
+**Output Data:**
+
+- **sample-1-gene-coverage-annotation-and-tax.tsv** (table with combined gene coverage, annotation, and taxonomy info)
+
+<br>
+
+---
+
+### 21. Combining contig-level coverage and taxonomy into one table for each sample
+> **Note:**  
+> Just uses `paste`, `sed`, and `awk`, which are all standard in any Unix-like environment.  
+
+```bash
+paste <( tail -n +2 sample-1-contig-coverages.tsv | sort -V -k 1 ) \
+      <( tail -n +2 sample-1-contig-tax-out.tsv | sort -V -k 1 | cut -f 2- ) > sample-1-contig.tmp
+
+paste <( head -n 1 sample-1-contig-coverages.tsv ) <( head -n 1 sample-1-contig-tax-out.tsv | cut -f 2- ) \
+      > sample-1-contig-header.tmp
       
-    p <- ggplot(abund_table_long, mapping = aes(x=Sample_ID, y=relative_abundance, fill=Species)) +
-         geom_col() +
-         scale_fill_manual(values = colors2use) + 
-         labs(x=NULL, y="Relative Abundance (%)") + 
-         publication_format
+cat sample-1-contig-header.tmp sample-1-contig.tmp > sample-1-contig-coverage-and-tax.tsv
 
-    return(p)
-  }
-  ```
-  **Function Parameter Definitions:**
-  - `abund_table` - a dataframe containing the data to plot
-  - `metadata` - a vector of strings specifying the data to include in the plot
-  - `colors2use` - a vector of strings specifying a custom color palette for coloring plots
-  - `publication_format` - a ggplot::theme object specifying the custom theme for plotting
+  # removing intermediate files
+rm sample-1*tmp sample-1-contig-coverages.tsv sample-1-contig-tax-out.tsv
+```
 
-  **Returns:** a ggplot bar plot
+**Input Data:**
 
-</details>
+- sample-1-contig-coverages.tsv (table with contig-level coverages from [step 10b](#10b-filtering-gene-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage))
+- sample-1-contig-tax-out.tsv (contig-level taxonomic classifications from [step 8f](#8f-formatting-contig-level-output-with-awk-and-sed))
 
-##### get_colors2use()
-<details>
-  <summary>get colors to use in plots</summary>
 
-  ```R
-  get_colors2use <- function(species, expected_microbes, microbe_colors, custom_palette){
-    
-    unexpected_microbes <- setdiff(species, expected_microbes)
-    
-    start <- length(species)+1
-    end <-  length(species) + length(unexpected_microbes)
-    unexpected_microbes_colors <-  custom_palette[start:end]
-    names(unexpected_microbes_colors) <- unexpected_microbes
-    colors2use <- append(microbe_colors,unexpected_microbes_colors)
-    return(colors2use)
-    
-  }
-  ```
-  **Function Parameter Definitions:**
-  - `species` - a vector specifying the list of species that will use this color palette, used to set the number of colors in the palette
-  - `expected_microbes` - the list of microbe species that were expected in the data
-  - `microbe_colors` - colors assigned to the expected microbes
-  - `custom_palette` - a vector of strings specifying a custom color palette
+**Output Data:**
 
-  **Returns:** a vector of strings specifying the color palette to use for the input species list
+- **sample-1-contig-coverage-and-tax.tsv** (table with combined contig coverage and taxonomy info)
 
-</details>
+<br>
 
-#### 24c. Set global variables
+---
 
-```R
+### 22. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples
 
-kraken_taxonomy_outdir <- "kraken2_taxonomy/"
-kaiju_taxonomy_outdir <- "kaiju_taxonomy/"
+> **Note:**  
+> * To combine across samples to generate these summary tables, we need the same "units". This is done for annotations 
+based on the assigned KO terms, and all non-annotated functions are included together as "Not annotated". It is done for 
+taxonomic classifications based on taxids (full lineages included in the table), and any not classified are included 
+together as "Not classified". 
+> * The values we are working with are coverage per gene (so they are number of bases recruited to the gene normalized 
+by the length of the gene). These have been normalized by making the total coverage of a sample 1,000,000 and setting 
+each individual gene-level coverage its proportion of that 1,000,000 total. So basically percent, but out of 1,000,000 
+instead of 100 to make the numbers more friendly. 
 
-# Define custom theme for plotting
-publication_format <- theme_bw() +
-  theme(panel.grid = element_blank()) +
-  theme(axis.ticks.length=unit(-0.15, "cm"),
-        axis.text.x=element_text(margin=ggplot2::margin(t=0.5,r=0,b=0,l=0,unit ="cm")),
-        axis.text.y=element_text(margin=ggplot2::margin(t=0,r=0.5,b=0,l=0,unit ="cm")), 
-        axis.title = element_text(size = 18,face ='bold.italic', color = 'black'), 
-        axis.text = element_text(size = 16,face ='bold', color = 'black'),
-        legend.position = 'right', legend.title = element_text(size = 15,face ='bold', color = 'black'),
-        legend.text = element_text(size = 14,face ='bold', color = 'black'),
-        strip.text =  element_text(size = 14,face ='bold', color = 'black'))
+#### 22a. Generating gene-level coverage summary tables
 
-# Define custom palette for plotting
-custom_palette <- c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F", "#FF7F00",
-                    "#CAB2D6","#6A3D9A","#FF00FFFF","#B15928","#000000","#FFC0CBFF","#8B864EFF","#F0027F",
-                    "#666666","#1B9E77", "#E6AB02","#A6761D","#FFFF00FF","#FFFF99","#00FFFFFF",
-                    "#B2182B","#FDDBC7","#D1E5F0","#CC0033","#FF00CC","#330033",
-                    "#999933","#FF9933","#FFFAFAFF",colors()) 
-# remove white colors
-custom_palette <- custom_palette[-c(21:23,
-                                    grep(pattern = "white|snow|azure|gray|#FFFAFAFF|aliceblue",
-                                         x = custom_palette, 
-                                         ignore.case = TRUE)
-                                   )
-                                ]
-# Define expected microbes to use for filtering
-expected_microbes <- c("Pseudomonas aeruginosa", "Salmonella enterica",
-                       "Limosilactobacillus fermentum", "Lactobacillus fermentum", "Staphylococcus aureus",
-                       "Enterococcus faecalis", "Escherichia coli",
-                       "Listeria monocytogenes", "Bacillus subtilis", "Bacillus spizizenii",
-                       "Saccharomyces cerevisiae", "Cryptococcus neoformans")
-orig_expected_microbes <- c("Pseudomonas aeruginosa", "Salmonella enterica",
-                       "Limosilactobacillus fermentum", "Staphylococcus aureus",
-                       "Enterococcus faecalis", "Escherichia coli",
-                       "Listeria monocytogenes", "Bacillus spizizenii",
-                       "Saccharomyces cerevisiae", "Cryptococcus neoformans")
-orig_expected_microbes <- c(sort(orig_expected_microbes), "Escherichia phage Lambda")
-
-# Define expected microbe color palette
-microbe_colors <- custom_palette[1:length(orig_expected_microbes)]
-names(microbe_colors) <- orig_expected_microbes
-
-# Define human associated microbes
-human_associated_microbes <- c("Staphylococcus epidermedis", "Staphylococcus hominis", "Cutibacterium acnes",
-                               "Staphylococcus haemolyticus", "Malassezia", "Corynebacterium", "Micrococcus",
-                               "Hoylesella shahii", "Streptococcus mitis",
-                               "Eubacterium saphenum", "Lawsonella clevelandensis")
-
-# subplots grouping variable
-facets_kaiju <- c("Sample_Type","input_conc_ng", "lambda_spike")
-facets_kraken2 <- c("Sample_Type","input_conc_ng")
+```bash
+bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv -o Combined
 ```
-**Input Data:** 
 
-*No input data required*
+**Parameter Definitions:**  
 
-**Output Data:**
+- `*-gene-coverage-annotation-and-tax.tsv` - positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above
 
-- `publication_format` (a ggplot::theme object specifying the custom theme for plotting)
-- `custom_palette` (a vector of strings specifying a custom color palette for coloring plots)
-- `expected_microbes` (a vector of strings listing microbes that may be found in the samples)
-- `orig_expected_microbes` (a vector of strings listing microbes that may be found in the samples plus "Escherichia phage Lambda")
-- `microbe_colors` (a vector of strings specifying the custom color palette to use for coloring the `orig_expected_microbes`)
-- `human_associated_microbes` (a vector of strings listing microbes that are known to be found in humans)
-- `facets_kaiju` (a vector of strings listing subplot grouping variables for kaiju data)
-- `facets_kraken2` (a vector of strings listing subplot grouping variables for kraken2 data)
-- `kraken_taxonomy_outdir` (a path to the output folder for read taxonomy output based on kraken2 processing)
-- `kaiju_taxonomy_outdir` (a path to the output folder for read taxonomy output based on kaiju processing)
+- `-o` – specifies the output prefix
 
-#### 24d. Import Kaiju Taxonomy Data
 
-```R
-kaiju_table <- "/path/to/kaiju_read_taxonomy/merged_kaiju_summary_species.tsv"
-feature_table <- process_kaiju_table(kaiju_table, taxon_col="taxon_name", remove_non_microbial = FALSE)$abundance_table
+**Input Data:**
+
+- *-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples from [step 11](#11-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample))
+
+**Output Data:**
 
-# Create Species table (species raw read count by barcode)
-feature_table %>% as.data.frame %>% 
-  rownames_to_column("Species") %>%
-  pivot_longer(-Species, names_to = "Barcode", values_to = "Reads") %>% 
-  write_delim(species_csv, delim=',')
+- **Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on KO annotations; normalized to coverage per million genes covered)
+- **Combined-gene-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications; normalized to coverage per million genes covered)
+- **Combined-gene-level-KO-function-coverages_GLmetagenomics.tsv** (table with all samples combined based on KO annotations)
+- **Combined-gene-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications)
 
-## The number of reads classified at the species level
-colSums(feature_table) %>%
-  enframe(name = "Barcode", value = "Number of reads") %>%
-  write_delim("{kaiju_taxonomy_outdir}species_counts{assay_suffix}.csv", delim=",")
 
-species_table <- feature_table
+#### 22b. Generating contig-level coverage summary tables
 
+```bash
+bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 ```
+
+**Parameter Definitions:**  
+
+- `*-contig-coverage-and-tax.tsv` - positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above
+
+- `-o` – specifies the output prefix
+
+
 **Input Data:**
-- `kaiju_table` (the merged kaiju summary data at the species taxon level, output from [Step 9f](#9f-compile-kaiju-taxonomy-results))
-- `kaiju_taxonomy_outdir` (a path to the output folder for read taxonomy output based on kaiju processing)
-- `assay_suffix` (standard GeneLab assay suffix to use in output files)
+
+- *-contig-coverage-annotation-and-tax.tsv (tables with combined contig coverage, annotation, and taxonomy info generated for individual samples from [step 12](#12-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample))
 
 **Output Data:**
-- `species_table` (dataframe of relative abundance data)
-- **kaiju_taxonomy/species_counts_GLMetagenomics.csv** (Number of reads classified for each species)
 
+- **Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered)
+- **Combined-contig-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications)
 
-##### 24e. Import Kraken2 Taxonomy Data
+<br>
 
-```R
-kraken_reports_dir <- "/path/to/read_taxonomy/kraken2_output/"
+---
 
-# import kraken2 reports
-reports <- pavian::read_reports(kraken_reports_dir)
+### 23. **M**etagenome-**A**ssembled **G**enome (MAG) recovery
 
-# create taxonomy overview
-summary_table  <- pavian::summarize_reports(reports)
-rownames(summary_table) <- rownames(summary_table) %>% str_split("-") %>% map_chr(\(x) pluck(x, 1))
-summary_table %>% rownames_to_column("Sample_ID") %>% write_delim('{kraken_taxonomy_outdir}kraken_taxonomy_overview.csv', delim=',')
+#### 23a. Binning contigs
 
-samples <- names(reports) %>% str_split("-") %>% map_chr(\(x) pluck(x, 1))
-merged_reports  <- pavian::merge_reports2(reports, col_names = samples)
-taxonReads <- merged_reports$taxonReads
-cladeReads <- merged_reports$cladeReads
-tax_data <- merged_reports[["tax_data"]]
+```bash
+jgi_summarize_bam_contig_depths --outputDepth sample-1-metabat-assembly-depth.tsv --percentIdentity 97 --minContigLength 1000 --minContigDepth 1.0  --referenceFasta sample-1-assembly.fasta sample-1.bam
 
-#Create species table
-species_table <- tax_data %>% 
-  bind_cols(cladeReads) %>%
-  filter(taxRank %in% c("U","S")) %>% 
-  select(-contains("tax")) %>%
-  zero_if_na() %>% 
-  filter(name != 0) %>%  # drop unknown taxonomies
-  group_by(name) %>% 
-  summarise(across(everything(), sum)) %>% 
-  ungroup() %>% 
-  as.data.frame()
+metabat2  --inFile sample-1-assembly.fasta --outFile sample-1 --abdFile sample-1-metabat-assembly-depth.tsv -t NumberOfThreads
 
-species_names <- species_table[,"name"]
-rownames(species_table) <- species_names
+mkdir sample-1-bins
+mv sample-1*bin*.fasta sample-1-bins
+zip -r sample-1-bins.zip sample-1-bins
+```
 
-taxonomy_col <- match("name", colnames(species_table))
-species_table <- species_table[,-taxonomy_col]
+**Parameter Definitions:**  
 
-species_table <- apply(X = species_table, MARGIN = 2, FUN = as.numeric)
-rownames(species_table) <- species_names
+-  `--outputDepth` – specifies the output depth file
+-  `--percentIdentity` – minimum end-to-end percent identity of a mapped read to be included
+-  `--minContigLength` – minimum contig length to include
+-  `--minContigDepth` – minimum contig depth to include
+-  `--referenceFasta` – the assembly fasta file generated in step 5a
+-  `sample-1.bam` – final positional arguments are the bam files generated in step 9
+-  `--inFile` - the assembly fasta file generated in step 5a
+-  `--outFile` - the prefix of the identified bins output files
+-  `--abdFile` - the depth file generated by the previous `jgi_summarize_bam_contig_depths` command
+-  `-t` - number of parallel processing threads to use
 
-# calculate total number of reads for each sample
-colSums(species_table) %>%
-  enframe(name = "Sample", value = "Number of reads") %>%
-  write_delim("{kraken_taxonomy_outdir}species_counts{assay_suffix}.csv", delim=",")
-```
 
 **Input Data:**
-- `kraken_reports` (the per-sample kraken reports, output from [Step 10a](#10a-taxonomic-classification))
-- `kraken_taxonomy_outdir` (a path to the output folder for read taxonomy output based on kraken2 processing)
-- `assay_suffix` (standard GeneLab assay suffix to use in output files)
 
+- sample-1-assembly.fasta (assembly fasta file created in [step 5a](#5a-renaming-contig-headers))
+- sample-1.bam (bam file created in [step 9b](#9b-performing-mapping-conversion-to-bam-and-sorting))
 
 **Output Data:**
-- `species_table` (a dataframe of species raw read counts by barcode)
-- **kraken_taxonomy/species_counts_GLMetagenomics.csv** (a dataframe of per-sample read counts)
-- **kraken_taxonomy/kraken_taxonomy_overview.csv** (Comma-separated table containing a summary of Kraken2 taxonomy classification)
-
 
-#### 24f. Import Sample Metadata
+- **sample-1-metabat-assembly-depth.tsv** (tab-delimited summary of coverages)
+- sample-1-bins/sample-1-bin\*.fasta (fasta files of recovered bins)
+- **sample-1-bins.zip** (zip file containing fasta files of recovered bins)
 
-```R
-# define input files
-metadata_file <- "/path/to/metadata.txt"
+#### 23b. Bin quality assessment
+Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
 
-# Import metadata
-metadata <- read_delim(metdata_file , delim = "\t") %>% as.data.frame()
-row.names(metadata) <- metadata$Sample_ID
+```bash
+checkm lineage_wf -f bins-overview_GLmetagenomics.tsv --tab_table -x fa ./ checkm-output-dir
 ```
 
-**Input Data:** 
+**Parameter Definitions:**  
 
-- `metadata_file` (a file containing sample metadata for the study, columns are: Sample_ID (string), Sample_Type (string), input_conc_ng (float), lambda_spike ('no'/'yes'), Sample_or_Control (string))
+-  `lineage_wf` – specifies the workflow being utilized
+-  `-f` – specifies the output summary file
+-  `--tab_table` – specifies the output summary file should be a tab-delimited table
+-  `-x` – specifies the extension that is on the bin fasta files that are being assessed
+-  `./` – first positional argument at end specifies the directory holding the bins generated in step 14a
+-  `checkm-output-dir` – second positional argument at end specifies the primary checkm output directory with detailed information
 
-**Output Data:**
+**Input Data:**
 
-- `metadata` (a dataframe containing sample metadata for the study with the sampleIDs as the row names)
+- sample-1-bins/sample-1-bin\*.fasta (bin fasta files generated in [step 14a](#14a-binning-contigs))
 
---- 
+**Output Data:**
 
-### 25. Read-based processing feature table decontamination
+- **bins-overview_GLmetagenomics.tsv** (tab-delimited file with quality estimates per bin)
+- checkm-output-dir (directory holding detailed checkm outputs)
 
-The read-based feature table decontamination and taxonomy QC are performed using the same functions for both kraken2 and kaiju generated taxonomies.
+#### 23c. Filtering MAGs
 
-#### 25a. Taxonomy filtering
+```bash
+cat <( head -n 1 bins-overview_GLmetagenomics.tsv ) \
+    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLmetagenomics.tsv | sed 's/bin./MAG-/' ) \
+    > checkm-MAGs-overview.tsv
+    
+# copying bins into a MAGs directory in order to run tax classification
+awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLmetagenomics.tsv | cut -f 1 > MAG-bin-IDs.tmp
 
-```R
-# with unclassified data
-output_dir <- "{taxonomy_type}_taxonomy/"
-abundance_threshold <- 0.5
-
-species <- species_table %>% as.data.frame %>% 
-  rownames_to_column("Species") %>% pull(Species) %>% unique()
-colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
-
-abund_table <- species_table %>% 
-               as.data.frame %>% 
-               mutate( across(everything(), \(x) (x/sum(x, na.rm = TRUE))*100 ) ) %>% 
-               rownames_to_column("Species") 
-  
-rownames(abund_table) <- abund_table$Species
-  
-abund_table <- abund_table[, -match(x = "Species", colnames(abund_table))] %>% t
+mkdir MAGs
+for ID in MAG-bin-IDs.tmp
+do
+    MAG_ID=$(echo $ID | sed 's/bin./MAG-/')
+    cp ${ID}.fasta MAGs/${MAG_ID}.fasta
+done
 
-# excluding unclassified and host reads
-non_microbial <- "Unclassified|unclassified|Homo sapien"
+for SAMPLE in $(cat MAG-bin-IDs.tmp | sed 's/-bin.*//' | sort -u);
+do
+  mkdir ${SAMPLE}-MAGs
+  mv ${SAMPLE}-*MAG*.fasta ${SAMPLE}-MAGs
+  zip -r ${SAMPLE}-MAGs.zip ${SAMPLE}-MAGs
+done
+```
 
-# Get species with relative abundance greater than 0.5 in all the samples
-clean_tab <- species_table %>% 
-  as.data.frame %>% 
-  rownames_to_column("Species") 
+**Input Data:**
 
-abund_table <- filter_rare(clean_tab, non_microbial, threshold=abundance_threshold)
-species <- rownames(abund_table)
-colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
+- bins-overview_GLmetagenomics.tsv (tab-delimited file with quality estimates per bin from [step 14b](#14b-bin-quality-assessment))
 
-species_abund_table <- abund_table %>% 
-                    as.data.frame %>% 
-                   mutate( across( where(is.numeric)  , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
+**Output Data:**
 
-abund_table <- species_abund_table %>% t
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG)
+- MAGs/\*.fasta (directory holding high-quality MAGs)
+- **\*-MAGs.zip** (zip files containing directories of high-quality MAGs)
 
-# Without human-associated microbes
-unwanted <- str_c(c(non_microbial, human_associated_microbes), collapse = "|")
-clean_tab2 <- filter_rare(clean_tab, unwanted, threshold=abundance_threshold)
-clean_tab2 <- clean_tab2   %>% 
-  mutate( across( where(is.numeric)  , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
-abund_table <- clean_tab2 %>% t
-species <- rownames(clean_tab2)
-colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
 
-p <- make_plot(abund_table, metadata, colors2use, publication_format) + 
-  facet_wrap(facets, scales = "free_x", nrow=1)
+#### 23d. MAG taxonomic classification
+Uses default `gtdbtk` database setup with program's `download.sh` command.
 
-p$data %>% mutate(run="Ultra Low", host_read_removal="kraken", taxonomy="{taxonomy_type}") %>% write_delim(file="{output_dir}/{taxonomy_type}_no_unwanted{assay_suffix}.tsv", delim = "\t")
+```bash
+gtdbtk classify_wf --genome_dir MAGs/ -x fa --out_dir gtdbtk-output-dir  --skip_ani_screen
+```
 
-# Expected microbes alone 
-non_microbial <- "Unclassifed|unclassified|Homo sapien"
+**Parameter Definitions:**  
 
-clean_tab2 <- clean_tab %>% 
-  filter(str_detect(Species, non_microbial, negate = TRUE))  %>% 
-    filter(str_detect(Species, str_c(expected_microbes, collapse = "|"))) %>%  #select only the expected microbes
-  mutate( across( where(is.numeric)  , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
+-  `classify_wf` – specifies the workflow being utilized
+-  `--genome_dir` – specifies the directory holding the MAGs generated in step 14c
+-  `-x` – specifies the extension that is on the MAG fasta files that are being taxonomically classified
+-  `--out_dir` – specifies the output directory
+-  `--skip_ani_screen`  - specifies to skip ani_screening step to classify genomes using mash and skani
 
-rownames(clean_tab2) <- clean_tab2$Species
-clean_tab2  <- clean_tab2[,-1] 
-abund_table <- clean_tab2 %>% t
-species <- rownames(clean_tab2)
-colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
+**Input Data:**
 
-p <- make_plot(abund_table, metadata, colors2use, publication_format) + 
-  facet_wrap(facets, scales = "free_x", nrow=1)
+- MAGs/\*.fasta (directory holding high-quality MAGs from [step 14c](#14c-filtering-mags))
 
-p$data %>% mutate(run="Ultra Low", host_read_removal="kraken", taxonomy="{taxonomy_type}") %>% write_delim(file="{output_dir}{taxonomy_type}_expected{assay_suffix}.tsv", delim = "\t")
+**Output Data:**
 
-# Without Unclassified and host reads alone
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
 
-# Get species with relative abundance greater than 1 in all the samples
-clean_tab2 <- clean_tab %>% 
-  as.data.frame %>% 
-  filter(str_detect(Species, non_microbial, negate = TRUE))  %>% 
-  mutate( across( where(is.numeric)  , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
+#### 23e. Generating overview table of all MAGs
 
-rownames(clean_tab2) <- clean_tab2$Species
-clean_tab2  <- clean_tab2[,-1] 
-abund_table <- clean_tab2 %>% t
-species <- rownames(clean_tab2)
-colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
+```bash
+# combine summaries
+for MAG in $(cut -f 1 assembly-summaries_GLmetagenomics.tsv | tail -n +2); do
 
-p <- make_plot(abund_table, metadata, colors2use, publication_format) + 
-  facet_wrap(facets, scales = "free_x", nrow=1)
+    grep -w -m 1 "^${MAG}" checkm-MAGs-overview.tsv | cut -f 12,13,14 \
+        >> checkm-estimates.tmp
 
-#Without removing taxonomies with relative abundance less than 0.5%
-p$data %>% mutate(run="Ultra Low", host_read_removal="kraken", taxonomy="{taxonomy_type}") %>% write_delim(file="{output_dir}{taxonomy_type}_no_filt{assay_suffix}.tsv", delim = "\t")
+    grep -w "^${MAG}" gtdbtk-output-dir/gtdbtk.*.summary.tsv | \
+    cut -f 2 | sed 's/^.__//' | \
+    sed 's/;.__/\t/g' | \
+    awk 'BEGIN{ OFS=FS="\t" } { for (i=1; i<=NF; i++) if ( $i ~ /^ *$/ ) $i = "NA" }; 1' \
+        >> gtdb-taxonomies.tmp
 
-# Filter out unclassified, human reads and rare species
+done
 
-# Rare species here are classified as species with a relative abundance less than 0.5% across
-# all samples.
+# Add headers
+cat <(printf "est. completeness\test. redundancy\test. strain heterogeneity\n") checkm-estimates.tmp \
+    > checkm-estimates-with-headers.tmp
 
-# Get species with relative abundance greater than 0.5 in all the samples
-abund_table <- filter_rare(clean_tab, non_microbial, threshold=abundance_threshold)
-species <- rownames(abund_table)
-colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
+cat <(printf "domain\tphylum\tclass\torder\tfamily\\tgenus\tspecies\n") gtdb-taxonomies.tmp \
+    > gtdb-taxonomies-with-headers.tmp
 
-species_abund_table <- abund_table %>% 
-                    as.data.frame %>% 
-                   mutate( across( where(is.numeric)  , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
+paste assembly-summaries_GLmetagenomics.tsv \
+checkm-estimates-with-headers.tmp \
+gtdb-taxonomies-with-headers.tmp \
+    > MAGs-overview.tmp
 
-abund_table <- species_abund_table %>% t
+# Ordering by taxonomy
+head -n 1 MAGs-overview.tmp > MAGs-overview-header.tmp
 
-p <- make_plot(abund_table, metadata, colors2use, publication_format) + 
-  facet_wrap(facets, scales = "free_x", nrow=1)
+tail -n +2 MAGs-overview.tmp | sort -t \$'\t' -k 14,20 > MAGs-overview-sorted.tmp
 
-p$data %>% mutate(run="Ultra Low", host_read_removal="kraken", taxonomy="{taxonomy_type}") %>% write_delim(file="{output_dir}{taxonomy_type}_filtered{assay_suffix}.tsv", delim = "\t")
+cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
+    > MAGs-overview_GLmetagenomics.tsv
 ```
 
-**Parameter Definitions:**
-- `abundance_threshold` - threshold for defining rare species, default=0.5
-- `taxonomy_type` - string specify which tool was used to create the input taxonomy, either `kaiju` or `kraken`
-- `assay_suffix` - string specifying an assay suffix to use for output file nameing in this dataset (default: GLMetagenomics)
-
-
 **Input Data:**
-- `species_table` (dataframe of relative abundance data, from [Step 24d](#24d-import-kaiju-taxonomy-data) if using kaiju taxonomies or [Step 24e](#24e-import-kraken2-taxonomy-data) is using kraken taxonomies)
-- `facets` (a vector of strings listing subplot grouping variables for either kaiju or kraken data, from [Step 24c](#24c-set-global-variables))
 
+- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics from [step 5b](#5b-summarizing-assemblies))
+- MAGs/\*.fasta (directory holding high-quality MAGs from [step 14c](#14c-filtering-mags))
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG from [step 14c](#14c-filtering-mags))
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info from [step 14d](#14d-mag-taxonomic-classification))
 
-**Output Data**
-- `species_abund_table` (a dataframe containing filtered realtive abundance values)
-- **<kraken|kaiju>_taxonomy/<kraken|kaiju>_expected_GLMetagenomics.tsv** ()
-- **<kraken|kaiju>_taxonomy/<kraken|kaiju>_no_filt_GLMetagenomics.tsv** ()
-- **<kraken|kaiju>_taxonomy/<kraken|kaiju>_filtered_GLMetagenomics.tsv** ()
+**Output Data:**
 
----
+- **MAGs-overview_GLmetagenomics.tsv** (a tab-delimited overview of all recovered MAGs)
 
-#### 25b. Decontamination with Decontam
 
-##### 25b.i. Setup variables
-```R
-feature_table <- species_abund_table #species_table
-sub_metadata <- metadata[colnames(feature_table),]
-# Modify NTC concentration
-sub_metadata <- sub_metadata %>% 
-  mutate(input_conc_ng=map2_dbl(Sample_Type, input_conc_ng,
-                                .f= function(type, conc) { 
-                                  if(conc == 0) return(0.0000001) else return(conc) 
-                                  } )
-         )
-sub_metadata$input_conc_ng <- as.numeric(sub_metadata$input_conc_ng)
-ps <- phyloseq(otu_table(feature_table, taxa_are_rows = TRUE),
-            sample_data(sub_metadata))
-```
+<br>
 
-**Input Data:**
-- `species_abund_table` (a dataframe containing filtered relative abundance values, from [Step ](#25a-taxonomy-filtering))
+---
 
-**Output Data:**
-- `ps` (phyloseq object of the relative abundance values with NTC metadata added)
+### 24. Generating MAG-level functional summary overview
 
-##### 25b.ii. Identify prevalence of contaminant sequences
-The prevalence (presence/absence across samples) of each sequence feature in 
-true positive samples is compared to the prevalence in negative controls to 
-identify contaminants.
+#### 24a. Getting KO annotations per MAG
+This utilizes the helper script [`parse-MAG-annots.py`](../Workflow_Documentation/NF_MGIllumina/workflow_code/bin/parse-MAG-annots.py) 
 
-```R
-contam_threshold <- 0.1
-output_dir <- "{taxonomy_type}_taxonomy_decontam/"
-# In our phyloseq object, "Sample_or_Control" is the sample variable that holds 
-# the negative control information. We’ll summarize that data as a logical 
-# variable, with TRUE for control samples, as that is the form required by isContaminant
-sample_data(ps)$is.neg <- sample_data(ps)$Sample_or_Control == "Control_Sample"
-contamdf <- isContaminant(ps, neg="is.neg", conc="input_conc_ng", threshold=contam_threshold) # threshold
+```bash
+for file in $( ls MAGs/*.fasta )
+do
 
-#### Create contaminant table
-contamdf %>%
-  mutate( across( where(is.numeric), \(x) round(x, digits = 2) ) ) %>%
-  rownames_to_column("Species") %>% 
-  write_delim(file="{output_dir}{taxonomy_type}_contaminant_table{assay_suffix}.tsv", delim = "\t")
+    MAG_ID=$( echo ${file} | cut -f 2 -d "/" | sed 's/.fasta//' )
+    sample_ID=$( echo ${MAG_ID} | sed 's/-MAG-[0-9]*$//' )
 
-table(contamdf$contaminant)
+    grep "^>" ${file} | tr -d ">" > ${MAG_ID}-contigs.tmp
 
-contamdf %>% filter(contaminant == TRUE) %>% 
-  write_delim(file="{output_dir}{taxonomy_type}_filtered_contaminant_table{assay_suffix}.tsv", delim = "\t")
+    python parse-MAG-annots.py -i annotations-and-taxonomy/${sample_ID}-gene-coverage-annotation-and-tax.tsv \
+                               -w ${MAG_ID}-contigs.tmp -M ${MAG_ID} \
+                               -o MAG-level-KO-annotations_GLmetagenomics.tsv
 
+    rm ${MAG_ID}-contigs.tmp
 
-isExpected <- str_detect(rownames(contamdf), pattern = str_c(expected_microbes, collapse = "|"))
-contamdf[isExpected,] %>%
-  select(-p.freq) %>%
-  mutate( across( where(is.numeric), \(x) round(x, digits = 3) ) ) %>% 
-  write_delim(file="{output_dir}{taxonomy_type}_contaminant_table_expected_microbes{assay_suffix}.tsv", delim = "\t")
+done
 ```
 
-**Parameter Defintitions:**
-- `contam_threshold` - probability threshold below which the null hypothesis (not a contaminant) should be rejected in favor of the alternate hypothesis (contaminant) (default: 0.1)
-- `taxonomy_type` - string specify which tool was used to create the input taxonomy, either `kaiju` or `kraken`
-- `assay_suffix` - string specifying an assay suffix to use for output file nameing in this dataset (default: GLMetagenomics)
+**Parameter Definitions:**  
 
+- `-i` – specifies the input sample gene-coverage-annotation-and-tax.tsv file generated in step 11
+-  `-w` – specifies the appropriate temporary file holding all the contigs in the current MAG
+- `-M` – specifies the current MAG unique identifier
+- `-o` – specifies the output file
 
 **Input Data:**
-- `ps` (phyloseq object of the relative abundance values with NTC metadata added, from [Step ](#25bi-setup-variables))
+
+- \*-gene-coverage-annotation-and-tax.tsv (sample gene-coverage-annotation-and-tax.tsv file generated in [step 11](#11-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample))
+- MAGs/\*.fasta (directory holding high-quality MAGs from [step 14c](#14c-filtering-mags))
 
 **Output Data:**
-- `contam_df` (dataframe of contaminant table)
-- **<kaiju|kraken>_taxonomy_decontam/<kaiju|kraken>_contaminant_table_GLMetagenomics.tsv** (tab-delimited table of classification information for all input sequences)
-- **<kaiju|kraken>_taxonomy_decontam/<kaiju|kraken>_filtered_contaminant_table_GLMetagenomics.tsv** (tab-delimited table of classification information for all sequences identified as contaminants)
-- **<kaiju|kraken>_taxonomy_decontam/<kaiju|kraken>_contaminant_table_expected_microbes_GLMetagenomics.tsv** (tab-delimited table of classification information for expected microbes)
 
-##### 25b.iii. Decontaminated taxonomy plots
+- **MAG-level-KO-annotations_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their KO annotations)
 
-```R
-output_dir <- "{taxonomy_type}_taxonomy_decontam/"
-contaminants <- contamdf %>%
-  as.data.frame %>%
-  rownames_to_column("Species") %>%
-  filter(contaminant == TRUE) %>% pull(Species)
-species <- species_abund_table  %>% 
-  as.data.frame %>% 
-  rownames_to_column("Species") %>%
-  filter(str_detect(Species, pattern = str_c(contaminants, collapse = "|"), negate = TRUE)) %>%
-  pull(Species) %>%
-  unique()
-colors2use <- get_colors2use(species, orig_expected_microbes, microbe_colors, custom_palette)
-
-abund_table <- species_abund_table %>% 
-                    as.data.frame  %>% 
-                    rownames_to_column("Species") %>% 
-                    filter(str_detect(Species, 
-                                      pattern = str_c(contaminants,
-                                                      collapse = "|"),
-                                      negate = TRUE)) %>%
-                    mutate( across( where(is.numeric)   , \(x) (x/sum(x, na.rm = TRUE))*100 ) )
-  
-rownames(abund_table) <- abund_table$Species
-  
-abund_table <- abund_table[, -match(x = "Species", colnames(abund_table))] %>% t
-  
-abund_table_wide <- abund_table %>% 
-    as.data.frame() %>% 
-    rownames_to_column("Sample_ID") %>% 
-    inner_join(metadata) %>% 
-    select(!!!colnames(metadata), everything()) %>% 
-    mutate(Sample_ID = Sample_ID %>% str_remove("barcode"))
-    
-  
-abund_table_long <- abund_table_wide  %>%
-    pivot_longer(-colnames(metadata), 
-                 names_to = "Species",
-                 values_to = "relative_abundance")
-  
-p <- ggplot(abund_table_long, mapping = aes(x=Sample_ID, 
-                                              y=relative_abundance, fill=Species)) +
-    geom_col() +
-    scale_fill_manual(values = colors2use) + 
-    labs(x=NULL, y="Relative Abundance (%)") + 
-    publication_format + 
-  facet_wrap(facets, scales = "free_x", nrow=1)
 
-#### Taxonomy plot without contaminants
+#### 24b. Summarizing KO annotations with KEGG-Decoder
 
-# Taxonomy plot after contaminant removal at a set threshold of 0.1
-# ggsave(filename = "results/species_plot.png", plot = p,
-#          device = "png", width = 10, height = 6, units = "in", dpi = 300)
-ggplotly(p) %>% saveWidget(file = "{output_dir}{taxonomy_type}_taxonomy_plots_no_contam{assay_suffix}.html")
+```bash
+KEGG-decoder -v interactive -i MAG-level-KO-annotations_GLmetagenomics.tsv -o MAG-KEGG-Decoder-out_GLmetagenomics.tsv
 ```
-**Parameter Definitions:**
-- `taxonomy_type` - string specify which tool was used to create the input taxonomy, either `kaiju` or `kraken`
-- `assay_suffix` - string specifying an assay suffix to use for output file nameing in this dataset (default: GLMetagenomics)
+
+**Parameter Definitions:**  
+
+- `-v interactive` – specifies to create an interactive html output
+- `-i` – specifies the input MAG-level-KO-annotations_GLmetagenomics.tsv file generated in [step 15a](#15a-getting-ko-annotations-per-mag)
+- `-o` – specifies the output table
 
 **Input Data:**
-- `species_abund_table` ()
-- `contam_df` ()
+
+- MAG-level-KO-annotations_GLmetagenomics.tsv (tab-delimited table holding MAGs and their KO annotations, generated in [step 15a](#15a-getting-ko-annotations-per-mag))
 
 **Output Data:**
-- **<kaiju|kraken>_taxonomy_decontam/<kaiju|kraken>_taxonomy_plots_no_contam_GLMetagenomics.html** (Plot of taxonomies for decontaminated data)
 
----
+- **MAG-KEGG-Decoder-out_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their proportions of genes held known to be required for specific pathways/metabolisms)
+
+- **MAG-KEGG-Decoder-out_GLmetagenomics.html** (interactive heatmap html file of the above output table)
+
+<br>
 
-### 26. Assembly-based processing decontamination
-Medaka assembly annotation of kraken decontaminated low biomass samples
-Quality filtered and trimmed reads were decontaminated (host (human) reads filtered out) using kraken2. Assembly of the clean reads was performed using metaflye followed by polishing with medaka. The polished assembly was annotated using our standard assembly annotation pipeline with prodigal used to predict genes, CAT used for taxonomy assignment of genes and contigs and KOFamScan for genes functional annotation.  
+---
 

From 6e6924c566b7057fc2726b65d9bc5c27b1216b00 Mon Sep 17 00:00:00 2001
From: olabiyi <obadbotanist@yahoo.com>
Date: Thu, 2 Oct 2025 13:56:53 -0700
Subject: [PATCH 06/47] added decontamination and visualization steps, and
 fixed broken links

---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md      | 1302 +++++++++++++----
 1 file changed, 1017 insertions(+), 285 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index ac9c389af..3f4851636 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -40,57 +40,89 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [5a. Trim Filtered Data](#5a-trim-filtered-data)
       - [5b. Trimmed Data QC](#5b-trimmed-data-qc)
       - [5c. Compile Trimmed Data QC](#5c-compile-filtered-data-qc)
-    - [6. Assemble Contaminants](#6-assemble-contaminants)
-    - [7. Contaminant Removal](#7-remove-contaminants)
-      - [7a. Build Contaminant Index and Map Reads](#7a-build-contaminant-index-and-map-reads)
-      - [7b. Sort and Index Contaminant Reads](#7b-sort-and-index-contaminant-alignments)
-      - [7c. Gather Contaminant Mapping Metrics](#7c-gather-contaminant-mapping-metrics)
-      - [7d. Generate Decontaminated Read Files](#7d-generate-decontaminated-read-files)
-      - [7e. Contaminant Removal QC](#7e-contaminant-removal-qc)
-      - [7f. Compile Contaminant Removal QC](#7f-compile-contaminant-removal-qc)
-    - [8. Host Removal](#8-host-removal)
-      - [8a. Build or download host database](#8a-build-or-download-host-database)
-        - [8a.i. Download from URL](#8ai-download-from-url)
-        - [8a.ii. Build from custom reference](#8aii-build-from-custom-reference)
-        - [8a.iii. Build from host name](#8aiii-build-from-host-name)
-      - [8b. Remove Host Reads](#8b-remove-host-reads)
-    - [9. R Environment Setup](#9-r-environment-setup)
-      - [9a. Load libraries](#9a-load-libraries)
-      - [9b. Define Custom Functions](#9b-define-custom-functions)
-      - [9c. Set global variables](#9c-set-global-variables)
+    - [6. Contaminant Removal](#7-remove-contaminants)
+      - [6a. Assemble Contaminants](#6a-assemble-contaminants)
+      - [6b. Build Contaminant Index and Map Reads](#6b-build-contaminant-index-and-map-reads)
+      - [6c. Sort and Index Contaminant Reads](#6c-sort-and-index-contaminant-alignments)
+      - [6d. Gather Contaminant Mapping Metrics](#6d-gather-contaminant-mapping-metrics)
+      - [6e. Generate Decontaminated Read Files](#6e-generate-decontaminated-read-files)
+      - [6f. Contaminant Removal QC](#6f-contaminant-removal-qc)
+      - [6g. Compile Contaminant Removal QC](#6g-compile-contaminant-removal-qc)
+    - [7. Host Removal](#7-host-removal)
+      - [7a. Build or download host database](#7a-build-or-download-host-database)
+        - [7a.i. Download from URL](#7ai-download-from-url)
+        - [7a.ii. Build from custom reference](#7aii-build-from-custom-reference)
+        - [7a.iii. Build from host name](#7aiii-build-from-host-name)
+      - [7b. Remove Host Reads](#7b-remove-host-reads)
+    - [8. R Environment Setup](#8-r-environment-setup)
+      - [8a. Load libraries](#8a-load-libraries)
+      - [8b. Define Custom Functions](#8b-define-custom-functions)
+      - [8c. Set global variables](#8c-set-global-variables)
   - [**Read-based processing**](#read-based-processing)
-    - [10. Taxonomic profiling using kaiju](#10-taxonomic-profiling-using-kaiju)
-      - [10a. Build kaiju database](#10a-build-kaiju-database)
-      - [10b. Kaiju Taxonomic Classification](#10b-kaiju-taxonomic-classification)
-      - [10c. Compile kaiju taxonomy results](#10c-compile-kaiju-taxonomy-results)
-      - [10d. Convert kaiju output to krona format](#10d-convert-kaiju-output-to-krona-format)
-      - [10e. Compile kaiju krona report](#10e-compile-kaiju-krona-report)
-      - [10f. Create kaiju species count table](#10f-create-kaiju-species-count-table)
-      - [10g. Read-in tables](#10g-read-in-tables)
-      - [10h. Taxonomy barplots](#10h-taxonomy-barplots)
-    - [11. Taxonomic Profiling using Kraken2](#11-taxonomic-profiling-using-kraken2)
-      - [11a. Download kraken2 database](#11a-download-kraken2-database)
-      - [11b. Taxonomic Classification](#11b-taxonomic-classification)
-      - [11c. Convert Kraken2 output to Krona format](#11c-convert-kraken2-output-to-krona-format)
-      - [11d. Compile kraken2 krona report](#11d-compile-kraken2-krona-report)
-      - [11e. Create kraken species count table](#11e-create-kraken-species-count-table)
-      - [11f. Read-in tables](#11f-read-in-tables)
-      - [11g. Taxonomy barplots](#11g-taxonomy-barplots)
-      - [11h. Feature decontamination](#11h-feature-decontamination)
+    - [9. Taxonomic profiling using kaiju](#9-taxonomic-profiling-using-kaiju)
+      - [9a. Build kaiju database](#9a-build-kaiju-database)
+      - [9b. Kaiju Taxonomic Classification](#9b-kaiju-taxonomic-classification)
+      - [9c. Compile kaiju taxonomy results](#9c-compile-kaiju-taxonomy-results)
+      - [9d. Convert kaiju output to krona format](#9d-convert-kaiju-output-to-krona-format)
+      - [9e. Compile kaiju krona report](#9e-compile-kaiju-krona-report)
+      - [9f. Create kaiju species count table](#9f-create-kaiju-species-count-table)
+      - [9g. Read-in tables](#9g-read-in-tables)
+      - [9h. Taxonomy barplots](#9h-taxonomy-barplots)
+      - [9i. Feature decontamination](#9i-feature-decontamination)
+    - [10. Taxonomic Profiling using Kraken2](#10-taxonomic-profiling-using-kraken2)
+      - [10a. Download kraken2 database](#10a-download-kraken2-database)
+      - [10b. Taxonomic Classification](#10b-taxonomic-classification)
+      - [10c. Convert Kraken2 output to Krona format](#10c-convert-kraken2-output-to-krona-format)
+      - [10d. Compile kraken2 krona report](#10d-compile-kraken2-krona-report)
+      - [10e. Create kraken species count table](#10e-create-kraken-species-count-table)
+      - [10f. Read-in tables](#10f-read-in-tables)
+      - [10g. Taxonomy barplots](#10g-taxonomy-barplots)
+      - [10h. Feature decontamination](#10h-feature-decontamination)
   - [**Assembly-based processing**](#assembly-based-processing)
-    - [12. Sample assembly](#12-sample-assembly)
-    - [13. Polish assembly](#13-polish-assembly)
-    - [14. Renaming contigs and summarizing assemblies](#14-renaming-contigs-and-summarizing-assemblies)
-    - [15. Gene prediction](#15-gene-prediction)
-    - [16. Functional annotation](#16-functional-annotation)
-    - [17. Taxonomic classification](#17-taxonomic-classification)
-    - [18. Read-mapping](#18-read-mapping)
-    - [19. Getting coverage information and filtering based on detection](#19-getting-coverage-information-and-filtering-based-on-detection)
-    - [20. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample](#20-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample)
-    - [21. Combining contig-level coverage and taxonomy into one table for each sample](#21-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample)
-    - [22. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples](#22-generating-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
-    - [23. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#23-metagenome-assembled-genome-mag-recovery)
-    - [24. Generating MAG-level functional summary overview](#24-generating-mag-level-functional-summary-overview)
+    - [11. Sample assembly](#11-sample-assembly)
+    - [12. Polish assembly](#12-polish-assembly)
+    - [13. Renaming contigs and summarizing assemblies](#13-renaming-contigs-and-summarizing-assemblies)
+      - [13a. Renaming contig headers](#13a-renaming-contig-headers)
+      - [13b. Summarizing assemblies](#13b-summarizing-assemblies)
+    - [14. Gene prediction](#14-gene-prediction)
+      - [14a. Remove line wraps in gene prediction output](#14a-remove-line-wraps-in-gene-prediction-output)
+    - [15. Functional annotation](#15-functional-annotation)
+      - [15a. Downloading reference database of HMM models (only needs to be done once)](#15a-downloading-reference-database-of-hmm-models-only-needs-to-be-done-once)
+      - [15b. Running KEGG annotation](#15b-running-kegg-annotation)
+      - [15c. Filtering output to retain only those passing the KO-specific score and top hits](#15c-filtering-output-to-retain-only-those-passing-the-ko-specific-score-and-top-hits)
+    - [16. Taxonomic classification](#16-taxonomic-classification)
+      - [16a. Pulling and un-packing pre-built reference db (only needs to be done once)](#16a-pulling-and-un-packing-pre-built-reference-db-only-needs-to-be-done-once)
+      - [16b. Running taxonomic classification](#16b-running-taxonomic-classification)
+      - [16c. Adding taxonomy info from taxids to genes](#16c-adding-taxonomy-info-from-taxids-to-genes)
+      - [16d. Adding taxonomy info from taxids to contigs](#16d-adding-taxonomy-info-from-taxids-to-contigs)
+      - [16e. Formatting gene-level output with awk and sed](#16e-formatting-gene-level-output-with-awk-and-sed)
+      - [16f. Formatting contig-level output with awk and sed](#16f-formatting-contig-level-output-with-awk-and-sed)
+    - [17. Read-mapping](#17-read-mapping)
+      - [17a. Align Reads to Sample Assembly](#17a-align-reads-to-sample-assembly)
+      - [17b. Sort and Index Assembly Alignments](#17b-sort-and-index-assembly-alignments)
+    - [18. Getting coverage information and filtering based on detection](#18-getting-coverage-information-and-filtering-based-on-detection)
+      - [18a. Filtering coverage levels based on detection](#18a-filtering-coverage-levels-based-on-detection)
+      - [18b. Filtering gene and contig coverage based on requiring 50% detection and parsing down to just gene ID and coverage](#18b-filtering-gene-and-contig-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage)
+    - [19. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample](#19-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample)
+    - [20. Combining contig-level coverage and taxonomy into one table for each sample](#20-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample)
+    - [21. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples](#21-generating-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
+      - [21a. Generating gene-level coverage summary tables](#21a-generating-gene-level-coverage-summary-tables)
+      - [21b. Gene-level taxonomy heatmaps](#21b-gene-level-taxonomy-heatmaps)
+      - [21c. Gene-level taxonomy decontamination](#21c-gene-level-taxonomy-decontamination)
+      - [21d. Gene-level KO functions heatmaps](#21d-gene-level-ko-functions-heatmaps)
+      - [21e. Gene-level KO functions decontamination](#21e-gene-level-ko-functions-decontamination)
+      - [21f. Generating contig-level coverage summary tables](#21f-generating-contig-level-coverage-summary-tables)
+      - [21g. Contig-level Heatmaps](#21g-contig-level-heatmaps)
+      - [21h. Contig-level decontamination](#21h-contig-level-decontamination)
+    - [22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#22-metagenome-assembled-genome-mag-recovery)
+      - [22a. Binning contigs](#22a-binning-contigs)
+      - [22b. Bin quality assessment](#22b-bin-quality-assessment)
+      - [22c. Filtering MAGs](#22c-filtering-mags)
+      - [22d. MAG taxonomic classification](#22d-mag-taxonomic-classification)
+      - [22e. Generating overview table of all MAGs](#22e-generating-overview-table-of-all-mags)
+    - [23. Generating MAG-level functional summary overview](#23-generating-mag-level-functional-summary-overview)
+      - [23a. Getting KO annotations per MAG](#23a-getting-ko-annotations-per-mag)
+      - [23b. Summarizing KO annotations with KEGG-Decoder](#23b-summarizing-ko-annotations-with-kegg-decoder)
 
 
 
@@ -142,7 +174,7 @@ Barbara Novak (GeneLab Data Processing Lead)
 ### 1. Basecalling
 
 ```bash
-model="hac"
+model="hac" # high accuracy model
 input_directory=/path/to/pod5/or/fast5/data
 kit_name=SQK-RPB004
 
@@ -198,7 +230,7 @@ dorado demux \
 
 **Input Data:**
 
-- basecalled.bam (basecalled nanopore data in bam format, output from [step 1](#1-basecalling))
+- basecalled.bam (basecalled nanopore data in bam format, output from [Step 1](#1-basecalling))
 
 **Output Data:**
 
@@ -233,7 +265,7 @@ done
 
 **Input Data:**
 
-- /path/to/fastq/output/ (directory containing spilt fastq files)[step 2a](#2a-split-fastq))
+- /path/to/fastq/output/ (directory containing spilt fastq files from [Step 2a](#2a-split-fastq))
 
 **Output Data:**
 
@@ -266,7 +298,7 @@ NanoPlot --only-report \
 
 **Input Data:**
 
-- /path/to/raw_data/sample.fastq.gz (raw reads, output from [Step 2](#2-demultiplexing))
+- /path/to/raw_data/sample.fastq.gz (raw reads, output from [Step 2b](#2b-concatenate-files-for-each-sample))
 
 **Output Data:**
 
@@ -346,7 +378,7 @@ NanoPlot --only-report \
 
 **Input Data:**
 
-- sample_filtered.fastq (raw reads, output from [Step 2](#2-demultiplexing))
+- sample_filtered.fastq (filtered reads, output from [Step 4a](#4a-filter-raw-data))
 
 **Output Data:**
 
@@ -467,7 +499,11 @@ multiqc --zip-data-dir \
 
 ---
 
-### 6. Assemble Contaminants
+### 6. Contaminant Removal
+
+> A major issue with low biomass data is the high potential for contamination due to the low amount of DNA extracted in the sample for sequencing.  Because negative control/blank samples should by theory be contaminant free, any sequence detected in the negative control is a potential contaminant. To filter out contaminants found in negative control samples that may have been due to cross contamination in the lab, we use a read mapping approach. First negative/blank control samples reads are assembled then filtered and trimmed reads mapped to the assembled contigs. Reads mapping to the assembled contigs are categorized as contaminants and are therefore filtered out and thus excluded from further analyses.
+
+### 6a. Assemble Contaminants
 
 ```bash
 flye --meta --threads NumberOfThreads \
@@ -494,9 +530,7 @@ flye --meta --threads NumberOfThreads \
 
 ---
 
-### 7. Contaminant Removal
-
-#### 7a. Build Contaminant Index and Map Reads
+#### 6b. Build Contaminant Index and Map Reads
 
 ```bash
 # Build contaminant index
@@ -515,14 +549,14 @@ minimap2 -t NumberOfThreads -a -x splice blanks.mmi /path/to/trimmed_reads/sampl
 
 **Input Data**
 
-- /path/to/contaminant_assembly/assembly.fasta (Contaminant assembly, output from [Step 6](#6-assemble-contaminants))
+- /path/to/contaminant_assembly/assembly.fasta (Contaminant assembly, output from [Step 6a](#6-assemble-contaminants))
 - /path/to/trimmed_reads/sample_trimmed.fastq (Filtered and trimmed reads, output from [Step 5a](#5a-trim-filtered-data))
 
 **Output Data**
 
 - sample.sam (Reads aligned to contaminant assembly)
 
-#### 7b. Sort and Index Contaminant Alignments
+#### 6b. Sort and Index Contaminant Alignments
 ```bash
 # Sort Sam, convert to bam and create index
 samtools sort --threads NumberOfThreads -o sample_sorted.bam sample.sam > sample_sort.log 2>&1
@@ -543,14 +577,14 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 
 **Input Data:**
 
-- sample.sam (Reads aligned to contaminant assembly, output from [Step 7a](#7a-build-contaminant-index-and-map-reads))
+- sample.sam (Reads aligned to contaminant assembly, output from [Step 6a](#6a-build-contaminant-index-and-map-reads))
 
 **Output Data:**
 
 - sample_sorted.bam (sorted mapping to contaminant assembly)
 - sample_sorted.bam.bai (index of sorted mapping to contaminant assembly)
 
-#### 7c. Gather Contaminant Mapping Metrics
+#### 6c. Gather Contaminant Mapping Metrics
 
 ```bash
 
@@ -569,8 +603,8 @@ samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.lo
 
 **Input Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 7b](#7b-sort-and-index-contaminant-alignments))
-- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly, output from [Step 7b](#7b-sort-and-index-contaminant-alignments))
+- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 6b](#6b-sort-and-index-contaminant-alignments))
+- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly, output from [Step 6b](#6b-sort-and-index-contaminant-alignments))
 
 **Output Data:**
 
@@ -578,7 +612,7 @@ samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.lo
 - sample_stats.txt (comprehensive alignment statistics)
 - sample_idxstats.txt (contig alignment summary statistics)
 
-#### 7d. Generate Decontaminated Read Files
+#### 6d. Generate Decontaminated Read Files
 ```bash
 # Retain reads that do not map to contaminants
 samtools fastq -t -f 4 sample_sorted.bam | gzip --to-stdout > sample_blank_removed.fastq.gz
@@ -595,13 +629,13 @@ samtools fastq -t -f 4 sample_sorted.bam | gzip --to-stdout > sample_blank_remov
 
 **Input Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 7b](#7b-sort-and-index-contaminant-alignments))
+- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 6b](#6b-sort-and-index-contaminant-alignments))
 
 **Output Data:**
 
 - sample_blank_removed.fastq.gz (blank removed reads in fastq format)
 
-#### 7e. Contaminant Removal QC
+#### 6e. Contaminant Removal QC
 
 ```bash
 NanoPlot --only-report \
@@ -622,7 +656,7 @@ NanoPlot --only-report \
 
 **Input Data:**
 
-- sample_blank_removed.fastq.gz (blank removed reads, output from [Step 7d](#7d-generate-decontaminated-read-files))
+- sample_blank_removed.fastq.gz (blank removed reads, output from [Step 6d](#6d-generate-decontaminated-read-files))
 
 **Output Data:**
 
@@ -631,7 +665,7 @@ NanoPlot --only-report \
 - /path/to/noblank_nanoplot_output/sample_NanoStats.txt (text file containing basic statistics)
 
 
-#### 7f. Compile Contaminant Removal QC
+#### 6f. Compile Contaminant Removal QC
 
 ```bash
 multiqc --zip-data-dir \ 
@@ -650,7 +684,7 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- /path/to/noblank_nanoplot_output/*NanoStats.txt (NanoPlot output data, output from [Step 7d](#7d-generate-decontaminated-read-files))
+- /path/to/noblank_nanoplot_output/*NanoStats.txt (NanoPlot output data, output from [Step 6e](#6e-contaminant-removal-qc))
 
 **Output Data:**
 
@@ -661,11 +695,11 @@ multiqc --zip-data-dir \
 
 ---
 
-### 8. Host Removal
+### 7. Host Removal
 
-#### 8a. Build or download host database
+#### 7a. Build or download host database
 
-##### 8a.i. Download from URL
+##### 7a.i. Download from URL
 
 ```bash
   # Downloading and unpacking database from ${host_url}
@@ -688,7 +722,7 @@ multiqc --zip-data-dir \
 - kraken2_host_db/ - Kraken2 database directory
 
 
-##### 8a.ii. Build from custom reference
+##### 7a.ii. Build from custom reference
 
 ```bash
 # Install taxonomy       
@@ -715,7 +749,7 @@ kraken2-build --build --db kraken2_host_db/
 - kraken2_host_db/ - Kraken2 database directory
 
 
-##### 8a.iii. Build from host name
+##### 7a.iii. Build from host name
 
 ```bash
 # Build kraken reference from host_name
@@ -741,13 +775,13 @@ kraken2-build --clean --db kraken2_host_db/
 
 **Input Data:**
 
-- `host_name` - host database name (one of )
+- `host_name` - host database name (one of those specified in `--download-library` above)
 
 **Output Data:**
 
 - kraken2_host_db/ - Kraken2 database directory
 
-#### 8b. Remove host reads
+#### 7b. Remove host reads
 ```bash
 kraken2 --db kraken2_host_db/ --gzip-compressed --threads NumberOfThreads --use-names \
         --output sample-kraken2-output.txt --report sample-kraken2-report.tsv \
@@ -757,7 +791,7 @@ gzip sample_host_removed.fastq
 
 **Parameter Definitions:**
 
-- `--db` - specifies the directory holding the kraken2 database files created in [Step 8a](#8a-build-or-download-host-database)
+- `--db` - specifies the directory holding the kraken2 database files created in [Step 7a](#7a-build-or-download-host-database)
 - `--gzip-compressed` - specifies the input fastq files are gzip-compressed
 - `--threads` - number of parallel processing threads to use
 - `--use-names` - specifies adding taxa names in addition to taxon IDs
@@ -768,7 +802,7 @@ gzip sample_host_removed.fastq
 
 **Input Data:**
 
-- sample_blank_removed.fastq.gz (gzipped blank removed fastq file, output from [Step 7d](#7d-generate-decontaminated-read-files))
+- sample_blank_removed.fastq.gz (gzipped blank removed fastq file, output from [Step 6d](#6d-generate-decontaminated-read-files))
 
 **Output Data:**
 
@@ -780,13 +814,12 @@ gzip sample_host_removed.fastq
 
 ---
 
-## Read-based Processing
 
-### 9. R Environment Setup
+### 8. R Environment Setup
 
-> Taxonomy bar plot, heatmaps and feature decontamination with decontam are performed in R.
+> Taxonomy bar plots, heatmaps and feature decontamination with decontam are performed in R.
 
-#### 9a. Load libraries
+#### 8a. Load libraries
 
 ```R
 library(decontam)
@@ -797,7 +830,7 @@ library(pheatmap)
 library(pavian)
 ```
 
-#### 9b. Define Custom Functions
+#### 8b. Define Custom Functions
 
 ##### get_last_assignment()
 <details>
@@ -805,7 +838,7 @@ library(pavian)
 
   ```R
   get_last_assignment <- function(taxonomy_string, split_by=';', remove_prefix=NULL) {
-    # A function to get the last taxonomy assignment from a taxonomy string 
+
     split_names <- strsplit(x =  taxonomy_string , split = split_by) %>% 
       unlist()
     
@@ -950,7 +983,7 @@ library(pavian)
 
     abund_table <- species_table %>% 
                         as.data.frame %>% 
-                        mutate( across(everything(), function(x) (x/sum(x, na.rm = TRUE))*100 ) )  %>% # calculation species relative abundance per sample
+                        mutate( across(everything(), function(x) (x/sum(x, na.rm = TRUE))*100 ) )  %>% # calculate species relative abundance per sample
         select(
                 where( ~all(!is.na(.)) )
               )  %>% # drop columns where none of the reads were classified or were non-microbial
@@ -1092,12 +1125,12 @@ library(pavian)
     if (!is.null(freq_col) && !is.null(prev_col)) {   
 
       # Run decontam in both prevalence and frequency modes
-      contamdf <- isContaminant(ps, neg=prev_col, conc=freq_col, threshold=contam_threshold) # threshold
+      contamdf <- isContaminant(ps, neg=prev_col, conc=freq_col, threshold=contam_threshold) 
 
     } else if(!is.null(freq_col)) {
       
       # Run decontam in frequency mode
-      contamdf <- isContaminant(ps, conc=freq_col, threshold=contam_threshold) # threshold
+      contamdf <- isContaminant(ps, conc=freq_col, threshold=contam_threshold) 
 
     } else if(!is.null(prev_col)){
 
@@ -1107,7 +1140,7 @@ library(pavian)
     } else {
 
       cat("Both freq_col and prev_col cannot be set tdo NULL\n")
-      cat("please supply either one or both column names your metadata")
+      cat("please supply either one or both column names in your metadata")
       cat("for frequency and prevalence based analysis, respectively\n")
       stop()
 
@@ -1128,7 +1161,191 @@ library(pavian)
   **Returns:** a dataframe of detailed decontam results
 </details>
 
-#### 9c. Set global variables
+
+##### process_taxonomy()
+<details>
+  <summary>process a taxonomy assignment table</summary>
+
+```R
+process_taxonomy <- function(taxonomy, prefix='\\w__') { 
+  
+  taxonomy <- apply(X = taxonomy, MARGIN = 2, FUN = as.character) 
+
+  # replace NAa with Other and delete the D_num__ prefix from the taxonomy names
+  for (rank in colnames(taxonomy)) {
+    #delete the taxonomy prefix
+    taxonomy[,rank] <- gsub(pattern = prefix, x = taxonomy[, rank],
+                            replacement = '')
+    indices <- which(is.na(taxonomy[,rank]))
+    taxonomy[indices, rank] <- rep(x = "Other", times=length(indices)) 
+    #replace empty cell
+    indices <- which(taxonomy[,rank] == "")
+    taxonomy[indices,rank] <- rep(x = "Other", times=length(indices))
+  }
+  taxonomy <- apply(X = taxonomy,MARGIN = 2,
+                    FUN =  gsub,pattern = "_",replacement = " ") %>% 
+    as.data.frame(stringAsfactor=F)
+  return(taxonomy)
+}
+
+```
+**Function Parameter Definitions:**
+
+- `taxonomy` - is a string specifying the taxonomic assignment file name
+- `prefix`  - is a regular expression specifying the characters to remove
+              from taxon names
+
+**Returns:** a dataframe of reformated taxonomy names
+
+</details>
+
+
+##### format_taxonomy_table()
+<details>
+  <summary>format a taxonomy assignment table by appending a suffix to a known name</summary>
+
+```R
+format_taxonomy_table <- function(taxonomy,stringToReplace="Other",
+                                  suffix=";Other") {
+  
+  for (taxa_index in seq_along(taxonomy)) {
+    
+    indices <- grep(x = taxonomy[,taxa_index], pattern = stringToReplace)
+    
+    taxonomy[indices,taxa_index] <- 
+      paste0(taxonomy[indices,taxa_index-1],
+             rep(x = suffix, times=length(indices)))
+    
+  }
+  return(taxonomy)
+}
+
+```
+**Function Parameter Definitions:**
+- `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
+- `stringToReplace` - a vector of regex strings specifying what to replace
+- `suffix` - string specifying the replacement value
+
+**Returns:** a dataframe of reformated taxonomy names
+
+</details>
+
+
+##### fix_names()
+<details>
+  <summary>clean taxonomy names</summary>
+
+```R
+fix_names<- function(taxonomy,stringToReplace,suffix){
+  
+  for(index in seq_along(stringToReplace)){
+    taxonomy <- format_taxonomy_table(taxonomy = taxonomy,
+                                      stringToReplace=stringToReplace[index], 
+                                      suffix=suffix[index])
+  }
+  return(taxonomy)
+}
+
+```
+**Function Parameter Definitions:**
+- `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
+- `stringToReplace` - a vector of regex strings specifying what to replace
+- `suffix` - string specifying the replacement value
+
+**Returns:** a dataframe of detailed decontam results
+
+</details>
+
+
+##### read_input_table()
+<details>
+  <summary>read an input table into a dataframe</summary>
+
+```R
+read_input_table <- function(file_name){
+  
+   # Get depth from file name
+   df <- read_delim(file = file_name, delim = "\t", comment = "#")
+   return(df)
+   
+}
+```
+**Function Parameter Definitions:**
+
+- `file_name` - path to file to be read
+
+**Returns:** a dataframe from input file
+
+</details>
+
+
+
+##### read_contig_table()
+<details>
+  <summary>Read Assembly-based contig annotation table</summary>
+
+  ```R
+read_contig_table <- function(file_name, sample_names){
+  
+  df <- read_input_table(file_name)
+
+  taxonomy_table <- df %>%
+    select(domain:species) %>%
+    mutate(domain=replace_na(domain, "Unclassified"))
+  
+  counts_table <- df %>% select(!!sample_names)
+
+  taxonomy_table  <- process_taxonomy(taxonomy_table)
+  taxonomy_table <- fix_names(taxonomy_table, "Other", ";_")
+
+  df <- bind_cols(taxonomy_table, counts_table)
+  
+  return(df)
+}
+
+```
+
+**Function Parameter Definitions:**
+
+- `file_name` - path to file to be read
+- `sample_names` - string of samples names to keep in the final dataframe
+
+**Returns:** a dataframe with cleaned taxonomy names
+
+</details>
+
+
+
+##### get_sample_names()
+<details>
+  <summary>retrieve the name of samples for which assemblies were generated</summary>
+
+  ```R
+get_sample_names <- function (assembly_summary) {
+  # assembly_summary - path to assembly summary file
+
+  overview_table <-  read_input_table(assembly_summary) %>%
+                       select(
+                         where( ~all(!is.na(.)) )
+                         ) # Drop columns were all its rows are NAs
+
+col_names <- names(overview_table) %>% str_remove_all("-assembly")
+sample_order <- col_names[-1] %>% sort()
+
+return(sample_order)
+
+}
+```
+**Function Parameter Definitions:**
+
+- `assembly_summary` - path to assembly summary file
+
+**Returns:** a character vector of sorted sample names
+
+</details>
+
+
+#### 8c. Set global variables
 
 ```R
 # Define custom theme for plotting
@@ -1149,13 +1366,15 @@ custom_palette <- c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C",
                     "#666666","#1B9E77", "#E6AB02","#A6761D","#FFFF00FF","#FFFF99","#00FFFFFF",
                     "#B2182B","#FDDBC7","#D1E5F0","#CC0033","#FF00CC","#330033",
                     "#999933","#FF9933","#FFFAFAFF",colors()) 
-# remove white colors
+# Drop white colors
 custom_palette <- custom_palette[-c(21:23,
                                     grep(pattern = "white|snow|azure|gray|#FFFAFAFF|aliceblue",
                                          x = custom_palette, 
                                          ignore.case = TRUE)
                                    )
                                 ]
+# Heatmap color gradient - here from white to red
+colours <- colorRampPalette(c('white','red'))(255)
 ```
 
 **Input Data:** 
@@ -1171,9 +1390,11 @@ custom_palette <- custom_palette[-c(21:23,
 
 ---
 
-### 10. Taxonomic profiling using kaiju
+## Read-based Processing
+
+### 9. Taxonomic profiling using kaiju
 
-#### 10a. Build kaiju database
+#### 9a. Build kaiju database
 
 ```bash
 # Make directory that will hold all the download kaiju database
@@ -1200,7 +1421,7 @@ rm nr_euk/kaiju_db_nr_euk.bwt nr_euk/kaiju_db_nr_euk.sa
 - kaiju-db/names.dmp (names file)
 
 
-#### 10b. Kaiju Taxonomic Classification
+#### 9b. Kaiju Taxonomic Classification
 
 ```bash
 kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi -t kaiju-db/nodes.dmp \
@@ -1223,13 +1444,13 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi -t kaiju-db/nodes.dmp \
 
 - kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (fmi file, output from [Step 9a](#9a-build-kaiju-database))
 - kaiju-db/nodes.dmp (nodes file, output from [Step 9a](#9a-build-kaiju-database))
-- sample_host_removed.fastq.gz (gzipped decontaminated reads fastq file, output from [Step 8](#8-host-removal))
+- sample_host_removed.fastq.gz (gzipped decontaminated reads fastq file, output from [Step 7b](#7b-remove-host-reads))
 
 **Output Data:**
 
 - sample_kaiju.out (kaiju output file)
 
-#### 10c. Compile kaiju taxonomy results
+#### 9c. Compile kaiju taxonomy results
 
 ```bash
 # Merge kaiju reports to one table at the species level
@@ -1259,7 +1480,7 @@ sed -i -E 's/file/sample/' merged_kaiju_table.tsv
 
 - **merged_kaiju_table.tsv** (Compiled kaiju table at the species taxon level)
 
-#### 10d. Convert kaiju output to krona format
+#### 9d. Convert kaiju output to krona format
 
 ```bash
 kaiju2krona -u -n kaiju-db/names.dmp -t kaiju-db/nodes.dmp \
@@ -1277,13 +1498,13 @@ kaiju2krona -u -n kaiju-db/names.dmp -t kaiju-db/nodes.dmp \
 **Input Data:**
 - kaiju-db/nodes.dmp (nodes file, output from [Step 9a](#9a-build-kaiju-database))
 - kaiju-db/names.dmp (names file, output from [Step 9a](#9a-build-kaiju-database))
-- sample_kaiju.out (kaiju output file, output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+- sample_kaiju.out (kaiju output file, output from [Step 9b](#9b-kaiju-taxonomic-classification))
 
 **Output Data:**
 
 - sample.krona (krona formatted kaiju output)
 
-#### 10e. Compile kaiju krona report
+#### 9e. Compile kaiju krona report
 
 ```bash
 # Find, list and write all .krona files to file 
@@ -1327,15 +1548,15 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
                      sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
 
 **Input Data:**
-*.krona (all sample .krona formatted files, output from [Step 9e](#9e-convert-kaiju-output-to-krona-format)) 
+*.krona (all sample .krona formatted files, output from [Step 9d](#9d-convert-kaiju-output-to-krona-format)) 
 
                       
 **Output Data:**
 
-- kaiju-report.html (compiled krona html report output)
+- **kaiju-report.html** (compiled krona html report output)
 
 
-#### 10f. Create kaiju species count table
+#### 9f. Create kaiju species count table
 
 ```R
 library(tidyverse)
@@ -1351,14 +1572,14 @@ write_csv(x = feature_table, file = "kaiju_species_table.csv")
 
 **Input Data:**
 
-- merged_kaiju_table.tsv (Compiled kaiju table at the species taxon level, from [Step 9c](#9c-compile-kaiju-taxonomy-results))
+- merged_kaiju_table.tsv (Compiled kaiju table at the species taxon level, from [Step 9c](#10c-compile-kaiju-taxonomy-results))
 
 **Output Data:**
 
 - kaiju_species_table.csv (kaiju species count table in csv format)
 
 
-#### 10g. Read-in tables
+#### 9g. Read-in tables
 
 ```R
 library(tidyverse)
@@ -1381,7 +1602,7 @@ species_table <- read_csv(file="kaiju_species_table.csv") %>%  as.data.frame()
 **Input Data:**
 
 - metadata_file  (path to sample-wise metadata file)
-- kaiju_species_table.csv (path to kaiju species taable from [step 10f](#10f-create-kaiju-species-count-table))
+- kaiju_species_table.csv (path to kaiju species table from [Step 9f](#9f-create-kaiju-species-count-table))
 
 **Output Data:**
 
@@ -1389,12 +1610,14 @@ species_table <- read_csv(file="kaiju_species_table.csv") %>%  as.data.frame()
 - `species_table` - a dataframe of species count per sample
 ---
 
-#### 10h. Taxonomy barplots
+#### 9h. Taxonomy barplots
 
 ```R
 library(tidyverse)
 
-filter_threshold=0.5
+# Threshold to filter out potential false positive
+# taxonomy assignments
+filter_threshold <- 0.5
 # Filter out Rare and non-microbial assignment
 # You can add as many species that you'd like to filter out
 # using the following syntax "|species_name1|species_name2"
@@ -1438,8 +1661,8 @@ ggsave(filename = "filtered-kaiju_species_plot.png", plot = p,
 
 **Input Data:**
 
-- `species_table` (a dataframe of species count per sample, output from [Step 10g](#10g-read-in-tables))
-- `metadata` - (a dataframe of sample-wise metadata, output from [Step 10g](#10g-read-in-tables))
+- `species_table` (a dataframe of species count per sample, output from [Step 9g](#9g-read-in-tables))
+- `metadata` - (a dataframe of sample-wise metadata, output from [Step 9g](#9g-read-in-tables))
 
 **Output Data:**
 
@@ -1448,16 +1671,19 @@ ggsave(filename = "filtered-kaiju_species_plot.png", plot = p,
 - **filtered-kaiju_species_plot.png** (barplot after filtering rare and non-microbial taxa)
 
 
-#### 10i. Feature decontamination
+#### 9i. Feature decontamination
 
-Feature decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
+> Feature (species) decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
 
 ```R
 library(tidyverse)
 library(decontam)
 feature_table <- read_csv("filtered-kaiju_species_table.csv")
+# Set to 0.5 for a more aggressive approach where species more prevalent
+# in the negative controls are considered contaminants
 contam_threshold <- 0.1
-# Control samples in this column should always be written as "Control_Sample" and true samples as "True_Sample"
+# Control samples in this column should always be written as 
+# "Control_Sample" and true samples as "True_Sample"
 prev_col <- "Sample_or_Control"
 freq_col <- "input_conc_ng"
 plot_width <- 18
@@ -1499,8 +1725,8 @@ ggsave(filename = "decontaminated-kaiju-species_plot.png", plot = p,
 
 **Input Data:**
 
-- `filtered-kaiju_species_table.csv`(a dataframe of species count per sample, output from [Step 10h](#10h-taxonomy-barplots))
-- `metadata`(a dataframe of sample-wise metadata, output from [Step 10g](#10g-read-in-tables))
+- `filtered-kaiju_species_table.csv`(path to filtered species count per sample, output from [Step 9h](#9h-taxonomy-barplots))
+- `metadata`(a dataframe of sample-wise metadata, output from [Step 9g](#9g-read-in-tables))
 
 **Output Data:**
 
@@ -1512,9 +1738,9 @@ ggsave(filename = "decontaminated-kaiju-species_plot.png", plot = p,
 
 ---
 
-### 11. Taxonomic Profiling using Kraken2
+### 10. Taxonomic Profiling using Kraken2
 
-#### 11a. Download kraken2 database
+#### 10a. Download kraken2 database
 
 ```bash 
 ## Download all microbial (including eukaryotes) - https://benlangmead.github.io/aws-indexes/k2
@@ -1563,7 +1789,7 @@ tar -xvzf k2_pluspfp.tar.gz
 
 - kraken2-db/  (a directory containing kraken 2 database files)
 
-#### 11b. Taxonomic Classification
+#### 10b. Taxonomic Classification
 
 ```bash
 kraken2 --db kraken2-db/ --gzip-compressed --threads NumberOfThreads --use-names \
@@ -1584,14 +1810,14 @@ kraken2 --db kraken2-db/ --gzip-compressed --threads NumberOfThreads --use-names
 **Input Data:**
 
 - kraken2-db/ (a direcory containing kraken 2 database files, output from [Step 10a](#10a-download-kraken2-database))
-- sample_host_removed.fastq.gz (gzipped reads fastq file, output from [Step 7d](#7d-generate-decontaminated-read-files))
+- sample_host_removed.fastq.gz (gzipped reads fastq file, output from [Step 7b](#7b-remove-host-reads))
 
 **Output Data:**
 
 - sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
 - sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
 
-#### 11c. Convert Kraken2 output to Krona format
+#### 10c. Convert Kraken2 output to Krona format
 
 ```bash
 kreport2krona.py --report-file sample-kraken2-report.tsv  --output sample.krona
@@ -1604,14 +1830,14 @@ kreport2krona.py --report-file sample-kraken2-report.tsv  --output sample.krona
 
 **Input Data:**
 
-- sample-kraken2-report.tsv (kraken report, output from [Step 10b](#10b-taxonomic-classification)
+- sample-kraken2-report.tsv (kraken report, output from [Step 10b](#10b-taxonomic-classification))
 
 **Output Data:**
 
 - sample.krona (krona formatted kraken2 output)
 
 
-#### 11d. Compile kraken2 krona report
+#### 10d. Compile kraken2 krona report
 
 ```bash
 # Find, list and write all .krona files to file 
@@ -1656,14 +1882,14 @@ ktImportText  -o kraken-report.html ${KTEXT_FILES[*]}
 
 **Input Data:**
 
-- *.krona (all sample .krona formatted files, output from [Step 10d](#10d-convert-kraken2-output-to-krona-format)) 
+- *.krona (all sample .krona formatted files, output from [Step 10c](#10c-convert-kraken2-output-to-krona-format)) 
 
                       
 **Output Data:**
 
-- kraken-report.html (compiled krona html report output)
+- **kraken-report.html** (compiled krona html report output)
 
-#### 11e. Create kraken species count table
+#### 10e. Create kraken species count table
 
 ```R
 library(tidyverse)
@@ -1689,7 +1915,7 @@ write_csv(x = species_table,
 
 - **kraken_species_table.csv** (kraken species count table in csv format)
 
-#### 11f. Read-in tables
+#### 10f. Read-in tables
 
 ```R
 library(tidyverse)
@@ -1723,12 +1949,14 @@ species_table <- species_table[,-match("species", colnames(species_table))]
 - `species_table` - a dataframe
 
 
-#### 11g. Taxonomy barplots
+#### 10g. Taxonomy barplots
 
 ```R
 library(tidyverse)
 
-filter_threshold=0.5
+# Threshold to filter out potential false positive
+# taxonomy assignments
+filter_threshold <- 0.5
 # Filter out Rare and non-microbial assignment
 # You can add as many species that you'd like to filter out
 # using the following syntax "|species_name1|species_name2"
@@ -1772,8 +2000,8 @@ ggsave(filename = "filtered-kraken_species_plot.png", plot = p,
 
 **Input Data:**
 
-- `species_table` (a dataframe of species count per sample, output from [Step 10g](#10g-read-in-tables))
-- `metadata` - (a dataframe of sample-wise metadata, output from [Step 10g](#10g-read-in-tables))
+- `species_table` (a dataframe of species count per sample, output from [Step 10f](#10f-read-in-tables))
+- `metadata` - (a dataframe of sample-wise metadata, output from [Step 10f](#10f-read-in-tables))
 
 **Output Data:**
 
@@ -1781,19 +2009,21 @@ ggsave(filename = "filtered-kraken_species_plot.png", plot = p,
 - **filtered-kraken_species_table.csv** (filtered relative abundance table)
 - **filtered-kraken_species_plot.png** (barplot after filtering rare and non-microbial taxa)
 
----
 
-#### 11h. Feature decontamination
+#### 10h. Feature decontamination
 
-Feature decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
+Feature decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table.
 
 ```R
 library(tidyverse)
 library(decontam)
 
 feature_table <- read_csv("filtered-kraken_species_table.csv")
+# Set to 0.5 for a more aggressive approach where species more prevalent
+# in the negative controls are considered contaminants
 contam_threshold <- 0.1
-# Control samples in this column should always be written as "Control_Sample" and true samples as "True_Sample"
+# Control samples in this column should always be written as
+# "Control_Sample" and true samples as "True_Sample"
 prev_col <- "Sample_or_Control"
 freq_col <- "input_conc_ng"
 plot_width <- 18
@@ -1835,8 +2065,8 @@ ggsave(filename = "decontaminated-kraken-species_plot.png", plot = p,
 
 **Input Data:**
 
-- `filtered-kraken_species_table.csv`(a dataframe of species count per sample, output from [Step 11g](#11g-taxonomy-barplots))
-- `metadata`(a dataframe of sample-wise metadata, output from step[Step 11f](#11f-read-in-tables))
+- `filtered-kraken_species_table.csv`(path to species count per sample, output from [Step 10g](#10g-taxonomy-barplots))
+- `metadata`(a dataframe of sample-wise metadata, output from step[Step 10f](#10f-read-in-tables))
 
 **Output Data:**
 
@@ -1850,7 +2080,7 @@ ggsave(filename = "decontaminated-kraken-species_plot.png", plot = p,
 
 ## Assembly-based processing
 
-### 12. Sample assembly
+### 11. Sample assembly
 
 ```bash
 flye --meta --threads NumberOfThreads --out-dir sample/ \
@@ -1870,7 +2100,7 @@ mv sample/flye.log sample_flye.log
 
 **Input Data**
 
-- sample_host_removed.fastq.gz (decontaminated raw data in fastq format, output from [Step 8](#8-host-removal))
+- sample_host_removed.fastq.gz (decontaminated raw data in fastq format, output from [Step 7b](#7b-remove-host-reads))
 
 **Output Data**
 
@@ -1881,7 +2111,7 @@ mv sample/flye.log sample_flye.log
 
 ---
 
-### 13. Polish assembly
+### 12. Polish assembly
 
 ```bash
 medaka_consensus -t NumberOfThreads -i /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz \
@@ -1899,7 +2129,7 @@ mv sample/consensus.fasta sample_polished.fasta
 
 **Input Data:**
 
-- /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz (decontaminated raw data in fastq format, output from [Step 8](#8-host-removal))
+- /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz (decontaminated raw data in fastq format, output from [Step 7b](#8b-remove-host-reads))
 - /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 11](#11-sample-assembly))
 
 **Output Data:**
@@ -1908,31 +2138,31 @@ mv sample/consensus.fasta sample_polished.fasta
 
 ---
 
-### 14. Renaming contigs and summarizing assemblies
+### 13. Renaming contigs and summarizing assemblies
 
-#### 14a. Renaming contig headers
+#### 13a. Renaming contig headers
 
 ```bash
-bit-rename-fasta-headers -i sample-1_polished.fasta -w c_sample-1 -o sample-1_assembly.fasta
+bit-rename-fasta-headers -i sample_polished.fasta -w c_sample -o sample_assembly.fasta
 ```
 
 **Parameter Definitions:**  
 
 - `-i` – input fasta file
-- `-w` – wanted header prefix (a number will be appended for each contig), starts with a “c_” to ensure they won’t start with a number which can be problematic
+- `-w` – wanted header prefix (a number will be appended for each contig), starts with a "c" to ensure they won't start with a number which can be problematic
 - `-o` – output fasta file
 
 
 **Input Data:**
 
-- sample-1_polished.fasta (polished assembly file from [step 12](#12-polish-assembly))
+- sample_polished.fasta (polished assembly file from [Step 12](#12-polish-assembly))
 
 **Output files:**
 
-- **sample-1-assembly.fasta** (contig-renamed assembly file)
+- **sample-assembly.fasta** (contig-renamed assembly file)
 
 
-#### 14b. Summarizing assemblies
+#### 13b. Summarizing assemblies
 
 ```bash
 bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv *-assembly.fasta
@@ -1945,7 +2175,7 @@ bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv *-assembly.fasta
 
 **Input Data:**
 
-- *-assembly.fasta (contig-renamed assembly files from [step 13a](#13a-renaming-contig-headers))
+- *-assembly.fasta (contig-renamed assembly files from [Step 13a](#13a-renaming-contig-headers))
 
 **Output files:**
 
@@ -1955,10 +2185,10 @@ bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv *-assembly.fasta
 
 ---
 
-### 15. Gene prediction
+### 14. Gene prediction
 ```bash
-prodigal -a sample-1-genes.faa -d sample-1-genes.fasta -f gff -p meta -c -q \
-         -o sample-1-genes.gff -i sample-1-assembly.fasta
+prodigal -a sample-genes.faa -d sample-genes.fasta -f gff -p meta -c -q \
+         -o sample-genes.gff -i sample-assembly.fasta
 ```
 
 **Parameter Definitions:**
@@ -1974,47 +2204,47 @@ prodigal -a sample-1-genes.faa -d sample-1-genes.fasta -f gff -p meta -c -q \
 
 **Input Data:**
 
-- sample-1-assembly.fasta (contig-renamed assembly file from [step 5a](#5a-renaming-contig-headers))
+- sample-assembly.fasta (contig-renamed assembly file from [Step 13a](#13a-renaming-contig-headers))
 
 **Output Data:**
 
-- sample-1-genes.faa (gene-calls amino-acid fasta file)
-- sample-1-genes.fasta (gene-calls nucleotide fasta file)
-- **sample-1-genes.gff** (gene-calls in general feature format)
+- sample-genes.faa (gene-calls amino-acid fasta file)
+- sample-genes.fasta (gene-calls nucleotide fasta file)
+- **sample-genes.gff** (gene-calls in general feature format)
 
 <br>
 
-#### 15a. Remove line wraps in gene prediction output
+#### 14a. Remove line wraps in gene prediction output
 ```bash
-bit-remove-wraps sample-1-genes.faa > sample-1-genes.faa.tmp 2> /dev/null
-mv sample-1-genes.faa.tmp sample-1-genes.faa
+bit-remove-wraps sample-genes.faa > sample-genes.faa.tmp 2> /dev/null
+mv sample-genes.faa.tmp sample-genes.faa
 
-bit-remove-wraps sample-1-genes.fasta > sample-1-genes.fasta.tmp 2> /dev/null
-mv sample-1-genes.fasta.tmp sample-1-genes.fasta
+bit-remove-wraps sample-genes.fasta > sample-genes.fasta.tmp 2> /dev/null
+mv sample-genes.fasta.tmp sample-genes.fasta
 ```
 
 **Input Data:**
 
-- sample-1-genes.faa (gene-calls amino-acid fasta file, output from [Step 14](#14-gene-prediction))
-- sample-1-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14](#14-gene-prediction))
+- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 14](#14-gene-prediction))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14](#14-gene-prediction))
 
 **Output Data:**
 
-- **sample-1-genes.faa** (gene-calls amino-acid fasta file with line wraps removed)
-- **sample-1-genes.fasta** (gene-calls nucleotide fasta file with line wraps removed)
+- **sample-genes.faa** (gene-calls amino-acid fasta file with line wraps removed)
+- **sample-genes.fasta** (gene-calls nucleotide fasta file with line wraps removed)
 
 <br>
 
 ---
 
-### 16. Functional annotation
+### 15. Functional annotation
 > **Note:**  
 > The annotation process overwrites the same temporary directory by default. When running multiple 
 processses at a time, it is necessary to specify a specific temporary directory with the 
 `--tmp-dir` argument as shown below.
 
 
-#### 16a. Downloading reference database of HMM models (only needs to be done once)
+#### 15a. Downloading reference database of HMM models (only needs to be done once)
 
 ```bash
 curl -LO ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
@@ -2023,11 +2253,11 @@ tar -xzvf profiles.tar.gz
 gunzip ko_list.gz 
 ```
 
-#### 16b. Running KEGG annotation
+#### 15b. Running KEGG annotation
 
 ```bash
-exec_annotation -p profiles/ -k ko_list --cpu NumberOfThreads -f detail-tsv -o sample-1-KO-tab.tmp \
-                --tmp-dir sample-1-tmp-KO --report-unannotated sample-1-genes.faa 
+exec_annotation -p profiles/ -k ko_list --cpu NumberOfThreads -f detail-tsv -o sample-KO-tab.tmp \
+                --tmp-dir sample-tmp-KO --report-unannotated sample-genes.faa 
 ```
 
 **Parameter Definitions:**
@@ -2039,27 +2269,27 @@ exec_annotation -p profiles/ -k ko_list --cpu NumberOfThreads -f detail-tsv -o s
 - `-o` – specifies the output file name
 - `--tmp-dir` – specifies the temporary directory to write to (needed if running more than one process concurrently, see Notes above)
 - `--report-unannotated` – specifies to generate an output for each entry
-- `sample-1-genes.faa` – the input file is specified as a positional argument 
+- `sample-genes.faa` – the input file is specified as a positional argument 
 
 
 **Input Data:**
 
-- sample-1-genes.faa (amino-acid fasta file, from [step 6](#6-gene-prediction))
+- sample-genes.faa (amino-acid fasta file, from [Step 14](#14-gene-prediction))
 - profiles/ (reference directory holding the KO HMMs)
 - ko_list (reference list of KOs to scan for)
 
 **Output Data:**
 
-- sample-1-KO-tab.tmp (table of KO annotations assigned to gene IDs)
+- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs)
 
 
-#### 16c. Filtering output to retain only those passing the KO-specific score and top hits
+#### 15c. Filtering output to retain only those passing the KO-specific score and top hits
 
 ```bash
-bit-filter-KOFamScan-results -i sample-1-KO-tab.tmp -o sample-1-annotations.tsv
+bit-filter-KOFamScan-results -i sample-KO-tab.tmp -o sample-annotations.tsv
 
 # removing temporary files
-rm -rf sample-1-tmp-KO/ sample-1-KO-annots.tmp
+rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 ```
 
 **Parameter Definitions:**  
@@ -2069,31 +2299,31 @@ rm -rf sample-1-tmp-KO/ sample-1-KO-annots.tmp
 
 **Input Data:**
 
-- sample-1-KO-tab.tmp (table of KO annotations assigned to gene IDs from [step 7b](#7b-running-kegg-annotation))
+- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs from [Step 15b](#15b-running-kegg-annotation))
 
 **Output Data:**
 
-- sample-1-annotations.tsv (table of KO annotations assigned to gene IDs)
+- sample-annotations.tsv (table of KO annotations assigned to gene IDs)
 
 <br>
 
 ---
 
-### 17. Taxonomic classification
+### 16. Taxonomic classification
 
-#### 17a. Pulling and un-packing pre-built reference db (only needs to be done once)
+#### 16a. Pulling and un-packing pre-built reference db (only needs to be done once)
 
 ```bash
 wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20200618.tar.gz
 tar -xvzf CAT_prepare_20200618.tar.gz
 ```
 
-#### 17b. Running taxonomic classification
+#### 16b. Running taxonomic classification
 
 ```bash
-CAT contigs -c sample-1-assembly.fasta -d CAT_prepare_20200618/2020-06-18_database/ \
-            -t CAT_prepare_20200618/2020-06-18_taxonomy/ -p sample-1-genes.faa \
-            -o sample-1-tax-out.tmp -n NumberOfThreads -r 3 --top 4 --I_know_what_Im_doing --no-stars
+CAT contigs -c sample-assembly.fasta -d CAT_prepare_20200618/2020-06-18_database/ \
+            -t CAT_prepare_20200618/2020-06-18_taxonomy/ -p sample-genes.faa \
+            -o sample-tax-out.tmp -n NumberOfThreads -r 3 --top 4 --I_know_what_Im_doing --no-stars
 ```
 
 **Parameter Definitions:**  
@@ -2111,18 +2341,18 @@ CAT contigs -c sample-1-assembly.fasta -d CAT_prepare_20200618/2020-06-18_databa
 
 **Input Data:**
 
-- sample-1-assembly.fasta (assembly file from [step 5a](#5a-renaming-contig-headers))
-- sample-1-genes.faa (gene-calls amino-acid fasta file from [step 6](#6-gene-prediction))
+- sample-assembly.fasta (assembly file from [Step 13a](#13a-renaming-contig-headers))
+- sample-genes.faa (gene-calls amino-acid fasta file from [Step 14](#14-gene-prediction))
 
 **Output Data:**
 
-- sample-1-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file)
-- sample-1-tax-out.tmp.contig2classification.txt (contig taxonomy file)
+- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file)
+- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file)
 
-#### 17c. Adding taxonomy info from taxids to genes
+#### 16c. Adding taxonomy info from taxids to genes
 
 ```bash
-CAT add_names -i sample-1-tax-out.tmp.ORF2LCA.txt -o sample-1-gene-tax-out.tmp \
+CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt -o sample-gene-tax-out.tmp \
               -t CAT_prepare_20200618/2020-06-18_taxonomy/ --only_official --exclude-scores
 ```
 
@@ -2136,16 +2366,16 @@ CAT add_names -i sample-1-tax-out.tmp.ORF2LCA.txt -o sample-1-gene-tax-out.tmp \
 
 **Input Data:**
 
-- sample-1-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file from [step 8b](#8b-running-taxonomic-classification))
+- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file from [Step 16b](#16b-running-taxonomic-classification))
 
 **Output Data:**
 
-- sample-1-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
+- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
 
-#### 17d. Adding taxonomy info from taxids to contigs
+#### 16d. Adding taxonomy info from taxids to contigs
 
 ```bash
-CAT add_names -i sample-1-tax-out.tmp.contig2classification.txt -o sample-1-contig-tax-out.tmp \
+CAT add_names -i sample-tax-out.tmp.contig2classification.txt -o sample-contig-tax-out.tmp \
               -t CAT-ref/2020-06-18_taxonomy/ --only_official --exclude-scores
 ```
 
@@ -2159,60 +2389,60 @@ CAT add_names -i sample-1-tax-out.tmp.contig2classification.txt -o sample-1-cont
 
 **Input Data:**
 
-- sample-1-tax-out.tmp.contig2classification.txt (contig taxonomy file from [step 8b](#8b-running-taxonomic-classification))
+- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file from [Step 16b](#16b-running-taxonomic-classification))
 
 **Output Data:**
 
-- sample-1-contig-tax-out.tmp (contig taxonomy file with lineage info added)
+- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added)
 
 
-#### 17e. Formatting gene-level output with awk and sed
+#### 16e. Formatting gene-level output with awk and sed
 
 ```bash
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,$8,$9,$10,$11 } \
     else if ( $2 == "ORF has no hit to database" || $2 ~ /^no taxid found/ ) \
     { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } else { n=split($3,lineage,";"); \
-    print $1,lineage[n],$5,$6,$7,$8,$9,$10,$11 } } ' sample-1-gene-tax-out.tmp | \
+    print $1,lineage[n],$5,$6,$7,$8,$9,$10,$11 } } ' sample-gene-tax-out.tmp | \
     sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/# ORF/gene_ID/' | \
-    sed 's/lineage/taxid/'  > sample-1-gene-tax-out.tsv
+    sed 's/lineage/taxid/'  > sample-gene-tax-out.tsv
 ```
 
 **Input Data:**
 
-- sample-1-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added from [step 8c](#8c-adding-taxonomy-info-from-taxids-to-genes))
+- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added from [Step 16c](#16c-adding-taxonomy-info-from-taxids-to-genes))
 
 **Output Data:**
 
-- sample-1-gene-tax-out.tsv (gene-calls taxonomy file with lineage info added reformatted)
+- sample-gene-tax-out.tsv (gene-calls taxonomy file with lineage info added reformatted)
 
-#### 17f. Formatting contig-level output with awk and sed
+#### 16f. Formatting contig-level output with awk and sed
 
 ```bash
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6,$7,$8,$9,$10,$11,$12 } \
     else if ( $2 == "no taxid assigned" ) { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } \
-    else { n=split($4,lineage,";"); print $1,lineage[n],$6,$7,$8,$9,$10,$11,$12 } } ' sample-1-contig-tax-out.tmp | \
+    else { n=split($4,lineage,";"); print $1,lineage[n],$6,$7,$8,$9,$10,$11,$12 } } ' sample-contig-tax-out.tmp | \
     sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/^# contig/contig_ID/' | \
-    sed 's/lineage/taxid/' > sample-1-contig-tax-out.tsv
+    sed 's/lineage/taxid/' > sample-contig-tax-out.tsv
 
   # clearing intermediate files
-rm sample-1*.tmp*
+rm sample*.tmp*
 ```
 
 **Input Data:**
 
-- sample-1-contig-tax-out.tmp (contig taxonomy file with lineage info added from [step 8d](#8d-adding-taxonomy-info-from-taxids-to-contigs))
+- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added from [Step 16d](#16d-adding-taxonomy-info-from-taxids-to-contigs))
 
 **Output Data:**
 
-- sample-1-contig-tax-out.tsv (contig taxonomy file with lineage info added reformatted)
+- sample-contig-tax-out.tsv (contig taxonomy file with lineage info added reformatted)
 
 <br>
 
 ---
 
-### 18. Read-Mapping
+### 17. Read-Mapping
 
-#### 18a. Align Reads to Sample Assembly
+#### 17a. Align Reads to Sample Assembly
 
 ```bash
 minimap2 -a -x map-ont \
@@ -2230,13 +2460,13 @@ minimap2 -a -x map-ont \
 **Input Data**
 
 - /path/to/assemblies/sample_assembly.fasta (Sample assembly, output from [Step 13a](#13a-renaming-contig-headers))
-- /path/to/trimmed_reads/sample_host_removed.fastq.gz (Filtered and trimmed reads, output from [Step 8](#8-host-removal))
+- /path/to/trimmed_reads/sample_host_removed.fastq.gz (Filtered and trimmed reads, output from [Step 7b](#7b-remove-host-reads))
 
 **Output Data**
 
-- sample.sam (Reads aligned to contaminant assembly)
+- sample.sam (Reads aligned to sample assembly)
 
-#### 18b. Sort and Index Assembly Alignments
+#### 17b. Sort and Index Assembly Alignments
 
 ```bash
 # Sort Sam, convert to bam and create index
@@ -2258,7 +2488,7 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 
 **Input Data:**
 
-- sample.sam (Reads aligned to sample assembly, output from [Step 13c](#13c-read-mapping))
+- sample.sam (Reads aligned to sample assembly, output from [Step 17a](#17a-align-reads-to-sample-assembly))
 
 **Output Data:**
 
@@ -2269,18 +2499,18 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 
 ---
 
-### 19. Getting coverage information and filtering based on detection
+### 18. Getting coverage information and filtering based on detection
 > **Note:**  
 > “Detection” is a measure of what proportion of a reference sequence recruited reads 
 (see the discussion of detection [here](http://merenlab.org/2017/05/08/anvio-views/#detection)). 
 Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
 
-#### 19a. Filtering coverage levels based on detection
+#### 18a. Filtering coverage levels based on detection
 
 ```bash
   # pileup.sh comes from the bbduk.sh package
-pileup.sh -in sample-1.bam fastaorf=sample-1-genes.fasta outorf=sample-1-gene-cov-and-det.tmp \
-          out=sample-1-contig-cov-and-det.tmp
+pileup.sh -in sample.bam fastaorf=sample-genes.fasta outorf=sample-gene-cov-and-det.tmp \
+          out=sample-contig-cov-and-det.tmp
 ```
 
 **Parameter Definitions:**  
@@ -2290,104 +2520,115 @@ pileup.sh -in sample-1.bam fastaorf=sample-1-genes.fasta outorf=sample-1-gene-co
 - `outorf=` – the output gene-coverage tsv file
 - `out=` – the output contig-coverage tsv file
 
+**Input Data:**
+
+- sample.bam (mapping file from [Step 17b](#17b-sort-and-index-assembly-alignments))
+- sample-genes.fasta (gene-calls nucleotide fasta file from [Step 14](#14-gene-prediction))
+
+
+**Output Data:**
+
+- sample-gene-cov-and-det.tmp (gene-coverage tsv file)
+- sample-contig-cov-and-det.tmp (contig-coverage tsv file)
 
-#### 19b. Filtering gene and contig coverage based on requiring 50% detection and parsing down to just gene ID and coverage
+
+#### 18b. Filtering gene and contig coverage based on requiring 50% detection and parsing down to just gene ID and coverage
 
 ```bash
 # Filtering gene coverage
-grep -v "#" sample-1-gene-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
-     { print $1,$4 } ' > sample-1-gene-cov.tmp
+grep -v "#" sample-gene-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
+     { print $1,$4 } ' > sample-gene-cov.tmp
 
-cat <( printf "gene_ID\tcoverage\n" ) sample-1-gene-cov.tmp > sample-1-gene-coverages.tsv
+cat <( printf "gene_ID\tcoverage\n" ) sample-gene-cov.tmp > sample-gene-coverages.tsv
 
 # Filtering contig coverage
-grep -v "#" sample-1-contig-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
-     { print $1,$2 } ' > sample-1-contig-cov.tmp
+grep -v "#" sample-contig-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
+     { print $1,$2 } ' > sample-contig-cov.tmp
 
-cat <( printf "contig_ID\tcoverage\n" ) sample-1-contig-cov.tmp > sample-1-contig-coverages.tsv
+cat <( printf "contig_ID\tcoverage\n" ) sample-contig-cov.tmp > sample-contig-coverages.tsv
 
 # removing intermediate files
-rm sample-1-*.tmp
+rm sample-*.tmp
 ```
 
 **Input Data:**
 
-- sample-1.bam (mapping file from [step 9b](#9b-performing-mapping-conversion-to-bam-and-sorting))
-- sample-1-genes.fasta (gene-calls nucleotide fasta file from [step 6](#6-gene-prediction))
+- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file from [Step 18a](#18a-filtering-coverage-levels-based-on-detection))
+- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file from [Step 18a](#18a-filtering-coverage-levels-based-on-detection))
 
 **Output Data:**
 
-- sample-1-gene-coverages.tsv (table with gene-level coverages)
-- sample-1-contig-coverages.tsv (table with contig-level coverages)
+- sample-gene-coverages.tsv (table with gene-level coverages)
+- sample-contig-coverages.tsv (table with contig-level coverages)
 
 <br>
 
 ---
 
-### 20. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample
+### 19. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample
 > **Note:**  
 > Just uses `paste`, `sed`, and `awk`, which are all standard in any Unix-like environment.  
 
 ```bash
-paste <( tail -n +2 sample-1-gene-coverages.tsv | sort -V -k 1 ) <( tail -n +2 sample-1-annotations.tsv | sort -V -k 1 | cut -f 2- ) \
-      <( tail -n +2 sample-1-gene-tax-out.tsv | sort -V -k 1 | cut -f 2- ) > sample-1-gene-tab.tmp
+paste <( tail -n +2 sample-gene-coverages.tsv | sort -V -k 1 ) <( tail -n +2 sample-annotations.tsv | sort -V -k 1 | cut -f 2- ) \
+      <( tail -n +2 sample-gene-tax-out.tsv | sort -V -k 1 | cut -f 2- ) > sample-gene-tab.tmp
 
-paste <( head -n 1 sample-1-gene-coverages.tsv ) <( head -n 1 sample-1-annotations.tsv | cut -f 2- ) \
-      <( head -n 1 sample-1-gene-tax-out.tsv | cut -f 2- ) > sample-1-header.tmp
+paste <( head -n 1 sample-gene-coverages.tsv ) <( head -n 1 sample-annotations.tsv | cut -f 2- ) \
+      <( head -n 1 sample-gene-tax-out.tsv | cut -f 2- ) > sample-header.tmp
 
-cat sample-1-header.tmp sample-1-gene-tab.tmp > sample-1-gene-coverage-annotation-and-tax.tsv
+cat sample-header.tmp sample-gene-tab.tmp > sample-gene-coverage-annotation-and-tax.tsv
 
   # removing intermediate files
-rm sample-1*tmp sample-1-gene-coverages.tsv sample-1-annotations.tsv sample-1-gene-tax-out.tsv
+rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-out.tsv
 ```
 
 **Input Data:**
 
-- sample-1-gene-coverages.tsv (table with gene-level coverages from [step 10b](#10b-filtering-gene-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage))
-- sample-1-annotations.tsv (table of KO annotations assigned to gene IDs from [step 7c](#7c-filtering-output-to-retain-only-those-passing-the-ko-specific-score-and-top-hits))
-- sample-1-gene-tax-out.tsv (gene-level taxonomic classifications from [step 8f](#8f-formatting-contig-level-output-with-awk-and-sed))
+- sample-gene-coverages.tsv (table with gene-level coverages from [Step 18b](#18b-filtering-gene-and-contig-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage))
+- sample-annotations.tsv (table of KO annotations assigned to gene IDs from [Step 15c](#15c-filtering-output-to-retain-only-those-passing-the-ko-specific-score-and-top-hits))
+- sample-gene-tax-out.tsv (gene-level taxonomic classifications from [Step 16f](#16f-formatting-contig-level-output-with-awk-and-sed))
 
 
 **Output Data:**
 
-- **sample-1-gene-coverage-annotation-and-tax.tsv** (table with combined gene coverage, annotation, and taxonomy info)
+- **sample-gene-coverage-annotation-and-tax.tsv** (table with combined gene coverage, annotation, and taxonomy info)
 
 <br>
 
 ---
 
-### 21. Combining contig-level coverage and taxonomy into one table for each sample
+### 20. Combining contig-level coverage and taxonomy into one table for each sample
 > **Note:**  
 > Just uses `paste`, `sed`, and `awk`, which are all standard in any Unix-like environment.  
 
 ```bash
-paste <( tail -n +2 sample-1-contig-coverages.tsv | sort -V -k 1 ) \
-      <( tail -n +2 sample-1-contig-tax-out.tsv | sort -V -k 1 | cut -f 2- ) > sample-1-contig.tmp
+paste <( tail -n +2 sample-contig-coverages.tsv | sort -V -k 1 ) \
+      <( tail -n +2 sample-contig-tax-out.tsv | sort -V -k 1 | cut -f 2- ) > sample-contig.tmp
 
-paste <( head -n 1 sample-1-contig-coverages.tsv ) <( head -n 1 sample-1-contig-tax-out.tsv | cut -f 2- ) \
-      > sample-1-contig-header.tmp
+paste <( head -n 1 sample-contig-coverages.tsv ) <( head -n 1 sample-contig-tax-out.tsv | cut -f 2- ) \
+      > sample-contig-header.tmp
       
-cat sample-1-contig-header.tmp sample-1-contig.tmp > sample-1-contig-coverage-and-tax.tsv
+cat sample-contig-header.tmp sample-contig.tmp > sample-contig-coverage-and-tax.tsv
 
   # removing intermediate files
-rm sample-1*tmp sample-1-contig-coverages.tsv sample-1-contig-tax-out.tsv
+rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
 ```
 
 **Input Data:**
 
-- sample-1-contig-coverages.tsv (table with contig-level coverages from [step 10b](#10b-filtering-gene-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage))
-- sample-1-contig-tax-out.tsv (contig-level taxonomic classifications from [step 8f](#8f-formatting-contig-level-output-with-awk-and-sed))
+- sample-contig-coverages.tsv (table with contig-level coverages from [Step 18b](#18b-filtering-gene-and-contig-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage))
+- sample-contig-tax-out.tsv (contig-level taxonomic classifications from [Step 16f](#16f-formatting-contig-level-output-with-awk-and-sed))
 
 
 **Output Data:**
 
-- **sample-1-contig-coverage-and-tax.tsv** (table with combined contig coverage and taxonomy info)
+- **sample-contig-coverage-and-tax.tsv** (table with combined contig coverage and taxonomy info)
 
 <br>
 
 ---
 
-### 22. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples
+### 21. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples
 
 > **Note:**  
 > * To combine across samples to generate these summary tables, we need the same "units". This is done for annotations 
@@ -2399,7 +2640,7 @@ by the length of the gene). These have been normalized by making the total cover
 each individual gene-level coverage its proportion of that 1,000,000 total. So basically percent, but out of 1,000,000 
 instead of 100 to make the numbers more friendly. 
 
-#### 22a. Generating gene-level coverage summary tables
+#### 21a. Generating gene-level coverage summary tables
 
 ```bash
 bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv -o Combined
@@ -2414,7 +2655,7 @@ bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv -o Combi
 
 **Input Data:**
 
-- *-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples from [step 11](#11-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample))
+- *-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples from [Step 19](#19-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample))
 
 **Output Data:**
 
@@ -2424,7 +2665,335 @@ bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv -o Combi
 - **Combined-gene-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications)
 
 
-#### 22b. Generating contig-level coverage summary tables
+#### 21b. Gene-level taxonomy heatmaps
+
+```R
+library(tidyverse)
+library(pheatmap)
+
+# Abundant taxa with CPM > 1000
+abundance_threshold <- 1000
+
+sample_order <- get_sample_names("assembly-summaries_GLmetagenomics.tsv")
+# Read-in gene table
+gene_taxonomy_table <-  read_contig_table("Combined-gene-level-taxonomy-coverages-CPM_GLmetagenomics.tsv", sample_order)
+
+# Summarize gene table
+species_gene_table <- gene_taxonomy_table %>%
+  select(species, !!sample_order) %>% 
+  group_by(species) %>% 
+  summarise(across(everything(), sum)) 
+
+# Convert gene dataframe table to a matrix table
+gene.m <- species_gene_table %>% as.data.frame()
+# Write out gene taxonomy table
+write_csv(x = gene.m, file = "gene_taxonomy_table.csv")
+
+rownames(gene.m) <- gene.m[['species']]
+gene.m <- gene.m[,-match("species", colnames(gene.m))] %>% as.matrix()
+
+
+#------ All gene taxonomy assignments
+
+# Drop unclassified assignments
+mat2plot <- gene.m[-match("Unclassified;_;_;_;_;_;_", rownames(gene.m)),]
+
+png(filename = "All-genes-taxonomy-heatmap_GLmetagenomics.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 12,
+         number_format = "%.0f")
+dev.off()
+
+
+#------ Abundant gene taxonomy assignments
+
+taxa <- rowSums(gene.m) %>% sort()
+abund_taxa <- taxa[ taxa > abundance_threshold ] %>% names
+abund_gene.m <- gene.m[abund_taxa,]
+
+
+# Drop unclassified assignments
+mat2plot <- abund_gene.m[-match("Unclassified;_;_;_;_;_;_", rownames(abund_gene.m)),]
+
+png(filename = "Abundant-genes-taxonomy-heatmap_GLmetagenomics.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 12,
+         number_format = "%.0f")
+dev.off()
+```
+
+**Input data:**
+- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
+- Combined-gene-level-taxonomy-coverages-CPM_GLmetagenomics.tsv (table with all samples combined based on gene-level taxonomic classifications, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables)) 
+
+**Output data:**
+- gene_taxonomy_table.csv (aggregated gene taxonomy table)
+- **All-genes-taxonomy-heatmap_GLmetagenomics.png** (heatmap of all genes taxonomy assignments)
+- **Abundant-genes-taxonomy-heatmap_GLmetagenomics.png** (heatmap of abundant genes taxonomy assignments)
+
+#### 21c. Gene-level taxonomy decontamination
+
+```R
+library(tidyverse)
+library(decontam)
+# Set to 0.5 for a more aggressive approach where species more prevalent
+# in the negative controls are considered contaminants
+contam_threshold <- 0.1
+# Control samples in this column should always be written as 
+# "Control_Sample" and true samples as "True_Sample"
+prev_col <- "Sample_or_Control"
+freq_col <- "input_conc_ng"
+plot_width <- 18
+plot_height <- 8
+
+# Read-in metadata
+metdata_file <- "/path/to/sample/metadata"
+samples_column <- "Sample_ID"
+metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
+row.names(metadata) <- metadata[,samples_column]
+
+# Read-in featusre table
+gene.m <- read_csv("gene_taxonomy_table.csv")
+rownames(gene.m) <- gene.m[['species']]
+gene.m <- gene.m[,-match("species", colnames(gene.m))] %>% as.matrix()
+feature_table <- gene.m
+
+
+contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
+
+# Write decontam results table to file
+write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-gene-taxonomy_results.csv")
+
+# Get the list of contaminats identified by decontam
+contaminants <- contamdf %>%
+                as.data.frame %>%
+                rownames_to_column("Species") %>%
+                filter(contaminant == TRUE) %>% pull(Species)
+
+# Drop contaminant features identified by decontam
+decontaminated_table <- feature_table %>% 
+                as.data.frame  %>% 
+                rownames_to_column("Species") %>% 
+                filter(str_detect(Species, 
+                                  pattern = str_c(contaminants,
+                                                  collapse = "|"),
+                                  negate = TRUE)) %>%
+                select(-Species) %>% as.matrix
+
+
+# Write decontaminated species table to file
+write_csv(x = decontaminated_table, file = "decontaminated-gene-taxonomy_table.csv")
+
+# Get the index of species (contaminants and unclassified) to drop
+non_microbial <- "Unclassified;_;_;_;_;_;_"
+species_to_drop_index <- grep(x = rownames(feature_table), 
+                              str_c(c(contaminants,non_microbial), 
+                                    collapse = "|"))
+
+mat2plot <- feature_table[-species_to_drop_index,]
+png(filename = "decontaminated-gene-taxonomy-heatmap_GLmetagenomics.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 14, 
+         number_format = "%.0f")
+dev.off()
+
+```
+
+**Input data:**
+
+- metadata_file  (path to sample-wise metadata file)
+- gene_taxonomy_table.csv (aggregated gene taxonomy table, output from [Step 21b](#21b-gene-level-taxonomy-heatmaps))
+
+**Output data:**
+
+- **decontam-gene-taxonomy_results.csv** (decontam's results table)
+- **decontaminated-gene-taxonomy_table.csv** (decontaminated species table)
+- **decontaminated-gene-taxonomy-heatmap_GLmetagenomics.png** (heatmap after filtering out contaminants)
+
+
+
+#### 21d. Gene-level KO functions heatmaps
+
+```R
+library(tidyverse)
+library(pheatmap)
+
+# Abundant functions with CPM > 2000
+abundance_threshold <- 2000
+
+sample_order <- get_sample_names("assembly-summaries_GLmetagenomics.tsv")
+# Read-in KO functions table
+functions_table <- read_input_table("Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv") %>%
+                    select(KO_ID, KO_function, !!sample_order)
+
+# Subset table and then convert from datafame to matrix
+functions.m <- functions_table[,sample_order] %>% as.matrix()
+rownames(functions.m) <- functions_table$KO_ID
+table2write <-  functions.m %>% 
+                      as.data.frame() %>% rownames_to_column("KO_ID") %>%
+                      filter(KO_ID != "Not annotated") # Drop unannotated / unclassified
+# Write out  taxonomy table
+write_csv(x = table2write  , file = "genes-KO-functions_table.csv")
+
+
+#------ All KO functions assignments
+
+# Drop unclassified assignments
+mat2plot <- functions.m[-match("Not annotated", rownames(functions.m),]
+
+png(filename = "All-genes-KO-functions-heatmap_GLmetagenomics.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 12,
+         number_format = "%.0f")
+dev.off()
+
+
+#------ Abundant KO functions assignments
+
+functions <- rowSums(functions.m) %>% sort()
+abund_functions <- functions[ functions > abundance_threshold ] %>% names
+abund_functions.m <- functions.m[abund_functions,]
+
+
+# Drop unannotated assignments
+mat2plot <- abund_functions.m[-match("Not annotated", rownames(abund_functions.m)),]
+
+png(filename = "Abundant-genes-KO-functions-heatmap_GLmetagenomics.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 12,
+         number_format = "%.0f")
+dev.off()
+```
+
+**Parameter Definitions:**  
+
+
+**Input data:**
+- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
+- Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv (table with all samples combined based on KO annotations; normalized to coverage per million genes covered, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables))
+
+**Output data:**
+- genes-KO-functions_table.csv (aggregated and subsetted gene KO functions table)
+- **All-genes-KO-functions-heatmap_GLmetagenomics.png** (heatmap of gene-wise KO function assignments)
+- **Abundant-genes-KO-functions-heatmap_GLmetagenomics.png** (heatmap of gene-wise abundant KO function assignments)
+
+#### 21e. Gene-level KO functions decontamination
+
+```R
+library(tidyverse)
+library(decontam)
+# Set to 0.5 for a more aggressive approach where species more prevalent
+# in the negative controls are considered contaminants
+contam_threshold <- 0.1 
+# Control samples in this column should always be written as "Control_Sample" and true samples as "True_Sample"
+prev_col <- "Sample_or_Control"
+freq_col <- "input_conc_ng"
+plot_width <- 18
+plot_height <- 8
+
+# Read-in metadata
+metdata_file <- "/path/to/sample/metadata"
+samples_column <- "Sample_ID"
+metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
+row.names(metadata) <- metadata[,samples_column]
+
+# Read-in feature table
+functions.m <- read_csv("genes-KO-functions_table.csv")
+rownames(functions.m) <- functions.m[['KO_ID']]
+gene.m <- functions.m[,-match("KO_ID", colnames(functions.m))] %>% as.matrix()
+feature_table <- functions.m
+
+
+contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
+
+# Write decontam results table to file
+write_csv(x = contamdf %>% rownames_to_column("KO_ID"), file = "decontam-gene-KO-functions_results.csv")
+
+# Get the list of contaminants identified by decontam
+contaminants <- contamdf %>%
+                as.data.frame %>%
+                rownames_to_column("KO_ID") %>%
+                filter(contaminant == TRUE) %>% pull(KO_ID)
+
+# Drop contaminant features identified by decontam
+decontaminated_table <- feature_table %>% 
+                as.data.frame  %>% 
+                rownames_to_column("KO_ID") %>% 
+                filter(str_detect(Species, 
+                                  pattern = str_c(contaminants,
+                                                  collapse = "|"),
+                                  negate = TRUE)) %>%
+                select(-KO_ID) %>% as.matrix
+
+
+# Write decontaminated species table to file
+write_csv(x = decontaminated_table, file = "decontaminated-gene-KO-functions_table.csv")
+
+# Get the index of species (contaminants and unclassified) to drop
+unclassified <- "Not annotated"
+functions_to_drop_index <- grep(x = rownames(feature_table), 
+                              str_c(c(contaminants,unclassified), 
+                                    collapse = "|"))
+
+mat2plot <- feature_table[-functions_to_drop_index,]
+png(filename = "decontaminated-gene-KO-functions-heatmap_GLmetagenomics.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 14, 
+         number_format = "%.0f")
+dev.off()
+
+```
+
+**Input data:**
+
+- metadata_file  (path to sample-wise metadata file)
+- gene_taxonomy_table.csv (agggregated gene taxomy table, output from [Step 21b](#21b-gene-level-taxonomy-heatmaps))
+
+**Output data:**
+
+- **decontam-gene-KO-functions_results.csv** (decontam's results table)
+- **decontaminated-gene-KO-functions_table.csv** (decontaminated functions table)
+- **decontaminated-gene-KO-functions-heatmap_GLmetagenomics.png** (heatmap after filtering out contaminants)
+
+
+
+#### 21f. Generating contig-level coverage summary tables
 
 ```bash
 bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
@@ -2433,13 +3002,12 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 **Parameter Definitions:**  
 
 - `*-contig-coverage-and-tax.tsv` - positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above
-
 - `-o` – specifies the output prefix
 
 
 **Input Data:**
 
-- *-contig-coverage-annotation-and-tax.tsv (tables with combined contig coverage, annotation, and taxonomy info generated for individual samples from [step 12](#12-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample))
+- *-contig-coverage-annotation-and-tax.tsv (tables with combined contig coverage, annotation, and taxonomy info generated for individual samples from [Step 20](#20-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample))
 
 **Output Data:**
 
@@ -2448,20 +3016,184 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 
 <br>
 
+
+#### 21g. Contig-level Heatmaps
+
+```R
+plot_width <- 20
+plot_height <- 30
+sample_order <- get_sample_names("assembly-summaries_GLmetagenomics.tsv")
+
+contig_table <-  read_contig_table("Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv", sample_order)
+species_contig_table <- contig_table %>% select(species, !!sample_order)
+
+contig.m <- species_contig_table %>%
+  group_by(species) %>%
+  summarise(across(everything(), sum)) %>%
+  filter(species != "Unclassified;_;_;_;_;_;_") %>% # Drop unclassifed
+  as.data.frame()
+
+# Write out contig taxonomy table
+write_csv(x = contig.m, file = "contig_taxonomy_table.csv")
+
+rownames(contig.m) <- contig.m[['species']]
+contig.m <- contig.m[,-match("species", colnames(contig.m))] %>% as.matrix()
+
+#------ All contig taxonomy assignments
+
+# Drop unclassified assignments
+mat2plot <- contig.m[-match("Unclassified;_;_;_;_;_;_", rownames(contig.m)),]
+
+png(filename = "All-contig-taxonomy-heatmap_GLmetagenomics.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 12,
+         number_format = "%.0f")
+dev.off()
+
+
+#------ Abundant contig taxonomy assignments
+
+taxa <- rowSums(contig.m) %>% sort()
+abund_taxa <- taxa[ taxa > abundance_threshold ] %>% names
+abund_contig.m <- contig.m[abund_taxa,]
+
+mat2plot <- abund_contig.m[-match("Unclassified;_;_;_;_;_;_", rownames(abund_contig.m)),]
+
+png(filename = "Abundant-contig-taxonomy-heatmap_GLmetagenomics.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 12,
+         number_format = "%.0f")
+dev.off()
+```
+
+
+**Parameter Definitions:**  
+
+
+**Input data:**
+
+- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
+- Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered from [Step 21f](#21f-generating-contig-level-coverage-summary-tables))
+
+**Output data:**
+
+- contig_taxonomy_table.csv (aggregated contig taxonomy)
+- **All-contig-taxonomy-heatmap_GLmetagenomics.png** (All contig level taxonomy heatmap)
+- **Abundant-contig-taxonomy-heatmap_GLmetagenomics.png** (Abundant contig level taxonomy heatmap)
+
+
+#### 21h. Contig-level decontamination
+
+```R
+library(tidyverse)
+library(decontam)
+# Set to 0.5 for a more aggressive approach where species more prevalent
+# in the negative controls are considered contaminants
+contam_threshold <- 0.1
+# Control samples in this column should always be written as
+# "Control_Sample" and true samples as "True_Sample"
+prev_col <- "Sample_or_Control"
+freq_col <- "input_conc_ng"
+plot_width <- 18
+plot_height <- 8
+
+# Read-in metadata
+metdata_file <- "/path/to/sample/metadata"
+samples_column <- "Sample_ID"
+metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
+row.names(metadata) <- metadata[,samples_column]
+
+# Read-in feature table
+contig.m <- read_csv("contig_taxonomy_table.csv")
+rownames(contig.m) <- contig.m[['species']]
+contig.m <- contig.m[,-match("species", colnames(contig.m))] %>% as.matrix()
+feature_table <- contig.m
+
+
+contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
+
+# Write decontam results table to file
+write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-gene-taxonomy_results.csv")
+
+# Get a list of contaminants identified by decontam
+contaminants <- contamdf %>%
+                as.data.frame %>%
+                rownames_to_column("Species") %>%
+                filter(contaminant == TRUE) %>% pull(Species)
+
+# Drop contaminant features identified by decontam
+decontaminated_table <- feature_table %>% 
+                as.data.frame  %>% 
+                rownames_to_column("Species") %>% 
+                filter(str_detect(Species, 
+                                  pattern = str_c(contaminants,
+                                                  collapse = "|"),
+                                  negate = TRUE)) %>%
+                select(-Species) %>% as.matrix
+
+
+# Write decontaminated species table to file
+write_csv(x = decontaminated_table, file = "decontaminated-gene-taxonomy_table.csv")
+
+# Get the index of species (contaminants and unclassified) to drop
+non_microbial <- "Unclassified;_;_;_;_;_;_"
+species_to_drop_index <- grep(x = rownames(feature_table), 
+                              str_c(c(contaminants,non_microbial), 
+                                    collapse = "|"))
+
+mat2plot <- feature_table[-species_to_drop_index,]
+png(filename = "decontaminated-contig-taxonomy-heatmap_GLmetagenomics.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 14, 
+         number_format = "%.0f")
+dev.off()
+
+```
+
+**Input data:**
+
+- metadata_file  (path to sample-wise metadata file)
+- contig_taxonomy_table.csv (aggregated contig taxonomy table, output from [Step 21g](#21g-contig-level-heatmaps))
+
+**Output data:**
+
+- **decontam-contig-taxonomy_results.csv** (decontam's results table)
+- **decontaminated-contig-taxonomy_table.csv** (decontaminated species table)
+- **decontaminated-contig-taxonomy-heatmap_GLmetagenomics.png** (heatmap after filtering out contaminants)
+
+
 ---
 
-### 23. **M**etagenome-**A**ssembled **G**enome (MAG) recovery
+### 22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery
 
-#### 23a. Binning contigs
+#### 22a. Binning contigs
 
 ```bash
-jgi_summarize_bam_contig_depths --outputDepth sample-1-metabat-assembly-depth.tsv --percentIdentity 97 --minContigLength 1000 --minContigDepth 1.0  --referenceFasta sample-1-assembly.fasta sample-1.bam
+jgi_summarize_bam_contig_depths --outputDepth sample-metabat-assembly-depth.tsv --percentIdentity 97 --minContigLength 1000 --minContigDepth 1.0  --referenceFasta sample-assembly.fasta sample.bam
 
-metabat2  --inFile sample-1-assembly.fasta --outFile sample-1 --abdFile sample-1-metabat-assembly-depth.tsv -t NumberOfThreads
+metabat2  --inFile sample-assembly.fasta --outFile sample --abdFile sample-metabat-assembly-depth.tsv -t NumberOfThreads
 
-mkdir sample-1-bins
-mv sample-1*bin*.fasta sample-1-bins
-zip -r sample-1-bins.zip sample-1-bins
+mkdir sample-bins
+mv sample*bin*.fasta sample-bins
+zip -r sample-bins.zip sample-bins
 ```
 
 **Parameter Definitions:**  
@@ -2471,7 +3203,7 @@ zip -r sample-1-bins.zip sample-1-bins
 -  `--minContigLength` – minimum contig length to include
 -  `--minContigDepth` – minimum contig depth to include
 -  `--referenceFasta` – the assembly fasta file generated in step 5a
--  `sample-1.bam` – final positional arguments are the bam files generated in step 9
+-  `sample.bam` – final positional arguments are the bam files generated in step 9
 -  `--inFile` - the assembly fasta file generated in step 5a
 -  `--outFile` - the prefix of the identified bins output files
 -  `--abdFile` - the depth file generated by the previous `jgi_summarize_bam_contig_depths` command
@@ -2480,16 +3212,16 @@ zip -r sample-1-bins.zip sample-1-bins
 
 **Input Data:**
 
-- sample-1-assembly.fasta (assembly fasta file created in [step 5a](#5a-renaming-contig-headers))
-- sample-1.bam (bam file created in [step 9b](#9b-performing-mapping-conversion-to-bam-and-sorting))
+- sample-assembly.fasta (assembly fasta file created in [Step 13a](#13a-renaming-contig-headers))
+- sample.bam (bam file created in [Step 17b](#17b-sort-and-index-assembly-alignments))
 
 **Output Data:**
 
-- **sample-1-metabat-assembly-depth.tsv** (tab-delimited summary of coverages)
-- sample-1-bins/sample-1-bin\*.fasta (fasta files of recovered bins)
-- **sample-1-bins.zip** (zip file containing fasta files of recovered bins)
+- **sample-metabat-assembly-depth.tsv** (tab-delimited summary of coverages)
+- sample-bins/sample-bin\*.fasta (fasta files of recovered bins)
+- **sample-bins.zip** (zip file containing fasta files of recovered bins)
 
-#### 23b. Bin quality assessment
+#### 22b. Bin quality assessment
 Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
 
 ```bash
@@ -2507,14 +3239,14 @@ checkm lineage_wf -f bins-overview_GLmetagenomics.tsv --tab_table -x fa ./ check
 
 **Input Data:**
 
-- sample-1-bins/sample-1-bin\*.fasta (bin fasta files generated in [step 14a](#14a-binning-contigs))
+- sample-bins/sample-bin\*.fasta (bin fasta files generated in [Step 22a](#22a-binning-contigs))
 
 **Output Data:**
 
 - **bins-overview_GLmetagenomics.tsv** (tab-delimited file with quality estimates per bin)
 - checkm-output-dir (directory holding detailed checkm outputs)
 
-#### 23c. Filtering MAGs
+#### 22c. Filtering MAGs
 
 ```bash
 cat <( head -n 1 bins-overview_GLmetagenomics.tsv ) \
@@ -2541,7 +3273,7 @@ done
 
 **Input Data:**
 
-- bins-overview_GLmetagenomics.tsv (tab-delimited file with quality estimates per bin from [step 14b](#14b-bin-quality-assessment))
+- bins-overview_GLmetagenomics.tsv (tab-delimited file with quality estimates per bin from [Step 22b](#22b-bin-quality-assessment))
 
 **Output Data:**
 
@@ -2550,7 +3282,7 @@ done
 - **\*-MAGs.zip** (zip files containing directories of high-quality MAGs)
 
 
-#### 23d. MAG taxonomic classification
+#### 22d. MAG taxonomic classification
 Uses default `gtdbtk` database setup with program's `download.sh` command.
 
 ```bash
@@ -2567,13 +3299,13 @@ gtdbtk classify_wf --genome_dir MAGs/ -x fa --out_dir gtdbtk-output-dir  --skip_
 
 **Input Data:**
 
-- MAGs/\*.fasta (directory holding high-quality MAGs from [step 14c](#14c-filtering-mags))
+- MAGs/\*.fasta (directory holding high-quality MAGs from [Step 22c](#22c-filtering-mags))
 
 **Output Data:**
 
 - gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
 
-#### 23e. Generating overview table of all MAGs
+#### 22e. Generating overview table of all MAGs
 
 ```bash
 # combine summaries
@@ -2613,10 +3345,10 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 
 **Input Data:**
 
-- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics from [step 5b](#5b-summarizing-assemblies))
-- MAGs/\*.fasta (directory holding high-quality MAGs from [step 14c](#14c-filtering-mags))
-- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG from [step 14c](#14c-filtering-mags))
-- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info from [step 14d](#14d-mag-taxonomic-classification))
+- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics from [Step 13b](#13b-summarizing-assemblies))
+- MAGs/\*.fasta (directory holding high-quality MAGs from [Step 22c](#23c-filtering-mags))
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG from [Step 22c](#22c-filtering-mags))
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info from [Step 22d](#22d-mag-taxonomic-classification))
 
 **Output Data:**
 
@@ -2627,9 +3359,9 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 
 ---
 
-### 24. Generating MAG-level functional summary overview
+### 23. Generating MAG-level functional summary overview
 
-#### 24a. Getting KO annotations per MAG
+#### 23a. Getting KO annotations per MAG
 This utilizes the helper script [`parse-MAG-annots.py`](../Workflow_Documentation/NF_MGIllumina/workflow_code/bin/parse-MAG-annots.py) 
 
 ```bash
@@ -2659,15 +3391,15 @@ done
 
 **Input Data:**
 
-- \*-gene-coverage-annotation-and-tax.tsv (sample gene-coverage-annotation-and-tax.tsv file generated in [step 11](#11-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample))
-- MAGs/\*.fasta (directory holding high-quality MAGs from [step 14c](#14c-filtering-mags))
+- \*-gene-coverage-annotation-and-tax.tsv (sample gene-coverage-annotation-and-tax.tsv file generated in [Step 19](#19-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample))
+- MAGs/\*.fasta (directory holding high-quality MAGs from [Step 22c](#22c-filtering-mags))
 
 **Output Data:**
 
 - **MAG-level-KO-annotations_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their KO annotations)
 
 
-#### 24b. Summarizing KO annotations with KEGG-Decoder
+#### 23b. Summarizing KO annotations with KEGG-Decoder
 
 ```bash
 KEGG-decoder -v interactive -i MAG-level-KO-annotations_GLmetagenomics.tsv -o MAG-KEGG-Decoder-out_GLmetagenomics.tsv
@@ -2676,12 +3408,12 @@ KEGG-decoder -v interactive -i MAG-level-KO-annotations_GLmetagenomics.tsv -o MA
 **Parameter Definitions:**  
 
 - `-v interactive` – specifies to create an interactive html output
-- `-i` – specifies the input MAG-level-KO-annotations_GLmetagenomics.tsv file generated in [step 15a](#15a-getting-ko-annotations-per-mag)
+- `-i` – specifies the input MAG-level-KO-annotations_GLmetagenomics.tsv file generated in [Step 23a](#23a-getting-ko-annotations-per-mag)
 - `-o` – specifies the output table
 
 **Input Data:**
 
-- MAG-level-KO-annotations_GLmetagenomics.tsv (tab-delimited table holding MAGs and their KO annotations, generated in [step 15a](#15a-getting-ko-annotations-per-mag))
+- MAG-level-KO-annotations_GLmetagenomics.tsv (tab-delimited table holding MAGs and their KO annotations, generated in [Step 23a](#23a-getting-ko-annotations-per-mag))
 
 **Output Data:**
 

From 0e0cf11532ccb10b938771deab46e67b33913398 Mon Sep 17 00:00:00 2001
From: olabiyi <obadbotanist@yahoo.com>
Date: Thu, 2 Oct 2025 14:08:27 -0700
Subject: [PATCH 07/47] fixed broken links

---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md      | 24 +++++++++----------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index 3f4851636..b72538e36 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -40,7 +40,7 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [5a. Trim Filtered Data](#5a-trim-filtered-data)
       - [5b. Trimmed Data QC](#5b-trimmed-data-qc)
       - [5c. Compile Trimmed Data QC](#5c-compile-filtered-data-qc)
-    - [6. Contaminant Removal](#7-remove-contaminants)
+    - [6. Contaminant Removal](#6-remove-contaminants)
       - [6a. Assemble Contaminants](#6a-assemble-contaminants)
       - [6b. Build Contaminant Index and Map Reads](#6b-build-contaminant-index-and-map-reads)
       - [6c. Sort and Index Contaminant Reads](#6c-sort-and-index-contaminant-alignments)
@@ -556,7 +556,7 @@ minimap2 -t NumberOfThreads -a -x splice blanks.mmi /path/to/trimmed_reads/sampl
 
 - sample.sam (Reads aligned to contaminant assembly)
 
-#### 6b. Sort and Index Contaminant Alignments
+#### 6c. Sort and Index Contaminant Alignments
 ```bash
 # Sort Sam, convert to bam and create index
 samtools sort --threads NumberOfThreads -o sample_sorted.bam sample.sam > sample_sort.log 2>&1
@@ -577,14 +577,14 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 
 **Input Data:**
 
-- sample.sam (Reads aligned to contaminant assembly, output from [Step 6a](#6a-build-contaminant-index-and-map-reads))
+- sample.sam (Reads aligned to contaminant assembly, output from [Step 6b](#6b-build-contaminant-index-and-map-reads))
 
 **Output Data:**
 
 - sample_sorted.bam (sorted mapping to contaminant assembly)
 - sample_sorted.bam.bai (index of sorted mapping to contaminant assembly)
 
-#### 6c. Gather Contaminant Mapping Metrics
+#### 6d. Gather Contaminant Mapping Metrics
 
 ```bash
 
@@ -603,8 +603,8 @@ samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.lo
 
 **Input Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 6b](#6b-sort-and-index-contaminant-alignments))
-- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly, output from [Step 6b](#6b-sort-and-index-contaminant-alignments))
+- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 6c](#6c-sort-and-index-contaminant-alignments))
+- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly, output from [Step 6c](#6c-sort-and-index-contaminant-alignments))
 
 **Output Data:**
 
@@ -612,7 +612,7 @@ samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.lo
 - sample_stats.txt (comprehensive alignment statistics)
 - sample_idxstats.txt (contig alignment summary statistics)
 
-#### 6d. Generate Decontaminated Read Files
+#### 6e. Generate Decontaminated Read Files
 ```bash
 # Retain reads that do not map to contaminants
 samtools fastq -t -f 4 sample_sorted.bam | gzip --to-stdout > sample_blank_removed.fastq.gz
@@ -629,13 +629,13 @@ samtools fastq -t -f 4 sample_sorted.bam | gzip --to-stdout > sample_blank_remov
 
 **Input Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 6b](#6b-sort-and-index-contaminant-alignments))
+- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 6c](#6c-sort-and-index-contaminant-alignments))
 
 **Output Data:**
 
 - sample_blank_removed.fastq.gz (blank removed reads in fastq format)
 
-#### 6e. Contaminant Removal QC
+#### 6f. Contaminant Removal QC
 
 ```bash
 NanoPlot --only-report \
@@ -656,7 +656,7 @@ NanoPlot --only-report \
 
 **Input Data:**
 
-- sample_blank_removed.fastq.gz (blank removed reads, output from [Step 6d](#6d-generate-decontaminated-read-files))
+- sample_blank_removed.fastq.gz (blank removed reads, output from [Step 6e](#6e-generate-decontaminated-read-files))
 
 **Output Data:**
 
@@ -665,7 +665,7 @@ NanoPlot --only-report \
 - /path/to/noblank_nanoplot_output/sample_NanoStats.txt (text file containing basic statistics)
 
 
-#### 6f. Compile Contaminant Removal QC
+#### 6g. Compile Contaminant Removal QC
 
 ```bash
 multiqc --zip-data-dir \ 
@@ -684,7 +684,7 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- /path/to/noblank_nanoplot_output/*NanoStats.txt (NanoPlot output data, output from [Step 6e](#6e-contaminant-removal-qc))
+- /path/to/noblank_nanoplot_output/*NanoStats.txt (NanoPlot output data, output from [Step 6f](#6f-contaminant-removal-qc))
 
 **Output Data:**
 

From 2e4f7f1eed151da0a1824874762410794980103e Mon Sep 17 00:00:00 2001
From: olabiyi <obadbotanist@yahoo.com>
Date: Thu, 2 Oct 2025 14:13:39 -0700
Subject: [PATCH 08/47] fixed broken links

---
 Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index b72538e36..b11416d1f 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -40,7 +40,7 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [5a. Trim Filtered Data](#5a-trim-filtered-data)
       - [5b. Trimmed Data QC](#5b-trimmed-data-qc)
       - [5c. Compile Trimmed Data QC](#5c-compile-filtered-data-qc)
-    - [6. Contaminant Removal](#6-remove-contaminants)
+    - [6. Contaminant Removal](#6-contaminant-removal)
       - [6a. Assemble Contaminants](#6a-assemble-contaminants)
       - [6b. Build Contaminant Index and Map Reads](#6b-build-contaminant-index-and-map-reads)
       - [6c. Sort and Index Contaminant Reads](#6c-sort-and-index-contaminant-alignments)

From 9c150ef870151234408b5b7ec8cfb0df4524b806 Mon Sep 17 00:00:00 2001
From: olabiyi <obadbotanist@yahoo.com>
Date: Mon, 6 Oct 2025 14:25:24 -0700
Subject: [PATCH 09/47] made edits to functions and the overall document

---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md      | 237 ++++++++++--------
 1 file changed, 130 insertions(+), 107 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index b11416d1f..0b26010c5 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -156,7 +156,6 @@ Barbara Novak (GeneLab Data Processing Lead)
 | R | 4.5.1 | [https://www.r-project.org](https://www.r-project.org) |
 |Bioconductor | 3.21 | [https://www.bioconductor.org](https://www.bioconductor.org) |
 |decontam| 1.28.0 | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html) |
-|glue| 1.8.0 | [https://cran.r-project.org/web/packages/glue/index.html](https://cran.r-project.org/web/packages/glue/index.html) |
 |optparse| 1.7.5 |[https://cran.r-project.org/web/packages/optparse/index.html](https://cran.r-project.org/web/packages/optparse/index.html) |
 |pavian| 1.2.1 | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian) |
 |pheatmap| 1.0.13 | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap) |
@@ -782,6 +781,7 @@ kraken2-build --clean --db kraken2_host_db/
 - kraken2_host_db/ - Kraken2 database directory
 
 #### 7b. Remove host reads
+
 ```bash
 kraken2 --db kraken2_host_db/ --gzip-compressed --threads NumberOfThreads --use-names \
         --output sample-kraken2-output.txt --report sample-kraken2-report.tsv \
@@ -825,7 +825,6 @@ gzip sample_host_removed.fastq
 library(decontam)
 library(phyloseq)
 library(tidyverse)
-library(glue)
 library(pheatmap)
 library(pavian)
 ```
@@ -839,15 +838,17 @@ library(pavian)
   ```R
   get_last_assignment <- function(taxonomy_string, split_by=';', remove_prefix=NULL) {
 
+    # Spilt taxonomy string by the supplied delimiter 'split_by'
+    # then convert the list of parts to a vector of parts
     split_names <- strsplit(x =  taxonomy_string , split = split_by) %>% 
       unlist()
-    
+    # Get the last part of the split string
     level_name <- split_names[[length(split_names)]]
     
     if(level_name == "_"){
       return(taxonomy_string)
     }
-    
+    # remove an unwanted prefix if specified
     if(!is.null(remove_prefix)){
       level_name <- gsub(pattern = remove_prefix, replacement = '', x = level_name)
     }
@@ -857,7 +858,7 @@ library(pavian)
   ```
 
   **Function Parameter Definitions:**
-  - `taxonomy_string` - a character string containing a list of taxonomy assignments
+  - `taxonomy_string` - a character string containing a list of taxonomy assignments separated by `split_by`
   - `split_by=` - a character string containing a regular expression used to split the `taxonomy_string`
   - `remove_prefix=` - a character string containing a regular expression to be matched and removed, default=`NULL`
 
@@ -866,7 +867,7 @@ library(pavian)
 
 ##### mutate_taxonomy()
 <details>
-  <summary>ensure that the taxonomy column is named "taxonomy" and aggregate duplicates to ensure that taxonomy names are unique</summary>
+  <summary>mutate taxonomy column to contain the lowest taxonomy assignment</summary>
 
   ```R
   mutate_taxonomy <- function(df, taxonomy_column="taxonomy") {
@@ -874,7 +875,7 @@ library(pavian)
     # make sure that the taxonomy column is always named taxonomy
     col_index <- which(colnames(df) == taxonomy_column)
     colnames(df)[col_index] <- 'taxonomy'
-    df <- df %>% dplyr::mutate(across( where(is.numeric), \(x) tidyr::replace_na(x,0)  ) )%>% 
+    df <- df %>% dplyr::mutate(across( where(is.numeric), function(x) tidyr::replace_na(x,0)  ) )%>% 
       dplyr::mutate(taxonomy=map_chr(taxonomy,.f = function(taxon_name=.x){
         last_assignment <- get_last_assignment(taxon_name) 
         last_assignment  <- gsub(pattern = "\\[|\\]|'", replacement = '',x = last_assignment)
@@ -891,7 +892,7 @@ library(pavian)
   - `df` - a dataframe containing the taxonomy assignments
   - `taxonomy_column=` - name of the column in `df` containing the taxonomy assignments, default="taxonomy"
 
-  **Returns:** a dataframe with unique taxonomy names stored in a column named "taxonomy"
+  **Returns:** a dataframe with unique last taxonomy names stored in a column named "taxonomy"
 
 </details>
 
@@ -904,11 +905,11 @@ library(pavian)
   
     abs_abun_df <-  read_delim(file = file_path,
                                delim = "\t",
-                               col_names = TRUE) %>% 
+                               col_names = TRUE) %>% # read input table
              select(sample, reads, taxonomy=!!sym(taxon_col)) %>%
-             pivot_wider(names_from = "sample", values_from = "reads",
-                             names_sort = TRUE) %>%
-             mutate_taxonomy
+             pivot_wider(names_from = "sample", values_from = "reads", 
+                             names_sort = TRUE) %>% # convert long dataframe to wide dataframe
+             mutate_taxonomy # mutate the taxonomy coxlumn such that it contains only lowest taxonomy assignment
   
     # Set the taxon names as row names, drop the taxonomy column and convert to a matrix
     rownames(abs_abun_df) <- abs_abun_df[,"taxonomy"]
@@ -936,7 +937,7 @@ library(pavian)
   process_kraken_table <- function(reports_dir) {
 
     reports <- read_reports(reports_dir)
-
+    # Retrieve sample names from file names
     samples <- names(reports) %>%
                   str_split("-") %>%
                   map_chr(function(x) pluck(x, 1))
@@ -957,12 +958,14 @@ library(pavian)
       as.data.frame() %>% 
       rename(species=name)
 
+    # Set rownames as species name, drop species column
+    # and convert table from dataframe to matrix
     species_names <- species_table[,"species"]
     rownames(species_table) <- species_names
     species_table <- species_table[,-(which(colnames(species_table) == "species"))]
     species_table <- as.matrix(species_table)
     
-    return(species_table)\
+    return(species_table)
   }
   ```
 
@@ -986,11 +989,11 @@ library(pavian)
                         mutate( across(everything(), function(x) (x/sum(x, na.rm = TRUE))*100 ) )  %>% # calculate species relative abundance per sample
         select(
                 where( ~all(!is.na(.)) )
-              )  %>% # drop columns where none of the reads were classified or were non-microbial
+              )  %>% # drop columns where none of the reads were classified or were non-microbial (NA)
               rownames_to_column("Species") 
-      
+
+    # Set rownames as species name and drop species column  
     rownames(abund_table) <- abund_table$Species
-      
     abund_table <- abund_table[, -match(x = "Species", colnames(abund_table))] %>% t
 
     return(abund_table)
@@ -1012,25 +1015,27 @@ library(pavian)
   ```R
   filter_rare <- function(species_table, non_microbial, threshold=1) {
     
+    # Drop species listed in 'non_microbial' regex
     clean_tab_count  <-  species_table %>% 
                          as.data.frame %>% 
                          rownames_to_column("Species") %>% 
                          filter(str_detect(Species, non_microbial, negate = TRUE))  
-    
+    # Calculate species relative abundance
     clean_tab <- clean_tab_count %>% 
-      mutate( across( where(is.numeric), \(x) (x/sum(x, na.rm = TRUE))*100 ) )
-    
+      mutate( across( where(is.numeric), function(x) (x/sum(x, na.rm = TRUE))*100 ) )
+    # Set rownames as species name and drop species column  
     rownames(clean_tab) <- clean_tab$Species
     clean_tab  <- clean_tab[,-1] 
     
     
-    # Get species with relative abundance less than 1% in all samples
-    rare_species <- map(clean_tab, .f = \(col) rownames(clean_tab)[col < threshold])
+    # Get species with relative abundance less than `threshold` in all samples
+    rare_species <- map(clean_tab, .f = function(col) rownames(clean_tab)[col < threshold])
     rare <- Reduce(intersect, rare_species)
     
+    # Set rownames as species name and drop species column  
     rownames(clean_tab_count) <- clean_tab_count$Species
     clean_tab_count  <- clean_tab_count[,-1] 
-    
+    # Drop rare species
     abund_table <- clean_tab_count[!(rownames(clean_tab_count) %in% rare), ]
     
     return(abund_table)
@@ -1038,11 +1043,11 @@ library(pavian)
   ```
 
   **Function Parameter Definitions:**
-  - `species_table` - the dataframe to filter
-  - `non_microbial` - a character vector denoting the string used to identify a species as non-microbial
+  - `species_table` - the species matrix to filter with species and samples as rows and columns, respectively.
+  - `non_microbial` - a regex denoting the string used to identify a species as non-microbial or unwanted
   - `threshold=` - abundance threshold used to determine if the relative abundance is rare, value denotes a percentage between 0 and 100.
 
-  **Returns:** a dataframe with rare and non_microbial assignments removed
+  **Returns:** a dataframe with rare and non_microbial/unwanted species removed
 </details>
 
 
@@ -1052,20 +1057,20 @@ library(pavian)
 
   ```R
   # Make bar plot
-  make_plot <- function(abund_table, metadata, colors2use, publication_format) {
-    
+  make_plot <- function(abund_table, metadata, colors2use, publication_format, samples_column="Sample_ID", prefix_to_remove="barcode") {
+    # Prepare table
     abund_table_wide <- abund_table %>% 
         as.data.frame() %>% 
-        rownames_to_column("Sample_ID") %>% 
-        inner_join(metadata) %>% 
+        rownames_to_column(samples_column) %>% 
+        inner_join(metadata) %>% # join abundance table and metadata by `samples_column`
         select(!!!colnames(metadata), everything()) %>% 
-        mutate(Sample_ID = Sample_ID %>% str_remove("barcode"))
-        
+        mutate(Sample_ID = Sample_ID %>% str_remove(prefix_to_remove))
+    # Convert table from wide to log format for plotting    
     abund_table_long <- abund_table_wide  %>%
         pivot_longer(-colnames(metadata), 
                     names_to = "Species",
                     values_to = "relative_abundance")
-      
+    # Make relative abundance plot  
     p <- ggplot(abund_table_long, mapping = aes(x=Sample_ID, y=relative_abundance, fill=Species)) +
          geom_col() +
          scale_fill_manual(values = colors2use) + 
@@ -1077,19 +1082,21 @@ library(pavian)
   ```
 
   **Function Parameter Definitions:**
-  - `abund_table` - a dataframe containing the data to plot
-  - `metadata` - a vector of strings specifying the data to include in the plot
+  - `abund_table` - a relative bundance dataframe with rows summing to 100%
+  - `metadata` - a metadata dataframe with samples as row and columns describing each sample
   - `colors2use` - a vector of strings specifying a custom color palette for coloring plots
-  - `publication_format` - a ggplot::theme object specifying the custom theme for plotting
+  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting
+  - `samples_column` - a character column specifying the column in `metadata` holding sample names, default is "Sample_ID"
+  - `prefix_to_remove` - a string specifying a prefix or any character set to remove from sample names, default is "barcode"
 
-  **Returns:** a ggplot bar plot
+  **Returns:** a relative abundance stacked bar plot
 
 </details>
 
 
 ##### run_decontam()
 <details>
-  <summary>Feature table decoxntamination with decontam</summary>
+  <summary>Feature table decontamination with decontam</summary>
 
   ```R
   run_decontam <- function(feature_table, metadata, contam_threshold=0.1, prev_col=NULL, freq_col=NULL) {
@@ -1097,8 +1104,8 @@ library(pavian)
     sub_metadata <- metadata[colnames(feature_table),]
     # Modify NTC concentration
     # Often times the user may set the NTC concentration to zero because they think nothing 
-    # should be in the negative control but decontam fails if the value is zero.
-    # to prevent decontam from failing we use a very small concentration value
+    # should be in the negative control but decontam fails if the value is set to zero.
+    # To prevent decontam from failing, we replace zero with a very small concentration value
     # 0.0000001
     if (!is.null(freq_col)) {
 
@@ -1115,11 +1122,12 @@ library(pavian)
     # Create phyloseq object
     ps <- phyloseq(otu_table(feature_table, taxa_are_rows = TRUE), sample_data(sub_metadata))
 
-    # In our phyloseq object, "Sample_or_Control" is the sample variable that holds  the negative 
-    # control information. We’ll summarize that data as a logical variable, with TRUE for control 
-    # samples, as that is the form required by isContaminant
+    # In our phyloseq object, `prev_col` is the sample variable that holds the negative 
+    # control information. We'll summarize the data as a logical variable, with TRUE for control 
+    # samples, as that is the form required by isContaminant.
+    # The line below assumes that control samples will always be named "Control_Sample"
+    # in the `prev_col`.
     sample_data(ps)$is.neg <- sample_data(ps)[[prev_col]] == "Control_Sample"
-    contamdf <- isContaminant(ps, neg="is.neg", conc="input_conc_ng") # thresheld = 0.1 - default
 
     # Run Decontam 
     if (!is.null(freq_col) && !is.null(prev_col)) {   
@@ -1139,8 +1147,8 @@ library(pavian)
     
     } else {
 
-      cat("Both freq_col and prev_col cannot be set tdo NULL\n")
-      cat("please supply either one or both column names in your metadata")
+      cat("Both freq_col and prev_col cannot be set to NULL.\n")
+      cat("Please supply either one or both column names in your metadata")
       cat("for frequency and prevalence based analysis, respectively\n")
       stop()
 
@@ -1151,10 +1159,10 @@ library(pavian)
   ```
 
   **Function Parameter Definitions:**
-  - `metadata` - a vector of strings specifying the data to include in the plot
-  - `feature_table` -  feature matrix to decontaminate with sample names as column and features as row
+  - `metadata` - a metadata dataframe with samples as row and columns describing each sample
+  - `feature_table` -  feature [species, functions etc.] matrix to decontaminate with sample names as column and features as row
   - `prev_col` - a character column in metadata to be used for prevalence based analysis. Controls in this column should always be names "Control_Sample"
-  - `freq_col` - a numeric column in metadata to be use for frequency based analysis
+  - `freq_col` - a numeric column in metadata to be used for frequency based analysis
   - `contam_threshold` -  the probability threshold below which (strictly less than) the null-hypothesis 
                           (not a contaminant) should be rejected in favor of the alternate hypothesis (contaminant).
 
@@ -1171,28 +1179,28 @@ process_taxonomy <- function(taxonomy, prefix='\\w__') {
   
   taxonomy <- apply(X = taxonomy, MARGIN = 2, FUN = as.character) 
 
-  # replace NAa with Other and delete the D_num__ prefix from the taxonomy names
+  # replace NAs and empty cells with "Other" and delete the `prefix` from taxonomy names
   for (rank in colnames(taxonomy)) {
-    #delete the taxonomy prefix
+    # Delete the taxonomy prefix
     taxonomy[,rank] <- gsub(pattern = prefix, x = taxonomy[, rank],
                             replacement = '')
     indices <- which(is.na(taxonomy[,rank]))
     taxonomy[indices, rank] <- rep(x = "Other", times=length(indices)) 
-    #replace empty cell
+    # Replace empty cells with "Other"
     indices <- which(taxonomy[,rank] == "")
     taxonomy[indices,rank] <- rep(x = "Other", times=length(indices))
   }
+  # Replace underscore with space
   taxonomy <- apply(X = taxonomy,MARGIN = 2,
                     FUN =  gsub,pattern = "_",replacement = " ") %>% 
-    as.data.frame(stringAsfactor=F)
+    as.data.frame(stringAsfactor=FALSE)
   return(taxonomy)
-}
 
 ```
 **Function Parameter Definitions:**
 
-- `taxonomy` - is a string specifying the taxonomic assignment file name
-- `prefix`  - is a regular expression specifying the characters to remove
+- `taxonomy` - is a taxonomy assignment dataframe with ranks [Phylum, Class .. Species] as columns and taxonomy assignments as rows
+- `prefix`  - is a regular expression specifying a character sequence to remove
               from taxon names
 
 **Returns:** a dataframe of reformated taxonomy names
@@ -1210,8 +1218,10 @@ format_taxonomy_table <- function(taxonomy,stringToReplace="Other",
   
   for (taxa_index in seq_along(taxonomy)) {
     
+    # Get the row indices of the current taxonomy columns 
+    # with rows matching the sting in `stringToReplace`
     indices <- grep(x = taxonomy[,taxa_index], pattern = stringToReplace)
-    
+    # Replace the value in that row with the value in the adjacent cell concated with `suffix` 
     taxonomy[indices,taxa_index] <- 
       paste0(taxonomy[indices,taxa_index-1],
              rep(x = suffix, times=length(indices)))
@@ -1223,7 +1233,7 @@ format_taxonomy_table <- function(taxonomy,stringToReplace="Other",
 ```
 **Function Parameter Definitions:**
 - `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
-- `stringToReplace` - a vector of regex strings specifying what to replace
+- `stringToReplace` - a regex string specifying what to replace
 - `suffix` - string specifying the replacement value
 
 **Returns:** a dataframe of reformated taxonomy names
@@ -1249,22 +1259,21 @@ fix_names<- function(taxonomy,stringToReplace,suffix){
 ```
 **Function Parameter Definitions:**
 - `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
-- `stringToReplace` - a vector of regex strings specifying what to replace
+- `stringToReplace` - a regex string specifying what to replace
 - `suffix` - string specifying the replacement value
 
-**Returns:** a dataframe of detailed decontam results
+**Returns:** a dataframe of reformated/cleaned taxonomy names
 
 </details>
 
 
 ##### read_input_table()
 <details>
-  <summary>read an input table into a dataframe</summary>
+  <summary>read an input table into a tibble</summary>
 
 ```R
 read_input_table <- function(file_name){
   
-   # Get depth from file name
    df <- read_delim(file = file_name, delim = "\t", comment = "#")
    return(df)
    
@@ -1273,8 +1282,7 @@ read_input_table <- function(file_name){
 **Function Parameter Definitions:**
 
 - `file_name` - path to file to be read
-
-**Returns:** a dataframe from input file
+**Returns:** a tibble generated from the input file
 
 </details>
 
@@ -1289,15 +1297,20 @@ read_contig_table <- function(file_name, sample_names){
   
   df <- read_input_table(file_name)
 
+  # Subset taxoxnomy portion (domain:species) of input table
+  # and replace empty/Na domain assignments with "Unclassified"
   taxonomy_table <- df %>%
     select(domain:species) %>%
     mutate(domain=replace_na(domain, "Unclassified"))
   
+  # Subset count table
   counts_table <- df %>% select(!!sample_names)
 
+  # Mutate taxonomy mames
   taxonomy_table  <- process_taxonomy(taxonomy_table)
   taxonomy_table <- fix_names(taxonomy_table, "Other", ";_")
 
+  # Column bind taxonomy dataframe with species count dataframe
   df <- bind_cols(taxonomy_table, counts_table)
   
   return(df)
@@ -1307,10 +1320,10 @@ read_contig_table <- function(file_name, sample_names){
 
 **Function Parameter Definitions:**
 
-- `file_name` - path to file to be read
+- `file_name` - path to contig taxonomy assignment file to be read
 - `sample_names` - string of samples names to keep in the final dataframe
 
-**Returns:** a dataframe with cleaned taxonomy names
+**Returns:** a dataframe with cleaned taxonomy names and sample species count
 
 </details>
 
@@ -1318,11 +1331,11 @@ read_contig_table <- function(file_name, sample_names){
 
 ##### get_sample_names()
 <details>
-  <summary>retrieve the name of samples for which assemblies were generated</summary>
+  <summary>retrieve sample names for which assemblies were generated</summary>
 
   ```R
 get_sample_names <- function (assembly_summary) {
-  # assembly_summary - path to assembly summary file
+
 
   overview_table <-  read_input_table(assembly_summary) %>%
                        select(
@@ -1383,7 +1396,7 @@ colours <- colorRampPalette(c('white','red'))(255)
 
 **Output Data:**
 
-- `publication_format` (a ggplot::theme object specifying the custom theme for plotting)
+- `publication_format` (a ggplot::theme object specifying a custom theme for plotting)
 - `custom_palette` (a vector of strings specifying a custom color palette for coloring plots)
 
 <br>
@@ -1397,7 +1410,7 @@ colours <- colorRampPalette(c('white','red'))(255)
 #### 9a. Build kaiju database
 
 ```bash
-# Make directory that will hold all the download kaiju database
+# Make a directory that will hold the downloaded kaiju database
 mkdir kaiju-db/ && cd kaiju-db/
 # Download kaiju's reference database
 kaiju-makedb -s nr_euk -t NumberOfThreads
@@ -1433,10 +1446,10 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi -t kaiju-db/nodes.dmp \
 
 **Parameter Definitions:**
 
-- `-f` - specifies path to the Kaiju database (.fmi) file
-- `-t` - specifies path to the Kaiju nodes.dmp file
+- `-f` - specifies path to the kaiju database (.fmi) file
+- `-t` - specifies path to the kaiju nodes.dmp file
 - `-z` - number of parallel processing threads to use
-- `-E` - specifies the minimum E-value in Greedy mode (default: 0.01)
+- `-E` - specifies the minimum E-value in Greedy mode (default: 1e-05)
 - `-i` - specifies path to the input file
 - `-o` - specifies the name of output file
 
@@ -1457,18 +1470,18 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi -t kaiju-db/nodes.dmp \
   kaiju2table -t nodes.dmp -n names.dmp -p -r species \
               -o merged_kaiju_table.tsv *_kaiju.out
 
-# Covert the file names to sample names
+# Convert file names to sample names
 sed -i -E 's/.+\/(.+)_kaiju\.out/\1/g' merged_kaiju_table.tsv && \
 sed -i -E 's/file/sample/' merged_kaiju_table.tsv
 ```
 
 **Parameter Definitions:**
 
-- `-n` - specifies path to the Kaiju names.dmp file
-- `-t` - specifies path to the Kaiju nodes.dmp file
+- `-n` - specifies path to the kaiju names.dmp file
+- `-t` - specifies path to the kaiju nodes.dmp file
 - `-r` - specifies taxonomic rank, must be one of: phylum, class, order, family, genus, species
 - `-o` - specifies the name of krona formatted kaiju output file
-- `*_kaiju.out` - positional argument specifying the path to the Kaiju output file (output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+- `*_kaiju.out` - positional argument specifying the path to the kaiju output file (output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
 
 **Input Data:**
 
@@ -1490,9 +1503,9 @@ kaiju2krona -u -n kaiju-db/names.dmp -t kaiju-db/nodes.dmp \
 **Parameter Definitions:**
 
 - `-u` - include count for unclassified reads in output
-- `-n` - specifies path to the Kaiju names.dmp file
-- `-t` - specifies path to the Kaiju nodes.dmp file
-- `-i` - specifies path to the Kaiju output file (output from [Step 9b](#9b-kaiju-taxonomic-classification))
+- `-n` - specifies path to the kaiju names.dmp file
+- `-t` - specifies path to the kaiju nodes.dmp file
+- `-i` - specifies path to the kaiju output file (output from [Step 9b](#9b-kaiju-taxonomic-classification))
 - `-o` - specifies the name of krona formatted kaiju output file
 
 **Input Data:**
@@ -1544,7 +1557,7 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 **ktImportText**
 
 - `-o` - specifies the compiled output html file name
-- `${KTEXT_FILES[*]}` - a array positional arguement with the follow content: 
+- `${KTEXT_FILES[*]}` - an array positional arguement with the following content: 
                      sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
 
 **Input Data:**
@@ -1568,11 +1581,11 @@ write_csv(x = feature_table, file = "kaiju_species_table.csv")
 
 - `file_path` - path to compiled kaiju table at the species taxon level
 - `x`  - feature table dataframe to write to file
-- `file` - path to where to write kaiju count table per sample.
+- `file` - path to where to write kaiju count table per sample
 
 **Input Data:**
 
-- merged_kaiju_table.tsv (Compiled kaiju table at the species taxon level, from [Step 9c](#10c-compile-kaiju-taxonomy-results))
+- merged_kaiju_table.tsv (compiled kaiju table at the species taxon level, from [Step 9c](#10c-compile-kaiju-taxonomy-results))
 
 **Output Data:**
 
@@ -1618,7 +1631,7 @@ library(tidyverse)
 # Threshold to filter out potential false positive
 # taxonomy assignments
 filter_threshold <- 0.5
-# Filter out Rare and non-microbial assignment
+# Filter out Rare and non-microbial assignments.
 # You can add as many species that you'd like to filter out
 # using the following syntax "|species_name1|species_name2"
 non_microbial <- "Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
@@ -1636,7 +1649,7 @@ ggsave(filename =  "unfiltered-kaiju_species_plot.png", plot = p,
        device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
 
 
-# Get species with relative abundance greater than filter_threshold in all samples
+# Get species with relative abundance greater than `filter_threshold` in all samples
 # Drop rare and non-microbial assignments
 filtered_species_table  <- filter_rare(species_table, non_microbial, threshold=filter_threshold)
 
@@ -1656,7 +1669,7 @@ ggsave(filename = "filtered-kaiju_species_plot.png", plot = p,
 
 **Parameter Definitions:**
 
-- `filter_threshold` - a decimal threshold from 0-1 for filter out rare species i.e potential fals epositives.
+- `filter_threshold` - a decimal threshold from 0-1 for filtering out rare species i.e potential false epositives.
 - `non_microbial` - a regex string  listing out assignmnets to drop before filtering based on the `filter_threshold` above. 
 
 **Input Data:**
@@ -1678,6 +1691,8 @@ ggsave(filename = "filtered-kaiju_species_plot.png", plot = p,
 ```R
 library(tidyverse)
 library(decontam)
+library(phyloseq)
+
 feature_table <- read_csv("filtered-kaiju_species_table.csv")
 # Set to 0.5 for a more aggressive approach where species more prevalent
 # in the negative controls are considered contaminants
@@ -1694,7 +1709,7 @@ contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, fr
 # Write decontam results table to file
 write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-kaiju_results.csv")
 
-# Get the list of contaminats identified by decontam
+# Get the list of contaminants identified by decontam
 contaminants <- contamdf %>%
                 as.data.frame %>%
                 rownames_to_column("Species") %>%
@@ -1730,7 +1745,7 @@ ggsave(filename = "decontaminated-kaiju-species_plot.png", plot = p,
 
 **Output Data:**
 
-- **decontam-kaiju_results.csv** (decontam's results table)
+- **decontam-kaiju_results.csv** (decontam's result table)
 - **decontaminated-kaiju_species_table.csv** (decontaminated species table)
 - **decontaminated-kaiju-species_plot.png** (barplot after filtering out contaminants)
 
@@ -1771,9 +1786,9 @@ tar -xvzf k2_pluspfp.tar.gz
 
 **wget**
 
-- `O` - name of file to download the url content to.
-- `--timeout=3600` - specifies the network timeout to seconds seconds
-- `--tries=0` - retry downdload infinitely.
+- `O` - name of file to download the url content to
+- `--timeout=3600` - specifies the network timeout in seconds
+- `--tries=0` - retry downdload infinitely
 - `--continue` -  continue getting a partially-downloaded file
 - `*_URL` - position arguement specifying the url to download a particular resource from.
 
@@ -1782,7 +1797,7 @@ tar -xvzf k2_pluspfp.tar.gz
 
 - `INSPECT_URL=` - url specifying the location of kraken2 inspect file
 - `LIRARY_REPORT_URL=` -  url specifying the location of kraken2 library report file
-- `MD5_URL=` -  url specifying the location of md5 file of kraken database
+- `MD5_URL=` -  url specifying the location of the md5 file of the kraken database
 - `DB_URL=` - url specifying the location of the main kraken database archive in .tar.gz format
 
 **Output Data:**
@@ -1809,7 +1824,7 @@ kraken2 --db kraken2-db/ --gzip-compressed --threads NumberOfThreads --use-names
 
 **Input Data:**
 
-- kraken2-db/ (a direcory containing kraken 2 database files, output from [Step 10a](#10a-download-kraken2-database))
+- kraken2-db/ (a directory containing kraken 2 database files, output from [Step 10a](#10a-download-kraken2-database))
 - sample_host_removed.fastq.gz (gzipped reads fastq file, output from [Step 7b](#7b-remove-host-reads))
 
 **Output Data:**
@@ -1877,7 +1892,7 @@ ktImportText  -o kraken-report.html ${KTEXT_FILES[*]}
 **ktImportText**
 
 - `-o` - specifies the compiled output html file name
-- `${KTEXT_FILES[*]}` - a array positional arguement with the follow content: 
+- `${KTEXT_FILES[*]}` - an array positional arguement with the following content: 
                      sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
 
 **Input Data:**
@@ -1935,7 +1950,7 @@ species_table <- species_table[,-match("species", colnames(species_table))]
 
 **Parameter Definitions:**
 
-- `file` - path to input tables
+- `file` - path to input table
 - `delim` - file delimiter 
 
 **Input Data:**
@@ -1945,8 +1960,8 @@ species_table <- species_table[,-match("species", colnames(species_table))]
 
 **Output Data:**
 
-- `metadata` - a dataframe of sample-wise metadata
-- `species_table` - a dataframe
+- metadata (a dataframe of sample-wise metadata)
+- species_table (a dataframe of species count with rows and columns as species and sample names, respectively)
 
 
 #### 10g. Taxonomy barplots
@@ -1957,7 +1972,7 @@ library(tidyverse)
 # Threshold to filter out potential false positive
 # taxonomy assignments
 filter_threshold <- 0.5
-# Filter out Rare and non-microbial assignment
+# Filter out Rare and non-microbial assignments.
 # You can add as many species that you'd like to filter out
 # using the following syntax "|species_name1|species_name2"
 non_microbial <- "Unclassifed|unclassified|Homo sapien"
@@ -1975,7 +1990,7 @@ ggsave(filename =  "unfiltered-kraken_species_plot.png", plot = p, device = "png
        width = plot_width, height = plot_height, units = "in", dpi = 300)
 
 
-# Get species with relative abundance greater than filter_threshold in all samples
+# Get species with relative abundance greater than `filter_threshold` in all samples
 # Drop rare and non-microbial assignments
 filtered_species_table  <- filter_rare(species_table, non_microbial, threshold=filter_threshold)
 
@@ -1995,8 +2010,8 @@ ggsave(filename = "filtered-kraken_species_plot.png", plot = p,
 
 **Parameter Definitions:**
 
-- `filter_threshold` - a decimal threshold from 0-1 for filter out rare species i.e potential fals epositives.
-- `non_microbial` - a regex string  listing out assignmnets to drop before filtering based on the `filter_threshold` above. 
+- `filter_threshold` - a decimal threshold from 0-1 to filter out rare species i.e potential false positives
+- `non_microbial` - a regex string listing out assignments to drop before filtering based on the `filter_threshold` above 
 
 **Input Data:**
 
@@ -2017,13 +2032,15 @@ Feature decontamination with decontam. Decontam is an R package that statistical
 ```R
 library(tidyverse)
 library(decontam)
+library(phyloseq)
 
 feature_table <- read_csv("filtered-kraken_species_table.csv")
 # Set to 0.5 for a more aggressive approach where species more prevalent
 # in the negative controls are considered contaminants
 contam_threshold <- 0.1
 # Control samples in this column should always be written as
-# "Control_Sample" and true samples as "True_Sample"
+# "Control_Sample" and true samples as "True_Sample" for the function below to
+# function properly.
 prev_col <- "Sample_or_Control"
 freq_col <- "input_conc_ng"
 plot_width <- 18
@@ -2031,10 +2048,10 @@ plot_height <- 8
 
 contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
 
-# Write decontam results table to file
+# Write decontam result table to file
 write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-kraken_results.csv")
 
-# Get the list of contaminats identified by decontam
+# Get the list of contaminants identified by decontam
 contaminants <- contamdf %>%
                 as.data.frame %>%
                 rownames_to_column("Species") %>%
@@ -2070,7 +2087,7 @@ ggsave(filename = "decontaminated-kraken-species_plot.png", plot = p,
 
 **Output Data:**
 
-- **decontam-kraken_results.csv** (decontam's results table)
+- **decontam-kraken_results.csv** (decontam's result table)
 - **decontaminated-kraken_species_table.csv** (decontaminated species table)
 - **decontaminated-kraken-species_plot.png** (barplot after filtering out contaminants)
 
@@ -2448,7 +2465,7 @@ rm sample*.tmp*
 minimap2 -a -x map-ont \
         -t NumberOfThreads \
         sample_assembly.fasta sample_host_removed.fastq.gz \
-        > sample.sam  2> sample-mapping-info.txt | 
+        > sample.sam  2> sample-mapping-info.txt
 ```
 
 **Parameter Definitions:**
@@ -2748,6 +2765,8 @@ dev.off()
 ```R
 library(tidyverse)
 library(decontam)
+library(phyloseq)
+
 # Set to 0.5 for a more aggressive approach where species more prevalent
 # in the negative controls are considered contaminants
 contam_threshold <- 0.1
@@ -2912,8 +2931,10 @@ dev.off()
 ```R
 library(tidyverse)
 library(decontam)
+library(phyloseq)
+
 # Set to 0.5 for a more aggressive approach where species more prevalent
-# in the negative controls are considered contaminants
+# in negative controls are considered contaminants
 contam_threshold <- 0.1 
 # Control samples in this column should always be written as "Control_Sample" and true samples as "True_Sample"
 prev_col <- "Sample_or_Control"
@@ -3099,6 +3120,8 @@ dev.off()
 ```R
 library(tidyverse)
 library(decontam)
+library(phyloseq)
+
 # Set to 0.5 for a more aggressive approach where species more prevalent
 # in the negative controls are considered contaminants
 contam_threshold <- 0.1

From 519b9873c066536af86a5c296587fa267ccb4712 Mon Sep 17 00:00:00 2001
From: olabiyi <obadbotanist@yahoo.com>
Date: Thu, 9 Oct 2025 11:00:52 -0700
Subject: [PATCH 10/47] fixed run_decontam()

---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md      | 53 ++++++++++---------
 1 file changed, 28 insertions(+), 25 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index 0b26010c5..fb16e66f9 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -705,7 +705,7 @@ multiqc --zip-data-dir \
   wget -O host.tar.gz --timeout=3600 --tries=0 --continue  host_url
 
   mkdir kraken2_host_db/ && \
-  tar -zxvf -C kraken2_host_db/ && \
+  tar -zxvf host.tar.gz -C kraken2_host_db/ && \
   rm -rf  host.tar.gz # Cleaning up
 ```
 
@@ -1057,28 +1057,31 @@ library(pavian)
 
   ```R
   # Make bar plot
-  make_plot <- function(abund_table, metadata, colors2use, publication_format, samples_column="Sample_ID", prefix_to_remove="barcode") {
-    # Prepare table
-    abund_table_wide <- abund_table %>% 
-        as.data.frame() %>% 
-        rownames_to_column(samples_column) %>% 
-        inner_join(metadata) %>% # join abundance table and metadata by `samples_column`
-        select(!!!colnames(metadata), everything()) %>% 
-        mutate(Sample_ID = Sample_ID %>% str_remove(prefix_to_remove))
-    # Convert table from wide to log format for plotting    
-    abund_table_long <- abund_table_wide  %>%
-        pivot_longer(-colnames(metadata), 
-                    names_to = "Species",
-                    values_to = "relative_abundance")
-    # Make relative abundance plot  
-    p <- ggplot(abund_table_long, mapping = aes(x=Sample_ID, y=relative_abundance, fill=Species)) +
-         geom_col() +
-         scale_fill_manual(values = colors2use) + 
-         labs(x=NULL, y="Relative Abundance (%)") + 
-         publication_format
-
-    return(p)
-  }
+make_plot <- function(abund_table, metadata, colors2use, publication_format,
+                      samples_column="Sample_ID", prefix_to_remove="barcode"){
+  
+abund_table_wide <- abund_table %>% 
+    as.data.frame() %>% 
+    rownames_to_column(samples_column) %>% 
+    inner_join(metadata) %>% 
+    select(!!!colnames(metadata), everything()) %>% 
+    mutate(!!samples_column := !!sym(samples_column) %>% str_remove(prefix_to_remove))
+    
+  
+abund_table_long <- abund_table_wide  %>%
+    pivot_longer(-colnames(metadata), 
+                 names_to = "Species",
+                 values_to = "relative_abundance")
+  
+p <- ggplot(abund_table_long, mapping = aes(x=!!sym(samples_column), 
+                                              y=relative_abundance, fill=Species)) +
+    geom_col() +
+    scale_fill_manual(values = colors2use) + 
+    labs(x=NULL, y="Relative Abundance (%)") + 
+    publication_format
+
+return(p)
+}
   ```
 
   **Function Parameter Definitions:**
@@ -1133,7 +1136,7 @@ library(pavian)
     if (!is.null(freq_col) && !is.null(prev_col)) {   
 
       # Run decontam in both prevalence and frequency modes
-      contamdf <- isContaminant(ps, neg=prev_col, conc=freq_col, threshold=contam_threshold) 
+      contamdf <- isContaminant(ps, neg="is.neg", conc=freq_col, threshold=contam_threshold) 
 
     } else if(!is.null(freq_col)) {
       
@@ -1143,7 +1146,7 @@ library(pavian)
     } else if(!is.null(prev_col)){
 
       # Run decontam in prevalence mode
-      contamdf <- isContaminant(ps, conc=freq_col, threshold=contam_threshold)
+      contamdf <- isContaminant(ps, neg="is.neg", threshold=contam_threshold)
     
     } else {
 

From ba251b7d470ff2f411fecac259d151a78862dec8 Mon Sep 17 00:00:00 2001
From: olabiyi <obadbotanist@yahoo.com>
Date: Thu, 9 Oct 2025 13:36:04 -0700
Subject: [PATCH 11/47] Changed how filtered table is generated for kraken and
 kaiju

---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md      | 25 ++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index fb16e66f9..a5dc55483 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -1661,6 +1661,11 @@ filtered_species_table  <- filter_rare(species_table, non_microbial, threshold=f
 filtered_species_table <- count_to_rel_abundance(filtered_species_table)
 
 # Write filtered table to file
+table2write <- filtered_species_table %>%
+                 t %>%
+                as.data.frame() %>%
+                rownames_to_column("Species")
+
 write_csv(x = filtered_species_table, file = "filtered-kaiju_species_table.csv")
 
 # Make plot after filtering
@@ -1696,7 +1701,11 @@ library(tidyverse)
 library(decontam)
 library(phyloseq)
 
-feature_table <- read_csv("filtered-kaiju_species_table.csv")
+feature_table <- read_csv("filtered-kaiju_species_table.csv") %>%
+                  as.data.frame()
+
+ rownames(feature_table) <- feature_table$Species
+ feature_table <- feature_table[,-1]  %>% as.matrix()
 # Set to 0.5 for a more aggressive approach where species more prevalent
 # in the negative controls are considered contaminants
 contam_threshold <- 0.1
@@ -2002,7 +2011,12 @@ filtered_species_table  <- filter_rare(species_table, non_microbial, threshold=f
 filtered_species_table <- count_to_rel_abundance(filtered_species_table)
 
 # Write filtered table to file
-write_csv(x = filtered_species_table, file = "filtered-kraken_species_table.csv")
+table2write <- filtered_species_table %>%
+                 t %>%
+                 as.data.frame() %>%
+                rownames_to_column("Species")
+
+write_csv(x = table2write , file = "filtered-kraken_species_table.csv")
 
 # Make plot after filtering
 p <- make_plot(filtered_species_table , metadata, custom_palette, publication_format)
@@ -2037,7 +2051,12 @@ library(tidyverse)
 library(decontam)
 library(phyloseq)
 
-feature_table <- read_csv("filtered-kraken_species_table.csv")
+feature_table <- read_csv("filtered-kraken_species_table.csv") %>%
+                  as.data.frame()
+
+ rownames(feature_table) <- feature_table$Species
+ feature_table <- feature_table[,-1]  %>% as.matrix()
+ 
 # Set to 0.5 for a more aggressive approach where species more prevalent
 # in the negative controls are considered contaminants
 contam_threshold <- 0.1

From 069fb14df322ef4d9a9f1ba26e9580b6d0ab7dd6 Mon Sep 17 00:00:00 2001
From: olabiyi <obadbotanist@yahoo.com>
Date: Thu, 9 Oct 2025 13:40:15 -0700
Subject: [PATCH 12/47] Changed how filtered table is generated for kraken and
 kaiju

---
 Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index a5dc55483..b2ac3fd26 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -1666,7 +1666,7 @@ table2write <- filtered_species_table %>%
                 as.data.frame() %>%
                 rownames_to_column("Species")
 
-write_csv(x = filtered_species_table, file = "filtered-kaiju_species_table.csv")
+write_csv(x = table2write, file = "filtered-kaiju_species_table.csv")
 
 # Make plot after filtering
 p <- make_plot(filtered_species_table , metadata, custom_palette, publication_format)
@@ -2056,7 +2056,7 @@ feature_table <- read_csv("filtered-kraken_species_table.csv") %>%
 
  rownames(feature_table) <- feature_table$Species
  feature_table <- feature_table[,-1]  %>% as.matrix()
- 
+
 # Set to 0.5 for a more aggressive approach where species more prevalent
 # in the negative controls are considered contaminants
 contam_threshold <- 0.1

From 285abb766db95e181242ae9d2a0c2af2e3530c8e Mon Sep 17 00:00:00 2001
From: olabiyi <obadbotanist@yahoo.com>
Date: Thu, 9 Oct 2025 14:16:31 -0700
Subject: [PATCH 13/47] Fixed decontamination_table bug

---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md      | 26 ++++++++++++++-----
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index b2ac3fd26..4a0abb2bc 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -1734,14 +1734,21 @@ decontaminated_table <- feature_table %>%
                 filter(str_detect(Species, 
                                   pattern = str_c(contaminants,
                                                   collapse = "|"),
-                                  negate = TRUE)) %>%
-                select(-Species) %>% as.matrix
+                                  negate = TRUE))
+
+rownames(decontaminated_table) <- decontaminated_table$Species
+decontaminated_table <- decontaminated_table[,-1] %>% as.matrix
 
 # Convert count matrix to relative abundance matrix
 decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
 
 # Write decontaminated species table to file
-write_csv(x = decontaminated_species_table, file = "decontaminated-kaiju_species_table.csv")
+table2write <- decontaminated_species_table %>%
+                 t %>%
+                 as.data.frame() %>%
+                rownames_to_column("Species")
+
+write_csv(x = table2write, file = "decontaminated-kaiju_species_table.csv")
 
 # Make plot after filtering out contaminants
 p <- make_plot(decontaminated_species_table , metadata, custom_palette, publication_format)
@@ -2086,14 +2093,21 @@ decontaminated_table <- feature_table %>%
                 filter(str_detect(Species, 
                                   pattern = str_c(contaminants,
                                                   collapse = "|"),
-                                  negate = TRUE)) %>%
-                select(-Species) %>% as.matrix
+                                  negate = TRUE))
+
+rownames(decontaminated_table) <- decontaminated_table$Species
+decontaminated_table <- decontaminated_table[,-1] %>% as.matrix
 
 # Convert count matrix to relative abundance matrix
 decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
 
 # Write decontaminated species table to file
-write_csv(x = decontaminated_species_table, file = "decontaminated-kraken_species_table.csv")
+table2write <- decontaminated_species_table %>%
+                 t %>%
+                 as.data.frame() %>%
+                rownames_to_column("Species")
+
+write_csv(x = table2write, file = "decontaminated-kraken_species_table.csv")
 
 # Make plot after filtering out contaminants
 p <- make_plot(decontaminated_species_table , metadata, custom_palette, publication_format)

From f843916a837a45e05f4b7d20cddff111997ec733 Mon Sep 17 00:00:00 2001
From: olabiyi <obadbotanist@yahoo.com>
Date: Fri, 10 Oct 2025 09:30:06 -0700
Subject: [PATCH 14/47] Fixed bug with reading and writing species table

---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md        | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index 4a0abb2bc..70722b426 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -1577,7 +1577,10 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 ```R
 library(tidyverse)
 feature_table <- process_kaiju_table (file_path="merged_kaiju_table.tsv")
-write_csv(x = feature_table, file = "kaiju_species_table.csv")
+table2write <- feature_table  %>%
+                as.data.frame() %>%
+                rownames_to_column("Species")
+write_csv(x = table2write, file = "kaiju_species_table.csv")
 ```
 
 **Parameter Definitions:**
@@ -1608,6 +1611,8 @@ row.names(metadata) <- metadata[,samples_column]
 
 # Read-in feature table
 species_table <- read_csv(file="kaiju_species_table.csv") %>%  as.data.frame()
+rownames(species_tablee) <- species_table$Species
+species_table <- species_table[,-1]  %>% as.matrix()
 ```
 
 **Parameter Definitions:**
@@ -1931,7 +1936,11 @@ library(pavian)
 
 reports_dir <- "/path/to/directory/with/*-kraken2-report.tsv"
 species_table <- process_kraken_table(reports_dir)
-write_csv(x = species_table, 
+table2write <- species_table  %>%
+                as.data.frame() %>%
+                rownames_to_column("Species")
+
+write_csv(x = table2write, 
           file = "kraken_species_table.csv")
 ```
 
@@ -1962,9 +1971,9 @@ metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
 row.names(metadata) <- metadata[,samples_column]
 # Read-in feature table
 species_table <- read_csv(file="kraken_species_table.csv") %>%  as.data.frame()
-rownames(species_table) <- species_table$species
+rownames(species_table) <- species_table$Species
 # Drop the species column
-species_table <- species_table[,-match("species", colnames(species_table))]
+species_table <- species_table[,-match("Species", colnames(species_table))]
 ```
 
 **Parameter Definitions:**

From 96df97fab0c80cf45e2ba4e6fdccbcfdfa027e33 Mon Sep 17 00:00:00 2001
From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com>
Date: Wed, 12 Nov 2025 21:33:11 -0800
Subject: [PATCH 15/47] Updates to the dev low biomass pipeline document
 GL-DPPD-7116

* Formatting updates
* Update step names
* Add missing steps
* Fix broken links
---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md      | 1472 ++++++++++-------
 1 file changed, 845 insertions(+), 627 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index 70722b426..80a2ed7f1 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -4,9 +4,9 @@
 
 ---
 
-**Date:** XXX NN, 2025  
+**Date:** November MM, 2025  
 **Revision:** -  
-**Document Number:** GL-DPPD-XXXX  
+**Document Number:** GL-DPPD-7116  
 
 **Submitted by:**  
 Olabiyi A. Obayomi (GeneLab Analysis Team)  
@@ -48,81 +48,83 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [6e. Generate Decontaminated Read Files](#6e-generate-decontaminated-read-files)
       - [6f. Contaminant Removal QC](#6f-contaminant-removal-qc)
       - [6g. Compile Contaminant Removal QC](#6g-compile-contaminant-removal-qc)
-    - [7. Host Removal](#7-host-removal)
-      - [7a. Build or download host database](#7a-build-or-download-host-database)
-        - [7a.i. Download from URL](#7ai-download-from-url)
-        - [7a.ii. Build from custom reference](#7aii-build-from-custom-reference)
-        - [7a.iii. Build from host name](#7aiii-build-from-host-name)
-      - [7b. Remove Host Reads](#7b-remove-host-reads)
+    - [7. Human Read Removal](#7-human-read-removal)
+      - [7a. Build Kraken2 Database](#7a-build-kraken2-database)
+      - [7b. Remove Human Reads](#7b-remove-human-reads)
+      - [7c. Compile Human Read Removal QC](#7c-compile-human-read-removal-qc)
     - [8. R Environment Setup](#8-r-environment-setup)
-      - [8a. Load libraries](#8a-load-libraries)
+      - [8a. Load Libraries](#8a-load-libraries)
       - [8b. Define Custom Functions](#8b-define-custom-functions)
       - [8c. Set global variables](#8c-set-global-variables)
   - [**Read-based processing**](#read-based-processing)
     - [9. Taxonomic profiling using kaiju](#9-taxonomic-profiling-using-kaiju)
-      - [9a. Build kaiju database](#9a-build-kaiju-database)
+      - [9a. Build Kaiju Database](#9a-build-kaiju-database)
       - [9b. Kaiju Taxonomic Classification](#9b-kaiju-taxonomic-classification)
-      - [9c. Compile kaiju taxonomy results](#9c-compile-kaiju-taxonomy-results)
-      - [9d. Convert kaiju output to krona format](#9d-convert-kaiju-output-to-krona-format)
-      - [9e. Compile kaiju krona report](#9e-compile-kaiju-krona-report)
-      - [9f. Create kaiju species count table](#9f-create-kaiju-species-count-table)
-      - [9g. Read-in tables](#9g-read-in-tables)
-      - [9h. Taxonomy barplots](#9h-taxonomy-barplots)
-      - [9i. Feature decontamination](#9i-feature-decontamination)
-    - [10. Taxonomic Profiling using Kraken2](#10-taxonomic-profiling-using-kraken2)
-      - [10a. Download kraken2 database](#10a-download-kraken2-database)
-      - [10b. Taxonomic Classification](#10b-taxonomic-classification)
-      - [10c. Convert Kraken2 output to Krona format](#10c-convert-kraken2-output-to-krona-format)
-      - [10d. Compile kraken2 krona report](#10d-compile-kraken2-krona-report)
-      - [10e. Create kraken species count table](#10e-create-kraken-species-count-table)
-      - [10f. Read-in tables](#10f-read-in-tables)
-      - [10g. Taxonomy barplots](#10g-taxonomy-barplots)
-      - [10h. Feature decontamination](#10h-feature-decontamination)
+      - [9c. Compile Kaiju Taxonomy Results](#9c-compile-kaiju-taxonomy-results)
+      - [9d. Convert Kaiju Output To Krona Format](#9d-convert-kaiju-output-to-krona-format)
+      - [9e. Compile Kaiju Krona Reports](#9e-compile-kaiju-krona-reports)
+      - [9f. Create Kaiju Species Count Table](#9f-create-kaiju-species-count-table)
+      - [9g. Read-in Tables](#9g-read-in-tables)
+      - [9h. Taxonomy Barplots](#9h-taxonomy-barplots)
+      - [9i. Feature Decontamination](#9i-feature-decontamination)
+    - [10. Taxonomic Profiling Using Kraken2](#10-taxonomic-profiling-using-kraken2)
+      - [10a. Download Kraken2 Database](#10a-download-kraken2-database)
+      - [10b. Kraken2 Taxonomic Classification](#10b-kraken2-taxonomic-classification)
+      - [10c. Compile Kraken2 Taxonomy Results](#10c-compile-kraken2-taxonomy-results)
+        - [10ci. Create Merged Kraken2 Taxonomy Table](10ci-create-merged-kraken2-taxonomy-table)
+        - [10cii. Compile Kraken2 Taxonomy Reports](10cii-compile-kraken2-taxonomy-reports)
+      - [10d. Convert Kraken2 Output to Krona Format](#10d-convert-kraken2-output-to-krona-format)
+      - [10e. Compile Kraken2 Krona Reports](#10e-compile-kraken2-krona-reports)
+      - [10f. Create Kraken2 Species Count Table](#10f-create-kraken2-species-count-table)
+      - [10g. Read-in Tables](#10g-read-in-tables)
+      - [10h. Taxonomy Barplots](#10h-taxonomy-barplots)
+      - [10i. Feature Decontamination](#10i-feature-decontamination)
   - [**Assembly-based processing**](#assembly-based-processing)
-    - [11. Sample assembly](#11-sample-assembly)
-    - [12. Polish assembly](#12-polish-assembly)
-    - [13. Renaming contigs and summarizing assemblies](#13-renaming-contigs-and-summarizing-assemblies)
-      - [13a. Renaming contig headers](#13a-renaming-contig-headers)
-      - [13b. Summarizing assemblies](#13b-summarizing-assemblies)
-    - [14. Gene prediction](#14-gene-prediction)
-      - [14a. Remove line wraps in gene prediction output](#14a-remove-line-wraps-in-gene-prediction-output)
-    - [15. Functional annotation](#15-functional-annotation)
-      - [15a. Downloading reference database of HMM models (only needs to be done once)](#15a-downloading-reference-database-of-hmm-models-only-needs-to-be-done-once)
-      - [15b. Running KEGG annotation](#15b-running-kegg-annotation)
-      - [15c. Filtering output to retain only those passing the KO-specific score and top hits](#15c-filtering-output-to-retain-only-those-passing-the-ko-specific-score-and-top-hits)
-    - [16. Taxonomic classification](#16-taxonomic-classification)
-      - [16a. Pulling and un-packing pre-built reference db (only needs to be done once)](#16a-pulling-and-un-packing-pre-built-reference-db-only-needs-to-be-done-once)
-      - [16b. Running taxonomic classification](#16b-running-taxonomic-classification)
-      - [16c. Adding taxonomy info from taxids to genes](#16c-adding-taxonomy-info-from-taxids-to-genes)
-      - [16d. Adding taxonomy info from taxids to contigs](#16d-adding-taxonomy-info-from-taxids-to-contigs)
-      - [16e. Formatting gene-level output with awk and sed](#16e-formatting-gene-level-output-with-awk-and-sed)
-      - [16f. Formatting contig-level output with awk and sed](#16f-formatting-contig-level-output-with-awk-and-sed)
-    - [17. Read-mapping](#17-read-mapping)
+    - [11. Sample Assembly](#11-sample-assembly)
+    - [12. Polish Assembly](#12-polish-assembly)
+    - [13. Rename Contigs and Summarize Assemblies](#13-rename-contigs-and-summarize-assemblies)
+      - [13a. Rename Contig Headers](#13a-rename-contig-headers)
+      - [13b. Summarize Assemblies](#13b-summarize-assemblies)
+    - [14. Gene Prediction](#14-gene-prediction)
+      - [14a. Generate Gene Predictions](14a-generate-gene-predictions)
+      - [14b. Remove Line Wraps In Gene Prediction Output](#14a-remove-line-wraps-in-gene-prediction-output)
+    - [15. Functional Annotation](#15-functional-annotation)
+      - [15a. Download Reference Database of HMM Models](#15a-download-reference-database-of-hmm-models)
+      - [15b. Run KEGG Annotation](#15b-run-kegg-annotation)
+      - [15c. Filter KO Outputs](#15c-filter-ko-outputs)
+    - [16. Taxonomic Classification](#16-taxonomic-classification)
+      - [16a. Pull and Unpack Pre-built Reference DB](#16a-pull-and-unpack-pre-built-reference-db)
+      - [16b. Run Taxonomic Classification](#16b-run-taxonomic-classification)
+      - [16c. Add Taxonomy Info From Taxids To Genes](#16c-add-taxonomy-info-from-taxids-to-genes)
+      - [16d. Add Taxonomy Info From Taxids To Contigs](#16d-add-taxonomy-info-from-taxids-to-contigs)
+      - [16e. Format Gene-level Output With awk and sed](#16e-format-gene-level-output-with-awk-and-sed)
+      - [16f. Format Contig-level Output With awk and sed](#16f-format-contig-level-output-with-awk-and-sed)
+    - [17. Read-Mapping](#17-read-mapping)
       - [17a. Align Reads to Sample Assembly](#17a-align-reads-to-sample-assembly)
       - [17b. Sort and Index Assembly Alignments](#17b-sort-and-index-assembly-alignments)
-    - [18. Getting coverage information and filtering based on detection](#18-getting-coverage-information-and-filtering-based-on-detection)
-      - [18a. Filtering coverage levels based on detection](#18a-filtering-coverage-levels-based-on-detection)
-      - [18b. Filtering gene and contig coverage based on requiring 50% detection and parsing down to just gene ID and coverage](#18b-filtering-gene-and-contig-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage)
-    - [19. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample](#19-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample)
-    - [20. Combining contig-level coverage and taxonomy into one table for each sample](#20-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample)
-    - [21. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples](#21-generating-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
-      - [21a. Generating gene-level coverage summary tables](#21a-generating-gene-level-coverage-summary-tables)
+    - [18. Get Coverage Information and Filter Based On Detection](#18-get-coverage-information-and-filter-based-on-detection)
+      - [18a. Filter Coverage Levels Based On Detection](#18a-filter-coverage-levels-based-on-detection)
+      - [18b. Filter Gene and Contig Coverage Based On Detection](#18b-filter-gene-and-contig-coverage-based-on-detection)
+    - [19. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
+    - [20. Combine Contig-level Coverage and Taxonomy For Each Sample](#20-combine-contig-level-coverage-and-taxonomy-for-each-sample)
+    - [21. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#21-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
+      - [21a. Generate Gene-level Coverage Summary Tables](#21a-generate-gene-level-coverage-summary-tables)
       - [21b. Gene-level taxonomy heatmaps](#21b-gene-level-taxonomy-heatmaps)
       - [21c. Gene-level taxonomy decontamination](#21c-gene-level-taxonomy-decontamination)
       - [21d. Gene-level KO functions heatmaps](#21d-gene-level-ko-functions-heatmaps)
       - [21e. Gene-level KO functions decontamination](#21e-gene-level-ko-functions-decontamination)
-      - [21f. Generating contig-level coverage summary tables](#21f-generating-contig-level-coverage-summary-tables)
+      - [21f. Generate contig-level coverage summary tables](#21f-generate-contig-level-coverage-summary-tables)
       - [21g. Contig-level Heatmaps](#21g-contig-level-heatmaps)
       - [21h. Contig-level decontamination](#21h-contig-level-decontamination)
     - [22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#22-metagenome-assembled-genome-mag-recovery)
-      - [22a. Binning contigs](#22a-binning-contigs)
-      - [22b. Bin quality assessment](#22b-bin-quality-assessment)
-      - [22c. Filtering MAGs](#22c-filtering-mags)
-      - [22d. MAG taxonomic classification](#22d-mag-taxonomic-classification)
-      - [22e. Generating overview table of all MAGs](#22e-generating-overview-table-of-all-mags)
-    - [23. Generating MAG-level functional summary overview](#23-generating-mag-level-functional-summary-overview)
-      - [23a. Getting KO annotations per MAG](#23a-getting-ko-annotations-per-mag)
-      - [23b. Summarizing KO annotations with KEGG-Decoder](#23b-summarizing-ko-annotations-with-kegg-decoder)
+      - [22a. Bin Contigs](#22a-bin-contigs)
+      - [22b. Bin Quality Assessment](#22b-bin-quality-assessment)
+      - [22c. Filter MAGs](#22c-filter-mags)
+      - [22d. MAG Taxonomic Classification](#22d-mag-taxonomic-classification)
+      - [22e. Generate Overview Table Of All MAGs](#22e-generate-overview-table-of-all-mags)
+    - [23. Generate MAG-level Functional Summary Overview](#23-generate-mag-level-functional-summary-overview)
+      - [23a. Get KO Annotations Per MAG](#23a-get-ko-annotations-per-mag)
+      - [23b. Summarize KO Annotations With KEGG-Decoder](#23b-summarize-ko-annotations-with-kegg-decoder)
 
 
 
@@ -187,13 +189,13 @@ dorado basecaller ${model} ${input_directory} \
 
 **Parameter Definitions:**
 
-- `--no-trim` - skips trimming of barcodes, adapters, and primers
-- `--device` - specifies CPU or GPU device; specifying 'auto' chooses either 'cpu' or 'gpu' depending on detected presence of a GPU device
-- `--recursive` - enables recursive scanning through input directory to load FAST5 and/or POD5 files
-- `--kit-name` - The nanopore barcoding kit used during sequencing preparation. Enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names
-- `--min-qscore` - specifies the minimum Q-score, reads with a mean Q-score below this threshold are discarded (default to `7` for this pipeline)
-- `model` - positional argument specifying the basecalling model to use or a path to the model directory. `hac` chooses the high accuracy model.
-- `input_directory` - positional argument specifying the location of the raw data in POD5 or FAST5 format
+- `model` - Positional argument specifying the basecalling model to use or a path to the model directory. `hac` chooses the high accuracy model.
+- `input_directory` - Positional argument specifying the location of the raw data in POD5 or FAST5 format.
+- `--no-trim` - Skips trimming of barcodes, adapters, and primers.
+- `--device` - Specifies CPU or GPU device; specifying 'auto' chooses either 'cpu' or 'gpu' depending on detected presence of a GPU device.
+- `--recursive` - Enables recursive scanning through input directory to load FAST5 and/or POD5 files.
+- `--kit-name` - The nanopore barcoding kit used during sequencing preparation. Enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names.
+- `--min-qscore` - Specifies the minimum Q-score, reads with a mean Q-score below this threshold are discarded (default to `7` for this pipeline).
 
 **Input Data:**
 
@@ -209,7 +211,7 @@ dorado basecaller ${model} ${input_directory} \
 
 ### 2. Demultiplexing
 
-#### 2a. Split fastq
+#### 2a. Split Fastq
 
 ```bash
 dorado demux \
@@ -222,10 +224,11 @@ dorado demux \
 
 **Parameter Definitions:**
 
-- `--output-dir` - specifies the output folder that is the root of the nested output structure. 
-- `--emit-fastq` - specifies that output is fastq format
-- `--emit-summary` - creates a summary listing each read and its classified barcode.
-- `--kit-name` - The nanopore barcoding kit used during sequencing preparation. Enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names
+- `--output-dir` - Specifies the output folder that is the root of the nested output structure. 
+- `--emit-fastq` - Specifies that output is fastq format.
+- `--emit-summary` - Creates a summary listing each read and its classified barcode.
+- `--kit-name` - The nanopore barcoding kit used during sequencing preparation. Enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names.
+- `basecalled.bam` - Positional argument specifying the input bam file.
 
 **Input Data:**
 
@@ -235,18 +238,19 @@ dorado demux \
 
 - /path/to/fastq/output/\*_barcode\*.fastq (demultiplexed reads in fastq format)
 - /path/to/fastq/output/\*_unclassified.fastq (unclassified reads in fastq format)
-- /path/to/fastq/output/barcoding_summary.txt (barcode summary file listing each read, the file it was assigned to, and its classified barcode )
+- /path/to/fastq/output/barcoding_summary.txt (barcode summary file listing each read, the file it was assigned to, and its classified barcode)
 
 
-#### 2b. Concatenate files for each sample
+#### 2b. Concatenate Files For Each Sample
 
 ```bash
-# Change to directory containing split fastq files generated from step 2a. split fastq above
+# Change to directory containing split fastq files generated from step 2a. 
 cd /path/to/fastq/output/ # output of step 2a
+
 # Get unique barcode names from demultiplexed file names
-BARCODES=($(ls -1 *fastq* |sed -E 's/.+_(barcode[0-9]+)_.+/\1/g' | sort -u))
+BARCODES=($(ls -1 *fastq* | sed -E 's/.+_(barcode[0-9]+)_.+/\1/g' | sort -u))
 
-# Concat separate barcode/sample fastq files into per sample fastq gzippped files
+# Concat separate barcode/sample fastq files into per sample fastq gzipped files
 [ -d raw_data/ ] || mkdir raw_data/
 for sample in ${BARCODES[*]}; do
 
@@ -260,7 +264,8 @@ done
 
 **Parameter Definitions:**
 
-- `| gzip --to-stdout` - sends output from `cat` to `gzip` to create compressed fastq.gz file
+- `cat ${sample}/*` - Concatenates all fastq files with the same barcode into one fastq file.
+- `| gzip --to-stdout` - Sends the concatenated fastq file output from the `cat` command to the `gzip` command to create a compressed fastq.gz file for each barcode.
 
 **Input Data:**
 
@@ -280,20 +285,21 @@ done
 
 ```bash 
 NanoPlot --only-report \
-         --prefix sample_ \
+         --prefix sample_raw_ \
          --outdir /path/to/raw_nanoplot_output \
          --threads NumberOfThreads \
-         --fastq /path/to/raw_data/sample.fastq.gz
+         --fastq \
+         /path/to/raw_data/sample.fastq.gz
 ```
 
 **Parameter Definitions:**
 
-- `--outdir` – specifies the output directory to store results
-- `--only-report` - output only the report files
-- `--prefix` - adds a sample specific prefix to the name of each output file
-- `--threads` - number of parallel processing threads to use
-- `--fastq` - specifies that the input data is in a fastq format
-- `/path/to/raw_data/sample.fastq.gz` – the input reads are specified as a positional argument
+- `--only-report` - Output only the report files.
+- `--prefix` - Adds a sample specific prefix to the name of each output file.
+- `--outdir` – Specifies the output directory to store results.
+- `--threads` - Number of parallel processing threads to use.
+- `--fastq` - Specifies that the input data is in fastq format.
+- `/path/to/raw_data/sample.fastq.gz` – The input reads, specified as a positional argument.
 
 **Input Data:**
 
@@ -301,9 +307,9 @@ NanoPlot --only-report \
 
 **Output Data:**
 
-- **/path/to/raw_nanoplot_output/sample_NanoPlot-report.html** (NanoPlot html summary)
-- /path/to/raw_nanoplot_output/sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
-- /path/to/raw_nanoplot_output/sample_NanoStats.txt (text file containing basic statistics)
+- **/path/to/raw_nanoplot_output/sample_raw_NanoPlot-report.html** (NanoPlot html summary)
+- /path/to/raw_nanoplot_output/sample_raw_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
+- /path/to/raw_nanoplot_output/sample_raw_NanoStats.txt (text file containing basic statistics)
 
 #### 3b. Compile Raw Data QC
 
@@ -311,20 +317,21 @@ NanoPlot --only-report \
 multiqc --zip-data-dir \
         --outdir raw_multiqc_report \
         --filename raw_multiqc \
-        --interactive /path/to/raw_nanoplot_output/
+        --interactive \
+        /path/to/raw_nanoplot_output/
 ```
 
 **Parameter Definitions:**
 
-- `--zip-data-dir` - compress the data directory
-- `--outdir` – the output directory to store results
-- `--filename` – the filename prefix of results
-- `--interactive` - force multiqc to always create interactive javascript plots
-- `/path/to/raw_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/raw_nanoplot_output/` – The directory holding the output data from the NanoPlot run, provided as a positional argument.
 
 **Input Data:**
 
-- /path/to/raw_nanoplot_output/*NanoStats.txt (NanoPlot output data, from [Step 3a](#3a-raw-data-qc))
+- /path/to/raw_nanoplot_output/*raw_NanoStats.txt (NanoPlot output data, from [Step 3a](#3a-raw-data-qc))
 
 **Output Data:**
 
@@ -335,7 +342,7 @@ multiqc --zip-data-dir \
 
 ---
 
-### 4. Quality filtering
+### 4. Quality Filtering
 
 #### 4a. Filter Raw Data
 
@@ -345,8 +352,10 @@ filtlong --min_length 200 --min_mean_q 8 /path/to/raw_data/sample.fastq.gz > sam
 
 **Parameter Definitions:**
 
-- `--min_length` – specifies the minimum read length to retain (default to `200` for this pipeline)
-- `--min_mean_q` – specifies the minimum mean read quality (default to `8` for this pipeline)
+- `--min_length` – Specifies the minimum read length to retain (default to `200` for this pipeline).
+- `--min_mean_q` – Specifies the minimum mean read quality to retain (default to `8` for this pipeline).
+- `/path/to/raw_data/sample.fastq.gz` - The path to the input fastq file, provided as a positional argument.
+- `> sample_filtered.fastq` - Redirects the output to a sample_filtered.fastq file.
 
 **Input Data:**
 
@@ -361,19 +370,21 @@ filtlong --min_length 200 --min_mean_q 8 /path/to/raw_data/sample.fastq.gz > sam
 
 ```bash
 NanoPlot --only-report \
-         --prefix sample_ \
+         --prefix sample_filtered_ \
          --outdir /path/to/filtered_nanoplot_output \
          --threads NumberOfThreads \
-         --fastq sample_filtered.fastq
+         --fastq \
+         sample_filtered.fastq
 ```
 
 **Parameter Definitions:**
 
-- `--outdir` – specifies the output directory to store results
-- `--only-report` - output only the report files
-- `--prefix` - adds a sample specific prefix to the name of each output file
-- `--threads` - number of parallel processing threads to use
-- `sample_filtered.fastq` – the input reads are specified as a positional argument
+- `--only-report` - Output only the report files.
+- `--prefix` - Adds a sample specific prefix to the name of each output file.
+- `--outdir` – Specifies the output directory to store results.
+- `--threads` - Number of parallel processing threads to use.
+- `--fastq` - Specifies that the input data is in fastq format.
+- `sample_filtered.fastq` – The input reads, specified as a positional argument.
 
 **Input Data:**
 
@@ -381,9 +392,9 @@ NanoPlot --only-report \
 
 **Output Data:**
 
-- **/path/to/filtered_nanoplot_output/sample_NanoPlot-report.html** (NanoPlot html summary)
-- /path/to/filtered_nanoplot_output/sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
-- /path/to/filtered_nanoplot_output/sample_NanoStats.txt (text file containing basic statistics)
+- **/path/to/filtered_nanoplot_output/sample_filtered_NanoPlot-report.html** (NanoPlot html summary)
+- /path/to/filtered_nanoplot_output/sample_filtered_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
+- /path/to/filtered_nanoplot_output/sample_filtered_NanoStats.txt (text file containing basic statistics)
 
 #### 4c. Compile Filtered Data QC
 
@@ -391,20 +402,21 @@ NanoPlot --only-report \
 multiqc  --zip-data-dir \ 
          --outdir filtered_multiqc_report \
          --filename filtered_multiqc \
-         --interactive /path/to/filtered_nanoplot_output/
+         --interactive \
+         /path/to/filtered_nanoplot_output/
 ```
 
 **Parameter Definitions:**
 
-- `--zip-data-dir` - compress the data directory
-- `--outdir` – the output directory to store results
-- `--filename` – the filename prefix of results
-- `--interactive` - force multiqc to always create interactive javascript plots
-- `/path/to/filtered_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/filtered_nanoplot_output/` – The directory holding the output data from the NanoPlot run, provided as a positional argument.
 
 **Input Data:**
 
-- /path/to/filtered_nanoplot_output/*NanoStats.txt (NanoPlot output data, from [Step 4b](#4b-filtered-data-qc))
+- /path/to/filtered_nanoplot_output/*filtered_NanoStats.txt (NanoPlot output data, from [Step 4b](#4b-filtered-data-qc))
 
 **Output Data:**
 
@@ -420,17 +432,19 @@ multiqc  --zip-data-dir \
 #### 5a. Trim Filtered Data
 
 ```bash
-porechop --input sample_filtered.fastq --threads NumberOfThreads \
-         --discard_middle --output sample_trimmed.fastq  > sample_porechop.log
+porechop --input sample_filtered.fastq \
+         --threads NumberOfThreads \
+         --discard_middle \
+         --output sample_trimmed.fastq  > sample_porechop.log
 ```
 
 **Parameter Definitions:**
 
-- `--input` – the input read file in fastq format
-- `--threads` - number of parallel processing threads to use
-- `--discard_middle` -  reads with middle adapters will be discarded
-- `--output` - trimmed reads output fastq filename
-- `> sample_porechop.log` - capture stdout in a log file
+- `--input` – Specifies the input sequence file in fastq format.
+- `--threads` - Number of parallel processing threads to use.
+- `--discard_middle` -  Reads with middle adapters will be discarded.
+- `--output` - Specifies the trimmed reads output fastq filename.
+- `> sample_porechop.log` - Redirects the standard output to a log file.
 
 **Input Data:**
 
@@ -439,24 +453,27 @@ porechop --input sample_filtered.fastq --threads NumberOfThreads \
 **Output Data:**
 
 - **sample_trimmed.fastq** (filtered and trimmed reads)
+- sample_porechop.log (porechop standard output containing trimming info)
 
 #### 5b. Trimmed Data QC
 
 ```bash
 NanoPlot --only-report \
-         --prefix sample_ \
+         --prefix sample_trimmed_ \
          --outdir /path/to/trimmed_nanoplot_output \
          --threads NumberOfThreads \
-         --fastq sample_trimmed.fastq
+         --fastq \
+         sample_trimmed.fastq
 ```
 
 **Parameter Definitions:**
 
-- `--outdir` – specifies the output directory to store results
-- `--only-report` - output only the report files
-- `--prefix` - adds a sample specific prefix to the name of each output file
-- `--threads` - number of parallel processing threads to use
-- `sample_trimmed.fastq` – the input reads are specified as a positional argument
+- `--only-report` - Output only the report files.
+- `--prefix` - Adds a sample specific prefix to the name of each output file.
+- `--outdir` – Specifies the output directory to store results.
+- `--threads` - Number of parallel processing threads to use.
+- `--fastq` - Specifies that the input data is in fastq format.
+- `sample_trimmed.fastq` – The input reads, specified as a positional argument.
 
 **Input Data:**
 
@@ -464,9 +481,9 @@ NanoPlot --only-report \
 
 **Output Data:**
 
-- **/path/to/trimmed_nanoplot_output/sample_NanoPlot-report.html** (NanoPlot html summary)
-- /path/to/trimmed_nanoplot_output/sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
-- /path/to/trimmed_nanoplot_output/sample_NanoStats.txt (text file containing basic statistics)
+- **/path/to/trimmed_nanoplot_output/sample_trimmed_NanoPlot-report.html** (NanoPlot html summary)
+- /path/to/trimmed_nanoplot_output/sample_trimmed_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
+- /path/to/trimmed_nanoplot_output/sample_trimmed_NanoStats.txt (text file containing basic statistics)
 
 #### 5c. Compile Trimmed Data QC
 
@@ -474,20 +491,21 @@ NanoPlot --only-report \
 multiqc --zip-data-dir \ 
         --outdir trimmed_multiqc_report \
         --filename trimmed_multiqc \
-        --interactive /path/to/trimmed_nanoplot_output/
+        --interactive \
+        /path/to/trimmed_nanoplot_output/
 ```
 
 **Parameter Definitions:**
 
-- `--zip-data-dir` - compress the data directory
-- `--outdir` – the output directory to store results
-- `--filename` – the filename prefix of results
-- `--interactive` - force multiqc to always create interactive javascript plots
-- `/path/to/trimmed_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/trimmed_nanoplot_output/` – The directory holding the output data from the NanoPlot run, provided as a positional argument.
 
 **Input Data:**
 
-- /path/to/trimmed_nanoplot_output/*NanoStats.txt (NanoPlot output data, output from [Step 5b](#5b-trimmed-data-qc))
+- /path/to/trimmed_nanoplot_output/*trimmed_NanoStats.txt (NanoPlot output data, output from [Step 5b](#5b-trimmed-data-qc))
 
 **Output Data:**
 
@@ -500,30 +518,31 @@ multiqc --zip-data-dir \
 
 ### 6. Contaminant Removal
 
-> A major issue with low biomass data is the high potential for contamination due to the low amount of DNA extracted in the sample for sequencing.  Because negative control/blank samples should by theory be contaminant free, any sequence detected in the negative control is a potential contaminant. To filter out contaminants found in negative control samples that may have been due to cross contamination in the lab, we use a read mapping approach. First negative/blank control samples reads are assembled then filtered and trimmed reads mapped to the assembled contigs. Reads mapping to the assembled contigs are categorized as contaminants and are therefore filtered out and thus excluded from further analyses.
+> A major issue with low biomass data is the high potential for contamination due to the low amount of DNA extracted from the samples. Because negative control/blank samples should by theory be contaminant free, any sequence detected in the negative control is a potential contaminant. To filter out contaminants found in negative control samples that may have been due to cross contamination in the lab, we use a read mapping approach. First negative/blank control sample reads are assembled then the filtered and trimmed reads from each low-biomass sample are mapped to the assembled contigs from the negative/blank control samples. Reads mapping to the assembled contigs are categorized as contaminants and are therefore filtered out and thus excluded from downstream analyses.
 
 ### 6a. Assemble Contaminants
 
 ```bash
-flye --meta --threads NumberOfThreads \
+flye --meta \
+     --threads NumberOfThreads \
      --out-dir /path/to/contaminant_assembly \
      --nano-raw /path/to/blank_samples/\*_trimmed.fastq
 ```
 
 **Parameter Definitions:**
 
-- `--meta` – use metagenome/uneven coverage mode
-- `--threads` - number of parallel processing threads to use
-- `--out-dir` - output directory
-- `--nano-raw` - specifies that input is from Oxford Nanopore regular raw reads. This adds a polishing step for error correction after the assembly is generated.
+- `--meta` – Use metagenome/uneven coverage mode.
+- `--threads` - Number of parallel processing threads to use.
+- `--out-dir` - Specifies the output directory.
+- `--nano-raw` - Specifies that input is from Oxford Nanopore regular raw reads. This adds a polishing step for error correction after the assembly is generated.
 
 **Input Data**
 
-- *_trimmed.fastq (one or more trimmed reads from blank samples, output from [Step 5a](#5a-trim-filtered-data))
+- *_trimmed.fastq (one or more trimmed reads from blank (negative control) samples, output from [Step 5a](#5a-trim-filtered-data))
 
 **Output Data**
 
-- /path/to/contaminant_assembly/assembly.fasta (Assembly built from reads in blank samples in fasta format)
+- /path/to/contaminant_assembly/assembly.fasta (assembly built from reads in blank samples in fasta format)
 
 <br>
 
@@ -533,32 +552,47 @@ flye --meta --threads NumberOfThreads \
 
 ```bash
 # Build contaminant index
-minimap2 -t NumberOfThreads -a -x splice -d blanks.mmi /path/to/contaminant_assembly/assembly.fasta
+minimap2 -t NumberOfThreads \
+         -a \
+         -x splice \
+         -d blanks.mmi \
+         /path/to/contaminant_assembly/assembly.fasta
 
 # Map reads to index
-minimap2 -t NumberOfThreads -a -x splice blanks.mmi /path/to/trimmed_reads/sample_trimmed.fastq  > sample.sam
+minimap2 -t NumberOfThreads \
+         -a \
+         -x splice \
+         blanks.mmi \
+         /path/to/trimmed_reads/sample_trimmed.fastq  > sample.sam
 ```
 
 **Parameter Definitions:**
 
-- `-t` - number of parallel processing threads
-- `-a` – output in SAM format
-- `-x splice` - specifies preset for spliced alignment of long reads
-- `-d` - specifies the output file for the index
+- `-t` - Number of parallel processing threads.
+- `-a` – Output in SAM format.
+- `-x splice` - Specifies preset for spliced alignment of long reads.
+- `-d` - Specifies the output file for the index (specific to the build contaminant index command).
+- `/path/to/contaminant_assembly/assembly.fasta` - Specifies the input file in fasta format, provided as a positional argument (specific to the build contaminant index command).
+- `blanks.mmi` - Specifies the index file in mmi format, provided as a positional argument (specific to the map reads command).
+- `/path/to/trimmed_reads/sample_trimmed.fastq` - Specifies the input file in fastq format, provided as a positional argument (specific to the map reads command).
+- `> sample.sam` - Redirects the output of the map reads command to a separate SAM file (specific to the map reads command).
 
 **Input Data**
 
-- /path/to/contaminant_assembly/assembly.fasta (Contaminant assembly, output from [Step 6a](#6-assemble-contaminants))
-- /path/to/trimmed_reads/sample_trimmed.fastq (Filtered and trimmed reads, output from [Step 5a](#5a-trim-filtered-data))
+- /path/to/contaminant_assembly/assembly.fasta (contaminant assembly, output from [Step 6a](#6-assemble-contaminants))
+- /path/to/trimmed_reads/sample_trimmed.fastq (filtered and trimmed reads, output from [Step 5a](#5a-trim-filtered-data))
 
 **Output Data**
 
-- sample.sam (Reads aligned to contaminant assembly)
+- blanks.mmi (contaminant index in MMI format)
+- sample.sam (reads aligned to contaminant assembly in SAM format)
 
 #### 6c. Sort and Index Contaminant Alignments
 ```bash
 # Sort Sam, convert to bam and create index
-samtools sort --threads NumberOfThreads -o sample_sorted.bam sample.sam > sample_sort.log 2>&1
+samtools sort --threads NumberOfThreads \
+              -o sample_sorted.bam \
+              sample.sam > sample_sort.log 2>&1
 
 samtools index sample_sorted.bam sample_sorted.bam.bai
 ```
@@ -566,22 +600,24 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 **Parameter Definitions:**
 
 **samtools sort**
-- `--threads` - number of parallel processing threads to use
-- `-o` - specifies the output file for the sorted reads
-- `sample.sam` - positional argument specifying the input SAM file
+- `--threads` - Number of parallel processing threads to use.
+- `-o` - Specifies the output file for the aligned and sorted reads.
+- `sample.sam` - Specifies the input SAM file, provided as a positional argument.
+- `> sample_sort.log 2>&1` - Redirects the standard output to a log file. 
 
 **samtools index**
-- `sample_sorted.bam` - positional argument specifying the input BAM file to be indexed
-- `sample_sorted.bam.bai` - positional argument specifying the name of the index file
+- `sample_sorted.bam` - The input BAM file, provided as a positional argument.
+- `sample_sorted.bam.bai` - The output index file, provided as a positional argument.
 
 **Input Data:**
 
-- sample.sam (Reads aligned to contaminant assembly, output from [Step 6b](#6b-build-contaminant-index-and-map-reads))
+- sample.sam (reads aligned to contaminant assembly, output from [Step 6b](#6b-build-contaminant-index-and-map-reads))
 
 **Output Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly)
-- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly)
+- sample_sorted.bam (sorted mapping to contaminant assembly file)
+- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly file)
+- sample_sort.log (log file containing the samtools sort standard output)
 
 #### 6d. Gather Contaminant Mapping Metrics
 
@@ -594,221 +630,224 @@ samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.lo
 
 **Parameter Definitions:**
 
-- `flagstat` - positional argument specifying the program for counting the number of alignments for each SAM FLAG type
-- `stats` - positional argument specifying the program for producing comprehensive statistics from the alignment file
-- `idxstats` - positional argument specifying the program for producing contig alignment summary statistics
-- `--remove-dups` - excludes reads marked as duplicates from comprehensive statistics
-- `sample_sorted.bam` - positional argument specifying the input BAM file
+- `flagstat` - Positional argument specifying the program for counting the number of alignments for each SAM FLAG type.
+- `stats` - Positional argument specifying the program for producing comprehensive statistics from the alignment file.
+- `idxstats` - Positional argument specifying the program for producing contig alignment summary statistics.
+- `--remove-dups` - Excludes reads marked as duplicates from the comprehensive statistics.
+- `sample_sorted.bam` - Positional argument specifying the input BAM file.
+- `> sample_flagstats.txt` - Redirects the flagstat standard output to a text file.
+- `2> sample_flagstats.log` - Redirects the flagstat standard error to a log file.
+- `> sample_stats.txt` - Redirects the stats standard output to a text file.
+- `2> sample_stats.log` - Redirects the stats standard error to a log file.
+- `> sample_idxstats.txt` - Redirects the idxstats standard output to a text file.
+- `2> sample_idxstats.log` - Redirects the idxstats standard error to a log file.
 
 **Input Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 6c](#6c-sort-and-index-contaminant-alignments))
-- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly, output from [Step 6c](#6c-sort-and-index-contaminant-alignments))
+- sample_sorted.bam (sorted mapping to contaminant assembly file, output from [Step 6c](#6c-sort-and-index-contaminant-alignments))
+- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly file, output from [Step 6c](#6c-sort-and-index-contaminant-alignments))
 
 **Output Data:**
 
 - sample_flagstats.txt (SAM FLAG counts)
+- sample_flagstats.log (log file containing the flagstat standard error)
 - sample_stats.txt (comprehensive alignment statistics)
+- sample_stats.log (log file containing the stats standard error)
 - sample_idxstats.txt (contig alignment summary statistics)
+- sample_idxstats.log (log file containing the idxstats standard error)
 
 #### 6e. Generate Decontaminated Read Files
 ```bash
 # Retain reads that do not map to contaminants
-samtools fastq -t -f 4 sample_sorted.bam | gzip --to-stdout > sample_blank_removed.fastq.gz
+samtools fastq -t -f 4 sample_sorted.bam | gzip --to-stdout > sample_decontam.fastq.gz
 ```
 
 **Parameter Definitions:**
 
-- `fastq` - positional argument specifying the program for generating fastq files from a SAM/BAM file
-- `-t` - copy RG, BC, and QT tags to the FASTQ header line
-- `-f 4` - only retain unmapped reads that have been marked with the SAM "segment unmapped" FLAG (4)
-- `sample_sorted.bam` - positional argument specifying the input BAM file
-- `| gzip --to-stdout` - sends output from `samtools fastq` to `gzip` to create compressed fastq.gz file
-- `> sample_blank_removed.fastq.gz` - specifies the name of the file used to store the fastq.gz output
+- `fastq` - Positional argument specifying the program for generating fastq files from a SAM/BAM file.
+- `-t` - Copy RG, BC, and QT tags to the FASTQ header line.
+- `-f 4` - Only retain unmapped reads that have been marked with the SAM "segment unmapped" FLAG (4).
+- `sample_sorted.bam` - Positional argument specifying the input BAM file.
+- `| gzip --to-stdout` - Sends output from `samtools fastq` to `gzip` command to create a compressed fastq.gz file.
+- `--to-stdout` - Sends the output from the `gzip` command to standard out.
+- `> sample_decontam.fastq.gz` - Redirects the `gzip` standard output to a fastq.gz file.
 
 **Input Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly, output from [Step 6c](#6c-sort-and-index-contaminant-alignments))
+- sample_sorted.bam (sorted mapping to contaminant assembly file, output from [Step 6c](#6c-sort-and-index-contaminant-alignments))
 
 **Output Data:**
 
-- sample_blank_removed.fastq.gz (blank removed reads in fastq format)
+- sample_decontam.fastq.gz (filtered and trimmed sample reads with contaminants removed in fastq format)
 
 #### 6f. Contaminant Removal QC
 
 ```bash
 NanoPlot --only-report \
-         --prefix sample_ \
-         --outdir /path/to/noblank_nanoplot_output \
+         --prefix sample_noblank_ \
+         --outdir /path/to/decontam_nanoplot_output \
          --threads NumberOfThreads \
-         --fastq sample_blank_removed.fastq.gz
+         --fastq \
+         sample_decontam.fastq.gz
 ```
 
 **Parameter Definitions:**
 
-- `--outdir` – specifies the output directory to store results
-- `--only-report` - output only the report files
-- `--prefix` - adds a sample specific prefix to the name of each output file
-- `--threads` - number of parallel processing threads to use
-- `--fastq` - specifies that the input data is in a fastq format
-- `sample_blank_removed.fastq.gz` – the input reads are specified as a positional argument
+- `--only-report` - Output only the report files.
+- `--prefix` - Adds a sample specific prefix to the name of each output file.
+- `--outdir` – Specifies the output directory to store results.
+- `--threads` - Number of parallel processing threads to use.
+- `--fastq` - Specifies that the input data is in fastq format.
+- `sample_decontam.fastq.gz` – The input reads, specified as a positional argument.
 
 **Input Data:**
 
-- sample_blank_removed.fastq.gz (blank removed reads, output from [Step 6e](#6e-generate-decontaminated-read-files))
+- sample_decontam.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 6e](#6e-generate-decontaminated-read-files))
 
 **Output Data:**
 
-- **/path/to/noblank_nanoplot_output/sample_NanoPlot-report.html** (NanoPlot html summary)
-- /path/to/noblank_nanoplot_output/sample_NanoPlot_<date>_<time>.log (NanoPlot log file)
-- /path/to/noblank_nanoplot_output/sample_NanoStats.txt (text file containing basic statistics)
+- **/path/to/decontam_nanoplot_output/sample_decontam_NanoPlot-report.html** (NanoPlot html summary)
+- /path/to/decontam_nanoplot_output/sample_decontam_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
+- /path/to/decontam_nanoplot_output/sample_decontam_NanoStats.txt (text file containing basic statistics)
 
 
 #### 6g. Compile Contaminant Removal QC
 
 ```bash
 multiqc --zip-data-dir \ 
-        --outdir noblank_multiqc_report \
-        --filename noblank_multiqc \
-        --interactive /path/to/noblank_nanoplot_output/
+        --outdir decontam_multiqc_report \
+        --filename decontam_multiqc \
+        --interactive \
+        /path/to/decontam_nanoplot_output/
 ```
 
 **Parameter Definitions:**
 
-- `--zip-data-dir` - compress the data directory
-- `--outdir` – the output directory to store results
-- `--filename` – the filename prefix of results
-- `--interactive` - force multiqc to always create interactive javascript plots
-- `/path/to/noblank_nanoplot_output/` – the directory holding the output data from the NanoPlot run, provided as a positional argument
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/decontam_nanoplot_output/` – The directory holding the output data from the NanoPlot run, provided as a positional argument.
 
 **Input Data:**
 
-- /path/to/noblank_nanoplot_output/*NanoStats.txt (NanoPlot output data, output from [Step 6f](#6f-contaminant-removal-qc))
+- /path/to/decontam_nanoplot_output/*decontam_NanoStats.txt (NanoPlot output data, output from [Step 6f](#6f-contaminant-removal-qc))
 
 **Output Data:**
 
-- **noblank_multiqc_report/noblank_multiqc.html** (multiqc output html summary)
-- **noblank_multiqc_report/noblank_multiqc_data.zip** (zip archive containing multiqc output data)
+- **decontam_multiqc.html** (multiqc output html summary)
+- **decontam_multiqc_data.zip** (zip archive containing multiqc output data)
 
 <br>
 
 ---
 
-### 7. Host Removal
-
-#### 7a. Build or download host database
+### 7. Human Read Removal
 
-##### 7a.i. Download from URL
+#### 7a. Build Kraken2 Database
 
 ```bash
-  # Downloading and unpacking database from ${host_url}
-  wget -O host.tar.gz --timeout=3600 --tries=0 --continue  host_url
-
-  mkdir kraken2_host_db/ && \
-  tar -zxvf host.tar.gz -C kraken2_host_db/ && \
-  rm -rf  host.tar.gz # Cleaning up
-```
-
-**Parameter Definitions:**
-
-- `--timeout` - network timeout in seconds
-- `--tries` - number of times to retry the download
-- `--continue` - continue getting a partially downloaded file (if it exists)
-- `host_url` - positional argument specifying the URl for the host database
-
-**Output Data:**
+kraken2-build --download-library human \
+              --db kraken2_human_db \
+              --threads numberOfThreads \
+              --no-masking
 
-- kraken2_host_db/ - Kraken2 database directory
+kraken2-build --download-taxonomy \
+              --db kraken2_human_db/
 
-
-##### 7a.ii. Build from custom reference
-
-```bash
-# Install taxonomy       
-kraken2-build --download-taxonomy --db kraken2_host_db/
-# Add sequence to your database's genomic library
-kraken2-build --add-to-library host_assembly.fasta --db kraken2_host_db/ --no-masking
-# Once your library is finalized, build the database
-kraken2-build --build --db kraken2_host_db/
+kraken2-build --build \
+              --db kraken2_human_db/ \
+              --threads numberOfThreads
+ 
+kraken2-build --clean \
+              --db kraken2_human_db/
 ```
 
 **Parameter Definitions:**
 
-- `--download-taxonomy` - downloads taxonomic mapping information
-- `--add-to-library host_assembly.fasta` - specifies to add assembly fasta to library
-- `--db` - specifies the output directory for the kraken database
-- `--build` - specifies to construct kraken2-formatted database
+- `--download-library` - Specifies the reference name/type to download.
+- `--db` - Specifies the directory to put the database in.
+- `--threads` - Number of parallel processing threads to use.
+- `--no-masking` - Prevents masking of low-complexity sequences. For additional 
+                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
+- `--download-taxonomy` - Downloads taxonomic mapping information.
+- `--build` - Specifies to construct kraken2-formatted database.
+- `--clean` - Specifies to remove unnecessary intermediate files.
 
 **Input Data:**
 
-- `host_assembly.fasta` - host genome assembly in fasta format 
+- `human` - database name to download (specified with the `--download-library` parameter above)
 
 **Output Data:**
 
-- kraken2_host_db/ - Kraken2 database directory
+- kraken2_human_db/ - Kraken2 human database directory, containing hash.k2d, opts.k2d, and taxo.k2d files 
 
 
-##### 7a.iii. Build from host name
+#### 7b. Remove Human Reads
 
 ```bash
-# Build kraken reference from host_name
-kraken2-build --download-library host_name  -db kraken2_host_db/ \
-              --threads numberOfThreads  --no-masking
-kraken2-build --download-taxonomy --db kraken2_host_db/
-kraken2-build --build --db kraken2_host_db/ --threads numberOfThreads 
-kraken2-build --clean --db kraken2_host_db/
+kraken2 --db kraken2_human_db \
+        --gzip-compressed \
+        --threads NumberOfThreads \
+        --use-names \
+        --output sample-kraken2-output.txt \
+        --report sample-kraken2-report.tsv \
+        --unclassified-out sample_HRrm.fasta \
+        sample_decontam.fastq.gz
+
+# add ">" before each sequence name and gzip fasta output file
+sed -i -E 's/^([a-z0-9])/>\1/g' sample_HRrm.fasta | gzip 
 ```
 
 **Parameter Definitions:**
 
-- `--download-library` - specifies the reference name/type to download, host_name must 
-                         be one of: "archaea", "bacteria", "plasmid", "viral", "human", 
-                         "fungi", "plant", "protozoa", "nr", "nt", "UniVec", "UniVec_Core"
-- `--db` - specifies the directory we are putting the database in
-- `--threads` - number of parallel processing threads to use
-- `--no-masking` - prevents masking of low-complexity sequences. For additional 
-                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences)
-- `--download-taxonomy` - downloads taxonomic mapping information
-- `--build` - specifies to construct kraken2-formatted database
-- `--clean` - specifies to remove unnecessarily intermediate files
+- `--db` - Specifies the directory holding the kraken2 database.
+- `--gzip-compressed` - Specifies that the input fastq files are gzip-compressed.
+- `--threads` - Number of parallel processing threads to use.
+- `--use-names` - Specifies adding taxa names in addition to taxon IDs.
+- `--output` - Specifies the name of the kraken2 read-based output file (one line per read).
+- `--report` - Specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it).
+- `--unclassified-out` - Specifies the name of the output file containing reads that were not classified, i.e non-human reads.
+- `sample_decontam.fastq.gz` - Positional argument specifying the input read file.
 
 **Input Data:**
 
-- `host_name` - host database name (one of those specified in `--download-library` above)
+- kraken2_human_db/ (kraken2 human database directory, output from [Step 7a](#7a-build-kraken2-database))
+- sample_decontam.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 6e](#6e-generate-decontaminated-read-files))
 
 **Output Data:**
 
-- kraken2_host_db/ - Kraken2 database directory
+- sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
+- sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
+- **sample_HRrm.fasta.gz** (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file)
+
 
-#### 7b. Remove host reads
+#### 7c. Compile Human Read Removal QC
 
 ```bash
-kraken2 --db kraken2_host_db/ --gzip-compressed --threads NumberOfThreads --use-names \
-        --output sample-kraken2-output.txt --report sample-kraken2-report.tsv \
-        --unclassified-out sample_host_removed.fastq sample_blank_removed.fastq.gz
-gzip sample_host_removed.fastq
+multiqc --zip-data-dir \ 
+        --outdir HRrm_multiqc_report \
+        --filename HRrm_multiqc \
+        --interactive \
+        /path/to/*kraken2-report.tsv
 ```
 
 **Parameter Definitions:**
 
-- `--db` - specifies the directory holding the kraken2 database files created in [Step 7a](#7a-build-or-download-host-database)
-- `--gzip-compressed` - specifies the input fastq files are gzip-compressed
-- `--threads` - number of parallel processing threads to use
-- `--use-names` - specifies adding taxa names in addition to taxon IDs
-- `--output` - specifies the name of the kraken2 read-based output file (one line per read)
-- `--report` - specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it)
-- `--unclassified-out` - name of output file of reads that were not classified i.e non-host reads.
-- `sample_blank_removed.fastq.gz` - positional argument specifying the input read file
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/*kraken2-report.tsv` – The kraken2 output report files, provided as a positional argument.
 
 **Input Data:**
 
-- sample_blank_removed.fastq.gz (gzipped blank removed fastq file, output from [Step 6d](#6d-generate-decontaminated-read-files))
+- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 7b](#7b-remove-human-reads))
 
 **Output Data:**
 
-- sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
-- sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
-- **sample_host_removed.fastq.gz** (host-read removed, gzipped fastq file)
+- **HRrm_multiqc.html** (multiqc output html summary)
+- **HRrm_multiqc_data.zip** (zip archive containing multiqc output data)
 
 <br>
 
@@ -1408,23 +1447,25 @@ colours <- colorRampPalette(c('white','red'))(255)
 
 ## Read-based Processing
 
-### 9. Taxonomic profiling using kaiju
+### 9. Taxonomic Profiling Using Kaiju
 
-#### 9a. Build kaiju database
+#### 9a. Build Kaiju Database
 
 ```bash
 # Make a directory that will hold the downloaded kaiju database
 mkdir kaiju-db/ && cd kaiju-db/
+
 # Download kaiju's reference database
 kaiju-makedb -s nr_euk -t NumberOfThreads
-# Cleaning up
+
+# Clean up
 rm nr_euk/kaiju_db_nr_euk.bwt nr_euk/kaiju_db_nr_euk.sa
 ```
 
 **Parameter Definitions:**
 
-- `-t` - number of parallel processing threads to use
-- `-s nr_euk` - specifies to download NCBI's nr and additionally including fungi and microbial eukaryotes databases
+- `-s nr_euk` - Specifies to download the subset of the NCBI BLAST nr (non-redundant) database containing all proteins belonging to Archaea, bacteria, and viruses, and additionally include proteins from fungi and microbial eukaryotes.
+- `-t` - Number of parallel processing threads to use.
 
 **Input Data:**
 
@@ -1432,107 +1473,119 @@ rm nr_euk/kaiju_db_nr_euk.bwt nr_euk/kaiju_db_nr_euk.sa
 
 **Output Data:**
 
-- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (fmi file)
-- kaiju-db/nodes.dmp (nodes file)
-- kaiju-db/names.dmp (names file)
+- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index)
+- kaiju-db/nr_euk/kaiju_db_nr_euk.faa (FASTA amino acid file containing the protein sequences used to build the .fmi index file)
+- kaiju-db/nodes.dmp (taxonomy hierarchy file from the NCBI Taxonomy database defining the parent-child relationships in the taxonomic tree)
+- kaiju-db/names.dmp (taxonomy names file from the NCBI Taxonomy database that maps taxonomic IDs to their scientific names)
+- kaiju-db/merged.dmp (merged taxonomy IDs file from the NCBI Taxonomy database that maps deprecated taxonomic IDs to current ones)
 
 
 #### 9b. Kaiju Taxonomic Classification
 
 ```bash
-kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi -t kaiju-db/nodes.dmp \
-    -z NumberOfThreads \
-    -E 1e-05 \
-    -i /path/to/decontaminated_reads/sample_host_removed.fastq.gz \
-    -o sample_kaiju.out
+kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
+      -t kaiju-db/nodes.dmp \
+      -z NumberOfThreads \
+      -E 1e-05 \
+      -i /path/to/sample_HRrm.fasta.gz \
+      -o sample_kaiju.out
 ```
 
 **Parameter Definitions:**
 
-- `-f` - specifies path to the kaiju database (.fmi) file
-- `-t` - specifies path to the kaiju nodes.dmp file
-- `-z` - number of parallel processing threads to use
-- `-E` - specifies the minimum E-value in Greedy mode (default: 1e-05)
-- `-i` - specifies path to the input file
-- `-o` - specifies the name of output file
+- `-f` - Specifies the path to the kaiju database index file (.fmi).
+- `-t` - Specifies the path to the kaiju taxonomy hierarchy file (nodes.dmp).
+- `-z` - Number of parallel processing threads to use.
+- `-E` - Specifies the minimum E-value to use for filter matches (an E-value of 1e-05 means that there's a 0.001% chance that the matches identified occurred randomly).
+- `-i` - Specifies path to the input file.
+- `-o` - Specifies the name of the output file.
 
 **Input Data:**
 
-- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (fmi file, output from [Step 9a](#9a-build-kaiju-database))
-- kaiju-db/nodes.dmp (nodes file, output from [Step 9a](#9a-build-kaiju-database))
-- sample_host_removed.fastq.gz (gzipped decontaminated reads fastq file, output from [Step 7b](#7b-remove-host-reads))
+- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 9a](#9a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
+- sample_HRrm.fasta.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7b](#7b-remove-host-reads))
 
 **Output Data:**
 
 - sample_kaiju.out (kaiju output file)
 
-#### 9c. Compile kaiju taxonomy results
+#### 9c. Compile Kaiju Taxonomy Results
 
 ```bash
-# Merge kaiju reports to one table at the species level
-  kaiju2table -t nodes.dmp -n names.dmp -p -r species \
-              -o merged_kaiju_table.tsv *_kaiju.out
+# Merge kaiju reports to one table at each taxonomic level, phylum, class, order, family, genus, species 
+kaiju2table -t nodes.dmp \
+            -n names.dmp \
+            -p \
+            -r ${TAXON_LEVEL} \
+            -o merged_kaiju_summary_${TAXON_LEVEL}.tsv \
+            *_kaiju.out
 
 # Convert file names to sample names
-sed -i -E 's/.+\/(.+)_kaiju\.out/\1/g' merged_kaiju_table.tsv && \
-sed -i -E 's/file/sample/' merged_kaiju_table.tsv
+sed -i -E 's/.+\/(.+)_kaiju\.out/\1/g' merged_kaiju_summary_${TAXON_LEVEL}.tsv && \
+sed -i -E 's/file/sample/' merged_kaiju_summary_${TAXON_LEVEL}.tsv
 ```
 
 **Parameter Definitions:**
 
-- `-n` - specifies path to the kaiju names.dmp file
-- `-t` - specifies path to the kaiju nodes.dmp file
-- `-r` - specifies taxonomic rank, must be one of: phylum, class, order, family, genus, species
-- `-o` - specifies the name of krona formatted kaiju output file
-- `*_kaiju.out` - positional argument specifying the path to the kaiju output file (output from [Step 9ai](#9ai-read-taxonomic-classification-using-kaiju))
+- `-t` - Specifies the path to the kaiju taxonomy hierarchy file (nodes.dmp).
+- `-n` - Specifies the path to the kaiju taxonomy names file (names.dmp).
+- `-p` - Print the full taxon path instead of only the taxon name.
+- `-r` - Specifies taxonomic rank to print the taxon path to, must be one of: phylum, class, order, family, genus, species.
+- `-o` - Specifies the name of the kaiju taxon summary output file.
+- `*_kaiju.out` - Positional argument specifying the path to the kaiju output files for each sample. 
 
 **Input Data:**
 
-- kaiju-db/nodes.dmp (nodes file, output from [Step 9a](#9a-build-kaiju-database))
-- kaiju-db/names.dmp (names file, output from [Step 9a](#9a-build-kaiju-database))
-- *kaiju.out (kaiju report files, output from [Step 9b](#9b-kaiju-taxonomic-classification))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
+- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 9a](#9a-build-kaiju-database))
+- *kaiju.out (kaiju output files, output from [Step 9b](#9b-kaiju-taxonomic-classification))
 
 **Output Data:**
 
-- **merged_kaiju_table.tsv** (Compiled kaiju table at the species taxon level)
+- **merged_kaiju_summary_${TAXON_LEVEL}.tsv** (compiled kaiju summary table for each taxon level)
 
-#### 9d. Convert kaiju output to krona format
+#### 9d. Convert Kaiju Output To Krona Format
 
 ```bash
-kaiju2krona -u -n kaiju-db/names.dmp -t kaiju-db/nodes.dmp \
-            -i sample_kaiju.out -o sample.krona
+kaiju2krona -u \
+            -n kaiju-db/names.dmp \
+            -t kaiju-db/nodes.dmp \
+            -i sample_kaiju.out \
+            -o sample.krona
 ```
 
 **Parameter Definitions:**
 
-- `-u` - include count for unclassified reads in output
-- `-n` - specifies path to the kaiju names.dmp file
-- `-t` - specifies path to the kaiju nodes.dmp file
-- `-i` - specifies path to the kaiju output file (output from [Step 9b](#9b-kaiju-taxonomic-classification))
-- `-o` - specifies the name of krona formatted kaiju output file
+- `-u` - Include count for unclassified reads in output.
+- `-n` - Specifies the path to the kaiju taxonomy names file (names.dmp).
+- `-t` - Specifies the path to the kaiju taxonomy hierarchy file (nodes.dmp).
+- `-i` - Specifies the path to the kaiju output file.
+- `-o` - Specifies the name of krona formatted kaiju output file.
 
 **Input Data:**
-- kaiju-db/nodes.dmp (nodes file, output from [Step 9a](#9a-build-kaiju-database))
-- kaiju-db/names.dmp (names file, output from [Step 9a](#9a-build-kaiju-database))
+- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 9a](#9a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
 - sample_kaiju.out (kaiju output file, output from [Step 9b](#9b-kaiju-taxonomic-classification))
 
 **Output Data:**
 
 - sample.krona (krona formatted kaiju output)
 
-#### 9e. Compile kaiju krona report
+#### 9e. Compile Kaiju Krona Reports
 
 ```bash
-# Find, list and write all .krona files to file 
-find . -type f -name "*.krona" |sort -uV > krona_files.txt
+# Create a file containing a sorted list of all .krona files 
+find . -type f -name "*.krona" | sort -uV > krona_files.txt
 
+# Create a file containing a sorted list of all sample names
 FILES=($(find . -type f -name "*.krona"))
 basename --multiple --suffix='.krona' ${FILES[*]} | sort -uV  > sample_names.txt
 
 # Create ktImportText input format files
 KTEXT_FILES=($(paste -d',' "krona_files.txt" "sample_names.txt"))
 
-# Create html   
+# Create html containing krona plot  
 ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 ```
 
@@ -1540,39 +1593,42 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 
 **find**
 
-- `-type f` -  specifies that the type of file to find is a regular file
-- `-name "*.krona"` - specifies to find files ending with the .krona suffix  
+- `-type f` -  Specifies that the type of file to find is a regular file.
+- `-name "*.krona"` - Specifies to find files ending with the .krona suffix.  
 
 **sort**
 
-- `-u` - specifies to perform a unique sort
-- `-V` - specifies to perform a mixed type of sorting
+- `-u` - Specifies to perform a unique sort.
+- `-V` - Specifies to perform a mixed type of sorting with names containing numbers within text.
+- `> {}.txt` - Redirects the sorted list to a separate text file.
 
 **basename**
 
-- `--multiple` - support multiple arguments and treat each as a file name
-- `--suffix='.krona'` - remove a trailing '.krona' suffix
+- `--multiple` - Support multiple arguments and treat each as a file name.
+- `--suffix='.krona'` - Remove a trailing '.krona' suffix.
 
 **paste**
 
-- `-d','` - paste both krona and sample files together line by line delimited by comma ','
+- `-d','` - Paste both krona and sample files together line by line delimited by comma ','.
 
 **ktImportText**
 
-- `-o` - specifies the compiled output html file name
-- `${KTEXT_FILES[*]}` - an array positional arguement with the following content: 
-                     sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
+- `-o` - Specifies the compiled output html file name.
+- `${KTEXT_FILES[*]}` - An array positional arguement with the following content: 
+                     sample_1.krona,sample_1 sample_2.krona,sample_2 ... sample_n.krona,sample_n.
 
 **Input Data:**
-*.krona (all sample .krona formatted files, output from [Step 9d](#9d-convert-kaiju-output-to-krona-format)) 
+- *.krona (all sample .krona formatted files, output from [Step 9d](#9d-convert-kaiju-output-to-krona-format)) 
 
                       
 **Output Data:**
 
-- **kaiju-report.html** (compiled krona html report output)
+- krona_files.txt (sorted list of all *.krona files)
+- sample_names.txt (sorted list of all sample names)
+- **kaiju-report.html** (compiled krona html report containing all samples)
 
 
-#### 9f. Create kaiju species count table
+#### 9f. Create Kaiju Species Count Table --- START NEEDS REVIEW ---
 
 ```R
 library(tidyverse)
@@ -1697,7 +1753,7 @@ ggsave(filename = "filtered-kaiju_species_plot.png", plot = p,
 - **filtered-kaiju_species_plot.png** (barplot after filtering rare and non-microbial taxa)
 
 
-#### 9i. Feature decontamination
+#### 9i. Feature decontamination --- END NEEDS REVIEW ---
 
 > Feature (species) decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
 
@@ -1777,14 +1833,14 @@ ggsave(filename = "decontaminated-kaiju-species_plot.png", plot = p,
 
 ---
 
-### 10. Taxonomic Profiling using Kraken2
+### 10. Taxonomic Profiling Using Kraken2
 
-#### 10a. Download kraken2 database
+#### 10a. Download Kraken2 Database
 
 ```bash 
 ## Download all microbial (including eukaryotes) - https://benlangmead.github.io/aws-indexes/k2
 
-# Downloading and building kraken2's pluspfp database which contains that standard database + plants + protists + fungi..
+# Downloading and building kraken2's pluspfp database which contains the standard database (Refseq archaea, bacteria, viral, plasmid, human1, UniVec_Core) + plants + protists + fungi
 
 mkdir kraken2-db/ && cd kraken2-db/
 
@@ -1810,62 +1866,121 @@ tar -xvzf k2_pluspfp.tar.gz
 
 **wget**
 
-- `O` - name of file to download the url content to
-- `--timeout=3600` - specifies the network timeout in seconds
-- `--tries=0` - retry downdload infinitely
-- `--continue` -  continue getting a partially-downloaded file
-- `*_URL` - position arguement specifying the url to download a particular resource from.
+- `O` - Name of file to download the url content to.
+- `--timeout=3600` - Specifies the network timeout in seconds.
+- `--tries=0` - Retry download infinitely.
+- `--continue` -  Continue getting a partially-downloaded file.
+- `*_URL` - Position arguement specifying the url to download a particular resource from.
 
 
 **Input Data:**
 
 - `INSPECT_URL=` - url specifying the location of kraken2 inspect file
-- `LIRARY_REPORT_URL=` -  url specifying the location of kraken2 library report file
-- `MD5_URL=` -  url specifying the location of the md5 file of the kraken database
+- `LIRARY_REPORT_URL=` - url specifying the location of kraken2 library report file
+- `MD5_URL=` - url specifying the location of the md5 file of the kraken database
 - `DB_URL=` - url specifying the location of the main kraken database archive in .tar.gz format
 
 **Output Data:**
 
-- kraken2-db/  (a directory containing kraken 2 database files)
+- kraken2-db/  (a directory containing kraken2 database files)
 
-#### 10b. Taxonomic Classification
+#### 10b. Kraken2 Taxonomic Classification
 
 ```bash
-kraken2 --db kraken2-db/ --gzip-compressed --threads NumberOfThreads --use-names \
-        --output sample-kraken2-output.txt --report sample-kraken2-report.tsv \
-        /path/to/decontaminated_reads/sample_host_removed.fastq.gz
+kraken2 --db kraken2-db/ \
+        --gzip-compressed \
+        --threads NumberOfThreads \
+        --use-names \
+        --output sample-kraken2-output.txt \
+        --report sample-kraken2-report.tsv \
+        /path/to/sample_HRrm.fasta.gz
 ```
 
 **Parameter Definitions:**
 
-- `--db` - specifies the directory holding the kraken2 database files 
-- `--gzip-compressed` - specifies the input fastq files are gzip-compressed
-- `--threads` - number of parallel processing threads to use
-- `--use-names` - specifies adding taxa names in addition to taxids
-- `--output` - specifies the name of the kraken2 read-based output file (one line per read)
-- `--report` - specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it)
-- `sample_host_removed.fastq.gz` - positional argument specifying the input read file
+- `--db` - Specifies the directory holding the kraken2 database files. 
+- `--gzip-compressed` - Specifies the input files are gzip-compressed.
+- `--threads` - Number of parallel processing threads to use.
+- `--use-names` - Specifies to add taxa names in addition to taxids.
+- `--output` - Specifies the name of the kraken2 read-based output file.
+- `--report` - Specifies the name of the kraken2 report output file.
+- `sample_HRrm.fasta.gz` - Positional argument specifying the input file.
 
 **Input Data:**
 
-- kraken2-db/ (a directory containing kraken 2 database files, output from [Step 10a](#10a-download-kraken2-database))
-- sample_host_removed.fastq.gz (gzipped reads fastq file, output from [Step 7b](#7b-remove-host-reads))
+- kraken2-db/ (a directory containing kraken2 database files, output from [Step 10a](#10a-download-kraken2-database))
+- sample_HRrm.fasta.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7b](#7b-remove-host-reads))
 
 **Output Data:**
 
 - sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
 - sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
 
-#### 10c. Convert Kraken2 output to Krona format
+
+#### 10c. Compile Kraken2 Taxonomy Results
+
+##### 10ci. Create Merged Kraken2 Taxonomy Table
 
 ```bash
-kreport2krona.py --report-file sample-kraken2-report.tsv  --output sample.krona
+combine_kreports.py --output merged-kraken2-table.tsv \
+                    --report-files sample1-kraken2-report.tsv sample2-kraken2-report.tsv ... sampleN-kraken2-report.tsv \
+                    --sample-names sample1 sample2 ... sampleN
 ```
 
 **Parameter Definitions:**
 
-- `--output` - specifies the name of the krona output file
-- `--report-file` - specifies the name of the input kraken2 report file
+- `--output` - Specifies the name of the kraken2 compiled results output file.
+- `--report-files` - Specifies the name of each input kraken2 report file to compile.
+- `--sample-names` - Specifies the name of each sample. 
+
+**Input Data:**
+
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 10b](#10b-taxonomic-classification))
+
+**Output Data:**
+
+- **merged-kraken2-table.tsv** (table containing compiled kraken2 reports)
+
+
+##### 10cii. Compile Kraken2 Taxonomy Reports
+
+```bash
+multiqc --zip-data-dir \ 
+        --outdir kraken2_multiqc_report \
+        --filename kraken2_multiqc \
+        --interactive \
+        /path/to/*kraken2-report.tsv
+```
+
+**Parameter Definitions:**
+
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` - Specifies the output directory to store results.
+- `--filename` - Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/*kraken2-report.tsv` - The kraken2 output report files, provided as a positional argument.
+
+**Input Data:**
+
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 10b](#10b-taxonomic-classification))
+
+**Output Data:**
+
+- **kraken2_multiqc.html** (multiqc output html summary)
+- **kraken2_multiqc_data.zip** (zip archive containing multiqc output data)
+
+
+#### 10d. Convert Kraken2 Output to Krona Format
+
+```bash
+kreport2krona.py --report-file sample-kraken2-report.tsv  \
+                 --output sample.krona
+```
+
+**Parameter Definitions:**
+
+- `--report-file` - Specifies the name of the input kraken2 report file.
+- `--output` - Specifies the name of the krona output file.
 
 **Input Data:**
 
@@ -1876,11 +1991,11 @@ kreport2krona.py --report-file sample-kraken2-report.tsv  --output sample.krona
 - sample.krona (krona formatted kraken2 output)
 
 
-#### 10d. Compile kraken2 krona report
+#### 10e. Compile Kraken2 Krona Reports
 
 ```bash
 # Find, list and write all .krona files to file 
-find . -type f -name "*.krona" |sort -uV > krona_files.txt
+find . -type f -name "*.krona" | sort -uV > krona_files.txt
 
 FILES=($(find . -type f -name "*.krona"))
 basename --multiple --suffix='.krona' ${FILES[*]} | sort -uV  > sample_names.txt
@@ -1889,46 +2004,49 @@ basename --multiple --suffix='.krona' ${FILES[*]} | sort -uV  > sample_names.txt
 KTEXT_FILES=($(paste -d',' "krona_files.txt" "sample_names.txt"))
 
 # Create html   
-ktImportText  -o kraken-report.html ${KTEXT_FILES[*]}
+ktImportText -o kraken2-report.html ${KTEXT_FILES[*]}
 ```
 
 **Parameter Definitions:**
 
 **find**
 
-- `-type f` -  specifies that the type of file to find is a regular file
-- `-name "*.krona"` - specifies to find files ending with the .krona suffix  
+- `-type f` -  Specifies that the type of file to find is a regular file.
+- `-name "*.krona"` - Specifies to find files ending with the .krona suffix.  
 
 **sort**
 
-- `-u` - specifies to perform a unique sort
-- `-V` - specifies to perform a mixed type of sorting
+- `-u` - Specifies to perform a unique sort.
+- `-V` - Specifies to perform a mixed type of sorting with names containing numbers within text.
+- `> {}.txt` - Redirects the sorted list to a separate text file.
 
 **basename**
 
-- `--multiple` - support multiple arguments and treat each as a file name
-- `--suffix='.krona'` - remove a trailing '.krona' suffix
+- `--multiple` - Support multiple arguments and treat each as a file name.
+- `--suffix='.krona'` - Remove a trailing '.krona' suffix.
 
 **paste**
 
-- `-d','` - paste both krona and sample files together line by line delimited by comma ','
+- `-d','` - Paste both krona and sample files together line by line delimited by comma ','.
 
 **ktImportText**
 
-- `-o` - specifies the compiled output html file name
-- `${KTEXT_FILES[*]}` - an array positional arguement with the following content: 
-                     sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
+- `-o` - Specifies the compiled output html file name.
+- `${KTEXT_FILES[*]}` - An array positional arguement with the following content: sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
 
 **Input Data:**
 
-- *.krona (all sample .krona formatted files, output from [Step 10c](#10c-convert-kraken2-output-to-krona-format)) 
+- *.krona (all sample .krona formatted files, output from [Step 10d](#10d-convert-kraken2-output-to-krona-format)) 
 
                       
 **Output Data:**
 
-- **kraken-report.html** (compiled krona html report output)
+- krona_files.txt (sorted list of all *.krona files)
+- sample_names.txt (sorted list of all sample names)
+- **kraken2-report.html** (compiled krona html report containing all samples)
 
-#### 10e. Create kraken species count table
+
+#### 10f. Create Kraken2 Species Count Table --- START NEEDS REVIEW ---
 
 ```R
 library(tidyverse)
@@ -1958,7 +2076,8 @@ write_csv(x = table2write,
 
 - **kraken_species_table.csv** (kraken species count table in csv format)
 
-#### 10f. Read-in tables
+
+#### 10g. Read-in tables
 
 ```R
 library(tidyverse)
@@ -1984,7 +2103,7 @@ species_table <- species_table[,-match("Species", colnames(species_table))]
 **Input Data:**
 
 - metadata_file  (path to sample-wise metadata file)
-- kraken_species_table.csv (path to kraken species taable)
+- kraken_species_table.csv (path to kraken species table)
 
 **Output Data:**
 
@@ -1992,7 +2111,7 @@ species_table <- species_table[,-match("Species", colnames(species_table))]
 - species_table (a dataframe of species count with rows and columns as species and sample names, respectively)
 
 
-#### 10g. Taxonomy barplots
+#### 10h. Taxonomy barplots
 
 ```R
 library(tidyverse)
@@ -2048,8 +2167,8 @@ ggsave(filename = "filtered-kraken_species_plot.png", plot = p,
 
 **Input Data:**
 
-- `species_table` (a dataframe of species count per sample, output from [Step 10f](#10f-read-in-tables))
-- `metadata` - (a dataframe of sample-wise metadata, output from [Step 10f](#10f-read-in-tables))
+- `species_table` (a dataframe of species count per sample, output from [Step 10g](#10g-read-in-tables))
+- `metadata` - (a dataframe of sample-wise metadata, output from [Step 10g](#10g-read-in-tables))
 
 **Output Data:**
 
@@ -2058,7 +2177,7 @@ ggsave(filename = "filtered-kraken_species_plot.png", plot = p,
 - **filtered-kraken_species_plot.png** (barplot after filtering rare and non-microbial taxa)
 
 
-#### 10h. Feature decontamination
+#### 10i. Feature decontamination --- END NEEDS REVIEW ---
 
 Feature decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table.
 
@@ -2127,8 +2246,8 @@ ggsave(filename = "decontaminated-kraken-species_plot.png", plot = p,
 
 **Input Data:**
 
-- `filtered-kraken_species_table.csv`(path to species count per sample, output from [Step 10g](#10g-taxonomy-barplots))
-- `metadata`(a dataframe of sample-wise metadata, output from step[Step 10f](#10f-read-in-tables))
+- `filtered-kraken_species_table.csv`(path to species count per sample, output from [Step 10h](#10h-taxonomy-barplots))
+- `metadata`(a dataframe of sample-wise metadata, output from step[Step 10g](#10g-read-in-tables))
 
 **Output Data:**
 
@@ -2140,13 +2259,16 @@ ggsave(filename = "decontaminated-kraken-species_plot.png", plot = p,
 
 ---
 
-## Assembly-based processing
+## Assembly-based Processing
 
-### 11. Sample assembly
+### 11. Sample Assembly
 
 ```bash
-flye --meta --threads NumberOfThreads --out-dir sample/ \
-     --nano-hq /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz
+flye --meta \
+     --threads NumberOfThreads \
+     --out-dir sample/ \
+     --nano-hq \
+     /path/to/sample_HRrm.fasta.gz
 
 # rename output files            
 mv sample/assembly.fasta sample_assembly.fasta
@@ -2155,14 +2277,15 @@ mv sample/flye.log sample_flye.log
 
 **Parameter Definitions:**
 
-- `--meta` – use metagenome/uneven coverage mode
-- `--threads` - number of parallel processing threads to use
-- `--out-dir` - Output directory
-- `--nano-hq` - specifies that input is from Oxford Nanopore high-quality reads (Guppy5+ SUP or Q20, <5% error). This skips a genome polishing step since the assembly will be polished with medaka in the next step
+- `--meta` – Use metagenome/uneven coverage mode.
+- `--threads` - Number of parallel processing threads to use.
+- `--out-dir` - Specifies the name of the output directory.
+- `--nano-hq` - Specifies that input is from Oxford Nanopore high-quality reads (Guppy5+ SUP or Q20, <5% error). This skips a genome polishing step since the assembly will be polished with medaka in the next step.
+- `/path/to/sample_HRrm.fasta.gz` - Path to the input file, specified as a positional argument.
 
 **Input Data**
 
-- sample_host_removed.fastq.gz (decontaminated raw data in fastq format, output from [Step 7b](#7b-remove-host-reads))
+- sample_HRrm.fasta.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7b](#7b-remove-host-reads))
 
 **Output Data**
 
@@ -2173,25 +2296,27 @@ mv sample/flye.log sample_flye.log
 
 ---
 
-### 12. Polish assembly
+### 12. Polish Assembly
 
 ```bash
-medaka_consensus -t NumberOfThreads -i /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz \
-  -d /path/to/assemblies/sample_assembly.fasta -o sample/
+medaka_consensus -t NumberOfThreads \
+                 -i /path/to/sample_HRrm.fasta.gz \
+                 -d /path/to/assemblies/sample_assembly.fasta \
+                 -o sample/
   
 mv sample/consensus.fasta sample_polished.fasta
 ```
 
 **Parameter Definitions:**
 
-- `-t` - number of parallel processing threads to use
-- `-i` - specifies path to input read files used in creating the assembly
-- `-d` - specifies path to the assembly fasta file
-- `-o` - specifies the output directory
+- `-t` - Number of parallel processing threads to use.
+- `-i` - Specifies path to input read files used in creating the assembly.
+- `-d` - Specifies path to the assembly fasta file.
+- `-o` - Specifies the output directory.
 
 **Input Data:**
 
-- /path/to/decontaminated_raw_data/sample_host_removed.fastq.gz (decontaminated raw data in fastq format, output from [Step 7b](#8b-remove-host-reads))
+- sample_HRrm.fasta.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7b](#7b-remove-host-reads))
 - /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 11](#11-sample-assembly))
 
 **Output Data:**
@@ -2200,19 +2325,21 @@ mv sample/consensus.fasta sample_polished.fasta
 
 ---
 
-### 13. Renaming contigs and summarizing assemblies
+### 13. Rename Contigs and Summarize Assemblies
 
-#### 13a. Renaming contig headers
+#### 13a. Rename Contig Headers
 
 ```bash
-bit-rename-fasta-headers -i sample_polished.fasta -w c_sample -o sample_assembly.fasta
+bit-rename-fasta-headers -i sample_polished.fasta \
+                         -w c_sample \
+                         -o sample_assembly.fasta
 ```
 
 **Parameter Definitions:**  
 
-- `-i` – input fasta file
-- `-w` – wanted header prefix (a number will be appended for each contig), starts with a "c" to ensure they won't start with a number which can be problematic
-- `-o` – output fasta file
+- `-i` – Specifies the input fasta file.
+- `-w` – Specifies the wanted header prefix (a number will be appended for each contig), starts with a "c" to ensure they won't start with a number which can be problematic.
+- `-o` – Specifies the output fasta file.
 
 
 **Input Data:**
@@ -2224,16 +2351,17 @@ bit-rename-fasta-headers -i sample_polished.fasta -w c_sample -o sample_assembly
 - **sample-assembly.fasta** (contig-renamed assembly file)
 
 
-#### 13b. Summarizing assemblies
+#### 13b. Summarize Assemblies
 
 ```bash
-bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv *-assembly.fasta
+bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv \
+                       *-assembly.fasta
 ```
 
 **Parameter Definitions:**  
 
-- `-o` – output summary table
-- `*-assembly.fasta` - multiple input assemblies provided as positional arguments
+- `-o` – Specifies the output summary table.
+- `*-assembly.fasta` - Specifies the input assemblies to summarize, provided as positional arguments.
 
 **Input Data:**
 
@@ -2247,22 +2375,31 @@ bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv *-assembly.fasta
 
 ---
 
-### 14. Gene prediction
+### 14. Gene Prediction
+
+#### 14a. Generate Gene Predictions
+
 ```bash
-prodigal -a sample-genes.faa -d sample-genes.fasta -f gff -p meta -c -q \
-         -o sample-genes.gff -i sample-assembly.fasta
+prodigal -a sample-genes.faa \
+         -d sample-genes.fasta \
+         -f gff \
+         -p meta \
+         -c \
+         -q \
+         -o sample-genes.gff \
+         -i sample-assembly.fasta
 ```
 
 **Parameter Definitions:**
 
-- `-a` – specifies the output amino acid sequences file
-- `-d` – specifies the output nucleotide sequences file
-- `-f` – specifies the output format gene-calls file
-- `-p` – specifies which mode to run the gene-caller in 
-- `-c` – no incomplete genes reported 
-- `-q` – run in quiet mode (don’t output process on each contig) 
-- `-o` – specifies the name of the output gene-calls file 
-- `-i` – specifies the input assembly
+- `-a` – Specifies the output amino acid sequences file.
+- `-d` – Specifies the output nucleotide sequences file.
+- `-f` – Specifies the gene-calls output format, gff = GFF format.
+- `-p` – Specifies which mode to run the gene-caller in. 
+- `-c` – No incomplete genes reported. 
+- `-q` – Run in quiet mode (don’t output process on each contig). 
+- `-o` – Specifies the name of the output gene-calls file. 
+- `-i` – Specifies the input assembly file.
 
 **Input Data:**
 
@@ -2276,7 +2413,8 @@ prodigal -a sample-genes.faa -d sample-genes.fasta -f gff -p meta -c -q \
 
 <br>
 
-#### 14a. Remove line wraps in gene prediction output
+#### 14b. Remove Line Wraps In Gene Prediction Output
+
 ```bash
 bit-remove-wraps sample-genes.faa > sample-genes.faa.tmp 2> /dev/null
 mv sample-genes.faa.tmp sample-genes.faa
@@ -2287,8 +2425,8 @@ mv sample-genes.fasta.tmp sample-genes.fasta
 
 **Input Data:**
 
-- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 14](#14-gene-prediction))
-- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14](#14-gene-prediction))
+- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 14a](#14a-gene-prediction))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14a](#14a-gene-prediction))
 
 **Output Data:**
 
@@ -2299,14 +2437,17 @@ mv sample-genes.fasta.tmp sample-genes.fasta
 
 ---
 
-### 15. Functional annotation
+### 15. Functional Annotation
+
 > **Note:**  
 > The annotation process overwrites the same temporary directory by default. When running multiple 
 processses at a time, it is necessary to specify a specific temporary directory with the 
 `--tmp-dir` argument as shown below.
 
 
-#### 15a. Downloading reference database of HMM models (only needs to be done once)
+#### 15a. Download Reference Database of HMM Models
+
+> **Note:** This step only needs to be done once.
 
 ```bash
 curl -LO ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
@@ -2315,40 +2456,48 @@ tar -xzvf profiles.tar.gz
 gunzip ko_list.gz 
 ```
 
-#### 15b. Running KEGG annotation
+#### 15b. Run KEGG Annotation
 
 ```bash
-exec_annotation -p profiles/ -k ko_list --cpu NumberOfThreads -f detail-tsv -o sample-KO-tab.tmp \
-                --tmp-dir sample-tmp-KO --report-unannotated sample-genes.faa 
+exec_annotation -p profiles/ \
+                -k ko_list \
+                --cpu NumberOfThreads \
+                -f detail-tsv \
+                -o sample-KO-tab.tmp \
+                --tmp-dir sample-tmp-KO \
+                --report-unannotated \
+                sample-genes.faa 
 ```
 
 **Parameter Definitions:**
 
-- `-p` – specifies the directory holding the downloaded reference HMMs
-- `-k` – specifies the downloaded reference KO  (Kegg Orthology) terms 
-- `--cpu` – specifies the number of searches to run in parallel
-- `-f` – specifies the output format
-- `-o` – specifies the output file name
-- `--tmp-dir` – specifies the temporary directory to write to (needed if running more than one process concurrently, see Notes above)
-- `--report-unannotated` – specifies to generate an output for each entry
-- `sample-genes.faa` – the input file is specified as a positional argument 
+- `-p` – Specifies the directory holding the downloaded reference HMMs.
+- `-k` – Specifies the downloaded reference KO  (Kegg Orthology) terms. 
+- `--cpu` – Specifies the number of searches to run in parallel.
+- `-f` – Specifies the output format.
+- `-o` – Specifies the output file name.
+- `--tmp-dir` – Specifies the temporary directory to write to (needed if running more than one process concurrently, see Note above).
+- `--report-unannotated` – Specifies to generate an output for each entry, event when no KO is assigned.
+- `sample-genes.faa` – Specifies the input file, provided as a positional argument. 
 
 
 **Input Data:**
 
-- sample-genes.faa (amino-acid fasta file, from [Step 14](#14-gene-prediction))
-- profiles/ (reference directory holding the KO HMMs)
-- ko_list (reference list of KOs to scan for)
+- sample-genes.faa (amino-acid fasta file, output from [Step 14b](#14b-remove-line-wraps-in-gene-prediction-output))
+- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 15a](15a-download-reference-database-of-hmm-models))
+- ko_list (reference list of KOs to scan for, downloaded in [Step 15a](15a-download-reference-database-of-hmm-models))
 
 **Output Data:**
 
 - sample-KO-tab.tmp (table of KO annotations assigned to gene IDs)
 
 
-#### 15c. Filtering output to retain only those passing the KO-specific score and top hits
+#### 15c. Filter KO Outputs
+*Filter KO outputs to retain only those passing the KO-specific score and top hits.*
 
 ```bash
-bit-filter-KOFamScan-results -i sample-KO-tab.tmp -o sample-annotations.tsv
+bit-filter-KOFamScan-results -i sample-KO-tab.tmp \
+                             -o sample-annotations.tsv
 
 # removing temporary files
 rm -rf sample-tmp-KO/ sample-KO-annots.tmp
@@ -2356,12 +2505,12 @@ rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 
 **Parameter Definitions:**  
 
-- `-i` – specifies the input table
-- `-o` – specifies the output table
+- `-i` – Specifies the input table.
+- `-o` – Specifies the output table.
 
 **Input Data:**
 
-- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs from [Step 15b](#15b-running-kegg-annotation))
+- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs, output from [Step 15b](#15b-run-kegg-annotation))
 
 **Output Data:**
 
@@ -2371,94 +2520,115 @@ rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 
 ---
 
-### 16. Taxonomic classification
+### 16. Taxonomic Classification 
 
-#### 16a. Pulling and un-packing pre-built reference db (only needs to be done once)
+#### 16a. Pull and Unpack Pre-built Reference DB 
+
+> **Note:** This step only needs to be done once.
 
 ```bash
 wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20200618.tar.gz
 tar -xvzf CAT_prepare_20200618.tar.gz
 ```
 
-#### 16b. Running taxonomic classification
+#### 16b. Run Taxonomic Classification
 
 ```bash
-CAT contigs -c sample-assembly.fasta -d CAT_prepare_20200618/2020-06-18_database/ \
-            -t CAT_prepare_20200618/2020-06-18_taxonomy/ -p sample-genes.faa \
-            -o sample-tax-out.tmp -n NumberOfThreads -r 3 --top 4 --I_know_what_Im_doing --no-stars
+CAT contigs -c sample-assembly.fasta \
+            -d CAT_prepare_20200618/2020-06-18_database/ \
+            -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
+            -p sample-genes.faa \
+            -o sample-tax-out.tmp \
+            -n NumberOfThreads \
+            -r 3 \
+            --top 4 \
+            --I_know_what_Im_doing \
+            --no-stars
 ```
 
 **Parameter Definitions:**  
 
-- `-c` – specifies the input assembly fasta file
-- `-d` – specifies the CAT reference sequence database
-- `-t` – specifies the CAT reference taxonomy database
-- `-p` – specifies the input protein fasta file
-- `-o` – specifies the output prefix
-- `-n` – specifies the number of CPU cores to use
-- `-r` – specifies the number of top protein hits to consider in assigning tax
-- `--top` – specifies the number of protein alignments to store
-- `--I_know_what_Im_doing` – allows us to alter the `--top` parameter
-- `--no-stars` - suppress marking of suggestive taxonomic assignments
+- `-c` – Specifies the input assembly fasta file.
+- `-d` – Specifies the CAT reference sequence database.
+- `-t` – Specifies the CAT reference taxonomy database.
+- `-p` – Specifies the input protein fasta file.
+- `-o` – Specifies the output file prefix.
+- `-n` – Specifies the number of CPU cores to use.
+- `-r` – Specifies the number of top protein hits to consider in assigning taxonomy.
+- `--top` – Specifies the number of protein alignments to store.
+- `--I_know_what_Im_doing` – Allows us to alter the `--top` parameter.
+- `--no-stars` - Suppress marking of suggestive taxonomic assignments.
 
 **Input Data:**
 
-- sample-assembly.fasta (assembly file from [Step 13a](#13a-renaming-contig-headers))
-- sample-genes.faa (gene-calls amino-acid fasta file from [Step 14](#14-gene-prediction))
+- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 16a](16a-pull-and-unpack-pre-built-reference-db))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 16a](16a-pull-and-unpack-pre-built-reference-db))
+- sample-assembly.fasta (contig-renamed assembly file from [Step 13a](#13a-rename-contig-headers))
+- sample-genes.faa (amino-acid fasta file, output from [Step 14b](#14b-remove-line-wraps-in-gene-prediction-output))
 
 **Output Data:**
 
 - sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file)
 - sample-tax-out.tmp.contig2classification.txt (contig taxonomy file)
 
-#### 16c. Adding taxonomy info from taxids to genes
+
+#### 16c. Add Taxonomy Info From Taxids To Genes
 
 ```bash
-CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt -o sample-gene-tax-out.tmp \
-              -t CAT_prepare_20200618/2020-06-18_taxonomy/ --only_official --exclude-scores
+CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
+              -o sample-gene-tax-out.tmp \
+              -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
+              --only_official \
+              --exclude-scores
 ```
 
 **Parameter Definitions:**  
 
-- `-i` – specifies the input taxonomy file
-- `-o` – specifies the output file 
-- `-t` – specifies the CAT reference taxonomy database
-- `--only_official` – specifies to add only standard taxonomic ranks
-- `--exclude-scores` - specifies to exclude bit-score support scores in the lineage
+- `-i` – Specifies the input taxonomy file.
+- `-o` – Specifies the output file name.
+- `-t` – Specifies the CAT reference taxonomy database.
+- `--only_official` – Specifies to add only standard taxonomic ranks.
+- `--exclude-scores` - Specifies to exclude bit-score support scores in the lineage.
 
 **Input Data:**
 
-- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file from [Step 16b](#16b-running-taxonomic-classification))
+- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file, output from [Step 16b](#16b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 16a](#16a-pull-and-unpack-pre-built-reference-db))
 
 **Output Data:**
 
 - sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
 
-#### 16d. Adding taxonomy info from taxids to contigs
+
+#### 16d. Add Taxonomy Info From Taxids To Contigs
 
 ```bash
-CAT add_names -i sample-tax-out.tmp.contig2classification.txt -o sample-contig-tax-out.tmp \
-              -t CAT-ref/2020-06-18_taxonomy/ --only_official --exclude-scores
+CAT add_names -i sample-tax-out.tmp.contig2classification.txt \
+              -o sample-contig-tax-out.tmp \
+              -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
+              --only_official \
+              --exclude-scores
 ```
 
 **Parameter Definitions:**  
 
-- `-i` – specifies the input taxonomy file
-- `-o` – specifies the output file 
-- `-t` – specifies the CAT reference taxonomy database
-- `--only_official` – specifies to add only standard taxonomic ranks
-- `--exclude-scores` - specifies to exclude bit-score support scores in the lineage
+- `-i` – Specifies the input taxonomy file.
+- `-o` – Specifies the output file name.
+- `-t` – Specifies the CAT reference taxonomy database.
+- `--only_official` – Specifies to add only standard taxonomic ranks.
+- `--exclude-scores` - Specifies to exclude bit-score support scores in the lineage.
 
 **Input Data:**
 
-- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file from [Step 16b](#16b-running-taxonomic-classification))
+- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file, output from [Step 16b](#16b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 16a](#16a-pull-and-unpack-pre-built-reference-db))
 
 **Output Data:**
 
 - sample-contig-tax-out.tmp (contig taxonomy file with lineage info added)
 
 
-#### 16e. Formatting gene-level output with awk and sed
+#### 16e. Format Gene-level Output With awk and sed
 
 ```bash
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,$8,$9,$10,$11 } \
@@ -2471,13 +2641,14 @@ awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,
 
 **Input Data:**
 
-- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added from [Step 16c](#16c-adding-taxonomy-info-from-taxids-to-genes))
+- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added, output from [Step 16c](#16c-add-taxonomy-info-from-taxids-to-genes))
 
 **Output Data:**
 
-- sample-gene-tax-out.tsv (gene-calls taxonomy file with lineage info added reformatted)
+- sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info)
+
 
-#### 16f. Formatting contig-level output with awk and sed
+#### 16f. Format Contig-level Output With awk and sed
 
 ```bash
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6,$7,$8,$9,$10,$11,$12 } \
@@ -2492,11 +2663,11 @@ rm sample*.tmp*
 
 **Input Data:**
 
-- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added from [Step 16d](#16d-adding-taxonomy-info-from-taxids-to-contigs))
+- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added, output from [Step 16d](#16d-add-taxonomy-info-from-taxids-to-contigs))
 
 **Output Data:**
 
-- sample-contig-tax-out.tsv (contig taxonomy file with lineage info added reformatted)
+- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info)
 
 <br>
 
@@ -2507,32 +2678,42 @@ rm sample*.tmp*
 #### 17a. Align Reads to Sample Assembly
 
 ```bash
-minimap2 -a -x map-ont \
-        -t NumberOfThreads \
-        sample_assembly.fasta sample_host_removed.fastq.gz \
-        > sample.sam  2> sample-mapping-info.txt
+minimap2 -a \
+         -x map-ont \
+         -t NumberOfThreads \
+         sample_assembly.fasta \
+         sample_HRrm.fasta.gz \
+         > sample.sam  2> sample-mapping-info.txt
 ```
 
 **Parameter Definitions:**
 
-- `-t` - number of parallel processing threads to use
-- `-a` – output in SAM format
-- `-x map-ont` - specifies preset for mapping Nanopore reads to a reference
+- `-a` – Output in SAM format.
+- `-x map-ont` - Specifies preset for mapping Nanopore reads to a reference.
+- `-t` - Number of parallel processing threads to use
+- `sample_assembly.fasta` – Assembly fasta file, provided as a positional argument.
+- `sample_HRrm.fasta.gz` - Input sequence data file, provided as a positional argument.
+- `> sample.sam` - Redirects the output to a separate file.
+- `2> sample-mapping-info.txt` - Redirects the standar error to a separate file.
 
 **Input Data**
 
-- /path/to/assemblies/sample_assembly.fasta (Sample assembly, output from [Step 13a](#13a-renaming-contig-headers))
-- /path/to/trimmed_reads/sample_host_removed.fastq.gz (Filtered and trimmed reads, output from [Step 7b](#7b-remove-host-reads))
+- sample-assembly.fasta (contig-renamed assembly file, output from [Step 13a](#13a-rename-contig-headers))
+- sample_HRrm.fasta.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7b](#7b-remove-host-reads))
 
 **Output Data**
 
-- sample.sam (Reads aligned to sample assembly)
+- sample.sam (reads aligned to sample assembly in SAM format)
+- **sample-mapping-info.txt** (read mapping information)
+
 
 #### 17b. Sort and Index Assembly Alignments
 
 ```bash
 # Sort Sam, convert to bam and create index
-samtools sort --threads NumberOfThreads -o sample_sorted.bam sample.sam > sample_sort.log 2>&1
+samtools sort --threads NumberOfThreads \
+              -o sample_sorted.bam \
+              sample.sam > sample_sort.log 2>&1
 
 samtools index sample_sorted.bam sample_sorted.bam.bai
 ```
@@ -2540,52 +2721,55 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 **Parameter Definitions:**
 
 **samtools sort**
-- `--threads` - number of parallel processing threads to use
-- `-o` - specifies the output file for the sorted reads
-- `sample.sam` - positional argument specifying the input SAM file
+- `--threads` - Number of parallel processing threads to use.
+- `-o` - Specifies the output file for the sorted aligned reads.
+- `sample.sam` - Positional argument specifying the input SAM file.
+- `> sample_sort.log 2>&1` - Redirects the standard output and standard error to a separate file.
 
 **samtools index**
-- `sample_sorted.bam` - positional argument specifying the input BAM file to be sorted
-- `sample_sorted.bam.bai` - positional argument specifying the name of the index file
+- `sample_sorted.bam` - Positional argument specifying the input BAM file to be sorted.
+- `sample_sorted.bam.bai` - Positional argument specifying the name of the index file.
 
 **Input Data:**
 
-- sample.sam (Reads aligned to sample assembly, output from [Step 17a](#17a-align-reads-to-sample-assembly))
+- sample.sam (reads aligned to sample assembly, output from [Step 17a](#17a-align-reads-to-sample-assembly))
 
 **Output Data:**
 
-- sample_sorted.bam (sorted mapping to sample assembly)
-- sample_sorted.bam.bai (index of sorted mapping to sample assembly)
+- **sample_sorted.bam** (sorted mapping to sample assembly, in BAM format)
+- **sample_sorted.bam.bai** (index of sorted mapping to sample assembly)
 
 <br>
 
 ---
 
-### 18. Getting coverage information and filtering based on detection
+### 18. Get Coverage Information and Filter Based On Detection
 > **Note:**  
 > “Detection” is a measure of what proportion of a reference sequence recruited reads 
 (see the discussion of detection [here](http://merenlab.org/2017/05/08/anvio-views/#detection)). 
 Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
 
-#### 18a. Filtering coverage levels based on detection
+#### 18a. Filter Coverage Levels Based On Detection
 
 ```bash
-  # pileup.sh comes from the bbduk.sh package
-pileup.sh -in sample.bam fastaorf=sample-genes.fasta outorf=sample-gene-cov-and-det.tmp \
+# pileup.sh comes from the bbduk.sh package
+pileup.sh -in sample.bam \
+          fastaorf=sample-genes.fasta \
+          outorf=sample-gene-cov-and-det.tmp \
           out=sample-contig-cov-and-det.tmp
 ```
 
 **Parameter Definitions:**  
 
-- `-in` – the input bam file
-- `fastaorf=` – input gene-calls nucleotide fasta file
-- `outorf=` – the output gene-coverage tsv file
-- `out=` – the output contig-coverage tsv file
+- `-in` – Specifies the input BAM file.
+- `fastaorf=` – Specifies the input gene-calls nucleotide fasta file.
+- `outorf=` – Specifies the output gene-coverage tsv file name.
+- `out=` – Specifies the output contig-coverage tsv file name.
 
 **Input Data:**
 
-- sample.bam (mapping file from [Step 17b](#17b-sort-and-index-assembly-alignments))
-- sample-genes.fasta (gene-calls nucleotide fasta file from [Step 14](#14-gene-prediction))
+- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 17b](#17b-sort-and-index-assembly-alignments))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14a](#14a-gene-prediction))
 
 
 **Output Data:**
@@ -2594,17 +2778,21 @@ pileup.sh -in sample.bam fastaorf=sample-genes.fasta outorf=sample-gene-cov-and-
 - sample-contig-cov-and-det.tmp (contig-coverage tsv file)
 
 
-#### 18b. Filtering gene and contig coverage based on requiring 50% detection and parsing down to just gene ID and coverage
+#### 18b. Filter Gene and Contig Coverage Based On Detection
+
+> *The following commands filter gene and contig coverage tsv files to only keep genes and contigs with at least 50% detection (as defined above) then parse the tables to retain only gene IDs and respective coverage.*
 
 ```bash
 # Filtering gene coverage
-grep -v "#" sample-gene-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
+grep -v "#" sample-gene-cov-and-det.tmp | \
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
      { print $1,$4 } ' > sample-gene-cov.tmp
 
 cat <( printf "gene_ID\tcoverage\n" ) sample-gene-cov.tmp > sample-gene-coverages.tsv
 
 # Filtering contig coverage
-grep -v "#" sample-contig-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
+grep -v "#" sample-contig-cov-and-det.tmp | \
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
      { print $1,$2 } ' > sample-contig-cov.tmp
 
 cat <( printf "contig_ID\tcoverage\n" ) sample-contig-cov.tmp > sample-contig-coverages.tsv
@@ -2615,8 +2803,8 @@ rm sample-*.tmp
 
 **Input Data:**
 
-- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file from [Step 18a](#18a-filtering-coverage-levels-based-on-detection))
-- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file from [Step 18a](#18a-filtering-coverage-levels-based-on-detection))
+- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file, output from [Step 18a](#18a-filter-coverage-levels-based-on-detection))
+- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file, output from [Step 18a](#18a-filter-coverage-levels-based-on-detection))
 
 **Output Data:**
 
@@ -2627,28 +2815,32 @@ rm sample-*.tmp
 
 ---
 
-### 19. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample
+### 19. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample
 > **Note:**  
-> Just uses `paste`, `sed`, and `awk`, which are all standard in any Unix-like environment.  
+> Just uses `paste`, `sed`, and `awk` standard Unix commands to combine gene-level coverage, taxonomy, and functional annotations into one table for each sample.  
 
 ```bash
-paste <( tail -n +2 sample-gene-coverages.tsv | sort -V -k 1 ) <( tail -n +2 sample-annotations.tsv | sort -V -k 1 | cut -f 2- ) \
-      <( tail -n +2 sample-gene-tax-out.tsv | sort -V -k 1 | cut -f 2- ) > sample-gene-tab.tmp
+paste <( tail -n +2 sample-gene-coverages.tsv | sort -V -k 1 ) \
+      <( tail -n +2 sample-annotations.tsv | sort -V -k 1 | cut -f 2- ) \
+      <( tail -n +2 sample-gene-tax-out.tsv | sort -V -k 1 | cut -f 2- ) \
+      > sample-gene-tab.tmp
 
-paste <( head -n 1 sample-gene-coverages.tsv ) <( head -n 1 sample-annotations.tsv | cut -f 2- ) \
-      <( head -n 1 sample-gene-tax-out.tsv | cut -f 2- ) > sample-header.tmp
+paste <( head -n 1 sample-gene-coverages.tsv ) \
+      <( head -n 1 sample-annotations.tsv | cut -f 2- ) \
+      <( head -n 1 sample-gene-tax-out.tsv | cut -f 2- ) \
+      > sample-header.tmp
 
 cat sample-header.tmp sample-gene-tab.tmp > sample-gene-coverage-annotation-and-tax.tsv
 
-  # removing intermediate files
+# removing intermediate files
 rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-out.tsv
 ```
 
 **Input Data:**
 
-- sample-gene-coverages.tsv (table with gene-level coverages from [Step 18b](#18b-filtering-gene-and-contig-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage))
-- sample-annotations.tsv (table of KO annotations assigned to gene IDs from [Step 15c](#15c-filtering-output-to-retain-only-those-passing-the-ko-specific-score-and-top-hits))
-- sample-gene-tax-out.tsv (gene-level taxonomic classifications from [Step 16f](#16f-formatting-contig-level-output-with-awk-and-sed))
+- sample-gene-coverages.tsv (table with gene-level coverages, output from [Step 18b](#18b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-annotations.tsv (table of KO annotations assigned to gene IDs, output from [Step 15c](#15c-filter-ko-outputs))
+- sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info, output from [Step 16e](#16e-format-gene-level-output-with-awk-and-sed))
 
 
 **Output Data:**
@@ -2659,27 +2851,29 @@ rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-o
 
 ---
 
-### 20. Combining contig-level coverage and taxonomy into one table for each sample
+### 20. Combine Contig-level Coverage and Taxonomy For Each Sample
 > **Note:**  
-> Just uses `paste`, `sed`, and `awk`, which are all standard in any Unix-like environment.  
+> Just uses `paste`, `sed`, and `awk` standard Unix commands to combine contig-level coverage and taxonomy into one table for each sample.
 
 ```bash
 paste <( tail -n +2 sample-contig-coverages.tsv | sort -V -k 1 ) \
-      <( tail -n +2 sample-contig-tax-out.tsv | sort -V -k 1 | cut -f 2- ) > sample-contig.tmp
+      <( tail -n +2 sample-contig-tax-out.tsv | sort -V -k 1 | cut -f 2- ) \
+      > sample-contig.tmp
 
-paste <( head -n 1 sample-contig-coverages.tsv ) <( head -n 1 sample-contig-tax-out.tsv | cut -f 2- ) \
+paste <( head -n 1 sample-contig-coverages.tsv ) \
+      <( head -n 1 sample-contig-tax-out.tsv | cut -f 2- ) \
       > sample-contig-header.tmp
       
 cat sample-contig-header.tmp sample-contig.tmp > sample-contig-coverage-and-tax.tsv
 
-  # removing intermediate files
+# removing intermediate files
 rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
 ```
 
 **Input Data:**
 
-- sample-contig-coverages.tsv (table with contig-level coverages from [Step 18b](#18b-filtering-gene-and-contig-coverage-based-on-requiring-50-detection-and-parsing-down-to-just-gene-id-and-coverage))
-- sample-contig-tax-out.tsv (contig-level taxonomic classifications from [Step 16f](#16f-formatting-contig-level-output-with-awk-and-sed))
+- sample-contig-coverages.tsv (table with contig-level coverages, output from [Step 18b](#18b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info, output from [Step 16f](#16f-format-contig-level-output-with-awk-and-sed))
 
 
 **Output Data:**
@@ -2690,34 +2884,35 @@ rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
 
 ---
 
-### 21. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples
+### 21. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples
 
 > **Note:**  
 > * To combine across samples to generate these summary tables, we need the same "units". This is done for annotations 
 based on the assigned KO terms, and all non-annotated functions are included together as "Not annotated". It is done for 
-taxonomic classifications based on taxids (full lineages included in the table), and any not classified are included 
+taxonomic classifications based on taxids (full lineages included in the table), and any genes not classified are included 
 together as "Not classified". 
 > * The values we are working with are coverage per gene (so they are number of bases recruited to the gene normalized 
 by the length of the gene). These have been normalized by making the total coverage of a sample 1,000,000 and setting 
 each individual gene-level coverage its proportion of that 1,000,000 total. So basically percent, but out of 1,000,000 
 instead of 100 to make the numbers more friendly. 
 
-#### 21a. Generating gene-level coverage summary tables
+#### 21a. Generate Gene-level Coverage Summary Tables
 
 ```bash
-bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv -o Combined
+bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv \
+                                 -o Combined
 ```
 
 **Parameter Definitions:**  
 
-- `*-gene-coverage-annotation-and-tax.tsv` - positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above
+- `*-gene-coverage-annotation-and-tax.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
 
-- `-o` – specifies the output prefix
+- `-o` – Specifies the output file prefix.
 
 
 **Input Data:**
 
-- *-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples from [Step 19](#19-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample))
+- *-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 19](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
 
 **Output Data:**
 
@@ -2727,7 +2922,7 @@ bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv -o Combi
 - **Combined-gene-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications)
 
 
-#### 21b. Gene-level taxonomy heatmaps
+#### 21b. Gene-level taxonomy heatmaps --- START NEEDS REVIEW ---
 
 ```R
 library(tidyverse)
@@ -2971,7 +3166,7 @@ dev.off()
 - **All-genes-KO-functions-heatmap_GLmetagenomics.png** (heatmap of gene-wise KO function assignments)
 - **Abundant-genes-KO-functions-heatmap_GLmetagenomics.png** (heatmap of gene-wise abundant KO function assignments)
 
-#### 21e. Gene-level KO functions decontamination
+#### 21e. Gene-level KO functions decontamination --- END NEEDS REVIEW ---
 
 ```R
 library(tidyverse)
@@ -3059,7 +3254,7 @@ dev.off()
 
 
 
-#### 21f. Generating contig-level coverage summary tables
+#### 21f. Generate Contig-level Coverage Summary Tables
 
 ```bash
 bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
@@ -3067,23 +3262,23 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 
 **Parameter Definitions:**  
 
-- `*-contig-coverage-and-tax.tsv` - positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above
-- `-o` – specifies the output prefix
+- `*-contig-coverage-and-tax.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
+- `-o` – Specifies the output file prefix.
 
 
 **Input Data:**
 
-- *-contig-coverage-annotation-and-tax.tsv (tables with combined contig coverage, annotation, and taxonomy info generated for individual samples from [Step 20](#20-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample))
+- *-contig-coverage-and-tax.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 20](#20-combine-contig-level-coverage-and-taxonomy-for-each-sample))
 
 **Output Data:**
 
-- **Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered)
+- **Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million contigs covered)
 - **Combined-contig-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications)
 
 <br>
 
 
-#### 21g. Contig-level Heatmaps
+#### 21g. Contig-level Heatmaps --- START NEEDS REVIEW ---
 
 ```R
 plot_width <- 20
@@ -3160,7 +3355,7 @@ dev.off()
 - **Abundant-contig-taxonomy-heatmap_GLmetagenomics.png** (Abundant contig level taxonomy heatmap)
 
 
-#### 21h. Contig-level decontamination
+#### 21h. Contig-level decontamination --- END NEEDS REVIEW ---
 
 ```R
 library(tidyverse)
@@ -3250,14 +3445,22 @@ dev.off()
 
 ---
 
-### 22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery
+### 22. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery
 
-#### 22a. Binning contigs
+#### 22a. Bin Contigs
 
 ```bash
-jgi_summarize_bam_contig_depths --outputDepth sample-metabat-assembly-depth.tsv --percentIdentity 97 --minContigLength 1000 --minContigDepth 1.0  --referenceFasta sample-assembly.fasta sample.bam
-
-metabat2  --inFile sample-assembly.fasta --outFile sample --abdFile sample-metabat-assembly-depth.tsv -t NumberOfThreads
+jgi_summarize_bam_contig_depths --outputDepth sample-metabat-assembly-depth.tsv \
+                                --percentIdentity 97 \
+                                --minContigLength 1000 \
+                                --minContigDepth 1.0  \
+                                --referenceFasta sample-assembly.fasta \
+                                sample.bam
+
+metabat2  --inFile sample-assembly.fasta \
+          --outFile sample \
+          --abdFile sample-metabat-assembly-depth.tsv \
+          -t NumberOfThreads
 
 mkdir sample-bins
 mv sample*bin*.fasta sample-bins
@@ -3266,22 +3469,27 @@ zip -r sample-bins.zip sample-bins
 
 **Parameter Definitions:**  
 
--  `--outputDepth` – specifies the output depth file
--  `--percentIdentity` – minimum end-to-end percent identity of a mapped read to be included
--  `--minContigLength` – minimum contig length to include
--  `--minContigDepth` – minimum contig depth to include
--  `--referenceFasta` – the assembly fasta file generated in step 5a
--  `sample.bam` – final positional arguments are the bam files generated in step 9
--  `--inFile` - the assembly fasta file generated in step 5a
--  `--outFile` - the prefix of the identified bins output files
--  `--abdFile` - the depth file generated by the previous `jgi_summarize_bam_contig_depths` command
--  `-t` - number of parallel processing threads to use
+**jgi_summarize_bam_contig_depths**
+
+-  `--outputDepth` – Specifies the output depth file name.
+-  `--percentIdentity` – Minimum end-to-end percent identity of a mapped read to be included.
+-  `--minContigLength` – Minimum contig length to include.
+-  `--minContigDepth` – Minimum contig depth to include.
+-  `--referenceFasta` – Specifies the input assembly fasta file.
+-  `sample.bam` – Input alignment BAM file, specified as a positional argument.
+
+**metabat2**
+
+-  `--inFile` - Specifies the input assembly fasta file.
+-  `--outFile` - Specifies the prefix of the identified bins output files.
+-  `--abdFile` - The depth file generated by the previous `jgi_summarize_bam_contig_depths` command.
+-  `-t` - Number of parallel processing threads to use.
 
 
 **Input Data:**
 
-- sample-assembly.fasta (assembly fasta file created in [Step 13a](#13a-renaming-contig-headers))
-- sample.bam (bam file created in [Step 17b](#17b-sort-and-index-assembly-alignments))
+- sample-assembly.fasta (contig-renamed assembly file from [Step 13a](#13a-renaming-contig-headers))
+- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 17b](#17b-sort-and-index-assembly-alignments))
 
 **Output Data:**
 
@@ -3289,32 +3497,36 @@ zip -r sample-bins.zip sample-bins
 - sample-bins/sample-bin\*.fasta (fasta files of recovered bins)
 - **sample-bins.zip** (zip file containing fasta files of recovered bins)
 
-#### 22b. Bin quality assessment
-Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
+#### 22b. Bin quality assessment 
+> Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
 
 ```bash
-checkm lineage_wf -f bins-overview_GLmetagenomics.tsv --tab_table -x fa ./ checkm-output-dir
+checkm lineage_wf -f bins-overview_GLmetagenomics.tsv \
+                  --tab_table \
+                  -x fasta \
+                  ./ \
+                  checkm-output-dir
 ```
 
 **Parameter Definitions:**  
 
--  `lineage_wf` – specifies the workflow being utilized
--  `-f` – specifies the output summary file
--  `--tab_table` – specifies the output summary file should be a tab-delimited table
--  `-x` – specifies the extension that is on the bin fasta files that are being assessed
--  `./` – first positional argument at end specifies the directory holding the bins generated in step 14a
--  `checkm-output-dir` – second positional argument at end specifies the primary checkm output directory with detailed information
+-  `lineage_wf` – Specifies the workflow being utilized.
+-  `-f` – Specifies the output summary file name.
+-  `--tab_table` – Specifies the output summary file should be a tab-delimited table.
+-  `-x` – Specifies the extension that is on the bin fasta files that are being assessed.
+-  `./` – Specifies the directory holding the bins, provided as a positional argument.
+-  `checkm-output-dir` – Specifies the primary checkm output directory, provided as a positional argument.
 
 **Input Data:**
 
-- sample-bins/sample-bin\*.fasta (bin fasta files generated in [Step 22a](#22a-binning-contigs))
+- sample-bins/sample-bin\*.fasta (fasta files of recovered bins, output from [Step 22a](#22a-bin-contigs))
 
 **Output Data:**
 
 - **bins-overview_GLmetagenomics.tsv** (tab-delimited file with quality estimates per bin)
-- checkm-output-dir (directory holding detailed checkm outputs)
+- checkm-output-dir/ (directory holding detailed checkm outputs)
 
-#### 22c. Filtering MAGs
+#### 22c. Filter MAGs
 
 ```bash
 cat <( head -n 1 bins-overview_GLmetagenomics.tsv ) \
@@ -3350,30 +3562,33 @@ done
 - **\*-MAGs.zip** (zip files containing directories of high-quality MAGs)
 
 
-#### 22d. MAG taxonomic classification
-Uses default `gtdbtk` database setup with program's `download.sh` command.
+#### 22d. MAG Taxonomic Classification
+> Uses default `gtdbtk` database setup with program's `download.sh` command.
 
 ```bash
-gtdbtk classify_wf --genome_dir MAGs/ -x fa --out_dir gtdbtk-output-dir  --skip_ani_screen
+gtdbtk classify_wf --genome_dir MAGs/ \
+                   -x fasta \
+                   --out_dir gtdbtk-output-dir \
+                   --skip_ani_screen
 ```
 
 **Parameter Definitions:**  
 
--  `classify_wf` – specifies the workflow being utilized
--  `--genome_dir` – specifies the directory holding the MAGs generated in step 14c
--  `-x` – specifies the extension that is on the MAG fasta files that are being taxonomically classified
--  `--out_dir` – specifies the output directory
--  `--skip_ani_screen`  - specifies to skip ani_screening step to classify genomes using mash and skani
+-  `classify_wf` – Specifies the workflow being utilized.
+-  `--genome_dir` – Specifies the directory holding the MAGs to classify.
+-  `-x` – Specifies the extension that is on the MAG fasta files that are being taxonomically classified.
+-  `--out_dir` – Specifies the output directory name.
+-  `--skip_ani_screen`  - Specifies to skip ani_screening step to classify genomes using mash and skani.
 
 **Input Data:**
 
-- MAGs/\*.fasta (directory holding high-quality MAGs from [Step 22c](#22c-filtering-mags))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#22c-filter-mags))
 
 **Output Data:**
 
 - gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
 
-#### 22e. Generating overview table of all MAGs
+#### 22e. Generate Overview Table Of All MAGs
 
 ```bash
 # combine summaries
@@ -3413,10 +3628,10 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 
 **Input Data:**
 
-- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics from [Step 13b](#13b-summarizing-assemblies))
-- MAGs/\*.fasta (directory holding high-quality MAGs from [Step 22c](#23c-filtering-mags))
-- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG from [Step 22c](#22c-filtering-mags))
-- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info from [Step 22d](#22d-mag-taxonomic-classification))
+- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarize-assemblies))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#23c-filter-mags))
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 22c](#22c-filter-mags))
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 22d](#22d-mag-taxonomic-classification))
 
 **Output Data:**
 
@@ -3427,10 +3642,10 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 
 ---
 
-### 23. Generating MAG-level functional summary overview
+### 23. Generate MAG-level Functional Summary Overview
 
-#### 23a. Getting KO annotations per MAG
-This utilizes the helper script [`parse-MAG-annots.py`](../Workflow_Documentation/NF_MGIllumina/workflow_code/bin/parse-MAG-annots.py) 
+#### 23a. Get KO Annotations Per MAG
+> This utilizes the helper script [`parse-MAG-annots.py`](../Workflow_Documentation/NF_MGIllumina/workflow_code/bin/parse-MAG-annots.py) 
 
 ```bash
 for file in $( ls MAGs/*.fasta )
@@ -3442,7 +3657,8 @@ do
     grep "^>" ${file} | tr -d ">" > ${MAG_ID}-contigs.tmp
 
     python parse-MAG-annots.py -i annotations-and-taxonomy/${sample_ID}-gene-coverage-annotation-and-tax.tsv \
-                               -w ${MAG_ID}-contigs.tmp -M ${MAG_ID} \
+                               -w ${MAG_ID}-contigs.tmp \
+                               -M ${MAG_ID} \
                                -o MAG-level-KO-annotations_GLmetagenomics.tsv
 
     rm ${MAG_ID}-contigs.tmp
@@ -3452,36 +3668,38 @@ done
 
 **Parameter Definitions:**  
 
-- `-i` – specifies the input sample gene-coverage-annotation-and-tax.tsv file generated in step 11
--  `-w` – specifies the appropriate temporary file holding all the contigs in the current MAG
-- `-M` – specifies the current MAG unique identifier
-- `-o` – specifies the output file
+- `-i` – Specifies the input sample TSV file containing sample coverage, annotation, and taxonomy info.
+- `-w` – Specifies the appropriate temporary file holding all the contigs in the current MAG.
+- `-M` – Specifies the current MAG unique identifier.
+- `-o` – Specifies the output file name.
 
 **Input Data:**
 
-- \*-gene-coverage-annotation-and-tax.tsv (sample gene-coverage-annotation-and-tax.tsv file generated in [Step 19](#19-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample))
-- MAGs/\*.fasta (directory holding high-quality MAGs from [Step 22c](#22c-filtering-mags))
+- \*-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 19](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#22c-filter-mags))
 
 **Output Data:**
 
 - **MAG-level-KO-annotations_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their KO annotations)
 
 
-#### 23b. Summarizing KO annotations with KEGG-Decoder
+#### 23b. Summarize KO Annotations With KEGG-Decoder
 
 ```bash
-KEGG-decoder -v interactive -i MAG-level-KO-annotations_GLmetagenomics.tsv -o MAG-KEGG-Decoder-out_GLmetagenomics.tsv
+KEGG-decoder -v interactive \
+             -i MAG-level-KO-annotations_GLmetagenomics.tsv \
+             -o MAG-KEGG-Decoder-out_GLmetagenomics.tsv
 ```
 
 **Parameter Definitions:**  
 
-- `-v interactive` – specifies to create an interactive html output
-- `-i` – specifies the input MAG-level-KO-annotations_GLmetagenomics.tsv file generated in [Step 23a](#23a-getting-ko-annotations-per-mag)
-- `-o` – specifies the output table
+- `-v interactive` – Specifies to create an interactive html output.
+- `-i` – Specifies the input tab-delimited table holding MAGs and their KO annotations.
+- `-o` – Specifies the output table.
 
 **Input Data:**
 
-- MAG-level-KO-annotations_GLmetagenomics.tsv (tab-delimited table holding MAGs and their KO annotations, generated in [Step 23a](#23a-getting-ko-annotations-per-mag))
+- MAG-level-KO-annotations_GLmetagenomics.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 23a](#23a-getting-ko-annotations-per-mag))
 
 **Output Data:**
 

From 82a7043c638aa56e86abbd157fc438ac299c9895 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Mon, 3 Nov 2025 09:14:41 -0800
Subject: [PATCH 16/47] Update GL-DPPD-7110-A_annotations.csv

Fixed a typo in the fasta file name for Mycobacterium marinum.
---
 .../GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv            | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv
index c2a881e26..bd009024f 100644
--- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv
+++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv
@@ -11,7 +11,7 @@ ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.
 HUMAN,Homo sapiens,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz,9606,org.Hs.eg.db,,https://figshare.com/ndownloader/files/48354445,https://figshare.com/ndownloader/files/48354448
 ,Lactobacillus acidophilus,NCFM,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.gtf.gz,272621,,,https://figshare.com/ndownloader/files/49061254,https://figshare.com/ndownloader/files/49061257
 MOUSE,Mus musculus,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz,10090,org.Mm.eg.db,,https://figshare.com/ndownloader/files/48354460,https://figshare.com/ndownloader/files/48354457
-,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,,https://figshare.com/ndownloader/files/49061260,https://figshare.com/ndownloader/files/49061263
+,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,,https://figshare.com/ndownloader/files/49061260,https://figshare.com/ndownloader/files/49061263
 ORYSJ,Oryza sativa,Japonica,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz,39947,,,https://figshare.com/ndownloader/files/48354451,https://figshare.com/ndownloader/files/48354454
 ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,,org.Olatipes.eg.db,https://figshare.com/ndownloader/files/48354463,https://figshare.com/ndownloader/files/48354466
 ,Pseudomonas aeruginosa,UCBPP-PA14,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.gtf.gz,208963,,,https://figshare.com/ndownloader/files/49061266,https://figshare.com/ndownloader/files/49061269
@@ -21,4 +21,5 @@ SALTY,Salmonella enterica,serovar Typhimurium str. LT2,,ncbi,https://ftp.ncbi.nl
 ,Serratia liquefaciens,ATCC 27592,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.gtf.gz,1346614,,,https://figshare.com/ndownloader/files/49061278,https://figshare.com/ndownloader/files/49061281
 ,Staphylococcus aureus,MRSA252,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.gtf.gz,282458,,,https://figshare.com/ndownloader/files/49061284,https://figshare.com/ndownloader/files/49061287
 ,Streptococcus mutans,UA159,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.gtf.gz,210007,,,https://figshare.com/ndownloader/files/49061290,https://figshare.com/ndownloader/files/49061293
-,Vibrio fischeri,ES114,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.gtf.gz,312309,,,https://figshare.com/ndownloader/files/49061296,https://figshare.com/ndownloader/files/49061299
\ No newline at end of file
+
+,Vibrio fischeri,ES114,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.gtf.gz,312309,,,https://figshare.com/ndownloader/files/49061296,https://figshare.com/ndownloader/files/49061299

From a93c9c4bb05c593f973f7b30783bd447b5d44afb Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Tue, 4 Nov 2025 10:41:14 -0800
Subject: [PATCH 17/47] Update humann3.yaml

Fixed typo in metaphlan version specification.
---
 .../NF_MGIllumina/workflow_code/envs/humann3.yaml              | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina/workflow_code/envs/humann3.yaml b/Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina/workflow_code/envs/humann3.yaml
index fa616b0d7..ce82cd076 100644
--- a/Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina/workflow_code/envs/humann3.yaml
+++ b/Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina/workflow_code/envs/humann3.yaml
@@ -1,8 +1,7 @@
 channels:
     - conda-forge
     - bioconda
-    - defaults
     - biobakery
 dependencies:
     - humann=3.9
-    - metaphlan=4.10
\ No newline at end of file
+    - metaphlan=4.1.0

From 9c1b6184eecec7df95a74dabd6418b0dcdb3d03c Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Sat, 20 Dec 2025 17:28:02 -0800
Subject: [PATCH 18/47] Updated assay suffixes

---
 .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md      | 161 +++++++++---------
 1 file changed, 82 insertions(+), 79 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
index 80a2ed7f1..e9cd48a79 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
@@ -135,12 +135,14 @@ Barbara Novak (GeneLab Data Processing Lead)
 |Program|Version|Relevant Links|
 |:------|:-----:|------:|
 |bbduk| 38.86 |[https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/)|
+|bit| 1.8.53 |[https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)|
 |CAT| 5.2.3 |[https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)|
 |CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)|
 |Dorado| 1.1.1| [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)|
 |filtlong| 0.2.1 |[https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)|
 |Flye| 2.9.5 | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye) |
 |GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
+|HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)|
 |Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) |
 |KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder)
 |KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)|
@@ -148,13 +150,14 @@ Barbara Novak (GeneLab Data Processing Lead)
 |KrakenTools | 1.2 | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/) |
 |Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)|
 |MetaBAT| 2.15 |[https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)|
-|Minimap2| 2.2.8 | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2) |
+|Minimap2| 2.28 | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2) |
 |MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)|
-|Medaka| 2.0.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) |
+|Medaka| 2.1.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) |
+|MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)|
 |NanoPlot| 1.44.1 | [https://github.com/wdecoster/NanoPlot](https://github.com/wdecoster/NanoPlot)|
 |Porechop| 0.2.4 | [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop) |
 |Prodigal| 2.6.3 |[https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)|
-|samtools| 1.20 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
+|samtools| 1.22.1 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
 | R | 4.5.1 | [https://www.r-project.org](https://www.r-project.org) |
 |Bioconductor | 3.21 | [https://www.bioconductor.org](https://www.bioconductor.org) |
 |decontam| 1.28.0 | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html) |
@@ -184,7 +187,7 @@ dorado basecaller ${model} ${input_directory} \
   --device auto \
   --recursive \
   --kit-name ${kit_name} \
-  --min-qscore 7 > basecalled.bam
+  --min-qscore 8 > basecalled.bam
 ```
 
 **Parameter Definitions:**
@@ -195,7 +198,7 @@ dorado basecaller ${model} ${input_directory} \
 - `--device` - Specifies CPU or GPU device; specifying 'auto' chooses either 'cpu' or 'gpu' depending on detected presence of a GPU device.
 - `--recursive` - Enables recursive scanning through input directory to load FAST5 and/or POD5 files.
 - `--kit-name` - The nanopore barcoding kit used during sequencing preparation. Enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names.
-- `--min-qscore` - Specifies the minimum Q-score, reads with a mean Q-score below this threshold are discarded (default to `7` for this pipeline).
+- `--min-qscore` - Specifies the minimum Q-score, reads with a mean Q-score below this threshold are discarded (default to `8` for this pipeline).
 
 **Input Data:**
 
@@ -316,7 +319,7 @@ NanoPlot --only-report \
 ```bash 
 multiqc --zip-data-dir \
         --outdir raw_multiqc_report \
-        --filename raw_multiqc \
+        --filename raw_multiqc_GLlbnMetag \
         --interactive \
         /path/to/raw_nanoplot_output/
 ```
@@ -335,8 +338,8 @@ multiqc --zip-data-dir \
 
 **Output Data:**
 
-- **raw_multiqc.html** (multiqc output html summary)
-- **raw_multiqc_data.zip** (zip archive containing multiqc output data)
+- **raw_multiqc_GLlbnMetag.html** (multiqc output html summary)
+- **raw_multiqc_GLlbnMetag_data.zip** (zip archive containing multiqc output data)
 
 <br>  
 
@@ -401,7 +404,7 @@ NanoPlot --only-report \
 ```bash
 multiqc  --zip-data-dir \ 
          --outdir filtered_multiqc_report \
-         --filename filtered_multiqc \
+         --filename filtered_multiqc_GLlbnMetag \
          --interactive \
          /path/to/filtered_nanoplot_output/
 ```
@@ -420,8 +423,8 @@ multiqc  --zip-data-dir \
 
 **Output Data:**
 
-- **filtered_multiqc_report/filtered_multiqc.html** (multiqc output html summary)
-- **filtered_multiqc_report/filtered_multiqc_data.zip** (zip archive containing multiqc output data)
+- **filtered_multiqc_report/filtered_multiqc_GLlbnMetag.html** (multiqc output html summary)
+- **filtered_multiqc_report/filtered_multiqc_GLlbnMetag_data.zip** (zip archive containing multiqc output data)
 
 <br>
 
@@ -490,7 +493,7 @@ NanoPlot --only-report \
 ```bash
 multiqc --zip-data-dir \ 
         --outdir trimmed_multiqc_report \
-        --filename trimmed_multiqc \
+        --filename trimmed_multiqc_GLlbnMetag \
         --interactive \
         /path/to/trimmed_nanoplot_output/
 ```
@@ -509,8 +512,8 @@ multiqc --zip-data-dir \
 
 **Output Data:**
 
-- **trimmed_multiqc.html** (multiqc output html summary)
-- **trimmed_multiqc_data.zip** (zip archive containing multiqc output data)
+- **trimmed_multiqc_GLlbnMetag.html** (multiqc output html summary)
+- **trimmed_multiqc_GLlbnMetag_data.zip** (zip archive containing multiqc output data)
 
 <br>
 
@@ -716,7 +719,7 @@ NanoPlot --only-report \
 ```bash
 multiqc --zip-data-dir \ 
         --outdir decontam_multiqc_report \
-        --filename decontam_multiqc \
+        --filename decontam_multiqc_GLlbnMetag \
         --interactive \
         /path/to/decontam_nanoplot_output/
 ```
@@ -735,8 +738,8 @@ multiqc --zip-data-dir \
 
 **Output Data:**
 
-- **decontam_multiqc.html** (multiqc output html summary)
-- **decontam_multiqc_data.zip** (zip archive containing multiqc output data)
+- **decontam_multiqc_GLlbnMetag.html** (multiqc output html summary)
+- **decontam_multiqc_GLlbnMetag_data.zip** (zip archive containing multiqc output data)
 
 <br>
 
@@ -827,7 +830,7 @@ sed -i -E 's/^([a-z0-9])/>\1/g' sample_HRrm.fasta | gzip
 ```bash
 multiqc --zip-data-dir \ 
         --outdir HRrm_multiqc_report \
-        --filename HRrm_multiqc \
+        --filename HRrm_multiqc_GLlbnMetag \
         --interactive \
         /path/to/*kraken2-report.tsv
 ```
@@ -846,8 +849,8 @@ multiqc --zip-data-dir \
 
 **Output Data:**
 
-- **HRrm_multiqc.html** (multiqc output html summary)
-- **HRrm_multiqc_data.zip** (zip archive containing multiqc output data)
+- **HRrm_multiqc_GLlbnMetag.html** (multiqc output html summary)
+- **HRrm_multiqc_GLlbnMetag_data.zip** (zip archive containing multiqc output data)
 
 <br>
 
@@ -1947,7 +1950,7 @@ combine_kreports.py --output merged-kraken2-table.tsv \
 ```bash
 multiqc --zip-data-dir \ 
         --outdir kraken2_multiqc_report \
-        --filename kraken2_multiqc \
+        --filename kraken2_multiqc_GLlbnMetag \
         --interactive \
         /path/to/*kraken2-report.tsv
 ```
@@ -1966,8 +1969,8 @@ multiqc --zip-data-dir \
 
 **Output Data:**
 
-- **kraken2_multiqc.html** (multiqc output html summary)
-- **kraken2_multiqc_data.zip** (zip archive containing multiqc output data)
+- **kraken2_multiqc_GLlbnMetag.html** (multiqc output html summary)
+- **kraken2_multiqc_GLlbnMetag_data.zip** (zip archive containing multiqc output data)
 
 
 #### 10d. Convert Kraken2 Output to Krona Format
@@ -2354,7 +2357,7 @@ bit-rename-fasta-headers -i sample_polished.fasta \
 #### 13b. Summarize Assemblies
 
 ```bash
-bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv \
+bit-summarize-assembly -o assembly-summaries_GLlbnMetag.tsv \
                        *-assembly.fasta
 ```
 
@@ -2369,7 +2372,7 @@ bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv \
 
 **Output files:**
 
-- **assembly-summaries_GLmetagenomics.tsv** (table of assembly summary statistics)
+- **assembly-summaries_GLlbnMetag.tsv** (table of assembly summary statistics)
 
 <br>
 
@@ -2916,10 +2919,10 @@ bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv \
 
 **Output Data:**
 
-- **Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on KO annotations; normalized to coverage per million genes covered)
-- **Combined-gene-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications; normalized to coverage per million genes covered)
-- **Combined-gene-level-KO-function-coverages_GLmetagenomics.tsv** (table with all samples combined based on KO annotations)
-- **Combined-gene-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications)
+- **Combined-gene-level-KO-function-coverages-CPM_GLlbnMetag.tsv** (table with all samples combined based on KO annotations; normalized to coverage per million genes covered)
+- **Combined-gene-level-taxonomy-coverages-CPM_GLlbnMetag.tsv** (table with all samples combined based on gene-level taxonomic classifications; normalized to coverage per million genes covered)
+- **Combined-gene-level-KO-function-coverages_GLlbnMetag.tsv** (table with all samples combined based on KO annotations)
+- **Combined-gene-level-taxonomy-coverages_GLlbnMetag.tsv** (table with all samples combined based on gene-level taxonomic classifications)
 
 
 #### 21b. Gene-level taxonomy heatmaps --- START NEEDS REVIEW ---
@@ -2931,9 +2934,9 @@ library(pheatmap)
 # Abundant taxa with CPM > 1000
 abundance_threshold <- 1000
 
-sample_order <- get_sample_names("assembly-summaries_GLmetagenomics.tsv")
+sample_order <- get_sample_names("assembly-summaries_GLlbnMetag.tsv")
 # Read-in gene table
-gene_taxonomy_table <-  read_contig_table("Combined-gene-level-taxonomy-coverages-CPM_GLmetagenomics.tsv", sample_order)
+gene_taxonomy_table <-  read_contig_table("Combined-gene-level-taxonomy-coverages-CPM_GLlbnMetag.tsv", sample_order)
 
 # Summarize gene table
 species_gene_table <- gene_taxonomy_table %>%
@@ -2955,7 +2958,7 @@ gene.m <- gene.m[,-match("species", colnames(gene.m))] %>% as.matrix()
 # Drop unclassified assignments
 mat2plot <- gene.m[-match("Unclassified;_;_;_;_;_;_", rownames(gene.m)),]
 
-png(filename = "All-genes-taxonomy-heatmap_GLmetagenomics.png", 
+png(filename = "All-genes-taxonomy-heatmap_GLlbnMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -2978,7 +2981,7 @@ abund_gene.m <- gene.m[abund_taxa,]
 # Drop unclassified assignments
 mat2plot <- abund_gene.m[-match("Unclassified;_;_;_;_;_;_", rownames(abund_gene.m)),]
 
-png(filename = "Abundant-genes-taxonomy-heatmap_GLmetagenomics.png", 
+png(filename = "Abundant-genes-taxonomy-heatmap_GLlbnMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -2992,13 +2995,13 @@ dev.off()
 ```
 
 **Input data:**
-- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
-- Combined-gene-level-taxonomy-coverages-CPM_GLmetagenomics.tsv (table with all samples combined based on gene-level taxonomic classifications, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables)) 
+- assembly-summaries_GLlbnMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
+- Combined-gene-level-taxonomy-coverages-CPM_GLlbnMetag.tsv (table with all samples combined based on gene-level taxonomic classifications, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables)) 
 
 **Output data:**
 - gene_taxonomy_table.csv (aggregated gene taxonomy table)
-- **All-genes-taxonomy-heatmap_GLmetagenomics.png** (heatmap of all genes taxonomy assignments)
-- **Abundant-genes-taxonomy-heatmap_GLmetagenomics.png** (heatmap of abundant genes taxonomy assignments)
+- **All-genes-taxonomy-heatmap_GLlbnMetag.png** (heatmap of all genes taxonomy assignments)
+- **Abundant-genes-taxonomy-heatmap_GLlbnMetag.png** (heatmap of abundant genes taxonomy assignments)
 
 #### 21c. Gene-level taxonomy decontamination
 
@@ -3062,7 +3065,7 @@ species_to_drop_index <- grep(x = rownames(feature_table),
                                     collapse = "|"))
 
 mat2plot <- feature_table[-species_to_drop_index,]
-png(filename = "decontaminated-gene-taxonomy-heatmap_GLmetagenomics.png", 
+png(filename = "decontaminated-gene-taxonomy-heatmap_GLlbnMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3085,7 +3088,7 @@ dev.off()
 
 - **decontam-gene-taxonomy_results.csv** (decontam's results table)
 - **decontaminated-gene-taxonomy_table.csv** (decontaminated species table)
-- **decontaminated-gene-taxonomy-heatmap_GLmetagenomics.png** (heatmap after filtering out contaminants)
+- **decontaminated-gene-taxonomy-heatmap_GLlbnMetag.png** (heatmap after filtering out contaminants)
 
 
 
@@ -3098,9 +3101,9 @@ library(pheatmap)
 # Abundant functions with CPM > 2000
 abundance_threshold <- 2000
 
-sample_order <- get_sample_names("assembly-summaries_GLmetagenomics.tsv")
+sample_order <- get_sample_names("assembly-summaries_GLlbnMetag.tsv")
 # Read-in KO functions table
-functions_table <- read_input_table("Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv") %>%
+functions_table <- read_input_table("Combined-gene-level-KO-function-coverages-CPM_GLlbnMetag.tsv") %>%
                     select(KO_ID, KO_function, !!sample_order)
 
 # Subset table and then convert from datafame to matrix
@@ -3118,7 +3121,7 @@ write_csv(x = table2write  , file = "genes-KO-functions_table.csv")
 # Drop unclassified assignments
 mat2plot <- functions.m[-match("Not annotated", rownames(functions.m),]
 
-png(filename = "All-genes-KO-functions-heatmap_GLmetagenomics.png", 
+png(filename = "All-genes-KO-functions-heatmap_GLlbnMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3141,7 +3144,7 @@ abund_functions.m <- functions.m[abund_functions,]
 # Drop unannotated assignments
 mat2plot <- abund_functions.m[-match("Not annotated", rownames(abund_functions.m)),]
 
-png(filename = "Abundant-genes-KO-functions-heatmap_GLmetagenomics.png", 
+png(filename = "Abundant-genes-KO-functions-heatmap_GLlbnMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3158,13 +3161,13 @@ dev.off()
 
 
 **Input data:**
-- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
-- Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv (table with all samples combined based on KO annotations; normalized to coverage per million genes covered, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables))
+- assembly-summaries_GLlbnMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
+- Combined-gene-level-KO-function-coverages-CPM_GLlbnMetag.tsv (table with all samples combined based on KO annotations; normalized to coverage per million genes covered, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables))
 
 **Output data:**
 - genes-KO-functions_table.csv (aggregated and subsetted gene KO functions table)
-- **All-genes-KO-functions-heatmap_GLmetagenomics.png** (heatmap of gene-wise KO function assignments)
-- **Abundant-genes-KO-functions-heatmap_GLmetagenomics.png** (heatmap of gene-wise abundant KO function assignments)
+- **All-genes-KO-functions-heatmap_GLlbnMetag.png** (heatmap of gene-wise KO function assignments)
+- **Abundant-genes-KO-functions-heatmap_GLlbnMetag.png** (heatmap of gene-wise abundant KO function assignments)
 
 #### 21e. Gene-level KO functions decontamination --- END NEEDS REVIEW ---
 
@@ -3227,7 +3230,7 @@ functions_to_drop_index <- grep(x = rownames(feature_table),
                                     collapse = "|"))
 
 mat2plot <- feature_table[-functions_to_drop_index,]
-png(filename = "decontaminated-gene-KO-functions-heatmap_GLmetagenomics.png", 
+png(filename = "decontaminated-gene-KO-functions-heatmap_GLlbnMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3250,7 +3253,7 @@ dev.off()
 
 - **decontam-gene-KO-functions_results.csv** (decontam's results table)
 - **decontaminated-gene-KO-functions_table.csv** (decontaminated functions table)
-- **decontaminated-gene-KO-functions-heatmap_GLmetagenomics.png** (heatmap after filtering out contaminants)
+- **decontaminated-gene-KO-functions-heatmap_GLlbnMetag.png** (heatmap after filtering out contaminants)
 
 
 
@@ -3272,8 +3275,8 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 
 **Output Data:**
 
-- **Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million contigs covered)
-- **Combined-contig-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications)
+- **Combined-contig-level-taxonomy-coverages-CPM_GLlbnMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million contigs covered)
+- **Combined-contig-level-taxonomy-coverages_GLlbnMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications)
 
 <br>
 
@@ -3283,9 +3286,9 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 ```R
 plot_width <- 20
 plot_height <- 30
-sample_order <- get_sample_names("assembly-summaries_GLmetagenomics.tsv")
+sample_order <- get_sample_names("assembly-summaries_GLlbnMetag.tsv")
 
-contig_table <-  read_contig_table("Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv", sample_order)
+contig_table <-  read_contig_table("Combined-contig-level-taxonomy-coverages-CPM_GLlbnMetag.tsv", sample_order)
 species_contig_table <- contig_table %>% select(species, !!sample_order)
 
 contig.m <- species_contig_table %>%
@@ -3305,7 +3308,7 @@ contig.m <- contig.m[,-match("species", colnames(contig.m))] %>% as.matrix()
 # Drop unclassified assignments
 mat2plot <- contig.m[-match("Unclassified;_;_;_;_;_;_", rownames(contig.m)),]
 
-png(filename = "All-contig-taxonomy-heatmap_GLmetagenomics.png", 
+png(filename = "All-contig-taxonomy-heatmap_GLlbnMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3326,7 +3329,7 @@ abund_contig.m <- contig.m[abund_taxa,]
 
 mat2plot <- abund_contig.m[-match("Unclassified;_;_;_;_;_;_", rownames(abund_contig.m)),]
 
-png(filename = "Abundant-contig-taxonomy-heatmap_GLmetagenomics.png", 
+png(filename = "Abundant-contig-taxonomy-heatmap_GLlbnMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3345,14 +3348,14 @@ dev.off()
 
 **Input data:**
 
-- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
-- Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered from [Step 21f](#21f-generating-contig-level-coverage-summary-tables))
+- assembly-summaries_GLlbnMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
+- Combined-contig-level-taxonomy-coverages-CPM_GLlbnMetag.tsv (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered from [Step 21f](#21f-generating-contig-level-coverage-summary-tables))
 
 **Output data:**
 
 - contig_taxonomy_table.csv (aggregated contig taxonomy)
-- **All-contig-taxonomy-heatmap_GLmetagenomics.png** (All contig level taxonomy heatmap)
-- **Abundant-contig-taxonomy-heatmap_GLmetagenomics.png** (Abundant contig level taxonomy heatmap)
+- **All-contig-taxonomy-heatmap_GLlbnMetag.png** (All contig level taxonomy heatmap)
+- **Abundant-contig-taxonomy-heatmap_GLlbnMetag.png** (Abundant contig level taxonomy heatmap)
 
 
 #### 21h. Contig-level decontamination --- END NEEDS REVIEW ---
@@ -3417,7 +3420,7 @@ species_to_drop_index <- grep(x = rownames(feature_table),
                                     collapse = "|"))
 
 mat2plot <- feature_table[-species_to_drop_index,]
-png(filename = "decontaminated-contig-taxonomy-heatmap_GLmetagenomics.png", 
+png(filename = "decontaminated-contig-taxonomy-heatmap_GLlbnMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3440,7 +3443,7 @@ dev.off()
 
 - **decontam-contig-taxonomy_results.csv** (decontam's results table)
 - **decontaminated-contig-taxonomy_table.csv** (decontaminated species table)
-- **decontaminated-contig-taxonomy-heatmap_GLmetagenomics.png** (heatmap after filtering out contaminants)
+- **decontaminated-contig-taxonomy-heatmap_GLlbnMetag.png** (heatmap after filtering out contaminants)
 
 
 ---
@@ -3501,7 +3504,7 @@ zip -r sample-bins.zip sample-bins
 > Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
 
 ```bash
-checkm lineage_wf -f bins-overview_GLmetagenomics.tsv \
+checkm lineage_wf -f bins-overview_GLlbnMetag.tsv \
                   --tab_table \
                   -x fasta \
                   ./ \
@@ -3523,18 +3526,18 @@ checkm lineage_wf -f bins-overview_GLmetagenomics.tsv \
 
 **Output Data:**
 
-- **bins-overview_GLmetagenomics.tsv** (tab-delimited file with quality estimates per bin)
+- **bins-overview_GLlbnMetag.tsv** (tab-delimited file with quality estimates per bin)
 - checkm-output-dir/ (directory holding detailed checkm outputs)
 
 #### 22c. Filter MAGs
 
 ```bash
-cat <( head -n 1 bins-overview_GLmetagenomics.tsv ) \
-    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLmetagenomics.tsv | sed 's/bin./MAG-/' ) \
+cat <( head -n 1 bins-overview_GLlbnMetag.tsv ) \
+    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlbnMetag.tsv | sed 's/bin./MAG-/' ) \
     > checkm-MAGs-overview.tsv
     
 # copying bins into a MAGs directory in order to run tax classification
-awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLmetagenomics.tsv | cut -f 1 > MAG-bin-IDs.tmp
+awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlbnMetag.tsv | cut -f 1 > MAG-bin-IDs.tmp
 
 mkdir MAGs
 for ID in MAG-bin-IDs.tmp
@@ -3553,7 +3556,7 @@ done
 
 **Input Data:**
 
-- bins-overview_GLmetagenomics.tsv (tab-delimited file with quality estimates per bin from [Step 22b](#22b-bin-quality-assessment))
+- bins-overview_GLlbnMetag.tsv (tab-delimited file with quality estimates per bin from [Step 22b](#22b-bin-quality-assessment))
 
 **Output Data:**
 
@@ -3592,7 +3595,7 @@ gtdbtk classify_wf --genome_dir MAGs/ \
 
 ```bash
 # combine summaries
-for MAG in $(cut -f 1 assembly-summaries_GLmetagenomics.tsv | tail -n +2); do
+for MAG in $(cut -f 1 assembly-summaries_GLlbnMetag.tsv | tail -n +2); do
 
     grep -w -m 1 "^${MAG}" checkm-MAGs-overview.tsv | cut -f 12,13,14 \
         >> checkm-estimates.tmp
@@ -3612,7 +3615,7 @@ cat <(printf "est. completeness\test. redundancy\test. strain heterogeneity\n")
 cat <(printf "domain\tphylum\tclass\torder\tfamily\\tgenus\tspecies\n") gtdb-taxonomies.tmp \
     > gtdb-taxonomies-with-headers.tmp
 
-paste assembly-summaries_GLmetagenomics.tsv \
+paste assembly-summaries_GLlbnMetag.tsv \
 checkm-estimates-with-headers.tmp \
 gtdb-taxonomies-with-headers.tmp \
     > MAGs-overview.tmp
@@ -3623,19 +3626,19 @@ head -n 1 MAGs-overview.tmp > MAGs-overview-header.tmp
 tail -n +2 MAGs-overview.tmp | sort -t \$'\t' -k 14,20 > MAGs-overview-sorted.tmp
 
 cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
-    > MAGs-overview_GLmetagenomics.tsv
+    > MAGs-overview_GLlbnMetag.tsv
 ```
 
 **Input Data:**
 
-- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarize-assemblies))
+- assembly-summaries_GLlbnMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarize-assemblies))
 - MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#23c-filter-mags))
 - checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 22c](#22c-filter-mags))
 - gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 22d](#22d-mag-taxonomic-classification))
 
 **Output Data:**
 
-- **MAGs-overview_GLmetagenomics.tsv** (a tab-delimited overview of all recovered MAGs)
+- **MAGs-overview_GLlbnMetag.tsv** (a tab-delimited overview of all recovered MAGs)
 
 
 <br>
@@ -3659,7 +3662,7 @@ do
     python parse-MAG-annots.py -i annotations-and-taxonomy/${sample_ID}-gene-coverage-annotation-and-tax.tsv \
                                -w ${MAG_ID}-contigs.tmp \
                                -M ${MAG_ID} \
-                               -o MAG-level-KO-annotations_GLmetagenomics.tsv
+                               -o MAG-level-KO-annotations_GLlbnMetag.tsv
 
     rm ${MAG_ID}-contigs.tmp
 
@@ -3680,15 +3683,15 @@ done
 
 **Output Data:**
 
-- **MAG-level-KO-annotations_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their KO annotations)
+- **MAG-level-KO-annotations_GLlbnMetag.tsv** (tab-delimited table holding MAGs and their KO annotations)
 
 
 #### 23b. Summarize KO Annotations With KEGG-Decoder
 
 ```bash
 KEGG-decoder -v interactive \
-             -i MAG-level-KO-annotations_GLmetagenomics.tsv \
-             -o MAG-KEGG-Decoder-out_GLmetagenomics.tsv
+             -i MAG-level-KO-annotations_GLlbnMetag.tsv \
+             -o MAG-KEGG-Decoder-out_GLlbnMetag.tsv
 ```
 
 **Parameter Definitions:**  
@@ -3699,13 +3702,13 @@ KEGG-decoder -v interactive \
 
 **Input Data:**
 
-- MAG-level-KO-annotations_GLmetagenomics.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 23a](#23a-getting-ko-annotations-per-mag))
+- MAG-level-KO-annotations_GLlbnMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 23a](#23a-getting-ko-annotations-per-mag))
 
 **Output Data:**
 
-- **MAG-KEGG-Decoder-out_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their proportions of genes held known to be required for specific pathways/metabolisms)
+- **MAG-KEGG-Decoder-out_GLlbnMetag.tsv** (tab-delimited table holding MAGs and their proportions of genes held known to be required for specific pathways/metabolisms)
 
-- **MAG-KEGG-Decoder-out_GLmetagenomics.html** (interactive heatmap html file of the above output table)
+- **MAG-KEGG-Decoder-out_GLlbnMetag.html** (interactive heatmap html file of the above output table)
 
 <br>
 

From c886d8db435abafc4f4196cef017ef46f1f907b8 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Fri, 23 Jan 2026 22:39:48 -0800
Subject: [PATCH 19/47] Updates to Low-biomass pipeline docs

- Updated Long-read document to better match latest workflow
- Added first draft of Short-read document

TODO:
  - Fix Assembly-based decontamination and heatmaps
  - Finish Short-read document (only revised through pre-processing,
    which also requires some overhaul)
---
 .../GL-DPPD-7117.md}                          | 1487 +++---
 .../Low_Biomass/Nanopore/GL-DPPD-7116.md      | 4212 +++++++++++++++++
 2 files changed, 5030 insertions(+), 669 deletions(-)
 rename Metagenomics/Low_Biomass/{Nanopore/GL-DPPD-XXXX.md => Illumina/GL-DPPD-7117.md} (72%)
 create mode 100644 Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
similarity index 72%
rename from Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
rename to Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
index e9cd48a79..a1d96d7f2 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md
+++ b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
@@ -25,33 +25,26 @@ Barbara Novak (GeneLab Data Processing Lead)
 - [**Software used**](#software-used)
 - [**General processing overview with example commands**](#general-processing-overview-with-example-commands)
   - [**Pre-processing**](#pre-processing)
-    - [1. Basecalling](#1-basecalling)
-    - [2. Demultiplexing](#2-demultiplexing)
-      - [2a. Split fastq ](#2a-split-fastq)
-      - [2b. Concatenate files for each sample](#2b-concatenate-files-for-each-sample)
-    - [3. Raw Data QC](#3-raw-data-qc)
-      - [3a. Raw Data QC](#3a-raw-data-qc)
-      - [3b. Compile Raw Data QC](#3b-compile-raw-data-qc)
-    - [4. Quality filtering](#4-quality-filtering)
-      - [4a. Filter Raw Data](#4a-filter-raw-data)
-      - [4a. Filtered Data QC](#4b-filtered-data-qc)
-      - [4c. Compile Filtered Data QC](#4c-compile-filtered-data-qc)
-    - [5. Trimming](#5-trimming)
-      - [5a. Trim Filtered Data](#5a-trim-filtered-data)
-      - [5b. Trimmed Data QC](#5b-trimmed-data-qc)
-      - [5c. Compile Trimmed Data QC](#5c-compile-filtered-data-qc)
-    - [6. Contaminant Removal](#6-contaminant-removal)
-      - [6a. Assemble Contaminants](#6a-assemble-contaminants)
-      - [6b. Build Contaminant Index and Map Reads](#6b-build-contaminant-index-and-map-reads)
-      - [6c. Sort and Index Contaminant Reads](#6c-sort-and-index-contaminant-alignments)
-      - [6d. Gather Contaminant Mapping Metrics](#6d-gather-contaminant-mapping-metrics)
-      - [6e. Generate Decontaminated Read Files](#6e-generate-decontaminated-read-files)
-      - [6f. Contaminant Removal QC](#6f-contaminant-removal-qc)
-      - [6g. Compile Contaminant Removal QC](#6g-compile-contaminant-removal-qc)
-    - [7. Human Read Removal](#7-human-read-removal)
-      - [7a. Build Kraken2 Database](#7a-build-kraken2-database)
-      - [7b. Remove Human Reads](#7b-remove-human-reads)
-      - [7c. Compile Human Read Removal QC](#7c-compile-human-read-removal-qc)
+    - [1. Raw Data QC](#1-raw-data-qc)
+      - [1a. Raw Data QC](#1a-raw-data-qc)
+      - [3b. Compile Raw Data QC](#1b-compile-raw-data-qc)
+    - [2. Human Read Removal](
+      - [2a. Build Kraken2 Database](#2a-build-kraken2-database)
+      - [2b. Remove Human Reads](#2b-remove-human-reads)
+      - [2c. Compile Human Read Removal QC](#2c-compile-human-read-removal-qc)
+    - [3. Trimming and Quality filtering](#3-trimming-and-quality-filtering)
+      - [3a. Filter Quality and Trim Adapters](#3a-filter-quality-and-trim-adapters)
+      - [3b. Trim PolyG](#3b-trim-polyg)
+      - [3c. Filtered Data QC](#3c-filtered-data-qc)
+      - [3d. Compile Filtered Data QC](#3d-compile-filtered-data-qc)
+    - [4. Contaminant Removal](#7-contaminant-removal)
+      - [4a. Assemble Contaminants](#7a-assemble-contaminants)
+      - [4b. Build Contaminant Index and Map Reads](#7b-build-contaminant-index-and-map-reads)
+      - [4c. Sort and Index Contaminant Reads](#7c-sort-and-index-contaminant-alignments)
+      - [4d. Gather Contaminant Mapping Metrics](#7d-gather-contaminant-mapping-metrics)
+      - [4e. Generate Decontaminated Read Files](#7e-generate-decontaminated-read-files)
+      - [4f. Contaminant Removal QC](#7f-contaminant-removal-qc)
+      - [4g. Compile Contaminant Removal QC](#7g-compile-contaminant-removal-qc)
     - [8. R Environment Setup](#8-r-environment-setup)
       - [8a. Load Libraries](#8a-load-libraries)
       - [8b. Define Custom Functions](#8b-define-custom-functions)
@@ -175,153 +168,155 @@ Barbara Novak (GeneLab Data Processing Lead)
 
 ## Pre-processing
 
-### 1. Basecalling
 
-```bash
-model="hac" # high accuracy model
-input_directory=/path/to/pod5/or/fast5/data
-kit_name=SQK-RPB004
+### 1. Raw Data QC
+
+#### 1a. Raw Data QC
 
-dorado basecaller ${model} ${input_directory} \
-  --no-trim \
-  --device auto \
-  --recursive \
-  --kit-name ${kit_name} \
-  --min-qscore 8 > basecalled.bam
+```bash
+fastqc -o raw_fastqc_output *raw.fastq.gz
 ```
 
 **Parameter Definitions:**
 
-- `model` - Positional argument specifying the basecalling model to use or a path to the model directory. `hac` chooses the high accuracy model.
-- `input_directory` - Positional argument specifying the location of the raw data in POD5 or FAST5 format.
-- `--no-trim` - Skips trimming of barcodes, adapters, and primers.
-- `--device` - Specifies CPU or GPU device; specifying 'auto' chooses either 'cpu' or 'gpu' depending on detected presence of a GPU device.
-- `--recursive` - Enables recursive scanning through input directory to load FAST5 and/or POD5 files.
-- `--kit-name` - The nanopore barcoding kit used during sequencing preparation. Enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names.
-- `--min-qscore` - Specifies the minimum Q-score, reads with a mean Q-score below this threshold are discarded (default to `8` for this pipeline).
-
-**Input Data:**
-
-- *pod5 and/or *fast5 (raw nanopore data)
+- `-o` – the output directory to store results
+- `*raw.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
 
-**Output Data:**
+**Input data:**
 
-- basecalled.bam (basecalled data in bam format)
+- *raw.fastq.gz (raw reads)
 
-<br>
+**Output data:**
 
----
+- *fastqc.html (FastQC output html summary)
+- *fastqc.zip (FastQC output data)
 
-### 2. Demultiplexing
 
-#### 2a. Split Fastq
+#### 1b. Compile Raw Data QC
 
 ```bash
-dorado demux \
-  --output-dir /path/to/fastq/output \
-  --emit-fastq \
-  --emit-summary \
-  --kit-name ${kit_name} \
-  basecalled.bam
+multiqc --zip-data-dir \
+        --outdir raw_multiqc_report \
+        --filename raw_multiqc_GLlbsMetag \
+        --interactive 
+        /path/to/raw_fastqc_output/
 ```
 
 **Parameter Definitions:**
 
-- `--output-dir` - Specifies the output folder that is the root of the nested output structure. 
-- `--emit-fastq` - Specifies that output is fastq format.
-- `--emit-summary` - Creates a summary listing each read and its classified barcode.
-- `--kit-name` - The nanopore barcoding kit used during sequencing preparation. Enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names.
-- `basecalled.bam` - Positional argument specifying the input bam file.
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/raw_nanoplot_output/` – The directory holding the output data from the FastQC run, provided as a positional argument.
 
 **Input Data:**
 
-- basecalled.bam (basecalled nanopore data in bam format, output from [Step 1](#1-basecalling))
+- /path/to/raw_fastqc_output/*fastqc.zip (FastQC output data, from [Step 1a](#1a-raw-data-qc))
 
 **Output Data:**
 
-- /path/to/fastq/output/\*_barcode\*.fastq (demultiplexed reads in fastq format)
-- /path/to/fastq/output/\*_unclassified.fastq (unclassified reads in fastq format)
-- /path/to/fastq/output/barcoding_summary.txt (barcode summary file listing each read, the file it was assigned to, and its classified barcode)
+- **raw_multiqc_report/filtered_multiqc_GLlbsMetag.html** (multiqc output html summary)
+- **raw_multiqc_report/filtered_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
 
-#### 2b. Concatenate Files For Each Sample
+<br>  
 
-```bash
-# Change to directory containing split fastq files generated from step 2a. 
-cd /path/to/fastq/output/ # output of step 2a
+---
 
-# Get unique barcode names from demultiplexed file names
-BARCODES=($(ls -1 *fastq* | sed -E 's/.+_(barcode[0-9]+)_.+/\1/g' | sort -u))
+### 2. Human Read Removal
 
-# Concat separate barcode/sample fastq files into per sample fastq gzipped files
-[ -d raw_data/ ] || mkdir raw_data/
-for sample in ${BARCODES[*]}; do
+#### 2a. Build Kraken2 Database
 
-  [ -d  ${sample}/ ] ||  mkdir ${sample}/  
-  mv *_${sample}_*  ${sample}/ 
+```bash
+kraken2-build --download-library human \
+              --db kraken2_human_db \
+              --threads numberOfThreads \
+              --no-masking
 
-  cat ${sample}/* | gzip --to-stdout raw_data/${sample}.fastq.gz
+kraken2-build --download-taxonomy \
+              --db kraken2_human_db/
 
-done
+kraken2-build --build \
+              --db kraken2_human_db/ \
+              --threads numberOfThreads
+ 
+kraken2-build --clean \
+              --db kraken2_human_db/
 ```
 
 **Parameter Definitions:**
 
-- `cat ${sample}/*` - Concatenates all fastq files with the same barcode into one fastq file.
-- `| gzip --to-stdout` - Sends the concatenated fastq file output from the `cat` command to the `gzip` command to create a compressed fastq.gz file for each barcode.
+- `--download-library` - Specifies the reference name/type to download.
+- `--db` - Specifies the directory to put the database in.
+- `--threads` - Number of parallel processing threads to use.
+- `--no-masking` - Prevents masking of low-complexity sequences. For additional 
+                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
+- `--download-taxonomy` - Downloads taxonomic mapping information.
+- `--build` - Specifies to construct kraken2-formatted database.
+- `--clean` - Specifies to remove unnecessary intermediate files.
 
 **Input Data:**
 
-- /path/to/fastq/output/ (directory containing spilt fastq files from [Step 2a](#2a-split-fastq))
+- `human` - database name to download (specified with the `--download-library` parameter above)
 
 **Output Data:**
 
--  raw_data/sample.fastq.gz (gzipped per sample/barcode fastq files)
+- kraken2_human_db/ - Kraken2 human database directory, containing hash.k2d, opts.k2d, and taxo.k2d files 
 
-<br>
 
----
+#### 2b. Remove Human Reads
 
-### 3.  Raw Data QC
+```bash
+kraken2 --db kraken2_human_db \
+        --gzip-compressed \
+        --threads NumberOfThreads \
+        --use-names \
+        --output sample-kraken2-output.txt \
+        --report sample-kraken2-report.tsv \
+        --unclassified-out sample1_R#.fastq \
+        sample1_R1_raw.fastq.gz sample1_R2_raw.fastq.gz
 
-#### 3a. Raw Data QC
+# rename and gzip output files
+mv sample1_R_1.fastq sample1_GLlbsMetag_R1_HRrm.fastq && \
+gzip sample1_GLlbsMetag_R1_HRrm.fastq
+
+mv  sample1_R_2.fastq sample1_GLlbsMetag_R2_HRrm.fastq && \
+gzip sample1_GLlbsMetag_R2_HRrm.fastq
 
-```bash 
-NanoPlot --only-report \
-         --prefix sample_raw_ \
-         --outdir /path/to/raw_nanoplot_output \
-         --threads NumberOfThreads \
-         --fastq \
-         /path/to/raw_data/sample.fastq.gz
 ```
 
 **Parameter Definitions:**
 
-- `--only-report` - Output only the report files.
-- `--prefix` - Adds a sample specific prefix to the name of each output file.
-- `--outdir` – Specifies the output directory to store results.
-- `--threads` - Number of parallel processing threads to use.
-- `--fastq` - Specifies that the input data is in fastq format.
-- `/path/to/raw_data/sample.fastq.gz` – The input reads, specified as a positional argument.
+- `--db` - Specifies the directory holding the kraken2 database.
+- `--gzip-compressed` - Specifies that the input fastq files are gzip-compressed.
+- `--threads NumberOfThreads` - Number of parallel processing threads to use.
+- `--use-names` - Specifies adding taxa names in addition to taxon IDs.
+- `--output` - Specifies the name of the kraken2 read-based output file (one line per read).
+- `--report` - Specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it).
+- `--unclassified-out` - Specifies a regular expression for the naming of the output files containing reads that were not classified, i.e non-human reads.
+- `sample1_R1_filtered.fastq.gz sample1_R2_filtered.fastq.gz` - Positional argument specifying the input read files (omit read2 for single-end data).
 
 **Input Data:**
 
-- /path/to/raw_data/sample.fastq.gz (raw reads, output from [Step 2b](#2b-concatenate-files-for-each-sample))
+- kraken2_human_db/ (kraken2 human database directory, output from [Step 7a](#7a-build-kraken2-database))
+- *raw.fastq.gz (raw reads)
 
 **Output Data:**
 
-- **/path/to/raw_nanoplot_output/sample_raw_NanoPlot-report.html** (NanoPlot html summary)
-- /path/to/raw_nanoplot_output/sample_raw_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
-- /path/to/raw_nanoplot_output/sample_raw_NanoStats.txt (text file containing basic statistics)
+- sample1-kraken2-output.txt (kraken2 read-based output file (one line per read))
+- sample1-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
+- **sample1_GLlbsMetag_raw_HRrm.fastq.gz** (raw sample reads with human reads removed, gzipped fasta file)
 
-#### 3b. Compile Raw Data QC
 
-```bash 
-multiqc --zip-data-dir \
-        --outdir raw_multiqc_report \
-        --filename raw_multiqc_GLlbnMetag \
+#### 2c. Compile Human Read Removal QC
+
+```bash
+multiqc --zip-data-dir \ 
+        --outdir HRrm_multiqc_report \
+        --filename HRrm_multiqc_GLlbsMetag \
         --interactive \
-        /path/to/raw_nanoplot_output/
+        /path/to/*kraken2-report.tsv
 ```
 
 **Parameter Definitions:**
@@ -330,172 +325,123 @@ multiqc --zip-data-dir \
 - `--outdir` – Specifies the output directory to store results.
 - `--filename` – Specifies the filename prefix of results.
 - `--interactive` - Force multiqc to always create interactive javascript plots.
-- `/path/to/raw_nanoplot_output/` – The directory holding the output data from the NanoPlot run, provided as a positional argument.
+- `/path/to/*kraken2-report.tsv` – The kraken2 output report files, provided as a positional argument.
 
 **Input Data:**
 
-- /path/to/raw_nanoplot_output/*raw_NanoStats.txt (NanoPlot output data, from [Step 3a](#3a-raw-data-qc))
+- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 7b](#7b-remove-human-reads))
 
 **Output Data:**
 
-- **raw_multiqc_GLlbnMetag.html** (multiqc output html summary)
-- **raw_multiqc_GLlbnMetag_data.zip** (zip archive containing multiqc output data)
+- **HRrm_multiqc_GLlbsMetag.html** (multiqc output html summary)
+- **HRrm_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
+
+<br>
 
-<br>  
 
 ---
 
-### 4. Quality Filtering
+### 2. Trimming and Quality Filtering
 
-#### 4a. Filter Raw Data
+#### 2a. Filter Quality and Trim Adapters
 
 ```bash
-filtlong --min_length 200 --min_mean_q 8 /path/to/raw_data/sample.fastq.gz > sample_filtered.fastq
+fastp --in1 sample1_R1_raw.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
+      --in2 sample1_R2_raw.fastq.gz --out2 temp_sample1_R2_filtered.fastq.gz \
+      --qualified_quality_phred  20 \
+      --length_required 50 \
+      --thread 2 \
+      --detect_adapter_for_pe \
+      --json sample1.fastp.json \
+      --html sample1.fastp.html 2> sample1-fastp.log
 ```
 
 **Parameter Definitions:**
-
-- `--min_length` – Specifies the minimum read length to retain (default to `200` for this pipeline).
-- `--min_mean_q` – Specifies the minimum mean read quality to retain (default to `8` for this pipeline).
-- `/path/to/raw_data/sample.fastq.gz` - The path to the input fastq file, provided as a positional argument.
-- `> sample_filtered.fastq` - Redirects the output to a sample_filtered.fastq file.
+- `--in1` - Specifies the forward input read file
+- `--in2` - Specifies the reverse input read file (omit for single-end data)
+- `--in1` - Specifies the forward output read file
+- `--in2` - Specifies the reverse output read file (omit for single-end data)
+- `--qualified_quality_phred` - the minimum quality value at which a base is qualified (default: 20)
+- `--length_required` - the minimum read length. Shorter reads will be discarded (default: 50)
+- `--thread` - number of worker threads (default: 2)
+- `--detect_adapter_for_pe` - for paired end data, enable auto-detection of adapters
+- `--json` - Specifies the json format report file name
+- `--html` - Specifies the html format report file name
+- `2> sample-fastp.log` - Redirects the stderr output to a log file.
 
 **Input Data:**
 
-- /path/to/raw_data/sample.fastq.gz (raw reads, output from [Step 2b](#2b-concatenate-files-for-each-sample))
+- *raw.fastq.gz (raw reads)
 
 **Output Data:**
 
-- *sample_filtered.fastq (quality filtered reads)
-
+- temp_*_filtered.fastq.gz (quality filtered and adapter trimmed reads)
 
-#### 4b. Filtered Data QC
+#### 2b. Trim polyG
 
 ```bash
-NanoPlot --only-report \
-         --prefix sample_filtered_ \
-         --outdir /path/to/filtered_nanoplot_output \
-         --threads NumberOfThreads \
-         --fastq \
-         sample_filtered.fastq
+fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered.fastq.gz \
+      --in2 temp_sample1_R2_filtered.fastq.gz --out2 sample1_R2_filtered.fastq.gz \
+      --qualified_quality_phred  20 \
+      --length_required 50 \
+      --thread 2 \
+      --detect_adapter_for_pe \
+      --json sample1.fastp.json \
+      --html sample1.fastp.html \
+      --trim_poly_g 2> sample1-fastp.log
 ```
 
 **Parameter Definitions:**
-
-- `--only-report` - Output only the report files.
-- `--prefix` - Adds a sample specific prefix to the name of each output file.
-- `--outdir` – Specifies the output directory to store results.
-- `--threads` - Number of parallel processing threads to use.
-- `--fastq` - Specifies that the input data is in fastq format.
-- `sample_filtered.fastq` – The input reads, specified as a positional argument.
+- `--in1` - Specifies the forward input read file
+- `--in2` - Specifies the reverse input read file (omit for single-end data)
+- `--in1` - Specifies the forward output read file
+- `--in2` - Specifies the reverse output read file (omit for single-end data)
+- `--qualified_quality_phred` - the minimum quality value at which a base is qualified (default: 20)
+- `--length_required` - the minimum read length. Shorter reads will be discarded (default: 50)
+- `--thread` - number of worker threads (default: 2)
+- `--detect_adapter_for_pe` - for paired end data, enable auto-detection of adapters
+- `--json` - Specifies the json format report file name
+- `--html` - Specifies the html format report file name
+- `--trim_poly_g` - force polyG trimming
+- `2> sample-fastp.log` - Redirects the stderr output to a log file.
 
 **Input Data:**
 
-- sample_filtered.fastq (filtered reads, output from [Step 4a](#4a-filter-raw-data))
+- /path/to/filtered_data/temp_sample1*.fastq.gz (raw reads, output from [Step 2a](#2a-filter-quality-and-trim-adapters)
 
 **Output Data:**
 
-- **/path/to/filtered_nanoplot_output/sample_filtered_NanoPlot-report.html** (NanoPlot html summary)
-- /path/to/filtered_nanoplot_output/sample_filtered_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
-- /path/to/filtered_nanoplot_output/sample_filtered_NanoStats.txt (text file containing basic statistics)
+- *filtered.fastq.gz (quality filtered and adapter trimmed reads)
 
-#### 4c. Compile Filtered Data QC
+#### 2c. Filtered Data QC
 
 ```bash
-multiqc  --zip-data-dir \ 
-         --outdir filtered_multiqc_report \
-         --filename filtered_multiqc_GLlbnMetag \
-         --interactive \
-         /path/to/filtered_nanoplot_output/
+fastqc -o filtered_fastqc_output *filtered.fastq.gz
 ```
 
 **Parameter Definitions:**
 
-- `--zip-data-dir` - Compress the data directory.
-- `--outdir` – Specifies the output directory to store results.
-- `--filename` – Specifies the filename prefix of results.
-- `--interactive` - Force multiqc to always create interactive javascript plots.
-- `/path/to/filtered_nanoplot_output/` – The directory holding the output data from the NanoPlot run, provided as a positional argument.
-
-**Input Data:**
-
-- /path/to/filtered_nanoplot_output/*filtered_NanoStats.txt (NanoPlot output data, from [Step 4b](#4b-filtered-data-qc))
-
-**Output Data:**
-
-- **filtered_multiqc_report/filtered_multiqc_GLlbnMetag.html** (multiqc output html summary)
-- **filtered_multiqc_report/filtered_multiqc_GLlbnMetag_data.zip** (zip archive containing multiqc output data)
-
-<br>
-
----
-
-### 5. Trimming
-
-#### 5a. Trim Filtered Data
-
-```bash
-porechop --input sample_filtered.fastq \
-         --threads NumberOfThreads \
-         --discard_middle \
-         --output sample_trimmed.fastq  > sample_porechop.log
-```
+- `-o` – the output directory to store results
+- `*filtered.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
 
-**Parameter Definitions:**
-
-- `--input` – Specifies the input sequence file in fastq format.
-- `--threads` - Number of parallel processing threads to use.
-- `--discard_middle` -  Reads with middle adapters will be discarded.
-- `--output` - Specifies the trimmed reads output fastq filename.
-- `> sample_porechop.log` - Redirects the standard output to a log file.
-
-**Input Data:**
-
-- sample_filtered.fastq (filtered reads output from [Step 4a](#4a-filter-raw-data))
-
-**Output Data:**
-
-- **sample_trimmed.fastq** (filtered and trimmed reads)
-- sample_porechop.log (porechop standard output containing trimming info)
-
-#### 5b. Trimmed Data QC
-
-```bash
-NanoPlot --only-report \
-         --prefix sample_trimmed_ \
-         --outdir /path/to/trimmed_nanoplot_output \
-         --threads NumberOfThreads \
-         --fastq \
-         sample_trimmed.fastq
-```
-
-**Parameter Definitions:**
-
-- `--only-report` - Output only the report files.
-- `--prefix` - Adds a sample specific prefix to the name of each output file.
-- `--outdir` – Specifies the output directory to store results.
-- `--threads` - Number of parallel processing threads to use.
-- `--fastq` - Specifies that the input data is in fastq format.
-- `sample_trimmed.fastq` – The input reads, specified as a positional argument.
+**Input data:**
 
-**Input Data:**
+- *filtered.fastq.gz (trimmed and filtered reads, from [Step 2b](#2b-trim-polyg))
 
-- sample_trimmed.fastq (filtered and trimmed reads, output from [Step 5a](#5a-trim-filtered-data))
+**Output data:**
 
-**Output Data:**
+- *fastqc.html (FastQC output html summary)
+- *fastqc.zip (FastQC output data)
 
-- **/path/to/trimmed_nanoplot_output/sample_trimmed_NanoPlot-report.html** (NanoPlot html summary)
-- /path/to/trimmed_nanoplot_output/sample_trimmed_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
-- /path/to/trimmed_nanoplot_output/sample_trimmed_NanoStats.txt (text file containing basic statistics)
 
-#### 5c. Compile Trimmed Data QC
+#### 2d. Compile Filtered Data QC
 
 ```bash
-multiqc --zip-data-dir \ 
-        --outdir trimmed_multiqc_report \
-        --filename trimmed_multiqc_GLlbnMetag \
-        --interactive \
-        /path/to/trimmed_nanoplot_output/
+multiqc --zip-data-dir \
+        --outdir filtered_multiqc_report \
+        --filename filtered_multiqc_GLlbsMetag \
+        --interactive 
+        /path/to/filtered_fastqc_output/
 ```
 
 **Parameter Definitions:**
@@ -504,32 +450,36 @@ multiqc --zip-data-dir \
 - `--outdir` – Specifies the output directory to store results.
 - `--filename` – Specifies the filename prefix of results.
 - `--interactive` - Force multiqc to always create interactive javascript plots.
-- `/path/to/trimmed_nanoplot_output/` – The directory holding the output data from the NanoPlot run, provided as a positional argument.
+- `/path/to/filtered_fastqc_output/` – The directory holding the output data from the FastQC run, provided as a positional argument.
 
 **Input Data:**
 
-- /path/to/trimmed_nanoplot_output/*trimmed_NanoStats.txt (NanoPlot output data, output from [Step 5b](#5b-trimmed-data-qc))
+- /path/to/filtered_fastqc_output/*fastqc.zip (FastQC output data, from [Step 2c](#2c-filtered-data-qc))
 
 **Output Data:**
 
-- **trimmed_multiqc_GLlbnMetag.html** (multiqc output html summary)
-- **trimmed_multiqc_GLlbnMetag_data.zip** (zip archive containing multiqc output data)
+- **filtered_multiqc_report/filtered_multiqc_GLlbsMetag.html** (multiqc output html summary)
+- **filtered_multiqc_report/filtered_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
 <br>
 
 ---
 
-### 6. Contaminant Removal
+### 7. Contaminant Removal
 
 > A major issue with low biomass data is the high potential for contamination due to the low amount of DNA extracted from the samples. Because negative control/blank samples should by theory be contaminant free, any sequence detected in the negative control is a potential contaminant. To filter out contaminants found in negative control samples that may have been due to cross contamination in the lab, we use a read mapping approach. First negative/blank control sample reads are assembled then the filtered and trimmed reads from each low-biomass sample are mapped to the assembled contigs from the negative/blank control samples. Reads mapping to the assembled contigs are categorized as contaminants and are therefore filtered out and thus excluded from downstream analyses.
 
-### 6a. Assemble Contaminants
+### 7a. Assemble Contaminants
 
 ```bash
 flye --meta \
      --threads NumberOfThreads \
      --out-dir /path/to/contaminant_assembly \
-     --nano-raw /path/to/blank_samples/\*_trimmed.fastq
+     --nano-raw /path/to/blank_samples/\*_GLlbsMetag_HRrm.fastq.gz
+
+# rename output
+mv assembly.fasta blank-assembly.fasta
+mv flye.log blank-flye.log
 ```
 
 **Parameter Definitions:**
@@ -541,17 +491,16 @@ flye --meta \
 
 **Input Data**
 
-- *_trimmed.fastq (one or more trimmed reads from blank (negative control) samples, output from [Step 5a](#5a-trim-filtered-data))
+- *_GLlbsMetag_HRrm.fastq.gz (one or more trimmed, HRrm reads from blank (negative control) samples, output from [Step 6b](#6b-remove-human-reads))
 
 **Output Data**
 
-- /path/to/contaminant_assembly/assembly.fasta (assembly built from reads in blank samples in fasta format)
+- /path/to/contaminant_assembly/blank-assembly.fasta (assembly built from reads in blank samples in fasta format)
+- blank-flye.log (flye log file)
 
 <br>
 
----
-
-#### 6b. Build Contaminant Index and Map Reads
+#### 7b. Build Contaminant Index and Map Reads
 
 ```bash
 # Build contaminant index
@@ -559,14 +508,14 @@ minimap2 -t NumberOfThreads \
          -a \
          -x splice \
          -d blanks.mmi \
-         /path/to/contaminant_assembly/assembly.fasta
+         /path/to/contaminant_assembly/blank-assembly.fasta
 
 # Map reads to index
 minimap2 -t NumberOfThreads \
          -a \
          -x splice \
          blanks.mmi \
-         /path/to/trimmed_reads/sample_trimmed.fastq  > sample.sam
+         sample_GLlbsMetag_HRrm.fastq.gz  > sample.sam 2> sample-mapping-info.txt
 ```
 
 **Parameter Definitions:**
@@ -575,27 +524,28 @@ minimap2 -t NumberOfThreads \
 - `-a` – Output in SAM format.
 - `-x splice` - Specifies preset for spliced alignment of long reads.
 - `-d` - Specifies the output file for the index (specific to the build contaminant index command).
-- `/path/to/contaminant_assembly/assembly.fasta` - Specifies the input file in fasta format, provided as a positional argument (specific to the build contaminant index command).
+- `/path/to/contaminant_assembly/blank-assembly.fasta` - Specifies the input file in fasta format, provided as a positional argument (specific to the build contaminant index command).
 - `blanks.mmi` - Specifies the index file in mmi format, provided as a positional argument (specific to the map reads command).
-- `/path/to/trimmed_reads/sample_trimmed.fastq` - Specifies the input file in fastq format, provided as a positional argument (specific to the map reads command).
+- `/path/to/trimmed_reads/sample_GLlbsMetag_HRrm.fastq.gz` - Specifies the input file in fastq format, provided as a positional argument (specific to the map reads command).
 - `> sample.sam` - Redirects the output of the map reads command to a separate SAM file (specific to the map reads command).
 
 **Input Data**
 
-- /path/to/contaminant_assembly/assembly.fasta (contaminant assembly, output from [Step 6a](#6-assemble-contaminants))
-- /path/to/trimmed_reads/sample_trimmed.fastq (filtered and trimmed reads, output from [Step 5a](#5a-trim-filtered-data))
+- /path/to/contaminant_assembly/blank-assembly.fasta (contaminant assembly, output from [Step 7a](#7-assemble-contaminants))
+- sample_GLlbsMetag_HRrm.fastq.gz (filtered and trimmed reads, output from [Step 6b](#6b-remove-human-reads))
 
 **Output Data**
 
 - blanks.mmi (contaminant index in MMI format)
 - sample.sam (reads aligned to contaminant assembly in SAM format)
+- sample-mapping-info.txt (minimap2 mapping log file)
 
-#### 6c. Sort and Index Contaminant Alignments
+#### 7c. Sort and Index Contaminant Alignments
 ```bash
 # Sort Sam, convert to bam and create index
 samtools sort --threads NumberOfThreads \
-              -o sample_sorted.bam \
-              sample.sam > sample_sort.log 2>&1
+              --output sample_sorted.bam \
+              sample.sam
 
 samtools index sample_sorted.bam sample_sorted.bam.bai
 ```
@@ -604,9 +554,8 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 
 **samtools sort**
 - `--threads` - Number of parallel processing threads to use.
-- `-o` - Specifies the output file for the aligned and sorted reads.
+- `--output` - Specifies the output file for the aligned and sorted reads.
 - `sample.sam` - Specifies the input SAM file, provided as a positional argument.
-- `> sample_sort.log 2>&1` - Redirects the standard output to a log file. 
 
 **samtools index**
 - `sample_sorted.bam` - The input BAM file, provided as a positional argument.
@@ -614,15 +563,14 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 
 **Input Data:**
 
-- sample.sam (reads aligned to contaminant assembly, output from [Step 6b](#6b-build-contaminant-index-and-map-reads))
+- sample.sam (reads aligned to contaminant assembly, output from [Step 7b](#7b-build-contaminant-index-and-map-reads))
 
 **Output Data:**
 
 - sample_sorted.bam (sorted mapping to contaminant assembly file)
 - sample_sorted.bam.bai (index of sorted mapping to contaminant assembly file)
-- sample_sort.log (log file containing the samtools sort standard output)
 
-#### 6d. Gather Contaminant Mapping Metrics
+#### 7d. Gather Contaminant Mapping Metrics
 
 ```bash
 
@@ -647,8 +595,8 @@ samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.lo
 
 **Input Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly file, output from [Step 6c](#6c-sort-and-index-contaminant-alignments))
-- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly file, output from [Step 6c](#6c-sort-and-index-contaminant-alignments))
+- sample_sorted.bam (sorted mapping to contaminant assembly file, output from [Step 7c](#7c-sort-and-index-contaminant-alignments))
+- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly file, output from [Step 7c](#7c-sort-and-index-contaminant-alignments))
 
 **Output Data:**
 
@@ -659,10 +607,10 @@ samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.lo
 - sample_idxstats.txt (contig alignment summary statistics)
 - sample_idxstats.log (log file containing the idxstats standard error)
 
-#### 6e. Generate Decontaminated Read Files
+#### 7e. Generate Decontaminated Read Files
 ```bash
 # Retain reads that do not map to contaminants
-samtools fastq -t -f 4 sample_sorted.bam | gzip --to-stdout > sample_decontam.fastq.gz
+samtools fastq -t -f 4 -o sample_decontam_GLlbsMetag.fastq.gz -0 sample_decontam_GLlbsMetag.fastq.gz sample_sorted.bam 
 ```
 
 **Parameter Definitions:**
@@ -670,20 +618,19 @@ samtools fastq -t -f 4 sample_sorted.bam | gzip --to-stdout > sample_decontam.fa
 - `fastq` - Positional argument specifying the program for generating fastq files from a SAM/BAM file.
 - `-t` - Copy RG, BC, and QT tags to the FASTQ header line.
 - `-f 4` - Only retain unmapped reads that have been marked with the SAM "segment unmapped" FLAG (4).
+- `-o sample_decontam_GLlbsMetag.fastq.gz` - Send reads flagged as either read1 or read2 to the named file (.gz ending ensures compressed output)
+- `-0 sample_decontam_GLlbsMetag.fastq.gz` - Send reads flagged as both read1 and read2 or neither to the same named file
 - `sample_sorted.bam` - Positional argument specifying the input BAM file.
-- `| gzip --to-stdout` - Sends output from `samtools fastq` to `gzip` command to create a compressed fastq.gz file.
-- `--to-stdout` - Sends the output from the `gzip` command to standard out.
-- `> sample_decontam.fastq.gz` - Redirects the `gzip` standard output to a fastq.gz file.
 
 **Input Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly file, output from [Step 6c](#6c-sort-and-index-contaminant-alignments))
+- sample_sorted.bam (sorted mapping to contaminant assembly file, output from [Step 7c](#7c-sort-and-index-contaminant-alignments))
 
 **Output Data:**
 
-- sample_decontam.fastq.gz (filtered and trimmed sample reads with contaminants removed in fastq format)
+- **sample_decontam_GLlbsMetag.fastq.gz** (filtered and trimmed sample reads with contaminants removed in fastq format)
 
-#### 6f. Contaminant Removal QC
+#### 7f. Contaminant Removal QC
 
 ```bash
 NanoPlot --only-report \
@@ -691,7 +638,7 @@ NanoPlot --only-report \
          --outdir /path/to/decontam_nanoplot_output \
          --threads NumberOfThreads \
          --fastq \
-         sample_decontam.fastq.gz
+         sample_decontam_GLlbsMetag.fastq.gz
 ```
 
 **Parameter Definitions:**
@@ -701,11 +648,11 @@ NanoPlot --only-report \
 - `--outdir` – Specifies the output directory to store results.
 - `--threads` - Number of parallel processing threads to use.
 - `--fastq` - Specifies that the input data is in fastq format.
-- `sample_decontam.fastq.gz` – The input reads, specified as a positional argument.
+- `sample_decontam_GLlbsMetag.fastq.gz` – The input reads, specified as a positional argument.
 
 **Input Data:**
 
-- sample_decontam.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 6e](#6e-generate-decontaminated-read-files))
+- sample_decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with all contaminants removed, output from [Step 7e](#7e-generate-decontaminated-read-files))
 
 **Output Data:**
 
@@ -714,12 +661,12 @@ NanoPlot --only-report \
 - /path/to/decontam_nanoplot_output/sample_decontam_NanoStats.txt (text file containing basic statistics)
 
 
-#### 6g. Compile Contaminant Removal QC
+#### 7g. Compile Contaminant Removal QC
 
 ```bash
 multiqc --zip-data-dir \ 
         --outdir decontam_multiqc_report \
-        --filename decontam_multiqc_GLlbnMetag \
+        --filename decontam_multiqc_GLlbsMetag \
         --interactive \
         /path/to/decontam_nanoplot_output/
 ```
@@ -734,129 +681,17 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- /path/to/decontam_nanoplot_output/*decontam_NanoStats.txt (NanoPlot output data, output from [Step 6f](#6f-contaminant-removal-qc))
+- /path/to/decontam_nanoplot_output/*decontam_NanoStats.txt (NanoPlot output data, output from [Step 7f](#7f-contaminant-removal-qc))
 
 **Output Data:**
 
-- **decontam_multiqc_GLlbnMetag.html** (multiqc output html summary)
-- **decontam_multiqc_GLlbnMetag_data.zip** (zip archive containing multiqc output data)
+- **decontam_multiqc_GLlbsMetag.html** (multiqc output html summary)
+- **decontam_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
 <br>
 
 ---
 
-### 7. Human Read Removal
-
-#### 7a. Build Kraken2 Database
-
-```bash
-kraken2-build --download-library human \
-              --db kraken2_human_db \
-              --threads numberOfThreads \
-              --no-masking
-
-kraken2-build --download-taxonomy \
-              --db kraken2_human_db/
-
-kraken2-build --build \
-              --db kraken2_human_db/ \
-              --threads numberOfThreads
- 
-kraken2-build --clean \
-              --db kraken2_human_db/
-```
-
-**Parameter Definitions:**
-
-- `--download-library` - Specifies the reference name/type to download.
-- `--db` - Specifies the directory to put the database in.
-- `--threads` - Number of parallel processing threads to use.
-- `--no-masking` - Prevents masking of low-complexity sequences. For additional 
-                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
-- `--download-taxonomy` - Downloads taxonomic mapping information.
-- `--build` - Specifies to construct kraken2-formatted database.
-- `--clean` - Specifies to remove unnecessary intermediate files.
-
-**Input Data:**
-
-- `human` - database name to download (specified with the `--download-library` parameter above)
-
-**Output Data:**
-
-- kraken2_human_db/ - Kraken2 human database directory, containing hash.k2d, opts.k2d, and taxo.k2d files 
-
-
-#### 7b. Remove Human Reads
-
-```bash
-kraken2 --db kraken2_human_db \
-        --gzip-compressed \
-        --threads NumberOfThreads \
-        --use-names \
-        --output sample-kraken2-output.txt \
-        --report sample-kraken2-report.tsv \
-        --unclassified-out sample_HRrm.fasta \
-        sample_decontam.fastq.gz
-
-# add ">" before each sequence name and gzip fasta output file
-sed -i -E 's/^([a-z0-9])/>\1/g' sample_HRrm.fasta | gzip 
-```
-
-**Parameter Definitions:**
-
-- `--db` - Specifies the directory holding the kraken2 database.
-- `--gzip-compressed` - Specifies that the input fastq files are gzip-compressed.
-- `--threads` - Number of parallel processing threads to use.
-- `--use-names` - Specifies adding taxa names in addition to taxon IDs.
-- `--output` - Specifies the name of the kraken2 read-based output file (one line per read).
-- `--report` - Specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it).
-- `--unclassified-out` - Specifies the name of the output file containing reads that were not classified, i.e non-human reads.
-- `sample_decontam.fastq.gz` - Positional argument specifying the input read file.
-
-**Input Data:**
-
-- kraken2_human_db/ (kraken2 human database directory, output from [Step 7a](#7a-build-kraken2-database))
-- sample_decontam.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 6e](#6e-generate-decontaminated-read-files))
-
-**Output Data:**
-
-- sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
-- sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
-- **sample_HRrm.fasta.gz** (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file)
-
-
-#### 7c. Compile Human Read Removal QC
-
-```bash
-multiqc --zip-data-dir \ 
-        --outdir HRrm_multiqc_report \
-        --filename HRrm_multiqc_GLlbnMetag \
-        --interactive \
-        /path/to/*kraken2-report.tsv
-```
-
-**Parameter Definitions:**
-
-- `--zip-data-dir` - Compress the data directory.
-- `--outdir` – Specifies the output directory to store results.
-- `--filename` – Specifies the filename prefix of results.
-- `--interactive` - Force multiqc to always create interactive javascript plots.
-- `/path/to/*kraken2-report.tsv` – The kraken2 output report files, provided as a positional argument.
-
-**Input Data:**
-
-- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 7b](#7b-remove-human-reads))
-
-**Output Data:**
-
-- **HRrm_multiqc_GLlbnMetag.html** (multiqc output html summary)
-- **HRrm_multiqc_GLlbnMetag_data.zip** (zip archive containing multiqc output data)
-
-<br>
-
----
-
-
 ### 8. R Environment Setup
 
 > Taxonomy bar plots, heatmaps and feature decontamination with decontam are performed in R.
@@ -878,11 +713,11 @@ library(pavian)
   <summary>retrieves the last taxonomy assignment from a taxonomy string</summary>
 
   ```R
-  get_last_assignment <- function(taxonomy_string, split_by=';', remove_prefix=NULL) {
+  get_last_assignment <- function(taxonomy_string, split_by = ';', remove_prefix = NULL) {
 
     # Spilt taxonomy string by the supplied delimiter 'split_by'
     # then convert the list of parts to a vector of parts
-    split_names <- strsplit(x =  taxonomy_string , split = split_by) %>% 
+    split_names <- strsplit(x =  taxonomy_string , split = split_by) %>%
       unlist()
     # Get the last part of the split string
     level_name <- split_names[[length(split_names)]]
@@ -892,7 +727,7 @@ library(pavian)
     }
     # remove an unwanted prefix if specified
     if(!is.null(remove_prefix)){
-      level_name <- gsub(pattern = remove_prefix, replacement = '', x = level_name)
+      level_name <- gsub(pattern = remove_prefix, replacement = "", x = level_name)
     }
     
     return(level_name)
@@ -913,22 +748,24 @@ library(pavian)
 
   ```R
   mutate_taxonomy <- function(df, taxonomy_column="taxonomy") {
-    
+
     # make sure that the taxonomy column is always named taxonomy
     col_index <- which(colnames(df) == taxonomy_column)
-    colnames(df)[col_index] <- 'taxonomy'
-    df <- df %>% dplyr::mutate(across( where(is.numeric), function(x) tidyr::replace_na(x,0)  ) )%>% 
-      dplyr::mutate(taxonomy=map_chr(taxonomy,.f = function(taxon_name=.x){
+    colnames(df)[col_index] <- "taxonomy"
+    df <- df %>% dplyr::mutate(across(where(is.numeric), function(x) tidyr::replace_na(x, 0))) %>%
+      dplyr::mutate(taxonomy=map_chr(taxonomy, .f = function(taxon_name = .x) {
         last_assignment <- get_last_assignment(taxon_name) 
-        last_assignment  <- gsub(pattern = "\\[|\\]|'", replacement = '',x = last_assignment)
+        last_assignment  <- gsub(pattern = "\\[|\\]|'", replacement = "", x = last_assignment)
         trimws(last_assignment, which = "both")
       })) %>% 
-      as.data.frame(check.names=FALSE, StringAsFactor=FASLE)
+      as.data.frame(check.names = FALSE, StringAsFactor = FALSE)
     # Ensure the taxonomy names are unique by aggregating duplicates
-    df <- aggregate(.~taxonomy,data = df, FUN = sum)
+    df <- aggregate(.~taxonomy, data = df, FUN = sum)
     return(df)
   }
   ```
+  **Custom Functions Used:**
+  - [get_last_assignment()](#get_last_assignment)
 
   **Function Parameter Definitions:**
   - `df` - a dataframe containing the taxonomy assignments
@@ -943,15 +780,30 @@ library(pavian)
   <summary>reformat kaiju output table</summary>
 
   ```R
-  process_kaiju_table <- function(file_path, taxon_col="taxon_name") {
+  process_kaiju_table <- function(file_path, taxon_col = "taxon_name") {
   
-    abs_abun_df <-  read_delim(file = file_path,
+    # read input table
+    kaiju_table <-  read_delim(file = file_path,
                                delim = "\t",
-                               col_names = TRUE) %>% # read input table
-             select(sample, reads, taxonomy=!!sym(taxon_col)) %>%
-             pivot_wider(names_from = "sample", values_from = "reads", 
-                             names_sort = TRUE) %>% # convert long dataframe to wide dataframe
-             mutate_taxonomy # mutate the taxonomy coxlumn such that it contains only lowest taxonomy assignment
+                               col_names = TRUE)
+
+    # Create  a sample colname if the file column wasn't pre-edited
+    if(colnames(kaiju_table)[1] ==  "file" ){
+      kaiju_table <-  kaiju_table %>% rename(sample=file)
+    }
+
+    # filter out all kaiju database entries
+    kaiju_table <- kaiju_table %>% 
+      filter(!str_detect(sample, "dmp")) %>%
+      mutate(sample=str_replace_all(sample, ".+/(.+)_kaiju.out", "\\1"))
+ 
+    # keep only sample, reads, and taxonomy column (as defined by taxon_col argument) 
+    # convert long dataframe to wide dataframe
+    # mutate the taxonomy column such that it contains only lowest taxonomy assignment
+    abs_abun_df <- kaiju_table %>%
+      select(sample, reads, taxonomy=!!sym(taxon_col)) %>%
+      pivot_wider(names_from = "sample", values_from = "reads", names_sort = TRUE) %>%
+      mutate_taxonomy 
   
     # Set the taxon names as row names, drop the taxonomy column and convert to a matrix
     rownames(abs_abun_df) <- abs_abun_df[,"taxonomy"]
@@ -961,10 +813,12 @@ library(pavian)
     return(abs_abun_matrix)
   }
   ```
+  **Custom Functions Used:**
+  - [mutate_taxonomy()](#mutate_taxonomy)
 
   **Function Parameter Definitions:**
   - `file_path` - file path to the tab-delimited kaiju output table file
-  - `taxon_col=`- name of the taxon column in the input data file, default="taxon_path"
+  - `taxon_col=`- name of the taxon column in the input data file, default="taxon_name"
 
   **Returns:** a dataframe with reformated kaiju output
 
@@ -1018,6 +872,28 @@ library(pavian)
 
 </details>
 
+##### get_abundant_features()
+<details>
+  <summary>Find abundant features based on the sum of feature values</summary>
+  
+  ```R
+  get_abundant_features <- function(mat, cpm_threshold = 1000){
+  
+    features <- rowSums(mat) %>% sort()
+    
+    abund_features <- features[features > cpm_threshold] %>% names
+    
+    abund_features.m <- mat[abund_features, ]
+    
+    return(abund_features.m)
+  }
+  ```
+  **Function Parameter Definitions:**
+  - `mat` - a feature count matrix with features as rows and samples as columns
+  - `cpm_threshold = 1000` - threshold to identify abundant features
+
+  **Returns:** a species relative abundance matrix with samples and species as rows and column, respectively.
+</details>
 
 ##### count_to_rel_abundance()
 <details>
@@ -1026,13 +902,15 @@ library(pavian)
   ```R
   count_to_rel_abundance <- function(species_table) {
 
-    abund_table <- species_table %>% 
-                        as.data.frame %>% 
-                        mutate( across(everything(), function(x) (x/sum(x, na.rm = TRUE))*100 ) )  %>% # calculate species relative abundance per sample
+    # calculate species relative abundance per sample and
+    # drop columns where none of the reads were classified or were non-microbial (NA)
+    abund_table <- species_table %>%
+      as.data.frame %>%
+      mutate(across(everything(), function(x) (x/sum(x, na.rm = TRUE))*100)) %>%
         select(
-                where( ~all(!is.na(.)) )
-              )  %>% # drop columns where none of the reads were classified or were non-microbial (NA)
-              rownames_to_column("Species") 
+          where( ~all(!is.na(.)))
+        ) %>%
+      rownames_to_column("Species")
 
     # Set rownames as species name and drop species column  
     rownames(abund_table) <- abund_table$Species
@@ -1040,6 +918,7 @@ library(pavian)
 
     return(abund_table)
   }
+
   ```
 
   **Function Parameter Definitions:**
@@ -1052,7 +931,7 @@ library(pavian)
 
 ##### filter_rare()
 <details>
-  <summary>filter out rare and non_microbial taxonomy assignments</summary>
+  <summary>filter out rare and non_microbial taxonomy assignments based on relative abundance</summary>
 
   ```R
   filter_rare <- function(species_table, non_microbial, threshold=1) {
@@ -1061,14 +940,13 @@ library(pavian)
     clean_tab_count  <-  species_table %>% 
                          as.data.frame %>% 
                          rownames_to_column("Species") %>% 
-                         filter(str_detect(Species, non_microbial, negate = TRUE))  
+                         filter(str_detect(Species, non_microbial, negate = TRUE))
     # Calculate species relative abundance
-    clean_tab <- clean_tab_count %>% 
+    clean_tab <- clean_tab_count %>%
       mutate( across( where(is.numeric), function(x) (x/sum(x, na.rm = TRUE))*100 ) )
-    # Set rownames as species name and drop species column  
+    # Set rownames as species name and drop species column
     rownames(clean_tab) <- clean_tab$Species
-    clean_tab  <- clean_tab[,-1] 
-    
+    clean_tab  <- clean_tab[, -1]
     
     # Get species with relative abundance less than `threshold` in all samples
     rare_species <- map(clean_tab, .f = function(col) rownames(clean_tab)[col < threshold])
@@ -1086,12 +964,71 @@ library(pavian)
 
   **Function Parameter Definitions:**
   - `species_table` - the species matrix to filter with species and samples as rows and columns, respectively.
-  - `non_microbial` - a regex denoting the string used to identify a species as non-microbial or unwanted
+  - `non_microbial` - a regular expression denoting the names used to identify a species as non-microbial or unwanted
   - `threshold=` - abundance threshold used to determine if the relative abundance is rare, value denotes a percentage between 0 and 100.
 
   **Returns:** a dataframe with rare and non_microbial/unwanted species removed
 </details>
 
+##### group_low_abund_taxa()
+<details>
+  <summary>Group rare taxa or return a table with only rare taxa</summary>
+
+  ```R
+  group_low_abund_taxa <- function(abund_table, threshold = 0.05,
+                                   rare_taxa = FALSE) {
+    # If set to TRUE then a table with only the rare taxa will be returned 
+    # initialize an empty vector that will contain the indices for the
+    # low abundance columns/ taxa to group
+    taxa_to_group <- c()
+    # initialize the index variable of species with low abundance (taxa/columns)
+    index <- 1
+    
+    #loop over every column or taxa check to see if the max abundance is less than the set threshold
+    #if true save the index in the taxa_to_group vector variable
+    for (column in ncol(abund_table)) {
+      if(max(abund_table[,column], na.rm = TRUE) < threshold) {
+        #print(column)
+        taxa_to_group[index] <- column
+        index = index + 1
+      }
+    }
+    
+    if(is.null(taxa_to_group)) {
+      message(glue::glue("Rare taxa were not grouped. please provide a higher 
+                        threshold than {threshold} for grouping rare taxa, 
+                        only numbers are allowed."))
+      return(abund_table)
+    }
+    
+    if(rare_taxa) {
+      abund_table <- abund_table[,taxa_to_group,drop=FALSE]
+    } else {
+      #remove the low abundant taxa or columns
+      abundant_taxa <-abund_table[,-(taxa_to_group), drop=FALSE]
+      #get the rare taxa
+      # rare_taxa <-abund_table[,taxa_to_group]
+      rare_taxa <- subset(x = abund_table, select = taxa_to_group)
+      #get the proportion of each sample that makes up the rare taxa
+      rare <- rowSums(rare_taxa)
+      #bind the abundant taxa to the rae taxa
+      abund_table <- cbind(abundant_taxa,rare)
+      #rename the columns i.e the taxa
+      colnames(abund_table) <- c(colnames(abundant_taxa),"Rare")
+    }
+    
+    return(abund_table)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `abund_table` - a relative abundance matrix with taxa as columns and  samples as rows
+  - `rare_taxa` - a boolean specifying if only rare taxa should be returned
+  - `threshold` - a max abundance threshold for defining taxa as rare
+
+  **Returns:** a relative abundance matrix with rare taxa grouped or with non-rare taxa filtered out
+
+</details>
 
 ##### make_plot()
 <details>
@@ -1099,37 +1036,37 @@ library(pavian)
 
   ```R
   # Make bar plot
-make_plot <- function(abund_table, metadata, colors2use, publication_format,
-                      samples_column="Sample_ID", prefix_to_remove="barcode"){
-  
-abund_table_wide <- abund_table %>% 
-    as.data.frame() %>% 
-    rownames_to_column(samples_column) %>% 
-    inner_join(metadata) %>% 
-    select(!!!colnames(metadata), everything()) %>% 
-    mutate(!!samples_column := !!sym(samples_column) %>% str_remove(prefix_to_remove))
-    
-  
-abund_table_long <- abund_table_wide  %>%
-    pivot_longer(-colnames(metadata), 
-                 names_to = "Species",
-                 values_to = "relative_abundance")
+  make_plot <- function(abund_table, metadata, custom_palette, publication_format,
+                        samples_column="Sample_ID", prefix_to_remove="barcode"){
   
-p <- ggplot(abund_table_long, mapping = aes(x=!!sym(samples_column), 
-                                              y=relative_abundance, fill=Species)) +
-    geom_col() +
-    scale_fill_manual(values = colors2use) + 
-    labs(x=NULL, y="Relative Abundance (%)") + 
-    publication_format
-
-return(p)
-}
+    abund_table_wide <- abund_table %>%
+        as.data.frame() %>%
+        rownames_to_column(samples_column) %>%
+        inner_join(metadata) %>%
+        select(!!!colnames(metadata), everything()) %>%
+        mutate(!!samples_column := !!sym(samples_column) %>% str_remove(prefix_to_remove))
+        
+      
+    abund_table_long <- abund_table_wide  %>%
+        pivot_longer(-colnames(metadata), 
+                     names_to = "Species",
+                     values_to = "relative_abundance")
+      
+    p <- ggplot(abund_table_long, mapping = aes(x = !!sym(samples_column), 
+                                                y = relative_abundance, fill = Species)) +
+         geom_col() +
+         scale_fill_manual(values = custom_palette) + 
+         labs(x=NULL, y="Relative Abundance (%)") + 
+         publication_format
+
+    return(p)
+  }
   ```
 
   **Function Parameter Definitions:**
   - `abund_table` - a relative bundance dataframe with rows summing to 100%
   - `metadata` - a metadata dataframe with samples as row and columns describing each sample
-  - `colors2use` - a vector of strings specifying a custom color palette for coloring plots
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots
   - `publication_format` - a ggplot::theme object specifying a custom theme for plotting
   - `samples_column` - a character column specifying the column in `metadata` holding sample names, default is "Sample_ID"
   - `prefix_to_remove` - a string specifying a prefix or any character set to remove from sample names, default is "barcode"
@@ -1138,15 +1075,153 @@ return(p)
 
 </details>
 
+##### make_barplot()
+<details>
+  <summary>Creates barplots from a feature table file</summary>
+  
+  ```R
+  make_barplot <- function(metadata_table_file, feature_table_file, 
+                           feature_column = "species", samples_column = "sample_id", group_column = "group", 
+                           output_prefix, assay_suffix = "_GLlbsMetag",
+                           publication_format, custom_palette) {
+    # Prepare feature table
+    feature_table <- read_csv(feature_table_file)
+    rownames(feature_table) <- feature_table[[1]]
+    feature_table <- feature_table[, -1]
+
+    # Prepare metadata
+    metadata <- read_delim(metdata_file, delim = ",") %>% as.data.frame()
+    row.names(metadata) <- metadata[, samples_column]
+
+    # compute abundances from counts
+    abund_table <- count_to_rel_abundance(feature_table)
+    
+    # create plot
+    p <- make_plot(abund_table, metadata, custom_palette, publication_format, samples_column) +
+         facet_wrap(~Description, nrow=1, scales = "free_x")
+
+    number_of_species <- p$data$Species %>% unique() %>% length()
+    # Don't save legend if the number of species to plot is gsreater than 30
+    if(number_of_species > 30) {
+      p <- p + theme(legend.position = "none")
+    }
+
+    return(p)
+
+  }
+  ```
+  **Custom Functions Used:**
+  - [make_plot()](#make_plot)
+  - [count_to_rel_abundance()](#count_to_rel_abundance)
+
+  **Function Parameter Definitions:**
+  - `metadata_table_file` - path to a file with samples as rows and columns describing each sample
+  - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                           table with species/functions as the first column and samples as other columns.
+  - `feature_column` - a character string containing the feature column name in the feature table ['Species', 'species', 'KO_ID'], default: "species".
+  - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
+  - `group_column` - a character string specifying the column in `metadata` used to facet/group plots, default: "group"
+  - `output_prefix` - a character string specifying the unique name to add to the output file names 
+                      used to denote the data type/source, for example "unfiltered-kaiju_species"
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
+  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 8c](#8c-set-global-variables)
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 8c](#8c-set-global-variables)
+
+  **Returns:** a relative abundance stacked bar plot
+
+</details>
+
+##### make_heatmap()
+<details>
+  <summary>Creates heatmaps from a feature table file</summary>
+  
+  ```R
+  make_barplot <- function(metadata_file, feature_table_file, 
+                           samples_column = "sample_id", group_column = "group", 
+                           output_prefix, assay_suffix = "_GLlbsMetag",
+                           custom_palette) {
+    # Prepare feature table
+    feature_table <- read_csv(feature_table_file) %>% as.data.frame()
+    rownames(feature_table) <- feature_table[[1]]
+    feature_table <- feature_table[, -1] %>% as.matrix()
+    colnames(feature_table) <- colnames(feature_table) %>% str_remove_all("barcode")
+
+    # Prepare metadata
+    metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame()
+    row.names(metadata) <- metadata[, samples_column] %>% str_remove_all("barcode")
+
+    # GFet common samples and re-arrange feature table and metadata
+    common_samples <- intersect(colnames(feature_table), rownames(metadata))
+    feature_table <- feature_table[, common_samples]
+    metadata <- metadata[common_samples,]
+    metadata <- metadata %>% arrange(!!sym(group_column))
+
+    # Create column annotation
+    col_annotation <- as.data.frame(metadata)[,group_column, drop=FALSE]
+    rownames(col_annotation) <- rownames(col_annotation)
+
+    # Calculate output plot width and height
+    number_of_samples <- ncol(feature_table)
+    width <- 1 * number_of_samples
+    number_of_features <- nrow(feature_table)
+    height <- 0.2 * number_of_features 
+
+    # Set colors by group
+    groups <- metadata[[group_column]] %>%  unique()
+    number_of_groups <-  length(groups)
+    my_colors <- custom_palette[1:number_of_groups]
+    names(my_colors) <- groups
+    annotation_colors  <- list(my_colors)
+    names(annotation_colors) <- group_column
+
+    # create heatmap
+    png(filename = glue("{output_prefix}_heatmap.png"), width = width,
+        height = height, units = "in", res = 300)
+    pheatmap(mat = feature_table[,rownames(col_annotation)],
+            cluster_cols = FALSE, 
+            cluster_rows = FALSE, 
+            col = colorRampPalette(c('white','red'))(255), 
+            angle_col = 0, 
+            display_numbers = TRUE,
+            fontsize = 12, 
+            annotation_col = col_annotation,
+            annotation_colors = annotation_colors ,
+            number_format = "%.0f")
+    dev.off()
+
+
+  }
+  ```
+  **Custom Functions Used:**
+  - [make_plot()](#make_plot)
+
+  **Function Parameter Definitions:**
+  - `metadata_table_file` - path to a file with samples as rows and columns describing each sample
+  - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                           table with species/functions as the first column and samples as other columns.
+  - `feature_column` - a character string containing the feature column name in the feature table ['Species', 'species', 'KO_ID'].
+  - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
+  - `group_column` - a character string specifying the column in `metadata` used to facet/group plots, default: "group"
+  - `output_prefix` - a character string specifying the unique name to add to the output file names 
+                      used to denote the data type/source, for example "unfiltered-kaiju_species"
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
+  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 8c](#8c-set-global-variables)
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 8c](#8c-set-global-variables)
+
+  **Returns:** a relative abundance stacked bar plot
+
+</details>
 
 ##### run_decontam()
 <details>
   <summary>Feature table decontamination with decontam</summary>
 
   ```R
-  run_decontam <- function(feature_table, metadata, contam_threshold=0.1, prev_col=NULL, freq_col=NULL) {
+  run_decontam <- function(feature_table, metadata, contam_threshold=0.1, 
+                           prev_col = NULL, freq_col = NULL, ntc_name = "Control_Sample") {
 
-    sub_metadata <- metadata[colnames(feature_table),]
+    # retain metadata for only the samples present in the input feature table
+    sub_metadata <- metadata[colnames(feature_table), ]
     # Modify NTC concentration
     # Often times the user may set the NTC concentration to zero because they think nothing 
     # should be in the negative control but decontam fails if the value is set to zero.
@@ -1154,13 +1229,13 @@ return(p)
     # 0.0000001
     if (!is.null(freq_col)) {
 
-      sub_metadata <- sub_metadata %>% 
-        mutate(!!freq_col:=map_dbl(!!sym(freq_col), .f= function(conc) { 
-                                      if(conc == 0) return(0.0000001) else return(conc) 
-                                    } 
-                                  )
-              )
-      sub_metadata[, freq_col] <- as.numeric(sub_metadata[,freq_col])
+      sub_metadata <- sub_metadata %>%
+        mutate(!!freq_col:=map_dbl(!!sym(freq_col), .f = function(conc) {
+              if(conc == 0) return(0.0000001) else return(conc) 
+            } 
+          )
+        )
+      sub_metadata[, freq_col] <- as.numeric(sub_metadata[, freq_col])
 
     }
 
@@ -1172,33 +1247,27 @@ return(p)
     # samples, as that is the form required by isContaminant.
     # The line below assumes that control samples will always be named "Control_Sample"
     # in the `prev_col`.
-    sample_data(ps)$is.neg <- sample_data(ps)[[prev_col]] == "Control_Sample"
+    sd <- as.data.frame(sample_data(ps)) # Extract sample metadata
+    sd[, "is.neg"] <- 0 # Initialize
+    sd[, "is.neg"] <- sample_data(ps)[[prev_col]] == ntc_name # Assign boolean value
+    sample_data(ps) <- sd
 
     # Run Decontam 
-    if (!is.null(freq_col) && !is.null(prev_col)) {   
-
+    if (!is.null(freq_col) && !is.null(prev_col)) {
       # Run decontam in both prevalence and frequency modes
       contamdf <- isContaminant(ps, neg="is.neg", conc=freq_col, threshold=contam_threshold) 
-
     } else if(!is.null(freq_col)) {
-      
       # Run decontam in frequency mode
       contamdf <- isContaminant(ps, conc=freq_col, threshold=contam_threshold) 
-
     } else if(!is.null(prev_col)){
-
       # Run decontam in prevalence mode
       contamdf <- isContaminant(ps, neg="is.neg", threshold=contam_threshold)
-    
     } else {
-
       cat("Both freq_col and prev_col cannot be set to NULL.\n")
       cat("Please supply either one or both column names in your metadata")
       cat("for frequency and prevalence based analysis, respectively\n")
       stop()
-
-    }
-                    
+    }            
     return(contamdf)
   }
   ```
@@ -1214,6 +1283,88 @@ return(p)
   **Returns:** a dataframe of detailed decontam results
 </details>
 
+##### feature_decontam() 
+<details>
+  <summary>decontaminate a feature table</summary>
+  
+  ```R
+  library(tidyverse)
+  library(glue)
+
+  feature_decontam <- function(metadata_file, feature_table_file, 
+                               feature_column = "species", samples_column = "sample_id",
+                               prevalence_column = "Sample_or_Control", ntc_name, frequency_column = "concentration", 
+                               threshold = 0.1, classification_method, 
+                               output_prefix, assay_suffix = "_GLlbsMetag") {
+    # Prepare feature table
+    feature_table <- read_csv(feature_table_file) %>%  as.data.frame()
+    rownames(feature_table) <- feature_table[[1]]
+    feature_table <- feature_table[, -1]  %>% as.matrix()
+
+    # Prepare metadata
+    metadata <- read_csv(metadata_file) %>% as.data.frame()
+    row.names(metadata) <- metadata[, samples_column]
+
+    # Run decontam
+    contamdf <- run_decontam(feature_table, metadata, threshold, prev_col, freq_col, ntc_name) 
+
+    contamdf <- as.data.frame(contamdf) %>% rownames_to_column(feature_column)
+
+    # Write decontaminated feature table and decontam's primary results
+    outfile <- glue("{output_prefix}decontam-{classification_method}_results{assay_suffix}.csv")
+    write_csv(x = contamdf, file = outfile)
+
+    # Get the list of contaminants identified by decontam
+    contaminants <- contamdf %>%
+                    filter(contaminant == TRUE) %>%
+                    pull(!!sym(feature_column))
+
+    # Drop contaminants(s) if detected
+    if(length(contaminants) > 0){
+      
+      # Drop contaminant features identified by decontam
+      decontaminated_table <- feature_table %>%
+        as.data.frame() %>%
+        rownames_to_column(feature_column) %>%
+        filter(str_detect(!!sym(feature_column),
+                          pattern = str_c(contaminants,
+                                          collapse = "|"),
+                          negate = TRUE))
+
+      rownames(decontaminated_table) <- decontaminated_table[[feature_column]]
+      decontaminated_table <- decontaminated_table[,-1] %>% as.matrix
+
+      outfile <- glue("{output_prefix}decontaminated-{classification_method}_species_table{assay_suffix}.csv")
+      write_csv(x = decontaminated_table, file = outfile)
+
+      return(decontaminated_table)
+
+    } else {
+      message("No contaminants were detected by Decontam")
+      return(NULL)
+    }
+  }
+  ```
+  **Custom Functions Used:**
+  - [run_decontam()](#run_decontam)
+
+  **Function Parameter Definitions:**
+  - `metadata_file` - path to a file with samples as rows and columns describing each sample
+  - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                           table with species/functions as the first column and samples as other columns.
+  - `feature_column` - a character string containing the feature column name in the feature table ['Species', 'species', 'KO_ID'].
+  - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
+  - `frequency_column` - a character string specifying the column in `metadata` to use for frequency based analysis, default: "concentration"
+  - `prevalence_column` - a character string specifying the column in `metadata` to use for prevalence based analysis, default: "Sample_or_Control"
+  - `ntc_name` - a character string specifying the name of the NTC in the prevalence column
+  - `threshold` - a number between 0 and 1 specfying the decontam threshold for both prevalence and frequency based analyses. default: 0.1
+  - `output_prefix` - a character string specifying the unique name to add to the output file names 
+                      used to denote the data type/source, for example "unfiltered-kaiju_species"
+  - `classification_method` - a character string specifying the tool used to generate the classifications ['kaiju', 'kraken2', 'metaphlan', 'contig-taxonomy', 'gene-taxonomy', 'gene-function']
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
+
+  **Returns:** a dataframe containing the decontaminated feature table
+</details>
 
 ##### process_taxonomy()
 <details>
@@ -1430,7 +1581,7 @@ custom_palette <- custom_palette[-c(21:23,
                                          x = custom_palette, 
                                          ignore.case = TRUE)
                                    )
-                                ]
+                                ]                      
 # Heatmap color gradient - here from white to red
 colours <- colorRampPalette(c('white','red'))(255)
 ```
@@ -1456,10 +1607,10 @@ colours <- colorRampPalette(c('white','red'))(255)
 
 ```bash
 # Make a directory that will hold the downloaded kaiju database
-mkdir kaiju-db/ && cd kaiju-db/
+mkdir kaiju-db/
 
 # Download kaiju's reference database
-kaiju-makedb -s nr_euk -t NumberOfThreads
+kaiju-makedb -s kaiju_db/nr_euk -t NumberOfThreads
 
 # Clean up
 rm nr_euk/kaiju_db_nr_euk.bwt nr_euk/kaiju_db_nr_euk.sa
@@ -1490,7 +1641,7 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
       -t kaiju-db/nodes.dmp \
       -z NumberOfThreads \
       -E 1e-05 \
-      -i /path/to/sample_HRrm.fasta.gz \
+      -i /path/to/sample_decontam_GLlbsMetag.fastq.gz \
       -o sample_kaiju.out
 ```
 
@@ -1507,7 +1658,7 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
 
 - kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 9a](#9a-build-kaiju-database))
 - kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
-- sample_HRrm.fasta.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7b](#7b-remove-host-reads))
+- sample_decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7e](#7e-generate-decontaminated-read-files))
 
 **Output Data:**
 
@@ -1516,17 +1667,17 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
 #### 9c. Compile Kaiju Taxonomy Results
 
 ```bash
-# Merge kaiju reports to one table at each taxonomic level, phylum, class, order, family, genus, species 
+# Merge kaiju reports to one table at the species level 
 kaiju2table -t nodes.dmp \
             -n names.dmp \
             -p \
-            -r ${TAXON_LEVEL} \
+            -r "species" \
             -o merged_kaiju_summary_${TAXON_LEVEL}.tsv \
             *_kaiju.out
 
 # Convert file names to sample names
-sed -i -E 's/.+\/(.+)_kaiju\.out/\1/g' merged_kaiju_summary_${TAXON_LEVEL}.tsv && \
-sed -i -E 's/file/sample/' merged_kaiju_summary_${TAXON_LEVEL}.tsv
+sed -i -E 's/.+\/(.+)_kaiju\.out/\1/g' merged_kaiju_table_GLlbsMetag.tsv && \
+sed -i -E 's/file/sample/' merged_kaiju_table_GLlbsMetag.tsv
 ```
 
 **Parameter Definitions:**
@@ -1534,7 +1685,7 @@ sed -i -E 's/file/sample/' merged_kaiju_summary_${TAXON_LEVEL}.tsv
 - `-t` - Specifies the path to the kaiju taxonomy hierarchy file (nodes.dmp).
 - `-n` - Specifies the path to the kaiju taxonomy names file (names.dmp).
 - `-p` - Print the full taxon path instead of only the taxon name.
-- `-r` - Specifies taxonomic rank to print the taxon path to, must be one of: phylum, class, order, family, genus, species.
+- `-r` - Specifies taxonomic rank to print the taxon path to, must be one of: phylum, class, order, family, genus, species. (Default: species).
 - `-o` - Specifies the name of the kaiju taxon summary output file.
 - `*_kaiju.out` - Positional argument specifying the path to the kaiju output files for each sample. 
 
@@ -1546,7 +1697,7 @@ sed -i -E 's/file/sample/' merged_kaiju_summary_${TAXON_LEVEL}.tsv
 
 **Output Data:**
 
-- **merged_kaiju_summary_${TAXON_LEVEL}.tsv** (compiled kaiju summary table for each taxon level)
+- merged_kaiju_table_GLlbsMetag.tsv (compiled kaiju summary table at the species level)
 
 #### 9d. Convert Kaiju Output To Krona Format
 
@@ -1583,7 +1734,7 @@ find . -type f -name "*.krona" | sort -uV > krona_files.txt
 
 # Create a file containing a sorted list of all sample names
 FILES=($(find . -type f -name "*.krona"))
-basename --multiple --suffix='.krona' ${FILES[*]} | sort -uV  > sample_names.txt
+basename -a -s '.krona' ${FILES[*]} | sort -uV  > sample_names.txt
 
 # Create ktImportText input format files
 KTEXT_FILES=($(paste -d',' "krona_files.txt" "sample_names.txt"))
@@ -1603,12 +1754,12 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 
 - `-u` - Specifies to perform a unique sort.
 - `-V` - Specifies to perform a mixed type of sorting with names containing numbers within text.
-- `> {}.txt` - Redirects the sorted list to a separate text file.
+- `> krona_files.txt` - Redirects the sorted list to a separate text file.
 
 **basename**
 
-- `--multiple` - Support multiple arguments and treat each as a file name.
-- `--suffix='.krona'` - Remove a trailing '.krona' suffix.
+- `-a` - Support multiple arguments and treat each as a file name.
+- `-s '.krona'` - Remove trailing '.krona' suffix.
 
 **paste**
 
@@ -1617,8 +1768,8 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 **ktImportText**
 
 - `-o` - Specifies the compiled output html file name.
-- `${KTEXT_FILES[*]}` - An array positional arguement with the following content: 
-                     sample_1.krona,sample_1 sample_2.krona,sample_2 ... sample_n.krona,sample_n.
+- `${KTEXT_FILES[*]}` - An array positional argument with the following content: 
+                        sample_1.krona,sample_1 sample_2.krona,sample_2 ... sample_n.krona,sample_n.
 
 **Input Data:**
 - *.krona (all sample .krona formatted files, output from [Step 9d](#9d-convert-kaiju-output-to-krona-format)) 
@@ -1631,17 +1782,21 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 - **kaiju-report.html** (compiled krona html report containing all samples)
 
 
-#### 9f. Create Kaiju Species Count Table --- START NEEDS REVIEW ---
+#### 9f. Create Kaiju Species Count Table
 
 ```R
 library(tidyverse)
-feature_table <- process_kaiju_table (file_path="merged_kaiju_table.tsv")
+feature_table <- process_kaiju_table(file_path="merged_kaiju_table_GLlbsMetag.tsv")
 table2write <- feature_table  %>%
                 as.data.frame() %>%
                 rownames_to_column("Species")
-write_csv(x = table2write, file = "kaiju_species_table.csv")
+write_csv(x = table2write, file = "kaiju_species_table_GLlbsMetag.csv")
 ```
 
+**Custom Functions Used:**
+
+- [process_kaiju_table()](#process_kaiju_table)
+
 **Parameter Definitions:**
 
 - `file_path` - path to compiled kaiju table at the species taxon level
@@ -1650,44 +1805,64 @@ write_csv(x = table2write, file = "kaiju_species_table.csv")
 
 **Input Data:**
 
-- merged_kaiju_table.tsv (compiled kaiju table at the species taxon level, from [Step 9c](#10c-compile-kaiju-taxonomy-results))
+- merged_kaiju_table_GLlbsMetag.tsv (compiled kaiju table at the species taxon level, from [Step 9c](#9c-compile-kaiju-taxonomy-results))
 
 **Output Data:**
 
-- kaiju_species_table.csv (kaiju species count table in csv format)
+- **kaiju_species_table_GLlbsMetag.csv** (kaiju species count table in csv format)
 
 
-#### 9g. Read-in tables
+#### 9g. Filter Kaiju Species Count Table
 
 ```R
 library(tidyverse)
 
-# Read-in metadata
-metdata_file <- "/path/to/sample/metadata"
-samples_column <- "Sample_ID"
-metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
-row.names(metadata) <- metadata[,samples_column]
+input_file <- "kaiju_species_table_GLlbsMetag.csv"
+output_file <- "kaiju_filtered_species_table_GLlbsMetag.csv"
+threshold <- 0.5
 
-# Read-in feature table
-species_table <- read_csv(file="kaiju_species_table.csv") %>%  as.data.frame()
-rownames(species_tablee) <- species_table$Species
-species_table <- species_table[,-1]  %>% as.matrix()
+# string used to define non-microbial taxa
+non_microbial <- "UNCLASSIFIED|Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
+
+# read in feature table
+feature_table <- read_csv(input_file) %>% as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+# convert count table to a relative abundance matrix
+abund_table <- feature_table %>% rownames_to_column(feature_name) %>%
+  mutate(across(where(is.numeric), function(x) (x / sum(x, na.rm = TRUE)) * 100)) %>%
+  as.data.frame()
+
+rownames(abund_table) <- abund_table[,1]
+abund_table <- abund_table[,-1] %>% t 
+
+table2write <- group_low_abund_taxa(abund_table, threshold = threshold)  %>%
+  t %>% as.data.frame() %>%
+  rownames_to_column(feature_name)
+
+write_csv(x = table2write, file = output_file)
 ```
 
+**Custom Functions Used:**
+
+- [group_low_abund_taxa()](#group_low_abund_taxa)
+
 **Parameter Definitions:**
 
-- `file` - path to input tables
-- `delim` - file delimiter 
+- `threshold` - threshold for filtering out rare taxa, a percentage between 0 and 100.
+- `output_file` - output filename
+- `input_file` - input filename
 
 **Input Data:**
 
-- metadata_file  (path to sample-wise metadata file)
-- kaiju_species_table.csv (path to kaiju species table from [Step 9f](#9f-create-kaiju-species-count-table))
+- kaiju_species_table_GLlbsMetag.csv (path to kaiju species table from [Step 9f](#9f-create-kaiju-species-count-table))
 
 **Output Data:**
 
-- `metadata` - a dataframe of sample-wise metadata
-- `species_table` - a dataframe of species count per sample
+- **filtered-kaiju_species_table_GLlbsMetag.csv** - a file containing the filtered species table
+
 ---
 
 #### 9h. Taxonomy barplots
@@ -1695,68 +1870,61 @@ species_table <- species_table[,-1]  %>% as.matrix()
 ```R
 library(tidyverse)
 
-# Threshold to filter out potential false positive
-# taxonomy assignments
-filter_threshold <- 0.5
-# Filter out Rare and non-microbial assignments.
-# You can add as many species that you'd like to filter out
-# using the following syntax "|species_name1|species_name2"
-non_microbial <- "Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
-
-plot_width <- 18
-plot_height <- 8
-
-# Convert count matrix to relative abundance matrix
-abund_table <- count_to_rel_abundance(species_table)
-
-# Make plot without filtering
-p <- make_plot(abund_table, metadata, custom_palette, publication_format)
-
-ggsave(filename =  "unfiltered-kaiju_species_plot.png", plot = p,
-       device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
+species_table_file <- "kaiju_species_table_GLlbsMetag.csv"
+filtered_species_table_file <- "filtered-kaiju_species_table_GLlbsMetag.csv"
+metadata_file <- "/path/to/sample/metadata"
+number_samples <- 10 
 
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
 
-# Get species with relative abundance greater than `filter_threshold` in all samples
-# Drop rare and non-microbial assignments
-filtered_species_table  <- filter_rare(species_table, non_microbial, threshold=filter_threshold)
+p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
+                  publication_format = publication_format, custom_palette = custom_palette)
 
+ggsave(filename = "unfiltered-kaiju_species_barplot_GLlbsMetag.png", plot = p,
+       device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
 
-# Convert count matrix to relative abundance matrix
-filtered_species_table <- count_to_rel_abundance(filtered_species_table)
-
-# Write filtered table to file
-table2write <- filtered_species_table %>%
-                 t %>%
-                as.data.frame() %>%
-                rownames_to_column("Species")
+# Save static unfiltered plot
+p <- make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
+                  publication_format = publication_format, custom_palette = custom_palette)
 
-write_csv(x = table2write, file = "filtered-kaiju_species_table.csv")
+# Save interactive unfilterted plot
+htmlwidgets::saveWidget(ggplotly(p), glue("unfiltered-kaiju_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
 
-# Make plot after filtering
-p <- make_plot(filtered_species_table , metadata, custom_palette, publication_format)
+# Save static filtered plot
+ggsave(filename = glue("filtered-kaiju_species_barplot_GLlbsMetag.png"), plot = p,
+      device = 'png', width = plot_width, height = 10, units = 'in', dpi = 300, limitsize = FALSE)
 
-ggsave(filename = "filtered-kaiju_species_plot.png", plot = p,
-         device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
+# Save interactive filtered plot
+htmlwidgets::saveWidget(ggplotly(p), glue("filtered-kaiju_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
 ```
 
 **Parameter Definitions:**
 
-- `filter_threshold` - a decimal threshold from 0-1 for filtering out rare species i.e potential false epositives.
-- `non_microbial` - a regex string  listing out assignmnets to drop before filtering based on the `filter_threshold` above. 
+- `species_table_file` - a file containing the species count table
+- `filtered_species_table_file` - a file containing the filtered species count table
+- `metadata_file` - a file containing group information for each sample in the species count files
+- `number_samples` - the total number of samples in the species count files, adjust based on input files.
 
 **Input Data:**
 
-- `species_table` (a dataframe of species count per sample, output from [Step 9g](#9g-read-in-tables))
-- `metadata` - (a dataframe of sample-wise metadata, output from [Step 9g](#9g-read-in-tables))
+- `kaiju_species_table_GLlbsMetag.csv` (a file containing the species count table, output from [Step 9f](#9f-create-kaiju-species-count-table))
+- `filtered-kaiju_species_table_GLlbsMetag.csv` (a file containing the filtered species count table, output from [Step 9g](#9g-filter-kaiju-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+
 
 **Output Data:**
 
-- **unfiltered-kaiju_species_plot.png** (barplot plot without filtering)
-- **filtered-kaiju_species_table.csv** (filtered relative abundance table)
-- **filtered-kaiju_species_plot.png** (barplot after filtering rare and non-microbial taxa)
+- **unfiltered-kaiju_species_barplot.png** (taxonomy barplot without filtering)
+- **unfiltered-kaiju_species_barplot.html** (interactive taxonomy barplot without filtering)
+- **filtered-kaiju_species_barplot.png** (taxonomy barplot after filtering rare and non-microbial taxa)
+- **filtered-kaiju_species_barplot.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 9i. Feature decontamination --- END NEEDS REVIEW ---
+#### 9i. Feature decontamination
 
 > Feature (species) decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
 
@@ -1765,72 +1933,46 @@ library(tidyverse)
 library(decontam)
 library(phyloseq)
 
-feature_table <- read_csv("filtered-kaiju_species_table.csv") %>%
-                  as.data.frame()
-
- rownames(feature_table) <- feature_table$Species
- feature_table <- feature_table[,-1]  %>% as.matrix()
-# Set to 0.5 for a more aggressive approach where species more prevalent
-# in the negative controls are considered contaminants
-contam_threshold <- 0.1
-# Control samples in this column should always be written as 
-# "Control_Sample" and true samples as "True_Sample"
-prev_col <- "Sample_or_Control"
-freq_col <- "input_conc_ng"
-plot_width <- 18
-plot_height <- 8
-
-contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
-
-# Write decontam results table to file
-write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-kaiju_results.csv")
-
-# Get the list of contaminants identified by decontam
-contaminants <- contamdf %>%
-                as.data.frame %>%
-                rownames_to_column("Species") %>%
-                filter(contaminant == TRUE) %>% pull(Species)
-
-# Drop contaminant features identified by decontam
-decontaminated_table <- feature_table %>% 
-                as.data.frame  %>% 
-                rownames_to_column("Species") %>% 
-                filter(str_detect(Species, 
-                                  pattern = str_c(contaminants,
-                                                  collapse = "|"),
-                                  negate = TRUE))
+feature_table_file <- "filtered-kaiju_species_table_GLlbsMetag.csv"
+metadata_table <- "/path/to/sample/metadata"
+ntc_name <- "name_of_ntc_sample"
 
-rownames(decontaminated_table) <- decontaminated_table$Species
-decontaminated_table <- decontaminated_table[,-1] %>% as.matrix
+decontaminated_table <- feature_decontam(metadata_file = metadata_table, feature_table_file = feature_table_file, 
+                               feature_column = "species", samples_column = "sample_id",
+                               prevalence_column = "Sample_or_Control", ntc_name = ntc_name, frequency_column = "concentration", 
+                               threshold = 0.1, classification_method = "kaiju", 
+                               output_prefix = "", assay_suffix = "_GLlbsMetag")
 
 # Convert count matrix to relative abundance matrix
 decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
 
-# Write decontaminated species table to file
-table2write <- decontaminated_species_table %>%
-                 t %>%
-                 as.data.frame() %>%
-                rownames_to_column("Species")
-
-write_csv(x = table2write, file = "decontaminated-kaiju_species_table.csv")
-
 # Make plot after filtering out contaminants
-p <- make_plot(decontaminated_species_table , metadata, custom_palette, publication_format)
+p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
 
-ggsave(filename = "decontaminated-kaiju-species_plot.png", plot = p,
+ggsave(filename = "decontaminated-kaiju-species_barplot.png", plot = p,
          device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
 ```
+**Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
+- [make_plot()](#make_plot)
+- [count_to_rel_abundance()](#count_to_rel_abundance)
+
+**Parameter Definitions:**
+  - `metadata_table` - path to a file with samples as rows and columns describing each sample
+  - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                           table with species/functions as the first column and samples as other columns.
+  - `ntc_name` - a character string specifying the name of the NTC in the prevalence column
 
 **Input Data:**
 
-- `filtered-kaiju_species_table.csv`(path to filtered species count per sample, output from [Step 9h](#9h-taxonomy-barplots))
-- `metadata`(a dataframe of sample-wise metadata, output from [Step 9g](#9g-read-in-tables))
+- `filtered-kaiju_species_table_GLlbsMetag.csv`(path to filtered species count per sample, output from [Step 9h](#9h-taxonomy-barplots))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
 
-- **decontam-kaiju_results.csv** (decontam's result table)
-- **decontaminated-kaiju_species_table.csv** (decontaminated species table)
-- **decontaminated-kaiju-species_plot.png** (barplot after filtering out contaminants)
+- **decontam-kaiju_results_GLlbsMetag.csv** (decontam's result table)
+- **decontaminated-kaiju_species_table_GLlbsMetag.csv** (decontaminated species table)
+- **decontaminated-kaiju-species_barplot_GLlbsMetag.png** (barplot after filtering out contaminants)
 
 <br>
 
@@ -1896,7 +2038,7 @@ kraken2 --db kraken2-db/ \
         --use-names \
         --output sample-kraken2-output.txt \
         --report sample-kraken2-report.tsv \
-        /path/to/sample_HRrm.fasta.gz
+        /path/to/sample_decontam_GLlbsMetag.fastq.gz
 ```
 
 **Parameter Definitions:**
@@ -1907,12 +2049,12 @@ kraken2 --db kraken2-db/ \
 - `--use-names` - Specifies to add taxa names in addition to taxids.
 - `--output` - Specifies the name of the kraken2 read-based output file.
 - `--report` - Specifies the name of the kraken2 report output file.
-- `sample_HRrm.fasta.gz` - Positional argument specifying the input file.
+- `sample_decontam_GLlbsMetag.fastq.gz` - Positional argument specifying the input file.
 
 **Input Data:**
 
 - kraken2-db/ (a directory containing kraken2 database files, output from [Step 10a](#10a-download-kraken2-database))
-- sample_HRrm.fasta.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7b](#7b-remove-host-reads))
+- sample_decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7e](#7e-generate-decontaminated-read-files))
 
 **Output Data:**
 
@@ -1950,7 +2092,7 @@ combine_kreports.py --output merged-kraken2-table.tsv \
 ```bash
 multiqc --zip-data-dir \ 
         --outdir kraken2_multiqc_report \
-        --filename kraken2_multiqc_GLlbnMetag \
+        --filename kraken2_multiqc_GLlbsMetag \
         --interactive \
         /path/to/*kraken2-report.tsv
 ```
@@ -1969,8 +2111,8 @@ multiqc --zip-data-dir \
 
 **Output Data:**
 
-- **kraken2_multiqc_GLlbnMetag.html** (multiqc output html summary)
-- **kraken2_multiqc_GLlbnMetag_data.zip** (zip archive containing multiqc output data)
+- **kraken2_multiqc_GLlbsMetag.html** (multiqc output html summary)
+- **kraken2_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
 
 #### 10d. Convert Kraken2 Output to Krona Format
@@ -2163,6 +2305,11 @@ ggsave(filename = "filtered-kraken_species_plot.png", plot = p,
          device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
 ```
 
+**Custom Functions Used:**
+- [make_plot()](#make_plot)
+- [count_to_rel_abundance()](#count_to_rel_abundance)
+
+
 **Parameter Definitions:**
 
 - `filter_threshold` - a decimal threshold from 0-1 to filter out rare species i.e potential false positives
@@ -2194,13 +2341,11 @@ feature_table <- read_csv("filtered-kraken_species_table.csv") %>%
 
  rownames(feature_table) <- feature_table$Species
  feature_table <- feature_table[,-1]  %>% as.matrix()
-
 # Set to 0.5 for a more aggressive approach where species more prevalent
 # in the negative controls are considered contaminants
 contam_threshold <- 0.1
 # Control samples in this column should always be written as
-# "Control_Sample" and true samples as "True_Sample" for the function below to
-# function properly.
+# "Control_Sample" and true samples as "True_Sample" for the function below to function properly.
 prev_col <- "Sample_or_Control"
 freq_col <- "input_conc_ng"
 plot_width <- 18
@@ -2247,6 +2392,10 @@ ggsave(filename = "decontaminated-kraken-species_plot.png", plot = p,
          device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
 ```
 
+**Custom Functions Used:**
+- [make_plot()](#make_plot)
+- [count_to_rel_abundance()](#count_to_rel_abundance)
+
 **Input Data:**
 
 - `filtered-kraken_species_table.csv`(path to species count per sample, output from [Step 10h](#10h-taxonomy-barplots))
@@ -2271,7 +2420,7 @@ flye --meta \
      --threads NumberOfThreads \
      --out-dir sample/ \
      --nano-hq \
-     /path/to/sample_HRrm.fasta.gz
+     /path/to/sample_decontam_GLlbsMetag.fastq.gz
 
 # rename output files            
 mv sample/assembly.fasta sample_assembly.fasta
@@ -2284,11 +2433,11 @@ mv sample/flye.log sample_flye.log
 - `--threads` - Number of parallel processing threads to use.
 - `--out-dir` - Specifies the name of the output directory.
 - `--nano-hq` - Specifies that input is from Oxford Nanopore high-quality reads (Guppy5+ SUP or Q20, <5% error). This skips a genome polishing step since the assembly will be polished with medaka in the next step.
-- `/path/to/sample_HRrm.fasta.gz` - Path to the input file, specified as a positional argument.
+- `/path/to/sample_decontam_GLlbsMetag.fastq.gz` - Path to the input file, specified as a positional argument.
 
 **Input Data**
 
-- sample_HRrm.fasta.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7b](#7b-remove-host-reads))
+- sample_decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7e](#7e-generate-decontaminated-read-files))
 
 **Output Data**
 
@@ -2303,7 +2452,7 @@ mv sample/flye.log sample_flye.log
 
 ```bash
 medaka_consensus -t NumberOfThreads \
-                 -i /path/to/sample_HRrm.fasta.gz \
+                 -i /path/to/sample_decontam_GLlbsMetag.fastq.gz \
                  -d /path/to/assemblies/sample_assembly.fasta \
                  -o sample/
   
@@ -2319,7 +2468,7 @@ mv sample/consensus.fasta sample_polished.fasta
 
 **Input Data:**
 
-- sample_HRrm.fasta.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7b](#7b-remove-host-reads))
+- sample_decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7e](#7e-generate-decontaminated-read-files))
 - /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 11](#11-sample-assembly))
 
 **Output Data:**
@@ -2357,7 +2506,7 @@ bit-rename-fasta-headers -i sample_polished.fasta \
 #### 13b. Summarize Assemblies
 
 ```bash
-bit-summarize-assembly -o assembly-summaries_GLlbnMetag.tsv \
+bit-summarize-assembly -o assembly-summaries_GLlbsMetag.tsv \
                        *-assembly.fasta
 ```
 
@@ -2372,7 +2521,7 @@ bit-summarize-assembly -o assembly-summaries_GLlbnMetag.tsv \
 
 **Output files:**
 
-- **assembly-summaries_GLlbnMetag.tsv** (table of assembly summary statistics)
+- **assembly-summaries_GLlbsMetag.tsv** (table of assembly summary statistics)
 
 <br>
 
@@ -2685,7 +2834,7 @@ minimap2 -a \
          -x map-ont \
          -t NumberOfThreads \
          sample_assembly.fasta \
-         sample_HRrm.fasta.gz \
+         sample_decontam_GLlbsMetag.fastq.gz \
          > sample.sam  2> sample-mapping-info.txt
 ```
 
@@ -2695,14 +2844,14 @@ minimap2 -a \
 - `-x map-ont` - Specifies preset for mapping Nanopore reads to a reference.
 - `-t` - Number of parallel processing threads to use
 - `sample_assembly.fasta` – Assembly fasta file, provided as a positional argument.
-- `sample_HRrm.fasta.gz` - Input sequence data file, provided as a positional argument.
+- `sample_decontam_GLlbsMetag.fastq.gz` - Input sequence data file, provided as a positional argument.
 - `> sample.sam` - Redirects the output to a separate file.
 - `2> sample-mapping-info.txt` - Redirects the standar error to a separate file.
 
 **Input Data**
 
 - sample-assembly.fasta (contig-renamed assembly file, output from [Step 13a](#13a-rename-contig-headers))
-- sample_HRrm.fasta.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7b](#7b-remove-host-reads))
+- sample_decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7e](#7e-generate-decontaminated-read-files))
 
 **Output Data**
 
@@ -2919,10 +3068,10 @@ bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv \
 
 **Output Data:**
 
-- **Combined-gene-level-KO-function-coverages-CPM_GLlbnMetag.tsv** (table with all samples combined based on KO annotations; normalized to coverage per million genes covered)
-- **Combined-gene-level-taxonomy-coverages-CPM_GLlbnMetag.tsv** (table with all samples combined based on gene-level taxonomic classifications; normalized to coverage per million genes covered)
-- **Combined-gene-level-KO-function-coverages_GLlbnMetag.tsv** (table with all samples combined based on KO annotations)
-- **Combined-gene-level-taxonomy-coverages_GLlbnMetag.tsv** (table with all samples combined based on gene-level taxonomic classifications)
+- **Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.tsv** (table with all samples combined based on KO annotations; normalized to coverage per million genes covered)
+- **Combined-gene-level-taxonomy-coverages-CPM_GLlbsMetag.tsv** (table with all samples combined based on gene-level taxonomic classifications; normalized to coverage per million genes covered)
+- **Combined-gene-level-KO-function-coverages_GLlbsMetag.tsv** (table with all samples combined based on KO annotations)
+- **Combined-gene-level-taxonomy-coverages_GLlbsMetag.tsv** (table with all samples combined based on gene-level taxonomic classifications)
 
 
 #### 21b. Gene-level taxonomy heatmaps --- START NEEDS REVIEW ---
@@ -2934,9 +3083,9 @@ library(pheatmap)
 # Abundant taxa with CPM > 1000
 abundance_threshold <- 1000
 
-sample_order <- get_sample_names("assembly-summaries_GLlbnMetag.tsv")
+sample_order <- get_sample_names("assembly-summaries_GLlbsMetag.tsv")
 # Read-in gene table
-gene_taxonomy_table <-  read_contig_table("Combined-gene-level-taxonomy-coverages-CPM_GLlbnMetag.tsv", sample_order)
+gene_taxonomy_table <-  read_contig_table("Combined-gene-level-taxonomy-coverages-CPM_GLlbsMetag.tsv", sample_order)
 
 # Summarize gene table
 species_gene_table <- gene_taxonomy_table %>%
@@ -2958,7 +3107,7 @@ gene.m <- gene.m[,-match("species", colnames(gene.m))] %>% as.matrix()
 # Drop unclassified assignments
 mat2plot <- gene.m[-match("Unclassified;_;_;_;_;_;_", rownames(gene.m)),]
 
-png(filename = "All-genes-taxonomy-heatmap_GLlbnMetag.png", 
+png(filename = "All-genes-taxonomy-heatmap_GLlbsMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -2981,7 +3130,7 @@ abund_gene.m <- gene.m[abund_taxa,]
 # Drop unclassified assignments
 mat2plot <- abund_gene.m[-match("Unclassified;_;_;_;_;_;_", rownames(abund_gene.m)),]
 
-png(filename = "Abundant-genes-taxonomy-heatmap_GLlbnMetag.png", 
+png(filename = "Abundant-genes-taxonomy-heatmap_GLlbsMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -2995,13 +3144,13 @@ dev.off()
 ```
 
 **Input data:**
-- assembly-summaries_GLlbnMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
-- Combined-gene-level-taxonomy-coverages-CPM_GLlbnMetag.tsv (table with all samples combined based on gene-level taxonomic classifications, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables)) 
+- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
+- Combined-gene-level-taxonomy-coverages-CPM_GLlbsMetag.tsv (table with all samples combined based on gene-level taxonomic classifications, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables)) 
 
 **Output data:**
 - gene_taxonomy_table.csv (aggregated gene taxonomy table)
-- **All-genes-taxonomy-heatmap_GLlbnMetag.png** (heatmap of all genes taxonomy assignments)
-- **Abundant-genes-taxonomy-heatmap_GLlbnMetag.png** (heatmap of abundant genes taxonomy assignments)
+- **All-genes-taxonomy-heatmap_GLlbsMetag.png** (heatmap of all genes taxonomy assignments)
+- **Abundant-genes-taxonomy-heatmap_GLlbsMetag.png** (heatmap of abundant genes taxonomy assignments)
 
 #### 21c. Gene-level taxonomy decontamination
 
@@ -3065,7 +3214,7 @@ species_to_drop_index <- grep(x = rownames(feature_table),
                                     collapse = "|"))
 
 mat2plot <- feature_table[-species_to_drop_index,]
-png(filename = "decontaminated-gene-taxonomy-heatmap_GLlbnMetag.png", 
+png(filename = "decontaminated-gene-taxonomy-heatmap_GLlbsMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3088,7 +3237,7 @@ dev.off()
 
 - **decontam-gene-taxonomy_results.csv** (decontam's results table)
 - **decontaminated-gene-taxonomy_table.csv** (decontaminated species table)
-- **decontaminated-gene-taxonomy-heatmap_GLlbnMetag.png** (heatmap after filtering out contaminants)
+- **decontaminated-gene-taxonomy-heatmap_GLlbsMetag.png** (heatmap after filtering out contaminants)
 
 
 
@@ -3101,9 +3250,9 @@ library(pheatmap)
 # Abundant functions with CPM > 2000
 abundance_threshold <- 2000
 
-sample_order <- get_sample_names("assembly-summaries_GLlbnMetag.tsv")
+sample_order <- get_sample_names("assembly-summaries_GLlbsMetag.tsv")
 # Read-in KO functions table
-functions_table <- read_input_table("Combined-gene-level-KO-function-coverages-CPM_GLlbnMetag.tsv") %>%
+functions_table <- read_input_table("Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.tsv") %>%
                     select(KO_ID, KO_function, !!sample_order)
 
 # Subset table and then convert from datafame to matrix
@@ -3121,7 +3270,7 @@ write_csv(x = table2write  , file = "genes-KO-functions_table.csv")
 # Drop unclassified assignments
 mat2plot <- functions.m[-match("Not annotated", rownames(functions.m),]
 
-png(filename = "All-genes-KO-functions-heatmap_GLlbnMetag.png", 
+png(filename = "All-genes-KO-functions-heatmap_GLlbsMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3144,7 +3293,7 @@ abund_functions.m <- functions.m[abund_functions,]
 # Drop unannotated assignments
 mat2plot <- abund_functions.m[-match("Not annotated", rownames(abund_functions.m)),]
 
-png(filename = "Abundant-genes-KO-functions-heatmap_GLlbnMetag.png", 
+png(filename = "Abundant-genes-KO-functions-heatmap_GLlbsMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3161,13 +3310,13 @@ dev.off()
 
 
 **Input data:**
-- assembly-summaries_GLlbnMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
-- Combined-gene-level-KO-function-coverages-CPM_GLlbnMetag.tsv (table with all samples combined based on KO annotations; normalized to coverage per million genes covered, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables))
+- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
+- Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.tsv (table with all samples combined based on KO annotations; normalized to coverage per million genes covered, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables))
 
 **Output data:**
 - genes-KO-functions_table.csv (aggregated and subsetted gene KO functions table)
-- **All-genes-KO-functions-heatmap_GLlbnMetag.png** (heatmap of gene-wise KO function assignments)
-- **Abundant-genes-KO-functions-heatmap_GLlbnMetag.png** (heatmap of gene-wise abundant KO function assignments)
+- **All-genes-KO-functions-heatmap_GLlbsMetag.png** (heatmap of gene-wise KO function assignments)
+- **Abundant-genes-KO-functions-heatmap_GLlbsMetag.png** (heatmap of gene-wise abundant KO function assignments)
 
 #### 21e. Gene-level KO functions decontamination --- END NEEDS REVIEW ---
 
@@ -3230,7 +3379,7 @@ functions_to_drop_index <- grep(x = rownames(feature_table),
                                     collapse = "|"))
 
 mat2plot <- feature_table[-functions_to_drop_index,]
-png(filename = "decontaminated-gene-KO-functions-heatmap_GLlbnMetag.png", 
+png(filename = "decontaminated-gene-KO-functions-heatmap_GLlbsMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3253,7 +3402,7 @@ dev.off()
 
 - **decontam-gene-KO-functions_results.csv** (decontam's results table)
 - **decontaminated-gene-KO-functions_table.csv** (decontaminated functions table)
-- **decontaminated-gene-KO-functions-heatmap_GLlbnMetag.png** (heatmap after filtering out contaminants)
+- **decontaminated-gene-KO-functions-heatmap_GLlbsMetag.png** (heatmap after filtering out contaminants)
 
 
 
@@ -3275,8 +3424,8 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 
 **Output Data:**
 
-- **Combined-contig-level-taxonomy-coverages-CPM_GLlbnMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million contigs covered)
-- **Combined-contig-level-taxonomy-coverages_GLlbnMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications)
+- **Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million contigs covered)
+- **Combined-contig-level-taxonomy-coverages_GLlbsMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications)
 
 <br>
 
@@ -3286,9 +3435,9 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 ```R
 plot_width <- 20
 plot_height <- 30
-sample_order <- get_sample_names("assembly-summaries_GLlbnMetag.tsv")
+sample_order <- get_sample_names("assembly-summaries_GLlbsMetag.tsv")
 
-contig_table <-  read_contig_table("Combined-contig-level-taxonomy-coverages-CPM_GLlbnMetag.tsv", sample_order)
+contig_table <-  read_contig_table("Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv", sample_order)
 species_contig_table <- contig_table %>% select(species, !!sample_order)
 
 contig.m <- species_contig_table %>%
@@ -3308,7 +3457,7 @@ contig.m <- contig.m[,-match("species", colnames(contig.m))] %>% as.matrix()
 # Drop unclassified assignments
 mat2plot <- contig.m[-match("Unclassified;_;_;_;_;_;_", rownames(contig.m)),]
 
-png(filename = "All-contig-taxonomy-heatmap_GLlbnMetag.png", 
+png(filename = "All-contig-taxonomy-heatmap_GLlbsMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3329,7 +3478,7 @@ abund_contig.m <- contig.m[abund_taxa,]
 
 mat2plot <- abund_contig.m[-match("Unclassified;_;_;_;_;_;_", rownames(abund_contig.m)),]
 
-png(filename = "Abundant-contig-taxonomy-heatmap_GLlbnMetag.png", 
+png(filename = "Abundant-contig-taxonomy-heatmap_GLlbsMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3348,14 +3497,14 @@ dev.off()
 
 **Input data:**
 
-- assembly-summaries_GLlbnMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
-- Combined-contig-level-taxonomy-coverages-CPM_GLlbnMetag.tsv (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered from [Step 21f](#21f-generating-contig-level-coverage-summary-tables))
+- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
+- Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered from [Step 21f](#21f-generating-contig-level-coverage-summary-tables))
 
 **Output data:**
 
 - contig_taxonomy_table.csv (aggregated contig taxonomy)
-- **All-contig-taxonomy-heatmap_GLlbnMetag.png** (All contig level taxonomy heatmap)
-- **Abundant-contig-taxonomy-heatmap_GLlbnMetag.png** (Abundant contig level taxonomy heatmap)
+- **All-contig-taxonomy-heatmap_GLlbsMetag.png** (All contig level taxonomy heatmap)
+- **Abundant-contig-taxonomy-heatmap_GLlbsMetag.png** (Abundant contig level taxonomy heatmap)
 
 
 #### 21h. Contig-level decontamination --- END NEEDS REVIEW ---
@@ -3420,7 +3569,7 @@ species_to_drop_index <- grep(x = rownames(feature_table),
                                     collapse = "|"))
 
 mat2plot <- feature_table[-species_to_drop_index,]
-png(filename = "decontaminated-contig-taxonomy-heatmap_GLlbnMetag.png", 
+png(filename = "decontaminated-contig-taxonomy-heatmap_GLlbsMetag.png", 
     width = plot_width, height = plot_height, units = "in", res=300)
 pheatmap(mat = mat2plot,
          cluster_cols = FALSE, 
@@ -3443,7 +3592,7 @@ dev.off()
 
 - **decontam-contig-taxonomy_results.csv** (decontam's results table)
 - **decontaminated-contig-taxonomy_table.csv** (decontaminated species table)
-- **decontaminated-contig-taxonomy-heatmap_GLlbnMetag.png** (heatmap after filtering out contaminants)
+- **decontaminated-contig-taxonomy-heatmap_GLlbsMetag.png** (heatmap after filtering out contaminants)
 
 
 ---
@@ -3504,7 +3653,7 @@ zip -r sample-bins.zip sample-bins
 > Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
 
 ```bash
-checkm lineage_wf -f bins-overview_GLlbnMetag.tsv \
+checkm lineage_wf -f bins-overview_GLlbsMetag.tsv \
                   --tab_table \
                   -x fasta \
                   ./ \
@@ -3526,18 +3675,18 @@ checkm lineage_wf -f bins-overview_GLlbnMetag.tsv \
 
 **Output Data:**
 
-- **bins-overview_GLlbnMetag.tsv** (tab-delimited file with quality estimates per bin)
+- **bins-overview_GLlbsMetag.tsv** (tab-delimited file with quality estimates per bin)
 - checkm-output-dir/ (directory holding detailed checkm outputs)
 
 #### 22c. Filter MAGs
 
 ```bash
-cat <( head -n 1 bins-overview_GLlbnMetag.tsv ) \
-    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlbnMetag.tsv | sed 's/bin./MAG-/' ) \
+cat <( head -n 1 bins-overview_GLlbsMetag.tsv ) \
+    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlbsMetag.tsv | sed 's/bin./MAG-/' ) \
     > checkm-MAGs-overview.tsv
     
 # copying bins into a MAGs directory in order to run tax classification
-awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlbnMetag.tsv | cut -f 1 > MAG-bin-IDs.tmp
+awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlbsMetag.tsv | cut -f 1 > MAG-bin-IDs.tmp
 
 mkdir MAGs
 for ID in MAG-bin-IDs.tmp
@@ -3556,7 +3705,7 @@ done
 
 **Input Data:**
 
-- bins-overview_GLlbnMetag.tsv (tab-delimited file with quality estimates per bin from [Step 22b](#22b-bin-quality-assessment))
+- bins-overview_GLlbsMetag.tsv (tab-delimited file with quality estimates per bin from [Step 22b](#22b-bin-quality-assessment))
 
 **Output Data:**
 
@@ -3595,7 +3744,7 @@ gtdbtk classify_wf --genome_dir MAGs/ \
 
 ```bash
 # combine summaries
-for MAG in $(cut -f 1 assembly-summaries_GLlbnMetag.tsv | tail -n +2); do
+for MAG in $(cut -f 1 assembly-summaries_GLlbsMetag.tsv | tail -n +2); do
 
     grep -w -m 1 "^${MAG}" checkm-MAGs-overview.tsv | cut -f 12,13,14 \
         >> checkm-estimates.tmp
@@ -3615,7 +3764,7 @@ cat <(printf "est. completeness\test. redundancy\test. strain heterogeneity\n")
 cat <(printf "domain\tphylum\tclass\torder\tfamily\\tgenus\tspecies\n") gtdb-taxonomies.tmp \
     > gtdb-taxonomies-with-headers.tmp
 
-paste assembly-summaries_GLlbnMetag.tsv \
+paste assembly-summaries_GLlbsMetag.tsv \
 checkm-estimates-with-headers.tmp \
 gtdb-taxonomies-with-headers.tmp \
     > MAGs-overview.tmp
@@ -3626,19 +3775,19 @@ head -n 1 MAGs-overview.tmp > MAGs-overview-header.tmp
 tail -n +2 MAGs-overview.tmp | sort -t \$'\t' -k 14,20 > MAGs-overview-sorted.tmp
 
 cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
-    > MAGs-overview_GLlbnMetag.tsv
+    > MAGs-overview_GLlbsMetag.tsv
 ```
 
 **Input Data:**
 
-- assembly-summaries_GLlbnMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarize-assemblies))
+- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarize-assemblies))
 - MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#23c-filter-mags))
 - checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 22c](#22c-filter-mags))
 - gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 22d](#22d-mag-taxonomic-classification))
 
 **Output Data:**
 
-- **MAGs-overview_GLlbnMetag.tsv** (a tab-delimited overview of all recovered MAGs)
+- **MAGs-overview_GLlbsMetag.tsv** (a tab-delimited overview of all recovered MAGs)
 
 
 <br>
@@ -3662,7 +3811,7 @@ do
     python parse-MAG-annots.py -i annotations-and-taxonomy/${sample_ID}-gene-coverage-annotation-and-tax.tsv \
                                -w ${MAG_ID}-contigs.tmp \
                                -M ${MAG_ID} \
-                               -o MAG-level-KO-annotations_GLlbnMetag.tsv
+                               -o MAG-level-KO-annotations_GLlbsMetag.tsv
 
     rm ${MAG_ID}-contigs.tmp
 
@@ -3683,15 +3832,15 @@ done
 
 **Output Data:**
 
-- **MAG-level-KO-annotations_GLlbnMetag.tsv** (tab-delimited table holding MAGs and their KO annotations)
+- **MAG-level-KO-annotations_GLlbsMetag.tsv** (tab-delimited table holding MAGs and their KO annotations)
 
 
 #### 23b. Summarize KO Annotations With KEGG-Decoder
 
 ```bash
 KEGG-decoder -v interactive \
-             -i MAG-level-KO-annotations_GLlbnMetag.tsv \
-             -o MAG-KEGG-Decoder-out_GLlbnMetag.tsv
+             -i MAG-level-KO-annotations_GLlbsMetag.tsv \
+             -o MAG-KEGG-Decoder-out_GLlbsMetag.tsv
 ```
 
 **Parameter Definitions:**  
@@ -3702,13 +3851,13 @@ KEGG-decoder -v interactive \
 
 **Input Data:**
 
-- MAG-level-KO-annotations_GLlbnMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 23a](#23a-getting-ko-annotations-per-mag))
+- MAG-level-KO-annotations_GLlbsMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 23a](#23a-getting-ko-annotations-per-mag))
 
 **Output Data:**
 
-- **MAG-KEGG-Decoder-out_GLlbnMetag.tsv** (tab-delimited table holding MAGs and their proportions of genes held known to be required for specific pathways/metabolisms)
+- **MAG-KEGG-Decoder-out_GLlbsMetag.tsv** (tab-delimited table holding MAGs and their proportions of genes held known to be required for specific pathways/metabolisms)
 
-- **MAG-KEGG-Decoder-out_GLlbnMetag.html** (interactive heatmap html file of the above output table)
+- **MAG-KEGG-Decoder-out_GLlbsMetag.html** (interactive heatmap html file of the above output table)
 
 <br>
 
diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
new file mode 100644
index 000000000..ff4f3ef98
--- /dev/null
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
@@ -0,0 +1,4212 @@
+# Bioinformatics pipeline for Low biomass long-read metagenomics data
+
+> **This document holds an overview and some example commands of how GeneLab processes low-biomass, long-read metagenomics datasets. Exact processing commands for specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
+
+---
+
+**Date:** November MM, 2025  
+**Revision:** -  
+**Document Number:** GL-DPPD-7116  
+
+**Submitted by:**  
+Olabiyi A. Obayomi (GeneLab Analysis Team)  
+
+**Approved by:**  
+Samrawit Gebre (OSDR Project Manager)  
+Jonathan Galazka (OSDR Project Scientist)  
+Amanda Saravia-Butler (GeneLab Science Lead)  
+Barbara Novak (GeneLab Data Processing Lead)  
+
+
+---
+
+# Table of contents
+
+- [**Software used**](#software-used)
+- [**General processing overview with example commands**](#general-processing-overview-with-example-commands)
+  - [**Pre-processing**](#pre-processing)
+    - [1. Basecalling](#1-basecalling)
+    - [2. Demultiplexing](#2-demultiplexing)
+      - [2a. Split fastq ](#2a-split-fastq)
+      - [2b. Concatenate files for each sample](#2b-concatenate-files-for-each-sample)
+    - [3. Raw Data QC](#3-raw-data-qc)
+      - [3a. Raw Data QC](#3a-raw-data-qc)
+      - [3b. Compile Raw Data QC](#3b-compile-raw-data-qc)
+    - [4. Quality filtering](#4-quality-filtering)
+      - [4a. Filter Raw Data](#4a-filter-raw-data)
+      - [4a. Filtered Data QC](#4b-filtered-data-qc)
+      - [4c. Compile Filtered Data QC](#4c-compile-filtered-data-qc)
+    - [5. Trimming](#5-trimming)
+      - [5a. Trim Filtered Data](#5a-trim-filtered-data)
+      - [5b. Trimmed Data QC](#5b-trimmed-data-qc)
+      - [5c. Compile Trimmed Data QC](#5c-compile-trimmed-data-qc)
+    - [6. Human Read Removal](#6-human-read-removal)
+      - [6a. Build Kraken2 Database](#6a-build-kraken2-database)
+      - [6b. Remove Human Reads](#6b-remove-human-reads)
+      - [6c. Compile Human Read Removal QC](#6c-compile-human-read-removal-qc)
+    - [7. Contaminant Removal](#7-contaminant-removal)
+      - [7a. Assemble Contaminants](#7a-assemble-contaminants)
+      - [7b. Build Contaminant Index and Map Reads](#7b-build-contaminant-index-and-map-reads)
+      - [7c. Sort and Index Contaminant Reads](#7c-sort-and-index-contaminant-alignments)
+      - [7d. Gather Contaminant Mapping Metrics](#7d-gather-contaminant-mapping-metrics)
+      - [7e. Generate Decontaminated Read Files](#7e-generate-decontaminated-read-files)
+      - [7f. Contaminant Removal QC](#7f-contaminant-removal-qc)
+      - [7g. Compile Contaminant Removal QC](#7g-compile-contaminant-removal-qc)
+    - [8. R Environment Setup](#8-r-environment-setup)
+      - [8a. Load Libraries](#8a-load-libraries)
+      - [8b. Define Custom Functions](#8b-define-custom-functions)
+      - [8c. Set global variables](#8c-set-global-variables)
+  - [**Read-based processing**](#read-based-processing)
+    - [9. Taxonomic profiling using kaiju](#9-taxonomic-profiling-using-kaiju)
+      - [9a. Build Kaiju Database](#9a-build-kaiju-database)
+      - [9b. Kaiju Taxonomic Classification](#9b-kaiju-taxonomic-classification)
+      - [9c. Compile Kaiju Taxonomy Results](#9c-compile-kaiju-taxonomy-results)
+      - [9d. Convert Kaiju Output To Krona Format](#9d-convert-kaiju-output-to-krona-format)
+      - [9e. Compile Kaiju Krona Reports](#9e-compile-kaiju-krona-reports)
+      - [9f. Create Kaiju Species Count Table](#9f-create-kaiju-species-count-table)
+      - [9g. Read-in Tables](#9g-read-in-tables)
+      - [9h. Taxonomy Barplots](#9h-taxonomy-barplots)
+      - [9i. Feature Decontamination](#9i-feature-decontamination)
+    - [10. Taxonomic Profiling Using Kraken2](#10-taxonomic-profiling-using-kraken2)
+      - [10a. Download Kraken2 Database](#10a-download-kraken2-database)
+      - [10b. Kraken2 Taxonomic Classification](#10b-kraken2-taxonomic-classification)
+      - [10c. Compile Kraken2 Taxonomy ](#10c-compile-kraken2-taxonomy-results)
+        - [10ci.](#10ci-create-merged-kraken2-taxonomy-table)
+        - [10cii.](#10cii-compile-kraken2-taxonomy-reports)
+      - [10d. Convert Kraken2 Output to Krona Format](#10d-convert-kraken2-output-to-krona-format)
+      - [10e. Compile Kraken2 Krona Reports](#10e-compile-kraken2-krona-reports)
+      - [10f. Create Kraken2 Species Count Table](#10f-create-kraken2-species-count-table)
+      - [10g. Read-in Tables](#10g-read-in-tables)
+      - [10h. Taxonomy Barplots](#10h-taxonomy-barplots)
+      - [10i. Feature Decontamination](#10i-feature-decontamination)
+  - [**Assembly-based processing**](#assembly-based-processing)
+    - [11. Sample Assembly](#11-sample-assembly)
+    - [12. Polish Assembly](#12-polish-assembly)
+    - [13. Rename Contigs and Summarize Assemblies](#13-rename-contigs-and-summarize-assemblies)
+      - [13a. Rename Contig Headers](#13a-rename-contig-headers)
+      - [13b. Summarize Assemblies](#13b-summarize-assemblies)
+    - [14. Gene Prediction](#14-gene-prediction)
+      - [14a. Generate Gene Predictions](14a-generate-gene-predictions)
+      - [14b. Remove Line Wraps In Gene Prediction Output](#14a-remove-line-wraps-in-gene-prediction-output)
+    - [15. Functional Annotation](#15-functional-annotation)
+      - [15a. Download Reference Database of HMM Models](#15a-download-reference-database-of-hmm-models)
+      - [15b. Run KEGG Annotation](#15b-run-kegg-annotation)
+      - [15c. Filter KO Outputs](#15c-filter-ko-outputs)
+    - [16. Taxonomic Classification](#16-taxonomic-classification)
+      - [16a. Pull and Unpack Pre-built Reference DB](#16a-pull-and-unpack-pre-built-reference-db)
+      - [16b. Run Taxonomic Classification](#16b-run-taxonomic-classification)
+      - [16c. Add Taxonomy Info From Taxids To Genes](#16c-add-taxonomy-info-from-taxids-to-genes)
+      - [16d. Add Taxonomy Info From Taxids To Contigs](#16d-add-taxonomy-info-from-taxids-to-contigs)
+      - [16e. Format Gene-level Output With awk and sed](#16e-format-gene-level-output-with-awk-and-sed)
+      - [16f. Format Contig-level Output With awk and sed](#16f-format-contig-level-output-with-awk-and-sed)
+    - [17. Read-Mapping](#17-read-mapping)
+      - [17a. Align Reads to Sample Assembly](#17a-align-reads-to-sample-assembly)
+      - [17b. Sort and Index Assembly Alignments](#17b-sort-and-index-assembly-alignments)
+    - [18. Get Coverage Information and Filter Based On Detection](#18-get-coverage-information-and-filter-based-on-detection)
+      - [18a. Filter Coverage Levels Based On Detection](#18a-filter-coverage-levels-based-on-detection)
+      - [18b. Filter Gene and Contig Coverage Based On Detection](#18b-filter-gene-and-contig-coverage-based-on-detection)
+    - [19. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
+    - [20. Combine Contig-level Coverage and Taxonomy For Each Sample](#20-combine-contig-level-coverage-and-taxonomy-for-each-sample)
+    - [21. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#21-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
+      - [21a. Generate Gene-level Coverage Summary Tables](#21a-generate-gene-level-coverage-summary-tables)
+      - [21b. Generate Contig-level Coverage Summary Tables](#21f-generate-contig-level-coverage-summary-tables)
+    - [22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#22-metagenome-assembled-genome-mag-recovery)
+      - [22a. Bin Contigs](#22a-bin-contigs)
+      - [22b. Bin Quality Assessment](#22b-bin-quality-assessment)
+      - [22c. Filter MAGs](#22c-filter-mags)
+      - [22d. MAG Taxonomic Classification](#22d-mag-taxonomic-classification)
+      - [22e. Generate Overview Table Of All MAGs](#22e-generate-overview-table-of-all-mags)
+    - [23. Generate MAG-level Functional Summary Overview](#23-generate-mag-level-functional-summary-overview)
+      - [23a. Get KO Annotations Per MAG](#23a-get-ko-annotations-per-mag)
+      - [23b. Summarize KO Annotations With KEGG-Decoder](#23b-summarize-ko-annotations-with-kegg-decoder)
+    - [24. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs](#24-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
+      - [24a. Gene-level taxonomy heatmaps](#24a-gene-level-taxonomy-heatmaps)
+      - [24b. Gene-level taxonomy decontamination](#24b-gene-level-taxonomy-decontamination)
+      - [24c. Gene-level KO functions heatmaps](#24c-gene-level-ko-functions-heatmaps)
+      - [24d. Gene-level KO functions decontamination](#24d-gene-level-ko-functions-decontamination)
+      - [24e. Contig-level heatmaps](#24e-contig-level-heatmaps)
+      - [24f. Contig-level decontamination](#24f-contig-level-decontamination)
+
+
+---
+
+# Software used
+
+|Program|Version|Relevant Links|
+|:------|:-----:|------:|
+|bbduk| 38.86 |[https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/)|
+|bit| 1.8.53 |[https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)|
+|CAT| 5.2.3 |[https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)|
+|CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)|
+|Dorado| 1.1.1| [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)|
+|filtlong| 0.2.1 |[https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)|
+|Flye| 2.9.5 | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye) |
+|GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
+|HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)|
+|Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) |
+|KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder)
+|KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)|
+|Kraken2| 2.1.6 | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2) |
+|KrakenTools | 1.2 | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/) |
+|Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)|
+|MetaBAT| 2.15 |[https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)|
+|Minimap2| 2.28 | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2) |
+|MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)|
+|Medaka| 2.1.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) |
+|MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)|
+|NanoPlot| 1.44.1 | [https://github.com/wdecoster/NanoPlot](https://github.com/wdecoster/NanoPlot)|
+|Porechop| 0.2.4 | [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop) |
+|Prodigal| 2.6.3 |[https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)|
+|samtools| 1.22.1 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
+| R | 4.5.1 | [https://www.r-project.org](https://www.r-project.org) |
+|Bioconductor | 3.21 | [https://www.bioconductor.org](https://www.bioconductor.org) |
+|decontam| 1.28.0 | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html) |
+|optparse| 1.7.5 |[https://cran.r-project.org/web/packages/optparse/index.html](https://cran.r-project.org/web/packages/optparse/index.html) |
+|pavian| 1.2.1 | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian) |
+|pheatmap| 1.0.13 | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap) |
+|phyloseq| 1.52.0 | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html) |
+|tidyverse| 2.0.0 | [https://www.tidyverse.org](https://www.tidyverse.org) |
+
+---
+
+# General processing overview with example commands
+
+> Exact processing commands and output files listed in **bold** below are included with each Low Biomass Metagenomics Seq processed dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).  
+
+## Pre-processing
+
+### 1. Basecalling
+
+```bash
+model="hac" # high accuracy model
+input_directory=/path/to/pod5/or/fast5/data
+kit_name=SQK-RPB004
+
+dorado basecaller ${model} ${input_directory} \
+  --no-trim \
+  --device auto \
+  --recursive \
+  --kit-name ${kit_name} \
+  --min-qscore 8 > basecalled.bam
+```
+
+**Parameter Definitions:**
+
+- `model` - Positional argument specifying the basecalling model to use or a path to the model directory. `hac` chooses the high accuracy model.
+- `input_directory` - Positional argument specifying the location of the raw data in POD5 or FAST5 format.
+- `--no-trim` - Skips trimming of barcodes, adapters, and primers.
+- `--device` - Specifies CPU or GPU device; specifying 'auto' chooses either 'cpu' or 'gpu' depending on detected presence of a GPU device.
+- `--recursive` - Enables recursive scanning through input directory to load FAST5 and/or POD5 files.
+- `--kit-name` - The nanopore barcoding kit used during sequencing preparation. Enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names.
+- `--min-qscore` - Specifies the minimum Q-score, reads with a mean Q-score below this threshold are discarded (default to `8`).
+
+**Input Data:**
+
+- *pod5 and/or *fast5 (raw nanopore data)
+
+**Output Data:**
+
+- basecalled.bam (basecalled data in bam format)
+
+<br>
+
+---
+
+### 2. Demultiplexing
+
+#### 2a. Split Fastq
+
+```bash
+dorado demux \
+  --output-dir /path/to/fastq/output \
+  --emit-fastq \
+  --emit-summary \
+  --kit-name ${kit_name} \
+  basecalled.bam
+```
+
+**Parameter Definitions:**
+
+- `--output-dir` - Specifies the output folder that is the root of the nested output structure. 
+- `--emit-fastq` - Specifies that output is fastq format.
+- `--emit-summary` - Creates a summary listing each read and its classified barcode.
+- `--kit-name` - The nanopore barcoding kit used during sequencing preparation. Enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names.
+- `basecalled.bam` - Positional argument specifying the input bam file.
+
+**Input Data:**
+
+- basecalled.bam (basecalled nanopore data in bam format, output from [Step 1](#1-basecalling))
+
+**Output Data:**
+
+- /path/to/fastq/output/\*_barcode\*.fastq (demultiplexed reads in fastq format)
+- /path/to/fastq/output/\*_unclassified.fastq (unclassified reads in fastq format)
+- /path/to/fastq/output/barcoding_summary.txt (barcode summary file listing each read, the file it was assigned to, and its classified barcode)
+
+
+#### 2b. Concatenate Files For Each Sample
+
+```bash
+# Change to directory containing split fastq files generated from step 2a. 
+cd /path/to/fastq/output/ # output of step 2a
+
+# Get unique barcode names from demultiplexed file names
+BARCODES=($(ls -1 *fastq* | sed -E 's/.+_(barcode[0-9]+)_.+/\1/g' | sort -u))
+
+# Concat separate barcode/sample fastq files into per sample fastq gzipped files
+[ -d raw_data/ ] || mkdir raw_data/
+for sample in ${BARCODES[*]}; do
+
+  [ -d  ${sample}/ ] ||  mkdir ${sample}/  
+  mv *_${sample}_*  ${sample}/ 
+
+  cat ${sample}/* | gzip > raw_data/${sample}.fastq.gz
+
+done
+```
+
+**Parameter Definitions:**
+
+- `cat ${sample}/*` - Concatenates all fastq files with the same barcode into one fastq file.
+- `| gzip` - Sends the concatenated fastq file output from the `cat` command to the `gzip` command to create a compressed fastq.gz file for each barcode.
+
+**Input Data:**
+
+- /path/to/fastq/output/ (directory containing spilt fastq files from [Step 2a](#2a-split-fastq))
+
+**Output Data:**
+
+-  raw_data/sample.fastq.gz (gzipped per sample/barcode fastq files)
+
+<br>
+
+---
+
+### 3. Raw Data QC
+
+#### 3a. Raw Data QC
+
+```bash 
+NanoPlot --only-report \
+         --prefix sample_raw_ \
+         --outdir /path/to/raw_nanoplot_output \
+         --threads NumberOfThreads \
+         --fastq \
+         /path/to/raw_data/sample.fastq.gz
+```
+
+**Parameter Definitions:**
+
+- `--only-report` - Output only the report files.
+- `--prefix` - Adds a sample specific prefix to the name of each output file.
+- `--outdir` – Specifies the output directory to store results.
+- `--threads` - Number of parallel processing threads to use.
+- `--fastq` - Specifies that the input data is in fastq format.
+- `/path/to/raw_data/sample.fastq.gz` – The input reads, specified as a positional argument.
+
+**Input Data:**
+
+- /path/to/raw_data/sample.fastq.gz (raw reads, output from [Step 2b](#2b-concatenate-files-for-each-sample))
+
+**Output Data:**
+
+- **/path/to/raw_nanoplot_output/sample_raw_NanoPlot-report.html** (NanoPlot html summary)
+- /path/to/raw_nanoplot_output/sample_raw_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
+- /path/to/raw_nanoplot_output/sample_raw_NanoStats.txt (text file containing basic statistics)
+
+#### 3b. Compile Raw Data QC
+
+```bash 
+multiqc --zip-data-dir \
+        --outdir raw_multiqc_report \
+        --filename raw_multiqc_GLlblMetag \
+        --interactive \
+        /path/to/raw_nanoplot_output/
+```
+
+**Parameter Definitions:**
+
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/raw_nanoplot_output/` – The directory holding the output data from the NanoPlot run, provided as a positional argument.
+
+**Input Data:**
+
+- /path/to/raw_nanoplot_output/*raw_NanoStats.txt (NanoPlot output data, from [Step 3a](#3a-raw-data-qc))
+
+**Output Data:**
+
+- **raw_multiqc_GLlblMetag.html** (multiqc output html summary)
+- **raw_multiqc_GLlblMetag_data.zip** (zip archive containing multiqc output data)
+
+<br>  
+
+---
+
+### 4. Quality Filtering
+
+#### 4a. Filter Raw Data
+
+```bash
+filtlong --min_length 200 --min_mean_q 8 /path/to/raw_data/sample.fastq.gz > sample_filtered.fastq
+```
+
+**Parameter Definitions:**
+
+- `--min_length` – Specifies the minimum read length to retain (default: `200`).
+- `--min_mean_q` – Specifies the minimum mean read quality to retain (default: `8`).
+- `/path/to/raw_data/sample.fastq.gz` - The path to the input fastq file, provided as a positional argument.
+- `> sample_filtered.fastq` - Redirects the output to a sample_filtered.fastq file.
+
+**Input Data:**
+
+- /path/to/raw_data/sample.fastq.gz (raw reads, output from [Step 2b](#2b-concatenate-files-for-each-sample))
+
+**Output Data:**
+
+- *sample_filtered.fastq (quality filtered reads)
+
+
+#### 4b. Filtered Data QC
+
+```bash
+NanoPlot --only-report \
+         --prefix sample_filtered_ \
+         --outdir /path/to/filtered_nanoplot_output \
+         --threads NumberOfThreads \
+         --fastq \
+         sample_filtered.fastq
+```
+
+**Parameter Definitions:**
+
+- `--only-report` - Output only the report files.
+- `--prefix` - Adds a sample specific prefix to the name of each output file.
+- `--outdir` – Specifies the output directory to store results.
+- `--threads` - Number of parallel processing threads to use.
+- `--fastq` - Specifies that the input data is in fastq format.
+- `sample_filtered.fastq` – The input reads, specified as a positional argument.
+
+**Input Data:**
+
+- sample_filtered.fastq (filtered reads, output from [Step 4a](#4a-filter-raw-data))
+
+**Output Data:**
+
+- **/path/to/filtered_nanoplot_output/sample_filtered_NanoPlot-report.html** (NanoPlot html summary)
+- /path/to/filtered_nanoplot_output/sample_filtered_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
+- /path/to/filtered_nanoplot_output/sample_filtered_NanoStats.txt (text file containing basic statistics)
+
+#### 4c. Compile Filtered Data QC
+
+```bash
+multiqc  --zip-data-dir \ 
+         --outdir filtered_multiqc_report \
+         --filename filtered_multiqc_GLlblMetag \
+         --interactive \
+         /path/to/filtered_nanoplot_output/
+```
+
+**Parameter Definitions:**
+
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/filtered_nanoplot_output/` – The directory holding the output data from the NanoPlot run, provided as a positional argument.
+
+**Input Data:**
+
+- /path/to/filtered_nanoplot_output/*filtered_NanoStats.txt (NanoPlot output data, from [Step 4b](#4b-filtered-data-qc))
+
+**Output Data:**
+
+- **filtered_multiqc_report/filtered_multiqc_GLlblMetag.html** (multiqc output html summary)
+- **filtered_multiqc_report/filtered_multiqc_GLlblMetag_data.zip** (zip archive containing multiqc output data)
+
+<br>
+
+---
+
+### 5. Trimming
+
+#### 5a. Trim Filtered Data
+
+```bash
+porechop --input sample_filtered.fastq \
+         --threads NumberOfThreads \
+         --discard_middle \
+         --output sample_trimmed.fastq.gz  > sample_porechop.log
+```
+
+**Parameter Definitions:**
+
+- `--input` – Specifies the input sequence file in fastq format.
+- `--threads` - Number of parallel processing threads to use.
+- `--discard_middle` -  Reads with middle adapters will be discarded.
+- `--output` - Specifies the trimmed reads output fastq filename.
+- `> sample_porechop.log` - Redirects the standard output to a log file.
+
+**Input Data:**
+
+- sample_filtered.fastq (filtered reads output from [Step 4a](#4a-filter-raw-data))
+
+**Output Data:**
+
+- sample_trimmed.fastq.gz (filtered and trimmed reads)
+- sample_porechop.log (porechop standard output containing trimming info)
+
+#### 5b. Trimmed Data QC
+
+```bash
+NanoPlot --only-report \
+         --prefix sample_trimmed_ \
+         --outdir /path/to/trimmed_nanoplot_output \
+         --threads NumberOfThreads \
+         --fastq \
+         sample_trimmed.fastq.gz
+```
+
+**Parameter Definitions:**
+
+- `--only-report` - Output only the report files.
+- `--prefix` - Adds a sample specific prefix to the name of each output file.
+- `--outdir` – Specifies the output directory to store results.
+- `--threads` - Number of parallel processing threads to use.
+- `--fastq` - Specifies that the input data is in fastq format.
+- `sample_trimmed.fastq.gz` – The input reads, specified as a positional argument.
+
+**Input Data:**
+
+- sample_trimmed.fastq.gz (filtered and trimmed reads, output from [Step 5a](#5a-trim-filtered-data))
+
+**Output Data:**
+
+- **/path/to/trimmed_nanoplot_output/sample_trimmed_NanoPlot-report.html** (NanoPlot html summary)
+- /path/to/trimmed_nanoplot_output/sample_trimmed_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
+- /path/to/trimmed_nanoplot_output/sample_trimmed_NanoStats.txt (text file containing basic statistics)
+
+#### 5c. Compile Trimmed Data QC
+
+```bash
+multiqc --zip-data-dir \ 
+        --outdir trimmed_multiqc_report \
+        --filename trimmed_multiqc_GLlblMetag \
+        --interactive \
+        /path/to/trimmed_nanoplot_output/
+```
+
+**Parameter Definitions:**
+
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/trimmed_nanoplot_output/` – The directory holding the output data from the NanoPlot run, provided as a positional argument.
+
+**Input Data:**
+
+- /path/to/trimmed_nanoplot_output/*trimmed_NanoStats.txt (NanoPlot output data, output from [Step 5b](#5b-trimmed-data-qc))
+
+**Output Data:**
+
+- **trimmed_multiqc_GLlblMetag.html** (multiqc output html summary)
+- **trimmed_multiqc_GLlblMetag_data.zip** (zip archive containing multiqc output data)
+
+<br>
+
+---
+
+### 6. Human Read Removal
+
+#### 6a. Build Kraken2 Database
+
+> **Note:** It is recommended to use NCBI genome files with kraken2 because sequences not downloaded from 
+NCBI may require explicit assignment of taxonomy information before they can be used to build the 
+database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).
+
+```bash
+# Download NCBI taxonomic information 
+kraken2-build --download-taxonomy --db kraken2-human-db/
+
+# Add genomic sequences to your database's genomic library
+kraken2-build --add-to-library human.fasta --db kraken2-human-db/ \
+              --no-masking --kmer-length 35 --minimizer-length 31
+
+# Build the database
+kraken2-build --build --db kraken2-human-db/
+
+# Clean up intermediate files
+kraken2-build --clean --db kraken2-human-db/
+```
+**Parameter Definitions:**
+- `--download-taxonomy` - Instructs kraken2-build to download the NCBI taxonomic information.
+- `--db` - Specifies the name of the directory for the kraken2 database
+- `--add-to-library` - Instructs kraken2-build to add the contents of a file (`human.fasta`) to the kraken2 DB library
+- `--no-masking` - Disables masking of low-complexity sequences. For additional 
+                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
+- `--build` - Instructs kraken2-build to build the kraken2 DB from the library files
+- `--clean` - Instructs kraken2-build to remove unneeded intermediate files.
+
+**Input Data:**
+
+- `human.fasta` (fasta file containing human genome)
+
+**Output Data:**
+
+- kraken2_human_db/ - Kraken2 human database directory, containing hash.k2d, opts.k2d, and taxo.k2d files 
+
+#### 6b. Remove Human Reads
+
+```bash
+kraken2 --db kraken2_human_db \
+        --gzip-compressed \
+        --threads NumberOfThreads \
+        --use-names \
+        --output sample-kraken2-output.txt \
+        --report sample-kraken2-report.tsv \
+        --unclassified-out sample_GLlblMetag_HRrm.fastq \
+        sample_trimmed_fastq.gz
+
+# gzip fastq output file
+gzip sample_GLlblMetag_HRrm.fastq
+```
+
+**Parameter Definitions:**
+
+- `--db` - Specifies the directory holding the kraken2 database.
+- `--gzip-compressed` - Specifies that the input fastq files are gzip-compressed.
+- `--threads` - Specifies the number of parallel processing threads to use.
+- `--use-names` - Specifies adding taxa names in addition to taxon IDs.
+- `--output` - Specifies the name of the kraken2 read-based output file (one line per read).
+- `--report` - Specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it).
+- `--unclassified-out` - Specifies the name of the output file containing reads that were not classified, i.e non-human reads.
+- `sample_trimmed.fastq.gz` - Positional argument specifying the input read file.
+
+**Input Data:**
+
+- kraken2_human_db/ (kraken2 human database directory, output from [Step 7a](#7a-build-kraken2-database))
+- sample_trimmed.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 5a](#5a-trim-filtered-data))
+
+**Output Data:**
+
+- sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
+- sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
+- **sample_GLlblMetag_HRrm.fastq.gz** (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file)
+
+
+#### 6c. Compile Human Read Removal QC
+
+```bash
+multiqc --zip-data-dir \ 
+        --outdir HRrm_multiqc_report \
+        --filename HRrm_multiqc_GLlblMetag \
+        --interactive \
+        /path/to/*kraken2-report.tsv
+```
+
+**Parameter Definitions:**
+
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/*kraken2-report.tsv` – The kraken2 output report files, provided as a positional argument.
+
+**Input Data:**
+
+- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 7b](#7b-remove-human-reads))
+
+**Output Data:**
+
+- **HRrm_multiqc_GLlblMetag.html** (multiqc output html summary)
+- **HRrm_multiqc_GLlblMetag_data.zip** (zip archive containing multiqc output data)
+
+<br>
+
+
+---
+
+### 7. Contaminant Removal
+
+> A major issue with low biomass data is the high potential for contamination due to the low amount of DNA extracted from the samples. Because negative control/blank samples should by theory be contaminant free, any sequence detected in the negative control is a potential contaminant. To filter out contaminants found in negative control samples that may have been due to cross contamination in the lab, we use a read mapping approach. First negative/blank control sample reads are assembled then the filtered and trimmed reads from each low-biomass sample are mapped to the assembled contigs from the negative/blank control samples. Reads mapping to the assembled contigs are categorized as contaminants and are therefore filtered out and thus excluded from downstream analyses.
+
+### 7a. Assemble Contaminants
+
+```bash
+flye --meta \
+     --threads NumberOfThreads \
+     --out-dir /path/to/contaminant_assembly \
+     --nano-raw /path/to/blank_samples/\*_GLlblMetag_HRrm.fastq.gz
+
+# rename output
+mv assembly.fasta blank-assembly.fasta
+mv flye.log blank-flye.log
+```
+
+**Parameter Definitions:**
+
+- `--meta` – Use metagenome/uneven coverage mode.
+- `--threads` - Number of parallel processing threads to use.
+- `--out-dir` - Specifies the output directory.
+- `--nano-raw` - Specifies that input is from Oxford Nanopore regular raw reads. This adds a polishing step for error correction after the assembly is generated.
+
+**Input Data**
+
+- *_GLlblMetag_HRrm.fastq.gz (one or more trimmed, HRrm reads from blank (negative control) samples, output from [Step 6b](#6b-remove-human-reads))
+
+**Output Data**
+
+- /path/to/contaminant_assembly/blank-assembly.fasta (assembly built from reads in blank samples in fasta format)
+- blank-flye.log (flye log file)
+
+<br>
+
+#### 7b. Build Contaminant Index and Map Reads
+
+```bash
+# Build contaminant index
+minimap2 -t NumberOfThreads \
+         -a \
+         -x splice \
+         -d blanks.mmi \
+         /path/to/contaminant_assembly/blank-assembly.fasta
+
+# Map reads to index
+minimap2 -t NumberOfThreads \
+         -a \
+         -x splice \
+         blanks.mmi \
+         sample_GLlblMetag_HRrm.fastq.gz  > sample.sam 2> sample-mapping-info.txt
+```
+
+**Parameter Definitions:**
+
+- `-t` - Number of parallel processing threads.
+- `-a` – Output in SAM format.
+- `-x splice` - Specifies preset for spliced alignment of long reads.
+- `-d` - Specifies the output file for the index (specific to the build contaminant index command).
+- `/path/to/contaminant_assembly/blank-assembly.fasta` - Specifies the input file in fasta format, provided as a positional argument (specific to the build contaminant index command).
+- `blanks.mmi` - Specifies the index file in mmi format, provided as a positional argument (specific to the map reads command).
+- `/path/to/trimmed_reads/sample_GLlblMetag_HRrm.fastq.gz` - Specifies the input file in fastq format, provided as a positional argument (specific to the map reads command).
+- `> sample.sam` - Redirects the output of the map reads command to a separate SAM file (specific to the map reads command).
+
+**Input Data**
+
+- /path/to/contaminant_assembly/blank-assembly.fasta (contaminant assembly, output from [Step 7a](#7-assemble-contaminants))
+- sample_GLlblMetag_HRrm.fastq.gz (filtered and trimmed reads, output from [Step 6b](#6b-remove-human-reads))
+
+**Output Data**
+
+- blanks.mmi (contaminant index in MMI format)
+- sample.sam (reads aligned to contaminant assembly in SAM format)
+- sample-mapping-info.txt (minimap2 mapping log file)
+
+#### 7c. Sort and Index Contaminant Alignments
+```bash
+# Sort Sam, convert to bam and create index
+samtools sort --threads NumberOfThreads \
+              --output sample_sorted.bam \
+              sample.sam
+
+samtools index sample_sorted.bam sample_sorted.bam.bai
+```
+
+**Parameter Definitions:**
+
+**samtools sort**
+- `--threads` - Number of parallel processing threads to use.
+- `--output` - Specifies the output file for the aligned and sorted reads.
+- `sample.sam` - Specifies the input SAM file, provided as a positional argument.
+
+**samtools index**
+- `sample_sorted.bam` - The input BAM file, provided as a positional argument.
+- `sample_sorted.bam.bai` - The output index file, provided as a positional argument.
+
+**Input Data:**
+
+- sample.sam (reads aligned to contaminant assembly, output from [Step 7b](#7b-build-contaminant-index-and-map-reads))
+
+**Output Data:**
+
+- sample_sorted.bam (sorted mapping to contaminant assembly file)
+- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly file)
+
+#### 7d. Gather Contaminant Mapping Metrics
+
+```bash
+
+samtools flagstat sample_sorted.bam > sample_flagstats.txt  2> sample_flagstats.log
+samtools stats --remove-dups sample_sorted.bam > sample_stats.txt   2> sample_stats.log
+samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.log
+```
+
+**Parameter Definitions:**
+
+- `flagstat` - Positional argument specifying the program for counting the number of alignments for each SAM FLAG type.
+- `stats` - Positional argument specifying the program for producing comprehensive statistics from the alignment file.
+- `idxstats` - Positional argument specifying the program for producing contig alignment summary statistics.
+- `--remove-dups` - Excludes reads marked as duplicates from the comprehensive statistics.
+- `sample_sorted.bam` - Positional argument specifying the input BAM file.
+- `> sample_flagstats.txt` - Redirects the flagstat standard output to a text file.
+- `2> sample_flagstats.log` - Redirects the flagstat standard error to a log file.
+- `> sample_stats.txt` - Redirects the stats standard output to a text file.
+- `2> sample_stats.log` - Redirects the stats standard error to a log file.
+- `> sample_idxstats.txt` - Redirects the idxstats standard output to a text file.
+- `2> sample_idxstats.log` - Redirects the idxstats standard error to a log file.
+
+**Input Data:**
+
+- sample_sorted.bam (sorted mapping to contaminant assembly file, output from [Step 7c](#7c-sort-and-index-contaminant-alignments))
+- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly file, output from [Step 7c](#7c-sort-and-index-contaminant-alignments))
+
+**Output Data:**
+
+- sample_flagstats.txt (SAM FLAG counts)
+- sample_flagstats.log (log file containing the flagstat standard error)
+- sample_stats.txt (comprehensive alignment statistics)
+- sample_stats.log (log file containing the stats standard error)
+- sample_idxstats.txt (contig alignment summary statistics)
+- sample_idxstats.log (log file containing the idxstats standard error)
+
+#### 7e. Generate Decontaminated Read Files
+```bash
+# Retain reads that do not map to contaminants
+samtools fastq -t -f 4 -o sample_GLlblMetag_decontam.fastq.gz -0 sample_GLlblMetag_decontam.fastq.gz sample_sorted.bam 
+```
+
+**Parameter Definitions:**
+
+- `fastq` - Positional argument specifying the program for generating fastq files from a SAM/BAM file.
+- `-t` - Copy RG, BC, and QT tags to the FASTQ header line.
+- `-f 4` - Only retain unmapped reads that have been marked with the SAM "segment unmapped" FLAG (4).
+- `-o sample_GLlblMetag_decontam.fastq.gz` - Send reads flagged as either read1 or read2 to the named file (.gz ending ensures compressed output)
+- `-0 sample_GLlblMetag_decontam.fastq.gz` - Send reads flagged as both read1 and read2 or neither to the same named file
+- `sample_sorted.bam` - Positional argument specifying the input BAM file.
+
+**Input Data:**
+
+- sample_sorted.bam (sorted mapping to contaminant assembly file, output from [Step 7c](#7c-sort-and-index-contaminant-alignments))
+
+**Output Data:**
+
+- **sample_GLlblMetag_decontam.fastq.gz** (filtered and trimmed sample reads with contaminants removed in fastq format)
+
+#### 7f. Contaminant Removal QC
+
+```bash
+NanoPlot --only-report \
+         --prefix sample_noblank_ \
+         --outdir /path/to/decontam_nanoplot_output \
+         --threads NumberOfThreads \
+         --fastq \
+         sample_GLlblMetag_decontam.fastq.gz
+```
+
+**Parameter Definitions:**
+
+- `--only-report` - Output only the report files.
+- `--prefix` - Adds a sample specific prefix to the name of each output file.
+- `--outdir` – Specifies the output directory to store results.
+- `--threads` - Number of parallel processing threads to use.
+- `--fastq` - Specifies that the input data is in fastq format.
+- `sample_GLlblMetag_decontam.fastq.gz` – The input reads, specified as a positional argument.
+
+**Input Data:**
+
+- sample_GLlblMetag_decontam.fastq.gz (filtered and trimmed sample reads with all contaminants removed, output from [Step 7e](#7e-generate-decontaminated-read-files))
+
+**Output Data:**
+
+- **/path/to/decontam_nanoplot_output/sample_decontam_NanoPlot-report_GLlblMetag.html** (NanoPlot html summary)
+- /path/to/decontam_nanoplot_output/sample_decontam_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
+- /path/to/decontam_nanoplot_output/sample_decontam_NanoStats.txt (text file containing basic statistics)
+
+
+#### 7g. Compile Contaminant Removal QC
+
+```bash
+multiqc --zip-data-dir \ 
+        --outdir decontam_multiqc_report \
+        --filename decontam_multiqc_GLlblMetag \
+        --interactive \
+        /path/to/decontam_nanoplot_output/
+```
+
+**Parameter Definitions:**
+
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/decontam_nanoplot_output/` – The directory holding the output data from the NanoPlot run, provided as a positional argument.
+
+**Input Data:**
+
+- /path/to/decontam_nanoplot_output/*decontam_NanoStats.txt (NanoPlot output data, output from [Step 7f](#7f-contaminant-removal-qc))
+
+**Output Data:**
+
+- **decontam_multiqc_GLlblMetag.html** (multiqc output html summary)
+- **decontam_multiqc_GLlblMetag_data.zip** (zip archive containing multiqc output data)
+
+<br>
+
+---
+
+### 8. Host Read Removal
+
+If the samples were derived from a host organism other than human, potential host reads
+should be identified and removed. This step is optional.
+
+#### 8a. Build Kraken2 Database
+
+> **Note:** It is recommended to use NCBI genome files with kraken2 because sequences not downloaded from 
+NCBI may require explicit assignment of taxonomy information before they can be used to build the 
+database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).
+
+```bash
+
+```bash
+# Download NCBI taxonomic information 
+kraken2-build --download-taxonomy --db kraken2-${hostname}-db/
+
+# Add genomic sequences to your database's genomic library
+kraken2-build --add-to-library ${hostname}.fasta --db kraken2-${hostname}-db/ \
+              --no-masking --kmer-length 35 --minimizer-length 31
+
+# Build the database
+kraken2-build --build --db kraken2-${hostname}-db/
+
+# Clean up intermediate files
+kraken2-build --clean --db kraken2-${hostname}-db/
+```
+**Parameter Definitions:**
+- `--download-taxonomy` - Instructs kraken2-build to download the NCBI taxonomic information.
+- `--db` - Specifies the name of the directory for the kraken2 database
+- `--add-to-library` - Instructs kraken2-build to add the contents of a file (`${hostname}.fasta`) to the kraken2 DB library
+- `--no-masking` - Disables masking of low-complexity sequences. For additional 
+                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
+- `--build` - Instructs kraken2-build to build the kraken2 DB from the library files
+- `--clean` - Instructs kraken2-build to remove unneeded intermediate files.
+- `{$hostname}` - Specifies the name of the host organism used to uniquely identify the kraken2 database
+
+**Input Data:**
+
+- `${hostname}.fasta` (fasta file containing host genome)
+
+**Output Data:**
+
+- kraken2_${hostname}_db/ - Kraken2 host database directory, containing hash.k2d, opts.k2d, and taxo.k2d files 
+
+
+#### 8b. Remove Host Reads
+
+```bash
+kraken2 --db kraken2_host_db \
+        --gzip-compressed \
+        --threads NumberOfThreads \
+        --use-names \
+        --output sample-kraken2-output.txt \
+        --report sample-kraken2-report.tsv \
+        --unclassified-out sample_GLlblMetag_HostRm.fastq \
+        sample_trimmed_fastq.gz
+
+# gzip fastq output file
+gzip sample_GLlblMetag_HostRm.fastq
+```
+
+**Parameter Definitions:**
+
+- `--db` - Specifies the directory holding the kraken2 database.
+- `--gzip-compressed` - Specifies that the input fastq files are gzip-compressed.
+- `--threads` - Number of parallel processing threads to use.
+- `--use-names` - Specifies adding taxa names in addition to taxon IDs.
+- `--output` - Specifies the name of the kraken2 read-based output file (one line per read).
+- `--report` - Specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it).
+- `--unclassified-out` - Specifies the name of the output file containing reads that were not classified, i.e non-human reads.
+- `sample_trimmed.fastq.gz` - Positional argument specifying the input read file.
+
+**Input Data:**
+
+- kraken2_host_db/ (kraken2 host database directory, output from [Step 8a](#8a-build-kraken2-database))
+- sample_*decontam.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 5a](#5a-trim-filtered-data))
+
+**Output Data:**
+
+- sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
+- sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
+- **sample_GLlblMetag_HostRm.fastq.gz** (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file)
+
+
+#### 8c. Compile Host Read Removal QC
+
+```bash
+multiqc --zip-data-dir \ 
+        --outdir HostRm_multiqc_report \
+        --filename HostRm_multiqc_GLlblMetag \
+        --interactive \
+        /path/to/*kraken2-report.tsv
+```
+
+**Parameter Definitions:**
+
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/*kraken2-report.tsv` – The kraken2 output report files, provided as a positional argument.
+
+**Input Data:**
+
+- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 7b](#7b-remove-human-reads))
+
+**Output Data:**
+
+- **HostRm_multiqc_GLlblMetag.html** (multiqc output html summary)
+- **HostRm_multiqc_GLlblMetag_data.zip** (zip archive containing multiqc output data)
+
+<br>
+
+---
+
+### 8. R Environment Setup
+
+> Taxonomy bar plots, heatmaps and feature decontamination with decontam are performed in R.
+
+#### 8a. Load libraries
+
+```R
+library(decontam)
+library(phyloseq)
+library(tidyverse)
+library(pheatmap)
+library(pavian)
+```
+
+#### 8b. Define Custom Functions
+
+##### get_last_assignment()
+<details>
+  <summary>retrieves the last taxonomy assignment from a taxonomy string</summary>
+
+  ```R
+  get_last_assignment <- function(taxonomy_string, split_by = ';', remove_prefix = NULL) {
+
+    # Spilt taxonomy string by the supplied delimiter 'split_by'
+    # then convert the list of parts to a vector of parts
+    split_names <- strsplit(x =  taxonomy_string , split = split_by) %>%
+      unlist()
+    # Get the last part of the split string
+    level_name <- split_names[[length(split_names)]]
+    
+    if(level_name == "_"){
+      return(taxonomy_string)
+    }
+    # remove an unwanted prefix if specified
+    if(!is.null(remove_prefix)){
+      level_name <- gsub(pattern = remove_prefix, replacement = "", x = level_name)
+    }
+    
+    return(level_name)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `taxonomy_string` - a character string containing a list of taxonomy assignments separated by `split_by`
+  - `split_by=` - a character string containing a regular expression used to split the `taxonomy_string`
+  - `remove_prefix=` - a character string containing a regular expression to be matched and removed, default=`NULL`
+
+  **Returns:** the last taxonomy assignment listed in the `taxonomy_string`
+</details>
+
+##### mutate_taxonomy()
+<details>
+  <summary>mutate taxonomy column to contain the lowest taxonomy assignment</summary>
+
+  ```R
+  mutate_taxonomy <- function(df, taxonomy_column="taxonomy") {
+
+    # make sure that the taxonomy column is always named taxonomy
+    col_index <- which(colnames(df) == taxonomy_column)
+    colnames(df)[col_index] <- "taxonomy"
+    df <- df %>% dplyr::mutate(across(where(is.numeric), function(x) tidyr::replace_na(x, 0))) %>%
+      dplyr::mutate(taxonomy=map_chr(taxonomy, .f = function(taxon_name = .x) {
+        last_assignment <- get_last_assignment(taxon_name) 
+        last_assignment  <- gsub(pattern = "\\[|\\]|'", replacement = "", x = last_assignment)
+        trimws(last_assignment, which = "both")
+      })) %>% 
+      as.data.frame(check.names = FALSE, StringAsFactor = FALSE)
+    # Ensure the taxonomy names are unique by aggregating duplicates
+    df <- aggregate(.~taxonomy, data = df, FUN = sum)
+    return(df)
+  }
+  ```
+  **Custom Functions Used:**
+  - [get_last_assignment()](#get_last_assignment)
+
+  **Function Parameter Definitions:**
+  - `df` - a dataframe containing the taxonomy assignments
+  - `taxonomy_column=` - name of the column in `df` containing the taxonomy assignments, default="taxonomy"
+
+  **Returns:** a dataframe with unique last taxonomy names stored in a column named "taxonomy"
+
+</details>
+
+##### process_kaiju_table()
+<details>
+  <summary>reformat kaiju output table</summary>
+
+  ```R
+  process_kaiju_table <- function(file_path, taxon_col = "taxon_name") {
+  
+    # read input table
+    kaiju_table <-  read_delim(file = file_path,
+                               delim = "\t",
+                               col_names = TRUE)
+
+    # Create  a sample colname if the file column wasn't pre-edited
+    if(colnames(kaiju_table)[1] ==  "file" ){
+      kaiju_table <-  kaiju_table %>% rename(sample=file)
+    }
+
+    # filter out all kaiju database entries
+    kaiju_table <- kaiju_table %>% 
+      filter(!str_detect(sample, "dmp")) %>%
+      mutate(sample=str_replace_all(sample, ".+/(.+)_kaiju.out", "\\1"))
+ 
+    # keep only sample, reads, and taxonomy column (as defined by taxon_col argument) 
+    # convert long dataframe to wide dataframe
+    # mutate the taxonomy column such that it contains only lowest taxonomy assignment
+    abs_abun_df <- kaiju_table %>%
+      select(sample, reads, taxonomy=!!sym(taxon_col)) %>%
+      pivot_wider(names_from = "sample", values_from = "reads", names_sort = TRUE) %>%
+      mutate_taxonomy 
+  
+    # Set the taxon names as row names, drop the taxonomy column and convert to a matrix
+    rownames(abs_abun_df) <- abs_abun_df[,"taxonomy"]
+    abs_abun_df <- abs_abun_df[,-(which(colnames(abs_abun_df) == "taxonomy"))]
+    abs_abun_matrix <- as.matrix(abs_abun_df)
+    
+    return(abs_abun_matrix)
+  }
+  ```
+  **Custom Functions Used:**
+  - [mutate_taxonomy()](#mutate_taxonomy)
+
+  **Function Parameter Definitions:**
+  - `file_path` - file path to the tab-delimited kaiju output table file
+  - `taxon_col=`- name of the taxon column in the input data file, default="taxon_name"
+
+  **Returns:** a dataframe with reformated kaiju output
+
+</details>
+
+
+##### merge_kraken_reports()
+<details>
+  <summary>merge and process multiple kraken outputs to one species table</summary>
+
+  ```R
+  library(pavian)
+
+  merge_kraken_reports <- function(reports_dir) {
+
+    reports <- read_reports(reports_dir)
+
+    # Retrieve sample names from file names
+    samples <- names(reports) %>% str_split("-") %>% map_chr(function(x) pluck(x, 1))
+    merged_reports  <- merge_reports2(reports, col_names = samples)
+    taxonReads <- merged_reports$taxonReads
+    cladeReads <- merged_reports$cladeReads
+    tax_data <- merged_reports[["tax_data"]]
+
+    species_table <- tax_data %>%
+      bind_cols(cladeReads) %>%
+      filter(taxRank %in% c("U", "S")) %>% # select unclassified and species rows 
+      select(-contains("tax")) %>%
+      zero_if_na() %>%
+      filter(name != 0) %>% # drop unknown taxonomies
+      group_by(name) %>%
+      summarise(across(everything(), sum)) %>%
+      ungroup() %>%
+      as.data.frame() %>%
+      rename(species = name)
+
+    # Set rownames as species name, drop species column
+    # and convert table from dataframe to matrix
+    species_names <- species_table[, "species"]
+    rownames(species_table) <- species_names
+    species_table <- species_table[,-(which(colnames(species_table) == "species"))]
+    species_table <- as.matrix(species_table)
+    
+    return(species_table)
+  }
+  ```
+  **Custom Functions Used:**
+  - [read_reports()]()
+
+
+  **Function Parameter Definitions:**
+  - `reports_dir` - path to a directory containing kraken2 reports 
+
+  **Returns:** a kraken species count matrix with samples and species as columns and rows, respectively.
+
+</details>
+
+##### get_abundant_features()
+<details>
+  <summary>Find abundant features based on the sum of feature values</summary>
+  
+  ```R
+  get_abundant_features <- function(mat, cpm_threshold = 1000){
+  
+    features <- rowSums(mat) %>% sort()
+    
+    abund_features <- features[features > cpm_threshold] %>% names
+    
+    abund_features.m <- mat[abund_features, ]
+    
+    return(abund_features.m)
+  }
+  ```
+  **Function Parameter Definitions:**
+  - `mat` - a feature count matrix with features as rows and samples as columns
+  - `cpm_threshold = 1000` - threshold to identify abundant features
+
+  **Returns:** a species relative abundance matrix with samples and species as rows and column, respectively.
+</details>
+
+##### count_to_rel_abundance()
+<details>
+  <summary>Convert species count matrix to relative abundance matrix</summary>
+
+  ```R
+  count_to_rel_abundance <- function(species_table) {
+
+    # calculate species relative abundance per sample and
+    # drop columns where none of the reads were classified or were non-microbial (NA)
+    abund_table <- species_table %>%
+      as.data.frame %>%
+      mutate(across(everything(), function(x) (x/sum(x, na.rm = TRUE))*100)) %>%
+        select(
+          where( ~all(!is.na(.)))
+        ) %>%
+      rownames_to_column("Species")
+
+    # Set rownames as species name and drop species column  
+    rownames(abund_table) <- abund_table$Species
+    abund_table <- abund_table[, -match(x = "Species", colnames(abund_table))] %>% t
+
+    return(abund_table)
+  }
+
+  ```
+
+  **Function Parameter Definitions:**
+  - `species_table` - a species count matrix with samples and species as columns and rows, respectively.
+
+  **Returns:** a species relative abundance matrix with samples and species as rows and column, respectively.
+
+</details>
+
+
+##### filter_rare()
+<details>
+  <summary>filter out rare and non_microbial taxonomy assignments based on relative abundance</summary>
+
+  ```R
+  filter_rare <- function(species_table, non_microbial, threshold=1) {
+    
+    # Drop species listed in 'non_microbial' regex
+    clean_tab_count  <-  species_table %>% 
+                         as.data.frame %>% 
+                         rownames_to_column("Species") %>% 
+                         filter(str_detect(Species, non_microbial, negate = TRUE))
+    # Calculate species relative abundance
+    clean_tab <- clean_tab_count %>%
+      mutate( across( where(is.numeric), function(x) (x/sum(x, na.rm = TRUE))*100 ) )
+    # Set rownames as species name and drop species column
+    rownames(clean_tab) <- clean_tab$Species
+    clean_tab  <- clean_tab[, -1]
+    
+    # Get species with relative abundance less than `threshold` in all samples
+    rare_species <- map(clean_tab, .f = function(col) rownames(clean_tab)[col < threshold])
+    rare <- Reduce(intersect, rare_species)
+    
+    # Set rownames as species name and drop species column  
+    rownames(clean_tab_count) <- clean_tab_count$Species
+    clean_tab_count  <- clean_tab_count[,-1] 
+    # Drop rare species
+    abund_table <- clean_tab_count[!(rownames(clean_tab_count) %in% rare), ]
+    
+    return(abund_table)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `species_table` - the species matrix to filter with species and samples as rows and columns, respectively.
+  - `non_microbial` - a regular expression denoting the names used to identify a species as non-microbial or unwanted
+  - `threshold=` - abundance threshold used to determine if the relative abundance is rare, value denotes a percentage between 0 and 100.
+
+  **Returns:** a dataframe with rare and non_microbial/unwanted species removed
+</details>
+
+##### group_low_abund_taxa()
+<details>
+  <summary>Group rare taxa or return a table with only rare taxa</summary>
+
+  ```R
+  group_low_abund_taxa <- function(abund_table, threshold = 0.05,
+                                   rare_taxa = FALSE) {
+    # If set to TRUE then a table with only the rare taxa will be returned 
+    # initialize an empty vector that will contain the indices for the
+    # low abundance columns/ taxa to group
+    taxa_to_group <- c()
+    # initialize the index variable of species with low abundance (taxa/columns)
+    index <- 1
+    
+    #loop over every column or taxa check to see if the max abundance is less than the set threshold
+    #if true save the index in the taxa_to_group vector variable
+    for (column in ncol(abund_table)) {
+      if(max(abund_table[,column], na.rm = TRUE) < threshold) {
+        #print(column)
+        taxa_to_group[index] <- column
+        index = index + 1
+      }
+    }
+    
+    if(is.null(taxa_to_group)) {
+      message(glue::glue("Rare taxa were not grouped. please provide a higher 
+                        threshold than {threshold} for grouping rare taxa, 
+                        only numbers are allowed."))
+      return(abund_table)
+    }
+    
+    if(rare_taxa) {
+      abund_table <- abund_table[,taxa_to_group,drop=FALSE]
+    } else {
+      #remove the low abundant taxa or columns
+      abundant_taxa <-abund_table[,-(taxa_to_group), drop=FALSE]
+      #get the rare taxa
+      # rare_taxa <-abund_table[,taxa_to_group]
+      rare_taxa <- subset(x = abund_table, select = taxa_to_group)
+      #get the proportion of each sample that makes up the rare taxa
+      rare <- rowSums(rare_taxa)
+      #bind the abundant taxa to the rae taxa
+      abund_table <- cbind(abundant_taxa,rare)
+      #rename the columns i.e the taxa
+      colnames(abund_table) <- c(colnames(abundant_taxa),"Rare")
+    }
+    
+    return(abund_table)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `abund_table` - a relative abundance matrix with taxa as columns and  samples as rows
+  - `rare_taxa` - a boolean specifying if only rare taxa should be returned
+  - `threshold` - a max abundance threshold for defining taxa as rare
+
+  **Returns:** a relative abundance matrix with rare taxa grouped or with non-rare taxa filtered out
+
+</details>
+
+##### make_plot()
+<details>
+  <summary>create bar plot of relative abundance</summary>
+
+  ```R
+  # Make bar plot
+  make_plot <- function(abund_table, metadata, custom_palette, publication_format,
+                        samples_column="Sample_ID", prefix_to_remove="barcode"){
+  
+    abund_table_wide <- abund_table %>%
+        as.data.frame() %>%
+        rownames_to_column(samples_column) %>%
+        inner_join(metadata) %>%
+        select(!!!colnames(metadata), everything()) %>%
+        mutate(!!samples_column := !!sym(samples_column) %>% str_remove(prefix_to_remove))
+        
+      
+    abund_table_long <- abund_table_wide  %>%
+        pivot_longer(-colnames(metadata), 
+                     names_to = "Species",
+                     values_to = "relative_abundance")
+      
+    p <- ggplot(abund_table_long, mapping = aes(x = !!sym(samples_column), 
+                                                y = relative_abundance, fill = Species)) +
+         geom_col() +
+         scale_fill_manual(values = custom_palette) + 
+         labs(x=NULL, y="Relative Abundance (%)") + 
+         publication_format
+
+    return(p)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `abund_table` - a relative bundance dataframe with rows summing to 100%
+  - `metadata` - a metadata dataframe with samples as row and columns describing each sample
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots
+  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting
+  - `samples_column` - a character column specifying the column in `metadata` holding sample names, default is "Sample_ID"
+  - `prefix_to_remove` - a string specifying a prefix or any character set to remove from sample names, default is "barcode"
+
+  **Returns:** a relative abundance stacked bar plot
+
+</details>
+
+##### make_barplot()
+<details>
+  <summary>Creates barplots from a feature table file</summary>
+  
+  ```R
+  make_barplot <- function(metadata_table_file, feature_table_file, 
+                           feature_column = "species", samples_column = "sample_id", group_column = "group", 
+                           output_prefix, assay_suffix = "_GLlblMetag",
+                           publication_format, custom_palette) {
+    # Prepare feature table
+    feature_table <- read_csv(feature_table_file)
+    rownames(feature_table) <- feature_table[[1]]
+    feature_table <- feature_table[, -1]
+
+    # Prepare metadata
+    metadata <- read_delim(metdata_file, delim = ",") %>% as.data.frame()
+    row.names(metadata) <- metadata[, samples_column]
+
+    # compute abundances from counts
+    abund_table <- count_to_rel_abundance(feature_table)
+    
+    # create plot
+    p <- make_plot(abund_table, metadata, custom_palette, publication_format, samples_column) +
+         facet_wrap(~Description, nrow=1, scales = "free_x")
+
+    number_of_species <- p$data$Species %>% unique() %>% length()
+    # Don't save legend if the number of species to plot is gsreater than 30
+    if(number_of_species > 30) {
+      p <- p + theme(legend.position = "none")
+    }
+
+    return(p)
+
+  }
+  ```
+  **Custom Functions Used:**
+  - [make_plot()](#make_plot)
+  - [count_to_rel_abundance()](#count_to_rel_abundance)
+
+  **Function Parameter Definitions:**
+  - `metadata_table_file` - path to a file with samples as rows and columns describing each sample
+  - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                           table with species/functions as the first column and samples as other columns.
+  - `feature_column` - a character string containing the feature column name in the feature table ['Species', 'species', 'KO_ID'], default: "species".
+  - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
+  - `group_column` - a character string specifying the column in `metadata` used to facet/group plots, default: "group"
+  - `output_prefix` - a character string specifying the unique name to add to the output file names 
+                      used to denote the data type/source, for example "unfiltered-kaiju_species"
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
+  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 8c](#8c-set-global-variables)
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 8c](#8c-set-global-variables)
+
+  **Returns:** a relative abundance stacked bar plot
+
+</details>
+
+##### make_heatmap()
+<details>
+  <summary>Creates heatmaps from a feature table file</summary>
+  
+  ```R
+  make_heatmap <- function(metadata_file, feature_table_file, 
+                           samples_column = "sample_id", group_column = "group", 
+                           output_prefix, assay_suffix = "_GLlblMetag",
+                           custom_palette) {
+    # Prepare feature table
+    # feature_table <- read_csv(feature_table_file) %>% as.data.frame()
+    # rownames(feature_table) <- feature_table[[1]]
+    # feature_table <- feature_table[, -1] %>% as.matrix()
+
+    # # Prepare metadata
+    # metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame()
+    # row.names(metadata) <- metadata[, samples_column]
+
+    # # Get common samples and re-arrange feature table and metadata
+    # common_samples <- intersect(colnames(feature_table), rownames(metadata))
+    # feature_table <- feature_table[, common_samples]
+    # metadata <- metadata[common_samples, ]
+    # metadata <- metadata %>% arrange(!!sym(group_column))
+
+    # Create column annotation
+    col_annotation <- as.data.frame(metadata)[, group_column, drop = FALSE]
+
+    # Calculate output plot width and height
+    number_of_samples <- ncol(feature_table)
+    width <- 1 * number_of_samples
+    number_of_features <- nrow(feature_table)
+    height <- 0.2 * number_of_features
+
+    # Set colors by group
+    groups <- metadata[[group_column]] %>%  unique()
+    number_of_groups <-  length(groups)
+    my_colors <- custom_palette[1:number_of_groups]
+    names(my_colors) <- groups
+    annotation_colors  <- list(my_colors)
+    names(annotation_colors) <- group_column
+
+    # create heatmap
+    png(filename = glue("{output_prefix}_heatmap{assay_suffix}.png"), width = width,
+        height = height, units = "in", res = 300)
+    pheatmap(mat = feature_table[, rownames(col_annotation)],
+             cluster_cols = FALSE,
+             cluster_rows = FALSE,
+             col = colorRampPalette(c('white','red'))(255), 
+             angle_col = 0,
+             display_numbers = TRUE,
+             fontsize = 12,
+             annotation_col = col_annotation,
+             annotation_colors = annotation_colors,
+             number_format = "%.0f")
+    dev.off()
+  }
+  ```
+  **Function Parameter Definitions:**
+  - `metadata_file` - path to a file with samples as rows and columns describing each sample
+  - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                           table with species/functions as the first column and samples as other columns.
+  - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
+  - `group_column` - a character string specifying the column in `metadata` used to facet/group plots, default: "group"
+  - `output_prefix` - a character string specifying the unique name to add to the output file names 
+                      used to denote the data type/source, for example "unfiltered-kaiju_species"
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 8c](#8c-set-global-variables)
+
+</details>
+
+##### run_decontam()
+<details>
+  <summary>Feature table decontamination with decontam</summary>
+
+  ```R
+  run_decontam <- function(feature_table, metadata, contam_threshold=0.1, 
+                           prev_col = NULL, freq_col = NULL, ntc_name = "TRUE") {
+
+    # retain metadata for only the samples present in the input feature table
+    sub_metadata <- metadata[colnames(feature_table), ]
+    # Modify NTC concentration
+    # Often times the user may set the NTC concentration to zero because they think nothing 
+    # should be in the negative control but decontam fails if the value is set to zero.
+    # To prevent decontam from failing, we replace zero with a very small concentration value
+    # 0.0000001
+    if (!is.null(freq_col)) {
+
+      sub_metadata <- sub_metadata %>%
+        mutate(!!freq_col:=map_dbl(!!sym(freq_col), .f = function(conc) {
+              if(conc == 0) return(0.0000001) else return(conc) 
+            } 
+          )
+        )
+      sub_metadata[, freq_col] <- as.numeric(sub_metadata[, freq_col])
+
+    }
+
+    # Create phyloseq object
+    ps <- phyloseq(otu_table(feature_table, taxa_are_rows = TRUE), sample_data(sub_metadata))
+
+    # In our phyloseq object, `prev_col` is the sample variable that holds the negative 
+    # control information. We'll summarize the data as a logical variable, with TRUE for control 
+    # samples, as that is the form required by isContaminant.
+    # The line below assumes that control samples will always be named "Control_Sample"
+    # in the `prev_col`.
+    sd <- as.data.frame(sample_data(ps)) # Extract sample metadata
+    sd[, "is.neg"] <- 0 # Initialize
+    sd[, "is.neg"] <- sample_data(ps)[[prev_col]] == ntc_name # Assign boolean value
+    sample_data(ps) <- sd
+
+    # Run Decontam 
+    if (!is.null(freq_col) && !is.null(prev_col)) {
+      # Run decontam in both prevalence and frequency modes
+      contamdf <- isContaminant(ps, neg="is.neg", conc=freq_col, threshold=contam_threshold) 
+    } else if(!is.null(freq_col)) {
+      # Run decontam in frequency mode
+      contamdf <- isContaminant(ps, conc=freq_col, threshold=contam_threshold) 
+    } else if(!is.null(prev_col)){
+      # Run decontam in prevalence mode
+      contamdf <- isContaminant(ps, neg="is.neg", threshold=contam_threshold)
+    } else {
+      cat("Both freq_col and prev_col cannot be set to NULL.\n")
+      cat("Please supply either one or both column names in your metadata")
+      cat("for frequency and prevalence based analysis, respectively\n")
+      stop()
+    }            
+    return(contamdf)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `metadata` - a metadata dataframe with samples as row and columns describing each sample
+  - `feature_table` -  feature [species, functions etc.] matrix to decontaminate with sample names as column and features as row
+  - `prev_col` - a character column in metadata to be used for prevalence based analysis. Controls in this column should always be names "Control_Sample"
+  - `freq_col` - a numeric column in metadata to be used for frequency based analysis
+  - `contam_threshold` -  the probability threshold below which (strictly less than) the null-hypothesis 
+                          (not a contaminant) should be rejected in favor of the alternate hypothesis (contaminant).
+
+  **Returns:** a dataframe of detailed decontam results
+</details>
+
+##### feature_decontam() 
+<details>
+  <summary>decontaminate a feature table</summary>
+  
+  ```R
+  library(tidyverse)
+  library(glue)
+
+  feature_decontam <- function(metadata_file, feature_table_file, 
+                               feature_column = "Species", samples_column = "sample_id",
+                               prevalence_column = "NTC", ntc_name = "TRUE", 
+                               frequency_column = "concentration", 
+                               threshold = 0.1, classification_method, 
+                               output_prefix, assay_suffix = "_GLlblMetag") {
+    # Prepare feature table
+    feature_table <- read_csv(feature_table_file) %>%  as.data.frame()
+    rownames(feature_table) <- feature_table[[1]]
+    feature_table <- feature_table[, -1]  %>% as.matrix()
+
+    # Prepare metadata
+    metadata <- read_csv(metadata_file) %>% as.data.frame()
+    row.names(metadata) <- metadata[, samples_column]
+
+    # Run decontam
+    contamdf <- run_decontam(feature_table, metadata, threshold, prev_col, freq_col, ntc_name) 
+
+    contamdf <- as.data.frame(contamdf) %>% rownames_to_column(feature_column)
+
+    # Write decontaminated feature table and decontam's primary results
+    outfile <- glue("{output_prefix}{classification_method}_decontam_results{assay_suffix}.csv")
+    write_csv(x = contamdf, file = outfile)
+
+    # Get the list of contaminants identified by decontam
+    contaminants <- contamdf %>%
+                    filter(contaminant == TRUE) %>%
+                    pull(!!sym(feature_column))
+
+    # Drop contaminants(s) if detected
+    if(length(contaminants) > 0){
+      
+      # Drop contaminant features identified by decontam
+      decontaminated_table <- feature_table %>%
+        as.data.frame() %>%
+        rownames_to_column(feature_column) %>%
+        filter(str_detect(!!sym(feature_column),
+                          pattern = str_c(contaminants,
+                                          collapse = "|"),
+                          negate = TRUE))
+
+      rownames(decontaminated_table) <- decontaminated_table[[feature_column]]
+      decontaminated_table <- decontaminated_table[,-1] %>% as.matrix
+
+      outfile <- glue("{output_prefix}{classification_method}_decontam_species_table{assay_suffix}.csv")
+      write_csv(x = decontaminated_table, file = outfile)
+
+      return(decontaminated_table)
+
+    } else {
+      message("No contaminants were detected by Decontam")
+      return(NULL)
+    }
+  }
+  ```
+  **Custom Functions Used:**
+  - [run_decontam()](#run_decontam)
+
+  **Function Parameter Definitions:**
+  - `metadata_file` - path to a file with samples as rows and columns describing each sample
+  - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                           table with species/functions as the first column and samples as other columns.
+  - `feature_column` - a character string containing the feature column name in the feature table ['Species', 'species', 'KO_ID'].
+  - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
+  - `frequency_column` - a character string specifying the column in `metadata` to use for frequency based analysis, default: "concentration"
+  - `prevalence_column` - a character string specifying the column in `metadata` to use for prevalence based analysis, default: "NTC"
+  - `ntc_name` - a character string specifying the value in the prevalence column for all negative template control samples, default: "TRUE"
+  - `threshold` - a number between 0 and 1 specfying the decontam threshold for both prevalence and frequency based analyses. default: 0.1
+  - `output_prefix` - a character string specifying the unique name to add to the output file names 
+                      used to denote the data type/source, for example "unfiltered-kaiju_species"
+  - `classification_method` - a character string specifying the tool used to generate the classifications ['kaiju', 'kraken2', 'metaphlan', 'contig-taxonomy', 'gene-taxonomy', 'gene-function']
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
+
+  **Output Data:**
+  - {classification_method}_decontam_species_table_GLlblMetag.csv - decontaminated feature table file
+  - {classification_method}_decontam_results_GLlblMetag.csv - Decontam results file
+
+  **Returns:** a dataframe containing the decontaminated feature table
+
+</details>
+
+##### process_taxonomy()
+<details>
+  <summary>process a taxonomy assignment table</summary>
+
+  ```R
+  process_taxonomy <- function(taxonomy, prefix='\\w__') { 
+    
+    taxonomy <- apply(X = taxonomy, MARGIN = 2, FUN = as.character) 
+
+    # replace NAs and empty cells with "Other" and delete the `prefix` from taxonomy names
+    for (rank in colnames(taxonomy)) {
+      # Delete the taxonomy prefix
+      taxonomy[,rank] <- gsub(pattern = prefix, x = taxonomy[, rank],
+                              replacement = '')
+      indices <- which(is.na(taxonomy[,rank]))
+      taxonomy[indices, rank] <- rep(x = "Other", times=length(indices)) 
+      # Replace empty cells with "Other"
+      indices <- which(taxonomy[,rank] == "")
+      taxonomy[indices,rank] <- rep(x = "Other", times=length(indices))
+    }
+    # Replace underscore with space
+    taxonomy <- apply(X = taxonomy,MARGIN = 2,
+                      FUN =  gsub,pattern = "_",replacement = " ") %>% 
+      as.data.frame(stringAsfactor=FALSE)
+    return(taxonomy)
+  }
+  ```
+  **Function Parameter Definitions:**
+
+  - `taxonomy` - is a taxonomy assignment dataframe with ranks [Phylum, Class .. Species] as columns and taxonomy assignments as rows
+  - `prefix`  - is a regular expression specifying a character sequence to remove
+                from taxon names
+
+  **Returns:** a dataframe of reformated taxonomy names
+
+</details>
+
+
+##### format_taxonomy_table()
+<details>
+  <summary>format a taxonomy assignment table by appending a suffix to a known name</summary>
+
+```R
+format_taxonomy_table <- function(taxonomy,stringToReplace="Other",
+                                  suffix=";Other") {
+  
+  for (taxa_index in seq_along(taxonomy)) {
+    
+    # Get the row indices of the current taxonomy columns 
+    # with rows matching the sting in `stringToReplace`
+    indices <- grep(x = taxonomy[,taxa_index], pattern = stringToReplace)
+    # Replace the value in that row with the value in the adjacent cell concated with `suffix` 
+    taxonomy[indices,taxa_index] <- 
+      paste0(taxonomy[indices,taxa_index-1],
+             rep(x = suffix, times=length(indices)))
+    
+  }
+  return(taxonomy)
+}
+
+```
+**Function Parameter Definitions:**
+- `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
+- `stringToReplace` - a regex string specifying what to replace
+- `suffix` - string specifying the replacement value
+
+**Returns:** a dataframe of reformated taxonomy names
+
+</details>
+
+
+##### fix_names()
+<details>
+  <summary>clean taxonomy names</summary>
+
+```R
+fix_names<- function(taxonomy,stringToReplace,suffix){
+  
+  for(index in seq_along(stringToReplace)){
+    taxonomy <- format_taxonomy_table(taxonomy = taxonomy,
+                                      stringToReplace=stringToReplace[index], 
+                                      suffix=suffix[index])
+  }
+  return(taxonomy)
+}
+
+```
+**Function Parameter Definitions:**
+- `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
+- `stringToReplace` - a regex string specifying what to replace
+- `suffix` - string specifying the replacement value
+
+**Returns:** a dataframe of reformated/cleaned taxonomy names
+
+</details>
+
+
+##### read_input_table()
+<details>
+  <summary>read an input table into a dataframe</summary>
+
+  ```R
+  read_input_table <- function(file_name){
+    
+    df <- read_delim(file = file_name, delim = "\t", comment = "#")
+    return(df)
+    
+  }
+  ```
+  **Function Parameter Definitions:**
+  - `file_name` - path to file to be read
+
+  **Returns:** a tibble generated from the input file
+
+</details>
+
+
+
+##### read_assembly_coverage_table()
+<details>
+  <summary>Read Assembly-based coverage annotation table</summary>
+
+  ```R
+  read_assembly_coverage_table <- function(file_name, sample_names){
+  
+    df <- read_input_table(file_name)
+
+    # Subset taxoxnomy portion (domain:species) of input table
+    # and replace empty/Na domain assignments with "Unclassified"
+    taxonomy_table <- df %>%
+      select(domain:species) %>%
+      mutate(domain=replace_na(domain, "Unclassified"))
+    
+    # Subset count table
+    counts_table <- df %>% select(!!any_of(sample_names))
+
+    # Mutate taxonomy mames
+    taxonomy_table  <- process_taxonomy(taxonomy_table)
+    taxonomy_table <- fix_names(taxonomy_table, "Other", ";_")
+
+    # Column bind taxonomy dataframe with species count dataframe
+    df <- bind_cols(taxonomy_table, counts_table)
+    
+    return(df)
+  }
+
+  ```
+
+  **Function Parameter Definitions:**
+
+  - `file_name` - path to contig taxonomy assignment file to be read
+  - `sample_names` - string of samples names to keep in the final dataframe
+
+  **Returns:** a dataframe with cleaned taxonomy names and sample species count
+
+</details>
+
+
+
+##### get_sample_names()
+<details>
+  <summary>retrieve sample names for which assemblies were generated</summary>
+
+  ```R
+  get_sample_names <- function (assembly_summary) {
+    overview_table <-  read_input_table(assembly_summary) %>%
+                        select(
+                          where( ~all(!is.na(.)) )
+                          ) # Drop columns were all its rows are NAs
+
+    col_names <- names(overview_table) %>% str_remove_all("-assembly")
+    sample_order <- col_names[-1] %>% sort()
+
+    return(sample_order)
+  }
+  ```
+  **Function Parameter Definitions:**
+
+  - `assembly_summary` - path to assembly summary file
+
+  **Returns:** a character vector of sorted sample names
+
+</details>
+
+
+#### 8c. Set global variables
+
+```R
+# Define custom theme for plotting
+publication_format <- theme_bw() +
+  theme(panel.grid = element_blank()) +
+  theme(axis.ticks.length=unit(-0.15, "cm"),
+        axis.text.x=element_text(margin=ggplot2::margin(t=0.5,r=0,b=0,l=0,unit ="cm")),
+        axis.text.y=element_text(margin=ggplot2::margin(t=0,r=0.5,b=0,l=0,unit ="cm")), 
+        axis.title = element_text(size = 18,face ='bold.italic', color = 'black'), 
+        axis.text = element_text(size = 16,face ='bold', color = 'black'),
+        legend.position = 'right', legend.title = element_text(size = 15,face ='bold', color = 'black'),
+        legend.text = element_text(size = 14,face ='bold', color = 'black'),
+        strip.text =  element_text(size = 14,face ='bold', color = 'black'))
+
+# Define custom palette for plotting
+custom_palette <- c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F", "#FF7F00",
+                    "#CAB2D6","#6A3D9A","#FF00FFFF","#B15928","#000000","#FFC0CBFF","#8B864EFF","#F0027F",
+                    "#666666","#1B9E77", "#E6AB02","#A6761D","#FFFF00FF","#FFFF99","#00FFFFFF",
+                    "#B2182B","#FDDBC7","#D1E5F0","#CC0033","#FF00CC","#330033",
+                    "#999933","#FF9933","#FFFAFAFF",colors()) 
+# Drop white colors
+custom_palette <- custom_palette[-c(21:23,
+                                    grep(pattern = "white|snow|azure|gray|#FFFAFAFF|aliceblue",
+                                         x = custom_palette, 
+                                         ignore.case = TRUE)
+                                   )
+                                ]                      
+```
+
+**Input Data:** 
+
+*No input data required*
+
+**Output Data:**
+
+- `publication_format` (a ggplot::theme object specifying a custom theme for plotting)
+- `custom_palette` (a vector of strings specifying a custom color palette for coloring plots)
+
+<br>
+
+---
+
+## Read-based Processing
+
+### 9. Taxonomic Profiling Using Kaiju
+
+#### 9a. Build Kaiju Database
+
+```bash
+# Make a directory that will hold the downloaded kaiju database
+mkdir kaiju-db/
+
+# Download kaiju's reference database
+kaiju-makedb -s kaiju_db/nr_euk -t NumberOfThreads
+
+# Clean up
+rm nr_euk/kaiju_db_nr_euk.bwt nr_euk/kaiju_db_nr_euk.sa
+```
+
+**Parameter Definitions:**
+
+- `-s nr_euk` - Specifies to download the subset of the NCBI BLAST nr (non-redundant) database containing all proteins belonging to Archaea, bacteria, and viruses, and additionally include proteins from fungi and microbial eukaryotes.
+- `-t` - Number of parallel processing threads to use.
+
+**Input Data:**
+
+*No input data required*
+
+**Output Data:**
+
+- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index)
+- kaiju-db/nr_euk/kaiju_db_nr_euk.faa (FASTA amino acid file containing the protein sequences used to build the .fmi index file)
+- kaiju-db/nodes.dmp (taxonomy hierarchy file from the NCBI Taxonomy database defining the parent-child relationships in the taxonomic tree)
+- kaiju-db/names.dmp (taxonomy names file from the NCBI Taxonomy database that maps taxonomic IDs to their scientific names)
+- kaiju-db/merged.dmp (merged taxonomy IDs file from the NCBI Taxonomy database that maps deprecated taxonomic IDs to current ones)
+
+
+#### 9b. Kaiju Taxonomic Classification
+
+```bash
+kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
+      -t kaiju-db/nodes.dmp \
+      -z NumberOfThreads \
+      -E 1e-05 \
+      -i /path/to/sample_GLlblMetag_decontam.fastq.gz \
+      -o sample_kaiju.out
+```
+
+**Parameter Definitions:**
+
+- `-f` - Specifies the path to the kaiju database index file (.fmi).
+- `-t` - Specifies the path to the kaiju taxonomy hierarchy file (nodes.dmp).
+- `-z` - Number of parallel processing threads to use.
+- `-E` - Specifies the minimum E-value to use for filter matches (an E-value of 1e-05 means that there's a 0.001% chance that the matches identified occurred randomly).
+- `-i` - Specifies path to the input file.
+- `-o` - Specifies the name of the output file.
+
+**Input Data:**
+
+- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 9a](#9a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
+- sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+    contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
+    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8](#8b-remove-host-reads))
+
+**Output Data:**
+
+- sample_kaiju.out (kaiju output file)
+
+#### 9c. Compile Kaiju Taxonomy Results
+
+```bash
+# Merge kaiju reports to one table at the species level 
+kaiju2table -t nodes.dmp \
+            -n names.dmp \
+            -p \
+            -r "species" \
+            -o merged_kaiju_summary_${TAXON_LEVEL}.tsv \
+            *_kaiju.out
+
+# Convert file names to sample names
+sed -i -E 's/.+\/(.+)_kaiju\.out/\1/g' merged_kaiju_table.tsv && \
+sed -i -E 's/file/sample/' merged_kaiju_table.tsv
+```
+
+**Parameter Definitions:**
+
+- `-t` - Specifies the path to the kaiju taxonomy hierarchy file (nodes.dmp).
+- `-n` - Specifies the path to the kaiju taxonomy names file (names.dmp).
+- `-p` - Print the full taxon path instead of only the taxon name.
+- `-r` - Specifies taxonomic rank to print the taxon path to, must be one of: phylum, class, order, family, genus, species. (Default: species).
+- `-o` - Specifies the name of the kaiju taxon summary output file.
+- `*_kaiju.out` - Positional argument specifying the path to the kaiju output files for each sample. 
+
+**Input Data:**
+
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
+- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 9a](#9a-build-kaiju-database))
+- *kaiju.out (kaiju output files, output from [Step 9b](#9b-kaiju-taxonomic-classification))
+
+**Output Data:**
+
+- merged_kaiju_table.tsv (compiled kaiju summary table at the species level)
+
+#### 9d. Convert Kaiju Output To Krona Format
+
+```bash
+kaiju2krona -u \
+            -n kaiju-db/names.dmp \
+            -t kaiju-db/nodes.dmp \
+            -i sample_kaiju.out \
+            -o sample.krona
+```
+
+**Parameter Definitions:**
+
+- `-u` - Include count for unclassified reads in output.
+- `-n` - Specifies the path to the kaiju taxonomy names file (names.dmp).
+- `-t` - Specifies the path to the kaiju taxonomy hierarchy file (nodes.dmp).
+- `-i` - Specifies the path to the kaiju output file.
+- `-o` - Specifies the name of krona formatted kaiju output file.
+
+**Input Data:**
+- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 9a](#9a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
+- sample_kaiju.out (kaiju output file, output from [Step 9b](#9b-kaiju-taxonomic-classification))
+
+**Output Data:**
+
+- sample.krona (krona formatted kaiju output)
+
+#### 9e. Compile Kaiju Krona Reports
+
+```bash
+# Create a file containing a sorted list of all .krona files 
+find . -type f -name "*.krona" | sort -uV > krona_files.txt
+
+# Create a file containing a sorted list of all sample names
+FILES=($(find . -type f -name "*.krona"))
+basename -a -s '.krona' ${FILES[*]} | sort -uV  > sample_names.txt
+
+# Create ktImportText input format files
+KTEXT_FILES=($(paste -d',' "krona_files.txt" "sample_names.txt"))
+
+# Create html containing krona plot  
+ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
+```
+
+**Parameter Definitions:**
+
+**find**
+
+- `-type f` -  Specifies that the type of file to find is a regular file.
+- `-name "*.krona"` - Specifies to find files ending with the .krona suffix.  
+
+**sort**
+
+- `-u` - Specifies to perform a unique sort.
+- `-V` - Specifies to perform a mixed type of sorting with names containing numbers within text.
+- `> krona_files.txt` - Redirects the sorted list to a separate text file.
+
+**basename**
+
+- `-a` - Support multiple arguments and treat each as a file name.
+- `-s '.krona'` - Remove trailing '.krona' suffix.
+
+**paste**
+
+- `-d','` - Paste both krona and sample files together line by line delimited by comma ','.
+
+**ktImportText**
+
+- `-o` - Specifies the compiled output html file name.
+- `${KTEXT_FILES[*]}` - An array positional argument with the following content: 
+                        sample_1.krona,sample_1 sample_2.krona,sample_2 ... sample_n.krona,sample_n.
+
+**Input Data:**
+- *.krona (all sample .krona formatted files, output from [Step 9d](#9d-convert-kaiju-output-to-krona-format)) 
+
+                      
+**Output Data:**
+
+- krona_files.txt (sorted list of all *.krona files)
+- sample_names.txt (sorted list of all sample names)
+- **kaiju-report_GLlblMetag.html** (compiled krona html report containing all samples)
+
+
+#### 9f. Create Kaiju Species Count Table
+
+```R
+library(tidyverse)
+feature_table <- process_kaiju_table(file_path="merged_kaiju_table_GLlblMetag.tsv")
+table2write <- feature_table  %>%
+                as.data.frame() %>%
+                rownames_to_column("Species")
+write_csv(x = table2write, file = "kaiju_species_table_GLlblMetag.csv")
+```
+
+**Custom Functions Used:**
+
+- [process_kaiju_table()](#process_kaiju_table)
+
+**Parameter Definitions:**
+
+- `file_path` - path to compiled kaiju table at the species taxon level
+- `x`  - feature table dataframe to write to file
+- `file` - path to where to write kaiju count table per sample
+
+**Input Data:**
+
+- merged_kaiju_table_GLlblMetag.tsv (compiled kaiju table at the species taxon level, from [Step 9c](#9c-compile-kaiju-taxonomy-results))
+
+**Output Data:**
+
+- **kaiju_species_table_GLlblMetag.csv** (kaiju species count table in csv format)
+
+
+#### 9g. Filter Kaiju Species Count Table
+
+```R
+library(tidyverse)
+
+input_file <- "kaiju_species_table_GLlblMetag.csv"
+output_file <- "kaiju_filtered_species_table_GLlblMetag.csv"
+threshold <- 0.5
+
+# string used to define non-microbial taxa
+non_microbial <- "UNCLASSIFIED|Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
+
+# read in feature table
+feature_table <- read_csv(input_file) %>% as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+# convert count table to a relative abundance matrix
+abund_table <- feature_table %>% rownames_to_column(feature_name) %>%
+  mutate(across(where(is.numeric), function(x) (x / sum(x, na.rm = TRUE)) * 100)) %>%
+  as.data.frame()
+
+rownames(abund_table) <- abund_table[,1]
+abund_table <- abund_table[,-1] %>% t 
+
+table2write <- group_low_abund_taxa(abund_table, threshold = threshold)  %>%
+  t %>% as.data.frame() %>%
+  rownames_to_column(feature_name)
+
+write_csv(x = table2write, file = output_file)
+```
+
+**Custom Functions Used:**
+
+- [group_low_abund_taxa()](#group_low_abund_taxa)
+
+**Parameter Definitions:**
+
+- `threshold` - threshold for filtering out rare taxa, a percentage between 0 and 100.
+- `output_file` - output filename
+- `input_file` - input filename
+
+**Input Data:**
+
+- kaiju_species_table_GLlblMetag.csv (path to kaiju species table from [Step 9f](#9f-create-kaiju-species-count-table))
+
+**Output Data:**
+
+- **kaiju_filtered_species_table_GLlblMetag.csv** - a file containing the filtered species table
+
+---
+
+#### 9h. Taxonomy barplots
+
+```R
+library(tidyverse)
+
+species_table_file <- "kaiju_species_table_GLlblMetag.csv"
+filtered_species_table_file <- "kaiju_filtered_species_table_GLlblMetag.csv"
+metadata_file <- "/path/to/sample/metadata"
+number_samples <- 10 
+
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
+
+p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
+                  publication_format = publication_format, custom_palette = custom_palette)
+
+ggsave(filename = "unfiltered-kaiju_species_barplot_GLlblMetag.png", plot = p,
+       device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
+
+# Save static unfiltered plot
+p <- make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
+                  publication_format = publication_format, custom_palette = custom_palette)
+
+# Save interactive unfilterted plot
+htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_unfiltered_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+
+# Save static filtered plot
+ggsave(filename = glue("kaiju_filtered_species_barplot_GLlblMetag.png"), plot = p,
+      device = 'png', width = plot_width, height = 10, units = 'in', dpi = 300, limitsize = FALSE)
+
+# Save interactive filtered plot
+htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_filtered_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+```
+
+**Custom Functions Used:**
+- [make_barplot](#make_barplot)
+
+**Parameter Definitions:**
+
+- `species_table_file` - a file containing the species count table
+- `filtered_species_table_file` - a file containing the filtered species count table
+- `metadata_file` - a file containing group information for each sample in the species count files
+- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples.
+
+**Input Data:**
+
+- `kaiju_species_table_GLlblMetag.csv` (a file containing the species count table, output from [Step 9f](#9f-create-kaiju-species-count-table))
+- `kaiju_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 9g](#9g-filter-kaiju-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+
+
+**Output Data:**
+
+- kaiju_unfiltered_species_barplot_GLlblMetag.png (taxonomy barplot without filtering)
+- **kaiju_unfiltered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot without filtering)
+- kaiju_filtered_species_barplot_GLlblMetag.png (taxonomy barplot after filtering rare and non-microbial taxa)
+- **kaiju_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
+
+
+#### 9i. Feature decontamination
+
+> Feature (species) decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
+
+```R
+library(tidyverse)
+library(decontam)
+library(phyloseq)
+
+feature_table_file <- "kaiju_filtered_species_table_GLlblMetag.csv"
+metadata_table <- "/path/to/sample/metadata"
+number_of_samples <- NumberOfSamples
+
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
+
+decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
+                                         feature_table_file = feature_table_file, 
+                                         feature_column = "species", 
+                                         samples_column = "sample_id",
+                                         prevalence_column = "NTC", 
+                                         ntc_name = "TRUE", 
+                                         frequency_column = "concentration", 
+                                         threshold = 0.1, 
+                                         classification_method = "kaiju", 
+                                         output_prefix = "", 
+                                         assay_suffix = "_GLlblMetag")
+
+# Convert count matrix to relative abundance matrix
+decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
+
+# Make plot after filtering out contaminants
+p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
+
+ggsave(filename = "kaiju_decontam_species_barplot_GLlblMetag.png", plot = p,
+         device = "png", width = plot_width, height = 10, units = "in", dpi = 300)
+
+# Save interactive filtered plot
+htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+```
+
+**Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
+- [make_plot()](#make_plot)
+- [count_to_rel_abundance()](#count_to_rel_abundance)
+
+**Parameter Definitions:**
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                         table with species/functions as the first column and samples as other columns.
+- `ntc_name` - a character string specifying the name of the NTC in the prevalence column
+- `number_of_samples` - the total number of samples in the species count files, adjust based on number of input samples in the feature_table_file
+
+**Input Data:**
+
+- `kaiju_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 9](#9g-filter-kaiju-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+
+**Output Data:**
+
+- **kaiju_decontam_results_GLlblMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
+- **kaiju_decontam_species_table_GLlblMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
+- **kaiju_decontam_species_barplot_GLlblMetag.png** (barplot after filtering out contaminants)
+- **kaiju_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants)
+
+<br>
+
+---
+
+### 10. Taxonomic Profiling Using Kraken2
+
+#### 10a. Download Kraken2 Database
+
+```bash 
+## Download all microbial (including eukaryotes) - https://benlangmead.github.io/aws-indexes/k2
+
+# Downloading and building kraken2's pluspfp database which contains the standard database (Refseq archaea, bacteria, viral, plasmid, human1, UniVec_Core) + plants + protists + fungi
+
+mkdir kraken2-db/ && cd kraken2-db/
+
+# Inspect file
+INSPECT_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/inspect.txt
+wget ${INSPECT_URL}
+
+# Library report
+LIRARY_REPORT_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/library_report.tsv
+wget ${LIRARY_REPORT_URL}
+
+# Md5sums
+MD5_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/pluspfp.md5 
+wget ${MD5_URL}
+
+# Download and unzip the main database files
+DB_URL=https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_20250714.tar.gz 
+wget -O k2_pluspfp.tar.gz --timeout=3600 --tries=0 --continue ${DB_URL} && \
+tar -xvzf k2_pluspfp.tar.gz
+```
+
+**Parameter Definitions:**
+
+**wget**
+
+- `O` - Name of file to download the url content to.
+- `--timeout=3600` - Specifies the network timeout in seconds.
+- `--tries=0` - Retry download infinitely.
+- `--continue` -  Continue getting a partially-downloaded file.
+- `*_URL` - Position arguement specifying the url to download a particular resource from.
+
+
+**Input Data:**
+
+- `INSPECT_URL=` - url specifying the location of kraken2 inspect file
+- `LIRARY_REPORT_URL=` - url specifying the location of kraken2 library report file
+- `MD5_URL=` - url specifying the location of the md5 file of the kraken database
+- `DB_URL=` - url specifying the location of the main kraken database archive in .tar.gz format
+
+**Output Data:**
+
+- kraken2-db/  (a directory containing kraken2 database files)
+
+#### 10b. Kraken2 Taxonomic Classification
+
+```bash
+kraken2 --db kraken2-db/ \
+        --gzip-compressed \
+        --threads NumberOfThreads \
+        --use-names \
+        --output sample-kraken2-output.txt \
+        --report sample-kraken2-report.tsv \
+        /path/to/sample_GLlblMetag_decontam.fastq.gz
+```
+
+**Parameter Definitions:**
+
+- `--db` - Specifies the directory holding the kraken2 database files. 
+- `--gzip-compressed` - Specifies the input files are gzip-compressed.
+- `--threads` - Number of parallel processing threads to use.
+- `--use-names` - Specifies to add taxa names in addition to taxids.
+- `--output` - Specifies the name of the kraken2 read-based output file.
+- `--report` - Specifies the name of the kraken2 report output file.
+- `sample_GLlblMetag_decontam.fastq.gz` - Positional argument specifying the input file.
+
+**Input Data:**
+
+- kraken2-db/ (a directory containing kraken2 database files, output from [Step 10a](#10a-download-kraken2-database))
+- sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+    contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
+    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8](#8b-remove-host-reads))
+
+**Output Data:**
+
+- sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
+- sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
+
+
+#### 10c. Compile Kraken2 Taxonomy Results
+
+##### 10ci. Create Merged Kraken2 Taxonomy Table
+
+```R
+species_table <- merge_kraken_reports(reports-dir='/path/to/kraken2/reports')
+write_csv(x = species_table, file = "merged-kraken2-table.csv")
+```
+
+**Custom Functions Used:**
+
+- [merge_kraken_reports()](#merge_kraken_reports)
+
+**Parameter Definitions:**
+
+- `file_path` - path to compiled kaiju table at the species taxon level
+- `x`  - feature table dataframe to write to file
+- `file` - path to where to write kaiju count table per sample
+
+**Parameter Definitions:**
+
+- `--output` - Specifies the name of the kraken2 compiled results output file.
+- `--report-files` - Specifies the name of each input kraken2 report file to compile.
+- `--sample-names` - Specifies the name of each sample. 
+
+**Input Data:**
+
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 10b](#10b-taxonomic-classification))
+
+**Output Data:**
+
+- **kraken2_species_table_GLlblMetag.csv** (kraken species count table in csv format)
+
+
+##### 10cii. Compile Kraken2 Taxonomy Reports
+
+```bash
+multiqc --zip-data-dir \ 
+        --outdir kraken2_multiqc_report \
+        --filename kraken2_multiqc_GLlblMetag \
+        --interactive \
+        /path/to/*kraken2-report.tsv
+```
+
+**Parameter Definitions:**
+
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` - Specifies the output directory to store results.
+- `--filename` - Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/*kraken2-report.tsv` - The kraken2 output report files, provided as a positional argument.
+
+**Input Data:**
+
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 10b](#10b-taxonomic-classification))
+
+**Output Data:**
+
+- **kraken2_multiqc_GLlblMetag.html** (multiqc output html summary)
+- **kraken2_multiqc_GLlblMetag_data.zip** (zip archive containing multiqc output data)
+
+
+#### 10d. Convert Kraken2 Output to Krona Format
+
+```bash
+kreport2krona.py --report-file sample-kraken2-report.tsv  \
+                 --output sample.krona
+```
+
+**Parameter Definitions:**
+
+- `--report-file` - Specifies the name of the input kraken2 report file.
+- `--output` - Specifies the name of the krona output file.
+
+**Input Data:**
+
+- sample-kraken2-report.tsv (kraken report, output from [Step 10b](#10b-taxonomic-classification))
+
+**Output Data:**
+
+- sample.krona (krona formatted kraken2 output)
+
+
+#### 10e. Compile Kraken2 Krona Reports
+
+```bash
+# Find, list and write all .krona files to file 
+find . -type f -name "*.krona" | sort -uV > krona_files.txt
+
+FILES=($(find . -type f -name "*.krona"))
+basename --multiple --suffix='.krona' ${FILES[*]} | sort -uV  > sample_names.txt
+
+# Create ktImportText input format files
+KTEXT_FILES=($(paste -d',' "krona_files.txt" "sample_names.txt"))
+
+# Create html   
+ktImportText -o kraken2-report_GLlblMetag.html ${KTEXT_FILES[*]}
+```
+
+**Parameter Definitions:**
+
+**find**
+
+- `-type f` -  Specifies that the type of file to find is a regular file.
+- `-name "*.krona"` - Specifies to find files ending with the .krona suffix.  
+
+**sort**
+
+- `-u` - Specifies to perform a unique sort.
+- `-V` - Specifies to perform a mixed type of sorting with names containing numbers within text.
+- `> {}.txt` - Redirects the sorted list to a separate text file.
+
+**basename**
+
+- `--multiple` - Support multiple arguments and treat each as a file name.
+- `--suffix='.krona'` - Remove a trailing '.krona' suffix.
+
+**paste**
+
+- `-d','` - Paste both krona and sample files together line by line delimited by comma ','.
+
+**ktImportText**
+
+- `-o` - Specifies the compiled output html file name.
+- `${KTEXT_FILES[*]}` - An array positional arguement with the following content: sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
+
+**Input Data:**
+
+- *.krona (all sample .krona formatted files, output from [Step 10d](#10d-convert-kraken2-output-to-krona-format)) 
+
+                      
+**Output Data:**
+
+- krona_files.txt (sorted list of all *.krona files)
+- sample_names.txt (sorted list of all sample names)
+- **kraken2-report_GLlblMetag.html** (compiled krona html report containing all samples)
+
+---
+
+#### 10f. Filter Kraken2 Species Count Table
+
+```R
+library(tidyverse)
+
+input_file <- "kraken2_species_table_GLlblMetag.csv"
+output_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
+threshold <- 0.5
+
+# string used to define non-microbial taxa
+non_microbial <- "UNCLASSIFIED|Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
+
+# read in feature table
+feature_table <- read_csv(input_file) %>% as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+# read-based count table
+table2write <- filter_rare(feature_table, non_microbial, threshold = threshold) %>%
+  as.data.frame() %>%
+  rownames_to_column(feature_name)
+
+write_csv(x = table2write, file = output_file)
+```
+
+**Custom Functions Used:**
+
+- [group_low_abund_taxa()](#group_low_abund_taxa)
+
+**Parameter Definitions:**
+
+- `threshold` - threshold for filtering out rare taxa, a percentage between 0 and 100.
+- `output_file` - output filename
+- `input_file` - input filename
+
+**Input Data:**
+
+- kaiju_species_table_GLlblMetag.csv (path to kaiju species table from [Step 10ci.](#10ci-create-merged-kraken2-taxonomy-table))
+
+**Output Data:**
+
+- **kraken2_filtered_species_table_GLlblMetag.csv** - a file containing the filtered species table
+
+---
+
+
+#### 10g. Taxonomy barplots
+
+```R
+library(tidyverse)
+
+species_table_file <- "kraken2_species_table_GLlblMetag.csv"
+filtered_species_table_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
+metadata_file <- "/path/to/sample/metadata"
+number_samples <- 10 
+
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
+
+p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+                  feature_column = "species", samples_column = "sample_id", group_column = "group",
+                  publication_format = publication_format, custom_palette = custom_palette)
+
+ggsave(filename = "kraken2_unfiltered_species_barplot_GLlblMetag.png", plot = p,
+       device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
+
+# Save static unfiltered plot
+p <- make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
+                  publication_format = publication_format, custom_palette = custom_palette)
+
+# Save interactive unfilterted plot
+htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_unfiltered_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+
+# Save static filtered plot
+ggsave(filename = glue("kraken2_filtered_species_barplot_GLlblMetag.png"), plot = p,
+      device = 'png', width = plot_width, height = 10, units = 'in', dpi = 300, limitsize = FALSE)
+
+# Save interactive filtered plot
+htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+```
+**Custom Functions Used:**
+- [make_barplot()](#make_plot)
+
+**Parameter Definitions:**
+
+- `species_table_file` - a file containing the species count table
+- `filtered_species_table_file` - a file containing the filtered species count table
+- `metadata_file` - a file containing group information for each sample in the species count files
+- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples.
+
+**Input Data:**
+
+- `kraken2_species_table_GLlblMetag.csv` (a file containing the species count table, output from [Step 9f](#9f-create-kaiju-species-count-table))
+- `kraken2_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 9g](#9g-filter-kaiju-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+
+**Output Data:**
+
+- kraken2_unfiltered_species_barplot_GLlblMetag.png (taxonomy barplot without filtering)
+- **kraken2_unfiltered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot without filtering)
+- kraken2_filtered_species_barplot_GLlblMetag.png (taxonomy barplot after filtering rare and non-microbial taxa)
+- **kraken2_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
+
+
+#### 10h. Feature decontamination
+
+> Feature (species) decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
+
+```R
+library(tidyverse)
+library(decontam)
+library(phyloseq)
+
+feature_table_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
+metadata_table <- "/path/to/sample/metadata"
+number_of_samples <- NumberOfSamples
+
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
+
+decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
+                                         feature_table_file = feature_table_file, 
+                                         feature_column = "species", 
+                                         samples_column = "sample_id",
+                                         prevalence_column = "NTC", 
+                                         ntc_name = "TRUE", 
+                                         frequency_column = "concentration", 
+                                         threshold = 0.1, 
+                                         classification_method = "kraken2", 
+                                         output_prefix = "", 
+                                         assay_suffix = "_GLlblMetag")
+
+# Convert count matrix to relative abundance matrix
+decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
+
+# Make plot after filtering out contaminants
+p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
+
+ggsave(filename = "kraken2_decontam_species_barplot_GLlblMetag.png", plot = p,
+         device = "png", width = plot_width, height = 10, units = "in", dpi = 300)
+
+# Save interactive filtered plot
+htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+```
+
+**Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
+- [make_plot()](#make_plot)
+- [count_to_rel_abundance()](#count_to_rel_abundance)
+
+**Parameter Definitions:**
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                          table with species/functions as the first column and samples as other columns.
+- `number_of_samples` - the total number of samples in the species count files, adjust based on number of input samples.
+
+**Input Data:**
+
+- `kraken2_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 10f](#10f-filter-kraken2-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+
+**Output Data:**
+
+- **kraken2_decontam_results_GLlblMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
+- **kraken2_decontam_species_table_GLlblMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
+- **kraken2_decontam_species_barplot_GLlblMetag.png** (barplot after filtering out contaminants)
+- **kraken2_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants)
+
+<br>
+
+---
+
+## Assembly-based Processing
+
+### 11. Sample Assembly
+
+```bash
+flye --meta \
+     --threads NumberOfThreads \
+     --out-dir sample/ \
+     --nano-hq \
+     /path/to/sample_GLlblMetag_decontam.fastq.gz
+
+# rename output files            
+mv sample/assembly.fasta sample_assembly.fasta
+mv sample/flye.log sample_assembly.log
+```
+
+**Parameter Definitions:**
+
+- `--meta` – Use metagenome/uneven coverage mode.
+- `--threads` - Number of parallel processing threads to use.
+- `--out-dir` - Specifies the name of the output directory.
+- `--nano-hq` - Specifies that input is from Oxford Nanopore high-quality reads (Guppy5+ SUP or Q20, <5% error). This skips a genome polishing step since the assembly will be polished with medaka in the next step.
+- `/path/to/sample_GLlblMetag_decontam.fastq.gz` - Path to the input file, specified as a positional argument.
+
+**Input Data**
+
+- sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+    contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
+    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8](#8b-remove-host-reads))
+
+**Output Data**
+
+- sample_assembly.fasta (sample assembly fasta)
+- sample_assembly.log (flye log file)
+
+<br>
+
+---
+
+### 12. Polish Assembly
+
+```bash
+medaka_consensus -t NumberOfThreads \
+                 -i /path/to/sample_GLlblMetag_decontam.fastq.gz \
+                 -d /path/to/assemblies/sample_assembly.fasta \
+                 -o sample/
+  
+mv sample/consensus.fasta sample_polished.fasta
+```
+
+**Parameter Definitions:**
+
+- `-t` - Number of parallel processing threads to use.
+- `-i` - Specifies path to input read files used in creating the assembly.
+- `-d` - Specifies path to the assembly fasta file.
+- `-o` - Specifies the output directory.
+
+**Input Data:**
+
+- sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+    contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
+    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8](#8b-remove-host-reads))
+- /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 11](#11-sample-assembly))
+
+**Output Data:**
+
+- sample_polished.fasta (polished sample assembly)
+
+---
+
+### 13. Rename Contigs and Summarize Assemblies
+
+#### 13a. Rename Contig Headers
+
+```bash
+bit-rename-fasta-headers -i sample_polished.fasta \
+                         -w c_sample \
+                         -o sample-assembly_GLlblMetag.fasta
+```
+
+**Parameter Definitions:**  
+
+- `-i` – Specifies the input fasta file.
+- `-w` – Specifies the wanted header prefix (a number will be appended for each contig), starts with a "c" to ensure they won't start with a number which can be problematic.
+- `-o` – Specifies the output fasta file.
+
+
+**Input Data:**
+
+- sample_polished.fasta (polished assembly file from [Step 12](#12-polish-assembly))
+
+**Output files:**
+
+- **sample-assembly_GLlblMetag.fasta** (contig-renamed assembly file)
+
+
+#### 13b. Summarize Assemblies
+
+```bash
+bit-summarize-assembly -o assembly-summaries_GLlblMetag.tsv \
+                       *-assembly_GLlblMetag.fasta
+```
+
+**Parameter Definitions:**  
+
+- `-o` – Specifies the output summary table.
+- `*-assembly.fasta` - Specifies the input assemblies to summarize, provided as positional arguments.
+
+**Input Data:**
+
+- *-assembly.fasta (contig-renamed assembly files from [Step 13a](#13a-renaming-contig-headers))
+
+**Output files:**
+
+- **assembly-summaries_GLlblMetag.tsv** (table of assembly summary statistics)
+
+<br>
+
+---
+
+### 14. Gene Prediction
+
+#### 14a. Generate Gene Predictions
+
+```bash
+prodigal -a sample-genes.faa \
+         -d sample-genes.fasta \
+         -f gff \
+         -p meta \
+         -c \
+         -q \
+         -o sample-genes.gff \
+         -i sample-assembly_GLlblMetag.fasta
+```
+
+**Parameter Definitions:**
+
+- `-a` – Specifies the output amino acid sequences file.
+- `-d` – Specifies the output nucleotide sequences file.
+- `-f` – Specifies the gene-calls output format, gff = GFF format.
+- `-p` – Specifies which mode to run the gene-caller in. 
+- `-c` – No incomplete genes reported. 
+- `-q` – Run in quiet mode (don’t output process on each contig). 
+- `-o` – Specifies the name of the output gene-calls file. 
+- `-i` – Specifies the input assembly file.
+
+**Input Data:**
+
+- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 13a](#13a-renaming-contig-headers))
+
+**Output Data:**
+
+- sample-genes.faa (gene-calls amino-acid fasta file)
+- sample-genes.fasta (gene-calls nucleotide fasta file)
+- **sample-genes_GLlblMetag.gff** (gene-calls in general feature format)
+
+<br>
+
+#### 14b. Remove Line Wraps In Gene Prediction Output
+
+```bash
+bit-remove-wraps sample-genes.faa > sample-genes.faa.tmp 2> /dev/null
+mv sample-genes.faa.tmp sample-genes_GLlblMetag.faa
+
+bit-remove-wraps sample-genes.fasta > sample-genes.fasta.tmp 2> /dev/null
+mv sample-genes.fasta.tmp sample-genes_GLlblMetag.fasta
+```
+
+**Input Data:**
+
+- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 14a](#14a-gene-prediction))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14a](#14a-gene-prediction))
+
+**Output Data:**
+
+- **sample-genes_GLlblMetag.faa** (gene-calls amino-acid fasta file with line wraps removed)
+- **sample-genes_GLlblMetag.fasta** (gene-calls nucleotide fasta file with line wraps removed)
+
+<br>
+
+---
+
+### 15. Functional Annotation
+
+> **Note:**  
+> The annotation process overwrites the same temporary directory by default. When running multiple 
+processses at a time, it is necessary to specify a specific temporary directory with the 
+`--tmp-dir` argument as shown below.
+
+
+#### 15a. Download Reference Database of HMM Models
+
+> **Note:** This step only needs to be done once.
+
+```bash
+curl -LO ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
+curl -LO ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
+tar -xzvf profiles.tar.gz
+gunzip ko_list.gz 
+```
+
+#### 15b. Run KEGG Annotation
+
+```bash
+exec_annotation -p profiles/ \
+                -k ko_list \
+                --cpu NumberOfThreads \
+                -f detail-tsv \
+                -o sample-KO-tab.tmp \
+                --tmp-dir sample-tmp-KO \
+                --report-unannotated \
+                sample-genes_GLlblMetag.faa 
+```
+
+**Parameter Definitions:**
+
+- `-p` – Specifies the directory holding the downloaded reference HMMs.
+- `-k` – Specifies the downloaded reference KO  (Kegg Orthology) terms. 
+- `--cpu` – Specifies the number of searches to run in parallel.
+- `-f` – Specifies the output format.
+- `-o` – Specifies the output file name.
+- `--tmp-dir` – Specifies the temporary directory to write to (needed if running more than one process concurrently, see Note above).
+- `--report-unannotated` – Specifies to generate an output for each entry, event when no KO is assigned.
+- `sample-genes.faa` – Specifies the input file, provided as a positional argument. 
+
+
+**Input Data:**
+
+- sample-genes_GLlblMetag.faa (amino-acid fasta file, output from [Step 14b](#14b-remove-line-wraps-in-gene-prediction-output))
+- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 15a](15a-download-reference-database-of-hmm-models))
+- ko_list (reference list of KOs to scan for, downloaded in [Step 15a](15a-download-reference-database-of-hmm-models))
+
+**Output Data:**
+
+- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs)
+
+
+#### 15c. Filter KO Outputs
+*Filter KO outputs to retain only those passing the KO-specific score and top hits.*
+
+```bash
+bit-filter-KOFamScan-results -i sample-KO-tab.tmp \
+                             -o sample-annotations.tsv
+
+# removing temporary files
+rm -rf sample-tmp-KO/ sample-KO-annots.tmp
+```
+
+**Parameter Definitions:**  
+
+- `-i` – Specifies the input table.
+- `-o` – Specifies the output table.
+
+**Input Data:**
+
+- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs, output from [Step 15b](#15b-run-kegg-annotation))
+
+**Output Data:**
+
+- sample-annotations.tsv (table of KO annotations assigned to gene IDs)
+
+<br>
+
+---
+
+### 16. Taxonomic Classification 
+
+#### 16a. Pull and Unpack Pre-built Reference DB 
+
+> **Note:** This step only needs to be done once.
+
+```bash
+wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20200618.tar.gz
+tar -xvzf CAT_prepare_20200618.tar.gz
+```
+
+#### 16b. Run Taxonomic Classification
+
+```bash
+CAT contigs -c sample-assembly.fasta \
+            -d CAT_prepare_20200618/2020-06-18_database/ \
+            -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
+            -p sample-genes.faa \
+            -o sample-tax-out.tmp \
+            -n NumberOfThreads \
+            -r 3 \
+            --top 4 \
+            --I_know_what_Im_doing \
+            --no-stars
+```
+
+**Parameter Definitions:**  
+
+- `-c` – Specifies the input assembly fasta file.
+- `-d` – Specifies the CAT reference sequence database.
+- `-t` – Specifies the CAT reference taxonomy database.
+- `-p` – Specifies the input protein fasta file.
+- `-o` – Specifies the output file prefix.
+- `-n` – Specifies the number of CPU cores to use.
+- `-r` – Specifies the number of top protein hits to consider in assigning taxonomy.
+- `--top` – Specifies the number of protein alignments to store.
+- `--I_know_what_Im_doing` – Allows us to alter the `--top` parameter.
+- `--no-stars` - Suppress marking of suggestive taxonomic assignments.
+
+**Input Data:**
+
+- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 16a](16a-pull-and-unpack-pre-built-reference-db))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 16a](16a-pull-and-unpack-pre-built-reference-db))
+- sample-assembly.fasta (contig-renamed assembly file from [Step 13a](#13a-rename-contig-headers))
+- sample-genes.faa (amino-acid fasta file, output from [Step 14b](#14b-remove-line-wraps-in-gene-prediction-output))
+
+**Output Data:**
+
+- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file)
+- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file)
+
+
+#### 16c. Add Taxonomy Info From Taxids To Genes
+
+```bash
+CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
+              -o sample-gene-tax-out.tmp \
+              -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
+              --only_official \
+              --exclude-scores
+```
+
+**Parameter Definitions:**  
+
+- `-i` – Specifies the input taxonomy file.
+- `-o` – Specifies the output file name.
+- `-t` – Specifies the CAT reference taxonomy database.
+- `--only_official` – Specifies to add only standard taxonomic ranks.
+- `--exclude-scores` - Specifies to exclude bit-score support scores in the lineage.
+
+**Input Data:**
+
+- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file, output from [Step 16b](#16b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 16a](#16a-pull-and-unpack-pre-built-reference-db))
+
+**Output Data:**
+
+- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
+
+
+#### 16d. Add Taxonomy Info From Taxids To Contigs
+
+```bash
+CAT add_names -i sample-tax-out.tmp.contig2classification.txt \
+              -o sample-contig-tax-out.tmp \
+              -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
+              --only_official \
+              --exclude-scores
+```
+
+**Parameter Definitions:**  
+
+- `-i` – Specifies the input taxonomy file.
+- `-o` – Specifies the output file name.
+- `-t` – Specifies the CAT reference taxonomy database.
+- `--only_official` – Specifies to add only standard taxonomic ranks.
+- `--exclude-scores` - Specifies to exclude bit-score support scores in the lineage.
+
+**Input Data:**
+
+- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file, output from [Step 16b](#16b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 16a](#16a-pull-and-unpack-pre-built-reference-db))
+
+**Output Data:**
+
+- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added)
+
+
+#### 16e. Format Gene-level Output With awk and sed
+
+```bash
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,$8,$9,$10,$11 } \
+    else if ( $2 == "ORF has no hit to database" || $2 ~ /^no taxid found/ ) \
+    { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } else { n=split($3,lineage,";"); \
+    print $1,lineage[n],$5,$6,$7,$8,$9,$10,$11 } } ' sample-gene-tax-out.tmp | \
+    sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/# ORF/gene_ID/' | \
+    sed 's/lineage/taxid/'  > sample-gene-tax-out.tsv
+```
+
+**Input Data:**
+
+- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added, output from [Step 16c](#16c-add-taxonomy-info-from-taxids-to-genes))
+
+**Output Data:**
+
+- sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info)
+
+
+#### 16f. Format Contig-level Output With awk and sed
+
+```bash
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6,$7,$8,$9,$10,$11,$12 } \
+    else if ( $2 == "no taxid assigned" ) { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } \
+    else { n=split($4,lineage,";"); print $1,lineage[n],$6,$7,$8,$9,$10,$11,$12 } } ' sample-contig-tax-out.tmp | \
+    sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/^# contig/contig_ID/' | \
+    sed 's/lineage/taxid/' > sample-contig-tax-out.tsv
+
+  # clearing intermediate files
+rm sample*.tmp*
+```
+
+**Input Data:**
+
+- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added, output from [Step 16d](#16d-add-taxonomy-info-from-taxids-to-contigs))
+
+**Output Data:**
+
+- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info)
+
+<br>
+
+---
+
+### 17. Read-Mapping
+
+#### 17a. Align Reads to Sample Assembly
+
+```bash
+minimap2 -a \
+         -x map-ont \
+         -t NumberOfThreads \
+         sample_assembly.fasta \
+         sample_GLlblMetag_decontam.fastq.gz \
+         > sample.sam  2> sample-mapping-info.txt
+```
+
+**Parameter Definitions:**
+
+- `-a` – Output in SAM format.
+- `-x map-ont` - Specifies preset for mapping Nanopore reads to a reference.
+- `-t` - Number of parallel processing threads to use
+- `sample_assembly.fasta` – Assembly fasta file, provided as a positional argument.
+- `sample_GLlblMetag_decontam.fastq.gz` - Input sequence data file, provided as a positional argument.
+- `> sample.sam` - Redirects the output to a separate file.
+- `2> sample-mapping-info.txt` - Redirects the standar error to a separate file.
+
+**Input Data**
+
+- sample-assembly.fasta (contig-renamed assembly file, output from [Step 13a](#13a-rename-contig-headers))
+- sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+    contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
+    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8](#8b-remove-host-reads))
+
+**Output Data**
+
+- sample.sam (reads aligned to sample assembly in SAM format)
+- **sample-mapping-info_GLlblMetag.txt** (read mapping information)
+
+
+#### 17b. Sort and Index Assembly Alignments
+
+```bash
+# Sort Sam, convert to bam and create index
+samtools sort --threads NumberOfThreads \
+              -o sample_sorted_GLlblMetag.bam \
+              sample.sam > sample_sort.log 2>&1
+
+samtools index sample_sorted_GLlblMetag.bam sample_sorted_GLlblMetag.bam.bai
+```
+
+**Parameter Definitions:**
+
+**samtools sort**
+- `--threads` - Number of parallel processing threads to use.
+- `-o` - Specifies the output file for the sorted aligned reads.
+- `sample.sam` - Positional argument specifying the input SAM file.
+- `> sample_sort.log 2>&1` - Redirects the standard output and standard error to a separate file.
+
+**samtools index**
+- `sample_sorted.bam` - Positional argument specifying the input BAM file to be sorted.
+- `sample_sorted.bam.bai` - Positional argument specifying the name of the index file.
+
+**Input Data:**
+
+- sample.sam (reads aligned to sample assembly, output from [Step 17a](#17a-align-reads-to-sample-assembly))
+
+**Output Data:**
+
+- **sample_sorted_GLlblMetag.bam** (sorted mapping to sample assembly, in BAM format)
+- **sample_sorted_GLlblMetag.bam.bai** (index of sorted mapping to sample assembly)
+
+<br>
+
+---
+
+### 18. Get Coverage Information and Filter Based On Detection
+> **Note:**  
+> “Detection” is a measure of what proportion of a reference sequence recruited reads 
+(see the discussion of detection [here](http://merenlab.org/2017/05/08/anvio-views/#detection)). 
+Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
+
+#### 18a. Filter Coverage Levels Based On Detection
+
+```bash
+# pileup.sh comes from the bbduk.sh package
+pileup.sh -in sample.bam \
+          fastaorf=sample-genes_GLlblMetag.fasta \
+          outorf=sample-gene-cov-and-det.tmp \
+          out=sample-contig-cov-and-det.tmp
+```
+
+**Parameter Definitions:**  
+
+- `-in` – Specifies the input BAM file.
+- `fastaorf=` – Specifies the input gene-calls nucleotide fasta file.
+- `outorf=` – Specifies the output gene-coverage tsv file name.
+- `out=` – Specifies the output contig-coverage tsv file name.
+
+**Input Data:**
+
+- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 17b](#17b-sort-and-index-assembly-alignments))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14a](#14a-gene-prediction))
+
+
+**Output Data:**
+
+- sample-gene-cov-and-det.tmp (gene-coverage tsv file)
+- sample-contig-cov-and-det.tmp (contig-coverage tsv file)
+
+
+#### 18b. Filter Gene and Contig Coverage Based On Detection
+
+> *The following commands filter gene and contig coverage tsv files to only keep genes and contigs with at least 50% detection (as defined above) then parse the tables to retain only gene IDs and respective coverage.*
+
+```bash
+# Filtering gene coverage
+grep -v "#" sample-gene-cov-and-det.tmp | \
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
+     { print $1,$4 } ' > sample-gene-cov.tmp
+
+cat <( printf "gene_ID\tcoverage\n" ) sample-gene-cov.tmp > sample-gene-coverages_GLlblMetag.tsv
+
+# Filtering contig coverage
+grep -v "#" sample-contig-cov-and-det.tmp | \
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
+     { print $1,$2 } ' > sample-contig-cov.tmp
+
+cat <( printf "contig_ID\tcoverage\n" ) sample-contig-cov.tmp > sample-contig-coverages_GLlblMetag.tsv
+
+# removing intermediate files
+rm sample-*.tmp
+```
+
+**Input Data:**
+
+- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file, output from [Step 18a](#18a-filter-coverage-levels-based-on-detection))
+- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file, output from [Step 18a](#18a-filter-coverage-levels-based-on-detection))
+
+**Output Data:**
+
+- sample-gene-coverages_GLlblMetag.tsv (table with gene-level coverages)
+- sample-contig-coverages_GLlblMetag.tsv (table with contig-level coverages)
+
+<br>
+
+---
+
+### 19. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample
+> **Note:**  
+> Just uses `paste`, `sed`, and `awk` standard Unix commands to combine gene-level coverage, taxonomy, and functional annotations into one table for each sample.  
+
+```bash
+paste <( tail -n +2 sample-gene-coverages.tsv | sort -V -k 1 ) \
+      <( tail -n +2 sample-annotations.tsv | sort -V -k 1 | cut -f 2- ) \
+      <( tail -n +2 sample-gene-tax-out.tsv | sort -V -k 1 | cut -f 2- ) \
+      > sample-gene-tab.tmp
+
+paste <( head -n 1 sample-gene-coverages.tsv ) \
+      <( head -n 1 sample-annotations.tsv | cut -f 2- ) \
+      <( head -n 1 sample-gene-tax-out.tsv | cut -f 2- ) \
+      > sample-header.tmp
+
+cat sample-header.tmp sample-gene-tab.tmp > sample-gene-coverage-annotation-and-tax_GLlblMetag.tsv
+
+# removing intermediate files
+rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-out.tsv
+```
+
+**Input Data:**
+
+- sample-gene-coverages.tsv (table with gene-level coverages, output from [Step 18b](#18b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-annotations.tsv (table of KO annotations assigned to gene IDs, output from [Step 15c](#15c-filter-ko-outputs))
+- sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info, output from [Step 16e](#16e-format-gene-level-output-with-awk-and-sed))
+
+
+**Output Data:**
+
+- **sample-gene-coverage-annotation-and-tax_GLlblMetag.tsv** (table with combined gene coverage, annotation, and taxonomy info)
+
+<br>
+
+---
+
+### 20. Combine Contig-level Coverage and Taxonomy For Each Sample
+> **Note:**  
+> Just uses `paste`, `sed`, and `awk` standard Unix commands to combine contig-level coverage and taxonomy into one table for each sample.
+
+```bash
+paste <( tail -n +2 sample-contig-coverages.tsv | sort -V -k 1 ) \
+      <( tail -n +2 sample-contig-tax-out.tsv | sort -V -k 1 | cut -f 2- ) \
+      > sample-contig.tmp
+
+paste <( head -n 1 sample-contig-coverages.tsv ) \
+      <( head -n 1 sample-contig-tax-out.tsv | cut -f 2- ) \
+      > sample-contig-header.tmp
+      
+cat sample-contig-header.tmp sample-contig.tmp > sample-contig-coverage-and-tax_GLlblMetag.tsv
+
+# removing intermediate files
+rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
+```
+
+**Input Data:**
+
+- sample-contig-coverages.tsv (table with contig-level coverages, output from [Step 18b](#18b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info, output from [Step 16f](#16f-format-contig-level-output-with-awk-and-sed))
+
+
+**Output Data:**
+
+- **sample-contig-coverage-and-tax_GLlblMetag.tsv** (table with combined contig coverage and taxonomy info)
+
+<br>
+
+---
+
+### 21. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples
+
+> **Note:**  
+> * To combine across samples to generate these summary tables, we need the same "units". This is done for annotations 
+based on the assigned KO terms, and all non-annotated functions are included together as "Not annotated". It is done for 
+taxonomic classifications based on taxids (full lineages included in the table), and any genes not classified are included 
+together as "Not classified". 
+> * The values we are working with are coverage per gene (so they are number of bases recruited to the gene normalized 
+by the length of the gene). These have been normalized by making the total coverage of a sample 1,000,000 and setting 
+each individual gene-level coverage its proportion of that 1,000,000 total. So basically percent, but out of 1,000,000 
+instead of 100 to make the numbers more friendly. 
+
+#### 21a. Generate Gene-level Coverage Summary Tables
+
+```bash
+bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax_GLlblMetag.tsv \
+                                 -o Combined
+
+# add assay specific suffix
+mv "Combined-gene-level-KO-function-coverages-CPM.tsv Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv"
+mv "Combined-gene-level-KO-function-coverages-CPM.tsv Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv"
+mv "Combined-gene-level-KO-function-coverages.tsv Combined-gene-level-KO-function-coverages_GLlblMetag.tsv"
+mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-coverages_GLlblMetag.tsv"
+```
+
+**Parameter Definitions:**  
+
+- `*-gene-coverage-annotation-and-tax.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
+
+- `-o` – Specifies the output file prefix.
+
+
+**Input Data:**
+
+- *-gene-coverage-annotation-and-tax_GLlblMetag.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 19](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
+
+**Output Data:**
+
+- **Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv** (table with all samples combined based on KO annotations; normalized to coverage per million genes covered)
+- **Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv** (table with all samples combined based on gene-level taxonomic classifications; normalized to coverage per million genes covered)
+- **Combined-gene-level-KO-function-coverages_GLlblMetag.tsv** (table with all samples combined based on KO annotations)
+- **Combined-gene-level-taxonomy-coverages_GLlblMetag.tsv** (table with all samples combined based on gene-level taxonomic classifications)
+
+#### 21b. Generate Contig-level Coverage Summary Tables
+
+```bash
+bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
+```
+
+**Parameter Definitions:**  
+
+- `*-contig-coverage-and-tax.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
+- `-o` – Specifies the output file prefix.
+
+
+**Input Data:**
+
+- *-contig-coverage-and-tax.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 20](#20-combine-contig-level-coverage-and-taxonomy-for-each-sample))
+
+**Output Data:**
+
+- **Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million contigs covered)
+- **Combined-contig-level-taxonomy-coverages_GLlblMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications)
+
+<br>
+
+---
+
+### 22. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery
+
+#### 22a. Bin Contigs
+
+```bash
+jgi_summarize_bam_contig_depths --outputDepth sample-metabat-assembly-depth.tsv \
+                                --percentIdentity 97 \
+                                --minContigLength 1000 \
+                                --minContigDepth 1.0  \
+                                --referenceFasta sample-assembly.fasta \
+                                sample.bam
+
+metabat2  --inFile sample-assembly.fasta \
+          --outFile sample \
+          --abdFile sample-metabat-assembly-depth.tsv \
+          -t NumberOfThreads
+
+mkdir sample-bins
+mv sample*bin*.fasta sample-bins
+zip -r sample-bins.zip sample-bins
+```
+
+**Parameter Definitions:**  
+
+**jgi_summarize_bam_contig_depths**
+
+-  `--outputDepth` – Specifies the output depth file name.
+-  `--percentIdentity` – Minimum end-to-end percent identity of a mapped read to be included.
+-  `--minContigLength` – Minimum contig length to include.
+-  `--minContigDepth` – Minimum contig depth to include.
+-  `--referenceFasta` – Specifies the input assembly fasta file.
+-  `sample.bam` – Input alignment BAM file, specified as a positional argument.
+
+**metabat2**
+
+-  `--inFile` - Specifies the input assembly fasta file.
+-  `--outFile` - Specifies the prefix of the identified bins output files.
+-  `--abdFile` - The depth file generated by the previous `jgi_summarize_bam_contig_depths` command.
+-  `-t` - Number of parallel processing threads to use.
+
+
+**Input Data:**
+
+- sample-assembly.fasta (contig-renamed assembly file from [Step 13a](#13a-renaming-contig-headers))
+- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 17b](#17b-sort-and-index-assembly-alignments))
+
+**Output Data:**
+
+- **sample-metabat-assembly-depth.tsv** (tab-delimited summary of coverages)
+- sample-bins/sample-bin\*.fasta (fasta files of recovered bins)
+- **sample-bins.zip** (zip file containing fasta files of recovered bins)
+
+#### 22b. Bin quality assessment 
+> Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
+
+```bash
+checkm lineage_wf -f bins-overview_GLlblMetag.tsv \
+                  --tab_table \
+                  -x fasta \
+                  ./ \
+                  checkm-output-dir
+```
+
+**Parameter Definitions:**  
+
+-  `lineage_wf` – Specifies the workflow being utilized.
+-  `-f` – Specifies the output summary file name.
+-  `--tab_table` – Specifies the output summary file should be a tab-delimited table.
+-  `-x` – Specifies the extension that is on the bin fasta files that are being assessed.
+-  `./` – Specifies the directory holding the bins, provided as a positional argument.
+-  `checkm-output-dir` – Specifies the primary checkm output directory, provided as a positional argument.
+
+**Input Data:**
+
+- sample-bins/sample-bin\*.fasta (fasta files of recovered bins, output from [Step 22a](#22a-bin-contigs))
+
+**Output Data:**
+
+- **bins-overview_GLlblMetag.tsv** (tab-delimited file with quality estimates per bin)
+- checkm-output-dir/ (directory holding detailed checkm outputs)
+
+#### 22c. Filter MAGs
+
+```bash
+cat <( head -n 1 bins-overview_GLlblMetag.tsv ) \
+    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlblMetag.tsv | sed 's/bin./MAG-/' ) \
+    > checkm-MAGs-overview.tsv
+    
+# copying bins into a MAGs directory in order to run tax classification
+awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlblMetag.tsv | cut -f 1 > MAG-bin-IDs.tmp
+
+mkdir MAGs
+for ID in MAG-bin-IDs.tmp
+do
+    MAG_ID=$(echo $ID | sed 's/bin./MAG-/')
+    cp ${ID}.fasta MAGs/${MAG_ID}.fasta
+done
+
+for SAMPLE in $(cat MAG-bin-IDs.tmp | sed 's/-bin.*//' | sort -u);
+do
+  mkdir ${SAMPLE}-MAGs
+  mv ${SAMPLE}-*MAG*.fasta ${SAMPLE}-MAGs
+  zip -r ${SAMPLE}-MAGs.zip ${SAMPLE}-MAGs
+done
+```
+
+**Input Data:**
+
+- bins-overview_GLlblMetag.tsv (tab-delimited file with quality estimates per bin from [Step 22b](#22b-bin-quality-assessment))
+
+**Output Data:**
+
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG)
+- MAGs/\*.fasta (directory holding high-quality MAGs)
+- **\*-MAGs.zip** (zip files containing directories of high-quality MAGs)
+
+
+#### 22d. MAG Taxonomic Classification
+> Uses default `gtdbtk` database setup with program's `download.sh` command.
+
+```bash
+gtdbtk classify_wf --genome_dir MAGs/ \
+                   -x fasta \
+                   --out_dir gtdbtk-output-dir \
+                   --skip_ani_screen
+```
+
+**Parameter Definitions:**  
+
+-  `classify_wf` – Specifies the workflow being utilized.
+-  `--genome_dir` – Specifies the directory holding the MAGs to classify.
+-  `-x` – Specifies the extension that is on the MAG fasta files that are being taxonomically classified.
+-  `--out_dir` – Specifies the output directory name.
+-  `--skip_ani_screen`  - Specifies to skip ani_screening step to classify genomes using mash and skani.
+
+**Input Data:**
+
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#22c-filter-mags))
+
+**Output Data:**
+
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
+
+#### 22e. Generate Overview Table Of All MAGs
+
+```bash
+# combine summaries
+for MAG in $(cut -f 1 assembly-summaries_GLlblMetag.tsv | tail -n +2); do
+
+    grep -w -m 1 "^${MAG}" checkm-MAGs-overview.tsv | cut -f 12,13,14 \
+        >> checkm-estimates.tmp
+
+    grep -w "^${MAG}" gtdbtk-output-dir/gtdbtk.*.summary.tsv | \
+    cut -f 2 | sed 's/^.__//' | \
+    sed 's/;.__/\t/g' | \
+    awk 'BEGIN{ OFS=FS="\t" } { for (i=1; i<=NF; i++) if ( $i ~ /^ *$/ ) $i = "NA" }; 1' \
+        >> gtdb-taxonomies.tmp
+
+done
+
+# Add headers
+cat <(printf "est. completeness\test. redundancy\test. strain heterogeneity\n") checkm-estimates.tmp \
+    > checkm-estimates-with-headers.tmp
+
+cat <(printf "domain\tphylum\tclass\torder\tfamily\\tgenus\tspecies\n") gtdb-taxonomies.tmp \
+    > gtdb-taxonomies-with-headers.tmp
+
+paste assembly-summaries_GLlblMetag.tsv \
+checkm-estimates-with-headers.tmp \
+gtdb-taxonomies-with-headers.tmp \
+    > MAGs-overview.tmp
+
+# Ordering by taxonomy
+head -n 1 MAGs-overview.tmp > MAGs-overview-header.tmp
+
+tail -n +2 MAGs-overview.tmp | sort -t \$'\t' -k 14,20 > MAGs-overview-sorted.tmp
+
+cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
+    > MAGs-overview_GLlblMetag.tsv
+```
+
+**Input Data:**
+
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarize-assemblies))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#23c-filter-mags))
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 22c](#22c-filter-mags))
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 22d](#22d-mag-taxonomic-classification))
+
+**Output Data:**
+
+- **MAGs-overview_GLlblMetag.tsv** (a tab-delimited overview of all recovered MAGs)
+
+
+<br>
+
+---
+
+### 23. Generate MAG-level Functional Summary Overview
+
+#### 23a. Get KO Annotations Per MAG
+> This utilizes the helper script [`parse-MAG-annots.py`](../Workflow_Documentation/NF_MGIllumina/workflow_code/bin/parse-MAG-annots.py) 
+
+```bash
+for file in $( ls MAGs/*.fasta )
+do
+
+    MAG_ID=$( echo ${file} | cut -f 2 -d "/" | sed 's/.fasta//' )
+    sample_ID=$( echo ${MAG_ID} | sed 's/-MAG-[0-9]*$//' )
+
+    grep "^>" ${file} | tr -d ">" > ${MAG_ID}-contigs.tmp
+
+    python parse-MAG-annots.py -i annotations-and-taxonomy/${sample_ID}-gene-coverage-annotation-and-tax.tsv \
+                               -w ${MAG_ID}-contigs.tmp \
+                               -M ${MAG_ID} \
+                               -o MAG-level-KO-annotations_GLlblMetag.tsv
+
+    rm ${MAG_ID}-contigs.tmp
+
+done
+```
+
+**Parameter Definitions:**  
+
+- `-i` – Specifies the input sample TSV file containing sample coverage, annotation, and taxonomy info.
+- `-w` – Specifies the appropriate temporary file holding all the contigs in the current MAG.
+- `-M` – Specifies the current MAG unique identifier.
+- `-o` – Specifies the output file name.
+
+**Input Data:**
+
+- \*-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 19](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#22c-filter-mags))
+
+**Output Data:**
+
+- **MAG-level-KO-annotations_GLlblMetag.tsv** (tab-delimited table holding MAGs and their KO annotations)
+
+
+#### 23b. Summarize KO Annotations With KEGG-Decoder
+
+```bash
+KEGG-decoder -v interactive \
+             -i MAG-level-KO-annotations_GLlblMetag.tsv \
+             -o MAG-KEGG-Decoder-out_GLlblMetag.tsv
+```
+
+**Parameter Definitions:**  
+
+- `-v interactive` – Specifies to create an interactive html output.
+- `-i` – Specifies the input tab-delimited table holding MAGs and their KO annotations.
+- `-o` – Specifies the output table.
+
+**Input Data:**
+
+- MAG-level-KO-annotations_GLlblMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 23a](#23a-getting-ko-annotations-per-mag))
+
+**Output Data:**
+
+- **MAG-KEGG-Decoder-out_GLlblMetag.tsv** (tab-delimited table holding MAGs and their proportions of genes held known to be required for specific pathways/metabolisms)
+
+- **MAG-KEGG-Decoder-f.html** (interactive heatmap html file of the above output table)
+
+<br>
+
+---
+
+### 24. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
+
+#### 24a. Gene-level taxonomy heatmaps
+
+```R
+library(tidyverse)
+
+metadata_file <- "/path/to/sample/metadata"
+# Prepare metadata
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame()
+sample_names = metadata[, samples_column]
+row.names(metadata) <- sample_names
+
+# Prepare feature table
+gene_taxonomy_table <- read_assembly_coverage_table(feature_table_file, sample_names) %>% as.data.frame()
+
+# Summarize gene table
+species_gene_table <- gene_taxonomy_table %>%
+  select(species, !!any_of(sample_names)) %>% 
+  group_by(species) %>% 
+  summarise(across(everything(), sum)) %>% as.data.frame
+
+rownames(species_gene_table) <- species_gene_table[[1]]
+species_gene_table <- species_gene_table[, -1] %>% as.matrix()
+
+# Get common samples and re-arrange feature table and metadata
+common_samples <- intersect(colnames(species_gene_table), rownames(metadata))
+species_gene_table <- species_gene_table[, common_samples]
+metadata <- metadata[common_samples, ]
+metadata <- metadata %>% arrange(!!sym(group_column))
+
+# Write out gene taxonomy table
+write_csv(x = gene.m, file = "gene_taxonomy_table.csv")
+
+make_heatmap(metadata, species_gene_table, 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-taxonomy", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+
+```
+
+**Input data:**
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
+- Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples combined based on gene-level taxonomic classifications, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables)) 
+
+**Output data:**
+- gene_taxonomy_table.csv (aggregated gene taxonomy table with samples in columns and species in rows)
+- **Combined-gene-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all genes taxonomy assignments)
+
+#### 24b. Gene-level taxonomy decontamination
+
+```R
+library(tidyverse)
+
+    # Prepare feature table
+    feature_table <- read_csv(feature_table_file) %>% as.data.frame()
+    rownames(feature_table) <- feature_table[[1]]
+    feature_table <- feature_table[, -1] %>% as.matrix()
+    colnames(feature_table) <-  colnames(feature_table) %>% str_remove_all("barcode")
+
+    # Prepare metadata
+    metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame()
+    row.names(metadata) <- metadata[, samples_column] %>% str_remove_all("barcode")
+
+    # Get common samples and re-arrange feature table and metadata
+    common_samples <- intersect(colnames(feature_table), rownames(metadata))
+    feature_table <- feature_table[, common_samples]
+    metadata <- metadata[common_samples, ]
+    metadata <- metadata %>% arrange(!!sym(group_column))
+
+    # Create column annotation
+    col_annotation <- as.data.frame(metadata)[, group_column, drop = FALSE]
+    rownames(col_annotation) <- rownames(col_annotation)
+
+    # Calculate output plot width and height
+    number_of_samples <- ncol(feature_table)
+    width <- 1 * number_of_samples
+    number_of_features <- nrow(feature_table)
+    height <- 0.2 * number_of_features
+
+    # Set colors by group
+    groups <- metadata[[group_column]] %>%  unique()
+    number_of_groups <-  length(groups)
+    my_colors <- custom_palette[1:number_of_groups]
+    names(my_colors) <- groups
+    annotation_colors  <- list(my_colors)
+    names(annotation_colors) <- group_column
+
+# Read-in featusre table
+gene.m <- read_csv("gene_taxonomy_table.csv")
+rownames(gene.m) <- gene.m[['species']]
+gene.m <- gene.m[,-match("species", colnames(gene.m))] %>% as.matrix()
+feature_table <- gene.m
+
+
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
+
+decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
+                                         feature_table_file = feature_table_file, 
+                                         feature_column = "species", 
+                                         samples_column = "sample_id",
+                                         prevalence_column = "NTC", 
+                                         ntc_name = "TRUE", 
+                                         frequency_column = "concentration", 
+                                         threshold = 0.1, 
+                                         classification_method = "gene-taxonomy", 
+                                         output_prefix = "", 
+                                         assay_suffix = "_GLlblMetag")
+
+non_microbial <- "Unclassified;_;_;_;_;_;_"
+
+make_heatmap(metadata_file, feature_table_file, 
+                           samples_column = "sample_id", group_column = "group", 
+                           output_prefix, assay_suffix = "_GLlblMetag",
+                           custom_palette)
+
+# Convert count matrix to relative abundance matrix
+decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
+
+# Make plot after filtering out contaminants
+make_heatmap(decontaminated_species_table, metadata, custom_palette, publication_format)
+
+```
+
+**Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
+- [make_plot()](#make_plot)
+- [count_to_rel_abundance()](#count_to_rel_abundance)
+
+**Parameter Definitions:**
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                         table with species/functions as the first column and samples as other columns.
+- `ntc_name` - a character string specifying the name of the NTC in the prevalence column
+- `number_of_samples` - the total number of samples in the species count files, adjust based on number of input samples in the feature_table_file
+
+**Input Data:**
+
+- `kaiju_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 9](#9g-filter-kaiju-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+
+**Output Data:**
+
+- **decontam-gene-taxonomy_results.csv** (decontam's results table)
+- **decontaminated-gene-taxonomy_table.csv** (decontaminated species table)
+- **decontaminated-gene-taxonomy-heatmap_GLlblMetag.png** (heatmap after filtering out contaminants)
+
+- **gene-taxonomy_decontam_results_GLlblMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
+- **gene-taxonomy_decontam_species_table_GLlblMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
+- **gene-taxonomy_decontam_species_barplot_GLlblMetag.png** (barplot after filtering out contaminants)
+- **gene-taxonomy_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants)
+
+```R
+library(tidyverse)
+library(decontam)
+library(phyloseq)
+
+# Set to 0.5 for a more aggressive approach where species more prevalent
+# in the negative controls are considered contaminants
+contam_threshold <- 0.1
+# Control samples in this column should always be written as 
+# "Control_Sample" and true samples as "True_Sample"
+prev_col <- "Sample_or_Control"
+freq_col <- "input_conc_ng"
+plot_width <- 18
+plot_height <- 8
+
+# Read-in metadata
+metdata_file <- "/path/to/sample/metadata"
+samples_column <- "Sample_ID"
+metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
+row.names(metadata) <- metadata[,samples_column]
+
+# Read-in featusre table
+gene.m <- read_csv("gene_taxonomy_table.csv")
+rownames(gene.m) <- gene.m[['species']]
+gene.m <- gene.m[,-match("species", colnames(gene.m))] %>% as.matrix()
+feature_table <- gene.m
+
+
+contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
+
+# Write decontam results table to file
+write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-gene-taxonomy_results.csv")
+
+# Get the list of contaminats identified by decontam
+contaminants <- contamdf %>%
+                as.data.frame %>%
+                rownames_to_column("Species") %>%
+                filter(contaminant == TRUE) %>% pull(Species)
+
+# Drop contaminant features identified by decontam
+decontaminated_table <- feature_table %>% 
+                as.data.frame  %>% 
+                rownames_to_column("Species") %>% 
+                filter(str_detect(Species, 
+                                  pattern = str_c(contaminants,
+                                                  collapse = "|"),
+                                  negate = TRUE)) %>%
+                select(-Species) %>% as.matrix
+
+
+# Write decontaminated species table to file
+write_csv(x = decontaminated_table, file = "decontaminated-gene-taxonomy_table.csv")
+
+# Get the index of species (contaminants and unclassified) to drop
+non_microbial <- "Unclassified;_;_;_;_;_;_"
+species_to_drop_index <- grep(x = rownames(feature_table), 
+                              str_c(c(contaminants,non_microbial), 
+                                    collapse = "|"))
+
+mat2plot <- feature_table[-species_to_drop_index,]
+png(filename = "decontaminated-gene-taxonomy-heatmap_GLlblMetag.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 14, 
+         number_format = "%.0f")
+dev.off()
+
+```
+
+**Input data:**
+
+- metadata_file  (path to sample-wise metadata file)
+- gene_taxonomy_table.csv (aggregated gene taxonomy table, output from [Step 21b](#21b-gene-level-taxonomy-heatmaps))
+
+**Output data:**
+
+- **decontam-gene-taxonomy_results.csv** (decontam's results table)
+- **decontaminated-gene-taxonomy_table.csv** (decontaminated species table)
+- **decontaminated-gene-taxonomy-heatmap_GLlblMetag.png** (heatmap after filtering out contaminants)
+
+
+
+#### 21d. Gene-level KO functions heatmaps
+
+```R
+library(tidyverse)
+library(pheatmap)
+
+# Abundant functions with CPM > 2000
+abundance_threshold <- 2000
+
+sample_order <- get_sample_names("assembly-summaries_GLlblMetag.tsv")
+# Read-in KO functions table
+functions_table <- read_input_table("Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv") %>%
+                    select(KO_ID, KO_function, !!sample_order)
+
+# Subset table and then convert from datafame to matrix
+functions.m <- functions_table[,sample_order] %>% as.matrix()
+rownames(functions.m) <- functions_table$KO_ID
+table2write <-  functions.m %>% 
+                      as.data.frame() %>% rownames_to_column("KO_ID") %>%
+                      filter(KO_ID != "Not annotated") # Drop unannotated / unclassified
+# Write out  taxonomy table
+write_csv(x = table2write  , file = "genes-KO-functions_table.csv")
+
+
+#------ All KO functions assignments
+
+# Drop unclassified assignments
+mat2plot <- functions.m[-match("Not annotated", rownames(functions.m),]
+
+png(filename = "All-genes-KO-functions-heatmap_GLlblMetag.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 12,
+         number_format = "%.0f")
+dev.off()
+
+
+#------ Abundant KO functions assignments
+
+functions <- rowSums(functions.m) %>% sort()
+abund_functions <- functions[ functions > abundance_threshold ] %>% names
+abund_functions.m <- functions.m[abund_functions,]
+
+
+# Drop unannotated assignments
+mat2plot <- abund_functions.m[-match("Not annotated", rownames(abund_functions.m)),]
+
+png(filename = "Abundant-genes-KO-functions-heatmap_GLlblMetag.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 12,
+         number_format = "%.0f")
+dev.off()
+```
+
+**Parameter Definitions:**  
+
+
+**Input data:**
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
+- Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv (table with all samples combined based on KO annotations; normalized to coverage per million genes covered, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables))
+
+**Output data:**
+- genes-KO-functions_table.csv (aggregated and subsetted gene KO functions table)
+- **All-genes-KO-functions-heatmap_GLlblMetag.png** (heatmap of gene-wise KO function assignments)
+- **Abundant-genes-KO-functions-heatmap_GLlblMetag.png** (heatmap of gene-wise abundant KO function assignments)
+
+#### 21e. Gene-level KO functions decontamination --- END NEEDS REVIEW ---
+
+```R
+library(tidyverse)
+library(decontam)
+library(phyloseq)
+
+# Set to 0.5 for a more aggressive approach where species more prevalent
+# in negative controls are considered contaminants
+contam_threshold <- 0.1 
+# Control samples in this column should always be written as "Control_Sample" and true samples as "True_Sample"
+prev_col <- "Sample_or_Control"
+freq_col <- "input_conc_ng"
+plot_width <- 18
+plot_height <- 8
+
+# Read-in metadata
+metdata_file <- "/path/to/sample/metadata"
+samples_column <- "Sample_ID"
+metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
+row.names(metadata) <- metadata[,samples_column]
+
+# Read-in feature table
+functions.m <- read_csv("genes-KO-functions_table.csv")
+rownames(functions.m) <- functions.m[['KO_ID']]
+gene.m <- functions.m[,-match("KO_ID", colnames(functions.m))] %>% as.matrix()
+feature_table <- functions.m
+
+
+contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
+
+# Write decontam results table to file
+write_csv(x = contamdf %>% rownames_to_column("KO_ID"), file = "decontam-gene-KO-functions_results.csv")
+
+# Get the list of contaminants identified by decontam
+contaminants <- contamdf %>%
+                as.data.frame %>%
+                rownames_to_column("KO_ID") %>%
+                filter(contaminant == TRUE) %>% pull(KO_ID)
+
+# Drop contaminant features identified by decontam
+decontaminated_table <- feature_table %>% 
+                as.data.frame  %>% 
+                rownames_to_column("KO_ID") %>% 
+                filter(str_detect(Species, 
+                                  pattern = str_c(contaminants,
+                                                  collapse = "|"),
+                                  negate = TRUE)) %>%
+                select(-KO_ID) %>% as.matrix
+
+
+# Write decontaminated species table to file
+write_csv(x = decontaminated_table, file = "decontaminated-gene-KO-functions_table.csv")
+
+# Get the index of species (contaminants and unclassified) to drop
+unclassified <- "Not annotated"
+functions_to_drop_index <- grep(x = rownames(feature_table), 
+                              str_c(c(contaminants,unclassified), 
+                                    collapse = "|"))
+
+mat2plot <- feature_table[-functions_to_drop_index,]
+png(filename = "decontaminated-gene-KO-functions-heatmap_GLlblMetag.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 14, 
+         number_format = "%.0f")
+dev.off()
+
+```
+
+**Input data:**
+
+- metadata_file  (path to sample-wise metadata file)
+- gene_taxonomy_table.csv (agggregated gene taxomy table, output from [Step 21b](#21b-gene-level-taxonomy-heatmaps))
+
+**Output data:**
+
+- **decontam-gene-KO-functions_results.csv** (decontam's results table)
+- **decontaminated-gene-KO-functions_table.csv** (decontaminated functions table)
+- **decontaminated-gene-KO-functions-heatmap_GLlblMetag.png** (heatmap after filtering out contaminants)
+
+
+#### 21g. Contig-level Heatmaps --- START NEEDS REVIEW ---
+
+```R
+plot_width <- 20
+plot_height <- 30
+sample_order <- get_sample_names("assembly-summaries_GLlblMetag.tsv")
+
+species_contig_table <-  read_contig_table("Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv")
+
+contig.m <- species_contig_table %>%
+  group_by(species) %>%
+  summarise(across(everything(), sum)) %>%
+  filter(species != "Unclassified;_;_;_;_;_;_") %>% # Drop unclassifed
+  as.data.frame()
+
+# Write out contig taxonomy table
+write_csv(x = contig.m, file = "contig_taxonomy_table.csv")
+
+rownames(contig.m) <- contig.m[['species']]
+contig.m <- contig.m[,-match("species", colnames(contig.m))] %>% as.matrix()
+
+#------ All contig taxonomy assignments
+
+# Drop unclassified assignments
+mat2plot <- contig.m[-match("Unclassified;_;_;_;_;_;_", rownames(contig.m)),]
+
+png(filename = "All-contig-taxonomy-heatmap_GLlblMetag.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 12,
+         number_format = "%.0f")
+dev.off()
+
+
+#------ Abundant contig taxonomy assignments
+
+taxa <- rowSums(contig.m) %>% sort()
+abund_taxa <- taxa[ taxa > abundance_threshold ] %>% names
+abund_contig.m <- contig.m[abund_taxa,]
+
+mat2plot <- abund_contig.m[-match("Unclassified;_;_;_;_;_;_", rownames(abund_contig.m)),]
+
+png(filename = "Abundant-contig-taxonomy-heatmap_GLlblMetag.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 12,
+         number_format = "%.0f")
+dev.off()
+```
+
+
+**Parameter Definitions:**  
+
+
+**Input data:**
+
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
+- Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered from [Step 21f](#21f-generating-contig-level-coverage-summary-tables))
+
+**Output data:**
+
+- contig_taxonomy_table.csv (aggregated contig taxonomy)
+- **All-contig-taxonomy-heatmap_GLlblMetag.png** (All contig level taxonomy heatmap)
+- **Abundant-contig-taxonomy-heatmap_GLlblMetag.png** (Abundant contig level taxonomy heatmap)
+
+
+#### 21h. Contig-level decontamination --- END NEEDS REVIEW ---
+
+```R
+library(tidyverse)
+library(decontam)
+library(phyloseq)
+
+# Set to 0.5 for a more aggressive approach where species more prevalent
+# in the negative controls are considered contaminants
+contam_threshold <- 0.1
+# Control samples in this column should always be written as
+# "Control_Sample" and true samples as "True_Sample"
+prev_col <- "Sample_or_Control"
+freq_col <- "input_conc_ng"
+plot_width <- 18
+plot_height <- 8
+
+# Read-in metadata
+metdata_file <- "/path/to/sample/metadata"
+samples_column <- "Sample_ID"
+metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
+row.names(metadata) <- metadata[,samples_column]
+
+# Read-in feature table
+contig.m <- read_csv("contig_taxonomy_table.csv")
+rownames(contig.m) <- contig.m[['species']]
+contig.m <- contig.m[,-match("species", colnames(contig.m))] %>% as.matrix()
+feature_table <- contig.m
+
+
+contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
+
+# Write decontam results table to file
+write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-gene-taxonomy_results.csv")
+
+# Get a list of contaminants identified by decontam
+contaminants <- contamdf %>%
+                as.data.frame %>%
+                rownames_to_column("Species") %>%
+                filter(contaminant == TRUE) %>% pull(Species)
+
+# Drop contaminant features identified by decontam
+decontaminated_table <- feature_table %>% 
+                as.data.frame  %>% 
+                rownames_to_column("Species") %>% 
+                filter(str_detect(Species, 
+                                  pattern = str_c(contaminants,
+                                                  collapse = "|"),
+                                  negate = TRUE)) %>%
+                select(-Species) %>% as.matrix
+
+
+# Write decontaminated species table to file
+write_csv(x = decontaminated_table, file = "decontaminated-gene-taxonomy_table.csv")
+
+# Get the index of species (contaminants and unclassified) to drop
+non_microbial <- "Unclassified;_;_;_;_;_;_"
+species_to_drop_index <- grep(x = rownames(feature_table), 
+                              str_c(c(contaminants,non_microbial), 
+                                    collapse = "|"))
+
+mat2plot <- feature_table[-species_to_drop_index,]
+png(filename = "decontaminated-contig-taxonomy-heatmap_GLlblMetag.png", 
+    width = plot_width, height = plot_height, units = "in", res=300)
+pheatmap(mat = mat2plot,
+         cluster_cols = FALSE, 
+         cluster_rows = FALSE, 
+         col = colours, 
+         angle_col = 0, 
+         display_numbers = TRUE,
+         fontsize = 14, 
+         number_format = "%.0f")
+dev.off()
+
+```
+
+**Input data:**
+
+- metadata_file  (path to sample-wise metadata file)
+- contig_taxonomy_table.csv (aggregated contig taxonomy table, output from [Step 21g](#21g-contig-level-heatmaps))
+
+**Output data:**
+
+- **decontam-contig-taxonomy_results.csv** (decontam's results table)
+- **decontaminated-contig-taxonomy_table.csv** (decontaminated species table)
+- **decontaminated-contig-taxonomy-heatmap_GLlblMetag.png** (heatmap after filtering out contaminants)
+
+
+

From a196c7a9c4f6328f76cec4f0d41a2144c766e462 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Sat, 24 Jan 2026 20:36:20 -0800
Subject: [PATCH 20/47] Updated assembly decontamination and heatmap section

---
 .../Low_Biomass/Nanopore/GL-DPPD-7116.md      | 787 ++++++------------
 1 file changed, 275 insertions(+), 512 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
index ff4f3ef98..28233bac9 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
@@ -1132,7 +1132,7 @@ library(pavian)
       group_by(name) %>%
       summarise(across(everything(), sum)) %>%
       ungroup() %>%
-      as.data.frame() %>%
+      as.data.frame %>%
       rename(species = name)
 
     # Set rownames as species name, drop species column
@@ -1324,7 +1324,7 @@ library(pavian)
                         samples_column="Sample_ID", prefix_to_remove="barcode"){
   
     abund_table_wide <- abund_table %>%
-        as.data.frame() %>%
+        as.data.frame %>%
         rownames_to_column(samples_column) %>%
         inner_join(metadata) %>%
         select(!!!colnames(metadata), everything()) %>%
@@ -1374,7 +1374,7 @@ library(pavian)
     feature_table <- feature_table[, -1]
 
     # Prepare metadata
-    metadata <- read_delim(metdata_file, delim = ",") %>% as.data.frame()
+    metadata <- read_delim(metdata_file, delim = ",") %>% as.data.frame
     row.names(metadata) <- metadata[, samples_column]
 
     # compute abundances from counts
@@ -1420,17 +1420,17 @@ library(pavian)
   <summary>Creates heatmaps from a feature table file</summary>
   
   ```R
-  make_heatmap <- function(metadata_file, feature_table_file, 
+  make_heatmap <- function(metadata, species_gene_table, 
                            samples_column = "sample_id", group_column = "group", 
                            output_prefix, assay_suffix = "_GLlblMetag",
                            custom_palette) {
     # Prepare feature table
-    # feature_table <- read_csv(feature_table_file) %>% as.data.frame()
+    # feature_table <- read_csv(feature_table_file) %>% as.data.frame
     # rownames(feature_table) <- feature_table[[1]]
     # feature_table <- feature_table[, -1] %>% as.matrix()
 
     # # Prepare metadata
-    # metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame()
+    # metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
     # row.names(metadata) <- metadata[, samples_column]
 
     # # Get common samples and re-arrange feature table and metadata
@@ -1571,12 +1571,12 @@ library(pavian)
                                threshold = 0.1, classification_method, 
                                output_prefix, assay_suffix = "_GLlblMetag") {
     # Prepare feature table
-    feature_table <- read_csv(feature_table_file) %>%  as.data.frame()
+    feature_table <- read_csv(feature_table_file) %>%  as.data.frame
     rownames(feature_table) <- feature_table[[1]]
     feature_table <- feature_table[, -1]  %>% as.matrix()
 
     # Prepare metadata
-    metadata <- read_csv(metadata_file) %>% as.data.frame()
+    metadata <- read_csv(metadata_file) %>% as.data.frame
     row.names(metadata) <- metadata[, samples_column]
 
     # Run decontam
@@ -1598,7 +1598,7 @@ library(pavian)
       
       # Drop contaminant features identified by decontam
       decontaminated_table <- feature_table %>%
-        as.data.frame() %>%
+        as.data.frame %>%
         rownames_to_column(feature_column) %>%
         filter(str_detect(!!sym(feature_column),
                           pattern = str_c(contaminants,
@@ -1682,87 +1682,40 @@ library(pavian)
 
 </details>
 
-
-##### format_taxonomy_table()
-<details>
-  <summary>format a taxonomy assignment table by appending a suffix to a known name</summary>
-
-```R
-format_taxonomy_table <- function(taxonomy,stringToReplace="Other",
-                                  suffix=";Other") {
-  
-  for (taxa_index in seq_along(taxonomy)) {
-    
-    # Get the row indices of the current taxonomy columns 
-    # with rows matching the sting in `stringToReplace`
-    indices <- grep(x = taxonomy[,taxa_index], pattern = stringToReplace)
-    # Replace the value in that row with the value in the adjacent cell concated with `suffix` 
-    taxonomy[indices,taxa_index] <- 
-      paste0(taxonomy[indices,taxa_index-1],
-             rep(x = suffix, times=length(indices)))
-    
-  }
-  return(taxonomy)
-}
-
-```
-**Function Parameter Definitions:**
-- `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
-- `stringToReplace` - a regex string specifying what to replace
-- `suffix` - string specifying the replacement value
-
-**Returns:** a dataframe of reformated taxonomy names
-
-</details>
-
-
 ##### fix_names()
 <details>
   <summary>clean taxonomy names</summary>
 
-```R
-fix_names<- function(taxonomy,stringToReplace,suffix){
-  
-  for(index in seq_along(stringToReplace)){
-    taxonomy <- format_taxonomy_table(taxonomy = taxonomy,
-                                      stringToReplace=stringToReplace[index], 
-                                      suffix=suffix[index])
-  }
-  return(taxonomy)
-}
-
-```
-**Function Parameter Definitions:**
-- `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
-- `stringToReplace` - a regex string specifying what to replace
-- `suffix` - string specifying the replacement value
-
-**Returns:** a dataframe of reformated/cleaned taxonomy names
-
-</details>
-
-
-##### read_input_table()
-<details>
-  <summary>read an input table into a dataframe</summary>
-
   ```R
-  read_input_table <- function(file_name){
-    
-    df <- read_delim(file = file_name, delim = "\t", comment = "#")
-    return(df)
+  fix_names<- function(taxonomy,stringToReplace="Othe",suffix=";Other"){
     
+    for(index in seq_along(stringToReplace)){
+
+      for (taxa_index in seq_along(taxonomy)) {    
+        # Get the row indices of the current taxonomy columns
+        # with rows matching the sting in `stringToReplace`
+        indices <- grep(x = taxonomy[,taxa_index], pattern = stringToReplace)
+        # Replace the value in that row with the value in the adjacent cell concated with `suffix`
+        taxonomy[indices,taxa_index] <-
+          paste0(taxonomy[indices,taxa_index-1],
+                rep(x = suffix, times=length(indices)))
+      }
+
+    }
+    return(taxonomy)
   }
   ```
+
   **Function Parameter Definitions:**
-  - `file_name` - path to file to be read
+  - `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
+  - `stringToReplace` - a regex string specifying what to replace
+  - `suffix` - string specifying the replacement value
 
-  **Returns:** a tibble generated from the input file
+  **Returns:** a dataframe of reformated/cleaned taxonomy names
 
 </details>
 
 
-
 ##### read_assembly_coverage_table()
 <details>
   <summary>Read Assembly-based coverage annotation table</summary>
@@ -1770,7 +1723,7 @@ fix_names<- function(taxonomy,stringToReplace,suffix){
   ```R
   read_assembly_coverage_table <- function(file_name, sample_names){
   
-    df <- read_input_table(file_name)
+    df <- read_delim(file = file_name, delim = "\t", comment = "#")
 
     # Subset taxoxnomy portion (domain:species) of input table
     # and replace empty/Na domain assignments with "Unclassified"
@@ -1790,9 +1743,12 @@ fix_names<- function(taxonomy,stringToReplace,suffix){
     
     return(df)
   }
-
   ```
 
+  **Custom Functions Used:**
+  [process_taxonomy](#process_taxonomy)
+  [fix_names()](#fix_names)
+
   **Function Parameter Definitions:**
 
   - `file_name` - path to contig taxonomy assignment file to be read
@@ -1803,17 +1759,15 @@ fix_names<- function(taxonomy,stringToReplace,suffix){
 </details>
 
 
-
 ##### get_sample_names()
 <details>
   <summary>retrieve sample names for which assemblies were generated</summary>
 
   ```R
   get_sample_names <- function (assembly_summary) {
-    overview_table <-  read_input_table(assembly_summary) %>%
-                        select(
-                          where( ~all(!is.na(.)) )
-                          ) # Drop columns were all its rows are NAs
+    # Read in table and drop columns were all rows are NA
+    overview_table <-  read_delim(file = assembly_summary, delim = "\t", comment = "#") %>%
+                        select(where( ~all(!is.na(.)) )) 
 
     col_names <- names(overview_table) %>% str_remove_all("-assembly")
     sample_order <- col_names[-1] %>% sort()
@@ -2064,7 +2018,7 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 library(tidyverse)
 feature_table <- process_kaiju_table(file_path="merged_kaiju_table_GLlblMetag.tsv")
 table2write <- feature_table  %>%
-                as.data.frame() %>%
+                as.data.frame %>%
                 rownames_to_column("Species")
 write_csv(x = table2write, file = "kaiju_species_table_GLlblMetag.csv")
 ```
@@ -2101,7 +2055,7 @@ threshold <- 0.5
 non_microbial <- "UNCLASSIFIED|Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
 
 # read in feature table
-feature_table <- read_csv(input_file) %>% as.data.frame()
+feature_table <- read_csv(input_file) %>% as.data.frame
 feature_name <- colnames(feature_table)[1]
 rownames(feature_table) <- feature_table[,1]
 feature_table <- feature_table[, -1]
@@ -2109,13 +2063,13 @@ feature_table <- feature_table[, -1]
 # convert count table to a relative abundance matrix
 abund_table <- feature_table %>% rownames_to_column(feature_name) %>%
   mutate(across(where(is.numeric), function(x) (x / sum(x, na.rm = TRUE)) * 100)) %>%
-  as.data.frame()
+  as.data.frame
 
 rownames(abund_table) <- abund_table[,1]
 abund_table <- abund_table[,-1] %>% t 
 
 table2write <- group_low_abund_taxa(abund_table, threshold = threshold)  %>%
-  t %>% as.data.frame() %>%
+  t %>% as.data.frame %>%
   rownames_to_column(feature_name)
 
 write_csv(x = table2write, file = output_file)
@@ -2214,7 +2168,7 @@ library(phyloseq)
 
 feature_table_file <- "kaiju_filtered_species_table_GLlblMetag.csv"
 metadata_table <- "/path/to/sample/metadata"
-number_of_samples <- NumberOfSamples
+number_samples <- NumberOfSamples # integer indicating how many samples are in the file
 
 # set width based on number of samples, with a cap at 50 inches
 plot_width <- 2 * number_samples
@@ -2254,8 +2208,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlblM
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
                          table with species/functions as the first column and samples as other columns.
-- `ntc_name` - a character string specifying the name of the NTC in the prevalence column
-- `number_of_samples` - the total number of samples in the species count files, adjust based on number of input samples in the feature_table_file
+- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples in the feature_table_file
 
 **Input Data:**
 
@@ -2511,14 +2464,14 @@ threshold <- 0.5
 non_microbial <- "UNCLASSIFIED|Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
 
 # read in feature table
-feature_table <- read_csv(input_file) %>% as.data.frame()
+feature_table <- read_csv(input_file) %>% as.data.frame
 feature_name <- colnames(feature_table)[1]
 rownames(feature_table) <- feature_table[,1]
 feature_table <- feature_table[, -1]
 
 # read-based count table
 table2write <- filter_rare(feature_table, non_microbial, threshold = threshold) %>%
-  as.data.frame() %>%
+  as.data.frame %>%
   rownames_to_column(feature_name)
 
 write_csv(x = table2write, file = output_file)
@@ -2607,7 +2560,8 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlb
 
 #### 10h. Feature decontamination
 
-> Feature (species) decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
+> Feature (species) decontamination with decontam. Decontam is an R package that statistically 
+  identifies contaminating features in a feature table
 
 ```R
 library(tidyverse)
@@ -2616,7 +2570,7 @@ library(phyloseq)
 
 feature_table_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
 metadata_table <- "/path/to/sample/metadata"
-number_of_samples <- NumberOfSamples
+number_samples <- NumberOfSamples # integer indicating how many samples are in the file
 
 # set width based on number of samples, with a cap at 50 inches
 plot_width <- 2 * number_samples
@@ -2656,7 +2610,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlb
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
                           table with species/functions as the first column and samples as other columns.
-- `number_of_samples` - the total number of samples in the species count files, adjust based on number of input samples.
+- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples.
 
 **Input Data:**
 
@@ -3631,9 +3585,9 @@ KEGG-decoder -v interactive \
 
 **Output Data:**
 
-- **MAG-KEGG-Decoder-out_GLlblMetag.tsv** (tab-delimited table holding MAGs and their proportions of genes held known to be required for specific pathways/metabolisms)
-
-- **MAG-KEGG-Decoder-f.html** (interactive heatmap html file of the above output table)
+- **MAG-KEGG-Decoder-out_GLlblMetag.tsv** (tab-delimited table holding MAGs and their proportions of 
+                                           genes held known to be required for specific pathways/metabolisms)
+- **MAG-KEGG-Decoder-out_GLlbnMetag.html** (interactive heatmap html file of the above output table)
 
 <br>
 
@@ -3647,19 +3601,23 @@ KEGG-decoder -v interactive \
 library(tidyverse)
 
 metadata_file <- "/path/to/sample/metadata"
+feature_data_file <- "Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv"
+
 # Prepare metadata
-metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame()
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
 sample_names = metadata[, samples_column]
 row.names(metadata) <- sample_names
 
 # Prepare feature table
-gene_taxonomy_table <- read_assembly_coverage_table(feature_table_file, sample_names) %>% as.data.frame()
+gene_taxonomy_table <- read_assembly_coverage_table(feature_table_file, sample_names) %>% as.data.frame
 
 # Summarize gene table
 species_gene_table <- gene_taxonomy_table %>%
   select(species, !!any_of(sample_names)) %>% 
   group_by(species) %>% 
-  summarise(across(everything(), sum)) %>% as.data.frame
+  summarise(across(everything(), sum)) %>% 
+  filter(species != "Unclassified;_;_;_;_;_;_") %>% # Drop unclassifed
+  as.data.frame
 
 rownames(species_gene_table) <- species_gene_table[[1]]
 species_gene_table <- species_gene_table[, -1] %>% as.matrix()
@@ -3670,8 +3628,9 @@ species_gene_table <- species_gene_table[, common_samples]
 metadata <- metadata[common_samples, ]
 metadata <- metadata %>% arrange(!!sym(group_column))
 
+table2write = species_gene_table %>% as.data.frame %>% rownames_to_column("species")
 # Write out gene taxonomy table
-write_csv(x = gene.m, file = "gene_taxonomy_table.csv")
+write_csv(x = table2write, file = "gene_taxonomy_table.csv")
 
 make_heatmap(metadata, species_gene_table, 
              samples_column="sample_id", group_column = "group", 
@@ -3681,64 +3640,40 @@ make_heatmap(metadata, species_gene_table,
 
 ```
 
+**Custom Functions Used:**
+- [read_assembly_coverage_table()](#read_assembly_coverage_table)
+- [make_heatmap()](#make_heatmap)
+
 **Input data:**
-- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
-- Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples combined based on gene-level taxonomic classifications, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables)) 
+- /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
+- Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples 
+    combined based on gene-level taxonomic classifications, output from 
+    [Step 21a](#21a-generating-gene-level-coverage-summary-tables)) 
 
 **Output data:**
 - gene_taxonomy_table.csv (aggregated gene taxonomy table with samples in columns and species in rows)
-- **Combined-gene-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all genes taxonomy assignments)
+- **Combined-gene-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all gene taxonomy assignments)
 
 #### 24b. Gene-level taxonomy decontamination
 
 ```R
 library(tidyverse)
+library(decontam)
+library(phyloseq)
 
-    # Prepare feature table
-    feature_table <- read_csv(feature_table_file) %>% as.data.frame()
-    rownames(feature_table) <- feature_table[[1]]
-    feature_table <- feature_table[, -1] %>% as.matrix()
-    colnames(feature_table) <-  colnames(feature_table) %>% str_remove_all("barcode")
-
-    # Prepare metadata
-    metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame()
-    row.names(metadata) <- metadata[, samples_column] %>% str_remove_all("barcode")
-
-    # Get common samples and re-arrange feature table and metadata
-    common_samples <- intersect(colnames(feature_table), rownames(metadata))
-    feature_table <- feature_table[, common_samples]
-    metadata <- metadata[common_samples, ]
-    metadata <- metadata %>% arrange(!!sym(group_column))
-
-    # Create column annotation
-    col_annotation <- as.data.frame(metadata)[, group_column, drop = FALSE]
-    rownames(col_annotation) <- rownames(col_annotation)
-
-    # Calculate output plot width and height
-    number_of_samples <- ncol(feature_table)
-    width <- 1 * number_of_samples
-    number_of_features <- nrow(feature_table)
-    height <- 0.2 * number_of_features
-
-    # Set colors by group
-    groups <- metadata[[group_column]] %>%  unique()
-    number_of_groups <-  length(groups)
-    my_colors <- custom_palette[1:number_of_groups]
-    names(my_colors) <- groups
-    annotation_colors  <- list(my_colors)
-    names(annotation_colors) <- group_column
-
-# Read-in featusre table
-gene.m <- read_csv("gene_taxonomy_table.csv")
-rownames(gene.m) <- gene.m[['species']]
-gene.m <- gene.m[,-match("species", colnames(gene.m))] %>% as.matrix()
-feature_table <- gene.m
-
+feature_table_file <- "gene_taxonomy_table.csv"
+metadata_table <- "/path/to/sample/metadata"
+number_samples <- NumberOfSamples # integer indicating how many samples are in the file
 
 # set width based on number of samples, with a cap at 50 inches
 plot_width <- 2 * number_samples
 if(plot_width > 50) { plot_width = 50 }
 
+# Prepare metadata
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
+sample_names = metadata[, samples_column]
+row.names(metadata) <- sample_names
+
 decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
                                          feature_table_file = feature_table_file, 
                                          feature_column = "species", 
@@ -3747,466 +3682,294 @@ decontaminated_table <- feature_decontam(metadata_file = metadata_table,
                                          ntc_name = "TRUE", 
                                          frequency_column = "concentration", 
                                          threshold = 0.1, 
-                                         classification_method = "gene-taxonomy", 
+                                         classification_method = "Combined-gene-level-taxonomy", 
                                          output_prefix = "", 
                                          assay_suffix = "_GLlblMetag")
 
-non_microbial <- "Unclassified;_;_;_;_;_;_"
-
-make_heatmap(metadata_file, feature_table_file, 
-                           samples_column = "sample_id", group_column = "group", 
-                           output_prefix, assay_suffix = "_GLlblMetag",
-                           custom_palette)
-
-# Convert count matrix to relative abundance matrix
-decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
+# Get common samples and re-arrange feature table and metadata
+common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
+decontaminated_table <- decontaminated_table[, common_samples]
+metadata <- metadata[common_samples, ]
+metadata <- metadata %>% arrange(!!sym(group_column))
 
-# Make plot after filtering out contaminants
-make_heatmap(decontaminated_species_table, metadata, custom_palette, publication_format)
+make_heatmap(metadata, decontaminated_table, 
+             samples_column = "sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-taxonomy_decontam", 
+             assay_suffix = "_GLlblMetag",
+             custom_palette)
 
 ```
 
 **Custom Functions Used:**
 - [feature_decontam()](#feature_decontam)
-- [make_plot()](#make_plot)
-- [count_to_rel_abundance()](#count_to_rel_abundance)
+- [make_heatmap()](#make_plot)
 
 **Parameter Definitions:**
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
-- `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
-                         table with species/functions as the first column and samples as other columns.
-- `ntc_name` - a character string specifying the name of the NTC in the prevalence column
-- `number_of_samples` - the total number of samples in the species count files, adjust based on number of input samples in the feature_table_file
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `number_samples` - the total number of samples in the feature_table_file, adjust based on number of input samples
 
 **Input Data:**
 
-- `kaiju_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 9](#9g-filter-kaiju-species-count-table))
+- `gene_taxonomy_table.csv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 24a](#24a-gene-level-taxonomy-heatmaps))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
 
-- **decontam-gene-taxonomy_results.csv** (decontam's results table)
-- **decontaminated-gene-taxonomy_table.csv** (decontaminated species table)
-- **decontaminated-gene-taxonomy-heatmap_GLlblMetag.png** (heatmap after filtering out contaminants)
-
-- **gene-taxonomy_decontam_results_GLlblMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
-- **gene-taxonomy_decontam_species_table_GLlblMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
-- **gene-taxonomy_decontam_species_barplot_GLlblMetag.png** (barplot after filtering out contaminants)
-- **gene-taxonomy_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants)
-
-```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-# Set to 0.5 for a more aggressive approach where species more prevalent
-# in the negative controls are considered contaminants
-contam_threshold <- 0.1
-# Control samples in this column should always be written as 
-# "Control_Sample" and true samples as "True_Sample"
-prev_col <- "Sample_or_Control"
-freq_col <- "input_conc_ng"
-plot_width <- 18
-plot_height <- 8
-
-# Read-in metadata
-metdata_file <- "/path/to/sample/metadata"
-samples_column <- "Sample_ID"
-metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
-row.names(metadata) <- metadata[,samples_column]
-
-# Read-in featusre table
-gene.m <- read_csv("gene_taxonomy_table.csv")
-rownames(gene.m) <- gene.m[['species']]
-gene.m <- gene.m[,-match("species", colnames(gene.m))] %>% as.matrix()
-feature_table <- gene.m
-
-
-contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
-
-# Write decontam results table to file
-write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-gene-taxonomy_results.csv")
-
-# Get the list of contaminats identified by decontam
-contaminants <- contamdf %>%
-                as.data.frame %>%
-                rownames_to_column("Species") %>%
-                filter(contaminant == TRUE) %>% pull(Species)
-
-# Drop contaminant features identified by decontam
-decontaminated_table <- feature_table %>% 
-                as.data.frame  %>% 
-                rownames_to_column("Species") %>% 
-                filter(str_detect(Species, 
-                                  pattern = str_c(contaminants,
-                                                  collapse = "|"),
-                                  negate = TRUE)) %>%
-                select(-Species) %>% as.matrix
-
-
-# Write decontaminated species table to file
-write_csv(x = decontaminated_table, file = "decontaminated-gene-taxonomy_table.csv")
-
-# Get the index of species (contaminants and unclassified) to drop
-non_microbial <- "Unclassified;_;_;_;_;_;_"
-species_to_drop_index <- grep(x = rownames(feature_table), 
-                              str_c(c(contaminants,non_microbial), 
-                                    collapse = "|"))
-
-mat2plot <- feature_table[-species_to_drop_index,]
-png(filename = "decontaminated-gene-taxonomy-heatmap_GLlblMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 14, 
-         number_format = "%.0f")
-dev.off()
-
-```
-
-**Input data:**
-
-- metadata_file  (path to sample-wise metadata file)
-- gene_taxonomy_table.csv (aggregated gene taxonomy table, output from [Step 21b](#21b-gene-level-taxonomy-heatmaps))
-
-**Output data:**
-
-- **decontam-gene-taxonomy_results.csv** (decontam's results table)
-- **decontaminated-gene-taxonomy_table.csv** (decontaminated species table)
-- **decontaminated-gene-taxonomy-heatmap_GLlblMetag.png** (heatmap after filtering out contaminants)
-
+- **Combined-gene-level-taxonomy_decontam_results_GLlblMetag.csv** (decontam's results table)
+- **Combined-gene-level-taxonomy_decontam_species_table_GLlblMetag.csv** (decontaminated species table)
+- **Combined-gene-level-taxonomy_decontam_heatmap_GLlblMetag.png** (gene-level taxonomy heatmap after filtering out contaminants)
 
-
-#### 21d. Gene-level KO functions heatmaps
+#### 24c. Gene-level KO functions heatmaps
 
 ```R
 library(tidyverse)
 library(pheatmap)
 
+metadata_file <- "/path/to/sample/metadata"
+feature_data_file <- "Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.ts"
+
 # Abundant functions with CPM > 2000
 abundance_threshold <- 2000
 
-sample_order <- get_sample_names("assembly-summaries_GLlblMetag.tsv")
-# Read-in KO functions table
-functions_table <- read_input_table("Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv") %>%
-                    select(KO_ID, KO_function, !!sample_order)
+# Prepare metadata
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
+sample_names = metadata[, samples_column]
+row.names(metadata) <- sample_names
 
-# Subset table and then convert from datafame to matrix
-functions.m <- functions_table[,sample_order] %>% as.matrix()
+# Read-in KO functions table and drop unannotated
+functions_table <- read_delim(file = feature_table_file, delim = "\t", comment = "#") %>%
+                   select(KO_ID, KO_function, !!any_of(sample_names)) %>%
+                   filter(KO_ID != "Not annotated")
+
+# Convert the sample level data into a matrix
+functions.m <- functions_table %>% select(any_of(sample_names)) %>% as.matrix()
 rownames(functions.m) <- functions_table$KO_ID
-table2write <-  functions.m %>% 
-                      as.data.frame() %>% rownames_to_column("KO_ID") %>%
-                      filter(KO_ID != "Not annotated") # Drop unannotated / unclassified
+
+# convert to dataframe without unannotated/unclassified species for output
+table2write <- functions.m %>% as.data.frame %>%
+               rownames_to_column("KO_ID")
 # Write out  taxonomy table
 write_csv(x = table2write  , file = "genes-KO-functions_table.csv")
 
+# Get common samples and re-arrange feature table and metadata
+common_samples <- intersect(colnames(functions_table), rownames(metadata))
+functions_table <- functions_table[, common_samples]
+metadata <- metadata[common_samples, ]
+metadata <- metadata %>% arrange(!!sym(group_column))
 
-#------ All KO functions assignments
-
-# Drop unclassified assignments
-mat2plot <- functions.m[-match("Not annotated", rownames(functions.m),]
-
-png(filename = "All-genes-KO-functions-heatmap_GLlblMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 12,
-         number_format = "%.0f")
-dev.off()
-
-
-#------ Abundant KO functions assignments
-
-functions <- rowSums(functions.m) %>% sort()
-abund_functions <- functions[ functions > abundance_threshold ] %>% names
-abund_functions.m <- functions.m[abund_functions,]
-
-
-# Drop unannotated assignments
-mat2plot <- abund_functions.m[-match("Not annotated", rownames(abund_functions.m)),]
+make_heatmap(metadata, table2write,
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-KO-function", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
 
-png(filename = "Abundant-genes-KO-functions-heatmap_GLlblMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 12,
-         number_format = "%.0f")
-dev.off()
 ```
 
-**Parameter Definitions:**  
-
+**Custom Functions Used:**
+- [make_heatmap()](#make_heatmap)
 
 **Input data:**
-- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
-- Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv (table with all samples combined based on KO annotations; normalized to coverage per million genes covered, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables))
+- /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
+- Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv (table with all samples combined 
+    based on KO annotations; normalized to coverage per million genes covered, output from 
+    [Step 21a](#21a-generate-gene-level-coverage-summary-tables)
 
 **Output data:**
-- genes-KO-functions_table.csv (aggregated and subsetted gene KO functions table)
-- **All-genes-KO-functions-heatmap_GLlblMetag.png** (heatmap of gene-wise KO function assignments)
-- **Abundant-genes-KO-functions-heatmap_GLlblMetag.png** (heatmap of gene-wise abundant KO function assignments)
+- genes-KO-functions_table.csv (aggregated and subsetted gene KO function table)
+- **Combined-gene-level-KO-function_heatmap_GLlblMetag.png** (heatmap of all gene-level KO function assignments)
 
-#### 21e. Gene-level KO functions decontamination --- END NEEDS REVIEW ---
+#### 24d. Gene-level KO functions decontamination
 
 ```R
 library(tidyverse)
 library(decontam)
 library(phyloseq)
 
-# Set to 0.5 for a more aggressive approach where species more prevalent
-# in negative controls are considered contaminants
-contam_threshold <- 0.1 
-# Control samples in this column should always be written as "Control_Sample" and true samples as "True_Sample"
-prev_col <- "Sample_or_Control"
-freq_col <- "input_conc_ng"
-plot_width <- 18
-plot_height <- 8
-
-# Read-in metadata
-metdata_file <- "/path/to/sample/metadata"
-samples_column <- "Sample_ID"
-metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
-row.names(metadata) <- metadata[,samples_column]
+feature_table_file <- "genes-KO-functions_table.csv"
+metadata_table <- "/path/to/sample/metadata"
+number_samples <- NumberOfSamples # integer indicating how many samples are in the file
 
-# Read-in feature table
-functions.m <- read_csv("genes-KO-functions_table.csv")
-rownames(functions.m) <- functions.m[['KO_ID']]
-gene.m <- functions.m[,-match("KO_ID", colnames(functions.m))] %>% as.matrix()
-feature_table <- functions.m
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
 
+# Prepare metadata
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
+sample_names = metadata[, samples_column]
+row.names(metadata) <- sample_names
 
-contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
+decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
+                                         feature_table_file = feature_table_file, 
+                                         feature_column = "KO_ID", 
+                                         samples_column = "sample_id",
+                                         prevalence_column = "NTC", 
+                                         ntc_name = "TRUE", 
+                                         frequency_column = "concentration", 
+                                         threshold = 0.1, 
+                                         classification_method = "Combined-gene-level-KO-function", 
+                                         output_prefix = "", 
+                                         assay_suffix = "_GLlblMetag")
 
-# Write decontam results table to file
-write_csv(x = contamdf %>% rownames_to_column("KO_ID"), file = "decontam-gene-KO-functions_results.csv")
+# Get common samples and re-arrange feature table and metadata
+common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
+decontaminated_table <- decontaminated_table[, common_samples]
+metadata <- metadata[common_samples, ]
+metadata <- metadata %>% arrange(!!sym(group_column))
 
-# Get the list of contaminants identified by decontam
-contaminants <- contamdf %>%
-                as.data.frame %>%
-                rownames_to_column("KO_ID") %>%
-                filter(contaminant == TRUE) %>% pull(KO_ID)
-
-# Drop contaminant features identified by decontam
-decontaminated_table <- feature_table %>% 
-                as.data.frame  %>% 
-                rownames_to_column("KO_ID") %>% 
-                filter(str_detect(Species, 
-                                  pattern = str_c(contaminants,
-                                                  collapse = "|"),
-                                  negate = TRUE)) %>%
-                select(-KO_ID) %>% as.matrix
-
-
-# Write decontaminated species table to file
-write_csv(x = decontaminated_table, file = "decontaminated-gene-KO-functions_table.csv")
-
-# Get the index of species (contaminants and unclassified) to drop
-unclassified <- "Not annotated"
-functions_to_drop_index <- grep(x = rownames(feature_table), 
-                              str_c(c(contaminants,unclassified), 
-                                    collapse = "|"))
-
-mat2plot <- feature_table[-functions_to_drop_index,]
-png(filename = "decontaminated-gene-KO-functions-heatmap_GLlblMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 14, 
-         number_format = "%.0f")
-dev.off()
+make_heatmap(metadata, decontaminated_table, 
+             samples_column = "sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-KO-function_decontam", 
+             assay_suffix = "_GLlblMetag",
+             custom_palette)
 
 ```
 
-**Input data:**
+**Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
+- [make_heatmap()](#make_plot)
 
-- metadata_file  (path to sample-wise metadata file)
-- gene_taxonomy_table.csv (agggregated gene taxomy table, output from [Step 21b](#21b-gene-level-taxonomy-heatmaps))
+**Parameter Definitions:**
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level KO functions coverage data 
+                         with KO_ID as the first column and samples as other columns.
+- `number_samples` - the total number of samples in the feature_table_file, adjust based on number of input samples
 
-**Output data:**
+**Input Data:**
 
-- **decontam-gene-KO-functions_results.csv** (decontam's results table)
-- **decontaminated-gene-KO-functions_table.csv** (decontaminated functions table)
-- **decontaminated-gene-KO-functions-heatmap_GLlblMetag.png** (heatmap after filtering out contaminants)
+- `genes-KO-functions_table.csv`(aggregated gene KO functions table table with samples in columns and KO_ID in rows, from [Step 24c](#24c-gene-level-ko-functions-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
+**Output Data:**
 
-#### 21g. Contig-level Heatmaps --- START NEEDS REVIEW ---
+- **Combined-gene-level-KO-function_decontam_results_GLlblMetag.csv** (decontam's results table)
+- **Combined-gene-level-KO-function_decontam_species_table_GLlblMetag.csv** (decontaminated gene-level KO functions table)
+- **Combined-gene-level-KO-function_decontam_heatmap_GLlblMetag.png** (gene-level KO functions heatmap after filtering out contaminants)
+
+
+#### 24f. Contig-level Heatmaps
 
 ```R
-plot_width <- 20
-plot_height <- 30
-sample_order <- get_sample_names("assembly-summaries_GLlblMetag.tsv")
+library(tidyverse)
+
+metadata_file <- "/path/to/sample/metadata"
+feature_data_file <- "Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv"
 
-species_contig_table <-  read_contig_table("Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv")
+# Prepare metadata
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
+sample_names = metadata[, samples_column]
+row.names(metadata) <- sample_names
 
-contig.m <- species_contig_table %>%
+# Prepare feature table
+contig_taxonomy_table <- read_assembly_coverage_table(feature_table_file, sample_names) %>% as.data.frame
+
+# Summarize contig table
+species_contig_table <- contig_taxonomy_table %>%
+  select(species, !!any_of(sample_names)) %>%
   group_by(species) %>%
-  summarise(across(everything(), sum)) %>%
+  summarise(across(everything(), sum)) %>% 
   filter(species != "Unclassified;_;_;_;_;_;_") %>% # Drop unclassifed
-  as.data.frame()
+  as.data.frame
 
-# Write out contig taxonomy table
-write_csv(x = contig.m, file = "contig_taxonomy_table.csv")
-
-rownames(contig.m) <- contig.m[['species']]
-contig.m <- contig.m[,-match("species", colnames(contig.m))] %>% as.matrix()
-
-#------ All contig taxonomy assignments
-
-# Drop unclassified assignments
-mat2plot <- contig.m[-match("Unclassified;_;_;_;_;_;_", rownames(contig.m)),]
-
-png(filename = "All-contig-taxonomy-heatmap_GLlblMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 12,
-         number_format = "%.0f")
-dev.off()
-
-
-#------ Abundant contig taxonomy assignments
-
-taxa <- rowSums(contig.m) %>% sort()
-abund_taxa <- taxa[ taxa > abundance_threshold ] %>% names
-abund_contig.m <- contig.m[abund_taxa,]
-
-mat2plot <- abund_contig.m[-match("Unclassified;_;_;_;_;_;_", rownames(abund_contig.m)),]
-
-png(filename = "Abundant-contig-taxonomy-heatmap_GLlblMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 12,
-         number_format = "%.0f")
-dev.off()
-```
+rownames(species_contig_table) <- species_contig_table[[1]]
+species_contig_table <- species_contig_table[, -1] %>% as.matrix()
 
+# Get common samples and re-arrange feature table and metadata
+common_samples <- intersect(colnames(species_contig_table), rownames(metadata))
+species_contig_table <- species_contig_table[, common_samples]
+metadata <- metadata[common_samples, ]
+metadata <- metadata %>% arrange(!!sym(group_column))
 
-**Parameter Definitions:**  
+table2write = species_contig_table %>% as.data.frame %>% rownames_to_column("species")
+# Write out contig taxonomy table
+write_csv(x = table2write, file = "contig_taxonomy_table.csv")
 
+make_heatmap(metadata, species_contig_table, 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-contig-level-taxonomy", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+```
 
-**Input data:**
+**Custom Functions Used:**
+- [read_assembly_coverage_table()](#read_assembly_coverage_table)
+- [make_heatmap()](#make_heatmap)
 
-- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
-- Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered from [Step 21f](#21f-generating-contig-level-coverage-summary-tables))
+**Input data:**
+- /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
+- Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples 
+    combined based on contig-level taxonomic classifications, output from 
+    [Step 21b](#21b-generate-contig-level-coverage-summary-tables)) 
 
 **Output data:**
+- contig_taxonomy_table.csv (aggregated contig taxonomy table with samples in columns and species in rows)
+- **Combined-contig-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all contig taxonomy assignments)
 
-- contig_taxonomy_table.csv (aggregated contig taxonomy)
-- **All-contig-taxonomy-heatmap_GLlblMetag.png** (All contig level taxonomy heatmap)
-- **Abundant-contig-taxonomy-heatmap_GLlblMetag.png** (Abundant contig level taxonomy heatmap)
-
-
-#### 21h. Contig-level decontamination --- END NEEDS REVIEW ---
+#### 24g. Contig-level decontamination
 
 ```R
 library(tidyverse)
 library(decontam)
 library(phyloseq)
 
-# Set to 0.5 for a more aggressive approach where species more prevalent
-# in the negative controls are considered contaminants
-contam_threshold <- 0.1
-# Control samples in this column should always be written as
-# "Control_Sample" and true samples as "True_Sample"
-prev_col <- "Sample_or_Control"
-freq_col <- "input_conc_ng"
-plot_width <- 18
-plot_height <- 8
-
-# Read-in metadata
-metdata_file <- "/path/to/sample/metadata"
-samples_column <- "Sample_ID"
-metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
-row.names(metadata) <- metadata[,samples_column]
+feature_table_file <- "contig_taxonomy_table.csv"
+metadata_table <- "/path/to/sample/metadata"
+number_samples <- NumberOfSamples # integer indicating how many samples are in the file
 
-# Read-in feature table
-contig.m <- read_csv("contig_taxonomy_table.csv")
-rownames(contig.m) <- contig.m[['species']]
-contig.m <- contig.m[,-match("species", colnames(contig.m))] %>% as.matrix()
-feature_table <- contig.m
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
 
+# Prepare metadata
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
+sample_names = metadata[, samples_column]
+row.names(metadata) <- sample_names
 
-contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
+decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
+                                         feature_table_file = feature_table_file, 
+                                         feature_column = "species", 
+                                         samples_column = "sample_id",
+                                         prevalence_column = "NTC", 
+                                         ntc_name = "TRUE", 
+                                         frequency_column = "concentration", 
+                                         threshold = 0.1, 
+                                         classification_method = "Combined-contig-level-taxonomy", 
+                                         output_prefix = "", 
+                                         assay_suffix = "_GLlblMetag")
 
-# Write decontam results table to file
-write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-gene-taxonomy_results.csv")
+# Get common samples and re-arrange feature table and metadata
+common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
+decontaminated_table <- decontaminated_table[, common_samples]
+metadata <- metadata[common_samples, ]
+metadata <- metadata %>% arrange(!!sym(group_column))
 
-# Get a list of contaminants identified by decontam
-contaminants <- contamdf %>%
-                as.data.frame %>%
-                rownames_to_column("Species") %>%
-                filter(contaminant == TRUE) %>% pull(Species)
-
-# Drop contaminant features identified by decontam
-decontaminated_table <- feature_table %>% 
-                as.data.frame  %>% 
-                rownames_to_column("Species") %>% 
-                filter(str_detect(Species, 
-                                  pattern = str_c(contaminants,
-                                                  collapse = "|"),
-                                  negate = TRUE)) %>%
-                select(-Species) %>% as.matrix
-
-
-# Write decontaminated species table to file
-write_csv(x = decontaminated_table, file = "decontaminated-gene-taxonomy_table.csv")
-
-# Get the index of species (contaminants and unclassified) to drop
-non_microbial <- "Unclassified;_;_;_;_;_;_"
-species_to_drop_index <- grep(x = rownames(feature_table), 
-                              str_c(c(contaminants,non_microbial), 
-                                    collapse = "|"))
-
-mat2plot <- feature_table[-species_to_drop_index,]
-png(filename = "decontaminated-contig-taxonomy-heatmap_GLlblMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 14, 
-         number_format = "%.0f")
-dev.off()
+make_heatmap(metadata, decontaminated_table, 
+             samples_column = "sample_id", group_column = "group", 
+             output_prefix = "Combined-contig-level-taxonomy_decontam", 
+             assay_suffix = "_GLlblMetag",
+             custom_palette)
 
 ```
 
-**Input data:**
+**Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
+- [make_heatmap()](#make_plot)
 
-- metadata_file  (path to sample-wise metadata file)
-- contig_taxonomy_table.csv (aggregated contig taxonomy table, output from [Step 21g](#21g-contig-level-heatmaps))
+**Parameter Definitions:**
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table containing contig-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `number_samples` - the total number of samples in the feature_table_file, adjust based on number of input samples
 
-**Output data:**
+**Input Data:**
 
-- **decontam-contig-taxonomy_results.csv** (decontam's results table)
-- **decontaminated-contig-taxonomy_table.csv** (decontaminated species table)
-- **decontaminated-contig-taxonomy-heatmap_GLlblMetag.png** (heatmap after filtering out contaminants)
+- `contig_taxonomy_table.csv`(aggregated contig taxonomy table with samples in columns and species in rows, from [Step 24f](#24f-contig-level-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
+**Output Data:**
 
+- **Combined-contig-level-taxonomy_decontam_results_GLlblMetag.csv** (decontam's results table)
+- **Combined-contig-level-taxonomy_decontam_species_table_GLlblMetag.csv** (decontaminated contig-level species table)
+- **Combined-contig-level-taxonomy_decontam_heatmap_GLlblMetag.png** (contig-level heatmap after filtering out contaminants)
 

From 614c1455c4b11720eb5a35a49b9ac936352831db Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Sun, 25 Jan 2026 22:42:41 -0800
Subject: [PATCH 21/47] Finished Illumina low-biomass pipeline draft

- Updated GL-DPPD-7117 to 1st draft status
- Fixed some links and formatting in GL-DPPD-7116

TODO: Add barplots and decontamination to read-based metaphlan
taxonomies
---
 .../Low_Biomass/Illumina/GL-DPPD-7117.md      | 2723 +++++++++--------
 .../Low_Biomass/Nanopore/GL-DPPD-7116.md      |   25 +-
 2 files changed, 1404 insertions(+), 1344 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
index a1d96d7f2..b9fcf4b2f 100644
--- a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
@@ -28,7 +28,7 @@ Barbara Novak (GeneLab Data Processing Lead)
     - [1. Raw Data QC](#1-raw-data-qc)
       - [1a. Raw Data QC](#1a-raw-data-qc)
       - [3b. Compile Raw Data QC](#1b-compile-raw-data-qc)
-    - [2. Human Read Removal](
+    - [2. Human Read Removal](#2-human-read-removal)
       - [2a. Build Kraken2 Database](#2a-build-kraken2-database)
       - [2b. Remove Human Reads](#2b-remove-human-reads)
       - [2c. Compile Human Read Removal QC](#2c-compile-human-read-removal-qc)
@@ -37,88 +37,100 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [3b. Trim PolyG](#3b-trim-polyg)
       - [3c. Filtered Data QC](#3c-filtered-data-qc)
       - [3d. Compile Filtered Data QC](#3d-compile-filtered-data-qc)
-    - [4. Contaminant Removal](#7-contaminant-removal)
-      - [4a. Assemble Contaminants](#7a-assemble-contaminants)
-      - [4b. Build Contaminant Index and Map Reads](#7b-build-contaminant-index-and-map-reads)
-      - [4c. Sort and Index Contaminant Reads](#7c-sort-and-index-contaminant-alignments)
-      - [4d. Gather Contaminant Mapping Metrics](#7d-gather-contaminant-mapping-metrics)
-      - [4e. Generate Decontaminated Read Files](#7e-generate-decontaminated-read-files)
-      - [4f. Contaminant Removal QC](#7f-contaminant-removal-qc)
-      - [4g. Compile Contaminant Removal QC](#7g-compile-contaminant-removal-qc)
-    - [8. R Environment Setup](#8-r-environment-setup)
-      - [8a. Load Libraries](#8a-load-libraries)
-      - [8b. Define Custom Functions](#8b-define-custom-functions)
-      - [8c. Set global variables](#8c-set-global-variables)
+    - [4. Contaminant Removal](#4-contaminant-removal)
+      - [4a. Assemble Contaminants](#4a-assemble-contaminants)
+      - [4b. Build Contaminant Index and Map Reads](#4b-build-contaminant-index-and-map-reads)
+      - [4c. Contaminant Removal QC](#4c-contaminant-removal-qc)
+      - [4d. Compile Contaminant Removal QC](#4d-compile-raw-data-qc)
+    - [5. Host read removal](#5-host-read-removal)
+      - [5a.](#5a-build-kraken2-host-database)
+      - [5b.](#5b-remove-host-reads)
+      - [5c.](#5c-compile-host-read-removal-qc)
+    - [6. R Environment Setup](#8-r-environment-setup)
+      - [6a. Load Libraries](#8a-load-libraries)
+      - [6b. Define Custom Functions](#8b-define-custom-functions)
+      - [6c. Set global variables](#8c-set-global-variables)
   - [**Read-based processing**](#read-based-processing)
-    - [9. Taxonomic profiling using kaiju](#9-taxonomic-profiling-using-kaiju)
-      - [9a. Build Kaiju Database](#9a-build-kaiju-database)
-      - [9b. Kaiju Taxonomic Classification](#9b-kaiju-taxonomic-classification)
-      - [9c. Compile Kaiju Taxonomy Results](#9c-compile-kaiju-taxonomy-results)
-      - [9d. Convert Kaiju Output To Krona Format](#9d-convert-kaiju-output-to-krona-format)
-      - [9e. Compile Kaiju Krona Reports](#9e-compile-kaiju-krona-reports)
-      - [9f. Create Kaiju Species Count Table](#9f-create-kaiju-species-count-table)
-      - [9g. Read-in Tables](#9g-read-in-tables)
-      - [9h. Taxonomy Barplots](#9h-taxonomy-barplots)
-      - [9i. Feature Decontamination](#9i-feature-decontamination)
-    - [10. Taxonomic Profiling Using Kraken2](#10-taxonomic-profiling-using-kraken2)
-      - [10a. Download Kraken2 Database](#10a-download-kraken2-database)
-      - [10b. Kraken2 Taxonomic Classification](#10b-kraken2-taxonomic-classification)
-      - [10c. Compile Kraken2 Taxonomy Results](#10c-compile-kraken2-taxonomy-results)
-        - [10ci. Create Merged Kraken2 Taxonomy Table](10ci-create-merged-kraken2-taxonomy-table)
-        - [10cii. Compile Kraken2 Taxonomy Reports](10cii-compile-kraken2-taxonomy-reports)
-      - [10d. Convert Kraken2 Output to Krona Format](#10d-convert-kraken2-output-to-krona-format)
-      - [10e. Compile Kraken2 Krona Reports](#10e-compile-kraken2-krona-reports)
-      - [10f. Create Kraken2 Species Count Table](#10f-create-kraken2-species-count-table)
-      - [10g. Read-in Tables](#10g-read-in-tables)
-      - [10h. Taxonomy Barplots](#10h-taxonomy-barplots)
-      - [10i. Feature Decontamination](#10i-feature-decontamination)
-  - [**Assembly-based processing**](#assembly-based-processing)
-    - [11. Sample Assembly](#11-sample-assembly)
-    - [12. Polish Assembly](#12-polish-assembly)
-    - [13. Rename Contigs and Summarize Assemblies](#13-rename-contigs-and-summarize-assemblies)
-      - [13a. Rename Contig Headers](#13a-rename-contig-headers)
-      - [13b. Summarize Assemblies](#13b-summarize-assemblies)
-    - [14. Gene Prediction](#14-gene-prediction)
-      - [14a. Generate Gene Predictions](14a-generate-gene-predictions)
-      - [14b. Remove Line Wraps In Gene Prediction Output](#14a-remove-line-wraps-in-gene-prediction-output)
-    - [15. Functional Annotation](#15-functional-annotation)
-      - [15a. Download Reference Database of HMM Models](#15a-download-reference-database-of-hmm-models)
-      - [15b. Run KEGG Annotation](#15b-run-kegg-annotation)
-      - [15c. Filter KO Outputs](#15c-filter-ko-outputs)
-    - [16. Taxonomic Classification](#16-taxonomic-classification)
-      - [16a. Pull and Unpack Pre-built Reference DB](#16a-pull-and-unpack-pre-built-reference-db)
-      - [16b. Run Taxonomic Classification](#16b-run-taxonomic-classification)
-      - [16c. Add Taxonomy Info From Taxids To Genes](#16c-add-taxonomy-info-from-taxids-to-genes)
-      - [16d. Add Taxonomy Info From Taxids To Contigs](#16d-add-taxonomy-info-from-taxids-to-contigs)
-      - [16e. Format Gene-level Output With awk and sed](#16e-format-gene-level-output-with-awk-and-sed)
-      - [16f. Format Contig-level Output With awk and sed](#16f-format-contig-level-output-with-awk-and-sed)
-    - [17. Read-Mapping](#17-read-mapping)
-      - [17a. Align Reads to Sample Assembly](#17a-align-reads-to-sample-assembly)
-      - [17b. Sort and Index Assembly Alignments](#17b-sort-and-index-assembly-alignments)
-    - [18. Get Coverage Information and Filter Based On Detection](#18-get-coverage-information-and-filter-based-on-detection)
-      - [18a. Filter Coverage Levels Based On Detection](#18a-filter-coverage-levels-based-on-detection)
-      - [18b. Filter Gene and Contig Coverage Based On Detection](#18b-filter-gene-and-contig-coverage-based-on-detection)
-    - [19. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
-    - [20. Combine Contig-level Coverage and Taxonomy For Each Sample](#20-combine-contig-level-coverage-and-taxonomy-for-each-sample)
-    - [21. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#21-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
-      - [21a. Generate Gene-level Coverage Summary Tables](#21a-generate-gene-level-coverage-summary-tables)
-      - [21b. Gene-level taxonomy heatmaps](#21b-gene-level-taxonomy-heatmaps)
-      - [21c. Gene-level taxonomy decontamination](#21c-gene-level-taxonomy-decontamination)
-      - [21d. Gene-level KO functions heatmaps](#21d-gene-level-ko-functions-heatmaps)
-      - [21e. Gene-level KO functions decontamination](#21e-gene-level-ko-functions-decontamination)
-      - [21f. Generate contig-level coverage summary tables](#21f-generate-contig-level-coverage-summary-tables)
-      - [21g. Contig-level Heatmaps](#21g-contig-level-heatmaps)
-      - [21h. Contig-level decontamination](#21h-contig-level-decontamination)
-    - [22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#22-metagenome-assembled-genome-mag-recovery)
-      - [22a. Bin Contigs](#22a-bin-contigs)
-      - [22b. Bin Quality Assessment](#22b-bin-quality-assessment)
-      - [22c. Filter MAGs](#22c-filter-mags)
-      - [22d. MAG Taxonomic Classification](#22d-mag-taxonomic-classification)
-      - [22e. Generate Overview Table Of All MAGs](#22e-generate-overview-table-of-all-mags)
-    - [23. Generate MAG-level Functional Summary Overview](#23-generate-mag-level-functional-summary-overview)
-      - [23a. Get KO Annotations Per MAG](#23a-get-ko-annotations-per-mag)
-      - [23b. Summarize KO Annotations With KEGG-Decoder](#23b-summarize-ko-annotations-with-kegg-decoder)
-
+    - [7. Taxonomic profiling using kaiju](#7-taxonomic-profiling-using-kaiju)
+      - [7a. Build Kaiju Database](#7a-build-kaiju-database)
+      - [7b. Kaiju Taxonomic Classification](#7b-kaiju-taxonomic-classification)
+      - [7c. Compile Kaiju Taxonomy Results](#7c-compile-kaiju-taxonomy-results)
+      - [7d. Convert Kaiju Output To Krona Format](#7d-convert-kaiju-output-to-krona-format)
+      - [7e. Compile Kaiju Krona Reports](#7e-compile-kaiju-krona-reports)
+      - [7f. Create Kaiju Species Count Table](#7f-create-kaiju-species-count-table)
+      - [7g. Read-in Tables](#7g-read-in-tables)
+      - [7h. Taxonomy Barplots](#7h-taxonomy-barplots)
+      - [7i. Feature Decontamination](#7i-feature-decontamination)
+    - [8. Taxonomic Profiling Using Kraken2](#8-taxonomic-profiling-using-kraken2)
+      - [8a. Download Kraken2 Database](#8a-download-kraken2-database)
+      - [8b. Kraken2 Taxonomic Classification](#8b-kraken2-taxonomic-classification)
+      - [8c. Compile Kraken2 Taxonomy Results](#8c-compile-kraken2-taxonomy-results)
+        - [8ci. Create Merged Kraken2 Taxonomy Table](8ci-create-merged-kraken2-taxonomy-table)
+        - [8cii. Compile Kraken2 Taxonomy Reports](8cii-compile-kraken2-taxonomy-reports)
+      - [8d. Convert Kraken2 Output to Krona Format](#8d-convert-kraken2-output-to-krona-format)
+      - [8e. Compile Kraken2 Krona Reports](#8e-compile-kraken2-krona-reports)
+      - [8f. Create Kraken2 Species Count Table](#8f-create-kraken2-species-count-table)
+      - [8g. Read-in Tables](#8g-read-in-tables)
+      - [8h. Taxonomy Barplots](#8h-taxonomy-barplots)
+      - [8i. Feature Decontamination](#8i-feature-decontamination)
+    - [9. Taxonomic Profiling Using MetaPhlan](#9-taxonomic-profiling-using-metaphlan)
+      - [9a. Download and install HUMAnN databases](#9a-download-and-install-humann-databases)
+      - [9b. HUMAnN/MetaPhlAn Taxonomic Classification](#9b-humannmetaphlan-taxonomic-classification)
+      - [9c. Merge multiple sample functional profiles](#9c-merge-multiple-sample-functional-profiles)
+      - [9e. Normalize gene families and pathway abundances tables](#9e-normalize-gene-families-and-pathway-abundances-tables)
+      - [9f. Generate a normalized gene-family table grouped by Kegg Orthologs (KOs)](#9f-generate-a-normalized-gene-family-table-grouped-by-kegg-orthologs-kos)
+    - [10. Sample Assembly](#10-sample-assembly)
+    - [11. Rename Contigs and Summarize Assemblies](#11-rename-contigs-and-summarize-assemblies)
+      - [11a. Rename Contig Headers](#11a-rename-contig-headers)
+      - [11b. Summarize Assemblies](#11b-summarize-assemblies)
+    - [12. Gene Prediction](#12-gene-prediction)
+      - [12a. Generate Gene Predictions](12a-generate-gene-predictions)
+      - [12b. Remove Line Wraps In Gene Prediction Output](#12a-remove-line-wraps-in-gene-prediction-output)
+    - [13. Functional Annotation](#13-functional-annotation)
+      - [13a. Download Reference Database of HMM Models](#13a-download-reference-database-of-hmm-models)
+      - [13b. Run KEGG Annotation](#13b-run-kegg-annotation)
+      - [13c. Filter KO Outputs](#13c-filter-ko-outputs)
+    - [14. Taxonomic Classification](#14-taxonomic-classification)
+      - [14a. Pull and Unpack Pre-built Reference DB](#14a-pull-and-unpack-pre-built-reference-db)
+      - [14b. Run Taxonomic Classification](#14b-run-taxonomic-classification)
+      - [14c. Add Taxonomy Info From Taxids To Genes](#14c-add-taxonomy-info-from-taxids-to-genes)
+      - [14d. Add Taxonomy Info From Taxids To Contigs](#14d-add-taxonomy-info-from-taxids-to-contigs)
+      - [14e. Format Gene-level Output With awk and sed](#14e-format-gene-level-output-with-awk-and-sed)
+      - [14f. Format Contig-level Output With awk and sed](#14f-format-contig-level-output-with-awk-and-sed)
+    - [15. Read-Mapping](#15-read-mapping)
+      - [15a. Build Reference Index](#15a-build-reference-index)
+      - [15a. Align Reads to Sample Assembly](#15b-align-reads-to-sample-assembly)
+      - [15b. Sort and Index Assembly Alignments](#15c-sort-and-index-assembly-alignments)
+    - [16. Get Coverage Information and Filter Based On Detection](#16-get-coverage-information-and-filter-based-on-detection)
+      - [16a. Filter Coverage Levels Based On Detection](#16a-filter-coverage-levels-based-on-detection)
+      - [16b. Filter Gene and Contig Coverage Based On Detection](#16b-filter-gene-and-contig-coverage-based-on-detection)
+    - [17. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample](#17-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
+    - [18. Combine Contig-level Coverage and Taxonomy For Each Sample](#18-combine-contig-level-coverage-and-taxonomy-for-each-sample)
+    - [19. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#19-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
+      - [19a. Generate Gene-level Coverage Summary Tables](#19a-generate-gene-level-coverage-summary-tables)
+      - [19b. Gene-level taxonomy heatmaps](#19b-gene-level-taxonomy-heatmaps)
+      - [19c. Gene-level taxonomy decontamination](#19c-gene-level-taxonomy-decontamination)
+      - [19d. Gene-level KO functions heatmaps](#19d-gene-level-ko-functions-heatmaps)
+      - [19e. Gene-level KO functions decontamination](#19e-gene-level-ko-functions-decontamination)
+      - [19f. Generate contig-level coverage summary tables](#19f-generate-contig-level-coverage-summary-tables)
+      - [19g. Contig-level Heatmaps](#19g-contig-level-heatmaps)
+      - [19h. Contig-level decontamination](#19h-contig-level-decontamination)
+    - [20. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#20-metagenome-assembled-genome-mag-recovery)
+      - [20a. Bin Contigs](#20a-bin-contigs)
+      - [20b. Bin Quality Assessment](#20b-bin-quality-assessment)
+      - [20c. Filter MAGs](#20c-filter-mags)
+      - [20d. MAG Taxonomic Classification](#20d-mag-taxonomic-classification)
+      - [20e. Generate Overview Table Of All MAGs](#20e-generate-overview-table-of-all-mags)
+    - [21. Generate MAG-level Functional Summary Overview](#21-generate-mag-level-functional-summary-overview)
+      - [21a. Get KO Annotations Per MAG](#21a-get-ko-annotations-per-mag)
+      - [21b. Summarize KO Annotations With KEGG-Decoder](#21b-summarize-ko-annotations-with-kegg-decoder)
+    - [22. Decontamination and Visualizaiton of Contig- and Gene-taxonomy and gene function outputs](#22-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
+      - [22a. Gene-level taxonomy heatmaps](#22a-gene-level-taxonomy-heatmaps)
+      - [22b. Gene-level taxonomy decontamination](#22b-gene-level-taxonomy-decontamination)
+      - [22c. Gene-level KO functions heatmaps](#22c-gene-level-ko-functions-heatmaps)
+      - [22d. Gene-level KO functions decontamination](#22d-gene-level-ko-functions-decontamination)
+      - [22e. Contig-level heatmaps](#22e-contig-level-heatmaps)
+      - [22f. Contig-level decontamination](#22f-contig-level-decontamination)
 
 
 ---
@@ -133,14 +145,13 @@ Barbara Novak (GeneLab Data Processing Lead)
 |CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)|
 |Dorado| 1.1.1| [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)|
 |filtlong| 0.2.1 |[https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)|
-|Flye| 2.9.5 | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye) |
+|SPAdes| 4.1.0 | [https://github.com/ablab/spades](https://github.com/ablab/spades) |
 |GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
 |HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)|
 |Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) |
 |KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder)
 |KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)|
 |Kraken2| 2.1.6 | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2) |
-|KrakenTools | 1.2 | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/) |
 |Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)|
 |MetaBAT| 2.15 |[https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)|
 |Minimap2| 2.28 | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2) |
@@ -216,8 +227,8 @@ multiqc --zip-data-dir \
 
 **Output Data:**
 
-- **raw_multiqc_report/filtered_multiqc_GLlbsMetag.html** (multiqc output html summary)
-- **raw_multiqc_report/filtered_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
+- **raw_multiqc_report/raw_multiqc_GLlbsMetag.html** (multiqc output html summary)
+- **raw_multiqc_report/raw_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
 
 <br>  
@@ -295,7 +306,7 @@ gzip sample1_GLlbsMetag_R2_HRrm.fastq
 - `--output` - Specifies the name of the kraken2 read-based output file (one line per read).
 - `--report` - Specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it).
 - `--unclassified-out` - Specifies a regular expression for the naming of the output files containing reads that were not classified, i.e non-human reads.
-- `sample1_R1_filtered.fastq.gz sample1_R2_filtered.fastq.gz` - Positional argument specifying the input read files (omit read2 for single-end data).
+- `sample1_R1_raw.fastq.gz sample1_R2_raw.fastq.gz` - Positional argument specifying the input read files
 
 **Input Data:**
 
@@ -338,12 +349,11 @@ multiqc --zip-data-dir \
 
 <br>
 
-
 ---
 
-### 2. Trimming and Quality Filtering
+### 3. Trimming and Quality Filtering
 
-#### 2a. Filter Quality and Trim Adapters
+#### 3a. Filter Quality and Trim Adapters
 
 ```bash
 fastp --in1 sample1_R1_raw.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
@@ -358,9 +368,9 @@ fastp --in1 sample1_R1_raw.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
 
 **Parameter Definitions:**
 - `--in1` - Specifies the forward input read file
-- `--in2` - Specifies the reverse input read file (omit for single-end data)
+- `--in2` - Specifies the reverse input read file
 - `--in1` - Specifies the forward output read file
-- `--in2` - Specifies the reverse output read file (omit for single-end data)
+- `--in2` - Specifies the reverse output read file
 - `--qualified_quality_phred` - the minimum quality value at which a base is qualified (default: 20)
 - `--length_required` - the minimum read length. Shorter reads will be discarded (default: 50)
 - `--thread` - number of worker threads (default: 2)
@@ -377,7 +387,7 @@ fastp --in1 sample1_R1_raw.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
 
 - temp_*_filtered.fastq.gz (quality filtered and adapter trimmed reads)
 
-#### 2b. Trim polyG
+#### 3b. Trim polyG
 
 ```bash
 fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered.fastq.gz \
@@ -393,9 +403,9 @@ fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered.fastq.g
 
 **Parameter Definitions:**
 - `--in1` - Specifies the forward input read file
-- `--in2` - Specifies the reverse input read file (omit for single-end data)
+- `--in2` - Specifies the reverse input read file
 - `--in1` - Specifies the forward output read file
-- `--in2` - Specifies the reverse output read file (omit for single-end data)
+- `--in2` - Specifies the reverse output read file
 - `--qualified_quality_phred` - the minimum quality value at which a base is qualified (default: 20)
 - `--length_required` - the minimum read length. Shorter reads will be discarded (default: 50)
 - `--thread` - number of worker threads (default: 2)
@@ -413,7 +423,7 @@ fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered.fastq.g
 
 - *filtered.fastq.gz (quality filtered and adapter trimmed reads)
 
-#### 2c. Filtered Data QC
+#### 3c. Filtered Data QC
 
 ```bash
 fastqc -o filtered_fastqc_output *filtered.fastq.gz
@@ -434,7 +444,7 @@ fastqc -o filtered_fastqc_output *filtered.fastq.gz
 - *fastqc.zip (FastQC output data)
 
 
-#### 2d. Compile Filtered Data QC
+#### 3d. Compile Filtered Data QC
 
 ```bash
 multiqc --zip-data-dir \
@@ -465,210 +475,241 @@ multiqc --zip-data-dir \
 
 ---
 
-### 7. Contaminant Removal
+### 4. Contaminant Removal
 
 > A major issue with low biomass data is the high potential for contamination due to the low amount of DNA extracted from the samples. Because negative control/blank samples should by theory be contaminant free, any sequence detected in the negative control is a potential contaminant. To filter out contaminants found in negative control samples that may have been due to cross contamination in the lab, we use a read mapping approach. First negative/blank control sample reads are assembled then the filtered and trimmed reads from each low-biomass sample are mapped to the assembled contigs from the negative/blank control samples. Reads mapping to the assembled contigs are categorized as contaminants and are therefore filtered out and thus excluded from downstream analyses.
 
-### 7a. Assemble Contaminants
+### 4a. Assemble Contaminants
 
 ```bash
-flye --meta \
-     --threads NumberOfThreads \
-     --out-dir /path/to/contaminant_assembly \
-     --nano-raw /path/to/blank_samples/\*_GLlbsMetag_HRrm.fastq.gz
+cat /path/to/contaminant_fastq/*_R1_filtered.fastq.gz > mreged_R1.fastq.gz
+cat /path/to/contaminant_fastq/*_R2_filtered.fastq.gz > mreged_R2.fastq.gz
+
+spades.py --meta \
+     --threads 8 \
+     --memory 20 \
+     -o /path/to/contaminant_assembly/ \
+     -1 merged_R1.fastq.gz \
+     -2 merged_R2.fastq.gz
 
 # rename output
-mv assembly.fasta blank-assembly.fasta
-mv flye.log blank-flye.log
+mv scaffolds.fasta blank-scaffolds.fasta
+mv spades.log blank-assembly.log
 ```
 
 **Parameter Definitions:**
 
-- `--meta` – Use metagenome/uneven coverage mode.
-- `--threads` - Number of parallel processing threads to use.
-- `--out-dir` - Specifies the output directory.
-- `--nano-raw` - Specifies that input is from Oxford Nanopore regular raw reads. This adds a polishing step for error correction after the assembly is generated.
+- `--meta` – Use metagenome/uneven coverage mode
+- `--threads` - Number of parallel processing threads to use (default: 8)
+- `--memory` - Sets the maximum memory for the task (default: 20)
+- `-o` - Specifies the output directory.
 
 **Input Data**
 
-- *_GLlbsMetag_HRrm.fastq.gz (one or more trimmed, HRrm reads from blank (negative control) samples, output from [Step 6b](#6b-remove-human-reads))
+- *_R[12]_filtered.fastq.gz (one or more paired-end, trimmed and filtered, HRrm reads from blank (negative control) samples, output from [Step 3b](#3b-trim-polyg))
 
 **Output Data**
 
 - /path/to/contaminant_assembly/blank-assembly.fasta (assembly built from reads in blank samples in fasta format)
-- blank-flye.log (flye log file)
+- blank-assembly.log (SPAdes log file)
 
 <br>
 
-#### 7b. Build Contaminant Index and Map Reads
+#### 4b. Build Contaminant Index and Map Reads
 
 ```bash
 # Build contaminant index
-minimap2 -t NumberOfThreads \
-         -a \
-         -x splice \
-         -d blanks.mmi \
-         /path/to/contaminant_assembly/blank-assembly.fasta
-
+bowtie2-build /path/to/contaminant_assembly/blank-scaffolds.fasta /path/to/blank-index/blanks
+     
 # Map reads to index
-minimap2 -t NumberOfThreads \
-         -a \
-         -x splice \
-         blanks.mmi \
-         sample_GLlbsMetag_HRrm.fastq.gz  > sample.sam 2> sample-mapping-info.txt
+bowtie2 -p NumberOfThreads \
+       -x /path/to/blank-index/blanks \
+       --very-sensitive-local \
+       -1 sample1_GLlbsMetag_R1_filtered.fastq.gz \
+       -2 sample2_GLlbsMetag_R2_filtered.fastq.gz \
+       --un-conc-gz sample1_decontam.fastq.gz
+       > sample1.sam 2> sample1-mapping-info.txt
+
+# rename blank removed fastq files
+mv sample1_decontam.fastq.1.gz sample1_GLlbsMetag_R1_decontam.fastq.gz
+mv sample1_decontam.fastq.2.gz sample1_GLlbsMetag_R2_decontam.fastq.gz
+
+# remove intermediate file
+rm -rf sample1.sam
 ```
 
 **Parameter Definitions:**
 
-- `-t` - Number of parallel processing threads.
-- `-a` – Output in SAM format.
-- `-x splice` - Specifies preset for spliced alignment of long reads.
-- `-d` - Specifies the output file for the index (specific to the build contaminant index command).
-- `/path/to/contaminant_assembly/blank-assembly.fasta` - Specifies the input file in fasta format, provided as a positional argument (specific to the build contaminant index command).
-- `blanks.mmi` - Specifies the index file in mmi format, provided as a positional argument (specific to the map reads command).
-- `/path/to/trimmed_reads/sample_GLlbsMetag_HRrm.fastq.gz` - Specifies the input file in fastq format, provided as a positional argument (specific to the map reads command).
-- `> sample.sam` - Redirects the output of the map reads command to a separate SAM file (specific to the map reads command).
+*bowtie-build*
+- `/path/to/contaminant_assembly/blank-scaffolds.fasta` - Specifies the path to the input contaminant assembly, provided as a positional parameter 
+- `/path/to/blank-index/blanks` - Specifies the path to the output contaminant index
+
+*bowtie2*
+- `-p` - Number of parallel processing threads.
+- `-x` - specifies the prefix of the reference index files to map to, generated by bowtie2-build
+- `--very-sensitive-local` - Specifies the mapping presets for very sensitive local mode (see [bowtie2 documentation](https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#bowtie2-options-very-sensitive-local) for more information).
+-	`-1` - specifies the forward read to map
+- `-2` – specifies the reverse reads to map
+- `--un-conc-gz` - Specifies the file pattern for the unaligned read fastq.gz files. ".1" or ".2" will be added to the output filenames to distinguish the forward and reverse read files.
+- `> sample1.sam` - Redirects the output of the map reads command to a separate SAM file (specific to the map reads command).
+- `2> sample1-mapping-info.txt` – capture the printed summary results in a log file
 
 **Input Data**
 
-- /path/to/contaminant_assembly/blank-assembly.fasta (contaminant assembly, output from [Step 7a](#7-assemble-contaminants))
-- sample_GLlbsMetag_HRrm.fastq.gz (filtered and trimmed reads, output from [Step 6b](#6b-remove-human-reads))
+- /path/to/contaminant_assembly/blank-scaffolds.fasta (contaminant assembly, output from [Step 4a](#4a-assemble-contaminants))
+- sample1_GLlbsMetag_R[12]_filtered.fastq.gz (filtered and trimmed reads, output from [Step 3b](#3b-trim-polyg))
 
 **Output Data**
 
-- blanks.mmi (contaminant index in MMI format)
-- sample.sam (reads aligned to contaminant assembly in SAM format)
-- sample-mapping-info.txt (minimap2 mapping log file)
+- sample1_GLlbsMetag_R[12]_decontam.fastq.gz (decontaminated reads)
+- sample-mapping-info.txt (bowtie2 mapping log file)
 
-#### 7c. Sort and Index Contaminant Alignments
-```bash
-# Sort Sam, convert to bam and create index
-samtools sort --threads NumberOfThreads \
-              --output sample_sorted.bam \
-              sample.sam
+#### 4c. Contaminant Removal QC
 
-samtools index sample_sorted.bam sample_sorted.bam.bai
+```bash
+fastqc -o decontam_fastqc_output *decontam.fastq.gz
 ```
 
 **Parameter Definitions:**
 
-**samtools sort**
-- `--threads` - Number of parallel processing threads to use.
-- `--output` - Specifies the output file for the aligned and sorted reads.
-- `sample.sam` - Specifies the input SAM file, provided as a positional argument.
+- `-o` – the output directory to store results
+- `*decontam.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
 
-**samtools index**
-- `sample_sorted.bam` - The input BAM file, provided as a positional argument.
-- `sample_sorted.bam.bai` - The output index file, provided as a positional argument.
+**Input data:**
 
-**Input Data:**
+- *decontam.fastq.gz (decontaminated reads)
 
-- sample.sam (reads aligned to contaminant assembly, output from [Step 7b](#7b-build-contaminant-index-and-map-reads))
+**Output data:**
 
-**Output Data:**
+- *fastqc.html (FastQC output html summary)
+- *fastqc.zip (FastQC output data)
 
-- sample_sorted.bam (sorted mapping to contaminant assembly file)
-- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly file)
 
-#### 7d. Gather Contaminant Mapping Metrics
+#### 4d. Compile Raw Data QC
 
 ```bash
-
-samtools flagstat sample_sorted.bam > sample_flagstats.txt  2> sample_flagstats.log
-samtools stats --remove-dups sample_sorted.bam > sample_stats.txt   2> sample_stats.log
-samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.log
+multiqc --zip-data-dir \
+        --outdir decontam_multiqc_report \
+        --filename decontam_multiqc_GLlbsMetag \
+        --interactive 
+        /path/to/decontam_fastqc_output/
 ```
 
 **Parameter Definitions:**
 
-- `flagstat` - Positional argument specifying the program for counting the number of alignments for each SAM FLAG type.
-- `stats` - Positional argument specifying the program for producing comprehensive statistics from the alignment file.
-- `idxstats` - Positional argument specifying the program for producing contig alignment summary statistics.
-- `--remove-dups` - Excludes reads marked as duplicates from the comprehensive statistics.
-- `sample_sorted.bam` - Positional argument specifying the input BAM file.
-- `> sample_flagstats.txt` - Redirects the flagstat standard output to a text file.
-- `2> sample_flagstats.log` - Redirects the flagstat standard error to a log file.
-- `> sample_stats.txt` - Redirects the stats standard output to a text file.
-- `2> sample_stats.log` - Redirects the stats standard error to a log file.
-- `> sample_idxstats.txt` - Redirects the idxstats standard output to a text file.
-- `2> sample_idxstats.log` - Redirects the idxstats standard error to a log file.
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/decontam_fastqc_output/` – The directory holding the output data from the FastQC run, provided as a positional argument.
 
 **Input Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly file, output from [Step 7c](#7c-sort-and-index-contaminant-alignments))
-- sample_sorted.bam.bai (index of sorted mapping to contaminant assembly file, output from [Step 7c](#7c-sort-and-index-contaminant-alignments))
+- /path/to/decontam_fastqc_output/*fastqc.zip (FastQC output data, from [Step 4c](#4c-contaminant-removal-qc))
 
 **Output Data:**
 
-- sample_flagstats.txt (SAM FLAG counts)
-- sample_flagstats.log (log file containing the flagstat standard error)
-- sample_stats.txt (comprehensive alignment statistics)
-- sample_stats.log (log file containing the stats standard error)
-- sample_idxstats.txt (contig alignment summary statistics)
-- sample_idxstats.log (log file containing the idxstats standard error)
+- **decontam_multiqc_report/decontam_multiqc_GLlbsMetag.html** (multiqc output html summary)
+- **decontam_multiqc_report/decontam_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
+
+---
+### 5. Host Read Removal
+
+If the samples were derived from a host organism other than human, potential host reads
+should be identified and removed. This step is optional.
+
+#### 5a. Build Kraken2 Host Database
+
+> **Note:** It is recommended to use NCBI genome files with kraken2 because sequences not downloaded from 
+NCBI may require explicit assignment of taxonomy information before they can be used to build the 
+database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).
 
-#### 7e. Generate Decontaminated Read Files
 ```bash
-# Retain reads that do not map to contaminants
-samtools fastq -t -f 4 -o sample_decontam_GLlbsMetag.fastq.gz -0 sample_decontam_GLlbsMetag.fastq.gz sample_sorted.bam 
-```
 
-**Parameter Definitions:**
+```bash
+# Download NCBI taxonomic information 
+kraken2-build --download-taxonomy --db kraken2-${hostname}-db/
+
+# Add genomic sequences to your database's genomic library
+kraken2-build --add-to-library ${hostname}.fasta --db kraken2-${hostname}-db/ \
+              --no-masking --kmer-length 35 --minimizer-length 31
 
-- `fastq` - Positional argument specifying the program for generating fastq files from a SAM/BAM file.
-- `-t` - Copy RG, BC, and QT tags to the FASTQ header line.
-- `-f 4` - Only retain unmapped reads that have been marked with the SAM "segment unmapped" FLAG (4).
-- `-o sample_decontam_GLlbsMetag.fastq.gz` - Send reads flagged as either read1 or read2 to the named file (.gz ending ensures compressed output)
-- `-0 sample_decontam_GLlbsMetag.fastq.gz` - Send reads flagged as both read1 and read2 or neither to the same named file
-- `sample_sorted.bam` - Positional argument specifying the input BAM file.
+# Build the database
+kraken2-build --build --db kraken2-${hostname}-db/
+
+# Clean up intermediate files
+kraken2-build --clean --db kraken2-${hostname}-db/
+```
+**Parameter Definitions:**
+- `--download-taxonomy` - Instructs kraken2-build to download the NCBI taxonomic information.
+- `--db` - Specifies the name of the directory for the kraken2 database
+- `--add-to-library` - Instructs kraken2-build to add the contents of a file (`${hostname}.fasta`) to the kraken2 DB library
+- `--no-masking` - Disables masking of low-complexity sequences. For additional 
+                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
+- `--build` - Instructs kraken2-build to build the kraken2 DB from the library files
+- `--clean` - Instructs kraken2-build to remove unneeded intermediate files.
+- `{$hostname}` - Specifies the name of the host organism used to uniquely identify the kraken2 database
 
 **Input Data:**
 
-- sample_sorted.bam (sorted mapping to contaminant assembly file, output from [Step 7c](#7c-sort-and-index-contaminant-alignments))
+- `${hostname}.fasta` (fasta file containing host genome)
 
 **Output Data:**
 
-- **sample_decontam_GLlbsMetag.fastq.gz** (filtered and trimmed sample reads with contaminants removed in fastq format)
+- kraken2_${hostname}_db/ - Kraken2 host database directory, containing hash.k2d, opts.k2d, and taxo.k2d files 
 
-#### 7f. Contaminant Removal QC
+#### 5b. Remove Host Reads
 
 ```bash
-NanoPlot --only-report \
-         --prefix sample_noblank_ \
-         --outdir /path/to/decontam_nanoplot_output \
-         --threads NumberOfThreads \
-         --fastq \
-         sample_decontam_GLlbsMetag.fastq.gz
+
+kraken2 --db kraken2_human_db \
+        --gzip-compressed \
+        --threads NumberOfThreads \
+        --use-names \
+        --output sample-kraken2-output.txt \
+        --report sample-kraken2-report.tsv \
+        --unclassified-out sample1_R#.fastq \
+        sample1_R1_decontam.fastq.gz sample1_R2_decontam.fastq.gz
+
+# rename and gzip output files
+mv sample1_R_1.fastq sample1_GLlbsMetag_R1_HostRm.fastq && \
+gzip sample1_GLlbsMetag_R1_HostRm.fastq
+
+mv  sample1_R_2.fastq sample1_GLlbsMetag_R2_HostRm.fastq && \
+gzip sample1_GLlbsMetag_R2_HostRm.fastq
 ```
 
 **Parameter Definitions:**
 
-- `--only-report` - Output only the report files.
-- `--prefix` - Adds a sample specific prefix to the name of each output file.
-- `--outdir` – Specifies the output directory to store results.
+- `--db` - Specifies the directory holding the kraken2 database.
+- `--gzip-compressed` - Specifies that the input fastq files are gzip-compressed.
 - `--threads` - Number of parallel processing threads to use.
-- `--fastq` - Specifies that the input data is in fastq format.
-- `sample_decontam_GLlbsMetag.fastq.gz` – The input reads, specified as a positional argument.
+- `--use-names` - Specifies adding taxa names in addition to taxon IDs.
+- `--output` - Specifies the name of the kraken2 read-based output file (one line per read).
+- `--report` - Specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it).
+- `--unclassified-out` - Specifies the name of the output file containing reads that were not classified, i.e non-human reads.
+- `sample1_R1_decontam.fastq.gz sample1_R2_decontam.fastq.gz` - Positional argument specifying the input read files.
 
 **Input Data:**
 
-- sample_decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with all contaminants removed, output from [Step 7e](#7e-generate-decontaminated-read-files))
+- kraken2_host_db/ (kraken2 host database directory, output from [Step 5a](#5a-build-kraken2-host-database))
+- sample_*decontam.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 4b](#4b-build-contaminant-index-and-map-reads))
 
 **Output Data:**
 
-- **/path/to/decontam_nanoplot_output/sample_decontam_NanoPlot-report.html** (NanoPlot html summary)
-- /path/to/decontam_nanoplot_output/sample_decontam_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
-- /path/to/decontam_nanoplot_output/sample_decontam_NanoStats.txt (text file containing basic statistics)
+- sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
+- sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
+- **sample_GLlbsMetag_HostRm.fastq.gz** (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file)
 
 
-#### 7g. Compile Contaminant Removal QC
+#### 5c. Compile Host Read Removal QC
 
 ```bash
 multiqc --zip-data-dir \ 
-        --outdir decontam_multiqc_report \
-        --filename decontam_multiqc_GLlbsMetag \
+        --outdir HostRm_multiqc_report \
+        --filename HostRm_multiqc_GLlbsMetag \
         --interactive \
-        /path/to/decontam_nanoplot_output/
+        /path/to/*kraken2-report.tsv
 ```
 
 **Parameter Definitions:**
@@ -677,26 +718,26 @@ multiqc --zip-data-dir \
 - `--outdir` – Specifies the output directory to store results.
 - `--filename` – Specifies the filename prefix of results.
 - `--interactive` - Force multiqc to always create interactive javascript plots.
-- `/path/to/decontam_nanoplot_output/` – The directory holding the output data from the NanoPlot run, provided as a positional argument.
+- `/path/to/*kraken2-report.tsv` – The kraken2 output report files, provided as a positional argument.
 
 **Input Data:**
 
-- /path/to/decontam_nanoplot_output/*decontam_NanoStats.txt (NanoPlot output data, output from [Step 7f](#7f-contaminant-removal-qc))
+- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 5b](#5b-remove-host-reads))
 
 **Output Data:**
 
-- **decontam_multiqc_GLlbsMetag.html** (multiqc output html summary)
-- **decontam_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
+- **HostRm_multiqc_GLlbsMetag.html** (multiqc output html summary)
+- **HostRm_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
 <br>
 
 ---
 
-### 8. R Environment Setup
+### 6. R Environment Setup
 
 > Taxonomy bar plots, heatmaps and feature decontamination with decontam are performed in R.
 
-#### 8a. Load libraries
+#### 6a. Load libraries
 
 ```R
 library(decontam)
@@ -825,38 +866,39 @@ library(pavian)
 </details>
 
 
-##### process_kraken_table()
+##### merge_kraken_reports()
 <details>
   <summary>merge and process multiple kraken outputs to one species table</summary>
 
   ```R
-  process_kraken_table <- function(reports_dir) {
+  library(pavian)
+
+  merge_kraken_reports <- function(reports_dir) {
 
     reports <- read_reports(reports_dir)
+
     # Retrieve sample names from file names
-    samples <- names(reports) %>%
-                  str_split("-") %>%
-                  map_chr(function(x) pluck(x, 1))
+    samples <- names(reports) %>% str_split("-") %>% map_chr(function(x) pluck(x, 1))
     merged_reports  <- merge_reports2(reports, col_names = samples)
     taxonReads <- merged_reports$taxonReads
     cladeReads <- merged_reports$cladeReads
     tax_data <- merged_reports[["tax_data"]]
 
-    species_table <- tax_data %>% 
+    species_table <- tax_data %>%
       bind_cols(cladeReads) %>%
-      filter(taxRank %in% c("U","S")) %>% # select unclassified and species rows 
+      filter(taxRank %in% c("U", "S")) %>% # select unclassified and species rows 
       select(-contains("tax")) %>%
-      zero_if_na() %>% 
-      filter(name != 0) %>%  # drop unknown taxonomies
-      group_by(name) %>% 
-      summarise(across(everything(), sum)) %>% 
-      ungroup() %>% 
-      as.data.frame() %>% 
-      rename(species=name)
+      zero_if_na() %>%
+      filter(name != 0) %>% # drop unknown taxonomies
+      group_by(name) %>%
+      summarise(across(everything(), sum)) %>%
+      ungroup() %>%
+      as.data.frame %>%
+      rename(species = name)
 
     # Set rownames as species name, drop species column
     # and convert table from dataframe to matrix
-    species_names <- species_table[,"species"]
+    species_names <- species_table[, "species"]
     rownames(species_table) <- species_names
     species_table <- species_table[,-(which(colnames(species_table) == "species"))]
     species_table <- as.matrix(species_table)
@@ -864,6 +906,9 @@ library(pavian)
     return(species_table)
   }
   ```
+  **Custom Functions Used:**
+  - [read_reports()]()
+
 
   **Function Parameter Definitions:**
   - `reports_dir` - path to a directory containing kraken2 reports 
@@ -1040,7 +1085,7 @@ library(pavian)
                         samples_column="Sample_ID", prefix_to_remove="barcode"){
   
     abund_table_wide <- abund_table %>%
-        as.data.frame() %>%
+        as.data.frame %>%
         rownames_to_column(samples_column) %>%
         inner_join(metadata) %>%
         select(!!!colnames(metadata), everything()) %>%
@@ -1082,7 +1127,7 @@ library(pavian)
   ```R
   make_barplot <- function(metadata_table_file, feature_table_file, 
                            feature_column = "species", samples_column = "sample_id", group_column = "group", 
-                           output_prefix, assay_suffix = "_GLlbsMetag",
+                           output_prefix, assay_suffix = "_GLlblMetag",
                            publication_format, custom_palette) {
     # Prepare feature table
     feature_table <- read_csv(feature_table_file)
@@ -1090,7 +1135,7 @@ library(pavian)
     feature_table <- feature_table[, -1]
 
     # Prepare metadata
-    metadata <- read_delim(metdata_file, delim = ",") %>% as.data.frame()
+    metadata <- read_delim(metdata_file, delim = ",") %>% as.data.frame
     row.names(metadata) <- metadata[, samples_column]
 
     # compute abundances from counts
@@ -1123,7 +1168,7 @@ library(pavian)
   - `group_column` - a character string specifying the column in `metadata` used to facet/group plots, default: "group"
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
-  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
   - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 8c](#8c-set-global-variables)
   - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 8c](#8c-set-global-variables)
 
@@ -1136,35 +1181,33 @@ library(pavian)
   <summary>Creates heatmaps from a feature table file</summary>
   
   ```R
-  make_barplot <- function(metadata_file, feature_table_file, 
+  make_heatmap <- function(metadata, species_gene_table, 
                            samples_column = "sample_id", group_column = "group", 
-                           output_prefix, assay_suffix = "_GLlbsMetag",
+                           output_prefix, assay_suffix = "_GLlblMetag",
                            custom_palette) {
     # Prepare feature table
-    feature_table <- read_csv(feature_table_file) %>% as.data.frame()
-    rownames(feature_table) <- feature_table[[1]]
-    feature_table <- feature_table[, -1] %>% as.matrix()
-    colnames(feature_table) <- colnames(feature_table) %>% str_remove_all("barcode")
+    # feature_table <- read_csv(feature_table_file) %>% as.data.frame
+    # rownames(feature_table) <- feature_table[[1]]
+    # feature_table <- feature_table[, -1] %>% as.matrix()
 
-    # Prepare metadata
-    metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame()
-    row.names(metadata) <- metadata[, samples_column] %>% str_remove_all("barcode")
+    # # Prepare metadata
+    # metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
+    # row.names(metadata) <- metadata[, samples_column]
 
-    # GFet common samples and re-arrange feature table and metadata
-    common_samples <- intersect(colnames(feature_table), rownames(metadata))
-    feature_table <- feature_table[, common_samples]
-    metadata <- metadata[common_samples,]
-    metadata <- metadata %>% arrange(!!sym(group_column))
+    # # Get common samples and re-arrange feature table and metadata
+    # common_samples <- intersect(colnames(feature_table), rownames(metadata))
+    # feature_table <- feature_table[, common_samples]
+    # metadata <- metadata[common_samples, ]
+    # metadata <- metadata %>% arrange(!!sym(group_column))
 
     # Create column annotation
-    col_annotation <- as.data.frame(metadata)[,group_column, drop=FALSE]
-    rownames(col_annotation) <- rownames(col_annotation)
+    col_annotation <- as.data.frame(metadata)[, group_column, drop = FALSE]
 
     # Calculate output plot width and height
     number_of_samples <- ncol(feature_table)
     width <- 1 * number_of_samples
     number_of_features <- nrow(feature_table)
-    height <- 0.2 * number_of_features 
+    height <- 0.2 * number_of_features
 
     # Set colors by group
     groups <- metadata[[group_column]] %>%  unique()
@@ -1175,41 +1218,32 @@ library(pavian)
     names(annotation_colors) <- group_column
 
     # create heatmap
-    png(filename = glue("{output_prefix}_heatmap.png"), width = width,
+    png(filename = glue("{output_prefix}_heatmap{assay_suffix}.png"), width = width,
         height = height, units = "in", res = 300)
-    pheatmap(mat = feature_table[,rownames(col_annotation)],
-            cluster_cols = FALSE, 
-            cluster_rows = FALSE, 
-            col = colorRampPalette(c('white','red'))(255), 
-            angle_col = 0, 
-            display_numbers = TRUE,
-            fontsize = 12, 
-            annotation_col = col_annotation,
-            annotation_colors = annotation_colors ,
-            number_format = "%.0f")
+    pheatmap(mat = feature_table[, rownames(col_annotation)],
+             cluster_cols = FALSE,
+             cluster_rows = FALSE,
+             col = colorRampPalette(c('white','red'))(255), 
+             angle_col = 0,
+             display_numbers = TRUE,
+             fontsize = 12,
+             annotation_col = col_annotation,
+             annotation_colors = annotation_colors,
+             number_format = "%.0f")
     dev.off()
-
-
   }
   ```
-  **Custom Functions Used:**
-  - [make_plot()](#make_plot)
-
   **Function Parameter Definitions:**
-  - `metadata_table_file` - path to a file with samples as rows and columns describing each sample
+  - `metadata_file` - path to a file with samples as rows and columns describing each sample
   - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
                            table with species/functions as the first column and samples as other columns.
-  - `feature_column` - a character string containing the feature column name in the feature table ['Species', 'species', 'KO_ID'].
   - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
   - `group_column` - a character string specifying the column in `metadata` used to facet/group plots, default: "group"
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
-  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
-  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 8c](#8c-set-global-variables)
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
   - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 8c](#8c-set-global-variables)
 
-  **Returns:** a relative abundance stacked bar plot
-
 </details>
 
 ##### run_decontam()
@@ -1218,7 +1252,7 @@ library(pavian)
 
   ```R
   run_decontam <- function(feature_table, metadata, contam_threshold=0.1, 
-                           prev_col = NULL, freq_col = NULL, ntc_name = "Control_Sample") {
+                           prev_col = NULL, freq_col = NULL, ntc_name = "TRUE") {
 
     # retain metadata for only the samples present in the input feature table
     sub_metadata <- metadata[colnames(feature_table), ]
@@ -1292,17 +1326,18 @@ library(pavian)
   library(glue)
 
   feature_decontam <- function(metadata_file, feature_table_file, 
-                               feature_column = "species", samples_column = "sample_id",
-                               prevalence_column = "Sample_or_Control", ntc_name, frequency_column = "concentration", 
+                               feature_column = "Species", samples_column = "sample_id",
+                               prevalence_column = "NTC", ntc_name = "TRUE", 
+                               frequency_column = "concentration", 
                                threshold = 0.1, classification_method, 
-                               output_prefix, assay_suffix = "_GLlbsMetag") {
+                               output_prefix, assay_suffix = "_GLlblMetag") {
     # Prepare feature table
-    feature_table <- read_csv(feature_table_file) %>%  as.data.frame()
+    feature_table <- read_csv(feature_table_file) %>%  as.data.frame
     rownames(feature_table) <- feature_table[[1]]
     feature_table <- feature_table[, -1]  %>% as.matrix()
 
     # Prepare metadata
-    metadata <- read_csv(metadata_file) %>% as.data.frame()
+    metadata <- read_csv(metadata_file) %>% as.data.frame
     row.names(metadata) <- metadata[, samples_column]
 
     # Run decontam
@@ -1311,7 +1346,7 @@ library(pavian)
     contamdf <- as.data.frame(contamdf) %>% rownames_to_column(feature_column)
 
     # Write decontaminated feature table and decontam's primary results
-    outfile <- glue("{output_prefix}decontam-{classification_method}_results{assay_suffix}.csv")
+    outfile <- glue("{output_prefix}{classification_method}_decontam_results{assay_suffix}.csv")
     write_csv(x = contamdf, file = outfile)
 
     # Get the list of contaminants identified by decontam
@@ -1324,7 +1359,7 @@ library(pavian)
       
       # Drop contaminant features identified by decontam
       decontaminated_table <- feature_table %>%
-        as.data.frame() %>%
+        as.data.frame %>%
         rownames_to_column(feature_column) %>%
         filter(str_detect(!!sym(feature_column),
                           pattern = str_c(contaminants,
@@ -1334,7 +1369,7 @@ library(pavian)
       rownames(decontaminated_table) <- decontaminated_table[[feature_column]]
       decontaminated_table <- decontaminated_table[,-1] %>% as.matrix
 
-      outfile <- glue("{output_prefix}decontaminated-{classification_method}_species_table{assay_suffix}.csv")
+      outfile <- glue("{output_prefix}{classification_method}_decontam_species_table{assay_suffix}.csv")
       write_csv(x = decontaminated_table, file = outfile)
 
       return(decontaminated_table)
@@ -1355,201 +1390,157 @@ library(pavian)
   - `feature_column` - a character string containing the feature column name in the feature table ['Species', 'species', 'KO_ID'].
   - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
   - `frequency_column` - a character string specifying the column in `metadata` to use for frequency based analysis, default: "concentration"
-  - `prevalence_column` - a character string specifying the column in `metadata` to use for prevalence based analysis, default: "Sample_or_Control"
-  - `ntc_name` - a character string specifying the name of the NTC in the prevalence column
+  - `prevalence_column` - a character string specifying the column in `metadata` to use for prevalence based analysis, default: "NTC"
+  - `ntc_name` - a character string specifying the value in the prevalence column for all negative template control samples, default: "TRUE"
   - `threshold` - a number between 0 and 1 specfying the decontam threshold for both prevalence and frequency based analyses. default: 0.1
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
   - `classification_method` - a character string specifying the tool used to generate the classifications ['kaiju', 'kraken2', 'metaphlan', 'contig-taxonomy', 'gene-taxonomy', 'gene-function']
-  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
+
+  **Output Data:**
+  - {classification_method}_decontam_species_table_GLlblMetag.csv - decontaminated feature table file
+  - {classification_method}_decontam_results_GLlblMetag.csv - Decontam results file
 
   **Returns:** a dataframe containing the decontaminated feature table
+
 </details>
 
 ##### process_taxonomy()
 <details>
   <summary>process a taxonomy assignment table</summary>
 
-```R
-process_taxonomy <- function(taxonomy, prefix='\\w__') { 
-  
-  taxonomy <- apply(X = taxonomy, MARGIN = 2, FUN = as.character) 
-
-  # replace NAs and empty cells with "Other" and delete the `prefix` from taxonomy names
-  for (rank in colnames(taxonomy)) {
-    # Delete the taxonomy prefix
-    taxonomy[,rank] <- gsub(pattern = prefix, x = taxonomy[, rank],
-                            replacement = '')
-    indices <- which(is.na(taxonomy[,rank]))
-    taxonomy[indices, rank] <- rep(x = "Other", times=length(indices)) 
-    # Replace empty cells with "Other"
-    indices <- which(taxonomy[,rank] == "")
-    taxonomy[indices,rank] <- rep(x = "Other", times=length(indices))
-  }
-  # Replace underscore with space
-  taxonomy <- apply(X = taxonomy,MARGIN = 2,
-                    FUN =  gsub,pattern = "_",replacement = " ") %>% 
-    as.data.frame(stringAsfactor=FALSE)
-  return(taxonomy)
-
-```
-**Function Parameter Definitions:**
-
-- `taxonomy` - is a taxonomy assignment dataframe with ranks [Phylum, Class .. Species] as columns and taxonomy assignments as rows
-- `prefix`  - is a regular expression specifying a character sequence to remove
-              from taxon names
-
-**Returns:** a dataframe of reformated taxonomy names
-
-</details>
-
-
-##### format_taxonomy_table()
-<details>
-  <summary>format a taxonomy assignment table by appending a suffix to a known name</summary>
-
-```R
-format_taxonomy_table <- function(taxonomy,stringToReplace="Other",
-                                  suffix=";Other") {
-  
-  for (taxa_index in seq_along(taxonomy)) {
-    
-    # Get the row indices of the current taxonomy columns 
-    # with rows matching the sting in `stringToReplace`
-    indices <- grep(x = taxonomy[,taxa_index], pattern = stringToReplace)
-    # Replace the value in that row with the value in the adjacent cell concated with `suffix` 
-    taxonomy[indices,taxa_index] <- 
-      paste0(taxonomy[indices,taxa_index-1],
-             rep(x = suffix, times=length(indices)))
+  ```R
+  process_taxonomy <- function(taxonomy, prefix='\\w__') { 
     
+    taxonomy <- apply(X = taxonomy, MARGIN = 2, FUN = as.character) 
+
+    # replace NAs and empty cells with "Other" and delete the `prefix` from taxonomy names
+    for (rank in colnames(taxonomy)) {
+      # Delete the taxonomy prefix
+      taxonomy[,rank] <- gsub(pattern = prefix, x = taxonomy[, rank],
+                              replacement = '')
+      indices <- which(is.na(taxonomy[,rank]))
+      taxonomy[indices, rank] <- rep(x = "Other", times=length(indices)) 
+      # Replace empty cells with "Other"
+      indices <- which(taxonomy[,rank] == "")
+      taxonomy[indices,rank] <- rep(x = "Other", times=length(indices))
+    }
+    # Replace underscore with space
+    taxonomy <- apply(X = taxonomy,MARGIN = 2,
+                      FUN =  gsub,pattern = "_",replacement = " ") %>% 
+      as.data.frame(stringAsfactor=FALSE)
+    return(taxonomy)
   }
-  return(taxonomy)
-}
+  ```
+  **Function Parameter Definitions:**
 
-```
-**Function Parameter Definitions:**
-- `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
-- `stringToReplace` - a regex string specifying what to replace
-- `suffix` - string specifying the replacement value
+  - `taxonomy` - is a taxonomy assignment dataframe with ranks [Phylum, Class .. Species] as columns and taxonomy assignments as rows
+  - `prefix`  - is a regular expression specifying a character sequence to remove
+                from taxon names
 
-**Returns:** a dataframe of reformated taxonomy names
+  **Returns:** a dataframe of reformated taxonomy names
 
 </details>
 
-
 ##### fix_names()
 <details>
   <summary>clean taxonomy names</summary>
 
-```R
-fix_names<- function(taxonomy,stringToReplace,suffix){
-  
-  for(index in seq_along(stringToReplace)){
-    taxonomy <- format_taxonomy_table(taxonomy = taxonomy,
-                                      stringToReplace=stringToReplace[index], 
-                                      suffix=suffix[index])
-  }
-  return(taxonomy)
-}
-
-```
-**Function Parameter Definitions:**
-- `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
-- `stringToReplace` - a regex string specifying what to replace
-- `suffix` - string specifying the replacement value
-
-**Returns:** a dataframe of reformated/cleaned taxonomy names
-
-</details>
-
+  ```R
+  fix_names<- function(taxonomy,stringToReplace="Othe",suffix=";Other"){
+    
+    for(index in seq_along(stringToReplace)){
+
+      for (taxa_index in seq_along(taxonomy)) {    
+        # Get the row indices of the current taxonomy columns
+        # with rows matching the sting in `stringToReplace`
+        indices <- grep(x = taxonomy[,taxa_index], pattern = stringToReplace)
+        # Replace the value in that row with the value in the adjacent cell concated with `suffix`
+        taxonomy[indices,taxa_index] <-
+          paste0(taxonomy[indices,taxa_index-1],
+                rep(x = suffix, times=length(indices)))
+      }
 
-##### read_input_table()
-<details>
-  <summary>read an input table into a tibble</summary>
+    }
+    return(taxonomy)
+  }
+  ```
 
-```R
-read_input_table <- function(file_name){
-  
-   df <- read_delim(file = file_name, delim = "\t", comment = "#")
-   return(df)
-   
-}
-```
-**Function Parameter Definitions:**
+  **Function Parameter Definitions:**
+  - `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
+  - `stringToReplace` - a regex string specifying what to replace
+  - `suffix` - string specifying the replacement value
 
-- `file_name` - path to file to be read
-**Returns:** a tibble generated from the input file
+  **Returns:** a dataframe of reformated/cleaned taxonomy names
 
 </details>
 
 
-
-##### read_contig_table()
+##### read_assembly_coverage_table()
 <details>
-  <summary>Read Assembly-based contig annotation table</summary>
+  <summary>Read Assembly-based coverage annotation table</summary>
 
   ```R
-read_contig_table <- function(file_name, sample_names){
+  read_assembly_coverage_table <- function(file_name, sample_names){
   
-  df <- read_input_table(file_name)
+    df <- read_delim(file = file_name, delim = "\t", comment = "#")
 
-  # Subset taxoxnomy portion (domain:species) of input table
-  # and replace empty/Na domain assignments with "Unclassified"
-  taxonomy_table <- df %>%
-    select(domain:species) %>%
-    mutate(domain=replace_na(domain, "Unclassified"))
-  
-  # Subset count table
-  counts_table <- df %>% select(!!sample_names)
+    # Subset taxoxnomy portion (domain:species) of input table
+    # and replace empty/Na domain assignments with "Unclassified"
+    taxonomy_table <- df %>%
+      select(domain:species) %>%
+      mutate(domain=replace_na(domain, "Unclassified"))
+    
+    # Subset count table
+    counts_table <- df %>% select(!!any_of(sample_names))
 
-  # Mutate taxonomy mames
-  taxonomy_table  <- process_taxonomy(taxonomy_table)
-  taxonomy_table <- fix_names(taxonomy_table, "Other", ";_")
+    # Mutate taxonomy mames
+    taxonomy_table  <- process_taxonomy(taxonomy_table)
+    taxonomy_table <- fix_names(taxonomy_table, "Other", ";_")
 
-  # Column bind taxonomy dataframe with species count dataframe
-  df <- bind_cols(taxonomy_table, counts_table)
-  
-  return(df)
-}
+    # Column bind taxonomy dataframe with species count dataframe
+    df <- bind_cols(taxonomy_table, counts_table)
+    
+    return(df)
+  }
+  ```
 
-```
+  **Custom Functions Used:**
+  [process_taxonomy](#process_taxonomy)
+  [fix_names()](#fix_names)
 
-**Function Parameter Definitions:**
+  **Function Parameter Definitions:**
 
-- `file_name` - path to contig taxonomy assignment file to be read
-- `sample_names` - string of samples names to keep in the final dataframe
+  - `file_name` - path to contig taxonomy assignment file to be read
+  - `sample_names` - string of samples names to keep in the final dataframe
 
-**Returns:** a dataframe with cleaned taxonomy names and sample species count
+  **Returns:** a dataframe with cleaned taxonomy names and sample species count
 
 </details>
 
 
-
 ##### get_sample_names()
 <details>
   <summary>retrieve sample names for which assemblies were generated</summary>
 
   ```R
-get_sample_names <- function (assembly_summary) {
-
+  get_sample_names <- function (assembly_summary) {
+    # Read in table and drop columns were all rows are NA
+    overview_table <-  read_delim(file = assembly_summary, delim = "\t", comment = "#") %>%
+                        select(where( ~all(!is.na(.)) )) 
 
-  overview_table <-  read_input_table(assembly_summary) %>%
-                       select(
-                         where( ~all(!is.na(.)) )
-                         ) # Drop columns were all its rows are NAs
+    col_names <- names(overview_table) %>% str_remove_all("-assembly")
+    sample_order <- col_names[-1] %>% sort()
 
-col_names <- names(overview_table) %>% str_remove_all("-assembly")
-sample_order <- col_names[-1] %>% sort()
-
-return(sample_order)
-
-}
-```
-**Function Parameter Definitions:**
+    return(sample_order)
+  }
+  ```
+  **Function Parameter Definitions:**
 
-- `assembly_summary` - path to assembly summary file
+  - `assembly_summary` - path to assembly summary file
 
-**Returns:** a character vector of sorted sample names
+  **Returns:** a character vector of sorted sample names
 
 </details>
 
@@ -1582,8 +1573,6 @@ custom_palette <- custom_palette[-c(21:23,
                                          ignore.case = TRUE)
                                    )
                                 ]                      
-# Heatmap color gradient - here from white to red
-colours <- colorRampPalette(c('white','red'))(255)
 ```
 
 **Input Data:** 
@@ -1597,13 +1586,12 @@ colours <- colorRampPalette(c('white','red'))(255)
 
 <br>
 
----
 
 ## Read-based Processing
 
-### 9. Taxonomic Profiling Using Kaiju
+### 7. Taxonomic Profiling Using Kaiju
 
-#### 9a. Build Kaiju Database
+#### 7a. Build Kaiju Database
 
 ```bash
 # Make a directory that will hold the downloaded kaiju database
@@ -1634,14 +1622,15 @@ rm nr_euk/kaiju_db_nr_euk.bwt nr_euk/kaiju_db_nr_euk.sa
 - kaiju-db/merged.dmp (merged taxonomy IDs file from the NCBI Taxonomy database that maps deprecated taxonomic IDs to current ones)
 
 
-#### 9b. Kaiju Taxonomic Classification
+#### 7b. Kaiju Taxonomic Classification
 
 ```bash
 kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
       -t kaiju-db/nodes.dmp \
       -z NumberOfThreads \
       -E 1e-05 \
-      -i /path/to/sample_decontam_GLlbsMetag.fastq.gz \
+      -i /path/to/sample1_GLlbsMetag_R1_decontam.fastq.gz \
+      -j /path/to/sample1_GLlbsMetag_R2_decontam.fastq.gz \
       -o sample_kaiju.out
 ```
 
@@ -1651,20 +1640,24 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
 - `-t` - Specifies the path to the kaiju taxonomy hierarchy file (nodes.dmp).
 - `-z` - Number of parallel processing threads to use.
 - `-E` - Specifies the minimum E-value to use for filter matches (an E-value of 1e-05 means that there's a 0.001% chance that the matches identified occurred randomly).
-- `-i` - Specifies path to the input file.
+- `-i` - Specifies path to the forward read input file.
+- `-i` - Specifies path to the reverse read input file.
 - `-o` - Specifies the name of the output file.
 
 **Input Data:**
 
 - kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 9a](#9a-build-kaiju-database))
-- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
-- sample_decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7e](#7e-generate-decontaminated-read-files))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 7a](#9a-build-kaiju-database))
+- *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+    contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
+    output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
+
 
 **Output Data:**
 
 - sample_kaiju.out (kaiju output file)
 
-#### 9c. Compile Kaiju Taxonomy Results
+#### 7c. Compile Kaiju Taxonomy Results
 
 ```bash
 # Merge kaiju reports to one table at the species level 
@@ -1676,8 +1669,8 @@ kaiju2table -t nodes.dmp \
             *_kaiju.out
 
 # Convert file names to sample names
-sed -i -E 's/.+\/(.+)_kaiju\.out/\1/g' merged_kaiju_table_GLlbsMetag.tsv && \
-sed -i -E 's/file/sample/' merged_kaiju_table_GLlbsMetag.tsv
+sed -i -E 's/.+\/(.+)_kaiju\.out/\1/g' merged_kaiju_table.tsv && \
+sed -i -E 's/file/sample/' merged_kaiju_table.tsv
 ```
 
 **Parameter Definitions:**
@@ -1691,15 +1684,15 @@ sed -i -E 's/file/sample/' merged_kaiju_table_GLlbsMetag.tsv
 
 **Input Data:**
 
-- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
-- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 9a](#9a-build-kaiju-database))
-- *kaiju.out (kaiju output files, output from [Step 9b](#9b-kaiju-taxonomic-classification))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 7a](#7a-build-kaiju-database))
+- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 7a](#7a-build-kaiju-database))
+- *kaiju.out (kaiju output files, output from [Step 7b](#7b-kaiju-taxonomic-classification))
 
 **Output Data:**
 
-- merged_kaiju_table_GLlbsMetag.tsv (compiled kaiju summary table at the species level)
+- merged_kaiju_table.tsv (compiled kaiju summary table at the species level)
 
-#### 9d. Convert Kaiju Output To Krona Format
+#### 7d. Convert Kaiju Output To Krona Format
 
 ```bash
 kaiju2krona -u \
@@ -1718,15 +1711,15 @@ kaiju2krona -u \
 - `-o` - Specifies the name of krona formatted kaiju output file.
 
 **Input Data:**
-- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 9a](#9a-build-kaiju-database))
-- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
-- sample_kaiju.out (kaiju output file, output from [Step 9b](#9b-kaiju-taxonomic-classification))
+- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 7a](#7a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 7a](#7a-build-kaiju-database))
+- sample_kaiju.out (kaiju output file, output from [Step 7b](#7b-kaiju-taxonomic-classification))
 
 **Output Data:**
 
 - sample.krona (krona formatted kaiju output)
 
-#### 9e. Compile Kaiju Krona Reports
+#### 7e. Compile Kaiju Krona Reports
 
 ```bash
 # Create a file containing a sorted list of all .krona files 
@@ -1772,17 +1765,17 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
                         sample_1.krona,sample_1 sample_2.krona,sample_2 ... sample_n.krona,sample_n.
 
 **Input Data:**
-- *.krona (all sample .krona formatted files, output from [Step 9d](#9d-convert-kaiju-output-to-krona-format)) 
+- *.krona (all sample .krona formatted files, output from [Step 7d](#7d-convert-kaiju-output-to-krona-format)) 
 
                       
 **Output Data:**
 
 - krona_files.txt (sorted list of all *.krona files)
 - sample_names.txt (sorted list of all sample names)
-- **kaiju-report.html** (compiled krona html report containing all samples)
+- **kaiju-report_GllbsMetag.html** (compiled krona html report containing all samples)
 
 
-#### 9f. Create Kaiju Species Count Table
+#### 7f. Create Kaiju Species Count Table
 
 ```R
 library(tidyverse)
@@ -1805,14 +1798,14 @@ write_csv(x = table2write, file = "kaiju_species_table_GLlbsMetag.csv")
 
 **Input Data:**
 
-- merged_kaiju_table_GLlbsMetag.tsv (compiled kaiju table at the species taxon level, from [Step 9c](#9c-compile-kaiju-taxonomy-results))
+- merged_kaiju_table_GLlbsMetag.tsv (compiled kaiju table at the species taxon level, from [Step 7c](#7c-compile-kaiju-taxonomy-results))
 
 **Output Data:**
 
 - **kaiju_species_table_GLlbsMetag.csv** (kaiju species count table in csv format)
 
 
-#### 9g. Filter Kaiju Species Count Table
+#### 7g. Filter Kaiju Species Count Table
 
 ```R
 library(tidyverse)
@@ -1833,13 +1826,13 @@ feature_table <- feature_table[, -1]
 # convert count table to a relative abundance matrix
 abund_table <- feature_table %>% rownames_to_column(feature_name) %>%
   mutate(across(where(is.numeric), function(x) (x / sum(x, na.rm = TRUE)) * 100)) %>%
-  as.data.frame()
+  as.data.frame
 
 rownames(abund_table) <- abund_table[,1]
 abund_table <- abund_table[,-1] %>% t 
 
 table2write <- group_low_abund_taxa(abund_table, threshold = threshold)  %>%
-  t %>% as.data.frame() %>%
+  t %>% as.data.frame %>%
   rownames_to_column(feature_name)
 
 write_csv(x = table2write, file = output_file)
@@ -1857,21 +1850,21 @@ write_csv(x = table2write, file = output_file)
 
 **Input Data:**
 
-- kaiju_species_table_GLlbsMetag.csv (path to kaiju species table from [Step 9f](#9f-create-kaiju-species-count-table))
+- kaiju_species_table_GLlbsMetag.csv (path to kaiju species table from [Step 7f](#7f-create-kaiju-species-count-table))
 
 **Output Data:**
 
-- **filtered-kaiju_species_table_GLlbsMetag.csv** - a file containing the filtered species table
+- **kaiju_filtered_species_table_GLlbsMetag.csv** - a file containing the filtered species table
 
 ---
 
-#### 9h. Taxonomy barplots
+#### 7h. Taxonomy barplots
 
 ```R
 library(tidyverse)
 
 species_table_file <- "kaiju_species_table_GLlbsMetag.csv"
-filtered_species_table_file <- "filtered-kaiju_species_table_GLlbsMetag.csv"
+filtered_species_table_file <- "kaiju_filtered_species_table_GLlbsMetag.csv"
 metadata_file <- "/path/to/sample/metadata"
 number_samples <- 10 
 
@@ -1883,7 +1876,7 @@ p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_ta
                   feature_column = "Species", samples_column = "sample_id", group_column = "group",
                   publication_format = publication_format, custom_palette = custom_palette)
 
-ggsave(filename = "unfiltered-kaiju_species_barplot_GLlbsMetag.png", plot = p,
+ggsave(filename = "kaiju_unfiltered_species_barplot_GLlbsMetag.png", plot = p,
        device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
 
 # Save static unfiltered plot
@@ -1892,16 +1885,19 @@ p <- make_barplot(metadata_file = metadata_file, feature_table_file = filtered_s
                   publication_format = publication_format, custom_palette = custom_palette)
 
 # Save interactive unfilterted plot
-htmlwidgets::saveWidget(ggplotly(p), glue("unfiltered-kaiju_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
+htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_unfiltered_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
 
 # Save static filtered plot
-ggsave(filename = glue("filtered-kaiju_species_barplot_GLlbsMetag.png"), plot = p,
+ggsave(filename = glue("kaiju_unfiltered_species_barplot_GLlbsMetag.png"), plot = p,
       device = 'png', width = plot_width, height = 10, units = 'in', dpi = 300, limitsize = FALSE)
 
 # Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("filtered-kaiju_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
+htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_filtered_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
 ```
 
+**Custom Functions Used:**
+- [make_barplot](#make_barplot)
+
 **Parameter Definitions:**
 
 - `species_table_file` - a file containing the species count table
@@ -1911,20 +1907,20 @@ htmlwidgets::saveWidget(ggplotly(p), glue("filtered-kaiju_species_barplot_GLlbsM
 
 **Input Data:**
 
-- `kaiju_species_table_GLlbsMetag.csv` (a file containing the species count table, output from [Step 9f](#9f-create-kaiju-species-count-table))
-- `filtered-kaiju_species_table_GLlbsMetag.csv` (a file containing the filtered species count table, output from [Step 9g](#9g-filter-kaiju-species-count-table))
+- `kaiju_species_table_GLlblMetag.csv` (a file containing the species count table, output from [Step 7f](#7f-create-kaiju-species-count-table))
+- `kaiju_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 7g](#7g-filter-kaiju-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 
 **Output Data:**
 
-- **unfiltered-kaiju_species_barplot.png** (taxonomy barplot without filtering)
-- **unfiltered-kaiju_species_barplot.html** (interactive taxonomy barplot without filtering)
-- **filtered-kaiju_species_barplot.png** (taxonomy barplot after filtering rare and non-microbial taxa)
-- **filtered-kaiju_species_barplot.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
+- kaiju_unfiltered_species_barplot_GLlblMetag.png (taxonomy barplot without filtering)
+- **kaiju_unfiltered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot without filtering)
+- kaiju_filtered_species_barplot_GLlblMetag.png (taxonomy barplot after filtering rare and non-microbial taxa)
+- **kaiju_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 9i. Feature decontamination
+#### 7i. Feature decontamination
 
 > Feature (species) decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
 
@@ -1937,11 +1933,21 @@ feature_table_file <- "filtered-kaiju_species_table_GLlbsMetag.csv"
 metadata_table <- "/path/to/sample/metadata"
 ntc_name <- "name_of_ntc_sample"
 
-decontaminated_table <- feature_decontam(metadata_file = metadata_table, feature_table_file = feature_table_file, 
-                               feature_column = "species", samples_column = "sample_id",
-                               prevalence_column = "Sample_or_Control", ntc_name = ntc_name, frequency_column = "concentration", 
-                               threshold = 0.1, classification_method = "kaiju", 
-                               output_prefix = "", assay_suffix = "_GLlbsMetag")
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
+
+decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
+                                         feature_table_file = feature_table_file, 
+                                         feature_column = "species", 
+                                         samples_column = "sample_id",
+                                         prevalence_column = "NTC", 
+                                         ntc_name = "TRUE", 
+                                         frequency_column = "concentration", 
+                                         threshold = 0.1, 
+                                         classification_method = "kaiju", 
+                                         output_prefix = "", 
+                                         assay_suffix = "_GLlblMetag")
 
 # Convert count matrix to relative abundance matrix
 decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
@@ -1949,38 +1955,43 @@ decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
 # Make plot after filtering out contaminants
 p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
 
-ggsave(filename = "decontaminated-kaiju-species_barplot.png", plot = p,
-         device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
+ggsave(filename = "kaiju_decontam_species_barplot_GLlblMetag.png", plot = p,
+         device = "png", width = plot_width, height = 10, units = "in", dpi = 300)
+
+# Save interactive filtered plot
+htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
 ```
+
 **Custom Functions Used:**
 - [feature_decontam()](#feature_decontam)
 - [make_plot()](#make_plot)
 - [count_to_rel_abundance()](#count_to_rel_abundance)
 
 **Parameter Definitions:**
-  - `metadata_table` - path to a file with samples as rows and columns describing each sample
-  - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
-                           table with species/functions as the first column and samples as other columns.
-  - `ntc_name` - a character string specifying the name of the NTC in the prevalence column
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                         table with species/functions as the first column and samples as other columns.
+- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples in the feature_table_file
 
 **Input Data:**
 
-- `filtered-kaiju_species_table_GLlbsMetag.csv`(path to filtered species count per sample, output from [Step 9h](#9h-taxonomy-barplots))
+- `kaiju_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 7g](#7g-filter-kaiju-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
 
-- **decontam-kaiju_results_GLlbsMetag.csv** (decontam's result table)
-- **decontaminated-kaiju_species_table_GLlbsMetag.csv** (decontaminated species table)
-- **decontaminated-kaiju-species_barplot_GLlbsMetag.png** (barplot after filtering out contaminants)
+- **kaiju_decontam_results_GLlblMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
+- **kaiju_decontam_species_table_GLlblMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
+- **kaiju_decontam_species_barplot_GLlblMetag.png** (barplot after filtering out contaminants)
+- **kaiju_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants)
 
 <br>
 
 ---
 
-### 10. Taxonomic Profiling Using Kraken2
+### 8. Taxonomic Profiling Using Kraken2
 
-#### 10a. Download Kraken2 Database
+#### 8a. Download Kraken2 Database
 
 ```bash 
 ## Download all microbial (including eukaryotes) - https://benlangmead.github.io/aws-indexes/k2
@@ -2038,7 +2049,7 @@ kraken2 --db kraken2-db/ \
         --use-names \
         --output sample-kraken2-output.txt \
         --report sample-kraken2-report.tsv \
-        /path/to/sample_decontam_GLlbsMetag.fastq.gz
+        /path/to/sample1_GLlbsMetag_R1_decontam.fastq.gz /path/to/sample1_GLlbsMetag_R2_decontam.fastq.gz
 ```
 
 **Parameter Definitions:**
@@ -2049,12 +2060,17 @@ kraken2 --db kraken2-db/ \
 - `--use-names` - Specifies to add taxa names in addition to taxids.
 - `--output` - Specifies the name of the kraken2 read-based output file.
 - `--report` - Specifies the name of the kraken2 report output file.
-- `sample_decontam_GLlbsMetag.fastq.gz` - Positional argument specifying the input file.
+- `sample1_GLlbsMetag_R1_decontam.fastq.gz` - Positional argument specifying the forward read input file.
+- `sample1_GLlbsMetag_R2_decontam.fastq.gz` - Positional argument specifying the reverse read input file.
+
 
 **Input Data:**
 
 - kraken2-db/ (a directory containing kraken2 database files, output from [Step 10a](#10a-download-kraken2-database))
-- sample_decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7e](#7e-generate-decontaminated-read-files))
+- *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+    contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
+    output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
+
 
 **Output Data:**
 
@@ -2062,16 +2078,25 @@ kraken2 --db kraken2-db/ \
 - sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
 
 
-#### 10c. Compile Kraken2 Taxonomy Results
+#### 8c. Compile Kraken2 Taxonomy Results
 
-##### 10ci. Create Merged Kraken2 Taxonomy Table
+##### 8ci. Create Merged Kraken2 Taxonomy Table
 
-```bash
-combine_kreports.py --output merged-kraken2-table.tsv \
-                    --report-files sample1-kraken2-report.tsv sample2-kraken2-report.tsv ... sampleN-kraken2-report.tsv \
-                    --sample-names sample1 sample2 ... sampleN
+```R
+species_table <- merge_kraken_reports(reports-dir='/path/to/kraken2/reports')
+write_csv(x = species_table, file = "merged-kraken2-table.csv")
 ```
 
+**Custom Functions Used:**
+
+- [merge_kraken_reports()](#merge_kraken_reports)
+
+**Parameter Definitions:**
+
+- `file_path` - path to compiled kaiju table at the species taxon level
+- `x`  - feature table dataframe to write to file
+- `file` - path to where to write kaiju count table per sample
+
 **Parameter Definitions:**
 
 - `--output` - Specifies the name of the kraken2 compiled results output file.
@@ -2080,14 +2105,13 @@ combine_kreports.py --output merged-kraken2-table.tsv \
 
 **Input Data:**
 
-- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 10b](#10b-taxonomic-classification))
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 8b](#8b-taxonomic-classification))
 
 **Output Data:**
 
-- **merged-kraken2-table.tsv** (table containing compiled kraken2 reports)
+- **kraken2_species_table_GLlblMetag.csv** (kraken species count table in csv format)
 
-
-##### 10cii. Compile Kraken2 Taxonomy Reports
+##### 8cii. Compile Kraken2 Taxonomy Reports
 
 ```bash
 multiqc --zip-data-dir \ 
@@ -2107,7 +2131,7 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 10b](#10b-taxonomic-classification))
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 8b](#8b-taxonomic-classification))
 
 **Output Data:**
 
@@ -2115,7 +2139,7 @@ multiqc --zip-data-dir \
 - **kraken2_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
 
-#### 10d. Convert Kraken2 Output to Krona Format
+#### 8d. Convert Kraken2 Output to Krona Format
 
 ```bash
 kreport2krona.py --report-file sample-kraken2-report.tsv  \
@@ -2129,14 +2153,14 @@ kreport2krona.py --report-file sample-kraken2-report.tsv  \
 
 **Input Data:**
 
-- sample-kraken2-report.tsv (kraken report, output from [Step 10b](#10b-taxonomic-classification))
+- sample-kraken2-report.tsv (kraken report, output from [Step 8b](#8b-taxonomic-classification))
 
 **Output Data:**
 
 - sample.krona (krona formatted kraken2 output)
 
 
-#### 10e. Compile Kraken2 Krona Reports
+#### 8e. Compile Kraken2 Krona Reports
 
 ```bash
 # Find, list and write all .krona files to file 
@@ -2149,7 +2173,7 @@ basename --multiple --suffix='.krona' ${FILES[*]} | sort -uV  > sample_names.txt
 KTEXT_FILES=($(paste -d',' "krona_files.txt" "sample_names.txt"))
 
 # Create html   
-ktImportText -o kraken2-report.html ${KTEXT_FILES[*]}
+ktImportText -o kraken2-report_GLlbsMetag.html ${KTEXT_FILES[*]}
 ```
 
 **Parameter Definitions:**
@@ -2181,310 +2205,439 @@ ktImportText -o kraken2-report.html ${KTEXT_FILES[*]}
 
 **Input Data:**
 
-- *.krona (all sample .krona formatted files, output from [Step 10d](#10d-convert-kraken2-output-to-krona-format)) 
+- *.krona (all sample .krona formatted files, output from [Step 8d](#8d-convert-kraken2-output-to-krona-format)) 
 
                       
 **Output Data:**
 
 - krona_files.txt (sorted list of all *.krona files)
 - sample_names.txt (sorted list of all sample names)
-- **kraken2-report.html** (compiled krona html report containing all samples)
+- **kraken2-report_GLlbsMetag.html** (compiled krona html report containing all samples)
 
 
-#### 10f. Create Kraken2 Species Count Table --- START NEEDS REVIEW ---
+#### 8f. Filter Kraken2 Species Count Table
 
 ```R
 library(tidyverse)
-library(pavian)
-
-reports_dir <- "/path/to/directory/with/*-kraken2-report.tsv"
-species_table <- process_kraken_table(reports_dir)
-table2write <- species_table  %>%
-                as.data.frame() %>%
-                rownames_to_column("Species")
-
-write_csv(x = table2write, 
-          file = "kraken_species_table.csv")
-```
-
-**Parameter Definitions:**
-
-- `reports_dir` - a directory containing kraken2 default reports
-- `x` - table to write
-- `file` - file name to write table to.
-
-**Input Data:**
 
-- *-kraken2-report.tsv (kraken2 report output file, from [Step 10b](#10b-taxonomic-classification))
-
-**Output Data:**
+input_file <- "kraken2_species_table_GLlblMetag.csv"
+output_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
+threshold <- 0.5
 
-- **kraken_species_table.csv** (kraken species count table in csv format)
+# string used to define non-microbial taxa
+non_microbial <- "UNCLASSIFIED|Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
 
+# read in feature table
+feature_table <- read_csv(input_file) %>% as.data.frame
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
 
-#### 10g. Read-in tables
+# read-based count table
+table2write <- filter_rare(feature_table, non_microbial, threshold = threshold) %>%
+  as.data.frame %>%
+  rownames_to_column(feature_name)
 
-```R
-library(tidyverse)
+write_csv(x = table2write, file = output_file)
+```
 
-# Read-in metadata
+**Custom Functions Used:**
 
-metdata_file <- "/path/to/sample/metadata"
-samples_column <- "Sample_ID"
-metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
-row.names(metadata) <- metadata[,samples_column]
-# Read-in feature table
-species_table <- read_csv(file="kraken_species_table.csv") %>%  as.data.frame()
-rownames(species_table) <- species_table$Species
-# Drop the species column
-species_table <- species_table[,-match("Species", colnames(species_table))]
-```
+- [group_low_abund_taxa()](#group_low_abund_taxa)
 
 **Parameter Definitions:**
 
-- `file` - path to input table
-- `delim` - file delimiter 
+- `threshold` - threshold for filtering out rare taxa, a percentage between 0 and 100.
+- `output_file` - output filename
+- `input_file` - input filename
 
 **Input Data:**
 
-- metadata_file  (path to sample-wise metadata file)
-- kraken_species_table.csv (path to kraken species table)
+- kraken2_species_table_GLlblMetag.csv (path to kaiju species table from [Step 8ci.](#8ci-create-merged-kraken2-taxonomy-table))
 
 **Output Data:**
 
-- metadata (a dataframe of sample-wise metadata)
-- species_table (a dataframe of species count with rows and columns as species and sample names, respectively)
+- **kraken2_filtered_species_table_GLlblMetag.csv** - a file containing the filtered species table
 
+---
 
-#### 10h. Taxonomy barplots
+#### 8g. Taxonomy barplots
 
 ```R
 library(tidyverse)
 
-# Threshold to filter out potential false positive
-# taxonomy assignments
-filter_threshold <- 0.5
-# Filter out Rare and non-microbial assignments.
-# You can add as many species that you'd like to filter out
-# using the following syntax "|species_name1|species_name2"
-non_microbial <- "Unclassifed|unclassified|Homo sapien"
-
-plot_width <- 18
-plot_height <- 8
-
-# Convert count matrix to relative abundance matrix
-abund_table <- count_to_rel_abundance(species_table)
-
-# Make plot without filtering
-p <- make_plot(abund_table, metadata, custom_palette, publication_format)
-
-ggsave(filename =  "unfiltered-kraken_species_plot.png", plot = p, device = "png", 
-       width = plot_width, height = plot_height, units = "in", dpi = 300)
-
+species_table_file <- "kraken2_species_table_GLlblMetag.csv"
+filtered_species_table_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
+metadata_file <- "/path/to/sample/metadata"
+number_samples <- 10 
 
-# Get species with relative abundance greater than `filter_threshold` in all samples
-# Drop rare and non-microbial assignments
-filtered_species_table  <- filter_rare(species_table, non_microbial, threshold=filter_threshold)
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
 
+p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+                  feature_column = "species", samples_column = "sample_id", group_column = "group",
+                  publication_format = publication_format, custom_palette = custom_palette)
 
-# Convert count matrix to relative abundance matrix
-filtered_species_table <- count_to_rel_abundance(filtered_species_table)
+ggsave(filename = "kraken2_unfiltered_species_barplot_GLlblMetag.png", plot = p,
+       device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
 
-# Write filtered table to file
-table2write <- filtered_species_table %>%
-                 t %>%
-                 as.data.frame() %>%
-                rownames_to_column("Species")
+# Save static unfiltered plot
+p <- make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
+                  publication_format = publication_format, custom_palette = custom_palette)
 
-write_csv(x = table2write , file = "filtered-kraken_species_table.csv")
+# Save interactive unfilterted plot
+htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_unfiltered_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
 
-# Make plot after filtering
-p <- make_plot(filtered_species_table , metadata, custom_palette, publication_format)
+# Save static filtered plot
+ggsave(filename = glue("kraken2_filtered_species_barplot_GLlblMetag.png"), plot = p,
+      device = 'png', width = plot_width, height = 10, units = 'in', dpi = 300, limitsize = FALSE)
 
-ggsave(filename = "filtered-kraken_species_plot.png", plot = p,
-         device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
+# Save interactive filtered plot
+htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
 ```
-
 **Custom Functions Used:**
-- [make_plot()](#make_plot)
-- [count_to_rel_abundance()](#count_to_rel_abundance)
-
+- [make_barplot()](#make_plot)
 
 **Parameter Definitions:**
 
-- `filter_threshold` - a decimal threshold from 0-1 to filter out rare species i.e potential false positives
-- `non_microbial` - a regex string listing out assignments to drop before filtering based on the `filter_threshold` above 
+- `species_table_file` - a file containing the species count table
+- `filtered_species_table_file` - a file containing the filtered species count table
+- `metadata_file` - a file containing group information for each sample in the species count files
+- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples.
 
 **Input Data:**
 
-- `species_table` (a dataframe of species count per sample, output from [Step 10g](#10g-read-in-tables))
-- `metadata` - (a dataframe of sample-wise metadata, output from [Step 10g](#10g-read-in-tables))
+- `kraken2_species_table_GLlblMetag.csv` (path to kaiju species table from [Step 10ci.](#8ci-create-merged-kraken2-taxonomy-table))
+- `kraken2_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 10g](#10f-filter-kraken2-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
 
-- **unfiltered-kraken_species_plot.png** (barplot plot without filtering)
-- **filtered-kraken_species_table.csv** (filtered relative abundance table)
-- **filtered-kraken_species_plot.png** (barplot after filtering rare and non-microbial taxa)
+- kraken2_unfiltered_species_barplot_GLlblMetag.png (taxonomy barplot without filtering)
+- **kraken2_unfiltered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot without filtering)
+- kraken2_filtered_species_barplot_GLlblMetag.png (taxonomy barplot after filtering rare and non-microbial taxa)
+- **kraken2_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 10i. Feature decontamination --- END NEEDS REVIEW ---
+#### 8h. Feature decontamination
 
-Feature decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table.
+> Feature (species) decontamination with decontam. Decontam is an R package that statistically 
+  identifies contaminating features in a feature table
 
 ```R
 library(tidyverse)
 library(decontam)
 library(phyloseq)
 
-feature_table <- read_csv("filtered-kraken_species_table.csv") %>%
-                  as.data.frame()
-
- rownames(feature_table) <- feature_table$Species
- feature_table <- feature_table[,-1]  %>% as.matrix()
-# Set to 0.5 for a more aggressive approach where species more prevalent
-# in the negative controls are considered contaminants
-contam_threshold <- 0.1
-# Control samples in this column should always be written as
-# "Control_Sample" and true samples as "True_Sample" for the function below to function properly.
-prev_col <- "Sample_or_Control"
-freq_col <- "input_conc_ng"
-plot_width <- 18
-plot_height <- 8
-
-contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
-
-# Write decontam result table to file
-write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-kraken_results.csv")
-
-# Get the list of contaminants identified by decontam
-contaminants <- contamdf %>%
-                as.data.frame %>%
-                rownames_to_column("Species") %>%
-                filter(contaminant == TRUE) %>% pull(Species)
-
-# Drop contaminant features identified by decontam
-decontaminated_table <- feature_table %>% 
-                as.data.frame  %>% 
-                rownames_to_column("Species") %>% 
-                filter(str_detect(Species, 
-                                  pattern = str_c(contaminants,
-                                                  collapse = "|"),
-                                  negate = TRUE))
-
-rownames(decontaminated_table) <- decontaminated_table$Species
-decontaminated_table <- decontaminated_table[,-1] %>% as.matrix
+feature_table_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
+metadata_table <- "/path/to/sample/metadata"
+number_samples <- NumberOfSamples # integer indicating how many samples are in the file
 
-# Convert count matrix to relative abundance matrix
-decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
 
-# Write decontaminated species table to file
-table2write <- decontaminated_species_table %>%
-                 t %>%
-                 as.data.frame() %>%
-                rownames_to_column("Species")
+decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
+                                         feature_table_file = feature_table_file, 
+                                         feature_column = "species", 
+                                         samples_column = "sample_id",
+                                         prevalence_column = "NTC", 
+                                         ntc_name = "TRUE", 
+                                         frequency_column = "concentration", 
+                                         threshold = 0.1, 
+                                         classification_method = "kraken2", 
+                                         output_prefix = "", 
+                                         assay_suffix = "_GLlblMetag")
 
-write_csv(x = table2write, file = "decontaminated-kraken_species_table.csv")
+# Convert count matrix to relative abundance matrix
+decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
 
 # Make plot after filtering out contaminants
-p <- make_plot(decontaminated_species_table , metadata, custom_palette, publication_format)
+p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
+
+ggsave(filename = "kraken2_decontam_species_barplot_GLlblMetag.png", plot = p,
+         device = "png", width = plot_width, height = 10, units = "in", dpi = 300)
 
-ggsave(filename = "decontaminated-kraken-species_plot.png", plot = p,
-         device = "png", width = plot_width, height = plot_height, units = "in", dpi = 300)
+# Save interactive filtered plot
+htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
 ```
 
 **Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
 - [make_plot()](#make_plot)
 - [count_to_rel_abundance()](#count_to_rel_abundance)
 
+**Parameter Definitions:**
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                          table with species/functions as the first column and samples as other columns.
+- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples.
+
 **Input Data:**
 
-- `filtered-kraken_species_table.csv`(path to species count per sample, output from [Step 10h](#10h-taxonomy-barplots))
-- `metadata`(a dataframe of sample-wise metadata, output from step[Step 10g](#10g-read-in-tables))
+- `kraken2_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 8f](#10f-filter-kraken2-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
 
-- **decontam-kraken_results.csv** (decontam's result table)
-- **decontaminated-kraken_species_table.csv** (decontaminated species table)
-- **decontaminated-kraken-species_plot.png** (barplot after filtering out contaminants)
+- **kraken2_decontam_results_GLlblMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
+- **kraken2_decontam_species_table_GLlblMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
+- **kraken2_decontam_species_barplot_GLlblMetag.png** (barplot after filtering out contaminants)
+- **kraken2_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants)
 
 <br>
 
----
+### 9. Taxonomic Profiling Using MetaPhlan
 
-## Assembly-based Processing
+#### 9a. Download and install HUMAnN databases
+
+```bash 
+mkdir -p /path/to/humann3-db
+humann3_databases --download chocophlan full /path/to/humann3-db/
+humann_databases --download uniref uniref90_ec_filtered_diamond /path/to/humann3-db/
+humann_databases --download utility_mapping full /path/to/human3-db/
+metaphlan --install
+```
+
+**Parameter Definition:**
+*humann_databases*
+- `--download` - Specifies the databases to download:
+  - `chocophlan full` - the full ChocoPhlAn pangenome database, which includes Archaea, Bacteria, Eukaryotes, and Viruses
+  - `uniref uniref90_ec_filtered_diamond` - Download the EC-filtered UniRef90 translated search database
+  - `utility_mapping full` - additional gene family to functional category mapping database
+-`/path/to/humann3-db` - Specifies the database install location
 
-### 11. Sample Assembly
+*metaphlan*
+`--install` - install the metaphlan clade markers and database locally
 
+**Input Data**
+None
+
+**Output Data**
+`/path/to/humann3-db` - the path to the installed metaphlan databases
+
+#### 9b. HUMAnN/MetaPhlAn Taxonomic Classification
 ```bash
-flye --meta \
-     --threads NumberOfThreads \
-     --out-dir sample/ \
-     --nano-hq \
-     /path/to/sample_decontam_GLlbsMetag.fastq.gz
+  # forward and reverse reads need to be provided combined if paired-end (if not paired-end, single-end reads are provided to the --input argument next)
+cat sample1_GLlbsMetag_R1_decontam.fastq.gz sample1_GLlbsMetag_R2_decontam.fastq.gz > sample1-combined.fastq.gz
+
+humann --input sample1-combined.fastq.gz \
+       --output sample1-humann3-out-dir \
+       --threads NumberOfThreads \
+       --output-basename sample1 \
+       --metaphlan-options "--bowtie2db /path/to/humann3-db/ --unclassified_estimation --add_viruses --sample_id sample1" \
+       --nucleotide-database /path/to/humann3-db/ \
+       --protein-database /path/to/humann3-db/ \
+       --bowtie-options "--sensitive --mm"
 
-# rename output files            
-mv sample/assembly.fasta sample_assembly.fasta
-mv sample/flye.log sample_flye.log
+mv sample1-humann3-out-dir/sample1_humann_temp/sample1_metaphlan_bugs_list.tsv \
+   sample1-humann3-out-dir/sample1_metaphlan_bugs_list.tsv
 ```
 
-**Parameter Definitions:**
+**Parameter Definitions:**  
 
-- `--meta` – Use metagenome/uneven coverage mode.
-- `--threads` - Number of parallel processing threads to use.
-- `--out-dir` - Specifies the name of the output directory.
-- `--nano-hq` - Specifies that input is from Oxford Nanopore high-quality reads (Guppy5+ SUP or Q20, <5% error). This skips a genome polishing step since the assembly will be polished with medaka in the next step.
-- `/path/to/sample_decontam_GLlbsMetag.fastq.gz` - Path to the input file, specified as a positional argument.
+-	`--input` – specifies the input (combined forward and reverse reads)
+-	`--output` – specifies output directory
+-	`--threads` – specifies the number of threads to use
+-	`--output-basename` – specifies prefix of the output files
+-	`--metaphlan-options` – options to be passed to metaphlan
+	- `--bowtie2db` – path to bowtie2 indexes (stored in humann database folder)
+  - `unclassified_estimation` - scale the relative abundance profile according to the percentage of reads mapping to a clade.
+	- `--add_viruses` – include viruses in the reference database
+	- `--sample_id` – specifies the sample identifier we want in the table (rather than full filename)
 
-**Input Data**
+**Input Data:**
+- `/path/to/humann3-db/` (humann databases installed in [Step 9a](#9a-download-and-install-humann-databases))
+- *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+    contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
+    output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
 
-- sample_decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7e](#7e-generate-decontaminated-read-files))
+**Output Data:**
+- sample1-humann3-out-dir/ - humann output directory containing *genefamilies.tsv, *pathabundance.tsv, and *pathcoverage.tsv files
 
-**Output Data**
+#### 9c. Merge multiple sample functional profiles
+```bash
+# they need to be in their own directories
+mkdir genefamily-results/ pathabundance-results/ pathcoverage-results/
 
-- sample_assembly.fasta (sample assembly)
-- sample_flye.log (log file)
+# copying results from humann3 step
+cp *-humann3-out-dir/*genefamilies.tsv genefamily-results/
+cp *-humann3-out-dir/*abundance.tsv pathabundance-results/
+cp *-humann3-out-dir/*coverage.tsv pathcoverage-results/
 
-<br>
+# join results across samples
+humann_join_tables -i genefamily-results/ -o gene-families.tsv
+humann_join_tables -i pathabundance-results/ -o path-abundances.tsv
+humann_join_tables -i pathcoverage-results/ -o path-coverages.tsv
+```
 
----
+**Parameter Definitions:**  
+
+- `-i` - the directory holding the input tables
+- `-o` - the name of the output table holding combined data
 
-### 12. Polish Assembly
+**Input Data:**
+
+- `sample-humann3-out-dir` (humann output directory, from [Step 9b](#9b-running-humannmetaphlan))
+
+**Output Data:**
+
+- gene-families.tsv - Combined gene family table in tab-separated format.
+- path-abundances.tsv - Combined path abundances table in tab-separated format.
+- path-coverages.tsv - Combined path coverages table in tab-separated format.
+
+#### 9d. Split results tables
+
+The read-based functional annotation tables have taxonomic info and non-taxonomic info mixed together. `humann` comes with a helper script to split them into both non-taxonomically grouped functional info files and taxonomically grouped functional info files.
 
 ```bash
-medaka_consensus -t NumberOfThreads \
-                 -i /path/to/sample_decontam_GLlbsMetag.fastq.gz \
-                 -d /path/to/assemblies/sample_assembly.fasta \
-                 -o sample/
-  
-mv sample/consensus.fasta sample_polished.fasta
+humann_split_stratified_table -i gene-families.tsv -o ./
+mv gene-families_stratified.tsv Gene-families-grouped-by-taxa_GLlbsMetag.tsv
+mv gene-families_unstratified.tsv Gene-families_GLlbsMetag.tsv
+
+humann_split_stratified_table -i path-abundances.tsv -o ./
+mv path-abundances_stratified.tsv Path-abundances-grouped-by-taxa_GLlbsMetag.tsv
+mv path-abundances_unstratified.tsv Path-abundances_GLlbsMetag.tsv
+
+humann2_split_stratified_table -i path-coverages.tsv -o ./
+mv path-coverages_stratified.tsv Path-coverages-grouped-by-taxa_GLlbsMetag.tsv
+mv path-coverages_unstratified.tsv Path-coverages_GLlbsMetag.tsv
 ```
 
-**Parameter Definitions:**
+**Parameter Definitions:**  
 
-- `-t` - Number of parallel processing threads to use.
-- `-i` - Specifies path to input read files used in creating the assembly.
-- `-d` - Specifies path to the assembly fasta file.
-- `-o` - Specifies the output directory.
+-	`-i` – the input combined table
+-	`-o` – output directory (here specifying current directory)
+
+**Input Data:**
+
+- gene-families.tsv (Combined gene family table from [Step 9c](#9c-merging-multiple-sample-functional-profiles-into-one-table))
+- path-abundances.tsv (Combined path abundances table from [Step 9c](#9c-merging-multiple-sample-functional-profiles-into-one-table))
+- path-coverages.tsv (Combined path coverages table from [Step 9c](#9c-merging-multiple-sample-functional-profiles-into-one-table))
+
+**Output Data:**
+- Gene-families-grouped-by-taxa_GLlbsMetag.tsv - Gene families grouped by taxa
+- Gene-families_GLlbsMetag.tsv - Non-taxonomically grouped gene families
+- Path-abundances-grouped-by-taxa_GLlbsMetag.tsv - Path abundances grouped by taxa
+- Path-abundances_GLlbsMetag.tsv  - Non-taxonomically grouped gene families
+- Path-coverages-grouped-by-taxa_GLlbsMetag.tsv - Path coverages grouped by taxa
+- Path-coverages_GLlbsMetag.tsv - Non-taxonomically groups path coverages
+
+#### 9e. Normalize gene families and pathway abundances tables
+Generates some normalized tables of the read-based functional outputs from humann that are more readily suitable for across sample comparisons.
+
+```bash
+humann_renorm_table -i Gene-families_GLlbsMetag.tsv -o Gene-families-cpm_GLlbsMetag.tsv --update-snames
+humann_renorm_table -i Path-abundances_GLlbsMetag.tsv -o Path-abundances-cpm_GLlbsMetag.tsv --update-snames
+```
+
+**Parameter Definitions:**  
+
+-	`-i` – the input combined table
+-	`-o` – name of the output normalized table
+-	`--update-snames` – change suffix of column names in tables to "-CPM"
 
 **Input Data:**
 
-- sample_decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7e](#7e-generate-decontaminated-read-files))
-- /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 11](#11-sample-assembly))
+- Gene-families_GLlbsMetag.tsv (Non-taxonomically grouped gene families, from [Step 9d](#9d-splitting-results-tables))
+- Path-abundances_GLlbsMetag.tsv (Non-taxonomically grouped gene families, from [Step 9d](#9d-splitting-results-tables))
 
 **Output Data:**
 
-- sample_polished.fasta (polished sample assembly)
+- Gene-families-cpm_GLlbsMetag.tsv (Non-taxonomically grouped gene families, from [Step 9d](#9d-splitting-results-tables))
+- Path-abundances-cpm_GLlbsMetag.tsv (Non-taxonomically grouped gene families, from [Step 9d](#9d-splitting-results-tables))
+
+#### 9f. Generate a normalized gene-family table grouped by Kegg Orthologs (KOs)
+```bash
+humann_regroup_table -i Gene-families_GLlbsMetag.tsv -g uniref90_ko | \
+humann_rename_table -n kegg-orthology | \
+humann_renorm_table -o Gene-families-KO-cpm_GLlbsMetag.tsv --update-snames
+```
+
+**Parameter Definitions:**  
+
+*humann_regroup_table*
+-	`-i` – the input table
+-	`-g` – the map to use to group uniref IDs into Kegg Orthologs
+-	`|` – sending that output into the next humann command to add human-readable Kegg Orthology names
+*humann_rename_table*
+-	`-n` – specifying we are converting Kegg orthology IDs into Kegg orthology human-readable names
+-	`|` – sending that output into the next humann command to normalize to copies-per-million
+*humann_renorm_table*
+-	`-o` – specifying the final output file name
+-  `--update-snames` – change suffix of column names in tables to "-CPM"
+
+**Input Data:**
+
+- Gene-families_GLlbsMetag.tsv (Non-taxonomically grouped gene families, from [Step 9d](#9d-splitting-results-tables))
+
+**Output Data:**
+
+- Gene-families-KO-cpm_GLlbsMetag.tsv (gene-families with annotations based on Kegg Orthology terms)
+
+#### 9g. Combining taxonomy tables
+
+```bash
+merge_metaphlan_tables.py *-humann3-out-dir/*_humann_temp/*_metaphlan_bugs_list.tsv > Metaphlan-taxonomy_GLlbsMetag.tsv
+
+# remove redundant text from headers
+sed -i 's/_metaphlan_bugs_list//g' Metaphlan-taxonomy_GLlbsMetag_.tsv
+```
+**Parameter Definitions:**  
+*merge_metaphlan_tables.py*
+- positional argument specifying input files and output filename
+*sed*
+- `-i` - Perform the search/replace in-place on the input file.
+
+**Input Data:**
+*	\*-humann3-out-dir/\*_humann_temp/\*_metaphlan_bugs_list.tsv (metaphlan bugs_list produced during humann3 run in [step 9b](#9b-running-humannmetaphlan)
+
+**Output Data:**
+- **Metaphlan-taxonomy_GLlbsMetag.tsv** - metaphlan estimated taxonomic relative abundances
+
 
 ---
 
-### 13. Rename Contigs and Summarize Assemblies
+## Assembly-based Processing
 
-#### 13a. Rename Contig Headers
+### 10. Sample Assembly
+```
+megahit -1 sample1_R1_decontam.fastq.gz -2 sample1_R2_decontam.fastq.gz \
+        -o sample1-assembly -t NumberOfThreads --min-contig-length 500 > sample1-assembly.log 2>&1
+```
+
+**Parameter Definitions:**  
+
+-	`-1 and -2` – specifies the input forward and reverse reads (if single-end data, then neither `-1` nor `-2` are used, instead single-end reads are passed to `-r`)
+-	`-o` – specifies output directory
+-	`-t` – specifies the number of threads to use
+-	`--min-contig-length` – specifies the minimum contig length to write out
+-	`> sample1-assembly.log 2>&1` – sends stdout/stderr to log file
+
+
+**Input data:**
+
+- *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+    contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
+    output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
+
+**Output data:**
+
+- sample1-assembly/final.contigs.fa (assembly file)
+- **sample1-assembly.log** (log file)
+
+<br>
+
+### 11. Rename Contigs and Summarize Assemblies
+
+#### 11a. Rename Contig Headers
 
 ```bash
-bit-rename-fasta-headers -i sample_polished.fasta \
+bit-rename-fasta-headers -i sample1/final.contigs.fasta \
                          -w c_sample \
-                         -o sample_assembly.fasta
+                         -o sample_assembly_GLlbsMetag.fasta
 ```
 
 **Parameter Definitions:**  
@@ -2496,28 +2649,28 @@ bit-rename-fasta-headers -i sample_polished.fasta \
 
 **Input Data:**
 
-- sample_polished.fasta (polished assembly file from [Step 12](#12-polish-assembly))
+- sample1/final.contigs.fasta (assembly file from [Step 10](#10-sample-assembly))
 
 **Output files:**
 
-- **sample-assembly.fasta** (contig-renamed assembly file)
+- **sample-assembly_GLlbsMetag.fasta** (contig-renamed assembly file)
 
 
-#### 13b. Summarize Assemblies
+#### 10b. Summarize Assemblies
 
 ```bash
 bit-summarize-assembly -o assembly-summaries_GLlbsMetag.tsv \
-                       *-assembly.fasta
+                       *-assembly_GLlbsMetag.fasta
 ```
 
 **Parameter Definitions:**  
 
 - `-o` – Specifies the output summary table.
-- `*-assembly.fasta` - Specifies the input assemblies to summarize, provided as positional arguments.
+- `*-assembly_GLlbsMetag.fasta` - Specifies the input assemblies to summarize, provided as positional arguments.
 
 **Input Data:**
 
-- *-assembly.fasta (contig-renamed assembly files from [Step 13a](#13a-renaming-contig-headers))
+- *-assembly_GLlbsMetag.fasta (contig-renamed assembly files from [Step 11a](#11a-renaming-contig-headers))
 
 **Output files:**
 
@@ -2527,9 +2680,9 @@ bit-summarize-assembly -o assembly-summaries_GLlbsMetag.tsv \
 
 ---
 
-### 14. Gene Prediction
+### 12. Gene Prediction
 
-#### 14a. Generate Gene Predictions
+#### 12a. Generate Gene Predictions
 
 ```bash
 prodigal -a sample-genes.faa \
@@ -2538,8 +2691,8 @@ prodigal -a sample-genes.faa \
          -p meta \
          -c \
          -q \
-         -o sample-genes.gff \
-         -i sample-assembly.fasta
+         -o sample-genes_GLlbsMetag.gff \
+         -i sample-assembly_GLlbsMetag.fasta
 ```
 
 **Parameter Definitions:**
@@ -2555,41 +2708,41 @@ prodigal -a sample-genes.faa \
 
 **Input Data:**
 
-- sample-assembly.fasta (contig-renamed assembly file from [Step 13a](#13a-renaming-contig-headers))
+- sample-assembly_GLlbsMetag.fasta (contig-renamed assembly file from [Step 11a](#11a-renaming-contig-headers))
 
 **Output Data:**
 
-- sample-genes.faa (gene-calls amino-acid fasta file)
-- sample-genes.fasta (gene-calls nucleotide fasta file)
-- **sample-genes.gff** (gene-calls in general feature format)
+- sample-genes.faa** (gene-calls amino-acid fasta file)
+- sample-genes.fasta** (gene-calls nucleotide fasta file)
+- **sample-genes_GLlbsMetag.gff** (gene-calls in general feature format)
 
 <br>
 
-#### 14b. Remove Line Wraps In Gene Prediction Output
+#### 12b. Remove Line Wraps In Gene Prediction Output
 
 ```bash
 bit-remove-wraps sample-genes.faa > sample-genes.faa.tmp 2> /dev/null
-mv sample-genes.faa.tmp sample-genes.faa
+mv sample-genes.faa.tmp sample-genes_GLlbsMetag.faa
 
 bit-remove-wraps sample-genes.fasta > sample-genes.fasta.tmp 2> /dev/null
-mv sample-genes.fasta.tmp sample-genes.fasta
+mv sample-genes.fasta.tmp sample-genes_GLlbsMetag.fasta
 ```
 
 **Input Data:**
 
-- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 14a](#14a-gene-prediction))
-- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14a](#14a-gene-prediction))
+- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 12a](#12a-gene-prediction))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 12a](#12a-gene-prediction))
 
 **Output Data:**
 
-- **sample-genes.faa** (gene-calls amino-acid fasta file with line wraps removed)
-- **sample-genes.fasta** (gene-calls nucleotide fasta file with line wraps removed)
+- **sample-genes_GLlbsMetag.faa** (gene-calls amino-acid fasta file with line wraps removed)
+- **sample-genes_GLlbsMetag.fasta** (gene-calls nucleotide fasta file with line wraps removed)
 
 <br>
 
 ---
 
-### 15. Functional Annotation
+### 13. Functional Annotation
 
 > **Note:**  
 > The annotation process overwrites the same temporary directory by default. When running multiple 
@@ -2597,7 +2750,7 @@ processses at a time, it is necessary to specify a specific temporary directory
 `--tmp-dir` argument as shown below.
 
 
-#### 15a. Download Reference Database of HMM Models
+#### 13a. Download Reference Database of HMM Models
 
 > **Note:** This step only needs to be done once.
 
@@ -2608,7 +2761,7 @@ tar -xzvf profiles.tar.gz
 gunzip ko_list.gz 
 ```
 
-#### 15b. Run KEGG Annotation
+#### 13b. Run KEGG Annotation
 
 ```bash
 exec_annotation -p profiles/ \
@@ -2635,16 +2788,16 @@ exec_annotation -p profiles/ \
 
 **Input Data:**
 
-- sample-genes.faa (amino-acid fasta file, output from [Step 14b](#14b-remove-line-wraps-in-gene-prediction-output))
-- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 15a](15a-download-reference-database-of-hmm-models))
-- ko_list (reference list of KOs to scan for, downloaded in [Step 15a](15a-download-reference-database-of-hmm-models))
+- sample-genes.faa (amino-acid fasta file, output from [Step 12b](#12b-remove-line-wraps-in-gene-prediction-output))
+- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 13a](#13a-download-reference-database-of-hmm-models))
+- ko_list (reference list of KOs to scan for, downloaded in [Step 13a](#13a-download-reference-database-of-hmm-models))
 
 **Output Data:**
 
 - sample-KO-tab.tmp (table of KO annotations assigned to gene IDs)
 
 
-#### 15c. Filter KO Outputs
+#### 13c. Filter KO Outputs
 *Filter KO outputs to retain only those passing the KO-specific score and top hits.*
 
 ```bash
@@ -2662,7 +2815,7 @@ rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 
 **Input Data:**
 
-- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs, output from [Step 15b](#15b-run-kegg-annotation))
+- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs, output from [Step 14b](#14b-run-kegg-annotation))
 
 **Output Data:**
 
@@ -2672,9 +2825,9 @@ rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 
 ---
 
-### 16. Taxonomic Classification 
+### 14. Taxonomic Classification 
 
-#### 16a. Pull and Unpack Pre-built Reference DB 
+#### 14a. Pull and Unpack Pre-built Reference DB 
 
 > **Note:** This step only needs to be done once.
 
@@ -2683,7 +2836,7 @@ wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20200618.tar.gz
 tar -xvzf CAT_prepare_20200618.tar.gz
 ```
 
-#### 16b. Run Taxonomic Classification
+#### 14b. Run Taxonomic Classification
 
 ```bash
 CAT contigs -c sample-assembly.fasta \
@@ -2713,10 +2866,10 @@ CAT contigs -c sample-assembly.fasta \
 
 **Input Data:**
 
-- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 16a](16a-pull-and-unpack-pre-built-reference-db))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 16a](16a-pull-and-unpack-pre-built-reference-db))
-- sample-assembly.fasta (contig-renamed assembly file from [Step 13a](#13a-rename-contig-headers))
-- sample-genes.faa (amino-acid fasta file, output from [Step 14b](#14b-remove-line-wraps-in-gene-prediction-output))
+- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 14a](14a-pull-and-unpack-pre-built-reference-db))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 14a](14a-pull-and-unpack-pre-built-reference-db))
+- sample-assembly.fasta (contig-renamed assembly file from [Step 11a](#11a-rename-contig-headers)
+- sample-genes.faa (amino-acid fasta file, output from [Step 12b](#12b-remove-line-wraps-in-gene-prediction-output)
 
 **Output Data:**
 
@@ -2724,7 +2877,7 @@ CAT contigs -c sample-assembly.fasta \
 - sample-tax-out.tmp.contig2classification.txt (contig taxonomy file)
 
 
-#### 16c. Add Taxonomy Info From Taxids To Genes
+#### 14c. Add Taxonomy Info From Taxids To Genes
 
 ```bash
 CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
@@ -2744,15 +2897,15 @@ CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
 
 **Input Data:**
 
-- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file, output from [Step 16b](#16b-run-taxonomic-classification))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 16a](#16a-pull-and-unpack-pre-built-reference-db))
+- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file, output from [Step 14b](#14b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 14a](#14a-pull-and-unpack-pre-built-reference-db))
 
 **Output Data:**
 
 - sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
 
 
-#### 16d. Add Taxonomy Info From Taxids To Contigs
+#### 14d. Add Taxonomy Info From Taxids To Contigs
 
 ```bash
 CAT add_names -i sample-tax-out.tmp.contig2classification.txt \
@@ -2772,15 +2925,15 @@ CAT add_names -i sample-tax-out.tmp.contig2classification.txt \
 
 **Input Data:**
 
-- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file, output from [Step 16b](#16b-run-taxonomic-classification))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 16a](#16a-pull-and-unpack-pre-built-reference-db))
+- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file, output from [Step 14b](#14b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 14a](#14a-pull-and-unpack-pre-built-reference-db))
 
 **Output Data:**
 
 - sample-contig-tax-out.tmp (contig taxonomy file with lineage info added)
 
 
-#### 16e. Format Gene-level Output With awk and sed
+#### 14e. Format Gene-level Output With awk and sed
 
 ```bash
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,$8,$9,$10,$11 } \
@@ -2793,14 +2946,14 @@ awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,
 
 **Input Data:**
 
-- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added, output from [Step 16c](#16c-add-taxonomy-info-from-taxids-to-genes))
+- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added, output from [Step 14c](#14c-add-taxonomy-info-from-taxids-to-genes))
 
 **Output Data:**
 
 - sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info)
 
 
-#### 16f. Format Contig-level Output With awk and sed
+#### 14f. Format Contig-level Output With awk and sed
 
 ```bash
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6,$7,$8,$9,$10,$11,$12 } \
@@ -2815,7 +2968,7 @@ rm sample*.tmp*
 
 **Input Data:**
 
-- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added, output from [Step 16d](#16d-add-taxonomy-info-from-taxids-to-contigs))
+- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added, output from [Step 14d](#14d-add-taxonomy-info-from-taxids-to-contigs))
 
 **Output Data:**
 
@@ -2825,41 +2978,62 @@ rm sample*.tmp*
 
 ---
 
-### 17. Read-Mapping
+### 15. Read-Mapping
+
+#### 15a. Build reference index
+```
+bowtie2-build ssample_assembly_GLlbsMetag.fasta sample1-index
+```
+
+**Parameter Definitions:**  
+
+- `ssample_assembly_GLlbsMetag.fasta` - first positional argument specifies the input assembly
+-	`sample1-index` - second positional argument specifies the prefix of the output index files
+
+**Input Data:**
+
+- `sample1_assembly.fasta` (contig-renamed assembly file, output from [Step 11a](#11a-rename-contig-headers))
+
+**Output Data:**
 
-#### 17a. Align Reads to Sample Assembly
+- `sample1-index*` - the bowtie2 index files
+
+#### 15b. Align Reads to Sample Assembly
 
 ```bash
-minimap2 -a \
-         -x map-ont \
-         -t NumberOfThreads \
-         sample_assembly.fasta \
-         sample_decontam_GLlbsMetag.fastq.gz \
-         > sample.sam  2> sample-mapping-info.txt
+bowtie2 --mm --quiet --threads ${task.cpus} \
+        -x sample1-index \
+        -1 sample1_GLlbsMetag_R1_decontam.fastq.gz \
+        -2 sample1_GLlbsMetag_R2_decontam.fastq.gz \
+        --no-unal > sample1.sam  2> sample1-mapping-info_GLlbsMetag.txt 
 ```
 
 **Parameter Definitions:**
+- `--mm` - Use memory-mapped I/O to load the index.
+- `--quiet` - Print only error messages.
+- `--threads` - Number of parallel processing threads.
+- `-x` - specifies the prefix of the reference index files to map to, generated by bowtie2-build
+-	`-1` - specifies the forward reads to map
+- `-2` – specifies the reverse reads to map
+- `--no-unal` - Suppress SAM records for reads that did not align.
+- `> sample1.sam` - Redirects the output of the map reads command to a SAM file.
+- `2> sample1-mapping-info.txt` – capture the printed summary results in a log file
 
-- `-a` – Output in SAM format.
-- `-x map-ont` - Specifies preset for mapping Nanopore reads to a reference.
-- `-t` - Number of parallel processing threads to use
-- `sample_assembly.fasta` – Assembly fasta file, provided as a positional argument.
-- `sample_decontam_GLlbsMetag.fastq.gz` - Input sequence data file, provided as a positional argument.
-- `> sample.sam` - Redirects the output to a separate file.
-- `2> sample-mapping-info.txt` - Redirects the standar error to a separate file.
 
 **Input Data**
 
-- sample-assembly.fasta (contig-renamed assembly file, output from [Step 13a](#13a-rename-contig-headers))
-- sample_decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file, output from [Step 7e](#7e-generate-decontaminated-read-files))
+- sample1-index (contig-renamed assembly file, output from [Step 15a](#15a-build-reference-index))
+- *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+    contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
+    output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
 
 **Output Data**
 
 - sample.sam (reads aligned to sample assembly in SAM format)
-- **sample-mapping-info.txt** (read mapping information)
+- **sample-mapping-info_GLlblMetag.txt** (read mapping information)
 
 
-#### 17b. Sort and Index Assembly Alignments
+#### 15c. Sort and Index Assembly Alignments
 
 ```bash
 # Sort Sam, convert to bam and create index
@@ -2884,24 +3058,24 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 
 **Input Data:**
 
-- sample.sam (reads aligned to sample assembly, output from [Step 17a](#17a-align-reads-to-sample-assembly))
+- sample.sam (reads aligned to sample assembly, output from [Step 15b](#15b-align-reads-to-sample-assembly))
 
 **Output Data:**
 
-- **sample_sorted.bam** (sorted mapping to sample assembly, in BAM format)
-- **sample_sorted.bam.bai** (index of sorted mapping to sample assembly)
+- **sample_sorted_GLlbsMetag.bam** (sorted mapping to sample assembly, in BAM format)
+- **sample_sorted_GLlbsMetag.bam.bai** (index of sorted mapping to sample assembly)
 
 <br>
 
 ---
 
-### 18. Get Coverage Information and Filter Based On Detection
+### 16. Get Coverage Information and Filter Based On Detection
 > **Note:**  
 > “Detection” is a measure of what proportion of a reference sequence recruited reads 
 (see the discussion of detection [here](http://merenlab.org/2017/05/08/anvio-views/#detection)). 
 Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
 
-#### 18a. Filter Coverage Levels Based On Detection
+#### 16a. Filter Coverage Levels Based On Detection
 
 ```bash
 # pileup.sh comes from the bbduk.sh package
@@ -2920,8 +3094,8 @@ pileup.sh -in sample.bam \
 
 **Input Data:**
 
-- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 17b](#17b-sort-and-index-assembly-alignments))
-- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14a](#14a-gene-prediction))
+- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 15c](#15c-sort-and-index-assembly-alignments))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 12a](#12-gene-prediction))
 
 
 **Output Data:**
@@ -2930,7 +3104,7 @@ pileup.sh -in sample.bam \
 - sample-contig-cov-and-det.tmp (contig-coverage tsv file)
 
 
-#### 18b. Filter Gene and Contig Coverage Based On Detection
+#### 16b. Filter Gene and Contig Coverage Based On Detection
 
 > *The following commands filter gene and contig coverage tsv files to only keep genes and contigs with at least 50% detection (as defined above) then parse the tables to retain only gene IDs and respective coverage.*
 
@@ -2940,14 +3114,14 @@ grep -v "#" sample-gene-cov-and-det.tmp | \
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
      { print $1,$4 } ' > sample-gene-cov.tmp
 
-cat <( printf "gene_ID\tcoverage\n" ) sample-gene-cov.tmp > sample-gene-coverages.tsv
+cat <( printf "gene_ID\tcoverage\n" ) sample-gene-cov.tmp > sample-gene-coverages_GLlbsMetag.tsv
 
 # Filtering contig coverage
 grep -v "#" sample-contig-cov-and-det.tmp | \
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
      { print $1,$2 } ' > sample-contig-cov.tmp
 
-cat <( printf "contig_ID\tcoverage\n" ) sample-contig-cov.tmp > sample-contig-coverages.tsv
+cat <( printf "contig_ID\tcoverage\n" ) sample-contig-cov.tmp > sample-contig-coverages_GLlbsMetag.tsv
 
 # removing intermediate files
 rm sample-*.tmp
@@ -2955,19 +3129,19 @@ rm sample-*.tmp
 
 **Input Data:**
 
-- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file, output from [Step 18a](#18a-filter-coverage-levels-based-on-detection))
-- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file, output from [Step 18a](#18a-filter-coverage-levels-based-on-detection))
+- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file, output from [Step 16a](#16a-filter-coverage-levels-based-on-detection))
+- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file, output from [Step 16a](#16a-filter-coverage-levels-based-on-detection))
 
 **Output Data:**
 
-- sample-gene-coverages.tsv (table with gene-level coverages)
-- sample-contig-coverages.tsv (table with contig-level coverages)
+- sample-gene-coverages_GLlbsMetag.tsv (table with gene-level coverages)
+- sample-contig-coverages_GllbsMetag.tsv (table with contig-level coverages)
 
 <br>
 
 ---
 
-### 19. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample
+### 17. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample
 > **Note:**  
 > Just uses `paste`, `sed`, and `awk` standard Unix commands to combine gene-level coverage, taxonomy, and functional annotations into one table for each sample.  
 
@@ -2982,7 +3156,7 @@ paste <( head -n 1 sample-gene-coverages.tsv ) \
       <( head -n 1 sample-gene-tax-out.tsv | cut -f 2- ) \
       > sample-header.tmp
 
-cat sample-header.tmp sample-gene-tab.tmp > sample-gene-coverage-annotation-and-tax.tsv
+cat sample-header.tmp sample-gene-tab.tmp > sample-gene-coverage-annotation-and-tax_GLlbsMetag.tsv
 
 # removing intermediate files
 rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-out.tsv
@@ -2990,20 +3164,20 @@ rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-o
 
 **Input Data:**
 
-- sample-gene-coverages.tsv (table with gene-level coverages, output from [Step 18b](#18b-filter-gene-and-contig-coverage-based-on-detection))
-- sample-annotations.tsv (table of KO annotations assigned to gene IDs, output from [Step 15c](#15c-filter-ko-outputs))
-- sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info, output from [Step 16e](#16e-format-gene-level-output-with-awk-and-sed))
+- sample-gene-coverages_GLlbsMetag.tsv (table with gene-level coverages, output from [Step 16b](#16b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-annotations.tsv (table of KO annotations assigned to gene IDs, output from [Step 13c](#13c-filter-ko-outputs
+- sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info, output from [Step 14e](#14e-format-gene-level-output-with-awk-and-sed))
 
 
 **Output Data:**
 
-- **sample-gene-coverage-annotation-and-tax.tsv** (table with combined gene coverage, annotation, and taxonomy info)
+- **sample-gene-coverage-annotation-and-tax_GLlbsMetag.tsv** (table with combined gene coverage, annotation, and taxonomy info)
 
 <br>
 
 ---
 
-### 20. Combine Contig-level Coverage and Taxonomy For Each Sample
+### 18. Combine Contig-level Coverage and Taxonomy For Each Sample
 > **Note:**  
 > Just uses `paste`, `sed`, and `awk` standard Unix commands to combine contig-level coverage and taxonomy into one table for each sample.
 
@@ -3016,7 +3190,7 @@ paste <( head -n 1 sample-contig-coverages.tsv ) \
       <( head -n 1 sample-contig-tax-out.tsv | cut -f 2- ) \
       > sample-contig-header.tmp
       
-cat sample-contig-header.tmp sample-contig.tmp > sample-contig-coverage-and-tax.tsv
+cat sample-contig-header.tmp sample-contig.tmp > sample-contig-coverage-and-tax_GLlbsMetag.tsv
 
 # removing intermediate files
 rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
@@ -3024,19 +3198,19 @@ rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
 
 **Input Data:**
 
-- sample-contig-coverages.tsv (table with contig-level coverages, output from [Step 18b](#18b-filter-gene-and-contig-coverage-based-on-detection))
-- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info, output from [Step 16f](#16f-format-contig-level-output-with-awk-and-sed))
+- sample-contig-coverages.tsv (table with contig-level coverages, output from [Step 16b](#16b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info, output from [Step 14f](#14f-format-contig-level-output-with-awk-and-sed))
 
 
 **Output Data:**
 
-- **sample-contig-coverage-and-tax.tsv** (table with combined contig coverage and taxonomy info)
+- **sample-contig-coverage-and-tax_GLlbsMetag.tsv** (table with combined contig coverage and taxonomy info)
 
 <br>
 
 ---
 
-### 21. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples
+### 19. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples
 
 > **Note:**  
 > * To combine across samples to generate these summary tables, we need the same "units". This is done for annotations 
@@ -3048,23 +3222,29 @@ by the length of the gene). These have been normalized by making the total cover
 each individual gene-level coverage its proportion of that 1,000,000 total. So basically percent, but out of 1,000,000 
 instead of 100 to make the numbers more friendly. 
 
-#### 21a. Generate Gene-level Coverage Summary Tables
+#### 19a. Generate Gene-level Coverage Summary Tables
 
 ```bash
-bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv \
+bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax_GLlbsMetag.tsv \
                                  -o Combined
+
+# add assay specific suffix
+mv "Combined-gene-level-KO-function-coverages-CPM.tsv Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.tsv"
+mv "Combined-gene-level-KO-function-coverages-CPM.tsv Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.tsv"
+mv "Combined-gene-level-KO-function-coverages.tsv Combined-gene-level-KO-function-coverages_GLlbsMetag.tsv"
+mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-coverages_GLlbsMetag.tsv"
 ```
 
 **Parameter Definitions:**  
 
-- `*-gene-coverage-annotation-and-tax.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
+- `*-gene-coverage-annotation-and-tax_GLlbsMetag.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
 
 - `-o` – Specifies the output file prefix.
 
 
 **Input Data:**
 
-- *-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 19](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
+- *-gene-coverage-annotation-and-tax_GLlbsMetag.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 17](#17-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
 
 **Output Data:**
 
@@ -3074,339 +3254,7 @@ bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax.tsv \
 - **Combined-gene-level-taxonomy-coverages_GLlbsMetag.tsv** (table with all samples combined based on gene-level taxonomic classifications)
 
 
-#### 21b. Gene-level taxonomy heatmaps --- START NEEDS REVIEW ---
-
-```R
-library(tidyverse)
-library(pheatmap)
-
-# Abundant taxa with CPM > 1000
-abundance_threshold <- 1000
-
-sample_order <- get_sample_names("assembly-summaries_GLlbsMetag.tsv")
-# Read-in gene table
-gene_taxonomy_table <-  read_contig_table("Combined-gene-level-taxonomy-coverages-CPM_GLlbsMetag.tsv", sample_order)
-
-# Summarize gene table
-species_gene_table <- gene_taxonomy_table %>%
-  select(species, !!sample_order) %>% 
-  group_by(species) %>% 
-  summarise(across(everything(), sum)) 
-
-# Convert gene dataframe table to a matrix table
-gene.m <- species_gene_table %>% as.data.frame()
-# Write out gene taxonomy table
-write_csv(x = gene.m, file = "gene_taxonomy_table.csv")
-
-rownames(gene.m) <- gene.m[['species']]
-gene.m <- gene.m[,-match("species", colnames(gene.m))] %>% as.matrix()
-
-
-#------ All gene taxonomy assignments
-
-# Drop unclassified assignments
-mat2plot <- gene.m[-match("Unclassified;_;_;_;_;_;_", rownames(gene.m)),]
-
-png(filename = "All-genes-taxonomy-heatmap_GLlbsMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 12,
-         number_format = "%.0f")
-dev.off()
-
-
-#------ Abundant gene taxonomy assignments
-
-taxa <- rowSums(gene.m) %>% sort()
-abund_taxa <- taxa[ taxa > abundance_threshold ] %>% names
-abund_gene.m <- gene.m[abund_taxa,]
-
-
-# Drop unclassified assignments
-mat2plot <- abund_gene.m[-match("Unclassified;_;_;_;_;_;_", rownames(abund_gene.m)),]
-
-png(filename = "Abundant-genes-taxonomy-heatmap_GLlbsMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 12,
-         number_format = "%.0f")
-dev.off()
-```
-
-**Input data:**
-- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
-- Combined-gene-level-taxonomy-coverages-CPM_GLlbsMetag.tsv (table with all samples combined based on gene-level taxonomic classifications, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables)) 
-
-**Output data:**
-- gene_taxonomy_table.csv (aggregated gene taxonomy table)
-- **All-genes-taxonomy-heatmap_GLlbsMetag.png** (heatmap of all genes taxonomy assignments)
-- **Abundant-genes-taxonomy-heatmap_GLlbsMetag.png** (heatmap of abundant genes taxonomy assignments)
-
-#### 21c. Gene-level taxonomy decontamination
-
-```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-# Set to 0.5 for a more aggressive approach where species more prevalent
-# in the negative controls are considered contaminants
-contam_threshold <- 0.1
-# Control samples in this column should always be written as 
-# "Control_Sample" and true samples as "True_Sample"
-prev_col <- "Sample_or_Control"
-freq_col <- "input_conc_ng"
-plot_width <- 18
-plot_height <- 8
-
-# Read-in metadata
-metdata_file <- "/path/to/sample/metadata"
-samples_column <- "Sample_ID"
-metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
-row.names(metadata) <- metadata[,samples_column]
-
-# Read-in featusre table
-gene.m <- read_csv("gene_taxonomy_table.csv")
-rownames(gene.m) <- gene.m[['species']]
-gene.m <- gene.m[,-match("species", colnames(gene.m))] %>% as.matrix()
-feature_table <- gene.m
-
-
-contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
-
-# Write decontam results table to file
-write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-gene-taxonomy_results.csv")
-
-# Get the list of contaminats identified by decontam
-contaminants <- contamdf %>%
-                as.data.frame %>%
-                rownames_to_column("Species") %>%
-                filter(contaminant == TRUE) %>% pull(Species)
-
-# Drop contaminant features identified by decontam
-decontaminated_table <- feature_table %>% 
-                as.data.frame  %>% 
-                rownames_to_column("Species") %>% 
-                filter(str_detect(Species, 
-                                  pattern = str_c(contaminants,
-                                                  collapse = "|"),
-                                  negate = TRUE)) %>%
-                select(-Species) %>% as.matrix
-
-
-# Write decontaminated species table to file
-write_csv(x = decontaminated_table, file = "decontaminated-gene-taxonomy_table.csv")
-
-# Get the index of species (contaminants and unclassified) to drop
-non_microbial <- "Unclassified;_;_;_;_;_;_"
-species_to_drop_index <- grep(x = rownames(feature_table), 
-                              str_c(c(contaminants,non_microbial), 
-                                    collapse = "|"))
-
-mat2plot <- feature_table[-species_to_drop_index,]
-png(filename = "decontaminated-gene-taxonomy-heatmap_GLlbsMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 14, 
-         number_format = "%.0f")
-dev.off()
-
-```
-
-**Input data:**
-
-- metadata_file  (path to sample-wise metadata file)
-- gene_taxonomy_table.csv (aggregated gene taxonomy table, output from [Step 21b](#21b-gene-level-taxonomy-heatmaps))
-
-**Output data:**
-
-- **decontam-gene-taxonomy_results.csv** (decontam's results table)
-- **decontaminated-gene-taxonomy_table.csv** (decontaminated species table)
-- **decontaminated-gene-taxonomy-heatmap_GLlbsMetag.png** (heatmap after filtering out contaminants)
-
-
-
-#### 21d. Gene-level KO functions heatmaps
-
-```R
-library(tidyverse)
-library(pheatmap)
-
-# Abundant functions with CPM > 2000
-abundance_threshold <- 2000
-
-sample_order <- get_sample_names("assembly-summaries_GLlbsMetag.tsv")
-# Read-in KO functions table
-functions_table <- read_input_table("Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.tsv") %>%
-                    select(KO_ID, KO_function, !!sample_order)
-
-# Subset table and then convert from datafame to matrix
-functions.m <- functions_table[,sample_order] %>% as.matrix()
-rownames(functions.m) <- functions_table$KO_ID
-table2write <-  functions.m %>% 
-                      as.data.frame() %>% rownames_to_column("KO_ID") %>%
-                      filter(KO_ID != "Not annotated") # Drop unannotated / unclassified
-# Write out  taxonomy table
-write_csv(x = table2write  , file = "genes-KO-functions_table.csv")
-
-
-#------ All KO functions assignments
-
-# Drop unclassified assignments
-mat2plot <- functions.m[-match("Not annotated", rownames(functions.m),]
-
-png(filename = "All-genes-KO-functions-heatmap_GLlbsMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 12,
-         number_format = "%.0f")
-dev.off()
-
-
-#------ Abundant KO functions assignments
-
-functions <- rowSums(functions.m) %>% sort()
-abund_functions <- functions[ functions > abundance_threshold ] %>% names
-abund_functions.m <- functions.m[abund_functions,]
-
-
-# Drop unannotated assignments
-mat2plot <- abund_functions.m[-match("Not annotated", rownames(abund_functions.m)),]
-
-png(filename = "Abundant-genes-KO-functions-heatmap_GLlbsMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 12,
-         number_format = "%.0f")
-dev.off()
-```
-
-**Parameter Definitions:**  
-
-
-**Input data:**
-- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
-- Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.tsv (table with all samples combined based on KO annotations; normalized to coverage per million genes covered, output from [Step 21a](#21a-generating-gene-level-coverage-summary-tables))
-
-**Output data:**
-- genes-KO-functions_table.csv (aggregated and subsetted gene KO functions table)
-- **All-genes-KO-functions-heatmap_GLlbsMetag.png** (heatmap of gene-wise KO function assignments)
-- **Abundant-genes-KO-functions-heatmap_GLlbsMetag.png** (heatmap of gene-wise abundant KO function assignments)
-
-#### 21e. Gene-level KO functions decontamination --- END NEEDS REVIEW ---
-
-```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-# Set to 0.5 for a more aggressive approach where species more prevalent
-# in negative controls are considered contaminants
-contam_threshold <- 0.1 
-# Control samples in this column should always be written as "Control_Sample" and true samples as "True_Sample"
-prev_col <- "Sample_or_Control"
-freq_col <- "input_conc_ng"
-plot_width <- 18
-plot_height <- 8
-
-# Read-in metadata
-metdata_file <- "/path/to/sample/metadata"
-samples_column <- "Sample_ID"
-metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
-row.names(metadata) <- metadata[,samples_column]
-
-# Read-in feature table
-functions.m <- read_csv("genes-KO-functions_table.csv")
-rownames(functions.m) <- functions.m[['KO_ID']]
-gene.m <- functions.m[,-match("KO_ID", colnames(functions.m))] %>% as.matrix()
-feature_table <- functions.m
-
-
-contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
-
-# Write decontam results table to file
-write_csv(x = contamdf %>% rownames_to_column("KO_ID"), file = "decontam-gene-KO-functions_results.csv")
-
-# Get the list of contaminants identified by decontam
-contaminants <- contamdf %>%
-                as.data.frame %>%
-                rownames_to_column("KO_ID") %>%
-                filter(contaminant == TRUE) %>% pull(KO_ID)
-
-# Drop contaminant features identified by decontam
-decontaminated_table <- feature_table %>% 
-                as.data.frame  %>% 
-                rownames_to_column("KO_ID") %>% 
-                filter(str_detect(Species, 
-                                  pattern = str_c(contaminants,
-                                                  collapse = "|"),
-                                  negate = TRUE)) %>%
-                select(-KO_ID) %>% as.matrix
-
-
-# Write decontaminated species table to file
-write_csv(x = decontaminated_table, file = "decontaminated-gene-KO-functions_table.csv")
-
-# Get the index of species (contaminants and unclassified) to drop
-unclassified <- "Not annotated"
-functions_to_drop_index <- grep(x = rownames(feature_table), 
-                              str_c(c(contaminants,unclassified), 
-                                    collapse = "|"))
-
-mat2plot <- feature_table[-functions_to_drop_index,]
-png(filename = "decontaminated-gene-KO-functions-heatmap_GLlbsMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 14, 
-         number_format = "%.0f")
-dev.off()
-
-```
-
-**Input data:**
-
-- metadata_file  (path to sample-wise metadata file)
-- gene_taxonomy_table.csv (agggregated gene taxomy table, output from [Step 21b](#21b-gene-level-taxonomy-heatmaps))
-
-**Output data:**
-
-- **decontam-gene-KO-functions_results.csv** (decontam's results table)
-- **decontaminated-gene-KO-functions_table.csv** (decontaminated functions table)
-- **decontaminated-gene-KO-functions-heatmap_GLlbsMetag.png** (heatmap after filtering out contaminants)
-
-
-
-#### 21f. Generate Contig-level Coverage Summary Tables
+#### 19b. Generate Contig-level Coverage Summary Tables
 
 ```bash
 bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
@@ -3420,186 +3268,19 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 
 **Input Data:**
 
-- *-contig-coverage-and-tax.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 20](#20-combine-contig-level-coverage-and-taxonomy-for-each-sample))
-
+- *-contig-coverage-and-tax.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 18](#18-combine-contig-level-coverage-and-taxonomy-for-each-sample))
 **Output Data:**
 
-- **Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million contigs covered)
-- **Combined-contig-level-taxonomy-coverages_GLlbsMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications)
+- **Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million contigs covered)
+- **Combined-contig-level-taxonomy-coverages_GLlblMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications)
 
 <br>
 
-
-#### 21g. Contig-level Heatmaps --- START NEEDS REVIEW ---
-
-```R
-plot_width <- 20
-plot_height <- 30
-sample_order <- get_sample_names("assembly-summaries_GLlbsMetag.tsv")
-
-contig_table <-  read_contig_table("Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv", sample_order)
-species_contig_table <- contig_table %>% select(species, !!sample_order)
-
-contig.m <- species_contig_table %>%
-  group_by(species) %>%
-  summarise(across(everything(), sum)) %>%
-  filter(species != "Unclassified;_;_;_;_;_;_") %>% # Drop unclassifed
-  as.data.frame()
-
-# Write out contig taxonomy table
-write_csv(x = contig.m, file = "contig_taxonomy_table.csv")
-
-rownames(contig.m) <- contig.m[['species']]
-contig.m <- contig.m[,-match("species", colnames(contig.m))] %>% as.matrix()
-
-#------ All contig taxonomy assignments
-
-# Drop unclassified assignments
-mat2plot <- contig.m[-match("Unclassified;_;_;_;_;_;_", rownames(contig.m)),]
-
-png(filename = "All-contig-taxonomy-heatmap_GLlbsMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 12,
-         number_format = "%.0f")
-dev.off()
-
-
-#------ Abundant contig taxonomy assignments
-
-taxa <- rowSums(contig.m) %>% sort()
-abund_taxa <- taxa[ taxa > abundance_threshold ] %>% names
-abund_contig.m <- contig.m[abund_taxa,]
-
-mat2plot <- abund_contig.m[-match("Unclassified;_;_;_;_;_;_", rownames(abund_contig.m)),]
-
-png(filename = "Abundant-contig-taxonomy-heatmap_GLlbsMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 12,
-         number_format = "%.0f")
-dev.off()
-```
-
-
-**Parameter Definitions:**  
-
-
-**Input data:**
-
-- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarizing-assemblies))
-- Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered from [Step 21f](#21f-generating-contig-level-coverage-summary-tables))
-
-**Output data:**
-
-- contig_taxonomy_table.csv (aggregated contig taxonomy)
-- **All-contig-taxonomy-heatmap_GLlbsMetag.png** (All contig level taxonomy heatmap)
-- **Abundant-contig-taxonomy-heatmap_GLlbsMetag.png** (Abundant contig level taxonomy heatmap)
-
-
-#### 21h. Contig-level decontamination --- END NEEDS REVIEW ---
-
-```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-# Set to 0.5 for a more aggressive approach where species more prevalent
-# in the negative controls are considered contaminants
-contam_threshold <- 0.1
-# Control samples in this column should always be written as
-# "Control_Sample" and true samples as "True_Sample"
-prev_col <- "Sample_or_Control"
-freq_col <- "input_conc_ng"
-plot_width <- 18
-plot_height <- 8
-
-# Read-in metadata
-metdata_file <- "/path/to/sample/metadata"
-samples_column <- "Sample_ID"
-metadata <- read_delim(file=metdata_file , delim = "\t") %>% as.data.frame()
-row.names(metadata) <- metadata[,samples_column]
-
-# Read-in feature table
-contig.m <- read_csv("contig_taxonomy_table.csv")
-rownames(contig.m) <- contig.m[['species']]
-contig.m <- contig.m[,-match("species", colnames(contig.m))] %>% as.matrix()
-feature_table <- contig.m
-
-
-contamdf <- run_decontam(feature_table, metadata, contam_threshold, prev_col, freq_col)
-
-# Write decontam results table to file
-write_csv(x = contamdf %>% rownames_to_column("Species"), file = "decontam-gene-taxonomy_results.csv")
-
-# Get a list of contaminants identified by decontam
-contaminants <- contamdf %>%
-                as.data.frame %>%
-                rownames_to_column("Species") %>%
-                filter(contaminant == TRUE) %>% pull(Species)
-
-# Drop contaminant features identified by decontam
-decontaminated_table <- feature_table %>% 
-                as.data.frame  %>% 
-                rownames_to_column("Species") %>% 
-                filter(str_detect(Species, 
-                                  pattern = str_c(contaminants,
-                                                  collapse = "|"),
-                                  negate = TRUE)) %>%
-                select(-Species) %>% as.matrix
-
-
-# Write decontaminated species table to file
-write_csv(x = decontaminated_table, file = "decontaminated-gene-taxonomy_table.csv")
-
-# Get the index of species (contaminants and unclassified) to drop
-non_microbial <- "Unclassified;_;_;_;_;_;_"
-species_to_drop_index <- grep(x = rownames(feature_table), 
-                              str_c(c(contaminants,non_microbial), 
-                                    collapse = "|"))
-
-mat2plot <- feature_table[-species_to_drop_index,]
-png(filename = "decontaminated-contig-taxonomy-heatmap_GLlbsMetag.png", 
-    width = plot_width, height = plot_height, units = "in", res=300)
-pheatmap(mat = mat2plot,
-         cluster_cols = FALSE, 
-         cluster_rows = FALSE, 
-         col = colours, 
-         angle_col = 0, 
-         display_numbers = TRUE,
-         fontsize = 14, 
-         number_format = "%.0f")
-dev.off()
-
-```
-
-**Input data:**
-
-- metadata_file  (path to sample-wise metadata file)
-- contig_taxonomy_table.csv (aggregated contig taxonomy table, output from [Step 21g](#21g-contig-level-heatmaps))
-
-**Output data:**
-
-- **decontam-contig-taxonomy_results.csv** (decontam's results table)
-- **decontaminated-contig-taxonomy_table.csv** (decontaminated species table)
-- **decontaminated-contig-taxonomy-heatmap_GLlbsMetag.png** (heatmap after filtering out contaminants)
-
-
 ---
 
-### 22. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery
+### 20. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery
 
-#### 22a. Bin Contigs
+#### 20a. Bin Contigs
 
 ```bash
 jgi_summarize_bam_contig_depths --outputDepth sample-metabat-assembly-depth.tsv \
@@ -3640,8 +3321,8 @@ zip -r sample-bins.zip sample-bins
 
 **Input Data:**
 
-- sample-assembly.fasta (contig-renamed assembly file from [Step 13a](#13a-renaming-contig-headers))
-- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 17b](#17b-sort-and-index-assembly-alignments))
+- sample-assembly.fasta (contig-renamed assembly file from [Step 11a](#11a-renaming-contig-headers))
+- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 15c](#15c-sort-and-index-assembly-alignments))
 
 **Output Data:**
 
@@ -3649,11 +3330,11 @@ zip -r sample-bins.zip sample-bins
 - sample-bins/sample-bin\*.fasta (fasta files of recovered bins)
 - **sample-bins.zip** (zip file containing fasta files of recovered bins)
 
-#### 22b. Bin quality assessment 
+#### 20b. Bin quality assessment 
 > Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
 
 ```bash
-checkm lineage_wf -f bins-overview_GLlbsMetag.tsv \
+checkm lineage_wf -f bins-overview_GLlblMetag.tsv \
                   --tab_table \
                   -x fasta \
                   ./ \
@@ -3671,22 +3352,22 @@ checkm lineage_wf -f bins-overview_GLlbsMetag.tsv \
 
 **Input Data:**
 
-- sample-bins/sample-bin\*.fasta (fasta files of recovered bins, output from [Step 22a](#22a-bin-contigs))
+- sample-bins/sample-bin\*.fasta (fasta files of recovered bins, output from [Step 20a](#20a-bin-contigs))
 
 **Output Data:**
 
-- **bins-overview_GLlbsMetag.tsv** (tab-delimited file with quality estimates per bin)
+- **bins-overview_GLlblMetag.tsv** (tab-delimited file with quality estimates per bin)
 - checkm-output-dir/ (directory holding detailed checkm outputs)
 
-#### 22c. Filter MAGs
+#### 20c. Filter MAGs
 
 ```bash
-cat <( head -n 1 bins-overview_GLlbsMetag.tsv ) \
-    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlbsMetag.tsv | sed 's/bin./MAG-/' ) \
+cat <( head -n 1 bins-overview_GLlblMetag.tsv ) \
+    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlblMetag.tsv | sed 's/bin./MAG-/' ) \
     > checkm-MAGs-overview.tsv
     
 # copying bins into a MAGs directory in order to run tax classification
-awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlbsMetag.tsv | cut -f 1 > MAG-bin-IDs.tmp
+awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlblMetag.tsv | cut -f 1 > MAG-bin-IDs.tmp
 
 mkdir MAGs
 for ID in MAG-bin-IDs.tmp
@@ -3705,7 +3386,7 @@ done
 
 **Input Data:**
 
-- bins-overview_GLlbsMetag.tsv (tab-delimited file with quality estimates per bin from [Step 22b](#22b-bin-quality-assessment))
+- bins-overview_GLlblMetag.tsv (tab-delimited file with quality estimates per bin from [Step 20b](#20b-bin-quality-assessment))
 
 **Output Data:**
 
@@ -3714,7 +3395,7 @@ done
 - **\*-MAGs.zip** (zip files containing directories of high-quality MAGs)
 
 
-#### 22d. MAG Taxonomic Classification
+#### 20d. MAG Taxonomic Classification
 > Uses default `gtdbtk` database setup with program's `download.sh` command.
 
 ```bash
@@ -3734,17 +3415,17 @@ gtdbtk classify_wf --genome_dir MAGs/ \
 
 **Input Data:**
 
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#22c-filter-mags))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 20c](#20c-filter-mags))
 
 **Output Data:**
 
 - gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
 
-#### 22e. Generate Overview Table Of All MAGs
+#### 20e. Generate Overview Table Of All MAGs
 
 ```bash
 # combine summaries
-for MAG in $(cut -f 1 assembly-summaries_GLlbsMetag.tsv | tail -n +2); do
+for MAG in $(cut -f 1 assembly-summaries_GLlblMetag.tsv | tail -n +2); do
 
     grep -w -m 1 "^${MAG}" checkm-MAGs-overview.tsv | cut -f 12,13,14 \
         >> checkm-estimates.tmp
@@ -3764,7 +3445,7 @@ cat <(printf "est. completeness\test. redundancy\test. strain heterogeneity\n")
 cat <(printf "domain\tphylum\tclass\torder\tfamily\\tgenus\tspecies\n") gtdb-taxonomies.tmp \
     > gtdb-taxonomies-with-headers.tmp
 
-paste assembly-summaries_GLlbsMetag.tsv \
+paste assembly-summaries_GLlblMetag.tsv \
 checkm-estimates-with-headers.tmp \
 gtdb-taxonomies-with-headers.tmp \
     > MAGs-overview.tmp
@@ -3775,28 +3456,28 @@ head -n 1 MAGs-overview.tmp > MAGs-overview-header.tmp
 tail -n +2 MAGs-overview.tmp | sort -t \$'\t' -k 14,20 > MAGs-overview-sorted.tmp
 
 cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
-    > MAGs-overview_GLlbsMetag.tsv
+    > MAGs-overview_GLlblMetag.tsv
 ```
 
 **Input Data:**
 
-- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarize-assemblies))
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#23c-filter-mags))
-- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 22c](#22c-filter-mags))
-- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 22d](#22d-mag-taxonomic-classification))
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 11b](#11b-summarize-assemblies))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 20c](#20c-filter-mags))
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 20c](#20c-filter-mags))
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 20d](#20d-mag-taxonomic-classification))
 
 **Output Data:**
 
-- **MAGs-overview_GLlbsMetag.tsv** (a tab-delimited overview of all recovered MAGs)
+- **MAGs-overview_GLlblMetag.tsv** (a tab-delimited overview of all recovered MAGs)
 
 
 <br>
 
 ---
 
-### 23. Generate MAG-level Functional Summary Overview
+### 21. Generate MAG-level Functional Summary Overview
 
-#### 23a. Get KO Annotations Per MAG
+#### 21a. Get KO Annotations Per MAG
 > This utilizes the helper script [`parse-MAG-annots.py`](../Workflow_Documentation/NF_MGIllumina/workflow_code/bin/parse-MAG-annots.py) 
 
 ```bash
@@ -3811,7 +3492,7 @@ do
     python parse-MAG-annots.py -i annotations-and-taxonomy/${sample_ID}-gene-coverage-annotation-and-tax.tsv \
                                -w ${MAG_ID}-contigs.tmp \
                                -M ${MAG_ID} \
-                               -o MAG-level-KO-annotations_GLlbsMetag.tsv
+                               -o MAG-level-KO-annotations_GLlblMetag.tsv
 
     rm ${MAG_ID}-contigs.tmp
 
@@ -3827,20 +3508,20 @@ done
 
 **Input Data:**
 
-- \*-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 19](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#22c-filter-mags))
+- \*-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 17](#17-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 20c](#20c-filter-mags))
 
 **Output Data:**
 
-- **MAG-level-KO-annotations_GLlbsMetag.tsv** (tab-delimited table holding MAGs and their KO annotations)
+- **MAG-level-KO-annotations_GLlblMetag.tsv** (tab-delimited table holding MAGs and their KO annotations)
 
 
-#### 23b. Summarize KO Annotations With KEGG-Decoder
+#### 21b. Summarize KO Annotations With KEGG-Decoder
 
 ```bash
 KEGG-decoder -v interactive \
-             -i MAG-level-KO-annotations_GLlbsMetag.tsv \
-             -o MAG-KEGG-Decoder-out_GLlbsMetag.tsv
+             -i MAG-level-KO-annotations_GLlblMetag.tsv \
+             -o MAG-KEGG-Decoder-out_GLlblMetag.tsv
 ```
 
 **Parameter Definitions:**  
@@ -3851,15 +3532,395 @@ KEGG-decoder -v interactive \
 
 **Input Data:**
 
-- MAG-level-KO-annotations_GLlbsMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 23a](#23a-getting-ko-annotations-per-mag))
+- MAG-level-KO-annotations_GLlblMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 21a](#21a-getting-ko-annotations-per-mag))
 
 **Output Data:**
 
-- **MAG-KEGG-Decoder-out_GLlbsMetag.tsv** (tab-delimited table holding MAGs and their proportions of genes held known to be required for specific pathways/metabolisms)
-
-- **MAG-KEGG-Decoder-out_GLlbsMetag.html** (interactive heatmap html file of the above output table)
+- **MAG-KEGG-Decoder-out_GLlblMetag.tsv** (tab-delimited table holding MAGs and their proportions of 
+                                           genes held known to be required for specific pathways/metabolisms)
+- **MAG-KEGG-Decoder-out_GLlbnMetag.html** (interactive heatmap html file of the above output table)
 
 <br>
 
 ---
 
+### 22. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
+
+#### 22a. Gene-level taxonomy heatmaps
+
+```R
+library(tidyverse)
+
+metadata_file <- "/path/to/sample/metadata"
+feature_data_file <- "Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv"
+
+# Prepare metadata
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
+sample_names = metadata[, samples_column]
+row.names(metadata) <- sample_names
+
+# Prepare feature table
+gene_taxonomy_table <- read_assembly_coverage_table(feature_table_file, sample_names) %>% as.data.frame
+
+# Summarize gene table
+species_gene_table <- gene_taxonomy_table %>%
+  select(species, !!any_of(sample_names)) %>% 
+  group_by(species) %>% 
+  summarise(across(everything(), sum)) %>% 
+  filter(species != "Unclassified;_;_;_;_;_;_") %>% # Drop unclassifed
+  as.data.frame
+
+rownames(species_gene_table) <- species_gene_table[[1]]
+species_gene_table <- species_gene_table[, -1] %>% as.matrix()
+
+# Get common samples and re-arrange feature table and metadata
+common_samples <- intersect(colnames(species_gene_table), rownames(metadata))
+species_gene_table <- species_gene_table[, common_samples]
+metadata <- metadata[common_samples, ]
+metadata <- metadata %>% arrange(!!sym(group_column))
+
+table2write = species_gene_table %>% as.data.frame %>% rownames_to_column("species")
+# Write out gene taxonomy table
+write_csv(x = table2write, file = "gene_taxonomy_table.csv")
+
+make_heatmap(metadata, species_gene_table, 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-taxonomy", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+
+```
+
+**Custom Functions Used:**
+- [read_assembly_coverage_table()](#read_assembly_coverage_table)
+- [make_heatmap()](#make_heatmap)
+
+**Input data:**
+- /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
+- Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples 
+    combined based on gene-level taxonomic classifications, output from 
+    [Step 19a](#19a-generating-gene-level-coverage-summary-tables)) 
+
+**Output data:**
+- gene_taxonomy_table.csv (aggregated gene taxonomy table with samples in columns and species in rows)
+- **Combined-gene-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all gene taxonomy assignments)
+
+#### 22b. Gene-level taxonomy decontamination
+
+```R
+library(tidyverse)
+library(decontam)
+library(phyloseq)
+
+feature_table_file <- "gene_taxonomy_table.csv"
+metadata_table <- "/path/to/sample/metadata"
+number_samples <- NumberOfSamples # integer indicating how many samples are in the file
+
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
+
+# Prepare metadata
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
+sample_names = metadata[, samples_column]
+row.names(metadata) <- sample_names
+
+decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
+                                         feature_table_file = feature_table_file, 
+                                         feature_column = "species", 
+                                         samples_column = "sample_id",
+                                         prevalence_column = "NTC", 
+                                         ntc_name = "TRUE", 
+                                         frequency_column = "concentration", 
+                                         threshold = 0.1, 
+                                         classification_method = "Combined-gene-level-taxonomy", 
+                                         output_prefix = "", 
+                                         assay_suffix = "_GLlblMetag")
+
+# Get common samples and re-arrange feature table and metadata
+common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
+decontaminated_table <- decontaminated_table[, common_samples]
+metadata <- metadata[common_samples, ]
+metadata <- metadata %>% arrange(!!sym(group_column))
+
+make_heatmap(metadata, decontaminated_table, 
+             samples_column = "sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-taxonomy_decontam", 
+             assay_suffix = "_GLlblMetag",
+             custom_palette)
+
+```
+
+**Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
+- [make_heatmap()](#make_plot)
+
+**Parameter Definitions:**
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `number_samples` - the total number of samples in the feature_table_file, adjust based on number of input samples
+
+**Input Data:**
+
+- `gene_taxonomy_table.csv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 22a](#22a-gene-level-taxonomy-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+
+**Output Data:**
+
+- **Combined-gene-level-taxonomy_decontam_results_GLlblMetag.csv** (decontam's results table)
+- **Combined-gene-level-taxonomy_decontam_species_table_GLlblMetag.csv** (decontaminated species table)
+- **Combined-gene-level-taxonomy_decontam_heatmap_GLlblMetag.png** (gene-level taxonomy heatmap after filtering out contaminants)
+
+#### 22c. Gene-level KO functions heatmaps
+
+```R
+library(tidyverse)
+library(pheatmap)
+
+metadata_file <- "/path/to/sample/metadata"
+feature_data_file <- "Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.ts"
+
+# Abundant functions with CPM > 2000
+abundance_threshold <- 2000
+
+# Prepare metadata
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
+sample_names = metadata[, samples_column]
+row.names(metadata) <- sample_names
+
+# Read-in KO functions table and drop unannotated
+functions_table <- read_delim(file = feature_table_file, delim = "\t", comment = "#") %>%
+                   select(KO_ID, KO_function, !!any_of(sample_names)) %>%
+                   filter(KO_ID != "Not annotated")
+
+# Convert the sample level data into a matrix
+functions.m <- functions_table %>% select(any_of(sample_names)) %>% as.matrix()
+rownames(functions.m) <- functions_table$KO_ID
+
+# convert to dataframe without unannotated/unclassified species for output
+table2write <- functions.m %>% as.data.frame %>%
+               rownames_to_column("KO_ID")
+# Write out  taxonomy table
+write_csv(x = table2write  , file = "genes-KO-functions_table.csv")
+
+# Get common samples and re-arrange feature table and metadata
+common_samples <- intersect(colnames(functions_table), rownames(metadata))
+functions_table <- functions_table[, common_samples]
+metadata <- metadata[common_samples, ]
+metadata <- metadata %>% arrange(!!sym(group_column))
+
+make_heatmap(metadata, table2write,
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-KO-function", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+
+```
+
+**Custom Functions Used:**
+- [make_heatmap()](#make_heatmap)
+
+**Input data:**
+- /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
+- Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv (table with all samples combined 
+    based on KO annotations; normalized to coverage per million genes covered, output from 
+    [Step 19a](#19a-generate-gene-level-coverage-summary-tables)
+
+**Output data:**
+- genes-KO-functions_table.csv (aggregated and subsetted gene KO function table)
+- **Combined-gene-level-KO-function_heatmap_GLlblMetag.png** (heatmap of all gene-level KO function assignments)
+
+#### 22d. Gene-level KO functions decontamination
+
+```R
+library(tidyverse)
+library(decontam)
+library(phyloseq)
+
+feature_table_file <- "genes-KO-functions_table.csv"
+metadata_table <- "/path/to/sample/metadata"
+number_samples <- NumberOfSamples # integer indicating how many samples are in the file
+
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
+
+# Prepare metadata
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
+sample_names = metadata[, samples_column]
+row.names(metadata) <- sample_names
+
+decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
+                                         feature_table_file = feature_table_file, 
+                                         feature_column = "KO_ID", 
+                                         samples_column = "sample_id",
+                                         prevalence_column = "NTC", 
+                                         ntc_name = "TRUE", 
+                                         frequency_column = "concentration", 
+                                         threshold = 0.1, 
+                                         classification_method = "Combined-gene-level-KO-function", 
+                                         output_prefix = "", 
+                                         assay_suffix = "_GLlblMetag")
+
+# Get common samples and re-arrange feature table and metadata
+common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
+decontaminated_table <- decontaminated_table[, common_samples]
+metadata <- metadata[common_samples, ]
+metadata <- metadata %>% arrange(!!sym(group_column))
+
+make_heatmap(metadata, decontaminated_table, 
+             samples_column = "sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-KO-function_decontam", 
+             assay_suffix = "_GLlblMetag",
+             custom_palette)
+
+```
+
+**Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
+- [make_heatmap()](#make_plot)
+
+**Parameter Definitions:**
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level KO functions coverage data 
+                         with KO_ID as the first column and samples as other columns.
+- `number_samples` - the total number of samples in the feature_table_file, adjust based on number of input samples
+
+**Input Data:**
+
+- `genes-KO-functions_table.csv`(aggregated gene KO functions table table with samples in columns and KO_ID in rows, from [Step 22c](#22c-gene-level-ko-functions-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+
+**Output Data:**
+
+- **Combined-gene-level-KO-function_decontam_results_GLlblMetag.csv** (decontam's results table)
+- **Combined-gene-level-KO-function_decontam_species_table_GLlblMetag.csv** (decontaminated gene-level KO functions table)
+- **Combined-gene-level-KO-function_decontam_heatmap_GLlblMetag.png** (gene-level KO functions heatmap after filtering out contaminants)
+
+
+#### 22e. Contig-level Heatmaps
+
+```R
+library(tidyverse)
+
+metadata_file <- "/path/to/sample/metadata"
+feature_data_file <- "Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv"
+
+# Prepare metadata
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
+sample_names = metadata[, samples_column]
+row.names(metadata) <- sample_names
+
+# Prepare feature table
+contig_taxonomy_table <- read_assembly_coverage_table(feature_table_file, sample_names) %>% as.data.frame
+
+# Summarize contig table
+species_contig_table <- contig_taxonomy_table %>%
+  select(species, !!any_of(sample_names)) %>%
+  group_by(species) %>%
+  summarise(across(everything(), sum)) %>% 
+  filter(species != "Unclassified;_;_;_;_;_;_") %>% # Drop unclassifed
+  as.data.frame
+
+rownames(species_contig_table) <- species_contig_table[[1]]
+species_contig_table <- species_contig_table[, -1] %>% as.matrix()
+
+# Get common samples and re-arrange feature table and metadata
+common_samples <- intersect(colnames(species_contig_table), rownames(metadata))
+species_contig_table <- species_contig_table[, common_samples]
+metadata <- metadata[common_samples, ]
+metadata <- metadata %>% arrange(!!sym(group_column))
+
+table2write = species_contig_table %>% as.data.frame %>% rownames_to_column("species")
+# Write out contig taxonomy table
+write_csv(x = table2write, file = "contig_taxonomy_table.csv")
+
+make_heatmap(metadata, species_contig_table, 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-contig-level-taxonomy", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [read_assembly_coverage_table()](#read_assembly_coverage_table)
+- [make_heatmap()](#make_heatmap)
+
+**Input data:**
+- /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
+- Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples 
+    combined based on contig-level taxonomic classifications, output from 
+    [Step 19b](#19b-generate-contig-level-coverage-summary-tables)) 
+
+**Output data:**
+- contig_taxonomy_table.csv (aggregated contig taxonomy table with samples in columns and species in rows)
+- **Combined-contig-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all contig taxonomy assignments)
+
+#### 22f. Contig-level decontamination
+
+```R
+library(tidyverse)
+library(decontam)
+library(phyloseq)
+
+feature_table_file <- "contig_taxonomy_table.csv"
+metadata_table <- "/path/to/sample/metadata"
+number_samples <- NumberOfSamples # integer indicating how many samples are in the file
+
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
+
+# Prepare metadata
+metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
+sample_names = metadata[, samples_column]
+row.names(metadata) <- sample_names
+
+decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
+                                         feature_table_file = feature_table_file, 
+                                         feature_column = "species", 
+                                         samples_column = "sample_id",
+                                         prevalence_column = "NTC", 
+                                         ntc_name = "TRUE", 
+                                         frequency_column = "concentration", 
+                                         threshold = 0.1, 
+                                         classification_method = "Combined-contig-level-taxonomy", 
+                                         output_prefix = "", 
+                                         assay_suffix = "_GLlblMetag")
+
+# Get common samples and re-arrange feature table and metadata
+common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
+decontaminated_table <- decontaminated_table[, common_samples]
+metadata <- metadata[common_samples, ]
+metadata <- metadata %>% arrange(!!sym(group_column))
+
+make_heatmap(metadata, decontaminated_table, 
+             samples_column = "sample_id", group_column = "group", 
+             output_prefix = "Combined-contig-level-taxonomy_decontam", 
+             assay_suffix = "_GLlblMetag",
+             custom_palette)
+
+```
+
+**Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
+- [make_heatmap()](#make_plot)
+
+**Parameter Definitions:**
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table containing contig-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `number_samples` - the total number of samples in the feature_table_file, adjust based on number of input samples
+
+**Input Data:**
+
+- `contig_taxonomy_table.csv`(aggregated contig taxonomy table with samples in columns and species in rows, from [Step 22f](#22f-contig-level-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+
+**Output Data:**
+
+- **Combined-contig-level-taxonomy_decontam_results_GLlblMetag.csv** (decontam's results table)
+- **Combined-contig-level-taxonomy_decontam_species_table_GLlblMetag.csv** (decontaminated contig-level species table)
+- **Combined-contig-level-taxonomy_decontam_heatmap_GLlblMetag.png** (contig-level heatmap after filtering out contaminants)
+
diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
index 28233bac9..c5548e615 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
@@ -1888,7 +1888,7 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
 - kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
 - sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
-    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8](#8b-remove-host-reads))
+    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 
 **Output Data:**
 
@@ -2113,7 +2113,7 @@ p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_ta
                   feature_column = "Species", samples_column = "sample_id", group_column = "group",
                   publication_format = publication_format, custom_palette = custom_palette)
 
-ggsave(filename = "unfiltered-kaiju_species_barplot_GLlblMetag.png", plot = p,
+ggsave(filename = "kaiju_unfiltered_species_barplot_GLlblMetag.png", plot = p,
        device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
 
 # Save static unfiltered plot
@@ -2212,7 +2212,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlblM
 
 **Input Data:**
 
-- `kaiju_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 9](#9g-filter-kaiju-species-count-table))
+- `kaiju_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 9g](#9g-filter-kaiju-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
@@ -2304,7 +2304,7 @@ kraken2 --db kraken2-db/ \
 - kraken2-db/ (a directory containing kraken2 database files, output from [Step 10a](#10a-download-kraken2-database))
 - sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8](#8b-remove-host-reads))
+    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 
 **Output Data:**
 
@@ -2345,7 +2345,6 @@ write_csv(x = species_table, file = "merged-kraken2-table.csv")
 
 - **kraken2_species_table_GLlblMetag.csv** (kraken species count table in csv format)
 
-
 ##### 10cii. Compile Kraken2 Taxonomy Reports
 
 ```bash
@@ -2489,7 +2488,7 @@ write_csv(x = table2write, file = output_file)
 
 **Input Data:**
 
-- kaiju_species_table_GLlblMetag.csv (path to kaiju species table from [Step 10ci.](#10ci-create-merged-kraken2-taxonomy-table))
+- kraken2_species_table_GLlblMetag.csv (path to kaiju species table from [Step 10ci.](#10ci-create-merged-kraken2-taxonomy-table))
 
 **Output Data:**
 
@@ -2546,8 +2545,8 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlb
 
 **Input Data:**
 
-- `kraken2_species_table_GLlblMetag.csv` (a file containing the species count table, output from [Step 9f](#9f-create-kaiju-species-count-table))
-- `kraken2_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 9g](#9g-filter-kaiju-species-count-table))
+- `kraken2_species_table_GLlblMetag.csv` (path to kaiju species table from [Step 10ci.](#10ci-create-merged-kraken2-taxonomy-table))
+- `kraken2_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 9g](#10f-filter-kraken2-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
@@ -2656,7 +2655,7 @@ mv sample/flye.log sample_assembly.log
 
 - sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
-    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8](#8b-remove-host-reads))
+    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 
 **Output Data**
 
@@ -2689,7 +2688,7 @@ mv sample/consensus.fasta sample_polished.fasta
 
 - sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
-    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8](#8b-remove-host-reads))
+    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 - /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 11](#11-sample-assembly))
 
 **Output Data:**
@@ -2759,7 +2758,7 @@ prodigal -a sample-genes.faa \
          -p meta \
          -c \
          -q \
-         -o sample-genes.gff \
+         -o sample-genes_GLlblMetag.gff \
          -i sample-assembly_GLlblMetag.fasta
 ```
 
@@ -3074,7 +3073,7 @@ minimap2 -a \
 - sample-assembly.fasta (contig-renamed assembly file, output from [Step 13a](#13a-rename-contig-headers))
 - sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
-    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8](#8b-remove-host-reads))
+    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 
 **Output Data**
 
@@ -3286,7 +3285,7 @@ mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-cove
 
 **Parameter Definitions:**  
 
-- `*-gene-coverage-annotation-and-tax.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
+- `*-gene-coverage-annotation-and-tax_GLlbsMetag.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
 
 - `-o` – Specifies the output file prefix.
 

From ef65dd5bab30936b368b1c5bf9d1391637317592 Mon Sep 17 00:00:00 2001
From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com>
Date: Mon, 26 Jan 2026 12:24:14 -0500
Subject: [PATCH 22/47] Update GL-DPPD-7116.md (#188)

- started renaming fastq files to move assay suffix to end
- fixed typos in the table of contents
---
 .../Low_Biomass/Nanopore/GL-DPPD-7116.md      | 32 +++++++++----------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
index c5548e615..78318bc35 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
@@ -4,7 +4,7 @@
 
 ---
 
-**Date:** November MM, 2025  
+**Date:** January MM, 2026  
 **Revision:** -  
 **Document Number:** GL-DPPD-7116  
 
@@ -70,9 +70,9 @@ Barbara Novak (GeneLab Data Processing Lead)
     - [10. Taxonomic Profiling Using Kraken2](#10-taxonomic-profiling-using-kraken2)
       - [10a. Download Kraken2 Database](#10a-download-kraken2-database)
       - [10b. Kraken2 Taxonomic Classification](#10b-kraken2-taxonomic-classification)
-      - [10c. Compile Kraken2 Taxonomy ](#10c-compile-kraken2-taxonomy-results)
-        - [10ci.](#10ci-create-merged-kraken2-taxonomy-table)
-        - [10cii.](#10cii-compile-kraken2-taxonomy-reports)
+      - [10c. Compile Kraken2 Taxonomy Results](#10c-compile-kraken2-taxonomy-results)
+        - [10ci. Create Merged Kraken2 Taxonomy Table](#10ci-create-merged-kraken2-taxonomy-table)
+        - [10cii. Compile Kraken2 Taxonomy Reports](#10cii-compile-kraken2-taxonomy-reports)
       - [10d. Convert Kraken2 Output to Krona Format](#10d-convert-kraken2-output-to-krona-format)
       - [10e. Compile Kraken2 Krona Reports](#10e-compile-kraken2-krona-reports)
       - [10f. Create Kraken2 Species Count Table](#10f-create-kraken2-species-count-table)
@@ -306,7 +306,7 @@ NanoPlot --only-report \
 
 **Input Data:**
 
-- /path/to/raw_data/sample.fastq.gz (raw reads, output from [Step 2b](#2b-concatenate-files-for-each-sample))
+- /path/to/raw_data/sample.fastq.gz (concatenated raw reads, output from [Step 2b](#2b-concatenate-files-for-each-sample))
 
 **Output Data:**
 
@@ -362,7 +362,7 @@ filtlong --min_length 200 --min_mean_q 8 /path/to/raw_data/sample.fastq.gz > sam
 
 **Input Data:**
 
-- /path/to/raw_data/sample.fastq.gz (raw reads, output from [Step 2b](#2b-concatenate-files-for-each-sample))
+- /path/to/raw_data/sample.fastq.gz (concatenated raw reads, output from [Step 2b](#2b-concatenate-files-for-each-sample))
 
 **Output Data:**
 
@@ -552,7 +552,7 @@ kraken2-build --clean --db kraken2-human-db/
 
 **Input Data:**
 
-- `human.fasta` (fasta file containing human genome)
+- `human.fasta` (fasta file containing human genome, SPECIFY WHERE THIS GEONOME CAME FROM)
 
 **Output Data:**
 
@@ -567,11 +567,11 @@ kraken2 --db kraken2_human_db \
         --use-names \
         --output sample-kraken2-output.txt \
         --report sample-kraken2-report.tsv \
-        --unclassified-out sample_GLlblMetag_HRrm.fastq \
+        --unclassified-out sample_HRrm_GLlblMetag.fastq \
         sample_trimmed_fastq.gz
 
 # gzip fastq output file
-gzip sample_GLlblMetag_HRrm.fastq
+gzip sample_HRrm_GLlblMetag.fastq
 ```
 
 **Parameter Definitions:**
@@ -587,14 +587,14 @@ gzip sample_GLlblMetag_HRrm.fastq
 
 **Input Data:**
 
-- kraken2_human_db/ (kraken2 human database directory, output from [Step 7a](#7a-build-kraken2-database))
-- sample_trimmed.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 5a](#5a-trim-filtered-data))
+- kraken2_human_db/ (kraken2 human database directory, output from [Step 6a](#6a-build-kraken2-database))
+- sample_trimmed.fastq.gz (filtered and trimmed sample reads, output from [Step 5a](#5a-trim-filtered-data))
 
 **Output Data:**
 
 - sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
 - sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
-- **sample_GLlblMetag_HRrm.fastq.gz** (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file)
+- **sample_HRrm_GLlblMetag.fastq.gz** (filtered and trimmed sample reads with human reads removed, gzipped fastq file)
 
 
 #### 6c. Compile Human Read Removal QC
@@ -617,7 +617,7 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 7b](#7b-remove-human-reads))
+- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 6b](#6b-remove-human-reads))
 
 **Output Data:**
 
@@ -679,7 +679,7 @@ minimap2 -t NumberOfThreads \
          -a \
          -x splice \
          blanks.mmi \
-         sample_GLlblMetag_HRrm.fastq.gz  > sample.sam 2> sample-mapping-info.txt
+         sample_HRrm_GLlblMetag.fastq.gz  > sample.sam 2> sample-mapping-info.txt
 ```
 
 **Parameter Definitions:**
@@ -690,13 +690,13 @@ minimap2 -t NumberOfThreads \
 - `-d` - Specifies the output file for the index (specific to the build contaminant index command).
 - `/path/to/contaminant_assembly/blank-assembly.fasta` - Specifies the input file in fasta format, provided as a positional argument (specific to the build contaminant index command).
 - `blanks.mmi` - Specifies the index file in mmi format, provided as a positional argument (specific to the map reads command).
-- `/path/to/trimmed_reads/sample_GLlblMetag_HRrm.fastq.gz` - Specifies the input file in fastq format, provided as a positional argument (specific to the map reads command).
+- `/path/to/trimmed_reads/sample_HRrm_GLlblMetag.fastq.gz` - Specifies the input file in fastq format, provided as a positional argument (specific to the map reads command).
 - `> sample.sam` - Redirects the output of the map reads command to a separate SAM file (specific to the map reads command).
 
 **Input Data**
 
 - /path/to/contaminant_assembly/blank-assembly.fasta (contaminant assembly, output from [Step 7a](#7-assemble-contaminants))
-- sample_GLlblMetag_HRrm.fastq.gz (filtered and trimmed reads, output from [Step 6b](#6b-remove-human-reads))
+- sample_HRrm_GLlblMetag.fastq.gz (filtered and trimmed reads, output from [Step 6b](#6b-remove-human-reads))
 
 **Output Data**
 

From 661fbacd9efe42f0d78a0f467ba19e9900b801ed Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Mon, 26 Jan 2026 16:44:42 -0800
Subject: [PATCH 23/47] pipeline document updates through step 8c

---
 .../Low_Biomass/Nanopore/GL-DPPD-7116.md      | 213 +++++++++---------
 1 file changed, 107 insertions(+), 106 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
index 78318bc35..0ee66cb16 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
@@ -52,80 +52,84 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [7e. Generate Decontaminated Read Files](#7e-generate-decontaminated-read-files)
       - [7f. Contaminant Removal QC](#7f-contaminant-removal-qc)
       - [7g. Compile Contaminant Removal QC](#7g-compile-contaminant-removal-qc)
-    - [8. R Environment Setup](#8-r-environment-setup)
-      - [8a. Load Libraries](#8a-load-libraries)
-      - [8b. Define Custom Functions](#8b-define-custom-functions)
-      - [8c. Set global variables](#8c-set-global-variables)
+    - [8. Host Read Removal](#8-host-read-removal)
+      - [8a. Build Kraken2 Database](#8a-build-kraken2-database)
+      - [8b. Remove Host Reads](#8b-remove-host-reads)
+      - [8c. Compile Host Read Removal QC](#8c-compile-host-read-removal-qc)
+    - [9. R Environment Setup](#9-r-environment-setup)
+      - [9a. Load Libraries](#9a-load-libraries)
+      - [9b. Define Custom Functions](#9b-define-custom-functions)
+      - [9c. Set global variables](#9c-set-global-variables)
   - [**Read-based processing**](#read-based-processing)
-    - [9. Taxonomic profiling using kaiju](#9-taxonomic-profiling-using-kaiju)
-      - [9a. Build Kaiju Database](#9a-build-kaiju-database)
-      - [9b. Kaiju Taxonomic Classification](#9b-kaiju-taxonomic-classification)
-      - [9c. Compile Kaiju Taxonomy Results](#9c-compile-kaiju-taxonomy-results)
-      - [9d. Convert Kaiju Output To Krona Format](#9d-convert-kaiju-output-to-krona-format)
-      - [9e. Compile Kaiju Krona Reports](#9e-compile-kaiju-krona-reports)
-      - [9f. Create Kaiju Species Count Table](#9f-create-kaiju-species-count-table)
-      - [9g. Read-in Tables](#9g-read-in-tables)
-      - [9h. Taxonomy Barplots](#9h-taxonomy-barplots)
-      - [9i. Feature Decontamination](#9i-feature-decontamination)
-    - [10. Taxonomic Profiling Using Kraken2](#10-taxonomic-profiling-using-kraken2)
-      - [10a. Download Kraken2 Database](#10a-download-kraken2-database)
-      - [10b. Kraken2 Taxonomic Classification](#10b-kraken2-taxonomic-classification)
-      - [10c. Compile Kraken2 Taxonomy Results](#10c-compile-kraken2-taxonomy-results)
-        - [10ci. Create Merged Kraken2 Taxonomy Table](#10ci-create-merged-kraken2-taxonomy-table)
-        - [10cii. Compile Kraken2 Taxonomy Reports](#10cii-compile-kraken2-taxonomy-reports)
-      - [10d. Convert Kraken2 Output to Krona Format](#10d-convert-kraken2-output-to-krona-format)
-      - [10e. Compile Kraken2 Krona Reports](#10e-compile-kraken2-krona-reports)
-      - [10f. Create Kraken2 Species Count Table](#10f-create-kraken2-species-count-table)
+    - [10. Taxonomic profiling using kaiju](#10-taxonomic-profiling-using-kaiju)
+      - [10a. Build Kaiju Database](#10a-build-kaiju-database)
+      - [10b. Kaiju Taxonomic Classification](#10b-kaiju-taxonomic-classification)
+      - [10c. Compile Kaiju Taxonomy Results](#10c-compile-kaiju-taxonomy-results)
+      - [10d. Convert Kaiju Output To Krona Format](#10d-convert-kaiju-output-to-krona-format)
+      - [10e. Compile Kaiju Krona Reports](#10e-compile-kaiju-krona-reports)
+      - [10f. Create Kaiju Species Count Table](#10f-create-kaiju-species-count-table)
       - [10g. Read-in Tables](#10g-read-in-tables)
       - [10h. Taxonomy Barplots](#10h-taxonomy-barplots)
       - [10i. Feature Decontamination](#10i-feature-decontamination)
+    - [11. Taxonomic Profiling Using Kraken2](#11-taxonomic-profiling-using-kraken2)
+      - [11a. Download Kraken2 Database](#11a-download-kraken2-database)
+      - [11b. Kraken2 Taxonomic Classification](#11b-kraken2-taxonomic-classification)
+      - [11c. Compile Kraken2 Taxonomy Results](#11c-compile-kraken2-taxonomy-results)
+        - [11ci. Create Merged Kraken2 Taxonomy Table](#11ci-create-merged-kraken2-taxonomy-table)
+        - [11cii. Compile Kraken2 Taxonomy Reports](#11cii-compile-kraken2-taxonomy-reports)
+      - [11d. Convert Kraken2 Output to Krona Format](#11d-convert-kraken2-output-to-krona-format)
+      - [11e. Compile Kraken2 Krona Reports](#11e-compile-kraken2-krona-reports)
+      - [11f. Create Kraken2 Species Count Table](#11f-create-kraken2-species-count-table)
+      - [11g. Read-in Tables](#11g-read-in-tables)
+      - [11h. Taxonomy Barplots](#11h-taxonomy-barplots)
+      - [11i. Feature Decontamination](#11i-feature-decontamination)
   - [**Assembly-based processing**](#assembly-based-processing)
-    - [11. Sample Assembly](#11-sample-assembly)
-    - [12. Polish Assembly](#12-polish-assembly)
-    - [13. Rename Contigs and Summarize Assemblies](#13-rename-contigs-and-summarize-assemblies)
-      - [13a. Rename Contig Headers](#13a-rename-contig-headers)
-      - [13b. Summarize Assemblies](#13b-summarize-assemblies)
-    - [14. Gene Prediction](#14-gene-prediction)
-      - [14a. Generate Gene Predictions](14a-generate-gene-predictions)
-      - [14b. Remove Line Wraps In Gene Prediction Output](#14a-remove-line-wraps-in-gene-prediction-output)
-    - [15. Functional Annotation](#15-functional-annotation)
-      - [15a. Download Reference Database of HMM Models](#15a-download-reference-database-of-hmm-models)
-      - [15b. Run KEGG Annotation](#15b-run-kegg-annotation)
-      - [15c. Filter KO Outputs](#15c-filter-ko-outputs)
-    - [16. Taxonomic Classification](#16-taxonomic-classification)
-      - [16a. Pull and Unpack Pre-built Reference DB](#16a-pull-and-unpack-pre-built-reference-db)
-      - [16b. Run Taxonomic Classification](#16b-run-taxonomic-classification)
-      - [16c. Add Taxonomy Info From Taxids To Genes](#16c-add-taxonomy-info-from-taxids-to-genes)
-      - [16d. Add Taxonomy Info From Taxids To Contigs](#16d-add-taxonomy-info-from-taxids-to-contigs)
-      - [16e. Format Gene-level Output With awk and sed](#16e-format-gene-level-output-with-awk-and-sed)
-      - [16f. Format Contig-level Output With awk and sed](#16f-format-contig-level-output-with-awk-and-sed)
-    - [17. Read-Mapping](#17-read-mapping)
-      - [17a. Align Reads to Sample Assembly](#17a-align-reads-to-sample-assembly)
-      - [17b. Sort and Index Assembly Alignments](#17b-sort-and-index-assembly-alignments)
-    - [18. Get Coverage Information and Filter Based On Detection](#18-get-coverage-information-and-filter-based-on-detection)
-      - [18a. Filter Coverage Levels Based On Detection](#18a-filter-coverage-levels-based-on-detection)
-      - [18b. Filter Gene and Contig Coverage Based On Detection](#18b-filter-gene-and-contig-coverage-based-on-detection)
-    - [19. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
-    - [20. Combine Contig-level Coverage and Taxonomy For Each Sample](#20-combine-contig-level-coverage-and-taxonomy-for-each-sample)
-    - [21. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#21-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
-      - [21a. Generate Gene-level Coverage Summary Tables](#21a-generate-gene-level-coverage-summary-tables)
-      - [21b. Generate Contig-level Coverage Summary Tables](#21f-generate-contig-level-coverage-summary-tables)
-    - [22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#22-metagenome-assembled-genome-mag-recovery)
-      - [22a. Bin Contigs](#22a-bin-contigs)
-      - [22b. Bin Quality Assessment](#22b-bin-quality-assessment)
-      - [22c. Filter MAGs](#22c-filter-mags)
-      - [22d. MAG Taxonomic Classification](#22d-mag-taxonomic-classification)
-      - [22e. Generate Overview Table Of All MAGs](#22e-generate-overview-table-of-all-mags)
-    - [23. Generate MAG-level Functional Summary Overview](#23-generate-mag-level-functional-summary-overview)
-      - [23a. Get KO Annotations Per MAG](#23a-get-ko-annotations-per-mag)
-      - [23b. Summarize KO Annotations With KEGG-Decoder](#23b-summarize-ko-annotations-with-kegg-decoder)
-    - [24. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs](#24-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
-      - [24a. Gene-level taxonomy heatmaps](#24a-gene-level-taxonomy-heatmaps)
-      - [24b. Gene-level taxonomy decontamination](#24b-gene-level-taxonomy-decontamination)
-      - [24c. Gene-level KO functions heatmaps](#24c-gene-level-ko-functions-heatmaps)
-      - [24d. Gene-level KO functions decontamination](#24d-gene-level-ko-functions-decontamination)
-      - [24e. Contig-level heatmaps](#24e-contig-level-heatmaps)
-      - [24f. Contig-level decontamination](#24f-contig-level-decontamination)
+    - [12. Sample Assembly](#12-sample-assembly)
+    - [13. Polish Assembly](#13-polish-assembly)
+    - [14. Rename Contigs and Summarize Assemblies](#14-rename-contigs-and-summarize-assemblies)
+      - [14a. Rename Contig Headers](#14a-rename-contig-headers)
+      - [14b. Summarize Assemblies](#14b-summarize-assemblies)
+    - [15. Gene Prediction](#15-gene-prediction)
+      - [15a. Generate Gene Predictions](15a-generate-gene-predictions)
+      - [15b. Remove Line Wraps In Gene Prediction Output](#15a-remove-line-wraps-in-gene-prediction-output)
+    - [16. Functional Annotation](#16-functional-annotation)
+      - [16a. Download Reference Database of HMM Models](#16a-download-reference-database-of-hmm-models)
+      - [16b. Run KEGG Annotation](#16b-run-kegg-annotation)
+      - [16c. Filter KO Outputs](#16c-filter-ko-outputs)
+    - [17. Taxonomic Classification](#17-taxonomic-classification)
+      - [17a. Pull and Unpack Pre-built Reference DB](#17a-pull-and-unpack-pre-built-reference-db)
+      - [17b. Run Taxonomic Classification](#17b-run-taxonomic-classification)
+      - [17c. Add Taxonomy Info From Taxids To Genes](#17c-add-taxonomy-info-from-taxids-to-genes)
+      - [17d. Add Taxonomy Info From Taxids To Contigs](#17d-add-taxonomy-info-from-taxids-to-contigs)
+      - [17e. Format Gene-level Output With awk and sed](#17e-format-gene-level-output-with-awk-and-sed)
+      - [17f. Format Contig-level Output With awk and sed](#17f-format-contig-level-output-with-awk-and-sed)
+    - [18. Read-Mapping](#17-read-mapping)
+      - [18a. Align Reads to Sample Assembly](#18a-align-reads-to-sample-assembly)
+      - [18b. Sort and Index Assembly Alignments](#18b-sort-and-index-assembly-alignments)
+    - [19. Get Coverage Information and Filter Based On Detection](#19-get-coverage-information-and-filter-based-on-detection)
+      - [19a. Filter Coverage Levels Based On Detection](#19a-filter-coverage-levels-based-on-detection)
+      - [19b. Filter Gene and Contig Coverage Based On Detection](#19b-filter-gene-and-contig-coverage-based-on-detection)
+    - [20. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample](#20-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
+    - [21. Combine Contig-level Coverage and Taxonomy For Each Sample](#21-combine-contig-level-coverage-and-taxonomy-for-each-sample)
+    - [22. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#22-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
+      - [22a. Generate Gene-level Coverage Summary Tables](#22a-generate-gene-level-coverage-summary-tables)
+      - [22b. Generate Contig-level Coverage Summary Tables](#22b-generate-contig-level-coverage-summary-tables)
+    - [23. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#23-metagenome-assembled-genome-mag-recovery)
+      - [23a. Bin Contigs](#23a-bin-contigs)
+      - [23b. Bin Quality Assessment](#23b-bin-quality-assessment)
+      - [23c. Filter MAGs](#23c-filter-mags)
+      - [23d. MAG Taxonomic Classification](#23d-mag-taxonomic-classification)
+      - [23e. Generate Overview Table Of All MAGs](#23e-generate-overview-table-of-all-mags)
+    - [24. Generate MAG-level Functional Summary Overview](#24-generate-mag-level-functional-summary-overview)
+      - [24a. Get KO Annotations Per MAG](#24a-get-ko-annotations-per-mag)
+      - [24b. Summarize KO Annotations With KEGG-Decoder](#24b-summarize-ko-annotations-with-kegg-decoder)
+    - [25. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs](#25-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
+      - [25a. Gene-level taxonomy heatmaps](#25a-gene-level-taxonomy-heatmaps)
+      - [25b. Gene-level taxonomy decontamination](#25b-gene-level-taxonomy-decontamination)
+      - [25c. Gene-level KO functions heatmaps](#25c-gene-level-ko-functions-heatmaps)
+      - [25d. Gene-level KO functions decontamination](#25d-gene-level-ko-functions-decontamination)
+      - [25e. Contig-level heatmaps](#25e-contig-level-heatmaps)
+      - [25f. Contig-level decontamination](#25f-contig-level-decontamination)
 
 
 ---
@@ -395,7 +399,7 @@ NanoPlot --only-report \
 
 **Output Data:**
 
-- **/path/to/filtered_nanoplot_output/sample_filtered_NanoPlot-report.html** (NanoPlot html summary)
+- **/path/to/filtered_nanoplot_output/sample_filtered_NanoPlot-report_GLlblMetag.html** (NanoPlot html summary)
 - /path/to/filtered_nanoplot_output/sample_filtered_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
 - /path/to/filtered_nanoplot_output/sample_filtered_NanoStats.txt (text file containing basic statistics)
 
@@ -484,7 +488,7 @@ NanoPlot --only-report \
 
 **Output Data:**
 
-- **/path/to/trimmed_nanoplot_output/sample_trimmed_NanoPlot-report.html** (NanoPlot html summary)
+- **/path/to/trimmed_nanoplot_output/sample_trimmed_NanoPlot-report_GLlblMetag.html** (NanoPlot html summary)
 - /path/to/trimmed_nanoplot_output/sample_trimmed_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
 - /path/to/trimmed_nanoplot_output/sample_trimmed_NanoStats.txt (text file containing basic statistics)
 
@@ -631,7 +635,7 @@ multiqc --zip-data-dir \
 
 ### 7. Contaminant Removal
 
-> A major issue with low biomass data is the high potential for contamination due to the low amount of DNA extracted from the samples. Because negative control/blank samples should by theory be contaminant free, any sequence detected in the negative control is a potential contaminant. To filter out contaminants found in negative control samples that may have been due to cross contamination in the lab, we use a read mapping approach. First negative/blank control sample reads are assembled then the filtered and trimmed reads from each low-biomass sample are mapped to the assembled contigs from the negative/blank control samples. Reads mapping to the assembled contigs are categorized as contaminants and are therefore filtered out and thus excluded from downstream analyses.
+> A major issue with low biomass data is the high potential for contamination due to the low amount of DNA extracted from the samples. Because negative control/blank samples should by theory be contaminant free, any sequence detected in the negative control is a potential contaminant. To filter out contaminants found in negative control samples that may have been due to cross contamination in the lab, we use a read mapping approach. First negative/blank control sample reads are assembled then the filtered, trimmed, and human-removed reads from each low-biomass sample are mapped to the assembled contigs from the negative/blank control samples. Reads mapping to the assembled contigs are categorized as contaminants and are therefore filtered out and thus excluded from downstream analyses.
 
 ### 7a. Assemble Contaminants
 
@@ -639,7 +643,7 @@ multiqc --zip-data-dir \
 flye --meta \
      --threads NumberOfThreads \
      --out-dir /path/to/contaminant_assembly \
-     --nano-raw /path/to/blank_samples/\*_GLlblMetag_HRrm.fastq.gz
+     --nano-raw /path/to/blank_samples/\*_HRrm_GLlblMetag.fastq.gz
 
 # rename output
 mv assembly.fasta blank-assembly.fasta
@@ -655,7 +659,7 @@ mv flye.log blank-flye.log
 
 **Input Data**
 
-- *_GLlblMetag_HRrm.fastq.gz (one or more trimmed, HRrm reads from blank (negative control) samples, output from [Step 6b](#6b-remove-human-reads))
+- *_HRrm_GLlblMetag.fastq.gz (one or more filtered, trimmed, and HRrm reads from blank (negative control) samples, output from [Step 6b](#6b-remove-human-reads))
 
 **Output Data**
 
@@ -696,7 +700,7 @@ minimap2 -t NumberOfThreads \
 **Input Data**
 
 - /path/to/contaminant_assembly/blank-assembly.fasta (contaminant assembly, output from [Step 7a](#7-assemble-contaminants))
-- sample_HRrm_GLlblMetag.fastq.gz (filtered and trimmed reads, output from [Step 6b](#6b-remove-human-reads))
+- sample_HRrm_GLlblMetag.fastq.gz (filtered, trimmed, and HRrm reads, output from [Step 6b](#6b-remove-human-reads))
 
 **Output Data**
 
@@ -774,7 +778,7 @@ samtools idxstats sample_sorted.bam  > sample_idxstats.txt 2> sample_idxstats.lo
 #### 7e. Generate Decontaminated Read Files
 ```bash
 # Retain reads that do not map to contaminants
-samtools fastq -t -f 4 -o sample_GLlblMetag_decontam.fastq.gz -0 sample_GLlblMetag_decontam.fastq.gz sample_sorted.bam 
+samtools fastq -t -f 4 -o sample_decontam_GLlblMetag.fastq.gz -0 sample_decontam_GLlblMetag.fastq.gz sample_sorted.bam 
 ```
 
 **Parameter Definitions:**
@@ -782,8 +786,8 @@ samtools fastq -t -f 4 -o sample_GLlblMetag_decontam.fastq.gz -0 sample_GLlblMet
 - `fastq` - Positional argument specifying the program for generating fastq files from a SAM/BAM file.
 - `-t` - Copy RG, BC, and QT tags to the FASTQ header line.
 - `-f 4` - Only retain unmapped reads that have been marked with the SAM "segment unmapped" FLAG (4).
-- `-o sample_GLlblMetag_decontam.fastq.gz` - Send reads flagged as either read1 or read2 to the named file (.gz ending ensures compressed output)
-- `-0 sample_GLlblMetag_decontam.fastq.gz` - Send reads flagged as both read1 and read2 or neither to the same named file
+- `-o sample_decontam_GLlblMetag.fastq.gz` - Send reads flagged as either read1 or read2 to the named file (.gz ending ensures compressed output)
+- `-0 sample_decontam_GLlblMetag.fastq.gz` - Send reads flagged as both read1 and read2 or neither to the same named file
 - `sample_sorted.bam` - Positional argument specifying the input BAM file.
 
 **Input Data:**
@@ -792,7 +796,7 @@ samtools fastq -t -f 4 -o sample_GLlblMetag_decontam.fastq.gz -0 sample_GLlblMet
 
 **Output Data:**
 
-- **sample_GLlblMetag_decontam.fastq.gz** (filtered and trimmed sample reads with contaminants removed in fastq format)
+- **sample_decontam_GLlblMetag.fastq.gz** (filtered, trimmed, and HRrm sample reads with contaminants removed in fastq format)
 
 #### 7f. Contaminant Removal QC
 
@@ -802,7 +806,7 @@ NanoPlot --only-report \
          --outdir /path/to/decontam_nanoplot_output \
          --threads NumberOfThreads \
          --fastq \
-         sample_GLlblMetag_decontam.fastq.gz
+         sample_decontam_GLlblMetag.fastq.gz
 ```
 
 **Parameter Definitions:**
@@ -812,11 +816,11 @@ NanoPlot --only-report \
 - `--outdir` – Specifies the output directory to store results.
 - `--threads` - Number of parallel processing threads to use.
 - `--fastq` - Specifies that the input data is in fastq format.
-- `sample_GLlblMetag_decontam.fastq.gz` – The input reads, specified as a positional argument.
+- `sample_decontam_GLlblMetag.fastq.gz` – The input reads, specified as a positional argument.
 
 **Input Data:**
 
-- sample_GLlblMetag_decontam.fastq.gz (filtered and trimmed sample reads with all contaminants removed, output from [Step 7e](#7e-generate-decontaminated-read-files))
+- sample_decontam_GLlblMetag.fastq.gz (filtered, trimmed, and HRrm sample reads with all contaminants removed, output from [Step 7e](#7e-generate-decontaminated-read-files))
 
 **Output Data:**
 
@@ -858,8 +862,7 @@ multiqc --zip-data-dir \
 
 ### 8. Host Read Removal
 
-If the samples were derived from a host organism other than human, potential host reads
-should be identified and removed. This step is optional.
+If the samples were derived from a host organism other than human, potential host reads should be identified and removed. This step is optional.
 
 #### 8a. Build Kraken2 Database
 
@@ -867,8 +870,6 @@ should be identified and removed. This step is optional.
 NCBI may require explicit assignment of taxonomy information before they can be used to build the 
 database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).
 
-```bash
-
 ```bash
 # Download NCBI taxonomic information 
 kraken2-build --download-taxonomy --db kraken2-${hostname}-db/
@@ -911,11 +912,11 @@ kraken2 --db kraken2_host_db \
         --use-names \
         --output sample-kraken2-output.txt \
         --report sample-kraken2-report.tsv \
-        --unclassified-out sample_GLlblMetag_HostRm.fastq \
-        sample_trimmed_fastq.gz
+        --unclassified-out sample_HostRm_GLlblMetag.fastq \
+        sample_decontam_GLlblMetag.fastq.gz
 
 # gzip fastq output file
-gzip sample_GLlblMetag_HostRm.fastq
+gzip sample_HostRm_GLlblMetag.fastq
 ```
 
 **Parameter Definitions:**
@@ -927,18 +928,18 @@ gzip sample_GLlblMetag_HostRm.fastq
 - `--output` - Specifies the name of the kraken2 read-based output file (one line per read).
 - `--report` - Specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it).
 - `--unclassified-out` - Specifies the name of the output file containing reads that were not classified, i.e non-human reads.
-- `sample_trimmed.fastq.gz` - Positional argument specifying the input read file.
+- `sample_decontam_GLlblMetag.fastq.gz` - Positional argument specifying the input read file.
 
 **Input Data:**
 
 - kraken2_host_db/ (kraken2 host database directory, output from [Step 8a](#8a-build-kraken2-database))
-- sample_*decontam.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 5a](#5a-trim-filtered-data))
+- sample_decontam_GLlblMetag.fastq.gz (filtered, trimmed, HRrm and contaminant-removed sample reads, output from [Step 7e](#7e-generate-decontaminated-read-files))
 
 **Output Data:**
 
 - sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
 - sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
-- **sample_GLlblMetag_HostRm.fastq.gz** (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file)
+- **sample_HostRm_GLlblMetag.fastq.gz** (filtered, trimmed, HRrm and contaminant-removed sample reads with all host reads removed, gzipped fastq file)
 
 
 #### 8c. Compile Host Read Removal QC
@@ -961,7 +962,7 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 7b](#7b-remove-human-reads))
+- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 8b](#8b-remove-host-reads))
 
 **Output Data:**
 
@@ -1869,7 +1870,7 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
       -t kaiju-db/nodes.dmp \
       -z NumberOfThreads \
       -E 1e-05 \
-      -i /path/to/sample_GLlblMetag_decontam.fastq.gz \
+      -i /path/to/sample_decontam_GLlblMetag.fastq.gz \
       -o sample_kaiju.out
 ```
 
@@ -1886,7 +1887,7 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
 
 - kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 9a](#9a-build-kaiju-database))
 - kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
-- sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+- sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
     output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 
@@ -2286,7 +2287,7 @@ kraken2 --db kraken2-db/ \
         --use-names \
         --output sample-kraken2-output.txt \
         --report sample-kraken2-report.tsv \
-        /path/to/sample_GLlblMetag_decontam.fastq.gz
+        /path/to/sample_decontam_GLlblMetag.fastq.gz
 ```
 
 **Parameter Definitions:**
@@ -2297,12 +2298,12 @@ kraken2 --db kraken2-db/ \
 - `--use-names` - Specifies to add taxa names in addition to taxids.
 - `--output` - Specifies the name of the kraken2 read-based output file.
 - `--report` - Specifies the name of the kraken2 report output file.
-- `sample_GLlblMetag_decontam.fastq.gz` - Positional argument specifying the input file.
+- `sample_decontam_GLlblMetag.fastq.gz` - Positional argument specifying the input file.
 
 **Input Data:**
 
 - kraken2-db/ (a directory containing kraken2 database files, output from [Step 10a](#10a-download-kraken2-database))
-- sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+- sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
     output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 
@@ -2636,7 +2637,7 @@ flye --meta \
      --threads NumberOfThreads \
      --out-dir sample/ \
      --nano-hq \
-     /path/to/sample_GLlblMetag_decontam.fastq.gz
+     /path/to/sample_decontam_GLlblMetag.fastq.gz
 
 # rename output files            
 mv sample/assembly.fasta sample_assembly.fasta
@@ -2649,11 +2650,11 @@ mv sample/flye.log sample_assembly.log
 - `--threads` - Number of parallel processing threads to use.
 - `--out-dir` - Specifies the name of the output directory.
 - `--nano-hq` - Specifies that input is from Oxford Nanopore high-quality reads (Guppy5+ SUP or Q20, <5% error). This skips a genome polishing step since the assembly will be polished with medaka in the next step.
-- `/path/to/sample_GLlblMetag_decontam.fastq.gz` - Path to the input file, specified as a positional argument.
+- `/path/to/sample_decontam_GLlblMetag.fastq.gz` - Path to the input file, specified as a positional argument.
 
 **Input Data**
 
-- sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+- sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
     output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 
@@ -2670,7 +2671,7 @@ mv sample/flye.log sample_assembly.log
 
 ```bash
 medaka_consensus -t NumberOfThreads \
-                 -i /path/to/sample_GLlblMetag_decontam.fastq.gz \
+                 -i /path/to/sample_decontam_GLlblMetag.fastq.gz \
                  -d /path/to/assemblies/sample_assembly.fasta \
                  -o sample/
   
@@ -2686,7 +2687,7 @@ mv sample/consensus.fasta sample_polished.fasta
 
 **Input Data:**
 
-- sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+- sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
     output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 - /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 11](#11-sample-assembly))
@@ -3054,7 +3055,7 @@ minimap2 -a \
          -x map-ont \
          -t NumberOfThreads \
          sample_assembly.fasta \
-         sample_GLlblMetag_decontam.fastq.gz \
+         sample_decontam_GLlblMetag.fastq.gz \
          > sample.sam  2> sample-mapping-info.txt
 ```
 
@@ -3064,14 +3065,14 @@ minimap2 -a \
 - `-x map-ont` - Specifies preset for mapping Nanopore reads to a reference.
 - `-t` - Number of parallel processing threads to use
 - `sample_assembly.fasta` – Assembly fasta file, provided as a positional argument.
-- `sample_GLlblMetag_decontam.fastq.gz` - Input sequence data file, provided as a positional argument.
+- `sample_decontam_GLlblMetag.fastq.gz` - Input sequence data file, provided as a positional argument.
 - `> sample.sam` - Redirects the output to a separate file.
 - `2> sample-mapping-info.txt` - Redirects the standar error to a separate file.
 
 **Input Data**
 
 - sample-assembly.fasta (contig-renamed assembly file, output from [Step 13a](#13a-rename-contig-headers))
-- sample_GLlblMetag_decontam.fastq.gz or sample_GLlblMetag_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+- sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
     output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 

From cb75153a439ff3fcf0d8b62f0fa83ccaf8a5b5f4 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Mon, 26 Jan 2026 21:14:03 -0800
Subject: [PATCH 24/47] Update GL-DPPD-7117

- Add missing steps (Read-based metaphlan taxonomies and assembly-based
  heatmaps)
- Fix documentation for kraken2-build (adding references to fasta
  acquisition)
- Update/fix table of contents
- Regularize formatting
- Fix typos
- remove Nanopore specific tools from software table
---
 .../Low_Biomass/Illumina/GL-DPPD-7117.md      | 1009 +++++++++++------
 1 file changed, 635 insertions(+), 374 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
index b9fcf4b2f..9cf180152 100644
--- a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
@@ -4,7 +4,7 @@
 
 ---
 
-**Date:** November MM, 2025  
+**Date:** January MM, 2026  
 **Revision:** -  
 **Document Number:** GL-DPPD-7116  
 
@@ -27,9 +27,9 @@ Barbara Novak (GeneLab Data Processing Lead)
   - [**Pre-processing**](#pre-processing)
     - [1. Raw Data QC](#1-raw-data-qc)
       - [1a. Raw Data QC](#1a-raw-data-qc)
-      - [3b. Compile Raw Data QC](#1b-compile-raw-data-qc)
+      - [1b. Compile Raw Data QC](#1b-compile-raw-data-qc)
     - [2. Human Read Removal](#2-human-read-removal)
-      - [2a. Build Kraken2 Database](#2a-build-kraken2-database)
+      - [2a. Build Kraken2 Human Database](#2a-build-kraken2-human-database)
       - [2b. Remove Human Reads](#2b-remove-human-reads)
       - [2c. Compile Human Read Removal QC](#2c-compile-human-read-removal-qc)
     - [3. Trimming and Quality filtering](#3-trimming-and-quality-filtering)
@@ -41,15 +41,15 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [4a. Assemble Contaminants](#4a-assemble-contaminants)
       - [4b. Build Contaminant Index and Map Reads](#4b-build-contaminant-index-and-map-reads)
       - [4c. Contaminant Removal QC](#4c-contaminant-removal-qc)
-      - [4d. Compile Contaminant Removal QC](#4d-compile-raw-data-qc)
+      - [4d. Compile Contaminant Removal QC](#4d-compile-contaminant-removal-qc)
     - [5. Host read removal](#5-host-read-removal)
-      - [5a.](#5a-build-kraken2-host-database)
-      - [5b.](#5b-remove-host-reads)
-      - [5c.](#5c-compile-host-read-removal-qc)
-    - [6. R Environment Setup](#8-r-environment-setup)
-      - [6a. Load Libraries](#8a-load-libraries)
-      - [6b. Define Custom Functions](#8b-define-custom-functions)
-      - [6c. Set global variables](#8c-set-global-variables)
+      - [5a. Build Kraken2 Host Database](#5a-build-kraken2-host-database)
+      - [5b. Remove Host Reads](#5b-remove-host-reads)
+      - [5c. Compile Host Read Removal QC](#5c-compile-host-read-removal-qc)
+    - [6. R Environment Setup](#6-r-environment-setup)
+      - [6a. Load Libraries](#6a-load-libraries)
+      - [6b. Define Custom Functions](#6b-define-custom-functions)
+      - [6c. Set global variables](#6c-set-global-variables)
   - [**Read-based processing**](#read-based-processing)
     - [7. Taxonomic profiling using kaiju](#7-taxonomic-profiling-using-kaiju)
       - [7a. Build Kaiju Database](#7a-build-kaiju-database)
@@ -58,7 +58,7 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [7d. Convert Kaiju Output To Krona Format](#7d-convert-kaiju-output-to-krona-format)
       - [7e. Compile Kaiju Krona Reports](#7e-compile-kaiju-krona-reports)
       - [7f. Create Kaiju Species Count Table](#7f-create-kaiju-species-count-table)
-      - [7g. Read-in Tables](#7g-read-in-tables)
+      - [7g. Filter Kaiju Species Count Table ](#7g-filter-kaiju-species-count-table)
       - [7h. Taxonomy Barplots](#7h-taxonomy-barplots)
       - [7i. Feature Decontamination](#7i-feature-decontamination)
     - [8. Taxonomic Profiling Using Kraken2](#8-taxonomic-profiling-using-kraken2)
@@ -69,16 +69,24 @@ Barbara Novak (GeneLab Data Processing Lead)
         - [8cii. Compile Kraken2 Taxonomy Reports](8cii-compile-kraken2-taxonomy-reports)
       - [8d. Convert Kraken2 Output to Krona Format](#8d-convert-kraken2-output-to-krona-format)
       - [8e. Compile Kraken2 Krona Reports](#8e-compile-kraken2-krona-reports)
-      - [8f. Create Kraken2 Species Count Table](#8f-create-kraken2-species-count-table)
-      - [8g. Read-in Tables](#8g-read-in-tables)
-      - [8h. Taxonomy Barplots](#8h-taxonomy-barplots)
-      - [8i. Feature Decontamination](#8i-feature-decontamination)
+      - [8f. Filter Kraken2 Species Count Table](#8f-filter-kraken2-species-count-table)
+      - [8g. Taxonomy Barplots](#8g-taxonomy-barplots)
+      - [8h. Feature Decontamination](#8h-feature-decontamination)
     - [9. Taxonomic Profiling Using MetaPhlan](#9-taxonomic-profiling-using-metaphlan)
       - [9a. Download and install HUMAnN databases](#9a-download-and-install-humann-databases)
       - [9b. HUMAnN/MetaPhlAn Taxonomic Classification](#9b-humannmetaphlan-taxonomic-classification)
-      - [9c. Merge multiple sample functional profiles](#9c-merge-multiple-sample-functional-profiles)
-      - [9e. Normalize gene families and pathway abundances tables](#9e-normalize-gene-families-and-pathway-abundances-tables)
-      - [9f. Generate a normalized gene-family table grouped by Kegg Orthologs (KOs)](#9f-generate-a-normalized-gene-family-table-grouped-by-kegg-orthologs-kos)
+      - [9c. Merge Multiple Sample Functional Profiles](#9c-merge-multiple-sample-functional-profiles)
+      - [9d. Split Results Tables](#9d-split-results-tables)
+      - [9e. Normalize Gene Families and Pathway Abundances Tables](#9e-normalize-gene-families-and-pathway-abundances-tables)
+      - [9f. Generate Normalized Gene-family Table Grouped by Kegg Orthologs (KOs)](#9f-generate-normalized-gene-family-table-grouped-by-kegg-orthologs-kos)
+      - [9g. Combine MetaPhlan Taxonomy Tables](#9g-combine-metaphlan-taxonomy-tables)
+      - [9h. Create MetaPhlan Species Count Table](#9h-process-metaphlan)
+        - [9hi. Get Sample Read Counts](#9hi-get-sample-read-counts)
+        - [9hii. Process MetaPhlan Taxonomy Table](#9hii-process-metaphlan-taxonomy-table)
+      - [9i. Filter MetaPhlan Species Count Table](#9i-filter-metaphlan-species-count-table)
+      - [9j. Taxonomy Barplots](#8g-taxonomy-barplots)
+      - [9k. Feature Decontamination](#8h-feature-decontamination)
+  - [**Assembly-based Processing**](#assembly-based-processing)
     - [10. Sample Assembly](#10-sample-assembly)
     - [11. Rename Contigs and Summarize Assemblies](#11-rename-contigs-and-summarize-assemblies)
       - [11a. Rename Contig Headers](#11a-rename-contig-headers)
@@ -99,8 +107,8 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [14f. Format Contig-level Output With awk and sed](#14f-format-contig-level-output-with-awk-and-sed)
     - [15. Read-Mapping](#15-read-mapping)
       - [15a. Build Reference Index](#15a-build-reference-index)
-      - [15a. Align Reads to Sample Assembly](#15b-align-reads-to-sample-assembly)
-      - [15b. Sort and Index Assembly Alignments](#15c-sort-and-index-assembly-alignments)
+      - [15b. Align Reads to Sample Assembly](#15b-align-reads-to-sample-assembly)
+      - [15c. Sort and Index Assembly Alignments](#15c-sort-and-index-assembly-alignments)
     - [16. Get Coverage Information and Filter Based On Detection](#16-get-coverage-information-and-filter-based-on-detection)
       - [16a. Filter Coverage Levels Based On Detection](#16a-filter-coverage-levels-based-on-detection)
       - [16b. Filter Gene and Contig Coverage Based On Detection](#16b-filter-gene-and-contig-coverage-based-on-detection)
@@ -108,13 +116,7 @@ Barbara Novak (GeneLab Data Processing Lead)
     - [18. Combine Contig-level Coverage and Taxonomy For Each Sample](#18-combine-contig-level-coverage-and-taxonomy-for-each-sample)
     - [19. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#19-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
       - [19a. Generate Gene-level Coverage Summary Tables](#19a-generate-gene-level-coverage-summary-tables)
-      - [19b. Gene-level taxonomy heatmaps](#19b-gene-level-taxonomy-heatmaps)
-      - [19c. Gene-level taxonomy decontamination](#19c-gene-level-taxonomy-decontamination)
-      - [19d. Gene-level KO functions heatmaps](#19d-gene-level-ko-functions-heatmaps)
-      - [19e. Gene-level KO functions decontamination](#19e-gene-level-ko-functions-decontamination)
-      - [19f. Generate contig-level coverage summary tables](#19f-generate-contig-level-coverage-summary-tables)
-      - [19g. Contig-level Heatmaps](#19g-contig-level-heatmaps)
-      - [19h. Contig-level decontamination](#19h-contig-level-decontamination)
+      - [19b. Generate Contig-level Coverage Summary Tables](#19b-generate-contig-level-coverage-summary-tables)
     - [20. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#20-metagenome-assembled-genome-mag-recovery)
       - [20a. Bin Contigs](#20a-bin-contigs)
       - [20b. Bin Quality Assessment](#20b-bin-quality-assessment)
@@ -124,14 +126,13 @@ Barbara Novak (GeneLab Data Processing Lead)
     - [21. Generate MAG-level Functional Summary Overview](#21-generate-mag-level-functional-summary-overview)
       - [21a. Get KO Annotations Per MAG](#21a-get-ko-annotations-per-mag)
       - [21b. Summarize KO Annotations With KEGG-Decoder](#21b-summarize-ko-annotations-with-kegg-decoder)
-    - [22. Decontamination and Visualizaiton of Contig- and Gene-taxonomy and gene function outputs](#22-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
-      - [22a. Gene-level taxonomy heatmaps](#22a-gene-level-taxonomy-heatmaps)
-      - [22b. Gene-level taxonomy decontamination](#22b-gene-level-taxonomy-decontamination)
-      - [22c. Gene-level KO functions heatmaps](#22c-gene-level-ko-functions-heatmaps)
-      - [22d. Gene-level KO functions decontamination](#22d-gene-level-ko-functions-decontamination)
-      - [22e. Contig-level heatmaps](#22e-contig-level-heatmaps)
-      - [22f. Contig-level decontamination](#22f-contig-level-decontamination)
-
+    - [22. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs](#22-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
+      - [22a. Gene-level Taxonomy Heatmaps](#22a-gene-level-taxonomy-heatmaps)
+      - [22b. Gene-level Taxonomy Decontamination](#22b-gene-level-taxonomy-decontamination)
+      - [22c. Gene-level KO Functions Heatmaps](#22c-gene-level-ko-functions-heatmaps)
+      - [22d. Gene-level KO Functions Decontamination](#22d-gene-level-ko-functions-decontamination)
+      - [22e. Contig-level Heatmaps](#22e-contig-level-heatmaps)
+      - [22f. Contig-level Decontamination](#22f-contig-level-decontamination)
 
 ---
 
@@ -143,8 +144,6 @@ Barbara Novak (GeneLab Data Processing Lead)
 |bit| 1.8.53 |[https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)|
 |CAT| 5.2.3 |[https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)|
 |CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)|
-|Dorado| 1.1.1| [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)|
-|filtlong| 0.2.1 |[https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)|
 |SPAdes| 4.1.0 | [https://github.com/ablab/spades](https://github.com/ablab/spades) |
 |GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
 |HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)|
@@ -158,8 +157,6 @@ Barbara Novak (GeneLab Data Processing Lead)
 |MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)|
 |Medaka| 2.1.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) |
 |MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)|
-|NanoPlot| 1.44.1 | [https://github.com/wdecoster/NanoPlot](https://github.com/wdecoster/NanoPlot)|
-|Porechop| 0.2.4 | [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop) |
 |Prodigal| 2.6.3 |[https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)|
 |samtools| 1.22.1 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
 | R | 4.5.1 | [https://www.r-project.org](https://www.r-project.org) |
@@ -230,46 +227,46 @@ multiqc --zip-data-dir \
 - **raw_multiqc_report/raw_multiqc_GLlbsMetag.html** (multiqc output html summary)
 - **raw_multiqc_report/raw_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
-
 <br>  
 
 ---
 
 ### 2. Human Read Removal
 
-#### 2a. Build Kraken2 Database
+#### 2a. Build Kraken2 Human Database
+
+> **Note:** It is recommended to use NCBI genome files with kraken2 because sequences not downloaded from 
+NCBI may require explicit assignment of taxonomy information before they can be used to build the 
+database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).
 
 ```bash
-kraken2-build --download-library human \
-              --db kraken2_human_db \
-              --threads numberOfThreads \
-              --no-masking
+# Download NCBI taxonomic information 
+kraken2-build --download-taxonomy --db kraken2-human-db/
 
-kraken2-build --download-taxonomy \
-              --db kraken2_human_db/
+# Add genomic sequences to your database's genomic library
+kraken2-build --add-to-library human.fasta --db kraken2-human-db/ --no-masking
+             
+# Build the database
+kraken2-build --build --db kraken2-human-db/ --kmer-len 35 --minimizer-len 31
 
-kraken2-build --build \
-              --db kraken2_human_db/ \
-              --threads numberOfThreads
- 
-kraken2-build --clean \
-              --db kraken2_human_db/
+# Clean up intermediate files
+kraken2-build --clean --db kraken2-human-db/
 ```
 
 **Parameter Definitions:**
-
-- `--download-library` - Specifies the reference name/type to download.
-- `--db` - Specifies the directory to put the database in.
-- `--threads` - Number of parallel processing threads to use.
-- `--no-masking` - Prevents masking of low-complexity sequences. For additional 
+- `--download-taxonomy` - Instructs kraken2-build to download the NCBI taxonomic information.
+- `--db` - Specifies the name of the directory for the kraken2 database
+- `--add-to-library` - Instructs kraken2-build to add the contents of a file to the kraken2 DB library
+  - `--no-masking` - Disables masking of low-complexity sequences. For additional 
                    information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
-- `--download-taxonomy` - Downloads taxonomic mapping information.
-- `--build` - Specifies to construct kraken2-formatted database.
-- `--clean` - Specifies to remove unnecessary intermediate files.
+- `--build` - Instructs kraken2-build to build the kraken2 DB from the library files
+  - `--kmer-len` - K-mer length in bp (default: 35).
+  - `--minimizer-len` - Minimizer length in bp (default: 31)
+- `--clean` - Instructs kraken2-build to remove unneeded intermediate files.
 
 **Input Data:**
 
-- `human` - database name to download (specified with the `--download-library` parameter above)
+- `human.fasta` (fasta file containing human genome, for example, the human genome fasta downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz)
 
 **Output Data:**
 
@@ -289,12 +286,11 @@ kraken2 --db kraken2_human_db \
         sample1_R1_raw.fastq.gz sample1_R2_raw.fastq.gz
 
 # rename and gzip output files
-mv sample1_R_1.fastq sample1_GLlbsMetag_R1_HRrm.fastq && \
-gzip sample1_GLlbsMetag_R1_HRrm.fastq
-
-mv  sample1_R_2.fastq sample1_GLlbsMetag_R2_HRrm.fastq && \
-gzip sample1_GLlbsMetag_R2_HRrm.fastq
+mv sample1_R_1.fastq sample1_R1_HRrm_GLlbsMetag.fastq && \
+gzip sample1_R1_HRrm_GLlbsMetag.fastq
 
+mv  sample1_R_2.fastq sample1_R2_HRrm_GLlbsMetag.fastq && \
+gzip sample1_R2_HRrm_GLlbsMetag.fastq
 ```
 
 **Parameter Definitions:**
@@ -310,14 +306,14 @@ gzip sample1_GLlbsMetag_R2_HRrm.fastq
 
 **Input Data:**
 
-- kraken2_human_db/ (kraken2 human database directory, output from [Step 7a](#7a-build-kraken2-database))
+- kraken2_human_db/ (kraken2 human database directory, output from [Step 2a](#2a-build-kraken2-database))
 - *raw.fastq.gz (raw reads)
 
 **Output Data:**
 
 - sample1-kraken2-output.txt (kraken2 read-based output file (one line per read))
 - sample1-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
-- **sample1_GLlbsMetag_raw_HRrm.fastq.gz** (raw sample reads with human reads removed, gzipped fasta file)
+- **sample1_raw_HRrm_GLlbsMetag.fastq.gz** (raw sample reads with human reads removed, gzipped fasta file)
 
 
 #### 2c. Compile Human Read Removal QC
@@ -340,14 +336,14 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 7b](#7b-remove-human-reads))
+- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 2b](#2b-remove-human-reads))
 
 **Output Data:**
 
 - **HRrm_multiqc_GLlbsMetag.html** (multiqc output html summary)
 - **HRrm_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
-<br>
+<br>  
 
 ---
 
@@ -367,6 +363,7 @@ fastp --in1 sample1_R1_raw.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
 ```
 
 **Parameter Definitions:**
+
 - `--in1` - Specifies the forward input read file
 - `--in2` - Specifies the reverse input read file
 - `--in1` - Specifies the forward output read file
@@ -381,7 +378,7 @@ fastp --in1 sample1_R1_raw.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
 
 **Input Data:**
 
-- *raw.fastq.gz (raw reads)
+- *raw_HRrm_GLlbsMetag.fastq.gz (raw sample reads with human reads removed, from [Step 2b](#2b-remove-human-reads))
 
 **Output Data:**
 
@@ -390,8 +387,8 @@ fastp --in1 sample1_R1_raw.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
 #### 3b. Trim polyG
 
 ```bash
-fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered.fastq.gz \
-      --in2 temp_sample1_R2_filtered.fastq.gz --out2 sample1_R2_filtered.fastq.gz \
+fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered_GLlbsMetag.fastq.gz \
+      --in2 temp_sample1_R2_filtered.fastq.gz --out2 sample1_R2_filtered_GLlbsMetag.fastq.gz \
       --qualified_quality_phred  20 \
       --length_required 50 \
       --thread 2 \
@@ -402,6 +399,7 @@ fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered.fastq.g
 ```
 
 **Parameter Definitions:**
+
 - `--in1` - Specifies the forward input read file
 - `--in2` - Specifies the reverse input read file
 - `--in1` - Specifies the forward output read file
@@ -417,11 +415,11 @@ fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered.fastq.g
 
 **Input Data:**
 
-- /path/to/filtered_data/temp_sample1*.fastq.gz (raw reads, output from [Step 2a](#2a-filter-quality-and-trim-adapters)
+- /path/to/filtered_data/temp_sample1*.fastq.gz (round1 filtered/adapter trimmed reads, output from [Step 3a](#3a-filter-quality-and-trim-adapters)
 
 **Output Data:**
 
-- *filtered.fastq.gz (quality filtered and adapter trimmed reads)
+- **\*filtered_GLlbsMetag.fastq.gz** (quality filtered and adapter trimmed, human removed reads)
 
 #### 3c. Filtered Data QC
 
@@ -432,11 +430,11 @@ fastqc -o filtered_fastqc_output *filtered.fastq.gz
 **Parameter Definitions:**
 
 - `-o` – the output directory to store results
-- `*filtered.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
+- `*filtered_GLlbsMetag.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
 
 **Input data:**
 
-- *filtered.fastq.gz (trimmed and filtered reads, from [Step 2b](#2b-trim-polyg))
+- *filtered_GLlbsMetag.fastq.gz (trimmed and filtered reads, from [Step 3b](#3b-trim-polyg))
 
 **Output data:**
 
@@ -464,14 +462,14 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- /path/to/filtered_fastqc_output/*fastqc.zip (FastQC output data, from [Step 2c](#2c-filtered-data-qc))
+- /path/to/filtered_fastqc_output/*fastqc.zip (FastQC output data, from [Step 3c](#3c-filtered-data-qc))
 
 **Output Data:**
 
 - **filtered_multiqc_report/filtered_multiqc_GLlbsMetag.html** (multiqc output html summary)
 - **filtered_multiqc_report/filtered_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
-<br>
+<br>  
 
 ---
 
@@ -479,11 +477,11 @@ multiqc --zip-data-dir \
 
 > A major issue with low biomass data is the high potential for contamination due to the low amount of DNA extracted from the samples. Because negative control/blank samples should by theory be contaminant free, any sequence detected in the negative control is a potential contaminant. To filter out contaminants found in negative control samples that may have been due to cross contamination in the lab, we use a read mapping approach. First negative/blank control sample reads are assembled then the filtered and trimmed reads from each low-biomass sample are mapped to the assembled contigs from the negative/blank control samples. Reads mapping to the assembled contigs are categorized as contaminants and are therefore filtered out and thus excluded from downstream analyses.
 
-### 4a. Assemble Contaminants
+#### 4a. Assemble Contaminants
 
 ```bash
-cat /path/to/contaminant_fastq/*_R1_filtered.fastq.gz > mreged_R1.fastq.gz
-cat /path/to/contaminant_fastq/*_R2_filtered.fastq.gz > mreged_R2.fastq.gz
+cat /path/to/contaminant_fastq/*_R1_filtered_GLlbsMetag.fastq.gz > merged_R1.fastq.gz
+cat /path/to/contaminant_fastq/*_R2_filtered_GLlbsMetag.fastq.gz > merged_R2.fastq.gz
 
 spades.py --meta \
      --threads 8 \
@@ -506,7 +504,7 @@ mv spades.log blank-assembly.log
 
 **Input Data**
 
-- *_R[12]_filtered.fastq.gz (one or more paired-end, trimmed and filtered, HRrm reads from blank (negative control) samples, output from [Step 3b](#3b-trim-polyg))
+- *_R[12]_filtered_GLlbsMetag.fastq.gz (one or more paired-end, trimmed and filtered, HRrm reads from blank (negative control) samples, output from [Step 3b](#3b-trim-polyg))
 
 **Output Data**
 
@@ -525,14 +523,14 @@ bowtie2-build /path/to/contaminant_assembly/blank-scaffolds.fasta /path/to/blank
 bowtie2 -p NumberOfThreads \
        -x /path/to/blank-index/blanks \
        --very-sensitive-local \
-       -1 sample1_GLlbsMetag_R1_filtered.fastq.gz \
-       -2 sample2_GLlbsMetag_R2_filtered.fastq.gz \
+       -1 sample1_R1_filtered_GLlbsMetag.fastq.gz \
+       -2 sample2_R2_filtered_GLlbsMetag.fastq.gz \
        --un-conc-gz sample1_decontam.fastq.gz
        > sample1.sam 2> sample1-mapping-info.txt
 
 # rename blank removed fastq files
-mv sample1_decontam.fastq.1.gz sample1_GLlbsMetag_R1_decontam.fastq.gz
-mv sample1_decontam.fastq.2.gz sample1_GLlbsMetag_R2_decontam.fastq.gz
+mv sample1_decontam.fastq.1.gz sample1_R1_decontam_GLlbsMetag.fastq.gz
+mv sample1_decontam.fastq.2.gz sample1_R2_decontam_GLlbsMetag.fastq.gz
 
 # remove intermediate file
 rm -rf sample1.sam
@@ -557,35 +555,38 @@ rm -rf sample1.sam
 **Input Data**
 
 - /path/to/contaminant_assembly/blank-scaffolds.fasta (contaminant assembly, output from [Step 4a](#4a-assemble-contaminants))
-- sample1_GLlbsMetag_R[12]_filtered.fastq.gz (filtered and trimmed reads, output from [Step 3b](#3b-trim-polyg))
+- sample1_R[12]_filtered_GLlbsMetag.fastq.gz (filtered and trimmed reads, output from [Step 3b](#3b-trim-polyg))
 
 **Output Data**
 
-- sample1_GLlbsMetag_R[12]_decontam.fastq.gz (decontaminated reads)
+- sample1_R[12]_decontam_GLlbsMetag.fastq.gz (decontaminated reads)
 - sample-mapping-info.txt (bowtie2 mapping log file)
 
+<br>
+
 #### 4c. Contaminant Removal QC
 
 ```bash
-fastqc -o decontam_fastqc_output *decontam.fastq.gz
+fastqc -o decontam_fastqc_output *decontam_GLlbsMetag.fastq.gz
 ```
 
 **Parameter Definitions:**
 
 - `-o` – the output directory to store results
-- `*decontam.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
+- `*decontam_GLlbsMetag.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
 
 **Input data:**
 
-- *decontam.fastq.gz (decontaminated reads)
+- *decontam_GLlbsMetag.fastq.gz (decontaminated reads)
 
 **Output data:**
 
 - *fastqc.html (FastQC output html summary)
 - *fastqc.zip (FastQC output data)
 
+<br>
 
-#### 4d. Compile Raw Data QC
+#### 4d. Compile Contaminant Remove QC
 
 ```bash
 multiqc --zip-data-dir \
@@ -612,11 +613,14 @@ multiqc --zip-data-dir \
 - **decontam_multiqc_report/decontam_multiqc_GLlbsMetag.html** (multiqc output html summary)
 - **decontam_multiqc_report/decontam_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
+<br>  
+
 ---
+
 ### 5. Host Read Removal
 
 If the samples were derived from a host organism other than human, potential host reads
-should be identified and removed. This step is optional.
+should be identified and removed. This step is optional. 
 
 #### 5a. Build Kraken2 Host Database
 
@@ -624,45 +628,47 @@ should be identified and removed. This step is optional.
 NCBI may require explicit assignment of taxonomy information before they can be used to build the 
 database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).
 
-```bash
-
 ```bash
 # Download NCBI taxonomic information 
 kraken2-build --download-taxonomy --db kraken2-${hostname}-db/
 
 # Add genomic sequences to your database's genomic library
-kraken2-build --add-to-library ${hostname}.fasta --db kraken2-${hostname}-db/ \
-              --no-masking --kmer-length 35 --minimizer-length 31
+kraken2-build --add-to-library ${hostname}.fasta --db kraken2-${hostname}-db/ --no-masking 
 
 # Build the database
-kraken2-build --build --db kraken2-${hostname}-db/
+kraken2-build --build --db kraken2-${hostname}-db/ --kmer-len 35 --minimizer-len 31
 
 # Clean up intermediate files
 kraken2-build --clean --db kraken2-${hostname}-db/
 ```
+
 **Parameter Definitions:**
+
 - `--download-taxonomy` - Instructs kraken2-build to download the NCBI taxonomic information.
 - `--db` - Specifies the name of the directory for the kraken2 database
 - `--add-to-library` - Instructs kraken2-build to add the contents of a file (`${hostname}.fasta`) to the kraken2 DB library
-- `--no-masking` - Disables masking of low-complexity sequences. For additional 
-                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
+  - `--no-masking` - Disables masking of low-complexity sequences. For additional 
+                     information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
 - `--build` - Instructs kraken2-build to build the kraken2 DB from the library files
+  - `--kmer-len` - K-mer length in bp (default: 35).
+  - `--minimizer-len` - Minimizer length in bp (default: 31)
 - `--clean` - Instructs kraken2-build to remove unneeded intermediate files.
 - `{$hostname}` - Specifies the name of the host organism used to uniquely identify the kraken2 database
 
 **Input Data:**
 
-- `${hostname}.fasta` (fasta file containing host genome)
+- `${hostname}.fasta` (fasta file containing host genome, for example, the mouse genome fasta downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_genomic.fna.gz for mouse)
 
 **Output Data:**
 
-- kraken2_${hostname}_db/ - Kraken2 host database directory, containing hash.k2d, opts.k2d, and taxo.k2d files 
+- kraken2_${hostname}_db/ (Kraken2 host database directory, containing hash.k2d, opts.k2d, and taxo.k2d files)
+
+<br>
 
 #### 5b. Remove Host Reads
 
 ```bash
-
-kraken2 --db kraken2_human_db \
+kraken2 --db kraken2_${hostname}_db \
         --gzip-compressed \
         --threads NumberOfThreads \
         --use-names \
@@ -672,11 +678,11 @@ kraken2 --db kraken2_human_db \
         sample1_R1_decontam.fastq.gz sample1_R2_decontam.fastq.gz
 
 # rename and gzip output files
-mv sample1_R_1.fastq sample1_GLlbsMetag_R1_HostRm.fastq && \
-gzip sample1_GLlbsMetag_R1_HostRm.fastq
+mv sample1_R_1.fastq sample1_R1_HostRm_GLlbsMetag.fastq && \
+gzip sample1_R1_HostRm_GLlbsMetag.fastq
 
-mv  sample1_R_2.fastq sample1_GLlbsMetag_R2_HostRm.fastq && \
-gzip sample1_GLlbsMetag_R2_HostRm.fastq
+mv  sample1_R_2.fastq sample1_R2_HostRm_GLlbsMetag.fastq && \
+gzip sample1_R2_HostRm_GLlbsMetag.fastq
 ```
 
 **Parameter Definitions:**
@@ -687,19 +693,19 @@ gzip sample1_GLlbsMetag_R2_HostRm.fastq
 - `--use-names` - Specifies adding taxa names in addition to taxon IDs.
 - `--output` - Specifies the name of the kraken2 read-based output file (one line per read).
 - `--report` - Specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it).
-- `--unclassified-out` - Specifies the name of the output file containing reads that were not classified, i.e non-human reads.
-- `sample1_R1_decontam.fastq.gz sample1_R2_decontam.fastq.gz` - Positional argument specifying the input read files.
+- `--unclassified-out` - Specifies the name of the output file containing reads that were not classified, i.e non-host reads.
+- `sample1_R1_decontam_GLlbsMetag.fastq.gz sample1_R2_decontam_GLlbsMetag.fastq.gz` - Positional argument specifying the input read files.
 
 **Input Data:**
 
 - kraken2_host_db/ (kraken2 host database directory, output from [Step 5a](#5a-build-kraken2-host-database))
-- sample_*decontam.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 4b](#4b-build-contaminant-index-and-map-reads))
+- sample_*decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 4b](#4b-build-contaminant-index-and-map-reads))
 
 **Output Data:**
 
 - sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
 - sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
-- **sample_GLlbsMetag_HostRm.fastq.gz** (filtered and trimmed sample reads with both contaminants and human reads removed, gzipped fasta file)
+- **sample_HostRm_GLlbsMetag.fastq.gz** (filtered and trimmed sample reads with contaminants, human, and host reads removed, gzipped fasta file)
 
 
 #### 5c. Compile Host Read Removal QC
@@ -747,9 +753,9 @@ library(pheatmap)
 library(pavian)
 ```
 
-#### 8b. Define Custom Functions
+#### 6b. Define Custom Functions
 
-##### get_last_assignment()
+#### get_last_assignment()
 <details>
   <summary>retrieves the last taxonomy assignment from a taxonomy string</summary>
 
@@ -783,7 +789,7 @@ library(pavian)
   **Returns:** the last taxonomy assignment listed in the `taxonomy_string`
 </details>
 
-##### mutate_taxonomy()
+#### mutate_taxonomy()
 <details>
   <summary>mutate taxonomy column to contain the lowest taxonomy assignment</summary>
 
@@ -805,6 +811,7 @@ library(pavian)
     return(df)
   }
   ```
+
   **Custom Functions Used:**
   - [get_last_assignment()](#get_last_assignment)
 
@@ -816,7 +823,7 @@ library(pavian)
 
 </details>
 
-##### process_kaiju_table()
+#### process_kaiju_table()
 <details>
   <summary>reformat kaiju output table</summary>
 
@@ -854,6 +861,7 @@ library(pavian)
     return(abs_abun_matrix)
   }
   ```
+
   **Custom Functions Used:**
   - [mutate_taxonomy()](#mutate_taxonomy)
 
@@ -866,7 +874,7 @@ library(pavian)
 </details>
 
 
-##### merge_kraken_reports()
+#### merge_kraken_reports()
 <details>
   <summary>merge and process multiple kraken outputs to one species table</summary>
 
@@ -906,9 +914,6 @@ library(pavian)
     return(species_table)
   }
   ```
-  **Custom Functions Used:**
-  - [read_reports()]()
-
 
   **Function Parameter Definitions:**
   - `reports_dir` - path to a directory containing kraken2 reports 
@@ -917,7 +922,7 @@ library(pavian)
 
 </details>
 
-##### get_abundant_features()
+#### get_abundant_features()
 <details>
   <summary>Find abundant features based on the sum of feature values</summary>
   
@@ -933,6 +938,7 @@ library(pavian)
     return(abund_features.m)
   }
   ```
+
   **Function Parameter Definitions:**
   - `mat` - a feature count matrix with features as rows and samples as columns
   - `cpm_threshold = 1000` - threshold to identify abundant features
@@ -940,7 +946,7 @@ library(pavian)
   **Returns:** a species relative abundance matrix with samples and species as rows and column, respectively.
 </details>
 
-##### count_to_rel_abundance()
+#### count_to_rel_abundance()
 <details>
   <summary>Convert species count matrix to relative abundance matrix</summary>
 
@@ -974,7 +980,7 @@ library(pavian)
 </details>
 
 
-##### filter_rare()
+#### filter_rare()
 <details>
   <summary>filter out rare and non_microbial taxonomy assignments based on relative abundance</summary>
 
@@ -1015,7 +1021,7 @@ library(pavian)
   **Returns:** a dataframe with rare and non_microbial/unwanted species removed
 </details>
 
-##### group_low_abund_taxa()
+#### group_low_abund_taxa()
 <details>
   <summary>Group rare taxa or return a table with only rare taxa</summary>
 
@@ -1077,7 +1083,7 @@ library(pavian)
 
 ##### make_plot()
 <details>
-  <summary>create bar plot of relative abundance</summary>
+  <summary>Create stacked bar plots of relative abundance from input dataframes</summary>
 
   ```R
   # Make bar plot
@@ -1122,12 +1128,12 @@ library(pavian)
 
 ##### make_barplot()
 <details>
-  <summary>Creates barplots from a feature table file</summary>
+  <summary>Parse Metadata and Feature table files in order to create stacked barplots of relative abundance.</summary>
   
   ```R
   make_barplot <- function(metadata_table_file, feature_table_file, 
                            feature_column = "species", samples_column = "sample_id", group_column = "group", 
-                           output_prefix, assay_suffix = "_GLlblMetag",
+                           output_prefix, assay_suffix = "_GLlbsMetag",
                            publication_format, custom_palette) {
     # Prepare feature table
     feature_table <- read_csv(feature_table_file)
@@ -1155,6 +1161,7 @@ library(pavian)
 
   }
   ```
+
   **Custom Functions Used:**
   - [make_plot()](#make_plot)
   - [count_to_rel_abundance()](#count_to_rel_abundance)
@@ -1168,11 +1175,11 @@ library(pavian)
   - `group_column` - a character string specifying the column in `metadata` used to facet/group plots, default: "group"
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
-  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
   - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 8c](#8c-set-global-variables)
   - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 8c](#8c-set-global-variables)
 
-  **Returns:** a relative abundance stacked bar plot
+  **Returns:** a relative abundance stacked bar plot as output from [make_plot](#make_plot)
 
 </details>
 
@@ -1181,25 +1188,10 @@ library(pavian)
   <summary>Creates heatmaps from a feature table file</summary>
   
   ```R
-  make_heatmap <- function(metadata, species_gene_table, 
+  make_heatmap <- function(metadata, feature_table, 
                            samples_column = "sample_id", group_column = "group", 
-                           output_prefix, assay_suffix = "_GLlblMetag",
+                           output_prefix, assay_suffix = "_GLlbsMetag",
                            custom_palette) {
-    # Prepare feature table
-    # feature_table <- read_csv(feature_table_file) %>% as.data.frame
-    # rownames(feature_table) <- feature_table[[1]]
-    # feature_table <- feature_table[, -1] %>% as.matrix()
-
-    # # Prepare metadata
-    # metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-    # row.names(metadata) <- metadata[, samples_column]
-
-    # # Get common samples and re-arrange feature table and metadata
-    # common_samples <- intersect(colnames(feature_table), rownames(metadata))
-    # feature_table <- feature_table[, common_samples]
-    # metadata <- metadata[common_samples, ]
-    # metadata <- metadata %>% arrange(!!sym(group_column))
-
     # Create column annotation
     col_annotation <- as.data.frame(metadata)[, group_column, drop = FALSE]
 
@@ -1233,17 +1225,18 @@ library(pavian)
     dev.off()
   }
   ```
+
   **Function Parameter Definitions:**
-  - `metadata_file` - path to a file with samples as rows and columns describing each sample
-  - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
-                           table with species/functions as the first column and samples as other columns.
+  - `metadata_file` - a dataframe with samples as rows and columns describing each sample
+  - `feature_table` - a dataframe of features with species/functions as the first column and samples as other columns.
   - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
   - `group_column` - a character string specifying the column in `metadata` used to facet/group plots, default: "group"
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
-  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
   - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 8c](#8c-set-global-variables)
 
+  **Returns:** A heatmap of species/functions across samples from the input feature table
 </details>
 
 ##### run_decontam()
@@ -1330,7 +1323,7 @@ library(pavian)
                                prevalence_column = "NTC", ntc_name = "TRUE", 
                                frequency_column = "concentration", 
                                threshold = 0.1, classification_method, 
-                               output_prefix, assay_suffix = "_GLlblMetag") {
+                               output_prefix, assay_suffix = "_GLlbsMetag") {
     # Prepare feature table
     feature_table <- read_csv(feature_table_file) %>%  as.data.frame
     rownames(feature_table) <- feature_table[[1]]
@@ -1380,6 +1373,7 @@ library(pavian)
     }
   }
   ```
+
   **Custom Functions Used:**
   - [run_decontam()](#run_decontam)
 
@@ -1396,14 +1390,13 @@ library(pavian)
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
   - `classification_method` - a character string specifying the tool used to generate the classifications ['kaiju', 'kraken2', 'metaphlan', 'contig-taxonomy', 'gene-taxonomy', 'gene-function']
-  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
 
   **Output Data:**
-  - {classification_method}_decontam_species_table_GLlblMetag.csv - decontaminated feature table file
-  - {classification_method}_decontam_results_GLlblMetag.csv - Decontam results file
+  - {classification_method}_decontam_species_table_GLlbsMetag.csv - decontaminated feature table file
+  - {classification_method}_decontam_results_GLlbsMetag.csv - Decontam results file
 
   **Returns:** a dataframe containing the decontaminated feature table
-
 </details>
 
 ##### process_taxonomy()
@@ -1434,13 +1427,11 @@ library(pavian)
   }
   ```
   **Function Parameter Definitions:**
-
   - `taxonomy` - is a taxonomy assignment dataframe with ranks [Phylum, Class .. Species] as columns and taxonomy assignments as rows
   - `prefix`  - is a regular expression specifying a character sequence to remove
                 from taxon names
 
   **Returns:** a dataframe of reformated taxonomy names
-
 </details>
 
 ##### fix_names()
@@ -1511,7 +1502,6 @@ library(pavian)
   [fix_names()](#fix_names)
 
   **Function Parameter Definitions:**
-
   - `file_name` - path to contig taxonomy assignment file to be read
   - `sample_names` - string of samples names to keep in the final dataframe
 
@@ -1536,8 +1526,8 @@ library(pavian)
     return(sample_order)
   }
   ```
-  **Function Parameter Definitions:**
 
+  **Function Parameter Definitions:**
   - `assembly_summary` - path to assembly summary file
 
   **Returns:** a character vector of sorted sample names
@@ -1545,7 +1535,7 @@ library(pavian)
 </details>
 
 
-#### 8c. Set global variables
+#### 6c. Set global variables
 
 ```R
 # Define custom theme for plotting
@@ -1584,11 +1574,13 @@ custom_palette <- custom_palette[-c(21:23,
 - `publication_format` (a ggplot::theme object specifying a custom theme for plotting)
 - `custom_palette` (a vector of strings specifying a custom color palette for coloring plots)
 
-<br>
+<br>  
 
+---
 
 ## Read-based Processing
 
+
 ### 7. Taxonomic Profiling Using Kaiju
 
 #### 7a. Build Kaiju Database
@@ -1629,8 +1621,8 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
       -t kaiju-db/nodes.dmp \
       -z NumberOfThreads \
       -E 1e-05 \
-      -i /path/to/sample1_GLlbsMetag_R1_decontam.fastq.gz \
-      -j /path/to/sample1_GLlbsMetag_R2_decontam.fastq.gz \
+      -i /path/to/sample1_R1_decontam_GLlbsMetag.fastq.gz \
+      -j /path/to/sample1_R2_decontam_GLlbsMetag.fastq.gz \
       -o sample_kaiju.out
 ```
 
@@ -1738,43 +1730,37 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 
 **Parameter Definitions:**
 
-**find**
-
+*find*
 - `-type f` -  Specifies that the type of file to find is a regular file.
 - `-name "*.krona"` - Specifies to find files ending with the .krona suffix.  
 
-**sort**
-
+*sort*
 - `-u` - Specifies to perform a unique sort.
 - `-V` - Specifies to perform a mixed type of sorting with names containing numbers within text.
 - `> krona_files.txt` - Redirects the sorted list to a separate text file.
 
-**basename**
-
+*basename*
 - `-a` - Support multiple arguments and treat each as a file name.
 - `-s '.krona'` - Remove trailing '.krona' suffix.
 
-**paste**
-
+*paste*
 - `-d','` - Paste both krona and sample files together line by line delimited by comma ','.
 
-**ktImportText**
-
+*ktImportText*
 - `-o` - Specifies the compiled output html file name.
 - `${KTEXT_FILES[*]}` - An array positional argument with the following content: 
                         sample_1.krona,sample_1 sample_2.krona,sample_2 ... sample_n.krona,sample_n.
 
 **Input Data:**
-- *.krona (all sample .krona formatted files, output from [Step 7d](#7d-convert-kaiju-output-to-krona-format)) 
 
-                      
+- *.krona (all sample .krona formatted files, output from [Step 7d](#7d-convert-kaiju-output-to-krona-format)) 
+             
 **Output Data:**
 
 - krona_files.txt (sorted list of all *.krona files)
 - sample_names.txt (sorted list of all sample names)
 - **kaiju-report_GllbsMetag.html** (compiled krona html report containing all samples)
 
-
 #### 7f. Create Kaiju Species Count Table
 
 ```R
@@ -1787,7 +1773,6 @@ write_csv(x = table2write, file = "kaiju_species_table_GLlbsMetag.csv")
 ```
 
 **Custom Functions Used:**
-
 - [process_kaiju_table()](#process_kaiju_table)
 
 **Parameter Definitions:**
@@ -1839,7 +1824,6 @@ write_csv(x = table2write, file = output_file)
 ```
 
 **Custom Functions Used:**
-
 - [group_low_abund_taxa()](#group_low_abund_taxa)
 
 **Parameter Definitions:**
@@ -1854,11 +1838,11 @@ write_csv(x = table2write, file = output_file)
 
 **Output Data:**
 
-- **kaiju_filtered_species_table_GLlbsMetag.csv** - a file containing the filtered species table
+- **kaiju_filtered_species_table_GLlbsMetag.csv** (a file containing the filtered species table)
 
 ---
 
-#### 7h. Taxonomy barplots
+#### 7h. Taxonomy Barplots
 
 ```R
 library(tidyverse)
@@ -1907,20 +1891,20 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_filtered_species_barplot_GLlbsM
 
 **Input Data:**
 
-- `kaiju_species_table_GLlblMetag.csv` (a file containing the species count table, output from [Step 7f](#7f-create-kaiju-species-count-table))
-- `kaiju_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 7g](#7g-filter-kaiju-species-count-table))
+- `kaiju_species_table_GLlbsMetag.csv` (a file containing the species count table, output from [Step 7f](#7f-create-kaiju-species-count-table))
+- `kaiju_filtered_species_table_GLlbsMetag.csv` (a file containing the filtered species count table, output from [Step 7g](#7g-filter-kaiju-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 
 **Output Data:**
 
-- kaiju_unfiltered_species_barplot_GLlblMetag.png (taxonomy barplot without filtering)
-- **kaiju_unfiltered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot without filtering)
-- kaiju_filtered_species_barplot_GLlblMetag.png (taxonomy barplot after filtering rare and non-microbial taxa)
-- **kaiju_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
+- kaiju_unfiltered_species_barplot_GLlbsMetag.png (taxonomy barplot without filtering)
+- **kaiju_unfiltered_species_barplot_GLlbsMetag.html** (interactive taxonomy barplot without filtering)
+- kaiju_filtered_species_barplot_GLlbsMetag.png (taxonomy barplot after filtering rare and non-microbial taxa)
+- **kaiju_filtered_species_barplot_GLlbsMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 7i. Feature decontamination
+#### 7i. Feature Decontamination
 
 > Feature (species) decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
 
@@ -1947,7 +1931,7 @@ decontaminated_table <- feature_decontam(metadata_file = metadata_table,
                                          threshold = 0.1, 
                                          classification_method = "kaiju", 
                                          output_prefix = "", 
-                                         assay_suffix = "_GLlblMetag")
+                                         assay_suffix = "_GLlbsMetag")
 
 # Convert count matrix to relative abundance matrix
 decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
@@ -1955,11 +1939,11 @@ decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
 # Make plot after filtering out contaminants
 p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
 
-ggsave(filename = "kaiju_decontam_species_barplot_GLlblMetag.png", plot = p,
+ggsave(filename = "kaiju_decontam_species_barplot_GLlbsMetag.png", plot = p,
          device = "png", width = plot_width, height = 10, units = "in", dpi = 300)
 
 # Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
 ```
 
 **Custom Functions Used:**
@@ -1968,6 +1952,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlblM
 - [count_to_rel_abundance()](#count_to_rel_abundance)
 
 **Parameter Definitions:**
+
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
                          table with species/functions as the first column and samples as other columns.
@@ -1975,15 +1960,15 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlblM
 
 **Input Data:**
 
-- `kaiju_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 7g](#7g-filter-kaiju-species-count-table))
+- `kaiju_filtered_species_table_GLlbsMetag.csv`(path to filtered species count per sample, output from [Step 7g](#7g-filter-kaiju-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
 
-- **kaiju_decontam_results_GLlblMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
-- **kaiju_decontam_species_table_GLlblMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
-- **kaiju_decontam_species_barplot_GLlblMetag.png** (barplot after filtering out contaminants)
-- **kaiju_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants)
+- **kaiju_decontam_results_GLlbsMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
+- **kaiju_decontam_species_table_GLlbsMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
+- kaiju_decontam_species_barplot_GLlbsMetag.png (barplot after filtering out contaminants)
+- **kaiju_decontam_species_barplot_GLlbsMetag.html** (barplot after filtering out contaminants)
 
 <br>
 
@@ -2020,21 +2005,22 @@ tar -xvzf k2_pluspfp.tar.gz
 
 **Parameter Definitions:**
 
-**wget**
-
+*wget*
 - `O` - Name of file to download the url content to.
 - `--timeout=3600` - Specifies the network timeout in seconds.
 - `--tries=0` - Retry download infinitely.
 - `--continue` -  Continue getting a partially-downloaded file.
 - `*_URL` - Position arguement specifying the url to download a particular resource from.
 
+*tar*
+- `-xvzf` - unpack the specified *tar.gz archive in verbose mode
 
 **Input Data:**
 
-- `INSPECT_URL=` - url specifying the location of kraken2 inspect file
-- `LIRARY_REPORT_URL=` - url specifying the location of kraken2 library report file
-- `MD5_URL=` - url specifying the location of the md5 file of the kraken database
-- `DB_URL=` - url specifying the location of the main kraken database archive in .tar.gz format
+- `INSPECT_URL=` (url specifying the location of kraken2 inspect file)
+- `LIRARY_REPORT_URL=` (url specifying the location of kraken2 library report file)
+- `MD5_URL=` (url specifying the location of the md5 file of the kraken database)
+- `DB_URL=` (url specifying the location of the main kraken database archive in .tar.gz format)
 
 **Output Data:**
 
@@ -2049,7 +2035,7 @@ kraken2 --db kraken2-db/ \
         --use-names \
         --output sample-kraken2-output.txt \
         --report sample-kraken2-report.tsv \
-        /path/to/sample1_GLlbsMetag_R1_decontam.fastq.gz /path/to/sample1_GLlbsMetag_R2_decontam.fastq.gz
+        /path/to/sample1_R1_decontam_GLlbsMetag.fastq.gz /path/to/sample1_R2_decontam_GLlbsMetag.fastq.gz
 ```
 
 **Parameter Definitions:**
@@ -2060,8 +2046,8 @@ kraken2 --db kraken2-db/ \
 - `--use-names` - Specifies to add taxa names in addition to taxids.
 - `--output` - Specifies the name of the kraken2 read-based output file.
 - `--report` - Specifies the name of the kraken2 report output file.
-- `sample1_GLlbsMetag_R1_decontam.fastq.gz` - Positional argument specifying the forward read input file.
-- `sample1_GLlbsMetag_R2_decontam.fastq.gz` - Positional argument specifying the reverse read input file.
+- `sample1_R1_decontam_GLlbsMetag.fastq.gz` - Positional argument specifying the forward read input file.
+- `sample1_R2_decontam_GLlbsMetag.fastq.gz` - Positional argument specifying the reverse read input file.
 
 
 **Input Data:**
@@ -2084,24 +2070,17 @@ kraken2 --db kraken2-db/ \
 
 ```R
 species_table <- merge_kraken_reports(reports-dir='/path/to/kraken2/reports')
-write_csv(x = species_table, file = "merged-kraken2-table.csv")
+write_csv(x = species_table, file = "kraken2_species_table_GLlbsMetag.csv")
 ```
 
 **Custom Functions Used:**
-
 - [merge_kraken_reports()](#merge_kraken_reports)
 
 **Parameter Definitions:**
 
-- `file_path` - path to compiled kaiju table at the species taxon level
+- `reports-dir` - path to compiled kraken reports
 - `x`  - feature table dataframe to write to file
-- `file` - path to where to write kaiju count table per sample
-
-**Parameter Definitions:**
-
-- `--output` - Specifies the name of the kraken2 compiled results output file.
-- `--report-files` - Specifies the name of each input kraken2 report file to compile.
-- `--sample-names` - Specifies the name of each sample. 
+- `file` - path to where to write kraken2 species table table
 
 **Input Data:**
 
@@ -2109,7 +2088,7 @@ write_csv(x = species_table, file = "merged-kraken2-table.csv")
 
 **Output Data:**
 
-- **kraken2_species_table_GLlblMetag.csv** (kraken species count table in csv format)
+- **kraken2_species_table_GLlbsMetag.csv** (kraken species count table in csv format)
 
 ##### 8cii. Compile Kraken2 Taxonomy Reports
 
@@ -2178,30 +2157,25 @@ ktImportText -o kraken2-report_GLlbsMetag.html ${KTEXT_FILES[*]}
 
 **Parameter Definitions:**
 
-**find**
+*find*
+  - `-type f` -  Specifies that the type of file to find is a regular file.
+  - `-name "*.krona"` - Specifies to find files ending with the .krona suffix. 
 
-- `-type f` -  Specifies that the type of file to find is a regular file.
-- `-name "*.krona"` - Specifies to find files ending with the .krona suffix.  
+*sort*
+  - `-u` - Specifies to perform a unique sort.
+  - `-V` - Specifies to perform a mixed type of sorting with names containing numbers within text.
+  - `> {}.txt` - Redirects the sorted list to a separate text file.
 
-**sort**
+*basename*
+  - `--multiple` - Support multiple arguments and treat each as a file name.
+  - `--suffix='.krona'` - Remove a trailing '.krona' suffix.
 
-- `-u` - Specifies to perform a unique sort.
-- `-V` - Specifies to perform a mixed type of sorting with names containing numbers within text.
-- `> {}.txt` - Redirects the sorted list to a separate text file.
-
-**basename**
-
-- `--multiple` - Support multiple arguments and treat each as a file name.
-- `--suffix='.krona'` - Remove a trailing '.krona' suffix.
-
-**paste**
-
-- `-d','` - Paste both krona and sample files together line by line delimited by comma ','.
+*paste*
+  - `-d','` - Paste both krona and sample files together line by line delimited by comma ','.
 
-**ktImportText**
-
-- `-o` - Specifies the compiled output html file name.
-- `${KTEXT_FILES[*]}` - An array positional arguement with the following content: sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
+*ktImportText*
+  - `-o` - Specifies the compiled output html file name.
+  - `${KTEXT_FILES[*]}` - An array positional arguement with the following content: sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
 
 **Input Data:**
 
@@ -2220,8 +2194,8 @@ ktImportText -o kraken2-report_GLlbsMetag.html ${KTEXT_FILES[*]}
 ```R
 library(tidyverse)
 
-input_file <- "kraken2_species_table_GLlblMetag.csv"
-output_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
+input_file <- "kraken2_species_table_GLlbsMetag.csv"
+output_file <- "kraken2_filtered_species_table_GLlbsMetag.csv"
 threshold <- 0.5
 
 # string used to define non-microbial taxa
@@ -2242,7 +2216,6 @@ write_csv(x = table2write, file = output_file)
 ```
 
 **Custom Functions Used:**
-
 - [group_low_abund_taxa()](#group_low_abund_taxa)
 
 **Parameter Definitions:**
@@ -2253,21 +2226,21 @@ write_csv(x = table2write, file = output_file)
 
 **Input Data:**
 
-- kraken2_species_table_GLlblMetag.csv (path to kaiju species table from [Step 8ci.](#8ci-create-merged-kraken2-taxonomy-table))
+- kraken2_species_table_GLlbsMetag.csv (path to kaiju species table from [Step 8ci.](#8ci-create-merged-kraken2-taxonomy-table))
 
 **Output Data:**
 
-- **kraken2_filtered_species_table_GLlblMetag.csv** - a file containing the filtered species table
+- **kraken2_filtered_species_table_GLlbsMetag.csv** (a file containing the filtered species table)
 
 ---
 
-#### 8g. Taxonomy barplots
+#### 8g. Taxonomy Barplots
 
 ```R
 library(tidyverse)
 
-species_table_file <- "kraken2_species_table_GLlblMetag.csv"
-filtered_species_table_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
+species_table_file <- "kraken2_species_table_GLlbsMetag.csv"
+filtered_species_table_file <- "kraken2_filtered_species_table_GLlbsMetag.csv"
 metadata_file <- "/path/to/sample/metadata"
 number_samples <- 10 
 
@@ -2279,7 +2252,7 @@ p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_ta
                   feature_column = "species", samples_column = "sample_id", group_column = "group",
                   publication_format = publication_format, custom_palette = custom_palette)
 
-ggsave(filename = "kraken2_unfiltered_species_barplot_GLlblMetag.png", plot = p,
+ggsave(filename = "kraken2_unfiltered_species_barplot_GLlbsMetag.png", plot = p,
        device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
 
 # Save static unfiltered plot
@@ -2288,15 +2261,16 @@ p <- make_barplot(metadata_file = metadata_file, feature_table_file = filtered_s
                   publication_format = publication_format, custom_palette = custom_palette)
 
 # Save interactive unfilterted plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_unfiltered_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_unfiltered_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
 
 # Save static filtered plot
-ggsave(filename = glue("kraken2_filtered_species_barplot_GLlblMetag.png"), plot = p,
+ggsave(filename = glue("kraken2_filtered_species_barplot_GLlbsMetag.png"), plot = p,
       device = 'png', width = plot_width, height = 10, units = 'in', dpi = 300, limitsize = FALSE)
 
 # Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
 ```
+
 **Custom Functions Used:**
 - [make_barplot()](#make_plot)
 
@@ -2309,19 +2283,19 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlb
 
 **Input Data:**
 
-- `kraken2_species_table_GLlblMetag.csv` (path to kaiju species table from [Step 10ci.](#8ci-create-merged-kraken2-taxonomy-table))
-- `kraken2_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 10g](#10f-filter-kraken2-species-count-table))
+- `kraken2_species_table_GLlbsMetag.csv` (path to kaiju species table from [Step 10ci.](#8ci-create-merged-kraken2-taxonomy-table))
+- `kraken2_filtered_species_table_GLlbsMetag.csv` (a file containing the filtered species count table, output from [Step 10g](#10f-filter-kraken2-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
 
-- kraken2_unfiltered_species_barplot_GLlblMetag.png (taxonomy barplot without filtering)
-- **kraken2_unfiltered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot without filtering)
-- kraken2_filtered_species_barplot_GLlblMetag.png (taxonomy barplot after filtering rare and non-microbial taxa)
-- **kraken2_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
+- kraken2_unfiltered_species_barplot_GLlbsMetag.png (taxonomy barplot without filtering)
+- **kraken2_unfiltered_species_barplot_GLlbsMetag.html** (interactive taxonomy barplot without filtering)
+- kraken2_filtered_species_barplot_GLlbsMetag.png (taxonomy barplot after filtering rare and non-microbial taxa)
+- **kraken2_filtered_species_barplot_GLlbsMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 8h. Feature decontamination
+#### 8h. Feature Decontamination
 
 > Feature (species) decontamination with decontam. Decontam is an R package that statistically 
   identifies contaminating features in a feature table
@@ -2331,7 +2305,7 @@ library(tidyverse)
 library(decontam)
 library(phyloseq)
 
-feature_table_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
+feature_table_file <- "kraken2_filtered_species_table_GLlbsMetag.csv"
 metadata_table <- "/path/to/sample/metadata"
 number_samples <- NumberOfSamples # integer indicating how many samples are in the file
 
@@ -2349,7 +2323,7 @@ decontaminated_table <- feature_decontam(metadata_file = metadata_table,
                                          threshold = 0.1, 
                                          classification_method = "kraken2", 
                                          output_prefix = "", 
-                                         assay_suffix = "_GLlblMetag")
+                                         assay_suffix = "_GLlbsMetag")
 
 # Convert count matrix to relative abundance matrix
 decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
@@ -2357,11 +2331,11 @@ decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
 # Make plot after filtering out contaminants
 p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
 
-ggsave(filename = "kraken2_decontam_species_barplot_GLlblMetag.png", plot = p,
+ggsave(filename = "kraken2_decontam_species_barplot_GLlbsMetag.png", plot = p,
          device = "png", width = plot_width, height = 10, units = "in", dpi = 300)
 
 # Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
 ```
 
 **Custom Functions Used:**
@@ -2370,6 +2344,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlb
 - [count_to_rel_abundance()](#count_to_rel_abundance)
 
 **Parameter Definitions:**
+
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
                           table with species/functions as the first column and samples as other columns.
@@ -2377,32 +2352,35 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlb
 
 **Input Data:**
 
-- `kraken2_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 8f](#10f-filter-kraken2-species-count-table))
+- `kraken2_filtered_species_table_GLlbsMetag.csv`(path to filtered species count per sample, output from [Step 8f](#10f-filter-kraken2-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
 
-- **kraken2_decontam_results_GLlblMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
-- **kraken2_decontam_species_table_GLlblMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
-- **kraken2_decontam_species_barplot_GLlblMetag.png** (barplot after filtering out contaminants)
-- **kraken2_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants)
+- **kraken2_decontam_results_GLlbsMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
+- **kraken2_decontam_species_table_GLlbsMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
+- kraken2_decontam_species_barplot_GLlbsMetag.png (barplot after filtering out contaminants)
+- **kraken2_decontam_species_barplot_GLlbsMetag.html** (barplot after filtering out contaminants)
 
-<br>
+<br>  
+
+---
 
 ### 9. Taxonomic Profiling Using MetaPhlan
 
-#### 9a. Download and install HUMAnN databases
+#### 9a. Download and Install HUMAnN databases
 
 ```bash 
 mkdir -p /path/to/humann3-db
 humann3_databases --download chocophlan full /path/to/humann3-db/
-humann_databases --download uniref uniref90_ec_filtered_diamond /path/to/humann3-db/
-humann_databases --download utility_mapping full /path/to/human3-db/
+humann3_databases --download uniref uniref90_ec_filtered_diamond /path/to/humann3-db/
+humann3_databases --download utility_mapping full /path/to/human3-db/
 metaphlan --install
 ```
 
 **Parameter Definition:**
-*humann_databases*
+
+*humann3_databases*
 - `--download` - Specifies the databases to download:
   - `chocophlan full` - the full ChocoPhlAn pangenome database, which includes Archaea, Bacteria, Eukaryotes, and Viruses
   - `uniref uniref90_ec_filtered_diamond` - Download the EC-filtered UniRef90 translated search database
@@ -2410,18 +2388,21 @@ metaphlan --install
 -`/path/to/humann3-db` - Specifies the database install location
 
 *metaphlan*
-`--install` - install the metaphlan clade markers and database locally
+`--install` - install the MetaPhlan clade markers and database locally
 
 **Input Data**
-None
+
+*No input data required*
 
 **Output Data**
-`/path/to/humann3-db` - the path to the installed metaphlan databases
+
+`/path/to/humann3-db` (the installed MetaPhlan databases)
 
 #### 9b. HUMAnN/MetaPhlAn Taxonomic Classification
+
 ```bash
   # forward and reverse reads need to be provided combined if paired-end (if not paired-end, single-end reads are provided to the --input argument next)
-cat sample1_GLlbsMetag_R1_decontam.fastq.gz sample1_GLlbsMetag_R2_decontam.fastq.gz > sample1-combined.fastq.gz
+cat sample1_R1_decontam_GLlbsMetag.fastq.gz sample1_R2_decontam_GLlbsMetag.fastq.gz > sample1-combined.fastq.gz
 
 humann --input sample1-combined.fastq.gz \
        --output sample1-humann3-out-dir \
@@ -2443,21 +2424,24 @@ mv sample1-humann3-out-dir/sample1_humann_temp/sample1_metaphlan_bugs_list.tsv \
 -	`--threads` – specifies the number of threads to use
 -	`--output-basename` – specifies prefix of the output files
 -	`--metaphlan-options` – options to be passed to metaphlan
-	- `--bowtie2db` – path to bowtie2 indexes (stored in humann database folder)
+	- `--bowtie2db` – path to bowtie2 indexes (stored in HUMAnN database folder)
   - `unclassified_estimation` - scale the relative abundance profile according to the percentage of reads mapping to a clade.
 	- `--add_viruses` – include viruses in the reference database
 	- `--sample_id` – specifies the sample identifier we want in the table (rather than full filename)
 
 **Input Data:**
-- `/path/to/humann3-db/` (humann databases installed in [Step 9a](#9a-download-and-install-humann-databases))
-- *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+
+- `/path/to/humann3-db/` (HUMAnN databases installed in [Step 9a](#9a-download-and-install-humann-databases))
+- *_R[12]_decontam_GLlbsMetag.fastq.gz or *_R[12]_HostRm_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
     output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
 
 **Output Data:**
-- sample1-humann3-out-dir/ - humann output directory containing *genefamilies.tsv, *pathabundance.tsv, and *pathcoverage.tsv files
 
-#### 9c. Merge multiple sample functional profiles
+- sample1-humann3-out-dir/ *humann output directory containing *genefamilies.tsv, *pathabundance.tsv, and *pathcoverage.tsv files)
+
+#### 9c. Merge Multiple Sample Functional Profiles
+
 ```bash
 # they need to be in their own directories
 mkdir genefamily-results/ pathabundance-results/ pathcoverage-results/
@@ -2480,15 +2464,15 @@ humann_join_tables -i pathcoverage-results/ -o path-coverages.tsv
 
 **Input Data:**
 
-- `sample-humann3-out-dir` (humann output directory, from [Step 9b](#9b-running-humannmetaphlan))
+- `sample-humann3-out-dir` (HUMAnN output directory, from [Step 9b](#9b-running-humannmetaphlan))
 
 **Output Data:**
 
-- gene-families.tsv - Combined gene family table in tab-separated format.
-- path-abundances.tsv - Combined path abundances table in tab-separated format.
-- path-coverages.tsv - Combined path coverages table in tab-separated format.
+- gene-families.tsv (Combined gene family table in tab-separated format.)
+- path-abundances.tsv (Combined path abundances table in tab-separated format.)
+- path-coverages.tsv (Combined path coverages table in tab-separated format.)
 
-#### 9d. Split results tables
+#### 9d. Split Results Tables
 
 The read-based functional annotation tables have taxonomic info and non-taxonomic info mixed together. `humann` comes with a helper script to split them into both non-taxonomically grouped functional info files and taxonomically grouped functional info files.
 
@@ -2518,14 +2502,15 @@ mv path-coverages_unstratified.tsv Path-coverages_GLlbsMetag.tsv
 - path-coverages.tsv (Combined path coverages table from [Step 9c](#9c-merging-multiple-sample-functional-profiles-into-one-table))
 
 **Output Data:**
-- Gene-families-grouped-by-taxa_GLlbsMetag.tsv - Gene families grouped by taxa
-- Gene-families_GLlbsMetag.tsv - Non-taxonomically grouped gene families
-- Path-abundances-grouped-by-taxa_GLlbsMetag.tsv - Path abundances grouped by taxa
-- Path-abundances_GLlbsMetag.tsv  - Non-taxonomically grouped gene families
-- Path-coverages-grouped-by-taxa_GLlbsMetag.tsv - Path coverages grouped by taxa
-- Path-coverages_GLlbsMetag.tsv - Non-taxonomically groups path coverages
 
-#### 9e. Normalize gene families and pathway abundances tables
+- Gene-families-grouped-by-taxa_GLlbsMetag.tsv (Gene families grouped by taxa)
+- Gene-families_GLlbsMetag.tsv (Non-taxonomically grouped gene families)
+- Path-abundances-grouped-by-taxa_GLlbsMetag.tsv (Path abundances grouped by taxa)
+- Path-abundances_GLlbsMetag.tsv  (Non-taxonomically grouped gene families)
+- Path-coverages-grouped-by-taxa_GLlbsMetag.tsv (Path coverages grouped by taxa)
+- Path-coverages_GLlbsMetag.tsv (Non-taxonomically groups path coverages)
+
+#### 9e. Normalize Gene Families and Pathway Abundances Tables
 Generates some normalized tables of the read-based functional outputs from humann that are more readily suitable for across sample comparisons.
 
 ```bash
@@ -2546,10 +2531,11 @@ humann_renorm_table -i Path-abundances_GLlbsMetag.tsv -o Path-abundances-cpm_GLl
 
 **Output Data:**
 
-- Gene-families-cpm_GLlbsMetag.tsv (Non-taxonomically grouped gene families, from [Step 9d](#9d-splitting-results-tables))
-- Path-abundances-cpm_GLlbsMetag.tsv (Non-taxonomically grouped gene families, from [Step 9d](#9d-splitting-results-tables))
+- Gene-families-cpm_GLlbsMetag.tsv (Normalized non-taxonomically grouped gene families)
+- Path-abundances-cpm_GLlbsMetag.tsv (Normalized on-taxonomically grouped gene families)
+
+#### 9f. Generate Normalized Gene-family Table Grouped by Kegg Orthologs (KOs)
 
-#### 9f. Generate a normalized gene-family table grouped by Kegg Orthologs (KOs)
 ```bash
 humann_regroup_table -i Gene-families_GLlbsMetag.tsv -g uniref90_ko | \
 humann_rename_table -n kegg-orthology | \
@@ -2562,12 +2548,14 @@ humann_renorm_table -o Gene-families-KO-cpm_GLlbsMetag.tsv --update-snames
 -	`-i` – the input table
 -	`-g` – the map to use to group uniref IDs into Kegg Orthologs
 -	`|` – sending that output into the next humann command to add human-readable Kegg Orthology names
+
 *humann_rename_table*
 -	`-n` – specifying we are converting Kegg orthology IDs into Kegg orthology human-readable names
 -	`|` – sending that output into the next humann command to normalize to copies-per-million
+
 *humann_renorm_table*
 -	`-o` – specifying the final output file name
--  `--update-snames` – change suffix of column names in tables to "-CPM"
+-  `--update-snames` – change suffix of column names in tables to "-CPM"
 
 **Input Data:**
 
@@ -2575,36 +2563,303 @@ humann_renorm_table -o Gene-families-KO-cpm_GLlbsMetag.tsv --update-snames
 
 **Output Data:**
 
-- Gene-families-KO-cpm_GLlbsMetag.tsv (gene-families with annotations based on Kegg Orthology terms)
+- Gene-families-KO-cpm_GLlbsMetag.tsv (Normalized gene-families with annotations based on Kegg Orthology terms)
 
-#### 9g. Combining taxonomy tables
+#### 9g. Combine MetaPhlan Taxonomy Tables
 
 ```bash
 merge_metaphlan_tables.py *-humann3-out-dir/*_humann_temp/*_metaphlan_bugs_list.tsv > Metaphlan-taxonomy_GLlbsMetag.tsv
 
 # remove redundant text from headers
-sed -i 's/_metaphlan_bugs_list//g' Metaphlan-taxonomy_GLlbsMetag_.tsv
+sed -i 's/_metaphlan_bugs_list//g' Metaphlan-taxonomy_GLlbsMetag.tsv
 ```
-**Parameter Definitions:**  
+
+**Parameter Definitions:**
+
 *merge_metaphlan_tables.py*
 - positional argument specifying input files and output filename
+
 *sed*
 - `-i` - Perform the search/replace in-place on the input file.
 
 **Input Data:**
-*	\*-humann3-out-dir/\*_humann_temp/\*_metaphlan_bugs_list.tsv (metaphlan bugs_list produced during humann3 run in [step 9b](#9b-running-humannmetaphlan)
+
+-	\*-humann3-out-dir/\*_humann_temp/\*_metaphlan_bugs_list.tsv (MetaPhlan bugs_list produced during humann3 run in [step 9b](#9b-running-humannmetaphlan)
+
+**Output Data:**
+
+- **Metaphlan-taxonomy_GLlbsMetag.tsv** (MetaPhlan estimated taxonomic relative abundances)
+
+#### 9h. Create MetaPhlan Species Count Table 
+
+#### 9hi. Get Sample Read Counts
+
+```bash
+unzip decontam_multiqc_GLlbsMetag_data.zip
+
+grep _R1_decontam multiqc_fastqc.txt | awk 'BEGIN{FS="\t"; OFS="\t"}{print $1,int($5)}' > reads_per_sample.tsv
+```
+
+**Input Data:**
+
+- decontam_multiqc_GLlbsMetag_data.zip or HostRm_multiqc_GLlbsMetag_data.zip (multiqc data from [Step ](#4d-compile-contaminant-remove-qc) or [Step 5c](#5c-compile-host-read-removal-qc) if the optional host removal step was done, respectively)
+
+**Output Data:**
+
+- reads_per_sample.txt (a 2-column tab delimited file with the sample names and read counts as column 1 and 2, respectively)
+
+#### 9hii. Process Metaphlan Taxonomy Table
+
+```R
+library(tidyverse)
+
+input_file <- "Metaphlan-taxonomy_GLlbsMetag.tsv"
+read_count_file <- "reads_per_sample.tsv"
+output_file <- "metaphlan_species_table_GLlbsMetag.csv"
+threshold <- 0.5
+
+taxon_levels <- c("Kingdom", "Phylum", "Class", "Order",
+                  "Family", "Genus", "Species")
+
+# read in feature table
+feature_table <- read_delim(input_file, delim="\t", comment="#") 
+colnames(feature_table)[1] <- "taxonomy"
+
+feature_table <- feature_table %>%
+  filter(str_detect(taxonomy, "UNCLASSIFIED|s__") & 
+         str_detect(taxonomy, "t__", negate = TRUE)) %>%
+  mutate(Species=str_replace_all(taxonomy, '\\w__', "")) %>%
+  separate(Species, into=taxon_levels, sep="\\|") %>%
+  mutate(across(where(is.character), function(x) replace_na(x, "UNCLASSIFIED"))) %>%
+  mutate(Species=str_replace_all(Species, "_", " ")) %>%
+  select(-taxonomy, -Kingdom, -Phylum, -Class, -Order, -Family, -Genus) %>%
+  select(Species, everything()) %>%
+  as.data.frame
+
+rownames(feature_table) <- feature_table$Species
+feature_table <- feature_table[,-match("Species", colnames(feature_table))]
+
+# Set max abundance equal to 1
+tab2 <- (feature_table %>% t) / 100
+
+# read in sample read counts
+counts <- read_delim(read_count_file, delim = "\t", 
+                     col_names = c("Sample_ID", "Reads")) %>%
+  as.data.frame
+
+# Set rownames as sample names
+rownames(counts) <- counts$Sample_ID
+# Drop the Sample_ID column
+counts <- counts[, -1, drop = FALSE]
+
+tab2 <- tab2[rownames(counts),]
+
+# Convert relative abundance to raw count
+species_table <- map2(tab2 %>% as.data.frame, 
+                      colnames(tab2), function(col, specie) {
+                        df <- col * counts
+                        colnames(df) <- specie
+                        return(df) 
+                      }) %>% list_cbind() %>% t
+
+table2write <- species_table  %>%
+  as.data.frame() %>%
+  rownames_to_column("Species")
+
+write_csv(x = table2write, file = "Metaphlan_species_table_GLlbsMetag.csv")
+```
+
+**Input Data:**
+
+- Metaphlan-taxonomy_GLlbsMetag.tsv (Metaphlan taxonomy table from [Step 9g](#9g-combine-metaphlan-taxonomy-tables))
+- reads_per_sample.tsv (a 2-column tab delimited file with sample names and read counts as columns 1 and 2, respectively from [Step 9hi](#9hi-get-sample-read-counts))
+
+**Output Data:**
+
+- **Metaphlan_species_table_GLlbsMetag.csv** (a file containing the MetaPhlan species table)
+
+#### 9i. Filter MetaPhlan Species Count Table
+
+```R
+library(tidyverse)
+
+input_file <- "Metaphlan_species_table_GLlbsMetag.csv"
+output_file <- "Metaphlan_filtered_species_table_GLlbsMetag.csv"
+threshold <- 0.5
+
+# string used to define non-microbial taxa
+non_microbial <- "UNCLASSIFIED"
+
+# read in feature table
+feature_table <- read_csv(input_file) %>% as.data.frame
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+# read-based count table
+table2write <- filter_rare(feature_table, non_microbial, threshold = threshold) %>%
+  as.data.frame %>%
+  rownames_to_column(feature_name)
+
+write_csv(x = table2write, file = output_file)
+```
+
+**Custom Functions Used:**
+- [group_low_abund_taxa()](#group_low_abund_taxa)
+
+**Parameter Definitions:**
+
+- `threshold` - threshold for filtering out rare taxa, a percentage between 0 and 100.
+- `output_file` - output filename
+- `input_file` - input filename
+
+**Input Data:**
+
+- Metaphlan_species_table_GLlbsMetag.csv (path to Metaphlan species count table from [Step 9hii](#9hii-process-metaphlan-taxonomy-table))
 
 **Output Data:**
-- **Metaphlan-taxonomy_GLlbsMetag.tsv** - metaphlan estimated taxonomic relative abundances
 
+- **Metaphlan_filtered_species_table_GLlbsMetag.csv** (a file containing the filtered MetaPhlan species table)
+
+#### 9j. Taxonomy Barplots
+
+```R
+library(tidyverse)
+
+species_table_file <- "Metaphlan_species_table_GLlbsMetag.csv"
+filtered_species_table_file <- "Metaphlan_filtered_species_table_GLlbsMetag.csv"
+metadata_file <- "/path/to/sample/metadata"
+number_samples <- 10 
+
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
+
+p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
+                  publication_format = publication_format, custom_palette = custom_palette)
+
+ggsave(filename = "Metaphlan_unfiltered_species_barplot_GLlbsMetag.png", plot = p,
+       device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
+
+# Save static unfiltered plot
+p <- make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
+                  publication_format = publication_format, custom_palette = custom_palette)
+
+# Save interactive unfilterted plot
+htmlwidgets::saveWidget(ggplotly(p), glue("Metaphlan_unfiltered_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
+
+# Save static filtered plot
+ggsave(filename = glue("Metaphlan_filtered_species_barplot_GLlbsMetag.png"), plot = p,
+      device = 'png', width = plot_width, height = 10, units = 'in', dpi = 300, limitsize = FALSE)
+
+# Save interactive filtered plot
+htmlwidgets::saveWidget(ggplotly(p), glue("Metaphlan_filtered_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
+```
+
+**Custom Functions Used:**
+- [make_barplot()](#make_plot)
+
+**Parameter Definitions:**
+
+- `species_table_file` - a file containing the species count table
+- `filtered_species_table_file` - a file containing the filtered species count table
+- `metadata_file` - a file containing group information for each sample in the species count files
+- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples.
+
+**Input Data:**
+
+- `Metaphlan_species_table_GLlbsMetag.csv` (path to kaiju species table from [Step 10ci.](#8ci-create-merged-Metaphlan-taxonomy-table))
+- `Metaphlan_filtered_species_table_GLlbsMetag.csv` (a file containing the filtered species count table, output from [Step 10g](#10f-filter-Metaphlan-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+
+**Output Data:**
+
+- Metaphlan_unfiltered_species_barplot_GLlbsMetag.png (taxonomy barplot without filtering)
+- **Metaphlan_unfiltered_species_barplot_GLlbsMetag.html** (interactive taxonomy barplot without filtering)
+- Metaphlan_filtered_species_barplot_GLlbsMetag.png (taxonomy barplot after filtering rare and non-microbial taxa)
+- **Metaphlan_filtered_species_barplot_GLlbsMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
+
+
+#### 9k. Feature Decontamination
+
+> Feature (species) decontamination with decontam. Decontam is an R package that statistically 
+  identifies contaminating features in a feature table
+
+```R
+library(tidyverse)
+library(decontam)
+library(phyloseq)
+
+feature_table_file <- "Metaphlan_filtered_species_table_GLlbsMetag.csv"
+metadata_table <- "/path/to/sample/metadata"
+number_samples <- NumberOfSamples # integer indicating how many samples are in the file
+
+# set width based on number of samples, with a cap at 50 inches
+plot_width <- 2 * number_samples
+if(plot_width > 50) { plot_width = 50 }
+
+decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
+                                         feature_table_file = feature_table_file, 
+                                         feature_column = "species", 
+                                         samples_column = "sample_id",
+                                         prevalence_column = "NTC", 
+                                         ntc_name = "TRUE", 
+                                         frequency_column = "concentration", 
+                                         threshold = 0.1, 
+                                         classification_method = "kraken2", 
+                                         output_prefix = "", 
+                                         assay_suffix = "_GLlbsMetag")
+
+# Convert count matrix to relative abundance matrix
+decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
+
+# Make plot after filtering out contaminants
+p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
+
+ggsave(filename = "Metaphlan_decontam_species_barplot_GLlbsMetag.png", plot = p,
+         device = "png", width = plot_width, height = 10, units = "in", dpi = 300)
+
+# Save interactive filtered plot
+htmlwidgets::saveWidget(ggplotly(p), glue("Metaphlan_decontam_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
+```
+
+**Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
+- [make_plot()](#make_plot)
+- [count_to_rel_abundance()](#count_to_rel_abundance)
+
+**Parameter Definitions:**
+
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                          table with species/functions as the first column and samples as other columns.
+- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples.
+
+**Input Data:**
+
+- `Metaphlan_filtered_species_table_GLlbsMetag.csv`(path to filtered species count per sample, output from [Step 8f](#10f-filter-kraken2-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+
+**Output Data:**
+
+- **Metaphlan_decontam_results_GLlbsMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
+- **Metaphlan_decontam_species_table_GLlbsMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
+- Metaphlan_decontam_species_barplot_GLlbsMetag.png (barplot after filtering out contaminants)
+- **Metaphlan_decontam_species_barplot_GLlbsMetag.html** (barplot after filtering out contaminants)
+
+<br>
 
 ---
 
 ## Assembly-based Processing
 
+
 ### 10. Sample Assembly
+
 ```
-megahit -1 sample1_R1_decontam.fastq.gz -2 sample1_R2_decontam.fastq.gz \
+megahit -1 sample1_R1_decontam_GLlbsMetag.fastq.gz -2 sample1_R2_decontam_GLlbsMetag.fastq.gz \
         -o sample1-assembly -t NumberOfThreads --min-contig-length 500 > sample1-assembly.log 2>&1
 ```
 
@@ -2619,7 +2874,7 @@ megahit -1 sample1_R1_decontam.fastq.gz -2 sample1_R2_decontam.fastq.gz \
 
 **Input data:**
 
-- *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+- *_R[12]_decontam_GLlbsMetag.fastq.gz or *_R[12]_HostRm_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
     output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
 
@@ -2628,7 +2883,9 @@ megahit -1 sample1_R1_decontam.fastq.gz -2 sample1_R2_decontam.fastq.gz \
 - sample1-assembly/final.contigs.fa (assembly file)
 - **sample1-assembly.log** (log file)
 
-<br>
+<br>  
+
+---
 
 ### 11. Rename Contigs and Summarize Assemblies
 
@@ -2981,13 +3238,14 @@ rm sample*.tmp*
 ### 15. Read-Mapping
 
 #### 15a. Build reference index
+
 ```
-bowtie2-build ssample_assembly_GLlbsMetag.fasta sample1-index
+bowtie2-build sample1_assembly_GLlbsMetag.fasta sample1-index
 ```
 
 **Parameter Definitions:**  
 
-- `ssample_assembly_GLlbsMetag.fasta` - first positional argument specifies the input assembly
+- `sample1_assembly_GLlbsMetag.fasta` - first positional argument specifies the input assembly
 -	`sample1-index` - second positional argument specifies the prefix of the output index files
 
 **Input Data:**
@@ -3003,8 +3261,8 @@ bowtie2-build ssample_assembly_GLlbsMetag.fasta sample1-index
 ```bash
 bowtie2 --mm --quiet --threads ${task.cpus} \
         -x sample1-index \
-        -1 sample1_GLlbsMetag_R1_decontam.fastq.gz \
-        -2 sample1_GLlbsMetag_R2_decontam.fastq.gz \
+        -1 sample1_R1_decontam_GLlbsMetag.fastq.gz \
+        -2 sample1_R2_decontam_GLlbsMetag.fastq.gz \
         --no-unal > sample1.sam  2> sample1-mapping-info_GLlbsMetag.txt 
 ```
 
@@ -3022,15 +3280,15 @@ bowtie2 --mm --quiet --threads ${task.cpus} \
 
 **Input Data**
 
-- sample1-index (contig-renamed assembly file, output from [Step 15a](#15a-build-reference-index))
-- *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
+- sample1-index (bowti2 index files, output from [Step 15a](#15a-build-reference-index))
+- *_R[12]_decontam_GLlbsMetag.fastq.gz or *_R[12]_HostRm_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
     output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
 
 **Output Data**
 
 - sample.sam (reads aligned to sample assembly in SAM format)
-- **sample-mapping-info_GLlblMetag.txt** (read mapping information)
+- **sample-mapping-info_GLlbsMetag.txt** (read mapping information)
 
 
 #### 15c. Sort and Index Assembly Alignments
@@ -3046,13 +3304,12 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 
 **Parameter Definitions:**
 
-**samtools sort**
+*samtools sort*
 - `--threads` - Number of parallel processing threads to use.
 - `-o` - Specifies the output file for the sorted aligned reads.
 - `sample.sam` - Positional argument specifying the input SAM file.
 - `> sample_sort.log 2>&1` - Redirects the standard output and standard error to a separate file.
-
-**samtools index**
+*samtools index*
 - `sample_sorted.bam` - Positional argument specifying the input BAM file to be sorted.
 - `sample_sorted.bam.bai` - Positional argument specifying the name of the index file.
 
@@ -3201,7 +3458,6 @@ rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
 - sample-contig-coverages.tsv (table with contig-level coverages, output from [Step 16b](#16b-filter-gene-and-contig-coverage-based-on-detection))
 - sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info, output from [Step 14f](#14f-format-contig-level-output-with-awk-and-sed))
 
-
 **Output Data:**
 
 - **sample-contig-coverage-and-tax_GLlbsMetag.tsv** (table with combined contig coverage and taxonomy info)
@@ -3269,12 +3525,13 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 **Input Data:**
 
 - *-contig-coverage-and-tax.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 18](#18-combine-contig-level-coverage-and-taxonomy-for-each-sample))
+
 **Output Data:**
 
-- **Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million contigs covered)
-- **Combined-contig-level-taxonomy-coverages_GLlblMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications)
+- **Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million contigs covered)
+- **Combined-contig-level-taxonomy-coverages_GLlbsMetag.tsv** (table with all samples combined based on contig-level taxonomic classifications)
 
-<br>
+<br>  
 
 ---
 
@@ -3302,8 +3559,7 @@ zip -r sample-bins.zip sample-bins
 
 **Parameter Definitions:**  
 
-**jgi_summarize_bam_contig_depths**
-
+*jgi_summarize_bam_contig_depths*
 -  `--outputDepth` – Specifies the output depth file name.
 -  `--percentIdentity` – Minimum end-to-end percent identity of a mapped read to be included.
 -  `--minContigLength` – Minimum contig length to include.
@@ -3311,8 +3567,7 @@ zip -r sample-bins.zip sample-bins
 -  `--referenceFasta` – Specifies the input assembly fasta file.
 -  `sample.bam` – Input alignment BAM file, specified as a positional argument.
 
-**metabat2**
-
+*metabat2*
 -  `--inFile` - Specifies the input assembly fasta file.
 -  `--outFile` - Specifies the prefix of the identified bins output files.
 -  `--abdFile` - The depth file generated by the previous `jgi_summarize_bam_contig_depths` command.
@@ -3334,7 +3589,7 @@ zip -r sample-bins.zip sample-bins
 > Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
 
 ```bash
-checkm lineage_wf -f bins-overview_GLlblMetag.tsv \
+checkm lineage_wf -f bins-overview_GLlbsMetag.tsv \
                   --tab_table \
                   -x fasta \
                   ./ \
@@ -3356,18 +3611,18 @@ checkm lineage_wf -f bins-overview_GLlblMetag.tsv \
 
 **Output Data:**
 
-- **bins-overview_GLlblMetag.tsv** (tab-delimited file with quality estimates per bin)
+- **bins-overview_GLlbsMetag.tsv** (tab-delimited file with quality estimates per bin)
 - checkm-output-dir/ (directory holding detailed checkm outputs)
 
 #### 20c. Filter MAGs
 
 ```bash
-cat <( head -n 1 bins-overview_GLlblMetag.tsv ) \
-    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlblMetag.tsv | sed 's/bin./MAG-/' ) \
+cat <( head -n 1 bins-overview_GLlbsMetag.tsv ) \
+    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlbsMetag.tsv | sed 's/bin./MAG-/' ) \
     > checkm-MAGs-overview.tsv
     
 # copying bins into a MAGs directory in order to run tax classification
-awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlblMetag.tsv | cut -f 1 > MAG-bin-IDs.tmp
+awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLlbsMetag.tsv | cut -f 1 > MAG-bin-IDs.tmp
 
 mkdir MAGs
 for ID in MAG-bin-IDs.tmp
@@ -3386,7 +3641,7 @@ done
 
 **Input Data:**
 
-- bins-overview_GLlblMetag.tsv (tab-delimited file with quality estimates per bin from [Step 20b](#20b-bin-quality-assessment))
+- bins-overview_GLlbsMetag.tsv (tab-delimited file with quality estimates per bin from [Step 20b](#20b-bin-quality-assessment))
 
 **Output Data:**
 
@@ -3425,7 +3680,7 @@ gtdbtk classify_wf --genome_dir MAGs/ \
 
 ```bash
 # combine summaries
-for MAG in $(cut -f 1 assembly-summaries_GLlblMetag.tsv | tail -n +2); do
+for MAG in $(cut -f 1 assembly-summaries_GLlbsMetag.tsv | tail -n +2); do
 
     grep -w -m 1 "^${MAG}" checkm-MAGs-overview.tsv | cut -f 12,13,14 \
         >> checkm-estimates.tmp
@@ -3445,7 +3700,7 @@ cat <(printf "est. completeness\test. redundancy\test. strain heterogeneity\n")
 cat <(printf "domain\tphylum\tclass\torder\tfamily\\tgenus\tspecies\n") gtdb-taxonomies.tmp \
     > gtdb-taxonomies-with-headers.tmp
 
-paste assembly-summaries_GLlblMetag.tsv \
+paste assembly-summaries_GLlbsMetag.tsv \
 checkm-estimates-with-headers.tmp \
 gtdb-taxonomies-with-headers.tmp \
     > MAGs-overview.tmp
@@ -3456,22 +3711,21 @@ head -n 1 MAGs-overview.tmp > MAGs-overview-header.tmp
 tail -n +2 MAGs-overview.tmp | sort -t \$'\t' -k 14,20 > MAGs-overview-sorted.tmp
 
 cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
-    > MAGs-overview_GLlblMetag.tsv
+    > MAGs-overview_GLlbsMetag.tsv
 ```
 
 **Input Data:**
 
-- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 11b](#11b-summarize-assemblies))
+- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 11b](#11b-summarize-assemblies))
 - MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 20c](#20c-filter-mags))
 - checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 20c](#20c-filter-mags))
 - gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 20d](#20d-mag-taxonomic-classification))
 
 **Output Data:**
 
-- **MAGs-overview_GLlblMetag.tsv** (a tab-delimited overview of all recovered MAGs)
+- **MAGs-overview_GLlbsMetag.tsv** (a tab-delimited overview of all recovered MAGs)
 
-
-<br>
+<br>  
 
 ---
 
@@ -3492,7 +3746,7 @@ do
     python parse-MAG-annots.py -i annotations-and-taxonomy/${sample_ID}-gene-coverage-annotation-and-tax.tsv \
                                -w ${MAG_ID}-contigs.tmp \
                                -M ${MAG_ID} \
-                               -o MAG-level-KO-annotations_GLlblMetag.tsv
+                               -o MAG-level-KO-annotations_GLlbsMetag.tsv
 
     rm ${MAG_ID}-contigs.tmp
 
@@ -3513,15 +3767,15 @@ done
 
 **Output Data:**
 
-- **MAG-level-KO-annotations_GLlblMetag.tsv** (tab-delimited table holding MAGs and their KO annotations)
+- **MAG-level-KO-annotations_GLlbsMetag.tsv** (tab-delimited table holding MAGs and their KO annotations)
 
 
 #### 21b. Summarize KO Annotations With KEGG-Decoder
 
 ```bash
 KEGG-decoder -v interactive \
-             -i MAG-level-KO-annotations_GLlblMetag.tsv \
-             -o MAG-KEGG-Decoder-out_GLlblMetag.tsv
+             -i MAG-level-KO-annotations_GLlbsMetag.tsv \
+             -o MAG-KEGG-Decoder-out_GLlbsMetag.tsv
 ```
 
 **Parameter Definitions:**  
@@ -3532,11 +3786,11 @@ KEGG-decoder -v interactive \
 
 **Input Data:**
 
-- MAG-level-KO-annotations_GLlblMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 21a](#21a-getting-ko-annotations-per-mag))
+- MAG-level-KO-annotations_GLlbsMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 21a](#21a-getting-ko-annotations-per-mag))
 
 **Output Data:**
 
-- **MAG-KEGG-Decoder-out_GLlblMetag.tsv** (tab-delimited table holding MAGs and their proportions of 
+- **MAG-KEGG-Decoder-out_GLlbsMetag.tsv** (tab-delimited table holding MAGs and their proportions of 
                                            genes held known to be required for specific pathways/metabolisms)
 - **MAG-KEGG-Decoder-out_GLlbnMetag.html** (interactive heatmap html file of the above output table)
 
@@ -3546,13 +3800,13 @@ KEGG-decoder -v interactive \
 
 ### 22. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
 
-#### 22a. Gene-level taxonomy heatmaps
+#### 22a. Gene-level Taxonomy Heatmaps
 
 ```R
 library(tidyverse)
 
 metadata_file <- "/path/to/sample/metadata"
-feature_data_file <- "Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv"
+feature_data_file <- "Combined-gene-level-taxonomy-coverages-CPM_GLlbsMetag.tsv"
 
 # Prepare metadata
 metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
@@ -3586,7 +3840,7 @@ write_csv(x = table2write, file = "gene_taxonomy_table.csv")
 make_heatmap(metadata, species_gene_table, 
              samples_column="sample_id", group_column = "group", 
              output_prefix = "Combined-gene-level-taxonomy", 
-             assay_suffix = "_GLlblMetag", 
+             assay_suffix = "_GLlbsMetag", 
              custom_palette = custom_palette)
 
 ```
@@ -3597,15 +3851,15 @@ make_heatmap(metadata, species_gene_table,
 
 **Input data:**
 - /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
-- Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples 
+- Combined-gene-level-taxonomy-coverages-CPM_GLlbsMetag.tsv (table with all samples 
     combined based on gene-level taxonomic classifications, output from 
     [Step 19a](#19a-generating-gene-level-coverage-summary-tables)) 
 
 **Output data:**
 - gene_taxonomy_table.csv (aggregated gene taxonomy table with samples in columns and species in rows)
-- **Combined-gene-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all gene taxonomy assignments)
+- **Combined-gene-level-taxonomy_heatmap_GLlbsMetag.png** (heatmap of all gene taxonomy assignments)
 
-#### 22b. Gene-level taxonomy decontamination
+#### 22b. Gene-level Taxonomy Decontamination
 
 ```R
 library(tidyverse)
@@ -3635,7 +3889,7 @@ decontaminated_table <- feature_decontam(metadata_file = metadata_table,
                                          threshold = 0.1, 
                                          classification_method = "Combined-gene-level-taxonomy", 
                                          output_prefix = "", 
-                                         assay_suffix = "_GLlblMetag")
+                                         assay_suffix = "_GLlbsMetag")
 
 # Get common samples and re-arrange feature table and metadata
 common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
@@ -3646,7 +3900,7 @@ metadata <- metadata %>% arrange(!!sym(group_column))
 make_heatmap(metadata, decontaminated_table, 
              samples_column = "sample_id", group_column = "group", 
              output_prefix = "Combined-gene-level-taxonomy_decontam", 
-             assay_suffix = "_GLlblMetag",
+             assay_suffix = "_GLlbsMetag",
              custom_palette)
 
 ```
@@ -3656,6 +3910,7 @@ make_heatmap(metadata, decontaminated_table,
 - [make_heatmap()](#make_plot)
 
 **Parameter Definitions:**
+
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
                          species/functions as the first column and samples as other columns.
@@ -3668,18 +3923,18 @@ make_heatmap(metadata, decontaminated_table,
 
 **Output Data:**
 
-- **Combined-gene-level-taxonomy_decontam_results_GLlblMetag.csv** (decontam's results table)
-- **Combined-gene-level-taxonomy_decontam_species_table_GLlblMetag.csv** (decontaminated species table)
-- **Combined-gene-level-taxonomy_decontam_heatmap_GLlblMetag.png** (gene-level taxonomy heatmap after filtering out contaminants)
+- **Combined-gene-level-taxonomy_decontam_results_GLlbsMetag.csv** (decontam's results table)
+- **Combined-gene-level-taxonomy_decontam_species_table_GLlbsMetag.csv** (decontaminated species table)
+- **Combined-gene-level-taxonomy_decontam_heatmap_GLlbsMetag.png** (gene-level taxonomy heatmap after filtering out contaminants)
 
-#### 22c. Gene-level KO functions heatmaps
+#### 22c. Gene-level KO Functions Heatmaps
 
 ```R
 library(tidyverse)
 library(pheatmap)
 
 metadata_file <- "/path/to/sample/metadata"
-feature_data_file <- "Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.ts"
+feature_data_file <- "Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.ts"
 
 # Abundant functions with CPM > 2000
 abundance_threshold <- 2000
@@ -3713,7 +3968,7 @@ metadata <- metadata %>% arrange(!!sym(group_column))
 make_heatmap(metadata, table2write,
              samples_column="sample_id", group_column = "group", 
              output_prefix = "Combined-gene-level-KO-function", 
-             assay_suffix = "_GLlblMetag", 
+             assay_suffix = "_GLlbsMetag", 
              custom_palette = custom_palette)
 
 ```
@@ -3722,16 +3977,18 @@ make_heatmap(metadata, table2write,
 - [make_heatmap()](#make_heatmap)
 
 **Input data:**
+
 - /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
-- Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv (table with all samples combined 
+- Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.tsv (table with all samples combined 
     based on KO annotations; normalized to coverage per million genes covered, output from 
     [Step 19a](#19a-generate-gene-level-coverage-summary-tables)
 
 **Output data:**
+
 - genes-KO-functions_table.csv (aggregated and subsetted gene KO function table)
-- **Combined-gene-level-KO-function_heatmap_GLlblMetag.png** (heatmap of all gene-level KO function assignments)
+- **Combined-gene-level-KO-function_heatmap_GLlbsMetag.png** (heatmap of all gene-level KO function assignments)
 
-#### 22d. Gene-level KO functions decontamination
+#### 22d. Gene-level KO Functions Decontamination
 
 ```R
 library(tidyverse)
@@ -3761,7 +4018,7 @@ decontaminated_table <- feature_decontam(metadata_file = metadata_table,
                                          threshold = 0.1, 
                                          classification_method = "Combined-gene-level-KO-function", 
                                          output_prefix = "", 
-                                         assay_suffix = "_GLlblMetag")
+                                         assay_suffix = "_GLlbsMetag")
 
 # Get common samples and re-arrange feature table and metadata
 common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
@@ -3772,7 +4029,7 @@ metadata <- metadata %>% arrange(!!sym(group_column))
 make_heatmap(metadata, decontaminated_table, 
              samples_column = "sample_id", group_column = "group", 
              output_prefix = "Combined-gene-level-KO-function_decontam", 
-             assay_suffix = "_GLlblMetag",
+             assay_suffix = "_GLlbsMetag",
              custom_palette)
 
 ```
@@ -3782,6 +4039,7 @@ make_heatmap(metadata, decontaminated_table,
 - [make_heatmap()](#make_plot)
 
 **Parameter Definitions:**
+
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table containing gene-level KO functions coverage data 
                          with KO_ID as the first column and samples as other columns.
@@ -3794,9 +4052,9 @@ make_heatmap(metadata, decontaminated_table,
 
 **Output Data:**
 
-- **Combined-gene-level-KO-function_decontam_results_GLlblMetag.csv** (decontam's results table)
-- **Combined-gene-level-KO-function_decontam_species_table_GLlblMetag.csv** (decontaminated gene-level KO functions table)
-- **Combined-gene-level-KO-function_decontam_heatmap_GLlblMetag.png** (gene-level KO functions heatmap after filtering out contaminants)
+- **Combined-gene-level-KO-function_decontam_results_GLlbsMetag.csv** (decontam's results table)
+- **Combined-gene-level-KO-function_decontam_species_table_GLlbsMetag.csv** (decontaminated gene-level KO functions table)
+- **Combined-gene-level-KO-function_decontam_heatmap_GLlbsMetag.png** (gene-level KO functions heatmap after filtering out contaminants)
 
 
 #### 22e. Contig-level Heatmaps
@@ -3805,7 +4063,7 @@ make_heatmap(metadata, decontaminated_table,
 library(tidyverse)
 
 metadata_file <- "/path/to/sample/metadata"
-feature_data_file <- "Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv"
+feature_data_file <- "Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv"
 
 # Prepare metadata
 metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
@@ -3839,7 +4097,7 @@ write_csv(x = table2write, file = "contig_taxonomy_table.csv")
 make_heatmap(metadata, species_contig_table, 
              samples_column="sample_id", group_column = "group", 
              output_prefix = "Combined-contig-level-taxonomy", 
-             assay_suffix = "_GLlblMetag", 
+             assay_suffix = "_GLlbsMetag", 
              custom_palette = custom_palette)
 ```
 
@@ -3848,16 +4106,18 @@ make_heatmap(metadata, species_contig_table,
 - [make_heatmap()](#make_heatmap)
 
 **Input data:**
+
 - /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
-- Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples 
+- Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv (table with all samples 
     combined based on contig-level taxonomic classifications, output from 
     [Step 19b](#19b-generate-contig-level-coverage-summary-tables)) 
 
 **Output data:**
+
 - contig_taxonomy_table.csv (aggregated contig taxonomy table with samples in columns and species in rows)
-- **Combined-contig-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all contig taxonomy assignments)
+- **Combined-contig-level-taxonomy_heatmap_GLlbsMetag.png** (heatmap of all contig taxonomy assignments)
 
-#### 22f. Contig-level decontamination
+#### 22f. Contig-level Decontamination
 
 ```R
 library(tidyverse)
@@ -3887,7 +4147,7 @@ decontaminated_table <- feature_decontam(metadata_file = metadata_table,
                                          threshold = 0.1, 
                                          classification_method = "Combined-contig-level-taxonomy", 
                                          output_prefix = "", 
-                                         assay_suffix = "_GLlblMetag")
+                                         assay_suffix = "_GLlbsMetag")
 
 # Get common samples and re-arrange feature table and metadata
 common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
@@ -3898,7 +4158,7 @@ metadata <- metadata %>% arrange(!!sym(group_column))
 make_heatmap(metadata, decontaminated_table, 
              samples_column = "sample_id", group_column = "group", 
              output_prefix = "Combined-contig-level-taxonomy_decontam", 
-             assay_suffix = "_GLlblMetag",
+             assay_suffix = "_GLlbsMetag",
              custom_palette)
 
 ```
@@ -3908,6 +4168,7 @@ make_heatmap(metadata, decontaminated_table,
 - [make_heatmap()](#make_plot)
 
 **Parameter Definitions:**
+
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table containing contig-level coverage data 
                          species/functions as the first column and samples as other columns.
@@ -3920,7 +4181,7 @@ make_heatmap(metadata, decontaminated_table,
 
 **Output Data:**
 
-- **Combined-contig-level-taxonomy_decontam_results_GLlblMetag.csv** (decontam's results table)
-- **Combined-contig-level-taxonomy_decontam_species_table_GLlblMetag.csv** (decontaminated contig-level species table)
-- **Combined-contig-level-taxonomy_decontam_heatmap_GLlblMetag.png** (contig-level heatmap after filtering out contaminants)
+- **Combined-contig-level-taxonomy_decontam_results_GLlbsMetag.csv** (decontam's results table)
+- **Combined-contig-level-taxonomy_decontam_species_table_GLlbsMetag.csv** (decontaminated contig-level species table)
+- **Combined-contig-level-taxonomy_decontam_heatmap_GLlbsMetag.png** (contig-level heatmap after filtering out contaminants)
 

From e7cd7016a0075218270aecb5fd7f5374a0a34ab9 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Mon, 26 Jan 2026 21:19:44 -0800
Subject: [PATCH 25/47] update get_abundant_features functions docs

---
 Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
index 9cf180152..067b12a00 100644
--- a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
@@ -943,7 +943,8 @@ library(pavian)
   - `mat` - a feature count matrix with features as rows and samples as columns
   - `cpm_threshold = 1000` - threshold to identify abundant features
 
-  **Returns:** a species relative abundance matrix with samples and species as rows and column, respectively.
+  **Returns:** a matrix holding the features that pass the requested threshold
+  
 </details>
 
 #### count_to_rel_abundance()

From 2d7d797cc4c836348816a3c265649ba8a8fcaf50 Mon Sep 17 00:00:00 2001
From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com>
Date: Tue, 27 Jan 2026 19:17:52 -0500
Subject: [PATCH 26/47] Dev metagenomics low biomass - nanopore updates through
 step 9c (#190)

* Update GL-DPPD-7116.md through step 9c (set global variables).
* Fix numbering in GL-DPPD-7116.md
---
 .../Low_Biomass/Nanopore/GL-DPPD-7116.md      | 320 +++++++++---------
 1 file changed, 160 insertions(+), 160 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
index 0ee66cb16..c8c60add9 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
@@ -973,11 +973,11 @@ multiqc --zip-data-dir \
 
 ---
 
-### 8. R Environment Setup
+### 9. R Environment Setup
 
 > Taxonomy bar plots, heatmaps and feature decontamination with decontam are performed in R.
 
-#### 8a. Load libraries
+#### 9a. Load libraries
 
 ```R
 library(decontam)
@@ -987,7 +987,7 @@ library(pheatmap)
 library(pavian)
 ```
 
-#### 8b. Define Custom Functions
+#### 9b. Define Custom Functions
 
 ##### get_last_assignment()
 <details>
@@ -1052,7 +1052,7 @@ library(pavian)
   - `df` - a dataframe containing the taxonomy assignments
   - `taxonomy_column=` - name of the column in `df` containing the taxonomy assignments, default="taxonomy"
 
-  **Returns:** a dataframe with unique last taxonomy names stored in a column named "taxonomy"
+  **Returns:** dataframe, `df`, with unique last taxonomy names stored in a column named "taxonomy"
 
 </details>
 
@@ -1101,7 +1101,7 @@ library(pavian)
   - `file_path` - file path to the tab-delimited kaiju output table file
   - `taxon_col=`- name of the taxon column in the input data file, default="taxon_name"
 
-  **Returns:** a dataframe with reformated kaiju output
+  **Returns:** dataframe, `abs_abun_matrix`, with reformated kaiju output
 
 </details>
 
@@ -1153,7 +1153,7 @@ library(pavian)
   **Function Parameter Definitions:**
   - `reports_dir` - path to a directory containing kraken2 reports 
 
-  **Returns:** a kraken species count matrix with samples and species as columns and rows, respectively.
+  **Returns:** a kraken species count matrix, `species_table`, with samples and species as columns and rows, respectively.
 
 </details>
 
@@ -1177,7 +1177,7 @@ library(pavian)
   - `mat` - a feature count matrix with features as rows and samples as columns
   - `cpm_threshold = 1000` - threshold to identify abundant features
 
-  **Returns:** a species relative abundance matrix with samples and species as rows and column, respectively.
+  **Returns:** a species relative abundance matrix, `abund_features.m`, with samples and species as rows and columns, respectively.
 </details>
 
 ##### count_to_rel_abundance()
@@ -1209,7 +1209,7 @@ library(pavian)
   **Function Parameter Definitions:**
   - `species_table` - a species count matrix with samples and species as columns and rows, respectively.
 
-  **Returns:** a species relative abundance matrix with samples and species as rows and column, respectively.
+  **Returns:** a species relative abundance matrix, `abund_table`, with samples and species as rows and columns, respectively.
 
 </details>
 
@@ -1252,7 +1252,7 @@ library(pavian)
   - `non_microbial` - a regular expression denoting the names used to identify a species as non-microbial or unwanted
   - `threshold=` - abundance threshold used to determine if the relative abundance is rare, value denotes a percentage between 0 and 100.
 
-  **Returns:** a dataframe with rare and non_microbial/unwanted species removed
+  **Returns:** dataframe, `abund_table`, with rare and non_microbial/unwanted species removed
 </details>
 
 ##### group_low_abund_taxa()
@@ -1311,7 +1311,7 @@ library(pavian)
   - `rare_taxa` - a boolean specifying if only rare taxa should be returned
   - `threshold` - a max abundance threshold for defining taxa as rare
 
-  **Returns:** a relative abundance matrix with rare taxa grouped or with non-rare taxa filtered out
+  **Returns:** a relative abundance matrix, `abund_table`, with rare taxa grouped or with non-rare taxa filtered out
 
 </details>
 
@@ -1356,7 +1356,7 @@ library(pavian)
   - `samples_column` - a character column specifying the column in `metadata` holding sample names, default is "Sample_ID"
   - `prefix_to_remove` - a string specifying a prefix or any character set to remove from sample names, default is "barcode"
 
-  **Returns:** a relative abundance stacked bar plot
+  **Returns:** a relative abundance stacked bar plot, `p`
 
 </details>
 
@@ -1409,10 +1409,10 @@ library(pavian)
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
   - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
-  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 8c](#8c-set-global-variables)
-  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 8c](#8c-set-global-variables)
+  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 9c](#9c-set-global-variables)
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 9c](#9c-set-global-variables)
 
-  **Returns:** a relative abundance stacked bar plot
+  **Returns:** a relative abundance stacked bar plot, `p`
 
 </details>
 
@@ -1482,7 +1482,7 @@ library(pavian)
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
   - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
-  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 8c](#8c-set-global-variables)
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 9c](#9c-set-global-variables)
 
 </details>
 
@@ -1554,7 +1554,7 @@ library(pavian)
   - `contam_threshold` -  the probability threshold below which (strictly less than) the null-hypothesis 
                           (not a contaminant) should be rejected in favor of the alternate hypothesis (contaminant).
 
-  **Returns:** a dataframe of detailed decontam results
+  **Returns:** dataframe, `contamdf`, containing detailed decontam results
 </details>
 
 ##### feature_decontam() 
@@ -1642,7 +1642,7 @@ library(pavian)
   - {classification_method}_decontam_species_table_GLlblMetag.csv - decontaminated feature table file
   - {classification_method}_decontam_results_GLlblMetag.csv - Decontam results file
 
-  **Returns:** a dataframe containing the decontaminated feature table
+  **Returns:** dataframe, `decontaminated_table`, containing the decontaminated feature table
 
 </details>
 
@@ -1679,7 +1679,7 @@ library(pavian)
   - `prefix`  - is a regular expression specifying a character sequence to remove
                 from taxon names
 
-  **Returns:** a dataframe of reformated taxonomy names
+  **Returns:** dataframe, `taxonomy`, containing reformated taxonomy names
 
 </details>
 
@@ -1712,7 +1712,7 @@ library(pavian)
   - `stringToReplace` - a regex string specifying what to replace
   - `suffix` - string specifying the replacement value
 
-  **Returns:** a dataframe of reformated/cleaned taxonomy names
+  **Returns:** dataframe, `taxonomy`, containing reformated/cleaned taxonomy names
 
 </details>
 
@@ -1747,15 +1747,15 @@ library(pavian)
   ```
 
   **Custom Functions Used:**
-  [process_taxonomy](#process_taxonomy)
-  [fix_names()](#fix_names)
+  [process_taxonomy()](#process_taxonomy)  
+  [fix_names()](#fix_names)   
 
   **Function Parameter Definitions:**
 
   - `file_name` - path to contig taxonomy assignment file to be read
   - `sample_names` - string of samples names to keep in the final dataframe
 
-  **Returns:** a dataframe with cleaned taxonomy names and sample species count
+  **Returns:** dataframe, `df`, containing cleaned taxonomy names and sample species count
 
 </details>
 
@@ -1780,12 +1780,12 @@ library(pavian)
 
   - `assembly_summary` - path to assembly summary file
 
-  **Returns:** a character vector of sorted sample names
+  **Returns:** a character vector, `sample_order`, of sorted sample names
 
 </details>
 
 
-#### 8c. Set global variables
+#### 9c. Set global variables
 
 ```R
 # Define custom theme for plotting
@@ -1830,9 +1830,9 @@ custom_palette <- custom_palette[-c(21:23,
 
 ## Read-based Processing
 
-### 9. Taxonomic Profiling Using Kaiju
+### 10. Taxonomic Profiling Using Kaiju
 
-#### 9a. Build Kaiju Database
+#### 10a. Build Kaiju Database
 
 ```bash
 # Make a directory that will hold the downloaded kaiju database
@@ -1863,7 +1863,7 @@ rm nr_euk/kaiju_db_nr_euk.bwt nr_euk/kaiju_db_nr_euk.sa
 - kaiju-db/merged.dmp (merged taxonomy IDs file from the NCBI Taxonomy database that maps deprecated taxonomic IDs to current ones)
 
 
-#### 9b. Kaiju Taxonomic Classification
+#### 10b. Kaiju Taxonomic Classification
 
 ```bash
 kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
@@ -1885,17 +1885,17 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
 
 **Input Data:**
 
-- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 9a](#9a-build-kaiju-database))
-- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
+- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 10a](#10a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 10a](#10a-build-kaiju-database))
 - sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
-    contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
+    contaminants and human reads (and optionally host reads) removed, gzipped fastq file, 
     output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 
 **Output Data:**
 
 - sample_kaiju.out (kaiju output file)
 
-#### 9c. Compile Kaiju Taxonomy Results
+#### 10c. Compile Kaiju Taxonomy Results
 
 ```bash
 # Merge kaiju reports to one table at the species level 
@@ -1922,9 +1922,9 @@ sed -i -E 's/file/sample/' merged_kaiju_table.tsv
 
 **Input Data:**
 
-- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
-- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 9a](#9a-build-kaiju-database))
-- *kaiju.out (kaiju output files, output from [Step 9b](#9b-kaiju-taxonomic-classification))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 10a](#10a-build-kaiju-database))
+- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 10a](#10a-build-kaiju-database))
+- *kaiju.out (kaiju output files, output from [Step 10b](#10b-kaiju-taxonomic-classification))
 
 **Output Data:**
 
@@ -1949,15 +1949,15 @@ kaiju2krona -u \
 - `-o` - Specifies the name of krona formatted kaiju output file.
 
 **Input Data:**
-- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 9a](#9a-build-kaiju-database))
-- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 9a](#9a-build-kaiju-database))
-- sample_kaiju.out (kaiju output file, output from [Step 9b](#9b-kaiju-taxonomic-classification))
+- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 10a](#10a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 10a](#9a-build-kaiju-database))
+- sample_kaiju.out (kaiju output file, output from [Step 10b](#10b-kaiju-taxonomic-classification))
 
 **Output Data:**
 
 - sample.krona (krona formatted kaiju output)
 
-#### 9e. Compile Kaiju Krona Reports
+#### 10e. Compile Kaiju Krona Reports
 
 ```bash
 # Create a file containing a sorted list of all .krona files 
@@ -2003,7 +2003,7 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
                         sample_1.krona,sample_1 sample_2.krona,sample_2 ... sample_n.krona,sample_n.
 
 **Input Data:**
-- *.krona (all sample .krona formatted files, output from [Step 9d](#9d-convert-kaiju-output-to-krona-format)) 
+- *.krona (all sample .krona formatted files, output from [Step 10d](#10d-convert-kaiju-output-to-krona-format)) 
 
                       
 **Output Data:**
@@ -2013,7 +2013,7 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 - **kaiju-report_GLlblMetag.html** (compiled krona html report containing all samples)
 
 
-#### 9f. Create Kaiju Species Count Table
+#### 10f. Create Kaiju Species Count Table
 
 ```R
 library(tidyverse)
@@ -2036,14 +2036,14 @@ write_csv(x = table2write, file = "kaiju_species_table_GLlblMetag.csv")
 
 **Input Data:**
 
-- merged_kaiju_table_GLlblMetag.tsv (compiled kaiju table at the species taxon level, from [Step 9c](#9c-compile-kaiju-taxonomy-results))
+- merged_kaiju_table_GLlblMetag.tsv (compiled kaiju table at the species taxon level, from [Step 10c](#10c-compile-kaiju-taxonomy-results))
 
 **Output Data:**
 
 - **kaiju_species_table_GLlblMetag.csv** (kaiju species count table in csv format)
 
 
-#### 9g. Filter Kaiju Species Count Table
+#### 10g. Filter Kaiju Species Count Table
 
 ```R
 library(tidyverse)
@@ -2096,7 +2096,7 @@ write_csv(x = table2write, file = output_file)
 
 ---
 
-#### 9h. Taxonomy barplots
+#### 10h. Taxonomy barplots
 
 ```R
 library(tidyverse)
@@ -2158,7 +2158,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_filtered_species_barplot_GLlblM
 - **kaiju_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 9i. Feature decontamination
+#### 10i. Feature decontamination
 
 > Feature (species) decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
 
@@ -2227,9 +2227,9 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlblM
 
 ---
 
-### 10. Taxonomic Profiling Using Kraken2
+### 11. Taxonomic Profiling Using Kraken2
 
-#### 10a. Download Kraken2 Database
+#### 11a. Download Kraken2 Database
 
 ```bash 
 ## Download all microbial (including eukaryotes) - https://benlangmead.github.io/aws-indexes/k2
@@ -2278,7 +2278,7 @@ tar -xvzf k2_pluspfp.tar.gz
 
 - kraken2-db/  (a directory containing kraken2 database files)
 
-#### 10b. Kraken2 Taxonomic Classification
+#### 11b. Kraken2 Taxonomic Classification
 
 ```bash
 kraken2 --db kraken2-db/ \
@@ -2302,7 +2302,7 @@ kraken2 --db kraken2-db/ \
 
 **Input Data:**
 
-- kraken2-db/ (a directory containing kraken2 database files, output from [Step 10a](#10a-download-kraken2-database))
+- kraken2-db/ (a directory containing kraken2 database files, output from [Step 11a](#11a-download-kraken2-database))
 - sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
     output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
@@ -2313,13 +2313,13 @@ kraken2 --db kraken2-db/ \
 - sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
 
 
-#### 10c. Compile Kraken2 Taxonomy Results
+#### 11c. Compile Kraken2 Taxonomy Results
 
-##### 10ci. Create Merged Kraken2 Taxonomy Table
+##### 11ci. Create Merged Kraken2 Taxonomy Table
 
 ```R
 species_table <- merge_kraken_reports(reports-dir='/path/to/kraken2/reports')
-write_csv(x = species_table, file = "merged-kraken2-table.csv")
+write_csv(x = species_table, file = "kraken2_species_table_GLlblMetag.csv")
 ```
 
 **Custom Functions Used:**
@@ -2340,13 +2340,13 @@ write_csv(x = species_table, file = "merged-kraken2-table.csv")
 
 **Input Data:**
 
-- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 10b](#10b-taxonomic-classification))
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 11b](#11b-taxonomic-classification))
 
 **Output Data:**
 
 - **kraken2_species_table_GLlblMetag.csv** (kraken species count table in csv format)
 
-##### 10cii. Compile Kraken2 Taxonomy Reports
+##### 11cii. Compile Kraken2 Taxonomy Reports
 
 ```bash
 multiqc --zip-data-dir \ 
@@ -2366,7 +2366,7 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 10b](#10b-taxonomic-classification))
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 11b](#11b-taxonomic-classification))
 
 **Output Data:**
 
@@ -2374,7 +2374,7 @@ multiqc --zip-data-dir \
 - **kraken2_multiqc_GLlblMetag_data.zip** (zip archive containing multiqc output data)
 
 
-#### 10d. Convert Kraken2 Output to Krona Format
+#### 11d. Convert Kraken2 Output to Krona Format
 
 ```bash
 kreport2krona.py --report-file sample-kraken2-report.tsv  \
@@ -2388,14 +2388,14 @@ kreport2krona.py --report-file sample-kraken2-report.tsv  \
 
 **Input Data:**
 
-- sample-kraken2-report.tsv (kraken report, output from [Step 10b](#10b-taxonomic-classification))
+- sample-kraken2-report.tsv (kraken report, output from [Step 11b](#11b-taxonomic-classification))
 
 **Output Data:**
 
 - sample.krona (krona formatted kraken2 output)
 
 
-#### 10e. Compile Kraken2 Krona Reports
+#### 11e. Compile Kraken2 Krona Reports
 
 ```bash
 # Find, list and write all .krona files to file 
@@ -2440,7 +2440,7 @@ ktImportText -o kraken2-report_GLlblMetag.html ${KTEXT_FILES[*]}
 
 **Input Data:**
 
-- *.krona (all sample .krona formatted files, output from [Step 10d](#10d-convert-kraken2-output-to-krona-format)) 
+- *.krona (all sample .krona formatted files, output from [Step 11d](#11d-convert-kraken2-output-to-krona-format)) 
 
                       
 **Output Data:**
@@ -2451,7 +2451,7 @@ ktImportText -o kraken2-report_GLlblMetag.html ${KTEXT_FILES[*]}
 
 ---
 
-#### 10f. Filter Kraken2 Species Count Table
+#### 11f. Filter Kraken2 Species Count Table
 
 ```R
 library(tidyverse)
@@ -2489,7 +2489,7 @@ write_csv(x = table2write, file = output_file)
 
 **Input Data:**
 
-- kraken2_species_table_GLlblMetag.csv (path to kaiju species table from [Step 10ci.](#10ci-create-merged-kraken2-taxonomy-table))
+- kraken2_species_table_GLlblMetag.csv (path to kaiju species table from [Step 11ci.](#11ci-create-merged-kraken2-taxonomy-table))
 
 **Output Data:**
 
@@ -2498,7 +2498,7 @@ write_csv(x = table2write, file = output_file)
 ---
 
 
-#### 10g. Taxonomy barplots
+#### 11g. Taxonomy barplots
 
 ```R
 library(tidyverse)
@@ -2546,8 +2546,8 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlb
 
 **Input Data:**
 
-- `kraken2_species_table_GLlblMetag.csv` (path to kaiju species table from [Step 10ci.](#10ci-create-merged-kraken2-taxonomy-table))
-- `kraken2_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 9g](#10f-filter-kraken2-species-count-table))
+- `kraken2_species_table_GLlblMetag.csv` (path to kaiju species table from [Step 11ci.](#11ci-create-merged-kraken2-taxonomy-table))
+- `kraken2_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 11f](#11f-filter-kraken2-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
@@ -2558,7 +2558,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlb
 - **kraken2_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 10h. Feature decontamination
+#### 11h. Feature decontamination
 
 > Feature (species) decontamination with decontam. Decontam is an R package that statistically 
   identifies contaminating features in a feature table
@@ -2614,7 +2614,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlb
 
 **Input Data:**
 
-- `kraken2_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 10f](#10f-filter-kraken2-species-count-table))
+- `kraken2_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 11f](#11f-filter-kraken2-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
@@ -2630,7 +2630,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlb
 
 ## Assembly-based Processing
 
-### 11. Sample Assembly
+### 12. Sample Assembly
 
 ```bash
 flye --meta \
@@ -2667,7 +2667,7 @@ mv sample/flye.log sample_assembly.log
 
 ---
 
-### 12. Polish Assembly
+### 13. Polish Assembly
 
 ```bash
 medaka_consensus -t NumberOfThreads \
@@ -2690,7 +2690,7 @@ mv sample/consensus.fasta sample_polished.fasta
 - sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
     output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
-- /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 11](#11-sample-assembly))
+- /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 12](#12-sample-assembly))
 
 **Output Data:**
 
@@ -2698,9 +2698,9 @@ mv sample/consensus.fasta sample_polished.fasta
 
 ---
 
-### 13. Rename Contigs and Summarize Assemblies
+### 14. Rename Contigs and Summarize Assemblies
 
-#### 13a. Rename Contig Headers
+#### 14a. Rename Contig Headers
 
 ```bash
 bit-rename-fasta-headers -i sample_polished.fasta \
@@ -2717,14 +2717,14 @@ bit-rename-fasta-headers -i sample_polished.fasta \
 
 **Input Data:**
 
-- sample_polished.fasta (polished assembly file from [Step 12](#12-polish-assembly))
+- sample_polished.fasta (polished assembly file from [Step 13](#13-polish-assembly))
 
 **Output files:**
 
 - **sample-assembly_GLlblMetag.fasta** (contig-renamed assembly file)
 
 
-#### 13b. Summarize Assemblies
+#### 14b. Summarize Assemblies
 
 ```bash
 bit-summarize-assembly -o assembly-summaries_GLlblMetag.tsv \
@@ -2738,7 +2738,7 @@ bit-summarize-assembly -o assembly-summaries_GLlblMetag.tsv \
 
 **Input Data:**
 
-- *-assembly.fasta (contig-renamed assembly files from [Step 13a](#13a-renaming-contig-headers))
+- *-assembly.fasta (contig-renamed assembly files from [Step 14a](#14a-renaming-contig-headers))
 
 **Output files:**
 
@@ -2748,9 +2748,9 @@ bit-summarize-assembly -o assembly-summaries_GLlblMetag.tsv \
 
 ---
 
-### 14. Gene Prediction
+### 15. Gene Prediction
 
-#### 14a. Generate Gene Predictions
+#### 15a. Generate Gene Predictions
 
 ```bash
 prodigal -a sample-genes.faa \
@@ -2776,7 +2776,7 @@ prodigal -a sample-genes.faa \
 
 **Input Data:**
 
-- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 13a](#13a-renaming-contig-headers))
+- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 14a](#14a-renaming-contig-headers))
 
 **Output Data:**
 
@@ -2786,7 +2786,7 @@ prodigal -a sample-genes.faa \
 
 <br>
 
-#### 14b. Remove Line Wraps In Gene Prediction Output
+#### 15b. Remove Line Wraps In Gene Prediction Output
 
 ```bash
 bit-remove-wraps sample-genes.faa > sample-genes.faa.tmp 2> /dev/null
@@ -2798,8 +2798,8 @@ mv sample-genes.fasta.tmp sample-genes_GLlblMetag.fasta
 
 **Input Data:**
 
-- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 14a](#14a-gene-prediction))
-- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14a](#14a-gene-prediction))
+- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 15a](#15a-gene-prediction))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 15a](#15a-gene-prediction))
 
 **Output Data:**
 
@@ -2810,7 +2810,7 @@ mv sample-genes.fasta.tmp sample-genes_GLlblMetag.fasta
 
 ---
 
-### 15. Functional Annotation
+### 16. Functional Annotation
 
 > **Note:**  
 > The annotation process overwrites the same temporary directory by default. When running multiple 
@@ -2818,7 +2818,7 @@ processses at a time, it is necessary to specify a specific temporary directory
 `--tmp-dir` argument as shown below.
 
 
-#### 15a. Download Reference Database of HMM Models
+#### 16a. Download Reference Database of HMM Models
 
 > **Note:** This step only needs to be done once.
 
@@ -2829,7 +2829,7 @@ tar -xzvf profiles.tar.gz
 gunzip ko_list.gz 
 ```
 
-#### 15b. Run KEGG Annotation
+#### 16b. Run KEGG Annotation
 
 ```bash
 exec_annotation -p profiles/ \
@@ -2856,16 +2856,16 @@ exec_annotation -p profiles/ \
 
 **Input Data:**
 
-- sample-genes_GLlblMetag.faa (amino-acid fasta file, output from [Step 14b](#14b-remove-line-wraps-in-gene-prediction-output))
-- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 15a](15a-download-reference-database-of-hmm-models))
-- ko_list (reference list of KOs to scan for, downloaded in [Step 15a](15a-download-reference-database-of-hmm-models))
+- sample-genes_GLlblMetag.faa (amino-acid fasta file, output from [Step 15b](#15b-remove-line-wraps-in-gene-prediction-output))
+- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 16a](16a-download-reference-database-of-hmm-models))
+- ko_list (reference list of KOs to scan for, downloaded in [Step 16a](16a-download-reference-database-of-hmm-models))
 
 **Output Data:**
 
 - sample-KO-tab.tmp (table of KO annotations assigned to gene IDs)
 
 
-#### 15c. Filter KO Outputs
+#### 16c. Filter KO Outputs
 *Filter KO outputs to retain only those passing the KO-specific score and top hits.*
 
 ```bash
@@ -2883,7 +2883,7 @@ rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 
 **Input Data:**
 
-- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs, output from [Step 15b](#15b-run-kegg-annotation))
+- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs, output from [Step 16b](#16b-run-kegg-annotation))
 
 **Output Data:**
 
@@ -2893,9 +2893,9 @@ rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 
 ---
 
-### 16. Taxonomic Classification 
+### 17. Taxonomic Classification 
 
-#### 16a. Pull and Unpack Pre-built Reference DB 
+#### 17a. Pull and Unpack Pre-built Reference DB 
 
 > **Note:** This step only needs to be done once.
 
@@ -2904,7 +2904,7 @@ wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20200618.tar.gz
 tar -xvzf CAT_prepare_20200618.tar.gz
 ```
 
-#### 16b. Run Taxonomic Classification
+#### 17b. Run Taxonomic Classification
 
 ```bash
 CAT contigs -c sample-assembly.fasta \
@@ -2934,10 +2934,10 @@ CAT contigs -c sample-assembly.fasta \
 
 **Input Data:**
 
-- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 16a](16a-pull-and-unpack-pre-built-reference-db))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 16a](16a-pull-and-unpack-pre-built-reference-db))
-- sample-assembly.fasta (contig-renamed assembly file from [Step 13a](#13a-rename-contig-headers))
-- sample-genes.faa (amino-acid fasta file, output from [Step 14b](#14b-remove-line-wraps-in-gene-prediction-output))
+- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 17a](17a-pull-and-unpack-pre-built-reference-db))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 17a](17a-pull-and-unpack-pre-built-reference-db))
+- sample-assembly.fasta (contig-renamed assembly file from [Step 14a](#14a-rename-contig-headers))
+- sample-genes.faa (amino-acid fasta file, output from [Step 15b](#15b-remove-line-wraps-in-gene-prediction-output))
 
 **Output Data:**
 
@@ -2945,7 +2945,7 @@ CAT contigs -c sample-assembly.fasta \
 - sample-tax-out.tmp.contig2classification.txt (contig taxonomy file)
 
 
-#### 16c. Add Taxonomy Info From Taxids To Genes
+#### 17c. Add Taxonomy Info From Taxids To Genes
 
 ```bash
 CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
@@ -2965,15 +2965,15 @@ CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
 
 **Input Data:**
 
-- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file, output from [Step 16b](#16b-run-taxonomic-classification))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 16a](#16a-pull-and-unpack-pre-built-reference-db))
+- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file, output from [Step 17b](#17b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 17a](#17a-pull-and-unpack-pre-built-reference-db))
 
 **Output Data:**
 
 - sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
 
 
-#### 16d. Add Taxonomy Info From Taxids To Contigs
+#### 17d. Add Taxonomy Info From Taxids To Contigs
 
 ```bash
 CAT add_names -i sample-tax-out.tmp.contig2classification.txt \
@@ -2993,15 +2993,15 @@ CAT add_names -i sample-tax-out.tmp.contig2classification.txt \
 
 **Input Data:**
 
-- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file, output from [Step 16b](#16b-run-taxonomic-classification))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 16a](#16a-pull-and-unpack-pre-built-reference-db))
+- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file, output from [Step 17b](#17b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 17a](#17a-pull-and-unpack-pre-built-reference-db))
 
 **Output Data:**
 
 - sample-contig-tax-out.tmp (contig taxonomy file with lineage info added)
 
 
-#### 16e. Format Gene-level Output With awk and sed
+#### 17e. Format Gene-level Output With awk and sed
 
 ```bash
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,$8,$9,$10,$11 } \
@@ -3014,14 +3014,14 @@ awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,
 
 **Input Data:**
 
-- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added, output from [Step 16c](#16c-add-taxonomy-info-from-taxids-to-genes))
+- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added, output from [Step 17c](#17c-add-taxonomy-info-from-taxids-to-genes))
 
 **Output Data:**
 
 - sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info)
 
 
-#### 16f. Format Contig-level Output With awk and sed
+#### 17f. Format Contig-level Output With awk and sed
 
 ```bash
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6,$7,$8,$9,$10,$11,$12 } \
@@ -3036,7 +3036,7 @@ rm sample*.tmp*
 
 **Input Data:**
 
-- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added, output from [Step 16d](#16d-add-taxonomy-info-from-taxids-to-contigs))
+- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added, output from [Step 17d](#17d-add-taxonomy-info-from-taxids-to-contigs))
 
 **Output Data:**
 
@@ -3046,9 +3046,9 @@ rm sample*.tmp*
 
 ---
 
-### 17. Read-Mapping
+### 18. Read-Mapping
 
-#### 17a. Align Reads to Sample Assembly
+#### 18a. Align Reads to Sample Assembly
 
 ```bash
 minimap2 -a \
@@ -3071,7 +3071,7 @@ minimap2 -a \
 
 **Input Data**
 
-- sample-assembly.fasta (contig-renamed assembly file, output from [Step 13a](#13a-rename-contig-headers))
+- sample-assembly.fasta (contig-renamed assembly file, output from [Step 14a](#14a-rename-contig-headers))
 - sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
     output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
@@ -3082,7 +3082,7 @@ minimap2 -a \
 - **sample-mapping-info_GLlblMetag.txt** (read mapping information)
 
 
-#### 17b. Sort and Index Assembly Alignments
+#### 18b. Sort and Index Assembly Alignments
 
 ```bash
 # Sort Sam, convert to bam and create index
@@ -3107,7 +3107,7 @@ samtools index sample_sorted_GLlblMetag.bam sample_sorted_GLlblMetag.bam.bai
 
 **Input Data:**
 
-- sample.sam (reads aligned to sample assembly, output from [Step 17a](#17a-align-reads-to-sample-assembly))
+- sample.sam (reads aligned to sample assembly, output from [Step 18a](#18a-align-reads-to-sample-assembly))
 
 **Output Data:**
 
@@ -3118,13 +3118,13 @@ samtools index sample_sorted_GLlblMetag.bam sample_sorted_GLlblMetag.bam.bai
 
 ---
 
-### 18. Get Coverage Information and Filter Based On Detection
+### 19. Get Coverage Information and Filter Based On Detection
 > **Note:**  
 > “Detection” is a measure of what proportion of a reference sequence recruited reads 
 (see the discussion of detection [here](http://merenlab.org/2017/05/08/anvio-views/#detection)). 
 Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
 
-#### 18a. Filter Coverage Levels Based On Detection
+#### 19a. Filter Coverage Levels Based On Detection
 
 ```bash
 # pileup.sh comes from the bbduk.sh package
@@ -3143,8 +3143,8 @@ pileup.sh -in sample.bam \
 
 **Input Data:**
 
-- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 17b](#17b-sort-and-index-assembly-alignments))
-- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 14a](#14a-gene-prediction))
+- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 18b](#18b-sort-and-index-assembly-alignments))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 15a](#15a-gene-prediction))
 
 
 **Output Data:**
@@ -3153,7 +3153,7 @@ pileup.sh -in sample.bam \
 - sample-contig-cov-and-det.tmp (contig-coverage tsv file)
 
 
-#### 18b. Filter Gene and Contig Coverage Based On Detection
+#### 19b. Filter Gene and Contig Coverage Based On Detection
 
 > *The following commands filter gene and contig coverage tsv files to only keep genes and contigs with at least 50% detection (as defined above) then parse the tables to retain only gene IDs and respective coverage.*
 
@@ -3178,8 +3178,8 @@ rm sample-*.tmp
 
 **Input Data:**
 
-- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file, output from [Step 18a](#18a-filter-coverage-levels-based-on-detection))
-- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file, output from [Step 18a](#18a-filter-coverage-levels-based-on-detection))
+- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file, output from [Step 19a](#19a-filter-coverage-levels-based-on-detection))
+- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file, output from [Step 19a](#19a-filter-coverage-levels-based-on-detection))
 
 **Output Data:**
 
@@ -3190,7 +3190,7 @@ rm sample-*.tmp
 
 ---
 
-### 19. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample
+### 20. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample
 > **Note:**  
 > Just uses `paste`, `sed`, and `awk` standard Unix commands to combine gene-level coverage, taxonomy, and functional annotations into one table for each sample.  
 
@@ -3213,9 +3213,9 @@ rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-o
 
 **Input Data:**
 
-- sample-gene-coverages.tsv (table with gene-level coverages, output from [Step 18b](#18b-filter-gene-and-contig-coverage-based-on-detection))
-- sample-annotations.tsv (table of KO annotations assigned to gene IDs, output from [Step 15c](#15c-filter-ko-outputs))
-- sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info, output from [Step 16e](#16e-format-gene-level-output-with-awk-and-sed))
+- sample-gene-coverages.tsv (table with gene-level coverages, output from [Step 19b](#19b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-annotations.tsv (table of KO annotations assigned to gene IDs, output from [Step 16c](#16c-filter-ko-outputs))
+- sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info, output from [Step 17e](#17e-format-gene-level-output-with-awk-and-sed))
 
 
 **Output Data:**
@@ -3226,7 +3226,7 @@ rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-o
 
 ---
 
-### 20. Combine Contig-level Coverage and Taxonomy For Each Sample
+### 21. Combine Contig-level Coverage and Taxonomy For Each Sample
 > **Note:**  
 > Just uses `paste`, `sed`, and `awk` standard Unix commands to combine contig-level coverage and taxonomy into one table for each sample.
 
@@ -3247,8 +3247,8 @@ rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
 
 **Input Data:**
 
-- sample-contig-coverages.tsv (table with contig-level coverages, output from [Step 18b](#18b-filter-gene-and-contig-coverage-based-on-detection))
-- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info, output from [Step 16f](#16f-format-contig-level-output-with-awk-and-sed))
+- sample-contig-coverages.tsv (table with contig-level coverages, output from [Step 19b](#19b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info, output from [Step 17f](#17f-format-contig-level-output-with-awk-and-sed))
 
 
 **Output Data:**
@@ -3259,7 +3259,7 @@ rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
 
 ---
 
-### 21. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples
+### 22. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples
 
 > **Note:**  
 > * To combine across samples to generate these summary tables, we need the same "units". This is done for annotations 
@@ -3271,7 +3271,7 @@ by the length of the gene). These have been normalized by making the total cover
 each individual gene-level coverage its proportion of that 1,000,000 total. So basically percent, but out of 1,000,000 
 instead of 100 to make the numbers more friendly. 
 
-#### 21a. Generate Gene-level Coverage Summary Tables
+#### 22a. Generate Gene-level Coverage Summary Tables
 
 ```bash
 bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax_GLlblMetag.tsv \
@@ -3293,7 +3293,7 @@ mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-cove
 
 **Input Data:**
 
-- *-gene-coverage-annotation-and-tax_GLlblMetag.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 19](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
+- *-gene-coverage-annotation-and-tax_GLlblMetag.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 20](#20-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
 
 **Output Data:**
 
@@ -3302,7 +3302,7 @@ mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-cove
 - **Combined-gene-level-KO-function-coverages_GLlblMetag.tsv** (table with all samples combined based on KO annotations)
 - **Combined-gene-level-taxonomy-coverages_GLlblMetag.tsv** (table with all samples combined based on gene-level taxonomic classifications)
 
-#### 21b. Generate Contig-level Coverage Summary Tables
+#### 22b. Generate Contig-level Coverage Summary Tables
 
 ```bash
 bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
@@ -3316,7 +3316,7 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 
 **Input Data:**
 
-- *-contig-coverage-and-tax.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 20](#20-combine-contig-level-coverage-and-taxonomy-for-each-sample))
+- *-contig-coverage-and-tax.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 21](#21-combine-contig-level-coverage-and-taxonomy-for-each-sample))
 
 **Output Data:**
 
@@ -3327,9 +3327,9 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 
 ---
 
-### 22. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery
+### 23. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery
 
-#### 22a. Bin Contigs
+#### 23a. Bin Contigs
 
 ```bash
 jgi_summarize_bam_contig_depths --outputDepth sample-metabat-assembly-depth.tsv \
@@ -3370,8 +3370,8 @@ zip -r sample-bins.zip sample-bins
 
 **Input Data:**
 
-- sample-assembly.fasta (contig-renamed assembly file from [Step 13a](#13a-renaming-contig-headers))
-- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 17b](#17b-sort-and-index-assembly-alignments))
+- sample-assembly.fasta (contig-renamed assembly file from [Step 14a](#14a-renaming-contig-headers))
+- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 18b](#18b-sort-and-index-assembly-alignments))
 
 **Output Data:**
 
@@ -3379,7 +3379,7 @@ zip -r sample-bins.zip sample-bins
 - sample-bins/sample-bin\*.fasta (fasta files of recovered bins)
 - **sample-bins.zip** (zip file containing fasta files of recovered bins)
 
-#### 22b. Bin quality assessment 
+#### 23b. Bin quality assessment 
 > Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
 
 ```bash
@@ -3401,14 +3401,14 @@ checkm lineage_wf -f bins-overview_GLlblMetag.tsv \
 
 **Input Data:**
 
-- sample-bins/sample-bin\*.fasta (fasta files of recovered bins, output from [Step 22a](#22a-bin-contigs))
+- sample-bins/sample-bin\*.fasta (fasta files of recovered bins, output from [Step 23a](#23a-bin-contigs))
 
 **Output Data:**
 
 - **bins-overview_GLlblMetag.tsv** (tab-delimited file with quality estimates per bin)
 - checkm-output-dir/ (directory holding detailed checkm outputs)
 
-#### 22c. Filter MAGs
+#### 23c. Filter MAGs
 
 ```bash
 cat <( head -n 1 bins-overview_GLlblMetag.tsv ) \
@@ -3435,7 +3435,7 @@ done
 
 **Input Data:**
 
-- bins-overview_GLlblMetag.tsv (tab-delimited file with quality estimates per bin from [Step 22b](#22b-bin-quality-assessment))
+- bins-overview_GLlblMetag.tsv (tab-delimited file with quality estimates per bin from [Step 23b](#23b-bin-quality-assessment))
 
 **Output Data:**
 
@@ -3444,7 +3444,7 @@ done
 - **\*-MAGs.zip** (zip files containing directories of high-quality MAGs)
 
 
-#### 22d. MAG Taxonomic Classification
+#### 23d. MAG Taxonomic Classification
 > Uses default `gtdbtk` database setup with program's `download.sh` command.
 
 ```bash
@@ -3464,13 +3464,13 @@ gtdbtk classify_wf --genome_dir MAGs/ \
 
 **Input Data:**
 
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#22c-filter-mags))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 23c](#23c-filter-mags))
 
 **Output Data:**
 
 - gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
 
-#### 22e. Generate Overview Table Of All MAGs
+#### 23e. Generate Overview Table Of All MAGs
 
 ```bash
 # combine summaries
@@ -3510,10 +3510,10 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 
 **Input Data:**
 
-- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 13b](#13b-summarize-assemblies))
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#23c-filter-mags))
-- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 22c](#22c-filter-mags))
-- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 22d](#22d-mag-taxonomic-classification))
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 14b](#14b-summarize-assemblies))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 23c](#23c-filter-mags))
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 23c](#23c-filter-mags))
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 23d](#23d-mag-taxonomic-classification))
 
 **Output Data:**
 
@@ -3524,9 +3524,9 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 
 ---
 
-### 23. Generate MAG-level Functional Summary Overview
+### 24. Generate MAG-level Functional Summary Overview
 
-#### 23a. Get KO Annotations Per MAG
+#### 24a. Get KO Annotations Per MAG
 > This utilizes the helper script [`parse-MAG-annots.py`](../Workflow_Documentation/NF_MGIllumina/workflow_code/bin/parse-MAG-annots.py) 
 
 ```bash
@@ -3558,14 +3558,14 @@ done
 **Input Data:**
 
 - \*-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 19](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 22c](#22c-filter-mags))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 23c](#23c-filter-mags))
 
 **Output Data:**
 
 - **MAG-level-KO-annotations_GLlblMetag.tsv** (tab-delimited table holding MAGs and their KO annotations)
 
 
-#### 23b. Summarize KO Annotations With KEGG-Decoder
+#### 24b. Summarize KO Annotations With KEGG-Decoder
 
 ```bash
 KEGG-decoder -v interactive \
@@ -3581,7 +3581,7 @@ KEGG-decoder -v interactive \
 
 **Input Data:**
 
-- MAG-level-KO-annotations_GLlblMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 23a](#23a-getting-ko-annotations-per-mag))
+- MAG-level-KO-annotations_GLlblMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 24a](#24a-getting-ko-annotations-per-mag))
 
 **Output Data:**
 
@@ -3593,9 +3593,9 @@ KEGG-decoder -v interactive \
 
 ---
 
-### 24. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
+### 25. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
 
-#### 24a. Gene-level taxonomy heatmaps
+#### 25a. Gene-level taxonomy heatmaps
 
 ```R
 library(tidyverse)
@@ -3648,13 +3648,13 @@ make_heatmap(metadata, species_gene_table,
 - /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
 - Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples 
     combined based on gene-level taxonomic classifications, output from 
-    [Step 21a](#21a-generating-gene-level-coverage-summary-tables)) 
+    [Step 22a](#22a-generating-gene-level-coverage-summary-tables)) 
 
 **Output data:**
 - gene_taxonomy_table.csv (aggregated gene taxonomy table with samples in columns and species in rows)
 - **Combined-gene-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all gene taxonomy assignments)
 
-#### 24b. Gene-level taxonomy decontamination
+#### 25b. Gene-level taxonomy decontamination
 
 ```R
 library(tidyverse)
@@ -3712,7 +3712,7 @@ make_heatmap(metadata, decontaminated_table,
 
 **Input Data:**
 
-- `gene_taxonomy_table.csv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 24a](#24a-gene-level-taxonomy-heatmaps))
+- `gene_taxonomy_table.csv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 25a](#25a-gene-level-taxonomy-heatmaps))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
@@ -3721,7 +3721,7 @@ make_heatmap(metadata, decontaminated_table,
 - **Combined-gene-level-taxonomy_decontam_species_table_GLlblMetag.csv** (decontaminated species table)
 - **Combined-gene-level-taxonomy_decontam_heatmap_GLlblMetag.png** (gene-level taxonomy heatmap after filtering out contaminants)
 
-#### 24c. Gene-level KO functions heatmaps
+#### 25c. Gene-level KO functions heatmaps
 
 ```R
 library(tidyverse)
@@ -3774,13 +3774,13 @@ make_heatmap(metadata, table2write,
 - /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
 - Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv (table with all samples combined 
     based on KO annotations; normalized to coverage per million genes covered, output from 
-    [Step 21a](#21a-generate-gene-level-coverage-summary-tables)
+    [Step 22a](#22a-generate-gene-level-coverage-summary-tables)
 
 **Output data:**
 - genes-KO-functions_table.csv (aggregated and subsetted gene KO function table)
 - **Combined-gene-level-KO-function_heatmap_GLlblMetag.png** (heatmap of all gene-level KO function assignments)
 
-#### 24d. Gene-level KO functions decontamination
+#### 25d. Gene-level KO functions decontamination
 
 ```R
 library(tidyverse)
@@ -3838,7 +3838,7 @@ make_heatmap(metadata, decontaminated_table,
 
 **Input Data:**
 
-- `genes-KO-functions_table.csv`(aggregated gene KO functions table table with samples in columns and KO_ID in rows, from [Step 24c](#24c-gene-level-ko-functions-heatmaps))
+- `genes-KO-functions_table.csv`(aggregated gene KO functions table table with samples in columns and KO_ID in rows, from [Step 25c](#25c-gene-level-ko-functions-heatmaps))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
@@ -3848,7 +3848,7 @@ make_heatmap(metadata, decontaminated_table,
 - **Combined-gene-level-KO-function_decontam_heatmap_GLlblMetag.png** (gene-level KO functions heatmap after filtering out contaminants)
 
 
-#### 24f. Contig-level Heatmaps
+#### 25f. Contig-level Heatmaps
 
 ```R
 library(tidyverse)
@@ -3906,7 +3906,7 @@ make_heatmap(metadata, species_contig_table,
 - contig_taxonomy_table.csv (aggregated contig taxonomy table with samples in columns and species in rows)
 - **Combined-contig-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all contig taxonomy assignments)
 
-#### 24g. Contig-level decontamination
+#### 25g. Contig-level decontamination
 
 ```R
 library(tidyverse)
@@ -3964,7 +3964,7 @@ make_heatmap(metadata, decontaminated_table,
 
 **Input Data:**
 
-- `contig_taxonomy_table.csv`(aggregated contig taxonomy table with samples in columns and species in rows, from [Step 24f](#24f-contig-level-heatmaps))
+- `contig_taxonomy_table.csv`(aggregated contig taxonomy table with samples in columns and species in rows, from [Step 25f](#25f-contig-level-heatmaps))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**

From 4b21d2df5fc83c9fb7ca96f01ac8e68fdc838b97 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Tue, 27 Jan 2026 16:57:47 -0800
Subject: [PATCH 27/47] Finish renumbering and minor formatting edits

- sync changes between long and short read documents
---
 .../Low_Biomass/Illumina/GL-DPPD-7117.md      |  76 +++---
 .../Low_Biomass/Nanopore/GL-DPPD-7116.md      | 239 ++++++++----------
 2 files changed, 146 insertions(+), 169 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
index 067b12a00..46e446047 100644
--- a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
@@ -153,7 +153,6 @@ Barbara Novak (GeneLab Data Processing Lead)
 |Kraken2| 2.1.6 | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2) |
 |Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)|
 |MetaBAT| 2.15 |[https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)|
-|Minimap2| 2.28 | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2) |
 |MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)|
 |Medaka| 2.1.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) |
 |MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)|
@@ -787,6 +786,7 @@ library(pavian)
   - `remove_prefix=` - a character string containing a regular expression to be matched and removed, default=`NULL`
 
   **Returns:** the last taxonomy assignment listed in the `taxonomy_string`
+
 </details>
 
 #### mutate_taxonomy()
@@ -819,7 +819,7 @@ library(pavian)
   - `df` - a dataframe containing the taxonomy assignments
   - `taxonomy_column=` - name of the column in `df` containing the taxonomy assignments, default="taxonomy"
 
-  **Returns:** a dataframe with unique last taxonomy names stored in a column named "taxonomy"
+  **Returns:** dataframe, `df`, with unique last taxonomy names stored in a column named "taxonomy"
 
 </details>
 
@@ -869,11 +869,10 @@ library(pavian)
   - `file_path` - file path to the tab-delimited kaiju output table file
   - `taxon_col=`- name of the taxon column in the input data file, default="taxon_name"
 
-  **Returns:** a dataframe with reformated kaiju output
+  **Returns:** dataframe, `abs_abun_matrix`, with reformated kaiju output
 
 </details>
 
-
 #### merge_kraken_reports()
 <details>
   <summary>merge and process multiple kraken outputs to one species table</summary>
@@ -918,7 +917,7 @@ library(pavian)
   **Function Parameter Definitions:**
   - `reports_dir` - path to a directory containing kraken2 reports 
 
-  **Returns:** a kraken species count matrix with samples and species as columns and rows, respectively.
+  **Returns:** a kraken species count matrix, `species_table`, with samples and species as columns and rows, respectively.
 
 </details>
 
@@ -943,7 +942,7 @@ library(pavian)
   - `mat` - a feature count matrix with features as rows and samples as columns
   - `cpm_threshold = 1000` - threshold to identify abundant features
 
-  **Returns:** a matrix holding the features that pass the requested threshold
+  **Returns:** a matrix, `abund_features.m`, holding the features that pass the requested threshold
   
 </details>
 
@@ -976,11 +975,10 @@ library(pavian)
   **Function Parameter Definitions:**
   - `species_table` - a species count matrix with samples and species as columns and rows, respectively.
 
-  **Returns:** a species relative abundance matrix with samples and species as rows and column, respectively.
+  **Returns:** a species relative abundance matrix, `abund_table`, with samples and species as rows and columns, respectively.
 
 </details>
 
-
 #### filter_rare()
 <details>
   <summary>filter out rare and non_microbial taxonomy assignments based on relative abundance</summary>
@@ -1019,7 +1017,8 @@ library(pavian)
   - `non_microbial` - a regular expression denoting the names used to identify a species as non-microbial or unwanted
   - `threshold=` - abundance threshold used to determine if the relative abundance is rare, value denotes a percentage between 0 and 100.
 
-  **Returns:** a dataframe with rare and non_microbial/unwanted species removed
+  **Returns:** dataframe, `abund_table`, with rare and non_microbial/unwanted species removed
+
 </details>
 
 #### group_low_abund_taxa()
@@ -1078,11 +1077,11 @@ library(pavian)
   - `rare_taxa` - a boolean specifying if only rare taxa should be returned
   - `threshold` - a max abundance threshold for defining taxa as rare
 
-  **Returns:** a relative abundance matrix with rare taxa grouped or with non-rare taxa filtered out
+  **Returns:** a relative abundance matrix, `abund_table`, with rare taxa grouped or with non-rare taxa filtered out
 
 </details>
 
-##### make_plot()
+#### make_plot()
 <details>
   <summary>Create stacked bar plots of relative abundance from input dataframes</summary>
 
@@ -1123,11 +1122,11 @@ library(pavian)
   - `samples_column` - a character column specifying the column in `metadata` holding sample names, default is "Sample_ID"
   - `prefix_to_remove` - a string specifying a prefix or any character set to remove from sample names, default is "barcode"
 
-  **Returns:** a relative abundance stacked bar plot
+  **Returns:** a relative abundance stacked bar plot, `p`
 
 </details>
 
-##### make_barplot()
+#### make_barplot()
 <details>
   <summary>Parse Metadata and Feature table files in order to create stacked barplots of relative abundance.</summary>
   
@@ -1177,14 +1176,14 @@ library(pavian)
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
   - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
-  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 8c](#8c-set-global-variables)
-  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 8c](#8c-set-global-variables)
+  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 6c](#8c-set-global-variables)
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 6c](#8c-set-global-variables)
 
-  **Returns:** a relative abundance stacked bar plot as output from [make_plot](#make_plot)
+  **Returns:** a relative abundance stacked bar plot, `p`, as output from [make_plot](#make_plot)
 
 </details>
 
-##### make_heatmap()
+#### make_heatmap()
 <details>
   <summary>Creates heatmaps from a feature table file</summary>
   
@@ -1235,12 +1234,13 @@ library(pavian)
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
   - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
-  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 8c](#8c-set-global-variables)
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 6c](#8c-set-global-variables)
 
-  **Returns:** A heatmap of species/functions across samples from the input feature table
+  **Output Data:** heatmap png file, `{output_prefix}_heatmap{assay_suffix}.png`, of species/functions across samples from the input feature table
+  
 </details>
 
-##### run_decontam()
+#### run_decontam()
 <details>
   <summary>Feature table decontamination with decontam</summary>
 
@@ -1308,10 +1308,11 @@ library(pavian)
   - `contam_threshold` -  the probability threshold below which (strictly less than) the null-hypothesis 
                           (not a contaminant) should be rejected in favor of the alternate hypothesis (contaminant).
 
-  **Returns:** a dataframe of detailed decontam results
+  **Returns:** dataframe, `contamdf`, containing detailed decontam results
+
 </details>
 
-##### feature_decontam() 
+#### feature_decontam() 
 <details>
   <summary>decontaminate a feature table</summary>
   
@@ -1397,10 +1398,10 @@ library(pavian)
   - {classification_method}_decontam_species_table_GLlbsMetag.csv - decontaminated feature table file
   - {classification_method}_decontam_results_GLlbsMetag.csv - Decontam results file
 
-  **Returns:** a dataframe containing the decontaminated feature table
+  **Returns:** dataframe, `decontaminated_table`, containing the decontaminated feature table
 </details>
 
-##### process_taxonomy()
+#### process_taxonomy()
 <details>
   <summary>process a taxonomy assignment table</summary>
 
@@ -1432,10 +1433,10 @@ library(pavian)
   - `prefix`  - is a regular expression specifying a character sequence to remove
                 from taxon names
 
-  **Returns:** a dataframe of reformated taxonomy names
+  **Returns:** dataframe, `taxonomy`, containing reformated taxonomy names
 </details>
 
-##### fix_names()
+#### fix_names()
 <details>
   <summary>clean taxonomy names</summary>
 
@@ -1464,12 +1465,11 @@ library(pavian)
   - `stringToReplace` - a regex string specifying what to replace
   - `suffix` - string specifying the replacement value
 
-  **Returns:** a dataframe of reformated/cleaned taxonomy names
+  **Returns:** dataframe, `taxonomy`, containing reformated/cleaned taxonomy names
 
 </details>
 
-
-##### read_assembly_coverage_table()
+#### read_assembly_coverage_table()
 <details>
   <summary>Read Assembly-based coverage annotation table</summary>
 
@@ -1506,12 +1506,11 @@ library(pavian)
   - `file_name` - path to contig taxonomy assignment file to be read
   - `sample_names` - string of samples names to keep in the final dataframe
 
-  **Returns:** a dataframe with cleaned taxonomy names and sample species count
+  **Returns:** dataframe, `df`, containing cleaned taxonomy names and sample species count
 
 </details>
 
-
-##### get_sample_names()
+#### get_sample_names()
 <details>
   <summary>retrieve sample names for which assemblies were generated</summary>
 
@@ -1531,11 +1530,10 @@ library(pavian)
   **Function Parameter Definitions:**
   - `assembly_summary` - path to assembly summary file
 
-  **Returns:** a character vector of sorted sample names
+  **Returns:** a character vector, `sample_order`, of sorted sample names
 
 </details>
 
-
 #### 6c. Set global variables
 
 ```R
@@ -1639,8 +1637,8 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
 
 **Input Data:**
 
-- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 9a](#9a-build-kaiju-database))
-- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 7a](#9a-build-kaiju-database))
+- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 7a](#7a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 7a](#7a-build-kaiju-database))
 - *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
     output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
@@ -2056,7 +2054,7 @@ kraken2 --db kraken2-db/ \
 - kraken2-db/ (a directory containing kraken2 database files, output from [Step 10a](#10a-download-kraken2-database))
 - *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
+    output from [Step 4b](#4b-build-contaminant-index-and-map-reads or [Step 5b](#5b-remove-host-reads))
 
 
 **Output Data:**
@@ -2840,7 +2838,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("Metaphlan_decontam_species_barplot_GL
 
 **Input Data:**
 
-- `Metaphlan_filtered_species_table_GLlbsMetag.csv`(path to filtered species count per sample, output from [Step 8f](#10f-filter-kraken2-species-count-table))
+- `Metaphlan_filtered_species_table_GLlbsMetag.csv`(path to filtered species count per sample, output from [Step 9i](#9i-filter-metaphlan-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
@@ -2882,7 +2880,7 @@ megahit -1 sample1_R1_decontam_GLlbsMetag.fastq.gz -2 sample1_R2_decontam_GLlbsM
 **Output data:**
 
 - sample1-assembly/final.contigs.fa (assembly file)
-- **sample1-assembly.log** (log file)
+- sample1-assembly.log (log file)
 
 <br>  
 
diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
index c8c60add9..bcba05b0a 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
@@ -27,12 +27,12 @@ Barbara Novak (GeneLab Data Processing Lead)
   - [**Pre-processing**](#pre-processing)
     - [1. Basecalling](#1-basecalling)
     - [2. Demultiplexing](#2-demultiplexing)
-      - [2a. Split fastq ](#2a-split-fastq)
-      - [2b. Concatenate files for each sample](#2b-concatenate-files-for-each-sample)
+      - [2a. Split Fastq](#2a-split-fastq)
+      - [2b. Concatenate Files For Each Sample](#2b-concatenate-files-for-each-sample)
     - [3. Raw Data QC](#3-raw-data-qc)
       - [3a. Raw Data QC](#3a-raw-data-qc)
       - [3b. Compile Raw Data QC](#3b-compile-raw-data-qc)
-    - [4. Quality filtering](#4-quality-filtering)
+    - [4. Quality Filtering](#4-quality-filtering)
       - [4a. Filter Raw Data](#4a-filter-raw-data)
       - [4a. Filtered Data QC](#4b-filtered-data-qc)
       - [4c. Compile Filtered Data QC](#4c-compile-filtered-data-qc)
@@ -41,19 +41,19 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [5b. Trimmed Data QC](#5b-trimmed-data-qc)
       - [5c. Compile Trimmed Data QC](#5c-compile-trimmed-data-qc)
     - [6. Human Read Removal](#6-human-read-removal)
-      - [6a. Build Kraken2 Database](#6a-build-kraken2-database)
+      - [6a. Build Kraken2 Human Database](#6a-build-kraken2-human-database)
       - [6b. Remove Human Reads](#6b-remove-human-reads)
       - [6c. Compile Human Read Removal QC](#6c-compile-human-read-removal-qc)
     - [7. Contaminant Removal](#7-contaminant-removal)
       - [7a. Assemble Contaminants](#7a-assemble-contaminants)
       - [7b. Build Contaminant Index and Map Reads](#7b-build-contaminant-index-and-map-reads)
-      - [7c. Sort and Index Contaminant Reads](#7c-sort-and-index-contaminant-alignments)
+      - [7c. Sort and Index Contaminant Alignments](#7c-sort-and-index-contaminant-alignments)
       - [7d. Gather Contaminant Mapping Metrics](#7d-gather-contaminant-mapping-metrics)
       - [7e. Generate Decontaminated Read Files](#7e-generate-decontaminated-read-files)
       - [7f. Contaminant Removal QC](#7f-contaminant-removal-qc)
       - [7g. Compile Contaminant Removal QC](#7g-compile-contaminant-removal-qc)
     - [8. Host Read Removal](#8-host-read-removal)
-      - [8a. Build Kraken2 Database](#8a-build-kraken2-database)
+      - [8a. Build Kraken2 Host Database](#8a-build-kraken2-host-database)
       - [8b. Remove Host Reads](#8b-remove-host-reads)
       - [8c. Compile Host Read Removal QC](#8c-compile-host-read-removal-qc)
     - [9. R Environment Setup](#9-r-environment-setup)
@@ -61,14 +61,14 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [9b. Define Custom Functions](#9b-define-custom-functions)
       - [9c. Set global variables](#9c-set-global-variables)
   - [**Read-based processing**](#read-based-processing)
-    - [10. Taxonomic profiling using kaiju](#10-taxonomic-profiling-using-kaiju)
+    - [10. Taxonomic Profiling Using Kaiju](#10-taxonomic-profiling-using-kaiju)
       - [10a. Build Kaiju Database](#10a-build-kaiju-database)
       - [10b. Kaiju Taxonomic Classification](#10b-kaiju-taxonomic-classification)
       - [10c. Compile Kaiju Taxonomy Results](#10c-compile-kaiju-taxonomy-results)
       - [10d. Convert Kaiju Output To Krona Format](#10d-convert-kaiju-output-to-krona-format)
       - [10e. Compile Kaiju Krona Reports](#10e-compile-kaiju-krona-reports)
       - [10f. Create Kaiju Species Count Table](#10f-create-kaiju-species-count-table)
-      - [10g. Read-in Tables](#10g-read-in-tables)
+      - [10g. Filter Kaiju Species Count Table](#10g-filter-kaiju-species-count-table)
       - [10h. Taxonomy Barplots](#10h-taxonomy-barplots)
       - [10i. Feature Decontamination](#10i-feature-decontamination)
     - [11. Taxonomic Profiling Using Kraken2](#11-taxonomic-profiling-using-kraken2)
@@ -79,10 +79,9 @@ Barbara Novak (GeneLab Data Processing Lead)
         - [11cii. Compile Kraken2 Taxonomy Reports](#11cii-compile-kraken2-taxonomy-reports)
       - [11d. Convert Kraken2 Output to Krona Format](#11d-convert-kraken2-output-to-krona-format)
       - [11e. Compile Kraken2 Krona Reports](#11e-compile-kraken2-krona-reports)
-      - [11f. Create Kraken2 Species Count Table](#11f-create-kraken2-species-count-table)
-      - [11g. Read-in Tables](#11g-read-in-tables)
-      - [11h. Taxonomy Barplots](#11h-taxonomy-barplots)
-      - [11i. Feature Decontamination](#11i-feature-decontamination)
+      - [11f. Filter Kraken2 Species Count Table](#11f-filter-kraken2-species-count-table)
+      - [11g. Taxonomy Barplots](#11h-taxonomy-barplots)
+      - [11h. Feature Decontamination](#11h-feature-decontamination)
   - [**Assembly-based processing**](#assembly-based-processing)
     - [12. Sample Assembly](#12-sample-assembly)
     - [13. Polish Assembly](#13-polish-assembly)
@@ -124,12 +123,12 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [24a. Get KO Annotations Per MAG](#24a-get-ko-annotations-per-mag)
       - [24b. Summarize KO Annotations With KEGG-Decoder](#24b-summarize-ko-annotations-with-kegg-decoder)
     - [25. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs](#25-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
-      - [25a. Gene-level taxonomy heatmaps](#25a-gene-level-taxonomy-heatmaps)
-      - [25b. Gene-level taxonomy decontamination](#25b-gene-level-taxonomy-decontamination)
-      - [25c. Gene-level KO functions heatmaps](#25c-gene-level-ko-functions-heatmaps)
-      - [25d. Gene-level KO functions decontamination](#25d-gene-level-ko-functions-decontamination)
-      - [25e. Contig-level heatmaps](#25e-contig-level-heatmaps)
-      - [25f. Contig-level decontamination](#25f-contig-level-decontamination)
+      - [25a. Gene-level Taxonomy Heatmaps](#25a-gene-level-taxonomy-heatmaps)
+      - [25b. Gene-level Taxonomy Decontamination](#25b-gene-level-taxonomy-decontamination)
+      - [25c. Gene-level KO Functions Heatmaps](#25c-gene-level-ko-functions-heatmaps)
+      - [25d. Gene-level KO Functions Decontamination](#25d-gene-level-ko-functions-decontamination)
+      - [25e. Contig-level Heatmaps](#25e-contig-level-heatmaps)
+      - [25f. Contig-level Decontamination](#25f-contig-level-decontamination)
 
 
 ---
@@ -157,7 +156,6 @@ Barbara Novak (GeneLab Data Processing Lead)
 |Minimap2| 2.28 | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2) |
 |MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)|
 |Medaka| 2.1.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) |
-|MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)|
 |NanoPlot| 1.44.1 | [https://github.com/wdecoster/NanoPlot](https://github.com/wdecoster/NanoPlot)|
 |Porechop| 0.2.4 | [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop) |
 |Prodigal| 2.6.3 |[https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)|
@@ -179,6 +177,7 @@ Barbara Novak (GeneLab Data Processing Lead)
 
 ## Pre-processing
 
+
 ### 1. Basecalling
 
 ```bash
@@ -525,7 +524,7 @@ multiqc --zip-data-dir \
 
 ### 6. Human Read Removal
 
-#### 6a. Build Kraken2 Database
+#### 6a. Build Kraken2 Human Database
 
 > **Note:** It is recommended to use NCBI genome files with kraken2 because sequences not downloaded from 
 NCBI may require explicit assignment of taxonomy information before they can be used to build the 
@@ -536,31 +535,33 @@ database, as mentioned in the [Kraken2 Documentation](https://github.com/Derrick
 kraken2-build --download-taxonomy --db kraken2-human-db/
 
 # Add genomic sequences to your database's genomic library
-kraken2-build --add-to-library human.fasta --db kraken2-human-db/ \
-              --no-masking --kmer-length 35 --minimizer-length 31
-
+kraken2-build --add-to-library human.fasta --db kraken2-human-db/ --no-masking
+             
 # Build the database
-kraken2-build --build --db kraken2-human-db/
+kraken2-build --build --db kraken2-human-db/ --kmer-len 35 --minimizer-len 31
 
 # Clean up intermediate files
 kraken2-build --clean --db kraken2-human-db/
 ```
+
 **Parameter Definitions:**
 - `--download-taxonomy` - Instructs kraken2-build to download the NCBI taxonomic information.
 - `--db` - Specifies the name of the directory for the kraken2 database
-- `--add-to-library` - Instructs kraken2-build to add the contents of a file (`human.fasta`) to the kraken2 DB library
-- `--no-masking` - Disables masking of low-complexity sequences. For additional 
+- `--add-to-library` - Instructs kraken2-build to add the contents of a file to the kraken2 DB library
+  - `--no-masking` - Disables masking of low-complexity sequences. For additional 
                    information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
 - `--build` - Instructs kraken2-build to build the kraken2 DB from the library files
+  - `--kmer-len` - K-mer length in bp (default: 35).
+  - `--minimizer-len` - Minimizer length in bp (default: 31)
 - `--clean` - Instructs kraken2-build to remove unneeded intermediate files.
 
 **Input Data:**
 
-- `human.fasta` (fasta file containing human genome, SPECIFY WHERE THIS GEONOME CAME FROM)
+- `human.fasta` (fasta file containing human genome, for example, the human genome fasta downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz)
 
 **Output Data:**
 
-- kraken2_human_db/ - Kraken2 human database directory, containing hash.k2d, opts.k2d, and taxo.k2d files 
+- kraken2_human_db/ (Kraken2 human database directory, containing hash.k2d, opts.k2d, and taxo.k2d files)
 
 #### 6b. Remove Human Reads
 
@@ -630,7 +631,6 @@ multiqc --zip-data-dir \
 
 <br>
 
-
 ---
 
 ### 7. Contaminant Removal
@@ -864,7 +864,7 @@ multiqc --zip-data-dir \
 
 If the samples were derived from a host organism other than human, potential host reads should be identified and removed. This step is optional.
 
-#### 8a. Build Kraken2 Database
+#### 8a. Build Kraken2 Host Database
 
 > **Note:** It is recommended to use NCBI genome files with kraken2 because sequences not downloaded from 
 NCBI may require explicit assignment of taxonomy information before they can be used to build the 
@@ -875,32 +875,35 @@ database, as mentioned in the [Kraken2 Documentation](https://github.com/Derrick
 kraken2-build --download-taxonomy --db kraken2-${hostname}-db/
 
 # Add genomic sequences to your database's genomic library
-kraken2-build --add-to-library ${hostname}.fasta --db kraken2-${hostname}-db/ \
-              --no-masking --kmer-length 35 --minimizer-length 31
+kraken2-build --add-to-library ${hostname}.fasta --db kraken2-${hostname}-db/ --no-masking 
 
 # Build the database
-kraken2-build --build --db kraken2-${hostname}-db/
+kraken2-build --build --db kraken2-${hostname}-db/ --kmer-length 35 --minimizer-length 31
 
 # Clean up intermediate files
 kraken2-build --clean --db kraken2-${hostname}-db/
 ```
+
 **Parameter Definitions:**
+
 - `--download-taxonomy` - Instructs kraken2-build to download the NCBI taxonomic information.
 - `--db` - Specifies the name of the directory for the kraken2 database
 - `--add-to-library` - Instructs kraken2-build to add the contents of a file (`${hostname}.fasta`) to the kraken2 DB library
-- `--no-masking` - Disables masking of low-complexity sequences. For additional 
-                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
+  - `--no-masking` - Disables masking of low-complexity sequences. For additional 
+                     information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
 - `--build` - Instructs kraken2-build to build the kraken2 DB from the library files
+  - `--kmer-len` - K-mer length in bp (default: 35).
+  - `--minimizer-len` - Minimizer length in bp (default: 31)
 - `--clean` - Instructs kraken2-build to remove unneeded intermediate files.
 - `{$hostname}` - Specifies the name of the host organism used to uniquely identify the kraken2 database
 
 **Input Data:**
 
-- `${hostname}.fasta` (fasta file containing host genome)
+- `${hostname}.fasta` (fasta file containing host genome, for example, the mouse genome fasta downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_genomic.fna.gz for mouse)
 
 **Output Data:**
 
-- kraken2_${hostname}_db/ - Kraken2 host database directory, containing hash.k2d, opts.k2d, and taxo.k2d files 
+- kraken2_${hostname}_db/ (Kraken2 host database directory, containing hash.k2d, opts.k2d, and taxo.k2d files)
 
 
 #### 8b. Remove Host Reads
@@ -989,7 +992,7 @@ library(pavian)
 
 #### 9b. Define Custom Functions
 
-##### get_last_assignment()
+#### get_last_assignment()
 <details>
   <summary>retrieves the last taxonomy assignment from a taxonomy string</summary>
 
@@ -1021,9 +1024,10 @@ library(pavian)
   - `remove_prefix=` - a character string containing a regular expression to be matched and removed, default=`NULL`
 
   **Returns:** the last taxonomy assignment listed in the `taxonomy_string`
+
 </details>
 
-##### mutate_taxonomy()
+#### mutate_taxonomy()
 <details>
   <summary>mutate taxonomy column to contain the lowest taxonomy assignment</summary>
 
@@ -1045,6 +1049,7 @@ library(pavian)
     return(df)
   }
   ```
+
   **Custom Functions Used:**
   - [get_last_assignment()](#get_last_assignment)
 
@@ -1056,7 +1061,7 @@ library(pavian)
 
 </details>
 
-##### process_kaiju_table()
+#### process_kaiju_table()
 <details>
   <summary>reformat kaiju output table</summary>
 
@@ -1105,8 +1110,7 @@ library(pavian)
 
 </details>
 
-
-##### merge_kraken_reports()
+#### merge_kraken_reports()
 <details>
   <summary>merge and process multiple kraken outputs to one species table</summary>
 
@@ -1146,9 +1150,6 @@ library(pavian)
     return(species_table)
   }
   ```
-  **Custom Functions Used:**
-  - [read_reports()]()
-
 
   **Function Parameter Definitions:**
   - `reports_dir` - path to a directory containing kraken2 reports 
@@ -1157,7 +1158,7 @@ library(pavian)
 
 </details>
 
-##### get_abundant_features()
+#### get_abundant_features()
 <details>
   <summary>Find abundant features based on the sum of feature values</summary>
   
@@ -1177,10 +1178,11 @@ library(pavian)
   - `mat` - a feature count matrix with features as rows and samples as columns
   - `cpm_threshold = 1000` - threshold to identify abundant features
 
-  **Returns:** a species relative abundance matrix, `abund_features.m`, with samples and species as rows and columns, respectively.
+  **Returns:** a matrix, `abund_features.m`, holding the features that pass the requested threshold
+  
 </details>
 
-##### count_to_rel_abundance()
+#### count_to_rel_abundance()
 <details>
   <summary>Convert species count matrix to relative abundance matrix</summary>
 
@@ -1214,7 +1216,7 @@ library(pavian)
 </details>
 
 
-##### filter_rare()
+#### filter_rare()
 <details>
   <summary>filter out rare and non_microbial taxonomy assignments based on relative abundance</summary>
 
@@ -1253,9 +1255,10 @@ library(pavian)
   - `threshold=` - abundance threshold used to determine if the relative abundance is rare, value denotes a percentage between 0 and 100.
 
   **Returns:** dataframe, `abund_table`, with rare and non_microbial/unwanted species removed
+
 </details>
 
-##### group_low_abund_taxa()
+#### group_low_abund_taxa()
 <details>
   <summary>Group rare taxa or return a table with only rare taxa</summary>
 
@@ -1315,7 +1318,7 @@ library(pavian)
 
 </details>
 
-##### make_plot()
+#### make_plot()
 <details>
   <summary>create bar plot of relative abundance</summary>
 
@@ -1360,7 +1363,7 @@ library(pavian)
 
 </details>
 
-##### make_barplot()
+#### make_barplot()
 <details>
   <summary>Creates barplots from a feature table file</summary>
   
@@ -1412,11 +1415,11 @@ library(pavian)
   - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 9c](#9c-set-global-variables)
   - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 9c](#9c-set-global-variables)
 
-  **Returns:** a relative abundance stacked bar plot, `p`
+  **Returns:** a relative abundance stacked bar plot, `p`, as output from [make_plot](#make_plot)
 
 </details>
 
-##### make_heatmap()
+#### make_heatmap()
 <details>
   <summary>Creates heatmaps from a feature table file</summary>
   
@@ -1425,21 +1428,6 @@ library(pavian)
                            samples_column = "sample_id", group_column = "group", 
                            output_prefix, assay_suffix = "_GLlblMetag",
                            custom_palette) {
-    # Prepare feature table
-    # feature_table <- read_csv(feature_table_file) %>% as.data.frame
-    # rownames(feature_table) <- feature_table[[1]]
-    # feature_table <- feature_table[, -1] %>% as.matrix()
-
-    # # Prepare metadata
-    # metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-    # row.names(metadata) <- metadata[, samples_column]
-
-    # # Get common samples and re-arrange feature table and metadata
-    # common_samples <- intersect(colnames(feature_table), rownames(metadata))
-    # feature_table <- feature_table[, common_samples]
-    # metadata <- metadata[common_samples, ]
-    # metadata <- metadata %>% arrange(!!sym(group_column))
-
     # Create column annotation
     col_annotation <- as.data.frame(metadata)[, group_column, drop = FALSE]
 
@@ -1484,9 +1472,11 @@ library(pavian)
   - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
   - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 9c](#9c-set-global-variables)
 
+  **Returns:** heatmap png file, `{output_prefix}_heatmap{assay_suffix}.png`, of species/functions across samples from the input feature table
+
 </details>
 
-##### run_decontam()
+#### run_decontam()
 <details>
   <summary>Feature table decontamination with decontam</summary>
 
@@ -1555,9 +1545,10 @@ library(pavian)
                           (not a contaminant) should be rejected in favor of the alternate hypothesis (contaminant).
 
   **Returns:** dataframe, `contamdf`, containing detailed decontam results
+
 </details>
 
-##### feature_decontam() 
+#### feature_decontam() 
 <details>
   <summary>decontaminate a feature table</summary>
   
@@ -1620,6 +1611,7 @@ library(pavian)
     }
   }
   ```
+
   **Custom Functions Used:**
   - [run_decontam()](#run_decontam)
 
@@ -1646,7 +1638,7 @@ library(pavian)
 
 </details>
 
-##### process_taxonomy()
+#### process_taxonomy()
 <details>
   <summary>process a taxonomy assignment table</summary>
 
@@ -1683,7 +1675,7 @@ library(pavian)
 
 </details>
 
-##### fix_names()
+#### fix_names()
 <details>
   <summary>clean taxonomy names</summary>
 
@@ -1716,8 +1708,7 @@ library(pavian)
 
 </details>
 
-
-##### read_assembly_coverage_table()
+#### read_assembly_coverage_table()
 <details>
   <summary>Read Assembly-based coverage annotation table</summary>
 
@@ -1759,8 +1750,7 @@ library(pavian)
 
 </details>
 
-
-##### get_sample_names()
+#### get_sample_names()
 <details>
   <summary>retrieve sample names for which assemblies were generated</summary>
 
@@ -1830,6 +1820,7 @@ custom_palette <- custom_palette[-c(21:23,
 
 ## Read-based Processing
 
+
 ### 10. Taxonomic Profiling Using Kaiju
 
 #### 10a. Build Kaiju Database
@@ -1930,7 +1921,7 @@ sed -i -E 's/file/sample/' merged_kaiju_table.tsv
 
 - merged_kaiju_table.tsv (compiled kaiju summary table at the species level)
 
-#### 9d. Convert Kaiju Output To Krona Format
+#### 10d. Convert Kaiju Output To Krona Format
 
 ```bash
 kaiju2krona -u \
@@ -1950,7 +1941,7 @@ kaiju2krona -u \
 
 **Input Data:**
 - kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 10a](#10a-build-kaiju-database))
-- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 10a](#9a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 10a](#10a-build-kaiju-database))
 - sample_kaiju.out (kaiju output file, output from [Step 10b](#10b-kaiju-taxonomic-classification))
 
 **Output Data:**
@@ -1976,28 +1967,23 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 
 **Parameter Definitions:**
 
-**find**
-
+*find*
 - `-type f` -  Specifies that the type of file to find is a regular file.
 - `-name "*.krona"` - Specifies to find files ending with the .krona suffix.  
 
-**sort**
-
+*sort*
 - `-u` - Specifies to perform a unique sort.
 - `-V` - Specifies to perform a mixed type of sorting with names containing numbers within text.
 - `> krona_files.txt` - Redirects the sorted list to a separate text file.
 
-**basename**
-
+*basename*
 - `-a` - Support multiple arguments and treat each as a file name.
 - `-s '.krona'` - Remove trailing '.krona' suffix.
 
-**paste**
-
+*paste*
 - `-d','` - Paste both krona and sample files together line by line delimited by comma ','.
 
-**ktImportText**
-
+*ktImportText*
 - `-o` - Specifies the compiled output html file name.
 - `${KTEXT_FILES[*]}` - An array positional argument with the following content: 
                         sample_1.krona,sample_1 sample_2.krona,sample_2 ... sample_n.krona,sample_n.
@@ -2077,7 +2063,6 @@ write_csv(x = table2write, file = output_file)
 ```
 
 **Custom Functions Used:**
-
 - [group_low_abund_taxa()](#group_low_abund_taxa)
 
 **Parameter Definitions:**
@@ -2088,7 +2073,7 @@ write_csv(x = table2write, file = output_file)
 
 **Input Data:**
 
-- kaiju_species_table_GLlblMetag.csv (path to kaiju species table from [Step 9f](#9f-create-kaiju-species-count-table))
+- kaiju_species_table_GLlblMetag.csv (path to kaiju species table from [Step 10f](#10f-create-kaiju-species-count-table))
 
 **Output Data:**
 
@@ -2096,7 +2081,7 @@ write_csv(x = table2write, file = output_file)
 
 ---
 
-#### 10h. Taxonomy barplots
+#### 10h. Taxonomy Barplots
 
 ```R
 library(tidyverse)
@@ -2145,8 +2130,8 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_filtered_species_barplot_GLlblM
 
 **Input Data:**
 
-- `kaiju_species_table_GLlblMetag.csv` (a file containing the species count table, output from [Step 9f](#9f-create-kaiju-species-count-table))
-- `kaiju_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 9g](#9g-filter-kaiju-species-count-table))
+- `kaiju_species_table_GLlblMetag.csv` (a file containing the species count table, output from [Step 10f](#10f-create-kaiju-species-count-table))
+- `kaiju_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 10g](#10g-filter-kaiju-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 
@@ -2158,7 +2143,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_filtered_species_barplot_GLlblM
 - **kaiju_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 10i. Feature decontamination
+#### 10i. Feature Decontamination
 
 > Feature (species) decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
 
@@ -2206,6 +2191,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlblM
 - [count_to_rel_abundance()](#count_to_rel_abundance)
 
 **Parameter Definitions:**
+
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
                          table with species/functions as the first column and samples as other columns.
@@ -2213,14 +2199,14 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlblM
 
 **Input Data:**
 
-- `kaiju_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 9g](#9g-filter-kaiju-species-count-table))
+- `kaiju_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 10g](#10g-filter-kaiju-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
 
 **Output Data:**
 
 - **kaiju_decontam_results_GLlblMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
 - **kaiju_decontam_species_table_GLlblMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
-- **kaiju_decontam_species_barplot_GLlblMetag.png** (barplot after filtering out contaminants)
+- kaiju_decontam_species_barplot_GLlblMetag.png (barplot after filtering out contaminants)
 - **kaiju_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants)
 
 <br>
@@ -2258,14 +2244,15 @@ tar -xvzf k2_pluspfp.tar.gz
 
 **Parameter Definitions:**
 
-**wget**
-
+*wget*
 - `O` - Name of file to download the url content to.
 - `--timeout=3600` - Specifies the network timeout in seconds.
 - `--tries=0` - Retry download infinitely.
 - `--continue` -  Continue getting a partially-downloaded file.
 - `*_URL` - Position arguement specifying the url to download a particular resource from.
 
+*tar*
+- `-xvzf` - unpack the specified *tar.gz archive in verbose mode
 
 **Input Data:**
 
@@ -2318,7 +2305,7 @@ kraken2 --db kraken2-db/ \
 ##### 11ci. Create Merged Kraken2 Taxonomy Table
 
 ```R
-species_table <- merge_kraken_reports(reports-dir='/path/to/kraken2/reports')
+species_table <- merge_kraken_reports(reports-dir = '/path/to/kraken2/reports')
 write_csv(x = species_table, file = "kraken2_species_table_GLlblMetag.csv")
 ```
 
@@ -2328,15 +2315,9 @@ write_csv(x = species_table, file = "kraken2_species_table_GLlblMetag.csv")
 
 **Parameter Definitions:**
 
-- `file_path` - path to compiled kaiju table at the species taxon level
+- `reports-dir` - path to compiled kraken reports
 - `x`  - feature table dataframe to write to file
-- `file` - path to where to write kaiju count table per sample
-
-**Parameter Definitions:**
-
-- `--output` - Specifies the name of the kraken2 compiled results output file.
-- `--report-files` - Specifies the name of each input kraken2 report file to compile.
-- `--sample-names` - Specifies the name of each sample. 
+- `file` - path to where to write kraken2 species table table
 
 **Input Data:**
 
@@ -2413,28 +2394,23 @@ ktImportText -o kraken2-report_GLlblMetag.html ${KTEXT_FILES[*]}
 
 **Parameter Definitions:**
 
-**find**
-
+*find*
 - `-type f` -  Specifies that the type of file to find is a regular file.
 - `-name "*.krona"` - Specifies to find files ending with the .krona suffix.  
 
-**sort**
-
+*sort*
 - `-u` - Specifies to perform a unique sort.
 - `-V` - Specifies to perform a mixed type of sorting with names containing numbers within text.
 - `> {}.txt` - Redirects the sorted list to a separate text file.
 
-**basename**
-
+*basename*
 - `--multiple` - Support multiple arguments and treat each as a file name.
 - `--suffix='.krona'` - Remove a trailing '.krona' suffix.
 
-**paste**
-
+*paste*
 - `-d','` - Paste both krona and sample files together line by line delimited by comma ','.
 
-**ktImportText**
-
+*ktImportText*
 - `-o` - Specifies the compiled output html file name.
 - `${KTEXT_FILES[*]}` - An array positional arguement with the following content: sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
 
@@ -2478,7 +2454,6 @@ write_csv(x = table2write, file = output_file)
 ```
 
 **Custom Functions Used:**
-
 - [group_low_abund_taxa()](#group_low_abund_taxa)
 
 **Parameter Definitions:**
@@ -2497,8 +2472,7 @@ write_csv(x = table2write, file = output_file)
 
 ---
 
-
-#### 11g. Taxonomy barplots
+#### 11g. Taxonomy Barplots
 
 ```R
 library(tidyverse)
@@ -2558,7 +2532,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlb
 - **kraken2_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 11h. Feature decontamination
+#### 11h. Feature Decontamination
 
 > Feature (species) decontamination with decontam. Decontam is an R package that statistically 
   identifies contaminating features in a feature table
@@ -2607,6 +2581,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlb
 - [count_to_rel_abundance()](#count_to_rel_abundance)
 
 **Parameter Definitions:**
+
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
                           table with species/functions as the first column and samples as other columns.
@@ -2621,7 +2596,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlb
 
 - **kraken2_decontam_results_GLlblMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
 - **kraken2_decontam_species_table_GLlblMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
-- **kraken2_decontam_species_barplot_GLlblMetag.png** (barplot after filtering out contaminants)
+- kraken2_decontam_species_barplot_GLlblMetag.png (barplot after filtering out contaminants)
 - **kraken2_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants)
 
 <br>
@@ -2630,6 +2605,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlb
 
 ## Assembly-based Processing
 
+
 ### 12. Sample Assembly
 
 ```bash
@@ -2696,6 +2672,8 @@ mv sample/consensus.fasta sample_polished.fasta
 
 - sample_polished.fasta (polished sample assembly)
 
+<br>
+
 ---
 
 ### 14. Rename Contigs and Summarize Assemblies
@@ -3227,6 +3205,7 @@ rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-o
 ---
 
 ### 21. Combine Contig-level Coverage and Taxonomy For Each Sample
+
 > **Note:**  
 > Just uses `paste`, `sed`, and `awk` standard Unix commands to combine contig-level coverage and taxonomy into one table for each sample.
 
@@ -3287,7 +3266,6 @@ mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-cove
 **Parameter Definitions:**  
 
 - `*-gene-coverage-annotation-and-tax_GLlbsMetag.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
-
 - `-o` – Specifies the output file prefix.
 
 
@@ -3379,7 +3357,7 @@ zip -r sample-bins.zip sample-bins
 - sample-bins/sample-bin\*.fasta (fasta files of recovered bins)
 - **sample-bins.zip** (zip file containing fasta files of recovered bins)
 
-#### 23b. Bin quality assessment 
+#### 23b. Bin Quality Assessment 
 > Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
 
 ```bash
@@ -3595,7 +3573,7 @@ KEGG-decoder -v interactive \
 
 ### 25. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
 
-#### 25a. Gene-level taxonomy heatmaps
+#### 25a. Gene-level Taxonomy Heatmaps
 
 ```R
 library(tidyverse)
@@ -3654,7 +3632,7 @@ make_heatmap(metadata, species_gene_table,
 - gene_taxonomy_table.csv (aggregated gene taxonomy table with samples in columns and species in rows)
 - **Combined-gene-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all gene taxonomy assignments)
 
-#### 25b. Gene-level taxonomy decontamination
+#### 25b. Gene-level Taxonomy Decontamination
 
 ```R
 library(tidyverse)
@@ -3721,7 +3699,7 @@ make_heatmap(metadata, decontaminated_table,
 - **Combined-gene-level-taxonomy_decontam_species_table_GLlblMetag.csv** (decontaminated species table)
 - **Combined-gene-level-taxonomy_decontam_heatmap_GLlblMetag.png** (gene-level taxonomy heatmap after filtering out contaminants)
 
-#### 25c. Gene-level KO functions heatmaps
+#### 25c. Gene-level KO Functions Heatmaps
 
 ```R
 library(tidyverse)
@@ -3780,7 +3758,7 @@ make_heatmap(metadata, table2write,
 - genes-KO-functions_table.csv (aggregated and subsetted gene KO function table)
 - **Combined-gene-level-KO-function_heatmap_GLlblMetag.png** (heatmap of all gene-level KO function assignments)
 
-#### 25d. Gene-level KO functions decontamination
+#### 25d. Gene-level KO Functions Decontamination
 
 ```R
 library(tidyverse)
@@ -3906,7 +3884,7 @@ make_heatmap(metadata, species_contig_table,
 - contig_taxonomy_table.csv (aggregated contig taxonomy table with samples in columns and species in rows)
 - **Combined-contig-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all contig taxonomy assignments)
 
-#### 25g. Contig-level decontamination
+#### 25g. Contig-level Decontamination
 
 ```R
 library(tidyverse)
@@ -3957,6 +3935,7 @@ make_heatmap(metadata, decontaminated_table,
 - [make_heatmap()](#make_plot)
 
 **Parameter Definitions:**
+
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table containing contig-level coverage data 
                          species/functions as the first column and samples as other columns.

From 12305efa4d2d58561bdfd22d3374702559e678e6 Mon Sep 17 00:00:00 2001
From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com>
Date: Tue, 27 Jan 2026 23:22:33 -0800
Subject: [PATCH 28/47] typo fix on GL-DPPD-7117.md

---
 Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
index 46e446047..0cd0adf59 100644
--- a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
@@ -1,6 +1,6 @@
 # Bioinformatics pipeline for Low biomass long-read metagenomics data
 
-> **This document holds an overview and some example commands of how GeneLab processes low-biomass, long-read metagenomics datasets. Exact processing commands for specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
+> **This document holds an overview and some example commands of how GeneLab processes low-biomass, short-read metagenomics datasets. Exact processing commands for specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
 
 ---
 

From 39655690095ef984c1ad17a98213aa91fadc22e2 Mon Sep 17 00:00:00 2001
From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com>
Date: Tue, 27 Jan 2026 23:22:59 -0800
Subject: [PATCH 29/47] typo fix on GL-DPPD-7117.md

---
 Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
index 0cd0adf59..4faeff5f7 100644
--- a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
@@ -1,4 +1,4 @@
-# Bioinformatics pipeline for Low biomass long-read metagenomics data
+# Bioinformatics pipeline for Low biomass short-read metagenomics data
 
 > **This document holds an overview and some example commands of how GeneLab processes low-biomass, short-read metagenomics datasets. Exact processing commands for specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
 

From 3c08da6ba6a34952d24b0cfba391bbf7ce551ab9 Mon Sep 17 00:00:00 2001
From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com>
Date: Thu, 29 Jan 2026 08:56:59 -0800
Subject: [PATCH 30/47] formatting updates to GL-DPPD-7116.md

---
 Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
index bcba05b0a..2a2ae9e02 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
@@ -557,7 +557,7 @@ kraken2-build --clean --db kraken2-human-db/
 
 **Input Data:**
 
-- `human.fasta` (fasta file containing human genome, for example, the human genome fasta downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz)
+- human.fasta (fasta file containing human genome, for example, the human genome fasta downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz)
 
 **Output Data:**
 

From 341b900c1b0c632ad0f2946b0ad0691d10520617 Mon Sep 17 00:00:00 2001
From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com>
Date: Thu, 29 Jan 2026 08:59:41 -0800
Subject: [PATCH 31/47] Formatting updates to GL-DPPD-7117.md

---
 Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
index 4faeff5f7..0b2f6b94d 100644
--- a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
@@ -265,7 +265,7 @@ kraken2-build --clean --db kraken2-human-db/
 
 **Input Data:**
 
-- `human.fasta` (fasta file containing human genome, for example, the human genome fasta downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz)
+- human.fasta (fasta file containing human genome, for example, the human genome fasta downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz)
 
 **Output Data:**
 

From 7366989e47138b625bd18a4814ae3a8ff2616fc0 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Wed, 18 Mar 2026 21:39:46 -0700
Subject: [PATCH 32/47] Dev metagenomics low biomass bn (#195)

Sync with working implementation
- fixed spelling/typos across both documents
- Short-read specific updates
  - removed remove human reads step and added reference to remove human
    reads pipeline
  - add Humann output downstream analysis steps
- Long-read specific updates
  - Update human read removal to sync with latest human read removal pipeline and add link to that pipeline
- update kraken2 database build steps to use k2 wrapper (as in latest RHR and EHR pipelines/workflows)
- renumbered steps and fixed internal links
- updated thresholds to match latest implementation
- add missing assay suffixes and fix incorrect suffixes
- remove unused samtools index step in Assembly-based processing
- update output file names
- change all csv output to tsv
- updated software tables to remove references to unused software in each pipeline.
- fix broken links
- add missing filtering steps in Assembly-based processing
- Updated header information
- Add top50 heatmaps
- added note about decontaminated plots and species tables only being
  present when 1 or more contaminants found
- reorganize R code
---
 .../Low_Biomass/Illumina/GL-DPPD-7117.md      | 2325 +++++++++--------
 .../Low_Biomass/Nanopore/GL-DPPD-7116.md      | 1313 ++++++----
 2 files changed, 2032 insertions(+), 1606 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
index 0b2f6b94d..b3c0174b7 100644
--- a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
@@ -4,17 +4,17 @@
 
 ---
 
-**Date:** January MM, 2026  
+**Date:** March MM, 2026  
 **Revision:** -  
-**Document Number:** GL-DPPD-7116  
+**Document Number:** GL-DPPD-7117  
 
 **Submitted by:**  
 Olabiyi A. Obayomi (GeneLab Analysis Team)  
 
 **Approved by:**  
-Samrawit Gebre (OSDR Project Manager)  
-Jonathan Galazka (OSDR Project Scientist)  
-Amanda Saravia-Butler (GeneLab Science Lead)  
+Jonathan Galazka (OSDR Project Manager)  
+Danielle Lopez (OSDR Deputy Project Manager)  
+Amanda Saravia-Butler (OSDR Subject Matter Expert)  
 Barbara Novak (GeneLab Data Processing Lead)  
 
 
@@ -28,111 +28,114 @@ Barbara Novak (GeneLab Data Processing Lead)
     - [1. Raw Data QC](#1-raw-data-qc)
       - [1a. Raw Data QC](#1a-raw-data-qc)
       - [1b. Compile Raw Data QC](#1b-compile-raw-data-qc)
-    - [2. Human Read Removal](#2-human-read-removal)
-      - [2a. Build Kraken2 Human Database](#2a-build-kraken2-human-database)
-      - [2b. Remove Human Reads](#2b-remove-human-reads)
-      - [2c. Compile Human Read Removal QC](#2c-compile-human-read-removal-qc)
-    - [3. Trimming and Quality filtering](#3-trimming-and-quality-filtering)
-      - [3a. Filter Quality and Trim Adapters](#3a-filter-quality-and-trim-adapters)
-      - [3b. Trim PolyG](#3b-trim-polyg)
-      - [3c. Filtered Data QC](#3c-filtered-data-qc)
-      - [3d. Compile Filtered Data QC](#3d-compile-filtered-data-qc)
-    - [4. Contaminant Removal](#4-contaminant-removal)
-      - [4a. Assemble Contaminants](#4a-assemble-contaminants)
-      - [4b. Build Contaminant Index and Map Reads](#4b-build-contaminant-index-and-map-reads)
-      - [4c. Contaminant Removal QC](#4c-contaminant-removal-qc)
-      - [4d. Compile Contaminant Removal QC](#4d-compile-contaminant-removal-qc)
-    - [5. Host read removal](#5-host-read-removal)
-      - [5a. Build Kraken2 Host Database](#5a-build-kraken2-host-database)
-      - [5b. Remove Host Reads](#5b-remove-host-reads)
-      - [5c. Compile Host Read Removal QC](#5c-compile-host-read-removal-qc)
-    - [6. R Environment Setup](#6-r-environment-setup)
-      - [6a. Load Libraries](#6a-load-libraries)
-      - [6b. Define Custom Functions](#6b-define-custom-functions)
-      - [6c. Set global variables](#6c-set-global-variables)
+    - [2. Trimming and Quality filtering](#2-trimming-and-quality-filtering)
+      - [2a. Filter Quality and Trim Adapters](#2a-filter-quality-and-trim-adapters)
+      - [2b. Trim PolyG](#2b-trim-polyg)
+      - [2c. Filtered Data QC](#2c-filtered-data-qc)
+      - [2d. Compile Filtered Data QC](#2d-compile-filtered-data-qc)
+    - [3. Contaminant Removal](#3-contaminant-removal)
+      - [3a. Assemble Contaminants](#3a-assemble-contaminants)
+      - [3b. Build Contaminant Index and Map Reads](#3b-build-contaminant-index-and-map-reads)
+      - [3c. Contaminant Removal QC](#3c-contaminant-removal-qc)
+      - [3d. Compile Contaminant Removal QC](#3d-compile-contaminant-remove-qc)
+    - [4. Host read removal](#4-host-read-removal)
+      - [4a. Build Kraken2 Host Database](#4a-build-kraken2-host-database)
+      - [4b. Remove Host Reads](#4b-remove-host-reads)
+      - [4c. Compile Host Read Removal QC](#4c-compile-host-read-removal-qc)
+    - [5. R Environment Setup](#5-r-environment-setup)
+      - [5a. Load Libraries](#5a-load-libraries)
+      - [5b. Define Custom Functions](#5b-define-custom-functions)
+      - [5c. Set global variables](#5c-set-global-variables)
   - [**Read-based processing**](#read-based-processing)
-    - [7. Taxonomic profiling using kaiju](#7-taxonomic-profiling-using-kaiju)
-      - [7a. Build Kaiju Database](#7a-build-kaiju-database)
-      - [7b. Kaiju Taxonomic Classification](#7b-kaiju-taxonomic-classification)
-      - [7c. Compile Kaiju Taxonomy Results](#7c-compile-kaiju-taxonomy-results)
-      - [7d. Convert Kaiju Output To Krona Format](#7d-convert-kaiju-output-to-krona-format)
-      - [7e. Compile Kaiju Krona Reports](#7e-compile-kaiju-krona-reports)
-      - [7f. Create Kaiju Species Count Table](#7f-create-kaiju-species-count-table)
-      - [7g. Filter Kaiju Species Count Table ](#7g-filter-kaiju-species-count-table)
-      - [7h. Taxonomy Barplots](#7h-taxonomy-barplots)
-      - [7i. Feature Decontamination](#7i-feature-decontamination)
-    - [8. Taxonomic Profiling Using Kraken2](#8-taxonomic-profiling-using-kraken2)
-      - [8a. Download Kraken2 Database](#8a-download-kraken2-database)
-      - [8b. Kraken2 Taxonomic Classification](#8b-kraken2-taxonomic-classification)
-      - [8c. Compile Kraken2 Taxonomy Results](#8c-compile-kraken2-taxonomy-results)
-        - [8ci. Create Merged Kraken2 Taxonomy Table](8ci-create-merged-kraken2-taxonomy-table)
-        - [8cii. Compile Kraken2 Taxonomy Reports](8cii-compile-kraken2-taxonomy-reports)
-      - [8d. Convert Kraken2 Output to Krona Format](#8d-convert-kraken2-output-to-krona-format)
-      - [8e. Compile Kraken2 Krona Reports](#8e-compile-kraken2-krona-reports)
-      - [8f. Filter Kraken2 Species Count Table](#8f-filter-kraken2-species-count-table)
-      - [8g. Taxonomy Barplots](#8g-taxonomy-barplots)
-      - [8h. Feature Decontamination](#8h-feature-decontamination)
-    - [9. Taxonomic Profiling Using MetaPhlan](#9-taxonomic-profiling-using-metaphlan)
-      - [9a. Download and install HUMAnN databases](#9a-download-and-install-humann-databases)
-      - [9b. HUMAnN/MetaPhlAn Taxonomic Classification](#9b-humannmetaphlan-taxonomic-classification)
-      - [9c. Merge Multiple Sample Functional Profiles](#9c-merge-multiple-sample-functional-profiles)
-      - [9d. Split Results Tables](#9d-split-results-tables)
-      - [9e. Normalize Gene Families and Pathway Abundances Tables](#9e-normalize-gene-families-and-pathway-abundances-tables)
-      - [9f. Generate Normalized Gene-family Table Grouped by Kegg Orthologs (KOs)](#9f-generate-normalized-gene-family-table-grouped-by-kegg-orthologs-kos)
-      - [9g. Combine MetaPhlan Taxonomy Tables](#9g-combine-metaphlan-taxonomy-tables)
-      - [9h. Create MetaPhlan Species Count Table](#9h-process-metaphlan)
-        - [9hi. Get Sample Read Counts](#9hi-get-sample-read-counts)
-        - [9hii. Process MetaPhlan Taxonomy Table](#9hii-process-metaphlan-taxonomy-table)
-      - [9i. Filter MetaPhlan Species Count Table](#9i-filter-metaphlan-species-count-table)
-      - [9j. Taxonomy Barplots](#8g-taxonomy-barplots)
-      - [9k. Feature Decontamination](#8h-feature-decontamination)
+    - [6. Taxonomic profiling using kaiju](#6-taxonomic-profiling-using-kaiju)
+      - [6a. Build Kaiju Database](#6a-build-kaiju-database)
+      - [6b. Kaiju Taxonomic Classification](#6b-kaiju-taxonomic-classification)
+      - [6c. Compile Kaiju Taxonomy Results](#6c-compile-kaiju-taxonomy-results)
+      - [6d. Convert Kaiju Output To Krona Format](#6d-convert-kaiju-output-to-krona-format)
+      - [6e. Compile Kaiju Krona Reports](#6e-compile-kaiju-krona-reports)
+      - [6f. Create Kaiju Species Count Table](#6f-create-kaiju-species-count-table)
+      - [6g. Filter Kaiju Species Count Table ](#6g-filter-kaiju-species-count-table)
+      - [6h. Kaiju Taxonomy Barplots](#6h-kaiju-taxonomy-barplots)
+      - [6i. Kaiju Feature Decontamination](#6i-kaiju-feature-decontamination)
+    - [7. Taxonomic Profiling Using Kraken2](#7-taxonomic-profiling-using-kraken2)
+      - [7a. Download Kraken2 Database](#7a-download-kraken2-database)
+      - [7b. Kraken2 Taxonomic Classification](#7b-kraken2-taxonomic-classification)
+      - [7c. Compile Kraken2 Taxonomy Results](#7c-compile-kraken2-taxonomy-results)
+        - [7ci. Create Merged Kraken2 Taxonomy Table](#7ci-create-merged-kraken2-taxonomy-table)
+        - [7cii. Compile Kraken2 Taxonomy Reports](#7cii-compile-kraken2-taxonomy-reports)
+      - [7d. Convert Kraken2 Output to Krona Format](#7d-convert-kraken2-output-to-krona-format)
+      - [7e. Compile Kraken2 Krona Reports](#7e-compile-kraken2-krona-reports)
+      - [7f. Filter Kraken2 Species Count Table](#7f-filter-kraken2-species-count-table)
+      - [7g. Kraken2 Taxonomy Barplots](#7g-kraken2-taxonomy-barplots)
+      - [7h. Kraken2 Feature Decontamination](#7h-kraken2-feature-decontamination)
+    - [8. Taxonomic Profiling Using MetaPhlan](#8-taxonomic-profiling-using-metaphlan)
+      - [8a. Download and install HUMAnN databases](#8a-download-and-install-humann-databases)
+      - [8b. HUMAnN/MetaPhlAn Taxonomic Classification](#8b-humannmetaphlan-taxonomic-classification)
+      - [8c. Merge Multiple Sample Functional Profiles](#8c-merge-multiple-sample-functional-profiles)
+      - [8d. Split Results Tables](#8d-split-results-tables)
+      - [8e. Normalize Gene Families and Pathway Abundances Tables](#8e-normalize-gene-families-and-pathway-abundances-tables)
+      - [8f. Generate Normalized Gene-family Table Grouped by Kegg Orthologs (KOs)](#8f-generate-normalized-gene-family-table-grouped-by-kegg-orthologs-kos)
+      - [8g. Combine MetaPhlan Taxonomy Tables](#8g-combine-metaphlan-taxonomy-tables)
+      - [8h. Create MetaPhlan Species Count Table](#8h-create-metaphlan-species-count-table)
+        - [8hi. Get Sample Read Counts](#8hi-get-sample-read-counts)
+        - [8hii. Process MetaPhlan Taxonomy Table](#8hii-process-metaphlan-taxonomy-table)
+      - [8i. Filter MetaPhlan Species Count Table](#8i-filter-metaphlan-species-count-table)
+      - [8j. MetaPhlan Taxonomy Barplots](#8j-metaphlan-taxonomy-barplots)
+      - [8k. MetaPhlan Feature Decontamination](#8k-metaphlan-feature-decontamination)
+      - [8l. Filter Humann Output](#8l-filter-humann-output)
+      - [8m. Create Humann Function Heatmaps](#8m-create-humann-function-heatmaps)
+      - [8n. Humann Feature Decontamination](#8n-humann-feature-decontamination)
   - [**Assembly-based Processing**](#assembly-based-processing)
-    - [10. Sample Assembly](#10-sample-assembly)
-    - [11. Rename Contigs and Summarize Assemblies](#11-rename-contigs-and-summarize-assemblies)
-      - [11a. Rename Contig Headers](#11a-rename-contig-headers)
-      - [11b. Summarize Assemblies](#11b-summarize-assemblies)
-    - [12. Gene Prediction](#12-gene-prediction)
-      - [12a. Generate Gene Predictions](12a-generate-gene-predictions)
-      - [12b. Remove Line Wraps In Gene Prediction Output](#12a-remove-line-wraps-in-gene-prediction-output)
-    - [13. Functional Annotation](#13-functional-annotation)
-      - [13a. Download Reference Database of HMM Models](#13a-download-reference-database-of-hmm-models)
-      - [13b. Run KEGG Annotation](#13b-run-kegg-annotation)
-      - [13c. Filter KO Outputs](#13c-filter-ko-outputs)
-    - [14. Taxonomic Classification](#14-taxonomic-classification)
-      - [14a. Pull and Unpack Pre-built Reference DB](#14a-pull-and-unpack-pre-built-reference-db)
-      - [14b. Run Taxonomic Classification](#14b-run-taxonomic-classification)
-      - [14c. Add Taxonomy Info From Taxids To Genes](#14c-add-taxonomy-info-from-taxids-to-genes)
-      - [14d. Add Taxonomy Info From Taxids To Contigs](#14d-add-taxonomy-info-from-taxids-to-contigs)
-      - [14e. Format Gene-level Output With awk and sed](#14e-format-gene-level-output-with-awk-and-sed)
-      - [14f. Format Contig-level Output With awk and sed](#14f-format-contig-level-output-with-awk-and-sed)
-    - [15. Read-Mapping](#15-read-mapping)
-      - [15a. Build Reference Index](#15a-build-reference-index)
-      - [15b. Align Reads to Sample Assembly](#15b-align-reads-to-sample-assembly)
-      - [15c. Sort and Index Assembly Alignments](#15c-sort-and-index-assembly-alignments)
-    - [16. Get Coverage Information and Filter Based On Detection](#16-get-coverage-information-and-filter-based-on-detection)
-      - [16a. Filter Coverage Levels Based On Detection](#16a-filter-coverage-levels-based-on-detection)
-      - [16b. Filter Gene and Contig Coverage Based On Detection](#16b-filter-gene-and-contig-coverage-based-on-detection)
-    - [17. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample](#17-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
-    - [18. Combine Contig-level Coverage and Taxonomy For Each Sample](#18-combine-contig-level-coverage-and-taxonomy-for-each-sample)
-    - [19. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#19-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
-      - [19a. Generate Gene-level Coverage Summary Tables](#19a-generate-gene-level-coverage-summary-tables)
-      - [19b. Generate Contig-level Coverage Summary Tables](#19b-generate-contig-level-coverage-summary-tables)
-    - [20. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#20-metagenome-assembled-genome-mag-recovery)
-      - [20a. Bin Contigs](#20a-bin-contigs)
-      - [20b. Bin Quality Assessment](#20b-bin-quality-assessment)
-      - [20c. Filter MAGs](#20c-filter-mags)
-      - [20d. MAG Taxonomic Classification](#20d-mag-taxonomic-classification)
-      - [20e. Generate Overview Table Of All MAGs](#20e-generate-overview-table-of-all-mags)
-    - [21. Generate MAG-level Functional Summary Overview](#21-generate-mag-level-functional-summary-overview)
-      - [21a. Get KO Annotations Per MAG](#21a-get-ko-annotations-per-mag)
-      - [21b. Summarize KO Annotations With KEGG-Decoder](#21b-summarize-ko-annotations-with-kegg-decoder)
-    - [22. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs](#22-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
-      - [22a. Gene-level Taxonomy Heatmaps](#22a-gene-level-taxonomy-heatmaps)
-      - [22b. Gene-level Taxonomy Decontamination](#22b-gene-level-taxonomy-decontamination)
-      - [22c. Gene-level KO Functions Heatmaps](#22c-gene-level-ko-functions-heatmaps)
-      - [22d. Gene-level KO Functions Decontamination](#22d-gene-level-ko-functions-decontamination)
-      - [22e. Contig-level Heatmaps](#22e-contig-level-heatmaps)
-      - [22f. Contig-level Decontamination](#22f-contig-level-decontamination)
+    - [9. Sample Assembly](#9-sample-assembly)
+    - [10. Rename Contigs and Summarize Assemblies](#10-rename-contigs-and-summarize-assemblies)
+      - [10a. Rename Contig Headers](#10a-rename-contig-headers)
+      - [10b. Summarize Assemblies](#10b-summarize-assemblies)
+    - [11. Gene Prediction](#11-gene-prediction)
+      - [11a. Generate Gene Predictions](#11a-generate-gene-predictions)
+      - [11b. Remove Line Wraps In Gene Prediction Output](#11b-remove-line-wraps-in-gene-prediction-output)
+    - [12. Functional Annotation](#12-functional-annotation)
+      - [12a. Download Reference Database of HMM Models](#12a-download-reference-database-of-hmm-models)
+      - [12b. Run KEGG Annotation](#12b-run-kegg-annotation)
+      - [12c. Filter KO Outputs](#12c-filter-ko-outputs)
+    - [13. Taxonomic Classification](#13-taxonomic-classification)
+      - [13a. Pull and Unpack Pre-built Reference DB](#13a-pull-and-unpack-pre-built-reference-db)
+      - [13b. Run Taxonomic Classification](#13b-run-taxonomic-classification)
+      - [13c. Add Taxonomy Info From Taxids To Genes](#13c-add-taxonomy-info-from-taxids-to-genes)
+      - [13d. Add Taxonomy Info From Taxids To Contigs](#13d-add-taxonomy-info-from-taxids-to-contigs)
+      - [13e. Format Gene-level Output With awk and sed](#13e-format-gene-level-output-with-awk-and-sed)
+      - [13f. Format Contig-level Output With awk and sed](#13f-format-contig-level-output-with-awk-and-sed)
+    - [14. Read-Mapping](#14-read-mapping)
+      - [14a. Build Reference Index](#14a-build-reference-index)
+      - [14b. Align Reads to Sample Assembly](#14b-align-reads-to-sample-assembly)
+      - [14c. Sort Assembly Alignments](#14c-sort-assembly-alignments)
+    - [15. Get Coverage Information and Filter Based On Detection](#15-get-coverage-information-and-filter-based-on-detection)
+      - [15a. Filter Coverage Levels Based On Detection](#15a-filter-coverage-levels-based-on-detection)
+      - [15b. Filter Gene and Contig Coverage Based On Detection](#15b-filter-gene-and-contig-coverage-based-on-detection)
+    - [16. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample](#16-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
+    - [17. Combine Contig-level Coverage and Taxonomy For Each Sample](#17-combine-contig-level-coverage-and-taxonomy-for-each-sample)
+    - [18. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#18-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
+      - [18a. Generate Gene-level Coverage Summary Tables](#18a-generate-gene-level-coverage-summary-tables)
+      - [18b. Generate Contig-level Coverage Summary Tables](#18b-generate-contig-level-coverage-summary-tables)
+    - [19. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#19-metagenome-assembled-genome-mag-recovery)
+      - [19a. Bin Contigs](#19a-bin-contigs)
+      - [19b. Bin Quality Assessment](#19b-bin-quality-assessment)
+      - [19c. Filter MAGs](#19c-filter-mags)
+      - [19d. MAG Taxonomic Classification](#19d-mag-taxonomic-classification)
+      - [19e. Generate Overview Table Of All MAGs](#19e-generate-overview-table-of-all-mags)
+    - [20. Generate MAG-level Functional Summary Overview](#20-generate-mag-level-functional-summary-overview)
+      - [20a. Get KO Annotations Per MAG](#20a-get-ko-annotations-per-mag)
+      - [20b. Summarize KO Annotations With KEGG-Decoder](#20b-summarize-ko-annotations-with-kegg-decoder)
+    - [21. Filtering, Decontamination, and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs](#21-filtering-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
+      - [21a. Gene-level Taxonomy Heatmaps](#21a-gene-level-taxonomy-heatmaps)
+      - [21b. Gene-level Taxonomy Feature Filtering](#21b-gene-level-taxonomy-feature-filtering)
+      - [21c. Gene-level Taxonomy Decontamination](#21c-gene-level-taxonomy-decontamination)
+      - [21d. Gene-level KO Functions Heatmaps](#21d-gene-level-ko-functions-heatmaps)
+      - [21e. Gene-level KO Functions Feature Filtering](#21e-gene-level-ko-functions-feature-filtering)
+      - [21f. Gene-level KO Functions Decontamination](#21f-gene-level-ko-functions-decontamination)
+      - [21g. Contig-level Heatmaps](#21g-contig-level-heatmaps)
+      - [21h. Contig-level Feature Filtering](#21h-contig-level-feature-filtering)
+      - [21i. Contig-level Decontamination](#21i-contig-level-decontamination)
+    - [22. Generate Assembly-based Processing Overview](#22-generate-assembly-based-processing-overview)
 
 ---
 
@@ -142,8 +145,11 @@ Barbara Novak (GeneLab Data Processing Lead)
 |:------|:-----:|------:|
 |bbduk| 38.86 |[https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/)|
 |bit| 1.8.53 |[https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)|
+|bowtie2| 2.4.1 | [https://bowtie-bio.sourceforge.net/bowtie2/index.shtml](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml)|
 |CAT| 5.2.3 |[https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)|
 |CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)|
+|fastp| 0.24.0 |[https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp)|
+|FastQC|0.12.1|[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)|
 |SPAdes| 4.1.0 | [https://github.com/ablab/spades](https://github.com/ablab/spades) |
 |GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
 |HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)|
@@ -154,7 +160,6 @@ Barbara Novak (GeneLab Data Processing Lead)
 |Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)|
 |MetaBAT| 2.15 |[https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)|
 |MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)|
-|Medaka| 2.1.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) |
 |MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)|
 |Prodigal| 2.6.3 |[https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)|
 |samtools| 1.22.1 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
@@ -177,21 +182,22 @@ Barbara Novak (GeneLab Data Processing Lead)
 
 
 ### 1. Raw Data QC
+> NOTE: It is NASA's policy that any human reads are to be removed from metagenomics datasets prior to being hosted in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). As such this pipeline starts with fastq files that have had the human reads removed using the GeneLab Remove Human Reads pipeline ([GL-DPPD-7107-A](../../Remove_human_reads_from_raw_data/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md))
 
 #### 1a. Raw Data QC
 
 ```bash
-fastqc -o raw_fastqc_output *raw.fastq.gz
+fastqc -o HRrm_fastqc_output *HRrm_GLlbsMetag.fastq.gz
 ```
 
 **Parameter Definitions:**
 
 - `-o` – the output directory to store results
-- `*raw.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
+- `*HRrm_GLlbsMetag.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
 
 **Input data:**
 
-- *raw.fastq.gz (raw reads)
+- *HRrm_GLlbsMetag.fastq.gz (raw reads, after human read removal)
 
 **Output data:**
 
@@ -219,140 +225,24 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- /path/to/raw_fastqc_output/*fastqc.zip (FastQC output data, from [Step 1a](#1a-raw-data-qc))
+- /path/to/HRrm_fastqc_output/*fastqc.zip (FastQC output data, from [Step 1a](#1a-raw-data-qc))
 
 **Output Data:**
 
-- **raw_multiqc_report/raw_multiqc_GLlbsMetag.html** (multiqc output html summary)
-- **raw_multiqc_report/raw_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
+- **HRrm_multiqc_report/HRrm_multiqc_GLlbsMetag.html** (multiqc output html summary)
+- **HRrm_multiqc_report/HRrm_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
 <br>  
 
 ---
 
-### 2. Human Read Removal
+### 2. Trimming and Quality Filtering
 
-#### 2a. Build Kraken2 Human Database
-
-> **Note:** It is recommended to use NCBI genome files with kraken2 because sequences not downloaded from 
-NCBI may require explicit assignment of taxonomy information before they can be used to build the 
-database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).
-
-```bash
-# Download NCBI taxonomic information 
-kraken2-build --download-taxonomy --db kraken2-human-db/
-
-# Add genomic sequences to your database's genomic library
-kraken2-build --add-to-library human.fasta --db kraken2-human-db/ --no-masking
-             
-# Build the database
-kraken2-build --build --db kraken2-human-db/ --kmer-len 35 --minimizer-len 31
-
-# Clean up intermediate files
-kraken2-build --clean --db kraken2-human-db/
-```
-
-**Parameter Definitions:**
-- `--download-taxonomy` - Instructs kraken2-build to download the NCBI taxonomic information.
-- `--db` - Specifies the name of the directory for the kraken2 database
-- `--add-to-library` - Instructs kraken2-build to add the contents of a file to the kraken2 DB library
-  - `--no-masking` - Disables masking of low-complexity sequences. For additional 
-                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
-- `--build` - Instructs kraken2-build to build the kraken2 DB from the library files
-  - `--kmer-len` - K-mer length in bp (default: 35).
-  - `--minimizer-len` - Minimizer length in bp (default: 31)
-- `--clean` - Instructs kraken2-build to remove unneeded intermediate files.
-
-**Input Data:**
-
-- human.fasta (fasta file containing human genome, for example, the human genome fasta downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz)
-
-**Output Data:**
-
-- kraken2_human_db/ - Kraken2 human database directory, containing hash.k2d, opts.k2d, and taxo.k2d files 
-
-
-#### 2b. Remove Human Reads
+#### 2a. Filter Quality and Trim Adapters
 
 ```bash
-kraken2 --db kraken2_human_db \
-        --gzip-compressed \
-        --threads NumberOfThreads \
-        --use-names \
-        --output sample-kraken2-output.txt \
-        --report sample-kraken2-report.tsv \
-        --unclassified-out sample1_R#.fastq \
-        sample1_R1_raw.fastq.gz sample1_R2_raw.fastq.gz
-
-# rename and gzip output files
-mv sample1_R_1.fastq sample1_R1_HRrm_GLlbsMetag.fastq && \
-gzip sample1_R1_HRrm_GLlbsMetag.fastq
-
-mv  sample1_R_2.fastq sample1_R2_HRrm_GLlbsMetag.fastq && \
-gzip sample1_R2_HRrm_GLlbsMetag.fastq
-```
-
-**Parameter Definitions:**
-
-- `--db` - Specifies the directory holding the kraken2 database.
-- `--gzip-compressed` - Specifies that the input fastq files are gzip-compressed.
-- `--threads NumberOfThreads` - Number of parallel processing threads to use.
-- `--use-names` - Specifies adding taxa names in addition to taxon IDs.
-- `--output` - Specifies the name of the kraken2 read-based output file (one line per read).
-- `--report` - Specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it).
-- `--unclassified-out` - Specifies a regular expression for the naming of the output files containing reads that were not classified, i.e non-human reads.
-- `sample1_R1_raw.fastq.gz sample1_R2_raw.fastq.gz` - Positional argument specifying the input read files
-
-**Input Data:**
-
-- kraken2_human_db/ (kraken2 human database directory, output from [Step 2a](#2a-build-kraken2-database))
-- *raw.fastq.gz (raw reads)
-
-**Output Data:**
-
-- sample1-kraken2-output.txt (kraken2 read-based output file (one line per read))
-- sample1-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
-- **sample1_raw_HRrm_GLlbsMetag.fastq.gz** (raw sample reads with human reads removed, gzipped fasta file)
-
-
-#### 2c. Compile Human Read Removal QC
-
-```bash
-multiqc --zip-data-dir \ 
-        --outdir HRrm_multiqc_report \
-        --filename HRrm_multiqc_GLlbsMetag \
-        --interactive \
-        /path/to/*kraken2-report.tsv
-```
-
-**Parameter Definitions:**
-
-- `--zip-data-dir` - Compress the data directory.
-- `--outdir` – Specifies the output directory to store results.
-- `--filename` – Specifies the filename prefix of results.
-- `--interactive` - Force multiqc to always create interactive javascript plots.
-- `/path/to/*kraken2-report.tsv` – The kraken2 output report files, provided as a positional argument.
-
-**Input Data:**
-
-- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 2b](#2b-remove-human-reads))
-
-**Output Data:**
-
-- **HRrm_multiqc_GLlbsMetag.html** (multiqc output html summary)
-- **HRrm_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
-
-<br>  
-
----
-
-### 3. Trimming and Quality Filtering
-
-#### 3a. Filter Quality and Trim Adapters
-
-```bash
-fastp --in1 sample1_R1_raw.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
-      --in2 sample1_R2_raw.fastq.gz --out2 temp_sample1_R2_filtered.fastq.gz \
+fastp --in1 sample1_R1_HRrm_GLlbsMetag.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
+      --in2 sample1_R2_HRrm_GLlbsMetag.fastq.gz --out2 temp_sample1_R2_filtered.fastq.gz \
       --qualified_quality_phred  20 \
       --length_required 50 \
       --thread 2 \
@@ -377,13 +267,13 @@ fastp --in1 sample1_R1_raw.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
 
 **Input Data:**
 
-- *raw_HRrm_GLlbsMetag.fastq.gz (raw sample reads with human reads removed, from [Step 2b](#2b-remove-human-reads))
+- *HRrm_GLlbsMetag.fastq.gz (raw sample reads with human reads removed)
 
 **Output Data:**
 
 - temp_*_filtered.fastq.gz (quality filtered and adapter trimmed reads)
 
-#### 3b. Trim polyG
+#### 2b. Trim polyG
 
 ```bash
 fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered_GLlbsMetag.fastq.gz \
@@ -414,13 +304,13 @@ fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered_GLlbsMe
 
 **Input Data:**
 
-- /path/to/filtered_data/temp_sample1*.fastq.gz (round1 filtered/adapter trimmed reads, output from [Step 3a](#3a-filter-quality-and-trim-adapters)
+- /path/to/filtered_data/temp_sample1*.fastq.gz (round1 filtered/adapter trimmed reads, output from [Step 2a](#2a-filter-quality-and-trim-adapters)
 
 **Output Data:**
 
 - **\*filtered_GLlbsMetag.fastq.gz** (quality filtered and adapter trimmed, human removed reads)
 
-#### 3c. Filtered Data QC
+#### 2c. Filtered Data QC
 
 ```bash
 fastqc -o filtered_fastqc_output *filtered.fastq.gz
@@ -433,7 +323,7 @@ fastqc -o filtered_fastqc_output *filtered.fastq.gz
 
 **Input data:**
 
-- *filtered_GLlbsMetag.fastq.gz (trimmed and filtered reads, from [Step 3b](#3b-trim-polyg))
+- *filtered_GLlbsMetag.fastq.gz (trimmed and filtered reads, from [Step 2b](#2b-trim-polyg))
 
 **Output data:**
 
@@ -441,7 +331,7 @@ fastqc -o filtered_fastqc_output *filtered.fastq.gz
 - *fastqc.zip (FastQC output data)
 
 
-#### 3d. Compile Filtered Data QC
+#### 2d. Compile Filtered Data QC
 
 ```bash
 multiqc --zip-data-dir \
@@ -461,7 +351,7 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- /path/to/filtered_fastqc_output/*fastqc.zip (FastQC output data, from [Step 3c](#3c-filtered-data-qc))
+- /path/to/filtered_fastqc_output/*fastqc.zip (FastQC output data, from [Step 2c](#2c-filtered-data-qc))
 
 **Output Data:**
 
@@ -472,11 +362,11 @@ multiqc --zip-data-dir \
 
 ---
 
-### 4. Contaminant Removal
+### 3. Contaminant Removal
 
 > A major issue with low biomass data is the high potential for contamination due to the low amount of DNA extracted from the samples. Because negative control/blank samples should by theory be contaminant free, any sequence detected in the negative control is a potential contaminant. To filter out contaminants found in negative control samples that may have been due to cross contamination in the lab, we use a read mapping approach. First negative/blank control sample reads are assembled then the filtered and trimmed reads from each low-biomass sample are mapped to the assembled contigs from the negative/blank control samples. Reads mapping to the assembled contigs are categorized as contaminants and are therefore filtered out and thus excluded from downstream analyses.
 
-#### 4a. Assemble Contaminants
+#### 3a. Assemble Contaminants
 
 ```bash
 cat /path/to/contaminant_fastq/*_R1_filtered_GLlbsMetag.fastq.gz > merged_R1.fastq.gz
@@ -503,7 +393,7 @@ mv spades.log blank-assembly.log
 
 **Input Data**
 
-- *_R[12]_filtered_GLlbsMetag.fastq.gz (one or more paired-end, trimmed and filtered, HRrm reads from blank (negative control) samples, output from [Step 3b](#3b-trim-polyg))
+- *_R[12]_filtered_GLlbsMetag.fastq.gz (one or more paired-end, trimmed and filtered, HRrm reads from blank (negative control) samples, output from [Step 2b](#2b-trim-polyg))
 
 **Output Data**
 
@@ -512,7 +402,7 @@ mv spades.log blank-assembly.log
 
 <br>
 
-#### 4b. Build Contaminant Index and Map Reads
+#### 3b. Build Contaminant Index and Map Reads
 
 ```bash
 # Build contaminant index
@@ -553,17 +443,17 @@ rm -rf sample1.sam
 
 **Input Data**
 
-- /path/to/contaminant_assembly/blank-scaffolds.fasta (contaminant assembly, output from [Step 4a](#4a-assemble-contaminants))
-- sample1_R[12]_filtered_GLlbsMetag.fastq.gz (filtered and trimmed reads, output from [Step 3b](#3b-trim-polyg))
+- /path/to/contaminant_assembly/blank-scaffolds.fasta (contaminant assembly, output from [Step 3a](#3a-assemble-contaminants))
+- sample1_R[12]_filtered_GLlbsMetag.fastq.gz (filtered and trimmed reads, output from [Step 2b](#2b-trim-polyg))
 
 **Output Data**
 
-- sample1_R[12]_decontam_GLlbsMetag.fastq.gz (decontaminated reads)
+- **sample1_R[12]_decontam_GLlbsMetag.fastq.gz** (decontaminated reads)
 - sample-mapping-info.txt (bowtie2 mapping log file)
 
 <br>
 
-#### 4c. Contaminant Removal QC
+#### 3c. Contaminant Removal QC
 
 ```bash
 fastqc -o decontam_fastqc_output *decontam_GLlbsMetag.fastq.gz
@@ -585,7 +475,7 @@ fastqc -o decontam_fastqc_output *decontam_GLlbsMetag.fastq.gz
 
 <br>
 
-#### 4d. Compile Contaminant Remove QC
+#### 3d. Compile Contaminant Remove QC
 
 ```bash
 multiqc --zip-data-dir \
@@ -605,7 +495,7 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- /path/to/decontam_fastqc_output/*fastqc.zip (FastQC output data, from [Step 4c](#4c-contaminant-removal-qc))
+- /path/to/decontam_fastqc_output/*fastqc.zip (FastQC output data, from [Step 3c](#3c-contaminant-removal-qc))
 
 **Output Data:**
 
@@ -616,42 +506,47 @@ multiqc --zip-data-dir \
 
 ---
 
-### 5. Host Read Removal
+### 4. Host Read Removal
 
 If the samples were derived from a host organism other than human, potential host reads
 should be identified and removed. This step is optional. 
 
-#### 5a. Build Kraken2 Host Database
+#### 4a. Build Kraken2 Host Database
 
 > **Note:** It is recommended to use NCBI genome files with kraken2 because sequences not downloaded from 
 NCBI may require explicit assignment of taxonomy information before they can be used to build the 
-database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).
+database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown). 
+This step uses the kraken2 [k2 wrapper script](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#introducing-k2) throughout
 
 ```bash
 # Download NCBI taxonomic information 
-kraken2-build --download-taxonomy --db kraken2-${hostname}-db/
+k2 download-taxonomy --db kraken2-${hostname}$-db/
 
-# Add genomic sequences to your database's genomic library
-kraken2-build --add-to-library ${hostname}.fasta --db kraken2-${hostname}-db/ --no-masking 
+# add host fasta sequences
+k2 add-to-library --files ${hostname}.fasta --db kraken2-${hostname}$-db/ --threads 30 --no-masking
 
 # Build the database
-kraken2-build --build --db kraken2-${hostname}-db/ --kmer-len 35 --minimizer-len 31
+k2 build --db kraken2-${hostname}$-db/ --kmer-len 35 --minimizer-len 31 --threads 30
 
 # Clean up intermediate files
-kraken2-build --clean --db kraken2-${hostname}-db/
+k2 clean --db kraken2-${hostname}$-db/
 ```
 
 **Parameter Definitions:**
 
-- `--download-taxonomy` - Instructs kraken2-build to download the NCBI taxonomic information.
-- `--db` - Specifies the name of the directory for the kraken2 database
-- `--add-to-library` - Instructs kraken2-build to add the contents of a file (`${hostname}.fasta`) to the kraken2 DB library
+- `download-taxonomy` - Chooses the taxonomy download function
+  - `--db` - Specifies the name of the directory for the kraken2 database
+- `add-to-library` - Chooses the download library function
+  - `--files` - Specifies the file(s) to add to the kraken2 database library
   - `--no-masking` - Disables masking of low-complexity sequences. For additional 
-                     information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
-- `--build` - Instructs kraken2-build to build the kraken2 DB from the library files
+                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
+  - `--db` - Specifies the name of the directory for the kraken2 database
+- `build` - Instructs k2 to build the kraken2 DB from the available library files
   - `--kmer-len` - K-mer length in bp (default: 35).
   - `--minimizer-len` - Minimizer length in bp (default: 31)
-- `--clean` - Instructs kraken2-build to remove unneeded intermediate files.
+  - `--db` - Specifies the name of the directory for the kraken2 database
+- `clean` - Instructs k2 to remove unneeded intermediate files.
+  - `--db` - Specifies the name of the directory for the kraken2 database
 - `{$hostname}` - Specifies the name of the host organism used to uniquely identify the kraken2 database
 
 **Input Data:**
@@ -664,7 +559,7 @@ kraken2-build --clean --db kraken2-${hostname}-db/
 
 <br>
 
-#### 5b. Remove Host Reads
+#### 4b. Remove Host Reads
 
 ```bash
 kraken2 --db kraken2_${hostname}_db \
@@ -697,8 +592,8 @@ gzip sample1_R2_HostRm_GLlbsMetag.fastq
 
 **Input Data:**
 
-- kraken2_host_db/ (kraken2 host database directory, output from [Step 5a](#5a-build-kraken2-host-database))
-- sample_*decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 4b](#4b-build-contaminant-index-and-map-reads))
+- kraken2_host_db/ (kraken2 host database directory, output from [Step 4a](#4a-build-kraken2-host-database))
+- sample_*decontam_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with contaminants removed, output from [Step 3b](#3b-build-contaminant-index-and-map-reads))
 
 **Output Data:**
 
@@ -707,7 +602,7 @@ gzip sample1_R2_HostRm_GLlbsMetag.fastq
 - **sample_HostRm_GLlbsMetag.fastq.gz** (filtered and trimmed sample reads with contaminants, human, and host reads removed, gzipped fasta file)
 
 
-#### 5c. Compile Host Read Removal QC
+#### 4c. Compile Host Read Removal QC
 
 ```bash
 multiqc --zip-data-dir \ 
@@ -727,7 +622,7 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 5b](#5b-remove-host-reads))
+- /path/to/*kraken2-report.tsv (kraken2 report files, output from [Step 4b](#4b-remove-host-reads))
 
 **Output Data:**
 
@@ -737,22 +632,25 @@ multiqc --zip-data-dir \
 <br>
 
 ---
-
-### 6. R Environment Setup
+ 
+### 5. R Environment Setup
 
 > Taxonomy bar plots, heatmaps and feature decontamination with decontam are performed in R.
 
-#### 6a. Load libraries
+#### 5a. Load libraries
 
 ```R
 library(decontam)
+library(glue)
+library(htmlwidgets)
+library(pavian)
+library(pheatmap)
 library(phyloseq)
+library(plotly)
 library(tidyverse)
-library(pheatmap)
-library(pavian)
 ```
 
-#### 6b. Define Custom Functions
+#### 5b. Define Custom Functions
 
 #### get_last_assignment()
 <details>
@@ -878,8 +776,6 @@ library(pavian)
   <summary>merge and process multiple kraken outputs to one species table</summary>
 
   ```R
-  library(pavian)
-
   merge_kraken_reports <- function(reports_dir) {
 
     reports <- read_reports(reports_dir)
@@ -907,8 +803,6 @@ library(pavian)
     # and convert table from dataframe to matrix
     species_names <- species_table[, "species"]
     rownames(species_table) <- species_names
-    species_table <- species_table[,-(which(colnames(species_table) == "species"))]
-    species_table <- as.matrix(species_table)
     
     return(species_table)
   }
@@ -928,7 +822,16 @@ library(pavian)
   ```R
   get_abundant_features <- function(mat, cpm_threshold = 1000){
   
-    features <- rowSums(mat) %>% sort()
+    # Filtered out unassigned functions
+    unassigned <- "UNMAPPED|UNGROUPED|UNINTEGRATED|Not annotated"
+    mat <- mat %>%
+      as.data.frame %>%
+      rownames_to_column("Feature") %>%
+      filter(str_detect(Feature, unassigned, negate = TRUE))
+    rownames(mat) <- mat$Feature
+    mat <- mat[, -1]
+
+    features <- rowSums(mat, na.rm = TRUE) %>% sort()
     
     abund_features <- features[features > cpm_threshold] %>% names
     
@@ -993,7 +896,7 @@ library(pavian)
                          filter(str_detect(Species, non_microbial, negate = TRUE))
     # Calculate species relative abundance
     clean_tab <- clean_tab_count %>%
-      mutate( across( where(is.numeric), function(x) (x/sum(x, na.rm = TRUE))*100 ) )
+      mutate(across(where(is.numeric), function(x) (x/sum(x, na.rm = TRUE))*100))
     # Set rownames as species name and drop species column
     rownames(clean_tab) <- clean_tab$Species
     clean_tab  <- clean_tab[, -1]
@@ -1046,7 +949,7 @@ library(pavian)
     }
     
     if(is.null(taxa_to_group)) {
-      message(glue::glue("Rare taxa were not grouped. please provide a higher 
+      message(glue("Rare taxa were not grouped. please provide a higher 
                         threshold than {threshold} for grouping rare taxa, 
                         only numbers are allowed."))
       return(abund_table)
@@ -1088,34 +991,34 @@ library(pavian)
   ```R
   # Make bar plot
   make_plot <- function(abund_table, metadata, custom_palette, publication_format,
-                        samples_column="Sample_ID", prefix_to_remove="barcode"){
+                        samples_column="sample_id", prefix_to_remove="barcode"){
   
     abund_table_wide <- abund_table %>%
-        as.data.frame %>%
+        as.data.frame() %>%
         rownames_to_column(samples_column) %>%
         inner_join(metadata) %>%
         select(!!!colnames(metadata), everything()) %>%
         mutate(!!samples_column := !!sym(samples_column) %>% str_remove(prefix_to_remove))
         
       
-    abund_table_long <- abund_table_wide  %>%
-        pivot_longer(-colnames(metadata), 
+    abund_table_long <- abund_table_wide %>%
+        pivot_longer(-colnames(metadata),
                      names_to = "Species",
                      values_to = "relative_abundance")
       
-    p <- ggplot(abund_table_long, mapping = aes(x = !!sym(samples_column), 
+    p <- ggplot(abund_table_long, mapping = aes(x = !!sym(samples_column),
                                                 y = relative_abundance, fill = Species)) +
          geom_col() +
-         scale_fill_manual(values = custom_palette) + 
-         labs(x=NULL, y="Relative Abundance (%)") + 
+         scale_fill_manual(values = custom_palette) +
+         labs(x = NULL, y = "Relative Abundance (%)") +
          publication_format
-
+    
     return(p)
   }
   ```
 
   **Function Parameter Definitions:**
-  - `abund_table` - a relative bundance dataframe with rows summing to 100%
+  - `abund_table` - a relative abundance dataframe with rows summing to 100%
   - `metadata` - a metadata dataframe with samples as row and columns describing each sample
   - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots
   - `publication_format` - a ggplot::theme object specifying a custom theme for plotting
@@ -1135,30 +1038,54 @@ library(pavian)
                            feature_column = "species", samples_column = "sample_id", group_column = "group", 
                            output_prefix, assay_suffix = "_GLlbsMetag",
                            publication_format, custom_palette) {
+    facet_by <- reformulate(group_column)
     # Prepare feature table
-    feature_table <- read_csv(feature_table_file)
+    feature_table <- read_delim(feature_table_file)
     rownames(feature_table) <- feature_table[[1]]
     feature_table <- feature_table[, -1]
 
+    number_of_species <- nrow(feature_table)
+
+    if (number_of_species > length(custom_palette)) {
+      N <- number_of_species / length(custom_palette)
+      custom_palette <- rep(custom_palette, times = N * 2)
+    }
+
     # Prepare metadata
-    metadata <- read_delim(metdata_file, delim = ",") %>% as.data.frame
+    metadata <- read_delim(metadata_table_file, delim = ",") %>% as.data.frame
     row.names(metadata) <- metadata[, samples_column]
 
     # compute abundances from counts
     abund_table <- count_to_rel_abundance(feature_table)
+
+    metadata <- metadata %>%
+                mutate(!!sym(group_column) := str_wrap(!!sym(group_column) %>%
+                         str_replace_all("_", " "), width = 10)
+                )
     
     # create plot
     p <- make_plot(abund_table, metadata, custom_palette, publication_format, samples_column) +
-         facet_wrap(~Description, nrow=1, scales = "free_x")
+         facet_wrap(facet_by, nrow = 1, scales = "free_x", labeller = label_wrap_gen(width = 10)) +
+         theme(axis.text.x = element_text(angle = 90))
 
+    static_plot <- p
     number_of_species <- p$data$Species %>% unique() %>% length()
-    # Don't save legend if the number of species to plot is gsreater than 30
+    # Don't save legend if the number of species to plot is greater than 30
     if(number_of_species > 30) {
-      p <- p + theme(legend.position = "none")
+      static_plot <- static_plot + theme(legend.position = "none")
     }
-
-    return(p)
-
+    
+    width <- 2 * nrow(metadata) # 3.6 * number_of_samples
+    if(width < 14) { width = 14 } # set minimum width to 14 inches
+    if(width > 50) { width = 50 } # Cap plot with at 50 inches
+    # Save Static
+    ggsave(filename = glue("{output_prefix}_barplot{assay_suffix}.png"), 
+           plot = static_plot,
+           device = 'png', width = width,
+           height = 10, units = 'in', dpi = 300 , limitsize = FALSE)
+
+    # Save interactive
+    htmlwidgets::saveWidget(ggplotly(p), glue("{output_prefix}_barplot{assay_suffix}.html"), selfcontained = TRUE)
   }
   ```
 
@@ -1176,11 +1103,11 @@ library(pavian)
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
   - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
-  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 6c](#8c-set-global-variables)
-  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 6c](#8c-set-global-variables)
-
-  **Returns:** a relative abundance stacked bar plot, `p`, as output from [make_plot](#make_plot)
+  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 5c](#5c-set-global-variables
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 5c](#5c-set-global-variables)
 
+  **Output Data:** 2 barplot files, `{output_prefix}_barplot{assay_suffix}.png` and `{output_prefix}_barplot{assay_suffix}.html`, containing relative abundance stacked bar plot as output from [make_plot](#make_plot)
+  
 </details>
 
 #### make_heatmap()
@@ -1188,18 +1115,38 @@ library(pavian)
   <summary>Creates heatmaps from a feature table file</summary>
   
   ```R
-  make_heatmap <- function(metadata, feature_table, 
+  make_heatmap <- function(metadata_table_file, feature_table_file, 
                            samples_column = "sample_id", group_column = "group", 
                            output_prefix, assay_suffix = "_GLlbsMetag",
                            custom_palette) {
+    # Prepare feature table
+    feature_table <- read_delim(feature_table_file) %>%  as.data.frame()
+    rownames(feature_table) <- feature_table[[1]]
+    feature_table <- feature_table[,-1] %>% as.matrix()
+    colnames(feature_table) <-  colnames(feature_table) %>% str_remove_all("barcode")
+
+    # Prepare metadata
+    metadata <- read_delim(metadata_table_file) %>% as.data.frame()
+    row.names(metadata) <- metadata[,samples_column] %>% str_remove_all("barcode")
+
+    # GFet common samples and re-arrange feature table and metadata
+    common_samples <- intersect(colnames(feature_table), rownames(metadata))
+    feature_table <- feature_table[, common_samples]
+    metadata <- metadata[common_samples,]
+    metadata <- metadata %>% arrange(!!sym(group_column))
+
     # Create column annotation
     col_annotation <- as.data.frame(metadata)[, group_column, drop = FALSE]
 
     # Calculate output plot width and height
     number_of_samples <- ncol(feature_table)
     width <- 1 * number_of_samples
+    if (width < 10) { width <- 10} # Set the minimum width to 10 inches
+    if (width > 100) { width <- 100} # Set the maximum width to 100 inches
     number_of_features <- nrow(feature_table)
     height <- 0.2 * number_of_features
+    if (height < 10) { height <- 10 } # Set the minimum height to 10 inches
+    if (height > 100) { height <- 100 } # Set the maximum height to 100 inches (highest that won't generate an error)
 
     # Set colors by group
     groups <- metadata[[group_column]] %>%  unique()
@@ -1223,20 +1170,42 @@ library(pavian)
              annotation_colors = annotation_colors,
              number_format = "%.0f")
     dev.off()
+
+    sorted_features <- rowSums(feature_table) %>% sort(decreasing = TRUE)
+
+    # Plot only top 50 features as it is often difficult to visualize all features at once
+    if(length(sorted_features >= 50)) { 
+      top50 <- sorted_features[1:50]
+
+      png(filename = glue("{output_prefix}_top_50_heatmap{assay_suffix}.png"), width = width,
+          height = 12, units = "in", res=300)
+      pheatmap(mat = feature_table[names(top50), rownames(col_annotation)],
+               cluster_cols = FALSE, 
+               cluster_rows = FALSE,
+               col = colorRampPalette(c('white','red'))(255), 
+               angle_col = 90, 
+               display_numbers = TRUE, 
+               fontsize = 12, 
+               annotation_col = col_annotation,
+               annotation_colors = annotation_colors,
+               number_format = "%.0f")
+      dev.off()
+    }
   }
   ```
 
   **Function Parameter Definitions:**
-  - `metadata_file` - a dataframe with samples as rows and columns describing each sample
-  - `feature_table` - a dataframe of features with species/functions as the first column and samples as other columns.
+  - `metadata_table_file` - path to a file with samples as rows and columns describing each sample
+  - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                           table with species/functions as the first column and samples as other columns.
   - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
   - `group_column` - a character string specifying the column in `metadata` used to facet/group plots, default: "group"
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
   - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
-  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 6c](#8c-set-global-variables)
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 5c](#5c-set-global-variables)
 
-  **Output Data:** heatmap png file, `{output_prefix}_heatmap{assay_suffix}.png`, of species/functions across samples from the input feature table
+  **Output Data:** 2 heatmap png files, `{output_prefix}_heatmap{assay_suffix}.png` and `{output_prefix}_top_50_heatmap{assay_suffix}.png`, of species/functions across samples from the input feature table
   
 </details>
 
@@ -1245,8 +1214,8 @@ library(pavian)
   <summary>Feature table decontamination with decontam</summary>
 
   ```R
-  run_decontam <- function(feature_table, metadata, contam_threshold=0.1, 
-                           prev_col = NULL, freq_col = NULL, ntc_name = "TRUE") {
+  run_decontam <- function(feature_table, metadata, contam_threshold=0.5, 
+                           prev_col = NULL, freq_col = NULL, ntc_name = "true") {
 
     # retain metadata for only the samples present in the input feature table
     sub_metadata <- metadata[colnames(feature_table), ]
@@ -1264,6 +1233,7 @@ library(pavian)
           )
         )
       sub_metadata[, freq_col] <- as.numeric(sub_metadata[, freq_col])
+      sub_metadata[, prev_col] <- tolower(sub_metadata[, prev_col])
 
     }
 
@@ -1312,37 +1282,40 @@ library(pavian)
 
 </details>
 
-#### feature_decontam() 
+#### feature_decontam()
 <details>
-  <summary>decontaminate a feature table</summary>
-  
-  ```R
-  library(tidyverse)
-  library(glue)
+  <summary>decontaminate a feature table using the Decontam R package to statistically identify contaminating features in a feature table</summary>
 
+  ```R
   feature_decontam <- function(metadata_file, feature_table_file, 
                                feature_column = "Species", samples_column = "sample_id",
-                               prevalence_column = "NTC", ntc_name = "TRUE", 
+                               prevalence_column = "NTC", ntc_name = "true", 
                                frequency_column = "concentration", 
-                               threshold = 0.1, classification_method, 
+                               threshold = 0.5, classification_method, 
                                output_prefix, assay_suffix = "_GLlbsMetag") {
     # Prepare feature table
-    feature_table <- read_csv(feature_table_file) %>%  as.data.frame
+    feature_table <- read_delim(feature_table_file) %>%  as.data.frame
     rownames(feature_table) <- feature_table[[1]]
     feature_table <- feature_table[, -1]  %>% as.matrix()
 
     # Prepare metadata
-    metadata <- read_csv(metadata_file) %>% as.data.frame
+    metadata <- read_delim(metadata_file) %>% as.data.frame
     row.names(metadata) <- metadata[, samples_column]
 
     # Run decontam
+    # Assign prev and freq column names to NULL if the values in the supplied columns aren't unique
+    if(length(unique(metadata[, prev_col])) == 1) prev_col <- NULL
+    if(length(unique(metadata[, freq_col])) == 1) freq_col <- NULL
     contamdf <- run_decontam(feature_table, metadata, threshold, prev_col, freq_col, ntc_name) 
 
     contamdf <- as.data.frame(contamdf) %>% rownames_to_column(feature_column)
 
+    type <- 'species'
+    if (classification_method == 'gene-function') { type <- "KO" }
+
     # Write decontaminated feature table and decontam's primary results
-    outfile <- glue("{output_prefix}{classification_method}_decontam_results{assay_suffix}.csv")
-    write_csv(x = contamdf, file = outfile)
+    outfile <- glue("{output_prefix}_decontam_results{assay_suffix}.tsv")
+    write_tsv(x = contamdf, file = outfile)
 
     # Get the list of contaminants identified by decontam
     contaminants <- contamdf %>%
@@ -1364,8 +1337,8 @@ library(pavian)
       rownames(decontaminated_table) <- decontaminated_table[[feature_column]]
       decontaminated_table <- decontaminated_table[,-1] %>% as.matrix
 
-      outfile <- glue("{output_prefix}{classification_method}_decontam_species_table{assay_suffix}.csv")
-      write_csv(x = decontaminated_table, file = outfile)
+      outfile <- glue("{output_prefix}_decontam_{type}_table{assay_suffix}.tsv")
+      write_tsv(x = decontaminated_table, file = outfile)
 
       return(decontaminated_table)
 
@@ -1388,15 +1361,15 @@ library(pavian)
   - `frequency_column` - a character string specifying the column in `metadata` to use for frequency based analysis, default: "concentration"
   - `prevalence_column` - a character string specifying the column in `metadata` to use for prevalence based analysis, default: "NTC"
   - `ntc_name` - a character string specifying the value in the prevalence column for all negative template control samples, default: "TRUE"
-  - `threshold` - a number between 0 and 1 specfying the decontam threshold for both prevalence and frequency based analyses. default: 0.1
+  - `threshold` - a number between 0 and 1 specifying the decontam threshold for both prevalence and frequency based analyses. default: 0.1
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
   - `classification_method` - a character string specifying the tool used to generate the classifications ['kaiju', 'kraken2', 'metaphlan', 'contig-taxonomy', 'gene-taxonomy', 'gene-function']
   - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlbsMetag")
 
   **Output Data:**
-  - {classification_method}_decontam_species_table_GLlbsMetag.csv - decontaminated feature table file
-  - {classification_method}_decontam_results_GLlbsMetag.csv - Decontam results file
+  - {output_prefix}_decontam_{species|KO}_table_GLlbsMetag.tsv - decontaminated feature table file
+  - {output_prefix}_decontam_results_GLlbsMetag.tsv - Decontam results file
 
   **Returns:** dataframe, `decontaminated_table`, containing the decontaminated feature table
 </details>
@@ -1441,7 +1414,7 @@ library(pavian)
   <summary>clean taxonomy names</summary>
 
   ```R
-  fix_names<- function(taxonomy,stringToReplace="Othe",suffix=";Other"){
+  fix_names<- function(taxonomy,stringToReplace="Other",suffix=";_"){
     
     for(index in seq_along(stringToReplace)){
 
@@ -1449,7 +1422,7 @@ library(pavian)
         # Get the row indices of the current taxonomy columns
         # with rows matching the sting in `stringToReplace`
         indices <- grep(x = taxonomy[,taxa_index], pattern = stringToReplace)
-        # Replace the value in that row with the value in the adjacent cell concated with `suffix`
+        # Replace the value in that row with the value in the adjacent cell concatenated with `suffix`
         taxonomy[indices,taxa_index] <-
           paste0(taxonomy[indices,taxa_index-1],
                 rep(x = suffix, times=length(indices)))
@@ -1469,25 +1442,24 @@ library(pavian)
 
 </details>
 
-#### read_assembly_coverage_table()
+#### read_taxonomy_table()
 <details>
   <summary>Read Assembly-based coverage annotation table</summary>
 
   ```R
-  read_assembly_coverage_table <- function(file_name, sample_names){
+  read_taxonomy_table <- function(df, sample_names){
   
-    df <- read_delim(file = file_name, delim = "\t", comment = "#")
-
-    # Subset taxoxnomy portion (domain:species) of input table
+    # Subset taxonomy portion (domain:species) of input table
     # and replace empty/Na domain assignments with "Unclassified"
     taxonomy_table <- df %>%
       select(domain:species) %>%
       mutate(domain=replace_na(domain, "Unclassified"))
     
     # Subset count table
-    counts_table <- df %>% select(!!any_of(sample_names))
+    sample_names <- get_samples(df, sample_names)
+    counts_table <- df %>% select(!!sample_names)
 
-    # Mutate taxonomy mames
+    # Mutate taxonomy names
     taxonomy_table  <- process_taxonomy(taxonomy_table)
     taxonomy_table <- fix_names(taxonomy_table, "Other", ";_")
 
@@ -1503,38 +1475,41 @@ library(pavian)
   [fix_names()](#fix_names)
 
   **Function Parameter Definitions:**
-  - `file_name` - path to contig taxonomy assignment file to be read
-  - `sample_names` - string of samples names to keep in the final dataframe
+  - `df` - dataframe containing assembly-based coverage
+  - `sample_names` - a character vector of sample names to keep in the final dataframe
 
   **Returns:** dataframe, `df`, containing cleaned taxonomy names and sample species count
 
 </details>
 
-#### get_sample_names()
+#### get_samples()
 <details>
   <summary>retrieve sample names for which assemblies were generated</summary>
 
   ```R
-  get_sample_names <- function (assembly_summary) {
-    # Read in table and drop columns were all rows are NA
-    overview_table <-  read_delim(file = assembly_summary, delim = "\t", comment = "#") %>%
-                        select(where( ~all(!is.na(.)) )) 
-
-    col_names <- names(overview_table) %>% str_remove_all("-assembly")
-    sample_order <- col_names[-1] %>% sort()
-
-    return(sample_order)
+  get_samples <- function(assembly_table_df, sample_names, end_col='species') {
+    # Get common samples 
+    cols <- colnames(df)
+    index <- grep(end_col, cols)
+    start <- grep(end_col, cols) + 1
+    end <- (length(cols) - index)
+    df_samples <- cols[start:end]
+    sample_names <- intersect(df_samples, sample_names)
+
+    return(sample_names)
   }
   ```
 
   **Function Parameter Definitions:**
-  - `assembly_summary` - path to assembly summary file
+  - `assembly_table_df` - dataframe containing assembly-based coverage
+  - `sample_names` - a character vector of samples names to keep in the final dataframe
+  - `end_col` - string containing the name of the last column
 
-  **Returns:** a character vector, `sample_order`, of sorted sample names
+  **Returns:** a character vector, `sample_names`, of sample names that appear in both the assembly dataframe and the sample_names list
 
 </details>
 
-#### 6c. Set global variables
+#### 5c. Set global variables
 
 ```R
 # Define custom theme for plotting
@@ -1580,9 +1555,9 @@ custom_palette <- custom_palette[-c(21:23,
 ## Read-based Processing
 
 
-### 7. Taxonomic Profiling Using Kaiju
+### 6. Taxonomic Profiling Using Kaiju
 
-#### 7a. Build Kaiju Database
+#### 6a. Build Kaiju Database
 
 ```bash
 # Make a directory that will hold the downloaded kaiju database
@@ -1613,7 +1588,7 @@ rm nr_euk/kaiju_db_nr_euk.bwt nr_euk/kaiju_db_nr_euk.sa
 - kaiju-db/merged.dmp (merged taxonomy IDs file from the NCBI Taxonomy database that maps deprecated taxonomic IDs to current ones)
 
 
-#### 7b. Kaiju Taxonomic Classification
+#### 6b. Kaiju Taxonomic Classification
 
 ```bash
 kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
@@ -1637,18 +1612,18 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
 
 **Input Data:**
 
-- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 7a](#7a-build-kaiju-database))
-- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 7a](#7a-build-kaiju-database))
+- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 6a](#6a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 6a](#6a-build-kaiju-database))
 - *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
+    output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
 
 
 **Output Data:**
 
 - sample_kaiju.out (kaiju output file)
 
-#### 7c. Compile Kaiju Taxonomy Results
+#### 6c. Compile Kaiju Taxonomy Results
 
 ```bash
 # Merge kaiju reports to one table at the species level 
@@ -1675,15 +1650,15 @@ sed -i -E 's/file/sample/' merged_kaiju_table.tsv
 
 **Input Data:**
 
-- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 7a](#7a-build-kaiju-database))
-- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 7a](#7a-build-kaiju-database))
-- *kaiju.out (kaiju output files, output from [Step 7b](#7b-kaiju-taxonomic-classification))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 6a](#6a-build-kaiju-database))
+- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 6a](#6a-build-kaiju-database))
+- *kaiju.out (kaiju output files, output from [Step 6b](#6b-kaiju-taxonomic-classification))
 
 **Output Data:**
 
 - merged_kaiju_table.tsv (compiled kaiju summary table at the species level)
 
-#### 7d. Convert Kaiju Output To Krona Format
+#### 6d. Convert Kaiju Output To Krona Format
 
 ```bash
 kaiju2krona -u \
@@ -1702,15 +1677,15 @@ kaiju2krona -u \
 - `-o` - Specifies the name of krona formatted kaiju output file.
 
 **Input Data:**
-- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 7a](#7a-build-kaiju-database))
-- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 7a](#7a-build-kaiju-database))
-- sample_kaiju.out (kaiju output file, output from [Step 7b](#7b-kaiju-taxonomic-classification))
+- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 6a](#6a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 6a](#6a-build-kaiju-database))
+- sample_kaiju.out (kaiju output file, output from [Step 6b](#6b-kaiju-taxonomic-classification))
 
 **Output Data:**
 
 - sample.krona (krona formatted kaiju output)
 
-#### 7e. Compile Kaiju Krona Reports
+#### 6e. Compile Kaiju Krona Reports
 
 ```bash
 # Create a file containing a sorted list of all .krona files 
@@ -1752,23 +1727,22 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 
 **Input Data:**
 
-- *.krona (all sample .krona formatted files, output from [Step 7d](#7d-convert-kaiju-output-to-krona-format)) 
+- *.krona (all sample .krona formatted files, output from [Step 6d](#6d-convert-kaiju-output-to-krona-format)) 
              
 **Output Data:**
 
 - krona_files.txt (sorted list of all *.krona files)
 - sample_names.txt (sorted list of all sample names)
-- **kaiju-report_GllbsMetag.html** (compiled krona html report containing all samples)
+- **kaiju-report_GLlbsMetag.html** (compiled krona html report containing all samples)
 
-#### 7f. Create Kaiju Species Count Table
+#### 6f. Create Kaiju Species Count Table
 
 ```R
-library(tidyverse)
 feature_table <- process_kaiju_table(file_path="merged_kaiju_table_GLlbsMetag.tsv")
 table2write <- feature_table  %>%
-                as.data.frame() %>%
-                rownames_to_column("Species")
-write_csv(x = table2write, file = "kaiju_species_table_GLlbsMetag.csv")
+               as.data.frame() %>%
+               rownames_to_column("Species")
+write_tsv(x = table2write, file = "kaiju_species_table_GLlbsMetag.tsv")
 ```
 
 **Custom Functions Used:**
@@ -1782,27 +1756,27 @@ write_csv(x = table2write, file = "kaiju_species_table_GLlbsMetag.csv")
 
 **Input Data:**
 
-- merged_kaiju_table_GLlbsMetag.tsv (compiled kaiju table at the species taxon level, from [Step 7c](#7c-compile-kaiju-taxonomy-results))
+- merged_kaiju_table_GLlbsMetag.tsv (compiled kaiju table at the species taxon level, from [Step 6c](#6c-compile-kaiju-taxonomy-results))
 
 **Output Data:**
 
-- **kaiju_species_table_GLlbsMetag.csv** (kaiju species count table in csv format)
+- **kaiju_species_table_GLlbsMetag.tsv** (kaiju species count table in tsv format)
 
 
-#### 7g. Filter Kaiju Species Count Table
+#### 6g. Filter Kaiju Species Count Table
 
 ```R
-library(tidyverse)
-
-input_file <- "kaiju_species_table_GLlbsMetag.csv"
-output_file <- "kaiju_filtered_species_table_GLlbsMetag.csv"
+feature_table_file <- "kaiju_species_table_GLlbsMetag.tsv"
+output_file <- "kaiju_filtered_species_table_GLlbsMetag.tsv"
 threshold <- 0.5
 
 # string used to define non-microbial taxa
-non_microbial <- "UNCLASSIFIED|Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
+non_microbial <- "UNCLASSIFIED|Unclassified|unclassified|Homo sapien|cannot|uncultured|unidentified"
 
 # read in feature table
-feature_table <- read_csv(input_file) %>% as.data.frame()
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
 feature_name <- colnames(feature_table)[1]
 rownames(feature_table) <- feature_table[,1]
 feature_table <- feature_table[, -1]
@@ -1819,7 +1793,7 @@ table2write <- group_low_abund_taxa(abund_table, threshold = threshold)  %>%
   t %>% as.data.frame %>%
   rownames_to_column(feature_name)
 
-write_csv(x = table2write, file = output_file)
+write_tsv(x = table2write, file = output_file)
 ```
 
 **Custom Functions Used:**
@@ -1833,49 +1807,31 @@ write_csv(x = table2write, file = output_file)
 
 **Input Data:**
 
-- kaiju_species_table_GLlbsMetag.csv (path to kaiju species table from [Step 7f](#7f-create-kaiju-species-count-table))
+- kaiju_species_table_GLlbsMetag.tsv (path to kaiju species table from [Step 6f](#6f-create-kaiju-species-count-table))
 
 **Output Data:**
 
-- **kaiju_filtered_species_table_GLlbsMetag.csv** (a file containing the filtered species table)
+- **kaiju_filtered_species_table_GLlbsMetag.tsv** (a file containing the filtered species table)
 
 ---
 
-#### 7h. Taxonomy Barplots
+#### 6h. Kaiju Taxonomy Barplots
 
 ```R
-library(tidyverse)
-
-species_table_file <- "kaiju_species_table_GLlbsMetag.csv"
-filtered_species_table_file <- "kaiju_filtered_species_table_GLlbsMetag.csv"
+species_table_file <- "kaiju_species_table_GLlbsMetag.tsv"
+filtered_species_table_file <- "kaiju_filtered_species_table_GLlbsMetag.tsv"
 metadata_file <- "/path/to/sample/metadata"
-number_samples <- 10 
-
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
 
-p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
-                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
-                  publication_format = publication_format, custom_palette = custom_palette)
-
-ggsave(filename = "kaiju_unfiltered_species_barplot_GLlbsMetag.png", plot = p,
-       device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
+make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kaiju_unfiltered_species", assay_suffix = "_GLlbsMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 
 # Save static unfiltered plot
-p <- make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
-                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
-                  publication_format = publication_format, custom_palette = custom_palette)
-
-# Save interactive unfilterted plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_unfiltered_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
-
-# Save static filtered plot
-ggsave(filename = glue("kaiju_unfiltered_species_barplot_GLlbsMetag.png"), plot = p,
-      device = 'png', width = plot_width, height = 10, units = 'in', dpi = 300, limitsize = FALSE)
-
-# Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_filtered_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
+make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kaiju_filtered_species", assay_suffix = "_GLlbsMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 ```
 
 **Custom Functions Used:**
@@ -1886,13 +1842,12 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_filtered_species_barplot_GLlbsM
 - `species_table_file` - a file containing the species count table
 - `filtered_species_table_file` - a file containing the filtered species count table
 - `metadata_file` - a file containing group information for each sample in the species count files
-- `number_samples` - the total number of samples in the species count files, adjust based on input files.
 
 **Input Data:**
 
-- `kaiju_species_table_GLlbsMetag.csv` (a file containing the species count table, output from [Step 7f](#7f-create-kaiju-species-count-table))
-- `kaiju_filtered_species_table_GLlbsMetag.csv` (a file containing the filtered species count table, output from [Step 7g](#7g-filter-kaiju-species-count-table))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `kaiju_species_table_GLlbsMetag.tsv` (a file containing the species count table, output from [Step 6f](#6f-create-kaiju-species-count-table))
+- `kaiju_filtered_species_table_GLlbsMetag.tsv` (a file containing the filtered species count table, output from [Step 6g](#6g-filter-kaiju-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 
 **Output Data:**
@@ -1903,46 +1858,31 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_filtered_species_barplot_GLlbsM
 - **kaiju_filtered_species_barplot_GLlbsMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 7i. Feature Decontamination
+#### 6i. Kaiju Feature Decontamination
 
-> Feature (species) decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
+> Note: species_table and barplots are only generated if 1 or more contaminants were detected
 
 ```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-feature_table_file <- "filtered-kaiju_species_table_GLlbsMetag.csv"
+feature_table_file <- "filtered-kaiju_species_table_GLlbsMetag.tsv"
 metadata_table <- "/path/to/sample/metadata"
-ntc_name <- "name_of_ntc_sample"
-
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
 
 decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
                                          feature_table_file = feature_table_file, 
                                          feature_column = "species", 
                                          samples_column = "sample_id",
                                          prevalence_column = "NTC", 
-                                         ntc_name = "TRUE", 
+                                         ntc_name = "true", 
                                          frequency_column = "concentration", 
-                                         threshold = 0.1, 
+                                         threshold = 0.5, 
                                          classification_method = "kaiju", 
-                                         output_prefix = "", 
+                                         output_prefix = "kaiju", 
                                          assay_suffix = "_GLlbsMetag")
 
-# Convert count matrix to relative abundance matrix
-decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
+make_barplot(metadata_file = metadata_table, feature_table_file = "kaiju_decontam_species_table_GLlbsMetag.tsv", 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kraken2_decontam_species", assay_suffix = "_GLlbsMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 
-# Make plot after filtering out contaminants
-p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
-
-ggsave(filename = "kaiju_decontam_species_barplot_GLlbsMetag.png", plot = p,
-         device = "png", width = plot_width, height = 10, units = "in", dpi = 300)
-
-# Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
 ```
 
 **Custom Functions Used:**
@@ -1955,27 +1895,26 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlbsM
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
                          table with species/functions as the first column and samples as other columns.
-- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples in the feature_table_file
 
 **Input Data:**
 
-- `kaiju_filtered_species_table_GLlbsMetag.csv`(path to filtered species count per sample, output from [Step 7g](#7g-filter-kaiju-species-count-table))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `kaiju_filtered_species_table_GLlbsMetag.tsv`(path to filtered species count per sample, output from [Step 6g](#6g-filter-kaiju-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
-- **kaiju_decontam_results_GLlbsMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
-- **kaiju_decontam_species_table_GLlbsMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
-- kaiju_decontam_species_barplot_GLlbsMetag.png (barplot after filtering out contaminants)
-- **kaiju_decontam_species_barplot_GLlbsMetag.html** (barplot after filtering out contaminants)
+- **kaiju_decontam_results_GLlbsMetag.tsv** (decontam's result table, output from [feature_decontam()](#feature_decontam))
+- **kaiju_decontam_species_table_GLlbsMetag.tsv** (decontaminated species table, output from [feature_decontam()](#feature_decontam))
+- kaiju_decontam_species_barplot_GLlbsMetag.png (barplot after filtering out contaminants, output from [make_barplot()](#make_barplot))
+- **kaiju_decontam_species_barplot_GLlbsMetag.html** (barplot after filtering out contaminants, output from [make_barplot()](#make_barplot))
 
 <br>
 
 ---
 
-### 8. Taxonomic Profiling Using Kraken2
+### 7. Taxonomic Profiling Using Kraken2
 
-#### 8a. Download Kraken2 Database
+#### 7a. Download Kraken2 Database
 
 ```bash 
 ## Download all microbial (including eukaryotes) - https://benlangmead.github.io/aws-indexes/k2
@@ -1989,8 +1928,8 @@ INSPECT_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/inspect.
 wget ${INSPECT_URL}
 
 # Library report
-LIRARY_REPORT_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/library_report.tsv
-wget ${LIRARY_REPORT_URL}
+LIBRARY_REPORT_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/library_report.tsv
+wget ${LIBRARY_REPORT_URL}
 
 # Md5sums
 MD5_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/pluspfp.md5 
@@ -2009,7 +1948,7 @@ tar -xvzf k2_pluspfp.tar.gz
 - `--timeout=3600` - Specifies the network timeout in seconds.
 - `--tries=0` - Retry download infinitely.
 - `--continue` -  Continue getting a partially-downloaded file.
-- `*_URL` - Position arguement specifying the url to download a particular resource from.
+- `*_URL` - Position argument specifying the url to download a particular resource from.
 
 *tar*
 - `-xvzf` - unpack the specified *tar.gz archive in verbose mode
@@ -2017,7 +1956,7 @@ tar -xvzf k2_pluspfp.tar.gz
 **Input Data:**
 
 - `INSPECT_URL=` (url specifying the location of kraken2 inspect file)
-- `LIRARY_REPORT_URL=` (url specifying the location of kraken2 library report file)
+- `LIBRARY_REPORT_URL=` (url specifying the location of kraken2 library report file)
 - `MD5_URL=` (url specifying the location of the md5 file of the kraken database)
 - `DB_URL=` (url specifying the location of the main kraken database archive in .tar.gz format)
 
@@ -2025,7 +1964,7 @@ tar -xvzf k2_pluspfp.tar.gz
 
 - kraken2-db/  (a directory containing kraken2 database files)
 
-#### 10b. Kraken2 Taxonomic Classification
+#### 7b. Kraken2 Taxonomic Classification
 
 ```bash
 kraken2 --db kraken2-db/ \
@@ -2051,10 +1990,10 @@ kraken2 --db kraken2-db/ \
 
 **Input Data:**
 
-- kraken2-db/ (a directory containing kraken2 database files, output from [Step 10a](#10a-download-kraken2-database))
+- kraken2-db/ (a directory containing kraken2 database files, output from [Step 7a](#7a-download-kraken2-database))
 - *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 4b](#4b-build-contaminant-index-and-map-reads or [Step 5b](#5b-remove-host-reads))
+    output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
 
 
 **Output Data:**
@@ -2063,13 +2002,13 @@ kraken2 --db kraken2-db/ \
 - sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
 
 
-#### 8c. Compile Kraken2 Taxonomy Results
+#### 7c. Compile Kraken2 Taxonomy Results
 
-##### 8ci. Create Merged Kraken2 Taxonomy Table
+##### 7ci. Create Merged Kraken2 Taxonomy Table
 
 ```R
 species_table <- merge_kraken_reports(reports-dir='/path/to/kraken2/reports')
-write_csv(x = species_table, file = "kraken2_species_table_GLlbsMetag.csv")
+write_tsv(x = species_table, file = "kraken2_species_table_GLlbsMetag.tsv")
 ```
 
 **Custom Functions Used:**
@@ -2083,13 +2022,13 @@ write_csv(x = species_table, file = "kraken2_species_table_GLlbsMetag.csv")
 
 **Input Data:**
 
-- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 8b](#8b-taxonomic-classification))
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 7b](#7b-kraken2-taxonomic-classification))
 
 **Output Data:**
 
-- **kraken2_species_table_GLlbsMetag.csv** (kraken species count table in csv format)
+- **kraken2_species_table_GLlbsMetag.tsv** (kraken species count table in tsv format)
 
-##### 8cii. Compile Kraken2 Taxonomy Reports
+##### 7cii. Compile Kraken2 Taxonomy Reports
 
 ```bash
 multiqc --zip-data-dir \ 
@@ -2109,7 +2048,7 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 8b](#8b-taxonomic-classification))
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 7b](#7b-kraken2-taxonomic-classification))
 
 **Output Data:**
 
@@ -2117,7 +2056,7 @@ multiqc --zip-data-dir \
 - **kraken2_multiqc_GLlbsMetag_data.zip** (zip archive containing multiqc output data)
 
 
-#### 8d. Convert Kraken2 Output to Krona Format
+#### 7d. Convert Kraken2 Output to Krona Format
 
 ```bash
 kreport2krona.py --report-file sample-kraken2-report.tsv  \
@@ -2131,14 +2070,14 @@ kreport2krona.py --report-file sample-kraken2-report.tsv  \
 
 **Input Data:**
 
-- sample-kraken2-report.tsv (kraken report, output from [Step 8b](#8b-taxonomic-classification))
+- sample-kraken2-report.tsv (kraken report, output from [Step 7b](#7b-kraken2-taxonomic-classification))
 
 **Output Data:**
 
 - sample.krona (krona formatted kraken2 output)
 
 
-#### 8e. Compile Kraken2 Krona Reports
+#### 7e. Compile Kraken2 Krona Reports
 
 ```bash
 # Find, list and write all .krona files to file 
@@ -2174,11 +2113,11 @@ ktImportText -o kraken2-report_GLlbsMetag.html ${KTEXT_FILES[*]}
 
 *ktImportText*
   - `-o` - Specifies the compiled output html file name.
-  - `${KTEXT_FILES[*]}` - An array positional arguement with the following content: sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
+  - `${KTEXT_FILES[*]}` - An array positional argument with the following content: sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
 
 **Input Data:**
 
-- *.krona (all sample .krona formatted files, output from [Step 8d](#8d-convert-kraken2-output-to-krona-format)) 
+- *.krona (all sample .krona formatted files, output from [Step 7d](#7d-convert-kraken2-output-to-krona-format)) 
 
                       
 **Output Data:**
@@ -2188,20 +2127,20 @@ ktImportText -o kraken2-report_GLlbsMetag.html ${KTEXT_FILES[*]}
 - **kraken2-report_GLlbsMetag.html** (compiled krona html report containing all samples)
 
 
-#### 8f. Filter Kraken2 Species Count Table
+#### 7f. Filter Kraken2 Species Count Table
 
 ```R
-library(tidyverse)
-
-input_file <- "kraken2_species_table_GLlbsMetag.csv"
-output_file <- "kraken2_filtered_species_table_GLlbsMetag.csv"
+feature_table_file <- "kraken2_species_table_GLlbsMetag.tsv"
+output_file <- "kraken2_filtered_species_table_GLlbsMetag.tsv"
 threshold <- 0.5
 
 # string used to define non-microbial taxa
-non_microbial <- "UNCLASSIFIED|Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
+non_microbial <- "UNCLASSIFIED|Unclassified|unclassified|Homo sapien|cannot|uncultured|unidentified"
 
 # read in feature table
-feature_table <- read_csv(input_file) %>% as.data.frame
+feature_table <- read_delim(feature_table_file) %>%
+                 across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
 feature_name <- colnames(feature_table)[1]
 rownames(feature_table) <- feature_table[,1]
 feature_table <- feature_table[, -1]
@@ -2211,7 +2150,7 @@ table2write <- filter_rare(feature_table, non_microbial, threshold = threshold)
   as.data.frame %>%
   rownames_to_column(feature_name)
 
-write_csv(x = table2write, file = output_file)
+write_tsv(x = table2write, file = output_file)
 ```
 
 **Custom Functions Used:**
@@ -2225,66 +2164,47 @@ write_csv(x = table2write, file = output_file)
 
 **Input Data:**
 
-- kraken2_species_table_GLlbsMetag.csv (path to kaiju species table from [Step 8ci.](#8ci-create-merged-kraken2-taxonomy-table))
+- kraken2_species_table_GLlbsMetag.tsv (path to kaiju species table from [Step 7ci](#7ci-create-merged-kraken2-taxonomy-table))
 
 **Output Data:**
 
-- **kraken2_filtered_species_table_GLlbsMetag.csv** (a file containing the filtered species table)
+- **kraken2_filtered_species_table_GLlbsMetag.tsv** (a file containing the filtered species table)
 
 ---
 
-#### 8g. Taxonomy Barplots
+#### 7g. Kraken2 Taxonomy Barplots
 
 ```R
-library(tidyverse)
-
-species_table_file <- "kraken2_species_table_GLlbsMetag.csv"
-filtered_species_table_file <- "kraken2_filtered_species_table_GLlbsMetag.csv"
+species_table_file <- "kraken2_species_table_GLlbsMetag.tsv"
+filtered_species_table_file <- "kraken2_filtered_species_table_GLlbsMetag.tsv"
 metadata_file <- "/path/to/sample/metadata"
-number_samples <- 10 
 
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
-
-p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
-                  feature_column = "species", samples_column = "sample_id", group_column = "group",
-                  publication_format = publication_format, custom_palette = custom_palette)
-
-ggsave(filename = "kraken2_unfiltered_species_barplot_GLlbsMetag.png", plot = p,
-       device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
+make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+             feature_column = "species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kraken2_unfiltered_species", assay_suffix = "_GLlbsMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 
 # Save static unfiltered plot
-p <- make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
-                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
-                  publication_format = publication_format, custom_palette = custom_palette)
-
-# Save interactive unfilterted plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_unfiltered_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
-
-# Save static filtered plot
-ggsave(filename = glue("kraken2_filtered_species_barplot_GLlbsMetag.png"), plot = p,
-      device = 'png', width = plot_width, height = 10, units = 'in', dpi = 300, limitsize = FALSE)
-
-# Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
+make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kraken2_filtered_species", assay_suffix = "_GLlbsMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 ```
 
 **Custom Functions Used:**
-- [make_barplot()](#make_plot)
+- [make_barplot()](#make_barplot)
 
 **Parameter Definitions:**
 
 - `species_table_file` - a file containing the species count table
 - `filtered_species_table_file` - a file containing the filtered species count table
 - `metadata_file` - a file containing group information for each sample in the species count files
-- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples.
 
 **Input Data:**
 
-- `kraken2_species_table_GLlbsMetag.csv` (path to kaiju species table from [Step 10ci.](#8ci-create-merged-kraken2-taxonomy-table))
-- `kraken2_filtered_species_table_GLlbsMetag.csv` (a file containing the filtered species count table, output from [Step 10g](#10f-filter-kraken2-species-count-table))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `kraken2_species_table_GLlbsMetag.tsv` (path to kaiju species table from [Step 7ci](#7ci-create-merged-kraken2-taxonomy-table))
+- `kraken2_filtered_species_table_GLlbsMetag.tsv` (a file containing the filtered species count table, output from [Step 7f](#7f-filter-kraken2-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
@@ -2294,47 +2214,32 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlb
 - **kraken2_filtered_species_barplot_GLlbsMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 8h. Feature Decontamination
+#### 7h. Kraken2 Feature Decontamination
 
-> Feature (species) decontamination with decontam. Decontam is an R package that statistically 
-  identifies contaminating features in a feature table
+> Note: species_table and barplots are only generated if 1 or more contaminants were detected
 
 ```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-feature_table_file <- "kraken2_filtered_species_table_GLlbsMetag.csv"
+feature_table_file <- "kraken2_filtered_species_table_GLlbsMetag.tsv"
 metadata_table <- "/path/to/sample/metadata"
-number_samples <- NumberOfSamples # integer indicating how many samples are in the file
-
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
 
 decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
                                          feature_table_file = feature_table_file, 
                                          feature_column = "species", 
                                          samples_column = "sample_id",
                                          prevalence_column = "NTC", 
-                                         ntc_name = "TRUE", 
+                                         ntc_name = "true", 
                                          frequency_column = "concentration", 
-                                         threshold = 0.1, 
+                                         threshold = 0.5, 
                                          classification_method = "kraken2", 
-                                         output_prefix = "", 
+                                         output_prefix = "kraken2", 
                                          assay_suffix = "_GLlbsMetag")
 
-# Convert count matrix to relative abundance matrix
-decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
-
 # Make plot after filtering out contaminants
-p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
-
-ggsave(filename = "kraken2_decontam_species_barplot_GLlbsMetag.png", plot = p,
-         device = "png", width = plot_width, height = 10, units = "in", dpi = 300)
+make_barplot(metadata_file = metadata_table, feature_table_file = "kraken2_decontam_species_table_GLlbsMetag.tsv", 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kraken2_decontam_species", assay_suffix = "_GLlbsMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 
-# Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
 ```
 
 **Custom Functions Used:**
@@ -2347,27 +2252,26 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlb
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
                           table with species/functions as the first column and samples as other columns.
-- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples.
 
 **Input Data:**
 
-- `kraken2_filtered_species_table_GLlbsMetag.csv`(path to filtered species count per sample, output from [Step 8f](#10f-filter-kraken2-species-count-table))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `kraken2_filtered_species_table_GLlbsMetag.tsv`(path to filtered species count per sample, output from [Step 7f](#7f-filter-kraken2-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
-- **kraken2_decontam_results_GLlbsMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
-- **kraken2_decontam_species_table_GLlbsMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
-- kraken2_decontam_species_barplot_GLlbsMetag.png (barplot after filtering out contaminants)
-- **kraken2_decontam_species_barplot_GLlbsMetag.html** (barplot after filtering out contaminants)
+- **kraken2_decontam_results_GLlbsMetag.tsv** (decontam's result table, output from [feature_decontam()](#feature_decontam))
+- **kraken2_decontam_species_table_GLlbsMetag.tsv** (decontaminated species table, output from [feature_decontam()](#feature_decontam))
+- kraken2_decontam_species_barplot_GLlbsMetag.png (barplot after filtering out contaminants, output from [make_barplot()](#make_barplot))
+- **kraken2_decontam_species_barplot_GLlbsMetag.html** (barplot after filtering out contaminants, output from [make_barplot()](#make_barplot))
 
 <br>  
 
 ---
 
-### 9. Taxonomic Profiling Using MetaPhlan
+### 8. Taxonomic Profiling Using MetaPhlan
 
-#### 9a. Download and Install HUMAnN databases
+#### 8a. Download and Install HUMAnN databases
 
 ```bash 
 mkdir -p /path/to/humann3-db
@@ -2397,7 +2301,7 @@ metaphlan --install
 
 `/path/to/humann3-db` (the installed MetaPhlan databases)
 
-#### 9b. HUMAnN/MetaPhlAn Taxonomic Classification
+#### 8b. HUMAnN/MetaPhlAn Taxonomic Classification
 
 ```bash
   # forward and reverse reads need to be provided combined if paired-end (if not paired-end, single-end reads are provided to the --input argument next)
@@ -2430,30 +2334,30 @@ mv sample1-humann3-out-dir/sample1_humann_temp/sample1_metaphlan_bugs_list.tsv \
 
 **Input Data:**
 
-- `/path/to/humann3-db/` (HUMAnN databases installed in [Step 9a](#9a-download-and-install-humann-databases))
+- `/path/to/humann3-db/` (HUMAnN databases installed in [Step 8a](#8a-download-and-install-humann-databases))
 - *_R[12]_decontam_GLlbsMetag.fastq.gz or *_R[12]_HostRm_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
+    output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
 
 **Output Data:**
 
 - sample1-humann3-out-dir/ *humann output directory containing *genefamilies.tsv, *pathabundance.tsv, and *pathcoverage.tsv files)
 
-#### 9c. Merge Multiple Sample Functional Profiles
+#### 8c. Merge Multiple Sample Functional Profiles
 
 ```bash
 # they need to be in their own directories
-mkdir genefamily-results/ pathabundance-results/ pathcoverage-results/
+mkdir gene-family-results/ path-abundance-results/ path-coverage-results/
 
 # copying results from humann3 step
-cp *-humann3-out-dir/*genefamilies.tsv genefamily-results/
-cp *-humann3-out-dir/*abundance.tsv pathabundance-results/
-cp *-humann3-out-dir/*coverage.tsv pathcoverage-results/
+cp *-humann3-out-dir/*genefamilies.tsv gene-family-results/
+cp *-humann3-out-dir/*abundance.tsv path-abundance-results/
+cp *-humann3-out-dir/*coverage.tsv path-coverage-results/
 
 # join results across samples
-humann_join_tables -i genefamily-results/ -o gene-families.tsv
-humann_join_tables -i pathabundance-results/ -o path-abundances.tsv
-humann_join_tables -i pathcoverage-results/ -o path-coverages.tsv
+humann_join_tables -i gene-family-results/ -o gene-families.tsv
+humann_join_tables -i path-abundance-results/ -o pathway-abundances.tsv
+humann_join_tables -i path-coverage-results/ -o pathway-coverages.tsv
 ```
 
 **Parameter Definitions:**  
@@ -2463,15 +2367,15 @@ humann_join_tables -i pathcoverage-results/ -o path-coverages.tsv
 
 **Input Data:**
 
-- `sample-humann3-out-dir` (HUMAnN output directory, from [Step 9b](#9b-running-humannmetaphlan))
+- `sample-humann3-out-dir` (HUMAnN output directory, from [Step 8b](#8b-humannmetaphlan-taxonomic-classification))
 
 **Output Data:**
 
 - gene-families.tsv (Combined gene family table in tab-separated format.)
-- path-abundances.tsv (Combined path abundances table in tab-separated format.)
-- path-coverages.tsv (Combined path coverages table in tab-separated format.)
+- pathway-abundances.tsv (Combined path abundances table in tab-separated format.)
+- pathway-coverages.tsv (Combined path coverages table in tab-separated format.)
 
-#### 9d. Split Results Tables
+#### 8d. Split Results Tables
 
 The read-based functional annotation tables have taxonomic info and non-taxonomic info mixed together. `humann` comes with a helper script to split them into both non-taxonomically grouped functional info files and taxonomically grouped functional info files.
 
@@ -2480,13 +2384,13 @@ humann_split_stratified_table -i gene-families.tsv -o ./
 mv gene-families_stratified.tsv Gene-families-grouped-by-taxa_GLlbsMetag.tsv
 mv gene-families_unstratified.tsv Gene-families_GLlbsMetag.tsv
 
-humann_split_stratified_table -i path-abundances.tsv -o ./
-mv path-abundances_stratified.tsv Path-abundances-grouped-by-taxa_GLlbsMetag.tsv
-mv path-abundances_unstratified.tsv Path-abundances_GLlbsMetag.tsv
+humann_split_stratified_table -i pathway-abundances.tsv -o ./
+mv pathway-abundances_stratified.tsv Pathway-abundances-grouped-by-taxa_GLlbsMetag.tsv
+mv pathway-abundances_unstratified.tsv Pathway-abundances_GLlbsMetag.tsv
 
-humann2_split_stratified_table -i path-coverages.tsv -o ./
-mv path-coverages_stratified.tsv Path-coverages-grouped-by-taxa_GLlbsMetag.tsv
-mv path-coverages_unstratified.tsv Path-coverages_GLlbsMetag.tsv
+humann2_split_stratified_table -i pathway-coverages.tsv -o ./
+mv pathway-coverages_stratified.tsv Pathway-coverages-grouped-by-taxa_GLlbsMetag.tsv
+mv pathway-coverages_unstratified.tsv Pathway-coverages_GLlbsMetag.tsv
 ```
 
 **Parameter Definitions:**  
@@ -2496,25 +2400,25 @@ mv path-coverages_unstratified.tsv Path-coverages_GLlbsMetag.tsv
 
 **Input Data:**
 
-- gene-families.tsv (Combined gene family table from [Step 9c](#9c-merging-multiple-sample-functional-profiles-into-one-table))
-- path-abundances.tsv (Combined path abundances table from [Step 9c](#9c-merging-multiple-sample-functional-profiles-into-one-table))
-- path-coverages.tsv (Combined path coverages table from [Step 9c](#9c-merging-multiple-sample-functional-profiles-into-one-table))
+- gene-families.tsv (Combined gene family table from [Step 8c](#8c-merge-multiple-sample-functional-profiles))
+- pathway-abundances.tsv (Combined path abundances table from [Step 8c](#8c-merge-multiple-sample-functional-profiles))
+- pathway-coverages.tsv (Combined path coverages table from [Step 8c](#8c-merge-multiple-sample-functional-profiles))
 
 **Output Data:**
 
-- Gene-families-grouped-by-taxa_GLlbsMetag.tsv (Gene families grouped by taxa)
-- Gene-families_GLlbsMetag.tsv (Non-taxonomically grouped gene families)
-- Path-abundances-grouped-by-taxa_GLlbsMetag.tsv (Path abundances grouped by taxa)
-- Path-abundances_GLlbsMetag.tsv  (Non-taxonomically grouped gene families)
-- Path-coverages-grouped-by-taxa_GLlbsMetag.tsv (Path coverages grouped by taxa)
-- Path-coverages_GLlbsMetag.tsv (Non-taxonomically groups path coverages)
+- **Gene-families_GLlbsMetag.tsv** (gene-family abundances)
+- **Gene-families-grouped-by-taxa_GLlbsMetag.tsv** (gene-family abundances grouped by taxa)
+- **Pathway-abundances_GLlbsMetag.tsv**  (pathway abundances)
+- **Pathway-abundances-grouped-by-taxa_GLlbsMetag.tsv** (pathway abundances grouped by tax)
+- **Pathway-coverages_GLlbsMetag.tsv** (pathway coverages)
+- **Pathway-coverages-grouped-by-taxa_GLlbsMetag.tsv** (pathway coverages grouped by taxa)
 
-#### 9e. Normalize Gene Families and Pathway Abundances Tables
+#### 8e. Normalize Gene Families and Pathway Abundances Tables
 Generates some normalized tables of the read-based functional outputs from humann that are more readily suitable for across sample comparisons.
 
 ```bash
 humann_renorm_table -i Gene-families_GLlbsMetag.tsv -o Gene-families-cpm_GLlbsMetag.tsv --update-snames
-humann_renorm_table -i Path-abundances_GLlbsMetag.tsv -o Path-abundances-cpm_GLlbsMetag.tsv --update-snames
+humann_renorm_table -i Pathway-abundances_GLlbsMetag.tsv -o Pathway-abundances-cpm_GLlbsMetag.tsv --update-snames
 ```
 
 **Parameter Definitions:**  
@@ -2525,15 +2429,14 @@ humann_renorm_table -i Path-abundances_GLlbsMetag.tsv -o Path-abundances-cpm_GLl
 
 **Input Data:**
 
-- Gene-families_GLlbsMetag.tsv (Non-taxonomically grouped gene families, from [Step 9d](#9d-splitting-results-tables))
-- Path-abundances_GLlbsMetag.tsv (Non-taxonomically grouped gene families, from [Step 9d](#9d-splitting-results-tables))
+- Gene-families_GLlbsMetag.tsv (gene-family abundances, from [Step 8d](#8d-split-results-tables))
+- Pathway-abundances_GLlbsMetag.tsv (pathway abundances, from [Step 8d](#8d-split-results-tables))
 
 **Output Data:**
+- **Gene-families-cpm_GLlbsMetag.tsv** (gene-family abundances normalized to copies-per-million)
+- **Pathway-abundances-cpm_GLlbsMetag.tsv** (pathway abundances normalized to copies-per-million)
 
-- Gene-families-cpm_GLlbsMetag.tsv (Normalized non-taxonomically grouped gene families)
-- Path-abundances-cpm_GLlbsMetag.tsv (Normalized on-taxonomically grouped gene families)
-
-#### 9f. Generate Normalized Gene-family Table Grouped by Kegg Orthologs (KOs)
+#### 8f. Generate Normalized Gene-family Table Grouped by Kegg Orthologs (KOs)
 
 ```bash
 humann_regroup_table -i Gene-families_GLlbsMetag.tsv -g uniref90_ko | \
@@ -2558,19 +2461,19 @@ humann_renorm_table -o Gene-families-KO-cpm_GLlbsMetag.tsv --update-snames
 
 **Input Data:**
 
-- Gene-families_GLlbsMetag.tsv (Non-taxonomically grouped gene families, from [Step 9d](#9d-splitting-results-tables))
+- Gene-families_GLlbsMetag.tsv (Non-taxonomically grouped gene families, from [Step 8d](#8d-split-results-tables))
 
 **Output Data:**
 
-- Gene-families-KO-cpm_GLlbsMetag.tsv (Normalized gene-families with annotations based on Kegg Orthology terms)
+- **Gene-families-KO-cpm_GLlbsMetag.tsv** (KO term abundances normalized to copies-per-million)
 
-#### 9g. Combine MetaPhlan Taxonomy Tables
+#### 8g. Combine MetaPhlan Taxonomy Tables
 
 ```bash
-merge_metaphlan_tables.py *-humann3-out-dir/*_humann_temp/*_metaphlan_bugs_list.tsv > Metaphlan-taxonomy_GLlbsMetag.tsv
+merge_metaphlan_tables.py *-humann3-out-dir/*_humann_temp/*_metaphlan_bugs_list.tsv > metaphlan-taxonomy_GLlbsMetag.tsv
 
 # remove redundant text from headers
-sed -i 's/_metaphlan_bugs_list//g' Metaphlan-taxonomy_GLlbsMetag.tsv
+sed -i 's/_metaphlan_bugs_list//g' metaphlan-taxonomy_GLlbsMetag.tsv
 ```
 
 **Parameter Definitions:**
@@ -2583,15 +2486,15 @@ sed -i 's/_metaphlan_bugs_list//g' Metaphlan-taxonomy_GLlbsMetag.tsv
 
 **Input Data:**
 
--	\*-humann3-out-dir/\*_humann_temp/\*_metaphlan_bugs_list.tsv (MetaPhlan bugs_list produced during humann3 run in [step 9b](#9b-running-humannmetaphlan)
+-	\*-humann3-out-dir/\*_humann_temp/\*_metaphlan_bugs_list.tsv (MetaPhlan bugs_list produced during humann3 run in [step 8b](#8b-humannmetaphlan-taxonomic-classification))
 
 **Output Data:**
 
 - **Metaphlan-taxonomy_GLlbsMetag.tsv** (MetaPhlan estimated taxonomic relative abundances)
 
-#### 9h. Create MetaPhlan Species Count Table 
+#### 8h. Create MetaPhlan Species Count Table
 
-#### 9hi. Get Sample Read Counts
+#### 8hi. Get Sample Read Counts
 
 ```bash
 unzip decontam_multiqc_GLlbsMetag_data.zip
@@ -2601,20 +2504,18 @@ grep _R1_decontam multiqc_fastqc.txt | awk 'BEGIN{FS="\t"; OFS="\t"}{print $1,in
 
 **Input Data:**
 
-- decontam_multiqc_GLlbsMetag_data.zip or HostRm_multiqc_GLlbsMetag_data.zip (multiqc data from [Step ](#4d-compile-contaminant-remove-qc) or [Step 5c](#5c-compile-host-read-removal-qc) if the optional host removal step was done, respectively)
+- decontam_multiqc_GLlbsMetag_data.zip or HostRm_multiqc_GLlbsMetag_data.zip (multiqc data from [Step 3d](#3d-compile-contaminant-remove-qc) or [Step 4c](#4c-compile-host-read-removal-qc) if the optional host removal step was done, respectively)
 
 **Output Data:**
 
 - reads_per_sample.txt (a 2-column tab delimited file with the sample names and read counts as column 1 and 2, respectively)
 
-#### 9hii. Process Metaphlan Taxonomy Table
+#### 8hii. Process MetaPhlan Taxonomy Table
 
 ```R
-library(tidyverse)
-
-input_file <- "Metaphlan-taxonomy_GLlbsMetag.tsv"
+input_file <- "metaphlan-taxonomy_GLlbsMetag.tsv"
 read_count_file <- "reads_per_sample.tsv"
-output_file <- "metaphlan_species_table_GLlbsMetag.csv"
+output_file <- "metaphlan_species_table_GLlbsMetag.tsv"
 threshold <- 0.5
 
 taxon_levels <- c("Kingdom", "Phylum", "Class", "Order",
@@ -2665,32 +2566,32 @@ table2write <- species_table  %>%
   as.data.frame() %>%
   rownames_to_column("Species")
 
-write_csv(x = table2write, file = "Metaphlan_species_table_GLlbsMetag.csv")
+write_tsv(x = table2write, file = "metaphlan_species_table_GLlbsMetag.tsv")
 ```
 
 **Input Data:**
 
-- Metaphlan-taxonomy_GLlbsMetag.tsv (Metaphlan taxonomy table from [Step 9g](#9g-combine-metaphlan-taxonomy-tables))
-- reads_per_sample.tsv (a 2-column tab delimited file with sample names and read counts as columns 1 and 2, respectively from [Step 9hi](#9hi-get-sample-read-counts))
+- metaphlan-taxonomy_GLlbsMetag.tsv (MetaPhlan taxonomy table from [Step 8g](#8g-combine-metaphlan-taxonomy-tables))
+- reads_per_sample.tsv (a 2-column tab delimited file with sample names and read counts as columns 1 and 2, respectively from [Step 8hi](#8hi-get-sample-read-counts))
 
 **Output Data:**
 
-- **Metaphlan_species_table_GLlbsMetag.csv** (a file containing the MetaPhlan species table)
+- **metaphlan_species_table_GLlbsMetag.tsv** (a file containing the MetaPhlan species table)
 
-#### 9i. Filter MetaPhlan Species Count Table
+#### 8i. Filter MetaPhlan Species Count Table
 
 ```R
-library(tidyverse)
-
-input_file <- "Metaphlan_species_table_GLlbsMetag.csv"
-output_file <- "Metaphlan_filtered_species_table_GLlbsMetag.csv"
+feature_table_file <- "metaphlan_species_table_GLlbsMetag.tsv"
+output_file <- "metaphlan_filtered_species_table_GLlbsMetag.tsv"
 threshold <- 0.5
 
 # string used to define non-microbial taxa
 non_microbial <- "UNCLASSIFIED"
 
 # read in feature table
-feature_table <- read_csv(input_file) %>% as.data.frame
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
 feature_name <- colnames(feature_table)[1]
 rownames(feature_table) <- feature_table[,1]
 feature_table <- feature_table[, -1]
@@ -2700,7 +2601,7 @@ table2write <- filter_rare(feature_table, non_microbial, threshold = threshold)
   as.data.frame %>%
   rownames_to_column(feature_name)
 
-write_csv(x = table2write, file = output_file)
+write_tsv(x = table2write, file = output_file)
 ```
 
 **Custom Functions Used:**
@@ -2714,114 +2615,77 @@ write_csv(x = table2write, file = output_file)
 
 **Input Data:**
 
-- Metaphlan_species_table_GLlbsMetag.csv (path to Metaphlan species count table from [Step 9hii](#9hii-process-metaphlan-taxonomy-table))
+- metaphlan_species_table_GLlbsMetag.tsv (path to MetaPhlan species count table from [Step 8hii](#8hii-process-metaphlan-taxonomy-table))
 
 **Output Data:**
 
-- **Metaphlan_filtered_species_table_GLlbsMetag.csv** (a file containing the filtered MetaPhlan species table)
+- **metaphlan_filtered_species_table_GLlbsMetag.tsv** (a file containing the filtered MetaPhlan species table)
 
-#### 9j. Taxonomy Barplots
+#### 8j. MetaPhlan Taxonomy Barplots
 
 ```R
-library(tidyverse)
-
-species_table_file <- "Metaphlan_species_table_GLlbsMetag.csv"
-filtered_species_table_file <- "Metaphlan_filtered_species_table_GLlbsMetag.csv"
+species_table_file <- "metaphlan_species_table_GLlbsMetag.tsv"
+filtered_species_table_file <- "metaphlan_filtered_species_table_GLlbsMetag.tsv"
 metadata_file <- "/path/to/sample/metadata"
-number_samples <- 10 
-
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
 
-p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
-                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
-                  publication_format = publication_format, custom_palette = custom_palette)
-
-ggsave(filename = "Metaphlan_unfiltered_species_barplot_GLlbsMetag.png", plot = p,
-       device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
+make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "metaphlan_unfiltered_species", assay_suffix = "_GLlbsMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 
 # Save static unfiltered plot
-p <- make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
-                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
-                  publication_format = publication_format, custom_palette = custom_palette)
-
-# Save interactive unfilterted plot
-htmlwidgets::saveWidget(ggplotly(p), glue("Metaphlan_unfiltered_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
-
-# Save static filtered plot
-ggsave(filename = glue("Metaphlan_filtered_species_barplot_GLlbsMetag.png"), plot = p,
-      device = 'png', width = plot_width, height = 10, units = 'in', dpi = 300, limitsize = FALSE)
-
-# Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("Metaphlan_filtered_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
+make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "metaphlan_filtered_species", assay_suffix = "_GLlbsMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 ```
 
 **Custom Functions Used:**
-- [make_barplot()](#make_plot)
+- [make_barplot()](#make_barplot)
 
 **Parameter Definitions:**
 
 - `species_table_file` - a file containing the species count table
 - `filtered_species_table_file` - a file containing the filtered species count table
 - `metadata_file` - a file containing group information for each sample in the species count files
-- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples.
 
 **Input Data:**
 
-- `Metaphlan_species_table_GLlbsMetag.csv` (path to kaiju species table from [Step 10ci.](#8ci-create-merged-Metaphlan-taxonomy-table))
-- `Metaphlan_filtered_species_table_GLlbsMetag.csv` (a file containing the filtered species count table, output from [Step 10g](#10f-filter-Metaphlan-species-count-table))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `metaphlan_species_table_GLlbsMetag.tsv` (path to MetaPhlan species table from [Step 8h](#8h-create-metaphlan-species-count-table))
+- `metaphlan_filtered_species_table_GLlbsMetag.tsv` (a file containing the filtered species count table, output from [Step 8i](#8i-filter-metaphlan-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
-- Metaphlan_unfiltered_species_barplot_GLlbsMetag.png (taxonomy barplot without filtering)
-- **Metaphlan_unfiltered_species_barplot_GLlbsMetag.html** (interactive taxonomy barplot without filtering)
-- Metaphlan_filtered_species_barplot_GLlbsMetag.png (taxonomy barplot after filtering rare and non-microbial taxa)
-- **Metaphlan_filtered_species_barplot_GLlbsMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
+- metaphlan_unfiltered_species_barplot_GLlbsMetag.png (taxonomy barplot without filtering)
+- **metaphlan_unfiltered_species_barplot_GLlbsMetag.html** (interactive taxonomy barplot without filtering)
+- metaphlan_filtered_species_barplot_GLlbsMetag.png (taxonomy barplot after filtering rare and non-microbial taxa)
+- **metaphlan_filtered_species_barplot_GLlbsMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 9k. Feature Decontamination
+#### 8k. MetaPhlan Feature Decontamination
 
-> Feature (species) decontamination with decontam. Decontam is an R package that statistically 
-  identifies contaminating features in a feature table
+> Note: species_table and barplots are only generated if 1 or more contaminants were detected
 
 ```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-feature_table_file <- "Metaphlan_filtered_species_table_GLlbsMetag.csv"
+feature_table_file <- "metaphlan_filtered_species_table_GLlbsMetag.tsv"
 metadata_table <- "/path/to/sample/metadata"
-number_samples <- NumberOfSamples # integer indicating how many samples are in the file
-
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
-
 decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
                                          feature_table_file = feature_table_file, 
                                          feature_column = "species", 
                                          samples_column = "sample_id",
                                          prevalence_column = "NTC", 
-                                         ntc_name = "TRUE", 
+                                         ntc_name = "true", 
                                          frequency_column = "concentration", 
-                                         threshold = 0.1, 
-                                         classification_method = "kraken2", 
-                                         output_prefix = "", 
+                                         threshold = 0.5, 
+                                         classification_method = "metaphlan", 
+                                         output_prefix = "metaphlan", 
                                          assay_suffix = "_GLlbsMetag")
 
-# Convert count matrix to relative abundance matrix
-decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
-
-# Make plot after filtering out contaminants
-p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
-
-ggsave(filename = "Metaphlan_decontam_species_barplot_GLlbsMetag.png", plot = p,
-         device = "png", width = plot_width, height = 10, units = "in", dpi = 300)
-
-# Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("Metaphlan_decontam_species_barplot_GLlbsMetag.html"), selfcontained = TRUE)
+make_barplot(metadata_file = metadata_table, feature_table_file = "metaphlan_decontam_species_table_GLlbsMetag.tsv", 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "metaphlan_decontam_species", assay_suffix = "_GLlbsMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 ```
 
 **Custom Functions Used:**
@@ -2838,15 +2702,270 @@ htmlwidgets::saveWidget(ggplotly(p), glue("Metaphlan_decontam_species_barplot_GL
 
 **Input Data:**
 
-- `Metaphlan_filtered_species_table_GLlbsMetag.csv`(path to filtered species count per sample, output from [Step 9i](#9i-filter-metaphlan-species-count-table))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `metaphlan_filtered_species_table_GLlbsMetag.tsv`(path to filtered species count per sample, output from [Step 8i](#8i-filter-metaphlan-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
+
+- **metaphlan_decontam_results_GLlbsMetag.tsv** (decontam's result table, output from [feature_decontam()](#feature_decontam))
+- **metaphlan_decontam_species_table_GLlbsMetag.tsv** (decontaminated species table, output from [feature_decontam()](#feature_decontam))
+- metaphlan_decontam_species_barplot_GLlbsMetag.png (barplot after filtering out contaminants, output from [make_barplot()](#make_barplot))
+- **metaphlan_decontam_species_barplot_GLlbsMetag.html** (barplot after filtering out contaminants, output from [make_barplot()](#make_barplot))
+
+<br>
+
+#### 8l. Filter Humann Output
+
+```R
+# read in humann tables
+humann_uniref_table <- read_delim(file = "Gene-families-cpm_GLlbsMetag.tsv", delim = "\t")
+humann_KO_table <- read_delim(file = "Gene-families-KO-cpm_GLlbsMetag.tsv", delim = "\t")
+humann_pathway_table <- read_delim(file = "Pathway-abundances-cpm_GLlbsMetag.tsv", delim = "\t")
+
+# rename headers
+humann_uniref_table <-  humann_uniref_table  %>% 
+  rename(Uniref90=`# Gene Family`) %>%
+  mutate(Uniref90=str_replace_all(Uniref90, "UniRef90_", "")) %>%
+  set_names(colnames(.) %>% str_replace_all("_Abundance-CPM", "")) %>%
+  as.data.frame()
+write_tsv(x = humann_uniref_table, file = "Gene-families-uniref_unfiltered_GLlbsMetag.tsv")
+
+humann_KO_table <- humann_KO_table %>%
+  rename(KO=`# Gene Family`) %>%
+  set_names(colnames(.) %>% str_replace_all("_Abundance-CPM", "")) %>%
+  as.data.frame()
+write_tsv(x = humann_KO_table, file = "Gene-families-KO_unfiltered_GLlbsMetag.tsv")
+
+humann_pathway_table <-  humann_pathway_table  %>% 
+  rename(Pathway=`# Pathway`) %>%
+  set_names(colnames(.) %>% str_replace_all("_Abundance-CPM", "")) %>%
+  as.data.frame()
+write_tsv(x = humann_pathway_table, file = "Pathway-abundances_unfiltered_GLlbsMetag.tsv")
+
+# filter data
+threshold <- 500
+
+humann_uniref_table <- humann_uniref_table %>%
+  mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>% column_to_rownames("Uniref90")
+humann_uniref_filtered <- get_abundant_features(humann_uniref_table, cpm_threshold = threshold) %>%
+  as.data.frame() %>% rownames_to_column("Uniref90")
+write_tsv(x = table2write, file = "Gene-families-uniref_filtered_GLlbsMetag.tsv")
+
+humann_KO_table <- humann_KO_table %>%
+  mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>% column_to_rownames("KO")
+humann_KO_filtered <- get_abundant_features(humann_KO_table, cpm_threshold = threshold) %>%
+  as.data.frame() %>% rownames_to_column("KO")
+write_tsv(x = table2write, file = "Gene-families-KO_filtered_GLlbsMetag.tsv")
+
+humann_pathway_table <- humann_pathway_table %>%
+  mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>% column_to_rownames("Pathway")
+humann_pathway_filtered <- get_abundant_features(humann_pathway_table, cpm_threshold = threshold) %>%
+  as.data.frame() %>% rownames_to_column("Pathway")
+write_tsv(x = table2write, file = "Pathway-abundances_filtered_GLlbsMetag.tsv")
+
+```
+
+**Custom Functions Used:**
+- [get_abundant_features()](#get_abundant_features)
+
+**Parameter Definitions:**
+
+- `threshold` - threshold for filtering out low abundance features, a value greater than 0
+
+**Input Data:**
+
+- Gene-families-cpm_GLlbsMetag.tsv (Humann taxonomy table from [Step 8e](#8e-normalize-gene-families-and-pathway-abundances-tables))
+- Gene-families-KO-cpm_GLlbsMetag.tsv (Humann pathway table from [Step 8e](#8e-normalize-gene-families-and-pathway-abundances-tables))
+- Pathway-abundances-cpm_GLlbsMetag.tsv (Humann KO function table from [Step 8f](#8f-generate-normalized-gene-family-table-grouped-by-kegg-orthologs-kos))
+
+**Output Data:**
+
+- Gene-families-KO_unfiltered_GLlbsMetag.tsv (KO term abundances normalized to copies-per-million, with cleaned headers)
+- Gene-families-uniref_unfiltered_GLlbsMetag.tsv (gene-family abundances normalized to copies-per-million, with cleaned headers)
+- Pathway-abundances_unfiltered_GLlbsMetag.tsv (pathway abundances normalized to copies-per-million, with cleaned headers)
+- **Gene-families-KO_filtered_GLlbsMetag.tsv** (KO term abundances filtered for features with less than 500 CPM across samples) 
+- **Gene-families-uniref_filtered_GLlbsMetag.tsv** (gene-family abundances filtered for features with less than 500 CPM across samples) 
+- **Gene-families-KO_filtered_GLlbsMetag.tsv** (Pathway abundances filtered for features with less than 500 CPM across samples) 
+
+#### 8m. Create Humann Function Heatmaps
+
+```R
+metadata_table < "/path/to/sample_metadata"
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-uniref_unfiltered_GLlbsMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-uniref_unfiltered", 
+             assay_suffix = "_GLlbsMetag", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-uniref_filtered_GLlbsMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-uniref_filtered", 
+             assay_suffix = "_GLlbsMetag", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-KO_unfiltered_GLlbsMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-KO_unfiltered", 
+             assay_suffix = "_GLlbsMetag", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-KO_filtered_GLlbsMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-KO_filtered", 
+             assay_suffix = "_GLlbsMetag", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Pathway-abundances_unfiltered_GLlbsMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Pathway-abundances_unfiltered", 
+             assay_suffix = "_GLlbsMetag", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Pathway-abundances_filtered_GLlbsMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Pathway-abundances_filtered", 
+             assay_suffix = "_GLlbsMetag", 
+             custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `metadata_file` - a file containing group information for each sample in the species count files
+
+**Input Data:**
+
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+- `Gene-families-uniref_unfiltered_GLlbsMetag.tsv` (gene-family abundances table, output from [Step])
+- `Gene-families-KO_unfiltered_GLlbsMetag.tsv` (KO term abundances table, output from [Step])
+- `Pathway-abundances_unfiltered_GLlbsMetag.tsv` (pathway abundances table, output from [Step])
+- `Gene-families-uniref_filtered_GLlbsMetag.tsv` (filtered gene-family abundances table, output from [Step]) 
+- `Gene-families-KO_filtered_GLlbsMetag.tsv` (filtered KO term abundances table, output from [Step]) 
+- `Pathway-abundances_filtered_GLlbsMetag.tsv` (filtered Pathway abundances table, output from [Step]) 
+
+**Output Data:**
+
+- **Gene-families-uniref_unfiltered_heatmap_GLlbsMetag.png** (gene family abundances heatmap without filtering)
+- **Gene-families-uniref_filtered_heatmap_GLlbsMetag.png** (gene family abundances heatmap after filtering rare and non-microbial taxa)
+- **Gene-families-KO_unfiltered_heatmap_GLlbsMetag.png** (KO term abundances heatmap without filtering)
+- **Gene-families-KO_filtered_heatmap_GLlbsMetag.png** (KO term abundances heatmap after filtering rare and non-microbial taxa)
+- **Pathway-abundances_unfiltered_heatmap_GLlbsMetag.png** (pathway abundances heatmap without filtering)
+- **Pathway-abundances_filtered_heatmap_GLlbsMetag.png** (pathway abundances heatmap after filtering rare and non-microbial taxa)
+- **Gene-families-uniref_unfiltered_top_50_heatmap_GLlbsMetag.png** (gene family abundances heatmap without filtering)
+- **Gene-families-uniref_filtered_top_50_heatmap_GLlbsMetag.png** (gene family abundances heatmap after filtering rare and non-microbial taxa)
+- **Gene-families-KO_unfiltered_top_50_heatmap_GLlbsMetag.png** (KO term abundances heatmap without filtering)
+- **Gene-families-KO_filtered_top_50_heatmap_GLlbsMetag.png** (KO term abundances heatmap after filtering rare and non-microbial taxa)
+- **Pathway-abundances_unfiltered_top_50_heatmap_GLlbsMetag.png** (pathway abundances heatmap without filtering)
+- **Pathway-abundances_filtered_top_50_heatmap_GLlbsMetag.png** (pathway abundances heatmap after filtering rare and non-microbial taxa)
+
+#### 8n. Humann Feature Decontamination
+
+> Note: species_table and barplots are only generated if 1 or more contaminants were detected
+
+```R
+metadata_table <- "/path/to/sample/metadata"
+uniref_table_file <- "Gene-families-uniref_filtered_GLlbsMetag.tsv"
+KO_table_file <- "Gene-families-KO_filtered_GLlbsMetag.tsv"
+pathway_table_file <- "Pathway-abundances_filtered_GLlbsMetag.tsv"
+
+# Gene-families-uniref
+feature_decontam(metadata_file = metadata_table, 
+                feature_table_file = uniref_table_file, 
+                feature_column = "Uniref90", 
+                samples_column = "sample_id",
+                prevalence_column = "NTC", 
+                ntc_name = "true", 
+                frequency_column = "concentration", 
+                threshold = 0.5, 
+                classification_method = "Gene-families-uniref", 
+                output_prefix = "Gene-families-uniref", 
+                assay_suffix = "_GLlbsMetag")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-uniref_decontam_species_table_GLlbsMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-uniref_decontam", 
+             assay_suffix = "_GLlbsMetag", 
+             custom_palette = custom_palette)
+
+# Gene-families-KO
+feature_decontam(metadata_file = metadata_table, 
+                feature_table_file = KO_table_file, 
+                feature_column = "KO", 
+                samples_column = "sample_id",
+                prevalence_column = "NTC", 
+                ntc_name = "true", 
+                frequency_column = "concentration", 
+                threshold = 0.5, 
+                classification_method = "Gene-families-KO", 
+                output_prefix = "Gene-families-KO", 
+                assay_suffix = "_GLlbsMetag")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-KO_decontam_species_table_GLlbsMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-KO_decontam", 
+             assay_suffix = "_GLlbsMetag", 
+             custom_palette = custom_palette)
+
+# Pathway-abundances
+feature_decontam(metadata_file = metadata_table, 
+                feature_table_file = pathway_table_file, 
+                feature_column = "Pathway", 
+                samples_column = "sample_id",
+                prevalence_column = "NTC", 
+                ntc_name = "true", 
+                frequency_column = "concentration", 
+                threshold = 0.5, 
+                classification_method = "Pathway-abundances", 
+                output_prefix = "Pathway-abundances", 
+                assay_suffix = "_GLlbsMetag")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Pathway-abundances_decontam_species_table_GLlbsMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Pathway-abundances_decontam", 
+             assay_suffix = "_GLlbsMetag", 
+             custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
+- [make-heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                          table with species/functions as the first column and samples as other columns.
+
+**Input Data:**
+
+- `Gene-families-uniref_filtered_GLlbsMetag.tsv` (filtered gene-family abundances table, output from [Step]) 
+- `Gene-families-KO_filtered_GLlbsMetag.tsv` (filtered KO term abundances table, output from [Step]) 
+- `Pathway-abundances_filtered_GLlbsMetag.tsv` (filtered Pathway abundances table, output from [Step]) 
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
-- **Metaphlan_decontam_results_GLlbsMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
-- **Metaphlan_decontam_species_table_GLlbsMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
-- Metaphlan_decontam_species_barplot_GLlbsMetag.png (barplot after filtering out contaminants)
-- **Metaphlan_decontam_species_barplot_GLlbsMetag.html** (barplot after filtering out contaminants)
+- **Gene-family-uniref_decontam_results_GLlbsMetag.tsv** (decontam's result table for gene-family abundances, output from [feature_decontam()](#feature_decontam))
+- **Gene-family-uniref_decontam_species_table_GLlbsMetag.tsv** (decontaminated gene-family abundances table, output from [feature_decontam()](#feature_decontam))
+- **Gene-family-uniref_decontam_species_heatmap_GLlbsMetag.png** (heatmap of decontaminated gene-family abundances, output from [make_heatmap()](#make_heatmap))
+- **Gene-family-KO_decontam_results_GLlbsMetag.tsv** (decontam's result table KO term abundances, output from [feature_decontam()](#feature_decontam))
+- **Gene-family-KO_decontam_species_table_GLlbsMetag.tsv** (decontaminated KO term abundances table, output from [feature_decontam()](#feature_decontam))
+- **Gene-family-KO_decontam_species_heatmap_GLlbsMetag.png** (heatmap of decontaminated KO term abundances, output from [make_heatmap()](#make_heatmap))
+- **Pathway-abundances_decontam_results_GLlbsMetag.tsv** (decontam's result table, output from [feature_decontam()](#feature_decontam))
+- **Pathway-abundances_decontam_species_table_GLlbsMetag.tsv** (decontaminated species table, output from [feature_decontam()](#feature_decontam))
+- **Pathway-abundances_decontam_species_heatmap_GLlbsMetag.png** (barplot after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
 
 <br>
 
@@ -2855,7 +2974,7 @@ htmlwidgets::saveWidget(ggplotly(p), glue("Metaphlan_decontam_species_barplot_GL
 ## Assembly-based Processing
 
 
-### 10. Sample Assembly
+### 9. Sample Assembly
 
 ```
 megahit -1 sample1_R1_decontam_GLlbsMetag.fastq.gz -2 sample1_R2_decontam_GLlbsMetag.fastq.gz \
@@ -2875,7 +2994,7 @@ megahit -1 sample1_R1_decontam_GLlbsMetag.fastq.gz -2 sample1_R2_decontam_GLlbsM
 
 - *_R[12]_decontam_GLlbsMetag.fastq.gz or *_R[12]_HostRm_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
+    output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
 
 **Output data:**
 
@@ -2886,14 +3005,14 @@ megahit -1 sample1_R1_decontam_GLlbsMetag.fastq.gz -2 sample1_R2_decontam_GLlbsM
 
 ---
 
-### 11. Rename Contigs and Summarize Assemblies
+### 10. Rename Contigs and Summarize Assemblies
 
-#### 11a. Rename Contig Headers
+#### 10a. Rename Contig Headers
 
 ```bash
 bit-rename-fasta-headers -i sample1/final.contigs.fasta \
                          -w c_sample \
-                         -o sample_assembly_GLlbsMetag.fasta
+                         -o sample-assembly_GLlbsMetag.fasta
 ```
 
 **Parameter Definitions:**  
@@ -2905,7 +3024,7 @@ bit-rename-fasta-headers -i sample1/final.contigs.fasta \
 
 **Input Data:**
 
-- sample1/final.contigs.fasta (assembly file from [Step 10](#10-sample-assembly))
+- sample1/final.contigs.fasta (assembly file from [Step 9](#9-sample-assembly))
 
 **Output files:**
 
@@ -2917,6 +3036,14 @@ bit-rename-fasta-headers -i sample1/final.contigs.fasta \
 ```bash
 bit-summarize-assembly -o assembly-summaries_GLlbsMetag.tsv \
                        *-assembly_GLlbsMetag.fasta
+
+# test assembly fasta files for absence of contigs
+for assembly_file in *-assembly_GLlbsMetag.fasta; do 
+  sample_id=${assembly_file%-assembly_GLlbsMetag.fasta} 
+  if [ ! -s ${assembly_file} ]; then 
+    printf "${sample_id}\tNo contigs assembled\n" >> Failed-assemblies_GLlbsMetag.tsv
+  fi
+done
 ```
 
 **Parameter Definitions:**  
@@ -2926,19 +3053,20 @@ bit-summarize-assembly -o assembly-summaries_GLlbsMetag.tsv \
 
 **Input Data:**
 
-- *-assembly_GLlbsMetag.fasta (contig-renamed assembly files from [Step 11a](#11a-renaming-contig-headers))
+- *-assembly_GLlbsMetag.fasta (contig-renamed assembly files from [Step 10a](#10a-rename-contig-headers))
 
 **Output files:**
 
 - **assembly-summaries_GLlbsMetag.tsv** (table of assembly summary statistics)
+- **Failed-assemblies_GLlbsMetag.tsv** (list of samples with no assembled contigs. Only present if no contigs were generated for at least one sample.)
 
 <br>
 
 ---
 
-### 12. Gene Prediction
+### 11. Gene Prediction
 
-#### 12a. Generate Gene Predictions
+#### 11a. Generate Gene Predictions
 
 ```bash
 prodigal -a sample-genes.faa \
@@ -2964,7 +3092,7 @@ prodigal -a sample-genes.faa \
 
 **Input Data:**
 
-- sample-assembly_GLlbsMetag.fasta (contig-renamed assembly file from [Step 11a](#11a-renaming-contig-headers))
+- sample-assembly_GLlbsMetag.fasta (contig-renamed assembly file from [Step 10a](#10a-rename-contig-headers))
 
 **Output Data:**
 
@@ -2974,7 +3102,7 @@ prodigal -a sample-genes.faa \
 
 <br>
 
-#### 12b. Remove Line Wraps In Gene Prediction Output
+#### 11b. Remove Line Wraps In Gene Prediction Output
 
 ```bash
 bit-remove-wraps sample-genes.faa > sample-genes.faa.tmp 2> /dev/null
@@ -2986,8 +3114,8 @@ mv sample-genes.fasta.tmp sample-genes_GLlbsMetag.fasta
 
 **Input Data:**
 
-- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 12a](#12a-gene-prediction))
-- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 12a](#12a-gene-prediction))
+- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 11a](#11a-generate-gene-predictions))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 11a](#11a-generate-gene-predictions))
 
 **Output Data:**
 
@@ -2998,15 +3126,15 @@ mv sample-genes.fasta.tmp sample-genes_GLlbsMetag.fasta
 
 ---
 
-### 13. Functional Annotation
+### 12. Functional Annotation
 
 > **Note:**  
 > The annotation process overwrites the same temporary directory by default. When running multiple 
-processses at a time, it is necessary to specify a specific temporary directory with the 
+processes at a time, it is necessary to specify a specific temporary directory with the 
 `--tmp-dir` argument as shown below.
 
 
-#### 13a. Download Reference Database of HMM Models
+#### 12a. Download Reference Database of HMM Models
 
 > **Note:** This step only needs to be done once.
 
@@ -3017,7 +3145,7 @@ tar -xzvf profiles.tar.gz
 gunzip ko_list.gz 
 ```
 
-#### 13b. Run KEGG Annotation
+#### 12b. Run KEGG Annotation
 
 ```bash
 exec_annotation -p profiles/ \
@@ -3027,7 +3155,7 @@ exec_annotation -p profiles/ \
                 -o sample-KO-tab.tmp \
                 --tmp-dir sample-tmp-KO \
                 --report-unannotated \
-                sample-genes.faa 
+                sample-genes_GLlbsMetag.faa 
 ```
 
 **Parameter Definitions:**
@@ -3039,21 +3167,21 @@ exec_annotation -p profiles/ \
 - `-o` – Specifies the output file name.
 - `--tmp-dir` – Specifies the temporary directory to write to (needed if running more than one process concurrently, see Note above).
 - `--report-unannotated` – Specifies to generate an output for each entry, event when no KO is assigned.
-- `sample-genes.faa` – Specifies the input file, provided as a positional argument. 
+- `sample-genes_GLlbsMetag.faa` – Specifies the input file, provided as a positional argument. 
 
 
 **Input Data:**
 
-- sample-genes.faa (amino-acid fasta file, output from [Step 12b](#12b-remove-line-wraps-in-gene-prediction-output))
-- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 13a](#13a-download-reference-database-of-hmm-models))
-- ko_list (reference list of KOs to scan for, downloaded in [Step 13a](#13a-download-reference-database-of-hmm-models))
+- sample-genes_GLlbsMetag.faa (amino-acid fasta file, output from [Step 11b](#11b-remove-line-wraps-in-gene-prediction-output))
+- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 12a](#12a-download-reference-database-of-hmm-models))
+- ko_list (reference list of KOs to scan for, downloaded in [Step 12a](#12a-download-reference-database-of-hmm-models))
 
 **Output Data:**
 
 - sample-KO-tab.tmp (table of KO annotations assigned to gene IDs)
 
 
-#### 13c. Filter KO Outputs
+#### 12c. Filter KO Outputs
 *Filter KO outputs to retain only those passing the KO-specific score and top hits.*
 
 ```bash
@@ -3071,7 +3199,7 @@ rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 
 **Input Data:**
 
-- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs, output from [Step 14b](#14b-run-kegg-annotation))
+- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs, output from [Step 12b](#12b-run-kegg-annotation))
 
 **Output Data:**
 
@@ -3081,9 +3209,9 @@ rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 
 ---
 
-### 14. Taxonomic Classification 
+### 13. Taxonomic Classification
 
-#### 14a. Pull and Unpack Pre-built Reference DB 
+#### 13a. Pull and Unpack Pre-built Reference DB
 
 > **Note:** This step only needs to be done once.
 
@@ -3092,13 +3220,13 @@ wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20200618.tar.gz
 tar -xvzf CAT_prepare_20200618.tar.gz
 ```
 
-#### 14b. Run Taxonomic Classification
+#### 13b. Run Taxonomic Classification
 
 ```bash
-CAT contigs -c sample-assembly.fasta \
+CAT contigs -c sample-assembly_GLlbsMetag.fasta \
             -d CAT_prepare_20200618/2020-06-18_database/ \
             -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
-            -p sample-genes.faa \
+            -p sample-genes_GLlbsMetag.faa \
             -o sample-tax-out.tmp \
             -n NumberOfThreads \
             -r 3 \
@@ -3122,10 +3250,10 @@ CAT contigs -c sample-assembly.fasta \
 
 **Input Data:**
 
-- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 14a](14a-pull-and-unpack-pre-built-reference-db))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 14a](14a-pull-and-unpack-pre-built-reference-db))
-- sample-assembly.fasta (contig-renamed assembly file from [Step 11a](#11a-rename-contig-headers)
-- sample-genes.faa (amino-acid fasta file, output from [Step 12b](#12b-remove-line-wraps-in-gene-prediction-output)
+- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 13a](#13a-pull-and-unpack-pre-built-reference-db))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 13a](#13a-pull-and-unpack-pre-built-reference-db))
+- sample-assembly_GLlbsMetag.fasta (contig-renamed assembly file from [Step 10a](#10a-rename-contig-headers))
+- sample-genes_GLlbsMetag.faa (amino-acid fasta file, output from [Step 11b](#11b-remove-line-wraps-in-gene-prediction-output))
 
 **Output Data:**
 
@@ -3133,7 +3261,7 @@ CAT contigs -c sample-assembly.fasta \
 - sample-tax-out.tmp.contig2classification.txt (contig taxonomy file)
 
 
-#### 14c. Add Taxonomy Info From Taxids To Genes
+#### 13c. Add Taxonomy Info From Taxids To Genes
 
 ```bash
 CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
@@ -3153,15 +3281,15 @@ CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
 
 **Input Data:**
 
-- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file, output from [Step 14b](#14b-run-taxonomic-classification))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 14a](#14a-pull-and-unpack-pre-built-reference-db))
+- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file, output from [Step 13b](#13b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 13a](#13a-pull-and-unpack-pre-built-reference-db))
 
 **Output Data:**
 
 - sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
 
 
-#### 14d. Add Taxonomy Info From Taxids To Contigs
+#### 13d. Add Taxonomy Info From Taxids To Contigs
 
 ```bash
 CAT add_names -i sample-tax-out.tmp.contig2classification.txt \
@@ -3181,15 +3309,15 @@ CAT add_names -i sample-tax-out.tmp.contig2classification.txt \
 
 **Input Data:**
 
-- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file, output from [Step 14b](#14b-run-taxonomic-classification))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 14a](#14a-pull-and-unpack-pre-built-reference-db))
+- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file, output from [Step 13b](#13b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 13a](#13a-pull-and-unpack-pre-built-reference-db))
 
 **Output Data:**
 
 - sample-contig-tax-out.tmp (contig taxonomy file with lineage info added)
 
 
-#### 14e. Format Gene-level Output With awk and sed
+#### 13e. Format Gene-level Output With awk and sed
 
 ```bash
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,$8,$9,$10,$11 } \
@@ -3197,26 +3325,26 @@ awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,
     { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } else { n=split($3,lineage,";"); \
     print $1,lineage[n],$5,$6,$7,$8,$9,$10,$11 } } ' sample-gene-tax-out.tmp | \
     sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/# ORF/gene_ID/' | \
-    sed 's/lineage/taxid/'  > sample-gene-tax-out.tsv
+    sed 's/lineage/taxid/'  > sample-gene-tax.tsv
 ```
 
 **Input Data:**
 
-- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added, output from [Step 14c](#14c-add-taxonomy-info-from-taxids-to-genes))
+- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added, output from [Step 13c](#13c-add-taxonomy-info-from-taxids-to-genes))
 
 **Output Data:**
 
-- sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info)
+- sample-gene-tax.tsv (reformatted gene-calls taxonomy file with lineage info)
 
 
-#### 14f. Format Contig-level Output With awk and sed
+#### 13f. Format Contig-level Output With awk and sed
 
 ```bash
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6,$7,$8,$9,$10,$11,$12 } \
     else if ( $2 == "no taxid assigned" ) { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } \
     else { n=split($4,lineage,";"); print $1,lineage[n],$6,$7,$8,$9,$10,$11,$12 } } ' sample-contig-tax-out.tmp | \
     sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/^# contig/contig_ID/' | \
-    sed 's/lineage/taxid/' > sample-contig-tax-out.tsv
+    sed 's/lineage/taxid/' > sample-contig-tax.tsv
 
   # clearing intermediate files
 rm sample*.tmp*
@@ -3224,19 +3352,19 @@ rm sample*.tmp*
 
 **Input Data:**
 
-- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added, output from [Step 14d](#14d-add-taxonomy-info-from-taxids-to-contigs))
+- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added, output from [Step 13d](#13d-add-taxonomy-info-from-taxids-to-contigs))
 
 **Output Data:**
 
-- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info)
+- sample-contig-tax.tsv (reformatted contig taxonomy file with lineage info)
 
 <br>
 
 ---
 
-### 15. Read-Mapping
+### 14. Read-Mapping
 
-#### 15a. Build reference index
+#### 14a. Build reference index
 
 ```
 bowtie2-build sample1_assembly_GLlbsMetag.fasta sample1-index
@@ -3249,13 +3377,13 @@ bowtie2-build sample1_assembly_GLlbsMetag.fasta sample1-index
 
 **Input Data:**
 
-- `sample1_assembly.fasta` (contig-renamed assembly file, output from [Step 11a](#11a-rename-contig-headers))
+- `sample1-assembly_GLlbsMetag.fasta` (contig-renamed assembly file, output from [Step 10a](#10a-rename-contig-headers))
 
 **Output Data:**
 
 - `sample1-index*` - the bowtie2 index files
 
-#### 15b. Align Reads to Sample Assembly
+#### 14b. Align Reads to Sample Assembly
 
 ```bash
 bowtie2 --mm --quiet --threads ${task.cpus} \
@@ -3274,15 +3402,15 @@ bowtie2 --mm --quiet --threads ${task.cpus} \
 - `-2` – specifies the reverse reads to map
 - `--no-unal` - Suppress SAM records for reads that did not align.
 - `> sample1.sam` - Redirects the output of the map reads command to a SAM file.
-- `2> sample1-mapping-info.txt` – capture the printed summary results in a log file
+- `2> sample1-mapping-info_GLlbsMetag.txt` – capture the printed summary results in a log file
 
 
 **Input Data**
 
-- sample1-index (bowti2 index files, output from [Step 15a](#15a-build-reference-index))
+- sample1-index (bowtie2 index files, output from [Step 14a](#14a-build-reference-index))
 - *_R[12]_decontam_GLlbsMetag.fastq.gz or *_R[12]_HostRm_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 4b](#4b-build-contaminant-index-and-map-reads) or [Step 5b](#5b-remove-host-reads))
+    output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
 
 **Output Data**
 
@@ -3290,15 +3418,13 @@ bowtie2 --mm --quiet --threads ${task.cpus} \
 - **sample-mapping-info_GLlbsMetag.txt** (read mapping information)
 
 
-#### 15c. Sort and Index Assembly Alignments
+#### 14c. Sort Assembly Alignments
 
 ```bash
 # Sort Sam, convert to bam and create index
 samtools sort --threads NumberOfThreads \
-              -o sample_sorted.bam \
+              -o sample_GLlbsMetag.bam \
               sample.sam > sample_sort.log 2>&1
-
-samtools index sample_sorted.bam sample_sorted.bam.bai
 ```
 
 **Parameter Definitions:**
@@ -3308,35 +3434,31 @@ samtools index sample_sorted.bam sample_sorted.bam.bai
 - `-o` - Specifies the output file for the sorted aligned reads.
 - `sample.sam` - Positional argument specifying the input SAM file.
 - `> sample_sort.log 2>&1` - Redirects the standard output and standard error to a separate file.
-*samtools index*
-- `sample_sorted.bam` - Positional argument specifying the input BAM file to be sorted.
-- `sample_sorted.bam.bai` - Positional argument specifying the name of the index file.
 
 **Input Data:**
 
-- sample.sam (reads aligned to sample assembly, output from [Step 15b](#15b-align-reads-to-sample-assembly))
+- sample.sam (reads aligned to sample assembly, output from [Step 14b](#14b-align-reads-to-sample-assembly))
 
 **Output Data:**
 
-- **sample_sorted_GLlbsMetag.bam** (sorted mapping to sample assembly, in BAM format)
-- **sample_sorted_GLlbsMetag.bam.bai** (index of sorted mapping to sample assembly)
+- **sample_GLlbsMetag.bam** (sorted mapping to sample assembly, in BAM format)
 
 <br>
 
 ---
 
-### 16. Get Coverage Information and Filter Based On Detection
+### 15. Get Coverage Information and Filter Based On Detection
 > **Note:**  
 > “Detection” is a measure of what proportion of a reference sequence recruited reads 
 (see the discussion of detection [here](http://merenlab.org/2017/05/08/anvio-views/#detection)). 
 Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
 
-#### 16a. Filter Coverage Levels Based On Detection
+#### 15a. Filter Coverage Levels Based On Detection
 
 ```bash
 # pileup.sh comes from the bbduk.sh package
-pileup.sh -in sample.bam \
-          fastaorf=sample-genes.fasta \
+pileup.sh -in sample_GLlbsMetag.bam \
+          fastaorf=sample-genes_GLlbsMetag.fasta \
           outorf=sample-gene-cov-and-det.tmp \
           out=sample-contig-cov-and-det.tmp
 ```
@@ -3350,8 +3472,8 @@ pileup.sh -in sample.bam \
 
 **Input Data:**
 
-- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 15c](#15c-sort-and-index-assembly-alignments))
-- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 12a](#12-gene-prediction))
+- sample_GLlbsMetag.bam (sorted mapping to sample assembly BAM file, output from [Step 14c](#14c-sort-assembly-alignments))
+- sample-genes_GLlbsMetag.fasta (gene-calls nucleotide fasta file, output from [Step 11b](#11b-remove-line-wraps-in-gene-prediction-output))
 
 
 **Output Data:**
@@ -3360,7 +3482,7 @@ pileup.sh -in sample.bam \
 - sample-contig-cov-and-det.tmp (contig-coverage tsv file)
 
 
-#### 16b. Filter Gene and Contig Coverage Based On Detection
+#### 15b. Filter Gene and Contig Coverage Based On Detection
 
 > *The following commands filter gene and contig coverage tsv files to only keep genes and contigs with at least 50% detection (as defined above) then parse the tables to retain only gene IDs and respective coverage.*
 
@@ -3370,14 +3492,14 @@ grep -v "#" sample-gene-cov-and-det.tmp | \
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
      { print $1,$4 } ' > sample-gene-cov.tmp
 
-cat <( printf "gene_ID\tcoverage\n" ) sample-gene-cov.tmp > sample-gene-coverages_GLlbsMetag.tsv
+cat <( printf "gene_ID\tcoverage\n" ) sample-gene-cov.tmp > sample-gene-coverages.tsv
 
 # Filtering contig coverage
 grep -v "#" sample-contig-cov-and-det.tmp | \
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
      { print $1,$2 } ' > sample-contig-cov.tmp
 
-cat <( printf "contig_ID\tcoverage\n" ) sample-contig-cov.tmp > sample-contig-coverages_GLlbsMetag.tsv
+cat <( printf "contig_ID\tcoverage\n" ) sample-contig-cov.tmp > sample-contig-coverages.tsv
 
 # removing intermediate files
 rm sample-*.tmp
@@ -3385,44 +3507,44 @@ rm sample-*.tmp
 
 **Input Data:**
 
-- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file, output from [Step 16a](#16a-filter-coverage-levels-based-on-detection))
-- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file, output from [Step 16a](#16a-filter-coverage-levels-based-on-detection))
+- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file, output from [Step 15a](#15a-filter-coverage-levels-based-on-detection))
+- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file, output from [Step 15a](#15a-filter-coverage-levels-based-on-detection))
 
 **Output Data:**
 
-- sample-gene-coverages_GLlbsMetag.tsv (table with gene-level coverages)
-- sample-contig-coverages_GllbsMetag.tsv (table with contig-level coverages)
+- sample-gene-coverages.tsv (table with gene-level coverages)
+- sample-contig-coverages.tsv (table with contig-level coverages)
 
 <br>
 
 ---
 
-### 17. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample
+### 16. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample
 > **Note:**  
 > Just uses `paste`, `sed`, and `awk` standard Unix commands to combine gene-level coverage, taxonomy, and functional annotations into one table for each sample.  
 
 ```bash
 paste <( tail -n +2 sample-gene-coverages.tsv | sort -V -k 1 ) \
       <( tail -n +2 sample-annotations.tsv | sort -V -k 1 | cut -f 2- ) \
-      <( tail -n +2 sample-gene-tax-out.tsv | sort -V -k 1 | cut -f 2- ) \
+      <( tail -n +2 sample-gene-tax.tsv | sort -V -k 1 | cut -f 2- ) \
       > sample-gene-tab.tmp
 
 paste <( head -n 1 sample-gene-coverages.tsv ) \
       <( head -n 1 sample-annotations.tsv | cut -f 2- ) \
-      <( head -n 1 sample-gene-tax-out.tsv | cut -f 2- ) \
+      <( head -n 1 sample-gene-tax.tsv | cut -f 2- ) \
       > sample-header.tmp
 
 cat sample-header.tmp sample-gene-tab.tmp > sample-gene-coverage-annotation-and-tax_GLlbsMetag.tsv
 
 # removing intermediate files
-rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-out.tsv
+rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax.tsv
 ```
 
 **Input Data:**
 
-- sample-gene-coverages_GLlbsMetag.tsv (table with gene-level coverages, output from [Step 16b](#16b-filter-gene-and-contig-coverage-based-on-detection))
-- sample-annotations.tsv (table of KO annotations assigned to gene IDs, output from [Step 13c](#13c-filter-ko-outputs
-- sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info, output from [Step 14e](#14e-format-gene-level-output-with-awk-and-sed))
+- sample-gene-coverages.tsv (table with gene-level coverages, output from [Step 15b](#15b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-annotations.tsv (table of KO annotations assigned to gene IDs, output from [Step 12c](#12c-filter-ko-outputs))
+- sample-gene-tax.tsv (reformatted gene-calls taxonomy file with lineage info, output from [Step 13e](#13e-format-gene-level-output-with-awk-and-sed))
 
 
 **Output Data:**
@@ -3433,29 +3555,29 @@ rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-o
 
 ---
 
-### 18. Combine Contig-level Coverage and Taxonomy For Each Sample
+### 17. Combine Contig-level Coverage and Taxonomy For Each Sample
 > **Note:**  
 > Just uses `paste`, `sed`, and `awk` standard Unix commands to combine contig-level coverage and taxonomy into one table for each sample.
 
 ```bash
 paste <( tail -n +2 sample-contig-coverages.tsv | sort -V -k 1 ) \
-      <( tail -n +2 sample-contig-tax-out.tsv | sort -V -k 1 | cut -f 2- ) \
+      <( tail -n +2 sample-contig-tax.tsv | sort -V -k 1 | cut -f 2- ) \
       > sample-contig.tmp
 
 paste <( head -n 1 sample-contig-coverages.tsv ) \
-      <( head -n 1 sample-contig-tax-out.tsv | cut -f 2- ) \
+      <( head -n 1 sample-contig-tax.tsv | cut -f 2- ) \
       > sample-contig-header.tmp
       
 cat sample-contig-header.tmp sample-contig.tmp > sample-contig-coverage-and-tax_GLlbsMetag.tsv
 
 # removing intermediate files
-rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
+rm sample*tmp sample-contig-coverages.tsv sample-contig-tax.tsv
 ```
 
 **Input Data:**
 
-- sample-contig-coverages.tsv (table with contig-level coverages, output from [Step 16b](#16b-filter-gene-and-contig-coverage-based-on-detection))
-- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info, output from [Step 14f](#14f-format-contig-level-output-with-awk-and-sed))
+- sample-contig-coverages.tsv (table with contig-level coverages, output from [Step 15b](#15b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-contig-tax.tsv (reformatted contig taxonomy file with lineage info, output from [Step 13f](#13f-format-contig-level-output-with-awk-and-sed))
 
 **Output Data:**
 
@@ -3465,7 +3587,7 @@ rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
 
 ---
 
-### 19. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples
+### 18. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples
 
 > **Note:**  
 > * To combine across samples to generate these summary tables, we need the same "units". This is done for annotations 
@@ -3477,7 +3599,7 @@ by the length of the gene). These have been normalized by making the total cover
 each individual gene-level coverage its proportion of that 1,000,000 total. So basically percent, but out of 1,000,000 
 instead of 100 to make the numbers more friendly. 
 
-#### 19a. Generate Gene-level Coverage Summary Tables
+#### 18a. Generate Gene-level Coverage Summary Tables
 
 ```bash
 bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax_GLlbsMetag.tsv \
@@ -3499,7 +3621,7 @@ mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-cove
 
 **Input Data:**
 
-- *-gene-coverage-annotation-and-tax_GLlbsMetag.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 17](#17-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
+- *-gene-coverage-annotation-and-tax_GLlbsMetag.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 16](#16-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
 
 **Output Data:**
 
@@ -3509,21 +3631,21 @@ mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-cove
 - **Combined-gene-level-taxonomy-coverages_GLlbsMetag.tsv** (table with all samples combined based on gene-level taxonomic classifications)
 
 
-#### 19b. Generate Contig-level Coverage Summary Tables
+#### 18b. Generate Contig-level Coverage Summary Tables
 
 ```bash
-bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
+bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax_GLlbsMetag.tsv -o Combined
 ```
 
 **Parameter Definitions:**  
 
-- `*-contig-coverage-and-tax.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
+- `*-contig-coverage-and-tax_GLlbsMetag.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
 - `-o` – Specifies the output file prefix.
 
 
 **Input Data:**
 
-- *-contig-coverage-and-tax.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 18](#18-combine-contig-level-coverage-and-taxonomy-for-each-sample))
+- *-contig-coverage-and-tax_GLlbsMetag.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 17](#17-combine-contig-level-coverage-and-taxonomy-for-each-sample))
 
 **Output Data:**
 
@@ -3534,26 +3656,26 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 
 ---
 
-### 20. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery
+### 19. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery
 
-#### 20a. Bin Contigs
+#### 19a. Bin Contigs
 
 ```bash
-jgi_summarize_bam_contig_depths --outputDepth sample-metabat-assembly-depth.tsv \
+jgi_summarize_bam_contig_depths --outputDepth sample-metabat-assembly-depth-GLlbsMetag.tsv \
                                 --percentIdentity 97 \
                                 --minContigLength 1000 \
                                 --minContigDepth 1.0  \
-                                --referenceFasta sample-assembly.fasta \
-                                sample.bam
+                                --referenceFasta sample-assembly_GLlbsMetag.fasta \
+                                sample_GLlbsMetag.bam
 
-metabat2  --inFile sample-assembly.fasta \
+metabat2  --inFile sample-assembly_GLlbsMetag.fasta \
           --outFile sample \
-          --abdFile sample-metabat-assembly-depth.tsv \
+          --abdFile sample-metabat-assembly-depth_GLlbsMetag.tsv \
           -t NumberOfThreads
 
 mkdir sample-bins
 mv sample*bin*.fasta sample-bins
-zip -r sample-bins.zip sample-bins
+zip -r sample-bins_GLlbsMetag.zip sample-bins
 ```
 
 **Parameter Definitions:**  
@@ -3564,7 +3686,7 @@ zip -r sample-bins.zip sample-bins
 -  `--minContigLength` – Minimum contig length to include.
 -  `--minContigDepth` – Minimum contig depth to include.
 -  `--referenceFasta` – Specifies the input assembly fasta file.
--  `sample.bam` – Input alignment BAM file, specified as a positional argument.
+-  `sample_GLlbsMetag.bam` – Input alignment BAM file, specified as a positional argument.
 
 *metabat2*
 -  `--inFile` - Specifies the input assembly fasta file.
@@ -3575,17 +3697,17 @@ zip -r sample-bins.zip sample-bins
 
 **Input Data:**
 
-- sample-assembly.fasta (contig-renamed assembly file from [Step 11a](#11a-renaming-contig-headers))
-- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 15c](#15c-sort-and-index-assembly-alignments))
+- sample-assembly_GLlbsMetag.fasta (contig-renamed assembly file from [Step 10a](#10a-rename-contig-headers))
+- sample_GLlbsMetag.bam (sorted mapping to sample assembly BAM file, output from [Step 14c](#14c-sort-assembly-alignments))
 
 **Output Data:**
 
-- **sample-metabat-assembly-depth.tsv** (tab-delimited summary of coverages)
+- **sample-metabat-assembly-depth_GLlbsMetag.tsv** (tab-delimited summary of coverages)
 - sample-bins/sample-bin\*.fasta (fasta files of recovered bins)
-- **sample-bins.zip** (zip file containing fasta files of recovered bins)
+- **sample-bins_GLlbsMetag.zip** (zip file containing fasta files of recovered bins)
 
-#### 20b. Bin quality assessment 
-> Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
+#### 19b. Bin quality assessment
+> Utilizes the default `checkm` database [checkm_data_2015_01_16.tar.gz](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz).
 
 ```bash
 checkm lineage_wf -f bins-overview_GLlbsMetag.tsv \
@@ -3606,14 +3728,14 @@ checkm lineage_wf -f bins-overview_GLlbsMetag.tsv \
 
 **Input Data:**
 
-- sample-bins/sample-bin\*.fasta (fasta files of recovered bins, output from [Step 20a](#20a-bin-contigs))
+- sample-bins/sample-bin\*.fasta (fasta files of recovered bins, output from [Step 19a](#19a-bin-contigs))
 
 **Output Data:**
 
 - **bins-overview_GLlbsMetag.tsv** (tab-delimited file with quality estimates per bin)
 - checkm-output-dir/ (directory holding detailed checkm outputs)
 
-#### 20c. Filter MAGs
+#### 19c. Filter MAGs
 
 ```bash
 cat <( head -n 1 bins-overview_GLlbsMetag.tsv ) \
@@ -3640,7 +3762,7 @@ done
 
 **Input Data:**
 
-- bins-overview_GLlbsMetag.tsv (tab-delimited file with quality estimates per bin from [Step 20b](#20b-bin-quality-assessment))
+- bins-overview_GLlbsMetag.tsv (tab-delimited file with quality estimates per bin from [Step 19b](#19b-bin-quality-assessment))
 
 **Output Data:**
 
@@ -3649,7 +3771,7 @@ done
 - **\*-MAGs.zip** (zip files containing directories of high-quality MAGs)
 
 
-#### 20d. MAG Taxonomic Classification
+#### 19d. MAG Taxonomic Classification
 > Uses default `gtdbtk` database setup with program's `download.sh` command.
 
 ```bash
@@ -3669,13 +3791,13 @@ gtdbtk classify_wf --genome_dir MAGs/ \
 
 **Input Data:**
 
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 20c](#20c-filter-mags))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 19c](#19c-filter-mags))
 
 **Output Data:**
 
 - gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
 
-#### 20e. Generate Overview Table Of All MAGs
+#### 19e. Generate Overview Table Of All MAGs
 
 ```bash
 # combine summaries
@@ -3715,10 +3837,10 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 
 **Input Data:**
 
-- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 11b](#11b-summarize-assemblies))
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 20c](#20c-filter-mags))
-- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 20c](#20c-filter-mags))
-- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 20d](#20d-mag-taxonomic-classification))
+- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 10b](#10b-summarize-assemblies))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 19c](#19c-filter-mags))
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 19c](#19c-filter-mags))
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 19d](#19d-mag-taxonomic-classification))
 
 **Output Data:**
 
@@ -3728,10 +3850,10 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 
 ---
 
-### 21. Generate MAG-level Functional Summary Overview
+### 20. Generate MAG-level Functional Summary Overview
 
-#### 21a. Get KO Annotations Per MAG
-> This utilizes the helper script [`parse-MAG-annots.py`](../Workflow_Documentation/NF_MGIllumina/workflow_code/bin/parse-MAG-annots.py) 
+#### 20a. Get KO Annotations Per MAG
+> This utilizes the helper script [`parse-MAG-annots.py`](https://github.com/nasa/GeneLab_Metagenomics_Workflow/blob/DEV/bin/parse-MAG-annots.py) 
 
 ```bash
 for file in $( ls MAGs/*.fasta )
@@ -3761,15 +3883,15 @@ done
 
 **Input Data:**
 
-- \*-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 17](#17-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 20c](#20c-filter-mags))
+- \*-gene-coverage-annotation-and-tax_GLlbsMetag.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 16](#16-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 19c](#19c-filter-mags))
 
 **Output Data:**
 
 - **MAG-level-KO-annotations_GLlbsMetag.tsv** (tab-delimited table holding MAGs and their KO annotations)
 
 
-#### 21b. Summarize KO Annotations With KEGG-Decoder
+#### 20b. Summarize KO Annotations With KEGG-Decoder
 
 ```bash
 KEGG-decoder -v interactive \
@@ -3785,118 +3907,148 @@ KEGG-decoder -v interactive \
 
 **Input Data:**
 
-- MAG-level-KO-annotations_GLlbsMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 21a](#21a-getting-ko-annotations-per-mag))
+- MAG-level-KO-annotations_GLlbsMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 20a](#20a-get-ko-annotations-per-mag))
 
 **Output Data:**
 
 - **MAG-KEGG-Decoder-out_GLlbsMetag.tsv** (tab-delimited table holding MAGs and their proportions of 
                                            genes held known to be required for specific pathways/metabolisms)
-- **MAG-KEGG-Decoder-out_GLlbnMetag.html** (interactive heatmap html file of the above output table)
+- **MAG-KEGG-Decoder-out_GLlbsMetag.html** (interactive heatmap html file of the above output table)
 
 <br>
 
 ---
 
-### 22. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
+### 21. Filtering, Decontamination, and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
 
-#### 22a. Gene-level Taxonomy Heatmaps
+#### 21a. Gene-level Taxonomy Heatmaps
 
 ```R
-library(tidyverse)
+assembly_table <- "Combined-gene-level-taxonomy-coverages-CPM_GLlbsMetag.tsv"
+assembly_summary <- "assembly-summaries_GLlbsMetag.tsv"
+metadata_table <- "/path/to/sample/metadata"
 
-metadata_file <- "/path/to/sample/metadata"
-feature_data_file <- "Combined-gene-level-taxonomy-coverages-CPM_GLlbsMetag.tsv"
-
-# Prepare metadata
-metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-sample_names = metadata[, samples_column]
-row.names(metadata) <- sample_names
-
-# Prepare feature table
-gene_taxonomy_table <- read_assembly_coverage_table(feature_table_file, sample_names) %>% as.data.frame
-
-# Summarize gene table
-species_gene_table <- gene_taxonomy_table %>%
-  select(species, !!any_of(sample_names)) %>% 
-  group_by(species) %>% 
-  summarise(across(everything(), sum)) %>% 
-  filter(species != "Unclassified;_;_;_;_;_;_") %>% # Drop unclassifed
-  as.data.frame
+# Read in assembly summary table
+overview_table <- read_delim(assembly_summary, comment="#") %>%
+  select(
+    where(~all(!is.na(.)))
+  )
+
+col_names <- names(overview_table) %>% str_remove_all("-assembly")
+sample_order <- col_names[-1] %>% sort()
 
-rownames(species_gene_table) <- species_gene_table[[1]]
-species_gene_table <- species_gene_table[, -1] %>% as.matrix()
+# deduplicate rows by summing together species values
+df <- read_delim(assembly_table, comment = "#")
+sample_order <- get_samples(df, sample_order)
 
-# Get common samples and re-arrange feature table and metadata
-common_samples <- intersect(colnames(species_gene_table), rownames(metadata))
-species_gene_table <- species_gene_table[, common_samples]
-metadata <- metadata[common_samples, ]
-metadata <- metadata %>% arrange(!!sym(group_column))
+table2write <- read_taxonomy_table(df, sample_order) %>%
+               select(species, !!sample_order) %>%
+               group_by(species) %>%
+               summarise(across(everything(), sum)) %>%
+               filter(species != "Unclassified;_;_;_;_;_;_") %>%
+               as.data.frame()
 
-table2write = species_gene_table %>% as.data.frame %>% rownames_to_column("species")
 # Write out gene taxonomy table
-write_csv(x = table2write, file = "gene_taxonomy_table.csv")
+write_tsv(x = table2write, file = "Combined-gene-level-taxonomy_unfiltered_GLlbsMetag.tsv")
 
-make_heatmap(metadata, species_gene_table, 
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-taxonomy_unfiltered_GLlbsMetag.tsv", 
              samples_column="sample_id", group_column = "group", 
-             output_prefix = "Combined-gene-level-taxonomy", 
+             output_prefix = "Combined-gene-level-taxonomy_unfiltered", 
              assay_suffix = "_GLlbsMetag", 
              custom_palette = custom_palette)
-
 ```
 
 **Custom Functions Used:**
-- [read_assembly_coverage_table()](#read_assembly_coverage_table)
+- [get_samples()](#get_samples)
+- [read_taxonomy_table()](#read_taxonomy_table)
 - [make_heatmap()](#make_heatmap)
 
 **Input data:**
-- /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
-- Combined-gene-level-taxonomy-coverages-CPM_GLlbsMetag.tsv (table with all samples 
-    combined based on gene-level taxonomic classifications, output from 
-    [Step 19a](#19a-generating-gene-level-coverage-summary-tables)) 
+- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 10b](#10b-summarize-assemblies))
+- Combined-gene-level-taxonomy-coverages-CPM_GLlbsMetag.tsv (table with all samples combined based on gene-level 
+  taxonomic classifications, output from [Step 18a](#18a-generate-gene-level-coverage-summary-tables)) 
 
 **Output data:**
-- gene_taxonomy_table.csv (aggregated gene taxonomy table with samples in columns and species in rows)
-- **Combined-gene-level-taxonomy_heatmap_GLlbsMetag.png** (heatmap of all gene taxonomy assignments)
+- Combined-gene-level-taxonomy_unfiltered_GLlbsMetag.tsv (aggregated gene-level taxonomy table with samples in columns and species in rows)
+- **Combined-gene-level-taxonomy_unfiltered_heatmap_GLlbsMetag.png** (heatmap of all gene-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-taxonomy_unfiltered_top_50_heatmap_GLlbsMetag.png** (heatmap of the top 50 gene-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
 
-#### 22b. Gene-level Taxonomy Decontamination
+#### 21b. Gene-level Taxonomy Feature Filtering
 
 ```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-feature_table_file <- "gene_taxonomy_table.csv"
+feature_table_file <- "Combined-gene-level-taxonomy_unfiltered_GLlbsMetag.tsv"
 metadata_table <- "/path/to/sample/metadata"
-number_samples <- NumberOfSamples # integer indicating how many samples are in the file
+threshold <- 1000
+
+# read in feature table
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+table2write <- get_abundant_features(feature_table, cpm_threshold=threshold) %>%
+               as.data.frame() %>%
+               rownames_to_column(feature_name)
+
+write_tsv(x = table2write, file = "Combined-gene-level-taxonomy_filtered_GLlbsMetag.tsv")
+
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-taxonomy_filtered_GLlbsMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-taxonomy_filtered", 
+             assay_suffix = "_GLlbsMetag", 
+             custom_palette = custom_palette)
+
+```
+
+**Custom Functions Used:**
+- [get_abundant_features()](#get_abundant_features)
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `threshold` - threshold to identify abundant features, default: 1000
 
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
+**Input Data:**
+
+- `Combined-gene-level-taxonomy_unfiltered_GLlbsMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 21a](#21a-gene-level-taxonomy-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
 
-# Prepare metadata
-metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-sample_names = metadata[, samples_column]
-row.names(metadata) <- sample_names
+- **Combined-gene-level-taxonomy_filtered_GLlbsMetag.tsv** (filtered gene-level taxonomy, output from [get_abundant_features()](#get_abundant_features))
+- **Combined-gene-level-taxonomy_filtered_heatmap_GLlbsMetag.png** (heatmap of all gene-level taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-taxonomy_filtered_top_50_heatmap_GLlbsMetag.png** (heatmap of the top 50 gene taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+
+#### 21c. Gene-level Taxonomy Decontamination
+
+> Note: species_table and heatmaps are only generated if 1 or more contaminants were detected
+
+```R
+feature_table_file <- "Combined-gene-level-taxonomy_GLlbsMetag.tsv"
+metadata_table <- "/path/to/sample/metadata"
 
 decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
                                          feature_table_file = feature_table_file, 
                                          feature_column = "species", 
                                          samples_column = "sample_id",
                                          prevalence_column = "NTC", 
-                                         ntc_name = "TRUE", 
+                                         ntc_name = "true", 
                                          frequency_column = "concentration", 
-                                         threshold = 0.1, 
-                                         classification_method = "Combined-gene-level-taxonomy", 
-                                         output_prefix = "", 
+                                         threshold = 0.5, 
+                                         classification_method = "gene-taxonomy", 
+                                         output_prefix = "Combined-gene-level-taxonomy", 
                                          assay_suffix = "_GLlbsMetag")
 
-# Get common samples and re-arrange feature table and metadata
-common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
-decontaminated_table <- decontaminated_table[, common_samples]
-metadata <- metadata[common_samples, ]
-metadata <- metadata %>% arrange(!!sym(group_column))
-
-make_heatmap(metadata, decontaminated_table, 
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-taxonomy_decontam_species_table_GLlbsMetag.tsv", 
              samples_column = "sample_id", group_column = "group", 
              output_prefix = "Combined-gene-level-taxonomy_decontam", 
              assay_suffix = "_GLlbsMetag",
@@ -3906,126 +4058,160 @@ make_heatmap(metadata, decontaminated_table,
 
 **Custom Functions Used:**
 - [feature_decontam()](#feature_decontam)
-- [make_heatmap()](#make_plot)
+- [make_heatmap()](#make_heatmap)
 
 **Parameter Definitions:**
 
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
                          species/functions as the first column and samples as other columns.
-- `number_samples` - the total number of samples in the feature_table_file, adjust based on number of input samples
 
 **Input Data:**
 
-- `gene_taxonomy_table.csv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 22a](#22a-gene-level-taxonomy-heatmaps))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `Combined-gene-level-taxonomy_GLlbsMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 21a](#21a-gene-level-taxonomy-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
-- **Combined-gene-level-taxonomy_decontam_results_GLlbsMetag.csv** (decontam's results table)
-- **Combined-gene-level-taxonomy_decontam_species_table_GLlbsMetag.csv** (decontaminated species table)
-- **Combined-gene-level-taxonomy_decontam_heatmap_GLlbsMetag.png** (gene-level taxonomy heatmap after filtering out contaminants)
+- **Combined-gene-level-taxonomy_decontam_results_GLlbsMetag.tsv** (decontam's results table, output from [feature_decontam()](#feature_decontam))
+- **Combined-gene-level-taxonomy_decontam_species_table_GLlbsMetag.tsv** (decontaminated gene-level taxonomy, output from [feature_decontam()](#feature_decontam))
+- **Combined-gene-level-taxonomy_decontam_heatmap_GLlbsMetag.png** (heatmap of the gene-level taxonomy assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-taxonomy_decontam_top_50_heatmap_GLlbsMetag.png** (heatmap of the top 50 gene-level taxonomy assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
 
-#### 22c. Gene-level KO Functions Heatmaps
+#### 21d. Gene-level KO Functions Heatmaps
 
 ```R
-library(tidyverse)
-library(pheatmap)
+assembly_table <- "Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.tsv"
+assembly_summary <- "assembly-summaries_GLlbsMetag.tsv"
+metadata_table <- "/path/to/sample/metadata"
 
-metadata_file <- "/path/to/sample/metadata"
-feature_data_file <- "Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.ts"
-
-# Abundant functions with CPM > 2000
-abundance_threshold <- 2000
-
-# Prepare metadata
-metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-sample_names = metadata[, samples_column]
-row.names(metadata) <- sample_names
-
-# Read-in KO functions table and drop unannotated
-functions_table <- read_delim(file = feature_table_file, delim = "\t", comment = "#") %>%
-                   select(KO_ID, KO_function, !!any_of(sample_names)) %>%
-                   filter(KO_ID != "Not annotated")
-
-# Convert the sample level data into a matrix
-functions.m <- functions_table %>% select(any_of(sample_names)) %>% as.matrix()
-rownames(functions.m) <- functions_table$KO_ID
-
-# convert to dataframe without unannotated/unclassified species for output
-table2write <- functions.m %>% as.data.frame %>%
-               rownames_to_column("KO_ID")
-# Write out  taxonomy table
-write_csv(x = table2write  , file = "genes-KO-functions_table.csv")
-
-# Get common samples and re-arrange feature table and metadata
-common_samples <- intersect(colnames(functions_table), rownames(metadata))
-functions_table <- functions_table[, common_samples]
-metadata <- metadata[common_samples, ]
-metadata <- metadata %>% arrange(!!sym(group_column))
-
-make_heatmap(metadata, table2write,
+# Read in assembly summary table and remove columns where the values are NA
+overview_table <- read_delim(assembly_summary, comment="#") %>%
+  select(
+    where(~all(!is.na(.)))
+  )
+
+col_names <- names(overview_table) %>% str_remove_all("-assembly")
+sample_order <- col_names[-1] %>% sort()
+
+# deduplicate rows by summing together species values
+df <- read_delim(assembly_table, comment = "#")
+sample_order <- get_samples(df, sample_order, end_col="KO_function")
+
+table2write <- df %>%
+               select(KO_ID, !!sample_order)
+
+# Write out gene taxonomy table
+write_tsv(x = table2write, file = "Combined-gene-level-KO_unfiltered_GLlbsMetag.tsv")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-KO_unfiltered_GLlbsMetag.tsv",
              samples_column="sample_id", group_column = "group", 
-             output_prefix = "Combined-gene-level-KO-function", 
+             output_prefix = "Combined-gene-level-KO-function_unfiltered", 
              assay_suffix = "_GLlbsMetag", 
              custom_palette = custom_palette)
 
 ```
 
 **Custom Functions Used:**
+- [get_samples()](#get_samples)
 - [make_heatmap()](#make_heatmap)
 
+**Parameter Definitions:**
+
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `assembly_table` - path to a tab-separated table containing gene-level KO function coverage data with
+                         species/functions as the first column and samples as other columns.
+- `assembly_summary` - path to a tab-separated file containing statistics on assemblies created for each sample
+
 **Input data:**
 
-- /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
-- Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.tsv (table with all samples combined 
-    based on KO annotations; normalized to coverage per million genes covered, output from 
-    [Step 19a](#19a-generate-gene-level-coverage-summary-tables)
+- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 10b](#10b-summarize-assemblies))
+- Combined-gene-level-KO-function-coverages-CPM_GLlbsMetag.tsv (table with all samples combined based on KO annotations; 
+  normalized to coverage per million genes covered, output from [Step 18a](#18a-generate-gene-level-coverage-summary-tables))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output data:**
 
-- genes-KO-functions_table.csv (aggregated and subsetted gene KO function table)
-- **Combined-gene-level-KO-function_heatmap_GLlbsMetag.png** (heatmap of all gene-level KO function assignments)
+- Combined-gene-level-KO-function_unfiltered_GLlbsMetag.tsv (aggregated and subsetted gene-level KO function table)
+- **Combined-gene-level-KO-function_unfiltered_heatmap_GLlbsMetag.png** (heatmap of all gene-level KO function assignments, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-KO-function_unfiltered_top_50_heatmap_GLlbsMetag.png** (heatmap of the top 50 gene-level KO function assignments, output from [make_heatmap()](#make_heatmap))
 
-#### 22d. Gene-level KO Functions Decontamination
+#### 21e. Gene-level KO Functions Feature Filtering
 
 ```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-feature_table_file <- "genes-KO-functions_table.csv"
+feature_table_file <- "Combined-gene-level-KO-function_unfiltered_GLlbsMetag.tsv"
 metadata_table <- "/path/to/sample/metadata"
-number_samples <- NumberOfSamples # integer indicating how many samples are in the file
+threshold <- 1000
+
+# read in feature table
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+table2write <- get_abundant_features(feature_table, cpm_threshold=threshold) %>%
+               as.data.frame() %>%
+               rownames_to_column(feature_name)
+
+write_tsv(x = table2write, file = "Combined-gene-level-KO_filtered_GLlbsMetag.tsv")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-KO_filtered_GLlbsMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-KO_filtered", 
+             assay_suffix = "_GLlbsMetag", 
+             custom_palette = custom_palette)
+
+```
+
+**Custom Functions Used:**
+- [get_abundant_features()](#get_abundant_features)
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `threshold` - threshold to identify abundant features, default: 1000
+
+**Input Data:**
+
+- `Combined-gene-level-KO-function_unfiltered_GLlbsMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 21d](#21d-gene-level-ko-functions-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
+
+- **Combined-gene-level-KO-function_filtered_GLlbsMetag.tsv** (filtered gene-level KO function table, output from [get_abundant_features()](#get_abundant_features))
+- **Combined-gene-level-KO-function_filtered_heatmap_GLlbsMetag.png** (heatmap of all gene-level KO function assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-KO-function_filtered_top_50_heatmap_GLlbsMetag.png** (heatmap of the top 50 gene-level KO function assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
 
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
 
-# Prepare metadata
-metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-sample_names = metadata[, samples_column]
-row.names(metadata) <- sample_names
+#### 21f. Gene-level KO Functions Decontamination
+
+> Note: species_table and heatmaps are only generated if 1 or more contaminants were detected
+
+```R
+feature_table_file <- "Combined-gene-level-KO-function_unfiltered_GLlbsMetag.tsv"
+metadata_table <- "/path/to/sample/metadata"
 
 decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
                                          feature_table_file = feature_table_file, 
                                          feature_column = "KO_ID", 
                                          samples_column = "sample_id",
                                          prevalence_column = "NTC", 
-                                         ntc_name = "TRUE", 
+                                         ntc_name = "true", 
                                          frequency_column = "concentration", 
-                                         threshold = 0.1, 
-                                         classification_method = "Combined-gene-level-KO-function", 
-                                         output_prefix = "", 
+                                         threshold = 0.5, 
+                                         classification_method = "gene-function", 
+                                         output_prefix = "Combined-gene-level-KO-function", 
                                          assay_suffix = "_GLlbsMetag")
 
-# Get common samples and re-arrange feature table and metadata
-common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
-decontaminated_table <- decontaminated_table[, common_samples]
-metadata <- metadata[common_samples, ]
-metadata <- metadata %>% arrange(!!sym(group_column))
-
-make_heatmap(metadata, decontaminated_table, 
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-KO-function_decontam_KO_table_GLlbsMetag.tsv", 
              samples_column = "sample_id", group_column = "group", 
              output_prefix = "Combined-gene-level-KO-function_decontam", 
              assay_suffix = "_GLlbsMetag",
@@ -4035,65 +4221,59 @@ make_heatmap(metadata, decontaminated_table,
 
 **Custom Functions Used:**
 - [feature_decontam()](#feature_decontam)
-- [make_heatmap()](#make_plot)
+- [make_heatmap()](#make_heatmap)
 
 **Parameter Definitions:**
 
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table containing gene-level KO functions coverage data 
                          with KO_ID as the first column and samples as other columns.
-- `number_samples` - the total number of samples in the feature_table_file, adjust based on number of input samples
 
 **Input Data:**
 
-- `genes-KO-functions_table.csv`(aggregated gene KO functions table table with samples in columns and KO_ID in rows, from [Step 22c](#22c-gene-level-ko-functions-heatmaps))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `Combined-gene-level-KO-function_unfiltered_GLlbsMetag.tsv`(aggregated gene KO functions table table with samples in columns and KO_ID in rows, from [Step 21d](#21d-gene-level-ko-functions-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
-- **Combined-gene-level-KO-function_decontam_results_GLlbsMetag.csv** (decontam's results table)
-- **Combined-gene-level-KO-function_decontam_species_table_GLlbsMetag.csv** (decontaminated gene-level KO functions table)
-- **Combined-gene-level-KO-function_decontam_heatmap_GLlbsMetag.png** (gene-level KO functions heatmap after filtering out contaminants)
+- **Combined-gene-level-KO-function_decontam_results_GLlbsMetag.tsv** (decontam results table, output from [feature_decontam()](#feature_decontam))
+- **Combined-gene-level-KO-function_decontam_KO_table_GLlbsMetag.tsv** (decontaminated gene-level KO functions table, output from [feature_decontam()](#feature_decontam))
+- **Combined-gene-level-KO-function_decontam_heatmap_GLlbsMetag.png** (heatmap of all gene-level KO function assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-KO-function_decontam_top_50_heatmap_GLlbsMetag.png** (heatmap of the top 50 gene-level KO function assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
 
 
-#### 22e. Contig-level Heatmaps
+#### 21g. Contig-level Heatmaps
 
 ```R
-library(tidyverse)
+assembly_table <- "Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv"
+assembly_summary <- "assembly-summaries_GLlbsMetag.tsv"
+metadata_table <- "/path/to/sample/metadata"
 
-metadata_file <- "/path/to/sample/metadata"
-feature_data_file <- "Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv"
-
-# Prepare metadata
-metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-sample_names = metadata[, samples_column]
-row.names(metadata) <- sample_names
-
-# Prepare feature table
-contig_taxonomy_table <- read_assembly_coverage_table(feature_table_file, sample_names) %>% as.data.frame
-
-# Summarize contig table
-species_contig_table <- contig_taxonomy_table %>%
-  select(species, !!any_of(sample_names)) %>%
-  group_by(species) %>%
-  summarise(across(everything(), sum)) %>% 
-  filter(species != "Unclassified;_;_;_;_;_;_") %>% # Drop unclassifed
-  as.data.frame
+# Read in assembly summary table
+overview_table <- read_delim(assembly_summary, comment="#") %>%
+  select(
+    where(~all(!is.na(.)))
+  )
 
-rownames(species_contig_table) <- species_contig_table[[1]]
-species_contig_table <- species_contig_table[, -1] %>% as.matrix()
+col_names <- names(overview_table) %>% str_remove_all("-assembly")
+sample_order <- col_names[-1] %>% sort()
 
-# Get common samples and re-arrange feature table and metadata
-common_samples <- intersect(colnames(species_contig_table), rownames(metadata))
-species_contig_table <- species_contig_table[, common_samples]
-metadata <- metadata[common_samples, ]
-metadata <- metadata %>% arrange(!!sym(group_column))
+# deduplicate rows by summing together species values
+df <- read_delim(assembly_table, comment = "#")
+sample_order <- get_samples(df, sample_order)
+
+table2write <- read_taxonomy_table(df, sample_order) %>%
+               select(species, !!sample_order) %>%
+               group_by(species) %>%
+               summarise(across(everything(), sum)) %>%
+               filter(species != "Unclassified;_;_;_;_;_;_") %>%
+               as.data.frame()
 
-table2write = species_contig_table %>% as.data.frame %>% rownames_to_column("species")
 # Write out contig taxonomy table
-write_csv(x = table2write, file = "contig_taxonomy_table.csv")
+write_tsv(x = table2write, file = "Combined-contig-level-taxonomy_unfiltered_GLlbsMetag.tsv")
 
-make_heatmap(metadata, species_contig_table, 
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-contig-level-taxonomy_unfiltered_GLlbsMetag.tsv", 
              samples_column="sample_id", group_column = "group", 
              output_prefix = "Combined-contig-level-taxonomy", 
              assay_suffix = "_GLlbsMetag", 
@@ -4101,60 +4281,103 @@ make_heatmap(metadata, species_contig_table,
 ```
 
 **Custom Functions Used:**
-- [read_assembly_coverage_table()](#read_assembly_coverage_table)
+- [get_samples()](#get_samples)
+- [read_taxonomy_table()](#read_taxonomy_table)
 - [make_heatmap()](#make_heatmap)
 
+**Parameter Definitions:**
+
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `assembly_table` - path to a tab-separated table containing gene-level KO function coverage data with
+                         species/functions as the first column and samples as other columns.
+- `assembly_summary` - path to a tab-separated file containing statistics on assemblies created for each sample
+
+
 **Input data:**
 
-- /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
-- Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv (table with all samples 
-    combined based on contig-level taxonomic classifications, output from 
-    [Step 19b](#19b-generate-contig-level-coverage-summary-tables)) 
+- assembly-summaries_GLlbsMetag.tsv (table of assembly summary statistics, output from [Step 10b](#10b-summarize-assemblies))
+- Combined-contig-level-taxonomy-coverages-CPM_GLlbsMetag.tsv (table with all samples combined based on contig-level 
+  taxonomic classifications, output from [Step 18b](#18b-generate-contig-level-coverage-summary-tables)) 
 
 **Output data:**
 
-- contig_taxonomy_table.csv (aggregated contig taxonomy table with samples in columns and species in rows)
-- **Combined-contig-level-taxonomy_heatmap_GLlbsMetag.png** (heatmap of all contig taxonomy assignments)
+- Combined-contig-level-taxonomy_unfiltered_GLlbsMetag.tsv (aggregated contig-level taxonomy table with samples in columns and species in rows)
+- **Combined-contig-level-taxonomy_unfiltered_heatmap_GLlbsMetag.png** (heatmap of all contig-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
+- **Combined-contig-level-taxonomy_unfiltered_top_50_heatmap_GLlbsMetag.png** (heatmap of the top 50 contig-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
 
-#### 22f. Contig-level Decontamination
+#### 21h. Contig-level Feature Filtering
 
 ```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-feature_table_file <- "contig_taxonomy_table.csv"
+feature_table_file <- "Combined-contig-level-taxonomy_GLlbsMetag.tsv"
 metadata_table <- "/path/to/sample/metadata"
-number_samples <- NumberOfSamples # integer indicating how many samples are in the file
+threshold <- 1000
+
+# read in feature table
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+table2write <- get_abundant_features(feature_table, cpm_threshold=threshold) %>%
+               as.data.frame() %>%
+               rownames_to_column(feature_name)
+
+write_tsv(x = table2write, file = "Combined-contig-level-taxonomy_filtered_GLlbsMetag.tsv")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-contig-level-taxonomy_filtered_GLlbsMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-contig-level-taxonomy_filtered", 
+             assay_suffix = "_GLlbsMetag", 
+             custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [get_abundant_features()](#get_abundant_features)
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `threshold` - threshold to identify abundant features, default: 1000
+
+**Input Data:**
+
+- `Combined-contig-level-taxonomy_unfiltered_GLlbsMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 21d](#21d-gene-level-ko-functions-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
 
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
+- **Combined-contig-level-taxonomy_filtered_GLlbsMetag.tsv** (filtered contig-level taxonomy, output from [get_abundant_features()](#get_abundant_features))
+- **Combined-contig-level-taxonomy_filtered_heatmap_GLlbsMetag.png** (heatmap of all contig-level taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+- **Combined-contig-level-taxonomy_filtered_top_50_heatmap_GLlbsMetag.png** (heatmap of the top 50 contig-level taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
 
-# Prepare metadata
-metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-sample_names = metadata[, samples_column]
-row.names(metadata) <- sample_names
+#### 21i. Contig-level Decontamination
+
+>Note: species_table and heatmaps are only generated if 1 or more contaminants were detected
+
+```R
+feature_table_file <- "Combined-contig-level-taxonomy_unfiltered_GLlbsMetag.tsv"
+metadata_table <- "/path/to/sample/metadata"
 
 decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
                                          feature_table_file = feature_table_file, 
                                          feature_column = "species", 
                                          samples_column = "sample_id",
                                          prevalence_column = "NTC", 
-                                         ntc_name = "TRUE", 
+                                         ntc_name = "true", 
                                          frequency_column = "concentration", 
-                                         threshold = 0.1, 
-                                         classification_method = "Combined-contig-level-taxonomy", 
-                                         output_prefix = "", 
+                                         threshold = 0.5, 
+                                         classification_method = "contig-taxonomy", 
+                                         output_prefix = "Combined-contig-level-taxonomy", 
                                          assay_suffix = "_GLlbsMetag")
 
-# Get common samples and re-arrange feature table and metadata
-common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
-decontaminated_table <- decontaminated_table[, common_samples]
-metadata <- metadata[common_samples, ]
-metadata <- metadata %>% arrange(!!sym(group_column))
-
-make_heatmap(metadata, decontaminated_table, 
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-contig-level-taxonomy_decontam_species_table_GLlbsMetag.tsv", 
              samples_column = "sample_id", group_column = "group", 
              output_prefix = "Combined-contig-level-taxonomy_decontam", 
              assay_suffix = "_GLlbsMetag",
@@ -4164,23 +4387,55 @@ make_heatmap(metadata, decontaminated_table,
 
 **Custom Functions Used:**
 - [feature_decontam()](#feature_decontam)
-- [make_heatmap()](#make_plot)
+- [make_heatmap()](#make_heatmap)
 
 **Parameter Definitions:**
 
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table containing contig-level coverage data 
                          species/functions as the first column and samples as other columns.
-- `number_samples` - the total number of samples in the feature_table_file, adjust based on number of input samples
 
 **Input Data:**
 
-- `contig_taxonomy_table.csv`(aggregated contig taxonomy table with samples in columns and species in rows, from [Step 22f](#22f-contig-level-heatmaps))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `Combined-contig-level-taxonomy_GLlbsMetag.tsv`(aggregated contig taxonomy table with samples in columns and species in rows, from [Step 21g](#21g-contig-level-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
-- **Combined-contig-level-taxonomy_decontam_results_GLlbsMetag.csv** (decontam's results table)
-- **Combined-contig-level-taxonomy_decontam_species_table_GLlbsMetag.csv** (decontaminated contig-level species table)
-- **Combined-contig-level-taxonomy_decontam_heatmap_GLlbsMetag.png** (contig-level heatmap after filtering out contaminants)
+- **Combined-contig-level-taxonomy_decontam_results_GLlbsMetag.tsv** (decontam's results table, output from [feature_decontam()](#feature_decontam))
+- **Combined-contig-level-taxonomy_decontam_species_table_GLlbsMetag.tsv** (decontaminated contig-level taxonomy table, output from [feature_decontam()](#feature_decontam))
+- **Combined-contig-level-taxonomy_decontam_heatmap_GLlbsMetag.png** (heatmap of all contig-level taxonomy assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
+- **Combined-contig-level-taxonomy_decontam_top_50_heatmap_GLlbsMetag.png** (heatmap of the top 50 contig-level taxonomy assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
+
+### 22. Generate Assembly-based Processing Overview
+> This utilizes the helper script [`generate-assembly-based-overview-table.sh`](https://github.com/nasa/GeneLab_Metagenomics_Workflow/blob/DEV/bin/generate-assembly-based-overview-table.sh) 
+
+```bash
+bash generate-assembly-based-overview-table.sh sample_ids_file.txt \
+  assemblies/ predicted-genes/ read-mapping/ bins/ MAGs/ \
+  Assembly-based-processing-overview_GLlbsMetag.tsv
+```
+
+**Parameter Definitions:**
+
+- `sample_ids_file.txt` - A file listing the sample names, one on each row, provided as a positional argument.
+- `assemblies/` - The directory holding the contig-renamed assembly files generated in [Step 10a](#10a-rename-contig-headers), provided as a positional argument.
+- `predicted-genes/` - The directory holding the gene-calls ammino-acid fasta files generated in [Step 11a](#11a-generate-gene-predictions) and [Step 11b](#11b-remove-line-wraps-in-gene-prediction-output), provided as a positional argument.
+- `read-mapping/` - The directory holding the sorted mapping to the sample assembly in BAM format generated in [Step 14c](#14c-sort-assembly-alignments), provided as a positional argument.
+- `bins/` - The directory holding the recovered bins fasta files generated in [Step 19a](#19a-bin-contigs), provided as a positional argument.
+- `MAGs/` - The directory holding the high-quality MAGs fasta files generated in [Step 19c](#19c-filter-mags), provided as a positional argument.
+- `Assembly-based-processing-overview_GLlbsMetag.tsv` - name of the output file, provided as a positional argument.
+
+**Input Data:**
+
+- assemblies/\*.fasta (contig-renamed assembly files from [Step 10a](#10a-rename-contig-headers))
+- predicted-genes/\*.faa (gene-calls amino-acid fasta file with line wraps removed, output from [Step 11b](#11b-remove-line-wraps-in-gene-prediction-output))
+- read-mapping/\*.bam (sorted mapping to sample assembly, in BAM format, output from [Step 14c](#14c-sort-assembly-alignments))
+- bins/\*.fasta (fasta files of recovered bins, output from [Step 19a](#19a-bin-contigs))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 19c](#19c-filter-mags))
+
+**Output Data:**
+
+- **Assembly-based-processing-overview_GLlbsMetag.tsv** (Tab delimited text file providing a summary of assembly-based processing results for each sample)
+
 
diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
index 2a2ae9e02..efc9e58ba 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
@@ -4,7 +4,7 @@
 
 ---
 
-**Date:** January MM, 2026  
+**Date:** March MM, 2026  
 **Revision:** -  
 **Document Number:** GL-DPPD-7116  
 
@@ -12,9 +12,9 @@
 Olabiyi A. Obayomi (GeneLab Analysis Team)  
 
 **Approved by:**  
-Samrawit Gebre (OSDR Project Manager)  
-Jonathan Galazka (OSDR Project Scientist)  
-Amanda Saravia-Butler (GeneLab Science Lead)  
+Jonathan Galazka (OSDR Project Manager)  
+Danielle Lopez (OSDR Deputy Project Manager)  
+Amanda Saravia-Butler (OSDR Subject Matter Expert)  
 Barbara Novak (GeneLab Data Processing Lead)  
 
 
@@ -69,8 +69,8 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [10e. Compile Kaiju Krona Reports](#10e-compile-kaiju-krona-reports)
       - [10f. Create Kaiju Species Count Table](#10f-create-kaiju-species-count-table)
       - [10g. Filter Kaiju Species Count Table](#10g-filter-kaiju-species-count-table)
-      - [10h. Taxonomy Barplots](#10h-taxonomy-barplots)
-      - [10i. Feature Decontamination](#10i-feature-decontamination)
+      - [10h. Kaiju Taxonomy Barplots](#10h-kaiju-taxonomy-barplots)
+      - [10i. Kaiju Feature Decontamination](#10i-kaiju-feature-decontamination)
     - [11. Taxonomic Profiling Using Kraken2](#11-taxonomic-profiling-using-kraken2)
       - [11a. Download Kraken2 Database](#11a-download-kraken2-database)
       - [11b. Kraken2 Taxonomic Classification](#11b-kraken2-taxonomic-classification)
@@ -80,8 +80,8 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [11d. Convert Kraken2 Output to Krona Format](#11d-convert-kraken2-output-to-krona-format)
       - [11e. Compile Kraken2 Krona Reports](#11e-compile-kraken2-krona-reports)
       - [11f. Filter Kraken2 Species Count Table](#11f-filter-kraken2-species-count-table)
-      - [11g. Taxonomy Barplots](#11h-taxonomy-barplots)
-      - [11h. Feature Decontamination](#11h-feature-decontamination)
+      - [11g. Kraken2 Taxonomy Barplots](#11g-kraken2-taxonomy-barplots)
+      - [11h. Kraken2 Feature Decontamination](#11h-kraken2-feature-decontamination)
   - [**Assembly-based processing**](#assembly-based-processing)
     - [12. Sample Assembly](#12-sample-assembly)
     - [13. Polish Assembly](#13-polish-assembly)
@@ -89,8 +89,8 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [14a. Rename Contig Headers](#14a-rename-contig-headers)
       - [14b. Summarize Assemblies](#14b-summarize-assemblies)
     - [15. Gene Prediction](#15-gene-prediction)
-      - [15a. Generate Gene Predictions](15a-generate-gene-predictions)
-      - [15b. Remove Line Wraps In Gene Prediction Output](#15a-remove-line-wraps-in-gene-prediction-output)
+      - [15a. Generate Gene Predictions](#15a-generate-gene-predictions)
+      - [15b. Remove Line Wraps In Gene Prediction Output](#15b-remove-line-wraps-in-gene-prediction-output)
     - [16. Functional Annotation](#16-functional-annotation)
       - [16a. Download Reference Database of HMM Models](#16a-download-reference-database-of-hmm-models)
       - [16b. Run KEGG Annotation](#16b-run-kegg-annotation)
@@ -102,9 +102,9 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [17d. Add Taxonomy Info From Taxids To Contigs](#17d-add-taxonomy-info-from-taxids-to-contigs)
       - [17e. Format Gene-level Output With awk and sed](#17e-format-gene-level-output-with-awk-and-sed)
       - [17f. Format Contig-level Output With awk and sed](#17f-format-contig-level-output-with-awk-and-sed)
-    - [18. Read-Mapping](#17-read-mapping)
+    - [18. Read-Mapping](#18-read-mapping)
       - [18a. Align Reads to Sample Assembly](#18a-align-reads-to-sample-assembly)
-      - [18b. Sort and Index Assembly Alignments](#18b-sort-and-index-assembly-alignments)
+      - [18b. Sort Assembly Alignments](#18b-sort-assembly-alignments)
     - [19. Get Coverage Information and Filter Based On Detection](#19-get-coverage-information-and-filter-based-on-detection)
       - [19a. Filter Coverage Levels Based On Detection](#19a-filter-coverage-levels-based-on-detection)
       - [19b. Filter Gene and Contig Coverage Based On Detection](#19b-filter-gene-and-contig-coverage-based-on-detection)
@@ -122,13 +122,17 @@ Barbara Novak (GeneLab Data Processing Lead)
     - [24. Generate MAG-level Functional Summary Overview](#24-generate-mag-level-functional-summary-overview)
       - [24a. Get KO Annotations Per MAG](#24a-get-ko-annotations-per-mag)
       - [24b. Summarize KO Annotations With KEGG-Decoder](#24b-summarize-ko-annotations-with-kegg-decoder)
-    - [25. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs](#25-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
+    - [25. Filtering, Decontamination, and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs](#25-filtering-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
       - [25a. Gene-level Taxonomy Heatmaps](#25a-gene-level-taxonomy-heatmaps)
-      - [25b. Gene-level Taxonomy Decontamination](#25b-gene-level-taxonomy-decontamination)
-      - [25c. Gene-level KO Functions Heatmaps](#25c-gene-level-ko-functions-heatmaps)
-      - [25d. Gene-level KO Functions Decontamination](#25d-gene-level-ko-functions-decontamination)
-      - [25e. Contig-level Heatmaps](#25e-contig-level-heatmaps)
-      - [25f. Contig-level Decontamination](#25f-contig-level-decontamination)
+      - [25b. Gene-level Taxonomy Feature Filtering](#25b-gene-level-taxonomy-feature-filtering)
+      - [25c. Gene-level Taxonomy Decontamination](#25c-gene-level-taxonomy-decontamination)
+      - [25d. Gene-level KO Functions Heatmaps](#25d-gene-level-ko-functions-heatmaps)
+      - [25e. Gene-level KO Functions Feature Filtering](#25e-gene-level-ko-functions-feature-filtering)
+      - [25f. Gene-level KO Functions Decontamination](#25f-gene-level-ko-functions-decontamination)
+      - [25g. Contig-level Heatmaps](#25g-contig-level-heatmaps)
+      - [25h. Contig-level Feature Filtering](#25h-contig-level-feature-filtering)
+      - [25i. Contig-level Decontamination](#25i-contig-level-decontamination)
+    - [26. Generate Assembly-based Processing Overview](#26-generate-assembly-based-processing-overview)
 
 
 ---
@@ -145,7 +149,6 @@ Barbara Novak (GeneLab Data Processing Lead)
 |filtlong| 0.2.1 |[https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)|
 |Flye| 2.9.5 | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye) |
 |GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
-|HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)|
 |Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) |
 |KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder)
 |KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)|
@@ -177,12 +180,11 @@ Barbara Novak (GeneLab Data Processing Lead)
 
 ## Pre-processing
 
-
 ### 1. Basecalling
 
 ```bash
 model="hac" # high accuracy model
-input_directory=/path/to/pod5/or/fast5/data
+input_directory=/path/to/pod5/data
 kit_name=SQK-RPB004
 
 dorado basecaller ${model} ${input_directory} \
@@ -196,16 +198,16 @@ dorado basecaller ${model} ${input_directory} \
 **Parameter Definitions:**
 
 - `model` - Positional argument specifying the basecalling model to use or a path to the model directory. `hac` chooses the high accuracy model.
-- `input_directory` - Positional argument specifying the location of the raw data in POD5 or FAST5 format.
+- `input_directory` - Positional argument specifying the location of the raw data in POD5 format.
 - `--no-trim` - Skips trimming of barcodes, adapters, and primers.
 - `--device` - Specifies CPU or GPU device; specifying 'auto' chooses either 'cpu' or 'gpu' depending on detected presence of a GPU device.
-- `--recursive` - Enables recursive scanning through input directory to load FAST5 and/or POD5 files.
+- `--recursive` - Enables recursive scanning through input directory to load POD5 files.
 - `--kit-name` - The nanopore barcoding kit used during sequencing preparation. Enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names.
 - `--min-qscore` - Specifies the minimum Q-score, reads with a mean Q-score below this threshold are discarded (default to `8`).
 
 **Input Data:**
 
-- *pod5 and/or *fast5 (raw nanopore data)
+- *pod5 (raw nanopore data)
 
 **Output Data:**
 
@@ -296,6 +298,8 @@ NanoPlot --only-report \
          --threads NumberOfThreads \
          --fastq \
          /path/to/raw_data/sample.fastq.gz
+
+mv /path/to/raw_nanoplot_output/sample_raw_NanoPlot-report.html /path/to/raw_nanoplot_output/sample_raw_NanoPlot-report_GLlblMetag.html
 ```
 
 **Parameter Definitions:**
@@ -313,7 +317,7 @@ NanoPlot --only-report \
 
 **Output Data:**
 
-- **/path/to/raw_nanoplot_output/sample_raw_NanoPlot-report.html** (NanoPlot html summary)
+- **/path/to/raw_nanoplot_output/sample_raw_NanoPlot-report_GLlblMetag.html** (NanoPlot html summary)
 - /path/to/raw_nanoplot_output/sample_raw_NanoPlot_\<date\>_\<time\>.log (NanoPlot log file)
 - /path/to/raw_nanoplot_output/sample_raw_NanoStats.txt (text file containing basic statistics)
 
@@ -381,6 +385,8 @@ NanoPlot --only-report \
          --threads NumberOfThreads \
          --fastq \
          sample_filtered.fastq
+
+mv /path/to/filtered_nanoplot_output/sample_filtered_NanoPlot-report.html /path/to/filtered_nanoplot_output/sample_filtered_NanoPlot-report_GLlblMetag.html
 ```
 
 **Parameter Definitions:**
@@ -470,6 +476,8 @@ NanoPlot --only-report \
          --threads NumberOfThreads \
          --fastq \
          sample_trimmed.fastq.gz
+
+mv /path/to/trimmed_nanoplot_output/sample_trimmed_NanoPlot-report.html /path/to/trimmed_nanoplot_output/sample_trimmed_NanoPlot-report_GLlblMetag.html
 ```
 
 **Parameter Definitions:**
@@ -523,41 +531,51 @@ multiqc --zip-data-dir \
 ---
 
 ### 6. Human Read Removal
+> **Note:** The human read removal step in this pipeline is derived from the 
+[NASA GeneLab Remove Human Reads pipeline](../../Remove_human_reads_from_raw_data/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md). 
+It is included explicitly in this pipeline document because the order of operations and QC generation steps differ for long-read data.
 
 #### 6a. Build Kraken2 Human Database
 
 > **Note:** It is recommended to use NCBI genome files with kraken2 because sequences not downloaded from 
 NCBI may require explicit assignment of taxonomy information before they can be used to build the 
-database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).
+database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown). 
+This step is derived from the [NASA GeneLab Remove Human Reads pipeline](../../Remove_human_reads_from_raw_data/Pipeline_GL-DPPD-7105_Versions/) and uses the kraken2 [k2 wrapper script](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#introducing-k2) throughout
 
 ```bash
+# download human fasta sequences
+k2 download-library --library human --db kraken2-human-db/ --threads 30 --no-masking
+
 # Download NCBI taxonomic information 
-kraken2-build --download-taxonomy --db kraken2-human-db/
+k2 download-taxonomy --db kraken2-human-db/
 
-# Add genomic sequences to your database's genomic library
-kraken2-build --add-to-library human.fasta --db kraken2-human-db/ --no-masking
-             
 # Build the database
-kraken2-build --build --db kraken2-human-db/ --kmer-len 35 --minimizer-len 31
+k2 build --db kraken2-human-db/ --kmer-len 35 --minimizer-len 31 --threads 30
 
 # Clean up intermediate files
-kraken2-build --clean --db kraken2-human-db/
+k2 clean --db kraken2-human-db/
 ```
 
 **Parameter Definitions:**
-- `--download-taxonomy` - Instructs kraken2-build to download the NCBI taxonomic information.
-- `--db` - Specifies the name of the directory for the kraken2 database
-- `--add-to-library` - Instructs kraken2-build to add the contents of a file to the kraken2 DB library
+
+- `download-library` - Chooses the download library function
+  - `--library` - Specifies the references to download (here the human reference genome)
   - `--no-masking` - Disables masking of low-complexity sequences. For additional 
                    information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
-- `--build` - Instructs kraken2-build to build the kraken2 DB from the library files
+  - `--db` - Specifies the name of the directory for the kraken2 database
+- `download-taxonomy` - Chooses the taxonomy download function
+  - `--db` - Specifies the name of the directory for the kraken2 database
+- `build` - Instructs the k2 wrapper to build the kraken2 DB from the available library files
   - `--kmer-len` - K-mer length in bp (default: 35).
   - `--minimizer-len` - Minimizer length in bp (default: 31)
-- `--clean` - Instructs kraken2-build to remove unneeded intermediate files.
+  - `--db` - Specifies the name of the directory for the kraken2 database
+- `clean` - Instructs kraken2-build to remove unneeded intermediate files.
+  - `--db` - Specifies the name of the directory for the kraken2 database
+
 
 **Input Data:**
 
-- human.fasta (fasta file containing human genome, for example, the human genome fasta downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz)
+- None
 
 **Output Data:**
 
@@ -592,7 +610,7 @@ gzip sample_HRrm_GLlblMetag.fastq
 
 **Input Data:**
 
-- kraken2_human_db/ (kraken2 human database directory, output from [Step 6a](#6a-build-kraken2-database))
+- kraken2_human_db/ (kraken2 human database directory, output from [Step 6a](#6a-build-kraken2-human-database))
 - sample_trimmed.fastq.gz (filtered and trimmed sample reads, output from [Step 5a](#5a-trim-filtered-data))
 
 **Output Data:**
@@ -699,7 +717,7 @@ minimap2 -t NumberOfThreads \
 
 **Input Data**
 
-- /path/to/contaminant_assembly/blank-assembly.fasta (contaminant assembly, output from [Step 7a](#7-assemble-contaminants))
+- /path/to/contaminant_assembly/blank-assembly.fasta (contaminant assembly, output from [Step 7a](#7a-assemble-contaminants))
 - sample_HRrm_GLlblMetag.fastq.gz (filtered, trimmed, and HRrm reads, output from [Step 6b](#6b-remove-human-reads))
 
 **Output Data**
@@ -802,11 +820,13 @@ samtools fastq -t -f 4 -o sample_decontam_GLlblMetag.fastq.gz -0 sample_decontam
 
 ```bash
 NanoPlot --only-report \
-         --prefix sample_noblank_ \
+         --prefix sample_decontam_ \
          --outdir /path/to/decontam_nanoplot_output \
          --threads NumberOfThreads \
          --fastq \
          sample_decontam_GLlblMetag.fastq.gz
+
+mv /path/to/decontam_nanoplot_output/sample_decontam_NanoPlot-report.html /path/to/decontam_nanoplot_output/sample_decontam_NanoPlot-report_GLlblMetag.html
 ```
 
 **Parameter Definitions:**
@@ -868,33 +888,38 @@ If the samples were derived from a host organism other than human, potential hos
 
 > **Note:** It is recommended to use NCBI genome files with kraken2 because sequences not downloaded from 
 NCBI may require explicit assignment of taxonomy information before they can be used to build the 
-database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown).
+database, as mentioned in the [Kraken2 Documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown). 
+This step uses the kraken2 [k2 wrapper script](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#introducing-k2) throughout
 
 ```bash
 # Download NCBI taxonomic information 
-kraken2-build --download-taxonomy --db kraken2-${hostname}-db/
+k2 download-taxonomy --db kraken2-${hostname}$-db/
 
-# Add genomic sequences to your database's genomic library
-kraken2-build --add-to-library ${hostname}.fasta --db kraken2-${hostname}-db/ --no-masking 
+# add host fasta sequences
+k2 add-to-library --files ${hostname}.fasta --db kraken2-${hostname}$-db/ --threads 30 --no-masking
 
 # Build the database
-kraken2-build --build --db kraken2-${hostname}-db/ --kmer-length 35 --minimizer-length 31
+k2 build --db kraken2-${hostname}$-db/ --kmer-len 35 --minimizer-len 31 --threads 30
 
 # Clean up intermediate files
-kraken2-build --clean --db kraken2-${hostname}-db/
+k2 clean --db kraken2-${hostname}$-db/
 ```
 
 **Parameter Definitions:**
 
-- `--download-taxonomy` - Instructs kraken2-build to download the NCBI taxonomic information.
-- `--db` - Specifies the name of the directory for the kraken2 database
-- `--add-to-library` - Instructs kraken2-build to add the contents of a file (`${hostname}.fasta`) to the kraken2 DB library
+- `download-taxonomy` - Chooses the taxonomy download function
+  - `--db` - Specifies the name of the directory for the kraken2 database
+- `add-to-library` - Chooses the download library function
+  - `--files` - Specifies the file(s) to add to the kraken2 database library
   - `--no-masking` - Disables masking of low-complexity sequences. For additional 
-                     information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
-- `--build` - Instructs kraken2-build to build the kraken2 DB from the library files
+                   information see the [kraken documentation for masking](https://github.com/DerrickWood/kraken2/wiki/Manual#masking-of-low-complexity-sequences).
+  - `--db` - Specifies the name of the directory for the kraken2 database
+- `build` - Instructs k2 to build the kraken2 DB from the available library files
   - `--kmer-len` - K-mer length in bp (default: 35).
   - `--minimizer-len` - Minimizer length in bp (default: 31)
-- `--clean` - Instructs kraken2-build to remove unneeded intermediate files.
+  - `--db` - Specifies the name of the directory for the kraken2 database
+- `clean` - Instructs k2 to remove unneeded intermediate files.
+  - `--db` - Specifies the name of the directory for the kraken2 database
 - `{$hostname}` - Specifies the name of the host organism used to uniquely identify the kraken2 database
 
 **Input Data:**
@@ -935,7 +960,7 @@ gzip sample_HostRm_GLlblMetag.fastq
 
 **Input Data:**
 
-- kraken2_host_db/ (kraken2 host database directory, output from [Step 8a](#8a-build-kraken2-database))
+- kraken2_host_db/ (kraken2 host database directory, output from [Step 8a](#8a-build-kraken2-host-database))
 - sample_decontam_GLlblMetag.fastq.gz (filtered, trimmed, HRrm and contaminant-removed sample reads, output from [Step 7e](#7e-generate-decontaminated-read-files))
 
 **Output Data:**
@@ -984,10 +1009,13 @@ multiqc --zip-data-dir \
 
 ```R
 library(decontam)
+library(glue)
+library(htmlwidgets)
+library(pavian)
+library(pheatmap)
 library(phyloseq)
+library(plotly)
 library(tidyverse)
-library(pheatmap)
-library(pavian)
 ```
 
 #### 9b. Define Custom Functions
@@ -1115,8 +1143,6 @@ library(pavian)
   <summary>merge and process multiple kraken outputs to one species table</summary>
 
   ```R
-  library(pavian)
-
   merge_kraken_reports <- function(reports_dir) {
 
     reports <- read_reports(reports_dir)
@@ -1144,8 +1170,6 @@ library(pavian)
     # and convert table from dataframe to matrix
     species_names <- species_table[, "species"]
     rownames(species_table) <- species_names
-    species_table <- species_table[,-(which(colnames(species_table) == "species"))]
-    species_table <- as.matrix(species_table)
     
     return(species_table)
   }
@@ -1165,7 +1189,16 @@ library(pavian)
   ```R
   get_abundant_features <- function(mat, cpm_threshold = 1000){
   
-    features <- rowSums(mat) %>% sort()
+    # Filtered out unassigned functions
+    unassigned <- "UNMAPPED|UNGROUPED|UNINTEGRATED|Not annotated"
+    mat <- mat %>%
+      as.data.frame %>%
+      rownames_to_column("Feature") %>%
+      filter(str_detect(Feature, unassigned, negate = TRUE))
+    rownames(mat) <- mat$Feature
+    mat <- mat[, -1]
+
+    features <- rowSums(mat, na.rm = TRUE) %>% sort()
     
     abund_features <- features[features > cpm_threshold] %>% names
     
@@ -1230,7 +1263,7 @@ library(pavian)
                          filter(str_detect(Species, non_microbial, negate = TRUE))
     # Calculate species relative abundance
     clean_tab <- clean_tab_count %>%
-      mutate( across( where(is.numeric), function(x) (x/sum(x, na.rm = TRUE))*100 ) )
+      mutate(across(where(is.numeric), function(x) (x/sum(x, na.rm = TRUE))*100))
     # Set rownames as species name and drop species column
     rownames(clean_tab) <- clean_tab$Species
     clean_tab  <- clean_tab[, -1]
@@ -1283,7 +1316,7 @@ library(pavian)
     }
     
     if(is.null(taxa_to_group)) {
-      message(glue::glue("Rare taxa were not grouped. please provide a higher 
+      message(glue("Rare taxa were not grouped. please provide a higher 
                         threshold than {threshold} for grouping rare taxa, 
                         only numbers are allowed."))
       return(abund_table)
@@ -1325,26 +1358,26 @@ library(pavian)
   ```R
   # Make bar plot
   make_plot <- function(abund_table, metadata, custom_palette, publication_format,
-                        samples_column="Sample_ID", prefix_to_remove="barcode"){
+                        samples_column="sample_id", prefix_to_remove="barcode"){
   
     abund_table_wide <- abund_table %>%
-        as.data.frame %>%
+        as.data.frame() %>%
         rownames_to_column(samples_column) %>%
         inner_join(metadata) %>%
         select(!!!colnames(metadata), everything()) %>%
         mutate(!!samples_column := !!sym(samples_column) %>% str_remove(prefix_to_remove))
         
       
-    abund_table_long <- abund_table_wide  %>%
-        pivot_longer(-colnames(metadata), 
+    abund_table_long <- abund_table_wide %>%
+        pivot_longer(-colnames(metadata),
                      names_to = "Species",
                      values_to = "relative_abundance")
       
-    p <- ggplot(abund_table_long, mapping = aes(x = !!sym(samples_column), 
+    p <- ggplot(abund_table_long, mapping = aes(x = !!sym(samples_column),
                                                 y = relative_abundance, fill = Species)) +
          geom_col() +
-         scale_fill_manual(values = custom_palette) + 
-         labs(x=NULL, y="Relative Abundance (%)") + 
+         scale_fill_manual(values = custom_palette) +
+         labs(x = NULL, y = "Relative Abundance (%)") +
          publication_format
 
     return(p)
@@ -1352,7 +1385,7 @@ library(pavian)
   ```
 
   **Function Parameter Definitions:**
-  - `abund_table` - a relative bundance dataframe with rows summing to 100%
+  - `abund_table` - a relative abundance dataframe with rows summing to 100%
   - `metadata` - a metadata dataframe with samples as row and columns describing each sample
   - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots
   - `publication_format` - a ggplot::theme object specifying a custom theme for plotting
@@ -1372,30 +1405,54 @@ library(pavian)
                            feature_column = "species", samples_column = "sample_id", group_column = "group", 
                            output_prefix, assay_suffix = "_GLlblMetag",
                            publication_format, custom_palette) {
+    facet_by <- reformulate(group_column)
     # Prepare feature table
-    feature_table <- read_csv(feature_table_file)
+    feature_table <- read_delim(feature_table_file)
     rownames(feature_table) <- feature_table[[1]]
     feature_table <- feature_table[, -1]
 
+    number_of_species <- nrow(feature_table)
+
+    if (number_of_species > length(custom_palette)) {
+      N <- number_of_species / length(custom_palette)
+      custom_palette <- rep(custom_palette, times = N * 2)
+    }
+
     # Prepare metadata
-    metadata <- read_delim(metdata_file, delim = ",") %>% as.data.frame
+    metadata <- read_delim(metadata_table_file, delim = ",") %>% as.data.frame
     row.names(metadata) <- metadata[, samples_column]
 
     # compute abundances from counts
     abund_table <- count_to_rel_abundance(feature_table)
+
+    metadata <- metadata %>%
+                mutate(!!sym(group_column) := str_wrap(!!sym(group_column) %>%
+                         str_replace_all("_", " "), width = 10)
+                )
     
     # create plot
     p <- make_plot(abund_table, metadata, custom_palette, publication_format, samples_column) +
-         facet_wrap(~Description, nrow=1, scales = "free_x")
+         facet_wrap(facet_by, nrow = 1, scales = "free_x", labeller = label_wrap_gen(width = 10)) +
+         theme(axis.text.x = element_text(angle = 90))
 
+    static_plot <- p
     number_of_species <- p$data$Species %>% unique() %>% length()
-    # Don't save legend if the number of species to plot is gsreater than 30
+    # Don't save legend if the number of species to plot is greater than 30
     if(number_of_species > 30) {
-      p <- p + theme(legend.position = "none")
+      static_plot <- static_plot + theme(legend.position = "none")
     }
 
-    return(p)
-
+    width <- 2 * nrow(metadata) # 3.6 * number_of_samples
+    if(width < 14) { width = 14 } # set minimum width to 14 inches
+    if(width > 50) { width = 50 } # Cap plot with at 50 inches
+    # Save Static
+    ggsave(filename = glue("{output_prefix}_barplot{assay_suffix}.png"), 
+           plot = static_plot,
+           device = 'png', width = width,
+           height = 10, units = 'in', dpi = 300 , limitsize = FALSE)
+
+    # Save interactive
+    htmlwidgets::saveWidget(ggplotly(p), glue("{output_prefix}_barplot{assay_suffix}.html"), selfcontained = TRUE)
   }
   ```
   **Custom Functions Used:**
@@ -1415,7 +1472,7 @@ library(pavian)
   - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 9c](#9c-set-global-variables)
   - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 9c](#9c-set-global-variables)
 
-  **Returns:** a relative abundance stacked bar plot, `p`, as output from [make_plot](#make_plot)
+  **Output Data:** 2 barplot files, `{output_prefix}_barplot{assay_suffix}.png` and `{output_prefix}_barplot{assay_suffix}.html`, containing relative abundance stacked bar plot, as output from [make_plot](#make_plot)
 
 </details>
 
@@ -1424,18 +1481,38 @@ library(pavian)
   <summary>Creates heatmaps from a feature table file</summary>
   
   ```R
-  make_heatmap <- function(metadata, species_gene_table, 
+  make_heatmap <- function(metadata_table_file, feature_table_file, 
                            samples_column = "sample_id", group_column = "group", 
                            output_prefix, assay_suffix = "_GLlblMetag",
                            custom_palette) {
+    # Prepare feature table
+    feature_table <- read_delim(feature_table_file) %>%  as.data.frame()
+    rownames(feature_table) <- feature_table[[1]]
+    feature_table <- feature_table[,-1] %>% as.matrix()
+    colnames(feature_table) <-  colnames(feature_table) %>% str_remove_all("barcode")
+
+    # Prepare metadata
+    metadata <- read_delim(metadata_table_file) %>% as.data.frame()
+    row.names(metadata) <- metadata[,samples_column] %>% str_remove_all("barcode")
+
+    # GFet common samples and re-arrange feature table and metadata
+    common_samples <- intersect(colnames(feature_table), rownames(metadata))
+    feature_table <- feature_table[, common_samples]
+    metadata <- metadata[common_samples,]
+    metadata <- metadata %>% arrange(!!sym(group_column))
+
     # Create column annotation
     col_annotation <- as.data.frame(metadata)[, group_column, drop = FALSE]
 
     # Calculate output plot width and height
     number_of_samples <- ncol(feature_table)
     width <- 1 * number_of_samples
+    if (width < 10) { width <- 10} # Set the minimum width to 10 inches
+    if (width > 100) { width <- 100} # Set the maximum width to 100 inches
     number_of_features <- nrow(feature_table)
     height <- 0.2 * number_of_features
+    if (height < 10) { height <- 10 } # Set the minimum height to 10 inches
+    if (height > 100) { height <- 100 } # Set the maximum height to 100 inches (highest that won't generate an error)
 
     # Set colors by group
     groups <- metadata[[group_column]] %>%  unique()
@@ -1459,10 +1536,32 @@ library(pavian)
              annotation_colors = annotation_colors,
              number_format = "%.0f")
     dev.off()
+
+    sorted_features <- rowSums(feature_table) %>% sort(decreasing = TRUE)
+
+    # Plot only top 50 features as it is often difficult to visualize all features at once
+    if(length(sorted_features >= 50)) { 
+
+      top50 <- sorted_features[1:50]
+
+      png(filename = glue("{output_prefix}_top_50_heatmap{assay_suffix}.png"), width = width,
+          height = 12, units = "in", res=300)
+      pheatmap(mat = feature_table[names(top50), rownames(col_annotation)],
+               cluster_cols = FALSE, 
+               cluster_rows = FALSE,
+               col = colorRampPalette(c('white','red'))(255), 
+               angle_col = 90, 
+               display_numbers = TRUE, 
+               fontsize = 12, 
+               annotation_col = col_annotation,
+               annotation_colors = annotation_colors,
+               number_format = "%.0f")
+      dev.off()
+    }
   }
   ```
   **Function Parameter Definitions:**
-  - `metadata_file` - path to a file with samples as rows and columns describing each sample
+  - `metadata_table_file` - path to a file with samples as rows and columns describing each sample
   - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
                            table with species/functions as the first column and samples as other columns.
   - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
@@ -1472,7 +1571,7 @@ library(pavian)
   - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
   - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 9c](#9c-set-global-variables)
 
-  **Returns:** heatmap png file, `{output_prefix}_heatmap{assay_suffix}.png`, of species/functions across samples from the input feature table
+  **Output Data:** 2 heatmap png files, `{output_prefix}_heatmap{assay_suffix}.png` and `{output_prefix}_top_50_heatmap{assay_suffix}.png`, of species/functions across samples from the input feature table
 
 </details>
 
@@ -1481,8 +1580,8 @@ library(pavian)
   <summary>Feature table decontamination with decontam</summary>
 
   ```R
-  run_decontam <- function(feature_table, metadata, contam_threshold=0.1, 
-                           prev_col = NULL, freq_col = NULL, ntc_name = "TRUE") {
+  run_decontam <- function(feature_table, metadata, contam_threshold=0.5, 
+                           prev_col = NULL, freq_col = NULL, ntc_name = "true") {
 
     # retain metadata for only the samples present in the input feature table
     sub_metadata <- metadata[colnames(feature_table), ]
@@ -1500,6 +1599,7 @@ library(pavian)
           )
         )
       sub_metadata[, freq_col] <- as.numeric(sub_metadata[, freq_col])
+      sub_metadata[, prev_col] <- tolower(sub_metadata[, prev_col])
 
     }
 
@@ -1548,37 +1648,40 @@ library(pavian)
 
 </details>
 
-#### feature_decontam() 
+#### feature_decontam()
 <details>
-  <summary>decontaminate a feature table</summary>
+  <summary>decontaminate a feature table using the Decontam R package to statistically identify contaminating features in a feature table</summary>
   
   ```R
-  library(tidyverse)
-  library(glue)
-
   feature_decontam <- function(metadata_file, feature_table_file, 
                                feature_column = "Species", samples_column = "sample_id",
-                               prevalence_column = "NTC", ntc_name = "TRUE", 
+                               prevalence_column = "NTC", ntc_name = "true", 
                                frequency_column = "concentration", 
-                               threshold = 0.1, classification_method, 
+                               threshold = 0.5, classification_method, 
                                output_prefix, assay_suffix = "_GLlblMetag") {
     # Prepare feature table
-    feature_table <- read_csv(feature_table_file) %>%  as.data.frame
+    feature_table <- read_delim(feature_table_file) %>%  as.data.frame
     rownames(feature_table) <- feature_table[[1]]
     feature_table <- feature_table[, -1]  %>% as.matrix()
 
     # Prepare metadata
-    metadata <- read_csv(metadata_file) %>% as.data.frame
+    metadata <- read_delim(metadata_file) %>% as.data.frame
     row.names(metadata) <- metadata[, samples_column]
 
     # Run decontam
+    # Assign prev and freq column names to NULL if the values in the supplied columns aren't unique
+    if( length(unique(metadata[, prev_col])) == 1) prev_col <- NULL
+    if( length(unique(metadata[, freq_col])) == 1) freq_col <- NULL
     contamdf <- run_decontam(feature_table, metadata, threshold, prev_col, freq_col, ntc_name) 
 
     contamdf <- as.data.frame(contamdf) %>% rownames_to_column(feature_column)
 
+    type <- 'species'
+    if (classification_method == 'gene-function') { type <- "KO" }
+
     # Write decontaminated feature table and decontam's primary results
-    outfile <- glue("{output_prefix}{classification_method}_decontam_results{assay_suffix}.csv")
-    write_csv(x = contamdf, file = outfile)
+    outfile <- glue("{output_prefix}_decontam_results{assay_suffix}.tsv")
+    write_tsv(x = contamdf, file = outfile)
 
     # Get the list of contaminants identified by decontam
     contaminants <- contamdf %>%
@@ -1600,8 +1703,8 @@ library(pavian)
       rownames(decontaminated_table) <- decontaminated_table[[feature_column]]
       decontaminated_table <- decontaminated_table[,-1] %>% as.matrix
 
-      outfile <- glue("{output_prefix}{classification_method}_decontam_species_table{assay_suffix}.csv")
-      write_csv(x = decontaminated_table, file = outfile)
+      outfile <- glue("{output_prefix}_decontam_{type}_table{assay_suffix}.tsv")
+      write_tsv(x = decontaminated_table, file = outfile)
 
       return(decontaminated_table)
 
@@ -1624,15 +1727,15 @@ library(pavian)
   - `frequency_column` - a character string specifying the column in `metadata` to use for frequency based analysis, default: "concentration"
   - `prevalence_column` - a character string specifying the column in `metadata` to use for prevalence based analysis, default: "NTC"
   - `ntc_name` - a character string specifying the value in the prevalence column for all negative template control samples, default: "TRUE"
-  - `threshold` - a number between 0 and 1 specfying the decontam threshold for both prevalence and frequency based analyses. default: 0.1
+  - `threshold` - a number between 0 and 1 specifying the decontam threshold for both prevalence and frequency based analyses. default: 0.1
   - `output_prefix` - a character string specifying the unique name to add to the output file names 
                       used to denote the data type/source, for example "unfiltered-kaiju_species"
   - `classification_method` - a character string specifying the tool used to generate the classifications ['kaiju', 'kraken2', 'metaphlan', 'contig-taxonomy', 'gene-taxonomy', 'gene-function']
   - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLlblMetag")
 
   **Output Data:**
-  - {classification_method}_decontam_species_table_GLlblMetag.csv - decontaminated feature table file
-  - {classification_method}_decontam_results_GLlblMetag.csv - Decontam results file
+  - {output_prefix}_decontam_{species|KO}_table_GLlblMetag.tsv - decontaminated feature table file
+  - {output_prefix}_decontam_results_GLlblMetag.tsv - Decontam results file
 
   **Returns:** dataframe, `decontaminated_table`, containing the decontaminated feature table
 
@@ -1680,7 +1783,7 @@ library(pavian)
   <summary>clean taxonomy names</summary>
 
   ```R
-  fix_names<- function(taxonomy,stringToReplace="Othe",suffix=";Other"){
+  fix_names<- function(taxonomy,stringToReplace="Other",suffix=";_"){
     
     for(index in seq_along(stringToReplace)){
 
@@ -1688,7 +1791,7 @@ library(pavian)
         # Get the row indices of the current taxonomy columns
         # with rows matching the sting in `stringToReplace`
         indices <- grep(x = taxonomy[,taxa_index], pattern = stringToReplace)
-        # Replace the value in that row with the value in the adjacent cell concated with `suffix`
+        # Replace the value in that row with the value in the adjacent cell concatenated with `suffix`
         taxonomy[indices,taxa_index] <-
           paste0(taxonomy[indices,taxa_index-1],
                 rep(x = suffix, times=length(indices)))
@@ -1708,25 +1811,24 @@ library(pavian)
 
 </details>
 
-#### read_assembly_coverage_table()
+#### read_taxonomy_table()
 <details>
   <summary>Read Assembly-based coverage annotation table</summary>
 
   ```R
-  read_assembly_coverage_table <- function(file_name, sample_names){
+  read_taxonomy_table <- function(df, sample_names){
   
-    df <- read_delim(file = file_name, delim = "\t", comment = "#")
-
-    # Subset taxoxnomy portion (domain:species) of input table
+    # Subset taxonomy portion (domain:species) of input table
     # and replace empty/Na domain assignments with "Unclassified"
     taxonomy_table <- df %>%
       select(domain:species) %>%
       mutate(domain=replace_na(domain, "Unclassified"))
     
     # Subset count table
+    sample_names <- get_samples(df, sample_names)
     counts_table <- df %>% select(!!any_of(sample_names))
 
-    # Mutate taxonomy mames
+    # Mutate taxonomy names
     taxonomy_table  <- process_taxonomy(taxonomy_table)
     taxonomy_table <- fix_names(taxonomy_table, "Other", ";_")
 
@@ -1743,38 +1845,39 @@ library(pavian)
 
   **Function Parameter Definitions:**
 
-  - `file_name` - path to contig taxonomy assignment file to be read
-  - `sample_names` - string of samples names to keep in the final dataframe
+  - `df` - dataframe containing assembly-based coverage
+  - `sample_names` - a character vector of sample names to keep in the final dataframe
 
   **Returns:** dataframe, `df`, containing cleaned taxonomy names and sample species count
 
 </details>
 
-#### get_sample_names()
+#### get_samples()
 <details>
   <summary>retrieve sample names for which assemblies were generated</summary>
 
   ```R
-  get_sample_names <- function (assembly_summary) {
-    # Read in table and drop columns were all rows are NA
-    overview_table <-  read_delim(file = assembly_summary, delim = "\t", comment = "#") %>%
-                        select(where( ~all(!is.na(.)) )) 
-
-    col_names <- names(overview_table) %>% str_remove_all("-assembly")
-    sample_order <- col_names[-1] %>% sort()
-
-    return(sample_order)
+  get_samples <- function(assembly_table_df, sample_names, end_col='species') {
+    # Get common samples 
+    cols <- colnames(df)
+    index <- grep(end_col, cols)
+    start <- grep(end_col, cols) + 1
+    end <- (length(cols) - index)
+    df_samples <- cols[start:end]
+    sample_names <- intersect(df_samples, sample_names)
+
+    return(sample_names)
   }
   ```
   **Function Parameter Definitions:**
+  - `assembly_table_df` - dataframe containing assembly-based coverage
+  - `sample_names` - a character vector of samples names to keep in the final dataframe
+  - `end_col` - string containing the name of the last column
 
-  - `assembly_summary` - path to assembly summary file
-
-  **Returns:** a character vector, `sample_order`, of sorted sample names
+  **Returns:** a character vector, `sample_names`, of sample names that appear in both the assembly dataframe and the sample_names list
 
 </details>
 
-
 #### 9c. Set global variables
 
 ```R
@@ -2002,16 +2105,14 @@ ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
 #### 10f. Create Kaiju Species Count Table
 
 ```R
-library(tidyverse)
 feature_table <- process_kaiju_table(file_path="merged_kaiju_table_GLlblMetag.tsv")
 table2write <- feature_table  %>%
-                as.data.frame %>%
-                rownames_to_column("Species")
-write_csv(x = table2write, file = "kaiju_species_table_GLlblMetag.csv")
+               as.data.frame %>%
+               rownames_to_column("Species")
+write_tsv(x = table2write, file = "kaiju_species_table_GLlblMetag.tsv")
 ```
 
 **Custom Functions Used:**
-
 - [process_kaiju_table()](#process_kaiju_table)
 
 **Parameter Definitions:**
@@ -2026,23 +2127,23 @@ write_csv(x = table2write, file = "kaiju_species_table_GLlblMetag.csv")
 
 **Output Data:**
 
-- **kaiju_species_table_GLlblMetag.csv** (kaiju species count table in csv format)
+- **kaiju_species_table_GLlblMetag.tsv** (kaiju species count table in tsv format)
 
 
 #### 10g. Filter Kaiju Species Count Table
 
 ```R
-library(tidyverse)
-
-input_file <- "kaiju_species_table_GLlblMetag.csv"
-output_file <- "kaiju_filtered_species_table_GLlblMetag.csv"
+feature_table_file <- "kaiju_species_table_GLlblMetag.tsv"
+output_file <- "kaiju_filtered_species_table_GLlblMetag.tsv"
 threshold <- 0.5
 
 # string used to define non-microbial taxa
-non_microbial <- "UNCLASSIFIED|Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
+non_microbial <- "UNCLASSIFIED|Unclassified|unclassified|Homo sapien|cannot|uncultured|unidentified"
 
 # read in feature table
-feature_table <- read_csv(input_file) %>% as.data.frame
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
 feature_name <- colnames(feature_table)[1]
 rownames(feature_table) <- feature_table[,1]
 feature_table <- feature_table[, -1]
@@ -2059,7 +2160,7 @@ table2write <- group_low_abund_taxa(abund_table, threshold = threshold)  %>%
   t %>% as.data.frame %>%
   rownames_to_column(feature_name)
 
-write_csv(x = table2write, file = output_file)
+write_tsv(x = table2write, file = output_file)
 ```
 
 **Custom Functions Used:**
@@ -2073,49 +2174,31 @@ write_csv(x = table2write, file = output_file)
 
 **Input Data:**
 
-- kaiju_species_table_GLlblMetag.csv (path to kaiju species table from [Step 10f](#10f-create-kaiju-species-count-table))
+- kaiju_species_table_GLlblMetag.tsv (path to kaiju species table from [Step 10f](#10f-create-kaiju-species-count-table))
 
 **Output Data:**
 
-- **kaiju_filtered_species_table_GLlblMetag.csv** - a file containing the filtered species table
+- **kaiju_filtered_species_table_GLlblMetag.tsv** - a file containing the filtered species table
 
 ---
 
-#### 10h. Taxonomy Barplots
+#### 10h. Kaiju Taxonomy Barplots
 
 ```R
-library(tidyverse)
-
-species_table_file <- "kaiju_species_table_GLlblMetag.csv"
-filtered_species_table_file <- "kaiju_filtered_species_table_GLlblMetag.csv"
+species_table_file <- "kaiju_species_table_GLlblMetag.tsv"
+filtered_species_table_file <- "kaiju_filtered_species_table_GLlblMetag.tsv"
 metadata_file <- "/path/to/sample/metadata"
-number_samples <- 10 
-
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
 
-p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
-                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
-                  publication_format = publication_format, custom_palette = custom_palette)
-
-ggsave(filename = "kaiju_unfiltered_species_barplot_GLlblMetag.png", plot = p,
-       device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
+make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+             output_prefix = "kaiju_unfiltered_species", assay_suffix = "_GLlblMetag",
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             publication_format = publication_format, custom_palette = custom_palette)
 
 # Save static unfiltered plot
-p <- make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
-                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
-                  publication_format = publication_format, custom_palette = custom_palette)
-
-# Save interactive unfilterted plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_unfiltered_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
-
-# Save static filtered plot
-ggsave(filename = glue("kaiju_filtered_species_barplot_GLlblMetag.png"), plot = p,
-      device = 'png', width = plot_width, height = 10, units = 'in', dpi = 300, limitsize = FALSE)
-
-# Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_filtered_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kaiju_filtered_species", assay_suffix = "_GLlblMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 ```
 
 **Custom Functions Used:**
@@ -2126,13 +2209,12 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_filtered_species_barplot_GLlblM
 - `species_table_file` - a file containing the species count table
 - `filtered_species_table_file` - a file containing the filtered species count table
 - `metadata_file` - a file containing group information for each sample in the species count files
-- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples.
 
 **Input Data:**
 
-- `kaiju_species_table_GLlblMetag.csv` (a file containing the species count table, output from [Step 10f](#10f-create-kaiju-species-count-table))
-- `kaiju_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 10g](#10g-filter-kaiju-species-count-table))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `kaiju_species_table_GLlblMetag.tsv` (a file containing the species count table, output from [Step 10f](#10f-create-kaiju-species-count-table))
+- `kaiju_filtered_species_table_GLlblMetag.tsv` (a file containing the filtered species count table, output from [Step 10g](#10g-filter-kaiju-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 
 **Output Data:**
@@ -2143,46 +2225,30 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_filtered_species_barplot_GLlblM
 - **kaiju_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
 
 
-#### 10i. Feature Decontamination
+#### 10i. Kaiju Feature Decontamination
 
-> Feature (species) decontamination with decontam. Decontam is an R package that statistically identifies contaminating features in a feature table
+> Note: species_table and barplots are only generated if 1 or more contaminants were detected
 
 ```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-feature_table_file <- "kaiju_filtered_species_table_GLlblMetag.csv"
+feature_table_file <- "kaiju_filtered_species_table_GLlblMetag.tsv"
 metadata_table <- "/path/to/sample/metadata"
-number_samples <- NumberOfSamples # integer indicating how many samples are in the file
-
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
 
 decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
                                          feature_table_file = feature_table_file, 
                                          feature_column = "species", 
                                          samples_column = "sample_id",
                                          prevalence_column = "NTC", 
-                                         ntc_name = "TRUE", 
+                                         ntc_name = "true", 
                                          frequency_column = "concentration", 
-                                         threshold = 0.1, 
+                                         threshold = 0.5, 
                                          classification_method = "kaiju", 
-                                         output_prefix = "", 
+                                         output_prefix = "kaiju", 
                                          assay_suffix = "_GLlblMetag")
 
-# Convert count matrix to relative abundance matrix
-decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
-
-# Make plot after filtering out contaminants
-p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
-
-ggsave(filename = "kaiju_decontam_species_barplot_GLlblMetag.png", plot = p,
-         device = "png", width = plot_width, height = 10, units = "in", dpi = 300)
-
-# Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+make_barplot(metadata_file = metadata_table, feature_table_file = "kaiju_decontam_species_table_GLlblMetag.tsv", 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kraken2_decontam_species", assay_suffix = "_GLlblMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 ```
 
 **Custom Functions Used:**
@@ -2195,19 +2261,18 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kaiju_decontam_species_barplot_GLlblM
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
                          table with species/functions as the first column and samples as other columns.
-- `number_samples` - the total number of samples in the species count files, adjust based on number of input samples in the feature_table_file
 
 **Input Data:**
 
-- `kaiju_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 10g](#10g-filter-kaiju-species-count-table))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `kaiju_filtered_species_table_GLlblMetag.tsv`(path to filtered species count per sample, output from [Step 10g](#10g-filter-kaiju-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
-- **kaiju_decontam_results_GLlblMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
-- **kaiju_decontam_species_table_GLlblMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
-- kaiju_decontam_species_barplot_GLlblMetag.png (barplot after filtering out contaminants)
-- **kaiju_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants)
+- **kaiju_decontam_results_GLlblMetag.tsv** (decontam's result table, output from [feature_decontam()](#feature_decontam))
+- **kaiju_decontam_species_table_GLlblMetag.tsv** (decontaminated species table, output from [feature_decontam()](#feature_decontam))
+- kaiju_decontam_species_barplot_GLlblMetag.png (barplot after filtering out contaminants, output from [make_barplot()](#make_barplot))
+- **kaiju_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants, output from [make_barplot()](#make_barplot))
 
 <br>
 
@@ -2229,8 +2294,8 @@ INSPECT_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/inspect.
 wget ${INSPECT_URL}
 
 # Library report
-LIRARY_REPORT_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/library_report.tsv
-wget ${LIRARY_REPORT_URL}
+LIBRARY_REPORT_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/library_report.tsv
+wget ${LIBRARY_REPORT_URL}
 
 # Md5sums
 MD5_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/pluspfp.md5 
@@ -2249,7 +2314,7 @@ tar -xvzf k2_pluspfp.tar.gz
 - `--timeout=3600` - Specifies the network timeout in seconds.
 - `--tries=0` - Retry download infinitely.
 - `--continue` -  Continue getting a partially-downloaded file.
-- `*_URL` - Position arguement specifying the url to download a particular resource from.
+- `*_URL` - Position argument specifying the url to download a particular resource from.
 
 *tar*
 - `-xvzf` - unpack the specified *tar.gz archive in verbose mode
@@ -2257,7 +2322,7 @@ tar -xvzf k2_pluspfp.tar.gz
 **Input Data:**
 
 - `INSPECT_URL=` - url specifying the location of kraken2 inspect file
-- `LIRARY_REPORT_URL=` - url specifying the location of kraken2 library report file
+- `LIBRARY_REPORT_URL=` - url specifying the location of kraken2 library report file
 - `MD5_URL=` - url specifying the location of the md5 file of the kraken database
 - `DB_URL=` - url specifying the location of the main kraken database archive in .tar.gz format
 
@@ -2306,7 +2371,7 @@ kraken2 --db kraken2-db/ \
 
 ```R
 species_table <- merge_kraken_reports(reports-dir = '/path/to/kraken2/reports')
-write_csv(x = species_table, file = "kraken2_species_table_GLlblMetag.csv")
+write_tsv(x = species_table, file = "kraken2_species_table_GLlblMetag.tsv")
 ```
 
 **Custom Functions Used:**
@@ -2321,11 +2386,11 @@ write_csv(x = species_table, file = "kraken2_species_table_GLlblMetag.csv")
 
 **Input Data:**
 
-- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 11b](#11b-taxonomic-classification))
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 11b](#11b-kraken2-taxonomic-classification))
 
 **Output Data:**
 
-- **kraken2_species_table_GLlblMetag.csv** (kraken species count table in csv format)
+- **kraken2_species_table_GLlblMetag.tsv** (kraken species count table in tsv format)
 
 ##### 11cii. Compile Kraken2 Taxonomy Reports
 
@@ -2347,7 +2412,7 @@ multiqc --zip-data-dir \
 
 **Input Data:**
 
-- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 11b](#11b-taxonomic-classification))
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 11b](#11b-kraken2-taxonomic-classification))
 
 **Output Data:**
 
@@ -2369,7 +2434,7 @@ kreport2krona.py --report-file sample-kraken2-report.tsv  \
 
 **Input Data:**
 
-- sample-kraken2-report.tsv (kraken report, output from [Step 11b](#11b-taxonomic-classification))
+- sample-kraken2-report.tsv (kraken report, output from [Step 11b](#11b-kraken2-taxonomic-classification))
 
 **Output Data:**
 
@@ -2412,7 +2477,7 @@ ktImportText -o kraken2-report_GLlblMetag.html ${KTEXT_FILES[*]}
 
 *ktImportText*
 - `-o` - Specifies the compiled output html file name.
-- `${KTEXT_FILES[*]}` - An array positional arguement with the following content: sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
+- `${KTEXT_FILES[*]}` - An array positional argument with the following content: sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
 
 **Input Data:**
 
@@ -2430,17 +2495,17 @@ ktImportText -o kraken2-report_GLlblMetag.html ${KTEXT_FILES[*]}
 #### 11f. Filter Kraken2 Species Count Table
 
 ```R
-library(tidyverse)
-
-input_file <- "kraken2_species_table_GLlblMetag.csv"
-output_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
+feature_table_file <- "kraken2_species_table_GLlblMetag.tsv"
+output_file <- "kraken2_filtered_species_table_GLlblMetag.tsv"
 threshold <- 0.5
 
 # string used to define non-microbial taxa
-non_microbial <- "UNCLASSIFIED|Unclassifed|unclassified|Homo sapien|cannot|uncultured|unidentified"
+non_microbial <- "UNCLASSIFIED|Unclassified|unclassified|Homo sapien|cannot|uncultured|unidentified"
 
 # read in feature table
-feature_table <- read_csv(input_file) %>% as.data.frame
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
 feature_name <- colnames(feature_table)[1]
 rownames(feature_table) <- feature_table[,1]
 feature_table <- feature_table[, -1]
@@ -2450,7 +2515,7 @@ table2write <- filter_rare(feature_table, non_microbial, threshold = threshold)
   as.data.frame %>%
   rownames_to_column(feature_name)
 
-write_csv(x = table2write, file = output_file)
+write_tsv(x = table2write, file = output_file)
 ```
 
 **Custom Functions Used:**
@@ -2464,52 +2529,33 @@ write_csv(x = table2write, file = output_file)
 
 **Input Data:**
 
-- kraken2_species_table_GLlblMetag.csv (path to kaiju species table from [Step 11ci.](#11ci-create-merged-kraken2-taxonomy-table))
+- kraken2_species_table_GLlblMetag.tsv (path to kaiju species table from [Step 11ci.](#11ci-create-merged-kraken2-taxonomy-table))
 
 **Output Data:**
 
-- **kraken2_filtered_species_table_GLlblMetag.csv** - a file containing the filtered species table
+- **kraken2_filtered_species_table_GLlblMetag.tsv** - a file containing the filtered species table
 
 ---
 
-#### 11g. Taxonomy Barplots
+#### 11g. Kraken2 Taxonomy Barplots
 
 ```R
-library(tidyverse)
-
-species_table_file <- "kraken2_species_table_GLlblMetag.csv"
-filtered_species_table_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
+species_table_file <- "kraken2_species_table_GLlblMetag.tsv"
+filtered_species_table_file <- "kraken2_filtered_species_table_GLlblMetag.tsv"
 metadata_file <- "/path/to/sample/metadata"
-number_samples <- 10 
-
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
-
-p <- make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
-                  feature_column = "species", samples_column = "sample_id", group_column = "group",
-                  publication_format = publication_format, custom_palette = custom_palette)
-
-ggsave(filename = "kraken2_unfiltered_species_barplot_GLlblMetag.png", plot = p,
-       device = "png", width = plot_width, height = 10, units = "in", dpi = 300, limitsize = FALSE)
 
-# Save static unfiltered plot
-p <- make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
-                  feature_column = "Species", samples_column = "sample_id", group_column = "group",
-                  publication_format = publication_format, custom_palette = custom_palette)
-
-# Save interactive unfilterted plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_unfiltered_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
-
-# Save static filtered plot
-ggsave(filename = glue("kraken2_filtered_species_barplot_GLlblMetag.png"), plot = p,
-      device = 'png', width = plot_width, height = 10, units = 'in', dpi = 300, limitsize = FALSE)
+make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+             feature_column = "species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kraken2_unfiltered_species", assay_suffix = "_GLlblMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 
-# Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+             feature_column = "Species", samples_column = "sample_id",group_column = "group",
+             output_prefix = "kraken2_filtered_species", assay_suffix = "_GLlblMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 ```
 **Custom Functions Used:**
-- [make_barplot()](#make_plot)
+- [make_barplot()](#make_barplot)
 
 **Parameter Definitions:**
 
@@ -2520,59 +2566,42 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_filtered_species_barplot_GLlb
 
 **Input Data:**
 
-- `kraken2_species_table_GLlblMetag.csv` (path to kaiju species table from [Step 11ci.](#11ci-create-merged-kraken2-taxonomy-table))
-- `kraken2_filtered_species_table_GLlblMetag.csv` (a file containing the filtered species count table, output from [Step 11f](#11f-filter-kraken2-species-count-table))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `kraken2_species_table_GLlblMetag.tsv` (path to kaiju species table from [Step 11ci.](#11ci-create-merged-kraken2-taxonomy-table))
+- `kraken2_filtered_species_table_GLlblMetag.tsv` (a file containing the filtered species count table, output from [Step 11f](#11f-filter-kraken2-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
 - kraken2_unfiltered_species_barplot_GLlblMetag.png (taxonomy barplot without filtering)
 - **kraken2_unfiltered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot without filtering)
-- kraken2_filtered_species_barplot_GLlblMetag.png (taxonomy barplot after filtering rare and non-microbial taxa)
-- **kraken2_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
+- kraken2_filtered_species_barplot_GLlblMetag.png (taxonomy barplot after filtering rare and non-microbial taxa, output from [make_barplot()](#make_barplot))
+- **kraken2_filtered_species_barplot_GLlblMetag.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa, output from [make_barplot()](#make_barplot))
 
 
-#### 11h. Feature Decontamination
+#### 11h. Kraken2 Feature Decontamination
 
-> Feature (species) decontamination with decontam. Decontam is an R package that statistically 
-  identifies contaminating features in a feature table
+> Note: species_table and barplots are only generated if 1 or more contaminants were detected
 
 ```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-feature_table_file <- "kraken2_filtered_species_table_GLlblMetag.csv"
+feature_table_file <- "kraken2_filtered_species_table_GLlblMetag.tsv"
 metadata_table <- "/path/to/sample/metadata"
-number_samples <- NumberOfSamples # integer indicating how many samples are in the file
-
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
 
 decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
                                          feature_table_file = feature_table_file, 
                                          feature_column = "species", 
                                          samples_column = "sample_id",
                                          prevalence_column = "NTC", 
-                                         ntc_name = "TRUE", 
+                                         ntc_name = "true", 
                                          frequency_column = "concentration", 
-                                         threshold = 0.1, 
+                                         threshold = 0.5, 
                                          classification_method = "kraken2", 
-                                         output_prefix = "", 
+                                         output_prefix = "kraken2", 
                                          assay_suffix = "_GLlblMetag")
 
-# Convert count matrix to relative abundance matrix
-decontaminated_species_table <- count_to_rel_abundance(decontaminated_table)
-
-# Make plot after filtering out contaminants
-p <- make_plot(decontaminated_species_table, metadata, custom_palette, publication_format)
-
-ggsave(filename = "kraken2_decontam_species_barplot_GLlblMetag.png", plot = p,
-         device = "png", width = plot_width, height = 10, units = "in", dpi = 300)
-
-# Save interactive filtered plot
-htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlblMetag.html"), selfcontained = TRUE)
+make_barplot(metadata_file = metadata_table, feature_table_file = "kraken2_decontam_species_table_GLlblMetag.tsv", 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kraken2_decontam_species", assay_suffix = "_GLlblMetag",
+             publication_format = publication_format, custom_palette = custom_palette)
 ```
 
 **Custom Functions Used:**
@@ -2589,15 +2618,15 @@ htmlwidgets::saveWidget(ggplotly(p), glue("kraken2_decontam_species_barplot_GLlb
 
 **Input Data:**
 
-- `kraken2_filtered_species_table_GLlblMetag.csv`(path to filtered species count per sample, output from [Step 11f](#11f-filter-kraken2-species-count-table))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `kraken2_filtered_species_table_GLlblMetag.tsv`(path to filtered species count per sample, output from [Step 11f](#11f-filter-kraken2-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
-- **kraken2_decontam_results_GLlblMetag.csv** (decontam's result table, output from [feature_decontam() function](#feature_decontam))
-- **kraken2_decontam_species_table_GLlblMetag.csv** (decontaminated species table, output from [feature_decontam() function](#feature_decontam))
-- kraken2_decontam_species_barplot_GLlblMetag.png (barplot after filtering out contaminants)
-- **kraken2_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants)
+- **kraken2_decontam_results_GLlblMetag.tsv** (decontam's result table, output from [feature_decontam()](#feature_decontam))
+- **kraken2_decontam_species_table_GLlblMetag.tsv** (decontaminated species table, output from [feature_decontam()](#feature_decontam))
+- kraken2_decontam_species_barplot_GLlblMetag.png (barplot after filtering out contaminants, output from [make_barplot()](#make_barplot))
+- **kraken2_decontam_species_barplot_GLlblMetag.html** (barplot after filtering out contaminants, output from [make_barplot()](#make_barplot))
 
 <br>
 
@@ -2616,8 +2645,8 @@ flye --meta \
      /path/to/sample_decontam_GLlblMetag.fastq.gz
 
 # rename output files            
-mv sample/assembly.fasta sample_assembly.fasta
-mv sample/flye.log sample_assembly.log
+mv sample/assembly.fasta sample-assembly.fasta
+mv sample/flye.log sample-assembly.log
 ```
 
 **Parameter Definitions:**
@@ -2636,8 +2665,8 @@ mv sample/flye.log sample_assembly.log
 
 **Output Data**
 
-- sample_assembly.fasta (sample assembly fasta)
-- sample_assembly.log (flye log file)
+- sample-assembly.fasta (sample assembly fasta)
+- sample-assembly.log (flye log file)
 
 <br>
 
@@ -2648,8 +2677,8 @@ mv sample/flye.log sample_assembly.log
 ```bash
 medaka_consensus -t NumberOfThreads \
                  -i /path/to/sample_decontam_GLlblMetag.fastq.gz \
-                 -d /path/to/assemblies/sample_assembly.fasta \
-                 -o sample/
+                 -d /path/to/assemblies/sample-assembly.fasta \
+                 -o sample/ > sample-medaka.log
   
 mv sample/consensus.fasta sample_polished.fasta
 ```
@@ -2666,11 +2695,12 @@ mv sample/consensus.fasta sample_polished.fasta
 - sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
     output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
-- /path/to/assemblies/sample_assembly.fasta (sample assembly, output from [Step 12](#12-sample-assembly))
+- /path/to/assemblies/sample-assembly.fasta (sample assembly, output from [Step 12](#12-sample-assembly))
 
 **Output Data:**
 
 - sample_polished.fasta (polished sample assembly)
+- sample-medaka.log (file containing medaka log output)
 
 <br>
 
@@ -2707,6 +2737,14 @@ bit-rename-fasta-headers -i sample_polished.fasta \
 ```bash
 bit-summarize-assembly -o assembly-summaries_GLlblMetag.tsv \
                        *-assembly_GLlblMetag.fasta
+
+# test assembly fasta files for absence of contigs
+for assembly_file in *-assembly_GLlblMetag.fasta; do 
+  sample_id=${assembly_file%-assembly_GLlblMetag.fasta} 
+  if [ ! -s ${assembly_file} ]; then 
+    printf "${sample_id}\tNo contigs assembled\n" >> Failed-assemblies_GLlblMetag.tsv
+  fi
+done
 ```
 
 **Parameter Definitions:**  
@@ -2716,11 +2754,12 @@ bit-summarize-assembly -o assembly-summaries_GLlblMetag.tsv \
 
 **Input Data:**
 
-- *-assembly.fasta (contig-renamed assembly files from [Step 14a](#14a-renaming-contig-headers))
+- *-assembly_GLlblMetag.fasta (contig-renamed assembly files from [Step 14a](#14a-rename-contig-headers))
 
 **Output files:**
 
 - **assembly-summaries_GLlblMetag.tsv** (table of assembly summary statistics)
+- **Failed-assemblies_GLlblMetag.tsv** (list of samples with no assembled contigs. Only present if no contigs were generated for at least one sample.)
 
 <br>
 
@@ -2754,7 +2793,7 @@ prodigal -a sample-genes.faa \
 
 **Input Data:**
 
-- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 14a](#14a-renaming-contig-headers))
+- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 14a](#14a-rename-contig-headers))
 
 **Output Data:**
 
@@ -2776,8 +2815,8 @@ mv sample-genes.fasta.tmp sample-genes_GLlblMetag.fasta
 
 **Input Data:**
 
-- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 15a](#15a-gene-prediction))
-- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 15a](#15a-gene-prediction))
+- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 15a](#15a-generate-gene-predictions))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 15a](#15a-generate-gene-predictions))
 
 **Output Data:**
 
@@ -2792,7 +2831,7 @@ mv sample-genes.fasta.tmp sample-genes_GLlblMetag.fasta
 
 > **Note:**  
 > The annotation process overwrites the same temporary directory by default. When running multiple 
-processses at a time, it is necessary to specify a specific temporary directory with the 
+processes at a time, it is necessary to specify a specific temporary directory with the 
 `--tmp-dir` argument as shown below.
 
 
@@ -2835,8 +2874,8 @@ exec_annotation -p profiles/ \
 **Input Data:**
 
 - sample-genes_GLlblMetag.faa (amino-acid fasta file, output from [Step 15b](#15b-remove-line-wraps-in-gene-prediction-output))
-- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 16a](16a-download-reference-database-of-hmm-models))
-- ko_list (reference list of KOs to scan for, downloaded in [Step 16a](16a-download-reference-database-of-hmm-models))
+- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 16a](#16a-download-reference-database-of-hmm-models))
+- ko_list (reference list of KOs to scan for, downloaded in [Step 16a](#16a-download-reference-database-of-hmm-models))
 
 **Output Data:**
 
@@ -2871,9 +2910,9 @@ rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 
 ---
 
-### 17. Taxonomic Classification 
+### 17. Taxonomic Classification
 
-#### 17a. Pull and Unpack Pre-built Reference DB 
+#### 17a. Pull and Unpack Pre-built Reference DB
 
 > **Note:** This step only needs to be done once.
 
@@ -2885,10 +2924,10 @@ tar -xvzf CAT_prepare_20200618.tar.gz
 #### 17b. Run Taxonomic Classification
 
 ```bash
-CAT contigs -c sample-assembly.fasta \
+CAT contigs -c sample-assembly_GLlblMetag.fasta \
             -d CAT_prepare_20200618/2020-06-18_database/ \
             -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
-            -p sample-genes.faa \
+            -p sample-genes_GLlblMetag.faa \
             -o sample-tax-out.tmp \
             -n NumberOfThreads \
             -r 3 \
@@ -2912,10 +2951,10 @@ CAT contigs -c sample-assembly.fasta \
 
 **Input Data:**
 
-- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 17a](17a-pull-and-unpack-pre-built-reference-db))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 17a](17a-pull-and-unpack-pre-built-reference-db))
-- sample-assembly.fasta (contig-renamed assembly file from [Step 14a](#14a-rename-contig-headers))
-- sample-genes.faa (amino-acid fasta file, output from [Step 15b](#15b-remove-line-wraps-in-gene-prediction-output))
+- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 17a](#17a-pull-and-unpack-pre-built-reference-db))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 17a](#17a-pull-and-unpack-pre-built-reference-db))
+- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 14a](#14a-rename-contig-headers))
+- sample-genes_GLlblMetag.faa (amino-acid fasta file, output from [Step 15b](#15b-remove-line-wraps-in-gene-prediction-output))
 
 **Output Data:**
 
@@ -2987,7 +3026,7 @@ awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,
     { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } else { n=split($3,lineage,";"); \
     print $1,lineage[n],$5,$6,$7,$8,$9,$10,$11 } } ' sample-gene-tax-out.tmp | \
     sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/# ORF/gene_ID/' | \
-    sed 's/lineage/taxid/'  > sample-gene-tax-out.tsv
+    sed 's/lineage/taxid/'  > sample-gene-tax.tsv
 ```
 
 **Input Data:**
@@ -2996,7 +3035,7 @@ awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,
 
 **Output Data:**
 
-- sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info)
+- sample-gene-tax.tsv (reformatted gene-calls taxonomy file with lineage info)
 
 
 #### 17f. Format Contig-level Output With awk and sed
@@ -3006,7 +3045,7 @@ awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6
     else if ( $2 == "no taxid assigned" ) { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } \
     else { n=split($4,lineage,";"); print $1,lineage[n],$6,$7,$8,$9,$10,$11,$12 } } ' sample-contig-tax-out.tmp | \
     sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/^# contig/contig_ID/' | \
-    sed 's/lineage/taxid/' > sample-contig-tax-out.tsv
+    sed 's/lineage/taxid/' > sample-contig-tax.tsv
 
   # clearing intermediate files
 rm sample*.tmp*
@@ -3018,7 +3057,7 @@ rm sample*.tmp*
 
 **Output Data:**
 
-- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info)
+- sample-contig-tax.tsv (reformatted contig taxonomy file with lineage info)
 
 <br>
 
@@ -3032,7 +3071,7 @@ rm sample*.tmp*
 minimap2 -a \
          -x map-ont \
          -t NumberOfThreads \
-         sample_assembly.fasta \
+         sample-assembly_GLlblMetag.fasta \
          sample_decontam_GLlblMetag.fastq.gz \
          > sample.sam  2> sample-mapping-info.txt
 ```
@@ -3042,14 +3081,14 @@ minimap2 -a \
 - `-a` – Output in SAM format.
 - `-x map-ont` - Specifies preset for mapping Nanopore reads to a reference.
 - `-t` - Number of parallel processing threads to use
-- `sample_assembly.fasta` – Assembly fasta file, provided as a positional argument.
+- `sample-assembly.fasta` – Assembly fasta file, provided as a positional argument.
 - `sample_decontam_GLlblMetag.fastq.gz` - Input sequence data file, provided as a positional argument.
 - `> sample.sam` - Redirects the output to a separate file.
-- `2> sample-mapping-info.txt` - Redirects the standar error to a separate file.
+- `2> sample-mapping-info.txt` - Redirects the standard error to a separate file.
 
 **Input Data**
 
-- sample-assembly.fasta (contig-renamed assembly file, output from [Step 14a](#14a-rename-contig-headers))
+- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file, output from [Step 14a](#14a-rename-contig-headers))
 - sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
     output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
@@ -3060,15 +3099,13 @@ minimap2 -a \
 - **sample-mapping-info_GLlblMetag.txt** (read mapping information)
 
 
-#### 18b. Sort and Index Assembly Alignments
+#### 18b. Sort Assembly Alignments
 
 ```bash
 # Sort Sam, convert to bam and create index
 samtools sort --threads NumberOfThreads \
-              -o sample_sorted_GLlblMetag.bam \
+              -o sample_GLlblMetag.bam \
               sample.sam > sample_sort.log 2>&1
-
-samtools index sample_sorted_GLlblMetag.bam sample_sorted_GLlblMetag.bam.bai
 ```
 
 **Parameter Definitions:**
@@ -3079,18 +3116,13 @@ samtools index sample_sorted_GLlblMetag.bam sample_sorted_GLlblMetag.bam.bai
 - `sample.sam` - Positional argument specifying the input SAM file.
 - `> sample_sort.log 2>&1` - Redirects the standard output and standard error to a separate file.
 
-**samtools index**
-- `sample_sorted.bam` - Positional argument specifying the input BAM file to be sorted.
-- `sample_sorted.bam.bai` - Positional argument specifying the name of the index file.
-
 **Input Data:**
 
 - sample.sam (reads aligned to sample assembly, output from [Step 18a](#18a-align-reads-to-sample-assembly))
 
 **Output Data:**
 
-- **sample_sorted_GLlblMetag.bam** (sorted mapping to sample assembly, in BAM format)
-- **sample_sorted_GLlblMetag.bam.bai** (index of sorted mapping to sample assembly)
+- **sample_GLlblMetag.bam** (sorted mapping to sample assembly, in BAM format)
 
 <br>
 
@@ -3106,7 +3138,7 @@ Filtering based on detection is one way of helping to mitigate non-specific read
 
 ```bash
 # pileup.sh comes from the bbduk.sh package
-pileup.sh -in sample.bam \
+pileup.sh -in sample_GLlblMetag.bam \
           fastaorf=sample-genes_GLlblMetag.fasta \
           outorf=sample-gene-cov-and-det.tmp \
           out=sample-contig-cov-and-det.tmp
@@ -3121,8 +3153,8 @@ pileup.sh -in sample.bam \
 
 **Input Data:**
 
-- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 18b](#18b-sort-and-index-assembly-alignments))
-- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 15a](#15a-gene-prediction))
+- sample_GLlblMetag.bam (sorted mapping to sample assembly BAM file, output from [Step 18b](#18b-sort-assembly-alignments))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 15a](#15a-generate-gene-predictions))
 
 
 **Output Data:**
@@ -3141,14 +3173,14 @@ grep -v "#" sample-gene-cov-and-det.tmp | \
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
      { print $1,$4 } ' > sample-gene-cov.tmp
 
-cat <( printf "gene_ID\tcoverage\n" ) sample-gene-cov.tmp > sample-gene-coverages_GLlblMetag.tsv
+cat <( printf "gene_ID\tcoverage\n" ) sample-gene-cov.tmp > sample-gene-coverages.tsv
 
 # Filtering contig coverage
 grep -v "#" sample-contig-cov-and-det.tmp | \
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
      { print $1,$2 } ' > sample-contig-cov.tmp
 
-cat <( printf "contig_ID\tcoverage\n" ) sample-contig-cov.tmp > sample-contig-coverages_GLlblMetag.tsv
+cat <( printf "contig_ID\tcoverage\n" ) sample-contig-cov.tmp > sample-contig-coverages.tsv
 
 # removing intermediate files
 rm sample-*.tmp
@@ -3161,8 +3193,8 @@ rm sample-*.tmp
 
 **Output Data:**
 
-- sample-gene-coverages_GLlblMetag.tsv (table with gene-level coverages)
-- sample-contig-coverages_GLlblMetag.tsv (table with contig-level coverages)
+- sample-gene-coverages.tsv (table with gene-level coverages)
+- sample-contig-coverages.tsv (table with contig-level coverages)
 
 <br>
 
@@ -3175,25 +3207,25 @@ rm sample-*.tmp
 ```bash
 paste <( tail -n +2 sample-gene-coverages.tsv | sort -V -k 1 ) \
       <( tail -n +2 sample-annotations.tsv | sort -V -k 1 | cut -f 2- ) \
-      <( tail -n +2 sample-gene-tax-out.tsv | sort -V -k 1 | cut -f 2- ) \
+      <( tail -n +2 sample-gene-tax.tsv | sort -V -k 1 | cut -f 2- ) \
       > sample-gene-tab.tmp
 
 paste <( head -n 1 sample-gene-coverages.tsv ) \
       <( head -n 1 sample-annotations.tsv | cut -f 2- ) \
-      <( head -n 1 sample-gene-tax-out.tsv | cut -f 2- ) \
+      <( head -n 1 sample-gene-tax.tsv | cut -f 2- ) \
       > sample-header.tmp
 
 cat sample-header.tmp sample-gene-tab.tmp > sample-gene-coverage-annotation-and-tax_GLlblMetag.tsv
 
 # removing intermediate files
-rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-out.tsv
+rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax.tsv
 ```
 
 **Input Data:**
 
 - sample-gene-coverages.tsv (table with gene-level coverages, output from [Step 19b](#19b-filter-gene-and-contig-coverage-based-on-detection))
 - sample-annotations.tsv (table of KO annotations assigned to gene IDs, output from [Step 16c](#16c-filter-ko-outputs))
-- sample-gene-tax-out.tsv (reformatted gene-calls taxonomy file with lineage info, output from [Step 17e](#17e-format-gene-level-output-with-awk-and-sed))
+- sample-gene-tax.tsv (reformatted gene-calls taxonomy file with lineage info, output from [Step 17e](#17e-format-gene-level-output-with-awk-and-sed))
 
 
 **Output Data:**
@@ -3211,23 +3243,23 @@ rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-o
 
 ```bash
 paste <( tail -n +2 sample-contig-coverages.tsv | sort -V -k 1 ) \
-      <( tail -n +2 sample-contig-tax-out.tsv | sort -V -k 1 | cut -f 2- ) \
+      <( tail -n +2 sample-contig-tax.tsv | sort -V -k 1 | cut -f 2- ) \
       > sample-contig.tmp
 
 paste <( head -n 1 sample-contig-coverages.tsv ) \
-      <( head -n 1 sample-contig-tax-out.tsv | cut -f 2- ) \
+      <( head -n 1 sample-contig-tax.tsv | cut -f 2- ) \
       > sample-contig-header.tmp
       
 cat sample-contig-header.tmp sample-contig.tmp > sample-contig-coverage-and-tax_GLlblMetag.tsv
 
 # removing intermediate files
-rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
+rm sample*tmp sample-contig-coverages.tsv sample-contig-tax.tsv
 ```
 
 **Input Data:**
 
 - sample-contig-coverages.tsv (table with contig-level coverages, output from [Step 19b](#19b-filter-gene-and-contig-coverage-based-on-detection))
-- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info, output from [Step 17f](#17f-format-contig-level-output-with-awk-and-sed))
+- sample-contig-tax.tsv (reformatted contig taxonomy file with lineage info, output from [Step 17f](#17f-format-contig-level-output-with-awk-and-sed))
 
 
 **Output Data:**
@@ -3265,7 +3297,7 @@ mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-cove
 
 **Parameter Definitions:**  
 
-- `*-gene-coverage-annotation-and-tax_GLlbsMetag.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
+- `*-gene-coverage-annotation-and-tax_GLlblMetag.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
 - `-o` – Specifies the output file prefix.
 
 
@@ -3283,18 +3315,18 @@ mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-cove
 #### 22b. Generate Contig-level Coverage Summary Tables
 
 ```bash
-bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
+bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax_GLlblMetag.tsv -o Combined
 ```
 
 **Parameter Definitions:**  
 
-- `*-contig-coverage-and-tax.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
+- `*-contig-coverage-and-tax_GLlblMetag.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
 - `-o` – Specifies the output file prefix.
 
 
 **Input Data:**
 
-- *-contig-coverage-and-tax.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 21](#21-combine-contig-level-coverage-and-taxonomy-for-each-sample))
+- *-contig-coverage-and-tax_GLlblMetag.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 21](#21-combine-contig-level-coverage-and-taxonomy-for-each-sample))
 
 **Output Data:**
 
@@ -3310,21 +3342,21 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax.tsv -o Combined
 #### 23a. Bin Contigs
 
 ```bash
-jgi_summarize_bam_contig_depths --outputDepth sample-metabat-assembly-depth.tsv \
+jgi_summarize_bam_contig_depths --outputDepth sample-metabat-assembly-depth_GLlblMetag.tsv \
                                 --percentIdentity 97 \
                                 --minContigLength 1000 \
                                 --minContigDepth 1.0  \
-                                --referenceFasta sample-assembly.fasta \
-                                sample.bam
+                                --referenceFasta sample-assembly_GLlblMetag.fasta \
+                                sample_GLlblMetag.bam
 
-metabat2  --inFile sample-assembly.fasta \
+metabat2  --inFile sample-assembly_GLlblMetag.fasta \
           --outFile sample \
-          --abdFile sample-metabat-assembly-depth.tsv \
+          --abdFile sample-metabat-assembly-depth_GLlblMetag.tsv \
           -t NumberOfThreads
 
 mkdir sample-bins
 mv sample*bin*.fasta sample-bins
-zip -r sample-bins.zip sample-bins
+zip -r sample-bins_GLlblMetag.zip sample-bins
 ```
 
 **Parameter Definitions:**  
@@ -3336,7 +3368,7 @@ zip -r sample-bins.zip sample-bins
 -  `--minContigLength` – Minimum contig length to include.
 -  `--minContigDepth` – Minimum contig depth to include.
 -  `--referenceFasta` – Specifies the input assembly fasta file.
--  `sample.bam` – Input alignment BAM file, specified as a positional argument.
+-  `sample_GLlblMetag.bam` – Input alignment BAM file, specified as a positional argument.
 
 **metabat2**
 
@@ -3348,17 +3380,17 @@ zip -r sample-bins.zip sample-bins
 
 **Input Data:**
 
-- sample-assembly.fasta (contig-renamed assembly file from [Step 14a](#14a-renaming-contig-headers))
-- sample.bam (sorted mapping to sample assembly BAM file, output from [Step 18b](#18b-sort-and-index-assembly-alignments))
+- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 14a](#14a-rename-contig-headers))
+- sample_GLlblMetag.bam (sorted mapping to sample assembly BAM file, output from [Step 18b](#18b-sort-assembly-alignments))
 
 **Output Data:**
 
-- **sample-metabat-assembly-depth.tsv** (tab-delimited summary of coverages)
+- **sample-metabat-assembly-depth_GLlblMetag.tsv** (tab-delimited summary of coverages)
 - sample-bins/sample-bin\*.fasta (fasta files of recovered bins)
-- **sample-bins.zip** (zip file containing fasta files of recovered bins)
+- **sample-bins_GLlblMetag.zip** (zip file containing fasta files of recovered bins)
 
-#### 23b. Bin Quality Assessment 
-> Utilizes the default `checkm` database available [here](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz), `checkm_data_2015_01_16.tar.gz`.
+#### 23b. Bin Quality Assessment
+> Utilizes the default `checkm` database [checkm_data_2015_01_16.tar.gz](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz).
 
 ```bash
 checkm lineage_wf -f bins-overview_GLlblMetag.tsv \
@@ -3505,7 +3537,7 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 ### 24. Generate MAG-level Functional Summary Overview
 
 #### 24a. Get KO Annotations Per MAG
-> This utilizes the helper script [`parse-MAG-annots.py`](../Workflow_Documentation/NF_MGIllumina/workflow_code/bin/parse-MAG-annots.py) 
+> This utilizes the helper script [`parse-MAG-annots.py`](https://github.com/nasa/GeneLab_Metagenomics_Workflow/blob/DEV/bin/parse-MAG-annots.py) 
 
 ```bash
 for file in $( ls MAGs/*.fasta )
@@ -3535,7 +3567,7 @@ done
 
 **Input Data:**
 
-- \*-gene-coverage-annotation-and-tax.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 19](#19-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
+- \*-gene-coverage-annotation-and-tax_GLlblMetag.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 20](#20-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
 - MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 23c](#23c-filter-mags))
 
 **Output Data:**
@@ -3559,118 +3591,149 @@ KEGG-decoder -v interactive \
 
 **Input Data:**
 
-- MAG-level-KO-annotations_GLlblMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 24a](#24a-getting-ko-annotations-per-mag))
+- MAG-level-KO-annotations_GLlblMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 24a](#24a-get-ko-annotations-per-mag))
 
 **Output Data:**
 
 - **MAG-KEGG-Decoder-out_GLlblMetag.tsv** (tab-delimited table holding MAGs and their proportions of 
                                            genes held known to be required for specific pathways/metabolisms)
-- **MAG-KEGG-Decoder-out_GLlbnMetag.html** (interactive heatmap html file of the above output table)
+- **MAG-KEGG-Decoder-out_GLlblMetag.html** (interactive heatmap html file of the above output table)
 
 <br>
 
 ---
 
-### 25. Decontamination and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
+### 25. Filtering, Decontamination, and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
 
 #### 25a. Gene-level Taxonomy Heatmaps
 
 ```R
-library(tidyverse)
-
-metadata_file <- "/path/to/sample/metadata"
-feature_data_file <- "Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv"
-
-# Prepare metadata
-metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-sample_names = metadata[, samples_column]
-row.names(metadata) <- sample_names
+assembly_table <- "Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv"
+assembly_summary <- "assembly-summaries_GLlblMetag.tsv"
+metadata_table <- "/path/to/sample/metadata"
 
-# Prepare feature table
-gene_taxonomy_table <- read_assembly_coverage_table(feature_table_file, sample_names) %>% as.data.frame
+# Read in assembly summary table
+overview_table <- read_delim(assembly_summary, comment="#") %>%
+  select(
+    where(~all(!is.na(.)))
+  )
 
-# Summarize gene table
-species_gene_table <- gene_taxonomy_table %>%
-  select(species, !!any_of(sample_names)) %>% 
-  group_by(species) %>% 
-  summarise(across(everything(), sum)) %>% 
-  filter(species != "Unclassified;_;_;_;_;_;_") %>% # Drop unclassifed
-  as.data.frame
+col_names <- names(overview_table) %>% str_remove_all("-assembly")
+sample_order <- col_names[-1] %>% sort()
 
-rownames(species_gene_table) <- species_gene_table[[1]]
-species_gene_table <- species_gene_table[, -1] %>% as.matrix()
+# deduplicate rows by summing together species values
+df <- read_delim(assembly_table, comment = "#")
+sample_order <- get_samples(df, sample_order)
 
-# Get common samples and re-arrange feature table and metadata
-common_samples <- intersect(colnames(species_gene_table), rownames(metadata))
-species_gene_table <- species_gene_table[, common_samples]
-metadata <- metadata[common_samples, ]
-metadata <- metadata %>% arrange(!!sym(group_column))
+table2write <- read_taxonomy_table(df, sample_order) %>%
+               select(species, !!sample_order) %>%
+               group_by(species) %>%
+               summarise(across(everything(), sum)) %>%
+               filter(species != "Unclassified;_;_;_;_;_;_") %>%
+               as.data.frame()
 
-table2write = species_gene_table %>% as.data.frame %>% rownames_to_column("species")
 # Write out gene taxonomy table
-write_csv(x = table2write, file = "gene_taxonomy_table.csv")
+write_tsv(x = table2write, file = "Combined-gene-level-taxonomy_unfiltered_GLlblMetag.tsv")
 
-make_heatmap(metadata, species_gene_table, 
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-taxonomy_unfiltered_GLlblMetag.tsv", 
              samples_column="sample_id", group_column = "group", 
-             output_prefix = "Combined-gene-level-taxonomy", 
+             output_prefix = "Combined-gene-level-taxonomy_unfiltered", 
              assay_suffix = "_GLlblMetag", 
              custom_palette = custom_palette)
 
 ```
 
 **Custom Functions Used:**
-- [read_assembly_coverage_table()](#read_assembly_coverage_table)
+- [get_samples()](#get_samples)
+- [read_taxonomy_table()](#read_taxonomy_table)
 - [make_heatmap()](#make_heatmap)
 
 **Input data:**
-- /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
-- Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples 
-    combined based on gene-level taxonomic classifications, output from 
-    [Step 22a](#22a-generating-gene-level-coverage-summary-tables)) 
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 14b](#14b-summarize-assemblies))
+- Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples combined based on gene-level 
+  taxonomic classifications, output from [Step 22a](#22a-generate-gene-level-coverage-summary-tables)) 
 
 **Output data:**
-- gene_taxonomy_table.csv (aggregated gene taxonomy table with samples in columns and species in rows)
-- **Combined-gene-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all gene taxonomy assignments)
+- Combined-gene-level-taxonomy_unfiltered_GLlblMetag.tsv (aggregated gene-level taxonomy table with samples in columns and species in rows)
+- **Combined-gene-level-taxonomy_unfiltered_heatmap_GLlblMetag.png** (heatmap of all gene-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-taxonomy_unfiltered_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 gene-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
 
-#### 25b. Gene-level Taxonomy Decontamination
 
-```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
+#### 25b. Gene-level Taxonomy Feature Filtering
 
-feature_table_file <- "gene_taxonomy_table.csv"
+```R
+feature_table_file <- "Combined-gene-level-taxonomy_unfiltered_GLlblMetag.tsv"
 metadata_table <- "/path/to/sample/metadata"
-number_samples <- NumberOfSamples # integer indicating how many samples are in the file
+threshold <- 1000
 
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
+# read in feature table
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
 
-# Prepare metadata
-metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-sample_names = metadata[, samples_column]
-row.names(metadata) <- sample_names
+table2write <- get_abundant_features(feature_table, cpm_threshold=threshold) %>%
+               as.data.frame() %>%
+               rownames_to_column(feature_name)
+
+write_tsv(x = table2write, file = "Combined-gene-level-taxonomy_filtered_GLlblMetag.tsv")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-taxonomy_filtered_GLlblMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-taxonomy_filtered", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+
+```
+
+**Custom Functions Used:**
+- [get_abundant_features()](#get_abundant_features)
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `threshold` - threshold to identify abundant features, default: 1000
+
+**Input Data:**
+
+- `Combined-gene-level-taxonomy_unfiltered_GLlblMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 25a](#25a-gene-level-taxonomy-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
+
+- **Combined-gene-level-taxonomy_filtered_GLlblMetag.tsv** (filtered gene-level taxonomy, output from [get_abundant_features()](#get_abundant_features))
+- **Combined-gene-level-taxonomy_filtered_heatmap_GLlblMetag.png** (heatmap of all gene-level taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-taxonomy_filtered_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 gene taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+
+#### 25c. Gene-level Taxonomy Decontamination
+
+> Note: species_table and heatmaps are only generated if 1 or more contaminants were detected
+
+```R
+feature_table_file <- "gene_taxonomy_table.tsv"
+metadata_table <- "/path/to/sample/metadata"
 
 decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
                                          feature_table_file = feature_table_file, 
                                          feature_column = "species", 
                                          samples_column = "sample_id",
                                          prevalence_column = "NTC", 
-                                         ntc_name = "TRUE", 
+                                         ntc_name = "true", 
                                          frequency_column = "concentration", 
-                                         threshold = 0.1, 
-                                         classification_method = "Combined-gene-level-taxonomy", 
-                                         output_prefix = "", 
+                                         threshold = 0.5, 
+                                         classification_method = "gene-taxonomy", 
+                                         output_prefix = "Combined-gene-level-taxonomy", 
                                          assay_suffix = "_GLlblMetag")
 
-# Get common samples and re-arrange feature table and metadata
-common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
-decontaminated_table <- decontaminated_table[, common_samples]
-metadata <- metadata[common_samples, ]
-metadata <- metadata %>% arrange(!!sym(group_column))
-
-make_heatmap(metadata, decontaminated_table, 
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-taxonomy_decontam_species_table_GLlblMetag.tsv", 
              samples_column = "sample_id", group_column = "group", 
              output_prefix = "Combined-gene-level-taxonomy_decontam", 
              assay_suffix = "_GLlblMetag",
@@ -3680,98 +3743,143 @@ make_heatmap(metadata, decontaminated_table,
 
 **Custom Functions Used:**
 - [feature_decontam()](#feature_decontam)
-- [make_heatmap()](#make_plot)
+- [make_heatmap()](#make_heatmap)
 
 **Parameter Definitions:**
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
                          species/functions as the first column and samples as other columns.
-- `number_samples` - the total number of samples in the feature_table_file, adjust based on number of input samples
 
 **Input Data:**
 
-- `gene_taxonomy_table.csv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 25a](#25a-gene-level-taxonomy-heatmaps))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `Combined-gene-level-taxonomy_unfiltered_GLlblMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 25a](#25a-gene-level-taxonomy-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
-- **Combined-gene-level-taxonomy_decontam_results_GLlblMetag.csv** (decontam's results table)
-- **Combined-gene-level-taxonomy_decontam_species_table_GLlblMetag.csv** (decontaminated species table)
-- **Combined-gene-level-taxonomy_decontam_heatmap_GLlblMetag.png** (gene-level taxonomy heatmap after filtering out contaminants)
+- **Combined-gene-level-taxonomy_decontam_results_GLlblMetag.tsv** (decontam's results table, output from [feature_decontam()](#feature_decontam))
+- **Combined-gene-level-taxonomy_decontam_species_table_GLlblMetag.tsv** (decontaminated gene-level taxonomy, output from [feature_decontam()](#feature_decontam))
+- **Combined-gene-level-taxonomy_decontam_heatmap_GLlblMetag.png** (heatmap of all gene-level taxonomy assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-taxonomy_decontam_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 gene-level taxonomy assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
 
-#### 25c. Gene-level KO Functions Heatmaps
+#### 25d. Gene-level KO Functions Heatmaps
 
 ```R
-library(tidyverse)
-library(pheatmap)
-
-metadata_file <- "/path/to/sample/metadata"
-feature_data_file <- "Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.ts"
-
-# Abundant functions with CPM > 2000
-abundance_threshold <- 2000
+assembly_table <- "Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv"
+assembly_summary <- "assembly-summaries_GLlblMetag.tsv"
+metadata_table <- "/path/to/sample/metadata"
 
-# Prepare metadata
-metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-sample_names = metadata[, samples_column]
-row.names(metadata) <- sample_names
+# Read in assembly summary table and remove columns where the values are NA
+overview_table <- read_delim(assembly_summary, comment="#") %>%
+  select(
+    where(~all(!is.na(.)))
+  )
 
-# Read-in KO functions table and drop unannotated
-functions_table <- read_delim(file = feature_table_file, delim = "\t", comment = "#") %>%
-                   select(KO_ID, KO_function, !!any_of(sample_names)) %>%
-                   filter(KO_ID != "Not annotated")
+col_names <- names(overview_table) %>% str_remove_all("-assembly")
+sample_order <- col_names[-1] %>% sort()
 
-# Convert the sample level data into a matrix
-functions.m <- functions_table %>% select(any_of(sample_names)) %>% as.matrix()
-rownames(functions.m) <- functions_table$KO_ID
+# deduplicate rows by summing together species values
+df <- read_delim(assembly_table, comment = "#")
+sample_order <- get_samples(df, sample_order, end_col="KO_function")
 
-# convert to dataframe without unannotated/unclassified species for output
-table2write <- functions.m %>% as.data.frame %>%
-               rownames_to_column("KO_ID")
-# Write out  taxonomy table
-write_csv(x = table2write  , file = "genes-KO-functions_table.csv")
+table2write <- df %>%
+               select(KO_ID, !!sample_order)
 
-# Get common samples and re-arrange feature table and metadata
-common_samples <- intersect(colnames(functions_table), rownames(metadata))
-functions_table <- functions_table[, common_samples]
-metadata <- metadata[common_samples, ]
-metadata <- metadata %>% arrange(!!sym(group_column))
+# Write out gene taxonomy table
+write_tsv(x = table2write, file = "Combined-gene-level-KO_unfiltered_GLlblMetag.tsv")
 
-make_heatmap(metadata, table2write,
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-KO_unfiltered_GLlblMetag.tsv",
              samples_column="sample_id", group_column = "group", 
-             output_prefix = "Combined-gene-level-KO-function", 
+             output_prefix = "Combined-gene-level-KO-function_unfiltered", 
              assay_suffix = "_GLlblMetag", 
              custom_palette = custom_palette)
 
 ```
 
 **Custom Functions Used:**
+- [get_samples()](#get_samples)
 - [make_heatmap()](#make_heatmap)
 
+**Parameter Definitions:**
+
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `assembly_table` - path to a tab-separated table containing gene-level KO function coverage data with
+                         species/functions as the first column and samples as other columns.
+- `assembly_summary` - path to a tab-separated file containing statistics on assemblies created for each sample
+
 **Input data:**
-- /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
-- Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv (table with all samples combined 
-    based on KO annotations; normalized to coverage per million genes covered, output from 
-    [Step 22a](#22a-generate-gene-level-coverage-summary-tables)
+
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 14b](#14b-summarize-assemblies))
+- Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv (table with all samples combined based on KO annotations;
+  normalized to coverage per million genes covered, output from [Step 22a](#22a-generate-gene-level-coverage-summary-tables))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output data:**
-- genes-KO-functions_table.csv (aggregated and subsetted gene KO function table)
-- **Combined-gene-level-KO-function_heatmap_GLlblMetag.png** (heatmap of all gene-level KO function assignments)
 
-#### 25d. Gene-level KO Functions Decontamination
+- Combined-gene-level-KO-function_unfiltered_GLlblMetag.tsv (aggregated and subsetted gene-level KO function table)
+- **Combined-gene-level-KO-function_unfiltered_heatmap_GLlblMetag.png** (heatmap of all gene-level KO function assignments, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-KO-function_unfiltered_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 gene-level KO function assignments, output from [make_heatmap()](#make_heatmap))
 
-```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
+#### 25e. Gene-level KO Functions Feature Filtering
 
-feature_table_file <- "genes-KO-functions_table.csv"
+```R
+feature_table_file <- "Combined-gene-level-KO-function_unfiltered_GLlblMetag.tsv"
 metadata_table <- "/path/to/sample/metadata"
-number_samples <- NumberOfSamples # integer indicating how many samples are in the file
+threshold <- 1000
+
+# read in feature table
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+table2write <- get_abundant_features(feature_table, cpm_threshold=threshold) %>%
+               as.data.frame() %>%
+               rownames_to_column(feature_name)
 
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
+write_tsv(x = table2write, file = "Combined-gene-level-KO_filtered_GLlblMetag.tsv")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-KO_filtered_GLlblMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-KO_filtered", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+
+```
+
+**Custom Functions Used:**
+- [get_abundant_features()](#get_abundant_features)
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `threshold` - threshold to identify abundant features, default: 1000
+
+**Input Data:**
+
+- `Combined-gene-level-KO-function_unfiltered_GLlblMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 25d](#25d-gene-level-ko-functions-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
+
+- **Combined-gene-level-KO-function_filtered_GLlblMetag.tsv** (filtered gene-level KO function table, output from [get_abundant_features()](#get_abundant_features))
+- **Combined-gene-level-KO-function_filtered_heatmap_GLlblMetag.png** (heatmap of all gene-level KO function assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-KO-function_filtered_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 gene-level KO function assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+
+#### 25f. Gene-level KO Functions Decontamination
+
+> Note: species_table and heatmaps are only generated if 1 or more contaminants were detected
+
+```R
+feature_table_file <- "genes-KO-functions_table.tsv"
+metadata_table <- "/path/to/sample/metadata"
 
 # Prepare metadata
 metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
@@ -3783,20 +3891,15 @@ decontaminated_table <- feature_decontam(metadata_file = metadata_table,
                                          feature_column = "KO_ID", 
                                          samples_column = "sample_id",
                                          prevalence_column = "NTC", 
-                                         ntc_name = "TRUE", 
+                                         ntc_name = "true", 
                                          frequency_column = "concentration", 
-                                         threshold = 0.1, 
-                                         classification_method = "Combined-gene-level-KO-function", 
-                                         output_prefix = "", 
+                                         threshold = 0.5, 
+                                         classification_method = "gene-function", 
+                                         output_prefix = "Combined-gene-level-KO-function", 
                                          assay_suffix = "_GLlblMetag")
 
-# Get common samples and re-arrange feature table and metadata
-common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
-decontaminated_table <- decontaminated_table[, common_samples]
-metadata <- metadata[common_samples, ]
-metadata <- metadata %>% arrange(!!sym(group_column))
-
-make_heatmap(metadata, decontaminated_table, 
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-KO-function_decontam_KO_table_GLlblMetag.tsv", 
              samples_column = "sample_id", group_column = "group", 
              output_prefix = "Combined-gene-level-KO-function_decontam", 
              assay_suffix = "_GLlblMetag",
@@ -3806,64 +3909,58 @@ make_heatmap(metadata, decontaminated_table,
 
 **Custom Functions Used:**
 - [feature_decontam()](#feature_decontam)
-- [make_heatmap()](#make_plot)
+- [make_heatmap()](#make_heatmap)
 
 **Parameter Definitions:**
 - `metadata_table` - path to a file with samples as rows and columns describing each sample
 - `feature_table_file` - path to a tab separated samples feature table containing gene-level KO functions coverage data 
                          with KO_ID as the first column and samples as other columns.
-- `number_samples` - the total number of samples in the feature_table_file, adjust based on number of input samples
 
 **Input Data:**
 
-- `genes-KO-functions_table.csv`(aggregated gene KO functions table table with samples in columns and KO_ID in rows, from [Step 25c](#25c-gene-level-ko-functions-heatmaps))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `Combined-gene-level-KO-function_unfiltered_GLlblMetag.tsv`(aggregated gene KO functions table table with samples in columns and KO_ID in rows, from [Step 25d](#25d-gene-level-ko-functions-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
 
-- **Combined-gene-level-KO-function_decontam_results_GLlblMetag.csv** (decontam's results table)
-- **Combined-gene-level-KO-function_decontam_species_table_GLlblMetag.csv** (decontaminated gene-level KO functions table)
-- **Combined-gene-level-KO-function_decontam_heatmap_GLlblMetag.png** (gene-level KO functions heatmap after filtering out contaminants)
+- **Combined-gene-level-KO-function_decontam_results_GLlblMetag.tsv** (decontam results table, output from [feature_decontam()](#feature_decontam))
+- **Combined-gene-level-KO-function_decontam_KO_table_GLlblMetag.tsv** (decontaminated gene-level KO functions table, output from [feature_decontam()](#feature_decontam))
+- **Combined-gene-level-KO-function_decontam_heatmap_GLlblMetag.png** (heatmap of all gene-level KO function assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-KO-function_decontam_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 gene-level KO function assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
 
 
-#### 25f. Contig-level Heatmaps
+#### 25g. Contig-level Heatmaps
 
 ```R
-library(tidyverse)
-
-metadata_file <- "/path/to/sample/metadata"
-feature_data_file <- "Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv"
-
-# Prepare metadata
-metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-sample_names = metadata[, samples_column]
-row.names(metadata) <- sample_names
+assembly_table <- "Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv"
+assembly_summary <- "assembly-summaries_GLlblMetag.tsv"
+metadata_table <- "/path/to/sample/metadata"
 
-# Prepare feature table
-contig_taxonomy_table <- read_assembly_coverage_table(feature_table_file, sample_names) %>% as.data.frame
+# Read in assembly summary table
+overview_table <- read_delim(assembly_summary, comment="#") %>%
+  select(
+    where(~all(!is.na(.)))
+  )
 
-# Summarize contig table
-species_contig_table <- contig_taxonomy_table %>%
-  select(species, !!any_of(sample_names)) %>%
-  group_by(species) %>%
-  summarise(across(everything(), sum)) %>% 
-  filter(species != "Unclassified;_;_;_;_;_;_") %>% # Drop unclassifed
-  as.data.frame
+col_names <- names(overview_table) %>% str_remove_all("-assembly")
+sample_order <- col_names[-1] %>% sort()
 
-rownames(species_contig_table) <- species_contig_table[[1]]
-species_contig_table <- species_contig_table[, -1] %>% as.matrix()
+# deduplicate rows by summing together species values
+df <- read_delim(assembly_table, comment = "#")
+sample_order <- get_samples(df, sample_order)
 
-# Get common samples and re-arrange feature table and metadata
-common_samples <- intersect(colnames(species_contig_table), rownames(metadata))
-species_contig_table <- species_contig_table[, common_samples]
-metadata <- metadata[common_samples, ]
-metadata <- metadata %>% arrange(!!sym(group_column))
+table2write <- read_taxonomy_table(df, sample_order) %>%
+               select(species, !!sample_order) %>%
+               group_by(species) %>%
+               summarise(across(everything(), sum)) %>%
+               filter(species != "Unclassified;_;_;_;_;_;_") %>%
+               as.data.frame()
 
-table2write = species_contig_table %>% as.data.frame %>% rownames_to_column("species")
 # Write out contig taxonomy table
-write_csv(x = table2write, file = "contig_taxonomy_table.csv")
+write_tsv(x = table2write, file = "Combined-contig-level-taxonomy_unfiltered_GLlblMetag.tsv")
 
-make_heatmap(metadata, species_contig_table, 
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-contig-level-taxonomy_unfiltered_GLlblMetag.tsv", 
              samples_column="sample_id", group_column = "group", 
              output_prefix = "Combined-contig-level-taxonomy", 
              assay_suffix = "_GLlblMetag", 
@@ -3871,58 +3968,100 @@ make_heatmap(metadata, species_contig_table,
 ```
 
 **Custom Functions Used:**
-- [read_assembly_coverage_table()](#read_assembly_coverage_table)
+- [get_samples()](#get_samples)
+- [read_taxonomy_table()](#read_taxonomy_table)
 - [make_heatmap()](#make_heatmap)
 
+**Parameter Definitions:**
+
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `assembly_table` - path to a tab-separated table containing gene-level KO function coverage data with
+                         species/functions as the first column and samples as other columns.
+- `assembly_summary` - path to a tab-separated file containing statistics on assemblies created for each sample
+
 **Input data:**
-- /path/to/sample/metadata (a file with samples as rows and columns describing each sample)
-- Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples 
-    combined based on contig-level taxonomic classifications, output from 
-    [Step 21b](#21b-generate-contig-level-coverage-summary-tables)) 
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 14b](#14b-summarize-assemblies))
+- Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples combined based on contig-level 
+  taxonomic classifications, output from [Step 21](#21-combine-contig-level-coverage-and-taxonomy-for-each-sample))
 
 **Output data:**
-- contig_taxonomy_table.csv (aggregated contig taxonomy table with samples in columns and species in rows)
-- **Combined-contig-level-taxonomy_heatmap_GLlblMetag.png** (heatmap of all contig taxonomy assignments)
+- Combined-contig-level-taxonomy_unfiltered_GLlblMetag.tsv (aggregated contig-level taxonomy table with samples in columns and species in rows)
+- **Combined-contig-level-taxonomy_unfiltered_heatmap_GLlblMetag.png** (heatmap of all contig-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
+- **Combined-contig-level-taxonomy_unfiltered_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 contig-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
 
-#### 25g. Contig-level Decontamination
+#### 25h. Contig-level Feature Filtering
 
 ```R
-library(tidyverse)
-library(decontam)
-library(phyloseq)
-
-feature_table_file <- "contig_taxonomy_table.csv"
+feature_table_file <- "Combined-contig-level-taxonomy_GLlblMetag.tsv"
 metadata_table <- "/path/to/sample/metadata"
-number_samples <- NumberOfSamples # integer indicating how many samples are in the file
+threshold <- 1000
 
-# set width based on number of samples, with a cap at 50 inches
-plot_width <- 2 * number_samples
-if(plot_width > 50) { plot_width = 50 }
+# read in feature table
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
 
-# Prepare metadata
-metadata <- read_delim(metadata_file, delim = ",") %>% as.data.frame
-sample_names = metadata[, samples_column]
-row.names(metadata) <- sample_names
+table2write <- get_abundant_features(feature_table, cpm_threshold=threshold) %>%
+               as.data.frame() %>%
+               rownames_to_column(feature_name)
+
+write_tsv(x = table2write, file = "Combined-contig-level-taxonomy_filtered_GLlblMetag.tsv")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-contig-level-taxonomy_filtered_GLlblMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-contig-level-taxonomy_filtered", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [get_abundant_features()](#get_abundant_features)
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `threshold` - threshold to identify abundant features, default: 1000
+
+**Input Data:**
+
+- `Combined-contig-level-taxonomy_unfiltered_GLlblMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 25d](#25d-gene-level-ko-functions-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
+
+- **Combined-contig-level-taxonomy_filtered_GLlblMetag.tsv** (filtered contig-level taxonomy, output from [get_abundant_features()](#get_abundant_features))
+- **Combined-contig-level-taxonomy_filtered_heatmap_GLlblMetag.png** (heatmap of all contig-level taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+- **Combined-contig-level-taxonomy_filtered_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 contig-level taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+
+#### 25i. Contig-level Decontamination
+
+>Note: species_table and heatmaps are only generated if 1 or more contaminants were detected
+
+```R
+feature_table_file <- "contig_taxonomy_table.tsv"
+metadata_table <- "/path/to/sample/metadata"
 
 decontaminated_table <- feature_decontam(metadata_file = metadata_table, 
                                          feature_table_file = feature_table_file, 
                                          feature_column = "species", 
                                          samples_column = "sample_id",
                                          prevalence_column = "NTC", 
-                                         ntc_name = "TRUE", 
+                                         ntc_name = "true", 
                                          frequency_column = "concentration", 
-                                         threshold = 0.1, 
-                                         classification_method = "Combined-contig-level-taxonomy", 
-                                         output_prefix = "", 
+                                         threshold = 0.5, 
+                                         classification_method = "contig-taxonomy", 
+                                         output_prefix = "Combined-contig-level-taxonomy", 
                                          assay_suffix = "_GLlblMetag")
 
-# Get common samples and re-arrange feature table and metadata
-common_samples <- intersect(colnames(decontaminated_table), rownames(metadata))
-decontaminated_table <- decontaminated_table[, common_samples]
-metadata <- metadata[common_samples, ]
-metadata <- metadata %>% arrange(!!sym(group_column))
-
-make_heatmap(metadata, decontaminated_table, 
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-contig-level-taxonomy_decontam_species_table_GLlblMetag.tsv", 
              samples_column = "sample_id", group_column = "group", 
              output_prefix = "Combined-contig-level-taxonomy_decontam", 
              assay_suffix = "_GLlblMetag",
@@ -3932,7 +4071,7 @@ make_heatmap(metadata, decontaminated_table,
 
 **Custom Functions Used:**
 - [feature_decontam()](#feature_decontam)
-- [make_heatmap()](#make_plot)
+- [make_heatmap()](#make_heatmap)
 
 **Parameter Definitions:**
 
@@ -3943,12 +4082,44 @@ make_heatmap(metadata, decontaminated_table,
 
 **Input Data:**
 
-- `contig_taxonomy_table.csv`(aggregated contig taxonomy table with samples in columns and species in rows, from [Step 25f](#25f-contig-level-heatmaps))
-- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping samplenames to group metadata)
+- `Combined-contig-level-taxonomy_GLlblMetag.tsv`(aggregated contig taxonomy table with samples in columns and species in rows, from [Step 25g](#25g-contig-level-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
+
+- **Combined-contig-level-taxonomy_decontam_results_GLlblMetag.tsv** (decontam's results table, output from [feature_decontam()](#feature_decontam))
+- **Combined-contig-level-taxonomy_decontam_species_table_GLlblMetag.tsv** (decontaminated contig-level taxonomy, output from [feature_decontam()](#feature_decontam))
+- **Combined-contig-level-taxonomy_decontam_heatmap_GLlblMetag.png** (heatmap of all contig-level taxonomy assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
+- **Combined-contig-level-taxonomy_decontam_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 contig-level taxonomy assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
+
+### 26. Generate Assembly-based Processing Overview
+> This utilizes the helper script [`generate-assembly-based-overview-table.sh`](https://github.com/nasa/GeneLab_Metagenomics_Workflow/blob/DEV/bin/generate-assembly-based-overview-table.sh) 
+
+```bash
+bash generate-assembly-based-overview-table.sh sample_ids_file.txt \
+  assemblies/ predicted-genes/ read-mapping/ bins/ MAGs/ \
+  Assembly-based-processing-overview_GLlblMetag.tsv
+```
+
+**Parameter Definitions:**
+
+- `sample_ids_file.txt` - A file listing the sample names, one on each row, provided as a positional argument.
+- `assemblies/` - The directory holding the contig-renamed assembly files generated in [Step 14a](#14a-rename-contig-headers), provided as a positional argument.
+- `predicted-genes/` - The directory holding the gene-calls ammino-acid fasta files generated in [Step 15b](#15b-remove-line-wraps-in-gene-prediction-output), provided as a positional argument.
+- `read-mapping/` - The directory holding the sorted mapping to the sample assembly in BAM format generated in [Step 18c](#18b-sort-assembly-alignments), provided as a positional argument.
+- `bins/` - The directory holding the recovered bins fasta files generated in [Step 23a](#23a-bin-contigs), provided as a positional argument.
+- `MAGs/` - The directory holding the high-quality MAGs fasta files generated in [Step 23c](#23c-filter-mags), provided as a positional argument.
+- `Assembly-based-processing-overview_GLlblMetag.tsv` - name of the output file, provided as a positional argument.
+
+**Input Data:**
+
+- assemblies/\*.fasta (contig-renamed assembly files from [Step 14a](#14a-rename-contig-headers))
+- predicted-genes/\*.faa (gene-calls amino-acid fasta file with line wraps removed, output from [Step 15b](#15b-remove-line-wraps-in-gene-prediction-output))
+- read-mapping/\*.bam (sorted mapping to sample assembly, in BAM format, output from [Step 18b](#18b-sort-assembly-alignments))
+- bins/\*.fasta (fasta files of recovered bins, output from [Step 23a](#23a-bin-contigs))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 23c](#23c-filter-mags))
 
 **Output Data:**
 
-- **Combined-contig-level-taxonomy_decontam_results_GLlblMetag.csv** (decontam's results table)
-- **Combined-contig-level-taxonomy_decontam_species_table_GLlblMetag.csv** (decontaminated contig-level species table)
-- **Combined-contig-level-taxonomy_decontam_heatmap_GLlblMetag.png** (contig-level heatmap after filtering out contaminants)
+- **Assembly-based-processing-overview_GLlblMetag.tsv** (Tab delimited text file providing a summary of assembly-based processing results for each sample)
 

From 65ae99ce35523b9709da80fb4c52248836ce6c58 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Thu, 19 Mar 2026 10:54:24 -0700
Subject: [PATCH 33/47] Add low-biomass README

---
 Metagenomics/Low_Biomass/README.md | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)
 create mode 100644 Metagenomics/Low_Biomass/README.md

diff --git a/Metagenomics/Low_Biomass/README.md b/Metagenomics/Low_Biomass/README.md
new file mode 100644
index 000000000..46384d622
--- /dev/null
+++ b/Metagenomics/Low_Biomass/README.md
@@ -0,0 +1,20 @@
+# GeneLab bioinformatics processing pipelines for low-biomass metagenomics sequencing data
+
+> **Documents [`GL-DPPD-7116`](Nanopore/GL-DPPD-7116.md) and [`GL-DPPD-7117.md`](Illumina/GL-DPPD-7117.md) contain overview and example commands for how GeneLab processes low-biomass metagenomics datasets for long- and short-read data, respectively. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and a GeneLab data processing summary is provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
+
+<br>
+
+---
+## Repository Links
+
+* [**Pipeline_GL-DPPD-7116_Versions**](Nanopore)
+
+  - Contains the current and previous GeneLab low-biomass metagenomics consensus processing pipeline documentation for long-read (Nanopore) data
+
+* [**Pipeline_GL-DPPD-7117_Versions**](Illumina)
+
+  - Contains the current and previous GeneLab low-biomass metagenomics consensus processing pipeline documentation for short-read (Illumina) data
+
+---
+**Developed by:**  
+Olabiyi Obayomi

From 4f66432e77750df03c343974c291c6870d3e0530 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Mon, 23 Mar 2026 09:46:44 -0700
Subject: [PATCH 34/47] updated Metagenomiocs Workflow links

- updated Metagenomics READMEs
- added Metagenomics workflow submodule for low biomass pipelines
- updated low biomass pipeline docs table-of-contents
---
 .gitmodules                                   |  4 +
 .../GL-DPPD-7116.md                           | 63 +++++++-------
 .../GL-DPPD-7117.md                           | 84 ++++++++++---------
 Metagenomics/Low_Biomass/README.md            |  6 +-
 .../Workflow_Documentation/NF_MGIllumina      |  1 +
 .../Workflow_Documentation/README.md          | 19 +++++
 6 files changed, 101 insertions(+), 76 deletions(-)
 rename Metagenomics/Low_Biomass/{Nanopore => Pipeline_GL-DPPD-7116_Versions}/GL-DPPD-7116.md (99%)
 rename Metagenomics/Low_Biomass/{Illumina => Pipeline_GL-DPPD-7117_Versions}/GL-DPPD-7117.md (98%)
 create mode 160000 Metagenomics/Low_Biomass/Workflow_Documentation/NF_MGIllumina
 create mode 100644 Metagenomics/Low_Biomass/Workflow_Documentation/README.md

diff --git a/.gitmodules b/.gitmodules
index 6ee385189..beec830a9 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,3 +1,7 @@
 [submodule "Amplicon/Illumina/Workflow_Documentation/NF_AmpIllumina"]
 	path = Amplicon/Illumina/Workflow_Documentation/NF_AmpIllumina
 	url = https://github.com/nasa/GeneLab_AmpliconSeq_Workflow
+[submodule "NF_MGIllumina"]
+	path = Metagenomics/Low_Biomass/Workflow_Documentation/NF_MGIllumina
+	url = https://github.com/nasa/GeneLab_Metagenomics_Workflow
+	branch = DEV
diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
similarity index 99%
rename from Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
rename to Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
index efc9e58ba..b92ea15be 100644
--- a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
@@ -1,4 +1,4 @@
-# Bioinformatics pipeline for Low biomass long-read metagenomics data
+# Bioinformatics pipeline for Low biomass long-read metagenomics data <!-- omit in toc -->
 
 > **This document holds an overview and some example commands of how GeneLab processes low-biomass, long-read metagenomics datasets. Exact processing commands for specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
 
@@ -20,11 +20,11 @@ Barbara Novak (GeneLab Data Processing Lead)
 
 ---
 
-# Table of contents
+# Table of contents <!-- omit in toc -->
 
-- [**Software used**](#software-used)
-- [**General processing overview with example commands**](#general-processing-overview-with-example-commands)
-  - [**Pre-processing**](#pre-processing)
+- [Software used](#software-used)
+- [General processing overview with example commands](#general-processing-overview-with-example-commands)
+  - [Pre-processing](#pre-processing)
     - [1. Basecalling](#1-basecalling)
     - [2. Demultiplexing](#2-demultiplexing)
       - [2a. Split Fastq](#2a-split-fastq)
@@ -34,7 +34,7 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [3b. Compile Raw Data QC](#3b-compile-raw-data-qc)
     - [4. Quality Filtering](#4-quality-filtering)
       - [4a. Filter Raw Data](#4a-filter-raw-data)
-      - [4a. Filtered Data QC](#4b-filtered-data-qc)
+      - [4b. Filtered Data QC](#4b-filtered-data-qc)
       - [4c. Compile Filtered Data QC](#4c-compile-filtered-data-qc)
     - [5. Trimming](#5-trimming)
       - [5a. Trim Filtered Data](#5a-trim-filtered-data)
@@ -57,10 +57,10 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [8b. Remove Host Reads](#8b-remove-host-reads)
       - [8c. Compile Host Read Removal QC](#8c-compile-host-read-removal-qc)
     - [9. R Environment Setup](#9-r-environment-setup)
-      - [9a. Load Libraries](#9a-load-libraries)
+      - [9a. Load libraries](#9a-load-libraries)
       - [9b. Define Custom Functions](#9b-define-custom-functions)
       - [9c. Set global variables](#9c-set-global-variables)
-  - [**Read-based processing**](#read-based-processing)
+  - [Read-based Processing](#read-based-processing)
     - [10. Taxonomic Profiling Using Kaiju](#10-taxonomic-profiling-using-kaiju)
       - [10a. Build Kaiju Database](#10a-build-kaiju-database)
       - [10b. Kaiju Taxonomic Classification](#10b-kaiju-taxonomic-classification)
@@ -82,7 +82,7 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [11f. Filter Kraken2 Species Count Table](#11f-filter-kraken2-species-count-table)
       - [11g. Kraken2 Taxonomy Barplots](#11g-kraken2-taxonomy-barplots)
       - [11h. Kraken2 Feature Decontamination](#11h-kraken2-feature-decontamination)
-  - [**Assembly-based processing**](#assembly-based-processing)
+  - [Assembly-based Processing](#assembly-based-processing)
     - [12. Sample Assembly](#12-sample-assembly)
     - [13. Polish Assembly](#13-polish-assembly)
     - [14. Rename Contigs and Summarize Assemblies](#14-rename-contigs-and-summarize-assemblies)
@@ -113,7 +113,7 @@ Barbara Novak (GeneLab Data Processing Lead)
     - [22. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#22-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
       - [22a. Generate Gene-level Coverage Summary Tables](#22a-generate-gene-level-coverage-summary-tables)
       - [22b. Generate Contig-level Coverage Summary Tables](#22b-generate-contig-level-coverage-summary-tables)
-    - [23. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#23-metagenome-assembled-genome-mag-recovery)
+    - [23. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery](#23-metagenome-assembled-genome-mag-recovery)
       - [23a. Bin Contigs](#23a-bin-contigs)
       - [23b. Bin Quality Assessment](#23b-bin-quality-assessment)
       - [23c. Filter MAGs](#23c-filter-mags)
@@ -141,19 +141,19 @@ Barbara Novak (GeneLab Data Processing Lead)
 
 |Program|Version|Relevant Links|
 |:------|:-----:|------:|
-|bbduk| 38.86 |[https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/)|
+|bbduk| 38.86 |[https://bbmap.org](https://bbmap.org)|
 |bit| 1.8.53 |[https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)|
 |CAT| 5.2.3 |[https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)|
 |CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)|
 |Dorado| 1.1.1| [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)|
-|filtlong| 0.2.1 |[https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)|
+|Filtlong| 0.2.1 |[https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)|
 |Flye| 2.9.5 | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye) |
 |GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
 |Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) |
 |KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder)
 |KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)|
 |Kraken2| 2.1.6 | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2) |
-|KrakenTools | 1.2 | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/) |
+|KrakenTools| 1.2 | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/) |
 |Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)|
 |MetaBAT| 2.15 |[https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)|
 |Minimap2| 2.28 | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2) |
@@ -655,7 +655,7 @@ multiqc --zip-data-dir \
 
 > A major issue with low biomass data is the high potential for contamination due to the low amount of DNA extracted from the samples. Because negative control/blank samples should by theory be contaminant free, any sequence detected in the negative control is a potential contaminant. To filter out contaminants found in negative control samples that may have been due to cross contamination in the lab, we use a read mapping approach. First negative/blank control sample reads are assembled then the filtered, trimmed, and human-removed reads from each low-biomass sample are mapped to the assembled contigs from the negative/blank control samples. Reads mapping to the assembled contigs are categorized as contaminants and are therefore filtered out and thus excluded from downstream analyses.
 
-### 7a. Assemble Contaminants
+#### 7a. Assemble Contaminants
 
 ```bash
 flye --meta \
@@ -1020,7 +1020,7 @@ library(tidyverse)
 
 #### 9b. Define Custom Functions
 
-#### get_last_assignment()
+#### get_last_assignment() <!-- omit in toc -->
 <details>
   <summary>retrieves the last taxonomy assignment from a taxonomy string</summary>
 
@@ -1055,7 +1055,7 @@ library(tidyverse)
 
 </details>
 
-#### mutate_taxonomy()
+#### mutate_taxonomy() <!-- omit in toc -->
 <details>
   <summary>mutate taxonomy column to contain the lowest taxonomy assignment</summary>
 
@@ -1089,7 +1089,7 @@ library(tidyverse)
 
 </details>
 
-#### process_kaiju_table()
+#### process_kaiju_table() <!-- omit in toc -->
 <details>
   <summary>reformat kaiju output table</summary>
 
@@ -1138,7 +1138,7 @@ library(tidyverse)
 
 </details>
 
-#### merge_kraken_reports()
+#### merge_kraken_reports() <!-- omit in toc -->
 <details>
   <summary>merge and process multiple kraken outputs to one species table</summary>
 
@@ -1182,7 +1182,7 @@ library(tidyverse)
 
 </details>
 
-#### get_abundant_features()
+#### get_abundant_features() <!-- omit in toc -->
 <details>
   <summary>Find abundant features based on the sum of feature values</summary>
   
@@ -1215,7 +1215,7 @@ library(tidyverse)
   
 </details>
 
-#### count_to_rel_abundance()
+#### count_to_rel_abundance() <!-- omit in toc -->
 <details>
   <summary>Convert species count matrix to relative abundance matrix</summary>
 
@@ -1248,8 +1248,7 @@ library(tidyverse)
 
 </details>
 
-
-#### filter_rare()
+#### filter_rare() <!-- omit in toc -->
 <details>
   <summary>filter out rare and non_microbial taxonomy assignments based on relative abundance</summary>
 
@@ -1291,7 +1290,7 @@ library(tidyverse)
 
 </details>
 
-#### group_low_abund_taxa()
+#### group_low_abund_taxa() <!-- omit in toc -->
 <details>
   <summary>Group rare taxa or return a table with only rare taxa</summary>
 
@@ -1351,7 +1350,7 @@ library(tidyverse)
 
 </details>
 
-#### make_plot()
+#### make_plot() <!-- omit in toc -->
 <details>
   <summary>create bar plot of relative abundance</summary>
 
@@ -1396,7 +1395,7 @@ library(tidyverse)
 
 </details>
 
-#### make_barplot()
+#### make_barplot() <!-- omit in toc -->
 <details>
   <summary>Creates barplots from a feature table file</summary>
   
@@ -1476,7 +1475,7 @@ library(tidyverse)
 
 </details>
 
-#### make_heatmap()
+#### make_heatmap() <!-- omit in toc -->
 <details>
   <summary>Creates heatmaps from a feature table file</summary>
   
@@ -1575,7 +1574,7 @@ library(tidyverse)
 
 </details>
 
-#### run_decontam()
+#### run_decontam() <!-- omit in toc -->
 <details>
   <summary>Feature table decontamination with decontam</summary>
 
@@ -1648,7 +1647,7 @@ library(tidyverse)
 
 </details>
 
-#### feature_decontam()
+#### feature_decontam() <!-- omit in toc -->
 <details>
   <summary>decontaminate a feature table using the Decontam R package to statistically identify contaminating features in a feature table</summary>
   
@@ -1741,7 +1740,7 @@ library(tidyverse)
 
 </details>
 
-#### process_taxonomy()
+#### process_taxonomy() <!-- omit in toc -->
 <details>
   <summary>process a taxonomy assignment table</summary>
 
@@ -1778,7 +1777,7 @@ library(tidyverse)
 
 </details>
 
-#### fix_names()
+#### fix_names() <!-- omit in toc -->
 <details>
   <summary>clean taxonomy names</summary>
 
@@ -1811,7 +1810,7 @@ library(tidyverse)
 
 </details>
 
-#### read_taxonomy_table()
+#### read_taxonomy_table() <!-- omit in toc -->
 <details>
   <summary>Read Assembly-based coverage annotation table</summary>
 
@@ -1852,7 +1851,7 @@ library(tidyverse)
 
 </details>
 
-#### get_samples()
+#### get_samples() <!-- omit in toc -->
 <details>
   <summary>retrieve sample names for which assemblies were generated</summary>
 
diff --git a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
similarity index 98%
rename from Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
rename to Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
index b3c0174b7..c355b9baf 100644
--- a/Metagenomics/Low_Biomass/Illumina/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
@@ -1,4 +1,4 @@
-# Bioinformatics pipeline for Low biomass short-read metagenomics data
+# Bioinformatics pipeline for Low biomass short-read metagenomics data <!-- omit in toc -->
 
 > **This document holds an overview and some example commands of how GeneLab processes low-biomass, short-read metagenomics datasets. Exact processing commands for specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
 
@@ -20,41 +20,41 @@ Barbara Novak (GeneLab Data Processing Lead)
 
 ---
 
-# Table of contents
+# Table of contents <!-- omit in toc -->
 
-- [**Software used**](#software-used)
-- [**General processing overview with example commands**](#general-processing-overview-with-example-commands)
-  - [**Pre-processing**](#pre-processing)
+- [Software used](#software-used)
+- [General processing overview with example commands](#general-processing-overview-with-example-commands)
+  - [Pre-processing](#pre-processing)
     - [1. Raw Data QC](#1-raw-data-qc)
       - [1a. Raw Data QC](#1a-raw-data-qc)
       - [1b. Compile Raw Data QC](#1b-compile-raw-data-qc)
-    - [2. Trimming and Quality filtering](#2-trimming-and-quality-filtering)
+    - [2. Trimming and Quality Filtering](#2-trimming-and-quality-filtering)
       - [2a. Filter Quality and Trim Adapters](#2a-filter-quality-and-trim-adapters)
-      - [2b. Trim PolyG](#2b-trim-polyg)
+      - [2b. Trim polyG](#2b-trim-polyg)
       - [2c. Filtered Data QC](#2c-filtered-data-qc)
       - [2d. Compile Filtered Data QC](#2d-compile-filtered-data-qc)
     - [3. Contaminant Removal](#3-contaminant-removal)
       - [3a. Assemble Contaminants](#3a-assemble-contaminants)
       - [3b. Build Contaminant Index and Map Reads](#3b-build-contaminant-index-and-map-reads)
       - [3c. Contaminant Removal QC](#3c-contaminant-removal-qc)
-      - [3d. Compile Contaminant Removal QC](#3d-compile-contaminant-remove-qc)
-    - [4. Host read removal](#4-host-read-removal)
+      - [3d. Compile Contaminant Removal QC](#3d-compile-contaminant-removal-qc)
+    - [4. Host Read Removal](#4-host-read-removal)
       - [4a. Build Kraken2 Host Database](#4a-build-kraken2-host-database)
       - [4b. Remove Host Reads](#4b-remove-host-reads)
       - [4c. Compile Host Read Removal QC](#4c-compile-host-read-removal-qc)
     - [5. R Environment Setup](#5-r-environment-setup)
-      - [5a. Load Libraries](#5a-load-libraries)
+      - [5a. Load libraries](#5a-load-libraries)
       - [5b. Define Custom Functions](#5b-define-custom-functions)
       - [5c. Set global variables](#5c-set-global-variables)
-  - [**Read-based processing**](#read-based-processing)
-    - [6. Taxonomic profiling using kaiju](#6-taxonomic-profiling-using-kaiju)
+  - [Read-based Processing](#read-based-processing)
+    - [6. Taxonomic Profiling Using Kaiju](#6-taxonomic-profiling-using-kaiju)
       - [6a. Build Kaiju Database](#6a-build-kaiju-database)
       - [6b. Kaiju Taxonomic Classification](#6b-kaiju-taxonomic-classification)
       - [6c. Compile Kaiju Taxonomy Results](#6c-compile-kaiju-taxonomy-results)
       - [6d. Convert Kaiju Output To Krona Format](#6d-convert-kaiju-output-to-krona-format)
       - [6e. Compile Kaiju Krona Reports](#6e-compile-kaiju-krona-reports)
       - [6f. Create Kaiju Species Count Table](#6f-create-kaiju-species-count-table)
-      - [6g. Filter Kaiju Species Count Table ](#6g-filter-kaiju-species-count-table)
+      - [6g. Filter Kaiju Species Count Table](#6g-filter-kaiju-species-count-table)
       - [6h. Kaiju Taxonomy Barplots](#6h-kaiju-taxonomy-barplots)
       - [6i. Kaiju Feature Decontamination](#6i-kaiju-feature-decontamination)
     - [7. Taxonomic Profiling Using Kraken2](#7-taxonomic-profiling-using-kraken2)
@@ -69,7 +69,7 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [7g. Kraken2 Taxonomy Barplots](#7g-kraken2-taxonomy-barplots)
       - [7h. Kraken2 Feature Decontamination](#7h-kraken2-feature-decontamination)
     - [8. Taxonomic Profiling Using MetaPhlan](#8-taxonomic-profiling-using-metaphlan)
-      - [8a. Download and install HUMAnN databases](#8a-download-and-install-humann-databases)
+      - [8a. Download and Install HUMAnN databases](#8a-download-and-install-humann-databases)
       - [8b. HUMAnN/MetaPhlAn Taxonomic Classification](#8b-humannmetaphlan-taxonomic-classification)
       - [8c. Merge Multiple Sample Functional Profiles](#8c-merge-multiple-sample-functional-profiles)
       - [8d. Split Results Tables](#8d-split-results-tables)
@@ -85,7 +85,7 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [8l. Filter Humann Output](#8l-filter-humann-output)
       - [8m. Create Humann Function Heatmaps](#8m-create-humann-function-heatmaps)
       - [8n. Humann Feature Decontamination](#8n-humann-feature-decontamination)
-  - [**Assembly-based Processing**](#assembly-based-processing)
+  - [Assembly-based Processing](#assembly-based-processing)
     - [9. Sample Assembly](#9-sample-assembly)
     - [10. Rename Contigs and Summarize Assemblies](#10-rename-contigs-and-summarize-assemblies)
       - [10a. Rename Contig Headers](#10a-rename-contig-headers)
@@ -105,7 +105,7 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [13e. Format Gene-level Output With awk and sed](#13e-format-gene-level-output-with-awk-and-sed)
       - [13f. Format Contig-level Output With awk and sed](#13f-format-contig-level-output-with-awk-and-sed)
     - [14. Read-Mapping](#14-read-mapping)
-      - [14a. Build Reference Index](#14a-build-reference-index)
+      - [14a. Build reference index](#14a-build-reference-index)
       - [14b. Align Reads to Sample Assembly](#14b-align-reads-to-sample-assembly)
       - [14c. Sort Assembly Alignments](#14c-sort-assembly-alignments)
     - [15. Get Coverage Information and Filter Based On Detection](#15-get-coverage-information-and-filter-based-on-detection)
@@ -116,9 +116,9 @@ Barbara Novak (GeneLab Data Processing Lead)
     - [18. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#18-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
       - [18a. Generate Gene-level Coverage Summary Tables](#18a-generate-gene-level-coverage-summary-tables)
       - [18b. Generate Contig-level Coverage Summary Tables](#18b-generate-contig-level-coverage-summary-tables)
-    - [19. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#19-metagenome-assembled-genome-mag-recovery)
+    - [19. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery](#19-metagenome-assembled-genome-mag-recovery)
       - [19a. Bin Contigs](#19a-bin-contigs)
-      - [19b. Bin Quality Assessment](#19b-bin-quality-assessment)
+      - [19b. Bin quality assessment](#19b-bin-quality-assessment)
       - [19c. Filter MAGs](#19c-filter-mags)
       - [19d. MAG Taxonomic Classification](#19d-mag-taxonomic-classification)
       - [19e. Generate Overview Table Of All MAGs](#19e-generate-overview-table-of-all-mags)
@@ -143,26 +143,28 @@ Barbara Novak (GeneLab Data Processing Lead)
 
 |Program|Version|Relevant Links|
 |:------|:-----:|------:|
-|bbduk| 38.86 |[https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/)|
+|bbduk| 38.86 |[https://bbmap.org/](https://bbmap.org/)|
 |bit| 1.8.53 |[https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)|
 |bowtie2| 2.4.1 | [https://bowtie-bio.sourceforge.net/bowtie2/index.shtml](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml)|
 |CAT| 5.2.3 |[https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)|
 |CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)|
 |fastp| 0.24.0 |[https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp)|
 |FastQC|0.12.1|[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)|
-|SPAdes| 4.1.0 | [https://github.com/ablab/spades](https://github.com/ablab/spades) |
 |GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
 |HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)|
 |Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) |
 |KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder)
 |KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)|
 |Kraken2| 2.1.6 | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2) |
+|KrakenTools | 1.2 | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/) |
 |Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)|
+|MEGAHIT| 1.2.9 |[https://github.com/voutcn/megahit#megahit](https://github.com/voutcn/megahit#megahit)|
 |MetaBAT| 2.15 |[https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)|
 |MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)|
 |MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)|
 |Prodigal| 2.6.3 |[https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)|
 |samtools| 1.22.1 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
+|SPAdes| 4.1.0 | [https://github.com/ablab/spades](https://github.com/ablab/spades) |
 | R | 4.5.1 | [https://www.r-project.org](https://www.r-project.org) |
 |Bioconductor | 3.21 | [https://www.bioconductor.org](https://www.bioconductor.org) |
 |decontam| 1.28.0 | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html) |
@@ -475,7 +477,7 @@ fastqc -o decontam_fastqc_output *decontam_GLlbsMetag.fastq.gz
 
 <br>
 
-#### 3d. Compile Contaminant Remove QC
+#### 3d. Compile Contaminant Removal QC
 
 ```bash
 multiqc --zip-data-dir \
@@ -652,7 +654,7 @@ library(tidyverse)
 
 #### 5b. Define Custom Functions
 
-#### get_last_assignment()
+#### get_last_assignment() <!-- omit in toc -->
 <details>
   <summary>retrieves the last taxonomy assignment from a taxonomy string</summary>
 
@@ -687,7 +689,7 @@ library(tidyverse)
 
 </details>
 
-#### mutate_taxonomy()
+#### mutate_taxonomy() <!-- omit in toc -->
 <details>
   <summary>mutate taxonomy column to contain the lowest taxonomy assignment</summary>
 
@@ -721,7 +723,7 @@ library(tidyverse)
 
 </details>
 
-#### process_kaiju_table()
+#### process_kaiju_table() <!-- omit in toc -->
 <details>
   <summary>reformat kaiju output table</summary>
 
@@ -771,7 +773,7 @@ library(tidyverse)
 
 </details>
 
-#### merge_kraken_reports()
+#### merge_kraken_reports() <!-- omit in toc -->
 <details>
   <summary>merge and process multiple kraken outputs to one species table</summary>
 
@@ -815,7 +817,7 @@ library(tidyverse)
 
 </details>
 
-#### get_abundant_features()
+#### get_abundant_features() <!-- omit in toc -->
 <details>
   <summary>Find abundant features based on the sum of feature values</summary>
   
@@ -849,7 +851,7 @@ library(tidyverse)
   
 </details>
 
-#### count_to_rel_abundance()
+#### count_to_rel_abundance() <!-- omit in toc -->
 <details>
   <summary>Convert species count matrix to relative abundance matrix</summary>
 
@@ -882,7 +884,7 @@ library(tidyverse)
 
 </details>
 
-#### filter_rare()
+#### filter_rare() <!-- omit in toc -->
 <details>
   <summary>filter out rare and non_microbial taxonomy assignments based on relative abundance</summary>
 
@@ -924,7 +926,7 @@ library(tidyverse)
 
 </details>
 
-#### group_low_abund_taxa()
+#### group_low_abund_taxa() <!-- omit in toc -->
 <details>
   <summary>Group rare taxa or return a table with only rare taxa</summary>
 
@@ -984,7 +986,7 @@ library(tidyverse)
 
 </details>
 
-#### make_plot()
+#### make_plot() <!-- omit in toc -->
 <details>
   <summary>Create stacked bar plots of relative abundance from input dataframes</summary>
 
@@ -1029,7 +1031,7 @@ library(tidyverse)
 
 </details>
 
-#### make_barplot()
+#### make_barplot() <!-- omit in toc -->
 <details>
   <summary>Parse Metadata and Feature table files in order to create stacked barplots of relative abundance.</summary>
   
@@ -1110,7 +1112,7 @@ library(tidyverse)
   
 </details>
 
-#### make_heatmap()
+#### make_heatmap() <!-- omit in toc -->
 <details>
   <summary>Creates heatmaps from a feature table file</summary>
   
@@ -1209,7 +1211,7 @@ library(tidyverse)
   
 </details>
 
-#### run_decontam()
+#### run_decontam() <!-- omit in toc -->
 <details>
   <summary>Feature table decontamination with decontam</summary>
 
@@ -1282,7 +1284,7 @@ library(tidyverse)
 
 </details>
 
-#### feature_decontam()
+#### feature_decontam() <!-- omit in toc -->
 <details>
   <summary>decontaminate a feature table using the Decontam R package to statistically identify contaminating features in a feature table</summary>
 
@@ -1374,7 +1376,7 @@ library(tidyverse)
   **Returns:** dataframe, `decontaminated_table`, containing the decontaminated feature table
 </details>
 
-#### process_taxonomy()
+#### process_taxonomy() <!-- omit in toc -->
 <details>
   <summary>process a taxonomy assignment table</summary>
 
@@ -1409,7 +1411,7 @@ library(tidyverse)
   **Returns:** dataframe, `taxonomy`, containing reformated taxonomy names
 </details>
 
-#### fix_names()
+#### fix_names() <!-- omit in toc -->
 <details>
   <summary>clean taxonomy names</summary>
 
@@ -1442,7 +1444,7 @@ library(tidyverse)
 
 </details>
 
-#### read_taxonomy_table()
+#### read_taxonomy_table() <!-- omit in toc -->
 <details>
   <summary>Read Assembly-based coverage annotation table</summary>
 
@@ -1482,7 +1484,7 @@ library(tidyverse)
 
 </details>
 
-#### get_samples()
+#### get_samples() <!-- omit in toc -->
 <details>
   <summary>retrieve sample names for which assemblies were generated</summary>
 
@@ -2494,7 +2496,7 @@ sed -i 's/_metaphlan_bugs_list//g' metaphlan-taxonomy_GLlbsMetag.tsv
 
 #### 8h. Create MetaPhlan Species Count Table
 
-#### 8hi. Get Sample Read Counts
+##### 8hi. Get Sample Read Counts
 
 ```bash
 unzip decontam_multiqc_GLlbsMetag_data.zip
@@ -2504,13 +2506,13 @@ grep _R1_decontam multiqc_fastqc.txt | awk 'BEGIN{FS="\t"; OFS="\t"}{print $1,in
 
 **Input Data:**
 
-- decontam_multiqc_GLlbsMetag_data.zip or HostRm_multiqc_GLlbsMetag_data.zip (multiqc data from [Step 3d](#3d-compile-contaminant-remove-qc) or [Step 4c](#4c-compile-host-read-removal-qc) if the optional host removal step was done, respectively)
+- decontam_multiqc_GLlbsMetag_data.zip or HostRm_multiqc_GLlbsMetag_data.zip (multiqc data from [Step 3d](#3d-compile-contaminant-removal-qc) or [Step 4c](#4c-compile-host-read-removal-qc) if the optional host removal step was done, respectively)
 
 **Output Data:**
 
 - reads_per_sample.txt (a 2-column tab delimited file with the sample names and read counts as column 1 and 2, respectively)
 
-#### 8hii. Process MetaPhlan Taxonomy Table
+##### 8hii. Process MetaPhlan Taxonomy Table
 
 ```R
 input_file <- "metaphlan-taxonomy_GLlbsMetag.tsv"
diff --git a/Metagenomics/Low_Biomass/README.md b/Metagenomics/Low_Biomass/README.md
index 46384d622..ca2e4a2cf 100644
--- a/Metagenomics/Low_Biomass/README.md
+++ b/Metagenomics/Low_Biomass/README.md
@@ -1,17 +1,17 @@
 # GeneLab bioinformatics processing pipelines for low-biomass metagenomics sequencing data
 
-> **Documents [`GL-DPPD-7116`](Nanopore/GL-DPPD-7116.md) and [`GL-DPPD-7117.md`](Illumina/GL-DPPD-7117.md) contain overview and example commands for how GeneLab processes low-biomass metagenomics datasets for long- and short-read data, respectively. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and a GeneLab data processing summary is provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
+> **Documents [`GL-DPPD-7116`](Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md) and [`GL-DPPD-7117.md`](Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md) contain overview and example commands for how GeneLab processes low-biomass metagenomics datasets for long- and short-read data, respectively. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and a GeneLab data processing summary is provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
 
 <br>
 
 ---
 ## Repository Links
 
-* [**Pipeline_GL-DPPD-7116_Versions**](Nanopore)
+* [**Pipeline_GL-DPPD-7116_Versions**](Pipeline_GL-DPPD-7116_Versions)
 
   - Contains the current and previous GeneLab low-biomass metagenomics consensus processing pipeline documentation for long-read (Nanopore) data
 
-* [**Pipeline_GL-DPPD-7117_Versions**](Illumina)
+* [**Pipeline_GL-DPPD-7117_Versions**](Pipeline_GL-DPPD-7117_Versions)
 
   - Contains the current and previous GeneLab low-biomass metagenomics consensus processing pipeline documentation for short-read (Illumina) data
 
diff --git a/Metagenomics/Low_Biomass/Workflow_Documentation/NF_MGIllumina b/Metagenomics/Low_Biomass/Workflow_Documentation/NF_MGIllumina
new file mode 160000
index 000000000..2a4a676d5
--- /dev/null
+++ b/Metagenomics/Low_Biomass/Workflow_Documentation/NF_MGIllumina
@@ -0,0 +1 @@
+Subproject commit 2a4a676d529fe2f160fa592b302a1d3e39e5c7e3
diff --git a/Metagenomics/Low_Biomass/Workflow_Documentation/README.md b/Metagenomics/Low_Biomass/Workflow_Documentation/README.md
new file mode 100644
index 000000000..364016ece
--- /dev/null
+++ b/Metagenomics/Low_Biomass/Workflow_Documentation/README.md
@@ -0,0 +1,19 @@
+# GeneLab Low-biomass Metagenomics Workflow Information
+
+> **GeneLab has wrapped each step of the low-biomass metagenomics sequencing data processing pipelines (MGIllumina) into a workflow. The table below lists (and links to) each MGIllumina version and the corresponding workflow subdirectory, the current MGIllumina pipeline/workflow implementation is indicated. The workflow subdirectory contains information about the workflow along with instructions for installation and usage. Exact workflow run info and MGIllumina version used to process specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
+
+## MGIllumina Pipeline Version and Corresponding Workflow
+
+|Pipeline Version|Current Workflow Version (for respective pipeline version)|Nextflow Version| 
+|:---------------|:---------------------------------------------------------|:---------------|
+|*[GL-DPPD-7116.md](../Nanopore/GL-DPPD-7116.md)|[NF_MGIllumina_2.0.0](NF_MGIllumina)|24.04.4|
+|*[GL-DPPD-7117.md](../Illumina/GL-DPPD-7117.md)|[NF_MGIllumina_2.0.0](NF_MGIllumina)|24.04.4|
+
+
+*Current GeneLab Pipeline/Workflow Implementation
+
+> See the [workflow change log](NF_MGIllumina/CHANGELOG.md) to access previous workflow versions and view all changes associated with each version update.
+
+
+> See the [NF_AmpIllumina Change Log](https://github.com/nasa/GeneLab_AmpliconSeq_Workflow/blob/main/CHANGELOG.md) to access the most recent changes to the workflow and view all changes associated with each update.<br>
+> All workflow changes associated with the previous version of the GeneLab Amplicon Pipeline ([GL-DPPD-7104-B](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md) and earlier) can be found in the [SW_AmpIllumina-B Change Log](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/CHANGELOG.md)

From 03cbf07c617cab5f5e1356c460fbe4e7ce43255a Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Mon, 23 Mar 2026 13:46:49 -0700
Subject: [PATCH 35/47] added human functional profiling to long-read pipeline

- updated all numbering
- fixed broken link
- removed incorrect references to gzipped fasta
---
 .../GL-DPPD-7116.md                           | 820 ++++++++++++++----
 .../GL-DPPD-7117.md                           |  41 +-
 2 files changed, 659 insertions(+), 202 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
index b92ea15be..c70ee6aa0 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
@@ -82,57 +82,67 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [11f. Filter Kraken2 Species Count Table](#11f-filter-kraken2-species-count-table)
       - [11g. Kraken2 Taxonomy Barplots](#11g-kraken2-taxonomy-barplots)
       - [11h. Kraken2 Feature Decontamination](#11h-kraken2-feature-decontamination)
+    - [12. Functional Profiling Using HUMAnN/MetaPhlan](#12-functional-profiling-using-humannmetaphlan)
+      - [12a. Download and Install HUMAnN databases](#12a-download-and-install-humann-databases)
+      - [12b. HUMAnN Functional Profiling](#12b-humann-functional-profiling)
+      - [12c. Merge Multiple Sample Functional Profiles](#12c-merge-multiple-sample-functional-profiles)
+      - [12d. Split Results Tables](#12d-split-results-tables)
+      - [12e. Normalize Gene Families and Pathway Abundances Tables](#12e-normalize-gene-families-and-pathway-abundances-tables)
+      - [12f. Generate Normalized Gene-family Table Grouped by Kegg Orthologs (KOs)](#12f-generate-normalized-gene-family-table-grouped-by-kegg-orthologs-kos)
+      - [12g. Filter Humann Output](#12g-filter-humann-output)
+      - [12h. Create Humann Function Heatmaps](#12h-create-humann-function-heatmaps)
+      - [12i. Humann Feature Decontamination](#12i-humann-feature-decontamination)
   - [Assembly-based Processing](#assembly-based-processing)
-    - [12. Sample Assembly](#12-sample-assembly)
-    - [13. Polish Assembly](#13-polish-assembly)
-    - [14. Rename Contigs and Summarize Assemblies](#14-rename-contigs-and-summarize-assemblies)
-      - [14a. Rename Contig Headers](#14a-rename-contig-headers)
-      - [14b. Summarize Assemblies](#14b-summarize-assemblies)
-    - [15. Gene Prediction](#15-gene-prediction)
-      - [15a. Generate Gene Predictions](#15a-generate-gene-predictions)
-      - [15b. Remove Line Wraps In Gene Prediction Output](#15b-remove-line-wraps-in-gene-prediction-output)
-    - [16. Functional Annotation](#16-functional-annotation)
-      - [16a. Download Reference Database of HMM Models](#16a-download-reference-database-of-hmm-models)
-      - [16b. Run KEGG Annotation](#16b-run-kegg-annotation)
-      - [16c. Filter KO Outputs](#16c-filter-ko-outputs)
-    - [17. Taxonomic Classification](#17-taxonomic-classification)
-      - [17a. Pull and Unpack Pre-built Reference DB](#17a-pull-and-unpack-pre-built-reference-db)
-      - [17b. Run Taxonomic Classification](#17b-run-taxonomic-classification)
-      - [17c. Add Taxonomy Info From Taxids To Genes](#17c-add-taxonomy-info-from-taxids-to-genes)
-      - [17d. Add Taxonomy Info From Taxids To Contigs](#17d-add-taxonomy-info-from-taxids-to-contigs)
-      - [17e. Format Gene-level Output With awk and sed](#17e-format-gene-level-output-with-awk-and-sed)
-      - [17f. Format Contig-level Output With awk and sed](#17f-format-contig-level-output-with-awk-and-sed)
-    - [18. Read-Mapping](#18-read-mapping)
-      - [18a. Align Reads to Sample Assembly](#18a-align-reads-to-sample-assembly)
-      - [18b. Sort Assembly Alignments](#18b-sort-assembly-alignments)
-    - [19. Get Coverage Information and Filter Based On Detection](#19-get-coverage-information-and-filter-based-on-detection)
-      - [19a. Filter Coverage Levels Based On Detection](#19a-filter-coverage-levels-based-on-detection)
-      - [19b. Filter Gene and Contig Coverage Based On Detection](#19b-filter-gene-and-contig-coverage-based-on-detection)
-    - [20. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample](#20-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
-    - [21. Combine Contig-level Coverage and Taxonomy For Each Sample](#21-combine-contig-level-coverage-and-taxonomy-for-each-sample)
-    - [22. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#22-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
-      - [22a. Generate Gene-level Coverage Summary Tables](#22a-generate-gene-level-coverage-summary-tables)
-      - [22b. Generate Contig-level Coverage Summary Tables](#22b-generate-contig-level-coverage-summary-tables)
-    - [23. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery](#23-metagenome-assembled-genome-mag-recovery)
-      - [23a. Bin Contigs](#23a-bin-contigs)
-      - [23b. Bin Quality Assessment](#23b-bin-quality-assessment)
-      - [23c. Filter MAGs](#23c-filter-mags)
-      - [23d. MAG Taxonomic Classification](#23d-mag-taxonomic-classification)
-      - [23e. Generate Overview Table Of All MAGs](#23e-generate-overview-table-of-all-mags)
-    - [24. Generate MAG-level Functional Summary Overview](#24-generate-mag-level-functional-summary-overview)
-      - [24a. Get KO Annotations Per MAG](#24a-get-ko-annotations-per-mag)
-      - [24b. Summarize KO Annotations With KEGG-Decoder](#24b-summarize-ko-annotations-with-kegg-decoder)
-    - [25. Filtering, Decontamination, and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs](#25-filtering-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
-      - [25a. Gene-level Taxonomy Heatmaps](#25a-gene-level-taxonomy-heatmaps)
-      - [25b. Gene-level Taxonomy Feature Filtering](#25b-gene-level-taxonomy-feature-filtering)
-      - [25c. Gene-level Taxonomy Decontamination](#25c-gene-level-taxonomy-decontamination)
-      - [25d. Gene-level KO Functions Heatmaps](#25d-gene-level-ko-functions-heatmaps)
-      - [25e. Gene-level KO Functions Feature Filtering](#25e-gene-level-ko-functions-feature-filtering)
-      - [25f. Gene-level KO Functions Decontamination](#25f-gene-level-ko-functions-decontamination)
-      - [25g. Contig-level Heatmaps](#25g-contig-level-heatmaps)
-      - [25h. Contig-level Feature Filtering](#25h-contig-level-feature-filtering)
-      - [25i. Contig-level Decontamination](#25i-contig-level-decontamination)
-    - [26. Generate Assembly-based Processing Overview](#26-generate-assembly-based-processing-overview)
+    - [13. Sample Assembly](#13-sample-assembly)
+    - [14. Polish Assembly](#14-polish-assembly)
+    - [15. Rename Contigs and Summarize Assemblies](#15-rename-contigs-and-summarize-assemblies)
+      - [15a. Rename Contig Headers](#15a-rename-contig-headers)
+      - [15b. Summarize Assemblies](#15b-summarize-assemblies)
+    - [16. Gene Prediction](#16-gene-prediction)
+      - [16a. Generate Gene Predictions](#16a-generate-gene-predictions)
+      - [16b. Remove Line Wraps In Gene Prediction Output](#16b-remove-line-wraps-in-gene-prediction-output)
+    - [17. Functional Annotation](#17-functional-annotation)
+      - [17a. Download Reference Database of HMM Models](#17a-download-reference-database-of-hmm-models)
+      - [17b. Run KEGG Annotation](#17b-run-kegg-annotation)
+      - [17c. Filter KO Outputs](#17c-filter-ko-outputs)
+    - [18. Taxonomic Classification](#18-taxonomic-classification)
+      - [18a. Pull and Unpack Pre-built Reference DB](#18a-pull-and-unpack-pre-built-reference-db)
+      - [18b. Run Taxonomic Classification](#18b-run-taxonomic-classification)
+      - [18c. Add Taxonomy Info From Taxids To Genes](#18c-add-taxonomy-info-from-taxids-to-genes)
+      - [18d. Add Taxonomy Info From Taxids To Contigs](#18d-add-taxonomy-info-from-taxids-to-contigs)
+      - [18e. Format Gene-level Output With awk and sed](#18e-format-gene-level-output-with-awk-and-sed)
+      - [18f. Format Contig-level Output With awk and sed](#18f-format-contig-level-output-with-awk-and-sed)
+    - [19. Read-Mapping](#19-read-mapping)
+      - [19a. Align Reads to Sample Assembly](#19a-align-reads-to-sample-assembly)
+      - [19b. Sort Assembly Alignments](#19b-sort-assembly-alignments)
+    - [20. Get Coverage Information and Filter Based On Detection](#20-get-coverage-information-and-filter-based-on-detection)
+      - [20a. Filter Coverage Levels Based On Detection](#20a-filter-coverage-levels-based-on-detection)
+      - [20b. Filter Gene and Contig Coverage Based On Detection](#20b-filter-gene-and-contig-coverage-based-on-detection)
+    - [21. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample](#21-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
+    - [22. Combine Contig-level Coverage and Taxonomy For Each Sample](#22-combine-contig-level-coverage-and-taxonomy-for-each-sample)
+    - [23. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples](#23-generate-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
+      - [23a. Generate Gene-level Coverage Summary Tables](#23a-generate-gene-level-coverage-summary-tables)
+      - [23b. Generate Contig-level Coverage Summary Tables](#23b-generate-contig-level-coverage-summary-tables)
+    - [24. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery](#24-metagenome-assembled-genome-mag-recovery)
+      - [24a. Bin Contigs](#24a-bin-contigs)
+      - [24b. Bin Quality Assessment](#24b-bin-quality-assessment)
+      - [24c. Filter MAGs](#24c-filter-mags)
+      - [24d. MAG Taxonomic Classification](#24d-mag-taxonomic-classification)
+      - [24e. Generate Overview Table Of All MAGs](#24e-generate-overview-table-of-all-mags)
+    - [25. Generate MAG-level Functional Summary Overview](#25-generate-mag-level-functional-summary-overview)
+      - [25a. Get KO Annotations Per MAG](#25a-get-ko-annotations-per-mag)
+      - [25b. Summarize KO Annotations With KEGG-Decoder](#25b-summarize-ko-annotations-with-kegg-decoder)
+    - [26. Filtering, Decontamination, and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs](#26-filtering-decontamination-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
+      - [26a. Gene-level Taxonomy Heatmaps](#26a-gene-level-taxonomy-heatmaps)
+      - [26b. Gene-level Taxonomy Feature Filtering](#26b-gene-level-taxonomy-feature-filtering)
+      - [26c. Gene-level Taxonomy Decontamination](#26c-gene-level-taxonomy-decontamination)
+      - [26d. Gene-level KO Functions Heatmaps](#26d-gene-level-ko-functions-heatmaps)
+      - [26e. Gene-level KO Functions Feature Filtering](#26e-gene-level-ko-functions-feature-filtering)
+      - [26f. Gene-level KO Functions Decontamination](#26f-gene-level-ko-functions-decontamination)
+      - [26g. Contig-level Heatmaps](#26g-contig-level-heatmaps)
+      - [26h. Contig-level Feature Filtering](#26h-contig-level-feature-filtering)
+      - [26i. Contig-level Decontamination](#26i-contig-level-decontamination)
+    - [27. Generate Assembly-based Processing Overview](#27-generate-assembly-based-processing-overview)
 
 
 ---
@@ -149,16 +159,18 @@ Barbara Novak (GeneLab Data Processing Lead)
 |Filtlong| 0.2.1 |[https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)|
 |Flye| 2.9.5 | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye) |
 |GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
+|HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)|
 |Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) |
 |KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder)
 |KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)|
 |Kraken2| 2.1.6 | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2) |
 |KrakenTools| 1.2 | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/) |
 |Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)|
+|Medaka| 2.1.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) |
 |MetaBAT| 2.15 |[https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)|
+|MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)|
 |Minimap2| 2.28 | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2) |
 |MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)|
-|Medaka| 2.1.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) |
 |NanoPlot| 1.44.1 | [https://github.com/wdecoster/NanoPlot](https://github.com/wdecoster/NanoPlot)|
 |Porechop| 0.2.4 | [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop) |
 |Prodigal| 2.6.3 |[https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)|
@@ -1981,8 +1993,7 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
 - kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 10a](#10a-build-kaiju-database))
 - kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 10a](#10a-build-kaiju-database))
 - sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
-    contaminants and human reads (and optionally host reads) removed, gzipped fastq file, 
-    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
+    contaminants and human reads (and optionally host reads) removed, output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 
 **Output Data:**
 
@@ -2355,8 +2366,7 @@ kraken2 --db kraken2-db/ \
 
 - kraken2-db/ (a directory containing kraken2 database files, output from [Step 11a](#11a-download-kraken2-database))
 - sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
-    contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
+    contaminants and human reads (and, optionally, host reads) removed, output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 
 **Output Data:**
 
@@ -2629,12 +2639,467 @@ make_barplot(metadata_file = metadata_table, feature_table_file = "kraken2_decon
 
 <br>
 
+### 12. Functional Profiling Using HUMAnN/MetaPhlan
+
+>**Note:** For long-read data, HUMAnN produces useful functional profiling results, but MetaPhlan shows all reads as unclassified. This pipeline only includes the functional profiling and does not save the taxonomic profiling or produce processed taxonomic profiling results. 
+
+#### 12a. Download and Install HUMAnN databases
+
+```bash 
+mkdir -p /path/to/humann3-db
+humann3_databases --download chocophlan full /path/to/humann3-db/
+humann3_databases --download uniref uniref90_ec_filtered_diamond /path/to/humann3-db/
+humann3_databases --download utility_mapping full /path/to/human3-db/
+metaphlan --install
+```
+
+**Parameter Definition:**
+
+*humann3_databases*
+- `--download` - Specifies the databases to download:
+  - `chocophlan full` - the full ChocoPhlAn pangenome database, which includes Archaea, Bacteria, Eukaryotes, and Viruses
+  - `uniref uniref90_ec_filtered_diamond` - Download the EC-filtered UniRef90 translated search database
+  - `utility_mapping full` - additional gene family to functional category mapping database
+-`/path/to/humann3-db` - Specifies the database install location
+
+*metaphlan*
+`--install` - install the MetaPhlan clade markers and database locally
+
+**Input Data**
+
+*No input data required*
+
+**Output Data**
+
+`/path/to/humann3-db` (the installed MetaPhlan databases)
+
+
+#### 12b. HUMAnN Functional Profiling
+
+```bash
+  # forward and reverse reads need to be provided combined if paired-end (if not paired-end, single-end reads are provided to the --input argument next)
+cat sample1_R1_decontam_GLlblMetag.fastq.gz sample1_R2_decontam_GLlblMetag.fastq.gz > sample1-combined.fastq.gz
+
+humann --input sample1-combined.fastq.gz \
+       --output sample1-humann3-out-dir \
+       --threads NumberOfThreads \
+       --output-basename sample1 \
+       --metaphlan-options "--bowtie2db /path/to/humann3-db/ --unclassified_estimation --add_viruses --sample_id sample1" \
+       --nucleotide-database /path/to/humann3-db/ \
+       --protein-database /path/to/humann3-db/ \
+       --bowtie-options "--sensitive --mm"
+
+mv sample1-humann3-out-dir/sample1_humann_temp/sample1_metaphlan_bugs_list.tsv \
+   sample1-humann3-out-dir/sample1_metaphlan_bugs_list.tsv
+```
+
+**Parameter Definitions:**  
+
+-	`--input` – specifies the input (combined forward and reverse reads)
+-	`--output` – specifies output directory
+-	`--threads` – specifies the number of threads to use
+-	`--output-basename` – specifies prefix of the output files
+-	`--metaphlan-options` – options to be passed to metaphlan
+  - `--bowtie2db` – path to bowtie2 indexes (stored in HUMAnN database folder)
+  - `unclassified_estimation` - scale the relative abundance profile according to the percentage of reads mapping to a clade.
+  - `--add_viruses` – include viruses in the reference database
+  - `--sample_id` – specifies the sample identifier we want in the table (rather than full filename)
+
+**Input Data:**
+
+- `/path/to/humann3-db/` (HUMAnN databases installed in [Step 12a](#12a-download-and-install-humann-databases))
+- *_R[12]_decontam_GLlblMetag.fastq.gz or *_R[12]_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
+    contaminants and human reads (and optionally host reads) removed, gzipped fastq file, output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
+
+**Output Data:**
+
+- sample1-humann3-out-dir/ *humann output directory containing *genefamilies.tsv, *pathabundance.tsv, and *pathcoverage.tsv files)
+
+#### 12c. Merge Multiple Sample Functional Profiles
+
+```bash
+# they need to be in their own directories
+mkdir gene-family-results/ path-abundance-results/ path-coverage-results/
+
+# copying results from humann3 step
+cp *-humann3-out-dir/*genefamilies.tsv gene-family-results/
+cp *-humann3-out-dir/*abundance.tsv path-abundance-results/
+cp *-humann3-out-dir/*coverage.tsv path-coverage-results/
+
+# join results across samples
+humann_join_tables -i gene-family-results/ -o gene-families.tsv
+humann_join_tables -i path-abundance-results/ -o pathway-abundances.tsv
+humann_join_tables -i path-coverage-results/ -o pathway-coverages.tsv
+```
+
+**Parameter Definitions:**  
+
+- `-i` - the directory holding the input tables
+- `-o` - the name of the output table holding combined data
+
+**Input Data:**
+
+- `sample-humann3-out-dir` (HUMAnN output directory, from [Step 12b](#12b-humann-functional-profiling))
+
+**Output Data:**
+
+- gene-families.tsv (Combined gene family table in tab-separated format.)
+- pathway-abundances.tsv (Combined path abundances table in tab-separated format.)
+- pathway-coverages.tsv (Combined path coverages table in tab-separated format.)
+
+#### 12d. Split Results Tables
+
+The read-based functional annotation tables have taxonomic info and non-taxonomic info mixed together. `humann` comes with a helper script to split them into both non-taxonomically grouped functional info files and taxonomically grouped functional info files.
+
+```bash
+humann_split_stratified_table -i gene-families.tsv -o ./
+mv gene-families_stratified.tsv Gene-families-grouped-by-taxa_GLlblMetag.tsv
+mv gene-families_unstratified.tsv Gene-families_GLlblMetag.tsv
+
+humann_split_stratified_table -i pathway-abundances.tsv -o ./
+mv pathway-abundances_stratified.tsv Pathway-abundances-grouped-by-taxa_GLlblMetag.tsv
+mv pathway-abundances_unstratified.tsv Pathway-abundances_GLlblMetag.tsv
+
+humann2_split_stratified_table -i pathway-coverages.tsv -o ./
+mv pathway-coverages_stratified.tsv Pathway-coverages-grouped-by-taxa_GLlblMetag.tsv
+mv pathway-coverages_unstratified.tsv Pathway-coverages_GLlblMetag.tsv
+```
+
+**Parameter Definitions:**  
+
+-	`-i` – the input combined table
+-	`-o` – output directory (here specifying current directory)
+
+**Input Data:**
+
+- gene-families.tsv (Combined gene family table from [Step 12c](#12c-merge-multiple-sample-functional-profiles))
+- pathway-abundances.tsv (Combined path abundances table from [Step 12c](#12c-merge-multiple-sample-functional-profiles))
+- pathway-coverages.tsv (Combined path coverages table from [Step 12c](#12c-merge-multiple-sample-functional-profiles))
+
+**Output Data:**
+
+- **Gene-families_GLlblMetag.tsv** (gene-family abundances)
+- **Gene-families-grouped-by-taxa_GLlblMetag.tsv** (gene-family abundances grouped by taxa)
+- **Pathway-abundances_GLlblMetag.tsv**  (pathway abundances)
+- **Pathway-abundances-grouped-by-taxa_GLlblMetag.tsv** (pathway abundances grouped by tax)
+- **Pathway-coverages_GLlblMetag.tsv** (pathway coverages)
+- **Pathway-coverages-grouped-by-taxa_GLlblMetag.tsv** (pathway coverages grouped by taxa)
+
+#### 12e. Normalize Gene Families and Pathway Abundances Tables
+Generates some normalized tables of the read-based functional outputs from humann that are more readily suitable for across sample comparisons.
+
+```bash
+humann_renorm_table -i Gene-families_GLlblMetag.tsv -o Gene-families-cpm_GLlblMetag.tsv --update-snames
+humann_renorm_table -i Pathway-abundances_GLlblMetag.tsv -o Pathway-abundances-cpm_GLlblMetag.tsv --update-snames
+```
+
+**Parameter Definitions:**  
+
+-	`-i` – the input combined table
+-	`-o` – name of the output normalized table
+-	`--update-snames` – change suffix of column names in tables to "-CPM"
+
+**Input Data:**
+
+- Gene-families_GLlblMetag.tsv (gene-family abundances, from [Step 12d](#12d-split-results-tables))
+- Pathway-abundances_GLlblMetag.tsv (pathway abundances, from [Step 12d](#12d-split-results-tables))
+
+**Output Data:**
+- **Gene-families-cpm_GLlblMetag.tsv** (gene-family abundances normalized to copies-per-million)
+- **Pathway-abundances-cpm_GLlblMetag.tsv** (pathway abundances normalized to copies-per-million)
+
+#### 12f. Generate Normalized Gene-family Table Grouped by Kegg Orthologs (KOs)
+
+```bash
+humann_regroup_table -i Gene-families_GLlblMetag.tsv -g uniref90_ko | \
+humann_rename_table -n kegg-orthology | \
+humann_renorm_table -o Gene-families-KO-cpm_GLlblMetag.tsv --update-snames
+```
+
+**Parameter Definitions:**  
+
+*humann_regroup_table*
+-	`-i` – the input table
+-	`-g` – the map to use to group uniref IDs into Kegg Orthologs
+-	`|` – sending that output into the next humann command to add human-readable Kegg Orthology names
+
+*humann_rename_table*
+-	`-n` – specifying we are converting Kegg orthology IDs into Kegg orthology human-readable names
+-	`|` – sending that output into the next humann command to normalize to copies-per-million
+
+*humann_renorm_table*
+-	`-o` – specifying the final output file name
+-  `--update-snames` – change suffix of column names in tables to "-CPM"
+
+**Input Data:**
+
+- Gene-families_GLlblMetag.tsv (Non-taxonomically grouped gene families, from [Step 12d](#12d-split-results-tables))
+
+**Output Data:**
+
+- **Gene-families-KO-cpm_GLlblMetag.tsv** (KO term abundances normalized to copies-per-million)
+  
+#### 12g. Filter Humann Output
+
+```R
+# read in humann tables
+humann_uniref_table <- read_delim(file = "Gene-families-cpm_GLlblMetag.tsv", delim = "\t")
+humann_KO_table <- read_delim(file = "Gene-families-KO-cpm_GLlblMetag.tsv", delim = "\t")
+humann_pathway_table <- read_delim(file = "Pathway-abundances-cpm_GLlblMetag.tsv", delim = "\t")
+
+# rename headers
+humann_uniref_table <-  humann_uniref_table  %>% 
+  rename(Uniref90=`# Gene Family`) %>%
+  mutate(Uniref90=str_replace_all(Uniref90, "UniRef90_", "")) %>%
+  set_names(colnames(.) %>% str_replace_all("_Abundance-CPM", "")) %>%
+  as.data.frame()
+write_tsv(x = humann_uniref_table, file = "Gene-families-uniref_unfiltered_GLlblMetag.tsv")
+
+humann_KO_table <- humann_KO_table %>%
+  rename(KO=`# Gene Family`) %>%
+  set_names(colnames(.) %>% str_replace_all("_Abundance-CPM", "")) %>%
+  as.data.frame()
+write_tsv(x = humann_KO_table, file = "Gene-families-KO_unfiltered_GLlblMetag.tsv")
+
+humann_pathway_table <-  humann_pathway_table  %>% 
+  rename(Pathway=`# Pathway`) %>%
+  set_names(colnames(.) %>% str_replace_all("_Abundance-CPM", "")) %>%
+  as.data.frame()
+write_tsv(x = humann_pathway_table, file = "Pathway-abundances_unfiltered_GLlblMetag.tsv")
+
+# filter data
+threshold <- 500
+
+humann_uniref_table <- humann_uniref_table %>%
+  mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>% column_to_rownames("Uniref90")
+humann_uniref_filtered <- get_abundant_features(humann_uniref_table, cpm_threshold = threshold) %>%
+  as.data.frame() %>% rownames_to_column("Uniref90")
+write_tsv(x = table2write, file = "Gene-families-uniref_filtered_GLlblMetag.tsv")
+
+humann_KO_table <- humann_KO_table %>%
+  mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>% column_to_rownames("KO")
+humann_KO_filtered <- get_abundant_features(humann_KO_table, cpm_threshold = threshold) %>%
+  as.data.frame() %>% rownames_to_column("KO")
+write_tsv(x = table2write, file = "Gene-families-KO_filtered_GLlblMetag.tsv")
+
+humann_pathway_table <- humann_pathway_table %>%
+  mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>% column_to_rownames("Pathway")
+humann_pathway_filtered <- get_abundant_features(humann_pathway_table, cpm_threshold = threshold) %>%
+  as.data.frame() %>% rownames_to_column("Pathway")
+write_tsv(x = table2write, file = "Pathway-abundances_filtered_GLlblMetag.tsv")
+
+```
+
+**Custom Functions Used:**
+- [get_abundant_features()](#get_abundant_features)
+
+**Parameter Definitions:**
+
+- `threshold` - threshold for filtering out low abundance features, a value greater than 0
+
+**Input Data:**
+
+- Gene-families-cpm_GLlblMetag.tsv (Humann taxonomy table from [Step 12e](#12e-normalize-gene-families-and-pathway-abundances-tables))
+- Gene-families-KO-cpm_GLlblMetag.tsv (Humann pathway table from [Step 12e](#12e-normalize-gene-families-and-pathway-abundances-tables))
+- Pathway-abundances-cpm_GLlblMetag.tsv (Humann KO function table from [Step 12f](#12f-generate-normalized-gene-family-table-grouped-by-kegg-orthologs-kos))
+
+**Output Data:**
+
+- Gene-families-KO_unfiltered_GLlblMetag.tsv (KO term abundances normalized to copies-per-million, with cleaned headers)
+- Gene-families-uniref_unfiltered_GLlblMetag.tsv (gene-family abundances normalized to copies-per-million, with cleaned headers)
+- Pathway-abundances_unfiltered_GLlblMetag.tsv (pathway abundances normalized to copies-per-million, with cleaned headers)
+- **Gene-families-KO_filtered_GLlblMetag.tsv** (KO term abundances filtered for features with less than 500 CPM across samples) 
+- **Gene-families-uniref_filtered_GLlblMetag.tsv** (gene-family abundances filtered for features with less than 500 CPM across samples) 
+- **Pathway-abundances_filtered_GLlblMetag.tsv** (Pathway abundances filtered for features with less than 500 CPM across samples) 
+
+#### 12h. Create Humann Function Heatmaps
+
+```R
+metadata_table < "/path/to/sample_metadata"
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-uniref_unfiltered_GLlblMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-uniref_unfiltered", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-uniref_filtered_GLlblMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-uniref_filtered", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-KO_unfiltered_GLlblMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-KO_unfiltered", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-KO_filtered_GLlblMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-KO_filtered", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Pathway-abundances_unfiltered_GLlblMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Pathway-abundances_unfiltered", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Pathway-abundances_filtered_GLlblMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Pathway-abundances_filtered", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `metadata_file` - a file containing group information for each sample in the species count files
+
+**Input Data:**
+
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+- `Gene-families-uniref_unfiltered_GLlblMetag.tsv` (gene-family abundances table, output from [Step 12g](#12g-filter-humann-output))
+- `Gene-families-KO_unfiltered_GLlblMetag.tsv` (KO term abundances table, output from [Step 12g](#12g-filter-humann-output))
+- `Pathway-abundances_unfiltered_GLlblMetag.tsv` (pathway abundances table, output from [Step 12g](#12g-filter-humann-output))
+- `Gene-families-uniref_filtered_GLlblMetag.tsv` (filtered gene-family abundances table, output from [Step 12g](#12g-filter-humann-output)) 
+- `Gene-families-KO_filtered_GLlblMetag.tsv` (filtered KO term abundances table, output from [Step 12g](#12g-filter-humann-output)) 
+- `Pathway-abundances_filtered_GLlblMetag.tsv` (filtered Pathway abundances table, output from [Step 12g](#12g-filter-humann-output)) 
+
+**Output Data:**
+
+- **Gene-families-uniref_unfiltered_heatmap_GLlblMetag.png** (gene family abundances heatmap without filtering)
+- **Gene-families-uniref_filtered_heatmap_GLlblMetag.png** (gene family abundances heatmap after filtering rare and non-microbial taxa)
+- **Gene-families-KO_unfiltered_heatmap_GLlblMetag.png** (KO term abundances heatmap without filtering)
+- **Gene-families-KO_filtered_heatmap_GLlblMetag.png** (KO term abundances heatmap after filtering rare and non-microbial taxa)
+- **Pathway-abundances_unfiltered_heatmap_GLlblMetag.png** (pathway abundances heatmap without filtering)
+- **Pathway-abundances_filtered_heatmap_GLlblMetag.png** (pathway abundances heatmap after filtering rare and non-microbial taxa)
+- **Gene-families-uniref_unfiltered_top_50_heatmap_GLlblMetag.png** (gene family abundances heatmap without filtering)
+- **Gene-families-uniref_filtered_top_50_heatmap_GLlblMetag.png** (gene family abundances heatmap after filtering rare and non-microbial taxa)
+- **Gene-families-KO_unfiltered_top_50_heatmap_GLlblMetag.png** (KO term abundances heatmap without filtering)
+- **Gene-families-KO_filtered_top_50_heatmap_GLlblMetag.png** (KO term abundances heatmap after filtering rare and non-microbial taxa)
+- **Pathway-abundances_unfiltered_top_50_heatmap_GLlblMetag.png** (pathway abundances heatmap without filtering)
+- **Pathway-abundances_filtered_top_50_heatmap_GLlblMetag.png** (pathway abundances heatmap after filtering rare and non-microbial taxa)
+
+#### 12i. Humann Feature Decontamination
+
+> Note: species_table and barplots are only generated if 1 or more contaminants were detected
+
+```R
+metadata_table <- "/path/to/sample/metadata"
+uniref_table_file <- "Gene-families-uniref_filtered_GLlblMetag.tsv"
+KO_table_file <- "Gene-families-KO_filtered_GLlblMetag.tsv"
+pathway_table_file <- "Pathway-abundances_filtered_GLlblMetag.tsv"
+
+# Gene-families-uniref
+feature_decontam(metadata_file = metadata_table, 
+                feature_table_file = uniref_table_file, 
+                feature_column = "Uniref90", 
+                samples_column = "sample_id",
+                prevalence_column = "NTC", 
+                ntc_name = "true", 
+                frequency_column = "concentration", 
+                threshold = 0.5, 
+                classification_method = "Gene-families-uniref", 
+                output_prefix = "Gene-families-uniref", 
+                assay_suffix = "_GLlblMetag")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-uniref_decontam_species_table_GLlblMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-uniref_decontam", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+
+# Gene-families-KO
+feature_decontam(metadata_file = metadata_table, 
+                feature_table_file = KO_table_file, 
+                feature_column = "KO", 
+                samples_column = "sample_id",
+                prevalence_column = "NTC", 
+                ntc_name = "true", 
+                frequency_column = "concentration", 
+                threshold = 0.5, 
+                classification_method = "Gene-families-KO", 
+                output_prefix = "Gene-families-KO", 
+                assay_suffix = "_GLlblMetag")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-KO_decontam_species_table_GLlblMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-KO_decontam", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+
+# Pathway-abundances
+feature_decontam(metadata_file = metadata_table, 
+                feature_table_file = pathway_table_file, 
+                feature_column = "Pathway", 
+                samples_column = "sample_id",
+                prevalence_column = "NTC", 
+                ntc_name = "true", 
+                frequency_column = "concentration", 
+                threshold = 0.5, 
+                classification_method = "Pathway-abundances", 
+                output_prefix = "Pathway-abundances", 
+                assay_suffix = "_GLlblMetag")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Pathway-abundances_decontam_species_table_GLlblMetag.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Pathway-abundances_decontam", 
+             assay_suffix = "_GLlblMetag", 
+             custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [feature_decontam()](#feature_decontam)
+- [make-heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                          table with species/functions as the first column and samples as other columns.
+
+**Input Data:**
+
+- `Gene-families-uniref_filtered_GLlblMetag.tsv` (filtered gene-family abundances table, output from [Step 12g](#12g-filter-humann-output)) 
+- `Gene-families-KO_filtered_GLlblMetag.tsv` (filtered KO term abundances table, output from [Step 12g](#12g-filter-humann-output)) 
+- `Pathway-abundances_filtered_GLlblMetag.tsv` (filtered Pathway abundances table, output from [Step 12g](#12g-filter-humann-output)) 
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
+
+- **Gene-family-uniref_decontam_results_GLlblMetag.tsv** (decontam's result table for gene-family abundances, output from [feature_decontam()](#feature_decontam))
+- **Gene-family-uniref_decontam_species_table_GLlblMetag.tsv** (decontaminated gene-family abundances table, output from [feature_decontam()](#feature_decontam))
+- **Gene-family-uniref_decontam_species_heatmap_GLlblMetag.png** (heatmap of decontaminated gene-family abundances, output from [make_heatmap()](#make_heatmap))
+- **Gene-family-KO_decontam_results_GLlblMetag.tsv** (decontam's result table KO term abundances, output from [feature_decontam()](#feature_decontam))
+- **Gene-family-KO_decontam_species_table_GLlblMetag.tsv** (decontaminated KO term abundances table, output from [feature_decontam()](#feature_decontam))
+- **Gene-family-KO_decontam_species_heatmap_GLlblMetag.png** (heatmap of decontaminated KO term abundances, output from [make_heatmap()](#make_heatmap))
+- **Pathway-abundances_decontam_results_GLlblMetag.tsv** (decontam's result table, output from [feature_decontam()](#feature_decontam))
+- **Pathway-abundances_decontam_species_table_GLlblMetag.tsv** (decontaminated species table, output from [feature_decontam()](#feature_decontam))
+- **Pathway-abundances_decontam_species_heatmap_GLlblMetag.png** (barplot after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
+
+<br>
+
 ---
 
 ## Assembly-based Processing
 
 
-### 12. Sample Assembly
+### 13. Sample Assembly
 
 ```bash
 flye --meta \
@@ -2659,8 +3124,7 @@ mv sample/flye.log sample-assembly.log
 **Input Data**
 
 - sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
-    contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
-    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
+    contaminants and human reads (and optionally host reads) removed, output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 
 **Output Data**
 
@@ -2671,7 +3135,7 @@ mv sample/flye.log sample-assembly.log
 
 ---
 
-### 13. Polish Assembly
+### 14. Polish Assembly
 
 ```bash
 medaka_consensus -t NumberOfThreads \
@@ -2692,9 +3156,8 @@ mv sample/consensus.fasta sample_polished.fasta
 **Input Data:**
 
 - sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
-    contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
-    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
-- /path/to/assemblies/sample-assembly.fasta (sample assembly, output from [Step 12](#12-sample-assembly))
+    contaminants and human reads (and optionally host reads) removed, output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
+- /path/to/assemblies/sample-assembly.fasta (sample assembly, output from [Step 13](#13-sample-assembly))
 
 **Output Data:**
 
@@ -2705,9 +3168,9 @@ mv sample/consensus.fasta sample_polished.fasta
 
 ---
 
-### 14. Rename Contigs and Summarize Assemblies
+### 15. Rename Contigs and Summarize Assemblies
 
-#### 14a. Rename Contig Headers
+#### 15a. Rename Contig Headers
 
 ```bash
 bit-rename-fasta-headers -i sample_polished.fasta \
@@ -2724,14 +3187,14 @@ bit-rename-fasta-headers -i sample_polished.fasta \
 
 **Input Data:**
 
-- sample_polished.fasta (polished assembly file from [Step 13](#13-polish-assembly))
+- sample_polished.fasta (polished assembly file from [Step 14](#14-polish-assembly))
 
 **Output files:**
 
 - **sample-assembly_GLlblMetag.fasta** (contig-renamed assembly file)
 
 
-#### 14b. Summarize Assemblies
+#### 15b. Summarize Assemblies
 
 ```bash
 bit-summarize-assembly -o assembly-summaries_GLlblMetag.tsv \
@@ -2753,7 +3216,7 @@ done
 
 **Input Data:**
 
-- *-assembly_GLlblMetag.fasta (contig-renamed assembly files from [Step 14a](#14a-rename-contig-headers))
+- *-assembly_GLlblMetag.fasta (contig-renamed assembly files from [Step 15a](#15a-rename-contig-headers))
 
 **Output files:**
 
@@ -2764,9 +3227,9 @@ done
 
 ---
 
-### 15. Gene Prediction
+### 16. Gene Prediction
 
-#### 15a. Generate Gene Predictions
+#### 16a. Generate Gene Predictions
 
 ```bash
 prodigal -a sample-genes.faa \
@@ -2792,7 +3255,7 @@ prodigal -a sample-genes.faa \
 
 **Input Data:**
 
-- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 14a](#14a-rename-contig-headers))
+- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 15a](#15a-rename-contig-headers))
 
 **Output Data:**
 
@@ -2802,7 +3265,7 @@ prodigal -a sample-genes.faa \
 
 <br>
 
-#### 15b. Remove Line Wraps In Gene Prediction Output
+#### 16b. Remove Line Wraps In Gene Prediction Output
 
 ```bash
 bit-remove-wraps sample-genes.faa > sample-genes.faa.tmp 2> /dev/null
@@ -2814,8 +3277,8 @@ mv sample-genes.fasta.tmp sample-genes_GLlblMetag.fasta
 
 **Input Data:**
 
-- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 15a](#15a-generate-gene-predictions))
-- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 15a](#15a-generate-gene-predictions))
+- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 16a](#16a-generate-gene-predictions))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 16a](#16a-generate-gene-predictions))
 
 **Output Data:**
 
@@ -2826,7 +3289,7 @@ mv sample-genes.fasta.tmp sample-genes_GLlblMetag.fasta
 
 ---
 
-### 16. Functional Annotation
+### 17. Functional Annotation
 
 > **Note:**  
 > The annotation process overwrites the same temporary directory by default. When running multiple 
@@ -2834,7 +3297,7 @@ processes at a time, it is necessary to specify a specific temporary directory w
 `--tmp-dir` argument as shown below.
 
 
-#### 16a. Download Reference Database of HMM Models
+#### 17a. Download Reference Database of HMM Models
 
 > **Note:** This step only needs to be done once.
 
@@ -2845,7 +3308,7 @@ tar -xzvf profiles.tar.gz
 gunzip ko_list.gz 
 ```
 
-#### 16b. Run KEGG Annotation
+#### 17b. Run KEGG Annotation
 
 ```bash
 exec_annotation -p profiles/ \
@@ -2872,16 +3335,16 @@ exec_annotation -p profiles/ \
 
 **Input Data:**
 
-- sample-genes_GLlblMetag.faa (amino-acid fasta file, output from [Step 15b](#15b-remove-line-wraps-in-gene-prediction-output))
-- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 16a](#16a-download-reference-database-of-hmm-models))
-- ko_list (reference list of KOs to scan for, downloaded in [Step 16a](#16a-download-reference-database-of-hmm-models))
+- sample-genes_GLlblMetag.faa (amino-acid fasta file, output from [Step 16b](#16b-remove-line-wraps-in-gene-prediction-output))
+- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 17a](#17a-download-reference-database-of-hmm-models))
+- ko_list (reference list of KOs to scan for, downloaded in [Step 17a](#17a-download-reference-database-of-hmm-models))
 
 **Output Data:**
 
 - sample-KO-tab.tmp (table of KO annotations assigned to gene IDs)
 
 
-#### 16c. Filter KO Outputs
+#### 17c. Filter KO Outputs
 *Filter KO outputs to retain only those passing the KO-specific score and top hits.*
 
 ```bash
@@ -2899,7 +3362,7 @@ rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 
 **Input Data:**
 
-- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs, output from [Step 16b](#16b-run-kegg-annotation))
+- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs, output from [Step 17b](#17b-run-kegg-annotation))
 
 **Output Data:**
 
@@ -2909,9 +3372,9 @@ rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 
 ---
 
-### 17. Taxonomic Classification
+### 18. Taxonomic Classification
 
-#### 17a. Pull and Unpack Pre-built Reference DB
+#### 18a. Pull and Unpack Pre-built Reference DB
 
 > **Note:** This step only needs to be done once.
 
@@ -2920,7 +3383,7 @@ wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20200618.tar.gz
 tar -xvzf CAT_prepare_20200618.tar.gz
 ```
 
-#### 17b. Run Taxonomic Classification
+#### 18b. Run Taxonomic Classification
 
 ```bash
 CAT contigs -c sample-assembly_GLlblMetag.fasta \
@@ -2950,10 +3413,10 @@ CAT contigs -c sample-assembly_GLlblMetag.fasta \
 
 **Input Data:**
 
-- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 17a](#17a-pull-and-unpack-pre-built-reference-db))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 17a](#17a-pull-and-unpack-pre-built-reference-db))
-- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 14a](#14a-rename-contig-headers))
-- sample-genes_GLlblMetag.faa (amino-acid fasta file, output from [Step 15b](#15b-remove-line-wraps-in-gene-prediction-output))
+- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 18a](#18a-pull-and-unpack-pre-built-reference-db))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 18a](#18a-pull-and-unpack-pre-built-reference-db))
+- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 15a](#15a-rename-contig-headers))
+- sample-genes_GLlblMetag.faa (amino-acid fasta file, output from [Step 16b](#16b-remove-line-wraps-in-gene-prediction-output))
 
 **Output Data:**
 
@@ -2961,7 +3424,7 @@ CAT contigs -c sample-assembly_GLlblMetag.fasta \
 - sample-tax-out.tmp.contig2classification.txt (contig taxonomy file)
 
 
-#### 17c. Add Taxonomy Info From Taxids To Genes
+#### 18c. Add Taxonomy Info From Taxids To Genes
 
 ```bash
 CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
@@ -2981,15 +3444,15 @@ CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
 
 **Input Data:**
 
-- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file, output from [Step 17b](#17b-run-taxonomic-classification))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 17a](#17a-pull-and-unpack-pre-built-reference-db))
+- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file, output from [Step 18b](#18b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 18a](#18a-pull-and-unpack-pre-built-reference-db))
 
 **Output Data:**
 
 - sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
 
 
-#### 17d. Add Taxonomy Info From Taxids To Contigs
+#### 18d. Add Taxonomy Info From Taxids To Contigs
 
 ```bash
 CAT add_names -i sample-tax-out.tmp.contig2classification.txt \
@@ -3009,15 +3472,15 @@ CAT add_names -i sample-tax-out.tmp.contig2classification.txt \
 
 **Input Data:**
 
-- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file, output from [Step 17b](#17b-run-taxonomic-classification))
-- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 17a](#17a-pull-and-unpack-pre-built-reference-db))
+- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file, output from [Step 18b](#18b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 18a](#18a-pull-and-unpack-pre-built-reference-db))
 
 **Output Data:**
 
 - sample-contig-tax-out.tmp (contig taxonomy file with lineage info added)
 
 
-#### 17e. Format Gene-level Output With awk and sed
+#### 18e. Format Gene-level Output With awk and sed
 
 ```bash
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,$8,$9,$10,$11 } \
@@ -3030,14 +3493,14 @@ awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,
 
 **Input Data:**
 
-- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added, output from [Step 17c](#17c-add-taxonomy-info-from-taxids-to-genes))
+- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added, output from [Step 18c](#18c-add-taxonomy-info-from-taxids-to-genes))
 
 **Output Data:**
 
 - sample-gene-tax.tsv (reformatted gene-calls taxonomy file with lineage info)
 
 
-#### 17f. Format Contig-level Output With awk and sed
+#### 18f. Format Contig-level Output With awk and sed
 
 ```bash
 awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6,$7,$8,$9,$10,$11,$12 } \
@@ -3052,7 +3515,7 @@ rm sample*.tmp*
 
 **Input Data:**
 
-- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added, output from [Step 17d](#17d-add-taxonomy-info-from-taxids-to-contigs))
+- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added, output from [Step 18d](#18d-add-taxonomy-info-from-taxids-to-contigs))
 
 **Output Data:**
 
@@ -3062,9 +3525,9 @@ rm sample*.tmp*
 
 ---
 
-### 18. Read-Mapping
+### 19. Read-Mapping
 
-#### 18a. Align Reads to Sample Assembly
+#### 19a. Align Reads to Sample Assembly
 
 ```bash
 minimap2 -a \
@@ -3087,10 +3550,9 @@ minimap2 -a \
 
 **Input Data**
 
-- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file, output from [Step 14a](#14a-rename-contig-headers))
+- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file, output from [Step 15a](#15a-rename-contig-headers))
 - sample_decontam_GLlblMetag.fastq.gz or sample_HostRm_GLlblMetag.fastq.gz (filtered and trimmed sample reads with both 
-    contaminants and human reads (and optionally host reads) removed, gzipped fasta file, 
-    output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
+    contaminants and human reads (and optionally host reads) removed, output from [Step 7e](#7e-generate-decontaminated-read-files) or [Step 8b](#8b-remove-host-reads))
 
 **Output Data**
 
@@ -3098,7 +3560,7 @@ minimap2 -a \
 - **sample-mapping-info_GLlblMetag.txt** (read mapping information)
 
 
-#### 18b. Sort Assembly Alignments
+#### 19b. Sort Assembly Alignments
 
 ```bash
 # Sort Sam, convert to bam and create index
@@ -3117,7 +3579,7 @@ samtools sort --threads NumberOfThreads \
 
 **Input Data:**
 
-- sample.sam (reads aligned to sample assembly, output from [Step 18a](#18a-align-reads-to-sample-assembly))
+- sample.sam (reads aligned to sample assembly, output from [Step 19a](#19a-align-reads-to-sample-assembly))
 
 **Output Data:**
 
@@ -3127,13 +3589,13 @@ samtools sort --threads NumberOfThreads \
 
 ---
 
-### 19. Get Coverage Information and Filter Based On Detection
+### 20. Get Coverage Information and Filter Based On Detection
 > **Note:**  
 > “Detection” is a measure of what proportion of a reference sequence recruited reads 
-(see the discussion of detection [here](http://merenlab.org/2017/05/08/anvio-views/#detection)). 
+(see the discussion of detection [here](https://merenlab.org/2017/05/08/anvio-views/#detection)). 
 Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
 
-#### 19a. Filter Coverage Levels Based On Detection
+#### 20a. Filter Coverage Levels Based On Detection
 
 ```bash
 # pileup.sh comes from the bbduk.sh package
@@ -3152,8 +3614,8 @@ pileup.sh -in sample_GLlblMetag.bam \
 
 **Input Data:**
 
-- sample_GLlblMetag.bam (sorted mapping to sample assembly BAM file, output from [Step 18b](#18b-sort-assembly-alignments))
-- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 15a](#15a-generate-gene-predictions))
+- sample_GLlblMetag.bam (sorted mapping to sample assembly BAM file, output from [Step 19b](#19b-sort-assembly-alignments))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 16a](#16a-generate-gene-predictions))
 
 
 **Output Data:**
@@ -3162,7 +3624,7 @@ pileup.sh -in sample_GLlblMetag.bam \
 - sample-contig-cov-and-det.tmp (contig-coverage tsv file)
 
 
-#### 19b. Filter Gene and Contig Coverage Based On Detection
+#### 20b. Filter Gene and Contig Coverage Based On Detection
 
 > *The following commands filter gene and contig coverage tsv files to only keep genes and contigs with at least 50% detection (as defined above) then parse the tables to retain only gene IDs and respective coverage.*
 
@@ -3187,8 +3649,8 @@ rm sample-*.tmp
 
 **Input Data:**
 
-- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file, output from [Step 19a](#19a-filter-coverage-levels-based-on-detection))
-- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file, output from [Step 19a](#19a-filter-coverage-levels-based-on-detection))
+- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file, output from [Step 20a](#20a-filter-coverage-levels-based-on-detection))
+- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file, output from [Step 20a](#20a-filter-coverage-levels-based-on-detection))
 
 **Output Data:**
 
@@ -3199,7 +3661,7 @@ rm sample-*.tmp
 
 ---
 
-### 20. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample
+### 21. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample
 > **Note:**  
 > Just uses `paste`, `sed`, and `awk` standard Unix commands to combine gene-level coverage, taxonomy, and functional annotations into one table for each sample.  
 
@@ -3222,9 +3684,9 @@ rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax.t
 
 **Input Data:**
 
-- sample-gene-coverages.tsv (table with gene-level coverages, output from [Step 19b](#19b-filter-gene-and-contig-coverage-based-on-detection))
-- sample-annotations.tsv (table of KO annotations assigned to gene IDs, output from [Step 16c](#16c-filter-ko-outputs))
-- sample-gene-tax.tsv (reformatted gene-calls taxonomy file with lineage info, output from [Step 17e](#17e-format-gene-level-output-with-awk-and-sed))
+- sample-gene-coverages.tsv (table with gene-level coverages, output from [Step 20b](#20b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-annotations.tsv (table of KO annotations assigned to gene IDs, output from [Step 17c](#17c-filter-ko-outputs))
+- sample-gene-tax.tsv (reformatted gene-calls taxonomy file with lineage info, output from [Step 18e](#18e-format-gene-level-output-with-awk-and-sed))
 
 
 **Output Data:**
@@ -3235,7 +3697,7 @@ rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax.t
 
 ---
 
-### 21. Combine Contig-level Coverage and Taxonomy For Each Sample
+### 22. Combine Contig-level Coverage and Taxonomy For Each Sample
 
 > **Note:**  
 > Just uses `paste`, `sed`, and `awk` standard Unix commands to combine contig-level coverage and taxonomy into one table for each sample.
@@ -3257,8 +3719,8 @@ rm sample*tmp sample-contig-coverages.tsv sample-contig-tax.tsv
 
 **Input Data:**
 
-- sample-contig-coverages.tsv (table with contig-level coverages, output from [Step 19b](#19b-filter-gene-and-contig-coverage-based-on-detection))
-- sample-contig-tax.tsv (reformatted contig taxonomy file with lineage info, output from [Step 17f](#17f-format-contig-level-output-with-awk-and-sed))
+- sample-contig-coverages.tsv (table with contig-level coverages, output from [Step 20b](#20b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-contig-tax.tsv (reformatted contig taxonomy file with lineage info, output from [Step 18f](#18f-format-contig-level-output-with-awk-and-sed))
 
 
 **Output Data:**
@@ -3269,7 +3731,7 @@ rm sample*tmp sample-contig-coverages.tsv sample-contig-tax.tsv
 
 ---
 
-### 22. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples
+### 23. Generate Normalized, Gene- and Contig-level Coverage Summary Tables of KO-annotations and Taxonomy Across Samples
 
 > **Note:**  
 > * To combine across samples to generate these summary tables, we need the same "units". This is done for annotations 
@@ -3281,7 +3743,7 @@ by the length of the gene). These have been normalized by making the total cover
 each individual gene-level coverage its proportion of that 1,000,000 total. So basically percent, but out of 1,000,000 
 instead of 100 to make the numbers more friendly. 
 
-#### 22a. Generate Gene-level Coverage Summary Tables
+#### 23a. Generate Gene-level Coverage Summary Tables
 
 ```bash
 bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax_GLlblMetag.tsv \
@@ -3302,7 +3764,7 @@ mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-cove
 
 **Input Data:**
 
-- *-gene-coverage-annotation-and-tax_GLlblMetag.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 20](#20-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
+- *-gene-coverage-annotation-and-tax_GLlblMetag.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 21](#21-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
 
 **Output Data:**
 
@@ -3311,7 +3773,7 @@ mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-cove
 - **Combined-gene-level-KO-function-coverages_GLlblMetag.tsv** (table with all samples combined based on KO annotations)
 - **Combined-gene-level-taxonomy-coverages_GLlblMetag.tsv** (table with all samples combined based on gene-level taxonomic classifications)
 
-#### 22b. Generate Contig-level Coverage Summary Tables
+#### 23b. Generate Contig-level Coverage Summary Tables
 
 ```bash
 bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax_GLlblMetag.tsv -o Combined
@@ -3325,7 +3787,7 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax_GLlblMetag.tsv -o Com
 
 **Input Data:**
 
-- *-contig-coverage-and-tax_GLlblMetag.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 21](#21-combine-contig-level-coverage-and-taxonomy-for-each-sample))
+- *-contig-coverage-and-tax_GLlblMetag.tsv (tables with combined contig coverage and taxonomy info generated for individual samples, output from [Step 22](#22-combine-contig-level-coverage-and-taxonomy-for-each-sample))
 
 **Output Data:**
 
@@ -3336,9 +3798,9 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax_GLlblMetag.tsv -o Com
 
 ---
 
-### 23. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery
+### 24. **M**etagenome-**A**ssembled **G**enome (MAG) Recovery
 
-#### 23a. Bin Contigs
+#### 24a. Bin Contigs
 
 ```bash
 jgi_summarize_bam_contig_depths --outputDepth sample-metabat-assembly-depth_GLlblMetag.tsv \
@@ -3379,8 +3841,8 @@ zip -r sample-bins_GLlblMetag.zip sample-bins
 
 **Input Data:**
 
-- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 14a](#14a-rename-contig-headers))
-- sample_GLlblMetag.bam (sorted mapping to sample assembly BAM file, output from [Step 18b](#18b-sort-assembly-alignments))
+- sample-assembly_GLlblMetag.fasta (contig-renamed assembly file from [Step 15a](#15a-rename-contig-headers))
+- sample_GLlblMetag.bam (sorted mapping to sample assembly BAM file, output from [Step 19b](#19b-sort-assembly-alignments))
 
 **Output Data:**
 
@@ -3388,7 +3850,7 @@ zip -r sample-bins_GLlblMetag.zip sample-bins
 - sample-bins/sample-bin\*.fasta (fasta files of recovered bins)
 - **sample-bins_GLlblMetag.zip** (zip file containing fasta files of recovered bins)
 
-#### 23b. Bin Quality Assessment
+#### 24b. Bin Quality Assessment
 > Utilizes the default `checkm` database [checkm_data_2015_01_16.tar.gz](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz).
 
 ```bash
@@ -3410,14 +3872,14 @@ checkm lineage_wf -f bins-overview_GLlblMetag.tsv \
 
 **Input Data:**
 
-- sample-bins/sample-bin\*.fasta (fasta files of recovered bins, output from [Step 23a](#23a-bin-contigs))
+- sample-bins/sample-bin\*.fasta (fasta files of recovered bins, output from [Step 24a](#24a-bin-contigs))
 
 **Output Data:**
 
 - **bins-overview_GLlblMetag.tsv** (tab-delimited file with quality estimates per bin)
 - checkm-output-dir/ (directory holding detailed checkm outputs)
 
-#### 23c. Filter MAGs
+#### 24c. Filter MAGs
 
 ```bash
 cat <( head -n 1 bins-overview_GLlblMetag.tsv ) \
@@ -3444,7 +3906,7 @@ done
 
 **Input Data:**
 
-- bins-overview_GLlblMetag.tsv (tab-delimited file with quality estimates per bin from [Step 23b](#23b-bin-quality-assessment))
+- bins-overview_GLlblMetag.tsv (tab-delimited file with quality estimates per bin from [Step 24b](#24b-bin-quality-assessment))
 
 **Output Data:**
 
@@ -3453,7 +3915,7 @@ done
 - **\*-MAGs.zip** (zip files containing directories of high-quality MAGs)
 
 
-#### 23d. MAG Taxonomic Classification
+#### 24d. MAG Taxonomic Classification
 > Uses default `gtdbtk` database setup with program's `download.sh` command.
 
 ```bash
@@ -3473,13 +3935,13 @@ gtdbtk classify_wf --genome_dir MAGs/ \
 
 **Input Data:**
 
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 23c](#23c-filter-mags))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 24c](#24c-filter-mags))
 
 **Output Data:**
 
 - gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
 
-#### 23e. Generate Overview Table Of All MAGs
+#### 24e. Generate Overview Table Of All MAGs
 
 ```bash
 # combine summaries
@@ -3519,10 +3981,10 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 
 **Input Data:**
 
-- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 14b](#14b-summarize-assemblies))
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 23c](#23c-filter-mags))
-- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 23c](#23c-filter-mags))
-- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 23d](#23d-mag-taxonomic-classification))
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 15b](#15b-summarize-assemblies))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 24c](#24c-filter-mags))
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG, output from [Step 24c](#24c-filter-mags))
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info, output from [Step 24d](#24d-mag-taxonomic-classification))
 
 **Output Data:**
 
@@ -3533,9 +3995,9 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 
 ---
 
-### 24. Generate MAG-level Functional Summary Overview
+### 25. Generate MAG-level Functional Summary Overview
 
-#### 24a. Get KO Annotations Per MAG
+#### 25a. Get KO Annotations Per MAG
 > This utilizes the helper script [`parse-MAG-annots.py`](https://github.com/nasa/GeneLab_Metagenomics_Workflow/blob/DEV/bin/parse-MAG-annots.py) 
 
 ```bash
@@ -3566,15 +4028,15 @@ done
 
 **Input Data:**
 
-- \*-gene-coverage-annotation-and-tax_GLlblMetag.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 20](#20-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 23c](#23c-filter-mags))
+- \*-gene-coverage-annotation-and-tax_GLlblMetag.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 21](#21-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 24c](#24c-filter-mags))
 
 **Output Data:**
 
 - **MAG-level-KO-annotations_GLlblMetag.tsv** (tab-delimited table holding MAGs and their KO annotations)
 
 
-#### 24b. Summarize KO Annotations With KEGG-Decoder
+#### 25b. Summarize KO Annotations With KEGG-Decoder
 
 ```bash
 KEGG-decoder -v interactive \
@@ -3590,7 +4052,7 @@ KEGG-decoder -v interactive \
 
 **Input Data:**
 
-- MAG-level-KO-annotations_GLlblMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 24a](#24a-get-ko-annotations-per-mag))
+- MAG-level-KO-annotations_GLlblMetag.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 25a](#25a-get-ko-annotations-per-mag))
 
 **Output Data:**
 
@@ -3602,9 +4064,9 @@ KEGG-decoder -v interactive \
 
 ---
 
-### 25. Filtering, Decontamination, and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
+### 26. Filtering, Decontamination, and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
 
-#### 25a. Gene-level Taxonomy Heatmaps
+#### 26a. Gene-level Taxonomy Heatmaps
 
 ```R
 assembly_table <- "Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv"
@@ -3649,9 +4111,9 @@ make_heatmap(metadata_table_file = metadata_table,
 - [make_heatmap()](#make_heatmap)
 
 **Input data:**
-- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 14b](#14b-summarize-assemblies))
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 15b](#15b-summarize-assemblies))
 - Combined-gene-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples combined based on gene-level 
-  taxonomic classifications, output from [Step 22a](#22a-generate-gene-level-coverage-summary-tables)) 
+  taxonomic classifications, output from [Step 23a](#23a-generate-gene-level-coverage-summary-tables)) 
 
 **Output data:**
 - Combined-gene-level-taxonomy_unfiltered_GLlblMetag.tsv (aggregated gene-level taxonomy table with samples in columns and species in rows)
@@ -3659,7 +4121,7 @@ make_heatmap(metadata_table_file = metadata_table,
 - **Combined-gene-level-taxonomy_unfiltered_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 gene-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
 
 
-#### 25b. Gene-level Taxonomy Feature Filtering
+#### 26b. Gene-level Taxonomy Feature Filtering
 
 ```R
 feature_table_file <- "Combined-gene-level-taxonomy_unfiltered_GLlblMetag.tsv"
@@ -3702,7 +4164,7 @@ make_heatmap(metadata_table_file = metadata_table,
 
 **Input Data:**
 
-- `Combined-gene-level-taxonomy_unfiltered_GLlblMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 25a](#25a-gene-level-taxonomy-heatmaps))
+- `Combined-gene-level-taxonomy_unfiltered_GLlblMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 26a](#26a-gene-level-taxonomy-heatmaps))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
@@ -3711,7 +4173,7 @@ make_heatmap(metadata_table_file = metadata_table,
 - **Combined-gene-level-taxonomy_filtered_heatmap_GLlblMetag.png** (heatmap of all gene-level taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
 - **Combined-gene-level-taxonomy_filtered_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 gene taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
 
-#### 25c. Gene-level Taxonomy Decontamination
+#### 26c. Gene-level Taxonomy Decontamination
 
 > Note: species_table and heatmaps are only generated if 1 or more contaminants were detected
 
@@ -3751,7 +4213,7 @@ make_heatmap(metadata_table_file = metadata_table,
 
 **Input Data:**
 
-- `Combined-gene-level-taxonomy_unfiltered_GLlblMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 25a](#25a-gene-level-taxonomy-heatmaps))
+- `Combined-gene-level-taxonomy_unfiltered_GLlblMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 26a](#26a-gene-level-taxonomy-heatmaps))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
@@ -3761,7 +4223,7 @@ make_heatmap(metadata_table_file = metadata_table,
 - **Combined-gene-level-taxonomy_decontam_heatmap_GLlblMetag.png** (heatmap of all gene-level taxonomy assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
 - **Combined-gene-level-taxonomy_decontam_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 gene-level taxonomy assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
 
-#### 25d. Gene-level KO Functions Heatmaps
+#### 26d. Gene-level KO Functions Heatmaps
 
 ```R
 assembly_table <- "Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv"
@@ -3809,9 +4271,9 @@ make_heatmap(metadata_table_file = metadata_table,
 
 **Input data:**
 
-- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 14b](#14b-summarize-assemblies))
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 15b](#15b-summarize-assemblies))
 - Combined-gene-level-KO-function-coverages-CPM_GLlblMetag.tsv (table with all samples combined based on KO annotations;
-  normalized to coverage per million genes covered, output from [Step 22a](#22a-generate-gene-level-coverage-summary-tables))
+  normalized to coverage per million genes covered, output from [Step 23a](#23a-generate-gene-level-coverage-summary-tables))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output data:**
@@ -3820,7 +4282,7 @@ make_heatmap(metadata_table_file = metadata_table,
 - **Combined-gene-level-KO-function_unfiltered_heatmap_GLlblMetag.png** (heatmap of all gene-level KO function assignments, output from [make_heatmap()](#make_heatmap))
 - **Combined-gene-level-KO-function_unfiltered_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 gene-level KO function assignments, output from [make_heatmap()](#make_heatmap))
 
-#### 25e. Gene-level KO Functions Feature Filtering
+#### 26e. Gene-level KO Functions Feature Filtering
 
 ```R
 feature_table_file <- "Combined-gene-level-KO-function_unfiltered_GLlblMetag.tsv"
@@ -3863,7 +4325,7 @@ make_heatmap(metadata_table_file = metadata_table,
 
 **Input Data:**
 
-- `Combined-gene-level-KO-function_unfiltered_GLlblMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 25d](#25d-gene-level-ko-functions-heatmaps))
+- `Combined-gene-level-KO-function_unfiltered_GLlblMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 26d](#26d-gene-level-ko-functions-heatmaps))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
@@ -3872,7 +4334,7 @@ make_heatmap(metadata_table_file = metadata_table,
 - **Combined-gene-level-KO-function_filtered_heatmap_GLlblMetag.png** (heatmap of all gene-level KO function assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
 - **Combined-gene-level-KO-function_filtered_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 gene-level KO function assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
 
-#### 25f. Gene-level KO Functions Decontamination
+#### 26f. Gene-level KO Functions Decontamination
 
 > Note: species_table and heatmaps are only generated if 1 or more contaminants were detected
 
@@ -3917,7 +4379,7 @@ make_heatmap(metadata_table_file = metadata_table,
 
 **Input Data:**
 
-- `Combined-gene-level-KO-function_unfiltered_GLlblMetag.tsv`(aggregated gene KO functions table table with samples in columns and KO_ID in rows, from [Step 25d](#25d-gene-level-ko-functions-heatmaps))
+- `Combined-gene-level-KO-function_unfiltered_GLlblMetag.tsv`(aggregated gene KO functions table table with samples in columns and KO_ID in rows, from [Step 26d](#26d-gene-level-ko-functions-heatmaps))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
@@ -3928,7 +4390,7 @@ make_heatmap(metadata_table_file = metadata_table,
 - **Combined-gene-level-KO-function_decontam_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 gene-level KO function assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
 
 
-#### 25g. Contig-level Heatmaps
+#### 26g. Contig-level Heatmaps
 
 ```R
 assembly_table <- "Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv"
@@ -3979,16 +4441,16 @@ make_heatmap(metadata_table_file = metadata_table,
 - `assembly_summary` - path to a tab-separated file containing statistics on assemblies created for each sample
 
 **Input data:**
-- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 14b](#14b-summarize-assemblies))
+- assembly-summaries_GLlblMetag.tsv (table of assembly summary statistics, output from [Step 15b](#15b-summarize-assemblies))
 - Combined-contig-level-taxonomy-coverages-CPM_GLlblMetag.tsv (table with all samples combined based on contig-level 
-  taxonomic classifications, output from [Step 21](#21-combine-contig-level-coverage-and-taxonomy-for-each-sample))
+  taxonomic classifications, output from [Step 22](#22-combine-contig-level-coverage-and-taxonomy-for-each-sample))
 
 **Output data:**
 - Combined-contig-level-taxonomy_unfiltered_GLlblMetag.tsv (aggregated contig-level taxonomy table with samples in columns and species in rows)
 - **Combined-contig-level-taxonomy_unfiltered_heatmap_GLlblMetag.png** (heatmap of all contig-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
 - **Combined-contig-level-taxonomy_unfiltered_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 contig-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
 
-#### 25h. Contig-level Feature Filtering
+#### 26h. Contig-level Feature Filtering
 
 ```R
 feature_table_file <- "Combined-contig-level-taxonomy_GLlblMetag.tsv"
@@ -4030,7 +4492,7 @@ make_heatmap(metadata_table_file = metadata_table,
 
 **Input Data:**
 
-- `Combined-contig-level-taxonomy_unfiltered_GLlblMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 25d](#25d-gene-level-ko-functions-heatmaps))
+- `Combined-contig-level-taxonomy_unfiltered_GLlblMetag.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 26d](#26d-gene-level-ko-functions-heatmaps))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
@@ -4039,7 +4501,7 @@ make_heatmap(metadata_table_file = metadata_table,
 - **Combined-contig-level-taxonomy_filtered_heatmap_GLlblMetag.png** (heatmap of all contig-level taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
 - **Combined-contig-level-taxonomy_filtered_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 contig-level taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
 
-#### 25i. Contig-level Decontamination
+#### 26i. Contig-level Decontamination
 
 >Note: species_table and heatmaps are only generated if 1 or more contaminants were detected
 
@@ -4081,7 +4543,7 @@ make_heatmap(metadata_table_file = metadata_table,
 
 **Input Data:**
 
-- `Combined-contig-level-taxonomy_GLlblMetag.tsv`(aggregated contig taxonomy table with samples in columns and species in rows, from [Step 25g](#25g-contig-level-heatmaps))
+- `Combined-contig-level-taxonomy_GLlblMetag.tsv`(aggregated contig taxonomy table with samples in columns and species in rows, from [Step 26g](#26g-contig-level-heatmaps))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
@@ -4091,7 +4553,7 @@ make_heatmap(metadata_table_file = metadata_table,
 - **Combined-contig-level-taxonomy_decontam_heatmap_GLlblMetag.png** (heatmap of all contig-level taxonomy assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
 - **Combined-contig-level-taxonomy_decontam_top_50_heatmap_GLlblMetag.png** (heatmap of the top 50 contig-level taxonomy assignments after filtering out contaminants, output from [make_heatmap()](#make_heatmap))
 
-### 26. Generate Assembly-based Processing Overview
+### 27. Generate Assembly-based Processing Overview
 > This utilizes the helper script [`generate-assembly-based-overview-table.sh`](https://github.com/nasa/GeneLab_Metagenomics_Workflow/blob/DEV/bin/generate-assembly-based-overview-table.sh) 
 
 ```bash
@@ -4103,20 +4565,20 @@ bash generate-assembly-based-overview-table.sh sample_ids_file.txt \
 **Parameter Definitions:**
 
 - `sample_ids_file.txt` - A file listing the sample names, one on each row, provided as a positional argument.
-- `assemblies/` - The directory holding the contig-renamed assembly files generated in [Step 14a](#14a-rename-contig-headers), provided as a positional argument.
-- `predicted-genes/` - The directory holding the gene-calls ammino-acid fasta files generated in [Step 15b](#15b-remove-line-wraps-in-gene-prediction-output), provided as a positional argument.
-- `read-mapping/` - The directory holding the sorted mapping to the sample assembly in BAM format generated in [Step 18c](#18b-sort-assembly-alignments), provided as a positional argument.
-- `bins/` - The directory holding the recovered bins fasta files generated in [Step 23a](#23a-bin-contigs), provided as a positional argument.
-- `MAGs/` - The directory holding the high-quality MAGs fasta files generated in [Step 23c](#23c-filter-mags), provided as a positional argument.
+- `assemblies/` - The directory holding the contig-renamed assembly files generated in [Step 15a](#15a-rename-contig-headers), provided as a positional argument.
+- `predicted-genes/` - The directory holding the gene-calls ammino-acid fasta files generated in [Step 16b](#16b-remove-line-wraps-in-gene-prediction-output), provided as a positional argument.
+- `read-mapping/` - The directory holding the sorted mapping to the sample assembly in BAM format generated in [Step 19c](#19b-sort-assembly-alignments), provided as a positional argument.
+- `bins/` - The directory holding the recovered bins fasta files generated in [Step 24a](#24a-bin-contigs), provided as a positional argument.
+- `MAGs/` - The directory holding the high-quality MAGs fasta files generated in [Step 24c](#24c-filter-mags), provided as a positional argument.
 - `Assembly-based-processing-overview_GLlblMetag.tsv` - name of the output file, provided as a positional argument.
 
 **Input Data:**
 
-- assemblies/\*.fasta (contig-renamed assembly files from [Step 14a](#14a-rename-contig-headers))
-- predicted-genes/\*.faa (gene-calls amino-acid fasta file with line wraps removed, output from [Step 15b](#15b-remove-line-wraps-in-gene-prediction-output))
-- read-mapping/\*.bam (sorted mapping to sample assembly, in BAM format, output from [Step 18b](#18b-sort-assembly-alignments))
-- bins/\*.fasta (fasta files of recovered bins, output from [Step 23a](#23a-bin-contigs))
-- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 23c](#23c-filter-mags))
+- assemblies/\*.fasta (contig-renamed assembly files from [Step 15a](#15a-rename-contig-headers))
+- predicted-genes/\*.faa (gene-calls amino-acid fasta file with line wraps removed, output from [Step 16b](#16b-remove-line-wraps-in-gene-prediction-output))
+- read-mapping/\*.bam (sorted mapping to sample assembly, in BAM format, output from [Step 19b](#19b-sort-assembly-alignments))
+- bins/\*.fasta (fasta files of recovered bins, output from [Step 24a](#24a-bin-contigs))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 24c](#24c-filter-mags))
 
 **Output Data:**
 
diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
index c355b9baf..a68c06bdb 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
@@ -68,7 +68,7 @@ Barbara Novak (GeneLab Data Processing Lead)
       - [7f. Filter Kraken2 Species Count Table](#7f-filter-kraken2-species-count-table)
       - [7g. Kraken2 Taxonomy Barplots](#7g-kraken2-taxonomy-barplots)
       - [7h. Kraken2 Feature Decontamination](#7h-kraken2-feature-decontamination)
-    - [8. Taxonomic Profiling Using MetaPhlan](#8-taxonomic-profiling-using-metaphlan)
+    - [8. Taxonomic Profiling Using HUMAnN/MetaPhlan](#8-taxonomic-profiling-using-humannmetaphlan)
       - [8a. Download and Install HUMAnN databases](#8a-download-and-install-humann-databases)
       - [8b. HUMAnN/MetaPhlAn Taxonomic Classification](#8b-humannmetaphlan-taxonomic-classification)
       - [8c. Merge Multiple Sample Functional Profiles](#8c-merge-multiple-sample-functional-profiles)
@@ -601,7 +601,7 @@ gzip sample1_R2_HostRm_GLlbsMetag.fastq
 
 - sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
 - sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
-- **sample_HostRm_GLlbsMetag.fastq.gz** (filtered and trimmed sample reads with contaminants, human, and host reads removed, gzipped fasta file)
+- **sample_HostRm_GLlbsMetag.fastq.gz** (filtered and trimmed sample reads with contaminants, human, and host reads removed, gzipped fastq file)
 
 
 #### 4c. Compile Host Read Removal QC
@@ -1617,8 +1617,7 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
 - kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 6a](#6a-build-kaiju-database))
 - kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 6a](#6a-build-kaiju-database))
 - *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
-    contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
+    contaminants and human reads (and, optionally, host reads) removed, output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
 
 
 **Output Data:**
@@ -1994,8 +1993,7 @@ kraken2 --db kraken2-db/ \
 
 - kraken2-db/ (a directory containing kraken2 database files, output from [Step 7a](#7a-download-kraken2-database))
 - *_R[12]_decontam.fastq.gz or *_R[12]_HostRm.fastq.gz (filtered and trimmed sample reads with both 
-    contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
+    contaminants and human reads (and, optionally, host reads) removed, output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
 
 
 **Output Data:**
@@ -2271,7 +2269,7 @@ make_barplot(metadata_file = metadata_table, feature_table_file = "kraken2_decon
 
 ---
 
-### 8. Taxonomic Profiling Using MetaPhlan
+### 8. Taxonomic Profiling Using HUMAnN/MetaPhlan
 
 #### 8a. Download and Install HUMAnN databases
 
@@ -2338,8 +2336,7 @@ mv sample1-humann3-out-dir/sample1_humann_temp/sample1_metaphlan_bugs_list.tsv \
 
 - `/path/to/humann3-db/` (HUMAnN databases installed in [Step 8a](#8a-download-and-install-humann-databases))
 - *_R[12]_decontam_GLlbsMetag.fastq.gz or *_R[12]_HostRm_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both 
-    contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
+    contaminants and human reads (and, optionally, host reads) removed, output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
 
 **Output Data:**
 
@@ -2847,12 +2844,12 @@ make_heatmap(metadata_table_file = metadata_table,
 **Input Data:**
 
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
-- `Gene-families-uniref_unfiltered_GLlbsMetag.tsv` (gene-family abundances table, output from [Step])
-- `Gene-families-KO_unfiltered_GLlbsMetag.tsv` (KO term abundances table, output from [Step])
-- `Pathway-abundances_unfiltered_GLlbsMetag.tsv` (pathway abundances table, output from [Step])
-- `Gene-families-uniref_filtered_GLlbsMetag.tsv` (filtered gene-family abundances table, output from [Step]) 
-- `Gene-families-KO_filtered_GLlbsMetag.tsv` (filtered KO term abundances table, output from [Step]) 
-- `Pathway-abundances_filtered_GLlbsMetag.tsv` (filtered Pathway abundances table, output from [Step]) 
+- `Gene-families-uniref_unfiltered_GLlbsMetag.tsv` (gene-family abundances table, output from [Step 8l](#8l-filter-humann-output))
+- `Gene-families-KO_unfiltered_GLlbsMetag.tsv` (KO term abundances table, output from [Step 8l](#8l-filter-humann-output))
+- `Pathway-abundances_unfiltered_GLlbsMetag.tsv` (pathway abundances table, output from [Step 8l](#8l-filter-humann-output))
+- `Gene-families-uniref_filtered_GLlbsMetag.tsv` (filtered gene-family abundances table, output from [Step 8l](#8l-filter-humann-output)) 
+- `Gene-families-KO_filtered_GLlbsMetag.tsv` (filtered KO term abundances table, output from [Step 8l](#8l-filter-humann-output)) 
+- `Pathway-abundances_filtered_GLlbsMetag.tsv` (filtered Pathway abundances table, output from [Step 8l](#8l-filter-humann-output)) 
 
 **Output Data:**
 
@@ -2952,9 +2949,9 @@ make_heatmap(metadata_table_file = metadata_table,
 
 **Input Data:**
 
-- `Gene-families-uniref_filtered_GLlbsMetag.tsv` (filtered gene-family abundances table, output from [Step]) 
-- `Gene-families-KO_filtered_GLlbsMetag.tsv` (filtered KO term abundances table, output from [Step]) 
-- `Pathway-abundances_filtered_GLlbsMetag.tsv` (filtered Pathway abundances table, output from [Step]) 
+- `Gene-families-uniref_filtered_GLlbsMetag.tsv` (filtered gene-family abundances table, output from [Step 8l](#8l-filter-humann-output)) 
+- `Gene-families-KO_filtered_GLlbsMetag.tsv` (filtered KO term abundances table, output from [Step 8l](#8l-filter-humann-output)) 
+- `Pathway-abundances_filtered_GLlbsMetag.tsv` (filtered Pathway abundances table, output from [Step 8l](#8l-filter-humann-output)) 
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
 **Output Data:**
@@ -2995,8 +2992,7 @@ megahit -1 sample1_R1_decontam_GLlbsMetag.fastq.gz -2 sample1_R2_decontam_GLlbsM
 **Input data:**
 
 - *_R[12]_decontam_GLlbsMetag.fastq.gz or *_R[12]_HostRm_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both 
-    contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
+    contaminants and human reads (and, optionally, host reads) removed, output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
 
 **Output data:**
 
@@ -3411,8 +3407,7 @@ bowtie2 --mm --quiet --threads ${task.cpus} \
 
 - sample1-index (bowtie2 index files, output from [Step 14a](#14a-build-reference-index))
 - *_R[12]_decontam_GLlbsMetag.fastq.gz or *_R[12]_HostRm_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both 
-    contaminants and human reads (and, optionally, host reads) removed, gzipped fasta file, 
-    output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
+    contaminants and human reads (and, optionally, host reads) removed, output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
 
 **Output Data**
 
@@ -3452,7 +3447,7 @@ samtools sort --threads NumberOfThreads \
 ### 15. Get Coverage Information and Filter Based On Detection
 > **Note:**  
 > “Detection” is a measure of what proportion of a reference sequence recruited reads 
-(see the discussion of detection [here](http://merenlab.org/2017/05/08/anvio-views/#detection)). 
+(see the discussion of detection [here](https://merenlab.org/2017/05/08/anvio-views/#detection)). 
 Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
 
 #### 15a. Filter Coverage Levels Based On Detection

From 19747ecdc5d9e1830d22c704b3983591b67cae1a Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Tue, 24 Mar 2026 09:45:27 -0700
Subject: [PATCH 36/47] updated Biyi's team designation

---
 .../Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md  | 2 +-
 .../Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
index c70ee6aa0..688c318b2 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
@@ -9,7 +9,7 @@
 **Document Number:** GL-DPPD-7116  
 
 **Submitted by:**  
-Olabiyi A. Obayomi (GeneLab Analysis Team)  
+Olabiyi A. Obayomi (GeneLab Data Processing Team)  
 
 **Approved by:**  
 Jonathan Galazka (OSDR Project Manager)  
diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
index a68c06bdb..74a784095 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
@@ -9,7 +9,7 @@
 **Document Number:** GL-DPPD-7117  
 
 **Submitted by:**  
-Olabiyi A. Obayomi (GeneLab Analysis Team)  
+Olabiyi A. Obayomi (GeneLab Data Processing Team)  
 
 **Approved by:**  
 Jonathan Galazka (OSDR Project Manager)  

From d13b7ba139648f6d9afe978f453a0582d154461c Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Tue, 24 Mar 2026 21:40:23 -0700
Subject: [PATCH 37/47] Added GL-DPPD-7107-B doc

- sync the standard metagenomics pipeline doc to the updates in the low
  biomass pipeline docs.
- also fixes some typos found in the low biomass docs
---
 .../GL-DPPD-7107-B.md                         | 3589 +++++++++++++++++
 .../GL-DPPD-7116.md                           |   16 +-
 .../GL-DPPD-7117.md                           |  124 +-
 3 files changed, 3659 insertions(+), 70 deletions(-)
 create mode 100644 Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md

diff --git a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
new file mode 100644
index 000000000..875451cc0
--- /dev/null
+++ b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
@@ -0,0 +1,3589 @@
+# Bioinformatics pipeline for Illumina metagenomics data  <!-- omit in toc -->
+
+> **This document holds an overview and some example commands of how GeneLab processes Illumina metagenomics datasets. Exact processing commands for specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
+
+---
+
+**Date:** October 28, 2024  
+**Revision:** -A  
+**Document Number:** GL-DPPD-7107  
+
+**Submitted by:**  
+Olabiyi A. Obayomi (GeneLab Analysis Team)  
+
+**Approved by:**  
+Samrawit Gebre (OSDR Project Manager)  
+Lauren Sanders (OSDR Project Scientist)  
+Amanda Saravia-Butler (GeneLab Science Lead)  
+Barbara Novak (GeneLab Data Processing Lead)  
+
+---
+
+## Updates from previous version  <!-- omit in toc -->
+
+Software Updates and Changes:
+
+| Program      | Previous Version | New Version |
+| :----------- | :--------------- | :---------- |
+| MultiQC      | 1.19             | 1.27.1      |
+| samtools     | 1.20             | 1.22.1      |
+| Kaiju        | N/A              | 1.10.1      |
+| fastp        | N/A              | 0.24.0      |
+| Kaiju        | N/A              | 1.10.1      |
+| Kraken2      | N/A              | 2.1.6       |
+| KrakenTools  | N/A              | 1.2         |
+| Krona        | N/A              | 2.8.1       |
+| SPAdes       | N/A              | 4.1.0       |
+| R            | N/A              | 4.5.1       |
+| Bioconductor | N/A              | 3.21        |
+| optparse     | N/A              | 1.7.5       |
+| pavian       | N/A              | 1.2.1       |
+| pheatmap     | N/A              | 1.0.13      |
+| phyloseq     | N/A              | 1.52.0      |
+| tidyverse    | N/A              | 2.0.0       |
+
+- Sync this pipeline with the new low-biomass pipelines (update formatting and definitions)
+- Added new processing steps for additional taxonomic profiling tools and downstream processed data outputs in R
+  - Add additional read-based processing taxonomic profiling methods:
+    - Kaiju taxonomic profiling ([Step 18](#18-taxonomic-profiling-using-kaiju))
+    - Kraken2 taxonomic profiling ([Step 19](#19-taxonomic-profiling-using-kraken2))
+  - summary plots for all taxonomic profiling and functional profiling for both read-based and assembly-based processing
+    - barplots for read-based taxonomic profiling
+    - heatmaps for read-based functional profiling
+    - heatmaps for assembly-based taxonomy and functional profiling
+  - Filtering (for rare taxa/features)
+    - Read-based processing 
+      - Kaiju and Kraken2 taxonomies (see [Step 18g](#18g-filter-kaiju-species-count-table) and [Step 19f](#19f-filter-kraken2-species-count-table))
+      - HUMAnN/MetaPhlan taxonomies and functional profiling (see ([Step 20i](#20i-filter-metaphlan-species-count-table) and [Step 20k](#20k-filter-humann-output)))
+    - Assembly-based processing (see [Step 16](#16-filtering-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs))
+  - Added missing steps for generating assembly-based processing overview and failed assembly file
+- replace bbduk with fastp for read quality filtering and adapter trimming
+
+---
+
+# Table of contents  <!-- omit in toc -->
+
+- [Software used](#software-used)
+- [General processing overview with example commands](#general-processing-overview-with-example-commands)
+  - [Pre-processing](#pre-processing)
+    - [1. Raw Data QC](#1-raw-data-qc)
+      - [1a. Raw Data QC](#1a-raw-data-qc)
+      - [1b. Compile Raw Data QC](#1b-compile-raw-data-qc)
+    - [2. Quality filtering/trimming](#2-quality-filteringtrimming)
+      - [2a. Filter Quality and Trim Adapters](#2a-filter-quality-and-trim-adapters)
+      - [2b. Trim polyG](#2b-trim-polyg)
+      - [2c. Filtered/Trimmed Data QC](#2c-filteredtrimmed-data-qc)
+      - [2d. Compile Filtered/Trimmed Data QC](#2d-compile-filteredtrimmed-data-qc)
+    - [3. R Environment Setup](#3-r-environment-setup)
+      - [3a. Load libraries](#3a-load-libraries)
+      - [3b. Define Custom Functions](#3b-define-custom-functions)
+      - [3c. Set global variables](#3c-set-global-variables)
+  - [Assembly-based Processing](#assembly-based-processing)
+    - [4. Sample assembly](#4-sample-assembly)
+    - [5. Rename Contigs and Summarize Assemblies](#5-rename-contigs-and-summarize-assemblies)
+      - [5a. Rename Contig Headers](#5a-rename-contig-headers)
+      - [5b. Summarize assemblies](#5b-summarize-assemblies)
+    - [6. Gene prediction](#6-gene-prediction)
+      - [6a. Generate Gene Predictions](#6a-generate-gene-predictions)
+      - [6b. Remove Line Wraps In Gene Prediction Output](#6b-remove-line-wraps-in-gene-prediction-output)
+    - [7. Functional annotation](#7-functional-annotation)
+      - [7a. Download reference database of HMM models](#7a-download-reference-database-of-hmm-models)
+      - [7b. Run KEGG annotation](#7b-run-kegg-annotation)
+      - [7c. Filter KO Outputs](#7c-filter-ko-outputs)
+    - [8. Taxonomic classification](#8-taxonomic-classification)
+      - [8a. Pull and Unpack Pre-built Reference DB](#8a-pull-and-unpack-pre-built-reference-db)
+      - [8b. Run Taxonomic Classification](#8b-run-taxonomic-classification)
+      - [8c. Add taxonomy info from taxids to genes](#8c-add-taxonomy-info-from-taxids-to-genes)
+      - [8d. Add Taxonomy Info From Taxids To Contigs](#8d-add-taxonomy-info-from-taxids-to-contigs)
+      - [8e. Format Gene-level Output With awk and sed](#8e-format-gene-level-output-with-awk-and-sed)
+      - [8f. Format Contig-level Output With awk and sed](#8f-format-contig-level-output-with-awk-and-sed)
+    - [9. Read-Mapping](#9-read-mapping)
+      - [9a. Build reference index](#9a-build-reference-index)
+      - [9b. Align Reads to Sample Assembly](#9b-align-reads-to-sample-assembly)
+      - [9c. Sort Assembly Alignments](#9c-sort-assembly-alignments)
+    - [10. Get Coverage Information and Filter Based On Detection](#10-get-coverage-information-and-filter-based-on-detection)
+      - [10a. Filter Coverage Levels Based On Detection](#10a-filter-coverage-levels-based-on-detection)
+      - [10b. Filter Gene and Contig Coverage Based On Detection](#10b-filter-gene-and-contig-coverage-based-on-detection)
+    - [11. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample](#11-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample)
+    - [12. Combine Contig-level Coverage and Taxonomy For Each Sample](#12-combine-contig-level-coverage-and-taxonomy-for-each-sample)
+    - [13. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples](#13-generating-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples)
+      - [13a. Generate Gene-level Coverage Summary Tables](#13a-generate-gene-level-coverage-summary-tables)
+      - [13b. Generate Contig-level Coverage Summary Tables](#13b-generate-contig-level-coverage-summary-tables)
+    - [14. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#14-metagenome-assembled-genome-mag-recovery)
+      - [14a. Bin contigs](#14a-bin-contigs)
+      - [14b. Bin quality assessment](#14b-bin-quality-assessment)
+      - [14c. Filter MAGs](#14c-filter-mags)
+      - [14d. MAG taxonomic classification](#14d-mag-taxonomic-classification)
+      - [14e. Generate Overview Table Of All MAGs](#14e-generate-overview-table-of-all-mags)
+    - [15. Generate MAG-level functional summary overview](#15-generate-mag-level-functional-summary-overview)
+      - [15a. Get KO annotations per MAG](#15a-get-ko-annotations-per-mag)
+      - [15b. Summarize KO annotations with KEGG-Decoder](#15b-summarize-ko-annotations-with-kegg-decoder)
+    - [16. Filtering and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs](#16-filtering-and-visualization-of-contig--and-gene-taxonomy-and-gene-function-outputs)
+      - [16a. Gene-level Taxonomy Heatmaps](#16a-gene-level-taxonomy-heatmaps)
+      - [16b. Gene-level Taxonomy Feature Filtering](#16b-gene-level-taxonomy-feature-filtering)
+      - [16c. Gene-level KO Functions Heatmaps](#16c-gene-level-ko-functions-heatmaps)
+      - [16d. Gene-level KO Functions Feature Filtering](#16d-gene-level-ko-functions-feature-filtering)
+      - [16e. Contig-level Heatmaps](#16e-contig-level-heatmaps)
+      - [16f. Contig-level Feature Filtering](#16f-contig-level-feature-filtering)
+    - [17. Generate Assembly-based Processing Overview](#17-generate-assembly-based-processing-overview)
+  - [Read-based Processing](#read-based-processing)
+    - [18. Taxonomic Profiling Using Kaiju](#18-taxonomic-profiling-using-kaiju)
+      - [18a. Build Kaiju Database](#18a-build-kaiju-database)
+      - [18b. Kaiju Taxonomic Classification](#18b-kaiju-taxonomic-classification)
+      - [18c. Compile Kaiju Taxonomy Results](#18c-compile-kaiju-taxonomy-results)
+      - [18d. Convert Kaiju Output To Krona Format](#18d-convert-kaiju-output-to-krona-format)
+      - [18e. Compile Kaiju Krona Reports](#18e-compile-kaiju-krona-reports)
+      - [18f. Create Kaiju Species Count Table](#18f-create-kaiju-species-count-table)
+      - [18g. Filter Kaiju Species Count Table](#18g-filter-kaiju-species-count-table)
+      - [18h. Kaiju Taxonomy Barplots](#18h-kaiju-taxonomy-barplots)
+    - [19. Taxonomic Profiling Using Kraken2](#19-taxonomic-profiling-using-kraken2)
+      - [19a. Download Kraken2 Database](#19a-download-kraken2-database)
+      - [19b. Kraken2 Taxonomic Classification](#19b-kraken2-taxonomic-classification)
+      - [19c. Compile Kraken2 Taxonomy Results](#19c-compile-kraken2-taxonomy-results)
+        - [19ci. Create Merged Kraken2 Taxonomy Table](#19ci-create-merged-kraken2-taxonomy-table)
+        - [19cii. Compile Kraken2 Taxonomy Reports](#19cii-compile-kraken2-taxonomy-reports)
+      - [19d. Convert Kraken2 Output to Krona Format](#19d-convert-kraken2-output-to-krona-format)
+      - [19e. Compile Kraken2 Krona Reports](#19e-compile-kraken2-krona-reports)
+      - [19f. Filter Kraken2 Species Count Table](#19f-filter-kraken2-species-count-table)
+      - [19g. Kraken2 Taxonomy Barplots](#19g-kraken2-taxonomy-barplots)
+    - [20. Taxonomic Profiling Using HUMAnN/MetaPhlan](#20-taxonomic-profiling-using-humannmetaphlan)
+      - [20a. Download and Install HUMAnN databases](#20a-download-and-install-humann-databases)
+      - [20b. HUMAnN/MetaPhlAn Taxonomic Classification](#20b-humannmetaphlan-taxonomic-classification)
+      - [20c. Merge Multiple Sample Functional Profiles](#20c-merge-multiple-sample-functional-profiles)
+      - [20d. Split Results Tables](#20d-split-results-tables)
+      - [20e. Normalize Gene Families and Pathway Abundances Tables](#20e-normalize-gene-families-and-pathway-abundances-tables)
+      - [20f. Generate Normalized Gene-family Table Grouped by Kegg Orthologs (KOs)](#20f-generate-normalized-gene-family-table-grouped-by-kegg-orthologs-kos)
+      - [20g. Combine MetaPhlan Taxonomy Tables](#20g-combine-metaphlan-taxonomy-tables)
+      - [20h. Create MetaPhlan Species Count Table](#20h-create-metaphlan-species-count-table)
+        - [20hi. Get Sample Read Counts](#20hi-get-sample-read-counts)
+        - [20hii. Process MetaPhlan Taxonomy Table](#20hii-process-metaphlan-taxonomy-table)
+      - [20i. Filter MetaPhlan Species Count Table](#20i-filter-metaphlan-species-count-table)
+      - [20j. MetaPhlan Taxonomy Barplots](#20j-metaphlan-taxonomy-barplots)
+      - [20k. Filter Humann Output](#20k-filter-humann-output)
+      - [20l. Create Humann Function Heatmaps](#20l-create-humann-function-heatmaps)
+
+---
+
+# Software used
+
+| Program      | Version | Relevant Links                                                                                                                                     |
+| :----------- | :-----: | :------------------------------------------------------------------------------------------------------------------------------------------------- |
+| bbduk        |  38.86  | [https://bbmap.org/](https://bbmap.org/)                                                                                                           |
+| bit          | 1.8.53  | [https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)     |
+| bowtie2      |  2.4.1  | [https://bowtie-bio.sourceforge.net/bowtie2/index.shtml](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml)                                   |
+| CAT          |  5.2.3  | [https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)                                                             |
+| CheckM       |  1.1.3  | [https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)                                                                     |
+| fastp        | 0.24.0  | [https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp)                                                                             |
+| FastQC       | 0.12.1  | [https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)                           |
+| GTDB-Tk      |  2.4.0  | [https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)                                                                     |
+| HUMAnN       |   3.9   | [https://github.com/biobakery/humann](https://github.com/biobakery/humann)                                                                         |
+| Kaiju        | 1.10.1  | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/)                                                   |
+| KEGG-Decoder |  1.2.2  | [https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder) |
+| KOFamScan    |  1.3.0  | [https://github.com/takaram/kofam_scan#kofamscan](https://github.com/takaram/kofam_scan#kofamscan)                                                 |
+| Kraken2      |  2.1.6  | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2)                                                                   |
+| KrakenTools  |   1.2   | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/)                                                             |
+| Krona        |  2.8.1  | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)                                                                         |
+| MEGAHIT      |  1.2.9  | [https://github.com/voutcn/megahit#megahit](https://github.com/voutcn/megahit#megahit)                                                             |
+| MetaBAT      |  2.15   | [https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)                                     |
+| MetaPhlAn    |  4.1.0  | [https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)                                                                   |
+| MultiQC      | 1.27.1  | [https://multiqc.info/](https://multiqc.info/)                                                                                                     |
+| Prodigal     |  2.6.3  | [https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)                                                       |
+| samtools     | 1.22.1  | [https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)                                                     |
+| R            |  4.5.1  | [https://www.r-project.org](https://www.r-project.org)                                                                                             |
+| Bioconductor |  3.21   | [https://www.bioconductor.org](https://www.bioconductor.org)                                                                                       |
+| optparse     |  1.7.5  | [https://cran.r-project.org/web/packages/optparse/index.html](https://cran.r-project.org/web/packages/optparse/index.html)                         |
+| pavian       |  1.2.1  | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian)                                                                   |
+| pheatmap     | 1.0.13  | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap)                                                         |
+| phyloseq     | 1.52.0  | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html)     |
+| tidyverse    |  2.0.0  | [https://www.tidyverse.org](https://www.tidyverse.org)                                                                                             |
+
+---
+
+# General processing overview with example commands
+
+> Exact processing commands and output files listed in **bold** below are included with each Metagenomics Seq processed dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).  
+
+## Pre-processing
+
+
+### 1. Raw Data QC
+> NOTE: It is NASA's policy that any human reads are to be removed from metagenomics datasets prior to being hosted in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). As such this pipeline starts with fastq files that have had the human reads removed using the GeneLab Remove Human Reads pipeline ([GL-DPPD-7107-A](../../Remove_human_reads_from_raw_data/Pipeline_GL-DPPD-7105_Versions/GL-DPPD-7105-A.md))
+
+#### 1a. Raw Data QC
+
+```bash
+fastqc -o HRrm_fastqc_output *HRrm_GLmetagenomics.fastq.gz
+```
+
+**Parameter Definitions:**
+
+* `-o` – the output directory to store results
+* `*HRrm_GLmetagenomics.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
+
+**Input data:**
+
+- *HRrm_GLmetagenomics.fastq.gz (raw reads, after human read removal)
+
+**Output data:**
+
+* *fastqc.html (FastQC output html summary)
+* *fastqc.zip (FastQC output data)
+
+
+#### 1b. Compile Raw Data QC
+
+```bash
+multiqc --zip-data-dir \
+        --outdir raw_multiqc_report \
+        --filename raw_multiqc_GLmetagenomics \
+        --interactive 
+        /path/to/raw_fastqc_output/
+```
+
+**Parameter Definitions:**
+
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/raw_fastqc_output/` – The directory holding the output data from the FastQC run, provided as a positional argument.
+
+**Input data:**
+
+- /path/to/HRrm_fastqc_output/*fastqc.zip (FastQC output data, from [Step 1a](#1a-raw-data-qc))
+
+**Output data:**
+
+- **HRrm_multiqc_report/HRrm_multiqc_GLmetagenomics.html** (multiqc output html summary)
+- **HRrm_multiqc_report/HRrm_multiqc_GLmetagenomics_data.zip** (directory containing multiqc output data)
+
+<br>  
+
+---
+
+### 2. Quality filtering/trimming
+
+#### 2a. Filter Quality and Trim Adapters
+
+```bash
+fastp --in1 sample1_R1_HRrm_GLmetagenomics.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
+      --in2 sample1_R2_HRrm_GLmetagenomics.fastq.gz --out2 temp_sample1_R2_filtered.fastq.gz \
+      --qualified_quality_phred  20 \
+      --length_required 50 \
+      --thread 2 \
+      --detect_adapter_for_pe \
+      --json sample1.fastp.json \
+      --html sample1.fastp.html 2> sample1-fastp.log
+```
+
+**Parameter Definitions:**
+
+- `--in1` - Specifies the forward input read file
+- `--in2` - Specifies the reverse input read file
+- `--in1` - Specifies the forward output read file
+- `--in2` - Specifies the reverse output read file
+- `--qualified_quality_phred` - the minimum quality value at which a base is qualified (default: 20)
+- `--length_required` - the minimum read length. Shorter reads will be discarded (default: 50)
+- `--thread` - number of worker threads (default: 2)
+- `--detect_adapter_for_pe` - for paired end data, enable auto-detection of adapters
+- `--json` - Specifies the json format report file name
+- `--html` - Specifies the html format report file name
+- `2> sample-fastp.log` - Redirects the stderr output to a log file.
+
+**Input Data:**
+
+- *HRrm_GLmetagenomics.fastq.gz (raw sample reads with human reads removed)
+
+**Output Data:**
+
+- temp_*_filtered.fastq.gz (quality filtered and adapter trimmed reads)
+
+#### 2b. Trim polyG
+
+```bash
+fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered_GLmetagenomics.fastq.gz \
+      --in2 temp_sample1_R2_filtered.fastq.gz --out2 sample1_R2_filtered_GLmetagenomics.fastq.gz \
+      --qualified_quality_phred  20 \
+      --length_required 50 \
+      --thread 2 \
+      --detect_adapter_for_pe \
+      --json sample1.fastp.json \
+      --html sample1.fastp.html \
+      --trim_poly_g 2> sample1-fastp.log
+```
+
+**Parameter Definitions:**
+
+- `--in1` - Specifies the forward input read file
+- `--in2` - Specifies the reverse input read file
+- `--in1` - Specifies the forward output read file
+- `--in2` - Specifies the reverse output read file
+- `--qualified_quality_phred` - the minimum quality value at which a base is qualified (default: 20)
+- `--length_required` - the minimum read length. Shorter reads will be discarded (default: 50)
+- `--thread` - number of worker threads (default: 2)
+- `--detect_adapter_for_pe` - for paired end data, enable auto-detection of adapters
+- `--json` - Specifies the json format report file name
+- `--html` - Specifies the html format report file name
+- `--trim_poly_g` - force polyG trimming
+- `2> sample-fastp.log` - Redirects the stderr output to a log file.
+
+**Input Data:**
+
+- /path/to/filtered_data/temp_sample1*.fastq.gz (round1 filtered/adapter trimmed reads, output from [Step 2a](#2a-filter-quality-and-trim-adapters)
+
+**Output Data:**
+
+- **\*filtered_GLmetagenomics.fastq.gz** (quality filtered and adapter trimmed, human removed reads)<br>
+
+---
+
+#### 2c. Filtered/Trimmed Data QC
+
+```bash 
+fastqc -o filtered_fastqc_output/ *filtered_GLmetagenomics.fastq.gz
+```
+
+**Parameter Definitions:**
+
+-	`-o` – the output directory to store results  
+-	`*filtered_GLmetagenomics.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them  
+
+**Input data:**
+
+- *filtered_GLmetagenomics.fastq.gz (trimmed and filtered reads, from [Step 2b](#2b-trim-polyg))
+
+**Output data:**
+
+- *fastqc.html (FastQC output html summary)
+- *fastqc.zip (FastQC output data)
+
+
+#### 2d. Compile Filtered/Trimmed Data QC
+
+```
+multiqc --zip-data-dir \
+        --outdir filtered_multiqc_report \
+        --filename filtered_multiqc_GLmetagenomics \
+        --interactive 
+        /path/to/filtered_fastqc_output/
+```
+
+**Parameter Definitions:**
+
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` – Specifies the output directory to store results.
+- `--filename` – Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/filtered_fastqc_output/` – The directory holding the output data from the FastQC run, provided as a positional argument.
+
+**Input Data:**
+
+- /path/to/filtered_fastqc_output/*fastqc.zip (FastQC output data, from [Step 2c](#2c-filteredtrimmed-data-qc))
+
+**Output Data:**
+
+- **filtered_multiqc_report/filtered_multiqc_GLmetagenomics.html** (multiqc output html summary)
+- **filtered_multiqc_report/filtered_multiqc_GLmetagenomics_data.zip** (zip archive containing multiqc output data)
+
+<br>
+
+### 3. R Environment Setup
+
+> Taxonomy bar plots and heatmaps are performed in R.
+
+#### 3a. Load libraries
+
+```R
+library(glue)
+library(htmlwidgets)
+library(pavian)
+library(pheatmap)
+library(phyloseq)
+library(plotly)
+library(tidyverse)
+```
+
+#### 3b. Define Custom Functions
+
+#### get_last_assignment() <!-- omit in toc -->
+<details>
+  <summary>retrieves the last taxonomy assignment from a taxonomy string</summary>
+
+  ```R
+  get_last_assignment <- function(taxonomy_string, split_by = ';', remove_prefix = NULL) {
+
+    # Spilt taxonomy string by the supplied delimiter 'split_by'
+    # then convert the list of parts to a vector of parts
+    split_names <- strsplit(x =  taxonomy_string , split = split_by) %>%
+      unlist()
+    # Get the last part of the split string
+    level_name <- split_names[[length(split_names)]]
+    
+    if(level_name == "_"){
+      return(taxonomy_string)
+    }
+    # remove an unwanted prefix if specified
+    if(!is.null(remove_prefix)){
+      level_name <- gsub(pattern = remove_prefix, replacement = "", x = level_name)
+    }
+    
+    return(level_name)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `taxonomy_string` - a character string containing a list of taxonomy assignments separated by `split_by`
+  - `split_by=` - a character string containing a regular expression used to split the `taxonomy_string`
+  - `remove_prefix=` - a character string containing a regular expression to be matched and removed, default=`NULL`
+
+  **Returns:** the last taxonomy assignment listed in the `taxonomy_string`
+
+</details>
+
+#### mutate_taxonomy() <!-- omit in toc -->
+<details>
+  <summary>mutate taxonomy column to contain the lowest taxonomy assignment</summary>
+
+  ```R
+  mutate_taxonomy <- function(df, taxonomy_column="taxonomy") {
+
+    # make sure that the taxonomy column is always named taxonomy
+    col_index <- which(colnames(df) == taxonomy_column)
+    colnames(df)[col_index] <- "taxonomy"
+    df <- df %>% dplyr::mutate(across(where(is.numeric), function(x) tidyr::replace_na(x, 0))) %>%
+      dplyr::mutate(taxonomy=map_chr(taxonomy, .f = function(taxon_name = .x) {
+        last_assignment <- get_last_assignment(taxon_name) 
+        last_assignment  <- gsub(pattern = "\\[|\\]|'", replacement = "", x = last_assignment)
+        trimws(last_assignment, which = "both")
+      })) %>% 
+      as.data.frame(check.names = FALSE, StringAsFactor = FALSE)
+    # Ensure the taxonomy names are unique by aggregating duplicates
+    df <- aggregate(.~taxonomy, data = df, FUN = sum)
+    return(df)
+  }
+  ```
+
+  **Custom Functions Used:**
+  - [get_last_assignment()](#get_last_assignment)
+
+  **Function Parameter Definitions:**
+  - `df` - a dataframe containing the taxonomy assignments
+  - `taxonomy_column=` - name of the column in `df` containing the taxonomy assignments, default="taxonomy"
+
+  **Returns:** dataframe, `df`, with unique last taxonomy names stored in a column named "taxonomy"
+
+</details>
+
+#### process_kaiju_table() <!-- omit in toc -->
+<details>
+  <summary>reformat kaiju output table</summary>
+
+  ```R
+  process_kaiju_table <- function(file_path, taxon_col = "taxon_name") {
+  
+    # read input table
+    kaiju_table <-  read_delim(file = file_path,
+                               delim = "\t",
+                               col_names = TRUE)
+
+    # Create  a sample colname if the file column wasn't pre-edited
+    if(colnames(kaiju_table)[1] ==  "file" ){
+      kaiju_table <-  kaiju_table %>% rename(sample=file)
+    }
+
+    # filter out all kaiju database entries
+    kaiju_table <- kaiju_table %>% 
+      filter(!str_detect(sample, "dmp")) %>%
+      mutate(sample=str_replace_all(sample, ".+/(.+)_kaiju.out", "\\1"))
+ 
+    # keep only sample, reads, and taxonomy column (as defined by taxon_col argument) 
+    # convert long dataframe to wide dataframe
+    # mutate the taxonomy column such that it contains only lowest taxonomy assignment
+    abs_abun_df <- kaiju_table %>%
+      select(sample, reads, taxonomy=!!sym(taxon_col)) %>%
+      pivot_wider(names_from = "sample", values_from = "reads", names_sort = TRUE) %>%
+      mutate_taxonomy 
+  
+    # Set the taxon names as row names, drop the taxonomy column and convert to a matrix
+    rownames(abs_abun_df) <- abs_abun_df[,"taxonomy"]
+    abs_abun_df <- abs_abun_df[,-(which(colnames(abs_abun_df) == "taxonomy"))]
+    abs_abun_matrix <- as.matrix(abs_abun_df)
+    
+    return(abs_abun_matrix)
+  }
+  ```
+
+  **Custom Functions Used:**
+  - [mutate_taxonomy()](#mutate_taxonomy)
+
+  **Function Parameter Definitions:**
+  - `file_path` - file path to the tab-delimited kaiju output table file
+  - `taxon_col=`- name of the taxon column in the input data file, default="taxon_name"
+
+  **Returns:** dataframe, `abs_abun_matrix`, with reformated kaiju output
+
+</details>
+
+#### merge_kraken_reports() <!-- omit in toc -->
+<details>
+  <summary>merge and process multiple kraken outputs to one species table</summary>
+
+  ```R
+  merge_kraken_reports <- function(reports_dir) {
+
+    reports <- read_reports(reports_dir)
+
+    # Retrieve sample names from file names
+    samples <- names(reports) %>% str_split("-") %>% map_chr(function(x) pluck(x, 1))
+    merged_reports  <- merge_reports2(reports, col_names = samples)
+    taxonReads <- merged_reports$taxonReads
+    cladeReads <- merged_reports$cladeReads
+    tax_data <- merged_reports[["tax_data"]]
+
+    species_table <- tax_data %>%
+      bind_cols(cladeReads) %>%
+      filter(taxRank %in% c("U", "S")) %>% # select unclassified and species rows 
+      select(-contains("tax")) %>%
+      zero_if_na() %>%
+      filter(name != 0) %>% # drop unknown taxonomies
+      group_by(name) %>%
+      summarise(across(everything(), sum)) %>%
+      ungroup() %>%
+      as.data.frame %>%
+      rename(species = name)
+
+    # Set rownames as species name, drop species column
+    # and convert table from dataframe to matrix
+    species_names <- species_table[, "species"]
+    rownames(species_table) <- species_names
+    
+    return(species_table)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `reports_dir` - path to a directory containing kraken2 reports 
+
+  **Returns:** a kraken species count matrix, `species_table`, with samples and species as columns and rows, respectively.
+
+</details>
+
+#### get_abundant_features() <!-- omit in toc -->
+<details>
+  <summary>Find abundant features based on the sum of feature values</summary>
+  
+  ```R
+  get_abundant_features <- function(mat, cpm_threshold = 1000){
+  
+    # Filtered out unassigned functions
+    unassigned <- "UNMAPPED|UNGROUPED|UNINTEGRATED|Not annotated"
+    mat <- mat %>%
+      as.data.frame %>%
+      rownames_to_column("Feature") %>%
+      filter(str_detect(Feature, unassigned, negate = TRUE))
+    rownames(mat) <- mat$Feature
+    mat <- mat[, -1]
+
+    features <- rowSums(mat, na.rm = TRUE) %>% sort()
+    
+    abund_features <- features[features > cpm_threshold] %>% names
+    
+    abund_features.m <- mat[abund_features, ]
+    
+    return(abund_features.m)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `mat` - a feature count matrix with features as rows and samples as columns
+  - `cpm_threshold = 1000` - threshold to identify abundant features
+
+  **Returns:** a matrix, `abund_features.m`, holding the features that pass the requested threshold
+  
+</details>
+
+#### count_to_rel_abundance() <!-- omit in toc -->
+<details>
+  <summary>Convert species count matrix to relative abundance matrix</summary>
+
+  ```R
+  count_to_rel_abundance <- function(species_table) {
+
+    # calculate species relative abundance per sample and
+    # drop columns where none of the reads were classified or were non-microbial (NA)
+    abund_table <- species_table %>%
+      as.data.frame %>%
+      mutate(across(everything(), function(x) (x/sum(x, na.rm = TRUE))*100)) %>%
+        select(
+          where( ~all(!is.na(.)))
+        ) %>%
+      rownames_to_column("Species")
+
+    # Set rownames as species name and drop species column  
+    rownames(abund_table) <- abund_table$Species
+    abund_table <- abund_table[, -match(x = "Species", colnames(abund_table))] %>% t
+
+    return(abund_table)
+  }
+
+  ```
+
+  **Function Parameter Definitions:**
+  - `species_table` - a species count matrix with samples and species as columns and rows, respectively.
+
+  **Returns:** a species relative abundance matrix, `abund_table`, with samples and species as rows and columns, respectively.
+
+</details>
+
+#### filter_rare() <!-- omit in toc -->
+<details>
+  <summary>filter out rare and non_microbial taxonomy assignments based on relative abundance</summary>
+
+  ```R
+  filter_rare <- function(species_table, non_microbial, threshold=1) {
+    
+    # Drop species listed in 'non_microbial' regex
+    clean_tab_count  <-  species_table %>% 
+                         as.data.frame %>% 
+                         rownames_to_column("Species") %>% 
+                         filter(str_detect(Species, non_microbial, negate = TRUE))
+    # Calculate species relative abundance
+    clean_tab <- clean_tab_count %>%
+      mutate(across(where(is.numeric), function(x) (x/sum(x, na.rm = TRUE))*100))
+    # Set rownames as species name and drop species column
+    rownames(clean_tab) <- clean_tab$Species
+    clean_tab  <- clean_tab[, -1]
+    
+    # Get species with relative abundance less than `threshold` in all samples
+    rare_species <- map(clean_tab, .f = function(col) rownames(clean_tab)[col < threshold])
+    rare <- Reduce(intersect, rare_species)
+    
+    # Set rownames as species name and drop species column  
+    rownames(clean_tab_count) <- clean_tab_count$Species
+    clean_tab_count  <- clean_tab_count[,-1] 
+    # Drop rare species
+    abund_table <- clean_tab_count[!(rownames(clean_tab_count) %in% rare), ]
+    
+    return(abund_table)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `species_table` - the species matrix to filter with species and samples as rows and columns, respectively.
+  - `non_microbial` - a regular expression denoting the names used to identify a species as non-microbial or unwanted
+  - `threshold=` - abundance threshold used to determine if the relative abundance is rare, value denotes a percentage between 0 and 100.
+
+  **Returns:** dataframe, `abund_table`, with rare and non_microbial/unwanted species removed
+
+</details>
+
+#### group_low_abund_taxa() <!-- omit in toc -->
+<details>
+  <summary>Group rare taxa or return a table with only rare taxa</summary>
+
+  ```R
+  group_low_abund_taxa <- function(abund_table, threshold = 0.05,
+                                   rare_taxa = FALSE) {
+    # If set to TRUE then a table with only the rare taxa will be returned 
+    # initialize an empty vector that will contain the indices for the
+    # low abundance columns/ taxa to group
+    taxa_to_group <- c()
+    # initialize the index variable of species with low abundance (taxa/columns)
+    index <- 1
+    
+    #loop over every column or taxa check to see if the max abundance is less than the set threshold
+    #if true save the index in the taxa_to_group vector variable
+    for (column in ncol(abund_table)) {
+      if(max(abund_table[,column], na.rm = TRUE) < threshold) {
+        #print(column)
+        taxa_to_group[index] <- column
+        index = index + 1
+      }
+    }
+    
+    if(is.null(taxa_to_group)) {
+      message(glue("Rare taxa were not grouped. please provide a higher 
+                        threshold than {threshold} for grouping rare taxa, 
+                        only numbers are allowed."))
+      return(abund_table)
+    }
+    
+    if(rare_taxa) {
+      abund_table <- abund_table[,taxa_to_group,drop=FALSE]
+    } else {
+      #remove the low abundant taxa or columns
+      abundant_taxa <-abund_table[,-(taxa_to_group), drop=FALSE]
+      #get the rare taxa
+      # rare_taxa <-abund_table[,taxa_to_group]
+      rare_taxa <- subset(x = abund_table, select = taxa_to_group)
+      #get the proportion of each sample that makes up the rare taxa
+      rare <- rowSums(rare_taxa)
+      #bind the abundant taxa to the rae taxa
+      abund_table <- cbind(abundant_taxa,rare)
+      #rename the columns i.e the taxa
+      colnames(abund_table) <- c(colnames(abundant_taxa),"Rare")
+    }
+    
+    return(abund_table)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `abund_table` - a relative abundance matrix with taxa as columns and  samples as rows
+  - `rare_taxa` - a boolean specifying if only rare taxa should be returned
+  - `threshold` - a max abundance threshold for defining taxa as rare
+
+  **Returns:** a relative abundance matrix, `abund_table`, with rare taxa grouped or with non-rare taxa filtered out
+
+</details>
+
+#### make_plot() <!-- omit in toc -->
+<details>
+  <summary>Create stacked bar plots of relative abundance from input dataframes</summary>
+
+  ```R
+  # Make bar plot
+  make_plot <- function(abund_table, metadata, custom_palette, publication_format,
+                        samples_column="sample_id", prefix_to_remove="barcode"){
+  
+    abund_table_wide <- abund_table %>%
+        as.data.frame() %>%
+        rownames_to_column(samples_column) %>%
+        inner_join(metadata) %>%
+        select(!!!colnames(metadata), everything()) %>%
+        mutate(!!samples_column := !!sym(samples_column) %>% str_remove(prefix_to_remove))
+        
+      
+    abund_table_long <- abund_table_wide %>%
+        pivot_longer(-colnames(metadata),
+                     names_to = "Species",
+                     values_to = "relative_abundance")
+      
+    p <- ggplot(abund_table_long, mapping = aes(x = !!sym(samples_column),
+                                                y = relative_abundance, fill = Species)) +
+         geom_col() +
+         scale_fill_manual(values = custom_palette) +
+         labs(x = NULL, y = "Relative Abundance (%)") +
+         publication_format
+    
+    return(p)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `abund_table` - a relative abundance dataframe with rows summing to 100%
+  - `metadata` - a metadata dataframe with samples as row and columns describing each sample
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots
+  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting
+  - `samples_column` - a character column specifying the column in `metadata` holding sample names, default is "Sample_ID"
+  - `prefix_to_remove` - a string specifying a prefix or any character set to remove from sample names, default is "barcode"
+
+  **Returns:** a relative abundance stacked bar plot, `p`
+
+</details>
+
+#### make_barplot()  <!-- omit in toc -->
+<details>
+  <summary>Parse Metadata and Feature table files in order to create stacked barplots of relative abundance.</summary>
+  
+  ```R
+  make_barplot <- function(metadata_table_file, feature_table_file, 
+                           feature_column = "species", samples_column = "sample_id", group_column = "group", 
+                           output_prefix, assay_suffix = "_GLmetagenomics",
+                           publication_format, custom_palette) {
+    facet_by <- reformulate(group_column)
+    # Prepare feature table
+    feature_table <- read_delim(feature_table_file)
+    rownames(feature_table) <- feature_table[[1]]
+    feature_table <- feature_table[, -1]
+
+    number_of_species <- nrow(feature_table)
+
+    if (number_of_species > length(custom_palette)) {
+      N <- number_of_species / length(custom_palette)
+      custom_palette <- rep(custom_palette, times = N * 2)
+    }
+
+    # Prepare metadata
+    metadata <- read_delim(metadata_table_file, delim = ",") %>% as.data.frame
+    row.names(metadata) <- metadata[, samples_column]
+
+    # compute abundances from counts
+    abund_table <- count_to_rel_abundance(feature_table)
+
+    metadata <- metadata %>%
+                mutate(!!sym(group_column) := str_wrap(!!sym(group_column) %>%
+                         str_replace_all("_", " "), width = 10)
+                )
+    
+    # create plot
+    p <- make_plot(abund_table, metadata, custom_palette, publication_format, samples_column) +
+         facet_wrap(facet_by, nrow = 1, scales = "free_x", labeller = label_wrap_gen(width = 10)) +
+         theme(axis.text.x = element_text(angle = 90))
+
+    static_plot <- p
+    number_of_species <- p$data$Species %>% unique() %>% length()
+    # Don't save legend if the number of species to plot is greater than 30
+    if(number_of_species > 30) {
+      static_plot <- static_plot + theme(legend.position = "none")
+    }
+    
+    width <- 2 * nrow(metadata) # 3.6 * number_of_samples
+    if(width < 14) { width = 14 } # set minimum width to 14 inches
+    if(width > 50) { width = 50 } # Cap plot with at 50 inches
+    # Save Static
+    ggsave(filename = glue("{output_prefix}_barplot{assay_suffix}.png"), 
+           plot = static_plot,
+           device = 'png', width = width,
+           height = 10, units = 'in', dpi = 300 , limitsize = FALSE)
+
+    # Save interactive
+    htmlwidgets::saveWidget(ggplotly(p), glue("{output_prefix}_barplot{assay_suffix}.html"), selfcontained = TRUE)
+  }
+  ```
+
+  **Custom Functions Used:**
+  - [make_plot()](#make_plot)
+  - [count_to_rel_abundance()](#count_to_rel_abundance)
+
+  **Function Parameter Definitions:**
+  - `metadata_table_file` - path to a file with samples as rows and columns describing each sample
+  - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                           table with species/functions as the first column and samples as other columns.
+  - `feature_column` - a character string containing the feature column name in the feature table ['Species', 'species', 'KO_ID'], default: "species".
+  - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
+  - `group_column` - a character string specifying the column in `metadata` used to facet/group plots, default: "group"
+  - `output_prefix` - a character string specifying the unique name to add to the output file names 
+                      used to denote the data type/source, for example "unfiltered-kaiju_species"
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLmetagenomics")
+  - `publication_format` - a ggplot::theme object specifying a custom theme for plotting, from [Step 3c](#3c-set-global-variables)
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 3c](#3c-set-global-variables)
+
+  **Output Data:** 2 barplot files, `{output_prefix}_barplot{assay_suffix}.png` and `{output_prefix}_barplot{assay_suffix}.html`, containing relative abundance stacked bar plot as output from [make_plot](#make_plot)
+  
+</details>
+
+#### make_heatmap() <!-- omit in toc -->
+<details>
+  <summary>Creates heatmaps from a feature table file</summary>
+  
+  ```R
+  make_heatmap <- function(metadata_table_file, feature_table_file, 
+                           samples_column = "sample_id", group_column = "group", 
+                           output_prefix, assay_suffix = "_GLmetagenomics",
+                           custom_palette) {
+    # Prepare feature table
+    feature_table <- read_delim(feature_table_file) %>%  as.data.frame()
+    rownames(feature_table) <- feature_table[[1]]
+    feature_table <- feature_table[,-1] %>% as.matrix()
+    colnames(feature_table) <-  colnames(feature_table) %>% str_remove_all("barcode")
+
+    # Prepare metadata
+    metadata <- read_delim(metadata_table_file) %>% as.data.frame()
+    row.names(metadata) <- metadata[,samples_column] %>% str_remove_all("barcode")
+
+    # GFet common samples and re-arrange feature table and metadata
+    common_samples <- intersect(colnames(feature_table), rownames(metadata))
+    feature_table <- feature_table[, common_samples]
+    metadata <- metadata[common_samples,]
+    metadata <- metadata %>% arrange(!!sym(group_column))
+
+    # Create column annotation
+    col_annotation <- as.data.frame(metadata)[, group_column, drop = FALSE]
+
+    # Calculate output plot width and height
+    number_of_samples <- ncol(feature_table)
+    width <- 1 * number_of_samples
+    if (width < 10) { width <- 10} # Set the minimum width to 10 inches
+    if (width > 100) { width <- 100} # Set the maximum width to 100 inches
+    number_of_features <- nrow(feature_table)
+    height <- 0.2 * number_of_features
+    if (height < 10) { height <- 10 } # Set the minimum height to 10 inches
+    if (height > 100) { height <- 100 } # Set the maximum height to 100 inches (highest that won't generate an error)
+
+    # Set colors by group
+    groups <- metadata[[group_column]] %>%  unique()
+    number_of_groups <-  length(groups)
+    my_colors <- custom_palette[1:number_of_groups]
+    names(my_colors) <- groups
+    annotation_colors  <- list(my_colors)
+    names(annotation_colors) <- group_column
+
+    # create heatmap
+    png(filename = glue("{output_prefix}_heatmap{assay_suffix}.png"), width = width,
+        height = height, units = "in", res = 300)
+    pheatmap(mat = feature_table[, rownames(col_annotation)],
+             cluster_cols = FALSE,
+             cluster_rows = FALSE,
+             col = colorRampPalette(c('white','red'))(255), 
+             angle_col = 0,
+             display_numbers = TRUE,
+             fontsize = 12,
+             annotation_col = col_annotation,
+             annotation_colors = annotation_colors,
+             number_format = "%.0f")
+    dev.off()
+
+    sorted_features <- rowSums(feature_table) %>% sort(decreasing = TRUE)
+
+    # Plot only top 50 features as it is often difficult to visualize all features at once
+    if(length(sorted_features >= 50)) { 
+      top50 <- sorted_features[1:50]
+
+      png(filename = glue("{output_prefix}_top_50_heatmap{assay_suffix}.png"), width = width,
+          height = 12, units = "in", res=300)
+      pheatmap(mat = feature_table[names(top50), rownames(col_annotation)],
+               cluster_cols = FALSE, 
+               cluster_rows = FALSE,
+               col = colorRampPalette(c('white','red'))(255), 
+               angle_col = 90, 
+               display_numbers = TRUE, 
+               fontsize = 12, 
+               annotation_col = col_annotation,
+               annotation_colors = annotation_colors,
+               number_format = "%.0f")
+      dev.off()
+    }
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `metadata_table_file` - path to a file with samples as rows and columns describing each sample
+  - `feature_table_file` - path to a tab separated samples feature table i.e. species/functions 
+                           table with species/functions as the first column and samples as other columns.
+  - `samples_column` - a character string specifying the column in `metadata` holding sample names, default: "sample_id"
+  - `group_column` - a character string specifying the column in `metadata` used to facet/group plots, default: "group"
+  - `output_prefix` - a character string specifying the unique name to add to the output file names 
+                      used to denote the data type/source, for example "unfiltered-kaiju_species"
+  - `assay_suffix` - a character string specifying the GeneLab assay suffix (default: "_GLmetagenomics")
+  - `custom_palette` - a vector of strings specifying a custom color palette for coloring plots, from [Step 3c](#3c-set-global-variables)
+
+  **Output Data:** 2 heatmap png files, `{output_prefix}_heatmap{assay_suffix}.png` and `{output_prefix}_top_50_heatmap{assay_suffix}.png`, of species/functions across samples from the input feature table
+  
+</details>
+
+#### process_taxonomy() <!-- omit in toc -->
+<details>
+  <summary>process a taxonomy assignment table</summary>
+
+  ```R
+  process_taxonomy <- function(taxonomy, prefix='\\w__') { 
+    
+    taxonomy <- apply(X = taxonomy, MARGIN = 2, FUN = as.character) 
+
+    # replace NAs and empty cells with "Other" and delete the `prefix` from taxonomy names
+    for (rank in colnames(taxonomy)) {
+      # Delete the taxonomy prefix
+      taxonomy[,rank] <- gsub(pattern = prefix, x = taxonomy[, rank],
+                              replacement = '')
+      indices <- which(is.na(taxonomy[,rank]))
+      taxonomy[indices, rank] <- rep(x = "Other", times=length(indices)) 
+      # Replace empty cells with "Other"
+      indices <- which(taxonomy[,rank] == "")
+      taxonomy[indices,rank] <- rep(x = "Other", times=length(indices))
+    }
+    # Replace underscore with space
+    taxonomy <- apply(X = taxonomy,MARGIN = 2,
+                      FUN =  gsub,pattern = "_",replacement = " ") %>% 
+      as.data.frame(stringAsfactor=FALSE)
+    return(taxonomy)
+  }
+  ```
+  **Function Parameter Definitions:**
+  - `taxonomy` - is a taxonomy assignment dataframe with ranks [Phylum, Class .. Species] as columns and taxonomy assignments as rows
+  - `prefix`  - is a regular expression specifying a character sequence to remove
+                from taxon names
+
+  **Returns:** dataframe, `taxonomy`, containing reformated taxonomy names
+</details>
+
+#### fix_names() <!-- omit in toc -->
+<details>
+  <summary>clean taxonomy names</summary>
+
+  ```R
+  fix_names<- function(taxonomy,stringToReplace="Other",suffix=";_"){
+    
+    for(index in seq_along(stringToReplace)){
+
+      for (taxa_index in seq_along(taxonomy)) {    
+        # Get the row indices of the current taxonomy columns
+        # with rows matching the sting in `stringToReplace`
+        indices <- grep(x = taxonomy[,taxa_index], pattern = stringToReplace)
+        # Replace the value in that row with the value in the adjacent cell concatenated with `suffix`
+        taxonomy[indices,taxa_index] <-
+          paste0(taxonomy[indices,taxa_index-1],
+                rep(x = suffix, times=length(indices)))
+      }
+
+    }
+    return(taxonomy)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `taxonomy` -  taxonomy dataframe with taxonomy ranks as column names
+  - `stringToReplace` - a regex string specifying what to replace
+  - `suffix` - string specifying the replacement value
+
+  **Returns:** dataframe, `taxonomy`, containing reformated/cleaned taxonomy names
+
+</details>
+
+#### read_taxonomy_table() <!-- omit in toc -->
+<details>
+  <summary>Read Assembly-based coverage annotation table</summary>
+
+  ```R
+  read_taxonomy_table <- function(df, sample_names){
+  
+    # Subset taxonomy portion (domain:species) of input table
+    # and replace empty/Na domain assignments with "Unclassified"
+    taxonomy_table <- df %>%
+      select(domain:species) %>%
+      mutate(domain=replace_na(domain, "Unclassified"))
+    
+    # Subset count table
+    sample_names <- get_samples(df, sample_names)
+    counts_table <- df %>% select(!!sample_names)
+
+    # Mutate taxonomy names
+    taxonomy_table  <- process_taxonomy(taxonomy_table)
+    taxonomy_table <- fix_names(taxonomy_table, "Other", ";_")
+
+    # Column bind taxonomy dataframe with species count dataframe
+    df <- bind_cols(taxonomy_table, counts_table)
+    
+    return(df)
+  }
+  ```
+
+  **Custom Functions Used:**
+  [process_taxonomy](#process_taxonomy)
+  [fix_names()](#fix_names)
+
+  **Function Parameter Definitions:**
+  - `df` - dataframe containing assembly-based coverage
+  - `sample_names` - a character vector of sample names to keep in the final dataframe
+
+  **Returns:** dataframe, `df`, containing cleaned taxonomy names and sample species count
+
+</details>
+
+#### get_samples() <!-- omit in toc -->
+<details>
+  <summary>retrieve sample names for which assemblies were generated</summary>
+
+  ```R
+  get_samples <- function(assembly_table_df, sample_names, end_col='species') {
+    # Get common samples 
+    cols <- colnames(df)
+    index <- grep(end_col, cols)
+    start <- grep(end_col, cols) + 1
+    end <- (length(cols) - index)
+    df_samples <- cols[start:end]
+    sample_names <- intersect(df_samples, sample_names)
+
+    return(sample_names)
+  }
+  ```
+
+  **Function Parameter Definitions:**
+  - `assembly_table_df` - dataframe containing assembly-based coverage
+  - `sample_names` - a character vector of samples names to keep in the final dataframe
+  - `end_col` - string containing the name of the last column
+
+  **Returns:** a character vector, `sample_names`, of sample names that appear in both the assembly dataframe and the sample_names list
+
+</details>
+
+#### 3c. Set global variables
+
+```R
+# Define custom theme for plotting
+publication_format <- theme_bw() +
+  theme(panel.grid = element_blank()) +
+  theme(axis.ticks.length=unit(-0.15, "cm"),
+        axis.text.x=element_text(margin=ggplot2::margin(t=0.5,r=0,b=0,l=0,unit ="cm")),
+        axis.text.y=element_text(margin=ggplot2::margin(t=0,r=0.5,b=0,l=0,unit ="cm")), 
+        axis.title = element_text(size = 18,face ='bold.italic', color = 'black'), 
+        axis.text = element_text(size = 16,face ='bold', color = 'black'),
+        legend.position = 'right', legend.title = element_text(size = 15,face ='bold', color = 'black'),
+        legend.text = element_text(size = 14,face ='bold', color = 'black'),
+        strip.text =  element_text(size = 14,face ='bold', color = 'black'))
+
+# Define custom palette for plotting
+custom_palette <- c("#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F", "#FF7F00",
+                    "#CAB2D6","#6A3D9A","#FF00FFFF","#B15928","#000000","#FFC0CBFF","#8B864EFF","#F0027F",
+                    "#666666","#1B9E77", "#E6AB02","#A6761D","#FFFF00FF","#FFFF99","#00FFFFFF",
+                    "#B2182B","#FDDBC7","#D1E5F0","#CC0033","#FF00CC","#330033",
+                    "#999933","#FF9933","#FFFAFAFF",colors()) 
+# Drop white colors
+custom_palette <- custom_palette[-c(21:23,
+                                    grep(pattern = "white|snow|azure|gray|#FFFAFAFF|aliceblue",
+                                         x = custom_palette, 
+                                         ignore.case = TRUE)
+                                   )
+                                ]                      
+```
+
+**Input Data:** 
+
+*No input data required*
+
+**Output Data:**
+
+- `publication_format` (a ggplot::theme object specifying a custom theme for plotting)
+- `custom_palette` (a vector of strings specifying a custom color palette for coloring plots)
+
+<br>  
+
+
+---
+
+## Assembly-based Processing
+
+### 4. Sample assembly
+```
+megahit -1 sample-1_R1_filtered_GLmetagenomics.fastq.gz -2 sample-1_R2_filtered_GLmetagenomics.fastq.gz \
+        -o sample-1-assembly -t NumberOfThreads --min-contig-length 500 > sample-1-assembly.log 2>&1
+```
+
+**Parameter Definitions:**  
+
+-	`-1 and -2` – specifies the input forward and reverse reads (if single-end data, then neither `-1` nor `-2` are used, instead single-end reads are passed to `-r`)
+-	`-o` – specifies output directory
+-	`-t` – specifies the number of threads to use
+-	`--min-contig-length` – specifies the minimum contig length to write out
+-	`> sample1-assembly.log 2>&1` – sends stdout/stderr to log file
+
+
+**Input data:**
+
+- *_R[12]_filtered_GLmetagenomics.fastq.gz (filtered/trimmed reads from [Step 2b](#2b-trim-polyg) above)
+
+**Output data:**
+
+* sample-1-assembly/final.contigs.fa (assembly file)
+* sample-1-assembly.log (log file)
+
+<br>
+
+---
+
+### 5. Rename Contigs and Summarize Assemblies
+
+#### 5a. Rename Contig Headers
+
+```bash
+bit-rename-fasta-headers -i sample/final.contigs.fa \
+                         -w c_sample \
+                         -o sample-assembly.fasta
+```
+
+**Parameter Definitions:**  
+
+- `-i` – Specifies the input fasta file.
+- `-w` – Specifies the wanted header prefix (a number will be appended for each contig), starts with a "c" to ensure they won't start with a number which can be problematic.
+- `-o` – Specifies the output fasta file.
+
+
+**Input data:**
+
+- sample/final.contigs.fa (assembly file from [Step 4](#4-sample-assembly))
+
+**Output files:**
+
+- **sample-assembly_GLmetagenomics.fasta** (contig-renamed assembly file)
+
+
+#### 5b. Summarize assemblies
+
+```bash
+bit-summarize-assembly -o assembly-summaries_GLmetagenomics.tsv \
+                       *assembly_GLmetagenomics.fasta
+
+# test assembly fasta files for absence of contigs
+for assembly_file in *-assembly_GLmetagenomics.fasta; do 
+  sample_id=${assembly_file%-assembly_GLmetagenomics.fasta} 
+  if [ ! -s ${assembly_file} ]; then 
+    printf "${sample_id}\tNo contigs assembled\n" >> Failed-assemblies_GLmetagenomics.tsv
+  fi
+done
+```
+
+**Parameter Definitions:**  
+
+-	`-o` – Specifies the output summary table.
+- `*-assembly_GLmetagenomics.fasta`	– Specifies the input assemblies to summarize, provided as positional arguments
+
+
+**Input data:**
+
+- *-assembly_GLmetagenomics.fasta (contig-renamed assembly files from [Step 5a](#5a-rename-contig-headers))
+
+**Output files:**
+
+- **assembly-summaries_GLmetagenomics.tsv** (table of assembly summary statistics)
+- **Failed-assemblies_GLmetagenomics.tsv** (list of samples with no assembled contigs. Only present if no contigs were generated for at least one sample.)
+
+<br>
+
+---
+
+### 6. Gene prediction
+
+#### 6a. Generate Gene Predictions
+
+```bash
+prodigal -a sample-genes.faa \
+         -d sample-genes.fasta \
+         -f gff \
+         -p meta \
+         -c \
+         -q \
+         -o sample-genes_GLmetagenomics.gff \
+         -i sample-assembly_GLmetagenomics.fasta
+```
+**Parameter Definitions:**
+
+- `-a` – Specifies the output amino acid sequences file.
+- `-d` – Specifies the output nucleotide sequences file.
+- `-f` – Specifies the gene-calls output format, gff = GFF format.
+- `-p` – Specifies which mode to run the gene-caller in. 
+- `-c` – No incomplete genes reported. 
+- `-q` – Run in quiet mode (don’t output process on each contig). 
+- `-o` – Specifies the name of the output gene-calls file. 
+- `-i` – Specifies the input assembly file.
+
+**Input data:**
+
+- sample-assembly.fasta (contig-renamed assembly file from [Step 5a](#5a-rename-contig-headers))
+
+**Output data:**
+
+* sample-genes.faa (gene-calls amino-acid fasta file)
+* sample-genes.fasta (gene-calls nucleotide fasta file)
+* **sample-genes.gff** (gene-calls in general feature format)
+
+<br>
+
+#### 6b. Remove Line Wraps In Gene Prediction Output
+
+```bash
+bit-remove-wraps sample-genes.faa > sample-genes.faa.tmp 2> /dev/null
+mv sample-genes.faa.tmp sample-genes_GLmetagenomics.faa
+
+bit-remove-wraps sample-genes.fasta > sample-genes.fasta.tmp 2> /dev/null
+mv sample-genes.fasta.tmp sample-genes_GLmetagenomics.fasta
+```
+
+**Input Data:**
+
+- sample-genes.faa (gene-calls amino-acid fasta file, output from [Step 6a](#6a-generate-gene-predictions))
+- sample-genes.fasta (gene-calls nucleotide fasta file, output from [Step 6a](#6a-generate-gene-predictions))
+
+**Output Data:**
+
+- **sample-genes_GLmetagenomics.faa** (gene-calls amino-acid fasta file with line wraps removed)
+- **sample-genes_GLmetagenomics.fasta** (gene-calls nucleotide fasta file with line wraps removed)
+
+---
+
+### 7. Functional annotation
+
+> **Note:**  
+> The annotation process overwrites the same temporary directory by default. When running multiple processes at a time, it is necessary to specify a specific temporary directory with the `--tmp-dir` argument as shown below.
+
+
+#### 7a. Download reference database of HMM models
+
+> **Note:** This step only needs to be done once.
+
+```bash
+curl -LO ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
+curl -LO ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
+tar -xzvf profiles.tar.gz
+gunzip ko_list.gz 
+```
+
+#### 7b. Run KEGG annotation
+
+```bash
+exec_annotation -p profiles/ \
+                -k ko_list \
+                --cpu NumberOfThreads \
+                -f detail-tsv \
+                -o sample-KO-tab.tmp \
+                --tmp-dir sample-tmp-KO \
+                --report-unannotated \
+                sample-genes_GLmetagenomics.faa 
+```
+
+**Parameter Definitions:**
+- `-p` – Specifies the directory holding the downloaded reference HMMs.
+- `-k` – Specifies the downloaded reference KO  (Kegg Orthology) terms. 
+- `--cpu` – Specifies the number of searches to run in parallel.
+- `-f` – Specifies the output format.
+- `-o` – Specifies the output file name.
+- `--tmp-dir` – Specifies the temporary directory to write to (needed if running more than one process concurrently, see Note above).
+- `--report-unannotated` – Specifies to generate an output for each entry, event when no KO is assigned.
+- `sample-genes_GLmetagenomics.faa` – Specifies the input file, provided as a positional argument. 
+
+
+
+**Input data:**
+
+- sample-genes_GLmetagenomics.faa (amino-acid fasta file, from [Step 6b](#6b-remove-line-wraps-in-gene-prediction-output))
+- profiles/ (reference directory holding the KO HMMs, downloaded in [Step 7a](#7a-download-reference-database-of-hmm-models))
+- ko_list (reference list of KOs to scan for, downloaded in [Step 7a](#7a-download-reference-database-of-hmm-models))
+
+**Output data:**
+
+- sample-1-KO-tab.tmp (table of KO annotations assigned to gene IDs)
+
+
+#### 7c. Filter KO Outputs
+*Filter KO outputs to retain only those passing the KO-specific score and top hits.*
+
+```bash
+bit-filter-KOFamScan-results -i sample-KO-tab.tmp \
+                             -o sample-annotations.tsv
+
+# removing temporary files
+rm -rf sample-tmp-KO/ sample-KO-annots.tmp
+```
+
+**Parameter Definitions:**  
+
+- `-i` – Specifies the input table.
+- `-o` – Specifies the output table.
+
+
+**Input data:**
+
+- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs from [Step 7b](#7b-run-kegg-annotation))
+
+**Output data:**
+
+- sample-annotations.tsv (table of KO annotations assigned to gene IDs)
+
+<br>
+
+---
+
+### 8. Taxonomic classification
+
+#### 8a. Pull and Unpack Pre-built Reference DB
+
+```bash
+wget tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20200618.tar.gz
+tar -xvzf CAT_prepare_20200618.tar.gz
+```
+
+#### 8b. Run Taxonomic Classification
+
+```bash
+CAT contigs -c sample-assembly_GLmetagenomics.fasta \
+            -d CAT_prepare_20200618/2020-06-18_database/ \
+            -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
+            -p sample-genes_GLmetagenomics.faa \
+            -o sample-1-tax-out.tmp \
+            -n NumberOfThreads -r 3 \
+            --top 4 \
+            --I_know_what_Im_doing \
+            --no-stars
+```
+
+**Parameter Definitions:**  
+
+- `-c` – Specifies the input assembly fasta file.
+- `-d` – Specifies the CAT reference sequence database.
+- `-t` – Specifies the CAT reference taxonomy database.
+- `-p` – Specifies the input protein fasta file.
+- `-o` – Specifies the output file prefix.
+- `-n` – Specifies the number of CPU cores to use.
+- `-r` – Specifies the number of top protein hits to consider in assigning taxonomy.
+- `--top` – Specifies the number of protein alignments to store.
+- `--I_know_what_Im_doing` – Allows us to alter the `--top` parameter.
+- `--no-stars` - Suppress marking of suggestive taxonomic assignments.
+
+
+**Input data:**
+
+- CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 8a](#8a-pull-and-unpack-pre-built-reference-db))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 8a](#8a-pull-and-unpack-pre-built-reference-db)
+- sample-assembly_GLmetagenomics.fasta (assembly file from [Step 5a](#5a-rename-contig-headers))
+- sample-genes_GLmetagenomics.faa (gene-calls amino-acid fasta file from [Step 6](#6b-remove-line-wraps-in-gene-prediction-output))
+
+**Output data:**
+
+- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file)
+- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file)
+
+#### 8c. Add taxonomy info from taxids to genes
+
+```bash
+CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
+              -o sample-1-gene-tax-out.tmp \
+              -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
+              --only_official \
+              --exclude-scores
+```
+
+**Parameter Definitions:**  
+
+- `-i` – Specifies the input taxonomy file.
+- `-o` – Specifies the output file name.
+- `-t` – Specifies the CAT reference taxonomy database.
+- `--only_official` – Specifies to add only standard taxonomic ranks.
+- `--exclude-scores` - Specifies to exclude bit-score support scores in the lineage.
+
+**Input data:**
+
+- sample-tax-out.tmp.ORF2LCA.txt (gene-calls taxonomy file from [Step 8b](#8b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 8a](#8a-pull-and-unpack-pre-built-reference-db)
+
+**Output data:**
+
+- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
+
+
+
+#### 8d. Add Taxonomy Info From Taxids To Contigs
+
+```bash
+CAT add_names -i sample-1-tax-out.tmp.contig2classification.txt \
+              -o sample-1-contig-tax-out.tmp \
+              -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
+              --only_official \
+              --exclude-scores
+```
+
+**Parameter Definitions:**  
+
+- `-i` – Specifies the input taxonomy file.
+- `-o` – Specifies the output file name.
+- `-t` – Specifies the CAT reference taxonomy database.
+- `--only_official` – Specifies to add only standard taxonomic ranks.
+- `--exclude-scores` - Specifies to exclude bit-score support scores in the lineage.
+
+
+**Input data:**
+
+- sample-tax-out.tmp.contig2classification.txt (contig taxonomy file from [Step 8b](#8b-run-taxonomic-classification))
+- CAT_prepare_20200618/2020-06-18_taxonomy/ (directory holding the CAT reference taxonomy database, output from [Step 8a](#8a-pull-and-unpack-pre-built-reference-db)
+
+**Output data:**
+
+- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added)
+
+
+#### 8e. Format Gene-level Output With awk and sed
+
+```bash
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,$8,$9,$10,$11 } \
+    else if ( $2 == "ORF has no hit to database" || $2 ~ /^no taxid found/ ) \
+    { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } else { n=split($3,lineage,";"); \
+    print $1,lineage[n],$5,$6,$7,$8,$9,$10,$11 } } ' sample-gene-tax-out.tmp | \
+    sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/# ORF/gene_ID/' | \
+    sed 's/lineage/taxid/'  > sample-gene-tax-out.tsv
+```
+
+**Input Data:**
+
+* sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added from [Step 8c](#8c-add-taxonomy-info-from-taxids-to-genes))
+
+**Output Data:**
+
+- sample-gene-tax.tsv (reformatted gene-calls taxonomy file with lineage info)
+
+#### 8f. Format Contig-level Output With awk and sed
+
+```bash
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $2 == "classification" ) { print $1,$4,$6,$7,$8,$9,$10,$11,$12 } \
+    else if ( $2 == "no taxid assigned" ) { print $1,"NA","NA","NA","NA","NA","NA","NA","NA" } \
+    else { n=split($4,lineage,";"); print $1,lineage[n],$6,$7,$8,$9,$10,$11,$12 } } ' sample-contig-tax-out.tmp | \
+    sed 's/no support/NA/g' | sed 's/superkingdom/domain/' | sed 's/^# contig/contig_ID/' | \
+    sed 's/lineage/taxid/' > sample-contig-tax-out.tsv
+
+  # clearing intermediate files
+rm sample*.tmp*
+```
+
+**Input data:**
+
+- sample-contig-tax-out.tmp (contig taxonomy file with lineage info added from [Step 8d](#8d-add-taxonomy-info-from-taxids-to-contigs))
+
+
+**Output data:**
+
+- sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info)
+
+<br>
+
+---
+
+### 9. Read-Mapping
+
+#### 9a. Build reference index
+
+```bash
+bowtie2-build sample-assembly_GLmetagenomics.fasta sample-assembly-bt-index
+```
+
+**Parameter Definitions:**  
+
+-	`sample-assembly_GLmetagenomics.fasta` - first positional argument specifies the input assembly
+-	`sample-assembly-bt-index` - second positional argument specifies the prefix of the output index files
+
+**Input Data:**
+
+- `sample-assembly_GLmetagenomics.fasta` (contig-renamed assembly file, output from [Step 5a](#5a-rename-contig-headers))
+
+**Output Data:**
+
+- `sample-assembly-bt-index*` - the bowtie2 index files
+
+#### 9b. Align Reads to Sample Assembly
+
+```bash
+bowtie2 --mm --quiet --threads ${task.cpus} \
+        -x sample-index \
+        -1 sample_R1_filtered_GLmetagenomics.fastq.gz \
+        -2 sample_R2_filtered_GLmetagenomics.fastq.gz \
+        --no-unal > sample.sam  2> sample-mapping-info_GLmetagenomics.txt 
+```
+
+**Parameter Definitions:**  
+
+- `--mm` - Use memory-mapped I/O to load the index.
+- `--quiet` - Print only error messages.
+- `--threads` - Number of parallel processing threads.
+- `-x` - specifies the prefix of the reference index files to map to, generated by bowtie2-build
+-	`-1` - specifies the forward reads to map
+- `-2` – specifies the reverse reads to map
+- `--no-unal` - Suppress SAM records for reads that did not align.
+- `> sample.sam` - Redirects the output of the map reads command to a SAM file.
+- `2> sample-mapping-info_GLmetagenomics.txt` – capture the printed summary results in a log file
+
+**Input Data**
+
+- sample-assembly-bt-index (bowtie2 index files, output from [Step 9a](#9a-build-reference-index))
+- *_R[12]_filtered_GLmetagenomics.fastq.gz (filtered and trimmed sample reads, output from [Step 2b](#2b-trim-polyg))
+
+**Output Data**
+
+- sample.sam (reads aligned to sample assembly in SAM format)
+- **sample-mapping-info_GLmetagenomics.txt** (read mapping information)
+
+
+#### 9c. Sort Assembly Alignments
+
+```bash
+# Sort Sam, convert to bam and create index
+samtools sort --threads NumberOfThreads \
+              -o sample_GLmetagenomics.bam \
+              sample.sam > sample_sort.log 2>&1
+```
+
+**Parameter Definitions:**
+
+*samtools sort*
+- `--threads` - Number of parallel processing threads to use.
+- `-o` - Specifies the output file for the sorted aligned reads.
+- `sample.sam` - Positional argument specifying the input SAM file.
+- `> sample_sort.log 2>&1` - Redirects the standard output and standard error to a separate file.
+
+**Input Data:**
+
+- sample.sam (reads aligned to sample assembly, output from [Step 9b](#9b-align-reads-to-sample-assembly))
+
+**Output Data:**
+
+- **sample_GLmetagenomics.bam** (sorted mapping to sample assembly, in BAM format)
+
+<br>
+
+---
+
+### 10. Get Coverage Information and Filter Based On Detection
+> **Note:**  
+> “Detection” is a metric of what proportion of a reference sequence recruited reads (see [here](https://merenlab.org/2017/05/08/anvio-views/#detection)). Filtering based on detection is one way of helping to mitigate non-specific read-recruitment.
+
+#### 10a. Filter Coverage Levels Based On Detection
+
+```bash
+# pileup.sh comes from the bbduk.sh package
+pileup.sh -in sample_GLmetagenomics.bam \
+          fastaorf=sample-genes_GLmetagenomics.fasta \
+          outorf=sample-gene-cov-and-det.tmp \
+          out=sample-contig-cov-and-det.tmp
+```
+
+**Parameter Definitions:**  
+
+- `-in` – Specifies the input BAM file.
+- `fastaorf=` – Specifies the input gene-calls nucleotide fasta file.
+- `outorf=` – Specifies the output gene-coverage tsv file name.
+- `out=` – Specifies the output contig-coverage tsv file name.
+
+**Input Data:**
+
+- sample_GLmetagenomics.bam (sorted mapping to sample assembly BAM file, output from [Step 9c](#9c-sort-assembly-alignments))
+- sample-genes_GLmetagenomics.fasta (gene-calls nucleotide fasta file, output from [Step 6b](#6b-remove-line-wraps-in-gene-prediction-output))
+
+
+**Output Data:**
+
+- sample-gene-cov-and-det.tmp (gene-coverage tsv file)
+- sample-contig-cov-and-det.tmp (contig-coverage tsv file)
+
+
+#### 10b. Filter Gene and Contig Coverage Based On Detection
+
+> *The following commands filter gene and contig coverage tsv files to only keep genes and contigs with at least 50% detection (as defined above) then parse the tables to retain only gene IDs and respective coverage.*
+
+```bash
+# Filtering gene coverage
+grep -v "#" sample-gene-cov-and-det.tmp | \
+awk -F $'\t' ' BEGIN { OFS=FS } { if ( $10 <= 0.5 ) $4 = 0 } \
+     { print $1,$4 } ' > sample-gene-cov.tmp
+
+cat <( printf "gene_ID\tcoverage\n" ) sample-gene-cov.tmp > sample-gene-coverages.tsv
+
+#Filtering contig coverage based on requiring 50% detection and parsing down to just contig ID and coverage:
+grep -v "#" sample-contig-cov-and-det.tmp | awk -F $'\t' ' BEGIN { OFS=FS } { if ( $5 <= 50 ) $2 = 0 } \
+     { print $1,$2 } ' > sample-contig-cov.tmp
+
+cat <( printf "contig_ID\tcoverage\n" ) sample-contig-cov.tmp > sample-contig-coverages.tsv
+
+# removing intermediate files
+rm sample-*.tmp
+```
+
+**Input data:**
+
+- sample-gene-cov-and-det.tmp (temporary gene-coverage tsv file, output from [Step 10a](#10a-filter-coverage-levels-based-on-detection))
+- sample-contig-cov-and-det.tmp (temporary contig-coverage tsv file, output from [Step 10a](#10a-filter-coverage-levels-based-on-detection))
+
+**Output data:**
+
+* sample-gene-coverages.tsv (table with gene-level coverages)
+* sample-contig-coverages.tsv (table with contig-level coverages)
+
+<br>
+
+---
+
+### 11. Combine Gene-level Coverage, Taxonomy, and Functional Annotations For Each Sample
+> **Notes**  
+> Just uses `paste`, `sed`, and `awk` standard Unix commands to combine gene-level coverage, taxonomy, and functional annotations into one table for each sample. 
+
+```bash
+paste <( tail -n +2 sample-gene-coverages.tsv | sort -V -k 1 ) \
+      <( tail -n +2 sample-annotations.tsv | sort -V -k 1 | cut -f 2- ) \
+      <( tail -n +2 sample-gene-tax-out.tsv | sort -V -k 1 | cut -f 2- ) \
+      > sample-gene-tab.tmp
+
+paste <( head -n 1 sample-gene-coverages.tsv ) \
+      <( head -n 1 sample-annotations.tsv | cut -f 2- ) \
+      <( head -n 1 sample-gene-tax-out.tsv | cut -f 2- ) \
+      > sample-header.tmp
+
+cat sample-header.tmp sample-gene-tab.tmp > sample-gene-coverage-annotation-and-tax_GLmetagenomics.tsv
+
+  # removing intermediate files
+rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-out.tsv
+```
+
+**Input data:**
+
+* sample-gene-coverages.tsv (table with gene-level coverages from [Step 10b](#10b-filter-gene-and-contig-coverage-based-on-detection))
+* sample-annotations.tsv (table of KO annotations assigned to gene IDs from [Step 7c](#7c-filter-ko-outputs))
+* sample-gene-tax-out.tsv (gene-level taxonomic classifications from [Step 8f](#8f-format-contig-level-output-with-awk-and-sed))
+
+
+**Output data:**
+
+* **sample-gene-coverage-annotation-and-tax_GLmetagenomics.tsv** (table with combined gene coverage, annotation, and taxonomy info)
+
+<br>
+
+---
+
+### 12. Combine Contig-level Coverage and Taxonomy For Each Sample
+> **Note:**  
+> Just uses `paste`, `sed`, and `awk` standard Unix commands to combine contig-level coverage and taxonomy into one table for each sample.
+
+```bash
+paste <( tail -n +2 sample-contig-coverages.tsv | sort -V -k 1 ) \
+      <( tail -n +2 sample-contig-tax-out.tsv | sort -V -k 1 | cut -f 2- ) \
+      > sample-contig.tmp
+
+paste <( head -n 1 sample-contig-coverages.tsv ) \
+      <( head -n 1 sample-contig-tax-out.tsv | cut -f 2- ) \
+      > sample-contig-header.tmp
+      
+cat sample-contig-header.tmp sample-contig.tmp > sample-contig-coverage-and-tax_GLmetagenomics.tsv
+
+# removing intermediate files
+rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
+```
+
+**Input data:**
+
+- sample-contig-coverages.tsv (table with contig-level coverages from [Step 10b](#10b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-contig-tax-out.tsv (contig-level taxonomic classifications from [Step 8f](#8f-format-contig-level-output-with-awk-and-sed))
+
+
+**Output data:**
+
+- **sample-contig-coverage-and-tax_GLmetagenomics.tsv** (table with combined contig coverage and taxonomy info)
+
+<br>
+
+---
+
+### 13. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples
+
+> **Notes**  
+> * To combine across samples to generate these summary tables, we need the same "units". This is done for annotations based on the assigned KO terms, and all non-annotated functions are included together as "Not annotated". It is done for taxonomic classifications based on taxids (full lineages included in the table), and any not classified are included together as "Not classified". 
+> * The values we are working with are coverage per gene (so they are number of bases recruited to the gene normalized by the length of the gene). These have been normalized by making the total coverage of a sample 1,000,000 and setting each individual gene-level coverage its proportion of that 1,000,000 total. So basically percent, but out of 1,000,000 instead of 100 to make the numbers more friendly. 
+
+#### 13a. Generate Gene-level Coverage Summary Tables
+
+
+```bash
+bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax_GLmetagenomics.tsv \
+                                 -o Combined
+# add assay specific suffix
+mv "Combined-gene-level-KO-function-coverages-CPM.tsv Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv"
+mv "Combined-gene-level-KO-function-coverages-CPM.tsv Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv"
+mv "Combined-gene-level-KO-function-coverages.tsv Combined-gene-level-KO-function-coverages_GLmetagenomics.tsv"
+mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-coverages_GLmetagenomics.tsv"
+```
+
+**Parameter Definitions:**  
+
+- `*-gene-coverage-annotation-and-tax_GLmetagenomics.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
+
+- `-o` – Specifies the output file prefix.
+
+
+**Input data:**
+
+- *-gene-coverage-annotation-and-tax_GLmetagenomics.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples from [Step 11](#11-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
+
+**Output data:**
+
+- **Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on KO annotations; normalized to coverage per million genes covered)
+- **Combined-gene-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications; normalized to coverage per million genes covered)
+- **Combined-gene-level-KO-function-coverages_GLmetagenomics.tsv** (table with all samples combined based on KO annotations)
+- **Combined-gene-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications)
+
+
+#### 13b. Generate Contig-level Coverage Summary Tables
+
+```bash
+bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax_GLmetagenomics.tsv -o Combined
+```
+
+**Parameter Definitions:**  
+
+- `*-contig-coverage-and-tax_GLmetagenomics.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
+- `-o` – Specifies the output file prefix.
+
+
+**Input data:**
+
+* *-contig-coverage-annotation-and-tax_GLmetagenomics.tsv (tables with combined contig coverage, annotation, and taxonomy info generated for individual samples from [Step 12](#12-combine-contig-level-coverage-and-taxonomy-for-each-sample))
+
+**Output data:**
+
+* **Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered)
+* **Combined-contig-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications)
+
+<br>
+
+---
+
+### 14. **M**etagenome-**A**ssembled **G**enome (MAG) recovery
+
+#### 14a. Bin contigs
+
+```bash
+jgi_summarize_bam_contig_depths --outputDepth sample-metabat-assembly-depth.tsv \
+                                --percentIdentity 97 \
+                                --minContigLength 1000 \
+                                --minContigDepth 1.0  \
+                                --referenceFasta sample-assembly_GLmetagenomics.fasta \
+                                sample.bam
+
+metabat2  --inFile sample-assembly_GLmetagenomics.fasta \
+          --outFile sample \
+          --abdFile sample-metabat-assembly-depth_GLmetagenomics.tsv \
+          -t NumberOfThreads
+
+mkdir sample-bins
+mv sample*bin*.fasta sample-bins
+zip -r sample-bins_GLmetagenomics.zip sample-bins
+```
+
+**Parameter Definitions:**  
+
+*jgi_summarize_bam_contig_depths*
+-  `--outputDepth` – Specifies the output depth file name.
+-  `--percentIdentity` – Minimum end-to-end percent identity of a mapped read to be included.
+-  `--minContigLength` – Minimum contig length to include.
+-  `--minContigDepth` – Minimum contig depth to include.
+-  `--referenceFasta` – Specifies the input assembly fasta file.
+-  `sample_GLmetagenomics.bam` – Input alignment BAM file, specified as a positional argument.
+
+*metabat2*
+-  `--inFile` - Specifies the input assembly fasta file.
+-  `--outFile` - Specifies the prefix of the identified bins output files.
+-  `--abdFile` - The depth file generated by the previous `jgi_summarize_bam_contig_depths` command.
+-  `-t` - Number of parallel processing threads to use.
+
+
+**Input data:**
+
+- sample-assembly_GLmetagenomics.fasta (assembly fasta file created in [Step 5a](#5a-rename-contig-headers))
+- sample.bam (bam file created in [Step 9b](#9c-sort-assembly-alignments))
+
+**Output data:**
+
+- **sample-metabat-assembly-depth_GLmetagenomics.tsv** (tab-delimited summary of coverages)
+- sample-bins/sample-bin\*.fasta (fasta files of recovered bins)
+- **sample-bins_GLmetagenomics.zip** (zip file containing fasta files of recovered bins)
+
+#### 14b. Bin quality assessment
+> Utilizes the default `checkm` database available [checkm_data_2015_01_16.tar.gz](https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz).
+
+```bash
+checkm lineage_wf -f bins-overview_GLmetagenomics.tsv \
+                  --tab_table \
+                  -x fa \
+                  ./ \
+                  checkm-output-dir
+```
+
+**Parameter Definitions:**  
+
+-  `lineage_wf` – Specifies the workflow being utilized.
+-  `-f` – Specifies the output summary file name.
+-  `--tab_table` – Specifies the output summary file should be a tab-delimited table.
+-  `-x` – Specifies the extension that is on the bin fasta files that are being assessed.
+-  `./` – Specifies the directory holding the bins, provided as a positional argument.
+-  `checkm-output-dir` – Specifies the primary checkm output directory, provided as a positional argument.
+
+**Input data:**
+
+- sample-bins/sample-bin\*.fasta (bin fasta files generated in [Step 14a](#14a-bin-contigs))
+
+**Output data:**
+
+- **bins-overview_GLmetagenomics.tsv** (tab-delimited file with quality estimates per bin)
+- checkm-output-dir/ (directory holding detailed checkm outputs)
+
+#### 14c. Filter MAGs
+
+```bash
+cat <( head -n 1 bins-overview_GLmetagenomics.tsv ) \
+    <( awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLmetagenomics.tsv | sed 's/bin./MAG-/' ) \
+    > checkm-MAGs-overview.tsv
+    
+# copying bins into a MAGs directory in order to run tax classification
+awk -F $'\t' ' $12 >= 90 && $13 <= 10 && $14 == 0 ' bins-overview_GLmetagenomics.tsv | cut -f 1 > MAG-bin-IDs.tmp
+
+mkdir MAGs
+for ID in MAG-bin-IDs.tmp
+do
+    MAG_ID=$(echo $ID | sed 's/bin./MAG-/')
+    cp ${ID}.fasta MAGs/${MAG_ID}.fasta
+done
+
+for SAMPLE in $(cat MAG-bin-IDs.tmp | sed 's/-bin.*//' | sort -u);
+do
+  mkdir ${SAMPLE}-MAGs
+  mv ${SAMPLE}-*MAG*.fasta ${SAMPLE}-MAGs
+  zip -r ${SAMPLE}-MAGs_GLmetagenomics.zip ${SAMPLE}-MAGs
+done
+```
+
+**Input data:**
+
+- bins-overview_GLmetagenomics.tsv (tab-delimited file with quality estimates per bin from [Step 14b](#14b-bin-quality-assessment))
+
+**Output data:**
+
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG)
+- MAGs/\*.fasta (directory holding high-quality MAGs)
+- **\*-MAGs_GLmetagenomics.zip** (zip files containing directories of high-quality MAGs)
+
+
+#### 14d. MAG taxonomic classification
+> Uses default `gtdbtk` database setup with program's `download.sh` command.
+
+```bash
+gtdbtk classify_wf --genome_dir MAGs/ \
+                   -x fa \
+                   --out_dir gtdbtk-output-dir  \
+                   --skip_ani_screen
+```
+
+**Parameter Definitions:**  
+
+-  `classify_wf` – Specifies the workflow being utilized.
+-  `--genome_dir` – Specifies the directory holding the MAGs to classify.
+-  `-x` – Specifies the extension that is on the MAG fasta files that are being taxonomically classified.
+-  `--out_dir` – Specifies the output directory name.
+-  `--skip_ani_screen`  - Specifies to skip ani_screening step to classify genomes using mash and skani.
+
+**Input data:**
+
+* MAGs/\*.fasta (directory holding high-quality MAGs from [Step 14c](#14c-filter-mags))
+
+**Output data:**
+
+* gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
+
+#### 14e. Generate Overview Table Of All MAGs
+
+```bash
+# combine summaries
+for MAG in $(cut -f 1 assembly-summaries_GLmetagenomics.tsv | tail -n +2); do
+
+    grep -w -m 1 "^${MAG}" checkm-MAGs-overview.tsv | cut -f 12,13,14 \
+        >> checkm-estimates.tmp
+
+    grep -w "^${MAG}" gtdbtk-output-dir/gtdbtk.*.summary.tsv | \
+    cut -f 2 | sed 's/^.__//' | \
+    sed 's/;.__/\t/g' | \
+    awk 'BEGIN{ OFS=FS="\t" } { for (i=1; i<=NF; i++) if ( $i ~ /^ *$/ ) $i = "NA" }; 1' \
+        >> gtdb-taxonomies.tmp
+
+done
+
+# Add headers
+cat <(printf "est. completeness\test. redundancy\test. strain heterogeneity\n") checkm-estimates.tmp \
+    > checkm-estimates-with-headers.tmp
+
+cat <(printf "domain\tphylum\tclass\torder\tfamily\\tgenus\tspecies\n") gtdb-taxonomies.tmp \
+    > gtdb-taxonomies-with-headers.tmp
+
+paste assembly-summaries_GLmetagenomics.tsv \
+checkm-estimates-with-headers.tmp \
+gtdb-taxonomies-with-headers.tmp \
+    > MAGs-overview.tmp
+
+# Ordering by taxonomy
+head -n 1 MAGs-overview.tmp > MAGs-overview-header.tmp
+
+tail -n +2 MAGs-overview.tmp | sort -t \$'\t' -k 14,20 > MAGs-overview-sorted.tmp
+
+cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
+    > MAGs-overview_GLmetagenomics.tsv
+```
+
+**Input data:**
+
+- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics from [Step 5b](#5b-summarize-assemblies))
+- MAGs/\*.fasta (directory holding high-quality MAGs from [Step 14c](#14c-filter-mags))
+- checkm-MAGs-overview.tsv (tab-delimited file with quality estimates per MAG from [Step 14c](#14c-filter-mags))
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (directory of files with assigned taxonomy and info from [Step 14d](#14d-mag-taxonomic-classification))
+
+**Output data:**
+
+* **MAGs-overview_GLmetagenomics.tsv** (a tab-delimited overview of all recovered MAGs)
+
+<br>
+
+---
+
+### 15. Generate MAG-level functional summary overview
+
+#### 15a. Get KO annotations per MAG
+> This utilizes the helper script [`parse-MAG-annots.py`](https://github.com/nasa/GeneLab_Metagenomics_Workflow/blob/DEV/bin/parse-MAG-annots.py)
+
+```bash
+for file in $( ls MAGs/*.fasta )
+do
+
+    MAG_ID=$( echo ${file} | cut -f 2 -d "/" | sed 's/.fasta//' )
+    sample_ID=$( echo ${MAG_ID} | sed 's/-MAG-[0-9]*$//' )
+
+    grep "^>" ${file} | tr -d ">" > ${MAG_ID}-contigs.tmp
+
+    python parse-MAG-annots.py -i annotations-and-taxonomy/${sample_ID}-gene-coverage-annotation-and-tax_GLmetagenomics.tsv \
+                               -w ${MAG_ID}-contigs.tmp -M ${MAG_ID} \
+                               -o MAG-level-KO-annotations_GLmetagenomics.tsv
+
+    rm ${MAG_ID}-contigs.tmp
+
+done
+```
+
+**Parameter Definitions:**  
+
+- `-i` – Specifies the input sample TSV file containing sample coverage, annotation, and taxonomy info.
+- `-w` – Specifies the appropriate temporary file holding all the contigs in the current MAG.
+- `-M` – Specifies the current MAG unique identifier.
+- `-o` – Specifies the output file name.
+
+**Input data:**
+
+- \*-gene-coverage-annotation-and-tax_GLmetagenomics.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples, output from [Step 11](#11-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
+- MAGs/\*.fasta (directory holding high-quality MAGs from [Step 14c](#14c-filter-mags))
+
+**Output data:**
+
+* **MAG-level-KO-annotations_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their KO annotations)
+
+
+#### 15b. Summarize KO annotations with KEGG-Decoder
+
+```bash
+KEGG-decoder -v interactive \
+             -i MAG-level-KO-annotations_GLmetagenomics.tsv \
+             -o MAG-KEGG-Decoder-out_GLmetagenomics.tsv
+```
+
+**Parameter Definitions:**  
+
+- `-v interactive` – Specifies to create an interactive html output.
+- `-i` – Specifies the input tab-delimited table holding MAGs and their KO annotations.
+- `-o` – Specifies the output table.
+
+**Input data:**
+
+- MAG-level-KO-annotations_GLmetagenomics.tsv (tab-delimited table holding MAGs and their KO annotations, output from [Step 15a](#15a-get-ko-annotations-per-mag))
+
+**Output data:**
+
+- **MAG-KEGG-Decoder-out_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their proportions of genes held known to be required for specific pathways/metabolisms)
+- **MAG-KEGG-Decoder-out_GLmetagenomics.html** (interactive heatmap html file of the above output table)
+
+<br>
+
+### 16. Filtering and Visualization of Contig- and Gene-taxonomy and Gene-function Outputs
+
+#### 16a. Gene-level Taxonomy Heatmaps
+
+```R
+assembly_table <- "Combined-gene-level-taxonomy-coverages-CPM_GLmetagenomics.tsv"
+assembly_summary <- "assembly-summaries_GLmetagenomics.tsv"
+metadata_table <- "/path/to/sample/metadata"
+
+# Read in assembly summary table
+overview_table <- read_delim(assembly_summary, comment="#") %>%
+  select(
+    where(~all(!is.na(.)))
+  )
+
+col_names <- names(overview_table) %>% str_remove_all("-assembly")
+sample_order <- col_names[-1] %>% sort()
+
+# deduplicate rows by summing together species values
+df <- read_delim(assembly_table, comment = "#")
+sample_order <- get_samples(df, sample_order)
+
+table2write <- read_taxonomy_table(df, sample_order) %>%
+               select(species, !!sample_order) %>%
+               group_by(species) %>%
+               summarise(across(everything(), sum)) %>%
+               filter(species != "Unclassified;_;_;_;_;_;_") %>%
+               as.data.frame()
+
+# Write out gene taxonomy table
+write_tsv(x = table2write, file = "Combined-gene-level-taxonomy_unfiltered_GLmetagenomics.tsv")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-taxonomy_unfiltered_GLmetagenomics.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-taxonomy_unfiltered", 
+             assay_suffix = "_GLmetagenomics", 
+             custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [get_samples()](#get_samples)
+- [read_taxonomy_table()](#read_taxonomy_table)
+- [make_heatmap()](#make_heatmap)
+
+**Input data:**
+- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics, output from [Step 5b](#5b-summarize-assemblies))
+- Combined-gene-level-taxonomy-coverages-CPM_GLmetagenomics.tsv (table with all samples combined based on gene-level 
+  taxonomic classifications, output from [Step 13a](#13a-generate-gene-level-coverage-summary-tables)) 
+
+**Output data:**
+- Combined-gene-level-taxonomy_unfiltered_GLmetagenomics.tsv (aggregated gene-level taxonomy table with samples in columns and species in rows)
+- **Combined-gene-level-taxonomy_unfiltered_heatmap_GLmetagenomics.png** (heatmap of all gene-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-taxonomy_unfiltered_top_50_heatmap_GLmetagenomics.png** (heatmap of the top 50 gene-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
+
+#### 16b. Gene-level Taxonomy Feature Filtering
+
+```R
+feature_table_file <- "Combined-gene-level-taxonomy_unfiltered_GLmetagenomics.tsv"
+metadata_table <- "/path/to/sample/metadata"
+threshold <- 1000
+
+# read in feature table
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+table2write <- get_abundant_features(feature_table, cpm_threshold=threshold) %>%
+               as.data.frame() %>%
+               rownames_to_column(feature_name)
+
+write_tsv(x = table2write, file = "Combined-gene-level-taxonomy_filtered_GLmetagenomics.tsv")
+
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-taxonomy_filtered_GLmetagenomics.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-taxonomy_filtered", 
+             assay_suffix = "_GLmetagenomics", 
+             custom_palette = custom_palette)
+
+```
+
+**Custom Functions Used:**
+- [get_abundant_features()](#get_abundant_features)
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `threshold` - threshold to identify abundant features, default: 1000
+
+**Input Data:**
+
+- `Combined-gene-level-taxonomy_unfiltered_GLmetagenomics.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 16a](#16a-gene-level-taxonomy-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
+
+- **Combined-gene-level-taxonomy_filtered_GLmetagenomics.tsv** (filtered gene-level taxonomy, output from [get_abundant_features()](#get_abundant_features))
+- **Combined-gene-level-taxonomy_filtered_heatmap_GLmetagenomics.png** (heatmap of all gene-level taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-taxonomy_filtered_top_50_heatmap_GLmetagenomics.png** (heatmap of the top 50 gene taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+
+#### 16c. Gene-level KO Functions Heatmaps
+
+```R
+assembly_table <- "Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv"
+assembly_summary <- "assembly-summaries_GLmetagenomics.tsv"
+metadata_table <- "/path/to/sample/metadata"
+
+# Read in assembly summary table and remove columns where the values are NA
+overview_table <- read_delim(assembly_summary, comment="#") %>%
+  select(
+    where(~all(!is.na(.)))
+  )
+
+col_names <- names(overview_table) %>% str_remove_all("-assembly")
+sample_order <- col_names[-1] %>% sort()
+
+# deduplicate rows by summing together species values
+df <- read_delim(assembly_table, comment = "#")
+sample_order <- get_samples(df, sample_order, end_col="KO_function")
+
+table2write <- df %>%
+               select(KO_ID, !!sample_order)
+
+# Write out gene taxonomy table
+write_tsv(x = table2write, file = "Combined-gene-level-KO_unfiltered_GLmetagenomics.tsv")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-KO_unfiltered_GLmetagenomics.tsv",
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-KO-function_unfiltered", 
+             assay_suffix = "_GLmetagenomics", 
+             custom_palette = custom_palette)
+
+```
+
+**Custom Functions Used:**
+- [get_samples()](#get_samples)
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `assembly_table` - path to a tab-separated table containing gene-level KO function coverage data with
+                         species/functions as the first column and samples as other columns.
+- `assembly_summary` - path to a tab-separated file containing statistics on assemblies created for each sample
+
+**Input data:**
+
+- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics, output from [Step 5b](#5b-summarize-assemblies))
+- Combined-gene-level-KO-function-coverages-CPM_GLmetagenomics.tsv (table with all samples combined based on KO annotations; 
+  normalized to coverage per million genes covered, output from [Step 13a](#13a-generate-gene-level-coverage-summary-tables))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output data:**
+
+- Combined-gene-level-KO-function_unfiltered_GLmetagenomics.tsv (aggregated and subsetted gene-level KO function table)
+- **Combined-gene-level-KO-function_unfiltered_heatmap_GLmetagenomics.png** (heatmap of all gene-level KO function assignments, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-KO-function_unfiltered_top_50_heatmap_GLmetagenomics.png** (heatmap of the top 50 gene-level KO function assignments, output from [make_heatmap()](#make_heatmap))
+
+#### 16d. Gene-level KO Functions Feature Filtering
+
+```R
+feature_table_file <- "Combined-gene-level-KO-function_unfiltered_GLmetagenomics.tsv"
+metadata_table <- "/path/to/sample/metadata"
+threshold <- 1000
+
+# read in feature table
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+table2write <- get_abundant_features(feature_table, cpm_threshold=threshold) %>%
+               as.data.frame() %>%
+               rownames_to_column(feature_name)
+
+write_tsv(x = table2write, file = "Combined-gene-level-KO_filtered_GLmetagenomics.tsv")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-gene-level-KO_filtered_GLmetagenomics.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-gene-level-KO_filtered", 
+             assay_suffix = "_GLmetagenomics", 
+             custom_palette = custom_palette)
+
+```
+
+**Custom Functions Used:**
+- [get_abundant_features()](#get_abundant_features)
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `threshold` - threshold to identify abundant features, default: 1000
+
+**Input Data:**
+
+- `Combined-gene-level-KO-function_unfiltered_GLmetagenomics.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 16c](#16c-gene-level-ko-functions-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
+
+- **Combined-gene-level-KO-function_filtered_GLmetagenomics.tsv** (filtered gene-level KO function table, output from [get_abundant_features()](#get_abundant_features))
+- **Combined-gene-level-KO-function_filtered_heatmap_GLmetagenomics.png** (heatmap of all gene-level KO function assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+- **Combined-gene-level-KO-function_filtered_top_50_heatmap_GLmetagenomics.png** (heatmap of the top 50 gene-level KO function assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+
+#### 16e. Contig-level Heatmaps
+
+```R
+assembly_table <- "Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv"
+assembly_summary <- "assembly-summaries_GLmetagenomics.tsv"
+metadata_table <- "/path/to/sample/metadata"
+
+# Read in assembly summary table
+overview_table <- read_delim(assembly_summary, comment="#") %>%
+  select(
+    where(~all(!is.na(.)))
+  )
+
+col_names <- names(overview_table) %>% str_remove_all("-assembly")
+sample_order <- col_names[-1] %>% sort()
+
+# deduplicate rows by summing together species values
+df <- read_delim(assembly_table, comment = "#")
+sample_order <- get_samples(df, sample_order)
+
+table2write <- read_taxonomy_table(df, sample_order) %>%
+               select(species, !!sample_order) %>%
+               group_by(species) %>%
+               summarise(across(everything(), sum)) %>%
+               filter(species != "Unclassified;_;_;_;_;_;_") %>%
+               as.data.frame()
+
+# Write out contig taxonomy table
+write_tsv(x = table2write, file = "Combined-contig-level-taxonomy_unfiltered_GLmetagenomics.tsv")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-contig-level-taxonomy_unfiltered_GLmetagenomics.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-contig-level-taxonomy", 
+             assay_suffix = "_GLmetagenomics", 
+             custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [get_samples()](#get_samples)
+- [read_taxonomy_table()](#read_taxonomy_table)
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `assembly_table` - path to a tab-separated table containing gene-level KO function coverage data with
+                         species/functions as the first column and samples as other columns.
+- `assembly_summary` - path to a tab-separated file containing statistics on assemblies created for each sample
+
+
+**Input data:**
+
+- assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics, output from [Step 5b](#5b-summarize-assemblies))
+- Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv (table with all samples combined based on contig-level 
+  taxonomic classifications, output from [Step 13b](#13b-generate-contig-level-coverage-summary-tables)) 
+
+**Output data:**
+
+- Combined-contig-level-taxonomy_unfiltered_GLmetagenomics.tsv (aggregated contig-level taxonomy table with samples in columns and species in rows)
+- **Combined-contig-level-taxonomy_unfiltered_heatmap_GLmetagenomics.png** (heatmap of all contig-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
+- **Combined-contig-level-taxonomy_unfiltered_top_50_heatmap_GLmetagenomics.png** (heatmap of the top 50 contig-level taxonomy assignments, output from [make_heatmap()](#make_heatmap))
+
+#### 16f. Contig-level Feature Filtering
+
+```R
+feature_table_file <- "Combined-contig-level-taxonomy_GLmetagenomics.tsv"
+metadata_table <- "/path/to/sample/metadata"
+threshold <- 1000
+
+# read in feature table
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+table2write <- get_abundant_features(feature_table, cpm_threshold=threshold) %>%
+               as.data.frame() %>%
+               rownames_to_column(feature_name)
+
+write_tsv(x = table2write, file = "Combined-contig-level-taxonomy_filtered_GLmetagenomics.tsv")
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Combined-contig-level-taxonomy_filtered_GLmetagenomics.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Combined-contig-level-taxonomy_filtered", 
+             assay_suffix = "_GLmetagenomics", 
+             custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [get_abundant_features()](#get_abundant_features)
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `feature_table_file` - path to a tab separated samples feature table containing gene-level coverage data 
+                         species/functions as the first column and samples as other columns.
+- `metadata_table` - path to a file with samples as rows and columns describing each sample
+- `threshold` - threshold to identify abundant features, default: 1000
+
+**Input Data:**
+
+- `Combined-contig-level-taxonomy_unfiltered_GLmetagenomics.tsv`(aggregated gene taxonomy table with samples in columns and species in rows, from [Step 16c](#16c-gene-level-ko-functions-heatmaps))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
+
+- **Combined-contig-level-taxonomy_filtered_GLmetagenomics.tsv** (filtered contig-level taxonomy, output from [get_abundant_features()](#get_abundant_features))
+- **Combined-contig-level-taxonomy_filtered_heatmap_GLmetagenomics.png** (heatmap of all contig-level taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+- **Combined-contig-level-taxonomy_filtered_top_50_heatmap_GLmetagenomics.png** (heatmap of the top 50 contig-level taxonomy assignments after filtering out non-abundant features, output from [make_heatmap()](#make_heatmap))
+
+### 17. Generate Assembly-based Processing Overview
+> This utilizes the helper script [`generate-assembly-based-overview-table.sh`](https://github.com/nasa/GeneLab_Metagenomics_Workflow/blob/DEV/bin/generate-assembly-based-overview-table.sh) 
+
+```bash
+bash generate-assembly-based-overview-table.sh sample_ids_file.txt \
+  assemblies/ predicted-genes/ read-mapping/ bins/ MAGs/ \
+  Assembly-based-processing-overview_GLmetagenomics.tsv
+```
+
+**Parameter Definitions:**
+
+- `sample_ids_file.txt` - A file listing the sample names, one on each row, provided as a positional argument.
+- `assemblies/` - The directory holding the contig-renamed assembly files generated in [Step 5a](#5a-rename-contig-headers), provided as a positional argument.
+- `predicted-genes/` - The directory holding the gene-calls ammino-acid fasta files generated in [Step 6a](#6a-generate-gene-predictions) and [Step 6b](#6b-remove-line-wraps-in-gene-prediction-output), provided as a positional argument.
+- `read-mapping/` - The directory holding the sorted mapping to the sample assembly in BAM format generated in [Step 9c](#9c-sort-assembly-alignments), provided as a positional argument.
+- `bins/` - The directory holding the recovered bins fasta files generated in [Step 14a](#14a-bin-contigs), provided as a positional argument.
+- `MAGs/` - The directory holding the high-quality MAGs fasta files generated in [Step 14c](#14c-filter-mags), provided as a positional argument.
+- `Assembly-based-processing-overview_GLmetagenomics.tsv` - name of the output file, provided as a positional argument.
+
+**Input Data:**
+
+- assemblies/\*.fasta (contig-renamed assembly files from [Step 5a](#5a-rename-contig-headers))
+- predicted-genes/\*.faa (gene-calls amino-acid fasta file with line wraps removed, output from [Step 6b](#6b-remove-line-wraps-in-gene-prediction-output))
+- read-mapping/\*.bam (sorted mapping to sample assembly, in BAM format, output from [Step 9c](#9c-sort-assembly-alignments))
+- bins/\*.fasta (fasta files of recovered bins, output from [Step 14a](#14a-bin-contigs))
+- MAGs/\*.fasta (directory holding high-quality MAGs, output from [Step 14c](#14c-filter-mags))
+
+**Output Data:**
+
+- **Assembly-based-processing-overview_GLmetagenomics.tsv** (Tab delimited text file providing a summary of assembly-based processing results for each sample)
+
+<br>
+
+---
+
+## Read-based Processing
+
+
+### 18. Taxonomic Profiling Using Kaiju
+
+#### 18a. Build Kaiju Database
+
+```bash
+# Make a directory that will hold the downloaded kaiju database
+mkdir kaiju-db/
+
+# Download kaiju's reference database
+kaiju-makedb -s kaiju_db/nr_euk -t NumberOfThreads
+
+# Clean up
+rm nr_euk/kaiju_db_nr_euk.bwt nr_euk/kaiju_db_nr_euk.sa
+```
+
+**Parameter Definitions:**
+
+- `-s nr_euk` - Specifies to download the subset of the NCBI BLAST nr (non-redundant) database containing all proteins belonging to Archaea, bacteria, and viruses, and additionally include proteins from fungi and microbial eukaryotes.
+- `-t` - Number of parallel processing threads to use.
+
+**Input Data:**
+
+*No input data required*
+
+**Output Data:**
+
+- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index)
+- kaiju-db/nr_euk/kaiju_db_nr_euk.faa (FASTA amino acid file containing the protein sequences used to build the .fmi index file)
+- kaiju-db/nodes.dmp (taxonomy hierarchy file from the NCBI Taxonomy database defining the parent-child relationships in the taxonomic tree)
+- kaiju-db/names.dmp (taxonomy names file from the NCBI Taxonomy database that maps taxonomic IDs to their scientific names)
+- kaiju-db/merged.dmp (merged taxonomy IDs file from the NCBI Taxonomy database that maps deprecated taxonomic IDs to current ones)
+
+
+#### 18b. Kaiju Taxonomic Classification
+
+```bash
+kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
+      -t kaiju-db/nodes.dmp \
+      -z NumberOfThreads \
+      -E 1e-05 \
+      -i /path/to/sample1_R1_filtered_GLmetagenomics.fastq.gz \
+      -j /path/to/sample1_R2_filtered_GLmetagenomics.fastq.gz \
+      -o sample_kaiju.out
+```
+
+**Parameter Definitions:**
+
+- `-f` - Specifies the path to the kaiju database index file (.fmi).
+- `-t` - Specifies the path to the kaiju taxonomy hierarchy file (nodes.dmp).
+- `-z` - Number of parallel processing threads to use.
+- `-E` - Specifies the minimum E-value to use for filter matches (an E-value of 1e-05 means that there's a 0.001% chance that the matches identified occurred randomly).
+- `-i` - Specifies path to the forward read input file.
+- `-i` - Specifies path to the reverse read input file.
+- `-o` - Specifies the name of the output file.
+
+**Input Data:**
+
+- kaiju-db/nr_euk/kaiju_db_nr_euk.fmi (FM-index file containing the main Kaiju database index, output from [Step 18a](#18a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 18a](#18a-build-kaiju-database))
+- *_R[12]_filtered_GLmetagenomics.fastq.gz (filtered/trimmed reads from [Step 2b](#2b-trim-polyg) above)
+
+
+**Output Data:**
+
+- sample_kaiju.out (kaiju output file)
+
+#### 18c. Compile Kaiju Taxonomy Results
+
+```bash
+# Merge kaiju reports to one table at the species level 
+kaiju2table -t nodes.dmp \
+            -n names.dmp \
+            -p \
+            -r "species" \
+            -o merged_kaiju_summary_${TAXON_LEVEL}.tsv \
+            *_kaiju.out
+
+# Convert file names to sample names
+sed -i -E 's/.+\/(.+)_kaiju\.out/\1/g' merged_kaiju_table.tsv && \
+sed -i -E 's/file/sample/' merged_kaiju_table.tsv
+```
+
+**Parameter Definitions:**
+
+- `-t` - Specifies the path to the kaiju taxonomy hierarchy file (nodes.dmp).
+- `-n` - Specifies the path to the kaiju taxonomy names file (names.dmp).
+- `-p` - Print the full taxon path instead of only the taxon name.
+- `-r` - Specifies taxonomic rank to print the taxon path to, must be one of: phylum, class, order, family, genus, species. (Default: species).
+- `-o` - Specifies the name of the kaiju taxon summary output file.
+- `*_kaiju.out` - Positional argument specifying the path to the kaiju output files for each sample. 
+
+**Input Data:**
+
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 18a](#18a-build-kaiju-database))
+- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 18a](#18a-build-kaiju-database))
+- *kaiju.out (kaiju output files, output from [Step 18b](#18b-kaiju-taxonomic-classification))
+
+**Output Data:**
+
+- merged_kaiju_table.tsv (compiled kaiju summary table at the species level)
+
+#### 18d. Convert Kaiju Output To Krona Format
+
+```bash
+kaiju2krona -u \
+            -n kaiju-db/names.dmp \
+            -t kaiju-db/nodes.dmp \
+            -i sample_kaiju.out \
+            -o sample.krona
+```
+
+**Parameter Definitions:**
+
+- `-u` - Include count for unclassified reads in output.
+- `-n` - Specifies the path to the kaiju taxonomy names file (names.dmp).
+- `-t` - Specifies the path to the kaiju taxonomy hierarchy file (nodes.dmp).
+- `-i` - Specifies the path to the kaiju output file.
+- `-o` - Specifies the name of krona formatted kaiju output file.
+
+**Input Data:**
+- kaiju-db/names.dmp (kaiju taxonomy names file, output from [Step 18a](#18a-build-kaiju-database))
+- kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 18a](#18a-build-kaiju-database))
+- sample_kaiju.out (kaiju output file, output from [Step 18b](#18b-kaiju-taxonomic-classification))
+
+**Output Data:**
+
+- sample.krona (krona formatted kaiju output)
+
+#### 18e. Compile Kaiju Krona Reports
+
+```bash
+# Create a file containing a sorted list of all .krona files 
+find . -type f -name "*.krona" | sort -uV > krona_files.txt
+
+# Create a file containing a sorted list of all sample names
+FILES=($(find . -type f -name "*.krona"))
+basename -a -s '.krona' ${FILES[*]} | sort -uV  > sample_names.txt
+
+# Create ktImportText input format files
+KTEXT_FILES=($(paste -d',' "krona_files.txt" "sample_names.txt"))
+
+# Create html containing krona plot  
+ktImportText  -o kaiju-report.html ${KTEXT_FILES[*]}
+```
+
+**Parameter Definitions:**
+
+*find*
+- `-type f` -  Specifies that the type of file to find is a regular file.
+- `-name "*.krona"` - Specifies to find files ending with the .krona suffix.  
+
+*sort*
+- `-u` - Specifies to perform a unique sort.
+- `-V` - Specifies to perform a mixed type of sorting with names containing numbers within text.
+- `> krona_files.txt` - Redirects the sorted list to a separate text file.
+
+*basename*
+- `-a` - Support multiple arguments and treat each as a file name.
+- `-s '.krona'` - Remove trailing '.krona' suffix.
+
+*paste*
+- `-d','` - Paste both krona and sample files together line by line delimited by comma ','.
+
+*ktImportText*
+- `-o` - Specifies the compiled output html file name.
+- `${KTEXT_FILES[*]}` - An array positional argument with the following content: 
+                        sample_1.krona,sample_1 sample_2.krona,sample_2 ... sample_n.krona,sample_n.
+
+**Input Data:**
+
+- *.krona (all sample .krona formatted files, output from [Step 18d](#18d-convert-kaiju-output-to-krona-format)) 
+             
+**Output Data:**
+
+- krona_files.txt (sorted list of all *.krona files)
+- sample_names.txt (sorted list of all sample names)
+- **kaiju-report_GLmetagenomics.html** (compiled krona html report containing all samples)
+
+#### 18f. Create Kaiju Species Count Table
+
+```R
+feature_table <- process_kaiju_table(file_path="merged_kaiju_table_GLmetagenomics.tsv")
+table2write <- feature_table  %>%
+               as.data.frame() %>%
+               rownames_to_column("Species")
+write_tsv(x = table2write, file = "kaiju_species_table_GLmetagenomics.tsv")
+```
+
+**Custom Functions Used:**
+- [process_kaiju_table()](#process_kaiju_table)
+
+**Parameter Definitions:**
+
+- `file_path` - path to compiled kaiju table at the species taxon level
+- `x`  - feature table dataframe to write to file
+- `file` - path to where to write kaiju count table per sample
+
+**Input Data:**
+
+- merged_kaiju_table_GLmetagenomics.tsv (compiled kaiju table at the species taxon level, from [Step 18c](#18c-compile-kaiju-taxonomy-results))
+
+**Output Data:**
+
+- **kaiju_species_table_GLmetagenomics.tsv** (kaiju species count table in tsv format)
+
+
+#### 18g. Filter Kaiju Species Count Table
+
+```R
+feature_table_file <- "kaiju_species_table_GLmetagenomics.tsv"
+output_file <- "kaiju_filtered_species_table_GLmetagenomics.tsv"
+threshold <- 0.5
+
+# string used to define non-microbial taxa
+non_microbial <- "UNCLASSIFIED|Unclassified|unclassified|Homo sapien|cannot|uncultured|unidentified"
+
+# read in feature table
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+# convert count table to a relative abundance matrix
+abund_table <- feature_table %>% rownames_to_column(feature_name) %>%
+  mutate(across(where(is.numeric), function(x) (x / sum(x, na.rm = TRUE)) * 100)) %>%
+  as.data.frame
+
+rownames(abund_table) <- abund_table[,1]
+abund_table <- abund_table[,-1] %>% t 
+
+table2write <- group_low_abund_taxa(abund_table, threshold = threshold)  %>%
+  t %>% as.data.frame %>%
+  rownames_to_column(feature_name)
+
+write_tsv(x = table2write, file = output_file)
+```
+
+**Custom Functions Used:**
+- [group_low_abund_taxa()](#group_low_abund_taxa)
+
+**Parameter Definitions:**
+
+- `threshold` - threshold for filtering out rare taxa, a percentage between 0 and 100.
+- `output_file` - output filename
+- `input_file` - input filename
+
+**Input Data:**
+
+- kaiju_species_table_GLmetagenomics.tsv (path to kaiju species table from [Step 18f](#18f-create-kaiju-species-count-table))
+
+**Output Data:**
+
+- **kaiju_filtered_species_table_GLmetagenomics.tsv** (a file containing the filtered species table)
+
+---
+
+#### 18h. Kaiju Taxonomy Barplots
+
+```R
+species_table_file <- "kaiju_species_table_GLmetagenomics.tsv"
+filtered_species_table_file <- "kaiju_filtered_species_table_GLmetagenomics.tsv"
+metadata_file <- "/path/to/sample/metadata"
+
+make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kaiju_unfiltered_species", assay_suffix = "_GLmetagenomics",
+             publication_format = publication_format, custom_palette = custom_palette)
+
+# Save static unfiltered plot
+make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kaiju_filtered_species", assay_suffix = "_GLmetagenomics",
+             publication_format = publication_format, custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [make_barplot](#make_barplot)
+
+**Parameter Definitions:**
+
+- `species_table_file` - a file containing the species count table
+- `filtered_species_table_file` - a file containing the filtered species count table
+- `metadata_file` - a file containing group information for each sample in the species count files
+
+**Input Data:**
+
+- `kaiju_species_table_GLmetagenomics.tsv` (a file containing the species count table, output from [Step 18f](#18f-create-kaiju-species-count-table))
+- `kaiju_filtered_species_table_GLmetagenomics.tsv` (a file containing the filtered species count table, output from [Step 18g](#18g-filter-kaiju-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+
+**Output Data:**
+
+- kaiju_unfiltered_species_barplot_GLmetagenomics.png (taxonomy barplot without filtering)
+- **kaiju_unfiltered_species_barplot_GLmetagenomics.html** (interactive taxonomy barplot without filtering)
+- kaiju_filtered_species_barplot_GLmetagenomics.png (taxonomy barplot after filtering rare and non-microbial taxa)
+- **kaiju_filtered_species_barplot_GLmetagenomics.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
+
+<br>
+
+---
+
+### 19. Taxonomic Profiling Using Kraken2
+
+#### 19a. Download Kraken2 Database
+
+```bash 
+## Download all microbial (including eukaryotes) - https://benlangmead.github.io/aws-indexes/k2
+
+# Downloading and building kraken2's pluspfp database which contains the standard database (Refseq archaea, bacteria, viral, plasmid, human1, UniVec_Core) + plants + protists + fungi
+
+mkdir kraken2-db/ && cd kraken2-db/
+
+# Inspect file
+INSPECT_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/inspect.txt
+wget ${INSPECT_URL}
+
+# Library report
+LIBRARY_REPORT_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/library_report.tsv
+wget ${LIBRARY_REPORT_URL}
+
+# Md5sums
+MD5_URL=https://genome-idx.s3.amazonaws.com/kraken/pluspfp_20250714/pluspfp.md5 
+wget ${MD5_URL}
+
+# Download and unzip the main database files
+DB_URL=https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_20250714.tar.gz 
+wget -O k2_pluspfp.tar.gz --timeout=3600 --tries=0 --continue ${DB_URL} && \
+tar -xvzf k2_pluspfp.tar.gz
+```
+
+**Parameter Definitions:**
+
+*wget*
+- `O` - Name of file to download the url content to.
+- `--timeout=3600` - Specifies the network timeout in seconds.
+- `--tries=0` - Retry download infinitely.
+- `--continue` -  Continue getting a partially-downloaded file.
+- `*_URL` - Position argument specifying the url to download a particular resource from.
+
+*tar*
+- `-xvzf` - unpack the specified *tar.gz archive in verbose mode
+
+**Input Data:**
+
+- `INSPECT_URL=` (url specifying the location of kraken2 inspect file)
+- `LIBRARY_REPORT_URL=` (url specifying the location of kraken2 library report file)
+- `MD5_URL=` (url specifying the location of the md5 file of the kraken database)
+- `DB_URL=` (url specifying the location of the main kraken database archive in .tar.gz format)
+
+**Output Data:**
+
+- kraken2-db/  (a directory containing kraken2 database files)
+
+#### 19b. Kraken2 Taxonomic Classification
+
+```bash
+kraken2 --db kraken2-db/ \
+        --gzip-compressed \
+        --threads NumberOfThreads \
+        --use-names \
+        --output sample-kraken2-output.txt \
+        --report sample-kraken2-report.tsv \
+        /path/to/sample1_R1_filtered_GLmetagenomics.fastq.gz /path/to/sample1_R2_filtered_GLmetagenomics.fastq.gz
+```
+
+**Parameter Definitions:**
+
+- `--db` - Specifies the directory holding the kraken2 database files. 
+- `--gzip-compressed` - Specifies the input files are gzip-compressed.
+- `--threads` - Number of parallel processing threads to use.
+- `--use-names` - Specifies to add taxa names in addition to taxids.
+- `--output` - Specifies the name of the kraken2 read-based output file.
+- `--report` - Specifies the name of the kraken2 report output file.
+- `sample1_R1_filtered_GLmetagenomics.fastq.gz` - Positional argument specifying the forward read input file.
+- `sample1_R2_filtered_GLmetagenomics.fastq.gz` - Positional argument specifying the reverse read input file.
+
+
+**Input Data:**
+
+- kraken2-db/ (a directory containing kraken2 database files, output from [Step 19a](#19a-download-kraken2-database))
+- *_R[12]_filtered_GLmetagenomics.fastq.gz (filtered/trimmed reads from [Step 2b](#2b-trim-polyg) above)
+
+
+**Output Data:**
+
+- sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
+- sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
+
+
+#### 19c. Compile Kraken2 Taxonomy Results
+
+##### 19ci. Create Merged Kraken2 Taxonomy Table
+
+```R
+species_table <- merge_kraken_reports(reports-dir='/path/to/kraken2/reports')
+write_tsv(x = species_table, file = "kraken2_species_table_GLmetagenomics.tsv")
+```
+
+**Custom Functions Used:**
+- [merge_kraken_reports()](#merge_kraken_reports)
+
+**Parameter Definitions:**
+
+- `reports-dir` - path to compiled kraken reports
+- `x`  - feature table dataframe to write to file
+- `file` - path to where to write kraken2 species table table
+
+**Input Data:**
+
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 19b](#19b-kraken2-taxonomic-classification))
+
+**Output Data:**
+
+- **kraken2_species_table_GLmetagenomics.tsv** (kraken species count table in tsv format)
+
+##### 19cii. Compile Kraken2 Taxonomy Reports
+
+```bash
+multiqc --zip-data-dir \ 
+        --outdir kraken2_multiqc_report \
+        --filename kraken2_multiqc_GLmetagenomics \
+        --interactive \
+        /path/to/*kraken2-report.tsv
+```
+
+**Parameter Definitions:**
+
+- `--zip-data-dir` - Compress the data directory.
+- `--outdir` - Specifies the output directory to store results.
+- `--filename` - Specifies the filename prefix of results.
+- `--interactive` - Force multiqc to always create interactive javascript plots.
+- `/path/to/*kraken2-report.tsv` - The kraken2 output report files, provided as a positional argument.
+
+**Input Data:**
+
+- \*-kraken2-report.tsv (kraken report from each sample to compile, outputs from [Step 19b](#19b-kraken2-taxonomic-classification))
+
+**Output Data:**
+
+- **kraken2_multiqc_GLmetagenomics.html** (multiqc output html summary)
+- **kraken2_multiqc_GLmetagenomics_data.zip** (zip archive containing multiqc output data)
+
+
+#### 19d. Convert Kraken2 Output to Krona Format
+
+```bash
+kreport2krona.py --report-file sample-kraken2-report.tsv  \
+                 --output sample.krona
+```
+
+**Parameter Definitions:**
+
+- `--report-file` - Specifies the name of the input kraken2 report file.
+- `--output` - Specifies the name of the krona output file.
+
+**Input Data:**
+
+- sample-kraken2-report.tsv (kraken report, output from [Step 19b](#19b-kraken2-taxonomic-classification))
+
+**Output Data:**
+
+- sample.krona (krona formatted kraken2 output)
+
+
+#### 19e. Compile Kraken2 Krona Reports
+
+```bash
+# Find, list and write all .krona files to file 
+find . -type f -name "*.krona" | sort -uV > krona_files.txt
+
+FILES=($(find . -type f -name "*.krona"))
+basename --multiple --suffix='.krona' ${FILES[*]} | sort -uV  > sample_names.txt
+
+# Create ktImportText input format files
+KTEXT_FILES=($(paste -d',' "krona_files.txt" "sample_names.txt"))
+
+# Create html   
+ktImportText -o kraken2-report_GLmetagenomics.html ${KTEXT_FILES[*]}
+```
+
+**Parameter Definitions:**
+
+*find*
+  - `-type f` -  Specifies that the type of file to find is a regular file.
+  - `-name "*.krona"` - Specifies to find files ending with the .krona suffix. 
+
+*sort*
+  - `-u` - Specifies to perform a unique sort.
+  - `-V` - Specifies to perform a mixed type of sorting with names containing numbers within text.
+  - `> {}.txt` - Redirects the sorted list to a separate text file.
+
+*basename*
+  - `--multiple` - Support multiple arguments and treat each as a file name.
+  - `--suffix='.krona'` - Remove a trailing '.krona' suffix.
+
+*paste*
+  - `-d','` - Paste both krona and sample files together line by line delimited by comma ','.
+
+*ktImportText*
+  - `-o` - Specifies the compiled output html file name.
+  - `${KTEXT_FILES[*]}` - An array positional argument with the following content: sample_1.krona,sample_1 sample_2.krona,sample_2 .. sample_n.krona,sample_n.
+
+**Input Data:**
+
+- *.krona (all sample .krona formatted files, output from [Step 19d](#19d-convert-kraken2-output-to-krona-format)) 
+
+                      
+**Output Data:**
+
+- krona_files.txt (sorted list of all *.krona files)
+- sample_names.txt (sorted list of all sample names)
+- **kraken2-report_GLmetagenomics.html** (compiled krona html report containing all samples)
+
+
+#### 19f. Filter Kraken2 Species Count Table
+
+```R
+feature_table_file <- "kraken2_species_table_GLmetagenomics.tsv"
+output_file <- "kraken2_filtered_species_table_GLmetagenomics.tsv"
+threshold <- 0.5
+
+# string used to define non-microbial taxa
+non_microbial <- "UNCLASSIFIED|Unclassified|unclassified|Homo sapien|cannot|uncultured|unidentified"
+
+# read in feature table
+feature_table <- read_delim(feature_table_file) %>%
+                 across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+# read-based count table
+table2write <- filter_rare(feature_table, non_microbial, threshold = threshold) %>%
+  as.data.frame %>%
+  rownames_to_column(feature_name)
+
+write_tsv(x = table2write, file = output_file)
+```
+
+**Custom Functions Used:**
+- [group_low_abund_taxa()](#group_low_abund_taxa)
+
+**Parameter Definitions:**
+
+- `threshold` - threshold for filtering out rare taxa, a percentage between 0 and 100.
+- `output_file` - output filename
+- `input_file` - input filename
+
+**Input Data:**
+
+- kraken2_species_table_GLmetagenomics.tsv (path to kaiju species table from [Step 19ci](#19ci-create-merged-kraken2-taxonomy-table))
+
+**Output Data:**
+
+- **kraken2_filtered_species_table_GLmetagenomics.tsv** (a file containing the filtered species table)
+
+---
+
+#### 19g. Kraken2 Taxonomy Barplots
+
+```R
+species_table_file <- "kraken2_species_table_GLmetagenomics.tsv"
+filtered_species_table_file <- "kraken2_filtered_species_table_GLmetagenomics.tsv"
+metadata_file <- "/path/to/sample/metadata"
+
+make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+             feature_column = "species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kraken2_unfiltered_species", assay_suffix = "_GLmetagenomics",
+             publication_format = publication_format, custom_palette = custom_palette)
+
+# Save static unfiltered plot
+make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "kraken2_filtered_species", assay_suffix = "_GLmetagenomics",
+             publication_format = publication_format, custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [make_barplot()](#make_barplot)
+
+**Parameter Definitions:**
+
+- `species_table_file` - a file containing the species count table
+- `filtered_species_table_file` - a file containing the filtered species count table
+- `metadata_file` - a file containing group information for each sample in the species count files
+
+**Input Data:**
+
+- `kraken2_species_table_GLmetagenomics.tsv` (path to kaiju species table from [Step 19ci](#19ci-create-merged-kraken2-taxonomy-table))
+- `kraken2_filtered_species_table_GLmetagenomics.tsv` (a file containing the filtered species count table, output from [Step 19f](#19f-filter-kraken2-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
+
+- kraken2_unfiltered_species_barplot_GLmetagenomics.png (taxonomy barplot without filtering)
+- **kraken2_unfiltered_species_barplot_GLmetagenomics.html** (interactive taxonomy barplot without filtering)
+- kraken2_filtered_species_barplot_GLmetagenomics.png (taxonomy barplot after filtering rare and non-microbial taxa)
+- **kraken2_filtered_species_barplot_GLmetagenomics.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
+
+<br>  
+
+---
+
+### 20. Taxonomic Profiling Using HUMAnN/MetaPhlan
+
+#### 20a. Download and Install HUMAnN databases
+
+```bash
+mkdir -p /path/to/humann3-db
+humann_databases --download chocophlan full /path/to/humann3-db/
+humann_databases --download uniref uniref90_ec_filtered_diamond /path/to/humann3-db/
+humann_databases --download utility_mapping full /path/to/humann3-db/
+metaphlan --install
+```
+
+**Parameter Definition:**
+
+*humann3_databases*
+- `--download` - Specifies the databases to download:
+  - `chocophlan full` - the full ChocoPhlAn pangenome database, which includes Archaea, Bacteria, Eukaryotes, and Viruses
+  - `uniref uniref90_ec_filtered_diamond` - Download the EC-filtered UniRef90 translated search database
+  - `utility_mapping full` - additional gene family to functional category mapping database
+-`/path/to/humann3-db` - Specifies the database install location
+
+*metaphlan*
+`--install` - install the MetaPhlan clade markers and database locally
+
+**Input Data**
+
+*No input data required*
+
+**Output Data**
+
+`/path/to/humann3-db` (the installed MetaPhlan databases)
+
+
+#### 20b. HUMAnN/MetaPhlAn Taxonomic Classification
+```bash
+  # forward and reverse reads need to be provided combined if paired-end (if not paired-end, single-end reads are provided to the --input argument next)
+cat sample_R1_filtered_GLmetagenomics.fastq.gz sample_R2_filtered_GLmetagenomics.fastq.gz > sample-combined.fastq.gz
+
+humann --input sample-combined.fastq.gz \
+       --output sample-humann3-out-dir \
+       --threads NumberOfThreads \
+       --output-basename sample \
+       --metaphlan-options "--bowtie2db /path/to/humann3-db/ --unclassified_estimation --add_viruses --sample_id sample" \
+       --nucleotide-database /path/to/humann3-db/ \
+       --protein-database /path/to/humann3-db/ \
+       --bowtie-options "--sensitive --mm"
+
+mv sample-humann3-out-dir/sample_humann_temp/sample_metaphlan_bugs_list.tsv \
+   sample-humann3-out-dir/sample_metaphlan_bugs_list.tsv
+```
+
+**Parameter Definitions:**  
+
+-	`--input` – specifies the input (combined forward and reverse reads)
+-	`--output` – specifies output directory
+-	`--threads` – specifies the number of threads to use
+-	`--output-basename` – specifies prefix of the output files
+-	`--metaphlan-options` – options to be passed to metaphlan
+	- `--bowtie2db` – path to bowtie2 indexes (stored in HUMAnN database folder)
+  - `unclassified_estimation` - scale the relative abundance profile according to the percentage of reads mapping to a clade.
+	- `--add_viruses` – include viruses in the reference database
+	- `--sample_id` – specifies the sample identifier we want in the table (rather than full filename)
+
+**Input Data:**
+
+- `/path/to/humann3-db/` (HUMAnN databases installed in [Step 20a](#20a-download-and-install-humann-databases))
+- *_R[12]_filtered_GLmetagenomics.fastq.gz (filtered/trimmed reads from [Step 2b](#2b-trim-polyg) above)
+
+**Output Data:**
+
+- sample-humann3-out-dir/ (humann output directories containing *genefamilies.tsv, *pathabundance.tsv, and *pathcoverage.tsv files)
+
+#### 20c. Merge Multiple Sample Functional Profiles
+
+```bash
+  # they need to be in their own directories
+mkdir genefamily-results/ pathabundance-results/ pathcoverage-results/
+
+  # copying results from previous running humann3 step (16a) to get them all together in their own directories (as is needed)
+cp *-humann3-out-dir/*genefamilies.tsv genefamily-results/
+cp *-humann3-out-dir/*abundance.tsv pathabundance-results/
+cp *-humann3-out-dir/*coverage.tsv pathcoverage-results/
+
+humann_join_tables -i genefamily-results/ -o gene-families.tsv
+humann_join_tables -i pathabundance-results/ -o path-abundances.tsv
+humann_join_tables -i pathcoverage-results/ -o path-coverages.tsv
+```
+
+**Parameter Definitions:**  
+
+- `-i` - the directory holding the input tables
+- `-o` - the name of the output table holding combined data
+
+**Input Data:**
+
+- `sample-humann3-out-dir` (HUMAnN output directory, from [Step 20b](#20b-humannmetaphlan-taxonomic-classification))
+
+**Output Data:**
+
+- gene-families.tsv (Combined gene family table in tab-separated format.)
+- pathway-abundances.tsv (Combined path abundances table in tab-separated format.)
+- pathway-coverages.tsv (Combined path coverages table in tab-separated format.)
+
+#### 20d. Split Results Tables
+
+The read-based functional annotation tables have taxonomic info and non-taxonomic info mixed together initially. `humann` comes with a helper script to split these. Here we are using that to generate both non-taxonomically grouped functional info files and taxonomically grouped ones.
+
+```bash
+humann_split_stratified_table -i gene-families.tsv -o ./
+mv gene-families_stratified.tsv Gene-families-grouped-by-taxa_GLmetagenomics.tsv
+mv gene-families_unstratified.tsv Gene-families_GLmetagenomics.tsv
+
+humann_split_stratified_table -i path-abundances.tsv -o ./
+mv path-abundances_stratified.tsv Path-abundances-grouped-by-taxa_GLmetagenomics.tsv
+mv path-abundances_unstratified.tsv Path-abundances_GLmetagenomics.tsv
+
+humann2_split_stratified_table -i path-coverages.tsv -o ./
+mv path-coverages_stratified.tsv Path-coverages-grouped-by-taxa_GLmetagenomics.tsv
+mv path-coverages_unstratified.tsv Path-coverages_GLmetagenomics.tsv
+```
+
+**Parameter Definitions:**  
+
+-	`-i` – the input combined table
+-	`-o` – output directory (here specifying current directory)
+
+**Input Data:**
+
+- gene-families.tsv (Combined gene family table from [Step 20c](#20c-merge-multiple-sample-functional-profiles))
+- pathway-abundances.tsv (Combined path abundances table from [Step 20c](#20c-merge-multiple-sample-functional-profiles))
+- pathway-coverages.tsv (Combined path coverages table from [Step 20c](#20c-merge-multiple-sample-functional-profiles))
+
+**Output Data:**
+
+- **Gene-families_GLmetagenomics.tsv** (gene-family abundances)
+- **Gene-families-grouped-by-taxa_GLmetagenomics.tsv** (gene-family abundances grouped by taxa)
+- **Pathway-abundances_GLmetagenomics.tsv**  (pathway abundances)
+- **Pathway-abundances-grouped-by-taxa_GLmetagenomics.tsv** (pathway abundances grouped by tax)
+- **Pathway-coverages_GLmetagenomics.tsv** (pathway coverages)
+- **Pathway-coverages-grouped-by-taxa_GLmetagenomics.tsv** (pathway coverages grouped by taxa)
+
+#### 20e. Normalize Gene Families and Pathway Abundances Tables
+Generates some normalized tables of the read-based functional outputs from humann that are more readily suitable for across sample comparisons.
+
+```bash
+humann_renorm_table -i Gene-families_GLmetagenomics.tsv -o Gene-families-cpm_GLmetagenomics.tsv --update-snames
+humann_renorm_table -i Path-abundances_GLmetagenomics.tsv -o Path-abundances-cpm_GLmetagenomics.tsv --update-snames
+```
+
+**Parameter Definitions:**  
+
+-	`-i` – the input combined table
+-	`-o` – name of the output normalized table
+-	`--update-snames` – change suffix of column names in tables to "-CPM"
+
+**Input Data:**
+
+- Gene-families_GLmetagenomics.tsv (gene-family abundances, from [Step 20d](#20d-split-results-tables))
+- Pathway-abundances_GLmetagenomics.tsv (pathway abundances, from [Step 20d](#20d-split-results-tables))
+
+**Output Data:**
+- **Gene-families-cpm_GLmetagenomics.tsv** (gene-family abundances normalized to copies-per-million)
+- **Pathway-abundances-cpm_GLmetagenomics.tsv** (pathway abundances normalized to copies-per-million)
+
+#### 20f. Generate Normalized Gene-family Table Grouped by Kegg Orthologs (KOs)
+
+```bash
+humann_regroup_table -i Gene-families_GLmetagenomics.tsv -g uniref90_ko | \
+humann_rename_table -n kegg-orthology | \
+humann_renorm_table -o Gene-families-KO-cpm_GLmetagenomics.tsv --update-snames
+
+```
+
+**Parameter Definitions:**  
+
+*humann_regroup_table*
+-	`-i` – the input table
+-	`-g` – the map to use to group uniref IDs into Kegg Orthologs
+-	`|` – sending that output into the next humann command to add human-readable Kegg Orthology names
+
+*humann_rename_table*
+-	`-n` – specifying we are converting Kegg orthology IDs into Kegg orthology human-readable names
+-	`|` – sending that output into the next humann command to normalize to copies-per-million
+
+*humann_renorm_table*
+-	`-o` – specifying the final output file name
+-  `--update-snames` – change suffix of column names in tables to "-CPM"
+
+**Input Data:**
+
+- Gene-families_GLmetagenomics.tsv (Non-taxonomically grouped gene families, from [Step 20d](#20d-split-results-tables))
+
+**Output Data:**
+
+- **Gene-families-KO-cpm_GLmetagenomics.tsv** (KO term abundances normalized to copies-per-million)
+
+#### 20g. Combine MetaPhlan Taxonomy Tables
+
+```bash
+merge_metaphlan_tables.py *-humann3-out-dir/*_humann_temp/*_metaphlan_bugs_list.tsv > Metaphlan-taxonomy_GLmetagenomics.tsv
+```
+
+**Parameter Definitions:**  
+
+*merge_metaphlan_tables.py*
+- positional argument specifying input files and output filename
+
+*sed*
+- `-i` - Perform the search/replace in-place on the input file.
+
+**Input data:**
+
+-	\*-humann3-out-dir/\*_humann_temp/\*_metaphlan_bugs_list.tsv (MetaPhlan bugs_list produced during humann3 run in [step 20b](#20b-humannmetaphlan-taxonomic-classification))
+
+**Output data:**
+
+- **Metaphlan-taxonomy_GLmetagenomics.tsv** (MetaPhlan estimated taxonomic relative abundances)
+
+#### 20h. Create MetaPhlan Species Count Table
+
+##### 20hi. Get Sample Read Counts
+
+```bash
+unzip filtered_multiqc_GLmetagenomics_data.zip
+
+grep _R1_filtered multiqc_fastqc.txt | awk 'BEGIN{FS="\t"; OFS="\t"}{print $1,int($5)}' > reads_per_sample.tsv
+```
+
+**Input Data:**
+
+- filtered_multiqc_GLmetagenomics_data.zip or HostRm_multiqc_GLmetagenomics_data.zip (multiqc data from [Step 2d](#2d-compile-filteredtrimmed-data-qc)
+  
+**Output Data:**
+
+- reads_per_sample.txt (a 2-column tab delimited file with the sample names and read counts as column 1 and 2, respectively)
+
+##### 20hii. Process MetaPhlan Taxonomy Table
+
+```R
+input_file <- "metaphlan-taxonomy_GLmetagenomics.tsv"
+read_count_file <- "reads_per_sample.tsv"
+output_file <- "metaphlan_species_table_GLmetagenomics.tsv"
+threshold <- 0.5
+
+taxon_levels <- c("Kingdom", "Phylum", "Class", "Order",
+                  "Family", "Genus", "Species")
+
+# read in feature table
+feature_table <- read_delim(input_file, delim="\t", comment="#") 
+colnames(feature_table)[1] <- "taxonomy"
+
+feature_table <- feature_table %>%
+  filter(str_detect(taxonomy, "UNCLASSIFIED|s__") & 
+         str_detect(taxonomy, "t__", negate = TRUE)) %>%
+  mutate(Species=str_replace_all(taxonomy, '\\w__', "")) %>%
+  separate(Species, into=taxon_levels, sep="\\|") %>%
+  mutate(across(where(is.character), function(x) replace_na(x, "UNCLASSIFIED"))) %>%
+  mutate(Species=str_replace_all(Species, "_", " ")) %>%
+  select(-taxonomy, -Kingdom, -Phylum, -Class, -Order, -Family, -Genus) %>%
+  select(Species, everything()) %>%
+  as.data.frame
+
+rownames(feature_table) <- feature_table$Species
+feature_table <- feature_table[,-match("Species", colnames(feature_table))]
+
+# Set max abundance equal to 1
+tab2 <- (feature_table %>% t) / 100
+
+# read in sample read counts
+counts <- read_delim(read_count_file, delim = "\t", 
+                     col_names = c("Sample_ID", "Reads")) %>%
+  as.data.frame
+
+# Set rownames as sample names
+rownames(counts) <- counts$Sample_ID
+# Drop the Sample_ID column
+counts <- counts[, -1, drop = FALSE]
+
+tab2 <- tab2[rownames(counts),]
+
+# Convert relative abundance to raw count
+species_table <- map2(tab2 %>% as.data.frame, 
+                      colnames(tab2), function(col, specie) {
+                        df <- col * counts
+                        colnames(df) <- specie
+                        return(df) 
+                      }) %>% list_cbind() %>% t
+
+table2write <- species_table  %>%
+  as.data.frame() %>%
+  rownames_to_column("Species")
+
+write_tsv(x = table2write, file = "metaphlan_species_table_GLmetagenomics.tsv")
+```
+
+**Input Data:**
+
+- metaphlan-taxonomy_GLmetagenomics.tsv (MetaPhlan taxonomy table from [Step 20g](#20g-combine-metaphlan-taxonomy-tables))
+- reads_per_sample.tsv (a 2-column tab delimited file with sample names and read counts as columns 1 and 2, respectively from [Step 20hi](#20hi-get-sample-read-counts))
+
+**Output Data:**
+
+- **metaphlan_species_table_GLmetagenomics.tsv** (a file containing the MetaPhlan species table)
+
+#### 20i. Filter MetaPhlan Species Count Table
+
+```R
+feature_table_file <- "metaphlan_species_table_GLmetagenomics.tsv"
+output_file <- "metaphlan_filtered_species_table_GLmetagenomics.tsv"
+threshold <- 0.5
+
+# string used to define non-microbial taxa
+non_microbial <- "UNCLASSIFIED"
+
+# read in feature table
+feature_table <- read_delim(feature_table_file) %>%
+                 mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>%
+                 as.data.frame()
+feature_name <- colnames(feature_table)[1]
+rownames(feature_table) <- feature_table[,1]
+feature_table <- feature_table[, -1]
+
+# read-based count table
+table2write <- filter_rare(feature_table, non_microbial, threshold = threshold) %>%
+  as.data.frame %>%
+  rownames_to_column(feature_name)
+
+write_tsv(x = table2write, file = output_file)
+```
+
+**Custom Functions Used:**
+- [group_low_abund_taxa()](#group_low_abund_taxa)
+
+**Parameter Definitions:**
+
+- `threshold` - threshold for filtering out rare taxa, a percentage between 0 and 100.
+- `output_file` - output filename
+- `input_file` - input filename
+
+**Input Data:**
+
+- metaphlan_species_table_GLmetagenomics.tsv (path to MetaPhlan species count table from [Step 20hii](#20hii-process-metaphlan-taxonomy-table))
+
+**Output Data:**
+
+- **metaphlan_filtered_species_table_GLmetagenomics.tsv** (a file containing the filtered MetaPhlan species table)
+
+#### 20j. MetaPhlan Taxonomy Barplots
+
+```R
+species_table_file <- "metaphlan_species_table_GLmetagenomics.tsv"
+filtered_species_table_file <- "metaphlan_filtered_species_table_GLmetagenomics.tsv"
+metadata_file <- "/path/to/sample/metadata"
+
+make_barplot(metadata_file = metadata_file, feature_table_file = species_table_file, 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "metaphlan_unfiltered_species", assay_suffix = "_GLmetagenomics",
+             publication_format = publication_format, custom_palette = custom_palette)
+
+# Save static unfiltered plot
+make_barplot(metadata_file = metadata_file, feature_table_file = filtered_species_table_file, 
+             feature_column = "Species", samples_column = "sample_id", group_column = "group",
+             output_prefix = "metaphlan_filtered_species", assay_suffix = "_GLmetagenomics",
+             publication_format = publication_format, custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [make_barplot()](#make_barplot)
+
+**Parameter Definitions:**
+
+- `species_table_file` - a file containing the species count table
+- `filtered_species_table_file` - a file containing the filtered species count table
+- `metadata_file` - a file containing group information for each sample in the species count files
+
+**Input Data:**
+
+- `metaphlan_species_table_GLmetagenomics.tsv` (path to MetaPhlan species table from [Step 20h](#20h-create-metaphlan-species-count-table))
+- `metaphlan_filtered_species_table_GLmetagenomics.tsv` (a file containing the filtered species count table, output from [Step 20i](#20i-filter-metaphlan-species-count-table))
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+
+**Output Data:**
+
+- metaphlan_unfiltered_species_barplot_GLmetagenomics.png (taxonomy barplot without filtering)
+- **metaphlan_unfiltered_species_barplot_GLmetagenomics.html** (interactive taxonomy barplot without filtering)
+- metaphlan_filtered_species_barplot_GLmetagenomics.png (taxonomy barplot after filtering rare and non-microbial taxa)
+- **metaphlan_filtered_species_barplot_GLmetagenomics.html** (interactive taxonomy barplot after filtering rare and non-microbial taxa)
+
+#### 20k. Filter Humann Output
+
+```R
+# read in humann tables
+humann_uniref_table <- read_delim(file = "Gene-families-cpm_GLmetagenomics.tsv", delim = "\t")
+humann_KO_table <- read_delim(file = "Gene-families-KO-cpm_GLmetagenomics.tsv", delim = "\t")
+humann_pathway_table <- read_delim(file = "Pathway-abundances-cpm_GLmetagenomics.tsv", delim = "\t")
+
+# rename headers
+humann_uniref_table <-  humann_uniref_table  %>% 
+  rename(Uniref90=`# Gene Family`) %>%
+  mutate(Uniref90=str_replace_all(Uniref90, "UniRef90_", "")) %>%
+  set_names(colnames(.) %>% str_replace_all("_Abundance-CPM", "")) %>%
+  as.data.frame()
+write_tsv(x = humann_uniref_table, file = "Gene-families-uniref_unfiltered_GLmetagenomics.tsv")
+
+humann_KO_table <- humann_KO_table %>%
+  rename(KO=`# Gene Family`) %>%
+  set_names(colnames(.) %>% str_replace_all("_Abundance-CPM", "")) %>%
+  as.data.frame()
+write_tsv(x = humann_KO_table, file = "Gene-families-KO_unfiltered_GLmetagenomics.tsv")
+
+humann_pathway_table <-  humann_pathway_table  %>% 
+  rename(Pathway=`# Pathway`) %>%
+  set_names(colnames(.) %>% str_replace_all("_Abundance-CPM", "")) %>%
+  as.data.frame()
+write_tsv(x = humann_pathway_table, file = "Pathway-abundances_unfiltered_GLmetagenomics.tsv")
+
+# filter data
+threshold <- 500
+
+humann_uniref_table <- humann_uniref_table %>%
+  mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>% column_to_rownames("Uniref90")
+humann_uniref_filtered <- get_abundant_features(humann_uniref_table, cpm_threshold = threshold) %>%
+  as.data.frame() %>% rownames_to_column("Uniref90")
+write_tsv(x = table2write, file = "Gene-families-uniref_filtered_GLmetagenomics.tsv")
+
+humann_KO_table <- humann_KO_table %>%
+  mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>% column_to_rownames("KO")
+humann_KO_filtered <- get_abundant_features(humann_KO_table, cpm_threshold = threshold) %>%
+  as.data.frame() %>% rownames_to_column("KO")
+write_tsv(x = table2write, file = "Gene-families-KO_filtered_GLmetagenomics.tsv")
+
+humann_pathway_table <- humann_pathway_table %>%
+  mutate(across(where(is.numeric), function(col) replace_na(col, 0))) %>% column_to_rownames("Pathway")
+humann_pathway_filtered <- get_abundant_features(humann_pathway_table, cpm_threshold = threshold) %>%
+  as.data.frame() %>% rownames_to_column("Pathway")
+write_tsv(x = table2write, file = "Pathway-abundances_filtered_GLmetagenomics.tsv")
+
+```
+
+**Custom Functions Used:**
+- [get_abundant_features()](#get_abundant_features)
+
+**Parameter Definitions:**
+
+- `threshold` - threshold for filtering out low abundance features, a value greater than 0
+
+**Input Data:**
+
+- Gene-families-cpm_GLmetagenomics.tsv (Humann taxonomy table from [Step 20e](#20e-normalize-gene-families-and-pathway-abundances-tables))
+- Gene-families-KO-cpm_GLmetagenomics.tsv (Humann pathway table from [Step 20e](#20e-normalize-gene-families-and-pathway-abundances-tables))
+- Pathway-abundances-cpm_GLmetagenomics.tsv (Humann KO function table from [Step 20f](#20f-generate-normalized-gene-family-table-grouped-by-kegg-orthologs-kos))
+
+**Output Data:**
+
+- Gene-families-KO_unfiltered_GLmetagenomics.tsv (KO term abundances normalized to copies-per-million, with cleaned headers)
+- Gene-families-uniref_unfiltered_GLmetagenomics.tsv (gene-family abundances normalized to copies-per-million, with cleaned headers)
+- Pathway-abundances_unfiltered_GLmetagenomics.tsv (pathway abundances normalized to copies-per-million, with cleaned headers)
+- **Gene-families-KO_filtered_GLmetagenomics.tsv** (KO term abundances filtered for features with less than 500 CPM across samples) 
+- **Gene-families-uniref_filtered_GLmetagenomics.tsv** (gene-family abundances filtered for features with less than 500 CPM across samples) 
+- **Gene-families-KO_filtered_GLmetagenomics.tsv** (Pathway abundances filtered for features with less than 500 CPM across samples) 
+
+#### 20l. Create Humann Function Heatmaps
+
+```R
+metadata_table < "/path/to/sample_metadata"
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-uniref_unfiltered_GLmetagenomics.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-uniref_unfiltered", 
+             assay_suffix = "_GLmetagenomics", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-uniref_filtered_GLmetagenomics.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-uniref_filtered", 
+             assay_suffix = "_GLmetagenomics", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-KO_unfiltered_GLmetagenomics.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-KO_unfiltered", 
+             assay_suffix = "_GLmetagenomics", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Gene-families-KO_filtered_GLmetagenomics.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Gene-families-KO_filtered", 
+             assay_suffix = "_GLmetagenomics", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Pathway-abundances_unfiltered_GLmetagenomics.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Pathway-abundances_unfiltered", 
+             assay_suffix = "_GLmetagenomics", 
+             custom_palette = custom_palette)
+
+make_heatmap(metadata_table_file = metadata_table, 
+             feature_table_file = "Pathway-abundances_filtered_GLmetagenomics.tsv", 
+             samples_column="sample_id", group_column = "group", 
+             output_prefix = "Pathway-abundances_filtered", 
+             assay_suffix = "_GLmetagenomics", 
+             custom_palette = custom_palette)
+```
+
+**Custom Functions Used:**
+- [make_heatmap()](#make_heatmap)
+
+**Parameter Definitions:**
+
+- `metadata_file` - a file containing group information for each sample in the species count files
+
+**Input Data:**
+
+- `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
+- `Gene-families-uniref_unfiltered_GLmetagenomics.tsv` (gene-family abundances table, output from [Step 20k](#20k-filter-humann-output))
+- `Gene-families-KO_unfiltered_GLmetagenomics.tsv` (KO term abundances table, output from [Step 20k](#20k-filter-humann-output))
+- `Pathway-abundances_unfiltered_GLmetagenomics.tsv` (pathway abundances table, output from [Step 20k](#20k-filter-humann-output))
+- `Gene-families-uniref_filtered_GLmetagenomics.tsv` (filtered gene-family abundances table, output from [Step 20k](#20k-filter-humann-output)) 
+- `Gene-families-KO_filtered_GLmetagenomics.tsv` (filtered KO term abundances table, output from [Step 20k](#20k-filter-humann-output)) 
+- `Pathway-abundances_filtered_GLmetagenomics.tsv` (filtered Pathway abundances table, output from [Step 20k](#20k-filter-humann-output)) 
+
+**Output Data:**
+
+- **Gene-families-uniref_unfiltered_heatmap_GLmetagenomics.png** (gene family abundances heatmap without filtering)
+- **Gene-families-uniref_filtered_heatmap_GLmetagenomics.png** (gene family abundances heatmap after filtering rare and non-microbial taxa)
+- **Gene-families-KO_unfiltered_heatmap_GLmetagenomics.png** (KO term abundances heatmap without filtering)
+- **Gene-families-KO_filtered_heatmap_GLmetagenomics.png** (KO term abundances heatmap after filtering rare and non-microbial taxa)
+- **Pathway-abundances_unfiltered_heatmap_GLmetagenomics.png** (pathway abundances heatmap without filtering)
+- **Pathway-abundances_filtered_heatmap_GLmetagenomics.png** (pathway abundances heatmap after filtering rare and non-microbial taxa)
+- **Gene-families-uniref_unfiltered_top_50_heatmap_GLmetagenomics.png** (gene family abundances heatmap without filtering)
+- **Gene-families-uniref_filtered_top_50_heatmap_GLmetagenomics.png** (gene family abundances heatmap after filtering rare and non-microbial taxa)
+- **Gene-families-KO_unfiltered_top_50_heatmap_GLmetagenomics.png** (KO term abundances heatmap without filtering)
+- **Gene-families-KO_filtered_top_50_heatmap_GLmetagenomics.png** (KO term abundances heatmap after filtering rare and non-microbial taxa)
+- **Pathway-abundances_unfiltered_top_50_heatmap_GLmetagenomics.png** (pathway abundances heatmap without filtering)
+- **Pathway-abundances_filtered_top_50_heatmap_GLmetagenomics.png** (pathway abundances heatmap after filtering rare and non-microbial taxa)
+
+---
diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
index 688c318b2..3437ea7f2 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
@@ -2678,19 +2678,19 @@ metaphlan --install
 
 ```bash
   # forward and reverse reads need to be provided combined if paired-end (if not paired-end, single-end reads are provided to the --input argument next)
-cat sample1_R1_decontam_GLlblMetag.fastq.gz sample1_R2_decontam_GLlblMetag.fastq.gz > sample1-combined.fastq.gz
+cat sample_R1_decontam_GLlblMetag.fastq.gz sample_R2_decontam_GLlblMetag.fastq.gz > sample-combined.fastq.gz
 
-humann --input sample1-combined.fastq.gz \
-       --output sample1-humann3-out-dir \
+humann --input sample-combined.fastq.gz \
+       --output sample-humann3-out-dir \
        --threads NumberOfThreads \
-       --output-basename sample1 \
-       --metaphlan-options "--bowtie2db /path/to/humann3-db/ --unclassified_estimation --add_viruses --sample_id sample1" \
+       --output-basename sample \
+       --metaphlan-options "--bowtie2db /path/to/humann3-db/ --unclassified_estimation --add_viruses --sample_id sample" \
        --nucleotide-database /path/to/humann3-db/ \
        --protein-database /path/to/humann3-db/ \
        --bowtie-options "--sensitive --mm"
 
-mv sample1-humann3-out-dir/sample1_humann_temp/sample1_metaphlan_bugs_list.tsv \
-   sample1-humann3-out-dir/sample1_metaphlan_bugs_list.tsv
+mv sample-humann3-out-dir/sample_humann_temp/sample_metaphlan_bugs_list.tsv \
+   sample-humann3-out-dir/sample_metaphlan_bugs_list.tsv
 ```
 
 **Parameter Definitions:**  
@@ -2713,7 +2713,7 @@ mv sample1-humann3-out-dir/sample1_humann_temp/sample1_metaphlan_bugs_list.tsv \
 
 **Output Data:**
 
-- sample1-humann3-out-dir/ *humann output directory containing *genefamilies.tsv, *pathabundance.tsv, and *pathcoverage.tsv files)
+- sample-humann3-out-dir/ *humann output directory containing *genefamilies.tsv, *pathabundance.tsv, and *pathcoverage.tsv files)
 
 #### 12c. Merge Multiple Sample Functional Profiles
 
diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
index 74a784095..533e4d951 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
@@ -243,14 +243,14 @@ multiqc --zip-data-dir \
 #### 2a. Filter Quality and Trim Adapters
 
 ```bash
-fastp --in1 sample1_R1_HRrm_GLlbsMetag.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
-      --in2 sample1_R2_HRrm_GLlbsMetag.fastq.gz --out2 temp_sample1_R2_filtered.fastq.gz \
+fastp --in1 sample_R1_HRrm_GLlbsMetag.fastq.gz --out1 temp_sample_R1_filtered.fastq.gz \
+      --in2 sample_R2_HRrm_GLlbsMetag.fastq.gz --out2 temp_sample_R2_filtered.fastq.gz \
       --qualified_quality_phred  20 \
       --length_required 50 \
       --thread 2 \
       --detect_adapter_for_pe \
-      --json sample1.fastp.json \
-      --html sample1.fastp.html 2> sample1-fastp.log
+      --json sample.fastp.json \
+      --html sample.fastp.html 2> sample-fastp.log
 ```
 
 **Parameter Definitions:**
@@ -278,15 +278,15 @@ fastp --in1 sample1_R1_HRrm_GLlbsMetag.fastq.gz --out1 temp_sample1_R1_filtered.
 #### 2b. Trim polyG
 
 ```bash
-fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered_GLlbsMetag.fastq.gz \
-      --in2 temp_sample1_R2_filtered.fastq.gz --out2 sample1_R2_filtered_GLlbsMetag.fastq.gz \
+fastp --in1 temp_sample_R1_filtered.fastq.gz --out1 sample_R1_filtered_GLlbsMetag.fastq.gz \
+      --in2 temp_sample_R2_filtered.fastq.gz --out2 sample_R2_filtered_GLlbsMetag.fastq.gz \
       --qualified_quality_phred  20 \
       --length_required 50 \
       --thread 2 \
       --detect_adapter_for_pe \
-      --json sample1.fastp.json \
-      --html sample1.fastp.html \
-      --trim_poly_g 2> sample1-fastp.log
+      --json sample.fastp.json \
+      --html sample.fastp.html \
+      --trim_poly_g 2> sample-fastp.log
 ```
 
 **Parameter Definitions:**
@@ -306,7 +306,7 @@ fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered_GLlbsMe
 
 **Input Data:**
 
-- /path/to/filtered_data/temp_sample1*.fastq.gz (round1 filtered/adapter trimmed reads, output from [Step 2a](#2a-filter-quality-and-trim-adapters)
+- /path/to/filtered_data/temp_sample*.fastq.gz (round1 filtered/adapter trimmed reads, output from [Step 2a](#2a-filter-quality-and-trim-adapters)
 
 **Output Data:**
 
@@ -315,7 +315,7 @@ fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered_GLlbsMe
 #### 2c. Filtered Data QC
 
 ```bash
-fastqc -o filtered_fastqc_output *filtered.fastq.gz
+fastqc -o filtered_fastqc_output *filtered_GLlbsMetag.fastq.gz
 ```
 
 **Parameter Definitions:**
@@ -414,17 +414,17 @@ bowtie2-build /path/to/contaminant_assembly/blank-scaffolds.fasta /path/to/blank
 bowtie2 -p NumberOfThreads \
        -x /path/to/blank-index/blanks \
        --very-sensitive-local \
-       -1 sample1_R1_filtered_GLlbsMetag.fastq.gz \
+       -1 sample_R1_filtered_GLlbsMetag.fastq.gz \
        -2 sample2_R2_filtered_GLlbsMetag.fastq.gz \
-       --un-conc-gz sample1_decontam.fastq.gz
-       > sample1.sam 2> sample1-mapping-info.txt
+       --un-conc-gz sample_decontam.fastq.gz
+       > sample.sam 2> sample-mapping-info.txt
 
 # rename blank removed fastq files
-mv sample1_decontam.fastq.1.gz sample1_R1_decontam_GLlbsMetag.fastq.gz
-mv sample1_decontam.fastq.2.gz sample1_R2_decontam_GLlbsMetag.fastq.gz
+mv sample_decontam.fastq.1.gz sample_R1_decontam_GLlbsMetag.fastq.gz
+mv sample_decontam.fastq.2.gz sample_R2_decontam_GLlbsMetag.fastq.gz
 
 # remove intermediate file
-rm -rf sample1.sam
+rm -rf sample.sam
 ```
 
 **Parameter Definitions:**
@@ -440,17 +440,17 @@ rm -rf sample1.sam
 -	`-1` - specifies the forward read to map
 - `-2` – specifies the reverse reads to map
 - `--un-conc-gz` - Specifies the file pattern for the unaligned read fastq.gz files. ".1" or ".2" will be added to the output filenames to distinguish the forward and reverse read files.
-- `> sample1.sam` - Redirects the output of the map reads command to a separate SAM file (specific to the map reads command).
-- `2> sample1-mapping-info.txt` – capture the printed summary results in a log file
+- `> sample.sam` - Redirects the output of the map reads command to a separate SAM file (specific to the map reads command).
+- `2> sample-mapping-info.txt` – capture the printed summary results in a log file
 
 **Input Data**
 
 - /path/to/contaminant_assembly/blank-scaffolds.fasta (contaminant assembly, output from [Step 3a](#3a-assemble-contaminants))
-- sample1_R[12]_filtered_GLlbsMetag.fastq.gz (filtered and trimmed reads, output from [Step 2b](#2b-trim-polyg))
+- sample_R[12]_filtered_GLlbsMetag.fastq.gz (filtered and trimmed reads, output from [Step 2b](#2b-trim-polyg))
 
 **Output Data**
 
-- **sample1_R[12]_decontam_GLlbsMetag.fastq.gz** (decontaminated reads)
+- **sample_R[12]_decontam_GLlbsMetag.fastq.gz** (decontaminated reads)
 - sample-mapping-info.txt (bowtie2 mapping log file)
 
 <br>
@@ -570,15 +570,15 @@ kraken2 --db kraken2_${hostname}_db \
         --use-names \
         --output sample-kraken2-output.txt \
         --report sample-kraken2-report.tsv \
-        --unclassified-out sample1_R#.fastq \
-        sample1_R1_decontam.fastq.gz sample1_R2_decontam.fastq.gz
+        --unclassified-out sample_R#.fastq \
+        sample_R1_decontam.fastq.gz sample_R2_decontam.fastq.gz
 
 # rename and gzip output files
-mv sample1_R_1.fastq sample1_R1_HostRm_GLlbsMetag.fastq && \
-gzip sample1_R1_HostRm_GLlbsMetag.fastq
+mv sample_R_1.fastq sample_R1_HostRm_GLlbsMetag.fastq && \
+gzip sample_R1_HostRm_GLlbsMetag.fastq
 
-mv  sample1_R_2.fastq sample1_R2_HostRm_GLlbsMetag.fastq && \
-gzip sample1_R2_HostRm_GLlbsMetag.fastq
+mv  sample_R_2.fastq sample_R2_HostRm_GLlbsMetag.fastq && \
+gzip sample_R2_HostRm_GLlbsMetag.fastq
 ```
 
 **Parameter Definitions:**
@@ -590,7 +590,7 @@ gzip sample1_R2_HostRm_GLlbsMetag.fastq
 - `--output` - Specifies the name of the kraken2 read-based output file (one line per read).
 - `--report` - Specifies the name of the kraken2 report output file (one line per taxa, with number of reads assigned to it).
 - `--unclassified-out` - Specifies the name of the output file containing reads that were not classified, i.e non-host reads.
-- `sample1_R1_decontam_GLlbsMetag.fastq.gz sample1_R2_decontam_GLlbsMetag.fastq.gz` - Positional argument specifying the input read files.
+- `sample_R1_decontam_GLlbsMetag.fastq.gz sample_R2_decontam_GLlbsMetag.fastq.gz` - Positional argument specifying the input read files.
 
 **Input Data:**
 
@@ -1597,8 +1597,8 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
       -t kaiju-db/nodes.dmp \
       -z NumberOfThreads \
       -E 1e-05 \
-      -i /path/to/sample1_R1_decontam_GLlbsMetag.fastq.gz \
-      -j /path/to/sample1_R2_decontam_GLlbsMetag.fastq.gz \
+      -i /path/to/sample_R1_decontam_GLlbsMetag.fastq.gz \
+      -j /path/to/sample_R2_decontam_GLlbsMetag.fastq.gz \
       -o sample_kaiju.out
 ```
 
@@ -1974,7 +1974,7 @@ kraken2 --db kraken2-db/ \
         --use-names \
         --output sample-kraken2-output.txt \
         --report sample-kraken2-report.tsv \
-        /path/to/sample1_R1_decontam_GLlbsMetag.fastq.gz /path/to/sample1_R2_decontam_GLlbsMetag.fastq.gz
+        /path/to/sample_R1_decontam_GLlbsMetag.fastq.gz /path/to/sample_R2_decontam_GLlbsMetag.fastq.gz
 ```
 
 **Parameter Definitions:**
@@ -1985,8 +1985,8 @@ kraken2 --db kraken2-db/ \
 - `--use-names` - Specifies to add taxa names in addition to taxids.
 - `--output` - Specifies the name of the kraken2 read-based output file.
 - `--report` - Specifies the name of the kraken2 report output file.
-- `sample1_R1_decontam_GLlbsMetag.fastq.gz` - Positional argument specifying the forward read input file.
-- `sample1_R2_decontam_GLlbsMetag.fastq.gz` - Positional argument specifying the reverse read input file.
+- `sample_R1_decontam_GLlbsMetag.fastq.gz` - Positional argument specifying the forward read input file.
+- `sample_R2_decontam_GLlbsMetag.fastq.gz` - Positional argument specifying the reverse read input file.
 
 
 **Input Data:**
@@ -2305,19 +2305,19 @@ metaphlan --install
 
 ```bash
   # forward and reverse reads need to be provided combined if paired-end (if not paired-end, single-end reads are provided to the --input argument next)
-cat sample1_R1_decontam_GLlbsMetag.fastq.gz sample1_R2_decontam_GLlbsMetag.fastq.gz > sample1-combined.fastq.gz
+cat sample_R1_decontam_GLlbsMetag.fastq.gz sample_R2_decontam_GLlbsMetag.fastq.gz > sample-combined.fastq.gz
 
-humann --input sample1-combined.fastq.gz \
-       --output sample1-humann3-out-dir \
+humann --input sample-combined.fastq.gz \
+       --output sample-humann3-out-dir \
        --threads NumberOfThreads \
-       --output-basename sample1 \
-       --metaphlan-options "--bowtie2db /path/to/humann3-db/ --unclassified_estimation --add_viruses --sample_id sample1" \
+       --output-basename sample \
+       --metaphlan-options "--bowtie2db /path/to/humann3-db/ --unclassified_estimation --add_viruses --sample_id sample" \
        --nucleotide-database /path/to/humann3-db/ \
        --protein-database /path/to/humann3-db/ \
        --bowtie-options "--sensitive --mm"
 
-mv sample1-humann3-out-dir/sample1_humann_temp/sample1_metaphlan_bugs_list.tsv \
-   sample1-humann3-out-dir/sample1_metaphlan_bugs_list.tsv
+mv sample-humann3-out-dir/sample_humann_temp/sample_metaphlan_bugs_list.tsv \
+   sample-humann3-out-dir/sample_metaphlan_bugs_list.tsv
 ```
 
 **Parameter Definitions:**  
@@ -2340,7 +2340,7 @@ mv sample1-humann3-out-dir/sample1_humann_temp/sample1_metaphlan_bugs_list.tsv \
 
 **Output Data:**
 
-- sample1-humann3-out-dir/ *humann output directory containing *genefamilies.tsv, *pathabundance.tsv, and *pathcoverage.tsv files)
+- sample-humann3-out-dir/ *humann output directory containing *genefamilies.tsv, *pathabundance.tsv, and *pathcoverage.tsv files)
 
 #### 8c. Merge Multiple Sample Functional Profiles
 
@@ -2469,10 +2469,10 @@ humann_renorm_table -o Gene-families-KO-cpm_GLlbsMetag.tsv --update-snames
 #### 8g. Combine MetaPhlan Taxonomy Tables
 
 ```bash
-merge_metaphlan_tables.py *-humann3-out-dir/*_humann_temp/*_metaphlan_bugs_list.tsv > metaphlan-taxonomy_GLlbsMetag.tsv
+merge_metaphlan_tables.py *-humann3-out-dir/*_humann_temp/*_metaphlan_bugs_list.tsv > Metaphlan-taxonomy_GLlbsMetag.tsv
 
 # remove redundant text from headers
-sed -i 's/_metaphlan_bugs_list//g' metaphlan-taxonomy_GLlbsMetag.tsv
+sed -i 's/_metaphlan_bugs_list//g' Metaphlan-taxonomy_GLlbsMetag.tsv
 ```
 
 **Parameter Definitions:**
@@ -2976,8 +2976,8 @@ make_heatmap(metadata_table_file = metadata_table,
 ### 9. Sample Assembly
 
 ```
-megahit -1 sample1_R1_decontam_GLlbsMetag.fastq.gz -2 sample1_R2_decontam_GLlbsMetag.fastq.gz \
-        -o sample1-assembly -t NumberOfThreads --min-contig-length 500 > sample1-assembly.log 2>&1
+megahit -1 sample_R1_decontam_GLlbsMetag.fastq.gz -2 sample_R2_decontam_GLlbsMetag.fastq.gz \
+        -o sample-assembly -t NumberOfThreads --min-contig-length 500 > sample-assembly.log 2>&1
 ```
 
 **Parameter Definitions:**  
@@ -2986,7 +2986,7 @@ megahit -1 sample1_R1_decontam_GLlbsMetag.fastq.gz -2 sample1_R2_decontam_GLlbsM
 -	`-o` – specifies output directory
 -	`-t` – specifies the number of threads to use
 -	`--min-contig-length` – specifies the minimum contig length to write out
--	`> sample1-assembly.log 2>&1` – sends stdout/stderr to log file
+-	`> sample-assembly.log 2>&1` – sends stdout/stderr to log file
 
 
 **Input data:**
@@ -2996,8 +2996,8 @@ megahit -1 sample1_R1_decontam_GLlbsMetag.fastq.gz -2 sample1_R2_decontam_GLlbsM
 
 **Output data:**
 
-- sample1-assembly/final.contigs.fa (assembly file)
-- sample1-assembly.log (log file)
+- sample-assembly/final.contigs.fa (assembly file)
+- sample-assembly.log (log file)
 
 <br>  
 
@@ -3008,7 +3008,7 @@ megahit -1 sample1_R1_decontam_GLlbsMetag.fastq.gz -2 sample1_R2_decontam_GLlbsM
 #### 10a. Rename Contig Headers
 
 ```bash
-bit-rename-fasta-headers -i sample1/final.contigs.fasta \
+bit-rename-fasta-headers -i sample/final.contigs.fasta \
                          -w c_sample \
                          -o sample-assembly_GLlbsMetag.fasta
 ```
@@ -3022,7 +3022,7 @@ bit-rename-fasta-headers -i sample1/final.contigs.fasta \
 
 **Input Data:**
 
-- sample1/final.contigs.fasta (assembly file from [Step 9](#9-sample-assembly))
+- sample/final.contigs.fasta (assembly file from [Step 9](#9-sample-assembly))
 
 **Output files:**
 
@@ -3365,30 +3365,30 @@ rm sample*.tmp*
 #### 14a. Build reference index
 
 ```
-bowtie2-build sample1_assembly_GLlbsMetag.fasta sample1-index
+bowtie2-build sample_assembly_GLlbsMetag.fasta sample-index
 ```
 
 **Parameter Definitions:**  
 
-- `sample1_assembly_GLlbsMetag.fasta` - first positional argument specifies the input assembly
--	`sample1-index` - second positional argument specifies the prefix of the output index files
+- `sample_assembly_GLlbsMetag.fasta` - first positional argument specifies the input assembly
+-	`sample-index` - second positional argument specifies the prefix of the output index files
 
 **Input Data:**
 
-- `sample1-assembly_GLlbsMetag.fasta` (contig-renamed assembly file, output from [Step 10a](#10a-rename-contig-headers))
+- `sample-assembly_GLlbsMetag.fasta` (contig-renamed assembly file, output from [Step 10a](#10a-rename-contig-headers))
 
 **Output Data:**
 
-- `sample1-index*` - the bowtie2 index files
+- `sample-index*` - the bowtie2 index files
 
 #### 14b. Align Reads to Sample Assembly
 
 ```bash
 bowtie2 --mm --quiet --threads ${task.cpus} \
-        -x sample1-index \
-        -1 sample1_R1_decontam_GLlbsMetag.fastq.gz \
-        -2 sample1_R2_decontam_GLlbsMetag.fastq.gz \
-        --no-unal > sample1.sam  2> sample1-mapping-info_GLlbsMetag.txt 
+        -x sample-index \
+        -1 sample_R1_decontam_GLlbsMetag.fastq.gz \
+        -2 sample_R2_decontam_GLlbsMetag.fastq.gz \
+        --no-unal > sample.sam  2> sample-mapping-info_GLlbsMetag.txt 
 ```
 
 **Parameter Definitions:**
@@ -3399,13 +3399,13 @@ bowtie2 --mm --quiet --threads ${task.cpus} \
 -	`-1` - specifies the forward reads to map
 - `-2` – specifies the reverse reads to map
 - `--no-unal` - Suppress SAM records for reads that did not align.
-- `> sample1.sam` - Redirects the output of the map reads command to a SAM file.
-- `2> sample1-mapping-info_GLlbsMetag.txt` – capture the printed summary results in a log file
+- `> sample.sam` - Redirects the output of the map reads command to a SAM file.
+- `2> sample-mapping-info_GLlbsMetag.txt` – capture the printed summary results in a log file
 
 
 **Input Data**
 
-- sample1-index (bowtie2 index files, output from [Step 14a](#14a-build-reference-index))
+- sample-index (bowtie2 index files, output from [Step 14a](#14a-build-reference-index))
 - *_R[12]_decontam_GLlbsMetag.fastq.gz or *_R[12]_HostRm_GLlbsMetag.fastq.gz (filtered and trimmed sample reads with both 
     contaminants and human reads (and, optionally, host reads) removed, output from [Step 3b](#3b-build-contaminant-index-and-map-reads) or [Step 4b](#4b-remove-host-reads))
 

From 1a1ddd94fd2dae185954971d2c808a08c24c6fc7 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Tue, 24 Mar 2026 21:47:29 -0700
Subject: [PATCH 38/47] updated low-biomass README with workflow doc link

---
 Metagenomics/Low_Biomass/README.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/Metagenomics/Low_Biomass/README.md b/Metagenomics/Low_Biomass/README.md
index ca2e4a2cf..9ede17df0 100644
--- a/Metagenomics/Low_Biomass/README.md
+++ b/Metagenomics/Low_Biomass/README.md
@@ -15,6 +15,10 @@
 
   - Contains the current and previous GeneLab low-biomass metagenomics consensus processing pipeline documentation for short-read (Illumina) data
 
+* [**Workflow_Documentation**](Workflow_Documentation)
+
+  - Contains instructions for installing and running the GeneLab MGIllumina workflow
+
 ---
 **Developed by:**  
 Olabiyi Obayomi

From a779e3c45bbe308a6cedcd14bbcfe551a7144a95 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Wed, 25 Mar 2026 08:16:02 -0700
Subject: [PATCH 39/47] updated header

---
 .../GL-DPPD-7107-B.md                              | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
index 875451cc0..77f67b5a1 100644
--- a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
+++ b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
@@ -4,17 +4,17 @@
 
 ---
 
-**Date:** October 28, 2024  
-**Revision:** -A  
-**Document Number:** GL-DPPD-7107  
+**Date:** April MM, 2026  
+**Revision:** -B  
+**Document Number:** GL-DPPD-71710716  
 
 **Submitted by:**  
-Olabiyi A. Obayomi (GeneLab Analysis Team)  
+Olabiyi A. Obayomi (GeneLab Data Processing Team)  
 
 **Approved by:**  
-Samrawit Gebre (OSDR Project Manager)  
-Lauren Sanders (OSDR Project Scientist)  
-Amanda Saravia-Butler (GeneLab Science Lead)  
+Jonathan Galazka (OSDR Project Manager)  
+Danielle Lopez (OSDR Deputy Project Manager)  
+Amanda Saravia-Butler (OSDR Subject Matter Expert)  
 Barbara Novak (GeneLab Data Processing Lead)  
 
 ---

From 5367c7ef666dd2ec4b4f2d0d24a13c791d4be179 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Wed, 25 Mar 2026 19:37:52 -0700
Subject: [PATCH 40/47] Update GL-DPPD-7107-B.md

fix document number typo
---
 .../Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
index 77f67b5a1..58f971b0f 100644
--- a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
+++ b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
@@ -6,7 +6,7 @@
 
 **Date:** April MM, 2026  
 **Revision:** -B  
-**Document Number:** GL-DPPD-71710716  
+**Document Number:** GL-DPPD-7107  
 
 **Submitted by:**  
 Olabiyi A. Obayomi (GeneLab Data Processing Team)  

From 468915b80895669d795e201f809d5206e0797160 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Wed, 1 Apr 2026 16:00:08 -0700
Subject: [PATCH 41/47] formatting and software version updates

---
 .../GL-DPPD-7107-B.md                         | 221 ++++++++----------
 .../GL-DPPD-7116.md                           |  77 +++---
 .../GL-DPPD-7117.md                           |  75 +++---
 3 files changed, 183 insertions(+), 190 deletions(-)

diff --git a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
index 58f971b0f..45d65a215 100644
--- a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
+++ b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
@@ -5,7 +5,7 @@
 ---
 
 **Date:** April MM, 2026  
-**Revision:** -B  
+**Revision:** B  
 **Document Number:** GL-DPPD-7107  
 
 **Submitted by:**  
@@ -17,6 +17,7 @@ Danielle Lopez (OSDR Deputy Project Manager)
 Amanda Saravia-Butler (OSDR Subject Matter Expert)  
 Barbara Novak (GeneLab Data Processing Lead)  
 
+
 ---
 
 ## Updates from previous version  <!-- omit in toc -->
@@ -24,25 +25,33 @@ Barbara Novak (GeneLab Data Processing Lead)
 Software Updates and Changes:
 
 | Program      | Previous Version | New Version |
-| :----------- | :--------------- | :---------- |
-| MultiQC      | 1.19             | 1.27.1      |
-| samtools     | 1.20             | 1.22.1      |
-| Kaiju        | N/A              | 1.10.1      |
-| fastp        | N/A              | 0.24.0      |
-| Kaiju        | N/A              | 1.10.1      |
-| Kraken2      | N/A              | 2.1.6       |
-| KrakenTools  | N/A              | 1.2         |
-| Krona        | N/A              | 2.8.1       |
-| SPAdes       | N/A              | 4.1.0       |
-| R            | N/A              | 4.5.1       |
-| Bioconductor | N/A              | 3.21        |
-| optparse     | N/A              | 1.7.5       |
-| pavian       | N/A              | 1.2.1       |
-| pheatmap     | N/A              | 1.0.13      |
-| phyloseq     | N/A              | 1.52.0      |
-| tidyverse    | N/A              | 2.0.0       |
-
-- Sync this pipeline with the new low-biomass pipelines (update formatting and definitions)
+| :----------- | :--------------: | :---------: |
+| MultiQC      |       1.19       |   1.27.1    |
+| samtools     |       1.20       |   1.22.1    |
+| Kaiju        |       N/A        |   1.10.1    |
+| fastp        |       N/A        |    1.3.1    |
+| Kaiju        |       N/A        |   1.10.1    |
+| Kraken2      |       N/A        |    2.1.6    |
+| KrakenTools  |       N/A        |     1.2     |
+| Krona        |       N/A        |    2.8.1    |
+| SPAdes       |       N/A        |    4.1.0    |
+| R            |       N/A        |    4.5.3    |
+| htmlwidgets  |       N/A        |    1.6.4    |
+| pavian       |       N/A        |    1.2.0    |
+| pheatmap     |       N/A        |   1.0.13    |
+| phyloseq     |       N/A        |   1.54.0    |
+| plotly       |       N/A        |   4.12.0    |
+| dplyr        |       N/A        |    1.2.0    |
+| ggplot2      |       N/A        |    4.0.2    |
+| glue         |       N/A        |    1.8.0    |
+| purrr        |       N/A        |    1.2.1    |
+| readr        |       N/A        |    2.2.0    |
+| stringr      |       N/A        |    1.6.0    |
+| tibble       |       N/A        |    3.3.1    |
+| tidyr        |       N/A        |    1.3.2    |
+| htmlwidgets  |       N/A        |    1.6.4    |
+
+- Synced this pipeline with the new low-biomass pipelines (update formatting and definitions)
 - Added new processing steps for additional taxonomic profiling tools and downstream processed data outputs in R
   - Add additional read-based processing taxonomic profiling methods:
     - Kaiju taxonomic profiling ([Step 18](#18-taxonomic-profiling-using-kaiju))
@@ -168,12 +177,12 @@ Software Updates and Changes:
 
 | Program      | Version | Relevant Links                                                                                                                                     |
 | :----------- | :-----: | :------------------------------------------------------------------------------------------------------------------------------------------------- |
-| bbduk        |  38.86  | [https://bbmap.org/](https://bbmap.org/)                                                                                                           |
-| bit          | 1.8.53  | [https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)     |
-| bowtie2      |  2.4.1  | [https://bowtie-bio.sourceforge.net/bowtie2/index.shtml](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml)                                   |
-| CAT          |  5.2.3  | [https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)                                                             |
+| BBTools      |  39.80  | [https://bbmap.org/](https://bbmap.org/)                                                                                                           |
+| bit          | 1.13.15 | [https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)     |
+| bowtie2      |  2.5.5  | [https://bowtie-bio.sourceforge.net/bowtie2/index.shtml](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml)                                   |
+| CAT          |  5.2.3  | [https://github.com/MGXlab/CAT_pack](https://github.com/MGXlab/CAT_pack)                                                                           |
 | CheckM       |  1.1.3  | [https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)                                                                     |
-| fastp        | 0.24.0  | [https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp)                                                                             |
+| fastp        |  1.3.1  | [https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp)                                                                             |
 | FastQC       | 0.12.1  | [https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)                           |
 | GTDB-Tk      |  2.4.0  | [https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)                                                                     |
 | HUMAnN       |   3.9   | [https://github.com/biobakery/humann](https://github.com/biobakery/humann)                                                                         |
@@ -189,13 +198,22 @@ Software Updates and Changes:
 | MultiQC      | 1.27.1  | [https://multiqc.info/](https://multiqc.info/)                                                                                                     |
 | Prodigal     |  2.6.3  | [https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)                                                       |
 | samtools     | 1.22.1  | [https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)                                                     |
-| R            |  4.5.1  | [https://www.r-project.org](https://www.r-project.org)                                                                                             |
-| Bioconductor |  3.21   | [https://www.bioconductor.org](https://www.bioconductor.org)                                                                                       |
-| optparse     |  1.7.5  | [https://cran.r-project.org/web/packages/optparse/index.html](https://cran.r-project.org/web/packages/optparse/index.html)                         |
-| pavian       |  1.2.1  | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian)                                                                   |
+| R            |  4.5.3  | [https://www.r-project.org](https://www.r-project.org)                                                                                             |
+| decontam     | 1.28.0  | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html)     |
+| dplyr        |  1.2.0  | [https://dplyr.tidyverse.org](https://dplyr.tidyverse.org)                                                                                         |
+| ggplot2      |  4.0.2  | [https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)                                                                                     |
+| htmlwidgets  |  1.6.4  | [http://www.htmlwidgets.org](http://www.htmlwidgets.org)                                                                                           |
+| glue         |  1.8.0  | [https://glue.tidyverse.org](https://glue.tidyverse.org)                                                                                           |
+| pavian       |  1.2.0* | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian)                                                                   |
 | pheatmap     | 1.0.13  | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap)                                                         |
-| phyloseq     | 1.52.0  | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html)     |
-| tidyverse    |  2.0.0  | [https://www.tidyverse.org](https://www.tidyverse.org)                                                                                             |
+| phyloseq     | 1.54.0  | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html)     |
+| plotly       | 4.12.0  | [https://plotly-r.com](https://plotly-r.com)                                                                                                       |
+| purrr        |  1.2.1  | [https://purrr.tidyverse.org](https://purrr.tidyverse.org)                                                                                         |
+| readr        |  2.2.0  | [https://readr.tidyverse.org](https://readr.tidyverse.org)                                                                                         |
+| stringr      |  1.6.0  | [https://stringr.tidyverse.org](https://stringr.tidyverse.org)                                                                                     |
+| tibble       |  3.3.1  | [https://tibble.tidyverse.org](https://tibble.tidyverse.orgtext)                                                                                   |
+| tidyr        |  1.3.2  | [https://tidyr.tidyverse.org](https://tidyr.tidyverse.orgtext)                                                                                     |
+> **Note:** pavian R package requires R version 4.0.5
 
 ---
 
@@ -217,8 +235,8 @@ fastqc -o HRrm_fastqc_output *HRrm_GLmetagenomics.fastq.gz
 
 **Parameter Definitions:**
 
-* `-o` – the output directory to store results
-* `*HRrm_GLmetagenomics.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
+- `-o` – the output directory to store results
+- `*HRrm_GLmetagenomics.fastq.gz` – the input reads are specified as a positional argument, and can be given all at once with wildcards like this, or as individual arguments with spaces in between them
 
 **Input data:**
 
@@ -226,9 +244,8 @@ fastqc -o HRrm_fastqc_output *HRrm_GLmetagenomics.fastq.gz
 
 **Output data:**
 
-* *fastqc.html (FastQC output html summary)
-* *fastqc.zip (FastQC output data)
-
+- *fastqc.html (FastQC output html summary)
+- *fastqc.zip (FastQC output data)
 
 #### 1b. Compile Raw Data QC
 
@@ -266,14 +283,14 @@ multiqc --zip-data-dir \
 #### 2a. Filter Quality and Trim Adapters
 
 ```bash
-fastp --in1 sample1_R1_HRrm_GLmetagenomics.fastq.gz --out1 temp_sample1_R1_filtered.fastq.gz \
-      --in2 sample1_R2_HRrm_GLmetagenomics.fastq.gz --out2 temp_sample1_R2_filtered.fastq.gz \
+fastp --in1 sample_R1_HRrm_GLmetagenomics.fastq.gz --out1 temp_sample_R1_filtered.fastq.gz \
+      --in2 sample_R2_HRrm_GLmetagenomics.fastq.gz --out2 temp_sample_R2_filtered.fastq.gz \
       --qualified_quality_phred  20 \
       --length_required 50 \
       --thread 2 \
-      --detect_adapter_for_pe \
-      --json sample1.fastp.json \
-      --html sample1.fastp.html 2> sample1-fastp.log
+      --detect_adapter_for_pe --disable_trim_poly_g \
+      --json sample.fastp.json \
+      --html sample.fastp.html 2> sample-fastp.log
 ```
 
 **Parameter Definitions:**
@@ -286,6 +303,7 @@ fastp --in1 sample1_R1_HRrm_GLmetagenomics.fastq.gz --out1 temp_sample1_R1_filte
 - `--length_required` - the minimum read length. Shorter reads will be discarded (default: 50)
 - `--thread` - number of worker threads (default: 2)
 - `--detect_adapter_for_pe` - for paired end data, enable auto-detection of adapters
+- `--disable_trim_poly_g` - explicitly disable automatic polyG trimming
 - `--json` - Specifies the json format report file name
 - `--html` - Specifies the html format report file name
 - `2> sample-fastp.log` - Redirects the stderr output to a log file.
@@ -301,15 +319,15 @@ fastp --in1 sample1_R1_HRrm_GLmetagenomics.fastq.gz --out1 temp_sample1_R1_filte
 #### 2b. Trim polyG
 
 ```bash
-fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered_GLmetagenomics.fastq.gz \
-      --in2 temp_sample1_R2_filtered.fastq.gz --out2 sample1_R2_filtered_GLmetagenomics.fastq.gz \
+fastp --in1 temp_sample_R1_filtered.fastq.gz --out1 sample_R1_filtered_GLmetagenomics.fastq.gz \
+      --in2 temp_sample_R2_filtered.fastq.gz --out2 sample_R2_filtered_GLmetagenomics.fastq.gz \
       --qualified_quality_phred  20 \
       --length_required 50 \
       --thread 2 \
       --detect_adapter_for_pe \
-      --json sample1.fastp.json \
-      --html sample1.fastp.html \
-      --trim_poly_g 2> sample1-fastp.log
+      --json sample.fastp.json \
+      --html sample.fastp.html \
+      --trim_poly_g 2> sample-fastp.log
 ```
 
 **Parameter Definitions:**
@@ -329,7 +347,7 @@ fastp --in1 temp_sample1_R1_filtered.fastq.gz --out1 sample1_R1_filtered_GLmetag
 
 **Input Data:**
 
-- /path/to/filtered_data/temp_sample1*.fastq.gz (round1 filtered/adapter trimmed reads, output from [Step 2a](#2a-filter-quality-and-trim-adapters)
+- /path/to/filtered_data/temp_sample*.fastq.gz (round1 filtered/adapter trimmed reads, output from [Step 2a](#2a-filter-quality-and-trim-adapters)
 
 **Output Data:**
 
@@ -357,7 +375,6 @@ fastqc -o filtered_fastqc_output/ *filtered_GLmetagenomics.fastq.gz
 - *fastqc.html (FastQC output html summary)
 - *fastqc.zip (FastQC output data)
 
-
 #### 2d. Compile Filtered/Trimmed Data QC
 
 ```
@@ -1138,15 +1155,15 @@ custom_palette <- custom_palette[-c(21:23,
 
 <br>  
 
-
 ---
 
 ## Assembly-based Processing
 
+
 ### 4. Sample assembly
 ```
-megahit -1 sample-1_R1_filtered_GLmetagenomics.fastq.gz -2 sample-1_R2_filtered_GLmetagenomics.fastq.gz \
-        -o sample-1-assembly -t NumberOfThreads --min-contig-length 500 > sample-1-assembly.log 2>&1
+megahit -1 sample_R1_filtered_GLmetagenomics.fastq.gz -2 sample_R2_filtered_GLmetagenomics.fastq.gz \
+        -o sample-assembly -t NumberOfThreads --min-contig-length 500 > sample-assembly.log 2>&1
 ```
 
 **Parameter Definitions:**  
@@ -1155,8 +1172,7 @@ megahit -1 sample-1_R1_filtered_GLmetagenomics.fastq.gz -2 sample-1_R2_filtered_
 -	`-o` – specifies output directory
 -	`-t` – specifies the number of threads to use
 -	`--min-contig-length` – specifies the minimum contig length to write out
--	`> sample1-assembly.log 2>&1` – sends stdout/stderr to log file
-
+-	`> sample-assembly.log 2>&1` – sends stdout/stderr to log file
 
 **Input data:**
 
@@ -1164,8 +1180,8 @@ megahit -1 sample-1_R1_filtered_GLmetagenomics.fastq.gz -2 sample-1_R2_filtered_
 
 **Output data:**
 
-* sample-1-assembly/final.contigs.fa (assembly file)
-* sample-1-assembly.log (log file)
+- sample-assembly/final.contigs.fa (assembly file)
+- sample-assembly.log (log file)
 
 <br>
 
@@ -1196,7 +1212,6 @@ bit-rename-fasta-headers -i sample/final.contigs.fa \
 
 - **sample-assembly_GLmetagenomics.fasta** (contig-renamed assembly file)
 
-
 #### 5b. Summarize assemblies
 
 ```bash
@@ -1217,7 +1232,6 @@ done
 -	`-o` – Specifies the output summary table.
 - `*-assembly_GLmetagenomics.fasta`	– Specifies the input assemblies to summarize, provided as positional arguments
 
-
 **Input data:**
 
 - *-assembly_GLmetagenomics.fasta (contig-renamed assembly files from [Step 5a](#5a-rename-contig-headers))
@@ -1262,9 +1276,9 @@ prodigal -a sample-genes.faa \
 
 **Output data:**
 
-* sample-genes.faa (gene-calls amino-acid fasta file)
-* sample-genes.fasta (gene-calls nucleotide fasta file)
-* **sample-genes.gff** (gene-calls in general feature format)
+- sample-genes.faa (gene-calls amino-acid fasta file)
+- sample-genes.fasta (gene-calls nucleotide fasta file)
+- **sample-genes.gff** (gene-calls in general feature format)
 
 <br>
 
@@ -1295,7 +1309,6 @@ mv sample-genes.fasta.tmp sample-genes_GLmetagenomics.fasta
 > **Note:**  
 > The annotation process overwrites the same temporary directory by default. When running multiple processes at a time, it is necessary to specify a specific temporary directory with the `--tmp-dir` argument as shown below.
 
-
 #### 7a. Download reference database of HMM models
 
 > **Note:** This step only needs to be done once.
@@ -1330,8 +1343,6 @@ exec_annotation -p profiles/ \
 - `--report-unannotated` – Specifies to generate an output for each entry, event when no KO is assigned.
 - `sample-genes_GLmetagenomics.faa` – Specifies the input file, provided as a positional argument. 
 
-
-
 **Input data:**
 
 - sample-genes_GLmetagenomics.faa (amino-acid fasta file, from [Step 6b](#6b-remove-line-wraps-in-gene-prediction-output))
@@ -1340,8 +1351,7 @@ exec_annotation -p profiles/ \
 
 **Output data:**
 
-- sample-1-KO-tab.tmp (table of KO annotations assigned to gene IDs)
-
+- sample-KO-tab.tmp (table of KO annotations assigned to gene IDs)
 
 #### 7c. Filter KO Outputs
 *Filter KO outputs to retain only those passing the KO-specific score and top hits.*
@@ -1359,7 +1369,6 @@ rm -rf sample-tmp-KO/ sample-KO-annots.tmp
 - `-i` – Specifies the input table.
 - `-o` – Specifies the output table.
 
-
 **Input data:**
 
 - sample-KO-tab.tmp (table of KO annotations assigned to gene IDs from [Step 7b](#7b-run-kegg-annotation))
@@ -1388,7 +1397,7 @@ CAT contigs -c sample-assembly_GLmetagenomics.fasta \
             -d CAT_prepare_20200618/2020-06-18_database/ \
             -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
             -p sample-genes_GLmetagenomics.faa \
-            -o sample-1-tax-out.tmp \
+            -o sample-tax-out.tmp \
             -n NumberOfThreads -r 3 \
             --top 4 \
             --I_know_what_Im_doing \
@@ -1408,7 +1417,6 @@ CAT contigs -c sample-assembly_GLmetagenomics.fasta \
 - `--I_know_what_Im_doing` – Allows us to alter the `--top` parameter.
 - `--no-stars` - Suppress marking of suggestive taxonomic assignments.
 
-
 **Input data:**
 
 - CAT_prepare_20200618/2020-06-18_database/ (directory holding the CAT reference sequence database, output from [Step 8a](#8a-pull-and-unpack-pre-built-reference-db))
@@ -1425,7 +1433,7 @@ CAT contigs -c sample-assembly_GLmetagenomics.fasta \
 
 ```bash
 CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
-              -o sample-1-gene-tax-out.tmp \
+              -o sample-gene-tax-out.tmp \
               -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
               --only_official \
               --exclude-scores
@@ -1448,13 +1456,11 @@ CAT add_names -i sample-tax-out.tmp.ORF2LCA.txt \
 
 - sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added)
 
-
-
 #### 8d. Add Taxonomy Info From Taxids To Contigs
 
 ```bash
-CAT add_names -i sample-1-tax-out.tmp.contig2classification.txt \
-              -o sample-1-contig-tax-out.tmp \
+CAT add_names -i sample-tax-out.tmp.contig2classification.txt \
+              -o sample-contig-tax-out.tmp \
               -t CAT_prepare_20200618/2020-06-18_taxonomy/ \
               --only_official \
               --exclude-scores
@@ -1468,7 +1474,6 @@ CAT add_names -i sample-1-tax-out.tmp.contig2classification.txt \
 - `--only_official` – Specifies to add only standard taxonomic ranks.
 - `--exclude-scores` - Specifies to exclude bit-score support scores in the lineage.
 
-
 **Input data:**
 
 - sample-tax-out.tmp.contig2classification.txt (contig taxonomy file from [Step 8b](#8b-run-taxonomic-classification))
@@ -1478,7 +1483,6 @@ CAT add_names -i sample-1-tax-out.tmp.contig2classification.txt \
 
 - sample-contig-tax-out.tmp (contig taxonomy file with lineage info added)
 
-
 #### 8e. Format Gene-level Output With awk and sed
 
 ```bash
@@ -1492,7 +1496,7 @@ awk -F $'\t' ' BEGIN { OFS=FS } { if ( $3 == "lineage" ) { print $1,$3,$5,$6,$7,
 
 **Input Data:**
 
-* sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added from [Step 8c](#8c-add-taxonomy-info-from-taxids-to-genes))
+- sample-gene-tax-out.tmp (gene-calls taxonomy file with lineage info added from [Step 8c](#8c-add-taxonomy-info-from-taxids-to-genes))
 
 **Output Data:**
 
@@ -1515,7 +1519,6 @@ rm sample*.tmp*
 
 - sample-contig-tax-out.tmp (contig taxonomy file with lineage info added from [Step 8d](#8d-add-taxonomy-info-from-taxids-to-contigs))
 
-
 **Output data:**
 
 - sample-contig-tax-out.tsv (reformatted contig taxonomy file with lineage info)
@@ -1577,7 +1580,6 @@ bowtie2 --mm --quiet --threads ${task.cpus} \
 - sample.sam (reads aligned to sample assembly in SAM format)
 - **sample-mapping-info_GLmetagenomics.txt** (read mapping information)
 
-
 #### 9c. Sort Assembly Alignments
 
 ```bash
@@ -1633,13 +1635,11 @@ pileup.sh -in sample_GLmetagenomics.bam \
 - sample_GLmetagenomics.bam (sorted mapping to sample assembly BAM file, output from [Step 9c](#9c-sort-assembly-alignments))
 - sample-genes_GLmetagenomics.fasta (gene-calls nucleotide fasta file, output from [Step 6b](#6b-remove-line-wraps-in-gene-prediction-output))
 
-
 **Output Data:**
 
 - sample-gene-cov-and-det.tmp (gene-coverage tsv file)
 - sample-contig-cov-and-det.tmp (contig-coverage tsv file)
 
-
 #### 10b. Filter Gene and Contig Coverage Based On Detection
 
 > *The following commands filter gene and contig coverage tsv files to only keep genes and contigs with at least 50% detection (as defined above) then parse the tables to retain only gene IDs and respective coverage.*
@@ -1669,8 +1669,8 @@ rm sample-*.tmp
 
 **Output data:**
 
-* sample-gene-coverages.tsv (table with gene-level coverages)
-* sample-contig-coverages.tsv (table with contig-level coverages)
+- sample-gene-coverages.tsv (table with gene-level coverages)
+- sample-contig-coverages.tsv (table with contig-level coverages)
 
 <br>
 
@@ -1699,14 +1699,13 @@ rm sample*tmp sample-gene-coverages.tsv sample-annotations.tsv sample-gene-tax-o
 
 **Input data:**
 
-* sample-gene-coverages.tsv (table with gene-level coverages from [Step 10b](#10b-filter-gene-and-contig-coverage-based-on-detection))
-* sample-annotations.tsv (table of KO annotations assigned to gene IDs from [Step 7c](#7c-filter-ko-outputs))
-* sample-gene-tax-out.tsv (gene-level taxonomic classifications from [Step 8f](#8f-format-contig-level-output-with-awk-and-sed))
-
+- sample-gene-coverages.tsv (table with gene-level coverages from [Step 10b](#10b-filter-gene-and-contig-coverage-based-on-detection))
+- sample-annotations.tsv (table of KO annotations assigned to gene IDs from [Step 7c](#7c-filter-ko-outputs))
+- sample-gene-tax-out.tsv (gene-level taxonomic classifications from [Step 8f](#8f-format-contig-level-output-with-awk-and-sed))
 
 **Output data:**
 
-* **sample-gene-coverage-annotation-and-tax_GLmetagenomics.tsv** (table with combined gene coverage, annotation, and taxonomy info)
+- **sample-gene-coverage-annotation-and-tax_GLmetagenomics.tsv** (table with combined gene coverage, annotation, and taxonomy info)
 
 <br>
 
@@ -1736,7 +1735,6 @@ rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
 - sample-contig-coverages.tsv (table with contig-level coverages from [Step 10b](#10b-filter-gene-and-contig-coverage-based-on-detection))
 - sample-contig-tax-out.tsv (contig-level taxonomic classifications from [Step 8f](#8f-format-contig-level-output-with-awk-and-sed))
 
-
 **Output data:**
 
 - **sample-contig-coverage-and-tax_GLmetagenomics.tsv** (table with combined contig coverage and taxonomy info)
@@ -1753,7 +1751,6 @@ rm sample*tmp sample-contig-coverages.tsv sample-contig-tax-out.tsv
 
 #### 13a. Generate Gene-level Coverage Summary Tables
 
-
 ```bash
 bit-GL-combine-KO-and-tax-tables *-gene-coverage-annotation-and-tax_GLmetagenomics.tsv \
                                  -o Combined
@@ -1767,10 +1764,8 @@ mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-cove
 **Parameter Definitions:**  
 
 - `*-gene-coverage-annotation-and-tax_GLmetagenomics.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
-
 - `-o` – Specifies the output file prefix.
 
-
 **Input data:**
 
 - *-gene-coverage-annotation-and-tax_GLmetagenomics.tsv (tables with combined gene coverage, annotation, and taxonomy info generated for individual samples from [Step 11](#11-combine-gene-level-coverage-taxonomy-and-functional-annotations-for-each-sample))
@@ -1782,7 +1777,6 @@ mv "Combined-gene-level-taxonomy-coverages.tsv Combined-gene-level-taxonomy-cove
 - **Combined-gene-level-KO-function-coverages_GLmetagenomics.tsv** (table with all samples combined based on KO annotations)
 - **Combined-gene-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on gene-level taxonomic classifications)
 
-
 #### 13b. Generate Contig-level Coverage Summary Tables
 
 ```bash
@@ -1794,15 +1788,14 @@ bit-GL-combine-contig-tax-tables *-contig-coverage-and-tax_GLmetagenomics.tsv -o
 - `*-contig-coverage-and-tax_GLmetagenomics.tsv` - Positional arguments specifying the input tsv files, can be provided as a space-delimited list of files, or with wildcards like above.
 - `-o` – Specifies the output file prefix.
 
-
 **Input data:**
 
-* *-contig-coverage-annotation-and-tax_GLmetagenomics.tsv (tables with combined contig coverage, annotation, and taxonomy info generated for individual samples from [Step 12](#12-combine-contig-level-coverage-and-taxonomy-for-each-sample))
+- *-contig-coverage-annotation-and-tax_GLmetagenomics.tsv (tables with combined contig coverage, annotation, and taxonomy info generated for individual samples from [Step 12](#12-combine-contig-level-coverage-and-taxonomy-for-each-sample))
 
 **Output data:**
 
-* **Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered)
-* **Combined-contig-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications)
+- **Combined-contig-level-taxonomy-coverages-CPM_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications; normalized to coverage per million genes covered)
+- **Combined-contig-level-taxonomy-coverages_GLmetagenomics.tsv** (table with all samples combined based on contig-level taxonomic classifications)
 
 <br>
 
@@ -1846,7 +1839,6 @@ zip -r sample-bins_GLmetagenomics.zip sample-bins
 -  `--abdFile` - The depth file generated by the previous `jgi_summarize_bam_contig_depths` command.
 -  `-t` - Number of parallel processing threads to use.
 
-
 **Input data:**
 
 - sample-assembly_GLmetagenomics.fasta (assembly fasta file created in [Step 5a](#5a-rename-contig-headers))
@@ -1922,7 +1914,6 @@ done
 - MAGs/\*.fasta (directory holding high-quality MAGs)
 - **\*-MAGs_GLmetagenomics.zip** (zip files containing directories of high-quality MAGs)
 
-
 #### 14d. MAG taxonomic classification
 > Uses default `gtdbtk` database setup with program's `download.sh` command.
 
@@ -1943,11 +1934,11 @@ gtdbtk classify_wf --genome_dir MAGs/ \
 
 **Input data:**
 
-* MAGs/\*.fasta (directory holding high-quality MAGs from [Step 14c](#14c-filter-mags))
+- MAGs/\*.fasta (directory holding high-quality MAGs from [Step 14c](#14c-filter-mags))
 
 **Output data:**
 
-* gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
+- gtdbtk-output-dir/gtdbtk.\*.summary.tsv (files with assigned taxonomy and info)
 
 #### 14e. Generate Overview Table Of All MAGs
 
@@ -1996,7 +1987,7 @@ cat MAGs-overview-header.tmp MAGs-overview-sorted.tmp \
 
 **Output data:**
 
-* **MAGs-overview_GLmetagenomics.tsv** (a tab-delimited overview of all recovered MAGs)
+- **MAGs-overview_GLmetagenomics.tsv** (a tab-delimited overview of all recovered MAGs)
 
 <br>
 
@@ -2039,8 +2030,7 @@ done
 
 **Output data:**
 
-* **MAG-level-KO-annotations_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their KO annotations)
-
+- **MAG-level-KO-annotations_GLmetagenomics.tsv** (tab-delimited table holding MAGs and their KO annotations)
 
 #### 15b. Summarize KO annotations with KEGG-Decoder
 
@@ -2143,7 +2133,6 @@ table2write <- get_abundant_features(feature_table, cpm_threshold=threshold) %>%
 
 write_tsv(x = table2write, file = "Combined-gene-level-taxonomy_filtered_GLmetagenomics.tsv")
 
-
 make_heatmap(metadata_table_file = metadata_table, 
              feature_table_file = "Combined-gene-level-taxonomy_filtered_GLmetagenomics.tsv", 
              samples_column="sample_id", group_column = "group", 
@@ -2336,7 +2325,6 @@ make_heatmap(metadata_table_file = metadata_table,
                          species/functions as the first column and samples as other columns.
 - `assembly_summary` - path to a tab-separated file containing statistics on assemblies created for each sample
 
-
 **Input data:**
 
 - assembly-summaries_GLmetagenomics.tsv (table of assembly summary statistics, output from [Step 5b](#5b-summarize-assemblies))
@@ -2470,7 +2458,6 @@ rm nr_euk/kaiju_db_nr_euk.bwt nr_euk/kaiju_db_nr_euk.sa
 - kaiju-db/names.dmp (taxonomy names file from the NCBI Taxonomy database that maps taxonomic IDs to their scientific names)
 - kaiju-db/merged.dmp (merged taxonomy IDs file from the NCBI Taxonomy database that maps deprecated taxonomic IDs to current ones)
 
-
 #### 18b. Kaiju Taxonomic Classification
 
 ```bash
@@ -2478,8 +2465,8 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
       -t kaiju-db/nodes.dmp \
       -z NumberOfThreads \
       -E 1e-05 \
-      -i /path/to/sample1_R1_filtered_GLmetagenomics.fastq.gz \
-      -j /path/to/sample1_R2_filtered_GLmetagenomics.fastq.gz \
+      -i /path/to/sample_R1_filtered_GLmetagenomics.fastq.gz \
+      -j /path/to/sample_R2_filtered_GLmetagenomics.fastq.gz \
       -o sample_kaiju.out
 ```
 
@@ -2499,7 +2486,6 @@ kaiju -f kaiju-db/nr_euk/kaiju_db_nr_euk.fmi \
 - kaiju-db/nodes.dmp (kaiju taxonomy hierarchy nodes file, output from [Step 18a](#18a-build-kaiju-database))
 - *_R[12]_filtered_GLmetagenomics.fastq.gz (filtered/trimmed reads from [Step 2b](#2b-trim-polyg) above)
 
-
 **Output Data:**
 
 - sample_kaiju.out (kaiju output file)
@@ -2643,7 +2629,6 @@ write_tsv(x = table2write, file = "kaiju_species_table_GLmetagenomics.tsv")
 
 - **kaiju_species_table_GLmetagenomics.tsv** (kaiju species count table in tsv format)
 
-
 #### 18g. Filter Kaiju Species Count Table
 
 ```R
@@ -2730,7 +2715,6 @@ make_barplot(metadata_file = metadata_file, feature_table_file = filtered_specie
 - `kaiju_filtered_species_table_GLmetagenomics.tsv` (a file containing the filtered species count table, output from [Step 18g](#18g-filter-kaiju-species-count-table))
 - `/path/to/sample/metadata` (a file containing sample-wise metadata, mapping sample names to group metadata)
 
-
 **Output Data:**
 
 - kaiju_unfiltered_species_barplot_GLmetagenomics.png (taxonomy barplot without filtering)
@@ -2803,7 +2787,7 @@ kraken2 --db kraken2-db/ \
         --use-names \
         --output sample-kraken2-output.txt \
         --report sample-kraken2-report.tsv \
-        /path/to/sample1_R1_filtered_GLmetagenomics.fastq.gz /path/to/sample1_R2_filtered_GLmetagenomics.fastq.gz
+        /path/to/sample_R1_filtered_GLmetagenomics.fastq.gz /path/to/sample_R2_filtered_GLmetagenomics.fastq.gz
 ```
 
 **Parameter Definitions:**
@@ -2814,22 +2798,19 @@ kraken2 --db kraken2-db/ \
 - `--use-names` - Specifies to add taxa names in addition to taxids.
 - `--output` - Specifies the name of the kraken2 read-based output file.
 - `--report` - Specifies the name of the kraken2 report output file.
-- `sample1_R1_filtered_GLmetagenomics.fastq.gz` - Positional argument specifying the forward read input file.
-- `sample1_R2_filtered_GLmetagenomics.fastq.gz` - Positional argument specifying the reverse read input file.
-
+- `sample_R1_filtered_GLmetagenomics.fastq.gz` - Positional argument specifying the forward read input file.
+- `sample_R2_filtered_GLmetagenomics.fastq.gz` - Positional argument specifying the reverse read input file.
 
 **Input Data:**
 
 - kraken2-db/ (a directory containing kraken2 database files, output from [Step 19a](#19a-download-kraken2-database))
 - *_R[12]_filtered_GLmetagenomics.fastq.gz (filtered/trimmed reads from [Step 2b](#2b-trim-polyg) above)
 
-
 **Output Data:**
 
 - sample-kraken2-output.txt (kraken2 read-based output file (one line per read))
 - sample-kraken2-report.tsv (kraken2 report output file (one line per taxa, with number of reads assigned to it))
 
-
 #### 19c. Compile Kraken2 Taxonomy Results
 
 ##### 19ci. Create Merged Kraken2 Taxonomy Table
@@ -2883,7 +2864,6 @@ multiqc --zip-data-dir \
 - **kraken2_multiqc_GLmetagenomics.html** (multiqc output html summary)
 - **kraken2_multiqc_GLmetagenomics_data.zip** (zip archive containing multiqc output data)
 
-
 #### 19d. Convert Kraken2 Output to Krona Format
 
 ```bash
@@ -2904,7 +2884,6 @@ kreport2krona.py --report-file sample-kraken2-report.tsv  \
 
 - sample.krona (krona formatted kraken2 output)
 
-
 #### 19e. Compile Kraken2 Krona Reports
 
 ```bash
@@ -2954,7 +2933,6 @@ ktImportText -o kraken2-report_GLmetagenomics.html ${KTEXT_FILES[*]}
 - sample_names.txt (sorted list of all sample names)
 - **kraken2-report_GLmetagenomics.html** (compiled krona html report containing all samples)
 
-
 #### 19f. Filter Kraken2 Species Count Table
 
 ```R
@@ -3075,8 +3053,7 @@ metaphlan --install
 
 **Output Data**
 
-`/path/to/humann3-db` (the installed MetaPhlan databases)
-
+- /path/to/humann3-db (the installed MetaPhlan databases)
 
 #### 20b. HUMAnN/MetaPhlAn Taxonomic Classification
 ```bash
diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
index 3437ea7f2..36803fe2a 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
@@ -149,41 +149,48 @@ Barbara Novak (GeneLab Data Processing Lead)
 
 # Software used
 
-|Program|Version|Relevant Links|
-|:------|:-----:|------:|
-|bbduk| 38.86 |[https://bbmap.org](https://bbmap.org)|
-|bit| 1.8.53 |[https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)|
-|CAT| 5.2.3 |[https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)|
-|CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)|
-|Dorado| 1.1.1| [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)|
-|Filtlong| 0.2.1 |[https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)|
-|Flye| 2.9.5 | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye) |
-|GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
-|HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)|
-|Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) |
-|KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder)
-|KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)|
-|Kraken2| 2.1.6 | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2) |
-|KrakenTools| 1.2 | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/) |
-|Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)|
-|Medaka| 2.1.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) |
-|MetaBAT| 2.15 |[https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)|
-|MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)|
-|Minimap2| 2.28 | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2) |
-|MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)|
-|NanoPlot| 1.44.1 | [https://github.com/wdecoster/NanoPlot](https://github.com/wdecoster/NanoPlot)|
-|Porechop| 0.2.4 | [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop) |
-|Prodigal| 2.6.3 |[https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)|
-|samtools| 1.22.1 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
-| R | 4.5.1 | [https://www.r-project.org](https://www.r-project.org) |
-|Bioconductor | 3.21 | [https://www.bioconductor.org](https://www.bioconductor.org) |
-|decontam| 1.28.0 | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html) |
-|optparse| 1.7.5 |[https://cran.r-project.org/web/packages/optparse/index.html](https://cran.r-project.org/web/packages/optparse/index.html) |
-|pavian| 1.2.1 | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian) |
-|pheatmap| 1.0.13 | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap) |
-|phyloseq| 1.52.0 | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html) |
-|tidyverse| 2.0.0 | [https://www.tidyverse.org](https://www.tidyverse.org) |
-
+| Program      | Version | Relevant Links                                                                                                                                     |
+| :----------- | :-----: | :------------------------------------------------------------------------------------------------------------------------------------------------- |
+| BBTools      |  39.80  | [https://bbmap.org](https://bbmap.org)                                                                                                             |
+| bit          | 1.13.15 | [https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)     |
+| CAT          |  5.2.3  | [https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)                                                             |
+| CheckM       |  1.1.3  | [https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)                                                                     |
+| Dorado       |  1.1.1  | [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)                                                                   |
+| Filtlong     |  0.2.1  | [https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)                                                                           |
+| Flye         |  2.9.5  | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye)                                                                       |
+| GTDB-Tk      |  2.4.0  | [https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)                                                                     |
+| HUMAnN       |   3.9   | [https://github.com/biobakery/humann](https://github.com/biobakery/humann)                                                                         |
+| Kaiju        | 1.10.1  | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/)                                                   |
+| KEGG-Decoder |  1.2.2  | [https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder) |
+| KOFamScan    |  1.3.0  | [https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)                                                                     |
+| Kraken2      |  2.1.6  | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2)                                                                   |
+| KrakenTools  |  1.2.1  | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/)                                                             |
+| Krona        |  2.8.1  | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)                                                                         |
+| Medaka       |  2.2.0  | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka)                                                                   |
+| MetaBAT      |  2.15   | [https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)                                     |
+| MetaPhlAn    |  4.1.0  | [https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)                                                                   |
+| Minimap2     |  2.28   | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2)                                                                                 |
+| MultiQC      | 1.27.1  | [https://multiqc.info/](https://multiqc.info/)                                                                                                     |
+| NanoPlot     | 1.44.1  | [https://github.com/wdecoster/NanoPlot](https://github.com/wdecoster/NanoPlot)                                                                     |
+| Porechop     |  0.2.4  | [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop)                                                                           |
+| Prodigal     |  2.6.3  | [https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)                                                       |
+| samtools     | 1.22.1  | [https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)                                                     |
+| R            |  4.5.3  | [https://www.r-project.org](https://www.r-project.org)                                                                                             |
+| decontam     | 1.28.0  | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html)     |
+| dplyr        |  1.2.0  | [https://dplyr.tidyverse.org](https://dplyr.tidyverse.org)                                                                                         |
+| ggplot2      |  4.0.2  | [https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)                                                                                     |
+| glue         |  1.8.0  | [https://glue.tidyverse.org](https://glue.tidyverse.org)                                                                                           |
+| purrr        |  1.2.1  | [https://purrr.tidyverse.org](https://purrr.tidyverse.org)                                                                                         |
+| readr        |  2.2.0  | [https://readr.tidyverse.org](https://readr.tidyverse.org)                                                                                         |
+| stringr      |  1.6.0  | [https://stringr.tidyverse.org](https://stringr.tidyverse.org)                                                                                     |
+| tibble       |  3.3.1  | [https://tibble.tidyverse.org](https://tibble.tidyverse.orgtext)                                                                                   |
+| tidyr        |  1.3.2  | [https://tidyr.tidyverse.org](https://tidyr.tidyverse.orgtext)                                                                                     |
+| htmlwidgets  |  1.6.4  | [http://www.htmlwidgets.org](http://www.htmlwidgets.org)                                                                                           |
+| pavian       |  1.2.0* | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian)                                                                   |
+| pheatmap     | 1.0.13  | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap)                                                         |
+| phyloseq     | 1.54.0  | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html)     |
+| plotly       | 4.12.0  | [https://plotly-r.com](https://plotly-r.com)                                                                                                       |
+> **Note:** pavian R package requires R version 4.0.5
 ---
 
 # General processing overview with example commands
diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
index 533e4d951..9e107a362 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
@@ -141,38 +141,46 @@ Barbara Novak (GeneLab Data Processing Lead)
 
 # Software used
 
-|Program|Version|Relevant Links|
-|:------|:-----:|------:|
-|bbduk| 38.86 |[https://bbmap.org/](https://bbmap.org/)|
-|bit| 1.8.53 |[https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)|
-|bowtie2| 2.4.1 | [https://bowtie-bio.sourceforge.net/bowtie2/index.shtml](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml)|
-|CAT| 5.2.3 |[https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)|
-|CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)|
-|fastp| 0.24.0 |[https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp)|
-|FastQC|0.12.1|[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)|
-|GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)|
-|HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)|
-|Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) |
-|KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder)
-|KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)|
-|Kraken2| 2.1.6 | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2) |
-|KrakenTools | 1.2 | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/) |
-|Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)|
-|MEGAHIT| 1.2.9 |[https://github.com/voutcn/megahit#megahit](https://github.com/voutcn/megahit#megahit)|
-|MetaBAT| 2.15 |[https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)|
-|MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)|
-|MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)|
-|Prodigal| 2.6.3 |[https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)|
-|samtools| 1.22.1 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
-|SPAdes| 4.1.0 | [https://github.com/ablab/spades](https://github.com/ablab/spades) |
-| R | 4.5.1 | [https://www.r-project.org](https://www.r-project.org) |
-|Bioconductor | 3.21 | [https://www.bioconductor.org](https://www.bioconductor.org) |
-|decontam| 1.28.0 | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html) |
-|optparse| 1.7.5 |[https://cran.r-project.org/web/packages/optparse/index.html](https://cran.r-project.org/web/packages/optparse/index.html) |
-|pavian| 1.2.1 | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian) |
-|pheatmap| 1.0.13 | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap) |
-|phyloseq| 1.52.0 | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html) |
-|tidyverse| 2.0.0 | [https://www.tidyverse.org](https://www.tidyverse.org) |
+| Program      | Version | Relevant Links                                                                                                                                     |
+| :----------- | :-----: | :------------------------------------------------------------------------------------------------------------------------------------------------- |
+| BBTools      |  39.80  | [https://bbmap.org/](https://bbmap.org/)                                                                                                           |
+| bit          | 1.13.15 | [https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)     |
+| bowtie2      |  2.5.5  | [https://bowtie-bio.sourceforge.net/bowtie2/index.shtml](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml)                                   |
+| CAT          |  5.2.3  | [https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)                                                             |
+| CheckM       |  1.1.3  | [https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)                                                                     |
+| fastp        |  1.3.1  | [https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp)                                                                             |
+| FastQC       | 0.12.1  | [https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)                           |
+| GTDB-Tk      |  2.4.0  | [https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)                                                                     |
+| HUMAnN       |   3.9   | [https://github.com/biobakery/humann](https://github.com/biobakery/humann)                                                                         |
+| Kaiju        | 1.10.1  | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/)                                                   |
+| KEGG-Decoder |  1.2.2  | [https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder) |
+| KOFamScan    |  1.3.0  | [https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)                                                                     |
+| Kraken2      |  2.1.6  | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2)                                                                   |
+| KrakenTools  |   1.2   | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/)                                                             |
+| Krona        |  2.8.1  | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)                                                                         |
+| MEGAHIT      |  1.2.9  | [https://github.com/voutcn/megahit#megahit](https://github.com/voutcn/megahit#megahit)                                                             |
+| MetaBAT      |  2.15   | [https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)                                     |
+| MultiQC      | 1.27.1  | [https://multiqc.info/](https://multiqc.info/)                                                                                                     |
+| MetaPhlAn    |  4.1.0  | [https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)                                                                   |
+| Prodigal     |  2.6.3  | [https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)                                                       |
+| samtools     | 1.22.1  | [https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)                                                     |
+| SPAdes       |  4.1.0  | [https://github.com/ablab/spades](https://github.com/ablab/spades)                                                                                 |
+| R            |  4.5.3  | [https://www.r-project.org](https://www.r-project.org)                                                                                             |
+| decontam     | 1.28.0  | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html)     |
+| dplyr        |  1.2.0  | [https://dplyr.tidyverse.org](https://dplyr.tidyverse.org)                                                                                         |
+| ggplot2      |  4.0.2  | [https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)                                                                                     |
+| htmlwidgets  |  1.6.4  | [http://www.htmlwidgets.org](http://www.htmlwidgets.org)                                                                                           |
+| glue         |  1.8.0  | [https://glue.tidyverse.org](https://glue.tidyverse.org)                                                                                           |
+| pavian       |  1.2.0* | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian)                                                                   |
+| pheatmap     | 1.0.13  | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap)                                                         |
+| phyloseq     | 1.54.0  | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html)     |
+| plotly       | 4.12.0  | [https://plotly-r.com](https://plotly-r.com)                                                                                                       |
+| purrr        |  1.2.1  | [https://purrr.tidyverse.org](https://purrr.tidyverse.org)                                                                                         |
+| readr        |  2.2.0  | [https://readr.tidyverse.org](https://readr.tidyverse.org)                                                                                         |
+| stringr      |  1.6.0  | [https://stringr.tidyverse.org](https://stringr.tidyverse.org)                                                                                     |
+| tibble       |  3.3.1  | [https://tibble.tidyverse.org](https://tibble.tidyverse.orgtext)                                                                                   |
+| tidyr        |  1.3.2  | [https://tidyr.tidyverse.org](https://tidyr.tidyverse.orgtext)                                                                                     |
+> **Note:** pavian R package requires R version 4.0.5
 
 ---
 
@@ -248,7 +256,7 @@ fastp --in1 sample_R1_HRrm_GLlbsMetag.fastq.gz --out1 temp_sample_R1_filtered.fa
       --qualified_quality_phred  20 \
       --length_required 50 \
       --thread 2 \
-      --detect_adapter_for_pe \
+      --detect_adapter_for_pe --disable_trim_poly_g \
       --json sample.fastp.json \
       --html sample.fastp.html 2> sample-fastp.log
 ```
@@ -263,6 +271,7 @@ fastp --in1 sample_R1_HRrm_GLlbsMetag.fastq.gz --out1 temp_sample_R1_filtered.fa
 - `--length_required` - the minimum read length. Shorter reads will be discarded (default: 50)
 - `--thread` - number of worker threads (default: 2)
 - `--detect_adapter_for_pe` - for paired end data, enable auto-detection of adapters
+- `--disable_trim_poly_g` - explicitly disable automatic polyG trimming
 - `--json` - Specifies the json format report file name
 - `--html` - Specifies the html format report file name
 - `2> sample-fastp.log` - Redirects the stderr output to a log file.

From 69b3f0e3e2e13486c1b4375b7df9b64b25590a0e Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Fri, 3 Apr 2026 20:18:42 -0700
Subject: [PATCH 42/47] updated bbduk to BBTools

---
 .../Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md   | 2 +-
 .../Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md  | 2 +-
 .../Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
index 45d65a215..b3bbd9d1c 100644
--- a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
+++ b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
@@ -1616,7 +1616,7 @@ samtools sort --threads NumberOfThreads \
 #### 10a. Filter Coverage Levels Based On Detection
 
 ```bash
-# pileup.sh comes from the bbduk.sh package
+# pileup.sh comes from the BBTools package
 pileup.sh -in sample_GLmetagenomics.bam \
           fastaorf=sample-genes_GLmetagenomics.fasta \
           outorf=sample-gene-cov-and-det.tmp \
diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
index 36803fe2a..9472b4264 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
@@ -3605,7 +3605,7 @@ Filtering based on detection is one way of helping to mitigate non-specific read
 #### 20a. Filter Coverage Levels Based On Detection
 
 ```bash
-# pileup.sh comes from the bbduk.sh package
+# pileup.sh comes from the BBTools package
 pileup.sh -in sample_GLlblMetag.bam \
           fastaorf=sample-genes_GLlblMetag.fasta \
           outorf=sample-gene-cov-and-det.tmp \
diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
index 9e107a362..6d5f46ff1 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
@@ -3462,7 +3462,7 @@ Filtering based on detection is one way of helping to mitigate non-specific read
 #### 15a. Filter Coverage Levels Based On Detection
 
 ```bash
-# pileup.sh comes from the bbduk.sh package
+# pileup.sh comes from the BBTools package
 pileup.sh -in sample_GLlbsMetag.bam \
           fastaorf=sample-genes_GLlbsMetag.fasta \
           outorf=sample-gene-cov-and-det.tmp \

From a3ad2d6995055072463c75fc0bf01f145b7fe5d6 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Tue, 14 Apr 2026 19:41:43 -0700
Subject: [PATCH 43/47] Final updates

- Updated software versions to latest possible
- Specify separate tidyverse instead of the tidyverse collection for
  more granular software versions
- Add final pipeline approval date
---
 .../GL-DPPD-7107-B.md                         | 43 +++++++++-----
 .../GL-DPPD-7116.md                           | 59 +++++++++++--------
 .../GL-DPPD-7117.md                           | 49 +++++++++------
 3 files changed, 95 insertions(+), 56 deletions(-)

diff --git a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
index b3bbd9d1c..45026e247 100644
--- a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
+++ b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
@@ -4,7 +4,7 @@
 
 ---
 
-**Date:** April MM, 2026  
+**Date:** April 3, 2026  
 **Revision:** B  
 **Document Number:** GL-DPPD-7107  
 
@@ -177,39 +177,41 @@ Software Updates and Changes:
 
 | Program      | Version | Relevant Links                                                                                                                                     |
 | :----------- | :-----: | :------------------------------------------------------------------------------------------------------------------------------------------------- |
-| BBTools      |  39.80  | [https://bbmap.org/](https://bbmap.org/)                                                                                                           |
+| BBTools      |  39.81  | [https://bbmap.org/](https://bbmap.org/)                                                                                                           |
 | bit          | 1.13.15 | [https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)     |
 | bowtie2      |  2.5.5  | [https://bowtie-bio.sourceforge.net/bowtie2/index.shtml](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml)                                   |
-| CAT          |  5.2.3  | [https://github.com/MGXlab/CAT_pack](https://github.com/MGXlab/CAT_pack)                                                                           |
-| CheckM       |  1.1.3  | [https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)                                                                     |
+| CAT          |   5.3   | [https://github.com/MGXlab/CAT_pack](https://github.com/MGXlab/CAT_pack)                                                                           |
+| CheckM       |  1.2.5  | [https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)                                                                     |
 | fastp        |  1.3.1  | [https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp)                                                                             |
 | FastQC       | 0.12.1  | [https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)                           |
-| GTDB-Tk      |  2.4.0  | [https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)                                                                     |
+| GTDB-Tk      |  2.6.1  | [https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)                                                                     |
 | HUMAnN       |   3.9   | [https://github.com/biobakery/humann](https://github.com/biobakery/humann)                                                                         |
 | Kaiju        | 1.10.1  | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/)                                                   |
-| KEGG-Decoder |  1.2.2  | [https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder) |
+| KEGG-Decoder |   1.3   | [https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder) |
 | KOFamScan    |  1.3.0  | [https://github.com/takaram/kofam_scan#kofamscan](https://github.com/takaram/kofam_scan#kofamscan)                                                 |
-| Kraken2      |  2.1.6  | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2)                                                                   |
-| KrakenTools  |   1.2   | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/)                                                             |
+| Kraken2      | 2.17.1  | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2)                                                                   |
+| KrakenTools  |  1.2.1  | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/)                                                             |
 | Krona        |  2.8.1  | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)                                                                         |
 | MEGAHIT      |  1.2.9  | [https://github.com/voutcn/megahit#megahit](https://github.com/voutcn/megahit#megahit)                                                             |
-| MetaBAT      |  2.15   | [https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)                                     |
+| MetaBAT      |  2.18   | [https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)                                     |
 | MetaPhlAn    |  4.1.0  | [https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)                                                                   |
 | MultiQC      | 1.27.1  | [https://multiqc.info/](https://multiqc.info/)                                                                                                     |
 | Prodigal     |  2.6.3  | [https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)                                                       |
-| samtools     | 1.22.1  | [https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)                                                     |
+| samtools     | 1.23.1  | [https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)                                                     |
+| SPAdes       |  4.2.0  | [https://github.com/ablab/spades](https://github.com/ablab/spades)                                                                                 |
 | R            |  4.5.3  | [https://www.r-project.org](https://www.r-project.org)                                                                                             |
-| decontam     | 1.28.0  | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html)     |
 | dplyr        |  1.2.0  | [https://dplyr.tidyverse.org](https://dplyr.tidyverse.org)                                                                                         |
 | ggplot2      |  4.0.2  | [https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)                                                                                     |
-| htmlwidgets  |  1.6.4  | [http://www.htmlwidgets.org](http://www.htmlwidgets.org)                                                                                           |
 | glue         |  1.8.0  | [https://glue.tidyverse.org](https://glue.tidyverse.org)                                                                                           |
-| pavian       |  1.2.0* | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian)                                                                   |
+| htmlwidgets  |  1.6.4  | [http://www.htmlwidgets.org](http://www.htmlwidgets.org)                                                                                           |
+| magrittr     |  2.0.5  | [https://magrittr.tidyverse.org](https://magrittr.tidyverse.org)                                                                                   |
+| pavian       | 1.2.0*  | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian)                                                                   |
 | pheatmap     | 1.0.13  | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap)                                                         |
 | phyloseq     | 1.54.0  | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html)     |
 | plotly       | 4.12.0  | [https://plotly-r.com](https://plotly-r.com)                                                                                                       |
 | purrr        |  1.2.1  | [https://purrr.tidyverse.org](https://purrr.tidyverse.org)                                                                                         |
 | readr        |  2.2.0  | [https://readr.tidyverse.org](https://readr.tidyverse.org)                                                                                         |
+| scales       |  1.4.0  | [https://scales.r-lib.org](https://scales.r-lib.org)                                                                                               |
 | stringr      |  1.6.0  | [https://stringr.tidyverse.org](https://stringr.tidyverse.org)                                                                                     |
 | tibble       |  3.3.1  | [https://tibble.tidyverse.org](https://tibble.tidyverse.orgtext)                                                                                   |
 | tidyr        |  1.3.2  | [https://tidyr.tidyverse.org](https://tidyr.tidyverse.orgtext)                                                                                     |
@@ -411,13 +413,24 @@ multiqc --zip-data-dir \
 #### 3a. Load libraries
 
 ```R
-library(glue)
+# load libraries
 library(htmlwidgets)
 library(pavian)
 library(pheatmap)
 library(phyloseq)
+
+# load tidyverse libraries
+library(dplyr)
+library(ggplot2)
+library(glue)
+library(magrittr)
 library(plotly)
-library(tidyverse)
+library(purrr)
+library(readr)
+library(scales)
+library(stringr)
+library(tibble)
+library(tidyr)
 ```
 
 #### 3b. Define Custom Functions
diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
index 9472b4264..8c076a270 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
@@ -4,7 +4,7 @@
 
 ---
 
-**Date:** March MM, 2026  
+**Date:** April 3, 2026  
 **Revision:** -  
 **Document Number:** GL-DPPD-7116  
 
@@ -151,45 +151,47 @@ Barbara Novak (GeneLab Data Processing Lead)
 
 | Program      | Version | Relevant Links                                                                                                                                     |
 | :----------- | :-----: | :------------------------------------------------------------------------------------------------------------------------------------------------- |
-| BBTools      |  39.80  | [https://bbmap.org](https://bbmap.org)                                                                                                             |
+| BBTools      |  39.81  | [https://bbmap.org](https://bbmap.org)                                                                                                             |
 | bit          | 1.13.15 | [https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)     |
-| CAT          |  5.2.3  | [https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)                                                             |
-| CheckM       |  1.1.3  | [https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)                                                                     |
-| Dorado       |  1.1.1  | [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)                                                                   |
-| Filtlong     |  0.2.1  | [https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)                                                                           |
-| Flye         |  2.9.5  | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye)                                                                       |
-| GTDB-Tk      |  2.4.0  | [https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)                                                                     |
+| CAT          |   5.3   | [https://github.com/MGXlab/CAT_pack](https://github.com/MGXlab/CAT_pack)                                                                           |
+| CheckM       |  1.2.5  | [https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)                                                                     |
+| Dorado       |  1.3.0  | [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)                                                                   |
+| Filtlong     |  0.3.1  | [https://github.com/rrwick/Filtlong](https://github.com/rrwick/Filtlong)                                                                           |
+| Flye         |  2.9.6  | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye)                                                                       |
+| GTDB-Tk      |  2.6.1  | [https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)                                                                     |
 | HUMAnN       |   3.9   | [https://github.com/biobakery/humann](https://github.com/biobakery/humann)                                                                         |
 | Kaiju        | 1.10.1  | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/)                                                   |
-| KEGG-Decoder |  1.2.2  | [https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder) |
-| KOFamScan    |  1.3.0  | [https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)                                                                     |
-| Kraken2      |  2.1.6  | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2)                                                                   |
+| KEGG-Decoder |   1.3   | [https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder) |
+| KOFamScan    |  1.3.0  | [https://github.com/takaram/kofam_scan#kofamscan](https://github.com/takaram/kofam_scan#kofamscan)                                                 |
+| Kraken2      | 2.17.1  | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2)                                                                   |
 | KrakenTools  |  1.2.1  | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/)                                                             |
 | Krona        |  2.8.1  | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)                                                                         |
 | Medaka       |  2.2.0  | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka)                                                                   |
-| MetaBAT      |  2.15   | [https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)                                     |
+| MetaBAT      |  2.18   | [https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)                                     |
 | MetaPhlAn    |  4.1.0  | [https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)                                                                   |
-| Minimap2     |  2.28   | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2)                                                                                 |
+| Minimap2     |  2.30   | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2)                                                                                 |
 | MultiQC      | 1.27.1  | [https://multiqc.info/](https://multiqc.info/)                                                                                                     |
-| NanoPlot     | 1.44.1  | [https://github.com/wdecoster/NanoPlot](https://github.com/wdecoster/NanoPlot)                                                                     |
+| NanoPlot     | 1.45.2  | [https://github.com/wdecoster/NanoPlot](https://github.com/wdecoster/NanoPlot)                                                                     |
 | Porechop     |  0.2.4  | [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop)                                                                           |
 | Prodigal     |  2.6.3  | [https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)                                                       |
-| samtools     | 1.22.1  | [https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)                                                     |
+| samtools     | 1.23.1  | [https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)                                                     |
 | R            |  4.5.3  | [https://www.r-project.org](https://www.r-project.org)                                                                                             |
-| decontam     | 1.28.0  | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html)     |
+| decontam     | 1.30.0  | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html)     |
 | dplyr        |  1.2.0  | [https://dplyr.tidyverse.org](https://dplyr.tidyverse.org)                                                                                         |
 | ggplot2      |  4.0.2  | [https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)                                                                                     |
 | glue         |  1.8.0  | [https://glue.tidyverse.org](https://glue.tidyverse.org)                                                                                           |
+| htmlwidgets  |  1.6.4  | [http://www.htmlwidgets.org](http://www.htmlwidgets.org)                                                                                           |
+| magrittr     |  2.0.5  | [https://magrittr.tidyverse.org](https://magrittr.tidyverse.org)                                                                                   |
+| pavian       | 1.2.0*  | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian)                                                                   |
+| pheatmap     | 1.0.13  | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap)                                                         |
+| phyloseq     | 1.54.0  | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html)     |
+| plotly       | 4.12.0  | [https://plotly-r.com](https://plotly-r.com)                                                                                                       |
 | purrr        |  1.2.1  | [https://purrr.tidyverse.org](https://purrr.tidyverse.org)                                                                                         |
 | readr        |  2.2.0  | [https://readr.tidyverse.org](https://readr.tidyverse.org)                                                                                         |
+| scales       |  1.4.0  | [https://scales.r-lib.org](https://scales.r-lib.org)                                                                                               |
 | stringr      |  1.6.0  | [https://stringr.tidyverse.org](https://stringr.tidyverse.org)                                                                                     |
 | tibble       |  3.3.1  | [https://tibble.tidyverse.org](https://tibble.tidyverse.orgtext)                                                                                   |
 | tidyr        |  1.3.2  | [https://tidyr.tidyverse.org](https://tidyr.tidyverse.orgtext)                                                                                     |
-| htmlwidgets  |  1.6.4  | [http://www.htmlwidgets.org](http://www.htmlwidgets.org)                                                                                           |
-| pavian       |  1.2.0* | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian)                                                                   |
-| pheatmap     | 1.0.13  | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap)                                                         |
-| phyloseq     | 1.54.0  | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html)     |
-| plotly       | 4.12.0  | [https://plotly-r.com](https://plotly-r.com)                                                                                                       |
 > **Note:** pavian R package requires R version 4.0.5
 ---
 
@@ -1027,14 +1029,25 @@ multiqc --zip-data-dir \
 #### 9a. Load libraries
 
 ```R
+# load libraries
 library(decontam)
-library(glue)
 library(htmlwidgets)
 library(pavian)
 library(pheatmap)
 library(phyloseq)
+
+# load tidyverse libraries
+library(dplyr)
+library(glue)
+library(ggplot2)
+library(magrittr)
 library(plotly)
-library(tidyverse)
+library(purrr)
+library(readr)
+library(scales)
+library(stringr)
+library(tibble)
+library(tidyr)
 ```
 
 #### 9b. Define Custom Functions
diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
index 6d5f46ff1..8320910e5 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
@@ -4,7 +4,7 @@
 
 ---
 
-**Date:** March MM, 2026  
+**Date:** April 3, 2026  
 **Revision:** -  
 **Document Number:** GL-DPPD-7117  
 
@@ -143,40 +143,42 @@ Barbara Novak (GeneLab Data Processing Lead)
 
 | Program      | Version | Relevant Links                                                                                                                                     |
 | :----------- | :-----: | :------------------------------------------------------------------------------------------------------------------------------------------------- |
-| BBTools      |  39.80  | [https://bbmap.org/](https://bbmap.org/)                                                                                                           |
+| BBTools      |  39.81  | [https://bbmap.org/](https://bbmap.org/)                                                                                                           |
 | bit          | 1.13.15 | [https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit)     |
 | bowtie2      |  2.5.5  | [https://bowtie-bio.sourceforge.net/bowtie2/index.shtml](https://bowtie-bio.sourceforge.net/bowtie2/index.shtml)                                   |
-| CAT          |  5.2.3  | [https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)                                                             |
-| CheckM       |  1.1.3  | [https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)                                                                     |
+| CAT          |   5.3   | [https://github.com/MGXlab/CAT_pack](https://github.com/MGXlab/CAT_pack)                                                                           |
+| CheckM       |  1.2.5  | [https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)                                                                     |
 | fastp        |  1.3.1  | [https://github.com/OpenGene/fastp](https://github.com/OpenGene/fastp)                                                                             |
 | FastQC       | 0.12.1  | [https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)                           |
-| GTDB-Tk      |  2.4.0  | [https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)                                                                     |
+| GTDB-Tk      |  2.6.1  | [https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)                                                                     |
 | HUMAnN       |   3.9   | [https://github.com/biobakery/humann](https://github.com/biobakery/humann)                                                                         |
 | Kaiju        | 1.10.1  | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/)                                                   |
-| KEGG-Decoder |  1.2.2  | [https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder) |
-| KOFamScan    |  1.3.0  | [https://github.com/takaram/kofam_scan](https://github.com/takaram/kofam_scan)                                                                     |
-| Kraken2      |  2.1.6  | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2)                                                                   |
-| KrakenTools  |   1.2   | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/)                                                             |
+| KEGG-Decoder |   1.3   | [https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder) |
+| KOFamScan    |  1.3.0  | [https://github.com/takaram/kofam_scan#kofamscan](https://github.com/takaram/kofam_scan#kofamscan)                                                 |
+| Kraken2      | 2.17.1  | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2)                                                                   |
+| KrakenTools  |  1.2.1  | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/)                                                             |
 | Krona        |  2.8.1  | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)                                                                         |
 | MEGAHIT      |  1.2.9  | [https://github.com/voutcn/megahit#megahit](https://github.com/voutcn/megahit#megahit)                                                             |
-| MetaBAT      |  2.15   | [https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)                                     |
-| MultiQC      | 1.27.1  | [https://multiqc.info/](https://multiqc.info/)                                                                                                     |
+| MetaBAT      |  2.18   | [https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)                                     |
 | MetaPhlAn    |  4.1.0  | [https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)                                                                   |
+| MultiQC      | 1.27.1  | [https://multiqc.info/](https://multiqc.info/)                                                                                                     |
 | Prodigal     |  2.6.3  | [https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)                                                       |
-| samtools     | 1.22.1  | [https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)                                                     |
-| SPAdes       |  4.1.0  | [https://github.com/ablab/spades](https://github.com/ablab/spades)                                                                                 |
+| samtools     | 1.23.1  | [https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)                                                     |
+| SPAdes       |  4.2.0  | [https://github.com/ablab/spades](https://github.com/ablab/spades)                                                                                 |
 | R            |  4.5.3  | [https://www.r-project.org](https://www.r-project.org)                                                                                             |
-| decontam     | 1.28.0  | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html)     |
+| decontam     | 1.30.0  | [https://www.bioconductor.org/packages/release/bioc/html/decontam.html](https://www.bioconductor.org/packages/release/bioc/html/decontam.html)     |
 | dplyr        |  1.2.0  | [https://dplyr.tidyverse.org](https://dplyr.tidyverse.org)                                                                                         |
 | ggplot2      |  4.0.2  | [https://ggplot2.tidyverse.org](https://ggplot2.tidyverse.org)                                                                                     |
-| htmlwidgets  |  1.6.4  | [http://www.htmlwidgets.org](http://www.htmlwidgets.org)                                                                                           |
 | glue         |  1.8.0  | [https://glue.tidyverse.org](https://glue.tidyverse.org)                                                                                           |
-| pavian       |  1.2.0* | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian)                                                                   |
+| htmlwidgets  |  1.6.4  | [http://www.htmlwidgets.org](http://www.htmlwidgets.org)                                                                                           |
+| magrittr     |  2.0.5  | [https://magrittr.tidyverse.org](https://magrittr.tidyverse.org)                                                                                   |
+| pavian       | 1.2.0*  | [https://github.com/fbreitwieser/pavian](https://github.com/fbreitwieser/pavian)                                                                   |
 | pheatmap     | 1.0.13  | [https://cran.r-project.org/package=pheatmap](https://cran.r-project.org/package=pheatmap)                                                         |
 | phyloseq     | 1.54.0  | [https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html](https://www.bioconductor.org/packages/release/bioc/html/phyloseq.html)     |
 | plotly       | 4.12.0  | [https://plotly-r.com](https://plotly-r.com)                                                                                                       |
 | purrr        |  1.2.1  | [https://purrr.tidyverse.org](https://purrr.tidyverse.org)                                                                                         |
 | readr        |  2.2.0  | [https://readr.tidyverse.org](https://readr.tidyverse.org)                                                                                         |
+| scales       |  1.4.0  | [https://scales.r-lib.org](https://scales.r-lib.org)                                                                                               |
 | stringr      |  1.6.0  | [https://stringr.tidyverse.org](https://stringr.tidyverse.org)                                                                                     |
 | tibble       |  3.3.1  | [https://tibble.tidyverse.org](https://tibble.tidyverse.orgtext)                                                                                   |
 | tidyr        |  1.3.2  | [https://tidyr.tidyverse.org](https://tidyr.tidyverse.orgtext)                                                                                     |
@@ -651,14 +653,25 @@ multiqc --zip-data-dir \
 #### 5a. Load libraries
 
 ```R
+# load libraries
 library(decontam)
-library(glue)
 library(htmlwidgets)
 library(pavian)
 library(pheatmap)
 library(phyloseq)
+
+# load tidyverse libraries
+library(dplyr)
+library(glue)
+library(ggplot2)
+library(magrittr)
 library(plotly)
-library(tidyverse)
+library(purrr)
+library(readr)
+library(scales)
+library(stringr)
+library(tibble)
+library(tidyr)
 ```
 
 #### 5b. Define Custom Functions

From e21868f63b4fd3970d7575528865252b17fd885e Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Wed, 15 Apr 2026 10:30:51 -0700
Subject: [PATCH 44/47] Updated Metagenomics READMEs and added NF_Metagenomics
 repository links (#198)

---
 .gitmodules                                            | 10 +++++++---
 .../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md   |  6 ++++--
 Metagenomics/Illumina/README.md                        |  6 +++---
 .../Illumina/Workflow_Documentation/NF_Metagenomics    |  1 +
 Metagenomics/Illumina/Workflow_Documentation/README.md |  9 ++++++---
 .../Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md     |  4 ++--
 .../Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md     |  4 ++--
 Metagenomics/Low_Biomass/README.md                     |  4 ++--
 .../Low_Biomass/Workflow_Documentation/NF_MGIllumina   |  1 -
 .../Low_Biomass/Workflow_Documentation/NF_Metagenomics |  1 +
 .../Low_Biomass/Workflow_Documentation/README.md       | 10 +++-------
 11 files changed, 31 insertions(+), 25 deletions(-)
 create mode 160000 Metagenomics/Illumina/Workflow_Documentation/NF_Metagenomics
 delete mode 160000 Metagenomics/Low_Biomass/Workflow_Documentation/NF_MGIllumina
 create mode 160000 Metagenomics/Low_Biomass/Workflow_Documentation/NF_Metagenomics

diff --git a/.gitmodules b/.gitmodules
index beec830a9..660601944 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,7 +1,11 @@
 [submodule "Amplicon/Illumina/Workflow_Documentation/NF_AmpIllumina"]
 	path = Amplicon/Illumina/Workflow_Documentation/NF_AmpIllumina
 	url = https://github.com/nasa/GeneLab_AmpliconSeq_Workflow
-[submodule "NF_MGIllumina"]
-	path = Metagenomics/Low_Biomass/Workflow_Documentation/NF_MGIllumina
-	url = https://github.com/nasa/GeneLab_Metagenomics_Workflow
+[submodule "Metagenomics/Low_Biomass/Workflow_Documentation/NF_Metagenomics"]
+	path = Metagenomics/Low_Biomass/Workflow_Documentation/NF_Metagenomics
+	url = https://github.com/nasa/GeneLab_Metagenomics_Workflow/
+	branch = DEV
+[submodule "Metagenomics/Illumina/Workflow_Documentation/NF_Metagenomics"]
+	path = Metagenomics/Illumina/Workflow_Documentation/NF_Metagenomics
+	url = https://github.com/nasa/GeneLab_Metagenomics_Workflow/
 	branch = DEV
diff --git a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
index 45026e247..87cd4c77f 100644
--- a/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
+++ b/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md
@@ -44,8 +44,10 @@ Software Updates and Changes:
 | dplyr        |       N/A        |    1.2.0    |
 | ggplot2      |       N/A        |    4.0.2    |
 | glue         |       N/A        |    1.8.0    |
+| magrittr     |       N/A        |    2.0.5    |
 | purrr        |       N/A        |    1.2.1    |
 | readr        |       N/A        |    2.2.0    |
+| scales       |       N/A        |    1.4.0    |
 | stringr      |       N/A        |    1.6.0    |
 | tibble       |       N/A        |    3.3.1    |
 | tidyr        |       N/A        |    1.3.2    |
@@ -213,8 +215,8 @@ Software Updates and Changes:
 | readr        |  2.2.0  | [https://readr.tidyverse.org](https://readr.tidyverse.org)                                                                                         |
 | scales       |  1.4.0  | [https://scales.r-lib.org](https://scales.r-lib.org)                                                                                               |
 | stringr      |  1.6.0  | [https://stringr.tidyverse.org](https://stringr.tidyverse.org)                                                                                     |
-| tibble       |  3.3.1  | [https://tibble.tidyverse.org](https://tibble.tidyverse.orgtext)                                                                                   |
-| tidyr        |  1.3.2  | [https://tidyr.tidyverse.org](https://tidyr.tidyverse.orgtext)                                                                                     |
+| tibble       |  3.3.1  | [https://tibble.tidyverse.org](https://tibble.tidyverse.org)                                                                                       |
+| tidyr        |  1.3.2  | [https://tidyr.tidyverse.org](https://tidyr.tidyverse.org)                                                                                         |
 > **Note:** pavian R package requires R version 4.0.5
 
 ---
diff --git a/Metagenomics/Illumina/README.md b/Metagenomics/Illumina/README.md
index 0b842d288..52ffb8770 100644
--- a/Metagenomics/Illumina/README.md
+++ b/Metagenomics/Illumina/README.md
@@ -1,9 +1,9 @@
 
 # GeneLab bioinformatics processing pipeline for Illumina metagenomics sequencing data
 
-> **The document [`GL-DPPD-7107-A.md`](Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md) holds an overview and example commands for how GeneLab processes Illumina metagenomics sequencing datasets. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and processing code are provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
+> **The document [`GL-DPPD-7107-B.md`](Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md) holds an overview and example commands for how GeneLab processes Illumina metagenomics sequencing datasets. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and a GeneLab data processing summary are provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
 > 
-> Note: The exact processing commands and MGIllumina version used for specific GLDS datasets can be found in the *_processing_info.zip file under "Files" for each respective GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). 
+> Note: The exact processing commands and MGIllumina or Metagenomics workflow version used for specific GLDS datasets can be found in the *_processing_info.zip file under "Files" for each respective GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). 
 
 ---
 
@@ -21,7 +21,7 @@
 
 * [**Workflow_Documentation**](Workflow_Documentation)
 
-  - Contains instructions for installing and running the GeneLab MGIllumina workflow
+  - Contains instructions for installing and running the GeneLab MGIllumina or Metagenomics workflows
 
 ---
 
diff --git a/Metagenomics/Illumina/Workflow_Documentation/NF_Metagenomics b/Metagenomics/Illumina/Workflow_Documentation/NF_Metagenomics
new file mode 160000
index 000000000..285ec01bc
--- /dev/null
+++ b/Metagenomics/Illumina/Workflow_Documentation/NF_Metagenomics
@@ -0,0 +1 @@
+Subproject commit 285ec01bc9967b7fa312e4517226fdd26fdfc4f9
diff --git a/Metagenomics/Illumina/Workflow_Documentation/README.md b/Metagenomics/Illumina/Workflow_Documentation/README.md
index 28ddd29cf..608037dec 100644
--- a/Metagenomics/Illumina/Workflow_Documentation/README.md
+++ b/Metagenomics/Illumina/Workflow_Documentation/README.md
@@ -1,15 +1,18 @@
 # GeneLab Illumina Metagenomics Seq Workflow Information
 
-> **GeneLab has wrapped each step of the Illumina metagenomics sequencing data processing pipeline (MGIllumina) into a workflow. The table below lists (and links to) each MGIllumina version and the corresponding workflow subdirectory, the current MGIllumina pipeline/workflow implementation is indicated. The workflow subdirectory contains information about the workflow along with instructions for installation and usage. Exact workflow run info and MGIllumina version used to process specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
+> **GeneLab has wrapped each step of the Illumina metagenomics sequencing data processing pipeline into a workflow. The table below lists (and links to) each Metagenomics Workflow version and the corresponding workflow subdirectory, the current pipeline/workflow implementation is indicated. The workflow subdirectory contains information about the workflow along with instructions for installation and usage. Exact workflow run info and workflow version used to process specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
 
 ## MGIllumina Pipeline Version and Corresponding Workflow
 
 |Pipeline Version|Current Workflow Version (for respective pipeline version)|Nextflow Version| 
 |:---------------|:---------------------------------------------------------|:---------------|
-|*[GL-DPPD-7107-A.md](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md)|[NF_MGIllumina_1.0.0](NF_MGIllumina)|24.04.4|
+|*[GL-DPPD-7107-B.md](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md)|[NF_Metagenomics_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow/tree/DEV)|24.04.4|
+|[GL-DPPD-7107-A.md](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md)|[NF_MGIllumina_1.0.0](NF_MGIllumina)|24.04.4|
 |[GL-DPPD-7107.md](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107.md)|[SW_MGIllumina_2.0.4](SW_MGIllumina)|N/A (Snakemake v7.26.0)|
 
 
 *Current GeneLab Pipeline/Workflow Implementation
 
-> See the [workflow change log](NF_MGIllumina/CHANGELOG.md) to access previous workflow versions and view all changes associated with each version update.
+> See the workflow change log for [NF_Metagenomics](https://github.com/nasa/GeneLab_Metagenomics_Workflow/blob/DEV/CHANGELOG.md) to view all changes associated with each workflow version update.
+
+> All workflow changes associated with the previous version of the GeneLab Metagenomics Pipeline ([GL-DPPD-7107-A](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md) or [GL-DPPD-7107](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107.md)) can be found in the [NF_MGIllumina Change Log](./NF_MGIllumina/CHANGELOG.md) or the [SW_MGIllumina Change Log](./SW_MGIllumina/CHANGELOG.md), respectively.
diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
index 8c076a270..82bf58489 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md
@@ -190,8 +190,8 @@ Barbara Novak (GeneLab Data Processing Lead)
 | readr        |  2.2.0  | [https://readr.tidyverse.org](https://readr.tidyverse.org)                                                                                         |
 | scales       |  1.4.0  | [https://scales.r-lib.org](https://scales.r-lib.org)                                                                                               |
 | stringr      |  1.6.0  | [https://stringr.tidyverse.org](https://stringr.tidyverse.org)                                                                                     |
-| tibble       |  3.3.1  | [https://tibble.tidyverse.org](https://tibble.tidyverse.orgtext)                                                                                   |
-| tidyr        |  1.3.2  | [https://tidyr.tidyverse.org](https://tidyr.tidyverse.orgtext)                                                                                     |
+| tibble       |  3.3.1  | [https://tibble.tidyverse.org](https://tibble.tidyverse.org)                                                                                       |
+| tidyr        |  1.3.2  | [https://tidyr.tidyverse.org](https://tidyr.tidyverse.org)                                                                                         |
 > **Note:** pavian R package requires R version 4.0.5
 ---
 
diff --git a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
index 8320910e5..f126b6c6a 100644
--- a/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
+++ b/Metagenomics/Low_Biomass/Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md
@@ -180,8 +180,8 @@ Barbara Novak (GeneLab Data Processing Lead)
 | readr        |  2.2.0  | [https://readr.tidyverse.org](https://readr.tidyverse.org)                                                                                         |
 | scales       |  1.4.0  | [https://scales.r-lib.org](https://scales.r-lib.org)                                                                                               |
 | stringr      |  1.6.0  | [https://stringr.tidyverse.org](https://stringr.tidyverse.org)                                                                                     |
-| tibble       |  3.3.1  | [https://tibble.tidyverse.org](https://tibble.tidyverse.orgtext)                                                                                   |
-| tidyr        |  1.3.2  | [https://tidyr.tidyverse.org](https://tidyr.tidyverse.orgtext)                                                                                     |
+| tibble       |  3.3.1  | [https://tibble.tidyverse.org](https://tibble.tidyverse.org)                                                                                       |
+| tidyr        |  1.3.2  | [https://tidyr.tidyverse.org](https://tidyr.tidyverse.org)                                                                                         |
 > **Note:** pavian R package requires R version 4.0.5
 
 ---
diff --git a/Metagenomics/Low_Biomass/README.md b/Metagenomics/Low_Biomass/README.md
index 9ede17df0..cfbde0d29 100644
--- a/Metagenomics/Low_Biomass/README.md
+++ b/Metagenomics/Low_Biomass/README.md
@@ -1,8 +1,8 @@
 # GeneLab bioinformatics processing pipelines for low-biomass metagenomics sequencing data
 
-> **Documents [`GL-DPPD-7116`](Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md) and [`GL-DPPD-7117.md`](Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md) contain overview and example commands for how GeneLab processes low-biomass metagenomics datasets for long- and short-read data, respectively. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and a GeneLab data processing summary is provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
+> **Documents [`GL-DPPD-7116`](Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md) and [`GL-DPPD-7117.md`](Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md) contain overview and example commands for how GeneLab processes low-biomass metagenomics datasets for long- and short-read data, respectively. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and a GeneLab data processing summary are provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
 
-<br>
+> Note: The exact processing commands and Metagenomics workflow version used for specific GLDS datasets can be found in the *_processing_info.zip file under "Files" for each respective GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). 
 
 ---
 ## Repository Links
diff --git a/Metagenomics/Low_Biomass/Workflow_Documentation/NF_MGIllumina b/Metagenomics/Low_Biomass/Workflow_Documentation/NF_MGIllumina
deleted file mode 160000
index 2a4a676d5..000000000
--- a/Metagenomics/Low_Biomass/Workflow_Documentation/NF_MGIllumina
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit 2a4a676d529fe2f160fa592b302a1d3e39e5c7e3
diff --git a/Metagenomics/Low_Biomass/Workflow_Documentation/NF_Metagenomics b/Metagenomics/Low_Biomass/Workflow_Documentation/NF_Metagenomics
new file mode 160000
index 000000000..285ec01bc
--- /dev/null
+++ b/Metagenomics/Low_Biomass/Workflow_Documentation/NF_Metagenomics
@@ -0,0 +1 @@
+Subproject commit 285ec01bc9967b7fa312e4517226fdd26fdfc4f9
diff --git a/Metagenomics/Low_Biomass/Workflow_Documentation/README.md b/Metagenomics/Low_Biomass/Workflow_Documentation/README.md
index 364016ece..a269d5b98 100644
--- a/Metagenomics/Low_Biomass/Workflow_Documentation/README.md
+++ b/Metagenomics/Low_Biomass/Workflow_Documentation/README.md
@@ -6,14 +6,10 @@
 
 |Pipeline Version|Current Workflow Version (for respective pipeline version)|Nextflow Version| 
 |:---------------|:---------------------------------------------------------|:---------------|
-|*[GL-DPPD-7116.md](../Nanopore/GL-DPPD-7116.md)|[NF_MGIllumina_2.0.0](NF_MGIllumina)|24.04.4|
-|*[GL-DPPD-7117.md](../Illumina/GL-DPPD-7117.md)|[NF_MGIllumina_2.0.0](NF_MGIllumina)|24.04.4|
+|*[GL-DPPD-7116.md](../Pipeline_GL_DPPD_7116_Versions/GL-DPPD-7116.md)|[NF_Metagenomics_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow)|24.04.4|
+|*[GL-DPPD-7117.md](../Pipeline_GL_DPPD_7117_Versions/GL-DPPD-7117.md)|[NF_Metagenomics_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow)|24.04.4|
 
 
 *Current GeneLab Pipeline/Workflow Implementation
 
-> See the [workflow change log](NF_MGIllumina/CHANGELOG.md) to access previous workflow versions and view all changes associated with each version update.
-
-
-> See the [NF_AmpIllumina Change Log](https://github.com/nasa/GeneLab_AmpliconSeq_Workflow/blob/main/CHANGELOG.md) to access the most recent changes to the workflow and view all changes associated with each update.<br>
-> All workflow changes associated with the previous version of the GeneLab Amplicon Pipeline ([GL-DPPD-7104-B](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md) and earlier) can be found in the [SW_AmpIllumina-B Change Log](https://github.com/nasa/GeneLab_Data_Processing/blob/master/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/CHANGELOG.md)
+> See the [workflow change log](https://github.com/nasa/GeneLab_Metagenomics_Workflow/blob/DEV/CHANGELOG.md) to view all changes associated with each workflow version update.

From a8ca3057ea10279f6003edbd277d8673eed1da96 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Wed, 15 Apr 2026 11:44:14 -0700
Subject: [PATCH 45/47] Updated Metagenomics workflow name (#199)

- changed NF_Metagenomics to NF_MetagenomeSeq
- updated READMEs
---
 .gitmodules                                               | 8 ++++----
 Metagenomics/Illumina/README.md                           | 4 ++--
 .../{NF_Metagenomics => NF_MetagenomeSeq}                 | 0
 Metagenomics/Illumina/Workflow_Documentation/README.md    | 8 ++++----
 Metagenomics/Low_Biomass/README.md                        | 4 ++--
 .../{NF_Metagenomics => NF_MetagenomeSeq}                 | 0
 Metagenomics/Low_Biomass/Workflow_Documentation/README.md | 6 +++---
 Metagenomics/README.md                                    | 7 ++++---
 8 files changed, 19 insertions(+), 18 deletions(-)
 rename Metagenomics/Illumina/Workflow_Documentation/{NF_Metagenomics => NF_MetagenomeSeq} (100%)
 rename Metagenomics/Low_Biomass/Workflow_Documentation/{NF_Metagenomics => NF_MetagenomeSeq} (100%)

diff --git a/.gitmodules b/.gitmodules
index 660601944..3f6a06245 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,11 +1,11 @@
 [submodule "Amplicon/Illumina/Workflow_Documentation/NF_AmpIllumina"]
 	path = Amplicon/Illumina/Workflow_Documentation/NF_AmpIllumina
 	url = https://github.com/nasa/GeneLab_AmpliconSeq_Workflow
-[submodule "Metagenomics/Low_Biomass/Workflow_Documentation/NF_Metagenomics"]
-	path = Metagenomics/Low_Biomass/Workflow_Documentation/NF_Metagenomics
+[submodule "Metagenomics/Low_Biomass/Workflow_Documentation/NF_MetagenomeSeq"]
+	path = Metagenomics/Low_Biomass/Workflow_Documentation/NF_MetagenomeSeq
 	url = https://github.com/nasa/GeneLab_Metagenomics_Workflow/
 	branch = DEV
-[submodule "Metagenomics/Illumina/Workflow_Documentation/NF_Metagenomics"]
-	path = Metagenomics/Illumina/Workflow_Documentation/NF_Metagenomics
+[submodule "Metagenomics/Illumina/Workflow_Documentation/NF_MetagenomeSeq"]
+	path = Metagenomics/Illumina/Workflow_Documentation/NF_MetagenomeSeq
 	url = https://github.com/nasa/GeneLab_Metagenomics_Workflow/
 	branch = DEV
diff --git a/Metagenomics/Illumina/README.md b/Metagenomics/Illumina/README.md
index 52ffb8770..d0d5bee72 100644
--- a/Metagenomics/Illumina/README.md
+++ b/Metagenomics/Illumina/README.md
@@ -3,7 +3,7 @@
 
 > **The document [`GL-DPPD-7107-B.md`](Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md) holds an overview and example commands for how GeneLab processes Illumina metagenomics sequencing datasets. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and a GeneLab data processing summary are provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
 > 
-> Note: The exact processing commands and MGIllumina or Metagenomics workflow version used for specific GLDS datasets can be found in the *_processing_info.zip file under "Files" for each respective GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). 
+> Note: The exact processing commands and MGIllumina or MetagenomeSeq workflow version used for specific GLDS datasets can be found in the *_processing_info.zip file under "Files" for each respective GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). 
 
 ---
 
@@ -21,7 +21,7 @@
 
 * [**Workflow_Documentation**](Workflow_Documentation)
 
-  - Contains instructions for installing and running the GeneLab MGIllumina or Metagenomics workflows
+  - Contains instructions for installing and running the GeneLab MGIllumina or MetagenomeSeq workflows
 
 ---
 
diff --git a/Metagenomics/Illumina/Workflow_Documentation/NF_Metagenomics b/Metagenomics/Illumina/Workflow_Documentation/NF_MetagenomeSeq
similarity index 100%
rename from Metagenomics/Illumina/Workflow_Documentation/NF_Metagenomics
rename to Metagenomics/Illumina/Workflow_Documentation/NF_MetagenomeSeq
diff --git a/Metagenomics/Illumina/Workflow_Documentation/README.md b/Metagenomics/Illumina/Workflow_Documentation/README.md
index 608037dec..2244a477d 100644
--- a/Metagenomics/Illumina/Workflow_Documentation/README.md
+++ b/Metagenomics/Illumina/Workflow_Documentation/README.md
@@ -1,18 +1,18 @@
 # GeneLab Illumina Metagenomics Seq Workflow Information
 
-> **GeneLab has wrapped each step of the Illumina metagenomics sequencing data processing pipeline into a workflow. The table below lists (and links to) each Metagenomics Workflow version and the corresponding workflow subdirectory, the current pipeline/workflow implementation is indicated. The workflow subdirectory contains information about the workflow along with instructions for installation and usage. Exact workflow run info and workflow version used to process specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
+> **GeneLab has wrapped each step of the Illumina metagenomics sequencing data processing pipeline into a workflow. The table below lists (and links to) each MGIllumina or MetagenomeSeq Workflow version and the corresponding workflow subdirectory, the current MGIllumina or MetagenomeSeq pipeline/workflow implementation is indicated. The workflow subdirectory contains information about the workflow along with instructions for installation and usage. Exact workflow run info and workflow version used to process specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
 
 ## MGIllumina Pipeline Version and Corresponding Workflow
 
 |Pipeline Version|Current Workflow Version (for respective pipeline version)|Nextflow Version| 
 |:---------------|:---------------------------------------------------------|:---------------|
-|*[GL-DPPD-7107-B.md](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md)|[NF_Metagenomics_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow/tree/DEV)|24.04.4|
+|*[GL-DPPD-7107-B.md](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-B.md)|[NF_MetagenomeSeq_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow/tree/DEV)|24.04.4|
 |[GL-DPPD-7107-A.md](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md)|[NF_MGIllumina_1.0.0](NF_MGIllumina)|24.04.4|
 |[GL-DPPD-7107.md](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107.md)|[SW_MGIllumina_2.0.4](SW_MGIllumina)|N/A (Snakemake v7.26.0)|
 
 
 *Current GeneLab Pipeline/Workflow Implementation
 
-> See the workflow change log for [NF_Metagenomics](https://github.com/nasa/GeneLab_Metagenomics_Workflow/blob/DEV/CHANGELOG.md) to view all changes associated with each workflow version update.
+> See the workflow change log for [NF_MetagenomeSeq](https://github.com/nasa/GeneLab_Metagenomics_Workflow/blob/DEV/CHANGELOG.md) to view all changes associated with each workflow version update.
 
-> All workflow changes associated with the previous version of the GeneLab Metagenomics Pipeline ([GL-DPPD-7107-A](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md) or [GL-DPPD-7107](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107.md)) can be found in the [NF_MGIllumina Change Log](./NF_MGIllumina/CHANGELOG.md) or the [SW_MGIllumina Change Log](./SW_MGIllumina/CHANGELOG.md), respectively.
+> All workflow changes associated with the previous version of the GeneLab Illumina Metagenomics Pipeline ([GL-DPPD-7107-A](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md) or [GL-DPPD-7107](../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107.md)) can be found in the [NF_MGIllumina Change Log](./NF_MGIllumina/CHANGELOG.md) or the [SW_MGIllumina Change Log](./SW_MGIllumina/CHANGELOG.md), respectively.
diff --git a/Metagenomics/Low_Biomass/README.md b/Metagenomics/Low_Biomass/README.md
index cfbde0d29..fa610ec17 100644
--- a/Metagenomics/Low_Biomass/README.md
+++ b/Metagenomics/Low_Biomass/README.md
@@ -2,7 +2,7 @@
 
 > **Documents [`GL-DPPD-7116`](Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md) and [`GL-DPPD-7117.md`](Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md) contain overview and example commands for how GeneLab processes low-biomass metagenomics datasets for long- and short-read data, respectively. See the [Repository Links](#repository-links) descriptions below for more information. Processed data output files and a GeneLab data processing summary are provided for each GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
 
-> Note: The exact processing commands and Metagenomics workflow version used for specific GLDS datasets can be found in the *_processing_info.zip file under "Files" for each respective GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). 
+> Note: The exact processing commands and MetagenomeSeq workflow version used for specific GLDS datasets can be found in the *_processing_info.zip file under "Files" for each respective GLDS dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). 
 
 ---
 ## Repository Links
@@ -17,7 +17,7 @@
 
 * [**Workflow_Documentation**](Workflow_Documentation)
 
-  - Contains instructions for installing and running the GeneLab MGIllumina workflow
+  - Contains instructions for installing and running the GeneLab MetagenomeSeq workflow
 
 ---
 **Developed by:**  
diff --git a/Metagenomics/Low_Biomass/Workflow_Documentation/NF_Metagenomics b/Metagenomics/Low_Biomass/Workflow_Documentation/NF_MetagenomeSeq
similarity index 100%
rename from Metagenomics/Low_Biomass/Workflow_Documentation/NF_Metagenomics
rename to Metagenomics/Low_Biomass/Workflow_Documentation/NF_MetagenomeSeq
diff --git a/Metagenomics/Low_Biomass/Workflow_Documentation/README.md b/Metagenomics/Low_Biomass/Workflow_Documentation/README.md
index a269d5b98..876dd2c3c 100644
--- a/Metagenomics/Low_Biomass/Workflow_Documentation/README.md
+++ b/Metagenomics/Low_Biomass/Workflow_Documentation/README.md
@@ -1,13 +1,13 @@
 # GeneLab Low-biomass Metagenomics Workflow Information
 
-> **GeneLab has wrapped each step of the low-biomass metagenomics sequencing data processing pipelines (MGIllumina) into a workflow. The table below lists (and links to) each MGIllumina version and the corresponding workflow subdirectory, the current MGIllumina pipeline/workflow implementation is indicated. The workflow subdirectory contains information about the workflow along with instructions for installation and usage. Exact workflow run info and MGIllumina version used to process specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
+> **GeneLab has wrapped each step of the low-biomass metagenomics sequencing data processing pipelines into a workflow. The table below lists (and links to) each MetagenomeSeq version and the corresponding workflow subdirectory, the current MetagenomeSeq pipeline/workflow implementation is indicated. The workflow subdirectory contains information about the workflow along with instructions for installation and usage. Exact workflow run info and MetagenomeSeq version used to process specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  
 
 ## MGIllumina Pipeline Version and Corresponding Workflow
 
 |Pipeline Version|Current Workflow Version (for respective pipeline version)|Nextflow Version| 
 |:---------------|:---------------------------------------------------------|:---------------|
-|*[GL-DPPD-7116.md](../Pipeline_GL_DPPD_7116_Versions/GL-DPPD-7116.md)|[NF_Metagenomics_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow)|24.04.4|
-|*[GL-DPPD-7117.md](../Pipeline_GL_DPPD_7117_Versions/GL-DPPD-7117.md)|[NF_Metagenomics_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow)|24.04.4|
+|*[GL-DPPD-7116.md](../Pipeline_GL_DPPD_7116_Versions/GL-DPPD-7116.md)|[NF_MetagenomeSeq_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow)|24.04.4|
+|*[GL-DPPD-7117.md](../Pipeline_GL_DPPD_7117_Versions/GL-DPPD-7117.md)|[NF_MetagenomeSeq_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow)|24.04.4|
 
 
 *Current GeneLab Pipeline/Workflow Implementation
diff --git a/Metagenomics/README.md b/Metagenomics/README.md
index ebfd4a0db..5d1eeaf28 100644
--- a/Metagenomics/README.md
+++ b/Metagenomics/README.md
@@ -4,9 +4,10 @@
 
 ## Select a specific pipeline for more info:
 
-* [Estimating host reads](Estimate_host_reads_in_raw_data)
-* [Removing human reads](Remove_human_reads_from_raw_data)  
-* [Illumina](Illumina)  
+* [Estimating host reads](./Estimate_host_reads_in_raw_data/)
+* [Removing human reads](./Remove_human_reads_from_raw_data/)  
+* [Illumina](./Illumina/)  
+* [Low Biomass](./Low_Biomass/)  
 
 <br>
 

From 44cce130b06c36ff12a52c68580e5aeb0e4acb20 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Fri, 17 Apr 2026 20:11:21 -0700
Subject: [PATCH 46/47] Update README.md

fix typos in links
---
 Metagenomics/Low_Biomass/Workflow_Documentation/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Metagenomics/Low_Biomass/Workflow_Documentation/README.md b/Metagenomics/Low_Biomass/Workflow_Documentation/README.md
index 876dd2c3c..5b4b9fecd 100644
--- a/Metagenomics/Low_Biomass/Workflow_Documentation/README.md
+++ b/Metagenomics/Low_Biomass/Workflow_Documentation/README.md
@@ -6,8 +6,8 @@
 
 |Pipeline Version|Current Workflow Version (for respective pipeline version)|Nextflow Version| 
 |:---------------|:---------------------------------------------------------|:---------------|
-|*[GL-DPPD-7116.md](../Pipeline_GL_DPPD_7116_Versions/GL-DPPD-7116.md)|[NF_MetagenomeSeq_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow)|24.04.4|
-|*[GL-DPPD-7117.md](../Pipeline_GL_DPPD_7117_Versions/GL-DPPD-7117.md)|[NF_MetagenomeSeq_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow)|24.04.4|
+|*[GL-DPPD-7116.md](../Pipeline_GL-DPPD-7116_Versions/GL-DPPD-7116.md)|[NF_MetagenomeSeq_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow)|24.04.4|
+|*[GL-DPPD-7117.md](../Pipeline_GL-DPPD-7117_Versions/GL-DPPD-7117.md)|[NF_MetagenomeSeq_1.0.0](https://github.com/nasa/GeneLab_Metagenomics_Workflow)|24.04.4|
 
 
 *Current GeneLab Pipeline/Workflow Implementation

From ee00d2b3c0474444430d989ee4b9fbc0203daac7 Mon Sep 17 00:00:00 2001
From: Barbara Novak <19824106+bnovak32@users.noreply.github.com>
Date: Fri, 17 Apr 2026 20:16:50 -0700
Subject: [PATCH 47/47] Update README.md

Added link for Metagenomics/Low_Biomass assay type
---
 README.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index e23f668b1..143b02667 100644
--- a/README.md
+++ b/README.md
@@ -18,7 +18,8 @@ Click on an assay type below for data processing information.
 - [Metagenomics](Metagenomics)  
   - [Remove human reads](Metagenomics/Remove_human_reads_from_raw_data)
   - [Estimate host reads](Metagenomics/Estimate_host_reads_in_raw_data)
-  - [Illumina](Metagenomics/Illumina)  
+  - [Illumina](Metagenomics/Illumina)
+  - [Low_Biomass](Metagenomics/Low_Biomass)
 - [(bulk) RNAseq](RNAseq)  
 - [single cell RNAseq](scRNAseq)  
 - [Methylation Sequencing](Methyl-Seq)