From fb4614312f99a8fe143fae7bec1e0c3dece289c5 Mon Sep 17 00:00:00 2001 From: Barbara Novak <19824106+bnovak32@users.noreply.github.com> Date: Mon, 8 Sep 2025 09:43:32 -0700 Subject: [PATCH 01/47] Added first draft of low-biomass ppl --- .../Low_Biomass/Nanopore/GL-DPPD-XXXX.md | 1876 +++++++++++++++++ 1 file changed, 1876 insertions(+) create mode 100644 Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md diff --git a/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md new file mode 100644 index 000000000..bba4e3c74 --- /dev/null +++ b/Metagenomics/Low_Biomass/Nanopore/GL-DPPD-XXXX.md @@ -0,0 +1,1876 @@ +# Bioinformatics pipeline for Low biomass long-read metagenomics data + +> **This document holds an overview and some example commands of how GeneLab processes low-biomass, long-read metagenomics datasets. Exact processing commands for specific datasets that have been released are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).** + +--- + +**Date:** XXX NN, 2025 +**Revision:** - +**Document Number:** GL-DPPD-XXXX + +**Submitted by:** +Olabiyi A. Obayomi (GeneLab Analysis Team) + +**Approved by:** +Samrawit Gebre (OSDR Project Manager) +Jonathan Galazka (OSDR Project Scientist) +Amanda Saravia-Butler (GeneLab Science Lead) +Barbara Novak (GeneLab Data Processing Lead) + + +--- + +# Table of contents + +- [**Software used**](#software-used) +- [**General processing overview with example commands**](#general-processing-overview-with-example-commands) + - [**Pre-processing**](#pre-processing) + - [1. Basecalling](#1-basecalling) + - [2. Demultiplexing](#2-demultiplexing) + - [2a. Demultiplex]() + - [2b. Concatenate files for each sample]() + - [3. Raw Data QC](#3-raw-data-qc) + - [3a. Raw Data QC](#3a-raw-data-qc) + - [3b. Compile Raw Data QC](#3b-compile-raw-data-qc) + - [4. Quality filtering](#4-quality-filtering) + - [4a. Filter Raw Data](#4a-filter-raw-data) + - [4a. Filtered Data QC](#4b-filtered-data-qc) + - [4c. Compile Filtered Data QC](#4c-compile-filtered-data-qc) + - [5. Trimming](#3-filteredtrimmed-data-qc) + - [5a. Trim Filtered Data](#5a-trim-filtered-data) + - [5b. Trimmed Data QC](#5b-trimmed-data-qc) + - [5c. Compile Trimmed Data QC](#5c-compile-filtered-data-qc) + - [6. Assemble Contaminants](#6-assemble-contaminants) + - [7. Contaminant Removal](#7-remove-contaminants) + - [7a. Build Contaminant Index and Map Reads](#7a-build-contaminant-index-and-map-reads) + - [7b. Sort and Index Contaminant Reads](#7b-sort-and-index-contaminant-alignments) + - [7c. Gather Contaminant Mapping Metrics](#7c-gather-contaminant-mapping-metrics) + - [7d. Generate Decontaminated Read Files](#7d-generate-decontaminated-read-files) + - [7e. Contaminant Removal QC](#7e-contaminant-removal-qc) + - [7f. Compile Contaminant Removal QC](#7f-compile-contaminant-removal-qc) + - [8. Host Removal](#8-host-removal) + - [8a. Remove Host Reads](#8a) + - [8b. Compile Host Removal QC]() + - [**Read-based processing**](#read-based-processing) + - [9. Taxonomic and functional profiling using Kaiju](#8-taxonomic-and-functional-profiling) + - [9a. Taxonomic Classification](#9a-taxonomic-classification) + - [9b. Convert Kaiju output to Krona format](#9b-convert-kaiju-output-to-krona-format) + - [9c. Generate per sample Krona charts](#9c-generate-per-sample-krona-charts) + - [9d. Generate combined Krona chart](#9d-generate-combined-krona-chart) + - [9e. Compute per-sample taxon level summaries](#9e-compute-taxon-level-summaries-for-each-sample) + - [9f. Compile taxon level summaries](#9f-compile-kaiju-taxonomy-results) + - [10. Taxonomic and functional profiling using Kraken2](#10-taxonomic-and-functional-profiling-using-kraken2) + - [10a. Taxonomic Classification](#10a-taxonomic-classification) + - [10b. Combine Kraken2 reports](#10b-combine-kraken2-reports) + - [10c. Convert Kraken2 output to krona format](#10c-convert-kraken2-output-to-krona-format) + - [10c. Generate per sample Krona charts](#10d-generate-per-sample-krona-charts) + - [10d. Generate combined Krona chart](#10e-generate-combined-krona-chart) + - [10e. Compile Kraken2 Summary QC](#10f-compile-kraken2-summary-qc) + - [**Assembly-based processing**](#assembly-based-processing) + - [11. Sample assembly](#11-sample-assembly) + - [12. Polish assembly](#12-polish-assembly) + - [13. Renaming contigs and summarizing assemblies](#13-renaming-contigs-and-summarizing-assemblies) + - [14. Gene prediction](#14-gene-prediction) + - [15. Functional annotation](#15-functional-annotation) + - [16. Taxonomic classification](#16-taxonomic-classification) + - [17. Read-mapping](#17-read-mapping) + - [18. Getting coverage information and filtering based on detection](#18-getting-coverage-information-and-filtering-based-on-detection) + - [19. Combining gene-level coverage, taxonomy, and functional annotations into one table for each sample](#19-combining-gene-level-coverage-taxonomy-and-functional-annotations-into-one-table-for-each-sample) + - [20. Combining contig-level coverage and taxonomy into one table for each sample](#20-combining-contig-level-coverage-and-taxonomy-into-one-table-for-each-sample) + - [21. Generating normalized, gene- and contig-level coverage summary tables of KO-annotations and taxonomy across samples](#21-generating-normalized-gene--and-contig-level-coverage-summary-tables-of-ko-annotations-and-taxonomy-across-samples) + - [22. **M**etagenome-**A**ssembled **G**enome (MAG) recovery](#22-metagenome-assembled-genome-mag-recovery) + - [23. Generating MAG-level functional summary overview](#23-generating-mag-level-functional-summary-overview) + +--- + +# Software used + +|Program|Version|Relevant Links| +|:------|:-----:|------:| +|bbduk| 38.86 |[https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/)| +|CAT| 5.2.3 |[https://github.com/dutilh/CAT#cat-and-bat](https://github.com/dutilh/CAT#cat-and-bat)| +|CheckM| 1.1.3 |[https://github.com/Ecogenomics/CheckM](https://github.com/Ecogenomics/CheckM)| +|Decontam| | | +|Dorado| 1.1.1| [https://github.com/nanoporetech/dorado](https://github.com/nanoporetech/dorado)| +|Flye| 2.9.5 | [https://github.com/mikolmogorov/Flye](https://github.com/mikolmogorov/Flye) | +|GTDB-Tk| 2.4.0 |[https://github.com/Ecogenomics/GTDBTk](https://github.com/Ecogenomics/GTDBTk)| +|Kaiju| 1.10.1 | [https://bioinformatics-centre.github.io/kaiju/](https://bioinformatics-centre.github.io/kaiju/) | +|KOFamScan| 1.3.0 |[https://github.com/takaram/kofam_scan#kofamscan](https://github.com/takaram/kofam_scan#kofamscan)| +|Kraken2| 2.1.6 | [https://github.com/DerrickWood/kraken2](https://github.com/DerrickWood/kraken2) | +|KrakenTools | 1.2 | [https://ccb.jhu.edu/software/krakentools/](https://ccb.jhu.edu/software/krakentools/) | +|Krona| 2.8.1 | [https://github.com/marbl/Krona/wiki](https://github.com/marbl/Krona/wiki)| +|MetaBAT| 2.15 |[https://bitbucket.org/berkeleylab/metabat/src/master/](https://bitbucket.org/berkeleylab/metabat/src/master/)| +|Minimap2| 2.2.8 | [https://github.com/lh3/minimap2](https://github.com/lh3/minimap2) | +|MultiQC| 1.27.1 |[https://multiqc.info/](https://multiqc.info/)| +|Medaka| 2.0.1 | [https://github.com/nanoporetech/medaka](https://github.com/nanoporetech/medaka) | +|MEGAHIT| 1.2.9 |[https://github.com/voutcn/megahit#megahit](https://github.com/voutcn/megahit#megahit)| +|NanoPlot| 1.44.1 | [https://github.com/wdecoster/NanoPlot](https://github.com/wdecoster/NanoPlot)| +|Porechop| 0.2.4 | [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop) | +|Prodigal| 2.6.3 |[https://github.com/hyattpd/Prodigal#prodigal](https://github.com/hyattpd/Prodigal#prodigal)| +|samtools| 1.20 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)| +|KEGG-Decoder| 1.2.2 |[https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder](https://github.com/bjtully/BioData/tree/master/KEGGDecoder#kegg-decoder) +|HUMAnN| 3.9 |[https://github.com/biobakery/humann](https://github.com/biobakery/humann)| +|MetaPhlAn| 4.1.0 |[https://github.com/biobakery/MetaPhlAn](https://github.com/biobakery/MetaPhlAn)| + +--- + +# General processing overview with example commands + +> Exact processing commands and output files listed in **bold** below are included with each Low Biomass Metagenomics Seq processed dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/). + +## Pre-processing + +### 1. Basecalling + +```bash +model="fast@4.3.0" +input_dir=/path/to/raw/data + +dorado basecaller ${model} ${input_directory} \ + --no-trim \ + --device auto \ + --recursive \ + --kit-name ${kit_name} \ + --min-qscore 7 > basecalled.bam +``` + +**Parameter Definitions:** + +- `--no-trim` - Skips trimming of barcodes, adapters, and primers +- `--device` - specifies CPU or GPU device; specifying 'auto' chooses either 'cpu' or 'gpu' depending on detected presence of a GPU device +- `--recursive` - enables recursive scanning through input directory to load FAST5 and/or POD5 files +- `--kit-name` - enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names +- `--min-qscore` - +- `model` - positional argument specifying the basecalling model to use or a path to the model directory +- `input_directory` - positional argument specifying the location of the raw data in POD5 or FAST5 format + +**Input Data:** + +- *pod5 and/or *fast5 (raw nanopore data) + +**Output Data:** + +- **basecalled.bam** (raw data in BAM format) + +### 2. Demultiplexing + +```bash +dorado demux \ + --output-dir /path/to/fastq/output \ + --emit-fastq \ + --emit-summary \ + --kit-name ${kit_name} \ + basecalled.bam +``` + +**Parameter Definitions:** + +- `--output-dir` - specifies the output folder that is the root of the nested output structure +- `--emit-fastq` - specifies that output is fastq format +- `--emit-summary` - creates a summary listing each read and its classified barcode. +- `--kit-name` - enables barcoding with the provided kit name; see [dorado documentation](https://software-docs.nanoporetech.com/dorado/1.1.1/barcoding/barcoding/) for a full list of accepted kit names + +**Input Data:** + +- basecalled.bam (raw nanopore data in BAM format, output from [step 1](#1-basecalling)) + +**Output Data:** + +- \*_barcode\*.fastq (demultiplexed reads in fastq format) +- \*_unclassified.fastq (unclassified reads in fastq format) +- barcoding_summary.txt (barcode summary file listing each read, the file it was assigned to, and its classified barcode ) + +### 3. Raw Data QC + +#### 3a. Raw Data QC + +```bash +NanoPlot --only-report --prefix sample_ -o /path/to/raw_nanoplot_output -t NumberOfThreads --fastq sample_raw.fastq.gz +``` + +**Parameter Definitions:** + +- `-o` – specifies the output directory to store results +- `--only-report` - output only the report files +- `--prefix` - adds a sample specific prefix to the name of each output file +- `-t` - number of processing threads +- `sample_raw.fastq.gz` – the input reads are specified as a positional argument + +**Input data:** + +- *raw.fastq.gz (raw reads, output from [Step 2](#2-demultiplexing)) + +**Output data:** + +- **sample_NanoPlot-report.html** (NanoPlot html summary) +- sample_NanoPlot__