You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper/paper.md
+14-16Lines changed: 14 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,20 +31,22 @@ bibliography: paper.bib
31
31
32
32
# Summary
33
33
34
-
With the advancement of Next Generation Sequencing technologies, (palaeo)faeces have become a unique and valuable source in the fields of archaeology [@Battillo:2019], microbiome studies [@Rifkin:2020; @Wibowo:2021], species ecology and conservation [@Ang:2020; @Taylor:2022], and even shows promise in forensic investigations [@Quaak:2017; @Quaak:2018]. To use faecal samples for such studies, often the first question is "who deposited the faeces?", before doing additional analyses. The nf-core/coproID pipeline helps to answer this question by taking raw sequencing data and predicting the depositor's species.
34
+
With the advancement of next-generation sequencing technologies, (palaeo)faeces have become a unique and valuable source in the fields of archaeology [@Battillo:2019], microbiome studies [@Rifkin:2020; @Wibowo:2021], species ecology and conservation [@Ang:2020; @Taylor:2022], and even in forensic investigations [@Quaak:2017; @Quaak:2018]. To use faecal samples for such studies, often the first question is "who deposited the faeces?", before doing additional analyses. The nf-core/coproID pipeline helps to answer this question by taking raw sequencing data and predicting the depositor's species.
35
35
36
-
The raw sequencing data is first pre-processed to trim adapters and remove low quality and low complexity reads with fastp [@Chen:2018]. Bowtie2 [@Langmead:2018] is then used to align the reads to multiple user-specified reference genomes of potential depositor (host) species. Next, these reads are processed with sam2lca [@Borry:2022] to retain only reads specific to one of the references, and normalised according to the size of the genome. Furthermore, a taxonomic profile is created with kraken2 [@Wood:2019], and compared to a user supplied database of potential sources using sourcepredict to estimate the percentages of contributing sources [@Borry:2019]. Both the normalised host DNA and the sourcepredict results are used to predict the most likely depositor of the faeces. The pipeline also incorporates ancient DNA damage estimates using pyDamage [@Borry:2021] and damageprofiler [@Neukamm:2021] for authenticating the ancient nature of the host DNA, when working with palaeofaeces. All results are collated into a Quarto notebook html report for an easy overview of all the results.
36
+
The raw sequencing data is first pre-processed to trim adapters and remove low quality and low complexity reads with fastp [@Chen:2018]. Bowtie2 [@Langmead:2018] is then used to align the reads to multiple user-specified reference genomes of potential depositor (host) species. Next, these reads are processed with sam2lca [@Borry:2022] to retain only reads specific to one of the references, and normalised according to the size of the genome. Furthermore, a taxonomic profile is created with kraken2 [@Wood:2019], and compared to a user-supplied database of potential sources using sourcepredict to estimate the percentages of contributing sources [@Borry:2019]. Both the normalised host DNA and the sourcepredict results are used to predict the most likely depositor of the faeces. The pipeline also incorporates ancient DNA damage estimates using pyDamage [@Borry:2021] and damageprofiler [@Neukamm:2021] for authenticating the ancient nature of the host DNA, when working with palaeofaeces. All results are collated into a Quarto notebook HTML report for an easy overview of all the results.
37
37
38
38
# Statement of need
39
39
40
-
As mentioned above, (palaeo)faeces are valuable resources to study the depositor's DNA, diet, microbiome, health and more. However, it is often difficult to identify the depositor based on the faeces morphology alone. For example, humans and dogs often overlap in their diets, and produce similarly sized faeces. In 2020, the pipeline nf-core/coproID v1.0 was published [@Borry:2020], which uses both host and microbial DNA to predict the depositor of faecal samples. The microbiome can be a crucial part for a host prediction, as the host DNA content in faeces can be very low in certain species and/or individuals [@Ang:2020; @Perry:2010], including humans and modern dogs [@Borry:2020]. Since its first release, new tools have become available that can improve the accuracy and usability of nf-core/coproID. Here we present the newest version of the pipeline, nf-core/coproID v2.0, rewritten in the newest Nextflow DSL2 language to enhance modularity, reusability, and scalability [@DITommaso:2017], and with newly added features to improve accuracy and reporting.
40
+
As mentioned above, (palaeo)faeces are valuable resources to study the depositor's DNA, diet, microbiome, health, and more. However, it is often difficult to identify the depositor based on the faeces morphology alone. For example, humans and dogs often overlap in their diets, and produce similarly sized faeces. In 2020, the pipeline nf-core/coproID v1.0 was published [@Borry:2020], which uses both host and microbial DNA to predict the depositor of faecal samples. The microbiome can be a crucial part for a host prediction, as the host DNA content in faeces can be very low in certain species and/or individuals [@Ang:2020; @Perry:2010], including humans and modern dogs [@Borry:2020]. Since its first release, new tools have become available that can improve the accuracy and usability of nf-core/coproID. Here we present the newest version of the pipeline, nf-core/coproID v2.0, rewritten in the newest Nextflow DSL2 language to enhance modularity, reusability, and scalability [@DITommaso:2017], and with newly added features to improve accuracy and reporting.
41
41
42
42
# Materials and Methods
43
43
44
44
nf-core/coproID combines the analysis of the putative host (ancient) DNA with a machine learning prediction of the faeces source, based on microbiome taxonomic composition:
45
45
46
46
A. First, nf-core/coproID performs parallel mapping of all reads against two (or more) target genomes (genome1, genome2, ..., genomeX) using bowtie2 [@Langmead:2018], and computes a host-DNA species ratio (NormalisedProportion) using sam2lca [@Borry:2022].
47
-
B. Next, nf-core/coproID performs metagenomic taxonomic profiling with kraken2 [@Wood:2019], and compares the obtained profiles to user supplied modern reference samples of the target species metagenomes. Using machine learning, sourcepredict [@Borry:2019] then estimates the host source from the metagenomic taxonomic composition (SourcepredictProportion).
47
+
48
+
B. Next, nf-core/coproID performs metagenomic taxonomic profiling with kraken2 [@Wood:2019], and compares the obtained profiles to user-supplied modern reference samples of the target species metagenomes. Using machine learning, sourcepredict [@Borry:2019] then estimates the host source from the metagenomic taxonomic composition (SourcepredictProportion).
49
+
48
50
C. Finally, nf-core/coproID combines the A and B proportions to predict the likely host of the metagenomic sample.
49
51
50
52
## Workflow
@@ -54,28 +56,28 @@ Figure 1 describes the newest workflow:
54
56
1. Quality check of the input fastq reads with FastQC [@Andrews:2010].
55
57
2. Removal of adapters and low-complexity reads with fastp [@Chen:2018].
56
58
3. Mapping of adapter trimmed reads to multiple reference genomes with Bowtie2 [@Langmead:2018].
57
-
4. Lowest Common Ancestor analysis with sam2lca [@Borry:2022] to retain only genomespecific reads, i.e. reads that align equally well to multiple references are removed from the read counts. The sam2lca read counts are normalised by the size of the genome as follows. First, a normalisation factor is calculated per reference, or source species (sp):
59
+
4. Lowest Common Ancestor analysis with sam2lca [@Borry:2022] to retain only genome-specific reads, i.e. reads that align equally well to multiple references are removed from the read counts. The sam2lca read counts are normalised by the size of the genome as follows. First, a normalisation factor is calculated per reference, or source species (sp):
58
60
59
61
$$
60
-
Average Reference Length = ∑_{sp} Reference Length_{sp} / Number of References
62
+
\mathrm{Average Reference Length = ∑_{sp} Reference Length_{sp} / Number of References}
61
63
$$
62
64
63
65
$$
64
-
Normalisation Factor_{sp} = Average Reference Length / Reference Length_{sp}
66
+
\mathrm{Normalisation Factor_{sp} = Average Reference Length / Reference Length_{sp}}
5. Taxonomic profiling is performed on adaptertrimmed reads with kraken2 [@Wood:2019], and by using a custom supplied database. Kraken2 reports are parsed and merged into one table for all samples.
74
-
6. Sourcepredict [@Borry:2019] is then used to predict the source proportions, based on the kraken2 taxonomic profiles, and by using usersupplied reference sources (which should have been created with the same reference database).
75
+
5. Taxonomic profiling is performed on adapter-trimmed reads with kraken2 [@Wood:2019], and by using a custom supplied database. Kraken2 reports are parsed and merged into one table for all samples.
76
+
6. Sourcepredict [@Borry:2019] is then used to predict the source proportions, based on the kraken2 taxonomic profiles, and by using user-supplied reference sources (which should have been created with the same reference database).
75
77
7. Both the host DNA (NormalisedReads) and sourcepredict proportion are used to predict the most likely depositor of the (palaeo)faeces. The probability of each reference species is calculated by:
8. Ancient DNA damage patterns are estimated using pyDamage [@Borry:2021] and damageprofiler [@Neukamm:2021] to authenticate the ancient nature of the DNA when working on palaeofaecal samples.
@@ -84,7 +86,7 @@ $$
84
86
85
87
## Output
86
88
87
-
The results are located in a nested folder architecture. Fourteen subfolders are created within the useridentified output folder:
89
+
The results are located in a nested folder architecture. Fourteen subfolders are created within the user-identified output folder:
88
90
89
91
- bowtie2
90
92
- create
@@ -101,10 +103,6 @@ The results are located in a nested folder architecture. Fourteen subfolders are
101
103
- samtools
102
104
- sourcepredict
103
105
104
-
# Discussion and conclusions
105
-
106
-
We present a new version of the nf-core/coproID pipeline, v2.0, designed to identify the true depositor of (palaeo)faeces. Written in Nextflow DSL2, and adhering to the latest nf-core standards and guidelines, nf-core/coproID v2.0 is more modular, reusable, and scalable. It includes several new features, including fastp for faster pre-processing of the sequencing reads, sam2lca to improve and generalise host DNA prediction, pyDamage to discriminate between ancient and modern DNA, and the automated creation of a Quarto notebook html report. The modular design also makes it easier for users to customise the pipeline, for example by adding more modules and workflows.
107
-
108
106
# Funding source declaration
109
107
MO was supported by a University of Otago Doctoral Scholarship.
0 commit comments