Here, you can find an explanation of the different computations, tools or metrics implemented in deepCSA.
We are in the process of completing the documentation, but in the meantime you can check the recently published paper and its supplementary material for more details.
This can be found in the mutdensityadj folder of the deepCSA output.
The goal is to define a mean mutation density estimate for a specific subset of mutation sites. For example, we might be interested in measuring what is the mutation density at the set of missense mutation sites -- i.e., mutation sites in the CDS that induce missense mutations.
In the context of bulk-ultradeep sequencing, there are several factors that can influence the occurrence of mutations at a given site.
-
trinucleotide context of the site
-
neutral mutagenesis, defined as:
-
normalized profile: vector of relative mutabilities for each trinucleotide context
-
exposure associated with the profile, expressed as mutation burden
-
-
sequencing depth of the site
-
selection of the site
We would like to define a way to compute the mutation density that corrects for potential confounders like depth and triplet content, thus rendering the estimates more comparable. For example, the missense sites in gene A may imply that more missense mutations per site are observed than in gene B, even if both genes are subject to the same exact mutational process.
Given a collection
Assuming even depth
Throughout we make the following assumptions:
-
the only way mutations come about is by neutral mutagenesis
-
discrepancies between neutral mutagenesis and the observed mutations are due to selection
Mutation sites are defined as specific single base nucleotide changes. Each mutation site corresponds to a tri-nucleotide context of the form
Let
Note that in order to render absolute mutabilities, the
Let
Let
For each mutation site
Given a collection
Note that this definition of length has an arbitrary interpretation, because of the multiple possible choices that we can make for the relative mutational profile. To render a universal length, we must set a scale to represent the mutational profile with a concrete meaning. What would be a good practical reference?
One possibility would be to choose a scaling factor
If
Because
Applying this scaling, the effective length would take the following form:
Defining the mutation density as
For more explanations on omega go to the corresponding repo.
The site comparison step takes advantage of the computation of mutabilities in omega, and then compares these mutabilities either by residue, residue change or nucleotide change.
We provide two different strategies for signature analysis.
-
Using SigProfilerAssignment with a set of known SBS signatures
-
Using a Hierarchical Dirichlet Process algorithm developed by Nicola Robets and compacted by the McGranahan lab into a wrapped version.
Additionally one could run SigProfilerExtractor on the data but this needs to be done externally.