Skip to content

Commit f6335ca

Browse files
committed
update vignette to be more useful
1 parent 81a46df commit f6335ca

1 file changed

Lines changed: 27 additions & 26 deletions

File tree

vignettes/markeR.Rmd

Lines changed: 27 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -69,11 +69,6 @@ The package integrates multiple approaches for characterizing phenotypes:
6969
- **Enrichment-based methods**: GSEA using moderated t- or B-statistics, with options to handle unidirectional and bidirectional gene sets.
7070
- **Gene-level exploration**: expression heatmaps, violin plots, ROC curves, AUC calculations, effect size measures, and PCA analysis to assess individual gene contributions.
7171

72-
markeR supports two primary usage modes:
73-
74-
- **Benchmarking mode**: Evaluate multiple gene sets across multiple phenotypic variables, integrating both score- and enrichment-based analyses, visualizations, and null distribution comparisons.
75-
- **Discovery mode**: Focus on a single, well-characterized gene set for hypothesis generation and exploration of associations with phenotypic variables.
76-
7772
The package is designed to be fully customizable, supporting diverse visualization strategies via `ggplot2`, `ComplexHeatmap`, `ggpubr`, `cowplot`, and `grid.` Its modular structure allows easy integration of new functionalities while providing a robust framework for reproducible, standardized phenotypic characterization of gene sets.
7873

7974
# 2. Installation
@@ -82,7 +77,6 @@ Install the latest development release of markeR from [GitHub](https://github.co
8277

8378
```r
8479
devtools::install_github("DiseaseTranscriptomicsLab/markeR@*release")
85-
library(markeR)
8680
```
8781

8882
```{r, echo=FALSE}
@@ -135,7 +129,7 @@ data(genesets_example)
135129

136130
A filtered and normalised gene expression matrix (genes × samples). Row names must be gene identifiers; column names must match sample IDs in the metadata.
137131

138-
In this vignette, we use a pre‑processed dataset from Marthandan et al. (2016, GSE63577) with human fibroblasts under **replicative senescence** and **proliferative control**. Normalisation was performed with **edgeR**. See `?counts_example` for structure.
132+
In this vignette, we use a pre‑processed dataset from Marthandan et al. (2016, GSE63577) with human fibroblasts under **replicative senescence** and **proliferative control**. See `?counts_example` for structure.
139133

140134
```{r loaddata}
141135
# Load example expression data
@@ -157,12 +151,12 @@ head(metadata_example)
157151

158152
## 3.2. Select Mode of Analysis
159153

160-
* **Discovery Mode**:
161-
Explore how a single, well-characterised gene set relates to a specific variable of interest. Suitable for hypothesis generation or signature projection.
162-
163154
* **Benchmarking Mode**:
164-
Evaluate one or more gene sets against multiple metadata variables using a standardised scoring and effect size framework. This mode provides comprehensive visualisations and comparisons across methods.
155+
Evaluate one or more gene sets against a metadata variable using a standardised scoring and effect size framework. This mode provides comprehensive visualisations and comparisons across methods.
165156

157+
* **Discovery Mode**:
158+
Explore how a single, well-characterised gene set relates to a specific variable of interest. Suitable for hypothesis generation or signature projection.
159+
166160
## 3.3. Choose a Quantification Approach
167161

168162
`markeR` supports two complementary strategies for quantifying the association between gene sets and phenotypes:
@@ -179,7 +173,7 @@ Three scoring methods are available:
179173

180174
* **ssGSEA**: Computes a single-sample gene set enrichment score using the ssGSEA algorithm. Reflects the coordinated up- or down-regulation of the set in each sample.
181175

182-
These methods vary in assumptions and sensitivity. Robust gene sets are expected to perform consistently across all three.
176+
Robust gene sets are expected to perform consistently across all three.
183177

184178
### 3.3.2 Enrichment-Based Approach
185179

@@ -209,7 +203,7 @@ Example of common workflows in Benchmarking mode (full tutorial [here][tutorial-
209203

210204
We begin by quantifying gene set activity using score-based methods. These approaches generate numeric scores per sample that reflect the coordinated expression of genes in each set. This allows easy comparison of gene set behavior across phenotypic groups, while giving a single score per sample, which can be useful in certain contexts.
211205

212-
First, we compute and plot scores using the log2-median method. This method calculates median-centered expression for each gene and averages across the gene set, providing a robust summary of coordinated activity. Here, we assess whether the available gene sets can distinguish between the two conditions: Senescent and Proliferative.
206+
First, we compute and plot scores using the log2-median method, which calculates median-centered expression for each gene and averages across the gene set to provide a robust summary of coordinated activity. We then assess whether the available gene sets distinguish between Senescent and Proliferative conditions. These results suggest that the HernandezSegura and LiteratureMarkers gene sets show very strong effect sizes (|Cohen's d|), while the REACTOME_CELLULAR_SENESCENCE gene set shows a more modest effect. The distributions of scores also overlap more for the latter.
213207

214208

215209
```{r fig.width=8, fig.height=4, out.width="100%", warning=FALSE, message=FALSE}
@@ -229,7 +223,7 @@ PlotScores(data = counts_example,
229223
230224
```
231225

232-
Next, we calculate scores using all available methods (log2-median, ranking, and ssGSEA) to compare results across scoring strategies. The output includes heatmaps and volcano plots to visualize differences between conditions. Heatmaps summarize scores across samples and gene sets, while volcano plots show effect sizes (|Cohen's d|) versus statistical significance.
226+
Next, we calculate scores using multiple methods (log2-median, ranking, and ssGSEA) to compare results across scoring strategies. The output includes heatmaps and volcano plots to visualize effect sizes (|Cohen's d|) between conditions. These analyses confirm that the HernandezSegura and LiteratureMarkers gene sets consistently discriminate Senescence from Proliferative samples across all methods, whereas REACTOME_CELLULAR_SENESCENCE does not.
233227

234228
```{r warning=FALSE, message=FALSE}
235229
Overall_Scores <- PlotScores(data = counts_example,
@@ -266,7 +260,7 @@ Overall_Scores$heatmap
266260
Overall_Scores$volcano
267261
```
268262

269-
ROC curves evaluate the discriminatory power of gene set scores, providing insight into how well a signature distinguishes between experimental or clinical groups.
263+
ROC curves assess the discriminatory power of gene set scores, indicating how well a signature separates different experimental or clinical groups. When discriminating between Senescent and Proliferative samples, the REACTOME_CELLULAR_SENESCENCE gene set shows more heterogeneous ROC curves and AUC values across scoring methods, reflecting less consistent performance compared to the HernandezSegura and LiteratureMarkers gene sets.
270264

271265
```{r roc_scores, fig.width=10, fig.height=3, out.width="100%", warning=FALSE, message=FALSE}
272266
ROC_Scores(data = counts_example,
@@ -286,7 +280,8 @@ ROC_Scores(data = counts_example,
286280
287281
```
288282

289-
Finally, we perform simulations using random gene sets to estimate the false positive rate. This step ensures that observed gene set signals exceed what would be expected by chance, improving confidence in the results.
283+
Finally, we perform simulations using random gene sets to estimate the false positive rate. This step informs whether observed gene set signals exceed what would be expected by chance, improving confidence in the results. By simulating 10 random gene sets of the same size, we see that the LiteratureMarkers gene set performs best, with only two of the random sets achieving higher effect sizes than the original one using the ranking method. While increasing the number of simulated sets would yield finer resolution, this comes at the cost of additional computational time.
284+
290285

291286
```{r FDRSim, fig.width=12, fig.height=3, out.width="100%", warning=FALSE, message=FALSE}
292287
@@ -325,7 +320,8 @@ DEGs <- calculateDE(data = counts_example,
325320
DEGs$`Senescent-Proliferative`[1:5,]
326321
```
327322

328-
Once differential expression is calculated, we can visualize the results with a volcano plot. This plot highlights genes in our predefined gene sets, showing the magnitude of differential expression (log fold-change) versus statistical significance (adjusted p-value). Up- (green) and downregulated (red) genes are color-coded to facilitate interpretation. Genes in blue represent those from gene sets without known direction.
323+
Once differential expression is calculated, we can visualize the results with a volcano plot. This plot highlights genes in our predefined gene sets, showing the magnitude of differential expression (log fold-change) versus statistical significance (adjusted p-value). Up- (green) and downregulated (red) genes are color-coded. For the HernandezSegura gene set, green genes mostly appear in the positive logFC range and red genes in the negative range, though these are not the genes with the most extreme logFC values. In the LiteratureMarkers gene set, two red genes (LMNB1 and MKI67) show very negative logFC, which is consistent with their roles as proliferation markers since senescent cells are non-proliferative. Genes from the REACTOME_CELLULAR_SENESCENCE gene set are shown in blue, as this set lacks information on directionality; these genes are scattered across the full range of logFC values.
324+
329325

330326
```{r DEGsvolcano, fig.width=10, fig.height=3, out.width="100%", warning=FALSE, message=FALSE}
331327
# Change order: signatures in columns, contrast in rows
@@ -338,7 +334,7 @@ plotVolcano(DEGs, genes = genesets_example,
338334
labsize = 10, widthlabs = 24, invert = TRUE)
339335
```
340336

341-
Next, we perform Gene Set Enrichment Analysis (GSEA) using the differential expression results. This approach evaluates whether members of each gene set are non-randomly distributed toward the top or bottom of the ranked gene list, providing normalized enrichment scores (NES) and adjusted p-values.
337+
Next, we perform Gene Set Enrichment Analysis (GSEA) using the differential expression results. This approach evaluates whether members of each gene set are non-randomly distributed toward the top or bottom of the ranked gene list, providing normalized enrichment scores (NES) and adjusted p-values. The ranking of genes can be based on different metrics, reflecting the expected directionality of the gene set. This is also reflected in the plot labels below: gene sets are marked as altered when genes are ordered by the B statistic, or enriched/depleted when ordered by the t-statistic. For a full example and explanation, see the tutorial [here][tutorial-benchmarking].
342338

343339
```{r GSEA, warning=FALSE, message=FALSE}
344340
GSEAresults <- runGSEA(DEGList = DEGs,
@@ -349,7 +345,8 @@ GSEAresults <- runGSEA(DEGList = DEGs,
349345
GSEAresults
350346
```
351347

352-
To visualize the enrichment of each gene set along the ranked gene list, we use enrichment curves from `fgsea.` This plot shows the running enrichment score across the ranked list of genes, highlighting where gene set members are concentrated.
348+
To visualize the enrichment of each gene set along the ranked gene list, we use enrichment curves from `fgsea.` These plots show the running enrichment score across the ranked list, highlighting where members of each gene set are concentrated.
349+
353350

354351
```{r GSEA_plotenrichment, fig.width=10, fig.height=3, out.width="100%", warning=FALSE, message=FALSE}
355352
plotGSEAenrichment(GSEA_results=GSEAresults,
@@ -358,7 +355,7 @@ plotGSEAenrichment(GSEA_results=GSEAresults,
358355
widthTitle=40, grid = TRUE, titlesize = 10, nrow=1, ncol=3)
359356
```
360357

361-
We can also summarize enrichment results in a lollipop plot, which compactly shows the normalized enrichment score and highlights statistically significant gene sets. This provides a quick overview of which pathways are most strongly altered.
358+
We can also summarize enrichment results in a lollipop plot, which compactly shows the normalized enrichment score and highlights statistically significant gene sets. This provides a quick overview of which pathways are most strongly altered. Here, we can see that the HernandezSegura gene set clearly exhibits the strongest enrichment signal.
362359

363360
```{r GSEA_lollypop, fig.width=5, fig.height=4, out.width="60%", warning=FALSE, message=FALSE}
364361
plotNESlollipop(GSEA_results=GSEAresults,
@@ -372,12 +369,16 @@ plotNESlollipop(GSEA_results=GSEAresults,
372369
title=NULL, titlesize=12)
373370
```
374371

375-
Finally, a volcano-style scatter plot combines NES and significance for all gene sets, making it easy to identify which sets show the strongest and most statistically robust enrichment or depletion.
372+
Finally, a volcano-style scatter plot combines NES and significance for all gene sets, making it easy to identify which sets show the strongest and most statistically robust alteration.
376373

377374
```{r GSEA_volcano, fig.width=8, fig.height=3, out.width="100%", warning=FALSE, message=FALSE}
378375
plotCombinedGSEA(GSEAresults, sig_threshold = 0.05, PointSize=6, widthlegend = 26 )
379376
```
380377

378+
From this exercise in benchmarking mode, we can see that two gene sets clearly perform best at discriminating between Senescent and Proliferative conditions: HernandezSegura and LiteratureMarkers. The REACTOME_CELLULAR_SENESCENCE gene set does not show a strong signal; since it is undirected and lacks information on up- or downregulation, this also dilutes the signal, highlighting the importance of providing directionality when available.
379+
380+
Across methods, scoring and enrichment approaches provide complementary insights. Score-based methods offer sample-level resolution, capturing strong contributions from individual genes, while enrichment-based methods evaluate coordinated behaviour across the set, typically at the group level. Even among scoring methods, rank-based approaches (e.g., ssGSEA or rank scoring) are generally more robust to technical noise, whereas magnitude-based methods (e.g., log2-median) better detect shifts in well-controlled data. Performance also depends on sample size and gene set size, explaining why HernandezSegura may appear stronger in enrichment analyses and LiteratureMarkers in scoring analyses. Integrating both approaches provides the most comprehensive view of gene set behaviour.
381+
381382

382383
### 3.4.2 Discovery Mode
383384

@@ -404,7 +405,7 @@ metadata_example$DaysToSequencing <- sample(c(1:20),39, replace = TRUE)
404405
head(metadata_example)
405406
```
406407

407-
We can then examine how the selected gene set associates with these variables using enrichment-based approaches in Discovery Mode. The resulting plot highlights significant associations across variables and visually summarizes the direction and strength of the effect.
408+
We can then examine how the selected gene set associates with these variables using enrichment-based approaches in Discovery Mode. The resulting plot highlights significant associations across variables and visually summarizes the direction and strength of the effect. The “simple” mode provides comparison of effect sizes for pairwise contrasts between only two levels of the variable, but can be changed to more levels of comparison (see `?VariableAssociation`). Here, we can see that the HernandezSegura gene set is significantly enriched in samples sequenced by Francisca, when compared to those processed by Ana or John, suggesting that this gene set may be particularly relevant to her experimental conditions or sample processing. This gene set has also a strong enrichment for the comparison between proliferative and senescent, which is expected given the nature of the gene set and the results from the Benchmarking mode.
408409

409410
```{r GSEA_varassoc, fig.width=6, fig.height=6, out.width="60%", warning=FALSE, message=FALSE}
410411
VariableAssociation(
@@ -422,12 +423,12 @@ VariableAssociation(
422423
labsize = 10,
423424
titlesize = 14,
424425
pointSize = 5
425-
)
426+
) $plot
426427
427428
```
428429

429430

430-
Next, we evaluate the association using a score-based method (here, log2-median). This approach calculates per-sample scores for the gene set and summarizes them across variables. The “extensive” mode provides a comprehensive output including effect sizes for pairwise contrasts and overall statistics for continuous variables.
431+
Next, we evaluate the association using a score-based method (here, log2-median). This approach calculates per-sample scores for the gene set and summarizes them across variables. In this analysis, Condition (representing whether the sample is senescent or proliferative) shows a strong effect, reflected by a large Cohen’s f. This effect size metric is directly comparable across different variable types (categorical, numeric, etc.), making it particularly versatile. In contrast, the variable Researcher does not show a significant effect here, unlike in the enrichment analysis. This divergence illustrates the value of applying both enrichment- and score-based approaches in a complementary manner.
431432

432433

433434
```{r variableassoc_score_sen, fig.width=7, fig.height=7, out.width="100%", warning=FALSE, message=FALSE}
@@ -436,7 +437,7 @@ VariableAssociation(data = counts_example,
436437
method = "logmedian",
437438
cols = c("Condition","Researcher","DaysToSequencing"),
438439
gene_set = HernandezSegura_GeneSet,
439-
mode="extensive",
440+
mode="simple",
440441
nonsignif_color = "white", signif_color = "red", saturation_value=NULL,sig_threshold = 0.05,
441442
widthlabels=30, labsize=10, titlesize=14, pointSize=5, discrete_colors=NULL,
442443
continuous_color = "#8C6D03", color_palette = "Set2")$Overall
@@ -471,7 +472,7 @@ Computes enrichment using a user-defined gene universe and Fisher’s exact test
471472

472473
Filters can be applied based on similarity thresholds (e.g., minimum Jaccard, OR, or p-value).
473474

474-
Example of Gene Set Similarity (full tutorial [here][tutorial-signaturesimilarity])::
475+
Example of Gene Set Similarity (full tutorial [here][tutorial-signaturesimilarity]). Here, we are seeing how two user-defined signatures compare to a set of other user-defined signatures, as well as to a collection of reference gene sets from MSigDB (C2:CP:KEGG_LEGACY). The results are visualized in a heatmap, showing the similarity between the signatures based on log odds ratio (at least one with log10OR > 2).
475476

476477
```{r, fig.width=6, fig.height=8, out.width="60%"}
477478

0 commit comments

Comments
 (0)