Skip to content

add unsupervised approach NMF for clustering#106

Merged
tonywu1999 merged 4 commits into
develfrom
unsupervised-approach
Jun 16, 2026
Merged

add unsupervised approach NMF for clustering#106
tonywu1999 merged 4 commits into
develfrom
unsupervised-approach

Conversation

@tonywu1999

@tonywu1999 tonywu1999 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Motivation and Context

This PR adds interpretable, unsupervised topic clustering to MSstatsBioNet by decomposing INDRA-derived biological subnetworks into “topic-specific” subnetworks using Non-Negative Matrix Factorization (NMF). It jointly leverages two aligned data views at the paper level—(1) PubMed abstract text (paper × word) and (2) evidence aggregated to edge-level signals (paper × edge)—so topics can be characterized by both biological terminology and evidence patterns. It also supports an option to exclude the PPI/edge evidence view.

Detailed Changes

  • Public API / package exports

    • Exported decomposeSubnetworkByTopic
    • Exported bootstrapTopicModels
    • Exported compareTopicModels
    • Added importFrom(stats, runif) for internal stochastic initialization
  • decomposeSubnetworkByTopic() (R/decomposeSubnetworkByTopic.R)

    • Validates inputs and constructs deterministic paper-aligned matrices:
      • X_text from cleaned/validated PubMed abstract word counts
      • X_edges from evidence counts aggregated to (pmid, edge_key)
    • Runs NMF in one of two modes:
      • Joint (include_ppi = TRUE): shared basis across text and edges
      • Text-only (include_ppi = FALSE): fits on X_text and derives edge loadings from learned text topics
    • Computes per-edge topic shares, assigns edges to topics using cutoff + argmax-forced inclusion, and builds topic-restricted subnetworks.
    • Returns per-topic subnetworks (topic_1..topic_k) and attaches an "nmf" attribute containing factorization outputs (W, H_text, H_edges), metadata (e.g., terms, edge keys, PMIDs), and modeling settings.
  • Topic modeling pipeline & utilities (R/utils_decomposeSubnetworkByTopic.R)

    • Added/expanded internal helpers for:
      • Subnetwork input validation and edge key construction
      • Building aligned topic matrices (including term pruning with fallback behavior)
      • NMF fitting routines (.jointNMF(), .textNMF())
      • Deriving edge loadings when fitting is text-only
      • Normalization and topic interpretation utilities (e.g., edge-topic shares, top terms)
      • Topic alignment across runs (cosine similarity + greedy matching)
      • Topic-model comparison metrics (ARI, consensus matrices, dispersion, pairwise ARI)
  • Robustness / comparison features

    • bootstrapTopicModels() (R/bootstrapTopicModels.R)
      • Bootstraps by resampling papers (fixed vocabulary/matrix construction)
      • Fits reference and bootstrap topic models, aligns bootstrap topics to reference topics, and aggregates per-topic term selection frequency and within-topic weights
    • compareTopicModels() (R/compareTopicModels.R)
      • Compares joint vs text-only topic structure across multiple seeds
      • Computes between-mode vs within-mode ARI distributions and runs a one-sided Wilcoxon test
      • Returns consensus/dispersion outputs plus partitions and intermediate statistics
  • Documentation

    • Added/updated roxygen-generated man pages for:
      • decomposeSubnetworkByTopic
      • bootstrapTopicModels
      • compareTopicModels

Unit Tests

  • No unit tests were added or modified for decomposeSubnetworkByTopic(), bootstrapTopicModels(), or compareTopicModels().
  • A repository search under tests/testthat shows no direct references to these new APIs in the existing test files.

Coding Guidelines Violations

  • Insufficient test coverage for new functionality
    • Codecov reported 0% patch coverage for the newly added topic modeling/decomposition code (all new lines uncovered), contributing to a significant overall coverage drop.
  • Pre-submission checklist not completed
    • The PR template/checklist items (including contributing-guideline verification and documentation/style checks) were not marked as completed.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@tonywu1999, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 51 minutes and 12 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d146b10d-51a4-42f4-903a-b245c8edf33b

📥 Commits

Reviewing files that changed from the base of the PR and between a2caf70 and aefb316.

📒 Files selected for processing (8)
  • R/bootstrapTopicModels.R
  • R/compareTopicModels.R
  • R/decomposeSubnetworkByTopic.R
  • R/filterSubnetworkByContext.R
  • man/bootstrapTopicModels.Rd
  • man/compareTopicModels.Rd
  • man/decomposeSubnetworkByTopic.Rd
  • man/filterSubnetworkByContext.Rd
📝 Walkthrough

Walkthrough

A new topic decomposition ecosystem is added to the package: the core decomposeSubnetworkByTopic function factorizes subnetworks into per-topic subgraphs using joint NMF over PubMed abstracts and edge evidence, while bootstrapTopicModels assesses word-selection stability across resamples and compareTopicModels statistically tests whether including edge data alters inferred topics beyond NMF randomness. All three functions share a comprehensive internal library for validation, matrix construction, factorization, and clustering metrics.

Changes

NMF-based topic decomposition with bootstrap stability and mode comparison

Layer / File(s) Summary
Package metadata and namespace declarations
DESCRIPTION, NAMESPACE
Updates roxygen2 configuration version and adds exports for decomposeSubnetworkByTopic, bootstrapTopicModels, compareTopicModels plus runif import from stats.
Input validation and edge key construction
R/utils_decomposeSubnetworkByTopic.R
.validateDecomposeSubnetworkByTopicInput enforces subnetwork structure (nodes/edges dataframe shapes) and parameter ranges (n_topics, edge_topic_cutoff, include_ppi). .edgeKey builds deterministic edge identifiers from source, target, and interaction type.
Text and edge matrix builders
R/utils_decomposeSubnetworkByTopic.R
.buildTopicMatrices orchestrates data preparation: extracts evidence, filters to non-empty PMIDs, reduces n_topics when papers < topics, fetches abstracts. .buildTextMatrix tokenizes via text2vec with English stopword removal, term-frequency pruning with fallback, and PMID-order alignment. .buildEdgeMatrix allocates and fills zero matrix indexed by (PMID, edge_key) with evidence sentence counts.
Joint, text-only, and folded NMF factorizations
R/utils_decomposeSubnetworkByTopic.R
.jointNMF implements shared-basis multiplicative updates over both text and edge views with optional Frobenius normalization, objective tracking, and early stopping. .textNMF runs text-only single-view NMF. .edgeLoadingsFromTopics derives edge-topic loadings post-hoc by matrix projection when text-only mode is selected.
Topic post-processing utilities
R/utils_decomposeSubnetworkByTopic.R
.edgeTopicShares column-normalizes edge-topic loadings. .topTermsForTopic extracts top-weighted terms. .topicPartition assigns hard topic labels via argmax. .adjustedRandIndex, .consensusMatrix, .dispersionCoefficient, .pairwiseARI assess clustering stability. .cosineSimMatrix and .greedyTopicMatch enable permutation-invariant topic alignment. .fitTopicModel dispatches joint vs. text-only mode.
Main decomposition function
R/decomposeSubnetworkByTopic.R
Orchestrates full workflow: validates input, extracts PMID evidence, calls matrix builders, dispatches NMF (joint or text-only based on include_ppi), computes edge-topic shares, selects topic-associated edges via soft cutoff with argmax guarantee, restricts nodes, assigns PMIDs by strongest loading, assembles per-topic subnetwork lists, attaches factorization matrices as "nmf" attribute.
Bootstrap stability analysis
R/bootstrapTopicModels.R
Builds shared matrices once to fix vocabulary, fits a reference topic model on full data, bootstraps n_boot resamples and aligns topics to reference via cosine similarity and greedy matching, aggregates per-topic word selection frequency and mean within-topic weight, returns configuration and per-topic topTerms ranked by frequency and mean weight alongside reference top-words.
Statistical comparison of NMF modes
R/compareTopicModels.R
Tests whether including PPI/edge information changes inferred topic structure beyond random NMF variation. Fits both modes (joint and text-only) across provided seeds, derives partitions and within/between-mode ARI values, runs one-sided Wilcoxon rank-sum test comparing between-mode to within-mode ARI distributions, computes consensus co-membership and dispersion coefficients per mode.
Function documentation
man/decomposeSubnetworkByTopic.Rd, man/bootstrapTopicModels.Rd, man/compareTopicModels.Rd
Roxygen-generated Rd documentation for all three exported functions, including parameter definitions, return value structures, algorithm details, runnable examples, and cross-references.
filterSubnetworkByContext documentation updates
R/filterSubnetworkByContext.R, man/filterSubnetworkByContext.Rd
Roxygen documentation consolidated @importFrom text2vec tags onto single lines and added new beta feature @note blocks to both source roxygen headers and generated Rd page.

Sequence Diagram

sequenceDiagram
  participant User
  participant decomposeSubnetwork as decomposeSubnetworkByTopic
  participant buildMatrices as .buildTopicMatrices
  participant nmf as .jointNMF / .textNMF
  participant EdgeShare as .edgeTopicShares
  participant Bootstrap as bootstrapTopicModels
  participant Compare as compareTopicModels

  User->>decomposeSubnetwork: subnetwork, n_topics, include_ppi=TRUE/FALSE, ...
  decomposeSubnetwork->>buildMatrices: extract PMIDs, abstracts, evidence
  buildMatrices-->>decomposeSubnetwork: X_text, X_edges, pmids, edge_keys
  decomposeSubnetwork->>nmf: X_text, X_edges, k, include_ppi
  nmf-->>decomposeSubnetwork: W, H_text, H_edges
  decomposeSubnetwork->>EdgeShare: H_edges
  EdgeShare-->>decomposeSubnetwork: per-edge topic shares
  decomposeSubnetwork-->>User: topic_1, topic_2, ..., attr(nmf=...)

  User->>Bootstrap: same subnetwork, n_boot, include_ppi
  Bootstrap->>buildMatrices: same matrices, fixed vocabulary
  Bootstrap->>nmf: fit reference + n_boot resamples
  Bootstrap-->>User: topTerms (freq, mean weight), reference

  User->>Compare: same subnetwork, seeds, unit, include_ppi
  Compare->>buildMatrices: same matrices
  Compare->>nmf: fit joint and text-only across seeds
  Compare-->>User: within/between ARI, Wilcoxon test, consensus
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Review effort 3/5

Poem

🐇 With matrices built from paper and edge,
The rabbit sits perched on the knowledge ledge,
Joint NMF learns what the topics may be,
While bootstrap and tests verify all we see,
Each subgraph unwoven reveals its own tale—
A network of wisdom, a data-driven grail! 📚✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is completely empty. All required template sections (Motivation and Context, Changes, Testing, Checklist) are missing or unfilled, leaving reviewers without essential context about the implementation. Complete the PR description template fully. Include motivation for NMF approach, detailed bullet points of the three new functions and utility helpers, describe what tests were added, and verify all checklist items including styler and document() compliance.
Title check ❓ Inconclusive The title 'add unsupervised approach NMF for clustering' is partially related to the changeset, referring to a core aspect of the implementation but lacks clarity and specificity about which NMF functionality is being added to the package. Improve the title to be more specific and clear. Consider: 'Add NMF-based topic decomposition for subnetwork clustering' to better convey that this introduces three new topic modeling functions for network analysis.
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch unsupervised-approach

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/decomposeSubnetworkByTopic.R`:
- Around line 86-93: The pmids variable extracted from evidence$pmid may be
numeric or factor types, which will cause incorrect positional indexing when
used as list keys in abstract_list[[p]] instead of name-based indexing. After
creating pmids with unique(evidence$pmid), coerce the values to character type
using as.character() and trim any whitespace using trimws() to ensure they are
properly formatted strings before being used for subsetting the abstract list.
- Around line 131-135: The topicWeight column is only added to topic_edges when
it contains rows, but the function documentation promises that the returned
edges data frame always has a topicWeight column added. To ensure consistent
schema, you need to add the topicWeight column to topic_edges unconditionally,
not just when nrow(topic_edges) > 0. Refactor the code by either moving the
topicWeight assignment outside the conditional block and handling the empty case
separately (assigning an empty numeric vector or NA values), or by restructuring
the conditional to initialize topicWeight for all cases and only populate values
when rows exist.

In `@R/utils_decomposeSubnetworkByTopic.R`:
- Around line 46-48: The .edgeKey() function currently only uses source, target,
and interaction to create the edge identifier, but it needs to also include site
and stmt_hash parameters to ensure rows with identical source/target/interaction
but different site or stmt_hash values are treated as distinct edges. Modify the
.edgeKey() function signature to accept site and stmt_hash as additional
parameters, and include them in the paste() concatenation with the same
separator so that the resulting key uniquely identifies each combination of
source, target, interaction, site, and stmt_hash.
- Around line 7-36: Add validation for the five public tuning parameters that
currently bypass checks in the `.validateDecomposeSubnetworkByTopicInput`
function. Extend the function signature to accept `n_top_terms`,
`min_term_count`, `max_iter`, `tol`, and `seed` as parameters, then add
validation logic for each: validate that `n_top_terms` is a single positive
integer, `min_term_count` is a single positive integer, `max_iter` is a single
positive integer, `tol` is a single positive numeric value, and `seed` is either
NULL or a single integer. Then locate the call to
`.validateDecomposeSubnetworkByTopicInput` in `R/decomposeSubnetworkByTopic.R`
and update it to pass these five additional arguments from the parent function.
- Around line 154-157: The call to set.seed(seed) directly modifies the global
.Random.seed, affecting all subsequent random operations in the user's session.
Save the current random state before calling set.seed(seed) by capturing the
existing .Random.seed value, then use on.exit() to restore it when the function
completes. This ensures that the random number generator state is restored to
its original condition after the function executes, preventing side effects on
the caller's random stream.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f80e349a-4dce-41ae-bf36-4ee50357767f

📥 Commits

Reviewing files that changed from the base of the PR and between d1e6219 and 6d60e05.

📒 Files selected for processing (3)
  • NAMESPACE
  • R/decomposeSubnetworkByTopic.R
  • R/utils_decomposeSubnetworkByTopic.R

Comment thread R/decomposeSubnetworkByTopic.R Outdated
Comment on lines +131 to +135
topic_edges <- edges[in_topic, , drop = FALSE]
if (nrow(topic_edges) > 0) {
topic_edges$topicWeight <-
shares[t, match(edges_key_vec[in_topic], edge_keys)]
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Always add topicWeight, even for empty topic edge frames.

The return docs promise an edges data frame “with an added topicWeight column”, but empty topics skip that column, creating an inconsistent downstream schema.

Proposed fix
         in_topic <- edges_key_vec %in% topic_keys
         topic_edges <- edges[in_topic, , drop = FALSE]
-        if (nrow(topic_edges) > 0) {
-            topic_edges$topicWeight <-
-                shares[t, match(edges_key_vec[in_topic], edge_keys)]
-        }
+        topic_edges$topicWeight <-
+            shares[t, match(edges_key_vec[in_topic], edge_keys)]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
topic_edges <- edges[in_topic, , drop = FALSE]
if (nrow(topic_edges) > 0) {
topic_edges$topicWeight <-
shares[t, match(edges_key_vec[in_topic], edge_keys)]
}
topic_edges <- edges[in_topic, , drop = FALSE]
topic_edges$topicWeight <-
shares[t, match(edges_key_vec[in_topic], edge_keys)]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/decomposeSubnetworkByTopic.R` around lines 131 - 135, The topicWeight
column is only added to topic_edges when it contains rows, but the function
documentation promises that the returned edges data frame always has a
topicWeight column added. To ensure consistent schema, you need to add the
topicWeight column to topic_edges unconditionally, not just when
nrow(topic_edges) > 0. Refactor the code by either moving the topicWeight
assignment outside the conditional block and handling the empty case separately
(assigning an empty numeric vector or NA values), or by restructuring the
conditional to initialize topicWeight for all cases and only populate values
when rows exist.

Comment on lines +7 to +36
.validateDecomposeSubnetworkByTopicInput <- function(subnetwork,
n_topics,
edge_topic_cutoff) {
if (!is.list(subnetwork) ||
!all(c("nodes", "edges") %in% names(subnetwork))) {
stop("`subnetwork` must be a list containing `nodes` and `edges`, ",
"e.g. the output of getSubnetworkFromIndra().")
}
if (!is.data.frame(subnetwork$nodes) || !"id" %in% names(subnetwork$nodes)) {
stop("`subnetwork$nodes` must be a data.frame with an `id` column.")
}
required_edge_cols <- c("source", "target", "interaction",
"site", "evidenceLink", "stmt_hash")
missing_cols <- setdiff(required_edge_cols, names(subnetwork$edges))
if (!is.data.frame(subnetwork$edges) || length(missing_cols) > 0) {
stop(sprintf(
"`subnetwork$edges` must be a data.frame with columns: %s",
paste(required_edge_cols, collapse = ", ")
))
}
if (!is.numeric(n_topics) || length(n_topics) != 1L || is.na(n_topics) ||
n_topics < 1 || n_topics != as.integer(n_topics)) {
stop("`n_topics` must be a single positive integer.")
}
if (!is.numeric(edge_topic_cutoff) || length(edge_topic_cutoff) != 1L ||
is.na(edge_topic_cutoff) || edge_topic_cutoff < 0 ||
edge_topic_cutoff > 1) {
stop("`edge_topic_cutoff` must be a single numeric value in [0, 1].")
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate all public tuning parameters here.

n_top_terms, min_term_count, max_iter, tol, and seed are public arguments but bypass validation. Invalid values currently fail later in places like prune_vocabulary(), seq_len(max_iter), the tolerance if, or set.seed().

Proposed direction
 .validateDecomposeSubnetworkByTopicInput <- function(subnetwork,
                                                      n_topics,
-                                                     edge_topic_cutoff) {
+                                                     edge_topic_cutoff,
+                                                     n_top_terms,
+                                                     min_term_count,
+                                                     max_iter,
+                                                     tol,
+                                                     seed) {
@@
     if (!is.numeric(edge_topic_cutoff) || length(edge_topic_cutoff) != 1L ||
         is.na(edge_topic_cutoff) || edge_topic_cutoff < 0 ||
         edge_topic_cutoff > 1) {
         stop("`edge_topic_cutoff` must be a single numeric value in [0, 1].")
     }
+    if (!is.numeric(n_top_terms) || length(n_top_terms) != 1L ||
+        is.na(n_top_terms) || n_top_terms < 0 ||
+        n_top_terms != as.integer(n_top_terms)) {
+        stop("`n_top_terms` must be a single non-negative integer.")
+    }
+    if (!is.numeric(min_term_count) || length(min_term_count) != 1L ||
+        is.na(min_term_count) || min_term_count < 1 ||
+        min_term_count != as.integer(min_term_count)) {
+        stop("`min_term_count` must be a single positive integer.")
+    }
+    if (!is.numeric(max_iter) || length(max_iter) != 1L ||
+        is.na(max_iter) || max_iter < 1 ||
+        max_iter != as.integer(max_iter)) {
+        stop("`max_iter` must be a single positive integer.")
+    }
+    if (!is.numeric(tol) || length(tol) != 1L ||
+        is.na(tol) || !is.finite(tol) || tol < 0) {
+        stop("`tol` must be a single non-negative finite numeric value.")
+    }
+    if (!is.numeric(seed) || length(seed) != 1L ||
+        is.na(seed) || seed != as.integer(seed)) {
+        stop("`seed` must be a single integer.")
+    }
 }

Also update the call site in R/decomposeSubnetworkByTopic.R to pass the additional arguments.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/utils_decomposeSubnetworkByTopic.R` around lines 7 - 36, Add validation for
the five public tuning parameters that currently bypass checks in the
`.validateDecomposeSubnetworkByTopicInput` function. Extend the function
signature to accept `n_top_terms`, `min_term_count`, `max_iter`, `tol`, and
`seed` as parameters, then add validation logic for each: validate that
`n_top_terms` is a single positive integer, `min_term_count` is a single
positive integer, `max_iter` is a single positive integer, `tol` is a single
positive numeric value, and `seed` is either NULL or a single integer. Then
locate the call to `.validateDecomposeSubnetworkByTopicInput` in
`R/decomposeSubnetworkByTopic.R` and update it to pass these five additional
arguments from the parent function.

Comment on lines +46 to +48
.edgeKey <- function(source, target, interaction) {
paste(source, target, interaction, sep = "||")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Include the full edge identity in the topic key.

The validator requires site and stmt_hash, but .edgeKey() drops both. Rows that share source/target/interaction but differ by site or statement hash collapse into one NMF edge column and receive the same topic assignment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/utils_decomposeSubnetworkByTopic.R` around lines 46 - 48, The .edgeKey()
function currently only uses source, target, and interaction to create the edge
identifier, but it needs to also include site and stmt_hash parameters to ensure
rows with identical source/target/interaction but different site or stmt_hash
values are treated as distinct edges. Modify the .edgeKey() function signature
to accept site and stmt_hash as additional parameters, and include them in the
paste() concatenation with the same separator so that the resulting key uniquely
identifies each combination of source, target, interaction, site, and stmt_hash.

Comment on lines +154 to +157
set.seed(seed)
W <- matrix(runif(n * k), n, k)
H_text <- matrix(runif(k * p), k, p)
H_edges <- matrix(runif(k * m), k, m)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if the file exists and examine lines around 154-157
if [ -f "R/utils_decomposeSubnetworkByTopic.R" ]; then
    echo "=== File exists, examining lines 140-170 ==="
    sed -n '140,170p' "R/utils_decomposeSubnetworkByTopic.R" | cat -n
else
    echo "File not found, listing R directory:"
    ls -la R/ | head -20
fi

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 1363


🏁 Script executed:

#!/bin/bash
# Search for set.seed calls in the repository to understand the pattern
echo "=== Searching for set.seed patterns ==="
rg "set\.seed" R/ -A 2 -B 2

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 412


🏁 Script executed:

#!/bin/bash
# Verify the exact lines and check if the fix has been applied
echo "=== Checking lines 154-160 precisely ==="
sed -n '154,160p' "R/utils_decomposeSubnetworkByTopic.R" | cat -n

Repository: Vitek-Lab/MSstatsBioNet

Length of output: 344


🌐 Web query:

R best practice set.seed global RNG state mutation restore

💡 Result:

In R, the global random number generator (RNG) state is stored in the .Random.seed variable in the global environment (.GlobalEnv) [1][2]. Modifying this state globally within a function can disrupt external simulations or unexpected processes, and CRAN policies strictly discourage packages from modifying the global environment [3][4]. Best Practices for RNG State Management 1. Avoid Global Mutation in Packages: Do not modify .Random.seed in a package function unless explicitly intended [3][4]. If a function must use a fixed seed for internal reproducibility, restore the original state before returning to ensure the user's global RNG state remains unchanged [5][6]. 2. Robust Restoration Pattern: Use on.exit() to ensure the RNG state is restored even if the function encounters an error or is interrupted [6]. This prevents side effects on the user's workspace [6]. Example of the robust pattern: # Robustly setting/restoring RNG state within a function do_random_stuff <- function { genv <- globalenv # Save original state if it exists old_seed <- if (exists(".Random.seed", envir = genv, inherits = FALSE)) { get(".Random.seed", envir = genv, inherits = FALSE) } else { NULL } # Ensure restoration on exit on.exit(suspendInterrupts({ if (is.null(old_seed)) { rm(".Random.seed", envir = genv, inherits = FALSE) } else { assign(".Random.seed", value = old_seed, envir = genv, inherits = FALSE) } })) # Set desired seed for local work set.seed(42) return(runif(3)) } 3. Parallel Processing: For parallel computations, do not rely on manual seed restoration [7]. Use specialized packages like doRNG (for foreach loops) or the native L'Ecuyer-CMRG RNG stream support (via parallel::nextRNGStream) to ensure independent, reproducible streams across parallel workers [8][7][9]. 4. Use set.seed Correctly: set.seed() is the standard, recommended approach for initializing the RNG [1][2]. While you can manually assign to .Random.seed, it is an internal integer vector that should generally not be manipulated directly by users unless necessary for complex state preservation [1][10]. Always check if .Random.seed exists before attempting to save it, as it is only created in the global environment after the first random number generation call [5][11].

Citations:


Restore the RNG state after setting the seed to avoid mutating the caller's global random stream.

Calling set.seed(seed) directly modifies .Random.seed in the global environment, affecting all subsequent random operations in the user's session. This violates CRAN package policies and disrupts reproducibility of user code. Save and restore the RNG state using on.exit():

Recommended fix
+    had_seed <- exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
+    old_seed <- if (had_seed) {
+        get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
+    } else {
+        NULL
+    }
+    on.exit({
+        if (had_seed) {
+            assign(".Random.seed", old_seed, envir = .GlobalEnv)
+        } else if (exists(".Random.seed", envir = .GlobalEnv,
+                          inherits = FALSE)) {
+            rm(".Random.seed", envir = .GlobalEnv)
+        }
+    }, add = TRUE)
     set.seed(seed)
     W <- matrix(runif(n * k), n, k)
     H_text <- matrix(runif(k * p), k, p)
     H_edges <- matrix(runif(k * m), k, m)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/utils_decomposeSubnetworkByTopic.R` around lines 154 - 157, The call to
set.seed(seed) directly modifies the global .Random.seed, affecting all
subsequent random operations in the user's session. Save the current random
state before calling set.seed(seed) by capturing the existing .Random.seed
value, then use on.exit() to restore it when the function completes. This
ensures that the random number generator state is restored to its original
condition after the function executes, preventing side effects on the caller's
random stream.

@codecov-commenter

codecov-commenter commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 388 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.56%. Comparing base (e274ff2) to head (aefb316).

Files with missing lines Patch % Lines
R/utils_decomposeSubnetworkByTopic.R 0.00% 225 Missing ⚠️
R/bootstrapTopicModels.R 0.00% 56 Missing ⚠️
R/compareTopicModels.R 0.00% 54 Missing ⚠️
R/decomposeSubnetworkByTopic.R 0.00% 53 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##            devel     #106       +/-   ##
===========================================
- Coverage   77.17%   57.56%   -19.61%     
===========================================
  Files           9       13        +4     
  Lines        1139     1527      +388     
===========================================
  Hits          879      879               
- Misses        260      648      +388     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/bootstrapTopicModels.R`:
- Around line 67-82: The `n_top_terms` parameter lacks validation at function
entry and bounds checking during runtime, which can cause NA propagation. At
R/bootstrapTopicModels.R#L67-L82 (anchor site), add input validation for
`n_top_terms` after the `n_boot` validation block to ensure it is a single
positive integer, following the same pattern as the existing `n_boot` check. At
R/bootstrapTopicModels.R#L121-L128 (sibling site), before using `n_top_terms` to
index into a row vector, clamp it to `min(n_top_terms, length(row))` to prevent
out-of-bounds access and NA values.

In `@R/utils_decomposeSubnetworkByTopic.R`:
- Around line 396-407: The as.integer() conversion in the .topicPartition
function removes the names attribute from the labels vector, which are carefully
assigned just before the conversion. To fix this, preserve the names when
converting to integer by assigning the names back to the result of
as.integer(labels), or use an alternative approach that maintains both the
integer values and the names attribute from the original labels vector.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8f92d820-f9b0-4d6a-9513-46cea53dd84b

📥 Commits

Reviewing files that changed from the base of the PR and between 6d60e05 and d3a0d6a.

📒 Files selected for processing (9)
  • DESCRIPTION
  • NAMESPACE
  • R/bootstrapTopicModels.R
  • R/compareTopicModels.R
  • R/decomposeSubnetworkByTopic.R
  • R/utils_decomposeSubnetworkByTopic.R
  • man/bootstrapTopicModels.Rd
  • man/compareTopicModels.Rd
  • man/decomposeSubnetworkByTopic.Rd
✅ Files skipped from review due to trivial changes (4)
  • man/bootstrapTopicModels.Rd
  • man/compareTopicModels.Rd
  • man/decomposeSubnetworkByTopic.Rd
  • DESCRIPTION
🚧 Files skipped from review as they are similar to previous changes (1)
  • NAMESPACE

Comment thread R/bootstrapTopicModels.R
Comment on lines +67 to +82
bootstrapTopicModels <- function(subnetwork,
n_boot = 50,
n_topics = 5,
include_ppi = TRUE,
n_top_terms = 10,
min_term_count = 2,
max_iter = 200,
tol = 1e-4,
seed = 1) {

.validateDecomposeSubnetworkByTopicInput(subnetwork, n_topics, 0.2,
include_ppi)
if (!is.numeric(n_boot) || length(n_boot) != 1L || is.na(n_boot) ||
n_boot < 2 || n_boot != as.integer(n_boot)) {
stop("`n_boot` must be a single integer >= 2.")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

n_top_terms lacks input validation and runtime bounds checking. The parameter is used without validation at function entry and without clamping against vocabulary size inside the bootstrap loop, which could cause NA propagation if the value exceeds the vocabulary.

  • R/bootstrapTopicModels.R#L67-L82: Add validation that n_top_terms is a single positive integer, similar to the n_boot check.
  • R/bootstrapTopicModels.R#L121-L128: Clamp n_top_terms to min(n_top_terms, length(row)) before indexing to prevent out-of-bounds NAs.
🛡️ Proposed fixes

Add validation after line 83:

     n_boot <- as.integer(n_boot)
+    if (!is.numeric(n_top_terms) || length(n_top_terms) != 1L ||
+        is.na(n_top_terms) || n_top_terms < 1 ||
+        n_top_terms != as.integer(n_top_terms)) {
+        stop("`n_top_terms` must be a single positive integer.")
+    }
+    n_top_terms <- as.integer(n_top_terms)

Clamp at line 125:

-            top <- order(row, decreasing = TRUE)[seq_len(n_top_terms)]
+            n_sel <- min(n_top_terms, length(row))
+            top <- order(row, decreasing = TRUE)[seq_len(n_sel)]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
bootstrapTopicModels <- function(subnetwork,
n_boot = 50,
n_topics = 5,
include_ppi = TRUE,
n_top_terms = 10,
min_term_count = 2,
max_iter = 200,
tol = 1e-4,
seed = 1) {
.validateDecomposeSubnetworkByTopicInput(subnetwork, n_topics, 0.2,
include_ppi)
if (!is.numeric(n_boot) || length(n_boot) != 1L || is.na(n_boot) ||
n_boot < 2 || n_boot != as.integer(n_boot)) {
stop("`n_boot` must be a single integer >= 2.")
}
bootstrapTopicModels <- function(subnetwork,
n_boot = 50,
n_topics = 5,
include_ppi = TRUE,
n_top_terms = 10,
min_term_count = 2,
max_iter = 200,
tol = 1e-4,
seed = 1) {
.validateDecomposeSubnetworkByTopicInput(subnetwork, n_topics, 0.2,
include_ppi)
if (!is.numeric(n_boot) || length(n_boot) != 1L || is.na(n_boot) ||
n_boot < 2 || n_boot != as.integer(n_boot)) {
stop("`n_boot` must be a single integer >= 2.")
}
n_boot <- as.integer(n_boot)
if (!is.numeric(n_top_terms) || length(n_top_terms) != 1L ||
is.na(n_top_terms) || n_top_terms < 1 ||
n_top_terms != as.integer(n_top_terms)) {
stop("`n_top_terms` must be a single positive integer.")
}
n_top_terms <- as.integer(n_top_terms)
Suggested change
bootstrapTopicModels <- function(subnetwork,
n_boot = 50,
n_topics = 5,
include_ppi = TRUE,
n_top_terms = 10,
min_term_count = 2,
max_iter = 200,
tol = 1e-4,
seed = 1) {
.validateDecomposeSubnetworkByTopicInput(subnetwork, n_topics, 0.2,
include_ppi)
if (!is.numeric(n_boot) || length(n_boot) != 1L || is.na(n_boot) ||
n_boot < 2 || n_boot != as.integer(n_boot)) {
stop("`n_boot` must be a single integer >= 2.")
}
for (r in seq_len(k)) {
row <- fit$H_text[map[r], ]
s <- sum(row)
weightsum[r, ] <- weightsum[r, ] + (if (s > 0) row / s else row)
n_sel <- min(n_top_terms, length(row))
top <- order(row, decreasing = TRUE)[seq_len(n_sel)]
top <- top[row[top] > 0]
topcount[r, top] <- topcount[r, top] + 1
}
📍 Affects 1 file
  • R/bootstrapTopicModels.R#L67-L82 (this comment)
  • R/bootstrapTopicModels.R#L121-L128
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/bootstrapTopicModels.R` around lines 67 - 82, The `n_top_terms` parameter
lacks validation at function entry and bounds checking during runtime, which can
cause NA propagation. At R/bootstrapTopicModels.R#L67-L82 (anchor site), add
input validation for `n_top_terms` after the `n_boot` validation block to ensure
it is a single positive integer, following the same pattern as the existing
`n_boot` check. At R/bootstrapTopicModels.R#L121-L128 (sibling site), before
using `n_top_terms` to index into a row vector, clamp it to `min(n_top_terms,
length(row))` to prevent out-of-bounds access and NA values.

Comment on lines +396 to +407
.topicPartition <- function(model, unit = c("edges", "papers")) {
unit <- match.arg(unit)
if (unit == "papers") {
labels <- apply(model$W, 1, which.max)
names(labels) <- rownames(model$W)
} else {
shares <- .edgeTopicShares(model$H_edges)
labels <- apply(shares, 2, which.max)
names(labels) <- colnames(model$H_edges)
}
as.integer(labels)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

as.integer() strips names from the partition vector.

Lines 399-404 carefully assign names to labels, but as.integer(labels) on line 406 returns a new vector without names. This breaks .consensusMatrix (line 452-453) which expects names(partitions[[1]]) for row/column names.

Proposed fix
-.topicPartition <- function(model, unit = c("edges", "papers")) {
-    unit <- match.arg(unit)
-    if (unit == "papers") {
-        labels <- apply(model$W, 1, which.max)
-        names(labels) <- rownames(model$W)
-    } else {
-        shares <- .edgeTopicShares(model$H_edges)
-        labels <- apply(shares, 2, which.max)
-        names(labels) <- colnames(model$H_edges)
-    }
-    as.integer(labels)
-}
+.topicPartition <- function(model, unit = c("edges", "papers")) {
+    unit <- match.arg(unit)
+    if (unit == "papers") {
+        labels <- apply(model$W, 1, which.max)
+        names(labels) <- rownames(model$W)
+    } else {
+        shares <- .edgeTopicShares(model$H_edges)
+        labels <- apply(shares, 2, which.max)
+        names(labels) <- colnames(model$H_edges)
+    }
+    nms <- names(labels)
+    labels <- as.integer(labels)
+    names(labels) <- nms
+    labels
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/utils_decomposeSubnetworkByTopic.R` around lines 396 - 407, The
as.integer() conversion in the .topicPartition function removes the names
attribute from the labels vector, which are carefully assigned just before the
conversion. To fix this, preserve the names when converting to integer by
assigning the names back to the result of as.integer(labels), or use an
alternative approach that maintains both the integer values and the names
attribute from the original labels vector.

@tonywu1999 tonywu1999 force-pushed the unsupervised-approach branch from d3a0d6a to ff94b8b Compare June 16, 2026 14:21

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@man/filterSubnetworkByContext.Rd`:
- Around line 71-74: In the \note section of the filterSubnetworkByContext.Rd
documentation file, replace the markdown-style bold syntax **Beta feature:**
with the native .Rd syntax \strong{Beta feature:} since the roxygen2
configuration does not have markdown support enabled. This will ensure the text
displays as bold in the generated help page instead of showing literal
asterisks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: abc6e1d4-32f2-4e57-9f1e-811a8038c877

📥 Commits

Reviewing files that changed from the base of the PR and between d3a0d6a and a2caf70.

📒 Files selected for processing (11)
  • DESCRIPTION
  • NAMESPACE
  • R/bootstrapTopicModels.R
  • R/compareTopicModels.R
  • R/decomposeSubnetworkByTopic.R
  • R/filterSubnetworkByContext.R
  • R/utils_decomposeSubnetworkByTopic.R
  • man/bootstrapTopicModels.Rd
  • man/compareTopicModels.Rd
  • man/decomposeSubnetworkByTopic.Rd
  • man/filterSubnetworkByContext.Rd
✅ Files skipped from review due to trivial changes (5)
  • R/filterSubnetworkByContext.R
  • man/compareTopicModels.Rd
  • man/decomposeSubnetworkByTopic.Rd
  • man/bootstrapTopicModels.Rd
  • DESCRIPTION
🚧 Files skipped from review as they are similar to previous changes (4)
  • R/bootstrapTopicModels.R
  • R/decomposeSubnetworkByTopic.R
  • R/compareTopicModels.R
  • R/utils_decomposeSubnetworkByTopic.R

Comment thread man/filterSubnetworkByContext.Rd
@tonywu1999 tonywu1999 merged commit e25917e into devel Jun 16, 2026
4 checks passed
@tonywu1999 tonywu1999 deleted the unsupervised-approach branch June 16, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants