add unsupervised approach NMF for clustering#106
Conversation
|
Warning Review limit reached
More reviews will be available in 51 minutes and 12 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughA new topic decomposition ecosystem is added to the package: the core ChangesNMF-based topic decomposition with bootstrap stability and mode comparison
Sequence DiagramsequenceDiagram
participant User
participant decomposeSubnetwork as decomposeSubnetworkByTopic
participant buildMatrices as .buildTopicMatrices
participant nmf as .jointNMF / .textNMF
participant EdgeShare as .edgeTopicShares
participant Bootstrap as bootstrapTopicModels
participant Compare as compareTopicModels
User->>decomposeSubnetwork: subnetwork, n_topics, include_ppi=TRUE/FALSE, ...
decomposeSubnetwork->>buildMatrices: extract PMIDs, abstracts, evidence
buildMatrices-->>decomposeSubnetwork: X_text, X_edges, pmids, edge_keys
decomposeSubnetwork->>nmf: X_text, X_edges, k, include_ppi
nmf-->>decomposeSubnetwork: W, H_text, H_edges
decomposeSubnetwork->>EdgeShare: H_edges
EdgeShare-->>decomposeSubnetwork: per-edge topic shares
decomposeSubnetwork-->>User: topic_1, topic_2, ..., attr(nmf=...)
User->>Bootstrap: same subnetwork, n_boot, include_ppi
Bootstrap->>buildMatrices: same matrices, fixed vocabulary
Bootstrap->>nmf: fit reference + n_boot resamples
Bootstrap-->>User: topTerms (freq, mean weight), reference
User->>Compare: same subnetwork, seeds, unit, include_ppi
Compare->>buildMatrices: same matrices
Compare->>nmf: fit joint and text-only across seeds
Compare-->>User: within/between ARI, Wilcoxon test, consensus
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/decomposeSubnetworkByTopic.R`:
- Around line 86-93: The pmids variable extracted from evidence$pmid may be
numeric or factor types, which will cause incorrect positional indexing when
used as list keys in abstract_list[[p]] instead of name-based indexing. After
creating pmids with unique(evidence$pmid), coerce the values to character type
using as.character() and trim any whitespace using trimws() to ensure they are
properly formatted strings before being used for subsetting the abstract list.
- Around line 131-135: The topicWeight column is only added to topic_edges when
it contains rows, but the function documentation promises that the returned
edges data frame always has a topicWeight column added. To ensure consistent
schema, you need to add the topicWeight column to topic_edges unconditionally,
not just when nrow(topic_edges) > 0. Refactor the code by either moving the
topicWeight assignment outside the conditional block and handling the empty case
separately (assigning an empty numeric vector or NA values), or by restructuring
the conditional to initialize topicWeight for all cases and only populate values
when rows exist.
In `@R/utils_decomposeSubnetworkByTopic.R`:
- Around line 46-48: The .edgeKey() function currently only uses source, target,
and interaction to create the edge identifier, but it needs to also include site
and stmt_hash parameters to ensure rows with identical source/target/interaction
but different site or stmt_hash values are treated as distinct edges. Modify the
.edgeKey() function signature to accept site and stmt_hash as additional
parameters, and include them in the paste() concatenation with the same
separator so that the resulting key uniquely identifies each combination of
source, target, interaction, site, and stmt_hash.
- Around line 7-36: Add validation for the five public tuning parameters that
currently bypass checks in the `.validateDecomposeSubnetworkByTopicInput`
function. Extend the function signature to accept `n_top_terms`,
`min_term_count`, `max_iter`, `tol`, and `seed` as parameters, then add
validation logic for each: validate that `n_top_terms` is a single positive
integer, `min_term_count` is a single positive integer, `max_iter` is a single
positive integer, `tol` is a single positive numeric value, and `seed` is either
NULL or a single integer. Then locate the call to
`.validateDecomposeSubnetworkByTopicInput` in `R/decomposeSubnetworkByTopic.R`
and update it to pass these five additional arguments from the parent function.
- Around line 154-157: The call to set.seed(seed) directly modifies the global
.Random.seed, affecting all subsequent random operations in the user's session.
Save the current random state before calling set.seed(seed) by capturing the
existing .Random.seed value, then use on.exit() to restore it when the function
completes. This ensures that the random number generator state is restored to
its original condition after the function executes, preventing side effects on
the caller's random stream.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f80e349a-4dce-41ae-bf36-4ee50357767f
📒 Files selected for processing (3)
NAMESPACER/decomposeSubnetworkByTopic.RR/utils_decomposeSubnetworkByTopic.R
| topic_edges <- edges[in_topic, , drop = FALSE] | ||
| if (nrow(topic_edges) > 0) { | ||
| topic_edges$topicWeight <- | ||
| shares[t, match(edges_key_vec[in_topic], edge_keys)] | ||
| } |
There was a problem hiding this comment.
Always add topicWeight, even for empty topic edge frames.
The return docs promise an edges data frame “with an added topicWeight column”, but empty topics skip that column, creating an inconsistent downstream schema.
Proposed fix
in_topic <- edges_key_vec %in% topic_keys
topic_edges <- edges[in_topic, , drop = FALSE]
- if (nrow(topic_edges) > 0) {
- topic_edges$topicWeight <-
- shares[t, match(edges_key_vec[in_topic], edge_keys)]
- }
+ topic_edges$topicWeight <-
+ shares[t, match(edges_key_vec[in_topic], edge_keys)]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| topic_edges <- edges[in_topic, , drop = FALSE] | |
| if (nrow(topic_edges) > 0) { | |
| topic_edges$topicWeight <- | |
| shares[t, match(edges_key_vec[in_topic], edge_keys)] | |
| } | |
| topic_edges <- edges[in_topic, , drop = FALSE] | |
| topic_edges$topicWeight <- | |
| shares[t, match(edges_key_vec[in_topic], edge_keys)] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@R/decomposeSubnetworkByTopic.R` around lines 131 - 135, The topicWeight
column is only added to topic_edges when it contains rows, but the function
documentation promises that the returned edges data frame always has a
topicWeight column added. To ensure consistent schema, you need to add the
topicWeight column to topic_edges unconditionally, not just when
nrow(topic_edges) > 0. Refactor the code by either moving the topicWeight
assignment outside the conditional block and handling the empty case separately
(assigning an empty numeric vector or NA values), or by restructuring the
conditional to initialize topicWeight for all cases and only populate values
when rows exist.
| .validateDecomposeSubnetworkByTopicInput <- function(subnetwork, | ||
| n_topics, | ||
| edge_topic_cutoff) { | ||
| if (!is.list(subnetwork) || | ||
| !all(c("nodes", "edges") %in% names(subnetwork))) { | ||
| stop("`subnetwork` must be a list containing `nodes` and `edges`, ", | ||
| "e.g. the output of getSubnetworkFromIndra().") | ||
| } | ||
| if (!is.data.frame(subnetwork$nodes) || !"id" %in% names(subnetwork$nodes)) { | ||
| stop("`subnetwork$nodes` must be a data.frame with an `id` column.") | ||
| } | ||
| required_edge_cols <- c("source", "target", "interaction", | ||
| "site", "evidenceLink", "stmt_hash") | ||
| missing_cols <- setdiff(required_edge_cols, names(subnetwork$edges)) | ||
| if (!is.data.frame(subnetwork$edges) || length(missing_cols) > 0) { | ||
| stop(sprintf( | ||
| "`subnetwork$edges` must be a data.frame with columns: %s", | ||
| paste(required_edge_cols, collapse = ", ") | ||
| )) | ||
| } | ||
| if (!is.numeric(n_topics) || length(n_topics) != 1L || is.na(n_topics) || | ||
| n_topics < 1 || n_topics != as.integer(n_topics)) { | ||
| stop("`n_topics` must be a single positive integer.") | ||
| } | ||
| if (!is.numeric(edge_topic_cutoff) || length(edge_topic_cutoff) != 1L || | ||
| is.na(edge_topic_cutoff) || edge_topic_cutoff < 0 || | ||
| edge_topic_cutoff > 1) { | ||
| stop("`edge_topic_cutoff` must be a single numeric value in [0, 1].") | ||
| } | ||
| } |
There was a problem hiding this comment.
Validate all public tuning parameters here.
n_top_terms, min_term_count, max_iter, tol, and seed are public arguments but bypass validation. Invalid values currently fail later in places like prune_vocabulary(), seq_len(max_iter), the tolerance if, or set.seed().
Proposed direction
.validateDecomposeSubnetworkByTopicInput <- function(subnetwork,
n_topics,
- edge_topic_cutoff) {
+ edge_topic_cutoff,
+ n_top_terms,
+ min_term_count,
+ max_iter,
+ tol,
+ seed) {
@@
if (!is.numeric(edge_topic_cutoff) || length(edge_topic_cutoff) != 1L ||
is.na(edge_topic_cutoff) || edge_topic_cutoff < 0 ||
edge_topic_cutoff > 1) {
stop("`edge_topic_cutoff` must be a single numeric value in [0, 1].")
}
+ if (!is.numeric(n_top_terms) || length(n_top_terms) != 1L ||
+ is.na(n_top_terms) || n_top_terms < 0 ||
+ n_top_terms != as.integer(n_top_terms)) {
+ stop("`n_top_terms` must be a single non-negative integer.")
+ }
+ if (!is.numeric(min_term_count) || length(min_term_count) != 1L ||
+ is.na(min_term_count) || min_term_count < 1 ||
+ min_term_count != as.integer(min_term_count)) {
+ stop("`min_term_count` must be a single positive integer.")
+ }
+ if (!is.numeric(max_iter) || length(max_iter) != 1L ||
+ is.na(max_iter) || max_iter < 1 ||
+ max_iter != as.integer(max_iter)) {
+ stop("`max_iter` must be a single positive integer.")
+ }
+ if (!is.numeric(tol) || length(tol) != 1L ||
+ is.na(tol) || !is.finite(tol) || tol < 0) {
+ stop("`tol` must be a single non-negative finite numeric value.")
+ }
+ if (!is.numeric(seed) || length(seed) != 1L ||
+ is.na(seed) || seed != as.integer(seed)) {
+ stop("`seed` must be a single integer.")
+ }
}Also update the call site in R/decomposeSubnetworkByTopic.R to pass the additional arguments.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@R/utils_decomposeSubnetworkByTopic.R` around lines 7 - 36, Add validation for
the five public tuning parameters that currently bypass checks in the
`.validateDecomposeSubnetworkByTopicInput` function. Extend the function
signature to accept `n_top_terms`, `min_term_count`, `max_iter`, `tol`, and
`seed` as parameters, then add validation logic for each: validate that
`n_top_terms` is a single positive integer, `min_term_count` is a single
positive integer, `max_iter` is a single positive integer, `tol` is a single
positive numeric value, and `seed` is either NULL or a single integer. Then
locate the call to `.validateDecomposeSubnetworkByTopicInput` in
`R/decomposeSubnetworkByTopic.R` and update it to pass these five additional
arguments from the parent function.
| .edgeKey <- function(source, target, interaction) { | ||
| paste(source, target, interaction, sep = "||") | ||
| } |
There was a problem hiding this comment.
Include the full edge identity in the topic key.
The validator requires site and stmt_hash, but .edgeKey() drops both. Rows that share source/target/interaction but differ by site or statement hash collapse into one NMF edge column and receive the same topic assignment.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@R/utils_decomposeSubnetworkByTopic.R` around lines 46 - 48, The .edgeKey()
function currently only uses source, target, and interaction to create the edge
identifier, but it needs to also include site and stmt_hash parameters to ensure
rows with identical source/target/interaction but different site or stmt_hash
values are treated as distinct edges. Modify the .edgeKey() function signature
to accept site and stmt_hash as additional parameters, and include them in the
paste() concatenation with the same separator so that the resulting key uniquely
identifies each combination of source, target, interaction, site, and stmt_hash.
| set.seed(seed) | ||
| W <- matrix(runif(n * k), n, k) | ||
| H_text <- matrix(runif(k * p), k, p) | ||
| H_edges <- matrix(runif(k * m), k, m) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if the file exists and examine lines around 154-157
if [ -f "R/utils_decomposeSubnetworkByTopic.R" ]; then
echo "=== File exists, examining lines 140-170 ==="
sed -n '140,170p' "R/utils_decomposeSubnetworkByTopic.R" | cat -n
else
echo "File not found, listing R directory:"
ls -la R/ | head -20
fiRepository: Vitek-Lab/MSstatsBioNet
Length of output: 1363
🏁 Script executed:
#!/bin/bash
# Search for set.seed calls in the repository to understand the pattern
echo "=== Searching for set.seed patterns ==="
rg "set\.seed" R/ -A 2 -B 2Repository: Vitek-Lab/MSstatsBioNet
Length of output: 412
🏁 Script executed:
#!/bin/bash
# Verify the exact lines and check if the fix has been applied
echo "=== Checking lines 154-160 precisely ==="
sed -n '154,160p' "R/utils_decomposeSubnetworkByTopic.R" | cat -nRepository: Vitek-Lab/MSstatsBioNet
Length of output: 344
🌐 Web query:
R best practice set.seed global RNG state mutation restore
💡 Result:
In R, the global random number generator (RNG) state is stored in the .Random.seed variable in the global environment (.GlobalEnv) [1][2]. Modifying this state globally within a function can disrupt external simulations or unexpected processes, and CRAN policies strictly discourage packages from modifying the global environment [3][4]. Best Practices for RNG State Management 1. Avoid Global Mutation in Packages: Do not modify .Random.seed in a package function unless explicitly intended [3][4]. If a function must use a fixed seed for internal reproducibility, restore the original state before returning to ensure the user's global RNG state remains unchanged [5][6]. 2. Robust Restoration Pattern: Use on.exit() to ensure the RNG state is restored even if the function encounters an error or is interrupted [6]. This prevents side effects on the user's workspace [6]. Example of the robust pattern: # Robustly setting/restoring RNG state within a function do_random_stuff <- function { genv <- globalenv # Save original state if it exists old_seed <- if (exists(".Random.seed", envir = genv, inherits = FALSE)) { get(".Random.seed", envir = genv, inherits = FALSE) } else { NULL } # Ensure restoration on exit on.exit(suspendInterrupts({ if (is.null(old_seed)) { rm(".Random.seed", envir = genv, inherits = FALSE) } else { assign(".Random.seed", value = old_seed, envir = genv, inherits = FALSE) } })) # Set desired seed for local work set.seed(42) return(runif(3)) } 3. Parallel Processing: For parallel computations, do not rely on manual seed restoration [7]. Use specialized packages like doRNG (for foreach loops) or the native L'Ecuyer-CMRG RNG stream support (via parallel::nextRNGStream) to ensure independent, reproducible streams across parallel workers [8][7][9]. 4. Use set.seed Correctly: set.seed() is the standard, recommended approach for initializing the RNG [1][2]. While you can manually assign to .Random.seed, it is an internal integer vector that should generally not be manipulated directly by users unless necessary for complex state preservation [1][10]. Always check if .Random.seed exists before attempting to save it, as it is only created in the global environment after the first random number generation call [5][11].
Citations:
- 1: https://search.r-project.org/R/refmans/base/html/Random.html
- 2: https://stat.ethz.ch/R-manual/R-patched/RHOME/library/base/html/Random.html
- 3: https://stat.ethz.ch/pipermail/r-devel/2019-October/078615.html
- 4: https://stat.ethz.ch/pipermail/r-package-devel/2022q3/008439.html
- 5: http://cookbook-r.com/Numbers/Saving_the_state_of_the_random_number_generator/
- 6: https://stat.ethz.ch/pipermail/r-package-devel/2022q3/008447.html
- 7: https://researchdatapod.com/understanding-set-seed-r-comprehensive-guide/
- 8: https://cran.r-project.org/doc/FAQ/r-patched/packages/parallel/refman/parallel.html
- 9: Discussion: About reproducible parallel RNG streams futureverse/doFuture#41
- 10: https://hypatia.math.ethz.ch/pipermail/r-help/2012-January/299887.html
- 11: https://stackoverflow.com/questions/13997444/print-the-current-random-seed-so-that-i-can-enter-it-with-set-seed-later
Restore the RNG state after setting the seed to avoid mutating the caller's global random stream.
Calling set.seed(seed) directly modifies .Random.seed in the global environment, affecting all subsequent random operations in the user's session. This violates CRAN package policies and disrupts reproducibility of user code. Save and restore the RNG state using on.exit():
Recommended fix
+ had_seed <- exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
+ old_seed <- if (had_seed) {
+ get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
+ } else {
+ NULL
+ }
+ on.exit({
+ if (had_seed) {
+ assign(".Random.seed", old_seed, envir = .GlobalEnv)
+ } else if (exists(".Random.seed", envir = .GlobalEnv,
+ inherits = FALSE)) {
+ rm(".Random.seed", envir = .GlobalEnv)
+ }
+ }, add = TRUE)
set.seed(seed)
W <- matrix(runif(n * k), n, k)
H_text <- matrix(runif(k * p), k, p)
H_edges <- matrix(runif(k * m), k, m)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@R/utils_decomposeSubnetworkByTopic.R` around lines 154 - 157, The call to
set.seed(seed) directly modifies the global .Random.seed, affecting all
subsequent random operations in the user's session. Save the current random
state before calling set.seed(seed) by capturing the existing .Random.seed
value, then use on.exit() to restore it when the function completes. This
ensures that the random number generator state is restored to its original
condition after the function executes, preventing side effects on the caller's
random stream.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## devel #106 +/- ##
===========================================
- Coverage 77.17% 57.56% -19.61%
===========================================
Files 9 13 +4
Lines 1139 1527 +388
===========================================
Hits 879 879
- Misses 260 648 +388 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/bootstrapTopicModels.R`:
- Around line 67-82: The `n_top_terms` parameter lacks validation at function
entry and bounds checking during runtime, which can cause NA propagation. At
R/bootstrapTopicModels.R#L67-L82 (anchor site), add input validation for
`n_top_terms` after the `n_boot` validation block to ensure it is a single
positive integer, following the same pattern as the existing `n_boot` check. At
R/bootstrapTopicModels.R#L121-L128 (sibling site), before using `n_top_terms` to
index into a row vector, clamp it to `min(n_top_terms, length(row))` to prevent
out-of-bounds access and NA values.
In `@R/utils_decomposeSubnetworkByTopic.R`:
- Around line 396-407: The as.integer() conversion in the .topicPartition
function removes the names attribute from the labels vector, which are carefully
assigned just before the conversion. To fix this, preserve the names when
converting to integer by assigning the names back to the result of
as.integer(labels), or use an alternative approach that maintains both the
integer values and the names attribute from the original labels vector.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8f92d820-f9b0-4d6a-9513-46cea53dd84b
📒 Files selected for processing (9)
DESCRIPTIONNAMESPACER/bootstrapTopicModels.RR/compareTopicModels.RR/decomposeSubnetworkByTopic.RR/utils_decomposeSubnetworkByTopic.Rman/bootstrapTopicModels.Rdman/compareTopicModels.Rdman/decomposeSubnetworkByTopic.Rd
✅ Files skipped from review due to trivial changes (4)
- man/bootstrapTopicModels.Rd
- man/compareTopicModels.Rd
- man/decomposeSubnetworkByTopic.Rd
- DESCRIPTION
🚧 Files skipped from review as they are similar to previous changes (1)
- NAMESPACE
| bootstrapTopicModels <- function(subnetwork, | ||
| n_boot = 50, | ||
| n_topics = 5, | ||
| include_ppi = TRUE, | ||
| n_top_terms = 10, | ||
| min_term_count = 2, | ||
| max_iter = 200, | ||
| tol = 1e-4, | ||
| seed = 1) { | ||
|
|
||
| .validateDecomposeSubnetworkByTopicInput(subnetwork, n_topics, 0.2, | ||
| include_ppi) | ||
| if (!is.numeric(n_boot) || length(n_boot) != 1L || is.na(n_boot) || | ||
| n_boot < 2 || n_boot != as.integer(n_boot)) { | ||
| stop("`n_boot` must be a single integer >= 2.") | ||
| } |
There was a problem hiding this comment.
n_top_terms lacks input validation and runtime bounds checking. The parameter is used without validation at function entry and without clamping against vocabulary size inside the bootstrap loop, which could cause NA propagation if the value exceeds the vocabulary.
R/bootstrapTopicModels.R#L67-L82: Add validation thatn_top_termsis a single positive integer, similar to then_bootcheck.R/bootstrapTopicModels.R#L121-L128: Clampn_top_termstomin(n_top_terms, length(row))before indexing to prevent out-of-bounds NAs.
🛡️ Proposed fixes
Add validation after line 83:
n_boot <- as.integer(n_boot)
+ if (!is.numeric(n_top_terms) || length(n_top_terms) != 1L ||
+ is.na(n_top_terms) || n_top_terms < 1 ||
+ n_top_terms != as.integer(n_top_terms)) {
+ stop("`n_top_terms` must be a single positive integer.")
+ }
+ n_top_terms <- as.integer(n_top_terms)Clamp at line 125:
- top <- order(row, decreasing = TRUE)[seq_len(n_top_terms)]
+ n_sel <- min(n_top_terms, length(row))
+ top <- order(row, decreasing = TRUE)[seq_len(n_sel)]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| bootstrapTopicModels <- function(subnetwork, | |
| n_boot = 50, | |
| n_topics = 5, | |
| include_ppi = TRUE, | |
| n_top_terms = 10, | |
| min_term_count = 2, | |
| max_iter = 200, | |
| tol = 1e-4, | |
| seed = 1) { | |
| .validateDecomposeSubnetworkByTopicInput(subnetwork, n_topics, 0.2, | |
| include_ppi) | |
| if (!is.numeric(n_boot) || length(n_boot) != 1L || is.na(n_boot) || | |
| n_boot < 2 || n_boot != as.integer(n_boot)) { | |
| stop("`n_boot` must be a single integer >= 2.") | |
| } | |
| bootstrapTopicModels <- function(subnetwork, | |
| n_boot = 50, | |
| n_topics = 5, | |
| include_ppi = TRUE, | |
| n_top_terms = 10, | |
| min_term_count = 2, | |
| max_iter = 200, | |
| tol = 1e-4, | |
| seed = 1) { | |
| .validateDecomposeSubnetworkByTopicInput(subnetwork, n_topics, 0.2, | |
| include_ppi) | |
| if (!is.numeric(n_boot) || length(n_boot) != 1L || is.na(n_boot) || | |
| n_boot < 2 || n_boot != as.integer(n_boot)) { | |
| stop("`n_boot` must be a single integer >= 2.") | |
| } | |
| n_boot <- as.integer(n_boot) | |
| if (!is.numeric(n_top_terms) || length(n_top_terms) != 1L || | |
| is.na(n_top_terms) || n_top_terms < 1 || | |
| n_top_terms != as.integer(n_top_terms)) { | |
| stop("`n_top_terms` must be a single positive integer.") | |
| } | |
| n_top_terms <- as.integer(n_top_terms) |
| bootstrapTopicModels <- function(subnetwork, | |
| n_boot = 50, | |
| n_topics = 5, | |
| include_ppi = TRUE, | |
| n_top_terms = 10, | |
| min_term_count = 2, | |
| max_iter = 200, | |
| tol = 1e-4, | |
| seed = 1) { | |
| .validateDecomposeSubnetworkByTopicInput(subnetwork, n_topics, 0.2, | |
| include_ppi) | |
| if (!is.numeric(n_boot) || length(n_boot) != 1L || is.na(n_boot) || | |
| n_boot < 2 || n_boot != as.integer(n_boot)) { | |
| stop("`n_boot` must be a single integer >= 2.") | |
| } | |
| for (r in seq_len(k)) { | |
| row <- fit$H_text[map[r], ] | |
| s <- sum(row) | |
| weightsum[r, ] <- weightsum[r, ] + (if (s > 0) row / s else row) | |
| n_sel <- min(n_top_terms, length(row)) | |
| top <- order(row, decreasing = TRUE)[seq_len(n_sel)] | |
| top <- top[row[top] > 0] | |
| topcount[r, top] <- topcount[r, top] + 1 | |
| } |
📍 Affects 1 file
R/bootstrapTopicModels.R#L67-L82(this comment)R/bootstrapTopicModels.R#L121-L128
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@R/bootstrapTopicModels.R` around lines 67 - 82, The `n_top_terms` parameter
lacks validation at function entry and bounds checking during runtime, which can
cause NA propagation. At R/bootstrapTopicModels.R#L67-L82 (anchor site), add
input validation for `n_top_terms` after the `n_boot` validation block to ensure
it is a single positive integer, following the same pattern as the existing
`n_boot` check. At R/bootstrapTopicModels.R#L121-L128 (sibling site), before
using `n_top_terms` to index into a row vector, clamp it to `min(n_top_terms,
length(row))` to prevent out-of-bounds access and NA values.
| .topicPartition <- function(model, unit = c("edges", "papers")) { | ||
| unit <- match.arg(unit) | ||
| if (unit == "papers") { | ||
| labels <- apply(model$W, 1, which.max) | ||
| names(labels) <- rownames(model$W) | ||
| } else { | ||
| shares <- .edgeTopicShares(model$H_edges) | ||
| labels <- apply(shares, 2, which.max) | ||
| names(labels) <- colnames(model$H_edges) | ||
| } | ||
| as.integer(labels) | ||
| } |
There was a problem hiding this comment.
as.integer() strips names from the partition vector.
Lines 399-404 carefully assign names to labels, but as.integer(labels) on line 406 returns a new vector without names. This breaks .consensusMatrix (line 452-453) which expects names(partitions[[1]]) for row/column names.
Proposed fix
-.topicPartition <- function(model, unit = c("edges", "papers")) {
- unit <- match.arg(unit)
- if (unit == "papers") {
- labels <- apply(model$W, 1, which.max)
- names(labels) <- rownames(model$W)
- } else {
- shares <- .edgeTopicShares(model$H_edges)
- labels <- apply(shares, 2, which.max)
- names(labels) <- colnames(model$H_edges)
- }
- as.integer(labels)
-}
+.topicPartition <- function(model, unit = c("edges", "papers")) {
+ unit <- match.arg(unit)
+ if (unit == "papers") {
+ labels <- apply(model$W, 1, which.max)
+ names(labels) <- rownames(model$W)
+ } else {
+ shares <- .edgeTopicShares(model$H_edges)
+ labels <- apply(shares, 2, which.max)
+ names(labels) <- colnames(model$H_edges)
+ }
+ nms <- names(labels)
+ labels <- as.integer(labels)
+ names(labels) <- nms
+ labels
+}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@R/utils_decomposeSubnetworkByTopic.R` around lines 396 - 407, The
as.integer() conversion in the .topicPartition function removes the names
attribute from the labels vector, which are carefully assigned just before the
conversion. To fix this, preserve the names when converting to integer by
assigning the names back to the result of as.integer(labels), or use an
alternative approach that maintains both the integer values and the names
attribute from the original labels vector.
d3a0d6a to
ff94b8b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@man/filterSubnetworkByContext.Rd`:
- Around line 71-74: In the \note section of the filterSubnetworkByContext.Rd
documentation file, replace the markdown-style bold syntax **Beta feature:**
with the native .Rd syntax \strong{Beta feature:} since the roxygen2
configuration does not have markdown support enabled. This will ensure the text
displays as bold in the generated help page instead of showing literal
asterisks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: abc6e1d4-32f2-4e57-9f1e-811a8038c877
📒 Files selected for processing (11)
DESCRIPTIONNAMESPACER/bootstrapTopicModels.RR/compareTopicModels.RR/decomposeSubnetworkByTopic.RR/filterSubnetworkByContext.RR/utils_decomposeSubnetworkByTopic.Rman/bootstrapTopicModels.Rdman/compareTopicModels.Rdman/decomposeSubnetworkByTopic.Rdman/filterSubnetworkByContext.Rd
✅ Files skipped from review due to trivial changes (5)
- R/filterSubnetworkByContext.R
- man/compareTopicModels.Rd
- man/decomposeSubnetworkByTopic.Rd
- man/bootstrapTopicModels.Rd
- DESCRIPTION
🚧 Files skipped from review as they are similar to previous changes (4)
- R/bootstrapTopicModels.R
- R/decomposeSubnetworkByTopic.R
- R/compareTopicModels.R
- R/utils_decomposeSubnetworkByTopic.R
Motivation and Context
This PR adds interpretable, unsupervised topic clustering to MSstatsBioNet by decomposing INDRA-derived biological subnetworks into “topic-specific” subnetworks using Non-Negative Matrix Factorization (NMF). It jointly leverages two aligned data views at the paper level—(1) PubMed abstract text (paper × word) and (2) evidence aggregated to edge-level signals (paper × edge)—so topics can be characterized by both biological terminology and evidence patterns. It also supports an option to exclude the PPI/edge evidence view.
Detailed Changes
Public API / package exports
decomposeSubnetworkByTopicbootstrapTopicModelscompareTopicModelsimportFrom(stats, runif)for internal stochastic initializationdecomposeSubnetworkByTopic()(R/decomposeSubnetworkByTopic.R)X_textfrom cleaned/validated PubMed abstract word countsX_edgesfrom evidence counts aggregated to(pmid, edge_key)include_ppi = TRUE): shared basis across text and edgesinclude_ppi = FALSE): fits onX_textand derives edge loadings from learned text topicstopic_1..topic_k) and attaches an"nmf"attribute containing factorization outputs (W,H_text,H_edges), metadata (e.g., terms, edge keys, PMIDs), and modeling settings.Topic modeling pipeline & utilities (
R/utils_decomposeSubnetworkByTopic.R).jointNMF(),.textNMF())Robustness / comparison features
bootstrapTopicModels()(R/bootstrapTopicModels.R)compareTopicModels()(R/compareTopicModels.R)Documentation
decomposeSubnetworkByTopicbootstrapTopicModelscompareTopicModelsUnit Tests
decomposeSubnetworkByTopic(),bootstrapTopicModels(), orcompareTopicModels().tests/testthatshows no direct references to these new APIs in the existing test files.Coding Guidelines Violations