Removed deep-copy data.table ops from the dataProcess pipeline#208
Removed deep-copy data.table ops from the dataProcess pipeline#208tonywu1999 wants to merge 7 commits into
Conversation
* Replaced `input[, cols, with = FALSE]` deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL). * Replaced row-shuffle `input = input[order(...), ]` in .prepareForDataProcess with data.table::setorder() (in place). * Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table. * Replaced the synthesised `tmp` string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call. * Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes. * Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy. * Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame(). * Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept. * Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK. See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md Co-Authored-By: Claude <noreply@anthropic.com>
📝 WalkthroughWalkthroughRefactor to perform in-place data.table updates, extract predicted_survival from summarization outputs and pass it into finalizers, replace merges/ifelse with indexed := assignments, and add tinytests validating in-place and merge behaviors. ChangesDataProcess Pipeline Memory Optimization and Output Refactoring
Sequence DiagramsNo sequence diagrams generated. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
|
Great update, thanks. Did you get a chance to evaluate the memory gain with lineprof or lobstr? |
mstaniak
left a comment
There was a problem hiding this comment.
Hi,
thanks again for this update. I have a few minor comments
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
inst/tinytest/test_memory_optimization_copies.R (1)
328-369: ⚡ Quick winAdd a mixed-
LABELfixture to this contract test.These assertions only exercise
LABEL = "L", so a regression that dropsLABELfrom the survival projection or join keys would still pass here. A smallL/Hfixture with duplicated(RUN, FEATURE)values would cover the regression this stack is guarding against.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@inst/tinytest/test_memory_optimization_copies.R` around lines 328 - 369, The test only uses LABEL = "L", so add mixed LABEL values and duplicated (RUN, FEATURE) combos to both finalize_input_4 and pred_surv_4 to exercise join keys: modify finalize_input_4$LABEL to contain a small mixture (e.g. "L" and "H" as a factor) with duplicated RUN/FEATURE pairs across labels, and add a LABEL column to pred_surv_4 with matching L/H entries (and duplicate RUN/FEATURE rows) so MSstats:::.finalizeTMP must preserve/join on LABEL; keep result_4 assertions but ensure the fixture includes those mixed-label cases to catch regressions that drop LABEL from survival projection or join keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 206-212: The test currently counts all non-NA newABUNDANCE values
(matched_count) which can include rows that were already populated; instead
capture the rows that started with newABUNDANCE = NA before calling
.finalizeTMP() (e.g. store original_na_idx <-
is.na(original_result$newABUNDANCE)), then after .finalizeTMP() assert that
result$newABUNDANCE[original_na_idx] are non-NA and equal to the expected
imputed values from predicted_survival (use the (cen, RUN, FEATURE) key to
compare), replacing the generic expect_true(matched_count > 0) with direct
checks on those indices.
In `@R/utils_output.R`:
- Around line 41-49: Check whether summarized contains a "try-error" result
before accessing x[[1]]/x[[2]]: if any element of summarized inherits from
"try-error" (the fallback path intended for failed
MSstatsSummarizeWithSingleCore()), do not rbind or unpack
predicted_survival/protein_summaries; instead invoke the existing fallback
behavior (the same path currently guarded at the later check) and avoid calling
.finalizeInput on invalid data. Update the block that builds predicted_survival
and protein_summaries to first detect try-error in summarized and branch to the
fallback handling when present, referencing the summarized variable and the
.finalizeInput call to ensure invalid summary results are not unpacked.
- Around line 101-102: The calls to data.table::setDF(input) and
data.table::setDF(rqall) mutate caller-owned objects in place; update
MSstatsSummarizationOutput to avoid by-reference mutation by operating on copies
instead (e.g., create local copies like input_copy <- input and rqall_copy <-
rqall or coerce with as.data.frame() on copies) and call data.table::setDF() (or
as.data.frame) on those copies so the original input and rqall keep their
data.table class; ensure all subsequent uses in the function reference the
copied variables (input_copy, rqall_copy) rather than the originals.
---
Nitpick comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 328-369: The test only uses LABEL = "L", so add mixed LABEL values
and duplicated (RUN, FEATURE) combos to both finalize_input_4 and pred_surv_4 to
exercise join keys: modify finalize_input_4$LABEL to contain a small mixture
(e.g. "L" and "H" as a factor) with duplicated RUN/FEATURE pairs across labels,
and add a LABEL column to pred_surv_4 with matching L/H entries (and duplicate
RUN/FEATURE rows) so MSstats:::.finalizeTMP must preserve/join on LABEL; keep
result_4 assertions but ensure the fixture includes those mixed-label cases to
catch regressions that drop LABEL from survival projection or join keys.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2610ce77-e553-465d-8347-26a794f1b249
📒 Files selected for processing (6)
R/dataProcess.RR/utils_checks.RR/utils_feature_selection.RR/utils_normalize.RR/utils_output.Rinst/tinytest/test_memory_optimization_copies.R
|
Hi, |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@man/dot-finalizeInput.Rd`:
- Around line 7-18: The Rd documentation is out of sync: update the \arguments
block for .finalizeInput to replace the stale item {summarized} with an entry
for {predicted_survival} that describes the new data.table contract;
specifically, ensure the arguments list includes input, predicted_survival,
method, impute, and censored_symbol and that the predicted_survival description
matches the function's expected data.table structure (columns/types) used by
.finalizeInput so \usage and \arguments are consistent with the function
signature.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: eb8c80d9-00e4-4549-b689-e83302de9aac
📒 Files selected for processing (6)
R/utils_feature_selection.RR/utils_normalize.RR/utils_output.Rinst/tinytest/test_MSstatsMergeFractions.Rman/dot-finalizeInput.Rdman/dot-finalizeTMP.Rd
✅ Files skipped from review due to trivial changes (1)
- man/dot-finalizeTMP.Rd
🚧 Files skipped from review as they are similar to previous changes (3)
- R/utils_feature_selection.R
- R/utils_normalize.R
- R/utils_output.R
| .finalizeInput(input, predicted_survival, method, impute, censored_symbol) | ||
| } | ||
| \arguments{ | ||
| \item{input}{feature-level data} | ||
|
|
||
| \item{summarized}{protein-level data (list)} | ||
|
|
||
| \item{method}{summary method} | ||
|
|
||
| \item{impute}{if TRUE, censored missing values were imputed} | ||
|
|
||
| \item{censored_symbol}{censored missing value indicator} | ||
|
|
||
| \item{summarized}{protein-level data (list)} |
There was a problem hiding this comment.
Documented arguments are out of sync with the updated function signature.
Line 7 uses predicted_survival, but Line 18 still documents summarized and there is no predicted_survival argument entry. Please replace the stale summarized item with predicted_survival (and description matching the new data.table contract) so \usage and \arguments are consistent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@man/dot-finalizeInput.Rd` around lines 7 - 18, The Rd documentation is out of
sync: update the \arguments block for .finalizeInput to replace the stale item
{summarized} with an entry for {predicted_survival} that describes the new
data.table contract; specifically, ensure the arguments list includes input,
predicted_survival, method, impute, and censored_symbol and that the
predicted_survival description matches the function's expected data.table
structure (columns/types) used by .finalizeInput so \usage and \arguments are
consistent with the function signature.
|
For issue: #208 (comment) I looked into this but think the current input[, list(...)] is actually the right call here, because of the following:
Since data.table :=/setnames modify by reference across function boundaries, renaming the columns to lowercase (or adding log2inty/is_obs) inside the function would mutate the caller's table and break that merge (its by.x keys would no longer exist). The input[, list(...)] form avoids this by returning a new table and only rebinding the local name.
happy to reevaluate my decision if I miss something. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/utils_checks.R`:
- Line 173: The AnomalyScores column is being initialized with plain NA which
creates a logical column; update the call to data.table::set in R/utils_checks.R
(the line that sets j = "AnomalyScores") to use NA_real_ instead of NA so the
column type matches downstream numeric expectations (see
.updateColumnsForProcessing which uses NA_real_ and the uppercasing step that
may leave the column present).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a9ac4369-d3ff-416f-8dc1-4fdc7f912e4e
📒 Files selected for processing (3)
R/dataProcess.RR/utils_checks.Rinst/tinytest/test_memory_optimization_copies.R
🚧 Files skipped from review as they are similar to previous changes (2)
- R/dataProcess.R
- inst/tinytest/test_memory_optimization_copies.R
|
|
||
| if (!"AnomalyScores" %in% colnames(input)){ | ||
| input$AnomalyScores = NA | ||
| data.table::set(input, j = "AnomalyScores", value = NA) |
There was a problem hiding this comment.
Use NA_real_ for type consistency with downstream numeric AnomalyScores.
Using plain NA creates a logical-typed column. Since ANOMALYSCORES is expected to be numeric (values like 0.03, 0.01 per tests), and .updateColumnsForProcessing at line 318 uses NA_real_, this should match.
After column names are uppercased at line 198, the check at line 317 will find the column already exists and skip the NA_real_ assignment, leaving the column as logical.
Proposed fix
- data.table::set(input, j = "AnomalyScores", value = NA)
+ data.table::set(input, j = "AnomalyScores", value = NA_real_)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@R/utils_checks.R` at line 173, The AnomalyScores column is being initialized
with plain NA which creates a logical column; update the call to data.table::set
in R/utils_checks.R (the line that sets j = "AnomalyScores") to use NA_real_
instead of NA so the column type matches downstream numeric expectations (see
.updateColumnsForProcessing which uses NA_real_ and the uppercasing step that
may leave the column present).
There was a problem hiding this comment.
@mstaniak, glad you agree! The fix is straightforward:
- data.table::set(input, j = "AnomalyScores", value = NA)
+ data.table::set(input, j = "AnomalyScores", value = NA_real_)This ensures type consistency throughout the pipeline — the column stays numeric from the start, avoiding any silent logical→numeric coercion later when .updateColumnsForProcessing operates on ANOMALYSCORES.
(ノ◕ヮ◕)ノ*:・゚✧
| log2inty = ABUNDANCE, | ||
| is_censored = if (has_censored) censored else FALSE)] | ||
| # Censored or missing intensities are not observations. | ||
| input[is_censored | is.na(log2inty), log2inty := NA] |
There was a problem hiding this comment.
Should this be is_censored & is.na(log2inty)? Because the negation of !(is.na(ABUNDANCE) | is_censored) switches the OR to an AND
User description
input[, cols, with = FALSE]deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL).input = input[order(...), ]in .prepareForDataProcess with data.table::setorder() (in place).tmpstring-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call.See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md
Motivation and Context
Please include relevant motivation and context of the problem along with a short summary of the solution.
Changes
Please provide a detailed bullet point list of your changes.
Testing
Please describe any unit tests you added or modified to verify your changes.
Checklist Before Requesting a Review
PR Type
Enhancement, Bug fix, Tests
Description
Replace full-table copies with in-place updates
Use keyed lookups for joins
Split survival outputs before finalization
Add memory-regression pipeline tests
Diagram Walkthrough
File Walkthrough
dataProcess.R
Limit censored-value updates to matching rowsR/dataProcess.R
ifelse()rewrites with targeted:=updatespredictedon applicable censored rowsnewABUNDANCEwhere imputation appliesutils_checks.R
Avoid copies during input trimming and sortingR/utils_checks.R
data.table::set(..., NULL)data.table::setorder()utils_feature_selection.R
Collapse feature preprocessing into one passR/utils_feature_selection.R
censoredvalues inlineis_obswithout intermediate tablesutils_normalize.R
Use in-place cleanup and keyed fraction mergesR/utils_normalize.R
merge()with keyednewRunassignment(FEATURE, FRACTION)lookuputils_output.R
Streamline summary output and imputation joinsR/utils_output.R
predicted_survivalbefore finalizationtest_memory_optimization_copies.R
Add memory regression tests for copy avoidanceinst/tinytest/test_memory_optimization_copies.R
.normalizeMediantemp-column cleanup.finalizeTMPkeyed matches and unmatchedNAsMSstatsSummarizationOutputlist splitting behaviorMotivation and Context — Short summary of the solution
The dataProcess pipeline used several copy-heavy data.table idioms (column-subset deep copies, merge(all.x=TRUE) joins, order-based reassignments, and full-vector ifelse writes) that caused excessive memory allocations. This PR replaces those with in-place data.table operations (data.table::set(..., value = NULL), data.table::setorder(), keyed which lookups + data.table::set, and targeted [i, j := v] updates) to reduce memory churn while preserving behavior. It also extracts predicted_survival earlier in summarization, nulls nested survival slots, fixes two regressions (predicted_survival join-columns intersection and retention of LABEL), updates documentation for changed parameters, and adds memory-regression tinytests. Full test suite passes.
Detailed changes
General
R/dataProcess.R
R/utils_checks.R
R/utils_feature_selection.R
R/utils_normalize.R
R/utils_output.R
Regressions fixed
Documentation
Unit tests added / modified
Coding guidelines / potential violations