Feat/batch effect array gene expression#7
Open
nttg8100 wants to merge 3 commits into
Open
Conversation
added 3 commits
June 7, 2026 01:48
Reproducible pixi+R pipeline that integrates two GPL570 breast cancer
cohorts (GSE140494, GSE146558; n=200, both 66% ER+) and removes the
technical batch effect with ComBat and limma::removeBatchEffect while
protecting er_status as a biological covariate.
Pipeline (4 steps, all idempotent and cached):
01_download.R pull series matrix for each GSE, cache .rds
02_preprocess.R combine into ExpressionSet, scale guard,
label-aware multi-column ER extractor
03_batch_effect.R PCA on top 5000 variable probes, ComBat + limma
correction, kNN purity + centroid/SD metrics
04_validate.R ESR1 boxplot + limma ER+/ER- DE on largest
mixed-ER cohort
Results on the chosen pair (n=200):
- cor PC1-dataset: 0.983 -> 0.000 (batch axis collapsed)
- kNN purity (top 5k, PC1-10): 0.996 -> 0.651 (close to random)
- cor PC2-ER (top 5k): 0.084 -> -0.655 (biology exposed)
- ESR1 limma (GSE146558): logFC=3.87, FDR=1.38e-20, rank #1
- 8,563 probes at FDR<0.05 (16% of genome)
Key design choices documented in README.md:
- replace, do not coerce MAS5-style linear-scale cohorts (scale
guard stops with a clear swap-cohort message)
- ER parser label-aware (regex on label) + multi-column (scan all
characteristics_ch1.N, take first non-NA per row)
- visualize PCA on top 5000 variable probes (where biology lives)
AND on all 54,675 probes (for honesty); report kNN purity on both
- save BOTH ComBat and limma outputs; document the trade-off
Pair was picked from the 19 GPL570 datasets in
/scratch/data/omicslab-datasets/cohorts/BRCA/output/platforms_summary.csv
by intersecting: GPL570, RMA log2 (median < 30), IHC ER with 0 NAs,
mixed 66/34 ER distribution, and similar intrinsic subtype mix
(few TNBC, few HER2+ in both cohorts).
…xpression Flatten the directory structure to match the gkit utility pattern (igv-report/, dpgt/, glnexus/, hail/ are all at the top level of gkit/). The previous batch-effect/ parent was empty except for the array-expression subdirectory, so the rename is unambiguous. This makes the utility directly addressable for the make-test CI loop and aligns with the developer guide in Readme.developer.md.
Replace the GEO-download-first pipeline with a CSV-first design so
`make test` runs end-to-end without any network access.
Changes:
- Add Makefile with two workflows:
make test -- preprocess -> batch-effect -> validate (no network)
make raw -- export-csv -> preprocess -> batch-effect -> validate
(re-downloads from GEO, rebuilds the committed CSVs)
make clean -- remove generated artifacts
- Add data/expression/<GSE>.csv (probes x samples, log2 RMA, ~118 MB
total) and data/clinical/<GSE>.csv (cleaned per-sample phenotype
from the harmonized pool, ~20 KB total). These are now the
committed inputs; the .rds files in data/raw/ are no longer in
git (.gitignore).
- Replace scripts/01_download.R with scripts/01_export_csv.R, which
pulls the series matrix from GEO, then writes the two CSVs per
cohort. The clinical CSV is sourced from
/scratch/data/omicslab-datasets/cohorts/BRCA/output/json/reviewed/
so we no longer re-parse the heterogeneous
characteristics_ch1.N columns ourselves.
- Rewrite scripts/02_preprocess.R to read the committed CSVs and
build the combined ExpressionSet. Sample alignment is asserted
explicitly (setequal on GSM IDs). Scale guard, ER counts, and
intersection logic unchanged.
- pixi.toml: rename 'download' task to 'export-csv'; drop the
'all' chain (the Makefile orchestrates now). pixi.lock unchanged
(no new dependencies).
- README.md: document the new developer workflow (`make test`
and `make raw`), the data layout, and the rationale for using
the harmonized clinical data instead of re-parsing GEO.
Same numerical results as before (the committed data is byte-identical
to the previously-tested expression matrices):
kNN purity (top 5k, ComBat) : 0.996 -> 0.651
cor PC2-ER (top 5k, ComBat) : 0.084 -> -0.655
ESR1 limma (GSE146558) : logFC=3.87, FDR=1.38e-20
DE probes at FDR<0.05 : 8,563
Truongphi20
reviewed
Jun 8, 2026
Truongphi20
left a comment
There was a problem hiding this comment.
It is brilliant, Giang!
Thank for a thoughtful tutorial on handling batch-effect between two realistic cohorts using Combat and limma. I just had a quick look through your note and left a few questions on the evaluation section. Hope to hear from you soon if you are available. Thanks.
Comment on lines
+430
to
+431
| 1. **`cor PC1-dataset`** — the old single-number summary. Both | ||
| ComBat and limma drive it to ≈ 0 on both scopes. ✅ |
There was a problem hiding this comment.
What is the meaning of the cor PC1-dataset matrix? Why is 0 being good?
|
|
||
| `results/tables/pc_variance_explained.csv`: | ||
|
|
||
| | scope | stage | cor PC1-ds | cor PC2-ds | cor PC1-er | cor PC2-er | cent/SD | kNN purity | |
There was a problem hiding this comment.
What is difference between suffixes "-ds" and "-er"?
Comment on lines
+451
to
+458
| **Which method wins here?** ComBat (kNN = 0.651) wins on local | ||
| structure, limma wins on PC2-ER exposure (-0.663 vs -0.655 — basically | ||
| tied). For a *batch-effect demo* with *biology preserved*, we | ||
| recommend **ComBat** because the limma over-aggressiveness can be | ||
| confusing when the data has internal subtype structure. For a | ||
| *downstream DE analysis* where you want the cleanest possible batch- | ||
| free expression matrix, limma is the safer choice. We save both so | ||
| you can take your pick. |
There was a problem hiding this comment.
- What is limma over-aggressiveness? Why does "having internal subtype structure" matter?
- How do we decide which is matrix to rely on?
| | top 5 000 variable | ComBat | -0.001 | 0.001 | 0.086 | **-0.655** | 0.003 | **0.651** | | ||
| | top 5 000 variable | limma | 0.000 | -0.001 | 0.079 | -0.663 | 0.002 | 0.716 | | ||
|
|
||
| Three metrics, two ways to read them: |
There was a problem hiding this comment.
Could you provide script or references of three evaluate matrices?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.