Feat/batch effect array gene expression by nttg8100 · Pull Request #7 · gianglabs/omicslab-kit

nttg8100 · 2026-06-06T19:26:41Z

No description provided.

Reproducible pixi+R pipeline that integrates two GPL570 breast cancer cohorts (GSE140494, GSE146558; n=200, both 66% ER+) and removes the technical batch effect with ComBat and limma::removeBatchEffect while protecting er_status as a biological covariate. Pipeline (4 steps, all idempotent and cached): 01_download.R pull series matrix for each GSE, cache .rds 02_preprocess.R combine into ExpressionSet, scale guard, label-aware multi-column ER extractor 03_batch_effect.R PCA on top 5000 variable probes, ComBat + limma correction, kNN purity + centroid/SD metrics 04_validate.R ESR1 boxplot + limma ER+/ER- DE on largest mixed-ER cohort Results on the chosen pair (n=200): - cor PC1-dataset: 0.983 -> 0.000 (batch axis collapsed) - kNN purity (top 5k, PC1-10): 0.996 -> 0.651 (close to random) - cor PC2-ER (top 5k): 0.084 -> -0.655 (biology exposed) - ESR1 limma (GSE146558): logFC=3.87, FDR=1.38e-20, rank #1 - 8,563 probes at FDR<0.05 (16% of genome) Key design choices documented in README.md: - replace, do not coerce MAS5-style linear-scale cohorts (scale guard stops with a clear swap-cohort message) - ER parser label-aware (regex on label) + multi-column (scan all characteristics_ch1.N, take first non-NA per row) - visualize PCA on top 5000 variable probes (where biology lives) AND on all 54,675 probes (for honesty); report kNN purity on both - save BOTH ComBat and limma outputs; document the trade-off Pair was picked from the 19 GPL570 datasets in /scratch/data/omicslab-datasets/cohorts/BRCA/output/platforms_summary.csv by intersecting: GPL570, RMA log2 (median < 30), IHC ER with 0 NAs, mixed 66/34 ER distribution, and similar intrinsic subtype mix (few TNBC, few HER2+ in both cohorts).

…xpression Flatten the directory structure to match the gkit utility pattern (igv-report/, dpgt/, glnexus/, hail/ are all at the top level of gkit/). The previous batch-effect/ parent was empty except for the array-expression subdirectory, so the rename is unambiguous. This makes the utility directly addressable for the make-test CI loop and aligns with the developer guide in Readme.developer.md.

Replace the GEO-download-first pipeline with a CSV-first design so `make test` runs end-to-end without any network access. Changes: - Add Makefile with two workflows: make test -- preprocess -> batch-effect -> validate (no network) make raw -- export-csv -> preprocess -> batch-effect -> validate (re-downloads from GEO, rebuilds the committed CSVs) make clean -- remove generated artifacts - Add data/expression/<GSE>.csv (probes x samples, log2 RMA, ~118 MB total) and data/clinical/<GSE>.csv (cleaned per-sample phenotype from the harmonized pool, ~20 KB total). These are now the committed inputs; the .rds files in data/raw/ are no longer in git (.gitignore). - Replace scripts/01_download.R with scripts/01_export_csv.R, which pulls the series matrix from GEO, then writes the two CSVs per cohort. The clinical CSV is sourced from /scratch/data/omicslab-datasets/cohorts/BRCA/output/json/reviewed/ so we no longer re-parse the heterogeneous characteristics_ch1.N columns ourselves. - Rewrite scripts/02_preprocess.R to read the committed CSVs and build the combined ExpressionSet. Sample alignment is asserted explicitly (setequal on GSM IDs). Scale guard, ER counts, and intersection logic unchanged. - pixi.toml: rename 'download' task to 'export-csv'; drop the 'all' chain (the Makefile orchestrates now). pixi.lock unchanged (no new dependencies). - README.md: document the new developer workflow (`make test` and `make raw`), the data layout, and the rationale for using the harmonized clinical data instead of re-parsing GEO. Same numerical results as before (the committed data is byte-identical to the previously-tested expression matrices): kNN purity (top 5k, ComBat) : 0.996 -> 0.651 cor PC2-ER (top 5k, ComBat) : 0.084 -> -0.655 ESR1 limma (GSE146558) : logFC=3.87, FDR=1.38e-20 DE probes at FDR<0.05 : 8,563

Truongphi20

It is brilliant, Giang!

Thank for a thoughtful tutorial on handling batch-effect between two realistic cohorts using Combat and limma. I just had a quick look through your note and left a few questions on the evaluation section. Hope to hear from you soon if you are available. Thanks.

Truongphi20 · 2026-06-08T05:40:08Z

+1. **`cor PC1-dataset`** — the old single-number summary. Both
+   ComBat and limma drive it to ≈ 0 on both scopes. ✅


What is the meaning of the cor PC1-dataset matrix? Why is 0 being good?

Truongphi20 · 2026-06-08T05:45:22Z

+
+`results/tables/pc_variance_explained.csv`:
+
+| scope                       | stage   | cor PC1-ds | cor PC2-ds | cor PC1-er | cor PC2-er | cent/SD | kNN purity |


What is difference between suffixes "-ds" and "-er"?

Truongphi20 · 2026-06-08T05:55:29Z

+**Which method wins here?** ComBat (kNN = 0.651) wins on local
+structure, limma wins on PC2-ER exposure (-0.663 vs -0.655 — basically
+tied). For a *batch-effect demo* with *biology preserved*, we
+recommend **ComBat** because the limma over-aggressiveness can be
+confusing when the data has internal subtype structure. For a
+*downstream DE analysis* where you want the cleanest possible batch-
+free expression matrix, limma is the safer choice. We save both so
+you can take your pick.


What is limma over-aggressiveness? Why does "having internal subtype structure" matter?

How do we decide which is matrix to rely on?

Truongphi20 · 2026-06-08T06:00:43Z

+| top 5 000 variable          | ComBat  | -0.001     |  0.001     |  0.086     | **-0.655** | 0.003   | **0.651** |
+| top 5 000 variable          | limma   |  0.000     | -0.001     |  0.079     | -0.663     | 0.002   | 0.716 |
+
+Three metrics, two ways to read them:


Could you provide script or references of three evaluate matrices?

nttg8100 added 3 commits June 7, 2026 01:48

Truongphi20 reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/batch effect array gene expression#7

Feat/batch effect array gene expression#7
nttg8100 wants to merge 3 commits into
mainfrom
feat/batch-effect-array-gene-expression

nttg8100 commented Jun 6, 2026

Uh oh!

Truongphi20 left a comment

Uh oh!

Truongphi20 Jun 8, 2026

Uh oh!

Truongphi20 Jun 8, 2026

Uh oh!

Truongphi20 Jun 8, 2026

Uh oh!

Truongphi20 Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		1. `cor PC1-dataset` — the old single-number summary. Both
		ComBat and limma drive it to ≈ 0 on both scopes. ✅


		`results/tables/pc_variance_explained.csv`:

		\| scope \| stage \| cor PC1-ds \| cor PC2-ds \| cor PC1-er \| cor PC2-er \| cent/SD \| kNN purity \|

Uh oh!

Conversation

nttg8100 commented Jun 6, 2026

Uh oh!

Truongphi20 left a comment

Choose a reason for hiding this comment

Uh oh!

Truongphi20 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Truongphi20 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Truongphi20 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Truongphi20 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants