Skip to content

Feat/batch effect array gene expression#7

Open
nttg8100 wants to merge 3 commits into
mainfrom
feat/batch-effect-array-gene-expression
Open

Feat/batch effect array gene expression#7
nttg8100 wants to merge 3 commits into
mainfrom
feat/batch-effect-array-gene-expression

Conversation

@nttg8100

@nttg8100 nttg8100 commented Jun 6, 2026

Copy link
Copy Markdown
Member

No description provided.

nttg8100 added 3 commits June 7, 2026 01:48
Reproducible pixi+R pipeline that integrates two GPL570 breast cancer
cohorts (GSE140494, GSE146558; n=200, both 66% ER+) and removes the
technical batch effect with ComBat and limma::removeBatchEffect while
protecting er_status as a biological covariate.

Pipeline (4 steps, all idempotent and cached):
  01_download.R     pull series matrix for each GSE, cache .rds
  02_preprocess.R   combine into ExpressionSet, scale guard,
                    label-aware multi-column ER extractor
  03_batch_effect.R PCA on top 5000 variable probes, ComBat + limma
                    correction, kNN purity + centroid/SD metrics
  04_validate.R     ESR1 boxplot + limma ER+/ER- DE on largest
                    mixed-ER cohort

Results on the chosen pair (n=200):
  - cor PC1-dataset: 0.983 -> 0.000  (batch axis collapsed)
  - kNN purity (top 5k, PC1-10): 0.996 -> 0.651  (close to random)
  - cor PC2-ER (top 5k): 0.084 -> -0.655  (biology exposed)
  - ESR1 limma (GSE146558): logFC=3.87, FDR=1.38e-20, rank #1
  - 8,563 probes at FDR<0.05 (16% of genome)

Key design choices documented in README.md:
  - replace, do not coerce MAS5-style linear-scale cohorts (scale
    guard stops with a clear swap-cohort message)
  - ER parser label-aware (regex on label) + multi-column (scan all
    characteristics_ch1.N, take first non-NA per row)
  - visualize PCA on top 5000 variable probes (where biology lives)
    AND on all 54,675 probes (for honesty); report kNN purity on both
  - save BOTH ComBat and limma outputs; document the trade-off

Pair was picked from the 19 GPL570 datasets in
/scratch/data/omicslab-datasets/cohorts/BRCA/output/platforms_summary.csv
by intersecting: GPL570, RMA log2 (median < 30), IHC ER with 0 NAs,
mixed 66/34 ER distribution, and similar intrinsic subtype mix
(few TNBC, few HER2+ in both cohorts).
…xpression

Flatten the directory structure to match the gkit utility pattern
(igv-report/, dpgt/, glnexus/, hail/ are all at the top level of gkit/).
The previous batch-effect/ parent was empty except for the array-expression
subdirectory, so the rename is unambiguous.

This makes the utility directly addressable for the make-test CI loop
and aligns with the developer guide in Readme.developer.md.
Replace the GEO-download-first pipeline with a CSV-first design so
`make test` runs end-to-end without any network access.

Changes:
  - Add Makefile with two workflows:
      make test  -- preprocess -> batch-effect -> validate (no network)
      make raw   -- export-csv -> preprocess -> batch-effect -> validate
                    (re-downloads from GEO, rebuilds the committed CSVs)
      make clean -- remove generated artifacts
  - Add data/expression/<GSE>.csv (probes x samples, log2 RMA, ~118 MB
    total) and data/clinical/<GSE>.csv (cleaned per-sample phenotype
    from the harmonized pool, ~20 KB total). These are now the
    committed inputs; the .rds files in data/raw/ are no longer in
    git (.gitignore).
  - Replace scripts/01_download.R with scripts/01_export_csv.R, which
    pulls the series matrix from GEO, then writes the two CSVs per
    cohort. The clinical CSV is sourced from
    /scratch/data/omicslab-datasets/cohorts/BRCA/output/json/reviewed/
    so we no longer re-parse the heterogeneous
    characteristics_ch1.N columns ourselves.
  - Rewrite scripts/02_preprocess.R to read the committed CSVs and
    build the combined ExpressionSet. Sample alignment is asserted
    explicitly (setequal on GSM IDs). Scale guard, ER counts, and
    intersection logic unchanged.
  - pixi.toml: rename 'download' task to 'export-csv'; drop the
    'all' chain (the Makefile orchestrates now). pixi.lock unchanged
    (no new dependencies).
  - README.md: document the new developer workflow (`make test`
    and `make raw`), the data layout, and the rationale for using
    the harmonized clinical data instead of re-parsing GEO.

Same numerical results as before (the committed data is byte-identical
to the previously-tested expression matrices):
  kNN purity (top 5k, ComBat)         : 0.996 -> 0.651
  cor PC2-ER (top 5k, ComBat)         : 0.084 -> -0.655
  ESR1 limma (GSE146558)              : logFC=3.87, FDR=1.38e-20
  DE probes at FDR<0.05               : 8,563

@Truongphi20 Truongphi20 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is brilliant, Giang!

Thank for a thoughtful tutorial on handling batch-effect between two realistic cohorts using Combat and limma. I just had a quick look through your note and left a few questions on the evaluation section. Hope to hear from you soon if you are available. Thanks.

Comment on lines +430 to +431
1. **`cor PC1-dataset`** — the old single-number summary. Both
ComBat and limma drive it to ≈ 0 on both scopes. ✅

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of the cor PC1-dataset matrix? Why is 0 being good?


`results/tables/pc_variance_explained.csv`:

| scope | stage | cor PC1-ds | cor PC2-ds | cor PC1-er | cor PC2-er | cent/SD | kNN purity |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is difference between suffixes "-ds" and "-er"?

Comment on lines +451 to +458
**Which method wins here?** ComBat (kNN = 0.651) wins on local
structure, limma wins on PC2-ER exposure (-0.663 vs -0.655 — basically
tied). For a *batch-effect demo* with *biology preserved*, we
recommend **ComBat** because the limma over-aggressiveness can be
confusing when the data has internal subtype structure. For a
*downstream DE analysis* where you want the cleanest possible batch-
free expression matrix, limma is the safer choice. We save both so
you can take your pick.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What is limma over-aggressiveness? Why does "having internal subtype structure" matter?
  • How do we decide which is matrix to rely on?

| top 5 000 variable | ComBat | -0.001 | 0.001 | 0.086 | **-0.655** | 0.003 | **0.651** |
| top 5 000 variable | limma | 0.000 | -0.001 | 0.079 | -0.663 | 0.002 | 0.716 |

Three metrics, two ways to read them:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide script or references of three evaluate matrices?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants