Skip to content

Commit 3284e73

Browse files
authored
Merge pull request #70 from ArcInstitute/pdex-0.2.0
Pdex 0.2.0
2 parents 787f81d + 0cd5998 commit 3284e73

21 files changed

Lines changed: 1774 additions & 3342 deletions

.github/workflows/ci.yml

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ jobs:
1212

1313
install-job:
1414
runs-on: ubuntu-latest
15+
strategy:
16+
matrix:
17+
python-version: ["3.13", "3.12", "3.11"]
1518

1619
steps:
1720
- uses: actions/checkout@v4
@@ -21,7 +24,7 @@ jobs:
2124
with:
2225
enable-cache: true
2326
cache-dependency-glob: "pyproject.toml"
24-
python-version: "3.12"
27+
python-version: ${{ matrix.python-version }}
2528

2629
- name: install dependencies
2730
run: |
@@ -40,7 +43,7 @@ jobs:
4043
with:
4144
enable-cache: true
4245
cache-dependency-glob: "pyproject.toml"
43-
python-version: "3.12"
46+
python-version: "3.13"
4447

4548
- name: install dependencies
4649
run: |
@@ -63,15 +66,15 @@ jobs:
6366
with:
6467
enable-cache: true
6568
cache-dependency-glob: "pyproject.toml"
66-
python-version: "3.12"
69+
python-version: "3.13"
6770

6871
- name: install dependencies
6972
run: |
7073
uv sync --all-extras --dev
7174
7275
- name: run type checking
7376
run: |
74-
uv run pyright
77+
uv run ty check
7578
7679
pytest:
7780
runs-on: ubuntu-latest
@@ -86,7 +89,7 @@ jobs:
8689
with:
8790
enable-cache: true
8891
cache-dependency-glob: "pyproject.toml"
89-
python-version: "3.12"
92+
python-version: "3.13"
9093

9194
- name: install dependencies
9295
run: |

.python-version

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
3.10
1+
3.13

CLAUDE.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
**Important:** This file must be kept up to date with the codebase. Any time the public API, output schema, modes, parameters, or architecture changes, update the relevant sections here before closing the task.
6+
7+
## Project Overview
8+
9+
`pdex` is a Python library for Parallel Differential Expression (PDEX) analysis in single-cell genomics, focused on conditional screens.
10+
It computes per-gene statistics comparing perturbation groups against a reference using Mann-Whitney U tests with FDR correction.
11+
It also provides functionality for per-gene statistics on 1-vs-rest comparisons and on-target single-gene comparisons.
12+
13+
## Commands
14+
15+
```bash
16+
# Install / sync dependencies
17+
uv sync
18+
19+
# Run all tests
20+
uv run pytest -v
21+
22+
# Run a specific test file
23+
uv run pytest tests/test_pdex.py
24+
25+
# Run a single test by name
26+
uv run pytest tests/test_pdex.py::TestPdexRefMode::test_columns
27+
28+
# Lint and format
29+
uv run ruff format
30+
31+
# Type check
32+
uv run ty check
33+
```
34+
35+
## Architecture
36+
37+
### Core Pipeline (`src/pdex/__init__.py`)
38+
39+
The main entry point is `pdex(adata, groupby, mode, threads, is_log1p, geometric_mean, as_pandas, **kwargs)`, which:
40+
41+
1. Validates the `groupby` column in `adata.obs`
42+
2. Extracts unique groups (filters NaN and empty strings)
43+
3. Identifies a reference group (defaults to `"non-targeting"` in `"ref"` and `"on_target"` modes)
44+
4. For each non-reference group, slices the expression matrix, computes pseudobulk (mean), fold change, percent change, and Mann-Whitney U statistic vs the reference
45+
5. Applies per-group FDR correction (scipy) and returns a Polars DataFrame (or pandas if `as_pandas=True`)
46+
47+
Three modes:
48+
49+
- `"ref"`: each non-reference group vs a single reference group (reference group is excluded from output)
50+
- `"all"`: each group vs all remaining cells (1-vs-rest)
51+
- `"on_target"`: each non-reference group vs the reference, but only at the single gene targeted by that group (requires `gene_col=` kwarg)
52+
53+
Unexpected `**kwargs` for any mode trigger a `UserWarning`.
54+
55+
### Key Files
56+
57+
| File | Role |
58+
| ---------------------- | ------------------------------------------------------------------------------------------------------- |
59+
| `src/pdex/__init__.py` | `pdex()` entry point and full pipeline logic |
60+
| `src/pdex/_math.py` | Numba JIT-compiled `fold_change()`, `percent_change()`, and `mwu()` wrappers; `pseudobulk()` dispatcher |
61+
| `src/pdex/_utils.py` | `set_numba_threadpool()` — sets Numba thread count before JIT warmup; `_detect_is_log1p()` heuristic |
62+
63+
### Performance Design
64+
65+
- Numba JIT compilation accelerates per-cell/per-gene math (`fold_change`, `percent_change`, `_log1p_col_mean`, `_expm1_vec`)
66+
- `numba-mwu` (external dep) provides a Numba-accelerated Mann-Whitney U implementation
67+
- Sparse CSR matrices are handled by reusing pre-computed non-targeting column indices to avoid redundant dense conversion
68+
- Parallelism is controlled via `threads` passed to `set_numba_threadpool()`
69+
70+
### Output Schema
71+
72+
The returned Polars DataFrame (or pandas DataFrame when `as_pandas=True`) has columns:
73+
74+
| Column | Type | Description |
75+
| ------------------- | ----- | --------------------------------------------------------------------- |
76+
| `target` | str | Perturbation group name |
77+
| `feature` | str | Gene name |
78+
| `target_mean` | float | Pseudobulk mean for the target group, always in natural (count) space |
79+
| `ref_mean` | float | Pseudobulk mean for the reference, always in natural (count) space |
80+
| `target_membership` | int | Number of cells in the target group |
81+
| `ref_membership` | int | Number of cells in the reference |
82+
| `fold_change` | float | log2(target_mean / ref_mean) — computed from pseudobulk means |
83+
| `percent_change` | float | (target_mean - ref_mean) / ref_mean — computed from pseudobulk means |
84+
| `p_value` | float | Mann-Whitney U p-value (per-cell vectors) |
85+
| `statistic` | float | Mann-Whitney U statistic |
86+
| `fdr` | float | FDR-corrected p-value, applied per-group across genes. For `on_target` mode, applied across all groups. |
87+
88+
`target_mean` and `ref_mean` are always in natural (count) space regardless of `is_log1p` or `geometric_mean`.
89+
FDR is corrected within each group (across genes) for `ref` and `all` modes. For `on_target` mode, it is applied across all resulting p-values.
90+
91+
### Public API (`__all__`)
92+
93+
```python
94+
from pdex import pdex, DEFAULT_REFERENCE
95+
```
96+
97+
## Dependencies
98+
99+
Managed with `uv`. Build backend: `hatchling`. Key packages: `anndata`, `numba`, `numba-mwu`, `polars`, `pyarrow`, `scipy`, `tqdm`. Dev tools: `pytest`, `ruff`, `ty`.

README.md

Lines changed: 77 additions & 103 deletions
Original file line numberDiff line numberDiff line change
@@ -1,128 +1,102 @@
11
# pdex
22

3-
parallel differential expression for single-cell perturbation sequencing
3+
Parallel differential expression for single-cell perturbation sequencing.
44

55
## Installation
66

7-
Add to your `pyproject.toml` file with [`uv`](https://github.com/astral-sh/uv)
8-
97
```bash
8+
# add to pyproject.toml
109
uv add pdex
11-
```
1210

13-
## Summary
11+
# add to env
12+
uv pip install pdex
13+
```
1414

15-
This is a python package for performing parallel differential expression between multiple groups and a control.
15+
## Overview
1616

17-
It is optimized for very large datasets and very large numbers of perturbations.
17+
`pdex` computes per-gene differential expression statistics between perturbation groups in single-cell data using Mann-Whitney U tests with FDR correction. It was originally designed for CRISPR screen and perturbation sequencing datasets with many groups and large cell counts.
1818

19-
It makes use of shared memory to parallelize the computation to a high number of threads and minimizes the [IPC](https://en.wikipedia.org/wiki/Inter-process_communication) between processes to reduce overhead.
19+
It supports dense and sparse (CSR) expression matrices, and uses [numba-mwu](https://github.com/noamteyssier/numba-mwu) for Numba-accelerated Mann-Whitney U computation.
2020

21-
It supports the following metrics:
21+
## Modes
2222

23-
- Wilcoxon Rank Sum
24-
- Anderson-Darling
25-
- T-Test
23+
| Mode | Description |
24+
| ------------- | ------------------------------------------------------------------- |
25+
| `"ref"` | Each group vs a single reference group (default: `"non-targeting"`) |
26+
| `"all"` | Each group vs all remaining cells (1-vs-rest) |
27+
| `"on_target"` | Each group vs the reference at its single target gene only |
2628

27-
## Backed vs In-Memory AnnData
29+
## Usage
2830

29-
pdex adapts its execution strategy based on how the AnnData object is stored:
31+
### Reference mode (default)
3032

31-
- **In-memory AnnData** (e.g., loaded via `sc.read_h5ad(path)` without `backed="r"`):
32-
pdex uses a shared-memory multiprocessing workflow. Each worker process gets
33-
access to the full expression matrix through shared memory, which minimizes
34-
serialization overhead. Parallelism is configured via the `num_workers`
35-
parameter (process count). `num_threads` is ignored in this mode because numba
36-
kernels operate on per-target slices entirely in memory.
37-
- **Backed AnnData** (`adata.X` is an on-disk HDF5 dataset): pdex automatically
38-
switches to the low-memory chunked implementation. Gene chunks are streamed
39-
from disk, the reference group is computed once per chunk, and targets are
40-
processed in parallel via `num_workers` (thread pool). Within each target,
41-
Wilcoxon metrics can additionally use numba parallelization controlled by
42-
`num_threads`. This mode avoids loading the entire matrix into RAM while still
43-
enabling both target-level and gene-level parallelism.
33+
```python
34+
import anndata as ad
35+
from pdex import pdex
4436

45-
If a backed dataset is supplied without enabling low-memory mode, pdex raises a
46-
helpful error explaining that chunked processing is required. Conversely, you can
47-
force the chunked path for large in-memory matrices by passing `low_memory=True`.
37+
adata = ad.read_h5ad("screen.h5ad")
4838

49-
## Parallelization
39+
results = pdex(
40+
adata,
41+
groupby="guide",
42+
mode="ref",
43+
is_log1p=False,
44+
)
45+
```
5046

51-
`parallel_differential_expression` exposes two orthogonal knobs for controlling
52-
parallel execution:
47+
### 1-vs-rest mode
5348

54-
- `num_workers` controls the number of Python threads that process targets within
55-
each gene chunk. `None` (default in low-memory mode) enables an auto-detected
56-
worker count based on available CPUs, while `1` disables thread-level parallelism.
57-
- `num_threads` controls the numba thread pool used by the Wilcoxon kernel. `None`
58-
lets numba auto-detect the optimal size, whereas `1` turns numba parallelization
59-
off. This setting is only used in low-memory mode and only when `metric="wilcoxon"`.
60-
When pdex detects non-integer expression values in a gene chunk (for example, after
61-
log-normalization), it automatically disables numba for that chunk, logs a warning,
62-
and falls back to the SciPy implementation to preserve correct rank ordering.
49+
```python
50+
results = pdex(
51+
adata,
52+
groupby="guide",
53+
mode="all",
54+
is_log1p=False,
55+
)
56+
```
6357

64-
These strategies can be combined: for example, `num_workers=2, num_threads=8`
65-
runs two target threads that share an eight-thread numba pool. When the metric
66-
does not support numba acceleration, pdex automatically logs a warning and
67-
falls back to thread-only execution.
58+
### On-target mode
6859

69-
## Usage
60+
Requires a column in `adata.obs` mapping each group to its target gene:
7061

7162
```python
72-
import anndata as ad
73-
import numpy as np
74-
import pandas as pd
75-
76-
from pdex import parallel_differential_expression
77-
78-
PERT_COL = "perturbation"
79-
CONTROL_VAR = "control"
80-
81-
N_CELLS = 1000
82-
N_GENES = 100
83-
N_PERTS = 10
84-
MAX_UMI = 1e6
85-
86-
87-
def build_random_anndata(
88-
n_cells: int = N_CELLS,
89-
n_genes: int = N_GENES,
90-
n_perts: int = N_PERTS,
91-
pert_col: str = PERT_COL,
92-
control_var: str = CONTROL_VAR,
93-
) -> ad.AnnData:
94-
"""Sample a random AnnData object."""
95-
return ad.AnnData(
96-
X=np.random.randint(0, MAX_UMI, size=(n_cells, n_genes)),
97-
obs=pd.DataFrame(
98-
{
99-
pert_col: np.random.choice(
100-
[f"pert_{i}" for i in range(n_perts)] + [control_var],
101-
size=n_cells,
102-
replace=True,
103-
),
104-
}
105-
),
106-
)
107-
108-
109-
def main():
110-
adata = build_random_anndata()
111-
112-
# Run pdex with default metric (wilcoxon)
113-
results = parallel_differential_expression(
114-
adata,
115-
reference=CONTROL_VAR,
116-
groupby_key=PERT_COL,
117-
)
118-
assert results.shape[0] == N_GENES * N_PERTS
119-
120-
# Run pdex with alt metric (anderson)
121-
results = parallel_differential_expression(
122-
adata,
123-
reference=CONTROL_VAR,
124-
groupby_key=PERT_COL,
125-
metric="anderson"
126-
)
127-
assert results.shape[0] == N_GENES * N_PERTS
63+
results = pdex(
64+
adata,
65+
groupby="guide",
66+
mode="on_target",
67+
gene_col="target_gene",
68+
is_log1p=False,
69+
)
12870
```
71+
72+
## Parameters
73+
74+
| Parameter | Type | Default | Description |
75+
| ---------------- | -------------- | ----------------- | ---------------------------------------------------------- |
76+
| `adata` | `AnnData` | required | Annotated data matrix (dense or sparse CSR) |
77+
| `groupby` | `str` | required | Column in `adata.obs` defining groups |
78+
| `mode` | `str` | `"ref"` | Comparison mode: `"ref"`, `"all"`, or `"on_target"` |
79+
| `threads` | `int` | `0` | Numba thread count (`0` = all CPUs) |
80+
| `is_log1p` | `bool \| None` | `None` | Whether data is log1p-transformed. Auto-detected if `None` |
81+
| `geometric_mean` | `bool` | `True` | Use geometric mean for pseudobulk (vs arithmetic) |
82+
| `as_pandas` | `bool` | `False` | Return a pandas DataFrame instead of Polars |
83+
| `reference` | `str` | `"non-targeting"` | Reference group name (modes: `ref`, `on_target`) |
84+
| `gene_col` | `str` || Column mapping groups to target genes (mode: `on_target`) |
85+
86+
## Output
87+
88+
Returns a Polars DataFrame (or pandas if `as_pandas=True`) with one row per (group, gene) pair:
89+
90+
| Column | Description |
91+
| ------------------- | -------------------------------------------------- |
92+
| `target` | Perturbation group name |
93+
| `feature` | Gene name |
94+
| `target_mean` | Pseudobulk mean for the target group (count space) |
95+
| `ref_mean` | Pseudobulk mean for the reference (count space) |
96+
| `target_membership` | Number of cells in the target group |
97+
| `ref_membership` | Number of cells in the reference |
98+
| `fold_change` | log2(target_mean / ref_mean) |
99+
| `percent_change` | (target_mean - ref_mean) / ref_mean |
100+
| `p_value` | Mann-Whitney U p-value |
101+
| `statistic` | Mann-Whitney U statistic |
102+
| `fdr` | FDR-corrected p-value (per-group, across genes). For `on_target` mode, this is applied across all groups. |

0 commit comments

Comments
 (0)