|
1 | 1 | # pdex |
2 | 2 |
|
3 | | -parallel differential expression for single-cell perturbation sequencing |
| 3 | +Parallel differential expression for single-cell perturbation sequencing. |
4 | 4 |
|
5 | 5 | ## Installation |
6 | 6 |
|
7 | | -Add to your `pyproject.toml` file with [`uv`](https://github.com/astral-sh/uv) |
8 | | - |
9 | 7 | ```bash |
| 8 | +# add to pyproject.toml |
10 | 9 | uv add pdex |
11 | | -``` |
12 | 10 |
|
13 | | -## Summary |
| 11 | +# add to env |
| 12 | +uv pip install pdex |
| 13 | +``` |
14 | 14 |
|
15 | | -This is a python package for performing parallel differential expression between multiple groups and a control. |
| 15 | +## Overview |
16 | 16 |
|
17 | | -It is optimized for very large datasets and very large numbers of perturbations. |
| 17 | +`pdex` computes per-gene differential expression statistics between perturbation groups in single-cell data using Mann-Whitney U tests with FDR correction. It was originally designed for CRISPR screen and perturbation sequencing datasets with many groups and large cell counts. |
18 | 18 |
|
19 | | -It makes use of shared memory to parallelize the computation to a high number of threads and minimizes the [IPC](https://en.wikipedia.org/wiki/Inter-process_communication) between processes to reduce overhead. |
| 19 | +It supports dense and sparse (CSR) expression matrices, and uses [numba-mwu](https://github.com/noamteyssier/numba-mwu) for Numba-accelerated Mann-Whitney U computation. |
20 | 20 |
|
21 | | -It supports the following metrics: |
| 21 | +## Modes |
22 | 22 |
|
23 | | -- Wilcoxon Rank Sum |
24 | | -- Anderson-Darling |
25 | | -- T-Test |
| 23 | +| Mode | Description | |
| 24 | +| ------------- | ------------------------------------------------------------------- | |
| 25 | +| `"ref"` | Each group vs a single reference group (default: `"non-targeting"`) | |
| 26 | +| `"all"` | Each group vs all remaining cells (1-vs-rest) | |
| 27 | +| `"on_target"` | Each group vs the reference at its single target gene only | |
26 | 28 |
|
27 | | -## Backed vs In-Memory AnnData |
| 29 | +## Usage |
28 | 30 |
|
29 | | -pdex adapts its execution strategy based on how the AnnData object is stored: |
| 31 | +### Reference mode (default) |
30 | 32 |
|
31 | | -- **In-memory AnnData** (e.g., loaded via `sc.read_h5ad(path)` without `backed="r"`): |
32 | | - pdex uses a shared-memory multiprocessing workflow. Each worker process gets |
33 | | - access to the full expression matrix through shared memory, which minimizes |
34 | | - serialization overhead. Parallelism is configured via the `num_workers` |
35 | | - parameter (process count). `num_threads` is ignored in this mode because numba |
36 | | - kernels operate on per-target slices entirely in memory. |
37 | | -- **Backed AnnData** (`adata.X` is an on-disk HDF5 dataset): pdex automatically |
38 | | - switches to the low-memory chunked implementation. Gene chunks are streamed |
39 | | - from disk, the reference group is computed once per chunk, and targets are |
40 | | - processed in parallel via `num_workers` (thread pool). Within each target, |
41 | | - Wilcoxon metrics can additionally use numba parallelization controlled by |
42 | | - `num_threads`. This mode avoids loading the entire matrix into RAM while still |
43 | | - enabling both target-level and gene-level parallelism. |
| 33 | +```python |
| 34 | +import anndata as ad |
| 35 | +from pdex import pdex |
44 | 36 |
|
45 | | -If a backed dataset is supplied without enabling low-memory mode, pdex raises a |
46 | | -helpful error explaining that chunked processing is required. Conversely, you can |
47 | | -force the chunked path for large in-memory matrices by passing `low_memory=True`. |
| 37 | +adata = ad.read_h5ad("screen.h5ad") |
48 | 38 |
|
49 | | -## Parallelization |
| 39 | +results = pdex( |
| 40 | + adata, |
| 41 | + groupby="guide", |
| 42 | + mode="ref", |
| 43 | + is_log1p=False, |
| 44 | +) |
| 45 | +``` |
50 | 46 |
|
51 | | -`parallel_differential_expression` exposes two orthogonal knobs for controlling |
52 | | -parallel execution: |
| 47 | +### 1-vs-rest mode |
53 | 48 |
|
54 | | -- `num_workers` controls the number of Python threads that process targets within |
55 | | - each gene chunk. `None` (default in low-memory mode) enables an auto-detected |
56 | | - worker count based on available CPUs, while `1` disables thread-level parallelism. |
57 | | -- `num_threads` controls the numba thread pool used by the Wilcoxon kernel. `None` |
58 | | - lets numba auto-detect the optimal size, whereas `1` turns numba parallelization |
59 | | - off. This setting is only used in low-memory mode and only when `metric="wilcoxon"`. |
60 | | - When pdex detects non-integer expression values in a gene chunk (for example, after |
61 | | - log-normalization), it automatically disables numba for that chunk, logs a warning, |
62 | | - and falls back to the SciPy implementation to preserve correct rank ordering. |
| 49 | +```python |
| 50 | +results = pdex( |
| 51 | + adata, |
| 52 | + groupby="guide", |
| 53 | + mode="all", |
| 54 | + is_log1p=False, |
| 55 | +) |
| 56 | +``` |
63 | 57 |
|
64 | | -These strategies can be combined: for example, `num_workers=2, num_threads=8` |
65 | | -runs two target threads that share an eight-thread numba pool. When the metric |
66 | | -does not support numba acceleration, pdex automatically logs a warning and |
67 | | -falls back to thread-only execution. |
| 58 | +### On-target mode |
68 | 59 |
|
69 | | -## Usage |
| 60 | +Requires a column in `adata.obs` mapping each group to its target gene: |
70 | 61 |
|
71 | 62 | ```python |
72 | | -import anndata as ad |
73 | | -import numpy as np |
74 | | -import pandas as pd |
75 | | - |
76 | | -from pdex import parallel_differential_expression |
77 | | - |
78 | | -PERT_COL = "perturbation" |
79 | | -CONTROL_VAR = "control" |
80 | | - |
81 | | -N_CELLS = 1000 |
82 | | -N_GENES = 100 |
83 | | -N_PERTS = 10 |
84 | | -MAX_UMI = 1e6 |
85 | | - |
86 | | - |
87 | | -def build_random_anndata( |
88 | | - n_cells: int = N_CELLS, |
89 | | - n_genes: int = N_GENES, |
90 | | - n_perts: int = N_PERTS, |
91 | | - pert_col: str = PERT_COL, |
92 | | - control_var: str = CONTROL_VAR, |
93 | | -) -> ad.AnnData: |
94 | | - """Sample a random AnnData object.""" |
95 | | - return ad.AnnData( |
96 | | - X=np.random.randint(0, MAX_UMI, size=(n_cells, n_genes)), |
97 | | - obs=pd.DataFrame( |
98 | | - { |
99 | | - pert_col: np.random.choice( |
100 | | - [f"pert_{i}" for i in range(n_perts)] + [control_var], |
101 | | - size=n_cells, |
102 | | - replace=True, |
103 | | - ), |
104 | | - } |
105 | | - ), |
106 | | - ) |
107 | | - |
108 | | - |
109 | | -def main(): |
110 | | - adata = build_random_anndata() |
111 | | - |
112 | | - # Run pdex with default metric (wilcoxon) |
113 | | - results = parallel_differential_expression( |
114 | | - adata, |
115 | | - reference=CONTROL_VAR, |
116 | | - groupby_key=PERT_COL, |
117 | | - ) |
118 | | - assert results.shape[0] == N_GENES * N_PERTS |
119 | | - |
120 | | - # Run pdex with alt metric (anderson) |
121 | | - results = parallel_differential_expression( |
122 | | - adata, |
123 | | - reference=CONTROL_VAR, |
124 | | - groupby_key=PERT_COL, |
125 | | - metric="anderson" |
126 | | - ) |
127 | | - assert results.shape[0] == N_GENES * N_PERTS |
| 63 | +results = pdex( |
| 64 | + adata, |
| 65 | + groupby="guide", |
| 66 | + mode="on_target", |
| 67 | + gene_col="target_gene", |
| 68 | + is_log1p=False, |
| 69 | +) |
128 | 70 | ``` |
| 71 | + |
| 72 | +## Parameters |
| 73 | + |
| 74 | +| Parameter | Type | Default | Description | |
| 75 | +| ---------------- | -------------- | ----------------- | ---------------------------------------------------------- | |
| 76 | +| `adata` | `AnnData` | required | Annotated data matrix (dense or sparse CSR) | |
| 77 | +| `groupby` | `str` | required | Column in `adata.obs` defining groups | |
| 78 | +| `mode` | `str` | `"ref"` | Comparison mode: `"ref"`, `"all"`, or `"on_target"` | |
| 79 | +| `threads` | `int` | `0` | Numba thread count (`0` = all CPUs) | |
| 80 | +| `is_log1p` | `bool \| None` | `None` | Whether data is log1p-transformed. Auto-detected if `None` | |
| 81 | +| `geometric_mean` | `bool` | `True` | Use geometric mean for pseudobulk (vs arithmetic) | |
| 82 | +| `as_pandas` | `bool` | `False` | Return a pandas DataFrame instead of Polars | |
| 83 | +| `reference` | `str` | `"non-targeting"` | Reference group name (modes: `ref`, `on_target`) | |
| 84 | +| `gene_col` | `str` | — | Column mapping groups to target genes (mode: `on_target`) | |
| 85 | + |
| 86 | +## Output |
| 87 | + |
| 88 | +Returns a Polars DataFrame (or pandas if `as_pandas=True`) with one row per (group, gene) pair: |
| 89 | + |
| 90 | +| Column | Description | |
| 91 | +| ------------------- | -------------------------------------------------- | |
| 92 | +| `target` | Perturbation group name | |
| 93 | +| `feature` | Gene name | |
| 94 | +| `target_mean` | Pseudobulk mean for the target group (count space) | |
| 95 | +| `ref_mean` | Pseudobulk mean for the reference (count space) | |
| 96 | +| `target_membership` | Number of cells in the target group | |
| 97 | +| `ref_membership` | Number of cells in the reference | |
| 98 | +| `fold_change` | log2(target_mean / ref_mean) | |
| 99 | +| `percent_change` | (target_mean - ref_mean) / ref_mean | |
| 100 | +| `p_value` | Mann-Whitney U p-value | |
| 101 | +| `statistic` | Mann-Whitney U statistic | |
| 102 | +| `fdr` | FDR-corrected p-value (per-group, across genes). For `on_target` mode, this is applied across all groups. | |
0 commit comments