Skip to content

Commit afafab4

Browse files
danielStroblgithub-actions[bot]scottgigante-immunaiLuckyMD
authored
Batch integration data (#355)
* initial commit datasets batch integration * shorten long line * pre-commit * keep raw counts in X * kill pytest after 2 fails for testing * increase swap size * set swap size * swap * fix syntax * change order of tests * remove duplicate layer * pre-commit * immune cell dataloader comments * doc * add task dataloaders and subsampling immune * pre-commit * add batch integration to init py * pre-commit * typo * generate empty structure for metrics/methods * init py root * metrics wrong folder * fix pancreas dataloader batch * pancreas batch column * method stub * stub metric * pre-commit * import error * one method * method error * pre-commit * remove unused * pre-commit * change placeholder method to combat * pre-commit * downstream pp * reduce data correct import * pre-commit * grammar * removed random and * addressing comments Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Scott Gigante <84813314+scottgigante-immunai@users.noreply.github.com> Co-authored-by: MalteDLuecken <m.d.luecken@gmail.com>
1 parent 84c8287 commit afafab4

16 files changed

Lines changed: 271 additions & 3 deletions

File tree

.github/workflows/run_tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ jobs:
180180
cd ..
181181
182182
- name: Run tests
183-
run: pytest --cov=openproblems --cov-report=term-missing:skip-covered --cov-report=xml -vv
183+
run: pytest --cov=openproblems --cov-report=term-missing:skip-covered --cov-report=xml -vv --maxfail=2
184184

185185
- name: Upload coverage
186186
continue-on-error: ${{ github.repository != 'openproblems-bio/openproblems' }}

openproblems/data/immune_cells.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
from . import utils
2+
3+
import os
4+
import scanpy as sc
5+
import scprep
6+
import tempfile
7+
8+
URL = "https://ndownloader.figshare.com/files/25717328"
9+
10+
11+
@utils.loader
12+
def load_immune(test=False):
13+
"""Download immune human data from figshare."""
14+
if test:
15+
# load full data first, cached if available
16+
adata = load_immune(test=False)
17+
18+
# Subsample immune data to two batches with 250 cells each
19+
adata = adata[:, :500].copy()
20+
batch1 = adata[adata.obs.batch == "Oetjen_A"][:250]
21+
batch2 = adata[adata.obs.batch == "Freytag"][:250]
22+
adata = batch1.concatenate(batch2)
23+
# Note: could also use 200-500 HVGs rather than 200 random genes
24+
25+
# Ensure there are no cells or genes with 0 counts
26+
utils.filter_genes_cells(adata)
27+
28+
return adata
29+
30+
else:
31+
with tempfile.TemporaryDirectory() as tempdir:
32+
filepath = os.path.join(tempdir, "immune.h5ad")
33+
scprep.io.download.download_url(URL, filepath)
34+
adata = sc.read(filepath)
35+
36+
# Note: anndata.X contains scran log-normalized data,
37+
# so we're storing it in layers['log_scran']
38+
adata.layers["log_scran"] = adata.X
39+
adata.X = adata.layers["counts"]
40+
del adata.layers["counts"]
41+
42+
# Ensure there are no cells or genes with 0 counts
43+
utils.filter_genes_cells(adata)
44+
45+
return adata

openproblems/data/pancreas.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,10 +39,10 @@ def load_pancreas(test=False):
3939
scprep.io.download.download_url(URL, filepath)
4040
adata = sc.read(filepath)
4141

42-
# Remove preprocessing
42+
# NOTE: X contains counts that are normalized with scran
43+
adata.layers["log_scran"] = adata.X
4344
adata.X = adata.layers["counts"]
4445
del adata.layers["counts"]
45-
4646
# Ensure there are no cells or genes with 0 counts
4747
utils.filter_genes_cells(adata)
4848

openproblems/tasks/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@
33
from . import label_projection
44
from . import multimodal_data_integration
55
from . import regulatory_effect_prediction
6+
from ._batch_integration import batch_integration_graph
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Batch integration
2+
3+
Batch (or data) integration methods integrate datasets across batches that arise from various biological (e.g., tissue, location, individual, species) and technical (e.g., ambient RNA, lab, protocol) sources. The goal of a batch integration method is to remove unwanted batch effects in the data, while retaining biologically-meaningful variation that can help us to detect cell identities, fit cellular trajectories, or understand patterns of gene or pathway activity.
4+
5+
Methods that integrate batches typically have one of three different types of output: a corrected feature matrix, a joint embedding across batches, and/or an integrated cell-cell similarity graph (e.g., a kNN graph). In order to define a consistent input and output for each method and metric, we have divided the batch integration task into three subtasks. These subtasks are:
6+
7+
* [Batch integration graphs](batch_integration_graph/),
8+
* [Batch integration embeddings](batch_integration_embed/), and
9+
* [Batch integrated feature matrices]()
10+
11+
These subtasks collate methods that have the same data output type and metrics that evaluate this output. As corrected feature matrices can be turned into embeddings, which in turn can be processed into integrated graphs, methods overlap between the tasks. All methods are added to the graph subtask and imported into other subtasks from there. Information on the task API for datasets, methods, and metrics can be found in the individual subtask pages.
12+
13+
Metrics for this task can be divided into those that assess the removal of batch effects, and assessments of the conservation of biological variation. This can be a helpful distinction when devising new metrics. This task, including the subtask structure, was taken from a [benchmarking study of data integration methods](https://www.biorxiv.org/content/10.1101/2020.05.22.111161v2). This is a useful reference for more background reading on the task and the above concepts.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
from ... import utils
2+
3+
# from . import datasets, methods, metrics, checks
4+
5+
_task_name = "Batch integration"
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
<!--- TODO: add links --->
2+
3+
# Batch integration graph
4+
5+
This is a sub-task of the overall batch integration task. Batch (or data) integration methods integrate datasets across batches that arise from various biological and technical sources. Methods that integrate batches typically have three different types of output: a corrected feature matrix, a joint embedding across batches, and/or an integrated cell-cell similarity graph (e.g., a kNN graph). This sub-task focuses on all methods that can output integrated graphs, and includes methods that canonically output the other two data formats with subsequent postprocessing to generate a graph. Other sub-tasks for batch integration can be found for:
6+
7+
* [embeddings](../batch_integration_embed/), and
8+
* [corrected features]()
9+
10+
This sub-task was taken from a [benchmarking study of data integration methods](https://www.biorxiv.org/content/10.1101/2020.05.22.111161v2).
11+
12+
13+
## API
14+
15+
Datasets should contain the following attributes:
16+
17+
* `adata.obs["batch"]` with the batch covariate,
18+
* `adata.obs["label"]` with the cell identity label,
19+
* `adata.layers['counts']` with raw, integer UMI count data, and
20+
* `adata.obsm['X_uni']` with the PCA embedding of the unintegrated representation
21+
* `adata.obsp['uni_connectivities']` with an unintegrated connectivity matrix generated
22+
by `scanpy.pp.neighbors()`
23+
* `adata.X` with log-normalized data
24+
25+
Methods can take anything from datasets as input and should assign output to:
26+
* `adata.obsp['connectivities']` and `adata.obsp['distances']`, or
27+
* `adata.uns['neighbors']['connectivities']` and `adata.uns['neighbors']['distances']`.
28+
29+
Please note, that most methods do not use cell type labels, which improves their usability.
30+
31+
The `openproblems-python-batch-integration` docker container is used for the methods that
32+
can be installed without package conflicts. (NOTE: add additional containers here)
33+
For R methods, the `openproblems-r-extras`
34+
container is used.
35+
36+
Methods are run in four different scenarios that include scaling and highly variable gene selection:
37+
* `full_unscaled`
38+
* `hvg_unscaled`
39+
* `full_scaled`
40+
* `hvg_scaled`
41+
42+
Functions for scaling and highly variable gene selection per batch are reused from [`scib`](https://github.com/theislab/scib). Additionally, method wrappers are reused from scIB where possible.
43+
44+
Metrics can compare:
45+
* `adata.obsp['connectivities']` to `adata.obs['uni_connectivies']`,
46+
* `adata.obsp['connectivities']` to `adata.obs['label']`, and/or
47+
* `adata.obsp['connectivities']` to `adata.obs['batch']`.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
from .... import utils
2+
from . import api
3+
from . import datasets
4+
from . import methods
5+
from . import metrics
6+
7+
_task_name = "Batch integration graph"
8+
9+
DATASETS = utils.get_callable_members(datasets)
10+
METHODS = utils.get_callable_members(methods)
11+
METRICS = utils.get_callable_members(metrics)
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
from .datasets.immune import immune_batch
2+
3+
4+
def check_dataset(adata):
5+
"""Check that dataset output fits expected API."""
6+
7+
assert "X_uni" in adata.obsm
8+
assert "batch" in adata.obs
9+
assert "labels" in adata.obs
10+
assert "uni_connectivities" in adata.obsp
11+
12+
return True
13+
14+
15+
def check_method(adata):
16+
"""Check that method output fits expected API."""
17+
assert "connectivities" in adata.obsp
18+
assert "distances" in adata.obsp
19+
return True
20+
21+
22+
def sample_dataset():
23+
"""Create a simple dataset to use for testing methods in this task."""
24+
adata = immune_batch(True)
25+
# print(adata.obs.columns)
26+
27+
return adata
28+
29+
30+
def sample_method(adata):
31+
"""Create sample method output for testing metrics in this task."""
32+
import scanpy as sc
33+
34+
sc.pp.neighbors(adata, use_rep="X_uni")
35+
return adata
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from .immune import immune_batch
2+
from .pancreas import pancreas_batch

0 commit comments

Comments
 (0)