|
| 1 | +<!--- TODO: add links ---> |
| 2 | + |
| 3 | +# Batch integration graph |
| 4 | + |
| 5 | +This is a sub-task of the overall batch integration task. Batch (or data) integration methods integrate datasets across batches that arise from various biological and technical sources. Methods that integrate batches typically have three different types of output: a corrected feature matrix, a joint embedding across batches, and/or an integrated cell-cell similarity graph (e.g., a kNN graph). This sub-task focuses on all methods that can output integrated graphs, and includes methods that canonically output the other two data formats with subsequent postprocessing to generate a graph. Other sub-tasks for batch integration can be found for: |
| 6 | + |
| 7 | +* [embeddings](../batch_integration_embed/), and |
| 8 | +* [corrected features]() |
| 9 | + |
| 10 | +This sub-task was taken from a [benchmarking study of data integration methods](https://www.biorxiv.org/content/10.1101/2020.05.22.111161v2). |
| 11 | + |
| 12 | + |
| 13 | +## API |
| 14 | + |
| 15 | +Datasets should contain the following attributes: |
| 16 | + |
| 17 | +* `adata.obs["batch"]` with the batch covariate, |
| 18 | +* `adata.obs["label"]` with the cell identity label, |
| 19 | +* `adata.layers['counts']` with raw, integer UMI count data, and |
| 20 | +* `adata.obsm['X_uni']` with the PCA embedding of the unintegrated representation |
| 21 | +* `adata.obsp['uni_connectivities']` with an unintegrated connectivity matrix generated |
| 22 | + by `scanpy.pp.neighbors()` |
| 23 | +* `adata.X` with log-normalized data |
| 24 | + |
| 25 | +Methods can take anything from datasets as input and should assign output to: |
| 26 | +* `adata.obsp['connectivities']` and `adata.obsp['distances']`, or |
| 27 | +* `adata.uns['neighbors']['connectivities']` and `adata.uns['neighbors']['distances']`. |
| 28 | + |
| 29 | +Please note, that most methods do not use cell type labels, which improves their usability. |
| 30 | + |
| 31 | +The `openproblems-python-batch-integration` docker container is used for the methods that |
| 32 | +can be installed without package conflicts. (NOTE: add additional containers here) |
| 33 | +For R methods, the `openproblems-r-extras` |
| 34 | +container is used. |
| 35 | + |
| 36 | +Methods are run in four different scenarios that include scaling and highly variable gene selection: |
| 37 | +* `full_unscaled` |
| 38 | +* `hvg_unscaled` |
| 39 | +* `full_scaled` |
| 40 | +* `hvg_scaled` |
| 41 | + |
| 42 | +Functions for scaling and highly variable gene selection per batch are reused from [`scib`](https://github.com/theislab/scib). Additionally, method wrappers are reused from scIB where possible. |
| 43 | + |
| 44 | +Metrics can compare: |
| 45 | +* `adata.obsp['connectivities']` to `adata.obs['uni_connectivies']`, |
| 46 | +* `adata.obsp['connectivities']` to `adata.obs['label']`, and/or |
| 47 | +* `adata.obsp['connectivities']` to `adata.obs['batch']`. |
0 commit comments