Skip to content

Commit 3d3eb22

Browse files
danielStroblgithub-actions[bot]scottgigante-immunai
authored
Use graph and embedding metrics for feature and embedding subtask (#807)
* wrappers for output generation * pre-commit * add pca to sample feature task dataset * pre-commit * Update api.py * bugfixes * pre-commit * flake8 import * pre-commit * test other syntax * pre-commit * disable flake8 for long import * pre-commit * added whitespace * pre-commit * Address flake8 * pre-commit * address flake8 * pre-commit * flake8 * Fix syntax * pre-commit * pre-commit * graph conn flake8 * pre-commit * clean up gitignore * refactor for readability * require uncorrected PCA for feature task * pre-commit --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Scott Gigante <84813314+scottgigante-immunai@users.noreply.github.com> Co-authored-by: Scott Gigante <scott.gigante@immunai.com> Former-commit-id: fe18dfb
1 parent 678b400 commit 3d3eb22

20 files changed

Lines changed: 464 additions & 7 deletions

File tree

.gitignore

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -146,14 +146,12 @@ nf-openproblems
146146

147147
# Editor
148148
.idea
149+
.vscode
149150

150151
scratch/
151152
openproblems/results/
152153
openproblems/work/
153154
batch_embed.txt
154-
immune.h5ad
155+
*.h5ad
155156

156-
immune.h5ad
157-
batch_embed.txt
158-
.vscode/launch.json
159157
run_bbknn.py
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
1+
from .ari import ari
12
from .cc_score import cc_score
3+
from .graph_connectivity import graph_connectivity
4+
from .iso_label_f1 import isolated_labels_f1
25
from .iso_label_sil import isolated_labels_sil
36
from .kBET import kBET
7+
from .nmi import nmi
48
from .pcr import pcr
59
from .sil_batch import silhouette_batch
610
from .silhouette import silhouette
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
from .....tools.decorators import metric
2+
from ...batch_integration_graph import metrics as graph_metrics
3+
4+
"""
5+
The Rand index compares the overlap of two clusterings;
6+
it considers both correct clustering overlaps while also counting correct
7+
disagreements between two clusterings.
8+
Similar to NMI, we compared the cell-type labels with the NMI-optimized
9+
Louvain clustering computed on the integrated dataset.
10+
The adjustment of the Rand index corrects for randomly correct labels.
11+
An ARI of 0 or 1 corresponds to random labeling or a perfect match,
12+
respectively.
13+
We also used the scikit-learn (v.0.22.1) implementation of the ARI.
14+
"""
15+
16+
17+
@metric(
18+
metric_name="ARI",
19+
maximize=True,
20+
paper_reference="luecken2022benchmarking",
21+
image="openproblems-r-pytorch",
22+
)
23+
def ari(adata):
24+
from scanpy.pp import neighbors
25+
26+
neighbors(adata, use_rep="X_emb")
27+
return graph_metrics.ari(adata)
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
from .....tools.decorators import metric
2+
from ...batch_integration_graph import metrics as graph_metrics
3+
4+
"""
5+
The graph connectivity metric assesses whether the kNN graph representation,
6+
G, of the integrated data directly connects all cells with the same cell
7+
identity label. For each cell identity label c, we created the subset kNN
8+
graph G(Nc;Ec) to contain only cells from a given label. Using these subset
9+
kNN graphs, we computed the graph connectivity score using the equation:
10+
11+
gc =1/|C| Σc∈C |LCC(G(Nc;Ec))|/|Nc|.
12+
13+
Here, C represents the set of cell identity labels, |LCC()| is the number
14+
of nodes in the largest connected component of the graph, and |Nc| is the
15+
number of nodes with cell identity c. The resultant score has a range
16+
of (0;1], where 1 indicates that all cells with the same cell identity
17+
are connected in the integrated kNN graph, and the lowest possible score
18+
indicates a graph where no cell is connected. As this score is computed
19+
on the kNN graph, it can be used to evaluate all integration outputs.
20+
"""
21+
22+
23+
@metric(
24+
metric_name="Graph connectivity",
25+
paper_reference="luecken2022benchmarking",
26+
maximize=True,
27+
image="openproblems-r-pytorch",
28+
)
29+
def graph_connectivity(adata):
30+
from scanpy.pp import neighbors
31+
32+
neighbors(adata, use_rep="X_emb")
33+
return graph_metrics.graph_connectivity(adata)
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
from .....tools.decorators import metric
2+
from ...batch_integration_graph import metrics as graph_metrics
3+
4+
"""
5+
We developed two isolated label scores to evaluate how well the data integration methods
6+
dealt with cell identity labels shared by few batches. Specifically, we identified
7+
isolated cell labels as the labels present in the least number of batches in the
8+
integration task.
9+
The score evaluates how well these isolated labels separate from other cell identities.
10+
We implemented the isolated label metric in two versions:
11+
(1) the best clustering of the isolated label (F1 score) and
12+
(2) the global ASW of the isolated label. For the cluster-based score,
13+
we first optimize the cluster assignment of the isolated label using the F1 score
14+
across louvain clustering resolutions ranging from 0.1 to 2 in resolution steps of 0.1.
15+
The optimal F1 score for the isolated label is then used as the metric score.
16+
The F1 score is a weighted mean of precision and recall given by the equation:
17+
𝐹1=2×(precision×recall)/(precision+recall).
18+
19+
It returns a value between 0 and 1,
20+
where 1 shows that all of the isolated label cells and no others are captured in
21+
the cluster. For the isolated label ASW score, we compute the ASW of isolated
22+
versus nonisolated labels on the PCA embedding (ASW metric above) and scale this
23+
score to be between 0 and 1. The final score for each metric version consists of
24+
the mean isolated score of all isolated labels.
25+
"""
26+
27+
28+
@metric(
29+
metric_name="Isolated label F1",
30+
paper_reference="luecken2022benchmarking",
31+
maximize=True,
32+
image="openproblems-r-pytorch",
33+
)
34+
def isolated_labels_f1(adata):
35+
from scanpy.pp import neighbors
36+
37+
neighbors(adata, use_rep="X_emb")
38+
return graph_metrics.isolated_labels_f1(adata)
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
from .....tools.decorators import metric
2+
from ...batch_integration_graph import metrics as graph_metrics
3+
4+
"""NMI compares the overlap of two clusterings.
5+
We used NMI to compare the cell-type labels with Louvain clusters computed on
6+
the integrated dataset. The overlap was scaled using the mean of the entropy terms
7+
for cell-type and cluster labels. Thus, NMI scores of 0 or 1 correspond to uncorrelated
8+
clustering or a perfect match, respectively. We performed optimized Louvain clustering
9+
for this metric to obtain the best match between clusters and labels.
10+
Louvain clustering was performed at a resolution range of 0.1 to 2 in steps of 0.1,
11+
and the clustering output with the highest NMI with the label set was used. We used
12+
the scikit-learn27 (v.0.22.1) implementation of NMI.
13+
"""
14+
15+
16+
@metric(
17+
metric_name="NMI",
18+
paper_reference="luecken2022benchmarking",
19+
maximize=True,
20+
image="openproblems-r-pytorch",
21+
)
22+
def nmi(adata):
23+
from scanpy.pp import neighbors
24+
25+
neighbors(adata, use_rep="X_emb")
26+
return graph_metrics.nmi(adata)

openproblems/tasks/_batch_integration/batch_integration_feature/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ Datasets should contain the following attributes:
4242

4343
* `adata.obs["batch"]` with the batch covariate, and
4444
* `adata.obs["label"]` with the cell identity label
45+
* `adata.obs["X_uni_pca"]` with a PCA embedding of the uncorrected data
4546
* `adata.layers['counts']` with raw, integer UMI count data,
4647
* `adata.layers['log_normalized']` with log-normalized data and
4748
* `adata.X` with log-normalized data

openproblems/tasks/_batch_integration/batch_integration_feature/api.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
1+
from ....tools.decorators import dataset
12
from .._common import api
23

3-
check_dataset = api.check_dataset
4+
import functools
5+
6+
check_dataset = functools.partial(api.check_dataset, do_check_pca=True)
47

58

69
def check_method(adata, is_baseline=False):
@@ -11,7 +14,9 @@ def check_method(adata, is_baseline=False):
1114
return True
1215

1316

14-
sample_dataset = api.sample_dataset
17+
@dataset()
18+
def sample_dataset():
19+
return api.sample_dataset(run_pca=True)
1520

1621

1722
def sample_method(adata):
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,11 @@
1+
from .ari import ari
2+
from .cc_score import cc_score
3+
from .graph_connectivity import graph_connectivity
14
from .hvg_conservation import hvg_conservation
5+
from .iso_label_f1 import isolated_labels_f1
6+
from .iso_label_sil import isolated_labels_sil
7+
from .kBET import kBET
8+
from .nmi import nmi
9+
from .pcr import pcr
10+
from .sil_batch import silhouette_batch
11+
from .silhouette import silhouette
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
from .....tools.decorators import metric
2+
from ...batch_integration_graph import metrics as graph_metrics
3+
4+
"""
5+
The Rand index compares the overlap of two clusterings;
6+
it considers both correct clustering overlaps while also counting correct
7+
disagreements between two clusterings.
8+
Similar to NMI, we compared the cell-type labels with the NMI-optimized
9+
Louvain clustering computed on the integrated dataset.
10+
The adjustment of the Rand index corrects for randomly correct labels.
11+
An ARI of 0 or 1 corresponds to random labeling or a perfect match,
12+
respectively.
13+
We also used the scikit-learn (v.0.22.1) implementation of the ARI.
14+
"""
15+
16+
17+
@metric(
18+
metric_name="ARI",
19+
maximize=True,
20+
paper_reference="luecken2022benchmarking",
21+
image="openproblems-r-pytorch",
22+
)
23+
def ari(adata):
24+
from scanpy.pp import neighbors
25+
from scanpy.tl import pca
26+
27+
adata.obsm["X_emb"] = pca(adata.X)
28+
neighbors(adata, use_rep="X_emb")
29+
return graph_metrics.ari(adata)

0 commit comments

Comments
 (0)