Skip to content

Confusion about the correct formulation and interpretation of the co_occurrence calculation method in gr.co_occurrence #1205

@HXD3D0323

Description

@HXD3D0323

Report

Dear Authors ,
Thank you for developing and sharing this method. While reading the implementation of the co-occurrence calculation, I am trying to better understand the normalization strategy , particularly why the denominator is defined as row_sums[c, r] rather than the total number of neighbors surrounding center cells of type c used in the following code in src/squidpy/gr/_ppatterns.py :
`occ_prob = np.zeros((k, k, l_val), dtype=np.float64)

row_sums = counts.sum(axis=0)
totals = row_sums.sum(axis=0)

for r in prange(l_val):
probs = row_sums[:, r] / totals[r]

for c in range(k):
    for i in range(k):
        if probs[i] != 0.0 and row_sums[c, r] != 0.0:
            occ_prob[i, c, r] = (
                counts[c, i, r] / row_sums[c, r]
            ) / probs[i]`

From the code, counts[c, i, r] appears to represent the number of neighboring pairs within radius r, where the center cell is of type c and the neighboring cell is of type i.

The normalization term is defined asrow_sums = counts.sum(axis=0)
which yields$$
\mathrm{row_sums}[i,r]

\sum_c \mathrm{counts}[c,i,r].
$$

Therefore, row_sums[c, r] corresponds to

$$ \sum_{c'} \mathrm{counts}[c',c,r], $$

which seems to represent the total number of times cell type $c$ appears as a neighboring cell across all center-cell types at radius $r$.

My question concerns the denominator

$$ \frac{\mathrm{counts}[c,i,r]} {\mathrm{row_sums}[c,r]}. $$

If the goal is to estimate a conditional probability such as

$$ P(\mathrm{neighbor}=i \mid \mathrm{center}=c), $$

I would have expected the denominator to be

$$ \sum_i \mathrm{counts}[c,i,r], $$

i.e., the total number of neighbors observed around center cells of type $c$, since this corresponds directly to conditioning on the center-cell type.

In contrast, the current implementation uses

$$ \sum_{c'} \mathrm{counts}[c',c,r], $$

which appears to normalize by the frequency with which cell type $c$ occurs as a neighboring cell rather than as a center cell.

I would greatly appreciate any explanation of the statistical reasoning behind this normalization strategy.

Thank you very much for your time and for making the implementation publicly available.

Versions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions