|
| 1 | +# Asymptotic Standard Error of Somers' D |
| 2 | + |
| 3 | +Reference: Goktas, A., Oznur, I., 2011. *A Comparison of the Most Commonly Used Measures of Association for Doubly Ordered Square Contingency Tables via Simulation.* Metodoloski zvezki 8 (1), 17-37. |
| 4 | + |
| 5 | +## 1. Setup |
| 6 | + |
| 7 | +Given $n$ paired observations $(y_i, x_i)$, build the contingency table $\mathbf{F}$ where rows correspond to sorted unique values of $Y$ and columns to sorted unique values of $X$. |
| 8 | + |
| 9 | +| Symbol | Definition | |
| 10 | +|--------|-----------| |
| 11 | +| $f_{ij}$ | Cell count: number of observations in row $i$, column $j$ | |
| 12 | +| $a$ | Number of rows (unique $Y$ values) | |
| 13 | +| $b$ | Number of columns (unique $X$ values) | |
| 14 | +| $r_i$ | Row sum: $r_i = \sum_j f_{ij}$ | |
| 15 | +| $W$ | Sample size: $W = \sum_{ij} f_{ij} = n$ | |
| 16 | + |
| 17 | +## 2. Concordant and Discordant Counts |
| 18 | + |
| 19 | +For each cell $(i, j)$ of the contingency table, define: |
| 20 | + |
| 21 | +**$C_{ij}$** = number of observations that would form a **concordant** pair with any observation in cell $(i, j)$. These are observations in cells strictly above-left or strictly below-right: |
| 22 | + |
| 23 | +$$C_{ij} = \sum_{\substack{r > i \\ s > j}} f_{rs} \;+\; \sum_{\substack{r < i \\ s < j}} f_{rs}$$ |
| 24 | + |
| 25 | +**$D_{ij}$** = number of observations that would form a **discordant** pair: |
| 26 | + |
| 27 | +$$D_{ij} = \sum_{\substack{r > i \\ s < j}} f_{rs} \;+\; \sum_{\substack{r < i \\ s > j}} f_{rs}$$ |
| 28 | + |
| 29 | +These are computed efficiently using 2D prefix sums. |
| 30 | + |
| 31 | +## 3. Somers' D Statistic |
| 32 | + |
| 33 | +Total concordant and discordant pairs: |
| 34 | + |
| 35 | +$$P = \sum_{ij} f_{ij} \cdot C_{ij} \qquad Q = \sum_{ij} f_{ij} \cdot D_{ij}$$ |
| 36 | + |
| 37 | +Denominator - number of ordered pairs untied on $Y$: |
| 38 | + |
| 39 | +$$D_r = W^2 - \sum_i r_i^2$$ |
| 40 | + |
| 41 | +For binary $Y$ with $m$ positives and $W - m$ negatives: $D_r = 2m(W - m)$. |
| 42 | + |
| 43 | +**Somers' D:** |
| 44 | + |
| 45 | +$$D = \frac{P - Q}{D_r}$$ |
| 46 | + |
| 47 | +For binary targets, this equals the Gini coefficient $= 2 \cdot \text{AUC} - 1$. |
| 48 | + |
| 49 | +## 4. Row Midranks |
| 50 | + |
| 51 | +Define the midrank of row $i$ as the average position of observations in that row: |
| 52 | + |
| 53 | +$$R_i = \sum_{k=1}^{i} r_k + \frac{1 - r_i}{2}$$ |
| 54 | + |
| 55 | +This places each row at the center of its range in the marginal ranking. |
| 56 | + |
| 57 | +For example, if $r = [3, 3, 3]$: |
| 58 | + |
| 59 | +- $R_1 = 3 + (1-3)/2 = 2$ |
| 60 | +- $R_2 = 6 + (1-3)/2 = 5$ |
| 61 | +- $R_3 = 9 + (1-3)/2 = 8$ |
| 62 | + |
| 63 | +## 5. ASE via the Delta Method |
| 64 | + |
| 65 | +Somers' D is a **ratio of two U-statistics**: $D = (P - Q) / D_r$. |
| 66 | + |
| 67 | +To get the variance of a ratio $f/g$, the delta method gives: |
| 68 | + |
| 69 | +$$\text{Var}\!\left(\frac{f}{g}\right) \approx \frac{1}{g^2} \cdot \text{Var}(f - D \cdot g)$$ |
| 70 | + |
| 71 | +For each cell $(i, j)$, the **linearized residual** has two parts: |
| 72 | + |
| 73 | +### Part 1: Numerator contribution (scaled) |
| 74 | + |
| 75 | +$$D_r \cdot (C_{ij} - D_{ij})$$ |
| 76 | + |
| 77 | +This is how much cell $(i, j)$ contributes to the net concordance, scaled by the denominator. |
| 78 | + |
| 79 | +### Part 2: Denominator correction |
| 80 | + |
| 81 | +$$(P - Q) \cdot (W - R_i)$$ |
| 82 | + |
| 83 | +Here $(W - R_i)$ captures how much row $i$ contributes to $D_r$ (pairs untied on $Y$). The factor $(P - Q) = D \cdot D_r$ weights this by the current value of $D$. |
| 84 | + |
| 85 | +### Combined residual per cell |
| 86 | + |
| 87 | +$$\varepsilon_{ij} = D_r \cdot (C_{ij} - D_{ij}) \;-\; (P - Q) \cdot (W - R_i)$$ |
| 88 | + |
| 89 | +Note that $R_i$ depends only on the row, not the column - the denominator $D_r$ is determined entirely by the $Y$ marginal distribution. |
| 90 | + |
| 91 | +### ASE formula |
| 92 | + |
| 93 | +The weighted sum of squared residuals, normalized: |
| 94 | + |
| 95 | +$$\boxed{\text{ASE}_1 = \frac{2}{D_r^2} \sqrt{\sum_{ij} f_{ij} \cdot \varepsilon_{ij}^2}}$$ |
| 96 | + |
| 97 | +Expanding: |
| 98 | + |
| 99 | +$$\text{ASE}_1 = \frac{2}{D_r^2} \sqrt{\sum_{ij} f_{ij} \left\{ D_r(C_{ij} - D_{ij}) - (P - Q)(W - R_i) \right\}^2}$$ |
| 100 | + |
| 101 | +The factor of 2 accounts for the symmetry of ordered pairs (each unordered pair is counted twice). |
| 102 | + |
| 103 | +## 6. ASE Under the Null ($\text{ASE}_0$) |
| 104 | + |
| 105 | +Under $H_0\!: D = 0$, the denominator correction term vanishes (since $P - Q = 0$), giving a simpler formula: |
| 106 | + |
| 107 | +$$\boxed{\text{ASE}_0 = \frac{2}{D_r} \sqrt{\sum_{ij} f_{ij} (C_{ij} - D_{ij})^2 \;-\; \frac{(P - Q)^2}{W}}}$$ |
| 108 | + |
| 109 | +This is used for hypothesis testing (is $D$ significantly different from zero?). |
| 110 | + |
| 111 | +## 7. Summary |
| 112 | + |
| 113 | +| Quantity | Formula | Use | |
| 114 | +|----------|---------|-----| |
| 115 | +| $D$ | $(P - Q) \;/\; D_r$ | Point estimate | |
| 116 | +| $\text{ASE}_1$ | $\frac{2}{D_r^2} \sqrt{\sum f_{ij} \cdot \varepsilon_{ij}^2}$ | Confidence intervals | |
| 117 | +| $\text{ASE}_0$ | $\frac{2}{D_r} \sqrt{\sum f_{ij}(C_{ij}-D_{ij})^2 - (P-Q)^2/W}$ | Hypothesis testing | |
| 118 | + |
| 119 | +For a 95% confidence interval: |
| 120 | + |
| 121 | +$$D \;\pm\; z_{0.975} \cdot \text{ASE}_1$$ |
| 122 | + |
| 123 | +For a two-sided test of $H_0\!: D = 0$: |
| 124 | + |
| 125 | +$$z = \frac{D}{\text{ASE}_0}$$ |
| 126 | + |
| 127 | +## 8. Worked Example |
| 128 | + |
| 129 | +A small binary credit risk table: 10 applicants, target $Y \in \{0, 1\}$, WOE-encoded feature $X \in \{1, 2, 3\}$ (three bins). |
| 130 | + |
| 131 | +```python |
| 132 | +import numpy as np |
| 133 | + |
| 134 | +#----------------------------------------------------------------------------------------------- |
| 135 | +# Data |
| 136 | +#----------------------------------------------------------------------------------------------- |
| 137 | +y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype=float) |
| 138 | +x = np.array([1, 1, 2, 2, 3, 1, 2, 3, 3, 3], dtype=float) |
| 139 | + |
| 140 | +#----------------------------------------------------------------------------------------------- |
| 141 | +# Step 1: Contingency table |
| 142 | +#----------------------------------------------------------------------------------------------- |
| 143 | +# x=1 x=2 x=3 |
| 144 | +# y=0 [ 2 2 1 ] |
| 145 | +# y=1 [ 1 1 3 ] |
| 146 | +CT = np.array([[2, 2, 1], |
| 147 | + [1, 1, 3]], dtype=float) |
| 148 | + |
| 149 | +a, b = CT.shape # 2 rows, 3 columns |
| 150 | +W = CT.sum() # 10 |
| 151 | +r = CT.sum(axis=1) # [5, 5] |
| 152 | + |
| 153 | +print(f"CT:\n{CT}") |
| 154 | +print(f"W={W}, r={r}") |
| 155 | + |
| 156 | +#----------------------------------------------------------------------------------------------- |
| 157 | +# Step 2: Concordant (C) and discordant (D) matrices |
| 158 | +#----------------------------------------------------------------------------------------------- |
| 159 | +# C[i,j] = sum of cells strictly below-right + strictly above-left |
| 160 | +# D[i,j] = sum of cells strictly below-left + strictly above-right |
| 161 | +C = np.zeros_like(CT) |
| 162 | +D = np.zeros_like(CT) |
| 163 | +for i in range(a): |
| 164 | + for j in range(b): |
| 165 | + C[i, j] = CT[i+1:, j+1:].sum() + CT[:i, :j].sum() |
| 166 | + D[i, j] = CT[i+1:, :j].sum() + CT[:i, j+1:].sum() |
| 167 | + |
| 168 | +print(f"\nC (concordant counts):\n{C}") |
| 169 | +print(f"D (discordant counts):\n{D}") |
| 170 | + |
| 171 | +#----------------------------------------------------------------------------------------------- |
| 172 | +# Step 3: Somers' D |
| 173 | +#----------------------------------------------------------------------------------------------- |
| 174 | +P = (CT * C).sum() # total concordant |
| 175 | +Q = (CT * D).sum() # total discordant |
| 176 | +Dr = W**2 - (r**2).sum() # pairs untied on Y |
| 177 | + |
| 178 | +D_val = (P - Q) / Dr |
| 179 | + |
| 180 | +print(f"\nP={P}, Q={Q}, Dr={Dr}") |
| 181 | +print(f"Somers' D = (P-Q)/Dr = {D_val:.6f}") |
| 182 | + |
| 183 | +#----------------------------------------------------------------------------------------------- |
| 184 | +# Step 4: Row midranks |
| 185 | +#----------------------------------------------------------------------------------------------- |
| 186 | +RR = np.cumsum(r) + (1.0 - r) / 2.0 |
| 187 | +print(f"\nRow midranks R = {RR}") |
| 188 | + |
| 189 | +#----------------------------------------------------------------------------------------------- |
| 190 | +# Step 5: ASE (delta method) |
| 191 | +#----------------------------------------------------------------------------------------------- |
| 192 | +RR_mat = np.repeat(RR[:, np.newaxis], b, axis=1) |
| 193 | + |
| 194 | +# Linearized residual per cell |
| 195 | +eps = Dr * (C - D) - (P - Q) * (W - RR_mat) |
| 196 | +print(f"\nResiduals eps:\n{eps}") |
| 197 | + |
| 198 | +ASE = 2.0 / Dr**2 * np.sqrt((CT * eps**2).sum()) |
| 199 | +print(f"\nASE = {ASE:.6f}") |
| 200 | + |
| 201 | +#----------------------------------------------------------------------------------------------- |
| 202 | +# Step 6: ASE under H0 |
| 203 | +#----------------------------------------------------------------------------------------------- |
| 204 | +ASE0 = 2.0 / Dr * np.sqrt((CT * (C - D)**2).sum() - (P - Q)**2 / W) |
| 205 | +print(f"ASE0 = {ASE0:.6f}") |
| 206 | + |
| 207 | +#----------------------------------------------------------------------------------------------- |
| 208 | +# Step 7: 95% confidence interval |
| 209 | +#----------------------------------------------------------------------------------------------- |
| 210 | +print(f"\n95% CI: [{D_val - 1.96*ASE:.4f}, {D_val + 1.96*ASE:.4f}]") |
| 211 | +print(f"z-test (H0: D=0): z = {D_val / ASE0:.4f}") |
| 212 | +``` |
| 213 | + |
| 214 | +Output: |
| 215 | + |
| 216 | +``` |
| 217 | +CT: |
| 218 | +[[2. 2. 1.] |
| 219 | + [1. 1. 3.]] |
| 220 | +W=10.0, r=[5. 5.] |
| 221 | +
|
| 222 | +C (concordant counts): |
| 223 | +[[4. 3. 0.] |
| 224 | + [0. 2. 4.]] |
| 225 | +D (discordant counts): |
| 226 | +[[0. 1. 2.] |
| 227 | + [3. 1. 0.]] |
| 228 | +
|
| 229 | +P=28.0, Q=8.0, Dr=50.0 |
| 230 | +
|
| 231 | +Somers' D = (P-Q)/Dr = 0.400000 |
| 232 | +
|
| 233 | +Row midranks R = [3. 8.] |
| 234 | +
|
| 235 | +Residuals eps: |
| 236 | +[[ 60. -40. -240.] |
| 237 | + [-190. 10. 160.]] |
| 238 | +
|
| 239 | +ASE = 0.340353 |
| 240 | +ASE0 = 0.314960 |
| 241 | +
|
| 242 | +95% CI: [-0.2671, 1.0671] |
| 243 | +z-test (H0: D=0): z = 1.2700 |
| 244 | +``` |
| 245 | + |
| 246 | +## 9. Production Implementation |
| 247 | + |
| 248 | +See `fastwoe.metrics.somersd_se()` - validated against the VUROCS R package with exact numerical agreement across binary, ordinal, and continuous targets. |
0 commit comments