Skip to content

Commit de59596

Browse files
committed
prepare release for fastwoe 0.1.7
1 parent 59a032a commit de59596

8 files changed

Lines changed: 1748 additions & 106 deletions

File tree

CHANGELOG.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,40 @@
11
# Changelog
22

3+
## Version 0.1.7 (2026-04-11)
4+
5+
**WOE Recalibration & Somers' D Standard Errors**
6+
7+
### New Features
8+
9+
- **`finetune(X_new, y_new)`**: WOE recalibration without redevelopment
10+
- Updates WOE values using new data while preserving bin structure (edges, categories)
11+
- Pass a full DataFrame to update all features, or a single-column DataFrame for one
12+
- Bins absent from new data retain their original WOE (with warning)
13+
- Unknown columns are skipped (with warning)
14+
- `update_prior=True` shifts the global prior and warns about stale features
15+
- Returns `self` for method chaining
16+
17+
- **Asymptotic Standard Errors of Somers' D**: Uncertainty quantification via Goktas & Oznur (2011)
18+
- Contingency-table ASE using per-cell concordant/discordant counts and the delta method
19+
- Works for binary, ordinal, and continuous targets (for binary targets Somers' D = Gini)
20+
- New `somersd_se()` function in `fastwoe.metrics`
21+
- New fields in `feature_stats_` and `get_feature_stats()`: `somersd_se`, `somersd_ci_lower`, `somersd_ci_upper`
22+
- Returns NaN gracefully for degenerate inputs
23+
24+
### Internal
25+
26+
- Extracted `_apply_binning_to_column()` helper from `transform()` for reuse by `finetune()`
27+
- Added `_recalibrate_feature()` private method encapsulating per-feature recalibration logic
28+
29+
### Testing
30+
31+
- Added `TestFinetune` class with 13 test cases covering single-feature, full-DataFrame, numerical/categorical, error handling, prior update, missing bins, transform-after-finetune, and method chaining
32+
- All 132 tests passing
33+
34+
### Examples
35+
36+
- Added `examples/notebooks/fastwoe_finetuning.ipynb` demonstrating recalibration workflow
37+
338
## Version 0.1.6 (2026-01-07)
439

540
**Stable Release: Type Safety, Robustness & Code Quality** 🎯

docs/somersd_ase.md

Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
# Asymptotic Standard Error of Somers' D
2+
3+
Reference: Goktas, A., Oznur, I., 2011. *A Comparison of the Most Commonly Used Measures of Association for Doubly Ordered Square Contingency Tables via Simulation.* Metodoloski zvezki 8 (1), 17-37.
4+
5+
## 1. Setup
6+
7+
Given $n$ paired observations $(y_i, x_i)$, build the contingency table $\mathbf{F}$ where rows correspond to sorted unique values of $Y$ and columns to sorted unique values of $X$.
8+
9+
| Symbol | Definition |
10+
|--------|-----------|
11+
| $f_{ij}$ | Cell count: number of observations in row $i$, column $j$ |
12+
| $a$ | Number of rows (unique $Y$ values) |
13+
| $b$ | Number of columns (unique $X$ values) |
14+
| $r_i$ | Row sum: $r_i = \sum_j f_{ij}$ |
15+
| $W$ | Sample size: $W = \sum_{ij} f_{ij} = n$ |
16+
17+
## 2. Concordant and Discordant Counts
18+
19+
For each cell $(i, j)$ of the contingency table, define:
20+
21+
**$C_{ij}$** = number of observations that would form a **concordant** pair with any observation in cell $(i, j)$. These are observations in cells strictly above-left or strictly below-right:
22+
23+
$$C_{ij} = \sum_{\substack{r > i \\ s > j}} f_{rs} \;+\; \sum_{\substack{r < i \\ s < j}} f_{rs}$$
24+
25+
**$D_{ij}$** = number of observations that would form a **discordant** pair:
26+
27+
$$D_{ij} = \sum_{\substack{r > i \\ s < j}} f_{rs} \;+\; \sum_{\substack{r < i \\ s > j}} f_{rs}$$
28+
29+
These are computed efficiently using 2D prefix sums.
30+
31+
## 3. Somers' D Statistic
32+
33+
Total concordant and discordant pairs:
34+
35+
$$P = \sum_{ij} f_{ij} \cdot C_{ij} \qquad Q = \sum_{ij} f_{ij} \cdot D_{ij}$$
36+
37+
Denominator - number of ordered pairs untied on $Y$:
38+
39+
$$D_r = W^2 - \sum_i r_i^2$$
40+
41+
For binary $Y$ with $m$ positives and $W - m$ negatives: $D_r = 2m(W - m)$.
42+
43+
**Somers' D:**
44+
45+
$$D = \frac{P - Q}{D_r}$$
46+
47+
For binary targets, this equals the Gini coefficient $= 2 \cdot \text{AUC} - 1$.
48+
49+
## 4. Row Midranks
50+
51+
Define the midrank of row $i$ as the average position of observations in that row:
52+
53+
$$R_i = \sum_{k=1}^{i} r_k + \frac{1 - r_i}{2}$$
54+
55+
This places each row at the center of its range in the marginal ranking.
56+
57+
For example, if $r = [3, 3, 3]$:
58+
59+
- $R_1 = 3 + (1-3)/2 = 2$
60+
- $R_2 = 6 + (1-3)/2 = 5$
61+
- $R_3 = 9 + (1-3)/2 = 8$
62+
63+
## 5. ASE via the Delta Method
64+
65+
Somers' D is a **ratio of two U-statistics**: $D = (P - Q) / D_r$.
66+
67+
To get the variance of a ratio $f/g$, the delta method gives:
68+
69+
$$\text{Var}\!\left(\frac{f}{g}\right) \approx \frac{1}{g^2} \cdot \text{Var}(f - D \cdot g)$$
70+
71+
For each cell $(i, j)$, the **linearized residual** has two parts:
72+
73+
### Part 1: Numerator contribution (scaled)
74+
75+
$$D_r \cdot (C_{ij} - D_{ij})$$
76+
77+
This is how much cell $(i, j)$ contributes to the net concordance, scaled by the denominator.
78+
79+
### Part 2: Denominator correction
80+
81+
$$(P - Q) \cdot (W - R_i)$$
82+
83+
Here $(W - R_i)$ captures how much row $i$ contributes to $D_r$ (pairs untied on $Y$). The factor $(P - Q) = D \cdot D_r$ weights this by the current value of $D$.
84+
85+
### Combined residual per cell
86+
87+
$$\varepsilon_{ij} = D_r \cdot (C_{ij} - D_{ij}) \;-\; (P - Q) \cdot (W - R_i)$$
88+
89+
Note that $R_i$ depends only on the row, not the column - the denominator $D_r$ is determined entirely by the $Y$ marginal distribution.
90+
91+
### ASE formula
92+
93+
The weighted sum of squared residuals, normalized:
94+
95+
$$\boxed{\text{ASE}_1 = \frac{2}{D_r^2} \sqrt{\sum_{ij} f_{ij} \cdot \varepsilon_{ij}^2}}$$
96+
97+
Expanding:
98+
99+
$$\text{ASE}_1 = \frac{2}{D_r^2} \sqrt{\sum_{ij} f_{ij} \left\{ D_r(C_{ij} - D_{ij}) - (P - Q)(W - R_i) \right\}^2}$$
100+
101+
The factor of 2 accounts for the symmetry of ordered pairs (each unordered pair is counted twice).
102+
103+
## 6. ASE Under the Null ($\text{ASE}_0$)
104+
105+
Under $H_0\!: D = 0$, the denominator correction term vanishes (since $P - Q = 0$), giving a simpler formula:
106+
107+
$$\boxed{\text{ASE}_0 = \frac{2}{D_r} \sqrt{\sum_{ij} f_{ij} (C_{ij} - D_{ij})^2 \;-\; \frac{(P - Q)^2}{W}}}$$
108+
109+
This is used for hypothesis testing (is $D$ significantly different from zero?).
110+
111+
## 7. Summary
112+
113+
| Quantity | Formula | Use |
114+
|----------|---------|-----|
115+
| $D$ | $(P - Q) \;/\; D_r$ | Point estimate |
116+
| $\text{ASE}_1$ | $\frac{2}{D_r^2} \sqrt{\sum f_{ij} \cdot \varepsilon_{ij}^2}$ | Confidence intervals |
117+
| $\text{ASE}_0$ | $\frac{2}{D_r} \sqrt{\sum f_{ij}(C_{ij}-D_{ij})^2 - (P-Q)^2/W}$ | Hypothesis testing |
118+
119+
For a 95% confidence interval:
120+
121+
$$D \;\pm\; z_{0.975} \cdot \text{ASE}_1$$
122+
123+
For a two-sided test of $H_0\!: D = 0$:
124+
125+
$$z = \frac{D}{\text{ASE}_0}$$
126+
127+
## 8. Worked Example
128+
129+
A small binary credit risk table: 10 applicants, target $Y \in \{0, 1\}$, WOE-encoded feature $X \in \{1, 2, 3\}$ (three bins).
130+
131+
```python
132+
import numpy as np
133+
134+
#-----------------------------------------------------------------------------------------------
135+
# Data
136+
#-----------------------------------------------------------------------------------------------
137+
y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype=float)
138+
x = np.array([1, 1, 2, 2, 3, 1, 2, 3, 3, 3], dtype=float)
139+
140+
#-----------------------------------------------------------------------------------------------
141+
# Step 1: Contingency table
142+
#-----------------------------------------------------------------------------------------------
143+
# x=1 x=2 x=3
144+
# y=0 [ 2 2 1 ]
145+
# y=1 [ 1 1 3 ]
146+
CT = np.array([[2, 2, 1],
147+
[1, 1, 3]], dtype=float)
148+
149+
a, b = CT.shape # 2 rows, 3 columns
150+
W = CT.sum() # 10
151+
r = CT.sum(axis=1) # [5, 5]
152+
153+
print(f"CT:\n{CT}")
154+
print(f"W={W}, r={r}")
155+
156+
#-----------------------------------------------------------------------------------------------
157+
# Step 2: Concordant (C) and discordant (D) matrices
158+
#-----------------------------------------------------------------------------------------------
159+
# C[i,j] = sum of cells strictly below-right + strictly above-left
160+
# D[i,j] = sum of cells strictly below-left + strictly above-right
161+
C = np.zeros_like(CT)
162+
D = np.zeros_like(CT)
163+
for i in range(a):
164+
for j in range(b):
165+
C[i, j] = CT[i+1:, j+1:].sum() + CT[:i, :j].sum()
166+
D[i, j] = CT[i+1:, :j].sum() + CT[:i, j+1:].sum()
167+
168+
print(f"\nC (concordant counts):\n{C}")
169+
print(f"D (discordant counts):\n{D}")
170+
171+
#-----------------------------------------------------------------------------------------------
172+
# Step 3: Somers' D
173+
#-----------------------------------------------------------------------------------------------
174+
P = (CT * C).sum() # total concordant
175+
Q = (CT * D).sum() # total discordant
176+
Dr = W**2 - (r**2).sum() # pairs untied on Y
177+
178+
D_val = (P - Q) / Dr
179+
180+
print(f"\nP={P}, Q={Q}, Dr={Dr}")
181+
print(f"Somers' D = (P-Q)/Dr = {D_val:.6f}")
182+
183+
#-----------------------------------------------------------------------------------------------
184+
# Step 4: Row midranks
185+
#-----------------------------------------------------------------------------------------------
186+
RR = np.cumsum(r) + (1.0 - r) / 2.0
187+
print(f"\nRow midranks R = {RR}")
188+
189+
#-----------------------------------------------------------------------------------------------
190+
# Step 5: ASE (delta method)
191+
#-----------------------------------------------------------------------------------------------
192+
RR_mat = np.repeat(RR[:, np.newaxis], b, axis=1)
193+
194+
# Linearized residual per cell
195+
eps = Dr * (C - D) - (P - Q) * (W - RR_mat)
196+
print(f"\nResiduals eps:\n{eps}")
197+
198+
ASE = 2.0 / Dr**2 * np.sqrt((CT * eps**2).sum())
199+
print(f"\nASE = {ASE:.6f}")
200+
201+
#-----------------------------------------------------------------------------------------------
202+
# Step 6: ASE under H0
203+
#-----------------------------------------------------------------------------------------------
204+
ASE0 = 2.0 / Dr * np.sqrt((CT * (C - D)**2).sum() - (P - Q)**2 / W)
205+
print(f"ASE0 = {ASE0:.6f}")
206+
207+
#-----------------------------------------------------------------------------------------------
208+
# Step 7: 95% confidence interval
209+
#-----------------------------------------------------------------------------------------------
210+
print(f"\n95% CI: [{D_val - 1.96*ASE:.4f}, {D_val + 1.96*ASE:.4f}]")
211+
print(f"z-test (H0: D=0): z = {D_val / ASE0:.4f}")
212+
```
213+
214+
Output:
215+
216+
```
217+
CT:
218+
[[2. 2. 1.]
219+
[1. 1. 3.]]
220+
W=10.0, r=[5. 5.]
221+
222+
C (concordant counts):
223+
[[4. 3. 0.]
224+
[0. 2. 4.]]
225+
D (discordant counts):
226+
[[0. 1. 2.]
227+
[3. 1. 0.]]
228+
229+
P=28.0, Q=8.0, Dr=50.0
230+
231+
Somers' D = (P-Q)/Dr = 0.400000
232+
233+
Row midranks R = [3. 8.]
234+
235+
Residuals eps:
236+
[[ 60. -40. -240.]
237+
[-190. 10. 160.]]
238+
239+
ASE = 0.340353
240+
ASE0 = 0.314960
241+
242+
95% CI: [-0.2671, 1.0671]
243+
z-test (H0: D=0): z = 1.2700
244+
```
245+
246+
## 9. Production Implementation
247+
248+
See `fastwoe.metrics.somersd_se()` - validated against the VUROCS R package with exact numerical agreement across binary, ordinal, and continuous targets.

0 commit comments

Comments
 (0)