Skip to content

Commit b5a1d99

Browse files
committed
Add MSD algorithm and TestPyPI workflow
1 parent ef1ba78 commit b5a1d99

13 files changed

Lines changed: 2535 additions & 1239 deletions

.github/workflows/test-release.yml

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
name: Test Release
2+
3+
on:
4+
push:
5+
tags:
6+
- '*-test' # Tags ending with -test (e.g., v0.1.6-test)
7+
workflow_dispatch:
8+
inputs:
9+
version:
10+
description: 'Version to release to TestPyPI (e.g., 0.1.6)'
11+
required: true
12+
type: string
13+
14+
jobs:
15+
test:
16+
runs-on: ubuntu-latest
17+
steps:
18+
- uses: actions/checkout@v4
19+
20+
- name: Set up Python
21+
uses: actions/setup-python@v4
22+
with:
23+
python-version: '3.11'
24+
25+
- name: Install uv
26+
uses: astral-sh/setup-uv@v1
27+
with:
28+
version: "latest"
29+
30+
- name: Install dependencies
31+
run: |
32+
uv sync --dev
33+
34+
- name: Run tests
35+
run: |
36+
uv run python -m pytest tests/ -v
37+
38+
build:
39+
needs: test
40+
runs-on: ubuntu-latest
41+
steps:
42+
- uses: actions/checkout@v4
43+
44+
- name: Set up Python
45+
uses: actions/setup-python@v4
46+
with:
47+
python-version: '3.11'
48+
49+
- name: Install uv
50+
uses: astral-sh/setup-uv@v1
51+
with:
52+
version: "latest"
53+
54+
- name: Install build dependencies
55+
run: |
56+
uv tool install build
57+
58+
- name: Build package
59+
run: |
60+
uv tool run --from build pyproject-build
61+
62+
- name: Upload build artifacts
63+
uses: actions/upload-artifact@v4
64+
with:
65+
name: dist
66+
path: dist/
67+
68+
publish:
69+
needs: build
70+
runs-on: ubuntu-latest
71+
steps:
72+
- name: Download build artifacts
73+
uses: actions/download-artifact@v4
74+
with:
75+
name: dist
76+
path: dist/
77+
78+
- name: Publish to TestPyPI
79+
uses: pypa/gh-action-pypi-publish@release/v1
80+
with:
81+
password: ${{ secrets.TEST_PYPI_API_TOKEN }}
82+
repository-url: https://test.pypi.org/legacy/

docs/marginal_somersd_guide.md

Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
# Marginal Somers' D (MSD) for Feature Selection
2+
3+
**Denis Burakov** | **December 2025** | **xRiskLab**
4+
5+
<div align="center">
6+
7+
[![GitHub](https://img.shields.io/badge/GitHub-FastWoe-black?logo=github)](https://github.com/xRiskLab/fastwoe)
8+
[![PyPI](https://img.shields.io/badge/PyPI-fastwoe-blue?logo=pypi)](https://pypi.org/project/fastwoe/)
9+
10+
**Fast and efficient Python implementation of WOE encoding and MSD feature selection**
11+
12+
</div>
13+
14+
---
15+
16+
## 1. Introduction
17+
18+
**Marginal Somers' D (MSD)** is a feature selection method that uses rank correlation (Somers' D) instead of traditional Information Value (IV). It implements greedy forward selection that:
19+
20+
1. Transforms features using WOE encoding
21+
2. Selects features based on their Somers' D with the target
22+
3. Filters out features highly correlated with already-selected features
23+
4. Works with both binary and continuous targets
24+
25+
**Key advantage:** Unlike IV-based methods limited to binary classification, MSD handles continuous targets through rank correlation.
26+
27+
---
28+
29+
## 2. Mathematical Foundation
30+
31+
### Somers' D Definition
32+
33+
Somers' D measures monotonic association between two variables:
34+
35+
```math
36+
D_{Y|X} = \frac{\text{Concordant} - \text{Discordant}}{\text{Total pairs (excluding ties in Y)}}
37+
```
38+
39+
Where:
40+
- **Concordant**: $(x_i > x_j \text{ and } y_i > y_j)$ or $(x_i < x_j \text{ and } y_i < y_j)$
41+
- **Discordant**: $(x_i > x_j \text{ and } y_i < y_j)$ or $(x_i < x_j \text{ and } y_i > y_j)$
42+
43+
For binary classification:
44+
45+
```math
46+
\text{Gini} = 2 \times \text{AUC} - 1 = D_{Y|X}
47+
```
48+
49+
### WOE Transformation
50+
51+
All features are transformed using Weight of Evidence (WOE) before computing Somers' D. This:
52+
- Handles categorical variables
53+
- Creates monotonic transformations
54+
- Works with both binary and continuous targets
55+
56+
---
57+
58+
## 3. The MSD Algorithm
59+
60+
### Step-by-Step Process
61+
62+
1. **Pre-processing**
63+
- Transform all features using WOE encoding
64+
- Compute pairwise Somers' D correlation matrix
65+
66+
2. **Initialization**
67+
- Calculate univariate Somers' D for each feature
68+
- Select feature with highest univariate Somers' D
69+
70+
3. **Iterative Selection**
71+
- Fit model with currently selected features
72+
- For each remaining feature:
73+
- Compute univariate Somers' D between feature WOE and target
74+
- Check correlation with already-selected features (using pairwise feature correlation)
75+
- Add feature with highest Somers' D if correlation < threshold
76+
- Repeat until stopping criteria met
77+
78+
4. **Stopping Criteria**
79+
- Marginal Somers' D < `min_msd`, OR
80+
- Maximum features reached, OR
81+
- All remaining features too correlated with selected features
82+
83+
### What Makes It "Marginal"?
84+
85+
The "marginal" aspect comes from:
86+
- **Iterative evaluation**: Features are evaluated at each step after some features are already selected
87+
- **Correlation filtering**: Features with Somers' D correlation > threshold with already-selected features are skipped
88+
- **Greedy selection**: The selection order implicitly accounts for redundancy through correlation filtering
89+
90+
> [!NOTE]
91+
> The term "marginal" here refers to the iterative, step-wise evaluation process. At each step, features are evaluated using their univariate Somers' D with the target, but redundant features are filtered out based on their correlation with already-selected features.
92+
93+
Feature correlation is computed as:
94+
95+
```math
96+
\text{correlation}(f_i, f_j) = \frac{|D_{f_i|f_j}| + |D_{f_j|f_i}|}{2}
97+
```
98+
99+
---
100+
101+
## 4. Basic Usage
102+
103+
### Binary Classification
104+
105+
```python
106+
import numpy as np
107+
import pandas as pd
108+
from fastwoe.modeling import marginal_somersd_selection
109+
110+
# Prepare data
111+
X = pd.DataFrame({
112+
'feature1': np.random.choice(['A', 'B', 'C'], 1000),
113+
'feature2': np.random.choice(['X', 'Y', 'Z'], 1000),
114+
'feature3': np.random.choice(['P', 'Q', 'R'], 1000),
115+
})
116+
y = np.random.binomial(1, 0.3, 1000)
117+
118+
# Run selection
119+
result = marginal_somersd_selection(
120+
X, y,
121+
min_msd=0.01, # Minimum marginal Somers' D
122+
max_features=5, # Maximum features
123+
correlation_threshold=0.5 # Correlation threshold
124+
)
125+
126+
print(result['selected_features'])
127+
print(result['msd_history'])
128+
```
129+
130+
### With Train/Test Split
131+
132+
```python
133+
from sklearn.model_selection import train_test_split
134+
135+
X_train, X_test, y_train, y_test = train_test_split(
136+
X, y, test_size=0.3, random_state=42
137+
)
138+
139+
result = marginal_somersd_selection(
140+
X_train, y_train,
141+
X_test=X_test,
142+
y_test=y_test,
143+
min_msd=0.01
144+
)
145+
146+
# Monitor performance at each step
147+
# Note: test_performance has length len(selected_features) - 1
148+
# (computed at start of each iteration after first feature)
149+
for i, (feat, msd) in enumerate(zip(
150+
result['selected_features'],
151+
result['msd_history']
152+
)):
153+
if i > 0: # test_performance starts from step 2
154+
test_perf = result['test_performance'][i - 1]
155+
print(f"{feat}: Train MSD={msd:.4f}, Test D={test_perf:.4f}")
156+
else:
157+
print(f"{feat}: Train MSD={msd:.4f}")
158+
```
159+
160+
### Continuous Target
161+
162+
```python
163+
# Works with continuous targets
164+
y_continuous = np.random.normal(0, 1, 1000)
165+
166+
result = marginal_somersd_selection(
167+
X, y_continuous,
168+
min_msd=0.01
169+
)
170+
```
171+
172+
---
173+
174+
## 5. Output Structure
175+
176+
The function returns a dictionary with:
177+
178+
| Key | Type | Description |
179+
|-----|------|-------------|
180+
| `selected_features` | `list[str]` | Feature names in selection order |
181+
| `msd_history` | `list[float]` | Marginal Somers' D at each step (same length as selected_features) |
182+
| `univariate_somersd` | `dict[str, float]` | Univariate Somers' D for all features |
183+
| `model` | `FastWoe` | Trained WOE model with selected features |
184+
| `test_performance` | `list[float]` | Test Somers' D at each step (length = len(selected_features) - 1, if test set provided) |
185+
| `correlation_matrix` | `pd.DataFrame` | Pairwise correlations of selected features |
186+
187+
---
188+
189+
## 6. When to Use MSD
190+
191+
**Use MSD when:**
192+
- You have categorical or mixed-type features
193+
- You need rank correlation-based selection (robust to outliers)
194+
- You want to handle both binary and continuous targets
195+
- You want automatic redundancy filtering
196+
- You're building credit scoring or risk models
197+
198+
**Consider alternatives when:**
199+
- You have extremely high-dimensional data (thousands of features)
200+
- You need very fast selection with minimal computation
201+
- Your features are already numeric and well-scaled
202+
203+
---
204+
205+
## 7. Complete Example
206+
207+
```python
208+
import numpy as np
209+
import pandas as pd
210+
from fastwoe.modeling import marginal_somersd_selection
211+
from sklearn.metrics import roc_auc_score
212+
from sklearn.model_selection import train_test_split
213+
214+
# Generate data
215+
np.random.seed(42)
216+
n = 2000
217+
X = pd.DataFrame({
218+
'age_group': np.random.choice(['18-25', '26-35', '36-45', '46+'], n),
219+
'income': np.random.choice(['Low', 'Medium', 'High'], n),
220+
'employment': np.random.choice(['Employed', 'Self-Employed', 'Unemployed'], n),
221+
'education': np.random.choice(['HS', 'Bachelor', 'Master', 'PhD'], n),
222+
})
223+
224+
# Create target
225+
y = (
226+
(X['income'] == 'High').astype(int) * 0.3 +
227+
(X['education'].isin(['Master', 'PhD'])).astype(int) * 0.2 +
228+
np.random.normal(0, 0.1, n)
229+
)
230+
y = (y > 0.3).astype(int)
231+
232+
# Split
233+
X_train, X_test, y_train, y_test = train_test_split(
234+
X, y, test_size=0.3, stratify=y, random_state=42
235+
)
236+
237+
# Select features
238+
result = marginal_somersd_selection(
239+
X_train, y_train,
240+
X_test=X_test,
241+
y_test=y_test,
242+
min_msd=0.01,
243+
max_features=5
244+
)
245+
246+
# Results
247+
print("Selected features:", result["selected_features"])
248+
print("\nUnivariate Somers' D:")
249+
for feat, val in sorted(
250+
result["univariate_somersd"].items(), key=lambda x: x[1], reverse=True
251+
):
252+
print(f"{feat}: {val:.4f}")
253+
254+
# Evaluate
255+
model = result["model"]
256+
y_pred = model.predict_proba(X_test[result["selected_features"]])[:, 1]
257+
print(f"\nTest AUC: {roc_auc_score(y_test, y_pred):.4f}")
258+
```
259+
260+
---
261+
262+
## References
263+
264+
1. Somers, R.H. (1962). A new asymmetric measure of association for ordinal variables. *American Sociological Review*, 27(6), 799-811.
265+
266+
2. Spinella, F., & Krisciunas, T. (2025). Enhancing Credit Risk Models at Revolut by Combining Deep Feature Synthesis and Marginal Information Value. *Credit Research Centre, University of Edinburgh Business School*. Available at: https://www.crc.business-school.ed.ac.uk/sites/crc/files/2025-11/Enhancing-Credit-Risk-Models-at-Revolut-by-combining-Deep-Feature-Synthesis-and-Marginal-Information-Value-paper.pdf

examples/fastwoe_visualize_woe.ipynb

Lines changed: 37 additions & 37 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)