|
| 1 | +# Marginal Somers' D (MSD) for Feature Selection |
| 2 | + |
| 3 | +**Denis Burakov** | **December 2025** | **xRiskLab** |
| 4 | + |
| 5 | +<div align="center"> |
| 6 | + |
| 7 | +[](https://github.com/xRiskLab/fastwoe) |
| 8 | +[](https://pypi.org/project/fastwoe/) |
| 9 | + |
| 10 | +**Fast and efficient Python implementation of WOE encoding and MSD feature selection** |
| 11 | + |
| 12 | +</div> |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## 1. Introduction |
| 17 | + |
| 18 | +**Marginal Somers' D (MSD)** is a feature selection method that uses rank correlation (Somers' D) instead of traditional Information Value (IV). It implements greedy forward selection that: |
| 19 | + |
| 20 | +1. Transforms features using WOE encoding |
| 21 | +2. Selects features based on their Somers' D with the target |
| 22 | +3. Filters out features highly correlated with already-selected features |
| 23 | +4. Works with both binary and continuous targets |
| 24 | + |
| 25 | +**Key advantage:** Unlike IV-based methods limited to binary classification, MSD handles continuous targets through rank correlation. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## 2. Mathematical Foundation |
| 30 | + |
| 31 | +### Somers' D Definition |
| 32 | + |
| 33 | +Somers' D measures monotonic association between two variables: |
| 34 | + |
| 35 | +```math |
| 36 | +D_{Y|X} = \frac{\text{Concordant} - \text{Discordant}}{\text{Total pairs (excluding ties in Y)}} |
| 37 | +``` |
| 38 | + |
| 39 | +Where: |
| 40 | +- **Concordant**: $(x_i > x_j \text{ and } y_i > y_j)$ or $(x_i < x_j \text{ and } y_i < y_j)$ |
| 41 | +- **Discordant**: $(x_i > x_j \text{ and } y_i < y_j)$ or $(x_i < x_j \text{ and } y_i > y_j)$ |
| 42 | + |
| 43 | +For binary classification: |
| 44 | + |
| 45 | +```math |
| 46 | +\text{Gini} = 2 \times \text{AUC} - 1 = D_{Y|X} |
| 47 | +``` |
| 48 | + |
| 49 | +### WOE Transformation |
| 50 | + |
| 51 | +All features are transformed using Weight of Evidence (WOE) before computing Somers' D. This: |
| 52 | +- Handles categorical variables |
| 53 | +- Creates monotonic transformations |
| 54 | +- Works with both binary and continuous targets |
| 55 | + |
| 56 | +--- |
| 57 | + |
| 58 | +## 3. The MSD Algorithm |
| 59 | + |
| 60 | +### Step-by-Step Process |
| 61 | + |
| 62 | +1. **Pre-processing** |
| 63 | + - Transform all features using WOE encoding |
| 64 | + - Compute pairwise Somers' D correlation matrix |
| 65 | + |
| 66 | +2. **Initialization** |
| 67 | + - Calculate univariate Somers' D for each feature |
| 68 | + - Select feature with highest univariate Somers' D |
| 69 | + |
| 70 | +3. **Iterative Selection** |
| 71 | + - Fit model with currently selected features |
| 72 | + - For each remaining feature: |
| 73 | + - Compute univariate Somers' D between feature WOE and target |
| 74 | + - Check correlation with already-selected features (using pairwise feature correlation) |
| 75 | + - Add feature with highest Somers' D if correlation < threshold |
| 76 | + - Repeat until stopping criteria met |
| 77 | + |
| 78 | +4. **Stopping Criteria** |
| 79 | + - Marginal Somers' D < `min_msd`, OR |
| 80 | + - Maximum features reached, OR |
| 81 | + - All remaining features too correlated with selected features |
| 82 | + |
| 83 | +### What Makes It "Marginal"? |
| 84 | + |
| 85 | +The "marginal" aspect comes from: |
| 86 | +- **Iterative evaluation**: Features are evaluated at each step after some features are already selected |
| 87 | +- **Correlation filtering**: Features with Somers' D correlation > threshold with already-selected features are skipped |
| 88 | +- **Greedy selection**: The selection order implicitly accounts for redundancy through correlation filtering |
| 89 | + |
| 90 | +> [!NOTE] |
| 91 | +> The term "marginal" here refers to the iterative, step-wise evaluation process. At each step, features are evaluated using their univariate Somers' D with the target, but redundant features are filtered out based on their correlation with already-selected features. |
| 92 | +
|
| 93 | +Feature correlation is computed as: |
| 94 | + |
| 95 | +```math |
| 96 | +\text{correlation}(f_i, f_j) = \frac{|D_{f_i|f_j}| + |D_{f_j|f_i}|}{2} |
| 97 | +``` |
| 98 | + |
| 99 | +--- |
| 100 | + |
| 101 | +## 4. Basic Usage |
| 102 | + |
| 103 | +### Binary Classification |
| 104 | + |
| 105 | +```python |
| 106 | +import numpy as np |
| 107 | +import pandas as pd |
| 108 | +from fastwoe.modeling import marginal_somersd_selection |
| 109 | + |
| 110 | +# Prepare data |
| 111 | +X = pd.DataFrame({ |
| 112 | + 'feature1': np.random.choice(['A', 'B', 'C'], 1000), |
| 113 | + 'feature2': np.random.choice(['X', 'Y', 'Z'], 1000), |
| 114 | + 'feature3': np.random.choice(['P', 'Q', 'R'], 1000), |
| 115 | +}) |
| 116 | +y = np.random.binomial(1, 0.3, 1000) |
| 117 | + |
| 118 | +# Run selection |
| 119 | +result = marginal_somersd_selection( |
| 120 | + X, y, |
| 121 | + min_msd=0.01, # Minimum marginal Somers' D |
| 122 | + max_features=5, # Maximum features |
| 123 | + correlation_threshold=0.5 # Correlation threshold |
| 124 | +) |
| 125 | + |
| 126 | +print(result['selected_features']) |
| 127 | +print(result['msd_history']) |
| 128 | +``` |
| 129 | + |
| 130 | +### With Train/Test Split |
| 131 | + |
| 132 | +```python |
| 133 | +from sklearn.model_selection import train_test_split |
| 134 | + |
| 135 | +X_train, X_test, y_train, y_test = train_test_split( |
| 136 | + X, y, test_size=0.3, random_state=42 |
| 137 | +) |
| 138 | + |
| 139 | +result = marginal_somersd_selection( |
| 140 | + X_train, y_train, |
| 141 | + X_test=X_test, |
| 142 | + y_test=y_test, |
| 143 | + min_msd=0.01 |
| 144 | +) |
| 145 | + |
| 146 | +# Monitor performance at each step |
| 147 | +# Note: test_performance has length len(selected_features) - 1 |
| 148 | +# (computed at start of each iteration after first feature) |
| 149 | +for i, (feat, msd) in enumerate(zip( |
| 150 | + result['selected_features'], |
| 151 | + result['msd_history'] |
| 152 | +)): |
| 153 | + if i > 0: # test_performance starts from step 2 |
| 154 | + test_perf = result['test_performance'][i - 1] |
| 155 | + print(f"{feat}: Train MSD={msd:.4f}, Test D={test_perf:.4f}") |
| 156 | + else: |
| 157 | + print(f"{feat}: Train MSD={msd:.4f}") |
| 158 | +``` |
| 159 | + |
| 160 | +### Continuous Target |
| 161 | + |
| 162 | +```python |
| 163 | +# Works with continuous targets |
| 164 | +y_continuous = np.random.normal(0, 1, 1000) |
| 165 | + |
| 166 | +result = marginal_somersd_selection( |
| 167 | + X, y_continuous, |
| 168 | + min_msd=0.01 |
| 169 | +) |
| 170 | +``` |
| 171 | + |
| 172 | +--- |
| 173 | + |
| 174 | +## 5. Output Structure |
| 175 | + |
| 176 | +The function returns a dictionary with: |
| 177 | + |
| 178 | +| Key | Type | Description | |
| 179 | +|-----|------|-------------| |
| 180 | +| `selected_features` | `list[str]` | Feature names in selection order | |
| 181 | +| `msd_history` | `list[float]` | Marginal Somers' D at each step (same length as selected_features) | |
| 182 | +| `univariate_somersd` | `dict[str, float]` | Univariate Somers' D for all features | |
| 183 | +| `model` | `FastWoe` | Trained WOE model with selected features | |
| 184 | +| `test_performance` | `list[float]` | Test Somers' D at each step (length = len(selected_features) - 1, if test set provided) | |
| 185 | +| `correlation_matrix` | `pd.DataFrame` | Pairwise correlations of selected features | |
| 186 | + |
| 187 | +--- |
| 188 | + |
| 189 | +## 6. When to Use MSD |
| 190 | + |
| 191 | +**Use MSD when:** |
| 192 | +- You have categorical or mixed-type features |
| 193 | +- You need rank correlation-based selection (robust to outliers) |
| 194 | +- You want to handle both binary and continuous targets |
| 195 | +- You want automatic redundancy filtering |
| 196 | +- You're building credit scoring or risk models |
| 197 | + |
| 198 | +**Consider alternatives when:** |
| 199 | +- You have extremely high-dimensional data (thousands of features) |
| 200 | +- You need very fast selection with minimal computation |
| 201 | +- Your features are already numeric and well-scaled |
| 202 | + |
| 203 | +--- |
| 204 | + |
| 205 | +## 7. Complete Example |
| 206 | + |
| 207 | +```python |
| 208 | +import numpy as np |
| 209 | +import pandas as pd |
| 210 | +from fastwoe.modeling import marginal_somersd_selection |
| 211 | +from sklearn.metrics import roc_auc_score |
| 212 | +from sklearn.model_selection import train_test_split |
| 213 | + |
| 214 | +# Generate data |
| 215 | +np.random.seed(42) |
| 216 | +n = 2000 |
| 217 | +X = pd.DataFrame({ |
| 218 | + 'age_group': np.random.choice(['18-25', '26-35', '36-45', '46+'], n), |
| 219 | + 'income': np.random.choice(['Low', 'Medium', 'High'], n), |
| 220 | + 'employment': np.random.choice(['Employed', 'Self-Employed', 'Unemployed'], n), |
| 221 | + 'education': np.random.choice(['HS', 'Bachelor', 'Master', 'PhD'], n), |
| 222 | +}) |
| 223 | + |
| 224 | +# Create target |
| 225 | +y = ( |
| 226 | + (X['income'] == 'High').astype(int) * 0.3 + |
| 227 | + (X['education'].isin(['Master', 'PhD'])).astype(int) * 0.2 + |
| 228 | + np.random.normal(0, 0.1, n) |
| 229 | +) |
| 230 | +y = (y > 0.3).astype(int) |
| 231 | + |
| 232 | +# Split |
| 233 | +X_train, X_test, y_train, y_test = train_test_split( |
| 234 | + X, y, test_size=0.3, stratify=y, random_state=42 |
| 235 | +) |
| 236 | + |
| 237 | +# Select features |
| 238 | +result = marginal_somersd_selection( |
| 239 | + X_train, y_train, |
| 240 | + X_test=X_test, |
| 241 | + y_test=y_test, |
| 242 | + min_msd=0.01, |
| 243 | + max_features=5 |
| 244 | +) |
| 245 | + |
| 246 | +# Results |
| 247 | +print("Selected features:", result["selected_features"]) |
| 248 | +print("\nUnivariate Somers' D:") |
| 249 | +for feat, val in sorted( |
| 250 | + result["univariate_somersd"].items(), key=lambda x: x[1], reverse=True |
| 251 | +): |
| 252 | + print(f"{feat}: {val:.4f}") |
| 253 | + |
| 254 | +# Evaluate |
| 255 | +model = result["model"] |
| 256 | +y_pred = model.predict_proba(X_test[result["selected_features"]])[:, 1] |
| 257 | +print(f"\nTest AUC: {roc_auc_score(y_test, y_pred):.4f}") |
| 258 | +``` |
| 259 | + |
| 260 | +--- |
| 261 | + |
| 262 | +## References |
| 263 | + |
| 264 | +1. Somers, R.H. (1962). A new asymmetric measure of association for ordinal variables. *American Sociological Review*, 27(6), 799-811. |
| 265 | + |
| 266 | +2. Spinella, F., & Krisciunas, T. (2025). Enhancing Credit Risk Models at Revolut by Combining Deep Feature Synthesis and Marginal Information Value. *Credit Research Centre, University of Edinburgh Business School*. Available at: https://www.crc.business-school.ed.ac.uk/sites/crc/files/2025-11/Enhancing-Credit-Risk-Models-at-Revolut-by-combining-Deep-Feature-Synthesis-and-Marginal-Information-Value-paper.pdf |
0 commit comments