Model Stability and Data Shift Analysis on the UCI Adult Income Dataset
This project studies model stability under data distribution shift using the UCI Adult Income dataset. Rather than focusing solely on predictive performance under i.i.d. assumptions, we analyze how covariate shift and subpopulation shift affect both ranking performance and probability calibration of different machine learning models.
The project combines:
- Statistical distribution shift analysis (KS test, PSI, Wasserstein distance)
- Supervised learning (logistic regression and tree-based models)
- Performance degradation and calibration analysis under shift
- Source: UCI Adult Income Dataset (via OpenML, ID 1590)
- Task: Binary classification — income
>50K - Cleaned size: 45,222 samples
- Features: Demographic and socioeconomic attributes (age, education, occupation, marital status, etc.)
Missing values (?) are explicitly normalized and rows with missing entries are removed for reproducibility.
We construct three train–test scenarios:
- Stratified random split
- Train and test share the same distribution
- Serves as an i.i.d. reference
- Train:
age ≤ 35 - Test:
age > 35 - Simulates a realistic demographic shift
- Synthetic resampling to alter positive class prevalence
- Used for methodological comparison (not the main focus)
Before training any model, we quantify feature-level distribution shift between training and test sets.
-
Numeric features:
- Kolmogorov–Smirnov statistic
- Wasserstein distance
-
Categorical features:
- Population Stability Index (PSI) with epsilon smoothing
-
Although the split is defined by
age, the strongest drift occurs in socioeconomic proxy variables:marital-status(PSI ≈ 1.21)relationship(PSI ≈ 0.84)workclass,education,occupation
-
This indicates a multi-dimensional covariate shift, not a single-feature drift.
Two complementary models are used:
- StandardScaler for numeric features
- One-hot encoding for categorical features
- Serves as a linear, interpretable baseline
HistGradientBoostingClassifier- Captures non-linear feature interactions
- Used to assess robustness under shift
| Split | Model | ROC-AUC | PR-AUC | Brier |
|---|---|---|---|---|
| A_random | Logistic | 0.906 | 0.770 | 0.104 |
| A_random | Tree | 0.925 | 0.825 | 0.092 |
| B_age_shift | Logistic | 0.818 | 0.706 | 0.218 |
| B_age_shift | Tree | 0.890 | 0.836 | 0.130 |
-
Logistic regression is highly sensitive to subpopulation shift:
- Large drops in ROC-AUC and PR-AUC
- Brier score nearly doubles, indicating severe calibration degradation
-
Tree-based models are more robust:
- Ranking performance degrades less under shift
- Calibration still deteriorates, but remains substantially better than the linear model
The following plots are automatically generated:
- ROC curves (Random vs Age Shift)
- Precision–Recall curves
- Calibration (reliability) curves
These figures clearly illustrate that:
- Strong feature-level drift does not always imply proportional ranking degradation
- Probability estimates are significantly more sensitive to distribution shift than ranking metrics
- Subpopulation shift induces multi-dimensional covariate drift, especially in correlated socioeconomic features.
- Linear models suffer substantial performance and calibration degradation under covariate shift, even with large training samples.
- Tree-based models exhibit stronger robustness, but are not immune and still require monitoring or recalibration.
All experiments are fully reproducible.
# create environment
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
# environments
export PYTHONPATH="./src:${PYTHONPATH}"
# run stability analysis
python -m src.run_stability
# generate figures
python -m src.plot_metricsadult-income-drift-lab/
├── src/
│ ├── data.py
│ ├── split.py
│ ├── drift.py
│ ├── train.py
│ ├── run_stability.py
│ └── plot_metrics.py
├── reports/
│ ├── model_stability_comparison.csv
│ └── figures/
├── docs/
│ ├── methodology.md
│ └── results.md
└── README.md
- Explicit drift monitoring thresholds for production systems
- Post-shift recalibration (Platt scaling / isotonic regression)
- Decision-aware evaluation under cost-sensitive settings