ML-Feature-Selection-Comparative-Study/extracted.txt at main · ParthDS02/ML-Feature-Selection-Comparative-Study · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
# **ML Feature Selection Study**
Comparing RFE vs SHAP

## Objective
The goal of this notebook is to compare two popular feature selection methods:
- Recursive Feature Elimination (RFE)
- SHAP-based Feature Importance

We evaluate which method selects the most effective feature set based on:
- Model performance
- Feature interpretability
- Stability across methods

# **Imports & Setup**

In this cell, we import all the required libraries for:
- Data handling (pandas, numpy)
- Visualization (matplotlib, seaborn)
- Machine learning (scikit-learn)
- Model explainability (SHAP)

This setup ensures we can:
- Train ML models
- Perform feature selection
- Visualize feature importance
- Compare model performance


# **Dataset Loading**

## Dataset
Using the Breast Cancer Wisconsin dataset for binary classification.

### Dataset Loading

We use the **Breast Cancer Wisconsin Dataset**, a standard binary classification dataset.
- Target: Malignant (1) vs Benign (0)
- Features: 30 numeric medical measurements

This dataset is ideal for feature selection experiments because:
- It has many correlated features
- Interpretability is important in healthcare

# **Train–Test Split & Scaling**

### Train–Test Split

The dataset is split into:
- 80% training data
- 20% testing data

This ensures:
- The model is trained on unseen data
- Performance evaluation is unbiased

### Feature Scaling

We apply **StandardScaler** to normalize features.

Why scaling is important:
- Logistic Regression is sensitive to feature magnitude
- Feature selection methods like RFE work better with scaled data

# **Baseline Model (No Feature Selection)**

### Baseline Model (No Feature Selection)

We train a Logistic Regression model using **all features**.
This acts as a reference point to compare:
- RFE-selected features
- SHAP-selected features

Baseline accuracy shows the maximum achievable performance
before feature reduction.

# **Output**

- Baseline Accuracy: 0.973684

- Indicates a strong initial model

# **Feature Selection Using RFE**

Recursive Feature Elimination

### Feature Selection Using RFE

Recursive Feature Elimination (RFE):
- Starts with all features
- Iteratively removes the least important ones
- Retains the top 10 features based on model coefficients

# **Goal**:
Reduce feature count without losing accuracy.


# **Output**

- List of 10 selected features

- Mostly “worst” and “error” related medical measurements


# **RFE Ranking Visualization**

### RFE Feature Ranking Visualization

This bar chart shows the **top 10 features selected by RFE**.

Why this visualization matters:
- Confirms which features the model considers most important
- Helps validate feature selection logic visually

# **Output**

- Bar plot of top 10 RFE features

- Lower rank = higher importance

# **Model Performance with RFE Features**

### Model Performance with RFE Features

We retrain the model using only the 10 RFE-selected features.

Purpose:
- Measure whether performance drops after feature reduction
- Evaluate efficiency vs accuracy tradeoff

# **Output**

RFE Accuracy: 0.973684

Same as baseline despite using fewer features

# **Observation**:
RFE successfully removed redundant features without hurting performance.

# **Feature Selection Using SHAP**

SHAP Explainer

### SHAP Explainer

SHAP (SHapley Additive exPlanations):
- Explains how each feature contributes to predictions
- Based on cooperative game theory
- Focuses on interpretability, not just ranking

We compute SHAP values for the baseline model.

# **Output**

- SHAP values generated internally

- No printed output

# **SHAP Summary Plot**

Why this matters:
SHAP shows how much each feature contributes to predictions—not just rankings.

### SHAP Summary Plot (Global Importance)

This bar plot shows:
- Average absolute contribution of each feature
- Global feature importance across the dataset

Why this matters:
- Explains *why* the model makes decisions
- Useful for regulated or trust-critical domains

# **Output**

- SHAP bar plot showing most influential features

# **Select Top SHAP Features**

### Selecting Top SHAP Features

We select the top 10 features based on:
- Mean absolute SHAP value

These features have the strongest global impact
on model predictions.

# **Output**

- List of top 10 SHAP-selected features

# **Model Performance with SHAP Features**

### Model Performance with SHAP Features

We retrain the model using only SHAP-selected features.

This evaluates:
- Performance after prioritizing interpretability
- Accuracy vs explainability tradeoff

# **Output**

- SHAP Accuracy: 0.964912

# **Observation:**
Slight accuracy drop is expected because SHAP favors stability and interpretability over aggressive optimization.

# **Final Comparison**

## Accuracy Comparison


### Final Accuracy Comparison

This table compares:
- Baseline model
- RFE-based model
- SHAP-based model

It summarizes the impact of feature selection methods.


# **Accuracy Bar Chart**

### Accuracy Comparison Visualization

This bar chart visually compares model accuracy
across different feature selection methods.

Helps quickly communicate results to:
- Recruiters
- Stakeholders
- Reviewers

# **Output**

- Bar chart comparing Baseline, RFE, and SHAP accuracy

# **Final Conclusion**


## Conclusion

- RFE performs well when we want a compact, performance-driven feature set.
- SHAP provides better interpretability and explains feature contribution clearly.
- SHAP-selected features often match domain intuition.
- Best choice depends on:
  - Performance focus → RFE
  - Explainability & trust → SHAP


# **End**