- The Breast Cancer (Diagnostic) Dataset can be found on the UCI ML Repository
- Project focus is the implementation of ML Pipelines for binary classification, to include:
- Data preprocessing and transformations (i.e., scale, impute, convert)
- Feature selection and validation (i.e., Lasso Regularization or LDA)
- Optimization of model parameters (search_space)
- Comparing resultant models of GridSearchCV and RandomizedSearchCV
- Developing a scalable and maintainable ML Pipeline for binary classification
- Validating the model with "unseen" testing data
- Python 3.10 or newer (specifically for SMOTE)
- Scikit-learn version 1.5.2 – to minimize FutureWarning errors when fitting.
- When testing models independent of the Pipeline, ensure the data is first scaled!
- Links to source documentation:
- For more on Lasso
- For more on LinearDiscriminantAnalysis
- For more on RandomForestClassifier
- For more on SGDClassifier
- For more on Pipeline
- For more on SMOTE
- For more on GridSearchCV
- For more on RandomizedSearchCV
- Drink more coffee ☕ ☕ ☕
- The highest performing model uses the following data preprocessing and classification steps:
- StandardScaler - Scale incoming data
- Lasso L1 Regularization - Feature selection
- SMOTE – Artificially add malignant samples - Balancing dataset
- SGDClassifier – Final binary classification of samples as benign or malignant
- The model does not underfit and has little to no overfitting based upon the performance (Block 8.X):
- Pipeline Testing Set Score : 0.9912
- Pipeline Training Set Score: 0.9824
- Pipeline CV Training Score : 0.9714
- The CV score predicts expected performance on unseen data, and the Testing set is unseen data, the model generalizes well and performs above the 'predicted' CV score.
- The CV score predicts expected performance on unseen data, and the Testing set is unseen data, the model generalizes well and performs above the 'predicted' CV score.
- Lasso L1 Regularization (Feature Selection) has the highest ROI on model performance (Block 4.1.5):
- Using 19 of the 30 features performs best by removing noisy or unimportant features from the sample.
- Note: The model using 5 features performs as well as the model using 24 features!
- This illustrates the importance of removing sample noise.
- This illustrates the importance of removing sample noise.
- SGDClassifier brings significant performance and runtime benefits (Block 7.2):
- Fit and score time are reduced by two orders of magnitude over RandomForestClassifier and GradientBoosting.
- Optimal SGDC parameters include:
- loss = 'log_loss'
- alpha = 0.01
## Import UCI Dataset and write to local csv
from ucimlrepo import fetch_ucirepo
breast_ca = fetch_ucirepo(id=17)
breast_ca_df = breast_ca.data.original
breast_ca_df.to_csv('UCI_BreastCancer.csv', index=False)
print('Successfully wrote dataset to csv file!') OR:
## Import UCI Dataset from sklearn.datasets and write to local csv
from sklearn.datasets import load_breast_cancer
breast_ca = load_breast_cancer()
breast_ca.to_csv('UCI_BreastCancer.csv', index=False)
print('Successfully wrote dataset to csv file!')# Search Dataset for missing / null values
try:
if df.isnull().sum().any()>0:
print('NaN values found: ', df.isnull().sum())
else:
print('No NaN or null values found')
except Exception as e:
print(e)
# Consider the number of unique values for each feature:
print(df.nunique())
# Verify features and shape
print(df.columns)
print(df.shape)# Define features and target
y = df.Diagnosis
X = df.drop(columns=['Diagnosis','ID'])
# Verify expected shapes before and after:
print('X shape: ',X.shape)
print('y shape: ',y.shape)
# Convert target data to binary and verify value_counts.
print('\nTarget prior to binary conversion: \n',y.value_counts())
try:
y = pd.DataFrame(np.where(y == 'M',1,0), columns=['Diagnosis'])
y = y.Diagnosis
print('\nTarget post binary conversion: \n',y.value_counts(),'\n')
except Exception as e:
print(e)
# Verify expected shapes before and after:
print('X shape: ',X.shape)
print('y shape: ',y.shape)
# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)# Lasso outperforms LDA on feature selection as validated by SGDClassifier scores:
# Below are averaged results from fitting SGDClassifier 50 times using unique
# random_states and with the respective feature sets.
# Score Std Dev Features
# Lasso feature set: 0.9909 0.0017 19
# Full features set: 0.9889 0.0053 30
# LDA feature set: 0.9691 0.0054 19
# Both Lasso and LDA have 19 features after selection; however, Lasso features
# outperform even the full data set (30 features), likely due to a reduction
# in noisy data (i.e., noise from unimportant / low-importance features).For more on Lasso L1 Regularization see notebook blocks 4.1.X
For more on LinearDiscriminantAnalysis see notebook blocks 4.2.X
# Define Preprocessor, Pipeline, and search_space for GridSearchCV()
scaler = StandardScaler()
preprocessor = ColumnTransformer([
('scaler', scaler, X_train.columns)
])
pipeline = Pipeline([
('preprocessor', preprocessor),
('clf', ClfSwitch())
])
search_space = [
{'clf__estimator': [RandomForestClassifier(random_state=13)],
'clf__estimator__max_depth':[10,15,25],
'clf__estimator__n_estimators':[150,200,250],
},
{'clf__estimator': [GradientBoostingClassifier(random_state=13)],
'clf__estimator__learning_rate':[0.001,0.01,0.1,0.5],
'clf__estimator__n_estimators':[150,200,250],
},
{'clf__estimator': [SGDClassifier(random_state=13)],
'clf__estimator__loss': ['hinge','log_loss'],
'clf__estimator__alpha': [0.01],
'clf__estimator__penalty': ['l2']
}
]from sklearn.model_selection import GridSearchCV
# Model Selection and Hyperparameter tuning:
gs = GridSearchCV(estimator=pipeline, param_grid=search_space, cv=5, error_score='raise')
gs.fit(X_train, y_train)
# Load the best performing model as gs_best:
gs_best = gs.best_estimator_
# Load the best performing model classifier as gs_best_clf:
gs_best_clf = gs_best.named_steps['clf']
# Print the estimator that best fit the data with the given search_space:
print(gs_best_clf.get_params()['estimator']) OR:
from sklearn.model_selection import RandomizedSearchCV
# Model Selection and Hyperparameter tuning:
rs = RandomizedSearchCV(estimator=pipeline, param_distributions=search_space,
cv=5, n_iter=5, error_score='raise')
rs.fit(X_train, y_train)
# Load the best performing model as rs_best:
rs_best = rs.best_estimator_
# Load the best performing model classifier as rs_best_clf:
rs_best_clf = rs_best.named_steps['clf']
# Print the estimator that best fit the data with the given search_space:
print(rs_best_clf.get_params()['estimator'])# Using gs.cv_results_ for model parameter tuning:
cv_df = pd.DataFrame(gs.cv_results_)
columns_of_interest = [
'param_clf__estimator',
'param_clf__estimator__max_depth',
'param_clf__estimator__n_estimators',
'param_clf__estimator__loss',
'param_clf__estimator__penalty',
'param_clf__estimator__alpha',
'mean_test_score',
'std_test_score',
'rank_test_score']
cv_df_results = cv_df[columns_of_interest].round(3)
cv_df_results.style.background_gradient(axis=0,cmap='Spectra')
print(cv_df_results)from sklearn.metrics import accuracy_score
# Compare the GridSearchCV best_score_ (training data)
# to the best model's accuracy_score on testing data:
y_pred_gs = gs_best.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(gs.best_score_)
# Compare the RandomizedSearchCV best_score_ (training data)
# to the best model's accuracy_score on testing data:
y_pred_rs = rs_best.predict(X_test)
print(accuracy_score(y_test, y_pred_rs))
print(rs.best_score_)Takeaways
- SGDC outperforms RandomForestClassifier in both accuracy and runtime.
- SGDC performs best with loss='log_loss' and alpha=0.01
- RandomizedSearchCV is a non-exhaustive solution to finding good parameters in a defined number of iterations.
- SMOTE (Synthetic Minority Oversampling Technique) brings some moderate benefits to the SGDClassifier performance.
- RandomForestClassifier (RFC) is consistently underperforming (test scores: 92% full features, 97% Lasso features)
- RFC is significantly more expensive for run-time, averaging > 100 ms per fit compared to the 1 ms of SGDClassifier
- SGDClassifier (SGDC) is a highly performant model (test scores: 98.9% full features, 99.1% Lasso features)
- SGDC performs well with the 'log_loss' and 'hinge' loss parameter, but 'log_loss' outperforms overall
- SGDC brings significant run-time benefits, averaging < 1 ms per fit and score.
- See block 7.4 for SGDC performance with SMOTE dataset balancing.
- RandomizedSearchCV has improved control of run-times due to the n_iter parameter which defaults to 10.
- Runs the risk of not finding the "best" parameters due to n_iter relative to search_space.
- Ideal for quickly finding "good" parameters.
- SMOTE performs better on the Lasso feature set (19 features selected in blocks 4.1.X)
- SMOTE underperforms on the full feature set (30 features), likely due to an increase in sample noise
- BEST MODEL Accuracy is approaching 99.12% for the test set and 98% for the training set
# Balances dataset with SMOTE by synthetic insertion of positive diagnosis samples
# into the training data set. Then scale and validate with SGDClassifier performance.
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=13,sampling_strategy=0.99,k_neighbors=5)
X_res, y_res = smote.fit_resample(X_train,y_train)
# Note: Sampling_strategy only works when solving binary classification problems.Takeaways
- To use Sklearn.Pipeline, helper classes must be defined for Lasso and SMOTE (add transform method)
- Note: another solution is to use Imblearn.Pipeline
- Defining get_shape() method aids in visualizing the transformation of data in the Pipeline.
- The Pipeline is highly performant and does not appear to suffer much from overfitting.
# Pipeline Steps:
# Preprocessor (Scale) --> LASSO Features Selection --> SMOTE --> SGDClassifier
# Average Scores from 100 unique train_test_splits (BLOCK 8.5):
# Pipeline Testing Set Score: 0.9732 std dev: 0.0138
# Pipeline Training Set Score: 0.9834 std dev: 0.0045
# Pipeline CV Training Score: 0.9757