Skip to content

Commit 74ab16c

Browse files
author
SamoraHunter
committed
Optimize pipeline performance and H2O stability
- **Workflow:** Reuse `HyperparameterSearch` CV results instead of re-running cross-validation, significantly reducing runtime. Added `force_second_cv` option to override. - **H2O Performance:** Disable `return_train_score` during search and remove `h2o.assign` in `predict` to eliminate expensive garbage collection overhead. Optimize `H2OFrame` creation by passing column types directly. - **Stability:** Add fallback logic for `BayesSearchCV` result parsing. Sanitize H2O parameters (e.g., removing `HGLM`) and handle backend crashes (NPEs) gracefully. - **Diagrams:** Add diagrams for gridsearch cv execution flow.
1 parent 51cbd1e commit 74ab16c

12 files changed

Lines changed: 457 additions & 114 deletions

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Binary classification is a common machine learning task where the goal is to cat
3535

3636
This framework is designed to be a comprehensive toolkit for binary classification experiments, offering a wide range of configurable options:
3737

38-
- **Diverse Model Support:** Includes a collection of standard classifiers (e.g., Logistic Regression, SVM, RandomForest, XGBoost, LightGBM, CatBoost) and specialized time-series models from the `aeon` library (e.g., HIVE-COTE v2, MUSE, OrdinalTDE).
38+
- **Diverse Model Support:** Includes a collection of standard classifiers (e.g., Logistic Regression, SVM, RandomForest, XGBoost, LightGBM, CatBoost, H2O AutoML/GLM/GBM) and specialized time-series models from the `aeon` library (e.g., HIVE-COTE v2, MUSE, OrdinalTDE).
3939
- **Advanced Hyperparameter Tuning:** Supports multiple search strategies:
4040
- **Grid Search:** Exhaustively search a defined parameter grid.
4141
- **Random Search:** Randomly sample from the parameter space.
@@ -270,3 +270,4 @@ This project is licensed under the MIT License - see the LICENSE file for detail
270270
## Acknowledgments
271271
scikit-learn
272272
hyperopt
273+
H2O.ai
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
graph TB
2+
Start([Grid Search CV Initialization]) --> Init[Initialize Parameters<br/>- Algorithm<br/>- Parameter Space<br/>- CV Strategy<br/>- Global Config]
3+
4+
Init --> DataPrep[Data Preparation]
5+
6+
DataPrep --> CheckDF{X_train is<br/>DataFrame?}
7+
CheckDF -->|No| ConvertDF[Convert to DataFrame]
8+
CheckDF -->|Yes| CheckSeries{y_train is<br/>Series?}
9+
ConvertDF --> CheckSeries
10+
11+
CheckSeries -->|No| ConvertSeries[Convert to Series<br/>Align with X_train index]
12+
CheckSeries -->|Yes| SetCategory[Set y_train as category<br/>Name = 'outcome']
13+
ConvertSeries --> SetCategory
14+
15+
SetCategory --> ModelCheck{Model Type<br/>Detection}
16+
17+
ModelCheck -->|GPU Model| GPUConfig[Configure GPU<br/>n_jobs=1<br/>TF Memory Growth]
18+
ModelCheck -->|SVC| ScaleData[Apply StandardScaler]
19+
ModelCheck -->|KNN/SimBSig| AdjustKNN[Adjust n_neighbors<br/>for small datasets]
20+
ModelCheck -->|CatBoost| CheckSize{Dataset<br/>Size OK?}
21+
ModelCheck -->|Other| CVSetup
22+
23+
CheckSize -->|Too Small| ReturnDefault[Return Default Score 0.5]
24+
CheckSize -->|OK| AdjustCatBoost[Adjust subsample/rsm<br/>parameters]
25+
26+
GPUConfig --> CVSetup[CV Strategy Setup]
27+
ScaleData --> CVSetup
28+
AdjustKNN --> CVSetup
29+
AdjustCatBoost --> CVSetup
30+
31+
CVSetup --> TestMode{Test Mode<br/>Enabled?}
32+
TestMode -->|Yes| FastCV[KFold n_splits=2]
33+
TestMode -->|No| ProductionCV[RepeatedKFold<br/>n_splits=2, n_repeats=2]
34+
35+
FastCV --> ParamValidation
36+
ProductionCV --> ParamValidation
37+
38+
ParamValidation[Parameter Validation] --> BayesCheck{Bayesian<br/>Search?}
39+
40+
BayesCheck -->|Yes| WrapCategorical[Wrap lists in<br/>Categorical for skopt]
41+
BayesCheck -->|No| ValidateParams[Validate parameters<br/>against estimator]
42+
43+
WrapCategorical --> ConfigNIter
44+
ValidateParams --> ConfigNIter
45+
46+
ConfigNIter[Configure n_iter] --> LocalOverride{Local<br/>Override?}
47+
LocalOverride -->|Yes| UseLocal[Use local n_iter]
48+
LocalOverride -->|No| UseGlobal[Use global n_iter]
49+
50+
UseLocal --> CapIter{Exceeds<br/>max_iter?}
51+
UseGlobal --> CapIter
52+
CapIter -->|Yes| CapValue[Cap to max value]
53+
CapIter -->|No| Search
54+
CapValue --> Search
55+
56+
Search[HyperparameterSearch<br/>Instantiation] --> ResetIndices[Reset DataFrame indices<br/>to integer-based]
57+
58+
ResetIndices --> IndexCheck{Index<br/>Aligned?}
59+
IndexCheck -->|No| RaiseError[Raise AssertionError]
60+
IndexCheck -->|Yes| RunSearch[search.run_search]
61+
62+
RunSearch --> SearchError{Search<br/>Error?}
63+
SearchError -->|SVC Dual Coef| SVCDefault[Return default 0.5]
64+
SearchError -->|Other Error| LogRaise[Log error & re-raise]
65+
SearchError -->|Success| TestModeCheck2{Test Mode?}
66+
67+
TestModeCheck2 -->|Yes| SkipCV[Skip final CV<br/>Return 0.5]
68+
TestModeCheck2 -->|No| CheckClasses{Classes >= 2?}
69+
70+
CheckClasses -->|No| RaiseValueError[Raise ValueError<br/>AUC not defined]
71+
CheckClasses -->|Yes| H2OCheck{H2O or<br/>Keras Model?}
72+
73+
H2OCheck -->|Yes| SingleThread[Set n_jobs=1<br/>for CV]
74+
H2OCheck -->|No| MultiThread[Use grid_n_jobs]
75+
76+
SingleThread --> CheckCache{Can reuse<br/>cached CV<br/>results?}
77+
MultiThread --> CheckCache
78+
79+
CheckCache -->|Yes & Not Forced| ExtractCache[Extract scores from<br/>cv_results_]
80+
CheckCache -->|No or Forced| FreshCV[Run fresh<br/>cross_validate]
81+
82+
ExtractCache --> CacheError{Extraction<br/>Error?}
83+
CacheError -->|Yes| FreshCV
84+
CacheError -->|No| ProcessScores
85+
86+
FreshCV --> CVType{Model<br/>Type?}
87+
CVType -->|Keras| KerasCV[Internal CV handling<br/>in fit method]
88+
CVType -->|Other| StandardCV[cross_validate with<br/>multiple metrics]
89+
90+
KerasCV --> CVErrors{CV<br/>Errors?}
91+
StandardCV --> CVErrors
92+
93+
CVErrors -->|XGBoost GPU Error| FallbackCPU[Fallback to CPU<br/>tree_method='hist']
94+
CVErrors -->|AdaBoost Poor| AdaBoostDefault[Use default scores]
95+
CVErrors -->|H2O RuntimeError| H2ODefault[Use default scores]
96+
CVErrors -->|Other Error| GenericDefault[Use default scores<br/>Log error]
97+
CVErrors -->|Success| ProcessScores[Process Scores]
98+
99+
FallbackCPU --> Retry[Retry cross_validate]
100+
Retry --> RetryError{Retry<br/>Error?}
101+
RetryError -->|Yes| GenericDefault
102+
RetryError -->|No| ProcessScores
103+
104+
ProcessScores --> TimeCheck{CV time ><br/>threshold?}
105+
TimeCheck -->|Yes| WarnSlow[Warn about slow CV]
106+
TimeCheck -->|No| LogTime[Log CV completion time]
107+
108+
WarnSlow --> Predict
109+
LogTime --> Predict
110+
111+
Predict[Predict on X_test] --> UpdateLog{Score logging<br/>enabled?}
112+
113+
UpdateLog -->|Yes| SaveScores[Update score log with:<br/>- CV scores<br/>- predictions<br/>- best estimator<br/>- timing info]
114+
UpdateLog -->|No| WarnNoLog[Warn: no logging]
115+
116+
SaveScores --> CalcAUC[Calculate final AUC<br/>on test set]
117+
WarnNoLog --> CalcAUC
118+
119+
CalcAUC --> H2OCleanup{H2O<br/>Model?}
120+
H2OCleanup -->|Yes| LeaveRunning[Leave H2O cluster running<br/>for next model]
121+
H2OCleanup -->|No| End
122+
123+
LeaveRunning --> End([Return AUC Score])
124+
125+
SVCDefault --> End
126+
SkipCV --> H2OCleanup
127+
ReturnDefault --> End
128+
RaiseError --> End
129+
LogRaise --> End
130+
RaiseValueError --> End
131+
AdaBoostDefault --> CalcAUC
132+
H2ODefault --> CalcAUC
133+
GenericDefault --> CalcAUC
134+
135+
style Start fill:#e1f5e1
136+
style End fill:#ffe1e1
137+
style SearchError fill:#fff3cd
138+
style CVErrors fill:#fff3cd
139+
style TestMode fill:#d1ecf1
140+
style TestModeCheck2 fill:#d1ecf1
141+
style H2OCheck fill:#f8d7da
142+
style BayesCheck fill:#d1ecf1
143+
style CheckCache fill:#d4edda

assets/grid_search_cross_validate.svg

Lines changed: 102 additions & 0 deletions
Loading

config_hyperopt.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ global_params:
1010
# Number of iterations for RandomizedSearchCV and BayesSearchCV
1111
n_iter: 2
1212
max_param_space_iter_value : 10
13+
force_second_cv: false # If True, forces a second cross-validation run even if cached results are available. Defaults to False.
1314

1415
# Experiment settings for the hyperopt run
1516
experiment:

ml_grid/model_classes/H2OAutoMLClassifier.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ def fit(self, X: pd.DataFrame, y: pd.Series, **kwargs) -> "H2OAutoMLClassifier":
7878
self.model_ = H2OGeneralizedLinearEstimator(
7979
family="binomial", ignore_const_cols=False
8080
)
81+
self._sanitize_model_params()
8182
self.model_.train(y=outcome_var, x=x_vars, training_frame=train_h2o)
8283
self._using_dummy_model = True # Set flag for reference
8384

@@ -101,6 +102,7 @@ def _finalize_dummy_fit(self, X, y):
101102
self.model_ = H2OGeneralizedLinearEstimator(
102103
family="binomial", ignore_const_cols=False
103104
)
105+
self._sanitize_model_params()
104106
# We need to create a minimal H2OFrame to train on
105107
train_h2o, x_vars, outcome_var, _ = self._prepare_fit(X, y)
106108
self.model_.train(y=outcome_var, x=x_vars, training_frame=train_h2o)

ml_grid/model_classes/H2OBaseClassifier.py

Lines changed: 29 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -409,6 +409,17 @@ def _handle_small_data_fallback(self, X: pd.DataFrame, y: pd.Series) -> bool:
409409
return True
410410
return False
411411

412+
def _sanitize_model_params(self):
413+
"""Removes problematic parameters from the H2O model instance before training.
414+
415+
This handles version mismatches where the Python client sends parameters
416+
(like HGLM) that the H2O backend does not recognize.
417+
"""
418+
if self.model_ and hasattr(self.model_, "_parms"):
419+
if "HGLM" in self.model_._parms:
420+
self.logger.debug("Removing 'HGLM' parameter from H2O model to prevent backend error.")
421+
del self.model_._parms["HGLM"]
422+
412423
def fit(self, X: pd.DataFrame, y: pd.Series, **kwargs) -> "H2OBaseClassifier":
413424
"""Fits the H2O model.
414425
@@ -447,6 +458,9 @@ def fit(self, X: pd.DataFrame, y: pd.Series, **kwargs) -> "H2OBaseClassifier":
447458
self.logger.debug(f"Creating H2O model with params: {model_params}")
448459
self.model_ = self.estimator_class(**model_params)
449460

461+
# Sanitize parameters to prevent backend errors (e.g. HGLM)
462+
self._sanitize_model_params()
463+
450464
# Call the train() method with ONLY the data-related arguments
451465
self.logger.debug("Calling H2O model.train()...")
452466
self.model_.train(x=x_vars, y=outcome_var, training_frame=train_h2o)
@@ -550,30 +564,18 @@ def predict(self, X: pd.DataFrame) -> np.ndarray:
550564
# This seems to create a more 'stable' frame in the H2O cluster, preventing
551565
# internal errors during prediction with some models like GLM.
552566

553-
# Create a temporary H2OFrame by uploading the pandas DataFrame.
554-
# FIX: Do not pass column_types to constructor as it can be flaky.
555-
# Instead, create frame and explicitly cast columns.
556-
tmp_frame = h2o.H2OFrame(X, column_names=self.feature_names_)
557-
558-
# Enforce types explicitly to match training schema
567+
# Optimization: Pass column_types directly to constructor to avoid
568+
# expensive column-by-column casting loop (which triggers GC overhead).
569+
# We filter feature_types_ to ensure only present columns are passed.
570+
col_types = None
559571
if self.feature_types_:
560-
for col in self.feature_names_:
561-
if col in self.feature_types_ and col in tmp_frame.columns:
562-
t_type = self.feature_types_[col]
563-
if t_type == "enum":
564-
tmp_frame[col] = tmp_frame[col].asfactor()
565-
elif t_type in ["int", "real", "numeric"]:
566-
tmp_frame[col] = tmp_frame[col].asnumeric()
567-
elif t_type == "string":
568-
tmp_frame[col] = tmp_frame[col].ascharacter()
569-
570-
# Assign it to a unique key in the H2O cluster. This is more reliable.
571-
# Add PID and ID to ensure uniqueness across processes
572-
frame_id = f"pred_{os.getpid()}_{id(self)}_{pd.Timestamp.now().strftime('%H%M%S%f')}"
573-
h2o.assign(tmp_frame, frame_id)
574-
575-
# Get a handle to the newly created frame
576-
test_h2o = h2o.get_frame(frame_id)
572+
col_types = {k: v for k, v in self.feature_types_.items() if k in X.columns}
573+
574+
tmp_frame = h2o.H2OFrame(X, column_names=self.feature_names_, column_types=col_types)
575+
576+
# Optimization: Use the temporary frame directly.
577+
# Explicitly assigning a key (h2o.assign) triggers expensive GC checks.
578+
test_h2o = tmp_frame
577579

578580
except Exception as e:
579581
raise RuntimeError(f"Failed to create H2O frame for prediction: {e}")
@@ -654,21 +656,12 @@ def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
654656

655657
# Create H2O frame with explicit column names
656658
try:
657-
# FIX: Explicit type enforcement for predict_proba as well
658-
tmp_frame = h2o.H2OFrame(X, column_names=self.feature_names_)
659-
659+
# Optimization: Pass column_types directly to constructor
660+
col_types = None
660661
if self.feature_types_:
661-
for col in self.feature_names_:
662-
if col in self.feature_types_ and col in tmp_frame.columns:
663-
t_type = self.feature_types_[col]
664-
if t_type == "enum":
665-
tmp_frame[col] = tmp_frame[col].asfactor()
666-
elif t_type in ["int", "real", "numeric"]:
667-
tmp_frame[col] = tmp_frame[col].asnumeric()
668-
elif t_type == "string":
669-
tmp_frame[col] = tmp_frame[col].ascharacter()
662+
col_types = {k: v for k, v in self.feature_types_.items() if k in X.columns}
670663

671-
test_h2o = tmp_frame
664+
test_h2o = h2o.H2OFrame(X, column_names=self.feature_names_, column_types=col_types)
672665
except Exception as e:
673666
raise RuntimeError(f"Failed to create H2O frame for prediction: {e}")
674667

ml_grid/model_classes/H2OGAMClassifier.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,7 @@ def fit(self, X: pd.DataFrame, y: pd.Series, **kwargs) -> "H2OGAMClassifier":
171171
estimator_cls = self.estimator_class
172172

173173
self.model_ = estimator_cls(**model_params)
174+
self._sanitize_model_params()
174175

175176
# --- RUNTIME TRAIN WITH FALLBACK ---
176177
try:
@@ -200,6 +201,7 @@ def fit(self, X: pd.DataFrame, y: pd.Series, **kwargs) -> "H2OGAMClassifier":
200201
glm_params["lambda_search"] = False
201202

202203
self.model_ = H2OGeneralizedLinearEstimator(**glm_params)
204+
self._sanitize_model_params()
203205
self.model_.train(x=x_vars, y=outcome_var, training_frame=train_h2o)
204206
else:
205207
raise e

0 commit comments

Comments
 (0)