You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optimize validation overhead and data passing for hyperopt
Addressed profiling bottlenecks in `arraysetops.py` (numpy unique) and repeated data loading:
- `ml_grid/pipeline/data.py`: Update `pipe` constructor to accept an `input_df` argument, allowing pre-loaded data to be passed to workers and eliminating redundant disk I/O during hyperopt trials.
- `ml_grid/util/global_params.py`: Optimize `custom_roc_auc_score` to use `min() == max()` checks instead of the expensive `np.unique()` sort (O(N) vs O(N log N)).
- `ml_grid/pipeline/grid_search_cross_validate.py`:
- Refactor H2O model checks to use a module-level `H2O_MODEL_TYPES` constant.
- Optimize `y_train` handling: only convert to categorical for H2O models; keep as numeric/numpy for Scikit-learn to avoid validation overhead.
- Replace `len(np.unique())` with `series.nunique()` for faster class count checks.
- Pass numpy arrays (values) instead of Pandas objects to Scikit-learn's `cross_validate` to reduce indexing overhead.
0 commit comments