Notebooks:
This notebook contains the code for preprocessing netCDF files and creating what will become the training dataset for the models.
The netCDF files are processed with the NetCDFPreprocessor class: Three modes are implemented:
- Filtered: applies filters on co-polar/cross-polar gain, SNR and satellite-surface distance
- With_lat_lons: includes geographic coordinates of specular points
- Unfiltered: removes only the SNR filter while maintaining the others
Included features:
- NetCDF file integrity checking
- Data masking and filtering
- Binary label extraction based on surface type
- Batch processing with stratified sampling
Features are created by the DDMFeatureExtractor class with these aspects:
- Basic statistics: mean, standard deviation, skewness, kurtosis, entropy, Gini coefficient
- Positional features: peak index, center of mass, moments of inertia
- Spatial segmentation: analysis by quadrants and central region of the DDM matrix
- Temporal analysis: first derivatives, autocorrelations, Fourier transform
- Comparative features: statistical differences between quadrants and center
Using the features obtained this way, the most promising models are researched using Pycaret.
Complete pipeline for binary classification with CatBoost and experiment tracking via MLflow.
- Loading balanced sample for tuning (25K samples per class)
- Executing complete pipeline: preparation, tuning, training, evaluation
- Saving development model and generating artifacts
- Loading extended balanced dataset (250K samples per class)
- Model finalization using the development model as base
- Optional scaler retraining and early stopping
- Feature importance comparison between development and production models
- Artifact generation for deployment
The same approach described for CatBoost is applied.
In this notebook, two pre-trained models (CatBoost and XGBoost) are loaded and a Voting classifier is created.
Ensemble classifier that combines predictions from multiple models (CatBoost, XGBoost):
- Selects the class with maximum probability among base models. In case of tie, CatBoost is favored
- Supports differentiated scaling for each model
- Implements standard scikit-learn interface
The system implements a complete evaluation pipeline that includes:
- Threshold optimization: search for optimal threshold to maximize F1-score
- Robust validation: testing on balanced dataset (50K samples) in addition to standard test set
- Comparative analysis: performance comparison between different models and configurations
- Extended metrics: accuracy, precision, recall, F1, AUC-ROC, specificity, NPV
- Visualizations: ROC curves, precision-recall curves, confusion matrices, calibration curves
- Optimized ensemble models saved in joblib/CatBoost format
- Optimal threshold configurations to maximize specific metrics
- Detailed performance reports with comparative visualizations
- Preprocessed datasets in Parquet format for future reuse
In this notebook, n different test sets of chosen dimension are created and the model performance metrics are calculated on each one and then averaged.