This project demonstrates production-ready machine learning pipelines with comprehensive MLflow artifact tracking, focusing on customer churn prediction.
A complete ML system with enhanced MLflow tracking that provides:
- Comprehensive Data Lineage: Track data from raw input to final model predictions
- Rich Artifact Management: Automated logging of datasets, models, visualizations, and metadata
- Production-Ready Monitoring: Real-time inference tracking and performance monitoring
- Complete Reproducibility: All artifacts needed to reproduce experiments and results
./
βββ README.md # This file - comprehensive project documentation
βββ make.ps1 # PowerShell helper to run pipelines on Windows
βββ config.yaml # Central configuration management
βββ requirements.txt # Python dependencies
βββ stepplan.md # Task planning and dependency tracking
β
βββ artifacts/ # Generated artifacts and models
β βββ data/ # Processed datasets
β β βββ X_train.csv # Training features
β β βββ X_test.csv # Testing features
β β βββ Y_train.csv # Training labels
β β βββ Y_test.csv # Testing labels
β βββ encode/ # Feature encoders
β β βββ Gender_encoder.json # Gender feature encoder
β β βββ Geography_encoder.json # Geography feature encoder
β βββ models/ # Trained models
β β βββ churn_analysis.joblib # Main trained model
β βββ mlflow_run_artifacts/ # MLflow-specific artifacts
β βββ {run_id}/ # Run-specific artifacts
β βββ visualizations_*/ # Data visualizations by stage
β βββ final_csv_files/ # Final dataset metadata
β
βββ data/ # Data storage
β βββ raw/ # Original raw data
β β βββ TelcoCustomerChurn.csv # Raw customer churn dataset (project file)
β βββ processed/ # Intermediate processed data
β βββ imputed.csv # Data after missing value handling
β
βββ mlruns/ # MLflow tracking storage
β βββ 0/ # Default experiment
β βββ models/ # MLflow model registry
β βββ {experiment_id}/ # Experiment-specific runs
β βββ {run_id}/ # Individual run artifacts
β βββ artifacts/ # Run artifacts
β βββ metrics/ # Logged metrics
β βββ params/ # Logged parameters
β βββ tags/ # Run tags and metadata
β
βββ pipelines/ # ML pipeline implementations
β βββ __pycache__/ # Python cache files
β βββ data_pipeline.py # β¨ Enhanced data processing pipeline
β βββ training_pipeline.py # β¨ Enhanced model training pipeline
β βββ streaming_inference_pipeline.py # β¨ Enhanced inference pipeline
β
βββ src/ # Core ML modules
β βββ __pycache__/ # Python cache files
β βββ __init__.py # Package initialization
β βββ data_ingestion.py # Data loading and validation
β βββ data_spiltter.py # Train/test splitting strategies
β βββ feature_binning.py # Feature binning transformations
β βββ feature_encoding.py # Feature encoding strategies
β βββ feature_scaling.py # Feature scaling transformations
β βββ handle_missing_values.py # Missing value handling strategies
β βββ model_building.py # Model architecture definitions
β βββ model_evaluation.py # Model evaluation metrics
β βββ model_inference.py # Model inference and prediction
β βββ model_training.py # Model training orchestration
β βββ outlier_detection.py # Outlier detection and handling
β
βββ utils/ # Utility modules
βββ __pycache__/ # Python cache files
βββ config.py # Configuration management
βββ mlflow_utils.py # MLflow tracking utilities
- Stage-wise Tracking: Profiles data at each processing stage (raw β missing_handled β outliers_removed β encoded β final)
- Rich Visualizations: Automatic generation of distribution plots, correlation matrices
- Dataset Artifacts: Proper MLflow dataset tracking with lineage and versioning
- Metrics Tracking: Rows, columns, missing values, memory usage at each stage
- Transformation Logging: Before/after metrics for each transformation step
- Error Handling: Graceful handling of processing failures with detailed logging
# Example: Data profiling and visualization
create_data_visualizations(df, 'raw', run_artifacts_dir)
log_stage_metrics(df, 'raw')
# MLflow dataset tracking
raw_dataset = mlflow.data.from_pandas(df, source=data_path, name="raw_churn_data")
mlflow.log_input(raw_dataset, context="raw_data")- Comprehensive Visualizations: Confusion matrices, ROC curves, feature importance plots
- Training Metadata: Training time, model size, complexity metrics
- Performance Analytics: Detailed model performance analysis and comparison
# Example: Model performance visualization
create_model_performance_visualizations(model, X_test, y_test, evaluation_results,
run_artifacts_dir, 'XGboost')
# Model metadata logging
log_model_metadata(model, 'XGboost', model_params, training_time, run_artifacts_dir)- Batch Processing: Configurable batch sizes for efficient logging (default: 100 predictions)
- Performance Tracking: Inference time, prediction distributions, risk categorization
- Production Monitoring: Real-time model performance metrics
# Example: Inference tracking
class InferenceTracker:
def track_prediction(self, input_data, prediction_result, inference_time):
# Tracks individual predictions with metadata
# Logs batches automatically when batch size is reachedMLflow Run Artifacts:
βββ raw_data/ # Original dataset
βββ visualizations/ # Stage-wise data visualizations
β βββ raw/ # Raw data distributions
β βββ encoded/ # Post-encoding visualizations
β βββ final/ # Final processed data plots
βββ final_datasets/ # Train/test CSV files with metadata
β βββ X_train.csv, X_test.csv # Feature datasets
β βββ Y_train.csv, Y_test.csv # Label datasets
β βββ final_csv_metadata.json # Comprehensive metadata
βββ processed_datasets/ # Final processed datasets
MLflow Run Artifacts:
βββ model_performance/ # Model performance analysis
β βββ XGboost/ # Model-specific artifacts
β β βββ confusion_matrix_XGboost.png
β β βββ roc_curve_XGboost.png
β β βββ feature_importance_XGboost.png
β β βββ prediction_distribution_XGboost.png
βββ model_metadata/ # Model metadata and information
β βββ model_metadata_XGboost.json
βββ trained_models/ # Actual model files
β βββ churn_analysis.joblib
βββ training_summary/ # Complete training summary
βββ training_summary.json
MLflow Run Artifacts:
βββ inference_batches/ # Prediction batch logs
β βββ inference_batch_20241219_143022.json
β βββ inference_batch_20241219_143122.json
βββ prediction_analytics/ # Inference performance metrics
- MLflow Datasets: Proper dataset versioning and lineage tracking
- Schema Evolution: Automatic tracking of schema changes
- Data Lineage: Complete traceability from raw data to final models
# Data Pipeline Metrics
- raw_rows, raw_columns, raw_missing_values, raw_memory_mb
- missing_handled_rows_removed, outliers_removed_count
- final_train_samples, final_test_samples, final_features
- train_class_0, train_class_1, test_class_0, test_class_1
# Training Pipeline Metrics
- training_time_seconds, model_size_mb, model_complexity
- accuracy, precision, recall, f1, roc_auc
- XGboost_training_time_seconds, XGboost_model_size_mb
# Inference Pipeline Metrics
- batch_size, avg_inference_time_ms, avg_churn_probability
- high_risk_predictions, medium_risk_predictions, low_risk_predictions# Pipeline Configuration
- final_feature_names, preprocessing_steps, data_pipeline_version
- model_type, training_strategy, sklearn_version
- feature_encoding_applied, feature_scaling_applied
# Model Parameters
- n_estimators, max_depth, random_state
- test_size, missing_value_strategy, outlier_detection_method# Create a venv if you don't have one (script also creates venv when you run the install target):
python -m venv venv
# Activate the venv (Windows PowerShell)
.\venv\Scripts\Activate.ps1
# or, if your project uses .venv
.\.venv\Scripts\Activate.ps1
# Install dependencies with uv (recommended) so installers are executed via uv:
uv pip install -r requirements.txt
# Or, if you don't use uv, use pip inside the venv:
python -m pip install -r requirements.txtThe repository provides a PowerShell helper
make.ps1with targets that activate the venv when present and run the pipelines.
# Preferred (uses the script and venv activation if available)
.\make.ps1 data-pipeline
# Or directly (module style)
python -m pipelines.data_pipeline.\make.ps1 train-pipeline
# Or directly
python -m pipelines.training_pipeline.\make.ps1 streaming-inference
python -m pipelines.streaming_inference_pipeline# Use the script (it will try to activate the venv and launch MLflow)
.\make.ps1 mlflow-ui
# Or activate venv and run MLflow explicitly (script uses port 5001 by default):
.\venv\Scripts\Activate.ps1
mlflow ui --backend-store-uri file:./mlruns --default-artifact-root ./artifacts --host 127.0.0.1 --port 5001
# Access at: http://127.0.0.1:5001- Complete Lineage: Track data and model lineage from raw input to predictions
- Rich Visualizations: Automatic generation of insightful plots and charts
- Comprehensive Metrics: Detailed metrics at every pipeline stage
- Error Handling: Robust error handling with graceful degradation
- Monitoring: Real-time inference monitoring and performance tracking
- Reproducibility: Complete artifact tracking for experiment reproduction
- Automated Tracking: Minimal code changes for maximum tracking benefit
- Rich Metadata: Comprehensive metadata for all artifacts
- Easy Debugging: Quick access to intermediate results and visualizations
The system is configured through config.yaml:
- 68% Code Reduction: Optimized from ~950 lines to ~300 lines in data pipeline
- Consolidated Functions: Streamlined helper functions for better maintainability
- Essential Visualizations: Focus on most valuable plots and metrics
- Memory Efficient: Efficient handling of large datasets with cleanup
- Batch Processing: Configurable batch sizes for inference tracking
- Error Recovery: Graceful fallbacks when artifact logging fails
- Data Drift Detection: Monitor for data drift in production
- Model Registry Management: Automated model stage transitions
- Advanced Monitoring: Additional performance and quality metrics
- Integration Testing: Comprehensive pipeline testing framework
This enhanced MLflow tracking system provides:
- Production-grade logging throughout all modules
- Comprehensive error handling and input validation
- Enhanced type safety and documentation
- Complete artifact traceability for ML operations
The implementation follows clean architecture principles with separation of concerns and comprehensive observability for production ML systems.