Skip to content

Sanila-577/Customer-Churn-Production

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Enhanced MLflow Artifact Tracking for ML Pipelines

This project demonstrates production-ready machine learning pipelines with comprehensive MLflow artifact tracking, focusing on customer churn prediction.

image

🎯 Project Overview

A complete ML system with enhanced MLflow tracking that provides:

  • Comprehensive Data Lineage: Track data from raw input to final model predictions
  • Rich Artifact Management: Automated logging of datasets, models, visualizations, and metadata
  • Production-Ready Monitoring: Real-time inference tracking and performance monitoring
  • Complete Reproducibility: All artifacts needed to reproduce experiments and results

πŸ“ Project Structure

./
β”œβ”€β”€ README.md                          # This file - comprehensive project documentation
β”œβ”€β”€ make.ps1                           # PowerShell helper to run pipelines on Windows
β”œβ”€β”€ config.yaml                        # Central configuration management
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”œβ”€β”€ stepplan.md                       # Task planning and dependency tracking
β”‚
β”œβ”€β”€ artifacts/                         # Generated artifacts and models
β”‚   β”œβ”€β”€ data/                         # Processed datasets
β”‚   β”‚   β”œβ”€β”€ X_train.csv               # Training features
β”‚   β”‚   β”œβ”€β”€ X_test.csv                # Testing features
β”‚   β”‚   β”œβ”€β”€ Y_train.csv               # Training labels
β”‚   β”‚   └── Y_test.csv                # Testing labels
β”‚   β”œβ”€β”€ encode/                       # Feature encoders
β”‚   β”‚   β”œβ”€β”€ Gender_encoder.json       # Gender feature encoder
β”‚   β”‚   └── Geography_encoder.json    # Geography feature encoder
β”‚   β”œβ”€β”€ models/                       # Trained models
β”‚   β”‚   └── churn_analysis.joblib     # Main trained model
β”‚   └── mlflow_run_artifacts/         # MLflow-specific artifacts
β”‚       └── {run_id}/                 # Run-specific artifacts
β”‚           β”œβ”€β”€ visualizations_*/     # Data visualizations by stage
β”‚           └── final_csv_files/      # Final dataset metadata
β”‚
β”œβ”€β”€ data/                             # Data storage
β”‚   β”œβ”€β”€ raw/                          # Original raw data
β”‚   β”‚   └── TelcoCustomerChurn.csv    # Raw customer churn dataset (project file)
β”‚   └── processed/                    # Intermediate processed data
β”‚       └── imputed.csv               # Data after missing value handling
β”‚
β”œβ”€β”€ mlruns/                           # MLflow tracking storage
β”‚   β”œβ”€β”€ 0/                           # Default experiment
β”‚   β”œβ”€β”€ models/                      # MLflow model registry
β”‚   └── {experiment_id}/             # Experiment-specific runs
β”‚       └── {run_id}/                # Individual run artifacts
β”‚           β”œβ”€β”€ artifacts/           # Run artifacts
β”‚           β”œβ”€β”€ metrics/             # Logged metrics
β”‚           β”œβ”€β”€ params/              # Logged parameters
β”‚           └── tags/                # Run tags and metadata
β”‚
β”œβ”€β”€ pipelines/                        # ML pipeline implementations
β”‚   β”œβ”€β”€ __pycache__/                 # Python cache files
β”‚   β”œβ”€β”€ data_pipeline.py             # ✨ Enhanced data processing pipeline
β”‚   β”œβ”€β”€ training_pipeline.py         # ✨ Enhanced model training pipeline
β”‚   └── streaming_inference_pipeline.py # ✨ Enhanced inference pipeline
β”‚
β”œβ”€β”€ src/                             # Core ML modules
β”‚   β”œβ”€β”€ __pycache__/                 # Python cache files
β”‚   β”œβ”€β”€ __init__.py                  # Package initialization
β”‚   β”œβ”€β”€ data_ingestion.py            # Data loading and validation
β”‚   β”œβ”€β”€ data_spiltter.py             # Train/test splitting strategies
β”‚   β”œβ”€β”€ feature_binning.py           # Feature binning transformations
β”‚   β”œβ”€β”€ feature_encoding.py          # Feature encoding strategies
β”‚   β”œβ”€β”€ feature_scaling.py           # Feature scaling transformations
β”‚   β”œβ”€β”€ handle_missing_values.py     # Missing value handling strategies
β”‚   β”œβ”€β”€ model_building.py            # Model architecture definitions
β”‚   β”œβ”€β”€ model_evaluation.py          # Model evaluation metrics
β”‚   β”œβ”€β”€ model_inference.py           # Model inference and prediction
β”‚   β”œβ”€β”€ model_training.py            # Model training orchestration
β”‚   └── outlier_detection.py         # Outlier detection and handling
β”‚
└── utils/                           # Utility modules
    β”œβ”€β”€ __pycache__/                 # Python cache files
    β”œβ”€β”€ config.py                    # Configuration management
    └── mlflow_utils.py              # MLflow tracking utilities

πŸš€ Key Enhancements Implemented

1. Enhanced Data Pipeline (pipelines/data_pipeline.py)

πŸ“Š Comprehensive Data Profiling

  • Stage-wise Tracking: Profiles data at each processing stage (raw β†’ missing_handled β†’ outliers_removed β†’ encoded β†’ final)
  • Rich Visualizations: Automatic generation of distribution plots, correlation matrices
  • Dataset Artifacts: Proper MLflow dataset tracking with lineage and versioning

πŸ” Data Quality Monitoring

  • Metrics Tracking: Rows, columns, missing values, memory usage at each stage
  • Transformation Logging: Before/after metrics for each transformation step
  • Error Handling: Graceful handling of processing failures with detailed logging

πŸ“ Artifact Management

# Example: Data profiling and visualization
create_data_visualizations(df, 'raw', run_artifacts_dir)
log_stage_metrics(df, 'raw')

# MLflow dataset tracking
raw_dataset = mlflow.data.from_pandas(df, source=data_path, name="raw_churn_data")
mlflow.log_input(raw_dataset, context="raw_data")

2. Enhanced Training Pipeline (pipelines/training_pipeline.py)

🎯 Model Performance Tracking

  • Comprehensive Visualizations: Confusion matrices, ROC curves, feature importance plots
  • Training Metadata: Training time, model size, complexity metrics
  • Performance Analytics: Detailed model performance analysis and comparison

πŸ“ˆ Model Artifacts

# Example: Model performance visualization
create_model_performance_visualizations(model, X_test, y_test, evaluation_results, 
                                      run_artifacts_dir, 'XGboost')

# Model metadata logging
log_model_metadata(model, 'XGboost', model_params, training_time, run_artifacts_dir)

3. Enhanced Inference Pipeline (pipelines/streaming_inference_pipeline.py)

⚑ Real-time Monitoring

  • Batch Processing: Configurable batch sizes for efficient logging (default: 100 predictions)
  • Performance Tracking: Inference time, prediction distributions, risk categorization
  • Production Monitoring: Real-time model performance metrics

πŸ“Š Prediction Analytics

# Example: Inference tracking
class InferenceTracker:
    def track_prediction(self, input_data, prediction_result, inference_time):
        # Tracks individual predictions with metadata
        # Logs batches automatically when batch size is reached

πŸ› οΈ MLflow Artifacts Generated

Data Pipeline Artifacts

MLflow Run Artifacts:
β”œβ”€β”€ raw_data/                         # Original dataset
β”œβ”€β”€ visualizations/                   # Stage-wise data visualizations
β”‚   β”œβ”€β”€ raw/                         # Raw data distributions
β”‚   β”œβ”€β”€ encoded/                     # Post-encoding visualizations  
β”‚   └── final/                       # Final processed data plots
β”œβ”€β”€ final_datasets/                   # Train/test CSV files with metadata
β”‚   β”œβ”€β”€ X_train.csv, X_test.csv      # Feature datasets
β”‚   β”œβ”€β”€ Y_train.csv, Y_test.csv      # Label datasets
β”‚   └── final_csv_metadata.json      # Comprehensive metadata
└── processed_datasets/               # Final processed datasets

Training Pipeline Artifacts

MLflow Run Artifacts:
β”œβ”€β”€ model_performance/                # Model performance analysis
β”‚   β”œβ”€β”€ XGboost/                     # Model-specific artifacts
β”‚   β”‚   β”œβ”€β”€ confusion_matrix_XGboost.png
β”‚   β”‚   β”œβ”€β”€ roc_curve_XGboost.png
β”‚   β”‚   β”œβ”€β”€ feature_importance_XGboost.png
β”‚   β”‚   └── prediction_distribution_XGboost.png
β”œβ”€β”€ model_metadata/                   # Model metadata and information
β”‚   └── model_metadata_XGboost.json
β”œβ”€β”€ trained_models/                   # Actual model files
β”‚   └── churn_analysis.joblib
└── training_summary/                 # Complete training summary
    └── training_summary.json

Inference Pipeline Artifacts

MLflow Run Artifacts:
β”œβ”€β”€ inference_batches/                # Prediction batch logs
β”‚   β”œβ”€β”€ inference_batch_20241219_143022.json
β”‚   └── inference_batch_20241219_143122.json
└── prediction_analytics/             # Inference performance metrics

πŸ“Š MLflow Tracking Features

Dataset Tracking

  • MLflow Datasets: Proper dataset versioning and lineage tracking
  • Schema Evolution: Automatic tracking of schema changes
  • Data Lineage: Complete traceability from raw data to final models

Metrics Logged

# Data Pipeline Metrics
- raw_rows, raw_columns, raw_missing_values, raw_memory_mb
- missing_handled_rows_removed, outliers_removed_count
- final_train_samples, final_test_samples, final_features
- train_class_0, train_class_1, test_class_0, test_class_1

# Training Pipeline Metrics  
- training_time_seconds, model_size_mb, model_complexity
- accuracy, precision, recall, f1, roc_auc
- XGboost_training_time_seconds, XGboost_model_size_mb

# Inference Pipeline Metrics
- batch_size, avg_inference_time_ms, avg_churn_probability
- high_risk_predictions, medium_risk_predictions, low_risk_predictions

Parameters Logged

# Pipeline Configuration
- final_feature_names, preprocessing_steps, data_pipeline_version
- model_type, training_strategy, sklearn_version
- feature_encoding_applied, feature_scaling_applied

# Model Parameters
- n_estimators, max_depth, random_state
- test_size, missing_value_strategy, outlier_detection_method

πŸš€ Getting Started

Prerequisites

# Create a venv if you don't have one (script also creates venv when you run the install target):
python -m venv venv

# Activate the venv (Windows PowerShell)
.\venv\Scripts\Activate.ps1
# or, if your project uses .venv
.\.venv\Scripts\Activate.ps1

# Install dependencies with uv (recommended) so installers are executed via uv:
uv pip install -r requirements.txt

# Or, if you don't use uv, use pip inside the venv:
python -m pip install -r requirements.txt

Running the Pipelines (preferred via make.ps1)

The repository provides a PowerShell helper make.ps1 with targets that activate the venv when present and run the pipelines.

1. Data Pipeline

# Preferred (uses the script and venv activation if available)
.\make.ps1 data-pipeline

# Or directly (module style)
python -m pipelines.data_pipeline

2. Training Pipeline

.\make.ps1 train-pipeline
# Or directly
python -m pipelines.training_pipeline

3. Inference Pipeline

.\make.ps1 streaming-inference

python -m pipelines.streaming_inference_pipeline

MLflow UI

# Use the script (it will try to activate the venv and launch MLflow)
.\make.ps1 mlflow-ui

# Or activate venv and run MLflow explicitly (script uses port 5001 by default):
.\venv\Scripts\Activate.ps1
mlflow ui --backend-store-uri file:./mlruns --default-artifact-root ./artifacts --host 127.0.0.1 --port 5001

# Access at: http://127.0.0.1:5001

πŸ“ˆ Key Benefits

πŸ” Enhanced Observability

  • Complete Lineage: Track data and model lineage from raw input to predictions
  • Rich Visualizations: Automatic generation of insightful plots and charts
  • Comprehensive Metrics: Detailed metrics at every pipeline stage

πŸš€ Production Ready

  • Error Handling: Robust error handling with graceful degradation
  • Monitoring: Real-time inference monitoring and performance tracking
  • Reproducibility: Complete artifact tracking for experiment reproduction

⚑ Developer Experience

  • Automated Tracking: Minimal code changes for maximum tracking benefit
  • Rich Metadata: Comprehensive metadata for all artifacts
  • Easy Debugging: Quick access to intermediate results and visualizations

πŸ”§ Configuration

The system is configured through config.yaml:

πŸ“Š Performance Optimizations

Code Efficiency

  • 68% Code Reduction: Optimized from ~950 lines to ~300 lines in data pipeline
  • Consolidated Functions: Streamlined helper functions for better maintainability
  • Essential Visualizations: Focus on most valuable plots and metrics

Resource Management

  • Memory Efficient: Efficient handling of large datasets with cleanup
  • Batch Processing: Configurable batch sizes for inference tracking
  • Error Recovery: Graceful fallbacks when artifact logging fails

🎯 Future Enhancements

  • Data Drift Detection: Monitor for data drift in production
  • Model Registry Management: Automated model stage transitions
  • Advanced Monitoring: Additional performance and quality metrics
  • Integration Testing: Comprehensive pipeline testing framework

πŸ“ Development Notes

This enhanced MLflow tracking system provides:

  • Production-grade logging throughout all modules
  • Comprehensive error handling and input validation
  • Enhanced type safety and documentation
  • Complete artifact traceability for ML operations

The implementation follows clean architecture principles with separation of concerns and comprehensive observability for production ML systems.


About

Production-ready machine learning pipelines with comprehensive MLflow artifact tracking, focusing on customer churn prediction.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors