Skip to content

selfcriticalmass/RIJAN_DHAKAL_STAT563_PROJ-1_FALL_2025

Repository files navigation

Modular Pipeline for Maximum Likelihood Estimation and Model Selection

This repository provides a modular pipeline for fitting probability distributions to data using maximum likelihood estimation (MLE) and computing model selection criteria: Log-Likelihood (LL), Akaike Information Criterion (AIC), Schwarz Bayesian Criterion (SBC, also known as BIC), and ICOMP (corrected AIC).

The pipeline is designed for flexibility, allowing users to define custom sub-functions for different probability density functions (PDFs), while the core optimization and metric calculations are handled automatically.

Overview

The pipeline consists of three main components:

  1. Sub-functions: User-defined functions that specify the log-likelihood for a particular PDF, along with initial parameter guesses and other metadata.
  2. Optimizer Module: Handles the numerical optimization to find the maximum likelihood estimates (MLE) and computes the Hessian matrix.
  3. Pipeline Function: Ties everything together, running the optimization and calculating the model selection metrics.

Key Features

  • Modularity: Easily swap or add new distribution sub-functions without modifying core code.
  • Automatic Optimization: Uses SciPy's minimize for robust MLE.
  • Hessian Computation: Numerically computes the Hessian for confidence intervals or further analysis.
  • Model Selection: Computes LL, AIC, SBC, and ICOMP for model comparison.
  • Bounds Support: Handles parameter bounds for constrained optimization.

Installation and Dependencies

Ensure you have Python 3.7+ installed. Required packages:

  • numpy
  • scipy

Install via pip:

pip install numpy scipy

Quick Start

  1. Define a sub-function for your PDF (see below).
  2. Generate or load your data as a NumPy array.
  3. Run the pipeline:
from pipeline import run_pipeline
from your_sub_function_module import your_sub_function

data = np.array([1.0, 2.0, 3.0, ...])  # Your data
result = run_pipeline(data, your_sub_function)

print(result)
# Output: {'name': 'your_pdf', 'θhat': [...], 'LL': ..., 'AIC': ..., 'SBC': ..., 'ICOMP': ..., 'hessian': ...}

Implementing Custom Sub-Functions

The core of the modularity lies in the sub-functions. Each sub-function is a Python function that encapsulates the log-likelihood logic for a specific PDF.

Sub-Function Structure

A sub-function must:

  1. Accept at least one argument: data (a NumPy array of observations).
  2. Optionally accept additional parameters if needed (e.g., for truncated distributions).
  3. Define an inner function _negloglik(θ) that computes the negative log-likelihood given parameters θ.
  4. Compute initial parameter guesses (θ0).
  5. Return a tuple: (name, info_dict)

Where:

  • name: A string identifier for the PDF (e.g., "normal", "exponential").
  • info_dict: A dictionary containing:
    • 'negloglik': The callable _negloglik function.
    • 'initial_params': A list or NumPy array of initial parameter values (θ0).
    • 'bounds' (optional): A tuple of (lower, upper) bounds for each parameter, e.g., ((0, None), (0, None)) for positive parameters.

Step-by-Step Guide to Implementing a Sub-Function

  1. Choose Your PDF: Decide on the distribution. Research its probability density function (PDF) and parameter meanings.

  2. Import Necessary Modules:

    import numpy as np
    from scipy import stats  # For built-in PDFs, or implement manually
  3. Define the Negative Log-Likelihood Function:

    • Inside your sub-function, create a nested function _negloglik(θ).
    • θ is a sequence of parameters (e.g., for Normal: [mean, std]).
    • Compute the log-likelihood for each data point and sum (negate for minimization).
    • Handle edge cases: return np.inf for invalid parameters to avoid optimization errors.
    • Example for Normal distribution:
      def _negloglik(θ):
          mu, sigma = θ
          if sigma <= 0:
              return np.inf  # Invalid parameter
          return -np.sum(stats.norm.logpdf(data, loc=mu, scale=sigma))
  4. Compute Initial Parameter Guesses:

    • Use method-of-moments, sample statistics, or domain knowledge.
    • Ensure guesses are within valid ranges.
    • Example: For Normal, use sample mean and std:
      θ0 = [np.mean(data), np.std(data, ddof=1)]
  5. Set Parameter Bounds (Optional but Recommended):

    • Define bounds to prevent optimization from exploring invalid regions.
    • Use None for unbounded sides.
    • Example: For Normal, sigma > 0:
      bounds = ((None, None), (1e-6, None))  # mu unbounded, sigma > 0
  6. Return the Required Tuple:

    return "normal", {
        'negloglik': _negloglik,
        'initial_params': θ0,
        'bounds': bounds
    }

Complete Example: Exponential Distribution

def exponential_sub(data):
    """
    Sub-function for Exponential distribution.
    Parameter: rate (lambda) > 0.
    PDF: lambda * exp(-lambda * x) for x > 0.
    """
    def _negloglik(θ):
        rate = θ[0]
        if rate <= 0:
            return np.inf
        return -np.sum(stats.expon.logpdf(data, scale=1/rate))

    # Initial guess: rate = 1 / mean (method of moments)
    θ0 = [1 / np.mean(data)]

    bounds = ((1e-6, None),)  # rate > 0

    return "exponential", {
        'negloglik': _negloglik,
        'initial_params': θ0,
        'bounds': bounds
    }

Tips for Implementation

  • Validation: Check data assumptions (e.g., positivity for certain distributions).
  • Efficiency: Vectorize computations using NumPy for large datasets.
  • Robustness: Use try-except in _negloglik if needed, but returning np.inf is usually sufficient.
  • Parameter Order: Consistently order parameters (e.g., always shape first for Weibull).
  • Documentation: Add docstrings explaining the distribution and parameters.
  • Testing: Test your sub-function with known data before integrating.
  • Advanced: For complex PDFs, you can define helper functions inside the sub-function.

Common Patterns for Different Distribution Types

  • Two-Parameter Distributions (e.g., Normal, Weibull):

    • θ = [param1, param2]
    • bounds = ((min1, max1), (min2, max2))
  • Scale-Only (e.g., Exponential):

    • θ = [scale_or_rate]
    • Often one bound.
  • Location-Scale (e.g., Normal):

    • Location often unbounded, scale bounded below.
  • Shape-Scale (e.g., Weibull, Gamma):

    • Both shape and scale > 0.
  • Truncated Distributions: Modify _negloglik to account for truncation (e.g., normalize by CDF difference).

Debugging Sub-Functions

  • Check Initial LL: Compute _negloglik(θ0) manually. It should be finite and reasonable.
  • Plot LL Surface: For 1-2 parameters, visualize the log-likelihood surface.
  • Compare with SciPy: Fit using stats.distribution.fit and compare parameters.
  • Handle NaNs/Infs: Ensure data is clean; add checks in _negloglik.

Running the Pipeline

Once you have a sub-function, use run_pipeline(data, sub_function).

  • data: NumPy array of observations.
  • sub_function: Your sub-function (not called, passed as reference).

The result dictionary contains:

  • 'name': PDF name.
  • 'θhat': Optimal parameters (NumPy array).
  • 'LL': Maximum log-likelihood.
  • 'AIC': Akaike Information Criterion.
  • 'SBC': Schwarz Bayesian Criterion (BIC).
  • 'ICOMP': Corrected AIC (AICc).
  • 'hessian': Hessian matrix at θhat (for standard errors: sqrt of diagonal inverse).

Model Comparison

To compare multiple models, use compare_models(data, sub_functions_list) which returns a pandas DataFrame with columns: Model, M (number of parameters), θhat, LL, AIC, SBC, ICOMP.

Examples

See sub_functions.py for Weibull and Inverse Gaussian examples, and demo.py for usage.

Extending the Pipeline

Common Patterns for Different Distribution Types

  • Two-Parameter Distributions (e.g., Normal, Weibull):

    • θ = [param1, param2]
    • bounds = ((min1, max1), (min2, max2))
  • Scale-Only (e.g., Exponential):

    • θ = [scale_or_rate]
    • Often one bound.
  • Location-Scale (e.g., Normal):

    • Location often unbounded, scale bounded below.
  • Shape-Scale (e.g., Weibull, Gamma):

    • Both shape and scale > 0.

Debugging Sub-Functions

  • Check Initial LL: Compute _negloglik(θ0) manually. It should be finite and reasonable.
  • Plot LL Surface: For 1-2 parameters, visualize the log-likelihood surface.
  • Compare with SciPy: Fit using stats.distribution.fit and compare parameters.
  • Handle NaNs/Infs: Ensure data is clean; add checks in _negloglik.

Plotting Pipeline

The repository includes a modular plotting pipeline for overlaying fitted probability density functions (PDFs) on data histograms. This allows visual comparison of multiple distributions against the same dataset.

The pipeline is designed for flexibility, enabling users to define custom plotting sub-functions for different PDFs, while the core plotting logic (histogram generation, overlay management, chunking) is handled automatically.

Implementing Custom Plotting Sub-Functions

Plotting sub-functions follow a similar modular pattern to model sub-functions. Each encapsulates the parameter fitting and PDF evaluation logic for a specific distribution.

Plotting Sub-Function Structure

A plotting sub-function must:

  1. Take no arguments (or optional data-dependent parameters if needed, but typically none).
  2. Define a fitter function that accepts data (NumPy array) and returns fitted parameters as a tuple.
  3. Specify a pdf_callable function that computes the PDF at given points.
  4. Indicate whether it's a SciPy-based distribution (is_scipy: bool).
  5. Return a tuple: (name, info_dict)

Where:

  • name: A descriptive string (e.g., "Weibull", "Custom Exponential").
  • info_dict: A dictionary containing:
    • 'fitter': Callable function that takes data and returns parameter tuple or None on failure.
    • 'pdf_callable': Callable that takes x-values and parameters, returns PDF values.
    • 'is_scipy': Boolean indicating if it's a SciPy distribution (enables KS test goodness-of-fit).

Step-by-Step Guide to Implementing Plotting Sub-Functions

  1. Choose Your PDF: Select the distribution. For SciPy distributions, leverage built-in fit and pdf methods. For custom, implement manually.

  2. Import Dependencies:

    import numpy as np
    from scipy import stats  # For built-in distributions
  3. Define the Fitter Function:

    • Create an inner function fitter(data).
    • Use SciPy's fit method for standard distributions.
    • Handle exceptions and return None on failure.
    • Print fit details for user feedback.
    • Example for Weibull:
      def fitter(data):
          try:
              params = stats.weibull_min.fit(data, floc=0)  # Fix location if needed
              print(f"Weibull Fit: Shape={params[0]:.3f}, Loc={params[1]:.3f}, Scale={params[2]:.3f}")
              return params
          except Exception as e:
              print(f"Error fitting Weibull: {e}")

KDE Analysis Module

The repository includes a dedicated module for Kernel Density Estimation (KDE) analysis, enabling comparison of KDE models against parametric distributions.

KDE provides non-parametric density estimation using bandwidth selection methods.

Bandwidth Estimation

Implemented using Least Squares Cross-Validation (LSCV), which minimizes an estimate of the Mean Integrated Squared Error (MISE) by balancing roughness penalty and goodness-of-fit terms.

KDE Fitting and Evaluation

  • Fit KDE using Gaussian kernel.
  • Evaluate PDF at arbitrary points.

Information Criteria for KDE

Calculates LL, AIC, SBC, ICOMP for KDE models, treating bandwidth as a single parameter.

Usage

from kde_analysis import run_kde_analysis

result = run_kde_analysis(data)
# Returns {'bandwidth': float, 'kde_model': KernelDensity, 'IC': {'logL': ..., 'AIC': ..., ...}}

Plotting Comparisons

from plot_comparison import plot_data_kde_parametric
from sub_functions import weibull_sub

plot_data_kde_parametric(data, [weibull_sub])  # One plot per sub-function

This generates plots overlaying data histogram, LSCV KDE, and fitted parametric PDFs. return None ```

  1. Define the PDF Callable:

    • For SciPy: Use stats.distribution.pdf.
    • For custom: Implement the PDF formula using NumPy for vectorization.
    • Ensure it handles parameter unpacking with *params.
    • Example for custom exponential:
      def pdf_callable(x, lambda_rate):
          return np.where(x >= 0, lambda_rate * np.exp(-lambda_rate * x), 0.0)
  2. Set SciPy Flag:

    • True for SciPy distributions (enables automatic KS test).
    • False for custom implementations.
  3. Return the Required Tuple:

    return "Custom Exponential", {
        'fitter': fitter,
        'pdf_callable': pdf_callable,
        'is_scipy': False
    }

Complete Example: Gamma Distribution

def gamma_plot_sub():
    """
    Sub-function for Gamma distribution.
    Parameters: shape (a > 0), scale (or rate).
    """
    def fitter(data):
        try:
            # Gamma fit returns (a, loc, scale)
            params = stats.gamma.fit(data, floc=0)  # Fix location at 0
            print(f"Gamma Fit: Shape={params[0]:.3f}, Loc={params[1]:.3f}, Scale={params[2]:.3f}")
            return params
        except Exception as e:
            print(f"Error fitting Gamma: {e}")
            return None

    return "Gamma", {
        'fitter': fitter,
        'pdf_callable': stats.gamma.pdf,
        'is_scipy': True
    }

Tips for Implementing Plotting Sub-Functions

  • SciPy Leverage: For built-in distributions, use stats.distribution.fit with appropriate constraints (e.g., floc=0 for positive-only).
  • Custom PDFs: Vectorize with NumPy. Use np.where for conditional logic to avoid invalid domains.
  • Parameter Validation: Check fitted parameters (e.g., ensure positivity) before returning.
  • Error Handling: Always wrap fitting in try-except; provide meaningful error messages.
  • Performance: Fitting can be slow for large datasets; consider sampling if needed.
  • Goodness-of-Fit: SciPy sub-functions automatically include KS test results in plot legends.
  • Flexibility: Sub-functions can be data-dependent (e.g., pass additional args), but keep simple for modularity.
  • Testing: Test fitter with sample data; verify PDF callable produces expected shapes.

Common Patterns for Plotting Sub-Functions

  • SciPy Distributions (e.g., Weibull, Normal, Gamma):

    • Use stats.distribution.fit and stats.distribution.pdf.
    • Set is_scipy = True for KS test.
    • Often fix loc=0 for positive domains.
  • Custom Distributions (e.g., Exponential, Custom Mixture):

    • Implement fitter using method-of-moments or MLE.
    • Define PDF manually with NumPy.
    • Set is_scipy = False.
  • Truncated Distributions: Modify PDF callable to account for truncation (normalize by CDF difference).

Debugging Plotting Sub-Functions

  • Fit Validation: Check if fitter returns sensible parameters (e.g., positive scales).
  • PDF Shape: Plot PDF manually with fitted params to ensure correct shape.
  • Domain Issues: Ensure PDF is defined over the data range; handle with np.where.
  • Chunking: Use max_overlays_per_plot to debug overlays one at a time.
  • KS Test: For SciPy, verify p-values make sense (high p = good fit).

Plotting Pipeline Usage

from plotting_pipeline import plot_overlay, weibull_plot_sub, gamma_plot_sub

# Single plot with all overlays
plot_overlay(data, [weibull_plot_sub, gamma_plot_sub])

# Multiple plots (2 overlays per plot)
plot_overlay(data, [weibull_plot_sub, gamma_plot_sub], max_overlays_per_plot=2, lower_bound=0, upper_bound=10)

Parameters:

  • data: NumPy array of observations.
  • sub_functions: List of sub-function callables.
  • max_overlays_per_plot: Optional chunking for multiple plots.
  • lower_bound/upper_bound: Optional x-axis limits (auto-inferred if None).

Examples

See sub_functions.py, plotting_pipeline.py, and demo.py for implementations and usage.

Extending the Pipeline

  • Add Metrics: Modify calculate_metrics in pipeline.py for custom criteria.
  • Alternative Optimizers: Edit optimize_and_hessian in optimizer.py.
  • Batch Processing: Loop over multiple sub-functions for model comparison.

Limitations

  • Assumes continuous data (modify for discrete).
  • Numerical Hessian may be slow for many parameters.
  • No built-in goodness-of-fit tests (add separately).

Contributing

Add new sub-functions to sub_functions.py or create your own module. Ensure they follow the interface.

For issues or improvements, please provide details on the sub-function and error messages.

  • Add Metrics: Modify calculate_metrics in pipeline.py for custom criteria.
  • Alternative Optimizers: Edit optimize_and_hessian in optimizer.py.
  • Batch Processing: Loop over multiple sub-functions for model comparison.

Limitations

  • Assumes continuous data (modify for discrete).
  • Numerical Hessian may be slow for many parameters.
  • No built-in goodness-of-fit tests (add separately).

Contributing

Add new sub-functions to sub_functions.py or create your own module. Ensure they follow the interface.

For issues or improvements, please provide details on the sub-function and error messages.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors