This repository provides a modular pipeline for fitting probability distributions to data using maximum likelihood estimation (MLE) and computing model selection criteria: Log-Likelihood (LL), Akaike Information Criterion (AIC), Schwarz Bayesian Criterion (SBC, also known as BIC), and ICOMP (corrected AIC).
The pipeline is designed for flexibility, allowing users to define custom sub-functions for different probability density functions (PDFs), while the core optimization and metric calculations are handled automatically.
The pipeline consists of three main components:
- Sub-functions: User-defined functions that specify the log-likelihood for a particular PDF, along with initial parameter guesses and other metadata.
- Optimizer Module: Handles the numerical optimization to find the maximum likelihood estimates (MLE) and computes the Hessian matrix.
- Pipeline Function: Ties everything together, running the optimization and calculating the model selection metrics.
- Modularity: Easily swap or add new distribution sub-functions without modifying core code.
- Automatic Optimization: Uses SciPy's
minimizefor robust MLE. - Hessian Computation: Numerically computes the Hessian for confidence intervals or further analysis.
- Model Selection: Computes LL, AIC, SBC, and ICOMP for model comparison.
- Bounds Support: Handles parameter bounds for constrained optimization.
Ensure you have Python 3.7+ installed. Required packages:
- numpy
- scipy
Install via pip:
pip install numpy scipy- Define a sub-function for your PDF (see below).
- Generate or load your data as a NumPy array.
- Run the pipeline:
from pipeline import run_pipeline
from your_sub_function_module import your_sub_function
data = np.array([1.0, 2.0, 3.0, ...]) # Your data
result = run_pipeline(data, your_sub_function)
print(result)
# Output: {'name': 'your_pdf', 'θhat': [...], 'LL': ..., 'AIC': ..., 'SBC': ..., 'ICOMP': ..., 'hessian': ...}The core of the modularity lies in the sub-functions. Each sub-function is a Python function that encapsulates the log-likelihood logic for a specific PDF.
A sub-function must:
- Accept at least one argument:
data(a NumPy array of observations). - Optionally accept additional parameters if needed (e.g., for truncated distributions).
- Define an inner function
_negloglik(θ)that computes the negative log-likelihood given parametersθ. - Compute initial parameter guesses (
θ0). - Return a tuple:
(name, info_dict)
Where:
name: A string identifier for the PDF (e.g., "normal", "exponential").info_dict: A dictionary containing:'negloglik': The callable_negloglikfunction.'initial_params': A list or NumPy array of initial parameter values (θ0).'bounds'(optional): A tuple of (lower, upper) bounds for each parameter, e.g.,((0, None), (0, None))for positive parameters.
-
Choose Your PDF: Decide on the distribution. Research its probability density function (PDF) and parameter meanings.
-
Import Necessary Modules:
import numpy as np from scipy import stats # For built-in PDFs, or implement manually
-
Define the Negative Log-Likelihood Function:
- Inside your sub-function, create a nested function
_negloglik(θ). θis a sequence of parameters (e.g., for Normal: [mean, std]).- Compute the log-likelihood for each data point and sum (negate for minimization).
- Handle edge cases: return
np.inffor invalid parameters to avoid optimization errors. - Example for Normal distribution:
def _negloglik(θ): mu, sigma = θ if sigma <= 0: return np.inf # Invalid parameter return -np.sum(stats.norm.logpdf(data, loc=mu, scale=sigma))
- Inside your sub-function, create a nested function
-
Compute Initial Parameter Guesses:
- Use method-of-moments, sample statistics, or domain knowledge.
- Ensure guesses are within valid ranges.
- Example: For Normal, use sample mean and std:
θ0 = [np.mean(data), np.std(data, ddof=1)]
-
Set Parameter Bounds (Optional but Recommended):
- Define bounds to prevent optimization from exploring invalid regions.
- Use
Nonefor unbounded sides. - Example: For Normal, sigma > 0:
bounds = ((None, None), (1e-6, None)) # mu unbounded, sigma > 0
-
Return the Required Tuple:
return "normal", { 'negloglik': _negloglik, 'initial_params': θ0, 'bounds': bounds }
def exponential_sub(data):
"""
Sub-function for Exponential distribution.
Parameter: rate (lambda) > 0.
PDF: lambda * exp(-lambda * x) for x > 0.
"""
def _negloglik(θ):
rate = θ[0]
if rate <= 0:
return np.inf
return -np.sum(stats.expon.logpdf(data, scale=1/rate))
# Initial guess: rate = 1 / mean (method of moments)
θ0 = [1 / np.mean(data)]
bounds = ((1e-6, None),) # rate > 0
return "exponential", {
'negloglik': _negloglik,
'initial_params': θ0,
'bounds': bounds
}- Validation: Check data assumptions (e.g., positivity for certain distributions).
- Efficiency: Vectorize computations using NumPy for large datasets.
- Robustness: Use try-except in
_negloglikif needed, but returningnp.infis usually sufficient. - Parameter Order: Consistently order parameters (e.g., always shape first for Weibull).
- Documentation: Add docstrings explaining the distribution and parameters.
- Testing: Test your sub-function with known data before integrating.
- Advanced: For complex PDFs, you can define helper functions inside the sub-function.
-
Two-Parameter Distributions (e.g., Normal, Weibull):
- θ = [param1, param2]
- bounds = ((min1, max1), (min2, max2))
-
Scale-Only (e.g., Exponential):
- θ = [scale_or_rate]
- Often one bound.
-
Location-Scale (e.g., Normal):
- Location often unbounded, scale bounded below.
-
Shape-Scale (e.g., Weibull, Gamma):
- Both shape and scale > 0.
-
Truncated Distributions: Modify
_negloglikto account for truncation (e.g., normalize by CDF difference).
- Check Initial LL: Compute
_negloglik(θ0)manually. It should be finite and reasonable. - Plot LL Surface: For 1-2 parameters, visualize the log-likelihood surface.
- Compare with SciPy: Fit using
stats.distribution.fitand compare parameters. - Handle NaNs/Infs: Ensure data is clean; add checks in
_negloglik.
Once you have a sub-function, use run_pipeline(data, sub_function).
data: NumPy array of observations.sub_function: Your sub-function (not called, passed as reference).
The result dictionary contains:
'name': PDF name.'θhat': Optimal parameters (NumPy array).'LL': Maximum log-likelihood.'AIC': Akaike Information Criterion.'SBC': Schwarz Bayesian Criterion (BIC).'ICOMP': Corrected AIC (AICc).'hessian': Hessian matrix at θhat (for standard errors: sqrt of diagonal inverse).
To compare multiple models, use compare_models(data, sub_functions_list) which returns a pandas DataFrame with columns: Model, M (number of parameters), θhat, LL, AIC, SBC, ICOMP.
See sub_functions.py for Weibull and Inverse Gaussian examples, and demo.py for usage.
-
Two-Parameter Distributions (e.g., Normal, Weibull):
- θ = [param1, param2]
- bounds = ((min1, max1), (min2, max2))
-
Scale-Only (e.g., Exponential):
- θ = [scale_or_rate]
- Often one bound.
-
Location-Scale (e.g., Normal):
- Location often unbounded, scale bounded below.
-
Shape-Scale (e.g., Weibull, Gamma):
- Both shape and scale > 0.
- Check Initial LL: Compute
_negloglik(θ0)manually. It should be finite and reasonable. - Plot LL Surface: For 1-2 parameters, visualize the log-likelihood surface.
- Compare with SciPy: Fit using
stats.distribution.fitand compare parameters. - Handle NaNs/Infs: Ensure data is clean; add checks in
_negloglik.
The repository includes a modular plotting pipeline for overlaying fitted probability density functions (PDFs) on data histograms. This allows visual comparison of multiple distributions against the same dataset.
The pipeline is designed for flexibility, enabling users to define custom plotting sub-functions for different PDFs, while the core plotting logic (histogram generation, overlay management, chunking) is handled automatically.
Plotting sub-functions follow a similar modular pattern to model sub-functions. Each encapsulates the parameter fitting and PDF evaluation logic for a specific distribution.
A plotting sub-function must:
- Take no arguments (or optional data-dependent parameters if needed, but typically none).
- Define a
fitterfunction that acceptsdata(NumPy array) and returns fitted parameters as a tuple. - Specify a
pdf_callablefunction that computes the PDF at given points. - Indicate whether it's a SciPy-based distribution (
is_scipy: bool). - Return a tuple:
(name, info_dict)
Where:
name: A descriptive string (e.g., "Weibull", "Custom Exponential").info_dict: A dictionary containing:'fitter': Callable function that takesdataand returns parameter tuple or None on failure.'pdf_callable': Callable that takes x-values and parameters, returns PDF values.'is_scipy': Boolean indicating if it's a SciPy distribution (enables KS test goodness-of-fit).
-
Choose Your PDF: Select the distribution. For SciPy distributions, leverage built-in
fitandpdfmethods. For custom, implement manually. -
Import Dependencies:
import numpy as np from scipy import stats # For built-in distributions
-
Define the Fitter Function:
- Create an inner function
fitter(data). - Use SciPy's
fitmethod for standard distributions. - Handle exceptions and return
Noneon failure. - Print fit details for user feedback.
- Example for Weibull:
def fitter(data): try: params = stats.weibull_min.fit(data, floc=0) # Fix location if needed print(f"Weibull Fit: Shape={params[0]:.3f}, Loc={params[1]:.3f}, Scale={params[2]:.3f}") return params except Exception as e: print(f"Error fitting Weibull: {e}")
- Create an inner function
The repository includes a dedicated module for Kernel Density Estimation (KDE) analysis, enabling comparison of KDE models against parametric distributions.
KDE provides non-parametric density estimation using bandwidth selection methods.
Implemented using Least Squares Cross-Validation (LSCV), which minimizes an estimate of the Mean Integrated Squared Error (MISE) by balancing roughness penalty and goodness-of-fit terms.
- Fit KDE using Gaussian kernel.
- Evaluate PDF at arbitrary points.
Calculates LL, AIC, SBC, ICOMP for KDE models, treating bandwidth as a single parameter.
from kde_analysis import run_kde_analysis
result = run_kde_analysis(data)
# Returns {'bandwidth': float, 'kde_model': KernelDensity, 'IC': {'logL': ..., 'AIC': ..., ...}}from plot_comparison import plot_data_kde_parametric
from sub_functions import weibull_sub
plot_data_kde_parametric(data, [weibull_sub]) # One plot per sub-functionThis generates plots overlaying data histogram, LSCV KDE, and fitted parametric PDFs. return None ```
-
Define the PDF Callable:
- For SciPy: Use
stats.distribution.pdf. - For custom: Implement the PDF formula using NumPy for vectorization.
- Ensure it handles parameter unpacking with
*params. - Example for custom exponential:
def pdf_callable(x, lambda_rate): return np.where(x >= 0, lambda_rate * np.exp(-lambda_rate * x), 0.0)
- For SciPy: Use
-
Set SciPy Flag:
Truefor SciPy distributions (enables automatic KS test).Falsefor custom implementations.
-
Return the Required Tuple:
return "Custom Exponential", { 'fitter': fitter, 'pdf_callable': pdf_callable, 'is_scipy': False }
def gamma_plot_sub():
"""
Sub-function for Gamma distribution.
Parameters: shape (a > 0), scale (or rate).
"""
def fitter(data):
try:
# Gamma fit returns (a, loc, scale)
params = stats.gamma.fit(data, floc=0) # Fix location at 0
print(f"Gamma Fit: Shape={params[0]:.3f}, Loc={params[1]:.3f}, Scale={params[2]:.3f}")
return params
except Exception as e:
print(f"Error fitting Gamma: {e}")
return None
return "Gamma", {
'fitter': fitter,
'pdf_callable': stats.gamma.pdf,
'is_scipy': True
}- SciPy Leverage: For built-in distributions, use
stats.distribution.fitwith appropriate constraints (e.g.,floc=0for positive-only). - Custom PDFs: Vectorize with NumPy. Use
np.wherefor conditional logic to avoid invalid domains. - Parameter Validation: Check fitted parameters (e.g., ensure positivity) before returning.
- Error Handling: Always wrap fitting in try-except; provide meaningful error messages.
- Performance: Fitting can be slow for large datasets; consider sampling if needed.
- Goodness-of-Fit: SciPy sub-functions automatically include KS test results in plot legends.
- Flexibility: Sub-functions can be data-dependent (e.g., pass additional args), but keep simple for modularity.
- Testing: Test fitter with sample data; verify PDF callable produces expected shapes.
-
SciPy Distributions (e.g., Weibull, Normal, Gamma):
- Use
stats.distribution.fitandstats.distribution.pdf. - Set
is_scipy = Truefor KS test. - Often fix
loc=0for positive domains.
- Use
-
Custom Distributions (e.g., Exponential, Custom Mixture):
- Implement fitter using method-of-moments or MLE.
- Define PDF manually with NumPy.
- Set
is_scipy = False.
-
Truncated Distributions: Modify PDF callable to account for truncation (normalize by CDF difference).
- Fit Validation: Check if fitter returns sensible parameters (e.g., positive scales).
- PDF Shape: Plot PDF manually with fitted params to ensure correct shape.
- Domain Issues: Ensure PDF is defined over the data range; handle with
np.where. - Chunking: Use
max_overlays_per_plotto debug overlays one at a time. - KS Test: For SciPy, verify p-values make sense (high p = good fit).
from plotting_pipeline import plot_overlay, weibull_plot_sub, gamma_plot_sub
# Single plot with all overlays
plot_overlay(data, [weibull_plot_sub, gamma_plot_sub])
# Multiple plots (2 overlays per plot)
plot_overlay(data, [weibull_plot_sub, gamma_plot_sub], max_overlays_per_plot=2, lower_bound=0, upper_bound=10)Parameters:
data: NumPy array of observations.sub_functions: List of sub-function callables.max_overlays_per_plot: Optional chunking for multiple plots.lower_bound/upper_bound: Optional x-axis limits (auto-inferred if None).
See sub_functions.py, plotting_pipeline.py, and demo.py for implementations and usage.
- Add Metrics: Modify
calculate_metricsinpipeline.pyfor custom criteria. - Alternative Optimizers: Edit
optimize_and_hessianinoptimizer.py. - Batch Processing: Loop over multiple sub-functions for model comparison.
- Assumes continuous data (modify for discrete).
- Numerical Hessian may be slow for many parameters.
- No built-in goodness-of-fit tests (add separately).
Add new sub-functions to sub_functions.py or create your own module. Ensure they follow the interface.
For issues or improvements, please provide details on the sub-function and error messages.
- Add Metrics: Modify
calculate_metricsinpipeline.pyfor custom criteria. - Alternative Optimizers: Edit
optimize_and_hessianinoptimizer.py. - Batch Processing: Loop over multiple sub-functions for model comparison.
- Assumes continuous data (modify for discrete).
- Numerical Hessian may be slow for many parameters.
- No built-in goodness-of-fit tests (add separately).
Add new sub-functions to sub_functions.py or create your own module. Ensure they follow the interface.
For issues or improvements, please provide details on the sub-function and error messages.