Solution Overview: Predictive Alerting for Cloud Metrics

This repository contains the solution for predicting whether an incident will occur within the next H time steps based on the previous W steps of time-series metrics.

1. Problem Formulation & Data Preprocessing

The problem is approached with sliding-window formulation. For a given W various different statistical features get calculated(mean, std, lags, differences). The target is constructed by looking H steps ahead to predict whether an incident will occur or not. To ensure model viability data is split chronologically in train and test sets.

2. Modeling Choices & Training Setup

I chose XGBoost for this task, because of its efficieny and ability to handle non-linear relationships. To properly evaluate the model and ensure that there is no data leakage i used Time Series Split Cross Validation. This approach ensures that the model is trained on past chronological data and evaluated on future unseen data.

3. Evaluation Setup

In cloud metrics incidents classes are often unbalanced. To counter this i used AUCPR as my evaluation method as it's more informative about the model predictions on the rarer class than the other methods avaiable.

Additionaly I tested 2 different prediction thresholds for the evaluation [0.5,0.7]. A lower threshold increases Recall (less missed incidents, but more False Positives), while a higher threshold increases Precision(more missed incidents, but less False Positives)

4. Results Analysis

The model tested on the custom dataset with the Threshold of 0.7 gave the least amount of false positives, but skipped some real incidents. A Threshold of 0.5 caught about 50% more incidents, but also had about 10x more False Positives that 0.7 threshold. It is important to decide what is more important during Cloud Predictive Alerting. With more False Positives Engineers can often get used to them and skip important alerts. With a threshold of 0.7 about 80% of alerts will be about a real incident. With a threshold of 0.5 about 25% of alerts will be about a real incident.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
data		data
src		src
.env		.env
EDA.ipynb		EDA.ipynb
main.ipynb		main.ipynb
readme.md		readme.md
requirements.txt		requirements.txt
shell.nix		shell.nix
trained_model.joblib		trained_model.joblib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Solution Overview: Predictive Alerting for Cloud Metrics

1. Problem Formulation & Data Preprocessing

2. Modeling Choices & Training Setup

3. Evaluation Setup

4. Results Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Solution Overview: Predictive Alerting for Cloud Metrics

1. Problem Formulation & Data Preprocessing

2. Modeling Choices & Training Setup

3. Evaluation Setup

4. Results Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages