Skip to content

Commit 76036b3

Browse files
scotwellsclaude
andcommitted
docs: add runbook for ProjectStuckCreatingSLOViolation alert
Add a runbook with investigation steps for when the project creation SLO alert fires, and link it from the alert rule via the runbook_url annotation. Resolves datum-cloud/engineering#231 (runbook action item) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 33ceab3 commit 76036b3

4 files changed

Lines changed: 125 additions & 0 deletions

File tree

config/telemetry/alerts/resources-manager/projects.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,5 +15,6 @@ spec:
1515
severity: critical
1616
slo_violation: "true"
1717
annotations:
18+
runbook_url: "https://github.com/datum-cloud/milo/blob/main/docs/runbooks/project-stuck-creating-slo-violation.md"
1819
summary: "Project {{ $labels.resource_name }} is stuck creating for over 60 seconds"
1920
description: "Project {{ $labels.resource_name }} has been in creation state for {{ $value }} seconds without reaching Ready status, which exceeds the 60-second SLO threshold."
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# ProjectStuckCreatingSLOViolation
2+
3+
## What This Alert Means
4+
5+
A project has been in a "creating" state for more than 60 seconds without
6+
reaching a "Ready" status. This exceeds the service level objective (SLO) for
7+
project creation and indicates something is preventing the project from being
8+
fully provisioned.
9+
10+
The alert fires per-project, so multiple alerts may fire simultaneously if
11+
several projects are affected.
12+
13+
## Impact
14+
15+
Users who created the affected project(s) are waiting longer than expected.
16+
The project may not be usable until it reaches a Ready state.
17+
18+
## Investigation Steps
19+
20+
### 1. Identify the affected project
21+
22+
The alert labels include `resource_name`, which identifies the project that is
23+
stuck. Note this name for use in subsequent steps.
24+
25+
### 2. Check the project status
26+
27+
Use `kubectl` to inspect the project resource and its status conditions:
28+
29+
```sh
30+
kubectl get project <resource_name> -o yaml
31+
```
32+
33+
Look at `.status.conditions` for any condition with `status: "False"` or a
34+
`reason` and `message` that explain what is failing.
35+
36+
### 3. Check controller manager logs
37+
38+
The `milo-controller-manager` is responsible for reconciling projects. Check its
39+
logs for errors related to the affected project:
40+
41+
```sh
42+
kubectl logs -l app=milo-controller-manager --tail=200 | grep <resource_name>
43+
```
44+
45+
Look for:
46+
- **Permission errors** (e.g., RBAC forbidden): The controller may lack
47+
permissions to create dependent resources.
48+
- **Resource creation failures**: Errors when creating namespaces,
49+
ProjectControlPlane resources, or other dependent objects.
50+
- **OOMKilled or CrashLoopBackOff**: The controller pod itself may be
51+
unhealthy.
52+
53+
### 4. Check controller pod health
54+
55+
Verify the controller manager pod is running and not restarting:
56+
57+
```sh
58+
kubectl get pods -l app=milo-controller-manager
59+
```
60+
61+
If the pod is restarting, check its resource limits and recent events:
62+
63+
```sh
64+
kubectl describe pod -l app=milo-controller-manager
65+
```
66+
67+
### 5. Check for upstream dependencies
68+
69+
Project creation depends on several subsystems. Verify these are healthy:
70+
- **ProjectControlPlane** resources are being created and reconciled.
71+
- **Authorization system** (e.g., OpenFGA) is reachable and responding.
72+
- **Infrastructure cluster** connectivity is functioning.
73+
74+
### 6. Check for resource conflicts
75+
76+
If multiple controllers or deployment systems manage overlapping resources
77+
(e.g., ClusterRoles, ConfigMaps), one may overwrite changes made by another.
78+
Check for recent changes to RBAC resources:
79+
80+
```sh
81+
kubectl get clusterrole -l app=milo-controller-manager -o yaml
82+
```
83+
84+
Look for unexpected annotations or labels that indicate a different system is
85+
managing the same resource.
86+
87+
## Common Causes
88+
89+
| Cause | Indicators |
90+
|---|---|
91+
| RBAC permission errors | "forbidden" errors in controller logs |
92+
| Controller OOM crashes | Pod restarts, OOMKilled events |
93+
| Authorization service unavailable | Timeout or connection errors in logs |
94+
| Resource ownership conflicts | Oscillating resource annotations/labels |
95+
| High reconciliation backlog | Many projects stuck simultaneously, controller processing slowly |
96+
97+
## Resolution
98+
99+
Resolution depends on the root cause identified above:
100+
101+
- **Permission errors**: Verify and restore the correct RBAC configuration for
102+
the controller.
103+
- **Controller crashes**: Increase memory limits or investigate the source of
104+
excessive memory consumption.
105+
- **Service unavailability**: Restore connectivity to dependent services.
106+
- **Resource conflicts**: Ensure each deployment system manages uniquely named
107+
resources to avoid collisions.
108+
109+
After resolving the underlying issue, affected projects should automatically
110+
reconcile and reach a Ready state. Monitor the alert to confirm it resolves.
111+
112+
## Escalation
113+
114+
If the alert persists after investigation and you cannot identify the root cause,
115+
escalate to the platform engineering team with the following information:
116+
117+
- The affected project name(s)
118+
- Controller manager logs from the time of the alert
119+
- Status of the controller manager pod(s)
120+
- Any error messages found during investigation

test/prometheus-rules/resources-manager/projects/projects-slo-rules.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,6 @@ groups:
1111
severity: critical
1212
slo_violation: "true"
1313
annotations:
14+
runbook_url: "https://github.com/datum-cloud/milo/blob/main/docs/runbooks/project-stuck-creating-slo-violation.md"
1415
summary: "Project {{ $labels.resource_name }} is stuck creating for over 60 seconds"
1516
description: "Project {{ $labels.resource_name }} has been in creation state for {{ $value }} seconds without reaching Ready status, which exceeds the 60-second SLO threshold."

test/prometheus-rules/resources-manager/projects/projects-slo-tests.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ tests:
2222
slo_violation: "true"
2323
resource_name: test-project
2424
exp_annotations:
25+
runbook_url: "https://github.com/datum-cloud/milo/blob/main/docs/runbooks/project-stuck-creating-slo-violation.md"
2526
summary: "Project test-project is stuck creating for over 60 seconds"
2627
description: "Project test-project has been in creation state for 120 seconds without reaching Ready status, which exceeds the 60-second SLO threshold."
2728

@@ -72,6 +73,7 @@ tests:
7273
slo_violation: "true"
7374
resource_name: stuck-project
7475
exp_annotations:
76+
runbook_url: "https://github.com/datum-cloud/milo/blob/main/docs/runbooks/project-stuck-creating-slo-violation.md"
7577
summary: "Project stuck-project is stuck creating for over 60 seconds"
7678
description: "Project stuck-project has been in creation state for 90 seconds without reaching Ready status, which exceeds the 60-second SLO threshold."
7779

@@ -97,5 +99,6 @@ tests:
9799
slo_violation: "true"
98100
resource_name: multi-stuck-project
99101
exp_annotations:
102+
runbook_url: "https://github.com/datum-cloud/milo/blob/main/docs/runbooks/project-stuck-creating-slo-violation.md"
100103
summary: "Project multi-stuck-project is stuck creating for over 60 seconds"
101104
description: "Project multi-stuck-project has been in creation state for 150 seconds without reaching Ready status, which exceeds the 60-second SLO threshold."

0 commit comments

Comments
 (0)