|
| 1 | +# ProjectStuckCreatingSLOViolation |
| 2 | + |
| 3 | +## What This Alert Means |
| 4 | + |
| 5 | +A project has been in a "creating" state for more than 60 seconds without |
| 6 | +reaching a "Ready" status. This exceeds the service level objective (SLO) for |
| 7 | +project creation and indicates something is preventing the project from being |
| 8 | +fully provisioned. |
| 9 | + |
| 10 | +The alert fires per-project, so multiple alerts may fire simultaneously if |
| 11 | +several projects are affected. |
| 12 | + |
| 13 | +## Impact |
| 14 | + |
| 15 | +Users who created the affected project(s) are waiting longer than expected. |
| 16 | +The project may not be usable until it reaches a Ready state. |
| 17 | + |
| 18 | +## Investigation Steps |
| 19 | + |
| 20 | +### 1. Identify the affected project |
| 21 | + |
| 22 | +The alert labels include `resource_name`, which identifies the project that is |
| 23 | +stuck. Note this name for use in subsequent steps. |
| 24 | + |
| 25 | +### 2. Check the project status |
| 26 | + |
| 27 | +Use `kubectl` to inspect the project resource and its status conditions: |
| 28 | + |
| 29 | +```sh |
| 30 | +kubectl get project <resource_name> -o yaml |
| 31 | +``` |
| 32 | + |
| 33 | +Look at `.status.conditions` for any condition with `status: "False"` or a |
| 34 | +`reason` and `message` that explain what is failing. |
| 35 | + |
| 36 | +### 3. Check controller manager logs |
| 37 | + |
| 38 | +The `milo-controller-manager` is responsible for reconciling projects. Check its |
| 39 | +logs for errors related to the affected project: |
| 40 | + |
| 41 | +```sh |
| 42 | +kubectl logs -l app=milo-controller-manager --tail=200 | grep <resource_name> |
| 43 | +``` |
| 44 | + |
| 45 | +Look for: |
| 46 | +- **Permission errors** (e.g., RBAC forbidden): The controller may lack |
| 47 | + permissions to create dependent resources. |
| 48 | +- **Resource creation failures**: Errors when creating namespaces, |
| 49 | + ProjectControlPlane resources, or other dependent objects. |
| 50 | +- **OOMKilled or CrashLoopBackOff**: The controller pod itself may be |
| 51 | + unhealthy. |
| 52 | + |
| 53 | +### 4. Check controller pod health |
| 54 | + |
| 55 | +Verify the controller manager pod is running and not restarting: |
| 56 | + |
| 57 | +```sh |
| 58 | +kubectl get pods -l app=milo-controller-manager |
| 59 | +``` |
| 60 | + |
| 61 | +If the pod is restarting, check its resource limits and recent events: |
| 62 | + |
| 63 | +```sh |
| 64 | +kubectl describe pod -l app=milo-controller-manager |
| 65 | +``` |
| 66 | + |
| 67 | +### 5. Check for upstream dependencies |
| 68 | + |
| 69 | +Project creation depends on several subsystems. Verify these are healthy: |
| 70 | +- **ProjectControlPlane** resources are being created and reconciled. |
| 71 | +- **Authorization system** (e.g., OpenFGA) is reachable and responding. |
| 72 | +- **Infrastructure cluster** connectivity is functioning. |
| 73 | + |
| 74 | +### 6. Check for resource conflicts |
| 75 | + |
| 76 | +If multiple controllers or deployment systems manage overlapping resources |
| 77 | +(e.g., ClusterRoles, ConfigMaps), one may overwrite changes made by another. |
| 78 | +Check for recent changes to RBAC resources: |
| 79 | + |
| 80 | +```sh |
| 81 | +kubectl get clusterrole -l app=milo-controller-manager -o yaml |
| 82 | +``` |
| 83 | + |
| 84 | +Look for unexpected annotations or labels that indicate a different system is |
| 85 | +managing the same resource. |
| 86 | + |
| 87 | +## Common Causes |
| 88 | + |
| 89 | +| Cause | Indicators | |
| 90 | +|---|---| |
| 91 | +| RBAC permission errors | "forbidden" errors in controller logs | |
| 92 | +| Controller OOM crashes | Pod restarts, OOMKilled events | |
| 93 | +| Authorization service unavailable | Timeout or connection errors in logs | |
| 94 | +| Resource ownership conflicts | Oscillating resource annotations/labels | |
| 95 | +| High reconciliation backlog | Many projects stuck simultaneously, controller processing slowly | |
| 96 | + |
| 97 | +## Resolution |
| 98 | + |
| 99 | +Resolution depends on the root cause identified above: |
| 100 | + |
| 101 | +- **Permission errors**: Verify and restore the correct RBAC configuration for |
| 102 | + the controller. |
| 103 | +- **Controller crashes**: Increase memory limits or investigate the source of |
| 104 | + excessive memory consumption. |
| 105 | +- **Service unavailability**: Restore connectivity to dependent services. |
| 106 | +- **Resource conflicts**: Ensure each deployment system manages uniquely named |
| 107 | + resources to avoid collisions. |
| 108 | + |
| 109 | +After resolving the underlying issue, affected projects should automatically |
| 110 | +reconcile and reach a Ready state. Monitor the alert to confirm it resolves. |
| 111 | + |
| 112 | +## Escalation |
| 113 | + |
| 114 | +If the alert persists after investigation and you cannot identify the root cause, |
| 115 | +escalate to the platform engineering team with the following information: |
| 116 | + |
| 117 | +- The affected project name(s) |
| 118 | +- Controller manager logs from the time of the alert |
| 119 | +- Status of the controller manager pod(s) |
| 120 | +- Any error messages found during investigation |
0 commit comments