Skip to content

Commit 64add85

Browse files
authored
fix: eliminate 502 errors during rolling deployments (#560)
## Summary Clients occasionally see brief 502 errors when the API server is updated. This PR makes deployments seamless so clients never see errors during rollouts. **What changed:** - The gateway now automatically retries failed requests on a healthy pod instead of returning the error to the client - Health checks detect pods shutting down within seconds and stop sending them traffic - The API server signals it's going away before actually stopping, giving the gateway time to react - A disruption budget prevents too many pods from going down at once during cluster maintenance - The rollout strategy now waits for a new pod to be fully ready before terminating the old one Closes #559 ## Test plan - [ ] Deploy to staging with 2+ API server replicas - [ ] Trigger a rolling update (image tag change) while sending continuous API requests - [ ] Verify zero 502 errors are returned to clients during the rollout - [ ] Verify the `BackendTrafficPolicy` is accepted by Envoy Gateway (`kubectl get btp -n milo-system`) - [ ] Test a voluntary node drain to confirm the PDB keeps at least one pod available 🤖 Generated with [Claude Code](https://claude.com/claude-code)
2 parents eb29b07 + b4992cf commit 64add85

5 files changed

Lines changed: 56 additions & 5 deletions

File tree

config/apiserver/deployment.yaml

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ spec:
1111
strategy:
1212
rollingUpdate:
1313
maxSurge: 25%
14-
maxUnavailable: 25%
14+
maxUnavailable: 0
1515
type: RollingUpdate
1616
template:
1717
metadata:
@@ -83,6 +83,7 @@ spec:
8383
- --events-provider-timeout=$(EVENTS_PROVIDER_TIMEOUT)
8484
- --events-provider-retries=$(EVENTS_PROVIDER_RETRIES)
8585
- --events-forward-extras=$(EVENTS_FORWARD_EXTRAS)
86+
- --shutdown-delay-duration=$(SHUTDOWN_DELAY_DURATION)
8687
env:
8788
# Feature gates configuration
8889
# Sessions and UserIdentities are GA (enabled by default)
@@ -184,6 +185,8 @@ spec:
184185
value: "3"
185186
- name: EVENTS_FORWARD_EXTRAS
186187
value: "iam.miloapis.com/parent-api-group,iam.miloapis.com/parent-type,iam.miloapis.com/parent-name"
188+
- name: SHUTDOWN_DELAY_DURATION
189+
value: "10s"
187190
livenessProbe:
188191
failureThreshold: 3
189192
httpGet:
@@ -211,11 +214,11 @@ spec:
211214
timeoutSeconds: 15
212215
resources:
213216
requests:
214-
cpu: 100m
215-
memory: 128Mi
216-
limits:
217-
cpu: 500m
217+
cpu: 200m
218218
memory: 512Mi
219+
limits:
220+
cpu: "1"
221+
memory: 1Gi
219222
startupProbe:
220223
failureThreshold: 30
221224
httpGet:

config/apiserver/kustomization.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ kind: Kustomization
33
resources:
44
- deployment.yaml
55
- service.yaml
6+
- pdb.yaml

config/apiserver/pdb.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
apiVersion: policy/v1
2+
kind: PodDisruptionBudget
3+
metadata:
4+
name: milo-apiserver
5+
spec:
6+
maxUnavailable: 20%
7+
selector:
8+
matchLabels:
9+
app.kubernetes.io/name: milo-apiserver
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
apiVersion: gateway.envoyproxy.io/v1alpha1
2+
kind: BackendTrafficPolicy
3+
metadata:
4+
name: milo-apiserver
5+
namespace: milo-system
6+
spec:
7+
targetRefs:
8+
- group: gateway.networking.k8s.io
9+
kind: HTTPRoute
10+
name: milo-apiserver
11+
retry:
12+
numRetries: 3
13+
retryOn:
14+
triggers:
15+
- gateway-error
16+
- connect-failure
17+
- reset
18+
perRetry:
19+
backOff:
20+
baseInterval: 100ms
21+
maxInterval: 1s
22+
timeout: 2s
23+
healthCheck:
24+
active:
25+
type: HTTP
26+
http:
27+
path: /readyz
28+
interval: 5s
29+
timeout: 3s
30+
unhealthyThreshold: 2
31+
healthyThreshold: 1
32+
passive:
33+
consecutive5XxErrors: 2
34+
consecutiveGatewayErrors: 1
35+
interval: 3s
36+
baseEjectionTime: 15s
37+
maxEjectionPercent: 33

config/components/gateway-api/kustomization.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@ kind: Component
44
resources:
55
- httproute.yaml
66
- backend-tls-policy.yaml
7+
- backend-traffic-policy.yaml

0 commit comments

Comments
 (0)