Skip to content

Commit 8c760be

Browse files
bdchathamclaude
andauthored
design: unified OAuth2 auth for internal platform tools (Prometheus, AlertManager) (#75)
* design: unified OAuth2 authentication for internal platform tools Adds design doc for protecting Prometheus and AlertManager behind Google OAuth using Istio ext_authz + OAuth2 Proxy. Fixes broken PagerDuty generatorURL links by enabling external Prometheus access with auth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * design: reuse existing Google OAuth client for OAuth2 Proxy Share the Grafana Google OAuth client rather than creating a separate one. Just add the OAuth2 Proxy callback URI to the existing client. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f803a19 commit 8c760be

1 file changed

Lines changed: 237 additions & 0 deletions

File tree

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
# Unified OAuth2 Authentication for Internal Platform Tools
2+
3+
**Status:** Accepted
4+
**Date:** 2026-04-10
5+
**Scope:** Platform — Istio ext_authz, OAuth2 Proxy, Gateway API, Monitoring
6+
7+
---
8+
9+
## Problem
10+
11+
Internal platform tools (Prometheus, AlertManager) are not externally accessible. PagerDuty alert links contain `generatorURL` pointing to the cluster-internal Prometheus address (`http://sei-prod-prometheus.monitoring:9090/graph?g0.expr=...`), which is unreachable from a browser. On-call engineers cannot click through to see the metric that triggered an alert.
12+
13+
Exposing these tools requires authentication. The old per-chain monitoring stack (sei-infra) had Prometheus behind ALBs without auth — we should not replicate that pattern.
14+
15+
## Solution
16+
17+
Deploy a shared **OAuth2 Proxy** as an Istio **ext_authz** extension provider. Any internal tool can opt in to Google OAuth by adding an `AuthorizationPolicy` with `action: CUSTOM`. The first consumers are Prometheus and AlertManager.
18+
19+
### Architecture
20+
21+
```
22+
Browser → NLB → sei-gateway (TLS termination)
23+
24+
AuthorizationPolicy (CUSTOM action)
25+
26+
ext_authz check → OAuth2 Proxy (auth namespace)
27+
28+
┌── 202 (authenticated) → HTTPRoute → backend
29+
└── 302 (unauthenticated) → Google OAuth → callback → retry
30+
```
31+
32+
### Auth Flow
33+
34+
1. User clicks PagerDuty link to `https://prometheus.prod.platform.sei.io/graph?g0.expr=...`
35+
2. Istio Gateway matches the CUSTOM AuthorizationPolicy, sends ext_authz check to OAuth2 Proxy
36+
3. OAuth2 Proxy checks for `_sei_platform_auth` cookie — not found
37+
4. Returns 302 to Google OAuth with `redirect_uri=https://oauth2-proxy.prod.platform.sei.io/oauth2/callback` and `state=<original-url>`
38+
5. User authenticates with Google (same SSO as Grafana — `@seinetwork.io`, `@sei.io`, `@seifdn.org`)
39+
6. Google redirects to callback; OAuth2 Proxy sets cookie on `.prod.platform.sei.io`, redirects back to original URL
40+
7. Browser retries with cookie → ext_authz passes → Prometheus graph loads
41+
42+
### SSO
43+
44+
Cookie domain `.prod.platform.sei.io` means one login covers all protected tools. Authenticating on Prometheus automatically works on AlertManager, and any future tools added under the same domain.
45+
46+
---
47+
48+
## Design Decisions
49+
50+
### Shared Google OAuth Client
51+
52+
OAuth2 Proxy reuses the existing Google OAuth client that Grafana already uses. Add the OAuth2 Proxy callback URI to the existing client's authorized redirect URIs in Google Cloud Console:
53+
54+
```
55+
https://oauth2-proxy.prod.platform.sei.io/oauth2/callback
56+
```
57+
58+
This avoids managing a second set of credentials. The OAuth2 Proxy deployment references the same `google-oauth` secret (client ID and client secret) plus its own `cookie-secret`.
59+
60+
### Dedicated `auth` Namespace
61+
62+
OAuth2 Proxy deploys in its own `auth` namespace rather than `monitoring`. This isolates the authentication infrastructure from the monitoring workloads it protects, limiting blast radius if either is compromised.
63+
64+
### Fail-Closed
65+
66+
`failOpen: false` in the ext_authz config. When OAuth2 Proxy is down, protected tools return 503 rather than being accessible without auth. A PodDisruptionBudget (`minAvailable: 1`) and 2 replicas mitigate downtime.
67+
68+
### Cookie-Based Sessions (No Redis)
69+
70+
Session state lives in encrypted cookies. No shared state, no Redis, perfect HA. Appropriate for a team of <100 engineers. Revisit only if cookie size exceeds browser limits (4KB), which would happen with large group claims — not applicable here.
71+
72+
### Grafana Excluded
73+
74+
Grafana keeps its own Google OAuth integration. It needs user identity for role mapping (`@seinetwork.io` → Editor, others → Viewer). The AuthorizationPolicy only targets Prometheus and AlertManager hostnames, so Grafana is never intercepted.
75+
76+
---
77+
78+
## Resources
79+
80+
### Istio Mesh Config — Extension Provider
81+
82+
Register OAuth2 Proxy as an ext_authz provider in the istiod mesh config:
83+
84+
```yaml
85+
meshConfig:
86+
extensionProviders:
87+
- name: oauth2-proxy
88+
envoyExtAuthz:
89+
service: oauth2-proxy.auth.svc.cluster.local
90+
port: 4180
91+
includeRequestHeadersInCheck:
92+
- authorization
93+
- cookie
94+
headersToUpstreamOnAllow:
95+
- authorization
96+
- x-auth-request-user
97+
- x-auth-request-email
98+
- cookie
99+
headersToDownstreamOnDeny:
100+
- set-cookie
101+
- content-type
102+
- location
103+
headersToDownstreamOnAllow:
104+
- set-cookie
105+
failOpen: false
106+
statusOnError: "503"
107+
```
108+
109+
### OAuth2 Proxy (auth namespace)
110+
111+
Deployed via Helm chart `oauth2-proxy/oauth2-proxy` from `https://oauth2-proxy.github.io/manifests`.
112+
113+
Key configuration:
114+
- `provider = "google"`
115+
- `upstreams = ["static://202"]` — ext_authz mode, not reverse-proxy mode
116+
- `email_domains = ["seinetwork.io", "sei.io", "seifdn.org"]`
117+
- `cookie_name = "_sei_platform_auth"`
118+
- `cookie_domains = [".prod.platform.sei.io"]`
119+
- `cookie_expire = "12h"`, `cookie_refresh = "1h"`
120+
- `session_store_type = "cookie"`
121+
- `reverse_proxy = true` — trusts X-Forwarded-* from Gateway
122+
- `set_xauthrequest = true` — forwards user identity headers to backends
123+
124+
Secrets (SOPS-encrypted): Google client ID, client secret, cookie secret (32-byte random).
125+
126+
### HTTPRoutes
127+
128+
| Hostname | Backend | Namespace |
129+
|---|---|---|
130+
| `prometheus.prod.platform.sei.io` | `sei-prod-prometheus:9090` | monitoring |
131+
| `alertmanager.prod.platform.sei.io` | `sei-prod-alertmanager:9093` | monitoring |
132+
| `oauth2-proxy.prod.platform.sei.io` | `oauth2-proxy:4180` | auth |
133+
134+
All covered by the existing wildcard cert `*.prod.platform.sei.io`. External-DNS auto-creates DNS records from HTTPRoute hostnames.
135+
136+
### AuthorizationPolicy
137+
138+
```yaml
139+
apiVersion: security.istio.io/v1
140+
kind: AuthorizationPolicy
141+
metadata:
142+
name: platform-tools-ext-authz
143+
namespace: gateway
144+
spec:
145+
targetRefs:
146+
- group: gateway.networking.k8s.io
147+
kind: Gateway
148+
name: sei-gateway
149+
action: CUSTOM
150+
provider:
151+
name: oauth2-proxy
152+
rules:
153+
- to:
154+
- operation:
155+
hosts:
156+
- prometheus.prod.platform.sei.io
157+
- alertmanager.prod.platform.sei.io
158+
```
159+
160+
Scoped to specific hostnames. All other gateway traffic (RPC endpoints, Grafana, etc.) is unaffected.
161+
162+
### Prometheus Helm Changes
163+
164+
```yaml
165+
prometheus:
166+
prometheusSpec:
167+
externalUrl: https://prometheus.prod.platform.sei.io
168+
169+
alertmanager:
170+
alertmanagerSpec:
171+
externalUrl: https://alertmanager.prod.platform.sei.io
172+
```
173+
174+
This fixes the broken PagerDuty `generatorURL` links.
175+
176+
---
177+
178+
## Interaction with Existing AuthorizationPolicies
179+
180+
Istio evaluates authorization in order: **CUSTOM → DENY → ALLOW**.
181+
182+
- SeiNode AuthorizationPolicies use `action: ALLOW` with pod selectors — they operate at the sidecar level on SeiNode pods, not at the gateway.
183+
- The new CUSTOM policy targets the Gateway workload and only fires for matching hostnames.
184+
- No conflict, no interference.
185+
186+
---
187+
188+
## Adding Future Services
189+
190+
To protect a new tool (e.g., Jaeger):
191+
192+
1. Add an HTTPRoute for `jaeger.prod.platform.sei.io`
193+
2. Add the hostname to the AuthorizationPolicy `hosts` list
194+
195+
No OAuth2 Proxy changes needed — the cookie domain covers all `*.prod.platform.sei.io` subdomains.
196+
197+
---
198+
199+
## Rollout Order
200+
201+
1. Register ext_authz provider in Istio mesh config (safe — nothing references it yet)
202+
2. Deploy OAuth2 Proxy (Helm, Secret, Service, PDB) in `auth` namespace
203+
3. Create OAuth2 callback HTTPRoute, verify callback endpoint responds
204+
4. Create Prometheus + AlertManager HTTPRoutes
205+
5. Apply AuthorizationPolicy (auth now enforced)
206+
6. Update Prometheus/AlertManager `externalUrl` in Helm values
207+
7. Verify: click PagerDuty alert link → Google OAuth → Prometheus graph
208+
209+
---
210+
211+
## File Layout
212+
213+
```
214+
clusters/prod/auth/
215+
kustomization.yaml
216+
namespace.yaml
217+
oauth2-proxy.yaml # HelmRepository + HelmRelease
218+
secret.enc.yaml # SOPS: client-id, client-secret, cookie-secret
219+
httproute.yaml # oauth2-proxy.prod.platform.sei.io
220+
authz-policy.yaml # CUSTOM AuthorizationPolicy (in gateway namespace)
221+
222+
clusters/prod/monitoring/
223+
httproute-prometheus.yaml # prometheus.prod.platform.sei.io
224+
httproute-alertmanager.yaml # alertmanager.prod.platform.sei.io
225+
prometheus-operator.yaml # externalUrl changes
226+
227+
clusters/prod/istio-system/
228+
mesh-config patch # ext_authz extension provider
229+
```
230+
231+
---
232+
233+
## Observability
234+
235+
OAuth2 Proxy exposes Prometheus metrics on port 44180. A ServiceMonitor scrapes these, and a PrometheusRule alerts on:
236+
- `OAuth2ProxyHighErrorRate` (>5% 5xx for 5m) — warning
237+
- `OAuth2ProxyDown` (metrics unreachable for 3m) — critical

0 commit comments

Comments
 (0)