Skip to content

Commit 0dc9adc

Browse files
bdchathamclaude
andcommitted
design: unified OAuth2 authentication for internal platform tools
Adds design doc for protecting Prometheus and AlertManager behind Google OAuth using Istio ext_authz + OAuth2 Proxy. Fixes broken PagerDuty generatorURL links by enabling external Prometheus access with auth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 2599c44 commit 0dc9adc

1 file changed

Lines changed: 236 additions & 0 deletions

File tree

Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
# Unified OAuth2 Authentication for Internal Platform Tools
2+
3+
**Status:** Accepted
4+
**Date:** 2026-04-10
5+
**Scope:** Platform — Istio ext_authz, OAuth2 Proxy, Gateway API, Monitoring
6+
7+
---
8+
9+
## Problem
10+
11+
Internal platform tools (Prometheus, AlertManager) are not externally accessible. PagerDuty alert links contain `generatorURL` pointing to the cluster-internal Prometheus address (`http://sei-prod-prometheus.monitoring:9090/graph?g0.expr=...`), which is unreachable from a browser. On-call engineers cannot click through to see the metric that triggered an alert.
12+
13+
Exposing these tools requires authentication. The old per-chain monitoring stack (sei-infra) had Prometheus behind ALBs without auth — we should not replicate that pattern.
14+
15+
## Solution
16+
17+
Deploy a shared **OAuth2 Proxy** as an Istio **ext_authz** extension provider. Any internal tool can opt in to Google OAuth by adding an `AuthorizationPolicy` with `action: CUSTOM`. The first consumers are Prometheus and AlertManager.
18+
19+
### Architecture
20+
21+
```
22+
Browser → NLB → sei-gateway (TLS termination)
23+
24+
AuthorizationPolicy (CUSTOM action)
25+
26+
ext_authz check → OAuth2 Proxy (auth namespace)
27+
28+
┌── 202 (authenticated) → HTTPRoute → backend
29+
└── 302 (unauthenticated) → Google OAuth → callback → retry
30+
```
31+
32+
### Auth Flow
33+
34+
1. User clicks PagerDuty link to `https://prometheus.prod.platform.sei.io/graph?g0.expr=...`
35+
2. Istio Gateway matches the CUSTOM AuthorizationPolicy, sends ext_authz check to OAuth2 Proxy
36+
3. OAuth2 Proxy checks for `_sei_platform_auth` cookie — not found
37+
4. Returns 302 to Google OAuth with `redirect_uri=https://oauth2-proxy.prod.platform.sei.io/oauth2/callback` and `state=<original-url>`
38+
5. User authenticates with Google (same SSO as Grafana — `@seinetwork.io`, `@sei.io`, `@seifdn.org`)
39+
6. Google redirects to callback; OAuth2 Proxy sets cookie on `.prod.platform.sei.io`, redirects back to original URL
40+
7. Browser retries with cookie → ext_authz passes → Prometheus graph loads
41+
42+
### SSO
43+
44+
Cookie domain `.prod.platform.sei.io` means one login covers all protected tools. Authenticating on Prometheus automatically works on AlertManager, and any future tools added under the same domain.
45+
46+
---
47+
48+
## Design Decisions
49+
50+
### Separate Google OAuth Client
51+
52+
A new Google OAuth client is required (not reusing Grafana's). Google Cloud Console ties redirect URIs to specific clients. Mixing OAuth2 Proxy callbacks with Grafana's `/login/generic_oauth` in one client is fragile — URI rotation on one side risks breaking the other.
53+
54+
**Google Cloud Console config:**
55+
- Application type: Web application
56+
- Name: `sei-platform-oauth2-proxy-prod`
57+
- Authorized redirect URI: `https://oauth2-proxy.prod.platform.sei.io/oauth2/callback`
58+
59+
### Dedicated `auth` Namespace
60+
61+
OAuth2 Proxy deploys in its own `auth` namespace rather than `monitoring`. This isolates the authentication infrastructure from the monitoring workloads it protects, limiting blast radius if either is compromised.
62+
63+
### Fail-Closed
64+
65+
`failOpen: false` in the ext_authz config. When OAuth2 Proxy is down, protected tools return 503 rather than being accessible without auth. A PodDisruptionBudget (`minAvailable: 1`) and 2 replicas mitigate downtime.
66+
67+
### Cookie-Based Sessions (No Redis)
68+
69+
Session state lives in encrypted cookies. No shared state, no Redis, perfect HA. Appropriate for a team of <100 engineers. Revisit only if cookie size exceeds browser limits (4KB), which would happen with large group claims — not applicable here.
70+
71+
### Grafana Excluded
72+
73+
Grafana keeps its own Google OAuth integration. It needs user identity for role mapping (`@seinetwork.io` → Editor, others → Viewer). The AuthorizationPolicy only targets Prometheus and AlertManager hostnames, so Grafana is never intercepted.
74+
75+
---
76+
77+
## Resources
78+
79+
### Istio Mesh Config — Extension Provider
80+
81+
Register OAuth2 Proxy as an ext_authz provider in the istiod mesh config:
82+
83+
```yaml
84+
meshConfig:
85+
extensionProviders:
86+
- name: oauth2-proxy
87+
envoyExtAuthz:
88+
service: oauth2-proxy.auth.svc.cluster.local
89+
port: 4180
90+
includeRequestHeadersInCheck:
91+
- authorization
92+
- cookie
93+
headersToUpstreamOnAllow:
94+
- authorization
95+
- x-auth-request-user
96+
- x-auth-request-email
97+
- cookie
98+
headersToDownstreamOnDeny:
99+
- set-cookie
100+
- content-type
101+
- location
102+
headersToDownstreamOnAllow:
103+
- set-cookie
104+
failOpen: false
105+
statusOnError: "503"
106+
```
107+
108+
### OAuth2 Proxy (auth namespace)
109+
110+
Deployed via Helm chart `oauth2-proxy/oauth2-proxy` from `https://oauth2-proxy.github.io/manifests`.
111+
112+
Key configuration:
113+
- `provider = "google"`
114+
- `upstreams = ["static://202"]` — ext_authz mode, not reverse-proxy mode
115+
- `email_domains = ["seinetwork.io", "sei.io", "seifdn.org"]`
116+
- `cookie_name = "_sei_platform_auth"`
117+
- `cookie_domains = [".prod.platform.sei.io"]`
118+
- `cookie_expire = "12h"`, `cookie_refresh = "1h"`
119+
- `session_store_type = "cookie"`
120+
- `reverse_proxy = true` — trusts X-Forwarded-* from Gateway
121+
- `set_xauthrequest = true` — forwards user identity headers to backends
122+
123+
Secrets (SOPS-encrypted): Google client ID, client secret, cookie secret (32-byte random).
124+
125+
### HTTPRoutes
126+
127+
| Hostname | Backend | Namespace |
128+
|---|---|---|
129+
| `prometheus.prod.platform.sei.io` | `sei-prod-prometheus:9090` | monitoring |
130+
| `alertmanager.prod.platform.sei.io` | `sei-prod-alertmanager:9093` | monitoring |
131+
| `oauth2-proxy.prod.platform.sei.io` | `oauth2-proxy:4180` | auth |
132+
133+
All covered by the existing wildcard cert `*.prod.platform.sei.io`. External-DNS auto-creates DNS records from HTTPRoute hostnames.
134+
135+
### AuthorizationPolicy
136+
137+
```yaml
138+
apiVersion: security.istio.io/v1
139+
kind: AuthorizationPolicy
140+
metadata:
141+
name: platform-tools-ext-authz
142+
namespace: gateway
143+
spec:
144+
targetRefs:
145+
- group: gateway.networking.k8s.io
146+
kind: Gateway
147+
name: sei-gateway
148+
action: CUSTOM
149+
provider:
150+
name: oauth2-proxy
151+
rules:
152+
- to:
153+
- operation:
154+
hosts:
155+
- prometheus.prod.platform.sei.io
156+
- alertmanager.prod.platform.sei.io
157+
```
158+
159+
Scoped to specific hostnames. All other gateway traffic (RPC endpoints, Grafana, etc.) is unaffected.
160+
161+
### Prometheus Helm Changes
162+
163+
```yaml
164+
prometheus:
165+
prometheusSpec:
166+
externalUrl: https://prometheus.prod.platform.sei.io
167+
168+
alertmanager:
169+
alertmanagerSpec:
170+
externalUrl: https://alertmanager.prod.platform.sei.io
171+
```
172+
173+
This fixes the broken PagerDuty `generatorURL` links.
174+
175+
---
176+
177+
## Interaction with Existing AuthorizationPolicies
178+
179+
Istio evaluates authorization in order: **CUSTOM → DENY → ALLOW**.
180+
181+
- SeiNode AuthorizationPolicies use `action: ALLOW` with pod selectors — they operate at the sidecar level on SeiNode pods, not at the gateway.
182+
- The new CUSTOM policy targets the Gateway workload and only fires for matching hostnames.
183+
- No conflict, no interference.
184+
185+
---
186+
187+
## Adding Future Services
188+
189+
To protect a new tool (e.g., Jaeger):
190+
191+
1. Add an HTTPRoute for `jaeger.prod.platform.sei.io`
192+
2. Add the hostname to the AuthorizationPolicy `hosts` list
193+
194+
No OAuth2 Proxy changes needed — the cookie domain covers all `*.prod.platform.sei.io` subdomains.
195+
196+
---
197+
198+
## Rollout Order
199+
200+
1. Register ext_authz provider in Istio mesh config (safe — nothing references it yet)
201+
2. Deploy OAuth2 Proxy (Helm, Secret, Service, PDB) in `auth` namespace
202+
3. Create OAuth2 callback HTTPRoute, verify callback endpoint responds
203+
4. Create Prometheus + AlertManager HTTPRoutes
204+
5. Apply AuthorizationPolicy (auth now enforced)
205+
6. Update Prometheus/AlertManager `externalUrl` in Helm values
206+
7. Verify: click PagerDuty alert link → Google OAuth → Prometheus graph
207+
208+
---
209+
210+
## File Layout
211+
212+
```
213+
clusters/prod/auth/
214+
kustomization.yaml
215+
namespace.yaml
216+
oauth2-proxy.yaml # HelmRepository + HelmRelease
217+
secret.enc.yaml # SOPS: client-id, client-secret, cookie-secret
218+
httproute.yaml # oauth2-proxy.prod.platform.sei.io
219+
authz-policy.yaml # CUSTOM AuthorizationPolicy (in gateway namespace)
220+
221+
clusters/prod/monitoring/
222+
httproute-prometheus.yaml # prometheus.prod.platform.sei.io
223+
httproute-alertmanager.yaml # alertmanager.prod.platform.sei.io
224+
prometheus-operator.yaml # externalUrl changes
225+
226+
clusters/prod/istio-system/
227+
mesh-config patch # ext_authz extension provider
228+
```
229+
230+
---
231+
232+
## Observability
233+
234+
OAuth2 Proxy exposes Prometheus metrics on port 44180. A ServiceMonitor scrapes these, and a PrometheusRule alerts on:
235+
- `OAuth2ProxyHighErrorRate` (>5% 5xx for 5m) — warning
236+
- `OAuth2ProxyDown` (metrics unreachable for 3m) — critical

0 commit comments

Comments
 (0)