|
| 1 | +# Unified OAuth2 Authentication for Internal Platform Tools |
| 2 | + |
| 3 | +**Status:** Accepted |
| 4 | +**Date:** 2026-04-10 |
| 5 | +**Scope:** Platform — Istio ext_authz, OAuth2 Proxy, Gateway API, Monitoring |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Problem |
| 10 | + |
| 11 | +Internal platform tools (Prometheus, AlertManager) are not externally accessible. PagerDuty alert links contain `generatorURL` pointing to the cluster-internal Prometheus address (`http://sei-prod-prometheus.monitoring:9090/graph?g0.expr=...`), which is unreachable from a browser. On-call engineers cannot click through to see the metric that triggered an alert. |
| 12 | + |
| 13 | +Exposing these tools requires authentication. The old per-chain monitoring stack (sei-infra) had Prometheus behind ALBs without auth — we should not replicate that pattern. |
| 14 | + |
| 15 | +## Solution |
| 16 | + |
| 17 | +Deploy a shared **OAuth2 Proxy** as an Istio **ext_authz** extension provider. Any internal tool can opt in to Google OAuth by adding an `AuthorizationPolicy` with `action: CUSTOM`. The first consumers are Prometheus and AlertManager. |
| 18 | + |
| 19 | +### Architecture |
| 20 | + |
| 21 | +``` |
| 22 | +Browser → NLB → sei-gateway (TLS termination) |
| 23 | + │ |
| 24 | + AuthorizationPolicy (CUSTOM action) |
| 25 | + │ |
| 26 | + ext_authz check → OAuth2 Proxy (auth namespace) |
| 27 | + │ |
| 28 | + ┌── 202 (authenticated) → HTTPRoute → backend |
| 29 | + └── 302 (unauthenticated) → Google OAuth → callback → retry |
| 30 | +``` |
| 31 | + |
| 32 | +### Auth Flow |
| 33 | + |
| 34 | +1. User clicks PagerDuty link to `https://prometheus.prod.platform.sei.io/graph?g0.expr=...` |
| 35 | +2. Istio Gateway matches the CUSTOM AuthorizationPolicy, sends ext_authz check to OAuth2 Proxy |
| 36 | +3. OAuth2 Proxy checks for `_sei_platform_auth` cookie — not found |
| 37 | +4. Returns 302 to Google OAuth with `redirect_uri=https://oauth2-proxy.prod.platform.sei.io/oauth2/callback` and `state=<original-url>` |
| 38 | +5. User authenticates with Google (same SSO as Grafana — `@seinetwork.io`, `@sei.io`, `@seifdn.org`) |
| 39 | +6. Google redirects to callback; OAuth2 Proxy sets cookie on `.prod.platform.sei.io`, redirects back to original URL |
| 40 | +7. Browser retries with cookie → ext_authz passes → Prometheus graph loads |
| 41 | + |
| 42 | +### SSO |
| 43 | + |
| 44 | +Cookie domain `.prod.platform.sei.io` means one login covers all protected tools. Authenticating on Prometheus automatically works on AlertManager, and any future tools added under the same domain. |
| 45 | + |
| 46 | +--- |
| 47 | + |
| 48 | +## Design Decisions |
| 49 | + |
| 50 | +### Shared Google OAuth Client |
| 51 | + |
| 52 | +OAuth2 Proxy reuses the existing Google OAuth client that Grafana already uses. Add the OAuth2 Proxy callback URI to the existing client's authorized redirect URIs in Google Cloud Console: |
| 53 | + |
| 54 | +``` |
| 55 | +https://oauth2-proxy.prod.platform.sei.io/oauth2/callback |
| 56 | +``` |
| 57 | + |
| 58 | +This avoids managing a second set of credentials. The OAuth2 Proxy deployment references the same `google-oauth` secret (client ID and client secret) plus its own `cookie-secret`. |
| 59 | + |
| 60 | +### Dedicated `auth` Namespace |
| 61 | + |
| 62 | +OAuth2 Proxy deploys in its own `auth` namespace rather than `monitoring`. This isolates the authentication infrastructure from the monitoring workloads it protects, limiting blast radius if either is compromised. |
| 63 | + |
| 64 | +### Fail-Closed |
| 65 | + |
| 66 | +`failOpen: false` in the ext_authz config. When OAuth2 Proxy is down, protected tools return 503 rather than being accessible without auth. A PodDisruptionBudget (`minAvailable: 1`) and 2 replicas mitigate downtime. |
| 67 | + |
| 68 | +### Cookie-Based Sessions (No Redis) |
| 69 | + |
| 70 | +Session state lives in encrypted cookies. No shared state, no Redis, perfect HA. Appropriate for a team of <100 engineers. Revisit only if cookie size exceeds browser limits (4KB), which would happen with large group claims — not applicable here. |
| 71 | + |
| 72 | +### Grafana Excluded |
| 73 | + |
| 74 | +Grafana keeps its own Google OAuth integration. It needs user identity for role mapping (`@seinetwork.io` → Editor, others → Viewer). The AuthorizationPolicy only targets Prometheus and AlertManager hostnames, so Grafana is never intercepted. |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +## Resources |
| 79 | + |
| 80 | +### Istio Mesh Config — Extension Provider |
| 81 | + |
| 82 | +Register OAuth2 Proxy as an ext_authz provider in the istiod mesh config: |
| 83 | + |
| 84 | +```yaml |
| 85 | +meshConfig: |
| 86 | + extensionProviders: |
| 87 | + - name: oauth2-proxy |
| 88 | + envoyExtAuthz: |
| 89 | + service: oauth2-proxy.auth.svc.cluster.local |
| 90 | + port: 4180 |
| 91 | + includeRequestHeadersInCheck: |
| 92 | + - authorization |
| 93 | + - cookie |
| 94 | + headersToUpstreamOnAllow: |
| 95 | + - authorization |
| 96 | + - x-auth-request-user |
| 97 | + - x-auth-request-email |
| 98 | + - cookie |
| 99 | + headersToDownstreamOnDeny: |
| 100 | + - set-cookie |
| 101 | + - content-type |
| 102 | + - location |
| 103 | + headersToDownstreamOnAllow: |
| 104 | + - set-cookie |
| 105 | + failOpen: false |
| 106 | + statusOnError: "503" |
| 107 | +``` |
| 108 | +
|
| 109 | +### OAuth2 Proxy (auth namespace) |
| 110 | +
|
| 111 | +Deployed via Helm chart `oauth2-proxy/oauth2-proxy` from `https://oauth2-proxy.github.io/manifests`. |
| 112 | + |
| 113 | +Key configuration: |
| 114 | +- `provider = "google"` |
| 115 | +- `upstreams = ["static://202"]` — ext_authz mode, not reverse-proxy mode |
| 116 | +- `email_domains = ["seinetwork.io", "sei.io", "seifdn.org"]` |
| 117 | +- `cookie_name = "_sei_platform_auth"` |
| 118 | +- `cookie_domains = [".prod.platform.sei.io"]` |
| 119 | +- `cookie_expire = "12h"`, `cookie_refresh = "1h"` |
| 120 | +- `session_store_type = "cookie"` |
| 121 | +- `reverse_proxy = true` — trusts X-Forwarded-* from Gateway |
| 122 | +- `set_xauthrequest = true` — forwards user identity headers to backends |
| 123 | + |
| 124 | +Secrets (SOPS-encrypted): Google client ID, client secret, cookie secret (32-byte random). |
| 125 | + |
| 126 | +### HTTPRoutes |
| 127 | + |
| 128 | +| Hostname | Backend | Namespace | |
| 129 | +|---|---|---| |
| 130 | +| `prometheus.prod.platform.sei.io` | `sei-prod-prometheus:9090` | monitoring | |
| 131 | +| `alertmanager.prod.platform.sei.io` | `sei-prod-alertmanager:9093` | monitoring | |
| 132 | +| `oauth2-proxy.prod.platform.sei.io` | `oauth2-proxy:4180` | auth | |
| 133 | + |
| 134 | +All covered by the existing wildcard cert `*.prod.platform.sei.io`. External-DNS auto-creates DNS records from HTTPRoute hostnames. |
| 135 | + |
| 136 | +### AuthorizationPolicy |
| 137 | + |
| 138 | +```yaml |
| 139 | +apiVersion: security.istio.io/v1 |
| 140 | +kind: AuthorizationPolicy |
| 141 | +metadata: |
| 142 | + name: platform-tools-ext-authz |
| 143 | + namespace: gateway |
| 144 | +spec: |
| 145 | + targetRefs: |
| 146 | + - group: gateway.networking.k8s.io |
| 147 | + kind: Gateway |
| 148 | + name: sei-gateway |
| 149 | + action: CUSTOM |
| 150 | + provider: |
| 151 | + name: oauth2-proxy |
| 152 | + rules: |
| 153 | + - to: |
| 154 | + - operation: |
| 155 | + hosts: |
| 156 | + - prometheus.prod.platform.sei.io |
| 157 | + - alertmanager.prod.platform.sei.io |
| 158 | +``` |
| 159 | + |
| 160 | +Scoped to specific hostnames. All other gateway traffic (RPC endpoints, Grafana, etc.) is unaffected. |
| 161 | + |
| 162 | +### Prometheus Helm Changes |
| 163 | + |
| 164 | +```yaml |
| 165 | +prometheus: |
| 166 | + prometheusSpec: |
| 167 | + externalUrl: https://prometheus.prod.platform.sei.io |
| 168 | +
|
| 169 | +alertmanager: |
| 170 | + alertmanagerSpec: |
| 171 | + externalUrl: https://alertmanager.prod.platform.sei.io |
| 172 | +``` |
| 173 | + |
| 174 | +This fixes the broken PagerDuty `generatorURL` links. |
| 175 | + |
| 176 | +--- |
| 177 | + |
| 178 | +## Interaction with Existing AuthorizationPolicies |
| 179 | + |
| 180 | +Istio evaluates authorization in order: **CUSTOM → DENY → ALLOW**. |
| 181 | + |
| 182 | +- SeiNode AuthorizationPolicies use `action: ALLOW` with pod selectors — they operate at the sidecar level on SeiNode pods, not at the gateway. |
| 183 | +- The new CUSTOM policy targets the Gateway workload and only fires for matching hostnames. |
| 184 | +- No conflict, no interference. |
| 185 | + |
| 186 | +--- |
| 187 | + |
| 188 | +## Adding Future Services |
| 189 | + |
| 190 | +To protect a new tool (e.g., Jaeger): |
| 191 | + |
| 192 | +1. Add an HTTPRoute for `jaeger.prod.platform.sei.io` |
| 193 | +2. Add the hostname to the AuthorizationPolicy `hosts` list |
| 194 | + |
| 195 | +No OAuth2 Proxy changes needed — the cookie domain covers all `*.prod.platform.sei.io` subdomains. |
| 196 | + |
| 197 | +--- |
| 198 | + |
| 199 | +## Rollout Order |
| 200 | + |
| 201 | +1. Register ext_authz provider in Istio mesh config (safe — nothing references it yet) |
| 202 | +2. Deploy OAuth2 Proxy (Helm, Secret, Service, PDB) in `auth` namespace |
| 203 | +3. Create OAuth2 callback HTTPRoute, verify callback endpoint responds |
| 204 | +4. Create Prometheus + AlertManager HTTPRoutes |
| 205 | +5. Apply AuthorizationPolicy (auth now enforced) |
| 206 | +6. Update Prometheus/AlertManager `externalUrl` in Helm values |
| 207 | +7. Verify: click PagerDuty alert link → Google OAuth → Prometheus graph |
| 208 | + |
| 209 | +--- |
| 210 | + |
| 211 | +## File Layout |
| 212 | + |
| 213 | +``` |
| 214 | +clusters/prod/auth/ |
| 215 | + kustomization.yaml |
| 216 | + namespace.yaml |
| 217 | + oauth2-proxy.yaml # HelmRepository + HelmRelease |
| 218 | + secret.enc.yaml # SOPS: client-id, client-secret, cookie-secret |
| 219 | + httproute.yaml # oauth2-proxy.prod.platform.sei.io |
| 220 | + authz-policy.yaml # CUSTOM AuthorizationPolicy (in gateway namespace) |
| 221 | + |
| 222 | +clusters/prod/monitoring/ |
| 223 | + httproute-prometheus.yaml # prometheus.prod.platform.sei.io |
| 224 | + httproute-alertmanager.yaml # alertmanager.prod.platform.sei.io |
| 225 | + prometheus-operator.yaml # externalUrl changes |
| 226 | + |
| 227 | +clusters/prod/istio-system/ |
| 228 | + mesh-config patch # ext_authz extension provider |
| 229 | +``` |
| 230 | +
|
| 231 | +--- |
| 232 | +
|
| 233 | +## Observability |
| 234 | +
|
| 235 | +OAuth2 Proxy exposes Prometheus metrics on port 44180. A ServiceMonitor scrapes these, and a PrometheusRule alerts on: |
| 236 | +- `OAuth2ProxyHighErrorRate` (>5% 5xx for 5m) — warning |
| 237 | +- `OAuth2ProxyDown` (metrics unreachable for 3m) — critical |
0 commit comments