|
2 | 2 |
|
3 | 3 | ## Scope |
4 | 4 |
|
5 | | -Operational procedures for production incidents, recovery, and maintenance. |
| 5 | +Operational procedures for production incidents, recovery, and maintenance across Netlify + Supabase. |
6 | 6 |
|
7 | | -## Runbook Areas |
| 7 | +## Ownership |
8 | 8 |
|
9 | | -- Incident triage |
10 | | -- Service degradation response |
11 | | -- Data export job failures |
12 | | -- Migration rollback steps |
| 9 | +- Incident Commander: on-call platform engineer. |
| 10 | +- Communications Lead: project owner or delegate. |
| 11 | +- Database Lead: engineer with Supabase migration permissions. |
13 | 12 |
|
14 | | -Detailed runbook tasks are tracked in `/docs/plan/M08-quality-ci-cd-and-observability.md` and `/docs/plan/M09-release-readiness-and-pilot.md`. |
| 13 | +## Severity levels |
| 14 | + |
| 15 | +- `SEV-1`: Platform unavailable, cross-tenant risk, or data integrity risk. |
| 16 | +- `SEV-2`: Major feature degradation (submissions, moderation, exports, API unavailable). |
| 17 | +- `SEV-3`: Partial degradation with viable workaround. |
| 18 | + |
| 19 | +## First 15 minutes |
| 20 | + |
| 21 | +1. Declare incident severity and open incident channel. |
| 22 | +2. Confirm blast radius: |
| 23 | + - Public browse/commenting |
| 24 | + - Agency operations/moderation |
| 25 | + - Public API exports |
| 26 | +3. Freeze deploys until incident is stabilized. |
| 27 | +4. Capture current signals: |
| 28 | + - Netlify deploy health |
| 29 | + - Supabase project status |
| 30 | + - Recent function errors (`submit-comment`, `public-api`, `generate-export`) |
| 31 | + |
| 32 | +## Core playbooks |
| 33 | + |
| 34 | +### 1) Public submission failures |
| 35 | + |
| 36 | +- Check edge function logs for `submit-comment` errors. |
| 37 | +- Validate required environment variables: `HCAPTCHA_SECRET_KEY`, Supabase keys. |
| 38 | +- Check abuse-event volume spikes indicating CAPTCHA/rate-limit pressure. |
| 39 | +- Mitigation: |
| 40 | + - Temporarily reduce traffic pressure using stricter edge throttles. |
| 41 | + - If CAPTCHA provider outage occurs, switch to moderated maintenance banner for submissions. |
| 42 | + |
| 43 | +### 2) Public API degradation |
| 44 | + |
| 45 | +- Check `public-api` function logs and latency. |
| 46 | +- Verify `api_rate_limits` table growth and cleanup behavior. |
| 47 | +- Confirm RPC dependencies (`get_public_dockets`, `get_docket_public_detail`, `get_comment_detail`) return expected responses. |
| 48 | +- Mitigation: |
| 49 | + - Increase edge function concurrency limits (where available). |
| 50 | + - Apply temporary lower rate limits for abusive routes. |
| 51 | + |
| 52 | +### 3) Export pipeline failures |
| 53 | + |
| 54 | +- Inspect `exports` table rows in `failed`/stalled `processing`. |
| 55 | +- Check `generate-export` function errors and storage write failures. |
| 56 | +- Mitigation: |
| 57 | + - Requeue failed jobs by creating replacement job records. |
| 58 | + - Expire stale jobs and notify agency users. |
| 59 | + |
| 60 | +### 4) Data integrity or policy risk |
| 61 | + |
| 62 | +- Use `audit_events` for timeline reconstruction. |
| 63 | +- Use `abuse_events` to detect suspicious submission patterns. |
| 64 | +- If cross-tenant risk suspected: disable affected endpoints and enforce read-only mode for agency actions until triaged. |
| 65 | + |
| 66 | +## Rollback procedures |
| 67 | + |
| 68 | +### Application rollback (Netlify) |
| 69 | + |
| 70 | +1. Identify last known good deploy. |
| 71 | +2. Promote previous deploy in Netlify UI/CLI. |
| 72 | +3. Re-run smoke checks on public + agency flows. |
| 73 | + |
| 74 | +### Database rollback (Supabase) |
| 75 | + |
| 76 | +1. Stop deploy pipeline. |
| 77 | +2. Identify last migration applied before incident. |
| 78 | +3. Execute approved rollback playbook for affected migration set. |
| 79 | +4. Validate RLS policies and key RPCs before re-opening write traffic. |
| 80 | + |
| 81 | +## Post-incident actions |
| 82 | + |
| 83 | +1. Publish incident summary with root cause and customer impact. |
| 84 | +2. Create follow-up tasks for preventive fixes. |
| 85 | +3. Update this runbook if a playbook gap was discovered. |
0 commit comments