Last updated: 2026-04-03
This is the current production incident runbook for Agentbot. Use this file before older mixed-era rollback docs.
- Canonical web domain:
https://agentbot.sh - Vercel project:
raveculture-projects/agentbot - Vercel root directory:
web - Railway backend health:
https://YOUR_SERVICE_URL/health - Borg dashboard:
https://YOUR_SERVICE_URL/dashboard
SEV-1: Full production outage, login broken, payments broken, dashboard unusableSEV-2: Major feature degraded, partial dashboard failure, broken onboarding or chatSEV-3: Non-critical issue with workaround
- Confirm scope.
- Check
https://agentbot.sh/api/health - Check Railway backend health
- Check current Vercel production deployment id
- Check
- Freeze risky changes.
- Stop merging unrelated PRs
- Do not rotate secrets unless the incident requires it
- Decide whether to rollback or hotfix.
- Roll back for
SEV-1and clear regressions introduced by the latest deployment - Hotfix only if the blast radius is narrow and the repair is low risk
- Roll back for
- Primary ops channel: maintain one shared incident thread in the team channel
- Status updates cadence:
SEV-1: every 15 minutesSEV-2: every 30 minutes
- Always record:
- first detected time
- affected surface
- current mitigation
- next checkpoint
- Open Vercel dashboard for
agentbot - Go to
Deployments - Find the last known good deployment before the regression
- Click
Promote to Production - Re-check:
https://agentbot.shhttps://agentbot.sh/api/health- affected dashboard route
git -C /Users/raveculture/agentbot log --oneline -10
git -C /Users/raveculture/agentbot revert <bad_commit_sha>
git -C /Users/raveculture/agentbot push origin mainUse git revert, not history rewrite, for production rollback.
- Open the affected Railway service deployment history
- Redeploy the previous healthy release
- Verify the health endpoint before closing the incident
- Homepage loads on
https://agentbot.sh GET /api/healthreturns200- Login page loads
/dashboard/fleetredirects correctly or loads for authenticated users/dashboard/colonyno longer returns503/api/chatand/api/provisionreturn expected auth or success responses
- Record the root cause
- Link the bad deployment id and the rollback deployment id
- List customer-visible impact
- Create a follow-up issue for any temporary mitigation