145 lines (93 loc) · 3.91 KB

SRE And SOC Runbooks

Last Updated: 2026-03-29 Scope: Sentinel security-path incident response and recovery

1. Monitoring Priorities

Track and alert on:

replay detections (token/proof)
DPoP validation failures
nonce challenge rate (use_dpop_nonce)
SSF rejection and processing failure rates
finance authorization bounds exceeded events
cache dependency latency/error budget breach

2. Incident Severity Model

Severity	Criteria
Sev-1	security bypass suspected, widespread auth outage, or confirmed replay abuse campaign
Sev-2	localized security-path degradation with user impact
Sev-3	elevated warnings without confirmed user-impacting failures

3. Runbook: Replay Detection Spike

Trigger

sudden increase in replay-related failures/alerts

Triage

determine scope by route, client_id, subject, source network
verify whether failures map to a single integration/client rollout
check cache state health before concluding active abuse

Response

keep fail-closed behavior enabled
notify security engineering if replay pattern is distributed/coordinated
collect trace IDs and correlated logs for forensic timeline

Recovery Verification

replay rate returns to baseline
no permissive bypass behavior observed

4. Runbook: DPoP Nonce Challenge Storm

Trigger

increase in 401 responses with use_dpop_nonce

Common Causes

client not persisting latest nonce
intermediary stripping DPoP-Nonce or WWW-Authenticate headers
nonce store/cache degradation

Response

verify challenge headers are emitted by API
inspect edge/proxy header behavior
validate cache health and latency
work with affected clients on retry logic correctness (single retry with fresh nonce)

5. Runbook: SSF Validation Failures

Trigger

repeated 401/400 outcomes on SSF event endpoint

Triage

identify failure class: auth token mismatch, signature/issuer, payload timing/shape
verify IdP discovery/JWKS reachability
check auth token config parity between sender and receiver

Response

rotate/update shared auth token if compromised or mismatched
coordinate issuer key-rotation validation path
isolate malformed sender batches to avoid event flood masking

6. Runbook: Cache Dependency Degradation

Trigger

replay/nonce/session checks showing backend unavailability or timeouts

Expected Behavior

requests on protected paths may fail closed (503/denials depending path)

Response

restore cache availability first; do not disable replay or blacklist checks
verify state writes and reads recover
run synthetic auth checks (nonce challenge + valid retry + replay rejection)

7. Runbook: Finance Bounds Rejection Spike

Trigger

increased authorization-bounds-exceeded warnings and 403 responses on transfer route

Triage

compare expected signed bounds with submitted payload shapes
detect client-side currency/amount normalization regressions
assess for potential tampering/abuse signals

Response

preserve opaque external 403 response semantics
use internal structured logs for detailed delta analysis
coordinate fix rollout with affected client teams

8. Post-Incident Validation Checklist

After mitigation:

protected endpoints succeed for valid DPoP flows
replay attempts are rejected
nonce challenge volume normalizes
SSF valid events are accepted and applied
finance transfer policy denials match expected baseline

9. Escalation Artifacts

Always collect:

incident time window
impacted endpoints and prefixes
trace IDs and correlation IDs
dependency health snapshots
mitigation actions and rollback conditions

10. Change Control Requirement

Any incident-driven config or policy change in auth/security paths must trigger updates to:

LIVING_THREAT_MODEL.md
COMPLIANCE_AUDIT_MATRIX.md
OPENAPI_3_1.yaml (if contract behavior changed)