For environment setup and service startup commands, use RUNBOOK.md.
For a full operator pass that includes outage/recovery, use docs/DEMO_CHECKLIST.md.
In EDGE_ENV=prod, keychain-backed device/master-key storage is required unless ALLOW_KEYCHAIN_FALLBACK=1 is set for controlled debugging.
This guide covers recovery of telemetry delivery using the edge outbox and DLQ model.
PENDING: queued for delivery.SENT: delivered successfully.DLQ: retries exhausted, requires operator action.
curl -H "Authorization: Bearer <edge-token>" http://127.0.0.1:8787/api/v1/diagnosticsInspect pending_count, dlq_count, sent_count, and recent error metadata.
When make release-check fails, review output/ci/invariant_report.json first. It highlights auth and support-bundle regressions separately from transport/outbox failures, which shortens triage before diving into compose or edge logs.
Primary reliability fields in diagnostics:
outbox_pending_countdlq_countlast_attemptlast_successlast_error_summarytelemetry_flags
sqlite3 apps/edge/.sentinelid/audit.dbUseful queries:
SELECT id, status, attempts, created_at, last_error
FROM outbox_events
ORDER BY id DESC
LIMIT 25;SELECT status, COUNT(*) FROM outbox_events GROUP BY status;- Keep edge running.
- Restore cloud service.
- Pending events retry automatically.
- In demo mode, validate this with:
make smoke-cloud-recovery- Validate cloud health endpoint.
- Confirm ingest URL configured on edge (
CLOUD_INGEST_URL). - Replay or reset failed entries after root-cause resolution.
Replay DLQ entries back to PENDING (bearer-protected, localhost-only):
curl -X POST \
-H "Authorization: Bearer <edge-token>" \
-H "Content-Type: application/json" \
-d '{"limit": 100}' \
http://127.0.0.1:8787/api/v1/admin/outbox/replay-dlqReplay a specific DLQ event:
curl -X POST \
-H "Authorization: Bearer <edge-token>" \
-H "Content-Type: application/json" \
-d '{"event_id": 42}' \
http://127.0.0.1:8787/api/v1/admin/outbox/replay-dlq- Backup first.
- Remove or repair corrupted db.
- Restart edge.
Example reset:
cp apps/edge/.sentinelid/audit.db apps/edge/.sentinelid/audit.db.backup
rm -rf apps/edge/.sentinelid- Check disk free space.
- Prune old
SENTrows if retention policy allows.
- Do not delete DLQ rows before capturing
last_errorand payload context. - Prefer replay after fixing connectivity or schema mismatch root causes.
- Keep cloud/admin token and URL configuration consistent across
.envand runtime exports. - Keep bcrypt hashes in
.envsingle-quoted or useADMIN_UI_PASSWORD_HASH_B64to avoid compose interpolation on$. - Validate outage recovery end-to-end with:
make smoke-cloud-recovery- Privacy controls:
docs/privacy.md - Threat model:
docs/threat-model.md - Key lifecycle:
docs/KEY_MANAGEMENT.md
Collect a sanitized support artifact for incident triage:
EDGE_AUTH_TOKEN="<edge-token>" ADMIN_API_TOKEN="<admin-token>" make support-bundleOutput:
scripts/support/out/support_bundle_<timestamp>.tar.gz
Bundle contents intentionally exclude raw biometric payloads, tokens, signatures, frames, and embeddings.