Skip to content

Latest commit

 

History

History
146 lines (100 loc) · 3.57 KB

File metadata and controls

146 lines (100 loc) · 3.57 KB

Telemetry Recovery (v2.6.0)

For environment setup and service startup commands, use RUNBOOK.md.

For a full operator pass that includes outage/recovery, use docs/DEMO_CHECKLIST.md.

In EDGE_ENV=prod, keychain-backed device/master-key storage is required unless ALLOW_KEYCHAIN_FALLBACK=1 is set for controlled debugging.

Scope

This guide covers recovery of telemetry delivery using the edge outbox and DLQ model.

Outbox States

  • PENDING: queued for delivery.
  • SENT: delivered successfully.
  • DLQ: retries exhausted, requires operator action.

Diagnostics

API-level diagnostics

curl -H "Authorization: Bearer <edge-token>" http://127.0.0.1:8787/api/v1/diagnostics

Inspect pending_count, dlq_count, sent_count, and recent error metadata.

When make release-check fails, review output/ci/invariant_report.json first. It highlights auth and support-bundle regressions separately from transport/outbox failures, which shortens triage before diving into compose or edge logs.

Primary reliability fields in diagnostics:

  • outbox_pending_count
  • dlq_count
  • last_attempt
  • last_success
  • last_error_summary
  • telemetry_flags

Database-level diagnostics

sqlite3 apps/edge/.sentinelid/audit.db

Useful queries:

SELECT id, status, attempts, created_at, last_error
FROM outbox_events
ORDER BY id DESC
LIMIT 25;
SELECT status, COUNT(*) FROM outbox_events GROUP BY status;

Recovery Procedures

Cloud unavailable

  • Keep edge running.
  • Restore cloud service.
  • Pending events retry automatically.
  • In demo mode, validate this with:
make smoke-cloud-recovery

DLQ growth after transient outage

  • Validate cloud health endpoint.
  • Confirm ingest URL configured on edge (CLOUD_INGEST_URL).
  • Replay or reset failed entries after root-cause resolution.

Replay DLQ entries back to PENDING (bearer-protected, localhost-only):

curl -X POST \
  -H "Authorization: Bearer <edge-token>" \
  -H "Content-Type: application/json" \
  -d '{"limit": 100}' \
  http://127.0.0.1:8787/api/v1/admin/outbox/replay-dlq

Replay a specific DLQ event:

curl -X POST \
  -H "Authorization: Bearer <edge-token>" \
  -H "Content-Type: application/json" \
  -d '{"event_id": 42}' \
  http://127.0.0.1:8787/api/v1/admin/outbox/replay-dlq

Local database corruption

  1. Backup first.
  2. Remove or repair corrupted db.
  3. Restart edge.

Example reset:

cp apps/edge/.sentinelid/audit.db apps/edge/.sentinelid/audit.db.backup
rm -rf apps/edge/.sentinelid

Storage pressure

  • Check disk free space.
  • Prune old SENT rows if retention policy allows.

Operational Guardrails

  • Do not delete DLQ rows before capturing last_error and payload context.
  • Prefer replay after fixing connectivity or schema mismatch root causes.
  • Keep cloud/admin token and URL configuration consistent across .env and runtime exports.
  • Keep bcrypt hashes in .env single-quoted or use ADMIN_UI_PASSWORD_HASH_B64 to avoid compose interpolation on $.
  • Validate outage recovery end-to-end with:
make smoke-cloud-recovery

Related Docs

  • Privacy controls: docs/privacy.md
  • Threat model: docs/threat-model.md
  • Key lifecycle: docs/KEY_MANAGEMENT.md

Support Bundle

Collect a sanitized support artifact for incident triage:

EDGE_AUTH_TOKEN="<edge-token>" ADMIN_API_TOKEN="<admin-token>" make support-bundle

Output:

  • scripts/support/out/support_bundle_<timestamp>.tar.gz

Bundle contents intentionally exclude raw biometric payloads, tokens, signatures, frames, and embeddings.