Skip to content

feat(backend): scheduled reconciliation, manifest persistence, resuma…#867

Merged
Junirezz merged 1 commit into
Junirezz:mainfrom
king-aj-the-first:feat/breach
Jun 27, 2026
Merged

feat(backend): scheduled reconciliation, manifest persistence, resuma…#867
Junirezz merged 1 commit into
Junirezz:mainfrom
king-aj-the-first:feat/breach

Conversation

@king-aj-the-first

Copy link
Copy Markdown
Contributor

…ble backfill, and SLO metrics

Summary

This PR introduces four backend observability and durability improvements: proactive ledger drift detection, durable export manifests, resumable transaction backfill jobs, and Prometheus-based SLO breach metrics.


Task 1 — Scheduled Reconciliation Drift Detection

Refactored reconciliation logic for reuse and integrated automated drift detection via scheduled jobs.

Changes

  • Refactor

    • reconciliationReport.ts now exports:

      • runReconciliationReport()

      • reconcile()

      • fetchers and types

  • Scheduler Integration

    • Added:

      • runLedgerReconciliationJob()

      • startLedgerReconciliationScheduler()

    • Registered under existing report generation policy with retry/backoff via jobGovernance.ts

  • Drift Handling (status === 'DRIFT_DETECTED')

    • Emits structured logs

    • Increments:

      • reconciliation_drift_total{issue}

    • Updates gauges:

      • reconciliation_status

      • reconciliation_last_run_timestamp

    • Sends webhook alerts when drift exceeds threshold:

      • Controlled by:

        • RECONCILIATION_DRIFT_ALERT_THRESHOLD

        • RECONCILIATION_ALERT_COOLDOWN_MS

  • Persistence

    • Stores snapshots in ReconciliationSnapshot (Prisma)

    • Keeps latest automated summary in memory for diagnostics

  • New Endpoint

    • GET /admin/reconciliation/latest

      • Returns last automated summary without re-querying Horizon

  • Diagnostics

    • Bundle now includes lastReconciliation

  • Environment Variables

    • LEDGER_RECONCILIATION_ENABLED

    • LEDGER_RECONCILIATION_INTERVAL_MS

    • RECONCILIATION_WINDOW_HOURS

    • RECONCILIATION_DRIFT_ALERT_THRESHOLD

    • RECONCILIATION_ALERT_COOLDOWN_MS

    • RECONCILIATION_ALERT_WEBHOOK_URL


Task 2 — Export Manifest Persistence & Verification

Introduced durable export manifests with verification and retention controls.

Changes

  • Added ExportManifest Prisma model + migration

  • exportManifest.ts:

    • Persists manifests to Prisma

    • Supports memory fallback:

      • EXPORT_MANIFEST_STORAGE=memory

    • Adds:

      • Paginated listing

      • Checksum verification

      • Retention pruning (EXPORT_MANIFEST_RETENTION, default: 500)

  • Integrated manifest creation into:

    • bulkExportJobs.ts (on job completion)

  • New Endpoint

    • POST /admin/reports/exports/manifests/:id/verify

      • Returns match/mismatch without exposing raw data

  • Updated list endpoint:

    • Offset pagination

    • Total count support


Task 3 — Resumable Transaction Backfill Jobs

Enabled durable, restart-safe backfill processing.

Changes

  • Added TransactionBackfillJob Prisma model + migration

  • transactionBackfill.ts:

    • Persists:

      • Job metadata

      • Checkpoints (lastProcessedLedger)

    • Jobs hydrate from DB on startup

    • Running jobs resume after restart

  • Behavior

    • Dry-run jobs:

      • Persisted

      • Do NOT mutate ProcessedEvent rows

    • Completed/failed jobs:

      • Prunable via:

        • pruneOldBackfillJobs()

        • BACKFILL_JOB_RETENTION_DAYS (default: 30)

  • Existing endpoints:

    • POST /admin/transactions/backfill

    • GET /admin/transactions/backfill

    • Now return durable job state


Task 4 — Endpoint SLA Breach Prometheus Metrics

Added SLO monitoring metrics aligned with ENDPOINT_SLA_REGISTRY.

Metrics

  • backend_slo_breach_total{path,tier,type}

    • Counter incremented on alert dispatch (cooldown-aware)

  • backend_slo_p95_latency_ms{path,tier,type}

    • Rolling P95 latency

  • backend_slo_budget_ms{path,tier,type}

    • Configured latency budget

  • backend_slo_breach{path,tier,type}

    • Breach status gauge (0/1)

Integration

  • latencyMonitoringService.syncSloMetrics() runs on each /metrics scrape

  • Critical endpoints:

    • /health

    • /ready

    • Tagged with tier="critical"

  • Documentation updated:

    • docs/MONITORING_OBSERVABILITY.md (§1.6, §1.7)


Schema Migration

  • 20260627120000_add_manifest_reconciliation_backfill

    • Adds:

      • ExportManifest

      • ReconciliationSnapshot

      • TransactionBackfillJob


Tests Added

File | Coverage -- | -- ledgerReconciliationJob.test.ts | Clean vs drift scenarios, metrics validation exportManifest.test.ts | Creation, verification, pagination, retention transactionBackfill.persistence.test.ts | Start, dry-run, restart, failure handling sloMetrics.test.ts | Breach gauge + cooldown-aware counter

Test Plan

  • Run migrations:

    npx prisma migrate deploy
    
  • Verify metrics:

    • GET /metrics exposes:

      • reconciliation_*

      • backend_slo_*

  • Reconciliation:

    • Set LEDGER_RECONCILIATION_ENABLED=true

    • Confirm:

      • GET /admin/reconciliation/latest returns summary after scheduler run

  • Export manifests:

    • Create export:

      • POST /admin/reports/exports

    • Restart service

    • Verify:

      • Manifest persists

      • Checksum via:

        • POST /admin/reports/exports/manifests/:id/verify

  • Backfill jobs:

    • Start backfill

    • Restart service

    • Confirm resume from lastProcessedLedger

  • SLO metrics:

    • Trigger latency on /health

    • Confirm:

      • backend_slo_breach == 1

      • Counter increments once per cooldown window


Overall Impact

  • Adds proactive ledger drift detection and alerting

  • Introduces durable export verification layer

  • Enables fault-tolerant, resumable backfill processing

  • Provides production-grade SLO observability via Prometheus

  • Strengthens reliability, monitoring, and operational visibility


closes #861
closes #863
closes #864
closes #865

@drips-wave

drips-wave Bot commented Jun 27, 2026

Copy link
Copy Markdown

@king-aj-the-first Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits.

You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀

Learn more about application limits

@Junirezz Junirezz merged commit 54050f4 into Junirezz:main Jun 27, 2026
10 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants