Skip to content

Add retry with exponential backoff for transient failures in external service calls #886

Description

@RUKAYAT-CODER

Overview

External service calls (email provider, payment gateway, CDN invalidation) fail immediately on transient errors (network blip, 503). There is no retry logic, so a 1-second network hiccup causes a user-visible payment or email failure that requires manual intervention.

Specifications

Features:

  • Retry transient failures (5xx, network errors) with exponential backoff and jitter.
  • Stop retrying for client errors (4xx) as they are not transient.

Tasks:

  • Create a RetryPolicy utility using cockatiel or a custom implementation with max 3 retries, 1s base delay, 2x multiplier, 30s max delay, and full jitter.
  • Apply RetryPolicy to EmailService, PaymentProviderService, and CdnService.
  • Add a Prometheus counter external_call_retry_total{service, attempt}.
  • Add unit tests for retry behavior with mocked transient failures.

Impacted Files:

  • New src/common/utils/retry-policy.ts
  • src/notifications/email/, src/payments/providers/, src/cdn/

Acceptance Criteria

  • A single transient 503 from the email provider is retried transparently.
  • After 3 consecutive failures, the error is propagated to the caller.
  • Prometheus counter shows retry counts per service.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions