ZipCase Alerting System

ZipCase includes a centralized alerting system that monitors errors, provides observability, and sends notifications about critical issues. This document explains how the alerting system works and how to use it in your code.

Overview

The alerting system consists of:

AlertService: A TypeScript module for standardized error logging
CloudWatch Metrics: For tracking error rates and triggering alarms
CloudWatch Alarms: For automated monitoring of error thresholds
SNS Notifications: For delivering alerts via email

Key Features

Severity Levels: INFO, WARNING, ERROR, and CRITICAL
Error Categories: Authentication, Database, Network, Portal, Queue, System
Automatic Deduplication: Similar errors are grouped to prevent alert fatigue
Configurable Thresholds: Different thresholds for different severity levels
Contextual Metadata: Errors include relevant context like userId, caseNumber, etc.
Email Notifications: Critical errors trigger immediate email alerts

Using the AlertService

The AlertService is designed to be easy to use while providing robust error monitoring. Here's how to integrate it into your code:

Basic Usage

import AlertService, { Severity, AlertCategory } from './AlertService';

// Log a simple error
await AlertService.logError(
  Severity.ERROR,
  AlertCategory.DATABASE,
  'Failed to save case data',
  error, // Optional Error object
  { caseNumber: '22CR123456' } // Optional context
);

Category-Specific Loggers

For cleaner code, you can create a logger for a specific category:

// Create a logger for portal-related issues
const portalLogger = AlertService.forCategory(AlertCategory.PORTAL);

// Use the scoped logger
await portalLogger.error('Failed to connect to portal', error);
await portalLogger.info('Successfully retrieved case data');

Severity Guidelines

INFO: Informational messages that don't indicate problems
WARNING: Non-critical issues that might need attention
ERROR: Problems that affect functionality but aren't system failures
CRITICAL: Severe problems that require immediate attention

Context Object

The context object helps provide relevant information about the error:

{
  userId?: string;      // The affected user
  caseNumber?: string;  // The case number involved
  searchId?: string;    // ID of the search operation
  resource?: string;    // The resource/component having issues
  operationId?: string; // A unique ID for the operation
  metadata?: object;    // Any additional data
}

CloudWatch Alarms

The system includes two types of alarms:

Application-Level Alarms (via AlertService)

AuthenticationErrorAlarm: Triggers when authentication errors exceed normal rates
PortalCriticalErrorAlarm: Monitors for critical portal connectivity issues
SystemErrorAlarm: Alerts on high rates of system errors
DatabaseErrorAlarm: Tracks database connectivity issues

Infrastructure-Level Alarms (AWS Service Metrics)

LambdaErrorsAlarm: Monitors all Lambda function errors
LambdaThrottlesAlarm: Alerts when Lambda functions are being throttled
ApiGateway5xxErrorsAlarm: Detects server errors in API Gateway
CaseProcessingDLQAlarm: Monitors for messages in the Dead Letter Queue
Per-Function Alarms:
- processCaseSearch-Errors: Issues with search queue processing
- processCaseData-Errors: Issues with case data retrieval
- postSearch-Errors: Problems with the search API endpoint

These infrastructure alarms will catch issues outside the application code, such as:

Unhandled exceptions that crash Lambda functions
Memory/timeout issues
API Gateway configuration problems
Messages that repeatedly fail processing

Email Notifications

Email alerts are sent via Amazon SNS and include:

Severity level and category
Error message
Timestamp
Context information
Environment information (stage, region, service)

Configuration

The alerting system requires these SSM parameters:

/zipcase/alert-email: Email address for notifications
/zipcase/alert-topic-arn: SNS topic ARN (created automatically)

Error Deduplication

To prevent alert fatigue, the system deduplicates similar errors:

Errors are grouped by message pattern and category
Dynamic values like UUIDs and timestamps are normalized
A cached count of similar errors is maintained
Alerts are sent only after thresholds are exceeded or time intervals pass

Best Practices

Use Appropriate Severity Levels: Don't mark everything as CRITICAL
Include Relevant Context: Add userId, caseNumber, etc. when available
Be Specific in Messages: Error messages should be descriptive
Log Early, Log Often: Instrument critical code paths
Group Related Errors: Use consistent categories and message patterns

Dashboard

CloudWatch automatically creates dashboards for the metrics. You can view:

Error rates by severity and category
Alarm history and current state
Error trends over time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZipCase Alerting System

Overview

Key Features

Using the AlertService

Basic Usage

Category-Specific Loggers

Severity Guidelines

Context Object

CloudWatch Alarms

Application-Level Alarms (via AlertService)

Infrastructure-Level Alarms (AWS Service Metrics)

Email Notifications

Configuration

Error Deduplication

Best Practices

Dashboard

FilesExpand file tree

ALERTING.md

Latest commit

History

ALERTING.md

File metadata and controls

ZipCase Alerting System

Overview

Key Features

Using the AlertService

Basic Usage

Category-Specific Loggers

Severity Guidelines

Context Object

CloudWatch Alarms

Application-Level Alarms (via AlertService)

Infrastructure-Level Alarms (AWS Service Metrics)

Email Notifications

Configuration

Error Deduplication

Best Practices

Dashboard