ZipCase includes a centralized alerting system that monitors errors, provides observability, and sends notifications about critical issues. This document explains how the alerting system works and how to use it in your code.
The alerting system consists of:
- AlertService: A TypeScript module for standardized error logging
- CloudWatch Metrics: For tracking error rates and triggering alarms
- CloudWatch Alarms: For automated monitoring of error thresholds
- SNS Notifications: For delivering alerts via email
- Severity Levels: INFO, WARNING, ERROR, and CRITICAL
- Error Categories: Authentication, Database, Network, Portal, Queue, System
- Automatic Deduplication: Similar errors are grouped to prevent alert fatigue
- Configurable Thresholds: Different thresholds for different severity levels
- Contextual Metadata: Errors include relevant context like userId, caseNumber, etc.
- Email Notifications: Critical errors trigger immediate email alerts
The AlertService is designed to be easy to use while providing robust error monitoring. Here's how to integrate it into your code:
import AlertService, { Severity, AlertCategory } from './AlertService';
// Log a simple error
await AlertService.logError(
Severity.ERROR,
AlertCategory.DATABASE,
'Failed to save case data',
error, // Optional Error object
{ caseNumber: '22CR123456' } // Optional context
);For cleaner code, you can create a logger for a specific category:
// Create a logger for portal-related issues
const portalLogger = AlertService.forCategory(AlertCategory.PORTAL);
// Use the scoped logger
await portalLogger.error('Failed to connect to portal', error);
await portalLogger.info('Successfully retrieved case data');- INFO: Informational messages that don't indicate problems
- WARNING: Non-critical issues that might need attention
- ERROR: Problems that affect functionality but aren't system failures
- CRITICAL: Severe problems that require immediate attention
The context object helps provide relevant information about the error:
{
userId?: string; // The affected user
caseNumber?: string; // The case number involved
searchId?: string; // ID of the search operation
resource?: string; // The resource/component having issues
operationId?: string; // A unique ID for the operation
metadata?: object; // Any additional data
}The system includes two types of alarms:
- AuthenticationErrorAlarm: Triggers when authentication errors exceed normal rates
- PortalCriticalErrorAlarm: Monitors for critical portal connectivity issues
- SystemErrorAlarm: Alerts on high rates of system errors
- DatabaseErrorAlarm: Tracks database connectivity issues
- LambdaErrorsAlarm: Monitors all Lambda function errors
- LambdaThrottlesAlarm: Alerts when Lambda functions are being throttled
- ApiGateway5xxErrorsAlarm: Detects server errors in API Gateway
- CaseProcessingDLQAlarm: Monitors for messages in the Dead Letter Queue
- Per-Function Alarms:
- processCaseSearch-Errors: Issues with search queue processing
- processCaseData-Errors: Issues with case data retrieval
- postSearch-Errors: Problems with the search API endpoint
These infrastructure alarms will catch issues outside the application code, such as:
- Unhandled exceptions that crash Lambda functions
- Memory/timeout issues
- API Gateway configuration problems
- Messages that repeatedly fail processing
Email alerts are sent via Amazon SNS and include:
- Severity level and category
- Error message
- Timestamp
- Context information
- Environment information (stage, region, service)
The alerting system requires these SSM parameters:
/zipcase/alert-email: Email address for notifications/zipcase/alert-topic-arn: SNS topic ARN (created automatically)
To prevent alert fatigue, the system deduplicates similar errors:
- Errors are grouped by message pattern and category
- Dynamic values like UUIDs and timestamps are normalized
- A cached count of similar errors is maintained
- Alerts are sent only after thresholds are exceeded or time intervals pass
- Use Appropriate Severity Levels: Don't mark everything as CRITICAL
- Include Relevant Context: Add userId, caseNumber, etc. when available
- Be Specific in Messages: Error messages should be descriptive
- Log Early, Log Often: Instrument critical code paths
- Group Related Errors: Use consistent categories and message patterns
CloudWatch automatically creates dashboards for the metrics. You can view:
- Error rates by severity and category
- Alarm history and current state
- Error trends over time