Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions infrastructure/docs/upgrade-procedure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Kubernetes Node Group Upgrade Procedure

This document outlines the procedure for safely upgrading Kubernetes versions on EKS node groups in the GistPin infrastructure.

## Overview

The upgrade process automates the safe rolling upgrade of Kubernetes node groups with the following features:
- Comprehensive pre-upgrade validation checks
- Cordoning and draining of nodes before termination
- Gradual rolling replacement of nodes
- Post-upgrade validation
- Automated rollback capability

## Prerequisites

Before running any upgrade, ensure you have:

1. **Required tools installed**:
- AWS CLI (configured with appropriate permissions)
- kubectl (configured to access your cluster)
- jq (for JSON processing)

2. **AWS permissions required**:
- `eks:DescribeCluster`
- `eks:DescribeNodegroup`
- `eks:UpdateNodegroupVersion`
- `ec2:DescribeSubnets`
- Permissions to update kubeconfig

3. **Cluster readiness**:
- The cluster control plane must already be upgraded to the target Kubernetes version
- You can only upgrade one minor version at a time (e.g., 1.27 → 1.28 is supported, 1.27 → 1.29 is not)
- The node group must have enough capacity to add new nodes during the rolling upgrade (max size > current desired size)

## Pre-Upgrade Checks

The `pre-upgrade-checks.sh` script automatically verifies all prerequisites before an upgrade can begin:

```bash
./infrastructure/scripts/pre-upgrade-checks.sh <cluster-name> <node-group-name> <target-version>
```

### Checks Performed

1. **Dependency validation**: Verifies AWS CLI, kubectl, and jq are installed
2. **Cluster existence**: Confirms the specified EKS cluster exists and is accessible
3. **Node group eligibility**:
- Verifies the node group exists
- Checks that the target version is different from current version
- Validates only one minor version upgrade is attempted
- Confirms target version matches cluster control plane version
4. **Resource capacity**: Ensures node group has enough max capacity to add new nodes
5. **Cluster health**:
- All nodes are in Ready state
- No problematic PodDisruptionBudgets that would block draining
- All kube-system pods are running
6. **Subnet capacity**: Verifies sufficient IP addresses are available in subnets
7. **Node group status**: Confirms the node group is in ACTIVE state with no ongoing operations

## Running the Upgrade

### Standard Upgrade Process

To perform a rolling upgrade of a node group:

```bash
./infrastructure/scripts/upgrade-node-group.sh <cluster-name> <node-group-name> <target-version>
```

### What Happens During Upgrade

1. **Pre-flight**: Runs all pre-upgrade checks to ensure eligibility
2. **Initiate EKS upgrade**: Triggers the AWS EKS node group version update
3. **Wait for new nodes**: Waits for EKS to provision new nodes with the target version
4. **Process old nodes sequentially**:
- **Cordon**: Marks the node as unschedulable to prevent new pods from being assigned
- **Drain**: Evicts all existing pods from the node (respects PodDisruptionBudgets)
- Waits for workloads to reschedule onto new nodes
5. **Post-upgrade validation**: After all nodes are replaced, verifies:
- All nodes are running the target Kubernetes version
- All nodes are in Ready state
- All pods are running correctly
- The node group is back to ACTIVE state

## Rollback Procedure

If an upgrade fails or issues are discovered after upgrade, you can rollback to the previous version:

```bash
./infrastructure/scripts/upgrade-node-group.sh <cluster-name> <node-group-name> <original-version> --rollback
```

### Rollback Process

1. The rollback script retrieves the node group details
2. Initiates an EKS node group version update back to the original version
3. Waits for the rollback operation to complete
4. Verifies all nodes are running the original version
5. Performs health checks to ensure cluster stability after rollback

## Best Practices

1. **Test first**: Always test the upgrade procedure in a staging environment first
2. **Schedule during low traffic**: Run upgrades during periods of lower application traffic
3. **Monitor closely**: Keep an eye on cluster metrics, application logs, and node health during the upgrade
4. **Backup critical data**: Ensure all persistent volumes have recent backups before performing infrastructure changes
5. **Upgrade sequentially**: If upgrading multiple node groups, upgrade them one at a time to maintain capacity
6. **Verify workload resiliency**: Ensure your applications are designed to tolerate node failures and rescheduling

## Troubleshooting

### Common Issues

1. **Node draining fails**: This is often due to PodDisruptionBudgets that block evictions. Check the pre-upgrade warnings for PDBs with 0 disruptions allowed.
2. **New nodes fail to join**: Check subnet IP capacity, security group configurations, and AWS service limits.
3. **Upgrade times out**: The script has built-in timeouts. If this happens frequently, you may need to adjust timings for larger clusters.
4. **Workload issues after upgrade**: Check application logs for compatibility issues with the new Kubernetes version. Roll back immediately if critical issues are found.

## Emergency Contact

In case of issues during upgrade that require immediate assistance, follow the emergency procedures outlined in [emergency-procedures.md](runbooks/emergency-procedures.md).
82 changes: 82 additions & 0 deletions infrastructure/monitoring/alertmanager-routes.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Alertmanager routes configuration for log-based alerts
# Extends the main alertmanager.yml configuration

route:
# Log-specific routing rules that extend the base configuration
routes:
# Critical log errors - route to critical alerts channel with high priority
- match_re:
type: "^log-error|database-error|security-alert$"
severity: critical
receiver: slack-critical
group_by: ['alertname', 'instance', 'type']
group_wait: 10s # Short wait for critical issues
group_interval: 5m
repeat_interval: 1h # More frequent repeats for active log issues
continue: false

# Warning-level log alerts
- match_re:
type: "^log-error|log-volume|log-anomaly$"
severity: warning
receiver: slack-warnings
group_by: ['alertname', 'type']
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
continue: false

# Security-specific routing - send all security alerts to dedicated security channel
- match:
type: "security-alert"
receiver: slack-security
group_by: ['alertname']
group_wait: 5s # Immediately send security alerts
group_interval: 3m
repeat_interval: 30m # Frequent repeats for ongoing security issues
continue: true

# Inhibit rules for log alerts - prevent alert spam
inhibit_rules:
# If a critical log error is firing, suppress lower-severity warnings from the same instance
- source_match:
severity: critical
type: log-error
target_match:
severity: warning
type: log-error
equal: ['instance']

# If there's a log volume anomaly, suppress high log volume alerts for the same instance
- source_match:
alertname: "LogVolumeAnomaly"
target_match:
alertname: "HighLogVolume"
equal: ['instance']

# Deduplication settings to prevent alert flooding
inhibit_rules:
- source_match_re:
alertname: ".*"
target_match_re:
alertname: ".*"
equal: ['instance', 'message_pattern']
# Prevent duplicate alerts from same instance with same error pattern
# This works with the dedupe window in log-patterns.json to enforce deduplication

receivers:
# Dedicated security receiver for all security-related log alerts
- name: slack-security
slack_configs:
- channel: '#alerts-security'
title: '[{{ .Status | toUpper }}][SECURITY] {{ .CommonLabels.alertname }}'
text: |
:shield: *SECURITY ALERT* :shield:
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Instance:* {{ .Labels.instance }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
send_resolved: true
mention_channel: true # Always @channel for security alerts
19 changes: 18 additions & 1 deletion infrastructure/monitoring/alertmanager.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ route:
severity: warning
receiver: slack-warnings

# Include log alert routes
{{ include "alertmanager-routes.yml" . }}

receivers:
- name: slack-critical
slack_configs:
Expand All @@ -27,10 +30,24 @@ receivers:
- channel: '#alerts-warnings'
title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: slack-security
slack_configs:
- channel: '#alerts-security'
title: '[{{ .Status | toUpper }}][SECURITY] {{ .CommonLabels.alertname }}'
text: |
:shield: *SECURITY ALERT* :shield:
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Instance:* {{ .Labels.instance }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
send_resolved: true
mention_channel: true

inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname']
equal: ['alertname']
89 changes: 89 additions & 0 deletions infrastructure/monitoring/log-alert-rules.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
groups:
- name: log-error-alerts
rules:
# Error pattern detection rules
- alert: CriticalLogError
expr: |
count_over_time(logs_total{level="critical"}[5m]) > 0
for: 0m
labels:
severity: critical
type: log-error
annotations:
summary: "Critical error logged in application"
description: "Critical error detected in logs: {{ $labels.message }} (instance: {{ $labels.instance }})"
runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/critical-error.md"

- alert: ErrorLogSpike
expr: |
rate(logs_total{level="error"}[5m]) > 10
for: 2m
labels:
severity: warning
type: log-error
annotations:
summary: "High error rate detected in logs"
description: "Error rate exceeded 10 errors per minute for 2 minutes. Current rate: {{ $value | printf "%.2f" }} errors/sec"
runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/error-spike.md"

- alert: DatabaseConnectionErrors
expr: |
count_over_time(logs_total{message=~".*connection refused.*",component="postgres"}[5m]) > 5
for: 1m
labels:
severity: critical
type: database-error
annotations:
summary: "Multiple database connection failures"
description: "{{ $value }} database connection failures detected in 5 minutes"
runbook_url: "https://github.com/ykargee/GistPin/runbooks/database/connection-failed.md"

- alert: AuthenticationFailures
expr: |
count_over_time(logs_total{message=~".*invalid credentials.*|.*authentication failed.*"}[10m]) > 20
for: 0m
labels:
severity: critical
type: security-alert
annotations:
summary: "Multiple authentication failures detected"
description: "{{ $value }} authentication failures detected in 10 minutes - possible brute force attempt"
runbook_url: "https://github.com/ykargee/GistPin/runbooks/security/auth-failures.md"

# Rate-based alerting rules
- alert: HighLogVolume
expr: |
rate(logs_total[5m]) > 1000
for: 5m
labels:
severity: warning
type: log-volume
annotations:
summary: "Abnormally high log volume detected"
description: "Log ingestion rate exceeded 1000 logs/sec for 5 minutes. Current rate: {{ $value | printf "%.2f" }} logs/sec"
runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/high-volume.md"

# Anomaly detection using historical baseline comparison
- alert: LogVolumeAnomaly
expr: |
rate(logs_total[5m]) > 3 * rate(logs_total[5m] offset 24h)
for: 5m
labels:
severity: warning
type: log-anomaly
annotations:
summary: "Log volume anomaly detected"
description: "Current log rate ({{ $value | printf "%.2f" }} logs/sec) is 3x higher than the same time yesterday"
runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/anomaly.md"

- alert: ErrorRateAnomaly
expr: |
rate(logs_total{level="error"}[5m]) > 5 * rate(logs_total{level="error"}[5m] offset 24h)
for: 3m
labels:
severity: warning
type: log-anomaly
annotations:
summary: "Error rate anomaly detected"
description: "Current error rate is 5x higher than the same time yesterday"
runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/error-anomaly.md"
Loading