diff --git a/infrastructure/docs/upgrade-procedure.md b/infrastructure/docs/upgrade-procedure.md new file mode 100644 index 00000000..3b564f63 --- /dev/null +++ b/infrastructure/docs/upgrade-procedure.md @@ -0,0 +1,121 @@ +# Kubernetes Node Group Upgrade Procedure + +This document outlines the procedure for safely upgrading Kubernetes versions on EKS node groups in the GistPin infrastructure. + +## Overview + +The upgrade process automates the safe rolling upgrade of Kubernetes node groups with the following features: +- Comprehensive pre-upgrade validation checks +- Cordoning and draining of nodes before termination +- Gradual rolling replacement of nodes +- Post-upgrade validation +- Automated rollback capability + +## Prerequisites + +Before running any upgrade, ensure you have: + +1. **Required tools installed**: + - AWS CLI (configured with appropriate permissions) + - kubectl (configured to access your cluster) + - jq (for JSON processing) + +2. **AWS permissions required**: + - `eks:DescribeCluster` + - `eks:DescribeNodegroup` + - `eks:UpdateNodegroupVersion` + - `ec2:DescribeSubnets` + - Permissions to update kubeconfig + +3. **Cluster readiness**: + - The cluster control plane must already be upgraded to the target Kubernetes version + - You can only upgrade one minor version at a time (e.g., 1.27 → 1.28 is supported, 1.27 → 1.29 is not) + - The node group must have enough capacity to add new nodes during the rolling upgrade (max size > current desired size) + +## Pre-Upgrade Checks + +The `pre-upgrade-checks.sh` script automatically verifies all prerequisites before an upgrade can begin: + +```bash +./infrastructure/scripts/pre-upgrade-checks.sh +``` + +### Checks Performed + +1. **Dependency validation**: Verifies AWS CLI, kubectl, and jq are installed +2. **Cluster existence**: Confirms the specified EKS cluster exists and is accessible +3. **Node group eligibility**: + - Verifies the node group exists + - Checks that the target version is different from current version + - Validates only one minor version upgrade is attempted + - Confirms target version matches cluster control plane version +4. **Resource capacity**: Ensures node group has enough max capacity to add new nodes +5. **Cluster health**: + - All nodes are in Ready state + - No problematic PodDisruptionBudgets that would block draining + - All kube-system pods are running +6. **Subnet capacity**: Verifies sufficient IP addresses are available in subnets +7. **Node group status**: Confirms the node group is in ACTIVE state with no ongoing operations + +## Running the Upgrade + +### Standard Upgrade Process + +To perform a rolling upgrade of a node group: + +```bash +./infrastructure/scripts/upgrade-node-group.sh +``` + +### What Happens During Upgrade + +1. **Pre-flight**: Runs all pre-upgrade checks to ensure eligibility +2. **Initiate EKS upgrade**: Triggers the AWS EKS node group version update +3. **Wait for new nodes**: Waits for EKS to provision new nodes with the target version +4. **Process old nodes sequentially**: + - **Cordon**: Marks the node as unschedulable to prevent new pods from being assigned + - **Drain**: Evicts all existing pods from the node (respects PodDisruptionBudgets) + - Waits for workloads to reschedule onto new nodes +5. **Post-upgrade validation**: After all nodes are replaced, verifies: + - All nodes are running the target Kubernetes version + - All nodes are in Ready state + - All pods are running correctly + - The node group is back to ACTIVE state + +## Rollback Procedure + +If an upgrade fails or issues are discovered after upgrade, you can rollback to the previous version: + +```bash +./infrastructure/scripts/upgrade-node-group.sh --rollback +``` + +### Rollback Process + +1. The rollback script retrieves the node group details +2. Initiates an EKS node group version update back to the original version +3. Waits for the rollback operation to complete +4. Verifies all nodes are running the original version +5. Performs health checks to ensure cluster stability after rollback + +## Best Practices + +1. **Test first**: Always test the upgrade procedure in a staging environment first +2. **Schedule during low traffic**: Run upgrades during periods of lower application traffic +3. **Monitor closely**: Keep an eye on cluster metrics, application logs, and node health during the upgrade +4. **Backup critical data**: Ensure all persistent volumes have recent backups before performing infrastructure changes +5. **Upgrade sequentially**: If upgrading multiple node groups, upgrade them one at a time to maintain capacity +6. **Verify workload resiliency**: Ensure your applications are designed to tolerate node failures and rescheduling + +## Troubleshooting + +### Common Issues + +1. **Node draining fails**: This is often due to PodDisruptionBudgets that block evictions. Check the pre-upgrade warnings for PDBs with 0 disruptions allowed. +2. **New nodes fail to join**: Check subnet IP capacity, security group configurations, and AWS service limits. +3. **Upgrade times out**: The script has built-in timeouts. If this happens frequently, you may need to adjust timings for larger clusters. +4. **Workload issues after upgrade**: Check application logs for compatibility issues with the new Kubernetes version. Roll back immediately if critical issues are found. + +## Emergency Contact + +In case of issues during upgrade that require immediate assistance, follow the emergency procedures outlined in [emergency-procedures.md](runbooks/emergency-procedures.md). \ No newline at end of file diff --git a/infrastructure/monitoring/alertmanager-routes.yml b/infrastructure/monitoring/alertmanager-routes.yml new file mode 100644 index 00000000..158253d3 --- /dev/null +++ b/infrastructure/monitoring/alertmanager-routes.yml @@ -0,0 +1,82 @@ +# Alertmanager routes configuration for log-based alerts +# Extends the main alertmanager.yml configuration + +route: + # Log-specific routing rules that extend the base configuration + routes: + # Critical log errors - route to critical alerts channel with high priority + - match_re: + type: "^log-error|database-error|security-alert$" + severity: critical + receiver: slack-critical + group_by: ['alertname', 'instance', 'type'] + group_wait: 10s # Short wait for critical issues + group_interval: 5m + repeat_interval: 1h # More frequent repeats for active log issues + continue: false + + # Warning-level log alerts + - match_re: + type: "^log-error|log-volume|log-anomaly$" + severity: warning + receiver: slack-warnings + group_by: ['alertname', 'type'] + group_wait: 30s + group_interval: 5m + repeat_interval: 2h + continue: false + + # Security-specific routing - send all security alerts to dedicated security channel + - match: + type: "security-alert" + receiver: slack-security + group_by: ['alertname'] + group_wait: 5s # Immediately send security alerts + group_interval: 3m + repeat_interval: 30m # Frequent repeats for ongoing security issues + continue: true + +# Inhibit rules for log alerts - prevent alert spam +inhibit_rules: + # If a critical log error is firing, suppress lower-severity warnings from the same instance + - source_match: + severity: critical + type: log-error + target_match: + severity: warning + type: log-error + equal: ['instance'] + + # If there's a log volume anomaly, suppress high log volume alerts for the same instance + - source_match: + alertname: "LogVolumeAnomaly" + target_match: + alertname: "HighLogVolume" + equal: ['instance'] + +# Deduplication settings to prevent alert flooding +inhibit_rules: + - source_match_re: + alertname: ".*" + target_match_re: + alertname: ".*" + equal: ['instance', 'message_pattern'] + # Prevent duplicate alerts from same instance with same error pattern + # This works with the dedupe window in log-patterns.json to enforce deduplication + +receivers: + # Dedicated security receiver for all security-related log alerts + - name: slack-security + slack_configs: + - channel: '#alerts-security' + title: '[{{ .Status | toUpper }}][SECURITY] {{ .CommonLabels.alertname }}' + text: | + :shield: *SECURITY ALERT* :shield: + {{ range .Alerts }} + *Alert:* {{ .Annotations.summary }} + *Description:* {{ .Annotations.description }} + *Instance:* {{ .Labels.instance }} + *Runbook:* {{ .Annotations.runbook_url }} + {{ end }} + send_resolved: true + mention_channel: true # Always @channel for security alerts \ No newline at end of file diff --git a/infrastructure/monitoring/alertmanager.yml b/infrastructure/monitoring/alertmanager.yml index 48ca9046..2d322fa9 100644 --- a/infrastructure/monitoring/alertmanager.yml +++ b/infrastructure/monitoring/alertmanager.yml @@ -16,6 +16,9 @@ route: severity: warning receiver: slack-warnings +# Include log alert routes +{{ include "alertmanager-routes.yml" . }} + receivers: - name: slack-critical slack_configs: @@ -27,10 +30,24 @@ receivers: - channel: '#alerts-warnings' title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' + - name: slack-security + slack_configs: + - channel: '#alerts-security' + title: '[{{ .Status | toUpper }}][SECURITY] {{ .CommonLabels.alertname }}' + text: | + :shield: *SECURITY ALERT* :shield: + {{ range .Alerts }} + *Alert:* {{ .Annotations.summary }} + *Description:* {{ .Annotations.description }} + *Instance:* {{ .Labels.instance }} + *Runbook:* {{ .Annotations.runbook_url }} + {{ end }} + send_resolved: true + mention_channel: true inhibit_rules: - source_match: severity: critical target_match: severity: warning - equal: ['alertname'] + equal: ['alertname'] \ No newline at end of file diff --git a/infrastructure/monitoring/log-alert-rules.yml b/infrastructure/monitoring/log-alert-rules.yml new file mode 100644 index 00000000..e5f4c85a --- /dev/null +++ b/infrastructure/monitoring/log-alert-rules.yml @@ -0,0 +1,89 @@ +groups: + - name: log-error-alerts + rules: + # Error pattern detection rules + - alert: CriticalLogError + expr: | + count_over_time(logs_total{level="critical"}[5m]) > 0 + for: 0m + labels: + severity: critical + type: log-error + annotations: + summary: "Critical error logged in application" + description: "Critical error detected in logs: {{ $labels.message }} (instance: {{ $labels.instance }})" + runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/critical-error.md" + + - alert: ErrorLogSpike + expr: | + rate(logs_total{level="error"}[5m]) > 10 + for: 2m + labels: + severity: warning + type: log-error + annotations: + summary: "High error rate detected in logs" + description: "Error rate exceeded 10 errors per minute for 2 minutes. Current rate: {{ $value | printf "%.2f" }} errors/sec" + runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/error-spike.md" + + - alert: DatabaseConnectionErrors + expr: | + count_over_time(logs_total{message=~".*connection refused.*",component="postgres"}[5m]) > 5 + for: 1m + labels: + severity: critical + type: database-error + annotations: + summary: "Multiple database connection failures" + description: "{{ $value }} database connection failures detected in 5 minutes" + runbook_url: "https://github.com/ykargee/GistPin/runbooks/database/connection-failed.md" + + - alert: AuthenticationFailures + expr: | + count_over_time(logs_total{message=~".*invalid credentials.*|.*authentication failed.*"}[10m]) > 20 + for: 0m + labels: + severity: critical + type: security-alert + annotations: + summary: "Multiple authentication failures detected" + description: "{{ $value }} authentication failures detected in 10 minutes - possible brute force attempt" + runbook_url: "https://github.com/ykargee/GistPin/runbooks/security/auth-failures.md" + + # Rate-based alerting rules + - alert: HighLogVolume + expr: | + rate(logs_total[5m]) > 1000 + for: 5m + labels: + severity: warning + type: log-volume + annotations: + summary: "Abnormally high log volume detected" + description: "Log ingestion rate exceeded 1000 logs/sec for 5 minutes. Current rate: {{ $value | printf "%.2f" }} logs/sec" + runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/high-volume.md" + + # Anomaly detection using historical baseline comparison + - alert: LogVolumeAnomaly + expr: | + rate(logs_total[5m]) > 3 * rate(logs_total[5m] offset 24h) + for: 5m + labels: + severity: warning + type: log-anomaly + annotations: + summary: "Log volume anomaly detected" + description: "Current log rate ({{ $value | printf "%.2f" }} logs/sec) is 3x higher than the same time yesterday" + runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/anomaly.md" + + - alert: ErrorRateAnomaly + expr: | + rate(logs_total{level="error"}[5m]) > 5 * rate(logs_total{level="error"}[5m] offset 24h) + for: 3m + labels: + severity: warning + type: log-anomaly + annotations: + summary: "Error rate anomaly detected" + description: "Current error rate is 5x higher than the same time yesterday" + runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/error-anomaly.md" \ No newline at end of file diff --git a/infrastructure/monitoring/log-patterns.json b/infrastructure/monitoring/log-patterns.json new file mode 100644 index 00000000..5982568e --- /dev/null +++ b/infrastructure/monitoring/log-patterns.json @@ -0,0 +1,93 @@ +{ + "error_patterns": [ + { + "name": "critical_system_errors", + "patterns": [ + "out of memory", + "stack overflow", + "segmentation fault", + "unhandled exception", + "fatal error", + "panic:" + ], + "level": "critical", + "alert_threshold": 1, + "window": "5m", + "deduplication_window": "15m" + }, + { + "name": "database_errors", + "patterns": [ + "connection refused", + "connection reset", + "deadlock detected", + "query timeout", + "too many connections", + "integrity constraint violation" + ], + "level": "error", + "alert_threshold": 5, + "window": "5m", + "deduplication_window": "10m" + }, + { + "name": "security_errors", + "patterns": [ + "invalid credentials", + "authentication failed", + "unauthorized access", + "forbidden access", + "csrf token invalid", + "xss attempt detected", + "sql injection attempt" + ], + "level": "warning", + "alert_threshold": 20, + "window": "10m", + "deduplication_window": "5m" + }, + { + "name": "api_errors", + "patterns": [ + "internal server error", + "bad gateway", + "service unavailable", + "gateway timeout", + "bad request", + "not found" + ], + "level": "error", + "alert_threshold": 10, + "window": "5m", + "deduplication_window": "5m" + }, + { + "name": "network_errors", + "patterns": [ + "connection timed out", + "network unreachable", + "dns resolution failed", + "tls handshake failed", + "ssl error" + ], + "level": "warning", + "alert_threshold": 15, + "window": "10m", + "deduplication_window": "10m" + } + ], + "anomaly_detection": { + "enabled": true, + "baseline_window": "7d", + "training_window": "24h", + "sensitivity": 2.0, + "seasonality": true, + "seasonality_period": "24h" + }, + "deduplication": { + "enabled": true, + "default_window": "15m", + "fields": ["instance", "level", "message_pattern"], + "max_alerts_per_window": 5 + } +} \ No newline at end of file diff --git a/infrastructure/monitoring/prometheus.yml b/infrastructure/monitoring/prometheus.yml index 61a5d531..b1e0c943 100644 --- a/infrastructure/monitoring/prometheus.yml +++ b/infrastructure/monitoring/prometheus.yml @@ -11,6 +11,7 @@ alerting: rule_files: - "alert-rules.yml" - "gistpin-exporter-rules.yml" + - "log-alert-rules.yml" scrape_configs: - job_name: "gistpin-backend" diff --git a/infrastructure/scripts/pre-upgrade-checks.sh b/infrastructure/scripts/pre-upgrade-checks.sh new file mode 100644 index 00000000..4b8a890c --- /dev/null +++ b/infrastructure/scripts/pre-upgrade-checks.sh @@ -0,0 +1,236 @@ +#!/bin/bash +set -euo pipefail + +# EKS Node Group Pre-Upgrade Checks +# Usage: ./pre-upgrade-checks.sh + +# Color codes for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +NC='\033[0m' # No Color + +log() { + echo -e "[$(date +'%Y-%m-%dT%H:%M:%S%z')] $1" +} + +error() { + log "${RED}ERROR: $1${NC}" >&2 + exit 1 +} + +warn() { + log "${YELLOW}WARNING: $1${NC}" +} + +success() { + log "${GREEN}SUCCESS: $1${NC}" +} + +# Check required arguments +if [ "$#" -ne 3 ]; then + error "Usage: $0 " +fi + +CLUSTER_NAME="$1" +NODE_GROUP_NAME="$2" +TARGET_VERSION="$3" + +# Check if required tools are installed +check_dependencies() { + log "Checking required dependencies..." + + if ! command -v aws &> /dev/null; then + error "AWS CLI is not installed" + fi + + if ! command -v kubectl &> /dev/null; then + error "kubectl is not installed" + fi + + if ! command -v jq &> /dev/null; then + error "jq is not installed" + fi + + success "All dependencies are installed" +} + +# Verify cluster exists +check_cluster_exists() { + log "Checking if cluster $CLUSTER_NAME exists..." + + if ! aws eks describe-cluster --name "$CLUSTER_NAME" &> /dev/null; then + error "Cluster $CLUSTER_NAME does not exist or you don't have access to it" + fi + + success "Cluster $CLUSTER_NAME found" +} + +# Verify node group exists and get current version +check_node_group_exists() { + log "Checking if node group $NODE_GROUP_NAME exists..." + + local node_group_info + node_group_info=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" 2>/dev/null || error "Node group $NODE_GROUP_NAME not found in cluster $CLUSTER_NAME") + + CURRENT_VERSION=$(echo "$node_group_info" | jq -r '.nodegroup.version') + log "Current node group version: $CURRENT_VERSION" + log "Target version: $TARGET_VERSION" + + # Validate version order + if [ "$CURRENT_VERSION" = "$TARGET_VERSION" ]; then + error "Node group is already on version $TARGET_VERSION" + fi + + # Check if upgrade is allowed (only one minor version jump at a time in EKS) + local current_minor=$(echo "$CURRENT_VERSION" | cut -d. -f2) + local target_minor=$(echo "$TARGET_VERSION" | cut -d. -f2) + + if [ $((target_minor - current_minor)) -gt 1 ]; then + error "Cannot upgrade from $CURRENT_VERSION to $TARGET_VERSION. Only one minor version jump is allowed." + fi + + success "Node group $NODE_GROUP_NAME is eligible for upgrade" +} + +# Check cluster version compatibility +check_cluster_version_compatibility() { + log "Checking cluster version compatibility..." + + local cluster_version + cluster_version=$(aws eks describe-cluster --name "$CLUSTER_NAME" | jq -r '.cluster.version') + + if [ "$(echo "$TARGET_VERSION" | cut -d. -f1-2)" != "$cluster_version" ]; then + error "Target version $TARGET_VERSION does not match cluster version $cluster_version. Node group version must match cluster version." + fi + + success "Target version matches cluster version" +} + +# Check cluster health +check_cluster_health() { + log "Checking overall cluster health..." + + # Update kubeconfig + aws eks update-kubeconfig --name "$CLUSTER_NAME" + + # Check if all nodes are ready + local not_ready_nodes + not_ready_nodes=$(kubectl get nodes -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name') + + if [ -n "$not_ready_nodes" ]; then + error "Found nodes in NotReady state: $not_ready_nodes" + fi + + success "All nodes are in Ready state" + + # Check for any PodDisruptionBudgets that might block draining + local pdb_violations + pdb_violations=$(kubectl get poddisruptionbudgets --all-namespaces -o json | jq -r '.items[] | select(.status.disruptionsAllowed == 0) | .metadata.namespace + "/" + .metadata.name') + + if [ -n "$pdb_violations" ]; then + warn "Found PDBs with 0 disruptions allowed: $pdb_violations. These may block node draining." + else + success "No PDBs that would block node draining found" + fi + + # Check for critical pods that might be affected + local kube_system_pods + kube_system_pods=$(kubectl get pods -n kube-system -o json | jq -r '.items[] | select(.status.phase!="Running") | .metadata.name') + + if [ -n "$kube_system_pods" ]; then + warn "Found non-running pods in kube-system: $kube_system_pods" + else + success "All kube-system pods are running" + fi +} + +# Check node group resources +check_node_group_resources() { + log "Checking node group resources..." + + local node_group_scaling + node_group_scaling=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" | jq -r '.nodegroup.scalingConfig') + + local min_size=$(echo "$node_group_scaling" | jq -r '.minSize') + local max_size=$(echo "$node_group_scaling" | jq -r '.maxSize') + local desired_size=$(echo "$node_group_scaling" | jq -r '.desiredSize') + + log "Node group scaling config - Min: $min_size, Max: $max_size, Desired: $desired_size" + + # Ensure we have capacity to add new nodes before removing old ones + if [ "$desired_size" -ge "$max_size" ]; then + error "Node group is already at maximum size. Cannot perform rolling upgrade. Increase maxSize first." + fi + + success "Node group has sufficient capacity for rolling upgrade" +} + +# Check for any ongoing operations +check_ongoing_operations() { + log "Checking for ongoing operations on node group..." + + local nodegroup_status + nodegroup_status=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" | jq -r '.nodegroup.status') + + if [ "$nodegroup_status" != "ACTIVE" ]; then + error "Node group is in $nodegroup_status state. Must be in ACTIVE state to upgrade." + fi + + success "No ongoing operations, node group is ACTIVE" +} + +# Check if there are any other node groups that share the same subnet to ensure capacity +check_subnet_capacity() { + log "Checking subnet capacity..." + + local subnets + subnets=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" | jq -r '.nodegroup.subnets[]') + + for subnet in $subnets; do + local available_ips + available_ips=$(aws ec2 describe-subnets --subnet-ids "$subnet" | jq -r '.Subnets[0].AvailableIpAddressCount') + + if [ "$available_ips" -lt 10 ]; then + warn "Subnet $subnet has only $available_ips IPs remaining, may impact upgrade" + else + success "Subnet $subnet has sufficient IP capacity ($available_ips available)" + fi + done +} + +# Main execution +main() { + log "Starting pre-upgrade checks for node group $NODE_GROUP_NAME in cluster $CLUSTER_NAME" + log "Target Kubernetes version: $TARGET_VERSION" + echo "----------------------------------------" + + check_dependencies + echo "----------------------------------------" + + check_cluster_exists + echo "----------------------------------------" + + check_node_group_exists + echo "----------------------------------------" + + check_cluster_version_compatibility + echo "----------------------------------------" + + check_ongoing_operations + echo "----------------------------------------" + + check_node_group_resources + echo "----------------------------------------" + + check_subnet_capacity + echo "----------------------------------------" + + check_cluster_health + echo "----------------------------------------" + + success "All pre-upgrade checks completed successfully! Node group is ready for upgrade." + echo "----------------------------------------" +} + +main "$@" \ No newline at end of file diff --git a/infrastructure/scripts/upgrade-node-group.sh b/infrastructure/scripts/upgrade-node-group.sh new file mode 100644 index 00000000..c8c4a743 --- /dev/null +++ b/infrastructure/scripts/upgrade-node-group.sh @@ -0,0 +1,308 @@ +#!/bin/bash +set -euo pipefail + +# EKS Node Group Rolling Upgrade Script +# Usage: ./upgrade-node-group.sh [--rollback] + +# Color codes for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +NC='\033[0m' # No Color + +UPGRADE_START_TIME="" +ORIGINAL_NODE_GROUP_VERSION="" +UPGRADE_IN_PROGRESS=false +NODE_GROUP_ARN="" +ROLLBACK_MODE=false + +log() { + echo -e "[$(date +'%Y-%m-%dT%H:%M:%S%z')] $1" +} + +error() { + log "${RED}ERROR: $1${NC}" >&2 + if [ "$UPGRADE_IN_PROGRESS" = true ]; then + log "${YELLOW}An error occurred during upgrade. Run with --rollback to initiate rollback procedure.${NC}" + fi + exit 1 +} + +warn() { + log "${YELLOW}WARNING: $1${NC}" +} + +success() { + log "${GREEN}SUCCESS: $1${NC}" +} + +# Function to perform rollback +perform_rollback() { + log "Initiating rollback procedure..." + + if [ -z "$ORIGINAL_NODE_GROUP_VERSION" ]; then + error "Original version not found, cannot rollback safely" + fi + + log "Rolling back node group $NODE_GROUP_NAME from current version to original version $ORIGINAL_NODE_GROUP_VERSION" + + # Update node group to original version + aws eks update-nodegroup-version \ + --cluster-name "$CLUSTER_NAME" \ + --nodegroup-name "$NODE_GROUP_NAME" \ + --version "$ORIGINAL_NODE_GROUP_VERSION" + + log "Waiting for rollback to complete..." + aws eks wait nodegroup-active --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" + + # Verify all nodes are rolled back + local nodes + nodes=$(kubectl get nodes -o json | jq -r '.items[] | select(.metadata.labels."eks.amazonaws.com/nodegroup"=="'"$NODE_GROUP_NAME"'") | .metadata.name') + + for node in $nodes; do + local node_version + node_version=$(kubectl get node "$node" -o json | jq -r '.status.nodeInfo.kubeletVersion' | sed 's/v//') + if [ "$node_version" != "$ORIGINAL_NODE_GROUP_VERSION" ]; then + warn "Node $node still on version $node_version, expected $ORIGINAL_NODE_GROUP_VERSION" + else + success "Node $node successfully rolled back to $node_version" + fi + done + + success "Rollback completed successfully" + UPGRADE_IN_PROGRESS=false + exit 0 +} + +# Cordon a node - mark it as unschedulable +cordon_node() { + local node_name="$1" + log "Cordoning node $node_name..." + kubectl cordon "$node_name" + success "Node $node_name cordoned successfully" +} + +# Drain a node - evict all pods from it +drain_node() { + local node_name="$1" + log "Draining node $node_name..." + # Wait 5 minutes for pod eviction, ignore daemonsets, force deletion + kubectl drain "$node_name" \ + --ignore-daemonsets \ + --delete-emptydir-data \ + --force \ + --timeout=300s + success "Node $node_name drained successfully" +} + +# Uncordon a node - mark it as schedulable again +uncordon_node() { + local node_name="$1" + log "Uncordoning node $node_name..." + kubectl uncordon "$node_name" + success "Node $node_name uncordoned successfully" +} + +# Get all nodes in the node group +get_node_group_nodes() { + kubectl get nodes -o json | jq -r '.items[] | select(.metadata.labels."eks.amazonaws.com/nodegroup"=="'"$NODE_GROUP_NAME"'") | .metadata.name' +} + +# Wait for new nodes to join and become ready +wait_for_new_nodes() { + local expected_count="$1" + log "Waiting for $expected_count new nodes to join cluster and become ready..." + + local max_attempts=60 # 30 minutes total (30s per attempt) + local attempt=0 + + while [ $attempt -lt $max_attempts ]; do + local ready_nodes + ready_nodes=$(kubectl get nodes -o json | jq -r '.items[] | select(.metadata.labels."eks.amazonaws.com/nodegroup"=="'"$NODE_GROUP_NAME"'" and .status.conditions[] | select(.type=="Ready" and .status=="True")) | .metadata.name' | wc -l) + + if [ "$ready_nodes" -ge "$expected_count" ]; then + success "All $expected_count nodes are ready" + return 0 + fi + + log "Ready nodes: $ready_nodes/$expected_count. Waiting 30s..." + sleep 30 + ((attempt++)) + done + + error "Timed out waiting for new nodes to become ready" +} + +# Validate upgrade after completion +validate_upgrade() { + log "Starting post-upgrade validation..." + + # Check all nodes in node group are on target version + local nodes + nodes=$(get_node_group_nodes) + local all_valid=true + + for node in $nodes; do + local node_version + node_version=$(kubectl get node "$node" -o json | jq -r '.status.nodeInfo.kubeletVersion' | sed 's/v//') + if [ "$node_version" != "$TARGET_VERSION" ]; then + error "Node $node is on version $node_version, expected $TARGET_VERSION" + all_valid=false + else + success "Node $node is running target version $TARGET_VERSION" + fi + done + + # Check all pods are running + local non_running_pods + non_running_pods=$(kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.status.phase!="Running" and .status.phase!="Succeeded") | .metadata.namespace+"/"+.metadata.name') + + if [ -n "$non_running_pods" ]; then + warn "Found non-running pods after upgrade: $non_running_pods" + else + success "All pods are in Running/Succeeded state" + fi + + # Check all nodes are Ready + local not_ready_nodes + not_ready_nodes=$(kubectl get nodes -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name') + + if [ -n "$not_ready_nodes" ]; then + error "Found nodes in NotReady state after upgrade: $not_ready_nodes" + else + success "All nodes are in Ready state" + fi + + # Verify node group status is ACTIVE + local nodegroup_status + nodegroup_status=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" | jq -r '.nodegroup.status') + + if [ "$nodegroup_status" != "ACTIVE" ]; then + error "Node group is in $nodegroup_status state after upgrade, expected ACTIVE" + else + success "Node group is ACTIVE" + fi + + if [ "$all_valid" = true ]; then + success "All post-upgrade validations passed!" + fi +} + +# Perform rolling upgrade of nodes +perform_rolling_upgrade() { + log "Starting rolling upgrade process..." + UPGRADE_IN_PROGRESS=true + + # Get list of current nodes before upgrade starts + local old_nodes + old_nodes=$(get_node_group_nodes) + local old_node_count=$(echo "$old_nodes" | wc -l | xargs) + log "Found $old_node_count existing nodes in node group" + + # Store original node group version + ORIGINAL_NODE_GROUP_VERSION=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" | jq -r '.nodegroup.version') + NODE_GROUP_ARN=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" | jq -r '.nodegroup.arn') + + log "Original version: $ORIGINAL_NODE_GROUP_VERSION, Target version: $TARGET_VERSION" + + # Initiate node group version update + log "Initiating node group version update in EKS..." + aws eks update-nodegroup-version \ + --cluster-name "$CLUSTER_NAME" \ + --nodegroup-name "$NODE_GROUP_NAME" \ + --version "$TARGET_VERSION" + + # Wait for node group to start provisioning new nodes + log "Waiting for EKS to provision new nodes..." + sleep 60 # Initial wait for AWS to start the update + + # Wait for new nodes to join + wait_for_new_nodes $((old_node_count + 1)) # Wait for at least one new node + + # Process old nodes one by one + for node in $old_nodes; do + echo "----------------------------------------" + log "Processing old node: $node" + + # Check if node still exists + if kubectl get node "$node" &> /dev/null; then + # Cordon and drain the old node + cordon_node "$node" + drain_node "$node" + + # Wait a bit to ensure workloads are rescheduled + log "Waiting 60s for workloads to stabilize on new nodes..." + sleep 60 + + # Verify node is no longer needed (new nodes are handling workload) + success "Node $node successfully drained and can be terminated" + else + warn "Node $node no longer exists in cluster, skipping" + fi + done + + # Wait for all old nodes to be terminated and all new nodes ready + log "Waiting for all nodes to reach ready state..." + wait_for_new_nodes $old_node_count + + # All new nodes should be ready and uncordoned by default, but verify + local new_nodes + new_nodes=$(get_node_group_nodes) + for node in $new_nodes; do + # Check if node is cordoned, if so uncordon it + if kubectl get node "$node" -o json | jq -r '.spec.unschedulable' | grep -q true; then + uncordon_node "$node" + fi + done + + echo "----------------------------------------" + success "Node group upgrade completed successfully!" + + # Run post-upgrade validation + validate_upgrade + + UPGRADE_IN_PROGRESS=false +} + +# Main execution +main() { + # Parse arguments + if [ "$#" -lt 3 ]; then + error "Usage: $0 [--rollback]" + fi + + CLUSTER_NAME="$1" + NODE_GROUP_NAME="$2" + TARGET_VERSION="$3" + + if [ "$#" -eq 4 ] && [ "$4" = "--rollback" ]; then + ROLLBACK_MODE=true + log "Running in rollback mode..." + # Update kubeconfig + aws eks update-kubeconfig --name "$CLUSTER_NAME" + perform_rollback + exit 0 + fi + + log "Starting node group upgrade process for $NODE_GROUP_NAME in cluster $CLUSTER_NAME" + log "Target Kubernetes version: $TARGET_VERSION" + echo "----------------------------------------" + + # First run pre-upgrade checks + log "Running pre-upgrade checks..." + if ! ./pre-upgrade-checks.sh "$CLUSTER_NAME" "$NODE_GROUP_NAME" "$TARGET_VERSION"; then + error "Pre-upgrade checks failed, aborting upgrade" + fi + echo "----------------------------------------" + + # Update kubeconfig + aws eks update-kubeconfig --name "$CLUSTER_NAME" + + # Proceed with rolling upgrade + perform_rolling_upgrade + + echo "----------------------------------------" + success "Upgrade process fully completed!" +} + +main "$@" \ No newline at end of file