PinSpace-Org · ykargeee-bit · Jun 27, 2026 · Jun 27, 2026
diff --git a/infrastructure/docs/upgrade-procedure.md b/infrastructure/docs/upgrade-procedure.md
@@ -0,0 +1,121 @@
+# Kubernetes Node Group Upgrade Procedure
+
+This document outlines the procedure for safely upgrading Kubernetes versions on EKS node groups in the GistPin infrastructure.
+
+## Overview
+
+The upgrade process automates the safe rolling upgrade of Kubernetes node groups with the following features:
+- Comprehensive pre-upgrade validation checks
+- Cordoning and draining of nodes before termination
+- Gradual rolling replacement of nodes
+- Post-upgrade validation
+- Automated rollback capability
+
+## Prerequisites
+
+Before running any upgrade, ensure you have:
+
+1. **Required tools installed**:
+   - AWS CLI (configured with appropriate permissions)
+   - kubectl (configured to access your cluster)
+   - jq (for JSON processing)
+
+2. **AWS permissions required**:
+   - `eks:DescribeCluster`
+   - `eks:DescribeNodegroup`
+   - `eks:UpdateNodegroupVersion`
+   - `ec2:DescribeSubnets`
+   - Permissions to update kubeconfig
+
+3. **Cluster readiness**:
+   - The cluster control plane must already be upgraded to the target Kubernetes version
+   - You can only upgrade one minor version at a time (e.g., 1.27 → 1.28 is supported, 1.27 → 1.29 is not)
+   - The node group must have enough capacity to add new nodes during the rolling upgrade (max size > current desired size)
+
+## Pre-Upgrade Checks
+
+The `pre-upgrade-checks.sh` script automatically verifies all prerequisites before an upgrade can begin:
+
+```bash
+./infrastructure/scripts/pre-upgrade-checks.sh <cluster-name> <node-group-name> <target-version>
+```
+
+### Checks Performed
+
+1. **Dependency validation**: Verifies AWS CLI, kubectl, and jq are installed
+2. **Cluster existence**: Confirms the specified EKS cluster exists and is accessible
+3. **Node group eligibility**:
+   - Verifies the node group exists
+   - Checks that the target version is different from current version
+   - Validates only one minor version upgrade is attempted
+   - Confirms target version matches cluster control plane version
+4. **Resource capacity**: Ensures node group has enough max capacity to add new nodes
+5. **Cluster health**:
+   - All nodes are in Ready state
+   - No problematic PodDisruptionBudgets that would block draining
+   - All kube-system pods are running
+6. **Subnet capacity**: Verifies sufficient IP addresses are available in subnets
+7. **Node group status**: Confirms the node group is in ACTIVE state with no ongoing operations
+
+## Running the Upgrade
+
+### Standard Upgrade Process
+
+To perform a rolling upgrade of a node group:
+
+```bash
+./infrastructure/scripts/upgrade-node-group.sh <cluster-name> <node-group-name> <target-version>
+```
+
+### What Happens During Upgrade
+
+1. **Pre-flight**: Runs all pre-upgrade checks to ensure eligibility
+2. **Initiate EKS upgrade**: Triggers the AWS EKS node group version update
+3. **Wait for new nodes**: Waits for EKS to provision new nodes with the target version
+4. **Process old nodes sequentially**:
+   - **Cordon**: Marks the node as unschedulable to prevent new pods from being assigned
+   - **Drain**: Evicts all existing pods from the node (respects PodDisruptionBudgets)
+   - Waits for workloads to reschedule onto new nodes
+5. **Post-upgrade validation**: After all nodes are replaced, verifies:
+   - All nodes are running the target Kubernetes version
+   - All nodes are in Ready state
+   - All pods are running correctly
+   - The node group is back to ACTIVE state
+
+## Rollback Procedure
+
+If an upgrade fails or issues are discovered after upgrade, you can rollback to the previous version:
+
+```bash
+./infrastructure/scripts/upgrade-node-group.sh <cluster-name> <node-group-name> <original-version> --rollback
+```
+
+### Rollback Process
+
+1. The rollback script retrieves the node group details
+2. Initiates an EKS node group version update back to the original version
+3. Waits for the rollback operation to complete
+4. Verifies all nodes are running the original version
+5. Performs health checks to ensure cluster stability after rollback
+
+## Best Practices
+
+1. **Test first**: Always test the upgrade procedure in a staging environment first
+2. **Schedule during low traffic**: Run upgrades during periods of lower application traffic
+3. **Monitor closely**: Keep an eye on cluster metrics, application logs, and node health during the upgrade
+4. **Backup critical data**: Ensure all persistent volumes have recent backups before performing infrastructure changes
+5. **Upgrade sequentially**: If upgrading multiple node groups, upgrade them one at a time to maintain capacity
+6. **Verify workload resiliency**: Ensure your applications are designed to tolerate node failures and rescheduling
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Node draining fails**: This is often due to PodDisruptionBudgets that block evictions. Check the pre-upgrade warnings for PDBs with 0 disruptions allowed.
+2. **New nodes fail to join**: Check subnet IP capacity, security group configurations, and AWS service limits.
+3. **Upgrade times out**: The script has built-in timeouts. If this happens frequently, you may need to adjust timings for larger clusters.
+4. **Workload issues after upgrade**: Check application logs for compatibility issues with the new Kubernetes version. Roll back immediately if critical issues are found.
+
+## Emergency Contact
+
+In case of issues during upgrade that require immediate assistance, follow the emergency procedures outlined in [emergency-procedures.md](runbooks/emergency-procedures.md).
diff --git a/infrastructure/monitoring/alertmanager-routes.yml b/infrastructure/monitoring/alertmanager-routes.yml
@@ -0,0 +1,82 @@
+# Alertmanager routes configuration for log-based alerts
+# Extends the main alertmanager.yml configuration
+
+route:
+  # Log-specific routing rules that extend the base configuration
+  routes:
+    # Critical log errors - route to critical alerts channel with high priority
+    - match_re:
+        type: "^log-error|database-error|security-alert$"
+        severity: critical
+      receiver: slack-critical
+      group_by: ['alertname', 'instance', 'type']
+      group_wait: 10s  # Short wait for critical issues
+      group_interval: 5m
+      repeat_interval: 1h  # More frequent repeats for active log issues
+      continue: false
+
+    # Warning-level log alerts
+    - match_re:
+        type: "^log-error|log-volume|log-anomaly$"
+        severity: warning
+      receiver: slack-warnings
+      group_by: ['alertname', 'type']
+      group_wait: 30s
+      group_interval: 5m
+      repeat_interval: 2h
+      continue: false
+
+    # Security-specific routing - send all security alerts to dedicated security channel
+    - match:
+        type: "security-alert"
+      receiver: slack-security
+      group_by: ['alertname']
+      group_wait: 5s  # Immediately send security alerts
+      group_interval: 3m
+      repeat_interval: 30m  # Frequent repeats for ongoing security issues
+      continue: true
+
+# Inhibit rules for log alerts - prevent alert spam
+inhibit_rules:
+  # If a critical log error is firing, suppress lower-severity warnings from the same instance
+  - source_match:
+      severity: critical
+      type: log-error
+    target_match:
+      severity: warning
+      type: log-error
+    equal: ['instance']
+
+  # If there's a log volume anomaly, suppress high log volume alerts for the same instance
+  - source_match:
+      alertname: "LogVolumeAnomaly"
+    target_match:
+      alertname: "HighLogVolume"
+    equal: ['instance']
+
+# Deduplication settings to prevent alert flooding
+inhibit_rules:
+  - source_match_re:
+      alertname: ".*"
+    target_match_re:
+      alertname: ".*"
+    equal: ['instance', 'message_pattern']
+    # Prevent duplicate alerts from same instance with same error pattern
+    # This works with the dedupe window in log-patterns.json to enforce deduplication
+
+receivers:
+  # Dedicated security receiver for all security-related log alerts
+  - name: slack-security
+    slack_configs:
+      - channel: '#alerts-security'
+        title: '[{{ .Status | toUpper }}][SECURITY] {{ .CommonLabels.alertname }}'
+        text: |
+          :shield: *SECURITY ALERT* :shield:
+          {{ range .Alerts }}
+          *Alert:* {{ .Annotations.summary }}
+          *Description:* {{ .Annotations.description }}
+          *Instance:* {{ .Labels.instance }}
+          *Runbook:* {{ .Annotations.runbook_url }}
+          {{ end }}
+        send_resolved: true
+        mention_channel: true  # Always @channel for security alerts
diff --git a/infrastructure/monitoring/alertmanager.yml b/infrastructure/monitoring/alertmanager.yml
@@ -16,6 +16,9 @@ route:
         severity: warning
       receiver: slack-warnings
 
+# Include log alert routes
+{{ include "alertmanager-routes.yml" . }}
+
 receivers:
   - name: slack-critical
     slack_configs:
@@ -27,10 +30,24 @@ receivers:
       - channel: '#alerts-warnings'
         title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
         text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
+  - name: slack-security
+    slack_configs:
+      - channel: '#alerts-security'
+        title: '[{{ .Status | toUpper }}][SECURITY] {{ .CommonLabels.alertname }}'
+        text: |
+          :shield: *SECURITY ALERT* :shield:
+          {{ range .Alerts }}
+          *Alert:* {{ .Annotations.summary }}
+          *Description:* {{ .Annotations.description }}
+          *Instance:* {{ .Labels.instance }}
+          *Runbook:* {{ .Annotations.runbook_url }}
+          {{ end }}
+        send_resolved: true
+        mention_channel: true
 
 inhibit_rules:
   - source_match:
       severity: critical
     target_match:
       severity: warning
-    equal: ['alertname']
+    equal: ['alertname']
diff --git a/infrastructure/monitoring/log-alert-rules.yml b/infrastructure/monitoring/log-alert-rules.yml
@@ -0,0 +1,89 @@
+groups:
+  - name: log-error-alerts
+    rules:
+      # Error pattern detection rules
+      - alert: CriticalLogError
+        expr: |
+          count_over_time(logs_total{level="critical"}[5m]) > 0
+        for: 0m
+        labels:
+          severity: critical
+          type: log-error
+        annotations:
+          summary: "Critical error logged in application"
+          description: "Critical error detected in logs: {{ $labels.message }} (instance: {{ $labels.instance }})"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/critical-error.md"
+
+      - alert: ErrorLogSpike
+        expr: |
+          rate(logs_total{level="error"}[5m]) > 10
+        for: 2m
+        labels:
+          severity: warning
+          type: log-error
+        annotations:
+          summary: "High error rate detected in logs"
+          description: "Error rate exceeded 10 errors per minute for 2 minutes. Current rate: {{ $value | printf "%.2f" }} errors/sec"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/error-spike.md"
+
+      - alert: DatabaseConnectionErrors
+        expr: |
+          count_over_time(logs_total{message=~".*connection refused.*",component="postgres"}[5m]) > 5
+        for: 1m
+        labels:
+          severity: critical
+          type: database-error
+        annotations:
+          summary: "Multiple database connection failures"
+          description: "{{ $value }} database connection failures detected in 5 minutes"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/database/connection-failed.md"
+
+      - alert: AuthenticationFailures
+        expr: |
+          count_over_time(logs_total{message=~".*invalid credentials.*|.*authentication failed.*"}[10m]) > 20
+        for: 0m
+        labels:
+          severity: critical
+          type: security-alert
+        annotations:
+          summary: "Multiple authentication failures detected"
+          description: "{{ $value }} authentication failures detected in 10 minutes - possible brute force attempt"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/security/auth-failures.md"
+
+      # Rate-based alerting rules
+      - alert: HighLogVolume
+        expr: |
+          rate(logs_total[5m]) > 1000
+        for: 5m
+        labels:
+          severity: warning
+          type: log-volume
+        annotations:
+          summary: "Abnormally high log volume detected"
+          description: "Log ingestion rate exceeded 1000 logs/sec for 5 minutes. Current rate: {{ $value | printf "%.2f" }} logs/sec"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/high-volume.md"
+
+      # Anomaly detection using historical baseline comparison
+      - alert: LogVolumeAnomaly
+        expr: |
+          rate(logs_total[5m]) > 3 * rate(logs_total[5m] offset 24h)
+        for: 5m
+        labels:
+          severity: warning
+          type: log-anomaly
+        annotations:
+          summary: "Log volume anomaly detected"
+          description: "Current log rate ({{ $value | printf "%.2f" }} logs/sec) is 3x higher than the same time yesterday"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/anomaly.md"
+
+      - alert: ErrorRateAnomaly
+        expr: |
+          rate(logs_total{level="error"}[5m]) > 5 * rate(logs_total{level="error"}[5m] offset 24h)
+        for: 3m
+        labels:
+          severity: warning
+          type: log-anomaly
+        annotations:
+          summary: "Error rate anomaly detected"
+          description: "Current error rate is 5x higher than the same time yesterday"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/error-anomaly.md"