diff --git a/infrastructure/docs/upgrade-procedure.md b/infrastructure/docs/upgrade-procedure.md
new file mode 100644
index 00000000..3b564f63
--- /dev/null
+++ b/infrastructure/docs/upgrade-procedure.md
@@ -0,0 +1,121 @@
+# Kubernetes Node Group Upgrade Procedure
+
+This document outlines the procedure for safely upgrading Kubernetes versions on EKS node groups in the GistPin infrastructure.
+
+## Overview
+
+The upgrade process automates the safe rolling upgrade of Kubernetes node groups with the following features:
+- Comprehensive pre-upgrade validation checks
+- Cordoning and draining of nodes before termination
+- Gradual rolling replacement of nodes
+- Post-upgrade validation
+- Automated rollback capability
+
+## Prerequisites
+
+Before running any upgrade, ensure you have:
+
+1. **Required tools installed**:
+   - AWS CLI (configured with appropriate permissions)
+   - kubectl (configured to access your cluster)
+   - jq (for JSON processing)
+
+2. **AWS permissions required**:
+   - `eks:DescribeCluster`
+   - `eks:DescribeNodegroup`
+   - `eks:UpdateNodegroupVersion`
+   - `ec2:DescribeSubnets`
+   - Permissions to update kubeconfig
+
+3. **Cluster readiness**:
+   - The cluster control plane must already be upgraded to the target Kubernetes version
+   - You can only upgrade one minor version at a time (e.g., 1.27 → 1.28 is supported, 1.27 → 1.29 is not)
+   - The node group must have enough capacity to add new nodes during the rolling upgrade (max size > current desired size)
+
+## Pre-Upgrade Checks
+
+The `pre-upgrade-checks.sh` script automatically verifies all prerequisites before an upgrade can begin:
+
+```bash
+./infrastructure/scripts/pre-upgrade-checks.sh <cluster-name> <node-group-name> <target-version>
+```
+
+### Checks Performed
+
+1. **Dependency validation**: Verifies AWS CLI, kubectl, and jq are installed
+2. **Cluster existence**: Confirms the specified EKS cluster exists and is accessible
+3. **Node group eligibility**:
+   - Verifies the node group exists
+   - Checks that the target version is different from current version
+   - Validates only one minor version upgrade is attempted
+   - Confirms target version matches cluster control plane version
+4. **Resource capacity**: Ensures node group has enough max capacity to add new nodes
+5. **Cluster health**:
+   - All nodes are in Ready state
+   - No problematic PodDisruptionBudgets that would block draining
+   - All kube-system pods are running
+6. **Subnet capacity**: Verifies sufficient IP addresses are available in subnets
+7. **Node group status**: Confirms the node group is in ACTIVE state with no ongoing operations
+
+## Running the Upgrade
+
+### Standard Upgrade Process
+
+To perform a rolling upgrade of a node group:
+
+```bash
+./infrastructure/scripts/upgrade-node-group.sh <cluster-name> <node-group-name> <target-version>
+```
+
+### What Happens During Upgrade
+
+1. **Pre-flight**: Runs all pre-upgrade checks to ensure eligibility
+2. **Initiate EKS upgrade**: Triggers the AWS EKS node group version update
+3. **Wait for new nodes**: Waits for EKS to provision new nodes with the target version
+4. **Process old nodes sequentially**:
+   - **Cordon**: Marks the node as unschedulable to prevent new pods from being assigned
+   - **Drain**: Evicts all existing pods from the node (respects PodDisruptionBudgets)
+   - Waits for workloads to reschedule onto new nodes
+5. **Post-upgrade validation**: After all nodes are replaced, verifies:
+   - All nodes are running the target Kubernetes version
+   - All nodes are in Ready state
+   - All pods are running correctly
+   - The node group is back to ACTIVE state
+
+## Rollback Procedure
+
+If an upgrade fails or issues are discovered after upgrade, you can rollback to the previous version:
+
+```bash
+./infrastructure/scripts/upgrade-node-group.sh <cluster-name> <node-group-name> <original-version> --rollback
+```
+
+### Rollback Process
+
+1. The rollback script retrieves the node group details
+2. Initiates an EKS node group version update back to the original version
+3. Waits for the rollback operation to complete
+4. Verifies all nodes are running the original version
+5. Performs health checks to ensure cluster stability after rollback
+
+## Best Practices
+
+1. **Test first**: Always test the upgrade procedure in a staging environment first
+2. **Schedule during low traffic**: Run upgrades during periods of lower application traffic
+3. **Monitor closely**: Keep an eye on cluster metrics, application logs, and node health during the upgrade
+4. **Backup critical data**: Ensure all persistent volumes have recent backups before performing infrastructure changes
+5. **Upgrade sequentially**: If upgrading multiple node groups, upgrade them one at a time to maintain capacity
+6. **Verify workload resiliency**: Ensure your applications are designed to tolerate node failures and rescheduling
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Node draining fails**: This is often due to PodDisruptionBudgets that block evictions. Check the pre-upgrade warnings for PDBs with 0 disruptions allowed.
+2. **New nodes fail to join**: Check subnet IP capacity, security group configurations, and AWS service limits.
+3. **Upgrade times out**: The script has built-in timeouts. If this happens frequently, you may need to adjust timings for larger clusters.
+4. **Workload issues after upgrade**: Check application logs for compatibility issues with the new Kubernetes version. Roll back immediately if critical issues are found.
+
+## Emergency Contact
+
+In case of issues during upgrade that require immediate assistance, follow the emergency procedures outlined in [emergency-procedures.md](runbooks/emergency-procedures.md).
\ No newline at end of file
diff --git a/infrastructure/monitoring/alertmanager-routes.yml b/infrastructure/monitoring/alertmanager-routes.yml
new file mode 100644
index 00000000..158253d3
--- /dev/null
+++ b/infrastructure/monitoring/alertmanager-routes.yml
@@ -0,0 +1,82 @@
+# Alertmanager routes configuration for log-based alerts
+# Extends the main alertmanager.yml configuration
+
+route:
+  # Log-specific routing rules that extend the base configuration
+  routes:
+    # Critical log errors - route to critical alerts channel with high priority
+    - match_re:
+        type: "^log-error|database-error|security-alert$"
+        severity: critical
+      receiver: slack-critical
+      group_by: ['alertname', 'instance', 'type']
+      group_wait: 10s  # Short wait for critical issues
+      group_interval: 5m
+      repeat_interval: 1h  # More frequent repeats for active log issues
+      continue: false
+
+    # Warning-level log alerts
+    - match_re:
+        type: "^log-error|log-volume|log-anomaly$"
+        severity: warning
+      receiver: slack-warnings
+      group_by: ['alertname', 'type']
+      group_wait: 30s
+      group_interval: 5m
+      repeat_interval: 2h
+      continue: false
+
+    # Security-specific routing - send all security alerts to dedicated security channel
+    - match:
+        type: "security-alert"
+      receiver: slack-security
+      group_by: ['alertname']
+      group_wait: 5s  # Immediately send security alerts
+      group_interval: 3m
+      repeat_interval: 30m  # Frequent repeats for ongoing security issues
+      continue: true
+
+# Inhibit rules for log alerts - prevent alert spam
+inhibit_rules:
+  # If a critical log error is firing, suppress lower-severity warnings from the same instance
+  - source_match:
+      severity: critical
+      type: log-error
+    target_match:
+      severity: warning
+      type: log-error
+    equal: ['instance']
+
+  # If there's a log volume anomaly, suppress high log volume alerts for the same instance
+  - source_match:
+      alertname: "LogVolumeAnomaly"
+    target_match:
+      alertname: "HighLogVolume"
+    equal: ['instance']
+
+# Deduplication settings to prevent alert flooding
+inhibit_rules:
+  - source_match_re:
+      alertname: ".*"
+    target_match_re:
+      alertname: ".*"
+    equal: ['instance', 'message_pattern']
+    # Prevent duplicate alerts from same instance with same error pattern
+    # This works with the dedupe window in log-patterns.json to enforce deduplication
+
+receivers:
+  # Dedicated security receiver for all security-related log alerts
+  - name: slack-security
+    slack_configs:
+      - channel: '#alerts-security'
+        title: '[{{ .Status | toUpper }}][SECURITY] {{ .CommonLabels.alertname }}'
+        text: |
+          :shield: *SECURITY ALERT* :shield:
+          {{ range .Alerts }}
+          *Alert:* {{ .Annotations.summary }}
+          *Description:* {{ .Annotations.description }}
+          *Instance:* {{ .Labels.instance }}
+          *Runbook:* {{ .Annotations.runbook_url }}
+          {{ end }}
+        send_resolved: true
+        mention_channel: true  # Always @channel for security alerts
\ No newline at end of file
diff --git a/infrastructure/monitoring/alertmanager.yml b/infrastructure/monitoring/alertmanager.yml
index 48ca9046..2d322fa9 100644
--- a/infrastructure/monitoring/alertmanager.yml
+++ b/infrastructure/monitoring/alertmanager.yml
@@ -16,6 +16,9 @@ route:
         severity: warning
       receiver: slack-warnings
 
+# Include log alert routes
+{{ include "alertmanager-routes.yml" . }}
+
 receivers:
   - name: slack-critical
     slack_configs:
@@ -27,10 +30,24 @@ receivers:
       - channel: '#alerts-warnings'
         title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
         text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
+  - name: slack-security
+    slack_configs:
+      - channel: '#alerts-security'
+        title: '[{{ .Status | toUpper }}][SECURITY] {{ .CommonLabels.alertname }}'
+        text: |
+          :shield: *SECURITY ALERT* :shield:
+          {{ range .Alerts }}
+          *Alert:* {{ .Annotations.summary }}
+          *Description:* {{ .Annotations.description }}
+          *Instance:* {{ .Labels.instance }}
+          *Runbook:* {{ .Annotations.runbook_url }}
+          {{ end }}
+        send_resolved: true
+        mention_channel: true
 
 inhibit_rules:
   - source_match:
       severity: critical
     target_match:
       severity: warning
-    equal: ['alertname']
+    equal: ['alertname']
\ No newline at end of file
diff --git a/infrastructure/monitoring/log-alert-rules.yml b/infrastructure/monitoring/log-alert-rules.yml
new file mode 100644
index 00000000..e5f4c85a
--- /dev/null
+++ b/infrastructure/monitoring/log-alert-rules.yml
@@ -0,0 +1,89 @@
+groups:
+  - name: log-error-alerts
+    rules:
+      # Error pattern detection rules
+      - alert: CriticalLogError
+        expr: |
+          count_over_time(logs_total{level="critical"}[5m]) > 0
+        for: 0m
+        labels:
+          severity: critical
+          type: log-error
+        annotations:
+          summary: "Critical error logged in application"
+          description: "Critical error detected in logs: {{ $labels.message }} (instance: {{ $labels.instance }})"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/critical-error.md"
+
+      - alert: ErrorLogSpike
+        expr: |
+          rate(logs_total{level="error"}[5m]) > 10
+        for: 2m
+        labels:
+          severity: warning
+          type: log-error
+        annotations:
+          summary: "High error rate detected in logs"
+          description: "Error rate exceeded 10 errors per minute for 2 minutes. Current rate: {{ $value | printf "%.2f" }} errors/sec"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/error-spike.md"
+
+      - alert: DatabaseConnectionErrors
+        expr: |
+          count_over_time(logs_total{message=~".*connection refused.*",component="postgres"}[5m]) > 5
+        for: 1m
+        labels:
+          severity: critical
+          type: database-error
+        annotations:
+          summary: "Multiple database connection failures"
+          description: "{{ $value }} database connection failures detected in 5 minutes"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/database/connection-failed.md"
+
+      - alert: AuthenticationFailures
+        expr: |
+          count_over_time(logs_total{message=~".*invalid credentials.*|.*authentication failed.*"}[10m]) > 20
+        for: 0m
+        labels:
+          severity: critical
+          type: security-alert
+        annotations:
+          summary: "Multiple authentication failures detected"
+          description: "{{ $value }} authentication failures detected in 10 minutes - possible brute force attempt"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/security/auth-failures.md"
+
+      # Rate-based alerting rules
+      - alert: HighLogVolume
+        expr: |
+          rate(logs_total[5m]) > 1000
+        for: 5m
+        labels:
+          severity: warning
+          type: log-volume
+        annotations:
+          summary: "Abnormally high log volume detected"
+          description: "Log ingestion rate exceeded 1000 logs/sec for 5 minutes. Current rate: {{ $value | printf "%.2f" }} logs/sec"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/high-volume.md"
+
+      # Anomaly detection using historical baseline comparison
+      - alert: LogVolumeAnomaly
+        expr: |
+          rate(logs_total[5m]) > 3 * rate(logs_total[5m] offset 24h)
+        for: 5m
+        labels:
+          severity: warning
+          type: log-anomaly
+        annotations:
+          summary: "Log volume anomaly detected"
+          description: "Current log rate ({{ $value | printf "%.2f" }} logs/sec) is 3x higher than the same time yesterday"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/anomaly.md"
+
+      - alert: ErrorRateAnomaly
+        expr: |
+          rate(logs_total{level="error"}[5m]) > 5 * rate(logs_total{level="error"}[5m] offset 24h)
+        for: 3m
+        labels:
+          severity: warning
+          type: log-anomaly
+        annotations:
+          summary: "Error rate anomaly detected"
+          description: "Current error rate is 5x higher than the same time yesterday"
+          runbook_url: "https://github.com/ykargee/GistPin/runbooks/logs/error-anomaly.md"
\ No newline at end of file
diff --git a/infrastructure/monitoring/log-patterns.json b/infrastructure/monitoring/log-patterns.json
new file mode 100644
index 00000000..5982568e
--- /dev/null
+++ b/infrastructure/monitoring/log-patterns.json
@@ -0,0 +1,93 @@
+{
+  "error_patterns": [
+    {
+      "name": "critical_system_errors",
+      "patterns": [
+        "out of memory",
+        "stack overflow",
+        "segmentation fault",
+        "unhandled exception",
+        "fatal error",
+        "panic:"
+      ],
+      "level": "critical",
+      "alert_threshold": 1,
+      "window": "5m",
+      "deduplication_window": "15m"
+    },
+    {
+      "name": "database_errors",
+      "patterns": [
+        "connection refused",
+        "connection reset",
+        "deadlock detected",
+        "query timeout",
+        "too many connections",
+        "integrity constraint violation"
+      ],
+      "level": "error",
+      "alert_threshold": 5,
+      "window": "5m",
+      "deduplication_window": "10m"
+    },
+    {
+      "name": "security_errors",
+      "patterns": [
+        "invalid credentials",
+        "authentication failed",
+        "unauthorized access",
+        "forbidden access",
+        "csrf token invalid",
+        "xss attempt detected",
+        "sql injection attempt"
+      ],
+      "level": "warning",
+      "alert_threshold": 20,
+      "window": "10m",
+      "deduplication_window": "5m"
+    },
+    {
+      "name": "api_errors",
+      "patterns": [
+        "internal server error",
+        "bad gateway",
+        "service unavailable",
+        "gateway timeout",
+        "bad request",
+        "not found"
+      ],
+      "level": "error",
+      "alert_threshold": 10,
+      "window": "5m",
+      "deduplication_window": "5m"
+    },
+    {
+      "name": "network_errors",
+      "patterns": [
+        "connection timed out",
+        "network unreachable",
+        "dns resolution failed",
+        "tls handshake failed",
+        "ssl error"
+      ],
+      "level": "warning",
+      "alert_threshold": 15,
+      "window": "10m",
+      "deduplication_window": "10m"
+    }
+  ],
+  "anomaly_detection": {
+    "enabled": true,
+    "baseline_window": "7d",
+    "training_window": "24h",
+    "sensitivity": 2.0,
+    "seasonality": true,
+    "seasonality_period": "24h"
+  },
+  "deduplication": {
+    "enabled": true,
+    "default_window": "15m",
+    "fields": ["instance", "level", "message_pattern"],
+    "max_alerts_per_window": 5
+  }
+}
\ No newline at end of file
diff --git a/infrastructure/monitoring/prometheus.yml b/infrastructure/monitoring/prometheus.yml
index 61a5d531..b1e0c943 100644
--- a/infrastructure/monitoring/prometheus.yml
+++ b/infrastructure/monitoring/prometheus.yml
@@ -11,6 +11,7 @@ alerting:
 rule_files:
   - "alert-rules.yml"
   - "gistpin-exporter-rules.yml"
+  - "log-alert-rules.yml"
 
 scrape_configs:
   - job_name: "gistpin-backend"
diff --git a/infrastructure/scripts/pre-upgrade-checks.sh b/infrastructure/scripts/pre-upgrade-checks.sh
new file mode 100644
index 00000000..4b8a890c
--- /dev/null
+++ b/infrastructure/scripts/pre-upgrade-checks.sh
@@ -0,0 +1,236 @@
+#!/bin/bash
+set -euo pipefail
+
+# EKS Node Group Pre-Upgrade Checks
+# Usage: ./pre-upgrade-checks.sh <cluster-name> <node-group-name> <target-version>
+
+# Color codes for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+log() {
+    echo -e "[$(date +'%Y-%m-%dT%H:%M:%S%z')] $1"
+}
+
+error() {
+    log "${RED}ERROR: $1${NC}" >&2
+    exit 1
+}
+
+warn() {
+    log "${YELLOW}WARNING: $1${NC}"
+}
+
+success() {
+    log "${GREEN}SUCCESS: $1${NC}"
+}
+
+# Check required arguments
+if [ "$#" -ne 3 ]; then
+    error "Usage: $0 <cluster-name> <node-group-name> <target-version>"
+fi
+
+CLUSTER_NAME="$1"
+NODE_GROUP_NAME="$2"
+TARGET_VERSION="$3"
+
+# Check if required tools are installed
+check_dependencies() {
+    log "Checking required dependencies..."
+    
+    if ! command -v aws &> /dev/null; then
+        error "AWS CLI is not installed"
+    fi
+    
+    if ! command -v kubectl &> /dev/null; then
+        error "kubectl is not installed"
+    fi
+    
+    if ! command -v jq &> /dev/null; then
+        error "jq is not installed"
+    fi
+    
+    success "All dependencies are installed"
+}
+
+# Verify cluster exists
+check_cluster_exists() {
+    log "Checking if cluster $CLUSTER_NAME exists..."
+    
+    if ! aws eks describe-cluster --name "$CLUSTER_NAME" &> /dev/null; then
+        error "Cluster $CLUSTER_NAME does not exist or you don't have access to it"
+    fi
+    
+    success "Cluster $CLUSTER_NAME found"
+}
+
+# Verify node group exists and get current version
+check_node_group_exists() {
+    log "Checking if node group $NODE_GROUP_NAME exists..."
+    
+    local node_group_info
+    node_group_info=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" 2>/dev/null || error "Node group $NODE_GROUP_NAME not found in cluster $CLUSTER_NAME")
+    
+    CURRENT_VERSION=$(echo "$node_group_info" | jq -r '.nodegroup.version')
+    log "Current node group version: $CURRENT_VERSION"
+    log "Target version: $TARGET_VERSION"
+    
+    # Validate version order
+    if [ "$CURRENT_VERSION" = "$TARGET_VERSION" ]; then
+        error "Node group is already on version $TARGET_VERSION"
+    fi
+    
+    # Check if upgrade is allowed (only one minor version jump at a time in EKS)
+    local current_minor=$(echo "$CURRENT_VERSION" | cut -d. -f2)
+    local target_minor=$(echo "$TARGET_VERSION" | cut -d. -f2)
+    
+    if [ $((target_minor - current_minor)) -gt 1 ]; then
+        error "Cannot upgrade from $CURRENT_VERSION to $TARGET_VERSION. Only one minor version jump is allowed."
+    fi
+    
+    success "Node group $NODE_GROUP_NAME is eligible for upgrade"
+}
+
+# Check cluster version compatibility
+check_cluster_version_compatibility() {
+    log "Checking cluster version compatibility..."
+    
+    local cluster_version
+    cluster_version=$(aws eks describe-cluster --name "$CLUSTER_NAME" | jq -r '.cluster.version')
+    
+    if [ "$(echo "$TARGET_VERSION" | cut -d. -f1-2)" != "$cluster_version" ]; then
+        error "Target version $TARGET_VERSION does not match cluster version $cluster_version. Node group version must match cluster version."
+    fi
+    
+    success "Target version matches cluster version"
+}
+
+# Check cluster health
+check_cluster_health() {
+    log "Checking overall cluster health..."
+    
+    # Update kubeconfig
+    aws eks update-kubeconfig --name "$CLUSTER_NAME"
+    
+    # Check if all nodes are ready
+    local not_ready_nodes
+    not_ready_nodes=$(kubectl get nodes -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name')
+    
+    if [ -n "$not_ready_nodes" ]; then
+        error "Found nodes in NotReady state: $not_ready_nodes"
+    fi
+    
+    success "All nodes are in Ready state"
+    
+    # Check for any PodDisruptionBudgets that might block draining
+    local pdb_violations
+    pdb_violations=$(kubectl get poddisruptionbudgets --all-namespaces -o json | jq -r '.items[] | select(.status.disruptionsAllowed == 0) | .metadata.namespace + "/" + .metadata.name')
+    
+    if [ -n "$pdb_violations" ]; then
+        warn "Found PDBs with 0 disruptions allowed: $pdb_violations. These may block node draining."
+    else
+        success "No PDBs that would block node draining found"
+    fi
+    
+    # Check for critical pods that might be affected
+    local kube_system_pods
+    kube_system_pods=$(kubectl get pods -n kube-system -o json | jq -r '.items[] | select(.status.phase!="Running") | .metadata.name')
+    
+    if [ -n "$kube_system_pods" ]; then
+        warn "Found non-running pods in kube-system: $kube_system_pods"
+    else
+        success "All kube-system pods are running"
+    fi
+}
+
+# Check node group resources
+check_node_group_resources() {
+    log "Checking node group resources..."
+    
+    local node_group_scaling
+    node_group_scaling=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" | jq -r '.nodegroup.scalingConfig')
+    
+    local min_size=$(echo "$node_group_scaling" | jq -r '.minSize')
+    local max_size=$(echo "$node_group_scaling" | jq -r '.maxSize')
+    local desired_size=$(echo "$node_group_scaling" | jq -r '.desiredSize')
+    
+    log "Node group scaling config - Min: $min_size, Max: $max_size, Desired: $desired_size"
+    
+    # Ensure we have capacity to add new nodes before removing old ones
+    if [ "$desired_size" -ge "$max_size" ]; then
+        error "Node group is already at maximum size. Cannot perform rolling upgrade. Increase maxSize first."
+    fi
+    
+    success "Node group has sufficient capacity for rolling upgrade"
+}
+
+# Check for any ongoing operations
+check_ongoing_operations() {
+    log "Checking for ongoing operations on node group..."
+    
+    local nodegroup_status
+    nodegroup_status=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" | jq -r '.nodegroup.status')
+    
+    if [ "$nodegroup_status" != "ACTIVE" ]; then
+        error "Node group is in $nodegroup_status state. Must be in ACTIVE state to upgrade."
+    fi
+    
+    success "No ongoing operations, node group is ACTIVE"
+}
+
+# Check if there are any other node groups that share the same subnet to ensure capacity
+check_subnet_capacity() {
+    log "Checking subnet capacity..."
+    
+    local subnets
+    subnets=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" | jq -r '.nodegroup.subnets[]')
+    
+    for subnet in $subnets; do
+        local available_ips
+        available_ips=$(aws ec2 describe-subnets --subnet-ids "$subnet" | jq -r '.Subnets[0].AvailableIpAddressCount')
+        
+        if [ "$available_ips" -lt 10 ]; then
+            warn "Subnet $subnet has only $available_ips IPs remaining, may impact upgrade"
+        else
+            success "Subnet $subnet has sufficient IP capacity ($available_ips available)"
+        fi
+    done
+}
+
+# Main execution
+main() {
+    log "Starting pre-upgrade checks for node group $NODE_GROUP_NAME in cluster $CLUSTER_NAME"
+    log "Target Kubernetes version: $TARGET_VERSION"
+    echo "----------------------------------------"
+    
+    check_dependencies
+    echo "----------------------------------------"
+    
+    check_cluster_exists
+    echo "----------------------------------------"
+    
+    check_node_group_exists
+    echo "----------------------------------------"
+    
+    check_cluster_version_compatibility
+    echo "----------------------------------------"
+    
+    check_ongoing_operations
+    echo "----------------------------------------"
+    
+    check_node_group_resources
+    echo "----------------------------------------"
+    
+    check_subnet_capacity
+    echo "----------------------------------------"
+    
+    check_cluster_health
+    echo "----------------------------------------"
+    
+    success "All pre-upgrade checks completed successfully! Node group is ready for upgrade."
+    echo "----------------------------------------"
+}
+
+main "$@"
\ No newline at end of file
diff --git a/infrastructure/scripts/upgrade-node-group.sh b/infrastructure/scripts/upgrade-node-group.sh
new file mode 100644
index 00000000..c8c4a743
--- /dev/null
+++ b/infrastructure/scripts/upgrade-node-group.sh
@@ -0,0 +1,308 @@
+#!/bin/bash
+set -euo pipefail
+
+# EKS Node Group Rolling Upgrade Script
+# Usage: ./upgrade-node-group.sh <cluster-name> <node-group-name> <target-version> [--rollback]
+
+# Color codes for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+UPGRADE_START_TIME=""
+ORIGINAL_NODE_GROUP_VERSION=""
+UPGRADE_IN_PROGRESS=false
+NODE_GROUP_ARN=""
+ROLLBACK_MODE=false
+
+log() {
+    echo -e "[$(date +'%Y-%m-%dT%H:%M:%S%z')] $1"
+}
+
+error() {
+    log "${RED}ERROR: $1${NC}" >&2
+    if [ "$UPGRADE_IN_PROGRESS" = true ]; then
+        log "${YELLOW}An error occurred during upgrade. Run with --rollback to initiate rollback procedure.${NC}"
+    fi
+    exit 1
+}
+
+warn() {
+    log "${YELLOW}WARNING: $1${NC}"
+}
+
+success() {
+    log "${GREEN}SUCCESS: $1${NC}"
+}
+
+# Function to perform rollback
+perform_rollback() {
+    log "Initiating rollback procedure..."
+    
+    if [ -z "$ORIGINAL_NODE_GROUP_VERSION" ]; then
+        error "Original version not found, cannot rollback safely"
+    fi
+    
+    log "Rolling back node group $NODE_GROUP_NAME from current version to original version $ORIGINAL_NODE_GROUP_VERSION"
+    
+    # Update node group to original version
+    aws eks update-nodegroup-version \
+        --cluster-name "$CLUSTER_NAME" \
+        --nodegroup-name "$NODE_GROUP_NAME" \
+        --version "$ORIGINAL_NODE_GROUP_VERSION"
+    
+    log "Waiting for rollback to complete..."
+    aws eks wait nodegroup-active --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME"
+    
+    # Verify all nodes are rolled back
+    local nodes
+    nodes=$(kubectl get nodes -o json | jq -r '.items[] | select(.metadata.labels."eks.amazonaws.com/nodegroup"=="'"$NODE_GROUP_NAME"'") | .metadata.name')
+    
+    for node in $nodes; do
+        local node_version
+        node_version=$(kubectl get node "$node" -o json | jq -r '.status.nodeInfo.kubeletVersion' | sed 's/v//')
+        if [ "$node_version" != "$ORIGINAL_NODE_GROUP_VERSION" ]; then
+            warn "Node $node still on version $node_version, expected $ORIGINAL_NODE_GROUP_VERSION"
+        else
+            success "Node $node successfully rolled back to $node_version"
+        fi
+    done
+    
+    success "Rollback completed successfully"
+    UPGRADE_IN_PROGRESS=false
+    exit 0
+}
+
+# Cordon a node - mark it as unschedulable
+cordon_node() {
+    local node_name="$1"
+    log "Cordoning node $node_name..."
+    kubectl cordon "$node_name"
+    success "Node $node_name cordoned successfully"
+}
+
+# Drain a node - evict all pods from it
+drain_node() {
+    local node_name="$1"
+    log "Draining node $node_name..."
+    # Wait 5 minutes for pod eviction, ignore daemonsets, force deletion
+    kubectl drain "$node_name" \
+        --ignore-daemonsets \
+        --delete-emptydir-data \
+        --force \
+        --timeout=300s
+    success "Node $node_name drained successfully"
+}
+
+# Uncordon a node - mark it as schedulable again
+uncordon_node() {
+    local node_name="$1"
+    log "Uncordoning node $node_name..."
+    kubectl uncordon "$node_name"
+    success "Node $node_name uncordoned successfully"
+}
+
+# Get all nodes in the node group
+get_node_group_nodes() {
+    kubectl get nodes -o json | jq -r '.items[] | select(.metadata.labels."eks.amazonaws.com/nodegroup"=="'"$NODE_GROUP_NAME"'") | .metadata.name'
+}
+
+# Wait for new nodes to join and become ready
+wait_for_new_nodes() {
+    local expected_count="$1"
+    log "Waiting for $expected_count new nodes to join cluster and become ready..."
+    
+    local max_attempts=60 # 30 minutes total (30s per attempt)
+    local attempt=0
+    
+    while [ $attempt -lt $max_attempts ]; do
+        local ready_nodes
+        ready_nodes=$(kubectl get nodes -o json | jq -r '.items[] | select(.metadata.labels."eks.amazonaws.com/nodegroup"=="'"$NODE_GROUP_NAME"'" and .status.conditions[] | select(.type=="Ready" and .status=="True")) | .metadata.name' | wc -l)
+        
+        if [ "$ready_nodes" -ge "$expected_count" ]; then
+            success "All $expected_count nodes are ready"
+            return 0
+        fi
+        
+        log "Ready nodes: $ready_nodes/$expected_count. Waiting 30s..."
+        sleep 30
+        ((attempt++))
+    done
+    
+    error "Timed out waiting for new nodes to become ready"
+}
+
+# Validate upgrade after completion
+validate_upgrade() {
+    log "Starting post-upgrade validation..."
+    
+    # Check all nodes in node group are on target version
+    local nodes
+    nodes=$(get_node_group_nodes)
+    local all_valid=true
+    
+    for node in $nodes; do
+        local node_version
+        node_version=$(kubectl get node "$node" -o json | jq -r '.status.nodeInfo.kubeletVersion' | sed 's/v//')
+        if [ "$node_version" != "$TARGET_VERSION" ]; then
+            error "Node $node is on version $node_version, expected $TARGET_VERSION"
+            all_valid=false
+        else
+            success "Node $node is running target version $TARGET_VERSION"
+        fi
+    done
+    
+    # Check all pods are running
+    local non_running_pods
+    non_running_pods=$(kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.status.phase!="Running" and .status.phase!="Succeeded") | .metadata.namespace+"/"+.metadata.name')
+    
+    if [ -n "$non_running_pods" ]; then
+        warn "Found non-running pods after upgrade: $non_running_pods"
+    else
+        success "All pods are in Running/Succeeded state"
+    fi
+    
+    # Check all nodes are Ready
+    local not_ready_nodes
+    not_ready_nodes=$(kubectl get nodes -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status!="True")) | .metadata.name')
+    
+    if [ -n "$not_ready_nodes" ]; then
+        error "Found nodes in NotReady state after upgrade: $not_ready_nodes"
+    else
+        success "All nodes are in Ready state"
+    fi
+    
+    # Verify node group status is ACTIVE
+    local nodegroup_status
+    nodegroup_status=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" | jq -r '.nodegroup.status')
+    
+    if [ "$nodegroup_status" != "ACTIVE" ]; then
+        error "Node group is in $nodegroup_status state after upgrade, expected ACTIVE"
+    else
+        success "Node group is ACTIVE"
+    fi
+    
+    if [ "$all_valid" = true ]; then
+        success "All post-upgrade validations passed!"
+    fi
+}
+
+# Perform rolling upgrade of nodes
+perform_rolling_upgrade() {
+    log "Starting rolling upgrade process..."
+    UPGRADE_IN_PROGRESS=true
+    
+    # Get list of current nodes before upgrade starts
+    local old_nodes
+    old_nodes=$(get_node_group_nodes)
+    local old_node_count=$(echo "$old_nodes" | wc -l | xargs)
+    log "Found $old_node_count existing nodes in node group"
+    
+    # Store original node group version
+    ORIGINAL_NODE_GROUP_VERSION=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" | jq -r '.nodegroup.version')
+    NODE_GROUP_ARN=$(aws eks describe-nodegroup --cluster-name "$CLUSTER_NAME" --nodegroup-name "$NODE_GROUP_NAME" | jq -r '.nodegroup.arn')
+    
+    log "Original version: $ORIGINAL_NODE_GROUP_VERSION, Target version: $TARGET_VERSION"
+    
+    # Initiate node group version update
+    log "Initiating node group version update in EKS..."
+    aws eks update-nodegroup-version \
+        --cluster-name "$CLUSTER_NAME" \
+        --nodegroup-name "$NODE_GROUP_NAME" \
+        --version "$TARGET_VERSION"
+    
+    # Wait for node group to start provisioning new nodes
+    log "Waiting for EKS to provision new nodes..."
+    sleep 60 # Initial wait for AWS to start the update
+    
+    # Wait for new nodes to join
+    wait_for_new_nodes $((old_node_count + 1)) # Wait for at least one new node
+    
+    # Process old nodes one by one
+    for node in $old_nodes; do
+        echo "----------------------------------------"
+        log "Processing old node: $node"
+        
+        # Check if node still exists
+        if kubectl get node "$node" &> /dev/null; then
+            # Cordon and drain the old node
+            cordon_node "$node"
+            drain_node "$node"
+            
+            # Wait a bit to ensure workloads are rescheduled
+            log "Waiting 60s for workloads to stabilize on new nodes..."
+            sleep 60
+            
+            # Verify node is no longer needed (new nodes are handling workload)
+            success "Node $node successfully drained and can be terminated"
+        else
+            warn "Node $node no longer exists in cluster, skipping"
+        fi
+    done
+    
+    # Wait for all old nodes to be terminated and all new nodes ready
+    log "Waiting for all nodes to reach ready state..."
+    wait_for_new_nodes $old_node_count
+    
+    # All new nodes should be ready and uncordoned by default, but verify
+    local new_nodes
+    new_nodes=$(get_node_group_nodes)
+    for node in $new_nodes; do
+        # Check if node is cordoned, if so uncordon it
+        if kubectl get node "$node" -o json | jq -r '.spec.unschedulable' | grep -q true; then
+            uncordon_node "$node"
+        fi
+    done
+    
+    echo "----------------------------------------"
+    success "Node group upgrade completed successfully!"
+    
+    # Run post-upgrade validation
+    validate_upgrade
+    
+    UPGRADE_IN_PROGRESS=false
+}
+
+# Main execution
+main() {
+    # Parse arguments
+    if [ "$#" -lt 3 ]; then
+        error "Usage: $0 <cluster-name> <node-group-name> <target-version> [--rollback]"
+    fi
+    
+    CLUSTER_NAME="$1"
+    NODE_GROUP_NAME="$2"
+    TARGET_VERSION="$3"
+    
+    if [ "$#" -eq 4 ] && [ "$4" = "--rollback" ]; then
+        ROLLBACK_MODE=true
+        log "Running in rollback mode..."
+        # Update kubeconfig
+        aws eks update-kubeconfig --name "$CLUSTER_NAME"
+        perform_rollback
+        exit 0
+    fi
+    
+    log "Starting node group upgrade process for $NODE_GROUP_NAME in cluster $CLUSTER_NAME"
+    log "Target Kubernetes version: $TARGET_VERSION"
+    echo "----------------------------------------"
+    
+    # First run pre-upgrade checks
+    log "Running pre-upgrade checks..."
+    if ! ./pre-upgrade-checks.sh "$CLUSTER_NAME" "$NODE_GROUP_NAME" "$TARGET_VERSION"; then
+        error "Pre-upgrade checks failed, aborting upgrade"
+    fi
+    echo "----------------------------------------"
+    
+    # Update kubeconfig
+    aws eks update-kubeconfig --name "$CLUSTER_NAME"
+    
+    # Proceed with rolling upgrade
+    perform_rolling_upgrade
+    
+    echo "----------------------------------------"
+    success "Upgrade process fully completed!"
+}
+
+main "$@"
\ No newline at end of file