diff --git a/skills/cloud/gke-basics/references/gke-app-onboarding.md b/skills/cloud/gke-app-onboarding/SKILL.md
similarity index 57%
rename from skills/cloud/gke-basics/references/gke-app-onboarding.md
rename to skills/cloud/gke-app-onboarding/SKILL.md
index ef6ebbfca9..40f8bd766f 100644
--- a/skills/cloud/gke-basics/references/gke-app-onboarding.md
+++ b/skills/cloud/gke-app-onboarding/SKILL.md
@@ -1,8 +1,20 @@
+---
+name: gke-app-onboarding
+description: >-
+  Onboards applications to GKE, covering containerization, deployment
+  manifests, and migration. Use when onboarding or deploying an application to
+  GKE for the first time, or containerizing an app for GKE. Don't use for
+  general GKE cluster administration or upgrades (use gke-basics or
+  gke-upgrades instead).
+---
+
 # GKE App Onboarding
 
-This reference provides workflows for containerizing and deploying applications to GKE for the first time.
+This reference provides workflows for containerizing and deploying applications
+to GKE for the first time.
 
-> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `get_k8s_rollout_status`, `get_k8s_logs`, `describe_k8s_resource`
+> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`,
+> `get_k8s_rollout_status`, `get_k8s_logs`, `describe_k8s_resource`
 
 ## Workflow
 
@@ -10,12 +22,13 @@ This reference provides workflows for containerizing and deploying applications
 
 Before containerizing, assess the application:
 
-- **Language & Framework**: Identify the tech stack
-- **Dependencies**: List required libraries and external services
-- **Configuration**: How is the app configured? (env vars, config files, secrets)
-- **Statefulness**: Does it need persistent storage? (databases, file storage)
-- **Networking**: Port mapping and protocol (HTTP, gRPC, TCP)
-- **Health endpoints**: Does the app expose health check endpoints?
+-   **Language & Framework**: Identify the tech stack
+-   **Dependencies**: List required libraries and external services
+-   **Configuration**: How is the app configured? (env vars, config files,
+    secrets)
+-   **Statefulness**: Does it need persistent storage? (databases, file storage)
+-   **Networking**: Port mapping and protocol (HTTP, gRPC, TCP)
+-   **Health endpoints**: Does the app expose health check endpoints?
 
 ### 2. Containerization
 
@@ -38,14 +51,18 @@ ENTRYPOINT ["/server"]
 ```
 
 **Best practices:**
-- Use multi-stage builds to keep production images small
-- Use distroless or minimal base images to reduce attack surface
-- Run as non-root user
-- Log to `stdout` and `stderr` for Cloud Logging collection
+
+-   Use multi-stage builds to keep production images small
+-   Use distroless or minimal base images to reduce attack surface
+-   Run as non-root user
+-   Log to `stdout` and `stderr` for Cloud Logging collection
 
 **Alternatives:**
-- **Cloud Native Buildpacks** — auto-detect language and build without a Dockerfile: `pack build <image> --builder gcr.io/buildpacks/builder:latest`
-- **Skaffold** — development workflow tool for iterating on containerized apps: `skaffold dev`
+
+-   **Cloud Native Buildpacks** — auto-detect language and build without a
+    Dockerfile: `pack build <image> --builder gcr.io/buildpacks/builder:latest`
+-   **Skaffold** — development workflow tool for iterating on containerized
+    apps: `skaffold dev`
 
 ### 3. Image Management
 
@@ -60,7 +77,8 @@ docker build -t <REGION>-docker.pkg.dev/<PROJECT>/<REPO>/<IMAGE>:<TAG> .
 docker push <REGION>-docker.pkg.dev/<PROJECT>/<REPO>/<IMAGE>:<TAG>
 ```
 
-**Vulnerability scanning**: Enable automatic scanning in Artifact Registry to detect issues in base images and dependencies.
+**Vulnerability scanning**: Enable automatic scanning in Artifact Registry to
+detect issues in base images and dependencies.
 
 ```bash
 # Check scan results
@@ -127,10 +145,12 @@ spec:
 ```
 
 **Checklist for manifests:**
-- Resource requests and limits set
-- Liveness and readiness probes configured
-- At least 2 replicas for production
-- Service type appropriate (ClusterIP for internal, use Gateway API for external)
+
+-   Resource requests and limits set
+-   Liveness and readiness probes configured
+-   At least 2 replicas for production
+-   Service type appropriate (ClusterIP for internal, use Gateway API for
+    external)
 
 ### 5. Deploy
 
@@ -154,7 +174,10 @@ kubectl get pods -l app=my-app
 ## Next Steps
 
 Once the application is running on GKE:
-- Configure autoscaling — see [gke-scaling.md](./gke-scaling.md)
-- Set up observability — see [gke-observability.md](./gke-observability.md)
-- Harden security — see [gke-security.md](./gke-security.md)
-- Configure reliability (PDBs, topology spread) — see [gke-reliability.md](./gke-reliability.md)
+
+-   Configure autoscaling — see [gke-scaling.md](../gke-scaling/SKILL.md)
+-   Set up observability — see
+    [gke-observability.md](../gke-observability/SKILL.md)
+-   Harden security — see [gke-security.md](../gke-security/SKILL.md)
+-   Configure reliability (PDBs, topology spread) — see
+    [gke-reliability.md](../gke-reliability/SKILL.md)
diff --git a/skills/cloud/gke-basics/references/gke-backup-dr.md b/skills/cloud/gke-backup-dr/SKILL.md
similarity index 61%
rename from skills/cloud/gke-basics/references/gke-backup-dr.md
rename to skills/cloud/gke-backup-dr/SKILL.md
index eb7859d278..17d06e3bc5 100644
--- a/skills/cloud/gke-basics/references/gke-backup-dr.md
+++ b/skills/cloud/gke-backup-dr/SKILL.md
@@ -1,8 +1,19 @@
+---
+name: gke-backup-dr
+description: >-
+  Configures Backup for GKE and disaster recovery plans. Use when configuring
+  GKE backup policies, setting up disaster recovery, or restoring GKE clusters.
+  Don't use for generic database backups or persistent volume configuration
+  (use gke-storage instead).
+---
+
 # GKE Backup & Disaster Recovery
 
-This reference provides workflows for protecting stateful workloads on GKE using Backup for GKE.
+This reference provides workflows for protecting stateful workloads on GKE using
+Backup for GKE.
 
-> **MCP Tools:** `get_cluster`, `update_cluster`. **CLI-only:** `gcloud container backup-restore *`
+> **MCP Tools:** `get_cluster`, `update_cluster`. **CLI-only:** `gcloud
+> container backup-restore *`
 
 ## Workflows
 
@@ -38,9 +49,11 @@ gcloud container backup-restore backup-plans create <PLAN_NAME> \
 ```
 
 **Options:**
-- `--all-namespaces` — back up everything
-- `--included-namespaces=<ns1>,<ns2>` — back up specific namespaces
-- `--backup-encryption-key=<KEY>` — encrypt with Customer-Managed Encryption Key (CMEK)
+
+-   `--all-namespaces` — back up everything
+-   `--included-namespaces=<ns1>,<ns2>` — back up specific namespaces
+-   `--backup-encryption-key=<KEY>` — encrypt with Customer-Managed Encryption
+    Key (CMEK)
 
 ### 3. Create a Manual Backup
 
@@ -79,8 +92,11 @@ gcloud container backup-restore restores create <RESTORE_NAME> \
 
 ## Best Practices
 
-1. **Automate backups**: Always use a cron schedule for production workloads
-2. **Test restores regularly**: Restore to a separate namespace or cluster to verify data integrity
-3. **Cross-region DR**: Store backups in a different region or configure cross-region restore plans
-4. **Encrypt backups**: Use CMEK for compliance and security requirements
-5. **Scope backups**: Back up specific namespaces rather than the entire cluster when possible to reduce restore complexity
+1.  **Automate backups**: Always use a cron schedule for production workloads
+2.  **Test restores regularly**: Restore to a separate namespace or cluster to
+    verify data integrity
+3.  **Cross-region DR**: Store backups in a different region or configure
+    cross-region restore plans
+4.  **Encrypt backups**: Use CMEK for compliance and security requirements
+5.  **Scope backups**: Back up specific namespaces rather than the entire
+    cluster when possible to reduce restore complexity
diff --git a/skills/cloud/gke-basics/SKILL.md b/skills/cloud/gke-basics/SKILL.md
index fe2119da6a..43cc83923b 100644
--- a/skills/cloud/gke-basics/SKILL.md
+++ b/skills/cloud/gke-basics/SKILL.md
@@ -1,11 +1,22 @@
 ---
 name: gke-basics
-description: "Plan, create, and configure production-ready Google Kubernetes Engine (GKE) clusters using the golden path Autopilot configuration. Covers Day-0 checklist, Autopilot vs Standard, networking (private clusters, VPC-native, Gateway API), security (Workload Identity, Secret Manager, RBAC hardening), observability, scaling, cost optimization, and AI/ML inference. WHEN: create GKE cluster, provision GKE environment, design GKE networking, secure GKE, optimize GKE cost, GKE autoscaling, GKE inference, GKE upgrade, GKE observability, GKE multi-tenancy, GKE batch, GKE HPC, GKE compute class."
+description: >-
+  Plans, creates, and configures production-ready GKE clusters using the golden
+  path Autopilot configuration. Covers Day-0 checklist, Autopilot vs Standard,
+  networking, security, observability, scaling, cost optimization, and AI/ML
+  inference. Use when creating GKE clusters, provisioning GKE environments,
+  designing GKE networking, securing GKE, optimizing GKE cost, autoscaling, or
+  upgrading. Don't use if specialized skills for security, networking, scaling,
+  cost, storage, or upgrades are more applicable (use gke-security,
+  gke-networking, gke-scaling, gke-cost, gke-storage, or gke-upgrades instead).
 ---
 
 # Google Kubernetes Engine (GKE) Basics
 
-GKE is a managed Kubernetes platform on Google Cloud for deploying, scaling, and operating containerized applications. This skill defaults to the **golden path Autopilot configuration** — see [gke-golden-path.md](./references/gke-golden-path.md) for defaults, rules, and guardrails.
+GKE is a managed Kubernetes platform on Google Cloud for deploying, scaling, and
+operating containerized applications. This skill defaults to the **golden path
+Autopilot configuration** — see [gke-golden-path](../gke-golden-path/SKILL.md)
+for defaults, rules, and guardrails.
 
 ## Quick Start
 
@@ -19,31 +30,35 @@ kubectl create deployment hello-server \
 
 ## Reference Directory
 
-Load the relevant reference based on trigger keywords. Prefer the most specific match; if ambiguous, ask the user to clarify.
-
-| Scenario | Trigger Keywords | Reference |
-|----------|-----------------|-----------|
-| Core Concepts | Autopilot vs Standard, architecture, pricing, what is GKE | [core-concepts.md](./references/core-concepts.md) |
-| Golden Path & Defaults | golden path, Day-0 checklist, production defaults, cluster defaults | [gke-golden-path.md](./references/gke-golden-path.md) |
-| Cluster Creation | create cluster, new cluster, provision GKE | [gke-cluster-creation.md](./references/gke-cluster-creation.md) |
-| Networking | private cluster, VPC, subnet, Gateway API, DNS, ingress, egress, datapath | [gke-networking.md](./references/gke-networking.md) |
-| Security & IAM | Workload Identity, Secret Manager, RBAC, Binary Auth, hardening, audit, gVisor, IAM roles | [gke-security.md](./references/gke-security.md) |
-| Scaling | HPA, VPA, autoscaler, autoscaling, NAP, scale pods, scale nodes | [gke-scaling.md](./references/gke-scaling.md) |
-| Compute Classes | ComputeClass, machine family, Spot fallback, GPU node pool, node selection | [gke-compute-classes.md](./references/gke-compute-classes.md) |
-| Cost | cost, savings, Spot VMs, rightsizing, CUD, optimize spend, budget | [gke-cost.md](./references/gke-cost.md) |
-| AI/ML Inference | inference, model serving, LLM, GPU, TPU, GIQ, vLLM | [gke-inference.md](./references/gke-inference.md) |
-| Upgrades | upgrade, maintenance window, release channel, patching, version | [gke-upgrades.md](./references/gke-upgrades.md) |
-| Observability | monitoring, logging, Prometheus, Grafana, metrics, alerts, dashboards | [gke-observability.md](./references/gke-observability.md) |
-| Multi-tenancy | multi-tenant, namespace isolation, team access, enterprise, RBAC planning | [gke-multitenancy.md](./references/gke-multitenancy.md) |
-| Batch & HPC | batch, HPC, job queue, high performance, MPI, parallel | [gke-batch-hpc.md](./references/gke-batch-hpc.md) |
-| App Onboarding | containerize, deploy app, Dockerfile, onboard, migrate to GKE | [gke-app-onboarding.md](./references/gke-app-onboarding.md) |
-| Backup & DR | backup, restore, disaster recovery, CMEK | [gke-backup-dr.md](./references/gke-backup-dr.md) |
-| Storage | storage, PVC, persistent volume, StorageClass, Filestore, GCS FUSE | [gke-storage.md](./references/gke-storage.md) |
-| Reliability | PDB, health probe, liveness, readiness, topology spread, graceful shutdown | [gke-reliability.md](./references/gke-reliability.md) |
-| Client Libraries | client library, client-go, kubernetes python, kubernetes java, kubernetes SDK | [client-library-usage.md](./references/client-library-usage.md) |
-| Infrastructure as Code | Terraform, IaC, HCL, infrastructure as code | [iac-usage.md](./references/iac-usage.md) |
-| MCP Server | MCP tools, MCP server, MCP setup | [mcp-usage.md](./references/mcp-usage.md) |
-| CLI / Tools | gcloud, kubectl, commands, how to | [cli-reference.md](./references/cli-reference.md) |
-| Production Audit | production readiness, compliance, golden path check | [gke-cluster-creation.md](./references/gke-cluster-creation.md) |
+Load the relevant reference based on trigger keywords. Prefer the most specific
+match; if ambiguous, ask the user to clarify. If a referenced sibling skill
+(pointing to `..`) is not installed or cannot be accessed, inform the user that
+they may need to install that specific skill (e.g., `gke-networking`), and fall
+back to your general GKE knowledge.
+
+Scenario               | Trigger Keywords                                                                          | Reference
+---------------------- | ----------------------------------------------------------------------------------------- | ---------
+Core Concepts          | Autopilot vs Standard, architecture, pricing, what is GKE                                 | [core-concepts.md](./references/core-concepts.md)
+Golden Path & Defaults | golden path, Day-0 checklist, production defaults, cluster defaults                       | [gke-golden-path](../gke-golden-path/SKILL.md)
+Cluster Creation       | create cluster, new cluster, provision GKE                                                | [gke-cluster-creation](../gke-cluster-creation/SKILL.md)
+Networking             | private cluster, VPC, subnet, Gateway API, DNS, ingress, egress, datapath                 | [gke-networking](../gke-networking/SKILL.md)
+Security & IAM         | Workload Identity, Secret Manager, RBAC, Binary Auth, hardening, audit, gVisor, IAM roles | [gke-security](../gke-security/SKILL.md)
+Scaling                | HPA, VPA, autoscaler, autoscaling, NAP, scale pods, scale nodes                           | [gke-scaling](../gke-scaling/SKILL.md)
+Compute Classes        | ComputeClass, machine family, Spot fallback, GPU node pool, node selection                | [gke-compute-classes](../gke-compute-classes/SKILL.md)
+Cost                   | cost, savings, Spot VMs, rightsizing, CUD, optimize spend, budget                         | [gke-cost](../gke-cost/SKILL.md)
+AI/ML Inference        | inference, model serving, LLM, GPU, TPU, GIQ, vLLM                                        | [gke-inference](../gke-inference/SKILL.md)
+Upgrades               | upgrade, maintenance window, release channel, patching, version                           | [gke-upgrades](../gke-upgrades/SKILL.md)
+Observability          | monitoring, logging, Prometheus, Grafana, metrics, alerts, dashboards                     | [gke-observability](../gke-observability/SKILL.md)
+Multi-tenancy          | multi-tenant, namespace isolation, team access, enterprise, RBAC planning                 | [gke-multitenancy](../gke-multitenancy/SKILL.md)
+Batch & HPC            | batch, HPC, job queue, high performance, MPI, parallel                                    | [gke-batch-hpc](../gke-batch-hpc/SKILL.md)
+App Onboarding         | containerize, deploy app, Dockerfile, onboard, migrate to GKE                             | [gke-app-onboarding](../gke-app-onboarding/SKILL.md)
+Backup & DR            | backup, restore, disaster recovery, CMEK                                                  | [gke-backup-dr](../gke-backup-dr/SKILL.md)
+Storage                | storage, PVC, persistent volume, StorageClass, Filestore, GCS FUSE                        | [gke-storage](../gke-storage/SKILL.md)
+Reliability            | PDB, health probe, liveness, readiness, topology spread, graceful shutdown                | [gke-reliability](../gke-reliability/SKILL.md)
+Client Libraries       | client library, client-go, kubernetes python, kubernetes java, kubernetes SDK             | [client-library-usage.md](./references/client-library-usage.md)
+Infrastructure as Code | Terraform, IaC, HCL, infrastructure as code                                               | [iac-usage.md](./references/iac-usage.md)
+MCP Server             | MCP tools, MCP server, MCP setup                                                          | [mcp-usage.md](./references/mcp-usage.md)
+CLI / Tools            | gcloud, kubectl, commands, how to                                                         | [cli-reference.md](./references/cli-reference.md)
+Production Audit       | production readiness, compliance, golden path check                                       | [gke-cluster-creation](../gke-cluster-creation/SKILL.md)
 
 *If you need product information not found in these references, use the Developer Knowledge MCP server `search_documents` tool.*
diff --git a/skills/cloud/gke-basics/references/cli-reference.md b/skills/cloud/gke-basics/references/cli-reference.md
index 5b6b77db9f..6d29016b91 100644
--- a/skills/cloud/gke-basics/references/cli-reference.md
+++ b/skills/cloud/gke-basics/references/cli-reference.md
@@ -12,37 +12,70 @@ Default preference order:
 
 ### When to use each
 
-| Interface | When to Use | Examples |
-|-----------|-------------|---------|
-| **GKE MCP Tools** | Default for all cluster and K8s operations when MCP server is available. Structured I/O, supports dry-run, no shell/kubeconfig needed. | `create_cluster`, `get_cluster`, `get_k8s_resource`, `apply_k8s_manifest`, `get_k8s_logs` |
-| **`gcloud` CLI** | No MCP equivalent, or user explicitly requested CLI. Required for: GIQ model discovery, available K8s versions, maintenance windows, monitoring components, IAM/SA setup, Cloud Logging queries. | `gcloud container ai profiles`, `gcloud container get-server-config`, `gcloud iam service-accounts` |
-| **`kubectl`** | Neither MCP nor `gcloud` covers the operation, or user explicitly prefers kubectl. Required for: `kubectl top`, `kubectl scale`, `kubectl exec`, `kubectl port-forward`, Helm, custom CRDs not in MCP. | `kubectl top pods`, `kubectl scale deployment`, `helm install` |
+| Interface         | When to Use                | Examples                    |
+| ----------------- | -------------------------- | --------------------------- |
+| **GKE MCP Tools** | Default for all cluster    | `create_cluster`,           |
+:                   : and K8s operations when    : `get_cluster`,              :
+:                   : MCP server is available.   : `get_k8s_resource`,         :
+:                   : Structured I/O, supports   : `apply_k8s_manifest`,       :
+:                   : dry-run, no                : `get_k8s_logs`              :
+:                   : shell/kubeconfig needed.   :                             :
+| **`gcloud` CLI**  | No MCP equivalent, or user | `gcloud container ai        |
+:                   : explicitly requested CLI.  : profiles`, `gcloud          :
+:                   : Required for\: GIQ model   : container                   :
+:                   : discovery, available K8s   : get-server-config`, `gcloud :
+:                   : versions, maintenance      : iam service-accounts`       :
+:                   : windows, monitoring        :                             :
+:                   : components, IAM/SA setup,  :                             :
+:                   : Cloud Logging queries.     :                             :
+| **`kubectl`**     | Neither MCP nor `gcloud`   | `kubectl top pods`,         |
+:                   : covers the operation, or   : `kubectl scale deployment`, :
+:                   : user explicitly prefers    : `helm install`              :
+:                   : kubectl. Required for\:    :                             :
+:                   : `kubectl top`, `kubectl    :                             :
+:                   : scale`, `kubectl exec`,    :                             :
+:                   : `kubectl port-forward`,    :                             :
+:                   : Helm, custom CRDs not in   :                             :
+:                   : MCP.                       :                             :
 
 ### User preference override
 
 If the user states a preference, respect it for the session:
 
-- **"Use gcloud" / "Use CLI"** → `gcloud` for cluster ops, `kubectl` for K8s resource ops. Skip MCP.
-- **"Use kubectl"** → `kubectl` for all K8s resource ops, `gcloud` for cluster-level ops. Skip MCP.
-- **"Use MCP"** / no preference → Default. Use MCP for everything it supports.
+-   **"Use gcloud" / "Use CLI"** → `gcloud` for cluster ops, `kubectl` for K8s
+    resource ops. Skip MCP.
+-   **"Use kubectl"** → `kubectl` for all K8s resource ops, `gcloud` for
+    cluster-level ops. Skip MCP.
+-   **"Use MCP"** / no preference → Default. Use MCP for everything it supports.
 
-Even with an override, fall back through the chain for unsupported operations (e.g., cluster creation always requires `gcloud` or MCP).
+Even with an override, fall back through the chain for unsupported operations
+(e.g., cluster creation always requires `gcloud` or MCP).
 
----
+--------------------------------------------------------------------------------
 
-> All MCP tools use hierarchical resource paths — see [`parent` format](#parent--name-format-quick-reference) at the bottom.
+> All MCP tools use hierarchical resource paths — see
+> [`parent` format](#parent--name-format-quick-reference) at the bottom.
 
 ## Cluster Operations
 
-| Operation | MCP Tool | CLI Fallback | Mode |
-|-----------|----------|-------------|------|
-| List clusters | `list_clusters` | `gcloud container clusters list` | READ |
-| Get cluster details | `get_cluster` | `gcloud container clusters describe` | READ |
-| Create cluster | `create_cluster` | `gcloud container clusters create-auto` | MUTATE |
-| Update cluster | `update_cluster` | `gcloud container clusters update` | DESTRUCTIVE |
-| Get K8s versions | — | `gcloud container get-server-config` | READ |
-| Get credentials | — | `gcloud container clusters get-credentials` | READ |
-| Delete cluster | — | `gcloud container clusters delete` | DESTRUCTIVE |
+| Operation           | MCP Tool         | CLI Fallback       | Mode        |
+| ------------------- | ---------------- | ------------------ | ----------- |
+| List clusters       | `list_clusters`  | `gcloud container  | READ        |
+:                     :                  : clusters list`     :             :
+| Get cluster details | `get_cluster`    | `gcloud container  | READ        |
+:                     :                  : clusters describe` :             :
+| Create cluster      | `create_cluster` | `gcloud container  | MUTATE      |
+:                     :                  : clusters           :             :
+:                     :                  : create-auto`       :             :
+| Update cluster      | `update_cluster` | `gcloud container  | DESTRUCTIVE |
+:                     :                  : clusters update`   :             :
+| Get K8s versions    | —                | `gcloud container  | READ        |
+:                     :                  : get-server-config` :             :
+| Get credentials     | —                | `gcloud container  | READ        |
+:                     :                  : clusters           :             :
+:                     :                  : get-credentials`   :             :
+| Delete cluster      | —                | `gcloud container  | DESTRUCTIVE |
+:                     :                  : clusters delete`   :             :
 
 ```
 # List clusters in a project (all regions)
@@ -76,12 +109,16 @@ gcloud container clusters get-credentials <CLUSTER_NAME> --region <REGION> --pro
 
 ## Node Pool Operations
 
-| Operation | MCP Tool | CLI Fallback | Mode |
-|-----------|----------|-------------|------|
-| List node pools | `list_node_pools` | `gcloud container node-pools list` | READ |
-| Get node pool | `get_node_pool` | `gcloud container node-pools describe` | READ |
-| Create node pool | `create_node_pool` | `gcloud container node-pools create` | MUTATE |
-| Update node pool | `update_node_pool` | `gcloud container node-pools update` | DESTRUCTIVE |
+| Operation        | MCP Tool           | CLI Fallback         | Mode        |
+| ---------------- | ------------------ | -------------------- | ----------- |
+| List node pools  | `list_node_pools`  | `gcloud container    | READ        |
+:                  :                    : node-pools list`     :             :
+| Get node pool    | `get_node_pool`    | `gcloud container    | READ        |
+:                  :                    : node-pools describe` :             :
+| Create node pool | `create_node_pool` | `gcloud container    | MUTATE      |
+:                  :                    : node-pools create`   :             :
+| Update node pool | `update_node_pool` | `gcloud container    | DESTRUCTIVE |
+:                  :                    : node-pools update`   :             :
 
 ```
 list_node_pools(parent="projects/<PROJECT_ID>/locations/<REGION>/clusters/<CLUSTER_NAME>")
@@ -94,11 +131,16 @@ create_node_pool(
 
 ## Cluster Updates
 
-| Operation | MCP Tool | CLI Fallback | Mode |
-|-----------|----------|-------------|------|
-| Update cluster settings | `update_cluster` | `gcloud container clusters update` | DESTRUCTIVE |
-| Update monitoring | — | `gcloud container clusters update --monitoring=...` | DESTRUCTIVE |
-| Set maintenance window | — | `gcloud container clusters update --maintenance-window-*` | DESTRUCTIVE |
+| Operation         | MCP Tool         | CLI Fallback            | Mode        |
+| ----------------- | ---------------- | ----------------------- | ----------- |
+| Update cluster    | `update_cluster` | `gcloud container       | DESTRUCTIVE |
+: settings          :                  : clusters update`        :             :
+| Update monitoring | —                | `gcloud container       | DESTRUCTIVE |
+:                   :                  : clusters update         :             :
+:                   :                  : --monitoring=...`       :             :
+| Set maintenance   | —                | `gcloud container       | DESTRUCTIVE |
+: window            :                  : clusters update         :             :
+:                   :                  : --maintenance-window-*` :             :
 
 ```
 # Enable VPA via MCP
@@ -117,15 +159,19 @@ gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
 
 ## Kubernetes Resource Operations
 
-| Operation | MCP Tool | CLI Fallback | Mode |
-|-----------|----------|-------------|------|
-| Get/list resources | `get_k8s_resource` | `kubectl get` | READ |
-| Describe resource | `describe_k8s_resource` | `kubectl describe` | READ |
-| Apply manifest | `apply_k8s_manifest` | `kubectl apply` | DESTRUCTIVE |
-| Patch resource | `patch_k8s_resource` | `kubectl patch` | DESTRUCTIVE |
-| Delete resource | `delete_k8s_resource` | `kubectl delete` | DESTRUCTIVE |
-| List API resources | `list_k8s_api_resources` | `kubectl api-resources` | READ |
-| Check auth | `check_k8s_auth` | `kubectl auth can-i` | READ |
+| Operation       | MCP Tool                 | CLI Fallback     | Mode        |
+| --------------- | ------------------------ | ---------------- | ----------- |
+| Get/list        | `get_k8s_resource`       | `kubectl get`    | READ        |
+: resources       :                          :                  :             :
+| Describe        | `describe_k8s_resource`  | `kubectl         | READ        |
+: resource        :                          : describe`        :             :
+| Apply manifest  | `apply_k8s_manifest`     | `kubectl apply`  | DESTRUCTIVE |
+| Patch resource  | `patch_k8s_resource`     | `kubectl patch`  | DESTRUCTIVE |
+| Delete resource | `delete_k8s_resource`    | `kubectl delete` | DESTRUCTIVE |
+| List API        | `list_k8s_api_resources` | `kubectl         | READ        |
+: resources       :                          : api-resources`   :             :
+| Check auth      | `check_k8s_auth`         | `kubectl auth    | READ        |
+:                 :                          : can-i`           :             :
 
 ```
 # List all deployments in a namespace
@@ -150,14 +196,14 @@ check_k8s_auth(parent="...", verb="create", resourceType="deployments", namespac
 
 ## Diagnostics & Observability
 
-| Operation | MCP Tool | CLI Fallback | Mode |
-|-----------|----------|-------------|------|
-| List events | `list_k8s_events` | `kubectl events` | READ |
-| Get container logs | `get_k8s_logs` | `kubectl logs` | READ |
-| Cluster info | `get_k8s_cluster_info` | `kubectl cluster-info` | READ |
-| K8s version | `get_k8s_version` | `kubectl version` | READ |
-| Rollout status | `get_k8s_rollout_status` | `kubectl rollout status` | READ |
-| Query Cloud Logging | — | `gcloud logging read` | READ |
+Operation           | MCP Tool                 | CLI Fallback             | Mode
+------------------- | ------------------------ | ------------------------ | ----
+List events         | `list_k8s_events`        | `kubectl events`         | READ
+Get container logs  | `get_k8s_logs`           | `kubectl logs`           | READ
+Cluster info        | `get_k8s_cluster_info`   | `kubectl cluster-info`   | READ
+K8s version         | `get_k8s_version`        | `kubectl version`        | READ
+Rollout status      | `get_k8s_rollout_status` | `kubectl rollout status` | READ
+Query Cloud Logging | —                        | `gcloud logging read`    | READ
 
 ```
 # Get recent events across all namespaces
@@ -173,11 +219,14 @@ get_k8s_rollout_status(parent="...", resourceType="deployment", name="<DEPLOY>",
 
 ## Operations Tracking
 
-| Operation | MCP Tool | CLI Fallback | Mode |
-|-----------|----------|-------------|------|
-| List operations | `list_operations` | `gcloud container operations list` | READ |
-| Get operation | `get_operation` | `gcloud container operations describe` | READ |
-| Cancel operation | `cancel_operation` | `gcloud container operations cancel` | DESTRUCTIVE |
+| Operation        | MCP Tool           | CLI Fallback         | Mode        |
+| ---------------- | ------------------ | -------------------- | ----------- |
+| List operations  | `list_operations`  | `gcloud container    | READ        |
+:                  :                    : operations list`     :             :
+| Get operation    | `get_operation`    | `gcloud container    | READ        |
+:                  :                    : operations describe` :             :
+| Cancel operation | `cancel_operation` | `gcloud container    | DESTRUCTIVE |
+:                  :                    : operations cancel`   :             :
 
 ```
 list_operations(parent="projects/<PROJECT_ID>/locations/<REGION>")
@@ -228,12 +277,12 @@ Use `locations/-` to match all regions/zones when listing.
 
 ## Error Handling
 
-| Error / Symptom | Likely Cause | Remediation |
-|-----------------|--------------|-------------|
-| `PERMISSION_DENIED` on cluster create | Missing `container.clusters.create` IAM role | Grant `roles/container.admin` or `roles/container.clusterAdmin` |
-| Quota exceeded | Regional vCPU, GPU, or IP address limits | Request quota increase or select a different region |
-| IP exhaustion / CIDR conflict | Pod subnet too small or overlapping ranges | Re-plan IP ranges; may require cluster recreation (Day-0) |
-| Workload Identity not working | Missing OIDC issuer or federated credential | Verify `workloadIdentityConfig.workloadPool`; configure federated identity binding |
-| Private cluster unreachable | No authorized networks or DNS endpoint | Enable `dnsEndpointConfig.allowExternalTraffic` or add authorized networks |
-| Secret Manager rotation failing | SA missing `secretmanager.versions.access` | Grant Secret Manager accessor role to workload's GSA |
-| Control-plane metrics missing | Monitoring components not configured | Enable APISERVER, SCHEDULER, CONTROLLER_MANAGER in `monitoringConfig` |
+Error / Symptom                       | Likely Cause                                 | Remediation
+------------------------------------- | -------------------------------------------- | -----------
+`PERMISSION_DENIED` on cluster create | Missing `container.clusters.create` IAM role | Grant `roles/container.admin` or `roles/container.clusterAdmin`
+Quota exceeded                        | Regional vCPU, GPU, or IP address limits     | Request quota increase or select a different region
+IP exhaustion / CIDR conflict         | Pod subnet too small or overlapping ranges   | Re-plan IP ranges; may require cluster recreation (Day-0)
+Workload Identity not working         | Missing OIDC issuer or federated credential  | Verify `workloadIdentityConfig.workloadPool`; configure federated identity binding
+Private cluster unreachable           | No authorized networks or DNS endpoint       | Enable `dnsEndpointConfig.allowExternalTraffic` or add authorized networks
+Secret Manager rotation failing       | SA missing `secretmanager.versions.access`   | Grant Secret Manager accessor role to workload's GSA
+Control-plane metrics missing         | Monitoring components not configured         | Enable APISERVER, SCHEDULER, CONTROLLER_MANAGER in `monitoringConfig`
diff --git a/skills/cloud/gke-basics/references/client-library-usage.md b/skills/cloud/gke-basics/references/client-library-usage.md
index 43e5dd273c..a15558f3e2 100644
--- a/skills/cloud/gke-basics/references/client-library-usage.md
+++ b/skills/cloud/gke-basics/references/client-library-usage.md
@@ -3,10 +3,9 @@
 To interact with the GKE (Kubernetes) API programmatically, use the official
 Kubernetes client libraries.
 
-**Prerequisite:** These libraries interact with the Kubernetes API. You
-must already have a running GKE cluster and valid credentials
-(for example, by running `gcloud container clusters get-credentials`)
-before running this code.
+**Prerequisite:** These libraries interact with the Kubernetes API. You must
+already have a running GKE cluster and valid credentials (for example, by
+running `gcloud container clusters get-credentials`) before running this code.
 
 ## Getting Started
 
@@ -15,77 +14,77 @@ within your application code.
 
 ### Python
 
-- **Installation:**
+-   **Installation:**
 
-  ```bash
-  pip install kubernetes
-  ```
+    ```bash
+    pip install kubernetes
+    ```
 
-- **Usage Example:**
+-   **Usage Example:**
 
-  ```python
-  from kubernetes import client, config
-  config.load_kube_config() # Loads from ~/.kube/config
-  v1 = client.CoreV1Api()
-  print("Listing pods with their IPs:")
-  ret = v1.list_pod_for_all_namespaces(watch=False)
-  for i in ret.items:
-      print("%s\t%s\t%s" % (i.status.pod_ip, i.metadata.namespace, i.metadata.name))
-  ```
+    ```python
+    from kubernetes import client, config
+    config.load_kube_config() # Loads from ~/.kube/config
+    v1 = client.CoreV1Api()
+    print("Listing pods with their IPs:")
+    ret = v1.list_pod_for_all_namespaces(watch=False)
+    for i in ret.items:
+        print("%s\t%s\t%s" % (i.status.pod_ip, i.metadata.namespace, i.metadata.name))
+    ```
 
 ### Go
 
-- **Installation:**
+-   **Installation:**
 
-  ```bash
-  go get k8s.io/client-go@latest
-  ```
+    ```bash
+    go get k8s.io/client-go@latest
+    ```
 
-- **Usage Example:**
+-   **Usage Example:**
 
-  ```go
-  import (
-      "k8s.io/client-go/kubernetes"
-      "k8s.io/client-go/tools/clientcmd"
-  )
-  config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig)
-  clientset, _ := kubernetes.NewForConfig(config)
-  pods, _ := clientset.CoreV1().Pods("").List(
-      context.TODO, metav1.ListOptions{})
-  ```
+    ```go
+    import (
+        "k8s.io/client-go/kubernetes"
+        "k8s.io/client-go/tools/clientcmd"
+    )
+    config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig)
+    clientset, _ := kubernetes.NewForConfig(config)
+    pods, _ := clientset.CoreV1().Pods("").List(
+        context.Background(), metav1.ListOptions{})
+    ```
 
 ### Node.js (TypeScript)
 
-- **Installation:**
+-   **Installation:**
 
-  ```bash
-  npm install @kubernetes/client-node
-  ```
+    ```bash
+    npm install @kubernetes/client-node
+    ```
 
-- **Usage Example:**
+-   **Usage Example:**
 
-  ```javascript
-  const k8s = require('@kubernetes/client-node');
+    ```javascript
+    const k8s = require('@kubernetes/client-node');
 
-  const kc = new k8s.KubeConfig();
-  kc.loadFromDefault(); // Automatically detects local vs. in-cluster configuration
+    const kc = new k8s.KubeConfig();
+    kc.loadFromDefault(); // Automatically detects local vs. in-cluster configuration
 
-  const k8sApi = kc.makeApiClient(k8s.CoreV1Api);
+    const k8sApi = kc.makeApiClient(k8s.CoreV1Api);
 
-  // In most recent library versions, parameters must be passed inside an object
-  k8sApi.listNamespacedPod({ namespace: 'default' }).then((res) => {
-      const pods = res.items || res.body.items;
-      console.log(`Found ${pods.length} pods in 'default' namespace.`);
-  });
-  ```
+    // In most recent library versions, parameters must be passed inside an object
+    k8sApi.listNamespacedPod({ namespace: 'default' }).then((res) => {
+        const pods = res.items || res.body.items;
+        console.log(`Found ${pods.length} pods in 'default' namespace.`);
+    });
+    ```
 
 ### Java
 
-- [Java Reference](https://github.com/kubernetes-client/java)
+-   [Java Reference](https://github.com/kubernetes-client/java)
 
 ## GKE-specific API (Container Service)
 
 To manage the GKE *service* itself (e.g., create/delete clusters)
 programmatically, use the Google Cloud Container client libraries.
 
-- [Google Cloud Container Client Libraries](https://cloud.google.com/kubernetes-engine/docs/reference/libraries)
+-   [Google Cloud Container Client Libraries](https://cloud.google.com/kubernetes-engine/docs/reference/libraries)
diff --git a/skills/cloud/gke-basics/references/core-concepts.md b/skills/cloud/gke-basics/references/core-concepts.md
index b994f59828..9f02d5382e 100644
--- a/skills/cloud/gke-basics/references/core-concepts.md
+++ b/skills/cloud/gke-basics/references/core-concepts.md
@@ -1,54 +1,80 @@
 # GKE Core Concepts
 
-Google Kubernetes Engine (GKE) is a managed Kubernetes platform for deploying, managing, and scaling containerized applications on Google Cloud infrastructure. It handles cluster provisioning, upgrades, and node management, letting teams focus on workloads rather than infrastructure.
+Google Kubernetes Engine (GKE) is a managed Kubernetes platform for deploying,
+managing, and scaling containerized applications on Google Cloud infrastructure.
+It handles cluster provisioning, upgrades, and node management, letting teams
+focus on workloads rather than infrastructure.
 
 > **MCP Tools:** `list_clusters`, `get_cluster`
 
 ## Cluster Modes
 
-| Mode | Who Manages Nodes | Best For |
-|------|-------------------|----------|
-| **Autopilot** (recommended) | Google — fully managed nodes, scaling, and security | Most workloads. No node-level ops. Pay per pod resource request. |
-| **Standard** | You — full control over node pools, OS, machine types | Workloads requiring kernel customization, specific node OS, or DaemonSets not supported by Autopilot |
+| Mode          | Who Manages Nodes            | Best For                      |
+| ------------- | ---------------------------- | ----------------------------- |
+| **Autopilot** | Google — fully managed       | Most workloads. No node-level |
+: (recommended) : nodes, scaling, and security : ops. Pay per pod resource     :
+:               :                              : request.                      :
+| **Standard**  | You — full control over node | Workloads requiring kernel    |
+:               : pools, OS, machine types     : customization, specific node  :
+:               :                              : OS, or DaemonSets not         :
+:               :                              : supported by Autopilot        :
 
-**Default: Autopilot.** Use Standard only when Autopilot has a documented limitation for your workload.
+**Default: Autopilot.** Use Standard only when Autopilot has a documented
+limitation for your workload.
 
 ## Cluster Architecture
 
-- **Regional clusters** (recommended): Control plane replicated across 3 zones. Higher availability, no single-zone failure risk.
-- **Zonal clusters**: Single control plane zone. Lower cost, acceptable for dev/test.
-- **Private clusters** (golden path default): Nodes have no public IPs. Control plane accessible via private endpoint or DNS endpoint.
+-   **Regional clusters** (recommended): Control plane replicated across 3
+    zones. Higher availability, no single-zone failure risk.
+-   **Zonal clusters**: Single control plane zone. Lower cost, acceptable for
+    dev/test.
+-   **Private clusters** (golden path default): Nodes have no public IPs.
+    Control plane accessible via private endpoint or DNS endpoint.
 
 ## Networking Model
 
 GKE uses **VPC-native** clusters with alias IP ranges:
-- Each pod gets a routable IP from the pod CIDR
-- Dataplane V2 (eBPF-based) is the golden path default — provides built-in Network Policy enforcement
-- Cloud DNS for in-cluster DNS resolution
-- Gateway API for ingress/load balancing
+
+-   Each pod gets a routable IP from the pod CIDR
+-   Dataplane V2 (eBPF-based) is the golden path default — provides built-in
+    Network Policy enforcement
+-   Cloud DNS for in-cluster DNS resolution
+-   Gateway API for ingress/load balancing
 
 ## Scaling Model
 
-- **Horizontal Pod Autoscaler (HPA)**: Scales pod replicas based on CPU, memory, or custom metrics
-- **Vertical Pod Autoscaler (VPA)**: Recommends or auto-adjusts pod resource requests
-- **Cluster Autoscaler / NAP**: Scales nodes to match pod demand (Autopilot handles this automatically)
-- **ComputeClasses**: Declarative node selection — machine family, Spot VMs, GPU targeting
+-   **Horizontal Pod Autoscaler (HPA)**: Scales pod replicas based on CPU,
+    memory, or custom metrics
+-   **Vertical Pod Autoscaler (VPA)**: Recommends or auto-adjusts pod resource
+    requests
+-   **Cluster Autoscaler / NAP**: Scales nodes to match pod demand (Autopilot
+    handles this automatically)
+-   **ComputeClasses**: Declarative node selection — machine family, Spot VMs,
+    GPU targeting
 
 ## Identity & Security Model
 
-- **Workload Identity Federation**: Pods assume Google Cloud IAM identities without static keys
-- **Secret Manager integration**: Secrets synced to Kubernetes with automatic rotation
-- **Pod Security Standards**: `restricted` profile enforced on production namespaces
-- **Shielded Nodes**: Secure Boot and integrity monitoring (Autopilot-enforced)
+-   **Workload Identity Federation**: Pods assume Google Cloud IAM identities
+    without static keys
+-   **Secret Manager integration**: Secrets synced to Kubernetes with automatic
+    rotation
+-   **Pod Security Standards**: `restricted` profile enforced on production
+    namespaces
+-   **Shielded Nodes**: Secure Boot and integrity monitoring
+    (Autopilot-enforced)
 
 ## Regional Availability
 
-GKE is available in all Google Cloud regions. Autopilot clusters are regional by default. See https://cloud.google.com/about/locations for the full region list.
+GKE is available in all Google Cloud regions. Autopilot clusters are regional by
+default. See https://cloud.google.com/about/locations for the full region list.
 
 ## Pricing
 
 GKE pricing depends on the cluster mode:
-- **Autopilot**: Pay for pod resource requests (vCPU, memory, ephemeral storage). No cluster management fee.
-- **Standard**: Pay for underlying Compute Engine VMs plus a per-cluster management fee.
+
+-   **Autopilot**: Pay for pod resource requests (vCPU, memory, ephemeral
+    storage). No cluster management fee.
+-   **Standard**: Pay for underlying Compute Engine VMs plus a per-cluster
+    management fee.
 
 For current pricing, see https://cloud.google.com/kubernetes-engine/pricing.
diff --git a/skills/cloud/gke-basics/references/gke-compute-classes.md b/skills/cloud/gke-basics/references/gke-compute-classes.md
deleted file mode 100644
index 0edd842ece..0000000000
--- a/skills/cloud/gke-basics/references/gke-compute-classes.md
+++ /dev/null
@@ -1,172 +0,0 @@
-# GKE ComputeClasses
-
-ComputeClasses allow declarative node configuration and autoscaling priorities in GKE Autopilot (and Standard with NAP). Use them to specify machine families, Spot VM fallback, GPU requirements, and zone targeting.
-
-> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `describe_k8s_resource`, `delete_k8s_resource`
-
-## When to Use
-
-- Cost optimization: Spot VMs with on-demand fallback
-- GPU/TPU workloads: target specific accelerators
-- Performance: select specific machine families (c3, c4, n4)
-- Zone targeting: colocate workloads with zonal resources
-
-## CRD Structure
-
-```yaml
-apiVersion: cloud.google.com/v1
-kind: ComputeClass
-metadata:
-  name: <string>
-spec:
-  # Required. Ordered list of rules. GKE tries them in order.
-  priorities:
-    - <PriorityRule>
-
-  # Optional. Default: "DoNotScaleUp"
-  whenUnsatisfiable: <"DoNotScaleUp" | "ScaleUpAnyway">
-
-  # Optional. Auto-create node pools. Default: true
-  nodePoolAutoCreation:
-    enabled: <boolean>
-
-  # Optional. Move workloads back to higher-priority when available
-  activeMigration:
-    optimizeRulePriority: <boolean>
-
-  # Optional. Scale-down delay
-  autoscalingPolicy:
-    consolidationDelay: <duration>
-
-  # Optional. Defaults for fields omitted in priorities
-  priorityDefaults: <PriorityRule>
-```
-
-## PriorityRule Fields
-
-| Field | Type | Description | Example |
-|-------|------|-------------|---------|
-| `machineFamily` | string | Compute Engine machine family | `n4`, `c3`, `t2a` |
-| `machineType` | string | Specific machine type | `n4-standard-32` |
-| `spot` | boolean | Use Spot VMs | `true` |
-| `minCores` | int | Minimum vCPUs | `4` |
-| `minMemoryGb` | int | Minimum memory in GB | `16` |
-| `gpu` | object | GPU config: `type`, `count`, `driverVersion` | See below |
-| `tpu` | object | TPU config: `type`, `count`, `topology` | See below |
-| `storage` | object | Boot disk: `type`, `sizeGb`, `kmsKey`; Local SSD: `count`, `interface` | See below |
-| `location` | object | Zone targeting: `zones: [...]` or `type: "Any"` | See below |
-| `reservations` | object | Reservation consumption: `NO_RESERVATION`, `ANY_RESERVATION`, `SPECIFIC_RESERVATION` | See below |
-
-### GPU Configuration
-
-```yaml
-gpu:
-  type: "nvidia-l4"        # nvidia-l4, nvidia-h100-80gb, etc.
-  count: 1                 # GPUs per node
-  driverVersion: "latest"  # Optional
-```
-
-### TPU Configuration
-
-```yaml
-tpu:
-  type: "v5p-slice"
-  count: 8
-  topology: "2x2x1"
-```
-
-### Storage Configuration
-
-```yaml
-storage:
-  bootDisk:
-    type: "pd-balanced"     # pd-balanced (golden path), pd-ssd, hyperdisk-balanced
-    sizeGb: 100
-    kmsKey: "projects/.../cryptoKeys/..."  # Optional CMEK
-  localSsd:
-    count: 1
-    interface: "NVME"
-```
-
-### Location Configuration
-
-```yaml
-location:
-  zones:
-    - "us-central1-a"
-    - "us-central1-b"
-  # OR
-  type: "Any"              # Let GKE pick from cluster zones
-```
-
-## Common Patterns
-
-### Spot VMs with On-Demand Fallback
-
-```yaml
-apiVersion: cloud.google.com/v1
-kind: ComputeClass
-metadata:
-  name: spot-with-fallback
-spec:
-  nodePoolAutoCreation:
-    enabled: true
-  priorities:
-  - machineFamily: n4
-    spot: true
-  - machineFamily: n4
-    spot: false
-```
-
-### GPU Workload (L4)
-
-```yaml
-apiVersion: cloud.google.com/v1
-kind: ComputeClass
-metadata:
-  name: l4-gpu-class
-spec:
-  priorities:
-  - machineFamily: g2
-    gpu:
-      type: nvidia-l4
-      count: 1
-    minCores: 4
-    minMemoryGb: 16
-    storage:
-      bootDisk:
-        type: pd-balanced
-        sizeGb: 100
-```
-
-### Spot with Active Migration (Return to Spot When Available)
-
-Add `activeMigration` to the Spot-with-fallback pattern above to auto-migrate workloads back to Spot when capacity returns:
-
-```yaml
-spec:
-  activeMigration:
-    optimizeRulePriority: true
-  priorities:
-  - machineFamily: n4
-    spot: true
-  - machineFamily: n4
-    spot: false
-```
-
-> **Other patterns** — HPC (`machineFamily: c3`, `minCores: 8`) and zone targeting (`location.zones: [...]`) follow the same CRD structure. See the PriorityRule fields table and sub-config examples above.
-
-## Workload Usage
-
-Pods must specify the ComputeClass via node selector:
-
-```yaml
-nodeSelector:
-  cloud.google.com/compute-class: "<compute-class-name>"
-```
-
-## Warnings
-
-- Do not mix ComputeClass selection with other hard node selectors (like `cloud.google.com/gke-spot`) — this causes scheduling conflicts.
-- When using `activeMigration`, workloads will be evicted and rescheduled — ensure PDBs are in place.
-- Spot VMs can be evicted with 30-second notice. Set `terminationGracePeriodSeconds < 30` for Spot workloads.
diff --git a/skills/cloud/gke-basics/references/gke-cost.md b/skills/cloud/gke-basics/references/gke-cost.md
deleted file mode 100644
index 2bb88dc645..0000000000
--- a/skills/cloud/gke-basics/references/gke-cost.md
+++ /dev/null
@@ -1,158 +0,0 @@
-# GKE Cost Optimization
-
-This reference covers strategies for reducing GKE costs while maintaining the golden path security and reliability posture.
-
-> **MCP Tools:** `get_k8s_resource`, `describe_k8s_resource`, `apply_k8s_manifest`, `patch_k8s_resource`, `get_cluster`
-
-## Golden Path Cost Features
-
-The golden path already includes cost-optimizing settings:
-
-| Setting | Value | Impact |
-|---------|-------|--------|
-| `autoscalingProfile` | `OPTIMIZE_UTILIZATION` | Aggressive node scale-down reduces idle compute |
-| `verticalPodAutoscaling` | `enabled` | VPA recommendations prevent over-provisioning |
-| Autopilot pricing | Pay per pod request | No charge for unused node capacity |
-| Node Auto Provisioning | enabled | Right-sized node pools created automatically |
-
-## Cost Optimization Strategies
-
-### 1. Spot VMs via ComputeClasses
-
-Use Spot VMs for fault-tolerant workloads (60-90% cost reduction).
-
-```yaml
-apiVersion: cloud.google.com/v1
-kind: ComputeClass
-metadata:
-  name: spot-with-fallback
-spec:
-  activeMigration:
-    optimizeRulePriority: true
-  priorities:
-  - machineFamily: n4
-    spot: true
-  - machineFamily: n4
-    spot: false
-```
-
-**Spot-suitable workloads:**
-
-| Workload | Spot-Suitable? |
-|----------|----------------|
-| Batch / data processing | Yes |
-| Dev / test environments | Yes |
-| Stateless web/API (replicas >= 2) | Yes (with PDBs) |
-| Jobs with checkpointing | Yes |
-| Stateful workloads (databases) | No |
-| Single-replica critical services | No |
-
-**Handling eviction:**
-
-```yaml
-spec:
-  template:
-    spec:
-      terminationGracePeriodSeconds: 25  # Must be < 30s for Spot
-      containers:
-      - name: app
-        lifecycle:
-          preStop:
-            exec:
-              command: ["/bin/sh", "-c", "sleep 5"]
-```
-
-### 2. Pod Rightsizing
-
-Use VPA recommendations to reduce over-provisioned requests.
-
-```bash
-# 1. Deploy VPA in recommendation mode
-kubectl apply -f - <<EOF
-apiVersion: autoscaling.k8s.io/v1
-kind: VerticalPodAutoscaler
-metadata:
-  name: <DEPLOYMENT>-vpa
-spec:
-  targetRef:
-    apiVersion: apps/v1
-    kind: Deployment
-    name: <DEPLOYMENT>
-  updatePolicy:
-    updateMode: "Off"
-EOF
-
-# 2. Wait 24+ hours for data collection
-
-# 3. Read recommendations
-kubectl get vpa <DEPLOYMENT>-vpa -o jsonpath='{.status.recommendation}'
-```
-
-**Optimization rules:**
-
-| Condition | Action | Savings |
-|-----------|--------|---------|
-| CPU request >5x P95 actual | Reduce to `P95 * 1.2` | High |
-| Memory request >3x P95 actual | Reduce to `P95 * 1.2` | High |
-| CPU request >2x P95 actual | Reduce to `P95 * 1.2` | Medium |
-| No resource requests set | Add requests (enables bin-packing) | Medium |
-
-### 3. Machine Type Selection
-
-| Family | Use Case | Relative Cost |
-|--------|----------|---------------|
-| e2 | General purpose, burstable | Lowest |
-| t2a / t2d | Scale-out (Arm/AMD), price-performance optimized | Low |
-| n4a | Axion Arm-based, general-purpose price-performance | Low |
-| n4 / n4d | General purpose (Intel/AMD), flexible shapes | Low-Medium |
-| c4a | Compute-optimized (Arm), high efficiency | Medium-High |
-| c3 / c4 | Compute-optimized (Intel) | Medium-High |
-| c3d / c4d | Compute-optimized (AMD), high-performance throughput | Medium-High |
-| ek-standard | Autopilot enhanced (golden path) | Medium |
-| m3 / x4 | Memory-optimized, SAP HANA, large databases | High |
-| g2 (L4 GPU) | AI inference | High |
-| a3 (H100 GPU) | AI training | Highest |
-| a4 / a4x | Ultra-scale AI (Blackwell GPUs) | Highest |
-
-> In Autopilot, machine type is managed. Use ComputeClasses to influence selection.
-
-### 4. Committed Use Discounts (CUDs)
-
-For steady-state workloads, purchase 1-year or 3-year CUDs:
-
-- 1-year: ~20-30% discount
-- 3-year: ~50-55% discount
-- Applied automatically to matching usage in the region
-- Purchase via Google Cloud Console > Billing > Committed use discounts
-
-### 5. Cluster Management
-
-- **Stop/start dev clusters**: Idle dev clusters cost money even with no workloads (control plane fee).
-- **Right-size node pools** (Standard): Use Cluster Autoscaler with appropriate min/max.
-- **Multi-tenant clusters**: Share a single cluster across teams instead of per-team clusters (see [gke-multitenancy.md](./gke-multitenancy.md)).
-
-## Cost Monitoring
-
-```bash
-# View cluster cost breakdown (requires Cost Management API)
-gcloud billing budgets list --billing-account=<BILLING_ACCOUNT> --quiet
-
-# View node utilization
-kubectl top nodes
-
-# View pod resource usage vs requests
-kubectl top pods --all-namespaces --containers
-```
-
-## Dev/Test Cost Savings
-
-For non-production environments, these golden path deviations are acceptable:
-
-| Setting | Production (Golden Path) | Dev/Test |
-|---------|-------------------------|----------|
-| Cluster mode | Autopilot | Autopilot (cheaper with fewer pods) |
-| Release channel | Regular | Rapid (get fixes faster) |
-| Private nodes | Required | Optional (simpler access) |
-| Monitoring components | Full suite | SYSTEM_COMPONENTS only |
-| Secret Manager rotation | 120s | Disabled |
-| Maintenance windows | Configured | Not needed |
diff --git a/skills/cloud/gke-basics/references/gke-golden-path.md b/skills/cloud/gke-basics/references/gke-golden-path.md
deleted file mode 100644
index 8473c834c8..0000000000
--- a/skills/cloud/gke-basics/references/gke-golden-path.md
+++ /dev/null
@@ -1,76 +0,0 @@
-# GKE Golden Path Configuration
-
-The golden path is the recommended Autopilot configuration for production clusters. It defines sensible defaults — when the user requests different settings, apply them and note relevant trade-offs.
-
-> **MCP Tools:** `get_cluster`, `create_cluster`, `update_cluster`
-
-## Rules
-
-1. **Default to the golden path.** Use golden path values unless the user requests otherwise. When deviating, note trade-offs but respect the user's choice.
-2. **Day-0 vs Day-1.** Flag Day-0 decisions (networking, private nodes, subnets, IP allocation) prominently — they are hard/impossible to change after creation.
-3. **Tool preference: MCP > gcloud > kubectl.** See [cli-reference.md](./cli-reference.md) for full coverage matrix and override options. If the user says "use gcloud" or "use kubectl", respect that for the session.
-4. **Document decisions and rationale**, especially for Day-0 choices and golden path deviations.
-
-## Required Inputs
-
-If the user is unsure, use golden path defaults.
-
-- **Project ID** (required)
-- **Region** (required, e.g., `us-central1`)
-- **Cluster name** (required)
-- **Environment type**: dev/test or production (defaults to production)
-- **Networking**: bring-your-own VPC/subnet or auto-create (default: auto-create)
-- **Scale expectations**: expected node/pod count, workload types
-- **Cost constraints**: Spot VM tolerance, budget considerations
-
-## Always-Apply Defaults
-
-Recommended best practices applied by default. If the user requests a different setting, apply it and briefly note the security or operational trade-off.
-
-| Setting | Golden Path Value |
-|---------|-------------------|
-| `autopilot.enabled` | `true` |
-| `privateClusterConfig.enablePrivateNodes` | `true` |
-| `masterAuthorizedNetworksConfig.privateEndpointEnforcementEnabled` | `true` |
-| `secretManagerConfig.enabled` + `rotationInterval: 120s` | `true` |
-| `rbacBindingConfig.enableInsecureBinding*` | `false` (both) |
-| `workloadIdentityConfig.workloadPool` | enabled |
-| `networkConfig.datapathProvider` | `ADVANCED_DATAPATH` |
-| `networkConfig.dnsConfig.clusterDns` | `CLOUD_DNS` |
-| `autoscaling.autoscalingProfile` | `OPTIMIZE_UTILIZATION` |
-| `verticalPodAutoscaling.enabled` | `true` |
-| `monitoringConfig` components | SYSTEM_COMPONENTS, STORAGE, POD, DEPLOYMENT, STATEFULSET, DAEMONSET, HPA, JOBSET, CADVISOR, KUBELET, DCGM, APISERVER, SCHEDULER, CONTROLLER_MANAGER |
-| `advancedDatapathObservabilityConfig.enableMetrics` | `true` |
-| `nodeConfig.shieldedInstanceConfig.enableSecureBoot` | `true` |
-| `nodeConfig.workloadMetadataConfig.mode` | `GKE_METADATA` |
-| `nodeConfig.gcfsConfig.enabled` / `gvnic.enabled` | `true` / `true` |
-| `addonsConfig.statefulHaConfig.enabled` | `true` |
-| Storage CSI drivers (Filestore, GCS FUSE, Parallelstore) | enabled |
-| Pod Security Standards | `restricted` on production namespaces |
-
-## Customer-Configurable Settings
-
-These have golden path defaults but customers may deviate with valid justification. **Ask before changing.**
-
-| Setting | Default | Why Deviate |
-|---------|---------|-------------|
-| `dnsEndpointConfig.allowExternalTraffic` | `true` | Restrict if cluster only accessed from within VPC |
-| `autoIpamConfig` / `createSubnetwork` | `true` / `true` | Customer has pre-existing VPC/subnets |
-| `maxPodsPerNode` | `48` | `110` for high pod-density (costs more CIDR space) |
-| `subnetwork` | auto-created | Customer brings existing subnets |
-| Maintenance exclusion windows | configured (NO_MINOR_UPGRADES, 1yr) | Customer-specific scheduling |
-| `nodeConfig.bootDisk.diskType` | `pd-balanced` | `pd-ssd` for I/O-intensive, `pd-standard` for cost |
-| `nodeConfig.machineType` | `ek-standard-8` (Autopilot) | Varies by workload; use ComputeClasses |
-
-## Guardrails
-
-- Do not request or output secrets (tokens, keys, service account JSON).
-- Discover project/cluster context via MCP tools or `gcloud config get-value project` — don't ask users to paste project IDs.
-- For Day-0 decisions, always ask clarifying questions before proceeding.
-- For Day-1 features, propose golden path defaults with trade-offs and let the customer confirm.
-- Do not promise zero downtime; advise PDBs, health probes, replicas, and staged upgrades.
-- When auditing existing clusters, compare against golden path and report deviations with severity and remediation.
-
-## Golden Path Config
-
-See [golden-path-autopilot.yaml](../assets/golden-path-autopilot.yaml) for the full cluster-level policy settings.
diff --git a/skills/cloud/gke-basics/references/gke-inference.md b/skills/cloud/gke-basics/references/gke-inference.md
deleted file mode 100644
index 761adf2e62..0000000000
--- a/skills/cloud/gke-basics/references/gke-inference.md
+++ /dev/null
@@ -1,161 +0,0 @@
-# GKE AI/ML Inference
-
-This reference covers deploying AI/ML inference workloads on GKE using Google's Inference Quickstart (GIQ) and best practices for LLM serving.
-
-> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `get_k8s_logs`, `get_k8s_rollout_status`, `describe_k8s_resource`, `list_k8s_events`. **CLI-only:** `gcloud container ai profiles *`
-
-## When to Use
-
-- Deploy an AI model (Llama, Gemma, Mistral, etc.) to GKE
-- Generate optimized Kubernetes manifests for inference
-- Select GPU/TPU accelerators for model serving
-- Configure autoscaling for LLM inference
-
-## Prerequisites
-
-- A golden path GKE Autopilot cluster (GPU workloads are supported via ComputeClasses and NAP)
-- `gcloud` CLI authenticated
-- Sufficient GPU/TPU quota in the target region
-
-## Workflow
-
-### 1. Discovery: Find Models and Hardware
-
-```bash
-# List all supported models
-gcloud container ai profiles models list --quiet
-
-# Find valid accelerator/server combinations for a model
-gcloud container ai profiles list --model=<MODEL_NAME> --quiet
-
-# Example: what can run Gemma 2 9B?
-gcloud container ai profiles list --model=gemma-2-9b-it --quiet
-```
-
-### 2. Generate Manifest
-
-```bash
-gcloud container ai profiles manifests create \
-  --model=<MODEL_NAME> \
-  --model-server=<SERVER> \
-  --accelerator-type=<ACCELERATOR> \
-  --target-ntpot-milliseconds=<NTPOT> --quiet > inference.yaml
-```
-
-**Parameters:**
-- `--model`: Model ID (e.g., `gemma-2-9b-it`, `llama-3-8b`)
-- `--model-server`: Inference server (`vllm`, `tgi`, `triton`, `tensorrt-llm`)
-- `--accelerator-type`: GPU/TPU type (`nvidia-l4`, `nvidia-tesla-a100`, `nvidia-h100-80gb`)
-- `--target-ntpot-milliseconds`: Target Normalized Time Per Output Token (optional, for latency optimization)
-
-**Example:**
-
-```bash
-gcloud container ai profiles manifests create \
-  --model=gemma-2-9b-it \
-  --model-server=vllm \
-  --accelerator-type=nvidia-l4 \
-  --target-ntpot-milliseconds=50 --quiet > inference.yaml
-```
-
-### 3. Review and Deploy
-
-```bash
-# Review for placeholders (HF tokens, PVCs)
-cat inference.yaml
-
-# Deploy
-kubectl apply -f inference.yaml
-
-# Monitor
-kubectl get pods -w
-kubectl logs -f <POD_NAME>
-```
-
-> Some models require Hugging Face tokens. Create a Kubernetes Secret and reference it in the manifest.
-
-## GPU ComputeClass for Inference
-
-For Autopilot clusters, create a ComputeClass to target GPU nodes:
-
-```yaml
-apiVersion: cloud.google.com/v1
-kind: ComputeClass
-metadata:
-  name: l4-inference
-spec:
-  priorities:
-  - machineFamily: g2
-    gpu:
-      type: nvidia-l4
-      count: 1
-    minCores: 4
-    minMemoryGb: 16
-```
-
-## Accelerator Selection Guide
-
-| Accelerator | Best For | Memory | Relative Cost |
-|-------------|----------|--------|---------------|
-| NVIDIA T4 | Budget inference, lightweight legacy models | 16 GB | Lowest |
-| NVIDIA L4 (G2) | Small-medium model inference, video, graphics | 24 GB | Low |
-| NVIDIA RTX PRO 6000 (G4) | Multimodal AI, high-fidelity 3D, fine-tuning | 96 GB | Medium |
-| Cloud TPU v5e | Cost-effective transformer inference | Varies | Medium |
-| Cloud TPU v5p | High-performance training | Varies | High |
-| Cloud TPU v6e (Trillium) | High-efficiency next-gen training & serving | 32 GB/chip | Medium-High |
-| Cloud TPU v7x (Ironwood) | Ultra-scale inference & agentic workflows | 192 GB/chip | High |
-| NVIDIA A100 | Large model inference, enterprise ML | 40/80 GB | High |
-| NVIDIA H100 / H200 | Frontier model training, high throughput | 80/141 GB | Highest |
-| NVIDIA B200 (A4) | Blackwell-scale training, FP4 precision | 192 GB | Highest |
-| NVIDIA GB200 (A4X) | Rack-scale AI (Grace Blackwell Superchip) | Massive | Highest |
-
-## Autoscaling LLM Inference
-
-### GPU-based autoscaling
-
-Use custom metrics for GPU utilization:
-
-```yaml
-apiVersion: autoscaling/v2
-kind: HorizontalPodAutoscaler
-metadata:
-  name: llm-hpa
-spec:
-  scaleTargetRef:
-    apiVersion: apps/v1
-    kind: Deployment
-    name: llm-server
-  minReplicas: 1
-  maxReplicas: 10
-  metrics:
-  - type: Pods
-    pods:
-      metric:
-        name: gpu_duty_cycle
-      target:
-        type: AverageValue
-        averageValue: "80"
-```
-
-### Best practices for inference autoscaling
-
-1. **Use DCGM metrics**: Golden path enables DCGM monitoring for GPU utilization metrics
-2. **Set appropriate minReplicas**: At least 1 for always-on serving; 0 for batch/on-demand
-3. **Tune scale-down delay**: LLM model loading is slow; use longer stabilization windows
-4. **Consider queue depth**: Scale on pending requests rather than pure GPU utilization for latency-sensitive workloads
-
-## Optimization Tips
-
-- **Quantization**: Use quantized models (GPTQ, AWQ) to reduce GPU memory and increase throughput
-- **Batching**: Configure model server batch size for throughput vs latency trade-off
-- **Tensor parallelism**: Split large models across multiple GPUs within a node
-- **KV cache optimization**: Tune `--gpu-memory-utilization` in vLLM for KV cache allocation
-
-## Troubleshooting
-
-| Issue | Cause | Fix |
-|-------|-------|-----|
-| Invalid model/accelerator combination | Unsupported tuple | Re-run `gcloud container ai profiles list --model=<MODEL>` |
-| GPU quota exceeded | Regional quota limit | Request quota increase or try a different region |
-| OOM on GPU | Model too large for accelerator | Use larger GPU, enable quantization, or use tensor parallelism |
-| Slow cold start | Large model loading from registry | Use local SSD for model caching; pre-pull images |
diff --git a/skills/cloud/gke-basics/references/gke-networking.md b/skills/cloud/gke-basics/references/gke-networking.md
deleted file mode 100644
index 20eb5b49c0..0000000000
--- a/skills/cloud/gke-basics/references/gke-networking.md
+++ /dev/null
@@ -1,131 +0,0 @@
-# GKE Networking
-
-This reference covers networking configuration for GKE clusters. The golden path enforces private, VPC-native clusters with Dataplane V2.
-
-> **MCP Tools:** `get_cluster`, `update_cluster`, `apply_k8s_manifest`, `get_k8s_resource`
-
-## Golden Path Networking Defaults
-
-| Setting | Golden Path Value | Day-0/1 | Notes |
-|---------|-------------------|---------|-------|
-| `privateClusterConfig.enablePrivateNodes` | `true` | Day-0 | Nodes have no public IPs |
-| `masterAuthorizedNetworksConfig.privateEndpointEnforcementEnabled` | `true` | Day-0 | Control plane only reachable via private endpoint or DNS |
-| `controlPlaneEndpointsConfig.dnsEndpointConfig.allowExternalTraffic` | `true` | Day-0 | Allows DNS-based access from outside VPC |
-| `networkConfig.datapathProvider` | `ADVANCED_DATAPATH` (Dataplane V2) | Day-0 | eBPF-based, built-in Network Policy |
-| `networkConfig.dnsConfig.clusterDns` | `CLOUD_DNS` | Day-0 | Managed DNS, more reliable than kube-dns |
-| `networkConfig.enableIntraNodeVisibility` | `true` | Day-1 | VPC Flow Logs for intra-node traffic |
-| `networkConfig.gatewayApiConfig.channel` | `CHANNEL_STANDARD` | Day-1 | Gateway API support |
-| `ipAllocationPolicy.autoIpamConfig.enabled` | `true` | Day-0 | Automatic IP range management |
-| `ipAllocationPolicy.createSubnetwork` | `true` | Day-0 | Auto-create dedicated subnet |
-| `defaultMaxPodsConstraint.maxPodsPerNode` | `48` | Day-0 | Conservative default; 110 for high density |
-
-## Private Cluster Access Patterns
-
-The golden path creates a private cluster. Users access it via:
-
-1. **DNS endpoint (default)**: `allowExternalTraffic: true` enables access via the cluster's DNS endpoint from outside the VPC. No VPN required.
-2. **Private endpoint**: Direct access from within the VPC or via Cloud VPN/Interconnect.
-3. **Authorized networks**: Add specific CIDRs to `masterAuthorizedNetworksConfig` for IP-based access control.
-
-```bash
-# Access private cluster via DNS endpoint (golden path default)
-gcloud container clusters get-credentials <CLUSTER_NAME> \
-  --region <REGION> --dns-endpoint \
-  --quiet
-
-# Access via private endpoint (from within VPC)
-gcloud container clusters get-credentials <CLUSTER_NAME> \
-  --region <REGION> --internal-ip \
-  --quiet
-```
-
-## Bring-Your-Own VPC/Subnet
-
-If the customer has existing network infrastructure:
-
-```bash
-gcloud container clusters create-auto <CLUSTER_NAME> \
-  --region <REGION> \
-  --network <VPC_NAME> \
-  --subnetwork <SUBNET_NAME> \
-  --cluster-secondary-range-name <POD_RANGE> \
-  --services-secondary-range-name <SVC_RANGE> \
-  --enable-private-nodes \
-  --enable-master-authorized-networks \
-  --quiet
-```
-
-> **Day-0 Warning**: VPC, subnet, and IP ranges cannot be changed after cluster creation.
-
-## IP Planning
-
-| Resource | Golden Path | Notes |
-|----------|-------------|-------|
-| Pod CIDR | `/17` (auto) | ~32K pod IPs; size based on maxPodsPerNode |
-| Service CIDR | `/20` (auto) | ~4K service IPs |
-| Node subnet | auto-created | /20 recommended for growth |
-| Max pods/node | 48 | Each node gets a /25 pod range; set to 110 for /24 per node |
-
-**Pod CIDR sizing rule of thumb:**
-- `maxPodsPerNode=48` -> each node uses a `/25` (128 IPs) from pod CIDR
-- `maxPodsPerNode=110` -> each node uses a `/24` (256 IPs) from pod CIDR
-- Larger maxPodsPerNode = fewer nodes fit in a given CIDR
-
-## Ingress
-
-**Gateway API** (golden path, enabled via `gatewayApiConfig.channel: CHANNEL_STANDARD`):
-
-```yaml
-apiVersion: gateway.networking.k8s.io/v1
-kind: Gateway
-metadata:
-  name: external-http
-spec:
-  gatewayClassName: gke-l7-global-external-managed
-  listeners:
-  - name: http
-    protocol: HTTP
-    port: 80
-```
-
-**Alternatives:**
-- `gke-l7-regional-external-managed` — regional external
-- `gke-l7-rilb` — internal load balancer
-- Istio service mesh — for advanced traffic management, mTLS
-
-## Egress
-
-- Default: nodes use Cloud NAT for outbound internet access (private nodes have no public IPs)
-- For static egress IPs: configure Cloud NAT with manual IP allocation
-- For restricted egress: route through a firewall appliance via custom routes
-
-## Network Policy
-
-Dataplane V2 (golden path) provides built-in Network Policy enforcement — no additional addon needed. Apply default-deny per namespace, then allow specific flows.
-
-> See [gke-security.md](./gke-security.md) for default-deny policy and [gke-multitenancy.md](./gke-multitenancy.md) for per-team allow policies.
-
-## Cloud Armor (Recommended for Public-Facing Services)
-
-Cloud Armor provides WAF and DDoS protection. **Not a golden path default** — recommended for any service with public ingress. Link via `BackendConfig`:
-
-```yaml
-# 1. Create BackendConfig referencing your Cloud Armor policy
-apiVersion: cloud.google.com/v1
-kind: BackendConfig
-metadata:
-  name: my-backend-config
-spec:
-  securityPolicy:
-    name: my-cloud-armor-policy
----
-# 2. Annotate your Service
-# cloud.google.com/backend-config: '{"default": "my-backend-config"}'
-```
-
-## SSL, Container-Native LB, and PSC
-
-- **Google-managed SSL certificates**: Use `ManagedCertificate` CRD with Gateway API. Auto-provisions and renews.
-- **Container-native LB**: Enabled by default on VPC-native clusters (golden path). Targets pods via NEGs, bypassing iptables. Annotation: `cloud.google.com/neg: '{"ingress": true}'`.
-- **Private Service Connect (PSC)**: Use `ServiceAttachment` CRD to expose services across VPCs without peering.
-
diff --git a/skills/cloud/gke-basics/references/gke-observability.md b/skills/cloud/gke-basics/references/gke-observability.md
deleted file mode 100644
index 9b940a2041..0000000000
--- a/skills/cloud/gke-basics/references/gke-observability.md
+++ /dev/null
@@ -1,168 +0,0 @@
-# GKE Observability
-
-This reference covers monitoring, logging, and metrics configuration for GKE. The golden path enables comprehensive observability including control-plane metrics.
-
-> **MCP Tools:** `get_cluster`, `list_k8s_events`, `get_k8s_logs`, `get_k8s_cluster_info`, `describe_k8s_resource`. **CLI-only:** `gcloud container clusters update --monitoring=...`, `gcloud logging read`
-
-## Golden Path Observability Defaults
-
-| Setting | Golden Path Value | Notes |
-|---------|-------------------|-------|
-| `loggingConfig` components | SYSTEM_COMPONENTS, WORKLOADS | Full workload logging |
-| `monitoringConfig` components | SYSTEM_COMPONENTS, STORAGE, POD, DEPLOYMENT, STATEFULSET, DAEMONSET, HPA, JOBSET, CADVISOR, KUBELET, DCGM, APISERVER, SCHEDULER, CONTROLLER_MANAGER | Full suite including control-plane |
-| `managedPrometheusConfig.enabled` | `true` | Google-managed Prometheus |
-| `advancedDatapathObservabilityConfig.enableMetrics` | `true` | Dataplane V2 flow metrics |
-| `loggingService` | `logging.googleapis.com/kubernetes` | Cloud Logging |
-| `monitoringService` | `monitoring.googleapis.com/kubernetes` | Cloud Monitoring |
-
-### Control-Plane Metrics (Golden Path Addition)
-
-The golden path adds three control-plane monitoring components not present in default clusters:
-
-| Component | What It Monitors |
-|-----------|-----------------|
-| `APISERVER` | API server request latency, error rates, admission webhook performance |
-| `SCHEDULER` | Scheduling latency, pending pods, scheduling failures |
-| `CONTROLLER_MANAGER` | Controller work queue depth, reconciliation latency |
-
-These are critical for diagnosing cluster-level issues (slow API responses, scheduling delays, stuck controllers).
-
-## Enabling Full Monitoring
-
-```bash
-# Enable golden path monitoring suite
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET,DCGM \
-  --quiet
-
-# Enable Managed Prometheus
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --enable-managed-prometheus \
-  --quiet
-
-# Enable Dataplane V2 observability metrics
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --enable-dataplane-v2-flow-observability \
-  --quiet
-```
-
-## Managed Prometheus
-
-Golden path enables Google Managed Prometheus for metrics collection and querying.
-
-**Querying metrics:**
-- Use Cloud Monitoring Metrics Explorer in the console
-- Use PromQL via the Prometheus UI or API
-- Grafana dashboards via Managed Grafana
-
-**Key GKE metrics:**
-
-| Metric | Source | Use |
-|--------|--------|-----|
-| `container_cpu_usage_seconds_total` | cAdvisor | Pod CPU usage |
-| `container_memory_working_set_bytes` | cAdvisor | Pod memory usage |
-| `kube_pod_status_phase` | kube-state-metrics | Pod lifecycle |
-| `apiserver_request_duration_seconds` | API Server | Control plane latency |
-| `scheduler_scheduling_duration_seconds` | Scheduler | Scheduling performance |
-| `node_cpu_seconds_total` | Kubelet | Node CPU |
-| `DCGM_FI_DEV_GPU_UTIL` | DCGM | GPU utilization |
-
-## Live Resource Usage (kubectl-only)
-
-No MCP or gcloud equivalent exists for live resource usage. Use `kubectl top`:
-
-```bash
-kubectl top pods --all-namespaces --sort-by=cpu
-kubectl top nodes
-kubectl top pods --containers -n <NAMESPACE>  # per-container breakdown
-```
-
-## Cloud Logging (gcloud-only)
-
-**Querying cluster logs** (no MCP equivalent — use `gcloud logging read`):
-
-```bash
-# System component logs
-gcloud logging read \
-  'resource.type="k8s_cluster" AND resource.labels.cluster_name="<CLUSTER_NAME>"' \
-  --project <PROJECT_ID> --limit 50 \
-  --quiet
-
-# Workload logs for a specific namespace
-gcloud logging read \
-  'resource.type="k8s_container" AND resource.labels.cluster_name="<CLUSTER_NAME>" AND resource.labels.namespace_name="<NAMESPACE>"' \
-  --project <PROJECT_ID> --limit 50 \
-  --quiet
-
-# Audit logs (who did what)
-gcloud logging read \
-  'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"' \
-  --project <PROJECT_ID> --limit 50 \
-  --quiet
-```
-
-## Diagnostic Settings
-
-For security monitoring and troubleshooting, enable control-plane audit logs:
-
-```bash
-# View current logging config
-gcloud container clusters describe <CLUSTER_NAME> --region <REGION> \
-  --format="yaml(loggingConfig)" \
-  --quiet
-```
-
-## Alerting
-
-Set up alerts for critical conditions:
-
-| Condition | Metric | Threshold |
-|-----------|--------|-----------|
-| High API server latency | `apiserver_request_duration_seconds` | P99 > 5s |
-| Pod crash loops | `kube_pod_container_status_restarts_total` | > 5 in 10min |
-| Node not ready | `kube_node_status_condition` | condition=Ready, status!=True |
-| High GPU utilization | `DCGM_FI_DEV_GPU_UTIL` | > 95% sustained |
-| PVC near capacity | `kubelet_volume_stats_used_bytes / capacity` | > 85% |
-| Scheduling failures | `scheduler_schedule_attempts_total{result="error"}` | > 0 |
-
-## Cost Considerations
-
-Monitoring and logging have associated costs:
-
-- **Cloud Logging**: Charged per GiB ingested beyond free tier (50 GiB/project/month)
-- **Cloud Monitoring**: Free for GKE system metrics; custom metrics charged per time series
-- **Managed Prometheus**: Charged per samples ingested
-
-To reduce costs in non-production:
-```bash
-# Reduce to system-only monitoring
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --monitoring=SYSTEM \
-  --quiet
-```
-
-## Distributed Tracing & Continuous Profiling (Recommended)
-
-**Not golden path defaults** — recommended for production microservice architectures and performance-sensitive workloads.
-
-- **Cloud Trace**: Add OpenTelemetry SDK to your app with the `opentelemetry-operations-go` (or equivalent) exporter. Traces appear in Cloud Trace console. Identifies cross-service latency bottlenecks.
-- **Cloud Profiler**: Add the Cloud Profiler agent to your app. Profiles CPU and memory usage in production with low overhead. Identifies hotspots and compares across versions.
-
-## LQL Query Examples
-
-Common Logging Query Language patterns for GKE troubleshooting:
-
-```
-# Error logs for a specific container
-resource.type="k8s_container" AND resource.labels.container_name="my-app" AND severity>=ERROR
-
-# OOMKilled events
-resource.type="k8s_event" AND jsonPayload.reason="OOMKilling"
-
-# Pod scheduling failures
-resource.type="k8s_event" AND jsonPayload.reason="FailedScheduling"
-
-# Audit logs (who did what)
-resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"
-```
-
diff --git a/skills/cloud/gke-basics/references/gke-reliability.md b/skills/cloud/gke-basics/references/gke-reliability.md
deleted file mode 100644
index 8b2f3129b6..0000000000
--- a/skills/cloud/gke-basics/references/gke-reliability.md
+++ /dev/null
@@ -1,169 +0,0 @@
-# GKE Reliability
-
-This reference covers high availability and reliability configuration for GKE clusters and workloads.
-
-> **MCP Tools:** `get_cluster`, `get_k8s_resource`, `describe_k8s_resource`, `apply_k8s_manifest`, `list_k8s_events`
-
-## Golden Path Reliability Defaults
-
-| Setting | Golden Path Value | Notes |
-|---------|-------------------|-------|
-| Cluster type | Regional (4 zones: us-central1-a/b/c/f) | Control plane replicated across zones |
-| Upgrade strategy | SURGE (`maxSurge: 1`) | Rolling upgrades with extra capacity |
-| Auto-repair | `true` | Unhealthy nodes replaced automatically |
-| Auto-upgrade | `true` | Nodes follow control plane version |
-| Release channel | REGULAR | Balanced freshness and stability |
-| Stateful HA | Enabled | Leader election for stateful workloads |
-
-## Workflows
-
-### 1. Verify Cluster High Availability
-
-```
-# MCP (preferred)
-get_cluster(name="projects/<PROJECT>/locations/<REGION>/clusters/<CLUSTER>",
-  readMask="location,locations,nodePools.locations")
-
-# gcloud fallback
-gcloud container clusters describe <CLUSTER> --region <REGION> \
-  --format="json(location, locations)" \
-  --quiet
-```
-
-- If `location` is a region (e.g., `us-central1`), the control plane is regional
-- If `locations` has multiple entries, nodes span multiple zones
-
-### 2. Pod Disruption Budgets (PDBs)
-
-PDBs ensure minimum pod availability during voluntary disruptions (node upgrades, autoscaler scale-down).
-
-**Check existing PDBs:**
-
-```
-# MCP (preferred)
-get_k8s_resource(parent="...", resourceType="poddisruptionbudget")
-
-# kubectl fallback
-kubectl get pdb --all-namespaces
-```
-
-**Create PDB:**
-
-```yaml
-apiVersion: policy/v1
-kind: PodDisruptionBudget
-metadata:
-  name: my-app-pdb
-  namespace: default
-spec:
-  minAvailable: 2       # Or use maxUnavailable: 1
-  selector:
-    matchLabels:
-      app: my-app
-```
-
-> Every production Deployment with 2+ replicas should have a PDB.
-
-### 3. Health Probes
-
-Every production container should have liveness and readiness probes. Startup probes are recommended for slow-starting apps.
-
-**Check existing probes:**
-
-```
-# MCP (preferred)
-describe_k8s_resource(parent="...", resourceType="deployment", name="<APP>", namespace="<NS>")
-
-# kubectl fallback
-kubectl get deployment <APP> -n <NS> -o yaml | grep -E "livenessProbe|readinessProbe|startupProbe"
-```
-
-**Recommended probe configuration:**
-
-```yaml
-spec:
-  containers:
-  - name: app
-    livenessProbe:
-      httpGet:
-        path: /healthz
-        port: 8080
-      initialDelaySeconds: 15
-      periodSeconds: 10
-      failureThreshold: 3
-    readinessProbe:
-      httpGet:
-        path: /readyz
-        port: 8080
-      initialDelaySeconds: 5
-      periodSeconds: 5
-      failureThreshold: 3
-    startupProbe:             # For slow-starting apps
-      httpGet:
-        path: /healthz
-        port: 8080
-      initialDelaySeconds: 10
-      periodSeconds: 5
-      failureThreshold: 30    # 30 * 5s = 150s max startup time
-```
-
-- **Readiness**: Determines when a pod can accept traffic
-- **Liveness**: Determines when to restart a container
-- **Startup**: Disables liveness/readiness until the app is ready (prevents premature restarts)
-
-### 4. Graceful Shutdown
-
-Ensure applications handle `SIGTERM` and drain in-flight requests:
-
-```yaml
-spec:
-  terminationGracePeriodSeconds: 30    # Default; increase for long-running requests
-  containers:
-  - name: app
-    lifecycle:
-      preStop:
-        exec:
-          command: ["/bin/sh", "-c", "sleep 5"]  # Allow LB to deregister
-```
-
-### 5. Topology Spread Constraints
-
-Distribute pods across zones and nodes to survive failures:
-
-```yaml
-spec:
-  topologySpreadConstraints:
-  - maxSkew: 1
-    topologyKey: topology.kubernetes.io/zone
-    whenUnsatisfiable: DoNotSchedule
-    labelSelector:
-      matchLabels:
-        app: my-app
-  - maxSkew: 1
-    topologyKey: kubernetes.io/hostname
-    whenUnsatisfiable: ScheduleAnyway
-    labelSelector:
-      matchLabels:
-        app: my-app
-```
-
-- **Zone spread** (`DoNotSchedule`): Hard requirement -- pods must be balanced across zones
-- **Node spread** (`ScheduleAnyway`): Best-effort -- prefer distribution but don't block scheduling
-
-### 6. Replicas
-
-| Workload Type | Minimum Replicas | Reason |
-|--------------|-----------------|--------|
-| Stateless web/API | 2 | Survive single pod/node failure |
-| Critical services | 3 | Survive zone failure with zone spread |
-| Stateful (databases) | 3 (with replication) | Application-level quorum |
-| Batch/jobs | 1 | Ephemeral by nature |
-
-## Best Practices
-
-1. **Regional clusters for production**: Always use regional clusters to survive zone failures
-2. **PDBs for everything**: Every production workload with 2+ replicas needs a PDB
-3. **Probes for all containers**: At minimum, readiness probes on every production container
-4. **Zone spreading**: Use topology spread constraints to distribute pods across failure domains
-5. **Graceful shutdown**: Handle SIGTERM and set appropriate `terminationGracePeriodSeconds`
-6. **Maintenance windows**: Schedule upgrades during low-traffic periods (see [gke-upgrades.md](./gke-upgrades.md))
diff --git a/skills/cloud/gke-basics/references/gke-scaling.md b/skills/cloud/gke-basics/references/gke-scaling.md
deleted file mode 100644
index 2ce2a6dbb9..0000000000
--- a/skills/cloud/gke-basics/references/gke-scaling.md
+++ /dev/null
@@ -1,149 +0,0 @@
-# GKE Workload Scaling
-
-This reference covers scaling workloads on GKE. The golden path enables VPA, OPTIMIZE_UTILIZATION autoscaling profile, and Node Auto Provisioning by default.
-
-> **MCP Tools:** `get_k8s_resource`, `describe_k8s_resource`, `apply_k8s_manifest`, `patch_k8s_resource`, `get_cluster`, `update_cluster`, `update_node_pool`
-
-## Golden Path Scaling Defaults
-
-| Setting | Golden Path Value | Notes |
-|---------|-------------------|-------|
-| `autoscaling.autoscalingProfile` | `OPTIMIZE_UTILIZATION` | Aggressive scale-down for cost savings |
-| `verticalPodAutoscaling.enabled` | `true` | VPA recommendations available |
-| `autoscaling.enableNodeAutoprovisioning` | `true` | NAP creates node pools on demand |
-| GPU resource limits (T4, A100) | `1000000000` each | NAP can provision GPU nodes |
-
-## Scaling Mechanisms
-
-### 1. Manual Scaling
-
-> **kubectl-only** — no MCP equivalent for `kubectl scale`. Use kubectl directly.
-
-```bash
-kubectl scale deployment <DEPLOYMENT> --replicas=<N> -n <NAMESPACE>
-```
-
-### 2. Horizontal Pod Autoscaling (HPA)
-
-Scales the number of pods based on metrics.
-
-**Quick setup (kubectl-only — no MCP equivalent for `kubectl autoscale`):**
-
-```bash
-kubectl autoscale deployment <DEPLOYMENT> --cpu-percent=50 --min=1 --max=10
-```
-
-**Manifest approach (recommended — use MCP `apply_k8s_manifest`):**
-
-See [assets/hpa-example.yaml](../assets/hpa-example.yaml) for a template.
-
-```yaml
-apiVersion: autoscaling/v2
-kind: HorizontalPodAutoscaler
-metadata:
-  name: <DEPLOYMENT>-hpa
-spec:
-  scaleTargetRef:
-    apiVersion: apps/v1
-    kind: Deployment
-    name: <DEPLOYMENT>
-  minReplicas: 1
-  maxReplicas: 10
-  metrics:
-  - type: Resource
-    resource:
-      name: cpu
-      target:
-        type: Utilization
-        averageUtilization: 50
-```
-
-### 3. Vertical Pod Autoscaling (VPA)
-
-Adjusts CPU and memory requests to match actual usage. Enabled by default on golden path.
-
-**Update modes:**
-- `Off` — recommendations only (safest, start here)
-- `Initial` — sets resources only at pod creation
-- `Auto` — restarts pods to apply new resource values
-- `InPlaceOrRecreate` — updates resources without restart when possible (GKE 1.34+)
-
-**Create VPA in recommendation mode:**
-
-```yaml
-apiVersion: autoscaling.k8s.io/v1
-kind: VerticalPodAutoscaler
-metadata:
-  name: <DEPLOYMENT>-vpa
-spec:
-  targetRef:
-    apiVersion: apps/v1
-    kind: Deployment
-    name: <DEPLOYMENT>
-  updatePolicy:
-    updateMode: "Off"
-```
-
-**Read recommendations (prefer MCP `describe_k8s_resource`):**
-
-```
-# MCP (preferred)
-describe_k8s_resource(parent="...", resourceType="verticalpodautoscaler", name="<DEPLOYMENT>-vpa", namespace="<NAMESPACE>")
-
-# kubectl fallback
-kubectl get vpa <DEPLOYMENT>-vpa -o jsonpath='{.status.recommendation}'
-```
-
-See [assets/vpa-example.yaml](../assets/vpa-example.yaml) for a full template.
-
-### 4. Cluster Autoscaler / Node Auto Provisioning (NAP)
-
-On Autopilot (golden path), node scaling is fully managed. NAP automatically creates and sizes node pools based on workload demands.
-
-**For Standard clusters:**
-
-```bash
-# Enable cluster autoscaler on a node pool
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --enable-autoscaling --node-pool <POOL_NAME> \
-  --min-nodes <MIN> --max-nodes <MAX> \
-  --quiet
-
-# Enable NAP
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --enable-autoprovisioning \
-  --min-cpu <MIN_CPU> --max-cpu <MAX_CPU> \
-  --min-memory <MIN_MEM> --max-memory <MAX_MEM> \
-  --quiet
-```
-
-**Autoscaling profiles:**
-
-| Profile | Behavior | Golden Path? |
-|---------|----------|-------------|
-| `BALANCED` | Default GKE; conservative scale-down | No |
-| `OPTIMIZE_UTILIZATION` | Aggressive scale-down; lower idle resources | **Yes** |
-
-## Best Practices
-
-1. **Define resource requests**: HPA and VPA rely on accurate requests. Always set them.
-2. **Avoid metric conflicts**: Do not use HPA and VPA on the same metric. Typical pattern: HPA on CPU, VPA on memory.
-3. **Pod Disruption Budgets**: Define PDBs for all production workloads to ensure availability during scaling events.
-4. **HPA stabilization**: HPA has a default 5-minute stabilization window. Tune `behavior` for faster response if needed.
-5. **VPA "Auto" caution**: Auto mode restarts pods. Ensure your app handles SIGTERM gracefully. VPA requires at least 2 replicas for evictions by default.
-6. **Use ComputeClasses**: For workload-specific node targeting (Spot fallback, GPU, specific machine families), use ComputeClasses instead of node selectors.
-
-## Rightsizing Workflow
-
-1. Deploy VPA in `Off` mode for 24+ hours
-2. Read recommendations: `kubectl describe vpa <NAME>`
-3. Compare `target` values against current `requests`
-4. Apply with 20% buffer: `new_request = target * 1.2`
-5. Use patch format to update Deployment
-
-| Condition | Recommendation | Risk |
-|-----------|----------------|------|
-| CPU request >5x P95 actual | Reduce to `P95 * 1.2` | Medium |
-| Memory request >3x P95 actual | Reduce to `P95 * 1.2` | Medium |
-| CPU request >2x P95 actual | Rightsizing with 20% buffer | Low |
-| No resource limits set | Add limits to prevent noisy-neighbor | Low |
diff --git a/skills/cloud/gke-basics/references/gke-security.md b/skills/cloud/gke-basics/references/gke-security.md
deleted file mode 100644
index d4699ca5a6..0000000000
--- a/skills/cloud/gke-basics/references/gke-security.md
+++ /dev/null
@@ -1,226 +0,0 @@
-# GKE Security
-
-This reference covers security configuration for GKE clusters. The golden path enforces a hardened security posture by default.
-
-> **MCP Tools:** `get_cluster`, `check_k8s_auth`, `get_k8s_resource`, `apply_k8s_manifest`, `update_cluster`
-
-## Golden Path Security Defaults
-
-| Setting | Golden Path Value | Day-0/1 | Notes |
-|---------|-------------------|---------|-------|
-| `workloadIdentityConfig.workloadPool` | `<PROJECT>.svc.id.goog` | Day-0 | Workload Identity Federation for Pods |
-| `secretManagerConfig.enabled` | `true` | Day-1 | Google Secret Manager integration |
-| `secretManagerConfig.rotationConfig` | `enabled: true, rotationInterval: 120s` | Day-1 | Automatic secret rotation |
-| `rbacBindingConfig.enableInsecureBindingSystemAuthenticated` | `false` | Day-0 | Blocks legacy `system:authenticated` bindings |
-| `rbacBindingConfig.enableInsecureBindingSystemUnauthenticated` | `false` | Day-0 | Blocks legacy `system:unauthenticated` bindings |
-| `nodeConfig.shieldedInstanceConfig.enableSecureBoot` | `true` | Day-0 | Verifiable boot integrity |
-| `nodeConfig.shieldedInstanceConfig.enableIntegrityMonitoring` | `true` | Day-0 | Runtime integrity checks |
-| `nodeConfig.workloadMetadataConfig.mode` | `GKE_METADATA` | Day-0 | Blocks legacy metadata API, enforces Workload Identity |
-| Private cluster + Dataplane V2 settings | See [gke-networking.md](./gke-networking.md) | Day-0 | Private nodes, private endpoint enforcement, ADVANCED_DATAPATH |
-
-## Workload Identity Federation
-
-Workload Identity is the recommended way for pods to access Google Cloud APIs. It eliminates the need for static service account keys.
-
-### Setup
-
-```bash
-# 1. Create a Google Service Account (GSA)
-gcloud iam service-accounts create <GSA_NAME> \
-  --project <PROJECT_ID> \
-  --display-name "Workload Identity SA" \
-  --quiet
-
-# 2. Grant IAM roles to the GSA
-gcloud projects add-iam-policy-binding <PROJECT_ID> \
-  --member "serviceAccount:<GSA_NAME>@<PROJECT_ID>.iam.gserviceaccount.com" \
-  --role "<ROLE>" \
-  --quiet
-
-# 3. Create Kubernetes Service Account (KSA)
-kubectl create namespace <NAMESPACE>
-kubectl create serviceaccount <KSA_NAME> --namespace <NAMESPACE>
-
-# 4. Bind KSA to GSA
-gcloud iam service-accounts add-iam-policy-binding \
-  <GSA_NAME>@<PROJECT_ID>.iam.gserviceaccount.com \
-  --role roles/iam.workloadIdentityUser \
-  --member "serviceAccount:<PROJECT_ID>.svc.id.goog[<NAMESPACE>/<KSA_NAME>]" \
-  --quiet
-
-# 5. Annotate KSA
-kubectl annotate serviceaccount <KSA_NAME> \
-  --namespace <NAMESPACE> \
-  iam.gke.io/gcp-service-account=<GSA_NAME>@<PROJECT_ID>.iam.gserviceaccount.com
-```
-
-> See [assets/workload-identity-pod.yaml](../assets/workload-identity-pod.yaml) for a test pod.
-
-### Verification
-
-```bash
-kubectl run workload-identity-test \
-  --image=gcr.io/google.com/cloudsdktool/cloud-sdk:slim \
-  --serviceaccount=<KSA_NAME> --namespace=<NAMESPACE> \
-  --rm -it -- gcloud auth list --quiet
-```
-
-## Secret Manager Integration
-
-The golden path enables Secret Manager with automatic rotation. Secrets are synced to Kubernetes Secrets.
-
-```bash
-# Verify Secret Manager is enabled on cluster
-gcloud container clusters describe <CLUSTER_NAME> --region <REGION> \
-  --format="value(secretManagerConfig.enabled)" \
-  --quiet
-
-# Enable if not already (Day-1 change)
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --enable-secret-manager \
-  --secret-manager-rotation-interval=120s \
-  --quiet
-```
-
-## RBAC Hardening
-
-The golden path disables insecure legacy RBAC bindings that grant broad access to `system:authenticated` and `system:unauthenticated` groups.
-
-```bash
-# Verify insecure bindings are disabled
-gcloud container clusters describe <CLUSTER_NAME> --region <REGION> \
-  --format="yaml(rbacBindingConfig)" \
-  --quiet
-```
-
-**Best practices for RBAC:**
-- Use namespace-scoped Roles over cluster-wide ClusterRoles
-- Bind to specific Groups or ServiceAccounts, never to `system:authenticated`
-- Audit permissions via MCP: `check_k8s_auth(parent="...", verb="list", resourceType="pods", namespace="...")` (or `kubectl auth can-i --list --as=<user>`)
-- Review bindings via MCP: `get_k8s_resource(parent="...", resourceType="clusterrolebinding")` (or `kubectl get clusterrolebindings,rolebindings --all-namespaces`)
-
-> See [gke-multitenancy.md](./gke-multitenancy.md) for enterprise RBAC planning and https://docs.cloud.google.com/kubernetes-engine/docs/best-practices/rbac
-
-## Binary Authorization
-
-Not enabled in golden path by default but recommended for production image provenance:
-
-```bash
-# Enable Binary Authorization
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --binauthz-evaluation-mode=PROJECT_SINGLETON_POLICY_ENFORCE \
-  --quiet
-```
-
-## Network Policies
-
-Dataplane V2 (golden path) provides built-in Network Policy enforcement. Apply default-deny per namespace:
-
-```
-# MCP (preferred)
-apply_k8s_manifest(parent="...", yamlManifest="<contents of default-deny-netpol.yaml>")
-
-# kubectl fallback
-kubectl apply -f skills/gke/assets/default-deny-netpol.yaml -n <NAMESPACE>
-```
-
-## GKE Sandbox (gVisor)
-
-For running untrusted workloads in an isolated sandbox:
-
-```bash
-# Enable on cluster (Standard clusters)
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> --enable-gke-sandbox --quiet
-
-# Use in pod spec
-# Add: runtimeClassName: gvisor
-```
-
-## Pod Security Standards (Golden Path)
-
-Pod Security Standards define three profiles that restrict what pods can do. The **`restricted` profile is the golden path default** for production namespaces.
-
-| Profile | Level | Use Case |
-|---------|-------|----------|
-| `privileged` | Unrestricted | System namespaces (`kube-system`), infrastructure controllers |
-| `baseline` | Minimally restrictive | Shared/dev namespaces, legacy apps being migrated |
-| `restricted` | **Golden path** | Production workloads -- blocks privilege escalation, host access, root |
-
-**Enforce via namespace labels (Pod Security Admission):**
-
-```yaml
-apiVersion: v1
-kind: Namespace
-metadata:
-  name: production
-  labels:
-    pod-security.kubernetes.io/enforce: restricted
-    pod-security.kubernetes.io/warn: restricted
-    pod-security.kubernetes.io/audit: restricted
-```
-
-**Gradual rollout strategy:**
-1. Start with `warn` + `audit` on existing namespaces to identify violations
-2. Fix non-compliant workloads (remove `privileged`, `hostNetwork`, root user, etc.)
-3. Enable `enforce` once all workloads pass
-
-`restricted` blocks: running as root, privilege escalation, host networking/PID/IPC, host path volumes, and most capabilities. The golden path `workload-identity-pod.yaml` already complies.
-
-## Network Policy Logging (Recommended)
-
-With Dataplane V2 (golden path), you can enable logging for Network Policy decisions. **Not a golden path default** -- recommended for security auditing.
-
-```bash
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --enable-network-policy-logging \
-  --quiet
-```
-
-This logs allowed and denied connections, useful for troubleshooting Network Policy rules and auditing traffic flows.
-
-## Common IAM Roles
-
-The five most common predefined IAM roles for GKE:
-
-| Role | Purpose | When to Use |
-|------|---------|-------------|
-| `roles/container.admin` | Full control over clusters and Kubernetes resources | Platform team admins managing cluster lifecycle |
-| `roles/container.clusterAdmin` | Manage clusters but not project-level IAM | Cluster operators who create/delete clusters |
-| `roles/container.developer` | Deploy workloads (pods, services, deployments) | Application developers deploying to existing clusters |
-| `roles/container.viewer` | Read-only access to clusters and Kubernetes resources | Monitoring, auditing, or read-only dashboards |
-| `roles/container.clusterViewer` | List and get cluster details only | CI/CD pipelines that need cluster metadata |
-
-> **Principle of least privilege**: Start with `roles/container.viewer` or `roles/container.developer` and escalate only as needed. Avoid granting `roles/container.admin` broadly.
-
-## Service Accounts & Agents
-
-- **GKE Service Agent** (`service-<PROJECT_NUMBER>@container-engine-robot.iam.gserviceaccount.com`): Automatically created. Manages nodes, networking, and cluster operations on your behalf. Do not remove or modify its permissions.
-- **Node Service Account**: By default, nodes use the Compute Engine default service account. For production, create a dedicated SA with minimal permissions and assign it via node pool config.
-- **Workload Identity**: The recommended way for pods to access Google Cloud APIs. Maps a Kubernetes ServiceAccount to a Google IAM ServiceAccount — see [Workload Identity setup](#workload-identity-federation) above.
-
-## Cross-Service Authentication Patterns
-
-Common patterns for granting GKE workloads access to other Google Cloud services:
-
-```bash
-# Grant a GKE workload access to Cloud Storage
-gcloud projects add-iam-policy-binding <PROJECT_ID> \
-  --member "serviceAccount:<GSA_NAME>@<PROJECT_ID>.iam.gserviceaccount.com" \
-  --role "roles/storage.objectViewer" \
-  --quiet
-
-# Grant a GKE workload access to Cloud SQL
-gcloud projects add-iam-policy-binding <PROJECT_ID> \
-  --member "serviceAccount:<GSA_NAME>@<PROJECT_ID>.iam.gserviceaccount.com" \
-  --role "roles/cloudsql.client" \
-  --quiet
-
-# Grant a GKE workload access to Pub/Sub
-gcloud projects add-iam-policy-binding <PROJECT_ID> \
-  --member "serviceAccount:<GSA_NAME>@<PROJECT_ID>.iam.gserviceaccount.com" \
-  --role "roles/pubsub.subscriber" \
-  --quiet
-```
-
-In all cases, the GSA must be bound to a KSA via Workload Identity (see setup above). The pod then uses the KSA to authenticate as the GSA.
-
diff --git a/skills/cloud/gke-basics/references/gke-storage.md b/skills/cloud/gke-basics/references/gke-storage.md
deleted file mode 100644
index 3b96e61cf0..0000000000
--- a/skills/cloud/gke-basics/references/gke-storage.md
+++ /dev/null
@@ -1,136 +0,0 @@
-# GKE Storage
-
-This reference covers storage configuration for GKE clusters including persistent disks, file storage, and cloud storage integration.
-
-> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `describe_k8s_resource`, `get_cluster`
-
-## Golden Path Storage Defaults
-
-The golden path Autopilot config enables these CSI drivers:
-
-| Driver | Golden Path | Access Mode | Use Case |
-|--------|-------------|-------------|----------|
-| Compute Engine Persistent Disk CSI | Enabled (default) | ReadWriteOnce | Block storage for databases, single-pod workloads |
-| Google Cloud Filestore CSI | Enabled | ReadWriteMany | Shared NFS for multi-pod access |
-| Cloud Storage FUSE CSI | Enabled | ReadWriteMany / ReadOnlyMany | Mount GCS buckets as volumes |
-| Parallelstore CSI | Enabled | ReadWriteMany | High-performance parallel file system |
-| Boot disk type | `pd-balanced` | N/A | Node boot disks |
-
-## StorageClasses
-
-### Default StorageClasses
-
-GKE provides built-in StorageClasses:
-
-| StorageClass | Disk Type | Use Case |
-|-------------|-----------|----------|
-| `standard-rwo` | `pd-standard` | Cost-effective, low IOPS |
-| `premium-rwo` | `pd-ssd` | High IOPS, databases |
-| `standard-rwx` | Filestore (Basic HDD) | Shared NFS |
-| `premium-rwx` | Filestore (Basic SSD) | Shared NFS, higher performance |
-
-### Custom StorageClass
-
-```yaml
-apiVersion: storage.k8s.io/v1
-kind: StorageClass
-metadata:
-  name: fast-regional
-provisioner: pd.csi.storage.gke.io
-parameters:
-  type: pd-ssd
-  replication-type: regional-pd    # Replicate across 2 zones
-volumeBindingMode: WaitForFirstConsumer
-allowVolumeExpansion: true         # Always enable for production
-```
-
-## PersistentVolumeClaims
-
-### Block Storage (ReadWriteOnce)
-
-```yaml
-apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: database-pvc
-spec:
-  accessModes:
-  - ReadWriteOnce
-  storageClassName: premium-rwo
-  resources:
-    requests:
-      storage: 100Gi
-```
-
-### Shared File Storage (ReadWriteMany via Filestore)
-
-```yaml
-apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: shared-data
-spec:
-  accessModes:
-  - ReadWriteMany
-  storageClassName: standard-rwx
-  resources:
-    requests:
-      storage: 1Ti    # Filestore minimum is 1 TiB for Basic tier
-```
-
-### GCS Bucket Mount (Cloud Storage FUSE)
-
-Mount a GCS bucket as a volume without a PVC:
-
-```yaml
-apiVersion: v1
-kind: Pod
-metadata:
-  name: gcs-reader
-  annotations:
-    gke-gcsfuse/volumes: "true"
-spec:
-  containers:
-  - name: reader
-    image: busybox
-    command: ["ls", "/data"]
-    volumeMounts:
-    - name: gcs-bucket
-      mountPath: /data
-  volumes:
-  - name: gcs-bucket
-    csi:
-      driver: gcsfuse.csi.storage.gke.io
-      readOnly: true
-      volumeAttributes:
-        bucketName: <BUCKET_NAME>
-```
-
-> Requires Workload Identity for the pod's service account to have `storage.objectViewer` on the bucket.
-
-## Volume Expansion
-
-If `allowVolumeExpansion: true` is set on the StorageClass, resize by updating the PVC:
-
-```bash
-# kubectl
-kubectl patch pvc <PVC_NAME> -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
-```
-
-```
-# MCP (preferred)
-patch_k8s_resource(parent="...", resourceType="persistentvolumeclaim", name="<PVC_NAME>",
-  patch='{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}')
-```
-
-Kubernetes automatically resizes the filesystem.
-
-## Best Practices
-
-1. **Always enable volume expansion**: Set `allowVolumeExpansion: true` on all StorageClasses
-2. **Use regional PDs for production**: `replication-type: regional-pd` replicates across 2 zones for HA
-3. **Use `WaitForFirstConsumer`**: Ensures the PV is provisioned in the same zone as the pod
-4. **Choose the right disk type**: `pd-ssd` for databases, `pd-balanced` (golden path default) for general use, `pd-standard` for cold storage
-5. **Use Filestore for shared access**: When multiple pods need to read/write the same files
-6. **Use GCS FUSE for data pipelines**: Mount buckets directly for ML training data, logs, etc.
-7. **Back up PVCs**: Use Backup for GKE (see [gke-backup-dr.md](./gke-backup-dr.md)) to protect persistent data
diff --git a/skills/cloud/gke-basics/references/gke-upgrades.md b/skills/cloud/gke-basics/references/gke-upgrades.md
deleted file mode 100644
index 91e1a5ba90..0000000000
--- a/skills/cloud/gke-basics/references/gke-upgrades.md
+++ /dev/null
@@ -1,142 +0,0 @@
-# GKE Upgrades & Maintenance
-
-This reference covers upgrade strategy, maintenance windows, and release channel management for GKE clusters.
-
-> **MCP Tools:** `get_cluster`, `get_k8s_version`, `update_cluster`, `update_node_pool`, `list_operations`, `get_operation`, `cancel_operation`, `get_k8s_resource`
-> **CLI-only**: `gcloud container get-server-config` (available versions), `gcloud container clusters update --maintenance-window-*` (maintenance windows)
-
-## Golden Path Upgrade Defaults
-
-| Setting | Golden Path Value | Notes |
-|---------|-------------------|-------|
-| `releaseChannel.channel` | `REGULAR` | Balanced between freshness and stability |
-| Maintenance exclusion | `NO_MINOR_UPGRADES`, 1 year | Prevents surprise minor version bumps |
-| `upgradeSettings.strategy` | `SURGE` | Rolling upgrades with `maxSurge: 1` |
-| Auto-repair | `true` | Unhealthy nodes are automatically replaced |
-| Auto-upgrade | `true` | Nodes follow control plane version |
-
-## Release Channels
-
-| Channel | Cadence | Best For |
-|---------|---------|----------|
-| `RAPID` | Weeks after release | Dev/test, early access to features |
-| `REGULAR` (golden path) | 2-3 months after Rapid | Production workloads |
-| `STABLE` | 2-3 months after Regular | Risk-averse, highly regulated |
-
-```bash
-# Check current channel
-gcloud container clusters describe <CLUSTER_NAME> --region <REGION> \
-  --format="value(releaseChannel.channel)" \
-  --quiet
-
-# Change channel (Day-1)
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --release-channel <CHANNEL> \
-  --quiet
-```
-
-## Maintenance Windows
-
-Control when GKE can perform automatic maintenance (upgrades, patches).
-
-```bash
-# Set maintenance window (e.g., weekends 2am-6am UTC)
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --maintenance-window-start "2026-01-01T02:00:00Z" \
-  --maintenance-window-end "2026-01-01T06:00:00Z" \
-  --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=SA,SU" \
-  --quiet
-```
-
-### Maintenance Exclusions
-
-The golden path includes a 1-year `NO_MINOR_UPGRADES` exclusion to prevent automatic minor version changes.
-
-```bash
-# Add maintenance exclusion
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --add-maintenance-exclusion-name "freeze-1" \
-  --add-maintenance-exclusion-start "2026-04-11T00:00:00Z" \
-  --add-maintenance-exclusion-end "2027-04-11T00:00:00Z" \
-  --add-maintenance-exclusion-scope NO_MINOR_UPGRADES \
-  --quiet
-
-# Remove exclusion
-gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
-  --remove-maintenance-exclusion "freeze-1" \
-  --quiet
-```
-
-**Exclusion scopes:**
-- `NO_UPGRADES` — blocks all upgrades (max 30 days)
-- `NO_MINOR_UPGRADES` — allows patch upgrades, blocks minor version changes (max 1 year)
-- `NO_MINOR_OR_NODE_UPGRADES` — blocks minor and node upgrades (max 1 year)
-
-## Upgrade Strategy
-
-### SURGE (Golden Path)
-
-Rolling upgrade with configurable surge capacity:
-
-```bash
-# Default: maxSurge=1 (one extra node during upgrade)
-gcloud container node-pools update <POOL_NAME> \
-  --cluster <CLUSTER_NAME> --region <REGION> \
-  --max-surge-upgrade 1 --max-unavailable-upgrade 0 \
-  --quiet
-```
-
-### Blue-Green (For Zero-Downtime Critical Workloads)
-
-```bash
-gcloud container node-pools update <POOL_NAME> \
-  --cluster <CLUSTER_NAME> --region <REGION> \
-  --enable-blue-green-upgrade \
-  --node-pool-soak-duration "3600s" \
-  --quiet
-```
-
-## Pre-Upgrade Checklist
-
-1. **Check deprecations**: Review Kubernetes API deprecations between current and target version
-2. **Review PDBs**: Ensure all production workloads have PodDisruptionBudgets
-3. **Test in non-prod**: Upgrade a staging cluster first
-4. **Check addon compatibility**: Verify third-party controllers support the target version
-5. **Review node pool versions**: All node pools should be within 2 minor versions of the control plane
-
-```bash
-# Check current versions
-gcloud container clusters describe <CLUSTER_NAME> --region <REGION> \
-  --format="table(currentMasterVersion, nodePools[].version)" \
-  --quiet
-
-# Check available upgrades
-gcloud container get-server-config --region <REGION> \
-  --format="yaml(channels)" \
-  --quiet
-
-# List deprecation warnings
-kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
-```
-
-## Manual Upgrade (When Needed)
-
-```bash
-# Upgrade control plane
-gcloud container clusters upgrade <CLUSTER_NAME> --region <REGION> \
-  --master --cluster-version <VERSION> \
-  --quiet
-
-# Upgrade node pool
-gcloud container clusters upgrade <CLUSTER_NAME> --region <REGION> \
-  --node-pool <POOL_NAME> \
-  --quiet
-```
-
-## Best Practices
-
-1. **Stay on a release channel**: Manual version management is error-prone. Let GKE manage versions.
-2. **Use maintenance windows**: Schedule upgrades during low-traffic periods.
-3. **Set PDBs on everything**: Protects workloads during node drains.
-4. **Monitor during upgrades**: Watch for pod eviction failures, CrashLoopBackOff, and scheduling issues.
-5. **Don't skip minor versions**: Upgrade incrementally (1.28 -> 1.29 -> 1.30, not 1.28 -> 1.30).
diff --git a/skills/cloud/gke-basics/references/iac-usage.md b/skills/cloud/gke-basics/references/iac-usage.md
index efc44ca4e7..e088546d5a 100644
--- a/skills/cloud/gke-basics/references/iac-usage.md
+++ b/skills/cloud/gke-basics/references/iac-usage.md
@@ -6,14 +6,14 @@ managed using Terraform.
 ## Terraform
 
 Terraform uses two main providers for GKE:
-*   The **Google Cloud provider** connects to the Google Cloud API to manage
-    GKE cluster infrastructure using Terraform resources such as
+
+*   The **Google Cloud provider** connects to the Google Cloud API to manage GKE
+    cluster infrastructure using Terraform resources such as
     `google_container_cluster` for the cluster itself, and
     `google_container_node_pool` for nodes in Standard mode.
 *   The **Kubernetes provider** connects to the Kubernetes API to manage
-    workloads inside the cluster using Kubernetes resources such as
-    Deployments and Services.
-
+    workloads inside the cluster using Kubernetes resources such as Deployments
+    and Services.
 
 ### GKE Autopilot Cluster Example
 
@@ -65,13 +65,13 @@ resource "kubernetes_deployment_v1" "default" {
 
 ### Reference Documentation
 
-- [Terraform Google Provider - Container Cluster](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster)
+-   [Terraform Google Provider - Container Cluster](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster)
 
-- [Terraform Google Provider - Kubernetes Provider](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs)
+-   [Terraform Google Provider - Kubernetes Provider](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs)
 
 ## YAML Samples
 
 GKE cluster configurations and Kubernetes manifests can also be defined using
 YAML for use with `kubectl apply` or Deployment Manager.
 
-- [GKE YAML Samples](https://docs.cloud.google.com/docs/samples?product=googlekubernetesengine)
+-   [GKE YAML Samples](https://docs.cloud.google.com/docs/samples?product=googlekubernetesengine)
diff --git a/skills/cloud/gke-basics/references/mcp-usage.md b/skills/cloud/gke-basics/references/mcp-usage.md
index 66e6dbbf1a..4534d88c7b 100644
--- a/skills/cloud/gke-basics/references/mcp-usage.md
+++ b/skills/cloud/gke-basics/references/mcp-usage.md
@@ -1,10 +1,14 @@
 # GKE MCP Server Usage
 
-The GKE MCP server provides 23 structured tools for cluster management, Kubernetes resource operations, and diagnostics — without requiring shell access or kubeconfig setup.
+The GKE MCP server provides 23 structured tools for cluster management,
+Kubernetes resource operations, and diagnostics — without requiring shell access
+or kubeconfig setup.
 
 ## Connecting to the GKE MCP Server
 
-The GKE remote MCP server is available for AI clients that support the Model Context Protocol. For setup instructions, see https://docs.cloud.google.com/kubernetes-engine/docs/how-to/use-gke-mcp.
+The GKE remote MCP server is available for AI clients that support the Model
+Context Protocol. For setup instructions, see
+https://docs.cloud.google.com/kubernetes-engine/docs/how-to/use-gke-mcp.
 
 ## Available Tools
 
@@ -21,52 +25,60 @@ Use `locations/-` to match all regions when listing.
 
 ### Cluster Management
 
-| Tool | Mode | Purpose |
-|------|------|---------|
-| `list_clusters` | READ | Discover clusters in a project/region |
-| `get_cluster` | READ | Inspect cluster config. Use `readMask` to select fields |
-| `create_cluster` | MUTATE | Create a cluster from JSON config |
-| `update_cluster` | DESTRUCTIVE | Change Day-1 cluster settings |
+| Tool             | Mode        | Purpose                                   |
+| ---------------- | ----------- | ----------------------------------------- |
+| `list_clusters`  | READ        | Discover clusters in a project/region     |
+| `get_cluster`    | READ        | Inspect cluster config. Use `readMask` to |
+:                  :             : select fields                             :
+| `create_cluster` | MUTATE      | Create a cluster from JSON config         |
+| `update_cluster` | DESTRUCTIVE | Change Day-1 cluster settings             |
 
 ### Node Pool Management
 
-| Tool | Mode | Purpose |
-|------|------|---------|
-| `list_node_pools` | READ | List pools in a cluster |
-| `get_node_pool` | READ | Get pool details |
-| `create_node_pool` | MUTATE | Add a pool (Standard clusters) |
-| `update_node_pool` | DESTRUCTIVE | Modify a pool |
+Tool               | Mode        | Purpose
+------------------ | ----------- | ------------------------------
+`list_node_pools`  | READ        | List pools in a cluster
+`get_node_pool`    | READ        | Get pool details
+`create_node_pool` | MUTATE      | Add a pool (Standard clusters)
+`update_node_pool` | DESTRUCTIVE | Modify a pool
 
 ### Kubernetes Resources
 
-| Tool | Mode | Purpose |
-|------|------|---------|
-| `get_k8s_resource` | READ | List/get any K8s resource (supports label/field selectors) |
-| `describe_k8s_resource` | READ | Detailed info with events and conditions |
-| `apply_k8s_manifest` | DESTRUCTIVE | Apply YAML manifests (supports `dryRun`) |
-| `patch_k8s_resource` | DESTRUCTIVE | JSON patch resource fields |
-| `delete_k8s_resource` | DESTRUCTIVE | Remove resources (supports `cascade`, `dryRun`) |
-| `list_k8s_api_resources` | READ | Discover available resource types |
+| Tool                     | Mode        | Purpose                             |
+| ------------------------ | ----------- | ----------------------------------- |
+| `get_k8s_resource`       | READ        | List/get any K8s resource (supports |
+:                          :             : label/field selectors)              :
+| `describe_k8s_resource`  | READ        | Detailed info with events and       |
+:                          :             : conditions                          :
+| `apply_k8s_manifest`     | DESTRUCTIVE | Apply YAML manifests (supports      |
+:                          :             : `dryRun`)                           :
+| `patch_k8s_resource`     | DESTRUCTIVE | JSON patch resource fields          |
+| `delete_k8s_resource`    | DESTRUCTIVE | Remove resources (supports          |
+:                          :             : `cascade`, `dryRun`)                :
+| `list_k8s_api_resources` | READ        | Discover available resource types   |
 
 ### Diagnostics & Observability
 
-| Tool | Mode | Purpose |
-|------|------|---------|
-| `list_k8s_events` | READ | Scheduling failures, OOM kills, evictions |
-| `get_k8s_logs` | READ | Container logs (supports `tail`, `since`, `previous`) |
-| `get_k8s_cluster_info` | READ | Control plane and service endpoints |
-| `get_k8s_version` | READ | Kubernetes server version |
-| `get_k8s_rollout_status` | READ | Deployment/StatefulSet rollout progress |
-| `check_k8s_auth` | READ | Verify RBAC permissions for a user/SA |
+| Tool                     | Mode | Purpose                                   |
+| ------------------------ | ---- | ----------------------------------------- |
+| `list_k8s_events`        | READ | Scheduling failures, OOM kills, evictions |
+| `get_k8s_logs`           | READ | Container logs (supports `tail`, `since`, |
+:                          :      : `previous`)                               :
+| `get_k8s_cluster_info`   | READ | Control plane and service endpoints       |
+| `get_k8s_version`        | READ | Kubernetes server version                 |
+| `get_k8s_rollout_status` | READ | Deployment/StatefulSet rollout progress   |
+| `check_k8s_auth`         | READ | Verify RBAC permissions for a user/SA     |
 
 ### Operations
 
-| Tool | Mode | Purpose |
-|------|------|---------|
-| `list_operations` | READ | Pending/running cluster operations |
-| `get_operation` | READ | Track create/upgrade progress |
-| `cancel_operation` | DESTRUCTIVE | Abort stuck operations |
+Tool               | Mode        | Purpose
+------------------ | ----------- | ----------------------------------
+`list_operations`  | READ        | Pending/running cluster operations
+`get_operation`    | READ        | Track create/upgrade progress
+`cancel_operation` | DESTRUCTIVE | Abort stuck operations
 
 ## Tool Preference
 
-Default: **MCP tools > gcloud CLI > kubectl**. See [cli-reference.md](./cli-reference.md) for the full coverage comparison, CLI fallback commands, and user preference override options.
+Default: **MCP tools > gcloud CLI > kubectl**. See
+[cli-reference.md](./cli-reference.md) for the full coverage comparison, CLI
+fallback commands, and user preference override options.
diff --git a/skills/cloud/gke-basics/references/gke-batch-hpc.md b/skills/cloud/gke-batch-hpc/SKILL.md
similarity index 70%
rename from skills/cloud/gke-basics/references/gke-batch-hpc.md
rename to skills/cloud/gke-batch-hpc/SKILL.md
index 74ec29feb4..a49da717f2 100644
--- a/skills/cloud/gke-basics/references/gke-batch-hpc.md
+++ b/skills/cloud/gke-batch-hpc/SKILL.md
@@ -1,16 +1,28 @@
+---
+name: gke-batch-hpc
+description: >-
+  Runs batch and HPC workloads on GKE, utilizing job queues and parallel
+  processing. Use when running GKE batch jobs, configuring GKE HPC, or setting
+  up GKE job queues. Don't use for standard web application deployments (use
+  gke-app-onboarding instead).
+---
+
 # GKE Batch & HPC Workloads
 
-This reference covers running batch processing and high-performance computing (HPC) workloads on GKE.
+This reference covers running batch processing and high-performance computing
+(HPC) workloads on GKE.
 
-> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `describe_k8s_resource`, `get_k8s_logs`, `delete_k8s_resource`, `list_k8s_events`
+> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`,
+> `describe_k8s_resource`, `get_k8s_logs`, `delete_k8s_resource`,
+> `list_k8s_events`
 
 ## When to Use
 
-- Running batch data processing pipelines
-- HPC simulations (CFD, molecular dynamics, financial modeling)
-- Large-scale parallel computation (MPI, MapReduce)
-- ML training jobs
-- CI/CD build farms
+-   Running batch data processing pipelines
+-   HPC simulations (CFD, molecular dynamics, financial modeling)
+-   Large-scale parallel computation (MPI, MapReduce)
+-   ML training jobs
+-   CI/CD build farms
 
 ## Batch Processing on GKE
 
@@ -106,7 +118,8 @@ spec:
 
 ### Compact Placement (Low-Latency Networking)
 
-For tightly-coupled HPC workloads that need low-latency inter-node communication:
+For tightly-coupled HPC workloads that need low-latency inter-node
+communication:
 
 ```bash
 # Standard clusters: create node pool with compact placement
@@ -152,17 +165,25 @@ spec:
 
 ### Spot VMs for Batch
 
-Batch workloads are ideal Spot VM candidates (interruptible, can checkpoint). Use a ComputeClass with Spot-first priority and `activeMigration` to return to Spot when available. See [gke-compute-classes.md](./gke-compute-classes.md) for the Spot-with-fallback pattern.
+Batch workloads are ideal Spot VM candidates (interruptible, can checkpoint).
+Use a ComputeClass with Spot-first priority and `activeMigration` to return to
+Spot when available. See
+[gke-compute-classes.md](../gke-compute-classes/SKILL.md) for the
+Spot-with-fallback pattern.
 
 ### Scale-to-Zero
 
 For batch clusters, allow node pools to scale to zero when no jobs are running:
 
-- Autopilot (golden path): Automatic, nodes scale to zero when no pods are scheduled
-- Standard: Set `--min-nodes 0` on batch node pools
+-   Autopilot (golden path): Automatic, nodes scale to zero when no pods are
+    scheduled
+-   Standard: Set `--min-nodes 0` on batch node pools
 
 ## Best Practices
 
-- **Kueue** for multi-tenant job scheduling; **JobSet** for multi-component workflows
-- **Set `backoffLimit`** on Jobs; **checkpoint long jobs** for preemption resilience
-- **Spot VMs** for fault-tolerant batch; **compact placement** for tightly-coupled HPC
+-   **Kueue** for multi-tenant job scheduling; **JobSet** for multi-component
+    workflows
+-   **Set `backoffLimit`** on Jobs; **checkpoint long jobs** for preemption
+    resilience
+-   **Spot VMs** for fault-tolerant batch; **compact placement** for
+    tightly-coupled HPC
diff --git a/skills/cloud/gke-basics/references/gke-cluster-creation.md b/skills/cloud/gke-cluster-creation/SKILL.md
similarity index 52%
rename from skills/cloud/gke-basics/references/gke-cluster-creation.md
rename to skills/cloud/gke-cluster-creation/SKILL.md
index 735590011c..479a3ce28f 100644
--- a/skills/cloud/gke-basics/references/gke-cluster-creation.md
+++ b/skills/cloud/gke-cluster-creation/SKILL.md
@@ -1,38 +1,58 @@
+---
+name: gke-cluster-creation
+description: >-
+  Plans and executes GKE cluster creation, provisioning, and production
+  readiness audits. Use when creating GKE clusters, provisioning GKE
+  environments, or auditing GKE clusters. Don't use for application
+  onboarding or deployment configuration (use gke-app-onboarding instead).
+---
+
 # GKE Cluster Creation
 
-This reference guides creating GKE clusters. The **golden path Autopilot** configuration is the default for all new clusters.
+This reference guides creating GKE clusters. The **golden path Autopilot**
+configuration is the default for all new clusters.
 
-> **MCP Tools:** `list_clusters`, `create_cluster`, `get_cluster`, `list_operations`, `get_operation`
+> **MCP Tools:** `list_clusters`, `create_cluster`, `get_cluster`,
+> `list_operations`, `get_operation`
 
 ## Workflow
 
-1. **Discover context**: Use `list_clusters` to see existing clusters. Use `gcloud config get-value project` if project unknown.
-2. **Gather inputs**: project_id, region, cluster_name, environment type
-3. **Select mode**: Autopilot (default) vs Standard
-4. **Configure networking**: auto-create subnet (default) or bring-your-own
-5. **Review golden path settings**: present the config and confirm with user
-6. **Create**: Use MCP `create_cluster` tool. Fall back to `gcloud` CLI only if MCP is unavailable.
-7. **Track**: Use `get_operation` to monitor creation progress
-8. **Verify**: Use `get_cluster` with `readMask="*"` to confirm golden path settings applied
+1.  **Discover context**: Use `list_clusters` to see existing clusters. Use
+    `gcloud config get-value project` if project unknown.
+2.  **Gather inputs**: project_id, region, cluster_name, environment type
+3.  **Select mode**: Autopilot (default) vs Standard
+4.  **Configure networking**: auto-create subnet (default) or bring-your-own
+5.  **Review golden path settings**: present the config and confirm with user
+6.  **Create**: Use MCP `create_cluster` tool. Fall back to `gcloud` CLI only if
+    MCP is unavailable.
+7.  **Track**: Use `get_operation` to monitor creation progress
+8.  **Verify**: Use `get_cluster` with `readMask="*"` to confirm golden path
+    settings applied
 
 ## Mode Selection
 
-| Criteria | Autopilot (Golden Path) | Standard |
-|----------|------------------------|----------|
-| Node management | Google-managed | Self-managed |
-| Pricing | Pay per pod resource request | Pay per node (VM) |
-| Node customization | Via ComputeClasses | Full control |
-| DaemonSets | Allowed (with restrictions) | Full control |
-| GPU/TPU | Supported via ComputeClasses | Supported via node pools |
-| Best for | Most production workloads | Kernel tuning, custom OS, privileged workloads |
-
-> **Rule**: Default to Autopilot unless the customer has a specific requirement that Autopilot cannot satisfy.
+| Criteria           | Autopilot (Golden Path)   | Standard                  |
+| ------------------ | ------------------------- | ------------------------- |
+| Node management    | Google-managed            | Self-managed              |
+| Pricing            | Pay per pod resource      | Pay per node (VM)         |
+:                    : request                   :                           :
+| Node customization | Via ComputeClasses        | Full control              |
+| DaemonSets         | Allowed (with             | Full control              |
+:                    : restrictions)             :                           :
+| GPU/TPU            | Supported via             | Supported via node pools  |
+:                    : ComputeClasses            :                           :
+| Best for           | Most production workloads | Kernel tuning, custom OS, |
+:                    :                           : privileged workloads      :
+
+> **Rule**: Default to Autopilot unless the customer has a specific requirement
+> that Autopilot cannot satisfy.
 
 ## Templates
 
 ### 1. Golden Path Autopilot (Production)
 
-This is the default. All settings match `assets/golden-path-autopilot.yaml`.
+This is the default. All settings match
+`../gke-golden-path/assets/golden-path-autopilot.yaml`.
 
 **Via gcloud:**
 
@@ -78,7 +98,8 @@ gcloud container clusters create-auto <CLUSTER_NAME> \
 
 ### 2. Autopilot Dev/Test
 
-Relaxes some golden path defaults for cost savings and easier access in non-production.
+Relaxes some golden path defaults for cost savings and easier access in
+non-production.
 
 ```bash
 gcloud container clusters create-auto <CLUSTER_NAME> \
@@ -88,7 +109,8 @@ gcloud container clusters create-auto <CLUSTER_NAME> \
   --quiet
 ```
 
-> **Warning**: This does not apply golden path security hardening. Suitable for dev/test only.
+> **Warning**: This does not apply golden path security hardening. Suitable for
+> dev/test only.
 
 ### 3. Standard Regional (When Autopilot is Not an Option)
 
@@ -112,7 +134,8 @@ gcloud container clusters create <CLUSTER_NAME> \
 
 ### 4. GPU/AI Workloads (Autopilot with ComputeClass)
 
-Create a golden path Autopilot cluster, then apply a ComputeClass for GPU workloads:
+Create a golden path Autopilot cluster, then apply a ComputeClass for GPU
+workloads:
 
 ```bash
 # 1. Create golden path cluster (same as template 1)
@@ -133,10 +156,12 @@ kubectl apply -f inference.yaml
 
 ## Instructions
 
-- **ALWAYS** ask for `project_id` if not in context
-- **ALWAYS** ask for `region`
-- **ALWAYS** ask for a unique `cluster_name`
-- **DEFAULT** to golden path Autopilot unless customer specifies otherwise
-- **WARN** about Day-0 decisions (networking, private nodes) that are hard to change later
-- **WARN** about cost for GPU or multi-region clusters
-- When using MCP `create_cluster`, the `cluster.name` should be the **short name** (e.g., `my-cluster`), not the full resource path
+-   **ALWAYS** ask for `project_id` if not in context
+-   **ALWAYS** ask for `region`
+-   **ALWAYS** ask for a unique `cluster_name`
+-   **DEFAULT** to golden path Autopilot unless customer specifies otherwise
+-   **WARN** about Day-0 decisions (networking, private nodes) that are hard to
+    change later
+-   **WARN** about cost for GPU or multi-region clusters
+-   When using MCP `create_cluster`, the `cluster.name` should be the **short
+    name** (e.g., `my-cluster`), not the full resource path
diff --git a/skills/cloud/gke-compute-classes/SKILL.md b/skills/cloud/gke-compute-classes/SKILL.md
new file mode 100644
index 0000000000..4b558d3cd2
--- /dev/null
+++ b/skills/cloud/gke-compute-classes/SKILL.md
@@ -0,0 +1,204 @@
+---
+name: gke-compute-classes
+description: >-
+  Manages GKE compute classes, node selection, Spot fallbacks, and GPU node
+  pools. Use when configuring GKE compute classes, selecting GKE machine
+  families, or configuring GPUs on GKE. Don't use for general workload-level
+  resource limits or Horizontal/Vertical Pod Autoscaling (use gke-scaling
+  instead).
+---
+
+# GKE ComputeClasses
+
+ComputeClasses allow declarative node configuration and autoscaling priorities
+in GKE Autopilot (and Standard with NAP). Use them to specify machine families,
+Spot VM fallback, GPU requirements, and zone targeting.
+
+> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`,
+> `describe_k8s_resource`, `delete_k8s_resource`
+
+## When to Use
+
+-   Cost optimization: Spot VMs with on-demand fallback
+-   GPU/TPU workloads: target specific accelerators
+-   Performance: select specific machine families (c3, c4, n4)
+-   Zone targeting: colocate workloads with zonal resources
+
+## CRD Structure
+
+```yaml
+apiVersion: cloud.google.com/v1
+kind: ComputeClass
+metadata:
+  name: <string>
+spec:
+  # Required. Ordered list of rules. GKE tries them in order.
+  priorities:
+    - <PriorityRule>
+
+  # Optional. Default: "DoNotScaleUp"
+  whenUnsatisfiable: <"DoNotScaleUp" | "ScaleUpAnyway">
+
+  # Optional. Auto-create node pools. Default: true
+  nodePoolAutoCreation:
+    enabled: <boolean>
+
+  # Optional. Move workloads back to higher-priority when available
+  activeMigration:
+    optimizeRulePriority: <boolean>
+
+  # Optional. Scale-down delay
+  autoscalingPolicy:
+    consolidationDelay: <duration>
+
+  # Optional. Defaults for fields omitted in priorities
+  priorityDefaults: <PriorityRule>
+```
+
+## PriorityRule Fields
+
+| Field           | Type    | Description            | Example          |
+| --------------- | ------- | ---------------------- | ---------------- |
+| `machineFamily` | string  | Compute Engine machine | `n4`, `c3`,      |
+:                 :         : family                 : `t2a`            :
+| `machineType`   | string  | Specific machine type  | `n4-standard-32` |
+| `spot`          | boolean | Use Spot VMs           | `true`           |
+| `minCores`      | int     | Minimum vCPUs          | `4`              |
+| `minMemoryGb`   | int     | Minimum memory in GB   | `16`             |
+| `gpu`           | object  | GPU config: `type`,    | See below        |
+:                 :         : `count`,               :                  :
+:                 :         : `driverVersion`        :                  :
+| `tpu`           | object  | TPU config: `type`,    | See below        |
+:                 :         : `count`, `topology`    :                  :
+| `storage`       | object  | Boot disk: `type`,     | See below        |
+:                 :         : `sizeGb`, `kmsKey`;    :                  :
+:                 :         : Local SSD\: `count`,   :                  :
+:                 :         : `interface`            :                  :
+| `location`      | object  | Zone targeting:        | See below        |
+:                 :         : `zones\: [...]` or     :                  :
+:                 :         : `type\: "Any"`         :                  :
+| `reservations`  | object  | Reservation            | See below        |
+:                 :         : consumption\:          :                  :
+:                 :         : `NO_RESERVATION`,      :                  :
+:                 :         : `ANY_RESERVATION`,     :                  :
+:                 :         : `SPECIFIC_RESERVATION` :                  :
+
+### GPU Configuration
+
+```yaml
+gpu:
+  type: "nvidia-l4"        # nvidia-l4, nvidia-h100-80gb, etc.
+  count: 1                 # GPUs per node
+  driverVersion: "latest"  # Optional
+```
+
+### TPU Configuration
+
+```yaml
+tpu:
+  type: "v5p-slice"
+  count: 8
+  topology: "2x2x1"
+```
+
+### Storage Configuration
+
+```yaml
+storage:
+  bootDisk:
+    type: "pd-balanced"     # pd-balanced (golden path), pd-ssd, hyperdisk-balanced
+    sizeGb: 100
+    kmsKey: "projects/.../cryptoKeys/..."  # Optional CMEK
+  localSsd:
+    count: 1
+    interface: "NVME"
+```
+
+### Location Configuration
+
+```yaml
+location:
+  zones:
+    - "us-central1-a"
+    - "us-central1-b"
+  # OR
+  type: "Any"              # Let GKE pick from cluster zones
+```
+
+## Common Patterns
+
+### Spot VMs with On-Demand Fallback
+
+```yaml
+apiVersion: cloud.google.com/v1
+kind: ComputeClass
+metadata:
+  name: spot-with-fallback
+spec:
+  nodePoolAutoCreation:
+    enabled: true
+  priorities:
+  - machineFamily: n4
+    spot: true
+  - machineFamily: n4
+    spot: false
+```
+
+### GPU Workload (L4)
+
+```yaml
+apiVersion: cloud.google.com/v1
+kind: ComputeClass
+metadata:
+  name: l4-gpu-class
+spec:
+  priorities:
+  - machineFamily: g2
+    gpu:
+      type: nvidia-l4
+      count: 1
+    minCores: 4
+    minMemoryGb: 16
+    storage:
+      bootDisk:
+        type: pd-balanced
+        sizeGb: 100
+```
+
+### Spot with Active Migration (Return to Spot When Available)
+
+Add `activeMigration` to the Spot-with-fallback pattern above to auto-migrate
+workloads back to Spot when capacity returns:
+
+```yaml
+spec:
+  activeMigration:
+    optimizeRulePriority: true
+  priorities:
+  - machineFamily: n4
+    spot: true
+  - machineFamily: n4
+    spot: false
+```
+
+> **Other patterns** — HPC (`machineFamily: c3`, `minCores: 8`) and zone
+> targeting (`location.zones: [...]`) follow the same CRD structure. See the
+> PriorityRule fields table and sub-config examples above.
+
+## Workload Usage
+
+Pods must specify the ComputeClass via node selector:
+
+```yaml
+nodeSelector:
+  cloud.google.com/compute-class: "<compute-class-name>"
+```
+
+## Warnings
+
+-   Do not mix ComputeClass selection with other hard node selectors (like
+    `cloud.google.com/gke-spot`) — this causes scheduling conflicts.
+-   When using `activeMigration`, workloads will be evicted and rescheduled —
+    ensure PDBs are in place.
+-   Spot VMs can be evicted with 30-second notice. Set
+    `terminationGracePeriodSeconds < 30` for Spot workloads.
diff --git a/skills/cloud/gke-cost/SKILL.md b/skills/cloud/gke-cost/SKILL.md
new file mode 100644
index 0000000000..938c740b53
--- /dev/null
+++ b/skills/cloud/gke-cost/SKILL.md
@@ -0,0 +1,184 @@
+---
+name: gke-cost
+description: >-
+  Optimizes GKE costs, rightsizes workloads, and configures Spot VMs and CUDs.
+  Use when optimizing GKE costs, rightsizing GKE workloads, or configuring GKE
+  Spot VMs. Don't use for general compute class provisioning or GPU Selection
+  (use gke-compute-classes instead).
+---
+
+# GKE Cost Optimization
+
+This reference covers strategies for reducing GKE costs while maintaining the
+golden path security and reliability posture.
+
+> **MCP Tools:** `get_k8s_resource`, `describe_k8s_resource`,
+> `apply_k8s_manifest`, `patch_k8s_resource`, `get_cluster`
+
+## Golden Path Cost Features
+
+The golden path already includes cost-optimizing settings:
+
+| Setting                  | Value                  | Impact                  |
+| ------------------------ | ---------------------- | ----------------------- |
+| `autoscalingProfile`     | `OPTIMIZE_UTILIZATION` | Aggressive node         |
+:                          :                        : scale-down reduces idle :
+:                          :                        : compute                 :
+| `verticalPodAutoscaling` | `enabled`              | VPA recommendations     |
+:                          :                        : prevent                 :
+:                          :                        : over-provisioning       :
+| Autopilot pricing        | Pay per pod request    | No charge for unused    |
+:                          :                        : node capacity           :
+| Node Auto Provisioning   | enabled                | Right-sized node pools  |
+:                          :                        : created automatically   :
+
+## Cost Optimization Strategies
+
+### 1. Spot VMs via ComputeClasses
+
+Use Spot VMs for fault-tolerant workloads (60-90% cost reduction).
+
+```yaml
+apiVersion: cloud.google.com/v1
+kind: ComputeClass
+metadata:
+  name: spot-with-fallback
+spec:
+  activeMigration:
+    optimizeRulePriority: true
+  priorities:
+  - machineFamily: n4
+    spot: true
+  - machineFamily: n4
+    spot: false
+```
+
+**Spot-suitable workloads:**
+
+Workload                          | Spot-Suitable?
+--------------------------------- | ---------------
+Batch / data processing           | Yes
+Dev / test environments           | Yes
+Stateless web/API (replicas >= 2) | Yes (with PDBs)
+Jobs with checkpointing           | Yes
+Stateful workloads (databases)    | No
+Single-replica critical services  | No
+
+**Handling eviction:**
+
+```yaml
+spec:
+  template:
+    spec:
+      terminationGracePeriodSeconds: 25  # Must be < 30s for Spot
+      containers:
+      - name: app
+        lifecycle:
+          preStop:
+            exec:
+              command: ["/bin/sh", "-c", "sleep 5"]
+```
+
+### 2. Pod Rightsizing
+
+Use VPA recommendations to reduce over-provisioned requests.
+
+```bash
+# 1. Deploy VPA in recommendation mode
+kubectl apply -f - <<EOF
+apiVersion: autoscaling.k8s.io/v1
+kind: VerticalPodAutoscaler
+metadata:
+  name: <DEPLOYMENT>-vpa
+spec:
+  targetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: <DEPLOYMENT>
+  updatePolicy:
+    updateMode: "Off"
+EOF
+
+# 2. Wait 24+ hours for data collection
+
+# 3. Read recommendations
+kubectl get vpa <DEPLOYMENT>-vpa -o jsonpath='{.status.recommendation}'
+```
+
+**Optimization rules:**
+
+Condition                     | Action                             | Savings
+----------------------------- | ---------------------------------- | -------
+CPU request >5x P95 actual    | Reduce to `P95 * 1.2`              | High
+Memory request >3x P95 actual | Reduce to `P95 * 1.2`              | High
+CPU request >2x P95 actual    | Reduce to `P95 * 1.2`              | Medium
+No resource requests set      | Add requests (enables bin-packing) | Medium
+
+### 3. Machine Type Selection
+
+| Family        | Use Case                                     | Relative Cost |
+| ------------- | -------------------------------------------- | ------------- |
+| e2            | General purpose, burstable                   | Lowest        |
+| t2a / t2d     | Scale-out (Arm/AMD), price-performance       | Low           |
+:               : optimized                                    :               :
+| n4a           | Axion Arm-based, general-purpose             | Low           |
+:               : price-performance                            :               :
+| n4 / n4d      | General purpose (Intel/AMD), flexible shapes | Low-Medium    |
+| c4a           | Compute-optimized (Arm), high efficiency     | Medium-High   |
+| c3 / c4       | Compute-optimized (Intel)                    | Medium-High   |
+| c3d / c4d     | Compute-optimized (AMD), high-performance    | Medium-High   |
+:               : throughput                                   :               :
+| ek-standard   | Autopilot enhanced (golden path)             | Medium        |
+| m3 / x4       | Memory-optimized, SAP HANA, large databases  | High          |
+| g2 (L4 GPU)   | AI inference                                 | High          |
+| a3 (H100 GPU) | AI training                                  | Highest       |
+| a4 / a4x      | Ultra-scale AI (Blackwell GPUs)              | Highest       |
+
+> In Autopilot, machine type is managed. Use ComputeClasses to influence
+> selection.
+
+### 4. Committed Use Discounts (CUDs)
+
+For steady-state workloads, purchase 1-year or 3-year CUDs:
+
+-   1-year: ~20-30% discount
+-   3-year: ~50-55% discount
+-   Applied automatically to matching usage in the region
+-   Purchase via Google Cloud Console > Billing > Committed use discounts
+
+### 5. Cluster Management
+
+-   **Stop/start dev clusters**: Idle dev clusters cost money even with no
+    workloads (control plane fee).
+-   **Right-size node pools** (Standard): Use Cluster Autoscaler with
+    appropriate min/max.
+-   **Multi-tenant clusters**: Share a single cluster across teams instead of
+    per-team clusters (see [gke-multitenancy.md](../gke-multitenancy/SKILL.md)).
+
+## Cost Monitoring
+
+```bash
+# View cluster cost breakdown (requires Cost Management API)
+gcloud billing budgets list --billing-account=<BILLING_ACCOUNT> --quiet
+
+# View node utilization
+kubectl top nodes
+
+# View pod resource usage vs requests
+kubectl top pods --all-namespaces --containers
+```
+
+## Dev/Test Cost Savings
+
+For non-production environments, these golden path deviations are acceptable:
+
+| Setting                 | Production (Golden | Dev/Test                      |
+:                         : Path)              :                               :
+| ----------------------- | ------------------ | ----------------------------- |
+| Cluster mode            | Autopilot          | Autopilot (cheaper with fewer |
+:                         :                    : pods)                         :
+| Release channel         | Regular            | Rapid (get fixes faster)      |
+| Private nodes           | Required           | Optional (simpler access)     |
+| Monitoring components   | Full suite         | SYSTEM_COMPONENTS only        |
+| Secret Manager rotation | 120s               | Disabled                      |
+| Maintenance windows     | Configured         | Not needed                    |
diff --git a/skills/cloud/gke-golden-path/SKILL.md b/skills/cloud/gke-golden-path/SKILL.md
new file mode 100644
index 0000000000..5f3e2ff57c
--- /dev/null
+++ b/skills/cloud/gke-golden-path/SKILL.md
@@ -0,0 +1,104 @@
+---
+name: gke-golden-path
+description: >-
+  Provides GKE golden path configuration defaults, production readiness
+  checklists, and cluster default patterns. Use when designing GKE clusters,
+  verifying GKE production readiness, or checking configurations against
+  GKE defaults. Don't use for setting up node autoscaling specifically (use
+  gke-scaling instead).
+---
+
+# GKE Golden Path Configuration
+
+The golden path is the recommended Autopilot configuration for production
+clusters. It defines sensible defaults — when the user requests different
+settings, apply them and note relevant trade-offs.
+
+> **MCP Tools:** `get_cluster`, `create_cluster`, `update_cluster`
+
+## Rules
+
+1.  **Default to the golden path.** Use golden path values unless the user
+    requests otherwise. When deviating, note trade-offs but respect the user's
+    choice.
+2.  **Day-0 vs Day-1.** Flag Day-0 decisions (networking, private nodes,
+    subnets, IP allocation) prominently — they are hard/impossible to change
+    after creation.
+3.  **Tool preference: MCP > gcloud > kubectl.** See
+    [cli-reference.md](../gke-basics/references/cli-reference.md) for full
+    coverage matrix and override options. If the user says "use gcloud" or "use
+    kubectl", respect that for the session.
+4.  **Document decisions and rationale**, especially for Day-0 choices and
+    golden path deviations.
+
+## Required Inputs
+
+If the user is unsure, use golden path defaults.
+
+-   **Project ID** (required)
+-   **Region** (required, e.g., `us-central1`)
+-   **Cluster name** (required)
+-   **Environment type**: dev/test or production (defaults to production)
+-   **Networking**: bring-your-own VPC/subnet or auto-create (default:
+    auto-create)
+-   **Scale expectations**: expected node/pod count, workload types
+-   **Cost constraints**: Spot VM tolerance, budget considerations
+
+## Always-Apply Defaults
+
+Recommended best practices applied by default. If the user requests a different
+setting, apply it and briefly note the security or operational trade-off.
+
+Setting                                                            | Golden Path Value
+------------------------------------------------------------------ | -----------------
+`autopilot.enabled`                                                | `true`
+`privateClusterConfig.enablePrivateNodes`                          | `true`
+`masterAuthorizedNetworksConfig.privateEndpointEnforcementEnabled` | `true`
+`secretManagerConfig.enabled` + `rotationInterval: 120s`           | `true`
+`rbacBindingConfig.enableInsecureBinding*`                         | `false` (both)
+`workloadIdentityConfig.workloadPool`                              | enabled
+`networkConfig.datapathProvider`                                   | `ADVANCED_DATAPATH`
+`networkConfig.dnsConfig.clusterDns`                               | `CLOUD_DNS`
+`autoscaling.autoscalingProfile`                                   | `OPTIMIZE_UTILIZATION`
+`verticalPodAutoscaling.enabled`                                   | `true`
+`monitoringConfig` components                                      | SYSTEM_COMPONENTS, STORAGE, POD, DEPLOYMENT, STATEFULSET, DAEMONSET, HPA, JOBSET, CADVISOR, KUBELET, DCGM, APISERVER, SCHEDULER, CONTROLLER_MANAGER
+`advancedDatapathObservabilityConfig.enableMetrics`                | `true`
+`nodeConfig.shieldedInstanceConfig.enableSecureBoot`               | `true`
+`nodeConfig.workloadMetadataConfig.mode`                           | `GKE_METADATA`
+`nodeConfig.gcfsConfig.enabled` / `gvnic.enabled`                  | `true` / `true`
+`addonsConfig.statefulHaConfig.enabled`                            | `true`
+Storage CSI drivers (Filestore, GCS FUSE, Parallelstore)           | enabled
+Pod Security Standards                                             | `restricted` on production namespaces
+
+## Customer-Configurable Settings
+
+These have golden path defaults but customers may deviate with valid
+justification. **Ask before changing.**
+
+Setting                                  | Default                             | Why Deviate
+---------------------------------------- | ----------------------------------- | -----------
+`dnsEndpointConfig.allowExternalTraffic` | `true`                              | Restrict if cluster only accessed from within VPC
+`autoIpamConfig` / `createSubnetwork`    | `true` / `true`                     | Customer has pre-existing VPC/subnets
+`maxPodsPerNode`                         | `48`                                | `110` for high pod-density (costs more CIDR space)
+`subnetwork`                             | auto-created                        | Customer brings existing subnets
+Maintenance exclusion windows            | configured (NO_MINOR_UPGRADES, 1yr) | Customer-specific scheduling
+`nodeConfig.bootDisk.diskType`           | `pd-balanced`                       | `pd-ssd` for I/O-intensive, `pd-standard` for cost
+`nodeConfig.machineType`                 | `ek-standard-8` (Autopilot)         | Varies by workload; use ComputeClasses
+
+## Guardrails
+
+-   Do not request or output secrets (tokens, keys, service account JSON).
+-   Discover project/cluster context via MCP tools or `gcloud config get-value
+    project` — don't ask users to paste project IDs.
+-   For Day-0 decisions, always ask clarifying questions before proceeding.
+-   For Day-1 features, propose golden path defaults with trade-offs and let the
+    customer confirm.
+-   Do not promise zero downtime; advise PDBs, health probes, replicas, and
+    staged upgrades.
+-   When auditing existing clusters, compare against golden path and report
+    deviations with severity and remediation.
+
+## Golden Path Config
+
+See [golden-path-autopilot.yaml](./assets/golden-path-autopilot.yaml) for the
+full cluster-level policy settings.
diff --git a/skills/cloud/gke-basics/assets/golden-path-autopilot.yaml b/skills/cloud/gke-golden-path/assets/golden-path-autopilot.yaml
similarity index 100%
rename from skills/cloud/gke-basics/assets/golden-path-autopilot.yaml
rename to skills/cloud/gke-golden-path/assets/golden-path-autopilot.yaml
diff --git a/skills/cloud/gke-inference/SKILL.md b/skills/cloud/gke-inference/SKILL.md
new file mode 100644
index 0000000000..d137acfd42
--- /dev/null
+++ b/skills/cloud/gke-inference/SKILL.md
@@ -0,0 +1,206 @@
+---
+name: gke-inference
+description: >-
+  Deploys and optimizes AI/ML inference workloads on GKE, using GPUs, TPUs, and
+  model servers. Use when deploying GKE inference servers, configuring GKE GPU
+  resources for inference, or deploying LLMs on GKE. Don't use for generic
+  batch jobs or HPC task queues (use gke-batch-hpc instead).
+---
+
+# GKE AI/ML Inference
+
+This reference covers deploying AI/ML inference workloads on GKE using Google's
+Inference Quickstart (GIQ) and best practices for LLM serving.
+
+> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `get_k8s_logs`,
+> `get_k8s_rollout_status`, `describe_k8s_resource`, `list_k8s_events`.
+> **CLI-only:** `gcloud container ai profiles *`
+
+## When to Use
+
+-   Deploy an AI model (Llama, Gemma, Mistral, etc.) to GKE
+-   Generate optimized Kubernetes manifests for inference
+-   Select GPU/TPU accelerators for model serving
+-   Configure autoscaling for LLM inference
+
+## Prerequisites
+
+-   A golden path GKE Autopilot cluster (GPU workloads are supported via
+    ComputeClasses and NAP)
+-   `gcloud` CLI authenticated
+-   Sufficient GPU/TPU quota in the target region
+
+## Workflow
+
+### 1. Discovery: Find Models and Hardware
+
+```bash
+# List all supported models
+gcloud container ai profiles models list --quiet
+
+# Find valid accelerator/server combinations for a model
+gcloud container ai profiles list --model=<MODEL_NAME> --quiet
+
+# Example: what can run Gemma 2 9B?
+gcloud container ai profiles list --model=gemma-2-9b-it --quiet
+```
+
+### 2. Generate Manifest
+
+```bash
+gcloud container ai profiles manifests create \
+  --model=<MODEL_NAME> \
+  --model-server=<SERVER> \
+  --accelerator-type=<ACCELERATOR> \
+  --target-ntpot-milliseconds=<NTPOT> --quiet > inference.yaml
+```
+
+**Parameters:**
+
+-   `--model`: Model ID (e.g., `gemma-2-9b-it`, `llama-3-8b`)
+-   `--model-server`: Inference server (`vllm`, `tgi`, `triton`, `tensorrt-llm`)
+-   `--accelerator-type`: GPU/TPU type (`nvidia-l4`, `nvidia-tesla-a100`,
+    `nvidia-h100-80gb`)
+-   `--target-ntpot-milliseconds`: Target Normalized Time Per Output Token
+    (optional, for latency optimization)
+
+**Example:**
+
+```bash
+gcloud container ai profiles manifests create \
+  --model=gemma-2-9b-it \
+  --model-server=vllm \
+  --accelerator-type=nvidia-l4 \
+  --target-ntpot-milliseconds=50 --quiet > inference.yaml
+```
+
+### 3. Review and Deploy
+
+```bash
+# Review for placeholders (HF tokens, PVCs)
+cat inference.yaml
+
+# Deploy
+kubectl apply -f inference.yaml
+
+# Monitor
+kubectl get pods -w
+kubectl logs -f <POD_NAME>
+```
+
+> Some models require Hugging Face tokens. Create a Kubernetes Secret and
+> reference it in the manifest.
+
+## GPU ComputeClass for Inference
+
+For Autopilot clusters, create a ComputeClass to target GPU nodes:
+
+```yaml
+apiVersion: cloud.google.com/v1
+kind: ComputeClass
+metadata:
+  name: l4-inference
+spec:
+  priorities:
+  - machineFamily: g2
+    gpu:
+      type: nvidia-l4
+      count: 1
+    minCores: 4
+    minMemoryGb: 16
+```
+
+## Accelerator Selection Guide
+
+| Accelerator         | Best For                 | Memory      | Relative Cost |
+| ------------------- | ------------------------ | ----------- | ------------- |
+| NVIDIA T4           | Budget inference,        | 16 GB       | Lowest        |
+:                     : lightweight legacy       :             :               :
+:                     : models                   :             :               :
+| NVIDIA L4 (G2)      | Small-medium model       | 24 GB       | Low           |
+:                     : inference, video,        :             :               :
+:                     : graphics                 :             :               :
+| NVIDIA RTX PRO 6000 | Multimodal AI,           | 96 GB       | Medium        |
+: (G4)                : high-fidelity 3D,        :             :               :
+:                     : fine-tuning              :             :               :
+| Cloud TPU v5e       | Cost-effective           | Varies      | Medium        |
+:                     : transformer inference    :             :               :
+| Cloud TPU v5p       | High-performance         | Varies      | High          |
+:                     : training                 :             :               :
+| Cloud TPU v6e       | High-efficiency next-gen | 32 GB/chip  | Medium-High   |
+: (Trillium)          : training & serving       :             :               :
+| Cloud TPU v7x       | Ultra-scale inference &  | 192 GB/chip | High          |
+: (Ironwood)          : agentic workflows        :             :               :
+| NVIDIA A100         | Large model inference,   | 40/80 GB    | High          |
+:                     : enterprise ML            :             :               :
+| NVIDIA H100 / H200  | Frontier model training, | 80/141 GB   | Highest       |
+:                     : high throughput          :             :               :
+| NVIDIA B200 (A4)    | Blackwell-scale          | 192 GB      | Highest       |
+:                     : training, FP4 precision  :             :               :
+| NVIDIA GB200 (A4X)  | Rack-scale AI (Grace     | Massive     | Highest       |
+:                     : Blackwell Superchip)     :             :               :
+
+## Autoscaling LLM Inference
+
+### GPU-based autoscaling
+
+Use custom metrics for GPU utilization:
+
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: llm-hpa
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: llm-server
+  minReplicas: 1
+  maxReplicas: 10
+  metrics:
+  - type: Pods
+    pods:
+      metric:
+        name: gpu_duty_cycle
+      target:
+        type: AverageValue
+        averageValue: "80"
+```
+
+### Best practices for inference autoscaling
+
+1.  **Use DCGM metrics**: Golden path enables DCGM monitoring for GPU
+    utilization metrics
+2.  **Set appropriate minReplicas**: At least 1 for always-on serving; 0 for
+    batch/on-demand
+3.  **Tune scale-down delay**: LLM model loading is slow; use longer
+    stabilization windows
+4.  **Consider queue depth**: Scale on pending requests rather than pure GPU
+    utilization for latency-sensitive workloads
+
+## Optimization Tips
+
+-   **Quantization**: Use quantized models (GPTQ, AWQ) to reduce GPU memory and
+    increase throughput
+-   **Batching**: Configure model server batch size for throughput vs latency
+    trade-off
+-   **Tensor parallelism**: Split large models across multiple GPUs within a
+    node
+-   **KV cache optimization**: Tune `--gpu-memory-utilization` in vLLM for KV
+    cache allocation
+
+## Troubleshooting
+
+| Issue              | Cause                    | Fix                         |
+| ------------------ | ------------------------ | --------------------------- |
+| Invalid            | Unsupported tuple        | Re-run `gcloud container ai |
+: model/accelerator  :                          : profiles list               :
+: combination        :                          : --model=<MODEL>`            :
+| GPU quota exceeded | Regional quota limit     | Request quota increase or   |
+:                    :                          : try a different region      :
+| OOM on GPU         | Model too large for      | Use larger GPU, enable      |
+:                    : accelerator              : quantization, or use tensor :
+:                    :                          : parallelism                 :
+| Slow cold start    | Large model loading from | Use local SSD for model     |
+:                    : registry                 : caching; pre-pull images    :
diff --git a/skills/cloud/gke-basics/references/gke-multitenancy.md b/skills/cloud/gke-multitenancy/SKILL.md
similarity index 58%
rename from skills/cloud/gke-basics/references/gke-multitenancy.md
rename to skills/cloud/gke-multitenancy/SKILL.md
index 78458a30da..eb57cb8320 100644
--- a/skills/cloud/gke-basics/references/gke-multitenancy.md
+++ b/skills/cloud/gke-multitenancy/SKILL.md
@@ -1,26 +1,45 @@
+---
+name: gke-multitenancy
+description: >-
+  Plans and configures multi-tenancy on GKE. Covers namespace isolation, RBAC
+  planning for teams, resource quotas, LimitRanges, network isolation, and
+  cost allocation. Use when designing GKE multi-tenancy, configuring GKE
+  namespaces, setting up resource quotas, or isolating GKE teams. Don't use
+  for single-tenant cluster configuration or general deployment instructions
+  (use gke-basics or gke-app-onboarding instead).
+---
+
 # GKE Multi-Tenancy
 
-This reference covers enterprise multi-tenancy patterns on GKE, including namespace isolation, RBAC planning, resource quotas, and network segmentation.
+This reference covers enterprise multi-tenancy patterns on GKE, including
+namespace isolation, RBAC planning, resource quotas, and network segmentation.
 
-> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `check_k8s_auth`, `describe_k8s_resource`, `delete_k8s_resource`
+> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `check_k8s_auth`,
+> `describe_k8s_resource`, `delete_k8s_resource`
 
 ## When to Use
 
-- Multiple teams sharing a single GKE cluster
-- Isolating workloads by environment (dev/staging/prod) within one cluster
-- Implementing least-privilege access control
-- Cost allocation across teams or projects
+-   Multiple teams sharing a single GKE cluster
+-   Isolating workloads by environment (dev/staging/prod) within one cluster
+-   Implementing least-privilege access control
+-   Cost allocation across teams or projects
 
 ## Multi-Tenancy Models
 
-| Model | Isolation | Complexity | Cost |
-|-------|-----------|------------|------|
-| **Namespace-per-team** | Soft (RBAC + Network Policy) | Low | Lowest (shared cluster) |
-| **Namespace-per-environment** | Soft | Low | Low |
-| **Node pool-per-team** | Medium (dedicated compute) | Medium | Medium |
-| **Cluster-per-team** | Hard (full isolation) | High | Highest |
-
-> **Golden path recommendation**: Start with namespace-per-team for cost efficiency. Escalate to stronger isolation only when compliance requires it.
+| Model                         | Isolation    | Complexity | Cost           |
+| ----------------------------- | ------------ | ---------- | -------------- |
+| **Namespace-per-team**        | Soft (RBAC + | Low        | Lowest (shared |
+:                               : Network      :            : cluster)       :
+:                               : Policy)      :            :                :
+| **Namespace-per-environment** | Soft         | Low        | Low            |
+| **Node pool-per-team**        | Medium       | Medium     | Medium         |
+:                               : (dedicated   :            :                :
+:                               : compute)     :            :                :
+| **Cluster-per-team**          | Hard (full   | High       | Highest        |
+:                               : isolation)   :            :                :
+
+> **Golden path recommendation**: Start with namespace-per-team for cost
+> efficiency. Escalate to stronger isolation only when compliance requires it.
 
 ## Namespace Isolation Setup
 
@@ -35,7 +54,8 @@ kubectl label namespace team-b team=b
 
 ### 2. RBAC Configuration
 
-**Principle**: Grant minimal permissions per namespace. Never bind to `system:authenticated`.
+**Principle**: Grant minimal permissions per namespace. Never bind to
+`system:authenticated`.
 
 ```yaml
 # Namespace-scoped role for a team
@@ -64,7 +84,9 @@ roleRef:
   apiGroup: rbac.authorization.k8s.io
 ```
 
-**RBAC best practices:** Use Google Groups for subject bindings. Prefer namespace-scoped Roles over ClusterRoles. See [gke-security.md](./gke-security.md) for full RBAC hardening guidance.
+**RBAC best practices:** Use Google Groups for subject bindings. Prefer
+namespace-scoped Roles over ClusterRoles. See
+[gke-security](../gke-security/SKILL.md) for full RBAC hardening guidance.
 
 ### 3. Resource Quotas
 
@@ -113,7 +135,8 @@ spec:
 
 ### 5. Network Isolation
 
-Apply default-deny per namespace (see [gke-security.md](./gke-security.md)), then allow intra-team traffic:
+Apply default-deny per namespace (see [gke-security](../gke-security/SKILL.md)),
+then allow intra-team traffic:
 
 ```yaml
 # Allow same-namespace pods to talk + DNS
@@ -160,4 +183,3 @@ gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
 ```
 
 View in Cloud Billing > GKE Cost Allocation.
-
diff --git a/skills/cloud/gke-networking/SKILL.md b/skills/cloud/gke-networking/SKILL.md
new file mode 100644
index 0000000000..fabfb8bbe9
--- /dev/null
+++ b/skills/cloud/gke-networking/SKILL.md
@@ -0,0 +1,161 @@
+---
+name: gke-networking
+description: >-
+  Plans, configures, and manages GKE networking. Covers private clusters, VPC-
+  native configurations, Gateway API, DNS, ingress/egress, Dataplane V2, and
+  IP planning. Use when designing GKE networking layouts, configuring private
+  clusters, setting up Gateway API, planning GKE IP ranges, or configuring GKE
+  ingress/egress. Don't use for basic application routing that does not
+  require dedicated network configuration.
+---
+
+# GKE Networking
+
+This reference covers networking configuration for GKE clusters. The golden path
+enforces private, VPC-native clusters with Dataplane V2.
+
+> **MCP Tools:** `get_cluster`, `update_cluster`, `apply_k8s_manifest`,
+> `get_k8s_resource`
+
+## Golden Path Networking Defaults
+
+Setting                                                              | Golden Path Value                  | Day-0/1 | Notes
+-------------------------------------------------------------------- | ---------------------------------- | ------- | -----
+`privateClusterConfig.enablePrivateNodes`                            | `true`                             | Day-0   | Nodes have no public IPs
+`masterAuthorizedNetworksConfig.privateEndpointEnforcementEnabled`   | `true`                             | Day-0   | Control plane only reachable via private endpoint or DNS
+`controlPlaneEndpointsConfig.dnsEndpointConfig.allowExternalTraffic` | `true`                             | Day-0   | Allows DNS-based access from outside VPC
+`networkConfig.datapathProvider`                                     | `ADVANCED_DATAPATH` (Dataplane V2) | Day-0   | eBPF-based, built-in Network Policy
+`networkConfig.dnsConfig.clusterDns`                                 | `CLOUD_DNS`                        | Day-0   | Managed DNS, more reliable than kube-dns
+`networkConfig.enableIntraNodeVisibility`                            | `true`                             | Day-1   | VPC Flow Logs for intra-node traffic
+`networkConfig.gatewayApiConfig.channel`                             | `CHANNEL_STANDARD`                 | Day-1   | Gateway API support
+`ipAllocationPolicy.autoIpamConfig.enabled`                          | `true`                             | Day-0   | Automatic IP range management
+`ipAllocationPolicy.createSubnetwork`                                | `true`                             | Day-0   | Auto-create dedicated subnet
+`defaultMaxPodsConstraint.maxPodsPerNode`                            | `48`                               | Day-0   | Conservative default; 110 for high density
+
+## Private Cluster Access Patterns
+
+The golden path creates a private cluster. Users access it via:
+
+1.  **DNS endpoint (default)**: `allowExternalTraffic: true` enables access via
+    the cluster's DNS endpoint from outside the VPC. No VPN required.
+2.  **Private endpoint**: Direct access from within the VPC or via Cloud
+    VPN/Interconnect.
+3.  **Authorized networks**: Add specific CIDRs to
+    `masterAuthorizedNetworksConfig` for IP-based access control.
+
+```bash
+# Access private cluster via DNS endpoint (golden path default)
+gcloud container clusters get-credentials <CLUSTER_NAME> \
+  --region <REGION> --dns-endpoint \
+  --quiet
+
+# Access via private endpoint (from within VPC)
+gcloud container clusters get-credentials <CLUSTER_NAME> \
+  --region <REGION> --internal-ip \
+  --quiet
+```
+
+## Bring-Your-Own VPC/Subnet
+
+If the customer has existing network infrastructure:
+
+```bash
+gcloud container clusters create-auto <CLUSTER_NAME> \
+  --region <REGION> \
+  --network <VPC_NAME> \
+  --subnetwork <SUBNET_NAME> \
+  --cluster-secondary-range-name <POD_RANGE> \
+  --services-secondary-range-name <SVC_RANGE> \
+  --enable-private-nodes \
+  --enable-master-authorized-networks \
+  --quiet
+```
+
+> **Day-0 Warning**: VPC, subnet, and IP ranges cannot be changed after cluster
+> creation.
+
+## IP Planning
+
+| Resource      | Golden Path  | Notes                                      |
+| ------------- | ------------ | ------------------------------------------ |
+| Pod CIDR      | `/17` (auto) | ~32K pod IPs; size based on maxPodsPerNode |
+| Service CIDR  | `/20` (auto) | ~4K service IPs                            |
+| Node subnet   | auto-created | /20 recommended for growth                 |
+| Max pods/node | 48           | Each node gets a /25 pod range; set to 110 |
+:               :              : for /24 per node                           :
+
+**Pod CIDR sizing rule of thumb:**
+
+-   `maxPodsPerNode=48` -> each node uses a `/25` (128 IPs) from pod CIDR
+-   `maxPodsPerNode=110` -> each node uses a `/24` (256 IPs) from pod CIDR
+-   Larger maxPodsPerNode = fewer nodes fit in a given CIDR
+
+## Ingress
+
+**Gateway API** (golden path, enabled via `gatewayApiConfig.channel:
+CHANNEL_STANDARD`):
+
+```yaml
+apiVersion: gateway.networking.k8s.io/v1
+kind: Gateway
+metadata:
+  name: external-http
+spec:
+  gatewayClassName: gke-l7-global-external-managed
+  listeners:
+  - name: http
+    protocol: HTTP
+    port: 80
+```
+
+**Alternatives:**
+
+-   `gke-l7-regional-external-managed` — regional external
+-   `gke-l7-rilb` — internal load balancer
+-   Istio service mesh — for advanced traffic management, mTLS
+
+## Egress
+
+-   Default: nodes use Cloud NAT for outbound internet access (private nodes
+    have no public IPs)
+-   For static egress IPs: configure Cloud NAT with manual IP allocation
+-   For restricted egress: route through a firewall appliance via custom routes
+
+## Network Policy
+
+Dataplane V2 (golden path) provides built-in Network Policy enforcement — no
+additional addon needed. Apply default-deny per namespace, then allow specific
+flows.
+
+> See [gke-security.md](../gke-security/SKILL.md) for default-deny policy and
+> [gke-multitenancy.md](../gke-multitenancy/SKILL.md) for per-team allow
+> policies.
+
+## Cloud Armor (Recommended for Public-Facing Services)
+
+Cloud Armor provides WAF and DDoS protection. **Not a golden path default** —
+recommended for any service with public ingress. Link via `BackendConfig`:
+
+```yaml
+# 1. Create BackendConfig referencing your Cloud Armor policy
+apiVersion: cloud.google.com/v1
+kind: BackendConfig
+metadata:
+  name: my-backend-config
+spec:
+  securityPolicy:
+    name: my-cloud-armor-policy
+---
+# 2. Annotate your Service
+# cloud.google.com/backend-config: '{"default": "my-backend-config"}'
+```
+
+## SSL, Container-Native LB, and PSC
+
+-   **Google-managed SSL certificates**: Use `ManagedCertificate` CRD with
+    Gateway API. Auto-provisions and renews.
+-   **Container-native LB**: Enabled by default on VPC-native clusters (golden
+    path). Targets pods via NEGs, bypassing iptables. Annotation:
+    `cloud.google.com/neg: '{"ingress": true}'`.
+-   **Private Service Connect (PSC)**: Use `ServiceAttachment` CRD to expose
+    services across VPCs without peering.
diff --git a/skills/cloud/gke-observability/SKILL.md b/skills/cloud/gke-observability/SKILL.md
new file mode 100644
index 0000000000..1db6143c40
--- /dev/null
+++ b/skills/cloud/gke-observability/SKILL.md
@@ -0,0 +1,197 @@
+---
+name: gke-observability
+description: >-
+  Configures GKE observability, including Cloud Logging, Cloud Monitoring, and
+  managed Prometheus. Use when configuring GKE monitoring, setting up GKE logging,
+  or configuring Prometheus metrics collection. Don't use to configure local
+  application logging frameworks or external APMs outside GKE.
+---
+
+# GKE Observability
+
+This reference covers monitoring, logging, and metrics configuration for GKE.
+The golden path enables comprehensive observability including control-plane
+metrics.
+
+> **MCP Tools:** `get_cluster`, `list_k8s_events`, `get_k8s_logs`,
+> `get_k8s_cluster_info`, `describe_k8s_resource`. **CLI-only:** `gcloud
+> container clusters update --monitoring=...`, `gcloud logging read`
+
+## Golden Path Observability Defaults
+
+Setting                                             | Golden Path Value                                                                                                                                   | Notes
+--------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | -----
+`loggingConfig` components                          | SYSTEM_COMPONENTS, WORKLOADS                                                                                                                        | Full workload logging
+`monitoringConfig` components                       | SYSTEM_COMPONENTS, STORAGE, POD, DEPLOYMENT, STATEFULSET, DAEMONSET, HPA, JOBSET, CADVISOR, KUBELET, DCGM, APISERVER, SCHEDULER, CONTROLLER_MANAGER | Full suite including control-plane
+`managedPrometheusConfig.enabled`                   | `true`                                                                                                                                              | Google-managed Prometheus
+`advancedDatapathObservabilityConfig.enableMetrics` | `true`                                                                                                                                              | Dataplane V2 flow metrics
+`loggingService`                                    | `logging.googleapis.com/kubernetes`                                                                                                                 | Cloud Logging
+`monitoringService`                                 | `monitoring.googleapis.com/kubernetes`                                                                                                              | Cloud Monitoring
+
+### Control-Plane Metrics (Golden Path Addition)
+
+The golden path adds three control-plane monitoring components not present in
+default clusters:
+
+| Component            | What It Monitors                                      |
+| -------------------- | ----------------------------------------------------- |
+| `APISERVER`          | API server request latency, error rates, admission    |
+:                      : webhook performance                                   :
+| `SCHEDULER`          | Scheduling latency, pending pods, scheduling failures |
+| `CONTROLLER_MANAGER` | Controller work queue depth, reconciliation latency   |
+
+These are critical for diagnosing cluster-level issues (slow API responses,
+scheduling delays, stuck controllers).
+
+## Enabling Full Monitoring
+
+```bash
+# Enable golden path monitoring suite
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET,DCGM \
+  --quiet
+
+# Enable Managed Prometheus
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --enable-managed-prometheus \
+  --quiet
+
+# Enable Dataplane V2 observability metrics
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --enable-dataplane-v2-flow-observability \
+  --quiet
+```
+
+## Managed Prometheus
+
+Golden path enables Google Managed Prometheus for metrics collection and
+querying.
+
+**Querying metrics:**
+
+-   Use Cloud Monitoring Metrics Explorer in the console
+-   Use PromQL via the Prometheus UI or API
+-   Grafana dashboards via Managed Grafana
+
+**Key GKE metrics:**
+
+| Metric                                  | Source             | Use           |
+| --------------------------------------- | ------------------ | ------------- |
+| `container_cpu_usage_seconds_total`     | cAdvisor           | Pod CPU usage |
+| `container_memory_working_set_bytes`    | cAdvisor           | Pod memory    |
+:                                         :                    : usage         :
+| `kube_pod_status_phase`                 | kube-state-metrics | Pod lifecycle |
+| `apiserver_request_duration_seconds`    | API Server         | Control plane |
+:                                         :                    : latency       :
+| `scheduler_scheduling_duration_seconds` | Scheduler          | Scheduling    |
+:                                         :                    : performance   :
+| `node_cpu_seconds_total`                | Kubelet            | Node CPU      |
+| `DCGM_FI_DEV_GPU_UTIL`                  | DCGM               | GPU           |
+:                                         :                    : utilization   :
+
+## Live Resource Usage (kubectl-only)
+
+No MCP or gcloud equivalent exists for live resource usage. Use `kubectl top`:
+
+```bash
+kubectl top pods --all-namespaces --sort-by=cpu
+kubectl top nodes
+kubectl top pods --containers -n <NAMESPACE>  # per-container breakdown
+```
+
+## Cloud Logging (gcloud-only)
+
+**Querying cluster logs** (no MCP equivalent — use `gcloud logging read`):
+
+```bash
+# System component logs
+gcloud logging read \
+  'resource.type="k8s_cluster" AND resource.labels.cluster_name="<CLUSTER_NAME>"' \
+  --project <PROJECT_ID> --limit 50 \
+  --quiet
+
+# Workload logs for a specific namespace
+gcloud logging read \
+  'resource.type="k8s_container" AND resource.labels.cluster_name="<CLUSTER_NAME>" AND resource.labels.namespace_name="<NAMESPACE>"' \
+  --project <PROJECT_ID> --limit 50 \
+  --quiet
+
+# Audit logs (who did what)
+gcloud logging read \
+  'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"' \
+  --project <PROJECT_ID> --limit 50 \
+  --quiet
+```
+
+## Diagnostic Settings
+
+For security monitoring and troubleshooting, enable control-plane audit logs:
+
+```bash
+# View current logging config
+gcloud container clusters describe <CLUSTER_NAME> --region <REGION> \
+  --format="yaml(loggingConfig)" \
+  --quiet
+```
+
+## Alerting
+
+Set up alerts for critical conditions:
+
+Condition               | Metric                                              | Threshold
+----------------------- | --------------------------------------------------- | ---------
+High API server latency | `apiserver_request_duration_seconds`                | P99 > 5s
+Pod crash loops         | `kube_pod_container_status_restarts_total`          | > 5 in 10min
+Node not ready          | `kube_node_status_condition`                        | condition=Ready, status!=True
+High GPU utilization    | `DCGM_FI_DEV_GPU_UTIL`                              | > 95% sustained
+PVC near capacity       | `kubelet_volume_stats_used_bytes / capacity`        | > 85%
+Scheduling failures     | `scheduler_schedule_attempts_total{result="error"}` | > 0
+
+## Cost Considerations
+
+Monitoring and logging have associated costs:
+
+-   **Cloud Logging**: Charged per GiB ingested beyond free tier (50
+    GiB/project/month)
+-   **Cloud Monitoring**: Free for GKE system metrics; custom metrics charged
+    per time series
+-   **Managed Prometheus**: Charged per samples ingested
+
+To reduce costs in non-production:
+
+```bash
+# Reduce to system-only monitoring
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --monitoring=SYSTEM \
+  --quiet
+```
+
+## Distributed Tracing & Continuous Profiling (Recommended)
+
+**Not golden path defaults** — recommended for production microservice
+architectures and performance-sensitive workloads.
+
+-   **Cloud Trace**: Add OpenTelemetry SDK to your app with the
+    `opentelemetry-operations-go` (or equivalent) exporter. Traces appear in
+    Cloud Trace console. Identifies cross-service latency bottlenecks.
+-   **Cloud Profiler**: Add the Cloud Profiler agent to your app. Profiles CPU
+    and memory usage in production with low overhead. Identifies hotspots and
+    compares across versions.
+
+## LQL Query Examples
+
+Common Logging Query Language patterns for GKE troubleshooting:
+
+```
+# Error logs for a specific container
+resource.type="k8s_container" AND resource.labels.container_name="my-app" AND severity>=ERROR
+
+# OOMKilled events
+resource.type="k8s_event" AND jsonPayload.reason="OOMKilling"
+
+# Pod scheduling failures
+resource.type="k8s_event" AND jsonPayload.reason="FailedScheduling"
+
+# Audit logs (who did what)
+resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"
+```
diff --git a/skills/cloud/gke-reliability/SKILL.md b/skills/cloud/gke-reliability/SKILL.md
new file mode 100644
index 0000000000..57f73520cb
--- /dev/null
+++ b/skills/cloud/gke-reliability/SKILL.md
@@ -0,0 +1,200 @@
+---
+name: gke-reliability
+description: >-
+  Improves GKE workload reliability, using PDBs, health probes, and topology
+  spread constraints. Use when configuring GKE workload reliability, setting up
+  PDBs, or configuring GKE health probes (liveness, readiness, startup). Don't
+  use for disaster recovery setup or full cluster backups (use gke-backup-dr
+  instead).
+---
+
+# GKE Reliability
+
+This reference covers high availability and reliability configuration for GKE
+clusters and workloads.
+
+> **MCP Tools:** `get_cluster`, `get_k8s_resource`, `describe_k8s_resource`,
+> `apply_k8s_manifest`, `list_k8s_events`
+
+## Golden Path Reliability Defaults
+
+| Setting          | Golden Path Value     | Notes                            |
+| ---------------- | --------------------- | -------------------------------- |
+| Cluster type     | Regional (4 zones:    | Control plane replicated across  |
+:                  : us-central1-a/b/c/f)  : zones                            :
+| Upgrade strategy | SURGE (`maxSurge: 1`) | Rolling upgrades with extra      |
+:                  :                       : capacity                         :
+| Auto-repair      | `true`                | Unhealthy nodes replaced         |
+:                  :                       : automatically                    :
+| Auto-upgrade     | `true`                | Nodes follow control plane       |
+:                  :                       : version                          :
+| Release channel  | REGULAR               | Balanced freshness and stability |
+| Stateful HA      | Enabled               | Leader election for stateful     |
+:                  :                       : workloads                        :
+
+## Workflows
+
+### 1. Verify Cluster High Availability
+
+```
+# MCP (preferred)
+get_cluster(name="projects/<PROJECT>/locations/<REGION>/clusters/<CLUSTER>",
+  readMask="location,locations,nodePools.locations")
+
+# gcloud fallback
+gcloud container clusters describe <CLUSTER> --region <REGION> \
+  --format="json(location, locations)" \
+  --quiet
+```
+
+-   If `location` is a region (e.g., `us-central1`), the control plane is
+    regional
+-   If `locations` has multiple entries, nodes span multiple zones
+
+### 2. Pod Disruption Budgets (PDBs)
+
+PDBs ensure minimum pod availability during voluntary disruptions (node
+upgrades, autoscaler scale-down).
+
+**Check existing PDBs:**
+
+```
+# MCP (preferred)
+get_k8s_resource(parent="...", resourceType="poddisruptionbudget")
+
+# kubectl fallback
+kubectl get pdb --all-namespaces
+```
+
+**Create PDB:**
+
+```yaml
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: my-app-pdb
+  namespace: default
+spec:
+  minAvailable: 2       # Or use maxUnavailable: 1
+  selector:
+    matchLabels:
+      app: my-app
+```
+
+> Every production Deployment with 2+ replicas should have a PDB.
+
+### 3. Health Probes
+
+Every production container should have liveness and readiness probes. Startup
+probes are recommended for slow-starting apps.
+
+**Check existing probes:**
+
+```
+# MCP (preferred)
+describe_k8s_resource(parent="...", resourceType="deployment", name="<APP>", namespace="<NS>")
+
+# kubectl fallback
+kubectl get deployment <APP> -n <NS> -o yaml | grep -E "livenessProbe|readinessProbe|startupProbe"
+```
+
+**Recommended probe configuration:**
+
+```yaml
+spec:
+  containers:
+  - name: app
+    livenessProbe:
+      httpGet:
+        path: /healthz
+        port: 8080
+      initialDelaySeconds: 15
+      periodSeconds: 10
+      failureThreshold: 3
+    readinessProbe:
+      httpGet:
+        path: /readyz
+        port: 8080
+      initialDelaySeconds: 5
+      periodSeconds: 5
+      failureThreshold: 3
+    startupProbe:             # For slow-starting apps
+      httpGet:
+        path: /healthz
+        port: 8080
+      initialDelaySeconds: 10
+      periodSeconds: 5
+      failureThreshold: 30    # 30 * 5s = 150s max startup time
+```
+
+-   **Readiness**: Determines when a pod can accept traffic
+-   **Liveness**: Determines when to restart a container
+-   **Startup**: Disables liveness/readiness until the app is ready (prevents
+    premature restarts)
+
+### 4. Graceful Shutdown
+
+Ensure applications handle `SIGTERM` and drain in-flight requests:
+
+```yaml
+spec:
+  terminationGracePeriodSeconds: 30    # Default; increase for long-running requests
+  containers:
+  - name: app
+    lifecycle:
+      preStop:
+        exec:
+          command: ["/bin/sh", "-c", "sleep 5"]  # Allow LB to deregister
+```
+
+### 5. Topology Spread Constraints
+
+Distribute pods across zones and nodes to survive failures:
+
+```yaml
+spec:
+  topologySpreadConstraints:
+  - maxSkew: 1
+    topologyKey: topology.kubernetes.io/zone
+    whenUnsatisfiable: DoNotSchedule
+    labelSelector:
+      matchLabels:
+        app: my-app
+  - maxSkew: 1
+    topologyKey: kubernetes.io/hostname
+    whenUnsatisfiable: ScheduleAnyway
+    labelSelector:
+      matchLabels:
+        app: my-app
+```
+
+-   **Zone spread** (`DoNotSchedule`): Hard requirement -- pods must be balanced
+    across zones
+-   **Node spread** (`ScheduleAnyway`): Best-effort -- prefer distribution but
+    don't block scheduling
+
+### 6. Replicas
+
+| Workload Type        | Minimum Replicas     | Reason                         |
+| -------------------- | -------------------- | ------------------------------ |
+| Stateless web/API    | 2                    | Survive single pod/node        |
+:                      :                      : failure                        :
+| Critical services    | 3                    | Survive zone failure with zone |
+:                      :                      : spread                         :
+| Stateful (databases) | 3 (with replication) | Application-level quorum       |
+| Batch/jobs           | 1                    | Ephemeral by nature            |
+
+## Best Practices
+
+1.  **Regional clusters for production**: Always use regional clusters to
+    survive zone failures
+2.  **PDBs for everything**: Every production workload with 2+ replicas needs a
+    PDB
+3.  **Probes for all containers**: At minimum, readiness probes on every
+    production container
+4.  **Zone spreading**: Use topology spread constraints to distribute pods
+    across failure domains
+5.  **Graceful shutdown**: Handle SIGTERM and set appropriate
+    `terminationGracePeriodSeconds`
+6.  **Maintenance windows**: Schedule upgrades during low-traffic periods (see
+    [gke-upgrades.md](../gke-upgrades/SKILL.md))
diff --git a/skills/cloud/gke-scaling/SKILL.md b/skills/cloud/gke-scaling/SKILL.md
new file mode 100644
index 0000000000..01c6fa84e5
--- /dev/null
+++ b/skills/cloud/gke-scaling/SKILL.md
@@ -0,0 +1,175 @@
+---
+name: gke-scaling
+description: >-
+  Configures GKE autoscaling, including HPA, VPA, and Node Auto-Provisioning
+  (NAP). Use when configuring GKE autoscaling, setting up GKE HPA, setting up
+  GKE VPA, or configuring GKE NAP. Don't use for configuring static cluster sizes
+  or setting node-level machine styles directly (use gke-compute-classes instead).
+---
+
+# GKE Workload Scaling
+
+This reference covers scaling workloads on GKE. The golden path enables VPA,
+OPTIMIZE_UTILIZATION autoscaling profile, and Node Auto Provisioning by default.
+
+> **MCP Tools:** `get_k8s_resource`, `describe_k8s_resource`,
+> `apply_k8s_manifest`, `patch_k8s_resource`, `get_cluster`, `update_cluster`,
+> `update_node_pool`
+
+## Golden Path Scaling Defaults
+
+Setting                                  | Golden Path Value      | Notes
+---------------------------------------- | ---------------------- | -----
+`autoscaling.autoscalingProfile`         | `OPTIMIZE_UTILIZATION` | Aggressive scale-down for cost savings
+`verticalPodAutoscaling.enabled`         | `true`                 | VPA recommendations available
+`autoscaling.enableNodeAutoprovisioning` | `true`                 | NAP creates node pools on demand
+GPU resource limits (T4, A100)           | `1000000000` each      | NAP can provision GPU nodes
+
+## Scaling Mechanisms
+
+### 1. Manual Scaling
+
+> **kubectl-only** — no MCP equivalent for `kubectl scale`. Use kubectl
+> directly.
+
+```bash
+kubectl scale deployment <DEPLOYMENT> --replicas=<N> -n <NAMESPACE>
+```
+
+### 2. Horizontal Pod Autoscaling (HPA)
+
+Scales the number of pods based on metrics.
+
+**Quick setup (kubectl-only — no MCP equivalent for `kubectl autoscale`):**
+
+```bash
+kubectl autoscale deployment <DEPLOYMENT> --cpu-percent=50 --min=1 --max=10
+```
+
+**Manifest approach (recommended — use MCP `apply_k8s_manifest`):**
+
+See [assets/hpa-example.yaml](./assets/hpa-example.yaml) for a template.
+
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: <DEPLOYMENT>-hpa
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: <DEPLOYMENT>
+  minReplicas: 1
+  maxReplicas: 10
+  metrics:
+  - type: Resource
+    resource:
+      name: cpu
+      target:
+        type: Utilization
+        averageUtilization: 50
+```
+
+### 3. Vertical Pod Autoscaling (VPA)
+
+Adjusts CPU and memory requests to match actual usage. Enabled by default on
+golden path.
+
+**Update modes:**
+
+-   `Off` — recommendations only (safest, start here)
+-   `Initial` — sets resources only at pod creation
+-   `Auto` — restarts pods to apply new resource values
+-   `InPlaceOrRecreate` — updates resources without restart when possible (GKE
+    1.34+)
+
+**Create VPA in recommendation mode:**
+
+```yaml
+apiVersion: autoscaling.k8s.io/v1
+kind: VerticalPodAutoscaler
+metadata:
+  name: <DEPLOYMENT>-vpa
+spec:
+  targetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: <DEPLOYMENT>
+  updatePolicy:
+    updateMode: "Off"
+```
+
+**Read recommendations (prefer MCP `describe_k8s_resource`):**
+
+```
+# MCP (preferred)
+describe_k8s_resource(parent="...", resourceType="verticalpodautoscaler", name="<DEPLOYMENT>-vpa", namespace="<NAMESPACE>")
+
+# kubectl fallback
+kubectl get vpa <DEPLOYMENT>-vpa -o jsonpath='{.status.recommendation}'
+```
+
+See [assets/vpa-example.yaml](./assets/vpa-example.yaml) for a full template.
+
+### 4. Cluster Autoscaler / Node Auto Provisioning (NAP)
+
+On Autopilot (golden path), node scaling is fully managed. NAP automatically
+creates and sizes node pools based on workload demands.
+
+**For Standard clusters:**
+
+```bash
+# Enable cluster autoscaler on a node pool
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --enable-autoscaling --node-pool <POOL_NAME> \
+  --min-nodes <MIN> --max-nodes <MAX> \
+  --quiet
+
+# Enable NAP
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --enable-autoprovisioning \
+  --min-cpu <MIN_CPU> --max-cpu <MAX_CPU> \
+  --min-memory <MIN_MEM> --max-memory <MAX_MEM> \
+  --quiet
+```
+
+**Autoscaling profiles:**
+
+| Profile                | Behavior                             | Golden Path? |
+| ---------------------- | ------------------------------------ | ------------ |
+| `BALANCED`             | Default GKE; conservative scale-down | No           |
+| `OPTIMIZE_UTILIZATION` | Aggressive scale-down; lower idle    | **Yes**      |
+:                        : resources                            :              :
+
+## Best Practices
+
+1.  **Define resource requests**: HPA and VPA rely on accurate requests. Always
+    set them.
+2.  **Avoid metric conflicts**: Do not use HPA and VPA on the same metric.
+    Typical pattern: HPA on CPU, VPA on memory.
+3.  **Pod Disruption Budgets**: Define PDBs for all production workloads to
+    ensure availability during scaling events.
+4.  **HPA stabilization**: HPA has a default 5-minute stabilization window. Tune
+    `behavior` for faster response if needed.
+5.  **VPA "Auto" caution**: Auto mode restarts pods. Ensure your app handles
+    SIGTERM gracefully. VPA requires at least 2 replicas for evictions by
+    default.
+6.  **Use ComputeClasses**: For workload-specific node targeting (Spot fallback,
+    GPU, specific machine families), use ComputeClasses instead of node
+    selectors.
+
+## Rightsizing Workflow
+
+1.  Deploy VPA in `Off` mode for 24+ hours
+2.  Read recommendations: `kubectl describe vpa <NAME>`
+3.  Compare `target` values against current `requests`
+4.  Apply with 20% buffer: `new_request = target * 1.2`
+5.  Use patch format to update Deployment
+
+Condition                     | Recommendation                       | Risk
+----------------------------- | ------------------------------------ | ------
+CPU request >5x P95 actual    | Reduce to `P95 * 1.2`                | Medium
+Memory request >3x P95 actual | Reduce to `P95 * 1.2`                | Medium
+CPU request >2x P95 actual    | Rightsizing with 20% buffer          | Low
+No resource limits set        | Add limits to prevent noisy-neighbor | Low
diff --git a/skills/cloud/gke-basics/assets/hpa-example.yaml b/skills/cloud/gke-scaling/assets/hpa-example.yaml
similarity index 100%
rename from skills/cloud/gke-basics/assets/hpa-example.yaml
rename to skills/cloud/gke-scaling/assets/hpa-example.yaml
diff --git a/skills/cloud/gke-basics/assets/vpa-example.yaml b/skills/cloud/gke-scaling/assets/vpa-example.yaml
similarity index 100%
rename from skills/cloud/gke-basics/assets/vpa-example.yaml
rename to skills/cloud/gke-scaling/assets/vpa-example.yaml
diff --git a/skills/cloud/gke-security/SKILL.md b/skills/cloud/gke-security/SKILL.md
new file mode 100644
index 0000000000..4902ed81b3
--- /dev/null
+++ b/skills/cloud/gke-security/SKILL.md
@@ -0,0 +1,285 @@
+---
+name: gke-security
+description: >-
+  Plans, configures, and hardens Google Kubernetes Engine (GKE) security.
+  Covers Workload Identity Federation, Secret Manager integration, RBAC
+  hardening, Binary Authorization, Network Policies (Dataplane V2), Pod
+  Security Standards, and IAM roles. Use when securing GKE clusters, setting up
+  Workload Identity, hardening RBAC configurations, or configuring GKE secrets.
+  Don't use for general network routing configuration (use gke-networking instead).
+---
+
+# GKE Security
+
+This reference covers security configuration for GKE clusters. The golden path
+enforces a hardened security posture by default.
+
+> **MCP Tools:** `get_cluster`, `check_k8s_auth`, `get_k8s_resource`,
+> `apply_k8s_manifest`, `update_cluster`
+
+## Golden Path Security Defaults
+
+Setting                                                        | Golden Path Value                                   | Day-0/1 | Notes
+-------------------------------------------------------------- | --------------------------------------------------- | ------- | -----
+`workloadIdentityConfig.workloadPool`                          | `<PROJECT>.svc.id.goog`                             | Day-0   | Workload Identity Federation for Pods
+`secretManagerConfig.enabled`                                  | `true`                                              | Day-1   | Google Secret Manager integration
+`secretManagerConfig.rotationConfig`                           | `enabled: true, rotationInterval: 120s`             | Day-1   | Automatic secret rotation
+`rbacBindingConfig.enableInsecureBindingSystemAuthenticated`   | `false`                                             | Day-0   | Blocks legacy `system:authenticated` bindings
+`rbacBindingConfig.enableInsecureBindingSystemUnauthenticated` | `false`                                             | Day-0   | Blocks legacy `system:unauthenticated` bindings
+`nodeConfig.shieldedInstanceConfig.enableSecureBoot`           | `true`                                              | Day-0   | Verifiable boot integrity
+`nodeConfig.shieldedInstanceConfig.enableIntegrityMonitoring`  | `true`                                              | Day-0   | Runtime integrity checks
+`nodeConfig.workloadMetadataConfig.mode`                       | `GKE_METADATA`                                      | Day-0   | Blocks legacy metadata API, enforces Workload Identity
+Private cluster + Dataplane V2 settings                        | See [gke-networking.md](../gke-networking/SKILL.md) | Day-0   | Private nodes, private endpoint enforcement, ADVANCED_DATAPATH
+
+## Workload Identity Federation
+
+Workload Identity is the recommended way for pods to access Google Cloud APIs.
+It eliminates the need for static service account keys.
+
+### Setup
+
+```bash
+# 1. Create a Google Service Account (GSA)
+gcloud iam service-accounts create <GSA_NAME> \
+  --project <PROJECT_ID> \
+  --display-name "Workload Identity SA" \
+  --quiet
+
+# 2. Grant IAM roles to the GSA
+gcloud projects add-iam-policy-binding <PROJECT_ID> \
+  --member "serviceAccount:<GSA_NAME>@<PROJECT_ID>.iam.gserviceaccount.com" \
+  --role "<ROLE>" \
+  --quiet
+
+# 3. Create Kubernetes Service Account (KSA)
+kubectl create namespace <NAMESPACE>
+kubectl create serviceaccount <KSA_NAME> --namespace <NAMESPACE>
+
+# 4. Bind KSA to GSA
+gcloud iam service-accounts add-iam-policy-binding \
+  <GSA_NAME>@<PROJECT_ID>.iam.gserviceaccount.com \
+  --role roles/iam.workloadIdentityUser \
+  --member "serviceAccount:<PROJECT_ID>.svc.id.goog[<NAMESPACE>/<KSA_NAME>]" \
+  --quiet
+
+# 5. Annotate KSA
+kubectl annotate serviceaccount <KSA_NAME> \
+  --namespace <NAMESPACE> \
+  iam.gke.io/gcp-service-account=<GSA_NAME>@<PROJECT_ID>.iam.gserviceaccount.com
+```
+
+> See [assets/workload-identity-pod.yaml](./assets/workload-identity-pod.yaml)
+> for a test pod.
+
+### Verification
+
+```bash
+kubectl run workload-identity-test \
+  --image=gcr.io/google.com/cloudsdktool/cloud-sdk:slim \
+  --serviceaccount=<KSA_NAME> --namespace=<NAMESPACE> \
+  --rm -it -- gcloud auth list --quiet
+```
+
+## Secret Manager Integration
+
+The golden path enables Secret Manager with automatic rotation. Secrets are
+synced to Kubernetes Secrets.
+
+```bash
+# Verify Secret Manager is enabled on cluster
+gcloud container clusters describe <CLUSTER_NAME> --region <REGION> \
+  --format="value(secretManagerConfig.enabled)" \
+  --quiet
+
+# Enable if not already (Day-1 change)
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --enable-secret-manager \
+  --secret-manager-rotation-interval=120s \
+  --quiet
+```
+
+## RBAC Hardening
+
+The golden path disables insecure legacy RBAC bindings that grant broad access
+to `system:authenticated` and `system:unauthenticated` groups.
+
+```bash
+# Verify insecure bindings are disabled
+gcloud container clusters describe <CLUSTER_NAME> --region <REGION> \
+  --format="yaml(rbacBindingConfig)" \
+  --quiet
+```
+
+**Best practices for RBAC:**
+
+-   Use namespace-scoped Roles over cluster-wide ClusterRoles
+-   Bind to specific Groups or ServiceAccounts, never to `system:authenticated`
+-   Audit permissions via MCP: `check_k8s_auth(parent="...", verb="list",
+    resourceType="pods", namespace="...")` (or `kubectl auth can-i --list
+    --as=<user>`)
+-   Review bindings via MCP: `get_k8s_resource(parent="...",
+    resourceType="clusterrolebinding")` (or `kubectl get
+    clusterrolebindings,rolebindings --all-namespaces`)
+
+> See [gke-multitenancy.md](../gke-multitenancy/SKILL.md) for enterprise RBAC
+> planning and
+> https://docs.cloud.google.com/kubernetes-engine/docs/best-practices/rbac
+
+## Binary Authorization
+
+Not enabled in golden path by default but recommended for production image
+provenance:
+
+```bash
+# Enable Binary Authorization
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --binauthz-evaluation-mode=PROJECT_SINGLETON_POLICY_ENFORCE \
+  --quiet
+```
+
+## Network Policies
+
+Dataplane V2 (golden path) provides built-in Network Policy enforcement. Apply
+default-deny per namespace:
+
+```
+# MCP (preferred)
+apply_k8s_manifest(parent="...", yamlManifest="<contents of default-deny-netpol.yaml>")
+
+# kubectl fallback
+kubectl apply -f third_party/skills/skills/cloud/gke-security/assets/default-deny-netpol.yaml -n <NAMESPACE>
+```
+
+## GKE Sandbox (gVisor)
+
+For running untrusted workloads in an isolated sandbox:
+
+```bash
+# Enable on cluster (Standard clusters)
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> --enable-gke-sandbox --quiet
+
+# Use in pod spec
+# Add: runtimeClassName: gvisor
+```
+
+## Pod Security Standards (Golden Path)
+
+Pod Security Standards define three profiles that restrict what pods can do. The
+**`restricted` profile is the golden path default** for production namespaces.
+
+| Profile      | Level                 | Use Case                           |
+| ------------ | --------------------- | ---------------------------------- |
+| `privileged` | Unrestricted          | System namespaces (`kube-system`), |
+:              :                       : infrastructure controllers         :
+| `baseline`   | Minimally restrictive | Shared/dev namespaces, legacy apps |
+:              :                       : being migrated                     :
+| `restricted` | **Golden path**       | Production workloads -- blocks     |
+:              :                       : privilege escalation, host access, :
+:              :                       : root                               :
+
+**Enforce via namespace labels (Pod Security Admission):**
+
+```yaml
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: production
+  labels:
+    pod-security.kubernetes.io/enforce: restricted
+    pod-security.kubernetes.io/warn: restricted
+    pod-security.kubernetes.io/audit: restricted
+```
+
+**Gradual rollout strategy:**
+
+1.  Start with `warn` + `audit` on existing namespaces to identify violations
+2.  Fix non-compliant workloads (remove `privileged`, `hostNetwork`, root user,
+    etc.)
+3.  Enable `enforce` once all workloads pass
+
+`restricted` blocks: running as root, privilege escalation, host
+networking/PID/IPC, host path volumes, and most capabilities. The golden path
+`workload-identity-pod.yaml` already complies.
+
+## Network Policy Logging (Recommended)
+
+With Dataplane V2 (golden path), you can enable logging for Network Policy
+decisions. **Not a golden path default** -- recommended for security auditing.
+
+```bash
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --enable-network-policy-logging \
+  --quiet
+```
+
+This logs allowed and denied connections, useful for troubleshooting Network
+Policy rules and auditing traffic flows.
+
+## Common IAM Roles
+
+The five most common predefined IAM roles for GKE:
+
+| Role                            | Purpose             | When to Use          |
+| ------------------------------- | ------------------- | -------------------- |
+| `roles/container.admin`         | Full control over   | Platform team admins |
+:                                 : clusters and        : managing cluster     :
+:                                 : Kubernetes          : lifecycle            :
+:                                 : resources           :                      :
+| `roles/container.clusterAdmin`  | Manage clusters but | Cluster operators    |
+:                                 : not project-level   : who create/delete    :
+:                                 : IAM                 : clusters             :
+| `roles/container.developer`     | Deploy workloads    | Application          |
+:                                 : (pods, services,    : developers deploying :
+:                                 : deployments)        : to existing clusters :
+| `roles/container.viewer`        | Read-only access to | Monitoring,          |
+:                                 : clusters and        : auditing, or         :
+:                                 : Kubernetes          : read-only dashboards :
+:                                 : resources           :                      :
+| `roles/container.clusterViewer` | List and get        | CI/CD pipelines that |
+:                                 : cluster details     : need cluster         :
+:                                 : only                : metadata             :
+
+> **Principle of least privilege**: Start with `roles/container.viewer` or
+> `roles/container.developer` and escalate only as needed. Avoid granting
+> `roles/container.admin` broadly.
+
+## Service Accounts & Agents
+
+-   **GKE Service Agent**
+    (`service-<PROJECT_NUMBER>@container-engine-robot.iam.gserviceaccount.com`):
+    Automatically created. Manages nodes, networking, and cluster operations on
+    your behalf. Do not remove or modify its permissions.
+-   **Node Service Account**: By default, nodes use the Compute Engine default
+    service account. For production, create a dedicated SA with minimal
+    permissions and assign it via node pool config.
+-   **Workload Identity**: The recommended way for pods to access Google Cloud
+    APIs. Maps a Kubernetes ServiceAccount to a Google IAM ServiceAccount — see
+    [Workload Identity setup](#workload-identity-federation) above.
+
+## Cross-Service Authentication Patterns
+
+Common patterns for granting GKE workloads access to other Google Cloud
+services:
+
+```bash
+# Grant a GKE workload access to Cloud Storage
+gcloud projects add-iam-policy-binding <PROJECT_ID> \
+  --member "serviceAccount:<GSA_NAME>@<PROJECT_ID>.iam.gserviceaccount.com" \
+  --role "roles/storage.objectViewer" \
+  --quiet
+
+# Grant a GKE workload access to Cloud SQL
+gcloud projects add-iam-policy-binding <PROJECT_ID> \
+  --member "serviceAccount:<GSA_NAME>@<PROJECT_ID>.iam.gserviceaccount.com" \
+  --role "roles/cloudsql.client" \
+  --quiet
+
+# Grant a GKE workload access to Pub/Sub
+gcloud projects add-iam-policy-binding <PROJECT_ID> \
+  --member "serviceAccount:<GSA_NAME>@<PROJECT_ID>.iam.gserviceaccount.com" \
+  --role "roles/pubsub.subscriber" \
+  --quiet
+```
+
+In all cases, the GSA must be bound to a KSA via Workload Identity (see setup
+above). The pod then uses the KSA to authenticate as the GSA.
diff --git a/skills/cloud/gke-basics/assets/default-deny-netpol.yaml b/skills/cloud/gke-security/assets/default-deny-netpol.yaml
similarity index 100%
rename from skills/cloud/gke-basics/assets/default-deny-netpol.yaml
rename to skills/cloud/gke-security/assets/default-deny-netpol.yaml
diff --git a/skills/cloud/gke-basics/assets/workload-identity-pod.yaml b/skills/cloud/gke-security/assets/workload-identity-pod.yaml
similarity index 100%
rename from skills/cloud/gke-basics/assets/workload-identity-pod.yaml
rename to skills/cloud/gke-security/assets/workload-identity-pod.yaml
diff --git a/skills/cloud/gke-storage/SKILL.md b/skills/cloud/gke-storage/SKILL.md
new file mode 100644
index 0000000000..d48a4ccce0
--- /dev/null
+++ b/skills/cloud/gke-storage/SKILL.md
@@ -0,0 +1,161 @@
+---
+name: gke-storage
+description: >-
+  Manages GKE storage, including PVCs, PersistentVolumes, Filestore, and GCS
+  FUSE. Use when configuring GKE storage, creating PVCs, or setting up GCS FUSE
+  on GKE. Don't use for database administration or replication strategies
+  outside volume provisioning context.
+---
+
+# GKE Storage
+
+This reference covers storage configuration for GKE clusters including
+persistent disks, file storage, and cloud storage integration.
+
+> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`,
+> `describe_k8s_resource`, `get_cluster`
+
+## Golden Path Storage Defaults
+
+The golden path Autopilot config enables these CSI drivers:
+
+| Driver          | Golden Path       | Access Mode     | Use Case             |
+| --------------- | ----------------- | --------------- | -------------------- |
+| Compute Engine  | Enabled (default) | ReadWriteOnce   | Block storage for    |
+: Persistent Disk :                   :                 : databases,           :
+: CSI             :                   :                 : single-pod workloads :
+| Google Cloud    | Enabled           | ReadWriteMany   | Shared NFS for       |
+: Filestore CSI   :                   :                 : multi-pod access     :
+| Cloud Storage   | Enabled           | ReadWriteMany / | Mount GCS buckets as |
+: FUSE CSI        :                   : ReadOnlyMany    : volumes              :
+| Parallelstore   | Enabled           | ReadWriteMany   | High-performance     |
+: CSI             :                   :                 : parallel file system :
+| Boot disk type  | `pd-balanced`     | N/A             | Node boot disks      |
+
+## StorageClasses
+
+### Default StorageClasses
+
+GKE provides built-in StorageClasses:
+
+StorageClass   | Disk Type             | Use Case
+-------------- | --------------------- | ------------------------------
+`standard-rwo` | `pd-standard`         | Cost-effective, low IOPS
+`premium-rwo`  | `pd-ssd`              | High IOPS, databases
+`standard-rwx` | Filestore (Basic HDD) | Shared NFS
+`premium-rwx`  | Filestore (Basic SSD) | Shared NFS, higher performance
+
+### Custom StorageClass
+
+```yaml
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: fast-regional
+provisioner: pd.csi.storage.gke.io
+parameters:
+  type: pd-ssd
+  replication-type: regional-pd    # Replicate across 2 zones
+volumeBindingMode: WaitForFirstConsumer
+allowVolumeExpansion: true         # Always enable for production
+```
+
+## PersistentVolumeClaims
+
+### Block Storage (ReadWriteOnce)
+
+```yaml
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: database-pvc
+spec:
+  accessModes:
+  - ReadWriteOnce
+  storageClassName: premium-rwo
+  resources:
+    requests:
+      storage: 100Gi
+```
+
+### Shared File Storage (ReadWriteMany via Filestore)
+
+```yaml
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: shared-data
+spec:
+  accessModes:
+  - ReadWriteMany
+  storageClassName: standard-rwx
+  resources:
+    requests:
+      storage: 1Ti    # Filestore minimum is 1 TiB for Basic tier
+```
+
+### GCS Bucket Mount (Cloud Storage FUSE)
+
+Mount a GCS bucket as a volume without a PVC:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gcs-reader
+  annotations:
+    gke-gcsfuse/volumes: "true"
+spec:
+  containers:
+  - name: reader
+    image: busybox
+    command: ["ls", "/data"]
+    volumeMounts:
+    - name: gcs-bucket
+      mountPath: /data
+  volumes:
+  - name: gcs-bucket
+    csi:
+      driver: gcsfuse.csi.storage.gke.io
+      readOnly: true
+      volumeAttributes:
+        bucketName: <BUCKET_NAME>
+```
+
+> Requires Workload Identity for the pod's service account to have
+> `storage.objectViewer` on the bucket.
+
+## Volume Expansion
+
+If `allowVolumeExpansion: true` is set on the StorageClass, resize by updating
+the PVC:
+
+```bash
+# kubectl
+kubectl patch pvc <PVC_NAME> -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
+```
+
+```
+# MCP (preferred)
+patch_k8s_resource(parent="...", resourceType="persistentvolumeclaim", name="<PVC_NAME>",
+  patch='{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}')
+```
+
+Kubernetes automatically resizes the filesystem.
+
+## Best Practices
+
+1.  **Always enable volume expansion**: Set `allowVolumeExpansion: true` on all
+    StorageClasses
+2.  **Use regional PDs for production**: `replication-type: regional-pd`
+    replicates across 2 zones for HA
+3.  **Use `WaitForFirstConsumer`**: Ensures the PV is provisioned in the same
+    zone as the pod
+4.  **Choose the right disk type**: `pd-ssd` for databases, `pd-balanced`
+    (golden path default) for general use, `pd-standard` for cold storage
+5.  **Use Filestore for shared access**: When multiple pods need to read/write
+    the same files
+6.  **Use GCS FUSE for data pipelines**: Mount buckets directly for ML training
+    data, logs, etc.
+7.  **Back up PVCs**: Use Backup for GKE (see
+    [gke-backup-dr.md](../gke-backup-dr/SKILL.md)) to protect persistent data
diff --git a/skills/cloud/gke-upgrades/SKILL.md b/skills/cloud/gke-upgrades/SKILL.md
new file mode 100644
index 0000000000..ac826fb0de
--- /dev/null
+++ b/skills/cloud/gke-upgrades/SKILL.md
@@ -0,0 +1,172 @@
+---
+name: gke-upgrades
+description: >-
+  Manages GKE upgrades, maintenance windows, and release channels. Use when
+  upgrading GKE clusters, configuring maintenance windows, or setting release
+  channels. Don't use for general cluster provisioning or resizing (use
+  gke-cluster-creation or gke-scaling instead).
+---
+
+# GKE Upgrades & Maintenance
+
+This reference covers upgrade strategy, maintenance windows, and release channel
+management for GKE clusters.
+
+> **MCP Tools:** `get_cluster`, `get_k8s_version`, `update_cluster`,
+> `update_node_pool`, `list_operations`, `get_operation`, `cancel_operation`,
+> `get_k8s_resource` **CLI-only**: `gcloud container get-server-config`
+> (available versions), `gcloud container clusters update
+> --maintenance-window-*` (maintenance windows)
+
+## Golden Path Upgrade Defaults
+
+| Setting                    | Golden Path Value      | Notes                  |
+| -------------------------- | ---------------------- | ---------------------- |
+| `releaseChannel.channel`   | `REGULAR`              | Balanced between       |
+:                            :                        : freshness and          :
+:                            :                        : stability              :
+| Maintenance exclusion      | `NO_MINOR_UPGRADES`, 1 | Prevents surprise      |
+:                            : year                   : minor version bumps    :
+| `upgradeSettings.strategy` | `SURGE`                | Rolling upgrades with  |
+:                            :                        : `maxSurge\: 1`         :
+| Auto-repair                | `true`                 | Unhealthy nodes are    |
+:                            :                        : automatically replaced :
+| Auto-upgrade               | `true`                 | Nodes follow control   |
+:                            :                        : plane version          :
+
+## Release Channels
+
+| Channel                 | Cadence                | Best For                  |
+| ----------------------- | ---------------------- | ------------------------- |
+| `RAPID`                 | Weeks after release    | Dev/test, early access to |
+:                         :                        : features                  :
+| `REGULAR` (golden path) | 2-3 months after Rapid | Production workloads      |
+| `STABLE`                | 2-3 months after       | Risk-averse, highly       |
+:                         : Regular                : regulated                 :
+
+```bash
+# Check current channel
+gcloud container clusters describe <CLUSTER_NAME> --region <REGION> \
+  --format="value(releaseChannel.channel)" \
+  --quiet
+
+# Change channel (Day-1)
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --release-channel <CHANNEL> \
+  --quiet
+```
+
+## Maintenance Windows
+
+Control when GKE can perform automatic maintenance (upgrades, patches).
+
+```bash
+# Set maintenance window (e.g., weekends 2am-6am UTC)
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --maintenance-window-start "2026-01-01T02:00:00Z" \
+  --maintenance-window-end "2026-01-01T06:00:00Z" \
+  --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=SA,SU" \
+  --quiet
+```
+
+### Maintenance Exclusions
+
+The golden path includes a 1-year `NO_MINOR_UPGRADES` exclusion to prevent
+automatic minor version changes.
+
+```bash
+# Add maintenance exclusion
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --add-maintenance-exclusion-name "freeze-1" \
+  --add-maintenance-exclusion-start "2026-04-11T00:00:00Z" \
+  --add-maintenance-exclusion-end "2027-04-11T00:00:00Z" \
+  --add-maintenance-exclusion-scope NO_MINOR_UPGRADES \
+  --quiet
+
+# Remove exclusion
+gcloud container clusters update <CLUSTER_NAME> --region <REGION> \
+  --remove-maintenance-exclusion "freeze-1" \
+  --quiet
+```
+
+**Exclusion scopes:**
+
+-   `NO_UPGRADES` — blocks all upgrades (max 30 days)
+-   `NO_MINOR_UPGRADES` — allows patch upgrades, blocks minor version changes
+    (max 1 year)
+-   `NO_MINOR_OR_NODE_UPGRADES` — blocks minor and node upgrades (max 1 year)
+
+## Upgrade Strategy
+
+### SURGE (Golden Path)
+
+Rolling upgrade with configurable surge capacity:
+
+```bash
+# Default: maxSurge=1 (one extra node during upgrade)
+gcloud container node-pools update <POOL_NAME> \
+  --cluster <CLUSTER_NAME> --region <REGION> \
+  --max-surge-upgrade 1 --max-unavailable-upgrade 0 \
+  --quiet
+```
+
+### Blue-Green (For Zero-Downtime Critical Workloads)
+
+```bash
+gcloud container node-pools update <POOL_NAME> \
+  --cluster <CLUSTER_NAME> --region <REGION> \
+  --enable-blue-green-upgrade \
+  --node-pool-soak-duration "3600s" \
+  --quiet
+```
+
+## Pre-Upgrade Checklist
+
+1.  **Check deprecations**: Review Kubernetes API deprecations between current
+    and target version
+2.  **Review PDBs**: Ensure all production workloads have PodDisruptionBudgets
+3.  **Test in non-prod**: Upgrade a staging cluster first
+4.  **Check addon compatibility**: Verify third-party controllers support the
+    target version
+5.  **Review node pool versions**: All node pools should be within 2 minor
+    versions of the control plane
+
+```bash
+# Check current versions
+gcloud container clusters describe <CLUSTER_NAME> --region <REGION> \
+  --format="table(currentMasterVersion, nodePools[].version)" \
+  --quiet
+
+# Check available upgrades
+gcloud container get-server-config --region <REGION> \
+  --format="yaml(channels)" \
+  --quiet
+
+# List deprecation warnings
+kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
+```
+
+## Manual Upgrade (When Needed)
+
+```bash
+# Upgrade control plane
+gcloud container clusters upgrade <CLUSTER_NAME> --region <REGION> \
+  --master --cluster-version <VERSION> \
+  --quiet
+
+# Upgrade node pool
+gcloud container clusters upgrade <CLUSTER_NAME> --region <REGION> \
+  --node-pool <POOL_NAME> \
+  --quiet
+```
+
+## Best Practices
+
+1.  **Stay on a release channel**: Manual version management is error-prone. Let
+    GKE manage versions.
+2.  **Use maintenance windows**: Schedule upgrades during low-traffic periods.
+3.  **Set PDBs on everything**: Protects workloads during node drains.
+4.  **Monitor during upgrades**: Watch for pod eviction failures,
+    CrashLoopBackOff, and scheduling issues.
+5.  **Don't skip minor versions**: Upgrade incrementally (1.28 -> 1.29 -> 1.30,
+    not 1.28 -> 1.30).