diff --git a/skills/cloud/gke-basics/references/gke-app-onboarding.md b/skills/cloud/gke-app-onboarding/SKILL.md similarity index 57% rename from skills/cloud/gke-basics/references/gke-app-onboarding.md rename to skills/cloud/gke-app-onboarding/SKILL.md index ef6ebbfca9..40f8bd766f 100644 --- a/skills/cloud/gke-basics/references/gke-app-onboarding.md +++ b/skills/cloud/gke-app-onboarding/SKILL.md @@ -1,8 +1,20 @@ +--- +name: gke-app-onboarding +description: >- + Onboards applications to GKE, covering containerization, deployment + manifests, and migration. Use when onboarding or deploying an application to + GKE for the first time, or containerizing an app for GKE. Don't use for + general GKE cluster administration or upgrades (use gke-basics or + gke-upgrades instead). +--- + # GKE App Onboarding -This reference provides workflows for containerizing and deploying applications to GKE for the first time. +This reference provides workflows for containerizing and deploying applications +to GKE for the first time. -> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `get_k8s_rollout_status`, `get_k8s_logs`, `describe_k8s_resource` +> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, +> `get_k8s_rollout_status`, `get_k8s_logs`, `describe_k8s_resource` ## Workflow @@ -10,12 +22,13 @@ This reference provides workflows for containerizing and deploying applications Before containerizing, assess the application: -- **Language & Framework**: Identify the tech stack -- **Dependencies**: List required libraries and external services -- **Configuration**: How is the app configured? (env vars, config files, secrets) -- **Statefulness**: Does it need persistent storage? (databases, file storage) -- **Networking**: Port mapping and protocol (HTTP, gRPC, TCP) -- **Health endpoints**: Does the app expose health check endpoints? +- **Language & Framework**: Identify the tech stack +- **Dependencies**: List required libraries and external services +- **Configuration**: How is the app configured? (env vars, config files, + secrets) +- **Statefulness**: Does it need persistent storage? (databases, file storage) +- **Networking**: Port mapping and protocol (HTTP, gRPC, TCP) +- **Health endpoints**: Does the app expose health check endpoints? ### 2. Containerization @@ -38,14 +51,18 @@ ENTRYPOINT ["/server"] ``` **Best practices:** -- Use multi-stage builds to keep production images small -- Use distroless or minimal base images to reduce attack surface -- Run as non-root user -- Log to `stdout` and `stderr` for Cloud Logging collection + +- Use multi-stage builds to keep production images small +- Use distroless or minimal base images to reduce attack surface +- Run as non-root user +- Log to `stdout` and `stderr` for Cloud Logging collection **Alternatives:** -- **Cloud Native Buildpacks** — auto-detect language and build without a Dockerfile: `pack build --builder gcr.io/buildpacks/builder:latest` -- **Skaffold** — development workflow tool for iterating on containerized apps: `skaffold dev` + +- **Cloud Native Buildpacks** — auto-detect language and build without a + Dockerfile: `pack build --builder gcr.io/buildpacks/builder:latest` +- **Skaffold** — development workflow tool for iterating on containerized + apps: `skaffold dev` ### 3. Image Management @@ -60,7 +77,8 @@ docker build -t -docker.pkg.dev///: . docker push -docker.pkg.dev///: ``` -**Vulnerability scanning**: Enable automatic scanning in Artifact Registry to detect issues in base images and dependencies. +**Vulnerability scanning**: Enable automatic scanning in Artifact Registry to +detect issues in base images and dependencies. ```bash # Check scan results @@ -127,10 +145,12 @@ spec: ``` **Checklist for manifests:** -- Resource requests and limits set -- Liveness and readiness probes configured -- At least 2 replicas for production -- Service type appropriate (ClusterIP for internal, use Gateway API for external) + +- Resource requests and limits set +- Liveness and readiness probes configured +- At least 2 replicas for production +- Service type appropriate (ClusterIP for internal, use Gateway API for + external) ### 5. Deploy @@ -154,7 +174,10 @@ kubectl get pods -l app=my-app ## Next Steps Once the application is running on GKE: -- Configure autoscaling — see [gke-scaling.md](./gke-scaling.md) -- Set up observability — see [gke-observability.md](./gke-observability.md) -- Harden security — see [gke-security.md](./gke-security.md) -- Configure reliability (PDBs, topology spread) — see [gke-reliability.md](./gke-reliability.md) + +- Configure autoscaling — see [gke-scaling.md](../gke-scaling/SKILL.md) +- Set up observability — see + [gke-observability.md](../gke-observability/SKILL.md) +- Harden security — see [gke-security.md](../gke-security/SKILL.md) +- Configure reliability (PDBs, topology spread) — see + [gke-reliability.md](../gke-reliability/SKILL.md) diff --git a/skills/cloud/gke-basics/references/gke-backup-dr.md b/skills/cloud/gke-backup-dr/SKILL.md similarity index 61% rename from skills/cloud/gke-basics/references/gke-backup-dr.md rename to skills/cloud/gke-backup-dr/SKILL.md index eb7859d278..17d06e3bc5 100644 --- a/skills/cloud/gke-basics/references/gke-backup-dr.md +++ b/skills/cloud/gke-backup-dr/SKILL.md @@ -1,8 +1,19 @@ +--- +name: gke-backup-dr +description: >- + Configures Backup for GKE and disaster recovery plans. Use when configuring + GKE backup policies, setting up disaster recovery, or restoring GKE clusters. + Don't use for generic database backups or persistent volume configuration + (use gke-storage instead). +--- + # GKE Backup & Disaster Recovery -This reference provides workflows for protecting stateful workloads on GKE using Backup for GKE. +This reference provides workflows for protecting stateful workloads on GKE using +Backup for GKE. -> **MCP Tools:** `get_cluster`, `update_cluster`. **CLI-only:** `gcloud container backup-restore *` +> **MCP Tools:** `get_cluster`, `update_cluster`. **CLI-only:** `gcloud +> container backup-restore *` ## Workflows @@ -38,9 +49,11 @@ gcloud container backup-restore backup-plans create \ ``` **Options:** -- `--all-namespaces` — back up everything -- `--included-namespaces=,` — back up specific namespaces -- `--backup-encryption-key=` — encrypt with Customer-Managed Encryption Key (CMEK) + +- `--all-namespaces` — back up everything +- `--included-namespaces=,` — back up specific namespaces +- `--backup-encryption-key=` — encrypt with Customer-Managed Encryption + Key (CMEK) ### 3. Create a Manual Backup @@ -79,8 +92,11 @@ gcloud container backup-restore restores create \ ## Best Practices -1. **Automate backups**: Always use a cron schedule for production workloads -2. **Test restores regularly**: Restore to a separate namespace or cluster to verify data integrity -3. **Cross-region DR**: Store backups in a different region or configure cross-region restore plans -4. **Encrypt backups**: Use CMEK for compliance and security requirements -5. **Scope backups**: Back up specific namespaces rather than the entire cluster when possible to reduce restore complexity +1. **Automate backups**: Always use a cron schedule for production workloads +2. **Test restores regularly**: Restore to a separate namespace or cluster to + verify data integrity +3. **Cross-region DR**: Store backups in a different region or configure + cross-region restore plans +4. **Encrypt backups**: Use CMEK for compliance and security requirements +5. **Scope backups**: Back up specific namespaces rather than the entire + cluster when possible to reduce restore complexity diff --git a/skills/cloud/gke-basics/SKILL.md b/skills/cloud/gke-basics/SKILL.md index fe2119da6a..43cc83923b 100644 --- a/skills/cloud/gke-basics/SKILL.md +++ b/skills/cloud/gke-basics/SKILL.md @@ -1,11 +1,22 @@ --- name: gke-basics -description: "Plan, create, and configure production-ready Google Kubernetes Engine (GKE) clusters using the golden path Autopilot configuration. Covers Day-0 checklist, Autopilot vs Standard, networking (private clusters, VPC-native, Gateway API), security (Workload Identity, Secret Manager, RBAC hardening), observability, scaling, cost optimization, and AI/ML inference. WHEN: create GKE cluster, provision GKE environment, design GKE networking, secure GKE, optimize GKE cost, GKE autoscaling, GKE inference, GKE upgrade, GKE observability, GKE multi-tenancy, GKE batch, GKE HPC, GKE compute class." +description: >- + Plans, creates, and configures production-ready GKE clusters using the golden + path Autopilot configuration. Covers Day-0 checklist, Autopilot vs Standard, + networking, security, observability, scaling, cost optimization, and AI/ML + inference. Use when creating GKE clusters, provisioning GKE environments, + designing GKE networking, securing GKE, optimizing GKE cost, autoscaling, or + upgrading. Don't use if specialized skills for security, networking, scaling, + cost, storage, or upgrades are more applicable (use gke-security, + gke-networking, gke-scaling, gke-cost, gke-storage, or gke-upgrades instead). --- # Google Kubernetes Engine (GKE) Basics -GKE is a managed Kubernetes platform on Google Cloud for deploying, scaling, and operating containerized applications. This skill defaults to the **golden path Autopilot configuration** — see [gke-golden-path.md](./references/gke-golden-path.md) for defaults, rules, and guardrails. +GKE is a managed Kubernetes platform on Google Cloud for deploying, scaling, and +operating containerized applications. This skill defaults to the **golden path +Autopilot configuration** — see [gke-golden-path](../gke-golden-path/SKILL.md) +for defaults, rules, and guardrails. ## Quick Start @@ -19,31 +30,35 @@ kubectl create deployment hello-server \ ## Reference Directory -Load the relevant reference based on trigger keywords. Prefer the most specific match; if ambiguous, ask the user to clarify. - -| Scenario | Trigger Keywords | Reference | -|----------|-----------------|-----------| -| Core Concepts | Autopilot vs Standard, architecture, pricing, what is GKE | [core-concepts.md](./references/core-concepts.md) | -| Golden Path & Defaults | golden path, Day-0 checklist, production defaults, cluster defaults | [gke-golden-path.md](./references/gke-golden-path.md) | -| Cluster Creation | create cluster, new cluster, provision GKE | [gke-cluster-creation.md](./references/gke-cluster-creation.md) | -| Networking | private cluster, VPC, subnet, Gateway API, DNS, ingress, egress, datapath | [gke-networking.md](./references/gke-networking.md) | -| Security & IAM | Workload Identity, Secret Manager, RBAC, Binary Auth, hardening, audit, gVisor, IAM roles | [gke-security.md](./references/gke-security.md) | -| Scaling | HPA, VPA, autoscaler, autoscaling, NAP, scale pods, scale nodes | [gke-scaling.md](./references/gke-scaling.md) | -| Compute Classes | ComputeClass, machine family, Spot fallback, GPU node pool, node selection | [gke-compute-classes.md](./references/gke-compute-classes.md) | -| Cost | cost, savings, Spot VMs, rightsizing, CUD, optimize spend, budget | [gke-cost.md](./references/gke-cost.md) | -| AI/ML Inference | inference, model serving, LLM, GPU, TPU, GIQ, vLLM | [gke-inference.md](./references/gke-inference.md) | -| Upgrades | upgrade, maintenance window, release channel, patching, version | [gke-upgrades.md](./references/gke-upgrades.md) | -| Observability | monitoring, logging, Prometheus, Grafana, metrics, alerts, dashboards | [gke-observability.md](./references/gke-observability.md) | -| Multi-tenancy | multi-tenant, namespace isolation, team access, enterprise, RBAC planning | [gke-multitenancy.md](./references/gke-multitenancy.md) | -| Batch & HPC | batch, HPC, job queue, high performance, MPI, parallel | [gke-batch-hpc.md](./references/gke-batch-hpc.md) | -| App Onboarding | containerize, deploy app, Dockerfile, onboard, migrate to GKE | [gke-app-onboarding.md](./references/gke-app-onboarding.md) | -| Backup & DR | backup, restore, disaster recovery, CMEK | [gke-backup-dr.md](./references/gke-backup-dr.md) | -| Storage | storage, PVC, persistent volume, StorageClass, Filestore, GCS FUSE | [gke-storage.md](./references/gke-storage.md) | -| Reliability | PDB, health probe, liveness, readiness, topology spread, graceful shutdown | [gke-reliability.md](./references/gke-reliability.md) | -| Client Libraries | client library, client-go, kubernetes python, kubernetes java, kubernetes SDK | [client-library-usage.md](./references/client-library-usage.md) | -| Infrastructure as Code | Terraform, IaC, HCL, infrastructure as code | [iac-usage.md](./references/iac-usage.md) | -| MCP Server | MCP tools, MCP server, MCP setup | [mcp-usage.md](./references/mcp-usage.md) | -| CLI / Tools | gcloud, kubectl, commands, how to | [cli-reference.md](./references/cli-reference.md) | -| Production Audit | production readiness, compliance, golden path check | [gke-cluster-creation.md](./references/gke-cluster-creation.md) | +Load the relevant reference based on trigger keywords. Prefer the most specific +match; if ambiguous, ask the user to clarify. If a referenced sibling skill +(pointing to `..`) is not installed or cannot be accessed, inform the user that +they may need to install that specific skill (e.g., `gke-networking`), and fall +back to your general GKE knowledge. + +Scenario | Trigger Keywords | Reference +---------------------- | ----------------------------------------------------------------------------------------- | --------- +Core Concepts | Autopilot vs Standard, architecture, pricing, what is GKE | [core-concepts.md](./references/core-concepts.md) +Golden Path & Defaults | golden path, Day-0 checklist, production defaults, cluster defaults | [gke-golden-path](../gke-golden-path/SKILL.md) +Cluster Creation | create cluster, new cluster, provision GKE | [gke-cluster-creation](../gke-cluster-creation/SKILL.md) +Networking | private cluster, VPC, subnet, Gateway API, DNS, ingress, egress, datapath | [gke-networking](../gke-networking/SKILL.md) +Security & IAM | Workload Identity, Secret Manager, RBAC, Binary Auth, hardening, audit, gVisor, IAM roles | [gke-security](../gke-security/SKILL.md) +Scaling | HPA, VPA, autoscaler, autoscaling, NAP, scale pods, scale nodes | [gke-scaling](../gke-scaling/SKILL.md) +Compute Classes | ComputeClass, machine family, Spot fallback, GPU node pool, node selection | [gke-compute-classes](../gke-compute-classes/SKILL.md) +Cost | cost, savings, Spot VMs, rightsizing, CUD, optimize spend, budget | [gke-cost](../gke-cost/SKILL.md) +AI/ML Inference | inference, model serving, LLM, GPU, TPU, GIQ, vLLM | [gke-inference](../gke-inference/SKILL.md) +Upgrades | upgrade, maintenance window, release channel, patching, version | [gke-upgrades](../gke-upgrades/SKILL.md) +Observability | monitoring, logging, Prometheus, Grafana, metrics, alerts, dashboards | [gke-observability](../gke-observability/SKILL.md) +Multi-tenancy | multi-tenant, namespace isolation, team access, enterprise, RBAC planning | [gke-multitenancy](../gke-multitenancy/SKILL.md) +Batch & HPC | batch, HPC, job queue, high performance, MPI, parallel | [gke-batch-hpc](../gke-batch-hpc/SKILL.md) +App Onboarding | containerize, deploy app, Dockerfile, onboard, migrate to GKE | [gke-app-onboarding](../gke-app-onboarding/SKILL.md) +Backup & DR | backup, restore, disaster recovery, CMEK | [gke-backup-dr](../gke-backup-dr/SKILL.md) +Storage | storage, PVC, persistent volume, StorageClass, Filestore, GCS FUSE | [gke-storage](../gke-storage/SKILL.md) +Reliability | PDB, health probe, liveness, readiness, topology spread, graceful shutdown | [gke-reliability](../gke-reliability/SKILL.md) +Client Libraries | client library, client-go, kubernetes python, kubernetes java, kubernetes SDK | [client-library-usage.md](./references/client-library-usage.md) +Infrastructure as Code | Terraform, IaC, HCL, infrastructure as code | [iac-usage.md](./references/iac-usage.md) +MCP Server | MCP tools, MCP server, MCP setup | [mcp-usage.md](./references/mcp-usage.md) +CLI / Tools | gcloud, kubectl, commands, how to | [cli-reference.md](./references/cli-reference.md) +Production Audit | production readiness, compliance, golden path check | [gke-cluster-creation](../gke-cluster-creation/SKILL.md) *If you need product information not found in these references, use the Developer Knowledge MCP server `search_documents` tool.* diff --git a/skills/cloud/gke-basics/references/cli-reference.md b/skills/cloud/gke-basics/references/cli-reference.md index 5b6b77db9f..6d29016b91 100644 --- a/skills/cloud/gke-basics/references/cli-reference.md +++ b/skills/cloud/gke-basics/references/cli-reference.md @@ -12,37 +12,70 @@ Default preference order: ### When to use each -| Interface | When to Use | Examples | -|-----------|-------------|---------| -| **GKE MCP Tools** | Default for all cluster and K8s operations when MCP server is available. Structured I/O, supports dry-run, no shell/kubeconfig needed. | `create_cluster`, `get_cluster`, `get_k8s_resource`, `apply_k8s_manifest`, `get_k8s_logs` | -| **`gcloud` CLI** | No MCP equivalent, or user explicitly requested CLI. Required for: GIQ model discovery, available K8s versions, maintenance windows, monitoring components, IAM/SA setup, Cloud Logging queries. | `gcloud container ai profiles`, `gcloud container get-server-config`, `gcloud iam service-accounts` | -| **`kubectl`** | Neither MCP nor `gcloud` covers the operation, or user explicitly prefers kubectl. Required for: `kubectl top`, `kubectl scale`, `kubectl exec`, `kubectl port-forward`, Helm, custom CRDs not in MCP. | `kubectl top pods`, `kubectl scale deployment`, `helm install` | +| Interface | When to Use | Examples | +| ----------------- | -------------------------- | --------------------------- | +| **GKE MCP Tools** | Default for all cluster | `create_cluster`, | +: : and K8s operations when : `get_cluster`, : +: : MCP server is available. : `get_k8s_resource`, : +: : Structured I/O, supports : `apply_k8s_manifest`, : +: : dry-run, no : `get_k8s_logs` : +: : shell/kubeconfig needed. : : +| **`gcloud` CLI** | No MCP equivalent, or user | `gcloud container ai | +: : explicitly requested CLI. : profiles`, `gcloud : +: : Required for\: GIQ model : container : +: : discovery, available K8s : get-server-config`, `gcloud : +: : versions, maintenance : iam service-accounts` : +: : windows, monitoring : : +: : components, IAM/SA setup, : : +: : Cloud Logging queries. : : +| **`kubectl`** | Neither MCP nor `gcloud` | `kubectl top pods`, | +: : covers the operation, or : `kubectl scale deployment`, : +: : user explicitly prefers : `helm install` : +: : kubectl. Required for\: : : +: : `kubectl top`, `kubectl : : +: : scale`, `kubectl exec`, : : +: : `kubectl port-forward`, : : +: : Helm, custom CRDs not in : : +: : MCP. : : ### User preference override If the user states a preference, respect it for the session: -- **"Use gcloud" / "Use CLI"** → `gcloud` for cluster ops, `kubectl` for K8s resource ops. Skip MCP. -- **"Use kubectl"** → `kubectl` for all K8s resource ops, `gcloud` for cluster-level ops. Skip MCP. -- **"Use MCP"** / no preference → Default. Use MCP for everything it supports. +- **"Use gcloud" / "Use CLI"** → `gcloud` for cluster ops, `kubectl` for K8s + resource ops. Skip MCP. +- **"Use kubectl"** → `kubectl` for all K8s resource ops, `gcloud` for + cluster-level ops. Skip MCP. +- **"Use MCP"** / no preference → Default. Use MCP for everything it supports. -Even with an override, fall back through the chain for unsupported operations (e.g., cluster creation always requires `gcloud` or MCP). +Even with an override, fall back through the chain for unsupported operations +(e.g., cluster creation always requires `gcloud` or MCP). ---- +-------------------------------------------------------------------------------- -> All MCP tools use hierarchical resource paths — see [`parent` format](#parent--name-format-quick-reference) at the bottom. +> All MCP tools use hierarchical resource paths — see +> [`parent` format](#parent--name-format-quick-reference) at the bottom. ## Cluster Operations -| Operation | MCP Tool | CLI Fallback | Mode | -|-----------|----------|-------------|------| -| List clusters | `list_clusters` | `gcloud container clusters list` | READ | -| Get cluster details | `get_cluster` | `gcloud container clusters describe` | READ | -| Create cluster | `create_cluster` | `gcloud container clusters create-auto` | MUTATE | -| Update cluster | `update_cluster` | `gcloud container clusters update` | DESTRUCTIVE | -| Get K8s versions | — | `gcloud container get-server-config` | READ | -| Get credentials | — | `gcloud container clusters get-credentials` | READ | -| Delete cluster | — | `gcloud container clusters delete` | DESTRUCTIVE | +| Operation | MCP Tool | CLI Fallback | Mode | +| ------------------- | ---------------- | ------------------ | ----------- | +| List clusters | `list_clusters` | `gcloud container | READ | +: : : clusters list` : : +| Get cluster details | `get_cluster` | `gcloud container | READ | +: : : clusters describe` : : +| Create cluster | `create_cluster` | `gcloud container | MUTATE | +: : : clusters : : +: : : create-auto` : : +| Update cluster | `update_cluster` | `gcloud container | DESTRUCTIVE | +: : : clusters update` : : +| Get K8s versions | — | `gcloud container | READ | +: : : get-server-config` : : +| Get credentials | — | `gcloud container | READ | +: : : clusters : : +: : : get-credentials` : : +| Delete cluster | — | `gcloud container | DESTRUCTIVE | +: : : clusters delete` : : ``` # List clusters in a project (all regions) @@ -76,12 +109,16 @@ gcloud container clusters get-credentials --region --pro ## Node Pool Operations -| Operation | MCP Tool | CLI Fallback | Mode | -|-----------|----------|-------------|------| -| List node pools | `list_node_pools` | `gcloud container node-pools list` | READ | -| Get node pool | `get_node_pool` | `gcloud container node-pools describe` | READ | -| Create node pool | `create_node_pool` | `gcloud container node-pools create` | MUTATE | -| Update node pool | `update_node_pool` | `gcloud container node-pools update` | DESTRUCTIVE | +| Operation | MCP Tool | CLI Fallback | Mode | +| ---------------- | ------------------ | -------------------- | ----------- | +| List node pools | `list_node_pools` | `gcloud container | READ | +: : : node-pools list` : : +| Get node pool | `get_node_pool` | `gcloud container | READ | +: : : node-pools describe` : : +| Create node pool | `create_node_pool` | `gcloud container | MUTATE | +: : : node-pools create` : : +| Update node pool | `update_node_pool` | `gcloud container | DESTRUCTIVE | +: : : node-pools update` : : ``` list_node_pools(parent="projects//locations//clusters/") @@ -94,11 +131,16 @@ create_node_pool( ## Cluster Updates -| Operation | MCP Tool | CLI Fallback | Mode | -|-----------|----------|-------------|------| -| Update cluster settings | `update_cluster` | `gcloud container clusters update` | DESTRUCTIVE | -| Update monitoring | — | `gcloud container clusters update --monitoring=...` | DESTRUCTIVE | -| Set maintenance window | — | `gcloud container clusters update --maintenance-window-*` | DESTRUCTIVE | +| Operation | MCP Tool | CLI Fallback | Mode | +| ----------------- | ---------------- | ----------------------- | ----------- | +| Update cluster | `update_cluster` | `gcloud container | DESTRUCTIVE | +: settings : : clusters update` : : +| Update monitoring | — | `gcloud container | DESTRUCTIVE | +: : : clusters update : : +: : : --monitoring=...` : : +| Set maintenance | — | `gcloud container | DESTRUCTIVE | +: window : : clusters update : : +: : : --maintenance-window-*` : : ``` # Enable VPA via MCP @@ -117,15 +159,19 @@ gcloud container clusters update --region \ ## Kubernetes Resource Operations -| Operation | MCP Tool | CLI Fallback | Mode | -|-----------|----------|-------------|------| -| Get/list resources | `get_k8s_resource` | `kubectl get` | READ | -| Describe resource | `describe_k8s_resource` | `kubectl describe` | READ | -| Apply manifest | `apply_k8s_manifest` | `kubectl apply` | DESTRUCTIVE | -| Patch resource | `patch_k8s_resource` | `kubectl patch` | DESTRUCTIVE | -| Delete resource | `delete_k8s_resource` | `kubectl delete` | DESTRUCTIVE | -| List API resources | `list_k8s_api_resources` | `kubectl api-resources` | READ | -| Check auth | `check_k8s_auth` | `kubectl auth can-i` | READ | +| Operation | MCP Tool | CLI Fallback | Mode | +| --------------- | ------------------------ | ---------------- | ----------- | +| Get/list | `get_k8s_resource` | `kubectl get` | READ | +: resources : : : : +| Describe | `describe_k8s_resource` | `kubectl | READ | +: resource : : describe` : : +| Apply manifest | `apply_k8s_manifest` | `kubectl apply` | DESTRUCTIVE | +| Patch resource | `patch_k8s_resource` | `kubectl patch` | DESTRUCTIVE | +| Delete resource | `delete_k8s_resource` | `kubectl delete` | DESTRUCTIVE | +| List API | `list_k8s_api_resources` | `kubectl | READ | +: resources : : api-resources` : : +| Check auth | `check_k8s_auth` | `kubectl auth | READ | +: : : can-i` : : ``` # List all deployments in a namespace @@ -150,14 +196,14 @@ check_k8s_auth(parent="...", verb="create", resourceType="deployments", namespac ## Diagnostics & Observability -| Operation | MCP Tool | CLI Fallback | Mode | -|-----------|----------|-------------|------| -| List events | `list_k8s_events` | `kubectl events` | READ | -| Get container logs | `get_k8s_logs` | `kubectl logs` | READ | -| Cluster info | `get_k8s_cluster_info` | `kubectl cluster-info` | READ | -| K8s version | `get_k8s_version` | `kubectl version` | READ | -| Rollout status | `get_k8s_rollout_status` | `kubectl rollout status` | READ | -| Query Cloud Logging | — | `gcloud logging read` | READ | +Operation | MCP Tool | CLI Fallback | Mode +------------------- | ------------------------ | ------------------------ | ---- +List events | `list_k8s_events` | `kubectl events` | READ +Get container logs | `get_k8s_logs` | `kubectl logs` | READ +Cluster info | `get_k8s_cluster_info` | `kubectl cluster-info` | READ +K8s version | `get_k8s_version` | `kubectl version` | READ +Rollout status | `get_k8s_rollout_status` | `kubectl rollout status` | READ +Query Cloud Logging | — | `gcloud logging read` | READ ``` # Get recent events across all namespaces @@ -173,11 +219,14 @@ get_k8s_rollout_status(parent="...", resourceType="deployment", name="", ## Operations Tracking -| Operation | MCP Tool | CLI Fallback | Mode | -|-----------|----------|-------------|------| -| List operations | `list_operations` | `gcloud container operations list` | READ | -| Get operation | `get_operation` | `gcloud container operations describe` | READ | -| Cancel operation | `cancel_operation` | `gcloud container operations cancel` | DESTRUCTIVE | +| Operation | MCP Tool | CLI Fallback | Mode | +| ---------------- | ------------------ | -------------------- | ----------- | +| List operations | `list_operations` | `gcloud container | READ | +: : : operations list` : : +| Get operation | `get_operation` | `gcloud container | READ | +: : : operations describe` : : +| Cancel operation | `cancel_operation` | `gcloud container | DESTRUCTIVE | +: : : operations cancel` : : ``` list_operations(parent="projects//locations/") @@ -228,12 +277,12 @@ Use `locations/-` to match all regions/zones when listing. ## Error Handling -| Error / Symptom | Likely Cause | Remediation | -|-----------------|--------------|-------------| -| `PERMISSION_DENIED` on cluster create | Missing `container.clusters.create` IAM role | Grant `roles/container.admin` or `roles/container.clusterAdmin` | -| Quota exceeded | Regional vCPU, GPU, or IP address limits | Request quota increase or select a different region | -| IP exhaustion / CIDR conflict | Pod subnet too small or overlapping ranges | Re-plan IP ranges; may require cluster recreation (Day-0) | -| Workload Identity not working | Missing OIDC issuer or federated credential | Verify `workloadIdentityConfig.workloadPool`; configure federated identity binding | -| Private cluster unreachable | No authorized networks or DNS endpoint | Enable `dnsEndpointConfig.allowExternalTraffic` or add authorized networks | -| Secret Manager rotation failing | SA missing `secretmanager.versions.access` | Grant Secret Manager accessor role to workload's GSA | -| Control-plane metrics missing | Monitoring components not configured | Enable APISERVER, SCHEDULER, CONTROLLER_MANAGER in `monitoringConfig` | +Error / Symptom | Likely Cause | Remediation +------------------------------------- | -------------------------------------------- | ----------- +`PERMISSION_DENIED` on cluster create | Missing `container.clusters.create` IAM role | Grant `roles/container.admin` or `roles/container.clusterAdmin` +Quota exceeded | Regional vCPU, GPU, or IP address limits | Request quota increase or select a different region +IP exhaustion / CIDR conflict | Pod subnet too small or overlapping ranges | Re-plan IP ranges; may require cluster recreation (Day-0) +Workload Identity not working | Missing OIDC issuer or federated credential | Verify `workloadIdentityConfig.workloadPool`; configure federated identity binding +Private cluster unreachable | No authorized networks or DNS endpoint | Enable `dnsEndpointConfig.allowExternalTraffic` or add authorized networks +Secret Manager rotation failing | SA missing `secretmanager.versions.access` | Grant Secret Manager accessor role to workload's GSA +Control-plane metrics missing | Monitoring components not configured | Enable APISERVER, SCHEDULER, CONTROLLER_MANAGER in `monitoringConfig` diff --git a/skills/cloud/gke-basics/references/client-library-usage.md b/skills/cloud/gke-basics/references/client-library-usage.md index 43e5dd273c..a15558f3e2 100644 --- a/skills/cloud/gke-basics/references/client-library-usage.md +++ b/skills/cloud/gke-basics/references/client-library-usage.md @@ -3,10 +3,9 @@ To interact with the GKE (Kubernetes) API programmatically, use the official Kubernetes client libraries. -**Prerequisite:** These libraries interact with the Kubernetes API. You -must already have a running GKE cluster and valid credentials -(for example, by running `gcloud container clusters get-credentials`) -before running this code. +**Prerequisite:** These libraries interact with the Kubernetes API. You must +already have a running GKE cluster and valid credentials (for example, by +running `gcloud container clusters get-credentials`) before running this code. ## Getting Started @@ -15,77 +14,77 @@ within your application code. ### Python -- **Installation:** +- **Installation:** - ```bash - pip install kubernetes - ``` + ```bash + pip install kubernetes + ``` -- **Usage Example:** +- **Usage Example:** - ```python - from kubernetes import client, config - config.load_kube_config() # Loads from ~/.kube/config - v1 = client.CoreV1Api() - print("Listing pods with their IPs:") - ret = v1.list_pod_for_all_namespaces(watch=False) - for i in ret.items: - print("%s\t%s\t%s" % (i.status.pod_ip, i.metadata.namespace, i.metadata.name)) - ``` + ```python + from kubernetes import client, config + config.load_kube_config() # Loads from ~/.kube/config + v1 = client.CoreV1Api() + print("Listing pods with their IPs:") + ret = v1.list_pod_for_all_namespaces(watch=False) + for i in ret.items: + print("%s\t%s\t%s" % (i.status.pod_ip, i.metadata.namespace, i.metadata.name)) + ``` ### Go -- **Installation:** +- **Installation:** - ```bash - go get k8s.io/client-go@latest - ``` + ```bash + go get k8s.io/client-go@latest + ``` -- **Usage Example:** +- **Usage Example:** - ```go - import ( - "k8s.io/client-go/kubernetes" - "k8s.io/client-go/tools/clientcmd" - ) - config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig) - clientset, _ := kubernetes.NewForConfig(config) - pods, _ := clientset.CoreV1().Pods("").List( - context.TODO, metav1.ListOptions{}) - ``` + ```go + import ( + "k8s.io/client-go/kubernetes" + "k8s.io/client-go/tools/clientcmd" + ) + config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig) + clientset, _ := kubernetes.NewForConfig(config) + pods, _ := clientset.CoreV1().Pods("").List( + context.Background(), metav1.ListOptions{}) + ``` ### Node.js (TypeScript) -- **Installation:** +- **Installation:** - ```bash - npm install @kubernetes/client-node - ``` + ```bash + npm install @kubernetes/client-node + ``` -- **Usage Example:** +- **Usage Example:** - ```javascript - const k8s = require('@kubernetes/client-node'); + ```javascript + const k8s = require('@kubernetes/client-node'); - const kc = new k8s.KubeConfig(); - kc.loadFromDefault(); // Automatically detects local vs. in-cluster configuration + const kc = new k8s.KubeConfig(); + kc.loadFromDefault(); // Automatically detects local vs. in-cluster configuration - const k8sApi = kc.makeApiClient(k8s.CoreV1Api); + const k8sApi = kc.makeApiClient(k8s.CoreV1Api); - // In most recent library versions, parameters must be passed inside an object - k8sApi.listNamespacedPod({ namespace: 'default' }).then((res) => { - const pods = res.items || res.body.items; - console.log(`Found ${pods.length} pods in 'default' namespace.`); - }); - ``` + // In most recent library versions, parameters must be passed inside an object + k8sApi.listNamespacedPod({ namespace: 'default' }).then((res) => { + const pods = res.items || res.body.items; + console.log(`Found ${pods.length} pods in 'default' namespace.`); + }); + ``` ### Java -- [Java Reference](https://github.com/kubernetes-client/java) +- [Java Reference](https://github.com/kubernetes-client/java) ## GKE-specific API (Container Service) To manage the GKE *service* itself (e.g., create/delete clusters) programmatically, use the Google Cloud Container client libraries. -- [Google Cloud Container Client Libraries](https://cloud.google.com/kubernetes-engine/docs/reference/libraries) +- [Google Cloud Container Client Libraries](https://cloud.google.com/kubernetes-engine/docs/reference/libraries) diff --git a/skills/cloud/gke-basics/references/core-concepts.md b/skills/cloud/gke-basics/references/core-concepts.md index b994f59828..9f02d5382e 100644 --- a/skills/cloud/gke-basics/references/core-concepts.md +++ b/skills/cloud/gke-basics/references/core-concepts.md @@ -1,54 +1,80 @@ # GKE Core Concepts -Google Kubernetes Engine (GKE) is a managed Kubernetes platform for deploying, managing, and scaling containerized applications on Google Cloud infrastructure. It handles cluster provisioning, upgrades, and node management, letting teams focus on workloads rather than infrastructure. +Google Kubernetes Engine (GKE) is a managed Kubernetes platform for deploying, +managing, and scaling containerized applications on Google Cloud infrastructure. +It handles cluster provisioning, upgrades, and node management, letting teams +focus on workloads rather than infrastructure. > **MCP Tools:** `list_clusters`, `get_cluster` ## Cluster Modes -| Mode | Who Manages Nodes | Best For | -|------|-------------------|----------| -| **Autopilot** (recommended) | Google — fully managed nodes, scaling, and security | Most workloads. No node-level ops. Pay per pod resource request. | -| **Standard** | You — full control over node pools, OS, machine types | Workloads requiring kernel customization, specific node OS, or DaemonSets not supported by Autopilot | +| Mode | Who Manages Nodes | Best For | +| ------------- | ---------------------------- | ----------------------------- | +| **Autopilot** | Google — fully managed | Most workloads. No node-level | +: (recommended) : nodes, scaling, and security : ops. Pay per pod resource : +: : : request. : +| **Standard** | You — full control over node | Workloads requiring kernel | +: : pools, OS, machine types : customization, specific node : +: : : OS, or DaemonSets not : +: : : supported by Autopilot : -**Default: Autopilot.** Use Standard only when Autopilot has a documented limitation for your workload. +**Default: Autopilot.** Use Standard only when Autopilot has a documented +limitation for your workload. ## Cluster Architecture -- **Regional clusters** (recommended): Control plane replicated across 3 zones. Higher availability, no single-zone failure risk. -- **Zonal clusters**: Single control plane zone. Lower cost, acceptable for dev/test. -- **Private clusters** (golden path default): Nodes have no public IPs. Control plane accessible via private endpoint or DNS endpoint. +- **Regional clusters** (recommended): Control plane replicated across 3 + zones. Higher availability, no single-zone failure risk. +- **Zonal clusters**: Single control plane zone. Lower cost, acceptable for + dev/test. +- **Private clusters** (golden path default): Nodes have no public IPs. + Control plane accessible via private endpoint or DNS endpoint. ## Networking Model GKE uses **VPC-native** clusters with alias IP ranges: -- Each pod gets a routable IP from the pod CIDR -- Dataplane V2 (eBPF-based) is the golden path default — provides built-in Network Policy enforcement -- Cloud DNS for in-cluster DNS resolution -- Gateway API for ingress/load balancing + +- Each pod gets a routable IP from the pod CIDR +- Dataplane V2 (eBPF-based) is the golden path default — provides built-in + Network Policy enforcement +- Cloud DNS for in-cluster DNS resolution +- Gateway API for ingress/load balancing ## Scaling Model -- **Horizontal Pod Autoscaler (HPA)**: Scales pod replicas based on CPU, memory, or custom metrics -- **Vertical Pod Autoscaler (VPA)**: Recommends or auto-adjusts pod resource requests -- **Cluster Autoscaler / NAP**: Scales nodes to match pod demand (Autopilot handles this automatically) -- **ComputeClasses**: Declarative node selection — machine family, Spot VMs, GPU targeting +- **Horizontal Pod Autoscaler (HPA)**: Scales pod replicas based on CPU, + memory, or custom metrics +- **Vertical Pod Autoscaler (VPA)**: Recommends or auto-adjusts pod resource + requests +- **Cluster Autoscaler / NAP**: Scales nodes to match pod demand (Autopilot + handles this automatically) +- **ComputeClasses**: Declarative node selection — machine family, Spot VMs, + GPU targeting ## Identity & Security Model -- **Workload Identity Federation**: Pods assume Google Cloud IAM identities without static keys -- **Secret Manager integration**: Secrets synced to Kubernetes with automatic rotation -- **Pod Security Standards**: `restricted` profile enforced on production namespaces -- **Shielded Nodes**: Secure Boot and integrity monitoring (Autopilot-enforced) +- **Workload Identity Federation**: Pods assume Google Cloud IAM identities + without static keys +- **Secret Manager integration**: Secrets synced to Kubernetes with automatic + rotation +- **Pod Security Standards**: `restricted` profile enforced on production + namespaces +- **Shielded Nodes**: Secure Boot and integrity monitoring + (Autopilot-enforced) ## Regional Availability -GKE is available in all Google Cloud regions. Autopilot clusters are regional by default. See https://cloud.google.com/about/locations for the full region list. +GKE is available in all Google Cloud regions. Autopilot clusters are regional by +default. See https://cloud.google.com/about/locations for the full region list. ## Pricing GKE pricing depends on the cluster mode: -- **Autopilot**: Pay for pod resource requests (vCPU, memory, ephemeral storage). No cluster management fee. -- **Standard**: Pay for underlying Compute Engine VMs plus a per-cluster management fee. + +- **Autopilot**: Pay for pod resource requests (vCPU, memory, ephemeral + storage). No cluster management fee. +- **Standard**: Pay for underlying Compute Engine VMs plus a per-cluster + management fee. For current pricing, see https://cloud.google.com/kubernetes-engine/pricing. diff --git a/skills/cloud/gke-basics/references/gke-compute-classes.md b/skills/cloud/gke-basics/references/gke-compute-classes.md deleted file mode 100644 index 0edd842ece..0000000000 --- a/skills/cloud/gke-basics/references/gke-compute-classes.md +++ /dev/null @@ -1,172 +0,0 @@ -# GKE ComputeClasses - -ComputeClasses allow declarative node configuration and autoscaling priorities in GKE Autopilot (and Standard with NAP). Use them to specify machine families, Spot VM fallback, GPU requirements, and zone targeting. - -> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `describe_k8s_resource`, `delete_k8s_resource` - -## When to Use - -- Cost optimization: Spot VMs with on-demand fallback -- GPU/TPU workloads: target specific accelerators -- Performance: select specific machine families (c3, c4, n4) -- Zone targeting: colocate workloads with zonal resources - -## CRD Structure - -```yaml -apiVersion: cloud.google.com/v1 -kind: ComputeClass -metadata: - name: -spec: - # Required. Ordered list of rules. GKE tries them in order. - priorities: - - - - # Optional. Default: "DoNotScaleUp" - whenUnsatisfiable: <"DoNotScaleUp" | "ScaleUpAnyway"> - - # Optional. Auto-create node pools. Default: true - nodePoolAutoCreation: - enabled: - - # Optional. Move workloads back to higher-priority when available - activeMigration: - optimizeRulePriority: - - # Optional. Scale-down delay - autoscalingPolicy: - consolidationDelay: - - # Optional. Defaults for fields omitted in priorities - priorityDefaults: -``` - -## PriorityRule Fields - -| Field | Type | Description | Example | -|-------|------|-------------|---------| -| `machineFamily` | string | Compute Engine machine family | `n4`, `c3`, `t2a` | -| `machineType` | string | Specific machine type | `n4-standard-32` | -| `spot` | boolean | Use Spot VMs | `true` | -| `minCores` | int | Minimum vCPUs | `4` | -| `minMemoryGb` | int | Minimum memory in GB | `16` | -| `gpu` | object | GPU config: `type`, `count`, `driverVersion` | See below | -| `tpu` | object | TPU config: `type`, `count`, `topology` | See below | -| `storage` | object | Boot disk: `type`, `sizeGb`, `kmsKey`; Local SSD: `count`, `interface` | See below | -| `location` | object | Zone targeting: `zones: [...]` or `type: "Any"` | See below | -| `reservations` | object | Reservation consumption: `NO_RESERVATION`, `ANY_RESERVATION`, `SPECIFIC_RESERVATION` | See below | - -### GPU Configuration - -```yaml -gpu: - type: "nvidia-l4" # nvidia-l4, nvidia-h100-80gb, etc. - count: 1 # GPUs per node - driverVersion: "latest" # Optional -``` - -### TPU Configuration - -```yaml -tpu: - type: "v5p-slice" - count: 8 - topology: "2x2x1" -``` - -### Storage Configuration - -```yaml -storage: - bootDisk: - type: "pd-balanced" # pd-balanced (golden path), pd-ssd, hyperdisk-balanced - sizeGb: 100 - kmsKey: "projects/.../cryptoKeys/..." # Optional CMEK - localSsd: - count: 1 - interface: "NVME" -``` - -### Location Configuration - -```yaml -location: - zones: - - "us-central1-a" - - "us-central1-b" - # OR - type: "Any" # Let GKE pick from cluster zones -``` - -## Common Patterns - -### Spot VMs with On-Demand Fallback - -```yaml -apiVersion: cloud.google.com/v1 -kind: ComputeClass -metadata: - name: spot-with-fallback -spec: - nodePoolAutoCreation: - enabled: true - priorities: - - machineFamily: n4 - spot: true - - machineFamily: n4 - spot: false -``` - -### GPU Workload (L4) - -```yaml -apiVersion: cloud.google.com/v1 -kind: ComputeClass -metadata: - name: l4-gpu-class -spec: - priorities: - - machineFamily: g2 - gpu: - type: nvidia-l4 - count: 1 - minCores: 4 - minMemoryGb: 16 - storage: - bootDisk: - type: pd-balanced - sizeGb: 100 -``` - -### Spot with Active Migration (Return to Spot When Available) - -Add `activeMigration` to the Spot-with-fallback pattern above to auto-migrate workloads back to Spot when capacity returns: - -```yaml -spec: - activeMigration: - optimizeRulePriority: true - priorities: - - machineFamily: n4 - spot: true - - machineFamily: n4 - spot: false -``` - -> **Other patterns** — HPC (`machineFamily: c3`, `minCores: 8`) and zone targeting (`location.zones: [...]`) follow the same CRD structure. See the PriorityRule fields table and sub-config examples above. - -## Workload Usage - -Pods must specify the ComputeClass via node selector: - -```yaml -nodeSelector: - cloud.google.com/compute-class: "" -``` - -## Warnings - -- Do not mix ComputeClass selection with other hard node selectors (like `cloud.google.com/gke-spot`) — this causes scheduling conflicts. -- When using `activeMigration`, workloads will be evicted and rescheduled — ensure PDBs are in place. -- Spot VMs can be evicted with 30-second notice. Set `terminationGracePeriodSeconds < 30` for Spot workloads. diff --git a/skills/cloud/gke-basics/references/gke-cost.md b/skills/cloud/gke-basics/references/gke-cost.md deleted file mode 100644 index 2bb88dc645..0000000000 --- a/skills/cloud/gke-basics/references/gke-cost.md +++ /dev/null @@ -1,158 +0,0 @@ -# GKE Cost Optimization - -This reference covers strategies for reducing GKE costs while maintaining the golden path security and reliability posture. - -> **MCP Tools:** `get_k8s_resource`, `describe_k8s_resource`, `apply_k8s_manifest`, `patch_k8s_resource`, `get_cluster` - -## Golden Path Cost Features - -The golden path already includes cost-optimizing settings: - -| Setting | Value | Impact | -|---------|-------|--------| -| `autoscalingProfile` | `OPTIMIZE_UTILIZATION` | Aggressive node scale-down reduces idle compute | -| `verticalPodAutoscaling` | `enabled` | VPA recommendations prevent over-provisioning | -| Autopilot pricing | Pay per pod request | No charge for unused node capacity | -| Node Auto Provisioning | enabled | Right-sized node pools created automatically | - -## Cost Optimization Strategies - -### 1. Spot VMs via ComputeClasses - -Use Spot VMs for fault-tolerant workloads (60-90% cost reduction). - -```yaml -apiVersion: cloud.google.com/v1 -kind: ComputeClass -metadata: - name: spot-with-fallback -spec: - activeMigration: - optimizeRulePriority: true - priorities: - - machineFamily: n4 - spot: true - - machineFamily: n4 - spot: false -``` - -**Spot-suitable workloads:** - -| Workload | Spot-Suitable? | -|----------|----------------| -| Batch / data processing | Yes | -| Dev / test environments | Yes | -| Stateless web/API (replicas >= 2) | Yes (with PDBs) | -| Jobs with checkpointing | Yes | -| Stateful workloads (databases) | No | -| Single-replica critical services | No | - -**Handling eviction:** - -```yaml -spec: - template: - spec: - terminationGracePeriodSeconds: 25 # Must be < 30s for Spot - containers: - - name: app - lifecycle: - preStop: - exec: - command: ["/bin/sh", "-c", "sleep 5"] -``` - -### 2. Pod Rightsizing - -Use VPA recommendations to reduce over-provisioned requests. - -```bash -# 1. Deploy VPA in recommendation mode -kubectl apply -f - <-vpa -spec: - targetRef: - apiVersion: apps/v1 - kind: Deployment - name: - updatePolicy: - updateMode: "Off" -EOF - -# 2. Wait 24+ hours for data collection - -# 3. Read recommendations -kubectl get vpa -vpa -o jsonpath='{.status.recommendation}' -``` - -**Optimization rules:** - -| Condition | Action | Savings | -|-----------|--------|---------| -| CPU request >5x P95 actual | Reduce to `P95 * 1.2` | High | -| Memory request >3x P95 actual | Reduce to `P95 * 1.2` | High | -| CPU request >2x P95 actual | Reduce to `P95 * 1.2` | Medium | -| No resource requests set | Add requests (enables bin-packing) | Medium | - -### 3. Machine Type Selection - -| Family | Use Case | Relative Cost | -|--------|----------|---------------| -| e2 | General purpose, burstable | Lowest | -| t2a / t2d | Scale-out (Arm/AMD), price-performance optimized | Low | -| n4a | Axion Arm-based, general-purpose price-performance | Low | -| n4 / n4d | General purpose (Intel/AMD), flexible shapes | Low-Medium | -| c4a | Compute-optimized (Arm), high efficiency | Medium-High | -| c3 / c4 | Compute-optimized (Intel) | Medium-High | -| c3d / c4d | Compute-optimized (AMD), high-performance throughput | Medium-High | -| ek-standard | Autopilot enhanced (golden path) | Medium | -| m3 / x4 | Memory-optimized, SAP HANA, large databases | High | -| g2 (L4 GPU) | AI inference | High | -| a3 (H100 GPU) | AI training | Highest | -| a4 / a4x | Ultra-scale AI (Blackwell GPUs) | Highest | - -> In Autopilot, machine type is managed. Use ComputeClasses to influence selection. - -### 4. Committed Use Discounts (CUDs) - -For steady-state workloads, purchase 1-year or 3-year CUDs: - -- 1-year: ~20-30% discount -- 3-year: ~50-55% discount -- Applied automatically to matching usage in the region -- Purchase via Google Cloud Console > Billing > Committed use discounts - -### 5. Cluster Management - -- **Stop/start dev clusters**: Idle dev clusters cost money even with no workloads (control plane fee). -- **Right-size node pools** (Standard): Use Cluster Autoscaler with appropriate min/max. -- **Multi-tenant clusters**: Share a single cluster across teams instead of per-team clusters (see [gke-multitenancy.md](./gke-multitenancy.md)). - -## Cost Monitoring - -```bash -# View cluster cost breakdown (requires Cost Management API) -gcloud billing budgets list --billing-account= --quiet - -# View node utilization -kubectl top nodes - -# View pod resource usage vs requests -kubectl top pods --all-namespaces --containers -``` - -## Dev/Test Cost Savings - -For non-production environments, these golden path deviations are acceptable: - -| Setting | Production (Golden Path) | Dev/Test | -|---------|-------------------------|----------| -| Cluster mode | Autopilot | Autopilot (cheaper with fewer pods) | -| Release channel | Regular | Rapid (get fixes faster) | -| Private nodes | Required | Optional (simpler access) | -| Monitoring components | Full suite | SYSTEM_COMPONENTS only | -| Secret Manager rotation | 120s | Disabled | -| Maintenance windows | Configured | Not needed | diff --git a/skills/cloud/gke-basics/references/gke-golden-path.md b/skills/cloud/gke-basics/references/gke-golden-path.md deleted file mode 100644 index 8473c834c8..0000000000 --- a/skills/cloud/gke-basics/references/gke-golden-path.md +++ /dev/null @@ -1,76 +0,0 @@ -# GKE Golden Path Configuration - -The golden path is the recommended Autopilot configuration for production clusters. It defines sensible defaults — when the user requests different settings, apply them and note relevant trade-offs. - -> **MCP Tools:** `get_cluster`, `create_cluster`, `update_cluster` - -## Rules - -1. **Default to the golden path.** Use golden path values unless the user requests otherwise. When deviating, note trade-offs but respect the user's choice. -2. **Day-0 vs Day-1.** Flag Day-0 decisions (networking, private nodes, subnets, IP allocation) prominently — they are hard/impossible to change after creation. -3. **Tool preference: MCP > gcloud > kubectl.** See [cli-reference.md](./cli-reference.md) for full coverage matrix and override options. If the user says "use gcloud" or "use kubectl", respect that for the session. -4. **Document decisions and rationale**, especially for Day-0 choices and golden path deviations. - -## Required Inputs - -If the user is unsure, use golden path defaults. - -- **Project ID** (required) -- **Region** (required, e.g., `us-central1`) -- **Cluster name** (required) -- **Environment type**: dev/test or production (defaults to production) -- **Networking**: bring-your-own VPC/subnet or auto-create (default: auto-create) -- **Scale expectations**: expected node/pod count, workload types -- **Cost constraints**: Spot VM tolerance, budget considerations - -## Always-Apply Defaults - -Recommended best practices applied by default. If the user requests a different setting, apply it and briefly note the security or operational trade-off. - -| Setting | Golden Path Value | -|---------|-------------------| -| `autopilot.enabled` | `true` | -| `privateClusterConfig.enablePrivateNodes` | `true` | -| `masterAuthorizedNetworksConfig.privateEndpointEnforcementEnabled` | `true` | -| `secretManagerConfig.enabled` + `rotationInterval: 120s` | `true` | -| `rbacBindingConfig.enableInsecureBinding*` | `false` (both) | -| `workloadIdentityConfig.workloadPool` | enabled | -| `networkConfig.datapathProvider` | `ADVANCED_DATAPATH` | -| `networkConfig.dnsConfig.clusterDns` | `CLOUD_DNS` | -| `autoscaling.autoscalingProfile` | `OPTIMIZE_UTILIZATION` | -| `verticalPodAutoscaling.enabled` | `true` | -| `monitoringConfig` components | SYSTEM_COMPONENTS, STORAGE, POD, DEPLOYMENT, STATEFULSET, DAEMONSET, HPA, JOBSET, CADVISOR, KUBELET, DCGM, APISERVER, SCHEDULER, CONTROLLER_MANAGER | -| `advancedDatapathObservabilityConfig.enableMetrics` | `true` | -| `nodeConfig.shieldedInstanceConfig.enableSecureBoot` | `true` | -| `nodeConfig.workloadMetadataConfig.mode` | `GKE_METADATA` | -| `nodeConfig.gcfsConfig.enabled` / `gvnic.enabled` | `true` / `true` | -| `addonsConfig.statefulHaConfig.enabled` | `true` | -| Storage CSI drivers (Filestore, GCS FUSE, Parallelstore) | enabled | -| Pod Security Standards | `restricted` on production namespaces | - -## Customer-Configurable Settings - -These have golden path defaults but customers may deviate with valid justification. **Ask before changing.** - -| Setting | Default | Why Deviate | -|---------|---------|-------------| -| `dnsEndpointConfig.allowExternalTraffic` | `true` | Restrict if cluster only accessed from within VPC | -| `autoIpamConfig` / `createSubnetwork` | `true` / `true` | Customer has pre-existing VPC/subnets | -| `maxPodsPerNode` | `48` | `110` for high pod-density (costs more CIDR space) | -| `subnetwork` | auto-created | Customer brings existing subnets | -| Maintenance exclusion windows | configured (NO_MINOR_UPGRADES, 1yr) | Customer-specific scheduling | -| `nodeConfig.bootDisk.diskType` | `pd-balanced` | `pd-ssd` for I/O-intensive, `pd-standard` for cost | -| `nodeConfig.machineType` | `ek-standard-8` (Autopilot) | Varies by workload; use ComputeClasses | - -## Guardrails - -- Do not request or output secrets (tokens, keys, service account JSON). -- Discover project/cluster context via MCP tools or `gcloud config get-value project` — don't ask users to paste project IDs. -- For Day-0 decisions, always ask clarifying questions before proceeding. -- For Day-1 features, propose golden path defaults with trade-offs and let the customer confirm. -- Do not promise zero downtime; advise PDBs, health probes, replicas, and staged upgrades. -- When auditing existing clusters, compare against golden path and report deviations with severity and remediation. - -## Golden Path Config - -See [golden-path-autopilot.yaml](../assets/golden-path-autopilot.yaml) for the full cluster-level policy settings. diff --git a/skills/cloud/gke-basics/references/gke-inference.md b/skills/cloud/gke-basics/references/gke-inference.md deleted file mode 100644 index 761adf2e62..0000000000 --- a/skills/cloud/gke-basics/references/gke-inference.md +++ /dev/null @@ -1,161 +0,0 @@ -# GKE AI/ML Inference - -This reference covers deploying AI/ML inference workloads on GKE using Google's Inference Quickstart (GIQ) and best practices for LLM serving. - -> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `get_k8s_logs`, `get_k8s_rollout_status`, `describe_k8s_resource`, `list_k8s_events`. **CLI-only:** `gcloud container ai profiles *` - -## When to Use - -- Deploy an AI model (Llama, Gemma, Mistral, etc.) to GKE -- Generate optimized Kubernetes manifests for inference -- Select GPU/TPU accelerators for model serving -- Configure autoscaling for LLM inference - -## Prerequisites - -- A golden path GKE Autopilot cluster (GPU workloads are supported via ComputeClasses and NAP) -- `gcloud` CLI authenticated -- Sufficient GPU/TPU quota in the target region - -## Workflow - -### 1. Discovery: Find Models and Hardware - -```bash -# List all supported models -gcloud container ai profiles models list --quiet - -# Find valid accelerator/server combinations for a model -gcloud container ai profiles list --model= --quiet - -# Example: what can run Gemma 2 9B? -gcloud container ai profiles list --model=gemma-2-9b-it --quiet -``` - -### 2. Generate Manifest - -```bash -gcloud container ai profiles manifests create \ - --model= \ - --model-server= \ - --accelerator-type= \ - --target-ntpot-milliseconds= --quiet > inference.yaml -``` - -**Parameters:** -- `--model`: Model ID (e.g., `gemma-2-9b-it`, `llama-3-8b`) -- `--model-server`: Inference server (`vllm`, `tgi`, `triton`, `tensorrt-llm`) -- `--accelerator-type`: GPU/TPU type (`nvidia-l4`, `nvidia-tesla-a100`, `nvidia-h100-80gb`) -- `--target-ntpot-milliseconds`: Target Normalized Time Per Output Token (optional, for latency optimization) - -**Example:** - -```bash -gcloud container ai profiles manifests create \ - --model=gemma-2-9b-it \ - --model-server=vllm \ - --accelerator-type=nvidia-l4 \ - --target-ntpot-milliseconds=50 --quiet > inference.yaml -``` - -### 3. Review and Deploy - -```bash -# Review for placeholders (HF tokens, PVCs) -cat inference.yaml - -# Deploy -kubectl apply -f inference.yaml - -# Monitor -kubectl get pods -w -kubectl logs -f -``` - -> Some models require Hugging Face tokens. Create a Kubernetes Secret and reference it in the manifest. - -## GPU ComputeClass for Inference - -For Autopilot clusters, create a ComputeClass to target GPU nodes: - -```yaml -apiVersion: cloud.google.com/v1 -kind: ComputeClass -metadata: - name: l4-inference -spec: - priorities: - - machineFamily: g2 - gpu: - type: nvidia-l4 - count: 1 - minCores: 4 - minMemoryGb: 16 -``` - -## Accelerator Selection Guide - -| Accelerator | Best For | Memory | Relative Cost | -|-------------|----------|--------|---------------| -| NVIDIA T4 | Budget inference, lightweight legacy models | 16 GB | Lowest | -| NVIDIA L4 (G2) | Small-medium model inference, video, graphics | 24 GB | Low | -| NVIDIA RTX PRO 6000 (G4) | Multimodal AI, high-fidelity 3D, fine-tuning | 96 GB | Medium | -| Cloud TPU v5e | Cost-effective transformer inference | Varies | Medium | -| Cloud TPU v5p | High-performance training | Varies | High | -| Cloud TPU v6e (Trillium) | High-efficiency next-gen training & serving | 32 GB/chip | Medium-High | -| Cloud TPU v7x (Ironwood) | Ultra-scale inference & agentic workflows | 192 GB/chip | High | -| NVIDIA A100 | Large model inference, enterprise ML | 40/80 GB | High | -| NVIDIA H100 / H200 | Frontier model training, high throughput | 80/141 GB | Highest | -| NVIDIA B200 (A4) | Blackwell-scale training, FP4 precision | 192 GB | Highest | -| NVIDIA GB200 (A4X) | Rack-scale AI (Grace Blackwell Superchip) | Massive | Highest | - -## Autoscaling LLM Inference - -### GPU-based autoscaling - -Use custom metrics for GPU utilization: - -```yaml -apiVersion: autoscaling/v2 -kind: HorizontalPodAutoscaler -metadata: - name: llm-hpa -spec: - scaleTargetRef: - apiVersion: apps/v1 - kind: Deployment - name: llm-server - minReplicas: 1 - maxReplicas: 10 - metrics: - - type: Pods - pods: - metric: - name: gpu_duty_cycle - target: - type: AverageValue - averageValue: "80" -``` - -### Best practices for inference autoscaling - -1. **Use DCGM metrics**: Golden path enables DCGM monitoring for GPU utilization metrics -2. **Set appropriate minReplicas**: At least 1 for always-on serving; 0 for batch/on-demand -3. **Tune scale-down delay**: LLM model loading is slow; use longer stabilization windows -4. **Consider queue depth**: Scale on pending requests rather than pure GPU utilization for latency-sensitive workloads - -## Optimization Tips - -- **Quantization**: Use quantized models (GPTQ, AWQ) to reduce GPU memory and increase throughput -- **Batching**: Configure model server batch size for throughput vs latency trade-off -- **Tensor parallelism**: Split large models across multiple GPUs within a node -- **KV cache optimization**: Tune `--gpu-memory-utilization` in vLLM for KV cache allocation - -## Troubleshooting - -| Issue | Cause | Fix | -|-------|-------|-----| -| Invalid model/accelerator combination | Unsupported tuple | Re-run `gcloud container ai profiles list --model=` | -| GPU quota exceeded | Regional quota limit | Request quota increase or try a different region | -| OOM on GPU | Model too large for accelerator | Use larger GPU, enable quantization, or use tensor parallelism | -| Slow cold start | Large model loading from registry | Use local SSD for model caching; pre-pull images | diff --git a/skills/cloud/gke-basics/references/gke-networking.md b/skills/cloud/gke-basics/references/gke-networking.md deleted file mode 100644 index 20eb5b49c0..0000000000 --- a/skills/cloud/gke-basics/references/gke-networking.md +++ /dev/null @@ -1,131 +0,0 @@ -# GKE Networking - -This reference covers networking configuration for GKE clusters. The golden path enforces private, VPC-native clusters with Dataplane V2. - -> **MCP Tools:** `get_cluster`, `update_cluster`, `apply_k8s_manifest`, `get_k8s_resource` - -## Golden Path Networking Defaults - -| Setting | Golden Path Value | Day-0/1 | Notes | -|---------|-------------------|---------|-------| -| `privateClusterConfig.enablePrivateNodes` | `true` | Day-0 | Nodes have no public IPs | -| `masterAuthorizedNetworksConfig.privateEndpointEnforcementEnabled` | `true` | Day-0 | Control plane only reachable via private endpoint or DNS | -| `controlPlaneEndpointsConfig.dnsEndpointConfig.allowExternalTraffic` | `true` | Day-0 | Allows DNS-based access from outside VPC | -| `networkConfig.datapathProvider` | `ADVANCED_DATAPATH` (Dataplane V2) | Day-0 | eBPF-based, built-in Network Policy | -| `networkConfig.dnsConfig.clusterDns` | `CLOUD_DNS` | Day-0 | Managed DNS, more reliable than kube-dns | -| `networkConfig.enableIntraNodeVisibility` | `true` | Day-1 | VPC Flow Logs for intra-node traffic | -| `networkConfig.gatewayApiConfig.channel` | `CHANNEL_STANDARD` | Day-1 | Gateway API support | -| `ipAllocationPolicy.autoIpamConfig.enabled` | `true` | Day-0 | Automatic IP range management | -| `ipAllocationPolicy.createSubnetwork` | `true` | Day-0 | Auto-create dedicated subnet | -| `defaultMaxPodsConstraint.maxPodsPerNode` | `48` | Day-0 | Conservative default; 110 for high density | - -## Private Cluster Access Patterns - -The golden path creates a private cluster. Users access it via: - -1. **DNS endpoint (default)**: `allowExternalTraffic: true` enables access via the cluster's DNS endpoint from outside the VPC. No VPN required. -2. **Private endpoint**: Direct access from within the VPC or via Cloud VPN/Interconnect. -3. **Authorized networks**: Add specific CIDRs to `masterAuthorizedNetworksConfig` for IP-based access control. - -```bash -# Access private cluster via DNS endpoint (golden path default) -gcloud container clusters get-credentials \ - --region --dns-endpoint \ - --quiet - -# Access via private endpoint (from within VPC) -gcloud container clusters get-credentials \ - --region --internal-ip \ - --quiet -``` - -## Bring-Your-Own VPC/Subnet - -If the customer has existing network infrastructure: - -```bash -gcloud container clusters create-auto \ - --region \ - --network \ - --subnetwork \ - --cluster-secondary-range-name \ - --services-secondary-range-name \ - --enable-private-nodes \ - --enable-master-authorized-networks \ - --quiet -``` - -> **Day-0 Warning**: VPC, subnet, and IP ranges cannot be changed after cluster creation. - -## IP Planning - -| Resource | Golden Path | Notes | -|----------|-------------|-------| -| Pod CIDR | `/17` (auto) | ~32K pod IPs; size based on maxPodsPerNode | -| Service CIDR | `/20` (auto) | ~4K service IPs | -| Node subnet | auto-created | /20 recommended for growth | -| Max pods/node | 48 | Each node gets a /25 pod range; set to 110 for /24 per node | - -**Pod CIDR sizing rule of thumb:** -- `maxPodsPerNode=48` -> each node uses a `/25` (128 IPs) from pod CIDR -- `maxPodsPerNode=110` -> each node uses a `/24` (256 IPs) from pod CIDR -- Larger maxPodsPerNode = fewer nodes fit in a given CIDR - -## Ingress - -**Gateway API** (golden path, enabled via `gatewayApiConfig.channel: CHANNEL_STANDARD`): - -```yaml -apiVersion: gateway.networking.k8s.io/v1 -kind: Gateway -metadata: - name: external-http -spec: - gatewayClassName: gke-l7-global-external-managed - listeners: - - name: http - protocol: HTTP - port: 80 -``` - -**Alternatives:** -- `gke-l7-regional-external-managed` — regional external -- `gke-l7-rilb` — internal load balancer -- Istio service mesh — for advanced traffic management, mTLS - -## Egress - -- Default: nodes use Cloud NAT for outbound internet access (private nodes have no public IPs) -- For static egress IPs: configure Cloud NAT with manual IP allocation -- For restricted egress: route through a firewall appliance via custom routes - -## Network Policy - -Dataplane V2 (golden path) provides built-in Network Policy enforcement — no additional addon needed. Apply default-deny per namespace, then allow specific flows. - -> See [gke-security.md](./gke-security.md) for default-deny policy and [gke-multitenancy.md](./gke-multitenancy.md) for per-team allow policies. - -## Cloud Armor (Recommended for Public-Facing Services) - -Cloud Armor provides WAF and DDoS protection. **Not a golden path default** — recommended for any service with public ingress. Link via `BackendConfig`: - -```yaml -# 1. Create BackendConfig referencing your Cloud Armor policy -apiVersion: cloud.google.com/v1 -kind: BackendConfig -metadata: - name: my-backend-config -spec: - securityPolicy: - name: my-cloud-armor-policy ---- -# 2. Annotate your Service -# cloud.google.com/backend-config: '{"default": "my-backend-config"}' -``` - -## SSL, Container-Native LB, and PSC - -- **Google-managed SSL certificates**: Use `ManagedCertificate` CRD with Gateway API. Auto-provisions and renews. -- **Container-native LB**: Enabled by default on VPC-native clusters (golden path). Targets pods via NEGs, bypassing iptables. Annotation: `cloud.google.com/neg: '{"ingress": true}'`. -- **Private Service Connect (PSC)**: Use `ServiceAttachment` CRD to expose services across VPCs without peering. - diff --git a/skills/cloud/gke-basics/references/gke-observability.md b/skills/cloud/gke-basics/references/gke-observability.md deleted file mode 100644 index 9b940a2041..0000000000 --- a/skills/cloud/gke-basics/references/gke-observability.md +++ /dev/null @@ -1,168 +0,0 @@ -# GKE Observability - -This reference covers monitoring, logging, and metrics configuration for GKE. The golden path enables comprehensive observability including control-plane metrics. - -> **MCP Tools:** `get_cluster`, `list_k8s_events`, `get_k8s_logs`, `get_k8s_cluster_info`, `describe_k8s_resource`. **CLI-only:** `gcloud container clusters update --monitoring=...`, `gcloud logging read` - -## Golden Path Observability Defaults - -| Setting | Golden Path Value | Notes | -|---------|-------------------|-------| -| `loggingConfig` components | SYSTEM_COMPONENTS, WORKLOADS | Full workload logging | -| `monitoringConfig` components | SYSTEM_COMPONENTS, STORAGE, POD, DEPLOYMENT, STATEFULSET, DAEMONSET, HPA, JOBSET, CADVISOR, KUBELET, DCGM, APISERVER, SCHEDULER, CONTROLLER_MANAGER | Full suite including control-plane | -| `managedPrometheusConfig.enabled` | `true` | Google-managed Prometheus | -| `advancedDatapathObservabilityConfig.enableMetrics` | `true` | Dataplane V2 flow metrics | -| `loggingService` | `logging.googleapis.com/kubernetes` | Cloud Logging | -| `monitoringService` | `monitoring.googleapis.com/kubernetes` | Cloud Monitoring | - -### Control-Plane Metrics (Golden Path Addition) - -The golden path adds three control-plane monitoring components not present in default clusters: - -| Component | What It Monitors | -|-----------|-----------------| -| `APISERVER` | API server request latency, error rates, admission webhook performance | -| `SCHEDULER` | Scheduling latency, pending pods, scheduling failures | -| `CONTROLLER_MANAGER` | Controller work queue depth, reconciliation latency | - -These are critical for diagnosing cluster-level issues (slow API responses, scheduling delays, stuck controllers). - -## Enabling Full Monitoring - -```bash -# Enable golden path monitoring suite -gcloud container clusters update --region \ - --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET,DCGM \ - --quiet - -# Enable Managed Prometheus -gcloud container clusters update --region \ - --enable-managed-prometheus \ - --quiet - -# Enable Dataplane V2 observability metrics -gcloud container clusters update --region \ - --enable-dataplane-v2-flow-observability \ - --quiet -``` - -## Managed Prometheus - -Golden path enables Google Managed Prometheus for metrics collection and querying. - -**Querying metrics:** -- Use Cloud Monitoring Metrics Explorer in the console -- Use PromQL via the Prometheus UI or API -- Grafana dashboards via Managed Grafana - -**Key GKE metrics:** - -| Metric | Source | Use | -|--------|--------|-----| -| `container_cpu_usage_seconds_total` | cAdvisor | Pod CPU usage | -| `container_memory_working_set_bytes` | cAdvisor | Pod memory usage | -| `kube_pod_status_phase` | kube-state-metrics | Pod lifecycle | -| `apiserver_request_duration_seconds` | API Server | Control plane latency | -| `scheduler_scheduling_duration_seconds` | Scheduler | Scheduling performance | -| `node_cpu_seconds_total` | Kubelet | Node CPU | -| `DCGM_FI_DEV_GPU_UTIL` | DCGM | GPU utilization | - -## Live Resource Usage (kubectl-only) - -No MCP or gcloud equivalent exists for live resource usage. Use `kubectl top`: - -```bash -kubectl top pods --all-namespaces --sort-by=cpu -kubectl top nodes -kubectl top pods --containers -n # per-container breakdown -``` - -## Cloud Logging (gcloud-only) - -**Querying cluster logs** (no MCP equivalent — use `gcloud logging read`): - -```bash -# System component logs -gcloud logging read \ - 'resource.type="k8s_cluster" AND resource.labels.cluster_name=""' \ - --project --limit 50 \ - --quiet - -# Workload logs for a specific namespace -gcloud logging read \ - 'resource.type="k8s_container" AND resource.labels.cluster_name="" AND resource.labels.namespace_name=""' \ - --project --limit 50 \ - --quiet - -# Audit logs (who did what) -gcloud logging read \ - 'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"' \ - --project --limit 50 \ - --quiet -``` - -## Diagnostic Settings - -For security monitoring and troubleshooting, enable control-plane audit logs: - -```bash -# View current logging config -gcloud container clusters describe --region \ - --format="yaml(loggingConfig)" \ - --quiet -``` - -## Alerting - -Set up alerts for critical conditions: - -| Condition | Metric | Threshold | -|-----------|--------|-----------| -| High API server latency | `apiserver_request_duration_seconds` | P99 > 5s | -| Pod crash loops | `kube_pod_container_status_restarts_total` | > 5 in 10min | -| Node not ready | `kube_node_status_condition` | condition=Ready, status!=True | -| High GPU utilization | `DCGM_FI_DEV_GPU_UTIL` | > 95% sustained | -| PVC near capacity | `kubelet_volume_stats_used_bytes / capacity` | > 85% | -| Scheduling failures | `scheduler_schedule_attempts_total{result="error"}` | > 0 | - -## Cost Considerations - -Monitoring and logging have associated costs: - -- **Cloud Logging**: Charged per GiB ingested beyond free tier (50 GiB/project/month) -- **Cloud Monitoring**: Free for GKE system metrics; custom metrics charged per time series -- **Managed Prometheus**: Charged per samples ingested - -To reduce costs in non-production: -```bash -# Reduce to system-only monitoring -gcloud container clusters update --region \ - --monitoring=SYSTEM \ - --quiet -``` - -## Distributed Tracing & Continuous Profiling (Recommended) - -**Not golden path defaults** — recommended for production microservice architectures and performance-sensitive workloads. - -- **Cloud Trace**: Add OpenTelemetry SDK to your app with the `opentelemetry-operations-go` (or equivalent) exporter. Traces appear in Cloud Trace console. Identifies cross-service latency bottlenecks. -- **Cloud Profiler**: Add the Cloud Profiler agent to your app. Profiles CPU and memory usage in production with low overhead. Identifies hotspots and compares across versions. - -## LQL Query Examples - -Common Logging Query Language patterns for GKE troubleshooting: - -``` -# Error logs for a specific container -resource.type="k8s_container" AND resource.labels.container_name="my-app" AND severity>=ERROR - -# OOMKilled events -resource.type="k8s_event" AND jsonPayload.reason="OOMKilling" - -# Pod scheduling failures -resource.type="k8s_event" AND jsonPayload.reason="FailedScheduling" - -# Audit logs (who did what) -resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com" -``` - diff --git a/skills/cloud/gke-basics/references/gke-reliability.md b/skills/cloud/gke-basics/references/gke-reliability.md deleted file mode 100644 index 8b2f3129b6..0000000000 --- a/skills/cloud/gke-basics/references/gke-reliability.md +++ /dev/null @@ -1,169 +0,0 @@ -# GKE Reliability - -This reference covers high availability and reliability configuration for GKE clusters and workloads. - -> **MCP Tools:** `get_cluster`, `get_k8s_resource`, `describe_k8s_resource`, `apply_k8s_manifest`, `list_k8s_events` - -## Golden Path Reliability Defaults - -| Setting | Golden Path Value | Notes | -|---------|-------------------|-------| -| Cluster type | Regional (4 zones: us-central1-a/b/c/f) | Control plane replicated across zones | -| Upgrade strategy | SURGE (`maxSurge: 1`) | Rolling upgrades with extra capacity | -| Auto-repair | `true` | Unhealthy nodes replaced automatically | -| Auto-upgrade | `true` | Nodes follow control plane version | -| Release channel | REGULAR | Balanced freshness and stability | -| Stateful HA | Enabled | Leader election for stateful workloads | - -## Workflows - -### 1. Verify Cluster High Availability - -``` -# MCP (preferred) -get_cluster(name="projects//locations//clusters/", - readMask="location,locations,nodePools.locations") - -# gcloud fallback -gcloud container clusters describe --region \ - --format="json(location, locations)" \ - --quiet -``` - -- If `location` is a region (e.g., `us-central1`), the control plane is regional -- If `locations` has multiple entries, nodes span multiple zones - -### 2. Pod Disruption Budgets (PDBs) - -PDBs ensure minimum pod availability during voluntary disruptions (node upgrades, autoscaler scale-down). - -**Check existing PDBs:** - -``` -# MCP (preferred) -get_k8s_resource(parent="...", resourceType="poddisruptionbudget") - -# kubectl fallback -kubectl get pdb --all-namespaces -``` - -**Create PDB:** - -```yaml -apiVersion: policy/v1 -kind: PodDisruptionBudget -metadata: - name: my-app-pdb - namespace: default -spec: - minAvailable: 2 # Or use maxUnavailable: 1 - selector: - matchLabels: - app: my-app -``` - -> Every production Deployment with 2+ replicas should have a PDB. - -### 3. Health Probes - -Every production container should have liveness and readiness probes. Startup probes are recommended for slow-starting apps. - -**Check existing probes:** - -``` -# MCP (preferred) -describe_k8s_resource(parent="...", resourceType="deployment", name="", namespace="") - -# kubectl fallback -kubectl get deployment -n -o yaml | grep -E "livenessProbe|readinessProbe|startupProbe" -``` - -**Recommended probe configuration:** - -```yaml -spec: - containers: - - name: app - livenessProbe: - httpGet: - path: /healthz - port: 8080 - initialDelaySeconds: 15 - periodSeconds: 10 - failureThreshold: 3 - readinessProbe: - httpGet: - path: /readyz - port: 8080 - initialDelaySeconds: 5 - periodSeconds: 5 - failureThreshold: 3 - startupProbe: # For slow-starting apps - httpGet: - path: /healthz - port: 8080 - initialDelaySeconds: 10 - periodSeconds: 5 - failureThreshold: 30 # 30 * 5s = 150s max startup time -``` - -- **Readiness**: Determines when a pod can accept traffic -- **Liveness**: Determines when to restart a container -- **Startup**: Disables liveness/readiness until the app is ready (prevents premature restarts) - -### 4. Graceful Shutdown - -Ensure applications handle `SIGTERM` and drain in-flight requests: - -```yaml -spec: - terminationGracePeriodSeconds: 30 # Default; increase for long-running requests - containers: - - name: app - lifecycle: - preStop: - exec: - command: ["/bin/sh", "-c", "sleep 5"] # Allow LB to deregister -``` - -### 5. Topology Spread Constraints - -Distribute pods across zones and nodes to survive failures: - -```yaml -spec: - topologySpreadConstraints: - - maxSkew: 1 - topologyKey: topology.kubernetes.io/zone - whenUnsatisfiable: DoNotSchedule - labelSelector: - matchLabels: - app: my-app - - maxSkew: 1 - topologyKey: kubernetes.io/hostname - whenUnsatisfiable: ScheduleAnyway - labelSelector: - matchLabels: - app: my-app -``` - -- **Zone spread** (`DoNotSchedule`): Hard requirement -- pods must be balanced across zones -- **Node spread** (`ScheduleAnyway`): Best-effort -- prefer distribution but don't block scheduling - -### 6. Replicas - -| Workload Type | Minimum Replicas | Reason | -|--------------|-----------------|--------| -| Stateless web/API | 2 | Survive single pod/node failure | -| Critical services | 3 | Survive zone failure with zone spread | -| Stateful (databases) | 3 (with replication) | Application-level quorum | -| Batch/jobs | 1 | Ephemeral by nature | - -## Best Practices - -1. **Regional clusters for production**: Always use regional clusters to survive zone failures -2. **PDBs for everything**: Every production workload with 2+ replicas needs a PDB -3. **Probes for all containers**: At minimum, readiness probes on every production container -4. **Zone spreading**: Use topology spread constraints to distribute pods across failure domains -5. **Graceful shutdown**: Handle SIGTERM and set appropriate `terminationGracePeriodSeconds` -6. **Maintenance windows**: Schedule upgrades during low-traffic periods (see [gke-upgrades.md](./gke-upgrades.md)) diff --git a/skills/cloud/gke-basics/references/gke-scaling.md b/skills/cloud/gke-basics/references/gke-scaling.md deleted file mode 100644 index 2ce2a6dbb9..0000000000 --- a/skills/cloud/gke-basics/references/gke-scaling.md +++ /dev/null @@ -1,149 +0,0 @@ -# GKE Workload Scaling - -This reference covers scaling workloads on GKE. The golden path enables VPA, OPTIMIZE_UTILIZATION autoscaling profile, and Node Auto Provisioning by default. - -> **MCP Tools:** `get_k8s_resource`, `describe_k8s_resource`, `apply_k8s_manifest`, `patch_k8s_resource`, `get_cluster`, `update_cluster`, `update_node_pool` - -## Golden Path Scaling Defaults - -| Setting | Golden Path Value | Notes | -|---------|-------------------|-------| -| `autoscaling.autoscalingProfile` | `OPTIMIZE_UTILIZATION` | Aggressive scale-down for cost savings | -| `verticalPodAutoscaling.enabled` | `true` | VPA recommendations available | -| `autoscaling.enableNodeAutoprovisioning` | `true` | NAP creates node pools on demand | -| GPU resource limits (T4, A100) | `1000000000` each | NAP can provision GPU nodes | - -## Scaling Mechanisms - -### 1. Manual Scaling - -> **kubectl-only** — no MCP equivalent for `kubectl scale`. Use kubectl directly. - -```bash -kubectl scale deployment --replicas= -n -``` - -### 2. Horizontal Pod Autoscaling (HPA) - -Scales the number of pods based on metrics. - -**Quick setup (kubectl-only — no MCP equivalent for `kubectl autoscale`):** - -```bash -kubectl autoscale deployment --cpu-percent=50 --min=1 --max=10 -``` - -**Manifest approach (recommended — use MCP `apply_k8s_manifest`):** - -See [assets/hpa-example.yaml](../assets/hpa-example.yaml) for a template. - -```yaml -apiVersion: autoscaling/v2 -kind: HorizontalPodAutoscaler -metadata: - name: -hpa -spec: - scaleTargetRef: - apiVersion: apps/v1 - kind: Deployment - name: - minReplicas: 1 - maxReplicas: 10 - metrics: - - type: Resource - resource: - name: cpu - target: - type: Utilization - averageUtilization: 50 -``` - -### 3. Vertical Pod Autoscaling (VPA) - -Adjusts CPU and memory requests to match actual usage. Enabled by default on golden path. - -**Update modes:** -- `Off` — recommendations only (safest, start here) -- `Initial` — sets resources only at pod creation -- `Auto` — restarts pods to apply new resource values -- `InPlaceOrRecreate` — updates resources without restart when possible (GKE 1.34+) - -**Create VPA in recommendation mode:** - -```yaml -apiVersion: autoscaling.k8s.io/v1 -kind: VerticalPodAutoscaler -metadata: - name: -vpa -spec: - targetRef: - apiVersion: apps/v1 - kind: Deployment - name: - updatePolicy: - updateMode: "Off" -``` - -**Read recommendations (prefer MCP `describe_k8s_resource`):** - -``` -# MCP (preferred) -describe_k8s_resource(parent="...", resourceType="verticalpodautoscaler", name="-vpa", namespace="") - -# kubectl fallback -kubectl get vpa -vpa -o jsonpath='{.status.recommendation}' -``` - -See [assets/vpa-example.yaml](../assets/vpa-example.yaml) for a full template. - -### 4. Cluster Autoscaler / Node Auto Provisioning (NAP) - -On Autopilot (golden path), node scaling is fully managed. NAP automatically creates and sizes node pools based on workload demands. - -**For Standard clusters:** - -```bash -# Enable cluster autoscaler on a node pool -gcloud container clusters update --region \ - --enable-autoscaling --node-pool \ - --min-nodes --max-nodes \ - --quiet - -# Enable NAP -gcloud container clusters update --region \ - --enable-autoprovisioning \ - --min-cpu --max-cpu \ - --min-memory --max-memory \ - --quiet -``` - -**Autoscaling profiles:** - -| Profile | Behavior | Golden Path? | -|---------|----------|-------------| -| `BALANCED` | Default GKE; conservative scale-down | No | -| `OPTIMIZE_UTILIZATION` | Aggressive scale-down; lower idle resources | **Yes** | - -## Best Practices - -1. **Define resource requests**: HPA and VPA rely on accurate requests. Always set them. -2. **Avoid metric conflicts**: Do not use HPA and VPA on the same metric. Typical pattern: HPA on CPU, VPA on memory. -3. **Pod Disruption Budgets**: Define PDBs for all production workloads to ensure availability during scaling events. -4. **HPA stabilization**: HPA has a default 5-minute stabilization window. Tune `behavior` for faster response if needed. -5. **VPA "Auto" caution**: Auto mode restarts pods. Ensure your app handles SIGTERM gracefully. VPA requires at least 2 replicas for evictions by default. -6. **Use ComputeClasses**: For workload-specific node targeting (Spot fallback, GPU, specific machine families), use ComputeClasses instead of node selectors. - -## Rightsizing Workflow - -1. Deploy VPA in `Off` mode for 24+ hours -2. Read recommendations: `kubectl describe vpa ` -3. Compare `target` values against current `requests` -4. Apply with 20% buffer: `new_request = target * 1.2` -5. Use patch format to update Deployment - -| Condition | Recommendation | Risk | -|-----------|----------------|------| -| CPU request >5x P95 actual | Reduce to `P95 * 1.2` | Medium | -| Memory request >3x P95 actual | Reduce to `P95 * 1.2` | Medium | -| CPU request >2x P95 actual | Rightsizing with 20% buffer | Low | -| No resource limits set | Add limits to prevent noisy-neighbor | Low | diff --git a/skills/cloud/gke-basics/references/gke-security.md b/skills/cloud/gke-basics/references/gke-security.md deleted file mode 100644 index d4699ca5a6..0000000000 --- a/skills/cloud/gke-basics/references/gke-security.md +++ /dev/null @@ -1,226 +0,0 @@ -# GKE Security - -This reference covers security configuration for GKE clusters. The golden path enforces a hardened security posture by default. - -> **MCP Tools:** `get_cluster`, `check_k8s_auth`, `get_k8s_resource`, `apply_k8s_manifest`, `update_cluster` - -## Golden Path Security Defaults - -| Setting | Golden Path Value | Day-0/1 | Notes | -|---------|-------------------|---------|-------| -| `workloadIdentityConfig.workloadPool` | `.svc.id.goog` | Day-0 | Workload Identity Federation for Pods | -| `secretManagerConfig.enabled` | `true` | Day-1 | Google Secret Manager integration | -| `secretManagerConfig.rotationConfig` | `enabled: true, rotationInterval: 120s` | Day-1 | Automatic secret rotation | -| `rbacBindingConfig.enableInsecureBindingSystemAuthenticated` | `false` | Day-0 | Blocks legacy `system:authenticated` bindings | -| `rbacBindingConfig.enableInsecureBindingSystemUnauthenticated` | `false` | Day-0 | Blocks legacy `system:unauthenticated` bindings | -| `nodeConfig.shieldedInstanceConfig.enableSecureBoot` | `true` | Day-0 | Verifiable boot integrity | -| `nodeConfig.shieldedInstanceConfig.enableIntegrityMonitoring` | `true` | Day-0 | Runtime integrity checks | -| `nodeConfig.workloadMetadataConfig.mode` | `GKE_METADATA` | Day-0 | Blocks legacy metadata API, enforces Workload Identity | -| Private cluster + Dataplane V2 settings | See [gke-networking.md](./gke-networking.md) | Day-0 | Private nodes, private endpoint enforcement, ADVANCED_DATAPATH | - -## Workload Identity Federation - -Workload Identity is the recommended way for pods to access Google Cloud APIs. It eliminates the need for static service account keys. - -### Setup - -```bash -# 1. Create a Google Service Account (GSA) -gcloud iam service-accounts create \ - --project \ - --display-name "Workload Identity SA" \ - --quiet - -# 2. Grant IAM roles to the GSA -gcloud projects add-iam-policy-binding \ - --member "serviceAccount:@.iam.gserviceaccount.com" \ - --role "" \ - --quiet - -# 3. Create Kubernetes Service Account (KSA) -kubectl create namespace -kubectl create serviceaccount --namespace - -# 4. Bind KSA to GSA -gcloud iam service-accounts add-iam-policy-binding \ - @.iam.gserviceaccount.com \ - --role roles/iam.workloadIdentityUser \ - --member "serviceAccount:.svc.id.goog[/]" \ - --quiet - -# 5. Annotate KSA -kubectl annotate serviceaccount \ - --namespace \ - iam.gke.io/gcp-service-account=@.iam.gserviceaccount.com -``` - -> See [assets/workload-identity-pod.yaml](../assets/workload-identity-pod.yaml) for a test pod. - -### Verification - -```bash -kubectl run workload-identity-test \ - --image=gcr.io/google.com/cloudsdktool/cloud-sdk:slim \ - --serviceaccount= --namespace= \ - --rm -it -- gcloud auth list --quiet -``` - -## Secret Manager Integration - -The golden path enables Secret Manager with automatic rotation. Secrets are synced to Kubernetes Secrets. - -```bash -# Verify Secret Manager is enabled on cluster -gcloud container clusters describe --region \ - --format="value(secretManagerConfig.enabled)" \ - --quiet - -# Enable if not already (Day-1 change) -gcloud container clusters update --region \ - --enable-secret-manager \ - --secret-manager-rotation-interval=120s \ - --quiet -``` - -## RBAC Hardening - -The golden path disables insecure legacy RBAC bindings that grant broad access to `system:authenticated` and `system:unauthenticated` groups. - -```bash -# Verify insecure bindings are disabled -gcloud container clusters describe --region \ - --format="yaml(rbacBindingConfig)" \ - --quiet -``` - -**Best practices for RBAC:** -- Use namespace-scoped Roles over cluster-wide ClusterRoles -- Bind to specific Groups or ServiceAccounts, never to `system:authenticated` -- Audit permissions via MCP: `check_k8s_auth(parent="...", verb="list", resourceType="pods", namespace="...")` (or `kubectl auth can-i --list --as=`) -- Review bindings via MCP: `get_k8s_resource(parent="...", resourceType="clusterrolebinding")` (or `kubectl get clusterrolebindings,rolebindings --all-namespaces`) - -> See [gke-multitenancy.md](./gke-multitenancy.md) for enterprise RBAC planning and https://docs.cloud.google.com/kubernetes-engine/docs/best-practices/rbac - -## Binary Authorization - -Not enabled in golden path by default but recommended for production image provenance: - -```bash -# Enable Binary Authorization -gcloud container clusters update --region \ - --binauthz-evaluation-mode=PROJECT_SINGLETON_POLICY_ENFORCE \ - --quiet -``` - -## Network Policies - -Dataplane V2 (golden path) provides built-in Network Policy enforcement. Apply default-deny per namespace: - -``` -# MCP (preferred) -apply_k8s_manifest(parent="...", yamlManifest="") - -# kubectl fallback -kubectl apply -f skills/gke/assets/default-deny-netpol.yaml -n -``` - -## GKE Sandbox (gVisor) - -For running untrusted workloads in an isolated sandbox: - -```bash -# Enable on cluster (Standard clusters) -gcloud container clusters update --region --enable-gke-sandbox --quiet - -# Use in pod spec -# Add: runtimeClassName: gvisor -``` - -## Pod Security Standards (Golden Path) - -Pod Security Standards define three profiles that restrict what pods can do. The **`restricted` profile is the golden path default** for production namespaces. - -| Profile | Level | Use Case | -|---------|-------|----------| -| `privileged` | Unrestricted | System namespaces (`kube-system`), infrastructure controllers | -| `baseline` | Minimally restrictive | Shared/dev namespaces, legacy apps being migrated | -| `restricted` | **Golden path** | Production workloads -- blocks privilege escalation, host access, root | - -**Enforce via namespace labels (Pod Security Admission):** - -```yaml -apiVersion: v1 -kind: Namespace -metadata: - name: production - labels: - pod-security.kubernetes.io/enforce: restricted - pod-security.kubernetes.io/warn: restricted - pod-security.kubernetes.io/audit: restricted -``` - -**Gradual rollout strategy:** -1. Start with `warn` + `audit` on existing namespaces to identify violations -2. Fix non-compliant workloads (remove `privileged`, `hostNetwork`, root user, etc.) -3. Enable `enforce` once all workloads pass - -`restricted` blocks: running as root, privilege escalation, host networking/PID/IPC, host path volumes, and most capabilities. The golden path `workload-identity-pod.yaml` already complies. - -## Network Policy Logging (Recommended) - -With Dataplane V2 (golden path), you can enable logging for Network Policy decisions. **Not a golden path default** -- recommended for security auditing. - -```bash -gcloud container clusters update --region \ - --enable-network-policy-logging \ - --quiet -``` - -This logs allowed and denied connections, useful for troubleshooting Network Policy rules and auditing traffic flows. - -## Common IAM Roles - -The five most common predefined IAM roles for GKE: - -| Role | Purpose | When to Use | -|------|---------|-------------| -| `roles/container.admin` | Full control over clusters and Kubernetes resources | Platform team admins managing cluster lifecycle | -| `roles/container.clusterAdmin` | Manage clusters but not project-level IAM | Cluster operators who create/delete clusters | -| `roles/container.developer` | Deploy workloads (pods, services, deployments) | Application developers deploying to existing clusters | -| `roles/container.viewer` | Read-only access to clusters and Kubernetes resources | Monitoring, auditing, or read-only dashboards | -| `roles/container.clusterViewer` | List and get cluster details only | CI/CD pipelines that need cluster metadata | - -> **Principle of least privilege**: Start with `roles/container.viewer` or `roles/container.developer` and escalate only as needed. Avoid granting `roles/container.admin` broadly. - -## Service Accounts & Agents - -- **GKE Service Agent** (`service-@container-engine-robot.iam.gserviceaccount.com`): Automatically created. Manages nodes, networking, and cluster operations on your behalf. Do not remove or modify its permissions. -- **Node Service Account**: By default, nodes use the Compute Engine default service account. For production, create a dedicated SA with minimal permissions and assign it via node pool config. -- **Workload Identity**: The recommended way for pods to access Google Cloud APIs. Maps a Kubernetes ServiceAccount to a Google IAM ServiceAccount — see [Workload Identity setup](#workload-identity-federation) above. - -## Cross-Service Authentication Patterns - -Common patterns for granting GKE workloads access to other Google Cloud services: - -```bash -# Grant a GKE workload access to Cloud Storage -gcloud projects add-iam-policy-binding \ - --member "serviceAccount:@.iam.gserviceaccount.com" \ - --role "roles/storage.objectViewer" \ - --quiet - -# Grant a GKE workload access to Cloud SQL -gcloud projects add-iam-policy-binding \ - --member "serviceAccount:@.iam.gserviceaccount.com" \ - --role "roles/cloudsql.client" \ - --quiet - -# Grant a GKE workload access to Pub/Sub -gcloud projects add-iam-policy-binding \ - --member "serviceAccount:@.iam.gserviceaccount.com" \ - --role "roles/pubsub.subscriber" \ - --quiet -``` - -In all cases, the GSA must be bound to a KSA via Workload Identity (see setup above). The pod then uses the KSA to authenticate as the GSA. - diff --git a/skills/cloud/gke-basics/references/gke-storage.md b/skills/cloud/gke-basics/references/gke-storage.md deleted file mode 100644 index 3b96e61cf0..0000000000 --- a/skills/cloud/gke-basics/references/gke-storage.md +++ /dev/null @@ -1,136 +0,0 @@ -# GKE Storage - -This reference covers storage configuration for GKE clusters including persistent disks, file storage, and cloud storage integration. - -> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `describe_k8s_resource`, `get_cluster` - -## Golden Path Storage Defaults - -The golden path Autopilot config enables these CSI drivers: - -| Driver | Golden Path | Access Mode | Use Case | -|--------|-------------|-------------|----------| -| Compute Engine Persistent Disk CSI | Enabled (default) | ReadWriteOnce | Block storage for databases, single-pod workloads | -| Google Cloud Filestore CSI | Enabled | ReadWriteMany | Shared NFS for multi-pod access | -| Cloud Storage FUSE CSI | Enabled | ReadWriteMany / ReadOnlyMany | Mount GCS buckets as volumes | -| Parallelstore CSI | Enabled | ReadWriteMany | High-performance parallel file system | -| Boot disk type | `pd-balanced` | N/A | Node boot disks | - -## StorageClasses - -### Default StorageClasses - -GKE provides built-in StorageClasses: - -| StorageClass | Disk Type | Use Case | -|-------------|-----------|----------| -| `standard-rwo` | `pd-standard` | Cost-effective, low IOPS | -| `premium-rwo` | `pd-ssd` | High IOPS, databases | -| `standard-rwx` | Filestore (Basic HDD) | Shared NFS | -| `premium-rwx` | Filestore (Basic SSD) | Shared NFS, higher performance | - -### Custom StorageClass - -```yaml -apiVersion: storage.k8s.io/v1 -kind: StorageClass -metadata: - name: fast-regional -provisioner: pd.csi.storage.gke.io -parameters: - type: pd-ssd - replication-type: regional-pd # Replicate across 2 zones -volumeBindingMode: WaitForFirstConsumer -allowVolumeExpansion: true # Always enable for production -``` - -## PersistentVolumeClaims - -### Block Storage (ReadWriteOnce) - -```yaml -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: database-pvc -spec: - accessModes: - - ReadWriteOnce - storageClassName: premium-rwo - resources: - requests: - storage: 100Gi -``` - -### Shared File Storage (ReadWriteMany via Filestore) - -```yaml -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: shared-data -spec: - accessModes: - - ReadWriteMany - storageClassName: standard-rwx - resources: - requests: - storage: 1Ti # Filestore minimum is 1 TiB for Basic tier -``` - -### GCS Bucket Mount (Cloud Storage FUSE) - -Mount a GCS bucket as a volume without a PVC: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: gcs-reader - annotations: - gke-gcsfuse/volumes: "true" -spec: - containers: - - name: reader - image: busybox - command: ["ls", "/data"] - volumeMounts: - - name: gcs-bucket - mountPath: /data - volumes: - - name: gcs-bucket - csi: - driver: gcsfuse.csi.storage.gke.io - readOnly: true - volumeAttributes: - bucketName: -``` - -> Requires Workload Identity for the pod's service account to have `storage.objectViewer` on the bucket. - -## Volume Expansion - -If `allowVolumeExpansion: true` is set on the StorageClass, resize by updating the PVC: - -```bash -# kubectl -kubectl patch pvc -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}' -``` - -``` -# MCP (preferred) -patch_k8s_resource(parent="...", resourceType="persistentvolumeclaim", name="", - patch='{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}') -``` - -Kubernetes automatically resizes the filesystem. - -## Best Practices - -1. **Always enable volume expansion**: Set `allowVolumeExpansion: true` on all StorageClasses -2. **Use regional PDs for production**: `replication-type: regional-pd` replicates across 2 zones for HA -3. **Use `WaitForFirstConsumer`**: Ensures the PV is provisioned in the same zone as the pod -4. **Choose the right disk type**: `pd-ssd` for databases, `pd-balanced` (golden path default) for general use, `pd-standard` for cold storage -5. **Use Filestore for shared access**: When multiple pods need to read/write the same files -6. **Use GCS FUSE for data pipelines**: Mount buckets directly for ML training data, logs, etc. -7. **Back up PVCs**: Use Backup for GKE (see [gke-backup-dr.md](./gke-backup-dr.md)) to protect persistent data diff --git a/skills/cloud/gke-basics/references/gke-upgrades.md b/skills/cloud/gke-basics/references/gke-upgrades.md deleted file mode 100644 index 91e1a5ba90..0000000000 --- a/skills/cloud/gke-basics/references/gke-upgrades.md +++ /dev/null @@ -1,142 +0,0 @@ -# GKE Upgrades & Maintenance - -This reference covers upgrade strategy, maintenance windows, and release channel management for GKE clusters. - -> **MCP Tools:** `get_cluster`, `get_k8s_version`, `update_cluster`, `update_node_pool`, `list_operations`, `get_operation`, `cancel_operation`, `get_k8s_resource` -> **CLI-only**: `gcloud container get-server-config` (available versions), `gcloud container clusters update --maintenance-window-*` (maintenance windows) - -## Golden Path Upgrade Defaults - -| Setting | Golden Path Value | Notes | -|---------|-------------------|-------| -| `releaseChannel.channel` | `REGULAR` | Balanced between freshness and stability | -| Maintenance exclusion | `NO_MINOR_UPGRADES`, 1 year | Prevents surprise minor version bumps | -| `upgradeSettings.strategy` | `SURGE` | Rolling upgrades with `maxSurge: 1` | -| Auto-repair | `true` | Unhealthy nodes are automatically replaced | -| Auto-upgrade | `true` | Nodes follow control plane version | - -## Release Channels - -| Channel | Cadence | Best For | -|---------|---------|----------| -| `RAPID` | Weeks after release | Dev/test, early access to features | -| `REGULAR` (golden path) | 2-3 months after Rapid | Production workloads | -| `STABLE` | 2-3 months after Regular | Risk-averse, highly regulated | - -```bash -# Check current channel -gcloud container clusters describe --region \ - --format="value(releaseChannel.channel)" \ - --quiet - -# Change channel (Day-1) -gcloud container clusters update --region \ - --release-channel \ - --quiet -``` - -## Maintenance Windows - -Control when GKE can perform automatic maintenance (upgrades, patches). - -```bash -# Set maintenance window (e.g., weekends 2am-6am UTC) -gcloud container clusters update --region \ - --maintenance-window-start "2026-01-01T02:00:00Z" \ - --maintenance-window-end "2026-01-01T06:00:00Z" \ - --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=SA,SU" \ - --quiet -``` - -### Maintenance Exclusions - -The golden path includes a 1-year `NO_MINOR_UPGRADES` exclusion to prevent automatic minor version changes. - -```bash -# Add maintenance exclusion -gcloud container clusters update --region \ - --add-maintenance-exclusion-name "freeze-1" \ - --add-maintenance-exclusion-start "2026-04-11T00:00:00Z" \ - --add-maintenance-exclusion-end "2027-04-11T00:00:00Z" \ - --add-maintenance-exclusion-scope NO_MINOR_UPGRADES \ - --quiet - -# Remove exclusion -gcloud container clusters update --region \ - --remove-maintenance-exclusion "freeze-1" \ - --quiet -``` - -**Exclusion scopes:** -- `NO_UPGRADES` — blocks all upgrades (max 30 days) -- `NO_MINOR_UPGRADES` — allows patch upgrades, blocks minor version changes (max 1 year) -- `NO_MINOR_OR_NODE_UPGRADES` — blocks minor and node upgrades (max 1 year) - -## Upgrade Strategy - -### SURGE (Golden Path) - -Rolling upgrade with configurable surge capacity: - -```bash -# Default: maxSurge=1 (one extra node during upgrade) -gcloud container node-pools update \ - --cluster --region \ - --max-surge-upgrade 1 --max-unavailable-upgrade 0 \ - --quiet -``` - -### Blue-Green (For Zero-Downtime Critical Workloads) - -```bash -gcloud container node-pools update \ - --cluster --region \ - --enable-blue-green-upgrade \ - --node-pool-soak-duration "3600s" \ - --quiet -``` - -## Pre-Upgrade Checklist - -1. **Check deprecations**: Review Kubernetes API deprecations between current and target version -2. **Review PDBs**: Ensure all production workloads have PodDisruptionBudgets -3. **Test in non-prod**: Upgrade a staging cluster first -4. **Check addon compatibility**: Verify third-party controllers support the target version -5. **Review node pool versions**: All node pools should be within 2 minor versions of the control plane - -```bash -# Check current versions -gcloud container clusters describe --region \ - --format="table(currentMasterVersion, nodePools[].version)" \ - --quiet - -# Check available upgrades -gcloud container get-server-config --region \ - --format="yaml(channels)" \ - --quiet - -# List deprecation warnings -kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis -``` - -## Manual Upgrade (When Needed) - -```bash -# Upgrade control plane -gcloud container clusters upgrade --region \ - --master --cluster-version \ - --quiet - -# Upgrade node pool -gcloud container clusters upgrade --region \ - --node-pool \ - --quiet -``` - -## Best Practices - -1. **Stay on a release channel**: Manual version management is error-prone. Let GKE manage versions. -2. **Use maintenance windows**: Schedule upgrades during low-traffic periods. -3. **Set PDBs on everything**: Protects workloads during node drains. -4. **Monitor during upgrades**: Watch for pod eviction failures, CrashLoopBackOff, and scheduling issues. -5. **Don't skip minor versions**: Upgrade incrementally (1.28 -> 1.29 -> 1.30, not 1.28 -> 1.30). diff --git a/skills/cloud/gke-basics/references/iac-usage.md b/skills/cloud/gke-basics/references/iac-usage.md index efc44ca4e7..e088546d5a 100644 --- a/skills/cloud/gke-basics/references/iac-usage.md +++ b/skills/cloud/gke-basics/references/iac-usage.md @@ -6,14 +6,14 @@ managed using Terraform. ## Terraform Terraform uses two main providers for GKE: -* The **Google Cloud provider** connects to the Google Cloud API to manage - GKE cluster infrastructure using Terraform resources such as + +* The **Google Cloud provider** connects to the Google Cloud API to manage GKE + cluster infrastructure using Terraform resources such as `google_container_cluster` for the cluster itself, and `google_container_node_pool` for nodes in Standard mode. * The **Kubernetes provider** connects to the Kubernetes API to manage - workloads inside the cluster using Kubernetes resources such as - Deployments and Services. - + workloads inside the cluster using Kubernetes resources such as Deployments + and Services. ### GKE Autopilot Cluster Example @@ -65,13 +65,13 @@ resource "kubernetes_deployment_v1" "default" { ### Reference Documentation -- [Terraform Google Provider - Container Cluster](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster) +- [Terraform Google Provider - Container Cluster](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster) -- [Terraform Google Provider - Kubernetes Provider](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs) +- [Terraform Google Provider - Kubernetes Provider](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs) ## YAML Samples GKE cluster configurations and Kubernetes manifests can also be defined using YAML for use with `kubectl apply` or Deployment Manager. -- [GKE YAML Samples](https://docs.cloud.google.com/docs/samples?product=googlekubernetesengine) +- [GKE YAML Samples](https://docs.cloud.google.com/docs/samples?product=googlekubernetesengine) diff --git a/skills/cloud/gke-basics/references/mcp-usage.md b/skills/cloud/gke-basics/references/mcp-usage.md index 66e6dbbf1a..4534d88c7b 100644 --- a/skills/cloud/gke-basics/references/mcp-usage.md +++ b/skills/cloud/gke-basics/references/mcp-usage.md @@ -1,10 +1,14 @@ # GKE MCP Server Usage -The GKE MCP server provides 23 structured tools for cluster management, Kubernetes resource operations, and diagnostics — without requiring shell access or kubeconfig setup. +The GKE MCP server provides 23 structured tools for cluster management, +Kubernetes resource operations, and diagnostics — without requiring shell access +or kubeconfig setup. ## Connecting to the GKE MCP Server -The GKE remote MCP server is available for AI clients that support the Model Context Protocol. For setup instructions, see https://docs.cloud.google.com/kubernetes-engine/docs/how-to/use-gke-mcp. +The GKE remote MCP server is available for AI clients that support the Model +Context Protocol. For setup instructions, see +https://docs.cloud.google.com/kubernetes-engine/docs/how-to/use-gke-mcp. ## Available Tools @@ -21,52 +25,60 @@ Use `locations/-` to match all regions when listing. ### Cluster Management -| Tool | Mode | Purpose | -|------|------|---------| -| `list_clusters` | READ | Discover clusters in a project/region | -| `get_cluster` | READ | Inspect cluster config. Use `readMask` to select fields | -| `create_cluster` | MUTATE | Create a cluster from JSON config | -| `update_cluster` | DESTRUCTIVE | Change Day-1 cluster settings | +| Tool | Mode | Purpose | +| ---------------- | ----------- | ----------------------------------------- | +| `list_clusters` | READ | Discover clusters in a project/region | +| `get_cluster` | READ | Inspect cluster config. Use `readMask` to | +: : : select fields : +| `create_cluster` | MUTATE | Create a cluster from JSON config | +| `update_cluster` | DESTRUCTIVE | Change Day-1 cluster settings | ### Node Pool Management -| Tool | Mode | Purpose | -|------|------|---------| -| `list_node_pools` | READ | List pools in a cluster | -| `get_node_pool` | READ | Get pool details | -| `create_node_pool` | MUTATE | Add a pool (Standard clusters) | -| `update_node_pool` | DESTRUCTIVE | Modify a pool | +Tool | Mode | Purpose +------------------ | ----------- | ------------------------------ +`list_node_pools` | READ | List pools in a cluster +`get_node_pool` | READ | Get pool details +`create_node_pool` | MUTATE | Add a pool (Standard clusters) +`update_node_pool` | DESTRUCTIVE | Modify a pool ### Kubernetes Resources -| Tool | Mode | Purpose | -|------|------|---------| -| `get_k8s_resource` | READ | List/get any K8s resource (supports label/field selectors) | -| `describe_k8s_resource` | READ | Detailed info with events and conditions | -| `apply_k8s_manifest` | DESTRUCTIVE | Apply YAML manifests (supports `dryRun`) | -| `patch_k8s_resource` | DESTRUCTIVE | JSON patch resource fields | -| `delete_k8s_resource` | DESTRUCTIVE | Remove resources (supports `cascade`, `dryRun`) | -| `list_k8s_api_resources` | READ | Discover available resource types | +| Tool | Mode | Purpose | +| ------------------------ | ----------- | ----------------------------------- | +| `get_k8s_resource` | READ | List/get any K8s resource (supports | +: : : label/field selectors) : +| `describe_k8s_resource` | READ | Detailed info with events and | +: : : conditions : +| `apply_k8s_manifest` | DESTRUCTIVE | Apply YAML manifests (supports | +: : : `dryRun`) : +| `patch_k8s_resource` | DESTRUCTIVE | JSON patch resource fields | +| `delete_k8s_resource` | DESTRUCTIVE | Remove resources (supports | +: : : `cascade`, `dryRun`) : +| `list_k8s_api_resources` | READ | Discover available resource types | ### Diagnostics & Observability -| Tool | Mode | Purpose | -|------|------|---------| -| `list_k8s_events` | READ | Scheduling failures, OOM kills, evictions | -| `get_k8s_logs` | READ | Container logs (supports `tail`, `since`, `previous`) | -| `get_k8s_cluster_info` | READ | Control plane and service endpoints | -| `get_k8s_version` | READ | Kubernetes server version | -| `get_k8s_rollout_status` | READ | Deployment/StatefulSet rollout progress | -| `check_k8s_auth` | READ | Verify RBAC permissions for a user/SA | +| Tool | Mode | Purpose | +| ------------------------ | ---- | ----------------------------------------- | +| `list_k8s_events` | READ | Scheduling failures, OOM kills, evictions | +| `get_k8s_logs` | READ | Container logs (supports `tail`, `since`, | +: : : `previous`) : +| `get_k8s_cluster_info` | READ | Control plane and service endpoints | +| `get_k8s_version` | READ | Kubernetes server version | +| `get_k8s_rollout_status` | READ | Deployment/StatefulSet rollout progress | +| `check_k8s_auth` | READ | Verify RBAC permissions for a user/SA | ### Operations -| Tool | Mode | Purpose | -|------|------|---------| -| `list_operations` | READ | Pending/running cluster operations | -| `get_operation` | READ | Track create/upgrade progress | -| `cancel_operation` | DESTRUCTIVE | Abort stuck operations | +Tool | Mode | Purpose +------------------ | ----------- | ---------------------------------- +`list_operations` | READ | Pending/running cluster operations +`get_operation` | READ | Track create/upgrade progress +`cancel_operation` | DESTRUCTIVE | Abort stuck operations ## Tool Preference -Default: **MCP tools > gcloud CLI > kubectl**. See [cli-reference.md](./cli-reference.md) for the full coverage comparison, CLI fallback commands, and user preference override options. +Default: **MCP tools > gcloud CLI > kubectl**. See +[cli-reference.md](./cli-reference.md) for the full coverage comparison, CLI +fallback commands, and user preference override options. diff --git a/skills/cloud/gke-basics/references/gke-batch-hpc.md b/skills/cloud/gke-batch-hpc/SKILL.md similarity index 70% rename from skills/cloud/gke-basics/references/gke-batch-hpc.md rename to skills/cloud/gke-batch-hpc/SKILL.md index 74ec29feb4..a49da717f2 100644 --- a/skills/cloud/gke-basics/references/gke-batch-hpc.md +++ b/skills/cloud/gke-batch-hpc/SKILL.md @@ -1,16 +1,28 @@ +--- +name: gke-batch-hpc +description: >- + Runs batch and HPC workloads on GKE, utilizing job queues and parallel + processing. Use when running GKE batch jobs, configuring GKE HPC, or setting + up GKE job queues. Don't use for standard web application deployments (use + gke-app-onboarding instead). +--- + # GKE Batch & HPC Workloads -This reference covers running batch processing and high-performance computing (HPC) workloads on GKE. +This reference covers running batch processing and high-performance computing +(HPC) workloads on GKE. -> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `describe_k8s_resource`, `get_k8s_logs`, `delete_k8s_resource`, `list_k8s_events` +> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, +> `describe_k8s_resource`, `get_k8s_logs`, `delete_k8s_resource`, +> `list_k8s_events` ## When to Use -- Running batch data processing pipelines -- HPC simulations (CFD, molecular dynamics, financial modeling) -- Large-scale parallel computation (MPI, MapReduce) -- ML training jobs -- CI/CD build farms +- Running batch data processing pipelines +- HPC simulations (CFD, molecular dynamics, financial modeling) +- Large-scale parallel computation (MPI, MapReduce) +- ML training jobs +- CI/CD build farms ## Batch Processing on GKE @@ -106,7 +118,8 @@ spec: ### Compact Placement (Low-Latency Networking) -For tightly-coupled HPC workloads that need low-latency inter-node communication: +For tightly-coupled HPC workloads that need low-latency inter-node +communication: ```bash # Standard clusters: create node pool with compact placement @@ -152,17 +165,25 @@ spec: ### Spot VMs for Batch -Batch workloads are ideal Spot VM candidates (interruptible, can checkpoint). Use a ComputeClass with Spot-first priority and `activeMigration` to return to Spot when available. See [gke-compute-classes.md](./gke-compute-classes.md) for the Spot-with-fallback pattern. +Batch workloads are ideal Spot VM candidates (interruptible, can checkpoint). +Use a ComputeClass with Spot-first priority and `activeMigration` to return to +Spot when available. See +[gke-compute-classes.md](../gke-compute-classes/SKILL.md) for the +Spot-with-fallback pattern. ### Scale-to-Zero For batch clusters, allow node pools to scale to zero when no jobs are running: -- Autopilot (golden path): Automatic, nodes scale to zero when no pods are scheduled -- Standard: Set `--min-nodes 0` on batch node pools +- Autopilot (golden path): Automatic, nodes scale to zero when no pods are + scheduled +- Standard: Set `--min-nodes 0` on batch node pools ## Best Practices -- **Kueue** for multi-tenant job scheduling; **JobSet** for multi-component workflows -- **Set `backoffLimit`** on Jobs; **checkpoint long jobs** for preemption resilience -- **Spot VMs** for fault-tolerant batch; **compact placement** for tightly-coupled HPC +- **Kueue** for multi-tenant job scheduling; **JobSet** for multi-component + workflows +- **Set `backoffLimit`** on Jobs; **checkpoint long jobs** for preemption + resilience +- **Spot VMs** for fault-tolerant batch; **compact placement** for + tightly-coupled HPC diff --git a/skills/cloud/gke-basics/references/gke-cluster-creation.md b/skills/cloud/gke-cluster-creation/SKILL.md similarity index 52% rename from skills/cloud/gke-basics/references/gke-cluster-creation.md rename to skills/cloud/gke-cluster-creation/SKILL.md index 735590011c..479a3ce28f 100644 --- a/skills/cloud/gke-basics/references/gke-cluster-creation.md +++ b/skills/cloud/gke-cluster-creation/SKILL.md @@ -1,38 +1,58 @@ +--- +name: gke-cluster-creation +description: >- + Plans and executes GKE cluster creation, provisioning, and production + readiness audits. Use when creating GKE clusters, provisioning GKE + environments, or auditing GKE clusters. Don't use for application + onboarding or deployment configuration (use gke-app-onboarding instead). +--- + # GKE Cluster Creation -This reference guides creating GKE clusters. The **golden path Autopilot** configuration is the default for all new clusters. +This reference guides creating GKE clusters. The **golden path Autopilot** +configuration is the default for all new clusters. -> **MCP Tools:** `list_clusters`, `create_cluster`, `get_cluster`, `list_operations`, `get_operation` +> **MCP Tools:** `list_clusters`, `create_cluster`, `get_cluster`, +> `list_operations`, `get_operation` ## Workflow -1. **Discover context**: Use `list_clusters` to see existing clusters. Use `gcloud config get-value project` if project unknown. -2. **Gather inputs**: project_id, region, cluster_name, environment type -3. **Select mode**: Autopilot (default) vs Standard -4. **Configure networking**: auto-create subnet (default) or bring-your-own -5. **Review golden path settings**: present the config and confirm with user -6. **Create**: Use MCP `create_cluster` tool. Fall back to `gcloud` CLI only if MCP is unavailable. -7. **Track**: Use `get_operation` to monitor creation progress -8. **Verify**: Use `get_cluster` with `readMask="*"` to confirm golden path settings applied +1. **Discover context**: Use `list_clusters` to see existing clusters. Use + `gcloud config get-value project` if project unknown. +2. **Gather inputs**: project_id, region, cluster_name, environment type +3. **Select mode**: Autopilot (default) vs Standard +4. **Configure networking**: auto-create subnet (default) or bring-your-own +5. **Review golden path settings**: present the config and confirm with user +6. **Create**: Use MCP `create_cluster` tool. Fall back to `gcloud` CLI only if + MCP is unavailable. +7. **Track**: Use `get_operation` to monitor creation progress +8. **Verify**: Use `get_cluster` with `readMask="*"` to confirm golden path + settings applied ## Mode Selection -| Criteria | Autopilot (Golden Path) | Standard | -|----------|------------------------|----------| -| Node management | Google-managed | Self-managed | -| Pricing | Pay per pod resource request | Pay per node (VM) | -| Node customization | Via ComputeClasses | Full control | -| DaemonSets | Allowed (with restrictions) | Full control | -| GPU/TPU | Supported via ComputeClasses | Supported via node pools | -| Best for | Most production workloads | Kernel tuning, custom OS, privileged workloads | - -> **Rule**: Default to Autopilot unless the customer has a specific requirement that Autopilot cannot satisfy. +| Criteria | Autopilot (Golden Path) | Standard | +| ------------------ | ------------------------- | ------------------------- | +| Node management | Google-managed | Self-managed | +| Pricing | Pay per pod resource | Pay per node (VM) | +: : request : : +| Node customization | Via ComputeClasses | Full control | +| DaemonSets | Allowed (with | Full control | +: : restrictions) : : +| GPU/TPU | Supported via | Supported via node pools | +: : ComputeClasses : : +| Best for | Most production workloads | Kernel tuning, custom OS, | +: : : privileged workloads : + +> **Rule**: Default to Autopilot unless the customer has a specific requirement +> that Autopilot cannot satisfy. ## Templates ### 1. Golden Path Autopilot (Production) -This is the default. All settings match `assets/golden-path-autopilot.yaml`. +This is the default. All settings match +`../gke-golden-path/assets/golden-path-autopilot.yaml`. **Via gcloud:** @@ -78,7 +98,8 @@ gcloud container clusters create-auto \ ### 2. Autopilot Dev/Test -Relaxes some golden path defaults for cost savings and easier access in non-production. +Relaxes some golden path defaults for cost savings and easier access in +non-production. ```bash gcloud container clusters create-auto \ @@ -88,7 +109,8 @@ gcloud container clusters create-auto \ --quiet ``` -> **Warning**: This does not apply golden path security hardening. Suitable for dev/test only. +> **Warning**: This does not apply golden path security hardening. Suitable for +> dev/test only. ### 3. Standard Regional (When Autopilot is Not an Option) @@ -112,7 +134,8 @@ gcloud container clusters create \ ### 4. GPU/AI Workloads (Autopilot with ComputeClass) -Create a golden path Autopilot cluster, then apply a ComputeClass for GPU workloads: +Create a golden path Autopilot cluster, then apply a ComputeClass for GPU +workloads: ```bash # 1. Create golden path cluster (same as template 1) @@ -133,10 +156,12 @@ kubectl apply -f inference.yaml ## Instructions -- **ALWAYS** ask for `project_id` if not in context -- **ALWAYS** ask for `region` -- **ALWAYS** ask for a unique `cluster_name` -- **DEFAULT** to golden path Autopilot unless customer specifies otherwise -- **WARN** about Day-0 decisions (networking, private nodes) that are hard to change later -- **WARN** about cost for GPU or multi-region clusters -- When using MCP `create_cluster`, the `cluster.name` should be the **short name** (e.g., `my-cluster`), not the full resource path +- **ALWAYS** ask for `project_id` if not in context +- **ALWAYS** ask for `region` +- **ALWAYS** ask for a unique `cluster_name` +- **DEFAULT** to golden path Autopilot unless customer specifies otherwise +- **WARN** about Day-0 decisions (networking, private nodes) that are hard to + change later +- **WARN** about cost for GPU or multi-region clusters +- When using MCP `create_cluster`, the `cluster.name` should be the **short + name** (e.g., `my-cluster`), not the full resource path diff --git a/skills/cloud/gke-compute-classes/SKILL.md b/skills/cloud/gke-compute-classes/SKILL.md new file mode 100644 index 0000000000..4b558d3cd2 --- /dev/null +++ b/skills/cloud/gke-compute-classes/SKILL.md @@ -0,0 +1,204 @@ +--- +name: gke-compute-classes +description: >- + Manages GKE compute classes, node selection, Spot fallbacks, and GPU node + pools. Use when configuring GKE compute classes, selecting GKE machine + families, or configuring GPUs on GKE. Don't use for general workload-level + resource limits or Horizontal/Vertical Pod Autoscaling (use gke-scaling + instead). +--- + +# GKE ComputeClasses + +ComputeClasses allow declarative node configuration and autoscaling priorities +in GKE Autopilot (and Standard with NAP). Use them to specify machine families, +Spot VM fallback, GPU requirements, and zone targeting. + +> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, +> `describe_k8s_resource`, `delete_k8s_resource` + +## When to Use + +- Cost optimization: Spot VMs with on-demand fallback +- GPU/TPU workloads: target specific accelerators +- Performance: select specific machine families (c3, c4, n4) +- Zone targeting: colocate workloads with zonal resources + +## CRD Structure + +```yaml +apiVersion: cloud.google.com/v1 +kind: ComputeClass +metadata: + name: +spec: + # Required. Ordered list of rules. GKE tries them in order. + priorities: + - + + # Optional. Default: "DoNotScaleUp" + whenUnsatisfiable: <"DoNotScaleUp" | "ScaleUpAnyway"> + + # Optional. Auto-create node pools. Default: true + nodePoolAutoCreation: + enabled: + + # Optional. Move workloads back to higher-priority when available + activeMigration: + optimizeRulePriority: + + # Optional. Scale-down delay + autoscalingPolicy: + consolidationDelay: + + # Optional. Defaults for fields omitted in priorities + priorityDefaults: +``` + +## PriorityRule Fields + +| Field | Type | Description | Example | +| --------------- | ------- | ---------------------- | ---------------- | +| `machineFamily` | string | Compute Engine machine | `n4`, `c3`, | +: : : family : `t2a` : +| `machineType` | string | Specific machine type | `n4-standard-32` | +| `spot` | boolean | Use Spot VMs | `true` | +| `minCores` | int | Minimum vCPUs | `4` | +| `minMemoryGb` | int | Minimum memory in GB | `16` | +| `gpu` | object | GPU config: `type`, | See below | +: : : `count`, : : +: : : `driverVersion` : : +| `tpu` | object | TPU config: `type`, | See below | +: : : `count`, `topology` : : +| `storage` | object | Boot disk: `type`, | See below | +: : : `sizeGb`, `kmsKey`; : : +: : : Local SSD\: `count`, : : +: : : `interface` : : +| `location` | object | Zone targeting: | See below | +: : : `zones\: [...]` or : : +: : : `type\: "Any"` : : +| `reservations` | object | Reservation | See below | +: : : consumption\: : : +: : : `NO_RESERVATION`, : : +: : : `ANY_RESERVATION`, : : +: : : `SPECIFIC_RESERVATION` : : + +### GPU Configuration + +```yaml +gpu: + type: "nvidia-l4" # nvidia-l4, nvidia-h100-80gb, etc. + count: 1 # GPUs per node + driverVersion: "latest" # Optional +``` + +### TPU Configuration + +```yaml +tpu: + type: "v5p-slice" + count: 8 + topology: "2x2x1" +``` + +### Storage Configuration + +```yaml +storage: + bootDisk: + type: "pd-balanced" # pd-balanced (golden path), pd-ssd, hyperdisk-balanced + sizeGb: 100 + kmsKey: "projects/.../cryptoKeys/..." # Optional CMEK + localSsd: + count: 1 + interface: "NVME" +``` + +### Location Configuration + +```yaml +location: + zones: + - "us-central1-a" + - "us-central1-b" + # OR + type: "Any" # Let GKE pick from cluster zones +``` + +## Common Patterns + +### Spot VMs with On-Demand Fallback + +```yaml +apiVersion: cloud.google.com/v1 +kind: ComputeClass +metadata: + name: spot-with-fallback +spec: + nodePoolAutoCreation: + enabled: true + priorities: + - machineFamily: n4 + spot: true + - machineFamily: n4 + spot: false +``` + +### GPU Workload (L4) + +```yaml +apiVersion: cloud.google.com/v1 +kind: ComputeClass +metadata: + name: l4-gpu-class +spec: + priorities: + - machineFamily: g2 + gpu: + type: nvidia-l4 + count: 1 + minCores: 4 + minMemoryGb: 16 + storage: + bootDisk: + type: pd-balanced + sizeGb: 100 +``` + +### Spot with Active Migration (Return to Spot When Available) + +Add `activeMigration` to the Spot-with-fallback pattern above to auto-migrate +workloads back to Spot when capacity returns: + +```yaml +spec: + activeMigration: + optimizeRulePriority: true + priorities: + - machineFamily: n4 + spot: true + - machineFamily: n4 + spot: false +``` + +> **Other patterns** — HPC (`machineFamily: c3`, `minCores: 8`) and zone +> targeting (`location.zones: [...]`) follow the same CRD structure. See the +> PriorityRule fields table and sub-config examples above. + +## Workload Usage + +Pods must specify the ComputeClass via node selector: + +```yaml +nodeSelector: + cloud.google.com/compute-class: "" +``` + +## Warnings + +- Do not mix ComputeClass selection with other hard node selectors (like + `cloud.google.com/gke-spot`) — this causes scheduling conflicts. +- When using `activeMigration`, workloads will be evicted and rescheduled — + ensure PDBs are in place. +- Spot VMs can be evicted with 30-second notice. Set + `terminationGracePeriodSeconds < 30` for Spot workloads. diff --git a/skills/cloud/gke-cost/SKILL.md b/skills/cloud/gke-cost/SKILL.md new file mode 100644 index 0000000000..938c740b53 --- /dev/null +++ b/skills/cloud/gke-cost/SKILL.md @@ -0,0 +1,184 @@ +--- +name: gke-cost +description: >- + Optimizes GKE costs, rightsizes workloads, and configures Spot VMs and CUDs. + Use when optimizing GKE costs, rightsizing GKE workloads, or configuring GKE + Spot VMs. Don't use for general compute class provisioning or GPU Selection + (use gke-compute-classes instead). +--- + +# GKE Cost Optimization + +This reference covers strategies for reducing GKE costs while maintaining the +golden path security and reliability posture. + +> **MCP Tools:** `get_k8s_resource`, `describe_k8s_resource`, +> `apply_k8s_manifest`, `patch_k8s_resource`, `get_cluster` + +## Golden Path Cost Features + +The golden path already includes cost-optimizing settings: + +| Setting | Value | Impact | +| ------------------------ | ---------------------- | ----------------------- | +| `autoscalingProfile` | `OPTIMIZE_UTILIZATION` | Aggressive node | +: : : scale-down reduces idle : +: : : compute : +| `verticalPodAutoscaling` | `enabled` | VPA recommendations | +: : : prevent : +: : : over-provisioning : +| Autopilot pricing | Pay per pod request | No charge for unused | +: : : node capacity : +| Node Auto Provisioning | enabled | Right-sized node pools | +: : : created automatically : + +## Cost Optimization Strategies + +### 1. Spot VMs via ComputeClasses + +Use Spot VMs for fault-tolerant workloads (60-90% cost reduction). + +```yaml +apiVersion: cloud.google.com/v1 +kind: ComputeClass +metadata: + name: spot-with-fallback +spec: + activeMigration: + optimizeRulePriority: true + priorities: + - machineFamily: n4 + spot: true + - machineFamily: n4 + spot: false +``` + +**Spot-suitable workloads:** + +Workload | Spot-Suitable? +--------------------------------- | --------------- +Batch / data processing | Yes +Dev / test environments | Yes +Stateless web/API (replicas >= 2) | Yes (with PDBs) +Jobs with checkpointing | Yes +Stateful workloads (databases) | No +Single-replica critical services | No + +**Handling eviction:** + +```yaml +spec: + template: + spec: + terminationGracePeriodSeconds: 25 # Must be < 30s for Spot + containers: + - name: app + lifecycle: + preStop: + exec: + command: ["/bin/sh", "-c", "sleep 5"] +``` + +### 2. Pod Rightsizing + +Use VPA recommendations to reduce over-provisioned requests. + +```bash +# 1. Deploy VPA in recommendation mode +kubectl apply -f - <-vpa +spec: + targetRef: + apiVersion: apps/v1 + kind: Deployment + name: + updatePolicy: + updateMode: "Off" +EOF + +# 2. Wait 24+ hours for data collection + +# 3. Read recommendations +kubectl get vpa -vpa -o jsonpath='{.status.recommendation}' +``` + +**Optimization rules:** + +Condition | Action | Savings +----------------------------- | ---------------------------------- | ------- +CPU request >5x P95 actual | Reduce to `P95 * 1.2` | High +Memory request >3x P95 actual | Reduce to `P95 * 1.2` | High +CPU request >2x P95 actual | Reduce to `P95 * 1.2` | Medium +No resource requests set | Add requests (enables bin-packing) | Medium + +### 3. Machine Type Selection + +| Family | Use Case | Relative Cost | +| ------------- | -------------------------------------------- | ------------- | +| e2 | General purpose, burstable | Lowest | +| t2a / t2d | Scale-out (Arm/AMD), price-performance | Low | +: : optimized : : +| n4a | Axion Arm-based, general-purpose | Low | +: : price-performance : : +| n4 / n4d | General purpose (Intel/AMD), flexible shapes | Low-Medium | +| c4a | Compute-optimized (Arm), high efficiency | Medium-High | +| c3 / c4 | Compute-optimized (Intel) | Medium-High | +| c3d / c4d | Compute-optimized (AMD), high-performance | Medium-High | +: : throughput : : +| ek-standard | Autopilot enhanced (golden path) | Medium | +| m3 / x4 | Memory-optimized, SAP HANA, large databases | High | +| g2 (L4 GPU) | AI inference | High | +| a3 (H100 GPU) | AI training | Highest | +| a4 / a4x | Ultra-scale AI (Blackwell GPUs) | Highest | + +> In Autopilot, machine type is managed. Use ComputeClasses to influence +> selection. + +### 4. Committed Use Discounts (CUDs) + +For steady-state workloads, purchase 1-year or 3-year CUDs: + +- 1-year: ~20-30% discount +- 3-year: ~50-55% discount +- Applied automatically to matching usage in the region +- Purchase via Google Cloud Console > Billing > Committed use discounts + +### 5. Cluster Management + +- **Stop/start dev clusters**: Idle dev clusters cost money even with no + workloads (control plane fee). +- **Right-size node pools** (Standard): Use Cluster Autoscaler with + appropriate min/max. +- **Multi-tenant clusters**: Share a single cluster across teams instead of + per-team clusters (see [gke-multitenancy.md](../gke-multitenancy/SKILL.md)). + +## Cost Monitoring + +```bash +# View cluster cost breakdown (requires Cost Management API) +gcloud billing budgets list --billing-account= --quiet + +# View node utilization +kubectl top nodes + +# View pod resource usage vs requests +kubectl top pods --all-namespaces --containers +``` + +## Dev/Test Cost Savings + +For non-production environments, these golden path deviations are acceptable: + +| Setting | Production (Golden | Dev/Test | +: : Path) : : +| ----------------------- | ------------------ | ----------------------------- | +| Cluster mode | Autopilot | Autopilot (cheaper with fewer | +: : : pods) : +| Release channel | Regular | Rapid (get fixes faster) | +| Private nodes | Required | Optional (simpler access) | +| Monitoring components | Full suite | SYSTEM_COMPONENTS only | +| Secret Manager rotation | 120s | Disabled | +| Maintenance windows | Configured | Not needed | diff --git a/skills/cloud/gke-golden-path/SKILL.md b/skills/cloud/gke-golden-path/SKILL.md new file mode 100644 index 0000000000..5f3e2ff57c --- /dev/null +++ b/skills/cloud/gke-golden-path/SKILL.md @@ -0,0 +1,104 @@ +--- +name: gke-golden-path +description: >- + Provides GKE golden path configuration defaults, production readiness + checklists, and cluster default patterns. Use when designing GKE clusters, + verifying GKE production readiness, or checking configurations against + GKE defaults. Don't use for setting up node autoscaling specifically (use + gke-scaling instead). +--- + +# GKE Golden Path Configuration + +The golden path is the recommended Autopilot configuration for production +clusters. It defines sensible defaults — when the user requests different +settings, apply them and note relevant trade-offs. + +> **MCP Tools:** `get_cluster`, `create_cluster`, `update_cluster` + +## Rules + +1. **Default to the golden path.** Use golden path values unless the user + requests otherwise. When deviating, note trade-offs but respect the user's + choice. +2. **Day-0 vs Day-1.** Flag Day-0 decisions (networking, private nodes, + subnets, IP allocation) prominently — they are hard/impossible to change + after creation. +3. **Tool preference: MCP > gcloud > kubectl.** See + [cli-reference.md](../gke-basics/references/cli-reference.md) for full + coverage matrix and override options. If the user says "use gcloud" or "use + kubectl", respect that for the session. +4. **Document decisions and rationale**, especially for Day-0 choices and + golden path deviations. + +## Required Inputs + +If the user is unsure, use golden path defaults. + +- **Project ID** (required) +- **Region** (required, e.g., `us-central1`) +- **Cluster name** (required) +- **Environment type**: dev/test or production (defaults to production) +- **Networking**: bring-your-own VPC/subnet or auto-create (default: + auto-create) +- **Scale expectations**: expected node/pod count, workload types +- **Cost constraints**: Spot VM tolerance, budget considerations + +## Always-Apply Defaults + +Recommended best practices applied by default. If the user requests a different +setting, apply it and briefly note the security or operational trade-off. + +Setting | Golden Path Value +------------------------------------------------------------------ | ----------------- +`autopilot.enabled` | `true` +`privateClusterConfig.enablePrivateNodes` | `true` +`masterAuthorizedNetworksConfig.privateEndpointEnforcementEnabled` | `true` +`secretManagerConfig.enabled` + `rotationInterval: 120s` | `true` +`rbacBindingConfig.enableInsecureBinding*` | `false` (both) +`workloadIdentityConfig.workloadPool` | enabled +`networkConfig.datapathProvider` | `ADVANCED_DATAPATH` +`networkConfig.dnsConfig.clusterDns` | `CLOUD_DNS` +`autoscaling.autoscalingProfile` | `OPTIMIZE_UTILIZATION` +`verticalPodAutoscaling.enabled` | `true` +`monitoringConfig` components | SYSTEM_COMPONENTS, STORAGE, POD, DEPLOYMENT, STATEFULSET, DAEMONSET, HPA, JOBSET, CADVISOR, KUBELET, DCGM, APISERVER, SCHEDULER, CONTROLLER_MANAGER +`advancedDatapathObservabilityConfig.enableMetrics` | `true` +`nodeConfig.shieldedInstanceConfig.enableSecureBoot` | `true` +`nodeConfig.workloadMetadataConfig.mode` | `GKE_METADATA` +`nodeConfig.gcfsConfig.enabled` / `gvnic.enabled` | `true` / `true` +`addonsConfig.statefulHaConfig.enabled` | `true` +Storage CSI drivers (Filestore, GCS FUSE, Parallelstore) | enabled +Pod Security Standards | `restricted` on production namespaces + +## Customer-Configurable Settings + +These have golden path defaults but customers may deviate with valid +justification. **Ask before changing.** + +Setting | Default | Why Deviate +---------------------------------------- | ----------------------------------- | ----------- +`dnsEndpointConfig.allowExternalTraffic` | `true` | Restrict if cluster only accessed from within VPC +`autoIpamConfig` / `createSubnetwork` | `true` / `true` | Customer has pre-existing VPC/subnets +`maxPodsPerNode` | `48` | `110` for high pod-density (costs more CIDR space) +`subnetwork` | auto-created | Customer brings existing subnets +Maintenance exclusion windows | configured (NO_MINOR_UPGRADES, 1yr) | Customer-specific scheduling +`nodeConfig.bootDisk.diskType` | `pd-balanced` | `pd-ssd` for I/O-intensive, `pd-standard` for cost +`nodeConfig.machineType` | `ek-standard-8` (Autopilot) | Varies by workload; use ComputeClasses + +## Guardrails + +- Do not request or output secrets (tokens, keys, service account JSON). +- Discover project/cluster context via MCP tools or `gcloud config get-value + project` — don't ask users to paste project IDs. +- For Day-0 decisions, always ask clarifying questions before proceeding. +- For Day-1 features, propose golden path defaults with trade-offs and let the + customer confirm. +- Do not promise zero downtime; advise PDBs, health probes, replicas, and + staged upgrades. +- When auditing existing clusters, compare against golden path and report + deviations with severity and remediation. + +## Golden Path Config + +See [golden-path-autopilot.yaml](./assets/golden-path-autopilot.yaml) for the +full cluster-level policy settings. diff --git a/skills/cloud/gke-basics/assets/golden-path-autopilot.yaml b/skills/cloud/gke-golden-path/assets/golden-path-autopilot.yaml similarity index 100% rename from skills/cloud/gke-basics/assets/golden-path-autopilot.yaml rename to skills/cloud/gke-golden-path/assets/golden-path-autopilot.yaml diff --git a/skills/cloud/gke-inference/SKILL.md b/skills/cloud/gke-inference/SKILL.md new file mode 100644 index 0000000000..d137acfd42 --- /dev/null +++ b/skills/cloud/gke-inference/SKILL.md @@ -0,0 +1,206 @@ +--- +name: gke-inference +description: >- + Deploys and optimizes AI/ML inference workloads on GKE, using GPUs, TPUs, and + model servers. Use when deploying GKE inference servers, configuring GKE GPU + resources for inference, or deploying LLMs on GKE. Don't use for generic + batch jobs or HPC task queues (use gke-batch-hpc instead). +--- + +# GKE AI/ML Inference + +This reference covers deploying AI/ML inference workloads on GKE using Google's +Inference Quickstart (GIQ) and best practices for LLM serving. + +> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `get_k8s_logs`, +> `get_k8s_rollout_status`, `describe_k8s_resource`, `list_k8s_events`. +> **CLI-only:** `gcloud container ai profiles *` + +## When to Use + +- Deploy an AI model (Llama, Gemma, Mistral, etc.) to GKE +- Generate optimized Kubernetes manifests for inference +- Select GPU/TPU accelerators for model serving +- Configure autoscaling for LLM inference + +## Prerequisites + +- A golden path GKE Autopilot cluster (GPU workloads are supported via + ComputeClasses and NAP) +- `gcloud` CLI authenticated +- Sufficient GPU/TPU quota in the target region + +## Workflow + +### 1. Discovery: Find Models and Hardware + +```bash +# List all supported models +gcloud container ai profiles models list --quiet + +# Find valid accelerator/server combinations for a model +gcloud container ai profiles list --model= --quiet + +# Example: what can run Gemma 2 9B? +gcloud container ai profiles list --model=gemma-2-9b-it --quiet +``` + +### 2. Generate Manifest + +```bash +gcloud container ai profiles manifests create \ + --model= \ + --model-server= \ + --accelerator-type= \ + --target-ntpot-milliseconds= --quiet > inference.yaml +``` + +**Parameters:** + +- `--model`: Model ID (e.g., `gemma-2-9b-it`, `llama-3-8b`) +- `--model-server`: Inference server (`vllm`, `tgi`, `triton`, `tensorrt-llm`) +- `--accelerator-type`: GPU/TPU type (`nvidia-l4`, `nvidia-tesla-a100`, + `nvidia-h100-80gb`) +- `--target-ntpot-milliseconds`: Target Normalized Time Per Output Token + (optional, for latency optimization) + +**Example:** + +```bash +gcloud container ai profiles manifests create \ + --model=gemma-2-9b-it \ + --model-server=vllm \ + --accelerator-type=nvidia-l4 \ + --target-ntpot-milliseconds=50 --quiet > inference.yaml +``` + +### 3. Review and Deploy + +```bash +# Review for placeholders (HF tokens, PVCs) +cat inference.yaml + +# Deploy +kubectl apply -f inference.yaml + +# Monitor +kubectl get pods -w +kubectl logs -f +``` + +> Some models require Hugging Face tokens. Create a Kubernetes Secret and +> reference it in the manifest. + +## GPU ComputeClass for Inference + +For Autopilot clusters, create a ComputeClass to target GPU nodes: + +```yaml +apiVersion: cloud.google.com/v1 +kind: ComputeClass +metadata: + name: l4-inference +spec: + priorities: + - machineFamily: g2 + gpu: + type: nvidia-l4 + count: 1 + minCores: 4 + minMemoryGb: 16 +``` + +## Accelerator Selection Guide + +| Accelerator | Best For | Memory | Relative Cost | +| ------------------- | ------------------------ | ----------- | ------------- | +| NVIDIA T4 | Budget inference, | 16 GB | Lowest | +: : lightweight legacy : : : +: : models : : : +| NVIDIA L4 (G2) | Small-medium model | 24 GB | Low | +: : inference, video, : : : +: : graphics : : : +| NVIDIA RTX PRO 6000 | Multimodal AI, | 96 GB | Medium | +: (G4) : high-fidelity 3D, : : : +: : fine-tuning : : : +| Cloud TPU v5e | Cost-effective | Varies | Medium | +: : transformer inference : : : +| Cloud TPU v5p | High-performance | Varies | High | +: : training : : : +| Cloud TPU v6e | High-efficiency next-gen | 32 GB/chip | Medium-High | +: (Trillium) : training & serving : : : +| Cloud TPU v7x | Ultra-scale inference & | 192 GB/chip | High | +: (Ironwood) : agentic workflows : : : +| NVIDIA A100 | Large model inference, | 40/80 GB | High | +: : enterprise ML : : : +| NVIDIA H100 / H200 | Frontier model training, | 80/141 GB | Highest | +: : high throughput : : : +| NVIDIA B200 (A4) | Blackwell-scale | 192 GB | Highest | +: : training, FP4 precision : : : +| NVIDIA GB200 (A4X) | Rack-scale AI (Grace | Massive | Highest | +: : Blackwell Superchip) : : : + +## Autoscaling LLM Inference + +### GPU-based autoscaling + +Use custom metrics for GPU utilization: + +```yaml +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: llm-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: llm-server + minReplicas: 1 + maxReplicas: 10 + metrics: + - type: Pods + pods: + metric: + name: gpu_duty_cycle + target: + type: AverageValue + averageValue: "80" +``` + +### Best practices for inference autoscaling + +1. **Use DCGM metrics**: Golden path enables DCGM monitoring for GPU + utilization metrics +2. **Set appropriate minReplicas**: At least 1 for always-on serving; 0 for + batch/on-demand +3. **Tune scale-down delay**: LLM model loading is slow; use longer + stabilization windows +4. **Consider queue depth**: Scale on pending requests rather than pure GPU + utilization for latency-sensitive workloads + +## Optimization Tips + +- **Quantization**: Use quantized models (GPTQ, AWQ) to reduce GPU memory and + increase throughput +- **Batching**: Configure model server batch size for throughput vs latency + trade-off +- **Tensor parallelism**: Split large models across multiple GPUs within a + node +- **KV cache optimization**: Tune `--gpu-memory-utilization` in vLLM for KV + cache allocation + +## Troubleshooting + +| Issue | Cause | Fix | +| ------------------ | ------------------------ | --------------------------- | +| Invalid | Unsupported tuple | Re-run `gcloud container ai | +: model/accelerator : : profiles list : +: combination : : --model=` : +| GPU quota exceeded | Regional quota limit | Request quota increase or | +: : : try a different region : +| OOM on GPU | Model too large for | Use larger GPU, enable | +: : accelerator : quantization, or use tensor : +: : : parallelism : +| Slow cold start | Large model loading from | Use local SSD for model | +: : registry : caching; pre-pull images : diff --git a/skills/cloud/gke-basics/references/gke-multitenancy.md b/skills/cloud/gke-multitenancy/SKILL.md similarity index 58% rename from skills/cloud/gke-basics/references/gke-multitenancy.md rename to skills/cloud/gke-multitenancy/SKILL.md index 78458a30da..eb57cb8320 100644 --- a/skills/cloud/gke-basics/references/gke-multitenancy.md +++ b/skills/cloud/gke-multitenancy/SKILL.md @@ -1,26 +1,45 @@ +--- +name: gke-multitenancy +description: >- + Plans and configures multi-tenancy on GKE. Covers namespace isolation, RBAC + planning for teams, resource quotas, LimitRanges, network isolation, and + cost allocation. Use when designing GKE multi-tenancy, configuring GKE + namespaces, setting up resource quotas, or isolating GKE teams. Don't use + for single-tenant cluster configuration or general deployment instructions + (use gke-basics or gke-app-onboarding instead). +--- + # GKE Multi-Tenancy -This reference covers enterprise multi-tenancy patterns on GKE, including namespace isolation, RBAC planning, resource quotas, and network segmentation. +This reference covers enterprise multi-tenancy patterns on GKE, including +namespace isolation, RBAC planning, resource quotas, and network segmentation. -> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `check_k8s_auth`, `describe_k8s_resource`, `delete_k8s_resource` +> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, `check_k8s_auth`, +> `describe_k8s_resource`, `delete_k8s_resource` ## When to Use -- Multiple teams sharing a single GKE cluster -- Isolating workloads by environment (dev/staging/prod) within one cluster -- Implementing least-privilege access control -- Cost allocation across teams or projects +- Multiple teams sharing a single GKE cluster +- Isolating workloads by environment (dev/staging/prod) within one cluster +- Implementing least-privilege access control +- Cost allocation across teams or projects ## Multi-Tenancy Models -| Model | Isolation | Complexity | Cost | -|-------|-----------|------------|------| -| **Namespace-per-team** | Soft (RBAC + Network Policy) | Low | Lowest (shared cluster) | -| **Namespace-per-environment** | Soft | Low | Low | -| **Node pool-per-team** | Medium (dedicated compute) | Medium | Medium | -| **Cluster-per-team** | Hard (full isolation) | High | Highest | - -> **Golden path recommendation**: Start with namespace-per-team for cost efficiency. Escalate to stronger isolation only when compliance requires it. +| Model | Isolation | Complexity | Cost | +| ----------------------------- | ------------ | ---------- | -------------- | +| **Namespace-per-team** | Soft (RBAC + | Low | Lowest (shared | +: : Network : : cluster) : +: : Policy) : : : +| **Namespace-per-environment** | Soft | Low | Low | +| **Node pool-per-team** | Medium | Medium | Medium | +: : (dedicated : : : +: : compute) : : : +| **Cluster-per-team** | Hard (full | High | Highest | +: : isolation) : : : + +> **Golden path recommendation**: Start with namespace-per-team for cost +> efficiency. Escalate to stronger isolation only when compliance requires it. ## Namespace Isolation Setup @@ -35,7 +54,8 @@ kubectl label namespace team-b team=b ### 2. RBAC Configuration -**Principle**: Grant minimal permissions per namespace. Never bind to `system:authenticated`. +**Principle**: Grant minimal permissions per namespace. Never bind to +`system:authenticated`. ```yaml # Namespace-scoped role for a team @@ -64,7 +84,9 @@ roleRef: apiGroup: rbac.authorization.k8s.io ``` -**RBAC best practices:** Use Google Groups for subject bindings. Prefer namespace-scoped Roles over ClusterRoles. See [gke-security.md](./gke-security.md) for full RBAC hardening guidance. +**RBAC best practices:** Use Google Groups for subject bindings. Prefer +namespace-scoped Roles over ClusterRoles. See +[gke-security](../gke-security/SKILL.md) for full RBAC hardening guidance. ### 3. Resource Quotas @@ -113,7 +135,8 @@ spec: ### 5. Network Isolation -Apply default-deny per namespace (see [gke-security.md](./gke-security.md)), then allow intra-team traffic: +Apply default-deny per namespace (see [gke-security](../gke-security/SKILL.md)), +then allow intra-team traffic: ```yaml # Allow same-namespace pods to talk + DNS @@ -160,4 +183,3 @@ gcloud container clusters update --region \ ``` View in Cloud Billing > GKE Cost Allocation. - diff --git a/skills/cloud/gke-networking/SKILL.md b/skills/cloud/gke-networking/SKILL.md new file mode 100644 index 0000000000..fabfb8bbe9 --- /dev/null +++ b/skills/cloud/gke-networking/SKILL.md @@ -0,0 +1,161 @@ +--- +name: gke-networking +description: >- + Plans, configures, and manages GKE networking. Covers private clusters, VPC- + native configurations, Gateway API, DNS, ingress/egress, Dataplane V2, and + IP planning. Use when designing GKE networking layouts, configuring private + clusters, setting up Gateway API, planning GKE IP ranges, or configuring GKE + ingress/egress. Don't use for basic application routing that does not + require dedicated network configuration. +--- + +# GKE Networking + +This reference covers networking configuration for GKE clusters. The golden path +enforces private, VPC-native clusters with Dataplane V2. + +> **MCP Tools:** `get_cluster`, `update_cluster`, `apply_k8s_manifest`, +> `get_k8s_resource` + +## Golden Path Networking Defaults + +Setting | Golden Path Value | Day-0/1 | Notes +-------------------------------------------------------------------- | ---------------------------------- | ------- | ----- +`privateClusterConfig.enablePrivateNodes` | `true` | Day-0 | Nodes have no public IPs +`masterAuthorizedNetworksConfig.privateEndpointEnforcementEnabled` | `true` | Day-0 | Control plane only reachable via private endpoint or DNS +`controlPlaneEndpointsConfig.dnsEndpointConfig.allowExternalTraffic` | `true` | Day-0 | Allows DNS-based access from outside VPC +`networkConfig.datapathProvider` | `ADVANCED_DATAPATH` (Dataplane V2) | Day-0 | eBPF-based, built-in Network Policy +`networkConfig.dnsConfig.clusterDns` | `CLOUD_DNS` | Day-0 | Managed DNS, more reliable than kube-dns +`networkConfig.enableIntraNodeVisibility` | `true` | Day-1 | VPC Flow Logs for intra-node traffic +`networkConfig.gatewayApiConfig.channel` | `CHANNEL_STANDARD` | Day-1 | Gateway API support +`ipAllocationPolicy.autoIpamConfig.enabled` | `true` | Day-0 | Automatic IP range management +`ipAllocationPolicy.createSubnetwork` | `true` | Day-0 | Auto-create dedicated subnet +`defaultMaxPodsConstraint.maxPodsPerNode` | `48` | Day-0 | Conservative default; 110 for high density + +## Private Cluster Access Patterns + +The golden path creates a private cluster. Users access it via: + +1. **DNS endpoint (default)**: `allowExternalTraffic: true` enables access via + the cluster's DNS endpoint from outside the VPC. No VPN required. +2. **Private endpoint**: Direct access from within the VPC or via Cloud + VPN/Interconnect. +3. **Authorized networks**: Add specific CIDRs to + `masterAuthorizedNetworksConfig` for IP-based access control. + +```bash +# Access private cluster via DNS endpoint (golden path default) +gcloud container clusters get-credentials \ + --region --dns-endpoint \ + --quiet + +# Access via private endpoint (from within VPC) +gcloud container clusters get-credentials \ + --region --internal-ip \ + --quiet +``` + +## Bring-Your-Own VPC/Subnet + +If the customer has existing network infrastructure: + +```bash +gcloud container clusters create-auto \ + --region \ + --network \ + --subnetwork \ + --cluster-secondary-range-name \ + --services-secondary-range-name \ + --enable-private-nodes \ + --enable-master-authorized-networks \ + --quiet +``` + +> **Day-0 Warning**: VPC, subnet, and IP ranges cannot be changed after cluster +> creation. + +## IP Planning + +| Resource | Golden Path | Notes | +| ------------- | ------------ | ------------------------------------------ | +| Pod CIDR | `/17` (auto) | ~32K pod IPs; size based on maxPodsPerNode | +| Service CIDR | `/20` (auto) | ~4K service IPs | +| Node subnet | auto-created | /20 recommended for growth | +| Max pods/node | 48 | Each node gets a /25 pod range; set to 110 | +: : : for /24 per node : + +**Pod CIDR sizing rule of thumb:** + +- `maxPodsPerNode=48` -> each node uses a `/25` (128 IPs) from pod CIDR +- `maxPodsPerNode=110` -> each node uses a `/24` (256 IPs) from pod CIDR +- Larger maxPodsPerNode = fewer nodes fit in a given CIDR + +## Ingress + +**Gateway API** (golden path, enabled via `gatewayApiConfig.channel: +CHANNEL_STANDARD`): + +```yaml +apiVersion: gateway.networking.k8s.io/v1 +kind: Gateway +metadata: + name: external-http +spec: + gatewayClassName: gke-l7-global-external-managed + listeners: + - name: http + protocol: HTTP + port: 80 +``` + +**Alternatives:** + +- `gke-l7-regional-external-managed` — regional external +- `gke-l7-rilb` — internal load balancer +- Istio service mesh — for advanced traffic management, mTLS + +## Egress + +- Default: nodes use Cloud NAT for outbound internet access (private nodes + have no public IPs) +- For static egress IPs: configure Cloud NAT with manual IP allocation +- For restricted egress: route through a firewall appliance via custom routes + +## Network Policy + +Dataplane V2 (golden path) provides built-in Network Policy enforcement — no +additional addon needed. Apply default-deny per namespace, then allow specific +flows. + +> See [gke-security.md](../gke-security/SKILL.md) for default-deny policy and +> [gke-multitenancy.md](../gke-multitenancy/SKILL.md) for per-team allow +> policies. + +## Cloud Armor (Recommended for Public-Facing Services) + +Cloud Armor provides WAF and DDoS protection. **Not a golden path default** — +recommended for any service with public ingress. Link via `BackendConfig`: + +```yaml +# 1. Create BackendConfig referencing your Cloud Armor policy +apiVersion: cloud.google.com/v1 +kind: BackendConfig +metadata: + name: my-backend-config +spec: + securityPolicy: + name: my-cloud-armor-policy +--- +# 2. Annotate your Service +# cloud.google.com/backend-config: '{"default": "my-backend-config"}' +``` + +## SSL, Container-Native LB, and PSC + +- **Google-managed SSL certificates**: Use `ManagedCertificate` CRD with + Gateway API. Auto-provisions and renews. +- **Container-native LB**: Enabled by default on VPC-native clusters (golden + path). Targets pods via NEGs, bypassing iptables. Annotation: + `cloud.google.com/neg: '{"ingress": true}'`. +- **Private Service Connect (PSC)**: Use `ServiceAttachment` CRD to expose + services across VPCs without peering. diff --git a/skills/cloud/gke-observability/SKILL.md b/skills/cloud/gke-observability/SKILL.md new file mode 100644 index 0000000000..1db6143c40 --- /dev/null +++ b/skills/cloud/gke-observability/SKILL.md @@ -0,0 +1,197 @@ +--- +name: gke-observability +description: >- + Configures GKE observability, including Cloud Logging, Cloud Monitoring, and + managed Prometheus. Use when configuring GKE monitoring, setting up GKE logging, + or configuring Prometheus metrics collection. Don't use to configure local + application logging frameworks or external APMs outside GKE. +--- + +# GKE Observability + +This reference covers monitoring, logging, and metrics configuration for GKE. +The golden path enables comprehensive observability including control-plane +metrics. + +> **MCP Tools:** `get_cluster`, `list_k8s_events`, `get_k8s_logs`, +> `get_k8s_cluster_info`, `describe_k8s_resource`. **CLI-only:** `gcloud +> container clusters update --monitoring=...`, `gcloud logging read` + +## Golden Path Observability Defaults + +Setting | Golden Path Value | Notes +--------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | ----- +`loggingConfig` components | SYSTEM_COMPONENTS, WORKLOADS | Full workload logging +`monitoringConfig` components | SYSTEM_COMPONENTS, STORAGE, POD, DEPLOYMENT, STATEFULSET, DAEMONSET, HPA, JOBSET, CADVISOR, KUBELET, DCGM, APISERVER, SCHEDULER, CONTROLLER_MANAGER | Full suite including control-plane +`managedPrometheusConfig.enabled` | `true` | Google-managed Prometheus +`advancedDatapathObservabilityConfig.enableMetrics` | `true` | Dataplane V2 flow metrics +`loggingService` | `logging.googleapis.com/kubernetes` | Cloud Logging +`monitoringService` | `monitoring.googleapis.com/kubernetes` | Cloud Monitoring + +### Control-Plane Metrics (Golden Path Addition) + +The golden path adds three control-plane monitoring components not present in +default clusters: + +| Component | What It Monitors | +| -------------------- | ----------------------------------------------------- | +| `APISERVER` | API server request latency, error rates, admission | +: : webhook performance : +| `SCHEDULER` | Scheduling latency, pending pods, scheduling failures | +| `CONTROLLER_MANAGER` | Controller work queue depth, reconciliation latency | + +These are critical for diagnosing cluster-level issues (slow API responses, +scheduling delays, stuck controllers). + +## Enabling Full Monitoring + +```bash +# Enable golden path monitoring suite +gcloud container clusters update --region \ + --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA,CADVISOR,KUBELET,DCGM \ + --quiet + +# Enable Managed Prometheus +gcloud container clusters update --region \ + --enable-managed-prometheus \ + --quiet + +# Enable Dataplane V2 observability metrics +gcloud container clusters update --region \ + --enable-dataplane-v2-flow-observability \ + --quiet +``` + +## Managed Prometheus + +Golden path enables Google Managed Prometheus for metrics collection and +querying. + +**Querying metrics:** + +- Use Cloud Monitoring Metrics Explorer in the console +- Use PromQL via the Prometheus UI or API +- Grafana dashboards via Managed Grafana + +**Key GKE metrics:** + +| Metric | Source | Use | +| --------------------------------------- | ------------------ | ------------- | +| `container_cpu_usage_seconds_total` | cAdvisor | Pod CPU usage | +| `container_memory_working_set_bytes` | cAdvisor | Pod memory | +: : : usage : +| `kube_pod_status_phase` | kube-state-metrics | Pod lifecycle | +| `apiserver_request_duration_seconds` | API Server | Control plane | +: : : latency : +| `scheduler_scheduling_duration_seconds` | Scheduler | Scheduling | +: : : performance : +| `node_cpu_seconds_total` | Kubelet | Node CPU | +| `DCGM_FI_DEV_GPU_UTIL` | DCGM | GPU | +: : : utilization : + +## Live Resource Usage (kubectl-only) + +No MCP or gcloud equivalent exists for live resource usage. Use `kubectl top`: + +```bash +kubectl top pods --all-namespaces --sort-by=cpu +kubectl top nodes +kubectl top pods --containers -n # per-container breakdown +``` + +## Cloud Logging (gcloud-only) + +**Querying cluster logs** (no MCP equivalent — use `gcloud logging read`): + +```bash +# System component logs +gcloud logging read \ + 'resource.type="k8s_cluster" AND resource.labels.cluster_name=""' \ + --project --limit 50 \ + --quiet + +# Workload logs for a specific namespace +gcloud logging read \ + 'resource.type="k8s_container" AND resource.labels.cluster_name="" AND resource.labels.namespace_name=""' \ + --project --limit 50 \ + --quiet + +# Audit logs (who did what) +gcloud logging read \ + 'resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com"' \ + --project --limit 50 \ + --quiet +``` + +## Diagnostic Settings + +For security monitoring and troubleshooting, enable control-plane audit logs: + +```bash +# View current logging config +gcloud container clusters describe --region \ + --format="yaml(loggingConfig)" \ + --quiet +``` + +## Alerting + +Set up alerts for critical conditions: + +Condition | Metric | Threshold +----------------------- | --------------------------------------------------- | --------- +High API server latency | `apiserver_request_duration_seconds` | P99 > 5s +Pod crash loops | `kube_pod_container_status_restarts_total` | > 5 in 10min +Node not ready | `kube_node_status_condition` | condition=Ready, status!=True +High GPU utilization | `DCGM_FI_DEV_GPU_UTIL` | > 95% sustained +PVC near capacity | `kubelet_volume_stats_used_bytes / capacity` | > 85% +Scheduling failures | `scheduler_schedule_attempts_total{result="error"}` | > 0 + +## Cost Considerations + +Monitoring and logging have associated costs: + +- **Cloud Logging**: Charged per GiB ingested beyond free tier (50 + GiB/project/month) +- **Cloud Monitoring**: Free for GKE system metrics; custom metrics charged + per time series +- **Managed Prometheus**: Charged per samples ingested + +To reduce costs in non-production: + +```bash +# Reduce to system-only monitoring +gcloud container clusters update --region \ + --monitoring=SYSTEM \ + --quiet +``` + +## Distributed Tracing & Continuous Profiling (Recommended) + +**Not golden path defaults** — recommended for production microservice +architectures and performance-sensitive workloads. + +- **Cloud Trace**: Add OpenTelemetry SDK to your app with the + `opentelemetry-operations-go` (or equivalent) exporter. Traces appear in + Cloud Trace console. Identifies cross-service latency bottlenecks. +- **Cloud Profiler**: Add the Cloud Profiler agent to your app. Profiles CPU + and memory usage in production with low overhead. Identifies hotspots and + compares across versions. + +## LQL Query Examples + +Common Logging Query Language patterns for GKE troubleshooting: + +``` +# Error logs for a specific container +resource.type="k8s_container" AND resource.labels.container_name="my-app" AND severity>=ERROR + +# OOMKilled events +resource.type="k8s_event" AND jsonPayload.reason="OOMKilling" + +# Pod scheduling failures +resource.type="k8s_event" AND jsonPayload.reason="FailedScheduling" + +# Audit logs (who did what) +resource.type="k8s_cluster" AND logName:"cloudaudit.googleapis.com" +``` diff --git a/skills/cloud/gke-reliability/SKILL.md b/skills/cloud/gke-reliability/SKILL.md new file mode 100644 index 0000000000..57f73520cb --- /dev/null +++ b/skills/cloud/gke-reliability/SKILL.md @@ -0,0 +1,200 @@ +--- +name: gke-reliability +description: >- + Improves GKE workload reliability, using PDBs, health probes, and topology + spread constraints. Use when configuring GKE workload reliability, setting up + PDBs, or configuring GKE health probes (liveness, readiness, startup). Don't + use for disaster recovery setup or full cluster backups (use gke-backup-dr + instead). +--- + +# GKE Reliability + +This reference covers high availability and reliability configuration for GKE +clusters and workloads. + +> **MCP Tools:** `get_cluster`, `get_k8s_resource`, `describe_k8s_resource`, +> `apply_k8s_manifest`, `list_k8s_events` + +## Golden Path Reliability Defaults + +| Setting | Golden Path Value | Notes | +| ---------------- | --------------------- | -------------------------------- | +| Cluster type | Regional (4 zones: | Control plane replicated across | +: : us-central1-a/b/c/f) : zones : +| Upgrade strategy | SURGE (`maxSurge: 1`) | Rolling upgrades with extra | +: : : capacity : +| Auto-repair | `true` | Unhealthy nodes replaced | +: : : automatically : +| Auto-upgrade | `true` | Nodes follow control plane | +: : : version : +| Release channel | REGULAR | Balanced freshness and stability | +| Stateful HA | Enabled | Leader election for stateful | +: : : workloads : + +## Workflows + +### 1. Verify Cluster High Availability + +``` +# MCP (preferred) +get_cluster(name="projects//locations//clusters/", + readMask="location,locations,nodePools.locations") + +# gcloud fallback +gcloud container clusters describe --region \ + --format="json(location, locations)" \ + --quiet +``` + +- If `location` is a region (e.g., `us-central1`), the control plane is + regional +- If `locations` has multiple entries, nodes span multiple zones + +### 2. Pod Disruption Budgets (PDBs) + +PDBs ensure minimum pod availability during voluntary disruptions (node +upgrades, autoscaler scale-down). + +**Check existing PDBs:** + +``` +# MCP (preferred) +get_k8s_resource(parent="...", resourceType="poddisruptionbudget") + +# kubectl fallback +kubectl get pdb --all-namespaces +``` + +**Create PDB:** + +```yaml +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: my-app-pdb + namespace: default +spec: + minAvailable: 2 # Or use maxUnavailable: 1 + selector: + matchLabels: + app: my-app +``` + +> Every production Deployment with 2+ replicas should have a PDB. + +### 3. Health Probes + +Every production container should have liveness and readiness probes. Startup +probes are recommended for slow-starting apps. + +**Check existing probes:** + +``` +# MCP (preferred) +describe_k8s_resource(parent="...", resourceType="deployment", name="", namespace="") + +# kubectl fallback +kubectl get deployment -n -o yaml | grep -E "livenessProbe|readinessProbe|startupProbe" +``` + +**Recommended probe configuration:** + +```yaml +spec: + containers: + - name: app + livenessProbe: + httpGet: + path: /healthz + port: 8080 + initialDelaySeconds: 15 + periodSeconds: 10 + failureThreshold: 3 + readinessProbe: + httpGet: + path: /readyz + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 5 + failureThreshold: 3 + startupProbe: # For slow-starting apps + httpGet: + path: /healthz + port: 8080 + initialDelaySeconds: 10 + periodSeconds: 5 + failureThreshold: 30 # 30 * 5s = 150s max startup time +``` + +- **Readiness**: Determines when a pod can accept traffic +- **Liveness**: Determines when to restart a container +- **Startup**: Disables liveness/readiness until the app is ready (prevents + premature restarts) + +### 4. Graceful Shutdown + +Ensure applications handle `SIGTERM` and drain in-flight requests: + +```yaml +spec: + terminationGracePeriodSeconds: 30 # Default; increase for long-running requests + containers: + - name: app + lifecycle: + preStop: + exec: + command: ["/bin/sh", "-c", "sleep 5"] # Allow LB to deregister +``` + +### 5. Topology Spread Constraints + +Distribute pods across zones and nodes to survive failures: + +```yaml +spec: + topologySpreadConstraints: + - maxSkew: 1 + topologyKey: topology.kubernetes.io/zone + whenUnsatisfiable: DoNotSchedule + labelSelector: + matchLabels: + app: my-app + - maxSkew: 1 + topologyKey: kubernetes.io/hostname + whenUnsatisfiable: ScheduleAnyway + labelSelector: + matchLabels: + app: my-app +``` + +- **Zone spread** (`DoNotSchedule`): Hard requirement -- pods must be balanced + across zones +- **Node spread** (`ScheduleAnyway`): Best-effort -- prefer distribution but + don't block scheduling + +### 6. Replicas + +| Workload Type | Minimum Replicas | Reason | +| -------------------- | -------------------- | ------------------------------ | +| Stateless web/API | 2 | Survive single pod/node | +: : : failure : +| Critical services | 3 | Survive zone failure with zone | +: : : spread : +| Stateful (databases) | 3 (with replication) | Application-level quorum | +| Batch/jobs | 1 | Ephemeral by nature | + +## Best Practices + +1. **Regional clusters for production**: Always use regional clusters to + survive zone failures +2. **PDBs for everything**: Every production workload with 2+ replicas needs a + PDB +3. **Probes for all containers**: At minimum, readiness probes on every + production container +4. **Zone spreading**: Use topology spread constraints to distribute pods + across failure domains +5. **Graceful shutdown**: Handle SIGTERM and set appropriate + `terminationGracePeriodSeconds` +6. **Maintenance windows**: Schedule upgrades during low-traffic periods (see + [gke-upgrades.md](../gke-upgrades/SKILL.md)) diff --git a/skills/cloud/gke-scaling/SKILL.md b/skills/cloud/gke-scaling/SKILL.md new file mode 100644 index 0000000000..01c6fa84e5 --- /dev/null +++ b/skills/cloud/gke-scaling/SKILL.md @@ -0,0 +1,175 @@ +--- +name: gke-scaling +description: >- + Configures GKE autoscaling, including HPA, VPA, and Node Auto-Provisioning + (NAP). Use when configuring GKE autoscaling, setting up GKE HPA, setting up + GKE VPA, or configuring GKE NAP. Don't use for configuring static cluster sizes + or setting node-level machine styles directly (use gke-compute-classes instead). +--- + +# GKE Workload Scaling + +This reference covers scaling workloads on GKE. The golden path enables VPA, +OPTIMIZE_UTILIZATION autoscaling profile, and Node Auto Provisioning by default. + +> **MCP Tools:** `get_k8s_resource`, `describe_k8s_resource`, +> `apply_k8s_manifest`, `patch_k8s_resource`, `get_cluster`, `update_cluster`, +> `update_node_pool` + +## Golden Path Scaling Defaults + +Setting | Golden Path Value | Notes +---------------------------------------- | ---------------------- | ----- +`autoscaling.autoscalingProfile` | `OPTIMIZE_UTILIZATION` | Aggressive scale-down for cost savings +`verticalPodAutoscaling.enabled` | `true` | VPA recommendations available +`autoscaling.enableNodeAutoprovisioning` | `true` | NAP creates node pools on demand +GPU resource limits (T4, A100) | `1000000000` each | NAP can provision GPU nodes + +## Scaling Mechanisms + +### 1. Manual Scaling + +> **kubectl-only** — no MCP equivalent for `kubectl scale`. Use kubectl +> directly. + +```bash +kubectl scale deployment --replicas= -n +``` + +### 2. Horizontal Pod Autoscaling (HPA) + +Scales the number of pods based on metrics. + +**Quick setup (kubectl-only — no MCP equivalent for `kubectl autoscale`):** + +```bash +kubectl autoscale deployment --cpu-percent=50 --min=1 --max=10 +``` + +**Manifest approach (recommended — use MCP `apply_k8s_manifest`):** + +See [assets/hpa-example.yaml](./assets/hpa-example.yaml) for a template. + +```yaml +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: -hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: + minReplicas: 1 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 50 +``` + +### 3. Vertical Pod Autoscaling (VPA) + +Adjusts CPU and memory requests to match actual usage. Enabled by default on +golden path. + +**Update modes:** + +- `Off` — recommendations only (safest, start here) +- `Initial` — sets resources only at pod creation +- `Auto` — restarts pods to apply new resource values +- `InPlaceOrRecreate` — updates resources without restart when possible (GKE + 1.34+) + +**Create VPA in recommendation mode:** + +```yaml +apiVersion: autoscaling.k8s.io/v1 +kind: VerticalPodAutoscaler +metadata: + name: -vpa +spec: + targetRef: + apiVersion: apps/v1 + kind: Deployment + name: + updatePolicy: + updateMode: "Off" +``` + +**Read recommendations (prefer MCP `describe_k8s_resource`):** + +``` +# MCP (preferred) +describe_k8s_resource(parent="...", resourceType="verticalpodautoscaler", name="-vpa", namespace="") + +# kubectl fallback +kubectl get vpa -vpa -o jsonpath='{.status.recommendation}' +``` + +See [assets/vpa-example.yaml](./assets/vpa-example.yaml) for a full template. + +### 4. Cluster Autoscaler / Node Auto Provisioning (NAP) + +On Autopilot (golden path), node scaling is fully managed. NAP automatically +creates and sizes node pools based on workload demands. + +**For Standard clusters:** + +```bash +# Enable cluster autoscaler on a node pool +gcloud container clusters update --region \ + --enable-autoscaling --node-pool \ + --min-nodes --max-nodes \ + --quiet + +# Enable NAP +gcloud container clusters update --region \ + --enable-autoprovisioning \ + --min-cpu --max-cpu \ + --min-memory --max-memory \ + --quiet +``` + +**Autoscaling profiles:** + +| Profile | Behavior | Golden Path? | +| ---------------------- | ------------------------------------ | ------------ | +| `BALANCED` | Default GKE; conservative scale-down | No | +| `OPTIMIZE_UTILIZATION` | Aggressive scale-down; lower idle | **Yes** | +: : resources : : + +## Best Practices + +1. **Define resource requests**: HPA and VPA rely on accurate requests. Always + set them. +2. **Avoid metric conflicts**: Do not use HPA and VPA on the same metric. + Typical pattern: HPA on CPU, VPA on memory. +3. **Pod Disruption Budgets**: Define PDBs for all production workloads to + ensure availability during scaling events. +4. **HPA stabilization**: HPA has a default 5-minute stabilization window. Tune + `behavior` for faster response if needed. +5. **VPA "Auto" caution**: Auto mode restarts pods. Ensure your app handles + SIGTERM gracefully. VPA requires at least 2 replicas for evictions by + default. +6. **Use ComputeClasses**: For workload-specific node targeting (Spot fallback, + GPU, specific machine families), use ComputeClasses instead of node + selectors. + +## Rightsizing Workflow + +1. Deploy VPA in `Off` mode for 24+ hours +2. Read recommendations: `kubectl describe vpa ` +3. Compare `target` values against current `requests` +4. Apply with 20% buffer: `new_request = target * 1.2` +5. Use patch format to update Deployment + +Condition | Recommendation | Risk +----------------------------- | ------------------------------------ | ------ +CPU request >5x P95 actual | Reduce to `P95 * 1.2` | Medium +Memory request >3x P95 actual | Reduce to `P95 * 1.2` | Medium +CPU request >2x P95 actual | Rightsizing with 20% buffer | Low +No resource limits set | Add limits to prevent noisy-neighbor | Low diff --git a/skills/cloud/gke-basics/assets/hpa-example.yaml b/skills/cloud/gke-scaling/assets/hpa-example.yaml similarity index 100% rename from skills/cloud/gke-basics/assets/hpa-example.yaml rename to skills/cloud/gke-scaling/assets/hpa-example.yaml diff --git a/skills/cloud/gke-basics/assets/vpa-example.yaml b/skills/cloud/gke-scaling/assets/vpa-example.yaml similarity index 100% rename from skills/cloud/gke-basics/assets/vpa-example.yaml rename to skills/cloud/gke-scaling/assets/vpa-example.yaml diff --git a/skills/cloud/gke-security/SKILL.md b/skills/cloud/gke-security/SKILL.md new file mode 100644 index 0000000000..4902ed81b3 --- /dev/null +++ b/skills/cloud/gke-security/SKILL.md @@ -0,0 +1,285 @@ +--- +name: gke-security +description: >- + Plans, configures, and hardens Google Kubernetes Engine (GKE) security. + Covers Workload Identity Federation, Secret Manager integration, RBAC + hardening, Binary Authorization, Network Policies (Dataplane V2), Pod + Security Standards, and IAM roles. Use when securing GKE clusters, setting up + Workload Identity, hardening RBAC configurations, or configuring GKE secrets. + Don't use for general network routing configuration (use gke-networking instead). +--- + +# GKE Security + +This reference covers security configuration for GKE clusters. The golden path +enforces a hardened security posture by default. + +> **MCP Tools:** `get_cluster`, `check_k8s_auth`, `get_k8s_resource`, +> `apply_k8s_manifest`, `update_cluster` + +## Golden Path Security Defaults + +Setting | Golden Path Value | Day-0/1 | Notes +-------------------------------------------------------------- | --------------------------------------------------- | ------- | ----- +`workloadIdentityConfig.workloadPool` | `.svc.id.goog` | Day-0 | Workload Identity Federation for Pods +`secretManagerConfig.enabled` | `true` | Day-1 | Google Secret Manager integration +`secretManagerConfig.rotationConfig` | `enabled: true, rotationInterval: 120s` | Day-1 | Automatic secret rotation +`rbacBindingConfig.enableInsecureBindingSystemAuthenticated` | `false` | Day-0 | Blocks legacy `system:authenticated` bindings +`rbacBindingConfig.enableInsecureBindingSystemUnauthenticated` | `false` | Day-0 | Blocks legacy `system:unauthenticated` bindings +`nodeConfig.shieldedInstanceConfig.enableSecureBoot` | `true` | Day-0 | Verifiable boot integrity +`nodeConfig.shieldedInstanceConfig.enableIntegrityMonitoring` | `true` | Day-0 | Runtime integrity checks +`nodeConfig.workloadMetadataConfig.mode` | `GKE_METADATA` | Day-0 | Blocks legacy metadata API, enforces Workload Identity +Private cluster + Dataplane V2 settings | See [gke-networking.md](../gke-networking/SKILL.md) | Day-0 | Private nodes, private endpoint enforcement, ADVANCED_DATAPATH + +## Workload Identity Federation + +Workload Identity is the recommended way for pods to access Google Cloud APIs. +It eliminates the need for static service account keys. + +### Setup + +```bash +# 1. Create a Google Service Account (GSA) +gcloud iam service-accounts create \ + --project \ + --display-name "Workload Identity SA" \ + --quiet + +# 2. Grant IAM roles to the GSA +gcloud projects add-iam-policy-binding \ + --member "serviceAccount:@.iam.gserviceaccount.com" \ + --role "" \ + --quiet + +# 3. Create Kubernetes Service Account (KSA) +kubectl create namespace +kubectl create serviceaccount --namespace + +# 4. Bind KSA to GSA +gcloud iam service-accounts add-iam-policy-binding \ + @.iam.gserviceaccount.com \ + --role roles/iam.workloadIdentityUser \ + --member "serviceAccount:.svc.id.goog[/]" \ + --quiet + +# 5. Annotate KSA +kubectl annotate serviceaccount \ + --namespace \ + iam.gke.io/gcp-service-account=@.iam.gserviceaccount.com +``` + +> See [assets/workload-identity-pod.yaml](./assets/workload-identity-pod.yaml) +> for a test pod. + +### Verification + +```bash +kubectl run workload-identity-test \ + --image=gcr.io/google.com/cloudsdktool/cloud-sdk:slim \ + --serviceaccount= --namespace= \ + --rm -it -- gcloud auth list --quiet +``` + +## Secret Manager Integration + +The golden path enables Secret Manager with automatic rotation. Secrets are +synced to Kubernetes Secrets. + +```bash +# Verify Secret Manager is enabled on cluster +gcloud container clusters describe --region \ + --format="value(secretManagerConfig.enabled)" \ + --quiet + +# Enable if not already (Day-1 change) +gcloud container clusters update --region \ + --enable-secret-manager \ + --secret-manager-rotation-interval=120s \ + --quiet +``` + +## RBAC Hardening + +The golden path disables insecure legacy RBAC bindings that grant broad access +to `system:authenticated` and `system:unauthenticated` groups. + +```bash +# Verify insecure bindings are disabled +gcloud container clusters describe --region \ + --format="yaml(rbacBindingConfig)" \ + --quiet +``` + +**Best practices for RBAC:** + +- Use namespace-scoped Roles over cluster-wide ClusterRoles +- Bind to specific Groups or ServiceAccounts, never to `system:authenticated` +- Audit permissions via MCP: `check_k8s_auth(parent="...", verb="list", + resourceType="pods", namespace="...")` (or `kubectl auth can-i --list + --as=`) +- Review bindings via MCP: `get_k8s_resource(parent="...", + resourceType="clusterrolebinding")` (or `kubectl get + clusterrolebindings,rolebindings --all-namespaces`) + +> See [gke-multitenancy.md](../gke-multitenancy/SKILL.md) for enterprise RBAC +> planning and +> https://docs.cloud.google.com/kubernetes-engine/docs/best-practices/rbac + +## Binary Authorization + +Not enabled in golden path by default but recommended for production image +provenance: + +```bash +# Enable Binary Authorization +gcloud container clusters update --region \ + --binauthz-evaluation-mode=PROJECT_SINGLETON_POLICY_ENFORCE \ + --quiet +``` + +## Network Policies + +Dataplane V2 (golden path) provides built-in Network Policy enforcement. Apply +default-deny per namespace: + +``` +# MCP (preferred) +apply_k8s_manifest(parent="...", yamlManifest="") + +# kubectl fallback +kubectl apply -f third_party/skills/skills/cloud/gke-security/assets/default-deny-netpol.yaml -n +``` + +## GKE Sandbox (gVisor) + +For running untrusted workloads in an isolated sandbox: + +```bash +# Enable on cluster (Standard clusters) +gcloud container clusters update --region --enable-gke-sandbox --quiet + +# Use in pod spec +# Add: runtimeClassName: gvisor +``` + +## Pod Security Standards (Golden Path) + +Pod Security Standards define three profiles that restrict what pods can do. The +**`restricted` profile is the golden path default** for production namespaces. + +| Profile | Level | Use Case | +| ------------ | --------------------- | ---------------------------------- | +| `privileged` | Unrestricted | System namespaces (`kube-system`), | +: : : infrastructure controllers : +| `baseline` | Minimally restrictive | Shared/dev namespaces, legacy apps | +: : : being migrated : +| `restricted` | **Golden path** | Production workloads -- blocks | +: : : privilege escalation, host access, : +: : : root : + +**Enforce via namespace labels (Pod Security Admission):** + +```yaml +apiVersion: v1 +kind: Namespace +metadata: + name: production + labels: + pod-security.kubernetes.io/enforce: restricted + pod-security.kubernetes.io/warn: restricted + pod-security.kubernetes.io/audit: restricted +``` + +**Gradual rollout strategy:** + +1. Start with `warn` + `audit` on existing namespaces to identify violations +2. Fix non-compliant workloads (remove `privileged`, `hostNetwork`, root user, + etc.) +3. Enable `enforce` once all workloads pass + +`restricted` blocks: running as root, privilege escalation, host +networking/PID/IPC, host path volumes, and most capabilities. The golden path +`workload-identity-pod.yaml` already complies. + +## Network Policy Logging (Recommended) + +With Dataplane V2 (golden path), you can enable logging for Network Policy +decisions. **Not a golden path default** -- recommended for security auditing. + +```bash +gcloud container clusters update --region \ + --enable-network-policy-logging \ + --quiet +``` + +This logs allowed and denied connections, useful for troubleshooting Network +Policy rules and auditing traffic flows. + +## Common IAM Roles + +The five most common predefined IAM roles for GKE: + +| Role | Purpose | When to Use | +| ------------------------------- | ------------------- | -------------------- | +| `roles/container.admin` | Full control over | Platform team admins | +: : clusters and : managing cluster : +: : Kubernetes : lifecycle : +: : resources : : +| `roles/container.clusterAdmin` | Manage clusters but | Cluster operators | +: : not project-level : who create/delete : +: : IAM : clusters : +| `roles/container.developer` | Deploy workloads | Application | +: : (pods, services, : developers deploying : +: : deployments) : to existing clusters : +| `roles/container.viewer` | Read-only access to | Monitoring, | +: : clusters and : auditing, or : +: : Kubernetes : read-only dashboards : +: : resources : : +| `roles/container.clusterViewer` | List and get | CI/CD pipelines that | +: : cluster details : need cluster : +: : only : metadata : + +> **Principle of least privilege**: Start with `roles/container.viewer` or +> `roles/container.developer` and escalate only as needed. Avoid granting +> `roles/container.admin` broadly. + +## Service Accounts & Agents + +- **GKE Service Agent** + (`service-@container-engine-robot.iam.gserviceaccount.com`): + Automatically created. Manages nodes, networking, and cluster operations on + your behalf. Do not remove or modify its permissions. +- **Node Service Account**: By default, nodes use the Compute Engine default + service account. For production, create a dedicated SA with minimal + permissions and assign it via node pool config. +- **Workload Identity**: The recommended way for pods to access Google Cloud + APIs. Maps a Kubernetes ServiceAccount to a Google IAM ServiceAccount — see + [Workload Identity setup](#workload-identity-federation) above. + +## Cross-Service Authentication Patterns + +Common patterns for granting GKE workloads access to other Google Cloud +services: + +```bash +# Grant a GKE workload access to Cloud Storage +gcloud projects add-iam-policy-binding \ + --member "serviceAccount:@.iam.gserviceaccount.com" \ + --role "roles/storage.objectViewer" \ + --quiet + +# Grant a GKE workload access to Cloud SQL +gcloud projects add-iam-policy-binding \ + --member "serviceAccount:@.iam.gserviceaccount.com" \ + --role "roles/cloudsql.client" \ + --quiet + +# Grant a GKE workload access to Pub/Sub +gcloud projects add-iam-policy-binding \ + --member "serviceAccount:@.iam.gserviceaccount.com" \ + --role "roles/pubsub.subscriber" \ + --quiet +``` + +In all cases, the GSA must be bound to a KSA via Workload Identity (see setup +above). The pod then uses the KSA to authenticate as the GSA. diff --git a/skills/cloud/gke-basics/assets/default-deny-netpol.yaml b/skills/cloud/gke-security/assets/default-deny-netpol.yaml similarity index 100% rename from skills/cloud/gke-basics/assets/default-deny-netpol.yaml rename to skills/cloud/gke-security/assets/default-deny-netpol.yaml diff --git a/skills/cloud/gke-basics/assets/workload-identity-pod.yaml b/skills/cloud/gke-security/assets/workload-identity-pod.yaml similarity index 100% rename from skills/cloud/gke-basics/assets/workload-identity-pod.yaml rename to skills/cloud/gke-security/assets/workload-identity-pod.yaml diff --git a/skills/cloud/gke-storage/SKILL.md b/skills/cloud/gke-storage/SKILL.md new file mode 100644 index 0000000000..d48a4ccce0 --- /dev/null +++ b/skills/cloud/gke-storage/SKILL.md @@ -0,0 +1,161 @@ +--- +name: gke-storage +description: >- + Manages GKE storage, including PVCs, PersistentVolumes, Filestore, and GCS + FUSE. Use when configuring GKE storage, creating PVCs, or setting up GCS FUSE + on GKE. Don't use for database administration or replication strategies + outside volume provisioning context. +--- + +# GKE Storage + +This reference covers storage configuration for GKE clusters including +persistent disks, file storage, and cloud storage integration. + +> **MCP Tools:** `apply_k8s_manifest`, `get_k8s_resource`, +> `describe_k8s_resource`, `get_cluster` + +## Golden Path Storage Defaults + +The golden path Autopilot config enables these CSI drivers: + +| Driver | Golden Path | Access Mode | Use Case | +| --------------- | ----------------- | --------------- | -------------------- | +| Compute Engine | Enabled (default) | ReadWriteOnce | Block storage for | +: Persistent Disk : : : databases, : +: CSI : : : single-pod workloads : +| Google Cloud | Enabled | ReadWriteMany | Shared NFS for | +: Filestore CSI : : : multi-pod access : +| Cloud Storage | Enabled | ReadWriteMany / | Mount GCS buckets as | +: FUSE CSI : : ReadOnlyMany : volumes : +| Parallelstore | Enabled | ReadWriteMany | High-performance | +: CSI : : : parallel file system : +| Boot disk type | `pd-balanced` | N/A | Node boot disks | + +## StorageClasses + +### Default StorageClasses + +GKE provides built-in StorageClasses: + +StorageClass | Disk Type | Use Case +-------------- | --------------------- | ------------------------------ +`standard-rwo` | `pd-standard` | Cost-effective, low IOPS +`premium-rwo` | `pd-ssd` | High IOPS, databases +`standard-rwx` | Filestore (Basic HDD) | Shared NFS +`premium-rwx` | Filestore (Basic SSD) | Shared NFS, higher performance + +### Custom StorageClass + +```yaml +apiVersion: storage.k8s.io/v1 +kind: StorageClass +metadata: + name: fast-regional +provisioner: pd.csi.storage.gke.io +parameters: + type: pd-ssd + replication-type: regional-pd # Replicate across 2 zones +volumeBindingMode: WaitForFirstConsumer +allowVolumeExpansion: true # Always enable for production +``` + +## PersistentVolumeClaims + +### Block Storage (ReadWriteOnce) + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: database-pvc +spec: + accessModes: + - ReadWriteOnce + storageClassName: premium-rwo + resources: + requests: + storage: 100Gi +``` + +### Shared File Storage (ReadWriteMany via Filestore) + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: shared-data +spec: + accessModes: + - ReadWriteMany + storageClassName: standard-rwx + resources: + requests: + storage: 1Ti # Filestore minimum is 1 TiB for Basic tier +``` + +### GCS Bucket Mount (Cloud Storage FUSE) + +Mount a GCS bucket as a volume without a PVC: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: gcs-reader + annotations: + gke-gcsfuse/volumes: "true" +spec: + containers: + - name: reader + image: busybox + command: ["ls", "/data"] + volumeMounts: + - name: gcs-bucket + mountPath: /data + volumes: + - name: gcs-bucket + csi: + driver: gcsfuse.csi.storage.gke.io + readOnly: true + volumeAttributes: + bucketName: +``` + +> Requires Workload Identity for the pod's service account to have +> `storage.objectViewer` on the bucket. + +## Volume Expansion + +If `allowVolumeExpansion: true` is set on the StorageClass, resize by updating +the PVC: + +```bash +# kubectl +kubectl patch pvc -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}' +``` + +``` +# MCP (preferred) +patch_k8s_resource(parent="...", resourceType="persistentvolumeclaim", name="", + patch='{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}') +``` + +Kubernetes automatically resizes the filesystem. + +## Best Practices + +1. **Always enable volume expansion**: Set `allowVolumeExpansion: true` on all + StorageClasses +2. **Use regional PDs for production**: `replication-type: regional-pd` + replicates across 2 zones for HA +3. **Use `WaitForFirstConsumer`**: Ensures the PV is provisioned in the same + zone as the pod +4. **Choose the right disk type**: `pd-ssd` for databases, `pd-balanced` + (golden path default) for general use, `pd-standard` for cold storage +5. **Use Filestore for shared access**: When multiple pods need to read/write + the same files +6. **Use GCS FUSE for data pipelines**: Mount buckets directly for ML training + data, logs, etc. +7. **Back up PVCs**: Use Backup for GKE (see + [gke-backup-dr.md](../gke-backup-dr/SKILL.md)) to protect persistent data diff --git a/skills/cloud/gke-upgrades/SKILL.md b/skills/cloud/gke-upgrades/SKILL.md new file mode 100644 index 0000000000..ac826fb0de --- /dev/null +++ b/skills/cloud/gke-upgrades/SKILL.md @@ -0,0 +1,172 @@ +--- +name: gke-upgrades +description: >- + Manages GKE upgrades, maintenance windows, and release channels. Use when + upgrading GKE clusters, configuring maintenance windows, or setting release + channels. Don't use for general cluster provisioning or resizing (use + gke-cluster-creation or gke-scaling instead). +--- + +# GKE Upgrades & Maintenance + +This reference covers upgrade strategy, maintenance windows, and release channel +management for GKE clusters. + +> **MCP Tools:** `get_cluster`, `get_k8s_version`, `update_cluster`, +> `update_node_pool`, `list_operations`, `get_operation`, `cancel_operation`, +> `get_k8s_resource` **CLI-only**: `gcloud container get-server-config` +> (available versions), `gcloud container clusters update +> --maintenance-window-*` (maintenance windows) + +## Golden Path Upgrade Defaults + +| Setting | Golden Path Value | Notes | +| -------------------------- | ---------------------- | ---------------------- | +| `releaseChannel.channel` | `REGULAR` | Balanced between | +: : : freshness and : +: : : stability : +| Maintenance exclusion | `NO_MINOR_UPGRADES`, 1 | Prevents surprise | +: : year : minor version bumps : +| `upgradeSettings.strategy` | `SURGE` | Rolling upgrades with | +: : : `maxSurge\: 1` : +| Auto-repair | `true` | Unhealthy nodes are | +: : : automatically replaced : +| Auto-upgrade | `true` | Nodes follow control | +: : : plane version : + +## Release Channels + +| Channel | Cadence | Best For | +| ----------------------- | ---------------------- | ------------------------- | +| `RAPID` | Weeks after release | Dev/test, early access to | +: : : features : +| `REGULAR` (golden path) | 2-3 months after Rapid | Production workloads | +| `STABLE` | 2-3 months after | Risk-averse, highly | +: : Regular : regulated : + +```bash +# Check current channel +gcloud container clusters describe --region \ + --format="value(releaseChannel.channel)" \ + --quiet + +# Change channel (Day-1) +gcloud container clusters update --region \ + --release-channel \ + --quiet +``` + +## Maintenance Windows + +Control when GKE can perform automatic maintenance (upgrades, patches). + +```bash +# Set maintenance window (e.g., weekends 2am-6am UTC) +gcloud container clusters update --region \ + --maintenance-window-start "2026-01-01T02:00:00Z" \ + --maintenance-window-end "2026-01-01T06:00:00Z" \ + --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=SA,SU" \ + --quiet +``` + +### Maintenance Exclusions + +The golden path includes a 1-year `NO_MINOR_UPGRADES` exclusion to prevent +automatic minor version changes. + +```bash +# Add maintenance exclusion +gcloud container clusters update --region \ + --add-maintenance-exclusion-name "freeze-1" \ + --add-maintenance-exclusion-start "2026-04-11T00:00:00Z" \ + --add-maintenance-exclusion-end "2027-04-11T00:00:00Z" \ + --add-maintenance-exclusion-scope NO_MINOR_UPGRADES \ + --quiet + +# Remove exclusion +gcloud container clusters update --region \ + --remove-maintenance-exclusion "freeze-1" \ + --quiet +``` + +**Exclusion scopes:** + +- `NO_UPGRADES` — blocks all upgrades (max 30 days) +- `NO_MINOR_UPGRADES` — allows patch upgrades, blocks minor version changes + (max 1 year) +- `NO_MINOR_OR_NODE_UPGRADES` — blocks minor and node upgrades (max 1 year) + +## Upgrade Strategy + +### SURGE (Golden Path) + +Rolling upgrade with configurable surge capacity: + +```bash +# Default: maxSurge=1 (one extra node during upgrade) +gcloud container node-pools update \ + --cluster --region \ + --max-surge-upgrade 1 --max-unavailable-upgrade 0 \ + --quiet +``` + +### Blue-Green (For Zero-Downtime Critical Workloads) + +```bash +gcloud container node-pools update \ + --cluster --region \ + --enable-blue-green-upgrade \ + --node-pool-soak-duration "3600s" \ + --quiet +``` + +## Pre-Upgrade Checklist + +1. **Check deprecations**: Review Kubernetes API deprecations between current + and target version +2. **Review PDBs**: Ensure all production workloads have PodDisruptionBudgets +3. **Test in non-prod**: Upgrade a staging cluster first +4. **Check addon compatibility**: Verify third-party controllers support the + target version +5. **Review node pool versions**: All node pools should be within 2 minor + versions of the control plane + +```bash +# Check current versions +gcloud container clusters describe --region \ + --format="table(currentMasterVersion, nodePools[].version)" \ + --quiet + +# Check available upgrades +gcloud container get-server-config --region \ + --format="yaml(channels)" \ + --quiet + +# List deprecation warnings +kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis +``` + +## Manual Upgrade (When Needed) + +```bash +# Upgrade control plane +gcloud container clusters upgrade --region \ + --master --cluster-version \ + --quiet + +# Upgrade node pool +gcloud container clusters upgrade --region \ + --node-pool \ + --quiet +``` + +## Best Practices + +1. **Stay on a release channel**: Manual version management is error-prone. Let + GKE manage versions. +2. **Use maintenance windows**: Schedule upgrades during low-traffic periods. +3. **Set PDBs on everything**: Protects workloads during node drains. +4. **Monitor during upgrades**: Watch for pod eviction failures, + CrashLoopBackOff, and scheduling issues. +5. **Don't skip minor versions**: Upgrade incrementally (1.28 -> 1.29 -> 1.30, + not 1.28 -> 1.30).