Orchestrated Kubernetes Intelligence Kernel
OKIK is an AI-powered Kubernetes operations system that continuously analyzes your cluster, diagnoses issues across workloads, and takes corrective action -- all under graduated human oversight. It replaces the need to write dozens of single-purpose controllers by letting an LLM reason across pods, deployments, nodes, HPAs, and jobs in a single conversation loop.
For environments where LLM-driven automation is not desired, OKIK ships a deterministic watchdog mode that detects and remediates common failures using threshold-based rules with zero external dependencies.
Classic Kubernetes controllers are single-purpose reconciliation loops. Each one watches one resource type, compares desired vs actual state, and makes a hard-coded correction. To cover resource efficiency, scaling, availability, cost, and workload health, you need to wire together HPA, VPA, node autoscaler, custom operators, and monitoring dashboards -- and none of them talk to each other.
OKIK replaces that fragmented approach with a single agent that:
- Reasons across the whole cluster. The LLM correlates signals from pods, deployments, nodes, HPAs, jobs, PVCs, and events in one pass. It can say "this pod is OOMing because the HPA target is too aggressive for this workload's memory profile" -- something no single controller can do.
- Acts with graduated trust. Five autonomy levels (L0-L4) let you start advisory-only and unlock write actions incrementally. Risky operations queue for human approval instead of executing blindly.
- Works without an LLM too. The watchdog mode provides CrashLoopBackOff detection, OOM quarantining, and pending-pod remediation with deterministic rules and zero API keys.
- Multi-provider support -- OpenAI (GPT) and Anthropic (Claude) with native tool-calling APIs, no prompt parsing hacks
- Continuous reasoning loop -- collects cluster metrics, feeds them to the LLM, executes allowed actions, queues risky ones for approval
- 30+ Kubernetes tools organized into risk-tiered packages (
OBSERVE,REMEDIATE,QUARANTINE,RESTART,DESTRUCTIVE) - Interactive chat -- conversational cluster assistant with markdown rendering, command history, auto-completion, and offline fallback
- Streaming responses -- real-time streamed chat and analysis output
| Level | Mode | Behavior |
|---|---|---|
| L0 | Advisory | No tools. Observe-only, all suggestions are textual |
| L1 | Suggest | Read-only tools (OBSERVE). All write actions queue for approval |
| L2 | Guarded (default) | Read + safe writes (REMEDIATE, QUARANTINE). Restarts and destructive ops queue |
| L3 | Conditional | Most writes execute automatically. Only DESTRUCTIVE queues |
| L4 | Full | All tool packages execute (subject to hard safety checks) |
- Tool gatekeeper -- enforces package-level access by autonomy level at every tool call
- Namespace restrictions -- writes to
kube-system,kube-public,kube-node-lease, andokik-systemare always blocked - Approval queue -- risky operations are held as pending plans until a human approves or rejects via CLI/API
- Emergency stop -- single command to halt all agent actions immediately
- Rate limiting -- prevents runaway tool execution
- Full audit trail -- every tool call, plan, and action is logged to SQLite with timestamps, parameters, and results
Category-based cluster optimization with health scoring:
- Resource Efficiency -- CPU/memory request vs actual usage, over-provisioned workloads
- Availability & Health -- restart rates, CrashLoopBackOff, pending pods, node conditions
- Cost & Waste -- idle resources, unused deployments, unbound PVCs
- Scaling -- HPA saturation, deployment availability ratios
- Workload Health -- Job/CronJob failures, DaemonSet coverage gaps
Each category gets a 0-100 health score with severity-tagged findings. In reconcile mode, the agent fixes issues as it finds them.
LLM-free controller for environments that need predictable, rule-based operation:
- Detects CrashLoopBackOff pods exceeding restart thresholds
- Quarantines unstable deployments (scales to zero, preserves original replica count)
- Identifies pods stuck in Pending beyond a configurable timeout
- Detects OOMKilled containers
- Exposes compatible
/health,/status, and/recommendationsendpoints
- Quarantine -- scales a failing deployment to zero, annotates with reason, preserves original replica count
- Unquarantine -- restores the deployment to its pre-quarantine state
- List -- shows all quarantined deployments across namespaces
Full-featured Go CLI built with Cobra, Bubble Tea, and Lip Gloss:
| Command | Description |
|---|---|
deploy |
Deploy agent mode or watchdog mode (--watchdog) |
status |
Cluster and agent status with tree/classic views |
chat |
Interactive conversational assistant |
watch |
Live streaming recommendation feed |
optimise |
Targeted optimization by category |
actions |
List, approve, or reject pending plans |
autonomy |
Get/set autonomy level, enable/disable automation |
plan |
Inspect and manage generated action plans |
quarantine |
List, quarantine, and restore deployments |
tool |
List available tools, run tool tests |
load-test |
Deploy scenario workloads for testing |
logs |
Tail agent pod logs |
emergency-stop |
Halt all agent actions |
remove |
Remove OKIK from the cluster |
flowchart TB
subgraph CLI["okikctl (Go CLI)"]
cmds["deploy | status | chat | watch\noptimise | actions | autonomy | logs"]
end
subgraph AgentPod["Agent Pod (okik-system)"]
agent["FastAPI Agent\n(Python)"]
gk["Gatekeeper\nL0-L4 enforcement"]
llm["LLM Client\nOpenAI / Anthropic"]
audit["Audit Log\n(SQLite)"]
tp["Toolproxy Sidecar\n(Go, client-go)"]
agent --- gk
agent --- llm
agent --- audit
agent -- "HTTP (in-pod)" --> tp
end
subgraph WatchdogPod["Watchdog Pod (okik-system)"]
wd["Watchdog Service\n(Go rules engine)"]
end
CLI -- "Agent mode" --> agent
CLI -- "Watchdog mode" --> wd
tp -- "client-go" --> k8s["Kubernetes API"]
wd -- "client-go" --> k8s
Agent mode -- the CLI talks to a FastAPI agent that reasons via LLM tool-calling. The gatekeeper enforces autonomy-level permissions on every tool call. All Kubernetes API access goes through the Go toolproxy sidecar (same pod, HTTP), keeping the Python agent free of direct K8s dependencies.
Watchdog mode -- the CLI talks to a standalone Go service that applies deterministic threshold-based rules with zero LLM dependency.
- Kubernetes cluster with
kubectlaccess metrics-serverenabled (for minikube:minikube addons enable metrics-server)- API key for OpenAI or Anthropic (agent mode only)
go build -o okikctl cmd/okikctl/main.go# OpenAI
./okikctl deploy --llm-provider openai --api-key sk-...
# Anthropic
./okikctl deploy --llm-provider anthropic --api-key sk-ant-..../okikctl deploy --watchdog./okikctl status # cluster and agent overview
./okikctl chat # interactive assistant
./okikctl watch # live recommendation stream
./okikctl optimise --category all # full optimization scan
./okikctl autonomy set L3 # increase autonomy
./okikctl actions list # view pending approvals
./okikctl actions approve <plan-id> # approve a queued plan
./okikctl quarantine list # view quarantined deployments
./okikctl logs -f # tail agent logs
./okikctl emergency-stop # halt agent actions
./okikctl remove # remove from clusterlist_namespaces, get_pods, get_deployments, get_services, get_pod_logs, get_events, describe_resource, get_nodes, get_hpa, get_resource_quotas, get_pvcs, get_jobs, get_cronjobs, get_daemonsets, get_statefulsets, get_pdbs, get_node_conditions, get_container_statuses
scale_deployment, update_resources, create_hpa, update_hpa
quarantine_deployment, unquarantine_deployment, list_quarantined
rollout_restart, delete_pod, delete_hpa
scale_to_zero, cordon_node, uncordon_node, delete_deployment, delete_namespace
The agent exposes a REST API on port 8000:
| Method | Path | Description |
|---|---|---|
| GET | /health |
Health probe |
| GET | /status |
Agent config, state, and tool info |
| GET | /recommendations |
Current recommendations |
| GET | /metrics |
Raw cluster metrics snapshot |
| GET | /tools |
Tool definitions |
| GET | /tools/calls |
Recent tool call history |
| GET | /actions |
Autonomy state + pending approvals |
| POST | /actions/approve/{id} |
Approve a queued plan |
| POST | /actions/reject/{id} |
Reject a queued plan |
| PUT | /autonomy |
Set autonomy level and enabled state |
| POST | /emergency-stop |
Toggle emergency stop |
| POST | /chat |
Chat (non-streaming) |
| POST | /chat/stream |
Chat (streaming) |
| POST | /optimise |
Run category-based optimization |
| POST | /config |
Runtime config update |
| GET | /audit/plans |
Plan history |
| GET | /audit/plans/{id} |
Single plan detail |
| GET | /audit/actions |
Action history |
| GET | /audit/actions/{id} |
Single action detail |
| GET | /audit/tools |
Tool usage stats |
| GET | /audit/stats |
Aggregate audit stats |
| Parameter | Default | Description |
|---|---|---|
| LLM provider | -- | openai or anthropic |
| Model | gpt-5-mini / claude-sonnet-4-6 |
Provider-specific default |
| Autonomy level | L2 |
L0-L4 |
| Analysis interval | 30s |
Time between reasoning loop ticks |
| Temperature | 0.2 |
LLM sampling temperature |
okik/
agent/ # FastAPI agent, LLM client, gatekeeper, tools, audit
cmd/
okikctl/ # Go CLI
toolproxy/ # Go toolproxy sidecar binary
watchdog/ # Go watchdog binary
internal/
agent/ # Go client for agent HTTP APIs
toolproxy/ # Toolproxy HTTP handlers
watchdog/ # Watchdog rules engine and HTTP server
manifests/ # Kubernetes deployment manifests
scripts/ # Helper scripts
examples/ # Example workloads and scenarios
# Build
make build
# Test
make test
# or
./test.sh
# Run agent locally
cd agent
uv venv .venv
uv pip install --python .venv/bin/python -r requirements.txt
export provider=openai model=gpt-5-mini api-key=your-key
.venv/bin/python main.py- RBAC -- read/write permissions enforced by autonomy level and namespace guards
- Secret management -- API keys stored as Kubernetes secrets, never in config
- Namespace isolation -- writes always blocked for system namespaces
- Tool gatekeeper -- in-process enforcement at every tool call, not just at the API boundary
- Audit trail -- all plans, actions, and tool calls logged with full parameters and results
- Emergency stop -- single command to freeze all agent activity
- Container security -- no privileged containers, read-only root filesystem
MIT