Skip to content

okikorg/kubeagents

Repository files navigation

OKIK

Orchestrated Kubernetes Intelligence Kernel

Go Python License: MIT

OKIK is an AI-powered Kubernetes operations system that continuously analyzes your cluster, diagnoses issues across workloads, and takes corrective action -- all under graduated human oversight. It replaces the need to write dozens of single-purpose controllers by letting an LLM reason across pods, deployments, nodes, HPAs, and jobs in a single conversation loop.

For environments where LLM-driven automation is not desired, OKIK ships a deterministic watchdog mode that detects and remediates common failures using threshold-based rules with zero external dependencies.


Why OKIK

Classic Kubernetes controllers are single-purpose reconciliation loops. Each one watches one resource type, compares desired vs actual state, and makes a hard-coded correction. To cover resource efficiency, scaling, availability, cost, and workload health, you need to wire together HPA, VPA, node autoscaler, custom operators, and monitoring dashboards -- and none of them talk to each other.

OKIK replaces that fragmented approach with a single agent that:

  • Reasons across the whole cluster. The LLM correlates signals from pods, deployments, nodes, HPAs, jobs, PVCs, and events in one pass. It can say "this pod is OOMing because the HPA target is too aggressive for this workload's memory profile" -- something no single controller can do.
  • Acts with graduated trust. Five autonomy levels (L0-L4) let you start advisory-only and unlock write actions incrementally. Risky operations queue for human approval instead of executing blindly.
  • Works without an LLM too. The watchdog mode provides CrashLoopBackOff detection, OOM quarantining, and pending-pod remediation with deterministic rules and zero API keys.

Features

LLM-Driven Agent

  • Multi-provider support -- OpenAI (GPT) and Anthropic (Claude) with native tool-calling APIs, no prompt parsing hacks
  • Continuous reasoning loop -- collects cluster metrics, feeds them to the LLM, executes allowed actions, queues risky ones for approval
  • 30+ Kubernetes tools organized into risk-tiered packages (OBSERVE, REMEDIATE, QUARANTINE, RESTART, DESTRUCTIVE)
  • Interactive chat -- conversational cluster assistant with markdown rendering, command history, auto-completion, and offline fallback
  • Streaming responses -- real-time streamed chat and analysis output

Graduated Autonomy (L0-L4)

Level Mode Behavior
L0 Advisory No tools. Observe-only, all suggestions are textual
L1 Suggest Read-only tools (OBSERVE). All write actions queue for approval
L2 Guarded (default) Read + safe writes (REMEDIATE, QUARANTINE). Restarts and destructive ops queue
L3 Conditional Most writes execute automatically. Only DESTRUCTIVE queues
L4 Full All tool packages execute (subject to hard safety checks)

Safety and Guardrails

  • Tool gatekeeper -- enforces package-level access by autonomy level at every tool call
  • Namespace restrictions -- writes to kube-system, kube-public, kube-node-lease, and okik-system are always blocked
  • Approval queue -- risky operations are held as pending plans until a human approves or rejects via CLI/API
  • Emergency stop -- single command to halt all agent actions immediately
  • Rate limiting -- prevents runaway tool execution
  • Full audit trail -- every tool call, plan, and action is logged to SQLite with timestamps, parameters, and results

Optimization Engine

Category-based cluster optimization with health scoring:

  • Resource Efficiency -- CPU/memory request vs actual usage, over-provisioned workloads
  • Availability & Health -- restart rates, CrashLoopBackOff, pending pods, node conditions
  • Cost & Waste -- idle resources, unused deployments, unbound PVCs
  • Scaling -- HPA saturation, deployment availability ratios
  • Workload Health -- Job/CronJob failures, DaemonSet coverage gaps

Each category gets a 0-100 health score with severity-tagged findings. In reconcile mode, the agent fixes issues as it finds them.

Deterministic Watchdog Mode

LLM-free controller for environments that need predictable, rule-based operation:

  • Detects CrashLoopBackOff pods exceeding restart thresholds
  • Quarantines unstable deployments (scales to zero, preserves original replica count)
  • Identifies pods stuck in Pending beyond a configurable timeout
  • Detects OOMKilled containers
  • Exposes compatible /health, /status, and /recommendations endpoints

Quarantine System

  • Quarantine -- scales a failing deployment to zero, annotates with reason, preserves original replica count
  • Unquarantine -- restores the deployment to its pre-quarantine state
  • List -- shows all quarantined deployments across namespaces

CLI (okikctl)

Full-featured Go CLI built with Cobra, Bubble Tea, and Lip Gloss:

Command Description
deploy Deploy agent mode or watchdog mode (--watchdog)
status Cluster and agent status with tree/classic views
chat Interactive conversational assistant
watch Live streaming recommendation feed
optimise Targeted optimization by category
actions List, approve, or reject pending plans
autonomy Get/set autonomy level, enable/disable automation
plan Inspect and manage generated action plans
quarantine List, quarantine, and restore deployments
tool List available tools, run tool tests
load-test Deploy scenario workloads for testing
logs Tail agent pod logs
emergency-stop Halt all agent actions
remove Remove OKIK from the cluster

Architecture

flowchart TB
    subgraph CLI["okikctl (Go CLI)"]
        cmds["deploy | status | chat | watch\noptimise | actions | autonomy | logs"]
    end

    subgraph AgentPod["Agent Pod (okik-system)"]
        agent["FastAPI Agent\n(Python)"]
        gk["Gatekeeper\nL0-L4 enforcement"]
        llm["LLM Client\nOpenAI / Anthropic"]
        audit["Audit Log\n(SQLite)"]
        tp["Toolproxy Sidecar\n(Go, client-go)"]

        agent --- gk
        agent --- llm
        agent --- audit
        agent -- "HTTP (in-pod)" --> tp
    end

    subgraph WatchdogPod["Watchdog Pod (okik-system)"]
        wd["Watchdog Service\n(Go rules engine)"]
    end

    CLI -- "Agent mode" --> agent
    CLI -- "Watchdog mode" --> wd
    tp -- "client-go" --> k8s["Kubernetes API"]
    wd -- "client-go" --> k8s
Loading

Agent mode -- the CLI talks to a FastAPI agent that reasons via LLM tool-calling. The gatekeeper enforces autonomy-level permissions on every tool call. All Kubernetes API access goes through the Go toolproxy sidecar (same pod, HTTP), keeping the Python agent free of direct K8s dependencies.

Watchdog mode -- the CLI talks to a standalone Go service that applies deterministic threshold-based rules with zero LLM dependency.


Quick Start

Prerequisites

  • Kubernetes cluster with kubectl access
  • metrics-server enabled (for minikube: minikube addons enable metrics-server)
  • API key for OpenAI or Anthropic (agent mode only)

Install

go build -o okikctl cmd/okikctl/main.go

Deploy (Agent Mode)

# OpenAI
./okikctl deploy --llm-provider openai --api-key sk-...

# Anthropic
./okikctl deploy --llm-provider anthropic --api-key sk-ant-...

Deploy (Watchdog Mode -- No LLM)

./okikctl deploy --watchdog

Use

./okikctl status                        # cluster and agent overview
./okikctl chat                          # interactive assistant
./okikctl watch                         # live recommendation stream
./okikctl optimise --category all       # full optimization scan
./okikctl autonomy set L3              # increase autonomy
./okikctl actions list                  # view pending approvals
./okikctl actions approve <plan-id>     # approve a queued plan
./okikctl quarantine list               # view quarantined deployments
./okikctl logs -f                       # tail agent logs
./okikctl emergency-stop                # halt agent actions
./okikctl remove                        # remove from cluster

Kubernetes Tools

OBSERVE (risk tier 0) -- read-only cluster inspection

list_namespaces, get_pods, get_deployments, get_services, get_pod_logs, get_events, describe_resource, get_nodes, get_hpa, get_resource_quotas, get_pvcs, get_jobs, get_cronjobs, get_daemonsets, get_statefulsets, get_pdbs, get_node_conditions, get_container_statuses

REMEDIATE (risk tier 5) -- safe scaling and resource tuning

scale_deployment, update_resources, create_hpa, update_hpa

QUARANTINE (risk tier 6) -- isolate and restore failing workloads

quarantine_deployment, unquarantine_deployment, list_quarantined

RESTART (risk tier 7) -- pod and deployment recycling

rollout_restart, delete_pod, delete_hpa

DESTRUCTIVE (risk tier 9) -- irreversible operations

scale_to_zero, cordon_node, uncordon_node, delete_deployment, delete_namespace


Agent API

The agent exposes a REST API on port 8000:

Method Path Description
GET /health Health probe
GET /status Agent config, state, and tool info
GET /recommendations Current recommendations
GET /metrics Raw cluster metrics snapshot
GET /tools Tool definitions
GET /tools/calls Recent tool call history
GET /actions Autonomy state + pending approvals
POST /actions/approve/{id} Approve a queued plan
POST /actions/reject/{id} Reject a queued plan
PUT /autonomy Set autonomy level and enabled state
POST /emergency-stop Toggle emergency stop
POST /chat Chat (non-streaming)
POST /chat/stream Chat (streaming)
POST /optimise Run category-based optimization
POST /config Runtime config update
GET /audit/plans Plan history
GET /audit/plans/{id} Single plan detail
GET /audit/actions Action history
GET /audit/actions/{id} Single action detail
GET /audit/tools Tool usage stats
GET /audit/stats Aggregate audit stats

Configuration

Parameter Default Description
LLM provider -- openai or anthropic
Model gpt-5-mini / claude-sonnet-4-6 Provider-specific default
Autonomy level L2 L0-L4
Analysis interval 30s Time between reasoning loop ticks
Temperature 0.2 LLM sampling temperature

Repository Layout

okik/
  agent/                  # FastAPI agent, LLM client, gatekeeper, tools, audit
  cmd/
    okikctl/              # Go CLI
    toolproxy/            # Go toolproxy sidecar binary
    watchdog/             # Go watchdog binary
  internal/
    agent/                # Go client for agent HTTP APIs
    toolproxy/            # Toolproxy HTTP handlers
    watchdog/             # Watchdog rules engine and HTTP server
  manifests/              # Kubernetes deployment manifests
  scripts/                # Helper scripts
  examples/               # Example workloads and scenarios

Development

# Build
make build

# Test
make test
# or
./test.sh

# Run agent locally
cd agent
uv venv .venv
uv pip install --python .venv/bin/python -r requirements.txt
export provider=openai model=gpt-5-mini api-key=your-key
.venv/bin/python main.py

Security

  • RBAC -- read/write permissions enforced by autonomy level and namespace guards
  • Secret management -- API keys stored as Kubernetes secrets, never in config
  • Namespace isolation -- writes always blocked for system namespaces
  • Tool gatekeeper -- in-process enforcement at every tool call, not just at the API boundary
  • Audit trail -- all plans, actions, and tool calls logged with full parameters and results
  • Emergency stop -- single command to freeze all agent activity
  • Container security -- no privileged containers, read-only root filesystem

License

MIT

About

OKIK is an AI-powered Kubernetes operations system that continuously analyzes your cluster, diagnoses issues across workloads, and takes corrective action.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors