OKIK

Orchestrated Kubernetes Intelligence Kernel

OKIK is an AI-powered Kubernetes operations system that continuously analyzes your cluster, diagnoses issues across workloads, and takes corrective action -- all under graduated human oversight. It replaces the need to write dozens of single-purpose controllers by letting an LLM reason across pods, deployments, nodes, HPAs, and jobs in a single conversation loop.

For environments where LLM-driven automation is not desired, OKIK ships a deterministic watchdog mode that detects and remediates common failures using threshold-based rules with zero external dependencies.

Why OKIK

Classic Kubernetes controllers are single-purpose reconciliation loops. Each one watches one resource type, compares desired vs actual state, and makes a hard-coded correction. To cover resource efficiency, scaling, availability, cost, and workload health, you need to wire together HPA, VPA, node autoscaler, custom operators, and monitoring dashboards -- and none of them talk to each other.

OKIK replaces that fragmented approach with a single agent that:

Reasons across the whole cluster. The LLM correlates signals from pods, deployments, nodes, HPAs, jobs, PVCs, and events in one pass. It can say "this pod is OOMing because the HPA target is too aggressive for this workload's memory profile" -- something no single controller can do.
Acts with graduated trust. Five autonomy levels (L0-L4) let you start advisory-only and unlock write actions incrementally. Risky operations queue for human approval instead of executing blindly.
Works without an LLM too. The watchdog mode provides CrashLoopBackOff detection, OOM quarantining, and pending-pod remediation with deterministic rules and zero API keys.

Features

LLM-Driven Agent

Multi-provider support -- OpenAI (GPT) and Anthropic (Claude) with native tool-calling APIs, no prompt parsing hacks
Continuous reasoning loop -- collects cluster metrics, feeds them to the LLM, executes allowed actions, queues risky ones for approval
30+ Kubernetes tools organized into risk-tiered packages (OBSERVE, REMEDIATE, QUARANTINE, RESTART, DESTRUCTIVE)
Interactive chat -- conversational cluster assistant with markdown rendering, command history, auto-completion, and offline fallback
Streaming responses -- real-time streamed chat and analysis output

Graduated Autonomy (L0-L4)

Level	Mode	Behavior
L0	Advisory	No tools. Observe-only, all suggestions are textual
L1	Suggest	Read-only tools (`OBSERVE`). All write actions queue for approval
L2	Guarded (default)	Read + safe writes (`REMEDIATE`, `QUARANTINE`). Restarts and destructive ops queue
L3	Conditional	Most writes execute automatically. Only `DESTRUCTIVE` queues
L4	Full	All tool packages execute (subject to hard safety checks)

Safety and Guardrails

Tool gatekeeper -- enforces package-level access by autonomy level at every tool call
Namespace restrictions -- writes to kube-system, kube-public, kube-node-lease, and okik-system are always blocked
Approval queue -- risky operations are held as pending plans until a human approves or rejects via CLI/API
Emergency stop -- single command to halt all agent actions immediately
Rate limiting -- prevents runaway tool execution
Full audit trail -- every tool call, plan, and action is logged to SQLite with timestamps, parameters, and results

Optimization Engine

Category-based cluster optimization with health scoring:

Resource Efficiency -- CPU/memory request vs actual usage, over-provisioned workloads
Availability & Health -- restart rates, CrashLoopBackOff, pending pods, node conditions
Cost & Waste -- idle resources, unused deployments, unbound PVCs
Scaling -- HPA saturation, deployment availability ratios
Workload Health -- Job/CronJob failures, DaemonSet coverage gaps

Each category gets a 0-100 health score with severity-tagged findings. In reconcile mode, the agent fixes issues as it finds them.

Deterministic Watchdog Mode

LLM-free controller for environments that need predictable, rule-based operation:

Detects CrashLoopBackOff pods exceeding restart thresholds
Quarantines unstable deployments (scales to zero, preserves original replica count)
Identifies pods stuck in Pending beyond a configurable timeout
Detects OOMKilled containers
Exposes compatible /health, /status, and /recommendations endpoints

Quarantine System

Quarantine -- scales a failing deployment to zero, annotates with reason, preserves original replica count
Unquarantine -- restores the deployment to its pre-quarantine state
List -- shows all quarantined deployments across namespaces

CLI (okikctl)

Full-featured Go CLI built with Cobra, Bubble Tea, and Lip Gloss:

Command	Description
`deploy`	Deploy agent mode or watchdog mode (`--watchdog`)
`status`	Cluster and agent status with tree/classic views
`chat`	Interactive conversational assistant
`watch`	Live streaming recommendation feed
`optimise`	Targeted optimization by category
`actions`	List, approve, or reject pending plans
`autonomy`	Get/set autonomy level, enable/disable automation
`plan`	Inspect and manage generated action plans
`quarantine`	List, quarantine, and restore deployments
`tool`	List available tools, run tool tests
`load-test`	Deploy scenario workloads for testing
`logs`	Tail agent pod logs
`emergency-stop`	Halt all agent actions
`remove`	Remove OKIK from the cluster

Architecture

flowchart TB
    subgraph CLI["okikctl (Go CLI)"]
        cmds["deploy | status | chat | watch\noptimise | actions | autonomy | logs"]
    end

    subgraph AgentPod["Agent Pod (okik-system)"]
        agent["FastAPI Agent\n(Python)"]
        gk["Gatekeeper\nL0-L4 enforcement"]
        llm["LLM Client\nOpenAI / Anthropic"]
        audit["Audit Log\n(SQLite)"]
        tp["Toolproxy Sidecar\n(Go, client-go)"]

        agent --- gk
        agent --- llm
        agent --- audit
        agent -- "HTTP (in-pod)" --> tp
    end

    subgraph WatchdogPod["Watchdog Pod (okik-system)"]
        wd["Watchdog Service\n(Go rules engine)"]
    end

    CLI -- "Agent mode" --> agent
    CLI -- "Watchdog mode" --> wd
    tp -- "client-go" --> k8s["Kubernetes API"]
    wd -- "client-go" --> k8s

Agent mode -- the CLI talks to a FastAPI agent that reasons via LLM tool-calling. The gatekeeper enforces autonomy-level permissions on every tool call. All Kubernetes API access goes through the Go toolproxy sidecar (same pod, HTTP), keeping the Python agent free of direct K8s dependencies.

Watchdog mode -- the CLI talks to a standalone Go service that applies deterministic threshold-based rules with zero LLM dependency.

Quick Start

Prerequisites

Kubernetes cluster with kubectl access
metrics-server enabled (for minikube: minikube addons enable metrics-server)
API key for OpenAI or Anthropic (agent mode only)

Install

go build -o okikctl cmd/okikctl/main.go

Deploy (Agent Mode)

# OpenAI
./okikctl deploy --llm-provider openai --api-key sk-...

# Anthropic
./okikctl deploy --llm-provider anthropic --api-key sk-ant-...

Deploy (Watchdog Mode -- No LLM)

./okikctl deploy --watchdog

Use

./okikctl status                        # cluster and agent overview
./okikctl chat                          # interactive assistant
./okikctl watch                         # live recommendation stream
./okikctl optimise --category all       # full optimization scan
./okikctl autonomy set L3              # increase autonomy
./okikctl actions list                  # view pending approvals
./okikctl actions approve <plan-id>     # approve a queued plan
./okikctl quarantine list               # view quarantined deployments
./okikctl logs -f                       # tail agent logs
./okikctl emergency-stop                # halt agent actions
./okikctl remove                        # remove from cluster

Kubernetes Tools

OBSERVE (risk tier 0) -- read-only cluster inspection

list_namespaces, get_pods, get_deployments, get_services, get_pod_logs, get_events, describe_resource, get_nodes, get_hpa, get_resource_quotas, get_pvcs, get_jobs, get_cronjobs, get_daemonsets, get_statefulsets, get_pdbs, get_node_conditions, get_container_statuses

REMEDIATE (risk tier 5) -- safe scaling and resource tuning

scale_deployment, update_resources, create_hpa, update_hpa

QUARANTINE (risk tier 6) -- isolate and restore failing workloads

quarantine_deployment, unquarantine_deployment, list_quarantined

RESTART (risk tier 7) -- pod and deployment recycling

rollout_restart, delete_pod, delete_hpa

DESTRUCTIVE (risk tier 9) -- irreversible operations

scale_to_zero, cordon_node, uncordon_node, delete_deployment, delete_namespace

Agent API

The agent exposes a REST API on port 8000:

Method	Path	Description
GET	`/health`	Health probe
GET	`/status`	Agent config, state, and tool info
GET	`/recommendations`	Current recommendations
GET	`/metrics`	Raw cluster metrics snapshot
GET	`/tools`	Tool definitions
GET	`/tools/calls`	Recent tool call history
GET	`/actions`	Autonomy state + pending approvals
POST	`/actions/approve/{id}`	Approve a queued plan
POST	`/actions/reject/{id}`	Reject a queued plan
PUT	`/autonomy`	Set autonomy level and enabled state
POST	`/emergency-stop`	Toggle emergency stop
POST	`/chat`	Chat (non-streaming)
POST	`/chat/stream`	Chat (streaming)
POST	`/optimise`	Run category-based optimization
POST	`/config`	Runtime config update
GET	`/audit/plans`	Plan history
GET	`/audit/plans/{id}`	Single plan detail
GET	`/audit/actions`	Action history
GET	`/audit/actions/{id}`	Single action detail
GET	`/audit/tools`	Tool usage stats
GET	`/audit/stats`	Aggregate audit stats

Configuration

Parameter	Default	Description
LLM provider	--	`openai` or `anthropic`
Model	`gpt-5-mini` / `claude-sonnet-4-6`	Provider-specific default
Autonomy level	`L2`	L0-L4
Analysis interval	`30s`	Time between reasoning loop ticks
Temperature	`0.2`	LLM sampling temperature

Repository Layout

okik/
  agent/                  # FastAPI agent, LLM client, gatekeeper, tools, audit
  cmd/
    okikctl/              # Go CLI
    toolproxy/            # Go toolproxy sidecar binary
    watchdog/             # Go watchdog binary
  internal/
    agent/                # Go client for agent HTTP APIs
    toolproxy/            # Toolproxy HTTP handlers
    watchdog/             # Watchdog rules engine and HTTP server
  manifests/              # Kubernetes deployment manifests
  scripts/                # Helper scripts
  examples/               # Example workloads and scenarios

Development

# Build
make build

# Test
make test
# or
./test.sh

# Run agent locally
cd agent
uv venv .venv
uv pip install --python .venv/bin/python -r requirements.txt
export provider=openai model=gpt-5-mini api-key=your-key
.venv/bin/python main.py

Security

RBAC -- read/write permissions enforced by autonomy level and namespace guards
Secret management -- API keys stored as Kubernetes secrets, never in config
Namespace isolation -- writes always blocked for system namespaces
Tool gatekeeper -- in-process enforcement at every tool call, not just at the API boundary
Audit trail -- all plans, actions, and tool calls logged with full parameters and results
Emergency stop -- single command to freeze all agent activity
Container security -- no privileged containers, read-only root filesystem

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
agent		agent
cmd		cmd
config		config
doc		doc
docker		docker
docs/plans		docs/plans
examples		examples
internal		internal
manifests		manifests
scenarios		scenarios
scripts		scripts
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yaml		.goreleaser.yaml
.yamllint		.yamllint
Architecture.md		Architecture.md
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
test.sh		test.sh

Folders and files

Latest commit

History

Repository files navigation

OKIK

Why OKIK

Features

LLM-Driven Agent

Graduated Autonomy (L0-L4)

Safety and Guardrails

Optimization Engine

Deterministic Watchdog Mode

Quarantine System

CLI (okikctl)

Architecture

Quick Start

Prerequisites

Install

Deploy (Agent Mode)

Deploy (Watchdog Mode -- No LLM)

Use

Kubernetes Tools

OBSERVE (risk tier 0) -- read-only cluster inspection

REMEDIATE (risk tier 5) -- safe scaling and resource tuning

QUARANTINE (risk tier 6) -- isolate and restore failing workloads

RESTART (risk tier 7) -- pod and deployment recycling

DESTRUCTIVE (risk tier 9) -- irreversible operations

Agent API

Configuration

Repository Layout

Development

Security

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages