Kiven — Architecture Overview
Managed Data Services, On Your Infrastructure
Version 2.0 — February 2026
This document is the entry point for Kiven's architecture.
It provides a high-level overview and links to detailed documentation.
PART I — EXECUTIVE SUMMARY
Kiven is a fully managed data platform that runs on the customer's own Kubernetes infrastructure. Starting with PostgreSQL (powered by CloudNativePG), Kiven delivers an Aiven-quality experience — but the data never leaves the customer's cluster.
How it works:
Customer signs up → grants Kiven access to their EKS (cross-account IAM Role)
Kiven provisions everything: dedicated nodes, storage, S3 backups, CNPG operator, PostgreSQL
Customer gets: a connection string + a dashboard
Kiven manages everything from that point: scaling, backups, monitoring, security, tuning
The customer never touches kubectl, YAML, CNPG, or Kubernetes internals.
vs. Aiven
vs. Self-Managed CNPG
vs. Launchly
Same UX, but on customer's infra
Same PostgreSQL, but fully managed
Same CNPG, but Aiven-level depth
40-60% cheaper (no Aiven markup)
No need for K8s/CNPG expertise
Full infra management (nodes, storage)
Data never leaves customer's VPC
Risk eliminated by best practices
DBA intelligence built-in
Kiven is designed for:
Scalability : Support 100+ customer clusters across multiple EKS environments
Reliability : RPO 1h, RTO 15min (Kiven SaaS); RPO 5min, RTO 5min (customer databases via CNPG)
Compliance : GDPR (EU data residency), SOC2 (audit, RBAC, encryption)
Extensibility : Provider/plugin architecture for multi-operator future (Kafka, Redis, Elasticsearch)
Lifespan : 5+ years
Multi-cloud support (GKE, AKS) — Phase 3
Non-PostgreSQL data services (Kafka, Redis) — Phase 3
Self-hosted / air-gapped edition — Phase 3
Mobile app
Parameter
Value
Impact
RPO (Kiven SaaS)
1 hour
Hourly backups of product database
RTO (Kiven SaaS)
15 minutes
Automated failover
RPO (Customer DBs)
Configurable (1min–24h)
Continuous WAL archiving via Barman
RTO (Customer DBs)
< 5 minutes
CNPG automatic failover, multi-AZ
Provisioning time
< 10 minutes
From "Create Database" to connection string
Agent footprint
< 50MB RAM, < 0.1 CPU
Minimal impact on customer cluster
On-call team
5 people
Runbooks for both SaaS and customer infra
Standard
Key Requirements
Scope
GDPR
EU data residency, right to erasure, DPA
Kiven SaaS (eu-west-1) + customer data stays in their infra
SOC2
RBAC, audit logging, encryption, incident response
Kiven SaaS operations + customer infra access audit trail
Note: PCI-DSS is NOT in scope. Kiven does not process payment card data. Customer compliance (HIPAA, PCI, etc.) is helped by data staying on their own infra.
Category
Choice
Rationale
Cloud
AWS (eu-west-1)
GDPR, proximity to EU customers
Orchestration
EKS + Flux
GitOps, cloud-native
Backend
Go (stdlib + chi)
K8s ecosystem is Go, fast, small binaries
Frontend
Next.js 14+ (App Router) + Tailwind + shadcn/ui
Modern, fast, beautiful
Agent
Go (client-go + controller-runtime)
Native K8s SDK, single binary
Agent Comms
gRPC + mTLS
Secure, efficient, bidirectional streaming
Product DB
PostgreSQL (Aiven)
Dogfooding the ecosystem, managed
Cache
Valkey
Sessions, rate limiting, real-time state
Messaging
Kafka (Aiven)
Agent events, audit trail, async operations
Edge/CDN
Cloudflare
WAF, DDoS, Zero Trust, Tunnel
Observability
Prometheus / Loki / Tempo
Self-hosted, cost-efficient
Secrets
HashiCorp Vault
Dynamic secrets, rotation, IRSA
CNI
Cilium
mTLS, Gateway API, network policies
Policies
Kyverno
Admission control, pod security
Billing
Stripe
SaaS billing, per-cluster pricing
CI/CD
GitHub Actions
Already in place
IaC
Terraform
Infrastructure as Code
Customer-Side (Provisioned by Kiven)
Component
Technology
Managed By
Kubernetes nodes
EKS Managed Node Groups
Kiven (via AWS API)
PostgreSQL
CloudNativePG (CNPG)
Kiven (via agent)
Connection pooling
PgBouncer (CNPG Pooler CRD)
Kiven (via agent)
Backups
Barman → S3
Kiven (via agent + AWS API)
Storage
EBS gp3 (encrypted, KMS)
Kiven (via AWS API)
Backup storage
S3 bucket (encrypted, lifecycle)
Kiven (via AWS API)
TLS
cert-manager + self-signed CA
Kiven (via agent)
Monitoring agent
Kiven Agent (Go)
Kiven
2.1 System Context (C4 Level 1)
┌──────────────────────────────────────────────────────────────────────────┐
│ USERS │
│ Developers (Simple Mode) DevOps (Advanced Mode) │
└──────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ CLOUDFLARE EDGE │
│ (DNS, WAF, DDoS, CDN, Zero Trust) │
└──────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ KIVEN SaaS PLATFORM │
│ (AWS EKS — eu-west-1) │
│ │
│ Dashboard + API + CLI + Terraform Provider │
│ Core Services: provisioner, infra, clusters, backups, monitoring... │
│ Provider/Plugin: CNPG Provider (Phase 1), Strimzi (future)... │
└──────────────────────────────────────────────────────────────────────────┘
│ │
│ gRPC/mTLS (Agent) │ Cross-Account
│ │ IAM AssumeRole
▼ ▼
┌──────────────────────────────────────────────────────────────────────────┐
│ CUSTOMER'S AWS ACCOUNT / EKS │
│ │
│ ┌──── Managed by Kiven ──────────────────────────────────────────────┐ │
│ │ Node Group: kiven-db-nodes (dedicated, tainted, multi-AZ) │ │
│ │ Namespace: kiven-system (agent + CNPG operator) │ │
│ │ Namespace: kiven-databases (PostgreSQL clusters) │ │
│ │ S3 Bucket: kiven-backups-{customer-id} │ │
│ │ IAM: IRSA roles for S3 access │ │
│ │ CNPG: PostgreSQL Primary + Replicas + PgBouncer │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──── Managed by Customer ───────────────────────────────────────────┐ │
│ │ Their app nodes, services, workloads │ │
│ │ Connect to: pg-main.kiven-databases.svc:5432 │ │
│ └────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
2.2 Container Diagram (C4 Level 2) — Kiven SaaS
┌──────────────────────────────────────────────────────────────────────────┐
│ KIVEN SaaS — AWS WORKLOAD ACCOUNT — eu-west-1 │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ EKS CLUSTER │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ PLATFORM NODE POOL (taints: platform=true:NoSchedule) │ │ │
│ │ │ • Flux • Cilium • Vault Agent │ │ │
│ │ │ • OTel Collector • Prometheus • Grafana │ │ │
│ │ │ • Loki • Tempo • Kyverno │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ APPLICATION NODE POOL (auto-scaling) │ │ │
│ │ │ │ │ │
│ │ │ ┌── Core ─────────────────────────────────────────────┐ │ │ │
│ │ │ │ svc-api svc-auth svc-provisioner │ │ │ │
│ │ │ │ svc-infra svc-clusters svc-agent-relay │ │ │ │
│ │ │ └─────────────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌── Data Services ────────────────────────────────────┐ │ │ │
│ │ │ │ svc-backups svc-monitoring svc-users │ │ │ │
│ │ │ │ svc-yamleditor svc-migrations │ │ │ │
│ │ │ └─────────────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌── Business ────────────────────────────────────────┐ │ │ │
│ │ │ │ svc-billing svc-audit svc-notification│ │ │ │
│ │ │ └─────────────────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ VPC Peering │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ AIVEN VPC │ │
│ │ • PostgreSQL (Kiven product database) │ │
│ │ • Kafka (agent events, audit trail, async ops) │ │
│ │ • Valkey (sessions, rate limiting, cache) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
2.3 Container Diagram (C4 Level 2) — Customer Side
┌──────────────────────────────────────────────────────────────────────────┐
│ CUSTOMER'S EKS CLUSTER │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ NODE GROUP: kiven-db-nodes (Managed by Kiven) │ │
│ │ Instance: r6g.medium–r6g.2xlarge (memory-optimized) │ │
│ │ Taint: kiven.io/role=database:NoSchedule │ │
│ │ Multi-AZ: primary in AZ-a, replica in AZ-b │ │
│ │ │ │
│ │ ┌── Namespace: kiven-system ──────────────────────────────────┐ │ │
│ │ │ Kiven Agent (Go) — gRPC → Kiven SaaS │ │ │
│ │ │ CNPG Operator — manages PG clusters │ │ │
│ │ │ cert-manager (optional) — TLS certificates │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌── Namespace: kiven-databases ───────────────────────────────┐ │ │
│ │ │ │ │ │
│ │ │ CNPG Cluster: pg-production-main │ │ │
│ │ │ ├─ Pod: pg-production-main-1 (Primary, AZ-a) │ │ │
│ │ │ ├─ Pod: pg-production-main-2 (Replica, AZ-b) │ │ │
│ │ │ ├─ Pod: pg-production-main-3 (Replica, AZ-c) │ │ │
│ │ │ ├─ Service: pg-production-main-rw (read-write) │ │ │
│ │ │ ├─ Service: pg-production-main-ro (read-only) │ │ │
│ │ │ └─ Pooler: pg-production-main-pooler (PgBouncer) │ │ │
│ │ │ │ │ │
│ │ │ ScheduledBackup → S3: kiven-backups-{customer-id} │ │ │
│ │ │ NetworkPolicy: only kiven-databases + customer-app-ns │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ NODE GROUP: customer-app-nodes (Managed by Customer) │ │
│ │ • Customer's application pods │ │
│ │ • Connect to: pg-production-main-pooler.kiven-databases.svc:5432 │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ AWS Resources (Managed by Kiven via cross-account IAM) │ │
│ │ • EBS gp3 volumes (encrypted, KMS) │ │
│ │ • S3 bucket: kiven-backups-{customer-id} │ │
│ │ • IAM IRSA role: kiven-cnpg-backup-role │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Service
Responsibility
Language
Priority
svc-api
REST + GraphQL gateway, request routing
Go
P0
svc-auth
OIDC (Google/GitHub/SAML), RBAC, API keys, org/team model
Go
P0
svc-provisioner
THE BRAIN — Orchestrates full provisioning pipeline (nodes → storage → S3 → CNPG → PG)
Go
P0
svc-infra
AWS resource management in customer accounts (EC2, EBS, S3, IAM, KMS)
Go
P0
svc-clusters
Cluster lifecycle via provider interface (status, scale, upgrade, delete)
Go
P0
svc-backups
Backup/restore management, PITR, fork/clone, backup verification
Go
P0
svc-monitoring
Metrics ingestion from agents, DBA intelligence, alerts engine
Go
P0
svc-users
Database user/role management, permissions, pg_hba rules
Go
P0
svc-agent-relay
gRPC server, multiplexes all customer agent connections
Go
P0
svc-yamleditor
YAML generation, schema validation, diff engine, change history
Go
P0
svc-migrations
Import from Aiven/RDS/bare PG into Kiven-managed clusters
Go
P1
svc-billing
Stripe integration, usage tracking, per-cluster pricing
Go
P1
svc-audit
Immutable audit log of all operations on customer infra
Go
P1
svc-notification
Alerts via Slack, email, webhook, PagerDuty
Go
P1
agent
In-cluster binary — CNPG controller, PG stats, command executor, log aggregator
Go
P0
Provider/Plugin Architecture
The core engine is operator-agnostic . Each data service is a provider implementing a standard Go interface. Phase 1 ships the CNPG provider only. Future providers (Strimzi, Redis, ECK) plug in without rewriting core services.
Core Engine (operator-agnostic)
├── svc-provisioner → calls provider.Provision()
├── svc-clusters → calls provider.Scale(), provider.Status()
├── svc-backups → calls provider.Backup(), provider.Restore()
├── svc-monitoring → calls provider.CollectMetrics()
└── svc-users → calls provider.CreateUser()
│
▼
Provider Interface (Go interface)
│
┌─────┴───────────────────────────────┐
│ CNPG Provider (Phase 1 — PG) │
│ Strimzi Provider (Phase 3 — Kafka) │
│ Redis Provider (Phase 3 — Redis) │
│ ECK Provider (Phase 3 — ES) │
└─────────────────────────────────────┘
2.5 Data Flow — Provisioning
Customer clicks "Create Database"
│
▼
┌─── svc-api ───┐ ┌─── svc-auth ──┐
│ Validate req │────▶│ Check RBAC │
└───────┬───────┘ └───────────────┘
│
▼
┌─── svc-provisioner (THE BRAIN) ──────────────────────────────────────┐
│ │
│ 1. svc-infra → AssumeRole → Create node group (kiven-db-nodes) │
│ 2. svc-infra → AssumeRole → Create StorageClass (gp3, encrypted) │
│ 3. svc-infra → AssumeRole → Create S3 bucket (backups) │
│ 4. svc-infra → AssumeRole → Create IRSA role (CNPG → S3) │
│ 5. agent → Install CNPG operator (Helm) │
│ 6. agent → Apply CNPG Cluster YAML (generated by svc-clusters) │
│ 7. agent → Apply PgBouncer Pooler YAML │
│ 8. agent → Apply ScheduledBackup YAML │
│ 9. agent → Apply NetworkPolicy YAML │
│ 10. agent → Wait for cluster healthy │
│ 11. svc-users → Create initial database + user │
│ 12. Return connection string to customer │
│ │
│ Status updates streamed via agent gRPC → svc-agent-relay │
│ Dashboard shows real-time provisioning progress │
└───────────────────────────────────────────────────────────────────────┘
2.6 Data Flow — Steady State
┌─── Kiven Agent (in customer K8s) ─────────────────────────────┐
│ │
│ CNPG Controller ──── watches Cluster/Backup/Pooler CRDs │
│ PG Stats Collector ─ pg_stat_statements, pg_stat_activity │
│ Log Aggregator ───── PG logs from all pods │
│ Infra Reporter ───── node status, EBS usage, pod health │
│ │
│ Every 30s: streams metrics + status to svc-agent-relay │
│ On event: immediately reports (failover, backup done, error) │
└────────────────────────┬───────────────────────────────────────┘
│ gRPC/mTLS (outbound only)
▼
┌─── svc-agent-relay ───────────────────────────────────────────┐
│ Multiplexes connections from all customer agents │
│ Routes events to: svc-monitoring, svc-clusters, svc-audit │
└───────────────────────────────────────────────────────────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
svc-monitoring svc-clusters svc-audit
(DBA intelligence, (status update) (immutable log)
alert engine)
Each database is provisioned with a plan that determines compute, memory, storage, and HA configuration:
Plan
CPU
RAM
Storage
Instances
HA
Node Type
Use Case
Hobbyist
1 vCPU
1 GB
10 GB
1
No
t3.small
Testing, personal projects
Startup
2 vCPU
4 GB
50 GB
2
Yes
r6g.medium
Small apps, dev/staging
Business
4 vCPU
16 GB
100 GB
3
Yes
r6g.large
Production, medium traffic
Premium
8 vCPU
32 GB
500 GB
3
Yes
r6g.xlarge
High-performance, analytics
Custom
User-defined
User-defined
User-defined
1-5
Configurable
Any
Specific requirements
Each plan includes:
Pre-tuned postgresql.conf (shared_buffers, work_mem, etc. sized for the plan)
Appropriate PgBouncer pool size and mode
Right backup frequency and retention
Resource limits and requests matching the node type
Plans can be upgraded or downgraded at any time from the dashboard (triggers a rolling update via CNPG).
Databases can be paused to eliminate compute costs while preserving data. This is a fundamental advantage of the "managed on your infra" model — something Aiven cannot offer because they own the infrastructure.
Customer clicks "Power Off"
│
├─ 1. svc-clusters → agent: Delete CNPG Cluster CR
│ PVC reclaim policy = RETAIN → EBS volumes preserved
│
├─ 2. CNPG pods terminated, K8s services removed
│ EBS volumes detached but retained in AWS
│
├─ 3. svc-infra → AWS API: Scale node group to 0
│ No more EC2 cost
│
└─ 4. Dashboard: "Paused — Data safe, no compute cost"
S3 backups and EBS volumes remain
Customer clicks "Resume"
│
├─ 1. svc-infra → AWS API: Scale node group back up
│ Wait for nodes ready (~2-3 min)
│
├─ 2. svc-clusters → agent: Apply CNPG Cluster CR
│ References existing PVCs (same EBS volume IDs)
│
├─ 3. CNPG starts PostgreSQL with existing data
│ Primary elected, replicas sync (~1-2 min)
│
└─ 4. Dashboard: "Running — Resumed"
Connection strings unchanged, total resume time ~3-5 min
Automate power schedules for non-production environments:
Example: Mon-Fri 8am-6pm ON, nights and weekends OFF
Savings: 60-70% on dev/staging compute costs
Configured via dashboard, API, CLI, or Terraform
Scenario
Always On
Scheduled (10h/day, weekdays)
Savings
Startup plan (2×r6g.medium)
~$180/mo
~$55/mo
70%
Business plan (3×r6g.large)
~$450/mo
~$140/mo
69%
Paused (storage only)
—
~$10/mo
94%
Simple Mode (Default) — "Aiven Experience"
For developers who just need a database. Forms, sliders, buttons. No YAML visible.
Create database → pick plan → get connection string
Manage users, backups, config via UI forms
See metrics, alerts, logs in clean dashboards
Advanced Mode — "Lens Experience"
For DevOps/Platform engineers who want full control. Like Lens for Kubernetes.
View the generated YAML for every resource (CNPG Cluster, Pooler, Backup, etc.)
Edit YAML directly in Monaco editor (VS Code-like) with CNPG schema validation
Diff view before applying changes
Change history (git-like timeline of all YAML changes)
Rollback to any previous YAML version
Toggle between modes at any time
PART III — DELIVERY MODEL
Trunk-Based Development with Cherry-Pick
main (trunk)
│
┌───────────────┼───────────────┐
│ │ │
feature/A feature/B feature/C
│ │ │
└───────────────┼───────────────┘
│
merge to main
│
┌───────────┴───────────┐
│ │
▼ ▼
maintenance/v1.x.x maintenance/v2.x.x
(cherry-pick with (cherry-pick with
label: backport-v1) label: backport-v2)
Branch
Usage
Policy
main
Main trunk
All PRs merge here
maintenance/v*.x.x
Version maintenance
Cherry-pick from main only
feature/*
Development
Short-lived, merge to main
Centralized Flux : Single instance managing all environments
Kustomization/HelmRelease pattern : Git + Kustomize/Helm generators
Auto-reconcile : Dev auto-reconcile, Staging/Prod manual approval
Environment
Account
Cluster
Sync Policy
dev
kiven-dev
eks-dev
Auto-sync
staging
kiven-staging
eks-staging
Manual
prod
kiven-prod
eks-prod
Manual + Approval
Detailed documentation: bootstrap/BOOTSTRAP-GUIDE.md
PART IV — REPOSITORY & OWNERSHIP MODEL
Tier
Repos
Description
Owner
T0 — Foundation
bootstrap/
AWS Landing Zone, Account Factory
Platform Team
T1 — Platform
platform-*
GitOps, Networking, Security, Observability
Platform Team
T2 — Contracts
contracts-proto, sdk-*
gRPC APIs, Go SDK, CLI
Platform + Backend
T3 — Core Services
svc-*
Kiven backend services
Backend Team
T4 — Agent
agent/
Customer-deployed agent
Agent Team
T5 — Frontend
dashboard/
Next.js dashboard (Simple + Advanced modes)
Frontend Team
T6 — Providers
provider-*
CNPG provider, Strimzi provider (future)
Backend Team
T7 — Quality
e2e-scenarios, chaos-*
Tests, chaos engineering
QA + Platform
T8 — Documentation
docs/
Centralized documentation
All Teams
Tier
Owner Team
Approvers
Change Process
T0 — Foundation
Platform
Platform Lead + Security
ADR + RFC required
T1 — Platform
Platform
Platform Team (2 reviewers)
ADR if breaking change
T2 — Contracts
Platform + Backend
Tech Lead
Buf breaking detection
T3 — Core Services
Backend
Team Lead
Standard PR review
T4 — Agent
Agent / Backend
Agent Lead + Security
Security review required
T5 — Frontend
Frontend
Frontend Lead
Standard PR review
T6 — Providers
Backend
Tech Lead
Provider interface compliance
T7 — Quality
QA + Platform
QA Lead
Standard PR review
T8 — Documentation
All
Tech Lead
Standard PR review
Repo
Description
bootstrap/
AWS Landing Zone, Account Factory, SCPs, SSO
Repo
Description
platform-gitops/
Flux, Kustomizations, HelmReleases
platform-networking/
Cilium, Gateway API
platform-observability/
OTel, Prometheus, Loki, Tempo, Grafana
platform-security/
Vault, External-Secrets, Kyverno
Repo
Description
contracts-proto/
Protobuf definitions (agent ↔ SaaS, inter-service)
sdk-go/
Go SDK for Kiven API
kiven-cli/
CLI tool (kiven clusters list, kiven backup trigger)
terraform-provider-kiven/
Terraform provider for Kiven
Repo
Description
svc-api/
REST + GraphQL gateway
svc-auth/
Authentication, RBAC, API keys
svc-provisioner/
Provisioning orchestrator (THE BRAIN)
svc-infra/
AWS resource management in customer accounts
svc-clusters/
Cluster lifecycle (CNPG management)
svc-backups/
Backup/restore, PITR, fork/clone
svc-monitoring/
Metrics, DBA intelligence, alerts
svc-users/
Database user/role management
svc-agent-relay/
gRPC server for agent connections
svc-yamleditor/
YAML generation, validation, diff, history
svc-migrations/
Import from Aiven/RDS/bare PG
svc-billing/
Stripe billing
svc-audit/
Immutable audit log
svc-notification/
Alerts (Slack, email, webhook, PagerDuty)
Repo
Description
kiven-agent/
In-cluster agent (CNPG controller, PG stats, command executor)
kiven-agent-helm/
Helm chart for agent deployment
Repo
Description
dashboard/
Next.js dashboard (Simple + Advanced mode)
Repo
Description
provider-cnpg/
CloudNativePG provider (Phase 1)
provider-strimzi/
Strimzi/Kafka provider (Phase 3 — future)
provider-redis/
Redis Operator provider (Phase 3 — future)
Repo
Description
e2e-scenarios/
End-to-end tests (provisioning, backup, failover)
chaos-experiments/
Chaos Mesh experiments (node failure, network partition)
PART V — PLATFORM BASELINES
Defense in Depth : 7 layers of security
Layer
Component
Protection
Edge
Cloudflare
WAF, DDoS, Bot protection
Gateway
Cilium Gateway API
TLS termination, routing
Network
Cilium
NetworkPolicies, default deny
Identity
IRSA + Vault
Dynamic secrets, mTLS, OIDC
Workload
Kyverno
Pod security, image signing
Data
KMS + EBS encryption
Encryption at rest/transit
Customer Access
Cross-account IAM + Audit
Least privilege, CloudTrail, revocable
Detailed documentation: security/SECURITY-ARCHITECTURE.md
5.2 Observability Baseline
Signal
Tool
Retention
Cost
Metrics
Prometheus + Remote Write S3
15d local, 1y S3
~5 EUR/mo
Logs
Loki
30 days (GDPR)
Self-hosted
Traces
Tempo
7 days
Self-hosted
Profiling
Pyroscope
7 days
Self-hosted
Errors
Sentry (self-hosted)
30 days
Self-hosted
Detailed documentation: observability/OBSERVABILITY-GUIDE.md
Component
Role
Configuration
Cloudflare
Edge, WAF, Tunnel
Pro tier
Cilium
CNI, mTLS, Gateway API
WireGuard encryption
VPC Peering
Aiven connectivity (Kiven product DB)
Private, no internet
Route53
Private DNS, backup
Internal zones
Cross-Account
Customer EKS access
IAM AssumeRole, kubeconfig
Detailed documentation: networking/NETWORKING-ARCHITECTURE.md
Kiven Product Database (SaaS side)
Service
Provider
Purpose
Cost Estimate
PostgreSQL
Aiven
Product DB (orgs, clusters, audit)
~300 EUR/mo
Kafka
Aiven
Agent events, async operations
~400 EUR/mo
Valkey
Aiven
Sessions, rate limiting, cache
~150 EUR/mo
Customer Databases (managed by Kiven)
Service
Technology
Where
Cost
PostgreSQL
CloudNativePG on EKS
Customer's AWS
Customer's AWS bill
Backups
Barman → S3
Customer's AWS
Customer's S3 costs
Golden rule : Kiven product DB and customer databases are completely separate . Customer data never touches Kiven's infrastructure.
Detailed documentation: data/DATA-ARCHITECTURE.md
PART VI — TESTING & QUALITY
Layer
Test Types
Frequency
Base
Static analysis, linting (golangci-lint)
Pre-commit
Unit
Service logic, provider interface
PR
Integration
Agent ↔ CNPG, svc-infra ↔ AWS (LocalStack), DB (Testcontainers)
PR
Contract
gRPC contracts (Buf), agent protocol
PR
E2E
Full provisioning pipeline (kind + CNPG)
Nightly
Performance
Load testing, provisioning time (k6)
Weekly
Chaos
Node failure, agent disconnection, CNPG failover (Chaos Mesh)
Weekly
Metric
Target
Alert
API Latency P50
< 50ms
> 100ms
API Latency P95
< 100ms
> 200ms
API Latency P99
< 200ms
> 500ms
Error Rate
< 0.1%
> 1%
Provisioning Time
< 10min
> 15min
Agent Reconnection
< 30s
> 60s
Backup Success Rate
> 99.9%
< 99%
Detailed documentation: testing/TESTING-STRATEGY.md
PART VII — RESILIENCE & DR
Failure
Detection
Recovery
RTO
Pod crash
Liveness probe
K8s restart
< 30s
Node failure
Node NotReady
Pod reschedule
< 2min
AZ failure
Multi-AZ detect
Traffic shift
< 5min
Product DB failure
Aiven health
Automatic failover
< 5min
Kafka broker failure
Aiven health
Automatic rebalance
< 2min
Full region failure
Manual
DR procedure
4h (target)
Customer Database Failures (Handled by Kiven)
Failure
Detection
Recovery
RTO
PG pod crash
CNPG + Agent
CNPG automatic restart
< 30s
Primary failure
CNPG failover
Automatic promotion of replica
< 30s
DB node failure
Agent + AWS
Pod reschedule to healthy node
< 2min
EBS volume issue
Agent monitoring
Alert + manual intervention
< 15min
Agent disconnection
SaaS heartbeat
Agent auto-reconnects; DB keeps running
Immediate (DB unaffected)
Backup failure
Agent monitoring
Retry + alert to customer + Kiven ops
< 1h
Data corruption
Backup verification
PITR restore to last good point
< 30min
Data
Method
Frequency
Retention
Product DB
Aiven automated
Hourly
7 days
Product DB PITR
Aiven WAL
Continuous
24h
Kafka
Topic retention
N/A
7 days
Terraform state
S3 versioning
Every apply
90 days
Customer Databases (Managed by Kiven)
Data
Method
Frequency
Retention
PostgreSQL
Barman (CNPG) → S3
Configurable (default: 6h)
Configurable (default: 30 days)
PostgreSQL PITR
WAL archiving → S3
Continuous
Configurable (default: 7 days)
Backup verification
Automated restore test
Weekly
Report stored 90 days
Detailed documentation: resilience/DR-GUIDE.md
PART VIII — PLATFORM CONTRACTS
8.1 Golden Path (New Kiven Service Checklist)
Step
Action
Validation
1
Create repo from Go service template
Structure compliant
2
Define protos in contracts-proto
buf lint pass
3
Implement service (Go)
Unit tests > 80%
4
Configure K8s manifests
Kyverno policies pass
5
Configure External-Secret
Secrets resolved from Vault
6
Add ServiceMonitor
Metrics visible in Grafana
7
Create HTTPRoute or gRPC route
Traffic routable
8
PR review
Merge → Auto-deploy dev
8.2 SLI/SLO/Error Budgets
Service
SLI
SLO
Error Budget
svc-api
Availability
99.9%
43 min/month
svc-api
Latency P99
< 200ms
N/A
svc-provisioner
Provisioning success rate
99.5%
N/A
svc-agent-relay
Agent connection uptime
99.9%
43 min/month
Agent
Metrics delivery
99.9%
43 min/month
Customer DB
Backup success rate
99.9%
N/A
Platform
Availability
99.5%
3.6h/month
Role
Responsibility
Rotation
Primary
First responder, triage (SaaS + customer infra)
Weekly
Secondary
Escalation, deep expertise
Weekly
Incident Commander
Coordination for P1 (customer data at risk)
On-demand
Detailed documentation: platform/PLATFORM-ENGINEERING.md
Phase
Focus
Duration
1
Bootstrap Layer 0-1 (IAM, VPC, EKS)
3 weeks
2
Platform GitOps (Flux)
1 week
3
Platform Networking (Cilium, Gateway API) + Cloudflare
2 weeks
4
Platform Security (Vault, Kyverno)
2 weeks
5
Platform Observability (Prometheus, Loki, Tempo)
2 weeks
6
Agent framework + gRPC protocol + agent-relay
3 weeks
7
CNPG Provider (provider-cnpg)
2 weeks
8
svc-provisioner (THE BRAIN) + svc-infra (AWS resources)
4 weeks
9
svc-clusters + svc-backups + svc-users
3 weeks
10
svc-monitoring + DBA intelligence (basic)
3 weeks
11
Dashboard — Simple Mode (Next.js)
4 weeks
12
Dashboard — Advanced Mode (YAML editor)
2 weeks
13
svc-auth (OIDC, RBAC, org model)
2 weeks
14
CLI + API + Terraform Provider
3 weeks
15
svc-billing (Stripe) + svc-audit
2 weeks
16
svc-migrations (Aiven/RDS import)
2 weeks
17
Testing (E2E, chaos, performance)
2 weeks
18
Compliance audit (GDPR, SOC2)
2 weeks
Total estimated: ~43 weeks (~10 months)
GLOSSARY.md
ADR
Title
Status
001
Landing Zone: Control Tower + Terraform
Accepted
002
CNPG as PostgreSQL Engine
Accepted
003
Agent-Based Connectivity
Accepted
004
Provider/Plugin Architecture
Accepted
...
...
...
adr/
C. Change Management Process
ADR Required : Any decision impacting >1 service
Review : Platform Team + Tech Lead
Communication : Slack #platform-updates
RFC required (docs/rfc/)
Migration path documented
Announce 2 sprints before
Incident Commander approval
Post-mortem required
Retroactive ADR within 48h
Document
Description
Path
Bootstrap Guide
AWS setup, Account Factory
bootstrap/BOOTSTRAP-GUIDE.md
Security Architecture
Defense in depth, IAM, cross-account, Vault
security/SECURITY-ARCHITECTURE.md
Observability Guide
Metrics, logs, traces, APM, dashboards
observability/OBSERVABILITY-GUIDE.md
Networking Architecture
VPC, Cloudflare, Gateway API, customer connectivity
networking/NETWORKING-ARCHITECTURE.md
Data Architecture
Product DB, Kafka, customer DB model
data/DATA-ARCHITECTURE.md
Testing Strategy
Pyramid, E2E, chaos, provisioning tests
testing/TESTING-STRATEGY.md
Platform Engineering
Contracts, Golden Path, on-call, CI/CD
platform/PLATFORM-ENGINEERING.md
DR Guide
Backup, recovery, SaaS DR + customer DB DR
resilience/DR-GUIDE.md
Agent Architecture
Agent design, gRPC protocol, deployment
agent/AGENT-ARCHITECTURE.md
Customer Infra Management
Nodes, storage, S3, IAM, cross-account
infra/CUSTOMER-INFRA-MANAGEMENT.md
Customer Onboarding
Terraform module, EKS discovery, provisioning
onboarding/CUSTOMER-ONBOARDING.md
Provider Interface
Plugin architecture, Go interface, adding providers
providers/PROVIDER-INTERFACE.md
Glossary
All terminology
GLOSSARY.md
Maintained by: Kiven Platform Team
Last updated: February 2026