Skip to content

Commit 228621e

Browse files
committed
docs: add core platform documentation and refactor existing architecture docs
Introduce new docs for agent architecture, provider interface, customer infra management, customer onboarding, YAML config validator, local dev guide, template usage guide, and product roadmap. Restructure enterprise architecture, glossary, data architecture, bootstrap guide, and ADR-001 for consistency and clarity.
1 parent 67c9f7b commit 228621e

13 files changed

Lines changed: 3875 additions & 1388 deletions

EntrepriseArchitecture.md

Lines changed: 696 additions & 321 deletions
Large diffs are not rendered by default.

GLOSSARY.md

Lines changed: 133 additions & 643 deletions
Large diffs are not rendered by default.

adr/ADR-001-LANDING-ZONE-APPROACH.md

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -118,12 +118,23 @@ Control Tower controls can be managed via Terraform:
118118

119119
### Phase 1: Control Tower Setup (Console)
120120

121-
| Step | Action | Duration |
122-
|------|--------|----------|
123-
| 1 | Enable Control Tower | 45 min |
124-
| 2 | Configure home region (eu-west-1) | Included |
125-
| 3 | Log Archive + Audit accounts created | Automatic |
126-
| 4 | Enable IAM Identity Center | Included |
121+
> **Voir [BOOTSTRAP-RUNBOOK](../../bootstrap/docs/BOOTSTRAP-RUNBOOK.md) pour les instructions détaillées.**
122+
123+
| Step | Action |
124+
|------|--------|
125+
| 1 | Choose setup preferences (regions, region deny) |
126+
| 2 | Create OUs (Security, Sandbox) |
127+
| 3 | Configure Service integrations — **créer 2 comptes** |
128+
| 4 | Review and enable (~45 min) |
129+
130+
**Comptes créés dans Step 3 :**
131+
132+
| Service | Account | Email |
133+
|---------|---------|-------|
134+
| AWS Config Aggregator | **Audit** | `aws+audit@talq.xyz` |
135+
| CloudTrail Administrator | **Log Archive** | `aws+logs@talq.xyz` |
136+
137+
> ⚠️ Config et CloudTrail exigent des comptes **différents**.
127138
128139
### Phase 2: Terraform Layer (bootstrap/)
129140

agent/AGENT-ARCHITECTURE.md

Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
# Kiven Agent Architecture
2+
## *The Bridge Between Kiven SaaS and Customer Kubernetes*
3+
4+
---
5+
6+
> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md)
7+
8+
---
9+
10+
# What Is the Kiven Agent
11+
12+
The agent is a **single Go binary** deployed inside the customer's Kubernetes cluster. It is the only component Kiven runs in the customer's environment. Everything Kiven does on the customer's cluster goes through the agent.
13+
14+
```
15+
Kiven SaaS (our infra) ◄──── gRPC/mTLS (outbound from agent) ──── Agent (customer's K8s)
16+
17+
├── Watches CNPG CRDs
18+
├── Collects PG metrics
19+
├── Executes commands
20+
├── Aggregates logs
21+
└── Reports infra status
22+
```
23+
24+
---
25+
26+
# Design Principles
27+
28+
| Principle | Implementation |
29+
|-----------|---------------|
30+
| **Outbound-only** | Agent initiates connection to Kiven SaaS. No inbound ports on customer's firewall. |
31+
| **Minimal footprint** | < 50MB RAM, < 0.1 CPU. Must not impact customer's workloads. |
32+
| **Fault-tolerant** | If agent loses connection, databases keep running. Agent auto-reconnects. |
33+
| **Secure** | mTLS for all communication. ServiceAccount scoped to CNPG CRDs only. |
34+
| **Single binary** | One Go binary, deployed via Helm chart. No dependencies. |
35+
| **Multi-provider ready** | Plugin system: auto-detects installed operators, activates relevant modules. |
36+
37+
---
38+
39+
# Agent Components
40+
41+
```
42+
┌─────────────────────────────────────────────────────────────────────┐
43+
│ KIVEN AGENT (Go binary) │
44+
│ │
45+
│ ┌─────────────────────────────────────────────────────────────┐ │
46+
│ │ Provider Registry │ │
47+
│ │ ├── CNPG Module (Phase 1) │ │
48+
│ │ │ ├── CNPG Watcher (informers on Cluster/Backup/Pooler) │ │
49+
│ │ │ ├── PG Stats Collector (pg_stat_*, via PG connection) │ │
50+
│ │ │ └── PG Log Collector (pod logs from CNPG pods) │ │
51+
│ │ ├── Strimzi Module (Future) │ │
52+
│ │ └── Redis Module (Future) │ │
53+
│ └─────────────────────────────────────────────────────────────┘ │
54+
│ │
55+
│ ┌─────────────────────────────────────────────────────────────┐ │
56+
│ │ Core Components │ │
57+
│ │ ├── Command Executor — applies YAML, runs SQL │ │
58+
│ │ ├── Infra Reporter — node status, EBS, resource usage │ │
59+
│ │ ├── Health Monitor — self-health, connectivity check │ │
60+
│ │ └── Config Manager — agent config, hot reload │ │
61+
│ └─────────────────────────────────────────────────────────────┘ │
62+
│ │
63+
│ ┌─────────────────────────────────────────────────────────────┐ │
64+
│ │ Transport Layer │ │
65+
│ │ ├── gRPC Client (mTLS, outbound to svc-agent-relay) │ │
66+
│ │ ├── Event Buffer (in-memory, survives brief disconnects) │ │
67+
│ │ └── Heartbeat (every 30s to prove agent is alive) │ │
68+
│ └─────────────────────────────────────────────────────────────┘ │
69+
└─────────────────────────────────────────────────────────────────────┘
70+
```
71+
72+
## CNPG Watcher
73+
74+
Uses Kubernetes **informers** (via controller-runtime) to watch CNPG CRDs:
75+
- `Cluster` — status changes, failover events, replication lag
76+
- `Backup` — backup start/complete/fail events
77+
- `ScheduledBackup` — schedule status
78+
- `Pooler` — PgBouncer status, connection stats
79+
80+
On any change → event streamed to Kiven SaaS via gRPC.
81+
82+
## PG Stats Collector
83+
84+
Connects to PostgreSQL directly (using credentials from CNPG-managed K8s Secret):
85+
- `pg_stat_statements` — query performance (every 60s)
86+
- `pg_stat_activity` — active queries, blocking (every 30s)
87+
- `pg_stat_bgwriter` — checkpoint/write stats (every 60s)
88+
- `pg_stat_user_tables` — table stats, dead tuples (every 300s)
89+
- Custom queries for bloat detection, XID age (every 300s)
90+
91+
**Important**: Query parameter values are **never collected**. Only query templates (`SELECT * FROM users WHERE id = $1`).
92+
93+
## PG Log Collector
94+
95+
Tails PostgreSQL pod logs via Kubernetes API:
96+
- Filters for ERROR, WARNING, FATAL, PANIC levels
97+
- Applies **log scrubbing**: replaces parameter values with `$N`
98+
- Batches and streams to Kiven SaaS
99+
- Detects patterns: slow queries, connection rejections, OOM
100+
101+
## Command Executor
102+
103+
Receives commands from Kiven SaaS (via gRPC stream) and executes them:
104+
105+
| Command Type | What It Does | Example |
106+
|-------------|-------------|---------|
107+
| `apply_yaml` | Applies K8s manifest | Create/update CNPG Cluster, Pooler, Backup |
108+
| `delete_resource` | Deletes K8s resource | Delete cluster on power-off (PVCs retained) |
109+
| `run_sql` | Executes SQL via PG connection | CREATE USER, GRANT, ALTER SYSTEM |
110+
| `install_helm` | Installs/upgrades Helm chart | Install CNPG operator |
111+
| `collect_diagnostics` | Runs diagnostic checks | Prerequisites validation |
112+
113+
Every command is:
114+
- **Logged** with full audit trail (who requested, what was executed, result)
115+
- **Idempotent** where possible (apply is naturally idempotent)
116+
- **Validated** before execution (schema validation for YAML)
117+
- **Reported** with result (success/failure + output)
118+
119+
## Infra Reporter
120+
121+
Reports infrastructure-level information:
122+
- Node status (Ready/NotReady, capacity, allocatable)
123+
- EBS volume usage (via PVC status + df)
124+
- Resource consumption (CPU/memory per CNPG pod)
125+
- Kubernetes version, CNPG operator version
126+
- Storage classes available
127+
- Namespace resource quotas
128+
129+
---
130+
131+
# Communication Protocol
132+
133+
## gRPC Service Definition (Simplified)
134+
135+
```protobuf
136+
service AgentRelay {
137+
// Agent → SaaS: bidirectional stream for status and metrics
138+
rpc Connect(stream AgentMessage) returns (stream ServerMessage);
139+
140+
// Agent → SaaS: initial registration
141+
rpc Register(RegisterRequest) returns (RegisterResponse);
142+
}
143+
144+
message AgentMessage {
145+
oneof payload {
146+
Heartbeat heartbeat = 1;
147+
ClusterStatus cluster_status = 2;
148+
MetricsBatch metrics = 3;
149+
LogBatch logs = 4;
150+
EventReport event = 5;
151+
CommandResult command_result = 6;
152+
InfraReport infra_report = 7;
153+
}
154+
}
155+
156+
message ServerMessage {
157+
oneof payload {
158+
Command command = 1;
159+
ConfigUpdate config_update = 2;
160+
Ack ack = 3;
161+
}
162+
}
163+
```
164+
165+
## Connection Lifecycle
166+
167+
```
168+
Agent starts
169+
170+
├── 1. Load mTLS certificates (from K8s Secret)
171+
├── 2. Connect to svc-agent-relay (gRPC/mTLS)
172+
├── 3. Register: send agent ID, cluster info, CNPG version
173+
├── 4. Start bidirectional stream (Connect RPC)
174+
175+
│ ┌── Agent → SaaS ──────────────────────────────────┐
176+
│ │ Heartbeat every 30s │
177+
│ │ Cluster status on change (informer events) │
178+
│ │ Metrics every 30-60s │
179+
│ │ Logs (filtered, scrubbed) on arrival │
180+
│ │ Command results after execution │
181+
│ └───────────────────────────────────────────────────┘
182+
183+
│ ┌── SaaS → Agent ──────────────────────────────────┐
184+
│ │ Commands (apply_yaml, run_sql, etc.) │
185+
│ │ Config updates (collection intervals, log level) │
186+
│ │ Acknowledgements │
187+
│ └───────────────────────────────────────────────────┘
188+
189+
└── On disconnect: buffer events, retry with exponential backoff
190+
Databases continue running. No data loss.
191+
```
192+
193+
---
194+
195+
# Deployment
196+
197+
## Helm Chart
198+
199+
```bash
200+
helm install kiven-agent kiven/agent \
201+
--namespace kiven-system \
202+
--create-namespace \
203+
--set agentToken=<token-from-kiven-dashboard> \
204+
--set relay.endpoint=agent-relay.kiven.io:443
205+
```
206+
207+
## Kubernetes Resources Created
208+
209+
| Resource | Namespace | Purpose |
210+
|----------|-----------|---------|
211+
| Deployment (1 replica) | kiven-system | The agent pod |
212+
| ServiceAccount | kiven-system | Identity for RBAC |
213+
| ClusterRole || Read CNPG CRDs, read pods/logs, manage kiven-databases namespace |
214+
| ClusterRoleBinding || Binds role to ServiceAccount |
215+
| Secret | kiven-system | mTLS certificates + agent token |
216+
| ConfigMap | kiven-system | Agent configuration (intervals, log level) |
217+
218+
## RBAC (Least Privilege)
219+
220+
```yaml
221+
rules:
222+
# CNPG CRDs — full access (for provisioning)
223+
- apiGroups: ["postgresql.cnpg.io"]
224+
resources: ["clusters", "backups", "scheduledbackups", "poolers"]
225+
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
226+
227+
# Pods/logs — read only (for metrics and log collection)
228+
- apiGroups: [""]
229+
resources: ["pods", "pods/log", "services", "secrets", "configmaps", "persistentvolumeclaims"]
230+
verbs: ["get", "list", "watch"]
231+
232+
# Namespaces — manage kiven-databases
233+
- apiGroups: [""]
234+
resources: ["namespaces"]
235+
verbs: ["get", "list", "watch", "create"]
236+
237+
# Network policies — create in kiven-databases
238+
- apiGroups: ["networking.k8s.io"]
239+
resources: ["networkpolicies"]
240+
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
241+
242+
# Storage classes — read (for prerequisites check)
243+
- apiGroups: ["storage.k8s.io"]
244+
resources: ["storageclasses"]
245+
verbs: ["get", "list"]
246+
247+
# Nodes — read (for infra reporting)
248+
- apiGroups: [""]
249+
resources: ["nodes"]
250+
verbs: ["get", "list"]
251+
```
252+
253+
---
254+
255+
# Failure Modes
256+
257+
| Failure | Impact | Recovery |
258+
|---------|--------|----------|
259+
| **Agent pod crash** | Kiven dashboard shows "agent offline". Databases keep running. | K8s restarts pod automatically. Agent reconnects. |
260+
| **gRPC connection lost** | Events buffered in memory. Dashboard shows stale data (with warning). | Agent retries with exponential backoff (1s, 2s, 4s, 8s... max 60s). |
261+
| **Agent misconfigured** | Agent can't connect or authenticate. | Dashboard shows "agent not connected". Customer re-runs Helm install. |
262+
| **CNPG operator not installed** | Agent reports "CNPG not found" during prerequisites check. | svc-provisioner installs CNPG operator via agent (install_helm command). |
263+
| **Insufficient RBAC** | Agent commands fail with 403. | Agent reports permission error. Customer adjusts ClusterRoleBinding. |
264+
265+
**Key invariant**: Agent failure NEVER affects running databases. CNPG operator manages PG independently. Agent is only for Kiven management plane.
266+
267+
---
268+
269+
# Metrics Collected
270+
271+
| Category | Metrics | Interval |
272+
|----------|---------|----------|
273+
| **PostgreSQL** | connections, QPS, transactions, replication lag, cache hit ratio | 30s |
274+
| **Queries** | top queries by time/calls, slow queries (> threshold), lock waits | 60s |
275+
| **Tables** | size, dead tuples, seq scans, idx scans, bloat estimate | 300s |
276+
| **System** | CPU, memory, disk usage (per PG pod) | 30s |
277+
| **CNPG** | cluster phase, timeline, instances ready, failover count | On change |
278+
| **Backups** | last backup time, duration, size, WAL archiving lag | On change |
279+
| **PgBouncer** | active/idle/waiting connections, pool utilization | 30s |
280+
| **Infrastructure** | node status, EBS IOPS, storage capacity | 60s |
281+
282+
---
283+
284+
*Maintained by: Agent Team*
285+
*Last updated: February 2026*

bootstrap/BOOTSTRAP-GUIDE.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -84,14 +84,18 @@
8484

8585
## Steps
8686

87-
| Step | Action | Duration |
88-
|------|--------|----------|
89-
| 1 | Console → Control Tower → Set up landing zone | 45 min |
90-
| 2 | Home region: eu-west-1 | Included |
91-
| 3 | Additional regions: eu-central-1 (DR) | Included |
92-
| 4 | Log Archive account created | Automatic |
93-
| 5 | Audit account created | Automatic |
94-
| 6 | IAM Identity Center enabled | Included |
87+
> **Voir [BOOTSTRAP-RUNBOOK](../../bootstrap/docs/BOOTSTRAP-RUNBOOK.md) pour les instructions détaillées.**
88+
89+
| Step | Action | Description |
90+
|------|--------|-------------|
91+
| 1 | Choose setup preferences | Home region eu-west-1, Region deny enabled |
92+
| 2 | Create OUs | Security, Sandbox |
93+
| 3 | Configure Service integrations | Créer Audit + Log Archive accounts |
94+
| 4 | Review and enable | ~45 min pour compléter |
95+
96+
> ⚠️ **Important:** Dans Step 3, Config et CloudTrail exigent des comptes **différents** :
97+
> - AWS Config → **Audit** account
98+
> - CloudTrail → **Log Archive** account
9599
96100
## What Control Tower creates
97101

@@ -114,10 +118,9 @@
114118

115119
| Component | Module | Description |
116120
|-----------|--------|-------------|
117-
| **SSO** | `sso/` | Groups, Permission Sets |
118-
| **Custom Controls** | `control-tower/` | Additional guardrails via Terraform |
121+
| **SSO** | `sso/` | Groups, Permission Sets, Assignments |
122+
| **Custom Controls** | `control-tower/` | Additional controls via `aws_controltower_control` |
119123
| **Account Factory** | `account-factory/` | AFT module via GitHub Actions |
120-
| **Shared Services** | `core-accounts/` | ECR, Transit Gateway |
121124

122125
## SSO Groups
123126

0 commit comments

Comments
 (0)