diff --git a/EntrepriseArchitecture.md b/EntrepriseArchitecture.md index eaf4dda..746a12c 100644 --- a/EntrepriseArchitecture.md +++ b/EntrepriseArchitecture.md @@ -1,213 +1,463 @@ -# ๐Ÿ—๏ธ **LOCAL-PLUS โ€” Architecture Dรฉfinitive** -## *Gift Card & Loyalty Platform* -### *Version 1.0 โ€” Janvier 2026* +# Kiven โ€” Architecture Overview +## *Managed Data Services, On Your Infrastructure* +### *Version 2.0 โ€” February 2026* --- -# ๐Ÿ“‹ **PARTIE I โ€” CONTEXTE & CONTRAINTES** - -## **1.1 Paramรจtres Business** - -| Paramรจtre | Valeur | Impact architectural | -|-----------|--------|---------------------| -| **RPO** | 1 heure | Backups horaires minimum, rรฉplication async acceptable | -| **RTO** | 15 minutes | Failover automatisรฉ, pas de procรฉdure manuelle | -| **TPS** | 500 transactions/sec | Pas de sharding nรฉcessaire, single Postgres suffit | -| **RPS** | 1500 requรชtes/sec | Load balancer + HPA standard | -| **Durรฉe de vie** | 5+ ans | Design pour รฉvolutivitรฉ, pas de shortcuts | -| **ร‰quipe on-call** | 5 personnes | Runbooks exhaustifs, alerting structurรฉ | - -## **1.2 Contraintes Compliance** - -| Standard | Exigences clรฉs | Impact | -|----------|---------------|--------| -| **GDPR** | Droit ร  l'oubli, consentement, data residency EU | Logs anonymisรฉs, data retention policies, EU region | -| **PCI-DSS** | Pas de stockage PAN, encryption at rest/transit, audit logs | mTLS, Vault pour secrets, audit trail immutable | -| **SOC2** | Contrรดle d'accรจs, monitoring, incident response | RBAC strict, observabilitรฉ complรจte, runbooks documentรฉs | - -## **1.3 Contraintes Techniques** - -| Contrainte | Choix | Rationale | -|------------|-------|-----------| -| **Cloud primaire** | AWS | Dรฉcision business | -| **Rรฉgion initiale** | eu-west-1 (Ireland) | GDPR, latence Europe | -| **Multi-rรฉgion** | Prรฉvu, pas immรฉdiat | Design pour, implรฉmente plus tard | -| **Database** | Aiven PostgreSQL | Managed, multicloud-ready, PCI compliant | -| **Messaging** | Aiven Kafka | Managed, multicloud-ready | -| **Cache** | Aiven Valkey | Redis-compatible, managed | -| **Edge/CDN** | Cloudflare | Free tier, WAF, DDoS, global CDN, multi-cloud ready | -| **API Gateway / APIM** | ร€ dรฉfinir (Phase future) | Options : AWS API Gateway, Gravitee, Kong โ€” dรฉcision ultรฉrieure | -| **DNS Public** | Cloudflare DNS | Authoritative, DNSSEC, global anycast | -| **DNS Interne/Backup** | AWS Route53 | Private hosted zones, health checks, failover | -| **Observabilitรฉ** | Self-hosted, coรปt minimal | Prometheus/Loki/Tempo + CloudWatch Logs (tier gratuit) | +> **This document is the entry point for Kiven's architecture.** +> It provides a high-level overview and links to detailed documentation. --- -# ๐Ÿ›๏ธ **PARTIE II โ€” ARCHITECTURE LOGIQUE** +# PART I โ€” EXECUTIVE SUMMARY + +## 1.1 What Is Kiven + +Kiven is a **fully managed data platform** that runs on the customer's own Kubernetes infrastructure. Starting with PostgreSQL (powered by CloudNativePG), Kiven delivers an Aiven-quality experience โ€” but the data never leaves the customer's cluster. + +**How it works:** +1. Customer signs up โ†’ grants Kiven access to their EKS (cross-account IAM Role) +2. Kiven provisions everything: dedicated nodes, storage, S3 backups, CNPG operator, PostgreSQL +3. Customer gets: a connection string + a dashboard +4. Kiven manages everything from that point: scaling, backups, monitoring, security, tuning + +**The customer never touches kubectl, YAML, CNPG, or Kubernetes internals.** + +### Value Proposition + +| vs. Aiven | vs. Self-Managed CNPG | vs. Launchly | +|-----------|----------------------|-------------| +| Same UX, but on customer's infra | Same PostgreSQL, but fully managed | Same CNPG, but Aiven-level depth | +| 40-60% cheaper (no Aiven markup) | No need for K8s/CNPG expertise | Full infra management (nodes, storage) | +| Data never leaves customer's VPC | Risk eliminated by best practices | DBA intelligence built-in | + +## 1.2 Scope + +Kiven is designed for: +- **Scalability**: Support 100+ customer clusters across multiple EKS environments +- **Reliability**: RPO 1h, RTO 15min (Kiven SaaS); RPO 5min, RTO 5min (customer databases via CNPG) +- **Compliance**: GDPR (EU data residency), SOC2 (audit, RBAC, encryption) +- **Extensibility**: Provider/plugin architecture for multi-operator future (Kafka, Redis, Elasticsearch) +- **Lifespan**: 5+ years + +### Non-Goals (Phase 1) +- Multi-cloud support (GKE, AKS) โ€” Phase 3 +- Non-PostgreSQL data services (Kafka, Redis) โ€” Phase 3 +- Self-hosted / air-gapped edition โ€” Phase 3 +- Mobile app + +## 1.3 Key Parameters + +| Parameter | Value | Impact | +|-----------|-------|--------| +| **RPO (Kiven SaaS)** | 1 hour | Hourly backups of product database | +| **RTO (Kiven SaaS)** | 15 minutes | Automated failover | +| **RPO (Customer DBs)** | Configurable (1minโ€“24h) | Continuous WAL archiving via Barman | +| **RTO (Customer DBs)** | < 5 minutes | CNPG automatic failover, multi-AZ | +| **Provisioning time** | < 10 minutes | From "Create Database" to connection string | +| **Agent footprint** | < 50MB RAM, < 0.1 CPU | Minimal impact on customer cluster | +| **On-call team** | 5 people | Runbooks for both SaaS and customer infra | + +## 1.4 Compliance Summary + +| Standard | Key Requirements | Scope | +|----------|-----------------|-------| +| **GDPR** | EU data residency, right to erasure, DPA | Kiven SaaS (eu-west-1) + customer data stays in their infra | +| **SOC2** | RBAC, audit logging, encryption, incident response | Kiven SaaS operations + customer infra access audit trail | + +> Note: PCI-DSS is NOT in scope. Kiven does not process payment card data. Customer compliance (HIPAA, PCI, etc.) is helped by data staying on their own infra. + +## 1.5 Tech Stack Overview + +### Kiven SaaS Platform + +| Category | Choice | Rationale | +|----------|--------|-----------| +| **Cloud** | AWS (eu-west-1) | GDPR, proximity to EU customers | +| **Orchestration** | EKS + Flux | GitOps, cloud-native | +| **Backend** | Go (stdlib + chi) | K8s ecosystem is Go, fast, small binaries | +| **Frontend** | Next.js 14+ (App Router) + Tailwind + shadcn/ui | Modern, fast, beautiful | +| **Agent** | Go (client-go + controller-runtime) | Native K8s SDK, single binary | +| **Agent Comms** | gRPC + mTLS | Secure, efficient, bidirectional streaming | +| **Product DB** | PostgreSQL (Aiven) | Dogfooding the ecosystem, managed | +| **Cache** | Valkey | Sessions, rate limiting, real-time state | +| **Messaging** | Kafka (Aiven) | Agent events, audit trail, async operations | +| **Edge/CDN** | Cloudflare | WAF, DDoS, Zero Trust, Tunnel | +| **Observability** | Prometheus / Loki / Tempo | Self-hosted, cost-efficient | +| **Secrets** | HashiCorp Vault | Dynamic secrets, rotation, IRSA | +| **CNI** | Cilium | mTLS, Gateway API, network policies | +| **Policies** | Kyverno | Admission control, pod security | +| **Billing** | Stripe | SaaS billing, per-cluster pricing | +| **CI/CD** | GitHub Actions | Already in place | +| **IaC** | Terraform | Infrastructure as Code | + +### Customer-Side (Provisioned by Kiven) + +| Component | Technology | Managed By | +|-----------|-----------|------------| +| **Kubernetes nodes** | EKS Managed Node Groups | Kiven (via AWS API) | +| **PostgreSQL** | CloudNativePG (CNPG) | Kiven (via agent) | +| **Connection pooling** | PgBouncer (CNPG Pooler CRD) | Kiven (via agent) | +| **Backups** | Barman โ†’ S3 | Kiven (via agent + AWS API) | +| **Storage** | EBS gp3 (encrypted, KMS) | Kiven (via AWS API) | +| **Backup storage** | S3 bucket (encrypted, lifecycle) | Kiven (via AWS API) | +| **TLS** | cert-manager + self-signed CA | Kiven (via agent) | +| **Monitoring agent** | Kiven Agent (Go) | Kiven | -## **2.1 Vue d'ensemble** - -### **2.1.1 AWS Multi-Account Strategy (Control Tower)** +--- -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ AWS CONTROL TOWER (Organization) โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ MANAGEMENT โ”‚ โ”‚ SECURITY โ”‚ โ”‚ LOG ARCHIVE โ”‚ โ”‚ -โ”‚ โ”‚ ACCOUNT โ”‚ โ”‚ ACCOUNT โ”‚ โ”‚ ACCOUNT โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Control Towerโ”‚ โ”‚ โ€ข GuardDuty โ”‚ โ”‚ โ€ข CloudTrail โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Organizationsโ”‚ โ”‚ โ€ข Security Hub โ”‚ โ”‚ โ€ข Config Logs โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข SCPs โ”‚ โ”‚ โ€ข IAM Identity โ”‚ โ”‚ โ€ข VPC Flow Logsโ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ Center โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ WORKLOAD ACCOUNTS (OU: Workloads) โ”‚ โ”‚ -โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ DEV Account โ”‚ โ”‚ STAGING โ”‚ โ”‚ PROD Accountโ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Account โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ VPC + EKS โ”‚ โ”‚ VPC + EKS โ”‚ โ”‚ VPC + EKS โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ SHARED SERVICES ACCOUNT (OU: Infrastructure) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Transit Gateway Hub โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Centralized VPC Endpoints โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Container Registry (ECR) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Artifact Storage (S3) โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` +# PART II โ€” ARCHITECTURE -### **2.1.2 Architecture EKS par Environnement** +## 2.1 System Context (C4 Level 1) ``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ INTERNET โ”‚ -โ”‚ (End Users) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ USERS โ”‚ +โ”‚ Developers (Simple Mode) DevOps (Advanced Mode) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ CLOUDFLARE EDGE (Global) โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ€ข DNS (localplus.io) โ€ข WAF (OWASP rules) โ”‚ -โ”‚ โ€ข DDoS Protection (L3-L7) โ€ข SSL/TLS Termination โ”‚ -โ”‚ โ€ข CDN (static assets) โ€ข Bot Protection โ”‚ -โ”‚ โ€ข Cloudflare Tunnel โ€ข Zero Trust Access โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ CLOUDFLARE EDGE โ”‚ +โ”‚ (DNS, WAF, DDoS, CDN, Zero Trust) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ - โ”‚ Cloudflare Tunnel (encrypted) โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ WORKLOAD ACCOUNT (PROD) โ€” eu-west-1 โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ VPC โ€” 10.0.0.0/16 โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ EKS CLUSTER โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ NODE POOL: platform (taints: platform=true:NoSchedule) โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ Instance: m6i.xlarge (dedicated resources) โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ PLATFORM NAMESPACE โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ€ข ArgoCD (centralisรฉ) โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Cilium (CNI + Gateway API) โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Vault Agent Injector โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ€ข External-Secrets Operator โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Kyverno โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ€ข OTel Collector โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Prometheus + Loki + Tempo + Grafana โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ NODE POOL: application (default, auto-scaling) โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ Instance: m6i.large (cost-optimized) โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ APPLICATION NAMESPACES โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ€ข svc-ledger โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ€ข svc-wallet โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ€ข svc-merchant โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ€ข svc-giftcard โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ€ข svc-notification โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ VPC Peering / Transit Gateway โ”‚ โ”‚ -โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ -โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ AIVEN VPC โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ€ข PostgreSQL (Primary + Read Replica) โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ€ข Kafka Cluster โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ EXTERNAL SERVICES โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข AWS S3 (Terraform state, backups, artifacts) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข AWS KMS (Encryption keys) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข AWS Secrets Manager (bootstrap secrets only) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข HashiCorp Vault (self-hosted on EKS โ€” runtime secrets) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข AWS CloudWatch Logs (tier gratuit, fallback) โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` - -### **2.1.3 Node Pool Strategy** - -|| Node Pool | Taints | Usage | Instance Type | Scaling | -||-----------|--------|-------|---------------|---------| -|| **platform** | `platform=true:NoSchedule` | ArgoCD, Monitoring, Security tools | m6i.xlarge | Fixed (2-3 nodes) | -|| **application** | None (default) | Domain services | m6i.large | HPA (2-10 nodes) | -|| **spot** (optionnel) | `spot=true:PreferNoSchedule` | Batch jobs, non-critical | m6i.large (spot) | Auto (0-5 nodes) | - -## **2.2 Domain Services** - -| Service | Responsabilitรฉ | Pattern | Criticitรฉ | -|---------|---------------|---------|-----------| -| **svc-ledger** | Earn/Burn transactions, ACID ledger | Sync REST + gRPC | P0 โ€” Core | -| **svc-wallet** | Balance queries, snapshots | Sync REST + gRPC | P0 โ€” Core | -| **svc-merchant** | Onboarding, configuration | Sync REST | P1 | -| **svc-giftcard** | Catalog, rewards | Sync REST | P1 | -| **svc-notification** | SMS/Email dispatch | Async (Kafka consumer) | P2 | - -## **2.3 Data Flow** - -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” gRPC โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ svc-ledger โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ svc-wallet โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ โ”‚ - โ”‚ Outbox โ”‚ Read - โ–ผ โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Kafka โ”‚ โ”‚ PostgreSQL โ”‚ -โ”‚ (Aiven) โ”‚ โ”‚ (Aiven) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ”‚ Consume - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ svc-notification โ”‚ -โ”‚ svc-analytics โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ KIVEN SaaS PLATFORM โ”‚ +โ”‚ (AWS EKS โ€” eu-west-1) โ”‚ +โ”‚ โ”‚ +โ”‚ Dashboard + API + CLI + Terraform Provider โ”‚ +โ”‚ Core Services: provisioner, infra, clusters, backups, monitoring... โ”‚ +โ”‚ Provider/Plugin: CNPG Provider (Phase 1), Strimzi (future)... โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ โ”‚ + โ”‚ gRPC/mTLS (Agent) โ”‚ Cross-Account + โ”‚ โ”‚ IAM AssumeRole + โ–ผ โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ CUSTOMER'S AWS ACCOUNT / EKS โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€ Managed by Kiven โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Node Group: kiven-db-nodes (dedicated, tainted, multi-AZ) โ”‚ โ”‚ +โ”‚ โ”‚ Namespace: kiven-system (agent + CNPG operator) โ”‚ โ”‚ +โ”‚ โ”‚ Namespace: kiven-databases (PostgreSQL clusters) โ”‚ โ”‚ +โ”‚ โ”‚ S3 Bucket: kiven-backups-{customer-id} โ”‚ โ”‚ +โ”‚ โ”‚ IAM: IRSA roles for S3 access โ”‚ โ”‚ +โ”‚ โ”‚ CNPG: PostgreSQL Primary + Replicas + PgBouncer โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€ Managed by Customer โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Their app nodes, services, workloads โ”‚ โ”‚ +โ”‚ โ”‚ Connect to: pg-main.kiven-databases.svc:5432 โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## 2.2 Container Diagram (C4 Level 2) โ€” Kiven SaaS + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ KIVEN SaaS โ€” AWS WORKLOAD ACCOUNT โ€” eu-west-1 โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ EKS CLUSTER โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ PLATFORM NODE POOL (taints: platform=true:NoSchedule) โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ€ข Flux โ€ข Cilium โ€ข Vault Agent โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ€ข OTel Collector โ€ข Prometheus โ€ข Grafana โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ€ข Loki โ€ข Tempo โ€ข Kyverno โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ APPLICATION NODE POOL (auto-scaling) โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€ Core โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ svc-api svc-auth svc-provisioner โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ svc-infra svc-clusters svc-agent-relay โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€ Data Services โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ svc-backups svc-monitoring svc-users โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ svc-yamleditor svc-migrations โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€ Business โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ svc-billing svc-audit svc-notificationโ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ VPC Peering โ”‚ +โ”‚ โ–ผ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ AIVEN VPC โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข PostgreSQL (Kiven product database) โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข Kafka (agent events, audit trail, async ops) โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข Valkey (sessions, rate limiting, cache) โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## 2.3 Container Diagram (C4 Level 2) โ€” Customer Side + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ CUSTOMER'S EKS CLUSTER โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ NODE GROUP: kiven-db-nodes (Managed by Kiven) โ”‚ โ”‚ +โ”‚ โ”‚ Instance: r6g.mediumโ€“r6g.2xlarge (memory-optimized) โ”‚ โ”‚ +โ”‚ โ”‚ Taint: kiven.io/role=database:NoSchedule โ”‚ โ”‚ +โ”‚ โ”‚ Multi-AZ: primary in AZ-a, replica in AZ-b โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”Œโ”€โ”€ Namespace: kiven-system โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ Kiven Agent (Go) โ€” gRPC โ†’ Kiven SaaS โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ CNPG Operator โ€” manages PG clusters โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ cert-manager (optional) โ€” TLS certificates โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”Œโ”€โ”€ Namespace: kiven-databases โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ CNPG Cluster: pg-production-main โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”œโ”€ Pod: pg-production-main-1 (Primary, AZ-a) โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”œโ”€ Pod: pg-production-main-2 (Replica, AZ-b) โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”œโ”€ Pod: pg-production-main-3 (Replica, AZ-c) โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”œโ”€ Service: pg-production-main-rw (read-write) โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”œโ”€ Service: pg-production-main-ro (read-only) โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ””โ”€ Pooler: pg-production-main-pooler (PgBouncer) โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ ScheduledBackup โ†’ S3: kiven-backups-{customer-id} โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ NetworkPolicy: only kiven-databases + customer-app-ns โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ NODE GROUP: customer-app-nodes (Managed by Customer) โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข Customer's application pods โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข Connect to: pg-production-main-pooler.kiven-databases.svc:5432 โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ AWS Resources (Managed by Kiven via cross-account IAM) โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข EBS gp3 volumes (encrypted, KMS) โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข S3 bucket: kiven-backups-{customer-id} โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข IAM IRSA role: kiven-cnpg-backup-role โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## 2.4 Core Services + +### Service Catalog + +| Service | Responsibility | Language | Priority | +|---------|---------------|----------|----------| +| **svc-api** | REST + GraphQL gateway, request routing | Go | P0 | +| **svc-auth** | OIDC (Google/GitHub/SAML), RBAC, API keys, org/team model | Go | P0 | +| **svc-provisioner** | **THE BRAIN** โ€” Orchestrates full provisioning pipeline (nodes โ†’ storage โ†’ S3 โ†’ CNPG โ†’ PG) | Go | P0 | +| **svc-infra** | AWS resource management in customer accounts (EC2, EBS, S3, IAM, KMS) | Go | P0 | +| **svc-clusters** | Cluster lifecycle via provider interface (status, scale, upgrade, delete) | Go | P0 | +| **svc-backups** | Backup/restore management, PITR, fork/clone, backup verification | Go | P0 | +| **svc-monitoring** | Metrics ingestion from agents, DBA intelligence, alerts engine | Go | P0 | +| **svc-users** | Database user/role management, permissions, pg_hba rules | Go | P0 | +| **svc-agent-relay** | gRPC server, multiplexes all customer agent connections | Go | P0 | +| **svc-yamleditor** | YAML generation, schema validation, diff engine, change history | Go | P0 | +| **svc-migrations** | Import from Aiven/RDS/bare PG into Kiven-managed clusters | Go | P1 | +| **svc-billing** | Stripe integration, usage tracking, per-cluster pricing | Go | P1 | +| **svc-audit** | Immutable audit log of all operations on customer infra | Go | P1 | +| **svc-notification** | Alerts via Slack, email, webhook, PagerDuty | Go | P1 | +| **agent** | In-cluster binary โ€” CNPG controller, PG stats, command executor, log aggregator | Go | P0 | + +### Provider/Plugin Architecture + +The core engine is **operator-agnostic**. Each data service is a **provider** implementing a standard Go interface. Phase 1 ships the CNPG provider only. Future providers (Strimzi, Redis, ECK) plug in without rewriting core services. + +``` +Core Engine (operator-agnostic) + โ”œโ”€โ”€ svc-provisioner โ†’ calls provider.Provision() + โ”œโ”€โ”€ svc-clusters โ†’ calls provider.Scale(), provider.Status() + โ”œโ”€โ”€ svc-backups โ†’ calls provider.Backup(), provider.Restore() + โ”œโ”€โ”€ svc-monitoring โ†’ calls provider.CollectMetrics() + โ””โ”€โ”€ svc-users โ†’ calls provider.CreateUser() + โ”‚ + โ–ผ + Provider Interface (Go interface) + โ”‚ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ CNPG Provider (Phase 1 โ€” PG) โ”‚ + โ”‚ Strimzi Provider (Phase 3 โ€” Kafka) โ”‚ + โ”‚ Redis Provider (Phase 3 โ€” Redis) โ”‚ + โ”‚ ECK Provider (Phase 3 โ€” ES) โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## 2.5 Data Flow โ€” Provisioning + +``` +Customer clicks "Create Database" + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€ svc-api โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€ svc-auth โ”€โ”€โ” +โ”‚ Validate req โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ Check RBAC โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€ svc-provisioner (THE BRAIN) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ โ”‚ +โ”‚ 1. svc-infra โ†’ AssumeRole โ†’ Create node group (kiven-db-nodes) โ”‚ +โ”‚ 2. svc-infra โ†’ AssumeRole โ†’ Create StorageClass (gp3, encrypted) โ”‚ +โ”‚ 3. svc-infra โ†’ AssumeRole โ†’ Create S3 bucket (backups) โ”‚ +โ”‚ 4. svc-infra โ†’ AssumeRole โ†’ Create IRSA role (CNPG โ†’ S3) โ”‚ +โ”‚ 5. agent โ†’ Install CNPG operator (Helm) โ”‚ +โ”‚ 6. agent โ†’ Apply CNPG Cluster YAML (generated by svc-clusters) โ”‚ +โ”‚ 7. agent โ†’ Apply PgBouncer Pooler YAML โ”‚ +โ”‚ 8. agent โ†’ Apply ScheduledBackup YAML โ”‚ +โ”‚ 9. agent โ†’ Apply NetworkPolicy YAML โ”‚ +โ”‚ 10. agent โ†’ Wait for cluster healthy โ”‚ +โ”‚ 11. svc-users โ†’ Create initial database + user โ”‚ +โ”‚ 12. Return connection string to customer โ”‚ +โ”‚ โ”‚ +โ”‚ Status updates streamed via agent gRPC โ†’ svc-agent-relay โ”‚ +โ”‚ Dashboard shows real-time provisioning progress โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## 2.6 Data Flow โ€” Steady State + +``` +โ”Œโ”€โ”€โ”€ Kiven Agent (in customer K8s) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ โ”‚ +โ”‚ CNPG Controller โ”€โ”€โ”€โ”€ watches Cluster/Backup/Pooler CRDs โ”‚ +โ”‚ PG Stats Collector โ”€ pg_stat_statements, pg_stat_activity โ”‚ +โ”‚ Log Aggregator โ”€โ”€โ”€โ”€โ”€ PG logs from all pods โ”‚ +โ”‚ Infra Reporter โ”€โ”€โ”€โ”€โ”€ node status, EBS usage, pod health โ”‚ +โ”‚ โ”‚ +โ”‚ Every 30s: streams metrics + status to svc-agent-relay โ”‚ +โ”‚ On event: immediately reports (failover, backup done, error) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ gRPC/mTLS (outbound only) + โ–ผ +โ”Œโ”€โ”€โ”€ svc-agent-relay โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Multiplexes connections from all customer agents โ”‚ +โ”‚ Routes events to: svc-monitoring, svc-clusters, svc-audit โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ–ผ โ–ผ โ–ผ + svc-monitoring svc-clusters svc-audit + (DBA intelligence, (status update) (immutable log) + alert engine) +``` + +## 2.7 Service Plans + +Each database is provisioned with a **plan** that determines compute, memory, storage, and HA configuration: + +| Plan | CPU | RAM | Storage | Instances | HA | Node Type | Use Case | +|------|-----|-----|---------|-----------|-----|-----------|----------| +| **Hobbyist** | 1 vCPU | 1 GB | 10 GB | 1 | No | t3.small | Testing, personal projects | +| **Startup** | 2 vCPU | 4 GB | 50 GB | 2 | Yes | r6g.medium | Small apps, dev/staging | +| **Business** | 4 vCPU | 16 GB | 100 GB | 3 | Yes | r6g.large | Production, medium traffic | +| **Premium** | 8 vCPU | 32 GB | 500 GB | 3 | Yes | r6g.xlarge | High-performance, analytics | +| **Custom** | User-defined | User-defined | User-defined | 1-5 | Configurable | Any | Specific requirements | + +Each plan includes: +- Pre-tuned `postgresql.conf` (shared_buffers, work_mem, etc. sized for the plan) +- Appropriate PgBouncer pool size and mode +- Right backup frequency and retention +- Resource limits and requests matching the node type + +Plans can be **upgraded or downgraded** at any time from the dashboard (triggers a rolling update via CNPG). + +## 2.8 Power Off / Power On + +Databases can be **paused** to eliminate compute costs while preserving data. This is a fundamental advantage of the "managed on your infra" model โ€” something Aiven cannot offer because they own the infrastructure. + +### Power Off (Pause) + +``` +Customer clicks "Power Off" + โ”‚ + โ”œโ”€ 1. svc-clusters โ†’ agent: Delete CNPG Cluster CR + โ”‚ PVC reclaim policy = RETAIN โ†’ EBS volumes preserved + โ”‚ + โ”œโ”€ 2. CNPG pods terminated, K8s services removed + โ”‚ EBS volumes detached but retained in AWS + โ”‚ + โ”œโ”€ 3. svc-infra โ†’ AWS API: Scale node group to 0 + โ”‚ No more EC2 cost + โ”‚ + โ””โ”€ 4. Dashboard: "Paused โ€” Data safe, no compute cost" + S3 backups and EBS volumes remain +``` + +### Power On (Resume) + +``` +Customer clicks "Resume" + โ”‚ + โ”œโ”€ 1. svc-infra โ†’ AWS API: Scale node group back up + โ”‚ Wait for nodes ready (~2-3 min) + โ”‚ + โ”œโ”€ 2. svc-clusters โ†’ agent: Apply CNPG Cluster CR + โ”‚ References existing PVCs (same EBS volume IDs) + โ”‚ + โ”œโ”€ 3. CNPG starts PostgreSQL with existing data + โ”‚ Primary elected, replicas sync (~1-2 min) + โ”‚ + โ””โ”€ 4. Dashboard: "Running โ€” Resumed" + Connection strings unchanged, total resume time ~3-5 min +``` + +### Scheduled Power Off/On + +Automate power schedules for non-production environments: +- Example: Mon-Fri 8am-6pm ON, nights and weekends OFF +- Savings: 60-70% on dev/staging compute costs +- Configured via dashboard, API, CLI, or Terraform + +### Cost Impact + +| Scenario | Always On | Scheduled (10h/day, weekdays) | Savings | +|----------|-----------|-------------------------------|---------| +| Startup plan (2ร—r6g.medium) | ~$180/mo | ~$55/mo | 70% | +| Business plan (3ร—r6g.large) | ~$450/mo | ~$140/mo | 69% | +| Paused (storage only) | โ€” | ~$10/mo | 94% | + +## 2.9 Two UX Modes + +### Simple Mode (Default) โ€” "Aiven Experience" + +For developers who just need a database. Forms, sliders, buttons. No YAML visible. +- Create database โ†’ pick plan โ†’ get connection string +- Manage users, backups, config via UI forms +- See metrics, alerts, logs in clean dashboards + +### Advanced Mode โ€” "Lens Experience" + +For DevOps/Platform engineers who want full control. Like Lens for Kubernetes. +- View the generated YAML for every resource (CNPG Cluster, Pooler, Backup, etc.) +- Edit YAML directly in Monaco editor (VS Code-like) with CNPG schema validation +- Diff view before applying changes +- Change history (git-like timeline of all YAML changes) +- Rollback to any previous YAML version +- Toggle between modes at any time --- -# ๐ŸŒฟ **PARTIE II.B โ€” GIT STRATEGY** +# PART III โ€” DELIVERY MODEL + +## 3.1 Git Strategy -## **Trunk-Based Development avec Cherry-Pick** +**Trunk-Based Development with Cherry-Pick** ``` main (trunk) @@ -224,1594 +474,435 @@ โ”‚ โ”‚ โ–ผ โ–ผ maintenance/v1.x.x maintenance/v2.x.x - (cherry-pick avec (cherry-pick avec + (cherry-pick with (cherry-pick with label: backport-v1) label: backport-v2) ``` -### **Rรจgles Git** +| Branch | Usage | Policy | +|--------|-------|--------| +| `main` | Main trunk | All PRs merge here | +| `maintenance/v*.x.x` | Version maintenance | Cherry-pick from main only | +| `feature/*` | Development | Short-lived, merge to main | -|| Branche | Usage | Politique | -||---------|-------|-----------| -|| `main` | Trunk principal | Tous les PRs mergent ici | -|| `maintenance/v1.x.x` | Maintenance version 1 | Cherry-pick depuis main uniquement | -|| `maintenance/v2.x.x` | Maintenance version 2 | Cherry-pick depuis main uniquement | -|| `feature/*` | Dรฉveloppement | Short-lived, merge to main | +## 3.2 GitOps Flow (Flux) -### **Workflow Cherry-Pick** +- **Centralized Flux**: Single instance managing all environments +- **Kustomization/HelmRelease pattern**: Git + Kustomize/Helm generators +- **Auto-reconcile**: Dev auto-reconcile, Staging/Prod manual approval -1. **Dรฉveloppeur** crรฉe un PR vers `main` -2. **Dรฉveloppeur** ajoute le label `backport-v1` si le fix doit aller dans v1.x.x -3. **CI** (aprรจs merge dans main) dรฉtecte le label et crรฉe automatiquement un PR cherry-pick vers `maintenance/v1.x.x` -4. **Reviewer** valide le cherry-pick PR +## 3.3 Environments -> **Principe :** Tout passe par `main` d'abord. Les branches de maintenance reรงoivent uniquement des cherry-picks validรฉs. +| Environment | Account | Cluster | Sync Policy | +|-------------|---------|---------|-------------| +| **dev** | kiven-dev | eks-dev | Auto-sync | +| **staging** | kiven-staging | eks-staging | Manual | +| **prod** | kiven-prod | eks-prod | Manual + Approval | ---- - -# ๐Ÿ—‚๏ธ **PARTIE III โ€” ORGANISATION DES REPOSITORIES** +## 3.4 CI/CD & Bootstrap -## **3.1 Structure Complรจte** - -``` -github.com/localplus/ - -โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• -TIER 0 โ€” FOUNDATION (Platform Team ownership) -โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• - -bootstrap/ -โ”œโ”€โ”€ layer-0/ -โ”‚ โ””โ”€โ”€ aws/ -โ”‚ โ””โ”€โ”€ README.md # Runbook: Create bootstrap IAM role -โ”œโ”€โ”€ layer-1/ -โ”‚ โ”œโ”€โ”€ foundation/ -โ”‚ โ”‚ โ”œโ”€โ”€ main.tf -โ”‚ โ”‚ โ”œโ”€โ”€ networking.tf # VPC, Subnets, NAT, VPC Peering Aiven -โ”‚ โ”‚ โ”œโ”€โ”€ eks.tf # EKS cluster -โ”‚ โ”‚ โ”œโ”€โ”€ iam.tf # IRSA, Workload Identity -โ”‚ โ”‚ โ”œโ”€โ”€ kms.tf # Encryption keys -โ”‚ โ”‚ โ””โ”€โ”€ outputs.tf -โ”‚ โ”œโ”€โ”€ tests/ -โ”‚ โ”‚ โ”œโ”€โ”€ unit/ # Terraform unit tests (terraform test) -โ”‚ โ”‚ โ”œโ”€โ”€ compliance/ # Checkov, tfsec, Regula -โ”‚ โ”‚ โ””โ”€โ”€ integration/ # Terratest -โ”‚ โ””โ”€โ”€ backend.tf # S3 native locking -โ””โ”€โ”€ docs/ - โ””โ”€โ”€ RUNBOOK-BOOTSTRAP.md - -โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• -TIER 1 โ€” PLATFORM (Platform Team ownership) -โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• - -platform-gitops/ -โ”œโ”€โ”€ argocd/ -โ”‚ โ”œโ”€โ”€ install/ # Helm values for ArgoCD -โ”‚ โ””โ”€โ”€ applicationsets/ -โ”‚ โ”œโ”€โ”€ platform.yaml # Sync platform-* repos -โ”‚ โ””โ”€โ”€ services.yaml # Sync svc-* repos (Git + Cluster generators) -โ”œโ”€โ”€ projects/ # ArgoCD Projects (RBAC) -โ””โ”€โ”€ README.md - -platform-networking/ -โ”œโ”€โ”€ cilium/ -โ”‚ โ”œโ”€โ”€ values.yaml # Cilium Helm config -โ”‚ โ””โ”€โ”€ policies/ # ClusterNetworkPolicies -โ”œโ”€โ”€ gateway-api/ -โ”‚ โ”œโ”€โ”€ gateway-class.yaml -โ”‚ โ”œโ”€โ”€ gateways/ -โ”‚ โ””โ”€โ”€ httproutes/ -โ””โ”€โ”€ README.md - -platform-observability/ -โ”œโ”€โ”€ otel-collector/ -โ”‚ โ”œโ”€โ”€ daemonset.yaml # Node-level collection -โ”‚ โ”œโ”€โ”€ deployment.yaml # Gateway collector -โ”‚ โ””โ”€โ”€ config/ -โ”‚ โ”œโ”€โ”€ receivers.yaml -โ”‚ โ”œโ”€โ”€ processors.yaml # Cardinality filtering, PII scrubbing -โ”‚ โ”œโ”€โ”€ exporters.yaml -โ”‚ โ””โ”€โ”€ sampling.yaml # Tail sampling config -โ”œโ”€โ”€ prometheus/ -โ”‚ โ”œโ”€โ”€ values.yaml -โ”‚ โ”œโ”€โ”€ rules/ # AlertRules, RecordingRules -โ”‚ โ””โ”€โ”€ serviceMonitors/ -โ”œโ”€โ”€ loki/ -โ”‚ โ”œโ”€โ”€ values.yaml -โ”‚ โ””โ”€โ”€ retention-policies.yaml # GDPR: 30 days max -โ”œโ”€โ”€ tempo/ -โ”‚ โ””โ”€โ”€ values.yaml -โ”œโ”€โ”€ pyroscope/ # Continuous Profiling (APM) -โ”‚ โ”œโ”€โ”€ values.yaml -โ”‚ โ””โ”€โ”€ scrape-configs.yaml -โ”œโ”€โ”€ sentry/ # Error Tracking (APM) -โ”‚ โ”œโ”€โ”€ values.yaml -โ”‚ โ”œโ”€โ”€ dsn-config.yaml -โ”‚ โ””โ”€โ”€ alert-rules.yaml -โ”œโ”€โ”€ grafana/ -โ”‚ โ”œโ”€โ”€ values.yaml -โ”‚ โ”œโ”€โ”€ dashboards/ -โ”‚ โ”‚ โ”œโ”€โ”€ platform/ -โ”‚ โ”‚ โ”œโ”€โ”€ services/ -โ”‚ โ”‚ โ””โ”€โ”€ apm/ # APM-specific dashboards -โ”‚ โ”‚ โ”œโ”€โ”€ service-overview.json -โ”‚ โ”‚ โ”œโ”€โ”€ dependency-map.json -โ”‚ โ”‚ โ”œโ”€โ”€ database-performance.json -โ”‚ โ”‚ โ””โ”€โ”€ profiling-flamegraphs.json -โ”‚ โ””โ”€โ”€ datasources/ -โ””โ”€โ”€ README.md - -platform-cache/ -โ”œโ”€โ”€ valkey/ -โ”‚ โ”œโ”€โ”€ values.yaml # Helm config for Valkey -โ”‚ โ”œโ”€โ”€ cluster-config.yaml -โ”‚ โ””โ”€โ”€ monitoring/ -โ”‚ โ”œโ”€โ”€ servicemonitor.yaml -โ”‚ โ””โ”€โ”€ alerts.yaml -โ”œโ”€โ”€ sdk/ -โ”‚ โ”œโ”€โ”€ python/ # Cache SDK helpers -โ”‚ โ”‚ โ”œโ”€โ”€ cache_client.py -โ”‚ โ”‚ โ””โ”€โ”€ patterns.py # Cache-aside, write-through -โ”‚ โ””โ”€โ”€ go/ -โ”‚ โ””โ”€โ”€ cache/ -โ””โ”€โ”€ README.md - -platform-gateway/ -โ”œโ”€โ”€ apisix/ -โ”‚ โ”œโ”€โ”€ values.yaml # APISIX Helm config -โ”‚ โ”œโ”€โ”€ routes/ -โ”‚ โ”‚ โ”œโ”€โ”€ v1/ # API v1 routes -โ”‚ โ”‚ โ””โ”€โ”€ v2/ # API v2 routes (future) -โ”‚ โ”œโ”€โ”€ plugins/ -โ”‚ โ”‚ โ”œโ”€โ”€ jwt-config.yaml -โ”‚ โ”‚ โ”œโ”€โ”€ rate-limit-config.yaml -โ”‚ โ”‚ โ””โ”€โ”€ cors-config.yaml -โ”‚ โ””โ”€โ”€ consumers/ # API consumers (partners, services) -โ”œโ”€โ”€ cloudflare/ -โ”‚ โ”œโ”€โ”€ terraform/ -โ”‚ โ”‚ โ”œโ”€โ”€ main.tf -โ”‚ โ”‚ โ”œโ”€โ”€ dns.tf -โ”‚ โ”‚ โ”œโ”€โ”€ tunnel.tf -โ”‚ โ”‚ โ”œโ”€โ”€ waf.tf -โ”‚ โ”‚ โ””โ”€โ”€ access.tf -โ”‚ โ””โ”€โ”€ policies/ -โ”‚ โ”œโ”€โ”€ waf-rules.yaml -โ”‚ โ””โ”€โ”€ access-policies.yaml -โ”œโ”€โ”€ cloudflared/ -โ”‚ โ”œโ”€โ”€ deployment.yaml # Tunnel daemon -โ”‚ โ””โ”€โ”€ config.yaml -โ””โ”€โ”€ README.md - -platform-security/ -โ”œโ”€โ”€ vault/ -โ”‚ โ”œโ”€โ”€ policies/ # Per-service policies -โ”‚ โ”œโ”€โ”€ auth-methods/ # Kubernetes auth -โ”‚ โ””โ”€โ”€ secret-engines/ -โ”œโ”€โ”€ external-secrets/ -โ”‚ โ”œโ”€โ”€ operator/ -โ”‚ โ””โ”€โ”€ cluster-secret-stores/ -โ”œโ”€โ”€ kyverno/ -โ”‚ โ”œโ”€โ”€ cluster-policies/ -โ”‚ โ”‚ โ”œโ”€โ”€ require-labels.yaml -โ”‚ โ”‚ โ”œโ”€โ”€ require-probes.yaml -โ”‚ โ”‚ โ”œโ”€โ”€ require-resource-limits.yaml -โ”‚ โ”‚ โ”œโ”€โ”€ restrict-privileged.yaml -โ”‚ โ”‚ โ”œโ”€โ”€ require-image-signature.yaml # Supply chain -โ”‚ โ”‚ โ””โ”€โ”€ mutate-default-sa.yaml -โ”‚ โ””โ”€โ”€ policy-reports/ -โ”œโ”€โ”€ supply-chain/ -โ”‚ โ”œโ”€โ”€ cosign/ # Image signing config -โ”‚ โ””โ”€โ”€ sbom/ # Syft config -โ”œโ”€โ”€ audit/ -โ”‚ โ””โ”€โ”€ audit-policy.yaml # K8s audit logging -โ””โ”€โ”€ README.md - -โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• -TIER 2 โ€” CONTRACTS (Shared ownership) -โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• - -contracts-proto/ -โ”œโ”€โ”€ buf.yaml -โ”œโ”€โ”€ buf.gen.yaml -โ”œโ”€โ”€ localplus/ -โ”‚ โ”œโ”€โ”€ ledger/v1/ -โ”‚ โ”‚ โ”œโ”€โ”€ ledger.proto -โ”‚ โ”‚ โ””โ”€โ”€ ledger_service.proto -โ”‚ โ”œโ”€โ”€ wallet/v1/ -โ”‚ โ”‚ โ”œโ”€โ”€ wallet.proto -โ”‚ โ”‚ โ””โ”€โ”€ wallet_service.proto -โ”‚ โ””โ”€โ”€ common/v1/ -โ”‚ โ”œโ”€โ”€ money.proto -โ”‚ โ””โ”€โ”€ pagination.proto -โ””โ”€โ”€ README.md - -sdk-python/ -โ”œโ”€โ”€ localplus/ -โ”‚ โ”œโ”€โ”€ clients/ # Generated gRPC clients -โ”‚ โ”œโ”€โ”€ telemetry/ # OTel instrumentation helpers -โ”‚ โ”œโ”€โ”€ testing/ # Fixtures, factories -โ”‚ โ””โ”€โ”€ security/ # Vault client wrapper -โ”œโ”€โ”€ pyproject.toml -โ””โ”€โ”€ README.md - -sdk-go/ -โ”œโ”€โ”€ clients/ -โ”œโ”€โ”€ telemetry/ -โ””โ”€โ”€ go.mod - -โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• -TIER 3 โ€” DOMAIN SERVICES (Product Team ownership) -โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• - -svc-ledger/ # TON LOCAL-PLUS ACTUEL -โ”œโ”€โ”€ src/ -โ”‚ โ””โ”€โ”€ app/ -โ”‚ โ”œโ”€โ”€ api/ -โ”‚ โ”œโ”€โ”€ domain/ -โ”‚ โ”œโ”€โ”€ infrastructure/ -โ”‚ โ””โ”€โ”€ main.py -โ”œโ”€โ”€ tests/ -โ”‚ โ”œโ”€โ”€ unit/ # pytest, mocks -โ”‚ โ”œโ”€โ”€ integration/ # testcontainers -โ”‚ โ”œโ”€โ”€ contract/ # pact / grpc-testing -โ”‚ โ””โ”€โ”€ conftest.py -โ”œโ”€โ”€ perf/ -โ”‚ โ”œโ”€โ”€ k6/ -โ”‚ โ”‚ โ”œโ”€โ”€ smoke.js -โ”‚ โ”‚ โ”œโ”€โ”€ load.js -โ”‚ โ”‚ โ””โ”€โ”€ stress.js -โ”‚ โ””โ”€โ”€ scenarios/ -โ”œโ”€โ”€ k8s/ -โ”‚ โ”œโ”€โ”€ base/ -โ”‚ โ”‚ โ”œโ”€โ”€ deployment.yaml -โ”‚ โ”‚ โ”œโ”€โ”€ service.yaml -โ”‚ โ”‚ โ”œโ”€โ”€ configmap.yaml -โ”‚ โ”‚ โ”œโ”€โ”€ hpa.yaml -โ”‚ โ”‚ โ”œโ”€โ”€ pdb.yaml -โ”‚ โ”‚ โ””โ”€โ”€ kustomization.yaml -โ”‚ โ””โ”€โ”€ overlays/ -โ”‚ โ”œโ”€โ”€ dev/ -โ”‚ โ”œโ”€โ”€ staging/ -โ”‚ โ””โ”€โ”€ prod/ -โ”œโ”€โ”€ migrations/ # Alembic -โ”œโ”€โ”€ Dockerfile -โ”œโ”€โ”€ Taskfile.yml -โ””โ”€โ”€ README.md - -svc-wallet/ # Mรชme structure -svc-merchant/ # Mรชme structure -svc-giftcard/ # Mรชme structure -svc-notification/ # Mรชme structure (+ Kafka consumer) - -โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• -TIER 4 โ€” QUALITY ENGINEERING (Shared ownership) -โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• - -e2e-scenarios/ -โ”œโ”€โ”€ scenarios/ -โ”‚ โ”œโ”€โ”€ earn-burn-flow.spec.ts -โ”‚ โ”œโ”€โ”€ merchant-onboarding.spec.ts -โ”‚ โ””โ”€โ”€ giftcard-purchase.spec.ts -โ”œโ”€โ”€ fixtures/ -โ”œโ”€โ”€ playwright.config.ts -โ””โ”€โ”€ README.md - -chaos-experiments/ -โ”œโ”€โ”€ litmus/ -โ”‚ โ””โ”€โ”€ chaosengine/ -โ”œโ”€โ”€ experiments/ -โ”‚ โ”œโ”€โ”€ pod-kill/ -โ”‚ โ”œโ”€โ”€ network-partition/ -โ”‚ โ”œโ”€โ”€ db-latency/ -โ”‚ โ””โ”€โ”€ kafka-broker-kill/ -โ””โ”€โ”€ README.md - -โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• -TIER 5 โ€” DOCUMENTATION (Shared ownership) -โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• - -docs/ -โ”œโ”€โ”€ adr/ # Architecture Decision Records -โ”‚ โ”œโ”€โ”€ 001-modular-monolith-first.md -โ”‚ โ”œโ”€โ”€ 002-aiven-managed-data.md -โ”‚ โ”œโ”€โ”€ 003-cilium-over-calico.md -โ”‚ โ””โ”€โ”€ ... -โ”œโ”€โ”€ runbooks/ -โ”‚ โ”œโ”€โ”€ incident-response.md -โ”‚ โ”œโ”€โ”€ database-failover.md -โ”‚ โ”œโ”€โ”€ kafka-recovery.md -โ”‚ โ””โ”€โ”€ secret-rotation.md -โ”œโ”€โ”€ platform-contracts/ -โ”‚ โ”œโ”€โ”€ deployment-sla.md -โ”‚ โ”œโ”€โ”€ observability-requirements.md -โ”‚ โ””โ”€โ”€ security-baseline.md -โ”œโ”€โ”€ compliance/ -โ”‚ โ”œโ”€โ”€ gdpr/ -โ”‚ โ”‚ โ”œโ”€โ”€ data-retention-policy.md -โ”‚ โ”‚ โ”œโ”€โ”€ right-to-erasure.md -โ”‚ โ”‚ โ””โ”€โ”€ consent-management.md -โ”‚ โ”œโ”€โ”€ pci-dss/ -โ”‚ โ”‚ โ”œโ”€โ”€ cardholder-data-flow.md -โ”‚ โ”‚ โ””โ”€โ”€ encryption-requirements.md -โ”‚ โ””โ”€โ”€ soc2/ -โ”‚ โ”œโ”€โ”€ access-control-policy.md -โ”‚ โ””โ”€โ”€ incident-response-policy.md -โ”œโ”€โ”€ threat-models/ -โ”‚ โ”œโ”€โ”€ svc-ledger-stride.md -โ”‚ โ””โ”€โ”€ platform-attack-surface.md -โ””โ”€โ”€ onboarding/ - โ”œโ”€โ”€ new-developer.md - โ””โ”€โ”€ new-service-checklist.md -``` +> Detailed documentation: [bootstrap/BOOTSTRAP-GUIDE.md](bootstrap/BOOTSTRAP-GUIDE.md) --- -# ๐Ÿฅš๐Ÿ” **PARTIE IV โ€” BOOTSTRAP STRATEGY** - -## **4.1 Layer 0 โ€” Manual Bootstrap (1x per AWS account)** - -| Action | Commande/Outil | Output | -|--------|---------------|--------| -| Crรฉer IAM Role pour Terraform CI/CD | AWS CLI | `arn:aws:iam::xxx:role/TerraformCI` | -| Configurer OIDC pour GitHub Actions | AWS Console/CLI | GitHub peut assumer le role | - -**C'est TOUT. Le S3 backend est auto-crรฉรฉ par Terraform 1.10+** - -## **4.1.1 GitHub Actions โ€” Reusable & Composite Workflows** - -> **Note :** Utiliser des **reusable workflows** et **composite actions** pour standardiser les pipelines CI/CD. - -- **Reusable workflows** : `.github/workflows/` partagรฉs entre repos (build, test, deploy) -- **Composite actions** : `.github/actions/` pour encapsuler des steps communs (setup-python, terraform-plan, etc.) - -## **4.2 Layer 1 โ€” Foundation (Terraform)** - -| Ordre | Ressource | Dรฉpendances | -|-------|-----------|-------------| -| 1 | VPC + Subnets | Aucune | -| 2 | KMS Keys | Aucune | -| 3 | EKS Cluster | VPC, KMS | -| 4 | IRSA (IAM Roles for Service Accounts) | EKS | -| 5 | VPC Peering avec Aiven | VPC, Aiven crรฉรฉ manuellement d'abord | -| 6 | Outputs โ†’ Platform repos | Tous | - -## **4.3 Layer 2 โ€” Platform Bootstrap** - -| Ordre | Action | Dรฉpendance | -|-------|--------|------------| -| 1 | Install ArgoCD via Helm (1x) | EKS ready | -| 2 | Apply App-of-Apps ApplicationSet | ArgoCD running | -| 3 | ArgoCD syncs platform-* repos | Reconciliation automatique | - -**ArgoCD : Instance centralisรฉe unique** (comme demandรฉ) - -## **4.4 Layer 3+ โ€” Application Services** - -ArgoCD ApplicationSets avec **Git Generator + Matrix Generator** dรฉcouvrent automatiquement les services. +# PART IV โ€” REPOSITORY & OWNERSHIP MODEL + +## 4.1 Repository Tiers + +| Tier | Repos | Description | Owner | +|------|-------|-------------|-------| +| **T0 โ€” Foundation** | `bootstrap/` | AWS Landing Zone, Account Factory | Platform Team | +| **T1 โ€” Platform** | `platform-*` | GitOps, Networking, Security, Observability | Platform Team | +| **T2 โ€” Contracts** | `contracts-proto`, `sdk-*` | gRPC APIs, Go SDK, CLI | Platform + Backend | +| **T3 โ€” Core Services** | `svc-*` | Kiven backend services | Backend Team | +| **T4 โ€” Agent** | `agent/` | Customer-deployed agent | Agent Team | +| **T5 โ€” Frontend** | `dashboard/` | Next.js dashboard (Simple + Advanced modes) | Frontend Team | +| **T6 โ€” Providers** | `provider-*` | CNPG provider, Strimzi provider (future) | Backend Team | +| **T7 โ€” Quality** | `e2e-scenarios`, `chaos-*` | Tests, chaos engineering | QA + Platform | +| **T8 โ€” Documentation** | `docs/` | Centralized documentation | All Teams | + +## 4.2 Ownership Matrix + +| Tier | Owner Team | Approvers | Change Process | +|------|------------|-----------|----------------| +| **T0 โ€” Foundation** | Platform | Platform Lead + Security | ADR + RFC required | +| **T1 โ€” Platform** | Platform | Platform Team (2 reviewers) | ADR if breaking change | +| **T2 โ€” Contracts** | Platform + Backend | Tech Lead | Buf breaking detection | +| **T3 โ€” Core Services** | Backend | Team Lead | Standard PR review | +| **T4 โ€” Agent** | Agent / Backend | Agent Lead + Security | Security review required | +| **T5 โ€” Frontend** | Frontend | Frontend Lead | Standard PR review | +| **T6 โ€” Providers** | Backend | Tech Lead | Provider interface compliance | +| **T7 โ€” Quality** | QA + Platform | QA Lead | Standard PR review | +| **T8 โ€” Documentation** | All | Tech Lead | Standard PR review | + +## 4.3 Repository Index + +### Tier 0 โ€” Foundation + +| Repo | Description | +|------|-------------| +| `bootstrap/` | AWS Landing Zone, Account Factory, SCPs, SSO | + +### Tier 1 โ€” Platform + +| Repo | Description | +|------|-------------| +| `platform-gitops/` | Flux, Kustomizations, HelmReleases | +| `platform-networking/` | Cilium, Gateway API | +| `platform-observability/` | OTel, Prometheus, Loki, Tempo, Grafana | +| `platform-security/` | Vault, External-Secrets, Kyverno | + +### Tier 2 โ€” Contracts + +| Repo | Description | +|------|-------------| +| `contracts-proto/` | Protobuf definitions (agent โ†” SaaS, inter-service) | +| `sdk-go/` | Go SDK for Kiven API | +| `kiven-cli/` | CLI tool (`kiven clusters list`, `kiven backup trigger`) | +| `terraform-provider-kiven/` | Terraform provider for Kiven | + +### Tier 3 โ€” Core Services + +| Repo | Description | +|------|-------------| +| `svc-api/` | REST + GraphQL gateway | +| `svc-auth/` | Authentication, RBAC, API keys | +| `svc-provisioner/` | Provisioning orchestrator (THE BRAIN) | +| `svc-infra/` | AWS resource management in customer accounts | +| `svc-clusters/` | Cluster lifecycle (CNPG management) | +| `svc-backups/` | Backup/restore, PITR, fork/clone | +| `svc-monitoring/` | Metrics, DBA intelligence, alerts | +| `svc-users/` | Database user/role management | +| `svc-agent-relay/` | gRPC server for agent connections | +| `svc-yamleditor/` | YAML generation, validation, diff, history | +| `svc-migrations/` | Import from Aiven/RDS/bare PG | +| `svc-billing/` | Stripe billing | +| `svc-audit/` | Immutable audit log | +| `svc-notification/` | Alerts (Slack, email, webhook, PagerDuty) | + +### Tier 4 โ€” Agent + +| Repo | Description | +|------|-------------| +| `kiven-agent/` | In-cluster agent (CNPG controller, PG stats, command executor) | +| `kiven-agent-helm/` | Helm chart for agent deployment | + +### Tier 5 โ€” Frontend + +| Repo | Description | +|------|-------------| +| `dashboard/` | Next.js dashboard (Simple + Advanced mode) | + +### Tier 6 โ€” Providers + +| Repo | Description | +|------|-------------| +| `provider-cnpg/` | CloudNativePG provider (Phase 1) | +| `provider-strimzi/` | Strimzi/Kafka provider (Phase 3 โ€” future) | +| `provider-redis/` | Redis Operator provider (Phase 3 โ€” future) | + +### Tier 7 โ€” Quality + +| Repo | Description | +|------|-------------| +| `e2e-scenarios/` | End-to-end tests (provisioning, backup, failover) | +| `chaos-experiments/` | Chaos Mesh experiments (node failure, network partition) | --- -# ๐Ÿงช **PARTIE V โ€” TESTING STRATEGY COMPLรˆTE** - -## **5.1 Terraform Testing** - -| Type | Outil | Quand | Bloquant | -|------|-------|-------|----------| -| **Format/Lint** | `terraform fmt`, `tflint` | Pre-commit | Oui | -| **Security scan** | `tfsec`, `checkov` | PR | Oui | -| **Compliance** | `regula`, `opa conftest`, [terraform-compliance](https://terraform-compliance.com/) | PR | Oui | -| **Policy as Code** | HashiCorp Sentinel | PR | Oui | -| **Unit tests** | `terraform test` (native 1.6+) | PR | Oui | -| **Integration** | `terratest` | Nightly | Non | -| **Drift detection** | `terraform plan` scheduled | Daily | Alerte | - -## **5.2 Application Testing** - -| Type | Localisation | Outil | Trigger | Bloquant | -|------|--------------|-------|---------|----------| -| **Unit** | `svc-*/tests/unit/` | pytest | Pre-commit, PR | Oui | -| **Integration** | `svc-*/tests/integration/` | pytest + testcontainers | PR | Oui | -| **Contract** | `svc-*/tests/contract/` | pact, grpc-testing | PR | Oui | -| **Performance** | `svc-*/perf/` | k6 | Nightly, Pre-release | Non | -| **E2E** | `e2e-scenarios/` | Playwright | Post-merge staging | Oui pour prod | -| **Chaos** | `chaos-experiments/` | Litmus | Weekly | Non | - -## **5.3 TNR (Tests de Non-Rรฉgression)** - -| Catรฉgorie | Contenu | Frรฉquence | -|-----------|---------|-----------| -| **Critical Paths** | Earn โ†’ Balance Update โ†’ Notification | Nightly | -| **Golden Master** | Snapshot des rรฉponses API | Nightly | -| **Compliance** | GDPR data retention, PCI encryption checks | Nightly | -| **Security** | Kyverno policy audit, image signature verification | Nightly | - -## **5.4 Compliance Testing** - -| Standard | Test | Outil | -|----------|------|-------| -| **GDPR** | PII not in logs | OTel Collector scrubbing + log audit | -| **GDPR** | Data retention < 30 days | Loki retention policy check | -| **PCI-DSS** | mTLS enforced | Cilium policy audit | -| **PCI-DSS** | Encryption at rest | AWS KMS audit | -| **SOC2** | Audit logs present | CloudTrail + K8s audit logs check | -| **SOC2** | Access control | Kyverno policy reports | +# PART V โ€” PLATFORM BASELINES ---- +## 5.1 Security Baseline -# ๐Ÿ” **PARTIE VI โ€” SECURITY ARCHITECTURE** +**Defense in Depth**: 7 layers of security -## **6.1 Defense in Depth** +| Layer | Component | Protection | +|-------|-----------|------------| +| **Edge** | Cloudflare | WAF, DDoS, Bot protection | +| **Gateway** | Cilium Gateway API | TLS termination, routing | +| **Network** | Cilium | NetworkPolicies, default deny | +| **Identity** | IRSA + Vault | Dynamic secrets, mTLS, OIDC | +| **Workload** | Kyverno | Pod security, image signing | +| **Data** | KMS + EBS encryption | Encryption at rest/transit | +| **Customer Access** | Cross-account IAM + Audit | Least privilege, CloudTrail, revocable | -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ LAYER 0: EDGE (Cloudflare) โ”‚ -โ”‚ โ€ข Cloudflare WAF (OWASP Core Ruleset, custom rules) โ”‚ -โ”‚ โ€ข Cloudflare DDoS Protection (L3/L4/L7, unlimited) โ”‚ -โ”‚ โ€ข Bot Management (JS challenge, CAPTCHA) โ”‚ -โ”‚ โ€ข TLS 1.3 termination, HSTS enforced โ”‚ -โ”‚ โ€ข Cloudflare Tunnel (no public origin IP) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ LAYER 1: API GATEWAY (APISIX) โ”‚ -โ”‚ โ€ข JWT/API Key validation โ”‚ -โ”‚ โ€ข Rate limiting (fine-grained, per user/tenant) โ”‚ -โ”‚ โ€ข Request validation (JSON Schema) โ”‚ -โ”‚ โ€ข Circuit breaker โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ LAYER 2: NETWORK โ”‚ -โ”‚ โ€ข VPC isolation (private subnets only for workloads) โ”‚ -โ”‚ โ€ข Cilium NetworkPolicies (default deny, explicit allow) โ”‚ -โ”‚ โ€ข VPC Peering Aiven (no public internet for DB/Kafka) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ LAYER 3: IDENTITY & ACCESS โ”‚ -โ”‚ โ€ข IRSA (IAM Roles for Service Accounts) โ€” no static credentials โ”‚ -โ”‚ โ€ข Cilium mTLS (WireGuard) โ€” pod-to-pod encryption โ”‚ -โ”‚ โ€ข Vault dynamic secrets โ€” DB credentials rotated โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ LAYER 4: WORKLOAD โ”‚ -โ”‚ โ€ข Kyverno policies (no privileged, resource limits, probes required) โ”‚ -โ”‚ โ€ข Image signature verification (Cosign) โ”‚ -โ”‚ โ€ข Read-only root filesystem โ”‚ -โ”‚ โ€ข Non-root containers โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ LAYER 5: DATA โ”‚ -โ”‚ โ€ข Encryption at rest (AWS KMS, Aiven native) โ”‚ -โ”‚ โ€ข Encryption in transit (mTLS) โ”‚ -โ”‚ โ€ข PII scrubbing in logs (OTel processor) โ”‚ -โ”‚ โ€ข Audit trail immutable (CloudTrail, K8s audit logs) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` - -## **6.2 Rรฉponse ร  : "Commence simple, mais la dette technique ?"** - -**Le paradoxe :** Tu veux commencer simple mais avec GDPR/PCI-DSS/SOC2, tu ne peux PAS ignorer la sรฉcuritรฉ. - -**La solution : Security Baseline dรจs Day 1, รฉvolution par phases** - -| Phase | Ce qui est en place | Ce qui vient aprรจs | -|-------|---------------------|-------------------| -| **Day 1** | Cilium mTLS (zero config), Kyverno basic policies, Vault pour secrets | - | -| **Month 3** | Image signing (Cosign), SBOM generation | - | -| **Month 6** | SPIRE (si multi-cluster), Confidential Computing รฉvaluation | - | - -**Pas de dette technique SI :** -- mTLS dรจs le dรฉbut (Cilium = zero effort) -- Secrets dans Vault dรจs le dรฉbut (pas de migration douloureuse) -- Policies Kyverno dรจs le dรฉbut (culture sรฉcuritรฉ) - -**La vraie dette technique serait :** -- Commencer sans mTLS โ†’ Migration massive plus tard -- Secrets en ConfigMaps โ†’ Rotation impossible -- Pas d'audit logs โ†’ Compliance failure - ---- - -# ๐Ÿ“Š **PARTIE VII โ€” OBSERVABILITY ARCHITECTURE** - -## **7.1 Stack Self-Hosted (Coรปt Minimal)** +> Detailed documentation: [security/SECURITY-ARCHITECTURE.md](security/SECURITY-ARCHITECTURE.md) -| Composant | Outil | Coรปt | Retention | -|-----------|-------|------|-----------| -| **Metrics** | Prometheus | 0โ‚ฌ (self-hosted) | 15 jours local | -| **Metrics long-term** | Thanos Sidecar โ†’ S3 | ~5โ‚ฌ/mois S3 | 1 an | -| **Logs** | Loki | 0โ‚ฌ (self-hosted) | 30 jours (GDPR) | -| **Traces** | Tempo | 0โ‚ฌ (self-hosted) | 7 jours | -| **Dashboards** | Grafana | 0โ‚ฌ (self-hosted) | N/A | -| **Fallback logs** | CloudWatch Logs | Tier gratuit 5GB | 7 jours | +## 5.2 Observability Baseline -**Coรปt estimรฉ : < 50โ‚ฌ/mois** (principalement S3 pour Thanos) +| Signal | Tool | Retention | Cost | +|--------|------|-----------|------| +| **Metrics** | Prometheus + Remote Write S3 | 15d local, 1y S3 | ~5 EUR/mo | +| **Logs** | Loki | 30 days (GDPR) | Self-hosted | +| **Traces** | Tempo | 7 days | Self-hosted | +| **Profiling** | Pyroscope | 7 days | Self-hosted | +| **Errors** | Sentry (self-hosted) | 30 days | Self-hosted | -## **7.2 Telemetry Pipeline** +> Detailed documentation: [observability/OBSERVABILITY-GUIDE.md](observability/OBSERVABILITY-GUIDE.md) -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Applications โ”‚ โ”‚ OTel Collector โ”‚ โ”‚ Backends โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ€ข SDK Python โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ โ€ข Receivers โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ โ€ข Prometheus โ”‚ -โ”‚ โ€ข Auto-instr โ”‚ โ”‚ โ€ข Processors โ”‚ โ”‚ โ€ข Loki โ”‚ -โ”‚ โ”‚ โ”‚ โ€ข Exporters โ”‚ โ”‚ โ€ข Tempo โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ”‚ Scrubbing - โ–ผ - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ GDPR Compliant โ”‚ - โ”‚ โ€ข No user_id โ”‚ - โ”‚ โ€ข No PII โ”‚ - โ”‚ โ€ข No PAN โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` - -## **7.3 Cardinality Management** +## 5.3 Networking Baseline -| Label | Action | Rationale | -|-------|--------|-----------| -| `user_id` | DROP | High cardinality, use traces | -| `request_id` | DROP | Use trace_id instead | -| `http.url` | DROP | URLs uniques = explosion | -| `http.route` | KEEP | Templated, low cardinality | -| `service.name` | KEEP | Essential | -| `http.method` | KEEP | Low cardinality | -| `http.status_code` | KEEP | Low cardinality | +| Component | Role | Configuration | +|-----------|------|---------------| +| **Cloudflare** | Edge, WAF, Tunnel | Pro tier | +| **Cilium** | CNI, mTLS, Gateway API | WireGuard encryption | +| **VPC Peering** | Aiven connectivity (Kiven product DB) | Private, no internet | +| **Route53** | Private DNS, backup | Internal zones | +| **Cross-Account** | Customer EKS access | IAM AssumeRole, kubeconfig | -## **7.4 SLI/SLO/Error Budgets** - -| Service | SLI | SLO | Error Budget | -|---------|-----|-----|--------------| -| **svc-ledger** | Availability | 99.9% | 43 min/mois | -| **svc-ledger** | Latency P99 | < 200ms | N/A | -| **svc-wallet** | Availability | 99.9% | 43 min/mois | -| **Platform (ArgoCD, Prometheus)** | Availability | 99.5% | 3.6h/mois | +> Detailed documentation: [networking/NETWORKING-ARCHITECTURE.md](networking/NETWORKING-ARCHITECTURE.md) -## **7.5 Alerting Strategy** +## 5.4 Data Baseline -| Severity | Exemple | Notification | On-call | -|----------|---------|--------------|---------| -| **P1 โ€” Critical** | svc-ledger down | PagerDuty immediate | Wake up | -| **P2 โ€” High** | Error rate > 5% | Slack + PagerDuty 15min | Within 30min | -| **P3 โ€” Medium** | Latency P99 > 500ms | Slack | Business hours | -| **P4 โ€” Low** | Disk usage > 80% | Slack | Next day | +### Kiven Product Database (SaaS side) -## **7.6 APM (Application Performance Monitoring)** +| Service | Provider | Purpose | Cost Estimate | +|---------|----------|---------|---------------| +| **PostgreSQL** | Aiven | Product DB (orgs, clusters, audit) | ~300 EUR/mo | +| **Kafka** | Aiven | Agent events, async operations | ~400 EUR/mo | +| **Valkey** | Aiven | Sessions, rate limiting, cache | ~150 EUR/mo | -### **7.6.1 Stack APM** - -| Composant | Outil | Intรฉgration | Usage | -|-----------|-------|-------------|-------| -| **Distributed Tracing** | Tempo + OTel | Auto-instrumentation Python/Go | Request flow, latency breakdown | -| **Profiling** | Pyroscope (Grafana) | SDK intรฉgrรฉ | CPU/Memory profiling continu | -| **Error Tracking** | Sentry (self-hosted) | SDK Python/Go | Exception tracking, stack traces | -| **Database APM** | pg_stat_statements | Prometheus exporter | Query performance | -| **Real User Monitoring** | Grafana Faro | JavaScript SDK | Frontend performance (si applicable) | - -### **7.6.2 APM Pipeline** - -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ APPLICATION LAYER โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ OTel SDK โ”‚ โ”‚ Pyroscope โ”‚ โ”‚ Sentry SDK โ”‚ โ”‚ pg_stat โ”‚ โ”‚ -โ”‚ โ”‚ (Traces) โ”‚ โ”‚ (Profiles) โ”‚ โ”‚ (Errors) โ”‚ โ”‚ (DB metrics) โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ โ”‚ โ”‚ โ”‚ - โ–ผ โ–ผ โ–ผ โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ COLLECTION LAYER โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ OTel Collector (Gateway) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Receives: traces, metrics, logs โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Processes: sampling, enrichment, PII scrubbing โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Exports: Tempo, Prometheus, Loki โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ STORAGE & VISUALIZATION โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ Tempo โ”‚ โ”‚ Pyroscope โ”‚ โ”‚ Sentry โ”‚ โ”‚ Grafana โ”‚ โ”‚ -โ”‚ โ”‚ (Traces) โ”‚ โ”‚ (Profiles) โ”‚ โ”‚ (Errors) โ”‚ โ”‚ (Unified) โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` +### Customer Databases (managed by Kiven) -### **7.6.3 Instrumentation Standards** +| Service | Technology | Where | Cost | +|---------|-----------|-------|------| +| **PostgreSQL** | CloudNativePG on EKS | Customer's AWS | Customer's AWS bill | +| **Backups** | Barman โ†’ S3 | Customer's AWS | Customer's S3 costs | -| Language | Auto-instrumentation | Manual Instrumentation | Frameworks supportรฉs | -|----------|---------------------|------------------------|---------------------| -| **Python** | `opentelemetry-instrumentation` | `@tracer.start_as_current_span` | FastAPI, SQLAlchemy, httpx, grpcio | -| **Go** | OTel contrib packages | `tracer.Start()` | gRPC, net/http, pgx | +**Golden rule**: Kiven product DB and customer databases are **completely separate**. Customer data never touches Kiven's infrastructure. -### **7.6.4 Sampling Strategy** - -| Environment | Head Sampling | Tail Sampling | Rationale | -|-------------|---------------|---------------|-----------| -| **Dev** | 100% | N/A | Full visibility pour debug | -| **Staging** | 50% | Errors: 100% | Balance cost/visibility | -| **Prod** | 10% | Errors: 100%, Slow: 100% (>500ms) | Cost optimization | - -### **7.6.5 APM Dashboards** - -| Dashboard | Mรฉtriques clรฉs | Audience | -|-----------|---------------|----------| -| **Service Overview** | RPS, Error rate, Latency P50/P95/P99 | On-call | -| **Dependency Map** | Service topology, inter-service latency | Platform team | -| **Database Performance** | Query time, connections, deadlocks | Backend devs | -| **Error Analysis** | Error count by type, affected users | Product team | -| **Profiling Flame Graphs** | CPU hotspots, memory allocations | Performance team | - -### **7.6.6 Trace-to-Logs-to-Metrics Correlation** - -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” trace_id โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ TRACES โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ LOGS โ”‚ -โ”‚ (Tempo) โ”‚ โ”‚ (Loki) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ โ”‚ - โ”‚ Exemplars (trace_id in metrics) โ”‚ - โ”‚ โ”‚ - โ–ผ โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ GRAFANA โ”‚ -โ”‚ โ€ข Click trace โ†’ See logs for that request โ”‚ -โ”‚ โ€ข Click metric spike โ†’ Jump to exemplar trace โ”‚ -โ”‚ โ€ข Click error log โ†’ Navigate to full trace โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` - -### **7.6.7 APM Alerting** - -| Alert | Condition | Severity | Action | -|-------|-----------|----------|--------| -| **High Error Rate** | Error rate > 1% for 5min | P2 | Investigate errors in Sentry | -| **Latency Degradation** | P99 > 2x baseline for 10min | P2 | Check traces for slow spans | -| **Database Slow Queries** | Query time P95 > 100ms | P3 | Analyze pg_stat_statements | -| **Memory Leak Detected** | Memory growth > 10%/hour | P3 | Check Pyroscope profiles | +> Detailed documentation: [data/DATA-ARCHITECTURE.md](data/DATA-ARCHITECTURE.md) --- -# ๐Ÿ’พ **PARTIE VIII โ€” DATA ARCHITECTURE** - -## **8.1 Aiven Configuration** - -| Service | Plan | Config | Coรปt estimรฉ | -|---------|------|--------|-------------| -| **PostgreSQL** | Business-4 | Primary + Read Replica, 100GB | ~300โ‚ฌ/mois | -| **Kafka** | Business-4 | 3 brokers, 100GB retention | ~400โ‚ฌ/mois | -| **Valkey (Redis)** | Business-4 | 2 nodes, 10GB, HA | ~150โ‚ฌ/mois | +# PART VI โ€” TESTING & QUALITY -**Coรปt total Aiven estimรฉ : ~850โ‚ฌ/mois** +## 6.1 Test Pyramid -## **8.2 Database Strategy** +| Layer | Test Types | Frequency | +|-------|-----------|-----------| +| **Base** | Static analysis, linting (golangci-lint) | Pre-commit | +| **Unit** | Service logic, provider interface | PR | +| **Integration** | Agent โ†” CNPG, svc-infra โ†” AWS (LocalStack), DB (Testcontainers) | PR | +| **Contract** | gRPC contracts (Buf), agent protocol | PR | +| **E2E** | Full provisioning pipeline (kind + CNPG) | Nightly | +| **Performance** | Load testing, provisioning time (k6) | Weekly | +| **Chaos** | Node failure, agent disconnection, CNPG failover (Chaos Mesh) | Weekly | -| Aspect | Choix | Rationale | -|--------|-------|-----------| -| **Replication** | Aiven managed (async) | RPO 1h acceptable | -| **Backup** | Aiven automated hourly | RPO 1h | -| **Failover** | Aiven automated | RTO < 15min | -| **Connection** | VPC Peering (private) | PCI-DSS, no public internet | -| **Pooling** | PgBouncer (Aiven built-in) | Connection efficiency | +## 6.2 Performance Targets -## **8.3 Schema Ownership** +| Metric | Target | Alert | +|--------|--------|-------| +| **API Latency P50** | < 50ms | > 100ms | +| **API Latency P95** | < 100ms | > 200ms | +| **API Latency P99** | < 200ms | > 500ms | +| **Error Rate** | < 0.1% | > 1% | +| **Provisioning Time** | < 10min | > 15min | +| **Agent Reconnection** | < 30s | > 60s | +| **Backup Success Rate** | > 99.9% | < 99% | -| Table | Owner Service | Access pattern | -|-------|---------------|----------------| -| `transactions` | svc-ledger | CRUD | -| `ledger_entries` | svc-ledger | CRUD | -| `wallets` | svc-wallet | CRUD | -| `balance_snapshots` | svc-wallet | CRUD | -| `merchants` | svc-merchant | CRUD | -| `giftcards` | svc-giftcard | CRUD | +> Detailed documentation: [testing/TESTING-STRATEGY.md](testing/TESTING-STRATEGY.md) -**Rรจgle : 1 table = 1 owner. Cross-service = gRPC ou Events, jamais JOIN.** - -## **8.4 Kafka Topics** - -| Topic | Producer | Consumers | Retention | -|-------|----------|-----------|-----------| -| `ledger.transactions.v1` | svc-ledger (Outbox) | svc-notification, svc-analytics | 7 jours | -| `wallet.balance-updated.v1` | svc-wallet | svc-analytics | 7 jours | -| `merchant.onboarded.v1` | svc-merchant | svc-notification | 7 jours | - -## **8.5 Cache Architecture (Valkey/Redis)** - -### **8.5.1 Stack Cache** - -| Composant | Outil | Hรฉbergement | Coรปt estimรฉ | -|-----------|-------|-------------|-------------| -| **Cache primaire** | Valkey (Redis-compatible) | Aiven for Caching | ~150โ‚ฌ/mois | -| **Cache local (L1)** | Python `cachetools` / Go `bigcache` | In-memory | 0โ‚ฌ | - -> **Note :** Valkey est le fork open-source de Redis, maintenu par la Linux Foundation. Aiven supporte Valkey nativement. - -### **8.5.2 Cache Topology** - -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ MULTI-LAYER CACHE โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ L1 โ€” LOCAL CACHE (per pod) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข TTL: 30s - 5min โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Size: 100MB max per pod โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Use case: Hot data, config, user sessions โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ Cache miss โ”‚ -โ”‚ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ L2 โ€” DISTRIBUTED CACHE (Valkey cluster) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข TTL: 5min - 24h โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Size: 10GB โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Use case: Shared state, rate limits, session store โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ Cache miss โ”‚ -โ”‚ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ L3 โ€” DATABASE (PostgreSQL) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Source of truth โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Write-through pour updates โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` - -### **8.5.3 Cache Strategies par Use Case** - -| Use Case | Strategy | TTL | Invalidation | -|----------|----------|-----|--------------| -| **Wallet Balance** | Cache-aside (read) | 30s | Event-driven (Kafka) | -| **Merchant Config** | Read-through | 5min | TTL + Manual | -| **Rate Limiting** | Write-through | Sliding window | Auto-expire | -| **Session Data** | Write-through | 24h | Explicit logout | -| **Gift Card Catalog** | Cache-aside | 15min | Event-driven | -| **Feature Flags** | Read-through | 1min | Config push | - -### **8.5.4 Cache Patterns Implementation** - -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ CACHE-ASIDE PATTERN โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ 1. Application checks cache โ”‚ -โ”‚ 2. If HIT โ†’ return cached data โ”‚ -โ”‚ 3. If MISS โ†’ query database โ”‚ -โ”‚ 4. Store result in cache with TTL โ”‚ -โ”‚ 5. Return data to caller โ”‚ -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” GET โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ App โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ Cache โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ MISS โ”‚ -โ”‚ โ”‚ SELECT โ–ผ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ DB โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ WRITE-THROUGH PATTERN โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ 1. Application writes to cache AND database atomically โ”‚ -โ”‚ 2. Cache is always consistent with database โ”‚ -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” SET+TTL โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ App โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ Cache โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ INSERT/UPDATE โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ DB โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` - -### **8.5.5 Cache Invalidation Strategy** - -| Trigger | Mรฉthode | Use Case | -|---------|---------|----------| -| **TTL Expiry** | Automatic | Default pour toutes les clรฉs | -| **Event-driven** | Kafka consumer | Wallet balance aprรจs transaction | -| **Explicit Delete** | API call | Admin actions, config updates | -| **Pub/Sub** | Valkey PUBLISH | Real-time invalidation cross-pods | - -### **8.5.6 Cache Key Naming Convention** - -``` -{service}:{entity}:{id}:{version} - -Exemples: - wallet:balance:user_123:v1 - merchant:config:merchant_456:v1 - giftcard:catalog:category_active:v1 - ratelimit:api:user_123:minute - session:auth:session_abc123 -``` - -### **8.5.7 Cache Metrics & Monitoring** - -| Metric | Seuil alerte | Action | -|--------|--------------|--------| -| **Hit Rate** | < 80% | Revoir TTL, prรฉchargement | -| **Latency P99** | > 10ms | Check network, cluster size | -| **Memory Usage** | > 80% | Eviction analysis, scale up | -| **Evictions/sec** | > 100 | Augmenter cache size | -| **Connection Errors** | > 0 | Check connectivity, pooling | - -## **8.6 Queueing & Background Jobs** - -### **8.6.1 Queueing Architecture Overview** - -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ QUEUEING ARCHITECTURE โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ TIER 1 โ€” EVENT STREAMING (Kafka) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Use case: Event-driven architecture, CDC, audit logs โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Pattern: Pub/Sub, Event Sourcing โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Retention: 7 jours โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Ordering: Per-partition โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ TIER 2 โ€” TASK QUEUE (Valkey + Python Dramatiq/ARQ) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Use case: Background jobs, async processing โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Pattern: Producer/Consumer, Work Queue โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Features: Retries, priorities, scheduling โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Durability: Redis persistence โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ TIER 3 โ€” SCHEDULED JOBS (Kubernetes CronJobs) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Use case: Batch processing, reports, cleanup โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Pattern: Time-triggered execution โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Managed: K8s native โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` - -### **8.6.2 Kafka vs Task Queue โ€” Decision Matrix** - -| Critรจre | Kafka | Task Queue (Valkey) | -|---------|-------|---------------------| -| **Message Ordering** | โœ… Per-partition | โŒ Best effort | -| **Message Replay** | โœ… Retention-based | โŒ Non | -| **Priority Queues** | โŒ Non natif | โœ… Oui | -| **Delayed Messages** | โŒ Non natif | โœ… Oui | -| **Dead Letter Queue** | โœ… Configurable | โœ… Intรฉgrรฉ | -| **Exactly-once** | โœ… Avec idempotency | โŒ At-least-once | -| **Throughput** | ๐Ÿš€ Trรจs รฉlevรฉ | ๐Ÿ“ˆ ร‰levรฉ | -| **Use Case** | Events, CDC, Streaming | Jobs, Tasks, Async work | - -### **8.6.3 Task Queue Stack** +--- -| Composant | Outil | Rรดle | -|-----------|-------|------| -| **Task Framework** | Dramatiq (Python) / Asynq (Go) | Task definition, execution | -| **Broker** | Valkey (Redis-compatible) | Message storage, routing | -| **Result Backend** | Valkey | Task results, status | -| **Scheduler** | APScheduler / Dramatiq-crontab | Periodic tasks | -| **Monitoring** | Dramatiq Dashboard / Prometheus | Task metrics | +# PART VII โ€” RESILIENCE & DR -### **8.6.4 Task Queue Patterns** +## 7.1 Failure Modes -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ TASK PROCESSING FLOW โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ Producer Broker Workers โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ svc-* โ”‚โ”€โ”€โ”€โ”€ enqueue โ”€โ”€โ–บโ”‚ Valkey โ”‚โ—„โ”€โ”€ poll โ”€โ”€โ”€โ”€โ”€โ”‚ Worker โ”‚ โ”‚ -โ”‚ โ”‚ API โ”‚ โ”‚ โ”‚ โ”‚ Pods โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Queues: โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ€ข high โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข defaultโ”‚ โ”‚ execute โ”‚ -โ”‚ โ”‚ โ€ข low โ”‚ โ–ผ โ”‚ -โ”‚ โ”‚ โ€ข dlq โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Task โ”‚ โ”‚ -โ”‚ โ”‚ Handler โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` +### Kiven SaaS Failures -### **8.6.5 Queue Definitions** +| Failure | Detection | Recovery | RTO | +|---------|-----------|----------|-----| +| Pod crash | Liveness probe | K8s restart | < 30s | +| Node failure | Node NotReady | Pod reschedule | < 2min | +| AZ failure | Multi-AZ detect | Traffic shift | < 5min | +| Product DB failure | Aiven health | Automatic failover | < 5min | +| Kafka broker failure | Aiven health | Automatic rebalance | < 2min | +| Full region failure | Manual | DR procedure | 4h (target) | -| Queue | Priority | Workers | Use Cases | -|-------|----------|---------|-----------| -| **critical** | P0 | 5 | Transaction rollbacks, fraud alerts | -| **high** | P1 | 10 | Email confirmations, balance updates | -| **default** | P2 | 20 | Notifications, analytics events | -| **low** | P3 | 5 | Reports, cleanup, batch exports | -| **scheduled** | N/A | 3 | Cron-like scheduled tasks | -| **dead-letter** | N/A | 1 | Failed tasks investigation | +### Customer Database Failures (Handled by Kiven) -### **8.6.6 Retry Strategy** +| Failure | Detection | Recovery | RTO | +|---------|-----------|----------|-----| +| PG pod crash | CNPG + Agent | CNPG automatic restart | < 30s | +| Primary failure | CNPG failover | Automatic promotion of replica | < 30s | +| DB node failure | Agent + AWS | Pod reschedule to healthy node | < 2min | +| EBS volume issue | Agent monitoring | Alert + manual intervention | < 15min | +| Agent disconnection | SaaS heartbeat | Agent auto-reconnects; DB keeps running | Immediate (DB unaffected) | +| Backup failure | Agent monitoring | Retry + alert to customer + Kiven ops | < 1h | +| Data corruption | Backup verification | PITR restore to last good point | < 30min | -| Retry Policy | Configuration | Use Case | -|--------------|---------------|----------| -| **Exponential Backoff** | base=1s, max=1h, multiplier=2 | API calls, external services | -| **Fixed Interval** | interval=30s, max_retries=5 | Database operations | -| **No Retry** | max_retries=0 | Idempotent operations | +## 7.2 Backup Strategy -``` -Retry Timeline (Exponential): - Attempt 1: immediate - Attempt 2: +1s - Attempt 3: +2s - Attempt 4: +4s - Attempt 5: +8s - ... - Attempt N: move to DLQ -``` +### Kiven SaaS -### **8.6.7 Dead Letter Queue (DLQ) Handling** +| Data | Method | Frequency | Retention | +|------|--------|-----------|-----------| +| Product DB | Aiven automated | Hourly | 7 days | +| Product DB PITR | Aiven WAL | Continuous | 24h | +| Kafka | Topic retention | N/A | 7 days | +| Terraform state | S3 versioning | Every apply | 90 days | -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ DLQ WORKFLOW โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ 1. Task fails after max retries โ”‚ -โ”‚ 2. Task moved to DLQ with metadata: โ”‚ -โ”‚ โ€ข Original queue โ”‚ -โ”‚ โ€ข Failure reason โ”‚ -โ”‚ โ€ข Stack trace โ”‚ -โ”‚ โ€ข Attempt count โ”‚ -โ”‚ โ€ข Timestamp โ”‚ -โ”‚ 3. Alert sent to Slack (P3) โ”‚ -โ”‚ 4. On-call investigates โ”‚ -โ”‚ 5. Options: โ”‚ -โ”‚ a) Fix bug โ†’ Replay task โ”‚ -โ”‚ b) Manual resolution โ†’ Delete from DLQ โ”‚ -โ”‚ c) Archive for audit โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` +### Customer Databases (Managed by Kiven) -### **8.6.8 Scheduled Jobs (CronJobs)** +| Data | Method | Frequency | Retention | +|------|--------|-----------|-----------| +| PostgreSQL | Barman (CNPG) โ†’ S3 | Configurable (default: 6h) | Configurable (default: 30 days) | +| PostgreSQL PITR | WAL archiving โ†’ S3 | Continuous | Configurable (default: 7 days) | +| Backup verification | Automated restore test | Weekly | Report stored 90 days | -| Job | Schedule | Service | Description | -|-----|----------|---------|-------------| -| **balance-reconciliation** | `0 2 * * *` | svc-wallet | Daily balance verification | -| **expired-giftcards** | `0 0 * * *` | svc-giftcard | Mark expired cards | -| **analytics-rollup** | `0 */6 * * *` | svc-analytics | 6-hourly aggregation | -| **log-cleanup** | `0 3 * * 0` | platform | Weekly log rotation | -| **backup-verification** | `0 4 * * *` | platform | Daily backup integrity check | -| **compliance-report** | `0 6 1 * *` | platform | Monthly compliance export | - -### **8.6.9 Task Queue Monitoring** - -| Metric | Seuil alerte | Action | -|--------|--------------|--------| -| **Queue Depth** | > 1000 tasks | Scale workers | -| **Processing Time P95** | > 30s | Optimize task, check resources | -| **Failure Rate** | > 5% | Investigate DLQ, check dependencies | -| **DLQ Size** | > 10 tasks | Immediate investigation | -| **Worker Availability** | < 50% | Check pod health, scale up | +> Detailed documentation: [resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) --- -# ๐ŸŒ **PARTIE IX โ€” NETWORKING ARCHITECTURE** - -## **9.1 VPC Design** +# PART VIII โ€” PLATFORM CONTRACTS -| CIDR | Usage | -|------|-------| -| 10.0.0.0/16 | VPC Principal | -| 10.0.0.0/20 | Private Subnets (Workloads) | -| 10.0.16.0/20 | Private Subnets (Data) | -| 10.0.32.0/20 | Public Subnets (NAT, LB) | +## 8.1 Golden Path (New Kiven Service Checklist) -## **9.2 Traffic Flow** +| Step | Action | Validation | +|------|--------|------------| +| 1 | Create repo from Go service template | Structure compliant | +| 2 | Define protos in contracts-proto | `buf lint` pass | +| 3 | Implement service (Go) | Unit tests > 80% | +| 4 | Configure K8s manifests | Kyverno policies pass | +| 5 | Configure External-Secret | Secrets resolved from Vault | +| 6 | Add ServiceMonitor | Metrics visible in Grafana | +| 7 | Create HTTPRoute or gRPC route | Traffic routable | +| 8 | PR review | Merge โ†’ Auto-deploy dev | -| Flow | Path | Encryption | -|------|------|------------| -| Internet โ†’ Services | ALB โ†’ Cilium Gateway โ†’ Pod | TLS + mTLS | -| Service โ†’ Service | Pod โ†’ Pod (Cilium) | mTLS (WireGuard) | -| Service โ†’ Aiven | VPC Peering | TLS | -| Service โ†’ AWS (S3, KMS) | VPC Endpoints | TLS | +## 8.2 SLI/SLO/Error Budgets -## **9.3 Gateway API Configuration** +| Service | SLI | SLO | Error Budget | +|---------|-----|-----|--------------| +| **svc-api** | Availability | 99.9% | 43 min/month | +| **svc-api** | Latency P99 | < 200ms | N/A | +| **svc-provisioner** | Provisioning success rate | 99.5% | N/A | +| **svc-agent-relay** | Agent connection uptime | 99.9% | 43 min/month | +| **Agent** | Metrics delivery | 99.9% | 43 min/month | +| **Customer DB** | Backup success rate | 99.9% | N/A | +| **Platform** | Availability | 99.5% | 3.6h/month | -| Resource | Purpose | -|----------|---------| -| **GatewayClass** | Cilium implementation | -| **Gateway** | HTTPS listener, TLS termination | -| **HTTPRoute** | Routing vers services (path-based) | +## 8.3 On-Call Structure -## **9.4 Network Policies (Default Deny)** +| Role | Responsibility | Rotation | +|------|---------------|----------| +| **Primary** | First responder, triage (SaaS + customer infra) | Weekly | +| **Secondary** | Escalation, deep expertise | Weekly | +| **Incident Commander** | Coordination for P1 (customer data at risk) | On-demand | -| Policy | Effect | -|--------|--------| -| Default deny all | Aucun trafic sauf explicite | -| Allow intra-namespace | Services mรชme namespace peuvent communiquer | -| Allow specific cross-namespace | svc-ledger โ†’ svc-wallet explicite | -| Allow egress Aiven | Services โ†’ VPC Peering range only | -| Allow egress AWS endpoints | Services โ†’ VPC Endpoints only | +> Detailed documentation: [platform/PLATFORM-ENGINEERING.md](platform/PLATFORM-ENGINEERING.md) --- -# ๐ŸŒ **PARTIE IX.B โ€” EDGE, CDN & CLOUDFLARE** - -## **9.5 Cloudflare Architecture** - -### **9.5.1 Pourquoi Cloudflare ?** - -| Critรจre | Cloudflare | AWS CloudFront + WAF | Verdict | -|---------|------------|---------------------|---------| -| **Coรปt** | Free tier gรฉnรฉreux | Payant dรจs le dรฉbut | โœ… Cloudflare | -| **WAF** | Gratuit (rรจgles de base) | ~30โ‚ฌ/mois minimum | โœ… Cloudflare | -| **DDoS** | Inclus (unlimited) | AWS Shield Standard gratuit | โ‰ˆ ร‰gal | -| **SSL/TLS** | Gratuit, auto-renew | ACM gratuit | โ‰ˆ ร‰gal | -| **CDN** | 300+ PoPs, gratuit | Payant au GB | โœ… Cloudflare | -| **DNS** | Gratuit, trรจs rapide | Route53 ~0.50โ‚ฌ/zone | โœ… Cloudflare | -| **Zero Trust** | Gratuit jusqu'ร  50 users | Cognito + ALB payant | โœ… Cloudflare | -| **Terraform** | Provider officiel | Provider officiel | โ‰ˆ ร‰gal | +# PART IX โ€” ROADMAP -> **Dรฉcision :** Cloudflare en front, AWS en backend. Best of both worlds. +## 9.1 Build Sequence -### **9.5.2 Architecture Edge-to-Origin** - -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ INTERNET โ”‚ -โ”‚ (End Users) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ CLOUDFLARE EDGE โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ LAYER 1: DNS โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Authoritative DNS (localplus.io) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข DNSSEC enabled โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Geo-routing (future multi-region) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Health checks โ†’ automatic failover โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ -โ”‚ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ LAYER 2: DDoS Protection โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Layer 3/4 DDoS mitigation (automatic, unlimited) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Layer 7 DDoS mitigation โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Rate limiting rules โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ -โ”‚ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ LAYER 3: WAF (Web Application Firewall) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข OWASP Core Ruleset (free managed rules) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Custom rules (rate limit, geo-block, bot score) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Challenge pages (CAPTCHA, JS challenge) โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ -โ”‚ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ LAYER 4: SSL/TLS โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Edge certificates (auto-issued, free) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Full (strict) mode โ†’ Origin certificate โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข TLS 1.3 only, HSTS enabled โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Automatic HTTPS rewrites โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ -โ”‚ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ LAYER 5: CDN & Caching โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Static assets caching (JS, CSS, images) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข API responses: Cache-Control headers โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Tiered caching (edge โ†’ regional โ†’ origin) โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ -โ”‚ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ LAYER 6: Cloudflare Tunnel (Argo Tunnel) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข No public IP needed on origin โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Encrypted tunnel to Cloudflare edge โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข cloudflared daemon in K8s โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ”‚ Cloudflare Tunnel (encrypted) - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ AWS EKS CLUSTER โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ cloudflared (Deployment) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Runs in platform namespace โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Connects to Cloudflare edge โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Routes traffic to internal services โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ -โ”‚ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ API Gateway (APISIX) or Cilium Gateway โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Internal routing โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Rate limiting (L7) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Authentication โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ -โ”‚ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ Application Services โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข svc-ledger, svc-wallet, etc. โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` - -### **9.5.3 Cloudflare Services Configuration** - -| Service | Plan | Configuration | Coรปt | -|---------|------|---------------|------| -| **DNS** | Free | Authoritative, DNSSEC, proxy enabled | 0โ‚ฌ | -| **CDN** | Free | Cache everything, tiered caching | 0โ‚ฌ | -| **SSL/TLS** | Free | Full (strict), TLS 1.3, edge certs | 0โ‚ฌ | -| **WAF** | Free | Managed ruleset, 5 custom rules | 0โ‚ฌ | -| **DDoS** | Free | L3/L4/L7 protection, unlimited | 0โ‚ฌ | -| **Bot Management** | Free | Basic bot score, JS challenge | 0โ‚ฌ | -| **Rate Limiting** | Free | 1 rule (10K req/month free) | 0โ‚ฌ | -| **Tunnel** | Free | Unlimited tunnels, cloudflared | 0โ‚ฌ | -| **Access** | Free | Zero Trust, 50 users free | 0โ‚ฌ | - -**Coรปt Cloudflare total : 0โ‚ฌ** (Free tier suffisant pour dรฉmarrer) - -### **9.5.4 DNS Configuration** - -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ DNS RECORDS โ€” localplus.io โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ TYPE NAME CONTENT PROXY TTL โ”‚ -โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ -โ”‚ A @ Cloudflare Tunnel โ˜๏ธ ON Auto โ”‚ -โ”‚ CNAME www @ โ˜๏ธ ON Auto โ”‚ -โ”‚ CNAME api tunnel-xxx.cfargotunnel.com โ˜๏ธ ON Auto โ”‚ -โ”‚ CNAME grafana tunnel-xxx.cfargotunnel.com โ˜๏ธ ON Auto โ”‚ -โ”‚ CNAME argocd tunnel-xxx.cfargotunnel.com โ˜๏ธ ON Auto โ”‚ -โ”‚ TXT @ "v=spf1 include:_spf..." โ˜๏ธ OFF Auto โ”‚ -โ”‚ TXT _dmarc "v=DMARC1; p=reject..." โ˜๏ธ OFF Auto โ”‚ -โ”‚ MX @ mail provider โ˜๏ธ OFF Auto โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` - -### **9.5.5 WAF Rules Strategy** - -| Rule Set | Type | Action | Purpose | -|----------|------|--------|---------| -| **OWASP Core** | Managed | Block | SQLi, XSS, LFI, RFI protection | -| **Cloudflare Managed** | Managed | Block | Zero-day, emerging threats | -| **Geo-Block** | Custom | Block | Block high-risk countries (optional) | -| **Rate Limit API** | Custom | Challenge | > 100 req/min per IP on /api/* | -| **Bot Score < 30** | Custom | Challenge | Likely bot traffic | -| **Known Bad ASNs** | Custom | Block | Hosting providers, VPNs (optional) | - -### **9.5.6 SSL/TLS Configuration** - -| Setting | Value | Rationale | -|---------|-------|-----------| -| **SSL Mode** | Full (strict) | Origin has valid cert | -| **Minimum TLS** | 1.2 | PCI-DSS compliance | -| **TLS 1.3** | Enabled | Performance + security | -| **HSTS** | Enabled (max-age=31536000) | Force HTTPS | -| **Always Use HTTPS** | On | Redirect HTTP โ†’ HTTPS | -| **Automatic HTTPS Rewrites** | On | Fix mixed content | -| **Origin Certificate** | Cloudflare Origin CA | 15-year validity, free | - -### **9.5.7 Cloudflare Tunnel Architecture** - -| Composant | Rรดle | Dรฉploiement | -|-----------|------|-------------| -| **cloudflared daemon** | Agent tunnel, connexion sรฉcurisรฉe vers Cloudflare | 2+ replicas, namespace platform | -| **Tunnel credentials** | Secret d'authentification tunnel | Vault / External-Secrets | -| **Tunnel config** | Routing rules vers services internes | ConfigMap | -| **Health checks** | Vรฉrification disponibilitรฉ tunnel | Cloudflare dashboard | - -**Avantages Cloudflare Tunnel :** -- Pas d'IP publique exposรฉe sur l'origin -- Connexion outbound uniquement (pas de firewall inbound) -- Encryption de bout en bout -- Failover automatique entre replicas - -### **9.5.8 Cloudflare Access (Zero Trust)** - -| Resource | Policy | Authentication | -|----------|--------|----------------| -| **grafana.localplus.io** | Team only | GitHub SSO | -| **argocd.localplus.io** | Team only | GitHub SSO | -| **api.localplus.io/admin** | Admin only | GitHub SSO + MFA | -| **api.localplus.io/*** | Public | No auth (application handles) | - -### **9.5.9 Infrastructure as Code (Terraform)** - -| Ressource Terraform | Description | Module/Provider | -|---------------------|-------------|-----------------| -| **cloudflare_zone** | Zone DNS principale | cloudflare/cloudflare | -| **cloudflare_record** | Records DNS (A, CNAME, TXT) | cloudflare/cloudflare | -| **cloudflare_tunnel** | Configuration tunnel | cloudflare/cloudflare | -| **cloudflare_ruleset** | WAF rules, rate limiting | cloudflare/cloudflare | -| **cloudflare_access_application** | Zero Trust apps | cloudflare/cloudflare | -| **cloudflare_access_policy** | Policies d'accรจs | cloudflare/cloudflare | - -> **Note :** Toute la configuration Cloudflare est gรฉrรฉe via Terraform dans le repo `platform-gateway/cloudflare/terraform/` - -### **9.5.10 Cloudflare Monitoring & Analytics** - -| Metric | Source | Dashboard | -|--------|--------|-----------| -| **Requests** | Cloudflare Analytics | Grafana (API) | -| **Cache Hit Ratio** | Cloudflare Analytics | Grafana | -| **WAF Events** | Cloudflare Security Events | Grafana + Alerts | -| **Bot Score Distribution** | Cloudflare Analytics | Grafana | -| **Origin Response Time** | Cloudflare Analytics | Grafana | -| **DDoS Attacks** | Cloudflare Security Center | Email alerts | - -### **9.5.11 Route53 โ€” DNS Interne & Backup** - -| Use Case | Solution | Configuration | -|----------|----------|---------------| -| **DNS Public (Primary)** | Cloudflare | Authoritative pour `localplus.io` | -| **DNS Public (Backup)** | Route53 | Secondary zone, sync via AXFR | -| **DNS Privรฉ (Internal)** | Route53 Private Hosted Zones | `*.internal.localplus.io` | -| **Service Discovery** | Route53 + Cloud Map | Rรฉsolution services internes | -| **Health Checks** | Route53 Health Checks | Failover automatique si Cloudflare down | - -**Architecture DNS Hybride :** - -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ DNS ARCHITECTURE โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ EXTERNAL TRAFFIC INTERNAL TRAFFIC โ”‚ -โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ Cloudflare DNS โ”‚ โ”‚ Route53 Private โ”‚ โ”‚ -โ”‚ โ”‚ (Primary) โ”‚ โ”‚ Hosted Zone โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ localplus.io โ”‚ โ”‚ internal. โ”‚ โ”‚ -โ”‚ โ”‚ api.localplus.ioโ”‚ โ”‚ localplus.io โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ Failover โ”‚ VPC DNS โ”‚ -โ”‚ โ–ผ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ Route53 Public โ”‚ โ”‚ EKS CoreDNS โ”‚ โ”‚ -โ”‚ โ”‚ (Backup) โ”‚ โ”‚ + Cloud Map โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ Health checks โ”‚ โ”‚ svc-*.svc. โ”‚ โ”‚ -โ”‚ โ”‚ Failover ready โ”‚ โ”‚ cluster.local โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` - -| Route53 Feature | Use Case Local-Plus | -|-----------------|---------------------| -| **Private Hosted Zones** | Rรฉsolution DNS interne VPC, pas d'exposition internet | -| **Health Checks** | Vรฉrification santรฉ endpoints, failover automatique | -| **Alias Records** | Pointage vers ALB/NLB sans IP hardcodรฉe | -| **Geolocation Routing** | Future multi-rรฉgion, routage par gรฉographie | -| **Failover Routing** | Backup si Cloudflare indisponible | -| **Weighted Routing** | Canary deployments, A/B testing | - -### **9.5.12 Vision Multi-Cloud** - -> **Objectif :** L'architecture edge (Cloudflare) et API Gateway (APISIX) sont **cloud-agnostic** et peuvent router vers plusieurs cloud providers. - -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ MULTI-CLOUD ARCHITECTURE (Future) โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ CLOUDFLARE EDGE โ”‚ -โ”‚ (Global Load Balancing) โ”‚ -โ”‚ โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ–ผ โ–ผ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ AWS (Primary)โ”‚ โ”‚ GCP (Future) โ”‚ โ”‚ Azure (Future)โ”‚ โ”‚ -โ”‚ โ”‚ eu-west-1 โ”‚ โ”‚ europe-west1 โ”‚ โ”‚ westeurope โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ APISIX โ”‚ โ”‚ โ”‚ โ”‚ APISIX โ”‚ โ”‚ โ”‚ โ”‚ APISIX โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ Gateway โ”‚ โ”‚ โ”‚ โ”‚ Gateway โ”‚ โ”‚ โ”‚ โ”‚ Gateway โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚Services โ”‚ โ”‚ โ”‚ โ”‚Services โ”‚ โ”‚ โ”‚ โ”‚Services โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ AIVEN (Multi-Cloud Data Layer) โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข PostgreSQL avec rรฉplication cross-cloud โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Kafka avec MirrorMaker cross-cloud โ”‚ โ”‚ -โ”‚ โ”‚ โ€ข Valkey avec rรฉplication โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` - -| Composant | Multi-Cloud Ready | Comment | -|-----------|-------------------|---------| -| **Cloudflare** | โœ… Oui | Load balancing global, health checks multi-origin | -| **APISIX** | โœ… Oui | Dรฉployable sur tout K8s (EKS, GKE, AKS) | -| **Aiven** | โœ… Oui | PostgreSQL, Kafka, Valkey disponibles sur AWS/GCP/Azure | -| **ArgoCD** | โœ… Oui | Peut gรฉrer des clusters multi-cloud | -| **Vault** | โœ… Oui | Rรฉplication cross-datacenter | -| **OTel** | โœ… Oui | Standard ouvert, backends interchangeables | - -**Phases Multi-Cloud :** - -| Phase | Scope | Timeline | +| Phase | Focus | Duration | |-------|-------|----------| -| **Phase 1 (Actuelle)** | AWS uniquement, architecture cloud-agnostic | Now | -| **Phase 2** | DR sur GCP (read replicas, failover) | +12 mois | -| **Phase 3** | Active-Active multi-cloud | +24 mois | +| **1** | Bootstrap Layer 0-1 (IAM, VPC, EKS) | 3 weeks | +| **2** | Platform GitOps (Flux) | 1 week | +| **3** | Platform Networking (Cilium, Gateway API) + Cloudflare | 2 weeks | +| **4** | Platform Security (Vault, Kyverno) | 2 weeks | +| **5** | Platform Observability (Prometheus, Loki, Tempo) | 2 weeks | +| **6** | Agent framework + gRPC protocol + agent-relay | 3 weeks | +| **7** | CNPG Provider (provider-cnpg) | 2 weeks | +| **8** | svc-provisioner (THE BRAIN) + svc-infra (AWS resources) | 4 weeks | +| **9** | svc-clusters + svc-backups + svc-users | 3 weeks | +| **10** | svc-monitoring + DBA intelligence (basic) | 3 weeks | +| **11** | Dashboard โ€” Simple Mode (Next.js) | 4 weeks | +| **12** | Dashboard โ€” Advanced Mode (YAML editor) | 2 weeks | +| **13** | svc-auth (OIDC, RBAC, org model) | 2 weeks | +| **14** | CLI + API + Terraform Provider | 3 weeks | +| **15** | svc-billing (Stripe) + svc-audit | 2 weeks | +| **16** | svc-migrations (Aiven/RDS import) | 2 weeks | +| **17** | Testing (E2E, chaos, performance) | 2 weeks | +| **18** | Compliance audit (GDPR, SOC2) | 2 weeks | + +**Total estimated: ~43 weeks (~10 months)** + +## 9.2 Pre-Start Checklist + +### Accounts & Access +- [ ] AWS account created, billing configured +- [ ] Aiven account created (product database) +- [ ] Cloudflare account created +- [ ] GitHub organization created +- [ ] Stripe account created (billing) +- [ ] DNS domain acquired (kiven.io or similar) + +### Decisions Validated +- [ ] RPO 1h / RTO 15min (SaaS) +- [ ] AWS eu-west-1 +- [ ] Go as backend language +- [ ] Next.js as frontend +- [ ] CNPG as PostgreSQL engine +- [ ] Agent-based connectivity (gRPC/mTLS) +- [ ] Cross-account IAM for customer infra access +- [ ] Provider/plugin architecture for multi-operator future +- [ ] Aiven for Kiven product DB + Kafka +- [ ] Flux centralized +- [ ] Cilium + Gateway API +- [ ] Kyverno +- [ ] HashiCorp Vault self-hosted --- -# ๐Ÿšช **PARTIE IX.C โ€” API GATEWAY / APIM (Phase Future)** +# APPENDIX -> **Statut :** ร€ dรฉfinir ultรฉrieurement. Pour le moment, l'architecture reste simple : Cloudflare โ†’ Cilium Gateway โ†’ Services. +## A. Glossary -## **9.6 Options ร  รฉvaluer (Future)** +> [GLOSSARY.md](GLOSSARY.md) -| Solution | Type | Coรปt | Notes | -|----------|------|------|-------| -| **AWS API Gateway** | Managed | Pay-per-use | Simple, intรฉgrรฉ AWS | -| **Gravitee CE** | APIM complet | Gratuit | Portal, Subscriptions inclus | -| **Kong OSS** | Gateway | Gratuit | Populaire, plugins riches | -| **APISIX** | Gateway | Gratuit | Cloud-native, performant | +## B. ADR Index -**Dรฉcision reportรฉe ร  Phase 2+ selon les besoins :** -- Si besoin B2B/Partners โ†’ APIM (Gravitee) -- Si juste rate limiting/auth โ†’ AWS API Gateway -- Si multi-cloud requis โ†’ APISIX ou Kong +| ADR | Title | Status | +|-----|-------|--------| +| 001 | Landing Zone: Control Tower + Terraform | Accepted | +| 002 | CNPG as PostgreSQL Engine | Accepted | +| 003 | Agent-Based Connectivity | Accepted | +| 004 | Provider/Plugin Architecture | Accepted | +| ... | ... | ... | -### **Architecture Actuelle (Phase 1 โ€” Simple)** +> [adr/](adr/) -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ ARCHITECTURE SIMPLIFIร‰E โ€” PHASE 1 โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ โ”‚ -โ”‚ Internet โ”‚ -โ”‚ โ”‚ โ”‚ -โ”‚ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ CLOUDFLARE โ”‚ โ”‚ -โ”‚ โ”‚ (DNS, WAF, DDoS, TLS) โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ Tunnel ou Direct โ”‚ -โ”‚ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ AWS EKS โ€” Cilium Gateway API โ”‚ โ”‚ -โ”‚ โ”‚ (Routing interne, mTLS) โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ Services : svc-ledger, svc-wallet, svc-merchant, ... โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ -โ”‚ Pas d'API Gateway dรฉdiรฉ pour le moment โ€” Cilium Gateway API suffit. โ”‚ -โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` - ---- +## C. Change Management Process -# โšก **PARTIE X โ€” RESILIENCE & DR** +### Architecture Changes +1. **ADR Required**: Any decision impacting >1 service +2. **Review**: Platform Team + Tech Lead +3. **Communication**: Slack #platform-updates -## **10.1 Failure Modes** - -| Failure | Detection | Recovery | RTO | -|---------|-----------|----------|-----| -| **Pod crash** | Liveness probe | K8s restart | < 30s | -| **Node failure** | Node NotReady | Pod reschedule | < 2min | -| **AZ failure** | Multi-AZ detect | Traffic shift | < 5min | -| **DB primary failure** | Aiven health | Automatic failover | < 5min | -| **Kafka broker failure** | Aiven health | Automatic rebalance | < 2min | -| **Full region failure** | Manual | DR procedure (future) | 4h (target) | - -## **10.2 Backup Strategy** - -| Data | Method | Frequency | Retention | Location | -|------|--------|-----------|-----------|----------| -| **PostgreSQL** | Aiven automated | Hourly | 7 jours | Aiven (cross-AZ) | -| **PostgreSQL PITR** | Aiven WAL | Continuous | 24h | Aiven | -| **Kafka** | Topic retention | N/A | 7 jours | Aiven | -| **Terraform state** | S3 versioning | Every apply | 90 jours | S3 | -| **Git repos** | GitHub | Every push | Infini | GitHub | - -## **10.3 Disaster Recovery (Future)** - -| Scenario | Current | Future (Multi-region) | -|----------|---------|----------------------| -| Single AZ failure | Automatic (multi-AZ) | Automatic | -| Region failure | Manual restore from backup | Automatic failover | -| Data corruption | PITR restore | PITR restore | - ---- - -# ๐Ÿ› ๏ธ **PARTIE XI โ€” PLATFORM ENGINEERING** - -## **11.1 Platform Contracts** - -| Contrat | Garantie Platform | Responsabilitรฉ Service | -|---------|-------------------|------------------------| -| **Deployment** | Git push โ†’ Prod < 15min | Manifests K8s valides | -| **Secrets** | Vault dynamic, rotation auto | Utiliser External-Secrets | -| **Observability** | Auto-collection traces/metrics/logs | Instrumentation OTel | -| **Networking** | mTLS enforced, Gateway API | Dรฉclarer routes dans HTTPRoute | -| **Scaling** | HPA disponible | Configurer requests/limits | -| **Security** | Policies enforced | Passer les policies | - -## **11.2 Golden Path (New Service Checklist)** - -| ร‰tape | Action | Validation | -|-------|--------|------------| -| 1 | Crรฉer repo depuis template | Structure conforme | -| 2 | Dรฉfinir protos dans contracts-proto | buf lint pass | -| 3 | Implรฉmenter service | Unit tests > 80% | -| 4 | Configurer K8s manifests | Kyverno policies pass | -| 5 | Configurer External-Secret | Secrets rรฉsolus | -| 6 | Ajouter ServiceMonitor | Metrics visibles Grafana | -| 7 | Crรฉer HTTPRoute | Trafic routable | -| 8 | PR review | Merge โ†’ Auto-deploy dev | - -## **11.3 On-Call Structure (5 personnes)** - -| Rรดle | Responsabilitรฉ | Rotation | -|------|---------------|----------| -| **Primary** | First responder, triage | Weekly | -| **Secondary** | Escalation, expertise | Weekly | -| **Incident Commander** | Coordination si P1 | On-demand | - ---- +### Breaking Changes +1. RFC required (`docs/rfc/`) +2. Migration path documented +3. Announce 2 sprints before -# ๐Ÿ“Š **PARTIE XII โ€” MAPPING TERMINOLOGIE** - -| Terme | Application concrรจte Local-Plus | -|-------|--------------------------------| -| **Reconciliation loop** | ArgoCD sync, Kyverno background scan | -| **Desired state store** | Git repos | -| **Drift detection** | ArgoCD diff, `terraform plan` scheduled | -| **Blast radius** | Namespace isolation, PDB, Resource Quotas | -| **Tenant isolation** | Vault policies per service, Network Policies | -| **Paved road / Golden path** | Template service, checklist onboarding | -| **Guardrails** | Kyverno policies (not gates) | -| **Ephemeral credentials** | Vault dynamic DB secrets (TTL) | -| **SLI/SLO/SLA** | Prometheus recording rules, Error budgets | -| **Cardinality** | OTel Collector label filtering | -| **Circuit breaker** | Cilium timeout policies | -| **Outbox pattern** | svc-ledger โ†’ Kafka transactional | -| **Control plane vs Data plane** | platform-* repos vs svc-* repos | -| **Progressive delivery** | Argo Rollouts (canary) โ€” future | -| **Idempotency** | Idempotency-Key header (SYSTEM_CONTRACT.md) | -| **Pessimistic locking** | SELECT FOR UPDATE (SYSTEM_CONTRACT.md) | -| **Error budget** | 43 min/mois pour 99.9% SLO | -| **MTTR** | Target < 15min (RTO) | -| **Runbook** | docs/runbooks/*.md | -| **Postmortem** | docs/postmortems/*.md (blameless) | -| **APM (Application Performance Monitoring)** | Tempo + Pyroscope + Sentry | -| **Distributed Tracing** | OTel โ†’ Tempo, trace_id correlation | -| **Profiling** | Pyroscope (CPU/Memory flame graphs) | -| **Cache-aside pattern** | Valkey lookup, DB fallback, cache on miss | -| **Write-through cache** | Sync write to cache + DB | -| **Cache invalidation** | TTL + Event-driven (Kafka) + Pub/Sub | -| **L1/L2 Cache** | L1=In-memory (pod), L2=Valkey (distributed) | -| **Task Queue** | Dramatiq + Valkey (background jobs) | -| **Dead Letter Queue (DLQ)** | Failed tasks aprรจs max retries | -| **Exponential Backoff** | Retry avec dรฉlai croissant (1s, 2s, 4s...) | -| **Priority Queue** | critical > high > default > low | -| **CronJob** | K8s scheduled tasks (batch, cleanup) | -| **Rate Limiting** | Valkey sliding window counter | -| **Edge Computing** | Cloudflare Workers, CDN edge nodes | -| **WAF (Web Application Firewall)** | Cloudflare WAF, OWASP ruleset | -| **DDoS Protection** | Cloudflare L3/L4/L7 mitigation | -| **CDN (Content Delivery Network)** | Cloudflare CDN, static asset caching | -| **TLS Termination** | Cloudflare edge โ†’ Origin mTLS | -| **Zero Trust** | Cloudflare Access, GitHub SSO | -| **Cloudflare Tunnel** | Secure tunnel, no public origin IP | -| **API Gateway / APIM** | ร€ dรฉfinir โ€” Phase future (AWS API Gateway, Gravitee, Kong) | -| **Bot Score** | Cloudflare bot detection metric | -| **Origin Certificate** | Cloudflare Origin CA (15-year, free) | -| **Private Hosted Zone** | Route53 DNS interne (VPC only) | -| **DNS Failover** | Route53 health checks + backup de Cloudflare | -| **Multi-Cloud** | Architecture dรฉployable sur AWS/GCP/Azure | -| **Cloud-Agnostic** | Composants non liรฉs ร  un provider spรฉcifique | -| **Cloudflare Tunnel** | Connexion sรฉcurisรฉe sans IP publique origin | -| **Upstream** | Backend service target dans API Gateway | -| **Consumer** | Client API avec credentials (JWT, API Key) | -| **Global Load Balancing** | Cloudflare routing multi-origin/multi-cloud | +### Emergency Changes +1. Incident Commander approval +2. Post-mortem required +3. Retroactive ADR within 48h --- -# ๐Ÿš€ **PARTIE XIII โ€” Sร‰QUENCE DE CONSTRUCTION** - -| Phase | Focus | Livrables | Estimation | -|-------|-------|-----------|------------| -| **1** | Bootstrap Layer 0-1 | IAM, VPC, EKS, Aiven setup (PG, Kafka, Valkey) | 3 semaines | -| **2** | Platform GitOps | ArgoCD, ApplicationSets | 1 semaine | -| **3** | Platform Networking | Cilium, Gateway API | 1 semaine | -| **3b** | Edge & CDN | Cloudflare DNS, WAF, TLS | 1 semaine | -| **4** | Platform Security | Vault, External-Secrets, Kyverno | 2 semaines | -| **5** | Platform Observability | OTel, Prometheus, Loki, Tempo, Grafana | 2 semaines | -| **5b** | Platform APM | Pyroscope, Sentry, APM Dashboards | 1 semaine | -| **6** | Platform Cache | Valkey setup, SDK integration | 1 semaine | -| **7** | Contracts | Proto definitions, SDK Python | 1 semaine | -| **8** | svc-ledger | Migrate ton local-plus, full tests | 3 semaines | -| **9** | svc-wallet | Second service, gRPC integration | 2 semaines | -| **10** | Kafka + Outbox | Event-driven patterns | 2 semaines | -| **10b** | Task Queue | Dramatiq setup, background workers | 1 semaine | -| **11** | Testing complet | TNR, Perf, Chaos | 2 semaines | -| **12** | Compliance audit | GDPR, PCI-DSS, SOC2 checks | 2 semaines | -| **13** | Documentation | Runbooks, ADRs, Onboarding | 1 semaine | - -**Total : ~25 semaines** +# Documentation Index + +| Document | Description | Path | +|----------|-------------|------| +| **Bootstrap Guide** | AWS setup, Account Factory | [bootstrap/BOOTSTRAP-GUIDE.md](bootstrap/BOOTSTRAP-GUIDE.md) | +| **Security Architecture** | Defense in depth, IAM, cross-account, Vault | [security/SECURITY-ARCHITECTURE.md](security/SECURITY-ARCHITECTURE.md) | +| **Observability Guide** | Metrics, logs, traces, APM, dashboards | [observability/OBSERVABILITY-GUIDE.md](observability/OBSERVABILITY-GUIDE.md) | +| **Networking Architecture** | VPC, Cloudflare, Gateway API, customer connectivity | [networking/NETWORKING-ARCHITECTURE.md](networking/NETWORKING-ARCHITECTURE.md) | +| **Data Architecture** | Product DB, Kafka, customer DB model | [data/DATA-ARCHITECTURE.md](data/DATA-ARCHITECTURE.md) | +| **Testing Strategy** | Pyramid, E2E, chaos, provisioning tests | [testing/TESTING-STRATEGY.md](testing/TESTING-STRATEGY.md) | +| **Platform Engineering** | Contracts, Golden Path, on-call, CI/CD | [platform/PLATFORM-ENGINEERING.md](platform/PLATFORM-ENGINEERING.md) | +| **DR Guide** | Backup, recovery, SaaS DR + customer DB DR | [resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) | +| **Agent Architecture** | Agent design, gRPC protocol, deployment | [agent/AGENT-ARCHITECTURE.md](agent/AGENT-ARCHITECTURE.md) | +| **Customer Infra Management** | Nodes, storage, S3, IAM, cross-account | [infra/CUSTOMER-INFRA-MANAGEMENT.md](infra/CUSTOMER-INFRA-MANAGEMENT.md) | +| **Customer Onboarding** | Terraform module, EKS discovery, provisioning | [onboarding/CUSTOMER-ONBOARDING.md](onboarding/CUSTOMER-ONBOARDING.md) | +| **Provider Interface** | Plugin architecture, Go interface, adding providers | [providers/PROVIDER-INTERFACE.md](providers/PROVIDER-INTERFACE.md) | +| **Glossary** | All terminology | [GLOSSARY.md](GLOSSARY.md) | --- -# โœ… **PARTIE XIV โ€” CHECKLIST FINALE** - -## **Avant de commencer :** - -- [ ] Compte AWS crรฉรฉ, billing configurรฉ -- [ ] Compte Aiven crรฉรฉ -- [ ] Compte Cloudflare crรฉรฉ (Free tier) -- [ ] Organisation GitHub crรฉรฉe -- [ ] Dรฉcision : HashiCorp Vault self-hosted sur EKS -- [ ] Domaine DNS acquis et transfรฉrรฉ vers Cloudflare - -## **Dรฉcisions architecturales validรฉes :** - -- [ ] RPO 1h, RTO 15min โ€” OK -- [ ] AWS eu-west-1 โ€” OK -- [ ] Aiven pour Kafka + PostgreSQL + Valkey โ€” OK -- [ ] Cloudflare pour DNS + WAF + CDN โ€” OK -- [ ] API Gateway / APIM โ€” ร€ dรฉfinir (Phase future) -- [ ] Self-hosted observability โ€” OK -- [ ] ArgoCD centralisรฉ โ€” OK -- [ ] Cilium + Gateway API โ€” OK -- [ ] Kyverno โ€” OK -- [ ] GDPR + PCI-DSS + SOC2 โ€” OK +*Maintained by: Kiven Platform Team* +*Last updated: February 2026* diff --git a/GLOSSARY.md b/GLOSSARY.md new file mode 100644 index 0000000..1135e07 --- /dev/null +++ b/GLOSSARY.md @@ -0,0 +1,201 @@ +# Glossary +## *Kiven Platform Terminology* + +--- + +> **Back to**: [Architecture Overview](EntrepriseArchitecture.md) + +--- + +# 1. Kiven-Specific Terms + +| Term | Definition | +|------|------------| +| **Kiven** | Managed data services platform. "Aiven, but on your Kubernetes infrastructure." Finnish for "stone" โ€” solid ground for your database. | +| **Kiven Agent** | Lightweight Go binary deployed in the customer's K8s cluster. Executes commands, collects metrics/logs, reports status to Kiven SaaS via gRPC/mTLS. | +| **Kiven SaaS** | The management platform running in Kiven's AWS account (eu-west-1). Dashboard, API, core services. | +| **Provider** | Plugin that implements the Kiven provider interface for a specific K8s operator (e.g., CNPG Provider, Strimzi Provider). | +| **CNPG Provider** | The first Kiven provider. Manages PostgreSQL via the CloudNativePG operator. | +| **Service Plan** | Predefined resource tier (Hobbyist, Startup, Business, Premium, Custom) that maps to EC2 instance type, storage, instances, and postgresql.conf tuning. | +| **Power Off / Power On** | Feature to pause a database by deleting compute (nodes + pods) while retaining data (EBS volumes + S3 backups). Saves 60-70% on non-production environments. | +| **Power Schedule** | Automated schedule for power on/off (e.g., Mon-Fri 8am-6pm). | +| **Simple Mode** | Default dashboard UX for developers. Forms, sliders, buttons. No YAML visible. Like Aiven's UI. | +| **Advanced Mode** | Dashboard UX for DevOps. View/edit YAML directly, diff view, change history, rollback. Like Lens for K8s. | +| **svc-provisioner** | "The Brain" โ€” core service that orchestrates full provisioning pipeline (nodes โ†’ storage โ†’ S3 โ†’ CNPG โ†’ PG). | +| **svc-infra** | Service managing AWS resources in customer accounts (EC2 node groups, EBS, S3, IAM). | +| **svc-agent-relay** | gRPC server that multiplexes connections from all customer agents. | +| **svc-yamleditor** | Service powering Advanced Mode: YAML generation, validation, diff, change history. | +| **DBA Intelligence** | Kiven's automated database expertise: performance tuning, query optimization, backup verification, capacity planning, security auditing, incident diagnostics. | +| **Backup Verification** | Automated weekly restore test: spin up temporary CNPG cluster from latest backup, validate, tear down. Proves backups are restorable. | +| **Prerequisites Engine** | Validates customer's K8s environment before provisioning (CNPG operator, storage classes, resources, cert-manager, etc.). | +| **Customer Infrastructure** | AWS resources in the customer's account managed by Kiven: node groups, EBS volumes, S3 buckets, IAM roles. | +| **Cross-Account IAM** | AWS IAM role in customer's account that trusts Kiven's account. Kiven assumes this role to manage customer resources. | + +--- + +# 2. CloudNativePG (CNPG) Terms + +| Term | Definition | +|------|------------| +| **CloudNativePG (CNPG)** | CNCF Kubernetes operator for PostgreSQL. Manages cluster lifecycle, HA, backups, failover. | +| **CNPG Cluster** | Custom Resource (CR) defining a PostgreSQL cluster: instances, storage, config, backups. | +| **CNPG Pooler** | Custom Resource for PgBouncer connection pooling, managed by CNPG operator. | +| **CNPG ScheduledBackup** | Custom Resource defining automated backup schedule (frequency, retention, S3 target). | +| **Barman** | Backup tool used by CNPG for physical backups and WAL archiving to object storage (S3). | +| **PITR (Point-in-Time Recovery)** | Ability to restore a database to any specific moment using base backup + WAL replay. | +| **WAL (Write-Ahead Log)** | PostgreSQL's transaction log. Every change is written to WAL before data files. Used for replication and PITR. | +| **Switchover** | Planned promotion of a replica to primary (graceful, zero data loss). | +| **Failover** | Automatic promotion of a replica when primary fails (may lose last few transactions depending on replication mode). | +| **Replication Lag** | Time delay between primary writing data and replica receiving it. | +| **PVC (Persistent Volume Claim)** | Kubernetes resource requesting persistent storage (maps to EBS volume). | +| **PVC Reclaim Policy** | What happens to the EBS volume when the PVC is deleted. `Retain` = keep the volume (critical for Power Off/On). | + +--- + +# 3. PostgreSQL Terms + +| Term | Definition | +|------|------------| +| **postgresql.conf** | Main PostgreSQL configuration file. Controls memory, connections, WAL, checkpoints, etc. | +| **pg_hba.conf** | PostgreSQL Host-Based Authentication config. Controls who can connect and how. | +| **shared_buffers** | RAM allocated for caching data pages. Typically 25% of total RAM. | +| **work_mem** | RAM per query operation for sorting/hashing. Too low = spills to disk. | +| **effective_cache_size** | Hint to query planner about available cache. Typically 75% of RAM. | +| **max_connections** | Maximum concurrent connections. Should be sized with connection pooling. | +| **PgBouncer** | PostgreSQL connection pooler. Reduces connection overhead. Modes: session, transaction, statement. | +| **pg_stat_statements** | Extension tracking execution statistics of all SQL queries. | +| **pg_stat_activity** | System view showing currently active queries and connections. | +| **pg_stat_bgwriter** | System view for background writer and checkpoint statistics. | +| **pg_stat_user_tables** | System view for table-level statistics (seq scans, idx scans, dead tuples). | +| **Autovacuum** | Background process that reclaims dead tuples and updates statistics. | +| **Bloat** | Wasted space from dead tuples that autovacuum hasn't reclaimed. | +| **XID Wraparound** | PostgreSQL transaction ID limit (~2 billion). If reached, database freezes. Autovacuum prevents this. | +| **EXPLAIN / EXPLAIN ANALYZE** | Commands showing query execution plan (estimated vs actual). | +| **Sequential Scan** | Full table scan. Often indicates missing index. | +| **Index Scan** | Targeted lookup using an index. Generally faster than seq scan. | +| **Extensions** | PostgreSQL plugins: pg_vector (AI embeddings), PostGIS (geospatial), TimescaleDB (time-series), etc. | + +--- + +# 4. AWS / Cloud Terms + +| Term | Definition | +|------|------------| +| **EKS (Elastic Kubernetes Service)** | AWS managed Kubernetes service. | +| **EBS (Elastic Block Store)** | AWS block storage for EC2. Volumes attached to K8s nodes for database data. | +| **gp3** | EBS volume type. General purpose SSD with configurable IOPS and throughput. Default for Kiven. | +| **S3 (Simple Storage Service)** | AWS object storage. Used for CNPG backups (Barman) and WAL archiving. | +| **IRSA (IAM Roles for Service Accounts)** | AWS feature mapping K8s ServiceAccounts to IAM roles. CNPG uses IRSA to write backups to S3. | +| **AssumeRole** | AWS IAM action to temporarily take on another role's permissions. Kiven assumes customer's `KivenAccessRole`. | +| **Cross-Account Access** | Pattern where one AWS account accesses resources in another account via IAM role trust. | +| **Terraform** | HashiCorp IaC tool. Kiven provides a Terraform module for customers to create the access role. | +| **KMS (Key Management Service)** | AWS encryption key management. Used for EBS and S3 encryption. | +| **Managed Node Group** | EKS feature for managed EC2 instances as K8s worker nodes. Kiven creates dedicated node groups for databases. | +| **Taints** | K8s mechanism to repel pods from nodes. Kiven taints DB nodes so only DB pods run there. | +| **Tolerations** | K8s mechanism allowing pods to schedule on tainted nodes. CNPG pods tolerate the database taint. | +| **Multi-AZ** | Deploying across multiple Availability Zones for high availability. Kiven spreads primary/replicas across AZs. | + +--- + +# 5. Kubernetes & Operator Terms + +| Term | Definition | +|------|------------| +| **CRD (Custom Resource Definition)** | Extends K8s API with custom resources. CNPG adds Cluster, Backup, Pooler CRDs. | +| **CR (Custom Resource)** | Instance of a CRD. A CNPG `Cluster` CR defines one PostgreSQL cluster. | +| **Operator** | K8s controller that manages complex applications via CRDs. CNPG operator manages PostgreSQL. | +| **Controller** | Control loop watching K8s resources and reconciling actual vs desired state. | +| **Reconciliation Loop** | Continuous process comparing desired state (YAML) with actual state and making corrections. | +| **client-go** | Official Go client library for Kubernetes API. Used by Kiven agent. | +| **controller-runtime** | Go library for building K8s controllers/operators. Used by Kiven agent. | +| **Informer** | K8s pattern for watching resource changes efficiently. Agent uses informers for CNPG CRDs. | +| **Namespace** | K8s logical isolation. Kiven uses `kiven-system` (agent + operator) and `kiven-databases` (PG clusters). | +| **NetworkPolicy** | K8s L3/L4 firewall rules. Kiven creates policies so only authorized app pods reach the database. | +| **StorageClass** | K8s abstraction for dynamic storage provisioning. Kiven creates optimized storage classes for DB workloads. | +| **Helm** | K8s package manager. Agent and CNPG operator are installed via Helm charts. | + +--- + +# 6. Communication & Protocol Terms + +| Term | Definition | +|------|------------| +| **gRPC** | High-performance RPC framework by Google. Used for agent โ†” Kiven SaaS communication. | +| **mTLS (Mutual TLS)** | Both client and server verify each other's certificates. Used for agent โ†” SaaS security. | +| **Protobuf** | Protocol Buffers โ€” binary serialization format for gRPC messages. | +| **Bidirectional Streaming** | gRPC feature where both sides can send messages continuously. Agent streams metrics, SaaS streams commands. | +| **Outbound-Only** | Agent initiates the connection to Kiven SaaS. No inbound ports needed on customer's firewall. | + +--- + +# 7. Architecture & Software Terms + +| Term | Definition | +|------|------------| +| **Provider Interface** | Go interface that each data service (CNPG, Strimzi, Redis) implements. Enables multi-operator support. | +| **Plugin Architecture** | Design pattern where functionality is added via plugins without modifying core code. | +| **GitOps** | Managing infrastructure and apps using Git as single source of truth. Flux reconciles from Git. | +| **Infrastructure as Code (IaC)** | Managing infra through code (Terraform) rather than manual processes. | +| **Stategraph** | Terraform/OpenTofu state backend using PostgreSQL instead of flat state files. Enables parallel plans, no lock waiting, SQL-queryable state. See [stategraph.com](https://stategraph.com/). **Planned for Q4 2026** โ€” currently using S3. | +| **Trunk-Based Development** | All developers merge to main branch. Short-lived feature branches. | +| **C4 Model** | Architecture documentation: Context, Container, Component, Code diagrams. | +| **Defense in Depth** | Multiple security layers so one breach doesn't compromise everything. | +| **Zero Trust** | Never trust, always verify. Every request is authenticated and authorized. | +| **RBAC (Role-Based Access Control)** | Permissions based on roles (Admin, Operator, Viewer). | +| **OIDC (OpenID Connect)** | Identity protocol for SSO. Login with Google, GitHub, SAML. | +| **Idempotency** | Operation producing the same result no matter how many times executed. Critical for agent commands. | + +--- + +# 8. Observability & Reliability Terms + +| Term | Definition | +|------|------------| +| **RPO (Recovery Point Objective)** | Maximum acceptable data loss in time. RPO 1h = can lose up to 1 hour of data. | +| **RTO (Recovery Time Objective)** | Maximum acceptable downtime. RTO 15min = must recover within 15 minutes. | +| **SLI (Service Level Indicator)** | Metric measuring service behavior (e.g., availability, latency). | +| **SLO (Service Level Objective)** | Target for an SLI (e.g., 99.9% availability). | +| **SLA (Service Level Agreement)** | Contractual commitment to SLO with consequences for breach. | +| **Error Budget** | Allowable unreliability: 100% - SLO. 99.9% SLO = 43 min/month error budget. | +| **Prometheus** | Time-series database for metrics. Collects from Kiven services and agents. | +| **Loki** | Log aggregation system by Grafana. Stores and queries logs. | +| **Tempo** | Distributed tracing system by Grafana. Traces requests across services. | +| **OpenTelemetry (OTel)** | Standard for telemetry (metrics, logs, traces) collection and export. | +| **Chaos Engineering** | Deliberately injecting failures to test system resilience. | + +--- + +# 9. Business & Compliance Terms + +| Term | Definition | +|------|------------| +| **GDPR** | EU General Data Protection Regulation. Requires data residency, consent, right to erasure. | +| **SOC2** | Security framework requiring audit controls, RBAC, monitoring, incident response. | +| **Data Sovereignty** | Data stored and processed within specific geographic boundaries. Kiven's model ensures this โ€” data stays in customer's VPC. | +| **Vendor Lock-In** | Dependency on a specific vendor. Kiven reduces lock-in: customer owns their K8s infra, CNPG is open-source. | +| **DBaaS (Database-as-a-Service)** | Fully managed database. Aiven and Kiven are both DBaaS, but with different infrastructure models. | +| **BYOC (Bring Your Own Cloud)** | Model where the managed service runs on the customer's cloud account. Kiven's core model. | +| **Stripe** | Payment platform for SaaS billing. Kiven uses Stripe for subscription management. | + +--- + +# 10. How to Use These Terms + +## In PR Reviews +- *"This increases blast radius for customer data"* +- *"We need idempotency on this agent command"* +- *"Check the PVC reclaim policy โ€” must be Retain for power off"* + +## In Customer Conversations +- *"Your data never leaves your VPC"* +- *"You can power off dev databases on weekends to save 70%"* +- *"Our DBA intelligence will auto-tune your postgresql.conf"* + +## In Architecture Decisions +- *"We need cross-account IAM for svc-infra to manage customer node groups"* +- *"The provider interface must be stable before we add Strimzi support"* + +--- + +*Maintained by: Platform Team* +*Last updated: February 2026* diff --git a/adr/ADR-001-LANDING-ZONE-APPROACH.md b/adr/ADR-001-LANDING-ZONE-APPROACH.md new file mode 100644 index 0000000..adcf955 --- /dev/null +++ b/adr/ADR-001-LANDING-ZONE-APPROACH.md @@ -0,0 +1,222 @@ +# ADR-001: Landing Zone Approach + +**Status:** Accepted +**Date:** 2026-01-27 +**Decision makers:** Platform Team + +--- + +## Context + +LOCAL-PLUS needs an AWS multi-account strategy for a Gift Card & Loyalty Platform with SOC2, PCI-DSS, GDPR compliance requirements. + +## Decision + +**Hybrid approach: Control Tower + Terraform** + +> Control Tower comme fondation, Terraform comme langage. + +## Rationale + +### The real question + +> *"Qui porte la responsabilitรฉ lรฉgale, sรฉcuritรฉ et audit ?"* + +- **Org-wide, security baseline, audit** โ†’ Managed AWS (Control Tower) +- **Produit, mรฉtier, plateforme** โ†’ Terraform pur + +### Why not Pure Terraform? + +| Risk | Impact | +|------|--------| +| Erreur SCP | Blast radius = toute l'org | +| Oubli CloudTrail | Non-compliance, incident invisible | +| S3 log mal configurรฉ | Audit failure | +| Migration tardive vers CT | 2-4 semaines, risque รฉlevรฉ | + +> *"Le Terraform-only est intellectuellement pur mais stratรฉgiquement risquรฉ."* + +### Why not Control Tower only? + +| Issue | Impact | +|-------|--------| +| Pas Git-first | Platform Team friction | +| Black box | Debugging difficile | +| AFT interne | CodePipeline gรฉrรฉ par AWS, invisible pour nous | + +> *"Control Tower est imparfait mais politiquement et lรฉgalement puissant."* + +### Compliance = langage commun + +Un audit est un **exercice social**, pas technique. + +| Question auditeur | Avec Control Tower | +|-------------------|-------------------| +| "Comment vous gรฉrez les logs ?" | "Control Tower, Log Archive account" | +| "Vos guardrails ?" | "AWS managed controls + custom SCPs" | +| "Drift detection ?" | "AWS Config + Control Tower dashboard" | + +--- + +## Architecture + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LAYER 0 โ€” CONTROL TOWER โ”‚ +โ”‚ (Managed, Immutable, Audit) โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ€ข AWS Organizations โ”‚ +โ”‚ โ€ข SCPs globales (AWS managed + custom via Terraform) โ”‚ +โ”‚ โ€ข CloudTrail org-level โ”‚ +โ”‚ โ€ข AWS Config โ”‚ +โ”‚ โ€ข Security Hub โ”‚ +โ”‚ โ€ข Log Archive Account โ”‚ +โ”‚ โ€ข Audit Account โ”‚ +โ”‚ โ›” Aucune logique produit ici โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ AFT โ€” ACCOUNT FACTORY โ”‚ +โ”‚ (GitHub Actions โ†’ Terraform โ†’ AFT) โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ€ข Account requests via Git PR โ”‚ +โ”‚ โ€ข GitHub Actions exรฉcute Terraform โ”‚ +โ”‚ โ€ข Terraform appelle AFT module โ”‚ +โ”‚ โ€ข AFT provisionne compte + baseline โ”‚ +โ”‚ โ›” Pas de logique mรฉtier โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LAYER 1+ โ€” TERRAFORM PUR โ”‚ +โ”‚ (Platform, GitOps, GitHub Actions) โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ€ข VPC / Networking โ”‚ +โ”‚ โ€ข EKS / ECS โ”‚ +โ”‚ โ€ข RDS / Kafka / Cache โ”‚ +โ”‚ โ€ข IAM mรฉtiers (IRSA) โ”‚ +โ”‚ โ€ข Observabilitรฉ โ”‚ +โ”‚ โ€ข Everything business-facing โ”‚ +โ”‚ ๐Ÿ’ฏ PR review, GitHub Actions, 100% lisible โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +## Control Tower via Terraform + +Control Tower controls can be managed via Terraform: + +**Reference:** +- Terraform: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/controltower_control +- AWS Docs: https://docs.aws.amazon.com/controltower/ + +--- + +## Implementation Plan + +### Phase 1: Control Tower Setup (Console) + +> **Voir [BOOTSTRAP-RUNBOOK](../../bootstrap/docs/BOOTSTRAP-RUNBOOK.md) pour les instructions dรฉtaillรฉes.** + +| Step | Action | +|------|--------| +| 1 | Choose setup preferences (regions, region deny) | +| 2 | Create OUs (Security, Sandbox) | +| 3 | Configure Service integrations โ€” **crรฉer 2 comptes** | +| 4 | Review and enable (~45 min) | + +**Comptes crรฉรฉs dans Step 3 :** + +| Service | Account | Email | +|---------|---------|-------| +| AWS Config Aggregator | **Audit** | `aws+audit@talq.xyz` | +| CloudTrail Administrator | **Log Archive** | `aws+logs@talq.xyz` | + +> โš ๏ธ Config et CloudTrail exigent des comptes **diffรฉrents**. + +### Phase 2: Terraform Layer (bootstrap/) + +| Component | Approach | +|-----------|----------| +| Organizations | CT-managed, read via data sources | +| OUs | CT-managed, custom via Terraform | +| SCPs | CT-managed + custom via `aws_controltower_control` | +| SSO | Terraform (`aws_ssoadmin_*`) | +| Account Factory | AFT via GitHub Actions + Terraform | +| Workload accounts | AFT baseline + Terraform customizations | + +### Phase 3: Platform (platform-application-provisioning/) + +- VPC, EKS, RDS, Kafka โ€” Pure Terraform +- GitHub Actions CI/CD +- 100% GitOps + +--- + +## What changes in bootstrap/ + +| Current | New | +|---------|-----| +| `organization/` creates org | CT creates org, we read via data | +| `scps/` creates all SCPs | CT SCPs + custom via `aws_controltower_control` | +| `core-accounts/` creates accounts | CT creates Log/Audit, we create others | +| `account-factory/` custom | AFT module via GitHub Actions | +| `sso/` | Stays Terraform (SSO is independent) | + +--- + +## Decision Matrix by Stage + +| Stage | Approach | +|-------|----------| +| ๐ŸŸข Early startup (1-5 accounts, no audit < 12 months) | Pure Terraform OK (but CT-compatible design) | +| ๐ŸŸก Scaling / Series A / B2B clients | Control Tower OBLIGATOIRE | +| ๐Ÿ”ต Enterprise / regulated | Control Tower non-negotiable | + +**LOCAL-PLUS position:** ๐ŸŸก โ†’ Control Tower recommended + +--- + +## Risks and Mitigations + +| Risk | Decision | +|------|----------| +| CT opacity | Resource inventory maintenu dans cet ADR. Data sources Terraform pour lire les ressources CT. | +| AFT interne CodePipeline | AFT est un module Terraform. GitHub Actions exรฉcute Terraform โ†’ AFT. Le CodePipeline interne est gรฉrรฉ par AWS. | +| CT behavior changes | Provider Terraform pinnรฉ. Tests en sandbox avant promotion. | +| Expertise split | CODEOWNERS dรฉfini: `@security` pour CT, `@platform` pour Terraform. | + +--- + +## Consequences + +### Positive + +- Compliance ready โ€” auditors know Control Tower +- Reduced blast radius โ€” AWS manages critical controls +- Future-proof โ€” no migration pain +- Security baseline by default + +### Negative + +| Limitation | Resolution | +|------------|------------| +| Console setup one-time | **Acceptรฉ.** Documentรฉ dans BOOTSTRAP-RUNBOOK. Exรฉcutรฉ 1 seule fois. | +| Ressources CT pas dans Terraform state | Data sources pour lire les IDs. Inventory maintenu dans cet ADR. | +| ร‰quipe doit comprendre CT + Terraform | Formation Platform Team. Ownership clair dans CODEOWNERS. | + +--- + +## References + +- [Bootstrap Guide](../bootstrap/BOOTSTRAP-GUIDE.md) +- [Security Architecture](../security/SECURITY-ARCHITECTURE.md) +- AWS Control Tower: https://docs.aws.amazon.com/controltower/ +- Terraform aws_controltower_control: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/controltower_control + +--- + +*Document maintenu par : Platform Team* +*Derniรจre mise ร  jour : Janvier 2026* diff --git a/agent/AGENT-ARCHITECTURE.md b/agent/AGENT-ARCHITECTURE.md new file mode 100644 index 0000000..2a7b6ac --- /dev/null +++ b/agent/AGENT-ARCHITECTURE.md @@ -0,0 +1,285 @@ +# Kiven Agent Architecture +## *The Bridge Between Kiven SaaS and Customer Kubernetes* + +--- + +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# What Is the Kiven Agent + +The agent is a **single Go binary** deployed inside the customer's Kubernetes cluster. It is the only component Kiven runs in the customer's environment. Everything Kiven does on the customer's cluster goes through the agent. + +``` +Kiven SaaS (our infra) โ—„โ”€โ”€โ”€โ”€ gRPC/mTLS (outbound from agent) โ”€โ”€โ”€โ”€ Agent (customer's K8s) + โ”‚ + โ”œโ”€โ”€ Watches CNPG CRDs + โ”œโ”€โ”€ Collects PG metrics + โ”œโ”€โ”€ Executes commands + โ”œโ”€โ”€ Aggregates logs + โ””โ”€โ”€ Reports infra status +``` + +--- + +# Design Principles + +| Principle | Implementation | +|-----------|---------------| +| **Outbound-only** | Agent initiates connection to Kiven SaaS. No inbound ports on customer's firewall. | +| **Minimal footprint** | < 50MB RAM, < 0.1 CPU. Must not impact customer's workloads. | +| **Fault-tolerant** | If agent loses connection, databases keep running. Agent auto-reconnects. | +| **Secure** | mTLS for all communication. ServiceAccount scoped to CNPG CRDs only. | +| **Single binary** | One Go binary, deployed via Helm chart. No dependencies. | +| **Multi-provider ready** | Plugin system: auto-detects installed operators, activates relevant modules. | + +--- + +# Agent Components + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ KIVEN AGENT (Go binary) โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Provider Registry โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ CNPG Module (Phase 1) โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ CNPG Watcher (informers on Cluster/Backup/Pooler) โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ PG Stats Collector (pg_stat_*, via PG connection) โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ PG Log Collector (pod logs from CNPG pods) โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ Strimzi Module (Future) โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€ Redis Module (Future) โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Core Components โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ Command Executor โ€” applies YAML, runs SQL โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ Infra Reporter โ€” node status, EBS, resource usage โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ Health Monitor โ€” self-health, connectivity check โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€ Config Manager โ€” agent config, hot reload โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Transport Layer โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ gRPC Client (mTLS, outbound to svc-agent-relay) โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ Event Buffer (in-memory, survives brief disconnects) โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€ Heartbeat (every 30s to prove agent is alive) โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## CNPG Watcher + +Uses Kubernetes **informers** (via controller-runtime) to watch CNPG CRDs: +- `Cluster` โ€” status changes, failover events, replication lag +- `Backup` โ€” backup start/complete/fail events +- `ScheduledBackup` โ€” schedule status +- `Pooler` โ€” PgBouncer status, connection stats + +On any change โ†’ event streamed to Kiven SaaS via gRPC. + +## PG Stats Collector + +Connects to PostgreSQL directly (using credentials from CNPG-managed K8s Secret): +- `pg_stat_statements` โ€” query performance (every 60s) +- `pg_stat_activity` โ€” active queries, blocking (every 30s) +- `pg_stat_bgwriter` โ€” checkpoint/write stats (every 60s) +- `pg_stat_user_tables` โ€” table stats, dead tuples (every 300s) +- Custom queries for bloat detection, XID age (every 300s) + +**Important**: Query parameter values are **never collected**. Only query templates (`SELECT * FROM users WHERE id = $1`). + +## PG Log Collector + +Tails PostgreSQL pod logs via Kubernetes API: +- Filters for ERROR, WARNING, FATAL, PANIC levels +- Applies **log scrubbing**: replaces parameter values with `$N` +- Batches and streams to Kiven SaaS +- Detects patterns: slow queries, connection rejections, OOM + +## Command Executor + +Receives commands from Kiven SaaS (via gRPC stream) and executes them: + +| Command Type | What It Does | Example | +|-------------|-------------|---------| +| `apply_yaml` | Applies K8s manifest | Create/update CNPG Cluster, Pooler, Backup | +| `delete_resource` | Deletes K8s resource | Delete cluster on power-off (PVCs retained) | +| `run_sql` | Executes SQL via PG connection | CREATE USER, GRANT, ALTER SYSTEM | +| `install_helm` | Installs/upgrades Helm chart | Install CNPG operator | +| `collect_diagnostics` | Runs diagnostic checks | Prerequisites validation | + +Every command is: +- **Logged** with full audit trail (who requested, what was executed, result) +- **Idempotent** where possible (apply is naturally idempotent) +- **Validated** before execution (schema validation for YAML) +- **Reported** with result (success/failure + output) + +## Infra Reporter + +Reports infrastructure-level information: +- Node status (Ready/NotReady, capacity, allocatable) +- EBS volume usage (via PVC status + df) +- Resource consumption (CPU/memory per CNPG pod) +- Kubernetes version, CNPG operator version +- Storage classes available +- Namespace resource quotas + +--- + +# Communication Protocol + +## gRPC Service Definition (Simplified) + +```protobuf +service AgentRelay { + // Agent โ†’ SaaS: bidirectional stream for status and metrics + rpc Connect(stream AgentMessage) returns (stream ServerMessage); + + // Agent โ†’ SaaS: initial registration + rpc Register(RegisterRequest) returns (RegisterResponse); +} + +message AgentMessage { + oneof payload { + Heartbeat heartbeat = 1; + ClusterStatus cluster_status = 2; + MetricsBatch metrics = 3; + LogBatch logs = 4; + EventReport event = 5; + CommandResult command_result = 6; + InfraReport infra_report = 7; + } +} + +message ServerMessage { + oneof payload { + Command command = 1; + ConfigUpdate config_update = 2; + Ack ack = 3; + } +} +``` + +## Connection Lifecycle + +``` +Agent starts + โ”‚ + โ”œโ”€โ”€ 1. Load mTLS certificates (from K8s Secret) + โ”œโ”€โ”€ 2. Connect to svc-agent-relay (gRPC/mTLS) + โ”œโ”€โ”€ 3. Register: send agent ID, cluster info, CNPG version + โ”œโ”€โ”€ 4. Start bidirectional stream (Connect RPC) + โ”‚ + โ”‚ โ”Œโ”€โ”€ Agent โ†’ SaaS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ โ”‚ Heartbeat every 30s โ”‚ + โ”‚ โ”‚ Cluster status on change (informer events) โ”‚ + โ”‚ โ”‚ Metrics every 30-60s โ”‚ + โ”‚ โ”‚ Logs (filtered, scrubbed) on arrival โ”‚ + โ”‚ โ”‚ Command results after execution โ”‚ + โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ”‚ โ”Œโ”€โ”€ SaaS โ†’ Agent โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ โ”‚ Commands (apply_yaml, run_sql, etc.) โ”‚ + โ”‚ โ”‚ Config updates (collection intervals, log level) โ”‚ + โ”‚ โ”‚ Acknowledgements โ”‚ + โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ””โ”€โ”€ On disconnect: buffer events, retry with exponential backoff + Databases continue running. No data loss. +``` + +--- + +# Deployment + +## Helm Chart + +```bash +helm install kiven-agent kiven/agent \ + --namespace kiven-system \ + --create-namespace \ + --set agentToken= \ + --set relay.endpoint=agent-relay.kiven.io:443 +``` + +## Kubernetes Resources Created + +| Resource | Namespace | Purpose | +|----------|-----------|---------| +| Deployment (1 replica) | kiven-system | The agent pod | +| ServiceAccount | kiven-system | Identity for RBAC | +| ClusterRole | โ€” | Read CNPG CRDs, read pods/logs, manage kiven-databases namespace | +| ClusterRoleBinding | โ€” | Binds role to ServiceAccount | +| Secret | kiven-system | mTLS certificates + agent token | +| ConfigMap | kiven-system | Agent configuration (intervals, log level) | + +## RBAC (Least Privilege) + +```yaml +rules: + # CNPG CRDs โ€” full access (for provisioning) + - apiGroups: ["postgresql.cnpg.io"] + resources: ["clusters", "backups", "scheduledbackups", "poolers"] + verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] + + # Pods/logs โ€” read only (for metrics and log collection) + - apiGroups: [""] + resources: ["pods", "pods/log", "services", "secrets", "configmaps", "persistentvolumeclaims"] + verbs: ["get", "list", "watch"] + + # Namespaces โ€” manage kiven-databases + - apiGroups: [""] + resources: ["namespaces"] + verbs: ["get", "list", "watch", "create"] + + # Network policies โ€” create in kiven-databases + - apiGroups: ["networking.k8s.io"] + resources: ["networkpolicies"] + verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] + + # Storage classes โ€” read (for prerequisites check) + - apiGroups: ["storage.k8s.io"] + resources: ["storageclasses"] + verbs: ["get", "list"] + + # Nodes โ€” read (for infra reporting) + - apiGroups: [""] + resources: ["nodes"] + verbs: ["get", "list"] +``` + +--- + +# Failure Modes + +| Failure | Impact | Recovery | +|---------|--------|----------| +| **Agent pod crash** | Kiven dashboard shows "agent offline". Databases keep running. | K8s restarts pod automatically. Agent reconnects. | +| **gRPC connection lost** | Events buffered in memory. Dashboard shows stale data (with warning). | Agent retries with exponential backoff (1s, 2s, 4s, 8s... max 60s). | +| **Agent misconfigured** | Agent can't connect or authenticate. | Dashboard shows "agent not connected". Customer re-runs Helm install. | +| **CNPG operator not installed** | Agent reports "CNPG not found" during prerequisites check. | svc-provisioner installs CNPG operator via agent (install_helm command). | +| **Insufficient RBAC** | Agent commands fail with 403. | Agent reports permission error. Customer adjusts ClusterRoleBinding. | + +**Key invariant**: Agent failure NEVER affects running databases. CNPG operator manages PG independently. Agent is only for Kiven management plane. + +--- + +# Metrics Collected + +| Category | Metrics | Interval | +|----------|---------|----------| +| **PostgreSQL** | connections, QPS, transactions, replication lag, cache hit ratio | 30s | +| **Queries** | top queries by time/calls, slow queries (> threshold), lock waits | 60s | +| **Tables** | size, dead tuples, seq scans, idx scans, bloat estimate | 300s | +| **System** | CPU, memory, disk usage (per PG pod) | 30s | +| **CNPG** | cluster phase, timeline, instances ready, failover count | On change | +| **Backups** | last backup time, duration, size, WAL archiving lag | On change | +| **PgBouncer** | active/idle/waiting connections, pool utilization | 30s | +| **Infrastructure** | node status, EBS IOPS, storage capacity | 60s | + +--- + +*Maintained by: Agent Team* +*Last updated: February 2026* diff --git a/bootstrap/BOOTSTRAP-GUIDE.md b/bootstrap/BOOTSTRAP-GUIDE.md new file mode 100644 index 0000000..2052ebe --- /dev/null +++ b/bootstrap/BOOTSTRAP-GUIDE.md @@ -0,0 +1,251 @@ +# ๐Ÿฅš๐Ÿ” **Bootstrap Guide** +## *LOCAL-PLUS Platform Initialization* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# ๐Ÿ“‹ **Table of Contents** + +1. [Architecture Overview](#architecture-overview) +2. [Layer 0 โ€” Control Tower](#layer-0--control-tower) +3. [Layer 1 โ€” Terraform Foundation](#layer-1--terraform-foundation) +4. [Account Factory](#account-factory) +5. [Platform Provisioning](#platform-provisioning) +6. [Bootstrap Repository Structure](#bootstrap-repository-structure) + +--- + +# ๐Ÿ—๏ธ **Architecture Overview** + +> **Voir [ADR-001](../adr/ADR-001-LANDING-ZONE-APPROACH.md) pour le rationale complet.** + +## Principe + +| Layer | Gรฉrรฉ par | Responsabilitรฉ | +|-------|----------|----------------| +| **Layer 0** | Control Tower | Org, SCPs, Logging, Audit | +| **Layer 1** | Terraform | SSO, Custom Controls, Account Factory | +| **Layer 2+** | Terraform | VPC, EKS, RDS, Platform | + +> *"Control Tower comme fondation, Terraform comme langage."* + +## Diagram + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LAYER 0 โ€” CONTROL TOWER โ”‚ +โ”‚ (Managed, Immutable, Audit) โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ€ข AWS Organizations โ€ข CloudTrail org-level โ”‚ +โ”‚ โ€ข OUs (Security, Infra, โ€ข AWS Config โ”‚ +โ”‚ Workloads, Suspended) โ€ข Security Hub โ”‚ +โ”‚ โ€ข Guardrails (400+) โ€ข Log Archive + Audit accounts โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LAYER 1 โ€” TERRAFORM โ”‚ +โ”‚ (GitOps, GitHub Actions) โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ€ข SSO (groups, permission sets) โ”‚ +โ”‚ โ€ข Custom Controls (aws_controltower_control) โ”‚ +โ”‚ โ€ข Account Factory (baseline: OIDC, KMS, S3 state) โ”‚ +โ”‚ โ€ข Shared Services Account โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LAYER 2+ โ€” PLATFORM โ”‚ +โ”‚ (platform-application-provisioning/) โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ€ข VPC, Subnets, Transit Gateway โ”‚ +โ”‚ โ€ข EKS Clusters โ”‚ +โ”‚ โ€ข RDS, Kafka, Cache (Aiven) โ”‚ +โ”‚ โ€ข Observability stack โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +# ๐Ÿ”ง **Layer 0 โ€” Control Tower** + +> **Setup via Console AWS (one-time)** + +## Prerequisites + +| Requirement | Status | +|-------------|--------| +| AWS Account (Management) | Required | +| Email domain for accounts | Required | +| Region: eu-west-1 | Required (GDPR) | + +## Steps + +> **Voir [BOOTSTRAP-RUNBOOK](../../bootstrap/docs/BOOTSTRAP-RUNBOOK.md) pour les instructions dรฉtaillรฉes.** + +| Step | Action | Description | +|------|--------|-------------| +| 1 | Choose setup preferences | Home region eu-west-1, Region deny enabled | +| 2 | Create OUs | Security, Sandbox | +| 3 | Configure Service integrations | Crรฉer Audit + Log Archive accounts | +| 4 | Review and enable | ~45 min pour complรฉter | + +> โš ๏ธ **Important:** Dans Step 3, Config et CloudTrail exigent des comptes **diffรฉrents** : +> - AWS Config โ†’ **Audit** account +> - CloudTrail โ†’ **Log Archive** account + +## What Control Tower creates + +> **Voir [ADR-001 Resource Inventory](../adr/ADR-001-LANDING-ZONE-APPROACH.md#control-tower-resource-inventory) pour la liste complรจte.** + +| Account | Resources crรฉรฉes | +|---------|------------------| +| **Management** | Organizations, Service Roles, CloudTrail, Config | +| **Log Archive** | S3 Buckets (logs), KMS Key, Lifecycle Policies | +| **Audit** | Config Aggregator, Security Hub, GuardDuty, IAM Access Analyzer | +| **Workload (baseline)** | CloudTrail local, Config local, CT-managed roles | + +--- + +# ๐Ÿ—๏ธ **Layer 1 โ€” Terraform Foundation** + +> **Repo: `bootstrap/`** โ€” GitHub Actions CI/CD + +## Terraform State โ€” S3 + +Terraform state is stored in a versioned, encrypted S3 bucket (`localplus-terraform-state-mgmt`) with state locking via S3 native locking (`use_lockfile = true`). + +> **Future**: Migration to [Stategraph](https://stategraph.com/) planned for Q4 2026 (parallel plans, SQL queryable state). + +## What we manage in Terraform + +| Component | Module | Description | +|-----------|--------|-------------| +| **SSO** | `sso/` | Groups, Permission Sets, Assignments | +| **Custom Controls** | `control-tower/` | Additional controls via `aws_controltower_control` | +| **Account Factory** | `account-factory/` | AFT module via GitHub Actions | + +## SSO Groups + +| Group | Permission Set | Access | +|-------|----------------|--------| +| PlatformAdmins | AdministratorAccess | Full access | +| Developers | PowerUserAccess | No IAM changes | +| ReadOnly | ViewOnlyAccess | Read only | +| SecurityAuditors | SecurityAudit | Security review | +| OnCall | IncidentResponder | Break-glass | + +## Custom Controls + +| Control | Purpose | Target OU | +|---------|---------|-----------| +| Require IMDSv2 | EC2 metadata security | Workloads | +| Deny Public S3 | Data protection | All | +| Enforce EU Regions | GDPR compliance | All | +| Require Encryption | Data at rest | Workloads | + +--- + +# ๐Ÿญ **Account Factory** + +> Self-service account provisioning via PR + +## Workflow + +| Step | Action | Actor | +|------|--------|-------| +| 1 | Run `task account:create` | Developer | +| 2 | Fill account request YAML | Developer | +| 3 | Create PR | Developer | +| 4 | Review request | Platform Team | +| 5 | Merge PR | Platform Team | +| 6 | GitHub Actions applies Terraform | Automated | +| 7 | Account ready with baseline | Automated | + +## What's created per account + +| Resource | Purpose | +|----------|---------| +| AWS Account | In appropriate OU | +| Terraform state | S3 bucket (versioned, encrypted) | +| GitHub OIDC | CI/CD authentication | +| KMS Keys | Encryption (terraform, secrets, eks) | +| Security Baseline | EBS encryption, S3 block public | + +## Request fields + +| Field | Description | Example | +|-------|-------------|---------| +| account_name | Unique identifier | localplus-backend-dev | +| environment | dev / staging / prod | dev | +| owner_email | Team contact | backend@localplus.io | +| team | Owning team | backend | +| purpose | Business justification | Backend services development | + +--- + +# ๐Ÿ“ฆ **Platform Provisioning** + +> **Repo: `platform-application-provisioning/`** + +## Order of operations + +| Order | Resource | Dependencies | +|-------|----------|--------------| +| 1 | VPC + Subnets | Account created | +| 2 | KMS Keys | Account created | +| 3 | EKS Cluster | VPC, KMS | +| 4 | IRSA | EKS | +| 5 | VPC Peering (Aiven) | VPC, Aiven project | +| 6 | Flux | EKS | + +## Providers + +| Provider | Resources | Frequency | +|----------|-----------|-----------| +| AWS | VPC, EKS, KMS | 1x per environment | +| Aiven | PostgreSQL, Kafka, Valkey | 1x per environment | +| Cloudflare | DNS, WAF, Tunnel | 1x per zone | + +--- + +# ๐Ÿ“‹ **Bootstrap Repository Structure** + +| Directory | Purpose | +|-----------|---------| +| `.github/workflows/` | CI/CD pipelines (plan, apply, account-request) | +| `control-tower/` | Data sources + custom controls | +| `sso/` | Groups, Permission Sets | +| `account-factory/` | Account creation + baseline | +| `tests/checkov-policies/` | Custom compliance policies | +| `tests/compliance-bdd/` | BDD audit tests | +| `docs/` | Runbook | + +--- + +# ๐Ÿ”’ **Policy as Code** + +| Tool | Purpose | Format | +|------|---------|--------| +| **Trivy** | Security scanning (IaC + secrets) | Built-in | +| **Checkov** | 2000+ compliance policies | YAML | +| **terraform-compliance** | Audit-readable policies | BDD/Gherkin | + +--- + +# ๐Ÿ”— **Related Documentation** + +| Topic | Link | +|-------|------| +| **ADR Landing Zone** | [ADR-001](../adr/ADR-001-LANDING-ZONE-APPROACH.md) | +| **CI/CD & Delivery** | [Platform Engineering](../platform/PLATFORM-ENGINEERING.md) | +| **Security Setup** | [Security Architecture](../security/SECURITY-ARCHITECTURE.md) | +| **Networking** | [Networking Architecture](../networking/NETWORKING-ARCHITECTURE.md) | + +--- + +*Document maintenu par : Platform Team* +*Derniรจre mise ร  jour : Janvier 2026* diff --git a/data/DATA-ARCHITECTURE.md b/data/DATA-ARCHITECTURE.md new file mode 100644 index 0000000..1e39c3b --- /dev/null +++ b/data/DATA-ARCHITECTURE.md @@ -0,0 +1,297 @@ +# Data Architecture +## *Kiven โ€” Product Database, Kafka, Cache & Customer Database Model* + +--- + +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# Table of Contents + +1. [Two Data Domains](#two-data-domains) +2. [Kiven Product Database (SaaS)](#kiven-product-database-saas) +3. [Customer Databases (Managed by Kiven)](#customer-databases-managed-by-kiven) +4. [Kafka Topics](#kafka-topics) +5. [Cache Architecture (Valkey)](#cache-architecture-valkey) +6. [Data Isolation Principle](#data-isolation-principle) + +--- + +# Two Data Domains + +Kiven has **two completely separate data domains** that must never mix: + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ DOMAIN 1: Kiven Product Data โ”‚ โ”‚ DOMAIN 2: Customer Database Data โ”‚ +โ”‚ (lives in Kiven's AWS account) โ”‚ โ”‚ (lives in customer's AWS account) โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ PostgreSQL (Aiven) โ€” product DB โ”‚ โ”‚ PostgreSQL (CNPG on customer EKS) โ”‚ +โ”‚ Kafka (Aiven) โ€” events โ”‚ โ”‚ Barman backups โ†’ customer's S3 โ”‚ +โ”‚ Valkey (Aiven) โ€” cache โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ Kiven NEVER accesses row data. โ”‚ +โ”‚ Contains: orgs, users, clusters, โ”‚ โ”‚ Agent collects only: pg_stat_*, โ”‚ +โ”‚ billing, audit, agent metadata โ”‚ โ”‚ logs, CRD status, metrics. โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +**Golden rule: Customer data never touches Kiven's infrastructure.** + +--- + +# Kiven Product Database (SaaS) + +## Aiven Configuration + +| Service | Plan | Config | Estimated Cost | +|---------|------|--------|----------------| +| **PostgreSQL** | Business-4 | Primary + Read Replica, 100GB | ~300 EUR/mo | +| **Kafka** | Business-4 | 3 brokers, 100GB retention | ~400 EUR/mo | +| **Valkey** | Business-4 | 2 nodes, 10GB, HA | ~150 EUR/mo | + +**Total estimated Aiven cost: ~850 EUR/mo** + +## Database Configuration + +| Aspect | Choice | Rationale | +|--------|--------|-----------| +| **Replication** | Aiven managed (async) | RPO 1h acceptable for product DB | +| **Backup** | Aiven automated hourly | RPO 1h | +| **Failover** | Aiven automated | RTO < 15min | +| **Connection** | VPC Peering (private) | No public internet | +| **Pooling** | PgBouncer (Aiven built-in) | Connection efficiency | + +## Schema Ownership + +### Core Tables + +| Table | Owner Service | Description | +|-------|---------------|-------------| +| `organizations` | svc-auth | Customer organizations | +| `users` | svc-auth | Dashboard users, roles, teams | +| `api_keys` | svc-auth | API key management | +| `clusters` | svc-clusters | Managed CNPG cluster metadata | +| `cluster_configs` | svc-yamleditor | YAML history, versions, diffs | +| `databases` | svc-users | PostgreSQL databases within clusters | +| `database_users` | svc-users | PostgreSQL roles within clusters | +| `backups` | svc-backups | Backup records, status, PITR points | +| `backup_verifications` | svc-backups | Restore test results | +| `agents` | svc-agent-relay | Registered agents, heartbeat status | +| `provisioning_jobs` | svc-provisioner | Provisioning pipeline state machine | +| `infra_resources` | svc-infra | Customer AWS resources (node groups, EBS, S3, IAM) | +| `service_plans` | svc-clusters | Plan definitions (Hobbyist, Startup, Business...) | +| `metrics_snapshots` | svc-monitoring | Aggregated metrics for dashboard display | +| `alerts` | svc-monitoring | Alert rules and status | +| `dba_recommendations` | svc-monitoring | Performance advisor suggestions | +| `audit_log` | svc-audit | Immutable audit trail | +| `billing_subscriptions` | svc-billing | Stripe subscriptions, usage | +| `invoices` | svc-billing | Invoice records | +| `migrations` | svc-migrations | Migration jobs (from Aiven/RDS) | + +### Power Schedule Tables + +| Table | Owner Service | Description | +|-------|---------------|-------------| +| `power_schedules` | svc-clusters | Scheduled power on/off rules | +| `power_events` | svc-clusters | Power on/off event history | + +**Rule: 1 table = 1 owner. Cross-service communication = gRPC or Kafka events, never JOINs.** + +## Connection Best Practices + +| Parameter | Recommended Value | Rationale | +|-----------|-------------------|-----------| +| **pool_size** | 20 | Connections per service pod | +| **max_overflow** | 10 | Extra connections at peak | +| **pool_timeout** | 30s | Max wait for connection | +| **pool_recycle** | 1800s | Recycle connections every 30min | +| **ssl** | require | Always encrypted | + +--- + +# Customer Databases (Managed by Kiven) + +## What Kiven Provisions + +For each customer database, Kiven creates: + +| Resource | Type | Where | Managed By | +|----------|------|-------|------------| +| CNPG Cluster CR | Kubernetes CRD | Customer K8s | Kiven agent | +| PostgreSQL pods | Pods (Primary + Replicas) | Customer K8s | CNPG operator | +| PgBouncer Pooler | Kubernetes CRD | Customer K8s | CNPG operator | +| EBS volumes | AWS EBS gp3 | Customer AWS | Kiven svc-infra | +| S3 backup bucket | AWS S3 | Customer AWS | Kiven svc-infra | +| IRSA role | AWS IAM | Customer AWS | Kiven svc-infra | +| ScheduledBackup CR | Kubernetes CRD | Customer K8s | Kiven agent | +| NetworkPolicy | Kubernetes | Customer K8s | Kiven agent | + +## Service Plan โ†’ Infrastructure Mapping + +| Plan | Node Type | Instances | Storage | Backup Freq | PgBouncer Pool | +|------|-----------|-----------|---------|-------------|----------------| +| **Hobbyist** | t3.small | 1 | 10GB gp3 | Daily | 25 | +| **Startup** | r6g.medium | 2 | 50GB gp3 | 6h | 50 | +| **Business** | r6g.large | 3 | 100GB gp3 (3000 IOPS) | 1h | 100 | +| **Premium** | r6g.xlarge | 3 | 500GB gp3 (6000 IOPS) | 30min | 200 | +| **Custom** | Any | 1-5 | Custom | Custom | Custom | + +## Auto-Tuned postgresql.conf per Plan + +| Parameter | Hobbyist | Startup | Business | Premium | +|-----------|----------|---------|----------|---------| +| `shared_buffers` | 256MB | 1GB | 4GB | 8GB | +| `effective_cache_size` | 768MB | 3GB | 12GB | 24GB | +| `work_mem` | 4MB | 16MB | 32MB | 64MB | +| `maintenance_work_mem` | 64MB | 256MB | 512MB | 1GB | +| `max_connections` | 50 | 100 | 200 | 400 | +| `wal_buffers` | 8MB | 16MB | 32MB | 64MB | +| `random_page_cost` | 1.1 | 1.1 | 1.1 | 1.1 | +| `effective_io_concurrency` | 200 | 200 | 200 | 200 | +| `checkpoint_completion_target` | 0.9 | 0.9 | 0.9 | 0.9 | + +These values are the **defaults per plan**. The DBA intelligence engine adjusts them based on real workload over time. + +## What Kiven Collects (Metadata Only โ€” Never Row Data) + +| Data Collected | Source | Purpose | Contains PII? | +|----------------|--------|---------|---------------| +| `pg_stat_statements` | PG catalog | Query performance analysis | No (queries anonymized) | +| `pg_stat_activity` | PG catalog | Active connections, blocking | No | +| `pg_stat_bgwriter` | PG catalog | Checkpoint/write performance | No | +| `pg_stat_user_tables` | PG catalog | Table size, seq/idx scans | No | +| CNPG Cluster status | K8s CRD | Cluster health, replication lag | No | +| Pod metrics | Kubelet | CPU, memory, disk usage | No | +| PG logs | Pod logs | Error detection, slow queries | Potentially (log scrubbing applied) | +| Node status | K8s API | Node health, capacity | No | +| EBS metrics | CloudWatch | Disk IOPS, latency | No | + +**Log scrubbing**: The agent strips potential PII from PG logs before sending to Kiven (query parameter values replaced with `$N`). + +--- + +# Kafka Topics + +## Topic Configuration + +| Topic | Producer | Consumers | Retention | Purpose | +|-------|----------|-----------|-----------|---------| +| `agent.status.v1` | Agent (via relay) | svc-clusters, svc-monitoring | 7 days | Cluster status updates | +| `agent.metrics.v1` | Agent (via relay) | svc-monitoring | 3 days | PG metrics stream | +| `agent.logs.v1` | Agent (via relay) | svc-monitoring | 3 days | PG log stream | +| `agent.events.v1` | Agent (via relay) | svc-clusters, svc-notification | 7 days | Failover, backup, error events | +| `provisioning.commands.v1` | svc-provisioner | Agent (via relay) | 1 day | Commands to execute in customer K8s | +| `provisioning.status.v1` | svc-provisioner | svc-api, dashboard | 7 days | Provisioning pipeline progress | +| `audit.actions.v1` | All services | svc-audit | 30 days | Immutable audit trail | +| `billing.usage.v1` | svc-monitoring | svc-billing | 30 days | Per-cluster usage metrics | +| `alerts.triggered.v1` | svc-monitoring | svc-notification | 7 days | Alert events for dispatch | +| `dba.recommendations.v1` | svc-monitoring | svc-api, dashboard | 7 days | DBA intelligence suggestions | + +## Topic Naming Convention + +``` +{domain}.{entity}.{version} + +Examples: + agent.status.v1 + provisioning.commands.v1 + audit.actions.v1 +``` + +## Kafka Monitoring + +| Metric | Alert Threshold | Severity | +|--------|----------------|----------| +| **Consumer Lag** | > 1000 messages | P2 | +| **Under-replicated Partitions** | > 0 | P1 | +| **Active Controller Count** | != 1 | P1 | +| **Offline Partitions** | > 0 | P1 | +| **Request Latency P99** | > 100ms | P2 | + +--- + +# Cache Architecture (Valkey) + +## Cache Stack + +| Component | Tool | Hosting | Estimated Cost | +|-----------|------|---------|----------------| +| **Distributed cache** | Valkey (Redis-compatible) | Aiven | ~150 EUR/mo | +| **Local cache (L1)** | Go `bigcache` | In-memory per pod | 0 EUR | + +## Cache Use Cases + +| Use Case | Strategy | TTL | Invalidation | +|----------|----------|-----|--------------| +| **Session data** | Write-through | 24h | Explicit logout | +| **Cluster status** | Cache-aside | 30s | Agent event | +| **Org/team config** | Read-through | 5min | TTL + manual | +| **Rate limiting** | Write-through | Sliding window | Auto-expire | +| **API response cache** | Cache-aside | 1min | TTL | +| **Agent connection state** | Write-through | Heartbeat interval | Agent disconnect | +| **Service plan definitions** | Read-through | 1h | Manual invalidation | + +## Cache Key Naming Convention + +``` +{service}:{entity}:{id}:{version} + +Examples: + auth:session:sess_abc123 + clusters:status:cluster_456:v1 + monitoring:metrics:cluster_456:latest + ratelimit:api:org_789:minute + plans:definition:business:v1 +``` + +## Cache Metrics + +| Metric | Alert Threshold | Action | +|--------|----------------|--------| +| **Hit Rate** | < 80% | Review TTL, preloading | +| **Latency P99** | > 10ms | Check network, cluster size | +| **Memory Usage** | > 80% | Eviction analysis, scale up | +| **Connection Errors** | > 0 | Check connectivity | + +--- + +# Data Isolation Principle + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ DATA ISOLATION MODEL โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€ Kiven SaaS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ Product DB (Aiven PG) Kafka (Aiven) Valkey (Aiven) โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€ organizations โ”œโ”€ agent events โ”œโ”€ sessions โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€ clusters (metadata) โ”œโ”€ audit trail โ”œโ”€ rate limits โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€ audit_log โ”œโ”€ alerts โ”œโ”€ cache โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€ billing โ””โ”€ billing usage โ””โ”€ agent state โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ CONTAINS: Metadata, config, status, metrics aggregates โ”‚ โ”‚ +โ”‚ โ”‚ NEVER CONTAINS: Customer's actual database rows โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€ Customer A's AWS โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€ Customer B's AWS โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ CNPG PostgreSQL โ”‚ โ”‚ CNPG PostgreSQL โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€ Their app data โ”‚ โ”‚ โ”œโ”€ Their app data โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€ Their users โ”‚ โ”‚ โ””โ”€ Their users โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ S3: Their backups โ”‚ โ”‚ S3: Their backups โ”‚ โ”‚ +โ”‚ โ”‚ EBS: Their volumes โ”‚ โ”‚ EBS: Their volumes โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ KIVEN NEVER READS THIS โ”‚ โ”‚ KIVEN NEVER READS THIS โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +This isolation is **fundamental to Kiven's value proposition**: the customer's data never leaves their infrastructure. Kiven only manages the infrastructure and configuration around it. + +--- + +*Maintained by: Platform Team + Backend Team* +*Last updated: February 2026* diff --git a/development/LOCAL-DEV-GUIDE.md b/development/LOCAL-DEV-GUIDE.md new file mode 100644 index 0000000..8b188bc --- /dev/null +++ b/development/LOCAL-DEV-GUIDE.md @@ -0,0 +1,576 @@ +# Kiven โ€” Development Strategy & Local Dev Guide + +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# Development Environments + +## Overview + +``` +Level 1: LOCAL ($0/mo) Level 2: SANDBOX (~$400/mo) Level 3: STAGING/PROD +โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ +kiven-dev repo AWS Account: kiven-sandbox AWS Accounts: + Docker Compose (shared) โ”œโ”€โ”€ EKS "kiven-dev" kiven-staging + kind cluster (shared) โ”‚ โ””โ”€โ”€ Kiven services kiven-prod + Tilt (orchestrates all) โ””โ”€โ”€ EKS "test-client" + โ””โ”€โ”€ Agent + CNPG Real Aiven (PG, Kafka) +Each svc-* repo: Real customers + Go code + Dockerfile Full AWS integration: + task init (mise + tools) node groups, EBS, S3, IAM + Own CI (reusable GH wf) + Flux deployment +``` + +--- + +# Architecture: Polyrepo + Dev Orchestrator + +## Why Polyrepo + +- Each service has its **own repo**, its **own CI** (reusable GitHub workflows), its **own release cycle** +- Shared infrastructure (Docker Compose, kind, Tilt) lives in **one `kiven-dev` repo** +- Shared Go code lives in **`kiven-go-sdk`** (imported as a Go module) +- No port conflicts, no duplicate infra + +## Repo Layout + +``` +kivenio/ โ† GitHub Organization +โ”‚ +โ”œโ”€โ”€ kiven-dev/ โ† DEV ORCHESTRATOR (this section) +โ”‚ โ”œโ”€โ”€ docker-compose.yml โ† ONE PostgreSQL, Redpanda, Valkey, MinIO +โ”‚ โ”œโ”€โ”€ kind/ +โ”‚ โ”‚ โ”œโ”€โ”€ cluster.yaml โ† ONE kind cluster +โ”‚ โ”‚ โ””โ”€โ”€ cnpg-test-cluster.yaml โ† Test PG cluster (simulates customer DB) +โ”‚ โ”œโ”€โ”€ Tiltfile โ† Orchestrates ALL services for local dev +โ”‚ โ”œโ”€โ”€ init-db.sql โ† Product DB schema + seed data +โ”‚ โ”œโ”€โ”€ .mise.toml โ† Shared tool versions (Go, Node, kubectl, helm...) +โ”‚ โ””โ”€โ”€ Taskfile.yml โ† task dev, task infra:up, task kind:create +โ”‚ +โ”œโ”€โ”€ kiven-go-sdk/ โ† SHARED GO CODE (imported as module) +โ”‚ โ”œโ”€โ”€ provider/ โ† Provider interface (provider.go, registry.go) +โ”‚ โ”œโ”€โ”€ grpcapi/ โ† gRPC types (generated from proto) +โ”‚ โ”œโ”€โ”€ models/ โ† Shared domain models +โ”‚ โ””โ”€โ”€ go.mod โ† github.com/kivenio/kiven-go-sdk +โ”‚ +โ”œโ”€โ”€ contracts-proto/ โ† PROTOBUF DEFINITIONS +โ”‚ โ”œโ”€โ”€ agent/v1/agent.proto โ† Agent โ†” SaaS protocol +โ”‚ โ”œโ”€โ”€ api/v1/services.proto โ† REST API types +โ”‚ โ””โ”€โ”€ buf.yaml +โ”‚ +โ”œโ”€โ”€ svc-api/ โ† SERVICE REPO (one of many) +โ”‚ โ”œโ”€โ”€ cmd/main.go +โ”‚ โ”œโ”€โ”€ internal/ โ† Service-specific logic +โ”‚ โ”œโ”€โ”€ Dockerfile +โ”‚ โ”œโ”€โ”€ .mise.toml โ† Tool versions for THIS service +โ”‚ โ”œโ”€โ”€ Taskfile.yml โ† task init, task run, task test, task build +โ”‚ โ”œโ”€โ”€ .github/workflows/ci.yml โ† Uses reusable workflow +โ”‚ โ””โ”€โ”€ go.mod โ† imports github.com/kivenio/kiven-go-sdk +โ”‚ +โ”œโ”€โ”€ svc-provisioner/ โ† Same structure as svc-api +โ”œโ”€โ”€ svc-agent-relay/ โ† Same structure +โ”œโ”€โ”€ svc-clusters/ โ† Same structure +โ”œโ”€โ”€ svc-backups/ โ† Same structure +โ”œโ”€โ”€ svc-monitoring/ โ† Same structure +โ”œโ”€โ”€ svc-users/ โ† Same structure +โ”œโ”€โ”€ kiven-agent/ โ† Same structure (deployed in customer K8s) +โ”œโ”€โ”€ provider-cnpg/ โ† Same structure (Go library) +โ”œโ”€โ”€ dashboard/ โ† Next.js frontend +โ”‚ โ”œโ”€โ”€ src/ +โ”‚ โ”œโ”€โ”€ .mise.toml +โ”‚ โ”œโ”€โ”€ Taskfile.yml +โ”‚ โ””โ”€โ”€ package.json +โ”‚ +โ””โ”€โ”€ platform-github-management/ โ† Repo management + reusable workflows + โ””โ”€โ”€ .github/workflows/ + โ””โ”€โ”€ reusable-go-ci.yml โ† Reusable CI for all Go services +``` + +## How It Fits Together + +``` +โ”Œโ”€โ”€โ”€ kiven-dev (Dev Orchestrator) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ โ”‚ +โ”‚ Docker Compose (shared infra) kind cluster (shared K8s) โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚Postgresโ”‚ โ”‚Redpandaโ”‚ โ”‚ CNPG Operator โ”‚ โ”‚ +โ”‚ โ”‚ :5432 โ”‚ โ”‚ :19092 โ”‚ โ”‚ kiven-agent (from repo) โ”‚ โ”‚ +โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ test-pg cluster โ”‚ โ”‚ +โ”‚ โ”‚ Valkey โ”‚ โ”‚ MinIO โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ :6379 โ”‚ โ”‚ :9000 โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ Tilt (watches all service repos, builds & runs them) โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚svc-api โ”‚ โ”‚svc-agent-relayโ”‚ โ”‚svc-provisionerโ”‚ โ”‚svc-clustersโ”‚ โ”‚ +โ”‚ โ”‚ :8080 โ”‚ โ”‚ :9090 gRPC โ”‚ โ”‚ :8082 โ”‚ โ”‚ :8083 โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ Dashboard (from dashboard/ repo) โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Next.js โ”‚ โ”‚ +โ”‚ โ”‚ :3000 โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +# Level 1: Local Development + +## Prerequisites (via mise) + +Each repo has a `.mise.toml`. The `kiven-dev` repo has the global one: + +```toml +# kiven-dev/.mise.toml +[tools] +go = "1.23" +node = "22" +kubectl = "latest" +helm = "latest" +kind = "latest" +tilt = "latest" +buf = "latest" +task = "latest" +golangci-lint = "latest" +``` + +```bash +# Install mise (one time) +curl https://mise.run | sh + +# Install all tools (in kiven-dev/) +mise install +``` + +## Quick Start + +```bash +# 1. Clone the dev orchestrator +git clone git@github.com:kivenio/kiven-dev.git +cd kiven-dev + +# 2. Install tools via mise +mise install + +# 3. Clone the service repos you need (siblings of kiven-dev) +task repos:clone # Clones all service repos next to kiven-dev/ + +# 4. Start shared infrastructure +task infra:up # Docker Compose: PostgreSQL, Redpanda, Valkey, MinIO + +# 5. Create kind cluster + CNPG +task kind:create # kind cluster + CNPG operator + namespaces + +# 6. Deploy test PostgreSQL cluster +task cnpg:deploy # Creates a CNPG PostgreSQL cluster inside kind + # This simulates a customer's database. + # The agent watches this cluster and reports to svc-agent-relay. + +# 7. Start all services with Tilt +tilt up # Watches all repos, builds, runs, shows logs + # Open http://localhost:10350 for Tilt dashboard + +# OR start individual services manually: +task svc:api # Runs svc-api from ../svc-api/ +task svc:relay # Runs svc-agent-relay from ../svc-agent-relay/ +task agent # Runs kiven-agent from ../kiven-agent/ +task frontend # Runs dashboard from ../dashboard/ +``` + +## What Is the Test CNPG Cluster? + +When you run `task cnpg:deploy`, Kiven-dev creates a **real PostgreSQL cluster** inside kind using the CloudNativePG operator. This is what happens: + +``` +1. CNPG Operator (already installed in kind) receives a Cluster YAML +2. Operator creates a PostgreSQL pod (test-pg-1) in namespace kiven-databases +3. PostgreSQL starts with: + - Database: "app" + - User: "app_user" (password in K8s Secret) + - pg_stat_statements enabled + - Logs slow queries > 200ms + +This cluster simulates a REAL CUSTOMER DATABASE. +The Kiven agent watches it, collects metrics, and reports to svc-agent-relay. +When you test provisioning in the dashboard, THIS is the cluster you see. +``` + +You can connect to it directly: +```bash +# Port-forward to the test PostgreSQL +kubectl port-forward -n kiven-databases svc/test-pg-rw 15432:5432 + +# Connect with psql +psql postgresql://app_user:@localhost:15432/app + +# Get the password +kubectl get secret -n kiven-databases test-pg-app -o jsonpath='{.data.password}' | base64 -d +``` + +## What You Can Test Locally + +| Feature | Works Locally? | How | +|---------|---------------|-----| +| Agent โ†” CNPG | Yes | Agent watches test-pg in kind | +| Agent โ†” svc-agent-relay | Yes | gRPC on localhost:9090 | +| CNPG cluster provisioning | Yes | Agent applies YAML to kind | +| Backup to S3 | Yes | MinIO at localhost:9000 (S3-compatible) | +| PG metrics collection | Yes | Agent reads pg_stat_* from test-pg | +| User/database management | Yes | Agent runs SQL on test-pg | +| PgBouncer pooling | Yes | CNPG Pooler CRD on kind | +| Dashboard โ†” API | Yes | Next.js :3000 โ†’ svc-api :8080 | +| YAML editor (Advanced Mode) | Yes | svc-yamleditor generates YAML | +| DBA intelligence | Yes | svc-monitoring analyzes PG stats | +| Power off / Power on | Partial | Delete/recreate CNPG cluster. No node management. | +| **AWS node groups** | **No** | Needs real AWS (Level 2) | +| **AWS EBS/S3/IAM** | **No** | Needs real AWS or LocalStack | +| **Cross-account IAM** | **No** | Needs sandbox | +| **Multi-AZ** | **No** | kind is single-node | + +## Stopping + +```bash +tilt down # Stop all services +task infra:down # Stop Docker Compose +task kind:delete # Delete kind cluster +``` + +--- + +# Service Repo Structure + +Every Go service repo follows the same structure: + +``` +kivenio/svc-api/ โ† Example service +โ”œโ”€โ”€ cmd/ +โ”‚ โ””โ”€โ”€ main.go โ† Entry point +โ”œโ”€โ”€ internal/ +โ”‚ โ”œโ”€โ”€ handler/ โ† HTTP/gRPC handlers +โ”‚ โ”œโ”€โ”€ service/ โ† Business logic +โ”‚ โ””โ”€โ”€ repository/ โ† Database access +โ”œโ”€โ”€ migrations/ โ† SQL migrations (if needed) +โ”œโ”€โ”€ Dockerfile โ† Multi-stage build +โ”œโ”€โ”€ .mise.toml โ† Tool versions for this service +โ”œโ”€โ”€ Taskfile.yml โ† Service-level tasks +โ”œโ”€โ”€ go.mod โ† imports github.com/kivenio/kiven-go-sdk +โ”œโ”€โ”€ go.sum +โ”œโ”€โ”€ .github/ +โ”‚ โ””โ”€โ”€ workflows/ +โ”‚ โ””โ”€โ”€ ci.yml โ† Uses reusable workflow from platform-github-management +โ”œโ”€โ”€ .golangci.yml โ† Linter config +โ”œโ”€โ”€ .gitignore +โ””โ”€โ”€ README.md +``` + +## Service Taskfile (per repo) + +Each service has its own `Taskfile.yml`: + +```yaml +# svc-api/Taskfile.yml +version: "3" + +tasks: + init: + desc: "Initialize dev environment (mise + tools + dependencies)" + cmds: + - mise install + - go mod download + - echo "โœ… Ready! Run 'task run' to start." + + run: + desc: "Run the service locally" + env: + DATABASE_URL: "postgres://kiven:kiven-local-dev@localhost:5432/kiven?sslmode=disable" + KAFKA_BROKERS: "localhost:19092" + VALKEY_ADDR: "localhost:6379" + PORT: "8080" + cmds: + - go run ./cmd/ + + test: + desc: "Run unit tests" + cmds: + - go test ./... -v -count=1 -race + + test:coverage: + desc: "Run tests with coverage" + cmds: + - go test ./... -coverprofile=coverage.out -race + - go tool cover -html=coverage.out -o coverage.html + + build: + desc: "Build binary" + cmds: + - go build -o bin/svc-api ./cmd/ + + lint: + desc: "Run linter" + cmds: + - golangci-lint run ./... + + docker:build: + desc: "Build Docker image" + cmds: + - docker build -t kivenio/svc-api:dev . + + proto:generate: + desc: "Generate Go code from proto (if this service uses gRPC)" + cmds: + - buf generate +``` + +## Reusable GitHub Workflow + +All service repos use the same CI workflow: + +```yaml +# svc-api/.github/workflows/ci.yml +name: CI +on: + push: + branches: [main] + pull_request: + +jobs: + ci: + uses: kivenio/platform-github-management/.github/workflows/reusable-go-ci.yml@main + with: + go-version: "1.23" + secrets: inherit +``` + +The reusable workflow (in `platform-github-management`) handles: +- Go build, test, lint +- Security scan (trivy) +- Docker build + push (on main) +- Deploy to sandbox (on main, if configured) + +--- + +# Shared Go Code: kiven-go-sdk + +Shared code that multiple services import: + +``` +kivenio/kiven-go-sdk/ +โ”œโ”€โ”€ provider/ +โ”‚ โ”œโ”€โ”€ provider.go โ† Provider interface (30+ methods) +โ”‚ โ””โ”€โ”€ registry.go โ† Provider registry +โ”œโ”€โ”€ grpcapi/ +โ”‚ โ””โ”€โ”€ (generated from contracts-proto) +โ”œโ”€โ”€ models/ +โ”‚ โ”œโ”€โ”€ service.go โ† Service, Plan, Backup types +โ”‚ โ”œโ”€โ”€ cluster.go โ† ClusterSpec, ClusterStatus +โ”‚ โ””โ”€โ”€ user.go โ† DatabaseUser, UserSpec +โ”œโ”€โ”€ config/ +โ”‚ โ””โ”€โ”€ config.go โ† Shared config loading (env vars) +โ”œโ”€โ”€ telemetry/ +โ”‚ โ””โ”€โ”€ otel.go โ† OpenTelemetry setup +โ”œโ”€โ”€ go.mod โ† github.com/kivenio/kiven-go-sdk +โ””โ”€โ”€ go.sum +``` + +Each service imports it: +```go +// svc-api/go.mod +module github.com/kivenio/svc-api + +require github.com/kivenio/kiven-go-sdk v0.1.0 +``` + +--- + +# Tilt Configuration (kiven-dev) + +Tilt watches all service repos and orchestrates local development: + +```python +# kiven-dev/Tiltfile + +# --- Shared infrastructure (already running via Docker Compose) --- +# PostgreSQL :5432, Redpanda :19092, Valkey :6379, MinIO :9000 + +# --- Go services --- +local_resource('svc-api', + serve_cmd='cd ../svc-api && go run ./cmd/', + deps=['../svc-api/cmd/', '../svc-api/internal/'], + env={ + 'PORT': '8080', + 'DATABASE_URL': 'postgres://kiven:kiven-local-dev@localhost:5432/kiven?sslmode=disable', + 'KAFKA_BROKERS': 'localhost:19092', + 'VALKEY_ADDR': 'localhost:6379', + }, + labels=['backend'], +) + +local_resource('svc-agent-relay', + serve_cmd='cd ../svc-agent-relay && go run ./cmd/', + deps=['../svc-agent-relay/cmd/', '../svc-agent-relay/internal/'], + env={'GRPC_PORT': '9090'}, + labels=['backend'], +) + +local_resource('svc-provisioner', + serve_cmd='cd ../svc-provisioner && go run ./cmd/', + deps=['../svc-provisioner/cmd/', '../svc-provisioner/internal/'], + env={ + 'PORT': '8082', + 'DATABASE_URL': 'postgres://kiven:kiven-local-dev@localhost:5432/kiven?sslmode=disable', + }, + labels=['backend'], +) + +local_resource('svc-clusters', + serve_cmd='cd ../svc-clusters && go run ./cmd/', + deps=['../svc-clusters/cmd/', '../svc-clusters/internal/'], + env={ + 'PORT': '8083', + 'DATABASE_URL': 'postgres://kiven:kiven-local-dev@localhost:5432/kiven?sslmode=disable', + }, + labels=['backend'], +) + +# --- Agent (runs against kind cluster) --- +local_resource('kiven-agent', + serve_cmd='cd ../kiven-agent && go run ./cmd/', + deps=['../kiven-agent/cmd/', '../kiven-agent/internal/'], + env={ + 'KUBECONFIG': os.environ.get('KUBECONFIG', os.path.expanduser('~/.kube/config')), + 'RELAY_ENDPOINT': 'localhost:9090', + }, + labels=['agent'], +) + +# --- Frontend --- +local_resource('dashboard', + serve_cmd='cd ../dashboard && npm run dev', + deps=['../dashboard/src/'], + labels=['frontend'], +) +``` + +Tilt UI at `http://localhost:10350` shows all services, logs, status, restart buttons. + +--- + +# Level 2: Sandbox (AWS) + +## When to Use + +Move to Level 2 when: +- Agent and core services work locally +- You need to test svc-infra (real AWS APIs) +- You need to test full provisioning pipeline (node groups โ†’ CNPG) +- You're preparing for first customer demo + +## Architecture + +``` +AWS Account: kiven-sandbox (eu-west-1) +โ”‚ +โ”œโ”€โ”€ EKS "kiven-dev" +โ”‚ โ”œโ”€โ”€ Kiven services (deployed via Flux) +โ”‚ โ”œโ”€โ”€ Aiven VPC peering (product DB) +โ”‚ โ””โ”€โ”€ Platform stack (Prometheus, Loki, Flux) +โ”‚ +โ”œโ”€โ”€ EKS "test-client" +โ”‚ โ”œโ”€โ”€ Simulates a real customer cluster +โ”‚ โ”œโ”€โ”€ Kiven agent installed +โ”‚ โ”œโ”€โ”€ CNPG operator (installed by Kiven) +โ”‚ โ””โ”€โ”€ Full provisioning: +โ”‚ โ”œโ”€โ”€ Dedicated node group (created by svc-infra) +โ”‚ โ”œโ”€โ”€ CNPG cluster (created by agent) +โ”‚ โ”œโ”€โ”€ S3 backups (created by svc-infra) +โ”‚ โ””โ”€โ”€ Network policies, storage classes, IRSA +โ”‚ +โ”œโ”€โ”€ S3: kiven-backups-test-client +โ”œโ”€โ”€ IAM: KivenAccessRole (simulates customer role) +โ””โ”€โ”€ IAM: IRSA roles for CNPG +``` + +## Cost Optimization + +| Resource | Cost | Optimization | +|----------|------|-------------| +| EKS control plane x 2 | ~$146/mo | Can't avoid | +| EC2 nodes (kiven-dev, 2x t3.medium) | ~$60/mo | Power off nights/weekends | +| EC2 nodes (test-client, 2x t3.medium) | ~$60/mo | Power off when not testing | +| EBS volumes | ~$20/mo | Delete test data regularly | +| S3 | ~$5/mo | Lifecycle rules | +| **Total** | **~$300/mo** | **~$150/mo with power schedules** | + +--- + +# Level 3: Staging & Production + +Only needed when the product is ready for real customers. + +| Environment | Account | EKS | Aiven | Purpose | +|-------------|---------|-----|-------|---------| +| Staging | kiven-staging | eks-staging | Staging plan | Pre-production validation | +| Production | kiven-prod | eks-prod | Business plan | Live product | + +--- + +# Development Workflow + +## Daily Workflow + +```bash +cd kiven-dev + +# Morning: start everything +task infra:up # Docker Compose +task kind:create # kind + CNPG (idempotent, skips if exists) +tilt up # All services + agent + dashboard + +# Code in any svc-* repo โ†’ Tilt auto-reloads +# Dashboard at http://localhost:3000 +# Tilt UI at http://localhost:10350 + +# End of day +tilt down +task infra:down +``` + +## Adding a New Service + +```bash +# 1. Create repo from template +gh repo create kivenio/svc-my-service --template kivenio/platform-templates-service-go --private + +# 2. Clone next to kiven-dev +cd .. && git clone git@github.com:kivenio/svc-my-service.git + +# 3. Initialize +cd svc-my-service && task init + +# 4. Add to Tiltfile in kiven-dev +# (add local_resource block) + +# 5. Develop โ†’ test โ†’ PR โ†’ merge โ†’ CI runs automatically +``` + +## Adding a New Feature to Existing Service + +```bash +cd svc-api # Go to service repo +task init # Ensure tools are up to date (mise) +# ... code ... +task test # Run tests +task lint # Run linter +# Tilt auto-reloads if running +git add . && git commit && git push +# CI runs via reusable workflow +``` + +--- + +*Maintained by: Platform Team* +*Last updated: February 2026* diff --git a/development/TEMPLATE-USAGE-GUIDE.md b/development/TEMPLATE-USAGE-GUIDE.md new file mode 100644 index 0000000..6a04fc6 --- /dev/null +++ b/development/TEMPLATE-USAGE-GUIDE.md @@ -0,0 +1,488 @@ +# Template & Workflow Architecture + +## Overview + +Kiven uses a **three-layer developer platform** to ensure every repo starts with the right tooling, CI/CD, and conventions โ€” without the developer thinking about any of it. + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Layer 1: COPIER TEMPLATES โ”‚ +โ”‚ platform-templates-service-go, platform-templates-sdk-go, ... โ”‚ +โ”‚ โ”€ Scaffold a new repo with all files, config, CI/CD โ”‚ +โ”‚ โ”€ copier copy gh:kivenio/platform-templates-service-go ./repo โ”‚ +โ”‚ โ”€ copier update (evolve existing repos when template changes) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ references +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Layer 2: PLATFORM-GITHUB-MANAGEMENT โ”‚ +โ”‚ Declarative YAML โ†’ GitHub repos, settings, rulesets, labels โ”‚ +โ”‚ โ”€ repos/backend/core-services.yaml defines all svc-* โ”‚ +โ”‚ โ”€ template: service-go links to the Copier template โ”‚ +โ”‚ โ”€ config/enforced.yaml locks security settings โ”‚ +โ”‚ โ”€ sync-repos.py applies changes on PR merge โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ includes +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Layer 3: REUSABLE WORKFLOWS + COMPOSITE ACTIONS โ”‚ +โ”‚ Repo: reusable-workflows +โ”‚ reusable-workflows/.github/workflows/ci-go-reusable.yml โ”‚ +โ”‚ reusable-workflows/.github/actions/setup-go/action.yml โ”‚ +โ”‚ โ”€ Called by each repo's CI workflow (generated from template) โ”‚ +โ”‚ โ”€ One source of truth for all CI/CD logic โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## How It All Connects + +### Creating a new repo + +``` +Developer platform-github-management Copier template + โ”‚ โ”‚ โ”‚ + โ”‚ 1. Add YAML entry โ”‚ โ”‚ + โ”‚ in repos/backend/*.yaml โ”‚ โ”‚ + โ”‚ with template: service-go โ”‚ โ”‚ + โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ โ”‚ + โ”‚ โ”‚ โ”‚ + โ”‚ 2. PR merged โ†’ sync-repos โ”‚ โ”‚ + โ”‚ creates GitHub repo โ”‚ โ”‚ + โ”‚ with settings, labels, โ”‚ โ”‚ + โ”‚ rulesets, topics, teams โ”‚ โ”‚ + โ”‚ โ”‚ โ”‚ + โ”‚ 3. Developer clones repo โ”‚ โ”‚ + โ”‚ and runs copier copy โ”‚ โ”‚ + โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚ + โ”‚ โ”‚ โ”‚ + โ”‚ 4. copier copy โ”‚ โ”‚ + โ”‚ gh:kivenio/platform- โ”‚ answers copier.yml questions โ”‚ + โ”‚ templates-service-go . โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ + โ”‚ โ”‚ โ”‚ + โ”‚ 5. Template generates โ”‚ .editorconfig, .golangci.yml โ”‚ + โ”‚ ALL files: โ”‚ .mise.toml, Taskfile.yml โ”‚ + โ”‚ tooling, CI, Dockerfile, โ”‚ .pre-commit-config.yaml โ”‚ + โ”‚ vscode, go.mod, cmd/... โ”‚ .github/workflows/ci.yml โ”‚ + โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ + โ”‚ โ”‚ โ”‚ + โ”‚ 6. task init โ†’ ready โ”‚ โ”‚ +``` + +### Updating existing repos when template evolves + +```bash +cd svc-api +copier update +# Copier shows diff, applies new changes, respects .copier-answers.yml +``` + +This is the **key advantage** of Copier over `sed`-based scaffolding: when you add a new +linter rule, a new shared GitHub Action, or update Go version across all services, you +update the template once and each repo pulls the update via `copier update`. =>>>> this should be done via github resuable action and the target repo (the repo created baed on the tempalte) will triggered. + +## Template Inventory + +| Template | Repo | For | Key Features | +|----------|------|-----|--------------| +| `service-go` | `platform-templates-service-go` | Go microservices (`svc-*`) | chi, gRPC, OTel, Dockerfile, air, Testcontainers | +| `sdk-go` | `platform-templates-sdk-go` | Go libraries (`kiven-go-sdk`, `provider-*`, `kiven-cli`) | No Dockerfile, no cmd/, library-focused | +| `infrastructure` | `platform-templates-infrastructure` | Terraform modules (`bootstrap`, `infra-customer-*`) | tflint, Checkov, terraform-docs | +| `platform-component` | `platform-templates-platform-component` | GitOps components (`platform-gitops`, `platform-security`) | Helm/Kustomize, Flux integration | +| `documentation` | `platform-templates-documentation` | Doc sites (`docs`) | MkDocs Material, ADR template | + +## Template Structure (Copier) + +Each `platform-templates-*` repo follows the Copier convention: + +``` +platform-templates-service-go/ +โ”œโ”€โ”€ copier.yml # Questions + config +โ”œโ”€โ”€ .copier-answers.yml.jinja # Records answers in generated project +โ”œโ”€โ”€ {{project_name}}/ # (not used โ€” we generate at root) +โ”‚ +โ”œโ”€โ”€ .editorconfig # Static โ€” copied as-is +โ”œโ”€โ”€ .gitignore # Static +โ”œโ”€โ”€ .golangci.yml # Static +โ”œโ”€โ”€ .pre-commit-config.yaml # Static +โ”œโ”€โ”€ Dockerfile # Static +โ”‚ +โ”œโ”€โ”€ .mise.toml.jinja # Templated โ€” injects service name +โ”œโ”€โ”€ Taskfile.yml.jinja # Templated โ€” injects service name, port +โ”œโ”€โ”€ go.mod.jinja # Templated โ€” injects module path +โ”œโ”€โ”€ README.md.jinja # Templated โ€” injects name, description +โ”‚ +โ”œโ”€โ”€ .vscode/ +โ”‚ โ”œโ”€โ”€ settings.json # Static +โ”‚ โ”œโ”€โ”€ extensions.json # Static +โ”‚ โ””โ”€โ”€ launch.json.jinja # Templated โ€” injects port, env vars +โ”‚ +โ”œโ”€โ”€ renovate.json # Static โ€” Renovate dep updates (auto-merge, grouping) +โ”œโ”€โ”€ .github/ +โ”‚ โ”œโ”€โ”€ CODEOWNERS.jinja # Templated โ€” injects team +โ”‚ โ”œโ”€โ”€ pull_request_template.md # Static โ€” shared PR checklist +โ”‚ โ”œโ”€โ”€ ISSUE_TEMPLATE/ +โ”‚ โ”‚ โ”œโ”€โ”€ bug_report.yml # Static +โ”‚ โ”‚ โ””โ”€โ”€ feature_request.yml # Static +โ”‚ โ””โ”€โ”€ workflows/ +โ”‚ โ”œโ”€โ”€ ci.yml.jinja # Templated โ€” injects service name +โ”‚ โ”œโ”€โ”€ release.yml.jinja # Templated โ€” injects service name +โ”‚ โ””โ”€โ”€ copier-update.yml # Static โ€” weekly template sync check +โ”‚ +โ””โ”€โ”€ cmd/ + โ””โ”€โ”€ main.go.jinja # Templated โ€” basic service scaffold +``` + +### `copier.yml` โ€” The Questionnaire + +```yaml +# copier.yml โ€” Kiven Go Service Template +_min_copier_version: "9.0.0" +_subdirectory: "" +_answers_file: .copier-answers.yml + +project_name: + type: str + help: "Service name (e.g., svc-api, svc-auth)" + validator: "{% if not project_name | regex_search('^[a-z][a-z0-9-]+$') %}Must be lowercase with hyphens{% endif %}" + +project_description: + type: str + help: "One-line description" + +owner_team: + type: str + default: "@kivenio/backend" + help: "GitHub team that owns this repo" + +port: + type: int + default: 8080 + help: "HTTP port" + +grpc_port: + type: int + default: 0 + help: "gRPC port (0 = no gRPC)" + +enable_kafka: + type: bool + default: false + help: "Does this service consume/produce Kafka events?" + +go_version: + type: str + default: "1.23" + help: "Go version" +``` + +### How variables connect to `platform-github-management` + +In `repos/backend/core-services.yaml`, each repo definition contains a `template_variables` +field with ALL the Copier variables needed for that repo: + +```yaml +- name: svc-api # โ†’ auto-mapped to project_name + description: "API Gateway" # โ†’ auto-mapped to project_description + type: service + template: service-go # โ†’ which Copier template to use + ruleset: strict + topics: [core, api, graphql] + template_variables: # โ†’ ALL Copier variables for this repo + port: 8080 + grpc_port: 0 + enable_kafka: false +``` + +**Variable resolution order:** + +1. **Auto-mapped** (always set, derived from repo definition fields): + - `name` โ†’ `project_name` + - `description` โ†’ `project_description` + - Owner team from YAML file path (`repos/backend/` โ†’ `@kivenio/backend`) โ†’ `owner_team` + +2. **Explicit** (`template_variables` dict โ€” overrides auto-mapped if same key): + - `port`, `grpc_port`, `enable_kafka`, or any Copier variable from `copier.yml` + +3. **Template defaults** (from `copier.yml` โ€” used for any variable not specified above): + - `go_version: "1.23"`, etc. + +When `sync-repos.py` creates a repo, it passes all variables to `copier copy` via `-d key=value`. +Copier generates `.copier-answers.yml` in the new repo, recording every variable used. +This file is essential for future `copier update` runs โ€” it tells Copier which template +was used and with which answers. + +```yaml +# .copier-answers.yml (auto-generated by Copier in the new repo) +_src_path: gh:kivenio/platform-templates-service-go +_commit: abc1234 +project_name: svc-api +project_description: "API Gateway โ€” REST + GraphQL, request routing" +owner_team: "@kivenio/backend" +port: 8080 +grpc_port: 0 +enable_kafka: false +go_version: "1.23" +``` + +## Reusable Workflows vs Composite Actions + +### Reusable Workflows (job-level reuse) + +A reusable workflow replaces an **entire job** (or set of jobs). Each service repo calls them +in its `.github/workflows/ci.yml`: + +```yaml +# In svc-api/.github/workflows/ci.yml +jobs: + ci: + uses: kivenio/reusable-workflows/.github/workflows/ci-go-reusable.yml@main + with: + service-name: "svc-api" + go-version: "1.23" + secrets: inherit +``` + +| Workflow | File | Purpose | +|----------|------|---------| +| `ci-go-reusable.yml` | `reusable-workflows/.github/workflows/` | Lint + Test + Build + Security + Docker | +| `ci-frontend-reusable.yml` | `reusable-workflows/.github/workflows/` | Lint + TypeCheck + Test + Build + Audit | +| `ci-terraform-reusable.yml` | `reusable-workflows/.github/workflows/` | Format + Validate + tflint + Trivy + Checkov | +| `docker-build-reusable.yml` | `reusable-workflows/.github/workflows/` | Buildx + GHCR push + semver tags | +| `release-reusable.yml` | `reusable-workflows/.github/workflows/` | Conventional commits โ†’ semver โ†’ changelog โ†’ GitHub Release | +| `copier-update-reusable.yml` | `reusable-workflows/.github/workflows/` | Detect template drift โ†’ auto-PR with updates | +| `ci-copier-template-reusable.yml` | `reusable-workflows/.github/workflows/` | Validate Copier templates (syntax, dry-run, build test) | + +**When to use:** When you want to share a complete CI/CD pipeline across repos. + +### Composite Actions (step-level reuse) + +A composite action replaces **individual steps** within a job. Use them when multiple +workflows share the same setup/teardown logic but differ in the middle. + +```yaml +# In any workflow +steps: + - uses: kivenio/reusable-workflows/.github/actions/setup-go@main + with: + go-version: "1.23" + # setup-go handles: checkout + install Go + cache + download deps +``` + +| Action | File | Purpose | +|--------|------|---------| +| `setup-go` | `reusable-workflows/.github/actions/setup-go/action.yml` | Checkout + Go install + cache + `go mod download` | +| `setup-node` | `reusable-workflows/.github/actions/setup-node/action.yml` | Checkout + Node install + cache + `npm ci` | +| `security-scan` | `reusable-workflows/.github/actions/security-scan/action.yml` | Trivy fs + Gitleaks in one step | +| `docker-metadata` | `reusable-workflows/.github/actions/docker-metadata/action.yml` | Generate tags (sha, branch, semver) | +| `copier-update` | `reusable-workflows/.github/actions/copier-update/action.yml` | Run `copier update`, detect drift, create PR | + +**When to use:** When reusable workflows are too rigid. For example, a workflow that needs +custom steps between setup and test can use composite actions for the common parts. + +### How They Compose + +``` +Service repo CI workflow +โ”‚ +โ”œโ”€โ”€ uses: kivenio/reusable-workflows/.github/workflows/ci-go-reusable.yml +โ”‚ โ”‚ +โ”‚ โ”œโ”€โ”€ Job: lint +โ”‚ โ”‚ โ””โ”€โ”€ uses: kivenio/reusable-workflows/.github/actions/setup-go +โ”‚ โ”‚ โ””โ”€โ”€ golangci-lint +โ”‚ โ”‚ +โ”‚ โ”œโ”€โ”€ Job: test +โ”‚ โ”‚ โ””โ”€โ”€ uses: kivenio/reusable-workflows/.github/actions/setup-go +โ”‚ โ”‚ โ””โ”€โ”€ go test +โ”‚ โ”‚ +โ”‚ โ”œโ”€โ”€ Job: security +โ”‚ โ”‚ โ””โ”€โ”€ uses: kivenio/reusable-workflows/.github/actions/security-scan +โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€ Job: docker +โ”‚ โ””โ”€โ”€ uses: kivenio/reusable-workflows/.github/actions/docker-metadata +โ”‚ โ””โ”€โ”€ docker build + push +``` + +## What's Mutualized in Templates + +Every repo created from a Copier template automatically gets: + +### Tooling (developer experience) +- `.editorconfig` โ€” Consistent formatting (tabs for Go, spaces for YAML) +- `.vscode/settings.json` โ€” Formatter, linter, language settings +- `.vscode/extensions.json` โ€” Recommended VS Code extensions +- `.vscode/launch.json` โ€” Debug configurations +- `.mise.toml` โ€” Tool versions (Go, Node, Terraform, etc.) +- `.pre-commit-config.yaml` โ€” Git hooks (format, lint, secrets detection) + +### Quality (code standards) +- `.golangci.yml` โ€” 18 linters with opinionated config (Go repos) +- `.prettierrc` โ€” Code formatting (frontend repos) +- `Taskfile.yml` โ€” Standard tasks (init, test, lint, build, clean) + +### CI/CD (automation) +- `.github/workflows/ci.yml` โ€” Calls reusable workflow from kiven-dev +- `.github/workflows/release.yml` โ€” Release automation +- `renovate.json` โ€” Automated dependency updates (Renovate โ€” grouping, auto-merge patches/minors) +- `Dockerfile` โ€” Multi-stage, non-root, healthcheck (service repos) + +### Collaboration (team workflow) +- `.github/CODEOWNERS` โ€” Review routing (auto-assigned from `owner_team`) +- `.github/pull_request_template.md` โ€” PR checklist (tests, docs, breaking changes) +- `.github/ISSUE_TEMPLATE/bug_report.yml` โ€” Structured bug reports +- `.github/ISSUE_TEMPLATE/feature_request.yml` โ€” Feature request form + +### Project (bootstrapping) +- `go.mod` โ€” Module initialized with `github.com/kivenio/` +- `cmd/main.go` โ€” Minimal service entrypoint +- `README.md` โ€” Generated with name, description, badges +- `.copier-answers.yml` โ€” Records template answers for future updates + +## Applying Templates to Existing Repos + +For repos that were created before Copier was set up: + +```bash +cd ../svc-api +copier copy gh:kivenio/platform-templates-service-go . --overwrite +# Copier asks questions, generates files, respects existing code +``` + +For selective file application (tooling only, no code scaffolding), use +`copier copy --exclude 'cmd/**'` to skip code directories. + +## Template Repo CI + +Every `platform-templates-*` repo has its own CI (NOT a `.jinja` file โ€” it runs on the template +repo itself) that validates the Copier template works correctly: + +```yaml +# platform-templates-service-go/.github/workflows/ci.yml +jobs: + ci: + uses: kivenio/reusable-workflows/.github/workflows/ci-copier-template-reusable.yml@main + with: + template-type: "go" +``` + +**What it validates:** +1. `copier.yml` exists and is valid YAML +2. Jinja2 template files (`.jinja`) exist +3. Dry-run `copier copy --defaults` generates all critical files (editorconfig, gitignore, CI, CODEOWNERS, etc.) +4. No unresolved Jinja2 variables in generated output +5. Generated Go project compiles (`go build`, `go vet`) +6. Security scan (Trivy + Gitleaks) on the template itself + +This means every PR to a template repo is validated end-to-end before merge. + +## Automatic Scaffolding from Templates + +When a new repo is defined in `platform-github-management` with a `template` field: + +```yaml +- name: svc-foo + description: "New service" + template: service-go # โ† this triggers Copier scaffolding +``` + +The `sync-repos.py` script automatically: +1. Creates the GitHub repo (settings, labels, rulesets, teams) +2. Resolves the template source from `config/templates.yaml` (`service-go` โ†’ `gh:kivenio/platform-templates-service-go`) +3. Clones the new repo +4. Runs `copier copy --trust --defaults` with variables extracted from the YAML: + - `name` โ†’ `project_name` + - `description` โ†’ `project_description` + - Owner team from the YAML file path (`repos/backend/` โ†’ `@kivenio/backend`) +5. Commits and pushes the scaffolded files + +The developer gets a fully configured, ready-to-code repo without touching `copier` manually. + +## Adding a New Template + +1. Create the repo in `platform-github-management/repos/platform/templates.yaml` +2. Create the Copier template repo with `copier.yml` + files +3. Add a `ci.yml` that calls `ci-copier-template-reusable.yml` (validates the template) +4. Register it in `platform-github-management/config/templates.yaml` +5. Document it in this guide + +## Evolving Templates + +When you change a template (e.g., update Go version, add a linter): + +1. Update the `platform-templates-*` repo +2. Tag a new version (e.g., `v1.2.0`) +3. Each service repo pulls the update **automatically via CI**: + +### Automated Template Sync (CI) + +Every repo created from a Copier template includes a `copier-update.yml` workflow: + +```yaml +# .github/workflows/copier-update.yml (auto-generated from template) +on: + schedule: + - cron: "0 7 * * 1" # Every Monday 7am + workflow_dispatch: # Can also trigger manually + +jobs: + copier-update: + uses: kivenio/reusable-workflows/.github/workflows/copier-update-reusable.yml@main + secrets: inherit +``` + +**How it works:** + +``` +Template repo changes Downstream repo (svc-api) + โ”‚ โ”‚ + โ”‚ 1. Developer updates โ”‚ + โ”‚ platform-templates- โ”‚ + โ”‚ service-go (new linter, โ”‚ + โ”‚ Go version bump, etc.) โ”‚ + โ”‚ โ”‚ + โ”‚ โ”‚ 2. Monday 7am: copier-update + โ”‚ โ”‚ workflow runs automatically + โ”‚ โ”‚ + โ”‚ โ”‚ 3. Composite action: + โ”‚ โ”‚ - Installs copier + โ”‚ โ”‚ - Reads .copier-answers.yml + โ”‚ โ”‚ - Runs copier update --trust + โ”‚ โ”‚ - Detects git diff + โ”‚ โ”‚ + โ”‚ โ”‚ 4. If drift detected: + โ”‚ โ”‚ - Creates branch chore/copier-update + โ”‚ โ”‚ - Opens PR with changes + โ”‚ โ”‚ - Labels: dependencies + โ”‚ โ”‚ + โ”‚ โ”‚ 5. Developer reviews PR + โ”‚ โ”‚ - Resolves conflicts if any + โ”‚ โ”‚ - Merges when ready +``` + +### Manual Update + +You can also update manually at any time: + +```bash +cd svc-api +copier update --trust +# Shows diff of changes, applies non-conflicting updates +# Conflicts are shown for manual resolution +``` + +### Triggering Update Across All Repos + +After a significant template change, trigger all downstream repos at once via +`workflow_dispatch` on each repo's `copier-update.yml`. This can be scripted: + +```bash +repos=(svc-api svc-auth svc-provisioner svc-clusters svc-backups) +for repo in "${repos[@]}"; do + gh workflow run copier-update.yml --repo kivenio/$repo +done +``` + +This is how the entire fleet stays consistent without manual copy-paste. + +## YAML Config Validator (YCC) + +For detailed documentation on the planned YAML validation tool that validates +`platform-github-management` structures and Copier template variables, see +[YAML-CONFIG-VALIDATOR.md](../platform/YAML-CONFIG-VALIDATOR.md). diff --git a/infra/CUSTOMER-INFRA-MANAGEMENT.md b/infra/CUSTOMER-INFRA-MANAGEMENT.md new file mode 100644 index 0000000..7b6b930 --- /dev/null +++ b/infra/CUSTOMER-INFRA-MANAGEMENT.md @@ -0,0 +1,270 @@ +# Customer Infrastructure Management +## *How Kiven Manages AWS Resources in Customer Accounts* + +--- + +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# Overview + +Kiven manages **four types of AWS resources** in the customer's account: + +1. **EKS Node Groups** โ€” Dedicated compute for databases +2. **EBS Volumes** โ€” Persistent storage for PostgreSQL data +3. **S3 Buckets** โ€” Backup storage (Barman + WAL archiving) +4. **IAM Roles** โ€” IRSA for CNPG to access S3 + +All managed via `svc-infra`, which assumes the customer's `KivenAccessRole` IAM role. + +--- + +# Access Model + +## Cross-Account IAM + +``` +โ”Œโ”€โ”€โ”€ Kiven AWS Account โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€ Customer AWS Account โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ svc-infra โ”‚ โ”‚ IAM Role: KivenAccessRole โ”‚ +โ”‚ โ”œโ”€โ”€ IRSA: svc-infra-role โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ โ”œโ”€โ”€ Trust: Kiven account ID โ”‚ +โ”‚ โ””โ”€โ”€ AssumeRole call โ”‚ โ”‚ โ”œโ”€โ”€ Policy: KivenAccessPolicy โ”‚ +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ ExternalId: unique per customer โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## IAM Policy (KivenAccessPolicy) + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "EKSAccess", + "Effect": "Allow", + "Action": [ + "eks:DescribeCluster", + "eks:ListNodegroups", + "eks:DescribeNodegroup", + "eks:CreateNodegroup", + "eks:UpdateNodegroupConfig", + "eks:DeleteNodegroup" + ], + "Resource": "arn:aws:eks:*:*:cluster/*" + }, + { + "Sid": "EC2ForNodeGroups", + "Effect": "Allow", + "Action": [ + "ec2:DescribeInstances", + "ec2:DescribeVolumes", + "ec2:DescribeSubnets", + "ec2:DescribeSecurityGroups", + "ec2:CreateLaunchTemplate", + "ec2:DeleteLaunchTemplate", + "ec2:RunInstances", + "ec2:TerminateInstances" + ], + "Resource": "*", + "Condition": { + "StringEquals": { + "aws:RequestTag/managed-by": "kiven" + } + } + }, + { + "Sid": "S3BackupBucket", + "Effect": "Allow", + "Action": [ + "s3:CreateBucket", + "s3:PutBucketEncryption", + "s3:PutBucketLifecycleConfiguration", + "s3:PutBucketVersioning", + "s3:PutBucketPolicy", + "s3:GetBucketLocation", + "s3:ListBucket" + ], + "Resource": "arn:aws:s3:::kiven-backups-*" + }, + { + "Sid": "IAMForIRSA", + "Effect": "Allow", + "Action": [ + "iam:CreateRole", + "iam:DeleteRole", + "iam:AttachRolePolicy", + "iam:DetachRolePolicy", + "iam:PutRolePolicy", + "iam:DeleteRolePolicy", + "iam:GetRole", + "iam:TagRole" + ], + "Resource": "arn:aws:iam::*:role/kiven-*" + }, + { + "Sid": "KMSForEncryption", + "Effect": "Allow", + "Action": [ + "kms:DescribeKey", + "kms:CreateGrant", + "kms:Encrypt", + "kms:Decrypt", + "kms:GenerateDataKey" + ], + "Resource": "*" + } + ] +} +``` + +**Key security constraints:** +- EC2 actions limited to resources tagged `managed-by: kiven` +- S3 actions limited to `kiven-backups-*` bucket prefix +- IAM actions limited to `kiven-*` role prefix +- ExternalId required to prevent confused deputy attacks + +--- + +# Resource Management + +## 1. Node Groups + +### Create (on provisioning) + +| Parameter | Value | Rationale | +|-----------|-------|-----------| +| Name | `kiven-db-{cluster-id}` | Unique per database cluster | +| Instance type | Per service plan (t3.small โ†’ r6g.xlarge) | Memory-optimized for PG | +| Desired/min/max | Per plan (1-3 nodes) | HA: primary + replicas | +| Subnets | Multi-AZ (customer's private subnets) | Spread across AZs | +| Taints | `kiven.io/role=database:NoSchedule` | Only DB pods run here | +| Labels | `kiven.io/managed=true`, `kiven.io/cluster-id={id}` | Identification | +| Tags | `managed-by: kiven`, `kiven-cluster-id: {id}` | AWS-level tracking | +| AMI | EKS-optimized AL2023 | Standard, secure | + +### Scale to Zero (Power Off) + +``` +svc-infra โ†’ UpdateNodegroupConfig: + scalingConfig: + minSize: 0 + desiredSize: 0 + maxSize: 0 (or original max) +``` + +Nodes are terminated. EBS volumes detach but are RETAINED (PVC reclaim policy = Retain). + +### Scale Up (Power On) + +``` +svc-infra โ†’ UpdateNodegroupConfig: + scalingConfig: + minSize: {plan.instances} + desiredSize: {plan.instances} + maxSize: {plan.instances * 2} +``` + +New nodes join. CNPG pods re-created, PVCs reattached to new nodes. + +### Delete (on cluster deletion) + +Node group fully deleted only when customer deletes the database **and** confirms data destruction. + +## 2. EBS Volumes + +Managed indirectly via Kubernetes StorageClass + PVCs. Kiven creates the StorageClass: + +```yaml +apiVersion: storage.k8s.io/v1 +kind: StorageClass +metadata: + name: kiven-db-gp3 +parameters: + type: gp3 + iops: "3000" # adjusted per plan + throughput: "125" # adjusted per plan + encrypted: "true" + kmsKeyId: +provisioner: ebs.csi.aws.com +reclaimPolicy: Retain # CRITICAL: keep volumes on PVC delete +volumeBindingMode: WaitForFirstConsumer +allowVolumeExpansion: true +``` + +**Monitoring:** +- Disk usage alerts at 70%, 80%, 90% +- IOPS utilization alerts +- Auto-resize recommendation via DBA intelligence + +## 3. S3 Buckets + +One bucket per customer (shared across their database clusters): + +| Config | Value | Rationale | +|--------|-------|-----------| +| Name | `kiven-backups-{customer-id}` | Unique per customer | +| Region | Same as customer's EKS | Data locality | +| Encryption | SSE-KMS (customer's key) | Customer controls encryption | +| Versioning | Enabled | Protect against accidental delete | +| Lifecycle | Transition to IA after 30d, Glacier after 90d, delete after 365d | Cost optimization | +| Bucket policy | Only IRSA role can access | Least privilege | + +## 4. IAM Roles (IRSA) + +CNPG needs S3 access for backups. Kiven creates an IRSA role: + +| Parameter | Value | +|-----------|-------| +| Role name | `kiven-cnpg-backup-{cluster-id}` | +| Trust | EKS OIDC provider + kiven-databases ServiceAccount | +| Policy | s3:PutObject, s3:GetObject, s3:DeleteObject on backup bucket | + +--- + +# Tagging Strategy + +All Kiven-managed resources are tagged: + +| Tag Key | Value | Purpose | +|---------|-------|---------| +| `managed-by` | `kiven` | Identify Kiven resources | +| `kiven-cluster-id` | `{cluster-id}` | Link to specific database | +| `kiven-customer-id` | `{customer-id}` | Link to customer org | +| `kiven-plan` | `hobbyist/startup/business/premium/custom` | Service plan | +| `kiven-environment` | `production/staging/development` | Environment label | + +These tags enable: +- Cost tracking per database cluster +- Resource cleanup on cluster deletion +- IAM policy conditions (only manage tagged resources) + +--- + +# Audit & Compliance + +Every AWS API call made by `svc-infra` is: +1. **Logged in CloudTrail** (customer's account) โ€” they can see exactly what Kiven does +2. **Logged in Kiven audit** (`svc-audit`) โ€” immutable record on our side +3. **Attributed** โ€” which Kiven user/service triggered the action + +--- + +# Cost Tracking + +Kiven tracks estimated costs per cluster by monitoring: + +| Resource | Cost Calculation | +|----------|-----------------| +| **EC2 nodes** | Instance type ร— hours running (from power on/off events) | +| **EBS storage** | Volume size ร— hours provisioned + IOPS cost | +| **S3 storage** | Bucket size ร— storage class pricing | +| **Data transfer** | Estimated from backup size + WAL volume | + +Displayed in customer dashboard: "Estimated AWS cost: $X/month for this cluster" + +--- + +*Maintained by: Platform Team* +*Last updated: February 2026* diff --git a/networking/NETWORKING-ARCHITECTURE.md b/networking/NETWORKING-ARCHITECTURE.md new file mode 100644 index 0000000..ed763ea --- /dev/null +++ b/networking/NETWORKING-ARCHITECTURE.md @@ -0,0 +1,596 @@ +# ๐ŸŒ **Networking Architecture** +## *LOCAL-PLUS VPC, Edge, CDN & Gateway* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# ๐Ÿ“‹ **Table of Contents** + +1. [VPC Design](#vpc-design) +2. [Traffic Flow](#traffic-flow) +3. [Gateway API Configuration](#gateway-api-configuration) +4. [Network Policies](#network-policies) +5. [Cloudflare Architecture](#cloudflare-architecture) +6. [DNS Configuration](#dns-configuration) +7. [Route53 โ€” DNS Interne & Backup](#route53--dns-interne--backup) +8. [API Gateway / APIM (Future)](#api-gateway--apim-future) +9. [Multi-Cloud Vision](#multi-cloud-vision) +10. [Topologie Multi-Account (AWS Organizations)](#topologie-multi-account-aws-organizations) + +--- + +# ๐Ÿ—๏ธ **VPC Design** + +## CIDR Allocation + +| CIDR | Usage | Subnets | +|------|-------|---------| +| 10.0.0.0/16 | VPC Principal | - | +| 10.0.0.0/20 | Private Subnets (Workloads) | 3 AZs | +| 10.0.16.0/20 | Private Subnets (Data) | 3 AZs | +| 10.0.32.0/20 | Public Subnets (NAT, LB) | 3 AZs | + +## Architecture EKS + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ INTERNET โ”‚ +โ”‚ (End Users) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ CLOUDFLARE EDGE (Global) โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ€ข DNS (localplus.io) โ€ข WAF (OWASP rules) โ”‚ +โ”‚ โ€ข DDoS Protection (L3-L7) โ€ข SSL/TLS Termination โ”‚ +โ”‚ โ€ข CDN (static assets) โ€ข Bot Protection โ”‚ +โ”‚ โ€ข Cloudflare Tunnel โ€ข Zero Trust Access โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ”‚ Cloudflare Tunnel (encrypted) + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ WORKLOAD ACCOUNT (PROD) โ€” eu-west-1 โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ VPC โ€” 10.0.0.0/16 โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ EKS CLUSTER โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ NODE POOL: platform (taints: platform=true:NoSchedule) โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ Instance: m6i.xlarge (dedicated resources) โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ PLATFORM NAMESPACE โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Flux, Cilium, Vault, Kyverno, OTel, Grafana โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ NODE POOL: application (default, auto-scaling) โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ Instance: m6i.large (cost-optimized) โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ APPLICATION NAMESPACES โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ€ข svc-ledger, svc-wallet, svc-merchant, etc. โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ VPC Peering / Transit Gateway โ”‚ โ”‚ +โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ +โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ AIVEN VPC โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ€ข PostgreSQL (Primary + Read Replica) โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ€ข Kafka Cluster โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ€ข Valkey (Redis-compatible) โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Node Pool Strategy + +| Node Pool | Taints | Usage | Instance Type | Scaling | +|-----------|--------|-------|---------------|---------| +| **platform** | `platform=true:NoSchedule` | Flux, Monitoring, Security tools | m6i.xlarge | Fixed (2-3 nodes) | +| **application** | None (default) | Domain services | m6i.large | HPA (2-10 nodes) | +| **spot** (optionnel) | `spot=true:PreferNoSchedule` | Batch jobs, non-critical | m6i.large (spot) | Auto (0-5 nodes) | + +--- + +# ๐Ÿ”„ **Traffic Flow** + +| Flow | Path | Encryption | +|------|------|------------| +| Internet โ†’ Services | Cloudflare โ†’ Tunnel โ†’ Cilium Gateway โ†’ Pod | TLS + mTLS | +| Service โ†’ Service | Pod โ†’ Pod (Cilium) | mTLS (WireGuard) | +| Service โ†’ Aiven | VPC Peering | TLS | +| Service โ†’ AWS (S3, KMS) | VPC Endpoints | TLS | + +--- + +# ๐Ÿšช **Gateway API Configuration** + +## Resources + +| Resource | Purpose | +|----------|---------| +| **GatewayClass** | Cilium implementation | +| **Gateway** | HTTPS listener, TLS termination | +| **HTTPRoute** | Routing vers services (path-based) | + +## Gateway Configuration + +| Setting | Value | Description | +|---------|-------|-------------| +| **GatewayClass** | `cilium` | Utilise le controller Cilium | +| **Listener** | HTTPS:443 | TLS termination | +| **TLS Mode** | Terminate | Certificat gรฉrรฉ par External-Secrets | +| **Allowed Routes** | All namespaces | Services peuvent dรฉclarer leurs routes | + +## HTTPRoute Routing + +| Pattern | Exemple | Backend | +|---------|---------|---------| +| Path prefix | `/v1/ledger/*` | svc-ledger:8080 | +| Path prefix | `/v1/wallet/*` | svc-wallet:8080 | +| Path prefix | `/v1/merchant/*` | svc-merchant:8080 | +| Exact path | `/health` | Tous les services | + +--- + +# ๐Ÿ”’ **Network Policies** + +## Default Deny Strategy + +| Policy | Effect | +|--------|--------| +| Default deny all | Aucun trafic sauf explicite | +| Allow intra-namespace | Services mรชme namespace peuvent communiquer | +| Allow specific cross-namespace | svc-ledger โ†’ svc-wallet explicite | +| Allow egress Aiven | Services โ†’ VPC Peering range only | +| Allow egress AWS endpoints | Services โ†’ VPC Endpoints only | + +## Cilium Network Policy Rules + +### Ingress Rules + +| From | To | Port | Protocol | +|------|----|------|----------| +| Gateway (platform) | All services | 8080, 50051 | TCP | +| svc-ledger | svc-wallet | 50051 | gRPC | +| Prometheus | All services | 8080 | metrics | + +### Egress Rules + +| From | To | Port | Description | +|------|----|------|-------------| +| All services | Aiven PostgreSQL | 5432 | Database | +| All services | Aiven Kafka | 9092 | Messaging | +| All services | Aiven Valkey | 6379 | Cache | +| All services | AWS VPC Endpoints | 443 | S3, KMS, etc. | +| OTel Collector | Tempo, Loki | 4317, 3100 | Telemetry | + +--- + +# โ˜๏ธ **Cloudflare Architecture** + +## Pourquoi Cloudflare ? + +| Critรจre | Cloudflare | AWS CloudFront + WAF | Verdict | +|---------|------------|---------------------|---------| +| **Coรปt** | Free tier gรฉnรฉreux | Payant dรจs le dรฉbut | โœ… Cloudflare | +| **WAF** | Gratuit (rรจgles de base) | ~30โ‚ฌ/mois minimum | โœ… Cloudflare | +| **DDoS** | Inclus (unlimited) | AWS Shield Standard gratuit | โ‰ˆ ร‰gal | +| **SSL/TLS** | Gratuit, auto-renew | ACM gratuit | โ‰ˆ ร‰gal | +| **CDN** | 300+ PoPs, gratuit | Payant au GB | โœ… Cloudflare | +| **DNS** | Gratuit, trรจs rapide | Route53 ~0.50โ‚ฌ/zone | โœ… Cloudflare | +| **Zero Trust** | Gratuit jusqu'ร  50 users | Cognito + ALB payant | โœ… Cloudflare | + +> **Dรฉcision :** Cloudflare en front, AWS en backend. Best of both worlds. + +## Edge Layers + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ CLOUDFLARE EDGE โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ LAYER 1: DNS โ”‚ +โ”‚ โ€ข Authoritative DNS (localplus.io) โ”‚ +โ”‚ โ€ข DNSSEC enabled โ”‚ +โ”‚ โ€ข Geo-routing (future multi-region) โ”‚ +โ”‚ โ”‚ +โ”‚ LAYER 2: DDoS Protection โ”‚ +โ”‚ โ€ข Layer 3/4 DDoS mitigation (automatic, unlimited) โ”‚ +โ”‚ โ€ข Layer 7 DDoS mitigation โ”‚ +โ”‚ โ”‚ +โ”‚ LAYER 3: WAF โ”‚ +โ”‚ โ€ข OWASP Core Ruleset โ”‚ +โ”‚ โ€ข Custom rules (rate limit, geo-block, bot score) โ”‚ +โ”‚ โ”‚ +โ”‚ LAYER 4: SSL/TLS โ”‚ +โ”‚ โ€ข Edge certificates (auto-issued) โ”‚ +โ”‚ โ€ข Full (strict) mode โ†’ Origin certificate โ”‚ +โ”‚ โ€ข TLS 1.3 only, HSTS enabled โ”‚ +โ”‚ โ”‚ +โ”‚ LAYER 5: CDN & Caching โ”‚ +โ”‚ โ€ข Static assets caching โ”‚ +โ”‚ โ€ข Tiered caching โ”‚ +โ”‚ โ”‚ +โ”‚ LAYER 6: Cloudflare Tunnel โ”‚ +โ”‚ โ€ข No public IP needed on origin โ”‚ +โ”‚ โ€ข Encrypted tunnel to EKS โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Cloudflare Services + +| Service | Plan | Configuration | Coรปt | +|---------|------|---------------|------| +| **DNS** | Free | Authoritative, DNSSEC, proxy enabled | 0โ‚ฌ | +| **CDN** | Free | Cache everything, tiered caching | 0โ‚ฌ | +| **SSL/TLS** | Free | Full (strict), TLS 1.3, edge certs | 0โ‚ฌ | +| **WAF** | Free | Managed ruleset, 5 custom rules | 0โ‚ฌ | +| **DDoS** | Free | L3/L4/L7 protection, unlimited | 0โ‚ฌ | +| **Bot Management** | Free | Basic bot score, JS challenge | 0โ‚ฌ | +| **Rate Limiting** | Free | 1 rule (10K req/month free) | 0โ‚ฌ | +| **Tunnel** | Free | Unlimited tunnels, cloudflared | 0โ‚ฌ | +| **Access** | Free | Zero Trust, 50 users free | 0โ‚ฌ | + +**Coรปt Cloudflare total : 0โ‚ฌ** (Free tier suffisant pour dรฉmarrer) + +## WAF Rules Strategy + +| Rule Set | Type | Action | Purpose | +|----------|------|--------|---------| +| **OWASP Core** | Managed | Block | SQLi, XSS, LFI, RFI protection | +| **Cloudflare Managed** | Managed | Block | Zero-day, emerging threats | +| **Geo-Block** | Custom | Block | Block high-risk countries (optional) | +| **Rate Limit API** | Custom | Challenge | > 100 req/min per IP on /api/* | +| **Bot Score < 30** | Custom | Challenge | Likely bot traffic | + +## SSL/TLS Configuration + +| Setting | Value | Rationale | +|---------|-------|-----------| +| **SSL Mode** | Full (strict) | Origin has valid cert | +| **Minimum TLS** | 1.2 | PCI-DSS compliance | +| **TLS 1.3** | Enabled | Performance + security | +| **HSTS** | Enabled (max-age=31536000) | Force HTTPS | +| **Always Use HTTPS** | On | Redirect HTTP โ†’ HTTPS | +| **Origin Certificate** | Cloudflare Origin CA | 15-year validity, free | + +## Cloudflare Tunnel + +| Composant | Rรดle | Dรฉploiement | +|-----------|------|-------------| +| **cloudflared daemon** | Agent tunnel | 2+ replicas, namespace platform | +| **Tunnel credentials** | Secret d'authentification | Vault / External-Secrets | +| **Tunnel config** | Routing rules | ConfigMap | +| **Health checks** | Vรฉrification disponibilitรฉ | Cloudflare dashboard | + +**Avantages :** +- Pas d'IP publique exposรฉe sur l'origin +- Connexion outbound uniquement (pas de firewall inbound) +- Encryption de bout en bout +- Failover automatique entre replicas + +## Cloudflare Access (Zero Trust) + +| Resource | Policy | Authentication | +|----------|--------|----------------| +| **grafana.localplus.io** | Team only | GitHub SSO | +| **flux.localplus.io** | Team only | GitHub SSO | +| **api.localplus.io/admin** | Admin only | GitHub SSO + MFA | +| **api.localplus.io/*** | Public | No auth (application handles) | + +--- + +# ๐ŸŒ **DNS Configuration** + +## DNS Records โ€” localplus.io + +| Type | Name | Content | Proxy | TTL | +|------|------|---------|-------|-----| +| A | @ | Cloudflare Tunnel | โ˜๏ธ ON | Auto | +| CNAME | www | @ | โ˜๏ธ ON | Auto | +| CNAME | api | tunnel-xxx.cfargotunnel.com | โ˜๏ธ ON | Auto | +| CNAME | grafana | tunnel-xxx.cfargotunnel.com | โ˜๏ธ ON | Auto | +| CNAME | flux | tunnel-xxx.cfargotunnel.com | โ˜๏ธ ON | Auto | +| TXT | @ | SPF record | โ˜๏ธ OFF | Auto | +| TXT | _dmarc | DMARC policy | โ˜๏ธ OFF | Auto | +| MX | @ | Mail provider | โ˜๏ธ OFF | Auto | + +--- + +# ๐Ÿ›ฃ๏ธ **Route53 โ€” DNS Interne & Backup** + +| Use Case | Solution | Configuration | +|----------|----------|---------------| +| **DNS Public (Primary)** | Cloudflare | Authoritative pour `localplus.io` | +| **DNS Public (Backup)** | Route53 | Secondary zone, sync via AXFR | +| **DNS Privรฉ (Internal)** | Route53 Private Hosted Zones | `*.internal.localplus.io` | +| **Service Discovery** | Route53 + Cloud Map | Rรฉsolution services internes | +| **Health Checks** | Route53 Health Checks | Failover automatique | + +## Architecture DNS Hybride + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ DNS ARCHITECTURE โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ EXTERNAL TRAFFIC INTERNAL TRAFFIC โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Cloudflare DNS โ”‚ โ”‚ Route53 Private โ”‚ โ”‚ +โ”‚ โ”‚ (Primary) โ”‚ โ”‚ Hosted Zone โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ localplus.io โ”‚ โ”‚ internal. โ”‚ โ”‚ +โ”‚ โ”‚ api.localplus.ioโ”‚ โ”‚ localplus.io โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ Failover โ”‚ VPC DNS โ”‚ +โ”‚ โ–ผ โ–ผ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Route53 Public โ”‚ โ”‚ EKS CoreDNS โ”‚ โ”‚ +โ”‚ โ”‚ (Backup) โ”‚ โ”‚ + Cloud Map โ”‚ โ”‚ +โ”‚ โ”‚ Health checks โ”‚ โ”‚ svc-*.svc. โ”‚ โ”‚ +โ”‚ โ”‚ Failover ready โ”‚ โ”‚ cluster.local โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Route53 Features + +| Feature | Use Case Local-Plus | +|---------|---------------------| +| **Private Hosted Zones** | Rรฉsolution DNS interne VPC | +| **Health Checks** | Failover automatique | +| **Alias Records** | Pointage vers ALB/NLB | +| **Geolocation Routing** | Future multi-rรฉgion | +| **Failover Routing** | Backup si Cloudflare down | +| **Weighted Routing** | Canary deployments | + +--- + +# ๐Ÿšช **API Gateway / APIM (Future)** + +> **Statut :** ร€ dรฉfinir ultรฉrieurement. Pour le moment : Cloudflare โ†’ Cilium Gateway โ†’ Services. + +## Options ร  รฉvaluer + +| Solution | Type | Coรปt | Notes | +|----------|------|------|-------| +| **AWS API Gateway** | Managed | Pay-per-use | Simple, intรฉgrรฉ AWS | +| **Gravitee CE** | APIM complet | Gratuit | Portal, Subscriptions inclus | +| **Kong OSS** | Gateway | Gratuit | Populaire, plugins riches | +| **APISIX** | Gateway | Gratuit | Cloud-native, performant | + +**Dรฉcision reportรฉe ร  Phase 2+ selon les besoins :** +- Si besoin B2B/Partners โ†’ APIM (Gravitee) +- Si juste rate limiting/auth โ†’ AWS API Gateway +- Si multi-cloud requis โ†’ APISIX ou Kong + +## Architecture Actuelle (Phase 1) + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ ARCHITECTURE SIMPLIFIร‰E โ€” PHASE 1 โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ Internet โ”‚ +โ”‚ โ”‚ โ”‚ +โ”‚ โ–ผ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ CLOUDFLARE (DNS, WAF, DDoS, TLS) โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ Tunnel โ”‚ +โ”‚ โ–ผ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ AWS EKS โ€” Cilium Gateway API (Routing interne, mTLS) โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ Services : svc-ledger, svc-wallet, svc-merchant, ... โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ Pas d'API Gateway dรฉdiรฉ pour le moment โ€” Cilium Gateway API suffit. โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +# ๐ŸŒ **Multi-Cloud Vision** + +> **Objectif :** L'architecture edge (Cloudflare) est **cloud-agnostic** et peut router vers plusieurs cloud providers. + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ MULTI-CLOUD ARCHITECTURE (Future) โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ CLOUDFLARE EDGE โ”‚ +โ”‚ (Global Load Balancing) โ”‚ +โ”‚ โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ–ผ โ–ผ โ–ผ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ AWS (Primary)โ”‚ โ”‚ GCP (Future) โ”‚ โ”‚ Azure (Future)โ”‚ โ”‚ +โ”‚ โ”‚ eu-west-1 โ”‚ โ”‚ europe-west1 โ”‚ โ”‚ westeurope โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ Gateway + โ”‚ โ”‚ Gateway + โ”‚ โ”‚ Gateway + โ”‚ โ”‚ +โ”‚ โ”‚ Services โ”‚ โ”‚ Services โ”‚ โ”‚ Services โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ AIVEN (Multi-Cloud Data Layer) โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข PostgreSQL avec rรฉplication cross-cloud โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข Kafka avec MirrorMaker cross-cloud โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข Valkey avec rรฉplication โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Multi-Cloud Readiness + +| Composant | Multi-Cloud Ready | Comment | +|-----------|-------------------|---------| +| **Cloudflare** | โœ… Oui | Load balancing global, health checks multi-origin | +| **APISIX** | โœ… Oui | Dรฉployable sur tout K8s (EKS, GKE, AKS) | +| **Aiven** | โœ… Oui | PostgreSQL, Kafka, Valkey disponibles sur AWS/GCP/Azure | +| **Flux** | โœ… Oui | Peut gรฉrer des clusters multi-cloud | +| **Vault** | โœ… Oui | Rรฉplication cross-datacenter | +| **OTel** | โœ… Oui | Standard ouvert, backends interchangeables | + +## Phases Multi-Cloud + +| Phase | Scope | Timeline | +|-------|-------|----------| +| **Phase 1 (Actuelle)** | AWS uniquement, architecture cloud-agnostic | Now | +| **Phase 2** | DR sur GCP (read replicas, failover) | +12 mois | +| **Phase 3** | Active-Active multi-cloud | +24 mois | + +--- + +# ๐Ÿข **Topologie Multi-Account (AWS Organizations)** + +## Architecture Actuelle โ€” Phase 1 + +> **Note :** Control Tower n'est pas utilisรฉ. Nous gรฉrons OUs, SCPs et comptes via Terraform. + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ AWS ORGANIZATIONS โ€” MULTI-ACCOUNT โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ MANAGEMENT ACCOUNT โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข AWS Organizations, SCPs, IAM Identity Center โ”‚ โ”‚ +โ”‚ โ”‚ โš ๏ธ Pas de workloads, pas de VPC โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ LOG ARCHIVE โ”‚ โ”‚ SECURITY / AUDIT โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข S3: CloudTrail โ”‚ โ”‚ โ€ข Security Hub โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข S3: Config โ”‚ โ”‚ โ€ข GuardDuty โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข S3: VPC Flow Logs โ”‚ โ”‚ โ€ข IAM Access Analyzerโ”‚ โ”‚ +โ”‚ โ”‚ ๐Ÿ“ฆ Pas de VPC โ”‚ โ”‚ ๐Ÿ” Pas de VPC โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ–ฒ โ–ฒ โ”‚ +โ”‚ โ”‚ S3/API โ”‚ Findings โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ WORKLOAD ACCOUNTS โ”‚ โ”‚ +โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ Dev โ”‚ โ”‚ Staging โ”‚ โ”‚ Prod โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ VPC โ”‚ โ”‚ VPC โ”‚ โ”‚ VPC โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ VPC Peering โ”‚ โ”‚ +โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ +โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ AIVEN VPC โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Dรฉcision : Pas de Hub and Spoke (Phase 1) + +| Critรจre | ร‰valuation | Dรฉcision | +|---------|------------|----------| +| **Nombre de VPCs** | 3-4 (dev, staging, prod + Aiven) | VPC Peering suffit | +| **Inspection centralisรฉe** | Non requise | Pas de Network Firewall | +| **On-premises** | Pas de VPN/Direct Connect | Pas besoin de Transit Gateway | +| **Services partagรฉs** | Via AWS APIs, pas via rรฉseau | Pas de Hub VPC | + +## Comment les comptes communiquent ? + +> **Principe clรฉ :** Les comptes partagent via **AWS APIs**, pas via connectivitรฉ rรฉseau. + +| Communication | Mรฉthode | VPC Peering ? | +|---------------|---------|---------------| +| **Logs โ†’ Log Archive** | S3 + Organizations | โŒ Non | +| **Findings โ†’ Security Hub** | AWS API aggregation | โŒ Non | +| **GuardDuty** | API-level, membre Organizations | โŒ Non | +| **Secrets (Vault)** | HTTPS via Cloudflare Tunnel | โŒ Non | +| **Workload โ†’ Aiven** | VPC Peering | โœ… Oui | + +## ร‰volution Future (si besoin) + +### Triggers pour passer ร  Hub and Spoke + +| Trigger | Seuil | Action | +|---------|-------|--------| +| **> 5 VPCs** | 6+ workload accounts | Transit Gateway | +| **Inspection egress** | Compliance exige firewall | Network Account + AWS Network Firewall | +| **VPN / Direct Connect** | Connexion on-premises | Transit Gateway + VPN | +| **Rรฉgional egress** | Centraliser coรปts NAT | Centralized NAT via Transit Gateway | + +### Architecture Future (Phase 2+) + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ PHASE 2+ โ€” SI INSPECTION REQUISE โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ NETWORK ACCOUNT โ”‚ โ”‚ +โ”‚ โ”‚ (nouveau si besoin) โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ Transit Gateway โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ Network Firewall โ”‚ โ”‚ โ—„โ”€โ”€ Inspection egress โ”‚ +โ”‚ โ”‚ โ”‚ (optionnel) โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ NAT Gateway โ”‚ โ”‚ โ—„โ”€โ”€ Centralized NAT โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ–ผ โ–ผ โ–ผ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Dev VPC โ”‚ โ”‚ Staging VPC โ”‚ โ”‚ Prod VPC โ”‚ โ”‚ +โ”‚ โ”‚ (Spoke) โ”‚ โ”‚ (Spoke) โ”‚ โ”‚ (Spoke) โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +### Estimation coรปts Transit Gateway + +| Composant | Coรปt | Note | +|-----------|------|------| +| **Transit Gateway attachment** | ~$0.05/h par VPC | ร— nombre de VPCs | +| **Data processing** | ~$0.02/GB | Tout le trafic inter-VPC | +| **Network Firewall** | ~$0.40/h + $0.016/GB | Si inspection | + +> **Phase 1 :** On รฉvite ces coรปts en utilisant VPC Peering direct. + +--- + +*Document maintenu par : Platform Team* +*Derniรจre mise ร  jour : Janvier 2026* diff --git a/observability/OBSERVABILITY-GUIDE.md b/observability/OBSERVABILITY-GUIDE.md new file mode 100644 index 0000000..a2612bf --- /dev/null +++ b/observability/OBSERVABILITY-GUIDE.md @@ -0,0 +1,473 @@ +# Observability Guide +## Kiven Platform -- Monitoring, Logging, Tracing & APM + +--- + +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) | **See also**: [OTel Conventions](./OTEL-CONVENTIONS.md) + +--- + +## Table of Contents + +1. [Stack Overview](#stack-overview) +2. [Telemetry Pipeline](#telemetry-pipeline) +3. [Metrics (Prometheus)](#metrics-prometheus) +4. [Logs (Loki)](#logs-loki) +5. [Traces (Tempo)](#traces-tempo) +6. [APM (Application Performance Monitoring)](#apm-application-performance-monitoring) +7. [Cardinality Management](#cardinality-management) +8. [SLI/SLO/Error Budgets](#slisloerror-budgets) +9. [Alerting Strategy](#alerting-strategy) +10. [Dashboards & Visualizations](#dashboards--visualizations) + +> For OTel-specific conventions (span naming, attributes, Collector deployment, exporter helper, SDK usage), see [OTEL-CONVENTIONS.md](./OTEL-CONVENTIONS.md). + +--- + +## Stack Overview + +### Self-Hosted Stack + +| Composant | Outil | Coรปt | Retention | +|-----------|-------|------|-----------| +| **Metrics** | Prometheus | 0โ‚ฌ (self-hosted) | 15 jours local | +| **Metrics long-term** | Prometheus avec Remote Write โ†’ S3 | ~5โ‚ฌ/mois S3 | 1 an | +| **Logs** | Loki | 0โ‚ฌ (self-hosted) | 30 jours (GDPR) | +| **Traces** | Tempo | 0โ‚ฌ (self-hosted) | 7 jours | +| **Dashboards** | Grafana | 0โ‚ฌ (self-hosted) | N/A | +| **Fallback logs** | CloudWatch Logs | Tier gratuit 5GB | 7 jours | + +**Coรปt estimรฉ : < 50โ‚ฌ/mois** (principalement stockage S3) + +### Note sur le stockage long-terme + +Pour conserver les mรฉtriques au-delร  de 15 jours : + +| Option | Description | Complexitรฉ | +|--------|-------------|------------| +| **Remote Write vers S3** | Prometheus รฉcrit directement vers un backend compatible S3 | Simple | +| **Grafana Mimir** | Solution CNCF pour le stockage long-terme, scalable | Moyen | +| **Victoria Metrics** | Alternative performante, compatible Prometheus | Moyen | + +> **Choix Local-Plus :** Remote Write vers S3 via Grafana Mimir (ou Victoria Metrics) โ€” pas besoin de composants additionnels complexes. + +--- + +## Telemetry Pipeline + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Kiven Services โ”‚ โ”‚ OTel Agent โ”‚ โ”‚ OTel Gateway โ”‚ โ”‚ Backends โ”‚ +โ”‚ โ”‚ โ”‚ (DaemonSet) โ”‚ โ”‚ (Deployment) โ”‚ โ”‚ โ”‚ +โ”‚ โ€ข Go svc-* โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ โ€ข Receive OTLP โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ โ€ข Batch โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ Prometheus โ”‚ +โ”‚ โ€ข kiven-agent โ”‚ โ”‚ โ€ข Forward โ”‚ โ”‚ โ€ข Tail sample โ”‚ โ”‚ Loki โ”‚ +โ”‚ โ€ข dashboard โ”‚ โ”‚ โ€ข No processing โ”‚ โ”‚ โ€ข Scrub PII โ”‚ โ”‚ Tempo โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Persistent Q โ”‚ โ”‚ Grafana โ”‚ +โ”‚ Instrumented โ”‚ โ”‚ Lightweight โ”‚ โ”‚ โ€ข Export โ”‚ โ”‚ โ”‚ +โ”‚ via kiven-go- โ”‚ โ”‚ ~50MB per node โ”‚ โ”‚ โ€ข 2-3 replicas โ”‚ โ”‚ โ”‚ +โ”‚ sdk/telemetry โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ”‚ GDPR Scrubbing + โ–ผ + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ Removed: โ”‚ + โ”‚ โ€ข user_id โ”‚ + โ”‚ โ€ข user.email โ”‚ + โ”‚ โ€ข client_ip โ”‚ + โ”‚ โ€ข SQL params โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +> See [OTEL-CONVENTIONS.md](./OTEL-CONVENTIONS.md) for Collector config details, exporter helper pattern, and persistent queue setup. + +## OTel Collector โ€” Rรดle + +| Composant | Rรดle | Exemples | +|-----------|------|----------| +| **Receivers** | Rรฉceptionne les donnรฉes de tรฉlรฉmรฉtrie | OTLP (gRPC/HTTP), Prometheus scrape | +| **Processors** | Transforme, filtre, enrichit les donnรฉes | Batch, Memory limiter, Attribute deletion (PII), Sampling | +| **Exporters** | Envoie vers les backends | Prometheus, Loki, Tempo | + +## GDPR Compliance โ€” Donnรฉes supprimรฉes + +| Donnรฉe | Action | Raison | +|--------|--------|--------| +| `user.id` | Supprimรฉ | PII | +| `user.email` | Supprimรฉ | PII | +| `http.client_ip` | Hashรฉ | Anonymisation | +| `*_bucket` haute cardinalitรฉ | Filtrรฉ | Performance | + +--- + +# ๐Ÿ“ˆ **Metrics (Prometheus)** + +## Comment Prometheus collecte les mรฉtriques + +Prometheus utilise un modรจle **pull** : il va chercher les mรฉtriques sur chaque cible. + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ PROMETHEUS โ€” MODรˆLE DE COLLECTE โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ PROMETHEUS โ”‚ +โ”‚ (scraping) โ”‚ +โ”‚ โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ–ผ โ–ผ โ–ผ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Pod A โ”‚ โ”‚ Pod B โ”‚ โ”‚ Pod C โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ :8080 โ”‚ โ”‚ :8080 โ”‚ โ”‚ :9090 โ”‚ โ”‚ +โ”‚ โ”‚ /metrics โ”‚ โ”‚ /metrics โ”‚ โ”‚ /metrics โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ Prometheus fait GET http://pod:port/metrics toutes les 30s โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Dรฉcouverte des cibles โ€” ServiceMonitor (Prometheus Operator) + +Le **Prometheus Operator** utilise des **Custom Resources** pour configurer automatiquement les cibles de scraping. + +| Ressource | Ce qu'elle fait | +|-----------|-----------------| +| **ServiceMonitor** | Sรฉlectionne les Services via labels, Prometheus scrape les pods derriรจre | +| **PodMonitor** | Sรฉlectionne les Pods directement via labels | + +**Flux :** + +1. Developer deploys service with a label (e.g., `app: svc-api`) +2. A ServiceMonitor selects this label +3. Prometheus Operator configure automatiquement Prometheus +4. Prometheus scrape `/metrics` sur le port spรฉcifiรฉ + +**Avantages :** +- GitOps-friendly โ€” fichier sรฉparรฉ, versionnรฉ, reviewable +- Sรฉparation des concerns โ€” monitoring dรฉcouplรฉ du dรฉploiement +- Flexibilitรฉ โ€” intervalles, relabeling, TLS, authentification + +### Endpoints + +| Service | Port | Path | Description | +|---------|------|------|-------------| +| **svc-api** | 8080 | `/metrics` | Via `promhttp` handler | +| **svc-agent-relay** | 9090 | `/metrics` | gRPC service metrics | +| **svc-provisioner** | 8082 | `/metrics` | Provisioning pipeline metrics | +| **kiven-agent** | 9090 | `/metrics` | Agent-side CNPG + PG metrics | +| **Grafana** | 3000 | `/metrics` | Internal metrics | +| **Flux** | 8080 | `/metrics` | Reconciliation metrics | +| **Node Exporter** | 9100 | `/metrics` | System metrics (CPU, RAM, disk) | + +--- + +# ๐Ÿ“ **Logs (Loki)** + +## Configuration + +| Paramรจtre | Valeur | Raison | +|-----------|--------|--------| +| **Retention** | 30 jours | GDPR compliance | +| **Max query series** | 5000 | Protection performance | +| **Max entries per query** | 10000 | Protection performance | +| **Storage backend** | S3 | Coรปt faible, durabilitรฉ | + +## Log Labels (Low Cardinality) + +| Label | Exemple | Cardinalitรฉ | +|-------|---------|-------------| +| `namespace` | svc-ledger | Low | +| `pod` | svc-ledger-abc123 | Medium | +| `container` | svc-ledger | Low | +| `level` | info, error, warn | Very Low | +| `stream` | stdout, stderr | Very Low | + +**โš ๏ธ Never use as labels:** `user_id`, `request_id`, `trace_id` + +--- + +# ๐Ÿ” **Traces (Tempo)** + +## Configuration + +| Paramรจtre | Valeur | Raison | +|-----------|--------|--------| +| **Retention** | 7 jours | Coรปt / utilitรฉ | +| **Backend** | S3 | Durabilitรฉ | +| **Protocol** | OTLP (gRPC + HTTP) | Standard OTel | + +## Trace-to-Logs Correlation + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” trace_id โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ TRACES โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ LOGS โ”‚ +โ”‚ (Tempo) โ”‚ โ”‚ (Loki) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ โ”‚ + โ”‚ Exemplars (trace_id in metrics) โ”‚ + โ”‚ โ”‚ + โ–ผ โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ GRAFANA โ”‚ +โ”‚ โ€ข Click trace โ†’ See logs for that request โ”‚ +โ”‚ โ€ข Click metric spike โ†’ Jump to exemplar trace โ”‚ +โ”‚ โ€ข Click error log โ†’ Navigate to full trace โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +# ๐ŸŽฏ **APM (Application Performance Monitoring)** + +## Stack APM + +| Composant | Outil | Usage | +|-----------|-------|-------| +| **Distributed Tracing** | Tempo + OTel | Request flow, latency breakdown | +| **Profiling** | Pyroscope (Grafana) | CPU/Memory profiling continu | +| **Error Tracking** | Sentry (self-hosted) | Exception tracking, stack traces | +| **Database APM** | pg_stat_statements | Query performance | +| **Real User Monitoring** | Grafana Faro | Frontend performance (si applicable) | + +## Sampling Strategy + +| Environment | Head Sampling | Tail Sampling | Rationale | +|-------------|---------------|---------------|-----------| +| **Dev** | 100% | N/A | Full visibility pour debug | +| **Staging** | 50% | Errors: 100% | Balance cost/visibility | +| **Prod** | 10% | Errors: 100%, Slow: 100% (>500ms) | Cost optimization | + +### Tail Sampling โ€” Rรจgles + +| Rรจgle | Condition | Pourquoi | +|-------|-----------|----------| +| **error-policy** | Status = ERROR | Toujours conserver les erreurs | +| **slow-policy** | Latency > 500ms | Dรฉtecter les lenteurs | +| **probabilistic-policy** | 10% alรฉatoire | ร‰chantillonnage de base | + +--- + +# ๐Ÿ“‰ **Cardinality Management** + +## Label Rules + +| Label | Action | Rationale | +|-------|--------|-----------| +| `user_id` | DROP | High cardinality, use traces | +| `request_id` | DROP | Use trace_id instead | +| `http.url` | DROP | URLs uniques = explosion | +| `http.route` | KEEP | Templated, low cardinality | +| `service.name` | KEEP | Essential | +| `http.method` | KEEP | Low cardinality | +| `http.status_code` | KEEP | Low cardinality | + +## Cardinality Limits + +| Metric Type | Max Labels | Max Series | +|-------------|------------|------------| +| Counter | 5 | 1000 | +| Histogram | 4 | 500 | +| Gauge | 5 | 1000 | + +--- + +# ๐ŸŽฏ **SLI/SLO/Error Budgets** + +## Service SLOs + +| Service | SLI | SLO | Error Budget | Burn Rate Alert | +|---------|-----|-----|--------------|-----------------| +| **svc-api** | Availability | 99.9% | 43 min/month | 14.4x = 1h alert | +| **svc-api** | Latency P99 | < 200ms | N/A | P99 > 200ms for 5min | +| **svc-provisioner** | Availability | 99.9% | 43 min/month | 14.4x = 1h alert | +| **svc-agent-relay** | Availability | 99.95% | 22 min/month | 14.4x = 30min alert | +| **Customer PostgreSQL** | Availability | 99.99% | 4.3 min/month | 6x = 15min alert | +| **Platform (infra)** | Availability | 99.5% | 3.6h/month | 6x = 2h alert | + +## SLO Formulas + +| Mรฉtrique | Formule | Signification | +|----------|---------|---------------| +| **Availability** | `1 - (erreurs / total)` | % de requรชtes sans erreur 5xx | +| **Error Budget Remaining** | `1 - ((1 - availability) / (1 - SLO))` | % du budget restant | +| **Burn Rate** | `error_rate / allowed_error_rate` | Vitesse de consommation du budget | + +--- + +# ๐Ÿšจ **Alerting Strategy** + +## Severity Levels + +| Severity | Exemple | Notification | On-call | +|----------|---------|--------------|---------| +| **P1 โ€” Critical** | svc-ledger down | PagerDuty immediate | Wake up | +| **P2 โ€” High** | Error rate > 5% | Slack + PagerDuty 15min | Within 30min | +| **P3 โ€” Medium** | Latency P99 > 500ms | Slack | Business hours | +| **P4 โ€” Low** | Disk usage > 80% | Slack | Next day | + +## Alertes principales + +| Alerte | Condition | Sรฉvรฉritรฉ | Action | +|--------|-----------|----------|--------| +| **ServiceDown** | `up == 0` pendant 1min | P1 | Runbook: restart, check logs | +| **HighErrorRate** | Error rate > 5% pendant 5min | P2 | Investigate traces + Sentry | +| **LatencyDegradation** | P99 > 2x baseline pendant 10min | P2 | Check slow spans in Tempo | +| **DiskAlmostFull** | Disk > 80% | P4 | Extend volume or cleanup | + +--- + +# ๐Ÿ“Š **Dashboards & Visualizations** + +## Types de visualisations Grafana par type de mรฉtrique + +### Counter (Compteur) + +> **Dรฉfinition :** Valeur qui ne peut qu'augmenter (ou reset ร  0 au restart). + +| Visualisation | Query | Quand utiliser | +|---------------|-------|----------------| +| **Stat (nombre)** | `sum(http_requests_total)` | Total absolu | +| **Time Series (rate)** | `rate(http_requests_total[5m])` | Dรฉbit par seconde (RPS) | +| **Bar Gauge** | `sum by (status_code) (rate(http_requests_total[5m]))` | Comparaison entre labels | + +``` +Exemple visuel โ€” Counter en Time Series (rate) + + RPS + 30 โ”‚ โ•ญโ”€โ”€โ”€โ•ฎ + โ”‚ โ•ญโ”€โ”€โ”€โ”€โ•ฏ โ”‚ + 20 โ”‚โ”€โ”€โ”€โ•ฏ โ”‚ + โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ•ฎ + 10 โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ temps + 10:00 10:05 10:10 +``` + +### Gauge (Jauge) + +> **Dรฉfinition :** Valeur instantanรฉe qui peut monter ou descendre (tempรฉrature, connexions actives, CPU%). + +| Visualisation | Query | Quand utiliser | +|---------------|-------|----------------| +| **Gauge (cadran)** | `pg_stat_activity_count` | Valeur courante visuelle | +| **Stat** | `node_memory_MemAvailable_bytes / 1e9` | Valeur simple avec unitรฉ | +| **Time Series** | `process_resident_memory_bytes` | ร‰volution dans le temps | +| **Heatmap** | `avg by (pod) (container_memory_usage_bytes)` | Comparaison multi-pods | + +``` +Exemple visuel โ€” Gauge en cadran + + โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ CPU % โ”‚ + โ”‚ โ”‚ + โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ + โ”‚ โ”‚ 67% โ”‚ โ”‚ + โ”‚ โ”‚ โ–ˆโ–ˆโ–ˆ โ”‚ โ”‚ + โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ + โ”‚ 0% 100% โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +### Histogram (Histogramme) + +> **Dรฉfinition :** Distribution de valeurs dans des "buckets" (ex: latence). Permet de calculer des percentiles. + +| Visualisation | Query | Quand utiliser | +|---------------|-------|----------------| +| **Time Series (P50/P95/P99)** | `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))` | Latency trends | +| **Heatmap** | `sum by (le) (rate(http_request_duration_seconds_bucket[5m]))` | Distribution visuelle | +| **Stat** | `histogram_quantile(0.95, ...)` | Valeur P95 courante | + +``` +Exemple visuel โ€” Histogram en Heatmap (latence) + + Latency + 1s โ”‚โ–‘โ–‘โ–“โ–“โ–‘โ–‘ + 500ms โ”‚โ–“โ–“โ–“โ–“โ–“โ–“โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ + 200ms โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ + 100ms โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ + 50ms โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ temps + 10:00 10:30 11:00 + + โ–‘ = peu de requรชtes โ–“ = moyen โ–ˆ = beaucoup +``` + +### Summary + +> **Dรฉfinition :** Comme Histogram mais les percentiles sont calculรฉs cรดtรฉ client (moins flexible). + +| Visualisation | Query | Quand utiliser | +|---------------|-------|----------------| +| **Time Series** | `go_gc_duration_seconds{quantile="0.99"}` | Prรฉ-calculรฉ | +| **Stat** | `go_gc_duration_seconds{quantile="0.5"}` | Mรฉdiane | + +> **Note :** Prรฉfรฉrer Histogram. Summary est principalement utilisรฉ par les exporters Go legacy. + +--- + +## Dashboards recommandรฉs par audience + +### Dashboard 1 : Service Overview (On-call) + +| Panel | Type | Mรฉtrique | Visualisation | +|-------|------|----------|---------------| +| **Request Rate** | Counter | `rate(http_requests_total[5m])` | Time Series | +| **Error Rate %** | Counter | `rate(errors[5m]) / rate(total[5m]) * 100` | Time Series + Threshold | +| **Latency P50/P95/P99** | Histogram | `histogram_quantile(...)` | Time Series (3 lignes) | +| **Active Requests** | Gauge | `http_requests_in_flight` | Stat | + +### Dashboard 2 : Infrastructure (Platform Team) + +| Panel | Type | Mรฉtrique | Visualisation | +|-------|------|----------|---------------| +| **CPU Usage %** | Gauge | `container_cpu_usage_seconds_total` | Gauge cadran | +| **Memory Usage** | Gauge | `container_memory_usage_bytes` | Bar Gauge | +| **Network I/O** | Counter | `rate(container_network_receive_bytes_total[5m])` | Time Series | +| **Disk Usage %** | Gauge | `node_filesystem_avail_bytes / node_filesystem_size_bytes` | Gauge | +| **Pod Count** | Gauge | `kube_pod_status_phase{phase="Running"}` | Stat | + +### Dashboard 3 : Database (Backend Devs) + +| Panel | Type | Mรฉtrique | Visualisation | +|-------|------|----------|---------------| +| **Active Connections** | Gauge | `pg_stat_activity_count` | Gauge cadran | +| **Query Duration P95** | Histogram | `pg_stat_statements_mean_time_seconds` | Time Series | +| **Transactions/sec** | Counter | `rate(pg_stat_database_xact_commit[5m])` | Time Series | +| **Replication Lag** | Gauge | `pg_replication_lag_seconds` | Stat avec threshold | +| **Cache Hit Ratio** | Gauge | `pg_stat_database_blks_hit / (blks_hit + blks_read)` | Stat % | + +### Dashboard 4 : Kiven Business Metrics (Product) + +| Panel | Type | Metric | Visualization | +|-------|------|--------|---------------| +| **Services Created** | Counter | `sum(rate(kiven_services_created_total[1h]))` | Stat (big number) | +| **Active Databases** | Gauge | `kiven_services_active_count` | Stat | +| **Provisioning Time P95** | Histogram | `histogram_quantile(0.95, rate(kiven_provisioning_duration_bucket[1h]))` | Time Series | +| **Connected Agents** | Gauge | `kiven_agent_connected` | Stat | +| **Backup Success Rate** | Counter | `rate(kiven_backup_success_total[1h]) / rate(kiven_backup_total[1h])` | Gauge % | +| **Business Errors** | Counter | `sum by (error_type) (rate(kiven_errors_total[5m]))` | Bar chart | + +--- + +## Rรฉcapitulatif : Quel type pour quelle mรฉtrique ? + +| Mรฉtrique | Type Prometheus | Visualisation Grafana | +|----------|-----------------|----------------------| +| Nombre de requรชtes | Counter | Time Series (rate) | +| Erreurs totales | Counter | Time Series (rate) + Stat | +| Latence | Histogram | Time Series (quantile) + Heatmap | +| Connexions actives | Gauge | Gauge cadran ou Stat | +| Mรฉmoire utilisรฉe | Gauge | Time Series ou Bar Gauge | +| CPU % | Gauge | Gauge cadran | +| Durรฉe GC | Summary | Time Series | +| Taille de queue | Gauge | Stat avec threshold | + +--- + +*Maintained by: @kivenio/platform* +*Last updated: February 2026* +*See also: [OTEL-CONVENTIONS.md](./OTEL-CONVENTIONS.md) for instrumentation details* diff --git a/observability/OTEL-CONVENTIONS.md b/observability/OTEL-CONVENTIONS.md new file mode 100644 index 0000000..3319bcf --- /dev/null +++ b/observability/OTEL-CONVENTIONS.md @@ -0,0 +1,346 @@ +# OpenTelemetry Conventions + +> **Back to**: [Observability Guide](./OBSERVABILITY-GUIDE.md) | [Architecture Overview](../EntrepriseArchitecture.md) + +## Architecture Decisions + +| Decision | Choice | Rationale | +|----------|--------|-----------| +| Collector deployment | Agent + Gateway (two-tier) | Resilient, scalable, centralized config for heavy lifting | +| SDK approach | `kiven-go-sdk/telemetry` wrapping OTel Go SDK | Conventions baked in, zero-config for services | +| Exporter pattern | Exporter helper with persistent queue | New OTel pattern: no separate batch processor, survives restarts | +| Propagation | W3C TraceContext + Baggage | Industry standard, cross-service compatible | +| Metrics backend | Prometheus (scrape) + OTel Collector (receive) | Prometheus for K8s ecosystem, OTel for application metrics | +| Traces backend | Tempo | Grafana-native, S3 storage, cost effective | +| Logs backend | Loki | Grafana-native, label-based, low cost | + +--- + +## Collector Deployment: Agent + Gateway Two-Tier + +``` +Services (pods) + โ”‚ + โ”‚ OTLP gRPC (localhost:4317) + โ–ผ +OTel Collector Agent (DaemonSet, 1 per node) + โ”‚ Lightweight: receive โ†’ forward + โ”‚ No processing, no sampling + โ”‚ + โ”‚ OTLP gRPC (cluster-internal) + โ–ผ +OTel Collector Gateway (Deployment, 2-3 replicas) + โ”‚ Heavy lifting: batch, filter, sample, scrub PII + โ”‚ Exporter helper with persistent queue + โ”‚ + โ”œโ”€โ”€โ–บ Tempo (traces) + โ”œโ”€โ”€โ–บ Prometheus remote write (metrics) + โ””โ”€โ”€โ–บ Loki (logs) +``` + +### Why Two-Tier + +- **Agent (DaemonSet)**: minimal config, low memory (~50MB), just forwards. If a node dies, only that node's in-flight data is lost. +- **Gateway (Deployment)**: centralized processing, tail sampling decisions, PII scrubbing, persistent queue. Horizontally scalable. +- **Alternative considered**: Sidecar per pod -- rejected because 15+ services means 15+ Collector instances consuming memory. DaemonSet shares one Collector per node. + +### Exporter Helper: Persistent Queue (New Pattern) + +Since OTel Collector v0.110+, the recommended pattern is exporter-level batching with persistent storage. This replaces the old separate `batch` processor. + +```yaml +# Gateway Collector config +exporters: + otlp/tempo: + endpoint: tempo:4317 + tls: + insecure: true + sending_queue: + enabled: true + storage: file_storage/traces # persistent: survives collector restart + queue_size: 5000 + num_consumers: 10 + retry_on_failure: + enabled: true + initial_interval: 5s + max_interval: 30s + max_elapsed_time: 300s + batcher: + enabled: true + min_size: 500 + max_size: 2000 + timeout: 5s + +extensions: + file_storage/traces: + directory: /var/lib/otel/traces # PVC-backed for persistence + timeout: 10s + compaction: + on_start: true + directory: /tmp/otel-compaction +``` + +**Why this matters**: If the Gateway restarts (upgrade, OOM, node drain), queued spans are not lost. They're persisted to disk and replayed on startup. + +--- + +## Span Naming Convention + +All spans follow this pattern: + +``` +kiven... +``` + +### Layers + +| Layer | Description | Examples | +|-------|-------------|----------| +| `handler` | HTTP/gRPC request handlers | `kiven.svc-api.handler.CreateService` | +| `repo` | Database repository methods | `kiven.svc-api.repo.InsertService` | +| `provider` | Provider interface calls | `kiven.provider-cnpg.provider.GenerateClusterYAML` | +| `infra` | AWS/cloud infrastructure calls | `kiven.svc-infra.infra.CreateNodeGroup` | +| `agent` | Agent-side operations | `kiven.agent.agent.ApplyYAML` | +| `grpc` | gRPC calls (auto-generated) | `/kiven.agent.v1.AgentRelay/Heartbeat` | + +### Auto-Generated Spans + +The `kiven-go-sdk/telemetry` package generates spans automatically: + +| Source | Span Name Format | Example | +|--------|-------------------|---------| +| HTTP middleware | `HTTP {method} {path}` | `HTTP GET /v1/services` | +| gRPC server interceptor | `/{package}.{Service}/{Method}` | `/kiven.agent.v1.AgentRelay/Heartbeat` | +| gRPC client interceptor | `/{package}.{Service}/{Method}` | `/kiven.agent.v1.AgentRelay/SendCommand` | +| Manual (Trace helper) | Developer-defined | `repo.GetService`, `aws.CreateNodeGroup` | + +### Manual Span Creation + +Use the helpers from `kiven-go-sdk/telemetry`: + +```go +// Simple span +ctx, span := telemetry.Trace(ctx, "repo.GetService") +defer span.End() + +// Span with automatic error handling +err := telemetry.TraceFunc(ctx, "svc.CreateService", func(ctx context.Context) error { + return repo.Insert(ctx, svc) +}) + +// Add domain attributes to current span +telemetry.SetSpanAttributes(ctx, + attribute.String("kiven.service_id", serviceID), + attribute.String("kiven.plan", "business"), +) + +// Get trace ID for log correlation +traceID := telemetry.TraceID(ctx) +``` + +--- + +## Standard Attributes + +Every span SHOULD include these attributes where applicable: + +### Kiven Domain Attributes + +| Attribute | Type | Description | Example | +|-----------|------|-------------|---------| +| `kiven.org_id` | string | Organization ID | `org-abc123` | +| `kiven.project_id` | string | Project ID | `proj-xyz789` | +| `kiven.service_id` | string | Managed database service ID | `svc-pg-001` | +| `kiven.cluster_id` | string | Customer EKS cluster ID | `cluster-eu-west-1` | +| `kiven.plan` | string | Service plan | `business` | +| `kiven.env` | string | Environment | `production` | +| `kiven.agent_id` | string | Agent instance ID | `agent-abc123` | + +### OTel Semantic Convention Attributes (auto-set by middleware) + +| Attribute | Set By | Example | +|-----------|--------|---------| +| `http.request.method` | HTTP middleware | `GET` | +| `url.path` | HTTP middleware | `/v1/services` | +| `http.response.status_code` | HTTP middleware | `200` | +| `rpc.system` | gRPC interceptor | `grpc` | +| `rpc.service` | gRPC interceptor | `kiven.agent.v1.AgentRelay` | +| `rpc.method` | gRPC interceptor | `Heartbeat` | +| `rpc.grpc.status_code` | gRPC interceptor | `0` (OK) | +| `service.name` | Provider resource | `svc-api` | +| `deployment.environment` | Provider resource | `production` | + +--- + +## Metric Naming + +Follow OTel semantic conventions. Custom Kiven metrics use the `kiven.` prefix. + +### Standard Metrics (auto-collected) + +| Metric | Type | Source | +|--------|------|--------| +| `http.server.duration` | Histogram | HTTP middleware | +| `http.server.request.size` | Histogram | HTTP middleware | +| `rpc.server.duration` | Histogram | gRPC interceptor | +| `db.client.operation.duration` | Histogram | pgx tracing (Phase 1 gap) | + +### Kiven Business Metrics (service-specific) + +| Metric | Type | Service | Description | +|--------|------|---------|-------------| +| `kiven.provisioning.duration` | Histogram | svc-provisioner | Time to provision a database | +| `kiven.provisioning.active` | Gauge | svc-provisioner | In-flight provisioning jobs | +| `kiven.agent.connected` | Gauge | svc-agent-relay | Connected agents count | +| `kiven.agent.heartbeat.lag` | Histogram | svc-agent-relay | Time since last heartbeat | +| `kiven.backup.duration` | Histogram | svc-backups | Backup execution time | +| `kiven.backup.size` | Gauge | svc-backups | Last backup size in bytes | + +--- + +## Sampling Strategy + +Configured via `kiven-go-sdk/telemetry` Config: + +| Environment | Head Sampling Rate | Tail Sampling (Gateway) | Rationale | +|-------------|-------------------|-------------------------|-----------| +| `local` | 100% | N/A (stdout exporter) | Full visibility for development | +| `staging` | 100% | Keep all | Full visibility for QA | +| `production` | 10% | Errors: 100%, Slow (>500ms): 100% | Cost optimization | + +### Head Sampling (SDK-side) + +Set via `OTEL_SAMPLE_RATE` env var or `Config.SampleRate`. Uses `ParentBased(TraceIDRatioBased(rate))` so child spans always respect parent's sampling decision. + +### Tail Sampling (Gateway Collector) + +The Gateway applies tail sampling after receiving all spans of a trace: + +```yaml +processors: + tail_sampling: + decision_wait: 10s + policies: + - name: errors + type: status_code + status_code: {status_codes: [ERROR]} + - name: slow + type: latency + latency: {threshold_ms: 500} + - name: probabilistic + type: probabilistic + probabilistic: {sampling_percentage: 10} +``` + +--- + +## SDK Usage: kiven-go-sdk/telemetry + +### Service Bootstrap + +```go +func main() { + ctx := context.Background() + + // Auto-configures from env vars (OTEL_EXPORTER_TYPE, OTEL_SAMPLE_RATE, etc.) + cfg, err := telemetry.NewConfigFromEnv("svc-api") + if err != nil { + log.Fatal(err) + } + + tp, err := telemetry.NewProvider(ctx, cfg) + if err != nil { + log.Fatal(err) + } + defer tp.Shutdown(ctx) + + // HTTP server with tracing middleware + r := chi.NewRouter() + r.Use(telemetry.HTTPMiddleware("svc-api")) + // ... +} +``` + +### gRPC Server with Tracing + +```go +server := grpc.NewServer( + grpc.UnaryInterceptor(telemetry.UnaryServerInterceptor()), + grpc.StreamInterceptor(telemetry.StreamServerInterceptor()), +) +``` + +### gRPC Client with Tracing + +```go +conn, _ := grpc.Dial(address, + grpc.WithUnaryInterceptor(telemetry.UnaryClientInterceptor()), + grpc.WithStreamInterceptor(telemetry.StreamClientInterceptor()), +) +``` + +### Environment Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `OTEL_EXPORTER_TYPE` | `stdout` | `stdout`, `otlp`, or `none` | +| `OTEL_EXPORTER_OTLP_ENDPOINT` | (none) | `host:port` of OTel Collector | +| `OTEL_EXPORTER_OTLP_INSECURE` | `false` | Skip TLS for local collectors | +| `OTEL_ENVIRONMENT` | `local` | `local`, `staging`, `production` | +| `OTEL_SAMPLE_RATE` | env-based | `0.0`-`1.0` (negative = auto) | +| `OTEL_SERVICE_VERSION` | (none) | Service version for resource | + +--- + +## What's Implemented vs Gaps + +### Implemented (kiven-go-sdk/telemetry) + +| File | What It Does | +|------|-------------| +| `config.go` | Config struct, DefaultConfig, NewConfigFromEnv, exporter types, env-based sampling | +| `provider.go` | TracerProvider with resource, exporter, sampler, global registration, W3C propagation | +| `span.go` | Trace(), TraceFunc(), SetSpanError(), SetSpanAttributes(), TraceID() helpers | +| `httpmiddleware.go` | chi-compatible HTTP middleware with W3C extraction, semantic convention attrs, X-Trace-ID header | +| `grpc.go` | Full gRPC instrumentation: Unary/Stream Server/Client interceptors with metadata propagation | +| Tests | config_test.go, provider_test.go, span_test.go, httpmiddleware_test.go, grpc_test.go | + +### Phase 1 Gaps (to implement) + +| Gap | Description | Priority | +|-----|-------------|----------| +| **MeterProvider** | OTel MeterProvider setup (like TracerProvider but for metrics). Services need `meter.Int64Counter()`, `meter.Float64Histogram()` etc. | P0 | +| **slog bridge** | Bridge Go `log/slog` to OTel Logs so structured logs flow through the same pipeline as traces/metrics | P1 | +| **pgx tracing hook** | `pgx.QueryTracer` implementation that auto-creates spans for every SQL query with `db.statement`, `db.operation` attributes | P0 | +| **Kiven attribute constants** | Package-level constants for `kiven.org_id`, `kiven.service_id` etc. to avoid string typos | P1 | + +--- + +## GDPR Compliance in OTel Pipeline + +The Gateway Collector scrubs PII before exporting: + +| Data | Action | Processor | +|------|--------|-----------| +| `user.id` | Drop | `attributes/delete` | +| `user.email` | Drop | `attributes/delete` | +| `http.client_ip` | Hash | `transform` | +| High cardinality metric labels | Drop | `filter` | +| SQL query parameters | Redact | `transform` (replace bind values with `?`) | + +```yaml +processors: + attributes/scrub: + actions: + - key: user.id + action: delete + - key: user.email + action: delete + - key: enduser.id + action: delete + transform/anonymize: + trace_statements: + - context: span + statements: + - replace_pattern(attributes["http.client_ip"], "^(.*)$", "REDACTED") +``` diff --git a/onboarding/CUSTOMER-ONBOARDING.md b/onboarding/CUSTOMER-ONBOARDING.md new file mode 100644 index 0000000..6fe2986 --- /dev/null +++ b/onboarding/CUSTOMER-ONBOARDING.md @@ -0,0 +1,210 @@ +# Customer Onboarding +## *From Sign-Up to Running Database in 10 Minutes* + +--- + +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# Onboarding Flow + +``` +Step 1 Step 2 Step 3 Step 4 Step 5 +Sign up โ†’ Deploy TF โ†’ Connect EKS โ†’ Create DB โ†’ Connected! +(1 min) module (2 min) (1 min) (5-7 min) +``` + +--- + +# Step 1: Sign Up (1 minute) + +Customer creates account on kiven.io: +- Email + password or SSO (Google / GitHub) +- Create organization +- Invite team members (optional) + +--- + +# Step 2: Deploy Terraform Module (2 minutes) + +Customer deploys Kiven's Terraform module in their AWS account. This creates the `KivenAccessRole` IAM role. + +### How It Works + +1. Kiven dashboard shows: "Connect your AWS account" +2. Customer copies the Terraform module configuration (or uses the Terraform Registry) +3. Customer runs `terraform init` and `terraform apply` +4. Module creates: + - IAM Role `KivenAccessRole` (trusts Kiven's AWS account) + - IAM Policy `KivenAccessPolicy` (scoped permissions) + - ExternalId parameter (unique per customer, prevents confused deputy) +5. Terraform outputs: Role ARN โ†’ customer copies back to Kiven dashboard + +### Terraform Module (Summary) + +```hcl +module "kiven_access" { + source = "kivenio/kiven/aws" + version = "~> 1.0" + + external_id = var.kiven_external_id # Provided by Kiven dashboard + kiven_account_id = "123456789012" # Kiven's AWS account ID +} + +output "role_arn" { + description = "Paste this ARN in the Kiven dashboard" + value = module.kiven_access.role_arn +} +``` + +--- + +# Step 3: Connect EKS Cluster (1 minute) + +Customer provides their EKS cluster details: + +1. Select AWS region (auto-detected from IAM role) +2. Select EKS cluster (Kiven lists available clusters via AWS API) +3. Kiven validates: + - Can assume role โœ“ + - Can describe EKS cluster โœ“ + - Can access K8s API โœ“ +4. Dashboard shows: "Cluster connected!" + +### Cluster Discovery + +Kiven automatically discovers: +- EKS version +- VPC / subnets / AZs +- Existing node groups +- Installed operators (CNPG, cert-manager) +- Available storage classes +- Resource capacity (CPU, memory) + +--- + +# Step 4: Create Database (5-7 minutes) + +Customer clicks "Create Database" and configures: + +### Simple Mode (Form) + +| Field | Options | Default | +|-------|---------|---------| +| **Name** | Free text | `my-database` | +| **PostgreSQL Version** | 15, 16, 17 | 17 | +| **Plan** | Hobbyist, Startup, Business, Premium, Custom | Startup | +| **Region / AZ** | From customer's EKS subnets | Multi-AZ (auto) | +| **Initial Database** | Database name | `app` | +| **Initial User** | Username | `app_user` | + +### What Happens Behind the Scenes + +``` +1. Prerequisites Check (svc-provisioner) [~10s] + โ”œโ”€โ”€ Validate IAM permissions + โ”œโ”€โ”€ Check EKS cluster health + โ”œโ”€โ”€ Verify subnet availability across AZs + โ””โ”€โ”€ Check resource capacity + +2. Infrastructure Setup (svc-infra) [~2-3min] + โ”œโ”€โ”€ Create dedicated node group (kiven-db-{id}) + โ”œโ”€โ”€ Create StorageClass (kiven-db-gp3) + โ”œโ”€โ”€ Create S3 bucket (kiven-backups-{customer-id}) + โ”œโ”€โ”€ Create IRSA role (kiven-cnpg-backup-{id}) + โ””โ”€โ”€ Wait for nodes ready + +3. CNPG Setup (agent) [~1min] + โ”œโ”€โ”€ Create namespace (kiven-databases) + โ”œโ”€โ”€ Install CNPG operator (if not present) + โ”œโ”€โ”€ Deploy Kiven agent + โ””โ”€โ”€ Apply NetworkPolicies + +4. Database Provisioning (agent) [~2-3min] + โ”œโ”€โ”€ Apply CNPG Cluster YAML + โ”œโ”€โ”€ Apply PgBouncer Pooler YAML + โ”œโ”€โ”€ Apply ScheduledBackup YAML + โ”œโ”€โ”€ Wait for primary ready + โ”œโ”€โ”€ Wait for replicas synced + โ””โ”€โ”€ Create initial database + user + +5. Ready! [total: ~5-7min] + โ””โ”€โ”€ Return connection strings +``` + +### Dashboard Progress + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Creating database: my-database โ”‚ +โ”‚ โ”‚ +โ”‚ [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] 72% โ”‚ +โ”‚ โ”‚ +โ”‚ โœ… Prerequisites validated โ”‚ +โ”‚ โœ… Node group created (2ร— r6g.medium) โ”‚ +โ”‚ โœ… Storage and backups configured โ”‚ +โ”‚ โœ… CNPG operator ready โ”‚ +โ”‚ โณ PostgreSQL starting... โ”‚ +โ”‚ โ—‹ Replicas syncing โ”‚ +โ”‚ โ—‹ Creating database and user โ”‚ +โ”‚ โ”‚ +โ”‚ Estimated time remaining: ~2 minutes โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +# Step 5: Connected! + +Customer receives: + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ โœ… Database ready! โ”‚ +โ”‚ โ”‚ +โ”‚ Connection Details: โ”‚ +โ”‚ โ”‚ +โ”‚ Host: pg-my-database-rw.kiven-databases.svc โ”‚ +โ”‚ Port: 5432 โ”‚ +โ”‚ Database: app โ”‚ +โ”‚ User: app_user โ”‚ +โ”‚ Password: โ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ขโ€ข [Reveal] [Copy] โ”‚ +โ”‚ โ”‚ +โ”‚ Pooler (recommended): โ”‚ +โ”‚ Host: pg-my-database-pooler.kiven-databases.svc โ”‚ +โ”‚ Port: 5432 โ”‚ +โ”‚ โ”‚ +โ”‚ Connection String: โ”‚ +โ”‚ postgresql://app_user:***@pg-my-database-pooler โ”‚ +โ”‚ .kiven-databases.svc:5432/app?sslmode=require โ”‚ +โ”‚ [Copy] โ”‚ +โ”‚ โ”‚ +โ”‚ [Open Dashboard] [View Metrics] [Add User] โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +# Offboarding + +When a customer deletes their database: + +1. **Confirmation dialog**: "This will delete your database. Backups will be retained for 30 days." +2. CNPG cluster deleted +3. Node group deleted +4. EBS volumes retained for 7 days, then deleted (configurable) +5. S3 backups retained for 30 days (configurable) +6. IAM IRSA role deleted +7. Audit log entry created + +When a customer removes Kiven entirely: +1. All databases must be deleted first (or exported) +2. Kiven agent uninstalled (`helm uninstall kiven-agent`) +3. Customer runs `terraform destroy` (removes IAM role) +4. Kiven retains customer metadata for 90 days (GDPR), then purges + +--- + +*Maintained by: Product Team + Platform Team* +*Last updated: February 2026* diff --git a/plans/KIVEN_ROADMAP.md b/plans/KIVEN_ROADMAP.md new file mode 100644 index 0000000..119cec4 --- /dev/null +++ b/plans/KIVEN_ROADMAP.md @@ -0,0 +1,852 @@ +name: Kiven Team Roadmap +overview: Quarterly roadmap (Q1-Q4 2026) with weekly tracking. From foundation to production-ready managed PostgreSQL platform. +todos: + - id: q1-sdk + content: "Q1: Complete kiven-go-sdk (error types, middleware, OTel, DB helpers)" + status: pending + - id: q1-proto + content: "Q1: Create contracts-proto (buf setup, agent.proto, metrics.proto)" + status: pending + - id: q1-api-scaffold + content: "Q1: Scaffold svc-api (chi router, DB layer, first read endpoints)" + status: pending + - id: q1-templates + content: "Q1: Apply Copier templates to all repos + create sdk-go template" + status: pending + - id: q1-auth-start + content: "Q1: Start svc-auth (OIDC login)" + status: pending + - id: q2-auth + content: "Q2: Complete svc-auth (API keys, RBAC)" + status: pending + - id: q2-cnpg + content: "Q2: Build provider-cnpg (YAML generation, status parsing)" + status: pending + - id: q2-infra + content: "Q2: Build svc-infra (AWS SDK, node groups, S3, IRSA)" + status: pending + - id: q2-agent + content: "Q2: Build kiven-agent (CNPG informers, gRPC client, command executor)" + status: pending + - id: q2-provisioner + content: "Q2: Build svc-provisioner (state machine, full pipeline)" + status: pending + - id: q2-e2e + content: "Q2: End-to-end in kind (create DB โ†’ get connection string)" + status: pending + - id: q2-dashboard + content: "Q2: Dashboard API integration + auth flow + real data" + status: pending + - id: q3-gitops + content: "Q3: Production deployment (Flux, Helm, staging EKS)" + status: pending + - id: q3-observability + content: "Q3: Observability stack (Prometheus, Loki, Tempo, Grafana)" + status: pending + - id: q3-security + content: "Q3: Security hardening (Vault, mTLS, Kyverno, cert-manager)" + status: pending + - id: q3-onboarding + content: "Q3: Customer onboarding (Terraform module, EKS discovery, wizard)" + status: pending + - id: q3-enterprise + content: "Q3: Enterprise services (monitoring, billing, audit, notifications)" + status: pending + - id: q3-first-customer + content: "Q3: First customer live on production" + status: pending + - id: q4-stategraph + content: "Q4: Migrate Terraform state from S3 to Stategraph" + status: pending + - id: q4-migrations + content: "Q4: svc-migrations (import from Aiven, RDS, bare PG)" + status: pending + - id: q4-soc2 + content: "Q4: SOC2 Type 1 evidence collection" + status: pending +isProject: false +--- + +# Kiven Roadmap 2026 โ€” Quarterly Plan with Weekly Tracking + +## Current State (as of Feb 23, 2026) + +| Asset | Status | +|-------|--------| +| `bootstrap` (Terraform SSO, Control Tower) | Done | +| `platform-github-management` (repo sync) | Done | +| `reusable-workflows` (8 workflows, 5 actions) | Done | +| `platform-templates-service-go` (Copier) | Done | +| `kiven-go-sdk` (15 files, 4 tests) | Partial | +| `dashboard` (14 pages) | Scaffolded, no API | +| `kiven-dev` (Taskfile, kind, CNPG) | Done | +| `svc-api` (OpenAPI spec) | Spec only, no Go code | +| All other services | Nothing | + +**What does NOT work**: No service has Go code, no agent, no provider, no provisioning pipeline, no auth, no customer-facing functionality. + +## Production Target + +A customer can: + +1. Sign up and log in (OIDC + SSO/SAML) +2. Register their EKS cluster (Terraform module) +3. Click "Create Database" โ†’ PostgreSQL connection string in ~10 minutes +4. See metrics, logs, backups, users, connection info in the dashboard +5. Get DBA recommendations, alerts, and performance insights +6. Power on/off databases on schedule +7. Pay via Stripe with usage tracking +8. Have full audit trail of all operations + +## Team (4 developers) + +| Role | Path | +|------|------| +| **Dev 1** โ€” Backend Lead | SDK โ†’ svc-auth โ†’ svc-provisioner โ†’ svc-monitoring โ†’ svc-billing | +| **Dev 2** โ€” K8s/Infra | contracts-proto โ†’ provider-cnpg โ†’ kiven-agent โ†’ observability โ†’ security | +| **Dev 3** โ€” Cloud/AWS | svc-api โ†’ svc-infra โ†’ svc-clusters/backups/users โ†’ GitOps โ†’ onboarding | +| **Dev 4** โ€” Frontend | Templates โ†’ dashboard โ†’ svc-yamleditor โ†’ svc-audit/notification | + +## Calendar Reference + +| Roadmap Week | Calendar Date | Quarter | +|---|---|---| +| W1 | Feb 23 | Q1 | +| W2 | Mar 2 | Q1 | +| W3 | Mar 9 | Q1 | +| W4 | Mar 16 | Q1 | +| W5 | Mar 23 | Q1 | +| W6 | Mar 30 | Q1โ†’Q2 | +| W7 | Apr 6 | Q2 | +| W8 | Apr 13 | Q2 | +| W9 | Apr 20 | Q2 | +| W10 | Apr 27 | Q2 | +| W11 | May 4 | Q2 | +| W12 | May 11 | Q2 | +| W13 | May 18 | Q2 | +| W14 | May 25 | Q2 | +| W15 | Jun 1 | Q2 | +| W16 | Jun 8 | Q2 | +| W17 | Jun 15 | Q2 | +| W18 | Jun 22 | Q2 | +| W19 | Jun 29 | Q2โ†’Q3 | +| W20 | Jul 6 | Q3 | +| W21 | Jul 13 | Q3 | +| W22 | Jul 20 | Q3 | +| W23 | Jul 27 | Q3 | +| W24 | Aug 3 | Q3 | +| W25-31 | Aug-Sep | Q3 | +| W32+ | Oct+ | Q4 | + +--- + +# Q1 2026 โ€” FOUNDATION (Feb 23 โ†’ Mar 31) + +> **Theme**: Build the foundation. Every repo Phase 2 depends on is ready. +> +> **Objective**: SDK complete, gRPC contracts defined, svc-api serves data from DB, all repos scaffolded, dev environment works end-to-end. + +## Q1 Dependency Graph + +```mermaid +graph LR + SDK[kiven-go-sdk] --> API[svc-api scaffold] + SDK --> AUTH_START[svc-auth start] + SDK --> CNPG_START[provider-cnpg start] + PROTO[contracts-proto] --> RELAY_START[svc-agent-relay start] + TEMPLATES[Copier templates] --> ALL[All Phase 2 repos scaffolded] + API --> ENDPOINTS[First read endpoints] +``` + +## Week 1 (Feb 23 - Feb 28) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `kiven-go-sdk` | Error types package (`errors/errors.go`), HTTP client helpers, pagination types | Shared error handling + HTTP primitives | +| Dev 1 | `kiven-go-sdk` | Refactor models to API contracts only (remove DB-specific fields, add docs) | Clean separation: SDK = API contracts | +| Dev 2 | `contracts-proto` | Define `.proto` files: `agent.proto`, `metrics.proto`, `commands.proto` | gRPC contract drafts | +| Dev 3 | `svc-api` | Scaffold Go project from Copier template (chi router, healthcheck, graceful shutdown, OTel init) | Running HTTP server at :8080/healthz | +| Dev 4 | All Phase 2 repos | Start running `copier copy` from template for: svc-auth, svc-infra, svc-agent-relay, provider-cnpg | First repos scaffolded | + +**W1 Review checkpoint**: SDK has error types + pagination. svc-api starts. Proto files drafted. + +## Week 2 (Mar 2 - Mar 6) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `kiven-go-sdk` | Middleware package (logging, recovery, request ID, auth context) | Reusable chi middleware for all svc-* | +| Dev 1 | `kiven-go-sdk` | OTel: MeterProvider (counters, histograms, gauges), pgx tracing hook | Metrics + auto-traced SQL queries | +| Dev 2 | `contracts-proto` | `buf.yaml`, `buf.gen.yaml`, CI with `buf lint` + `buf breaking` | Generated Go code in `gen/go/` | +| Dev 3 | `svc-api` | Database layer: pgx pool, migration runner, repository pattern, internal domain models | `ServiceRepository` + `internal/domain/service.go` | +| Dev 3 | `svc-api` | OpenAPI validation middleware (validate requests/responses against spec) | Every request validated against openapi.yaml | +| Dev 4 | All Phase 2 repos | Continue `copier copy` for: svc-clusters, svc-backups, svc-users, kiven-agent | All Phase 2 repos scaffolded | + +**W2 Review checkpoint**: SDK has middleware + OTel metrics. Proto generates Go code. svc-api has DB layer. + +## Week 3 (Mar 9 - Mar 13) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `kiven-go-sdk` | OTel: slog bridge (structured logs โ†’ OTel Logs), Kiven attribute constants (`kiven.org_id`, etc.) | Full traces + metrics + logs from day 1 | +| Dev 1 | `kiven-go-sdk` | Database helpers: pgx pool factory, migration runner, transaction helpers with OTel tracing | Every service connects to DB with 3 lines | +| Dev 2 | `contracts-proto` | Finalize agent protocol: Heartbeat (bidirectional), CommandStream (server-push), MetricsStream (agent-push) | Stable gRPC contract | +| Dev 3 | `svc-api` | Implement read-only endpoints: `GET /v1/plans`, `GET /v1/services`, `GET /v1/services/{id}` | First working API endpoints from DB | +| Dev 4 | `platform-templates-sdk-go` | Create Copier template for sdk-go (no Dockerfile, no cmd/, no gRPC) | Template ready for provider repos | +| Dev 4 | `kiven-dev` | Verify `task dev` works end-to-end: Docker Compose + kind + CNPG + svc-api starts | Working local dev environment | + +**W3 Review checkpoint**: SDK feature-complete. Proto stable. svc-api returns plans from DB. `task dev` works. + +### -- PHASE 1 COMPLETE -- + +**Exit criteria**: +- [ ] `task dev` starts infra + svc-api +- [ ] `svc-api` returns service plans from DB +- [ ] `contracts-proto` generates Go code via `buf generate` +- [ ] All Phase 2 repos scaffolded (editorconfig, golangci, pre-commit, CI, Taskfile, Dockerfile, go.mod) +- [ ] `kiven-go-sdk` has: errors, middleware, OTel (traces + metrics + logs), DB helpers + +--- + +## Week 4 (Mar 16 - Mar 20) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-auth` | OIDC integration (Google/GitHub login via `coreos/go-oidc`), JWT token issuance | Users can log in, get a JWT | +| Dev 2 | `provider-cnpg` | Implement `GenerateClusterYAML()` โ€” given a service definition, produce valid CNPG Cluster manifest | CNPG Cluster YAML generation | +| Dev 3 | `svc-infra` | AWS SDK integration: `AssumeRole` into customer account, EKS `DescribeCluster` | Can access customer AWS resources | +| Dev 4 | `svc-api` | CRUD endpoints: `POST /v1/services`, `DELETE /v1/services/{id}`, `PATCH /v1/services/{id}` | Can create/update/delete services via API | + +**W4 Review checkpoint**: OIDC login works. CNPG YAML generation started. AWS AssumeRole works. + +## Week 5 (Mar 23 - Mar 27) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-auth` | API key management (create, list, revoke, hash with argon2) | Programmatic access for CLI/Terraform | +| Dev 2 | `provider-cnpg` | `GeneratePoolerYAML()`, `GenerateScheduledBackupYAML()` | Full CNPG manifest generation | +| Dev 3 | `svc-infra` | Create EKS managed node group (dedicated, tainted, right instance type) | Can create DB nodes in customer cluster | +| Dev 4 | `svc-api` | Customer cluster endpoints: `POST /v1/clusters`, `GET /v1/clusters` | Can register customer EKS clusters | + +**W5 Review checkpoint**: API keys work. Provider generates Cluster + Pooler + Backup YAML. Node group creation works. + +## Q1 Exit Criteria + +- [ ] SDK feature-complete (errors, middleware, OTel, DB helpers) +- [ ] gRPC contracts defined and generating Go code +- [ ] `svc-api` serves CRUD endpoints from PostgreSQL +- [ ] All Phase 2 repos scaffolded with Copier template +- [ ] `task dev` works end-to-end locally +- [ ] OIDC login flow works (svc-auth) +- [ ] CNPG YAML generation works (provider-cnpg) +- [ ] AWS AssumeRole + node group creation works (svc-infra) + +## Q1 Evolution Tracker + +| Week | SDK | Proto | svc-api | Templates | svc-auth | provider-cnpg | svc-infra | +|------|-----|-------|---------|-----------|----------|---------------|-----------| +| W1 | ๐Ÿ”จ | ๐Ÿ”จ | ๐Ÿ”จ | ๐Ÿ”จ | โ€” | โ€” | โ€” | +| W2 | ๐Ÿ”จ | ๐Ÿ”จ | ๐Ÿ”จ | ๐Ÿ”จ | โ€” | โ€” | โ€” | +| W3 | โœ… | โœ… | โœ… | โœ… | โ€” | โ€” | โ€” | +| W4 | โ€” | โ€” | ๐Ÿ”จ | โ€” | ๐Ÿ”จ | ๐Ÿ”จ | ๐Ÿ”จ | +| W5 | โ€” | โ€” | โœ… | โ€” | ๐Ÿ”จ | ๐Ÿ”จ | ๐Ÿ”จ | + +Legend: โ€” not started, ๐Ÿ”จ in progress, โœ… done + +--- + +# Q2 2026 โ€” CORE + ORCHESTRATION + DASHBOARD (Apr 1 โ†’ Jun 30) + +> **Theme**: Build everything needed for the MVP. Auth, provider, agent, provisioner, dashboard. +> +> **Objective**: A user can log in, create a database from the dashboard, and get a working PostgreSQL connection string โ€” all in a local kind cluster. Full provisioning pipeline works end-to-end. + +## Q2 Dependency Graph + +```mermaid +graph TD + subgraph "Apr (W6-9): Complete Core Services" + AUTH[svc-auth complete] --> API_PROTECT[svc-api protected] + CNPG[provider-cnpg complete] --> AGENT[kiven-agent] + INFRA[svc-infra complete] --> PROV[svc-provisioner] + RELAY[svc-agent-relay] --> AGENT + end + + subgraph "May (W10-14): Orchestration" + AGENT --> PROV + PROV --> CLUSTERS[svc-clusters] + PROV --> BACKUPS[svc-backups] + CLUSTERS --> USERS[svc-users] + PROV --> E2E[E2E integration test] + end + + subgraph "Jun (W15-18): Dashboard + Polish" + API_PROTECT --> DASH[Dashboard integration] + E2E --> DASH + end +``` + +## Week 6 (Mar 30 - Apr 3) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-auth` | RBAC middleware (admin, operator, viewer roles), org/team model | Role-based access control | +| Dev 1 | `svc-api` | Integrate auth middleware, protect all endpoints | Every API call requires valid token | +| Dev 2 | `provider-cnpg` | Implement `ParseStatus()`, `ParseMetrics()` from CNPG CRD status fields | Can read CNPG cluster state | +| Dev 3 | `svc-infra` | Create S3 bucket (encrypted, lifecycle rules) for backups, create IRSA role for CNPG | Backup infrastructure ready | +| Dev 4 | `dashboard` | API client layer: fetch wrapper, auth token management, error handling | Type-safe API client | + +**W6 Review checkpoint**: Auth complete (OIDC + API keys + RBAC). Provider can parse status. S3 bucket creation works. + +## Week 7 (Apr 6 - Apr 10) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-auth` | Unit + integration tests for OIDC, API keys, RBAC | Auth fully tested | +| Dev 2 | `provider-cnpg` | Integration tests: generate YAML โ†’ validate against CNPG CRD schema | Provider fully tested | +| Dev 2 | `contracts-proto` | Finalize: Heartbeat, CommandStream, MetricsStream โ€” stable contract | gRPC contract frozen | +| Dev 3 | `svc-infra` | Create EBS StorageClass (gp3, encrypted, right IOPS) | Storage ready for DB volumes | +| Dev 4 | `dashboard` | Auth flow: login page, OIDC redirect, token storage, protected routes | Users can log in via dashboard | + +**W7 Review checkpoint**: Auth tested. Provider tested. svc-infra can create storage. Dashboard login works. + +## Week 8 (Apr 13 - Apr 17) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Scaffold + state machine design: `provisioning_jobs` table, step definitions | Provisioner architecture ready | +| Dev 2 | `svc-agent-relay` | gRPC server: agent registration, heartbeat tracking, connection management | Agents can connect | +| Dev 3 | `svc-api` | Backup endpoints, user management endpoints, remaining CRUD | Full API endpoint coverage | +| Dev 4 | `dashboard` | Service list page: real data from API, create service wizard | Can create a database from the UI | + +**W8 Review checkpoint**: Provisioner designed. Agent relay accepts connections. Full API CRUD. Dashboard shows services. + +### -- CORE SERVICES COMPLETE -- + +**Exit criteria (Week 8)**: +- [ ] svc-auth: OIDC + API keys + RBAC, fully tested +- [ ] provider-cnpg: YAML generation + status parsing, fully tested +- [ ] svc-infra: AssumeRole + node groups + S3 + IRSA + StorageClass +- [ ] svc-agent-relay: gRPC server accepts agent connections +- [ ] svc-api: full CRUD for services, clusters, backups, users + +--- + +## Week 9 (Apr 20 - Apr 24) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Steps implementation: create_nodes โ†’ create_storage โ†’ create_s3 (calls svc-infra) | First 3 provisioning steps work | +| Dev 2 | `kiven-agent` | Go binary: gRPC client to relay, CNPG informers (watch Cluster/Backup CRDs), heartbeat | Agent running in kind, reports CNPG | +| Dev 3 | `svc-clusters` | Cluster lifecycle: get status from agent, basic CRUD | Can see cluster status | +| Dev 4 | `dashboard` | Service detail page: real connection info, status from API, power on/off | Can see database status in UI | + +**W9 Review checkpoint**: Provisioner creates infra resources. Agent watches CNPG CRDs. Cluster status visible. + +## Week 10 (Apr 27 - May 1) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Steps: install_cnpg โ†’ deploy_cluster (calls agent-relay to send commands to agent) | Full pipeline: API โ†’ provisioner โ†’ infra + agent | +| Dev 2 | `kiven-agent` | Command executor: receive YAML from relay, `kubectl apply`, report result | Can apply CNPG manifests on command | +| Dev 3 | `svc-clusters` | Scale (change instances), power on/off (delete pods + scale node group) | Scale up/down, power on/off | +| Dev 4 | `dashboard` | Backups page (real data from API), backup timeline visualization | Backups visible in UI | + +**W10 Review checkpoint**: Full provisioning pipeline works (API โ†’ provisioner โ†’ infra โ†’ agent โ†’ CNPG). Power on/off works. + +## Week 11 (May 4 - May 8) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Error handling, retries, rollback on failure, idempotency | Resilient provisioning | +| Dev 2 | `kiven-agent` | PG stats collector: connect to PG, collect pg_stat_statements, send to relay | Metrics flowing to SaaS | +| Dev 3 | `svc-backups` | Backup management: trigger backup via agent, list backups from S3, restore | Backup/restore working | +| Dev 4 | `dashboard` | Users page (CRUD), metrics page (charts from agent data) | Full dashboard pages | + +**W11 Review checkpoint**: Provisioner handles errors. PG metrics flowing. Backup/restore works. All dashboard pages exist. + +## Week 12 (May 11 - May 15) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Integration tests: full pipeline in kind | Provisioner fully tested | +| Dev 2 | `kiven-agent` | Log aggregator: collect PG logs from all pods, send to relay | Logs flowing to SaaS | +| Dev 2 | `kiven-agent-helm` | Helm chart for agent deployment | One-command agent install | +| Dev 3 | `svc-backups` | PITR restore, fork/clone support | Advanced backup features | +| Dev 4 | `dashboard` | Loading states, error handling, empty states | UI polish | + +**W12 Review checkpoint**: Provisioner tested. Agent has Helm chart. PITR works. Dashboard polished. + +## Week 13 (May 18 - May 22) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Status reporting: webhook/polling to update service status in svc-api | Real-time provisioning status | +| Dev 2 | `kiven-agent` | Infrastructure reporter: node status, storage usage, resource availability | Infrastructure metrics in SaaS | +| Dev 3 | `kiven-go-sdk` | Create `models.DatabaseUser` (distinct from dashboard User) + migration | Domain model for PG users | +| Dev 3 | `svc-users` | PG user management via agent: CREATE ROLE, GRANT, password rotation | Can manage database users | +| Dev 4 | `dashboard` | Responsive design, dark mode, accessibility | Production-ready UI | + +**W13 Review checkpoint**: Provisioning status updates in real-time. DB user management works. Dashboard responsive. + +## Week 14 (May 25 - May 29) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| All | E2E | End-to-end test in kind: create service โ†’ provisioner โ†’ agent โ†’ CNPG โ†’ PG running โ†’ connection string | **MVP proof: full loop works** | +| Dev 1 | `svc-provisioner` | Performance optimization, concurrent provisioning | Can provision multiple DBs | +| Dev 4 | `dashboard` | E2E test from UI: login โ†’ create DB โ†’ see status โ†’ manage users/backups | Full UI E2E | + +**W14 Review checkpoint**: **CRITICAL MILESTONE โ€” Full loop works in kind.** User creates DB โ†’ gets connection string. + +### -- ORCHESTRATION COMPLETE -- + +**Exit criteria (Week 14)**: +- [ ] In kind: user creates service via API โ†’ provisioner โ†’ agent โ†’ CNPG cluster โ†’ connection string +- [ ] Backup/restore + PITR works +- [ ] DB user management works +- [ ] Agent collects metrics + logs from PostgreSQL +- [ ] Dashboard shows everything + +--- + +## Week 15 (Jun 1 - Jun 5) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Hardening: retry logic, circuit breakers, graceful degradation | Production-grade provisioner | +| Dev 2 | `platform-observability` | Install Prometheus + Grafana via Helm, ServiceMonitor for all services | Metrics collection started | +| Dev 3 | `platform-gitops` | Flux Kustomizations/HelmReleases for all svc-*, environments (dev, staging, prod) | GitOps deployment pipeline | +| Dev 4 | `dashboard` | Settings page, organization management, team invitations | Admin features | + +**W15 Review checkpoint**: GitOps pipeline ready. Prometheus collecting metrics. Provisioner hardened. + +## Week 16 (Jun 8 - Jun 12) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Scaffold + metrics ingestion from agent: pg_stat_statements, connections, replication lag | Metrics pipeline from agent to SaaS | +| Dev 2 | `platform-observability` | Install Loki + Promtail, configure log ingestion | Centralized logging | +| Dev 3 | All svc-* repos | Production Helm charts (per service), Kustomize overlays for env-specific config | `helm install svc-api` works | +| Dev 4 | `dashboard` | API documentation page (Redoc), connection string helper | Developer experience | + +**W16 Review checkpoint**: Metrics pipeline ingesting. Loki collecting logs. Helm charts for all services. + +## Week 17 (Jun 15 - Jun 19) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Basic alerting: connection pool exhaustion, replication lag, disk usage | Critical alerts work | +| Dev 2 | `platform-observability` | OTel Collector (DaemonSet agents + Gateway), Tempo as trace backend | Distributed tracing | +| Dev 3 | `kiven-dev` | Staging environment in real EKS: Terraform for Kiven SaaS EKS cluster | Staging cluster on AWS | +| Dev 4 | `svc-audit` | Scaffold + immutable audit log: every API call, every infra change, who/what/when | Audit trail started | + +**W17 Review checkpoint**: Alerting works. Tracing live. Staging EKS cluster exists. Audit logging. + +## Week 18 (Jun 22 - Jun 26) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Grafana dashboards: service health, request latency, error rates, agent status | Operations visibility | +| Dev 2 | `platform-observability` | SLO definitions (99.9% API, <200ms p95), error budget alerts (Sloth/Pyrra) | SLO monitoring | +| Dev 3 | `platform-gitops` | Promotion workflow: dev โ†’ staging โ†’ prod with approval gates | Controlled rollouts | +| Dev 4 | `svc-notification` | Alert dispatch: Slack, email, webhook integration | Multi-channel alerting | + +**W18 Review checkpoint**: Grafana dashboards live. SLOs defined. Promotion workflow. Notifications dispatch. + +## Q2 Exit Criteria + +- [ ] svc-auth complete: OIDC + API keys + RBAC + tests +- [ ] provider-cnpg complete: YAML gen + status parsing + tests +- [ ] svc-infra complete: node groups + S3 + IRSA + StorageClass +- [ ] kiven-agent: CNPG informers + command executor + PG stats + logs +- [ ] svc-provisioner: full pipeline, tested, resilient +- [ ] svc-clusters + svc-backups + svc-users: all working +- [ ] **E2E in kind: login โ†’ create DB โ†’ get connection string โ†’ manage** +- [ ] Dashboard: all pages with real data, auth, responsive +- [ ] GitOps pipeline (Flux) deployed +- [ ] Observability started (Prometheus, Loki, Tempo, Grafana) +- [ ] Staging EKS cluster running +- [ ] svc-monitoring ingesting metrics + basic alerts +- [ ] svc-audit + svc-notification scaffolded + +## Q2 Evolution Tracker + +| Week | Auth | Provider | Infra | Relay | Agent | Provisioner | Clusters | Backups | Users | Dashboard | GitOps | Observ. | +|------|------|----------|-------|-------|-------|-------------|----------|---------|-------|-----------|--------|---------| +| W6 | โœ… | ๐Ÿ”จ | ๐Ÿ”จ | โ€” | โ€” | โ€” | โ€” | โ€” | โ€” | ๐Ÿ”จ | โ€” | โ€” | +| W7 | โœ… | โœ… | ๐Ÿ”จ | โ€” | โ€” | โ€” | โ€” | โ€” | โ€” | ๐Ÿ”จ | โ€” | โ€” | +| W8 | โœ… | โœ… | โœ… | ๐Ÿ”จ | โ€” | ๐Ÿ”จ | โ€” | โ€” | โ€” | ๐Ÿ”จ | โ€” | โ€” | +| W9 | โœ… | โœ… | โœ… | โœ… | ๐Ÿ”จ | ๐Ÿ”จ | ๐Ÿ”จ | โ€” | โ€” | ๐Ÿ”จ | โ€” | โ€” | +| W10 | โœ… | โœ… | โœ… | โœ… | ๐Ÿ”จ | ๐Ÿ”จ | ๐Ÿ”จ | โ€” | โ€” | ๐Ÿ”จ | โ€” | โ€” | +| W11 | โœ… | โœ… | โœ… | โœ… | ๐Ÿ”จ | ๐Ÿ”จ | โœ… | ๐Ÿ”จ | โ€” | ๐Ÿ”จ | โ€” | โ€” | +| W12 | โœ… | โœ… | โœ… | โœ… | โœ… | ๐Ÿ”จ | โœ… | ๐Ÿ”จ | โ€” | ๐Ÿ”จ | โ€” | โ€” | +| W13 | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | ๐Ÿ”จ | ๐Ÿ”จ | โ€” | โ€” | +| W14 | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โ€” | โ€” | +| W15 | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | ๐Ÿ”จ | ๐Ÿ”จ | +| W16 | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | ๐Ÿ”จ | ๐Ÿ”จ | +| W17 | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | ๐Ÿ”จ | ๐Ÿ”จ | +| W18 | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | ๐Ÿ”จ | + +--- + +# Q3 2026 โ€” PRODUCTION + ENTERPRISE + FIRST CUSTOMER (Jul 1 โ†’ Sep 30) + +> **Theme**: Go to production. Security hardened, customers can onboard, billing works, DBA intelligence, first real customer. +> +> **Objective**: Kiven runs in production on real AWS. First customer onboarded, paying, with a managed PostgreSQL. + +## Q3 Dependency Graph + +```mermaid +graph TD + subgraph "Jul (W19-22): Security + Onboarding" + VAULT[Vault + ESO] --> SEC[Security hardened] + KYVERNO[Kyverno policies] --> SEC + CERT[cert-manager TLS] --> SEC + TF_MOD[Terraform onboarding module] --> WIZARD[Onboarding wizard] + SEC --> PROD_READY[Production ready] + end + + subgraph "Aug (W23-27): Enterprise + Billing" + BILLING[svc-billing Stripe] --> INVOICES[Usage + invoices] + DBA[DBA intelligence] --> ALERTS[Smart alerts] + YAML_ED[svc-yamleditor] --> ADV_MODE[Advanced Mode] + end + + subgraph "Sep (W28-31): First Customer" + PROD_READY --> CUSTOMER[First customer live] + WIZARD --> CUSTOMER + INVOICES --> CUSTOMER + end +``` + +## Week 19 (Jun 29 - Jul 3) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | DBA recommendations engine: auto-tune postgresql.conf based on workload patterns | "Increase shared_buffers" alerts | +| Dev 2 | `platform-security` | HashiCorp Vault: install, configure dynamic secrets for PG and AWS credentials | No more static secrets | +| Dev 3 | `platform-gitops` | Deploy all services to staging EKS via Flux, validate full stack | Services running on real EKS | +| Dev 4 | `svc-audit` | Complete audit log: append-only table, query API, retention policies | Compliance-ready audit trail | + +**W19 Review checkpoint**: DBA recommendations work. Vault installed. All services on staging EKS. Audit log complete. + +## Week 20 (Jul 6 - Jul 10) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Query optimizer: slow query detection, visual EXPLAIN, index suggestions | Actionable query insights | +| Dev 2 | `platform-security` | External Secrets Operator: sync Vault secrets โ†’ K8s Secrets for all services | Services read secrets natively | +| Dev 3 | `infra-customer-aws` | Terraform module: creates `KivenAccessRole` in customer AWS (IAM + trust policy) | IaC-native onboarding | +| Dev 4 | `svc-yamleditor` | Advanced Mode: YAML viewer/editor with Monaco, CNPG schema validation | Experts can see/edit YAML | + +**W20 Review checkpoint**: Query optimizer works. ESO syncing secrets. Terraform onboarding module ready. YAML editor works. + +## Week 21 (Jul 13 - Jul 17) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Capacity planner: storage/CPU growth forecasting, "disk full in 14 days" | Proactive capacity alerts | +| Dev 2 | `platform-security` | cert-manager: Let's Encrypt ClusterIssuer, auto-TLS for all services | HTTPS everywhere | +| Dev 2 | `platform-security` | Kyverno policies: require resource limits, labels, block privileged pods | Policy enforcement | +| Dev 3 | `infra-customer-aws` | EKS discovery: validate cluster access, discover nodes, storage classes, CNPG | Automated cluster validation | +| Dev 4 | `svc-yamleditor` | Change history: git-like timeline of all YAML changes, rollback to any version | Full configuration history | + +**W21 Review checkpoint**: Capacity planning works. TLS everywhere. Kyverno enforcing. EKS discovery works. + +## Week 22 (Jul 20 - Jul 24) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Backup verification: automated weekly restore tests, RPO compliance dashboard | Verified backup reliability | +| Dev 2 | `platform-networking` | Cilium network policies: restrict pod-to-pod, mTLS between services | Zero-trust networking | +| Dev 3 | `svc-api` + `dashboard` | Onboarding wizard: Terraform โ†’ paste IAM Role ARN โ†’ validate โ†’ register โ†’ create first DB | Self-service onboarding | +| Dev 4 | `svc-billing` | Stripe integration: customer/subscription lifecycle, payment methods | Customers can subscribe | + +**W22 Review checkpoint**: Backup verification automated. Cilium mTLS. Onboarding wizard works. Stripe integration. + +## Week 23 (Jul 27 - Jul 31) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Integration tests, dashboard widgets for DBA intelligence | DBA intelligence complete | +| Dev 2 | `platform-security` | Image signing (Cosign), SBOM generation, vulnerability scanning in CI | Supply chain security | +| Dev 3 | `infra-customer-aws` | Advanced Terraform modules: VPC peering, private endpoints, custom KMS | Enterprise networking | +| Dev 4 | `svc-billing` | Usage tracking: compute hours, storage consumption, backup storage | Accurate usage metering | + +**W23 Review checkpoint**: DBA intelligence done. Supply chain security. Advanced networking. Usage metering. + +## Week 24 (Aug 3 - Aug 7) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `platform-observability` | On-call runbooks: automated alert โ†’ runbook link, PagerDuty integration | Operational readiness | +| Dev 2 | `platform-gateway` | Cloudflare Terraform: DNS, WAF rules, DDoS protection, Tunnel to EKS | kiven.io live | +| Dev 3 | `svc-api` | API rate limiting, pagination optimization, caching | Production-grade API | +| Dev 4 | `svc-billing` | Invoice generation: monthly invoices with line items (Kiven fee + AWS estimate) | Professional invoices | + +**W24 Review checkpoint**: On-call ready. kiven.io resolves. API production-grade. Invoicing works. + +## Week 25-26 (Aug 10 - Aug 22) โ€” Production Readiness Sprint + +| Dev | Focus | Task | +|-----|-------|------| +| All | Testing | Chaos testing: node failure, agent disconnect, CNPG failover | +| All | Testing | DR test: failover to second AZ, restore from backup | +| All | Security | Security audit: Vault, mTLS, Kyverno, no static credentials | +| All | Documentation | API docs (Redoc), user guides, admin guides | +| Dev 4 | Billing | Dashboard billing page: plan upgrade/downgrade, payment history | + +**W25-26 Review checkpoint**: Chaos tests pass. DR tested. Security audited. Docs complete. Billing UI done. + +## Week 27-28 (Aug 25 - Sep 5) โ€” Test Customer Dry Run + +| Dev | Focus | Task | +|-----|-------|------| +| All | Validation | Full customer onboarding with `test-client` AWS account | +| All | Validation | Provision database, verify metrics/logs/backups, test billing flow | +| All | Bug fixes | Fix issues found during dry run | +| Dev 4 | Legal | Terms of Service, Privacy Policy, DPA preparation | + +**W27-28 Review checkpoint**: Test customer fully onboarded. All flows work end-to-end. Bugs fixed. + +## Week 29-31 (Sep 8 - Sep 26) โ€” First Customer + +| Week | Focus | Task | +|------|-------|------| +| W29 | Onboarding | First real customer: guided onboarding, dedicated support | +| W30 | Monitoring | 24/7 monitoring of first customer, immediate response to any issue | +| W31 | Stabilization | Bug fixes, performance tuning, documentation updates from learnings | + +**W29-31 Review checkpoint**: **CRITICAL โ€” First customer live and healthy.** + +## Q3 Exit Criteria + +- [ ] All services deploy via Flux (no manual `kubectl apply`) +- [ ] Grafana dashboards for every service (RED metrics) +- [ ] SLOs defined and monitored (99.9% API, 99.99% DB uptime) +- [ ] On-call rotation with PagerDuty +- [ ] Vault secrets, mTLS, Kyverno, cert-manager โ€” no static credentials +- [ ] Chaos testing passed (node failure, agent disconnect, CNPG failover) +- [ ] DR tested (AZ failover, backup restore) +- [ ] Customer onboarding tested end-to-end +- [ ] Billing tested (subscription โ†’ usage โ†’ invoice โ†’ payment) +- [ ] kiven.io live behind Cloudflare +- [ ] **First customer live on production** +- [ ] DBA intelligence: recommendations, query optimizer, capacity planner, backup verification +- [ ] svc-yamleditor: Advanced Mode with change history +- [ ] svc-audit: immutable audit log +- [ ] svc-notification: Slack + email + webhook + +## Q3 Evolution Tracker + +| Week | Security | Onboarding | Billing | DBA Intel. | YAML Editor | Audit | Staging | First Customer | +|------|----------|------------|---------|------------|-------------|-------|---------|----------------| +| W19 | ๐Ÿ”จ | โ€” | โ€” | ๐Ÿ”จ | โ€” | ๐Ÿ”จ | ๐Ÿ”จ | โ€” | +| W20 | ๐Ÿ”จ | ๐Ÿ”จ | โ€” | ๐Ÿ”จ | ๐Ÿ”จ | โœ… | โœ… | โ€” | +| W21 | ๐Ÿ”จ | ๐Ÿ”จ | โ€” | ๐Ÿ”จ | ๐Ÿ”จ | โœ… | โœ… | โ€” | +| W22 | ๐Ÿ”จ | ๐Ÿ”จ | ๐Ÿ”จ | ๐Ÿ”จ | โœ… | โœ… | โœ… | โ€” | +| W23 | โœ… | ๐Ÿ”จ | ๐Ÿ”จ | โœ… | โœ… | โœ… | โœ… | โ€” | +| W24 | โœ… | โœ… | ๐Ÿ”จ | โœ… | โœ… | โœ… | โœ… | โ€” | +| W25-26 | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โ€” | +| W27-28 | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | ๐Ÿ”จ (dry run) | +| W29-31 | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | + +--- + +# Q4 2026 โ€” SCALE + OPTIMIZE + EXPAND (Oct 1 โ†’ Dec 31) + +> **Theme**: Stabilize production, onboard more customers, add migration tools, prepare multi-operator architecture, compliance. +> +> **Objective**: 5+ customers live. SOC2 Type 1 started. Migration tools working. Multi-operator architecture designed. + +## Week 32-33 (Oct 1 - Oct 10) โ€” Post-Launch Stabilization + +| Dev | Focus | Task | +|-----|-------|------| +| Dev 1 | Monitoring | Fine-tune alerts, reduce noise, improve DBA recommendations accuracy | +| Dev 2 | Performance | Optimize agent footprint, reduce gRPC latency, connection pooling tuning | +| Dev 3 | Onboarding | Streamline onboarding based on first customer feedback, improve docs | +| Dev 4 | Dashboard | UX improvements from first customer feedback, polish | + +## Week 34-37 (Oct 13 - Nov 7) โ€” Migration Tools + +| Week | Dev | Repo | Task | Deliverable | +|------|-----|------|------|-------------| +| W34-35 | Dev 1 | `svc-migrations` | Import from Aiven: logical replication setup, progress tracking, cutover | Customers can migrate from Aiven | +| W34-35 | Dev 3 | `svc-migrations` | Import from RDS: pg_dump/restore, pg_basebackup | Customers can migrate from RDS | +| W36-37 | Dev 1 | `svc-migrations` | Import from bare PostgreSQL, migration progress dashboard | Migrate from any PG source | +| W34-37 | Dev 2 | `svc-auth` | SSO/SAML support (enterprise), advanced RBAC (per-service permissions) | Enterprise auth | +| W34-37 | Dev 4 | `dashboard` | Migration wizard UI, SSO settings page | Migration + SSO in dashboard | + +**W37 Review checkpoint**: Migration from Aiven/RDS/bare PG works. SSO/SAML for enterprise customers. + +## Week 38-40 (Nov 10 - Nov 28) โ€” Stategraph + CLI + Terraform Provider + +| Week | Dev | Repo | Task | Deliverable | +|------|-----|------|------|-------------| +| W38 | Dev 3 | `bootstrap` | Stategraph setup: deploy or configure Stategraph (PostgreSQL backend for TF state) | Stategraph ready | +| W39 | Dev 3 | `bootstrap` | Migrate sso/ and control-tower/ from S3 to Stategraph, validate | TF state in Stategraph | +| W38-39 | Dev 1 | `kiven-cli` | CLI tool (`kiven`): login, list services, create DB, get connection string, logs | Terminal-first workflows | +| W39-40 | Dev 2 | `terraform-provider-kiven` | Terraform provider: `kiven_service`, `kiven_database_user` resources | IaC-native provisioning | +| W38-40 | Dev 4 | `dashboard` | CLI download page, Terraform docs, API key management improvements | Developer experience | + +**W40 Review checkpoint**: Stategraph migrated. CLI works. Terraform provider available. + +## Week 41-44 (Dec 1 - Dec 26) โ€” Compliance + Multi-Operator Prep + +| Week | Dev | Repo | Task | Deliverable | +|------|-----|------|------|-------------| +| W41-42 | Dev 1 | `docs` | SOC2 Type 1 evidence collection: access controls, audit logs, encryption, change management | SOC2 evidence package | +| W41-42 | Dev 2 | `kiven-go-sdk` | Multi-operator architecture design: Provider interface review, Strimzi/Redis operator analysis | Architecture decision document | +| W43-44 | Dev 2 | `provider-strimzi` | Scaffold Strimzi provider (Kafka): basic YAML generation, CRD analysis | Multi-operator proof of concept | +| W41-44 | Dev 3 | Infrastructure | Customer #2-5 onboarding, Terraform module improvements from feedback | Scale validation | +| W41-44 | Dev 4 | `dashboard` | Multi-service UI prep (service type selector), performance optimization | Dashboard ready for Kafka | + +**W44 Review checkpoint**: SOC2 evidence started. Strimzi provider scaffolded. 5+ customers onboarded. + +## Q4 Exit Criteria + +- [ ] Stategraph: Terraform state migrated from S3 +- [ ] `kiven` CLI: login, create/list/delete services, get connection strings +- [ ] `terraform-provider-kiven`: resource types for services and users +- [ ] svc-migrations: import from Aiven, RDS, bare PostgreSQL +- [ ] SSO/SAML for enterprise customers +- [ ] SOC2 Type 1 evidence collection started +- [ ] Strimzi provider scaffolded (Kafka proof of concept) +- [ ] 5+ customers live on production +- [ ] Production stable for 3+ months + +## Q4 Evolution Tracker + +| Week | Migrations | SSO/SAML | Stategraph | CLI | TF Provider | SOC2 | Multi-Operator | Customers | +|------|-----------|----------|------------|-----|-------------|------|----------------|-----------| +| W32-33 | โ€” | โ€” | โ€” | โ€” | โ€” | โ€” | โ€” | 1 | +| W34-37 | ๐Ÿ”จ | ๐Ÿ”จ | โ€” | โ€” | โ€” | โ€” | โ€” | 2 | +| W38-40 | โœ… | โœ… | ๐Ÿ”จ | ๐Ÿ”จ | ๐Ÿ”จ | โ€” | โ€” | 3 | +| W41-44 | โœ… | โœ… | โœ… | โœ… | โœ… | ๐Ÿ”จ | ๐Ÿ”จ | 5+ | + +--- + +# Milestones Summary + +| Week | Date | Milestone | Verification | +|------|------|-----------|-------------| +| W3 | Mar 13 | Foundation done | `svc-api` returns plans from DB, `buf generate` works, all repos scaffolded | +| W5 | Mar 27 | Auth + Provider started | OIDC login works, CNPG YAML generates, AWS AssumeRole works | +| W7 | Apr 10 | Core services tested | Auth + Provider fully tested with unit + integration tests | +| W8 | Apr 17 | Core services complete | All building blocks ready for orchestration | +| W10 | May 1 | Provisioning pipeline works | API โ†’ provisioner โ†’ infra โ†’ agent โ†’ CNPG โ†’ PG running | +| W14 | May 29 | **E2E in kind** | Full loop: login โ†’ create DB โ†’ get connection string | +| W16 | Jun 12 | Dashboard complete | All pages with real data from API | +| W18 | Jun 26 | Staging on AWS | Services running on real EKS via Flux | +| W22 | Jul 24 | Security hardened | Vault, mTLS, Kyverno, cert-manager, Cilium | +| W24 | Aug 7 | Enterprise features | DBA intelligence, billing, audit, YAML editor, kiven.io live | +| W26 | Aug 22 | Production ready | Chaos tested, DR tested, security audited, docs complete | +| W29 | Sep 8 | **First customer live** | Real customer with managed PostgreSQL | +| W37 | Nov 7 | Migrations working | Import from Aiven, RDS, bare PG | +| W40 | Nov 28 | Developer tools | CLI + Terraform provider available | +| W44 | Dec 26 | Year-end | 5+ customers, SOC2 started, Kafka POC | + +--- + +# Gantt Chart + +```mermaid +gantt + title Kiven 2026 Roadmap + dateFormat YYYY-MM-DD + axisFormat %b %d + + section Q1 โ€” Foundation + SDK complete :q1a, 2026-02-23, 3w + contracts-proto :q1b, 2026-02-23, 3w + svc-api scaffold + endpoints :q1c, 2026-02-23, 5w + Apply Copier templates :q1d, 2026-02-23, 3w + svc-auth start (OIDC) :q1e, 2026-03-16, 2w + provider-cnpg start :q1f, 2026-03-16, 2w + svc-infra start :q1g, 2026-03-16, 2w + + section Q2 โ€” Core + Orchestration + svc-auth complete :q2a, 2026-03-30, 2w + provider-cnpg complete :q2b, 2026-03-30, 2w + svc-infra complete :q2c, 2026-03-30, 3w + svc-agent-relay :q2d, 2026-04-13, 2w + kiven-agent :q2e, 2026-04-20, 4w + svc-provisioner :q2f, 2026-04-20, 5w + svc-clusters + backups :q2g, 2026-04-20, 4w + svc-users :q2h, 2026-05-18, 2w + E2E integration :crit, q2i, 2026-05-25, 1w + Dashboard integration :q2j, 2026-03-30, 10w + GitOps + Helm :q2k, 2026-06-01, 3w + Observability start :q2l, 2026-06-01, 4w + svc-monitoring start :q2m, 2026-06-08, 3w + + section Q3 โ€” Production + Enterprise + Security hardening :q3a, 2026-06-29, 5w + Customer onboarding :q3b, 2026-07-06, 4w + DBA intelligence :q3c, 2026-06-29, 5w + svc-billing :q3d, 2026-07-20, 4w + svc-yamleditor :q3e, 2026-07-06, 3w + svc-audit + notification :q3f, 2026-06-29, 3w + Production readiness :q3g, 2026-08-10, 2w + Test customer dry run :q3h, 2026-08-25, 2w + First customer live :crit, q3i, 2026-09-08, 3w + + section Q4 โ€” Scale + Expand + Post-launch stabilization :q4a, 2026-10-01, 2w + svc-migrations :q4b, 2026-10-13, 4w + SSO/SAML :q4c, 2026-10-13, 4w + Stategraph migration :q4d, 2026-11-10, 3w + CLI + TF provider :q4e, 2026-11-10, 3w + SOC2 + multi-operator :q4f, 2026-12-01, 4w +``` + +--- + +# Weekly Review Template + +Use this template every Friday to track progress and catch risks early. + +## Week [N] Review โ€” [Date] + +### Progress +- [ ] **Dev 1**: [What was planned] โ†’ [What was delivered] +- [ ] **Dev 2**: [What was planned] โ†’ [What was delivered] +- [ ] **Dev 3**: [What was planned] โ†’ [What was delivered] +- [ ] **Dev 4**: [What was planned] โ†’ [What was delivered] + +### Completion vs Plan +| Metric | Value | +|--------|-------| +| Tasks planned | X | +| Tasks completed | Y | +| Tasks carried over | Z | +| Completion rate | Y/X % | + +### Blockers +| Blocker | Impact | Owner | Resolution | +|---------|--------|-------|------------| +| | | | | + +### Risks Identified +| Risk | Probability | Impact | Mitigation | +|------|-------------|--------|------------| +| | | | | + +### Key Decisions Made +- [ ] Decision: ... โ†’ Rationale: ... + +### Next Week Focus +- Dev 1: ... +- Dev 2: ... +- Dev 3: ... +- Dev 4: ... + +### Quarter Health + +``` +Q[N] Progress: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘] XX% +On track: [YES/AT RISK/BEHIND] +Top risk: [description] +``` diff --git a/platform/PLATFORM-ENGINEERING.md b/platform/PLATFORM-ENGINEERING.md new file mode 100644 index 0000000..e1cf0cd --- /dev/null +++ b/platform/PLATFORM-ENGINEERING.md @@ -0,0 +1,340 @@ +# ๐Ÿ› ๏ธ **Platform Engineering** +## *LOCAL-PLUS Contracts, Golden Path & Operations* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# ๐Ÿ“‹ **Table of Contents** + +1. [Platform Contracts](#platform-contracts) +2. [CI/CD & Delivery](#cicd--delivery) +3. [Golden Path (New Service Checklist)](#golden-path-new-service-checklist) +4. [On-Call Structure](#on-call-structure) +5. [Incident Management](#incident-management) +6. [Service Templates](#service-templates) + +--- + +# ๐Ÿ“œ **Platform Contracts** + +## Guarantees + +| Contrat | Garantie Platform | Responsabilitรฉ Service | +|---------|-------------------|------------------------| +| **Deployment** | Git push โ†’ Prod < 15min | Manifests K8s valides | +| **Secrets** | Vault dynamic, rotation auto | Utiliser External-Secrets | +| **Observability** | Auto-collection traces/metrics/logs | Instrumentation OTel | +| **Networking** | mTLS enforced, Gateway API | Dรฉclarer routes dans HTTPRoute | +| **Scaling** | HPA disponible | Configurer requests/limits | +| **Security** | Policies enforced | Passer les policies | + +--- + +# ๐Ÿš€ **CI/CD & Delivery** + +## GitOps avec Flux + +| Concept | Implementation | +|---------|----------------| +| **Source of Truth** | Git repositories | +| **Delivery Model** | Pull-based (Flux reconciles from Git) | +| **Environments** | Kustomize overlays (dev/staging/prod) | +| **Promotion** | PR from dev โ†’ staging โ†’ prod overlays | + +## GitHub Actions โ€” Reusable Workflows + +> Les workflows partagรฉs standardisent les pipelines CI/CD. + +| Type | Localisation | Usage | +|------|--------------|-------| +| **Reusable workflows** | `.github/workflows/` | Build, test, deploy partagรฉs | +| **Composite actions** | `.github/actions/` | Steps communs rรฉutilisables | + +## Workflows Standards + +| Workflow | Description | Repos cibles | +|----------|-------------|--------------| +| `ci-python.yml` | Lint, test, build | `svc-*`, `sdk-python` | +| `ci-terraform.yml` | Format, lint, plan, apply | `platform-*`, `bootstrap` | +| `cd-flux.yml` | Trigger Flux reconcile | Tous | +| `security-scan.yml` | Trivy, Checkov, tfsec | Tous | + +## Pipeline Stages + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Lint โ”‚โ”€โ”€โ”€โ–บโ”‚ Test โ”‚โ”€โ”€โ”€โ–บโ”‚ Build โ”‚โ”€โ”€โ”€โ–บโ”‚ Scan โ”‚โ”€โ”€โ”€โ–บโ”‚ Push โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ + โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ + โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” + โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Flux Reconcile โ”‚ + โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ–ผ โ–ผ โ–ผ โ–ผ + Fail fast Coverage Image tag CVE check +``` + +## Deployment SLA + +| Metric | Target | Measurement | +|--------|--------|-------------| +| **Git to Dev** | < 5 min | Commit to Flux reconcile | +| **Git to Staging** | < 10 min | Commit to Flux reconcile (manual approval) | +| **Git to Prod** | < 15 min | Commit to Flux reconcile (manual approval) | +| **Rollback** | < 2 min | Flux rollback | + +## Observability Requirements + +| Signal | Requirement | Enforcement | +|--------|-------------|-------------| +| **Metrics** | `/metrics` endpoint exposed | Kyverno policy | +| **Logs** | Structured JSON, no PII | OTel scrubbing | +| **Traces** | OTel SDK instrumentation | Service template | +| **Health** | `/health/live` + `/health/ready` | Kyverno policy | + +## Security Baseline + +| Requirement | Enforcement | Exception Process | +|-------------|-------------|-------------------| +| Non-root containers | Kyverno policy | ADR + Platform approval | +| Read-only filesystem | Kyverno policy | ADR + Platform approval | +| Resource limits | Kyverno policy | None | +| Image signature | Kyverno policy | None | +| mTLS | Cilium automatic | None | + +--- + +# ๐Ÿ›ค๏ธ **Golden Path (New Service Checklist)** + +## Prerequisites + +- [ ] GitHub repo created from template +- [ ] Team assigned in GitHub +- [ ] CODEOWNERS configured + +## Step-by-Step Checklist + +| ร‰tape | Action | Validation | Owner | +|-------|--------|------------|-------| +| 1 | Crรฉer repo depuis template | Structure conforme | Dev | +| 2 | Dรฉfinir protos dans `contracts-proto` | `buf lint` pass | Dev | +| 3 | Implรฉmenter service | Unit tests > 80% coverage | Dev | +| 4 | Configurer K8s manifests | Kyverno policies pass | Dev | +| 5 | Configurer External-Secret | Secrets rรฉsolus | Dev + Platform | +| 6 | Ajouter ServiceMonitor | Metrics visibles Grafana | Dev | +| 7 | Crรฉer HTTPRoute | Trafic routable | Dev | +| 8 | Configurer alerts | Runbook links | Dev + Platform | +| 9 | PR review | Merge โ†’ Auto-deploy dev | Dev + Reviewer | +| 10 | Staging validation | E2E tests pass | QA | +| 11 | Prod deployment | Manual approval | Tech Lead | + +## Post-Deployment + +- [ ] Dashboard crรฉรฉ dans Grafana +- [ ] Runbook documentรฉ +- [ ] On-call routing configurรฉ +- [ ] Load test baseline รฉtabli + +--- + +# ๐Ÿ“ž **On-Call Structure** + +## Team Rotation (5 personnes) + +| Rรดle | Responsabilitรฉ | Rotation | Escalation | +|------|---------------|----------|------------| +| **Primary** | First responder, triage | Weekly | โ†’ Secondary (15min) | +| **Secondary** | Escalation, expertise | Weekly | โ†’ Incident Commander | +| **Incident Commander** | Coordination si P1 | On-demand | โ†’ Management | + +## Rotation Schedule + +| Week | Primary | Secondary | +|------|---------|-----------| +| 1 | Alice | Bob | +| 2 | Bob | Charlie | +| 3 | Charlie | Diana | +| 4 | Diana | Eve | +| 5 | Eve | Alice | + +## On-Call Expectations + +| Aspect | Requirement | +|--------|-------------| +| **Response Time (P1)** | < 5 min acknowledge | +| **Response Time (P2)** | < 15 min acknowledge | +| **Response Time (P3)** | < 1 hour acknowledge | +| **Availability** | Reachable 24/7 during rotation | +| **Handoff** | 30 min sync at rotation change | + +## Compensation + +| Activity | Compensation | +|----------|--------------| +| On-call week | Flat bonus | +| Night incident (22:00-08:00) | Time-off + bonus | +| Weekend incident | 1.5x time-off | + +--- + +# ๐Ÿšจ **Incident Management** + +## Severity Levels + +| Severity | Definition | Response | Communication | +|----------|------------|----------|---------------| +| **P1 โ€” Critical** | Service down, data loss risk | Immediate, all hands | Slack + PagerDuty + Status page | +| **P2 โ€” High** | Degraded service, high error rate | Within 30 min | Slack + PagerDuty | +| **P3 โ€” Medium** | Performance degradation | Business hours | Slack | +| **P4 โ€” Low** | Minor issues, no user impact | Next sprint | Ticket | + +## Incident Workflow + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ INCIDENT WORKFLOW โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ 1. DETECT โ”‚ +โ”‚ โ€ข Alert fires (Prometheus/Grafana) โ”‚ +โ”‚ โ€ข User report โ”‚ +โ”‚ โ€ข Synthetic monitoring โ”‚ +โ”‚ โ”‚ +โ”‚ 2. TRIAGE (Primary on-call) โ”‚ +โ”‚ โ€ข Acknowledge alert โ”‚ +โ”‚ โ€ข Assess severity โ”‚ +โ”‚ โ€ข Start incident channel (#inc-YYYYMMDD-short-name) โ”‚ +โ”‚ โ”‚ +โ”‚ 3. MITIGATE โ”‚ +โ”‚ โ€ข Apply runbook โ”‚ +โ”‚ โ€ข Rollback if needed โ”‚ +โ”‚ โ€ข Escalate if stuck > 15 min โ”‚ +โ”‚ โ”‚ +โ”‚ 4. RESOLVE โ”‚ +โ”‚ โ€ข Confirm service restored โ”‚ +โ”‚ โ€ข Update status page โ”‚ +โ”‚ โ€ข Close alert โ”‚ +โ”‚ โ”‚ +โ”‚ 5. POST-MORTEM (within 48h for P1/P2) โ”‚ +โ”‚ โ€ข Blameless analysis โ”‚ +โ”‚ โ€ข Root cause identification โ”‚ +โ”‚ โ€ข Action items with owners โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Runbook Structure + +| Section | Contenu | +|---------|---------| +| **Overview** | Description de l'alerte | +| **Impact** | User impact (High/Medium/Low), Business impact | +| **Prerequisites** | Accรจs et permissions nรฉcessaires | +| **Diagnosis Steps** | ร‰tapes de diagnostic (dashboards, logs, mรฉtriques) | +| **Resolution Steps** | Actions de rรฉsolution | +| **Escalation** | Contacts et dรฉlais | +| **Related** | Dashboards, alertes liรฉes, incidents passรฉs | + +## Post-Mortem Structure + +| Section | Contenu | +|---------|---------| +| **Summary** | Rรฉsumรฉ en un paragraphe | +| **Timeline** | Chronologie des รฉvรฉnements (UTC) | +| **Root Cause** | Cause racine dรฉtaillรฉe | +| **Impact** | Users affectรฉs, impact revenue, SLO burn | +| **What Went Well** | Ce qui a bien fonctionnรฉ | +| **What Went Wrong** | Ce qui a mal fonctionnรฉ | +| **Action Items** | Actions avec owner et due date | +| **Lessons Learned** | Leรงons apprises | + +--- + +# ๐Ÿ“ฆ **Service Templates** + +## Python Service Template โ€” Structure + +| Rรฉpertoire | Contenu | +|------------|---------| +| `src/app/` | Code applicatif FastAPI | +| `src/app/api/` | Routes et dependencies | +| `src/app/domain/` | Entities et services mรฉtier | +| `src/app/infrastructure/` | Database, Kafka, Cache | +| `tests/` | Unit et integration tests | +| `k8s/base/` | Manifests Kubernetes de base | +| `k8s/overlays/` | Overlays par environnement (dev, staging, prod) | +| `migrations/` | Alembic migrations | + +## Kubernetes Manifests Inclus + +| Fichier | Rรดle | +|---------|------| +| `deployment.yaml` | Dรฉfinition du Deployment | +| `service.yaml` | Exposition interne (ClusterIP) | +| `configmap.yaml` | Configuration non-sensible | +| `hpa.yaml` | Horizontal Pod Autoscaler | +| `pdb.yaml` | Pod Disruption Budget | +| `servicemonitor.yaml` | Prometheus scraping | +| `kustomization.yaml` | Kustomize base | + +## Deployment Configuration + +| Setting | Valeur | Raison | +|---------|--------|--------| +| **Replicas** | 2 (min) | Haute disponibilitรฉ | +| **Security Context** | runAsNonRoot: true, runAsUser: 1000 | Sรฉcuritรฉ | +| **Filesystem** | readOnlyRootFilesystem: true | Sรฉcuritรฉ | +| **Capabilities** | drop: ALL | Principle of least privilege | +| **Resources requests** | CPU: 100m, Memory: 256Mi | Scheduling | +| **Resources limits** | CPU: 500m, Memory: 512Mi | Protection contre les fuites | + +## Health Probes + +| Probe | Path | Delay | Period | +|-------|------|-------|--------| +| **Liveness** | `/health/live` | 10s | 10s | +| **Readiness** | `/health/ready` | 5s | 5s | + +## Horizontal Pod Autoscaler + +| Mรฉtrique | Target | Min/Max Replicas | +|----------|--------|------------------| +| **CPU** | 70% average utilization | 2 / 10 | +| **Memory** | 80% average utilization | 2 / 10 | + +## ServiceMonitor Configuration + +| Setting | Valeur | +|---------|--------| +| **Port** | http (8080) | +| **Path** | /metrics | +| **Interval** | 30s | +| **Selector** | matchLabels: app: ${SERVICE_NAME} | + +--- + +## SLI/SLO/Error Budgets + +| Service | SLI | SLO | Error Budget | Burn Rate Alert | +|---------|-----|-----|--------------|-----------------| +| **svc-ledger** | Availability | 99.9% | 43 min/mois | 14.4x = 1h alert | +| **svc-ledger** | Latency P99 | < 200ms | N/A | P99 > 200ms for 5min | +| **svc-wallet** | Availability | 99.9% | 43 min/mois | 14.4x = 1h alert | +| **Platform (Flux, Prometheus)** | Availability | 99.5% | 3.6h/mois | 6x = 2h alert | + +## Error Budget Policy + +| Budget Consumed | Action | +|-----------------|--------| +| < 50% | Normal development velocity | +| 50-75% | Increased testing, careful deployments | +| 75-90% | Feature freeze, reliability focus | +| > 90% | Emergency mode, only critical fixes | + +--- + +*Document maintenu par : Platform Team* +*Derniรจre mise ร  jour : Janvier 2026* diff --git a/platform/YAML-CONFIG-VALIDATOR.md b/platform/YAML-CONFIG-VALIDATOR.md new file mode 100644 index 0000000..42fd1e4 --- /dev/null +++ b/platform/YAML-CONFIG-VALIDATOR.md @@ -0,0 +1,141 @@ +# YCC โ€” YAML Config Validator + +> **Priority:** Low +> **Repo:** `yaml-config-validator` +> **Language:** Python (Pydantic v2) +> **Status:** Planned + +## Problem + +The Kiven platform relies on YAML files across multiple repos: +- **`platform-github-management`** โ€” Repo definitions (`repos/backend/*.yaml`, `config/enforced.yaml`, `config/defaults.yaml`) +- **Copier templates** โ€” `copier.yml` with questions, validators, conditional logic +- **Service configs** โ€” `.mise.toml`, `Taskfile.yml`, CI workflows reference variables + +These YAMLs are validated only at runtime (sync script fails, Copier fails, CI fails). +There is no **static validation** before merge. A typo in `template: servce-go` passes review +and breaks provisioning. + +## Solution: YCC (YAML Config Checker) + +A Python CLI tool that validates YAML structures **statically** (schema) and **dynamically** +(cross-references, variable resolution) using Pydantic v2. + +## What It Validates + +### Static Validation (schema) + +| Source | Validates | +|--------|-----------| +| `repos/backend/*.yaml` | Required fields (`name`, `description`, `type`), valid types, valid rulesets | +| `repos/platform/*.yaml` | Template references exist in `config/templates.yaml` | +| `config/enforced.yaml` | Schema matches expected structure | +| `config/defaults.yaml` | Rulesets have required fields, label colors are valid hex | +| `copier.yml` | Question types are valid, validators use valid Jinja2 syntax | +| `renovate.json` | Matches Renovate JSON schema | + +### Dynamic Validation (cross-references) + +| Check | Description | +|-------|-------------| +| Template exists | If `template: service-go` is set, `config/templates.yaml` must have a `service-go` entry | +| No orphan repos | Every repo in `repos/` must have a corresponding template (or `# No template` comment) | +| Copier variables used | Variables defined in `copier.yml` must be referenced in at least one `.jinja` file | +| Copier variables resolved | `.jinja` files must not reference undefined variables | +| Workflow references valid | CI workflows referencing `kivenio/reusable-workflows` must point to existing workflow files | +| Ruleset exists | `ruleset: strict` must exist in `config/defaults.yaml` rulesets | + +## Architecture + +``` +yaml-config-validator/ +โ”œโ”€โ”€ ycc/ +โ”‚ โ”œโ”€โ”€ __init__.py +โ”‚ โ”œโ”€โ”€ cli.py # Click CLI entrypoint +โ”‚ โ”œโ”€โ”€ schemas/ +โ”‚ โ”‚ โ”œโ”€โ”€ repo_definition.py # Pydantic model for repos/*.yaml +โ”‚ โ”‚ โ”œโ”€โ”€ enforced_config.py # Pydantic model for enforced.yaml +โ”‚ โ”‚ โ”œโ”€โ”€ defaults_config.py # Pydantic model for defaults.yaml +โ”‚ โ”‚ โ”œโ”€โ”€ templates_config.py # Pydantic model for templates.yaml +โ”‚ โ”‚ โ””โ”€โ”€ copier_config.py # Pydantic model for copier.yml +โ”‚ โ”œโ”€โ”€ validators/ +โ”‚ โ”‚ โ”œโ”€โ”€ static.py # Schema-only validation +โ”‚ โ”‚ โ”œโ”€โ”€ cross_ref.py # Cross-reference validation +โ”‚ โ”‚ โ””โ”€โ”€ copier_vars.py # Copier variable resolution +โ”‚ โ””โ”€โ”€ reporters/ +โ”‚ โ”œโ”€โ”€ console.py # Pretty terminal output +โ”‚ โ””โ”€โ”€ github.py # PR comment with validation results +โ”œโ”€โ”€ tests/ +โ”‚ โ”œโ”€โ”€ test_schemas.py +โ”‚ โ”œโ”€โ”€ test_cross_ref.py +โ”‚ โ””โ”€โ”€ fixtures/ # Sample valid/invalid YAMLs +โ”œโ”€โ”€ pyproject.toml +โ””โ”€โ”€ README.md +``` + +## Usage (planned) + +```bash +# Validate platform-github-management +ycc validate ../platform-github-management/ + +# Validate a Copier template +ycc validate-template ../platform-templates-service-go/ + +# Validate a specific file +ycc validate-file repos/backend/core-services.yaml + +# CI mode (exit code 1 on failure, PR comment) +ycc validate --ci --github-pr 42 +``` + +## CI Integration (planned) + +A reusable workflow in `reusable-workflows` will call YCC on PRs to `platform-github-management`: + +```yaml +# In platform-github-management/.github/workflows/validate.yml +jobs: + validate: + uses: kivenio/reusable-workflows/.github/workflows/ycc-validate-reusable.yml@main + with: + config-path: "." +``` + +## Pydantic Model Example + +```python +from pydantic import BaseModel, field_validator + +class RepoDefinition(BaseModel): + name: str + description: str + type: str # service, library, platform, template, testing, documentation + template: str | None = None + ruleset: str | None = None + topics: list[str] = [] + visibility: str = "private" + + @field_validator("type") + @classmethod + def validate_type(cls, v: str) -> str: + valid = {"service", "library", "platform", "template", "testing", "documentation", "bootstrap", "sdk"} + if v not in valid: + raise ValueError(f"Invalid type '{v}', must be one of {valid}") + return v + + @field_validator("name") + @classmethod + def validate_name(cls, v: str) -> str: + if not v.replace("-", "").isalnum(): + raise ValueError(f"Name '{v}' must be alphanumeric with hyphens") + return v +``` + +## Why Pydantic v2 + +- **Type safety** โ€” Python type hints map directly to YAML schema +- **Custom validators** โ€” `@field_validator` for cross-reference checks +- **Error messages** โ€” Clear, structured errors with field paths +- **JSON Schema export** โ€” `model_json_schema()` generates JSON Schema for IDE autocomplete +- **Fast** โ€” Pydantic v2 (Rust core) validates thousands of files in milliseconds diff --git a/providers/PROVIDER-INTERFACE.md b/providers/PROVIDER-INTERFACE.md new file mode 100644 index 0000000..5fac6cc --- /dev/null +++ b/providers/PROVIDER-INTERFACE.md @@ -0,0 +1,265 @@ +# Provider Interface +## *Plugin Architecture for Multi-Operator Support* + +--- + +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# Why a Provider System + +Kiven starts with PostgreSQL (CNPG) but will expand to Kafka (Strimzi), Redis, Elasticsearch, and more. Instead of hardcoding CNPG throughout the codebase, we define a **Provider Interface** โ€” a Go interface that every data service must implement. + +``` +Core Engine (operator-agnostic) + โ”‚ + โ”‚ Calls provider.Provision(), provider.Scale(), etc. + โ”‚ Doesn't know or care which operator is underneath. + โ”‚ + โ–ผ +Provider Interface (Go interface) + โ”‚ + โ”œโ”€โ”€ CNPG Provider (Phase 1) โ† implements interface for PostgreSQL + โ”œโ”€โ”€ Strimzi Provider (Phase 3) โ† implements interface for Kafka + โ”œโ”€โ”€ Redis Provider (Phase 3) โ† implements interface for Redis + โ””โ”€โ”€ ECK Provider (Phase 3) โ† implements interface for Elasticsearch +``` + +--- + +# The Interface + +```go +package provider + +import "context" + +// Provider is the interface every data service must implement. +// Core services (svc-provisioner, svc-clusters, etc.) call these methods +// without knowing which operator is underneath. +type Provider interface { + // Metadata + Name() string // "cnpg", "strimzi", "redis" + DisplayName() string // "PostgreSQL", "Kafka", "Redis" + Version() string // Provider version ("1.0.0") + SupportedVersions() []string // Data service versions ("15", "16", "17") + + // Discovery (used by agent) + Detect(ctx context.Context) (*DetectResult, error) + // Returns: operator installed? version? CRDs registered? + + // Prerequisites + CheckPrerequisites(ctx context.Context, plan ServicePlan) (*PrereqReport, error) + // Returns: what's ready, what's missing, what needs fixing + + // Lifecycle + Provision(ctx context.Context, spec ClusterSpec) (*ClusterStatus, error) + Scale(ctx context.Context, id string, spec ScaleSpec) error + Upgrade(ctx context.Context, id string, targetVersion string) error + Delete(ctx context.Context, id string, retainVolumes bool) error + PowerOff(ctx context.Context, id string) error + PowerOn(ctx context.Context, id string) error + + // Status + GetStatus(ctx context.Context, id string) (*ClusterStatus, error) + ListClusters(ctx context.Context) ([]ClusterSummary, error) + + // YAML Generation (for Advanced Mode) + GenerateYAML(ctx context.Context, spec ClusterSpec) ([]YAMLResource, error) + ValidateYAML(ctx context.Context, yaml string) (*ValidationResult, error) + DiffYAML(ctx context.Context, id string, newYAML string) (*DiffResult, error) + + // Users & Access + ListUsers(ctx context.Context, id string) ([]DatabaseUser, error) + CreateUser(ctx context.Context, id string, spec UserSpec) (*DatabaseUser, error) + DeleteUser(ctx context.Context, id string, username string) error + UpdatePermissions(ctx context.Context, id string, username string, perms Permissions) error + + // Databases + ListDatabases(ctx context.Context, id string) ([]Database, error) + CreateDatabase(ctx context.Context, id string, spec DatabaseSpec) (*Database, error) + DeleteDatabase(ctx context.Context, id string, dbName string) error + + // Backups + ListBackups(ctx context.Context, id string) ([]Backup, error) + TriggerBackup(ctx context.Context, id string) (*Backup, error) + Restore(ctx context.Context, id string, target RestoreTarget) error + VerifyBackup(ctx context.Context, id string, backupID string) (*VerificationResult, error) + + // Metrics + CollectMetrics(ctx context.Context, id string) (*MetricsSnapshot, error) + GetConnectionInfo(ctx context.Context, id string) (*ConnectionInfo, error) + + // Configuration + GetConfig(ctx context.Context, id string) (*ServiceConfig, error) + UpdateConfig(ctx context.Context, id string, params map[string]string) error + + // Extensions / Plugins (service-specific) + ListExtensions(ctx context.Context, id string) ([]Extension, error) + EnableExtension(ctx context.Context, id string, extName string) error + DisableExtension(ctx context.Context, id string, extName string) error +} +``` + +--- + +# Key Types + +```go +// ClusterSpec defines what to provision +type ClusterSpec struct { + Name string + ServiceVersion string // "17" for PG 17 + Plan ServicePlan // Hobbyist, Startup, etc. + Instances int // Number of instances (1-5) + StorageSize string // "50Gi" + StorageIOPS int // 3000 + BackupSchedule string // "0 */6 * * *" + BackupRetention int // days + Parameters map[string]string // postgresql.conf overrides + Extensions []string // pg_vector, PostGIS... + PoolerEnabled bool + PoolerMode string // "transaction" + PoolerPoolSize int // 100 + TLSEnabled bool + Namespace string + Labels map[string]string +} + +// ClusterStatus is the current state +type ClusterStatus struct { + ID string + Name string + Phase string // "Healthy", "Provisioning", "Failing", "PoweredOff" + Instances int + ReadyInstances int + PrimaryPod string + ReplicaPods []string + ReplicationLag []ReplicaLag + StorageUsed string + StorageTotal string + ServiceVersion string + CreatedAt time.Time + ConnectionInfo ConnectionInfo +} + +// ConnectionInfo for the customer +type ConnectionInfo struct { + Host string // pg-main-rw.kiven-databases.svc + Port int // 5432 + ReadOnlyHost string // pg-main-ro.kiven-databases.svc + PoolerHost string // pg-main-pooler.kiven-databases.svc + Database string + Username string + PasswordRef string // K8s secret reference + SSLMode string // "require" +} + +// YAMLResource for Advanced Mode +type YAMLResource struct { + Kind string // "Cluster", "Pooler", "ScheduledBackup" + Name string + YAML string // The full YAML content + Checksum string // For diff detection +} +``` + +--- + +# CNPG Provider Implementation (Phase 1) + +The CNPG provider is the first (and currently only) implementation: + +```go +type CNPGProvider struct { + kubeClient client.Client // K8s client (via agent) + pgClient *pgxpool.Pool // PG connection (for stats, users) +} + +func (p *CNPGProvider) Name() string { return "cnpg" } +func (p *CNPGProvider) DisplayName() string { return "PostgreSQL" } +``` + +### How It Maps to CNPG CRDs + +| Provider Method | CNPG Action | +|----------------|-------------| +| `Provision()` | Create `Cluster` CR + `Pooler` CR + `ScheduledBackup` CR | +| `Scale()` | Update `Cluster.spec.instances` | +| `Upgrade()` | Update `Cluster.spec.imageName` (rolling update) | +| `Delete()` | Delete `Cluster` CR (PVCs retained if `retainVolumes=true`) | +| `PowerOff()` | Delete `Cluster` CR with `retainVolumes=true`, agent reports to svc-infra to scale nodes to 0 | +| `PowerOn()` | svc-infra scales nodes up, then re-apply `Cluster` CR with existing PVCs | +| `TriggerBackup()` | Create `Backup` CR | +| `Restore()` | Create new `Cluster` CR with `bootstrap.recovery` | +| `CreateUser()` | Execute SQL: `CREATE ROLE ... LOGIN PASSWORD ...` | +| `UpdateConfig()` | Update `Cluster.spec.postgresql.parameters` | +| `EnableExtension()` | Update `Cluster.spec.postgresql.shared_preload_libraries` + SQL `CREATE EXTENSION` | +| `GenerateYAML()` | Render CNPG CRD templates with ClusterSpec values | + +--- + +# Adding a New Provider (Future) + +To add Strimzi (Kafka) support: + +1. **Create** `provider-strimzi/` repository +2. **Implement** the `Provider` interface for Strimzi CRDs +3. **Map** Strimzi CRDs to provider methods: + - `Provision()` โ†’ Create `Kafka` CR + - `Scale()` โ†’ Update `Kafka.spec.kafka.replicas` + - `CreateUser()` โ†’ Create `KafkaUser` CR + - etc. +4. **Register** the provider in the provider registry +5. **Update** the agent to watch Strimzi CRDs (auto-detected) +6. **Add** Kafka-specific UI components to dashboard +7. Core services (provisioner, billing, audit) work automatically โ€” they call the interface, not the implementation. + +--- + +# Provider Registry + +```go +// Registry holds all available providers +type Registry struct { + providers map[string]Provider +} + +func NewRegistry() *Registry { + r := &Registry{providers: make(map[string]Provider)} + r.Register(cnpg.NewProvider()) // Phase 1 + // r.Register(strimzi.NewProvider()) // Phase 3 + // r.Register(redis.NewProvider()) // Phase 3 + return r +} + +func (r *Registry) Get(name string) (Provider, error) { + p, ok := r.providers[name] + if !ok { + return nil, fmt.Errorf("provider %q not found", name) + } + return p, nil +} + +func (r *Registry) List() []Provider { + // Returns all registered providers +} +``` + +--- + +# Design Decisions + +| Decision | Rationale | +|----------|-----------| +| **Go interface** | Type-safe, compile-time verification, idiomatic for K8s ecosystem | +| **Single interface for all providers** | Core services don't need provider-specific code | +| **Provider methods are high-level** | Providers handle CRD-specific details internally | +| **YAML generation in provider** | Each provider knows its CRD schema | +| **Agent auto-detection** | Agent discovers which operators are installed, activates relevant modules | + +--- + +*Maintained by: Backend Team* +*Last updated: February 2026* diff --git a/resilience/DR-GUIDE.md b/resilience/DR-GUIDE.md new file mode 100644 index 0000000..fc5fabe --- /dev/null +++ b/resilience/DR-GUIDE.md @@ -0,0 +1,341 @@ +# โšก **Resilience & Disaster Recovery Guide** +## *LOCAL-PLUS Backup, Recovery & Business Continuity* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) +> **Voir aussi** : [Testing Strategy](../testing/TESTING-STRATEGY.md) โ€” Tests de performance et chaos + +--- + +# ๐Ÿ“‹ **Table of Contents** + +1. [Failure Modes](#failure-modes) +2. [Backup Strategy](#backup-strategy) +3. [Automated Recovery](#automated-recovery) +4. [Chaos Engineering](#chaos-engineering) +5. [Disaster Recovery](#disaster-recovery) +6. [Business Continuity](#business-continuity) + +--- + +# ๐Ÿ”ฅ **Failure Modes** + +## Failure Matrix + +| Failure | Detection | Recovery | RTO | Impact | +|---------|-----------|----------|-----|--------| +| **Pod crash** | Liveness probe | K8s restart automatique | < 30s | None (replicas) | +| **Node failure** | Node NotReady | Pod reschedule automatique | < 2min | Minor latency spike | +| **AZ failure** | Multi-AZ detect | Traffic shift automatique | < 5min | Reduced capacity | +| **DB primary failure** | Aiven health | Failover automatique | < 5min | Brief connection errors | +| **Kafka broker failure** | Aiven health | Rebalance automatique | < 2min | Brief producer retries | +| **Cache failure** | Health check | Fallback to DB automatique | < 1min | Increased latency | +| **Full region failure** | Health checks | DR procedure | 4h (target) | Extended outage | + +## Blast Radius Analysis + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ BLAST RADIUS ANALYSIS โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ SINGLE POD FAILURE โ”‚ +โ”‚ โ””โ”€โ”€ Impact: None (other replicas serve traffic) โ”‚ +โ”‚ โ””โ”€โ”€ Recovery: Automatique via Kubernetes โ”‚ +โ”‚ โ”‚ +โ”‚ SINGLE NODE FAILURE โ”‚ +โ”‚ โ””โ”€โ”€ Impact: 10-20% capacity loss temporarily โ”‚ +โ”‚ โ””โ”€โ”€ Recovery: Automatique via pod anti-affinity + reschedule โ”‚ +โ”‚ โ”‚ +โ”‚ SINGLE AZ FAILURE โ”‚ +โ”‚ โ””โ”€โ”€ Impact: 33% capacity loss โ”‚ +โ”‚ โ””โ”€โ”€ Recovery: Automatique via Multi-AZ + ALB health checks โ”‚ +โ”‚ โ”‚ +โ”‚ DATABASE PRIMARY FAILURE โ”‚ +โ”‚ โ””โ”€โ”€ Impact: Write unavailability ~5 min โ”‚ +โ”‚ โ””โ”€โ”€ Recovery: Automatique via Aiven failover โ”‚ +โ”‚ โ”‚ +โ”‚ FULL REGION FAILURE โ”‚ +โ”‚ โ””โ”€โ”€ Impact: Complete service unavailability โ”‚ +โ”‚ โ””โ”€โ”€ Recovery: Semi-automatique (DR procedure) โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +# ๐Ÿ’พ **Backup Strategy** + +## Backup Matrix + +| Data | Method | Frequency | Retention | Location | Encryption | +|------|--------|-----------|-----------|----------|------------| +| **PostgreSQL** | Aiven automated | Hourly | 7 jours | Aiven (cross-AZ) | AES-256 | +| **PostgreSQL PITR** | Aiven WAL | Continuous | 24h | Aiven | AES-256 | +| **Kafka** | Topic retention | N/A | 7 jours | Aiven | AES-256 | +| **Valkey** | RDB + AOF | Continuous | 24h | Aiven | AES-256 | +| **Terraform state** | S3 versioning | Every apply | 90 jours | S3 bucket | AES-256 | +| **Git repos** | GitHub | Every push | Infini | GitHub | At-rest | +| **Secrets (Vault)** | Integrated storage | Continuous | 30 jours | Vault HA | Transit | + +## Backup Verification โ€” Automatisรฉe + +| Check | Frequency | Automation | Alert si รฉchec | +|-------|-----------|------------|----------------| +| PostgreSQL restore test | Weekly | Job K8s scheduled | P2 | +| Terraform state backup | Daily | CI pipeline | P3 | +| Vault backup verification | Weekly | Job K8s scheduled | P2 | +| Git clone verification | Monthly | GitHub Actions | P4 | + +## Backup Monitoring + +| Metric | Alert Threshold | Severity | +|--------|-----------------|----------| +| Last backup age | > 2 hours | P2 | +| Backup size anomaly | > 50% change | P3 | +| Backup job failure | Any failure | P2 | +| PITR lag | > 1 hour | P2 | + +--- + +# ๐Ÿค– **Automated Recovery** + +## Principe : Self-Healing Infrastructure + +> **Objectif :** Minimiser l'intervention humaine. Le systรจme doit se rรฉparer automatiquement. + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ AUTOMATED RECOVERY LAYERS โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ LAYER 1: APPLICATION (Kubernetes) โ”‚ +โ”‚ โ”œโ”€โ”€ Liveness probes โ†’ Restart automatique โ”‚ +โ”‚ โ”œโ”€โ”€ Readiness probes โ†’ Traffic routing โ”‚ +โ”‚ โ”œโ”€โ”€ HPA โ†’ Scale automatique โ”‚ +โ”‚ โ””โ”€โ”€ PDB โ†’ Protection pendant maintenance โ”‚ +โ”‚ โ”‚ +โ”‚ LAYER 2: DATABASE (Aiven) โ”‚ +โ”‚ โ”œโ”€โ”€ Health monitoring โ†’ Failover automatique โ”‚ +โ”‚ โ”œโ”€โ”€ Connection pooling (PgBouncer) โ†’ Retry transparent โ”‚ +โ”‚ โ””โ”€โ”€ Read replicas โ†’ Load distribution โ”‚ +โ”‚ โ”‚ +โ”‚ LAYER 3: MESSAGING (Kafka) โ”‚ +โ”‚ โ”œโ”€โ”€ Broker failure โ†’ Partition rebalance automatique โ”‚ +โ”‚ โ”œโ”€โ”€ Consumer failure โ†’ Rebalance consumer group โ”‚ +โ”‚ โ””โ”€โ”€ Producer retry โ†’ Idempotent delivery โ”‚ +โ”‚ โ”‚ +โ”‚ LAYER 4: CACHE (Valkey) โ”‚ +โ”‚ โ”œโ”€โ”€ Cache miss โ†’ Fallback DB automatique (cache-aside pattern) โ”‚ +โ”‚ โ”œโ”€โ”€ Node failure โ†’ Cluster failover โ”‚ +โ”‚ โ””โ”€โ”€ TTL expiration โ†’ Lazy refresh โ”‚ +โ”‚ โ”‚ +โ”‚ LAYER 5: NETWORKING (Cloudflare + Cilium) โ”‚ +โ”‚ โ”œโ”€โ”€ Origin failure โ†’ Health check + failover โ”‚ +โ”‚ โ”œโ”€โ”€ DDoS โ†’ Auto-mitigation โ”‚ +โ”‚ โ””โ”€โ”€ mTLS โ†’ Automatic certificate rotation โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Recovery automatique par composant + +| Composant | Failure | Recovery Mechanism | Temps | Intervention | +|-----------|---------|-------------------|-------|--------------| +| **Pod** | Crash | Kubernetes restart | < 30s | Aucune | +| **Pod** | OOM | Kubernetes restart + alert | < 30s | Investigation | +| **Deployment** | Bad deploy | Flux rollback auto (si configurรฉ) | < 2min | Aucune | +| **DB Primary** | Failure | Aiven automatic failover | < 5min | Aucune | +| **DB Connection** | Pool exhausted | PgBouncer retry + scale | < 1min | Aucune | +| **Kafka Consumer** | Lag > threshold | KEDA auto-scale | < 2min | Aucune | +| **Cache** | Node down | Cluster failover + fallback DB | < 1min | Aucune | +| **Certificate** | Expiring | Cert-manager auto-renew | N/A | Aucune | + +--- + +# ๐Ÿ”ฌ **Chaos Engineering** + +## Philosophie + +> **"Nous ne testons pas si le systรจme tombe, mais si le systรจme se relรจve."** + +## Chaos Testing Framework + +| Outil | Usage | Intรฉgration | +|-------|-------|-------------| +| **Chaos Mesh** | Injection de pannes Kubernetes | CRD natif K8s | +| **Litmus** | Alternative open-source | Scenarios prรฉdรฉfinis | +| **Gremlin** | Enterprise (si budget) | SaaS, plus de features | + +## Experiments Automatisรฉs + +| Experiment | Target | Frรฉquence | Validation | +|------------|--------|-----------|------------| +| **Pod Kill** | Random pod in service | Daily (staging) | Service continues responding | +| **Network Latency** | Inter-service +100ms | Weekly | SLO latency maintained | +| **Node Drain** | Random node | Weekly | Pods rescheduled, no downtime | +| **DB Failover** | Force primary switch | Monthly | Connections recover < 5min | +| **Cache Flush** | Valkey flush | Weekly | Fallback to DB works | +| **AZ Failure Simulation** | Cordon all nodes in 1 AZ | Quarterly | Traffic shifts to other AZs | + +## Chaos Test Pipeline + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ CHAOS TESTING PIPELINE โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ 1. PRE-CHECK โ”‚ +โ”‚ โ€ข Verify system healthy (all green) โ”‚ +โ”‚ โ€ข Baseline metrics recorded โ”‚ +โ”‚ โ€ข Alerting team notified (staging) โ”‚ +โ”‚ โ”‚ +โ”‚ 2. INJECT CHAOS โ”‚ +โ”‚ โ€ข Apply Chaos Mesh experiment โ”‚ +โ”‚ โ€ข Duration: 5-15 minutes โ”‚ +โ”‚ โ”‚ +โ”‚ 3. OBSERVE โ”‚ +โ”‚ โ€ข Monitor SLIs (error rate, latency) โ”‚ +โ”‚ โ€ข Check recovery mechanisms activate โ”‚ +โ”‚ โ€ข Record recovery time โ”‚ +โ”‚ โ”‚ +โ”‚ 4. VALIDATE โ”‚ +โ”‚ โ€ข SLO maintained? โœ… / โŒ โ”‚ +โ”‚ โ€ข Recovery time within RTO? โœ… / โŒ โ”‚ +โ”‚ โ€ข No data loss? โœ… / โŒ โ”‚ +โ”‚ โ”‚ +โ”‚ 5. CLEANUP โ”‚ +โ”‚ โ€ข Remove chaos experiment โ”‚ +โ”‚ โ€ข Verify system back to baseline โ”‚ +โ”‚ โ€ข Generate report โ”‚ +โ”‚ โ”‚ +โ”‚ 6. ACTION โ”‚ +โ”‚ โ€ข Si รฉchec โ†’ Ticket pour fix โ”‚ +โ”‚ โ€ข Si succรจs โ†’ Increase chaos scope โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Game Days + +| Activitรฉ | Frรฉquence | Participants | Scope | +|----------|-----------|--------------|-------| +| **Chaos Friday** | Weekly | On-call | Staging, experiments simples | +| **Game Day** | Monthly | Full team | Staging, multi-failure scenarios | +| **DR Drill** | Quarterly | Team + Management | Staging, full DR simulation | +| **Production Chaos** | Annually | Team + SRE | Prod (maintenance window) | + +--- + +# ๐Ÿฅ **Disaster Recovery** + +## DR Scenarios + +| Scenario | Recovery | Automation Level | +|----------|----------|------------------| +| Single AZ failure | Automatique (multi-AZ) | 100% | +| Region failure | Semi-automatique (IaC + GitOps) | 80% | +| Data corruption | PITR restore | 60% | +| Ransomware | Immutable backups restore | 50% | + +## DR Automation โ€” Infrastructure as Code + +> **Principe :** Toute l'infrastructure est reproductible via Terraform + Flux. + +| Composant | Reproductibilitรฉ | Temps estimรฉ | +|-----------|------------------|--------------| +| **EKS Cluster** | Terraform apply | ~30 min | +| **Platform tools** | Flux reconcile | ~15 min | +| **Applications** | Flux reconcile | ~10 min | +| **Database** | Aiven restore from backup | ~1-2h | +| **DNS cutover** | Cloudflare API / Terraform | ~5 min | + +## DR Runbook โ€” Region Failure + +| Phase | Durรฉe | Actions | Automation | +|-------|-------|---------|------------| +| **1. Detection** | 15 min | Confirmer failure, dรฉclarer DR | Alerting automatique | +| **2. Infrastructure** | 1-2h | Terraform apply DR region | Semi-auto (approval required) | +| **3. Data** | 1-2h | Aiven restore, verify integrity | Semi-auto (Aiven console) | +| **4. Applications** | 30 min | Flux reconcile | Automatique | +| **5. Traffic** | 15 min | Cloudflare DNS update | Semi-auto (Terraform) | +| **6. Validation** | 30 min | E2E tests, verify SLIs | Automatique (CI) | + +**RTO Total : 4 heures** + +## DR Test Schedule + +| Test | Frรฉquence | Scope | Duration | +|------|-----------|-------|----------| +| Backup restore | Weekly | PostgreSQL single table | 30 min | +| Failover test | Monthly | Database failover | 1 hour | +| DR drill | Quarterly | Full DR simulation (staging) | 4 hours | +| Full DR test | Annually | Production DR (maintenance) | 8 hours | + +--- + +# ๐Ÿ“Š **Business Continuity** + +## RPO/RTO Summary + +| Scenario | RPO | RTO | Data Loss Risk | Automation | +|----------|-----|-----|----------------|------------| +| Pod failure | 0 | < 30s | None | 100% auto | +| Node failure | 0 | < 2min | None | 100% auto | +| AZ failure | 0 | < 5min | None | 100% auto | +| DB failover | 0 (sync) | < 5min | None | 100% auto | +| Region failure | 1 hour | 4 hours | Up to 1 hour | 80% auto | + +## Communication Plan + +| Audience | Channel | Frequency | Owner | +|----------|---------|-----------|-------| +| Engineering | Slack #incidents | Real-time | On-call | +| Management | Email + Slack | Every 30 min | Incident Commander | +| Customers | Status page | Every 15 min | Communications | +| Partners | Email | Major updates | Account Management | + +## Status Page + +| Status | Definition | +|--------|------------| +| **Operational** | All systems normal | +| **Degraded** | Partial impact (increased latency) | +| **Partial Outage** | Some features unavailable | +| **Major Outage** | Service unavailable | + +--- + +## Recovery Validation โ€” Automatisรฉe + +> Ces checks sont exรฉcutรฉs automatiquement aprรจs chaque recovery. + +### Health Check Pipeline + +| Check | Method | Failure Action | +|-------|--------|----------------| +| All pods running | kubectl health check | Alert P1 | +| Metrics flowing | Prometheus query | Alert P2 | +| Logs flowing | Loki query | Alert P2 | +| Traces flowing | Tempo query | Alert P3 | +| E2E critical path | Automated tests | Alert P1 | +| Error rate normal | SLI check | Alert P2 | + +### Database Validation + +| Check | Method | Failure Action | +|-------|--------|----------------| +| Data integrity | Checksum validation | Alert P1 | +| Transaction count | Count comparison | Alert P2 | +| FK constraints | DB validation | Alert P2 | +| Read/Write test | Smoke test | Alert P1 | + +--- + +> **Voir aussi :** [Testing Strategy](../testing/TESTING-STRATEGY.md) pour les tests de performance, load testing, et chaos engineering dรฉtaillรฉs. + +--- + +*Document maintenu par : Platform Team + SRE* +*Derniรจre mise ร  jour : Janvier 2026* diff --git a/security/SECURITY-ARCHITECTURE.md b/security/SECURITY-ARCHITECTURE.md new file mode 100644 index 0000000..87e5a8c --- /dev/null +++ b/security/SECURITY-ARCHITECTURE.md @@ -0,0 +1,535 @@ +# ๐Ÿ” **Security Architecture** +## *LOCAL-PLUS Defense in Depth* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# ๐Ÿ“‹ **Table of Contents** + +1. [Defense in Depth](#defense-in-depth) +2. [Layer 0 โ€” Edge (Cloudflare)](#layer-0--edge-cloudflare) +3. [Layer 1 โ€” API Gateway](#layer-1--api-gateway) +4. [Layer 2 โ€” Network](#layer-2--network) +5. [Layer 3 โ€” Identity & Access](#layer-3--identity--access) +6. [Layer 4 โ€” Workload](#layer-4--workload) +7. [Layer 5 โ€” Data](#layer-5--data) +8. [Supply Chain Security](#supply-chain-security) +9. [Security Roadmap](#security-roadmap) + +--- + +# ๐Ÿ›ก๏ธ **Defense in Depth** + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LAYER 0: EDGE (Cloudflare) โ”‚ +โ”‚ โ€ข Cloudflare WAF (OWASP Core Ruleset, custom rules) โ”‚ +โ”‚ โ€ข Cloudflare DDoS Protection (L3/L4/L7, unlimited) โ”‚ +โ”‚ โ€ข Bot Management (JS challenge, CAPTCHA) โ”‚ +โ”‚ โ€ข TLS 1.3 termination, HSTS enforced โ”‚ +โ”‚ โ€ข Cloudflare Tunnel (no public origin IP) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LAYER 1: API GATEWAY (Cilium Gateway API) โ”‚ +โ”‚ โ€ข JWT/API Key validation โ”‚ +โ”‚ โ€ข Rate limiting (fine-grained, per user/tenant) โ”‚ +โ”‚ โ€ข Request validation โ”‚ +โ”‚ โ€ข Circuit breaker โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LAYER 2: NETWORK โ”‚ +โ”‚ โ€ข VPC isolation (private subnets only for workloads) โ”‚ +โ”‚ โ€ข Cilium NetworkPolicies (default deny, explicit allow) โ”‚ +โ”‚ โ€ข VPC Peering Aiven (no public internet for DB/Kafka) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LAYER 3: IDENTITY & ACCESS โ”‚ +โ”‚ โ€ข IRSA / Workload Identity (no static credentials) โ”‚ +โ”‚ โ€ข Cilium mTLS (WireGuard) โ€” pod-to-pod encryption โ”‚ +โ”‚ โ€ข Vault dynamic secrets โ€” DB credentials rotated โ”‚ +โ”‚ โ€ข PAM โ€” Privileged Access Management โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LAYER 4: WORKLOAD โ”‚ +โ”‚ โ€ข Kyverno policies (no privileged, resource limits, probes required) โ”‚ +โ”‚ โ€ข Image signature verification (Cosign) โ”‚ +โ”‚ โ€ข Read-only root filesystem โ”‚ +โ”‚ โ€ข Non-root containers โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + โ–ผ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LAYER 5: DATA โ”‚ +โ”‚ โ€ข Encryption at rest (AWS KMS, Aiven native) โ”‚ +โ”‚ โ€ข Encryption in transit (mTLS) โ”‚ +โ”‚ โ€ข PII scrubbing in logs (OTel processor) โ”‚ +โ”‚ โ€ข Audit trail immutable (CloudTrail, K8s audit logs) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +# ๐ŸŒ **Layer 0 โ€” Edge (Cloudflare)** + +## Protection Components + +| Component | Protection | Configuration | +|-----------|------------|---------------| +| **WAF** | OWASP Core Ruleset | Managed + Custom rules | +| **DDoS** | L3/L4/L7 mitigation | Unlimited, automatic | +| **Bot Protection** | JS challenge, CAPTCHA | Bot score threshold | +| **TLS** | 1.3 only, HSTS | Full (strict) mode | +| **Tunnel** | No public origin IP | Encrypted connection | + +## WAF Rules Strategy + +| Rule Set | Type | Action | Purpose | +|----------|------|--------|---------| +| **OWASP Core** | Managed | Block | SQLi, XSS, LFI, RFI protection | +| **Cloudflare Managed** | Managed | Block | Zero-day, emerging threats | +| **Geo-Block** | Custom | Block | Block high-risk countries (optional) | +| **Rate Limit API** | Custom | Challenge | > 100 req/min per IP on /api/* | +| **Bot Score < 30** | Custom | Challenge | Likely bot traffic | + +--- + +# ๐Ÿšช **Layer 1 โ€” API Gateway** + +## Cilium Gateway API (Phase 1) + +| Feature | Configuration | Purpose | +|---------|---------------|---------| +| **TLS Termination** | Cloudflare Origin cert | Encryption | +| **Path-based routing** | HTTPRoute resources | Traffic routing | +| **mTLS** | Cilium automatic | Service authentication | + +## APISIX (Future Phase 2+) + +| Feature | Configuration | Purpose | +|---------|---------------|---------| +| **JWT Validation** | RS256, JWKS endpoint | Authentication | +| **API Key** | Header-based | Partner authentication | +| **Rate Limiting** | Per user/tenant | Abuse prevention | +| **Request Validation** | JSON Schema | Input validation | +| **Circuit Breaker** | Timeout + failure threshold | Resilience | + +--- + +# ๐Ÿ”’ **Layer 2 โ€” Network** + +## VPC Isolation + +| Subnet Type | CIDR | Usage | Internet Access | +|-------------|------|-------|-----------------| +| **Private (Workloads)** | 10.0.0.0/20 | EKS nodes, pods | NAT Gateway only | +| **Private (Data)** | 10.0.16.0/20 | VPC Endpoints | None | +| **Public** | 10.0.32.0/20 | NAT Gateway, LB | Direct | + +## Cilium Network Policies + +| Policy | Effect | +|--------|--------| +| Default deny all | Aucun trafic sauf explicite | +| Allow intra-namespace | Services mรชme namespace peuvent communiquer | +| Allow specific cross-namespace | svc-ledger โ†’ svc-wallet explicite | +| Allow egress Aiven | Services โ†’ VPC Peering range only | +| Allow egress AWS endpoints | Services โ†’ VPC Endpoints only | + +--- + +# ๐Ÿ”‘ **Layer 3 โ€” Identity & Access** + +## Vue d'ensemble โ€” Modรจle Zero Static Credentials + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ IDENTITY & ACCESS ARCHITECTURE โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ WORKLOADS (Pods) โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ Pod svc-ledger Pod svc-wallet โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ ServiceAccount โ”œโ”€โ”€ ServiceAccount โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€ JWT Token (auto) โ””โ”€โ”€ JWT Token (auto) โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ–ผ โ–ผ โ–ผ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ IRSA โ”‚ โ”‚ Vault โ”‚ โ”‚ Cilium โ”‚ โ”‚ +โ”‚ โ”‚ (AWS Access) โ”‚ โ”‚ (Secrets) โ”‚ โ”‚ (mTLS) โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ AssumeRole โ”‚ Dynamic creds โ”‚ +โ”‚ โ–ผ โ–ผ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ AWS IAM โ”‚ โ”‚ PostgreSQL โ”‚ โ”‚ +โ”‚ โ”‚ S3, KMS... โ”‚ โ”‚ Kafka โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ ZERO STATIC CREDENTIALS โ€” Tout est รฉphรฉmรจre โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +## IRSA โ€” IAM Roles for Service Accounts + +### Comment รงa marche ? + +**IRSA** permet ร  un pod Kubernetes d'assumer un rรดle IAM AWS **sans credentials statiques**. + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ IRSA โ€” FLOW โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ 1. POD Dร‰MARRE โ”‚ +โ”‚ โ”œโ”€โ”€ Kubernetes injecte un JWT token (ServiceAccount) โ”‚ +โ”‚ โ””โ”€โ”€ Le token contient: namespace, service account, issuer โ”‚ +โ”‚ โ”‚ +โ”‚ 2. POD VEUT ACCร‰DER ร€ S3 โ”‚ +โ”‚ โ”œโ”€โ”€ AWS SDK dรฉtecte le token IRSA โ”‚ +โ”‚ โ””โ”€โ”€ SDK appelle STS AssumeRoleWithWebIdentity โ”‚ +โ”‚ โ”‚ +โ”‚ 3. AWS STS VALIDE โ”‚ +โ”‚ โ”œโ”€โ”€ Vรฉrifie le JWT via OIDC Provider (EKS) โ”‚ +โ”‚ โ”œโ”€โ”€ Vรฉrifie que le ServiceAccount match le Trust Policy โ”‚ +โ”‚ โ””โ”€โ”€ Retourne des credentials temporaires (15min-12h) โ”‚ +โ”‚ โ”‚ +โ”‚ 4. POD ACCรˆDE ร€ S3 โ”‚ +โ”‚ โ”œโ”€โ”€ Utilise les credentials temporaires โ”‚ +โ”‚ โ””โ”€โ”€ AWS SDK renouvelle automatiquement โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +### Configuration + +| Composant | Configuration | +|-----------|---------------| +| **OIDC Provider** | Crรฉรฉ automatiquement avec EKS, URL: `oidc.eks.region.amazonaws.com/id/CLUSTER_ID` | +| **IAM Role** | Trust policy qui autorise le ServiceAccount spรฉcifique | +| **ServiceAccount** | Annotation `eks.amazonaws.com/role-arn` | +| **Pod** | Utilise le ServiceAccount, reรงoit le token automatiquement | + +### Trust Policy โ€” Principe + +| ร‰lรฉment | Description | +|---------|-------------| +| **Principal** | `arn:aws:iam::ACCOUNT:oidc-provider/oidc.eks...` | +| **Condition** | `sub` = `system:serviceaccount:NAMESPACE:SA_NAME` | +| **Action** | `sts:AssumeRoleWithWebIdentity` | + +### Mapping Services โ†’ Roles + +| Service Account | IAM Role | Permissions AWS | +|-----------------|----------|-----------------| +| `svc-ledger` | `role-svc-ledger` | S3 read (specific bucket), KMS decrypt | +| `svc-notification` | `role-svc-notification` | SES send email | +| `external-secrets` | `role-external-secrets` | Secrets Manager read | +| `otel-collector` | `role-otel-collector` | CloudWatch Logs write | + +--- + +## Workload Identity Federation โ€” Concept Gรฉnรฉral + +> **IRSA est une implรฉmentation AWS de Workload Identity Federation.** + +### Qu'est-ce que Workload Identity Federation ? + +| Concept | Description | +|---------|-------------| +| **Dรฉfinition** | Mรฉcanisme permettant ร  une workload (pod, VM, CI job) d'obtenir des credentials cloud **sans secret statique** | +| **Principe** | Le workload prouve son identitรฉ via un token (JWT), le cloud provider รฉchange contre des credentials temporaires | +| **Standard** | OIDC (OpenID Connect) โ€” standard ouvert | + +### Implรฉmentations par Cloud + +| Cloud | Nom | Comment รงa marche | +|-------|-----|-------------------| +| **AWS** | IRSA (EKS) | Pod โ†’ JWT โ†’ STS AssumeRoleWithWebIdentity โ†’ IAM Role | +| **GCP** | Workload Identity | Pod โ†’ JWT โ†’ GCP Token Service โ†’ Service Account | +| **Azure** | Workload Identity | Pod โ†’ JWT โ†’ Azure AD โ†’ Managed Identity | +| **Multi-cloud** | SPIRE/SPIFFE | Standard open-source, fรฉdรฉration cross-cloud | + +### Pourquoi c'est mieux que les credentials statiques ? + +| Critรจre | Credentials Statiques | Workload Identity | +|---------|----------------------|-------------------| +| **Rotation** | Manuelle, risquรฉe | Automatique (15min-12h) | +| **Blast radius** | Si leak โ†’ accรจs permanent | Si leak โ†’ expire rapidement | +| **Audit** | Difficile ร  tracer | Chaque assume est loggรฉ | +| **Gestion** | Secrets ร  distribuer | Zero secret management | +| **Compliance** | SOC2/PCI problรฉmatique | SOC2/PCI friendly | + +--- + +## Vault โ€” Dynamic Secrets + +### Comment Vault gรฉnรจre des credentials dynamiques + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ VAULT DYNAMIC SECRETS โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ 1. POD DEMANDE UN SECRET โ”‚ +โ”‚ โ”œโ”€โ”€ Pod s'authentifie ร  Vault (Kubernetes auth) โ”‚ +โ”‚ โ””โ”€โ”€ Vault vรฉrifie le ServiceAccount JWT โ”‚ +โ”‚ โ”‚ +โ”‚ 2. VAULT Gร‰NรˆRE LES CREDENTIALS โ”‚ +โ”‚ โ”œโ”€โ”€ Vault se connecte ร  PostgreSQL โ”‚ +โ”‚ โ”œโ”€โ”€ CREATE ROLE "svc-ledger-abc123" WITH PASSWORD '...' VALID UNTIL... โ”‚ +โ”‚ โ””โ”€โ”€ Retourne username/password au pod โ”‚ +โ”‚ โ”‚ +โ”‚ 3. POD UTILISE LES CREDENTIALS โ”‚ +โ”‚ โ”œโ”€โ”€ Connexion ร  PostgreSQL โ”‚ +โ”‚ โ””โ”€โ”€ TTL: 1 heure (renouvelable) โ”‚ +โ”‚ โ”‚ +โ”‚ 4. EXPIRATION โ”‚ +โ”‚ โ”œโ”€โ”€ Vault rรฉvoque automatiquement โ”‚ +โ”‚ โ””โ”€โ”€ PostgreSQL: DROP ROLE "svc-ledger-abc123" โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +### Auth Methods + +| Method | Use Case | Identity Source | +|--------|----------|-----------------| +| **Kubernetes** | Pods EKS | ServiceAccount JWT | +| **AWS IAM** | Lambda, EC2 | Instance metadata | +| **AppRole** | CI/CD | Role ID + Secret ID | +| **OIDC** | GitHub Actions | GitHub JWT | + +### Secret Engines + +| Engine | Path | Purpose | TTL | +|--------|------|---------|-----| +| **Database** | `database/` | PostgreSQL dynamic credentials | 1h (renouvelable) | +| **KV v2** | `secret/` | Static secrets (API keys externes) | N/A | +| **Transit** | `transit/` | Encryption as a service | N/A | +| **PKI** | `pki/` | Certificats TLS | 24h | + +--- + +## PAM โ€” Privileged Access Management + +### Pourquoi PAM ? + +| Problรจme | Solution PAM | +|----------|--------------| +| SSH keys partagรฉes | Accรจs รฉphรฉmรจre, certificat SSH signรฉ | +| Admin accounts permanents | Just-in-Time access | +| Pas d'audit | Session recording, audit complet | +| Blast radius รฉlevรฉ | Least privilege, time-bound | + +### Architecture PAM + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ PAM ARCHITECTURE โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ UTILISATEUR VEUT ACCร‰DER ร€ UN SYSTรˆME โ”‚ +โ”‚ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Engineer โ”‚ โ”‚ +โ”‚ โ”‚ (Browser) โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ 1. Request access โ”‚ +โ”‚ โ–ผ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ PAM SOLUTION โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข Teleport (open-source) ou โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข HashiCorp Boundary ou โ”‚ โ”‚ +โ”‚ โ”‚ โ€ข AWS SSM Session Manager โ”‚ โ”‚ +โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ 2. AUTHENTICATION โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ SSO (GitHub, Okta, Google) โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€ MFA required โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ 3. AUTHORIZATION โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ Check RBAC (role-based) โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ Check time restrictions โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€ Approval workflow (si P1 incident) โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ 4. CREDENTIAL VENDING โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ Generate short-lived SSH cert (10min-8h) โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€ Or create temporary DB user โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ 5. SESSION โ”‚ โ”‚ +โ”‚ โ”‚ โ”œโ”€โ”€ Proxied connection โ”‚ โ”‚ +โ”‚ โ”‚ โ””โ”€โ”€ Full session recording (audit) โ”‚ โ”‚ +โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ โ”‚ +โ”‚ โ”‚ 6. Access granted (time-limited) โ”‚ +โ”‚ โ–ผ โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ EKS Node โ”‚ โ”‚ Database โ”‚ โ”‚ Bastion โ”‚ โ”‚ +โ”‚ โ”‚ (kubectl) โ”‚ โ”‚ (psql) โ”‚ โ”‚ (SSH) โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +### Options PAM pour Local-Plus + +| Solution | Type | Coรปt | Features | +|----------|------|------|----------| +| **AWS SSM Session Manager** | Managed | Gratuit | SSH/RDP sans bastion, audit CloudTrail | +| **Teleport** | Open-source | Gratuit (Community) | SSH, K8s, DB, session recording | +| **HashiCorp Boundary** | Open-source | Gratuit (Community) | Session brokering, Vault integration | + +### Recommandation Phase 1 + +| Access Type | Solution | Justification | +|-------------|----------|---------------| +| **EKS kubectl** | IRSA + AWS SSO | Native, zero config | +| **Database** | Vault dynamic creds | Already planned | +| **SSH nodes** | SSM Session Manager | Gratuit, no bastion needed | +| **Emergency access** | Break-glass with MFA | Documented procedure | + +--- + +## mTLS โ€” Cilium WireGuard + +| Aspect | Configuration | +|--------|---------------| +| **Activation** | Automatique avec Cilium | +| **Protocol** | WireGuard (kernel-level) | +| **Certificate management** | Gรฉrรฉ par Cilium | +| **Application changes** | Aucun โ€” transparent | +| **Performance** | Minimal overhead (kernel crypto) | + +--- + +# ๐Ÿ›ก๏ธ **Layer 4 โ€” Workload** + +## Kyverno Policies + +| Policy | Effect | Enforcement | +|--------|--------|-------------| +| `require-labels` | Pods must have required labels | Enforce | +| `require-probes` | Liveness + Readiness required | Enforce | +| `require-resource-limits` | CPU/Memory limits required | Enforce | +| `restrict-privileged` | No privileged containers | Enforce | +| `require-image-signature` | Cosign signature required | Enforce | +| `mutate-default-sa` | Auto-mount SA token disabled | Enforce | + +## Container Security Settings + +| Setting | Value | Rationale | +|---------|-------|-----------| +| `runAsNonRoot` | true | Prevent root execution | +| `readOnlyRootFilesystem` | true | Prevent filesystem writes | +| `allowPrivilegeEscalation` | false | Prevent privilege escalation | +| `capabilities.drop` | ALL | Minimal capabilities | + +--- + +# ๐Ÿ’พ **Layer 5 โ€” Data** + +## Encryption + +| Data State | Method | Key Management | +|------------|--------|----------------| +| **At rest (PostgreSQL)** | AES-256 | Aiven managed | +| **At rest (Kafka)** | AES-256 | Aiven managed | +| **At rest (S3)** | AES-256 | AWS KMS | +| **In transit** | TLS 1.3 + mTLS | Cilium + Aiven | + +## PII Protection + +| Data Type | Protection | Implementation | +|-----------|------------|----------------| +| **User ID** | Anonymized in logs | OTel processor | +| **Email** | Masked in logs | OTel processor | +| **PAN** | Never stored | Application validation | +| **IP Address** | Hashed in logs | OTel processor | + +## Audit Trail + +| Source | Destination | Retention | Immutability | +|--------|-------------|-----------|--------------| +| **AWS CloudTrail** | S3 (Log Archive) | 1 year | S3 Object Lock | +| **K8s Audit Logs** | CloudWatch Logs | 90 days | CloudWatch retention | +| **Application Audit** | PostgreSQL | 1 year | Append-only table | + +--- + +# ๐Ÿ”— **Supply Chain Security** + +## Image Signing โ€” Cosign + +| ร‰tape | Description | +|-------|-------------| +| **Build** | CI build l'image Docker | +| **Sign** | Cosign signe l'image avec une clรฉ privรฉe | +| **Push** | Image + signature pushรฉes vers registry | +| **Deploy** | Kyverno vรฉrifie la signature avant d'admettre le pod | +| **Reject** | Si signature invalide โ†’ pod refusรฉ | + +## SBOM โ€” Software Bill of Materials + +| ร‰tape | Outil | Output | +|-------|-------|--------| +| **Generate** | Syft | SBOM en format SPDX-JSON | +| **Attach** | Cosign | SBOM attachรฉ ร  l'image | +| **Scan** | Grype | Vulnรฉrabilitรฉs dans les dรฉpendances | +| **Policy** | Kyverno | Reject si vulnรฉrabilitรฉs critiques | + +--- + +# ๐Ÿ“… **Security Roadmap** + +## Phase 1 โ€” Day 1 (Current) + +| Component | Status | Effort | +|-----------|--------|--------| +| Cilium mTLS | โœ… Zero config | Included | +| IRSA (Workload Identity) | โœ… Ready | 1 day | +| Kyverno basic policies | โœ… Ready | 2 days | +| Vault for secrets | โœ… Ready | 1 week | +| External-Secrets Operator | โœ… Ready | 2 days | +| SSM Session Manager | โœ… Ready | 1 day | + +## Phase 2 โ€” Month 3 + +| Component | Status | Effort | +|-----------|--------|--------| +| Image signing (Cosign) | ๐Ÿ”œ Planned | 1 week | +| SBOM generation (Syft) | ๐Ÿ”œ Planned | 2 days | +| Supply chain verification | ๐Ÿ”œ Planned | 1 week | +| Teleport (full PAM) | ๐Ÿ”œ Evaluation | 1 week | + +## Phase 3 โ€” Month 6 + +| Component | Status | Effort | +|-----------|--------|--------| +| SPIRE (if multi-cluster) | ๐Ÿ“‹ Evaluation | TBD | +| Confidential Computing | ๐Ÿ“‹ Evaluation | TBD | + +--- + +*Document maintenu par : Platform Team + Security Team* +*Derniรจre mise ร  jour : Janvier 2026* diff --git a/testing/TESTING-STRATEGY.md b/testing/TESTING-STRATEGY.md new file mode 100644 index 0000000..a2472cd --- /dev/null +++ b/testing/TESTING-STRATEGY.md @@ -0,0 +1,473 @@ +# ๐Ÿงช **Testing Strategy** +## *LOCAL-PLUS Quality Engineering* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) +> **Voir aussi** : [DR Guide](../resilience/DR-GUIDE.md) โ€” Chaos Engineering + +--- + +# ๐Ÿ“‹ **Table of Contents** + +1. [Test Pyramid Philosophy](#test-pyramid-philosophy) +2. [Platform Testing](#platform-testing) +3. [Application Testing](#application-testing) +4. [Integration & Contract Testing](#integration--contract-testing) +5. [Performance Testing](#performance-testing) +6. [Chaos Engineering](#chaos-engineering) +7. [Compliance Testing](#compliance-testing) +8. [TNR (Tests de Non-Rรฉgression)](#tnr-tests-de-non-rรฉgression) + +--- + +# ๐Ÿ”บ **Test Pyramid Philosophy** + +## Le concept + +``` + โ•ฑโ•ฒ + โ•ฑ โ•ฒ + โ•ฑ E2Eโ•ฒ โ† Peu, coรปteux, lents + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ Validation business + โ•ฑ โ•ฒ + โ•ฑ Contract โ•ฒ โ† Vรฉrifie les interfaces + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ Entre services + โ•ฑ โ•ฒ + โ•ฑ Integration โ•ฒ โ† DB, Kafka, Cache rรฉels + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ Testcontainers + โ•ฑ โ•ฒ + โ•ฑ Unit Tests โ•ฒ โ† Beaucoup, rapides, isolรฉs + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ Logique mรฉtier + โ•ฑ โ•ฒ + โ•ฑ Static Analysis โ•ฒ โ† Linting, type checking + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ Avant mรชme d'exรฉcuter +``` + +## Principes clรฉs + +| Principe | Description | +|----------|-------------| +| **Plus de tests en bas** | Unit tests = 70%, Integration = 20%, E2E = 10% | +| **Rapiditรฉ en bas** | Unit tests < 1s, Integration < 30s, E2E < 5min | +| **Isolation en bas** | Unit = mocks, Integration = containers, E2E = real env | +| **Coรปt croissant** | Plus on monte, plus c'est cher ร  maintenir | +| **Confiance croissante** | Plus on monte, plus on valide le "vrai" systรจme | + +## Application ร  Local-Plus + +| Layer | Type de test | Cible | Frรฉquence | +|-------|--------------|-------|-----------| +| **Infrastructure** | Terraform tests, Policy checks | IaC modules | PR | +| **Platform** | Smoke tests, Policy audit | Kubernetes, Flux | Post-deploy | +| **Application** | Unit, Integration, Contract | Services Python/Go | PR | +| **System** | E2E, Performance, Chaos | Full stack | Nightly/Weekly | + +--- + +# ๐Ÿ—๏ธ **Platform Testing** + +## Test Pyramid pour l'Infrastructure + +``` + โ•ฑโ•ฒ + โ•ฑ โ•ฒ + โ•ฑ E2Eโ•ฒ โ† Dรฉploiement rรฉel (staging) + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ Nightly + โ•ฑ โ•ฒ + โ•ฑIntegrationโ•ฒ โ† Terratest (crรฉe vraies ressources) + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ Nightly, temps limitรฉ + โ•ฑ โ•ฒ + โ•ฑ Unit Tests โ•ฒ โ† terraform test (plan-based) + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ PR, rapide + โ•ฑ โ•ฒ + โ•ฑ Static Analysis โ•ฒ โ† tflint, tfsec, checkov + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ Pre-commit, PR +``` + +## Terraform Testing + +| Type | Outil | Quand | Ce que รงa vรฉrifie | Bloquant | +|------|-------|-------|-------------------|----------| +| **Format** | `terraform fmt` | Pre-commit | Code formattรฉ | Oui | +| **Lint** | `tflint` | Pre-commit | Best practices HCL | Oui | +| **Security** | `tfsec`, `checkov` | PR | Vulnรฉrabilitรฉs, misconfigs | Oui | +| **Compliance** | `terraform-compliance`, `conftest` | PR | Policies internes | Oui | +| **Unit** | `terraform test` (native) | PR | Logique des modules | Oui | +| **Integration** | `terratest` | Nightly | Ressources crรฉรฉes correctement | Non | +| **Drift** | `terraform plan` (scheduled) | Daily | ร‰cart config vs rรฉalitรฉ | Alerte | + +## Policy as Code โ€” Ce qu'on vรฉrifie + +| Policy | Description | Outil | +|--------|-------------|-------| +| **S3 encryption** | Tous les buckets doivent avoir encryption | OPA/Conftest | +| **Public access** | Aucune ressource publique sauf explicite | tfsec | +| **Tagging** | Tags obligatoires (env, owner, cost-center) | terraform-compliance | +| **Naming** | Convention de nommage respectรฉe | Custom OPA | +| **Networking** | Pas d'IGW sur VPC privรฉ | Checkov | + +## Kubernetes Testing + +| Type | Outil | Quand | Ce que รงa vรฉrifie | +|------|-------|-------|-------------------| +| **Manifest validation** | `kubectl --dry-run`, `kubeconform` | PR | YAML valide, schema correct | +| **Policy check** | Kyverno CLI | PR | Policies passent | +| **Helm lint** | `helm lint`, `helm template` | PR | Charts valides | +| **Smoke test** | Flux reconcile + health check | Post-deploy | App dรฉployรฉe et healthy | + +--- + +# ๐Ÿ“ฑ **Application Testing** + +## Test Pyramid pour les Services + +``` + โ•ฑโ•ฒ + โ•ฑ โ•ฒ + โ•ฑ E2Eโ•ฒ โ† Playwright, staging + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ Post-merge + โ•ฑ โ•ฒ + โ•ฑ Contract โ•ฒ โ† Pact, gRPC testing + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ PR + โ•ฑ โ•ฒ + โ•ฑ Integration โ•ฒ โ† Testcontainers + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ PR + โ•ฑ โ•ฒ + โ•ฑ Unit Tests โ•ฒ โ† pytest, mocks + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ Pre-commit, PR + โ•ฑ โ•ฒ + โ•ฑ Static Analysis โ•ฒ โ† ruff, mypy, bandit + โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ Pre-commit +``` + +## Unit Tests + +| Aspect | Approche | +|--------|----------| +| **Cible** | Domain logic, Use cases, Utilities | +| **Isolation** | Mocks pour DB, Kafka, Cache, HTTP clients | +| **Coverage** | Minimum 80% sur le domain layer | +| **Vitesse** | < 1 seconde par test | +| **Framework** | pytest (Python), go test (Go) | + +### Ce qu'on teste en Unit + +| Composant | Tests | +|-----------|-------| +| **Domain entities** | Validation, business rules, state transitions | +| **Use cases** | Orchestration logic (avec mocks) | +| **Value objects** | Immutabilitรฉ, รฉgalitรฉ | +| **Utilities** | Pure functions, helpers | + +### Ce qu'on NE teste PAS en Unit + +| Composant | Pourquoi | +|-----------|----------| +| **Repositories** | Nรฉcessite vraie DB โ†’ Integration | +| **Kafka producers** | Nรฉcessite vrai broker โ†’ Integration | +| **HTTP clients** | Interactions rรฉelles โ†’ Contract | +| **Controllers/Routes** | Wiring โ†’ Integration ou E2E | + +## Integration Tests + +| Aspect | Approche | +|--------|----------| +| **Cible** | Repositories, Message producers, Cache clients | +| **Infrastructure** | Testcontainers (PostgreSQL, Kafka, Redis) | +| **Isolation** | Chaque test a sa propre DB/topic | +| **Vitesse** | < 30 secondes par test | +| **Framework** | pytest + testcontainers | + +### Ce qu'on vรฉrifie + +| Composant | Vรฉrifications | +|-----------|---------------| +| **PostgreSQL Repository** | CRUD fonctionne, transactions, contraintes FK | +| **Kafka Producer** | Messages publiรฉs, sรฉrialisation correcte | +| **Kafka Consumer** | Messages consommรฉs, idempotence | +| **Cache Client** | Set/Get/Delete, TTL, invalidation | +| **Outbox Pattern** | Transaction + event atomiques | + +--- + +# ๐Ÿค **Integration & Contract Testing** + +## Pourquoi Contract Testing ? + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LE PROBLรˆME SANS CONTRACT TESTING โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ svc-ledger svc-wallet โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Appelle โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€ HTTP/gRPC โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ Rรฉpond โ”‚ โ”‚ +โ”‚ โ”‚ Wallet โ”‚ โ”‚ โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ โŒ Wallet change son API โ”‚ +โ”‚ โŒ Ledger ne le sait pas โ”‚ +โ”‚ โŒ Dรฉcouvert en production = ๐Ÿ’ฅ โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ LA SOLUTION : CONTRACT TESTING โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ svc-ledger svc-wallet โ”‚ +โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ +โ”‚ โ”‚ Consumer โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ CONTRACT โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ Provider โ”‚ โ”‚ +โ”‚ โ”‚ Tests โ”‚ โ”‚ (Pact file) โ”‚ โ”‚ Tests โ”‚ โ”‚ +โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ +โ”‚ โ”‚ +โ”‚ โœ… Ledger dรฉclare ce qu'il attend โ”‚ +โ”‚ โœ… Wallet vรฉrifie qu'il respecte le contrat โ”‚ +โ”‚ โœ… CI bloque si contrat cassรฉ โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +## Contract Testing Approach + +| Aspect | Approche | +|--------|----------| +| **Outil REST** | Pact | +| **Outil gRPC** | buf breaking, grpc-testing | +| **Consumer-driven** | Le consumer dรฉfinit ses besoins | +| **Provider verification** | Le provider vรฉrifie qu'il satisfait | +| **Broker** | Pact Broker (ou Pactflow) pour centraliser | + +## Ce qu'on vรฉrifie en Contract + +| Type | Vรฉrifications | +|------|---------------| +| **Request format** | Path, method, headers, body schema | +| **Response format** | Status code, headers, body schema | +| **Error cases** | 4xx/5xx responses, error messages | +| **Breaking changes** | Champs supprimรฉs, types changรฉs | + +--- + +# โšก **Performance Testing** + +## Types de tests de performance + +| Type | Objectif | VUs | Durรฉe | Frรฉquence | +|------|----------|-----|-------|-----------| +| **Smoke** | Vรฉrifier que รงa marche | 1-5 | 1 min | Post-deploy | +| **Load** | Charge normale | 50-100 | 10 min | Nightly | +| **Stress** | Trouver le breaking point | Ramping 500+ | 15 min | Weekly | +| **Soak** | Endurance, memory leaks | 50 | 4 hours | Weekly | +| **Spike** | Pics soudains | 10โ†’200โ†’10 | 5 min | Monthly | + +## Outil : k6 + +| Aspect | Choix | +|--------|-------| +| **Outil** | k6 (Grafana) | +| **Scripting** | JavaScript | +| **Reporting** | Grafana Cloud ou self-hosted | +| **CI Integration** | GitHub Actions | + +## Thresholds (Critรจres de succรจs) + +| Mรฉtrique | Target | Alerte | Bloquant | +|----------|--------|--------|----------| +| **Latency P50** | < 50ms | > 100ms | Non | +| **Latency P95** | < 100ms | > 200ms | Oui | +| **Latency P99** | < 200ms | > 500ms | Oui | +| **Error Rate** | < 0.1% | > 1% | Oui | +| **Throughput** | > 500 TPS | < 400 TPS | Non | + +## Scรฉnarios de test par service + +| Service | Scenario | VUs cible | Throughput cible | +|---------|----------|-----------|------------------| +| **svc-ledger** | Create transaction | 100 | 500 TPS | +| **svc-ledger** | Get balance | 200 | 1000 TPS | +| **svc-wallet** | Update balance | 100 | 500 TPS | +| **svc-merchant** | List transactions | 50 | 200 TPS | + +## Performance Testing Pipeline + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ PERFORMANCE TESTING PIPELINE โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ 1. SMOKE TEST (Post-deploy) โ”‚ +โ”‚ โ€ข 1-5 VUs, 1 minute โ”‚ +โ”‚ โ€ข Vรฉrifie que le service rรฉpond โ”‚ +โ”‚ โ€ข Gate pour continuer โ”‚ +โ”‚ โ”‚ +โ”‚ 2. LOAD TEST (Nightly) โ”‚ +โ”‚ โ€ข 50-100 VUs, 10 minutes โ”‚ +โ”‚ โ€ข Vรฉrifie performance normale โ”‚ +โ”‚ โ€ข Compare avec baseline โ”‚ +โ”‚ โ”‚ +โ”‚ 3. STRESS TEST (Weekly) โ”‚ +โ”‚ โ€ข Ramping jusqu'ร  failure โ”‚ +โ”‚ โ€ข Identifie le breaking point โ”‚ +โ”‚ โ€ข Documente les limites โ”‚ +โ”‚ โ”‚ +โ”‚ 4. SOAK TEST (Weekly) โ”‚ +โ”‚ โ€ข 50 VUs, 4 heures โ”‚ +โ”‚ โ€ข Dรฉtecte memory leaks โ”‚ +โ”‚ โ€ข Vรฉrifie stabilitรฉ long-terme โ”‚ +โ”‚ โ”‚ +โ”‚ 5. REPORT โ”‚ +โ”‚ โ€ข Dashboard Grafana โ”‚ +โ”‚ โ€ข Trend analysis โ”‚ +โ”‚ โ€ข Alertes si rรฉgression โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +# ๐Ÿ’ฅ **Chaos Engineering** + +> **Dรฉtails complets** : voir [DR Guide โ€” Chaos Engineering](../resilience/DR-GUIDE.md#chaos-engineering) + +## Philosophie + +| Principe | Description | +|----------|-------------| +| **Build confidence** | Prouver que le systรจme rรฉsiste aux pannes | +| **Proactive** | Casser avant que รงa casse en prod | +| **Controlled** | Experiments planifiรฉs, scope limitรฉ | +| **Observable** | Mesurer l'impact, recovery time | + +## Experiments par layer + +| Layer | Experiment | Outil | Frรฉquence | +|-------|------------|-------|-----------| +| **Pod** | Kill random pod | Chaos Mesh | Daily (staging) | +| **Node** | Drain node | Chaos Mesh | Weekly | +| **Network** | Add latency 100ms | Chaos Mesh | Weekly | +| **Network** | Partition (isoler un service) | Chaos Mesh | Monthly | +| **Database** | Force failover | Aiven console | Monthly | +| **Cache** | Flush all | Chaos Mesh | Weekly | +| **AZ** | Cordon all nodes in 1 AZ | kubectl | Quarterly | + +## Validation + +| Experiment | Expected Behavior | Success Criteria | +|------------|-------------------|------------------| +| Pod kill | Traffic shifts to other pods | Error rate < 1%, recovery < 30s | +| Node drain | Pods rescheduled | No downtime | +| Network latency | Degraded but functional | SLO latency maintained | +| DB failover | Brief connection errors | Recovery < 5min | +| Cache flush | Fallback to DB | Increased latency, no errors | + +--- + +# โœ… **Compliance Testing** + +## Tests par standard + +| Standard | Test | Ce qu'on vรฉrifie | Outil | +|----------|------|------------------|-------| +| **GDPR** | PII in logs | Pas d'email, user_id, IP en clair | Log audit script | +| **GDPR** | Data retention | Logs < 30 jours | Loki config check | +| **GDPR** | Right to delete | API de suppression fonctionne | E2E test | +| **PCI-DSS** | Encryption in transit | mTLS enforced | Cilium policy audit | +| **PCI-DSS** | Encryption at rest | KMS enabled | AWS Config rules | +| **PCI-DSS** | No PAN storage | Pas de numรฉro de carte | Code scan + log audit | +| **SOC2** | Audit logs | CloudTrail + K8s audit | AWS Config | +| **SOC2** | Access control | RBAC enforced | Kyverno reports | +| **SOC2** | Change management | PR required, reviews | GitHub settings | + +## Automatisation + +| Check | Frรฉquence | Bloquant | +|-------|-----------|----------| +| Log audit (PII) | Nightly | Alerte P2 | +| Policy reports (Kyverno) | Continuous | Dashboard | +| AWS Config rules | Continuous | Alerte P2 | +| Encryption verification | Weekly | Alerte P1 si รฉchec | + +--- + +# ๐Ÿ”„ **TNR (Tests de Non-Rรฉgression)** + +## Catรฉgories + +| Catรฉgorie | Ce qu'on vรฉrifie | Frรฉquence | +|-----------|------------------|-----------| +| **Critical Paths** | Flux mรฉtier essentiels | Nightly | +| **Golden Master** | Rรฉponses API n'ont pas changรฉ | Nightly | +| **Backward Compatibility** | Anciennes versions clients fonctionnent | Pre-release | +| **Data Migration** | Migrations n'ont pas cassรฉ les donnรฉes | Post-migration | + +## Critical Paths + +| Path | ร‰tapes | SLA | +|------|--------|-----| +| **Earn flow** | Transaction โ†’ Balance update โ†’ Event โ†’ Notification | < 5s end-to-end | +| **Burn flow** | Transaction โ†’ Balance check โ†’ Deduction โ†’ Event | < 5s end-to-end | +| **Balance query** | Request โ†’ Cache/DB โ†’ Response | < 100ms | +| **Merchant onboarding** | Registration โ†’ Validation โ†’ Activation | < 30s | + +## E2E Testing + +| Aspect | Approche | +|--------|----------| +| **Outil** | Playwright | +| **Environment** | Staging (miroir de prod) | +| **Data** | Fixtures dรฉdiรฉes, cleanup aprรจs | +| **Frรฉquence** | Post-merge staging, pre-release prod | +| **Ownership** | QA Team | + +## Pipeline TNR + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ TNR PIPELINE (Nightly) โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ โ”‚ +โ”‚ 00:00 โ”€โ–บ SETUP โ”‚ +โ”‚ โ€ข Fresh staging environment โ”‚ +โ”‚ โ€ข Load test fixtures โ”‚ +โ”‚ โ”‚ +โ”‚ 00:15 โ”€โ–บ CRITICAL PATH TESTS โ”‚ +โ”‚ โ€ข Earn/Burn flows โ”‚ +โ”‚ โ€ข All major user journeys โ”‚ +โ”‚ โ”‚ +โ”‚ 01:00 โ”€โ–บ PERFORMANCE TESTS โ”‚ +โ”‚ โ€ข Load test (10 min) โ”‚ +โ”‚ โ€ข Compare with baseline โ”‚ +โ”‚ โ”‚ +โ”‚ 01:30 โ”€โ–บ COMPLIANCE TESTS โ”‚ +โ”‚ โ€ข Log audit (PII check) โ”‚ +โ”‚ โ€ข Policy verification โ”‚ +โ”‚ โ”‚ +โ”‚ 02:00 โ”€โ–บ REPORT โ”‚ +โ”‚ โ€ข Generate report โ”‚ +โ”‚ โ€ข Alert if failures โ”‚ +โ”‚ โ€ข Update dashboard โ”‚ +โ”‚ โ”‚ +โ”‚ 02:30 โ”€โ–บ CLEANUP โ”‚ +โ”‚ โ€ข Reset test data โ”‚ +โ”‚ โ€ข Archive logs โ”‚ +โ”‚ โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +--- + +## Rรฉcapitulatif : Qui teste quoi ? + +| ร‰quipe | Responsabilitรฉ | Types de tests | +|--------|----------------|----------------| +| **Developers** | Unit, Integration | PR gate | +| **Platform** | Terraform, Kubernetes, Chaos | PR + Nightly | +| **QA** | E2E, TNR, Performance | Nightly + Pre-release | +| **Security** | Compliance, Policy audit | Continuous | + +--- + +*Document maintenu par : QA Team + Platform Team* +*Derniรจre mise ร  jour : Janvier 2026*