From 4e65b7b5af866d7e2ddecf57ca04d72ece179fc2 Mon Sep 17 00:00:00 2001 From: NasrLadib Date: Tue, 27 Jan 2026 08:32:29 +0100 Subject: [PATCH 1/6] docs(architecture): restructure overview and add scope definition --- EntrepriseArchitecture.md | 2005 +++++++------------------------------ 1 file changed, 343 insertions(+), 1662 deletions(-) diff --git a/EntrepriseArchitecture.md b/EntrepriseArchitecture.md index eaf4dda..a44e64d 100644 --- a/EntrepriseArchitecture.md +++ b/EntrepriseArchitecture.md @@ -1,177 +1,134 @@ -# 🏗️ **LOCAL-PLUS — Architecture Définitive** +# 🏗️ **LOCAL-PLUS — Architecture Overview** ## *Gift Card & Loyalty Platform* ### *Version 1.0 — Janvier 2026* --- -# 📋 **PARTIE I — CONTEXTE & CONTRAINTES** +> **Ce document est la porte d'entrée de l'architecture LOCAL-PLUS.** +> Il fournit une vue d'ensemble et des liens vers la documentation détaillée. -## **1.1 Paramètres Business** +--- + +# 📋 **PARTIE I — EXECUTIVE SUMMARY** + +## **1.1 Scope** + +LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçue pour : +- **Scalabilité** : 500 TPS, 1500 RPS +- **Résilience** : RPO 1h, RTO 15min +- **Compliance** : GDPR, PCI-DSS, SOC2 +- **Durée de vie** : 5+ ans + +### **Non-Goals (Phase 1)** +- Multi-région active-active +- API Gateway/APIM dédié (évaluation future) +- Mobile apps natives + +## **1.2 Paramètres Clés** -| Paramètre | Valeur | Impact architectural | -|-----------|--------|---------------------| -| **RPO** | 1 heure | Backups horaires minimum, réplication async acceptable | -| **RTO** | 15 minutes | Failover automatisé, pas de procédure manuelle | -| **TPS** | 500 transactions/sec | Pas de sharding nécessaire, single Postgres suffit | +| Paramètre | Valeur | Impact | +|-----------|--------|--------| +| **RPO** | 1 heure | Backups horaires, réplication async | +| **RTO** | 15 minutes | Failover automatisé | +| **TPS** | 500 transactions/sec | Single Postgres suffit | | **RPS** | 1500 requêtes/sec | Load balancer + HPA standard | -| **Durée de vie** | 5+ ans | Design pour évolutivité, pas de shortcuts | -| **Équipe on-call** | 5 personnes | Runbooks exhaustifs, alerting structuré | - -## **1.2 Contraintes Compliance** - -| Standard | Exigences clés | Impact | -|----------|---------------|--------| -| **GDPR** | Droit à l'oubli, consentement, data residency EU | Logs anonymisés, data retention policies, EU region | -| **PCI-DSS** | Pas de stockage PAN, encryption at rest/transit, audit logs | mTLS, Vault pour secrets, audit trail immutable | -| **SOC2** | Contrôle d'accès, monitoring, incident response | RBAC strict, observabilité complète, runbooks documentés | - -## **1.3 Contraintes Techniques** - -| Contrainte | Choix | Rationale | -|------------|-------|-----------| -| **Cloud primaire** | AWS | Décision business | -| **Région initiale** | eu-west-1 (Ireland) | GDPR, latence Europe | -| **Multi-région** | Prévu, pas immédiat | Design pour, implémente plus tard | -| **Database** | Aiven PostgreSQL | Managed, multicloud-ready, PCI compliant | -| **Messaging** | Aiven Kafka | Managed, multicloud-ready | +| **Équipe on-call** | 5 personnes | Runbooks exhaustifs | + +## **1.3 Compliance Summary** + +| Standard | Exigences clés | Documentation | +|----------|---------------|---------------| +| **GDPR** | Data residency EU, droit à l'oubli | → [docs/compliance/gdpr/](compliance/gdpr/) | +| **PCI-DSS** | Pas de stockage PAN, encryption, audit | → [docs/compliance/pci-dss/](compliance/pci-dss/) | +| **SOC2** | RBAC, monitoring, incident response | → [docs/compliance/soc2/](compliance/soc2/) | + +## **1.4 Tech Stack Overview** + +| Catégorie | Choix | Rationale | +|-----------|-------|-----------| +| **Cloud** | AWS (eu-west-1) | Décision business, GDPR | +| **Orchestration** | EKS + ArgoCD | GitOps, cloud-native | +| **Database** | Aiven PostgreSQL | Managed, PCI compliant | +| **Messaging** | Aiven Kafka | Event-driven, managed | | **Cache** | Aiven Valkey | Redis-compatible, managed | -| **Edge/CDN** | Cloudflare | Free tier, WAF, DDoS, global CDN, multi-cloud ready | -| **API Gateway / APIM** | À définir (Phase future) | Options : AWS API Gateway, Gravitee, Kong — décision ultérieure | -| **DNS Public** | Cloudflare DNS | Authoritative, DNSSEC, global anycast | -| **DNS Interne/Backup** | AWS Route53 | Private hosted zones, health checks, failover | -| **Observabilité** | Self-hosted, coût minimal | Prometheus/Loki/Tempo + CloudWatch Logs (tier gratuit) | +| **Edge/CDN** | Cloudflare | WAF, DDoS, Zero Trust | +| **Observability** | Prometheus/Loki/Tempo | Self-hosted, coût minimal | +| **Secrets** | HashiCorp Vault | Dynamic secrets, rotation | +| **CNI** | Cilium | mTLS, Gateway API | +| **Policies** | Kyverno | Admission control | --- -# 🏛️ **PARTIE II — ARCHITECTURE LOGIQUE** - -## **2.1 Vue d'ensemble** +# 🏛️ **PARTIE II — ARCHITECTURE** -### **2.1.1 AWS Multi-Account Strategy (Control Tower)** +## **2.1 Context Diagram (C4 Level 1)** ``` ┌─────────────────────────────────────────────────────────────────────────────┐ -│ AWS CONTROL TOWER (Organization) │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ -│ │ MANAGEMENT │ │ SECURITY │ │ LOG ARCHIVE │ │ -│ │ ACCOUNT │ │ ACCOUNT │ │ ACCOUNT │ │ -│ │ • Control Tower│ │ • GuardDuty │ │ • CloudTrail │ │ -│ │ • Organizations│ │ • Security Hub │ │ • Config Logs │ │ -│ │ • SCPs │ │ • IAM Identity │ │ • VPC Flow Logs│ │ -│ │ │ │ Center │ │ │ │ -│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ WORKLOAD ACCOUNTS (OU: Workloads) │ │ -│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ -│ │ │ DEV Account │ │ STAGING │ │ PROD Account│ │ │ -│ │ │ │ │ Account │ │ │ │ │ -│ │ │ VPC + EKS │ │ VPC + EKS │ │ VPC + EKS │ │ │ -│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ SHARED SERVICES ACCOUNT (OU: Infrastructure) │ │ -│ │ • Transit Gateway Hub │ │ -│ │ • Centralized VPC Endpoints │ │ -│ │ • Container Registry (ECR) │ │ -│ │ • Artifact Storage (S3) │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ +│ END USERS │ +│ (Merchants, Consumers, Partners) │ └─────────────────────────────────────────────────────────────────────────────┘ -``` - -### **2.1.2 Architecture EKS par Environnement** - -``` + │ + ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ -│ INTERNET │ -│ (End Users) │ +│ CLOUDFLARE EDGE │ +│ (DNS, WAF, DDoS, CDN, Zero Trust, Tunnel) │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ -│ CLOUDFLARE EDGE (Global) │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ • DNS (localplus.io) • WAF (OWASP rules) │ -│ • DDoS Protection (L3-L7) • SSL/TLS Termination │ -│ • CDN (static assets) • Bot Protection │ -│ • Cloudflare Tunnel • Zero Trust Access │ +│ LOCAL-PLUS PLATFORM │ +│ (AWS EKS) │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ Domain Services: svc-ledger, svc-wallet, svc-merchant, svc-giftcard │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ │ - │ Cloudflare Tunnel (encrypted) ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ -│ WORKLOAD ACCOUNT (PROD) — eu-west-1 │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌───────────────────────────────────────────────────────────────────────┐ │ -│ │ VPC — 10.0.0.0/16 │ │ -│ │ │ │ -│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ -│ │ │ EKS CLUSTER │ │ │ -│ │ │ │ │ │ -│ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ -│ │ │ │ NODE POOL: platform (taints: platform=true:NoSchedule) │ │ │ │ -│ │ │ │ Instance: m6i.xlarge (dedicated resources) │ │ │ │ -│ │ │ ├─────────────────────────────────────────────────────────┤ │ │ │ -│ │ │ │ PLATFORM NAMESPACE │ │ │ │ -│ │ │ │ • ArgoCD (centralisé) │ │ │ │ -│ │ │ │ • Cilium (CNI + Gateway API) │ │ │ │ -│ │ │ │ • Vault Agent Injector │ │ │ │ -│ │ │ │ • External-Secrets Operator │ │ │ │ -│ │ │ │ • Kyverno │ │ │ │ -│ │ │ │ • OTel Collector │ │ │ │ -│ │ │ │ • Prometheus + Loki + Tempo + Grafana │ │ │ │ -│ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ -│ │ │ │ │ │ -│ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ -│ │ │ │ NODE POOL: application (default, auto-scaling) │ │ │ │ -│ │ │ │ Instance: m6i.large (cost-optimized) │ │ │ │ -│ │ │ ├─────────────────────────────────────────────────────────┤ │ │ │ -│ │ │ │ APPLICATION NAMESPACES │ │ │ │ -│ │ │ │ • svc-ledger │ │ │ │ -│ │ │ │ • svc-wallet │ │ │ │ -│ │ │ │ • svc-merchant │ │ │ │ -│ │ │ │ • svc-giftcard │ │ │ │ -│ │ │ │ • svc-notification │ │ │ │ -│ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ -│ │ │ │ │ │ -│ │ └─────────────────────────────────────────────────────────────────┘ │ │ -│ │ │ │ -│ │ │ VPC Peering / Transit Gateway │ │ -│ │ ▼ │ │ -│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ -│ │ │ AIVEN VPC │ │ │ -│ │ │ • PostgreSQL (Primary + Read Replica) │ │ │ -│ │ │ • Kafka Cluster │ │ │ -│ │ └─────────────────────────────────────────────────────────────────┘ │ │ -│ │ │ │ -│ └───────────────────────────────────────────────────────────────────────┘ │ -│ │ -│ ┌───────────────────────────────────────────────────────────────────────┐ │ -│ │ EXTERNAL SERVICES │ │ -│ │ • AWS S3 (Terraform state, backups, artifacts) │ │ -│ │ • AWS KMS (Encryption keys) │ │ -│ │ • AWS Secrets Manager (bootstrap secrets only) │ │ -│ │ • HashiCorp Vault (self-hosted on EKS — runtime secrets) │ │ -│ │ • AWS CloudWatch Logs (tier gratuit, fallback) │ │ -│ └───────────────────────────────────────────────────────────────────────┘ │ -│ │ +│ AIVEN DATA LAYER │ +│ (PostgreSQL, Kafka, Valkey — VPC Peering) │ └─────────────────────────────────────────────────────────────────────────────┘ ``` -### **2.1.3 Node Pool Strategy** +## **2.2 Container Diagram (C4 Level 2)** -|| Node Pool | Taints | Usage | Instance Type | Scaling | -||-----------|--------|-------|---------------|---------| -|| **platform** | `platform=true:NoSchedule` | ArgoCD, Monitoring, Security tools | m6i.xlarge | Fixed (2-3 nodes) | -|| **application** | None (default) | Domain services | m6i.large | HPA (2-10 nodes) | -|| **spot** (optionnel) | `spot=true:PreferNoSchedule` | Batch jobs, non-critical | m6i.large (spot) | Auto (0-5 nodes) | +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ AWS WORKLOAD ACCOUNT — eu-west-1 │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌────────────────────────────────────────────────────────────────────────┐ │ +│ │ EKS CLUSTER │ │ +│ │ │ │ +│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ +│ │ │ PLATFORM NODE POOL (taints: platform=true:NoSchedule) │ │ │ +│ │ │ • ArgoCD • Cilium • Vault Agent │ │ │ +│ │ │ • OTel Collector • Prometheus • Grafana │ │ │ +│ │ │ • Loki • Tempo • Kyverno │ │ │ +│ │ └─────────────────────────────────────────────────────────────────┘ │ │ +│ │ │ │ +│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ +│ │ │ APPLICATION NODE POOL (default, auto-scaling) │ │ │ +│ │ │ • svc-ledger • svc-wallet • svc-merchant │ │ │ +│ │ │ • svc-giftcard • svc-notification │ │ │ +│ │ └─────────────────────────────────────────────────────────────────┘ │ │ +│ │ │ │ +│ └────────────────────────────────────────────────────────────────────────┘ │ +│ │ │ +│ │ VPC Peering │ +│ ▼ │ +│ ┌────────────────────────────────────────────────────────────────────────┐ │ +│ │ AIVEN VPC │ │ +│ │ • PostgreSQL (Primary + Read Replica) │ │ +│ │ • Kafka Cluster (3 brokers) │ │ +│ │ • Valkey Cluster (HA) │ │ +│ └────────────────────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` -## **2.2 Domain Services** +## **2.3 Domain Services** | Service | Responsabilité | Pattern | Criticité | |---------|---------------|---------|-----------| @@ -181,7 +138,7 @@ | **svc-giftcard** | Catalog, rewards | Sync REST | P1 | | **svc-notification** | SMS/Email dispatch | Async (Kafka consumer) | P2 | -## **2.3 Data Flow** +## **2.4 Data Flow** ``` ┌─────────────┐ gRPC ┌─────────────┐ @@ -205,9 +162,11 @@ --- -# 🌿 **PARTIE II.B — GIT STRATEGY** +# 🌿 **PARTIE III — DELIVERY MODEL** + +## **3.1 Git Strategy** -## **Trunk-Based Development avec Cherry-Pick** +**Trunk-Based Development avec Cherry-Pick** ``` main (trunk) @@ -228,1590 +187,312 @@ label: backport-v1) label: backport-v2) ``` -### **Règles Git** +| Branche | Usage | Politique | +|---------|-------|-----------| +| `main` | Trunk principal | Tous les PRs mergent ici | +| `maintenance/v*.x.x` | Maintenance versions | Cherry-pick depuis main uniquement | +| `feature/*` | Développement | Short-lived, merge to main | -|| Branche | Usage | Politique | -||---------|-------|-----------| -|| `main` | Trunk principal | Tous les PRs mergent ici | -|| `maintenance/v1.x.x` | Maintenance version 1 | Cherry-pick depuis main uniquement | -|| `maintenance/v2.x.x` | Maintenance version 2 | Cherry-pick depuis main uniquement | -|| `feature/*` | Développement | Short-lived, merge to main | +## **3.2 GitOps Flow (ArgoCD)** -### **Workflow Cherry-Pick** +- **ArgoCD centralisé** : Instance unique gérant tous les environnements +- **App-of-Apps pattern** : ApplicationSets avec Git + Matrix generators +- **Sync automatique** : Dev auto-sync, Staging/Prod manual approval -1. **Développeur** crée un PR vers `main` -2. **Développeur** ajoute le label `backport-v1` si le fix doit aller dans v1.x.x -3. **CI** (après merge dans main) détecte le label et crée automatiquement un PR cherry-pick vers `maintenance/v1.x.x` -4. **Reviewer** valide le cherry-pick PR +## **3.3 Environments** -> **Principe :** Tout passe par `main` d'abord. Les branches de maintenance reçoivent uniquement des cherry-picks validés. +| Environment | Account | Cluster | Sync Policy | +|-------------|---------|---------|-------------| +| **dev** | localplus-dev | eks-dev | Auto-sync | +| **staging** | localplus-staging | eks-staging | Manual | +| **prod** | localplus-prod | eks-prod | Manual + Approval | ---- +## **3.4 CI/CD** -# 🗂️ **PARTIE III — ORGANISATION DES REPOSITORIES** +→ **Documentation détaillée** : [docs/bootstrap/BOOTSTRAP-GUIDE.md](bootstrap/BOOTSTRAP-GUIDE.md) -## **3.1 Structure Complète** +--- -``` -github.com/localplus/ - -══════════════════════════════════════════════════════════════════════════════ -TIER 0 — FOUNDATION (Platform Team ownership) -══════════════════════════════════════════════════════════════════════════════ - -bootstrap/ -├── layer-0/ -│ └── aws/ -│ └── README.md # Runbook: Create bootstrap IAM role -├── layer-1/ -│ ├── foundation/ -│ │ ├── main.tf -│ │ ├── networking.tf # VPC, Subnets, NAT, VPC Peering Aiven -│ │ ├── eks.tf # EKS cluster -│ │ ├── iam.tf # IRSA, Workload Identity -│ │ ├── kms.tf # Encryption keys -│ │ └── outputs.tf -│ ├── tests/ -│ │ ├── unit/ # Terraform unit tests (terraform test) -│ │ ├── compliance/ # Checkov, tfsec, Regula -│ │ └── integration/ # Terratest -│ └── backend.tf # S3 native locking -└── docs/ - └── RUNBOOK-BOOTSTRAP.md - -══════════════════════════════════════════════════════════════════════════════ -TIER 1 — PLATFORM (Platform Team ownership) -══════════════════════════════════════════════════════════════════════════════ - -platform-gitops/ -├── argocd/ -│ ├── install/ # Helm values for ArgoCD -│ └── applicationsets/ -│ ├── platform.yaml # Sync platform-* repos -│ └── services.yaml # Sync svc-* repos (Git + Cluster generators) -├── projects/ # ArgoCD Projects (RBAC) -└── README.md - -platform-networking/ -├── cilium/ -│ ├── values.yaml # Cilium Helm config -│ └── policies/ # ClusterNetworkPolicies -├── gateway-api/ -│ ├── gateway-class.yaml -│ ├── gateways/ -│ └── httproutes/ -└── README.md - -platform-observability/ -├── otel-collector/ -│ ├── daemonset.yaml # Node-level collection -│ ├── deployment.yaml # Gateway collector -│ └── config/ -│ ├── receivers.yaml -│ ├── processors.yaml # Cardinality filtering, PII scrubbing -│ ├── exporters.yaml -│ └── sampling.yaml # Tail sampling config -├── prometheus/ -│ ├── values.yaml -│ ├── rules/ # AlertRules, RecordingRules -│ └── serviceMonitors/ -├── loki/ -│ ├── values.yaml -│ └── retention-policies.yaml # GDPR: 30 days max -├── tempo/ -│ └── values.yaml -├── pyroscope/ # Continuous Profiling (APM) -│ ├── values.yaml -│ └── scrape-configs.yaml -├── sentry/ # Error Tracking (APM) -│ ├── values.yaml -│ ├── dsn-config.yaml -│ └── alert-rules.yaml -├── grafana/ -│ ├── values.yaml -│ ├── dashboards/ -│ │ ├── platform/ -│ │ ├── services/ -│ │ └── apm/ # APM-specific dashboards -│ │ ├── service-overview.json -│ │ ├── dependency-map.json -│ │ ├── database-performance.json -│ │ └── profiling-flamegraphs.json -│ └── datasources/ -└── README.md - -platform-cache/ -├── valkey/ -│ ├── values.yaml # Helm config for Valkey -│ ├── cluster-config.yaml -│ └── monitoring/ -│ ├── servicemonitor.yaml -│ └── alerts.yaml -├── sdk/ -│ ├── python/ # Cache SDK helpers -│ │ ├── cache_client.py -│ │ └── patterns.py # Cache-aside, write-through -│ └── go/ -│ └── cache/ -└── README.md - -platform-gateway/ -├── apisix/ -│ ├── values.yaml # APISIX Helm config -│ ├── routes/ -│ │ ├── v1/ # API v1 routes -│ │ └── v2/ # API v2 routes (future) -│ ├── plugins/ -│ │ ├── jwt-config.yaml -│ │ ├── rate-limit-config.yaml -│ │ └── cors-config.yaml -│ └── consumers/ # API consumers (partners, services) -├── cloudflare/ -│ ├── terraform/ -│ │ ├── main.tf -│ │ ├── dns.tf -│ │ ├── tunnel.tf -│ │ ├── waf.tf -│ │ └── access.tf -│ └── policies/ -│ ├── waf-rules.yaml -│ └── access-policies.yaml -├── cloudflared/ -│ ├── deployment.yaml # Tunnel daemon -│ └── config.yaml -└── README.md - -platform-security/ -├── vault/ -│ ├── policies/ # Per-service policies -│ ├── auth-methods/ # Kubernetes auth -│ └── secret-engines/ -├── external-secrets/ -│ ├── operator/ -│ └── cluster-secret-stores/ -├── kyverno/ -│ ├── cluster-policies/ -│ │ ├── require-labels.yaml -│ │ ├── require-probes.yaml -│ │ ├── require-resource-limits.yaml -│ │ ├── restrict-privileged.yaml -│ │ ├── require-image-signature.yaml # Supply chain -│ │ └── mutate-default-sa.yaml -│ └── policy-reports/ -├── supply-chain/ -│ ├── cosign/ # Image signing config -│ └── sbom/ # Syft config -├── audit/ -│ └── audit-policy.yaml # K8s audit logging -└── README.md - -══════════════════════════════════════════════════════════════════════════════ -TIER 2 — CONTRACTS (Shared ownership) -══════════════════════════════════════════════════════════════════════════════ - -contracts-proto/ -├── buf.yaml -├── buf.gen.yaml -├── localplus/ -│ ├── ledger/v1/ -│ │ ├── ledger.proto -│ │ └── ledger_service.proto -│ ├── wallet/v1/ -│ │ ├── wallet.proto -│ │ └── wallet_service.proto -│ └── common/v1/ -│ ├── money.proto -│ └── pagination.proto -└── README.md - -sdk-python/ -├── localplus/ -│ ├── clients/ # Generated gRPC clients -│ ├── telemetry/ # OTel instrumentation helpers -│ ├── testing/ # Fixtures, factories -│ └── security/ # Vault client wrapper -├── pyproject.toml -└── README.md - -sdk-go/ -├── clients/ -├── telemetry/ -└── go.mod - -══════════════════════════════════════════════════════════════════════════════ -TIER 3 — DOMAIN SERVICES (Product Team ownership) -══════════════════════════════════════════════════════════════════════════════ - -svc-ledger/ # TON LOCAL-PLUS ACTUEL -├── src/ -│ └── app/ -│ ├── api/ -│ ├── domain/ -│ ├── infrastructure/ -│ └── main.py -├── tests/ -│ ├── unit/ # pytest, mocks -│ ├── integration/ # testcontainers -│ ├── contract/ # pact / grpc-testing -│ └── conftest.py -├── perf/ -│ ├── k6/ -│ │ ├── smoke.js -│ │ ├── load.js -│ │ └── stress.js -│ └── scenarios/ -├── k8s/ -│ ├── base/ -│ │ ├── deployment.yaml -│ │ ├── service.yaml -│ │ ├── configmap.yaml -│ │ ├── hpa.yaml -│ │ ├── pdb.yaml -│ │ └── kustomization.yaml -│ └── overlays/ -│ ├── dev/ -│ ├── staging/ -│ └── prod/ -├── migrations/ # Alembic -├── Dockerfile -├── Taskfile.yml -└── README.md - -svc-wallet/ # Même structure -svc-merchant/ # Même structure -svc-giftcard/ # Même structure -svc-notification/ # Même structure (+ Kafka consumer) - -══════════════════════════════════════════════════════════════════════════════ -TIER 4 — QUALITY ENGINEERING (Shared ownership) -══════════════════════════════════════════════════════════════════════════════ - -e2e-scenarios/ -├── scenarios/ -│ ├── earn-burn-flow.spec.ts -│ ├── merchant-onboarding.spec.ts -│ └── giftcard-purchase.spec.ts -├── fixtures/ -├── playwright.config.ts -└── README.md - -chaos-experiments/ -├── litmus/ -│ └── chaosengine/ -├── experiments/ -│ ├── pod-kill/ -│ ├── network-partition/ -│ ├── db-latency/ -│ └── kafka-broker-kill/ -└── README.md - -══════════════════════════════════════════════════════════════════════════════ -TIER 5 — DOCUMENTATION (Shared ownership) -══════════════════════════════════════════════════════════════════════════════ - -docs/ -├── adr/ # Architecture Decision Records -│ ├── 001-modular-monolith-first.md -│ ├── 002-aiven-managed-data.md -│ ├── 003-cilium-over-calico.md -│ └── ... -├── runbooks/ -│ ├── incident-response.md -│ ├── database-failover.md -│ ├── kafka-recovery.md -│ └── secret-rotation.md -├── platform-contracts/ -│ ├── deployment-sla.md -│ ├── observability-requirements.md -│ └── security-baseline.md -├── compliance/ -│ ├── gdpr/ -│ │ ├── data-retention-policy.md -│ │ ├── right-to-erasure.md -│ │ └── consent-management.md -│ ├── pci-dss/ -│ │ ├── cardholder-data-flow.md -│ │ └── encryption-requirements.md -│ └── soc2/ -│ ├── access-control-policy.md -│ └── incident-response-policy.md -├── threat-models/ -│ ├── svc-ledger-stride.md -│ └── platform-attack-surface.md -└── onboarding/ - ├── new-developer.md - └── new-service-checklist.md -``` +# 🗂️ **PARTIE IV — REPOSITORY & OWNERSHIP MODEL** + +## **4.1 Repository Tiers** + +| Tier | Repos | Description | Owner | +|------|-------|-------------|-------| +| **T0 — Foundation** | `bootstrap/` | AWS Landing Zone, Account Factory | Platform Team | +| **T1 — Platform** | `platform-*` | GitOps, Networking, Security, Observability | Platform Team | +| **T2 — Contracts** | `contracts-proto`, `sdk-*` | APIs, SDKs partagés | Platform + Backend | +| **T3 — Domain** | `svc-*` | Services métier | Product Teams | +| **T4 — Quality** | `e2e-scenarios`, `chaos-*` | Tests E2E, Chaos engineering | QA + Platform | +| **T5 — Documentation** | `docs/` | Documentation centralisée | All Teams | + +## **4.2 Ownership Matrix** + +| Tier | Owner Team | Approvers | Change Process | +|------|------------|-----------|----------------| +| **T0 — Foundation** | Platform | Platform Lead + Security | ADR + RFC obligatoire | +| **T1 — Platform** | Platform | Platform Team (2 reviewers) | ADR si breaking change | +| **T2 — Contracts** | Platform + Backend | Tech Lead | Buf breaking detection | +| **T3 — Domain** | Product Teams | Team Lead | Standard PR review | +| **T4 — Quality** | QA + Platform | QA Lead | Standard PR review | +| **T5 — Documentation** | All | Tech Lead | Standard PR review | + +## **4.3 Repository Index** + +### Tier 0 — Foundation +| Repo | Description | README | +|------|-------------|--------| +| `bootstrap/` | AWS Landing Zone, Control Tower, Account Factory | → [bootstrap/README.md](../bootstrap/README.md) | + +### Tier 1 — Platform +| Repo | Description | README | +|------|-------------|--------| +| `platform-gitops/` | ArgoCD, ApplicationSets | → [platform-gitops/README.md](../platform-gitops/README.md) | +| `platform-networking/` | Cilium, Gateway API | → [platform-networking/README.md](../platform-networking/README.md) | +| `platform-observability/` | OTel, Prometheus, Loki, Tempo, Grafana | → [platform-observability/README.md](../platform-observability/README.md) | +| `platform-security/` | Vault, External-Secrets, Kyverno | → [platform-security/README.md](../platform-security/README.md) | +| `platform-cache/` | Valkey configuration, SDK | → [platform-cache/README.md](../platform-cache/README.md) | +| `platform-gateway/` | APISIX (future), Cloudflare config | → [platform-gateway/README.md](../platform-gateway/README.md) | +| `platform-application-provis/` | Terraform modules (DB, Kafka, Cache, EKS) | → [platform-application-provis/README.md](../platform-application-provis/README.md) | + +### Tier 2 — Contracts +| Repo | Description | README | +|------|-------------|--------| +| `contracts-proto/` | Protobuf definitions | → [contracts-proto/README.md](../contracts-proto/README.md) | +| `sdk-python/` | Python SDK (clients, telemetry) | → [sdk-python/README.md](../sdk-python/README.md) | +| `sdk-go/` | Go SDK | → [sdk-go/README.md](../sdk-go/README.md) | + +### Tier 3 — Domain Services +| Repo | Description | README | +|------|-------------|--------| +| `svc-ledger/` | Earn/Burn transactions | → [svc-ledger/README.md](../svc-ledger/README.md) | +| `svc-wallet/` | Balance queries | → [svc-wallet/README.md](../svc-wallet/README.md) | +| `svc-merchant/` | Merchant onboarding | → [svc-merchant/README.md](../svc-merchant/README.md) | +| `svc-giftcard/` | Gift card catalog | → [svc-giftcard/README.md](../svc-giftcard/README.md) | +| `svc-notification/` | Notifications (Kafka consumer) | → [svc-notification/README.md](../svc-notification/README.md) | + +### Tier 4 — Quality +| Repo | Description | README | +|------|-------------|--------| +| `e2e-scenarios/` | Playwright E2E tests | → [e2e-scenarios/README.md](../e2e-scenarios/README.md) | +| `chaos-experiments/` | Litmus chaos tests | → [chaos-experiments/README.md](../chaos-experiments/README.md) | --- -# 🥚🐔 **PARTIE IV — BOOTSTRAP STRATEGY** +# 🔐 **PARTIE V — PLATFORM BASELINES** -## **4.1 Layer 0 — Manual Bootstrap (1x per AWS account)** +## **5.1 Security Baseline** -| Action | Commande/Outil | Output | -|--------|---------------|--------| -| Créer IAM Role pour Terraform CI/CD | AWS CLI | `arn:aws:iam::xxx:role/TerraformCI` | -| Configurer OIDC pour GitHub Actions | AWS Console/CLI | GitHub peut assumer le role | +**Defense in Depth** : 6 couches de sécurité -**C'est TOUT. Le S3 backend est auto-créé par Terraform 1.10+** +| Layer | Composant | Protection | +|-------|-----------|------------| +| **Edge** | Cloudflare | WAF, DDoS, Bot protection | +| **Gateway** | APISIX (future) | JWT, Rate limiting | +| **Network** | Cilium | NetworkPolicies, default deny | +| **Identity** | IRSA + Vault | Dynamic secrets, mTLS | +| **Workload** | Kyverno | Pod security, image signing | +| **Data** | KMS + Aiven | Encryption at rest/transit | -## **4.1.1 GitHub Actions — Reusable & Composite Workflows** +→ **Documentation détaillée** : [docs/security/SECURITY-ARCHITECTURE.md](security/SECURITY-ARCHITECTURE.md) -> **Note :** Utiliser des **reusable workflows** et **composite actions** pour standardiser les pipelines CI/CD. +## **5.2 Observability Baseline** -- **Reusable workflows** : `.github/workflows/` partagés entre repos (build, test, deploy) -- **Composite actions** : `.github/actions/` pour encapsuler des steps communs (setup-python, terraform-plan, etc.) +| Signal | Outil | Retention | Coût | +|--------|-------|-----------|------| +| **Metrics** | Prometheus + Thanos | 15j local, 1an S3 | ~5€/mois | +| **Logs** | Loki | 30 jours (GDPR) | Self-hosted | +| **Traces** | Tempo | 7 jours | Self-hosted | +| **Profiling** | Pyroscope | 7 jours | Self-hosted | +| **Errors** | Sentry (self-hosted) | 30 jours | Self-hosted | -## **4.2 Layer 1 — Foundation (Terraform)** +→ **Documentation détaillée** : [docs/observability/OBSERVABILITY-GUIDE.md](observability/OBSERVABILITY-GUIDE.md) -| Ordre | Ressource | Dépendances | -|-------|-----------|-------------| -| 1 | VPC + Subnets | Aucune | -| 2 | KMS Keys | Aucune | -| 3 | EKS Cluster | VPC, KMS | -| 4 | IRSA (IAM Roles for Service Accounts) | EKS | -| 5 | VPC Peering avec Aiven | VPC, Aiven créé manuellement d'abord | -| 6 | Outputs → Platform repos | Tous | +## **5.3 Networking Baseline** -## **4.3 Layer 2 — Platform Bootstrap** +| Composant | Rôle | Configuration | +|-----------|------|---------------| +| **Cloudflare** | Edge, WAF, Tunnel | Free tier | +| **Cilium** | CNI, mTLS, Gateway API | WireGuard encryption | +| **VPC Peering** | Aiven connectivity | Private, no internet | +| **Route53** | Private DNS, backup | Internal zones | -| Ordre | Action | Dépendance | -|-------|--------|------------| -| 1 | Install ArgoCD via Helm (1x) | EKS ready | -| 2 | Apply App-of-Apps ApplicationSet | ArgoCD running | -| 3 | ArgoCD syncs platform-* repos | Reconciliation automatique | +→ **Documentation détaillée** : [docs/networking/NETWORKING-ARCHITECTURE.md](networking/NETWORKING-ARCHITECTURE.md) -**ArgoCD : Instance centralisée unique** (comme demandé) +## **5.4 Data Baseline** -## **4.4 Layer 3+ — Application Services** +| Service | Provider | Plan | Coût estimé | +|---------|----------|------|-------------| +| **PostgreSQL** | Aiven | Business-4 | ~300€/mois | +| **Kafka** | Aiven | Business-4 | ~400€/mois | +| **Valkey** | Aiven | Business-4 | ~150€/mois | -ArgoCD ApplicationSets avec **Git Generator + Matrix Generator** découvrent automatiquement les services. - ---- +**Règle d'or** : 1 table = 1 owner. Cross-service = gRPC ou Events, jamais JOIN. -# 🧪 **PARTIE V — TESTING STRATEGY COMPLÈTE** - -## **5.1 Terraform Testing** - -| Type | Outil | Quand | Bloquant | -|------|-------|-------|----------| -| **Format/Lint** | `terraform fmt`, `tflint` | Pre-commit | Oui | -| **Security scan** | `tfsec`, `checkov` | PR | Oui | -| **Compliance** | `regula`, `opa conftest`, [terraform-compliance](https://terraform-compliance.com/) | PR | Oui | -| **Policy as Code** | HashiCorp Sentinel | PR | Oui | -| **Unit tests** | `terraform test` (native 1.6+) | PR | Oui | -| **Integration** | `terratest` | Nightly | Non | -| **Drift detection** | `terraform plan` scheduled | Daily | Alerte | - -## **5.2 Application Testing** - -| Type | Localisation | Outil | Trigger | Bloquant | -|------|--------------|-------|---------|----------| -| **Unit** | `svc-*/tests/unit/` | pytest | Pre-commit, PR | Oui | -| **Integration** | `svc-*/tests/integration/` | pytest + testcontainers | PR | Oui | -| **Contract** | `svc-*/tests/contract/` | pact, grpc-testing | PR | Oui | -| **Performance** | `svc-*/perf/` | k6 | Nightly, Pre-release | Non | -| **E2E** | `e2e-scenarios/` | Playwright | Post-merge staging | Oui pour prod | -| **Chaos** | `chaos-experiments/` | Litmus | Weekly | Non | - -## **5.3 TNR (Tests de Non-Régression)** - -| Catégorie | Contenu | Fréquence | -|-----------|---------|-----------| -| **Critical Paths** | Earn → Balance Update → Notification | Nightly | -| **Golden Master** | Snapshot des réponses API | Nightly | -| **Compliance** | GDPR data retention, PCI encryption checks | Nightly | -| **Security** | Kyverno policy audit, image signature verification | Nightly | - -## **5.4 Compliance Testing** - -| Standard | Test | Outil | -|----------|------|-------| -| **GDPR** | PII not in logs | OTel Collector scrubbing + log audit | -| **GDPR** | Data retention < 30 days | Loki retention policy check | -| **PCI-DSS** | mTLS enforced | Cilium policy audit | -| **PCI-DSS** | Encryption at rest | AWS KMS audit | -| **SOC2** | Audit logs present | CloudTrail + K8s audit logs check | -| **SOC2** | Access control | Kyverno policy reports | +→ **Documentation détaillée** : [docs/data/DATA-ARCHITECTURE.md](data/DATA-ARCHITECTURE.md) --- -# 🔐 **PARTIE VI — SECURITY ARCHITECTURE** +# ⚡ **PARTIE VI — RESILIENCE & DR** -## **6.1 Defense in Depth** +## **6.1 Failure Modes** -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ LAYER 0: EDGE (Cloudflare) │ -│ • Cloudflare WAF (OWASP Core Ruleset, custom rules) │ -│ • Cloudflare DDoS Protection (L3/L4/L7, unlimited) │ -│ • Bot Management (JS challenge, CAPTCHA) │ -│ • TLS 1.3 termination, HSTS enforced │ -│ • Cloudflare Tunnel (no public origin IP) │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ LAYER 1: API GATEWAY (APISIX) │ -│ • JWT/API Key validation │ -│ • Rate limiting (fine-grained, per user/tenant) │ -│ • Request validation (JSON Schema) │ -│ • Circuit breaker │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ LAYER 2: NETWORK │ -│ • VPC isolation (private subnets only for workloads) │ -│ • Cilium NetworkPolicies (default deny, explicit allow) │ -│ • VPC Peering Aiven (no public internet for DB/Kafka) │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ LAYER 3: IDENTITY & ACCESS │ -│ • IRSA (IAM Roles for Service Accounts) — no static credentials │ -│ • Cilium mTLS (WireGuard) — pod-to-pod encryption │ -│ • Vault dynamic secrets — DB credentials rotated │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ LAYER 4: WORKLOAD │ -│ • Kyverno policies (no privileged, resource limits, probes required) │ -│ • Image signature verification (Cosign) │ -│ • Read-only root filesystem │ -│ • Non-root containers │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ LAYER 5: DATA │ -│ • Encryption at rest (AWS KMS, Aiven native) │ -│ • Encryption in transit (mTLS) │ -│ • PII scrubbing in logs (OTel processor) │ -│ • Audit trail immutable (CloudTrail, K8s audit logs) │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -## **6.2 Réponse à : "Commence simple, mais la dette technique ?"** - -**Le paradoxe :** Tu veux commencer simple mais avec GDPR/PCI-DSS/SOC2, tu ne peux PAS ignorer la sécurité. - -**La solution : Security Baseline dès Day 1, évolution par phases** +| Failure | Detection | Recovery | RTO | +|---------|-----------|----------|-----| +| Pod crash | Liveness probe | K8s restart | < 30s | +| Node failure | Node NotReady | Pod reschedule | < 2min | +| AZ failure | Multi-AZ detect | Traffic shift | < 5min | +| DB primary failure | Aiven health | Automatic failover | < 5min | +| Kafka broker failure | Aiven health | Automatic rebalance | < 2min | +| Full region failure | Manual | DR procedure | 4h (target) | -| Phase | Ce qui est en place | Ce qui vient après | -|-------|---------------------|-------------------| -| **Day 1** | Cilium mTLS (zero config), Kyverno basic policies, Vault pour secrets | - | -| **Month 3** | Image signing (Cosign), SBOM generation | - | -| **Month 6** | SPIRE (si multi-cluster), Confidential Computing évaluation | - | +## **6.2 Backup Strategy** -**Pas de dette technique SI :** -- mTLS dès le début (Cilium = zero effort) -- Secrets dans Vault dès le début (pas de migration douloureuse) -- Policies Kyverno dès le début (culture sécurité) +| Data | Method | Frequency | Retention | +|------|--------|-----------|-----------| +| PostgreSQL | Aiven automated | Hourly | 7 jours | +| PostgreSQL PITR | Aiven WAL | Continuous | 24h | +| Kafka | Topic retention | N/A | 7 jours | +| Terraform state | S3 versioning | Every apply | 90 jours | -**La vraie dette technique serait :** -- Commencer sans mTLS → Migration massive plus tard -- Secrets en ConfigMaps → Rotation impossible -- Pas d'audit logs → Compliance failure +→ **Documentation détaillée** : [docs/resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) --- -# 📊 **PARTIE VII — OBSERVABILITY ARCHITECTURE** - -## **7.1 Stack Self-Hosted (Coût Minimal)** - -| Composant | Outil | Coût | Retention | -|-----------|-------|------|-----------| -| **Metrics** | Prometheus | 0€ (self-hosted) | 15 jours local | -| **Metrics long-term** | Thanos Sidecar → S3 | ~5€/mois S3 | 1 an | -| **Logs** | Loki | 0€ (self-hosted) | 30 jours (GDPR) | -| **Traces** | Tempo | 0€ (self-hosted) | 7 jours | -| **Dashboards** | Grafana | 0€ (self-hosted) | N/A | -| **Fallback logs** | CloudWatch Logs | Tier gratuit 5GB | 7 jours | - -**Coût estimé : < 50€/mois** (principalement S3 pour Thanos) - -## **7.2 Telemetry Pipeline** - -``` -┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ -│ Applications │ │ OTel Collector │ │ Backends │ -│ │ │ │ │ │ -│ • SDK Python │────►│ • Receivers │────►│ • Prometheus │ -│ • Auto-instr │ │ • Processors │ │ • Loki │ -│ │ │ • Exporters │ │ • Tempo │ -└─────────────────┘ └─────────────────┘ └─────────────────┘ - │ - │ Scrubbing - ▼ - ┌─────────────────┐ - │ GDPR Compliant │ - │ • No user_id │ - │ • No PII │ - │ • No PAN │ - └─────────────────┘ -``` +# 🛠️ **PARTIE VII — PLATFORM CONTRACTS** -## **7.3 Cardinality Management** +## **7.1 Golden Path (New Service Checklist)** -| Label | Action | Rationale | -|-------|--------|-----------| -| `user_id` | DROP | High cardinality, use traces | -| `request_id` | DROP | Use trace_id instead | -| `http.url` | DROP | URLs uniques = explosion | -| `http.route` | KEEP | Templated, low cardinality | -| `service.name` | KEEP | Essential | -| `http.method` | KEEP | Low cardinality | -| `http.status_code` | KEEP | Low cardinality | +| Étape | Action | Validation | +|-------|--------|------------| +| 1 | Créer repo depuis template | Structure conforme | +| 2 | Définir protos dans contracts-proto | buf lint pass | +| 3 | Implémenter service | Unit tests > 80% | +| 4 | Configurer K8s manifests | Kyverno policies pass | +| 5 | Configurer External-Secret | Secrets résolus | +| 6 | Ajouter ServiceMonitor | Metrics visibles Grafana | +| 7 | Créer HTTPRoute | Trafic routable | +| 8 | PR review | Merge → Auto-deploy dev | -## **7.4 SLI/SLO/Error Budgets** +## **7.2 SLI/SLO/Error Budgets** | Service | SLI | SLO | Error Budget | |---------|-----|-----|--------------| | **svc-ledger** | Availability | 99.9% | 43 min/mois | | **svc-ledger** | Latency P99 | < 200ms | N/A | | **svc-wallet** | Availability | 99.9% | 43 min/mois | -| **Platform (ArgoCD, Prometheus)** | Availability | 99.5% | 3.6h/mois | - -## **7.5 Alerting Strategy** - -| Severity | Exemple | Notification | On-call | -|----------|---------|--------------|---------| -| **P1 — Critical** | svc-ledger down | PagerDuty immediate | Wake up | -| **P2 — High** | Error rate > 5% | Slack + PagerDuty 15min | Within 30min | -| **P3 — Medium** | Latency P99 > 500ms | Slack | Business hours | -| **P4 — Low** | Disk usage > 80% | Slack | Next day | - -## **7.6 APM (Application Performance Monitoring)** - -### **7.6.1 Stack APM** - -| Composant | Outil | Intégration | Usage | -|-----------|-------|-------------|-------| -| **Distributed Tracing** | Tempo + OTel | Auto-instrumentation Python/Go | Request flow, latency breakdown | -| **Profiling** | Pyroscope (Grafana) | SDK intégré | CPU/Memory profiling continu | -| **Error Tracking** | Sentry (self-hosted) | SDK Python/Go | Exception tracking, stack traces | -| **Database APM** | pg_stat_statements | Prometheus exporter | Query performance | -| **Real User Monitoring** | Grafana Faro | JavaScript SDK | Frontend performance (si applicable) | - -### **7.6.2 APM Pipeline** - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ APPLICATION LAYER │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ -│ │ OTel SDK │ │ Pyroscope │ │ Sentry SDK │ │ pg_stat │ │ -│ │ (Traces) │ │ (Profiles) │ │ (Errors) │ │ (DB metrics) │ │ -│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ -│ │ │ │ │ │ -└─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────┘ - │ │ │ │ - ▼ ▼ ▼ ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ COLLECTION LAYER │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ ┌──────────────────────────────────────────────────────────────────────┐ │ -│ │ OTel Collector (Gateway) │ │ -│ │ • Receives: traces, metrics, logs │ │ -│ │ • Processes: sampling, enrichment, PII scrubbing │ │ -│ │ • Exports: Tempo, Prometheus, Loki │ │ -│ └──────────────────────────────────────────────────────────────────────┘ │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ STORAGE & VISUALIZATION │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ -│ │ Tempo │ │ Pyroscope │ │ Sentry │ │ Grafana │ │ -│ │ (Traces) │ │ (Profiles) │ │ (Errors) │ │ (Unified) │ │ -│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -### **7.6.3 Instrumentation Standards** - -| Language | Auto-instrumentation | Manual Instrumentation | Frameworks supportés | -|----------|---------------------|------------------------|---------------------| -| **Python** | `opentelemetry-instrumentation` | `@tracer.start_as_current_span` | FastAPI, SQLAlchemy, httpx, grpcio | -| **Go** | OTel contrib packages | `tracer.Start()` | gRPC, net/http, pgx | - -### **7.6.4 Sampling Strategy** - -| Environment | Head Sampling | Tail Sampling | Rationale | -|-------------|---------------|---------------|-----------| -| **Dev** | 100% | N/A | Full visibility pour debug | -| **Staging** | 50% | Errors: 100% | Balance cost/visibility | -| **Prod** | 10% | Errors: 100%, Slow: 100% (>500ms) | Cost optimization | - -### **7.6.5 APM Dashboards** - -| Dashboard | Métriques clés | Audience | -|-----------|---------------|----------| -| **Service Overview** | RPS, Error rate, Latency P50/P95/P99 | On-call | -| **Dependency Map** | Service topology, inter-service latency | Platform team | -| **Database Performance** | Query time, connections, deadlocks | Backend devs | -| **Error Analysis** | Error count by type, affected users | Product team | -| **Profiling Flame Graphs** | CPU hotspots, memory allocations | Performance team | - -### **7.6.6 Trace-to-Logs-to-Metrics Correlation** - -``` -┌─────────────────┐ trace_id ┌─────────────────┐ -│ TRACES │◄────────────────►│ LOGS │ -│ (Tempo) │ │ (Loki) │ -└────────┬────────┘ └────────┬────────┘ - │ │ - │ Exemplars (trace_id in metrics) │ - │ │ - ▼ ▼ -┌─────────────────────────────────────────────────────────┐ -│ GRAFANA │ -│ • Click trace → See logs for that request │ -│ • Click metric spike → Jump to exemplar trace │ -│ • Click error log → Navigate to full trace │ -└─────────────────────────────────────────────────────────┘ -``` - -### **7.6.7 APM Alerting** - -| Alert | Condition | Severity | Action | -|-------|-----------|----------|--------| -| **High Error Rate** | Error rate > 1% for 5min | P2 | Investigate errors in Sentry | -| **Latency Degradation** | P99 > 2x baseline for 10min | P2 | Check traces for slow spans | -| **Database Slow Queries** | Query time P95 > 100ms | P3 | Analyze pg_stat_statements | -| **Memory Leak Detected** | Memory growth > 10%/hour | P3 | Check Pyroscope profiles | - ---- - -# 💾 **PARTIE VIII — DATA ARCHITECTURE** - -## **8.1 Aiven Configuration** - -| Service | Plan | Config | Coût estimé | -|---------|------|--------|-------------| -| **PostgreSQL** | Business-4 | Primary + Read Replica, 100GB | ~300€/mois | -| **Kafka** | Business-4 | 3 brokers, 100GB retention | ~400€/mois | -| **Valkey (Redis)** | Business-4 | 2 nodes, 10GB, HA | ~150€/mois | - -**Coût total Aiven estimé : ~850€/mois** - -## **8.2 Database Strategy** - -| Aspect | Choix | Rationale | -|--------|-------|-----------| -| **Replication** | Aiven managed (async) | RPO 1h acceptable | -| **Backup** | Aiven automated hourly | RPO 1h | -| **Failover** | Aiven automated | RTO < 15min | -| **Connection** | VPC Peering (private) | PCI-DSS, no public internet | -| **Pooling** | PgBouncer (Aiven built-in) | Connection efficiency | - -## **8.3 Schema Ownership** - -| Table | Owner Service | Access pattern | -|-------|---------------|----------------| -| `transactions` | svc-ledger | CRUD | -| `ledger_entries` | svc-ledger | CRUD | -| `wallets` | svc-wallet | CRUD | -| `balance_snapshots` | svc-wallet | CRUD | -| `merchants` | svc-merchant | CRUD | -| `giftcards` | svc-giftcard | CRUD | - -**Règle : 1 table = 1 owner. Cross-service = gRPC ou Events, jamais JOIN.** - -## **8.4 Kafka Topics** - -| Topic | Producer | Consumers | Retention | -|-------|----------|-----------|-----------| -| `ledger.transactions.v1` | svc-ledger (Outbox) | svc-notification, svc-analytics | 7 jours | -| `wallet.balance-updated.v1` | svc-wallet | svc-analytics | 7 jours | -| `merchant.onboarded.v1` | svc-merchant | svc-notification | 7 jours | - -## **8.5 Cache Architecture (Valkey/Redis)** +| **Platform** | Availability | 99.5% | 3.6h/mois | -### **8.5.1 Stack Cache** +## **7.3 On-Call Structure** -| Composant | Outil | Hébergement | Coût estimé | -|-----------|-------|-------------|-------------| -| **Cache primaire** | Valkey (Redis-compatible) | Aiven for Caching | ~150€/mois | -| **Cache local (L1)** | Python `cachetools` / Go `bigcache` | In-memory | 0€ | - -> **Note :** Valkey est le fork open-source de Redis, maintenu par la Linux Foundation. Aiven supporte Valkey nativement. - -### **8.5.2 Cache Topology** - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ MULTI-LAYER CACHE │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ L1 — LOCAL CACHE (per pod) │ │ -│ │ • TTL: 30s - 5min │ │ -│ │ • Size: 100MB max per pod │ │ -│ │ • Use case: Hot data, config, user sessions │ │ -│ └───────────────────────────────┬─────────────────────────────────────┘ │ -│ │ Cache miss │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ L2 — DISTRIBUTED CACHE (Valkey cluster) │ │ -│ │ • TTL: 5min - 24h │ │ -│ │ • Size: 10GB │ │ -│ │ • Use case: Shared state, rate limits, session store │ │ -│ └───────────────────────────────┬─────────────────────────────────────┘ │ -│ │ Cache miss │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ L3 — DATABASE (PostgreSQL) │ │ -│ │ • Source of truth │ │ -│ │ • Write-through pour updates │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -### **8.5.3 Cache Strategies par Use Case** - -| Use Case | Strategy | TTL | Invalidation | -|----------|----------|-----|--------------| -| **Wallet Balance** | Cache-aside (read) | 30s | Event-driven (Kafka) | -| **Merchant Config** | Read-through | 5min | TTL + Manual | -| **Rate Limiting** | Write-through | Sliding window | Auto-expire | -| **Session Data** | Write-through | 24h | Explicit logout | -| **Gift Card Catalog** | Cache-aside | 15min | Event-driven | -| **Feature Flags** | Read-through | 1min | Config push | - -### **8.5.4 Cache Patterns Implementation** - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ CACHE-ASIDE PATTERN │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ 1. Application checks cache │ -│ 2. If HIT → return cached data │ -│ 3. If MISS → query database │ -│ 4. Store result in cache with TTL │ -│ 5. Return data to caller │ -│ │ -│ ┌─────────┐ GET ┌─────────┐ │ -│ │ App │───────────►│ Cache │ │ -│ └────┬────┘ └────┬────┘ │ -│ │ │ MISS │ -│ │ SELECT ▼ │ -│ └─────────────────►┌─────────┐ │ -│ │ DB │ │ -│ └─────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ - -┌─────────────────────────────────────────────────────────────────────────────┐ -│ WRITE-THROUGH PATTERN │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ 1. Application writes to cache AND database atomically │ -│ 2. Cache is always consistent with database │ -│ │ -│ ┌─────────┐ SET+TTL ┌─────────┐ │ -│ │ App │────────────►│ Cache │ │ -│ └────┬────┘ └─────────┘ │ -│ │ │ -│ │ INSERT/UPDATE │ -│ └─────────────────►┌─────────┐ │ -│ │ DB │ │ -│ └─────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -### **8.5.5 Cache Invalidation Strategy** - -| Trigger | Méthode | Use Case | -|---------|---------|----------| -| **TTL Expiry** | Automatic | Default pour toutes les clés | -| **Event-driven** | Kafka consumer | Wallet balance après transaction | -| **Explicit Delete** | API call | Admin actions, config updates | -| **Pub/Sub** | Valkey PUBLISH | Real-time invalidation cross-pods | - -### **8.5.6 Cache Key Naming Convention** - -``` -{service}:{entity}:{id}:{version} - -Exemples: - wallet:balance:user_123:v1 - merchant:config:merchant_456:v1 - giftcard:catalog:category_active:v1 - ratelimit:api:user_123:minute - session:auth:session_abc123 -``` - -### **8.5.7 Cache Metrics & Monitoring** - -| Metric | Seuil alerte | Action | -|--------|--------------|--------| -| **Hit Rate** | < 80% | Revoir TTL, préchargement | -| **Latency P99** | > 10ms | Check network, cluster size | -| **Memory Usage** | > 80% | Eviction analysis, scale up | -| **Evictions/sec** | > 100 | Augmenter cache size | -| **Connection Errors** | > 0 | Check connectivity, pooling | - -## **8.6 Queueing & Background Jobs** - -### **8.6.1 Queueing Architecture Overview** - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ QUEUEING ARCHITECTURE │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ TIER 1 — EVENT STREAMING (Kafka) │ │ -│ │ • Use case: Event-driven architecture, CDC, audit logs │ │ -│ │ • Pattern: Pub/Sub, Event Sourcing │ │ -│ │ • Retention: 7 jours │ │ -│ │ • Ordering: Per-partition │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ TIER 2 — TASK QUEUE (Valkey + Python Dramatiq/ARQ) │ │ -│ │ • Use case: Background jobs, async processing │ │ -│ │ • Pattern: Producer/Consumer, Work Queue │ │ -│ │ • Features: Retries, priorities, scheduling │ │ -│ │ • Durability: Redis persistence │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ TIER 3 — SCHEDULED JOBS (Kubernetes CronJobs) │ │ -│ │ • Use case: Batch processing, reports, cleanup │ │ -│ │ • Pattern: Time-triggered execution │ │ -│ │ • Managed: K8s native │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -### **8.6.2 Kafka vs Task Queue — Decision Matrix** - -| Critère | Kafka | Task Queue (Valkey) | -|---------|-------|---------------------| -| **Message Ordering** | ✅ Per-partition | ❌ Best effort | -| **Message Replay** | ✅ Retention-based | ❌ Non | -| **Priority Queues** | ❌ Non natif | ✅ Oui | -| **Delayed Messages** | ❌ Non natif | ✅ Oui | -| **Dead Letter Queue** | ✅ Configurable | ✅ Intégré | -| **Exactly-once** | ✅ Avec idempotency | ❌ At-least-once | -| **Throughput** | 🚀 Très élevé | 📈 Élevé | -| **Use Case** | Events, CDC, Streaming | Jobs, Tasks, Async work | - -### **8.6.3 Task Queue Stack** - -| Composant | Outil | Rôle | -|-----------|-------|------| -| **Task Framework** | Dramatiq (Python) / Asynq (Go) | Task definition, execution | -| **Broker** | Valkey (Redis-compatible) | Message storage, routing | -| **Result Backend** | Valkey | Task results, status | -| **Scheduler** | APScheduler / Dramatiq-crontab | Periodic tasks | -| **Monitoring** | Dramatiq Dashboard / Prometheus | Task metrics | - -### **8.6.4 Task Queue Patterns** - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ TASK PROCESSING FLOW │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ Producer Broker Workers │ -│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ -│ │ svc-* │──── enqueue ──►│ Valkey │◄── poll ─────│ Worker │ │ -│ │ API │ │ │ │ Pods │ │ -│ └─────────┘ │ Queues: │ └────┬────┘ │ -│ │ • high │ │ │ -│ │ • default│ │ execute │ -│ │ • low │ ▼ │ -│ │ • dlq │ ┌─────────┐ │ -│ └─────────┘ │ Task │ │ -│ │ Handler │ │ -│ └─────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -### **8.6.5 Queue Definitions** - -| Queue | Priority | Workers | Use Cases | -|-------|----------|---------|-----------| -| **critical** | P0 | 5 | Transaction rollbacks, fraud alerts | -| **high** | P1 | 10 | Email confirmations, balance updates | -| **default** | P2 | 20 | Notifications, analytics events | -| **low** | P3 | 5 | Reports, cleanup, batch exports | -| **scheduled** | N/A | 3 | Cron-like scheduled tasks | -| **dead-letter** | N/A | 1 | Failed tasks investigation | - -### **8.6.6 Retry Strategy** - -| Retry Policy | Configuration | Use Case | -|--------------|---------------|----------| -| **Exponential Backoff** | base=1s, max=1h, multiplier=2 | API calls, external services | -| **Fixed Interval** | interval=30s, max_retries=5 | Database operations | -| **No Retry** | max_retries=0 | Idempotent operations | - -``` -Retry Timeline (Exponential): - Attempt 1: immediate - Attempt 2: +1s - Attempt 3: +2s - Attempt 4: +4s - Attempt 5: +8s - ... - Attempt N: move to DLQ -``` - -### **8.6.7 Dead Letter Queue (DLQ) Handling** - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ DLQ WORKFLOW │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ 1. Task fails after max retries │ -│ 2. Task moved to DLQ with metadata: │ -│ • Original queue │ -│ • Failure reason │ -│ • Stack trace │ -│ • Attempt count │ -│ • Timestamp │ -│ 3. Alert sent to Slack (P3) │ -│ 4. On-call investigates │ -│ 5. Options: │ -│ a) Fix bug → Replay task │ -│ b) Manual resolution → Delete from DLQ │ -│ c) Archive for audit │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -### **8.6.8 Scheduled Jobs (CronJobs)** - -| Job | Schedule | Service | Description | -|-----|----------|---------|-------------| -| **balance-reconciliation** | `0 2 * * *` | svc-wallet | Daily balance verification | -| **expired-giftcards** | `0 0 * * *` | svc-giftcard | Mark expired cards | -| **analytics-rollup** | `0 */6 * * *` | svc-analytics | 6-hourly aggregation | -| **log-cleanup** | `0 3 * * 0` | platform | Weekly log rotation | -| **backup-verification** | `0 4 * * *` | platform | Daily backup integrity check | -| **compliance-report** | `0 6 1 * *` | platform | Monthly compliance export | - -### **8.6.9 Task Queue Monitoring** - -| Metric | Seuil alerte | Action | -|--------|--------------|--------| -| **Queue Depth** | > 1000 tasks | Scale workers | -| **Processing Time P95** | > 30s | Optimize task, check resources | -| **Failure Rate** | > 5% | Investigate DLQ, check dependencies | -| **DLQ Size** | > 10 tasks | Immediate investigation | -| **Worker Availability** | < 50% | Check pod health, scale up | - ---- - -# 🌐 **PARTIE IX — NETWORKING ARCHITECTURE** - -## **9.1 VPC Design** - -| CIDR | Usage | -|------|-------| -| 10.0.0.0/16 | VPC Principal | -| 10.0.0.0/20 | Private Subnets (Workloads) | -| 10.0.16.0/20 | Private Subnets (Data) | -| 10.0.32.0/20 | Public Subnets (NAT, LB) | - -## **9.2 Traffic Flow** - -| Flow | Path | Encryption | -|------|------|------------| -| Internet → Services | ALB → Cilium Gateway → Pod | TLS + mTLS | -| Service → Service | Pod → Pod (Cilium) | mTLS (WireGuard) | -| Service → Aiven | VPC Peering | TLS | -| Service → AWS (S3, KMS) | VPC Endpoints | TLS | - -## **9.3 Gateway API Configuration** - -| Resource | Purpose | -|----------|---------| -| **GatewayClass** | Cilium implementation | -| **Gateway** | HTTPS listener, TLS termination | -| **HTTPRoute** | Routing vers services (path-based) | - -## **9.4 Network Policies (Default Deny)** - -| Policy | Effect | -|--------|--------| -| Default deny all | Aucun trafic sauf explicite | -| Allow intra-namespace | Services même namespace peuvent communiquer | -| Allow specific cross-namespace | svc-ledger → svc-wallet explicite | -| Allow egress Aiven | Services → VPC Peering range only | -| Allow egress AWS endpoints | Services → VPC Endpoints only | - ---- - -# 🌍 **PARTIE IX.B — EDGE, CDN & CLOUDFLARE** - -## **9.5 Cloudflare Architecture** - -### **9.5.1 Pourquoi Cloudflare ?** - -| Critère | Cloudflare | AWS CloudFront + WAF | Verdict | -|---------|------------|---------------------|---------| -| **Coût** | Free tier généreux | Payant dès le début | ✅ Cloudflare | -| **WAF** | Gratuit (règles de base) | ~30€/mois minimum | ✅ Cloudflare | -| **DDoS** | Inclus (unlimited) | AWS Shield Standard gratuit | ≈ Égal | -| **SSL/TLS** | Gratuit, auto-renew | ACM gratuit | ≈ Égal | -| **CDN** | 300+ PoPs, gratuit | Payant au GB | ✅ Cloudflare | -| **DNS** | Gratuit, très rapide | Route53 ~0.50€/zone | ✅ Cloudflare | -| **Zero Trust** | Gratuit jusqu'à 50 users | Cognito + ALB payant | ✅ Cloudflare | -| **Terraform** | Provider officiel | Provider officiel | ≈ Égal | - -> **Décision :** Cloudflare en front, AWS en backend. Best of both worlds. - -### **9.5.2 Architecture Edge-to-Origin** - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ INTERNET │ -│ (End Users) │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ CLOUDFLARE EDGE │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ LAYER 1: DNS │ │ -│ │ • Authoritative DNS (localplus.io) │ │ -│ │ • DNSSEC enabled │ │ -│ │ • Geo-routing (future multi-region) │ │ -│ │ • Health checks → automatic failover │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ LAYER 2: DDoS Protection │ │ -│ │ • Layer 3/4 DDoS mitigation (automatic, unlimited) │ │ -│ │ • Layer 7 DDoS mitigation │ │ -│ │ • Rate limiting rules │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ LAYER 3: WAF (Web Application Firewall) │ │ -│ │ • OWASP Core Ruleset (free managed rules) │ │ -│ │ • Custom rules (rate limit, geo-block, bot score) │ │ -│ │ • Challenge pages (CAPTCHA, JS challenge) │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ LAYER 4: SSL/TLS │ │ -│ │ • Edge certificates (auto-issued, free) │ │ -│ │ • Full (strict) mode → Origin certificate │ │ -│ │ • TLS 1.3 only, HSTS enabled │ │ -│ │ • Automatic HTTPS rewrites │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ LAYER 5: CDN & Caching │ │ -│ │ • Static assets caching (JS, CSS, images) │ │ -│ │ • API responses: Cache-Control headers │ │ -│ │ • Tiered caching (edge → regional → origin) │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ LAYER 6: Cloudflare Tunnel (Argo Tunnel) │ │ -│ │ • No public IP needed on origin │ │ -│ │ • Encrypted tunnel to Cloudflare edge │ │ -│ │ • cloudflared daemon in K8s │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - │ Cloudflare Tunnel (encrypted) - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ AWS EKS CLUSTER │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ cloudflared (Deployment) │ │ -│ │ • Runs in platform namespace │ │ -│ │ • Connects to Cloudflare edge │ │ -│ │ • Routes traffic to internal services │ │ -│ └──────────────────────────────┬──────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ API Gateway (APISIX) or Cilium Gateway │ │ -│ │ • Internal routing │ │ -│ │ • Rate limiting (L7) │ │ -│ │ • Authentication │ │ -│ └──────────────────────────────┬──────────────────────────────────────┘ │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ Application Services │ │ -│ │ • svc-ledger, svc-wallet, etc. │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -### **9.5.3 Cloudflare Services Configuration** - -| Service | Plan | Configuration | Coût | -|---------|------|---------------|------| -| **DNS** | Free | Authoritative, DNSSEC, proxy enabled | 0€ | -| **CDN** | Free | Cache everything, tiered caching | 0€ | -| **SSL/TLS** | Free | Full (strict), TLS 1.3, edge certs | 0€ | -| **WAF** | Free | Managed ruleset, 5 custom rules | 0€ | -| **DDoS** | Free | L3/L4/L7 protection, unlimited | 0€ | -| **Bot Management** | Free | Basic bot score, JS challenge | 0€ | -| **Rate Limiting** | Free | 1 rule (10K req/month free) | 0€ | -| **Tunnel** | Free | Unlimited tunnels, cloudflared | 0€ | -| **Access** | Free | Zero Trust, 50 users free | 0€ | - -**Coût Cloudflare total : 0€** (Free tier suffisant pour démarrer) - -### **9.5.4 DNS Configuration** - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ DNS RECORDS — localplus.io │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ TYPE NAME CONTENT PROXY TTL │ -│ ──────────────────────────────────────────────────────────────────────── │ -│ A @ Cloudflare Tunnel ☁️ ON Auto │ -│ CNAME www @ ☁️ ON Auto │ -│ CNAME api tunnel-xxx.cfargotunnel.com ☁️ ON Auto │ -│ CNAME grafana tunnel-xxx.cfargotunnel.com ☁️ ON Auto │ -│ CNAME argocd tunnel-xxx.cfargotunnel.com ☁️ ON Auto │ -│ TXT @ "v=spf1 include:_spf..." ☁️ OFF Auto │ -│ TXT _dmarc "v=DMARC1; p=reject..." ☁️ OFF Auto │ -│ MX @ mail provider ☁️ OFF Auto │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -### **9.5.5 WAF Rules Strategy** - -| Rule Set | Type | Action | Purpose | -|----------|------|--------|---------| -| **OWASP Core** | Managed | Block | SQLi, XSS, LFI, RFI protection | -| **Cloudflare Managed** | Managed | Block | Zero-day, emerging threats | -| **Geo-Block** | Custom | Block | Block high-risk countries (optional) | -| **Rate Limit API** | Custom | Challenge | > 100 req/min per IP on /api/* | -| **Bot Score < 30** | Custom | Challenge | Likely bot traffic | -| **Known Bad ASNs** | Custom | Block | Hosting providers, VPNs (optional) | - -### **9.5.6 SSL/TLS Configuration** - -| Setting | Value | Rationale | -|---------|-------|-----------| -| **SSL Mode** | Full (strict) | Origin has valid cert | -| **Minimum TLS** | 1.2 | PCI-DSS compliance | -| **TLS 1.3** | Enabled | Performance + security | -| **HSTS** | Enabled (max-age=31536000) | Force HTTPS | -| **Always Use HTTPS** | On | Redirect HTTP → HTTPS | -| **Automatic HTTPS Rewrites** | On | Fix mixed content | -| **Origin Certificate** | Cloudflare Origin CA | 15-year validity, free | - -### **9.5.7 Cloudflare Tunnel Architecture** - -| Composant | Rôle | Déploiement | -|-----------|------|-------------| -| **cloudflared daemon** | Agent tunnel, connexion sécurisée vers Cloudflare | 2+ replicas, namespace platform | -| **Tunnel credentials** | Secret d'authentification tunnel | Vault / External-Secrets | -| **Tunnel config** | Routing rules vers services internes | ConfigMap | -| **Health checks** | Vérification disponibilité tunnel | Cloudflare dashboard | - -**Avantages Cloudflare Tunnel :** -- Pas d'IP publique exposée sur l'origin -- Connexion outbound uniquement (pas de firewall inbound) -- Encryption de bout en bout -- Failover automatique entre replicas - -### **9.5.8 Cloudflare Access (Zero Trust)** - -| Resource | Policy | Authentication | -|----------|--------|----------------| -| **grafana.localplus.io** | Team only | GitHub SSO | -| **argocd.localplus.io** | Team only | GitHub SSO | -| **api.localplus.io/admin** | Admin only | GitHub SSO + MFA | -| **api.localplus.io/*** | Public | No auth (application handles) | - -### **9.5.9 Infrastructure as Code (Terraform)** - -| Ressource Terraform | Description | Module/Provider | -|---------------------|-------------|-----------------| -| **cloudflare_zone** | Zone DNS principale | cloudflare/cloudflare | -| **cloudflare_record** | Records DNS (A, CNAME, TXT) | cloudflare/cloudflare | -| **cloudflare_tunnel** | Configuration tunnel | cloudflare/cloudflare | -| **cloudflare_ruleset** | WAF rules, rate limiting | cloudflare/cloudflare | -| **cloudflare_access_application** | Zero Trust apps | cloudflare/cloudflare | -| **cloudflare_access_policy** | Policies d'accès | cloudflare/cloudflare | - -> **Note :** Toute la configuration Cloudflare est gérée via Terraform dans le repo `platform-gateway/cloudflare/terraform/` - -### **9.5.10 Cloudflare Monitoring & Analytics** - -| Metric | Source | Dashboard | -|--------|--------|-----------| -| **Requests** | Cloudflare Analytics | Grafana (API) | -| **Cache Hit Ratio** | Cloudflare Analytics | Grafana | -| **WAF Events** | Cloudflare Security Events | Grafana + Alerts | -| **Bot Score Distribution** | Cloudflare Analytics | Grafana | -| **Origin Response Time** | Cloudflare Analytics | Grafana | -| **DDoS Attacks** | Cloudflare Security Center | Email alerts | - -### **9.5.11 Route53 — DNS Interne & Backup** - -| Use Case | Solution | Configuration | -|----------|----------|---------------| -| **DNS Public (Primary)** | Cloudflare | Authoritative pour `localplus.io` | -| **DNS Public (Backup)** | Route53 | Secondary zone, sync via AXFR | -| **DNS Privé (Internal)** | Route53 Private Hosted Zones | `*.internal.localplus.io` | -| **Service Discovery** | Route53 + Cloud Map | Résolution services internes | -| **Health Checks** | Route53 Health Checks | Failover automatique si Cloudflare down | - -**Architecture DNS Hybride :** - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ DNS ARCHITECTURE │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ EXTERNAL TRAFFIC INTERNAL TRAFFIC │ -│ ───────────────── ───────────────── │ -│ │ -│ ┌─────────────────┐ ┌─────────────────┐ │ -│ │ Cloudflare DNS │ │ Route53 Private │ │ -│ │ (Primary) │ │ Hosted Zone │ │ -│ │ │ │ │ │ -│ │ localplus.io │ │ internal. │ │ -│ │ api.localplus.io│ │ localplus.io │ │ -│ └────────┬────────┘ └────────┬────────┘ │ -│ │ │ │ -│ │ Failover │ VPC DNS │ -│ ▼ ▼ │ -│ ┌─────────────────┐ ┌─────────────────┐ │ -│ │ Route53 Public │ │ EKS CoreDNS │ │ -│ │ (Backup) │ │ + Cloud Map │ │ -│ │ │ │ │ │ -│ │ Health checks │ │ svc-*.svc. │ │ -│ │ Failover ready │ │ cluster.local │ │ -│ └─────────────────┘ └─────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -| Route53 Feature | Use Case Local-Plus | -|-----------------|---------------------| -| **Private Hosted Zones** | Résolution DNS interne VPC, pas d'exposition internet | -| **Health Checks** | Vérification santé endpoints, failover automatique | -| **Alias Records** | Pointage vers ALB/NLB sans IP hardcodée | -| **Geolocation Routing** | Future multi-région, routage par géographie | -| **Failover Routing** | Backup si Cloudflare indisponible | -| **Weighted Routing** | Canary deployments, A/B testing | - -### **9.5.12 Vision Multi-Cloud** - -> **Objectif :** L'architecture edge (Cloudflare) et API Gateway (APISIX) sont **cloud-agnostic** et peuvent router vers plusieurs cloud providers. - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ MULTI-CLOUD ARCHITECTURE (Future) │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ CLOUDFLARE EDGE │ -│ (Global Load Balancing) │ -│ │ │ -│ ┌───────────────┼───────────────┐ │ -│ │ │ │ │ -│ ▼ ▼ ▼ │ -│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ -│ │ AWS (Primary)│ │ GCP (Future) │ │ Azure (Future)│ │ -│ │ eu-west-1 │ │ europe-west1 │ │ westeurope │ │ -│ │ │ │ │ │ │ │ -│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │ -│ │ │ APISIX │ │ │ │ APISIX │ │ │ │ APISIX │ │ │ -│ │ │ Gateway │ │ │ │ Gateway │ │ │ │ Gateway │ │ │ -│ │ └────┬────┘ │ │ └────┬────┘ │ │ └────┬────┘ │ │ -│ │ │ │ │ │ │ │ │ │ │ -│ │ ┌────┴────┐ │ │ ┌────┴────┐ │ │ ┌────┴────┐ │ │ -│ │ │Services │ │ │ │Services │ │ │ │Services │ │ │ -│ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ -│ └───────────────┘ └───────────────┘ └───────────────┘ │ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ AIVEN (Multi-Cloud Data Layer) │ │ -│ │ • PostgreSQL avec réplication cross-cloud │ │ -│ │ • Kafka avec MirrorMaker cross-cloud │ │ -│ │ • Valkey avec réplication │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -| Composant | Multi-Cloud Ready | Comment | -|-----------|-------------------|---------| -| **Cloudflare** | ✅ Oui | Load balancing global, health checks multi-origin | -| **APISIX** | ✅ Oui | Déployable sur tout K8s (EKS, GKE, AKS) | -| **Aiven** | ✅ Oui | PostgreSQL, Kafka, Valkey disponibles sur AWS/GCP/Azure | -| **ArgoCD** | ✅ Oui | Peut gérer des clusters multi-cloud | -| **Vault** | ✅ Oui | Réplication cross-datacenter | -| **OTel** | ✅ Oui | Standard ouvert, backends interchangeables | - -**Phases Multi-Cloud :** +| Rôle | Responsabilité | Rotation | +|------|---------------|----------| +| **Primary** | First responder, triage | Weekly | +| **Secondary** | Escalation, expertise | Weekly | +| **Incident Commander** | Coordination si P1 | On-demand | -| Phase | Scope | Timeline | -|-------|-------|----------| -| **Phase 1 (Actuelle)** | AWS uniquement, architecture cloud-agnostic | Now | -| **Phase 2** | DR sur GCP (read replicas, failover) | +12 mois | -| **Phase 3** | Active-Active multi-cloud | +24 mois | +→ **Documentation détaillée** : [docs/platform/PLATFORM-ENGINEERING.md](platform/PLATFORM-ENGINEERING.md) --- -# 🚪 **PARTIE IX.C — API GATEWAY / APIM (Phase Future)** - -> **Statut :** À définir ultérieurement. Pour le moment, l'architecture reste simple : Cloudflare → Cilium Gateway → Services. - -## **9.6 Options à évaluer (Future)** - -| Solution | Type | Coût | Notes | -|----------|------|------|-------| -| **AWS API Gateway** | Managed | Pay-per-use | Simple, intégré AWS | -| **Gravitee CE** | APIM complet | Gratuit | Portal, Subscriptions inclus | -| **Kong OSS** | Gateway | Gratuit | Populaire, plugins riches | -| **APISIX** | Gateway | Gratuit | Cloud-native, performant | - -**Décision reportée à Phase 2+ selon les besoins :** -- Si besoin B2B/Partners → APIM (Gravitee) -- Si juste rate limiting/auth → AWS API Gateway -- Si multi-cloud requis → APISIX ou Kong - -### **Architecture Actuelle (Phase 1 — Simple)** +# 🚀 **PARTIE VIII — ROADMAP** + +## **8.1 Séquence de Construction** + +| Phase | Focus | Estimation | +|-------|-------|------------| +| **1** | Bootstrap Layer 0-1 (IAM, VPC, EKS, Aiven) | 3 semaines | +| **2** | Platform GitOps (ArgoCD) | 1 semaine | +| **3** | Platform Networking (Cilium, Gateway API) | 1 semaine | +| **3b** | Edge & CDN (Cloudflare) | 1 semaine | +| **4** | Platform Security (Vault, Kyverno) | 2 semaines | +| **5** | Platform Observability | 2 semaines | +| **5b** | Platform APM | 1 semaine | +| **6** | Platform Cache (Valkey) | 1 semaine | +| **7** | Contracts (Proto, SDK) | 1 semaine | +| **8** | svc-ledger | 3 semaines | +| **9** | svc-wallet | 2 semaines | +| **10** | Kafka + Outbox | 2 semaines | +| **10b** | Task Queue | 1 semaine | +| **11** | Testing complet | 2 semaines | +| **12** | Compliance audit | 2 semaines | +| **13** | Documentation | 1 semaine | + +**Total estimé : ~25 semaines** + +## **8.2 Checklist avant démarrage** + +### Comptes & Accès +- [ ] Compte AWS créé, billing configuré +- [ ] Compte Aiven créé +- [ ] Compte Cloudflare créé (Free tier) +- [ ] Organisation GitHub créée +- [ ] Domaine DNS acquis et transféré vers Cloudflare -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ ARCHITECTURE SIMPLIFIÉE — PHASE 1 │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ Internet │ -│ │ │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ CLOUDFLARE │ │ -│ │ (DNS, WAF, DDoS, TLS) │ │ -│ └──────────────────────────────┬──────────────────────────────────────┘ │ -│ │ │ -│ │ Tunnel ou Direct │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ AWS EKS — Cilium Gateway API │ │ -│ │ (Routing interne, mTLS) │ │ -│ │ │ │ -│ │ ┌─────────────────────────────────────────────────────────────┐ │ │ -│ │ │ Services : svc-ledger, svc-wallet, svc-merchant, ... │ │ │ -│ │ └─────────────────────────────────────────────────────────────┘ │ │ -│ │ │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -│ Pas d'API Gateway dédié pour le moment — Cilium Gateway API suffit. │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` +### Décisions validées +- [ ] RPO 1h, RTO 15min +- [ ] AWS eu-west-1 +- [ ] Aiven pour Kafka + PostgreSQL + Valkey +- [ ] Cloudflare pour DNS + WAF + CDN +- [ ] Self-hosted observability +- [ ] ArgoCD centralisé +- [ ] Cilium + Gateway API +- [ ] Kyverno +- [ ] HashiCorp Vault self-hosted --- -# ⚡ **PARTIE X — RESILIENCE & DR** +# 📚 **APPENDIX** -## **10.1 Failure Modes** +## **A. Glossaire** -| Failure | Detection | Recovery | RTO | -|---------|-----------|----------|-----| -| **Pod crash** | Liveness probe | K8s restart | < 30s | -| **Node failure** | Node NotReady | Pod reschedule | < 2min | -| **AZ failure** | Multi-AZ detect | Traffic shift | < 5min | -| **DB primary failure** | Aiven health | Automatic failover | < 5min | -| **Kafka broker failure** | Aiven health | Automatic rebalance | < 2min | -| **Full region failure** | Manual | DR procedure (future) | 4h (target) | - -## **10.2 Backup Strategy** - -| Data | Method | Frequency | Retention | Location | -|------|--------|-----------|-----------|----------| -| **PostgreSQL** | Aiven automated | Hourly | 7 jours | Aiven (cross-AZ) | -| **PostgreSQL PITR** | Aiven WAL | Continuous | 24h | Aiven | -| **Kafka** | Topic retention | N/A | 7 jours | Aiven | -| **Terraform state** | S3 versioning | Every apply | 90 jours | S3 | -| **Git repos** | GitHub | Every push | Infini | GitHub | - -## **10.3 Disaster Recovery (Future)** - -| Scenario | Current | Future (Multi-region) | -|----------|---------|----------------------| -| Single AZ failure | Automatic (multi-AZ) | Automatic | -| Region failure | Manual restore from backup | Automatic failover | -| Data corruption | PITR restore | PITR restore | +→ [docs/GLOSSARY.md](GLOSSARY.md) ---- - -# 🛠️ **PARTIE XI — PLATFORM ENGINEERING** +## **B. ADR Index** -## **11.1 Platform Contracts** +| ADR | Titre | Statut | +|-----|-------|--------| +| 001 | Modular Monolith First | Accepted | +| 002 | Aiven Managed Data | Accepted | +| 003 | Cilium over Calico | Accepted | +| ... | ... | ... | -| Contrat | Garantie Platform | Responsabilité Service | -|---------|-------------------|------------------------| -| **Deployment** | Git push → Prod < 15min | Manifests K8s valides | -| **Secrets** | Vault dynamic, rotation auto | Utiliser External-Secrets | -| **Observability** | Auto-collection traces/metrics/logs | Instrumentation OTel | -| **Networking** | mTLS enforced, Gateway API | Déclarer routes dans HTTPRoute | -| **Scaling** | HPA disponible | Configurer requests/limits | -| **Security** | Policies enforced | Passer les policies | +→ [docs/adr/](adr/) -## **11.2 Golden Path (New Service Checklist)** +## **C. Change Management Process** -| Étape | Action | Validation | -|-------|--------|------------| -| 1 | Créer repo depuis template | Structure conforme | -| 2 | Définir protos dans contracts-proto | buf lint pass | -| 3 | Implémenter service | Unit tests > 80% | -| 4 | Configurer K8s manifests | Kyverno policies pass | -| 5 | Configurer External-Secret | Secrets résolus | -| 6 | Ajouter ServiceMonitor | Metrics visibles Grafana | -| 7 | Créer HTTPRoute | Trafic routable | -| 8 | PR review | Merge → Auto-deploy dev | +### Architecture Changes +1. **ADR Required** : Toute décision impactant >1 service +2. **Review** : Platform Team + Tech Lead +3. **Communication** : Slack #platform-updates -## **11.3 On-Call Structure (5 personnes)** +### Breaking Changes +1. RFC obligatoire (`docs/rfc/`) +2. Migration path documenté +3. Annonce 2 sprints avant -| Rôle | Responsabilité | Rotation | -|------|---------------|----------| -| **Primary** | First responder, triage | Weekly | -| **Secondary** | Escalation, expertise | Weekly | -| **Incident Commander** | Coordination si P1 | On-demand | +### Emergency Changes +1. Incident Commander approval +2. Post-mortem obligatoire +3. ADR rétroactif sous 48h --- -# 📊 **PARTIE XII — MAPPING TERMINOLOGIE** - -| Terme | Application concrète Local-Plus | -|-------|--------------------------------| -| **Reconciliation loop** | ArgoCD sync, Kyverno background scan | -| **Desired state store** | Git repos | -| **Drift detection** | ArgoCD diff, `terraform plan` scheduled | -| **Blast radius** | Namespace isolation, PDB, Resource Quotas | -| **Tenant isolation** | Vault policies per service, Network Policies | -| **Paved road / Golden path** | Template service, checklist onboarding | -| **Guardrails** | Kyverno policies (not gates) | -| **Ephemeral credentials** | Vault dynamic DB secrets (TTL) | -| **SLI/SLO/SLA** | Prometheus recording rules, Error budgets | -| **Cardinality** | OTel Collector label filtering | -| **Circuit breaker** | Cilium timeout policies | -| **Outbox pattern** | svc-ledger → Kafka transactional | -| **Control plane vs Data plane** | platform-* repos vs svc-* repos | -| **Progressive delivery** | Argo Rollouts (canary) — future | -| **Idempotency** | Idempotency-Key header (SYSTEM_CONTRACT.md) | -| **Pessimistic locking** | SELECT FOR UPDATE (SYSTEM_CONTRACT.md) | -| **Error budget** | 43 min/mois pour 99.9% SLO | -| **MTTR** | Target < 15min (RTO) | -| **Runbook** | docs/runbooks/*.md | -| **Postmortem** | docs/postmortems/*.md (blameless) | -| **APM (Application Performance Monitoring)** | Tempo + Pyroscope + Sentry | -| **Distributed Tracing** | OTel → Tempo, trace_id correlation | -| **Profiling** | Pyroscope (CPU/Memory flame graphs) | -| **Cache-aside pattern** | Valkey lookup, DB fallback, cache on miss | -| **Write-through cache** | Sync write to cache + DB | -| **Cache invalidation** | TTL + Event-driven (Kafka) + Pub/Sub | -| **L1/L2 Cache** | L1=In-memory (pod), L2=Valkey (distributed) | -| **Task Queue** | Dramatiq + Valkey (background jobs) | -| **Dead Letter Queue (DLQ)** | Failed tasks après max retries | -| **Exponential Backoff** | Retry avec délai croissant (1s, 2s, 4s...) | -| **Priority Queue** | critical > high > default > low | -| **CronJob** | K8s scheduled tasks (batch, cleanup) | -| **Rate Limiting** | Valkey sliding window counter | -| **Edge Computing** | Cloudflare Workers, CDN edge nodes | -| **WAF (Web Application Firewall)** | Cloudflare WAF, OWASP ruleset | -| **DDoS Protection** | Cloudflare L3/L4/L7 mitigation | -| **CDN (Content Delivery Network)** | Cloudflare CDN, static asset caching | -| **TLS Termination** | Cloudflare edge → Origin mTLS | -| **Zero Trust** | Cloudflare Access, GitHub SSO | -| **Cloudflare Tunnel** | Secure tunnel, no public origin IP | -| **API Gateway / APIM** | À définir — Phase future (AWS API Gateway, Gravitee, Kong) | -| **Bot Score** | Cloudflare bot detection metric | -| **Origin Certificate** | Cloudflare Origin CA (15-year, free) | -| **Private Hosted Zone** | Route53 DNS interne (VPC only) | -| **DNS Failover** | Route53 health checks + backup de Cloudflare | -| **Multi-Cloud** | Architecture déployable sur AWS/GCP/Azure | -| **Cloud-Agnostic** | Composants non liés à un provider spécifique | -| **Cloudflare Tunnel** | Connexion sécurisée sans IP publique origin | -| **Upstream** | Backend service target dans API Gateway | -| **Consumer** | Client API avec credentials (JWT, API Key) | -| **Global Load Balancing** | Cloudflare routing multi-origin/multi-cloud | - ---- +# 📖 **Documentation Index** -# 🚀 **PARTIE XIII — SÉQUENCE DE CONSTRUCTION** - -| Phase | Focus | Livrables | Estimation | -|-------|-------|-----------|------------| -| **1** | Bootstrap Layer 0-1 | IAM, VPC, EKS, Aiven setup (PG, Kafka, Valkey) | 3 semaines | -| **2** | Platform GitOps | ArgoCD, ApplicationSets | 1 semaine | -| **3** | Platform Networking | Cilium, Gateway API | 1 semaine | -| **3b** | Edge & CDN | Cloudflare DNS, WAF, TLS | 1 semaine | -| **4** | Platform Security | Vault, External-Secrets, Kyverno | 2 semaines | -| **5** | Platform Observability | OTel, Prometheus, Loki, Tempo, Grafana | 2 semaines | -| **5b** | Platform APM | Pyroscope, Sentry, APM Dashboards | 1 semaine | -| **6** | Platform Cache | Valkey setup, SDK integration | 1 semaine | -| **7** | Contracts | Proto definitions, SDK Python | 1 semaine | -| **8** | svc-ledger | Migrate ton local-plus, full tests | 3 semaines | -| **9** | svc-wallet | Second service, gRPC integration | 2 semaines | -| **10** | Kafka + Outbox | Event-driven patterns | 2 semaines | -| **10b** | Task Queue | Dramatiq setup, background workers | 1 semaine | -| **11** | Testing complet | TNR, Perf, Chaos | 2 semaines | -| **12** | Compliance audit | GDPR, PCI-DSS, SOC2 checks | 2 semaines | -| **13** | Documentation | Runbooks, ADRs, Onboarding | 1 semaine | - -**Total : ~25 semaines** +| Document | Description | Path | +|----------|-------------|------| +| **Bootstrap Guide** | AWS setup, Account Factory | → [bootstrap/BOOTSTRAP-GUIDE.md](bootstrap/BOOTSTRAP-GUIDE.md) | +| **Security Architecture** | Defense in depth, policies | → [security/SECURITY-ARCHITECTURE.md](security/SECURITY-ARCHITECTURE.md) | +| **Observability Guide** | Metrics, logs, traces, APM | → [observability/OBSERVABILITY-GUIDE.md](observability/OBSERVABILITY-GUIDE.md) | +| **Networking Architecture** | VPC, Cloudflare, Gateway API | → [networking/NETWORKING-ARCHITECTURE.md](networking/NETWORKING-ARCHITECTURE.md) | +| **Data Architecture** | PostgreSQL, Kafka, Cache, Queues | → [data/DATA-ARCHITECTURE.md](data/DATA-ARCHITECTURE.md) | +| **Testing Strategy** | Unit, Integration, E2E, Chaos | → [testing/TESTING-STRATEGY.md](testing/TESTING-STRATEGY.md) | +| **Platform Engineering** | Contracts, Golden Path, On-Call | → [platform/PLATFORM-ENGINEERING.md](platform/PLATFORM-ENGINEERING.md) | +| **DR Guide** | Backup, Recovery, Runbooks | → [resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) | +| **Glossary** | Terminologie | → [GLOSSARY.md](GLOSSARY.md) | --- -# ✅ **PARTIE XIV — CHECKLIST FINALE** - -## **Avant de commencer :** - -- [ ] Compte AWS créé, billing configuré -- [ ] Compte Aiven créé -- [ ] Compte Cloudflare créé (Free tier) -- [ ] Organisation GitHub créée -- [ ] Décision : HashiCorp Vault self-hosted sur EKS -- [ ] Domaine DNS acquis et transféré vers Cloudflare - -## **Décisions architecturales validées :** - -- [ ] RPO 1h, RTO 15min — OK -- [ ] AWS eu-west-1 — OK -- [ ] Aiven pour Kafka + PostgreSQL + Valkey — OK -- [ ] Cloudflare pour DNS + WAF + CDN — OK -- [ ] API Gateway / APIM — À définir (Phase future) -- [ ] Self-hosted observability — OK -- [ ] ArgoCD centralisé — OK -- [ ] Cilium + Gateway API — OK -- [ ] Kyverno — OK -- [ ] GDPR + PCI-DSS + SOC2 — OK +*Document maintenu par : Platform Team* +*Dernière mise à jour : Janvier 2026* From 09866c26e62a3a5c61a98234cc0a62ac6d3dd7d1 Mon Sep 17 00:00:00 2001 From: NasrLadib Date: Tue, 27 Jan 2026 08:34:26 +0100 Subject: [PATCH 2/6] docs: add comprehensive architecture documentation structure Add detailed documentation across all platform domains: - Add GLOSSARY.md with comprehensive terminology definitions - Add bootstrap/BOOTSTRAP-GUIDE.md for infrastructure setup - Add data/DATA-ARCHITECTURE.md for data layer design - Add networking/NETWORKING-ARCHITECTURE.md for network topology - Add observability/OBSERVABILITY-GUIDE.md for monitoring strategy - Add platform/PLATFORM-ENGINEERING.md for platform capabilities - Add security/SECURITY-ARCHITECTURE.md for security controls - Add resilience/DR-GUIDE.md for disaster recovery procedures - Add testing/TESTING-STRATEGY.md for quality assurance approach --- EntrepriseArchitecture.md | 159 +++--- GLOSSARY.md | 710 ++++++++++++++++++++++++++ bootstrap/BOOTSTRAP-GUIDE.md | 213 ++++++++ data/DATA-ARCHITECTURE.md | 478 +++++++++++++++++ networking/NETWORKING-ARCHITECTURE.md | 469 +++++++++++++++++ observability/OBSERVABILITY-GUIDE.md | 457 +++++++++++++++++ platform/PLATFORM-ENGINEERING.md | 340 ++++++++++++ resilience/DR-GUIDE.md | 341 +++++++++++++ security/SECURITY-ARCHITECTURE.md | 535 +++++++++++++++++++ testing/TESTING-STRATEGY.md | 473 +++++++++++++++++ 10 files changed, 4113 insertions(+), 62 deletions(-) create mode 100644 GLOSSARY.md create mode 100644 bootstrap/BOOTSTRAP-GUIDE.md create mode 100644 data/DATA-ARCHITECTURE.md create mode 100644 networking/NETWORKING-ARCHITECTURE.md create mode 100644 observability/OBSERVABILITY-GUIDE.md create mode 100644 platform/PLATFORM-ENGINEERING.md create mode 100644 resilience/DR-GUIDE.md create mode 100644 security/SECURITY-ARCHITECTURE.md create mode 100644 testing/TESTING-STRATEGY.md diff --git a/EntrepriseArchitecture.md b/EntrepriseArchitecture.md index a44e64d..d996bac 100644 --- a/EntrepriseArchitecture.md +++ b/EntrepriseArchitecture.md @@ -38,9 +38,9 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | Standard | Exigences clés | Documentation | |----------|---------------|---------------| -| **GDPR** | Data residency EU, droit à l'oubli | → [docs/compliance/gdpr/](compliance/gdpr/) | -| **PCI-DSS** | Pas de stockage PAN, encryption, audit | → [docs/compliance/pci-dss/](compliance/pci-dss/) | -| **SOC2** | RBAC, monitoring, incident response | → [docs/compliance/soc2/](compliance/soc2/) | +| **GDPR** | Data residency EU, droit à l'oubli | → [compliance/gdpr/](compliance/gdpr/) | +| **PCI-DSS** | Pas de stockage PAN, encryption, audit | → [compliance/pci-dss/](compliance/pci-dss/) | +| **SOC2** | RBAC, monitoring, incident response | → [compliance/soc2/](compliance/soc2/) | ## **1.4 Tech Stack Overview** @@ -207,9 +207,9 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | **staging** | localplus-staging | eks-staging | Manual | | **prod** | localplus-prod | eks-prod | Manual + Approval | -## **3.4 CI/CD** +## **3.4 CI/CD & Bootstrap** -→ **Documentation détaillée** : [docs/bootstrap/BOOTSTRAP-GUIDE.md](bootstrap/BOOTSTRAP-GUIDE.md) +→ **Documentation détaillée** : [bootstrap/BOOTSTRAP-GUIDE.md](bootstrap/BOOTSTRAP-GUIDE.md) --- @@ -239,43 +239,50 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu ## **4.3 Repository Index** +> **Note** : Les repos ci-dessous sont la structure cible. Chaque repo aura son propre README. + ### Tier 0 — Foundation -| Repo | Description | README | -|------|-------------|--------| -| `bootstrap/` | AWS Landing Zone, Control Tower, Account Factory | → [bootstrap/README.md](../bootstrap/README.md) | + +| Repo | Description | +|------|-------------| +| `bootstrap/` | AWS Landing Zone, Control Tower, Account Factory | ### Tier 1 — Platform -| Repo | Description | README | -|------|-------------|--------| -| `platform-gitops/` | ArgoCD, ApplicationSets | → [platform-gitops/README.md](../platform-gitops/README.md) | -| `platform-networking/` | Cilium, Gateway API | → [platform-networking/README.md](../platform-networking/README.md) | -| `platform-observability/` | OTel, Prometheus, Loki, Tempo, Grafana | → [platform-observability/README.md](../platform-observability/README.md) | -| `platform-security/` | Vault, External-Secrets, Kyverno | → [platform-security/README.md](../platform-security/README.md) | -| `platform-cache/` | Valkey configuration, SDK | → [platform-cache/README.md](../platform-cache/README.md) | -| `platform-gateway/` | APISIX (future), Cloudflare config | → [platform-gateway/README.md](../platform-gateway/README.md) | -| `platform-application-provis/` | Terraform modules (DB, Kafka, Cache, EKS) | → [platform-application-provis/README.md](../platform-application-provis/README.md) | + +| Repo | Description | +|------|-------------| +| `platform-gitops/` | ArgoCD, ApplicationSets | +| `platform-networking/` | Cilium, Gateway API | +| `platform-observability/` | OTel, Prometheus, Loki, Tempo, Grafana | +| `platform-security/` | Vault, External-Secrets, Kyverno | +| `platform-cache/` | Valkey configuration, SDK | +| `platform-gateway/` | APISIX (future), Cloudflare config | +| `platform-application-provis/` | Terraform modules (DB, Kafka, Cache, EKS) | ### Tier 2 — Contracts -| Repo | Description | README | -|------|-------------|--------| -| `contracts-proto/` | Protobuf definitions | → [contracts-proto/README.md](../contracts-proto/README.md) | -| `sdk-python/` | Python SDK (clients, telemetry) | → [sdk-python/README.md](../sdk-python/README.md) | -| `sdk-go/` | Go SDK | → [sdk-go/README.md](../sdk-go/README.md) | + +| Repo | Description | +|------|-------------| +| `contracts-proto/` | Protobuf definitions | +| `sdk-python/` | Python SDK (clients, telemetry) | +| `sdk-go/` | Go SDK | ### Tier 3 — Domain Services -| Repo | Description | README | -|------|-------------|--------| -| `svc-ledger/` | Earn/Burn transactions | → [svc-ledger/README.md](../svc-ledger/README.md) | -| `svc-wallet/` | Balance queries | → [svc-wallet/README.md](../svc-wallet/README.md) | -| `svc-merchant/` | Merchant onboarding | → [svc-merchant/README.md](../svc-merchant/README.md) | -| `svc-giftcard/` | Gift card catalog | → [svc-giftcard/README.md](../svc-giftcard/README.md) | -| `svc-notification/` | Notifications (Kafka consumer) | → [svc-notification/README.md](../svc-notification/README.md) | + +| Repo | Description | +|------|-------------| +| `svc-ledger/` | Earn/Burn transactions | +| `svc-wallet/` | Balance queries | +| `svc-merchant/` | Merchant onboarding | +| `svc-giftcard/` | Gift card catalog | +| `svc-notification/` | Notifications (Kafka consumer) | ### Tier 4 — Quality -| Repo | Description | README | -|------|-------------|--------| -| `e2e-scenarios/` | Playwright E2E tests | → [e2e-scenarios/README.md](../e2e-scenarios/README.md) | -| `chaos-experiments/` | Litmus chaos tests | → [chaos-experiments/README.md](../chaos-experiments/README.md) | + +| Repo | Description | +|------|-------------| +| `e2e-scenarios/` | Playwright E2E tests | +| `chaos-experiments/` | Chaos Mesh experiments | --- @@ -288,25 +295,25 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | Layer | Composant | Protection | |-------|-----------|------------| | **Edge** | Cloudflare | WAF, DDoS, Bot protection | -| **Gateway** | APISIX (future) | JWT, Rate limiting | +| **Gateway** | Cilium Gateway API | TLS, routing | | **Network** | Cilium | NetworkPolicies, default deny | | **Identity** | IRSA + Vault | Dynamic secrets, mTLS | | **Workload** | Kyverno | Pod security, image signing | | **Data** | KMS + Aiven | Encryption at rest/transit | -→ **Documentation détaillée** : [docs/security/SECURITY-ARCHITECTURE.md](security/SECURITY-ARCHITECTURE.md) +→ **Documentation détaillée** : [security/SECURITY-ARCHITECTURE.md](security/SECURITY-ARCHITECTURE.md) ## **5.2 Observability Baseline** | Signal | Outil | Retention | Coût | |--------|-------|-----------|------| -| **Metrics** | Prometheus + Thanos | 15j local, 1an S3 | ~5€/mois | +| **Metrics** | Prometheus + Remote Write S3 | 15j local, 1an S3 | ~5€/mois | | **Logs** | Loki | 30 jours (GDPR) | Self-hosted | | **Traces** | Tempo | 7 jours | Self-hosted | | **Profiling** | Pyroscope | 7 jours | Self-hosted | | **Errors** | Sentry (self-hosted) | 30 jours | Self-hosted | -→ **Documentation détaillée** : [docs/observability/OBSERVABILITY-GUIDE.md](observability/OBSERVABILITY-GUIDE.md) +→ **Documentation détaillée** : [observability/OBSERVABILITY-GUIDE.md](observability/OBSERVABILITY-GUIDE.md) ## **5.3 Networking Baseline** @@ -317,7 +324,7 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | **VPC Peering** | Aiven connectivity | Private, no internet | | **Route53** | Private DNS, backup | Internal zones | -→ **Documentation détaillée** : [docs/networking/NETWORKING-ARCHITECTURE.md](networking/NETWORKING-ARCHITECTURE.md) +→ **Documentation détaillée** : [networking/NETWORKING-ARCHITECTURE.md](networking/NETWORKING-ARCHITECTURE.md) ## **5.4 Data Baseline** @@ -329,13 +336,41 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu **Règle d'or** : 1 table = 1 owner. Cross-service = gRPC ou Events, jamais JOIN. -→ **Documentation détaillée** : [docs/data/DATA-ARCHITECTURE.md](data/DATA-ARCHITECTURE.md) +→ **Documentation détaillée** : [data/DATA-ARCHITECTURE.md](data/DATA-ARCHITECTURE.md) + +--- + +# 🧪 **PARTIE VI — TESTING & QUALITY** + +## **6.1 Test Pyramid** + +| Layer | Types de tests | Fréquence | +|-------|----------------|-----------| +| **Base** | Static analysis, Linting | Pre-commit | +| **Unit** | Domain logic, Use cases | PR | +| **Integration** | DB, Kafka, Cache (Testcontainers) | PR | +| **Contract** | API contracts (Pact, gRPC) | PR | +| **E2E** | Critical paths (Playwright) | Nightly | +| **Performance** | Load, Stress, Soak (k6) | Nightly/Weekly | +| **Chaos** | Failure injection (Chaos Mesh) | Weekly | + +## **6.2 Performance Targets** + +| Métrique | Target | Alerte | +|----------|--------|--------| +| **Latency P50** | < 50ms | > 100ms | +| **Latency P95** | < 100ms | > 200ms | +| **Latency P99** | < 200ms | > 500ms | +| **Error Rate** | < 0.1% | > 1% | +| **Throughput** | > 500 TPS | < 400 TPS | + +→ **Documentation détaillée** : [testing/TESTING-STRATEGY.md](testing/TESTING-STRATEGY.md) --- -# ⚡ **PARTIE VI — RESILIENCE & DR** +# ⚡ **PARTIE VII — RESILIENCE & DR** -## **6.1 Failure Modes** +## **7.1 Failure Modes** | Failure | Detection | Recovery | RTO | |---------|-----------|----------|-----| @@ -346,7 +381,7 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | Kafka broker failure | Aiven health | Automatic rebalance | < 2min | | Full region failure | Manual | DR procedure | 4h (target) | -## **6.2 Backup Strategy** +## **7.2 Backup Strategy** | Data | Method | Frequency | Retention | |------|--------|-----------|-----------| @@ -355,13 +390,13 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | Kafka | Topic retention | N/A | 7 jours | | Terraform state | S3 versioning | Every apply | 90 jours | -→ **Documentation détaillée** : [docs/resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) +→ **Documentation détaillée** : [resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) --- -# 🛠️ **PARTIE VII — PLATFORM CONTRACTS** +# 🛠️ **PARTIE VIII — PLATFORM CONTRACTS** -## **7.1 Golden Path (New Service Checklist)** +## **8.1 Golden Path (New Service Checklist)** | Étape | Action | Validation | |-------|--------|------------| @@ -374,7 +409,7 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | 7 | Créer HTTPRoute | Trafic routable | | 8 | PR review | Merge → Auto-deploy dev | -## **7.2 SLI/SLO/Error Budgets** +## **8.2 SLI/SLO/Error Budgets** | Service | SLI | SLO | Error Budget | |---------|-----|-----|--------------| @@ -383,7 +418,7 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | **svc-wallet** | Availability | 99.9% | 43 min/mois | | **Platform** | Availability | 99.5% | 3.6h/mois | -## **7.3 On-Call Structure** +## **8.3 On-Call Structure** | Rôle | Responsabilité | Rotation | |------|---------------|----------| @@ -391,13 +426,13 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | **Secondary** | Escalation, expertise | Weekly | | **Incident Commander** | Coordination si P1 | On-demand | -→ **Documentation détaillée** : [docs/platform/PLATFORM-ENGINEERING.md](platform/PLATFORM-ENGINEERING.md) +→ **Documentation détaillée** : [platform/PLATFORM-ENGINEERING.md](platform/PLATFORM-ENGINEERING.md) --- -# 🚀 **PARTIE VIII — ROADMAP** +# 🚀 **PARTIE IX — ROADMAP** -## **8.1 Séquence de Construction** +## **9.1 Séquence de Construction** | Phase | Focus | Estimation | |-------|-------|------------| @@ -420,7 +455,7 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu **Total estimé : ~25 semaines** -## **8.2 Checklist avant démarrage** +## **9.2 Checklist avant démarrage** ### Comptes & Accès - [ ] Compte AWS créé, billing configuré @@ -446,7 +481,7 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu ## **A. Glossaire** -→ [docs/GLOSSARY.md](GLOSSARY.md) +→ [GLOSSARY.md](GLOSSARY.md) ## **B. ADR Index** @@ -457,7 +492,7 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | 003 | Cilium over Calico | Accepted | | ... | ... | ... | -→ [docs/adr/](adr/) +→ [adr/](adr/) ## **C. Change Management Process** @@ -482,15 +517,15 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | Document | Description | Path | |----------|-------------|------| -| **Bootstrap Guide** | AWS setup, Account Factory | → [bootstrap/BOOTSTRAP-GUIDE.md](bootstrap/BOOTSTRAP-GUIDE.md) | -| **Security Architecture** | Defense in depth, policies | → [security/SECURITY-ARCHITECTURE.md](security/SECURITY-ARCHITECTURE.md) | -| **Observability Guide** | Metrics, logs, traces, APM | → [observability/OBSERVABILITY-GUIDE.md](observability/OBSERVABILITY-GUIDE.md) | -| **Networking Architecture** | VPC, Cloudflare, Gateway API | → [networking/NETWORKING-ARCHITECTURE.md](networking/NETWORKING-ARCHITECTURE.md) | -| **Data Architecture** | PostgreSQL, Kafka, Cache, Queues | → [data/DATA-ARCHITECTURE.md](data/DATA-ARCHITECTURE.md) | -| **Testing Strategy** | Unit, Integration, E2E, Chaos | → [testing/TESTING-STRATEGY.md](testing/TESTING-STRATEGY.md) | -| **Platform Engineering** | Contracts, Golden Path, On-Call | → [platform/PLATFORM-ENGINEERING.md](platform/PLATFORM-ENGINEERING.md) | -| **DR Guide** | Backup, Recovery, Runbooks | → [resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) | -| **Glossary** | Terminologie | → [GLOSSARY.md](GLOSSARY.md) | +| **Bootstrap Guide** | AWS setup, Account Factory | [bootstrap/BOOTSTRAP-GUIDE.md](bootstrap/BOOTSTRAP-GUIDE.md) | +| **Security Architecture** | Defense in depth, IAM, PAM, Vault | [security/SECURITY-ARCHITECTURE.md](security/SECURITY-ARCHITECTURE.md) | +| **Observability Guide** | Metrics, logs, traces, APM, dashboards | [observability/OBSERVABILITY-GUIDE.md](observability/OBSERVABILITY-GUIDE.md) | +| **Networking Architecture** | VPC, Cloudflare, Gateway API, DNS | [networking/NETWORKING-ARCHITECTURE.md](networking/NETWORKING-ARCHITECTURE.md) | +| **Data Architecture** | PostgreSQL, Kafka, Cache, Queues | [data/DATA-ARCHITECTURE.md](data/DATA-ARCHITECTURE.md) | +| **Testing Strategy** | Pyramide, Unit, Integration, Performance, Chaos | [testing/TESTING-STRATEGY.md](testing/TESTING-STRATEGY.md) | +| **Platform Engineering** | Contracts, Golden Path, On-Call, CI/CD | [platform/PLATFORM-ENGINEERING.md](platform/PLATFORM-ENGINEERING.md) | +| **DR Guide** | Backup, Recovery, Chaos Engineering | [resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) | +| **Glossary** | Terminologie complète | [GLOSSARY.md](GLOSSARY.md) | --- diff --git a/GLOSSARY.md b/GLOSSARY.md new file mode 100644 index 0000000..84baf89 --- /dev/null +++ b/GLOSSARY.md @@ -0,0 +1,710 @@ +# 📖 **Glossary** +## *LOCAL-PLUS Platform Terminology* + +--- + +> **Retour vers** : [Architecture Overview](EntrepriseArchitecture.md) + +--- + +# 🧩 **1. Core Software Architecture Terms** + +| Term | Definition | +|------|------------| +| **Monolith** | Single deployable unit containing all application functionality | +| **Microservices** | Architecture where application is composed of small, independent services | +| **Service boundaries** | Clear interfaces and responsibilities defining where one service ends and another begins | +| **Tight coupling** | Strong dependencies between components making them hard to change independently | +| **Loose coupling** | Minimal dependencies between components allowing independent evolution | +| **Cohesion** | Degree to which elements of a module belong together | +| **Separation of concerns** | Design principle for separating a program into distinct sections | +| **Scalability (vertical)** | Adding more power to existing machines (scale up) | +| **Scalability (horizontal)** | Adding more machines to the pool (scale out) | +| **Fault tolerance** | System's ability to continue operating when components fail | +| **Resilience** | System's ability to recover from failures and continue to function | +| **High availability** | System designed to be operational for a high percentage of time | +| **Latency budget** | Maximum acceptable delay for an operation across the system | +| **Throughput** | Number of operations a system can handle per unit of time | +| **Concurrency** | Multiple computations executing during overlapping time periods | +| **Rate limiting** | Controlling the rate of requests to protect system resources | +| **Backpressure** | Mechanism to resist and control upstream load when overwhelmed | +| **Stateless** | Component that doesn't retain client state between requests | +| **Stateful** | Component that maintains state across requests | +| **Idempotency** | Operation that produces same result regardless of how many times executed | +| **Eventual consistency** | Data will become consistent across replicas given enough time | +| **Strong consistency** | All nodes see the same data at the same time | +| **CAP theorem** | Distributed system can only provide 2 of 3: Consistency, Availability, Partition tolerance | +| **Data locality** | Keeping data close to where it's processed | +| **ACID** | Atomicity, Consistency, Isolation, Durability — transaction guarantees | +| **BASE** | Basically Available, Soft state, Eventually consistent — alternative to ACID | +| **CQRS** | Command Query Responsibility Segregation — separate read and write models | +| **Retry + Exponential backoff** | Retry failed operations with increasing delays | +| **Circuit breaker** | Pattern to prevent cascading failures by failing fast | +| **Bulkhead isolation** | Isolating components to prevent failure propagation | +| **Canary deployment** | Gradual rollout to a subset of users before full deployment | +| **Blue/Green deployment** | Two identical environments, switch traffic between them | +| **Progressive delivery** | Gradual rollout with automated checks and rollback | +| **Feature flags** | Toggles to enable/disable features without deployment | + +--- + +# 🚢 **2. DevOps Core Concepts** + +| Term | Definition | +|------|------------| +| **CI/CD** | Continuous Integration / Continuous Delivery — automated build, test, deploy | +| **Fail-Fast** | Design principle to detect and report failures immediately | +| **Deployment pipeline** | Automated sequence of stages from code to production | +| **GitOps** | Infrastructure and application management using Git as source of truth | +| **Pull-based delivery** | Agents pull desired state from Git (vs push-based) | +| **Infrastructure as Code (IaC)** | Managing infrastructure through code rather than manual processes | +| **Configuration drift** | Divergence between actual and intended configuration state | +| **Desired state vs actual state** | What should be vs what currently is | +| **Convergence loop** | Process that continuously moves actual state toward desired state | +| **Immutability** | Resources are replaced rather than modified | +| **Artifact registry** | Repository for storing build artifacts (images, packages) | +| **Environment parity** | Keeping dev, staging, prod as similar as possible | +| **Supply chain security** | Protecting the software delivery pipeline from attacks | +| **Build reproducibility** | Ability to recreate identical builds from same inputs | +| **Trunk-based development** | All developers work on a single branch (main/trunk) | +| **Shift left** | Moving testing and security earlier in the development process | +| **Continuous compliance** | Automated compliance checks integrated into pipeline | +| **Golden pipeline** | Standardized, pre-approved CI/CD pipeline | +| **Self-service delivery** | Teams can deploy without manual intervention | +| **Release automation** | Automated release process with minimal human intervention | +| **Promotion** | Moving artifacts from one environment to the next | + +--- + +# 🛠️ **3. Platform Engineering Vocabulary** + +| Term | Definition | +|------|------------| +| **Paved road** | Recommended path that's easy to follow and well-supported | +| **Golden path** | Opinionated, supported way to accomplish common tasks | +| **Developer experience (DevEx)** | Quality of developers' interactions with tools and processes | +| **Self-service portals** | Interfaces for teams to provision resources without tickets | +| **Platform boundaries** | Clear interfaces between platform and application teams | +| **Internal Developer Platform (IDP)** | Set of tools and services that enable self-service | +| **Tenant isolation** | Separation of resources between different users/teams | +| **Blast radius** | Scope of impact when something fails | +| **Multi-tenancy** | Single instance serving multiple isolated tenants | +| **Platform contracts** | Agreements about what the platform provides and expects | +| **Declarative everything** | Describing what you want, not how to achieve it | +| **Reconciliation loop** | Controller pattern that continuously aligns actual with desired state | +| **Policy as Code** | Expressing policies in code for automated enforcement | +| **Control plane vs data plane** | Management layer vs traffic/data processing layer | +| **Standardization** | Consistent patterns across the organization | +| **Opinionated defaults** | Pre-configured choices that work for most cases | +| **Guardrails** | Constraints that guide without blocking | +| **Drift detection** | Identifying when actual state differs from desired | +| **Day-2 operations** | Ongoing operations after initial deployment | +| **Platform lifecycle** | Stages from creation through deprecation | +| **Operational excellence** | Running workloads effectively and gaining insights | +| **Infra product thinking** | Treating infrastructure as a product with users | + +--- + +# 🐳 **4. Container & Kubernetes Terminology** + +| Term | Definition | +|------|------------| +| **Control plane** | Components that manage the cluster (API server, scheduler, etc.) | +| **Data plane** | Worker nodes where application workloads run | +| **Pod** | Smallest deployable unit in Kubernetes, one or more containers | +| **Deployment** | Declarative updates for Pods and ReplicaSets | +| **StatefulSet** | Manages stateful applications with stable identities | +| **DaemonSet** | Ensures a Pod runs on all (or some) nodes | +| **Service** | Abstract way to expose an application running on Pods | +| **Ingress** | API object managing external access to services | +| **Gateway API** | Next-generation Ingress, more expressive routing | +| **CRD (Custom Resource Definition)** | Extends Kubernetes API with custom resources | +| **Operator** | Controller that manages complex applications using CRDs | +| **Controller** | Control loop that watches state and makes changes | +| **Reconciliation loop** | Controller pattern comparing desired vs actual state | +| **Desired state store (etcd)** | Key-value store holding cluster state | +| **Horizontal Pod Autoscaler** | Scales Pods based on CPU/memory or custom metrics | +| **Vertical Pod Autoscaler** | Adjusts resource requests/limits automatically | +| **KEDA** | Kubernetes Event-Driven Autoscaling | +| **Knative** | Platform for serverless workloads on Kubernetes | +| **Service mesh** | Infrastructure layer for service-to-service communication | +| **Admission controller** | Intercepts requests before persistence | +| **Mutating webhook** | Modifies resources during admission | +| **Validating webhook** | Rejects invalid resources during admission | +| **Secrets** | Objects for sensitive data (passwords, tokens) | +| **ConfigMaps** | Objects for non-sensitive configuration data | +| **Namespace tenancy** | Using namespaces to isolate workloads | +| **Sidecar pattern** | Helper container running alongside main container | +| **Init containers** | Containers that run before app containers start | +| **Pod disruption budget** | Limits voluntary disruptions to Pods | +| **Resource requests vs limits** | Minimum guaranteed vs maximum allowed resources | +| **OOMKilled / throttling** | Container killed for memory / slowed for CPU | +| **Node pool** | Group of nodes with same configuration | +| **Taints/Tolerations** | Mechanism to repel/accept Pods on nodes | +| **Affinity rules** | Scheduling preferences for Pod placement | +| **kro** | Kubernetes Resource Orchestrator | + +--- + +# 🔄 **5. GitOps Deep Vocabulary** + +| Term | Definition | +|------|------------| +| **Declarative manifests** | YAML/JSON files describing desired state | +| **Single source of truth** | Git as the authoritative source for system state | +| **Drift** | When actual state differs from Git-defined state | +| **Convergence** | Process of moving actual state toward desired state | +| **Pull reconciliation** | Agent pulls changes from Git (vs push deployment) | +| **Progressive sync** | Gradual application of changes with health checks | +| **Rollback via Git revert** | Undoing changes by reverting Git commits | +| **Commit-driven deployments** | Deployments triggered by Git commits | +| **Audit trail** | Git history as immutable record of all changes | +| **Policy enforcement** | Automated checks before sync | +| **Drift remediation** | Automatic correction of drift | +| **Secret sealing** | Encrypting secrets for safe Git storage (ex: Sealed Secrets, pas SOPS) | +| **Environments as branches** | Different branches for different environments | +| **Kustomize overlays** | Environment-specific customizations | + +--- + +# ☁️ **6. Cloud Architecture Concepts** + +| Term | Definition | +|------|------------| +| **Shared responsibility model** | Division of security responsibilities between cloud and customer | +| **Multi-AZ** | Deployment across multiple Availability Zones | +| **Multi-region** | Deployment across multiple geographic regions | +| **Zonal vs regional resources** | Resources in one zone vs replicated across zones | +| **Edge caching** | Caching content at edge locations near users | +| **Network peering** | Direct network connection between VPCs | +| **Private service connect** | Private connectivity to managed services | +| **NAT gateway** | Network address translation for outbound traffic | +| **Egress costs** | Charges for data leaving cloud provider | +| **Ingress filtering** | Controlling inbound traffic | +| **Cloud IAM** | Cloud Identity and Access Management | +| **Workload identity federation** | Federating external identities with cloud IAM | +| **Service accounts** | Identity for non-human principals | +| **Service perimeter** | Boundary controlling access to resources | +| **Threat modeling** | Systematic analysis of potential threats | +| **Cloud Armor / WAF** | Web Application Firewall services | +| **Autoscaling** | Automatic adjustment of resources based on demand | +| **Rehydration** | Recreating immutable resources from scratch | +| **Blue/Green infra provisioning** | Two environments for zero-downtime infrastructure changes | +| **PAM** | Privileged Access Management | + +--- + +# 🔧 **7. Infrastructure as Code Vocabulary** + +## Terraform-specific + +| Term | Definition | +|------|------------| +| **Providers** | Plugins that interact with APIs (AWS, GCP, etc.) | +| **Resources** | Infrastructure components managed by Terraform | +| **Data sources** | Read-only queries to existing resources | +| **Modules** | Reusable, encapsulated Terraform configurations | +| **State** | Record of resources Terraform manages | +| **State locking** | Preventing concurrent state modifications | +| **Workspaces** | Separate state files for different environments | +| **Drift** | Difference between state and actual infrastructure | +| **Lifecycle ignore_changes** | Ignoring specific attribute changes | +| **Outputs** | Values exported from modules | +| **Variable validation** | Rules for valid variable values | +| **Sentinel** | HashiCorp's policy as code framework | + +## Platform IaC + +| Term | Definition | +|------|------------| +| **Composability** | Building complex systems from simpler parts | +| **Reusable patterns** | Standardized infrastructure blueprints | +| **Module registries** | Centralized storage for shared modules | +| **Abstraction leaks** | When implementation details break through abstractions | +| **Snowflake infrastructure** | Unique, non-reproducible configurations | + +--- + +# 🧮 **8. Observability (SRE Vocabulary)** + +## Three Pillars + Modern Additions + +| Term | Definition | +|------|------------| +| **Logs** | Time-stamped records of discrete events | +| **Metrics** | Numeric measurements aggregated over time | +| **Traces** | Records of request paths through distributed systems | +| **Profiles** | CPU/memory usage patterns over time | +| **Events** | Significant occurrences in the system | +| **Span attributes** | Metadata attached to trace spans | +| **Telemetry pipelines** | Collection, processing, and routing of telemetry | + +## Methods & Signals + +| Term | Definition | +|------|------------| +| **RED metrics** | Rate, Errors, Duration — for services | +| **USE method** | Utilization, Saturation, Errors — for resources | +| **Golden signals** | Latency, Traffic, Errors, Saturation | +| **Histogram buckets** | Distribution of values in ranges | +| **Sampling** | Recording only a subset of data | +| **Correlation IDs** | Identifiers linking related events | +| **Distributed tracing** | Following requests across service boundaries | +| **Log enrichment** | Adding context to log entries | +| **Span propagation** | Passing trace context between services | +| **Telemetry context** | Shared context for correlated telemetry | +| **P50/P95/P99 Latency** | Percentile latency measurements | + +## Advanced Observability + +| Term | Definition | +|------|------------| +| **Cardinality** | Number of unique label combinations | +| **Dimensionality** | Number of labels/attributes | +| **Retention policies** | Rules for how long data is kept | +| **Aggregation windows** | Time periods for aggregating data | +| **Exemplars** | Links from metrics to specific traces | +| **Structured logs (JSON)** | Machine-parseable log format | +| **High-cardinality labels** | Labels with many unique values (avoid!) | +| **Traceparent / tracestate** | W3C trace context headers | +| **Baggage propagation** | Passing custom context through requests | +| **Span links** | Connecting related but non-parent spans | +| **Tail-based sampling** | Sampling based on complete trace | +| **Head-based sampling** | Sampling decision at trace start | +| **Adaptive sampling** | Dynamic sampling based on conditions | +| **Context propagation** | Passing trace context between services | +| **Semantic conventions** | OpenTelemetry standard naming | +| **Continuous profiling** | Always-on performance profiling | +| **Flamegraphs** | Visualization of call stacks and time | +| **Log correlation** | Linking logs to traces and metrics | + +## Alerting & Incidents + +| Term | Definition | +|------|------------| +| **Alert fatigue** | Desensitization from too many alerts | +| **Multi-window burn rates** | Error budget consumption over multiple time windows | +| **Error budgets** | Allowable unreliability before action required | +| **Burn-rate alerts** | Alerts based on error budget consumption speed | +| **SLO/SLA/SLI** | Objective/Agreement/Indicator for service levels | +| **Availability vs reliability** | Uptime vs consistent correct behavior | +| **Thundering herd** | Many clients retrying simultaneously | +| **Retry storms** | Cascading retries overwhelming systems | +| **Cascading failures** | Failure spreading through dependencies | +| **Deadman's switch** | Alert when expected signal is absent | +| **Synthetic monitoring** | Artificial requests to test availability | +| **Service dependency graphs** | Visualization of service relationships | +| **Load shedding** | Dropping requests to protect system | +| **Health probes** | Liveness, readiness, startup checks | +| **Blameless postmortems** | Learning from incidents without blame | +| **MTTR/MTTA/MTBF/MTTD** | Mean Time To Recovery/Acknowledge/Between Failures/Detect | +| **Alert silencing** | Temporarily suppressing alerts | +| **Dead-letter queues (DLQ)** | Queue for failed messages | +| **Observability debt** | Accumulated lack of observability | + +## Prometheus Metric Types + +| Type | Description | Usage | Exemple | +|------|-------------|-------|---------| +| **Counter** | Valeur qui ne fait qu'augmenter (jamais diminuer) | Comptage d'événements cumulatifs | `http_requests_total`, `errors_total` | +| **Gauge** | Valeur qui peut monter ET descendre | Valeurs instantanées | `temperature`, `queue_size`, `active_connections` | +| **Histogram** | Distribution de valeurs dans des buckets prédéfinis | Latences, tailles de requêtes | `http_request_duration_seconds` | +| **Summary** | Comme Histogram mais calcule les percentiles côté client | Percentiles précis (mais plus coûteux) | `request_latency` | + +### Counter vs Gauge + +``` +Counter (cumulatif): Gauge (instantané): + ▲ ▲ + 100│ ● 50│ ● ● + 80│ ● 40│ ● + 60│ ● 30│● ● + 40│ ● 20│ ● + 20│● 10│ + └────────► └────────► + time time +``` + +### Histogram Buckets + +``` +http_request_duration_seconds_bucket{le="0.1"} → Requests < 100ms +http_request_duration_seconds_bucket{le="0.5"} → Requests < 500ms +http_request_duration_seconds_bucket{le="1.0"} → Requests < 1s +http_request_duration_seconds_bucket{le="+Inf"} → All requests (total) + +Calcul P99: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) +``` + +### Quand utiliser quoi ? + +| Besoin | Type | Pourquoi | +|--------|------|----------| +| Comptage d'événements | Counter | Ne fait qu'augmenter, rate() pour débit | +| Valeur actuelle | Gauge | Peut monter/descendre | +| Latences (P50, P95, P99) | Histogram | Buckets permettent percentiles | +| Taille de queue | Gauge | Valeur instantanée | +| Nombre de requêtes | Counter | Cumulatif, rate() pour RPS | + +--- + +# 🔥 **9. Reliability Engineering Vocabulary** + +| Term | Definition | +|------|------------| +| **SLO (Service Level Objective)** | Target reliability level | +| **SLI (Service Level Indicator)** | Metric measuring service behavior | +| **SLA (Service Level Agreement)** | Contractual reliability commitment | +| **Error budget** | Allowable unreliability (100% - SLO) | +| **Budget burn** | Rate of error budget consumption | +| **Reliability targets** | Goals for system reliability | +| **Failure domains** | Scope where failures are isolated | +| **Blast radius** | Impact area of a failure | +| **Incident commander** | Person coordinating incident response | +| **Postmortem (blameless)** | Analysis of incidents without blame | +| **MTTR** | Mean Time To Recovery | +| **MTTD** | Mean Time To Detection | +| **MTTF** | Mean Time To Failure | +| **Runbook** | Step-by-step guide for operational tasks | +| **Playbook** | Guide for responding to specific scenarios | +| **On-call rotation** | Schedule for incident response duty | +| **Escalation path** | Chain for escalating issues | +| **Severity levels** | Categories of incident impact (SEV-1, SEV-2...) | + +--- + +# 🔐 **10. Security Terminology** + +| Term | Definition | +|------|------------| +| **Zero Trust** | Never trust, always verify | +| **Principle of least privilege** | Grant minimum necessary access | +| **RBAC** | Role-Based Access Control | +| **ABAC** | Attribute-Based Access Control | +| **Ephemeral credentials** | Short-lived, automatically rotated credentials | +| **Dynamic secrets** | Secrets generated on-demand with TTL | +| **Secret rotation** | Regular replacement of credentials | +| **Time-bound access** | Access that expires automatically | +| **Vault Agent** | Sidecar for secret injection | +| **Token minting** | Creating authentication tokens | +| **Policy boundaries** | Limits on what policies can grant | +| **Just-in-time access** | Access granted only when needed | +| **SBOM** | Software Bill of Materials | +| **Supply chain attacks** | Compromising software delivery pipeline | +| **Secret scanning** | Detecting exposed credentials | +| **Threat modeling** | Systematic security analysis | +| **Attack surface** | All points where attacker could enter | +| **Posture management** | Continuous security state assessment | +| **Vulnerability hygiene** | Keeping systems patched and secure | + +--- + +# 🧵 **11. Networking Vocabulary** + +## Core Networking + +| Term | Definition | +|------|------------| +| **CIDR** | Classless Inter-Domain Routing notation | +| **Subnets** | Logical subdivisions of a network | +| **VPC peering** | Direct connection between VPCs | +| **VPC Service Controls** | Perimeter around GCP resources | +| **Route table** | Rules for directing network traffic | +| **NAT gateway** | Network Address Translation for outbound traffic | +| **Public vs private subnet** | Internet-accessible vs internal-only | +| **Load balancer (L4 vs L7)** | Transport vs application layer balancing | +| **Reverse proxy** | Proxy that handles client requests for backend servers | +| **TLS termination** | Decrypting TLS at a proxy/load balancer | +| **mTLS** | Mutual TLS — both sides authenticate | +| **VPN tunnels** | Encrypted connections over public networks | +| **Egress control** | Controlling outbound traffic | +| **DNS resolution** | Translating names to IP addresses | +| **Split-horizon DNS** | Different DNS responses internal vs external | +| **Service discovery** | Finding service endpoints dynamically | +| **Latency vs jitter** | Delay vs variation in delay | + +## Network Security + +| Term | Definition | +|------|------------| +| **Network ACLs** | Stateless firewall rules for subnets | +| **Security groups** | Stateful firewall rules for instances | +| **Firewall rules** | Rules controlling network traffic | +| **Ingress vs egress** | Inbound vs outbound traffic | +| **East-west vs north-south** | Internal vs external traffic | +| **Overlay networks** | Virtual networks on top of physical | +| **Underlay networks** | Physical network infrastructure | +| **Zero trust networking** | Verify every request regardless of source | +| **Network segmentation** | Dividing network into zones | +| **Micro-segmentation** | Fine-grained network isolation | + +## DNS + +| Term | Definition | +|------|------------| +| **DNS TTL** | Time-To-Live for DNS records | +| **DNS cache poisoning** | Attack corrupting DNS cache | +| **Anycast vs unicast** | Same IP multiple locations vs single location | +| **GSLB** | Global Server Load Balancing | +| **CNAMES vs ANAMEs** | Canonical names vs ALIAS records | +| **DNS SRV records** | Service location records | +| **Weighted DNS records** | Traffic distribution via DNS | +| **DNS failover** | Automatic DNS-based failover | + +## Load Balancing + +| Term | Definition | +|------|------------| +| **Round robin** | Distributing requests in rotation | +| **Least connections** | Sending to server with fewest connections | +| **Weighted** | Distribution based on server capacity | +| **IP hash** | Consistent routing based on client IP | +| **Sticky sessions** | Routing same client to same server | +| **Connection draining** | Completing requests before removing server | +| **Health checks (active/passive)** | Probing vs observing server health | + +## Advanced Networking + +| Term | Definition | +|------|------------| +| **Service mesh** | Infrastructure for service communication | +| **Sidecar proxy (Envoy)** | Proxy container alongside application | +| **Policy-based routing** | Routing based on policies not just destination | +| **BGP** | Border Gateway Protocol | +| **ASN** | Autonomous System Number | +| **Peering vs transit** | Direct connection vs paying for routing | +| **PrivateLink / VPC Endpoints** | Private connectivity to services | +| **MTU** | Maximum Transmission Unit | +| **QoS** | Quality of Service | +| **Bandwidth vs throughput** | Capacity vs actual data transfer rate | + +## Kubernetes Networking + +| Term | Definition | +|------|------------| +| **kube-proxy** | Network proxy on each node | +| **ClusterIP** | Internal-only service IP | +| **NodePort** | Service exposed on node ports | +| **LoadBalancer service** | Service with external load balancer | +| **Ingress controller** | Implementation of Ingress API | +| **Gateway API** | Next-generation ingress specification | +| **NetworkPolicies** | L3/L4 firewall for pods | +| **PodCIDR** | IP range allocated to pods | +| **CNI** | Container Network Interface | +| **Calico / Cilium** | Popular CNI implementations | +| **Pod-to-pod encryption** | Encrypting traffic between pods | + +--- + +# 🗄️ **12. Database Reliability Vocabulary** + +| Term | Definition | +|------|------------| +| **RPO/RTO** | Recovery Point/Time Objective | +| **Replication lag** | Delay between primary and replica | +| **Write amplification** | Extra writes from indexing/replication | +| **Connection pooling** | Reusing database connections | +| **Hot standby** | Replica ready for immediate failover | +| **Warm standby** | Replica needing some preparation | +| **Cold failover** | Failover requiring significant setup | +| **Partitioning (sharding)** | Splitting data across databases | +| **Read replicas** | Copies for read-only queries | +| **Transaction boundaries** | Scope of ACID guarantees | +| **Isolation levels** | Degree of transaction isolation | +| **Backfill** | Populating data retroactively | +| **pg_bouncer** | PostgreSQL connection pooler | +| **Vacuum** | PostgreSQL maintenance for dead tuples | +| **Dead tuple accumulation** | Buildup of deleted row versions | +| **Failover election** | Process of choosing new primary | + +--- + +# ✅ **13. Platform Anti-Patterns** + +| Anti-Pattern | Description | +|--------------|-------------| +| **Configuration drift** | Actual state diverging from intended | +| **Snowflake servers** | Unique, non-reproducible configurations | +| **Tight coupling** | Components that can't change independently | +| **Hidden dependencies** | Undocumented relationships between systems | +| **Mutating production manually** | Direct changes bypassing automation | +| **Silent failure** | Failures without alerts or logs | +| **Shadow ops** | Unofficial processes outside standard tooling | +| **Orphan secrets** | Unused but still valid credentials | +| **Credential sprawl** | Credentials scattered across systems | +| **Static long-lived passwords** | Credentials that never expire | +| **Single-tenant-by-accident** | Unintended tight coupling to one tenant | + +--- + +# 🧠 **14. Architecture Trade-Off Terminology** + +| Trade-Off | Description | +|-----------|-------------| +| **Latency vs throughput** | Response time vs capacity | +| **Cost vs durability** | Expense vs data safety | +| **Consistency vs availability** | Data correctness vs uptime | +| **Security vs convenience** | Protection vs ease of use | +| **Performance vs maintainability** | Speed vs code clarity | +| **Complexity vs control** | Features vs simplicity | +| **Abstraction leakage** | When implementation details break through abstractions | + +--- + +# 🎛️ **15. Control Plane Vocabulary** + +| Term | Definition | +|------|------------| +| **Declarative specification** | Describing what you want, not how | +| **Controller manager** | Component running controllers | +| **Watch loops** | Controllers watching for changes | +| **Reconciliation** | Aligning actual with desired state | +| **Drift remediation** | Correcting drift automatically | +| **Desired state store** | Where desired state is persisted | +| **Operator SDK** | Framework for building operators | +| **Custom resources** | User-defined Kubernetes resources | + +--- + +# 🐍 **16. FastAPI Vocabulary** + +## FastAPI Core + +| Term | Definition | +|------|------------| +| **Path operations** | HTTP method + path combinations | +| **Path operation function** | Function handling a path operation | +| **Dependency injection** | Automatic provision of dependencies | +| **Dependencies (Depends)** | FastAPI's DI mechanism | +| **Request state** | Data attached to request lifecycle | +| **Background tasks** | Tasks executed after response | +| **Middleware** | Code running before/after requests | +| **Routers** | Grouping of path operations | +| **Sub-applications** | Mounting apps within apps | +| **Exception handlers** | Custom error handling | +| **Response models** | Pydantic models for responses | +| **Startup/shutdown events** | Lifecycle hooks | +| **Lifespan protocol** | Modern async context manager for lifecycle | +| **OpenAPI schema generation** | Automatic API documentation | + +## Pydantic + +| Term | Definition | +|------|------------| +| **BaseModel** | Base class for data models | +| **Field validators** | Validation functions for fields | +| **Model config** | Configuration for model behavior | +| **Strict types** | Types that don't coerce | +| **Alias generation** | Automatic field name aliases | +| **Model inheritance** | Extending models | +| **ORM mode** | Compatibility with ORM objects | + +## Async/Concurrency + +| Term | Definition | +|------|------------| +| **Event loop** | Core of async execution | +| **Coroutine** | Async function | +| **Context switching** | Switching between coroutines | +| **Async DB engines** | Non-blocking database drivers | + +## API Integration + +| Term | Definition | +|------|------------| +| **Clients (httpx)** | Async HTTP client library | +| **Session reuse** | Reusing HTTP connections | +| **Circuit breakers** | Preventing cascading failures | +| **Retries with jitter** | Randomized retry timing | +| **Backoff** | Increasing delay between retries | +| **Timeout budgets** | Allocating latency across operations | + +--- + +# 🤖 **17. Modern AI Platform Terms** + +| Term | Definition | +|------|------------| +| **RAG** | Retrieval-Augmented Generation | +| **Vector embeddings** | Numerical representations of content | +| **Chunking strategies** | Methods for splitting documents | +| **Hallucination rate** | Frequency of incorrect AI outputs | +| **Prompt injection** | Attack via malicious prompts | +| **Safety guardrails** | Controls preventing harmful outputs | +| **Structured tool calling** | AI invoking tools with typed parameters | +| **Agent orchestration** | Managing multi-step AI workflows | +| **Agent handoff** | Transferring between specialized agents | +| **Latency budget (LLM)** | Acceptable delay for AI responses | +| **Function calling** | AI calling predefined functions | +| **Streaming response** | Incremental output delivery | +| **Semantic caching** | Caching based on meaning similarity | +| **Evaluation metrics (RAGAS)** | Framework for RAG evaluation | +| **Tracing (Langfuse)** | Observability for LLM applications | +| **Observability of prompts** | Tracking prompt performance | +| **Agentic** | AI that can take autonomous actions | +| **vLLM** | High-performance LLM inference engine | + +--- + +# 🧱 **18. How to Use These Terms** + +## In PR Reviews + +- *"This increases blast radius"* +- *"We risk configuration drift here"* +- *"Can we enforce immutability?"* +- *"Retries need idempotency guarantees"* + +## In Meetings + +- *"What's the rollback path?"* +- *"What's our boundary for tenant isolation?"* + +## In Documentation + +- *"We apply progressive delivery to reduce risk"* + +--- + +# 📚 **19. Learning Practice** + +For any term, ask: + +1. **Define** — What is it? +2. **When to use** — Appropriate scenarios +3. **When NOT to use** — Anti-patterns +4. **Trade-offs** — What you gain/lose +5. **Real-world example** — Concrete usage +6. **Sentence** — How to use it naturally + +--- + +# 🌿 **20. Git Vocabulary** + +| Term | Definition | +|------|------------| +| **Cherry-pick** | Apply specific commits to another branch | +| **Backport** | Apply fix from newer to older version | +| **Forwardport** | Apply fix from older to newer version | + +--- + +# 📨 **21. Messaging & Event-Driven Systems** + +| Term | Definition | +|------|------------| +| **Outbox Pattern** | Writing events to DB table, then to message broker atomically | +| **Event sourcing** | Storing events as source of truth | +| **CDC (Change Data Capture)** | Capturing database changes as events | +| **Exactly-once semantics** | Guarantee of processing exactly once | +| **At-least-once delivery** | Guarantee of delivery (may have duplicates) | +| **Consumer group** | Group of consumers sharing workload | +| **Partition** | Ordered subset of topic messages | +| **Dead Letter Queue (DLQ)** | Queue for failed messages | +| **Saga pattern** | Distributed transaction via event choreography | +| **Compensating transaction** | Undoing previous transaction on failure | + +--- + +*Document maintenu par : Platform Team* +*Dernière mise à jour : Janvier 2026* diff --git a/bootstrap/BOOTSTRAP-GUIDE.md b/bootstrap/BOOTSTRAP-GUIDE.md new file mode 100644 index 0000000..3f233e9 --- /dev/null +++ b/bootstrap/BOOTSTRAP-GUIDE.md @@ -0,0 +1,213 @@ +# 🥚🐔 **Bootstrap Guide** +## *LOCAL-PLUS Platform Initialization* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# 📋 **Table of Contents** + +1. [Layer 0 — Manual Bootstrap](#layer-0--manual-bootstrap) +2. [Account Factory — Self-Service](#account-factory--self-service) +3. [Platform Application Provisioning](#platform-application-provisioning) +4. [Workload Provisioning](#workload-provisioning) +5. [Layer 2 — Platform Bootstrap](#layer-2--platform-bootstrap) +6. [Layer 3+ — Application Services](#layer-3--application-services) +7. [Bootstrap Repository Structure](#bootstrap-repository-structure) + +--- + +# 🔧 **Layer 0 — Manual Bootstrap (1x per AWS Organization)** + +> **Principe :** Point d'entrée unique pour chaque cloud provider. +> Ces étapes sont manuelles car elles créent les fondations pour toute l'automatisation future. + +## Étapes + +| Étape | Action | Outil | Durée | +|-------|--------|-------|-------| +| 1 | Créer compte Management | Console AWS | 10 min | +| 2 | Activer AWS Organizations | Console | 5 min | +| 3 | Activer Control Tower | Console | 45 min | +| 4 | Configurer IAM Identity Center (SSO) | Console | 30 min | +| 5 | Créer OUs (Security, Infrastructure, Workloads) | Control Tower | 15 min | +| 6 | Appliquer SCPs | Console Organizations | 15 min | +| 7 | Créer Core Accounts | Control Tower | 15 min/compte | + +## AWS Multi-Account Strategy + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ AWS CONTROL TOWER (Organization) │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ MANAGEMENT │ │ SECURITY │ │ LOG ARCHIVE │ │ +│ │ ACCOUNT │ │ ACCOUNT │ │ ACCOUNT │ │ +│ │ • Control Tower│ │ • GuardDuty │ │ • CloudTrail │ │ +│ │ • Organizations│ │ • Security Hub │ │ • Config Logs │ │ +│ │ • SCPs │ │ • IAM Identity │ │ • VPC Flow Logs│ │ +│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ WORKLOAD ACCOUNTS (OU: Workloads) │ │ +│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ +│ │ │ DEV Account │ │ STAGING │ │ PROD Account│ │ │ +│ │ │ VPC + EKS │ │ VPC + EKS │ │ VPC + EKS │ │ │ +│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ SHARED SERVICES ACCOUNT (OU: Infrastructure) │ │ +│ │ • Transit Gateway Hub • Container Registry (ECR) │ │ +│ │ • VPC Endpoints • Artifact Storage (S3) │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +# 🏭 **Account Factory — Self-Service** + +> **Principe :** Les équipes demandent un AWS account via PR dans `bootstrap/account-factory/requests/` + +## Ce qui est créé automatiquement + +| Ressource | Description | +|-----------|-------------| +| **AWS Account** | Dans l'OU appropriée (Workloads/Dev, Staging, Prod) | +| **S3 Bucket** | Pour Terraform state | +| **GitHub OIDC** | Pour CI/CD sans credentials statiques | +| **Baseline IAM Roles** | Admin, Developer, ReadOnly | + +## Workflow + +1. **Équipe** crée un fichier YAML dans `bootstrap/account-factory/requests/` +2. **PR Review** par Platform Team +3. **Merge** déclenche Terraform via CI/CD +4. **Account créé** avec baseline automatique + +--- + +# 📦 **Platform Application Provisioning** + +> **Repo :** `platform-application-provisioning` +> Contient les modules Terraform pour provisionner les services applicatifs. + +## Providers + +| Provider | Ce qui est provisionné | Fréquence | +|----------|------------------------|-----------| +| **Cloudflare** | Zone DNS, WAF, Tunnel | 1x par zone | +| **Aiven** | Projet, VPC peering | 1x par environment | +| **AWS** | VPC, EKS, KMS | 1x par environment | + +## Modules disponibles + +| Module | Description | +|--------|-------------| +| `database/` | Aiven PostgreSQL | +| `kafka/` | Aiven Kafka | +| `cache/` | Aiven Valkey | +| `vpc/` | AWS VPC | +| `eks/` | AWS EKS Cluster | +| `eks-namespace/` | Namespace + RBAC + NetworkPolicy | + +--- + +# 🖥️ **Workload Provisioning** + +> Ordre de provisionnement pour un nouvel environnement. + +| Ordre | Ressource | Dépendances | +|-------|-----------|-------------| +| 1 | VPC + Subnets | Account créé | +| 2 | KMS Keys | Account créé | +| 3 | EKS Cluster | VPC, KMS | +| 4 | IRSA | EKS | +| 5 | VPC Peering (Aiven) | VPC, Aiven projet | +| 6 | Outputs → Platform repos | Tous | + +--- + +# 🚀 **Layer 2 — Platform Bootstrap** + +> Installation des composants platform sur le cluster EKS. + +| Ordre | Action | Dépendance | +|-------|--------|------------| +| 1 | Install ArgoCD via Helm | EKS ready | +| 2 | Apply App-of-Apps ApplicationSet | ArgoCD running | +| 3 | ArgoCD syncs `platform-*` repos | Reconciliation auto | + +**ArgoCD : Instance centralisée unique** gérant tous les environnements. + +--- + +# 📱 **Layer 3+ — Application Services** + +> ArgoCD ApplicationSets découvrent automatiquement les services. + +## Fonctionnement + +1. **Git Generator** scanne les répertoires de services +2. **Matrix Generator** croise avec les clusters (dev/staging/prod) +3. **Applications créées** automatiquement pour chaque combinaison +4. **Sync** selon la politique (auto pour dev, manual pour prod) + +## Flux de déploiement + +``` +Git push → ArgoCD détecte → Sync (dev: auto, prod: manual) → Deployed +``` + +→ **CI/CD détaillé** : voir [Platform Engineering](../platform/PLATFORM-ENGINEERING.md) + +--- + +# 📋 **Bootstrap Repository Structure** + +``` +bootstrap/ +├── .mise.toml # Tool versions +├── Taskfile.yaml # Task orchestration +│ +├── aws-landing-zone/ +│ ├── organization/ # OUs definition +│ ├── control-tower/ # Control Tower setup +│ ├── sso/ # SSO groups, permission sets +│ ├── scps/ # Service Control Policies +│ └── core-accounts/ # Core accounts config +│ +├── account-factory/ +│ ├── main.tf # Account creation +│ ├── templates/ # Baseline resources +│ └── requests/ # Account requests (PR) +│ +├── tests/ +│ ├── unit/ # terraform test +│ ├── compliance/ # OPA/Conftest +│ └── security/ # Trivy +│ +└── docs/ + ├── RUNBOOK-BOOTSTRAP.md + └── ACCOUNT-FACTORY.md +``` + +--- + +# 🔗 **Related Documentation** + +| Topic | Link | +|-------|------| +| **CI/CD & Delivery** | [Platform Engineering](../platform/PLATFORM-ENGINEERING.md) | +| **Security Setup** | [Security Architecture](../security/SECURITY-ARCHITECTURE.md) | +| **Networking** | [Networking Architecture](../networking/NETWORKING-ARCHITECTURE.md) | + +--- + +*Document maintenu par : Platform Team* +*Dernière mise à jour : Janvier 2026* diff --git a/data/DATA-ARCHITECTURE.md b/data/DATA-ARCHITECTURE.md new file mode 100644 index 0000000..f334d48 --- /dev/null +++ b/data/DATA-ARCHITECTURE.md @@ -0,0 +1,478 @@ +# 💾 **Data Architecture** +## *LOCAL-PLUS Database, Kafka, Cache & Queues* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# 📋 **Table of Contents** + +1. [Aiven Configuration](#aiven-configuration) +2. [Database Strategy](#database-strategy) +3. [Schema Ownership](#schema-ownership) +4. [Kafka Topics](#kafka-topics) +5. [Kafka Monitoring](#kafka-monitoring) +6. [Cache Architecture (Valkey)](#cache-architecture-valkey) +7. [Queueing & Background Jobs](#queueing--background-jobs) + +--- + +# 🗄️ **Aiven Configuration** + +## Services Overview + +| Service | Plan | Config | Coût estimé | +|---------|------|--------|-------------| +| **PostgreSQL** | Business-4 | Primary + Read Replica, 100GB | ~300€/mois | +| **Kafka** | Business-4 | 3 brokers, 100GB retention | ~400€/mois | +| **Valkey (Redis)** | Business-4 | 2 nodes, 10GB, HA | ~150€/mois | + +**Coût total Aiven estimé : ~850€/mois** + +--- + +# 🐘 **Database Strategy** + +## Configuration + +| Aspect | Choix | Rationale | +|--------|-------|-----------| +| **Replication** | Aiven managed (async) | RPO 1h acceptable | +| **Backup** | Aiven automated hourly | RPO 1h | +| **Failover** | Aiven automated | RTO < 15min | +| **Connection** | VPC Peering (private) | PCI-DSS, no public internet | +| **Pooling** | PgBouncer (Aiven built-in) | Connection efficiency | + +## Connection Best Practices + +| Paramètre | Valeur recommandée | Rationale | +|-----------|-------------------|-----------| +| **pool_size** | 20 | Nombre de connexions par pod | +| **max_overflow** | 10 | Connexions supplémentaires en pic | +| **pool_timeout** | 30s | Attente max pour une connexion | +| **pool_recycle** | 1800s | Recycler connexions toutes les 30min | +| **ssl** | require | Obligatoire pour PCI-DSS | + +--- + +# 📊 **Schema Ownership** + +| Table | Owner Service | Access pattern | +|-------|---------------|----------------| +| `transactions` | svc-ledger | CRUD | +| `ledger_entries` | svc-ledger | CRUD | +| `wallets` | svc-wallet | CRUD | +| `balance_snapshots` | svc-wallet | CRUD | +| `merchants` | svc-merchant | CRUD | +| `giftcards` | svc-giftcard | CRUD | + +**Règle d'or : 1 table = 1 owner. Cross-service = gRPC ou Events, jamais JOIN.** + +--- + +# 📨 **Kafka Topics** + +## Topic Configuration + +| Topic | Producer | Consumers | Retention | +|-------|----------|-----------|-----------| +| `ledger.transactions.v1` | svc-ledger (Outbox) | svc-notification, svc-analytics | 7 jours | +| `wallet.balance-updated.v1` | svc-wallet | svc-analytics | 7 jours | +| `merchant.onboarded.v1` | svc-merchant | svc-notification | 7 jours | + +## Outbox Pattern avec Debezium + +> **Implementation** : On utilise **Debezium** avec **PostgreSQL Logical Replication** (publication + replication slot), pas le polling. + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ OUTBOX PATTERN (Debezium CDC) │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ 1. Application writes to DB + Outbox table in same transaction │ +│ 2. Debezium reads WAL via replication slot │ +│ 3. Events published to Kafka │ +│ 4. Consumers process events │ +│ │ +│ ┌─────────┐ ┌─────────────┐ ┌──────────┐ ┌─────────────┐ │ +│ │ svc-* │───►│ PostgreSQL │───►│ Debezium │───►│ Kafka │ │ +│ │ │ │ (WAL/Slot) │ │ (CDC) │ │ │ │ +│ └─────────┘ └─────────────┘ └──────────┘ └──────┬──────┘ │ +│ │ │ +│ Publication + Replication Slot ▼ │ +│ ┌─────────────────┐ │ +│ │ Consumers │ │ +│ └─────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Debezium Configuration + +| Composant | Description | +|-----------|-------------| +| **Publication** | `CREATE PUBLICATION outbox_pub FOR TABLE outbox;` | +| **Replication Slot** | Créé automatiquement par Debezium | +| **Connector** | Debezium PostgreSQL Connector | +| **Output** | Kafka topic par table (ou SMT pour routing) | + +## Outbox Table Structure + +| Colonne | Type | Description | +|---------|------|-------------| +| `id` | UUID | Primary key | +| `aggregate_type` | VARCHAR(255) | Type d'entité (Transaction, Wallet...) | +| `aggregate_id` | VARCHAR(255) | ID de l'entité | +| `event_type` | VARCHAR(255) | Type d'événement | +| `payload` | JSONB | Données de l'événement | +| `created_at` | TIMESTAMPTZ | Timestamp création | + +--- + +# 📊 **Kafka Monitoring** + +## Métriques Essentielles + +| Métrique | Description | Seuil Alerte | Sévérité | +|----------|-------------|--------------|----------| +| **Consumer Lag** | Messages non traités | > 1000 | P2 | +| **Partition Lag** | Lag par partition | > 500 | P3 | +| **Under-replicated Partitions** | Partitions sans réplicas | > 0 | P1 | +| **Active Controller Count** | Controllers actifs | ≠ 1 | P1 | +| **Offline Partitions** | Partitions inaccessibles | > 0 | P1 | +| **Bytes In/Out Rate** | Débit Kafka | Anomalie > 50% | P3 | +| **Request Latency P99** | Latence requêtes | > 100ms | P2 | +| **ISR Shrink Rate** | Réduction In-Sync Replicas | > 0/min sustained | P2 | + +## Consumer Lag Monitoring + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ CONSUMER LAG │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ Producer Offset: 1000 ────────────────────────────────► │ +│ Consumer Offset: 800 ──────────────────────► │ +│ │◄───── LAG = 200 ─────►│ │ +│ │ +│ LAG = Producer Offset - Consumer Offset │ +│ │ +│ Causes de Lag élevé: │ +│ • Consumer lent (processing time) │ +│ • Consumer crashé │ +│ • Pic de trafic │ +│ • Problème de partition rebalancing │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Dashboard Kafka Recommandé + +| Panel | Métrique | Type | +|-------|----------|------| +| **Total Consumer Lag** | `kafka_consumergroup_lag` | Gauge | +| **Lag par Consumer Group** | `kafka_consumergroup_lag` by group | Gauge | +| **Messages In/sec** | `kafka_server_brokertopicmetrics_messagesin_total` | Counter → Rate | +| **Bytes In/Out** | `kafka_server_brokertopicmetrics_bytesin_total` | Counter → Rate | +| **Request Latency** | `kafka_network_requestmetrics_requestqueuetimems` | Histogram | +| **Partition Count** | `kafka_server_replicamanager_partitioncount` | Gauge | +| **Under-replicated** | `kafka_server_replicamanager_underreplicatedpartitions` | Gauge | + +--- + +# 🚀 **Cache Architecture (Valkey)** + +## Stack Cache + +| Composant | Outil | Hébergement | Coût estimé | +|-----------|-------|-------------|-------------| +| **Cache primaire** | Valkey (Redis-compatible) | Aiven for Caching | ~150€/mois | +| **Cache local (L1)** | Python `cachetools` / Go `bigcache` | In-memory | 0€ | + +> **Note :** Valkey est le fork open-source de Redis, maintenu par la Linux Foundation. Aiven supporte Valkey nativement. + +## Cache Topology + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ MULTI-LAYER CACHE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ L1 — LOCAL CACHE (per pod) │ │ +│ │ • TTL: 30s - 5min │ │ +│ │ • Size: 100MB max per pod │ │ +│ │ • Use case: Hot data, config, user sessions │ │ +│ └───────────────────────────────┬─────────────────────────────────────┘ │ +│ │ Cache miss │ +│ ▼ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ L2 — DISTRIBUTED CACHE (Valkey cluster) │ │ +│ │ • TTL: 5min - 24h │ │ +│ │ • Size: 10GB │ │ +│ │ • Use case: Shared state, rate limits, session store │ │ +│ └───────────────────────────────┬─────────────────────────────────────┘ │ +│ │ Cache miss │ +│ ▼ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ L3 — DATABASE (PostgreSQL) │ │ +│ │ • Source of truth │ │ +│ │ • Write-through pour updates │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Cache Strategies par Use Case + +| Use Case | Strategy | TTL | Invalidation | +|----------|----------|-----|--------------| +| **Wallet Balance** | Cache-aside (read) | 30s | Event-driven (Kafka) | +| **Merchant Config** | Read-through | 5min | TTL + Manual | +| **Rate Limiting** | Write-through | Sliding window | Auto-expire | +| **Session Data** | Write-through | 24h | Explicit logout | +| **Gift Card Catalog** | Cache-aside | 15min | Event-driven | +| **Feature Flags** | Read-through | 1min | Config push | + +## Cache Patterns + +### Cache-Aside Pattern + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ CACHE-ASIDE PATTERN │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ 1. Application checks cache │ +│ 2. If HIT → return cached data │ +│ 3. If MISS → query database │ +│ 4. Store result in cache with TTL │ +│ 5. Return data to caller │ +│ │ +│ ┌─────────┐ GET ┌─────────┐ │ +│ │ App │───────────►│ Cache │ │ +│ └────┬────┘ └────┬────┘ │ +│ │ │ MISS │ +│ │ SELECT ▼ │ +│ └─────────────────►┌─────────┐ │ +│ │ DB │ │ +│ └─────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### Write-Through Pattern + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ WRITE-THROUGH PATTERN │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ 1. Application writes to cache AND database atomically │ +│ 2. Cache is always consistent with database │ +│ │ +│ ┌─────────┐ SET+TTL ┌─────────┐ │ +│ │ App │────────────►│ Cache │ │ +│ └────┬────┘ └─────────┘ │ +│ │ │ +│ │ INSERT/UPDATE │ +│ └─────────────────►┌─────────┐ │ +│ │ DB │ │ +│ └─────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Cache Invalidation Strategy + +| Trigger | Méthode | Use Case | +|---------|---------|----------| +| **TTL Expiry** | Automatic | Default pour toutes les clés | +| **Event-driven** | Kafka consumer | Wallet balance après transaction | +| **Explicit Delete** | API call | Admin actions, config updates | +| **Pub/Sub** | Valkey PUBLISH | Real-time invalidation cross-pods | + +## Cache Key Naming Convention + +``` +{service}:{entity}:{id}:{version} + +Exemples: + wallet:balance:user_123:v1 + merchant:config:merchant_456:v1 + giftcard:catalog:category_active:v1 + ratelimit:api:user_123:minute + session:auth:session_abc123 +``` + +## Cache Metrics & Monitoring + +| Metric | Seuil alerte | Action | +|--------|--------------|--------| +| **Hit Rate** | < 80% | Revoir TTL, préchargement | +| **Latency P99** | > 10ms | Check network, cluster size | +| **Memory Usage** | > 80% | Eviction analysis, scale up | +| **Evictions/sec** | > 100 | Augmenter cache size | +| **Connection Errors** | > 0 | Check connectivity, pooling | + +--- + +# 📋 **Queueing & Background Jobs** + +## Architecture Overview + +> **Clarification** : La Task Queue est **interne** aux services, pas en frontal comme RabbitMQ. + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ TASK QUEUE vs MESSAGE BROKER (RabbitMQ) │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ❌ Pattern RabbitMQ (frontal) - PAS ce qu'on fait: │ +│ │ +│ Client → RabbitMQ → Worker → Response to Client (synchrone) │ +│ │ +│ ✅ Notre pattern (Task Queue interne): │ +│ │ +│ Client → API (svc-*) → Response immédiate (< 200ms) │ +│ │ │ +│ └──► enqueue task → Valkey → Worker (async, background) │ +│ │ +│ Différence clé: │ +│ • L'API répond IMMÉDIATEMENT au client │ +│ • Le worker traite en BACKGROUND (fire-and-forget ou avec callback) │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Queueing Tiers + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ QUEUEING ARCHITECTURE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ TIER 1 — EVENT STREAMING (Kafka) │ │ +│ │ • Use case: Event-driven architecture, CDC, audit logs │ │ +│ │ • Pattern: Pub/Sub, Event Sourcing │ │ +│ │ • Ordering: Per-partition guaranteed │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ TIER 2 — TASK QUEUE (Valkey + Dramatiq) │ │ +│ │ • Use case: Background jobs, async processing │ │ +│ │ • Pattern: Producer/Consumer, Work Queue │ │ +│ │ • Features: Retries, priorities, scheduling │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ TIER 3 — SCHEDULED JOBS (Kubernetes CronJobs) │ │ +│ │ • Use case: Batch processing, reports, cleanup │ │ +│ │ • Pattern: Time-triggered execution │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Kafka vs Task Queue — Quand utiliser quoi ? + +| Critère | Kafka | Task Queue (Valkey) | +|---------|-------|---------------------| +| **Message Ordering** | ✅ Per-partition | ❌ Best effort | +| **Message Replay** | ✅ Retention-based | ❌ Non | +| **Priority Queues** | ❌ Non natif | ✅ Oui | +| **Delayed Messages** | ❌ Non natif | ✅ Oui | +| **Dead Letter Queue** | ✅ Configurable | ✅ Intégré | +| **Exactly-once** | ✅ Avec idempotency | ❌ At-least-once | +| **Use Case** | Events entre services | Jobs internes async | + +## Task Queue Stack + +| Composant | Outil | Rôle | +|-----------|-------|------| +| **Task Framework** | Dramatiq (Python) / Asynq (Go) | Task definition, execution | +| **Broker** | Valkey (Redis-compatible) | Message storage, routing | +| **Result Backend** | Valkey | Task results, status | +| **Scheduler** | APScheduler / Dramatiq-crontab | Periodic tasks | +| **Monitoring** | Dramatiq Dashboard / Prometheus | Task metrics | + +## Task Processing Flow + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ TASK PROCESSING FLOW │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ Producer Broker Workers │ +│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ +│ │ svc-* │──── enqueue ──►│ Valkey │◄── poll ─────│ Worker │ │ +│ │ API │ │ │ │ Pods │ │ +│ └─────────┘ │ Queues: │ └────┬────┘ │ +│ │ │ • high │ │ │ +│ │ Response │ • default│ │ execute │ +│ │ immédiate │ • low │ ▼ │ +│ ▼ │ • dlq │ ┌─────────┐ │ +│ Client └─────────┘ │ Task │ │ +│ (n'attend pas) │ Handler │ │ +│ └─────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Queue Definitions + +| Queue | Priority | Workers | Use Cases | +|-------|----------|---------|-----------| +| **critical** | P0 | 5 | Transaction rollbacks, fraud alerts | +| **high** | P1 | 10 | Email confirmations, balance updates | +| **default** | P2 | 20 | Notifications, analytics events | +| **low** | P3 | 5 | Reports, cleanup, batch exports | +| **scheduled** | N/A | 3 | Cron-like scheduled tasks | +| **dead-letter** | N/A | 1 | Failed tasks investigation | + +## Retry Strategy + +| Retry Policy | Configuration | Use Case | +|--------------|---------------|----------| +| **Exponential Backoff** | base=1s, max=1h, multiplier=2 | API calls, external services | +| **Fixed Interval** | interval=30s, max_retries=5 | Database operations | +| **No Retry** | max_retries=0 | Idempotent operations | + +## Dead Letter Queue (DLQ) Handling + +| Étape | Action | +|-------|--------| +| 1 | Task fails après max retries | +| 2 | Task moved to DLQ avec metadata (reason, stack trace, attempts) | +| 3 | Alert Slack (P3) | +| 4 | On-call investigate | +| 5 | Options: Fix → Replay, Manual resolution, Archive | + +## Scheduled Jobs (CronJobs) + +| Job | Schedule | Service | Description | +|-----|----------|---------|-------------| +| **balance-reconciliation** | `0 2 * * *` | svc-wallet | Daily balance verification | +| **expired-giftcards** | `0 0 * * *` | svc-giftcard | Mark expired cards | +| **analytics-rollup** | `0 */6 * * *` | svc-analytics | 6-hourly aggregation | +| **log-cleanup** | `0 3 * * 0` | platform | Weekly log rotation | +| **backup-verification** | `0 4 * * *` | platform | Daily backup integrity check | +| **compliance-report** | `0 6 1 * *` | platform | Monthly compliance export | + +## Task Queue Monitoring + +| Metric | Seuil alerte | Action | +|--------|--------------|--------| +| **Queue Depth** | > 1000 tasks | Scale workers | +| **Processing Time P95** | > 30s | Optimize task, check resources | +| **Failure Rate** | > 5% | Investigate DLQ, check dependencies | +| **DLQ Size** | > 10 tasks | Immediate investigation | +| **Worker Availability** | < 50% | Check pod health, scale up | + +--- + +*Document maintenu par : Platform Team + Backend Team* +*Dernière mise à jour : Janvier 2026* diff --git a/networking/NETWORKING-ARCHITECTURE.md b/networking/NETWORKING-ARCHITECTURE.md new file mode 100644 index 0000000..d08d9af --- /dev/null +++ b/networking/NETWORKING-ARCHITECTURE.md @@ -0,0 +1,469 @@ +# 🌐 **Networking Architecture** +## *LOCAL-PLUS VPC, Edge, CDN & Gateway* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# 📋 **Table of Contents** + +1. [VPC Design](#vpc-design) +2. [Traffic Flow](#traffic-flow) +3. [Gateway API Configuration](#gateway-api-configuration) +4. [Network Policies](#network-policies) +5. [Cloudflare Architecture](#cloudflare-architecture) +6. [DNS Configuration](#dns-configuration) +7. [Route53 — DNS Interne & Backup](#route53--dns-interne--backup) +8. [API Gateway / APIM (Future)](#api-gateway--apim-future) +9. [Multi-Cloud Vision](#multi-cloud-vision) + +--- + +# 🏗️ **VPC Design** + +## CIDR Allocation + +| CIDR | Usage | Subnets | +|------|-------|---------| +| 10.0.0.0/16 | VPC Principal | - | +| 10.0.0.0/20 | Private Subnets (Workloads) | 3 AZs | +| 10.0.16.0/20 | Private Subnets (Data) | 3 AZs | +| 10.0.32.0/20 | Public Subnets (NAT, LB) | 3 AZs | + +## Architecture EKS + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ INTERNET │ +│ (End Users) │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ CLOUDFLARE EDGE (Global) │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ • DNS (localplus.io) • WAF (OWASP rules) │ +│ • DDoS Protection (L3-L7) • SSL/TLS Termination │ +│ • CDN (static assets) • Bot Protection │ +│ • Cloudflare Tunnel • Zero Trust Access │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + │ Cloudflare Tunnel (encrypted) + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ WORKLOAD ACCOUNT (PROD) — eu-west-1 │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌───────────────────────────────────────────────────────────────────────┐ │ +│ │ VPC — 10.0.0.0/16 │ │ +│ │ │ │ +│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ +│ │ │ EKS CLUSTER │ │ │ +│ │ │ │ │ │ +│ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ +│ │ │ │ NODE POOL: platform (taints: platform=true:NoSchedule) │ │ │ │ +│ │ │ │ Instance: m6i.xlarge (dedicated resources) │ │ │ │ +│ │ │ ├─────────────────────────────────────────────────────────┤ │ │ │ +│ │ │ │ PLATFORM NAMESPACE │ │ │ │ +│ │ │ │ • ArgoCD, Cilium, Vault, Kyverno, OTel, Grafana │ │ │ │ +│ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ +│ │ │ │ │ │ +│ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ +│ │ │ │ NODE POOL: application (default, auto-scaling) │ │ │ │ +│ │ │ │ Instance: m6i.large (cost-optimized) │ │ │ │ +│ │ │ ├─────────────────────────────────────────────────────────┤ │ │ │ +│ │ │ │ APPLICATION NAMESPACES │ │ │ │ +│ │ │ │ • svc-ledger, svc-wallet, svc-merchant, etc. │ │ │ │ +│ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ +│ │ │ │ │ │ +│ │ └─────────────────────────────────────────────────────────────────┘ │ │ +│ │ │ │ +│ │ │ VPC Peering / Transit Gateway │ │ +│ │ ▼ │ │ +│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ +│ │ │ AIVEN VPC │ │ │ +│ │ │ • PostgreSQL (Primary + Read Replica) │ │ │ +│ │ │ • Kafka Cluster │ │ │ +│ │ │ • Valkey (Redis-compatible) │ │ │ +│ │ └─────────────────────────────────────────────────────────────────┘ │ │ +│ │ │ │ +│ └───────────────────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Node Pool Strategy + +| Node Pool | Taints | Usage | Instance Type | Scaling | +|-----------|--------|-------|---------------|---------| +| **platform** | `platform=true:NoSchedule` | ArgoCD, Monitoring, Security tools | m6i.xlarge | Fixed (2-3 nodes) | +| **application** | None (default) | Domain services | m6i.large | HPA (2-10 nodes) | +| **spot** (optionnel) | `spot=true:PreferNoSchedule` | Batch jobs, non-critical | m6i.large (spot) | Auto (0-5 nodes) | + +--- + +# 🔄 **Traffic Flow** + +| Flow | Path | Encryption | +|------|------|------------| +| Internet → Services | Cloudflare → Tunnel → Cilium Gateway → Pod | TLS + mTLS | +| Service → Service | Pod → Pod (Cilium) | mTLS (WireGuard) | +| Service → Aiven | VPC Peering | TLS | +| Service → AWS (S3, KMS) | VPC Endpoints | TLS | + +--- + +# 🚪 **Gateway API Configuration** + +## Resources + +| Resource | Purpose | +|----------|---------| +| **GatewayClass** | Cilium implementation | +| **Gateway** | HTTPS listener, TLS termination | +| **HTTPRoute** | Routing vers services (path-based) | + +## Gateway Configuration + +| Setting | Value | Description | +|---------|-------|-------------| +| **GatewayClass** | `cilium` | Utilise le controller Cilium | +| **Listener** | HTTPS:443 | TLS termination | +| **TLS Mode** | Terminate | Certificat géré par External-Secrets | +| **Allowed Routes** | All namespaces | Services peuvent déclarer leurs routes | + +## HTTPRoute Routing + +| Pattern | Exemple | Backend | +|---------|---------|---------| +| Path prefix | `/v1/ledger/*` | svc-ledger:8080 | +| Path prefix | `/v1/wallet/*` | svc-wallet:8080 | +| Path prefix | `/v1/merchant/*` | svc-merchant:8080 | +| Exact path | `/health` | Tous les services | + +--- + +# 🔒 **Network Policies** + +## Default Deny Strategy + +| Policy | Effect | +|--------|--------| +| Default deny all | Aucun trafic sauf explicite | +| Allow intra-namespace | Services même namespace peuvent communiquer | +| Allow specific cross-namespace | svc-ledger → svc-wallet explicite | +| Allow egress Aiven | Services → VPC Peering range only | +| Allow egress AWS endpoints | Services → VPC Endpoints only | + +## Cilium Network Policy Rules + +### Ingress Rules + +| From | To | Port | Protocol | +|------|----|------|----------| +| Gateway (platform) | All services | 8080, 50051 | TCP | +| svc-ledger | svc-wallet | 50051 | gRPC | +| Prometheus | All services | 8080 | metrics | + +### Egress Rules + +| From | To | Port | Description | +|------|----|------|-------------| +| All services | Aiven PostgreSQL | 5432 | Database | +| All services | Aiven Kafka | 9092 | Messaging | +| All services | Aiven Valkey | 6379 | Cache | +| All services | AWS VPC Endpoints | 443 | S3, KMS, etc. | +| OTel Collector | Tempo, Loki | 4317, 3100 | Telemetry | + +--- + +# ☁️ **Cloudflare Architecture** + +## Pourquoi Cloudflare ? + +| Critère | Cloudflare | AWS CloudFront + WAF | Verdict | +|---------|------------|---------------------|---------| +| **Coût** | Free tier généreux | Payant dès le début | ✅ Cloudflare | +| **WAF** | Gratuit (règles de base) | ~30€/mois minimum | ✅ Cloudflare | +| **DDoS** | Inclus (unlimited) | AWS Shield Standard gratuit | ≈ Égal | +| **SSL/TLS** | Gratuit, auto-renew | ACM gratuit | ≈ Égal | +| **CDN** | 300+ PoPs, gratuit | Payant au GB | ✅ Cloudflare | +| **DNS** | Gratuit, très rapide | Route53 ~0.50€/zone | ✅ Cloudflare | +| **Zero Trust** | Gratuit jusqu'à 50 users | Cognito + ALB payant | ✅ Cloudflare | + +> **Décision :** Cloudflare en front, AWS en backend. Best of both worlds. + +## Edge Layers + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ CLOUDFLARE EDGE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ LAYER 1: DNS │ +│ • Authoritative DNS (localplus.io) │ +│ • DNSSEC enabled │ +│ • Geo-routing (future multi-region) │ +│ │ +│ LAYER 2: DDoS Protection │ +│ • Layer 3/4 DDoS mitigation (automatic, unlimited) │ +│ • Layer 7 DDoS mitigation │ +│ │ +│ LAYER 3: WAF │ +│ • OWASP Core Ruleset │ +│ • Custom rules (rate limit, geo-block, bot score) │ +│ │ +│ LAYER 4: SSL/TLS │ +│ • Edge certificates (auto-issued) │ +│ • Full (strict) mode → Origin certificate │ +│ • TLS 1.3 only, HSTS enabled │ +│ │ +│ LAYER 5: CDN & Caching │ +│ • Static assets caching │ +│ • Tiered caching │ +│ │ +│ LAYER 6: Cloudflare Tunnel │ +│ • No public IP needed on origin │ +│ • Encrypted tunnel to EKS │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Cloudflare Services + +| Service | Plan | Configuration | Coût | +|---------|------|---------------|------| +| **DNS** | Free | Authoritative, DNSSEC, proxy enabled | 0€ | +| **CDN** | Free | Cache everything, tiered caching | 0€ | +| **SSL/TLS** | Free | Full (strict), TLS 1.3, edge certs | 0€ | +| **WAF** | Free | Managed ruleset, 5 custom rules | 0€ | +| **DDoS** | Free | L3/L4/L7 protection, unlimited | 0€ | +| **Bot Management** | Free | Basic bot score, JS challenge | 0€ | +| **Rate Limiting** | Free | 1 rule (10K req/month free) | 0€ | +| **Tunnel** | Free | Unlimited tunnels, cloudflared | 0€ | +| **Access** | Free | Zero Trust, 50 users free | 0€ | + +**Coût Cloudflare total : 0€** (Free tier suffisant pour démarrer) + +## WAF Rules Strategy + +| Rule Set | Type | Action | Purpose | +|----------|------|--------|---------| +| **OWASP Core** | Managed | Block | SQLi, XSS, LFI, RFI protection | +| **Cloudflare Managed** | Managed | Block | Zero-day, emerging threats | +| **Geo-Block** | Custom | Block | Block high-risk countries (optional) | +| **Rate Limit API** | Custom | Challenge | > 100 req/min per IP on /api/* | +| **Bot Score < 30** | Custom | Challenge | Likely bot traffic | + +## SSL/TLS Configuration + +| Setting | Value | Rationale | +|---------|-------|-----------| +| **SSL Mode** | Full (strict) | Origin has valid cert | +| **Minimum TLS** | 1.2 | PCI-DSS compliance | +| **TLS 1.3** | Enabled | Performance + security | +| **HSTS** | Enabled (max-age=31536000) | Force HTTPS | +| **Always Use HTTPS** | On | Redirect HTTP → HTTPS | +| **Origin Certificate** | Cloudflare Origin CA | 15-year validity, free | + +## Cloudflare Tunnel + +| Composant | Rôle | Déploiement | +|-----------|------|-------------| +| **cloudflared daemon** | Agent tunnel | 2+ replicas, namespace platform | +| **Tunnel credentials** | Secret d'authentification | Vault / External-Secrets | +| **Tunnel config** | Routing rules | ConfigMap | +| **Health checks** | Vérification disponibilité | Cloudflare dashboard | + +**Avantages :** +- Pas d'IP publique exposée sur l'origin +- Connexion outbound uniquement (pas de firewall inbound) +- Encryption de bout en bout +- Failover automatique entre replicas + +## Cloudflare Access (Zero Trust) + +| Resource | Policy | Authentication | +|----------|--------|----------------| +| **grafana.localplus.io** | Team only | GitHub SSO | +| **argocd.localplus.io** | Team only | GitHub SSO | +| **api.localplus.io/admin** | Admin only | GitHub SSO + MFA | +| **api.localplus.io/*** | Public | No auth (application handles) | + +--- + +# 🌍 **DNS Configuration** + +## DNS Records — localplus.io + +| Type | Name | Content | Proxy | TTL | +|------|------|---------|-------|-----| +| A | @ | Cloudflare Tunnel | ☁️ ON | Auto | +| CNAME | www | @ | ☁️ ON | Auto | +| CNAME | api | tunnel-xxx.cfargotunnel.com | ☁️ ON | Auto | +| CNAME | grafana | tunnel-xxx.cfargotunnel.com | ☁️ ON | Auto | +| CNAME | argocd | tunnel-xxx.cfargotunnel.com | ☁️ ON | Auto | +| TXT | @ | SPF record | ☁️ OFF | Auto | +| TXT | _dmarc | DMARC policy | ☁️ OFF | Auto | +| MX | @ | Mail provider | ☁️ OFF | Auto | + +--- + +# 🛣️ **Route53 — DNS Interne & Backup** + +| Use Case | Solution | Configuration | +|----------|----------|---------------| +| **DNS Public (Primary)** | Cloudflare | Authoritative pour `localplus.io` | +| **DNS Public (Backup)** | Route53 | Secondary zone, sync via AXFR | +| **DNS Privé (Internal)** | Route53 Private Hosted Zones | `*.internal.localplus.io` | +| **Service Discovery** | Route53 + Cloud Map | Résolution services internes | +| **Health Checks** | Route53 Health Checks | Failover automatique | + +## Architecture DNS Hybride + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ DNS ARCHITECTURE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ EXTERNAL TRAFFIC INTERNAL TRAFFIC │ +│ │ +│ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ Cloudflare DNS │ │ Route53 Private │ │ +│ │ (Primary) │ │ Hosted Zone │ │ +│ │ │ │ │ │ +│ │ localplus.io │ │ internal. │ │ +│ │ api.localplus.io│ │ localplus.io │ │ +│ └────────┬────────┘ └────────┬────────┘ │ +│ │ │ │ +│ │ Failover │ VPC DNS │ +│ ▼ ▼ │ +│ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ Route53 Public │ │ EKS CoreDNS │ │ +│ │ (Backup) │ │ + Cloud Map │ │ +│ │ Health checks │ │ svc-*.svc. │ │ +│ │ Failover ready │ │ cluster.local │ │ +│ └─────────────────┘ └─────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Route53 Features + +| Feature | Use Case Local-Plus | +|---------|---------------------| +| **Private Hosted Zones** | Résolution DNS interne VPC | +| **Health Checks** | Failover automatique | +| **Alias Records** | Pointage vers ALB/NLB | +| **Geolocation Routing** | Future multi-région | +| **Failover Routing** | Backup si Cloudflare down | +| **Weighted Routing** | Canary deployments | + +--- + +# 🚪 **API Gateway / APIM (Future)** + +> **Statut :** À définir ultérieurement. Pour le moment : Cloudflare → Cilium Gateway → Services. + +## Options à évaluer + +| Solution | Type | Coût | Notes | +|----------|------|------|-------| +| **AWS API Gateway** | Managed | Pay-per-use | Simple, intégré AWS | +| **Gravitee CE** | APIM complet | Gratuit | Portal, Subscriptions inclus | +| **Kong OSS** | Gateway | Gratuit | Populaire, plugins riches | +| **APISIX** | Gateway | Gratuit | Cloud-native, performant | + +**Décision reportée à Phase 2+ selon les besoins :** +- Si besoin B2B/Partners → APIM (Gravitee) +- Si juste rate limiting/auth → AWS API Gateway +- Si multi-cloud requis → APISIX ou Kong + +## Architecture Actuelle (Phase 1) + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ ARCHITECTURE SIMPLIFIÉE — PHASE 1 │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ Internet │ +│ │ │ +│ ▼ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ CLOUDFLARE (DNS, WAF, DDoS, TLS) │ │ +│ └──────────────────────────────┬──────────────────────────────────────┘ │ +│ │ │ +│ │ Tunnel │ +│ ▼ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ AWS EKS — Cilium Gateway API (Routing interne, mTLS) │ │ +│ │ │ │ +│ │ Services : svc-ledger, svc-wallet, svc-merchant, ... │ │ +│ │ │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ Pas d'API Gateway dédié pour le moment — Cilium Gateway API suffit. │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +# 🌍 **Multi-Cloud Vision** + +> **Objectif :** L'architecture edge (Cloudflare) est **cloud-agnostic** et peut router vers plusieurs cloud providers. + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ MULTI-CLOUD ARCHITECTURE (Future) │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ CLOUDFLARE EDGE │ +│ (Global Load Balancing) │ +│ │ │ +│ ┌───────────────┼───────────────┐ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ +│ │ AWS (Primary)│ │ GCP (Future) │ │ Azure (Future)│ │ +│ │ eu-west-1 │ │ europe-west1 │ │ westeurope │ │ +│ │ │ │ │ │ │ │ +│ │ Gateway + │ │ Gateway + │ │ Gateway + │ │ +│ │ Services │ │ Services │ │ Services │ │ +│ └───────────────┘ └───────────────┘ └───────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ AIVEN (Multi-Cloud Data Layer) │ │ +│ │ • PostgreSQL avec réplication cross-cloud │ │ +│ │ • Kafka avec MirrorMaker cross-cloud │ │ +│ │ • Valkey avec réplication │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Multi-Cloud Readiness + +| Composant | Multi-Cloud Ready | Comment | +|-----------|-------------------|---------| +| **Cloudflare** | ✅ Oui | Load balancing global, health checks multi-origin | +| **APISIX** | ✅ Oui | Déployable sur tout K8s (EKS, GKE, AKS) | +| **Aiven** | ✅ Oui | PostgreSQL, Kafka, Valkey disponibles sur AWS/GCP/Azure | +| **ArgoCD** | ✅ Oui | Peut gérer des clusters multi-cloud | +| **Vault** | ✅ Oui | Réplication cross-datacenter | +| **OTel** | ✅ Oui | Standard ouvert, backends interchangeables | + +## Phases Multi-Cloud + +| Phase | Scope | Timeline | +|-------|-------|----------| +| **Phase 1 (Actuelle)** | AWS uniquement, architecture cloud-agnostic | Now | +| **Phase 2** | DR sur GCP (read replicas, failover) | +12 mois | +| **Phase 3** | Active-Active multi-cloud | +24 mois | + +--- + +*Document maintenu par : Platform Team* +*Dernière mise à jour : Janvier 2026* diff --git a/observability/OBSERVABILITY-GUIDE.md b/observability/OBSERVABILITY-GUIDE.md new file mode 100644 index 0000000..fdea3dc --- /dev/null +++ b/observability/OBSERVABILITY-GUIDE.md @@ -0,0 +1,457 @@ +# 📊 **Observability Guide** +## *LOCAL-PLUS Monitoring, Logging, Tracing & APM* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# 📋 **Table of Contents** + +1. [Stack Overview](#stack-overview) +2. [Telemetry Pipeline](#telemetry-pipeline) +3. [Metrics (Prometheus)](#metrics-prometheus) +4. [Logs (Loki)](#logs-loki) +5. [Traces (Tempo)](#traces-tempo) +6. [APM (Application Performance Monitoring)](#apm-application-performance-monitoring) +7. [Cardinality Management](#cardinality-management) +8. [SLI/SLO/Error Budgets](#slisloerror-budgets) +9. [Alerting Strategy](#alerting-strategy) +10. [Dashboards & Visualizations](#dashboards--visualizations) + +--- + +# 🏗️ **Stack Overview** + +## Self-Hosted Stack (Coût Minimal) + +| Composant | Outil | Coût | Retention | +|-----------|-------|------|-----------| +| **Metrics** | Prometheus | 0€ (self-hosted) | 15 jours local | +| **Metrics long-term** | Prometheus avec Remote Write → S3 | ~5€/mois S3 | 1 an | +| **Logs** | Loki | 0€ (self-hosted) | 30 jours (GDPR) | +| **Traces** | Tempo | 0€ (self-hosted) | 7 jours | +| **Dashboards** | Grafana | 0€ (self-hosted) | N/A | +| **Fallback logs** | CloudWatch Logs | Tier gratuit 5GB | 7 jours | + +**Coût estimé : < 50€/mois** (principalement stockage S3) + +### Note sur le stockage long-terme + +Pour conserver les métriques au-delà de 15 jours : + +| Option | Description | Complexité | +|--------|-------------|------------| +| **Remote Write vers S3** | Prometheus écrit directement vers un backend compatible S3 | Simple | +| **Grafana Mimir** | Solution CNCF pour le stockage long-terme, scalable | Moyen | +| **Victoria Metrics** | Alternative performante, compatible Prometheus | Moyen | + +> **Choix Local-Plus :** Remote Write vers S3 via Grafana Mimir (ou Victoria Metrics) — pas besoin de composants additionnels complexes. + +--- + +# 🔄 **Telemetry Pipeline** + +``` +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ +│ Applications │ │ OTel Collector │ │ Backends │ +│ │ │ │ │ │ +│ • SDK Python │────►│ • Receivers │────►│ • Prometheus │ +│ • Auto-instr │ │ • Processors │ │ • Loki │ +│ │ │ • Exporters │ │ • Tempo │ +└─────────────────┘ └─────────────────┘ └─────────────────┘ + │ + │ Scrubbing + ▼ + ┌─────────────────┐ + │ GDPR Compliant │ + │ • No user_id │ + │ • No PII │ + │ • No PAN │ + └─────────────────┘ +``` + +## OTel Collector — Rôle + +| Composant | Rôle | Exemples | +|-----------|------|----------| +| **Receivers** | Réceptionne les données de télémétrie | OTLP (gRPC/HTTP), Prometheus scrape | +| **Processors** | Transforme, filtre, enrichit les données | Batch, Memory limiter, Attribute deletion (PII), Sampling | +| **Exporters** | Envoie vers les backends | Prometheus, Loki, Tempo | + +## GDPR Compliance — Données supprimées + +| Donnée | Action | Raison | +|--------|--------|--------| +| `user.id` | Supprimé | PII | +| `user.email` | Supprimé | PII | +| `http.client_ip` | Hashé | Anonymisation | +| `*_bucket` haute cardinalité | Filtré | Performance | + +--- + +# 📈 **Metrics (Prometheus)** + +## Comment Prometheus collecte les métriques + +Prometheus utilise un modèle **pull** : il va chercher les métriques sur chaque cible. + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ PROMETHEUS — MODÈLE DE COLLECTE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ PROMETHEUS │ +│ (scraping) │ +│ │ │ +│ ┌──────────────────┼──────────────────┐ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ Pod A │ │ Pod B │ │ Pod C │ │ +│ │ │ │ │ │ │ │ +│ │ :8080 │ │ :8080 │ │ :9090 │ │ +│ │ /metrics │ │ /metrics │ │ /metrics │ │ +│ └──────────┘ └──────────┘ └──────────┘ │ +│ │ +│ Prometheus fait GET http://pod:port/metrics toutes les 30s │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Découverte des cibles — ServiceMonitor (Prometheus Operator) + +Le **Prometheus Operator** utilise des **Custom Resources** pour configurer automatiquement les cibles de scraping. + +| Ressource | Ce qu'elle fait | +|-----------|-----------------| +| **ServiceMonitor** | Sélectionne les Services via labels, Prometheus scrape les pods derrière | +| **PodMonitor** | Sélectionne les Pods directement via labels | + +**Flux :** + +1. Le développeur déploie son service avec un label (ex: `app: svc-ledger`) +2. Un ServiceMonitor sélectionne ce label +3. Prometheus Operator configure automatiquement Prometheus +4. Prometheus scrape `/metrics` sur le port spécifié + +**Avantages :** +- GitOps-friendly — fichier séparé, versionné, reviewable +- Séparation des concerns — monitoring découplé du déploiement +- Flexibilité — intervalles, relabeling, TLS, authentification + +## Endpoints typiques + +| Service | Port | Path | Description | +|---------|------|------|-------------| +| **FastAPI (Python)** | 8080 | `/metrics` | Via `prometheus-fastapi-instrumentator` | +| **Go gRPC** | 9090 | `/metrics` | Via `promhttp` handler | +| **Grafana** | 3000 | `/metrics` | Métriques internes | +| **ArgoCD** | 8083 | `/metrics` | Métriques application | +| **Node Exporter** | 9100 | `/metrics` | Métriques système (CPU, RAM, disk) | + +--- + +# 📝 **Logs (Loki)** + +## Configuration + +| Paramètre | Valeur | Raison | +|-----------|--------|--------| +| **Retention** | 30 jours | GDPR compliance | +| **Max query series** | 5000 | Protection performance | +| **Max entries per query** | 10000 | Protection performance | +| **Storage backend** | S3 | Coût faible, durabilité | + +## Log Labels (Low Cardinality) + +| Label | Exemple | Cardinalité | +|-------|---------|-------------| +| `namespace` | svc-ledger | Low | +| `pod` | svc-ledger-abc123 | Medium | +| `container` | svc-ledger | Low | +| `level` | info, error, warn | Very Low | +| `stream` | stdout, stderr | Very Low | + +**⚠️ Never use as labels:** `user_id`, `request_id`, `trace_id` + +--- + +# 🔍 **Traces (Tempo)** + +## Configuration + +| Paramètre | Valeur | Raison | +|-----------|--------|--------| +| **Retention** | 7 jours | Coût / utilité | +| **Backend** | S3 | Durabilité | +| **Protocol** | OTLP (gRPC + HTTP) | Standard OTel | + +## Trace-to-Logs Correlation + +``` +┌─────────────────┐ trace_id ┌─────────────────┐ +│ TRACES │◄────────────────►│ LOGS │ +│ (Tempo) │ │ (Loki) │ +└────────┬────────┘ └────────┬────────┘ + │ │ + │ Exemplars (trace_id in metrics) │ + │ │ + ▼ ▼ +┌─────────────────────────────────────────────────────────┐ +│ GRAFANA │ +│ • Click trace → See logs for that request │ +│ • Click metric spike → Jump to exemplar trace │ +│ • Click error log → Navigate to full trace │ +└─────────────────────────────────────────────────────────┘ +``` + +--- + +# 🎯 **APM (Application Performance Monitoring)** + +## Stack APM + +| Composant | Outil | Usage | +|-----------|-------|-------| +| **Distributed Tracing** | Tempo + OTel | Request flow, latency breakdown | +| **Profiling** | Pyroscope (Grafana) | CPU/Memory profiling continu | +| **Error Tracking** | Sentry (self-hosted) | Exception tracking, stack traces | +| **Database APM** | pg_stat_statements | Query performance | +| **Real User Monitoring** | Grafana Faro | Frontend performance (si applicable) | + +## Sampling Strategy + +| Environment | Head Sampling | Tail Sampling | Rationale | +|-------------|---------------|---------------|-----------| +| **Dev** | 100% | N/A | Full visibility pour debug | +| **Staging** | 50% | Errors: 100% | Balance cost/visibility | +| **Prod** | 10% | Errors: 100%, Slow: 100% (>500ms) | Cost optimization | + +### Tail Sampling — Règles + +| Règle | Condition | Pourquoi | +|-------|-----------|----------| +| **error-policy** | Status = ERROR | Toujours conserver les erreurs | +| **slow-policy** | Latency > 500ms | Détecter les lenteurs | +| **probabilistic-policy** | 10% aléatoire | Échantillonnage de base | + +--- + +# 📉 **Cardinality Management** + +## Label Rules + +| Label | Action | Rationale | +|-------|--------|-----------| +| `user_id` | DROP | High cardinality, use traces | +| `request_id` | DROP | Use trace_id instead | +| `http.url` | DROP | URLs uniques = explosion | +| `http.route` | KEEP | Templated, low cardinality | +| `service.name` | KEEP | Essential | +| `http.method` | KEEP | Low cardinality | +| `http.status_code` | KEEP | Low cardinality | + +## Cardinality Limits + +| Metric Type | Max Labels | Max Series | +|-------------|------------|------------| +| Counter | 5 | 1000 | +| Histogram | 4 | 500 | +| Gauge | 5 | 1000 | + +--- + +# 🎯 **SLI/SLO/Error Budgets** + +## Service SLOs + +| Service | SLI | SLO | Error Budget | Burn Rate Alert | +|---------|-----|-----|--------------|-----------------| +| **svc-ledger** | Availability | 99.9% | 43 min/mois | 14.4x = 1h alert | +| **svc-ledger** | Latency P99 | < 200ms | N/A | P99 > 200ms for 5min | +| **svc-wallet** | Availability | 99.9% | 43 min/mois | 14.4x = 1h alert | +| **Platform** | Availability | 99.5% | 3.6h/mois | 6x = 2h alert | + +## SLO Formulas + +| Métrique | Formule | Signification | +|----------|---------|---------------| +| **Availability** | `1 - (erreurs / total)` | % de requêtes sans erreur 5xx | +| **Error Budget Remaining** | `1 - ((1 - availability) / (1 - SLO))` | % du budget restant | +| **Burn Rate** | `error_rate / allowed_error_rate` | Vitesse de consommation du budget | + +--- + +# 🚨 **Alerting Strategy** + +## Severity Levels + +| Severity | Exemple | Notification | On-call | +|----------|---------|--------------|---------| +| **P1 — Critical** | svc-ledger down | PagerDuty immediate | Wake up | +| **P2 — High** | Error rate > 5% | Slack + PagerDuty 15min | Within 30min | +| **P3 — Medium** | Latency P99 > 500ms | Slack | Business hours | +| **P4 — Low** | Disk usage > 80% | Slack | Next day | + +## Alertes principales + +| Alerte | Condition | Sévérité | Action | +|--------|-----------|----------|--------| +| **ServiceDown** | `up == 0` pendant 1min | P1 | Runbook: restart, check logs | +| **HighErrorRate** | Error rate > 5% pendant 5min | P2 | Investigate traces + Sentry | +| **LatencyDegradation** | P99 > 2x baseline pendant 10min | P2 | Check slow spans in Tempo | +| **DiskAlmostFull** | Disk > 80% | P4 | Extend volume or cleanup | + +--- + +# 📊 **Dashboards & Visualizations** + +## Types de visualisations Grafana par type de métrique + +### Counter (Compteur) + +> **Définition :** Valeur qui ne peut qu'augmenter (ou reset à 0 au restart). + +| Visualisation | Query | Quand utiliser | +|---------------|-------|----------------| +| **Stat (nombre)** | `sum(http_requests_total)` | Total absolu | +| **Time Series (rate)** | `rate(http_requests_total[5m])` | Débit par seconde (RPS) | +| **Bar Gauge** | `sum by (status_code) (rate(http_requests_total[5m]))` | Comparaison entre labels | + +``` +Exemple visuel — Counter en Time Series (rate) + + RPS + 30 │ ╭───╮ + │ ╭────╯ │ + 20 │───╯ │ + │ ╰────╮ + 10 │ ╰───── + └─────────────────────────▶ temps + 10:00 10:05 10:10 +``` + +### Gauge (Jauge) + +> **Définition :** Valeur instantanée qui peut monter ou descendre (température, connexions actives, CPU%). + +| Visualisation | Query | Quand utiliser | +|---------------|-------|----------------| +| **Gauge (cadran)** | `pg_stat_activity_count` | Valeur courante visuelle | +| **Stat** | `node_memory_MemAvailable_bytes / 1e9` | Valeur simple avec unité | +| **Time Series** | `process_resident_memory_bytes` | Évolution dans le temps | +| **Heatmap** | `avg by (pod) (container_memory_usage_bytes)` | Comparaison multi-pods | + +``` +Exemple visuel — Gauge en cadran + + ┌─────────────────┐ + │ CPU % │ + │ │ + │ ┌───────┐ │ + │ │ 67% │ │ + │ │ ███ │ │ + │ └───────┘ │ + │ 0% 100% │ + └─────────────────┘ +``` + +### Histogram (Histogramme) + +> **Définition :** Distribution de valeurs dans des "buckets" (ex: latence). Permet de calculer des percentiles. + +| Visualisation | Query | Quand utiliser | +|---------------|-------|----------------| +| **Time Series (P50/P95/P99)** | `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))` | Latency trends | +| **Heatmap** | `sum by (le) (rate(http_request_duration_seconds_bucket[5m]))` | Distribution visuelle | +| **Stat** | `histogram_quantile(0.95, ...)` | Valeur P95 courante | + +``` +Exemple visuel — Histogram en Heatmap (latence) + + Latency + 1s │░░▓▓░░ + 500ms │▓▓▓▓▓▓████████ + 200ms │████████████████████████ + 100ms │██████████████████████████████ + 50ms │████████████████████████████████████ + └────────────────────────────────────▶ temps + 10:00 10:30 11:00 + + ░ = peu de requêtes ▓ = moyen █ = beaucoup +``` + +### Summary + +> **Définition :** Comme Histogram mais les percentiles sont calculés côté client (moins flexible). + +| Visualisation | Query | Quand utiliser | +|---------------|-------|----------------| +| **Time Series** | `go_gc_duration_seconds{quantile="0.99"}` | Pré-calculé | +| **Stat** | `go_gc_duration_seconds{quantile="0.5"}` | Médiane | + +> **Note :** Préférer Histogram. Summary est principalement utilisé par les exporters Go legacy. + +--- + +## Dashboards recommandés par audience + +### Dashboard 1 : Service Overview (On-call) + +| Panel | Type | Métrique | Visualisation | +|-------|------|----------|---------------| +| **Request Rate** | Counter | `rate(http_requests_total[5m])` | Time Series | +| **Error Rate %** | Counter | `rate(errors[5m]) / rate(total[5m]) * 100` | Time Series + Threshold | +| **Latency P50/P95/P99** | Histogram | `histogram_quantile(...)` | Time Series (3 lignes) | +| **Active Requests** | Gauge | `http_requests_in_flight` | Stat | + +### Dashboard 2 : Infrastructure (Platform Team) + +| Panel | Type | Métrique | Visualisation | +|-------|------|----------|---------------| +| **CPU Usage %** | Gauge | `container_cpu_usage_seconds_total` | Gauge cadran | +| **Memory Usage** | Gauge | `container_memory_usage_bytes` | Bar Gauge | +| **Network I/O** | Counter | `rate(container_network_receive_bytes_total[5m])` | Time Series | +| **Disk Usage %** | Gauge | `node_filesystem_avail_bytes / node_filesystem_size_bytes` | Gauge | +| **Pod Count** | Gauge | `kube_pod_status_phase{phase="Running"}` | Stat | + +### Dashboard 3 : Database (Backend Devs) + +| Panel | Type | Métrique | Visualisation | +|-------|------|----------|---------------| +| **Active Connections** | Gauge | `pg_stat_activity_count` | Gauge cadran | +| **Query Duration P95** | Histogram | `pg_stat_statements_mean_time_seconds` | Time Series | +| **Transactions/sec** | Counter | `rate(pg_stat_database_xact_commit[5m])` | Time Series | +| **Replication Lag** | Gauge | `pg_replication_lag_seconds` | Stat avec threshold | +| **Cache Hit Ratio** | Gauge | `pg_stat_database_blks_hit / (blks_hit + blks_read)` | Stat % | + +### Dashboard 4 : Business Metrics (Product) + +| Panel | Type | Métrique | Visualisation | +|-------|------|----------|---------------| +| **Transactions Créées** | Counter | `sum(rate(ledger_transactions_total[1h]))` | Stat (big number) | +| **Montant Total Traité** | Counter | `sum(ledger_amount_processed_total)` | Stat avec unité € | +| **Wallets Actifs** | Gauge | `wallet_active_count` | Stat | +| **Erreurs Métier** | Counter | `sum by (error_type) (rate(business_errors_total[5m]))` | Bar chart | + +--- + +## Récapitulatif : Quel type pour quelle métrique ? + +| Métrique | Type Prometheus | Visualisation Grafana | +|----------|-----------------|----------------------| +| Nombre de requêtes | Counter | Time Series (rate) | +| Erreurs totales | Counter | Time Series (rate) + Stat | +| Latence | Histogram | Time Series (quantile) + Heatmap | +| Connexions actives | Gauge | Gauge cadran ou Stat | +| Mémoire utilisée | Gauge | Time Series ou Bar Gauge | +| CPU % | Gauge | Gauge cadran | +| Durée GC | Summary | Time Series | +| Taille de queue | Gauge | Stat avec threshold | + +--- + +*Document maintenu par : Platform Team* +*Dernière mise à jour : Janvier 2026* diff --git a/platform/PLATFORM-ENGINEERING.md b/platform/PLATFORM-ENGINEERING.md new file mode 100644 index 0000000..18b4145 --- /dev/null +++ b/platform/PLATFORM-ENGINEERING.md @@ -0,0 +1,340 @@ +# 🛠️ **Platform Engineering** +## *LOCAL-PLUS Contracts, Golden Path & Operations* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# 📋 **Table of Contents** + +1. [Platform Contracts](#platform-contracts) +2. [CI/CD & Delivery](#cicd--delivery) +3. [Golden Path (New Service Checklist)](#golden-path-new-service-checklist) +4. [On-Call Structure](#on-call-structure) +5. [Incident Management](#incident-management) +6. [Service Templates](#service-templates) + +--- + +# 📜 **Platform Contracts** + +## Guarantees + +| Contrat | Garantie Platform | Responsabilité Service | +|---------|-------------------|------------------------| +| **Deployment** | Git push → Prod < 15min | Manifests K8s valides | +| **Secrets** | Vault dynamic, rotation auto | Utiliser External-Secrets | +| **Observability** | Auto-collection traces/metrics/logs | Instrumentation OTel | +| **Networking** | mTLS enforced, Gateway API | Déclarer routes dans HTTPRoute | +| **Scaling** | HPA disponible | Configurer requests/limits | +| **Security** | Policies enforced | Passer les policies | + +--- + +# 🚀 **CI/CD & Delivery** + +## GitOps avec ArgoCD + +| Concept | Implementation | +|---------|----------------| +| **Source of Truth** | Git repositories | +| **Delivery Model** | Pull-based (ArgoCD syncs from Git) | +| **Environments** | Kustomize overlays (dev/staging/prod) | +| **Promotion** | PR from dev → staging → prod overlays | + +## GitHub Actions — Reusable Workflows + +> Les workflows partagés standardisent les pipelines CI/CD. + +| Type | Localisation | Usage | +|------|--------------|-------| +| **Reusable workflows** | `.github/workflows/` | Build, test, deploy partagés | +| **Composite actions** | `.github/actions/` | Steps communs réutilisables | + +## Workflows Standards + +| Workflow | Description | Repos cibles | +|----------|-------------|--------------| +| `ci-python.yml` | Lint, test, build | `svc-*`, `sdk-python` | +| `ci-terraform.yml` | Format, lint, plan, apply | `platform-*`, `bootstrap` | +| `cd-argocd.yml` | Trigger ArgoCD sync | Tous | +| `security-scan.yml` | Trivy, Checkov, tfsec | Tous | + +## Pipeline Stages + +``` +┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ +│ Lint │───►│ Test │───►│ Build │───►│ Scan │───►│ Push │ +└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ + │ │ │ │ │ + │ │ │ │ ▼ + │ │ │ │ ┌─────────────┐ + │ │ │ │ │ ArgoCD Sync │ + │ │ │ │ └─────────────┘ + ▼ ▼ ▼ ▼ + Fail fast Coverage Image tag CVE check +``` + +## Deployment SLA + +| Metric | Target | Measurement | +|--------|--------|-------------| +| **Git to Dev** | < 5 min | Commit to ArgoCD sync | +| **Git to Staging** | < 10 min | Commit to ArgoCD sync (manual approval) | +| **Git to Prod** | < 15 min | Commit to ArgoCD sync (manual approval) | +| **Rollback** | < 2 min | ArgoCD rollback | + +## Observability Requirements + +| Signal | Requirement | Enforcement | +|--------|-------------|-------------| +| **Metrics** | `/metrics` endpoint exposed | Kyverno policy | +| **Logs** | Structured JSON, no PII | OTel scrubbing | +| **Traces** | OTel SDK instrumentation | Service template | +| **Health** | `/health/live` + `/health/ready` | Kyverno policy | + +## Security Baseline + +| Requirement | Enforcement | Exception Process | +|-------------|-------------|-------------------| +| Non-root containers | Kyverno policy | ADR + Platform approval | +| Read-only filesystem | Kyverno policy | ADR + Platform approval | +| Resource limits | Kyverno policy | None | +| Image signature | Kyverno policy | None | +| mTLS | Cilium automatic | None | + +--- + +# 🛤️ **Golden Path (New Service Checklist)** + +## Prerequisites + +- [ ] GitHub repo created from template +- [ ] Team assigned in GitHub +- [ ] CODEOWNERS configured + +## Step-by-Step Checklist + +| Étape | Action | Validation | Owner | +|-------|--------|------------|-------| +| 1 | Créer repo depuis template | Structure conforme | Dev | +| 2 | Définir protos dans `contracts-proto` | `buf lint` pass | Dev | +| 3 | Implémenter service | Unit tests > 80% coverage | Dev | +| 4 | Configurer K8s manifests | Kyverno policies pass | Dev | +| 5 | Configurer External-Secret | Secrets résolus | Dev + Platform | +| 6 | Ajouter ServiceMonitor | Metrics visibles Grafana | Dev | +| 7 | Créer HTTPRoute | Trafic routable | Dev | +| 8 | Configurer alerts | Runbook links | Dev + Platform | +| 9 | PR review | Merge → Auto-deploy dev | Dev + Reviewer | +| 10 | Staging validation | E2E tests pass | QA | +| 11 | Prod deployment | Manual approval | Tech Lead | + +## Post-Deployment + +- [ ] Dashboard créé dans Grafana +- [ ] Runbook documenté +- [ ] On-call routing configuré +- [ ] Load test baseline établi + +--- + +# 📞 **On-Call Structure** + +## Team Rotation (5 personnes) + +| Rôle | Responsabilité | Rotation | Escalation | +|------|---------------|----------|------------| +| **Primary** | First responder, triage | Weekly | → Secondary (15min) | +| **Secondary** | Escalation, expertise | Weekly | → Incident Commander | +| **Incident Commander** | Coordination si P1 | On-demand | → Management | + +## Rotation Schedule + +| Week | Primary | Secondary | +|------|---------|-----------| +| 1 | Alice | Bob | +| 2 | Bob | Charlie | +| 3 | Charlie | Diana | +| 4 | Diana | Eve | +| 5 | Eve | Alice | + +## On-Call Expectations + +| Aspect | Requirement | +|--------|-------------| +| **Response Time (P1)** | < 5 min acknowledge | +| **Response Time (P2)** | < 15 min acknowledge | +| **Response Time (P3)** | < 1 hour acknowledge | +| **Availability** | Reachable 24/7 during rotation | +| **Handoff** | 30 min sync at rotation change | + +## Compensation + +| Activity | Compensation | +|----------|--------------| +| On-call week | Flat bonus | +| Night incident (22:00-08:00) | Time-off + bonus | +| Weekend incident | 1.5x time-off | + +--- + +# 🚨 **Incident Management** + +## Severity Levels + +| Severity | Definition | Response | Communication | +|----------|------------|----------|---------------| +| **P1 — Critical** | Service down, data loss risk | Immediate, all hands | Slack + PagerDuty + Status page | +| **P2 — High** | Degraded service, high error rate | Within 30 min | Slack + PagerDuty | +| **P3 — Medium** | Performance degradation | Business hours | Slack | +| **P4 — Low** | Minor issues, no user impact | Next sprint | Ticket | + +## Incident Workflow + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ INCIDENT WORKFLOW │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ 1. DETECT │ +│ • Alert fires (Prometheus/Grafana) │ +│ • User report │ +│ • Synthetic monitoring │ +│ │ +│ 2. TRIAGE (Primary on-call) │ +│ • Acknowledge alert │ +│ • Assess severity │ +│ • Start incident channel (#inc-YYYYMMDD-short-name) │ +│ │ +│ 3. MITIGATE │ +│ • Apply runbook │ +│ • Rollback if needed │ +│ • Escalate if stuck > 15 min │ +│ │ +│ 4. RESOLVE │ +│ • Confirm service restored │ +│ • Update status page │ +│ • Close alert │ +│ │ +│ 5. POST-MORTEM (within 48h for P1/P2) │ +│ • Blameless analysis │ +│ • Root cause identification │ +│ • Action items with owners │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Runbook Structure + +| Section | Contenu | +|---------|---------| +| **Overview** | Description de l'alerte | +| **Impact** | User impact (High/Medium/Low), Business impact | +| **Prerequisites** | Accès et permissions nécessaires | +| **Diagnosis Steps** | Étapes de diagnostic (dashboards, logs, métriques) | +| **Resolution Steps** | Actions de résolution | +| **Escalation** | Contacts et délais | +| **Related** | Dashboards, alertes liées, incidents passés | + +## Post-Mortem Structure + +| Section | Contenu | +|---------|---------| +| **Summary** | Résumé en un paragraphe | +| **Timeline** | Chronologie des événements (UTC) | +| **Root Cause** | Cause racine détaillée | +| **Impact** | Users affectés, impact revenue, SLO burn | +| **What Went Well** | Ce qui a bien fonctionné | +| **What Went Wrong** | Ce qui a mal fonctionné | +| **Action Items** | Actions avec owner et due date | +| **Lessons Learned** | Leçons apprises | + +--- + +# 📦 **Service Templates** + +## Python Service Template — Structure + +| Répertoire | Contenu | +|------------|---------| +| `src/app/` | Code applicatif FastAPI | +| `src/app/api/` | Routes et dependencies | +| `src/app/domain/` | Entities et services métier | +| `src/app/infrastructure/` | Database, Kafka, Cache | +| `tests/` | Unit et integration tests | +| `k8s/base/` | Manifests Kubernetes de base | +| `k8s/overlays/` | Overlays par environnement (dev, staging, prod) | +| `migrations/` | Alembic migrations | + +## Kubernetes Manifests Inclus + +| Fichier | Rôle | +|---------|------| +| `deployment.yaml` | Définition du Deployment | +| `service.yaml` | Exposition interne (ClusterIP) | +| `configmap.yaml` | Configuration non-sensible | +| `hpa.yaml` | Horizontal Pod Autoscaler | +| `pdb.yaml` | Pod Disruption Budget | +| `servicemonitor.yaml` | Prometheus scraping | +| `kustomization.yaml` | Kustomize base | + +## Deployment Configuration + +| Setting | Valeur | Raison | +|---------|--------|--------| +| **Replicas** | 2 (min) | Haute disponibilité | +| **Security Context** | runAsNonRoot: true, runAsUser: 1000 | Sécurité | +| **Filesystem** | readOnlyRootFilesystem: true | Sécurité | +| **Capabilities** | drop: ALL | Principle of least privilege | +| **Resources requests** | CPU: 100m, Memory: 256Mi | Scheduling | +| **Resources limits** | CPU: 500m, Memory: 512Mi | Protection contre les fuites | + +## Health Probes + +| Probe | Path | Delay | Period | +|-------|------|-------|--------| +| **Liveness** | `/health/live` | 10s | 10s | +| **Readiness** | `/health/ready` | 5s | 5s | + +## Horizontal Pod Autoscaler + +| Métrique | Target | Min/Max Replicas | +|----------|--------|------------------| +| **CPU** | 70% average utilization | 2 / 10 | +| **Memory** | 80% average utilization | 2 / 10 | + +## ServiceMonitor Configuration + +| Setting | Valeur | +|---------|--------| +| **Port** | http (8080) | +| **Path** | /metrics | +| **Interval** | 30s | +| **Selector** | matchLabels: app: ${SERVICE_NAME} | + +--- + +## SLI/SLO/Error Budgets + +| Service | SLI | SLO | Error Budget | Burn Rate Alert | +|---------|-----|-----|--------------|-----------------| +| **svc-ledger** | Availability | 99.9% | 43 min/mois | 14.4x = 1h alert | +| **svc-ledger** | Latency P99 | < 200ms | N/A | P99 > 200ms for 5min | +| **svc-wallet** | Availability | 99.9% | 43 min/mois | 14.4x = 1h alert | +| **Platform (ArgoCD, Prometheus)** | Availability | 99.5% | 3.6h/mois | 6x = 2h alert | + +## Error Budget Policy + +| Budget Consumed | Action | +|-----------------|--------| +| < 50% | Normal development velocity | +| 50-75% | Increased testing, careful deployments | +| 75-90% | Feature freeze, reliability focus | +| > 90% | Emergency mode, only critical fixes | + +--- + +*Document maintenu par : Platform Team* +*Dernière mise à jour : Janvier 2026* diff --git a/resilience/DR-GUIDE.md b/resilience/DR-GUIDE.md new file mode 100644 index 0000000..59ec32f --- /dev/null +++ b/resilience/DR-GUIDE.md @@ -0,0 +1,341 @@ +# ⚡ **Resilience & Disaster Recovery Guide** +## *LOCAL-PLUS Backup, Recovery & Business Continuity* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) +> **Voir aussi** : [Testing Strategy](../testing/TESTING-STRATEGY.md) — Tests de performance et chaos + +--- + +# 📋 **Table of Contents** + +1. [Failure Modes](#failure-modes) +2. [Backup Strategy](#backup-strategy) +3. [Automated Recovery](#automated-recovery) +4. [Chaos Engineering](#chaos-engineering) +5. [Disaster Recovery](#disaster-recovery) +6. [Business Continuity](#business-continuity) + +--- + +# 🔥 **Failure Modes** + +## Failure Matrix + +| Failure | Detection | Recovery | RTO | Impact | +|---------|-----------|----------|-----|--------| +| **Pod crash** | Liveness probe | K8s restart automatique | < 30s | None (replicas) | +| **Node failure** | Node NotReady | Pod reschedule automatique | < 2min | Minor latency spike | +| **AZ failure** | Multi-AZ detect | Traffic shift automatique | < 5min | Reduced capacity | +| **DB primary failure** | Aiven health | Failover automatique | < 5min | Brief connection errors | +| **Kafka broker failure** | Aiven health | Rebalance automatique | < 2min | Brief producer retries | +| **Cache failure** | Health check | Fallback to DB automatique | < 1min | Increased latency | +| **Full region failure** | Health checks | DR procedure | 4h (target) | Extended outage | + +## Blast Radius Analysis + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ BLAST RADIUS ANALYSIS │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ SINGLE POD FAILURE │ +│ └── Impact: None (other replicas serve traffic) │ +│ └── Recovery: Automatique via Kubernetes │ +│ │ +│ SINGLE NODE FAILURE │ +│ └── Impact: 10-20% capacity loss temporarily │ +│ └── Recovery: Automatique via pod anti-affinity + reschedule │ +│ │ +│ SINGLE AZ FAILURE │ +│ └── Impact: 33% capacity loss │ +│ └── Recovery: Automatique via Multi-AZ + ALB health checks │ +│ │ +│ DATABASE PRIMARY FAILURE │ +│ └── Impact: Write unavailability ~5 min │ +│ └── Recovery: Automatique via Aiven failover │ +│ │ +│ FULL REGION FAILURE │ +│ └── Impact: Complete service unavailability │ +│ └── Recovery: Semi-automatique (DR procedure) │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +# 💾 **Backup Strategy** + +## Backup Matrix + +| Data | Method | Frequency | Retention | Location | Encryption | +|------|--------|-----------|-----------|----------|------------| +| **PostgreSQL** | Aiven automated | Hourly | 7 jours | Aiven (cross-AZ) | AES-256 | +| **PostgreSQL PITR** | Aiven WAL | Continuous | 24h | Aiven | AES-256 | +| **Kafka** | Topic retention | N/A | 7 jours | Aiven | AES-256 | +| **Valkey** | RDB + AOF | Continuous | 24h | Aiven | AES-256 | +| **Terraform state** | S3 versioning | Every apply | 90 jours | S3 | KMS | +| **Git repos** | GitHub | Every push | Infini | GitHub | At-rest | +| **Secrets (Vault)** | Integrated storage | Continuous | 30 jours | Vault HA | Transit | + +## Backup Verification — Automatisée + +| Check | Frequency | Automation | Alert si échec | +|-------|-----------|------------|----------------| +| PostgreSQL restore test | Weekly | Job K8s scheduled | P2 | +| Terraform state integrity | Daily | CI pipeline | P3 | +| Vault backup verification | Weekly | Job K8s scheduled | P2 | +| Git clone verification | Monthly | GitHub Actions | P4 | + +## Backup Monitoring + +| Metric | Alert Threshold | Severity | +|--------|-----------------|----------| +| Last backup age | > 2 hours | P2 | +| Backup size anomaly | > 50% change | P3 | +| Backup job failure | Any failure | P2 | +| PITR lag | > 1 hour | P2 | + +--- + +# 🤖 **Automated Recovery** + +## Principe : Self-Healing Infrastructure + +> **Objectif :** Minimiser l'intervention humaine. Le système doit se réparer automatiquement. + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ AUTOMATED RECOVERY LAYERS │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ LAYER 1: APPLICATION (Kubernetes) │ +│ ├── Liveness probes → Restart automatique │ +│ ├── Readiness probes → Traffic routing │ +│ ├── HPA → Scale automatique │ +│ └── PDB → Protection pendant maintenance │ +│ │ +│ LAYER 2: DATABASE (Aiven) │ +│ ├── Health monitoring → Failover automatique │ +│ ├── Connection pooling (PgBouncer) → Retry transparent │ +│ └── Read replicas → Load distribution │ +│ │ +│ LAYER 3: MESSAGING (Kafka) │ +│ ├── Broker failure → Partition rebalance automatique │ +│ ├── Consumer failure → Rebalance consumer group │ +│ └── Producer retry → Idempotent delivery │ +│ │ +│ LAYER 4: CACHE (Valkey) │ +│ ├── Cache miss → Fallback DB automatique (cache-aside pattern) │ +│ ├── Node failure → Cluster failover │ +│ └── TTL expiration → Lazy refresh │ +│ │ +│ LAYER 5: NETWORKING (Cloudflare + Cilium) │ +│ ├── Origin failure → Health check + failover │ +│ ├── DDoS → Auto-mitigation │ +│ └── mTLS → Automatic certificate rotation │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Recovery automatique par composant + +| Composant | Failure | Recovery Mechanism | Temps | Intervention | +|-----------|---------|-------------------|-------|--------------| +| **Pod** | Crash | Kubernetes restart | < 30s | Aucune | +| **Pod** | OOM | Kubernetes restart + alert | < 30s | Investigation | +| **Deployment** | Bad deploy | ArgoCD rollback auto (si configuré) | < 2min | Aucune | +| **DB Primary** | Failure | Aiven automatic failover | < 5min | Aucune | +| **DB Connection** | Pool exhausted | PgBouncer retry + scale | < 1min | Aucune | +| **Kafka Consumer** | Lag > threshold | KEDA auto-scale | < 2min | Aucune | +| **Cache** | Node down | Cluster failover + fallback DB | < 1min | Aucune | +| **Certificate** | Expiring | Cert-manager auto-renew | N/A | Aucune | + +--- + +# 🔬 **Chaos Engineering** + +## Philosophie + +> **"Nous ne testons pas si le système tombe, mais si le système se relève."** + +## Chaos Testing Framework + +| Outil | Usage | Intégration | +|-------|-------|-------------| +| **Chaos Mesh** | Injection de pannes Kubernetes | CRD natif K8s | +| **Litmus** | Alternative open-source | Scenarios prédéfinis | +| **Gremlin** | Enterprise (si budget) | SaaS, plus de features | + +## Experiments Automatisés + +| Experiment | Target | Fréquence | Validation | +|------------|--------|-----------|------------| +| **Pod Kill** | Random pod in service | Daily (staging) | Service continues responding | +| **Network Latency** | Inter-service +100ms | Weekly | SLO latency maintained | +| **Node Drain** | Random node | Weekly | Pods rescheduled, no downtime | +| **DB Failover** | Force primary switch | Monthly | Connections recover < 5min | +| **Cache Flush** | Valkey flush | Weekly | Fallback to DB works | +| **AZ Failure Simulation** | Cordon all nodes in 1 AZ | Quarterly | Traffic shifts to other AZs | + +## Chaos Test Pipeline + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ CHAOS TESTING PIPELINE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ 1. PRE-CHECK │ +│ • Verify system healthy (all green) │ +│ • Baseline metrics recorded │ +│ • Alerting team notified (staging) │ +│ │ +│ 2. INJECT CHAOS │ +│ • Apply Chaos Mesh experiment │ +│ • Duration: 5-15 minutes │ +│ │ +│ 3. OBSERVE │ +│ • Monitor SLIs (error rate, latency) │ +│ • Check recovery mechanisms activate │ +│ • Record recovery time │ +│ │ +│ 4. VALIDATE │ +│ • SLO maintained? ✅ / ❌ │ +│ • Recovery time within RTO? ✅ / ❌ │ +│ • No data loss? ✅ / ❌ │ +│ │ +│ 5. CLEANUP │ +│ • Remove chaos experiment │ +│ • Verify system back to baseline │ +│ • Generate report │ +│ │ +│ 6. ACTION │ +│ • Si échec → Ticket pour fix │ +│ • Si succès → Increase chaos scope │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Game Days + +| Activité | Fréquence | Participants | Scope | +|----------|-----------|--------------|-------| +| **Chaos Friday** | Weekly | On-call | Staging, experiments simples | +| **Game Day** | Monthly | Full team | Staging, multi-failure scenarios | +| **DR Drill** | Quarterly | Team + Management | Staging, full DR simulation | +| **Production Chaos** | Annually | Team + SRE | Prod (maintenance window) | + +--- + +# 🏥 **Disaster Recovery** + +## DR Scenarios + +| Scenario | Recovery | Automation Level | +|----------|----------|------------------| +| Single AZ failure | Automatique (multi-AZ) | 100% | +| Region failure | Semi-automatique (IaC + GitOps) | 80% | +| Data corruption | PITR restore | 60% | +| Ransomware | Immutable backups restore | 50% | + +## DR Automation — Infrastructure as Code + +> **Principe :** Toute l'infrastructure est reproductible via Terraform + ArgoCD. + +| Composant | Reproductibilité | Temps estimé | +|-----------|------------------|--------------| +| **EKS Cluster** | Terraform apply | ~30 min | +| **Platform tools** | ArgoCD sync | ~15 min | +| **Applications** | ArgoCD sync | ~10 min | +| **Database** | Aiven restore from backup | ~1-2h | +| **DNS cutover** | Cloudflare API / Terraform | ~5 min | + +## DR Runbook — Region Failure + +| Phase | Durée | Actions | Automation | +|-------|-------|---------|------------| +| **1. Detection** | 15 min | Confirmer failure, déclarer DR | Alerting automatique | +| **2. Infrastructure** | 1-2h | Terraform apply DR region | Semi-auto (approval required) | +| **3. Data** | 1-2h | Aiven restore, verify integrity | Semi-auto (Aiven console) | +| **4. Applications** | 30 min | ArgoCD sync | Automatique | +| **5. Traffic** | 15 min | Cloudflare DNS update | Semi-auto (Terraform) | +| **6. Validation** | 30 min | E2E tests, verify SLIs | Automatique (CI) | + +**RTO Total : 4 heures** + +## DR Test Schedule + +| Test | Fréquence | Scope | Duration | +|------|-----------|-------|----------| +| Backup restore | Weekly | PostgreSQL single table | 30 min | +| Failover test | Monthly | Database failover | 1 hour | +| DR drill | Quarterly | Full DR simulation (staging) | 4 hours | +| Full DR test | Annually | Production DR (maintenance) | 8 hours | + +--- + +# 📊 **Business Continuity** + +## RPO/RTO Summary + +| Scenario | RPO | RTO | Data Loss Risk | Automation | +|----------|-----|-----|----------------|------------| +| Pod failure | 0 | < 30s | None | 100% auto | +| Node failure | 0 | < 2min | None | 100% auto | +| AZ failure | 0 | < 5min | None | 100% auto | +| DB failover | 0 (sync) | < 5min | None | 100% auto | +| Region failure | 1 hour | 4 hours | Up to 1 hour | 80% auto | + +## Communication Plan + +| Audience | Channel | Frequency | Owner | +|----------|---------|-----------|-------| +| Engineering | Slack #incidents | Real-time | On-call | +| Management | Email + Slack | Every 30 min | Incident Commander | +| Customers | Status page | Every 15 min | Communications | +| Partners | Email | Major updates | Account Management | + +## Status Page + +| Status | Definition | +|--------|------------| +| **Operational** | All systems normal | +| **Degraded** | Partial impact (increased latency) | +| **Partial Outage** | Some features unavailable | +| **Major Outage** | Service unavailable | + +--- + +## Recovery Validation — Automatisée + +> Ces checks sont exécutés automatiquement après chaque recovery. + +### Health Check Pipeline + +| Check | Method | Failure Action | +|-------|--------|----------------| +| All pods running | kubectl health check | Alert P1 | +| Metrics flowing | Prometheus query | Alert P2 | +| Logs flowing | Loki query | Alert P2 | +| Traces flowing | Tempo query | Alert P3 | +| E2E critical path | Automated tests | Alert P1 | +| Error rate normal | SLI check | Alert P2 | + +### Database Validation + +| Check | Method | Failure Action | +|-------|--------|----------------| +| Data integrity | Checksum validation | Alert P1 | +| Transaction count | Count comparison | Alert P2 | +| FK constraints | DB validation | Alert P2 | +| Read/Write test | Smoke test | Alert P1 | + +--- + +> **Voir aussi :** [Testing Strategy](../testing/TESTING-STRATEGY.md) pour les tests de performance, load testing, et chaos engineering détaillés. + +--- + +*Document maintenu par : Platform Team + SRE* +*Dernière mise à jour : Janvier 2026* diff --git a/security/SECURITY-ARCHITECTURE.md b/security/SECURITY-ARCHITECTURE.md new file mode 100644 index 0000000..87e5a8c --- /dev/null +++ b/security/SECURITY-ARCHITECTURE.md @@ -0,0 +1,535 @@ +# 🔐 **Security Architecture** +## *LOCAL-PLUS Defense in Depth* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# 📋 **Table of Contents** + +1. [Defense in Depth](#defense-in-depth) +2. [Layer 0 — Edge (Cloudflare)](#layer-0--edge-cloudflare) +3. [Layer 1 — API Gateway](#layer-1--api-gateway) +4. [Layer 2 — Network](#layer-2--network) +5. [Layer 3 — Identity & Access](#layer-3--identity--access) +6. [Layer 4 — Workload](#layer-4--workload) +7. [Layer 5 — Data](#layer-5--data) +8. [Supply Chain Security](#supply-chain-security) +9. [Security Roadmap](#security-roadmap) + +--- + +# 🛡️ **Defense in Depth** + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ LAYER 0: EDGE (Cloudflare) │ +│ • Cloudflare WAF (OWASP Core Ruleset, custom rules) │ +│ • Cloudflare DDoS Protection (L3/L4/L7, unlimited) │ +│ • Bot Management (JS challenge, CAPTCHA) │ +│ • TLS 1.3 termination, HSTS enforced │ +│ • Cloudflare Tunnel (no public origin IP) │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ LAYER 1: API GATEWAY (Cilium Gateway API) │ +│ • JWT/API Key validation │ +│ • Rate limiting (fine-grained, per user/tenant) │ +│ • Request validation │ +│ • Circuit breaker │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ LAYER 2: NETWORK │ +│ • VPC isolation (private subnets only for workloads) │ +│ • Cilium NetworkPolicies (default deny, explicit allow) │ +│ • VPC Peering Aiven (no public internet for DB/Kafka) │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ LAYER 3: IDENTITY & ACCESS │ +│ • IRSA / Workload Identity (no static credentials) │ +│ • Cilium mTLS (WireGuard) — pod-to-pod encryption │ +│ • Vault dynamic secrets — DB credentials rotated │ +│ • PAM — Privileged Access Management │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ LAYER 4: WORKLOAD │ +│ • Kyverno policies (no privileged, resource limits, probes required) │ +│ • Image signature verification (Cosign) │ +│ • Read-only root filesystem │ +│ • Non-root containers │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ LAYER 5: DATA │ +│ • Encryption at rest (AWS KMS, Aiven native) │ +│ • Encryption in transit (mTLS) │ +│ • PII scrubbing in logs (OTel processor) │ +│ • Audit trail immutable (CloudTrail, K8s audit logs) │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +# 🌐 **Layer 0 — Edge (Cloudflare)** + +## Protection Components + +| Component | Protection | Configuration | +|-----------|------------|---------------| +| **WAF** | OWASP Core Ruleset | Managed + Custom rules | +| **DDoS** | L3/L4/L7 mitigation | Unlimited, automatic | +| **Bot Protection** | JS challenge, CAPTCHA | Bot score threshold | +| **TLS** | 1.3 only, HSTS | Full (strict) mode | +| **Tunnel** | No public origin IP | Encrypted connection | + +## WAF Rules Strategy + +| Rule Set | Type | Action | Purpose | +|----------|------|--------|---------| +| **OWASP Core** | Managed | Block | SQLi, XSS, LFI, RFI protection | +| **Cloudflare Managed** | Managed | Block | Zero-day, emerging threats | +| **Geo-Block** | Custom | Block | Block high-risk countries (optional) | +| **Rate Limit API** | Custom | Challenge | > 100 req/min per IP on /api/* | +| **Bot Score < 30** | Custom | Challenge | Likely bot traffic | + +--- + +# 🚪 **Layer 1 — API Gateway** + +## Cilium Gateway API (Phase 1) + +| Feature | Configuration | Purpose | +|---------|---------------|---------| +| **TLS Termination** | Cloudflare Origin cert | Encryption | +| **Path-based routing** | HTTPRoute resources | Traffic routing | +| **mTLS** | Cilium automatic | Service authentication | + +## APISIX (Future Phase 2+) + +| Feature | Configuration | Purpose | +|---------|---------------|---------| +| **JWT Validation** | RS256, JWKS endpoint | Authentication | +| **API Key** | Header-based | Partner authentication | +| **Rate Limiting** | Per user/tenant | Abuse prevention | +| **Request Validation** | JSON Schema | Input validation | +| **Circuit Breaker** | Timeout + failure threshold | Resilience | + +--- + +# 🔒 **Layer 2 — Network** + +## VPC Isolation + +| Subnet Type | CIDR | Usage | Internet Access | +|-------------|------|-------|-----------------| +| **Private (Workloads)** | 10.0.0.0/20 | EKS nodes, pods | NAT Gateway only | +| **Private (Data)** | 10.0.16.0/20 | VPC Endpoints | None | +| **Public** | 10.0.32.0/20 | NAT Gateway, LB | Direct | + +## Cilium Network Policies + +| Policy | Effect | +|--------|--------| +| Default deny all | Aucun trafic sauf explicite | +| Allow intra-namespace | Services même namespace peuvent communiquer | +| Allow specific cross-namespace | svc-ledger → svc-wallet explicite | +| Allow egress Aiven | Services → VPC Peering range only | +| Allow egress AWS endpoints | Services → VPC Endpoints only | + +--- + +# 🔑 **Layer 3 — Identity & Access** + +## Vue d'ensemble — Modèle Zero Static Credentials + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ IDENTITY & ACCESS ARCHITECTURE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ WORKLOADS (Pods) │ │ +│ │ │ │ +│ │ Pod svc-ledger Pod svc-wallet │ │ +│ │ ├── ServiceAccount ├── ServiceAccount │ │ +│ │ └── JWT Token (auto) └── JWT Token (auto) │ │ +│ │ │ │ +│ └───────────────────────────┬─────────────────────────────────────────┘ │ +│ │ │ +│ ┌───────────────────┼───────────────────┐ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ IRSA │ │ Vault │ │ Cilium │ │ +│ │ (AWS Access) │ │ (Secrets) │ │ (mTLS) │ │ +│ └──────┬───────┘ └──────┬───────┘ └──────────────┘ │ +│ │ │ │ +│ │ AssumeRole │ Dynamic creds │ +│ ▼ ▼ │ +│ ┌──────────────┐ ┌──────────────┐ │ +│ │ AWS IAM │ │ PostgreSQL │ │ +│ │ S3, KMS... │ │ Kafka │ │ +│ └──────────────┘ └──────────────┘ │ +│ │ +│ ZERO STATIC CREDENTIALS — Tout est éphémère │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## IRSA — IAM Roles for Service Accounts + +### Comment ça marche ? + +**IRSA** permet à un pod Kubernetes d'assumer un rôle IAM AWS **sans credentials statiques**. + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ IRSA — FLOW │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ 1. POD DÉMARRE │ +│ ├── Kubernetes injecte un JWT token (ServiceAccount) │ +│ └── Le token contient: namespace, service account, issuer │ +│ │ +│ 2. POD VEUT ACCÉDER À S3 │ +│ ├── AWS SDK détecte le token IRSA │ +│ └── SDK appelle STS AssumeRoleWithWebIdentity │ +│ │ +│ 3. AWS STS VALIDE │ +│ ├── Vérifie le JWT via OIDC Provider (EKS) │ +│ ├── Vérifie que le ServiceAccount match le Trust Policy │ +│ └── Retourne des credentials temporaires (15min-12h) │ +│ │ +│ 4. POD ACCÈDE À S3 │ +│ ├── Utilise les credentials temporaires │ +│ └── AWS SDK renouvelle automatiquement │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### Configuration + +| Composant | Configuration | +|-----------|---------------| +| **OIDC Provider** | Créé automatiquement avec EKS, URL: `oidc.eks.region.amazonaws.com/id/CLUSTER_ID` | +| **IAM Role** | Trust policy qui autorise le ServiceAccount spécifique | +| **ServiceAccount** | Annotation `eks.amazonaws.com/role-arn` | +| **Pod** | Utilise le ServiceAccount, reçoit le token automatiquement | + +### Trust Policy — Principe + +| Élément | Description | +|---------|-------------| +| **Principal** | `arn:aws:iam::ACCOUNT:oidc-provider/oidc.eks...` | +| **Condition** | `sub` = `system:serviceaccount:NAMESPACE:SA_NAME` | +| **Action** | `sts:AssumeRoleWithWebIdentity` | + +### Mapping Services → Roles + +| Service Account | IAM Role | Permissions AWS | +|-----------------|----------|-----------------| +| `svc-ledger` | `role-svc-ledger` | S3 read (specific bucket), KMS decrypt | +| `svc-notification` | `role-svc-notification` | SES send email | +| `external-secrets` | `role-external-secrets` | Secrets Manager read | +| `otel-collector` | `role-otel-collector` | CloudWatch Logs write | + +--- + +## Workload Identity Federation — Concept Général + +> **IRSA est une implémentation AWS de Workload Identity Federation.** + +### Qu'est-ce que Workload Identity Federation ? + +| Concept | Description | +|---------|-------------| +| **Définition** | Mécanisme permettant à une workload (pod, VM, CI job) d'obtenir des credentials cloud **sans secret statique** | +| **Principe** | Le workload prouve son identité via un token (JWT), le cloud provider échange contre des credentials temporaires | +| **Standard** | OIDC (OpenID Connect) — standard ouvert | + +### Implémentations par Cloud + +| Cloud | Nom | Comment ça marche | +|-------|-----|-------------------| +| **AWS** | IRSA (EKS) | Pod → JWT → STS AssumeRoleWithWebIdentity → IAM Role | +| **GCP** | Workload Identity | Pod → JWT → GCP Token Service → Service Account | +| **Azure** | Workload Identity | Pod → JWT → Azure AD → Managed Identity | +| **Multi-cloud** | SPIRE/SPIFFE | Standard open-source, fédération cross-cloud | + +### Pourquoi c'est mieux que les credentials statiques ? + +| Critère | Credentials Statiques | Workload Identity | +|---------|----------------------|-------------------| +| **Rotation** | Manuelle, risquée | Automatique (15min-12h) | +| **Blast radius** | Si leak → accès permanent | Si leak → expire rapidement | +| **Audit** | Difficile à tracer | Chaque assume est loggé | +| **Gestion** | Secrets à distribuer | Zero secret management | +| **Compliance** | SOC2/PCI problématique | SOC2/PCI friendly | + +--- + +## Vault — Dynamic Secrets + +### Comment Vault génère des credentials dynamiques + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ VAULT DYNAMIC SECRETS │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ 1. POD DEMANDE UN SECRET │ +│ ├── Pod s'authentifie à Vault (Kubernetes auth) │ +│ └── Vault vérifie le ServiceAccount JWT │ +│ │ +│ 2. VAULT GÉNÈRE LES CREDENTIALS │ +│ ├── Vault se connecte à PostgreSQL │ +│ ├── CREATE ROLE "svc-ledger-abc123" WITH PASSWORD '...' VALID UNTIL... │ +│ └── Retourne username/password au pod │ +│ │ +│ 3. POD UTILISE LES CREDENTIALS │ +│ ├── Connexion à PostgreSQL │ +│ └── TTL: 1 heure (renouvelable) │ +│ │ +│ 4. EXPIRATION │ +│ ├── Vault révoque automatiquement │ +│ └── PostgreSQL: DROP ROLE "svc-ledger-abc123" │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### Auth Methods + +| Method | Use Case | Identity Source | +|--------|----------|-----------------| +| **Kubernetes** | Pods EKS | ServiceAccount JWT | +| **AWS IAM** | Lambda, EC2 | Instance metadata | +| **AppRole** | CI/CD | Role ID + Secret ID | +| **OIDC** | GitHub Actions | GitHub JWT | + +### Secret Engines + +| Engine | Path | Purpose | TTL | +|--------|------|---------|-----| +| **Database** | `database/` | PostgreSQL dynamic credentials | 1h (renouvelable) | +| **KV v2** | `secret/` | Static secrets (API keys externes) | N/A | +| **Transit** | `transit/` | Encryption as a service | N/A | +| **PKI** | `pki/` | Certificats TLS | 24h | + +--- + +## PAM — Privileged Access Management + +### Pourquoi PAM ? + +| Problème | Solution PAM | +|----------|--------------| +| SSH keys partagées | Accès éphémère, certificat SSH signé | +| Admin accounts permanents | Just-in-Time access | +| Pas d'audit | Session recording, audit complet | +| Blast radius élevé | Least privilege, time-bound | + +### Architecture PAM + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ PAM ARCHITECTURE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ UTILISATEUR VEUT ACCÉDER À UN SYSTÈME │ +│ │ +│ ┌──────────────┐ │ +│ │ Engineer │ │ +│ │ (Browser) │ │ +│ └──────┬───────┘ │ +│ │ │ +│ │ 1. Request access │ +│ ▼ │ +│ ┌──────────────────────────────────────────────────────────────────────┐ │ +│ │ PAM SOLUTION │ │ +│ │ • Teleport (open-source) ou │ │ +│ │ • HashiCorp Boundary ou │ │ +│ │ • AWS SSM Session Manager │ │ +│ ├──────────────────────────────────────────────────────────────────────┤ │ +│ │ │ │ +│ │ 2. AUTHENTICATION │ │ +│ │ ├── SSO (GitHub, Okta, Google) │ │ +│ │ └── MFA required │ │ +│ │ │ │ +│ │ 3. AUTHORIZATION │ │ +│ │ ├── Check RBAC (role-based) │ │ +│ │ ├── Check time restrictions │ │ +│ │ └── Approval workflow (si P1 incident) │ │ +│ │ │ │ +│ │ 4. CREDENTIAL VENDING │ │ +│ │ ├── Generate short-lived SSH cert (10min-8h) │ │ +│ │ └── Or create temporary DB user │ │ +│ │ │ │ +│ │ 5. SESSION │ │ +│ │ ├── Proxied connection │ │ +│ │ └── Full session recording (audit) │ │ +│ │ │ │ +│ └──────────────────────────────────────────────────────────────────────┘ │ +│ │ │ +│ │ 6. Access granted (time-limited) │ +│ ▼ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ EKS Node │ │ Database │ │ Bastion │ │ +│ │ (kubectl) │ │ (psql) │ │ (SSH) │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### Options PAM pour Local-Plus + +| Solution | Type | Coût | Features | +|----------|------|------|----------| +| **AWS SSM Session Manager** | Managed | Gratuit | SSH/RDP sans bastion, audit CloudTrail | +| **Teleport** | Open-source | Gratuit (Community) | SSH, K8s, DB, session recording | +| **HashiCorp Boundary** | Open-source | Gratuit (Community) | Session brokering, Vault integration | + +### Recommandation Phase 1 + +| Access Type | Solution | Justification | +|-------------|----------|---------------| +| **EKS kubectl** | IRSA + AWS SSO | Native, zero config | +| **Database** | Vault dynamic creds | Already planned | +| **SSH nodes** | SSM Session Manager | Gratuit, no bastion needed | +| **Emergency access** | Break-glass with MFA | Documented procedure | + +--- + +## mTLS — Cilium WireGuard + +| Aspect | Configuration | +|--------|---------------| +| **Activation** | Automatique avec Cilium | +| **Protocol** | WireGuard (kernel-level) | +| **Certificate management** | Géré par Cilium | +| **Application changes** | Aucun — transparent | +| **Performance** | Minimal overhead (kernel crypto) | + +--- + +# 🛡️ **Layer 4 — Workload** + +## Kyverno Policies + +| Policy | Effect | Enforcement | +|--------|--------|-------------| +| `require-labels` | Pods must have required labels | Enforce | +| `require-probes` | Liveness + Readiness required | Enforce | +| `require-resource-limits` | CPU/Memory limits required | Enforce | +| `restrict-privileged` | No privileged containers | Enforce | +| `require-image-signature` | Cosign signature required | Enforce | +| `mutate-default-sa` | Auto-mount SA token disabled | Enforce | + +## Container Security Settings + +| Setting | Value | Rationale | +|---------|-------|-----------| +| `runAsNonRoot` | true | Prevent root execution | +| `readOnlyRootFilesystem` | true | Prevent filesystem writes | +| `allowPrivilegeEscalation` | false | Prevent privilege escalation | +| `capabilities.drop` | ALL | Minimal capabilities | + +--- + +# 💾 **Layer 5 — Data** + +## Encryption + +| Data State | Method | Key Management | +|------------|--------|----------------| +| **At rest (PostgreSQL)** | AES-256 | Aiven managed | +| **At rest (Kafka)** | AES-256 | Aiven managed | +| **At rest (S3)** | AES-256 | AWS KMS | +| **In transit** | TLS 1.3 + mTLS | Cilium + Aiven | + +## PII Protection + +| Data Type | Protection | Implementation | +|-----------|------------|----------------| +| **User ID** | Anonymized in logs | OTel processor | +| **Email** | Masked in logs | OTel processor | +| **PAN** | Never stored | Application validation | +| **IP Address** | Hashed in logs | OTel processor | + +## Audit Trail + +| Source | Destination | Retention | Immutability | +|--------|-------------|-----------|--------------| +| **AWS CloudTrail** | S3 (Log Archive) | 1 year | S3 Object Lock | +| **K8s Audit Logs** | CloudWatch Logs | 90 days | CloudWatch retention | +| **Application Audit** | PostgreSQL | 1 year | Append-only table | + +--- + +# 🔗 **Supply Chain Security** + +## Image Signing — Cosign + +| Étape | Description | +|-------|-------------| +| **Build** | CI build l'image Docker | +| **Sign** | Cosign signe l'image avec une clé privée | +| **Push** | Image + signature pushées vers registry | +| **Deploy** | Kyverno vérifie la signature avant d'admettre le pod | +| **Reject** | Si signature invalide → pod refusé | + +## SBOM — Software Bill of Materials + +| Étape | Outil | Output | +|-------|-------|--------| +| **Generate** | Syft | SBOM en format SPDX-JSON | +| **Attach** | Cosign | SBOM attaché à l'image | +| **Scan** | Grype | Vulnérabilités dans les dépendances | +| **Policy** | Kyverno | Reject si vulnérabilités critiques | + +--- + +# 📅 **Security Roadmap** + +## Phase 1 — Day 1 (Current) + +| Component | Status | Effort | +|-----------|--------|--------| +| Cilium mTLS | ✅ Zero config | Included | +| IRSA (Workload Identity) | ✅ Ready | 1 day | +| Kyverno basic policies | ✅ Ready | 2 days | +| Vault for secrets | ✅ Ready | 1 week | +| External-Secrets Operator | ✅ Ready | 2 days | +| SSM Session Manager | ✅ Ready | 1 day | + +## Phase 2 — Month 3 + +| Component | Status | Effort | +|-----------|--------|--------| +| Image signing (Cosign) | 🔜 Planned | 1 week | +| SBOM generation (Syft) | 🔜 Planned | 2 days | +| Supply chain verification | 🔜 Planned | 1 week | +| Teleport (full PAM) | 🔜 Evaluation | 1 week | + +## Phase 3 — Month 6 + +| Component | Status | Effort | +|-----------|--------|--------| +| SPIRE (if multi-cluster) | 📋 Evaluation | TBD | +| Confidential Computing | 📋 Evaluation | TBD | + +--- + +*Document maintenu par : Platform Team + Security Team* +*Dernière mise à jour : Janvier 2026* diff --git a/testing/TESTING-STRATEGY.md b/testing/TESTING-STRATEGY.md new file mode 100644 index 0000000..ee0a5bd --- /dev/null +++ b/testing/TESTING-STRATEGY.md @@ -0,0 +1,473 @@ +# 🧪 **Testing Strategy** +## *LOCAL-PLUS Quality Engineering* + +--- + +> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) +> **Voir aussi** : [DR Guide](../resilience/DR-GUIDE.md) — Chaos Engineering + +--- + +# 📋 **Table of Contents** + +1. [Test Pyramid Philosophy](#test-pyramid-philosophy) +2. [Platform Testing](#platform-testing) +3. [Application Testing](#application-testing) +4. [Integration & Contract Testing](#integration--contract-testing) +5. [Performance Testing](#performance-testing) +6. [Chaos Engineering](#chaos-engineering) +7. [Compliance Testing](#compliance-testing) +8. [TNR (Tests de Non-Régression)](#tnr-tests-de-non-régression) + +--- + +# 🔺 **Test Pyramid Philosophy** + +## Le concept + +``` + ╱╲ + ╱ ╲ + ╱ E2E╲ ← Peu, coûteux, lents + ╱──────╲ Validation business + ╱ ╲ + ╱ Contract ╲ ← Vérifie les interfaces + ╱────────────╲ Entre services + ╱ ╲ + ╱ Integration ╲ ← DB, Kafka, Cache réels + ╱──────────────────╲ Testcontainers + ╱ ╲ + ╱ Unit Tests ╲ ← Beaucoup, rapides, isolés + ╱────────────────────────╲ Logique métier + ╱ ╲ + ╱ Static Analysis ╲ ← Linting, type checking + ╱──────────────────────────────╲ Avant même d'exécuter +``` + +## Principes clés + +| Principe | Description | +|----------|-------------| +| **Plus de tests en bas** | Unit tests = 70%, Integration = 20%, E2E = 10% | +| **Rapidité en bas** | Unit tests < 1s, Integration < 30s, E2E < 5min | +| **Isolation en bas** | Unit = mocks, Integration = containers, E2E = real env | +| **Coût croissant** | Plus on monte, plus c'est cher à maintenir | +| **Confiance croissante** | Plus on monte, plus on valide le "vrai" système | + +## Application à Local-Plus + +| Layer | Type de test | Cible | Fréquence | +|-------|--------------|-------|-----------| +| **Infrastructure** | Terraform tests, Policy checks | IaC modules | PR | +| **Platform** | Smoke tests, Policy audit | Kubernetes, ArgoCD | Post-deploy | +| **Application** | Unit, Integration, Contract | Services Python/Go | PR | +| **System** | E2E, Performance, Chaos | Full stack | Nightly/Weekly | + +--- + +# 🏗️ **Platform Testing** + +## Test Pyramid pour l'Infrastructure + +``` + ╱╲ + ╱ ╲ + ╱ E2E╲ ← Déploiement réel (staging) + ╱──────╲ Nightly + ╱ ╲ + ╱Integration╲ ← Terratest (crée vraies ressources) + ╱────────────╲ Nightly, temps limité + ╱ ╲ + ╱ Unit Tests ╲ ← terraform test (plan-based) + ╱──────────────────╲ PR, rapide + ╱ ╲ + ╱ Static Analysis ╲ ← tflint, tfsec, checkov + ╱────────────────────────╲ Pre-commit, PR +``` + +## Terraform Testing + +| Type | Outil | Quand | Ce que ça vérifie | Bloquant | +|------|-------|-------|-------------------|----------| +| **Format** | `terraform fmt` | Pre-commit | Code formatté | Oui | +| **Lint** | `tflint` | Pre-commit | Best practices HCL | Oui | +| **Security** | `tfsec`, `checkov` | PR | Vulnérabilités, misconfigs | Oui | +| **Compliance** | `terraform-compliance`, `conftest` | PR | Policies internes | Oui | +| **Unit** | `terraform test` (native) | PR | Logique des modules | Oui | +| **Integration** | `terratest` | Nightly | Ressources créées correctement | Non | +| **Drift** | `terraform plan` (scheduled) | Daily | Écart config vs réalité | Alerte | + +## Policy as Code — Ce qu'on vérifie + +| Policy | Description | Outil | +|--------|-------------|-------| +| **S3 encryption** | Tous les buckets doivent avoir encryption | OPA/Conftest | +| **Public access** | Aucune ressource publique sauf explicite | tfsec | +| **Tagging** | Tags obligatoires (env, owner, cost-center) | terraform-compliance | +| **Naming** | Convention de nommage respectée | Custom OPA | +| **Networking** | Pas d'IGW sur VPC privé | Checkov | + +## Kubernetes Testing + +| Type | Outil | Quand | Ce que ça vérifie | +|------|-------|-------|-------------------| +| **Manifest validation** | `kubectl --dry-run`, `kubeconform` | PR | YAML valide, schema correct | +| **Policy check** | Kyverno CLI | PR | Policies passent | +| **Helm lint** | `helm lint`, `helm template` | PR | Charts valides | +| **Smoke test** | ArgoCD sync + health check | Post-deploy | App déployée et healthy | + +--- + +# 📱 **Application Testing** + +## Test Pyramid pour les Services + +``` + ╱╲ + ╱ ╲ + ╱ E2E╲ ← Playwright, staging + ╱──────╲ Post-merge + ╱ ╲ + ╱ Contract ╲ ← Pact, gRPC testing + ╱────────────╲ PR + ╱ ╲ + ╱ Integration ╲ ← Testcontainers + ╱──────────────────╲ PR + ╱ ╲ + ╱ Unit Tests ╲ ← pytest, mocks + ╱────────────────────────╲ Pre-commit, PR + ╱ ╲ + ╱ Static Analysis ╲ ← ruff, mypy, bandit + ╱──────────────────────────────╲ Pre-commit +``` + +## Unit Tests + +| Aspect | Approche | +|--------|----------| +| **Cible** | Domain logic, Use cases, Utilities | +| **Isolation** | Mocks pour DB, Kafka, Cache, HTTP clients | +| **Coverage** | Minimum 80% sur le domain layer | +| **Vitesse** | < 1 seconde par test | +| **Framework** | pytest (Python), go test (Go) | + +### Ce qu'on teste en Unit + +| Composant | Tests | +|-----------|-------| +| **Domain entities** | Validation, business rules, state transitions | +| **Use cases** | Orchestration logic (avec mocks) | +| **Value objects** | Immutabilité, égalité | +| **Utilities** | Pure functions, helpers | + +### Ce qu'on NE teste PAS en Unit + +| Composant | Pourquoi | +|-----------|----------| +| **Repositories** | Nécessite vraie DB → Integration | +| **Kafka producers** | Nécessite vrai broker → Integration | +| **HTTP clients** | Interactions réelles → Contract | +| **Controllers/Routes** | Wiring → Integration ou E2E | + +## Integration Tests + +| Aspect | Approche | +|--------|----------| +| **Cible** | Repositories, Message producers, Cache clients | +| **Infrastructure** | Testcontainers (PostgreSQL, Kafka, Redis) | +| **Isolation** | Chaque test a sa propre DB/topic | +| **Vitesse** | < 30 secondes par test | +| **Framework** | pytest + testcontainers | + +### Ce qu'on vérifie + +| Composant | Vérifications | +|-----------|---------------| +| **PostgreSQL Repository** | CRUD fonctionne, transactions, contraintes FK | +| **Kafka Producer** | Messages publiés, sérialisation correcte | +| **Kafka Consumer** | Messages consommés, idempotence | +| **Cache Client** | Set/Get/Delete, TTL, invalidation | +| **Outbox Pattern** | Transaction + event atomiques | + +--- + +# 🤝 **Integration & Contract Testing** + +## Pourquoi Contract Testing ? + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ LE PROBLÈME SANS CONTRACT TESTING │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ svc-ledger svc-wallet │ +│ ┌──────────┐ ┌──────────┐ │ +│ │ Appelle │─────── HTTP/gRPC ────────►│ Répond │ │ +│ │ Wallet │ │ │ │ +│ └──────────┘ └──────────┘ │ +│ │ +│ ❌ Wallet change son API │ +│ ❌ Ledger ne le sait pas │ +│ ❌ Découvert en production = 💥 │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ + +┌─────────────────────────────────────────────────────────────────────────────┐ +│ LA SOLUTION : CONTRACT TESTING │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ svc-ledger svc-wallet │ +│ ┌──────────┐ ┌──────────────┐ ┌──────────┐ │ +│ │ Consumer │────►│ CONTRACT │◄──────│ Provider │ │ +│ │ Tests │ │ (Pact file) │ │ Tests │ │ +│ └──────────┘ └──────────────┘ └──────────┘ │ +│ │ +│ ✅ Ledger déclare ce qu'il attend │ +│ ✅ Wallet vérifie qu'il respecte le contrat │ +│ ✅ CI bloque si contrat cassé │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Contract Testing Approach + +| Aspect | Approche | +|--------|----------| +| **Outil REST** | Pact | +| **Outil gRPC** | buf breaking, grpc-testing | +| **Consumer-driven** | Le consumer définit ses besoins | +| **Provider verification** | Le provider vérifie qu'il satisfait | +| **Broker** | Pact Broker (ou Pactflow) pour centraliser | + +## Ce qu'on vérifie en Contract + +| Type | Vérifications | +|------|---------------| +| **Request format** | Path, method, headers, body schema | +| **Response format** | Status code, headers, body schema | +| **Error cases** | 4xx/5xx responses, error messages | +| **Breaking changes** | Champs supprimés, types changés | + +--- + +# ⚡ **Performance Testing** + +## Types de tests de performance + +| Type | Objectif | VUs | Durée | Fréquence | +|------|----------|-----|-------|-----------| +| **Smoke** | Vérifier que ça marche | 1-5 | 1 min | Post-deploy | +| **Load** | Charge normale | 50-100 | 10 min | Nightly | +| **Stress** | Trouver le breaking point | Ramping 500+ | 15 min | Weekly | +| **Soak** | Endurance, memory leaks | 50 | 4 hours | Weekly | +| **Spike** | Pics soudains | 10→200→10 | 5 min | Monthly | + +## Outil : k6 + +| Aspect | Choix | +|--------|-------| +| **Outil** | k6 (Grafana) | +| **Scripting** | JavaScript | +| **Reporting** | Grafana Cloud ou self-hosted | +| **CI Integration** | GitHub Actions | + +## Thresholds (Critères de succès) + +| Métrique | Target | Alerte | Bloquant | +|----------|--------|--------|----------| +| **Latency P50** | < 50ms | > 100ms | Non | +| **Latency P95** | < 100ms | > 200ms | Oui | +| **Latency P99** | < 200ms | > 500ms | Oui | +| **Error Rate** | < 0.1% | > 1% | Oui | +| **Throughput** | > 500 TPS | < 400 TPS | Non | + +## Scénarios de test par service + +| Service | Scenario | VUs cible | Throughput cible | +|---------|----------|-----------|------------------| +| **svc-ledger** | Create transaction | 100 | 500 TPS | +| **svc-ledger** | Get balance | 200 | 1000 TPS | +| **svc-wallet** | Update balance | 100 | 500 TPS | +| **svc-merchant** | List transactions | 50 | 200 TPS | + +## Performance Testing Pipeline + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ PERFORMANCE TESTING PIPELINE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ 1. SMOKE TEST (Post-deploy) │ +│ • 1-5 VUs, 1 minute │ +│ • Vérifie que le service répond │ +│ • Gate pour continuer │ +│ │ +│ 2. LOAD TEST (Nightly) │ +│ • 50-100 VUs, 10 minutes │ +│ • Vérifie performance normale │ +│ • Compare avec baseline │ +│ │ +│ 3. STRESS TEST (Weekly) │ +│ • Ramping jusqu'à failure │ +│ • Identifie le breaking point │ +│ • Documente les limites │ +│ │ +│ 4. SOAK TEST (Weekly) │ +│ • 50 VUs, 4 heures │ +│ • Détecte memory leaks │ +│ • Vérifie stabilité long-terme │ +│ │ +│ 5. REPORT │ +│ • Dashboard Grafana │ +│ • Trend analysis │ +│ • Alertes si régression │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +# 💥 **Chaos Engineering** + +> **Détails complets** : voir [DR Guide — Chaos Engineering](../resilience/DR-GUIDE.md#chaos-engineering) + +## Philosophie + +| Principe | Description | +|----------|-------------| +| **Build confidence** | Prouver que le système résiste aux pannes | +| **Proactive** | Casser avant que ça casse en prod | +| **Controlled** | Experiments planifiés, scope limité | +| **Observable** | Mesurer l'impact, recovery time | + +## Experiments par layer + +| Layer | Experiment | Outil | Fréquence | +|-------|------------|-------|-----------| +| **Pod** | Kill random pod | Chaos Mesh | Daily (staging) | +| **Node** | Drain node | Chaos Mesh | Weekly | +| **Network** | Add latency 100ms | Chaos Mesh | Weekly | +| **Network** | Partition (isoler un service) | Chaos Mesh | Monthly | +| **Database** | Force failover | Aiven console | Monthly | +| **Cache** | Flush all | Chaos Mesh | Weekly | +| **AZ** | Cordon all nodes in 1 AZ | kubectl | Quarterly | + +## Validation + +| Experiment | Expected Behavior | Success Criteria | +|------------|-------------------|------------------| +| Pod kill | Traffic shifts to other pods | Error rate < 1%, recovery < 30s | +| Node drain | Pods rescheduled | No downtime | +| Network latency | Degraded but functional | SLO latency maintained | +| DB failover | Brief connection errors | Recovery < 5min | +| Cache flush | Fallback to DB | Increased latency, no errors | + +--- + +# ✅ **Compliance Testing** + +## Tests par standard + +| Standard | Test | Ce qu'on vérifie | Outil | +|----------|------|------------------|-------| +| **GDPR** | PII in logs | Pas d'email, user_id, IP en clair | Log audit script | +| **GDPR** | Data retention | Logs < 30 jours | Loki config check | +| **GDPR** | Right to delete | API de suppression fonctionne | E2E test | +| **PCI-DSS** | Encryption in transit | mTLS enforced | Cilium policy audit | +| **PCI-DSS** | Encryption at rest | KMS enabled | AWS Config rules | +| **PCI-DSS** | No PAN storage | Pas de numéro de carte | Code scan + log audit | +| **SOC2** | Audit logs | CloudTrail + K8s audit | AWS Config | +| **SOC2** | Access control | RBAC enforced | Kyverno reports | +| **SOC2** | Change management | PR required, reviews | GitHub settings | + +## Automatisation + +| Check | Fréquence | Bloquant | +|-------|-----------|----------| +| Log audit (PII) | Nightly | Alerte P2 | +| Policy reports (Kyverno) | Continuous | Dashboard | +| AWS Config rules | Continuous | Alerte P2 | +| Encryption verification | Weekly | Alerte P1 si échec | + +--- + +# 🔄 **TNR (Tests de Non-Régression)** + +## Catégories + +| Catégorie | Ce qu'on vérifie | Fréquence | +|-----------|------------------|-----------| +| **Critical Paths** | Flux métier essentiels | Nightly | +| **Golden Master** | Réponses API n'ont pas changé | Nightly | +| **Backward Compatibility** | Anciennes versions clients fonctionnent | Pre-release | +| **Data Migration** | Migrations n'ont pas cassé les données | Post-migration | + +## Critical Paths + +| Path | Étapes | SLA | +|------|--------|-----| +| **Earn flow** | Transaction → Balance update → Event → Notification | < 5s end-to-end | +| **Burn flow** | Transaction → Balance check → Deduction → Event | < 5s end-to-end | +| **Balance query** | Request → Cache/DB → Response | < 100ms | +| **Merchant onboarding** | Registration → Validation → Activation | < 30s | + +## E2E Testing + +| Aspect | Approche | +|--------|----------| +| **Outil** | Playwright | +| **Environment** | Staging (miroir de prod) | +| **Data** | Fixtures dédiées, cleanup après | +| **Fréquence** | Post-merge staging, pre-release prod | +| **Ownership** | QA Team | + +## Pipeline TNR + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ TNR PIPELINE (Nightly) │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ 00:00 ─► SETUP │ +│ • Fresh staging environment │ +│ • Load test fixtures │ +│ │ +│ 00:15 ─► CRITICAL PATH TESTS │ +│ • Earn/Burn flows │ +│ • All major user journeys │ +│ │ +│ 01:00 ─► PERFORMANCE TESTS │ +│ • Load test (10 min) │ +│ • Compare with baseline │ +│ │ +│ 01:30 ─► COMPLIANCE TESTS │ +│ • Log audit (PII check) │ +│ • Policy verification │ +│ │ +│ 02:00 ─► REPORT │ +│ • Generate report │ +│ • Alert if failures │ +│ • Update dashboard │ +│ │ +│ 02:30 ─► CLEANUP │ +│ • Reset test data │ +│ • Archive logs │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Récapitulatif : Qui teste quoi ? + +| Équipe | Responsabilité | Types de tests | +|--------|----------------|----------------| +| **Developers** | Unit, Integration | PR gate | +| **Platform** | Terraform, Kubernetes, Chaos | PR + Nightly | +| **QA** | E2E, TNR, Performance | Nightly + Pre-release | +| **Security** | Compliance, Policy audit | Continuous | + +--- + +*Document maintenu par : QA Team + Platform Team* +*Dernière mise à jour : Janvier 2026* From 5fb69cada6fad7bb3f7de01f5c7fdf7e137d5225 Mon Sep 17 00:00:00 2001 From: NasrLadib Date: Tue, 27 Jan 2026 09:13:38 +0100 Subject: [PATCH 3/6] docs(networking): add multi-account topology architecture Add comprehensive multi-account (Control Tower) topology section: - Add topology overview and architecture diagram - Document decision to defer Hub and Spoke to Phase 2 - Define cross-account communication methods via AWS APIs - Specify future evolution triggers and criteria - Include cost estimates for Control Tower setup --- networking/NETWORKING-ARCHITECTURE.md | 125 ++++++++++++++++++++++++++ 1 file changed, 125 insertions(+) diff --git a/networking/NETWORKING-ARCHITECTURE.md b/networking/NETWORKING-ARCHITECTURE.md index d08d9af..ab19582 100644 --- a/networking/NETWORKING-ARCHITECTURE.md +++ b/networking/NETWORKING-ARCHITECTURE.md @@ -18,6 +18,7 @@ 7. [Route53 — DNS Interne & Backup](#route53--dns-interne--backup) 8. [API Gateway / APIM (Future)](#api-gateway--apim-future) 9. [Multi-Cloud Vision](#multi-cloud-vision) +10. [Topologie Multi-Account (Control Tower)](#topologie-multi-account-control-tower) --- @@ -465,5 +466,129 @@ --- +# 🏢 **Topologie Multi-Account (Control Tower)** + +## Architecture Actuelle — Phase 1 + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ CONTROL TOWER — MULTI-ACCOUNT │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ MANAGEMENT ACCOUNT │ │ +│ │ • AWS Organizations, Control Tower, SCPs │ │ +│ │ ⚠️ Pas de workloads, pas de VPC │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌───────────────────────┐ ┌───────────────────────┐ │ +│ │ LOG ARCHIVE │ │ SECURITY / AUDIT │ │ +│ │ • S3: CloudTrail │ │ • Security Hub │ │ +│ │ • S3: Config │ │ • GuardDuty │ │ +│ │ • S3: VPC Flow Logs │ │ • IAM Access Analyzer│ │ +│ │ 📦 Pas de VPC │ │ 🔍 Pas de VPC │ │ +│ └───────────────────────┘ └───────────────────────┘ │ +│ ▲ ▲ │ +│ │ S3/API │ Findings │ +│ │ │ │ +│ ┌─────────────────┴────────────────────┴───────────────────────────────┐ │ +│ │ WORKLOAD ACCOUNTS │ │ +│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ +│ │ │ Dev │ │ Staging │ │ Prod │ │ │ +│ │ │ VPC │ │ VPC │ │ VPC │ │ │ +│ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ +│ │ │ │ │ │ │ +│ │ └────────────┴────────────┘ │ │ +│ │ │ │ │ +│ │ │ VPC Peering │ │ +│ │ ▼ │ │ +│ │ ┌─────────────┐ │ │ +│ │ │ AIVEN VPC │ │ │ +│ │ └─────────────┘ │ │ +│ └──────────────────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Décision : Pas de Hub and Spoke (Phase 1) + +| Critère | Évaluation | Décision | +|---------|------------|----------| +| **Nombre de VPCs** | 3-4 (dev, staging, prod + Aiven) | VPC Peering suffit | +| **Inspection centralisée** | Non requise | Pas de Network Firewall | +| **On-premises** | Pas de VPN/Direct Connect | Pas besoin de Transit Gateway | +| **Services partagés** | Via AWS APIs, pas via réseau | Pas de Hub VPC | + +## Comment les comptes communiquent ? + +> **Principe clé :** Les comptes partagent via **AWS APIs**, pas via connectivité réseau. + +| Communication | Méthode | VPC Peering ? | +|---------------|---------|---------------| +| **Logs → Log Archive** | S3 + Organizations | ❌ Non | +| **Findings → Security Hub** | AWS API aggregation | ❌ Non | +| **GuardDuty** | API-level, membre Organizations | ❌ Non | +| **Secrets (Vault)** | HTTPS via Cloudflare Tunnel | ❌ Non | +| **Workload → Aiven** | VPC Peering | ✅ Oui | + +## Évolution Future (si besoin) + +### Triggers pour passer à Hub and Spoke + +| Trigger | Seuil | Action | +|---------|-------|--------| +| **> 5 VPCs** | 6+ workload accounts | Transit Gateway | +| **Inspection egress** | Compliance exige firewall | Network Account + AWS Network Firewall | +| **VPN / Direct Connect** | Connexion on-premises | Transit Gateway + VPN | +| **Régional egress** | Centraliser coûts NAT | Centralized NAT via Transit Gateway | + +### Architecture Future (Phase 2+) + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ PHASE 2+ — SI INSPECTION REQUISE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────────────────┐ │ +│ │ NETWORK ACCOUNT │ │ +│ │ (nouveau si besoin) │ │ +│ │ │ │ +│ │ ┌─────────────────────┐ │ │ +│ │ │ Transit Gateway │ │ │ +│ │ └──────────┬──────────┘ │ │ +│ │ │ │ │ +│ │ ┌──────────┴──────────┐ │ │ +│ │ │ Network Firewall │ │ ◄── Inspection egress │ +│ │ │ (optionnel) │ │ │ +│ │ └──────────┬──────────┘ │ │ +│ │ │ │ │ +│ │ ┌──────────┴──────────┐ │ │ +│ │ │ NAT Gateway │ │ ◄── Centralized NAT │ +│ │ └─────────────────────┘ │ │ +│ └─────────────┬───────────────┘ │ +│ │ │ +│ ┌────────────────────────┼────────────────────────┐ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ Dev VPC │ │ Staging VPC │ │ Prod VPC │ │ +│ │ (Spoke) │ │ (Spoke) │ │ (Spoke) │ │ +│ └─────────────┘ └─────────────┘ └─────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### Estimation coûts Transit Gateway + +| Composant | Coût | Note | +|-----------|------|------| +| **Transit Gateway attachment** | ~$0.05/h par VPC | × nombre de VPCs | +| **Data processing** | ~$0.02/GB | Tout le trafic inter-VPC | +| **Network Firewall** | ~$0.40/h + $0.016/GB | Si inspection | + +> **Phase 1 :** On évite ces coûts en utilisant VPC Peering direct. + +--- + *Document maintenu par : Platform Team* *Dernière mise à jour : Janvier 2026* From 67c9f7ba4958a1e00f6218c607c80d89da4195f4 Mon Sep 17 00:00:00 2001 From: NasrLadib Date: Sat, 31 Jan 2026 09:47:39 +0100 Subject: [PATCH 4/6] docs(bootstrap): restructure bootstrap guide and update architecture --- EntrepriseArchitecture.md | 2 +- adr/ADR-001-LANDING-ZONE-APPROACH.md | 211 ++++++++++++++++++ bootstrap/BOOTSTRAP-GUIDE.md | 307 ++++++++++++++------------ networking/NETWORKING-ARCHITECTURE.md | 10 +- 4 files changed, 386 insertions(+), 144 deletions(-) create mode 100644 adr/ADR-001-LANDING-ZONE-APPROACH.md diff --git a/EntrepriseArchitecture.md b/EntrepriseArchitecture.md index d996bac..831222c 100644 --- a/EntrepriseArchitecture.md +++ b/EntrepriseArchitecture.md @@ -245,7 +245,7 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | Repo | Description | |------|-------------| -| `bootstrap/` | AWS Landing Zone, Control Tower, Account Factory | +| `bootstrap/` | AWS Landing Zone, Account Factory, SCPs, SSO | ### Tier 1 — Platform diff --git a/adr/ADR-001-LANDING-ZONE-APPROACH.md b/adr/ADR-001-LANDING-ZONE-APPROACH.md new file mode 100644 index 0000000..c824d0b --- /dev/null +++ b/adr/ADR-001-LANDING-ZONE-APPROACH.md @@ -0,0 +1,211 @@ +# ADR-001: Landing Zone Approach + +**Status:** Accepted +**Date:** 2026-01-27 +**Decision makers:** Platform Team + +--- + +## Context + +LOCAL-PLUS needs an AWS multi-account strategy for a Gift Card & Loyalty Platform with SOC2, PCI-DSS, GDPR compliance requirements. + +## Decision + +**Hybrid approach: Control Tower + Terraform** + +> Control Tower comme fondation, Terraform comme langage. + +## Rationale + +### The real question + +> *"Qui porte la responsabilité légale, sécurité et audit ?"* + +- **Org-wide, security baseline, audit** → Managed AWS (Control Tower) +- **Produit, métier, plateforme** → Terraform pur + +### Why not Pure Terraform? + +| Risk | Impact | +|------|--------| +| Erreur SCP | Blast radius = toute l'org | +| Oubli CloudTrail | Non-compliance, incident invisible | +| S3 log mal configuré | Audit failure | +| Migration tardive vers CT | 2-4 semaines, risque élevé | + +> *"Le Terraform-only est intellectuellement pur mais stratégiquement risqué."* + +### Why not Control Tower only? + +| Issue | Impact | +|-------|--------| +| Pas Git-first | Platform Team friction | +| Black box | Debugging difficile | +| AFT interne | CodePipeline géré par AWS, invisible pour nous | + +> *"Control Tower est imparfait mais politiquement et légalement puissant."* + +### Compliance = langage commun + +Un audit est un **exercice social**, pas technique. + +| Question auditeur | Avec Control Tower | +|-------------------|-------------------| +| "Comment vous gérez les logs ?" | "Control Tower, Log Archive account" | +| "Vos guardrails ?" | "AWS managed controls + custom SCPs" | +| "Drift detection ?" | "AWS Config + Control Tower dashboard" | + +--- + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ LAYER 0 — CONTROL TOWER │ +│ (Managed, Immutable, Audit) │ +├─────────────────────────────────────────────────────────────────────┤ +│ • AWS Organizations │ +│ • SCPs globales (AWS managed + custom via Terraform) │ +│ • CloudTrail org-level │ +│ • AWS Config │ +│ • Security Hub │ +│ • Log Archive Account │ +│ • Audit Account │ +│ ⛔ Aucune logique produit ici │ +└─────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────┐ +│ AFT — ACCOUNT FACTORY │ +│ (GitHub Actions → Terraform → AFT) │ +├─────────────────────────────────────────────────────────────────────┤ +│ • Account requests via Git PR │ +│ • GitHub Actions exécute Terraform │ +│ • Terraform appelle AFT module │ +│ • AFT provisionne compte + baseline │ +│ ⛔ Pas de logique métier │ +└─────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────┐ +│ LAYER 1+ — TERRAFORM PUR │ +│ (Platform, GitOps, GitHub Actions) │ +├─────────────────────────────────────────────────────────────────────┤ +│ • VPC / Networking │ +│ • EKS / ECS │ +│ • RDS / Kafka / Cache │ +│ • IAM métiers (IRSA) │ +│ • Observabilité │ +│ • Everything business-facing │ +│ 💯 PR review, GitHub Actions, 100% lisible │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Control Tower via Terraform + +Control Tower controls can be managed via Terraform: + +**Reference:** +- Terraform: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/controltower_control +- AWS Docs: https://docs.aws.amazon.com/controltower/ + +--- + +## Implementation Plan + +### Phase 1: Control Tower Setup (Console) + +| Step | Action | Duration | +|------|--------|----------| +| 1 | Enable Control Tower | 45 min | +| 2 | Configure home region (eu-west-1) | Included | +| 3 | Log Archive + Audit accounts created | Automatic | +| 4 | Enable IAM Identity Center | Included | + +### Phase 2: Terraform Layer (bootstrap/) + +| Component | Approach | +|-----------|----------| +| Organizations | CT-managed, read via data sources | +| OUs | CT-managed, custom via Terraform | +| SCPs | CT-managed + custom via `aws_controltower_control` | +| SSO | Terraform (`aws_ssoadmin_*`) | +| Account Factory | AFT via GitHub Actions + Terraform | +| Workload accounts | AFT baseline + Terraform customizations | + +### Phase 3: Platform (platform-application-provisioning/) + +- VPC, EKS, RDS, Kafka — Pure Terraform +- GitHub Actions CI/CD +- 100% GitOps + +--- + +## What changes in bootstrap/ + +| Current | New | +|---------|-----| +| `organization/` creates org | CT creates org, we read via data | +| `scps/` creates all SCPs | CT SCPs + custom via `aws_controltower_control` | +| `core-accounts/` creates accounts | CT creates Log/Audit, we create others | +| `account-factory/` custom | AFT module via GitHub Actions | +| `sso/` | Stays Terraform (SSO is independent) | + +--- + +## Decision Matrix by Stage + +| Stage | Approach | +|-------|----------| +| 🟢 Early startup (1-5 accounts, no audit < 12 months) | Pure Terraform OK (but CT-compatible design) | +| 🟡 Scaling / Series A / B2B clients | Control Tower OBLIGATOIRE | +| 🔵 Enterprise / regulated | Control Tower non-negotiable | + +**LOCAL-PLUS position:** 🟡 → Control Tower recommended + +--- + +## Risks and Mitigations + +| Risk | Decision | +|------|----------| +| CT opacity | Resource inventory maintenu dans cet ADR. Data sources Terraform pour lire les ressources CT. | +| AFT interne CodePipeline | AFT est un module Terraform. GitHub Actions exécute Terraform → AFT. Le CodePipeline interne est géré par AWS. | +| CT behavior changes | Provider Terraform pinné. Tests en sandbox avant promotion. | +| Expertise split | CODEOWNERS défini: `@security` pour CT, `@platform` pour Terraform. | + +--- + +## Consequences + +### Positive + +- Compliance ready — auditors know Control Tower +- Reduced blast radius — AWS manages critical controls +- Future-proof — no migration pain +- Security baseline by default + +### Negative + +| Limitation | Resolution | +|------------|------------| +| Console setup one-time | **Accepté.** Documenté dans BOOTSTRAP-RUNBOOK. Exécuté 1 seule fois. | +| Ressources CT pas dans Terraform state | Data sources pour lire les IDs. Inventory maintenu dans cet ADR. | +| Équipe doit comprendre CT + Terraform | Formation Platform Team. Ownership clair dans CODEOWNERS. | + +--- + +## References + +- [Bootstrap Guide](../bootstrap/BOOTSTRAP-GUIDE.md) +- [Security Architecture](../security/SECURITY-ARCHITECTURE.md) +- AWS Control Tower: https://docs.aws.amazon.com/controltower/ +- Terraform aws_controltower_control: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/controltower_control + +--- + +*Document maintenu par : Platform Team* +*Dernière mise à jour : Janvier 2026* diff --git a/bootstrap/BOOTSTRAP-GUIDE.md b/bootstrap/BOOTSTRAP-GUIDE.md index 3f233e9..a078f53 100644 --- a/bootstrap/BOOTSTRAP-GUIDE.md +++ b/bootstrap/BOOTSTRAP-GUIDE.md @@ -9,193 +9,221 @@ # 📋 **Table of Contents** -1. [Layer 0 — Manual Bootstrap](#layer-0--manual-bootstrap) -2. [Account Factory — Self-Service](#account-factory--self-service) -3. [Platform Application Provisioning](#platform-application-provisioning) -4. [Workload Provisioning](#workload-provisioning) -5. [Layer 2 — Platform Bootstrap](#layer-2--platform-bootstrap) -6. [Layer 3+ — Application Services](#layer-3--application-services) -7. [Bootstrap Repository Structure](#bootstrap-repository-structure) +1. [Architecture Overview](#architecture-overview) +2. [Layer 0 — Control Tower](#layer-0--control-tower) +3. [Layer 1 — Terraform Foundation](#layer-1--terraform-foundation) +4. [Account Factory](#account-factory) +5. [Platform Provisioning](#platform-provisioning) +6. [Bootstrap Repository Structure](#bootstrap-repository-structure) --- -# 🔧 **Layer 0 — Manual Bootstrap (1x per AWS Organization)** +# 🏗️ **Architecture Overview** -> **Principe :** Point d'entrée unique pour chaque cloud provider. -> Ces étapes sont manuelles car elles créent les fondations pour toute l'automatisation future. +> **Voir [ADR-001](../adr/ADR-001-LANDING-ZONE-APPROACH.md) pour le rationale complet.** -## Étapes +## Principe -| Étape | Action | Outil | Durée | -|-------|--------|-------|-------| -| 1 | Créer compte Management | Console AWS | 10 min | -| 2 | Activer AWS Organizations | Console | 5 min | -| 3 | Activer Control Tower | Console | 45 min | -| 4 | Configurer IAM Identity Center (SSO) | Console | 30 min | -| 5 | Créer OUs (Security, Infrastructure, Workloads) | Control Tower | 15 min | -| 6 | Appliquer SCPs | Console Organizations | 15 min | -| 7 | Créer Core Accounts | Control Tower | 15 min/compte | +| Layer | Géré par | Responsabilité | +|-------|----------|----------------| +| **Layer 0** | Control Tower | Org, SCPs, Logging, Audit | +| **Layer 1** | Terraform | SSO, Custom Controls, Account Factory | +| **Layer 2+** | Terraform | VPC, EKS, RDS, Platform | -## AWS Multi-Account Strategy +> *"Control Tower comme fondation, Terraform comme langage."* + +## Diagram ``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ AWS CONTROL TOWER (Organization) │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ -│ │ MANAGEMENT │ │ SECURITY │ │ LOG ARCHIVE │ │ -│ │ ACCOUNT │ │ ACCOUNT │ │ ACCOUNT │ │ -│ │ • Control Tower│ │ • GuardDuty │ │ • CloudTrail │ │ -│ │ • Organizations│ │ • Security Hub │ │ • Config Logs │ │ -│ │ • SCPs │ │ • IAM Identity │ │ • VPC Flow Logs│ │ -│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ WORKLOAD ACCOUNTS (OU: Workloads) │ │ -│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ -│ │ │ DEV Account │ │ STAGING │ │ PROD Account│ │ │ -│ │ │ VPC + EKS │ │ VPC + EKS │ │ VPC + EKS │ │ │ -│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ SHARED SERVICES ACCOUNT (OU: Infrastructure) │ │ -│ │ • Transit Gateway Hub • Container Registry (ECR) │ │ -│ │ • VPC Endpoints • Artifact Storage (S3) │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ +┌─────────────────────────────────────────────────────────────────────┐ +│ LAYER 0 — CONTROL TOWER │ +│ (Managed, Immutable, Audit) │ +├─────────────────────────────────────────────────────────────────────┤ +│ • AWS Organizations • CloudTrail org-level │ +│ • OUs (Security, Infra, • AWS Config │ +│ Workloads, Suspended) • Security Hub │ +│ • Guardrails (400+) • Log Archive + Audit accounts │ +└─────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────┐ +│ LAYER 1 — TERRAFORM │ +│ (GitOps, GitHub Actions) │ +├─────────────────────────────────────────────────────────────────────┤ +│ • SSO (groups, permission sets) │ +│ • Custom Controls (aws_controltower_control) │ +│ • Account Factory (baseline: OIDC, KMS, S3 state) │ +│ • Shared Services Account │ +└─────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────┐ +│ LAYER 2+ — PLATFORM │ +│ (platform-application-provisioning/) │ +├─────────────────────────────────────────────────────────────────────┤ +│ • VPC, Subnets, Transit Gateway │ +│ • EKS Clusters │ +│ • RDS, Kafka, Cache (Aiven) │ +│ • Observability stack │ +└─────────────────────────────────────────────────────────────────────┘ ``` --- -# 🏭 **Account Factory — Self-Service** +# 🔧 **Layer 0 — Control Tower** -> **Principe :** Les équipes demandent un AWS account via PR dans `bootstrap/account-factory/requests/` +> **Setup via Console AWS (one-time)** -## Ce qui est créé automatiquement +## Prerequisites -| Ressource | Description | -|-----------|-------------| -| **AWS Account** | Dans l'OU appropriée (Workloads/Dev, Staging, Prod) | -| **S3 Bucket** | Pour Terraform state | -| **GitHub OIDC** | Pour CI/CD sans credentials statiques | -| **Baseline IAM Roles** | Admin, Developer, ReadOnly | +| Requirement | Status | +|-------------|--------| +| AWS Account (Management) | Required | +| Email domain for accounts | Required | +| Region: eu-west-1 | Required (GDPR) | -## Workflow +## Steps -1. **Équipe** crée un fichier YAML dans `bootstrap/account-factory/requests/` -2. **PR Review** par Platform Team -3. **Merge** déclenche Terraform via CI/CD -4. **Account créé** avec baseline automatique +| Step | Action | Duration | +|------|--------|----------| +| 1 | Console → Control Tower → Set up landing zone | 45 min | +| 2 | Home region: eu-west-1 | Included | +| 3 | Additional regions: eu-central-1 (DR) | Included | +| 4 | Log Archive account created | Automatic | +| 5 | Audit account created | Automatic | +| 6 | IAM Identity Center enabled | Included | ---- +## What Control Tower creates -# 📦 **Platform Application Provisioning** +> **Voir [ADR-001 Resource Inventory](../adr/ADR-001-LANDING-ZONE-APPROACH.md#control-tower-resource-inventory) pour la liste complète.** -> **Repo :** `platform-application-provisioning` -> Contient les modules Terraform pour provisionner les services applicatifs. +| Account | Resources créées | +|---------|------------------| +| **Management** | Organizations, Service Roles, CloudTrail, Config | +| **Log Archive** | S3 Buckets (logs), KMS Key, Lifecycle Policies | +| **Audit** | Config Aggregator, Security Hub, GuardDuty, IAM Access Analyzer | +| **Workload (baseline)** | CloudTrail local, Config local, CT-managed roles | -## Providers +--- -| Provider | Ce qui est provisionné | Fréquence | -|----------|------------------------|-----------| -| **Cloudflare** | Zone DNS, WAF, Tunnel | 1x par zone | -| **Aiven** | Projet, VPC peering | 1x par environment | -| **AWS** | VPC, EKS, KMS | 1x par environment | +# 🏗️ **Layer 1 — Terraform Foundation** -## Modules disponibles +> **Repo: `bootstrap/`** — GitHub Actions CI/CD -| Module | Description | -|--------|-------------| -| `database/` | Aiven PostgreSQL | -| `kafka/` | Aiven Kafka | -| `cache/` | Aiven Valkey | -| `vpc/` | AWS VPC | -| `eks/` | AWS EKS Cluster | -| `eks-namespace/` | Namespace + RBAC + NetworkPolicy | +## What we manage in Terraform ---- +| Component | Module | Description | +|-----------|--------|-------------| +| **SSO** | `sso/` | Groups, Permission Sets | +| **Custom Controls** | `control-tower/` | Additional guardrails via Terraform | +| **Account Factory** | `account-factory/` | AFT module via GitHub Actions | +| **Shared Services** | `core-accounts/` | ECR, Transit Gateway | -# 🖥️ **Workload Provisioning** +## SSO Groups -> Ordre de provisionnement pour un nouvel environnement. +| Group | Permission Set | Access | +|-------|----------------|--------| +| PlatformAdmins | AdministratorAccess | Full access | +| Developers | PowerUserAccess | No IAM changes | +| ReadOnly | ViewOnlyAccess | Read only | +| SecurityAuditors | SecurityAudit | Security review | +| OnCall | IncidentResponder | Break-glass | -| Ordre | Ressource | Dépendances | -|-------|-----------|-------------| -| 1 | VPC + Subnets | Account créé | -| 2 | KMS Keys | Account créé | -| 3 | EKS Cluster | VPC, KMS | -| 4 | IRSA | EKS | -| 5 | VPC Peering (Aiven) | VPC, Aiven projet | -| 6 | Outputs → Platform repos | Tous | +## Custom Controls + +| Control | Purpose | Target OU | +|---------|---------|-----------| +| Require IMDSv2 | EC2 metadata security | Workloads | +| Deny Public S3 | Data protection | All | +| Enforce EU Regions | GDPR compliance | All | +| Require Encryption | Data at rest | Workloads | --- -# 🚀 **Layer 2 — Platform Bootstrap** +# 🏭 **Account Factory** -> Installation des composants platform sur le cluster EKS. +> Self-service account provisioning via PR -| Ordre | Action | Dépendance | -|-------|--------|------------| -| 1 | Install ArgoCD via Helm | EKS ready | -| 2 | Apply App-of-Apps ApplicationSet | ArgoCD running | -| 3 | ArgoCD syncs `platform-*` repos | Reconciliation auto | +## Workflow -**ArgoCD : Instance centralisée unique** gérant tous les environnements. +| Step | Action | Actor | +|------|--------|-------| +| 1 | Run `task account:create` | Developer | +| 2 | Fill account request YAML | Developer | +| 3 | Create PR | Developer | +| 4 | Review request | Platform Team | +| 5 | Merge PR | Platform Team | +| 6 | GitHub Actions applies Terraform | Automated | +| 7 | Account ready with baseline | Automated | + +## What's created per account + +| Resource | Purpose | +|----------|---------| +| AWS Account | In appropriate OU | +| S3 Bucket | Terraform state | +| GitHub OIDC | CI/CD authentication | +| KMS Keys | Encryption (terraform, secrets, eks) | +| Security Baseline | EBS encryption, S3 block public | + +## Request fields + +| Field | Description | Example | +|-------|-------------|---------| +| account_name | Unique identifier | localplus-backend-dev | +| environment | dev / staging / prod | dev | +| owner_email | Team contact | backend@localplus.io | +| team | Owning team | backend | +| purpose | Business justification | Backend services development | --- -# 📱 **Layer 3+ — Application Services** - -> ArgoCD ApplicationSets découvrent automatiquement les services. +# 📦 **Platform Provisioning** -## Fonctionnement +> **Repo: `platform-application-provisioning/`** -1. **Git Generator** scanne les répertoires de services -2. **Matrix Generator** croise avec les clusters (dev/staging/prod) -3. **Applications créées** automatiquement pour chaque combinaison -4. **Sync** selon la politique (auto pour dev, manual pour prod) +## Order of operations -## Flux de déploiement +| Order | Resource | Dependencies | +|-------|----------|--------------| +| 1 | VPC + Subnets | Account created | +| 2 | KMS Keys | Account created | +| 3 | EKS Cluster | VPC, KMS | +| 4 | IRSA | EKS | +| 5 | VPC Peering (Aiven) | VPC, Aiven project | +| 6 | ArgoCD | EKS | -``` -Git push → ArgoCD détecte → Sync (dev: auto, prod: manual) → Deployed -``` +## Providers -→ **CI/CD détaillé** : voir [Platform Engineering](../platform/PLATFORM-ENGINEERING.md) +| Provider | Resources | Frequency | +|----------|-----------|-----------| +| AWS | VPC, EKS, KMS | 1x per environment | +| Aiven | PostgreSQL, Kafka, Valkey | 1x per environment | +| Cloudflare | DNS, WAF, Tunnel | 1x per zone | --- # 📋 **Bootstrap Repository Structure** -``` -bootstrap/ -├── .mise.toml # Tool versions -├── Taskfile.yaml # Task orchestration -│ -├── aws-landing-zone/ -│ ├── organization/ # OUs definition -│ ├── control-tower/ # Control Tower setup -│ ├── sso/ # SSO groups, permission sets -│ ├── scps/ # Service Control Policies -│ └── core-accounts/ # Core accounts config -│ -├── account-factory/ -│ ├── main.tf # Account creation -│ ├── templates/ # Baseline resources -│ └── requests/ # Account requests (PR) -│ -├── tests/ -│ ├── unit/ # terraform test -│ ├── compliance/ # OPA/Conftest -│ └── security/ # Trivy -│ -└── docs/ - ├── RUNBOOK-BOOTSTRAP.md - └── ACCOUNT-FACTORY.md -``` +| Directory | Purpose | +|-----------|---------| +| `.github/workflows/` | CI/CD pipelines (plan, apply, account-request) | +| `control-tower/` | Data sources + custom controls | +| `sso/` | Groups, Permission Sets | +| `account-factory/` | Account creation + baseline | +| `tests/checkov-policies/` | Custom compliance policies | +| `tests/compliance-bdd/` | BDD audit tests | +| `docs/` | Runbook | + +--- + +# 🔒 **Policy as Code** + +| Tool | Purpose | Format | +|------|---------|--------| +| **Trivy** | Security scanning (IaC + secrets) | Built-in | +| **Checkov** | 2000+ compliance policies | YAML | +| **terraform-compliance** | Audit-readable policies | BDD/Gherkin | --- @@ -203,6 +231,7 @@ bootstrap/ | Topic | Link | |-------|------| +| **ADR Landing Zone** | [ADR-001](../adr/ADR-001-LANDING-ZONE-APPROACH.md) | | **CI/CD & Delivery** | [Platform Engineering](../platform/PLATFORM-ENGINEERING.md) | | **Security Setup** | [Security Architecture](../security/SECURITY-ARCHITECTURE.md) | | **Networking** | [Networking Architecture](../networking/NETWORKING-ARCHITECTURE.md) | diff --git a/networking/NETWORKING-ARCHITECTURE.md b/networking/NETWORKING-ARCHITECTURE.md index ab19582..5790d32 100644 --- a/networking/NETWORKING-ARCHITECTURE.md +++ b/networking/NETWORKING-ARCHITECTURE.md @@ -18,7 +18,7 @@ 7. [Route53 — DNS Interne & Backup](#route53--dns-interne--backup) 8. [API Gateway / APIM (Future)](#api-gateway--apim-future) 9. [Multi-Cloud Vision](#multi-cloud-vision) -10. [Topologie Multi-Account (Control Tower)](#topologie-multi-account-control-tower) +10. [Topologie Multi-Account (AWS Organizations)](#topologie-multi-account-aws-organizations) --- @@ -466,18 +466,20 @@ --- -# 🏢 **Topologie Multi-Account (Control Tower)** +# 🏢 **Topologie Multi-Account (AWS Organizations)** ## Architecture Actuelle — Phase 1 +> **Note :** Control Tower n'est pas utilisé. Nous gérons OUs, SCPs et comptes via Terraform. + ``` ┌─────────────────────────────────────────────────────────────────────────────┐ -│ CONTROL TOWER — MULTI-ACCOUNT │ +│ AWS ORGANIZATIONS — MULTI-ACCOUNT │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ MANAGEMENT ACCOUNT │ │ -│ │ • AWS Organizations, Control Tower, SCPs │ │ +│ │ • AWS Organizations, SCPs, IAM Identity Center │ │ │ │ ⚠️ Pas de workloads, pas de VPC │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ From 4dcfd0a1fc267ce68510dbf546a89e36b85373a0 Mon Sep 17 00:00:00 2001 From: NasrLadib Date: Mon, 23 Feb 2026 00:17:28 +0100 Subject: [PATCH 5/6] docs: add core platform documentation and refactor existing architecture docs Introduce new docs for agent architecture, provider interface, customer infra management, customer onboarding, YAML config validator, local dev guide, template usage guide, and product roadmap. Restructure enterprise architecture, glossary, data architecture, bootstrap guide, and ADR-001 for consistency and clarity. --- EntrepriseArchitecture.md | 1017 ++++++++++++++++++-------- GLOSSARY.md | 776 ++++---------------- adr/ADR-001-LANDING-ZONE-APPROACH.md | 23 +- agent/AGENT-ARCHITECTURE.md | 285 ++++++++ bootstrap/BOOTSTRAP-GUIDE.md | 25 +- data/DATA-ARCHITECTURE.md | 633 ++++++---------- development/LOCAL-DEV-GUIDE.md | 576 +++++++++++++++ development/TEMPLATE-USAGE-GUIDE.md | 488 ++++++++++++ infra/CUSTOMER-INFRA-MANAGEMENT.md | 270 +++++++ observability/OBSERVABILITY-GUIDE.md | 104 +-- observability/OTEL-CONVENTIONS.md | 346 +++++++++ onboarding/CUSTOMER-ONBOARDING.md | 233 ++++++ plans/KIVEN_ROADMAP.md | 532 ++++++++++++++ platform/YAML-CONFIG-VALIDATOR.md | 141 ++++ providers/PROVIDER-INTERFACE.md | 265 +++++++ 15 files changed, 4282 insertions(+), 1432 deletions(-) create mode 100644 agent/AGENT-ARCHITECTURE.md create mode 100644 development/LOCAL-DEV-GUIDE.md create mode 100644 development/TEMPLATE-USAGE-GUIDE.md create mode 100644 infra/CUSTOMER-INFRA-MANAGEMENT.md create mode 100644 observability/OTEL-CONVENTIONS.md create mode 100644 onboarding/CUSTOMER-ONBOARDING.md create mode 100644 plans/KIVEN_ROADMAP.md create mode 100644 platform/YAML-CONFIG-VALIDATOR.md create mode 100644 providers/PROVIDER-INTERFACE.md diff --git a/EntrepriseArchitecture.md b/EntrepriseArchitecture.md index 831222c..09ebd0f 100644 --- a/EntrepriseArchitecture.md +++ b/EntrepriseArchitecture.md @@ -1,172 +1,463 @@ -# 🏗️ **LOCAL-PLUS — Architecture Overview** -## *Gift Card & Loyalty Platform* -### *Version 1.0 — Janvier 2026* +# Kiven — Architecture Overview +## *Managed Data Services, On Your Infrastructure* +### *Version 2.0 — February 2026* --- -> **Ce document est la porte d'entrée de l'architecture LOCAL-PLUS.** -> Il fournit une vue d'ensemble et des liens vers la documentation détaillée. +> **This document is the entry point for Kiven's architecture.** +> It provides a high-level overview and links to detailed documentation. --- -# 📋 **PARTIE I — EXECUTIVE SUMMARY** +# PART I — EXECUTIVE SUMMARY -## **1.1 Scope** +## 1.1 What Is Kiven -LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçue pour : -- **Scalabilité** : 500 TPS, 1500 RPS -- **Résilience** : RPO 1h, RTO 15min -- **Compliance** : GDPR, PCI-DSS, SOC2 -- **Durée de vie** : 5+ ans +Kiven is a **fully managed data platform** that runs on the customer's own Kubernetes infrastructure. Starting with PostgreSQL (powered by CloudNativePG), Kiven delivers an Aiven-quality experience — but the data never leaves the customer's cluster. -### **Non-Goals (Phase 1)** -- Multi-région active-active -- API Gateway/APIM dédié (évaluation future) -- Mobile apps natives +**How it works:** +1. Customer signs up → grants Kiven access to their EKS (cross-account IAM Role) +2. Kiven provisions everything: dedicated nodes, storage, S3 backups, CNPG operator, PostgreSQL +3. Customer gets: a connection string + a dashboard +4. Kiven manages everything from that point: scaling, backups, monitoring, security, tuning -## **1.2 Paramètres Clés** +**The customer never touches kubectl, YAML, CNPG, or Kubernetes internals.** -| Paramètre | Valeur | Impact | -|-----------|--------|--------| -| **RPO** | 1 heure | Backups horaires, réplication async | -| **RTO** | 15 minutes | Failover automatisé | -| **TPS** | 500 transactions/sec | Single Postgres suffit | -| **RPS** | 1500 requêtes/sec | Load balancer + HPA standard | -| **Équipe on-call** | 5 personnes | Runbooks exhaustifs | +### Value Proposition -## **1.3 Compliance Summary** +| vs. Aiven | vs. Self-Managed CNPG | vs. Launchly | +|-----------|----------------------|-------------| +| Same UX, but on customer's infra | Same PostgreSQL, but fully managed | Same CNPG, but Aiven-level depth | +| 40-60% cheaper (no Aiven markup) | No need for K8s/CNPG expertise | Full infra management (nodes, storage) | +| Data never leaves customer's VPC | Risk eliminated by best practices | DBA intelligence built-in | -| Standard | Exigences clés | Documentation | -|----------|---------------|---------------| -| **GDPR** | Data residency EU, droit à l'oubli | → [compliance/gdpr/](compliance/gdpr/) | -| **PCI-DSS** | Pas de stockage PAN, encryption, audit | → [compliance/pci-dss/](compliance/pci-dss/) | -| **SOC2** | RBAC, monitoring, incident response | → [compliance/soc2/](compliance/soc2/) | +## 1.2 Scope -## **1.4 Tech Stack Overview** +Kiven is designed for: +- **Scalability**: Support 100+ customer clusters across multiple EKS environments +- **Reliability**: RPO 1h, RTO 15min (Kiven SaaS); RPO 5min, RTO 5min (customer databases via CNPG) +- **Compliance**: GDPR (EU data residency), SOC2 (audit, RBAC, encryption) +- **Extensibility**: Provider/plugin architecture for multi-operator future (Kafka, Redis, Elasticsearch) +- **Lifespan**: 5+ years -| Catégorie | Choix | Rationale | -|-----------|-------|-----------| -| **Cloud** | AWS (eu-west-1) | Décision business, GDPR | +### Non-Goals (Phase 1) +- Multi-cloud support (GKE, AKS) — Phase 3 +- Non-PostgreSQL data services (Kafka, Redis) — Phase 3 +- Self-hosted / air-gapped edition — Phase 3 +- Mobile app + +## 1.3 Key Parameters + +| Parameter | Value | Impact | +|-----------|-------|--------| +| **RPO (Kiven SaaS)** | 1 hour | Hourly backups of product database | +| **RTO (Kiven SaaS)** | 15 minutes | Automated failover | +| **RPO (Customer DBs)** | Configurable (1min–24h) | Continuous WAL archiving via Barman | +| **RTO (Customer DBs)** | < 5 minutes | CNPG automatic failover, multi-AZ | +| **Provisioning time** | < 10 minutes | From "Create Database" to connection string | +| **Agent footprint** | < 50MB RAM, < 0.1 CPU | Minimal impact on customer cluster | +| **On-call team** | 5 people | Runbooks for both SaaS and customer infra | + +## 1.4 Compliance Summary + +| Standard | Key Requirements | Scope | +|----------|-----------------|-------| +| **GDPR** | EU data residency, right to erasure, DPA | Kiven SaaS (eu-west-1) + customer data stays in their infra | +| **SOC2** | RBAC, audit logging, encryption, incident response | Kiven SaaS operations + customer infra access audit trail | + +> Note: PCI-DSS is NOT in scope. Kiven does not process payment card data. Customer compliance (HIPAA, PCI, etc.) is helped by data staying on their own infra. + +## 1.5 Tech Stack Overview + +### Kiven SaaS Platform + +| Category | Choice | Rationale | +|----------|--------|-----------| +| **Cloud** | AWS (eu-west-1) | GDPR, proximity to EU customers | | **Orchestration** | EKS + ArgoCD | GitOps, cloud-native | -| **Database** | Aiven PostgreSQL | Managed, PCI compliant | -| **Messaging** | Aiven Kafka | Event-driven, managed | -| **Cache** | Aiven Valkey | Redis-compatible, managed | -| **Edge/CDN** | Cloudflare | WAF, DDoS, Zero Trust | -| **Observability** | Prometheus/Loki/Tempo | Self-hosted, coût minimal | -| **Secrets** | HashiCorp Vault | Dynamic secrets, rotation | -| **CNI** | Cilium | mTLS, Gateway API | -| **Policies** | Kyverno | Admission control | +| **Backend** | Go (stdlib + chi) | K8s ecosystem is Go, fast, small binaries | +| **Frontend** | Next.js 14+ (App Router) + Tailwind + shadcn/ui | Modern, fast, beautiful | +| **Agent** | Go (client-go + controller-runtime) | Native K8s SDK, single binary | +| **Agent Comms** | gRPC + mTLS | Secure, efficient, bidirectional streaming | +| **Product DB** | PostgreSQL (Aiven) | Dogfooding the ecosystem, managed | +| **Cache** | Valkey | Sessions, rate limiting, real-time state | +| **Messaging** | Kafka (Aiven) | Agent events, audit trail, async operations | +| **Edge/CDN** | Cloudflare | WAF, DDoS, Zero Trust, Tunnel | +| **Observability** | Prometheus / Loki / Tempo | Self-hosted, cost-efficient | +| **Secrets** | HashiCorp Vault | Dynamic secrets, rotation, IRSA | +| **CNI** | Cilium | mTLS, Gateway API, network policies | +| **Policies** | Kyverno | Admission control, pod security | +| **Billing** | Stripe | SaaS billing, per-cluster pricing | +| **CI/CD** | GitHub Actions | Already in place | +| **IaC** | Terraform | Infrastructure as Code | + +### Customer-Side (Provisioned by Kiven) + +| Component | Technology | Managed By | +|-----------|-----------|------------| +| **Kubernetes nodes** | EKS Managed Node Groups | Kiven (via AWS API) | +| **PostgreSQL** | CloudNativePG (CNPG) | Kiven (via agent) | +| **Connection pooling** | PgBouncer (CNPG Pooler CRD) | Kiven (via agent) | +| **Backups** | Barman → S3 | Kiven (via agent + AWS API) | +| **Storage** | EBS gp3 (encrypted, KMS) | Kiven (via AWS API) | +| **Backup storage** | S3 bucket (encrypted, lifecycle) | Kiven (via AWS API) | +| **TLS** | cert-manager + self-signed CA | Kiven (via agent) | +| **Monitoring agent** | Kiven Agent (Go) | Kiven | --- -# 🏛️ **PARTIE II — ARCHITECTURE** +# PART II — ARCHITECTURE -## **2.1 Context Diagram (C4 Level 1)** +## 2.1 System Context (C4 Level 1) ``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ END USERS │ -│ (Merchants, Consumers, Partners) │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ CLOUDFLARE EDGE │ -│ (DNS, WAF, DDoS, CDN, Zero Trust, Tunnel) │ -└─────────────────────────────────────────────────────────────────────────────┘ +┌──────────────────────────────────────────────────────────────────────────┐ +│ USERS │ +│ Developers (Simple Mode) DevOps (Advanced Mode) │ +└──────────────────────────────────────────────────────────────────────────┘ │ ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ LOCAL-PLUS PLATFORM │ -│ (AWS EKS) │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ Domain Services: svc-ledger, svc-wallet, svc-merchant, svc-giftcard │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -└─────────────────────────────────────────────────────────────────────────────┘ +┌──────────────────────────────────────────────────────────────────────────┐ +│ CLOUDFLARE EDGE │ +│ (DNS, WAF, DDoS, CDN, Zero Trust) │ +└──────────────────────────────────────────────────────────────────────────┘ │ ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ AIVEN DATA LAYER │ -│ (PostgreSQL, Kafka, Valkey — VPC Peering) │ -└─────────────────────────────────────────────────────────────────────────────┘ +┌──────────────────────────────────────────────────────────────────────────┐ +│ KIVEN SaaS PLATFORM │ +│ (AWS EKS — eu-west-1) │ +│ │ +│ Dashboard + API + CLI + Terraform Provider │ +│ Core Services: provisioner, infra, clusters, backups, monitoring... │ +│ Provider/Plugin: CNPG Provider (Phase 1), Strimzi (future)... │ +└──────────────────────────────────────────────────────────────────────────┘ + │ │ + │ gRPC/mTLS (Agent) │ Cross-Account + │ │ IAM AssumeRole + ▼ ▼ +┌──────────────────────────────────────────────────────────────────────────┐ +│ CUSTOMER'S AWS ACCOUNT / EKS │ +│ │ +│ ┌──── Managed by Kiven ──────────────────────────────────────────────┐ │ +│ │ Node Group: kiven-db-nodes (dedicated, tainted, multi-AZ) │ │ +│ │ Namespace: kiven-system (agent + CNPG operator) │ │ +│ │ Namespace: kiven-databases (PostgreSQL clusters) │ │ +│ │ S3 Bucket: kiven-backups-{customer-id} │ │ +│ │ IAM: IRSA roles for S3 access │ │ +│ │ CNPG: PostgreSQL Primary + Replicas + PgBouncer │ │ +│ └────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌──── Managed by Customer ───────────────────────────────────────────┐ │ +│ │ Their app nodes, services, workloads │ │ +│ │ Connect to: pg-main.kiven-databases.svc:5432 │ │ +│ └────────────────────────────────────────────────────────────────────┘ │ +└──────────────────────────────────────────────────────────────────────────┘ ``` -## **2.2 Container Diagram (C4 Level 2)** +## 2.2 Container Diagram (C4 Level 2) — Kiven SaaS ``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ AWS WORKLOAD ACCOUNT — eu-west-1 │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌────────────────────────────────────────────────────────────────────────┐ │ -│ │ EKS CLUSTER │ │ -│ │ │ │ -│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ -│ │ │ PLATFORM NODE POOL (taints: platform=true:NoSchedule) │ │ │ -│ │ │ • ArgoCD • Cilium • Vault Agent │ │ │ -│ │ │ • OTel Collector • Prometheus • Grafana │ │ │ -│ │ │ • Loki • Tempo • Kyverno │ │ │ -│ │ └─────────────────────────────────────────────────────────────────┘ │ │ -│ │ │ │ -│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ -│ │ │ APPLICATION NODE POOL (default, auto-scaling) │ │ │ -│ │ │ • svc-ledger • svc-wallet • svc-merchant │ │ │ -│ │ │ • svc-giftcard • svc-notification │ │ │ -│ │ └─────────────────────────────────────────────────────────────────┘ │ │ -│ │ │ │ -│ └────────────────────────────────────────────────────────────────────────┘ │ -│ │ │ -│ │ VPC Peering │ -│ ▼ │ -│ ┌────────────────────────────────────────────────────────────────────────┐ │ -│ │ AIVEN VPC │ │ -│ │ • PostgreSQL (Primary + Read Replica) │ │ -│ │ • Kafka Cluster (3 brokers) │ │ -│ │ • Valkey Cluster (HA) │ │ -│ └────────────────────────────────────────────────────────────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ +┌──────────────────────────────────────────────────────────────────────────┐ +│ KIVEN SaaS — AWS WORKLOAD ACCOUNT — eu-west-1 │ +├──────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ EKS CLUSTER │ │ +│ │ │ │ +│ │ ┌──────────────────────────────────────────────────────────────┐ │ │ +│ │ │ PLATFORM NODE POOL (taints: platform=true:NoSchedule) │ │ │ +│ │ │ • ArgoCD • Cilium • Vault Agent │ │ │ +│ │ │ • OTel Collector • Prometheus • Grafana │ │ │ +│ │ │ • Loki • Tempo • Kyverno │ │ │ +│ │ └──────────────────────────────────────────────────────────────┘ │ │ +│ │ │ │ +│ │ ┌──────────────────────────────────────────────────────────────┐ │ │ +│ │ │ APPLICATION NODE POOL (auto-scaling) │ │ │ +│ │ │ │ │ │ +│ │ │ ┌── Core ─────────────────────────────────────────────┐ │ │ │ +│ │ │ │ svc-api svc-auth svc-provisioner │ │ │ │ +│ │ │ │ svc-infra svc-clusters svc-agent-relay │ │ │ │ +│ │ │ └─────────────────────────────────────────────────────┘ │ │ │ +│ │ │ │ │ │ +│ │ │ ┌── Data Services ────────────────────────────────────┐ │ │ │ +│ │ │ │ svc-backups svc-monitoring svc-users │ │ │ │ +│ │ │ │ svc-yamleditor svc-migrations │ │ │ │ +│ │ │ └─────────────────────────────────────────────────────┘ │ │ │ +│ │ │ │ │ │ +│ │ │ ┌── Business ────────────────────────────────────────┐ │ │ │ +│ │ │ │ svc-billing svc-audit svc-notification│ │ │ │ +│ │ │ └─────────────────────────────────────────────────────┘ │ │ │ +│ │ └──────────────────────────────────────────────────────────────┘ │ │ +│ │ │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ │ +│ │ VPC Peering │ +│ ▼ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ AIVEN VPC │ │ +│ │ • PostgreSQL (Kiven product database) │ │ +│ │ • Kafka (agent events, audit trail, async ops) │ │ +│ │ • Valkey (sessions, rate limiting, cache) │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +└──────────────────────────────────────────────────────────────────────────┘ ``` -## **2.3 Domain Services** +## 2.3 Container Diagram (C4 Level 2) — Customer Side -| Service | Responsabilité | Pattern | Criticité | -|---------|---------------|---------|-----------| -| **svc-ledger** | Earn/Burn transactions, ACID ledger | Sync REST + gRPC | P0 — Core | -| **svc-wallet** | Balance queries, snapshots | Sync REST + gRPC | P0 — Core | -| **svc-merchant** | Onboarding, configuration | Sync REST | P1 | -| **svc-giftcard** | Catalog, rewards | Sync REST | P1 | -| **svc-notification** | SMS/Email dispatch | Async (Kafka consumer) | P2 | +``` +┌──────────────────────────────────────────────────────────────────────────┐ +│ CUSTOMER'S EKS CLUSTER │ +├──────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ NODE GROUP: kiven-db-nodes (Managed by Kiven) │ │ +│ │ Instance: r6g.medium–r6g.2xlarge (memory-optimized) │ │ +│ │ Taint: kiven.io/role=database:NoSchedule │ │ +│ │ Multi-AZ: primary in AZ-a, replica in AZ-b │ │ +│ │ │ │ +│ │ ┌── Namespace: kiven-system ──────────────────────────────────┐ │ │ +│ │ │ Kiven Agent (Go) — gRPC → Kiven SaaS │ │ │ +│ │ │ CNPG Operator — manages PG clusters │ │ │ +│ │ │ cert-manager (optional) — TLS certificates │ │ │ +│ │ └─────────────────────────────────────────────────────────────┘ │ │ +│ │ │ │ +│ │ ┌── Namespace: kiven-databases ───────────────────────────────┐ │ │ +│ │ │ │ │ │ +│ │ │ CNPG Cluster: pg-production-main │ │ │ +│ │ │ ├─ Pod: pg-production-main-1 (Primary, AZ-a) │ │ │ +│ │ │ ├─ Pod: pg-production-main-2 (Replica, AZ-b) │ │ │ +│ │ │ ├─ Pod: pg-production-main-3 (Replica, AZ-c) │ │ │ +│ │ │ ├─ Service: pg-production-main-rw (read-write) │ │ │ +│ │ │ ├─ Service: pg-production-main-ro (read-only) │ │ │ +│ │ │ └─ Pooler: pg-production-main-pooler (PgBouncer) │ │ │ +│ │ │ │ │ │ +│ │ │ ScheduledBackup → S3: kiven-backups-{customer-id} │ │ │ +│ │ │ NetworkPolicy: only kiven-databases + customer-app-ns │ │ │ +│ │ └─────────────────────────────────────────────────────────────┘ │ │ +│ │ │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ NODE GROUP: customer-app-nodes (Managed by Customer) │ │ +│ │ • Customer's application pods │ │ +│ │ • Connect to: pg-production-main-pooler.kiven-databases.svc:5432 │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ AWS Resources (Managed by Kiven via cross-account IAM) │ │ +│ │ • EBS gp3 volumes (encrypted, KMS) │ │ +│ │ • S3 bucket: kiven-backups-{customer-id} │ │ +│ │ • IAM IRSA role: kiven-cnpg-backup-role │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +└──────────────────────────────────────────────────────────────────────────┘ +``` + +## 2.4 Core Services + +### Service Catalog + +| Service | Responsibility | Language | Priority | +|---------|---------------|----------|----------| +| **svc-api** | REST + GraphQL gateway, request routing | Go | P0 | +| **svc-auth** | OIDC (Google/GitHub/SAML), RBAC, API keys, org/team model | Go | P0 | +| **svc-provisioner** | **THE BRAIN** — Orchestrates full provisioning pipeline (nodes → storage → S3 → CNPG → PG) | Go | P0 | +| **svc-infra** | AWS resource management in customer accounts (EC2, EBS, S3, IAM, KMS) | Go | P0 | +| **svc-clusters** | Cluster lifecycle via provider interface (status, scale, upgrade, delete) | Go | P0 | +| **svc-backups** | Backup/restore management, PITR, fork/clone, backup verification | Go | P0 | +| **svc-monitoring** | Metrics ingestion from agents, DBA intelligence, alerts engine | Go | P0 | +| **svc-users** | Database user/role management, permissions, pg_hba rules | Go | P0 | +| **svc-agent-relay** | gRPC server, multiplexes all customer agent connections | Go | P0 | +| **svc-yamleditor** | YAML generation, schema validation, diff engine, change history | Go | P0 | +| **svc-migrations** | Import from Aiven/RDS/bare PG into Kiven-managed clusters | Go | P1 | +| **svc-billing** | Stripe integration, usage tracking, per-cluster pricing | Go | P1 | +| **svc-audit** | Immutable audit log of all operations on customer infra | Go | P1 | +| **svc-notification** | Alerts via Slack, email, webhook, PagerDuty | Go | P1 | +| **agent** | In-cluster binary — CNPG controller, PG stats, command executor, log aggregator | Go | P0 | -## **2.4 Data Flow** +### Provider/Plugin Architecture + +The core engine is **operator-agnostic**. Each data service is a **provider** implementing a standard Go interface. Phase 1 ships the CNPG provider only. Future providers (Strimzi, Redis, ECK) plug in without rewriting core services. ``` -┌─────────────┐ gRPC ┌─────────────┐ -│ svc-ledger │◄─────────────►│ svc-wallet │ -└──────┬──────┘ └──────┬──────┘ - │ │ - │ Outbox │ Read - ▼ ▼ -┌─────────────┐ ┌─────────────┐ -│ Kafka │ │ PostgreSQL │ -│ (Aiven) │ │ (Aiven) │ -└──────┬──────┘ └─────────────┘ - │ - │ Consume - ▼ -┌─────────────────────┐ -│ svc-notification │ -│ svc-analytics │ -└─────────────────────┘ +Core Engine (operator-agnostic) + ├── svc-provisioner → calls provider.Provision() + ├── svc-clusters → calls provider.Scale(), provider.Status() + ├── svc-backups → calls provider.Backup(), provider.Restore() + ├── svc-monitoring → calls provider.CollectMetrics() + └── svc-users → calls provider.CreateUser() + │ + ▼ + Provider Interface (Go interface) + │ + ┌─────┴───────────────────────────────┐ + │ CNPG Provider (Phase 1 — PG) │ + │ Strimzi Provider (Phase 3 — Kafka) │ + │ Redis Provider (Phase 3 — Redis) │ + │ ECK Provider (Phase 3 — ES) │ + └─────────────────────────────────────┘ ``` +## 2.5 Data Flow — Provisioning + +``` +Customer clicks "Create Database" + │ + ▼ +┌─── svc-api ───┐ ┌─── svc-auth ──┐ +│ Validate req │────▶│ Check RBAC │ +└───────┬───────┘ └───────────────┘ + │ + ▼ +┌─── svc-provisioner (THE BRAIN) ──────────────────────────────────────┐ +│ │ +│ 1. svc-infra → AssumeRole → Create node group (kiven-db-nodes) │ +│ 2. svc-infra → AssumeRole → Create StorageClass (gp3, encrypted) │ +│ 3. svc-infra → AssumeRole → Create S3 bucket (backups) │ +│ 4. svc-infra → AssumeRole → Create IRSA role (CNPG → S3) │ +│ 5. agent → Install CNPG operator (Helm) │ +│ 6. agent → Apply CNPG Cluster YAML (generated by svc-clusters) │ +│ 7. agent → Apply PgBouncer Pooler YAML │ +│ 8. agent → Apply ScheduledBackup YAML │ +│ 9. agent → Apply NetworkPolicy YAML │ +│ 10. agent → Wait for cluster healthy │ +│ 11. svc-users → Create initial database + user │ +│ 12. Return connection string to customer │ +│ │ +│ Status updates streamed via agent gRPC → svc-agent-relay │ +│ Dashboard shows real-time provisioning progress │ +└───────────────────────────────────────────────────────────────────────┘ +``` + +## 2.6 Data Flow — Steady State + +``` +┌─── Kiven Agent (in customer K8s) ─────────────────────────────┐ +│ │ +│ CNPG Controller ──── watches Cluster/Backup/Pooler CRDs │ +│ PG Stats Collector ─ pg_stat_statements, pg_stat_activity │ +│ Log Aggregator ───── PG logs from all pods │ +│ Infra Reporter ───── node status, EBS usage, pod health │ +│ │ +│ Every 30s: streams metrics + status to svc-agent-relay │ +│ On event: immediately reports (failover, backup done, error) │ +└────────────────────────┬───────────────────────────────────────┘ + │ gRPC/mTLS (outbound only) + ▼ +┌─── svc-agent-relay ───────────────────────────────────────────┐ +│ Multiplexes connections from all customer agents │ +│ Routes events to: svc-monitoring, svc-clusters, svc-audit │ +└───────────────────────────────────────────────────────────────┘ + │ + ┌──────────────┼──────────────┐ + ▼ ▼ ▼ + svc-monitoring svc-clusters svc-audit + (DBA intelligence, (status update) (immutable log) + alert engine) +``` + +## 2.7 Service Plans + +Each database is provisioned with a **plan** that determines compute, memory, storage, and HA configuration: + +| Plan | CPU | RAM | Storage | Instances | HA | Node Type | Use Case | +|------|-----|-----|---------|-----------|-----|-----------|----------| +| **Hobbyist** | 1 vCPU | 1 GB | 10 GB | 1 | No | t3.small | Testing, personal projects | +| **Startup** | 2 vCPU | 4 GB | 50 GB | 2 | Yes | r6g.medium | Small apps, dev/staging | +| **Business** | 4 vCPU | 16 GB | 100 GB | 3 | Yes | r6g.large | Production, medium traffic | +| **Premium** | 8 vCPU | 32 GB | 500 GB | 3 | Yes | r6g.xlarge | High-performance, analytics | +| **Custom** | User-defined | User-defined | User-defined | 1-5 | Configurable | Any | Specific requirements | + +Each plan includes: +- Pre-tuned `postgresql.conf` (shared_buffers, work_mem, etc. sized for the plan) +- Appropriate PgBouncer pool size and mode +- Right backup frequency and retention +- Resource limits and requests matching the node type + +Plans can be **upgraded or downgraded** at any time from the dashboard (triggers a rolling update via CNPG). + +## 2.8 Power Off / Power On + +Databases can be **paused** to eliminate compute costs while preserving data. This is a fundamental advantage of the "managed on your infra" model — something Aiven cannot offer because they own the infrastructure. + +### Power Off (Pause) + +``` +Customer clicks "Power Off" + │ + ├─ 1. svc-clusters → agent: Delete CNPG Cluster CR + │ PVC reclaim policy = RETAIN → EBS volumes preserved + │ + ├─ 2. CNPG pods terminated, K8s services removed + │ EBS volumes detached but retained in AWS + │ + ├─ 3. svc-infra → AWS API: Scale node group to 0 + │ No more EC2 cost + │ + └─ 4. Dashboard: "Paused — Data safe, no compute cost" + S3 backups and EBS volumes remain +``` + +### Power On (Resume) + +``` +Customer clicks "Resume" + │ + ├─ 1. svc-infra → AWS API: Scale node group back up + │ Wait for nodes ready (~2-3 min) + │ + ├─ 2. svc-clusters → agent: Apply CNPG Cluster CR + │ References existing PVCs (same EBS volume IDs) + │ + ├─ 3. CNPG starts PostgreSQL with existing data + │ Primary elected, replicas sync (~1-2 min) + │ + └─ 4. Dashboard: "Running — Resumed" + Connection strings unchanged, total resume time ~3-5 min +``` + +### Scheduled Power Off/On + +Automate power schedules for non-production environments: +- Example: Mon-Fri 8am-6pm ON, nights and weekends OFF +- Savings: 60-70% on dev/staging compute costs +- Configured via dashboard, API, CLI, or Terraform + +### Cost Impact + +| Scenario | Always On | Scheduled (10h/day, weekdays) | Savings | +|----------|-----------|-------------------------------|---------| +| Startup plan (2×r6g.medium) | ~$180/mo | ~$55/mo | 70% | +| Business plan (3×r6g.large) | ~$450/mo | ~$140/mo | 69% | +| Paused (storage only) | — | ~$10/mo | 94% | + +## 2.9 Two UX Modes + +### Simple Mode (Default) — "Aiven Experience" + +For developers who just need a database. Forms, sliders, buttons. No YAML visible. +- Create database → pick plan → get connection string +- Manage users, backups, config via UI forms +- See metrics, alerts, logs in clean dashboards + +### Advanced Mode — "Lens Experience" + +For DevOps/Platform engineers who want full control. Like Lens for Kubernetes. +- View the generated YAML for every resource (CNPG Cluster, Pooler, Backup, etc.) +- Edit YAML directly in Monaco editor (VS Code-like) with CNPG schema validation +- Diff view before applying changes +- Change history (git-like timeline of all YAML changes) +- Rollback to any previous YAML version +- Toggle between modes at any time + --- -# 🌿 **PARTIE III — DELIVERY MODEL** +# PART III — DELIVERY MODEL -## **3.1 Git Strategy** +## 3.1 Git Strategy -**Trunk-Based Development avec Cherry-Pick** +**Trunk-Based Development with Cherry-Pick** ``` main (trunk) @@ -183,63 +474,67 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu │ │ ▼ ▼ maintenance/v1.x.x maintenance/v2.x.x - (cherry-pick avec (cherry-pick avec + (cherry-pick with (cherry-pick with label: backport-v1) label: backport-v2) ``` -| Branche | Usage | Politique | -|---------|-------|-----------| -| `main` | Trunk principal | Tous les PRs mergent ici | -| `maintenance/v*.x.x` | Maintenance versions | Cherry-pick depuis main uniquement | -| `feature/*` | Développement | Short-lived, merge to main | +| Branch | Usage | Policy | +|--------|-------|--------| +| `main` | Main trunk | All PRs merge here | +| `maintenance/v*.x.x` | Version maintenance | Cherry-pick from main only | +| `feature/*` | Development | Short-lived, merge to main | -## **3.2 GitOps Flow (ArgoCD)** +## 3.2 GitOps Flow (ArgoCD) -- **ArgoCD centralisé** : Instance unique gérant tous les environnements -- **App-of-Apps pattern** : ApplicationSets avec Git + Matrix generators -- **Sync automatique** : Dev auto-sync, Staging/Prod manual approval +- **Centralized ArgoCD**: Single instance managing all environments +- **App-of-Apps pattern**: ApplicationSets with Git + Matrix generators +- **Auto-sync**: Dev auto-sync, Staging/Prod manual approval -## **3.3 Environments** +## 3.3 Environments | Environment | Account | Cluster | Sync Policy | |-------------|---------|---------|-------------| -| **dev** | localplus-dev | eks-dev | Auto-sync | -| **staging** | localplus-staging | eks-staging | Manual | -| **prod** | localplus-prod | eks-prod | Manual + Approval | +| **dev** | kiven-dev | eks-dev | Auto-sync | +| **staging** | kiven-staging | eks-staging | Manual | +| **prod** | kiven-prod | eks-prod | Manual + Approval | -## **3.4 CI/CD & Bootstrap** +## 3.4 CI/CD & Bootstrap -→ **Documentation détaillée** : [bootstrap/BOOTSTRAP-GUIDE.md](bootstrap/BOOTSTRAP-GUIDE.md) +> Detailed documentation: [bootstrap/BOOTSTRAP-GUIDE.md](bootstrap/BOOTSTRAP-GUIDE.md) --- -# 🗂️ **PARTIE IV — REPOSITORY & OWNERSHIP MODEL** +# PART IV — REPOSITORY & OWNERSHIP MODEL -## **4.1 Repository Tiers** +## 4.1 Repository Tiers | Tier | Repos | Description | Owner | |------|-------|-------------|-------| | **T0 — Foundation** | `bootstrap/` | AWS Landing Zone, Account Factory | Platform Team | | **T1 — Platform** | `platform-*` | GitOps, Networking, Security, Observability | Platform Team | -| **T2 — Contracts** | `contracts-proto`, `sdk-*` | APIs, SDKs partagés | Platform + Backend | -| **T3 — Domain** | `svc-*` | Services métier | Product Teams | -| **T4 — Quality** | `e2e-scenarios`, `chaos-*` | Tests E2E, Chaos engineering | QA + Platform | -| **T5 — Documentation** | `docs/` | Documentation centralisée | All Teams | +| **T2 — Contracts** | `contracts-proto`, `sdk-*` | gRPC APIs, Go SDK, CLI | Platform + Backend | +| **T3 — Core Services** | `svc-*` | Kiven backend services | Backend Team | +| **T4 — Agent** | `agent/` | Customer-deployed agent | Agent Team | +| **T5 — Frontend** | `dashboard/` | Next.js dashboard (Simple + Advanced modes) | Frontend Team | +| **T6 — Providers** | `provider-*` | CNPG provider, Strimzi provider (future) | Backend Team | +| **T7 — Quality** | `e2e-scenarios`, `chaos-*` | Tests, chaos engineering | QA + Platform | +| **T8 — Documentation** | `docs/` | Centralized documentation | All Teams | -## **4.2 Ownership Matrix** +## 4.2 Ownership Matrix | Tier | Owner Team | Approvers | Change Process | |------|------------|-----------|----------------| -| **T0 — Foundation** | Platform | Platform Lead + Security | ADR + RFC obligatoire | -| **T1 — Platform** | Platform | Platform Team (2 reviewers) | ADR si breaking change | +| **T0 — Foundation** | Platform | Platform Lead + Security | ADR + RFC required | +| **T1 — Platform** | Platform | Platform Team (2 reviewers) | ADR if breaking change | | **T2 — Contracts** | Platform + Backend | Tech Lead | Buf breaking detection | -| **T3 — Domain** | Product Teams | Team Lead | Standard PR review | -| **T4 — Quality** | QA + Platform | QA Lead | Standard PR review | -| **T5 — Documentation** | All | Tech Lead | Standard PR review | - -## **4.3 Repository Index** +| **T3 — Core Services** | Backend | Team Lead | Standard PR review | +| **T4 — Agent** | Agent / Backend | Agent Lead + Security | Security review required | +| **T5 — Frontend** | Frontend | Frontend Lead | Standard PR review | +| **T6 — Providers** | Backend | Tech Lead | Provider interface compliance | +| **T7 — Quality** | QA + Platform | QA Lead | Standard PR review | +| **T8 — Documentation** | All | Tech Lead | Standard PR review | -> **Note** : Les repos ci-dessous sont la structure cible. Chaque repo aura son propre README. +## 4.3 Repository Index ### Tier 0 — Foundation @@ -255,279 +550,359 @@ LOCAL-PLUS est une plateforme de gestion de cartes cadeaux et fidélité, conçu | `platform-networking/` | Cilium, Gateway API | | `platform-observability/` | OTel, Prometheus, Loki, Tempo, Grafana | | `platform-security/` | Vault, External-Secrets, Kyverno | -| `platform-cache/` | Valkey configuration, SDK | -| `platform-gateway/` | APISIX (future), Cloudflare config | -| `platform-application-provis/` | Terraform modules (DB, Kafka, Cache, EKS) | ### Tier 2 — Contracts | Repo | Description | |------|-------------| -| `contracts-proto/` | Protobuf definitions | -| `sdk-python/` | Python SDK (clients, telemetry) | -| `sdk-go/` | Go SDK | +| `contracts-proto/` | Protobuf definitions (agent ↔ SaaS, inter-service) | +| `sdk-go/` | Go SDK for Kiven API | +| `kiven-cli/` | CLI tool (`kiven clusters list`, `kiven backup trigger`) | +| `terraform-provider-kiven/` | Terraform provider for Kiven | + +### Tier 3 — Core Services + +| Repo | Description | +|------|-------------| +| `svc-api/` | REST + GraphQL gateway | +| `svc-auth/` | Authentication, RBAC, API keys | +| `svc-provisioner/` | Provisioning orchestrator (THE BRAIN) | +| `svc-infra/` | AWS resource management in customer accounts | +| `svc-clusters/` | Cluster lifecycle (CNPG management) | +| `svc-backups/` | Backup/restore, PITR, fork/clone | +| `svc-monitoring/` | Metrics, DBA intelligence, alerts | +| `svc-users/` | Database user/role management | +| `svc-agent-relay/` | gRPC server for agent connections | +| `svc-yamleditor/` | YAML generation, validation, diff, history | +| `svc-migrations/` | Import from Aiven/RDS/bare PG | +| `svc-billing/` | Stripe billing | +| `svc-audit/` | Immutable audit log | +| `svc-notification/` | Alerts (Slack, email, webhook, PagerDuty) | + +### Tier 4 — Agent + +| Repo | Description | +|------|-------------| +| `kiven-agent/` | In-cluster agent (CNPG controller, PG stats, command executor) | +| `kiven-agent-helm/` | Helm chart for agent deployment | + +### Tier 5 — Frontend + +| Repo | Description | +|------|-------------| +| `dashboard/` | Next.js dashboard (Simple + Advanced mode) | -### Tier 3 — Domain Services +### Tier 6 — Providers | Repo | Description | |------|-------------| -| `svc-ledger/` | Earn/Burn transactions | -| `svc-wallet/` | Balance queries | -| `svc-merchant/` | Merchant onboarding | -| `svc-giftcard/` | Gift card catalog | -| `svc-notification/` | Notifications (Kafka consumer) | +| `provider-cnpg/` | CloudNativePG provider (Phase 1) | +| `provider-strimzi/` | Strimzi/Kafka provider (Phase 3 — future) | +| `provider-redis/` | Redis Operator provider (Phase 3 — future) | -### Tier 4 — Quality +### Tier 7 — Quality | Repo | Description | |------|-------------| -| `e2e-scenarios/` | Playwright E2E tests | -| `chaos-experiments/` | Chaos Mesh experiments | +| `e2e-scenarios/` | End-to-end tests (provisioning, backup, failover) | +| `chaos-experiments/` | Chaos Mesh experiments (node failure, network partition) | --- -# 🔐 **PARTIE V — PLATFORM BASELINES** +# PART V — PLATFORM BASELINES -## **5.1 Security Baseline** +## 5.1 Security Baseline -**Defense in Depth** : 6 couches de sécurité +**Defense in Depth**: 7 layers of security -| Layer | Composant | Protection | +| Layer | Component | Protection | |-------|-----------|------------| | **Edge** | Cloudflare | WAF, DDoS, Bot protection | -| **Gateway** | Cilium Gateway API | TLS, routing | +| **Gateway** | Cilium Gateway API | TLS termination, routing | | **Network** | Cilium | NetworkPolicies, default deny | -| **Identity** | IRSA + Vault | Dynamic secrets, mTLS | +| **Identity** | IRSA + Vault | Dynamic secrets, mTLS, OIDC | | **Workload** | Kyverno | Pod security, image signing | -| **Data** | KMS + Aiven | Encryption at rest/transit | +| **Data** | KMS + EBS encryption | Encryption at rest/transit | +| **Customer Access** | Cross-account IAM + Audit | Least privilege, CloudTrail, revocable | -→ **Documentation détaillée** : [security/SECURITY-ARCHITECTURE.md](security/SECURITY-ARCHITECTURE.md) +> Detailed documentation: [security/SECURITY-ARCHITECTURE.md](security/SECURITY-ARCHITECTURE.md) -## **5.2 Observability Baseline** +## 5.2 Observability Baseline -| Signal | Outil | Retention | Coût | -|--------|-------|-----------|------| -| **Metrics** | Prometheus + Remote Write S3 | 15j local, 1an S3 | ~5€/mois | -| **Logs** | Loki | 30 jours (GDPR) | Self-hosted | -| **Traces** | Tempo | 7 jours | Self-hosted | -| **Profiling** | Pyroscope | 7 jours | Self-hosted | -| **Errors** | Sentry (self-hosted) | 30 jours | Self-hosted | +| Signal | Tool | Retention | Cost | +|--------|------|-----------|------| +| **Metrics** | Prometheus + Remote Write S3 | 15d local, 1y S3 | ~5 EUR/mo | +| **Logs** | Loki | 30 days (GDPR) | Self-hosted | +| **Traces** | Tempo | 7 days | Self-hosted | +| **Profiling** | Pyroscope | 7 days | Self-hosted | +| **Errors** | Sentry (self-hosted) | 30 days | Self-hosted | -→ **Documentation détaillée** : [observability/OBSERVABILITY-GUIDE.md](observability/OBSERVABILITY-GUIDE.md) +> Detailed documentation: [observability/OBSERVABILITY-GUIDE.md](observability/OBSERVABILITY-GUIDE.md) -## **5.3 Networking Baseline** +## 5.3 Networking Baseline -| Composant | Rôle | Configuration | +| Component | Role | Configuration | |-----------|------|---------------| -| **Cloudflare** | Edge, WAF, Tunnel | Free tier | +| **Cloudflare** | Edge, WAF, Tunnel | Pro tier | | **Cilium** | CNI, mTLS, Gateway API | WireGuard encryption | -| **VPC Peering** | Aiven connectivity | Private, no internet | +| **VPC Peering** | Aiven connectivity (Kiven product DB) | Private, no internet | | **Route53** | Private DNS, backup | Internal zones | +| **Cross-Account** | Customer EKS access | IAM AssumeRole, kubeconfig | -→ **Documentation détaillée** : [networking/NETWORKING-ARCHITECTURE.md](networking/NETWORKING-ARCHITECTURE.md) +> Detailed documentation: [networking/NETWORKING-ARCHITECTURE.md](networking/NETWORKING-ARCHITECTURE.md) -## **5.4 Data Baseline** +## 5.4 Data Baseline -| Service | Provider | Plan | Coût estimé | -|---------|----------|------|-------------| -| **PostgreSQL** | Aiven | Business-4 | ~300€/mois | -| **Kafka** | Aiven | Business-4 | ~400€/mois | -| **Valkey** | Aiven | Business-4 | ~150€/mois | +### Kiven Product Database (SaaS side) -**Règle d'or** : 1 table = 1 owner. Cross-service = gRPC ou Events, jamais JOIN. +| Service | Provider | Purpose | Cost Estimate | +|---------|----------|---------|---------------| +| **PostgreSQL** | Aiven | Product DB (orgs, clusters, audit) | ~300 EUR/mo | +| **Kafka** | Aiven | Agent events, async operations | ~400 EUR/mo | +| **Valkey** | Aiven | Sessions, rate limiting, cache | ~150 EUR/mo | -→ **Documentation détaillée** : [data/DATA-ARCHITECTURE.md](data/DATA-ARCHITECTURE.md) +### Customer Databases (managed by Kiven) + +| Service | Technology | Where | Cost | +|---------|-----------|-------|------| +| **PostgreSQL** | CloudNativePG on EKS | Customer's AWS | Customer's AWS bill | +| **Backups** | Barman → S3 | Customer's AWS | Customer's S3 costs | + +**Golden rule**: Kiven product DB and customer databases are **completely separate**. Customer data never touches Kiven's infrastructure. + +> Detailed documentation: [data/DATA-ARCHITECTURE.md](data/DATA-ARCHITECTURE.md) --- -# 🧪 **PARTIE VI — TESTING & QUALITY** +# PART VI — TESTING & QUALITY -## **6.1 Test Pyramid** +## 6.1 Test Pyramid -| Layer | Types de tests | Fréquence | -|-------|----------------|-----------| -| **Base** | Static analysis, Linting | Pre-commit | -| **Unit** | Domain logic, Use cases | PR | -| **Integration** | DB, Kafka, Cache (Testcontainers) | PR | -| **Contract** | API contracts (Pact, gRPC) | PR | -| **E2E** | Critical paths (Playwright) | Nightly | -| **Performance** | Load, Stress, Soak (k6) | Nightly/Weekly | -| **Chaos** | Failure injection (Chaos Mesh) | Weekly | +| Layer | Test Types | Frequency | +|-------|-----------|-----------| +| **Base** | Static analysis, linting (golangci-lint) | Pre-commit | +| **Unit** | Service logic, provider interface | PR | +| **Integration** | Agent ↔ CNPG, svc-infra ↔ AWS (LocalStack), DB (Testcontainers) | PR | +| **Contract** | gRPC contracts (Buf), agent protocol | PR | +| **E2E** | Full provisioning pipeline (kind + CNPG) | Nightly | +| **Performance** | Load testing, provisioning time (k6) | Weekly | +| **Chaos** | Node failure, agent disconnection, CNPG failover (Chaos Mesh) | Weekly | -## **6.2 Performance Targets** +## 6.2 Performance Targets -| Métrique | Target | Alerte | -|----------|--------|--------| -| **Latency P50** | < 50ms | > 100ms | -| **Latency P95** | < 100ms | > 200ms | -| **Latency P99** | < 200ms | > 500ms | +| Metric | Target | Alert | +|--------|--------|-------| +| **API Latency P50** | < 50ms | > 100ms | +| **API Latency P95** | < 100ms | > 200ms | +| **API Latency P99** | < 200ms | > 500ms | | **Error Rate** | < 0.1% | > 1% | -| **Throughput** | > 500 TPS | < 400 TPS | +| **Provisioning Time** | < 10min | > 15min | +| **Agent Reconnection** | < 30s | > 60s | +| **Backup Success Rate** | > 99.9% | < 99% | -→ **Documentation détaillée** : [testing/TESTING-STRATEGY.md](testing/TESTING-STRATEGY.md) +> Detailed documentation: [testing/TESTING-STRATEGY.md](testing/TESTING-STRATEGY.md) --- -# ⚡ **PARTIE VII — RESILIENCE & DR** +# PART VII — RESILIENCE & DR + +## 7.1 Failure Modes -## **7.1 Failure Modes** +### Kiven SaaS Failures | Failure | Detection | Recovery | RTO | |---------|-----------|----------|-----| | Pod crash | Liveness probe | K8s restart | < 30s | | Node failure | Node NotReady | Pod reschedule | < 2min | | AZ failure | Multi-AZ detect | Traffic shift | < 5min | -| DB primary failure | Aiven health | Automatic failover | < 5min | +| Product DB failure | Aiven health | Automatic failover | < 5min | | Kafka broker failure | Aiven health | Automatic rebalance | < 2min | | Full region failure | Manual | DR procedure | 4h (target) | -## **7.2 Backup Strategy** +### Customer Database Failures (Handled by Kiven) + +| Failure | Detection | Recovery | RTO | +|---------|-----------|----------|-----| +| PG pod crash | CNPG + Agent | CNPG automatic restart | < 30s | +| Primary failure | CNPG failover | Automatic promotion of replica | < 30s | +| DB node failure | Agent + AWS | Pod reschedule to healthy node | < 2min | +| EBS volume issue | Agent monitoring | Alert + manual intervention | < 15min | +| Agent disconnection | SaaS heartbeat | Agent auto-reconnects; DB keeps running | Immediate (DB unaffected) | +| Backup failure | Agent monitoring | Retry + alert to customer + Kiven ops | < 1h | +| Data corruption | Backup verification | PITR restore to last good point | < 30min | + +## 7.2 Backup Strategy + +### Kiven SaaS + +| Data | Method | Frequency | Retention | +|------|--------|-----------|-----------| +| Product DB | Aiven automated | Hourly | 7 days | +| Product DB PITR | Aiven WAL | Continuous | 24h | +| Kafka | Topic retention | N/A | 7 days | +| Terraform state | S3 versioning | Every apply | 90 days | + +### Customer Databases (Managed by Kiven) | Data | Method | Frequency | Retention | |------|--------|-----------|-----------| -| PostgreSQL | Aiven automated | Hourly | 7 jours | -| PostgreSQL PITR | Aiven WAL | Continuous | 24h | -| Kafka | Topic retention | N/A | 7 jours | -| Terraform state | S3 versioning | Every apply | 90 jours | +| PostgreSQL | Barman (CNPG) → S3 | Configurable (default: 6h) | Configurable (default: 30 days) | +| PostgreSQL PITR | WAL archiving → S3 | Continuous | Configurable (default: 7 days) | +| Backup verification | Automated restore test | Weekly | Report stored 90 days | -→ **Documentation détaillée** : [resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) +> Detailed documentation: [resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) --- -# 🛠️ **PARTIE VIII — PLATFORM CONTRACTS** +# PART VIII — PLATFORM CONTRACTS -## **8.1 Golden Path (New Service Checklist)** +## 8.1 Golden Path (New Kiven Service Checklist) -| Étape | Action | Validation | -|-------|--------|------------| -| 1 | Créer repo depuis template | Structure conforme | -| 2 | Définir protos dans contracts-proto | buf lint pass | -| 3 | Implémenter service | Unit tests > 80% | -| 4 | Configurer K8s manifests | Kyverno policies pass | -| 5 | Configurer External-Secret | Secrets résolus | -| 6 | Ajouter ServiceMonitor | Metrics visibles Grafana | -| 7 | Créer HTTPRoute | Trafic routable | +| Step | Action | Validation | +|------|--------|------------| +| 1 | Create repo from Go service template | Structure compliant | +| 2 | Define protos in contracts-proto | `buf lint` pass | +| 3 | Implement service (Go) | Unit tests > 80% | +| 4 | Configure K8s manifests | Kyverno policies pass | +| 5 | Configure External-Secret | Secrets resolved from Vault | +| 6 | Add ServiceMonitor | Metrics visible in Grafana | +| 7 | Create HTTPRoute or gRPC route | Traffic routable | | 8 | PR review | Merge → Auto-deploy dev | -## **8.2 SLI/SLO/Error Budgets** +## 8.2 SLI/SLO/Error Budgets | Service | SLI | SLO | Error Budget | |---------|-----|-----|--------------| -| **svc-ledger** | Availability | 99.9% | 43 min/mois | -| **svc-ledger** | Latency P99 | < 200ms | N/A | -| **svc-wallet** | Availability | 99.9% | 43 min/mois | -| **Platform** | Availability | 99.5% | 3.6h/mois | +| **svc-api** | Availability | 99.9% | 43 min/month | +| **svc-api** | Latency P99 | < 200ms | N/A | +| **svc-provisioner** | Provisioning success rate | 99.5% | N/A | +| **svc-agent-relay** | Agent connection uptime | 99.9% | 43 min/month | +| **Agent** | Metrics delivery | 99.9% | 43 min/month | +| **Customer DB** | Backup success rate | 99.9% | N/A | +| **Platform** | Availability | 99.5% | 3.6h/month | -## **8.3 On-Call Structure** +## 8.3 On-Call Structure -| Rôle | Responsabilité | Rotation | +| Role | Responsibility | Rotation | |------|---------------|----------| -| **Primary** | First responder, triage | Weekly | -| **Secondary** | Escalation, expertise | Weekly | -| **Incident Commander** | Coordination si P1 | On-demand | +| **Primary** | First responder, triage (SaaS + customer infra) | Weekly | +| **Secondary** | Escalation, deep expertise | Weekly | +| **Incident Commander** | Coordination for P1 (customer data at risk) | On-demand | -→ **Documentation détaillée** : [platform/PLATFORM-ENGINEERING.md](platform/PLATFORM-ENGINEERING.md) +> Detailed documentation: [platform/PLATFORM-ENGINEERING.md](platform/PLATFORM-ENGINEERING.md) --- -# 🚀 **PARTIE IX — ROADMAP** - -## **9.1 Séquence de Construction** - -| Phase | Focus | Estimation | -|-------|-------|------------| -| **1** | Bootstrap Layer 0-1 (IAM, VPC, EKS, Aiven) | 3 semaines | -| **2** | Platform GitOps (ArgoCD) | 1 semaine | -| **3** | Platform Networking (Cilium, Gateway API) | 1 semaine | -| **3b** | Edge & CDN (Cloudflare) | 1 semaine | -| **4** | Platform Security (Vault, Kyverno) | 2 semaines | -| **5** | Platform Observability | 2 semaines | -| **5b** | Platform APM | 1 semaine | -| **6** | Platform Cache (Valkey) | 1 semaine | -| **7** | Contracts (Proto, SDK) | 1 semaine | -| **8** | svc-ledger | 3 semaines | -| **9** | svc-wallet | 2 semaines | -| **10** | Kafka + Outbox | 2 semaines | -| **10b** | Task Queue | 1 semaine | -| **11** | Testing complet | 2 semaines | -| **12** | Compliance audit | 2 semaines | -| **13** | Documentation | 1 semaine | - -**Total estimé : ~25 semaines** - -## **9.2 Checklist avant démarrage** - -### Comptes & Accès -- [ ] Compte AWS créé, billing configuré -- [ ] Compte Aiven créé -- [ ] Compte Cloudflare créé (Free tier) -- [ ] Organisation GitHub créée -- [ ] Domaine DNS acquis et transféré vers Cloudflare - -### Décisions validées -- [ ] RPO 1h, RTO 15min +# PART IX — ROADMAP + +## 9.1 Build Sequence + +| Phase | Focus | Duration | +|-------|-------|----------| +| **1** | Bootstrap Layer 0-1 (IAM, VPC, EKS) | 3 weeks | +| **2** | Platform GitOps (ArgoCD) | 1 week | +| **3** | Platform Networking (Cilium, Gateway API) + Cloudflare | 2 weeks | +| **4** | Platform Security (Vault, Kyverno) | 2 weeks | +| **5** | Platform Observability (Prometheus, Loki, Tempo) | 2 weeks | +| **6** | Agent framework + gRPC protocol + agent-relay | 3 weeks | +| **7** | CNPG Provider (provider-cnpg) | 2 weeks | +| **8** | svc-provisioner (THE BRAIN) + svc-infra (AWS resources) | 4 weeks | +| **9** | svc-clusters + svc-backups + svc-users | 3 weeks | +| **10** | svc-monitoring + DBA intelligence (basic) | 3 weeks | +| **11** | Dashboard — Simple Mode (Next.js) | 4 weeks | +| **12** | Dashboard — Advanced Mode (YAML editor) | 2 weeks | +| **13** | svc-auth (OIDC, RBAC, org model) | 2 weeks | +| **14** | CLI + API + Terraform Provider | 3 weeks | +| **15** | svc-billing (Stripe) + svc-audit | 2 weeks | +| **16** | svc-migrations (Aiven/RDS import) | 2 weeks | +| **17** | Testing (E2E, chaos, performance) | 2 weeks | +| **18** | Compliance audit (GDPR, SOC2) | 2 weeks | + +**Total estimated: ~43 weeks (~10 months)** + +## 9.2 Pre-Start Checklist + +### Accounts & Access +- [ ] AWS account created, billing configured +- [ ] Aiven account created (product database) +- [ ] Cloudflare account created +- [ ] GitHub organization created +- [ ] Stripe account created (billing) +- [ ] DNS domain acquired (kiven.io or similar) + +### Decisions Validated +- [ ] RPO 1h / RTO 15min (SaaS) - [ ] AWS eu-west-1 -- [ ] Aiven pour Kafka + PostgreSQL + Valkey -- [ ] Cloudflare pour DNS + WAF + CDN -- [ ] Self-hosted observability -- [ ] ArgoCD centralisé +- [ ] Go as backend language +- [ ] Next.js as frontend +- [ ] CNPG as PostgreSQL engine +- [ ] Agent-based connectivity (gRPC/mTLS) +- [ ] Cross-account IAM for customer infra access +- [ ] Provider/plugin architecture for multi-operator future +- [ ] Aiven for Kiven product DB + Kafka +- [ ] ArgoCD centralized - [ ] Cilium + Gateway API - [ ] Kyverno - [ ] HashiCorp Vault self-hosted --- -# 📚 **APPENDIX** +# APPENDIX -## **A. Glossaire** +## A. Glossary -→ [GLOSSARY.md](GLOSSARY.md) +> [GLOSSARY.md](GLOSSARY.md) -## **B. ADR Index** +## B. ADR Index -| ADR | Titre | Statut | +| ADR | Title | Status | |-----|-------|--------| -| 001 | Modular Monolith First | Accepted | -| 002 | Aiven Managed Data | Accepted | -| 003 | Cilium over Calico | Accepted | +| 001 | Landing Zone: Control Tower + Terraform | Accepted | +| 002 | CNPG as PostgreSQL Engine | Accepted | +| 003 | Agent-Based Connectivity | Accepted | +| 004 | Provider/Plugin Architecture | Accepted | | ... | ... | ... | -→ [adr/](adr/) +> [adr/](adr/) -## **C. Change Management Process** +## C. Change Management Process ### Architecture Changes -1. **ADR Required** : Toute décision impactant >1 service -2. **Review** : Platform Team + Tech Lead -3. **Communication** : Slack #platform-updates +1. **ADR Required**: Any decision impacting >1 service +2. **Review**: Platform Team + Tech Lead +3. **Communication**: Slack #platform-updates ### Breaking Changes -1. RFC obligatoire (`docs/rfc/`) -2. Migration path documenté -3. Annonce 2 sprints avant +1. RFC required (`docs/rfc/`) +2. Migration path documented +3. Announce 2 sprints before ### Emergency Changes 1. Incident Commander approval -2. Post-mortem obligatoire -3. ADR rétroactif sous 48h +2. Post-mortem required +3. Retroactive ADR within 48h --- -# 📖 **Documentation Index** +# Documentation Index | Document | Description | Path | |----------|-------------|------| | **Bootstrap Guide** | AWS setup, Account Factory | [bootstrap/BOOTSTRAP-GUIDE.md](bootstrap/BOOTSTRAP-GUIDE.md) | -| **Security Architecture** | Defense in depth, IAM, PAM, Vault | [security/SECURITY-ARCHITECTURE.md](security/SECURITY-ARCHITECTURE.md) | +| **Security Architecture** | Defense in depth, IAM, cross-account, Vault | [security/SECURITY-ARCHITECTURE.md](security/SECURITY-ARCHITECTURE.md) | | **Observability Guide** | Metrics, logs, traces, APM, dashboards | [observability/OBSERVABILITY-GUIDE.md](observability/OBSERVABILITY-GUIDE.md) | -| **Networking Architecture** | VPC, Cloudflare, Gateway API, DNS | [networking/NETWORKING-ARCHITECTURE.md](networking/NETWORKING-ARCHITECTURE.md) | -| **Data Architecture** | PostgreSQL, Kafka, Cache, Queues | [data/DATA-ARCHITECTURE.md](data/DATA-ARCHITECTURE.md) | -| **Testing Strategy** | Pyramide, Unit, Integration, Performance, Chaos | [testing/TESTING-STRATEGY.md](testing/TESTING-STRATEGY.md) | -| **Platform Engineering** | Contracts, Golden Path, On-Call, CI/CD | [platform/PLATFORM-ENGINEERING.md](platform/PLATFORM-ENGINEERING.md) | -| **DR Guide** | Backup, Recovery, Chaos Engineering | [resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) | -| **Glossary** | Terminologie complète | [GLOSSARY.md](GLOSSARY.md) | +| **Networking Architecture** | VPC, Cloudflare, Gateway API, customer connectivity | [networking/NETWORKING-ARCHITECTURE.md](networking/NETWORKING-ARCHITECTURE.md) | +| **Data Architecture** | Product DB, Kafka, customer DB model | [data/DATA-ARCHITECTURE.md](data/DATA-ARCHITECTURE.md) | +| **Testing Strategy** | Pyramid, E2E, chaos, provisioning tests | [testing/TESTING-STRATEGY.md](testing/TESTING-STRATEGY.md) | +| **Platform Engineering** | Contracts, Golden Path, on-call, CI/CD | [platform/PLATFORM-ENGINEERING.md](platform/PLATFORM-ENGINEERING.md) | +| **DR Guide** | Backup, recovery, SaaS DR + customer DB DR | [resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) | +| **Agent Architecture** | Agent design, gRPC protocol, deployment | [agent/AGENT-ARCHITECTURE.md](agent/AGENT-ARCHITECTURE.md) | +| **Customer Infra Management** | Nodes, storage, S3, IAM, cross-account | [infra/CUSTOMER-INFRA-MANAGEMENT.md](infra/CUSTOMER-INFRA-MANAGEMENT.md) | +| **Customer Onboarding** | CloudFormation, EKS discovery, provisioning | [onboarding/CUSTOMER-ONBOARDING.md](onboarding/CUSTOMER-ONBOARDING.md) | +| **Provider Interface** | Plugin architecture, Go interface, adding providers | [providers/PROVIDER-INTERFACE.md](providers/PROVIDER-INTERFACE.md) | +| **Glossary** | All terminology | [GLOSSARY.md](GLOSSARY.md) | --- -*Document maintenu par : Platform Team* -*Dernière mise à jour : Janvier 2026* +*Maintained by: Kiven Platform Team* +*Last updated: February 2026* diff --git a/GLOSSARY.md b/GLOSSARY.md index 84baf89..b6455e7 100644 --- a/GLOSSARY.md +++ b/GLOSSARY.md @@ -1,710 +1,200 @@ -# 📖 **Glossary** -## *LOCAL-PLUS Platform Terminology* +# Glossary +## *Kiven Platform Terminology* --- -> **Retour vers** : [Architecture Overview](EntrepriseArchitecture.md) +> **Back to**: [Architecture Overview](EntrepriseArchitecture.md) --- -# 🧩 **1. Core Software Architecture Terms** +# 1. Kiven-Specific Terms | Term | Definition | |------|------------| -| **Monolith** | Single deployable unit containing all application functionality | -| **Microservices** | Architecture where application is composed of small, independent services | -| **Service boundaries** | Clear interfaces and responsibilities defining where one service ends and another begins | -| **Tight coupling** | Strong dependencies between components making them hard to change independently | -| **Loose coupling** | Minimal dependencies between components allowing independent evolution | -| **Cohesion** | Degree to which elements of a module belong together | -| **Separation of concerns** | Design principle for separating a program into distinct sections | -| **Scalability (vertical)** | Adding more power to existing machines (scale up) | -| **Scalability (horizontal)** | Adding more machines to the pool (scale out) | -| **Fault tolerance** | System's ability to continue operating when components fail | -| **Resilience** | System's ability to recover from failures and continue to function | -| **High availability** | System designed to be operational for a high percentage of time | -| **Latency budget** | Maximum acceptable delay for an operation across the system | -| **Throughput** | Number of operations a system can handle per unit of time | -| **Concurrency** | Multiple computations executing during overlapping time periods | -| **Rate limiting** | Controlling the rate of requests to protect system resources | -| **Backpressure** | Mechanism to resist and control upstream load when overwhelmed | -| **Stateless** | Component that doesn't retain client state between requests | -| **Stateful** | Component that maintains state across requests | -| **Idempotency** | Operation that produces same result regardless of how many times executed | -| **Eventual consistency** | Data will become consistent across replicas given enough time | -| **Strong consistency** | All nodes see the same data at the same time | -| **CAP theorem** | Distributed system can only provide 2 of 3: Consistency, Availability, Partition tolerance | -| **Data locality** | Keeping data close to where it's processed | -| **ACID** | Atomicity, Consistency, Isolation, Durability — transaction guarantees | -| **BASE** | Basically Available, Soft state, Eventually consistent — alternative to ACID | -| **CQRS** | Command Query Responsibility Segregation — separate read and write models | -| **Retry + Exponential backoff** | Retry failed operations with increasing delays | -| **Circuit breaker** | Pattern to prevent cascading failures by failing fast | -| **Bulkhead isolation** | Isolating components to prevent failure propagation | -| **Canary deployment** | Gradual rollout to a subset of users before full deployment | -| **Blue/Green deployment** | Two identical environments, switch traffic between them | -| **Progressive delivery** | Gradual rollout with automated checks and rollback | -| **Feature flags** | Toggles to enable/disable features without deployment | +| **Kiven** | Managed data services platform. "Aiven, but on your Kubernetes infrastructure." Finnish for "stone" — solid ground for your database. | +| **Kiven Agent** | Lightweight Go binary deployed in the customer's K8s cluster. Executes commands, collects metrics/logs, reports status to Kiven SaaS via gRPC/mTLS. | +| **Kiven SaaS** | The management platform running in Kiven's AWS account (eu-west-1). Dashboard, API, core services. | +| **Provider** | Plugin that implements the Kiven provider interface for a specific K8s operator (e.g., CNPG Provider, Strimzi Provider). | +| **CNPG Provider** | The first Kiven provider. Manages PostgreSQL via the CloudNativePG operator. | +| **Service Plan** | Predefined resource tier (Hobbyist, Startup, Business, Premium, Custom) that maps to EC2 instance type, storage, instances, and postgresql.conf tuning. | +| **Power Off / Power On** | Feature to pause a database by deleting compute (nodes + pods) while retaining data (EBS volumes + S3 backups). Saves 60-70% on non-production environments. | +| **Power Schedule** | Automated schedule for power on/off (e.g., Mon-Fri 8am-6pm). | +| **Simple Mode** | Default dashboard UX for developers. Forms, sliders, buttons. No YAML visible. Like Aiven's UI. | +| **Advanced Mode** | Dashboard UX for DevOps. View/edit YAML directly, diff view, change history, rollback. Like Lens for K8s. | +| **svc-provisioner** | "The Brain" — core service that orchestrates full provisioning pipeline (nodes → storage → S3 → CNPG → PG). | +| **svc-infra** | Service managing AWS resources in customer accounts (EC2 node groups, EBS, S3, IAM). | +| **svc-agent-relay** | gRPC server that multiplexes connections from all customer agents. | +| **svc-yamleditor** | Service powering Advanced Mode: YAML generation, validation, diff, change history. | +| **DBA Intelligence** | Kiven's automated database expertise: performance tuning, query optimization, backup verification, capacity planning, security auditing, incident diagnostics. | +| **Backup Verification** | Automated weekly restore test: spin up temporary CNPG cluster from latest backup, validate, tear down. Proves backups are restorable. | +| **Prerequisites Engine** | Validates customer's K8s environment before provisioning (CNPG operator, storage classes, resources, cert-manager, etc.). | +| **Customer Infrastructure** | AWS resources in the customer's account managed by Kiven: node groups, EBS volumes, S3 buckets, IAM roles. | +| **Cross-Account IAM** | AWS IAM role in customer's account that trusts Kiven's account. Kiven assumes this role to manage customer resources. | --- -# 🚢 **2. DevOps Core Concepts** +# 2. CloudNativePG (CNPG) Terms | Term | Definition | |------|------------| -| **CI/CD** | Continuous Integration / Continuous Delivery — automated build, test, deploy | -| **Fail-Fast** | Design principle to detect and report failures immediately | -| **Deployment pipeline** | Automated sequence of stages from code to production | -| **GitOps** | Infrastructure and application management using Git as source of truth | -| **Pull-based delivery** | Agents pull desired state from Git (vs push-based) | -| **Infrastructure as Code (IaC)** | Managing infrastructure through code rather than manual processes | -| **Configuration drift** | Divergence between actual and intended configuration state | -| **Desired state vs actual state** | What should be vs what currently is | -| **Convergence loop** | Process that continuously moves actual state toward desired state | -| **Immutability** | Resources are replaced rather than modified | -| **Artifact registry** | Repository for storing build artifacts (images, packages) | -| **Environment parity** | Keeping dev, staging, prod as similar as possible | -| **Supply chain security** | Protecting the software delivery pipeline from attacks | -| **Build reproducibility** | Ability to recreate identical builds from same inputs | -| **Trunk-based development** | All developers work on a single branch (main/trunk) | -| **Shift left** | Moving testing and security earlier in the development process | -| **Continuous compliance** | Automated compliance checks integrated into pipeline | -| **Golden pipeline** | Standardized, pre-approved CI/CD pipeline | -| **Self-service delivery** | Teams can deploy without manual intervention | -| **Release automation** | Automated release process with minimal human intervention | -| **Promotion** | Moving artifacts from one environment to the next | +| **CloudNativePG (CNPG)** | CNCF Kubernetes operator for PostgreSQL. Manages cluster lifecycle, HA, backups, failover. | +| **CNPG Cluster** | Custom Resource (CR) defining a PostgreSQL cluster: instances, storage, config, backups. | +| **CNPG Pooler** | Custom Resource for PgBouncer connection pooling, managed by CNPG operator. | +| **CNPG ScheduledBackup** | Custom Resource defining automated backup schedule (frequency, retention, S3 target). | +| **Barman** | Backup tool used by CNPG for physical backups and WAL archiving to object storage (S3). | +| **PITR (Point-in-Time Recovery)** | Ability to restore a database to any specific moment using base backup + WAL replay. | +| **WAL (Write-Ahead Log)** | PostgreSQL's transaction log. Every change is written to WAL before data files. Used for replication and PITR. | +| **Switchover** | Planned promotion of a replica to primary (graceful, zero data loss). | +| **Failover** | Automatic promotion of a replica when primary fails (may lose last few transactions depending on replication mode). | +| **Replication Lag** | Time delay between primary writing data and replica receiving it. | +| **PVC (Persistent Volume Claim)** | Kubernetes resource requesting persistent storage (maps to EBS volume). | +| **PVC Reclaim Policy** | What happens to the EBS volume when the PVC is deleted. `Retain` = keep the volume (critical for Power Off/On). | --- -# 🛠️ **3. Platform Engineering Vocabulary** +# 3. PostgreSQL Terms | Term | Definition | |------|------------| -| **Paved road** | Recommended path that's easy to follow and well-supported | -| **Golden path** | Opinionated, supported way to accomplish common tasks | -| **Developer experience (DevEx)** | Quality of developers' interactions with tools and processes | -| **Self-service portals** | Interfaces for teams to provision resources without tickets | -| **Platform boundaries** | Clear interfaces between platform and application teams | -| **Internal Developer Platform (IDP)** | Set of tools and services that enable self-service | -| **Tenant isolation** | Separation of resources between different users/teams | -| **Blast radius** | Scope of impact when something fails | -| **Multi-tenancy** | Single instance serving multiple isolated tenants | -| **Platform contracts** | Agreements about what the platform provides and expects | -| **Declarative everything** | Describing what you want, not how to achieve it | -| **Reconciliation loop** | Controller pattern that continuously aligns actual with desired state | -| **Policy as Code** | Expressing policies in code for automated enforcement | -| **Control plane vs data plane** | Management layer vs traffic/data processing layer | -| **Standardization** | Consistent patterns across the organization | -| **Opinionated defaults** | Pre-configured choices that work for most cases | -| **Guardrails** | Constraints that guide without blocking | -| **Drift detection** | Identifying when actual state differs from desired | -| **Day-2 operations** | Ongoing operations after initial deployment | -| **Platform lifecycle** | Stages from creation through deprecation | -| **Operational excellence** | Running workloads effectively and gaining insights | -| **Infra product thinking** | Treating infrastructure as a product with users | +| **postgresql.conf** | Main PostgreSQL configuration file. Controls memory, connections, WAL, checkpoints, etc. | +| **pg_hba.conf** | PostgreSQL Host-Based Authentication config. Controls who can connect and how. | +| **shared_buffers** | RAM allocated for caching data pages. Typically 25% of total RAM. | +| **work_mem** | RAM per query operation for sorting/hashing. Too low = spills to disk. | +| **effective_cache_size** | Hint to query planner about available cache. Typically 75% of RAM. | +| **max_connections** | Maximum concurrent connections. Should be sized with connection pooling. | +| **PgBouncer** | PostgreSQL connection pooler. Reduces connection overhead. Modes: session, transaction, statement. | +| **pg_stat_statements** | Extension tracking execution statistics of all SQL queries. | +| **pg_stat_activity** | System view showing currently active queries and connections. | +| **pg_stat_bgwriter** | System view for background writer and checkpoint statistics. | +| **pg_stat_user_tables** | System view for table-level statistics (seq scans, idx scans, dead tuples). | +| **Autovacuum** | Background process that reclaims dead tuples and updates statistics. | +| **Bloat** | Wasted space from dead tuples that autovacuum hasn't reclaimed. | +| **XID Wraparound** | PostgreSQL transaction ID limit (~2 billion). If reached, database freezes. Autovacuum prevents this. | +| **EXPLAIN / EXPLAIN ANALYZE** | Commands showing query execution plan (estimated vs actual). | +| **Sequential Scan** | Full table scan. Often indicates missing index. | +| **Index Scan** | Targeted lookup using an index. Generally faster than seq scan. | +| **Extensions** | PostgreSQL plugins: pg_vector (AI embeddings), PostGIS (geospatial), TimescaleDB (time-series), etc. | --- -# 🐳 **4. Container & Kubernetes Terminology** +# 4. AWS / Cloud Terms | Term | Definition | |------|------------| -| **Control plane** | Components that manage the cluster (API server, scheduler, etc.) | -| **Data plane** | Worker nodes where application workloads run | -| **Pod** | Smallest deployable unit in Kubernetes, one or more containers | -| **Deployment** | Declarative updates for Pods and ReplicaSets | -| **StatefulSet** | Manages stateful applications with stable identities | -| **DaemonSet** | Ensures a Pod runs on all (or some) nodes | -| **Service** | Abstract way to expose an application running on Pods | -| **Ingress** | API object managing external access to services | -| **Gateway API** | Next-generation Ingress, more expressive routing | -| **CRD (Custom Resource Definition)** | Extends Kubernetes API with custom resources | -| **Operator** | Controller that manages complex applications using CRDs | -| **Controller** | Control loop that watches state and makes changes | -| **Reconciliation loop** | Controller pattern comparing desired vs actual state | -| **Desired state store (etcd)** | Key-value store holding cluster state | -| **Horizontal Pod Autoscaler** | Scales Pods based on CPU/memory or custom metrics | -| **Vertical Pod Autoscaler** | Adjusts resource requests/limits automatically | -| **KEDA** | Kubernetes Event-Driven Autoscaling | -| **Knative** | Platform for serverless workloads on Kubernetes | -| **Service mesh** | Infrastructure layer for service-to-service communication | -| **Admission controller** | Intercepts requests before persistence | -| **Mutating webhook** | Modifies resources during admission | -| **Validating webhook** | Rejects invalid resources during admission | -| **Secrets** | Objects for sensitive data (passwords, tokens) | -| **ConfigMaps** | Objects for non-sensitive configuration data | -| **Namespace tenancy** | Using namespaces to isolate workloads | -| **Sidecar pattern** | Helper container running alongside main container | -| **Init containers** | Containers that run before app containers start | -| **Pod disruption budget** | Limits voluntary disruptions to Pods | -| **Resource requests vs limits** | Minimum guaranteed vs maximum allowed resources | -| **OOMKilled / throttling** | Container killed for memory / slowed for CPU | -| **Node pool** | Group of nodes with same configuration | -| **Taints/Tolerations** | Mechanism to repel/accept Pods on nodes | -| **Affinity rules** | Scheduling preferences for Pod placement | -| **kro** | Kubernetes Resource Orchestrator | +| **EKS (Elastic Kubernetes Service)** | AWS managed Kubernetes service. | +| **EBS (Elastic Block Store)** | AWS block storage for EC2. Volumes attached to K8s nodes for database data. | +| **gp3** | EBS volume type. General purpose SSD with configurable IOPS and throughput. Default for Kiven. | +| **S3 (Simple Storage Service)** | AWS object storage. Used for CNPG backups (Barman) and WAL archiving. | +| **IRSA (IAM Roles for Service Accounts)** | AWS feature mapping K8s ServiceAccounts to IAM roles. CNPG uses IRSA to write backups to S3. | +| **AssumeRole** | AWS IAM action to temporarily take on another role's permissions. Kiven assumes customer's `KivenAccessRole`. | +| **Cross-Account Access** | Pattern where one AWS account accesses resources in another account via IAM role trust. | +| **CloudFormation** | AWS IaC service. Kiven provides a CF template for customers to create the access role. | +| **KMS (Key Management Service)** | AWS encryption key management. Used for EBS and S3 encryption. | +| **Managed Node Group** | EKS feature for managed EC2 instances as K8s worker nodes. Kiven creates dedicated node groups for databases. | +| **Taints** | K8s mechanism to repel pods from nodes. Kiven taints DB nodes so only DB pods run there. | +| **Tolerations** | K8s mechanism allowing pods to schedule on tainted nodes. CNPG pods tolerate the database taint. | +| **Multi-AZ** | Deploying across multiple Availability Zones for high availability. Kiven spreads primary/replicas across AZs. | --- -# 🔄 **5. GitOps Deep Vocabulary** +# 5. Kubernetes & Operator Terms | Term | Definition | |------|------------| -| **Declarative manifests** | YAML/JSON files describing desired state | -| **Single source of truth** | Git as the authoritative source for system state | -| **Drift** | When actual state differs from Git-defined state | -| **Convergence** | Process of moving actual state toward desired state | -| **Pull reconciliation** | Agent pulls changes from Git (vs push deployment) | -| **Progressive sync** | Gradual application of changes with health checks | -| **Rollback via Git revert** | Undoing changes by reverting Git commits | -| **Commit-driven deployments** | Deployments triggered by Git commits | -| **Audit trail** | Git history as immutable record of all changes | -| **Policy enforcement** | Automated checks before sync | -| **Drift remediation** | Automatic correction of drift | -| **Secret sealing** | Encrypting secrets for safe Git storage (ex: Sealed Secrets, pas SOPS) | -| **Environments as branches** | Different branches for different environments | -| **Kustomize overlays** | Environment-specific customizations | +| **CRD (Custom Resource Definition)** | Extends K8s API with custom resources. CNPG adds Cluster, Backup, Pooler CRDs. | +| **CR (Custom Resource)** | Instance of a CRD. A CNPG `Cluster` CR defines one PostgreSQL cluster. | +| **Operator** | K8s controller that manages complex applications via CRDs. CNPG operator manages PostgreSQL. | +| **Controller** | Control loop watching K8s resources and reconciling actual vs desired state. | +| **Reconciliation Loop** | Continuous process comparing desired state (YAML) with actual state and making corrections. | +| **client-go** | Official Go client library for Kubernetes API. Used by Kiven agent. | +| **controller-runtime** | Go library for building K8s controllers/operators. Used by Kiven agent. | +| **Informer** | K8s pattern for watching resource changes efficiently. Agent uses informers for CNPG CRDs. | +| **Namespace** | K8s logical isolation. Kiven uses `kiven-system` (agent + operator) and `kiven-databases` (PG clusters). | +| **NetworkPolicy** | K8s L3/L4 firewall rules. Kiven creates policies so only authorized app pods reach the database. | +| **StorageClass** | K8s abstraction for dynamic storage provisioning. Kiven creates optimized storage classes for DB workloads. | +| **Helm** | K8s package manager. Agent and CNPG operator are installed via Helm charts. | --- -# ☁️ **6. Cloud Architecture Concepts** +# 6. Communication & Protocol Terms | Term | Definition | |------|------------| -| **Shared responsibility model** | Division of security responsibilities between cloud and customer | -| **Multi-AZ** | Deployment across multiple Availability Zones | -| **Multi-region** | Deployment across multiple geographic regions | -| **Zonal vs regional resources** | Resources in one zone vs replicated across zones | -| **Edge caching** | Caching content at edge locations near users | -| **Network peering** | Direct network connection between VPCs | -| **Private service connect** | Private connectivity to managed services | -| **NAT gateway** | Network address translation for outbound traffic | -| **Egress costs** | Charges for data leaving cloud provider | -| **Ingress filtering** | Controlling inbound traffic | -| **Cloud IAM** | Cloud Identity and Access Management | -| **Workload identity federation** | Federating external identities with cloud IAM | -| **Service accounts** | Identity for non-human principals | -| **Service perimeter** | Boundary controlling access to resources | -| **Threat modeling** | Systematic analysis of potential threats | -| **Cloud Armor / WAF** | Web Application Firewall services | -| **Autoscaling** | Automatic adjustment of resources based on demand | -| **Rehydration** | Recreating immutable resources from scratch | -| **Blue/Green infra provisioning** | Two environments for zero-downtime infrastructure changes | -| **PAM** | Privileged Access Management | +| **gRPC** | High-performance RPC framework by Google. Used for agent ↔ Kiven SaaS communication. | +| **mTLS (Mutual TLS)** | Both client and server verify each other's certificates. Used for agent ↔ SaaS security. | +| **Protobuf** | Protocol Buffers — binary serialization format for gRPC messages. | +| **Bidirectional Streaming** | gRPC feature where both sides can send messages continuously. Agent streams metrics, SaaS streams commands. | +| **Outbound-Only** | Agent initiates the connection to Kiven SaaS. No inbound ports needed on customer's firewall. | --- -# 🔧 **7. Infrastructure as Code Vocabulary** - -## Terraform-specific - -| Term | Definition | -|------|------------| -| **Providers** | Plugins that interact with APIs (AWS, GCP, etc.) | -| **Resources** | Infrastructure components managed by Terraform | -| **Data sources** | Read-only queries to existing resources | -| **Modules** | Reusable, encapsulated Terraform configurations | -| **State** | Record of resources Terraform manages | -| **State locking** | Preventing concurrent state modifications | -| **Workspaces** | Separate state files for different environments | -| **Drift** | Difference between state and actual infrastructure | -| **Lifecycle ignore_changes** | Ignoring specific attribute changes | -| **Outputs** | Values exported from modules | -| **Variable validation** | Rules for valid variable values | -| **Sentinel** | HashiCorp's policy as code framework | - -## Platform IaC - -| Term | Definition | -|------|------------| -| **Composability** | Building complex systems from simpler parts | -| **Reusable patterns** | Standardized infrastructure blueprints | -| **Module registries** | Centralized storage for shared modules | -| **Abstraction leaks** | When implementation details break through abstractions | -| **Snowflake infrastructure** | Unique, non-reproducible configurations | - ---- - -# 🧮 **8. Observability (SRE Vocabulary)** - -## Three Pillars + Modern Additions - -| Term | Definition | -|------|------------| -| **Logs** | Time-stamped records of discrete events | -| **Metrics** | Numeric measurements aggregated over time | -| **Traces** | Records of request paths through distributed systems | -| **Profiles** | CPU/memory usage patterns over time | -| **Events** | Significant occurrences in the system | -| **Span attributes** | Metadata attached to trace spans | -| **Telemetry pipelines** | Collection, processing, and routing of telemetry | - -## Methods & Signals - -| Term | Definition | -|------|------------| -| **RED metrics** | Rate, Errors, Duration — for services | -| **USE method** | Utilization, Saturation, Errors — for resources | -| **Golden signals** | Latency, Traffic, Errors, Saturation | -| **Histogram buckets** | Distribution of values in ranges | -| **Sampling** | Recording only a subset of data | -| **Correlation IDs** | Identifiers linking related events | -| **Distributed tracing** | Following requests across service boundaries | -| **Log enrichment** | Adding context to log entries | -| **Span propagation** | Passing trace context between services | -| **Telemetry context** | Shared context for correlated telemetry | -| **P50/P95/P99 Latency** | Percentile latency measurements | - -## Advanced Observability - -| Term | Definition | -|------|------------| -| **Cardinality** | Number of unique label combinations | -| **Dimensionality** | Number of labels/attributes | -| **Retention policies** | Rules for how long data is kept | -| **Aggregation windows** | Time periods for aggregating data | -| **Exemplars** | Links from metrics to specific traces | -| **Structured logs (JSON)** | Machine-parseable log format | -| **High-cardinality labels** | Labels with many unique values (avoid!) | -| **Traceparent / tracestate** | W3C trace context headers | -| **Baggage propagation** | Passing custom context through requests | -| **Span links** | Connecting related but non-parent spans | -| **Tail-based sampling** | Sampling based on complete trace | -| **Head-based sampling** | Sampling decision at trace start | -| **Adaptive sampling** | Dynamic sampling based on conditions | -| **Context propagation** | Passing trace context between services | -| **Semantic conventions** | OpenTelemetry standard naming | -| **Continuous profiling** | Always-on performance profiling | -| **Flamegraphs** | Visualization of call stacks and time | -| **Log correlation** | Linking logs to traces and metrics | - -## Alerting & Incidents - -| Term | Definition | -|------|------------| -| **Alert fatigue** | Desensitization from too many alerts | -| **Multi-window burn rates** | Error budget consumption over multiple time windows | -| **Error budgets** | Allowable unreliability before action required | -| **Burn-rate alerts** | Alerts based on error budget consumption speed | -| **SLO/SLA/SLI** | Objective/Agreement/Indicator for service levels | -| **Availability vs reliability** | Uptime vs consistent correct behavior | -| **Thundering herd** | Many clients retrying simultaneously | -| **Retry storms** | Cascading retries overwhelming systems | -| **Cascading failures** | Failure spreading through dependencies | -| **Deadman's switch** | Alert when expected signal is absent | -| **Synthetic monitoring** | Artificial requests to test availability | -| **Service dependency graphs** | Visualization of service relationships | -| **Load shedding** | Dropping requests to protect system | -| **Health probes** | Liveness, readiness, startup checks | -| **Blameless postmortems** | Learning from incidents without blame | -| **MTTR/MTTA/MTBF/MTTD** | Mean Time To Recovery/Acknowledge/Between Failures/Detect | -| **Alert silencing** | Temporarily suppressing alerts | -| **Dead-letter queues (DLQ)** | Queue for failed messages | -| **Observability debt** | Accumulated lack of observability | - -## Prometheus Metric Types - -| Type | Description | Usage | Exemple | -|------|-------------|-------|---------| -| **Counter** | Valeur qui ne fait qu'augmenter (jamais diminuer) | Comptage d'événements cumulatifs | `http_requests_total`, `errors_total` | -| **Gauge** | Valeur qui peut monter ET descendre | Valeurs instantanées | `temperature`, `queue_size`, `active_connections` | -| **Histogram** | Distribution de valeurs dans des buckets prédéfinis | Latences, tailles de requêtes | `http_request_duration_seconds` | -| **Summary** | Comme Histogram mais calcule les percentiles côté client | Percentiles précis (mais plus coûteux) | `request_latency` | - -### Counter vs Gauge - -``` -Counter (cumulatif): Gauge (instantané): - ▲ ▲ - 100│ ● 50│ ● ● - 80│ ● 40│ ● - 60│ ● 30│● ● - 40│ ● 20│ ● - 20│● 10│ - └────────► └────────► - time time -``` - -### Histogram Buckets - -``` -http_request_duration_seconds_bucket{le="0.1"} → Requests < 100ms -http_request_duration_seconds_bucket{le="0.5"} → Requests < 500ms -http_request_duration_seconds_bucket{le="1.0"} → Requests < 1s -http_request_duration_seconds_bucket{le="+Inf"} → All requests (total) - -Calcul P99: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) -``` - -### Quand utiliser quoi ? - -| Besoin | Type | Pourquoi | -|--------|------|----------| -| Comptage d'événements | Counter | Ne fait qu'augmenter, rate() pour débit | -| Valeur actuelle | Gauge | Peut monter/descendre | -| Latences (P50, P95, P99) | Histogram | Buckets permettent percentiles | -| Taille de queue | Gauge | Valeur instantanée | -| Nombre de requêtes | Counter | Cumulatif, rate() pour RPS | - ---- - -# 🔥 **9. Reliability Engineering Vocabulary** - -| Term | Definition | -|------|------------| -| **SLO (Service Level Objective)** | Target reliability level | -| **SLI (Service Level Indicator)** | Metric measuring service behavior | -| **SLA (Service Level Agreement)** | Contractual reliability commitment | -| **Error budget** | Allowable unreliability (100% - SLO) | -| **Budget burn** | Rate of error budget consumption | -| **Reliability targets** | Goals for system reliability | -| **Failure domains** | Scope where failures are isolated | -| **Blast radius** | Impact area of a failure | -| **Incident commander** | Person coordinating incident response | -| **Postmortem (blameless)** | Analysis of incidents without blame | -| **MTTR** | Mean Time To Recovery | -| **MTTD** | Mean Time To Detection | -| **MTTF** | Mean Time To Failure | -| **Runbook** | Step-by-step guide for operational tasks | -| **Playbook** | Guide for responding to specific scenarios | -| **On-call rotation** | Schedule for incident response duty | -| **Escalation path** | Chain for escalating issues | -| **Severity levels** | Categories of incident impact (SEV-1, SEV-2...) | - ---- - -# 🔐 **10. Security Terminology** - -| Term | Definition | -|------|------------| -| **Zero Trust** | Never trust, always verify | -| **Principle of least privilege** | Grant minimum necessary access | -| **RBAC** | Role-Based Access Control | -| **ABAC** | Attribute-Based Access Control | -| **Ephemeral credentials** | Short-lived, automatically rotated credentials | -| **Dynamic secrets** | Secrets generated on-demand with TTL | -| **Secret rotation** | Regular replacement of credentials | -| **Time-bound access** | Access that expires automatically | -| **Vault Agent** | Sidecar for secret injection | -| **Token minting** | Creating authentication tokens | -| **Policy boundaries** | Limits on what policies can grant | -| **Just-in-time access** | Access granted only when needed | -| **SBOM** | Software Bill of Materials | -| **Supply chain attacks** | Compromising software delivery pipeline | -| **Secret scanning** | Detecting exposed credentials | -| **Threat modeling** | Systematic security analysis | -| **Attack surface** | All points where attacker could enter | -| **Posture management** | Continuous security state assessment | -| **Vulnerability hygiene** | Keeping systems patched and secure | - ---- - -# 🧵 **11. Networking Vocabulary** - -## Core Networking - -| Term | Definition | -|------|------------| -| **CIDR** | Classless Inter-Domain Routing notation | -| **Subnets** | Logical subdivisions of a network | -| **VPC peering** | Direct connection between VPCs | -| **VPC Service Controls** | Perimeter around GCP resources | -| **Route table** | Rules for directing network traffic | -| **NAT gateway** | Network Address Translation for outbound traffic | -| **Public vs private subnet** | Internet-accessible vs internal-only | -| **Load balancer (L4 vs L7)** | Transport vs application layer balancing | -| **Reverse proxy** | Proxy that handles client requests for backend servers | -| **TLS termination** | Decrypting TLS at a proxy/load balancer | -| **mTLS** | Mutual TLS — both sides authenticate | -| **VPN tunnels** | Encrypted connections over public networks | -| **Egress control** | Controlling outbound traffic | -| **DNS resolution** | Translating names to IP addresses | -| **Split-horizon DNS** | Different DNS responses internal vs external | -| **Service discovery** | Finding service endpoints dynamically | -| **Latency vs jitter** | Delay vs variation in delay | - -## Network Security - -| Term | Definition | -|------|------------| -| **Network ACLs** | Stateless firewall rules for subnets | -| **Security groups** | Stateful firewall rules for instances | -| **Firewall rules** | Rules controlling network traffic | -| **Ingress vs egress** | Inbound vs outbound traffic | -| **East-west vs north-south** | Internal vs external traffic | -| **Overlay networks** | Virtual networks on top of physical | -| **Underlay networks** | Physical network infrastructure | -| **Zero trust networking** | Verify every request regardless of source | -| **Network segmentation** | Dividing network into zones | -| **Micro-segmentation** | Fine-grained network isolation | - -## DNS - -| Term | Definition | -|------|------------| -| **DNS TTL** | Time-To-Live for DNS records | -| **DNS cache poisoning** | Attack corrupting DNS cache | -| **Anycast vs unicast** | Same IP multiple locations vs single location | -| **GSLB** | Global Server Load Balancing | -| **CNAMES vs ANAMEs** | Canonical names vs ALIAS records | -| **DNS SRV records** | Service location records | -| **Weighted DNS records** | Traffic distribution via DNS | -| **DNS failover** | Automatic DNS-based failover | - -## Load Balancing - -| Term | Definition | -|------|------------| -| **Round robin** | Distributing requests in rotation | -| **Least connections** | Sending to server with fewest connections | -| **Weighted** | Distribution based on server capacity | -| **IP hash** | Consistent routing based on client IP | -| **Sticky sessions** | Routing same client to same server | -| **Connection draining** | Completing requests before removing server | -| **Health checks (active/passive)** | Probing vs observing server health | - -## Advanced Networking +# 7. Architecture & Software Terms | Term | Definition | |------|------------| -| **Service mesh** | Infrastructure for service communication | -| **Sidecar proxy (Envoy)** | Proxy container alongside application | -| **Policy-based routing** | Routing based on policies not just destination | -| **BGP** | Border Gateway Protocol | -| **ASN** | Autonomous System Number | -| **Peering vs transit** | Direct connection vs paying for routing | -| **PrivateLink / VPC Endpoints** | Private connectivity to services | -| **MTU** | Maximum Transmission Unit | -| **QoS** | Quality of Service | -| **Bandwidth vs throughput** | Capacity vs actual data transfer rate | - -## Kubernetes Networking - -| Term | Definition | -|------|------------| -| **kube-proxy** | Network proxy on each node | -| **ClusterIP** | Internal-only service IP | -| **NodePort** | Service exposed on node ports | -| **LoadBalancer service** | Service with external load balancer | -| **Ingress controller** | Implementation of Ingress API | -| **Gateway API** | Next-generation ingress specification | -| **NetworkPolicies** | L3/L4 firewall for pods | -| **PodCIDR** | IP range allocated to pods | -| **CNI** | Container Network Interface | -| **Calico / Cilium** | Popular CNI implementations | -| **Pod-to-pod encryption** | Encrypting traffic between pods | - ---- - -# 🗄️ **12. Database Reliability Vocabulary** - -| Term | Definition | -|------|------------| -| **RPO/RTO** | Recovery Point/Time Objective | -| **Replication lag** | Delay between primary and replica | -| **Write amplification** | Extra writes from indexing/replication | -| **Connection pooling** | Reusing database connections | -| **Hot standby** | Replica ready for immediate failover | -| **Warm standby** | Replica needing some preparation | -| **Cold failover** | Failover requiring significant setup | -| **Partitioning (sharding)** | Splitting data across databases | -| **Read replicas** | Copies for read-only queries | -| **Transaction boundaries** | Scope of ACID guarantees | -| **Isolation levels** | Degree of transaction isolation | -| **Backfill** | Populating data retroactively | -| **pg_bouncer** | PostgreSQL connection pooler | -| **Vacuum** | PostgreSQL maintenance for dead tuples | -| **Dead tuple accumulation** | Buildup of deleted row versions | -| **Failover election** | Process of choosing new primary | +| **Provider Interface** | Go interface that each data service (CNPG, Strimzi, Redis) implements. Enables multi-operator support. | +| **Plugin Architecture** | Design pattern where functionality is added via plugins without modifying core code. | +| **GitOps** | Managing infrastructure and apps using Git as single source of truth. ArgoCD pulls from Git. | +| **Infrastructure as Code (IaC)** | Managing infra through code (Terraform, CloudFormation) rather than manual processes. | +| **Trunk-Based Development** | All developers merge to main branch. Short-lived feature branches. | +| **C4 Model** | Architecture documentation: Context, Container, Component, Code diagrams. | +| **Defense in Depth** | Multiple security layers so one breach doesn't compromise everything. | +| **Zero Trust** | Never trust, always verify. Every request is authenticated and authorized. | +| **RBAC (Role-Based Access Control)** | Permissions based on roles (Admin, Operator, Viewer). | +| **OIDC (OpenID Connect)** | Identity protocol for SSO. Login with Google, GitHub, SAML. | +| **Idempotency** | Operation producing the same result no matter how many times executed. Critical for agent commands. | --- -# ✅ **13. Platform Anti-Patterns** - -| Anti-Pattern | Description | -|--------------|-------------| -| **Configuration drift** | Actual state diverging from intended | -| **Snowflake servers** | Unique, non-reproducible configurations | -| **Tight coupling** | Components that can't change independently | -| **Hidden dependencies** | Undocumented relationships between systems | -| **Mutating production manually** | Direct changes bypassing automation | -| **Silent failure** | Failures without alerts or logs | -| **Shadow ops** | Unofficial processes outside standard tooling | -| **Orphan secrets** | Unused but still valid credentials | -| **Credential sprawl** | Credentials scattered across systems | -| **Static long-lived passwords** | Credentials that never expire | -| **Single-tenant-by-accident** | Unintended tight coupling to one tenant | - ---- - -# 🧠 **14. Architecture Trade-Off Terminology** - -| Trade-Off | Description | -|-----------|-------------| -| **Latency vs throughput** | Response time vs capacity | -| **Cost vs durability** | Expense vs data safety | -| **Consistency vs availability** | Data correctness vs uptime | -| **Security vs convenience** | Protection vs ease of use | -| **Performance vs maintainability** | Speed vs code clarity | -| **Complexity vs control** | Features vs simplicity | -| **Abstraction leakage** | When implementation details break through abstractions | - ---- - -# 🎛️ **15. Control Plane Vocabulary** - -| Term | Definition | -|------|------------| -| **Declarative specification** | Describing what you want, not how | -| **Controller manager** | Component running controllers | -| **Watch loops** | Controllers watching for changes | -| **Reconciliation** | Aligning actual with desired state | -| **Drift remediation** | Correcting drift automatically | -| **Desired state store** | Where desired state is persisted | -| **Operator SDK** | Framework for building operators | -| **Custom resources** | User-defined Kubernetes resources | - ---- - -# 🐍 **16. FastAPI Vocabulary** - -## FastAPI Core - -| Term | Definition | -|------|------------| -| **Path operations** | HTTP method + path combinations | -| **Path operation function** | Function handling a path operation | -| **Dependency injection** | Automatic provision of dependencies | -| **Dependencies (Depends)** | FastAPI's DI mechanism | -| **Request state** | Data attached to request lifecycle | -| **Background tasks** | Tasks executed after response | -| **Middleware** | Code running before/after requests | -| **Routers** | Grouping of path operations | -| **Sub-applications** | Mounting apps within apps | -| **Exception handlers** | Custom error handling | -| **Response models** | Pydantic models for responses | -| **Startup/shutdown events** | Lifecycle hooks | -| **Lifespan protocol** | Modern async context manager for lifecycle | -| **OpenAPI schema generation** | Automatic API documentation | - -## Pydantic - -| Term | Definition | -|------|------------| -| **BaseModel** | Base class for data models | -| **Field validators** | Validation functions for fields | -| **Model config** | Configuration for model behavior | -| **Strict types** | Types that don't coerce | -| **Alias generation** | Automatic field name aliases | -| **Model inheritance** | Extending models | -| **ORM mode** | Compatibility with ORM objects | - -## Async/Concurrency - -| Term | Definition | -|------|------------| -| **Event loop** | Core of async execution | -| **Coroutine** | Async function | -| **Context switching** | Switching between coroutines | -| **Async DB engines** | Non-blocking database drivers | - -## API Integration +# 8. Observability & Reliability Terms | Term | Definition | |------|------------| -| **Clients (httpx)** | Async HTTP client library | -| **Session reuse** | Reusing HTTP connections | -| **Circuit breakers** | Preventing cascading failures | -| **Retries with jitter** | Randomized retry timing | -| **Backoff** | Increasing delay between retries | -| **Timeout budgets** | Allocating latency across operations | +| **RPO (Recovery Point Objective)** | Maximum acceptable data loss in time. RPO 1h = can lose up to 1 hour of data. | +| **RTO (Recovery Time Objective)** | Maximum acceptable downtime. RTO 15min = must recover within 15 minutes. | +| **SLI (Service Level Indicator)** | Metric measuring service behavior (e.g., availability, latency). | +| **SLO (Service Level Objective)** | Target for an SLI (e.g., 99.9% availability). | +| **SLA (Service Level Agreement)** | Contractual commitment to SLO with consequences for breach. | +| **Error Budget** | Allowable unreliability: 100% - SLO. 99.9% SLO = 43 min/month error budget. | +| **Prometheus** | Time-series database for metrics. Collects from Kiven services and agents. | +| **Loki** | Log aggregation system by Grafana. Stores and queries logs. | +| **Tempo** | Distributed tracing system by Grafana. Traces requests across services. | +| **OpenTelemetry (OTel)** | Standard for telemetry (metrics, logs, traces) collection and export. | +| **Chaos Engineering** | Deliberately injecting failures to test system resilience. | --- -# 🤖 **17. Modern AI Platform Terms** +# 9. Business & Compliance Terms | Term | Definition | |------|------------| -| **RAG** | Retrieval-Augmented Generation | -| **Vector embeddings** | Numerical representations of content | -| **Chunking strategies** | Methods for splitting documents | -| **Hallucination rate** | Frequency of incorrect AI outputs | -| **Prompt injection** | Attack via malicious prompts | -| **Safety guardrails** | Controls preventing harmful outputs | -| **Structured tool calling** | AI invoking tools with typed parameters | -| **Agent orchestration** | Managing multi-step AI workflows | -| **Agent handoff** | Transferring between specialized agents | -| **Latency budget (LLM)** | Acceptable delay for AI responses | -| **Function calling** | AI calling predefined functions | -| **Streaming response** | Incremental output delivery | -| **Semantic caching** | Caching based on meaning similarity | -| **Evaluation metrics (RAGAS)** | Framework for RAG evaluation | -| **Tracing (Langfuse)** | Observability for LLM applications | -| **Observability of prompts** | Tracking prompt performance | -| **Agentic** | AI that can take autonomous actions | -| **vLLM** | High-performance LLM inference engine | +| **GDPR** | EU General Data Protection Regulation. Requires data residency, consent, right to erasure. | +| **SOC2** | Security framework requiring audit controls, RBAC, monitoring, incident response. | +| **Data Sovereignty** | Data stored and processed within specific geographic boundaries. Kiven's model ensures this — data stays in customer's VPC. | +| **Vendor Lock-In** | Dependency on a specific vendor. Kiven reduces lock-in: customer owns their K8s infra, CNPG is open-source. | +| **DBaaS (Database-as-a-Service)** | Fully managed database. Aiven and Kiven are both DBaaS, but with different infrastructure models. | +| **BYOC (Bring Your Own Cloud)** | Model where the managed service runs on the customer's cloud account. Kiven's core model. | +| **Stripe** | Payment platform for SaaS billing. Kiven uses Stripe for subscription management. | --- -# 🧱 **18. How to Use These Terms** +# 10. How to Use These Terms ## In PR Reviews +- *"This increases blast radius for customer data"* +- *"We need idempotency on this agent command"* +- *"Check the PVC reclaim policy — must be Retain for power off"* -- *"This increases blast radius"* -- *"We risk configuration drift here"* -- *"Can we enforce immutability?"* -- *"Retries need idempotency guarantees"* - -## In Meetings - -- *"What's the rollback path?"* -- *"What's our boundary for tenant isolation?"* - -## In Documentation - -- *"We apply progressive delivery to reduce risk"* - ---- - -# 📚 **19. Learning Practice** +## In Customer Conversations +- *"Your data never leaves your VPC"* +- *"You can power off dev databases on weekends to save 70%"* +- *"Our DBA intelligence will auto-tune your postgresql.conf"* -For any term, ask: - -1. **Define** — What is it? -2. **When to use** — Appropriate scenarios -3. **When NOT to use** — Anti-patterns -4. **Trade-offs** — What you gain/lose -5. **Real-world example** — Concrete usage -6. **Sentence** — How to use it naturally - ---- - -# 🌿 **20. Git Vocabulary** - -| Term | Definition | -|------|------------| -| **Cherry-pick** | Apply specific commits to another branch | -| **Backport** | Apply fix from newer to older version | -| **Forwardport** | Apply fix from older to newer version | - ---- - -# 📨 **21. Messaging & Event-Driven Systems** - -| Term | Definition | -|------|------------| -| **Outbox Pattern** | Writing events to DB table, then to message broker atomically | -| **Event sourcing** | Storing events as source of truth | -| **CDC (Change Data Capture)** | Capturing database changes as events | -| **Exactly-once semantics** | Guarantee of processing exactly once | -| **At-least-once delivery** | Guarantee of delivery (may have duplicates) | -| **Consumer group** | Group of consumers sharing workload | -| **Partition** | Ordered subset of topic messages | -| **Dead Letter Queue (DLQ)** | Queue for failed messages | -| **Saga pattern** | Distributed transaction via event choreography | -| **Compensating transaction** | Undoing previous transaction on failure | +## In Architecture Decisions +- *"We need cross-account IAM for svc-infra to manage customer node groups"* +- *"The provider interface must be stable before we add Strimzi support"* --- -*Document maintenu par : Platform Team* -*Dernière mise à jour : Janvier 2026* +*Maintained by: Platform Team* +*Last updated: February 2026* diff --git a/adr/ADR-001-LANDING-ZONE-APPROACH.md b/adr/ADR-001-LANDING-ZONE-APPROACH.md index c824d0b..adcf955 100644 --- a/adr/ADR-001-LANDING-ZONE-APPROACH.md +++ b/adr/ADR-001-LANDING-ZONE-APPROACH.md @@ -118,12 +118,23 @@ Control Tower controls can be managed via Terraform: ### Phase 1: Control Tower Setup (Console) -| Step | Action | Duration | -|------|--------|----------| -| 1 | Enable Control Tower | 45 min | -| 2 | Configure home region (eu-west-1) | Included | -| 3 | Log Archive + Audit accounts created | Automatic | -| 4 | Enable IAM Identity Center | Included | +> **Voir [BOOTSTRAP-RUNBOOK](../../bootstrap/docs/BOOTSTRAP-RUNBOOK.md) pour les instructions détaillées.** + +| Step | Action | +|------|--------| +| 1 | Choose setup preferences (regions, region deny) | +| 2 | Create OUs (Security, Sandbox) | +| 3 | Configure Service integrations — **créer 2 comptes** | +| 4 | Review and enable (~45 min) | + +**Comptes créés dans Step 3 :** + +| Service | Account | Email | +|---------|---------|-------| +| AWS Config Aggregator | **Audit** | `aws+audit@talq.xyz` | +| CloudTrail Administrator | **Log Archive** | `aws+logs@talq.xyz` | + +> ⚠️ Config et CloudTrail exigent des comptes **différents**. ### Phase 2: Terraform Layer (bootstrap/) diff --git a/agent/AGENT-ARCHITECTURE.md b/agent/AGENT-ARCHITECTURE.md new file mode 100644 index 0000000..2a7b6ac --- /dev/null +++ b/agent/AGENT-ARCHITECTURE.md @@ -0,0 +1,285 @@ +# Kiven Agent Architecture +## *The Bridge Between Kiven SaaS and Customer Kubernetes* + +--- + +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# What Is the Kiven Agent + +The agent is a **single Go binary** deployed inside the customer's Kubernetes cluster. It is the only component Kiven runs in the customer's environment. Everything Kiven does on the customer's cluster goes through the agent. + +``` +Kiven SaaS (our infra) ◄──── gRPC/mTLS (outbound from agent) ──── Agent (customer's K8s) + │ + ├── Watches CNPG CRDs + ├── Collects PG metrics + ├── Executes commands + ├── Aggregates logs + └── Reports infra status +``` + +--- + +# Design Principles + +| Principle | Implementation | +|-----------|---------------| +| **Outbound-only** | Agent initiates connection to Kiven SaaS. No inbound ports on customer's firewall. | +| **Minimal footprint** | < 50MB RAM, < 0.1 CPU. Must not impact customer's workloads. | +| **Fault-tolerant** | If agent loses connection, databases keep running. Agent auto-reconnects. | +| **Secure** | mTLS for all communication. ServiceAccount scoped to CNPG CRDs only. | +| **Single binary** | One Go binary, deployed via Helm chart. No dependencies. | +| **Multi-provider ready** | Plugin system: auto-detects installed operators, activates relevant modules. | + +--- + +# Agent Components + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ KIVEN AGENT (Go binary) │ +│ │ +│ ┌─────────────────────────────────────────────────────────────┐ │ +│ │ Provider Registry │ │ +│ │ ├── CNPG Module (Phase 1) │ │ +│ │ │ ├── CNPG Watcher (informers on Cluster/Backup/Pooler) │ │ +│ │ │ ├── PG Stats Collector (pg_stat_*, via PG connection) │ │ +│ │ │ └── PG Log Collector (pod logs from CNPG pods) │ │ +│ │ ├── Strimzi Module (Future) │ │ +│ │ └── Redis Module (Future) │ │ +│ └─────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────────┐ │ +│ │ Core Components │ │ +│ │ ├── Command Executor — applies YAML, runs SQL │ │ +│ │ ├── Infra Reporter — node status, EBS, resource usage │ │ +│ │ ├── Health Monitor — self-health, connectivity check │ │ +│ │ └── Config Manager — agent config, hot reload │ │ +│ └─────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────────┐ │ +│ │ Transport Layer │ │ +│ │ ├── gRPC Client (mTLS, outbound to svc-agent-relay) │ │ +│ │ ├── Event Buffer (in-memory, survives brief disconnects) │ │ +│ │ └── Heartbeat (every 30s to prove agent is alive) │ │ +│ └─────────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +## CNPG Watcher + +Uses Kubernetes **informers** (via controller-runtime) to watch CNPG CRDs: +- `Cluster` — status changes, failover events, replication lag +- `Backup` — backup start/complete/fail events +- `ScheduledBackup` — schedule status +- `Pooler` — PgBouncer status, connection stats + +On any change → event streamed to Kiven SaaS via gRPC. + +## PG Stats Collector + +Connects to PostgreSQL directly (using credentials from CNPG-managed K8s Secret): +- `pg_stat_statements` — query performance (every 60s) +- `pg_stat_activity` — active queries, blocking (every 30s) +- `pg_stat_bgwriter` — checkpoint/write stats (every 60s) +- `pg_stat_user_tables` — table stats, dead tuples (every 300s) +- Custom queries for bloat detection, XID age (every 300s) + +**Important**: Query parameter values are **never collected**. Only query templates (`SELECT * FROM users WHERE id = $1`). + +## PG Log Collector + +Tails PostgreSQL pod logs via Kubernetes API: +- Filters for ERROR, WARNING, FATAL, PANIC levels +- Applies **log scrubbing**: replaces parameter values with `$N` +- Batches and streams to Kiven SaaS +- Detects patterns: slow queries, connection rejections, OOM + +## Command Executor + +Receives commands from Kiven SaaS (via gRPC stream) and executes them: + +| Command Type | What It Does | Example | +|-------------|-------------|---------| +| `apply_yaml` | Applies K8s manifest | Create/update CNPG Cluster, Pooler, Backup | +| `delete_resource` | Deletes K8s resource | Delete cluster on power-off (PVCs retained) | +| `run_sql` | Executes SQL via PG connection | CREATE USER, GRANT, ALTER SYSTEM | +| `install_helm` | Installs/upgrades Helm chart | Install CNPG operator | +| `collect_diagnostics` | Runs diagnostic checks | Prerequisites validation | + +Every command is: +- **Logged** with full audit trail (who requested, what was executed, result) +- **Idempotent** where possible (apply is naturally idempotent) +- **Validated** before execution (schema validation for YAML) +- **Reported** with result (success/failure + output) + +## Infra Reporter + +Reports infrastructure-level information: +- Node status (Ready/NotReady, capacity, allocatable) +- EBS volume usage (via PVC status + df) +- Resource consumption (CPU/memory per CNPG pod) +- Kubernetes version, CNPG operator version +- Storage classes available +- Namespace resource quotas + +--- + +# Communication Protocol + +## gRPC Service Definition (Simplified) + +```protobuf +service AgentRelay { + // Agent → SaaS: bidirectional stream for status and metrics + rpc Connect(stream AgentMessage) returns (stream ServerMessage); + + // Agent → SaaS: initial registration + rpc Register(RegisterRequest) returns (RegisterResponse); +} + +message AgentMessage { + oneof payload { + Heartbeat heartbeat = 1; + ClusterStatus cluster_status = 2; + MetricsBatch metrics = 3; + LogBatch logs = 4; + EventReport event = 5; + CommandResult command_result = 6; + InfraReport infra_report = 7; + } +} + +message ServerMessage { + oneof payload { + Command command = 1; + ConfigUpdate config_update = 2; + Ack ack = 3; + } +} +``` + +## Connection Lifecycle + +``` +Agent starts + │ + ├── 1. Load mTLS certificates (from K8s Secret) + ├── 2. Connect to svc-agent-relay (gRPC/mTLS) + ├── 3. Register: send agent ID, cluster info, CNPG version + ├── 4. Start bidirectional stream (Connect RPC) + │ + │ ┌── Agent → SaaS ──────────────────────────────────┐ + │ │ Heartbeat every 30s │ + │ │ Cluster status on change (informer events) │ + │ │ Metrics every 30-60s │ + │ │ Logs (filtered, scrubbed) on arrival │ + │ │ Command results after execution │ + │ └───────────────────────────────────────────────────┘ + │ + │ ┌── SaaS → Agent ──────────────────────────────────┐ + │ │ Commands (apply_yaml, run_sql, etc.) │ + │ │ Config updates (collection intervals, log level) │ + │ │ Acknowledgements │ + │ └───────────────────────────────────────────────────┘ + │ + └── On disconnect: buffer events, retry with exponential backoff + Databases continue running. No data loss. +``` + +--- + +# Deployment + +## Helm Chart + +```bash +helm install kiven-agent kiven/agent \ + --namespace kiven-system \ + --create-namespace \ + --set agentToken= \ + --set relay.endpoint=agent-relay.kiven.io:443 +``` + +## Kubernetes Resources Created + +| Resource | Namespace | Purpose | +|----------|-----------|---------| +| Deployment (1 replica) | kiven-system | The agent pod | +| ServiceAccount | kiven-system | Identity for RBAC | +| ClusterRole | — | Read CNPG CRDs, read pods/logs, manage kiven-databases namespace | +| ClusterRoleBinding | — | Binds role to ServiceAccount | +| Secret | kiven-system | mTLS certificates + agent token | +| ConfigMap | kiven-system | Agent configuration (intervals, log level) | + +## RBAC (Least Privilege) + +```yaml +rules: + # CNPG CRDs — full access (for provisioning) + - apiGroups: ["postgresql.cnpg.io"] + resources: ["clusters", "backups", "scheduledbackups", "poolers"] + verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] + + # Pods/logs — read only (for metrics and log collection) + - apiGroups: [""] + resources: ["pods", "pods/log", "services", "secrets", "configmaps", "persistentvolumeclaims"] + verbs: ["get", "list", "watch"] + + # Namespaces — manage kiven-databases + - apiGroups: [""] + resources: ["namespaces"] + verbs: ["get", "list", "watch", "create"] + + # Network policies — create in kiven-databases + - apiGroups: ["networking.k8s.io"] + resources: ["networkpolicies"] + verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] + + # Storage classes — read (for prerequisites check) + - apiGroups: ["storage.k8s.io"] + resources: ["storageclasses"] + verbs: ["get", "list"] + + # Nodes — read (for infra reporting) + - apiGroups: [""] + resources: ["nodes"] + verbs: ["get", "list"] +``` + +--- + +# Failure Modes + +| Failure | Impact | Recovery | +|---------|--------|----------| +| **Agent pod crash** | Kiven dashboard shows "agent offline". Databases keep running. | K8s restarts pod automatically. Agent reconnects. | +| **gRPC connection lost** | Events buffered in memory. Dashboard shows stale data (with warning). | Agent retries with exponential backoff (1s, 2s, 4s, 8s... max 60s). | +| **Agent misconfigured** | Agent can't connect or authenticate. | Dashboard shows "agent not connected". Customer re-runs Helm install. | +| **CNPG operator not installed** | Agent reports "CNPG not found" during prerequisites check. | svc-provisioner installs CNPG operator via agent (install_helm command). | +| **Insufficient RBAC** | Agent commands fail with 403. | Agent reports permission error. Customer adjusts ClusterRoleBinding. | + +**Key invariant**: Agent failure NEVER affects running databases. CNPG operator manages PG independently. Agent is only for Kiven management plane. + +--- + +# Metrics Collected + +| Category | Metrics | Interval | +|----------|---------|----------| +| **PostgreSQL** | connections, QPS, transactions, replication lag, cache hit ratio | 30s | +| **Queries** | top queries by time/calls, slow queries (> threshold), lock waits | 60s | +| **Tables** | size, dead tuples, seq scans, idx scans, bloat estimate | 300s | +| **System** | CPU, memory, disk usage (per PG pod) | 30s | +| **CNPG** | cluster phase, timeline, instances ready, failover count | On change | +| **Backups** | last backup time, duration, size, WAL archiving lag | On change | +| **PgBouncer** | active/idle/waiting connections, pool utilization | 30s | +| **Infrastructure** | node status, EBS IOPS, storage capacity | 60s | + +--- + +*Maintained by: Agent Team* +*Last updated: February 2026* diff --git a/bootstrap/BOOTSTRAP-GUIDE.md b/bootstrap/BOOTSTRAP-GUIDE.md index a078f53..7bf0359 100644 --- a/bootstrap/BOOTSTRAP-GUIDE.md +++ b/bootstrap/BOOTSTRAP-GUIDE.md @@ -84,14 +84,18 @@ ## Steps -| Step | Action | Duration | -|------|--------|----------| -| 1 | Console → Control Tower → Set up landing zone | 45 min | -| 2 | Home region: eu-west-1 | Included | -| 3 | Additional regions: eu-central-1 (DR) | Included | -| 4 | Log Archive account created | Automatic | -| 5 | Audit account created | Automatic | -| 6 | IAM Identity Center enabled | Included | +> **Voir [BOOTSTRAP-RUNBOOK](../../bootstrap/docs/BOOTSTRAP-RUNBOOK.md) pour les instructions détaillées.** + +| Step | Action | Description | +|------|--------|-------------| +| 1 | Choose setup preferences | Home region eu-west-1, Region deny enabled | +| 2 | Create OUs | Security, Sandbox | +| 3 | Configure Service integrations | Créer Audit + Log Archive accounts | +| 4 | Review and enable | ~45 min pour compléter | + +> ⚠️ **Important:** Dans Step 3, Config et CloudTrail exigent des comptes **différents** : +> - AWS Config → **Audit** account +> - CloudTrail → **Log Archive** account ## What Control Tower creates @@ -114,10 +118,9 @@ | Component | Module | Description | |-----------|--------|-------------| -| **SSO** | `sso/` | Groups, Permission Sets | -| **Custom Controls** | `control-tower/` | Additional guardrails via Terraform | +| **SSO** | `sso/` | Groups, Permission Sets, Assignments | +| **Custom Controls** | `control-tower/` | Additional controls via `aws_controltower_control` | | **Account Factory** | `account-factory/` | AFT module via GitHub Actions | -| **Shared Services** | `core-accounts/` | ECR, Transit Gateway | ## SSO Groups diff --git a/data/DATA-ARCHITECTURE.md b/data/DATA-ARCHITECTURE.md index f334d48..1e39c3b 100644 --- a/data/DATA-ARCHITECTURE.md +++ b/data/DATA-ARCHITECTURE.md @@ -1,478 +1,297 @@ -# 💾 **Data Architecture** -## *LOCAL-PLUS Database, Kafka, Cache & Queues* +# Data Architecture +## *Kiven — Product Database, Kafka, Cache & Customer Database Model* --- -> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) --- -# 📋 **Table of Contents** +# Table of Contents -1. [Aiven Configuration](#aiven-configuration) -2. [Database Strategy](#database-strategy) -3. [Schema Ownership](#schema-ownership) +1. [Two Data Domains](#two-data-domains) +2. [Kiven Product Database (SaaS)](#kiven-product-database-saas) +3. [Customer Databases (Managed by Kiven)](#customer-databases-managed-by-kiven) 4. [Kafka Topics](#kafka-topics) -5. [Kafka Monitoring](#kafka-monitoring) -6. [Cache Architecture (Valkey)](#cache-architecture-valkey) -7. [Queueing & Background Jobs](#queueing--background-jobs) +5. [Cache Architecture (Valkey)](#cache-architecture-valkey) +6. [Data Isolation Principle](#data-isolation-principle) --- -# 🗄️ **Aiven Configuration** +# Two Data Domains -## Services Overview +Kiven has **two completely separate data domains** that must never mix: -| Service | Plan | Config | Coût estimé | -|---------|------|--------|-------------| -| **PostgreSQL** | Business-4 | Primary + Read Replica, 100GB | ~300€/mois | -| **Kafka** | Business-4 | 3 brokers, 100GB retention | ~400€/mois | -| **Valkey (Redis)** | Business-4 | 2 nodes, 10GB, HA | ~150€/mois | +``` +┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐ +│ DOMAIN 1: Kiven Product Data │ │ DOMAIN 2: Customer Database Data │ +│ (lives in Kiven's AWS account) │ │ (lives in customer's AWS account) │ +│ │ │ │ +│ PostgreSQL (Aiven) — product DB │ │ PostgreSQL (CNPG on customer EKS) │ +│ Kafka (Aiven) — events │ │ Barman backups → customer's S3 │ +│ Valkey (Aiven) — cache │ │ │ +│ │ │ Kiven NEVER accesses row data. │ +│ Contains: orgs, users, clusters, │ │ Agent collects only: pg_stat_*, │ +│ billing, audit, agent metadata │ │ logs, CRD status, metrics. │ +└─────────────────────────────────────┘ └─────────────────────────────────────┘ +``` -**Coût total Aiven estimé : ~850€/mois** +**Golden rule: Customer data never touches Kiven's infrastructure.** --- -# 🐘 **Database Strategy** +# Kiven Product Database (SaaS) + +## Aiven Configuration + +| Service | Plan | Config | Estimated Cost | +|---------|------|--------|----------------| +| **PostgreSQL** | Business-4 | Primary + Read Replica, 100GB | ~300 EUR/mo | +| **Kafka** | Business-4 | 3 brokers, 100GB retention | ~400 EUR/mo | +| **Valkey** | Business-4 | 2 nodes, 10GB, HA | ~150 EUR/mo | -## Configuration +**Total estimated Aiven cost: ~850 EUR/mo** -| Aspect | Choix | Rationale | -|--------|-------|-----------| -| **Replication** | Aiven managed (async) | RPO 1h acceptable | +## Database Configuration + +| Aspect | Choice | Rationale | +|--------|--------|-----------| +| **Replication** | Aiven managed (async) | RPO 1h acceptable for product DB | | **Backup** | Aiven automated hourly | RPO 1h | | **Failover** | Aiven automated | RTO < 15min | -| **Connection** | VPC Peering (private) | PCI-DSS, no public internet | +| **Connection** | VPC Peering (private) | No public internet | | **Pooling** | PgBouncer (Aiven built-in) | Connection efficiency | +## Schema Ownership + +### Core Tables + +| Table | Owner Service | Description | +|-------|---------------|-------------| +| `organizations` | svc-auth | Customer organizations | +| `users` | svc-auth | Dashboard users, roles, teams | +| `api_keys` | svc-auth | API key management | +| `clusters` | svc-clusters | Managed CNPG cluster metadata | +| `cluster_configs` | svc-yamleditor | YAML history, versions, diffs | +| `databases` | svc-users | PostgreSQL databases within clusters | +| `database_users` | svc-users | PostgreSQL roles within clusters | +| `backups` | svc-backups | Backup records, status, PITR points | +| `backup_verifications` | svc-backups | Restore test results | +| `agents` | svc-agent-relay | Registered agents, heartbeat status | +| `provisioning_jobs` | svc-provisioner | Provisioning pipeline state machine | +| `infra_resources` | svc-infra | Customer AWS resources (node groups, EBS, S3, IAM) | +| `service_plans` | svc-clusters | Plan definitions (Hobbyist, Startup, Business...) | +| `metrics_snapshots` | svc-monitoring | Aggregated metrics for dashboard display | +| `alerts` | svc-monitoring | Alert rules and status | +| `dba_recommendations` | svc-monitoring | Performance advisor suggestions | +| `audit_log` | svc-audit | Immutable audit trail | +| `billing_subscriptions` | svc-billing | Stripe subscriptions, usage | +| `invoices` | svc-billing | Invoice records | +| `migrations` | svc-migrations | Migration jobs (from Aiven/RDS) | + +### Power Schedule Tables + +| Table | Owner Service | Description | +|-------|---------------|-------------| +| `power_schedules` | svc-clusters | Scheduled power on/off rules | +| `power_events` | svc-clusters | Power on/off event history | + +**Rule: 1 table = 1 owner. Cross-service communication = gRPC or Kafka events, never JOINs.** + ## Connection Best Practices -| Paramètre | Valeur recommandée | Rationale | +| Parameter | Recommended Value | Rationale | |-----------|-------------------|-----------| -| **pool_size** | 20 | Nombre de connexions par pod | -| **max_overflow** | 10 | Connexions supplémentaires en pic | -| **pool_timeout** | 30s | Attente max pour une connexion | -| **pool_recycle** | 1800s | Recycler connexions toutes les 30min | -| **ssl** | require | Obligatoire pour PCI-DSS | +| **pool_size** | 20 | Connections per service pod | +| **max_overflow** | 10 | Extra connections at peak | +| **pool_timeout** | 30s | Max wait for connection | +| **pool_recycle** | 1800s | Recycle connections every 30min | +| **ssl** | require | Always encrypted | --- -# 📊 **Schema Ownership** - -| Table | Owner Service | Access pattern | -|-------|---------------|----------------| -| `transactions` | svc-ledger | CRUD | -| `ledger_entries` | svc-ledger | CRUD | -| `wallets` | svc-wallet | CRUD | -| `balance_snapshots` | svc-wallet | CRUD | -| `merchants` | svc-merchant | CRUD | -| `giftcards` | svc-giftcard | CRUD | - -**Règle d'or : 1 table = 1 owner. Cross-service = gRPC ou Events, jamais JOIN.** +# Customer Databases (Managed by Kiven) + +## What Kiven Provisions + +For each customer database, Kiven creates: + +| Resource | Type | Where | Managed By | +|----------|------|-------|------------| +| CNPG Cluster CR | Kubernetes CRD | Customer K8s | Kiven agent | +| PostgreSQL pods | Pods (Primary + Replicas) | Customer K8s | CNPG operator | +| PgBouncer Pooler | Kubernetes CRD | Customer K8s | CNPG operator | +| EBS volumes | AWS EBS gp3 | Customer AWS | Kiven svc-infra | +| S3 backup bucket | AWS S3 | Customer AWS | Kiven svc-infra | +| IRSA role | AWS IAM | Customer AWS | Kiven svc-infra | +| ScheduledBackup CR | Kubernetes CRD | Customer K8s | Kiven agent | +| NetworkPolicy | Kubernetes | Customer K8s | Kiven agent | + +## Service Plan → Infrastructure Mapping + +| Plan | Node Type | Instances | Storage | Backup Freq | PgBouncer Pool | +|------|-----------|-----------|---------|-------------|----------------| +| **Hobbyist** | t3.small | 1 | 10GB gp3 | Daily | 25 | +| **Startup** | r6g.medium | 2 | 50GB gp3 | 6h | 50 | +| **Business** | r6g.large | 3 | 100GB gp3 (3000 IOPS) | 1h | 100 | +| **Premium** | r6g.xlarge | 3 | 500GB gp3 (6000 IOPS) | 30min | 200 | +| **Custom** | Any | 1-5 | Custom | Custom | Custom | + +## Auto-Tuned postgresql.conf per Plan + +| Parameter | Hobbyist | Startup | Business | Premium | +|-----------|----------|---------|----------|---------| +| `shared_buffers` | 256MB | 1GB | 4GB | 8GB | +| `effective_cache_size` | 768MB | 3GB | 12GB | 24GB | +| `work_mem` | 4MB | 16MB | 32MB | 64MB | +| `maintenance_work_mem` | 64MB | 256MB | 512MB | 1GB | +| `max_connections` | 50 | 100 | 200 | 400 | +| `wal_buffers` | 8MB | 16MB | 32MB | 64MB | +| `random_page_cost` | 1.1 | 1.1 | 1.1 | 1.1 | +| `effective_io_concurrency` | 200 | 200 | 200 | 200 | +| `checkpoint_completion_target` | 0.9 | 0.9 | 0.9 | 0.9 | + +These values are the **defaults per plan**. The DBA intelligence engine adjusts them based on real workload over time. + +## What Kiven Collects (Metadata Only — Never Row Data) + +| Data Collected | Source | Purpose | Contains PII? | +|----------------|--------|---------|---------------| +| `pg_stat_statements` | PG catalog | Query performance analysis | No (queries anonymized) | +| `pg_stat_activity` | PG catalog | Active connections, blocking | No | +| `pg_stat_bgwriter` | PG catalog | Checkpoint/write performance | No | +| `pg_stat_user_tables` | PG catalog | Table size, seq/idx scans | No | +| CNPG Cluster status | K8s CRD | Cluster health, replication lag | No | +| Pod metrics | Kubelet | CPU, memory, disk usage | No | +| PG logs | Pod logs | Error detection, slow queries | Potentially (log scrubbing applied) | +| Node status | K8s API | Node health, capacity | No | +| EBS metrics | CloudWatch | Disk IOPS, latency | No | + +**Log scrubbing**: The agent strips potential PII from PG logs before sending to Kiven (query parameter values replaced with `$N`). --- -# 📨 **Kafka Topics** +# Kafka Topics ## Topic Configuration -| Topic | Producer | Consumers | Retention | -|-------|----------|-----------|-----------| -| `ledger.transactions.v1` | svc-ledger (Outbox) | svc-notification, svc-analytics | 7 jours | -| `wallet.balance-updated.v1` | svc-wallet | svc-analytics | 7 jours | -| `merchant.onboarded.v1` | svc-merchant | svc-notification | 7 jours | - -## Outbox Pattern avec Debezium - -> **Implementation** : On utilise **Debezium** avec **PostgreSQL Logical Replication** (publication + replication slot), pas le polling. +| Topic | Producer | Consumers | Retention | Purpose | +|-------|----------|-----------|-----------|---------| +| `agent.status.v1` | Agent (via relay) | svc-clusters, svc-monitoring | 7 days | Cluster status updates | +| `agent.metrics.v1` | Agent (via relay) | svc-monitoring | 3 days | PG metrics stream | +| `agent.logs.v1` | Agent (via relay) | svc-monitoring | 3 days | PG log stream | +| `agent.events.v1` | Agent (via relay) | svc-clusters, svc-notification | 7 days | Failover, backup, error events | +| `provisioning.commands.v1` | svc-provisioner | Agent (via relay) | 1 day | Commands to execute in customer K8s | +| `provisioning.status.v1` | svc-provisioner | svc-api, dashboard | 7 days | Provisioning pipeline progress | +| `audit.actions.v1` | All services | svc-audit | 30 days | Immutable audit trail | +| `billing.usage.v1` | svc-monitoring | svc-billing | 30 days | Per-cluster usage metrics | +| `alerts.triggered.v1` | svc-monitoring | svc-notification | 7 days | Alert events for dispatch | +| `dba.recommendations.v1` | svc-monitoring | svc-api, dashboard | 7 days | DBA intelligence suggestions | + +## Topic Naming Convention ``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ OUTBOX PATTERN (Debezium CDC) │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ 1. Application writes to DB + Outbox table in same transaction │ -│ 2. Debezium reads WAL via replication slot │ -│ 3. Events published to Kafka │ -│ 4. Consumers process events │ -│ │ -│ ┌─────────┐ ┌─────────────┐ ┌──────────┐ ┌─────────────┐ │ -│ │ svc-* │───►│ PostgreSQL │───►│ Debezium │───►│ Kafka │ │ -│ │ │ │ (WAL/Slot) │ │ (CDC) │ │ │ │ -│ └─────────┘ └─────────────┘ └──────────┘ └──────┬──────┘ │ -│ │ │ -│ Publication + Replication Slot ▼ │ -│ ┌─────────────────┐ │ -│ │ Consumers │ │ -│ └─────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -## Debezium Configuration - -| Composant | Description | -|-----------|-------------| -| **Publication** | `CREATE PUBLICATION outbox_pub FOR TABLE outbox;` | -| **Replication Slot** | Créé automatiquement par Debezium | -| **Connector** | Debezium PostgreSQL Connector | -| **Output** | Kafka topic par table (ou SMT pour routing) | +{domain}.{entity}.{version} -## Outbox Table Structure - -| Colonne | Type | Description | -|---------|------|-------------| -| `id` | UUID | Primary key | -| `aggregate_type` | VARCHAR(255) | Type d'entité (Transaction, Wallet...) | -| `aggregate_id` | VARCHAR(255) | ID de l'entité | -| `event_type` | VARCHAR(255) | Type d'événement | -| `payload` | JSONB | Données de l'événement | -| `created_at` | TIMESTAMPTZ | Timestamp création | - ---- - -# 📊 **Kafka Monitoring** - -## Métriques Essentielles - -| Métrique | Description | Seuil Alerte | Sévérité | -|----------|-------------|--------------|----------| -| **Consumer Lag** | Messages non traités | > 1000 | P2 | -| **Partition Lag** | Lag par partition | > 500 | P3 | -| **Under-replicated Partitions** | Partitions sans réplicas | > 0 | P1 | -| **Active Controller Count** | Controllers actifs | ≠ 1 | P1 | -| **Offline Partitions** | Partitions inaccessibles | > 0 | P1 | -| **Bytes In/Out Rate** | Débit Kafka | Anomalie > 50% | P3 | -| **Request Latency P99** | Latence requêtes | > 100ms | P2 | -| **ISR Shrink Rate** | Réduction In-Sync Replicas | > 0/min sustained | P2 | - -## Consumer Lag Monitoring - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ CONSUMER LAG │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ Producer Offset: 1000 ────────────────────────────────► │ -│ Consumer Offset: 800 ──────────────────────► │ -│ │◄───── LAG = 200 ─────►│ │ -│ │ -│ LAG = Producer Offset - Consumer Offset │ -│ │ -│ Causes de Lag élevé: │ -│ • Consumer lent (processing time) │ -│ • Consumer crashé │ -│ • Pic de trafic │ -│ • Problème de partition rebalancing │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ +Examples: + agent.status.v1 + provisioning.commands.v1 + audit.actions.v1 ``` -## Dashboard Kafka Recommandé +## Kafka Monitoring -| Panel | Métrique | Type | -|-------|----------|------| -| **Total Consumer Lag** | `kafka_consumergroup_lag` | Gauge | -| **Lag par Consumer Group** | `kafka_consumergroup_lag` by group | Gauge | -| **Messages In/sec** | `kafka_server_brokertopicmetrics_messagesin_total` | Counter → Rate | -| **Bytes In/Out** | `kafka_server_brokertopicmetrics_bytesin_total` | Counter → Rate | -| **Request Latency** | `kafka_network_requestmetrics_requestqueuetimems` | Histogram | -| **Partition Count** | `kafka_server_replicamanager_partitioncount` | Gauge | -| **Under-replicated** | `kafka_server_replicamanager_underreplicatedpartitions` | Gauge | +| Metric | Alert Threshold | Severity | +|--------|----------------|----------| +| **Consumer Lag** | > 1000 messages | P2 | +| **Under-replicated Partitions** | > 0 | P1 | +| **Active Controller Count** | != 1 | P1 | +| **Offline Partitions** | > 0 | P1 | +| **Request Latency P99** | > 100ms | P2 | --- -# 🚀 **Cache Architecture (Valkey)** - -## Stack Cache +# Cache Architecture (Valkey) -| Composant | Outil | Hébergement | Coût estimé | -|-----------|-------|-------------|-------------| -| **Cache primaire** | Valkey (Redis-compatible) | Aiven for Caching | ~150€/mois | -| **Cache local (L1)** | Python `cachetools` / Go `bigcache` | In-memory | 0€ | +## Cache Stack -> **Note :** Valkey est le fork open-source de Redis, maintenu par la Linux Foundation. Aiven supporte Valkey nativement. - -## Cache Topology - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ MULTI-LAYER CACHE │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ L1 — LOCAL CACHE (per pod) │ │ -│ │ • TTL: 30s - 5min │ │ -│ │ • Size: 100MB max per pod │ │ -│ │ • Use case: Hot data, config, user sessions │ │ -│ └───────────────────────────────┬─────────────────────────────────────┘ │ -│ │ Cache miss │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ L2 — DISTRIBUTED CACHE (Valkey cluster) │ │ -│ │ • TTL: 5min - 24h │ │ -│ │ • Size: 10GB │ │ -│ │ • Use case: Shared state, rate limits, session store │ │ -│ └───────────────────────────────┬─────────────────────────────────────┘ │ -│ │ Cache miss │ -│ ▼ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ L3 — DATABASE (PostgreSQL) │ │ -│ │ • Source of truth │ │ -│ │ • Write-through pour updates │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` +| Component | Tool | Hosting | Estimated Cost | +|-----------|------|---------|----------------| +| **Distributed cache** | Valkey (Redis-compatible) | Aiven | ~150 EUR/mo | +| **Local cache (L1)** | Go `bigcache` | In-memory per pod | 0 EUR | -## Cache Strategies par Use Case +## Cache Use Cases | Use Case | Strategy | TTL | Invalidation | |----------|----------|-----|--------------| -| **Wallet Balance** | Cache-aside (read) | 30s | Event-driven (Kafka) | -| **Merchant Config** | Read-through | 5min | TTL + Manual | -| **Rate Limiting** | Write-through | Sliding window | Auto-expire | -| **Session Data** | Write-through | 24h | Explicit logout | -| **Gift Card Catalog** | Cache-aside | 15min | Event-driven | -| **Feature Flags** | Read-through | 1min | Config push | - -## Cache Patterns - -### Cache-Aside Pattern - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ CACHE-ASIDE PATTERN │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ 1. Application checks cache │ -│ 2. If HIT → return cached data │ -│ 3. If MISS → query database │ -│ 4. Store result in cache with TTL │ -│ 5. Return data to caller │ -│ │ -│ ┌─────────┐ GET ┌─────────┐ │ -│ │ App │───────────►│ Cache │ │ -│ └────┬────┘ └────┬────┘ │ -│ │ │ MISS │ -│ │ SELECT ▼ │ -│ └─────────────────►┌─────────┐ │ -│ │ DB │ │ -│ └─────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -### Write-Through Pattern - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ WRITE-THROUGH PATTERN │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ 1. Application writes to cache AND database atomically │ -│ 2. Cache is always consistent with database │ -│ │ -│ ┌─────────┐ SET+TTL ┌─────────┐ │ -│ │ App │────────────►│ Cache │ │ -│ └────┬────┘ └─────────┘ │ -│ │ │ -│ │ INSERT/UPDATE │ -│ └─────────────────►┌─────────┐ │ -│ │ DB │ │ -│ └─────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -## Cache Invalidation Strategy - -| Trigger | Méthode | Use Case | -|---------|---------|----------| -| **TTL Expiry** | Automatic | Default pour toutes les clés | -| **Event-driven** | Kafka consumer | Wallet balance après transaction | -| **Explicit Delete** | API call | Admin actions, config updates | -| **Pub/Sub** | Valkey PUBLISH | Real-time invalidation cross-pods | +| **Session data** | Write-through | 24h | Explicit logout | +| **Cluster status** | Cache-aside | 30s | Agent event | +| **Org/team config** | Read-through | 5min | TTL + manual | +| **Rate limiting** | Write-through | Sliding window | Auto-expire | +| **API response cache** | Cache-aside | 1min | TTL | +| **Agent connection state** | Write-through | Heartbeat interval | Agent disconnect | +| **Service plan definitions** | Read-through | 1h | Manual invalidation | ## Cache Key Naming Convention ``` {service}:{entity}:{id}:{version} -Exemples: - wallet:balance:user_123:v1 - merchant:config:merchant_456:v1 - giftcard:catalog:category_active:v1 - ratelimit:api:user_123:minute - session:auth:session_abc123 +Examples: + auth:session:sess_abc123 + clusters:status:cluster_456:v1 + monitoring:metrics:cluster_456:latest + ratelimit:api:org_789:minute + plans:definition:business:v1 ``` -## Cache Metrics & Monitoring +## Cache Metrics -| Metric | Seuil alerte | Action | -|--------|--------------|--------| -| **Hit Rate** | < 80% | Revoir TTL, préchargement | +| Metric | Alert Threshold | Action | +|--------|----------------|--------| +| **Hit Rate** | < 80% | Review TTL, preloading | | **Latency P99** | > 10ms | Check network, cluster size | | **Memory Usage** | > 80% | Eviction analysis, scale up | -| **Evictions/sec** | > 100 | Augmenter cache size | -| **Connection Errors** | > 0 | Check connectivity, pooling | +| **Connection Errors** | > 0 | Check connectivity | --- -# 📋 **Queueing & Background Jobs** - -## Architecture Overview - -> **Clarification** : La Task Queue est **interne** aux services, pas en frontal comme RabbitMQ. - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ TASK QUEUE vs MESSAGE BROKER (RabbitMQ) │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ❌ Pattern RabbitMQ (frontal) - PAS ce qu'on fait: │ -│ │ -│ Client → RabbitMQ → Worker → Response to Client (synchrone) │ -│ │ -│ ✅ Notre pattern (Task Queue interne): │ -│ │ -│ Client → API (svc-*) → Response immédiate (< 200ms) │ -│ │ │ -│ └──► enqueue task → Valkey → Worker (async, background) │ -│ │ -│ Différence clé: │ -│ • L'API répond IMMÉDIATEMENT au client │ -│ • Le worker traite en BACKGROUND (fire-and-forget ou avec callback) │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -## Queueing Tiers - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ QUEUEING ARCHITECTURE │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ TIER 1 — EVENT STREAMING (Kafka) │ │ -│ │ • Use case: Event-driven architecture, CDC, audit logs │ │ -│ │ • Pattern: Pub/Sub, Event Sourcing │ │ -│ │ • Ordering: Per-partition guaranteed │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ TIER 2 — TASK QUEUE (Valkey + Dramatiq) │ │ -│ │ • Use case: Background jobs, async processing │ │ -│ │ • Pattern: Producer/Consumer, Work Queue │ │ -│ │ • Features: Retries, priorities, scheduling │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ TIER 3 — SCHEDULED JOBS (Kubernetes CronJobs) │ │ -│ │ • Use case: Batch processing, reports, cleanup │ │ -│ │ • Pattern: Time-triggered execution │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ -``` - -## Kafka vs Task Queue — Quand utiliser quoi ? - -| Critère | Kafka | Task Queue (Valkey) | -|---------|-------|---------------------| -| **Message Ordering** | ✅ Per-partition | ❌ Best effort | -| **Message Replay** | ✅ Retention-based | ❌ Non | -| **Priority Queues** | ❌ Non natif | ✅ Oui | -| **Delayed Messages** | ❌ Non natif | ✅ Oui | -| **Dead Letter Queue** | ✅ Configurable | ✅ Intégré | -| **Exactly-once** | ✅ Avec idempotency | ❌ At-least-once | -| **Use Case** | Events entre services | Jobs internes async | - -## Task Queue Stack - -| Composant | Outil | Rôle | -|-----------|-------|------| -| **Task Framework** | Dramatiq (Python) / Asynq (Go) | Task definition, execution | -| **Broker** | Valkey (Redis-compatible) | Message storage, routing | -| **Result Backend** | Valkey | Task results, status | -| **Scheduler** | APScheduler / Dramatiq-crontab | Periodic tasks | -| **Monitoring** | Dramatiq Dashboard / Prometheus | Task metrics | - -## Task Processing Flow +# Data Isolation Principle ``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ TASK PROCESSING FLOW │ -├─────────────────────────────────────────────────────────────────────────────┤ -│ │ -│ Producer Broker Workers │ -│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ -│ │ svc-* │──── enqueue ──►│ Valkey │◄── poll ─────│ Worker │ │ -│ │ API │ │ │ │ Pods │ │ -│ └─────────┘ │ Queues: │ └────┬────┘ │ -│ │ │ • high │ │ │ -│ │ Response │ • default│ │ execute │ -│ │ immédiate │ • low │ ▼ │ -│ ▼ │ • dlq │ ┌─────────┐ │ -│ Client └─────────┘ │ Task │ │ -│ (n'attend pas) │ Handler │ │ -│ └─────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────────┘ +┌──────────────────────────────────────────────────────────────────────────┐ +│ DATA ISOLATION MODEL │ +│ │ +│ ┌─── Kiven SaaS ───────────────────────────────────────────────────┐ │ +│ │ │ │ +│ │ Product DB (Aiven PG) Kafka (Aiven) Valkey (Aiven) │ │ +│ │ ├─ organizations ├─ agent events ├─ sessions │ │ +│ │ ├─ clusters (metadata) ├─ audit trail ├─ rate limits │ │ +│ │ ├─ audit_log ├─ alerts ├─ cache │ │ +│ │ └─ billing └─ billing usage └─ agent state │ │ +│ │ │ │ +│ │ CONTAINS: Metadata, config, status, metrics aggregates │ │ +│ │ NEVER CONTAINS: Customer's actual database rows │ │ +│ └───────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─── Customer A's AWS ──────┐ ┌─── Customer B's AWS ──────┐ │ +│ │ │ │ │ │ +│ │ CNPG PostgreSQL │ │ CNPG PostgreSQL │ │ +│ │ ├─ Their app data │ │ ├─ Their app data │ │ +│ │ └─ Their users │ │ └─ Their users │ │ +│ │ │ │ │ │ +│ │ S3: Their backups │ │ S3: Their backups │ │ +│ │ EBS: Their volumes │ │ EBS: Their volumes │ │ +│ │ │ │ │ │ +│ │ KIVEN NEVER READS THIS │ │ KIVEN NEVER READS THIS │ │ +│ └────────────────────────────┘ └────────────────────────────┘ │ +│ │ +└──────────────────────────────────────────────────────────────────────────┘ ``` -## Queue Definitions - -| Queue | Priority | Workers | Use Cases | -|-------|----------|---------|-----------| -| **critical** | P0 | 5 | Transaction rollbacks, fraud alerts | -| **high** | P1 | 10 | Email confirmations, balance updates | -| **default** | P2 | 20 | Notifications, analytics events | -| **low** | P3 | 5 | Reports, cleanup, batch exports | -| **scheduled** | N/A | 3 | Cron-like scheduled tasks | -| **dead-letter** | N/A | 1 | Failed tasks investigation | - -## Retry Strategy - -| Retry Policy | Configuration | Use Case | -|--------------|---------------|----------| -| **Exponential Backoff** | base=1s, max=1h, multiplier=2 | API calls, external services | -| **Fixed Interval** | interval=30s, max_retries=5 | Database operations | -| **No Retry** | max_retries=0 | Idempotent operations | - -## Dead Letter Queue (DLQ) Handling - -| Étape | Action | -|-------|--------| -| 1 | Task fails après max retries | -| 2 | Task moved to DLQ avec metadata (reason, stack trace, attempts) | -| 3 | Alert Slack (P3) | -| 4 | On-call investigate | -| 5 | Options: Fix → Replay, Manual resolution, Archive | - -## Scheduled Jobs (CronJobs) - -| Job | Schedule | Service | Description | -|-----|----------|---------|-------------| -| **balance-reconciliation** | `0 2 * * *` | svc-wallet | Daily balance verification | -| **expired-giftcards** | `0 0 * * *` | svc-giftcard | Mark expired cards | -| **analytics-rollup** | `0 */6 * * *` | svc-analytics | 6-hourly aggregation | -| **log-cleanup** | `0 3 * * 0` | platform | Weekly log rotation | -| **backup-verification** | `0 4 * * *` | platform | Daily backup integrity check | -| **compliance-report** | `0 6 1 * *` | platform | Monthly compliance export | - -## Task Queue Monitoring - -| Metric | Seuil alerte | Action | -|--------|--------------|--------| -| **Queue Depth** | > 1000 tasks | Scale workers | -| **Processing Time P95** | > 30s | Optimize task, check resources | -| **Failure Rate** | > 5% | Investigate DLQ, check dependencies | -| **DLQ Size** | > 10 tasks | Immediate investigation | -| **Worker Availability** | < 50% | Check pod health, scale up | +This isolation is **fundamental to Kiven's value proposition**: the customer's data never leaves their infrastructure. Kiven only manages the infrastructure and configuration around it. --- -*Document maintenu par : Platform Team + Backend Team* -*Dernière mise à jour : Janvier 2026* +*Maintained by: Platform Team + Backend Team* +*Last updated: February 2026* diff --git a/development/LOCAL-DEV-GUIDE.md b/development/LOCAL-DEV-GUIDE.md new file mode 100644 index 0000000..cf027a1 --- /dev/null +++ b/development/LOCAL-DEV-GUIDE.md @@ -0,0 +1,576 @@ +# Kiven — Development Strategy & Local Dev Guide + +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# Development Environments + +## Overview + +``` +Level 1: LOCAL ($0/mo) Level 2: SANDBOX (~$400/mo) Level 3: STAGING/PROD +───────────────────── ────────────────────────── ────────────────────── +kiven-dev repo AWS Account: kiven-sandbox AWS Accounts: + Docker Compose (shared) ├── EKS "kiven-dev" kiven-staging + kind cluster (shared) │ └── Kiven services kiven-prod + Tilt (orchestrates all) └── EKS "test-client" + └── Agent + CNPG Real Aiven (PG, Kafka) +Each svc-* repo: Real customers + Go code + Dockerfile Full AWS integration: + task init (mise + tools) node groups, EBS, S3, IAM + Own CI (reusable GH wf) + ArgoCD deployment +``` + +--- + +# Architecture: Polyrepo + Dev Orchestrator + +## Why Polyrepo + +- Each service has its **own repo**, its **own CI** (reusable GitHub workflows), its **own release cycle** +- Shared infrastructure (Docker Compose, kind, Tilt) lives in **one `kiven-dev` repo** +- Shared Go code lives in **`kiven-go-sdk`** (imported as a Go module) +- No port conflicts, no duplicate infra + +## Repo Layout + +``` +kivenio/ ← GitHub Organization +│ +├── kiven-dev/ ← DEV ORCHESTRATOR (this section) +│ ├── docker-compose.yml ← ONE PostgreSQL, Redpanda, Valkey, MinIO +│ ├── kind/ +│ │ ├── cluster.yaml ← ONE kind cluster +│ │ └── cnpg-test-cluster.yaml ← Test PG cluster (simulates customer DB) +│ ├── Tiltfile ← Orchestrates ALL services for local dev +│ ├── init-db.sql ← Product DB schema + seed data +│ ├── .mise.toml ← Shared tool versions (Go, Node, kubectl, helm...) +│ └── Taskfile.yml ← task dev, task infra:up, task kind:create +│ +├── kiven-go-sdk/ ← SHARED GO CODE (imported as module) +│ ├── provider/ ← Provider interface (provider.go, registry.go) +│ ├── grpcapi/ ← gRPC types (generated from proto) +│ ├── models/ ← Shared domain models +│ └── go.mod ← github.com/kivenio/kiven-go-sdk +│ +├── contracts-proto/ ← PROTOBUF DEFINITIONS +│ ├── agent/v1/agent.proto ← Agent ↔ SaaS protocol +│ ├── api/v1/services.proto ← REST API types +│ └── buf.yaml +│ +├── svc-api/ ← SERVICE REPO (one of many) +│ ├── cmd/main.go +│ ├── internal/ ← Service-specific logic +│ ├── Dockerfile +│ ├── .mise.toml ← Tool versions for THIS service +│ ├── Taskfile.yml ← task init, task run, task test, task build +│ ├── .github/workflows/ci.yml ← Uses reusable workflow +│ └── go.mod ← imports github.com/kivenio/kiven-go-sdk +│ +├── svc-provisioner/ ← Same structure as svc-api +├── svc-agent-relay/ ← Same structure +├── svc-clusters/ ← Same structure +├── svc-backups/ ← Same structure +├── svc-monitoring/ ← Same structure +├── svc-users/ ← Same structure +├── kiven-agent/ ← Same structure (deployed in customer K8s) +├── provider-cnpg/ ← Same structure (Go library) +├── dashboard/ ← Next.js frontend +│ ├── src/ +│ ├── .mise.toml +│ ├── Taskfile.yml +│ └── package.json +│ +└── platform-github-management/ ← Repo management + reusable workflows + └── .github/workflows/ + └── reusable-go-ci.yml ← Reusable CI for all Go services +``` + +## How It Fits Together + +``` +┌─── kiven-dev (Dev Orchestrator) ──────────────────────────────────┐ +│ │ +│ Docker Compose (shared infra) kind cluster (shared K8s) │ +│ ┌────────┐ ┌────────┐ ┌────────────────────────────┐ │ +│ │Postgres│ │Redpanda│ │ CNPG Operator │ │ +│ │ :5432 │ │ :19092 │ │ kiven-agent (from repo) │ │ +│ ├────────┤ ├────────┤ │ test-pg cluster │ │ +│ │ Valkey │ │ MinIO │ └────────────────────────────┘ │ +│ │ :6379 │ │ :9000 │ │ +│ └────────┘ └────────┘ │ +│ │ +│ Tilt (watches all service repos, builds & runs them) │ +│ ┌────────┐ ┌──────────────┐ ┌───────────────┐ ┌──────────┐ │ +│ │svc-api │ │svc-agent-relay│ │svc-provisioner│ │svc-clusters│ │ +│ │ :8080 │ │ :9090 gRPC │ │ :8082 │ │ :8083 │ │ +│ └────────┘ └──────────────┘ └───────────────┘ └──────────┘ │ +│ │ +│ Dashboard (from dashboard/ repo) │ +│ ┌──────────┐ │ +│ │ Next.js │ │ +│ │ :3000 │ │ +│ └──────────┘ │ +└────────────────────────────────────────────────────────────────────┘ +``` + +--- + +# Level 1: Local Development + +## Prerequisites (via mise) + +Each repo has a `.mise.toml`. The `kiven-dev` repo has the global one: + +```toml +# kiven-dev/.mise.toml +[tools] +go = "1.23" +node = "22" +kubectl = "latest" +helm = "latest" +kind = "latest" +tilt = "latest" +buf = "latest" +task = "latest" +golangci-lint = "latest" +``` + +```bash +# Install mise (one time) +curl https://mise.run | sh + +# Install all tools (in kiven-dev/) +mise install +``` + +## Quick Start + +```bash +# 1. Clone the dev orchestrator +git clone git@github.com:kivenio/kiven-dev.git +cd kiven-dev + +# 2. Install tools via mise +mise install + +# 3. Clone the service repos you need (siblings of kiven-dev) +task repos:clone # Clones all service repos next to kiven-dev/ + +# 4. Start shared infrastructure +task infra:up # Docker Compose: PostgreSQL, Redpanda, Valkey, MinIO + +# 5. Create kind cluster + CNPG +task kind:create # kind cluster + CNPG operator + namespaces + +# 6. Deploy test PostgreSQL cluster +task cnpg:deploy # Creates a CNPG PostgreSQL cluster inside kind + # This simulates a customer's database. + # The agent watches this cluster and reports to svc-agent-relay. + +# 7. Start all services with Tilt +tilt up # Watches all repos, builds, runs, shows logs + # Open http://localhost:10350 for Tilt dashboard + +# OR start individual services manually: +task svc:api # Runs svc-api from ../svc-api/ +task svc:relay # Runs svc-agent-relay from ../svc-agent-relay/ +task agent # Runs kiven-agent from ../kiven-agent/ +task frontend # Runs dashboard from ../dashboard/ +``` + +## What Is the Test CNPG Cluster? + +When you run `task cnpg:deploy`, Kiven-dev creates a **real PostgreSQL cluster** inside kind using the CloudNativePG operator. This is what happens: + +``` +1. CNPG Operator (already installed in kind) receives a Cluster YAML +2. Operator creates a PostgreSQL pod (test-pg-1) in namespace kiven-databases +3. PostgreSQL starts with: + - Database: "app" + - User: "app_user" (password in K8s Secret) + - pg_stat_statements enabled + - Logs slow queries > 200ms + +This cluster simulates a REAL CUSTOMER DATABASE. +The Kiven agent watches it, collects metrics, and reports to svc-agent-relay. +When you test provisioning in the dashboard, THIS is the cluster you see. +``` + +You can connect to it directly: +```bash +# Port-forward to the test PostgreSQL +kubectl port-forward -n kiven-databases svc/test-pg-rw 15432:5432 + +# Connect with psql +psql postgresql://app_user:@localhost:15432/app + +# Get the password +kubectl get secret -n kiven-databases test-pg-app -o jsonpath='{.data.password}' | base64 -d +``` + +## What You Can Test Locally + +| Feature | Works Locally? | How | +|---------|---------------|-----| +| Agent ↔ CNPG | Yes | Agent watches test-pg in kind | +| Agent ↔ svc-agent-relay | Yes | gRPC on localhost:9090 | +| CNPG cluster provisioning | Yes | Agent applies YAML to kind | +| Backup to S3 | Yes | MinIO at localhost:9000 (S3-compatible) | +| PG metrics collection | Yes | Agent reads pg_stat_* from test-pg | +| User/database management | Yes | Agent runs SQL on test-pg | +| PgBouncer pooling | Yes | CNPG Pooler CRD on kind | +| Dashboard ↔ API | Yes | Next.js :3000 → svc-api :8080 | +| YAML editor (Advanced Mode) | Yes | svc-yamleditor generates YAML | +| DBA intelligence | Yes | svc-monitoring analyzes PG stats | +| Power off / Power on | Partial | Delete/recreate CNPG cluster. No node management. | +| **AWS node groups** | **No** | Needs real AWS (Level 2) | +| **AWS EBS/S3/IAM** | **No** | Needs real AWS or LocalStack | +| **Cross-account IAM** | **No** | Needs sandbox | +| **Multi-AZ** | **No** | kind is single-node | + +## Stopping + +```bash +tilt down # Stop all services +task infra:down # Stop Docker Compose +task kind:delete # Delete kind cluster +``` + +--- + +# Service Repo Structure + +Every Go service repo follows the same structure: + +``` +kivenio/svc-api/ ← Example service +├── cmd/ +│ └── main.go ← Entry point +├── internal/ +│ ├── handler/ ← HTTP/gRPC handlers +│ ├── service/ ← Business logic +│ └── repository/ ← Database access +├── migrations/ ← SQL migrations (if needed) +├── Dockerfile ← Multi-stage build +├── .mise.toml ← Tool versions for this service +├── Taskfile.yml ← Service-level tasks +├── go.mod ← imports github.com/kivenio/kiven-go-sdk +├── go.sum +├── .github/ +│ └── workflows/ +│ └── ci.yml ← Uses reusable workflow from platform-github-management +├── .golangci.yml ← Linter config +├── .gitignore +└── README.md +``` + +## Service Taskfile (per repo) + +Each service has its own `Taskfile.yml`: + +```yaml +# svc-api/Taskfile.yml +version: "3" + +tasks: + init: + desc: "Initialize dev environment (mise + tools + dependencies)" + cmds: + - mise install + - go mod download + - echo "✅ Ready! Run 'task run' to start." + + run: + desc: "Run the service locally" + env: + DATABASE_URL: "postgres://kiven:kiven-local-dev@localhost:5432/kiven?sslmode=disable" + KAFKA_BROKERS: "localhost:19092" + VALKEY_ADDR: "localhost:6379" + PORT: "8080" + cmds: + - go run ./cmd/ + + test: + desc: "Run unit tests" + cmds: + - go test ./... -v -count=1 -race + + test:coverage: + desc: "Run tests with coverage" + cmds: + - go test ./... -coverprofile=coverage.out -race + - go tool cover -html=coverage.out -o coverage.html + + build: + desc: "Build binary" + cmds: + - go build -o bin/svc-api ./cmd/ + + lint: + desc: "Run linter" + cmds: + - golangci-lint run ./... + + docker:build: + desc: "Build Docker image" + cmds: + - docker build -t kivenio/svc-api:dev . + + proto:generate: + desc: "Generate Go code from proto (if this service uses gRPC)" + cmds: + - buf generate +``` + +## Reusable GitHub Workflow + +All service repos use the same CI workflow: + +```yaml +# svc-api/.github/workflows/ci.yml +name: CI +on: + push: + branches: [main] + pull_request: + +jobs: + ci: + uses: kivenio/platform-github-management/.github/workflows/reusable-go-ci.yml@main + with: + go-version: "1.23" + secrets: inherit +``` + +The reusable workflow (in `platform-github-management`) handles: +- Go build, test, lint +- Security scan (trivy) +- Docker build + push (on main) +- Deploy to sandbox (on main, if configured) + +--- + +# Shared Go Code: kiven-go-sdk + +Shared code that multiple services import: + +``` +kivenio/kiven-go-sdk/ +├── provider/ +│ ├── provider.go ← Provider interface (30+ methods) +│ └── registry.go ← Provider registry +├── grpcapi/ +│ └── (generated from contracts-proto) +├── models/ +│ ├── service.go ← Service, Plan, Backup types +│ ├── cluster.go ← ClusterSpec, ClusterStatus +│ └── user.go ← DatabaseUser, UserSpec +├── config/ +│ └── config.go ← Shared config loading (env vars) +├── telemetry/ +│ └── otel.go ← OpenTelemetry setup +├── go.mod ← github.com/kivenio/kiven-go-sdk +└── go.sum +``` + +Each service imports it: +```go +// svc-api/go.mod +module github.com/kivenio/svc-api + +require github.com/kivenio/kiven-go-sdk v0.1.0 +``` + +--- + +# Tilt Configuration (kiven-dev) + +Tilt watches all service repos and orchestrates local development: + +```python +# kiven-dev/Tiltfile + +# --- Shared infrastructure (already running via Docker Compose) --- +# PostgreSQL :5432, Redpanda :19092, Valkey :6379, MinIO :9000 + +# --- Go services --- +local_resource('svc-api', + serve_cmd='cd ../svc-api && go run ./cmd/', + deps=['../svc-api/cmd/', '../svc-api/internal/'], + env={ + 'PORT': '8080', + 'DATABASE_URL': 'postgres://kiven:kiven-local-dev@localhost:5432/kiven?sslmode=disable', + 'KAFKA_BROKERS': 'localhost:19092', + 'VALKEY_ADDR': 'localhost:6379', + }, + labels=['backend'], +) + +local_resource('svc-agent-relay', + serve_cmd='cd ../svc-agent-relay && go run ./cmd/', + deps=['../svc-agent-relay/cmd/', '../svc-agent-relay/internal/'], + env={'GRPC_PORT': '9090'}, + labels=['backend'], +) + +local_resource('svc-provisioner', + serve_cmd='cd ../svc-provisioner && go run ./cmd/', + deps=['../svc-provisioner/cmd/', '../svc-provisioner/internal/'], + env={ + 'PORT': '8082', + 'DATABASE_URL': 'postgres://kiven:kiven-local-dev@localhost:5432/kiven?sslmode=disable', + }, + labels=['backend'], +) + +local_resource('svc-clusters', + serve_cmd='cd ../svc-clusters && go run ./cmd/', + deps=['../svc-clusters/cmd/', '../svc-clusters/internal/'], + env={ + 'PORT': '8083', + 'DATABASE_URL': 'postgres://kiven:kiven-local-dev@localhost:5432/kiven?sslmode=disable', + }, + labels=['backend'], +) + +# --- Agent (runs against kind cluster) --- +local_resource('kiven-agent', + serve_cmd='cd ../kiven-agent && go run ./cmd/', + deps=['../kiven-agent/cmd/', '../kiven-agent/internal/'], + env={ + 'KUBECONFIG': os.environ.get('KUBECONFIG', os.path.expanduser('~/.kube/config')), + 'RELAY_ENDPOINT': 'localhost:9090', + }, + labels=['agent'], +) + +# --- Frontend --- +local_resource('dashboard', + serve_cmd='cd ../dashboard && npm run dev', + deps=['../dashboard/src/'], + labels=['frontend'], +) +``` + +Tilt UI at `http://localhost:10350` shows all services, logs, status, restart buttons. + +--- + +# Level 2: Sandbox (AWS) + +## When to Use + +Move to Level 2 when: +- Agent and core services work locally +- You need to test svc-infra (real AWS APIs) +- You need to test full provisioning pipeline (node groups → CNPG) +- You're preparing for first customer demo + +## Architecture + +``` +AWS Account: kiven-sandbox (eu-west-1) +│ +├── EKS "kiven-dev" +│ ├── Kiven services (deployed via ArgoCD) +│ ├── Aiven VPC peering (product DB) +│ └── Platform stack (Prometheus, Loki, ArgoCD) +│ +├── EKS "test-client" +│ ├── Simulates a real customer cluster +│ ├── Kiven agent installed +│ ├── CNPG operator (installed by Kiven) +│ └── Full provisioning: +│ ├── Dedicated node group (created by svc-infra) +│ ├── CNPG cluster (created by agent) +│ ├── S3 backups (created by svc-infra) +│ └── Network policies, storage classes, IRSA +│ +├── S3: kiven-backups-test-client +├── IAM: KivenAccessRole (simulates customer role) +└── IAM: IRSA roles for CNPG +``` + +## Cost Optimization + +| Resource | Cost | Optimization | +|----------|------|-------------| +| EKS control plane x 2 | ~$146/mo | Can't avoid | +| EC2 nodes (kiven-dev, 2x t3.medium) | ~$60/mo | Power off nights/weekends | +| EC2 nodes (test-client, 2x t3.medium) | ~$60/mo | Power off when not testing | +| EBS volumes | ~$20/mo | Delete test data regularly | +| S3 | ~$5/mo | Lifecycle rules | +| **Total** | **~$300/mo** | **~$150/mo with power schedules** | + +--- + +# Level 3: Staging & Production + +Only needed when the product is ready for real customers. + +| Environment | Account | EKS | Aiven | Purpose | +|-------------|---------|-----|-------|---------| +| Staging | kiven-staging | eks-staging | Staging plan | Pre-production validation | +| Production | kiven-prod | eks-prod | Business plan | Live product | + +--- + +# Development Workflow + +## Daily Workflow + +```bash +cd kiven-dev + +# Morning: start everything +task infra:up # Docker Compose +task kind:create # kind + CNPG (idempotent, skips if exists) +tilt up # All services + agent + dashboard + +# Code in any svc-* repo → Tilt auto-reloads +# Dashboard at http://localhost:3000 +# Tilt UI at http://localhost:10350 + +# End of day +tilt down +task infra:down +``` + +## Adding a New Service + +```bash +# 1. Create repo from template +gh repo create kivenio/svc-my-service --template kivenio/platform-templates-service-go --private + +# 2. Clone next to kiven-dev +cd .. && git clone git@github.com:kivenio/svc-my-service.git + +# 3. Initialize +cd svc-my-service && task init + +# 4. Add to Tiltfile in kiven-dev +# (add local_resource block) + +# 5. Develop → test → PR → merge → CI runs automatically +``` + +## Adding a New Feature to Existing Service + +```bash +cd svc-api # Go to service repo +task init # Ensure tools are up to date (mise) +# ... code ... +task test # Run tests +task lint # Run linter +# Tilt auto-reloads if running +git add . && git commit && git push +# CI runs via reusable workflow +``` + +--- + +*Maintained by: Platform Team* +*Last updated: February 2026* diff --git a/development/TEMPLATE-USAGE-GUIDE.md b/development/TEMPLATE-USAGE-GUIDE.md new file mode 100644 index 0000000..47025a8 --- /dev/null +++ b/development/TEMPLATE-USAGE-GUIDE.md @@ -0,0 +1,488 @@ +# Template & Workflow Architecture + +## Overview + +Kiven uses a **three-layer developer platform** to ensure every repo starts with the right tooling, CI/CD, and conventions — without the developer thinking about any of it. + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Layer 1: COPIER TEMPLATES │ +│ platform-templates-service-go, platform-templates-sdk-go, ... │ +│ ─ Scaffold a new repo with all files, config, CI/CD │ +│ ─ copier copy gh:kivenio/platform-templates-service-go ./repo │ +│ ─ copier update (evolve existing repos when template changes) │ +└───────────────────────┬─────────────────────────────────────────┘ + │ references +┌───────────────────────▼─────────────────────────────────────────┐ +│ Layer 2: PLATFORM-GITHUB-MANAGEMENT │ +│ Declarative YAML → GitHub repos, settings, rulesets, labels │ +│ ─ repos/backend/core-services.yaml defines all svc-* │ +│ ─ template: service-go links to the Copier template │ +│ ─ config/enforced.yaml locks security settings │ +│ ─ sync-repos.py applies changes on PR merge │ +└───────────────────────┬─────────────────────────────────────────┘ + │ includes +┌───────────────────────▼─────────────────────────────────────────┐ +│ Layer 3: REUSABLE WORKFLOWS + COMPOSITE ACTIONS │ +│ Repo: reusable-workflows +│ reusable-workflows/.github/workflows/ci-go-reusable.yml │ +│ reusable-workflows/.github/actions/setup-go/action.yml │ +│ ─ Called by each repo's CI workflow (generated from template) │ +│ ─ One source of truth for all CI/CD logic │ +└─────────────────────────────────────────────────────────────────┘ +``` + +## How It All Connects + +### Creating a new repo + +``` +Developer platform-github-management Copier template + │ │ │ + │ 1. Add YAML entry │ │ + │ in repos/backend/*.yaml │ │ + │ with template: service-go │ │ + │─────────────────────────────►│ │ + │ │ │ + │ 2. PR merged → sync-repos │ │ + │ creates GitHub repo │ │ + │ with settings, labels, │ │ + │ rulesets, topics, teams │ │ + │ │ │ + │ 3. Developer clones repo │ │ + │ and runs copier copy │ │ + │◄─────────────────────────────│ │ + │ │ │ + │ 4. copier copy │ │ + │ gh:kivenio/platform- │ answers copier.yml questions │ + │ templates-service-go . │──────────────────────────────►│ + │ │ │ + │ 5. Template generates │ .editorconfig, .golangci.yml │ + │ ALL files: │ .mise.toml, Taskfile.yml │ + │ tooling, CI, Dockerfile, │ .pre-commit-config.yaml │ + │ vscode, go.mod, cmd/... │ .github/workflows/ci.yml │ + │◄─────────────────────────────│◄──────────────────────────────│ + │ │ │ + │ 6. task init → ready │ │ +``` + +### Updating existing repos when template evolves + +```bash +cd svc-api +copier update +# Copier shows diff, applies new changes, respects .copier-answers.yml +``` + +This is the **key advantage** of Copier over `sed`-based scaffolding: when you add a new +linter rule, a new shared GitHub Action, or update Go version across all services, you +update the template once and each repo pulls the update via `copier update`. =>>>> this should be done via github resuable action and the target repo (the repo created baed on the tempalte) will triggered. + +## Template Inventory + +| Template | Repo | For | Key Features | +|----------|------|-----|--------------| +| `service-go` | `platform-templates-service-go` | Go microservices (`svc-*`) | chi, gRPC, OTel, Dockerfile, air, Testcontainers | +| `sdk-go` | `platform-templates-sdk-go` | Go libraries (`kiven-go-sdk`, `provider-*`, `kiven-cli`) | No Dockerfile, no cmd/, library-focused | +| `infrastructure` | `platform-templates-infrastructure` | Terraform modules (`bootstrap`, `infra-customer-*`) | tflint, Checkov, terraform-docs | +| `platform-component` | `platform-templates-platform-component` | GitOps components (`platform-gitops`, `platform-security`) | Helm/Kustomize, ArgoCD integration | +| `documentation` | `platform-templates-documentation` | Doc sites (`docs`) | MkDocs Material, ADR template | + +## Template Structure (Copier) + +Each `platform-templates-*` repo follows the Copier convention: + +``` +platform-templates-service-go/ +├── copier.yml # Questions + config +├── .copier-answers.yml.jinja # Records answers in generated project +├── {{project_name}}/ # (not used — we generate at root) +│ +├── .editorconfig # Static — copied as-is +├── .gitignore # Static +├── .golangci.yml # Static +├── .pre-commit-config.yaml # Static +├── Dockerfile # Static +│ +├── .mise.toml.jinja # Templated — injects service name +├── Taskfile.yml.jinja # Templated — injects service name, port +├── go.mod.jinja # Templated — injects module path +├── README.md.jinja # Templated — injects name, description +│ +├── .vscode/ +│ ├── settings.json # Static +│ ├── extensions.json # Static +│ └── launch.json.jinja # Templated — injects port, env vars +│ +├── renovate.json # Static — Renovate dep updates (auto-merge, grouping) +├── .github/ +│ ├── CODEOWNERS.jinja # Templated — injects team +│ ├── pull_request_template.md # Static — shared PR checklist +│ ├── ISSUE_TEMPLATE/ +│ │ ├── bug_report.yml # Static +│ │ └── feature_request.yml # Static +│ └── workflows/ +│ ├── ci.yml.jinja # Templated — injects service name +│ ├── release.yml.jinja # Templated — injects service name +│ └── copier-update.yml # Static — weekly template sync check +│ +└── cmd/ + └── main.go.jinja # Templated — basic service scaffold +``` + +### `copier.yml` — The Questionnaire + +```yaml +# copier.yml — Kiven Go Service Template +_min_copier_version: "9.0.0" +_subdirectory: "" +_answers_file: .copier-answers.yml + +project_name: + type: str + help: "Service name (e.g., svc-api, svc-auth)" + validator: "{% if not project_name | regex_search('^[a-z][a-z0-9-]+$') %}Must be lowercase with hyphens{% endif %}" + +project_description: + type: str + help: "One-line description" + +owner_team: + type: str + default: "@kivenio/backend" + help: "GitHub team that owns this repo" + +port: + type: int + default: 8080 + help: "HTTP port" + +grpc_port: + type: int + default: 0 + help: "gRPC port (0 = no gRPC)" + +enable_kafka: + type: bool + default: false + help: "Does this service consume/produce Kafka events?" + +go_version: + type: str + default: "1.23" + help: "Go version" +``` + +### How variables connect to `platform-github-management` + +In `repos/backend/core-services.yaml`, each repo definition contains a `template_variables` +field with ALL the Copier variables needed for that repo: + +```yaml +- name: svc-api # → auto-mapped to project_name + description: "API Gateway" # → auto-mapped to project_description + type: service + template: service-go # → which Copier template to use + ruleset: strict + topics: [core, api, graphql] + template_variables: # → ALL Copier variables for this repo + port: 8080 + grpc_port: 0 + enable_kafka: false +``` + +**Variable resolution order:** + +1. **Auto-mapped** (always set, derived from repo definition fields): + - `name` → `project_name` + - `description` → `project_description` + - Owner team from YAML file path (`repos/backend/` → `@kivenio/backend`) → `owner_team` + +2. **Explicit** (`template_variables` dict — overrides auto-mapped if same key): + - `port`, `grpc_port`, `enable_kafka`, or any Copier variable from `copier.yml` + +3. **Template defaults** (from `copier.yml` — used for any variable not specified above): + - `go_version: "1.23"`, etc. + +When `sync-repos.py` creates a repo, it passes all variables to `copier copy` via `-d key=value`. +Copier generates `.copier-answers.yml` in the new repo, recording every variable used. +This file is essential for future `copier update` runs — it tells Copier which template +was used and with which answers. + +```yaml +# .copier-answers.yml (auto-generated by Copier in the new repo) +_src_path: gh:kivenio/platform-templates-service-go +_commit: abc1234 +project_name: svc-api +project_description: "API Gateway — REST + GraphQL, request routing" +owner_team: "@kivenio/backend" +port: 8080 +grpc_port: 0 +enable_kafka: false +go_version: "1.23" +``` + +## Reusable Workflows vs Composite Actions + +### Reusable Workflows (job-level reuse) + +A reusable workflow replaces an **entire job** (or set of jobs). Each service repo calls them +in its `.github/workflows/ci.yml`: + +```yaml +# In svc-api/.github/workflows/ci.yml +jobs: + ci: + uses: kivenio/reusable-workflows/.github/workflows/ci-go-reusable.yml@main + with: + service-name: "svc-api" + go-version: "1.23" + secrets: inherit +``` + +| Workflow | File | Purpose | +|----------|------|---------| +| `ci-go-reusable.yml` | `reusable-workflows/.github/workflows/` | Lint + Test + Build + Security + Docker | +| `ci-frontend-reusable.yml` | `reusable-workflows/.github/workflows/` | Lint + TypeCheck + Test + Build + Audit | +| `ci-terraform-reusable.yml` | `reusable-workflows/.github/workflows/` | Format + Validate + tflint + Trivy + Checkov | +| `docker-build-reusable.yml` | `reusable-workflows/.github/workflows/` | Buildx + GHCR push + semver tags | +| `release-reusable.yml` | `reusable-workflows/.github/workflows/` | Conventional commits → semver → changelog → GitHub Release | +| `copier-update-reusable.yml` | `reusable-workflows/.github/workflows/` | Detect template drift → auto-PR with updates | +| `ci-copier-template-reusable.yml` | `reusable-workflows/.github/workflows/` | Validate Copier templates (syntax, dry-run, build test) | + +**When to use:** When you want to share a complete CI/CD pipeline across repos. + +### Composite Actions (step-level reuse) + +A composite action replaces **individual steps** within a job. Use them when multiple +workflows share the same setup/teardown logic but differ in the middle. + +```yaml +# In any workflow +steps: + - uses: kivenio/reusable-workflows/.github/actions/setup-go@main + with: + go-version: "1.23" + # setup-go handles: checkout + install Go + cache + download deps +``` + +| Action | File | Purpose | +|--------|------|---------| +| `setup-go` | `reusable-workflows/.github/actions/setup-go/action.yml` | Checkout + Go install + cache + `go mod download` | +| `setup-node` | `reusable-workflows/.github/actions/setup-node/action.yml` | Checkout + Node install + cache + `npm ci` | +| `security-scan` | `reusable-workflows/.github/actions/security-scan/action.yml` | Trivy fs + Gitleaks in one step | +| `docker-metadata` | `reusable-workflows/.github/actions/docker-metadata/action.yml` | Generate tags (sha, branch, semver) | +| `copier-update` | `reusable-workflows/.github/actions/copier-update/action.yml` | Run `copier update`, detect drift, create PR | + +**When to use:** When reusable workflows are too rigid. For example, a workflow that needs +custom steps between setup and test can use composite actions for the common parts. + +### How They Compose + +``` +Service repo CI workflow +│ +├── uses: kivenio/reusable-workflows/.github/workflows/ci-go-reusable.yml +│ │ +│ ├── Job: lint +│ │ └── uses: kivenio/reusable-workflows/.github/actions/setup-go +│ │ └── golangci-lint +│ │ +│ ├── Job: test +│ │ └── uses: kivenio/reusable-workflows/.github/actions/setup-go +│ │ └── go test +│ │ +│ ├── Job: security +│ │ └── uses: kivenio/reusable-workflows/.github/actions/security-scan +│ │ +│ └── Job: docker +│ └── uses: kivenio/reusable-workflows/.github/actions/docker-metadata +│ └── docker build + push +``` + +## What's Mutualized in Templates + +Every repo created from a Copier template automatically gets: + +### Tooling (developer experience) +- `.editorconfig` — Consistent formatting (tabs for Go, spaces for YAML) +- `.vscode/settings.json` — Formatter, linter, language settings +- `.vscode/extensions.json` — Recommended VS Code extensions +- `.vscode/launch.json` — Debug configurations +- `.mise.toml` — Tool versions (Go, Node, Terraform, etc.) +- `.pre-commit-config.yaml` — Git hooks (format, lint, secrets detection) + +### Quality (code standards) +- `.golangci.yml` — 18 linters with opinionated config (Go repos) +- `.prettierrc` — Code formatting (frontend repos) +- `Taskfile.yml` — Standard tasks (init, test, lint, build, clean) + +### CI/CD (automation) +- `.github/workflows/ci.yml` — Calls reusable workflow from kiven-dev +- `.github/workflows/release.yml` — Release automation +- `renovate.json` — Automated dependency updates (Renovate — grouping, auto-merge patches/minors) +- `Dockerfile` — Multi-stage, non-root, healthcheck (service repos) + +### Collaboration (team workflow) +- `.github/CODEOWNERS` — Review routing (auto-assigned from `owner_team`) +- `.github/pull_request_template.md` — PR checklist (tests, docs, breaking changes) +- `.github/ISSUE_TEMPLATE/bug_report.yml` — Structured bug reports +- `.github/ISSUE_TEMPLATE/feature_request.yml` — Feature request form + +### Project (bootstrapping) +- `go.mod` — Module initialized with `github.com/kivenio/` +- `cmd/main.go` — Minimal service entrypoint +- `README.md` — Generated with name, description, badges +- `.copier-answers.yml` — Records template answers for future updates + +## Applying Templates to Existing Repos + +For repos that were created before Copier was set up: + +```bash +cd ../svc-api +copier copy gh:kivenio/platform-templates-service-go . --overwrite +# Copier asks questions, generates files, respects existing code +``` + +For selective file application (tooling only, no code scaffolding), use +`copier copy --exclude 'cmd/**'` to skip code directories. + +## Template Repo CI + +Every `platform-templates-*` repo has its own CI (NOT a `.jinja` file — it runs on the template +repo itself) that validates the Copier template works correctly: + +```yaml +# platform-templates-service-go/.github/workflows/ci.yml +jobs: + ci: + uses: kivenio/reusable-workflows/.github/workflows/ci-copier-template-reusable.yml@main + with: + template-type: "go" +``` + +**What it validates:** +1. `copier.yml` exists and is valid YAML +2. Jinja2 template files (`.jinja`) exist +3. Dry-run `copier copy --defaults` generates all critical files (editorconfig, gitignore, CI, CODEOWNERS, etc.) +4. No unresolved Jinja2 variables in generated output +5. Generated Go project compiles (`go build`, `go vet`) +6. Security scan (Trivy + Gitleaks) on the template itself + +This means every PR to a template repo is validated end-to-end before merge. + +## Automatic Scaffolding from Templates + +When a new repo is defined in `platform-github-management` with a `template` field: + +```yaml +- name: svc-foo + description: "New service" + template: service-go # ← this triggers Copier scaffolding +``` + +The `sync-repos.py` script automatically: +1. Creates the GitHub repo (settings, labels, rulesets, teams) +2. Resolves the template source from `config/templates.yaml` (`service-go` → `gh:kivenio/platform-templates-service-go`) +3. Clones the new repo +4. Runs `copier copy --trust --defaults` with variables extracted from the YAML: + - `name` → `project_name` + - `description` → `project_description` + - Owner team from the YAML file path (`repos/backend/` → `@kivenio/backend`) +5. Commits and pushes the scaffolded files + +The developer gets a fully configured, ready-to-code repo without touching `copier` manually. + +## Adding a New Template + +1. Create the repo in `platform-github-management/repos/platform/templates.yaml` +2. Create the Copier template repo with `copier.yml` + files +3. Add a `ci.yml` that calls `ci-copier-template-reusable.yml` (validates the template) +4. Register it in `platform-github-management/config/templates.yaml` +5. Document it in this guide + +## Evolving Templates + +When you change a template (e.g., update Go version, add a linter): + +1. Update the `platform-templates-*` repo +2. Tag a new version (e.g., `v1.2.0`) +3. Each service repo pulls the update **automatically via CI**: + +### Automated Template Sync (CI) + +Every repo created from a Copier template includes a `copier-update.yml` workflow: + +```yaml +# .github/workflows/copier-update.yml (auto-generated from template) +on: + schedule: + - cron: "0 7 * * 1" # Every Monday 7am + workflow_dispatch: # Can also trigger manually + +jobs: + copier-update: + uses: kivenio/reusable-workflows/.github/workflows/copier-update-reusable.yml@main + secrets: inherit +``` + +**How it works:** + +``` +Template repo changes Downstream repo (svc-api) + │ │ + │ 1. Developer updates │ + │ platform-templates- │ + │ service-go (new linter, │ + │ Go version bump, etc.) │ + │ │ + │ │ 2. Monday 7am: copier-update + │ │ workflow runs automatically + │ │ + │ │ 3. Composite action: + │ │ - Installs copier + │ │ - Reads .copier-answers.yml + │ │ - Runs copier update --trust + │ │ - Detects git diff + │ │ + │ │ 4. If drift detected: + │ │ - Creates branch chore/copier-update + │ │ - Opens PR with changes + │ │ - Labels: dependencies + │ │ + │ │ 5. Developer reviews PR + │ │ - Resolves conflicts if any + │ │ - Merges when ready +``` + +### Manual Update + +You can also update manually at any time: + +```bash +cd svc-api +copier update --trust +# Shows diff of changes, applies non-conflicting updates +# Conflicts are shown for manual resolution +``` + +### Triggering Update Across All Repos + +After a significant template change, trigger all downstream repos at once via +`workflow_dispatch` on each repo's `copier-update.yml`. This can be scripted: + +```bash +repos=(svc-api svc-auth svc-provisioner svc-clusters svc-backups) +for repo in "${repos[@]}"; do + gh workflow run copier-update.yml --repo kivenio/$repo +done +``` + +This is how the entire fleet stays consistent without manual copy-paste. + +## YAML Config Validator (YCC) + +For detailed documentation on the planned YAML validation tool that validates +`platform-github-management` structures and Copier template variables, see +[YAML-CONFIG-VALIDATOR.md](../platform/YAML-CONFIG-VALIDATOR.md). diff --git a/infra/CUSTOMER-INFRA-MANAGEMENT.md b/infra/CUSTOMER-INFRA-MANAGEMENT.md new file mode 100644 index 0000000..7b6b930 --- /dev/null +++ b/infra/CUSTOMER-INFRA-MANAGEMENT.md @@ -0,0 +1,270 @@ +# Customer Infrastructure Management +## *How Kiven Manages AWS Resources in Customer Accounts* + +--- + +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# Overview + +Kiven manages **four types of AWS resources** in the customer's account: + +1. **EKS Node Groups** — Dedicated compute for databases +2. **EBS Volumes** — Persistent storage for PostgreSQL data +3. **S3 Buckets** — Backup storage (Barman + WAL archiving) +4. **IAM Roles** — IRSA for CNPG to access S3 + +All managed via `svc-infra`, which assumes the customer's `KivenAccessRole` IAM role. + +--- + +# Access Model + +## Cross-Account IAM + +``` +┌─── Kiven AWS Account ──────────┐ ┌─── Customer AWS Account ────────────┐ +│ │ │ │ +│ svc-infra │ │ IAM Role: KivenAccessRole │ +│ ├── IRSA: svc-infra-role │────▶│ ├── Trust: Kiven account ID │ +│ └── AssumeRole call │ │ ├── Policy: KivenAccessPolicy │ +│ │ │ └── ExternalId: unique per customer │ +│ │ │ │ +└─────────────────────────────────┘ └──────────────────────────────────────┘ +``` + +## IAM Policy (KivenAccessPolicy) + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "EKSAccess", + "Effect": "Allow", + "Action": [ + "eks:DescribeCluster", + "eks:ListNodegroups", + "eks:DescribeNodegroup", + "eks:CreateNodegroup", + "eks:UpdateNodegroupConfig", + "eks:DeleteNodegroup" + ], + "Resource": "arn:aws:eks:*:*:cluster/*" + }, + { + "Sid": "EC2ForNodeGroups", + "Effect": "Allow", + "Action": [ + "ec2:DescribeInstances", + "ec2:DescribeVolumes", + "ec2:DescribeSubnets", + "ec2:DescribeSecurityGroups", + "ec2:CreateLaunchTemplate", + "ec2:DeleteLaunchTemplate", + "ec2:RunInstances", + "ec2:TerminateInstances" + ], + "Resource": "*", + "Condition": { + "StringEquals": { + "aws:RequestTag/managed-by": "kiven" + } + } + }, + { + "Sid": "S3BackupBucket", + "Effect": "Allow", + "Action": [ + "s3:CreateBucket", + "s3:PutBucketEncryption", + "s3:PutBucketLifecycleConfiguration", + "s3:PutBucketVersioning", + "s3:PutBucketPolicy", + "s3:GetBucketLocation", + "s3:ListBucket" + ], + "Resource": "arn:aws:s3:::kiven-backups-*" + }, + { + "Sid": "IAMForIRSA", + "Effect": "Allow", + "Action": [ + "iam:CreateRole", + "iam:DeleteRole", + "iam:AttachRolePolicy", + "iam:DetachRolePolicy", + "iam:PutRolePolicy", + "iam:DeleteRolePolicy", + "iam:GetRole", + "iam:TagRole" + ], + "Resource": "arn:aws:iam::*:role/kiven-*" + }, + { + "Sid": "KMSForEncryption", + "Effect": "Allow", + "Action": [ + "kms:DescribeKey", + "kms:CreateGrant", + "kms:Encrypt", + "kms:Decrypt", + "kms:GenerateDataKey" + ], + "Resource": "*" + } + ] +} +``` + +**Key security constraints:** +- EC2 actions limited to resources tagged `managed-by: kiven` +- S3 actions limited to `kiven-backups-*` bucket prefix +- IAM actions limited to `kiven-*` role prefix +- ExternalId required to prevent confused deputy attacks + +--- + +# Resource Management + +## 1. Node Groups + +### Create (on provisioning) + +| Parameter | Value | Rationale | +|-----------|-------|-----------| +| Name | `kiven-db-{cluster-id}` | Unique per database cluster | +| Instance type | Per service plan (t3.small → r6g.xlarge) | Memory-optimized for PG | +| Desired/min/max | Per plan (1-3 nodes) | HA: primary + replicas | +| Subnets | Multi-AZ (customer's private subnets) | Spread across AZs | +| Taints | `kiven.io/role=database:NoSchedule` | Only DB pods run here | +| Labels | `kiven.io/managed=true`, `kiven.io/cluster-id={id}` | Identification | +| Tags | `managed-by: kiven`, `kiven-cluster-id: {id}` | AWS-level tracking | +| AMI | EKS-optimized AL2023 | Standard, secure | + +### Scale to Zero (Power Off) + +``` +svc-infra → UpdateNodegroupConfig: + scalingConfig: + minSize: 0 + desiredSize: 0 + maxSize: 0 (or original max) +``` + +Nodes are terminated. EBS volumes detach but are RETAINED (PVC reclaim policy = Retain). + +### Scale Up (Power On) + +``` +svc-infra → UpdateNodegroupConfig: + scalingConfig: + minSize: {plan.instances} + desiredSize: {plan.instances} + maxSize: {plan.instances * 2} +``` + +New nodes join. CNPG pods re-created, PVCs reattached to new nodes. + +### Delete (on cluster deletion) + +Node group fully deleted only when customer deletes the database **and** confirms data destruction. + +## 2. EBS Volumes + +Managed indirectly via Kubernetes StorageClass + PVCs. Kiven creates the StorageClass: + +```yaml +apiVersion: storage.k8s.io/v1 +kind: StorageClass +metadata: + name: kiven-db-gp3 +parameters: + type: gp3 + iops: "3000" # adjusted per plan + throughput: "125" # adjusted per plan + encrypted: "true" + kmsKeyId: +provisioner: ebs.csi.aws.com +reclaimPolicy: Retain # CRITICAL: keep volumes on PVC delete +volumeBindingMode: WaitForFirstConsumer +allowVolumeExpansion: true +``` + +**Monitoring:** +- Disk usage alerts at 70%, 80%, 90% +- IOPS utilization alerts +- Auto-resize recommendation via DBA intelligence + +## 3. S3 Buckets + +One bucket per customer (shared across their database clusters): + +| Config | Value | Rationale | +|--------|-------|-----------| +| Name | `kiven-backups-{customer-id}` | Unique per customer | +| Region | Same as customer's EKS | Data locality | +| Encryption | SSE-KMS (customer's key) | Customer controls encryption | +| Versioning | Enabled | Protect against accidental delete | +| Lifecycle | Transition to IA after 30d, Glacier after 90d, delete after 365d | Cost optimization | +| Bucket policy | Only IRSA role can access | Least privilege | + +## 4. IAM Roles (IRSA) + +CNPG needs S3 access for backups. Kiven creates an IRSA role: + +| Parameter | Value | +|-----------|-------| +| Role name | `kiven-cnpg-backup-{cluster-id}` | +| Trust | EKS OIDC provider + kiven-databases ServiceAccount | +| Policy | s3:PutObject, s3:GetObject, s3:DeleteObject on backup bucket | + +--- + +# Tagging Strategy + +All Kiven-managed resources are tagged: + +| Tag Key | Value | Purpose | +|---------|-------|---------| +| `managed-by` | `kiven` | Identify Kiven resources | +| `kiven-cluster-id` | `{cluster-id}` | Link to specific database | +| `kiven-customer-id` | `{customer-id}` | Link to customer org | +| `kiven-plan` | `hobbyist/startup/business/premium/custom` | Service plan | +| `kiven-environment` | `production/staging/development` | Environment label | + +These tags enable: +- Cost tracking per database cluster +- Resource cleanup on cluster deletion +- IAM policy conditions (only manage tagged resources) + +--- + +# Audit & Compliance + +Every AWS API call made by `svc-infra` is: +1. **Logged in CloudTrail** (customer's account) — they can see exactly what Kiven does +2. **Logged in Kiven audit** (`svc-audit`) — immutable record on our side +3. **Attributed** — which Kiven user/service triggered the action + +--- + +# Cost Tracking + +Kiven tracks estimated costs per cluster by monitoring: + +| Resource | Cost Calculation | +|----------|-----------------| +| **EC2 nodes** | Instance type × hours running (from power on/off events) | +| **EBS storage** | Volume size × hours provisioned + IOPS cost | +| **S3 storage** | Bucket size × storage class pricing | +| **Data transfer** | Estimated from backup size + WAL volume | + +Displayed in customer dashboard: "Estimated AWS cost: $X/month for this cluster" + +--- + +*Maintained by: Platform Team* +*Last updated: February 2026* diff --git a/observability/OBSERVABILITY-GUIDE.md b/observability/OBSERVABILITY-GUIDE.md index fdea3dc..511ef4f 100644 --- a/observability/OBSERVABILITY-GUIDE.md +++ b/observability/OBSERVABILITY-GUIDE.md @@ -1,13 +1,13 @@ -# 📊 **Observability Guide** -## *LOCAL-PLUS Monitoring, Logging, Tracing & APM* +# Observability Guide +## Kiven Platform -- Monitoring, Logging, Tracing & APM --- -> **Retour vers** : [Architecture Overview](../EntrepriseArchitecture.md) +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) | **See also**: [OTel Conventions](./OTEL-CONVENTIONS.md) --- -# 📋 **Table of Contents** +## Table of Contents 1. [Stack Overview](#stack-overview) 2. [Telemetry Pipeline](#telemetry-pipeline) @@ -20,11 +20,13 @@ 9. [Alerting Strategy](#alerting-strategy) 10. [Dashboards & Visualizations](#dashboards--visualizations) +> For OTel-specific conventions (span naming, attributes, Collector deployment, exporter helper, SDK usage), see [OTEL-CONVENTIONS.md](./OTEL-CONVENTIONS.md). + --- -# 🏗️ **Stack Overview** +## Stack Overview -## Self-Hosted Stack (Coût Minimal) +### Self-Hosted Stack | Composant | Outil | Coût | Retention | |-----------|-------|------|-----------| @@ -51,27 +53,34 @@ Pour conserver les métriques au-delà de 15 jours : --- -# 🔄 **Telemetry Pipeline** +## Telemetry Pipeline ``` -┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ -│ Applications │ │ OTel Collector │ │ Backends │ -│ │ │ │ │ │ -│ • SDK Python │────►│ • Receivers │────►│ • Prometheus │ -│ • Auto-instr │ │ • Processors │ │ • Loki │ -│ │ │ • Exporters │ │ • Tempo │ -└─────────────────┘ └─────────────────┘ └─────────────────┘ - │ - │ Scrubbing - ▼ - ┌─────────────────┐ - │ GDPR Compliant │ - │ • No user_id │ - │ • No PII │ - │ • No PAN │ - └─────────────────┘ +┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────┐ +│ Kiven Services │ │ OTel Agent │ │ OTel Gateway │ │ Backends │ +│ │ │ (DaemonSet) │ │ (Deployment) │ │ │ +│ • Go svc-* │────►│ • Receive OTLP │────►│ • Batch │────►│ Prometheus │ +│ • kiven-agent │ │ • Forward │ │ • Tail sample │ │ Loki │ +│ • dashboard │ │ • No processing │ │ • Scrub PII │ │ Tempo │ +│ │ │ │ │ • Persistent Q │ │ Grafana │ +│ Instrumented │ │ Lightweight │ │ • Export │ │ │ +│ via kiven-go- │ │ ~50MB per node │ │ • 2-3 replicas │ │ │ +│ sdk/telemetry │ │ │ │ │ │ │ +└──────────────────┘ └──────────────────┘ └──────────────────┘ └──────────────┘ + │ + │ GDPR Scrubbing + ▼ + ┌──────────────────┐ + │ Removed: │ + │ • user_id │ + │ • user.email │ + │ • client_ip │ + │ • SQL params │ + └──────────────────┘ ``` +> See [OTEL-CONVENTIONS.md](./OTEL-CONVENTIONS.md) for Collector config details, exporter helper pattern, and persistent queue setup. + ## OTel Collector — Rôle | Composant | Rôle | Exemples | @@ -131,8 +140,8 @@ Le **Prometheus Operator** utilise des **Custom Resources** pour configurer auto **Flux :** -1. Le développeur déploie son service avec un label (ex: `app: svc-ledger`) -2. Un ServiceMonitor sélectionne ce label +1. Developer deploys service with a label (e.g., `app: svc-api`) +2. A ServiceMonitor selects this label 3. Prometheus Operator configure automatiquement Prometheus 4. Prometheus scrape `/metrics` sur le port spécifié @@ -141,15 +150,17 @@ Le **Prometheus Operator** utilise des **Custom Resources** pour configurer auto - Séparation des concerns — monitoring découplé du déploiement - Flexibilité — intervalles, relabeling, TLS, authentification -## Endpoints typiques +### Endpoints | Service | Port | Path | Description | |---------|------|------|-------------| -| **FastAPI (Python)** | 8080 | `/metrics` | Via `prometheus-fastapi-instrumentator` | -| **Go gRPC** | 9090 | `/metrics` | Via `promhttp` handler | -| **Grafana** | 3000 | `/metrics` | Métriques internes | -| **ArgoCD** | 8083 | `/metrics` | Métriques application | -| **Node Exporter** | 9100 | `/metrics` | Métriques système (CPU, RAM, disk) | +| **svc-api** | 8080 | `/metrics` | Via `promhttp` handler | +| **svc-agent-relay** | 9090 | `/metrics` | gRPC service metrics | +| **svc-provisioner** | 8082 | `/metrics` | Provisioning pipeline metrics | +| **kiven-agent** | 9090 | `/metrics` | Agent-side CNPG + PG metrics | +| **Grafana** | 3000 | `/metrics` | Internal metrics | +| **ArgoCD** | 8083 | `/metrics` | Application sync metrics | +| **Node Exporter** | 9100 | `/metrics` | System metrics (CPU, RAM, disk) | --- @@ -269,10 +280,12 @@ Le **Prometheus Operator** utilise des **Custom Resources** pour configurer auto | Service | SLI | SLO | Error Budget | Burn Rate Alert | |---------|-----|-----|--------------|-----------------| -| **svc-ledger** | Availability | 99.9% | 43 min/mois | 14.4x = 1h alert | -| **svc-ledger** | Latency P99 | < 200ms | N/A | P99 > 200ms for 5min | -| **svc-wallet** | Availability | 99.9% | 43 min/mois | 14.4x = 1h alert | -| **Platform** | Availability | 99.5% | 3.6h/mois | 6x = 2h alert | +| **svc-api** | Availability | 99.9% | 43 min/month | 14.4x = 1h alert | +| **svc-api** | Latency P99 | < 200ms | N/A | P99 > 200ms for 5min | +| **svc-provisioner** | Availability | 99.9% | 43 min/month | 14.4x = 1h alert | +| **svc-agent-relay** | Availability | 99.95% | 22 min/month | 14.4x = 30min alert | +| **Customer PostgreSQL** | Availability | 99.99% | 4.3 min/month | 6x = 15min alert | +| **Platform (infra)** | Availability | 99.5% | 3.6h/month | 6x = 2h alert | ## SLO Formulas @@ -427,14 +440,16 @@ Exemple visuel — Histogram en Heatmap (latence) | **Replication Lag** | Gauge | `pg_replication_lag_seconds` | Stat avec threshold | | **Cache Hit Ratio** | Gauge | `pg_stat_database_blks_hit / (blks_hit + blks_read)` | Stat % | -### Dashboard 4 : Business Metrics (Product) +### Dashboard 4 : Kiven Business Metrics (Product) -| Panel | Type | Métrique | Visualisation | -|-------|------|----------|---------------| -| **Transactions Créées** | Counter | `sum(rate(ledger_transactions_total[1h]))` | Stat (big number) | -| **Montant Total Traité** | Counter | `sum(ledger_amount_processed_total)` | Stat avec unité € | -| **Wallets Actifs** | Gauge | `wallet_active_count` | Stat | -| **Erreurs Métier** | Counter | `sum by (error_type) (rate(business_errors_total[5m]))` | Bar chart | +| Panel | Type | Metric | Visualization | +|-------|------|--------|---------------| +| **Services Created** | Counter | `sum(rate(kiven_services_created_total[1h]))` | Stat (big number) | +| **Active Databases** | Gauge | `kiven_services_active_count` | Stat | +| **Provisioning Time P95** | Histogram | `histogram_quantile(0.95, rate(kiven_provisioning_duration_bucket[1h]))` | Time Series | +| **Connected Agents** | Gauge | `kiven_agent_connected` | Stat | +| **Backup Success Rate** | Counter | `rate(kiven_backup_success_total[1h]) / rate(kiven_backup_total[1h])` | Gauge % | +| **Business Errors** | Counter | `sum by (error_type) (rate(kiven_errors_total[5m]))` | Bar chart | --- @@ -453,5 +468,6 @@ Exemple visuel — Histogram en Heatmap (latence) --- -*Document maintenu par : Platform Team* -*Dernière mise à jour : Janvier 2026* +*Maintained by: @kivenio/platform* +*Last updated: February 2026* +*See also: [OTEL-CONVENTIONS.md](./OTEL-CONVENTIONS.md) for instrumentation details* diff --git a/observability/OTEL-CONVENTIONS.md b/observability/OTEL-CONVENTIONS.md new file mode 100644 index 0000000..3319bcf --- /dev/null +++ b/observability/OTEL-CONVENTIONS.md @@ -0,0 +1,346 @@ +# OpenTelemetry Conventions + +> **Back to**: [Observability Guide](./OBSERVABILITY-GUIDE.md) | [Architecture Overview](../EntrepriseArchitecture.md) + +## Architecture Decisions + +| Decision | Choice | Rationale | +|----------|--------|-----------| +| Collector deployment | Agent + Gateway (two-tier) | Resilient, scalable, centralized config for heavy lifting | +| SDK approach | `kiven-go-sdk/telemetry` wrapping OTel Go SDK | Conventions baked in, zero-config for services | +| Exporter pattern | Exporter helper with persistent queue | New OTel pattern: no separate batch processor, survives restarts | +| Propagation | W3C TraceContext + Baggage | Industry standard, cross-service compatible | +| Metrics backend | Prometheus (scrape) + OTel Collector (receive) | Prometheus for K8s ecosystem, OTel for application metrics | +| Traces backend | Tempo | Grafana-native, S3 storage, cost effective | +| Logs backend | Loki | Grafana-native, label-based, low cost | + +--- + +## Collector Deployment: Agent + Gateway Two-Tier + +``` +Services (pods) + │ + │ OTLP gRPC (localhost:4317) + ▼ +OTel Collector Agent (DaemonSet, 1 per node) + │ Lightweight: receive → forward + │ No processing, no sampling + │ + │ OTLP gRPC (cluster-internal) + ▼ +OTel Collector Gateway (Deployment, 2-3 replicas) + │ Heavy lifting: batch, filter, sample, scrub PII + │ Exporter helper with persistent queue + │ + ├──► Tempo (traces) + ├──► Prometheus remote write (metrics) + └──► Loki (logs) +``` + +### Why Two-Tier + +- **Agent (DaemonSet)**: minimal config, low memory (~50MB), just forwards. If a node dies, only that node's in-flight data is lost. +- **Gateway (Deployment)**: centralized processing, tail sampling decisions, PII scrubbing, persistent queue. Horizontally scalable. +- **Alternative considered**: Sidecar per pod -- rejected because 15+ services means 15+ Collector instances consuming memory. DaemonSet shares one Collector per node. + +### Exporter Helper: Persistent Queue (New Pattern) + +Since OTel Collector v0.110+, the recommended pattern is exporter-level batching with persistent storage. This replaces the old separate `batch` processor. + +```yaml +# Gateway Collector config +exporters: + otlp/tempo: + endpoint: tempo:4317 + tls: + insecure: true + sending_queue: + enabled: true + storage: file_storage/traces # persistent: survives collector restart + queue_size: 5000 + num_consumers: 10 + retry_on_failure: + enabled: true + initial_interval: 5s + max_interval: 30s + max_elapsed_time: 300s + batcher: + enabled: true + min_size: 500 + max_size: 2000 + timeout: 5s + +extensions: + file_storage/traces: + directory: /var/lib/otel/traces # PVC-backed for persistence + timeout: 10s + compaction: + on_start: true + directory: /tmp/otel-compaction +``` + +**Why this matters**: If the Gateway restarts (upgrade, OOM, node drain), queued spans are not lost. They're persisted to disk and replayed on startup. + +--- + +## Span Naming Convention + +All spans follow this pattern: + +``` +kiven... +``` + +### Layers + +| Layer | Description | Examples | +|-------|-------------|----------| +| `handler` | HTTP/gRPC request handlers | `kiven.svc-api.handler.CreateService` | +| `repo` | Database repository methods | `kiven.svc-api.repo.InsertService` | +| `provider` | Provider interface calls | `kiven.provider-cnpg.provider.GenerateClusterYAML` | +| `infra` | AWS/cloud infrastructure calls | `kiven.svc-infra.infra.CreateNodeGroup` | +| `agent` | Agent-side operations | `kiven.agent.agent.ApplyYAML` | +| `grpc` | gRPC calls (auto-generated) | `/kiven.agent.v1.AgentRelay/Heartbeat` | + +### Auto-Generated Spans + +The `kiven-go-sdk/telemetry` package generates spans automatically: + +| Source | Span Name Format | Example | +|--------|-------------------|---------| +| HTTP middleware | `HTTP {method} {path}` | `HTTP GET /v1/services` | +| gRPC server interceptor | `/{package}.{Service}/{Method}` | `/kiven.agent.v1.AgentRelay/Heartbeat` | +| gRPC client interceptor | `/{package}.{Service}/{Method}` | `/kiven.agent.v1.AgentRelay/SendCommand` | +| Manual (Trace helper) | Developer-defined | `repo.GetService`, `aws.CreateNodeGroup` | + +### Manual Span Creation + +Use the helpers from `kiven-go-sdk/telemetry`: + +```go +// Simple span +ctx, span := telemetry.Trace(ctx, "repo.GetService") +defer span.End() + +// Span with automatic error handling +err := telemetry.TraceFunc(ctx, "svc.CreateService", func(ctx context.Context) error { + return repo.Insert(ctx, svc) +}) + +// Add domain attributes to current span +telemetry.SetSpanAttributes(ctx, + attribute.String("kiven.service_id", serviceID), + attribute.String("kiven.plan", "business"), +) + +// Get trace ID for log correlation +traceID := telemetry.TraceID(ctx) +``` + +--- + +## Standard Attributes + +Every span SHOULD include these attributes where applicable: + +### Kiven Domain Attributes + +| Attribute | Type | Description | Example | +|-----------|------|-------------|---------| +| `kiven.org_id` | string | Organization ID | `org-abc123` | +| `kiven.project_id` | string | Project ID | `proj-xyz789` | +| `kiven.service_id` | string | Managed database service ID | `svc-pg-001` | +| `kiven.cluster_id` | string | Customer EKS cluster ID | `cluster-eu-west-1` | +| `kiven.plan` | string | Service plan | `business` | +| `kiven.env` | string | Environment | `production` | +| `kiven.agent_id` | string | Agent instance ID | `agent-abc123` | + +### OTel Semantic Convention Attributes (auto-set by middleware) + +| Attribute | Set By | Example | +|-----------|--------|---------| +| `http.request.method` | HTTP middleware | `GET` | +| `url.path` | HTTP middleware | `/v1/services` | +| `http.response.status_code` | HTTP middleware | `200` | +| `rpc.system` | gRPC interceptor | `grpc` | +| `rpc.service` | gRPC interceptor | `kiven.agent.v1.AgentRelay` | +| `rpc.method` | gRPC interceptor | `Heartbeat` | +| `rpc.grpc.status_code` | gRPC interceptor | `0` (OK) | +| `service.name` | Provider resource | `svc-api` | +| `deployment.environment` | Provider resource | `production` | + +--- + +## Metric Naming + +Follow OTel semantic conventions. Custom Kiven metrics use the `kiven.` prefix. + +### Standard Metrics (auto-collected) + +| Metric | Type | Source | +|--------|------|--------| +| `http.server.duration` | Histogram | HTTP middleware | +| `http.server.request.size` | Histogram | HTTP middleware | +| `rpc.server.duration` | Histogram | gRPC interceptor | +| `db.client.operation.duration` | Histogram | pgx tracing (Phase 1 gap) | + +### Kiven Business Metrics (service-specific) + +| Metric | Type | Service | Description | +|--------|------|---------|-------------| +| `kiven.provisioning.duration` | Histogram | svc-provisioner | Time to provision a database | +| `kiven.provisioning.active` | Gauge | svc-provisioner | In-flight provisioning jobs | +| `kiven.agent.connected` | Gauge | svc-agent-relay | Connected agents count | +| `kiven.agent.heartbeat.lag` | Histogram | svc-agent-relay | Time since last heartbeat | +| `kiven.backup.duration` | Histogram | svc-backups | Backup execution time | +| `kiven.backup.size` | Gauge | svc-backups | Last backup size in bytes | + +--- + +## Sampling Strategy + +Configured via `kiven-go-sdk/telemetry` Config: + +| Environment | Head Sampling Rate | Tail Sampling (Gateway) | Rationale | +|-------------|-------------------|-------------------------|-----------| +| `local` | 100% | N/A (stdout exporter) | Full visibility for development | +| `staging` | 100% | Keep all | Full visibility for QA | +| `production` | 10% | Errors: 100%, Slow (>500ms): 100% | Cost optimization | + +### Head Sampling (SDK-side) + +Set via `OTEL_SAMPLE_RATE` env var or `Config.SampleRate`. Uses `ParentBased(TraceIDRatioBased(rate))` so child spans always respect parent's sampling decision. + +### Tail Sampling (Gateway Collector) + +The Gateway applies tail sampling after receiving all spans of a trace: + +```yaml +processors: + tail_sampling: + decision_wait: 10s + policies: + - name: errors + type: status_code + status_code: {status_codes: [ERROR]} + - name: slow + type: latency + latency: {threshold_ms: 500} + - name: probabilistic + type: probabilistic + probabilistic: {sampling_percentage: 10} +``` + +--- + +## SDK Usage: kiven-go-sdk/telemetry + +### Service Bootstrap + +```go +func main() { + ctx := context.Background() + + // Auto-configures from env vars (OTEL_EXPORTER_TYPE, OTEL_SAMPLE_RATE, etc.) + cfg, err := telemetry.NewConfigFromEnv("svc-api") + if err != nil { + log.Fatal(err) + } + + tp, err := telemetry.NewProvider(ctx, cfg) + if err != nil { + log.Fatal(err) + } + defer tp.Shutdown(ctx) + + // HTTP server with tracing middleware + r := chi.NewRouter() + r.Use(telemetry.HTTPMiddleware("svc-api")) + // ... +} +``` + +### gRPC Server with Tracing + +```go +server := grpc.NewServer( + grpc.UnaryInterceptor(telemetry.UnaryServerInterceptor()), + grpc.StreamInterceptor(telemetry.StreamServerInterceptor()), +) +``` + +### gRPC Client with Tracing + +```go +conn, _ := grpc.Dial(address, + grpc.WithUnaryInterceptor(telemetry.UnaryClientInterceptor()), + grpc.WithStreamInterceptor(telemetry.StreamClientInterceptor()), +) +``` + +### Environment Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `OTEL_EXPORTER_TYPE` | `stdout` | `stdout`, `otlp`, or `none` | +| `OTEL_EXPORTER_OTLP_ENDPOINT` | (none) | `host:port` of OTel Collector | +| `OTEL_EXPORTER_OTLP_INSECURE` | `false` | Skip TLS for local collectors | +| `OTEL_ENVIRONMENT` | `local` | `local`, `staging`, `production` | +| `OTEL_SAMPLE_RATE` | env-based | `0.0`-`1.0` (negative = auto) | +| `OTEL_SERVICE_VERSION` | (none) | Service version for resource | + +--- + +## What's Implemented vs Gaps + +### Implemented (kiven-go-sdk/telemetry) + +| File | What It Does | +|------|-------------| +| `config.go` | Config struct, DefaultConfig, NewConfigFromEnv, exporter types, env-based sampling | +| `provider.go` | TracerProvider with resource, exporter, sampler, global registration, W3C propagation | +| `span.go` | Trace(), TraceFunc(), SetSpanError(), SetSpanAttributes(), TraceID() helpers | +| `httpmiddleware.go` | chi-compatible HTTP middleware with W3C extraction, semantic convention attrs, X-Trace-ID header | +| `grpc.go` | Full gRPC instrumentation: Unary/Stream Server/Client interceptors with metadata propagation | +| Tests | config_test.go, provider_test.go, span_test.go, httpmiddleware_test.go, grpc_test.go | + +### Phase 1 Gaps (to implement) + +| Gap | Description | Priority | +|-----|-------------|----------| +| **MeterProvider** | OTel MeterProvider setup (like TracerProvider but for metrics). Services need `meter.Int64Counter()`, `meter.Float64Histogram()` etc. | P0 | +| **slog bridge** | Bridge Go `log/slog` to OTel Logs so structured logs flow through the same pipeline as traces/metrics | P1 | +| **pgx tracing hook** | `pgx.QueryTracer` implementation that auto-creates spans for every SQL query with `db.statement`, `db.operation` attributes | P0 | +| **Kiven attribute constants** | Package-level constants for `kiven.org_id`, `kiven.service_id` etc. to avoid string typos | P1 | + +--- + +## GDPR Compliance in OTel Pipeline + +The Gateway Collector scrubs PII before exporting: + +| Data | Action | Processor | +|------|--------|-----------| +| `user.id` | Drop | `attributes/delete` | +| `user.email` | Drop | `attributes/delete` | +| `http.client_ip` | Hash | `transform` | +| High cardinality metric labels | Drop | `filter` | +| SQL query parameters | Redact | `transform` (replace bind values with `?`) | + +```yaml +processors: + attributes/scrub: + actions: + - key: user.id + action: delete + - key: user.email + action: delete + - key: enduser.id + action: delete + transform/anonymize: + trace_statements: + - context: span + statements: + - replace_pattern(attributes["http.client_ip"], "^(.*)$", "REDACTED") +``` diff --git a/onboarding/CUSTOMER-ONBOARDING.md b/onboarding/CUSTOMER-ONBOARDING.md new file mode 100644 index 0000000..e928ba6 --- /dev/null +++ b/onboarding/CUSTOMER-ONBOARDING.md @@ -0,0 +1,233 @@ +# Customer Onboarding +## *From Sign-Up to Running Database in 10 Minutes* + +--- + +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# Onboarding Flow + +``` +Step 1 Step 2 Step 3 Step 4 Step 5 +Sign up → Deploy CFN → Connect EKS → Create DB → Connected! +(1 min) template (2 min) (1 min) (5-7 min) +``` + +--- + +# Step 1: Sign Up (1 minute) + +Customer creates account on kiven.io: +- Email + password or SSO (Google / GitHub) +- Create organization +- Invite team members (optional) + +--- + +# Step 2: Deploy CloudFormation Template (2 minutes) + +Customer deploys Kiven's CloudFormation template in their AWS account. This creates the `KivenAccessRole` IAM role. + +### How It Works + +1. Kiven dashboard shows: "Connect your AWS account" +2. Customer clicks → redirected to AWS CloudFormation console with pre-filled template URL +3. Customer reviews and clicks "Create Stack" +4. Stack creates: + - IAM Role `KivenAccessRole` (trusts Kiven's AWS account) + - IAM Policy `KivenAccessPolicy` (scoped permissions) + - ExternalId parameter (unique per customer, prevents confused deputy) +5. Stack outputs: Role ARN → customer copies back to Kiven dashboard + +### CloudFormation Template (Summary) + +```yaml +AWSTemplateFormatVersion: '2010-09-09' +Description: Kiven Access Role — Allows Kiven to manage database infrastructure + +Parameters: + ExternalId: + Type: String + Description: Unique ID provided by Kiven (do not change) + KivenAccountId: + Type: String + Default: '123456789012' # Kiven's AWS account ID + +Resources: + KivenAccessRole: + Type: AWS::IAM::Role + Properties: + RoleName: KivenAccessRole + AssumeRolePolicyDocument: + Version: '2012-10-17' + Statement: + - Effect: Allow + Principal: + AWS: !Sub 'arn:aws:iam::${KivenAccountId}:root' + Action: 'sts:AssumeRole' + Condition: + StringEquals: + 'sts:ExternalId': !Ref ExternalId + Policies: + - PolicyName: KivenAccessPolicy + PolicyDocument: + # ... (see CUSTOMER-INFRA-MANAGEMENT.md for full policy) + +Outputs: + RoleArn: + Description: Paste this ARN in the Kiven dashboard + Value: !GetAtt KivenAccessRole.Arn +``` + +--- + +# Step 3: Connect EKS Cluster (1 minute) + +Customer provides their EKS cluster details: + +1. Select AWS region (auto-detected from IAM role) +2. Select EKS cluster (Kiven lists available clusters via AWS API) +3. Kiven validates: + - Can assume role ✓ + - Can describe EKS cluster ✓ + - Can access K8s API ✓ +4. Dashboard shows: "Cluster connected!" + +### Cluster Discovery + +Kiven automatically discovers: +- EKS version +- VPC / subnets / AZs +- Existing node groups +- Installed operators (CNPG, cert-manager) +- Available storage classes +- Resource capacity (CPU, memory) + +--- + +# Step 4: Create Database (5-7 minutes) + +Customer clicks "Create Database" and configures: + +### Simple Mode (Form) + +| Field | Options | Default | +|-------|---------|---------| +| **Name** | Free text | `my-database` | +| **PostgreSQL Version** | 15, 16, 17 | 17 | +| **Plan** | Hobbyist, Startup, Business, Premium, Custom | Startup | +| **Region / AZ** | From customer's EKS subnets | Multi-AZ (auto) | +| **Initial Database** | Database name | `app` | +| **Initial User** | Username | `app_user` | + +### What Happens Behind the Scenes + +``` +1. Prerequisites Check (svc-provisioner) [~10s] + ├── Validate IAM permissions + ├── Check EKS cluster health + ├── Verify subnet availability across AZs + └── Check resource capacity + +2. Infrastructure Setup (svc-infra) [~2-3min] + ├── Create dedicated node group (kiven-db-{id}) + ├── Create StorageClass (kiven-db-gp3) + ├── Create S3 bucket (kiven-backups-{customer-id}) + ├── Create IRSA role (kiven-cnpg-backup-{id}) + └── Wait for nodes ready + +3. CNPG Setup (agent) [~1min] + ├── Create namespace (kiven-databases) + ├── Install CNPG operator (if not present) + ├── Deploy Kiven agent + └── Apply NetworkPolicies + +4. Database Provisioning (agent) [~2-3min] + ├── Apply CNPG Cluster YAML + ├── Apply PgBouncer Pooler YAML + ├── Apply ScheduledBackup YAML + ├── Wait for primary ready + ├── Wait for replicas synced + └── Create initial database + user + +5. Ready! [total: ~5-7min] + └── Return connection strings +``` + +### Dashboard Progress + +``` +┌──────────────────────────────────────────────────────────────┐ +│ Creating database: my-database │ +│ │ +│ [████████████████████████████░░░░░░░░░░] 72% │ +│ │ +│ ✅ Prerequisites validated │ +│ ✅ Node group created (2× r6g.medium) │ +│ ✅ Storage and backups configured │ +│ ✅ CNPG operator ready │ +│ ⏳ PostgreSQL starting... │ +│ ○ Replicas syncing │ +│ ○ Creating database and user │ +│ │ +│ Estimated time remaining: ~2 minutes │ +└──────────────────────────────────────────────────────────────┘ +``` + +--- + +# Step 5: Connected! + +Customer receives: + +``` +┌──────────────────────────────────────────────────────────────┐ +│ ✅ Database ready! │ +│ │ +│ Connection Details: │ +│ │ +│ Host: pg-my-database-rw.kiven-databases.svc │ +│ Port: 5432 │ +│ Database: app │ +│ User: app_user │ +│ Password: •••••••••• [Reveal] [Copy] │ +│ │ +│ Pooler (recommended): │ +│ Host: pg-my-database-pooler.kiven-databases.svc │ +│ Port: 5432 │ +│ │ +│ Connection String: │ +│ postgresql://app_user:***@pg-my-database-pooler │ +│ .kiven-databases.svc:5432/app?sslmode=require │ +│ [Copy] │ +│ │ +│ [Open Dashboard] [View Metrics] [Add User] │ +└──────────────────────────────────────────────────────────────┘ +``` + +--- + +# Offboarding + +When a customer deletes their database: + +1. **Confirmation dialog**: "This will delete your database. Backups will be retained for 30 days." +2. CNPG cluster deleted +3. Node group deleted +4. EBS volumes retained for 7 days, then deleted (configurable) +5. S3 backups retained for 30 days (configurable) +6. IAM IRSA role deleted +7. Audit log entry created + +When a customer removes Kiven entirely: +1. All databases must be deleted first (or exported) +2. Kiven agent uninstalled (`helm uninstall kiven-agent`) +3. Customer deletes CloudFormation stack (removes IAM role) +4. Kiven retains customer metadata for 90 days (GDPR), then purges + +--- + +*Maintained by: Product Team + Platform Team* +*Last updated: February 2026* diff --git a/plans/KIVEN_ROADMAP.md b/plans/KIVEN_ROADMAP.md new file mode 100644 index 0000000..597de06 --- /dev/null +++ b/plans/KIVEN_ROADMAP.md @@ -0,0 +1,532 @@ +name: Kiven Team Roadmap +overview: A phased roadmap for the Kiven team to go from current state (~10% implemented) to production-ready. 6 phases covering foundation, core services, orchestration, dashboard, production infrastructure, and enterprise features. +todos: + - id: phase1-sdk + content: "Phase 1A: Complete kiven-go-sdk (error types, middleware, OTel instrumentation, DB helpers)" + status: pending + - id: phase1-proto + content: "Phase 1A: Create contracts-proto (buf setup, agent.proto, metrics.proto)" + status: pending + - id: phase1-api + content: "Phase 1B: Scaffold svc-api (chi router, DB layer, first read endpoints)" + status: pending + - id: phase1-templates + content: "Phase 1C: Apply Copier templates to all Phase 2 repos + create sdk-go template" + status: pending + - id: phase2-auth + content: "Phase 2A: Build svc-auth (OIDC, API keys, RBAC)" + status: pending + - id: phase2-cnpg + content: "Phase 2B: Build provider-cnpg (YAML generation, status parsing)" + status: pending + - id: phase2-infra + content: "Phase 2C: Build svc-infra (AWS SDK, node groups, S3, IRSA)" + status: pending + - id: phase2-relay + content: "Phase 2B: Build svc-agent-relay (gRPC server, agent registration)" + status: pending + - id: phase2-api-endpoints + content: "Phase 2D: Implement svc-api CRUD endpoints" + status: pending + - id: phase3-agent + content: "Phase 3A: Build kiven-agent (CNPG informers, gRPC client, command executor)" + status: pending + - id: phase3-provisioner + content: "Phase 3B: Build svc-provisioner (state machine, orchestrate infra + agent)" + status: pending + - id: phase3-services + content: "Phase 3B: Build svc-clusters, svc-backups, svc-users" + status: pending + - id: phase3-e2e + content: "Phase 3: End-to-end integration test in kind" + status: pending + - id: phase4-dashboard + content: "Phase 4: Dashboard API integration, auth flow, real data" + status: pending + - id: phase5-deploy + content: "Phase 5A: Production deployment (ArgoCD, Helm charts, platform-gitops)" + status: pending + - id: phase5-observability + content: "Phase 5B: Observability stack (Prometheus, Loki, Tempo, Grafana, OTel)" + status: pending + - id: phase5-security + content: "Phase 5C: Security hardening (Vault, mTLS, Kyverno, cert-manager)" + status: pending + - id: phase5-onboarding + content: "Phase 5D: Customer onboarding (Terraform module, EKS discovery)" + status: pending + - id: phase6-monitoring + content: "Phase 6A: svc-monitoring (DBA intelligence, alerts, query optimizer)" + status: pending + - id: phase6-billing + content: "Phase 6B: svc-billing (Stripe, usage tracking, invoices)" + status: pending + - id: phase6-enterprise + content: "Phase 6C: Enterprise features (svc-audit, svc-notification, svc-yamleditor, svc-migrations)" + status: pending +isProject: false +--- + +# Kiven Team Roadmap -- From Foundation to Production + +## Current State + +What works today: + +- **Platform tooling**: `bootstrap` (Terraform), `platform-github-management` (sync script), `reusable-workflows` (8 workflows + 5 actions), `platform-templates-service-go` (Copier template), `docs` (18 docs) +- **SDK**: `kiven-go-sdk` (15 Go files, 4 test files -- provider interface, 11 models, config, logging) +- **Dashboard**: 14 pages scaffolded, no API integration +- **Dev environment**: `kiven-dev` with Taskfile, migrations, kind/CNPG setup +- **API contract**: `svc-api` has OpenAPI spec, no Go code yet + +What does NOT work: no service has Go code, no agent, no provider, no provisioning pipeline, no auth, no customer-facing functionality. + +## Production Target + +**Production = A customer can:** + +1. Sign up and log in (OIDC + SSO/SAML) +2. Register their EKS cluster (Terraform module or CloudFormation one-click) +3. Click "Create Database" and get a PostgreSQL connection string in ~10 minutes +4. See metrics, logs, backups, users, connection info in the dashboard +5. Get DBA recommendations, alerts, and performance insights +6. Power on/off databases on schedule +7. Pay via Stripe with usage tracking +8. Have full audit trail of all operations + +**6 phases**: Foundation -> Core Services -> Orchestration -> Dashboard -> Production Infrastructure -> Enterprise Features. + +## Dependency Graph + +```mermaid +graph TD + subgraph phase1 [Phase 1: Foundation -- Weeks 1-3] + SDK[kiven-go-sdk
complete models] + PROTO[contracts-proto
gRPC definitions] + API_SCAFFOLD[svc-api
scaffold + DB layer] + TEMPLATES[Apply Copier templates
to all repos] + end + + subgraph phase2 [Phase 2: Core Services -- Weeks 4-8] + AUTH[svc-auth
OIDC + API keys + RBAC] + CNPG[provider-cnpg
YAML generation] + INFRA[svc-infra
AWS SDK integration] + API_IMPL[svc-api
implement endpoints] + RELAY[svc-agent-relay
gRPC server] + end + + subgraph phase3 [Phase 3: Orchestration -- Weeks 9-14] + AGENT[kiven-agent
CNPG watcher + gRPC client] + PROV[svc-provisioner
state machine] + CLUSTERS[svc-clusters
lifecycle management] + BACKUPS[svc-backups
backup/restore] + USERS[svc-users
PG user management] + end + + subgraph phase4 [Phase 4: Dashboard -- Weeks 8-16] + DASH[dashboard
API integration + auth flow] + end + + subgraph phase5 [Phase 5: Production Infra -- Weeks 15-20] + GITOPS[platform-gitops
ArgoCD + Helm] + OBS[platform-observability
Prometheus + Loki + Tempo] + SEC[platform-security
Vault + Kyverno + mTLS] + ONBOARD[infra-customer-aws
Terraform onboarding] + end + + subgraph phase6 [Phase 6: Enterprise -- Weeks 17-24] + MON[svc-monitoring
DBA intelligence + alerts] + BILL[svc-billing
Stripe + usage tracking] + AUDIT[svc-audit
immutable audit log] + NOTIF[svc-notification
Slack + email + webhook] + YAML[svc-yamleditor
Advanced Mode] + MIG[svc-migrations
import from Aiven/RDS] + end + + SDK --> AUTH + SDK --> CNPG + SDK --> INFRA + SDK --> API_IMPL + PROTO --> RELAY + PROTO --> AGENT + API_SCAFFOLD --> API_IMPL + TEMPLATES --> AUTH + TEMPLATES --> CNPG + TEMPLATES --> INFRA + + AUTH --> API_IMPL + CNPG --> AGENT + CNPG --> CLUSTERS + CNPG --> BACKUPS + CNPG --> USERS + INFRA --> PROV + RELAY --> AGENT + RELAY --> PROV + + API_IMPL --> DASH + + PROV --> GITOPS + AGENT --> OBS + CLUSTERS --> MON + BACKUPS --> MON + AUTH --> ONBOARD + INFRA --> ONBOARD + API_IMPL --> BILL + API_IMPL --> AUDIT + CLUSTERS --> YAML + MON --> NOTIF +``` + + + +## Phase 1: Foundation (Weeks 1-3) + +**Goal**: Every repo that Phase 2 depends on is ready: SDK complete, gRPC contracts defined, svc-api scaffolded with DB layer, Copier templates applied. + +### Work Stream A: SDK + Contracts (1 developer) + + +| Week | Repo | Task | Deliverable | +| ---- | ----------------- | ------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------- | +| 1 | `kiven-go-sdk` | Add error types package (`errors/errors.go`), HTTP client helpers, pagination types | Shared error handling + HTTP primitives for all services | +| 1 | `kiven-go-sdk` | Add middleware package (logging, recovery, request ID, auth context) | Reusable chi middleware for all svc-* | +| 1-2 | `kiven-go-sdk` | Telemetry package gaps (traces provider, HTTP middleware, gRPC interceptors, span helpers already done): **add MeterProvider** (OTel metrics: counters, histograms, gauges), **add pgx tracing hook** (auto-trace SQL queries), **add slog bridge** (structured logs to OTel Logs), **add Kiven attribute constants** (`kiven.org_id`, `kiven.service_id` etc. as typed constants). See `docs/observability/OTEL-CONVENTIONS.md`. | Full traces + metrics + logs instrumentation from day 1 | +| 2 | `docs` | DONE: `docs/observability/OTEL-CONVENTIONS.md` created. DONE: `OBSERVABILITY-GUIDE.md` updated to Kiven context. | OTel architecture decisions documented | +| 2 | `contracts-proto` | Define `.proto` files: `agent.proto` (Heartbeat, Status, Command streams), `metrics.proto`, `commands.proto` | gRPC contract for agent-relay communication | +| 2 | `contracts-proto` | Set up `buf.yaml`, `buf.gen.yaml`, CI with `buf lint` + `buf breaking` | Generated Go code in `gen/go/` | +| 3 | `kiven-go-sdk` | Add database helpers (pgx pool factory, migration runner, transaction helpers) with OTel pgx tracing | Every service can connect to DB with 3 lines + auto-traced queries | + + +### Work Stream B: svc-api Scaffold (1 developer) + + +| Week | Repo | Task | Deliverable | +| ---- | --------- | ----------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | +| 1 | `svc-api` | Scaffold Go project from Copier template (`copier copy`), set up chi router, healthcheck, graceful shutdown, OTel trace provider init | Running HTTP server at :8080/healthz with OTel traces | +| 2 | `svc-api` | Database layer: pgx connection pool, migration runner on startup, repository pattern (interfaces) | `ServiceRepository`, `OrganizationRepository`, `BackupRepository` interfaces | +| 2 | `svc-api` | OpenAPI validation middleware (validate requests/responses against spec) | Every request validated against openapi.yaml | +| 3 | `svc-api` | Implement read-only endpoints: `GET /v1/plans`, `GET /v1/services`, `GET /v1/services/{id}` | First working API endpoints returning data from DB | + + +### Work Stream C: Apply Templates (1 developer, part-time) + + +| Week | Repo | Task | Deliverable | +| ---- | --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | +| 1-2 | All Phase 2 repos | Run `copier copy` from `platform-templates-service-go` for: `svc-auth`, `svc-infra`, `svc-agent-relay`, `svc-clusters`, `svc-backups`, `svc-users`, `provider-cnpg`, `kiven-agent` | All repos have: editorconfig, golangci, pre-commit, CI workflow, Taskfile, Dockerfile, go.mod, OTel init | +| 2-3 | `platform-templates-sdk-go` | Create Copier template for sdk-go (same as service-go but no Dockerfile, no cmd/, no gRPC) | Template ready for provider repos and CLI | +| 3 | `kiven-dev` | Verify `task dev` works end-to-end: Docker Compose + kind + CNPG + all services can start | Working local dev environment | + + +**Phase 1 exit criteria**: `task dev` starts infra, `svc-api` returns service plans from DB, `contracts-proto` generates Go code, all Phase 2 repos are scaffolded. + +--- + +## Phase 2: Core Services (Weeks 4-8) + +**Goal**: Authentication works, CNPG YAML can be generated, AWS resources can be created, agent can connect to relay. These are the building blocks the provisioner needs. + +### Work Stream A: Auth (1 developer) + + +| Week | Repo | Task | Deliverable | +| ---- | ---------- | ------------------------------------------------------------------------------- | ------------------------------------------ | +| 4 | `svc-auth` | OIDC integration (Google/GitHub login via `coreos/go-oidc`), JWT token issuance | Users can log in, get a JWT | +| 5 | `svc-auth` | API key management (create, list, revoke, hash with argon2) | Programmatic access for CLI/Terraform | +| 6 | `svc-auth` | RBAC middleware (admin, operator, viewer roles), org/team model | Role-based access control on all endpoints | +| 6 | `svc-api` | Integrate auth middleware from `svc-auth`, protect all endpoints | Every API call requires valid token | + + +### Work Stream B: Provider + Agent Foundation (1 developer) + + +| Week | Repo | Task | Deliverable | +| ---- | ----------------- | ----------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------- | +| 4-5 | `provider-cnpg` | Implement Provider interface for CNPG: `GenerateClusterYAML()`, `GeneratePoolerYAML()`, `GenerateScheduledBackupYAML()` | Given a service definition, produce valid CNPG YAML | +| 5-6 | `provider-cnpg` | Implement `ParseStatus()`, `ParseMetrics()` from CNPG CRD status fields | Can read CNPG cluster state | +| 6 | `contracts-proto` | Finalize agent protocol: Heartbeat (bidirectional), CommandStream (server-push), MetricsStream (agent-push) | Stable gRPC contract | +| 7-8 | `svc-agent-relay` | gRPC server: agent registration, heartbeat tracking, command dispatch queue | Agents can connect and receive commands | + + +### Work Stream C: Infrastructure (1 developer) + + +| Week | Repo | Task | Deliverable | +| ---- | ----------- | ------------------------------------------------------------------------------------- | --------------------------------------------- | +| 4-5 | `svc-infra` | AWS SDK integration: `AssumeRole` into customer account, EKS `DescribeCluster` | Can access customer AWS resources | +| 5-6 | `svc-infra` | Create EKS managed node group (dedicated for databases, tainted, right instance type) | Can create database nodes in customer cluster | +| 6-7 | `svc-infra` | Create S3 bucket (encrypted, lifecycle rules) for backups, create IRSA role for CNPG | Backup infrastructure ready | +| 7-8 | `svc-infra` | Create EBS StorageClass (gp3, encrypted, right IOPS) | Storage ready for database volumes | + + +### Work Stream D: svc-api Endpoints (shared across team) + + +| Week | Repo | Task | Deliverable | +| ---- | --------- | ------------------------------------------------------------------------------------------ | ----------------------------------------- | +| 5-6 | `svc-api` | CRUD endpoints: `POST /v1/services`, `DELETE /v1/services/{id}`, `PATCH /v1/services/{id}` | Can create/update/delete services via API | +| 7-8 | `svc-api` | Customer cluster endpoints: `POST /v1/clusters` (register EKS), `GET /v1/clusters` | Can register customer EKS clusters | +| 8 | `svc-api` | Backup endpoints, user management endpoints | Full CRUD for all resources | + + +**Phase 2 exit criteria**: User can log in via OIDC, create a service via API (stored in DB), `provider-cnpg` generates valid CNPG YAML, `svc-infra` can create node groups in test AWS account, agent can connect to relay. + +--- + +## Phase 3: Orchestration (Weeks 9-14) + +**Goal**: The provisioning pipeline works end-to-end. Customer clicks "Create Database" and gets a running PostgreSQL. + +### Work Stream A: Agent (1 developer) + + +| Week | Repo | Task | Deliverable | +| ----- | ------------------ | -------------------------------------------------------------------------------------- | --------------------------------------------- | +| 9-10 | `kiven-agent` | Go binary: gRPC client to relay, CNPG informers (watch Cluster/Backup CRDs), heartbeat | Agent running in kind, reporting CNPG status | +| 10-11 | `kiven-agent` | Command executor: receive YAML from relay, `kubectl apply`, report result | Can apply CNPG manifests on command | +| 11-12 | `kiven-agent` | PG stats collector: connect to PostgreSQL, collect pg_stat_statements, send to relay | Metrics flowing to SaaS | +| 12 | `kiven-agent-helm` | Helm chart for agent deployment | One-command agent install in customer cluster | + + +### Work Stream B: Provisioner + Services (2 developers) + + +| Week | Repo | Task | Deliverable | +| ----- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------- | +| 9-10 | `svc-provisioner` | State machine: `provisioning_jobs` table, steps: create_nodes -> create_storage -> create_s3 -> install_cnpg -> deploy_cluster | Provisioning pipeline orchestration | +| 10-11 | `svc-provisioner` | Integration with `svc-infra` (create AWS resources) + `svc-agent-relay` (send commands to agent) | Full pipeline: API -> provisioner -> infra + agent | +| 11-12 | `svc-clusters` | Cluster lifecycle: get status from agent, scale (change instances), power on/off | Can see cluster status, scale up/down | +| 12-13 | `svc-backups` | Backup management: trigger backup via agent, list backups from S3, PITR restore | Backup/restore working | +| 13-14 | `svc-users` | PG user management: create user via agent (SQL execution), list users, reset password | Can manage database users | + + +### Integration Testing (all developers) + + +| Week | Task | Deliverable | +| ----- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | +| 13-14 | End-to-end test in kind: create service -> provisioner runs -> agent deploys CNPG -> PostgreSQL running -> connection string returned | MVP proof: the full loop works | + + +**Phase 3 exit criteria**: In the local dev environment (kind), a user can create a service via API, the provisioner creates a CNPG cluster via the agent, and the user gets a working PostgreSQL connection string. + +--- + +## Phase 4: Dashboard Integration (Weeks 8-16, parallel with Phase 3) + +**Goal**: The dashboard is connected to the API and a customer can do everything via the UI. + +### Work Stream (1 frontend developer) + + +| Week | Repo | Task | Deliverable | +| ----- | ----------- | ---------------------------------------------------------------------------------- | --------------------------------- | +| 8-9 | `dashboard` | API client layer: fetch wrapper, auth token management, error handling | Type-safe API client | +| 9-10 | `dashboard` | Auth flow: login page, OIDC redirect, token storage, protected routes | Users can log in | +| 10-11 | `dashboard` | Service list page: real data from API, create service wizard connected to API | Can create a database from the UI | +| 11-12 | `dashboard` | Service detail page: real connection info, status from API, power on/off | Can see database status | +| 12-14 | `dashboard` | Backups page (real data), Users page (CRUD), Metrics page (charts from agent data) | Full dashboard functionality | +| 14-16 | `dashboard` | Polish: loading states, error handling, responsive design, dark mode | Production-ready UI | + + +**Phase 4 exit criteria**: Customer can log in, create a database, see status, manage users/backups -- all from the dashboard. + +--- + +## Phase 5: Production Infrastructure (Weeks 15-20, overlaps with Phase 4) + +**Goal**: Everything needed to run Kiven in production on real AWS infrastructure. Services deploy via GitOps, observability is in place, security is hardened, customers can onboard autonomously. + +### Work Stream A: Deployment + GitOps (1 developer) + + +| Week | Repo | Task | Deliverable | +| ----- | ------------------ | -------------------------------------------------------------------------------- | ------------------------------------------------ | +| 15 | `platform-gitops` | ArgoCD ApplicationSets for all svc-* services, environments (dev, staging, prod) | GitOps deployment pipeline | +| 15-16 | All svc-* repos | Production Helm charts (per service), Kustomize overlays for env-specific config | `helm install svc-api` works | +| 16-17 | `kiven-dev` | Staging environment in real EKS (not kind): Terraform for Kiven SaaS EKS cluster | Staging cluster running on AWS | +| 17-18 | `platform-gitops` | Promotion workflow: dev -> staging -> prod with approval gates | Controlled rollouts | +| 18 | `platform-gateway` | Cloudflare Terraform: DNS, WAF rules, DDoS protection, Tunnel to EKS | kiven.io resolves, API accessible via Cloudflare | + + +### Work Stream B: Observability (1 developer) + + +| Week | Repo | Task | Deliverable | +| ----- | ------------------------ | --------------------------------------------------------------------------------- | ---------------------------------- | +| 15-16 | `platform-observability` | Install Prometheus + Grafana via Helm, ServiceMonitor for all svc-* services | Metrics collection and dashboards | +| 16-17 | `platform-observability` | Install Loki + Promtail, configure log ingestion from all pods | Centralized logging | +| 17 | `platform-observability` | OTel Collector two-tier: DaemonSet agents (forward) + Gateway Deployment (batch, filter, export). Exporter helper with persistent queue (file-backed, survives restarts). Tempo as trace backend. | Distributed tracing with resilient pipeline | +| 17-18 | `platform-observability` | Grafana dashboards: service health, request latency, error rates, agent status | Operations visibility | +| 18-19 | `platform-observability` | SLO definitions (99.9% API availability, <200ms p95 latency), error budget alerts | SLO monitoring with Sloth or Pyrra | +| 19-20 | `platform-observability` | On-call runbooks: automated alert -> runbook link, PagerDuty integration | Operational readiness | + + +### Work Stream C: Security Hardening (1 developer) + + +| Week | Repo | Task | Deliverable | +| ----- | --------------------- | --------------------------------------------------------------------------------------- | --------------------------------------- | +| 15-16 | `platform-security` | HashiCorp Vault: install, configure dynamic secrets for PostgreSQL and AWS credentials | No more static secrets | +| 16-17 | `platform-security` | External Secrets Operator: sync Vault secrets to Kubernetes Secrets for all services | Services read secrets from K8s natively | +| 17 | `platform-security` | cert-manager: install, configure Let's Encrypt ClusterIssuer, auto-TLS for all services | HTTPS everywhere | +| 17-18 | `platform-security` | Kyverno policies: require resource limits, require labels, block privileged pods | Policy enforcement | +| 18-19 | `platform-networking` | Cilium network policies: restrict pod-to-pod traffic, mTLS between services | Zero-trust networking | +| 19-20 | `platform-security` | Image signing (Cosign), SBOM generation, vulnerability scanning in CI | Supply chain security | + + +### Work Stream D: Customer Onboarding (shared) + + +| Week | Repo | Task | Deliverable | +| ----- | ----------------------- | ---------------------------------------------------------------------------------------- | --------------------------------- | +| 16-17 | `infra-customer-aws` | Terraform module (primary): creates `KivenAccessRole` in customer AWS account (IAM + trust policy). Published to Terraform Registry. | IaC-native onboarding | +| 17 | `infra-customer-aws` | CloudFormation template (alternative): same IAM Role, launchable via one-click URL for non-Terraform customers | Quick-start onboarding for all | +| 17-18 | `infra-customer-aws` | EKS discovery: validate cluster access, discover node capacity, storage classes, CNPG | Automated cluster validation | +| 18-19 | `svc-api` + `dashboard` | Onboarding wizard: choose Terraform or CloudFormation -> paste IAM Role ARN -> validate -> register cluster -> create first DB | Self-service customer onboarding | +| 19-20 | `infra-customer-aws` | Advanced Terraform modules: VPC peering, private endpoints, custom KMS | Enterprise networking options | + + +**Phase 5 exit criteria**: Services deploy via ArgoCD to staging EKS, Grafana shows metrics/logs/traces, Vault manages secrets, customers can onboard via Terraform module OR CloudFormation + dashboard wizard, Cloudflare serves kiven.io. + +--- + +## Phase 6: Enterprise Features (Weeks 17-24, overlaps with Phase 5) + +**Goal**: Revenue-generating features, compliance, DBA intelligence, operational maturity. + +### Work Stream A: Monitoring + DBA Intelligence (1 developer) + + +| Week | Repo | Task | Deliverable | +| ----- | ---------------- | ------------------------------------------------------------------------------------------ | --------------------------------------- | +| 17-18 | `svc-monitoring` | Metrics ingestion from agent: store pg_stat_statements, connection counts, replication lag | Metrics pipeline from agent to SaaS | +| 18-19 | `svc-monitoring` | DBA recommendations engine: auto-tune postgresql.conf based on workload patterns | "Increase shared_buffers to 4GB" alerts | +| 19-20 | `svc-monitoring` | Query optimizer: slow query detection, visual EXPLAIN, index suggestions | Actionable query performance insights | +| 20-21 | `svc-monitoring` | Capacity planner: storage/CPU growth forecasting, "disk full in 14 days" warnings | Proactive capacity alerts | +| 21-22 | `svc-monitoring` | Backup verification: automated weekly restore tests, RPO compliance dashboard | Verified backup reliability | + + +### Work Stream B: Billing (1 developer) + + +| Week | Repo | Task | Deliverable | +| ----- | ------------- | ------------------------------------------------------------------------------- | -------------------------------- | +| 18-19 | `svc-billing` | Stripe integration: customer/subscription lifecycle, payment methods | Customers can subscribe to plans | +| 19-20 | `svc-billing` | Usage tracking: compute hours per cluster, storage consumption, backup storage | Accurate usage metering | +| 20-21 | `svc-billing` | Invoice generation: monthly invoices with line items (Kiven fee + AWS estimate) | Professional invoices | +| 21-22 | `svc-billing` | Dashboard billing page: plan upgrade/downgrade, payment history, cost breakdown | Self-service billing | + + +### Work Stream C: Enterprise Services (1-2 developers) + + +| Week | Repo | Task | Deliverable | +| ----- | ------------------ | --------------------------------------------------------------------------------------------------- | ---------------------------------- | +| 17-18 | `svc-audit` | Immutable audit log: every API call, every infra change, who/what/when, stored in append-only table | Compliance-ready audit trail | +| 18-19 | `svc-notification` | Alert dispatch: Slack, email, webhook, PagerDuty integration | Multi-channel alerting | +| 19-20 | `svc-yamleditor` | Advanced Mode: YAML viewer/editor with Monaco, CNPG schema validation, diff before apply | Expert users can see/edit all YAML | +| 20-21 | `svc-yamleditor` | Change history: git-like timeline of all YAML changes, rollback to any version | Full configuration history | +| 21-22 | `svc-migrations` | Import from Aiven: logical replication setup, progress tracking, cutover | Customers can migrate from Aiven | +| 22-24 | `svc-migrations` | Import from RDS + bare PostgreSQL: pg_dump/restore, pg_basebackup | Customers can migrate from any PG | +| 22-24 | `svc-auth` | SSO/SAML support (enterprise customers), advanced RBAC (per-service permissions) | Enterprise auth requirements | + + +**Phase 6 exit criteria**: Billing works (Stripe), audit log records everything, alerts dispatch to Slack/email, DBA intelligence gives recommendations, Advanced Mode lets experts edit YAML, customers can migrate from Aiven/RDS. + +--- + +## Team Allocation (assuming 4 developers) + +```mermaid +gantt + title Kiven Production Roadmap + dateFormat YYYY-MM-DD + axisFormat %b %d + + section Phase1_Foundation + SDK_complete :p1a, 2026-02-23, 3w + contracts_proto :p1b, 2026-02-23, 3w + svc_api_scaffold :p1c, 2026-02-23, 3w + apply_copier_templates :p1d, 2026-02-23, 2w + + section Phase2_Core + svc_auth :p2a, after p1a, 3w + provider_cnpg :p2b, after p1a, 5w + svc_infra :p2c, after p1a, 5w + svc_agent_relay :p2d, after p1b, 3w + svc_api_endpoints :p2e, after p1c, 4w + + section Phase3_Orchestration + kiven_agent :p3a, after p2d, 4w + svc_provisioner :p3b, after p2c, 4w + svc_clusters :p3c, after p2b, 3w + svc_backups :p3d, after p3c, 2w + svc_users :p3e, after p3c, 2w + e2e_integration :p3f, after p3b, 2w + + section Phase4_Dashboard + dashboard_integration :p4a, after p2e, 8w + + section Phase5_Production + argocd_helm_gitops :p5a, after p3f, 4w + observability_stack :p5b, after p3f, 6w + security_hardening :p5c, after p3f, 6w + customer_onboarding :p5d, after p5a, 4w + + section Phase6_Enterprise + svc_monitoring_dba :p6a, after p5b, 6w + svc_billing_stripe :p6b, after p5a, 4w + svc_audit_notif_yaml :p6c, after p5a, 6w + svc_migrations :p6d, after p6a, 4w +``` + + + +### Developer Assignment Suggestion + +- **Dev 1 (Backend Lead)**: SDK -> svc-auth -> svc-provisioner -> svc-monitoring -> svc-billing +- **Dev 2 (K8s/Infra)**: contracts-proto -> provider-cnpg -> kiven-agent -> platform-observability -> platform-security +- **Dev 3 (Cloud/AWS)**: svc-api scaffold -> svc-infra -> svc-clusters + svc-backups + svc-users -> platform-gitops -> infra-customer-aws +- **Dev 4 (Frontend)**: Apply templates -> dashboard -> svc-yamleditor -> svc-audit + svc-notification + svc-migrations + +--- + +## Milestones + + +| Week | Milestone | How to Verify | +| ---- | ------------------------- | --------------------------------------------------------------------------- | +| 3 | Foundation done | `svc-api` returns plans from DB, `buf generate` works, all repos scaffolded | +| 5 | Auth works | Log in via GitHub OIDC, get JWT, access protected endpoint | +| 6 | CNPG YAML generates | `provider-cnpg` produces valid Cluster YAML from service definition | +| 8 | AWS resources work | `svc-infra` creates node group + S3 bucket in test-client AWS account | +| 10 | Agent connects | Agent in kind sends heartbeat to relay, receives commands | +| 12 | Provisioning works | Create service -> provisioner -> agent deploys CNPG -> PG running | +| 14 | E2E in kind | Full loop works locally: login -> create DB -> get connection string | +| 16 | Dashboard complete | Everything works from the UI | +| 17 | Staging on AWS | Services running in real EKS via ArgoCD | +| 19 | Observability live | Grafana dashboards with metrics, logs, traces from staging | +| 20 | Security hardened | Vault secrets, mTLS, Kyverno policies, cert-manager TLS | +| 20 | Customer onboarding works | Terraform module OR CloudFormation -> EKS discovery -> create first DB | +| 22 | Billing live | Customer subscribes, gets invoiced via Stripe | +| 22 | DBA intelligence | Recommendations, slow query detection, backup verification | +| 24 | Production ready | All enterprise features, migrations, SSO/SAML, audit log | + + +## Production Readiness Checklist (before first customer) + +- All services deploy via ArgoCD (no manual `kubectl apply`) +- Grafana dashboards for every service (RED metrics: Rate, Errors, Duration) +- SLOs defined and monitored (99.9% API, 99.99% database uptime) +- On-call rotation set up with PagerDuty/OpsGenie +- Runbooks for top 10 alert scenarios +- DR tested: failover to second AZ, restore from backup +- Security audit: Vault secrets, mTLS, Kyverno, no static credentials +- Chaos testing: node failure, agent disconnect, CNPG failover +- Backup verification: automated weekly restore test passes +- Customer onboarding flow tested end-to-end with test-client AWS account +- Billing tested: subscription -> usage -> invoice -> payment +- Documentation complete: API docs (Redoc), user guides, admin guides +- Legal: Terms of Service, Privacy Policy, DPA (Data Processing Agreement) +- SOC2 Type 1 evidence collection started diff --git a/platform/YAML-CONFIG-VALIDATOR.md b/platform/YAML-CONFIG-VALIDATOR.md new file mode 100644 index 0000000..42fd1e4 --- /dev/null +++ b/platform/YAML-CONFIG-VALIDATOR.md @@ -0,0 +1,141 @@ +# YCC — YAML Config Validator + +> **Priority:** Low +> **Repo:** `yaml-config-validator` +> **Language:** Python (Pydantic v2) +> **Status:** Planned + +## Problem + +The Kiven platform relies on YAML files across multiple repos: +- **`platform-github-management`** — Repo definitions (`repos/backend/*.yaml`, `config/enforced.yaml`, `config/defaults.yaml`) +- **Copier templates** — `copier.yml` with questions, validators, conditional logic +- **Service configs** — `.mise.toml`, `Taskfile.yml`, CI workflows reference variables + +These YAMLs are validated only at runtime (sync script fails, Copier fails, CI fails). +There is no **static validation** before merge. A typo in `template: servce-go` passes review +and breaks provisioning. + +## Solution: YCC (YAML Config Checker) + +A Python CLI tool that validates YAML structures **statically** (schema) and **dynamically** +(cross-references, variable resolution) using Pydantic v2. + +## What It Validates + +### Static Validation (schema) + +| Source | Validates | +|--------|-----------| +| `repos/backend/*.yaml` | Required fields (`name`, `description`, `type`), valid types, valid rulesets | +| `repos/platform/*.yaml` | Template references exist in `config/templates.yaml` | +| `config/enforced.yaml` | Schema matches expected structure | +| `config/defaults.yaml` | Rulesets have required fields, label colors are valid hex | +| `copier.yml` | Question types are valid, validators use valid Jinja2 syntax | +| `renovate.json` | Matches Renovate JSON schema | + +### Dynamic Validation (cross-references) + +| Check | Description | +|-------|-------------| +| Template exists | If `template: service-go` is set, `config/templates.yaml` must have a `service-go` entry | +| No orphan repos | Every repo in `repos/` must have a corresponding template (or `# No template` comment) | +| Copier variables used | Variables defined in `copier.yml` must be referenced in at least one `.jinja` file | +| Copier variables resolved | `.jinja` files must not reference undefined variables | +| Workflow references valid | CI workflows referencing `kivenio/reusable-workflows` must point to existing workflow files | +| Ruleset exists | `ruleset: strict` must exist in `config/defaults.yaml` rulesets | + +## Architecture + +``` +yaml-config-validator/ +├── ycc/ +│ ├── __init__.py +│ ├── cli.py # Click CLI entrypoint +│ ├── schemas/ +│ │ ├── repo_definition.py # Pydantic model for repos/*.yaml +│ │ ├── enforced_config.py # Pydantic model for enforced.yaml +│ │ ├── defaults_config.py # Pydantic model for defaults.yaml +│ │ ├── templates_config.py # Pydantic model for templates.yaml +│ │ └── copier_config.py # Pydantic model for copier.yml +│ ├── validators/ +│ │ ├── static.py # Schema-only validation +│ │ ├── cross_ref.py # Cross-reference validation +│ │ └── copier_vars.py # Copier variable resolution +│ └── reporters/ +│ ├── console.py # Pretty terminal output +│ └── github.py # PR comment with validation results +├── tests/ +│ ├── test_schemas.py +│ ├── test_cross_ref.py +│ └── fixtures/ # Sample valid/invalid YAMLs +├── pyproject.toml +└── README.md +``` + +## Usage (planned) + +```bash +# Validate platform-github-management +ycc validate ../platform-github-management/ + +# Validate a Copier template +ycc validate-template ../platform-templates-service-go/ + +# Validate a specific file +ycc validate-file repos/backend/core-services.yaml + +# CI mode (exit code 1 on failure, PR comment) +ycc validate --ci --github-pr 42 +``` + +## CI Integration (planned) + +A reusable workflow in `reusable-workflows` will call YCC on PRs to `platform-github-management`: + +```yaml +# In platform-github-management/.github/workflows/validate.yml +jobs: + validate: + uses: kivenio/reusable-workflows/.github/workflows/ycc-validate-reusable.yml@main + with: + config-path: "." +``` + +## Pydantic Model Example + +```python +from pydantic import BaseModel, field_validator + +class RepoDefinition(BaseModel): + name: str + description: str + type: str # service, library, platform, template, testing, documentation + template: str | None = None + ruleset: str | None = None + topics: list[str] = [] + visibility: str = "private" + + @field_validator("type") + @classmethod + def validate_type(cls, v: str) -> str: + valid = {"service", "library", "platform", "template", "testing", "documentation", "bootstrap", "sdk"} + if v not in valid: + raise ValueError(f"Invalid type '{v}', must be one of {valid}") + return v + + @field_validator("name") + @classmethod + def validate_name(cls, v: str) -> str: + if not v.replace("-", "").isalnum(): + raise ValueError(f"Name '{v}' must be alphanumeric with hyphens") + return v +``` + +## Why Pydantic v2 + +- **Type safety** — Python type hints map directly to YAML schema +- **Custom validators** — `@field_validator` for cross-reference checks +- **Error messages** — Clear, structured errors with field paths +- **JSON Schema export** — `model_json_schema()` generates JSON Schema for IDE autocomplete +- **Fast** — Pydantic v2 (Rust core) validates thousands of files in milliseconds diff --git a/providers/PROVIDER-INTERFACE.md b/providers/PROVIDER-INTERFACE.md new file mode 100644 index 0000000..5fac6cc --- /dev/null +++ b/providers/PROVIDER-INTERFACE.md @@ -0,0 +1,265 @@ +# Provider Interface +## *Plugin Architecture for Multi-Operator Support* + +--- + +> **Back to**: [Architecture Overview](../EntrepriseArchitecture.md) + +--- + +# Why a Provider System + +Kiven starts with PostgreSQL (CNPG) but will expand to Kafka (Strimzi), Redis, Elasticsearch, and more. Instead of hardcoding CNPG throughout the codebase, we define a **Provider Interface** — a Go interface that every data service must implement. + +``` +Core Engine (operator-agnostic) + │ + │ Calls provider.Provision(), provider.Scale(), etc. + │ Doesn't know or care which operator is underneath. + │ + ▼ +Provider Interface (Go interface) + │ + ├── CNPG Provider (Phase 1) ← implements interface for PostgreSQL + ├── Strimzi Provider (Phase 3) ← implements interface for Kafka + ├── Redis Provider (Phase 3) ← implements interface for Redis + └── ECK Provider (Phase 3) ← implements interface for Elasticsearch +``` + +--- + +# The Interface + +```go +package provider + +import "context" + +// Provider is the interface every data service must implement. +// Core services (svc-provisioner, svc-clusters, etc.) call these methods +// without knowing which operator is underneath. +type Provider interface { + // Metadata + Name() string // "cnpg", "strimzi", "redis" + DisplayName() string // "PostgreSQL", "Kafka", "Redis" + Version() string // Provider version ("1.0.0") + SupportedVersions() []string // Data service versions ("15", "16", "17") + + // Discovery (used by agent) + Detect(ctx context.Context) (*DetectResult, error) + // Returns: operator installed? version? CRDs registered? + + // Prerequisites + CheckPrerequisites(ctx context.Context, plan ServicePlan) (*PrereqReport, error) + // Returns: what's ready, what's missing, what needs fixing + + // Lifecycle + Provision(ctx context.Context, spec ClusterSpec) (*ClusterStatus, error) + Scale(ctx context.Context, id string, spec ScaleSpec) error + Upgrade(ctx context.Context, id string, targetVersion string) error + Delete(ctx context.Context, id string, retainVolumes bool) error + PowerOff(ctx context.Context, id string) error + PowerOn(ctx context.Context, id string) error + + // Status + GetStatus(ctx context.Context, id string) (*ClusterStatus, error) + ListClusters(ctx context.Context) ([]ClusterSummary, error) + + // YAML Generation (for Advanced Mode) + GenerateYAML(ctx context.Context, spec ClusterSpec) ([]YAMLResource, error) + ValidateYAML(ctx context.Context, yaml string) (*ValidationResult, error) + DiffYAML(ctx context.Context, id string, newYAML string) (*DiffResult, error) + + // Users & Access + ListUsers(ctx context.Context, id string) ([]DatabaseUser, error) + CreateUser(ctx context.Context, id string, spec UserSpec) (*DatabaseUser, error) + DeleteUser(ctx context.Context, id string, username string) error + UpdatePermissions(ctx context.Context, id string, username string, perms Permissions) error + + // Databases + ListDatabases(ctx context.Context, id string) ([]Database, error) + CreateDatabase(ctx context.Context, id string, spec DatabaseSpec) (*Database, error) + DeleteDatabase(ctx context.Context, id string, dbName string) error + + // Backups + ListBackups(ctx context.Context, id string) ([]Backup, error) + TriggerBackup(ctx context.Context, id string) (*Backup, error) + Restore(ctx context.Context, id string, target RestoreTarget) error + VerifyBackup(ctx context.Context, id string, backupID string) (*VerificationResult, error) + + // Metrics + CollectMetrics(ctx context.Context, id string) (*MetricsSnapshot, error) + GetConnectionInfo(ctx context.Context, id string) (*ConnectionInfo, error) + + // Configuration + GetConfig(ctx context.Context, id string) (*ServiceConfig, error) + UpdateConfig(ctx context.Context, id string, params map[string]string) error + + // Extensions / Plugins (service-specific) + ListExtensions(ctx context.Context, id string) ([]Extension, error) + EnableExtension(ctx context.Context, id string, extName string) error + DisableExtension(ctx context.Context, id string, extName string) error +} +``` + +--- + +# Key Types + +```go +// ClusterSpec defines what to provision +type ClusterSpec struct { + Name string + ServiceVersion string // "17" for PG 17 + Plan ServicePlan // Hobbyist, Startup, etc. + Instances int // Number of instances (1-5) + StorageSize string // "50Gi" + StorageIOPS int // 3000 + BackupSchedule string // "0 */6 * * *" + BackupRetention int // days + Parameters map[string]string // postgresql.conf overrides + Extensions []string // pg_vector, PostGIS... + PoolerEnabled bool + PoolerMode string // "transaction" + PoolerPoolSize int // 100 + TLSEnabled bool + Namespace string + Labels map[string]string +} + +// ClusterStatus is the current state +type ClusterStatus struct { + ID string + Name string + Phase string // "Healthy", "Provisioning", "Failing", "PoweredOff" + Instances int + ReadyInstances int + PrimaryPod string + ReplicaPods []string + ReplicationLag []ReplicaLag + StorageUsed string + StorageTotal string + ServiceVersion string + CreatedAt time.Time + ConnectionInfo ConnectionInfo +} + +// ConnectionInfo for the customer +type ConnectionInfo struct { + Host string // pg-main-rw.kiven-databases.svc + Port int // 5432 + ReadOnlyHost string // pg-main-ro.kiven-databases.svc + PoolerHost string // pg-main-pooler.kiven-databases.svc + Database string + Username string + PasswordRef string // K8s secret reference + SSLMode string // "require" +} + +// YAMLResource for Advanced Mode +type YAMLResource struct { + Kind string // "Cluster", "Pooler", "ScheduledBackup" + Name string + YAML string // The full YAML content + Checksum string // For diff detection +} +``` + +--- + +# CNPG Provider Implementation (Phase 1) + +The CNPG provider is the first (and currently only) implementation: + +```go +type CNPGProvider struct { + kubeClient client.Client // K8s client (via agent) + pgClient *pgxpool.Pool // PG connection (for stats, users) +} + +func (p *CNPGProvider) Name() string { return "cnpg" } +func (p *CNPGProvider) DisplayName() string { return "PostgreSQL" } +``` + +### How It Maps to CNPG CRDs + +| Provider Method | CNPG Action | +|----------------|-------------| +| `Provision()` | Create `Cluster` CR + `Pooler` CR + `ScheduledBackup` CR | +| `Scale()` | Update `Cluster.spec.instances` | +| `Upgrade()` | Update `Cluster.spec.imageName` (rolling update) | +| `Delete()` | Delete `Cluster` CR (PVCs retained if `retainVolumes=true`) | +| `PowerOff()` | Delete `Cluster` CR with `retainVolumes=true`, agent reports to svc-infra to scale nodes to 0 | +| `PowerOn()` | svc-infra scales nodes up, then re-apply `Cluster` CR with existing PVCs | +| `TriggerBackup()` | Create `Backup` CR | +| `Restore()` | Create new `Cluster` CR with `bootstrap.recovery` | +| `CreateUser()` | Execute SQL: `CREATE ROLE ... LOGIN PASSWORD ...` | +| `UpdateConfig()` | Update `Cluster.spec.postgresql.parameters` | +| `EnableExtension()` | Update `Cluster.spec.postgresql.shared_preload_libraries` + SQL `CREATE EXTENSION` | +| `GenerateYAML()` | Render CNPG CRD templates with ClusterSpec values | + +--- + +# Adding a New Provider (Future) + +To add Strimzi (Kafka) support: + +1. **Create** `provider-strimzi/` repository +2. **Implement** the `Provider` interface for Strimzi CRDs +3. **Map** Strimzi CRDs to provider methods: + - `Provision()` → Create `Kafka` CR + - `Scale()` → Update `Kafka.spec.kafka.replicas` + - `CreateUser()` → Create `KafkaUser` CR + - etc. +4. **Register** the provider in the provider registry +5. **Update** the agent to watch Strimzi CRDs (auto-detected) +6. **Add** Kafka-specific UI components to dashboard +7. Core services (provisioner, billing, audit) work automatically — they call the interface, not the implementation. + +--- + +# Provider Registry + +```go +// Registry holds all available providers +type Registry struct { + providers map[string]Provider +} + +func NewRegistry() *Registry { + r := &Registry{providers: make(map[string]Provider)} + r.Register(cnpg.NewProvider()) // Phase 1 + // r.Register(strimzi.NewProvider()) // Phase 3 + // r.Register(redis.NewProvider()) // Phase 3 + return r +} + +func (r *Registry) Get(name string) (Provider, error) { + p, ok := r.providers[name] + if !ok { + return nil, fmt.Errorf("provider %q not found", name) + } + return p, nil +} + +func (r *Registry) List() []Provider { + // Returns all registered providers +} +``` + +--- + +# Design Decisions + +| Decision | Rationale | +|----------|-----------| +| **Go interface** | Type-safe, compile-time verification, idiomatic for K8s ecosystem | +| **Single interface for all providers** | Core services don't need provider-specific code | +| **Provider methods are high-level** | Providers handle CRD-specific details internally | +| **YAML generation in provider** | Each provider knows its CRD schema | +| **Agent auto-detection** | Agent discovers which operators are installed, activates relevant modules | + +--- + +*Maintained by: Backend Team* +*Last updated: February 2026* From 2a8e6f15a3c35123ab753b5112757359456b94f9 Mon Sep 17 00:00:00 2001 From: NasrLadib Date: Thu, 26 Mar 2026 11:40:24 +0100 Subject: [PATCH 6/6] docs: update GitOps documentation to reflect transition from ArgoCD to Flux Revise multiple documents to replace references to ArgoCD with Flux, including architecture, onboarding, and platform engineering guides. Update GitOps flow descriptions, deployment instructions, and Terraform module details to align with the new implementation. Ensure consistency across all related documentation. --- EntrepriseArchitecture.md | 20 +- GLOSSARY.md | 7 +- bootstrap/BOOTSTRAP-GUIDE.md | 12 +- development/LOCAL-DEV-GUIDE.md | 6 +- development/TEMPLATE-USAGE-GUIDE.md | 2 +- networking/NETWORKING-ARCHITECTURE.md | 10 +- observability/OBSERVABILITY-GUIDE.md | 2 +- onboarding/CUSTOMER-ONBOARDING.md | 73 +- plans/KIVEN_ROADMAP.md | 1062 ++++++++++++++++--------- platform/PLATFORM-ENGINEERING.md | 18 +- resilience/DR-GUIDE.md | 14 +- testing/TESTING-STRATEGY.md | 4 +- 12 files changed, 767 insertions(+), 463 deletions(-) diff --git a/EntrepriseArchitecture.md b/EntrepriseArchitecture.md index 09ebd0f..746a12c 100644 --- a/EntrepriseArchitecture.md +++ b/EntrepriseArchitecture.md @@ -74,7 +74,7 @@ Kiven is designed for: | Category | Choice | Rationale | |----------|--------|-----------| | **Cloud** | AWS (eu-west-1) | GDPR, proximity to EU customers | -| **Orchestration** | EKS + ArgoCD | GitOps, cloud-native | +| **Orchestration** | EKS + Flux | GitOps, cloud-native | | **Backend** | Go (stdlib + chi) | K8s ecosystem is Go, fast, small binaries | | **Frontend** | Next.js 14+ (App Router) + Tailwind + shadcn/ui | Modern, fast, beautiful | | **Agent** | Go (client-go + controller-runtime) | Native K8s SDK, single binary | @@ -166,7 +166,7 @@ Kiven is designed for: │ │ │ │ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ │ │ PLATFORM NODE POOL (taints: platform=true:NoSchedule) │ │ │ -│ │ │ • ArgoCD • Cilium • Vault Agent │ │ │ +│ │ │ • Flux • Cilium • Vault Agent │ │ │ │ │ │ • OTel Collector • Prometheus • Grafana │ │ │ │ │ │ • Loki • Tempo • Kyverno │ │ │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ @@ -484,11 +484,11 @@ For DevOps/Platform engineers who want full control. Like Lens for Kubernetes. | `maintenance/v*.x.x` | Version maintenance | Cherry-pick from main only | | `feature/*` | Development | Short-lived, merge to main | -## 3.2 GitOps Flow (ArgoCD) +## 3.2 GitOps Flow (Flux) -- **Centralized ArgoCD**: Single instance managing all environments -- **App-of-Apps pattern**: ApplicationSets with Git + Matrix generators -- **Auto-sync**: Dev auto-sync, Staging/Prod manual approval +- **Centralized Flux**: Single instance managing all environments +- **Kustomization/HelmRelease pattern**: Git + Kustomize/Helm generators +- **Auto-reconcile**: Dev auto-reconcile, Staging/Prod manual approval ## 3.3 Environments @@ -546,7 +546,7 @@ For DevOps/Platform engineers who want full control. Like Lens for Kubernetes. | Repo | Description | |------|-------------| -| `platform-gitops/` | ArgoCD, ApplicationSets | +| `platform-gitops/` | Flux, Kustomizations, HelmReleases | | `platform-networking/` | Cilium, Gateway API | | `platform-observability/` | OTel, Prometheus, Loki, Tempo, Grafana | | `platform-security/` | Vault, External-Secrets, Kyverno | @@ -800,7 +800,7 @@ For DevOps/Platform engineers who want full control. Like Lens for Kubernetes. | Phase | Focus | Duration | |-------|-------|----------| | **1** | Bootstrap Layer 0-1 (IAM, VPC, EKS) | 3 weeks | -| **2** | Platform GitOps (ArgoCD) | 1 week | +| **2** | Platform GitOps (Flux) | 1 week | | **3** | Platform Networking (Cilium, Gateway API) + Cloudflare | 2 weeks | | **4** | Platform Security (Vault, Kyverno) | 2 weeks | | **5** | Platform Observability (Prometheus, Loki, Tempo) | 2 weeks | @@ -840,7 +840,7 @@ For DevOps/Platform engineers who want full control. Like Lens for Kubernetes. - [ ] Cross-account IAM for customer infra access - [ ] Provider/plugin architecture for multi-operator future - [ ] Aiven for Kiven product DB + Kafka -- [ ] ArgoCD centralized +- [ ] Flux centralized - [ ] Cilium + Gateway API - [ ] Kyverno - [ ] HashiCorp Vault self-hosted @@ -898,7 +898,7 @@ For DevOps/Platform engineers who want full control. Like Lens for Kubernetes. | **DR Guide** | Backup, recovery, SaaS DR + customer DB DR | [resilience/DR-GUIDE.md](resilience/DR-GUIDE.md) | | **Agent Architecture** | Agent design, gRPC protocol, deployment | [agent/AGENT-ARCHITECTURE.md](agent/AGENT-ARCHITECTURE.md) | | **Customer Infra Management** | Nodes, storage, S3, IAM, cross-account | [infra/CUSTOMER-INFRA-MANAGEMENT.md](infra/CUSTOMER-INFRA-MANAGEMENT.md) | -| **Customer Onboarding** | CloudFormation, EKS discovery, provisioning | [onboarding/CUSTOMER-ONBOARDING.md](onboarding/CUSTOMER-ONBOARDING.md) | +| **Customer Onboarding** | Terraform module, EKS discovery, provisioning | [onboarding/CUSTOMER-ONBOARDING.md](onboarding/CUSTOMER-ONBOARDING.md) | | **Provider Interface** | Plugin architecture, Go interface, adding providers | [providers/PROVIDER-INTERFACE.md](providers/PROVIDER-INTERFACE.md) | | **Glossary** | All terminology | [GLOSSARY.md](GLOSSARY.md) | diff --git a/GLOSSARY.md b/GLOSSARY.md index b6455e7..1135e07 100644 --- a/GLOSSARY.md +++ b/GLOSSARY.md @@ -88,7 +88,7 @@ | **IRSA (IAM Roles for Service Accounts)** | AWS feature mapping K8s ServiceAccounts to IAM roles. CNPG uses IRSA to write backups to S3. | | **AssumeRole** | AWS IAM action to temporarily take on another role's permissions. Kiven assumes customer's `KivenAccessRole`. | | **Cross-Account Access** | Pattern where one AWS account accesses resources in another account via IAM role trust. | -| **CloudFormation** | AWS IaC service. Kiven provides a CF template for customers to create the access role. | +| **Terraform** | HashiCorp IaC tool. Kiven provides a Terraform module for customers to create the access role. | | **KMS (Key Management Service)** | AWS encryption key management. Used for EBS and S3 encryption. | | **Managed Node Group** | EKS feature for managed EC2 instances as K8s worker nodes. Kiven creates dedicated node groups for databases. | | **Taints** | K8s mechanism to repel pods from nodes. Kiven taints DB nodes so only DB pods run there. | @@ -134,8 +134,9 @@ |------|------------| | **Provider Interface** | Go interface that each data service (CNPG, Strimzi, Redis) implements. Enables multi-operator support. | | **Plugin Architecture** | Design pattern where functionality is added via plugins without modifying core code. | -| **GitOps** | Managing infrastructure and apps using Git as single source of truth. ArgoCD pulls from Git. | -| **Infrastructure as Code (IaC)** | Managing infra through code (Terraform, CloudFormation) rather than manual processes. | +| **GitOps** | Managing infrastructure and apps using Git as single source of truth. Flux reconciles from Git. | +| **Infrastructure as Code (IaC)** | Managing infra through code (Terraform) rather than manual processes. | +| **Stategraph** | Terraform/OpenTofu state backend using PostgreSQL instead of flat state files. Enables parallel plans, no lock waiting, SQL-queryable state. See [stategraph.com](https://stategraph.com/). **Planned for Q4 2026** — currently using S3. | | **Trunk-Based Development** | All developers merge to main branch. Short-lived feature branches. | | **C4 Model** | Architecture documentation: Context, Container, Component, Code diagrams. | | **Defense in Depth** | Multiple security layers so one breach doesn't compromise everything. | diff --git a/bootstrap/BOOTSTRAP-GUIDE.md b/bootstrap/BOOTSTRAP-GUIDE.md index 7bf0359..2052ebe 100644 --- a/bootstrap/BOOTSTRAP-GUIDE.md +++ b/bootstrap/BOOTSTRAP-GUIDE.md @@ -52,7 +52,7 @@ ├─────────────────────────────────────────────────────────────────────┤ │ • SSO (groups, permission sets) │ │ • Custom Controls (aws_controltower_control) │ -│ • Account Factory (baseline: OIDC, KMS, S3 state) │ +│ • Account Factory (baseline: OIDC, KMS, S3 state) │ │ • Shared Services Account │ └─────────────────────────────────────────────────────────────────────┘ │ @@ -114,6 +114,12 @@ > **Repo: `bootstrap/`** — GitHub Actions CI/CD +## Terraform State — S3 + +Terraform state is stored in a versioned, encrypted S3 bucket (`localplus-terraform-state-mgmt`) with state locking via S3 native locking (`use_lockfile = true`). + +> **Future**: Migration to [Stategraph](https://stategraph.com/) planned for Q4 2026 (parallel plans, SQL queryable state). + ## What we manage in Terraform | Component | Module | Description | @@ -164,7 +170,7 @@ | Resource | Purpose | |----------|---------| | AWS Account | In appropriate OU | -| S3 Bucket | Terraform state | +| Terraform state | S3 bucket (versioned, encrypted) | | GitHub OIDC | CI/CD authentication | | KMS Keys | Encryption (terraform, secrets, eks) | | Security Baseline | EBS encryption, S3 block public | @@ -194,7 +200,7 @@ | 3 | EKS Cluster | VPC, KMS | | 4 | IRSA | EKS | | 5 | VPC Peering (Aiven) | VPC, Aiven project | -| 6 | ArgoCD | EKS | +| 6 | Flux | EKS | ## Providers diff --git a/development/LOCAL-DEV-GUIDE.md b/development/LOCAL-DEV-GUIDE.md index cf027a1..8b188bc 100644 --- a/development/LOCAL-DEV-GUIDE.md +++ b/development/LOCAL-DEV-GUIDE.md @@ -20,7 +20,7 @@ Each svc-* repo: Real customers Go code + Dockerfile Full AWS integration: task init (mise + tools) node groups, EBS, S3, IAM Own CI (reusable GH wf) - ArgoCD deployment + Flux deployment ``` --- @@ -475,9 +475,9 @@ Move to Level 2 when: AWS Account: kiven-sandbox (eu-west-1) │ ├── EKS "kiven-dev" -│ ├── Kiven services (deployed via ArgoCD) +│ ├── Kiven services (deployed via Flux) │ ├── Aiven VPC peering (product DB) -│ └── Platform stack (Prometheus, Loki, ArgoCD) +│ └── Platform stack (Prometheus, Loki, Flux) │ ├── EKS "test-client" │ ├── Simulates a real customer cluster diff --git a/development/TEMPLATE-USAGE-GUIDE.md b/development/TEMPLATE-USAGE-GUIDE.md index 47025a8..6a04fc6 100644 --- a/development/TEMPLATE-USAGE-GUIDE.md +++ b/development/TEMPLATE-USAGE-GUIDE.md @@ -85,7 +85,7 @@ update the template once and each repo pulls the update via `copier update`. =>> | `service-go` | `platform-templates-service-go` | Go microservices (`svc-*`) | chi, gRPC, OTel, Dockerfile, air, Testcontainers | | `sdk-go` | `platform-templates-sdk-go` | Go libraries (`kiven-go-sdk`, `provider-*`, `kiven-cli`) | No Dockerfile, no cmd/, library-focused | | `infrastructure` | `platform-templates-infrastructure` | Terraform modules (`bootstrap`, `infra-customer-*`) | tflint, Checkov, terraform-docs | -| `platform-component` | `platform-templates-platform-component` | GitOps components (`platform-gitops`, `platform-security`) | Helm/Kustomize, ArgoCD integration | +| `platform-component` | `platform-templates-platform-component` | GitOps components (`platform-gitops`, `platform-security`) | Helm/Kustomize, Flux integration | | `documentation` | `platform-templates-documentation` | Doc sites (`docs`) | MkDocs Material, ADR template | ## Template Structure (Copier) diff --git a/networking/NETWORKING-ARCHITECTURE.md b/networking/NETWORKING-ARCHITECTURE.md index 5790d32..ed763ea 100644 --- a/networking/NETWORKING-ARCHITECTURE.md +++ b/networking/NETWORKING-ARCHITECTURE.md @@ -68,7 +68,7 @@ │ │ │ │ Instance: m6i.xlarge (dedicated resources) │ │ │ │ │ │ │ ├─────────────────────────────────────────────────────────┤ │ │ │ │ │ │ │ PLATFORM NAMESPACE │ │ │ │ -│ │ │ │ • ArgoCD, Cilium, Vault, Kyverno, OTel, Grafana │ │ │ │ +│ │ │ │ • Flux, Cilium, Vault, Kyverno, OTel, Grafana │ │ │ │ │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ @@ -99,7 +99,7 @@ | Node Pool | Taints | Usage | Instance Type | Scaling | |-----------|--------|-------|---------------|---------| -| **platform** | `platform=true:NoSchedule` | ArgoCD, Monitoring, Security tools | m6i.xlarge | Fixed (2-3 nodes) | +| **platform** | `platform=true:NoSchedule` | Flux, Monitoring, Security tools | m6i.xlarge | Fixed (2-3 nodes) | | **application** | None (default) | Domain services | m6i.large | HPA (2-10 nodes) | | **spot** (optionnel) | `spot=true:PreferNoSchedule` | Batch jobs, non-critical | m6i.large (spot) | Auto (0-5 nodes) | @@ -289,7 +289,7 @@ | Resource | Policy | Authentication | |----------|--------|----------------| | **grafana.localplus.io** | Team only | GitHub SSO | -| **argocd.localplus.io** | Team only | GitHub SSO | +| **flux.localplus.io** | Team only | GitHub SSO | | **api.localplus.io/admin** | Admin only | GitHub SSO + MFA | | **api.localplus.io/*** | Public | No auth (application handles) | @@ -305,7 +305,7 @@ | CNAME | www | @ | ☁️ ON | Auto | | CNAME | api | tunnel-xxx.cfargotunnel.com | ☁️ ON | Auto | | CNAME | grafana | tunnel-xxx.cfargotunnel.com | ☁️ ON | Auto | -| CNAME | argocd | tunnel-xxx.cfargotunnel.com | ☁️ ON | Auto | +| CNAME | flux | tunnel-xxx.cfargotunnel.com | ☁️ ON | Auto | | TXT | @ | SPF record | ☁️ OFF | Auto | | TXT | _dmarc | DMARC policy | ☁️ OFF | Auto | | MX | @ | Mail provider | ☁️ OFF | Auto | @@ -452,7 +452,7 @@ | **Cloudflare** | ✅ Oui | Load balancing global, health checks multi-origin | | **APISIX** | ✅ Oui | Déployable sur tout K8s (EKS, GKE, AKS) | | **Aiven** | ✅ Oui | PostgreSQL, Kafka, Valkey disponibles sur AWS/GCP/Azure | -| **ArgoCD** | ✅ Oui | Peut gérer des clusters multi-cloud | +| **Flux** | ✅ Oui | Peut gérer des clusters multi-cloud | | **Vault** | ✅ Oui | Réplication cross-datacenter | | **OTel** | ✅ Oui | Standard ouvert, backends interchangeables | diff --git a/observability/OBSERVABILITY-GUIDE.md b/observability/OBSERVABILITY-GUIDE.md index 511ef4f..a2612bf 100644 --- a/observability/OBSERVABILITY-GUIDE.md +++ b/observability/OBSERVABILITY-GUIDE.md @@ -159,7 +159,7 @@ Le **Prometheus Operator** utilise des **Custom Resources** pour configurer auto | **svc-provisioner** | 8082 | `/metrics` | Provisioning pipeline metrics | | **kiven-agent** | 9090 | `/metrics` | Agent-side CNPG + PG metrics | | **Grafana** | 3000 | `/metrics` | Internal metrics | -| **ArgoCD** | 8083 | `/metrics` | Application sync metrics | +| **Flux** | 8080 | `/metrics` | Reconciliation metrics | | **Node Exporter** | 9100 | `/metrics` | System metrics (CPU, RAM, disk) | --- diff --git a/onboarding/CUSTOMER-ONBOARDING.md b/onboarding/CUSTOMER-ONBOARDING.md index e928ba6..6fe2986 100644 --- a/onboarding/CUSTOMER-ONBOARDING.md +++ b/onboarding/CUSTOMER-ONBOARDING.md @@ -11,8 +11,8 @@ ``` Step 1 Step 2 Step 3 Step 4 Step 5 -Sign up → Deploy CFN → Connect EKS → Create DB → Connected! -(1 min) template (2 min) (1 min) (5-7 min) +Sign up → Deploy TF → Connect EKS → Create DB → Connected! +(1 min) module (2 min) (1 min) (5-7 min) ``` --- @@ -26,59 +26,36 @@ Customer creates account on kiven.io: --- -# Step 2: Deploy CloudFormation Template (2 minutes) +# Step 2: Deploy Terraform Module (2 minutes) -Customer deploys Kiven's CloudFormation template in their AWS account. This creates the `KivenAccessRole` IAM role. +Customer deploys Kiven's Terraform module in their AWS account. This creates the `KivenAccessRole` IAM role. ### How It Works 1. Kiven dashboard shows: "Connect your AWS account" -2. Customer clicks → redirected to AWS CloudFormation console with pre-filled template URL -3. Customer reviews and clicks "Create Stack" -4. Stack creates: +2. Customer copies the Terraform module configuration (or uses the Terraform Registry) +3. Customer runs `terraform init` and `terraform apply` +4. Module creates: - IAM Role `KivenAccessRole` (trusts Kiven's AWS account) - IAM Policy `KivenAccessPolicy` (scoped permissions) - ExternalId parameter (unique per customer, prevents confused deputy) -5. Stack outputs: Role ARN → customer copies back to Kiven dashboard - -### CloudFormation Template (Summary) - -```yaml -AWSTemplateFormatVersion: '2010-09-09' -Description: Kiven Access Role — Allows Kiven to manage database infrastructure - -Parameters: - ExternalId: - Type: String - Description: Unique ID provided by Kiven (do not change) - KivenAccountId: - Type: String - Default: '123456789012' # Kiven's AWS account ID - -Resources: - KivenAccessRole: - Type: AWS::IAM::Role - Properties: - RoleName: KivenAccessRole - AssumeRolePolicyDocument: - Version: '2012-10-17' - Statement: - - Effect: Allow - Principal: - AWS: !Sub 'arn:aws:iam::${KivenAccountId}:root' - Action: 'sts:AssumeRole' - Condition: - StringEquals: - 'sts:ExternalId': !Ref ExternalId - Policies: - - PolicyName: KivenAccessPolicy - PolicyDocument: - # ... (see CUSTOMER-INFRA-MANAGEMENT.md for full policy) - -Outputs: - RoleArn: - Description: Paste this ARN in the Kiven dashboard - Value: !GetAtt KivenAccessRole.Arn +5. Terraform outputs: Role ARN → customer copies back to Kiven dashboard + +### Terraform Module (Summary) + +```hcl +module "kiven_access" { + source = "kivenio/kiven/aws" + version = "~> 1.0" + + external_id = var.kiven_external_id # Provided by Kiven dashboard + kiven_account_id = "123456789012" # Kiven's AWS account ID +} + +output "role_arn" { + description = "Paste this ARN in the Kiven dashboard" + value = module.kiven_access.role_arn +} ``` --- @@ -224,7 +201,7 @@ When a customer deletes their database: When a customer removes Kiven entirely: 1. All databases must be deleted first (or exported) 2. Kiven agent uninstalled (`helm uninstall kiven-agent`) -3. Customer deletes CloudFormation stack (removes IAM role) +3. Customer runs `terraform destroy` (removes IAM role) 4. Kiven retains customer metadata for 90 days (GDPR), then purges --- diff --git a/plans/KIVEN_ROADMAP.md b/plans/KIVEN_ROADMAP.md index 597de06..119cec4 100644 --- a/plans/KIVEN_ROADMAP.md +++ b/plans/KIVEN_ROADMAP.md @@ -1,532 +1,852 @@ name: Kiven Team Roadmap -overview: A phased roadmap for the Kiven team to go from current state (~10% implemented) to production-ready. 6 phases covering foundation, core services, orchestration, dashboard, production infrastructure, and enterprise features. +overview: Quarterly roadmap (Q1-Q4 2026) with weekly tracking. From foundation to production-ready managed PostgreSQL platform. todos: - - id: phase1-sdk - content: "Phase 1A: Complete kiven-go-sdk (error types, middleware, OTel instrumentation, DB helpers)" + - id: q1-sdk + content: "Q1: Complete kiven-go-sdk (error types, middleware, OTel, DB helpers)" status: pending - - id: phase1-proto - content: "Phase 1A: Create contracts-proto (buf setup, agent.proto, metrics.proto)" + - id: q1-proto + content: "Q1: Create contracts-proto (buf setup, agent.proto, metrics.proto)" status: pending - - id: phase1-api - content: "Phase 1B: Scaffold svc-api (chi router, DB layer, first read endpoints)" + - id: q1-api-scaffold + content: "Q1: Scaffold svc-api (chi router, DB layer, first read endpoints)" status: pending - - id: phase1-templates - content: "Phase 1C: Apply Copier templates to all Phase 2 repos + create sdk-go template" + - id: q1-templates + content: "Q1: Apply Copier templates to all repos + create sdk-go template" status: pending - - id: phase2-auth - content: "Phase 2A: Build svc-auth (OIDC, API keys, RBAC)" + - id: q1-auth-start + content: "Q1: Start svc-auth (OIDC login)" status: pending - - id: phase2-cnpg - content: "Phase 2B: Build provider-cnpg (YAML generation, status parsing)" + - id: q2-auth + content: "Q2: Complete svc-auth (API keys, RBAC)" status: pending - - id: phase2-infra - content: "Phase 2C: Build svc-infra (AWS SDK, node groups, S3, IRSA)" + - id: q2-cnpg + content: "Q2: Build provider-cnpg (YAML generation, status parsing)" status: pending - - id: phase2-relay - content: "Phase 2B: Build svc-agent-relay (gRPC server, agent registration)" + - id: q2-infra + content: "Q2: Build svc-infra (AWS SDK, node groups, S3, IRSA)" status: pending - - id: phase2-api-endpoints - content: "Phase 2D: Implement svc-api CRUD endpoints" + - id: q2-agent + content: "Q2: Build kiven-agent (CNPG informers, gRPC client, command executor)" status: pending - - id: phase3-agent - content: "Phase 3A: Build kiven-agent (CNPG informers, gRPC client, command executor)" + - id: q2-provisioner + content: "Q2: Build svc-provisioner (state machine, full pipeline)" status: pending - - id: phase3-provisioner - content: "Phase 3B: Build svc-provisioner (state machine, orchestrate infra + agent)" + - id: q2-e2e + content: "Q2: End-to-end in kind (create DB → get connection string)" status: pending - - id: phase3-services - content: "Phase 3B: Build svc-clusters, svc-backups, svc-users" + - id: q2-dashboard + content: "Q2: Dashboard API integration + auth flow + real data" status: pending - - id: phase3-e2e - content: "Phase 3: End-to-end integration test in kind" + - id: q3-gitops + content: "Q3: Production deployment (Flux, Helm, staging EKS)" status: pending - - id: phase4-dashboard - content: "Phase 4: Dashboard API integration, auth flow, real data" + - id: q3-observability + content: "Q3: Observability stack (Prometheus, Loki, Tempo, Grafana)" status: pending - - id: phase5-deploy - content: "Phase 5A: Production deployment (ArgoCD, Helm charts, platform-gitops)" + - id: q3-security + content: "Q3: Security hardening (Vault, mTLS, Kyverno, cert-manager)" status: pending - - id: phase5-observability - content: "Phase 5B: Observability stack (Prometheus, Loki, Tempo, Grafana, OTel)" + - id: q3-onboarding + content: "Q3: Customer onboarding (Terraform module, EKS discovery, wizard)" status: pending - - id: phase5-security - content: "Phase 5C: Security hardening (Vault, mTLS, Kyverno, cert-manager)" + - id: q3-enterprise + content: "Q3: Enterprise services (monitoring, billing, audit, notifications)" status: pending - - id: phase5-onboarding - content: "Phase 5D: Customer onboarding (Terraform module, EKS discovery)" + - id: q3-first-customer + content: "Q3: First customer live on production" status: pending - - id: phase6-monitoring - content: "Phase 6A: svc-monitoring (DBA intelligence, alerts, query optimizer)" + - id: q4-stategraph + content: "Q4: Migrate Terraform state from S3 to Stategraph" status: pending - - id: phase6-billing - content: "Phase 6B: svc-billing (Stripe, usage tracking, invoices)" + - id: q4-migrations + content: "Q4: svc-migrations (import from Aiven, RDS, bare PG)" status: pending - - id: phase6-enterprise - content: "Phase 6C: Enterprise features (svc-audit, svc-notification, svc-yamleditor, svc-migrations)" + - id: q4-soc2 + content: "Q4: SOC2 Type 1 evidence collection" status: pending isProject: false --- -# Kiven Team Roadmap -- From Foundation to Production +# Kiven Roadmap 2026 — Quarterly Plan with Weekly Tracking -## Current State +## Current State (as of Feb 23, 2026) -What works today: +| Asset | Status | +|-------|--------| +| `bootstrap` (Terraform SSO, Control Tower) | Done | +| `platform-github-management` (repo sync) | Done | +| `reusable-workflows` (8 workflows, 5 actions) | Done | +| `platform-templates-service-go` (Copier) | Done | +| `kiven-go-sdk` (15 files, 4 tests) | Partial | +| `dashboard` (14 pages) | Scaffolded, no API | +| `kiven-dev` (Taskfile, kind, CNPG) | Done | +| `svc-api` (OpenAPI spec) | Spec only, no Go code | +| All other services | Nothing | -- **Platform tooling**: `bootstrap` (Terraform), `platform-github-management` (sync script), `reusable-workflows` (8 workflows + 5 actions), `platform-templates-service-go` (Copier template), `docs` (18 docs) -- **SDK**: `kiven-go-sdk` (15 Go files, 4 test files -- provider interface, 11 models, config, logging) -- **Dashboard**: 14 pages scaffolded, no API integration -- **Dev environment**: `kiven-dev` with Taskfile, migrations, kind/CNPG setup -- **API contract**: `svc-api` has OpenAPI spec, no Go code yet - -What does NOT work: no service has Go code, no agent, no provider, no provisioning pipeline, no auth, no customer-facing functionality. +**What does NOT work**: No service has Go code, no agent, no provider, no provisioning pipeline, no auth, no customer-facing functionality. ## Production Target -**Production = A customer can:** +A customer can: 1. Sign up and log in (OIDC + SSO/SAML) -2. Register their EKS cluster (Terraform module or CloudFormation one-click) -3. Click "Create Database" and get a PostgreSQL connection string in ~10 minutes +2. Register their EKS cluster (Terraform module) +3. Click "Create Database" → PostgreSQL connection string in ~10 minutes 4. See metrics, logs, backups, users, connection info in the dashboard 5. Get DBA recommendations, alerts, and performance insights 6. Power on/off databases on schedule 7. Pay via Stripe with usage tracking 8. Have full audit trail of all operations -**6 phases**: Foundation -> Core Services -> Orchestration -> Dashboard -> Production Infrastructure -> Enterprise Features. +## Team (4 developers) + +| Role | Path | +|------|------| +| **Dev 1** — Backend Lead | SDK → svc-auth → svc-provisioner → svc-monitoring → svc-billing | +| **Dev 2** — K8s/Infra | contracts-proto → provider-cnpg → kiven-agent → observability → security | +| **Dev 3** — Cloud/AWS | svc-api → svc-infra → svc-clusters/backups/users → GitOps → onboarding | +| **Dev 4** — Frontend | Templates → dashboard → svc-yamleditor → svc-audit/notification | + +## Calendar Reference + +| Roadmap Week | Calendar Date | Quarter | +|---|---|---| +| W1 | Feb 23 | Q1 | +| W2 | Mar 2 | Q1 | +| W3 | Mar 9 | Q1 | +| W4 | Mar 16 | Q1 | +| W5 | Mar 23 | Q1 | +| W6 | Mar 30 | Q1→Q2 | +| W7 | Apr 6 | Q2 | +| W8 | Apr 13 | Q2 | +| W9 | Apr 20 | Q2 | +| W10 | Apr 27 | Q2 | +| W11 | May 4 | Q2 | +| W12 | May 11 | Q2 | +| W13 | May 18 | Q2 | +| W14 | May 25 | Q2 | +| W15 | Jun 1 | Q2 | +| W16 | Jun 8 | Q2 | +| W17 | Jun 15 | Q2 | +| W18 | Jun 22 | Q2 | +| W19 | Jun 29 | Q2→Q3 | +| W20 | Jul 6 | Q3 | +| W21 | Jul 13 | Q3 | +| W22 | Jul 20 | Q3 | +| W23 | Jul 27 | Q3 | +| W24 | Aug 3 | Q3 | +| W25-31 | Aug-Sep | Q3 | +| W32+ | Oct+ | Q4 | -## Dependency Graph +--- -```mermaid -graph TD - subgraph phase1 [Phase 1: Foundation -- Weeks 1-3] - SDK[kiven-go-sdk
complete models] - PROTO[contracts-proto
gRPC definitions] - API_SCAFFOLD[svc-api
scaffold + DB layer] - TEMPLATES[Apply Copier templates
to all repos] - end +# Q1 2026 — FOUNDATION (Feb 23 → Mar 31) - subgraph phase2 [Phase 2: Core Services -- Weeks 4-8] - AUTH[svc-auth
OIDC + API keys + RBAC] - CNPG[provider-cnpg
YAML generation] - INFRA[svc-infra
AWS SDK integration] - API_IMPL[svc-api
implement endpoints] - RELAY[svc-agent-relay
gRPC server] - end +> **Theme**: Build the foundation. Every repo Phase 2 depends on is ready. +> +> **Objective**: SDK complete, gRPC contracts defined, svc-api serves data from DB, all repos scaffolded, dev environment works end-to-end. - subgraph phase3 [Phase 3: Orchestration -- Weeks 9-14] - AGENT[kiven-agent
CNPG watcher + gRPC client] - PROV[svc-provisioner
state machine] - CLUSTERS[svc-clusters
lifecycle management] - BACKUPS[svc-backups
backup/restore] - USERS[svc-users
PG user management] - end +## Q1 Dependency Graph - subgraph phase4 [Phase 4: Dashboard -- Weeks 8-16] - DASH[dashboard
API integration + auth flow] - end +```mermaid +graph LR + SDK[kiven-go-sdk] --> API[svc-api scaffold] + SDK --> AUTH_START[svc-auth start] + SDK --> CNPG_START[provider-cnpg start] + PROTO[contracts-proto] --> RELAY_START[svc-agent-relay start] + TEMPLATES[Copier templates] --> ALL[All Phase 2 repos scaffolded] + API --> ENDPOINTS[First read endpoints] +``` - subgraph phase5 [Phase 5: Production Infra -- Weeks 15-20] - GITOPS[platform-gitops
ArgoCD + Helm] - OBS[platform-observability
Prometheus + Loki + Tempo] - SEC[platform-security
Vault + Kyverno + mTLS] - ONBOARD[infra-customer-aws
Terraform onboarding] - end +## Week 1 (Feb 23 - Feb 28) - subgraph phase6 [Phase 6: Enterprise -- Weeks 17-24] - MON[svc-monitoring
DBA intelligence + alerts] - BILL[svc-billing
Stripe + usage tracking] - AUDIT[svc-audit
immutable audit log] - NOTIF[svc-notification
Slack + email + webhook] - YAML[svc-yamleditor
Advanced Mode] - MIG[svc-migrations
import from Aiven/RDS] - end +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `kiven-go-sdk` | Error types package (`errors/errors.go`), HTTP client helpers, pagination types | Shared error handling + HTTP primitives | +| Dev 1 | `kiven-go-sdk` | Refactor models to API contracts only (remove DB-specific fields, add docs) | Clean separation: SDK = API contracts | +| Dev 2 | `contracts-proto` | Define `.proto` files: `agent.proto`, `metrics.proto`, `commands.proto` | gRPC contract drafts | +| Dev 3 | `svc-api` | Scaffold Go project from Copier template (chi router, healthcheck, graceful shutdown, OTel init) | Running HTTP server at :8080/healthz | +| Dev 4 | All Phase 2 repos | Start running `copier copy` from template for: svc-auth, svc-infra, svc-agent-relay, provider-cnpg | First repos scaffolded | - SDK --> AUTH - SDK --> CNPG - SDK --> INFRA - SDK --> API_IMPL - PROTO --> RELAY - PROTO --> AGENT - API_SCAFFOLD --> API_IMPL - TEMPLATES --> AUTH - TEMPLATES --> CNPG - TEMPLATES --> INFRA - - AUTH --> API_IMPL - CNPG --> AGENT - CNPG --> CLUSTERS - CNPG --> BACKUPS - CNPG --> USERS - INFRA --> PROV - RELAY --> AGENT - RELAY --> PROV - - API_IMPL --> DASH - - PROV --> GITOPS - AGENT --> OBS - CLUSTERS --> MON - BACKUPS --> MON - AUTH --> ONBOARD - INFRA --> ONBOARD - API_IMPL --> BILL - API_IMPL --> AUDIT - CLUSTERS --> YAML - MON --> NOTIF -``` +**W1 Review checkpoint**: SDK has error types + pagination. svc-api starts. Proto files drafted. +## Week 2 (Mar 2 - Mar 6) +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `kiven-go-sdk` | Middleware package (logging, recovery, request ID, auth context) | Reusable chi middleware for all svc-* | +| Dev 1 | `kiven-go-sdk` | OTel: MeterProvider (counters, histograms, gauges), pgx tracing hook | Metrics + auto-traced SQL queries | +| Dev 2 | `contracts-proto` | `buf.yaml`, `buf.gen.yaml`, CI with `buf lint` + `buf breaking` | Generated Go code in `gen/go/` | +| Dev 3 | `svc-api` | Database layer: pgx pool, migration runner, repository pattern, internal domain models | `ServiceRepository` + `internal/domain/service.go` | +| Dev 3 | `svc-api` | OpenAPI validation middleware (validate requests/responses against spec) | Every request validated against openapi.yaml | +| Dev 4 | All Phase 2 repos | Continue `copier copy` for: svc-clusters, svc-backups, svc-users, kiven-agent | All Phase 2 repos scaffolded | -## Phase 1: Foundation (Weeks 1-3) +**W2 Review checkpoint**: SDK has middleware + OTel metrics. Proto generates Go code. svc-api has DB layer. -**Goal**: Every repo that Phase 2 depends on is ready: SDK complete, gRPC contracts defined, svc-api scaffolded with DB layer, Copier templates applied. +## Week 3 (Mar 9 - Mar 13) -### Work Stream A: SDK + Contracts (1 developer) +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `kiven-go-sdk` | OTel: slog bridge (structured logs → OTel Logs), Kiven attribute constants (`kiven.org_id`, etc.) | Full traces + metrics + logs from day 1 | +| Dev 1 | `kiven-go-sdk` | Database helpers: pgx pool factory, migration runner, transaction helpers with OTel tracing | Every service connects to DB with 3 lines | +| Dev 2 | `contracts-proto` | Finalize agent protocol: Heartbeat (bidirectional), CommandStream (server-push), MetricsStream (agent-push) | Stable gRPC contract | +| Dev 3 | `svc-api` | Implement read-only endpoints: `GET /v1/plans`, `GET /v1/services`, `GET /v1/services/{id}` | First working API endpoints from DB | +| Dev 4 | `platform-templates-sdk-go` | Create Copier template for sdk-go (no Dockerfile, no cmd/, no gRPC) | Template ready for provider repos | +| Dev 4 | `kiven-dev` | Verify `task dev` works end-to-end: Docker Compose + kind + CNPG + svc-api starts | Working local dev environment | +**W3 Review checkpoint**: SDK feature-complete. Proto stable. svc-api returns plans from DB. `task dev` works. -| Week | Repo | Task | Deliverable | -| ---- | ----------------- | ------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------- | -| 1 | `kiven-go-sdk` | Add error types package (`errors/errors.go`), HTTP client helpers, pagination types | Shared error handling + HTTP primitives for all services | -| 1 | `kiven-go-sdk` | Add middleware package (logging, recovery, request ID, auth context) | Reusable chi middleware for all svc-* | -| 1-2 | `kiven-go-sdk` | Telemetry package gaps (traces provider, HTTP middleware, gRPC interceptors, span helpers already done): **add MeterProvider** (OTel metrics: counters, histograms, gauges), **add pgx tracing hook** (auto-trace SQL queries), **add slog bridge** (structured logs to OTel Logs), **add Kiven attribute constants** (`kiven.org_id`, `kiven.service_id` etc. as typed constants). See `docs/observability/OTEL-CONVENTIONS.md`. | Full traces + metrics + logs instrumentation from day 1 | -| 2 | `docs` | DONE: `docs/observability/OTEL-CONVENTIONS.md` created. DONE: `OBSERVABILITY-GUIDE.md` updated to Kiven context. | OTel architecture decisions documented | -| 2 | `contracts-proto` | Define `.proto` files: `agent.proto` (Heartbeat, Status, Command streams), `metrics.proto`, `commands.proto` | gRPC contract for agent-relay communication | -| 2 | `contracts-proto` | Set up `buf.yaml`, `buf.gen.yaml`, CI with `buf lint` + `buf breaking` | Generated Go code in `gen/go/` | -| 3 | `kiven-go-sdk` | Add database helpers (pgx pool factory, migration runner, transaction helpers) with OTel pgx tracing | Every service can connect to DB with 3 lines + auto-traced queries | +### -- PHASE 1 COMPLETE -- +**Exit criteria**: +- [ ] `task dev` starts infra + svc-api +- [ ] `svc-api` returns service plans from DB +- [ ] `contracts-proto` generates Go code via `buf generate` +- [ ] All Phase 2 repos scaffolded (editorconfig, golangci, pre-commit, CI, Taskfile, Dockerfile, go.mod) +- [ ] `kiven-go-sdk` has: errors, middleware, OTel (traces + metrics + logs), DB helpers -### Work Stream B: svc-api Scaffold (1 developer) +--- +## Week 4 (Mar 16 - Mar 20) -| Week | Repo | Task | Deliverable | -| ---- | --------- | ----------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | -| 1 | `svc-api` | Scaffold Go project from Copier template (`copier copy`), set up chi router, healthcheck, graceful shutdown, OTel trace provider init | Running HTTP server at :8080/healthz with OTel traces | -| 2 | `svc-api` | Database layer: pgx connection pool, migration runner on startup, repository pattern (interfaces) | `ServiceRepository`, `OrganizationRepository`, `BackupRepository` interfaces | -| 2 | `svc-api` | OpenAPI validation middleware (validate requests/responses against spec) | Every request validated against openapi.yaml | -| 3 | `svc-api` | Implement read-only endpoints: `GET /v1/plans`, `GET /v1/services`, `GET /v1/services/{id}` | First working API endpoints returning data from DB | +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-auth` | OIDC integration (Google/GitHub login via `coreos/go-oidc`), JWT token issuance | Users can log in, get a JWT | +| Dev 2 | `provider-cnpg` | Implement `GenerateClusterYAML()` — given a service definition, produce valid CNPG Cluster manifest | CNPG Cluster YAML generation | +| Dev 3 | `svc-infra` | AWS SDK integration: `AssumeRole` into customer account, EKS `DescribeCluster` | Can access customer AWS resources | +| Dev 4 | `svc-api` | CRUD endpoints: `POST /v1/services`, `DELETE /v1/services/{id}`, `PATCH /v1/services/{id}` | Can create/update/delete services via API | +**W4 Review checkpoint**: OIDC login works. CNPG YAML generation started. AWS AssumeRole works. -### Work Stream C: Apply Templates (1 developer, part-time) +## Week 5 (Mar 23 - Mar 27) +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-auth` | API key management (create, list, revoke, hash with argon2) | Programmatic access for CLI/Terraform | +| Dev 2 | `provider-cnpg` | `GeneratePoolerYAML()`, `GenerateScheduledBackupYAML()` | Full CNPG manifest generation | +| Dev 3 | `svc-infra` | Create EKS managed node group (dedicated, tainted, right instance type) | Can create DB nodes in customer cluster | +| Dev 4 | `svc-api` | Customer cluster endpoints: `POST /v1/clusters`, `GET /v1/clusters` | Can register customer EKS clusters | -| Week | Repo | Task | Deliverable | -| ---- | --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | -| 1-2 | All Phase 2 repos | Run `copier copy` from `platform-templates-service-go` for: `svc-auth`, `svc-infra`, `svc-agent-relay`, `svc-clusters`, `svc-backups`, `svc-users`, `provider-cnpg`, `kiven-agent` | All repos have: editorconfig, golangci, pre-commit, CI workflow, Taskfile, Dockerfile, go.mod, OTel init | -| 2-3 | `platform-templates-sdk-go` | Create Copier template for sdk-go (same as service-go but no Dockerfile, no cmd/, no gRPC) | Template ready for provider repos and CLI | -| 3 | `kiven-dev` | Verify `task dev` works end-to-end: Docker Compose + kind + CNPG + all services can start | Working local dev environment | +**W5 Review checkpoint**: API keys work. Provider generates Cluster + Pooler + Backup YAML. Node group creation works. +## Q1 Exit Criteria -**Phase 1 exit criteria**: `task dev` starts infra, `svc-api` returns service plans from DB, `contracts-proto` generates Go code, all Phase 2 repos are scaffolded. +- [ ] SDK feature-complete (errors, middleware, OTel, DB helpers) +- [ ] gRPC contracts defined and generating Go code +- [ ] `svc-api` serves CRUD endpoints from PostgreSQL +- [ ] All Phase 2 repos scaffolded with Copier template +- [ ] `task dev` works end-to-end locally +- [ ] OIDC login flow works (svc-auth) +- [ ] CNPG YAML generation works (provider-cnpg) +- [ ] AWS AssumeRole + node group creation works (svc-infra) ---- +## Q1 Evolution Tracker -## Phase 2: Core Services (Weeks 4-8) +| Week | SDK | Proto | svc-api | Templates | svc-auth | provider-cnpg | svc-infra | +|------|-----|-------|---------|-----------|----------|---------------|-----------| +| W1 | 🔨 | 🔨 | 🔨 | 🔨 | — | — | — | +| W2 | 🔨 | 🔨 | 🔨 | 🔨 | — | — | — | +| W3 | ✅ | ✅ | ✅ | ✅ | — | — | — | +| W4 | — | — | 🔨 | — | 🔨 | 🔨 | 🔨 | +| W5 | — | — | ✅ | — | 🔨 | 🔨 | 🔨 | -**Goal**: Authentication works, CNPG YAML can be generated, AWS resources can be created, agent can connect to relay. These are the building blocks the provisioner needs. +Legend: — not started, 🔨 in progress, ✅ done -### Work Stream A: Auth (1 developer) +--- +# Q2 2026 — CORE + ORCHESTRATION + DASHBOARD (Apr 1 → Jun 30) -| Week | Repo | Task | Deliverable | -| ---- | ---------- | ------------------------------------------------------------------------------- | ------------------------------------------ | -| 4 | `svc-auth` | OIDC integration (Google/GitHub login via `coreos/go-oidc`), JWT token issuance | Users can log in, get a JWT | -| 5 | `svc-auth` | API key management (create, list, revoke, hash with argon2) | Programmatic access for CLI/Terraform | -| 6 | `svc-auth` | RBAC middleware (admin, operator, viewer roles), org/team model | Role-based access control on all endpoints | -| 6 | `svc-api` | Integrate auth middleware from `svc-auth`, protect all endpoints | Every API call requires valid token | +> **Theme**: Build everything needed for the MVP. Auth, provider, agent, provisioner, dashboard. +> +> **Objective**: A user can log in, create a database from the dashboard, and get a working PostgreSQL connection string — all in a local kind cluster. Full provisioning pipeline works end-to-end. +## Q2 Dependency Graph -### Work Stream B: Provider + Agent Foundation (1 developer) +```mermaid +graph TD + subgraph "Apr (W6-9): Complete Core Services" + AUTH[svc-auth complete] --> API_PROTECT[svc-api protected] + CNPG[provider-cnpg complete] --> AGENT[kiven-agent] + INFRA[svc-infra complete] --> PROV[svc-provisioner] + RELAY[svc-agent-relay] --> AGENT + end + subgraph "May (W10-14): Orchestration" + AGENT --> PROV + PROV --> CLUSTERS[svc-clusters] + PROV --> BACKUPS[svc-backups] + CLUSTERS --> USERS[svc-users] + PROV --> E2E[E2E integration test] + end -| Week | Repo | Task | Deliverable | -| ---- | ----------------- | ----------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------- | -| 4-5 | `provider-cnpg` | Implement Provider interface for CNPG: `GenerateClusterYAML()`, `GeneratePoolerYAML()`, `GenerateScheduledBackupYAML()` | Given a service definition, produce valid CNPG YAML | -| 5-6 | `provider-cnpg` | Implement `ParseStatus()`, `ParseMetrics()` from CNPG CRD status fields | Can read CNPG cluster state | -| 6 | `contracts-proto` | Finalize agent protocol: Heartbeat (bidirectional), CommandStream (server-push), MetricsStream (agent-push) | Stable gRPC contract | -| 7-8 | `svc-agent-relay` | gRPC server: agent registration, heartbeat tracking, command dispatch queue | Agents can connect and receive commands | + subgraph "Jun (W15-18): Dashboard + Polish" + API_PROTECT --> DASH[Dashboard integration] + E2E --> DASH + end +``` +## Week 6 (Mar 30 - Apr 3) -### Work Stream C: Infrastructure (1 developer) +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-auth` | RBAC middleware (admin, operator, viewer roles), org/team model | Role-based access control | +| Dev 1 | `svc-api` | Integrate auth middleware, protect all endpoints | Every API call requires valid token | +| Dev 2 | `provider-cnpg` | Implement `ParseStatus()`, `ParseMetrics()` from CNPG CRD status fields | Can read CNPG cluster state | +| Dev 3 | `svc-infra` | Create S3 bucket (encrypted, lifecycle rules) for backups, create IRSA role for CNPG | Backup infrastructure ready | +| Dev 4 | `dashboard` | API client layer: fetch wrapper, auth token management, error handling | Type-safe API client | +**W6 Review checkpoint**: Auth complete (OIDC + API keys + RBAC). Provider can parse status. S3 bucket creation works. -| Week | Repo | Task | Deliverable | -| ---- | ----------- | ------------------------------------------------------------------------------------- | --------------------------------------------- | -| 4-5 | `svc-infra` | AWS SDK integration: `AssumeRole` into customer account, EKS `DescribeCluster` | Can access customer AWS resources | -| 5-6 | `svc-infra` | Create EKS managed node group (dedicated for databases, tainted, right instance type) | Can create database nodes in customer cluster | -| 6-7 | `svc-infra` | Create S3 bucket (encrypted, lifecycle rules) for backups, create IRSA role for CNPG | Backup infrastructure ready | -| 7-8 | `svc-infra` | Create EBS StorageClass (gp3, encrypted, right IOPS) | Storage ready for database volumes | +## Week 7 (Apr 6 - Apr 10) +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-auth` | Unit + integration tests for OIDC, API keys, RBAC | Auth fully tested | +| Dev 2 | `provider-cnpg` | Integration tests: generate YAML → validate against CNPG CRD schema | Provider fully tested | +| Dev 2 | `contracts-proto` | Finalize: Heartbeat, CommandStream, MetricsStream — stable contract | gRPC contract frozen | +| Dev 3 | `svc-infra` | Create EBS StorageClass (gp3, encrypted, right IOPS) | Storage ready for DB volumes | +| Dev 4 | `dashboard` | Auth flow: login page, OIDC redirect, token storage, protected routes | Users can log in via dashboard | -### Work Stream D: svc-api Endpoints (shared across team) +**W7 Review checkpoint**: Auth tested. Provider tested. svc-infra can create storage. Dashboard login works. +## Week 8 (Apr 13 - Apr 17) -| Week | Repo | Task | Deliverable | -| ---- | --------- | ------------------------------------------------------------------------------------------ | ----------------------------------------- | -| 5-6 | `svc-api` | CRUD endpoints: `POST /v1/services`, `DELETE /v1/services/{id}`, `PATCH /v1/services/{id}` | Can create/update/delete services via API | -| 7-8 | `svc-api` | Customer cluster endpoints: `POST /v1/clusters` (register EKS), `GET /v1/clusters` | Can register customer EKS clusters | -| 8 | `svc-api` | Backup endpoints, user management endpoints | Full CRUD for all resources | +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Scaffold + state machine design: `provisioning_jobs` table, step definitions | Provisioner architecture ready | +| Dev 2 | `svc-agent-relay` | gRPC server: agent registration, heartbeat tracking, connection management | Agents can connect | +| Dev 3 | `svc-api` | Backup endpoints, user management endpoints, remaining CRUD | Full API endpoint coverage | +| Dev 4 | `dashboard` | Service list page: real data from API, create service wizard | Can create a database from the UI | +**W8 Review checkpoint**: Provisioner designed. Agent relay accepts connections. Full API CRUD. Dashboard shows services. -**Phase 2 exit criteria**: User can log in via OIDC, create a service via API (stored in DB), `provider-cnpg` generates valid CNPG YAML, `svc-infra` can create node groups in test AWS account, agent can connect to relay. +### -- CORE SERVICES COMPLETE -- ---- +**Exit criteria (Week 8)**: +- [ ] svc-auth: OIDC + API keys + RBAC, fully tested +- [ ] provider-cnpg: YAML generation + status parsing, fully tested +- [ ] svc-infra: AssumeRole + node groups + S3 + IRSA + StorageClass +- [ ] svc-agent-relay: gRPC server accepts agent connections +- [ ] svc-api: full CRUD for services, clusters, backups, users -## Phase 3: Orchestration (Weeks 9-14) +--- -**Goal**: The provisioning pipeline works end-to-end. Customer clicks "Create Database" and gets a running PostgreSQL. +## Week 9 (Apr 20 - Apr 24) -### Work Stream A: Agent (1 developer) +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Steps implementation: create_nodes → create_storage → create_s3 (calls svc-infra) | First 3 provisioning steps work | +| Dev 2 | `kiven-agent` | Go binary: gRPC client to relay, CNPG informers (watch Cluster/Backup CRDs), heartbeat | Agent running in kind, reports CNPG | +| Dev 3 | `svc-clusters` | Cluster lifecycle: get status from agent, basic CRUD | Can see cluster status | +| Dev 4 | `dashboard` | Service detail page: real connection info, status from API, power on/off | Can see database status in UI | +**W9 Review checkpoint**: Provisioner creates infra resources. Agent watches CNPG CRDs. Cluster status visible. -| Week | Repo | Task | Deliverable | -| ----- | ------------------ | -------------------------------------------------------------------------------------- | --------------------------------------------- | -| 9-10 | `kiven-agent` | Go binary: gRPC client to relay, CNPG informers (watch Cluster/Backup CRDs), heartbeat | Agent running in kind, reporting CNPG status | -| 10-11 | `kiven-agent` | Command executor: receive YAML from relay, `kubectl apply`, report result | Can apply CNPG manifests on command | -| 11-12 | `kiven-agent` | PG stats collector: connect to PostgreSQL, collect pg_stat_statements, send to relay | Metrics flowing to SaaS | -| 12 | `kiven-agent-helm` | Helm chart for agent deployment | One-command agent install in customer cluster | +## Week 10 (Apr 27 - May 1) +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Steps: install_cnpg → deploy_cluster (calls agent-relay to send commands to agent) | Full pipeline: API → provisioner → infra + agent | +| Dev 2 | `kiven-agent` | Command executor: receive YAML from relay, `kubectl apply`, report result | Can apply CNPG manifests on command | +| Dev 3 | `svc-clusters` | Scale (change instances), power on/off (delete pods + scale node group) | Scale up/down, power on/off | +| Dev 4 | `dashboard` | Backups page (real data from API), backup timeline visualization | Backups visible in UI | -### Work Stream B: Provisioner + Services (2 developers) +**W10 Review checkpoint**: Full provisioning pipeline works (API → provisioner → infra → agent → CNPG). Power on/off works. +## Week 11 (May 4 - May 8) -| Week | Repo | Task | Deliverable | -| ----- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------- | -| 9-10 | `svc-provisioner` | State machine: `provisioning_jobs` table, steps: create_nodes -> create_storage -> create_s3 -> install_cnpg -> deploy_cluster | Provisioning pipeline orchestration | -| 10-11 | `svc-provisioner` | Integration with `svc-infra` (create AWS resources) + `svc-agent-relay` (send commands to agent) | Full pipeline: API -> provisioner -> infra + agent | -| 11-12 | `svc-clusters` | Cluster lifecycle: get status from agent, scale (change instances), power on/off | Can see cluster status, scale up/down | -| 12-13 | `svc-backups` | Backup management: trigger backup via agent, list backups from S3, PITR restore | Backup/restore working | -| 13-14 | `svc-users` | PG user management: create user via agent (SQL execution), list users, reset password | Can manage database users | +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Error handling, retries, rollback on failure, idempotency | Resilient provisioning | +| Dev 2 | `kiven-agent` | PG stats collector: connect to PG, collect pg_stat_statements, send to relay | Metrics flowing to SaaS | +| Dev 3 | `svc-backups` | Backup management: trigger backup via agent, list backups from S3, restore | Backup/restore working | +| Dev 4 | `dashboard` | Users page (CRUD), metrics page (charts from agent data) | Full dashboard pages | +**W11 Review checkpoint**: Provisioner handles errors. PG metrics flowing. Backup/restore works. All dashboard pages exist. -### Integration Testing (all developers) +## Week 12 (May 11 - May 15) +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Integration tests: full pipeline in kind | Provisioner fully tested | +| Dev 2 | `kiven-agent` | Log aggregator: collect PG logs from all pods, send to relay | Logs flowing to SaaS | +| Dev 2 | `kiven-agent-helm` | Helm chart for agent deployment | One-command agent install | +| Dev 3 | `svc-backups` | PITR restore, fork/clone support | Advanced backup features | +| Dev 4 | `dashboard` | Loading states, error handling, empty states | UI polish | -| Week | Task | Deliverable | -| ----- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | -| 13-14 | End-to-end test in kind: create service -> provisioner runs -> agent deploys CNPG -> PostgreSQL running -> connection string returned | MVP proof: the full loop works | +**W12 Review checkpoint**: Provisioner tested. Agent has Helm chart. PITR works. Dashboard polished. +## Week 13 (May 18 - May 22) -**Phase 3 exit criteria**: In the local dev environment (kind), a user can create a service via API, the provisioner creates a CNPG cluster via the agent, and the user gets a working PostgreSQL connection string. +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Status reporting: webhook/polling to update service status in svc-api | Real-time provisioning status | +| Dev 2 | `kiven-agent` | Infrastructure reporter: node status, storage usage, resource availability | Infrastructure metrics in SaaS | +| Dev 3 | `kiven-go-sdk` | Create `models.DatabaseUser` (distinct from dashboard User) + migration | Domain model for PG users | +| Dev 3 | `svc-users` | PG user management via agent: CREATE ROLE, GRANT, password rotation | Can manage database users | +| Dev 4 | `dashboard` | Responsive design, dark mode, accessibility | Production-ready UI | ---- +**W13 Review checkpoint**: Provisioning status updates in real-time. DB user management works. Dashboard responsive. -## Phase 4: Dashboard Integration (Weeks 8-16, parallel with Phase 3) +## Week 14 (May 25 - May 29) -**Goal**: The dashboard is connected to the API and a customer can do everything via the UI. +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| All | E2E | End-to-end test in kind: create service → provisioner → agent → CNPG → PG running → connection string | **MVP proof: full loop works** | +| Dev 1 | `svc-provisioner` | Performance optimization, concurrent provisioning | Can provision multiple DBs | +| Dev 4 | `dashboard` | E2E test from UI: login → create DB → see status → manage users/backups | Full UI E2E | -### Work Stream (1 frontend developer) +**W14 Review checkpoint**: **CRITICAL MILESTONE — Full loop works in kind.** User creates DB → gets connection string. +### -- ORCHESTRATION COMPLETE -- -| Week | Repo | Task | Deliverable | -| ----- | ----------- | ---------------------------------------------------------------------------------- | --------------------------------- | -| 8-9 | `dashboard` | API client layer: fetch wrapper, auth token management, error handling | Type-safe API client | -| 9-10 | `dashboard` | Auth flow: login page, OIDC redirect, token storage, protected routes | Users can log in | -| 10-11 | `dashboard` | Service list page: real data from API, create service wizard connected to API | Can create a database from the UI | -| 11-12 | `dashboard` | Service detail page: real connection info, status from API, power on/off | Can see database status | -| 12-14 | `dashboard` | Backups page (real data), Users page (CRUD), Metrics page (charts from agent data) | Full dashboard functionality | -| 14-16 | `dashboard` | Polish: loading states, error handling, responsive design, dark mode | Production-ready UI | +**Exit criteria (Week 14)**: +- [ ] In kind: user creates service via API → provisioner → agent → CNPG cluster → connection string +- [ ] Backup/restore + PITR works +- [ ] DB user management works +- [ ] Agent collects metrics + logs from PostgreSQL +- [ ] Dashboard shows everything +--- -**Phase 4 exit criteria**: Customer can log in, create a database, see status, manage users/backups -- all from the dashboard. +## Week 15 (Jun 1 - Jun 5) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-provisioner` | Hardening: retry logic, circuit breakers, graceful degradation | Production-grade provisioner | +| Dev 2 | `platform-observability` | Install Prometheus + Grafana via Helm, ServiceMonitor for all services | Metrics collection started | +| Dev 3 | `platform-gitops` | Flux Kustomizations/HelmReleases for all svc-*, environments (dev, staging, prod) | GitOps deployment pipeline | +| Dev 4 | `dashboard` | Settings page, organization management, team invitations | Admin features | + +**W15 Review checkpoint**: GitOps pipeline ready. Prometheus collecting metrics. Provisioner hardened. + +## Week 16 (Jun 8 - Jun 12) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Scaffold + metrics ingestion from agent: pg_stat_statements, connections, replication lag | Metrics pipeline from agent to SaaS | +| Dev 2 | `platform-observability` | Install Loki + Promtail, configure log ingestion | Centralized logging | +| Dev 3 | All svc-* repos | Production Helm charts (per service), Kustomize overlays for env-specific config | `helm install svc-api` works | +| Dev 4 | `dashboard` | API documentation page (Redoc), connection string helper | Developer experience | + +**W16 Review checkpoint**: Metrics pipeline ingesting. Loki collecting logs. Helm charts for all services. + +## Week 17 (Jun 15 - Jun 19) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Basic alerting: connection pool exhaustion, replication lag, disk usage | Critical alerts work | +| Dev 2 | `platform-observability` | OTel Collector (DaemonSet agents + Gateway), Tempo as trace backend | Distributed tracing | +| Dev 3 | `kiven-dev` | Staging environment in real EKS: Terraform for Kiven SaaS EKS cluster | Staging cluster on AWS | +| Dev 4 | `svc-audit` | Scaffold + immutable audit log: every API call, every infra change, who/what/when | Audit trail started | + +**W17 Review checkpoint**: Alerting works. Tracing live. Staging EKS cluster exists. Audit logging. + +## Week 18 (Jun 22 - Jun 26) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Grafana dashboards: service health, request latency, error rates, agent status | Operations visibility | +| Dev 2 | `platform-observability` | SLO definitions (99.9% API, <200ms p95), error budget alerts (Sloth/Pyrra) | SLO monitoring | +| Dev 3 | `platform-gitops` | Promotion workflow: dev → staging → prod with approval gates | Controlled rollouts | +| Dev 4 | `svc-notification` | Alert dispatch: Slack, email, webhook integration | Multi-channel alerting | + +**W18 Review checkpoint**: Grafana dashboards live. SLOs defined. Promotion workflow. Notifications dispatch. + +## Q2 Exit Criteria + +- [ ] svc-auth complete: OIDC + API keys + RBAC + tests +- [ ] provider-cnpg complete: YAML gen + status parsing + tests +- [ ] svc-infra complete: node groups + S3 + IRSA + StorageClass +- [ ] kiven-agent: CNPG informers + command executor + PG stats + logs +- [ ] svc-provisioner: full pipeline, tested, resilient +- [ ] svc-clusters + svc-backups + svc-users: all working +- [ ] **E2E in kind: login → create DB → get connection string → manage** +- [ ] Dashboard: all pages with real data, auth, responsive +- [ ] GitOps pipeline (Flux) deployed +- [ ] Observability started (Prometheus, Loki, Tempo, Grafana) +- [ ] Staging EKS cluster running +- [ ] svc-monitoring ingesting metrics + basic alerts +- [ ] svc-audit + svc-notification scaffolded + +## Q2 Evolution Tracker + +| Week | Auth | Provider | Infra | Relay | Agent | Provisioner | Clusters | Backups | Users | Dashboard | GitOps | Observ. | +|------|------|----------|-------|-------|-------|-------------|----------|---------|-------|-----------|--------|---------| +| W6 | ✅ | 🔨 | 🔨 | — | — | — | — | — | — | 🔨 | — | — | +| W7 | ✅ | ✅ | 🔨 | — | — | — | — | — | — | 🔨 | — | — | +| W8 | ✅ | ✅ | ✅ | 🔨 | — | 🔨 | — | — | — | 🔨 | — | — | +| W9 | ✅ | ✅ | ✅ | ✅ | 🔨 | 🔨 | 🔨 | — | — | 🔨 | — | — | +| W10 | ✅ | ✅ | ✅ | ✅ | 🔨 | 🔨 | 🔨 | — | — | 🔨 | — | — | +| W11 | ✅ | ✅ | ✅ | ✅ | 🔨 | 🔨 | ✅ | 🔨 | — | 🔨 | — | — | +| W12 | ✅ | ✅ | ✅ | ✅ | ✅ | 🔨 | ✅ | 🔨 | — | 🔨 | — | — | +| W13 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🔨 | 🔨 | — | — | +| W14 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | — | +| W15 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🔨 | 🔨 | +| W16 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🔨 | 🔨 | +| W17 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🔨 | 🔨 | +| W18 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🔨 | --- -## Phase 5: Production Infrastructure (Weeks 15-20, overlaps with Phase 4) +# Q3 2026 — PRODUCTION + ENTERPRISE + FIRST CUSTOMER (Jul 1 → Sep 30) -**Goal**: Everything needed to run Kiven in production on real AWS infrastructure. Services deploy via GitOps, observability is in place, security is hardened, customers can onboard autonomously. +> **Theme**: Go to production. Security hardened, customers can onboard, billing works, DBA intelligence, first real customer. +> +> **Objective**: Kiven runs in production on real AWS. First customer onboarded, paying, with a managed PostgreSQL. -### Work Stream A: Deployment + GitOps (1 developer) +## Q3 Dependency Graph +```mermaid +graph TD + subgraph "Jul (W19-22): Security + Onboarding" + VAULT[Vault + ESO] --> SEC[Security hardened] + KYVERNO[Kyverno policies] --> SEC + CERT[cert-manager TLS] --> SEC + TF_MOD[Terraform onboarding module] --> WIZARD[Onboarding wizard] + SEC --> PROD_READY[Production ready] + end -| Week | Repo | Task | Deliverable | -| ----- | ------------------ | -------------------------------------------------------------------------------- | ------------------------------------------------ | -| 15 | `platform-gitops` | ArgoCD ApplicationSets for all svc-* services, environments (dev, staging, prod) | GitOps deployment pipeline | -| 15-16 | All svc-* repos | Production Helm charts (per service), Kustomize overlays for env-specific config | `helm install svc-api` works | -| 16-17 | `kiven-dev` | Staging environment in real EKS (not kind): Terraform for Kiven SaaS EKS cluster | Staging cluster running on AWS | -| 17-18 | `platform-gitops` | Promotion workflow: dev -> staging -> prod with approval gates | Controlled rollouts | -| 18 | `platform-gateway` | Cloudflare Terraform: DNS, WAF rules, DDoS protection, Tunnel to EKS | kiven.io resolves, API accessible via Cloudflare | - + subgraph "Aug (W23-27): Enterprise + Billing" + BILLING[svc-billing Stripe] --> INVOICES[Usage + invoices] + DBA[DBA intelligence] --> ALERTS[Smart alerts] + YAML_ED[svc-yamleditor] --> ADV_MODE[Advanced Mode] + end -### Work Stream B: Observability (1 developer) + subgraph "Sep (W28-31): First Customer" + PROD_READY --> CUSTOMER[First customer live] + WIZARD --> CUSTOMER + INVOICES --> CUSTOMER + end +``` +## Week 19 (Jun 29 - Jul 3) -| Week | Repo | Task | Deliverable | -| ----- | ------------------------ | --------------------------------------------------------------------------------- | ---------------------------------- | -| 15-16 | `platform-observability` | Install Prometheus + Grafana via Helm, ServiceMonitor for all svc-* services | Metrics collection and dashboards | -| 16-17 | `platform-observability` | Install Loki + Promtail, configure log ingestion from all pods | Centralized logging | -| 17 | `platform-observability` | OTel Collector two-tier: DaemonSet agents (forward) + Gateway Deployment (batch, filter, export). Exporter helper with persistent queue (file-backed, survives restarts). Tempo as trace backend. | Distributed tracing with resilient pipeline | -| 17-18 | `platform-observability` | Grafana dashboards: service health, request latency, error rates, agent status | Operations visibility | -| 18-19 | `platform-observability` | SLO definitions (99.9% API availability, <200ms p95 latency), error budget alerts | SLO monitoring with Sloth or Pyrra | -| 19-20 | `platform-observability` | On-call runbooks: automated alert -> runbook link, PagerDuty integration | Operational readiness | +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | DBA recommendations engine: auto-tune postgresql.conf based on workload patterns | "Increase shared_buffers" alerts | +| Dev 2 | `platform-security` | HashiCorp Vault: install, configure dynamic secrets for PG and AWS credentials | No more static secrets | +| Dev 3 | `platform-gitops` | Deploy all services to staging EKS via Flux, validate full stack | Services running on real EKS | +| Dev 4 | `svc-audit` | Complete audit log: append-only table, query API, retention policies | Compliance-ready audit trail | +**W19 Review checkpoint**: DBA recommendations work. Vault installed. All services on staging EKS. Audit log complete. -### Work Stream C: Security Hardening (1 developer) +## Week 20 (Jul 6 - Jul 10) +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Query optimizer: slow query detection, visual EXPLAIN, index suggestions | Actionable query insights | +| Dev 2 | `platform-security` | External Secrets Operator: sync Vault secrets → K8s Secrets for all services | Services read secrets natively | +| Dev 3 | `infra-customer-aws` | Terraform module: creates `KivenAccessRole` in customer AWS (IAM + trust policy) | IaC-native onboarding | +| Dev 4 | `svc-yamleditor` | Advanced Mode: YAML viewer/editor with Monaco, CNPG schema validation | Experts can see/edit YAML | -| Week | Repo | Task | Deliverable | -| ----- | --------------------- | --------------------------------------------------------------------------------------- | --------------------------------------- | -| 15-16 | `platform-security` | HashiCorp Vault: install, configure dynamic secrets for PostgreSQL and AWS credentials | No more static secrets | -| 16-17 | `platform-security` | External Secrets Operator: sync Vault secrets to Kubernetes Secrets for all services | Services read secrets from K8s natively | -| 17 | `platform-security` | cert-manager: install, configure Let's Encrypt ClusterIssuer, auto-TLS for all services | HTTPS everywhere | -| 17-18 | `platform-security` | Kyverno policies: require resource limits, require labels, block privileged pods | Policy enforcement | -| 18-19 | `platform-networking` | Cilium network policies: restrict pod-to-pod traffic, mTLS between services | Zero-trust networking | -| 19-20 | `platform-security` | Image signing (Cosign), SBOM generation, vulnerability scanning in CI | Supply chain security | +**W20 Review checkpoint**: Query optimizer works. ESO syncing secrets. Terraform onboarding module ready. YAML editor works. +## Week 21 (Jul 13 - Jul 17) -### Work Stream D: Customer Onboarding (shared) +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Capacity planner: storage/CPU growth forecasting, "disk full in 14 days" | Proactive capacity alerts | +| Dev 2 | `platform-security` | cert-manager: Let's Encrypt ClusterIssuer, auto-TLS for all services | HTTPS everywhere | +| Dev 2 | `platform-security` | Kyverno policies: require resource limits, labels, block privileged pods | Policy enforcement | +| Dev 3 | `infra-customer-aws` | EKS discovery: validate cluster access, discover nodes, storage classes, CNPG | Automated cluster validation | +| Dev 4 | `svc-yamleditor` | Change history: git-like timeline of all YAML changes, rollback to any version | Full configuration history | + +**W21 Review checkpoint**: Capacity planning works. TLS everywhere. Kyverno enforcing. EKS discovery works. + +## Week 22 (Jul 20 - Jul 24) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Backup verification: automated weekly restore tests, RPO compliance dashboard | Verified backup reliability | +| Dev 2 | `platform-networking` | Cilium network policies: restrict pod-to-pod, mTLS between services | Zero-trust networking | +| Dev 3 | `svc-api` + `dashboard` | Onboarding wizard: Terraform → paste IAM Role ARN → validate → register → create first DB | Self-service onboarding | +| Dev 4 | `svc-billing` | Stripe integration: customer/subscription lifecycle, payment methods | Customers can subscribe | + +**W22 Review checkpoint**: Backup verification automated. Cilium mTLS. Onboarding wizard works. Stripe integration. + +## Week 23 (Jul 27 - Jul 31) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `svc-monitoring` | Integration tests, dashboard widgets for DBA intelligence | DBA intelligence complete | +| Dev 2 | `platform-security` | Image signing (Cosign), SBOM generation, vulnerability scanning in CI | Supply chain security | +| Dev 3 | `infra-customer-aws` | Advanced Terraform modules: VPC peering, private endpoints, custom KMS | Enterprise networking | +| Dev 4 | `svc-billing` | Usage tracking: compute hours, storage consumption, backup storage | Accurate usage metering | + +**W23 Review checkpoint**: DBA intelligence done. Supply chain security. Advanced networking. Usage metering. + +## Week 24 (Aug 3 - Aug 7) + +| Dev | Repo | Task | Deliverable | +|-----|------|------|-------------| +| Dev 1 | `platform-observability` | On-call runbooks: automated alert → runbook link, PagerDuty integration | Operational readiness | +| Dev 2 | `platform-gateway` | Cloudflare Terraform: DNS, WAF rules, DDoS protection, Tunnel to EKS | kiven.io live | +| Dev 3 | `svc-api` | API rate limiting, pagination optimization, caching | Production-grade API | +| Dev 4 | `svc-billing` | Invoice generation: monthly invoices with line items (Kiven fee + AWS estimate) | Professional invoices | + +**W24 Review checkpoint**: On-call ready. kiven.io resolves. API production-grade. Invoicing works. + +## Week 25-26 (Aug 10 - Aug 22) — Production Readiness Sprint + +| Dev | Focus | Task | +|-----|-------|------| +| All | Testing | Chaos testing: node failure, agent disconnect, CNPG failover | +| All | Testing | DR test: failover to second AZ, restore from backup | +| All | Security | Security audit: Vault, mTLS, Kyverno, no static credentials | +| All | Documentation | API docs (Redoc), user guides, admin guides | +| Dev 4 | Billing | Dashboard billing page: plan upgrade/downgrade, payment history | + +**W25-26 Review checkpoint**: Chaos tests pass. DR tested. Security audited. Docs complete. Billing UI done. + +## Week 27-28 (Aug 25 - Sep 5) — Test Customer Dry Run + +| Dev | Focus | Task | +|-----|-------|------| +| All | Validation | Full customer onboarding with `test-client` AWS account | +| All | Validation | Provision database, verify metrics/logs/backups, test billing flow | +| All | Bug fixes | Fix issues found during dry run | +| Dev 4 | Legal | Terms of Service, Privacy Policy, DPA preparation | + +**W27-28 Review checkpoint**: Test customer fully onboarded. All flows work end-to-end. Bugs fixed. + +## Week 29-31 (Sep 8 - Sep 26) — First Customer + +| Week | Focus | Task | +|------|-------|------| +| W29 | Onboarding | First real customer: guided onboarding, dedicated support | +| W30 | Monitoring | 24/7 monitoring of first customer, immediate response to any issue | +| W31 | Stabilization | Bug fixes, performance tuning, documentation updates from learnings | + +**W29-31 Review checkpoint**: **CRITICAL — First customer live and healthy.** + +## Q3 Exit Criteria + +- [ ] All services deploy via Flux (no manual `kubectl apply`) +- [ ] Grafana dashboards for every service (RED metrics) +- [ ] SLOs defined and monitored (99.9% API, 99.99% DB uptime) +- [ ] On-call rotation with PagerDuty +- [ ] Vault secrets, mTLS, Kyverno, cert-manager — no static credentials +- [ ] Chaos testing passed (node failure, agent disconnect, CNPG failover) +- [ ] DR tested (AZ failover, backup restore) +- [ ] Customer onboarding tested end-to-end +- [ ] Billing tested (subscription → usage → invoice → payment) +- [ ] kiven.io live behind Cloudflare +- [ ] **First customer live on production** +- [ ] DBA intelligence: recommendations, query optimizer, capacity planner, backup verification +- [ ] svc-yamleditor: Advanced Mode with change history +- [ ] svc-audit: immutable audit log +- [ ] svc-notification: Slack + email + webhook + +## Q3 Evolution Tracker + +| Week | Security | Onboarding | Billing | DBA Intel. | YAML Editor | Audit | Staging | First Customer | +|------|----------|------------|---------|------------|-------------|-------|---------|----------------| +| W19 | 🔨 | — | — | 🔨 | — | 🔨 | 🔨 | — | +| W20 | 🔨 | 🔨 | — | 🔨 | 🔨 | ✅ | ✅ | — | +| W21 | 🔨 | 🔨 | — | 🔨 | 🔨 | ✅ | ✅ | — | +| W22 | 🔨 | 🔨 | 🔨 | 🔨 | ✅ | ✅ | ✅ | — | +| W23 | ✅ | 🔨 | 🔨 | ✅ | ✅ | ✅ | ✅ | — | +| W24 | ✅ | ✅ | 🔨 | ✅ | ✅ | ✅ | ✅ | — | +| W25-26 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | +| W27-28 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🔨 (dry run) | +| W29-31 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | +--- -| Week | Repo | Task | Deliverable | -| ----- | ----------------------- | ---------------------------------------------------------------------------------------- | --------------------------------- | -| 16-17 | `infra-customer-aws` | Terraform module (primary): creates `KivenAccessRole` in customer AWS account (IAM + trust policy). Published to Terraform Registry. | IaC-native onboarding | -| 17 | `infra-customer-aws` | CloudFormation template (alternative): same IAM Role, launchable via one-click URL for non-Terraform customers | Quick-start onboarding for all | -| 17-18 | `infra-customer-aws` | EKS discovery: validate cluster access, discover node capacity, storage classes, CNPG | Automated cluster validation | -| 18-19 | `svc-api` + `dashboard` | Onboarding wizard: choose Terraform or CloudFormation -> paste IAM Role ARN -> validate -> register cluster -> create first DB | Self-service customer onboarding | -| 19-20 | `infra-customer-aws` | Advanced Terraform modules: VPC peering, private endpoints, custom KMS | Enterprise networking options | +# Q4 2026 — SCALE + OPTIMIZE + EXPAND (Oct 1 → Dec 31) +> **Theme**: Stabilize production, onboard more customers, add migration tools, prepare multi-operator architecture, compliance. +> +> **Objective**: 5+ customers live. SOC2 Type 1 started. Migration tools working. Multi-operator architecture designed. -**Phase 5 exit criteria**: Services deploy via ArgoCD to staging EKS, Grafana shows metrics/logs/traces, Vault manages secrets, customers can onboard via Terraform module OR CloudFormation + dashboard wizard, Cloudflare serves kiven.io. +## Week 32-33 (Oct 1 - Oct 10) — Post-Launch Stabilization ---- +| Dev | Focus | Task | +|-----|-------|------| +| Dev 1 | Monitoring | Fine-tune alerts, reduce noise, improve DBA recommendations accuracy | +| Dev 2 | Performance | Optimize agent footprint, reduce gRPC latency, connection pooling tuning | +| Dev 3 | Onboarding | Streamline onboarding based on first customer feedback, improve docs | +| Dev 4 | Dashboard | UX improvements from first customer feedback, polish | -## Phase 6: Enterprise Features (Weeks 17-24, overlaps with Phase 5) +## Week 34-37 (Oct 13 - Nov 7) — Migration Tools -**Goal**: Revenue-generating features, compliance, DBA intelligence, operational maturity. +| Week | Dev | Repo | Task | Deliverable | +|------|-----|------|------|-------------| +| W34-35 | Dev 1 | `svc-migrations` | Import from Aiven: logical replication setup, progress tracking, cutover | Customers can migrate from Aiven | +| W34-35 | Dev 3 | `svc-migrations` | Import from RDS: pg_dump/restore, pg_basebackup | Customers can migrate from RDS | +| W36-37 | Dev 1 | `svc-migrations` | Import from bare PostgreSQL, migration progress dashboard | Migrate from any PG source | +| W34-37 | Dev 2 | `svc-auth` | SSO/SAML support (enterprise), advanced RBAC (per-service permissions) | Enterprise auth | +| W34-37 | Dev 4 | `dashboard` | Migration wizard UI, SSO settings page | Migration + SSO in dashboard | -### Work Stream A: Monitoring + DBA Intelligence (1 developer) +**W37 Review checkpoint**: Migration from Aiven/RDS/bare PG works. SSO/SAML for enterprise customers. +## Week 38-40 (Nov 10 - Nov 28) — Stategraph + CLI + Terraform Provider -| Week | Repo | Task | Deliverable | -| ----- | ---------------- | ------------------------------------------------------------------------------------------ | --------------------------------------- | -| 17-18 | `svc-monitoring` | Metrics ingestion from agent: store pg_stat_statements, connection counts, replication lag | Metrics pipeline from agent to SaaS | -| 18-19 | `svc-monitoring` | DBA recommendations engine: auto-tune postgresql.conf based on workload patterns | "Increase shared_buffers to 4GB" alerts | -| 19-20 | `svc-monitoring` | Query optimizer: slow query detection, visual EXPLAIN, index suggestions | Actionable query performance insights | -| 20-21 | `svc-monitoring` | Capacity planner: storage/CPU growth forecasting, "disk full in 14 days" warnings | Proactive capacity alerts | -| 21-22 | `svc-monitoring` | Backup verification: automated weekly restore tests, RPO compliance dashboard | Verified backup reliability | +| Week | Dev | Repo | Task | Deliverable | +|------|-----|------|------|-------------| +| W38 | Dev 3 | `bootstrap` | Stategraph setup: deploy or configure Stategraph (PostgreSQL backend for TF state) | Stategraph ready | +| W39 | Dev 3 | `bootstrap` | Migrate sso/ and control-tower/ from S3 to Stategraph, validate | TF state in Stategraph | +| W38-39 | Dev 1 | `kiven-cli` | CLI tool (`kiven`): login, list services, create DB, get connection string, logs | Terminal-first workflows | +| W39-40 | Dev 2 | `terraform-provider-kiven` | Terraform provider: `kiven_service`, `kiven_database_user` resources | IaC-native provisioning | +| W38-40 | Dev 4 | `dashboard` | CLI download page, Terraform docs, API key management improvements | Developer experience | +**W40 Review checkpoint**: Stategraph migrated. CLI works. Terraform provider available. -### Work Stream B: Billing (1 developer) +## Week 41-44 (Dec 1 - Dec 26) — Compliance + Multi-Operator Prep +| Week | Dev | Repo | Task | Deliverable | +|------|-----|------|------|-------------| +| W41-42 | Dev 1 | `docs` | SOC2 Type 1 evidence collection: access controls, audit logs, encryption, change management | SOC2 evidence package | +| W41-42 | Dev 2 | `kiven-go-sdk` | Multi-operator architecture design: Provider interface review, Strimzi/Redis operator analysis | Architecture decision document | +| W43-44 | Dev 2 | `provider-strimzi` | Scaffold Strimzi provider (Kafka): basic YAML generation, CRD analysis | Multi-operator proof of concept | +| W41-44 | Dev 3 | Infrastructure | Customer #2-5 onboarding, Terraform module improvements from feedback | Scale validation | +| W41-44 | Dev 4 | `dashboard` | Multi-service UI prep (service type selector), performance optimization | Dashboard ready for Kafka | -| Week | Repo | Task | Deliverable | -| ----- | ------------- | ------------------------------------------------------------------------------- | -------------------------------- | -| 18-19 | `svc-billing` | Stripe integration: customer/subscription lifecycle, payment methods | Customers can subscribe to plans | -| 19-20 | `svc-billing` | Usage tracking: compute hours per cluster, storage consumption, backup storage | Accurate usage metering | -| 20-21 | `svc-billing` | Invoice generation: monthly invoices with line items (Kiven fee + AWS estimate) | Professional invoices | -| 21-22 | `svc-billing` | Dashboard billing page: plan upgrade/downgrade, payment history, cost breakdown | Self-service billing | +**W44 Review checkpoint**: SOC2 evidence started. Strimzi provider scaffolded. 5+ customers onboarded. +## Q4 Exit Criteria -### Work Stream C: Enterprise Services (1-2 developers) +- [ ] Stategraph: Terraform state migrated from S3 +- [ ] `kiven` CLI: login, create/list/delete services, get connection strings +- [ ] `terraform-provider-kiven`: resource types for services and users +- [ ] svc-migrations: import from Aiven, RDS, bare PostgreSQL +- [ ] SSO/SAML for enterprise customers +- [ ] SOC2 Type 1 evidence collection started +- [ ] Strimzi provider scaffolded (Kafka proof of concept) +- [ ] 5+ customers live on production +- [ ] Production stable for 3+ months +## Q4 Evolution Tracker -| Week | Repo | Task | Deliverable | -| ----- | ------------------ | --------------------------------------------------------------------------------------------------- | ---------------------------------- | -| 17-18 | `svc-audit` | Immutable audit log: every API call, every infra change, who/what/when, stored in append-only table | Compliance-ready audit trail | -| 18-19 | `svc-notification` | Alert dispatch: Slack, email, webhook, PagerDuty integration | Multi-channel alerting | -| 19-20 | `svc-yamleditor` | Advanced Mode: YAML viewer/editor with Monaco, CNPG schema validation, diff before apply | Expert users can see/edit all YAML | -| 20-21 | `svc-yamleditor` | Change history: git-like timeline of all YAML changes, rollback to any version | Full configuration history | -| 21-22 | `svc-migrations` | Import from Aiven: logical replication setup, progress tracking, cutover | Customers can migrate from Aiven | -| 22-24 | `svc-migrations` | Import from RDS + bare PostgreSQL: pg_dump/restore, pg_basebackup | Customers can migrate from any PG | -| 22-24 | `svc-auth` | SSO/SAML support (enterprise customers), advanced RBAC (per-service permissions) | Enterprise auth requirements | +| Week | Migrations | SSO/SAML | Stategraph | CLI | TF Provider | SOC2 | Multi-Operator | Customers | +|------|-----------|----------|------------|-----|-------------|------|----------------|-----------| +| W32-33 | — | — | — | — | — | — | — | 1 | +| W34-37 | 🔨 | 🔨 | — | — | — | — | — | 2 | +| W38-40 | ✅ | ✅ | 🔨 | 🔨 | 🔨 | — | — | 3 | +| W41-44 | ✅ | ✅ | ✅ | ✅ | ✅ | 🔨 | 🔨 | 5+ | +--- -**Phase 6 exit criteria**: Billing works (Stripe), audit log records everything, alerts dispatch to Slack/email, DBA intelligence gives recommendations, Advanced Mode lets experts edit YAML, customers can migrate from Aiven/RDS. +# Milestones Summary + +| Week | Date | Milestone | Verification | +|------|------|-----------|-------------| +| W3 | Mar 13 | Foundation done | `svc-api` returns plans from DB, `buf generate` works, all repos scaffolded | +| W5 | Mar 27 | Auth + Provider started | OIDC login works, CNPG YAML generates, AWS AssumeRole works | +| W7 | Apr 10 | Core services tested | Auth + Provider fully tested with unit + integration tests | +| W8 | Apr 17 | Core services complete | All building blocks ready for orchestration | +| W10 | May 1 | Provisioning pipeline works | API → provisioner → infra → agent → CNPG → PG running | +| W14 | May 29 | **E2E in kind** | Full loop: login → create DB → get connection string | +| W16 | Jun 12 | Dashboard complete | All pages with real data from API | +| W18 | Jun 26 | Staging on AWS | Services running on real EKS via Flux | +| W22 | Jul 24 | Security hardened | Vault, mTLS, Kyverno, cert-manager, Cilium | +| W24 | Aug 7 | Enterprise features | DBA intelligence, billing, audit, YAML editor, kiven.io live | +| W26 | Aug 22 | Production ready | Chaos tested, DR tested, security audited, docs complete | +| W29 | Sep 8 | **First customer live** | Real customer with managed PostgreSQL | +| W37 | Nov 7 | Migrations working | Import from Aiven, RDS, bare PG | +| W40 | Nov 28 | Developer tools | CLI + Terraform provider available | +| W44 | Dec 26 | Year-end | 5+ customers, SOC2 started, Kafka POC | --- -## Team Allocation (assuming 4 developers) +# Gantt Chart ```mermaid gantt - title Kiven Production Roadmap + title Kiven 2026 Roadmap dateFormat YYYY-MM-DD axisFormat %b %d - section Phase1_Foundation - SDK_complete :p1a, 2026-02-23, 3w - contracts_proto :p1b, 2026-02-23, 3w - svc_api_scaffold :p1c, 2026-02-23, 3w - apply_copier_templates :p1d, 2026-02-23, 2w - - section Phase2_Core - svc_auth :p2a, after p1a, 3w - provider_cnpg :p2b, after p1a, 5w - svc_infra :p2c, after p1a, 5w - svc_agent_relay :p2d, after p1b, 3w - svc_api_endpoints :p2e, after p1c, 4w - - section Phase3_Orchestration - kiven_agent :p3a, after p2d, 4w - svc_provisioner :p3b, after p2c, 4w - svc_clusters :p3c, after p2b, 3w - svc_backups :p3d, after p3c, 2w - svc_users :p3e, after p3c, 2w - e2e_integration :p3f, after p3b, 2w - - section Phase4_Dashboard - dashboard_integration :p4a, after p2e, 8w - - section Phase5_Production - argocd_helm_gitops :p5a, after p3f, 4w - observability_stack :p5b, after p3f, 6w - security_hardening :p5c, after p3f, 6w - customer_onboarding :p5d, after p5a, 4w - - section Phase6_Enterprise - svc_monitoring_dba :p6a, after p5b, 6w - svc_billing_stripe :p6b, after p5a, 4w - svc_audit_notif_yaml :p6c, after p5a, 6w - svc_migrations :p6d, after p6a, 4w + section Q1 — Foundation + SDK complete :q1a, 2026-02-23, 3w + contracts-proto :q1b, 2026-02-23, 3w + svc-api scaffold + endpoints :q1c, 2026-02-23, 5w + Apply Copier templates :q1d, 2026-02-23, 3w + svc-auth start (OIDC) :q1e, 2026-03-16, 2w + provider-cnpg start :q1f, 2026-03-16, 2w + svc-infra start :q1g, 2026-03-16, 2w + + section Q2 — Core + Orchestration + svc-auth complete :q2a, 2026-03-30, 2w + provider-cnpg complete :q2b, 2026-03-30, 2w + svc-infra complete :q2c, 2026-03-30, 3w + svc-agent-relay :q2d, 2026-04-13, 2w + kiven-agent :q2e, 2026-04-20, 4w + svc-provisioner :q2f, 2026-04-20, 5w + svc-clusters + backups :q2g, 2026-04-20, 4w + svc-users :q2h, 2026-05-18, 2w + E2E integration :crit, q2i, 2026-05-25, 1w + Dashboard integration :q2j, 2026-03-30, 10w + GitOps + Helm :q2k, 2026-06-01, 3w + Observability start :q2l, 2026-06-01, 4w + svc-monitoring start :q2m, 2026-06-08, 3w + + section Q3 — Production + Enterprise + Security hardening :q3a, 2026-06-29, 5w + Customer onboarding :q3b, 2026-07-06, 4w + DBA intelligence :q3c, 2026-06-29, 5w + svc-billing :q3d, 2026-07-20, 4w + svc-yamleditor :q3e, 2026-07-06, 3w + svc-audit + notification :q3f, 2026-06-29, 3w + Production readiness :q3g, 2026-08-10, 2w + Test customer dry run :q3h, 2026-08-25, 2w + First customer live :crit, q3i, 2026-09-08, 3w + + section Q4 — Scale + Expand + Post-launch stabilization :q4a, 2026-10-01, 2w + svc-migrations :q4b, 2026-10-13, 4w + SSO/SAML :q4c, 2026-10-13, 4w + Stategraph migration :q4d, 2026-11-10, 3w + CLI + TF provider :q4e, 2026-11-10, 3w + SOC2 + multi-operator :q4f, 2026-12-01, 4w ``` +--- +# Weekly Review Template -### Developer Assignment Suggestion +Use this template every Friday to track progress and catch risks early. -- **Dev 1 (Backend Lead)**: SDK -> svc-auth -> svc-provisioner -> svc-monitoring -> svc-billing -- **Dev 2 (K8s/Infra)**: contracts-proto -> provider-cnpg -> kiven-agent -> platform-observability -> platform-security -- **Dev 3 (Cloud/AWS)**: svc-api scaffold -> svc-infra -> svc-clusters + svc-backups + svc-users -> platform-gitops -> infra-customer-aws -- **Dev 4 (Frontend)**: Apply templates -> dashboard -> svc-yamleditor -> svc-audit + svc-notification + svc-migrations +## Week [N] Review — [Date] ---- +### Progress +- [ ] **Dev 1**: [What was planned] → [What was delivered] +- [ ] **Dev 2**: [What was planned] → [What was delivered] +- [ ] **Dev 3**: [What was planned] → [What was delivered] +- [ ] **Dev 4**: [What was planned] → [What was delivered] + +### Completion vs Plan +| Metric | Value | +|--------|-------| +| Tasks planned | X | +| Tasks completed | Y | +| Tasks carried over | Z | +| Completion rate | Y/X % | + +### Blockers +| Blocker | Impact | Owner | Resolution | +|---------|--------|-------|------------| +| | | | | + +### Risks Identified +| Risk | Probability | Impact | Mitigation | +|------|-------------|--------|------------| +| | | | | -## Milestones - - -| Week | Milestone | How to Verify | -| ---- | ------------------------- | --------------------------------------------------------------------------- | -| 3 | Foundation done | `svc-api` returns plans from DB, `buf generate` works, all repos scaffolded | -| 5 | Auth works | Log in via GitHub OIDC, get JWT, access protected endpoint | -| 6 | CNPG YAML generates | `provider-cnpg` produces valid Cluster YAML from service definition | -| 8 | AWS resources work | `svc-infra` creates node group + S3 bucket in test-client AWS account | -| 10 | Agent connects | Agent in kind sends heartbeat to relay, receives commands | -| 12 | Provisioning works | Create service -> provisioner -> agent deploys CNPG -> PG running | -| 14 | E2E in kind | Full loop works locally: login -> create DB -> get connection string | -| 16 | Dashboard complete | Everything works from the UI | -| 17 | Staging on AWS | Services running in real EKS via ArgoCD | -| 19 | Observability live | Grafana dashboards with metrics, logs, traces from staging | -| 20 | Security hardened | Vault secrets, mTLS, Kyverno policies, cert-manager TLS | -| 20 | Customer onboarding works | Terraform module OR CloudFormation -> EKS discovery -> create first DB | -| 22 | Billing live | Customer subscribes, gets invoiced via Stripe | -| 22 | DBA intelligence | Recommendations, slow query detection, backup verification | -| 24 | Production ready | All enterprise features, migrations, SSO/SAML, audit log | - - -## Production Readiness Checklist (before first customer) - -- All services deploy via ArgoCD (no manual `kubectl apply`) -- Grafana dashboards for every service (RED metrics: Rate, Errors, Duration) -- SLOs defined and monitored (99.9% API, 99.99% database uptime) -- On-call rotation set up with PagerDuty/OpsGenie -- Runbooks for top 10 alert scenarios -- DR tested: failover to second AZ, restore from backup -- Security audit: Vault secrets, mTLS, Kyverno, no static credentials -- Chaos testing: node failure, agent disconnect, CNPG failover -- Backup verification: automated weekly restore test passes -- Customer onboarding flow tested end-to-end with test-client AWS account -- Billing tested: subscription -> usage -> invoice -> payment -- Documentation complete: API docs (Redoc), user guides, admin guides -- Legal: Terms of Service, Privacy Policy, DPA (Data Processing Agreement) -- SOC2 Type 1 evidence collection started +### Key Decisions Made +- [ ] Decision: ... → Rationale: ... + +### Next Week Focus +- Dev 1: ... +- Dev 2: ... +- Dev 3: ... +- Dev 4: ... + +### Quarter Health + +``` +Q[N] Progress: [██████░░░░] XX% +On track: [YES/AT RISK/BEHIND] +Top risk: [description] +``` diff --git a/platform/PLATFORM-ENGINEERING.md b/platform/PLATFORM-ENGINEERING.md index 18b4145..e1cf0cd 100644 --- a/platform/PLATFORM-ENGINEERING.md +++ b/platform/PLATFORM-ENGINEERING.md @@ -35,12 +35,12 @@ # 🚀 **CI/CD & Delivery** -## GitOps avec ArgoCD +## GitOps avec Flux | Concept | Implementation | |---------|----------------| | **Source of Truth** | Git repositories | -| **Delivery Model** | Pull-based (ArgoCD syncs from Git) | +| **Delivery Model** | Pull-based (Flux reconciles from Git) | | **Environments** | Kustomize overlays (dev/staging/prod) | | **Promotion** | PR from dev → staging → prod overlays | @@ -59,7 +59,7 @@ |----------|-------------|--------------| | `ci-python.yml` | Lint, test, build | `svc-*`, `sdk-python` | | `ci-terraform.yml` | Format, lint, plan, apply | `platform-*`, `bootstrap` | -| `cd-argocd.yml` | Trigger ArgoCD sync | Tous | +| `cd-flux.yml` | Trigger Flux reconcile | Tous | | `security-scan.yml` | Trivy, Checkov, tfsec | Tous | ## Pipeline Stages @@ -71,7 +71,7 @@ │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ ┌─────────────┐ - │ │ │ │ │ ArgoCD Sync │ + │ │ │ │ │ Flux Reconcile │ │ │ │ │ └─────────────┘ ▼ ▼ ▼ ▼ Fail fast Coverage Image tag CVE check @@ -81,10 +81,10 @@ | Metric | Target | Measurement | |--------|--------|-------------| -| **Git to Dev** | < 5 min | Commit to ArgoCD sync | -| **Git to Staging** | < 10 min | Commit to ArgoCD sync (manual approval) | -| **Git to Prod** | < 15 min | Commit to ArgoCD sync (manual approval) | -| **Rollback** | < 2 min | ArgoCD rollback | +| **Git to Dev** | < 5 min | Commit to Flux reconcile | +| **Git to Staging** | < 10 min | Commit to Flux reconcile (manual approval) | +| **Git to Prod** | < 15 min | Commit to Flux reconcile (manual approval) | +| **Rollback** | < 2 min | Flux rollback | ## Observability Requirements @@ -323,7 +323,7 @@ | **svc-ledger** | Availability | 99.9% | 43 min/mois | 14.4x = 1h alert | | **svc-ledger** | Latency P99 | < 200ms | N/A | P99 > 200ms for 5min | | **svc-wallet** | Availability | 99.9% | 43 min/mois | 14.4x = 1h alert | -| **Platform (ArgoCD, Prometheus)** | Availability | 99.5% | 3.6h/mois | 6x = 2h alert | +| **Platform (Flux, Prometheus)** | Availability | 99.5% | 3.6h/mois | 6x = 2h alert | ## Error Budget Policy diff --git a/resilience/DR-GUIDE.md b/resilience/DR-GUIDE.md index 59ec32f..fc5fabe 100644 --- a/resilience/DR-GUIDE.md +++ b/resilience/DR-GUIDE.md @@ -75,7 +75,7 @@ | **PostgreSQL PITR** | Aiven WAL | Continuous | 24h | Aiven | AES-256 | | **Kafka** | Topic retention | N/A | 7 jours | Aiven | AES-256 | | **Valkey** | RDB + AOF | Continuous | 24h | Aiven | AES-256 | -| **Terraform state** | S3 versioning | Every apply | 90 jours | S3 | KMS | +| **Terraform state** | S3 versioning | Every apply | 90 jours | S3 bucket | AES-256 | | **Git repos** | GitHub | Every push | Infini | GitHub | At-rest | | **Secrets (Vault)** | Integrated storage | Continuous | 30 jours | Vault HA | Transit | @@ -84,7 +84,7 @@ | Check | Frequency | Automation | Alert si échec | |-------|-----------|------------|----------------| | PostgreSQL restore test | Weekly | Job K8s scheduled | P2 | -| Terraform state integrity | Daily | CI pipeline | P3 | +| Terraform state backup | Daily | CI pipeline | P3 | | Vault backup verification | Weekly | Job K8s scheduled | P2 | | Git clone verification | Monthly | GitHub Actions | P4 | @@ -145,7 +145,7 @@ |-----------|---------|-------------------|-------|--------------| | **Pod** | Crash | Kubernetes restart | < 30s | Aucune | | **Pod** | OOM | Kubernetes restart + alert | < 30s | Investigation | -| **Deployment** | Bad deploy | ArgoCD rollback auto (si configuré) | < 2min | Aucune | +| **Deployment** | Bad deploy | Flux rollback auto (si configuré) | < 2min | Aucune | | **DB Primary** | Failure | Aiven automatic failover | < 5min | Aucune | | **DB Connection** | Pool exhausted | PgBouncer retry + scale | < 1min | Aucune | | **Kafka Consumer** | Lag > threshold | KEDA auto-scale | < 2min | Aucune | @@ -241,13 +241,13 @@ ## DR Automation — Infrastructure as Code -> **Principe :** Toute l'infrastructure est reproductible via Terraform + ArgoCD. +> **Principe :** Toute l'infrastructure est reproductible via Terraform + Flux. | Composant | Reproductibilité | Temps estimé | |-----------|------------------|--------------| | **EKS Cluster** | Terraform apply | ~30 min | -| **Platform tools** | ArgoCD sync | ~15 min | -| **Applications** | ArgoCD sync | ~10 min | +| **Platform tools** | Flux reconcile | ~15 min | +| **Applications** | Flux reconcile | ~10 min | | **Database** | Aiven restore from backup | ~1-2h | | **DNS cutover** | Cloudflare API / Terraform | ~5 min | @@ -258,7 +258,7 @@ | **1. Detection** | 15 min | Confirmer failure, déclarer DR | Alerting automatique | | **2. Infrastructure** | 1-2h | Terraform apply DR region | Semi-auto (approval required) | | **3. Data** | 1-2h | Aiven restore, verify integrity | Semi-auto (Aiven console) | -| **4. Applications** | 30 min | ArgoCD sync | Automatique | +| **4. Applications** | 30 min | Flux reconcile | Automatique | | **5. Traffic** | 15 min | Cloudflare DNS update | Semi-auto (Terraform) | | **6. Validation** | 30 min | E2E tests, verify SLIs | Automatique (CI) | diff --git a/testing/TESTING-STRATEGY.md b/testing/TESTING-STRATEGY.md index ee0a5bd..a2472cd 100644 --- a/testing/TESTING-STRATEGY.md +++ b/testing/TESTING-STRATEGY.md @@ -59,7 +59,7 @@ | Layer | Type de test | Cible | Fréquence | |-------|--------------|-------|-----------| | **Infrastructure** | Terraform tests, Policy checks | IaC modules | PR | -| **Platform** | Smoke tests, Policy audit | Kubernetes, ArgoCD | Post-deploy | +| **Platform** | Smoke tests, Policy audit | Kubernetes, Flux | Post-deploy | | **Application** | Unit, Integration, Contract | Services Python/Go | PR | | **System** | E2E, Performance, Chaos | Full stack | Nightly/Weekly | @@ -114,7 +114,7 @@ | **Manifest validation** | `kubectl --dry-run`, `kubeconform` | PR | YAML valide, schema correct | | **Policy check** | Kyverno CLI | PR | Policies passent | | **Helm lint** | `helm lint`, `helm template` | PR | Charts valides | -| **Smoke test** | ArgoCD sync + health check | Post-deploy | App déployée et healthy | +| **Smoke test** | Flux reconcile + health check | Post-deploy | App déployée et healthy | ---