This document describes the production Kubernetes deployment architecture for the Cybersecurity Agent Platform.
The platform is designed for cloud-native deployment with horizontal scaling, high availability, and complete observability.
- Gateway (Go) - API gateway with RBAC
- Brain (Python) - AI analysis and reporting
- Core Scanners (Rust) - High-performance scanning
- Celery Workers (Python) - Distributed task processing
- PostgreSQL - Multi-tenant database with RLS
- Redis - Queue and caching layer
- Prometheus - Metrics collection
- Structured Logs - JSON logging
graph TB
subgraph "External"
USER[Users]
PROM[Prometheus]
end
subgraph "Ingress"
LB[Load Balancer]
end
subgraph "Kubernetes Cluster"
subgraph "API Layer - 3-20 replicas"
GW1[Gateway Pod 1]
GW2[Gateway Pod 2]
GW3[Gateway Pod N]
end
subgraph "Intelligence Layer - 2-10 replicas"
BRAIN1[Brain Pod 1]
BRAIN2[Brain Pod 2]
end
subgraph "Worker Layer - 3-20 replicas"
CELERY1[Celery Worker 1]
CELERY2[Celery Worker 2]
CELERY3[Celery Worker N]
end
subgraph "Data Layer"
PG[(PostgreSQL<br/>StatefulSet)]
RD[(Redis<br/>Deployment)]
end
subgraph "Monitoring"
METRICS[/metrics endpoint]
end
end
USER --> LB
LB --> GW1
LB --> GW2
LB --> GW3
GW1 --> BRAIN1
GW2 --> BRAIN2
GW3 --> BRAIN1
GW1 --> PG
GW2 --> PG
GW3 --> PG
GW1 --> RD
GW2 --> RD
GW3 --> RD
CELERY1 --> RD
CELERY2 --> RD
CELERY3 --> RD
CELERY1 --> PG
CELERY2 --> PG
CELERY3 --> PG
PROM --> METRICS
GW1 -.-> METRICS
GW2 -.-> METRICS
GW3 -.-> METRICS
Gateway:
- Min: 3 replicas
- Max: 20 replicas
- Trigger: 70% CPU or 80% memory
Brain:
- Min: 2 replicas
- Max: 10 replicas
- Trigger: 70% CPU or 80% memory
Celery Workers:
- Min: 3 replicas
- Max: 20 replicas
- Trigger: 70% CPU or 80% memory
| Component | CPU Request | Memory Request | CPU Limit | Memory Limit |
|---|---|---|---|---|
| Gateway | 100m | 256Mi | 500m | 1Gi |
| Brain | 250m | 512Mi | 1000m | 2Gi |
| Celery Worker | 250m | 512Mi | 1000m | 2Gi |
| PostgreSQL | 250m | 512Mi | 1000m | 2Gi |
| Redis | 100m | 256Mi | 500m | 512Mi |
- PostgreSQL StatefulSet with persistent volume
- 20GB storage (expandable)
- Daily automated backups
- Point-in-time recovery
- All API services are stateless
- Session state in Redis
- Can scale horizontally without coordination
- Redis persistence enabled
- Job retry on worker failure
- Dead letter queue for failed jobs
- Pods can only communicate with required services
- External access only through Load Balancer
- Database access restricted to app pods
- Kubernetes Secrets for sensitive data
- Environment-specific configurations
- Rotation support for API keys
- Service accounts with minimal permissions
- Pod security policies enforced
- Network policies for traffic control
- HTTP request rates and latencies
- Scan job metrics
- Authentication attempts
- Active sessions
- Emergency stop status
- Audit log statistics
- Structured JSON logs
- ISO8601 timestamps
- Correlation IDs for request tracing
- Log aggregation to centralized system
- Liveness probes (process health)
- Readiness probes (service availability)
- Startup probes for slow-starting services
- Database: Daily full backups + WAL archiving
- Configuration: Git-based version control
- Secrets: Encrypted backup in secure storage
- RTO: 1 hour (maximum downtime)
- RPO: 15 minutes (maximum data loss)
- Detect failure (automated alerts)
- Restore from backup
- Verify data integrity
- Resume operations
- API Throughput: 10,000 requests/second
- Scan Processing: 1,000 concurrent scans
- Report Generation: <5 seconds
- Auto-scaling Response: <60 seconds
- Burst scaling during peak hours
- Scale down during low traffic
- Spot instances for non-critical workers
- Compression for large reports
- Archive old scan results
- S3/GCS for cold storage
graph LR
A[Code Commit] --> B[CI Build]
B --> C[Unit Tests]
C --> D[Integration Tests]
D --> E[Docker Build]
E --> F[Push to Registry]
F --> G[Deploy to Staging]
G --> H[E2E Tests]
H --> I{Manual Approval}
I -->|Approved| J[Deploy to Production]
I -->|Rejected| K[Roll Back]
- Zero-downtime deployments
- Gradual pod replacement
- Health check validation
- Automatic rollback on failure
- For major updates
- Full environment duplication
- Traffic switch after validation
- Quick rollback capability