Obiente Cloud is a Platform-as-a-Service (PaaS) similar to Vercel, designed to run user deployments across a distributed Docker Swarm cluster. The platform automatically orchestrates, routes, and monitors user applications across multiple nodes.
These are the core services that manage the platform:
- API (
api): ConnectRPC service handling deployment operations- Manages Docker containers via Docker API
- Tracks deployment locations across nodes
- Handles deployment lifecycle (create, update, delete)
- User authentication and authorization (Zitadel)
- Project and deployment management
- One instance per node (global mode) for direct Docker access
- Decides which node should host new deployments
- Strategies: least-loaded, round-robin, resource-based
- Monitors node health and capacity
- Handles deployment migration when needed
- Integrates with Docker Swarm API for placement
- 3-node PostgreSQL cluster with automatic failover
- Patroni: Manages PostgreSQL replication and failover
- etcd: Distributed consensus for leader election
- PgPool: Connection pooling and load balancing
- Stores:
- User accounts and organizations
- Projects and deployments metadata
- Deployment locations (node_id, container_id, status)
- Node resource usage and capacity
- Routing configuration
- Separate TimescaleDB instance optimized for time-series data
- Production HA: 3-node TimescaleDB cluster with Patroni + etcd (mirrors PostgreSQL setup)
- PgPool:
metrics-pgpoolfor connection pooling and load balancing - Stores:
- Container metrics (CPU, memory, network, disk I/O)
- Aggregated hourly usage statistics
- Historical deployment performance data
- Benefits:
- Isolated from main database workload
- Optimized for time-series queries and aggregations
- Automatic hypertable partitioning for performance
- Falls back to main PostgreSQL if TimescaleDB unavailable
- 3-node Redis cluster for distributed caching
- Use cases:
- Session storage
- API response caching
- Job queue for deployment operations
- Real-time deployment status updates
- Rate limiting state
User applications run as Docker containers distributed across worker nodes:
Node 1: Node 2: Node 3:
- user-app-a-v1 - user-app-b-v1 - user-app-c-v1
- user-app-d-v2 - user-app-e-v1 - user-app-f-v1
- user-app-g-v1 - user-app-h-v2 - user-app-i-v1
Each deployment is tracked in the database:
deployment_locations table:
- deployment_id
- node_id (which Swarm node)
- container_id (Docker container ID)
- status (running/stopped/failed)
- port (assigned port)
- domain (custom domain)
- health_status
- resource_usageTraefik acts as the dynamic reverse proxy:
- Service Discovery: Automatically discovers services via Docker Swarm API
- Dynamic Routing: Routes traffic based on domains and paths
- SSL/TLS: Automatic HTTPS via Let's Encrypt
- Load Balancing: Distributes traffic across deployment replicas
- Middleware: Rate limiting, authentication, compression
Domain-Based Routing:
Obiente Cloud uses domain-based routing for service-to-service communication, enabling:
- Cross-node communication: Services on different nodes communicate via domains
- Cross-network communication: Works with VPNs, service meshes, and custom networks
- Multi-cluster support: Multiple Swarm clusters can share the same domain
- Automatic load balancing: Traefik load balances across all healthy replicas
By default, services communicate via HTTPS through Traefik (e.g., https://auth-service.${DOMAIN}) instead of direct service-to-service HTTP. This enables distributed deployments across nodes, networks, and clusters.
See Domain-Based Routing Guide for detailed configuration.
Routing flow:
User Request (app.example.com)
↓
Traefik (checks routing table)
↓
Looks up deployment_routing table in DB
↓
Routes to correct node + container
↓
User's deployed application
Service-to-service communication:
Service A (Node 1)
↓
Requests: https://auth-service.${DOMAIN}
↓
Traefik (load balances across all nodes)
↓
Service B (Node 2 or Node 3)
The platform uses a production-ready metrics system with:
- Live Metrics Streaming: Real-time container stats collection (5-second intervals)
- In-Memory Caching: Fast access to recent metrics for UI streaming
- Batch Storage: Aggregated metrics written to TimescaleDB every minute
- Resilience Features:
- Circuit breaker pattern for Docker API protection
- Exponential backoff retry mechanism
- Automatic graceful degradation under load
- Health monitoring with failure detection
- Backpressure handling for slow subscribers
Container Stats (Docker API)
↓
Metrics Streamer (Parallel Collection)
↓
Live Cache (Memory) → Subscribers (UI streaming)
↓
Aggregation (Every 60s)
↓
TimescaleDB (Historical Storage)
/health: Health check including metrics system status/metrics/observability: Real-time metrics collection statistics- Collection rates and error counts
- Database write success/failure rates
- Circuit breaker state
- Subscriber and cache metrics
- Prometheus: Scrapes metrics from all services and nodes (optional)
- Grafana: Visualizes metrics and creates dashboards (optional)
- Metrics tracked:
- Node resource usage (CPU, memory, disk)
- Deployment resource consumption (real-time and historical)
- API request latency and throughput
- Database performance
- Network traffic per deployment
- Metrics collection health and performance
-
User initiates deployment via API
POST /api/v1/deployments { project_id, git_repo, branch, env_vars } -
Orchestrator selects target node
func SelectNode(strategy string) (*Node, error) { nodes := GetAvailableNodes() switch strategy { case "least-loaded": return nodes.WithLowestCPU() case "round-robin": return nodes.NextInRotation() } }
-
Go API on target node creates container
container := dockerClient.CreateContainer(deployment.Image, deployment.Config) dockerClient.StartContainer(container.ID)
-
Location is recorded in database
location := DeploymentLocation{ DeploymentID: deployment.ID, NodeID: currentNode.ID, ContainerID: container.ID, Status: "running", Port: assignedPort, } db.RecordDeploymentLocation(location)
-
Routing is configured
routing := DeploymentRouting{ DeploymentID: deployment.ID, Domain: deployment.Domain, TargetPort: assignedPort, SSLEnabled: true, } db.UpsertDeploymentRouting(routing)
-
Traefik discovers new container via Docker labels
labels: - "traefik.enable=true" - "traefik.http.routers.{deployment-id}.rule=Host(`{domain}`)" - "traefik.http.services.{deployment-id}.loadbalancer.server.port={port}"
The system maintains several tracking mechanisms:
deployment_locations: Real-time deployment locations
SELECT * FROM deployment_locations WHERE deployment_id = 'dep_123';
-- Result: node_id='node-worker-2', container_id='abc123', status='running'node_metadata: Cluster node information
SELECT * FROM node_metadata WHERE availability='active' ORDER BY deployment_count ASC;
-- Returns nodes sorted by current loaddeployment_routing: Traffic routing configuration
SELECT * FROM deployment_routing WHERE domain = 'myapp.com';
-- Returns: deployment_id, target_port, load_balancer_algoThe Go API queries Docker Swarm directly:
// Get all nodes in the cluster
nodes, _ := dockerClient.NodeList(ctx, types.NodeListOptions{})
// Get containers on current node
containers, _ := dockerClient.ContainerList(ctx, types.ContainerListOptions{
Filters: filters.NewArgs(
filters.Arg("label", "cloud.obiente.deployment=true"),
),
})Background job runs every minute:
func ReconcileDeployments() {
// 1. Query actual containers from Docker
actualContainers := getAllContainersFromAllNodes()
// 2. Compare with database records
dbRecords := db.GetAllDeploymentLocations()
// 3. Update discrepancies
for _, container := range actualContainers {
if !existsInDB(container) {
db.RecordDeploymentLocation(container)
}
}
// 4. Clean up stale records
for _, record := range dbRecords {
if !existsInCluster(record) {
db.RemoveDeploymentLocation(record.ContainerID)
}
}
}Nodes are labeled for deployment placement:
# Label nodes for PostgreSQL replicas
docker node update --label-add postgres.replica=1 node-1
docker node update --label-add postgres.replica=2 node-2
docker node update --label-add postgres.replica=3 node-3
# Label nodes for Redis
docker node update --label-add redis=1 node-1
docker node update --label-add redis=2 node-2
docker node update --label-add redis=3 node-3
# Label compute nodes for user deployments
docker node update --label-add compute=true node-4
docker node update --label-add compute=true node-5
docker node update --label-add compute=true node-6Control Plane Services:
- API services: Scale replicas based on request load
- Orchestrator: 2-3 replicas for redundancy
Data Plane:
- PostgreSQL: 3-5 replicas (1 primary + 2-4 replicas)
- Redis: 3-6 nodes for cluster
User Deployments:
- Add more worker nodes as capacity increases
- Each node can handle 50-100 deployments (configurable)
Increase resources per node based on workload:
- Manager nodes: 4-8 CPU, 8-16GB RAM
- Worker nodes: 8-16 CPU, 16-32GB RAM
- Database nodes: 4-8 CPU, 16-32GB RAM
- PostgreSQL: Automatic failover within seconds via Patroni
- Redis: Cluster mode with automatic resharding
- API Services: Multiple replicas behind load balancer
- Deployments: Can be replicated across multiple nodes
- Traefik: Multiple instances for routing redundancy
- Network Isolation: Overlay network isolates services
- TLS Everywhere: Automatic HTTPS via Let's Encrypt
- Resource Limits: CPU/memory limits prevent resource exhaustion
- Authentication: Zitadel for user authentication
- API Security: Rate limiting, CORS, helmet middleware
- Database: Connection pooling, prepared statements
- Secrets Management: Docker secrets for sensitive data
- PostgreSQL: Continuous archiving + daily snapshots
- Redis: AOF persistence + periodic snapshots
- Deployment metadata: Replicated across 3 nodes
- Database failover: Automatic via Patroni (< 30 seconds)
- Node failure: Swarm reschedules containers automatically
- Complete cluster failure: Restore from backups
- CPU usage per node
- Memory usage per node
- Deployment count per node
- Network throughput
- Total deployments
- Deployments per project
- Resource usage per deployment
- Request latency per deployment
- API response times
- Database query performance
- Cache hit rates
- Error rates
- Database Connection Pooling: PgPool with 100 connections
- Caching: Redis for frequently accessed data (5min TTL)
- CDN: Serve static assets via CDN
- Image Optimization: Use multi-stage Docker builds
- Lazy Loading: Load deployment metadata on-demand
- Batch Operations: Bulk update deployment status
- Resource Limits: Prevent over-provisioning
- Auto-scaling: Scale down during low usage
- Spot Instances: Use for non-critical worker nodes
- Caching: Reduce database queries
- Compression: Enable gzip for API responses
- Multi-region deployments: Deploy to multiple geographic regions
- Edge computing: Run deployments closer to users
- Serverless functions: Support for FaaS workloads
- Auto-scaling: Automatically scale deployments based on traffic
- Blue-green deployments: Zero-downtime deployment updates
- A/B testing: Traffic splitting between deployment versions
- Cost analytics: Per-deployment resource usage and billing