Harmony Chat is a fullstack chat application created for learning purposes as an exploration of distributed systems, event-driven architecture, and modern DevOps practices.
We intentionally keep the application level simple to practice the more complex underlying architecture patterns necessary for high scalability, state consistency, and real-time data streaming.
- Backend: FastAPI, SQLAlchemy, Pytest
- Frontend: TypeScript, Next.js, React Query, Tailwind CSS, Orval
- Infrastructure: AutoMQ Kafka, PostgreSQL, DynamoDB, Redis, Centrifugo, Conduit, Redpanda Connect
The system is designed around separating relational state (users, chat metadata) from high-volume, append-only data (chat messages), bound together by a fault-tolerant event-driven backbone.
Here is the logic behind our core architectural decisions:
- PostgreSQL (State Source of Truth): Ideal for storing user profiles, chat metadata, and user-chat relational mappings. These domains require complex transactions and row-based locking, making Postgres the strict source of truth for the application state. We rely on Change Data Capture (CDC) to keep the rest of the system eventually consistent with this state.
- DynamoDB (Message Storage): Chosen to store real-time chat data. Its low latency and high scalability are perfectly suited for a small message record data model that doesn't require relational joins. It serves as the long-term storage for messages and is eventually consistent with the Kafka topics that source them.
Kafka acts as the source of truth for chat history. When a user sends a message, it is sent to kafka and considered committed after being persisted within kafka.
- A Kafka Connect sink with a DynamoDB integration listens to this topic, strips transient author metadata from the payload, and persists the data to DynamoDB.
- Simultaneously, Centrifugo consumes the same topic to instantly broadcast messages to active WebSocket listeners.
- The Architectural Advantage: Kafka acts as a buffer that decouples Centrifugo, DynamoDB, and the FastAPI backend.
- For instance if DynamoDB experiences throttling or downtime, the system maintains short-term consistency: Centrifugo will continue delivering messages to active users in real-time, while DynamoDB safely catches up later once it recovers.
Application state changes (e.g., adding users to a chat, deleting a chat) produce side effects that must propagate through the entire system.
- We utilize the Transactional Outbox Pattern in PostgreSQL to atomically write state changes alongside outbox events within the same transaction.
- Debezium reads the Postgres Write-Ahead Log (WAL) and streams these outbox events into Kafka.
- A custom Python CDC Consumer worker (
harmony.consumer) processes these events with strict idempotency guarantees to update secondary storage (postgres and dynamodb) and invalidate Redis caches. - The Architectural Advantage:
- Because Kafka guarantees at-least-once delivery and our consumer event handlers are strictly idempotent, this pattern guarantees the distributed system will eventually achieve consistency with the relational source of truth, avoiding the need for distributed transactions.
We use Centrifugo to manage persistent WebSocket connections, providing a simple and highly scalable pub/sub model for frontend clients.
- It securely authenticates connections and channel subscriptions by proxying requests back to the FastAPI backend, which completely decouples the WebSocket connection pool from the Python API instances.
- Under the hood, it uses a Redis engine to cache user presence and recent messages, and utilizes Redis pub/sub as the broker for real-time message data.
We use the Cache-Aside pattern with Redis to increase response speed and decrease database load for metadata lookups.
- We frequently need to verify user-chat membership for authorization and hydrate chat messages with user metadata. By caching this data, we dramatically increase response speeds and reduce the load on PostgreSQL.
- Cache invalidation is integrated into the CDC event consumer to prevent stale reads when the relational state changes.
Authentication utilizes rotating JWT access and refresh tokens.
- The Next.js server acts as a proxy (
api/proxy/{api_path}), securely attaching HTTP-only cookies to outgoing requests bound for the internal FastAPI backend. - We transparently handle access token management without disrupting the client experience by using middleware (
proxy.ts) to resolve expired credentials and (api/refresh) for resolving HTTP 401s. - The backend uses a opaque string refresh tokens stored in postgres and rotates them on refresh with a very short grace period before invalidation to handle multiple refresh requests in close proximity.
- The frontend heavily utilizes React Query for optimistic UI updates, complete with automatic rollback capabilities upon network failure.
Our infrastructure deployment strictly adheres to GitOps principles, decoupling the application logic from environment-specific configurations:
- App Repository (Environment-Agnostic): Houses the application code and publishes template logic via Git tags and GitHub Releases (container images, Terraform templates, Helmfile templates). It maintains zero awareness of downstream configurations.
- Config Repository (Environment-Specific): Contains environment variables, SOPS-encrypted secrets, and references to specific App Repo Git tags to trigger deployments.
The CI/CD Pipeline:
We utilize a high-level values.yaml and secrets.yaml to control the entire deployment. Rendered manifests are published as an OCI Artifact using a Helmfile template, carefully decoupling secrets and dynamic Terraform outputs via AWS Secrets Manager and Parameter Store. ArgoCD then continuously syncs this artifact to the Kubernetes cluster.
- Infrastructure as Code (IaC): We currently use Terragrunt to parse, decrypt, and bootstrap the underlying AWS infrastructure, including the EKS cluster. (Note: We are planning a migration to Spacelift to fully automate infrastructure creation).
- Helmfile & Chart Management: We use Helmfile to facilitate multi-layer templating. We maintain custom Helm charts for our core application, Conduit, and utility specifications (like Karpenter and external secrets).
- Autoscaling: We utilize Karpenter for dynamic, node-level cluster autoscaling. At the pod level, Horizontal Pod Autoscaling (HPA) is built into all stateless components.
Instead of deploying a traditional shared-nothing Kafka architecture (like Apache or Redpanda) or relying on costly AWS MSK clusters, we utilize AutoMQ for our Kafka workloads.
The Architectural Advantage: AutoMQ utilizes a compute-storage decoupled model backed by AWS S3, providing several critical benefits for a Kubernetes environment:
- Painless Scaling: Because it uses a shared storage backend (S3), we can scale brokers vertically or horizontally without suffering the massive network overhead of partition reassignment. Data never needs to be copied over the network after it is committed.
- Optimized Durability vs. Latency: AutoMQ maintains low broker commit latency by writing to a persistent WAL buffer (an EBS volume) to achieve immediate single-AZ durability. It then batches these writes to S3 for long-term, high durability guarantees. While this trades off immediate multi-AZ consistency on commit, it completely bypasses multi-AZ network latency while maintaining excellent data safety.
We use devbox to manage all development dependencies. Use devbox shell to download and enter the devbox environment.
We use task to manage all development commands, which are defined in the Taskfile.yaml at the root of the repository
Download pyton and node dependencies for both the frontend and backend by running (uv sync and npm install):
task setuptask run:devtask kind:setupTo see a full list of available development tasks (like running tests, generating fake data, compiling environments, or syncing OpenAPI schemas):
task- Implement repo wrappers to interact with DynamoDB tables
- Implement simple poll-based FastAPI backend
- Implement simple frontend with Next.js
- Implement backend endpoints for chat and user management
- Implement authentication with access token JWTs
- Use the Backend-For-Frontend pattern proxying API requests to securely set HTTP-only cookies
- Implement pub/sub model for real-time chat updates
- Use React Query in the frontend for optimistic server-side hydration and client-side fallback
- Implement ULID cursor-based pagination for chat history
- Migrate to Postgres storage for metadata, relying on transactions and row-based locking
- Implement refresh tokens with rotation and automatic resolution of expired access tokens
- Implement Redis caching layer for metadata and membership authorization
- Migrate WebSocket handling to Centrifugo + Redis
- Implement Kafka-based event sourcing for chat messages with Kafka Connect DynamoDB sink
- Implement CDC (Change Data Capture) with Postgres Outbox + Debezium + Kafka
- Build k8s set up with helm and deploy locally using kind.
- Switch from Apache Kafka to AutoMQ, from Debezium to Conduit, from Kafka Connect to Redpanda Connect
- Deploy simplified application to AWS via Terraform
- Setup GitHub Actions CI
- Setup argoCD
- Upgrade k8s setup with API Gateway and Autoscaling
- Launch production deployment

