Project made as a part of Big Data - modeling, management, processing and integration class.
The stack simulates an end-to-end pipeline: transactional database → change data capture → data lake → curated analytics layers → feature store → model training and serving.
flowchart TD
Airflow((Airflow))
DataGen[Seed Postgres]
PSQL[(PostgreSQL)]
Debezium[Debezium Connector]
Kafka[Kafka Topics]
Connect[Kafka Connect S3 Sink]
Bronze[Bronze layer - MinIO]
Silver[Silver layer]
SodaSilver[Soda Silver scan]
Gold[Gold layer]
SodaGold[Soda Gold scan]
FeatureStore[Feature store table]
MLflow[MLflow Registry]
Serving[Model Serving]
Airflow -.->|generate_and_load_data| DataGen
DataGen --> PSQL
PSQL --> Debezium --> Kafka --> Connect --> Bronze
Bronze -->|Spark - bronze_to_silver| Silver
Silver -->|Spark - silver_to_gold| Gold
Gold -->|Spark - gold_to_features| FeatureStore
FeatureStore -->|Spark ML| MLflow --> Serving
Airflow -.->|bronze_to_silver| Silver
Airflow -.->|silver_to_gold| Gold
Airflow -.->|gold_to_features| FeatureStore
Airflow -.->|ml| MLflow
Airflow -.->|soda_check_silver| SodaSilver
Silver --> SodaSilver
Airflow -.->|soda_check_gold| SodaGold
Gold --> SodaGold
- Automated CDC pipeline from Postgres to an S3-compatible data lake (MinIO) through Kafka, Kafka Connect and Debezium.
- Airflow DAGs orchestrating synthetic data generation, multi-hop Spark ETL, Soda quality rules, feature engineering, and ML training.
- Spark-based ML pipeline that logs to MLflow, manages model registry aliases per CPU architecture, and exports models for serving.
- Containerised local infrastructure (Docker Compose) with observability endpoints: Airflow UI, Kafka UI, Spark UI, MinIO console, MLflow UI.
- Utilities for cleaning local state, exporting models, and serving the latest registered model with MLflow.
.
├── airflow/ # Custom Airflow image, DAGs, and dependencies
│ └── dags/
│ ├── etl_dag.py # Main orchestration DAG
│ ├── ... # Other DAGs
│ └── miscellaneous/ # See airflow/dags/miscellaneous/README.md
├── connect/ # Kafka Connect image with additional plugins
├── connector-setup/ # See connector-setup/config/README.md
├── docker-compose.yml # Full infrastructure definition
├── cleanup.sh # Resets logs, cache and local MLflow state
├── serve_model.sh # Fetch & serve the latest registered model via MLflow
├── serve_model_local.sh # Serve an exported model directory
├── ml_models/ # See ml_models/README.md
├── mlflow/ # MLflow server image
├── postgres-init/ # SQL bootstrap for OLTP schema and triggers
└── spark/ # Spark image and ETL / ML applications
- Docker ≥ 24 and Docker Compose v2.
- At least 16 GB RAM available to Docker (the full stack runs multiple JVM-based services).
- Python 3.11+ locally if you want to run helper scripts or serve models outside Docker.
openssl(or similar) to generate secrets.
Create a .env file in the project root and populate it with all required secrets:
POSTGRES_PASSWORD=postgres
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin
AIRFLOW__CORE__FERNET_KEY=<generate-with-python -c 'from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())'>
AIRFLOW__SECRET_KEY=<generate-with: openssl rand -hex 32>All variables are consumed by Docker Compose and reused by Airflow, Spark and MLflow. Keep the file private - .env is excluded from version control.
-
Build images (first run or after Dockerfile changes)
docker compose build
-
Start the stack
docker compose up --build -d
-
Wait for health checks – PostgreSQL, Redis, Kafka, MinIO and MLflow must be healthy before Airflow initialises. The helper containers
init-airflow,init-minio, and the connector setup jobs will exit automatically when ready. -
Confirm connectors – visit
http://localhost:8083/connectorsto ensure the Debezium source and S3 sink connectors are registered.
| Service | URL / Port | Notes |
|---|---|---|
| Airflow API/UI | http://localhost:8080 | Simple auth enabled for local dev (any credentials work) |
| Kafka UI | http://localhost:8082 | Inspect topics & messages |
| Kafka Connect | http://localhost:8083 | Connector status API |
| MinIO UI | http://localhost:9001 | Use MINIO_ROOT_USER / MINIO_ROOT_PASSWORD |
| Spark Master UI | http://localhost:8090 | View submitted Spark jobs |
| MLflow UI | http://localhost:5001 | Tracking server & model registry |
| PostgreSQL | localhost:5432 | postgres / POSTGRES_PASSWORD |
- Open the Airflow UI and enable the
etl_dag. Trigger a run manually for the first execution. - The DAG runs the following tasks in sequence:
generate_and_load_data– Python-based synthetic data generator that loads OLTP tables.bronze_to_silver– Spark job cleansing CDC snapshots from Bronze.soda_check_silver– Soda data quality scan on the Silver layer.silver_to_gold– Business-rule cleansing and validation before Gold promotion.soda_check_gold– Final data quality validation.gold_to_features– Feature engineering and publish tos3a://datalake/feature_store.ml– Spark ML pipeline training a RandomForest model, logging artefacts, metrics, SHAP-style plots and registering the model in MLflow (per CPU architecture).
- Inspect each stage using the respective UIs (Spark logs, MLflow experiment
cancellation_prediction, MinIO object browser).
s3a://datalake/bronze/<topic>– raw CDC parquet data from Kafka topics.s3a://datalake/silver/...– cleansed tables with standardised schema.s3a://datalake/gold/...– fully validated business entities.s3a://datalake/feature_store/cancellation_features– model-ready feature table.s3://mlflow– MLflow model artefacts managed by the tracking server.
-
The
mlSpark job logs to MLflow and registers the model under architecture-specific names (cancellation-predictor-amd64,cancellation-predictor-arm64, etc.) with aliases such asstaging-amd64. -
Use
serve_model.shto pull the latest registered model for your architecture and expose it on port 6000:./serve_model.sh
The script expects a prepared Python virtual environment at
.venvmanaged byuv(for example:uv venv .venv && source .venv/bin/activate && uv sync). This usespyproject.tomldependencies. -
To serve an exported model directory directly (e.g., produced by the
export_modelDAG), run:./serve_model_local.sh path/to/model
-
Postman collection
MLFlow Local.postman_collection.jsoncontains sample scoring requests.
cleanup.sh– removes Airflow logs, Soda outputs, cached bytecode and local MLflow artefacts to reclaim disk space.connect/,connector-setup/– Docker contexts for Kafka Connect and automated connector registration (seeconnector-setup/config/README.md).ml_models/– model packaging instructions and Dockerfile template (seeml_models/README.md).serve_model.sh,serve_model_local.sh– helper scripts for local model serving workflows.
- Airflow workers stuck starting – ensure Redis and Postgres health checks pass; run
docker compose logs airflow-workerfor details. - No data arriving in MinIO – check connector status (
GET /connectors/<name>/status), confirm Debezium slot is running, and verify Postgres tables contain data. - Spark jobs fail on macOS (Java errors) – make sure Java 17 is installed and exported when running local scripts. The containers already use the correct JRE.
- MLflow authentication errors when serving locally – confirm
MLFLOW_TRACKING_URI,MLFLOW_S3_ENDPOINT_URL, and MinIO credentials are exported (the scripts do this automatically if.envis present).
airflow/dags/miscellaneous/README.md– background on synthetic data signal injection and DAG helper utilities.connector-setup/config/README.md– detailed explanation of Debezium and S3 sink connector configuration.ml_models/README.md– guidance on exporting MLflow models and building serving images.