Big Data – End-To-End Pipeline

Project made as a part of Big Data - modeling, management, processing and integration class.

The stack simulates an end-to-end pipeline: transactional database → change data capture → data lake → curated analytics layers → feature store → model training and serving.

flowchart TD
    Airflow((Airflow))
    DataGen[Seed Postgres]
    PSQL[(PostgreSQL)]
    Debezium[Debezium Connector]
    Kafka[Kafka Topics]
    Connect[Kafka Connect S3 Sink]
    Bronze[Bronze layer - MinIO]
    Silver[Silver layer]
    SodaSilver[Soda Silver scan]
    Gold[Gold layer]
    SodaGold[Soda Gold scan]
    FeatureStore[Feature store table]
    MLflow[MLflow Registry]
    Serving[Model Serving]

    Airflow -.->|generate_and_load_data| DataGen
    DataGen --> PSQL

    PSQL --> Debezium --> Kafka --> Connect --> Bronze
    Bronze -->|Spark - bronze_to_silver| Silver
    Silver -->|Spark - silver_to_gold| Gold
    Gold -->|Spark - gold_to_features| FeatureStore
    FeatureStore -->|Spark ML| MLflow --> Serving

    Airflow -.->|bronze_to_silver| Silver
    Airflow -.->|silver_to_gold| Gold
    Airflow -.->|gold_to_features| FeatureStore
    Airflow -.->|ml| MLflow

    Airflow -.->|soda_check_silver| SodaSilver
    Silver --> SodaSilver
    Airflow -.->|soda_check_gold| SodaGold
    Gold --> SodaGold

Key Capabilities

Automated CDC pipeline from Postgres to an S3-compatible data lake (MinIO) through Kafka, Kafka Connect and Debezium.
Airflow DAGs orchestrating synthetic data generation, multi-hop Spark ETL, Soda quality rules, feature engineering, and ML training.
Spark-based ML pipeline that logs to MLflow, manages model registry aliases per CPU architecture, and exports models for serving.
Containerised local infrastructure (Docker Compose) with observability endpoints: Airflow UI, Kafka UI, Spark UI, MinIO console, MLflow UI.
Utilities for cleaning local state, exporting models, and serving the latest registered model with MLflow.

Repository Layout

.
├── airflow/                 # Custom Airflow image, DAGs, and dependencies
│   └── dags/
│       ├── etl_dag.py       # Main orchestration DAG
│       ├── ...              # Other DAGs
│       └── miscellaneous/   # See airflow/dags/miscellaneous/README.md
├── connect/                 # Kafka Connect image with additional plugins
├── connector-setup/         # See connector-setup/config/README.md
├── docker-compose.yml       # Full infrastructure definition
├── cleanup.sh               # Resets logs, cache and local MLflow state
├── serve_model.sh           # Fetch & serve the latest registered model via MLflow
├── serve_model_local.sh     # Serve an exported model directory
├── ml_models/               # See ml_models/README.md
├── mlflow/                  # MLflow server image
├── postgres-init/           # SQL bootstrap for OLTP schema and triggers
└── spark/                   # Spark image and ETL / ML applications

Prerequisites

Docker ≥ 24 and Docker Compose v2.
At least 16 GB RAM available to Docker (the full stack runs multiple JVM-based services).
Python 3.11+ locally if you want to run helper scripts or serve models outside Docker.
openssl (or similar) to generate secrets.

Configuration (`.env`)

Create a .env file in the project root and populate it with all required secrets:

POSTGRES_PASSWORD=postgres
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin
AIRFLOW__CORE__FERNET_KEY=<generate-with-python -c 'from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())'>
AIRFLOW__SECRET_KEY=<generate-with: openssl rand -hex 32>

All variables are consumed by Docker Compose and reused by Airflow, Spark and MLflow. Keep the file private - .env is excluded from version control.

Bootstrapping the Platform

Build images (first run or after Dockerfile changes)
```
docker compose build
```
Start the stack
```
docker compose up --build -d
```
Wait for health checks – PostgreSQL, Redis, Kafka, MinIO and MLflow must be healthy before Airflow initialises. The helper containers init-airflow, init-minio, and the connector setup jobs will exit automatically when ready.
Confirm connectors – visit http://localhost:8083/connectors to ensure the Debezium source and S3 sink connectors are registered.

Service Endpoints

Service	URL / Port	Notes
Airflow API/UI	http://localhost:8080	Simple auth enabled for local dev (any credentials work)
Kafka UI	http://localhost:8082	Inspect topics & messages
Kafka Connect	http://localhost:8083	Connector status API
MinIO UI	http://localhost:9001	Use `MINIO_ROOT_USER` / `MINIO_ROOT_PASSWORD`
Spark Master UI	http://localhost:8090	View submitted Spark jobs
MLflow UI	http://localhost:5001	Tracking server & model registry
PostgreSQL	localhost:5432	`postgres` / `POSTGRES_PASSWORD`

Running the End-to-End Pipeline

Open the Airflow UI and enable the etl_dag. Trigger a run manually for the first execution.
The DAG runs the following tasks in sequence:
- generate_and_load_data – Python-based synthetic data generator that loads OLTP tables.
- bronze_to_silver – Spark job cleansing CDC snapshots from Bronze.
- soda_check_silver – Soda data quality scan on the Silver layer.
- silver_to_gold – Business-rule cleansing and validation before Gold promotion.
- soda_check_gold – Final data quality validation.
- gold_to_features – Feature engineering and publish to s3a://datalake/feature_store.
- ml – Spark ML pipeline training a RandomForest model, logging artefacts, metrics, SHAP-style plots and registering the model in MLflow (per CPU architecture).
Inspect each stage using the respective UIs (Spark logs, MLflow experiment cancellation_prediction, MinIO object browser).

Data Lake Layout

s3a://datalake/bronze/<topic> – raw CDC parquet data from Kafka topics.
s3a://datalake/silver/... – cleansed tables with standardised schema.
s3a://datalake/gold/... – fully validated business entities.
s3a://datalake/feature_store/cancellation_features – model-ready feature table.
s3://mlflow – MLflow model artefacts managed by the tracking server.

ML Lifecycle & Model Serving

The ml Spark job logs to MLflow and registers the model under architecture-specific names (cancellation-predictor-amd64, cancellation-predictor-arm64, etc.) with aliases such as staging-amd64.
Use serve_model.sh to pull the latest registered model for your architecture and expose it on port 6000:
```
./serve_model.sh
```
The script expects a prepared Python virtual environment at .venv managed by uv (for example: uv venv .venv && source .venv/bin/activate && uv sync). This uses pyproject.toml dependencies.
To serve an exported model directory directly (e.g., produced by the export_model DAG), run:
```
./serve_model_local.sh path/to/model
```
Postman collection MLFlow Local.postman_collection.json contains sample scoring requests.

Utilities & Maintenance

cleanup.sh – removes Airflow logs, Soda outputs, cached bytecode and local MLflow artefacts to reclaim disk space.
connect/, connector-setup/ – Docker contexts for Kafka Connect and automated connector registration (see connector-setup/config/README.md).
ml_models/ – model packaging instructions and Dockerfile template (see ml_models/README.md).
serve_model.sh, serve_model_local.sh – helper scripts for local model serving workflows.

Troubleshooting

Airflow workers stuck starting – ensure Redis and Postgres health checks pass; run docker compose logs airflow-worker for details.
No data arriving in MinIO – check connector status (GET /connectors/<name>/status), confirm Debezium slot is running, and verify Postgres tables contain data.
Spark jobs fail on macOS (Java errors) – make sure Java 17 is installed and exported when running local scripts. The containers already use the correct JRE.
MLflow authentication errors when serving locally – confirm MLFLOW_TRACKING_URI, MLFLOW_S3_ENDPOINT_URL, and MinIO credentials are exported (the scripts do this automatically if .env is present).

Additional Documentation

airflow/dags/miscellaneous/README.md – background on synthetic data signal injection and DAG helper utilities.
connector-setup/config/README.md – detailed explanation of Debezium and S3 sink connector configuration.
ml_models/README.md – guidance on exporting MLflow models and building serving images.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data – End-To-End Pipeline

Key Capabilities

Repository Layout

Prerequisites

Configuration (`.env`)

Bootstrapping the Platform

Service Endpoints

Running the End-to-End Pipeline

Data Lake Layout

ML Lifecycle & Model Serving

Utilities & Maintenance

Troubleshooting

Additional Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
airflow		airflow
connect		connect
connector-setup		connector-setup
ml_models		ml_models
mlflow		mlflow
postgres-init		postgres-init
spark		spark
.gitignore		.gitignore
.python-version		.python-version
MLFlow Local.postman_collection.json		MLFlow Local.postman_collection.json
README.md		README.md
cleanup.sh		cleanup.sh
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
serve_model.sh		serve_model.sh
serve_model_local.sh		serve_model_local.sh
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Big Data – End-To-End Pipeline

Key Capabilities

Repository Layout

Prerequisites

Configuration (.env)

Bootstrapping the Platform

Service Endpoints

Running the End-to-End Pipeline

Data Lake Layout

ML Lifecycle & Model Serving

Utilities & Maintenance

Troubleshooting

Additional Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Configuration (`.env`)

Packages