Skip to content
This repository was archived by the owner on Feb 16, 2026. It is now read-only.

serafinski/Big-Data-End-To-End-Pipeline

Repository files navigation

Big Data – End-To-End Pipeline

Project made as a part of Big Data - modeling, management, processing and integration class.

The stack simulates an end-to-end pipeline: transactional database → change data capture → data lake → curated analytics layers → feature store → model training and serving.

flowchart TD
    Airflow((Airflow))
    DataGen[Seed Postgres]
    PSQL[(PostgreSQL)]
    Debezium[Debezium Connector]
    Kafka[Kafka Topics]
    Connect[Kafka Connect S3 Sink]
    Bronze[Bronze layer - MinIO]
    Silver[Silver layer]
    SodaSilver[Soda Silver scan]
    Gold[Gold layer]
    SodaGold[Soda Gold scan]
    FeatureStore[Feature store table]
    MLflow[MLflow Registry]
    Serving[Model Serving]

    Airflow -.->|generate_and_load_data| DataGen
    DataGen --> PSQL

    PSQL --> Debezium --> Kafka --> Connect --> Bronze
    Bronze -->|Spark - bronze_to_silver| Silver
    Silver -->|Spark - silver_to_gold| Gold
    Gold -->|Spark - gold_to_features| FeatureStore
    FeatureStore -->|Spark ML| MLflow --> Serving

    Airflow -.->|bronze_to_silver| Silver
    Airflow -.->|silver_to_gold| Gold
    Airflow -.->|gold_to_features| FeatureStore
    Airflow -.->|ml| MLflow

    Airflow -.->|soda_check_silver| SodaSilver
    Silver --> SodaSilver
    Airflow -.->|soda_check_gold| SodaGold
    Gold --> SodaGold
Loading

Key Capabilities

  • Automated CDC pipeline from Postgres to an S3-compatible data lake (MinIO) through Kafka, Kafka Connect and Debezium.
  • Airflow DAGs orchestrating synthetic data generation, multi-hop Spark ETL, Soda quality rules, feature engineering, and ML training.
  • Spark-based ML pipeline that logs to MLflow, manages model registry aliases per CPU architecture, and exports models for serving.
  • Containerised local infrastructure (Docker Compose) with observability endpoints: Airflow UI, Kafka UI, Spark UI, MinIO console, MLflow UI.
  • Utilities for cleaning local state, exporting models, and serving the latest registered model with MLflow.

Repository Layout

.
├── airflow/                 # Custom Airflow image, DAGs, and dependencies
│   └── dags/
│       ├── etl_dag.py       # Main orchestration DAG
│       ├── ...              # Other DAGs
│       └── miscellaneous/   # See airflow/dags/miscellaneous/README.md
├── connect/                 # Kafka Connect image with additional plugins
├── connector-setup/         # See connector-setup/config/README.md
├── docker-compose.yml       # Full infrastructure definition
├── cleanup.sh               # Resets logs, cache and local MLflow state
├── serve_model.sh           # Fetch & serve the latest registered model via MLflow
├── serve_model_local.sh     # Serve an exported model directory
├── ml_models/               # See ml_models/README.md
├── mlflow/                  # MLflow server image
├── postgres-init/           # SQL bootstrap for OLTP schema and triggers
└── spark/                   # Spark image and ETL / ML applications

Prerequisites

  • Docker ≥ 24 and Docker Compose v2.
  • At least 16 GB RAM available to Docker (the full stack runs multiple JVM-based services).
  • Python 3.11+ locally if you want to run helper scripts or serve models outside Docker.
  • openssl (or similar) to generate secrets.

Configuration (.env)

Create a .env file in the project root and populate it with all required secrets:

POSTGRES_PASSWORD=postgres
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin
AIRFLOW__CORE__FERNET_KEY=<generate-with-python -c 'from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())'>
AIRFLOW__SECRET_KEY=<generate-with: openssl rand -hex 32>

All variables are consumed by Docker Compose and reused by Airflow, Spark and MLflow. Keep the file private - .env is excluded from version control.

Bootstrapping the Platform

  1. Build images (first run or after Dockerfile changes)

    docker compose build
  2. Start the stack

    docker compose up --build -d
  3. Wait for health checks – PostgreSQL, Redis, Kafka, MinIO and MLflow must be healthy before Airflow initialises. The helper containers init-airflow, init-minio, and the connector setup jobs will exit automatically when ready.

  4. Confirm connectors – visit http://localhost:8083/connectors to ensure the Debezium source and S3 sink connectors are registered.

Service Endpoints

Service URL / Port Notes
Airflow API/UI http://localhost:8080 Simple auth enabled for local dev (any credentials work)
Kafka UI http://localhost:8082 Inspect topics & messages
Kafka Connect http://localhost:8083 Connector status API
MinIO UI http://localhost:9001 Use MINIO_ROOT_USER / MINIO_ROOT_PASSWORD
Spark Master UI http://localhost:8090 View submitted Spark jobs
MLflow UI http://localhost:5001 Tracking server & model registry
PostgreSQL localhost:5432 postgres / POSTGRES_PASSWORD

Running the End-to-End Pipeline

  1. Open the Airflow UI and enable the etl_dag. Trigger a run manually for the first execution.
  2. The DAG runs the following tasks in sequence:
    • generate_and_load_data – Python-based synthetic data generator that loads OLTP tables.
    • bronze_to_silver – Spark job cleansing CDC snapshots from Bronze.
    • soda_check_silver – Soda data quality scan on the Silver layer.
    • silver_to_gold – Business-rule cleansing and validation before Gold promotion.
    • soda_check_gold – Final data quality validation.
    • gold_to_features – Feature engineering and publish to s3a://datalake/feature_store.
    • ml – Spark ML pipeline training a RandomForest model, logging artefacts, metrics, SHAP-style plots and registering the model in MLflow (per CPU architecture).
  3. Inspect each stage using the respective UIs (Spark logs, MLflow experiment cancellation_prediction, MinIO object browser).

Data Lake Layout

  • s3a://datalake/bronze/<topic> – raw CDC parquet data from Kafka topics.
  • s3a://datalake/silver/... – cleansed tables with standardised schema.
  • s3a://datalake/gold/... – fully validated business entities.
  • s3a://datalake/feature_store/cancellation_features – model-ready feature table.
  • s3://mlflow – MLflow model artefacts managed by the tracking server.

ML Lifecycle & Model Serving

  • The ml Spark job logs to MLflow and registers the model under architecture-specific names (cancellation-predictor-amd64, cancellation-predictor-arm64, etc.) with aliases such as staging-amd64.

  • Use serve_model.sh to pull the latest registered model for your architecture and expose it on port 6000:

    ./serve_model.sh

    The script expects a prepared Python virtual environment at .venv managed by uv (for example: uv venv .venv && source .venv/bin/activate && uv sync). This uses pyproject.toml dependencies.

  • To serve an exported model directory directly (e.g., produced by the export_model DAG), run:

    ./serve_model_local.sh path/to/model
  • Postman collection MLFlow Local.postman_collection.json contains sample scoring requests.

Utilities & Maintenance

  • cleanup.sh – removes Airflow logs, Soda outputs, cached bytecode and local MLflow artefacts to reclaim disk space.
  • connect/, connector-setup/ – Docker contexts for Kafka Connect and automated connector registration (see connector-setup/config/README.md).
  • ml_models/ – model packaging instructions and Dockerfile template (see ml_models/README.md).
  • serve_model.sh, serve_model_local.sh – helper scripts for local model serving workflows.

Troubleshooting

  • Airflow workers stuck starting – ensure Redis and Postgres health checks pass; run docker compose logs airflow-worker for details.
  • No data arriving in MinIO – check connector status (GET /connectors/<name>/status), confirm Debezium slot is running, and verify Postgres tables contain data.
  • Spark jobs fail on macOS (Java errors) – make sure Java 17 is installed and exported when running local scripts. The containers already use the correct JRE.
  • MLflow authentication errors when serving locally – confirm MLFLOW_TRACKING_URI, MLFLOW_S3_ENDPOINT_URL, and MinIO credentials are exported (the scripts do this automatically if .env is present).

Additional Documentation

  • airflow/dags/miscellaneous/README.md – background on synthetic data signal injection and DAG helper utilities.
  • connector-setup/config/README.md – detailed explanation of Debezium and S3 sink connector configuration.
  • ml_models/README.md – guidance on exporting MLflow models and building serving images.

About

Project made as a part of Big Data - modeling, management, processing and integration class.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors