Global Retail — Databricks Medallion Pipeline

This project is a full retail data engineering pipeline that I built on Databricks to practice modern lakehouse patterns end to end. It starts from raw customer, product and transaction files in CSV, JSON and Parquet format, loads them into Delta Lake using the Bronze Silver Gold medallion architecture, and ends with a Power BI dashboard that answers real business questions around daily sales, category performance and customer segmentation. The pipeline was originally a classroom exercise on DBFS, but I rewrote it to follow the modern Unity Catalog Volumes approach, refactored the notebook logic into a clean Python package under src/, added environment driven configuration so the same code runs locally and on Databricks, and wrote PyTest unit tests that spin up a local Spark session so the silver transformations can be validated without a cluster. Every stage does one clear job. Bronze lands the raw data as Delta with an ingestion timestamp. Silver applies cleaning rules, validation and enrichment, and uses incremental MERGE with a watermark so it only processes new rows. Gold builds the reporting tables on top of silver and feeds Power BI. The goal was to treat it like a real production pipeline rather than a notebook demo.

End-to-end retail data engineering pipeline built on Databricks, PySpark, and Delta Lake, following the Bronze / Silver / Gold medallion architecture, with reporting in Power BI.

Adapted from a legacy DBFS-based course implementation to the modern Unity Catalog Volumes approach.

Architecture

Raw files (CSV / JSON / Parquet) │

        append + ingestion_timestamp

BRONZE ───────────────────────────────> Delta tables (raw)

│ incremental MERGE on watermark cleaning · validation · enrichment SILVER ───────────────────────────────> Delta tables (curated)

│ aggregations

GOLD ───────────────────────────────> Reporting tables

│

Power BI

Project Structure

global-retail-databricks-pipeline/ ├── src/ │ ├── global_retail/ # shared config, constants, utils │ ├── bronze/ # raw → bronze loaders │ ├── silver/ # bronze → silver transformations │ └── gold/ # silver → gold aggregations ├── notebooks/ # original Databricks notebooks ├── tests/ # PyTest unit tests with local Spark ├── data_samples/
└── raw/ ├── customer.csv ├── product.json └── transaction.snappy.parquet └── processed/ ├── GlobalRetail_Bronze/ customer, products, transactions in .csv format ├── GlobalRetail_Silver/ customer, orders, products in .csv └── GlobalRetail_Gold/ sales category, daily sales in csv format ├── docs/ # architecture, dashboards, screenshots ├── pyproject.toml ├── requirements.txt └── README.md

Running locally

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest -q

Running on Databricks

Upload src/ as a Repo (or build a wheel and install on cluster).
Place raw files in a Unity Catalog Volume, e.g. /Volumes/main/globalretail/raw/.
Set environment variables on the cluster:
- GR_RAW_ROOT=/Volumes/main/globalretail/raw
- GR_ARCHIVE_ROOT=/Volumes/main/globalretail/archive
Schedule jobs in this order:
1. bronze.customer_loader · bronze.product_loader · bronze.transaction_loader
2. silver.customer_transform · silver.product_transform · silver.order_transform
3. gold.daily_sales · gold.category_sales

Code quality

black src tests
isort src tests
pylint src
pytest

Power BI dashboards

See docs/dashboard_notes.md for the full DAX measures, modeling steps, and dashboard layout used to build the final Power BI report.

Tech Stack

Databricks · PySpark · Delta Lake · Spark SQL · Unity Catalog Volumes · Power BI · Python 3.10+

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Global Retail — Databricks Medallion Pipeline

Architecture

Project Structure

Running locally

Running on Databricks

Code quality

Power BI dashboards

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
PowerBI-insights		PowerBI-insights
data_samples		data_samples
docs		docs
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
globalretail-insights.pdf		globalretail-insights.pdf
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Folders and files

Latest commit

History

Repository files navigation

Global Retail — Databricks Medallion Pipeline

Architecture

Project Structure

Running locally

Running on Databricks

Code quality

Power BI dashboards

Tech Stack

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages