This project implements an end-to-end streaming data pipeline using Databricks, Auto Loader, and Delta Lake. It simulates incremental ingestion of partitioned JSON data and materializes Bronze, Silver, and Gold layers for downstream analytics and dashboarding.
The focus of this project is architectural clarity, reliability, and production-style design decisions rather than heavy infrastructure or large-scale cluster tuning.
This pipeline follows a Medallion (Bronze / Silver / Gold) pattern:
flowchart LR
subgraph Ingestion
A(Partitioned JSON Files<br/>year=YYYY / quarter=Q)
--> B(Auto Loader<br/>cloudFiles<br/>Schema Evolution + Checkpointing)
end
subgraph Processing
B --> C(Bronze Layer<br/>Raw Ingest<br/>_metadata + _ingest_ts)
C --> D("Silver Layer<br/>Hashtag Normalization<br/>(explode + trim)")
D --> E(Gold Layer<br/>Quarterly Aggregations<br/>Tag Counts + Distinct Users)
end
subgraph Serving
E --> F(SQL Views<br/>Top Tags / Trend Views)
F --> G(Lakehouse<br/>SQL Dashboard)
end
Bronze
- Auto Loader ingests JSON files from a managed Volume (file arrival trigger).
- Captures metadata such as
_metadata.file_path. - Handles corrupt records and schema evolution.
- Writes raw data to a Delta table.
Silver
- Cleans and normalizes data.
- Converts string "null" values to proper nulls.
- Parses timestamps.
- Explodes hashtags into normalized tag records.
- Adds derived columns (post_year, post_quarter).
- Writes clean structured data to Delta.
Gold
- Aggregates tags by year and quarter.
- Computes:
- tag_count
- distinct_users
- Materializes analytics-ready Delta tables for dashboards.
The Gold layer is designed to minimize recomputation by pre-aggregating tag counts. Dashboard queries operate on this reduced dataset to avoid repeated heavy transformations on raw data.
- Databricks Auto Loader (
cloudFiles) - Streaming ingestion with
availableNow=True - Delta Lake tables (Bronze/Silver/Gold)
- Unity Catalog compatibility
- Metadata-driven partition extraction
- Window functions for ranking
- Dashboard visualizations using Databricks SQL
- Partition-aware ingestion simulation
- Volume-based storage (no DBFS dependency)
- Partitioned JSON batches written to a Volume.
- Auto Loader detects new files.
- Bronze streaming ingestion appends to Delta.
- Silver transformation notebook normalizes and explodes tags.
- Gold aggregation computes metrics for dashboard consumption.
- Databricks SQL dashboard queries Gold layer views.
- Ensured last_4_quarters CTE is broadcast-joined to minimize shuffle.
- Verified use of BroadcastHashJoin for small dimension sets.
- Confirmed window ranking (RANK() OVER (PARTITION BY ...)) triggers expected shuffle boundaries.
- Observed adaptive execution (AQE) behavior in Spark.
- Validated that Top-K filtering reduces shuffle volume before final sort.
- Avoided unnecessary materialization at this dataset scale.
Result: Efficient execution plans with only necessary shuffle stages and broadcast joins for small dimension filters.
- Overwrite mode for Gold layer: acceptable for demo scale; MERGE would be used in production.
- Auto Loader chosen over batch read: enables incremental file arrival simulation.
- Schema evolution enabled: supports future column additions.
- Checkpointing enabled: ensures idempotent streaming execution.
- Volume storage used: aligns with Unity Catalog best practices.
exploration/ # Early experiments and validation notebooks
pipeline/ # Bronze, Silver, Gold notebooks/scripts
dashboards/ # SQL queries for dashboard panels
docs/ # Notes and architectural references
Exploration artifacts are intentionally preserved to demonstrate iterative development and validation of Auto Loader behavior and schema handling.
In a production environment:
- Gold layer would use incremental MERGE logic.
- Data quality checks (e.g., Great Expectations) would be added.
- Orchestration would include retry policies and monitoring.
- Cluster autoscaling would handle larger ingestion volumes.
The Bronze and Silver layers are orchestrated using a Databricks multi-task job:
- Task 1: File-triggered Auto Loader ingestion
- Task 2: Silver transformation
- Task 3: Gold aggregation
In production this would include retry policies, monitoring, and alerting.
- Databricks (Serverless)
- Apache Spark / PySpark
- Delta Lake
- Databricks SQL
- GitHub integration
- LLMs
This project demonstrates the ability to:
- Design and implement streaming Lakehouse pipelines
- Handle distributed data processing patterns
- Build production-style layered architectures
- Iterate rapidly while maintaining structural clarity
Created: 2026-02-13
Updated: 2026-02-20