Pinterest Streaming Lakehouse Pipeline (Databricks)

Overview

This project implements an end-to-end streaming data pipeline using Databricks, Auto Loader, and Delta Lake. It simulates incremental ingestion of partitioned JSON data and materializes Bronze, Silver, and Gold layers for downstream analytics and dashboarding.

The focus of this project is architectural clarity, reliability, and production-style design decisions rather than heavy infrastructure or large-scale cluster tuning.

Architecture

This pipeline follows a Medallion (Bronze / Silver / Gold) pattern:

Data Flow

flowchart LR
    subgraph Ingestion
        A(Partitioned JSON Files<br/>year=YYYY / quarter=Q)
        --> B(Auto Loader<br/>cloudFiles<br/>Schema Evolution + Checkpointing)
    end
    
    subgraph Processing
        B --> C(Bronze Layer<br/>Raw Ingest<br/>_metadata + _ingest_ts)
        C --> D("Silver Layer<br/>Hashtag Normalization<br/>(explode + trim)")
        D --> E(Gold Layer<br/>Quarterly Aggregations<br/>Tag Counts + Distinct Users)
    end

    subgraph Serving
        E --> F(SQL Views<br/>Top Tags / Trend Views)
        F --> G(Lakehouse<br/>SQL Dashboard)
    end

Bronze

Auto Loader ingests JSON files from a managed Volume (file arrival trigger).
Captures metadata such as _metadata.file_path.
Handles corrupt records and schema evolution.
Writes raw data to a Delta table.

Silver

Cleans and normalizes data.
Converts string "null" values to proper nulls.
Parses timestamps.
Explodes hashtags into normalized tag records.
Adds derived columns (post_year, post_quarter).
Writes clean structured data to Delta.

Gold

Aggregates tags by year and quarter.
Computes:
- tag_count
- distinct_users
Materializes analytics-ready Delta tables for dashboards.

Execution Plan Considerations

The Gold layer is designed to minimize recomputation by pre-aggregating tag counts. Dashboard queries operate on this reduced dataset to avoid repeated heavy transformations on raw data.

Key Technical Features

Databricks Auto Loader (cloudFiles)
Streaming ingestion with availableNow=True
Delta Lake tables (Bronze/Silver/Gold)
Unity Catalog compatibility
Metadata-driven partition extraction
Window functions for ranking
Dashboard visualizations using Databricks SQL
Partition-aware ingestion simulation
Volume-based storage (no DBFS dependency)

Data Flow

Partitioned JSON batches written to a Volume.
Auto Loader detects new files.
Bronze streaming ingestion appends to Delta.
Silver transformation notebook normalizes and explodes tags.
Gold aggregation computes metrics for dashboard consumption.
Databricks SQL dashboard queries Gold layer views.

Query Optimization & Execution Analysis

Ensured last_4_quarters CTE is broadcast-joined to minimize shuffle.
Verified use of BroadcastHashJoin for small dimension sets.
Confirmed window ranking (RANK() OVER (PARTITION BY ...)) triggers expected shuffle boundaries.
Observed adaptive execution (AQE) behavior in Spark.
Validated that Top-K filtering reduces shuffle volume before final sort.
Avoided unnecessary materialization at this dataset scale.

Result: Efficient execution plans with only necessary shuffle stages and broadcast joins for small dimension filters.

Design Decisions

Overwrite mode for Gold layer: acceptable for demo scale; MERGE would be used in production.
Auto Loader chosen over batch read: enables incremental file arrival simulation.
Schema evolution enabled: supports future column additions.
Checkpointing enabled: ensures idempotent streaming execution.
Volume storage used: aligns with Unity Catalog best practices.

Repository Structure

exploration/    # Early experiments and validation notebooks
pipeline/       # Bronze, Silver, Gold notebooks/scripts
dashboards/     # SQL queries for dashboard panels
docs/           # Notes and architectural references

Exploration artifacts are intentionally preserved to demonstrate iterative development and validation of Auto Loader behavior and schema handling.

How This Would Scale

In a production environment:

Gold layer would use incremental MERGE logic.
Data quality checks (e.g., Great Expectations) would be added.
Orchestration would include retry policies and monitoring.
Cluster autoscaling would handle larger ingestion volumes.

Orchestration

The Bronze and Silver layers are orchestrated using a Databricks multi-task job:

Task 1: File-triggered Auto Loader ingestion
Task 2: Silver transformation
Task 3: Gold aggregation

In production this would include retry policies, monitoring, and alerting.

Technologies Used

Databricks (Serverless)
Apache Spark / PySpark
Delta Lake
Databricks SQL
GitHub integration
LLMs

Purpose

This project demonstrates the ability to:

Design and implement streaming Lakehouse pipelines
Handle distributed data processing patterns
Build production-style layered architectures
Iterate rapidly while maintaining structural clarity

Created: 2026-02-13

Updated: 2026-02-20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pinterest Streaming Lakehouse Pipeline (Databricks)

Overview

Architecture

Data Flow

Execution Plan Considerations

Key Technical Features

Data Flow

Query Optimization & Execution Analysis

Design Decisions

Repository Structure

How This Would Scale

Orchestration

Technologies Used

Purpose

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dashboards		dashboards
docs		docs
exploration		exploration
notebooks		notebooks
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Pinterest Streaming Lakehouse Pipeline (Databricks)

Overview

Architecture

Data Flow

Execution Plan Considerations

Key Technical Features

Data Flow

Query Optimization & Execution Analysis

Design Decisions

Repository Structure

How This Would Scale

Orchestration

Technologies Used

Purpose

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages