Skip to content

kangaseric4152-max/pinterest_databricks_pipeline_simulation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pinterest Streaming Lakehouse Pipeline (Databricks)

Overview

This project implements an end-to-end streaming data pipeline using Databricks, Auto Loader, and Delta Lake. It simulates incremental ingestion of partitioned JSON data and materializes Bronze, Silver, and Gold layers for downstream analytics and dashboarding.

The focus of this project is architectural clarity, reliability, and production-style design decisions rather than heavy infrastructure or large-scale cluster tuning.


Architecture

This pipeline follows a Medallion (Bronze / Silver / Gold) pattern:

Data Flow

flowchart LR
    subgraph Ingestion
        A(Partitioned JSON Files<br/>year=YYYY / quarter=Q)
        --> B(Auto Loader<br/>cloudFiles<br/>Schema Evolution + Checkpointing)
    end
    
    subgraph Processing
        B --> C(Bronze Layer<br/>Raw Ingest<br/>_metadata + _ingest_ts)
        C --> D("Silver Layer<br/>Hashtag Normalization<br/>(explode + trim)")
        D --> E(Gold Layer<br/>Quarterly Aggregations<br/>Tag Counts + Distinct Users)
    end

    subgraph Serving
        E --> F(SQL Views<br/>Top Tags / Trend Views)
        F --> G(Lakehouse<br/>SQL Dashboard)
    end

Loading

Bronze

  • Auto Loader ingests JSON files from a managed Volume (file arrival trigger).
  • Captures metadata such as _metadata.file_path.
  • Handles corrupt records and schema evolution.
  • Writes raw data to a Delta table.

Silver

  • Cleans and normalizes data.
  • Converts string "null" values to proper nulls.
  • Parses timestamps.
  • Explodes hashtags into normalized tag records.
  • Adds derived columns (post_year, post_quarter).
  • Writes clean structured data to Delta.

Gold

  • Aggregates tags by year and quarter.
  • Computes:
    • tag_count
    • distinct_users
  • Materializes analytics-ready Delta tables for dashboards.

Execution Plan Considerations

The Gold layer is designed to minimize recomputation by pre-aggregating tag counts. Dashboard queries operate on this reduced dataset to avoid repeated heavy transformations on raw data.


Key Technical Features

  • Databricks Auto Loader (cloudFiles)
  • Streaming ingestion with availableNow=True
  • Delta Lake tables (Bronze/Silver/Gold)
  • Unity Catalog compatibility
  • Metadata-driven partition extraction
  • Window functions for ranking
  • Dashboard visualizations using Databricks SQL
  • Partition-aware ingestion simulation
  • Volume-based storage (no DBFS dependency)

Data Flow

  1. Partitioned JSON batches written to a Volume.
  2. Auto Loader detects new files.
  3. Bronze streaming ingestion appends to Delta.
  4. Silver transformation notebook normalizes and explodes tags.
  5. Gold aggregation computes metrics for dashboard consumption.
  6. Databricks SQL dashboard queries Gold layer views.

Query Optimization & Execution Analysis

  • Ensured last_4_quarters CTE is broadcast-joined to minimize shuffle.
  • Verified use of BroadcastHashJoin for small dimension sets.
  • Confirmed window ranking (RANK() OVER (PARTITION BY ...)) triggers expected shuffle boundaries.
  • Observed adaptive execution (AQE) behavior in Spark.
  • Validated that Top-K filtering reduces shuffle volume before final sort.
  • Avoided unnecessary materialization at this dataset scale.

Result: Efficient execution plans with only necessary shuffle stages and broadcast joins for small dimension filters.


Design Decisions

  • Overwrite mode for Gold layer: acceptable for demo scale; MERGE would be used in production.
  • Auto Loader chosen over batch read: enables incremental file arrival simulation.
  • Schema evolution enabled: supports future column additions.
  • Checkpointing enabled: ensures idempotent streaming execution.
  • Volume storage used: aligns with Unity Catalog best practices.

Repository Structure

exploration/    # Early experiments and validation notebooks
pipeline/       # Bronze, Silver, Gold notebooks/scripts
dashboards/     # SQL queries for dashboard panels
docs/           # Notes and architectural references

Exploration artifacts are intentionally preserved to demonstrate iterative development and validation of Auto Loader behavior and schema handling.


How This Would Scale

In a production environment:

  • Gold layer would use incremental MERGE logic.
  • Data quality checks (e.g., Great Expectations) would be added.
  • Orchestration would include retry policies and monitoring.
  • Cluster autoscaling would handle larger ingestion volumes.

Orchestration

The Bronze and Silver layers are orchestrated using a Databricks multi-task job:

  • Task 1: File-triggered Auto Loader ingestion
  • Task 2: Silver transformation
  • Task 3: Gold aggregation

In production this would include retry policies, monitoring, and alerting.


Technologies Used

  • Databricks (Serverless)
  • Apache Spark / PySpark
  • Delta Lake
  • Databricks SQL
  • GitHub integration
  • LLMs

Purpose

This project demonstrates the ability to:

  • Design and implement streaming Lakehouse pipelines
  • Handle distributed data processing patterns
  • Build production-style layered architectures
  • Iterate rapidly while maintaining structural clarity

Created: 2026-02-13

Updated: 2026-02-20

About

End-to-end streaming Lakehouse pipeline in Databricks using Auto Loader, Delta Lake, and medallion architecture.

Topics

Resources

Stars

Watchers

Forks

Contributors