Spotify Real-Time Data Engineering Pipeline (Azure & Databricks)

📌 Project Overview

This repository contains a production-grade Data Engineering pipeline built on the Microsoft Azure ecosystem. The project transforms raw Spotify clickstream data into analytics-ready datasets using a Medallion Architecture (Bronze, Silver, Gold).

The technical core of this project is the implementation of Delta Live Tables (DLT) and Change Data Capture (CDC) to maintain a high-performance, incremental Gold layer, all managed via Databricks Asset Bundles (DABs) for a professional CI/CD workflow.

Architecture

🛠️ Tech Stack & Key Keywords

Cloud Platform: Microsoft Azure (ADLS Gen2, Azure SQL, ADF)
Data Processing: Azure Databricks (Spark, Delta Live Tables)
Compute: Azure Databricks (Spark & Delta Live Tables)
Orchestration: Azure Data Factory with Watermark-based Incremental Loading
Deployment: Databricks Asset Bundles (DABs) for production-ready Infrastructure as Code (IaC).
Security: Managed Identity-based access to eliminate the need for sensitive credentials.

🏗️ Implementation Details

Phase 1: Ingestion & Bronze Layer

Incremental ELT: Configured ADF to move data from Azure SQL Database to ADLS Gen2.
Watermarking Logic: Implemented a Lookup-based watermark pattern to process only new records based on the last processed timestamp.

Phase 2: Transformation & Silver Layer

The Silver layer is implemented using Azure Databricks + Spark Structured Streaming and follows a file-based Delta Lake approach.

Instead of directly creating managed catalog tables first, this project writes and maintains Delta files at explicit storage paths (URIs) in ADLS Gen2. These paths act as the system of record, and tables are registered in Unity Catalog only after the data is stabilized.

Key Silver Layer Design Decisions

Delta files are read and written using explicit storage paths (URIs)
Upsert (MERGE) logic is applied at the file level, not catalog-first
Streaming micro-batches are handled using foreachBatch
First load creates Delta files; subsequent loads perform MERGE-based upserts
Unity Catalog tables are created on top of existing Delta files (external tables)

Why This Approach?

Decouples physical storage from logical table definitions
Safer and more flexible for streaming + incremental pipelines
Aligns with real-world lakehouse patterns used in production
Avoids early dependency on catalog availability
Reference Implementation: A dedicated Silver writer class handles path-based Delta upserts, streaming checkpoints, and later registration into Unity Catalog. (See Writer and Reader implementation in the codebase for details.)

Phase 3: Gold Layer & CDC (The "DABs" Advantage)

The Gold layer is built using Databricks Delta Live Tables (DLT) pipelines and represents the final, analytics-ready datasets.

Unlike ad-hoc notebook transformations, this project uses declarative DLT pipelines to apply CDC-based transformations consistently across all fact and dimension tables.

Gold Layer Design Pattern

For each Gold table, the following standardized pattern is implemented:

Streaming Staging Table
- Reads from the Silver layer using readStream
- Configured with ignoreChanges = true to safely consume MERGE-based Delta updates
- Acts as a controlled staging layer for CDC processing
Explicit Target Table Declaration
- Target Gold tables are explicitly declared as streaming tables
- Prevents runtime dependency and ordering issues during pipeline execution

CDC Application Using apply_changes
- Uses primary keys and sequence columns
- Implements SCD Type 1 logic
- Automatically handles inserts and updates

Why Delta Live Tables?

Declarative, production-grade pipeline management
Built-in data quality, lineage, and dependency handling
Simplified CDC implementation at scale
Clear separation between staging and curated layers

Reference Implementation: All Gold tables follow a consistent DLT pattern using staging tables and apply_changes logic. See the Gold layer DLT pipeline definitions for table-specific configurations.

🔐 Security & IAM (Concise)

Azure Managed Identities used for ADF and Databricks (no secrets)
RBAC enforced on ADLS Gen2 using least-privilege access
Unity Catalog controls access between Silver (read) and Gold (write) layers
External locations secured via managed identity
Optional integration with Azure Key Vault for API secrets

⚙️ Performance & Reliability (Concise)

Incremental ingestion using CDC / watermarks
Delta Lake ensures ACID transactions and schema enforcement
DLT pipelines provide fault-tolerant, declarative Gold layer processing
Streaming handled with checkpoints and trigger(once=True)
Optimized reads using Delta file compaction & Z-Ordering
Pipelines are restart-safe and idempotent

📊 Use Cases

Incremental processing of Spotify-style streaming and user activity data
Near–real-time fact table creation using CDC and Delta Live Tables
User engagement and listening behavior analysis
Track and artist popularity analytics
Analytics-ready datasets for BI and reporting tools
Reference architecture for cloud-scale Azure data platforms

✅ Conclusion

This project demonstrates a production-grade Azure data engineering solution built using modern Lakehouse principles. It showcases incremental CDC ingestion, secure cloud-native orchestration, and scalable transformations using Azure Data Factory, Databricks, Delta Lake, and Delta Live Tables.

The design emphasizes reliability, performance, and maintainability, following industry-standard Bronze–Silver–Gold architecture and best practices used in real enterprise data platforms.

Overall, this project reflects hands-on experience with end-to-end data engineering workflows and the ability to design robust, analytics-ready data systems on Azure.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Data		Data
Medallion architecture		Medallion architecture
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spotify Real-Time Data Engineering Pipeline (Azure & Databricks)

📌 Project Overview

Architecture

🛠️ Tech Stack & Key Keywords

🏗️ Implementation Details

Phase 1: Ingestion & Bronze Layer

Phase 2: Transformation & Silver Layer

Phase 3: Gold Layer & CDC (The "DABs" Advantage)

Gold Layer Design Pattern

Why Delta Live Tables?

🔐 Security & IAM (Concise)

⚙️ Performance & Reliability (Concise)

📊 Use Cases

✅ Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spotify Real-Time Data Engineering Pipeline (Azure & Databricks)

📌 Project Overview

Architecture

🛠️ Tech Stack & Key Keywords

🏗️ Implementation Details

Phase 1: Ingestion & Bronze Layer

Phase 2: Transformation & Silver Layer

Phase 3: Gold Layer & CDC (The "DABs" Advantage)

Gold Layer Design Pattern

Why Delta Live Tables?

🔐 Security & IAM (Concise)

⚙️ Performance & Reliability (Concise)

📊 Use Cases

✅ Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages