Skip to content

Sanjay-dev-ds/yt-analytics-data-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YouTube ELT on AWS

This project is a end‑to‑end ELT pipeline for YouTube analytics built on AWS.
It uses Amazon MWAA (Airflow) for orchestration, Amazon Redshift Serverless as the warehouse, Amazon S3 for raw data and code, AWS Secrets Manager for credentials, and dbt for transformations.

High‑Level Architecture

YouTube ELT Architecture

At a glance:

  • YouTube API → MWAA: Airflow calls the YouTube API and lands raw JSON files in S3.
  • S3 → Redshift (landing/staging): Airflow loads data from S3 into Redshift using COPY.
  • Redshift + dbt: dbt models transform data into clean dimensions and fact tables.
  • Secrets Manager: Stores Redshift and API credentials used by Airflow.

1. Prerequisites

  • AWS account with permissions to create S3, MWAA, Redshift Serverless, Secrets Manager, and IAM roles.
  • AWS CLI configured locally.
  • Python 3.10+ and the project requirements.txt installed.
  • A YouTube API key.

2. Quick Start

  1. Create S3 buckets

    • One bucket for MWAA code (DAGs, dbt project, requirements, plugins).
    • One bucket for raw data (YouTube JSON files).
  2. Upload project to S3

    • Sync the dags/ folder to the MWAA bucket under dags/.
    • Upload requirements.txt (and plugins if you have them) to the MWAA bucket root.
  3. Provision Redshift Serverless

    • Create a namespace, workgroup, and default database (for example elt_db).
    • Ensure it is reachable from the MWAA VPC/subnets.
  4. Create secrets

    • In Secrets Manager, create:
      • A Redshift secret with host, port, db name, username, and password.
      • A YouTube API key secret (or store the key as an Airflow Variable).
  5. Create the MWAA environment

    • Point MWAA to:
      • DAGs folder: your dags/ path in the MWAA S3 bucket.
      • Requirements file: requirements.txt in the same bucket.
    • Attach an execution role that can read your S3 buckets, Secrets Manager, and talk to Redshift.
  6. Configure Airflow

    • Connections:
      • redshift_db_yt_elt (Postgres type) pointing to Redshift Serverless.
    • Variables:
      • API_KEY, CHANNEL_HANDLE, S3_DATA_BUCKET, and optionally REDSHIFT_IAM_ROLE.
  7. Initialize schemas

    • In Redshift, create:
      CREATE SCHEMA IF NOT EXISTS landing;
      CREATE SCHEMA IF NOT EXISTS staging;
      CREATE SCHEMA IF NOT EXISTS core;
      CREATE SCHEMA IF NOT EXISTS analytics;
  8. Run the pipeline

    • In the MWAA Airflow UI:
      • Trigger youtube_extract_data_pipeline to pull data from the YouTube API to S3.
      • Confirm youtube_stage_load_pipeline loads data into Redshift.
      • Trigger youtube_dbt_pipeline to run dbt models and build marts.

3. dbt Project at a Glance

The dbt project lives under dags/youtube_analytics/ and is orchestrated by the youtube_dbt_pipeline DAG.

  • Staging models: Clean, standardized data from the landing schema.
  • Intermediate models: Enriched metrics and joins.
  • Marts: Final tables such as dim_channels, dim_videos, dim_categories, and fct_video_performance in the core schema.

About

This project is a end‑to‑end ELT pipeline for YouTube analytics built on AWS. It uses Amazon MWAA (Airflow) for orchestration, Amazon Redshift Serverless as the warehouse, Amazon S3 for raw data and code, and dbt for transformations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors