This project is a end‑to‑end ELT pipeline for YouTube analytics built on AWS.
It uses Amazon MWAA (Airflow) for orchestration, Amazon Redshift Serverless as the warehouse, Amazon S3 for raw data and code, AWS Secrets Manager for credentials, and dbt for transformations.
At a glance:
- YouTube API → MWAA: Airflow calls the YouTube API and lands raw JSON files in S3.
- S3 → Redshift (landing/staging): Airflow loads data from S3 into Redshift using COPY.
- Redshift + dbt: dbt models transform data into clean dimensions and fact tables.
- Secrets Manager: Stores Redshift and API credentials used by Airflow.
- AWS account with permissions to create S3, MWAA, Redshift Serverless, Secrets Manager, and IAM roles.
- AWS CLI configured locally.
- Python 3.10+ and the project
requirements.txtinstalled. - A YouTube API key.
-
Create S3 buckets
- One bucket for MWAA code (DAGs, dbt project, requirements, plugins).
- One bucket for raw data (YouTube JSON files).
-
Upload project to S3
- Sync the
dags/folder to the MWAA bucket underdags/. - Upload
requirements.txt(andpluginsif you have them) to the MWAA bucket root.
- Sync the
-
Provision Redshift Serverless
- Create a namespace, workgroup, and default database (for example
elt_db). - Ensure it is reachable from the MWAA VPC/subnets.
- Create a namespace, workgroup, and default database (for example
-
Create secrets
- In Secrets Manager, create:
- A Redshift secret with host, port, db name, username, and password.
- A YouTube API key secret (or store the key as an Airflow Variable).
- In Secrets Manager, create:
-
Create the MWAA environment
- Point MWAA to:
- DAGs folder: your
dags/path in the MWAA S3 bucket. - Requirements file:
requirements.txtin the same bucket.
- DAGs folder: your
- Attach an execution role that can read your S3 buckets, Secrets Manager, and talk to Redshift.
- Point MWAA to:
-
Configure Airflow
- Connections:
redshift_db_yt_elt(Postgres type) pointing to Redshift Serverless.
- Variables:
API_KEY,CHANNEL_HANDLE,S3_DATA_BUCKET, and optionallyREDSHIFT_IAM_ROLE.
- Connections:
-
Initialize schemas
- In Redshift, create:
CREATE SCHEMA IF NOT EXISTS landing; CREATE SCHEMA IF NOT EXISTS staging; CREATE SCHEMA IF NOT EXISTS core; CREATE SCHEMA IF NOT EXISTS analytics;
- In Redshift, create:
-
Run the pipeline
- In the MWAA Airflow UI:
- Trigger
youtube_extract_data_pipelineto pull data from the YouTube API to S3. - Confirm
youtube_stage_load_pipelineloads data into Redshift. - Trigger
youtube_dbt_pipelineto run dbt models and build marts.
- Trigger
- In the MWAA Airflow UI:
The dbt project lives under dags/youtube_analytics/ and is orchestrated by the youtube_dbt_pipeline DAG.
- Staging models: Clean, standardized data from the
landingschema. - Intermediate models: Enriched metrics and joins.
- Marts: Final tables such as
dim_channels,dim_videos,dim_categories, andfct_video_performancein thecoreschema.
