Walmart Sales Analytics Pipeline

End-to-End Data Engineering Project Using AWS S3, Snowflake, and dbt

📌 Executive Summary

This project demonstrates a production-style, end-to-end data engineering pipeline that ingests raw Walmart sales data from AWS S3 into Snowflake and transforms it using dbt into a dimensional warehouse model implementing both SCD Type 1 and SCD Type 2 logic.

The pipeline follows Medallion Architecture principles (Bronze → Silver) and includes full DEV and PROD environment separation to simulate real-world enterprise deployment practices.

A CI/CD workflow is implemented using GitHub version control and dbt Cloud job orchestration:

All development occurs in a dedicated DEV environment (WALMART_DEV)
Changes are version-controlled in GitHub
Code is merged into the main branch
A production dbt job automatically pulls the latest code
Models and snapshots are executed in the PROD environment (WALMART_PROD)

This mirrors modern analytics engineering practices by combining:

Cloud data ingestion (AWS S3)
Scalable cloud data warehousing (Snowflake)
Transformation-as-code (dbt)
Environment isolation (DEV → PROD promotion)
Continuous Integration and Continuous Deployment (CI/CD)
Business analytics delivery via Python (Seaborn and Plotly)

The result is a structured, production-ready analytics warehouse capable of supporting historical reporting, dimensional analysis, and executive-level business insights.

🏗 Architecture Overview

Data Flow

CSV files uploaded to AWS S3
dbt loads raw data into Snowflake Bronze tables
dbt transforms Bronze → Silver dimensional model
SCD Type 1 logic applied to dimension tables
SCD Type 2 snapshot logic applied to fact table
dbt Production job executes models in WALMART_PROD database
Python script queries warehouse and generates visualizations

📸 Architecture Diagram

📂 Repository Structure

walmart-sales-analytics-snowflake-dbt/  
│  
├── README.md  
│  
├── architecture/  
│   └── architecture-diagram.png  
│  
├── dbt/  
│   ├── dbt_project.yml  
│   ├── package-lock.yml  
│   ├── packages.yml  
│   ├── models/  
│   ├── snapshots/  
│   └── macros/  
│  
├── snowflake/  
│   ├── dev_setup.sql  
│   └── prod_setup.sql  
│  
├── python_analytics/  
│   └── generate_visualizations.py  
│  
├── visualizations/  
│  
└── screenshots/

📊 Source Data

The project uses three datasets:

stores.csv – Store information
department.csv – Department-level sales
fact.csv – Weekly store metrics (temperature, CPI, fuel price, etc.)

All files were uploaded into an AWS S3 bucket before ingestion.

📸 S3 Upload

🧱 Data Warehouse Design

Medallion Architecture

Bronze Layer

Raw ingestion from S3
Minimal transformations
Mirrors source CSV structure

Silver Layer

Dimensional model
Two dimension tables
One fact table

📊 Slowly Changing Dimensions

SCD Type 1 (Dimension Tables)

Implemented using MERGE logic in dbt
Overwrites historical values
Maintains current state

SCD Type 2 (Fact Table)

Implemented using dbt snapshot
Tracks historical changes
Maintains version history
Enables historical reporting

⚙️ Environment Configuration

Two isolated environments were implemented:

Environment	Database
DEV	WALMART_DEV
PROD	WALMART_PROD

Schemas and table names remain consistent across environments.

📸 DEV Environment

📸 PROD Environment

📸 dbt Production Job

📈 Analytical Visualizations

After transformation, the warehouse powers multiple business insights generated using Python (Seaborn and Plotly).

Generated Visualizations

Weekly sales by store and holiday
Weekly sales by temperature and year
Weekly sales by store size
Weekly sales by store type and month
Markdown sales by year and store
Weekly sales by store type
Fuel price by year
Weekly sales by year
Weekly sales by month
Weekly sales by date
Weekly sales by CPI
Weekly sales by department

All visualization outputs are stored in the visualizations/ directory.

🛠 Technologies Used

AWS S3
Snowflake
dbt
Python
Seaborn
Plotly
Dimensional Modeling
Medallion Architecture
SCD Type 1
SCD Type 2

🎯 Engineering Concepts Demonstrated

Cloud ingestion workflow
Medallion architecture implementation
Dimensional modeling best practices
Snapshot-based historical tracking
Environment isolation (DEV vs PROD)
Production job orchestration
Analytical data product delivery

💡 Business Value

This project demonstrates how raw retail sales data can be transformed into a structured analytics warehouse capable of supporting:

Historical trend analysis
Store performance comparison
Seasonality insights
External factor impact analysis (CPI, fuel price, temperature)
Executive-level reporting

🚀 Future Enhancements

Implement GitHub Actions for automated dbt test execution on pull requests
Add automated data quality gates prior to production deployment
Provision Snowflake infrastructure using Terraform
Implement branch-based deployment strategies
Add BI dashboard layer (Power BI / Tableau / Streamlit)

👤 Author

Johnathon Smith
Data Engineer focused on building scalable cloud data platforms using AWS, Snowflake, and dbt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Walmart Sales Analytics Pipeline

End-to-End Data Engineering Project Using AWS S3, Snowflake, and dbt

📌 Executive Summary