Skip to content

teevirta/Exploratory_Data_Analysis_SQL

Repository files navigation

Exploratory Data Analysis with SQL

A SQL project analyzing the data engineer job market using real world job posting data. It demonstrates my ability to write production-quality analytical SQL, design efficient queries, and turn business questions into data-driven insights.

EDA Project Overview

Summary

  • Project scope: Built 3 analytical queries that answer key questions about the data engineer job market
  • Data modeling: Used multi-table joins across fact and dimension tables to extract insights
  • Analytics: Applied aggregations, filtering, and sorting to find top skills by demand, salary, and overall value
  • Outcomes: Delivered actionable insights on SQL/Python dominance, cloud trends, and salary patterns

Problem & Context

Job market analysts need to answer questions like:

  • Most in-demand: Which skills are most in-demand for data engineers?
  • Highest paid: Which skills command the highest salaries?
  • Best trade-off: What is the optimal skill set balancing demand and compensation?

This project analyzes a data warehouse built using a star schema design. The warehouse structure consists of:

Data Warehouse Schema

  • Fact Table: job_postings_fact - Central table containing job posting details (job titles, locations, salaries, dates, etc.)
  • Dimension Tables:
    • company_dim - Company information linked to job postings
    • skills_dim - Skills catalog with skill names and types
  • Bridge Table: skills_job_dim - Resolves the many-to-many relationship between job postings and skills

By querying across these interconnected tables, I extracted insights about skill demand, salary patterns, and optimal skill combinations for data engineering roles.

Tech Stack

  • Query Engine: DuckDB
  • Language: SQL
  • Data Model: Star schema with fact + dimension + bridge tables
  • Development: VS Code for SQL editing + Terminal for DuckDB CLI
  • Version Control: Git/GitHub for versioned SQL scripts

Analysis Overview

Query Structure

  1. Top Demanded Skills – Identifies the 10 most in-demand skills for remote data engineer positions
  2. Top Paying Skills – Analyzes the 25 highest-paying skills with salary and demand metrics
  3. Optimal Skills – Calculates an optimal score using natural log of demand combined with median salary to identify the most valuable skills to learn

Key Insights

  • Core languages: SQL and Python each appear in ~29,000 job postings, making them the most demanded skills
  • Cloud platforms: AWS and Azure are critical for modern data engineering roles-
  • Infra & tooling: Kubernetes, Docker, and Terraform are associated with premium salaries
  • Big data tools: Apache Spark shows strong demand with competitive compensation

About

SQL exploratory analysis of the data engineer job market — identifying top skills by demand, salary, and optimal value using DuckDB and a star schema data warehouse.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors