Exploratory_Data_Analysis_SQL/03_most_optimal_skills.sql at main · teevirta/Exploratory_Data_Analysis_SQL · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
/*
Question: What are the most optimal skills for data engineers—balancing both
demand and salary?

- Create a ranking column that combines demand count and median salary to identify
the most valuable skills.
- Focus only on remote Data Engineer positions with specified annual salaries.
- Why?
    - This approach highlights skills that balance market demand and financial
    reward. It weights core skills appropriately, rather than letting rare,
    outlier skills distort the results.
    - The natural log transformation ensures that both high-salary and widely
    in-demand skills surface as the most practical and valuable to learn for
    data engineering careers.
*/

SELECT
    sd.skills,
    ROUND(MEDIAN(jpf.salary_year_avg), 0) AS median_salary,
    COUNT(jpf.*) AS demand_count,
    ROUND(LN(COUNT(jpf.*)), 2) AS ln_demand_count,
    ROUND(MEDIAN(jpf.salary_year_avg) * LN(COUNT(jpf.*)) / 1_000_000, 3) as ranking
FROM job_postings_fact AS jpf
    INNER JOIN skills_job_dim AS sjd
        ON jpf.job_id = sjd.job_id
    INNER JOIN skills_dim AS sd
        ON sjd.skill_id = sd.skill_id
WHERE
    jpf.job_title_short = 'Data Engineer'
    AND jpf.job_work_from_home = TRUE
    AND jpf.salary_year_avg IS NOT NULL
GROUP BY
    sd.skills
ORDER BY
    ranking DESC
LIMIT 25;


/*
Here's a breakdown of the most optimal skills for Data Engineers, based on high
demand and strong salaries:

Top Skills by Optimal Score:
- Terraform leads with a $184K median salary and 193 postings, achieving the highest
overall "optimal score."
- Python and SQL dominate demand with over 1,100 postings each and strong median
salaries of $135K and $130K.
- AWS (783 postings, $137K median), Spark (503 postings, $140K median), and Airflow
(386 postings, $150K median) are highly sought-after cloud and big data technologies.
- Kafka offers high compensation ($145K median) with solid demand (292 postings).
- Snowflake, Azure, and Databricks each have 250–475 postings and median salaries
between $128K–$137K.

DevOps & Engineering Tools:
- Airflow ($150K), Kubernetes ($150.5K), and Docker ($135K) stand out for combining
strong demand with top median salaries.
- Git ($140K/208 postings) and GitHub ($135K/127 postings) offer broad utility and
competitive compensation.

Noteworthy Languages:
- Java (303 postings, $135K median) and Scala (247 postings, $137K median) remain
strong choices for well-paid data engineering roles.
- Go ($140K/113 postings) also delivers excellent compensation.

Databases & Cloud:
- Redshift ($130K/274 postings), GCP ($136K/196 postings), Hadoop ($135K/198 postings),
NoSQL ($134.4K/193 postings), and MongoDB ($135.8K/136 postings) round out a
well-balanced data engineering skill set.
- R, PySpark, and BigQuery each deliver competitive salaries and meet the demand
threshold.

Summary: Skills near the top balance market demand (job security) with financial
reward. Python, SQL, AWS, Spark, Airflow, and Terraform are particularly strategic
for immediate opportunities and long-term career growth in data engineering.

┌────────────┬───────────────┬──────────────┬─────────────────┬─────────┐
│   skills   │ median_salary │ demand_count │ ln_demand_count │ ranking │
│  varchar   │    double     │    int64     │     double      │ double  │
├────────────┼───────────────┼──────────────┼─────────────────┼─────────┤
│ terraform  │      184000.0 │          193 │            5.26 │   0.968 │
│ python     │      135000.0 │         1133 │            7.03 │   0.949 │
│ aws        │      137320.0 │          783 │            6.66 │   0.915 │
│ sql        │      130000.0 │         1128 │            7.03 │   0.914 │
│ airflow    │      150000.0 │          386 │            5.96 │   0.893 │
│ spark      │      140000.0 │          503 │            6.22 │   0.871 │
│ snowflake  │      135500.0 │          438 │            6.08 │   0.824 │
│ kafka      │      145000.0 │          292 │            5.68 │   0.823 │
│ azure      │      128000.0 │          475 │            6.16 │   0.789 │
│ java       │      135000.0 │          303 │            5.71 │   0.771 │
│ scala      │      137290.0 │          247 │            5.51 │   0.756 │
│ kubernetes │      150500.0 │          147 │            4.99 │   0.751 │
│ git        │      140000.0 │          208 │            5.34 │   0.747 │
│ databricks │      132750.0 │          266 │            5.58 │   0.741 │
│ redshift   │      130000.0 │          274 │            5.61 │    0.73 │
│ gcp        │      136000.0 │          196 │            5.28 │   0.718 │
│ hadoop     │      135000.0 │          198 │            5.29 │   0.714 │
│ nosql      │      134415.0 │          193 │            5.26 │   0.707 │
│ pyspark    │      140000.0 │          152 │            5.02 │   0.703 │
│ golang     │      184000.0 │           39 │            3.66 │   0.674 │
│ docker     │      135000.0 │          144 │            4.97 │   0.671 │
│ mongodb    │      135750.0 │          136 │            4.91 │   0.667 │
│ go         │      140000.0 │          113 │            4.73 │   0.662 │
│ r          │      134775.0 │          133 │            4.89 │   0.659 │
│ rust       │      210000.0 │           23 │            3.14 │   0.658 │
├────────────┴───────────────┴──────────────┴─────────────────┴─────────┤
│ 25 rows                                                     5 columns │
└───────────────────────────────────────────────────────────────────────┘
*/