-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path03_most_optimal_skills.sql
More file actions
108 lines (98 loc) · 6.21 KB
/
03_most_optimal_skills.sql
File metadata and controls
108 lines (98 loc) · 6.21 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
/*
Question: What are the most optimal skills for data engineers—balancing both
demand and salary?
- Create a ranking column that combines demand count and median salary to identify
the most valuable skills.
- Focus only on remote Data Engineer positions with specified annual salaries.
- Why?
- This approach highlights skills that balance market demand and financial
reward. It weights core skills appropriately, rather than letting rare,
outlier skills distort the results.
- The natural log transformation ensures that both high-salary and widely
in-demand skills surface as the most practical and valuable to learn for
data engineering careers.
*/
SELECT
sd.skills,
ROUND(MEDIAN(jpf.salary_year_avg), 0) AS median_salary,
COUNT(jpf.*) AS demand_count,
ROUND(LN(COUNT(jpf.*)), 2) AS ln_demand_count,
ROUND(MEDIAN(jpf.salary_year_avg) * LN(COUNT(jpf.*)) / 1_000_000, 3) as ranking
FROM job_postings_fact AS jpf
INNER JOIN skills_job_dim AS sjd
ON jpf.job_id = sjd.job_id
INNER JOIN skills_dim AS sd
ON sjd.skill_id = sd.skill_id
WHERE
jpf.job_title_short = 'Data Engineer'
AND jpf.job_work_from_home = TRUE
AND jpf.salary_year_avg IS NOT NULL
GROUP BY
sd.skills
ORDER BY
ranking DESC
LIMIT 25;
/*
Here's a breakdown of the most optimal skills for Data Engineers, based on high
demand and strong salaries:
Top Skills by Optimal Score:
- Terraform leads with a $184K median salary and 193 postings, achieving the highest
overall "optimal score."
- Python and SQL dominate demand with over 1,100 postings each and strong median
salaries of $135K and $130K.
- AWS (783 postings, $137K median), Spark (503 postings, $140K median), and Airflow
(386 postings, $150K median) are highly sought-after cloud and big data technologies.
- Kafka offers high compensation ($145K median) with solid demand (292 postings).
- Snowflake, Azure, and Databricks each have 250–475 postings and median salaries
between $128K–$137K.
DevOps & Engineering Tools:
- Airflow ($150K), Kubernetes ($150.5K), and Docker ($135K) stand out for combining
strong demand with top median salaries.
- Git ($140K/208 postings) and GitHub ($135K/127 postings) offer broad utility and
competitive compensation.
Noteworthy Languages:
- Java (303 postings, $135K median) and Scala (247 postings, $137K median) remain
strong choices for well-paid data engineering roles.
- Go ($140K/113 postings) also delivers excellent compensation.
Databases & Cloud:
- Redshift ($130K/274 postings), GCP ($136K/196 postings), Hadoop ($135K/198 postings),
NoSQL ($134.4K/193 postings), and MongoDB ($135.8K/136 postings) round out a
well-balanced data engineering skill set.
- R, PySpark, and BigQuery each deliver competitive salaries and meet the demand
threshold.
Summary: Skills near the top balance market demand (job security) with financial
reward. Python, SQL, AWS, Spark, Airflow, and Terraform are particularly strategic
for immediate opportunities and long-term career growth in data engineering.
┌────────────┬───────────────┬──────────────┬─────────────────┬─────────┐
│ skills │ median_salary │ demand_count │ ln_demand_count │ ranking │
│ varchar │ double │ int64 │ double │ double │
├────────────┼───────────────┼──────────────┼─────────────────┼─────────┤
│ terraform │ 184000.0 │ 193 │ 5.26 │ 0.968 │
│ python │ 135000.0 │ 1133 │ 7.03 │ 0.949 │
│ aws │ 137320.0 │ 783 │ 6.66 │ 0.915 │
│ sql │ 130000.0 │ 1128 │ 7.03 │ 0.914 │
│ airflow │ 150000.0 │ 386 │ 5.96 │ 0.893 │
│ spark │ 140000.0 │ 503 │ 6.22 │ 0.871 │
│ snowflake │ 135500.0 │ 438 │ 6.08 │ 0.824 │
│ kafka │ 145000.0 │ 292 │ 5.68 │ 0.823 │
│ azure │ 128000.0 │ 475 │ 6.16 │ 0.789 │
│ java │ 135000.0 │ 303 │ 5.71 │ 0.771 │
│ scala │ 137290.0 │ 247 │ 5.51 │ 0.756 │
│ kubernetes │ 150500.0 │ 147 │ 4.99 │ 0.751 │
│ git │ 140000.0 │ 208 │ 5.34 │ 0.747 │
│ databricks │ 132750.0 │ 266 │ 5.58 │ 0.741 │
│ redshift │ 130000.0 │ 274 │ 5.61 │ 0.73 │
│ gcp │ 136000.0 │ 196 │ 5.28 │ 0.718 │
│ hadoop │ 135000.0 │ 198 │ 5.29 │ 0.714 │
│ nosql │ 134415.0 │ 193 │ 5.26 │ 0.707 │
│ pyspark │ 140000.0 │ 152 │ 5.02 │ 0.703 │
│ golang │ 184000.0 │ 39 │ 3.66 │ 0.674 │
│ docker │ 135000.0 │ 144 │ 4.97 │ 0.671 │
│ mongodb │ 135750.0 │ 136 │ 4.91 │ 0.667 │
│ go │ 140000.0 │ 113 │ 4.73 │ 0.662 │
│ r │ 134775.0 │ 133 │ 4.89 │ 0.659 │
│ rust │ 210000.0 │ 23 │ 3.14 │ 0.658 │
├────────────┴───────────────┴──────────────┴─────────────────┴─────────┤
│ 25 rows 5 columns │
└───────────────────────────────────────────────────────────────────────┘
*/