Skip to content

Commit cee48ac

Browse files
authored
feat/spark-4.x-hist: Add Spark History Server (#173)
* Add Spark History Server to the Spark Standalone cluster * Configure eventLog for Spark Master and Workers * Set SPARK_PUBLIC_DNS=localhost so Web UI worker links are accessible from the host * Pin PySpark to 4.0.1 to prevent version mismatches between Driver and Workers * Update README with spark-submit instructions
1 parent 0668bb8 commit cee48ac

5 files changed

Lines changed: 55 additions & 5 deletions

File tree

module5-batch-processing/compose.spark-4.0-standalone.yaml

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,15 @@ x-spark-common:
77
image: *spark-image
88
environment:
99
&spark-common-env
10-
SPARK_NO_DAEMONIZE: true # Forces the process to run in foreground (req. for Docker)
10+
SPARK_NO_DAEMONIZE: true # Forces the process to run in foreground (req. for Docker)
11+
SPARK_PUBLIC_DNS: localhost # Ensures Web UI links point to localhost instead of container IPs
12+
GOOGLE_APPLICATION_CREDENTIALS: "/secrets/gcp_credentials.json"
1113
volumes:
1214
&spark-common-vol
13-
- vol-spark-extra-jars:/opt/spark/extra-jars/
15+
- ./logs/:/opt/spark/logs/
1416
- ./spark-4.0-standalone.conf:/opt/spark/conf/spark-standalone.conf
1517
- ~/.gcp/spark_credentials.json:/secrets/gcp_credentials.json
18+
- vol-spark-extra-jars:/opt/spark/extra-jars/
1619
depends_on:
1720
&spark-common-depends-on
1821
spark-init:
@@ -77,7 +80,24 @@ services:
7780
depends_on:
7881
spark-master:
7982
condition: service_started
80-
restart: on-failure:3
83+
restart: on-failure:5
84+
85+
spark-history-server:
86+
<<: *spark-common
87+
container_name: spark-history-server
88+
command: |
89+
/opt/spark/sbin/start-history-server.sh
90+
--properties-file /opt/spark/conf/spark-standalone.conf
91+
environment:
92+
<<: *spark-common-env
93+
SPARK_HISTORY_OPTS: >-
94+
-Dspark.history.fs.logDirectory=/opt/spark/logs/
95+
ports:
96+
- '18080:18080'
97+
depends_on:
98+
spark-master:
99+
condition: service_started
100+
restart: on-failure:5
81101

82102
hive-db:
83103
image: *postgres-image
@@ -130,6 +150,7 @@ services:
130150
- |
131151
apt-get update && apt-get install curl -y
132152
curl --create-dirs -O --output-dir /opt/spark/extra-jars/ https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/4.0.2/gcs-connector-4.0.2-shaded.jar
153+
chown -R 185:185 /opt/spark/extra-jars/
133154
volumes:
134155
- vol-spark-extra-jars:/opt/spark/extra-jars/
135156

module5-batch-processing/pyspark-4.x/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,31 @@ pre-commit install
3737
docker compose -f ../compose.yaml up -d
3838
```
3939

40+
**5.** Spark Web UI
41+
- Spark Master Web UI can be accessed at [http://localhost:4040](http://localhost:4040)
42+
- Spark History Server can be accessed at [http://localhost:18080](http://localhost:18080)
43+
44+
45+
## Spark-submit Application
46+
47+
### Local (Spark Driver running on local machine)
48+
49+
With `--deploy-mode client` (default), the Spark Driver runs locally and doesn't pick up [spark-4.0-standalone.conf](../compose.spark-4.0-standalone.yaml), so the `--conf spark.hadoop.*` options must be set explicitly.
50+
51+
```shell
52+
spark-submit \
53+
--master spark://localhost:7077 \
54+
--jars https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/4.0.2/gcs-connector-4.0.2-shaded.jar \
55+
--conf spark.eventLog.enabled=true \
56+
--conf spark.eventLog.dir=file://$(pwd)/../logs/ \
57+
--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
58+
--conf spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS \
59+
--conf spark.hadoop.google.cloud.auth.type=APPLICATION_DEFAULT \
60+
fhv_zones_gcs.py
61+
```
62+
63+
> **Note:** `APPLICATION_DEFAULT` is recommended here. `spark.hadoop.*` confs set via `spark-submit` propagate to both the driver and the executors. With `SERVICE_ACCOUNT_JSON_KEYFILE`, the keyfile path must be valid on **both** the local machine (driver) and inside the Docker containers (executors). Since the executors already have their own SA keyfile configured via [spark-4.0-standalone.conf](../spark-4.0-standalone.conf), using `APPLICATION_DEFAULT` lets the driver authenticate with local ADC (`gcloud auth application-default login`) while the executors fall back to their cluster-side SA config.
64+
4065

4166
## Compatibility Matrix for GCS
4267

module5-batch-processing/pyspark-4.x/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ readme = "README.md"
66
requires-python = ">=3.12,<3.14"
77

88
dependencies = [
9-
"pyspark[connect]>=4.0.1,<4.1",
9+
"pyspark[connect]==4.0.1",
1010
"pyarrow>=23.0.0,<24.0",
1111
]
1212

module5-batch-processing/pyspark-4.x/uv.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

module5-batch-processing/spark-4.0-standalone.conf

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ spark.worker.cleanup.interval=600
1414
spark.shuffle.service.db.enabled=true
1515
spark.shuffle.service.db.backend=ROCKSDB
1616

17+
# Event Log (History Server)
18+
spark.eventLog.enabled=true
19+
spark.eventLog.dir=/opt/spark/logs/
20+
1721
# Classpath
1822
spark.driver.extraClassPath=/opt/spark/extra-jars/*
1923
spark.executor.extraClassPath=/opt/spark/extra-jars/*

0 commit comments

Comments
 (0)