Fix Python issues and add reverse proxy, metrics, Jupyter, and experimental GPU features#9
Merged
Conversation
Bugs: - CLI: --file-level was ignored; file sink used the console level. Wire --file-level through configure/start and fix its help text. - Configure logging in the stop command (and the now-enabled clean command). - hive: pass check=True to schematool so metastore init failures surface. - cluster_manager: raise InvalidConfiguration when computed worker memory is <= 0 instead of passing a negative value to Spark. - Fix managed_cluster docstring (managed_start -> managed_cluster) and the _read_spark_defaults whitespace-only-line filter. Performance: - Start/stop Spark workers over ssh concurrently with a thread pool instead of serially, which previously serialized multi-node cluster startup. - Cache the Slurm node-name lookup so a single configure/stop no longer shells out to squeue/scontrol multiple times. Usability: - Implement ClusterManager.clean() and enable the `sparkctl clean` command. - Make the Spark Connect port configurable (connect_server_port), with a CLI default that falls back to the model default for older settings files. - stop(): when the status file is missing, infer running processes from the configuration rather than assuming every server is up. - default-config: warn when writing to a directory sparkctl cannot auto-discover. Security: - Pass the Postgres password to the setup scripts via the environment instead of the command line so it no longer appears in process listings. - hive: extract the Hive tarball with filter="data". Cleanup: - Use click.Path(path_type=Path) instead of lambda callbacks for path options. - Remove dead code in slurm_compute.get_node_names, the duplicate import in config.py, and honor the node_memory_overhead_gb argument consistently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace mypy with ty and pre-commit with prek, and standardize on uv for environment management and CI, matching the datasight conventions. - pyproject: drop mypy/pre-commit dev deps; add prek, ty, and an explicit ruff pin. Replace [tool.mypy] with [tool.ty.src] (include src/tests, exclude vendored tests/data). - .pre-commit-config.yaml: bump ruff-pre-commit and run ty via a local hook (uv run ty check) instead of mypy. - CI: run lint (ruff check, ruff format --check) and ty through uv; install with uv sync --frozen. Convert the docs and PyPI publish workflows to uv. - config.py: annotate the RUNTIME/APP settings dicts as dict[str, Any] so the **dict model construction type-checks under ty, and drop the now-unused dynaconf type: ignore. - README: document the uv/ruff/ty/prek development workflow. - Commit uv.lock so uv sync --frozen works in CI. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Present pip and uv side by side with sphinx-tabs for creating the virtual environment and installing sparkctl, and add a tip covering `uv tool install sparkctl` for CLI-only usage. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Installation: - Fix the environment-activation command in every tutorial and how-to: `source ~/python-envs/sparkctl` -> `source ~/python-envs/sparkctl/bin/activate`. - Correct the `~/.sparkctl.toml` keys in the explanation page: `spark_path`/ `java_path` (not `spark_home`/`java_home`) and `start_connect_server` under a `[runtime]` section (not `spark_connect_server`). - Finish the truncated "Python script" tutorial: close the code block and add the step to run the script. Correctness: - Hive warehouse directory is `spark-warehouse`, not `spark_warehouse`. - `--spark-log-level` only accepts debug/info/warn/error; note that trace/ fatal/off require editing log4j2.properties. Refresh the stale Spark version in the example output. - Fix the out-of-order list in the "Why is my job slow?" FAQ entry and the `spark-default.conf` -> `spark-defaults.conf` filename typo. Debuggability: - Executor logs live under `spark_scratch/workers/` (plural); fix the path and the `tail` command in the debugging page. - Replace the macOS-only `stat -f` invocation with GNU `stat -c`, which is what runs on the Linux compute nodes. - Fix the Spark Web UI SSH tunnel in the FAQ to target the compute node (`$COMPUTE_NODE`) instead of `$(hostname)`, and fix the `--constraint` typo. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Deploy four additional Spark features through sparkctl: - Web UI reverse proxy (--reverse-proxy): run the master as a reverse proxy for worker/application UIs, so the UIs are reachable through the master node on HPC clusters where compute nodes are not. - Prometheus metrics (--prometheus): register Spark's PrometheusServlet via metrics.properties on the existing web UI ports. - JupyterLab service (--jupyter): start/stop a notebook server on the master node, pre-wired to the Connect server via SPARK_REMOTE. - GPU scheduling and NVIDIA RAPIDS (--gpus/--rapids), marked EXPERIMENTAL and untested: GPU detection in the compute interface, a generated GPU discovery script, and RAPIDS plugin wiring via spark.jars. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add --metrics-csv (and --metrics-csv-period) to write Spark metrics to CSV files in <base>/metrics-csv, leaving a durable on-disk record after an ephemeral cluster shuts down (the Prometheus sink only exposes metrics over HTTP). Refactor the Prometheus setup into _configure_metrics so the Prometheus and CSV sinks compose into a single metrics.properties when both are enabled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9 +/- ##
==========================================
+ Coverage 67.32% 70.98% +3.65%
==========================================
Files 20 20
Lines 1561 2071 +510
==========================================
+ Hits 1051 1470 +419
- Misses 510 601 +91 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR expands sparkctl’s configurable Spark runtime features (reverse-proxied UIs, Prometheus/CSV metrics sinks, JupyterLab service, experimental GPU/RAPIDS wiring) while also improving cluster process handling and updating Python tooling/workflows (uv/ty/prek).
Changes:
- Add new runtime configuration knobs and generation logic for reverse proxy, metrics sinks, JupyterLab, and experimental GPU/RAPIDS support.
- Improve robustness/usability of cluster lifecycle operations (concurrent worker ssh start/stop, rollback refactor,
cleanimplementation, connect server port configurability). - Migrate developer/CI workflows toward
uv+ruff+tyand update docs accordingly.
Reviewed changes
Copilot reviewed 41 out of 42 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_sparkctl_cli.py | Adds CLI coverage for new feature flags persisted into config. |
| tests/test_spark_process_runner.py | Makes worker-start test resilient to concurrent execution ordering. |
| tests/test_cluster_manager.py | Adds unit coverage for reverse proxy, metrics sinks, GPU/RAPIDS config generation and validation. |
| src/sparkctl/spark_process_runner.py | Adds Jupyter start/stop and concurrent ssh worker operations; connect server port wiring. |
| src/sparkctl/slurm_compute.py | Caches Slurm node list and adds Slurm GPU detection helpers. |
| src/sparkctl/postgres/start_container.sh | Switches Postgres password handling to environment variable. |
| src/sparkctl/postgres/setup_metastore.sh | Switches Postgres password handling to environment variable. |
| src/sparkctl/native_compute.py | Adds native GPU detection via nvidia-smi; fixes memory overhead calc. |
| src/sparkctl/models.py | Introduces new runtime fields (reverse proxy, metrics, Jupyter, GPUs/RAPIDS) and new config file helpers. |
| src/sparkctl/hive.py | Passes Postgres password via environment; uses safer tar extraction filtering; ensures schematool failures are surfaced. |
| src/sparkctl/fake_compute.py | Implements GPU count method for test compute environment; fixes memory overhead calc. |
| src/sparkctl/config.py | Updates settings-file discovery and runtime defaults typing to support new runtime fields cleanly. |
| src/sparkctl/compute_interface.py | Adds abstract GPU-count method to compute interface. |
| src/sparkctl/cluster_manager.py | Implements clean, adds rollback helper, metrics/reverse-proxy/GPU/RAPIDS config emission, connect server port support. |
| src/sparkctl/cli/sparkctl.py | Adds CLI options for new features and ports; improves settings discovery messaging; enables clean subcommand. |
| README.md | Updates dev workflow instructions for uv/ruff/ty/prek. |
| pyproject.toml | Updates dev dependencies and switches type-check configuration from mypy to ty. |
| docs/tutorials/run_spark_jobs.md | Fixes venv activation instructions. |
| docs/tutorials/run_python_spark_jobs_spark_connect.md | Fixes venv activation instructions. |
| docs/tutorials/run_python_spark_jobs_script.md | Fixes venv activation instructions; clarifies script naming and adds run step. |
| docs/tutorials/run_python_spark_jobs_interactively.md | Fixes venv activation instructions. |
| docs/tutorials/run_ibis_spark_jobs.md | Fixes venv activation instructions. |
| docs/how_tos/getting_started/installation.md | Adds uv-based installation paths and improves install docs structure. |
| docs/how_tos/execution/start_a_cluster.md | Fixes venv activation instructions. |
| docs/how_tos/execution/prometheus_metrics.md | New how-to for Prometheus + CSV metrics sinks. |
| docs/how_tos/execution/index.md | Adds Prometheus metrics doc to the execution how-to index. |
| docs/how_tos/debugging/index.md | Updates log paths and Linux stat invocation in examples. |
| docs/how_tos/configuration/web_ui_reverse_proxy.md | New how-to for Spark UI reverse proxy configuration. |
| docs/how_tos/configuration/spark_log_level.md | Clarifies accepted log levels; updates example Spark version. |
| docs/how_tos/configuration/index.md | Adds reverse proxy + GPU docs to configuration index. |
| docs/how_tos/configuration/gpus.md | New how-to for experimental GPU scheduling and RAPIDS. |
| docs/how_tos/configuration/custom_spark_defaults.md | Fixes spark-defaults.conf naming in docs. |
| docs/how_tos/applications/jupyter.md | New how-to for running JupyterLab against the cluster. |
| docs/how_tos/applications/index.md | Adds Jupyter doc to applications index. |
| docs/how_tos/applications/hive_metastore.md | Updates warehouse directory naming in docs. |
| docs/faq.md | Fixes Slurm flag typo; improves SSH tunnel guidance; adds performance troubleshooting note. |
| docs/explanation/index.md | Aligns settings key names and connect-server config docs with code. |
| .pre-commit-config.yaml | Updates hooks to ruff + ty (via uv) and removes mypy hook. |
| .github/workflows/publish_to_pypi.yml | Switches build step to uv. |
| .github/workflows/gh-pages.yml | Switches doc build/install steps to uv. |
| .github/workflows/ci.yml | Reworks CI to use uv; adds dedicated lint job and updates pytest job. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replace the hardcoded `jupyter lab` invocation with a configurable jupyter_command (--jupyter-command), defaulting to the classic 'notebook' frontend, which is lighter to install and simpler for a single-user cluster. JupyterLab is still available with --jupyter-command lab. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Jupyter is only needed for `sparkctl configure --jupyter` and is gated at runtime, so add it as an optional extra (sparkctl[jupyter] -> notebook) rather than a core dependency. This keeps the default install slim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Bind the Jupyter server to 127.0.0.1 by default (configurable via --jupyter-ip); document an SSH ProxyJump tunnel and the 0.0.0.0 escape hatch. Avoids exposing the notebook server on all interfaces. - Cap the ssh worker start/stop thread pool at 32 so large allocations do not spawn thousands of threads. - Validate SPARKCTL_PG_PASSWORD is set and non-empty in the Postgres start/setup scripts instead of silently using an empty password. - Lock the notebook dependency added by the jupyter extra in uv.lock. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per review discussion, keep clean() deleting the configured scratch directory (which may intentionally live outside the base directory, e.g. a dedicated scratch filesystem) and instead document that spark_scratch should be a dedicated directory. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Running clean on an active cluster deletes the state needed to stop it (config.json, status.json, srun_workers.pid, jupyter.pid, the conf and scratch dirs), orphaning the Spark/Jupyter processes. clean() now checks status.json and raises if any tracked process is recorded as running, directing the user to run `sparkctl stop` first. A --force flag overrides the guard, and the CLI reports the error cleanly instead of a traceback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- hive.py: guard the tar extraction `filter="data"` argument (PEP 706) behind a `tarfile.data_filter` capability check so enabling the Postgres Hive metastore does not TypeError on Python 3.11 patch releases before 3.11.4, which this project still supports. - docs/debugging: replace the GNU-only `find ... -exec stat -c` with a portable `ls -lt | head` that also works on macOS/BSD. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Jupyter server was running but unreachable: the 127.0.0.1 default (from the earlier security fix) does not work with the standard HPC pattern of tunneling to the compute node's hostname through a login node. - Default jupyter_ip back to 0.0.0.0 (keep --jupyter-ip 127.0.0.1 as an opt-in for the bind-to-localhost workflow). Access stays protected by Jupyter's token. - After startup, detect the access URL from the log and print a clean, copy-pasteable SSH tunnel command and token URL, so connecting does not depend on reading past the log noise. - Pass --LanguageServerManager.autodetect=False to stop jupyter_lsp from flooding the log with tracebacks for uninstalled language servers. - Document that the residual Node/yarn error comes from a jupyterlab package in the environment and is harmless (the jupyter extra installs only the classic notebook). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Add "Working in the notebook" (kernel environment, where notebooks are saved) and a "Troubleshooting" section to the Jupyter how-to, and link to the Spark Connect tutorial. - Surface Jupyter in the tutorials decision guide so users looking for an interactive/notebook workflow can find it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Fix wrong config key in gpus.md (spark.task.resource.gpu.amount). - GPU discovery script now prefers CUDA_VISIBLE_DEVICES (correct on Slurm and works when nvidia-smi is unavailable in executors), falls back to nvidia-smi, and fails fast if neither is available. - Document what applications must do for RAPIDS vs. custom GPU code, and when GPUs are advantageous over CPUs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Previously, configuring an executor with more cores than the worker node provides (e.g. a GPU allocation that grants GPUs but only one CPU) produced a config that Spark could never schedule: jobs hung forever with "Initial job has not accepted any resources". Mirror the existing executor-memory check and raise InvalidConfiguration at configure time with guidance to request more CPUs or lower executor_cores. Also document this and the related "worker advertises no GPUs after enabling --gpus without a restart" case in the debugging guide. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sparkctl's resource monitor does not capture GPU utilization, so add a section to the GPU how-to covering nvidia-smi / dmon / nvtop, attaching to the worker nodes in the allocation, CSV logging, multi-node monitoring, and how to read utilization to confirm work is on the GPU. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
GPU-aware scheduling and RAPIDS acceleration have now been validated on a real GPU cluster, so remove the "EXPERIMENTAL (untested)" prefixes from the config field descriptions (which feed the CLI help) and the configure log messages, the warning admonition and "(experimental)" title in the GPU how-to, and the matching caveat in the qualification tool tip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The RAPIDS Accelerator supports Apache Spark up to 4.1.1 (with Scala 2.13); 4.1.2 is not a supported shim target. Pin pyspark/pyspark-client to 4.1.1, regenerate the lockfile, update the test download URLs, and update the deployment/tutorial/reference docs and CLI examples to 4.1.1. Also make the .gitignore Spark pattern version-agnostic (spark-*-bin-hadoop3*) so a version bump no longer un-ignores the previously extracted test-data directory. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The executor count was derived purely from CPU cores and memory, so a GPU node whose core count did not divide evenly by executor_cores could leave GPUs idle (e.g. 4 GPUs but only 3 executors fit -> 1 GPU unused). Make executor_cores default to None (auto). When GPUs are enabled and the user has not set it explicitly, target NVIDIA's recommended layout of one executor per GPU by dividing the node's usable cores evenly among the GPUs. Also cap the executor count at the GPU count so executors are not left unschedulable for lack of a GPU, and warn when CPUs/memory allow fewer executors than there are GPUs (some GPUs would sit idle). An explicit executor_cores still overrides the auto sizing. Document the behavior in the GPU how-to and update the debugging guide. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sparkctl writes a single spark.worker.resource.gpu.amount for every worker, equal to the GPU count detected where configure runs. When GPUs are requested as a job-wide total (Slurm --gpus) instead of per node (--gpus-per-node), Slurm can split them unevenly, so a worker with fewer GPUs than declared silently fails to start and the job hangs with "Initial job has not accepted any resources". Add ComputeInterface.check_gpu_allocation() (no-op by default) and a SlurmCompute implementation that raises InvalidConfiguration at configure time when a multi-node job's per-node GPU count does not divide the job-wide total evenly. A per-node request (--gpus-per-node) short -circuits the check since it guarantees uniformity. Called from _enable_gpus, with a note in the GPU how-to. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The startup message listed SPARK_CONF_DIR and JAVA_HOME but not SPARK_HOME. Since sparkctl depends on pyspark-client (which bundles no Spark distribution), an interactive `pyspark`/`spark-submit` fails with "Could not find valid SPARK_HOME" unless SPARK_HOME points at the Spark distribution. Add SPARK_HOME (from binaries.spark_path) to the message plus a ready-to-paste example that adds Spark to PATH and connects a client to the master URL. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
start() already exports SPARK_CONF_DIR and JAVA_HOME into the current process so in-process Spark usage works; add SPARK_HOME (from binaries.spark_path) for consistency, so a caller that shells out to spark-submit/pyspark or builds a classic session in the same process finds the Spark distribution. managed_cluster clears it on exit like the others. The yielded managed_cluster session uses Spark Connect, which does not need it, so this is purely additive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the PATH export from the start message: sparkctl installs pyspark-client, so the pyspark launcher is already on PATH and only needs SPARK_HOME (shown above); prepending $SPARK_HOME/bin could shadow it with the distribution's own launcher. Also lower the "Enabled GPU scheduling" and "Enabled RAPIDS GPU acceleration" messages from warning to info now that GPU/RAPIDS support is tested and no longer experimental. The idle-GPU message stays a warning since it flags a real misconfiguration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
github-actions Bot
pushed a commit
that referenced
this pull request
Jun 14, 2026
Fix Python issues and add reverse proxy, metrics, Jupyter, and experimental GPU features
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch combines a round of correctness/usability fixes and a tooling migration with four new Spark features that can be deployed through sparkctl.
New Spark features
--reverse-proxy,--reverse-proxy-url): run the master as a reverse proxy for the worker/application UIs so they are reachable through the master node on HPC clusters where compute nodes are not.--prometheus,--metrics-csv): expose Spark's internal metrics via thePrometheusServleton the existing web UI ports, and/or write them to CSV files in<base>/metrics-csvfor a durable record after an ephemeral cluster shuts down. Both compose into a single auto-loadedmetrics.properties.--jupyter,--jupyter-port): start/stop a notebook server on the master node, pre-wired to the Spark Connect server viaSPARK_REMOTEsoSparkSession.builder.getOrCreate()just works.--gpus,--rapids), marked EXPERIMENTAL and untested: GPU detection in the compute interface (Slurm env vars /nvidia-smi), a generated GPU discovery script, and RAPIDS plugin wiring viaspark.jars(avoids clobbering the Postgres metastore classpath).Other changes already on the branch
uv.lock, updatespyproject.toml).Testing
ruff,ruff format, andtypass.🤖 Generated with Claude Code