Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
ead18ff
Fix bugs, performance, and usability issues in Python code
daniel-thom Jun 13, 2026
da7cd71
Switch tooling to prek/ty/uv
daniel-thom Jun 13, 2026
121317d
docs: add uv-based install option to the installation guide
daniel-thom Jun 13, 2026
5fd0a78
docs: fix install, correctness, and debuggability issues
daniel-thom Jun 13, 2026
b94e20e
Add reverse proxy, Prometheus, JupyterLab, and experimental GPU features
daniel-thom Jun 13, 2026
7c5ea5a
Add CSV metrics sink option
daniel-thom Jun 13, 2026
ddca495
Make the Jupyter frontend configurable, default to classic notebook
daniel-thom Jun 13, 2026
9c1ccce
Add optional jupyter extra instead of a core dependency
daniel-thom Jun 13, 2026
b8e8bd8
Address Copilot review: Jupyter bind, ssh thread cap, pg password checks
daniel-thom Jun 13, 2026
ec1fff4
Document that sparkctl clean deletes the spark_scratch directory
daniel-thom Jun 13, 2026
730c759
Make `sparkctl clean` refuse to run while a cluster is running
daniel-thom Jun 13, 2026
571cf61
Address Copilot round 2: tar filter compat and portable stat in docs
daniel-thom Jun 13, 2026
a506589
Fix Jupyter connectivity and reduce log noise
daniel-thom Jun 13, 2026
2057c18
Improve Jupyter user documentation
daniel-thom Jun 13, 2026
d08c1a8
Address Copilot round 3 and expand GPU docs
daniel-thom Jun 13, 2026
60164f2
Fail fast when executor_cores exceeds available worker CPUs
daniel-thom Jun 13, 2026
c0f4489
Document monitoring GPU usage with NVIDIA tools
daniel-thom Jun 13, 2026
dfc6780
Drop experimental/untested labels from GPU features
daniel-thom Jun 13, 2026
4b5c53b
Downgrade Spark/PySpark from 4.1.2 to 4.1.1 for RAPIDS support
daniel-thom Jun 13, 2026
97c4d21
Auto-size one executor per GPU when GPUs are enabled
daniel-thom Jun 13, 2026
9c6e6ff
Fail fast on uneven multi-node GPU allocations
daniel-thom Jun 13, 2026
06ab7be
Print SPARK_HOME and an interactive connect command on start
daniel-thom Jun 13, 2026
2058030
Set SPARK_HOME automatically alongside JAVA_HOME/SPARK_CONF_DIR
daniel-thom Jun 13, 2026
0454775
Simplify connect hint and lower GPU enable logs to INFO
daniel-thom Jun 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 34 additions & 40 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,52 +11,46 @@ env:
DEFAULT_OS: ubuntu-latest

jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v7
with:
python-version: "3.12"
enable-cache: false
- name: Install Python project
run: uv sync --extra dev --frozen
- name: Run Ruff lint
run: uv run ruff check .
- name: Check Ruff formatting
run: uv run ruff format --check .
- name: Run ty
run: uv run ty check

pytest:
runs-on: ${{ matrix.os }}
strategy:
matrix:
python-version: ["3.12", "3.13"]
os: [ubuntu-latest]

steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install ".[dev]"
- name: Run pytest with coverage
run: |
pytest -v -m "not integration" --cov --cov-report=xml
- name: codecov
uses: codecov/codecov-action@v4.2.0
if: ${{ matrix.os == env.DEFAULT_OS && matrix.python-version == env.DEFAULT_PYTHON }}
with:
token: ${{ secrets.CODECOV_TOKEN }}
name: sparkctl-tests
fail_ci_if_error: false
verbose: true
mypy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.12
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install ".[dev]"
mypy
ruff:
runs-on: ubuntu-latest
name: "ruff"
steps:
- uses: actions/checkout@v4
- uses: chartboost/ruff-action@v1
- name: Install uv
uses: astral-sh/setup-uv@v7
with:
python-version: ${{ matrix.python-version }}
enable-cache: false
- name: Install Python project
run: uv sync --extra dev --frozen
- name: Run pytest with coverage
run: uv run pytest -v -m "not integration" --cov --cov-report=xml
- name: codecov
uses: codecov/codecov-action@v4.2.0
if: ${{ matrix.os == env.DEFAULT_OS && matrix.python-version == env.DEFAULT_PYTHON }}
with:
src: "./src"
token: ${{ secrets.CODECOV_TOKEN }}
name: sparkctl-tests
fail_ci_if_error: false
verbose: true
15 changes: 7 additions & 8 deletions .github/workflows/gh-pages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,18 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: select python version
uses: actions/setup-python@v5
- name: Install uv
uses: astral-sh/setup-uv@v7
with:
python-version: "3.12"
- name: install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install ".[dev]"
enable-cache: false
- name: Install Python project
run: uv sync --extra dev --frozen
- name: build documentation
run: |
cd docs
make clean
make html
uv run make clean
uv run make html
- name: deploy
uses: peaceiris/actions-gh-pages@v3.6.1
with:
Expand Down
14 changes: 5 additions & 9 deletions .github/workflows/publish_to_pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,12 @@ jobs:
id-token: write
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
- name: Install uv
uses: astral-sh/setup-uv@v7
with:
python-version: "3.12"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install build
- name: Build and publish
run: |
python -m build
enable-cache: false
- name: Build
run: uv build
- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ dmypy.json
.vscode

tests/data/apache-hive-4.0.1-bin*
tests/data/spark-4.1.2-bin-hadoop3*
tests/data/spark-*-bin-hadoop3*
tests/data/postgresql*
tests/data/jdk-21.0.7.jdk*
conf
Expand Down
29 changes: 15 additions & 14 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.2.1
hooks:
# Run the linter.
- id: ruff
args: [ --fix ]
# Run the formatter.
- id: ruff-format
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.13.0
hooks:
- id: mypy
language: system
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.15.8
hooks:
- id: ruff
args: [--fix]
- id: ruff-format

- repo: local
hooks:
- id: ty
name: ty (type check)
entry: uv run ty check
language: system
types: [python]
pass_filenames: false
26 changes: 21 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,24 +49,40 @@ support.
Contributions are welcome.

## Development
Install the package with its development dependencies:
This project uses [uv](https://docs.astral.sh/uv/) for environment management. Install the
package with its development dependencies:
```console
$ pip install -e ".[dev]"
$ uv sync --extra dev
```

Lint, format, and type-check the code with [ruff](https://docs.astral.sh/ruff/) and
[ty](https://github.com/astral-sh/ty):
```console
$ uv run ruff check .
$ uv run ruff format --check .
$ uv run ty check
```

These checks also run as Git hooks via [prek](https://github.com/j178/prek). Install the hooks
once and then run them on demand:
```console
$ uv run prek install
$ uv run prek run --all-files
```

Run the unit tests. These are fast, require no special resources, and are what CI runs:
```console
$ pytest -m "not integration"
$ uv run pytest -m "not integration"
```

The integration tests download a real Spark and Java distribution into `tests/data/` and start a
real single-node Spark cluster, so they are slower and require network access and sufficient
memory. They are excluded from CI; run them locally with:
```console
$ pytest -m integration
$ uv run pytest -m integration
```

Run the complete suite (unit and integration tests) with `pytest`.
Run the complete suite (unit and integration tests) with `uv run pytest`.

## License
sparkctl is released under a BSD 3-Clause [license](https://github.com/NatLabRockies/sparkctl/blob/main/LICENSE).
Expand Down
12 changes: 9 additions & 3 deletions docs/explanation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,13 +94,19 @@ This TOML file stores environment-specific settings that rarely change:

```toml
[binaries]
spark_home = "/path/to/spark"
java_home = "/path/to/java"
spark_path = "/path/to/spark"
java_path = "/path/to/java"
```

These settings tell sparkctl where to find Spark and Java. You can also set global settings that
apply every time you run `sparkctl configure`. For example, if you always want to use Spark Connect,
you can set `spark_connect_server = true` and avoid having to set it each time you configure.
you can set `start_connect_server = true` in a `[runtime]` section and avoid having to set it each
time you configure:

```toml
[runtime]
start_connect_server = true
```

### Runtime Configuration (`./conf/`)

Expand Down
18 changes: 10 additions & 8 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,24 +70,26 @@ export the relevant variables yourself, for example through `spark-env.sh` in th

If you are running pyspark/spark-submit after installing via `pip install sparkctl[pyspark]`,
your version of pyspark must match the cluster version exactly. Client version 4.1.3 is
incompatible with cluster version 4.1.2.
incompatible with cluster version 4.1.1.

### Why can't my workers connect to the master?

Common causes:

1. **High-bandwidth nodes**: Some NLR Kestrel compute nodes have two network cards, which Spark
cannot deal with. Set `--constaint lbw` when allocating nodes.
cannot deal with. Set `--constraint lbw` when allocating nodes.

Check the Spark master logs in `./spark_scratch/logs/` for connection errors.

### How do I connect to the Spark Web UI?

The Spark master runs a web UI on port 4040 (driver) or 8080 (master). Since HPC compute nodes
aren't directly accessible, use SSH tunneling:
Spark runs a web UI on port 8080 (master) and port 4040 (driver/application). Since HPC compute
nodes aren't directly accessible, use SSH tunneling. Substitute the name of your compute node
(it is listed in `./conf/workers`) for `$COMPUTE_NODE`:

```console
$ ssh -L 8080:$(hostname):8080 user@hpc-login-node
$ export COMPUTE_NODE=<your-compute-node-name>
$ ssh -L 8080:$COMPUTE_NODE:8080 -L 4040:$COMPUTE_NODE:4040 user@hpc-login-node
```

Then open `http://localhost:8080` or `http://localhost:4040` in your browser.
Expand All @@ -104,13 +106,13 @@ Common causes:
3. **Too few partitions**: Increase `spark.sql.shuffle.partitions`.
4. **Too many partitions**: Decrease partitions if you have many small tasks.
5. **Slow storage**: Ensure shuffle storage uses fast local SSDs, not shared filesystem.
6. **Non-ideal partitioning**: If you are trying to partition-by-column in the same query as your
main work, especially where you significantly increased the shuffle partitions, persist your
main work first. Then repartition in a second task.
7. **Query too complex**: If you are trying to run a very complex query where subtasks have
different data sizes and partitioning needs, consider breaking the query into smaller parts with
different settings. Persist intermediate results to the filesystem so that you can checkpoint and
make incremental progress.
6. **Non-ideal partitioning**: If you are trying to partition-by-column in the same query as your
main work, especially where you significantly increased the shuffle partitions, persist your
main work first. Then repartition in a second task.

See the {ref}`how-tos-debugging` for performance troubleshooting.

Expand Down
2 changes: 1 addition & 1 deletion docs/how_tos/applications/hive_metastore.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,6 @@ start the server. Apptainer will cache the container image and you can reuse the
across Slurm allocations.

**Note**: The metadata about your tables will be stored in Derby or Postgres. Your tables will
be stored on the filesystem (Parquet files by default) in a directory called `spark_warehouse`,
be stored on the filesystem (Parquet files by default) in a directory called `spark-warehouse`,
which gets created in the directory passed to `--metastore-dir` (current directory by default).
Postgres data, if enabled, will be in the same directory (`pg_data`).
1 change: 1 addition & 0 deletions docs/how_tos/applications/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@

hive_metastore
tableau
jupyter
```
Loading
Loading