🕵️ SLURM Job Detective

Are you wasting GPU memory? Underusing CPUs? Thrashing disk?
sjdet shows you instantly — one command, no environment to activate.

Why sjdet?

HPC clusters are expensive. Running a GPU job that uses 2% of VRAM, or a CPU job at 5% utilization, wastes both your compute quota and valuable cluster resources. Normally you'd need to ssh into nodes, run nvidia-smi, parse sstat output, and do mental math — every time.

sjdet does all of that and fits it in one table:

What you see	What it tells you
CPU eff bar	Are your cores actually doing work, or are they idle?
Mem Use / Req + suggest	How much RAM you're really using vs. what you requested, plus the optimal `--mem` to request next time
VRAM Use / Req	GPU memory used vs. total, with inverted color scale — green means you're filling it up as expected
VRAM trend ↑↓	Is VRAM growing, stable, or shrinking between polls? Catches memory leaks or loading phases
MaxPages / MaxDisk ↑↓	Disk thrashing and page fault trends — immediately visible
Node column	Exactly which node your job landed on, so you can `ssh` or `srun --overlap` instantly

All of this from two SLURM calls (squeue + sstat), batched, with a local cache to avoid hammering the scheduler.

Install

No root. No sudo. Works on any cluster where you have a home directory.

Requires uv (recommended) or pipx — both install sjdet to ~/.local/bin so you can type sjdet from anywhere, forever, without activating anything.

uv (recommended)

curl -LsSf https://astral.sh/uv/install.sh | sh  # install uv itself (once)
uv tool install git+https://github.com/e-candeloro/slurm_job_detective

Upgrade later:

uv tool upgrade slurm-job-detective

pipx

python3 -m pip install --user pipx
pipx install git+https://github.com/e-candeloro/slurm_job_detective

Upgrade later:

pipx upgrade slurm-job-detective

Usage

sjdet                             # auto-detects $USER, shows all your jobs
sjdet --user alice                # inspect another user's jobs
sjdet --max-jobs 20               # show up to 20 jobs (default: 10)
sjdet --interval 120              # minimum seconds between sstat polls (default: 60)
sjdet --headroom 0.30             # memory suggestion headroom: ceil(MaxRSS × 1.30) (default: 20%)
sjdet --force-update-nodes        # forces an update to the node info cache
sjdet --clear-cache               # clears the local user cache completely and exits
sjdet --version                   # print installed version and exit
sjdet --update                    # upgrades sjdet and exits

Update checks

Normal runs perform a lightweight upstream update check and print a notice when a newer version is detected.
Notices are cached to avoid spam (cooldown: 7 days).
sjdet does not prompt interactively for updates during normal runs.
Update notices show current_version -> target_version.
sjdet --update upgrades directly to the latest published GitHub release version.

Update-source strategy:

GitHub latest release only

Update-command strategy for --update:

uv tool install --upgrade git+https://github.com/e-candeloro/slurm_job_detective.git@v<latest_version>
pipx install --force git+https://github.com/e-candeloro/slurm_job_detective.git@v<latest_version>
python -m pip install --upgrade git+https://github.com/e-candeloro/slurm_job_detective.git@v<latest_version>

Reading the GPU column:

🔴 red = low VRAM usage (you over-requested or the job hasn't loaded yet)
🟡 yellow = moderate usage
🟢 green = using ≥ 80% of allocated VRAM (well utilized)
↑ / ↓ / - = VRAM trend since last poll

To jump to your job's node:

ssh <node>               # SSH directly (node shown in table)
srun --overlap --jobid=<JOBID> --pty nvidia-smi   # run nvidia-smi inside your allocation

How it works (no scheduler spam)

Each invocation makes exactly two SLURM calls:

squeue — one call for all your jobs (state, node, GRES)
sstat — one batched call for all running job IDs

GPU hardware info (scontrol show node) is queried once per node, then cached locally forever — GPU models don't change between runs.

The sstat result is cached for --interval seconds (default 60s) to respect cluster etiquette. Re-running sjdet within that window is instant.

Development

git clone https://github.com/e-candeloro/slurm_job_detective
cd slurm_job_detective
uv sync --dev        # creates .venv, installs deps + torch (for GPU load testing)
uv run sjdet         # run without activating the venv

GPU load test (to see real VRAM numbers in the table):

# inside a GPU srun session:
uv run scripts/gpu_load_test.py --gb 8 --seconds 300
# in another terminal:
uv run sjdet

Project layout

src/sjdet/
├── cli.py      ← argument parsing + main() orchestration
├── slurm.py    ← squeue/sstat/scontrol calls, data model, parsing
├── display.py  ← rich table, progress bars, color logic
└── cache.py    ← local JSON cache (throttles sstat + persists node info)
scripts/
├── gpu_load_test.py  ← dev tool to burn VRAM and verify GPU column
└── mock_cli.py ← dev tool to test the CLI without the access of a SLURM scheduler

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
media/demo		media/demo
scripts		scripts
src/sjdet		src/sjdet
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
SLURM_SSTAT_REFERENCE.md		SLURM_SSTAT_REFERENCE.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕵️ SLURM Job Detective

Why sjdet?

Install

uv (recommended)

pipx

Usage

Update checks

How it works (no scheduler spam)

Development

Project layout

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🕵️ SLURM Job Detective

Why sjdet?

Install

uv (recommended)

pipx

Usage

Update checks

How it works (no scheduler spam)

Development

Project layout

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages