Are you wasting GPU memory? Underusing CPUs? Thrashing disk?
sjdetshows you instantly — one command, no environment to activate.
HPC clusters are expensive. Running a GPU job that uses 2% of VRAM, or a CPU job at 5% utilization, wastes both your compute quota and valuable cluster resources. Normally you'd need to ssh into nodes, run nvidia-smi, parse sstat output, and do mental math — every time.
sjdet does all of that and fits it in one table:
| What you see | What it tells you |
|---|---|
| CPU eff bar | Are your cores actually doing work, or are they idle? |
| Mem Use / Req + suggest | How much RAM you're really using vs. what you requested, plus the optimal --mem to request next time |
| VRAM Use / Req | GPU memory used vs. total, with inverted color scale — green means you're filling it up as expected |
| VRAM trend ↑↓ | Is VRAM growing, stable, or shrinking between polls? Catches memory leaks or loading phases |
| MaxPages / MaxDisk ↑↓ | Disk thrashing and page fault trends — immediately visible |
| Node column | Exactly which node your job landed on, so you can ssh or srun --overlap instantly |
All of this from two SLURM calls (squeue + sstat), batched, with a local cache to avoid hammering the scheduler.
No root. No sudo. Works on any cluster where you have a home directory.
Requires uv (recommended) or pipx — both install sjdet to ~/.local/bin so you can type sjdet from anywhere, forever, without activating anything.
curl -LsSf https://astral.sh/uv/install.sh | sh # install uv itself (once)
uv tool install git+https://github.com/e-candeloro/slurm_job_detectiveUpgrade later:
uv tool upgrade slurm-job-detectivepython3 -m pip install --user pipx
pipx install git+https://github.com/e-candeloro/slurm_job_detectiveUpgrade later:
pipx upgrade slurm-job-detectivesjdet # auto-detects $USER, shows all your jobs
sjdet --user alice # inspect another user's jobs
sjdet --max-jobs 20 # show up to 20 jobs (default: 10)
sjdet --interval 120 # minimum seconds between sstat polls (default: 60)
sjdet --headroom 0.30 # memory suggestion headroom: ceil(MaxRSS × 1.30) (default: 20%)
sjdet --force-update-nodes # forces an update to the node info cache
sjdet --clear-cache # clears the local user cache completely and exits
sjdet --version # print installed version and exit
sjdet --update # upgrades sjdet and exits- Normal runs perform a lightweight upstream update check and print a notice when a newer version is detected.
- Notices are cached to avoid spam (cooldown: 7 days).
sjdetdoes not prompt interactively for updates during normal runs.- Update notices show
current_version -> target_version. sjdet --updateupgrades directly to the latest published GitHub release version.
Update-source strategy:
- GitHub latest release only
Update-command strategy for --update:
uv tool install --upgrade git+https://github.com/e-candeloro/slurm_job_detective.git@v<latest_version>pipx install --force git+https://github.com/e-candeloro/slurm_job_detective.git@v<latest_version>python -m pip install --upgrade git+https://github.com/e-candeloro/slurm_job_detective.git@v<latest_version>
Reading the GPU column:
- 🔴 red = low VRAM usage (you over-requested or the job hasn't loaded yet)
- 🟡 yellow = moderate usage
- 🟢 green = using ≥ 80% of allocated VRAM (well utilized)
↑/↓/-= VRAM trend since last poll
To jump to your job's node:
ssh <node> # SSH directly (node shown in table)
srun --overlap --jobid=<JOBID> --pty nvidia-smi # run nvidia-smi inside your allocationEach invocation makes exactly two SLURM calls:
squeue— one call for all your jobs (state, node, GRES)sstat— one batched call for all running job IDs
GPU hardware info (scontrol show node) is queried once per node, then cached locally forever — GPU models don't change between runs.
The sstat result is cached for --interval seconds (default 60s) to respect cluster etiquette. Re-running sjdet within that window is instant.
git clone https://github.com/e-candeloro/slurm_job_detective
cd slurm_job_detective
uv sync --dev # creates .venv, installs deps + torch (for GPU load testing)
uv run sjdet # run without activating the venvGPU load test (to see real VRAM numbers in the table):
# inside a GPU srun session:
uv run scripts/gpu_load_test.py --gb 8 --seconds 300
# in another terminal:
uv run sjdetsrc/sjdet/
├── cli.py ← argument parsing + main() orchestration
├── slurm.py ← squeue/sstat/scontrol calls, data model, parsing
├── display.py ← rich table, progress bars, color logic
└── cache.py ← local JSON cache (throttles sstat + persists node info)
scripts/
├── gpu_load_test.py ← dev tool to burn VRAM and verify GPU column
└── mock_cli.py ← dev tool to test the CLI without the access of a SLURM scheduler
