Skip to content

Commit 2b8e221

Browse files
wdvrclaude
andcommitted
Add ptq auto: full GPU lifecycle with skills and Dockerfile support
- ptq auto: single command that reserves GPU via gpu-dev, sets up workspace, launches agent, fetches results, and cancels reservation on exit - Skills: verify-fix (test + perf checks) and make-pr (draft PR creation) baked into Docker image via --dockerfile to gpu-dev - Docker context: generates Dockerfile extending gpu-dev base image with skills at /home/dev/.claude/skills/ - Provision: thin wrapper around gpu-dev-cli (lazy import, optional dep) - SSH hardening: retry on 255 (proxy flakiness), base64 file transfer (scp unreliable over WebSocket proxy), _extract_hash for noisy output - Agent: Skill tool enabled in allowedTools + display in stream output - Job tracking: save/get reservation_id for clean command integration - CLI updates: list shows reservation column, clean cancels reservations, --prompt/-p for extra agent guidance - 91 tests, all passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d13defe commit 2b8e221

17 files changed

Lines changed: 1010 additions & 103 deletions

File tree

README.md

Lines changed: 155 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -1,155 +1,228 @@
11
# ptq — PyTorch Job Queue
22

3-
CLI tool that takes a GitHub issue number, SSHs into a remote GPU machine, and launches a Claude agent to autonomously investigate and fix the bug. The agent produces a report and a diff that you can review and turn into a PR.
3+
CLI tool that dispatches Claude agents to autonomously investigate and fix PyTorch GitHub issues on GPU machines. The agent reproduces the bug, finds the root cause, applies a fix, and produces a report and diff.
44

55
## Install
66

77
```bash
88
cd pt_job_queue
99
uv pip install -e .
10+
11+
# With GPU auto-provisioning support (optional)
12+
uv pip install -e '.[gpu]'
1013
```
1114

12-
## Usage
15+
## Quick start
1316

14-
### 1. Set up a machine
17+
### Fully automatic (recommended)
1518

1619
```bash
17-
# Remote GPU machine (auto-detects CUDA version)
18-
ptq setup my-gpu-box
19-
20-
# Remote with explicit CUDA version
21-
ptq setup my-gpu-box --cuda cu130
20+
# Reserve a GPU, set up workspace, run agent, fetch results — all in one command
21+
ptq auto --issue 149002
2222

23-
# Local (for testing/development)
24-
ptq setup --local --cpu
23+
# With options
24+
ptq auto --issue 149002 --gpu-type a100 --hours 2
25+
ptq auto --issue 149002 -p "focus on the inductor nan_assert codepath"
26+
ptq auto --issue 149002 --no-pr --no-follow
2527
```
2628

27-
This creates a workspace with:
28-
- A `uv`-managed venv with PyTorch nightly
29-
- A pytorch source clone at the matching nightly commit
30-
- Helper scripts for applying fixes to site-packages
29+
`ptq auto` handles the full lifecycle:
30+
1. Fetches the GitHub issue (fails fast before reserving a GPU)
31+
2. Reserves a GPU pod via gpu-dev
32+
3. Sets up the workspace (uv, PyTorch nightly, source clone)
33+
4. Launches a Claude agent to investigate and fix the issue
34+
5. Fetches results when the agent finishes
35+
6. Cancels the reservation (in `--follow` mode; in `--no-follow`, the reservation auto-expires)
36+
37+
The `--hours` flag (default: 4) is a max cap — the reservation is cancelled as soon as the agent finishes.
38+
39+
### Manual workflow
3140

32-
### 2. Launch an investigation
41+
If you already have a GPU machine provisioned:
3342

3443
```bash
35-
# On a remote machine
44+
# 1. One-time setup
45+
ptq setup my-gpu-box
46+
47+
# 2. Run an investigation
3648
ptq run --issue 174923 --machine my-gpu-box
3749

38-
# Locally
39-
ptq run --issue 174923 --local
50+
# 3. View results
51+
ptq results 174923
4052

41-
# Run in background (don't stream output)
42-
ptq run --issue 174923 --machine my-gpu-box --no-follow
53+
# 4. Apply the fix locally
54+
ptq apply 174923 --pytorch-path ~/pytorch
4355
```
4456

45-
The agent will:
46-
1. Reproduce the bug using a repro script extracted from the issue
47-
2. Read pytorch source to find the root cause
48-
3. Apply a minimal Python-only fix
49-
4. Test the fix by copying edits to site-packages and re-running the repro
50-
5. Write `report.md` and `fix.diff`
57+
## Commands
5158

52-
Re-running the same issue reuses the existing worktree and preserves prior edits. Each run gets its own log (`claude-1.log`, `claude-2.log`, ...). Different issues run concurrently via separate git worktrees.
59+
### `ptq auto` — Full auto mode
5360

54-
### 3. Monitor progress
61+
Reserves a GPU, runs the full pipeline, and cleans up.
5562

5663
```bash
57-
# Peek at the agent's worklog
58-
ptq peek 174923
64+
ptq auto --issue 149002 # investigate a GitHub issue
65+
ptq auto --issue 149002 --gpu-type a100 --hours 2 # custom GPU and duration
66+
ptq auto --issue 149002 -p "check flex_attention.py" # extra guidance for agent
67+
ptq auto --issue 149002 --no-pr # skip PR creation
68+
ptq auto -m "benchmark torch.compile on H100" -h 6 # freeform task
69+
ptq auto --issue 149002 --no-follow # launch in background
70+
```
5971

60-
# Peek with recent log activity
61-
ptq peek 174923 --log 30
72+
| Option | Default | Description |
73+
|--------|---------|-------------|
74+
| `--issue` | | GitHub issue number |
75+
| `--message`, `-m` | | Freeform task (instead of issue) |
76+
| `--input`, `-i` | | Read task from file |
77+
| `--prompt`, `-p` | | Extra guidance passed to the agent |
78+
| `--gpu-type` | `h100` | GPU type (`h100`, `a100`, `a10g`, `t4`, etc.) |
79+
| `--gpus` | `1` | Number of GPUs |
80+
| `--hours` | `4.0` | Max reservation hours (auto-cancels when done) |
81+
| `--no-pr` | | Skip PR creation after fix |
82+
| `--model` | `opus` | Claude model |
83+
| `--max-turns` | `100` | Max agent turns |
84+
| `--follow/--no-follow` | `follow` | Stream output or run in background |
85+
| `--dockerfile` | | Custom Dockerfile for the GPU pod |
86+
87+
### `ptq setup` — Workspace setup
88+
89+
One-time setup on a remote machine or locally.
6290

63-
# List all jobs with running/stopped status
64-
ptq list
91+
```bash
92+
ptq setup my-gpu-box # remote (auto-detects CUDA)
93+
ptq setup my-gpu-box --cuda cu130 # explicit CUDA version
94+
ptq setup --local --cpu # local (for testing)
6595
```
6696

67-
The agent maintains a `worklog.md` with entries after each significant step, so you can check progress without streaming the full output.
97+
### `ptq run` — Launch agent
6898

69-
### 4. View results
99+
Launch a Claude agent on a pre-provisioned machine.
70100

71101
```bash
72-
# By issue number (uses most recent job)
73-
ptq results 174923
102+
ptq run --issue 174923 --machine my-gpu-box
103+
ptq run -m "investigate OOM in flex_attention" --machine gpu-dev
104+
ptq run -i task.md --machine gpu-dev
105+
ptq run 174923 -m "try a different approach" # re-run with steering
106+
```
107+
108+
### `ptq results` — Fetch results
74109

75-
# By full job ID
110+
```bash
111+
ptq results 174923
76112
ptq results 20260214-174923
77113
```
78114

79-
Fetches `report.md`, `fix.diff`, `worklog.md`, and the run log from the remote.
115+
Fetches `report.md`, `fix.diff`, `worklog.md`, and the agent log.
80116

81-
### 5. Apply the fix
117+
### `ptq apply` — Apply fix locally
82118

83119
```bash
84-
ptq apply 174923 --pytorch-path ~/meta/pytorch
120+
ptq apply 174923 --pytorch-path ~/pytorch
85121
```
86122

87-
Creates a branch `ptq/{issue_number}`, applies the diff, and prints next steps for creating a PR.
123+
Creates branch `ptq/{issue_number}`, applies the diff, prints next steps for PR.
88124

89-
### 6. Manage agents
125+
### `ptq list` — List all jobs
90126

91127
```bash
92-
# Check status of a specific job
93-
ptq status 174923
128+
ptq list
129+
```
94130

95-
# Kill a specific agent
96-
ptq kill 174923
131+
Shows status (running/stopped), job ID, issue number, run count, target machine, and reservation ID (if provisioned via `ptq auto`).
97132

98-
# Kill all agents on a machine (tracked + zombie processes)
99-
ptq prune my-gpu-box
133+
### `ptq peek` — Check progress
100134

101-
# Kill all local agents
102-
ptq prune --local
135+
```bash
136+
ptq peek 174923 # show worklog
137+
ptq peek 174923 --log 30 # also show last 30 log lines
103138
```
104139

105-
### 7. Clean up
140+
### `ptq status` — Check if agent is running
106141

107142
```bash
108-
# Remove all jobs on a machine
109-
ptq clean my-gpu-box
143+
ptq status 174923
144+
```
110145

111-
# Keep the 3 most recent
112-
ptq clean my-gpu-box --keep 3
146+
### `ptq kill` — Stop an agent
113147

114-
# Clean local workspace
115-
ptq clean --local
148+
```bash
149+
ptq kill 174923
116150
```
117151

118-
Removes job directories and prunes git worktrees.
152+
### `ptq clean` — Remove jobs
153+
154+
```bash
155+
ptq clean 174923 # single job
156+
ptq clean my-gpu-box # all stopped jobs on machine
157+
ptq clean my-gpu-box --keep 2 # keep 2 most recent
158+
ptq clean my-gpu-box --all # include running jobs
159+
ptq clean --local # local workspace
160+
```
161+
162+
When cleaning a job created by `ptq auto`, the associated GPU reservation is also cancelled.
163+
164+
## How the agent works
165+
166+
The agent receives a system prompt with the issue context, a workspace with PyTorch nightly + source, and a repro script extracted from the issue. It then:
167+
168+
1. **Reproduces** the bug using the repro script
169+
2. **Investigates** by reading PyTorch source in an isolated git worktree
170+
3. **Fixes** with minimal Python-only edits
171+
4. **Tests** by syncing edits to site-packages and re-running the repro
172+
5. **Outputs** `report.md` and `fix.diff`
173+
6. **Verifies & creates PR** (if instructed) using built-in skills
174+
175+
### Skills
176+
177+
The agent has access to two skills (baked into the workspace):
178+
179+
- **verify-fix**: Runs the repro to confirm the fix, executes related regression tests, and does a before/after performance comparison.
180+
- **make-pr**: Creates a `ptq/{issue}` branch, commits the fix, pushes, and creates a draft PR via `gh pr create --draft`.
181+
182+
## GPU provisioning (gpu-dev integration)
183+
184+
`ptq auto` uses [gpu-dev-cli](https://github.com/wdvr/osdc) to manage GPU reservations. Requirements:
119185

120-
## Options
186+
1. `gpu-dev-cli` installed (`pip install 'ptq[gpu]'`)
187+
2. AWS credentials configured for gpu-dev
188+
3. GitHub username set: `gpu-dev config set github_user <username>`
189+
4. SSH config includes gpu-dev: the CLI handles this automatically
121190

122-
| Flag | Command | Default | Description |
123-
|------|---------|---------|-------------|
124-
| `--cuda` | setup | auto-detect | CUDA tag (`cu124`, `cu126`, `cu128`, `cu130`) |
125-
| `--cpu` | setup | | Use CPU-only PyTorch (macOS/testing) |
126-
| `--machine` | run | | Remote machine hostname |
127-
| `--local` | setup, run, clean, prune | | Use local workspace instead of SSH |
128-
| `--follow/--no-follow` | run | follow | Stream agent output to terminal |
129-
| `--model` | run | opus | Claude model |
130-
| `--max-turns` | run | 100 | Max agent turns |
131-
| `--workspace` | setup, run, prune | `~/ptq_workspace` | Custom workspace path |
132-
| `--keep` | clean | 0 | Number of recent jobs to keep |
133-
| `--log` | peek | 0 | Number of log lines to show |
191+
The reservation uses `no_persistent_disk=True` since results are fetched before cancellation.
134192

135193
## Project layout
136194

137195
```
138196
pt_job_queue/
139197
├── pyproject.toml
140198
├── ptq/
141-
│ ├── cli.py # Typer CLI
142-
│ ├── ssh.py # SSH/SCP + local subprocess backends
199+
│ ├── cli.py # Typer CLI (setup, run, auto, results, etc.)
200+
│ ├── ssh.py # SSH + local subprocess backends
143201
│ ├── issue.py # GitHub issue fetching via gh
144-
│ ├── job.py # Job ID generation + local state
145-
│ ├── workspace.py # Remote workspace setup
146-
│ ├── agent.py # Agent prompt construction + launch
202+
│ ├── job.py # Job ID generation + local state tracking
203+
│ ├── workspace.py # Remote workspace setup (uv, PyTorch, git)
204+
│ ├── agent.py # Agent prompt construction + Claude Code launch
147205
│ ├── results.py # Fetch + display results
148-
│ └── apply.py # Apply diff to local pytorch checkout
206+
│ ├── apply.py # Apply diff to local pytorch checkout
207+
│ ├── provision.py # GPU reservation lifecycle (gpu-dev wrapper)
208+
│ └── docker.py # Dockerfile generation + build context
149209
├── prompts/
150-
│ └── investigate.md # Agent system prompt template
151-
└── scripts/
152-
└── apply_to_site_pkgs.sh
210+
│ ├── investigate.md # System prompt for issue investigations
211+
│ └── adhoc.md # System prompt for freeform tasks
212+
├── skills/
213+
│ ├── verify-fix/SKILL.md # Testing + perf verification skill
214+
│ └── make-pr/SKILL.md # Draft PR creation skill
215+
├── scripts/
216+
│ └── apply_to_site_pkgs.sh
217+
└── tests/
218+
├── test_agent.py
219+
├── test_cli.py
220+
├── test_docker.py
221+
├── test_job.py
222+
├── test_provision.py
223+
├── test_issue.py
224+
├── test_results.py
225+
└── test_workspace.py
153226
```
154227

155228
## Workspace layout (on remote/local)

prompts/adhoc.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,3 +51,7 @@ cd {workspace}/jobs/{job_id}/pytorch && git diff > {workspace}/jobs/{job_id}/fix
5151
```
5252

5353
IMPORTANT: Always generate report.md before finishing. Generate fix.diff if you made any code changes.
54+
55+
### Verify & PR (if instructed)
56+
If you made code changes and were asked to create a PR, use the verify-fix
57+
skill to run tests and perf checks, then use the make-pr skill to create a draft PR.

prompts/investigate.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,3 +83,7 @@ cd {workspace}/jobs/{job_id}/pytorch && git diff > {workspace}/jobs/{job_id}/fix
8383
```
8484

8585
IMPORTANT: Always generate both report.md and fix.diff before finishing.
86+
87+
### 6. Verify & PR (if instructed)
88+
If your fix is verified and you were asked to create a PR, use the verify-fix
89+
skill to run tests and perf checks, then use the make-pr skill to create a draft PR.

ptq/agent.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,10 @@ def _print_stream_event(line: str) -> None:
119119
console.print(
120120
f" [cyan]glob[/cyan] [dim]{inp.get('pattern', '')}[/dim]"
121121
)
122+
case "Skill":
123+
console.print(
124+
f" [magenta]skill[/magenta] [dim]{inp.get('skill', '')}[/dim]"
125+
)
122126
case _:
123127
console.print(f" [dim]{tool}[/dim]")
124128
case "user":
@@ -324,7 +328,7 @@ def launch_agent(
324328
f"claude -p '{escaped_message}' "
325329
f"--model {model} "
326330
f"--max-turns {max_turns} "
327-
f"--allowedTools 'Read,Edit,Write,Bash,Grep,Glob' "
331+
f"--allowedTools 'Read,Edit,Write,Bash,Grep,Glob,Skill' "
328332
f"--dangerously-skip-permissions "
329333
f"--append-system-prompt-file {job_dir}/system_prompt.md "
330334
f"--output-format stream-json "

0 commit comments

Comments
 (0)