|
1 | 1 | # ptq — PyTorch Job Queue |
2 | 2 |
|
3 | | -CLI tool that takes a GitHub issue number, SSHs into a remote GPU machine, and launches a Claude agent to autonomously investigate and fix the bug. The agent produces a report and a diff that you can review and turn into a PR. |
| 3 | +CLI tool that dispatches Claude agents to autonomously investigate and fix PyTorch GitHub issues on GPU machines. The agent reproduces the bug, finds the root cause, applies a fix, and produces a report and diff. |
4 | 4 |
|
5 | 5 | ## Install |
6 | 6 |
|
7 | 7 | ```bash |
8 | 8 | cd pt_job_queue |
9 | 9 | uv pip install -e . |
| 10 | + |
| 11 | +# With GPU auto-provisioning support (optional) |
| 12 | +uv pip install -e '.[gpu]' |
10 | 13 | ``` |
11 | 14 |
|
12 | | -## Usage |
| 15 | +## Quick start |
13 | 16 |
|
14 | | -### 1. Set up a machine |
| 17 | +### Fully automatic (recommended) |
15 | 18 |
|
16 | 19 | ```bash |
17 | | -# Remote GPU machine (auto-detects CUDA version) |
18 | | -ptq setup my-gpu-box |
19 | | - |
20 | | -# Remote with explicit CUDA version |
21 | | -ptq setup my-gpu-box --cuda cu130 |
| 20 | +# Reserve a GPU, set up workspace, run agent, fetch results — all in one command |
| 21 | +ptq auto --issue 149002 |
22 | 22 |
|
23 | | -# Local (for testing/development) |
24 | | -ptq setup --local --cpu |
| 23 | +# With options |
| 24 | +ptq auto --issue 149002 --gpu-type a100 --hours 2 |
| 25 | +ptq auto --issue 149002 -p "focus on the inductor nan_assert codepath" |
| 26 | +ptq auto --issue 149002 --no-pr --no-follow |
25 | 27 | ``` |
26 | 28 |
|
27 | | -This creates a workspace with: |
28 | | -- A `uv`-managed venv with PyTorch nightly |
29 | | -- A pytorch source clone at the matching nightly commit |
30 | | -- Helper scripts for applying fixes to site-packages |
| 29 | +`ptq auto` handles the full lifecycle: |
| 30 | +1. Fetches the GitHub issue (fails fast before reserving a GPU) |
| 31 | +2. Reserves a GPU pod via gpu-dev |
| 32 | +3. Sets up the workspace (uv, PyTorch nightly, source clone) |
| 33 | +4. Launches a Claude agent to investigate and fix the issue |
| 34 | +5. Fetches results when the agent finishes |
| 35 | +6. Cancels the reservation (in `--follow` mode; in `--no-follow`, the reservation auto-expires) |
| 36 | + |
| 37 | +The `--hours` flag (default: 4) is a max cap — the reservation is cancelled as soon as the agent finishes. |
| 38 | + |
| 39 | +### Manual workflow |
31 | 40 |
|
32 | | -### 2. Launch an investigation |
| 41 | +If you already have a GPU machine provisioned: |
33 | 42 |
|
34 | 43 | ```bash |
35 | | -# On a remote machine |
| 44 | +# 1. One-time setup |
| 45 | +ptq setup my-gpu-box |
| 46 | + |
| 47 | +# 2. Run an investigation |
36 | 48 | ptq run --issue 174923 --machine my-gpu-box |
37 | 49 |
|
38 | | -# Locally |
39 | | -ptq run --issue 174923 --local |
| 50 | +# 3. View results |
| 51 | +ptq results 174923 |
40 | 52 |
|
41 | | -# Run in background (don't stream output) |
42 | | -ptq run --issue 174923 --machine my-gpu-box --no-follow |
| 53 | +# 4. Apply the fix locally |
| 54 | +ptq apply 174923 --pytorch-path ~/pytorch |
43 | 55 | ``` |
44 | 56 |
|
45 | | -The agent will: |
46 | | -1. Reproduce the bug using a repro script extracted from the issue |
47 | | -2. Read pytorch source to find the root cause |
48 | | -3. Apply a minimal Python-only fix |
49 | | -4. Test the fix by copying edits to site-packages and re-running the repro |
50 | | -5. Write `report.md` and `fix.diff` |
| 57 | +## Commands |
51 | 58 |
|
52 | | -Re-running the same issue reuses the existing worktree and preserves prior edits. Each run gets its own log (`claude-1.log`, `claude-2.log`, ...). Different issues run concurrently via separate git worktrees. |
| 59 | +### `ptq auto` — Full auto mode |
53 | 60 |
|
54 | | -### 3. Monitor progress |
| 61 | +Reserves a GPU, runs the full pipeline, and cleans up. |
55 | 62 |
|
56 | 63 | ```bash |
57 | | -# Peek at the agent's worklog |
58 | | -ptq peek 174923 |
| 64 | +ptq auto --issue 149002 # investigate a GitHub issue |
| 65 | +ptq auto --issue 149002 --gpu-type a100 --hours 2 # custom GPU and duration |
| 66 | +ptq auto --issue 149002 -p "check flex_attention.py" # extra guidance for agent |
| 67 | +ptq auto --issue 149002 --no-pr # skip PR creation |
| 68 | +ptq auto -m "benchmark torch.compile on H100" -h 6 # freeform task |
| 69 | +ptq auto --issue 149002 --no-follow # launch in background |
| 70 | +``` |
59 | 71 |
|
60 | | -# Peek with recent log activity |
61 | | -ptq peek 174923 --log 30 |
| 72 | +| Option | Default | Description | |
| 73 | +|--------|---------|-------------| |
| 74 | +| `--issue` | | GitHub issue number | |
| 75 | +| `--message`, `-m` | | Freeform task (instead of issue) | |
| 76 | +| `--input`, `-i` | | Read task from file | |
| 77 | +| `--prompt`, `-p` | | Extra guidance passed to the agent | |
| 78 | +| `--gpu-type` | `h100` | GPU type (`h100`, `a100`, `a10g`, `t4`, etc.) | |
| 79 | +| `--gpus` | `1` | Number of GPUs | |
| 80 | +| `--hours` | `4.0` | Max reservation hours (auto-cancels when done) | |
| 81 | +| `--no-pr` | | Skip PR creation after fix | |
| 82 | +| `--model` | `opus` | Claude model | |
| 83 | +| `--max-turns` | `100` | Max agent turns | |
| 84 | +| `--follow/--no-follow` | `follow` | Stream output or run in background | |
| 85 | +| `--dockerfile` | | Custom Dockerfile for the GPU pod | |
| 86 | + |
| 87 | +### `ptq setup` — Workspace setup |
| 88 | + |
| 89 | +One-time setup on a remote machine or locally. |
62 | 90 |
|
63 | | -# List all jobs with running/stopped status |
64 | | -ptq list |
| 91 | +```bash |
| 92 | +ptq setup my-gpu-box # remote (auto-detects CUDA) |
| 93 | +ptq setup my-gpu-box --cuda cu130 # explicit CUDA version |
| 94 | +ptq setup --local --cpu # local (for testing) |
65 | 95 | ``` |
66 | 96 |
|
67 | | -The agent maintains a `worklog.md` with entries after each significant step, so you can check progress without streaming the full output. |
| 97 | +### `ptq run` — Launch agent |
68 | 98 |
|
69 | | -### 4. View results |
| 99 | +Launch a Claude agent on a pre-provisioned machine. |
70 | 100 |
|
71 | 101 | ```bash |
72 | | -# By issue number (uses most recent job) |
73 | | -ptq results 174923 |
| 102 | +ptq run --issue 174923 --machine my-gpu-box |
| 103 | +ptq run -m "investigate OOM in flex_attention" --machine gpu-dev |
| 104 | +ptq run -i task.md --machine gpu-dev |
| 105 | +ptq run 174923 -m "try a different approach" # re-run with steering |
| 106 | +``` |
| 107 | + |
| 108 | +### `ptq results` — Fetch results |
74 | 109 |
|
75 | | -# By full job ID |
| 110 | +```bash |
| 111 | +ptq results 174923 |
76 | 112 | ptq results 20260214-174923 |
77 | 113 | ``` |
78 | 114 |
|
79 | | -Fetches `report.md`, `fix.diff`, `worklog.md`, and the run log from the remote. |
| 115 | +Fetches `report.md`, `fix.diff`, `worklog.md`, and the agent log. |
80 | 116 |
|
81 | | -### 5. Apply the fix |
| 117 | +### `ptq apply` — Apply fix locally |
82 | 118 |
|
83 | 119 | ```bash |
84 | | -ptq apply 174923 --pytorch-path ~/meta/pytorch |
| 120 | +ptq apply 174923 --pytorch-path ~/pytorch |
85 | 121 | ``` |
86 | 122 |
|
87 | | -Creates a branch `ptq/{issue_number}`, applies the diff, and prints next steps for creating a PR. |
| 123 | +Creates branch `ptq/{issue_number}`, applies the diff, prints next steps for PR. |
88 | 124 |
|
89 | | -### 6. Manage agents |
| 125 | +### `ptq list` — List all jobs |
90 | 126 |
|
91 | 127 | ```bash |
92 | | -# Check status of a specific job |
93 | | -ptq status 174923 |
| 128 | +ptq list |
| 129 | +``` |
94 | 130 |
|
95 | | -# Kill a specific agent |
96 | | -ptq kill 174923 |
| 131 | +Shows status (running/stopped), job ID, issue number, run count, target machine, and reservation ID (if provisioned via `ptq auto`). |
97 | 132 |
|
98 | | -# Kill all agents on a machine (tracked + zombie processes) |
99 | | -ptq prune my-gpu-box |
| 133 | +### `ptq peek` — Check progress |
100 | 134 |
|
101 | | -# Kill all local agents |
102 | | -ptq prune --local |
| 135 | +```bash |
| 136 | +ptq peek 174923 # show worklog |
| 137 | +ptq peek 174923 --log 30 # also show last 30 log lines |
103 | 138 | ``` |
104 | 139 |
|
105 | | -### 7. Clean up |
| 140 | +### `ptq status` — Check if agent is running |
106 | 141 |
|
107 | 142 | ```bash |
108 | | -# Remove all jobs on a machine |
109 | | -ptq clean my-gpu-box |
| 143 | +ptq status 174923 |
| 144 | +``` |
110 | 145 |
|
111 | | -# Keep the 3 most recent |
112 | | -ptq clean my-gpu-box --keep 3 |
| 146 | +### `ptq kill` — Stop an agent |
113 | 147 |
|
114 | | -# Clean local workspace |
115 | | -ptq clean --local |
| 148 | +```bash |
| 149 | +ptq kill 174923 |
116 | 150 | ``` |
117 | 151 |
|
118 | | -Removes job directories and prunes git worktrees. |
| 152 | +### `ptq clean` — Remove jobs |
| 153 | + |
| 154 | +```bash |
| 155 | +ptq clean 174923 # single job |
| 156 | +ptq clean my-gpu-box # all stopped jobs on machine |
| 157 | +ptq clean my-gpu-box --keep 2 # keep 2 most recent |
| 158 | +ptq clean my-gpu-box --all # include running jobs |
| 159 | +ptq clean --local # local workspace |
| 160 | +``` |
| 161 | + |
| 162 | +When cleaning a job created by `ptq auto`, the associated GPU reservation is also cancelled. |
| 163 | + |
| 164 | +## How the agent works |
| 165 | + |
| 166 | +The agent receives a system prompt with the issue context, a workspace with PyTorch nightly + source, and a repro script extracted from the issue. It then: |
| 167 | + |
| 168 | +1. **Reproduces** the bug using the repro script |
| 169 | +2. **Investigates** by reading PyTorch source in an isolated git worktree |
| 170 | +3. **Fixes** with minimal Python-only edits |
| 171 | +4. **Tests** by syncing edits to site-packages and re-running the repro |
| 172 | +5. **Outputs** `report.md` and `fix.diff` |
| 173 | +6. **Verifies & creates PR** (if instructed) using built-in skills |
| 174 | + |
| 175 | +### Skills |
| 176 | + |
| 177 | +The agent has access to two skills (baked into the workspace): |
| 178 | + |
| 179 | +- **verify-fix**: Runs the repro to confirm the fix, executes related regression tests, and does a before/after performance comparison. |
| 180 | +- **make-pr**: Creates a `ptq/{issue}` branch, commits the fix, pushes, and creates a draft PR via `gh pr create --draft`. |
| 181 | + |
| 182 | +## GPU provisioning (gpu-dev integration) |
| 183 | + |
| 184 | +`ptq auto` uses [gpu-dev-cli](https://github.com/wdvr/osdc) to manage GPU reservations. Requirements: |
119 | 185 |
|
120 | | -## Options |
| 186 | +1. `gpu-dev-cli` installed (`pip install 'ptq[gpu]'`) |
| 187 | +2. AWS credentials configured for gpu-dev |
| 188 | +3. GitHub username set: `gpu-dev config set github_user <username>` |
| 189 | +4. SSH config includes gpu-dev: the CLI handles this automatically |
121 | 190 |
|
122 | | -| Flag | Command | Default | Description | |
123 | | -|------|---------|---------|-------------| |
124 | | -| `--cuda` | setup | auto-detect | CUDA tag (`cu124`, `cu126`, `cu128`, `cu130`) | |
125 | | -| `--cpu` | setup | | Use CPU-only PyTorch (macOS/testing) | |
126 | | -| `--machine` | run | | Remote machine hostname | |
127 | | -| `--local` | setup, run, clean, prune | | Use local workspace instead of SSH | |
128 | | -| `--follow/--no-follow` | run | follow | Stream agent output to terminal | |
129 | | -| `--model` | run | opus | Claude model | |
130 | | -| `--max-turns` | run | 100 | Max agent turns | |
131 | | -| `--workspace` | setup, run, prune | `~/ptq_workspace` | Custom workspace path | |
132 | | -| `--keep` | clean | 0 | Number of recent jobs to keep | |
133 | | -| `--log` | peek | 0 | Number of log lines to show | |
| 191 | +The reservation uses `no_persistent_disk=True` since results are fetched before cancellation. |
134 | 192 |
|
135 | 193 | ## Project layout |
136 | 194 |
|
137 | 195 | ``` |
138 | 196 | pt_job_queue/ |
139 | 197 | ├── pyproject.toml |
140 | 198 | ├── ptq/ |
141 | | -│ ├── cli.py # Typer CLI |
142 | | -│ ├── ssh.py # SSH/SCP + local subprocess backends |
| 199 | +│ ├── cli.py # Typer CLI (setup, run, auto, results, etc.) |
| 200 | +│ ├── ssh.py # SSH + local subprocess backends |
143 | 201 | │ ├── issue.py # GitHub issue fetching via gh |
144 | | -│ ├── job.py # Job ID generation + local state |
145 | | -│ ├── workspace.py # Remote workspace setup |
146 | | -│ ├── agent.py # Agent prompt construction + launch |
| 202 | +│ ├── job.py # Job ID generation + local state tracking |
| 203 | +│ ├── workspace.py # Remote workspace setup (uv, PyTorch, git) |
| 204 | +│ ├── agent.py # Agent prompt construction + Claude Code launch |
147 | 205 | │ ├── results.py # Fetch + display results |
148 | | -│ └── apply.py # Apply diff to local pytorch checkout |
| 206 | +│ ├── apply.py # Apply diff to local pytorch checkout |
| 207 | +│ ├── provision.py # GPU reservation lifecycle (gpu-dev wrapper) |
| 208 | +│ └── docker.py # Dockerfile generation + build context |
149 | 209 | ├── prompts/ |
150 | | -│ └── investigate.md # Agent system prompt template |
151 | | -└── scripts/ |
152 | | - └── apply_to_site_pkgs.sh |
| 210 | +│ ├── investigate.md # System prompt for issue investigations |
| 211 | +│ └── adhoc.md # System prompt for freeform tasks |
| 212 | +├── skills/ |
| 213 | +│ ├── verify-fix/SKILL.md # Testing + perf verification skill |
| 214 | +│ └── make-pr/SKILL.md # Draft PR creation skill |
| 215 | +├── scripts/ |
| 216 | +│ └── apply_to_site_pkgs.sh |
| 217 | +└── tests/ |
| 218 | + ├── test_agent.py |
| 219 | + ├── test_cli.py |
| 220 | + ├── test_docker.py |
| 221 | + ├── test_job.py |
| 222 | + ├── test_provision.py |
| 223 | + ├── test_issue.py |
| 224 | + ├── test_results.py |
| 225 | + └── test_workspace.py |
153 | 226 | ``` |
154 | 227 |
|
155 | 228 | ## Workspace layout (on remote/local) |
|
0 commit comments