drisspg
diff --git a/‎README.md‎
Lines changed: 155 additions & 82 deletions b/‎README.md‎
Lines changed: 155 additions & 82 deletions
diff --git a/‎prompts/adhoc.md‎
Lines changed: 4 additions & 0 deletions b/‎prompts/adhoc.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎prompts/investigate.md‎
Lines changed: 4 additions & 0 deletions b/‎prompts/investigate.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎ptq/agent.py‎
Lines changed: 5 additions & 1 deletion b/‎ptq/agent.py‎
Lines changed: 5 additions & 1 deletion
@@ -1,155 +1,228 @@
 # ptq — PyTorch Job Queue
 
-CLI tool that takes a GitHub issue number, SSHs into a remote GPU machine, and launches a Claude agent to autonomously investigate and fix the bug. The agent produces a report and a diff that you can review and turn into a PR.
+CLI tool that dispatches Claude agents to autonomously investigate and fix PyTorch GitHub issues on GPU machines. The agent reproduces the bug, finds the root cause, applies a fix, and produces a report and diff.
 
 ## Install
 
 ```bash
 cd pt_job_queue
 uv pip install -e .
+
+# With GPU auto-provisioning support (optional)
+uv pip install -e '.[gpu]'
 ```
 
-## Usage
+## Quick start
 
-### 1. Set up a machine
+### Fully automatic (recommended)
 
 ```bash
-# Remote GPU machine (auto-detects CUDA version)
-ptq setup my-gpu-box
-
-# Remote with explicit CUDA version
-ptq setup my-gpu-box --cuda cu130
+# Reserve a GPU, set up workspace, run agent, fetch results — all in one command
+ptq auto --issue 149002
 
-# Local (for testing/development)
-ptq setup --local --cpu
+# With options
+ptq auto --issue 149002 --gpu-type a100 --hours 2
+ptq auto --issue 149002 -p "focus on the inductor nan_assert codepath"
+ptq auto --issue 149002 --no-pr --no-follow
 ```
 
-This creates a workspace with:
-- A `uv`-managed venv with PyTorch nightly
-- A pytorch source clone at the matching nightly commit
-- Helper scripts for applying fixes to site-packages
+`ptq auto` handles the full lifecycle:
+1. Fetches the GitHub issue (fails fast before reserving a GPU)
+2. Reserves a GPU pod via gpu-dev
+3. Sets up the workspace (uv, PyTorch nightly, source clone)
+4. Launches a Claude agent to investigate and fix the issue
+5. Fetches results when the agent finishes
+6. Cancels the reservation (in `--follow` mode; in `--no-follow`, the reservation auto-expires)
+
+The `--hours` flag (default: 4) is a max cap — the reservation is cancelled as soon as the agent finishes.
+
+### Manual workflow
 
-### 2. Launch an investigation
+If you already have a GPU machine provisioned:
 
 ```bash
-# On a remote machine
+# 1. One-time setup
+ptq setup my-gpu-box
+
+# 2. Run an investigation
 ptq run --issue 174923 --machine my-gpu-box
 
-# Locally
-ptq run --issue 174923 --local
+# 3. View results
+ptq results 174923
 
-# Run in background (don't stream output)
-ptq run --issue 174923 --machine my-gpu-box --no-follow
+# 4. Apply the fix locally
+ptq apply 174923 --pytorch-path ~/pytorch
 ```
 
-The agent will:
-1. Reproduce the bug using a repro script extracted from the issue
-2. Read pytorch source to find the root cause
-3. Apply a minimal Python-only fix
-4. Test the fix by copying edits to site-packages and re-running the repro
-5. Write `report.md` and `fix.diff`
+## Commands
 
-Re-running the same issue reuses the existing worktree and preserves prior edits. Each run gets its own log (`claude-1.log`, `claude-2.log`, ...). Different issues run concurrently via separate git worktrees.
+### `ptq auto` — Full auto mode
 
-### 3. Monitor progress
+Reserves a GPU, runs the full pipeline, and cleans up.
 
 ```bash
-# Peek at the agent's worklog
-ptq peek 174923
+ptq auto --issue 149002                              # investigate a GitHub issue
+ptq auto --issue 149002 --gpu-type a100 --hours 2    # custom GPU and duration
+ptq auto --issue 149002 -p "check flex_attention.py"  # extra guidance for agent
+ptq auto --issue 149002 --no-pr                       # skip PR creation
+ptq auto -m "benchmark torch.compile on H100" -h 6   # freeform task
+ptq auto --issue 149002 --no-follow                   # launch in background
+```
 
-# Peek with recent log activity
-ptq peek 174923 --log 30
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--issue` | | GitHub issue number |
+| `--message`, `-m` | | Freeform task (instead of issue) |
+| `--input`, `-i` | | Read task from file |
+| `--prompt`, `-p` | | Extra guidance passed to the agent |
+| `--gpu-type` | `h100` | GPU type (`h100`, `a100`, `a10g`, `t4`, etc.) |
+| `--gpus` | `1` | Number of GPUs |
+| `--hours` | `4.0` | Max reservation hours (auto-cancels when done) |
+| `--no-pr` | | Skip PR creation after fix |
+| `--model` | `opus` | Claude model |
+| `--max-turns` | `100` | Max agent turns |
+| `--follow/--no-follow` | `follow` | Stream output or run in background |
+| `--dockerfile` | | Custom Dockerfile for the GPU pod |
+
+### `ptq setup` — Workspace setup
+
+One-time setup on a remote machine or locally.
 
-# List all jobs with running/stopped status
-ptq list
+```bash
+ptq setup my-gpu-box               # remote (auto-detects CUDA)
+ptq setup my-gpu-box --cuda cu130   # explicit CUDA version
+ptq setup --local --cpu             # local (for testing)
 ```
 
-The agent maintains a `worklog.md` with entries after each significant step, so you can check progress without streaming the full output.
+### `ptq run` — Launch agent
 
-### 4. View results
+Launch a Claude agent on a pre-provisioned machine.
 
 ```bash
-# By issue number (uses most recent job)
-ptq results 174923
+ptq run --issue 174923 --machine my-gpu-box
+ptq run -m "investigate OOM in flex_attention" --machine gpu-dev
+ptq run -i task.md --machine gpu-dev
+ptq run 174923 -m "try a different approach"          # re-run with steering
+```
+
+### `ptq results` — Fetch results
 
-# By full job ID
+```bash
+ptq results 174923
 ptq results 20260214-174923
 ```
 
-Fetches `report.md`, `fix.diff`, `worklog.md`, and the run log from the remote.
+Fetches `report.md`, `fix.diff`, `worklog.md`, and the agent log.
 
-### 5. Apply the fix
+### `ptq apply` — Apply fix locally
 
 ```bash
-ptq apply 174923 --pytorch-path ~/meta/pytorch
+ptq apply 174923 --pytorch-path ~/pytorch
 ```
 
-Creates a branch `ptq/{issue_number}`, applies the diff, and prints next steps for creating a PR.
+Creates branch `ptq/{issue_number}`, applies the diff, prints next steps for PR.
 
-### 6. Manage agents
+### `ptq list` — List all jobs
 
 ```bash
-# Check status of a specific job
-ptq status 174923
+ptq list
+```
 
-# Kill a specific agent
-ptq kill 174923
+Shows status (running/stopped), job ID, issue number, run count, target machine, and reservation ID (if provisioned via `ptq auto`).
 
-# Kill all agents on a machine (tracked + zombie processes)
-ptq prune my-gpu-box
+### `ptq peek` — Check progress
 
-# Kill all local agents
-ptq prune --local
+```bash
+ptq peek 174923              # show worklog
+ptq peek 174923 --log 30     # also show last 30 log lines
 ```
 
-### 7. Clean up
+### `ptq status` — Check if agent is running
 
 ```bash
-# Remove all jobs on a machine
-ptq clean my-gpu-box
+ptq status 174923
+```
 
-# Keep the 3 most recent
-ptq clean my-gpu-box --keep 3
+### `ptq kill` — Stop an agent
 
-# Clean local workspace
-ptq clean --local
+```bash
+ptq kill 174923
 ```
 
-Removes job directories and prunes git worktrees.
+### `ptq clean` — Remove jobs
+
+```bash
+ptq clean 174923                    # single job
+ptq clean my-gpu-box                # all stopped jobs on machine
+ptq clean my-gpu-box --keep 2       # keep 2 most recent
+ptq clean my-gpu-box --all          # include running jobs
+ptq clean --local                   # local workspace
+```
+
+When cleaning a job created by `ptq auto`, the associated GPU reservation is also cancelled.
+
+## How the agent works
+
+The agent receives a system prompt with the issue context, a workspace with PyTorch nightly + source, and a repro script extracted from the issue. It then:
+
+1. **Reproduces** the bug using the repro script
+2. **Investigates** by reading PyTorch source in an isolated git worktree
+3. **Fixes** with minimal Python-only edits
+4. **Tests** by syncing edits to site-packages and re-running the repro
+5. **Outputs** `report.md` and `fix.diff`
+6. **Verifies & creates PR** (if instructed) using built-in skills
+
+### Skills
+
+The agent has access to two skills (baked into the workspace):
+
+- **verify-fix**: Runs the repro to confirm the fix, executes related regression tests, and does a before/after performance comparison.
+- **make-pr**: Creates a `ptq/{issue}` branch, commits the fix, pushes, and creates a draft PR via `gh pr create --draft`.
+
+## GPU provisioning (gpu-dev integration)
+
+`ptq auto` uses [gpu-dev-cli](https://github.com/wdvr/osdc) to manage GPU reservations. Requirements:
 
-## Options
+1. `gpu-dev-cli` installed (`pip install 'ptq[gpu]'`)
+2. AWS credentials configured for gpu-dev
+3. GitHub username set: `gpu-dev config set github_user <username>`
+4. SSH config includes gpu-dev: the CLI handles this automatically
 
-| Flag | Command | Default | Description |
-|------|---------|---------|-------------|
-| `--cuda` | setup | auto-detect | CUDA tag (`cu124`, `cu126`, `cu128`, `cu130`) |
-| `--cpu` | setup | | Use CPU-only PyTorch (macOS/testing) |
-| `--machine` | run | | Remote machine hostname |
-| `--local` | setup, run, clean, prune | | Use local workspace instead of SSH |
-| `--follow/--no-follow` | run | follow | Stream agent output to terminal |
-| `--model` | run | opus | Claude model |
-| `--max-turns` | run | 100 | Max agent turns |
-| `--workspace` | setup, run, prune | `~/ptq_workspace` | Custom workspace path |
-| `--keep` | clean | 0 | Number of recent jobs to keep |
-| `--log` | peek | 0 | Number of log lines to show |
+The reservation uses `no_persistent_disk=True` since results are fetched before cancellation.
 
 ## Project layout
 
 ```
 pt_job_queue/
 ├── pyproject.toml
 ├── ptq/
-│   ├── cli.py              # Typer CLI
-│   ├── ssh.py              # SSH/SCP + local subprocess backends
+│   ├── cli.py              # Typer CLI (setup, run, auto, results, etc.)
+│   ├── ssh.py              # SSH + local subprocess backends
 │   ├── issue.py            # GitHub issue fetching via gh
-│   ├── job.py              # Job ID generation + local state
-│   ├── workspace.py        # Remote workspace setup
-│   ├── agent.py            # Agent prompt construction + launch
+│   ├── job.py              # Job ID generation + local state tracking
+│   ├── workspace.py        # Remote workspace setup (uv, PyTorch, git)
+│   ├── agent.py            # Agent prompt construction + Claude Code launch
 │   ├── results.py          # Fetch + display results
-│   └── apply.py            # Apply diff to local pytorch checkout
+│   ├── apply.py            # Apply diff to local pytorch checkout
+│   ├── provision.py        # GPU reservation lifecycle (gpu-dev wrapper)
+│   └── docker.py           # Dockerfile generation + build context
 ├── prompts/
-│   └── investigate.md      # Agent system prompt template
-└── scripts/
-    └── apply_to_site_pkgs.sh
+│   ├── investigate.md      # System prompt for issue investigations
+│   └── adhoc.md            # System prompt for freeform tasks
+├── skills/
+│   ├── verify-fix/SKILL.md # Testing + perf verification skill
+│   └── make-pr/SKILL.md    # Draft PR creation skill
+├── scripts/
+│   └── apply_to_site_pkgs.sh
+└── tests/
+    ├── test_agent.py
+    ├── test_cli.py
+    ├── test_docker.py
+    ├── test_job.py
+    ├── test_provision.py
+    ├── test_issue.py
+    ├── test_results.py
+    └── test_workspace.py
 ```
 
 ## Workspace layout (on remote/local)
 
@@ -51,3 +51,7 @@ cd {workspace}/jobs/{job_id}/pytorch && git diff > {workspace}/jobs/{job_id}/fix
 ```
 
 IMPORTANT: Always generate report.md before finishing. Generate fix.diff if you made any code changes.
+
+### Verify & PR (if instructed)
+If you made code changes and were asked to create a PR, use the verify-fix
+skill to run tests and perf checks, then use the make-pr skill to create a draft PR.
@@ -83,3 +83,7 @@ cd {workspace}/jobs/{job_id}/pytorch && git diff > {workspace}/jobs/{job_id}/fix
 ```
 
 IMPORTANT: Always generate both report.md and fix.diff before finishing.
+
+### 6. Verify & PR (if instructed)
+If your fix is verified and you were asked to create a PR, use the verify-fix
+skill to run tests and perf checks, then use the make-pr skill to create a draft PR.
@@ -119,6 +119,10 @@ def _print_stream_event(line: str) -> None:
                             console.print(
                                 f"  [cyan]glob[/cyan] [dim]{inp.get('pattern', '')}[/dim]"
                             )
+                        case "Skill":
+                            console.print(
+                                f"  [magenta]skill[/magenta] [dim]{inp.get('skill', '')}[/dim]"
+                            )
                         case _:
                             console.print(f"  [dim]{tool}[/dim]")
         case "user":
@@ -324,7 +328,7 @@ def launch_agent(
         f"claude -p '{escaped_message}' "
         f"--model {model} "
         f"--max-turns {max_turns} "
-        f"--allowedTools 'Read,Edit,Write,Bash,Grep,Glob' "
+        f"--allowedTools 'Read,Edit,Write,Bash,Grep,Glob,Skill' "
         f"--dangerously-skip-permissions "
         f"--append-system-prompt-file {job_dir}/system_prompt.md "
         f"--output-format stream-json "