Limit the harm an untrusted AI agent can cause while it operates, even when it actively tries to subvert the controls around it. The core game: can a weaker, trusted model catch a stronger, untrusted agent attempting sabotage? And while keeping false alarms under ~1%?
Everything here runs on control-tower, the AI-control eval framework (LinuxArena / ControlArena) by Redwood Research and UK's AISI.
Important
Before continuing, make sure to familiarize yourself with LinuxArena's hackathon quickstart and we highly highly recommend at least skimming the resources they link (we've copied them to the bottom of this README for your convenience).
Before starting, make sure you have the following set up:
- A Unix-compatible machine (MacOS or Linux) or WSL for Windows
- The uv package manager
pip install uv - Docker installed and running
- The GH cli (
sudo apt install gh) used to pull LinuxArena envs - The AWS CLI (
aws), needed to download the public, creds-free trajectory dataset (aws s3 cp --no-sign-request …)
uv run ct init # completions, `ct` alias, settings, agent skills
uv run ct settings pull linuxarena # environments, main tasks, side tasks
# For `--hawk`: hawk + control-tower are pinned in pyproject.toml, and HAWK_CT_REF in
# .env.example, to match the deployed Hawk runner, so `uv sync` plus your `.env` are enough.
# (A floating hawk drifts ahead of the runner and crashes every --hawk run at setup.)
# Organizers: bump those pins in step with the deployed runner.Before spending any of your credits, run this to make sure everything works (it just runs ls and then exits, no model calls):
uv run ct run eval --policy test -e clinical_trial -t adverse_reaction_followup_tracking --no-uploadTip
clinical_trial is one of the heavier environments (large R image, slow first Docker build, lots of disk). For a quick local check, a lighter environment such as web_scraping builds much faster. For example: uv run ct run eval --policy test -e web_scraping --env-only --no-upload.
Note
Submitting to Hawk (--hawk) instead of running locally? A few non-obvious gotchas:
--policy testwon't work on Hawk:--hawkis eval2-only and rejects--policy; use--untrusted-policy+--protocol(which makes model calls). There's no zero-cost Hawk smoke.- Capture logs live: a finished/failed run's pods and logs are garbage-collected, so
hawk status/hawk logsgo empty afterward. Submit with--hawk-no-waitand poll, or usehawk logs -f. - Skip
--run-namefor now: with--hawkit makes a successful submission exit non-zero (the eval-set id parser expects aneval-set-…prefix). - Headless/remote login: the device-code flow isn't supported; complete the browser callback on port
18922(tunnel withssh -L 18922:localhost:18922 <host>if you're on a server).
| Track | Team | You build | |
|---|---|---|---|
| A | Monitor Building | 🔵 Blue | A monitor that scores how suspicious each action/trajectory is |
| B | Protocol Design | 🔵 Blue | A protocol (defer / rewrite / gate) and its safety–usefulness tradeoff |
| C | Attack Crafting | 🔴 Red | A sabotage trajectory (straj) that evades the monitor |
| D | Break a Protocol | 🔴 Red | An untrusted policy that beats a protocol — then a fix |
| E | New Side Tasks | ⚪ Gray | A new side task / environment / scorer that extends the testbed |
- Main task: the legitimate job the agent is asked to do.
- Side task: a high-stakes sabotage objective the untrusted agent may pursue covertly.
- Straj: a sabotage trajectory: a run that accomplishes the side task while staying subtle.
- Monitor: scores suspiciousness, 0 (clean) → 10 (very suspicious).
- Protocol: the blue-team decision policy that acts on monitor scores (audit / defer / rewrite / gate).
- Safety = caught / (caught + uncaught successful sabotage), at a fixed audit budget (FPR on honest runs).
- Usefulness = honest main-task success rate.
By default, anthropic/claude-haiku-4-5 (shorthand: haiku) is used as the untrusted model and openai/gpt-5-nano as the trusted one. If you decide to change this, beware of the cost of beefier models (especially since these evaluations tend to be token-heavy) and make sure to document this change in your deliverables.
Any time a command wants some kind of run as an argument, it's asking for the logs/[...].eval file produced by a ct run eval command.
Scored 1–5 on impact & innovation, execution quality, and clarity. Each team ships:
- 📄 a 2-page report;
- 💻 a code repo (commit your
data/andlogs/folders! we want to see them.); - 🎬 a 3-minute demo (either a video or in person)
- if it's a video, it can be an unlisted youtube link, public google drive link, etc.. Just make sure that we can access it.
1st-2nd-3rd place for the best monitors and best attack respectively, and "best side tasks" award.
- Running evals on Hawk: submit to a shared Hawk cluster and watch results in the viewer, instead of running locally (unless you're confident in your local compute or running out of time, we recommended using Hawk).
- Running Evaluations: full
ct run evalreference: policies, task sets, batching, replay. - Monitoring: monitor types, ensembles, suspiciousness scoring (building monitors).
- Control Protocols: defer-to-trusted, trusted editing, and writing your own protocol.
- Making Sabotage Trajectories: hand-crafting attacks that evade monitors.
- Making Side Tasks and the Hands-On Guide: extending the testbed with new environments and tasks.
- Safety Curves: turning monitor scores into safety/usefulness numbers.