Skip to content

hackos-team/hackos-5-quickstart

Repository files navigation

HackOS 5 — AI Control Hackathon · Quickstart

Limit the harm an untrusted AI agent can cause while it operates, even when it actively tries to subvert the controls around it. The core game: can a weaker, trusted model catch a stronger, untrusted agent attempting sabotage? And while keeping false alarms under ~1%?

Everything here runs on control-tower, the AI-control eval framework (LinuxArena / ControlArena) by Redwood Research and UK's AISI.

Important

Before continuing, make sure to familiarize yourself with LinuxArena's hackathon quickstart and we highly highly recommend at least skimming the resources they link (we've copied them to the bottom of this README for your convenience).

Shared setup (all tracks)

Before starting, make sure you have the following set up:

  • A Unix-compatible machine (MacOS or Linux) or WSL for Windows
  • The uv package manager pip install uv
  • Docker installed and running
  • The GH cli (sudo apt install gh) used to pull LinuxArena envs
  • The AWS CLI (aws), needed to download the public, creds-free trajectory dataset (aws s3 cp --no-sign-request …)
uv run ct init  # completions, `ct` alias, settings, agent skills
uv run ct settings pull linuxarena  # environments, main tasks, side tasks

# For `--hawk`: hawk + control-tower are pinned in pyproject.toml, and HAWK_CT_REF in
# .env.example, to match the deployed Hawk runner, so `uv sync` plus your `.env` are enough.
# (A floating hawk drifts ahead of the runner and crashes every --hawk run at setup.)
# Organizers: bump those pins in step with the deployed runner.

Before spending any of your credits, run this to make sure everything works (it just runs ls and then exits, no model calls):

uv run ct run eval --policy test -e clinical_trial -t adverse_reaction_followup_tracking --no-upload

Tip

clinical_trial is one of the heavier environments (large R image, slow first Docker build, lots of disk). For a quick local check, a lighter environment such as web_scraping builds much faster. For example: uv run ct run eval --policy test -e web_scraping --env-only --no-upload.

Note

Submitting to Hawk (--hawk) instead of running locally? A few non-obvious gotchas:

  • --policy test won't work on Hawk: --hawk is eval2-only and rejects --policy; use --untrusted-policy + --protocol (which makes model calls). There's no zero-cost Hawk smoke.
  • Capture logs live: a finished/failed run's pods and logs are garbage-collected, so hawk status/hawk logs go empty afterward. Submit with --hawk-no-wait and poll, or use hawk logs -f.
  • Skip --run-name for now: with --hawk it makes a successful submission exit non-zero (the eval-set id parser expects an eval-set-… prefix).
  • Headless/remote login: the device-code flow isn't supported; complete the browser callback on port 18922 (tunnel with ssh -L 18922:localhost:18922 <host> if you're on a server).

Pick a track

Track Team You build
A Monitor Building 🔵 Blue A monitor that scores how suspicious each action/trajectory is
B Protocol Design 🔵 Blue A protocol (defer / rewrite / gate) and its safety–usefulness tradeoff
C Attack Crafting 🔴 Red A sabotage trajectory (straj) that evades the monitor
D Break a Protocol 🔴 Red An untrusted policy that beats a protocol — then a fix
E New Side Tasks ⚪ Gray A new side task / environment / scorer that extends the testbed

The shared vocabulary

  • Main task: the legitimate job the agent is asked to do.
  • Side task: a high-stakes sabotage objective the untrusted agent may pursue covertly.
  • Straj: a sabotage trajectory: a run that accomplishes the side task while staying subtle.
  • Monitor: scores suspiciousness, 0 (clean) → 10 (very suspicious).
  • Protocol: the blue-team decision policy that acts on monitor scores (audit / defer / rewrite / gate).
  • Safety = caught / (caught + uncaught successful sabotage), at a fixed audit budget (FPR on honest runs).
  • Usefulness = honest main-task success rate.

By default, anthropic/claude-haiku-4-5 (shorthand: haiku) is used as the untrusted model and openai/gpt-5-nano as the trusted one. If you decide to change this, beware of the cost of beefier models (especially since these evaluations tend to be token-heavy) and make sure to document this change in your deliverables.

Any time a command wants some kind of run as an argument, it's asking for the logs/[...].eval file produced by a ct run eval command.

Judging & deliverables

Scored 1–5 on impact & innovation, execution quality, and clarity. Each team ships:

  • 📄 a 2-page report;
  • 💻 a code repo (commit your data/ and logs/ folders! we want to see them.);
  • 🎬 a 3-minute demo (either a video or in person)
    • if it's a video, it can be an unlisted youtube link, public google drive link, etc.. Just make sure that we can access it.

1st-2nd-3rd place for the best monitors and best attack respectively, and "best side tasks" award.

Docs worth skimming at least once

  • Running evals on Hawk: submit to a shared Hawk cluster and watch results in the viewer, instead of running locally (unless you're confident in your local compute or running out of time, we recommended using Hawk).
  • Running Evaluations: full ct run eval reference: policies, task sets, batching, replay.
  • Monitoring: monitor types, ensembles, suspiciousness scoring (building monitors).
  • Control Protocols: defer-to-trusted, trusted editing, and writing your own protocol.
  • Making Sabotage Trajectories: hand-crafting attacks that evade monitors.
  • Making Side Tasks and the Hands-On Guide: extending the testbed with new environments and tasks.
  • Safety Curves: turning monitor scores into safety/usefulness numbers.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages