v2 adapter hardening, shared git server, submission-based patch by akhatua2 · Pull Request #48 · cooperbench/CooperBench

akhatua2 · 2026-04-30T21:19:28Z

Summary

Stabilizes the mini_swe_agent_v2 adapter and removes a pile of latent
bugs that were silently degrading coop+git runs. Four commits, each
self-contained:

cli: auto-load ./.env — cooperbench itself never called
load_dotenv(), so project-local OPENAI_API_KEY etc. only worked
when users manually exported them. One-liner at the top of cli.py.
mini_swe_agent_v2: harden adapter, split mini.yaml into solo/coop, submit via patch.txt — the bulk of the change:
- Adapter accepts **kwargs; wires up the previously-dead
  agent_config flag; sanitizes content=None from tool-calling
  turns (CooperBench's _extract_conversation does
  \"send_message\" in content and TypeErrors on None); drops the
  dead SEND_MESSAGE_TOOL import.
- Drops _get_patch() / working-tree extraction; the patch now
  comes verbatim from result['submission']. No fallback — if the
  agent didn't submit, there's no patch. This is the upstream
  mini-swe-agent SWE-bench design.
- Splits mini.yaml (single file with {% if agent_id %} branches)
  into solo.yaml + coop.yaml; fixes a leak in the solo branch
  that mentioned a non-existent colleague.
- Updates the prompts to instruct the agent to curate
  git diff -- file1 file2 > patch.txt, cat patch.txt, then
  echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt —
  mirroring upstream's three-step submission flow.
mini_swe_agent_v2: shared singleton git server, path-prefixed per-run repos — replaces the per-run apt install git + sleep 3
container with a Redis-style singleton. One cooperbench-git
container running git daemon on a shared cooperbench bridge
network; per-run isolation via path namespacing
(/git/<run_id>/repo.git). Auto-created on first use, idempotent
thereafter. Also adds a typed network field to
DockerEnvironmentConfig so the --network kwarg actually reaches
docker run (previously Pydantic silently dropped it).
mini_swe_agent_v2: serialize() no longer mutates _segments —
DefaultAgent.run() calls save() in its finally clause every
step, and save → serialize → _close_current_segment was appending
a fresh solver segment per save. Result: _full_traj.json
ballooned with overlapping post-compaction snapshots. Fix:
serialize() snapshots locally without mutating state.

Verification

All four CI checks pass locally on the branch:

uv run --extra dev ruff check src/cooperbench/ — clean
uv run --extra dev ruff format --check src/cooperbench/ — clean
uv run --extra dev mypy src/cooperbench/ — clean
uv run --extra dev pytest tests/ -v — 155 passed, 63 skipped

End-to-end coop+git run against a Modal-served Qwen3.5-9B endpoint with
the new pipeline (cooperbench run -n test -r dottxt_ai_outlines_task -t 1655 -f 6,7 -m openai/Qwen/Qwen3.5-9B -a mini_swe_agent_v2 --git):
both agents reached Submitted status; the resulting agent6.patch
and agent7.patch contain only the modified source file (no scratch
test scripts), proving the curated-submission flow.

Test plan

Lint clean (ruff check, ruff format --check)
mypy clean
pytest unit suite passes
End-to-end coop+git smoke run with Modal-served Qwen3.5-9B
CI green on this PR

mini_swe_agent_v2 already loads dotenv from a global config dir (~/.config/mini-swe-agent/.env), but cooperbench itself never loaded the project-local .env, so OPENAI_API_KEY etc. only made it through when the user manually exported them. Calling dotenv.load_dotenv() at the top of cli.py auto-loads ./.env from cwd before any env-var-dependent imports run, matching how projects with python-dotenv conventionally pick up local config.

…bmit via patch.txt Adapter: - accept **kwargs so unknown caller-side args don't crash run() - wire up the agent_config CLI flag that was previously listed in the signature but never read; load YAML and deep-merge config: block over the defaults - sanitize content=None on tool-calling assistant turns before returning AgentResult.messages (CooperBench's downstream coop runner does '"send_message" in content' which TypeErrors on None) - drop the dead SEND_MESSAGE_TOOL import (only BASH_TOOL is registered; send_message is intercepted from inside the bash command string) - drop _get_patch() and the base_commit capture; the patch now comes straight from result['submission'] (no working-tree extraction fallback — if the agent didn't submit, there is no patch) Config: - delete config/mini.yaml (was a single file with {% if agent_id %} branches handling both solo and coop) and split into config/solo.yaml and config/coop.yaml; adapter picks based on is_coop = len(agents) > 1 - fix a leak in the solo branch where the CRITICAL REQUIREMENTS section still mentioned 'send_message to your colleague' even when the agent has no colleague - replace the bare 'echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT' submit step with the upstream mini-swe-agent SWE-bench three-step flow: curate via 'git diff -- file1 file2 > patch.txt', verify via cat, submit via 'echo COMPLETE... && cat patch.txt'. The submission field then carries the patch verbatim.

… repos Old design spun up a fresh debian-slim container per coop pair, ran 'apt-get install git' inside it, slept 3 seconds, and returned. Two problems: - the 3-second sleep was far too short for the apt install (~30-60s on a cold image), so agents' initial 'git push' raced the daemon start and got 'Connection refused' - the per-run container lived on its own bridge network cooperbench-git-<run_id>, but DockerEnvironmentConfig had no 'network' field, so the kwarg the adapter passed got silently dropped by Pydantic and agent containers ended up on the default bridge — no route to the git server's IP Replaced with a Redis-style shared singleton: - one image cooperbench-git-server:local (built lazily on first use from a 4-line Dockerfile) - one container cooperbench-git running 'git daemon --base-path=/git --listen=0.0.0.0' as PID 1, with a docker volume for /git - one shared bridge network 'cooperbench' that all agent containers join - per-run isolation via path namespacing: each coop pair gets /git/<run_id>/repo.git, served at git://cooperbench-git:9418/<run_id>/repo.git DockerGitServer.create() now just ensures the singleton infra is up (idempotent, ~140ms after first call) and exec's a quick 'mkdir + git init --bare' inside the running daemon. cleanup() removes only the per-run path and leaves the singleton alive. DockerEnvironmentConfig also gets a typed 'network' field so the --network flag actually reaches docker run.

DefaultAgent.run() calls self.save(self.config.output_path) in its finally clause every step, and save() calls serialize(). Once compaction has fired, serialize() was unconditionally calling _close_current_segment('solver') — which appends a snapshot of the current live messages as a new segment AND resets the buffer (which the next query() then repopulates). Net effect: each step after the first compaction added another post-compaction solver segment to _segments, each one near-superset of the previous. In a real run we observed segment counts like [86, 85, 8, 10, 12, 14, 15] where the last 5 should have been a single segment. Fix: serialize() builds a snapshot list locally without mutating self._segments. The current open buffer is appended as a transient 'solver' segment in the snapshot. Multiple calls to serialize() now produce identical output and leave state unchanged.

When an agent submits a malformed patch (e.g. a 'new file mode' diff against an existing file), 'git apply' rejects it but the eval would silently commit an empty branch, then report the subsequent merge as 'clean' because there was nothing to disagree with. An agent self-sabotage (the canonical case is an agent running 'rm -rf .git' mid-run) would look like a passing eval. _setup_branches now emits explicit per-agent markers (PATCH<N>_APPLIED / _SKIPPED / _FAILED) and returns an apply_status dict. test_merged threads it into the eval result and overrides merge.status to 'missing_input' when any patch failed. While in there, _run_tests now exposes exit_code in its result, and the per-feature dict gains feature_id / exit_code / tests_passed / tests_failed alongside the existing passed + test_output (which is still a 50KB blob). Consumers can now reason about results without grepping raw pytest output.

…truction The previous Submission section was ~40 lines of three-step procedure plus a CRITICAL block. Small models (Qwen 9B, observed in coop+git runs) tended to follow the recipe but still hit footguns we hadn't forbidden — the canonical failure was an agent running 'rm -rf .git' mid-task, then 'git init' to 'fix' it, then producing a malformed 'new file mode' diff that the eval silently dropped. Trim the recipe to the bare flow: git diff -- path/to/file > patch.txt cat patch.txt echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt Then a tight CRITICAL block forbidding 'rm -rf .git', 'git init', 'git rm -rf .', and 'git reset --hard' inside /workspace/repo — which corrupt the local .git/ directory regardless of whether the team remote is enabled. In a follow-up empirical run, agent6 (the agent that previously did rm -rf .git) issued zero destructive git commands and produced a clean modify-existing patch that passed all 100 of feature 6's tests. CHANGELOG: extend the v0.0.12 entry to cover both this and the preceding eval-observability commit.

Earlier simplification collapsed the four file-category exclusions (reproduction scripts, helper tools, build/config files, binaries) into a single inline parenthetical, which small models tend to under-weight. Restore the bulleted list — bullets are easier to parse and harder to skim past — without re-adding the Step 1/2/3 scaffolding or the env-var notation that the simplification was meant to remove.

Single fenced bash block invited models to chain the three commands with '&&' or run them as one heredoc. That breaks the design — the env's COMPLETE sentinel detection only fires when 'echo COMPLETE...' is the first line of bash output, and chaining makes the diff happen on the same line as cat, which the env then doesn't capture as a submission. Split into three separate fenced bash blocks (write / verify / submit), with explicit 'SEPARATE bash tool call' instruction.

The previous wording opened with 'Edit files in place. Don't commit.' — which contradicts the existing Shared Git Remote section (telling agents they have a branch to coordinate on) and over-prescribes the workflow. Reframe: patch.txt is the artifact we evaluate, the agent writes whatever unified diff they want to submit to that file, however fits the workflow they used. The 'git diff -- file > patch.txt' recipe stays as 'one common way', not the only way. Agents are free to commit, fetch, merge, or do whatever else — the contract is only what ends up in patch.txt.

…yaml too Mirrors the trim already applied to coop.yaml — the sentence was unnecessary scaffolding now that the three steps are in their own fenced blocks.

akhatua2 added 11 commits April 30, 2026 21:17

v0.0.12: bump version, add CHANGELOG entry

ec5483c

mini_swe_agent_v2: drop the 'SEPARATE bash tool call' line from solo.…

6bf32ce

…yaml too Mirrors the trim already applied to coop.yaml — the sentence was unnecessary scaffolding now that the three steps are in their own fenced blocks.

akhatua2 merged commit b47df3a into main Apr 30, 2026
3 checks passed

akhatua2 deleted the v2-hardening branch April 30, 2026 22:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2 adapter hardening, shared git server, submission-based patch#48

v2 adapter hardening, shared git server, submission-based patch#48
akhatua2 merged 11 commits into
mainfrom
v2-hardening

akhatua2 commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akhatua2 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

akhatua2 commented Apr 30, 2026 •

edited

Loading