Skip to content

v2 adapter hardening, shared git server, submission-based patch#48

Merged
akhatua2 merged 11 commits into
mainfrom
v2-hardening
Apr 30, 2026
Merged

v2 adapter hardening, shared git server, submission-based patch#48
akhatua2 merged 11 commits into
mainfrom
v2-hardening

Conversation

@akhatua2
Copy link
Copy Markdown
Collaborator

@akhatua2 akhatua2 commented Apr 30, 2026

Summary

Stabilizes the mini_swe_agent_v2 adapter and removes a pile of latent
bugs that were silently degrading coop+git runs. Four commits, each
self-contained:

  1. cli: auto-load ./.envcooperbench itself never called
    load_dotenv(), so project-local OPENAI_API_KEY etc. only worked
    when users manually exported them. One-liner at the top of cli.py.

  2. mini_swe_agent_v2: harden adapter, split mini.yaml into solo/coop, submit via patch.txt — the bulk of the change:

    • Adapter accepts **kwargs; wires up the previously-dead
      agent_config flag; sanitizes content=None from tool-calling
      turns (CooperBench's _extract_conversation does
      \"send_message\" in content and TypeErrors on None); drops the
      dead SEND_MESSAGE_TOOL import.
    • Drops _get_patch() / working-tree extraction; the patch now
      comes verbatim from result['submission']. No fallback — if the
      agent didn't submit, there's no patch. This is the upstream
      mini-swe-agent SWE-bench design.
    • Splits mini.yaml (single file with {% if agent_id %} branches)
      into solo.yaml + coop.yaml; fixes a leak in the solo branch
      that mentioned a non-existent colleague.
    • Updates the prompts to instruct the agent to curate
      git diff -- file1 file2 > patch.txt, cat patch.txt, then
      echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt
      mirroring upstream's three-step submission flow.
  3. mini_swe_agent_v2: shared singleton git server, path-prefixed per-run repos — replaces the per-run apt install git + sleep 3
    container with a Redis-style singleton. One cooperbench-git
    container running git daemon on a shared cooperbench bridge
    network; per-run isolation via path namespacing
    (/git/<run_id>/repo.git). Auto-created on first use, idempotent
    thereafter. Also adds a typed network field to
    DockerEnvironmentConfig so the --network kwarg actually reaches
    docker run (previously Pydantic silently dropped it).

  4. mini_swe_agent_v2: serialize() no longer mutates _segments
    DefaultAgent.run() calls save() in its finally clause every
    step, and save → serialize → _close_current_segment was appending
    a fresh solver segment per save. Result: _full_traj.json
    ballooned with overlapping post-compaction snapshots. Fix:
    serialize() snapshots locally without mutating state.

Verification

All four CI checks pass locally on the branch:

  • uv run --extra dev ruff check src/cooperbench/ — clean
  • uv run --extra dev ruff format --check src/cooperbench/ — clean
  • uv run --extra dev mypy src/cooperbench/ — clean
  • uv run --extra dev pytest tests/ -v — 155 passed, 63 skipped

End-to-end coop+git run against a Modal-served Qwen3.5-9B endpoint with
the new pipeline (cooperbench run -n test -r dottxt_ai_outlines_task -t 1655 -f 6,7 -m openai/Qwen/Qwen3.5-9B -a mini_swe_agent_v2 --git):
both agents reached Submitted status; the resulting agent6.patch
and agent7.patch contain only the modified source file (no scratch
test scripts), proving the curated-submission flow.

Test plan

  • Lint clean (ruff check, ruff format --check)
  • mypy clean
  • pytest unit suite passes
  • End-to-end coop+git smoke run with Modal-served Qwen3.5-9B
  • CI green on this PR

akhatua2 added 11 commits April 30, 2026 21:17
mini_swe_agent_v2 already loads dotenv from a global config dir
(~/.config/mini-swe-agent/.env), but cooperbench itself never loaded
the project-local .env, so OPENAI_API_KEY etc. only made it through
when the user manually exported them.

Calling dotenv.load_dotenv() at the top of cli.py auto-loads ./.env
from cwd before any env-var-dependent imports run, matching how
projects with python-dotenv conventionally pick up local config.
…bmit via patch.txt

Adapter:
- accept **kwargs so unknown caller-side args don't crash run()
- wire up the agent_config CLI flag that was previously listed in the
  signature but never read; load YAML and deep-merge config: block over
  the defaults
- sanitize content=None on tool-calling assistant turns before returning
  AgentResult.messages (CooperBench's downstream coop runner does
  '"send_message" in content' which TypeErrors on None)
- drop the dead SEND_MESSAGE_TOOL import (only BASH_TOOL is registered;
  send_message is intercepted from inside the bash command string)
- drop _get_patch() and the base_commit capture; the patch now comes
  straight from result['submission'] (no working-tree extraction
  fallback — if the agent didn't submit, there is no patch)

Config:
- delete config/mini.yaml (was a single file with {% if agent_id %}
  branches handling both solo and coop) and split into
  config/solo.yaml and config/coop.yaml; adapter picks based on
  is_coop = len(agents) > 1
- fix a leak in the solo branch where the CRITICAL REQUIREMENTS
  section still mentioned 'send_message to your colleague' even when
  the agent has no colleague
- replace the bare 'echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT' submit
  step with the upstream mini-swe-agent SWE-bench three-step flow:
  curate via 'git diff -- file1 file2 > patch.txt', verify via cat,
  submit via 'echo COMPLETE... && cat patch.txt'.  The submission
  field then carries the patch verbatim.
… repos

Old design spun up a fresh debian-slim container per coop pair, ran
'apt-get install git' inside it, slept 3 seconds, and returned.  Two
problems:

- the 3-second sleep was far too short for the apt install (~30-60s
  on a cold image), so agents' initial 'git push' raced the daemon
  start and got 'Connection refused'
- the per-run container lived on its own bridge network
  cooperbench-git-<run_id>, but DockerEnvironmentConfig had no
  'network' field, so the kwarg the adapter passed got silently
  dropped by Pydantic and agent containers ended up on the default
  bridge — no route to the git server's IP

Replaced with a Redis-style shared singleton:

- one image cooperbench-git-server:local (built lazily on first use
  from a 4-line Dockerfile)
- one container cooperbench-git running 'git daemon --base-path=/git
  --listen=0.0.0.0' as PID 1, with a docker volume for /git
- one shared bridge network 'cooperbench' that all agent containers
  join
- per-run isolation via path namespacing: each coop pair gets
  /git/<run_id>/repo.git, served at git://cooperbench-git:9418/<run_id>/repo.git

DockerGitServer.create() now just ensures the singleton infra is up
(idempotent, ~140ms after first call) and exec's a quick
'mkdir + git init --bare' inside the running daemon.  cleanup()
removes only the per-run path and leaves the singleton alive.

DockerEnvironmentConfig also gets a typed 'network' field so the
--network flag actually reaches docker run.
DefaultAgent.run() calls self.save(self.config.output_path) in its
finally clause every step, and save() calls serialize().  Once
compaction has fired, serialize() was unconditionally calling
_close_current_segment('solver') — which appends a snapshot of the
current live messages as a new segment AND resets the buffer (which
the next query() then repopulates).

Net effect: each step after the first compaction added another
post-compaction solver segment to _segments, each one near-superset
of the previous.  In a real run we observed segment counts like
[86, 85, 8, 10, 12, 14, 15] where the last 5 should have been a
single segment.

Fix: serialize() builds a snapshot list locally without mutating
self._segments.  The current open buffer is appended as a transient
'solver' segment in the snapshot.  Multiple calls to serialize() now
produce identical output and leave state unchanged.
When an agent submits a malformed patch (e.g. a 'new file mode' diff
against an existing file), 'git apply' rejects it but the eval would
silently commit an empty branch, then report the subsequent merge as
'clean' because there was nothing to disagree with.  An agent
self-sabotage (the canonical case is an agent running 'rm -rf .git'
mid-run) would look like a passing eval.

_setup_branches now emits explicit per-agent markers
(PATCH<N>_APPLIED / _SKIPPED / _FAILED) and returns an apply_status
dict.  test_merged threads it into the eval result and overrides
merge.status to 'missing_input' when any patch failed.

While in there, _run_tests now exposes exit_code in its result, and
the per-feature dict gains feature_id / exit_code / tests_passed /
tests_failed alongside the existing passed + test_output (which is
still a 50KB blob).  Consumers can now reason about results without
grepping raw pytest output.
…truction

The previous Submission section was ~40 lines of three-step procedure
plus a CRITICAL block.  Small models (Qwen 9B, observed in coop+git
runs) tended to follow the recipe but still hit footguns we hadn't
forbidden — the canonical failure was an agent running 'rm -rf .git'
mid-task, then 'git init' to 'fix' it, then producing a malformed
'new file mode' diff that the eval silently dropped.

Trim the recipe to the bare flow:

    git diff -- path/to/file > patch.txt
    cat patch.txt
    echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt

Then a tight CRITICAL block forbidding 'rm -rf .git', 'git init',
'git rm -rf .', and 'git reset --hard' inside /workspace/repo —
which corrupt the local .git/ directory regardless of whether the
team remote is enabled.

In a follow-up empirical run, agent6 (the agent that previously did
rm -rf .git) issued zero destructive git commands and produced a
clean modify-existing patch that passed all 100 of feature 6's tests.

CHANGELOG: extend the v0.0.12 entry to cover both this and the
preceding eval-observability commit.
Earlier simplification collapsed the four file-category exclusions
(reproduction scripts, helper tools, build/config files, binaries)
into a single inline parenthetical, which small models tend to
under-weight.  Restore the bulleted list — bullets are easier to
parse and harder to skim past — without re-adding the Step 1/2/3
scaffolding or the env-var notation that the simplification was
meant to remove.
Single fenced bash block invited models to chain the three commands
with '&&' or run them as one heredoc.  That breaks the design — the
env's COMPLETE sentinel detection only fires when 'echo COMPLETE...'
is the first line of bash output, and chaining makes the diff happen
on the same line as cat, which the env then doesn't capture as a
submission.

Split into three separate fenced bash blocks (write / verify /
submit), with explicit 'SEPARATE bash tool call' instruction.
The previous wording opened with 'Edit files in place. Don't commit.' —
which contradicts the existing Shared Git Remote section (telling agents
they have a branch to coordinate on) and over-prescribes the workflow.

Reframe: patch.txt is the artifact we evaluate, the agent writes whatever
unified diff they want to submit to that file, however fits the workflow
they used.  The 'git diff -- file > patch.txt' recipe stays as 'one common
way', not the only way.  Agents are free to commit, fetch, merge, or do
whatever else — the contract is only what ends up in patch.txt.
…yaml too

Mirrors the trim already applied to coop.yaml — the sentence was
unnecessary scaffolding now that the three steps are in their own
fenced blocks.
@akhatua2 akhatua2 merged commit b47df3a into main Apr 30, 2026
3 checks passed
@akhatua2 akhatua2 deleted the v2-hardening branch April 30, 2026 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant