Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,30 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.0.12] - 2026-04-30

### Changed

- **`mini_swe_agent_v2` patch is now the agent's `submission`** — the adapter no longer captures `base_commit` and runs `git diff <base>` at end-of-run. Instead the patch comes directly from `result['submission']`, which the env populates with everything the agent emits after `echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT`. Mirrors upstream mini-swe-agent's SWE-bench config. The `coop.yaml` / `solo.yaml` prompts now instruct the agent to curate via `git diff -- file1 file2 > patch.txt`, verify with `cat patch.txt`, and submit with `echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt`. No working-tree-extraction fallback — if the agent didn't submit, there is no patch.
- **`config/mini.yaml` split into `config/solo.yaml` + `config/coop.yaml`** — the previous single file conditioned everything on `{% if agent_id %}` blocks. The adapter now selects which file to load via `is_coop = len(agents) > 1`. While splitting, fixed a leak in the solo branch where the `CRITICAL REQUIREMENTS` section still mentioned `send_message to your colleague` for an agent with no colleague.
- **Shared singleton git server for `--git` coop runs** — replaces the previous design that spun up a fresh `debian:bookworm-slim` container per run, ran `apt-get install git`, slept 3s, and returned (resulting in race conditions where agents' initial `git push` beat the daemon to startup). The new design auto-creates one image (`cooperbench-git-server:local`), one network (`cooperbench`), and one container (`cooperbench-git`) on first use; per-run isolation comes from path namespacing under `/git/<run_id>/repo.git`. Idempotent — first run pays a ~30s image-build cost, subsequent runs reuse the singleton in ~140ms. Mirrors the Redis-style "one daemon, many namespaces" pattern.
- **Submission prompts simplified + `.git` footgun warnings** — the `## Submission` section in `coop.yaml` / `solo.yaml` is now ~5 lines (write a `git diff` to `patch.txt`, `cat` it, submit with `echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt`). Adds an explicit `<CRITICAL>` block forbidding `rm -rf .git`, `git init`, `git rm -rf .`, and `git reset --hard` inside `/workspace/repo` — these are easy footguns for small models, observed in the wild causing patches to come out as malformed `new file mode` diffs that fail to apply.

### Fixed

- **Eval surfaces `git apply` failures instead of silently masking them.** `_setup_branches` now emits explicit `PATCH<N>_APPLIED` / `_SKIPPED` / `_FAILED` markers per agent and returns an `apply_status` dict. `test_merged` writes that to the result and refuses to call merge `clean` when any input patch failed to apply — instead reporting `merge.status: "missing_input"`. Previously, an agent submitting a malformed patch (e.g. a `new file mode` diff against an existing file) would have its branch silently end up empty, and the subsequent merge against the other agent's branch would report `clean` despite the missing input — making the eval lie about partial success.
- **Per-feature eval result schema enriched.** `feature1` / `feature2` now carry `feature_id`, `exit_code`, `tests_passed`, and `tests_failed` (was just `passed: bool` + a 50KB `test_output` blob). Lets consumers reason about results without grepping raw pytest output.

- **`mini_swe_agent_v2` adapter no longer crashes on `content=None`** — tool-calling assistant turns leave `content=None` (the body lives in `tool_calls`), and CooperBench's downstream `_extract_conversation` does `"send_message" in content`, which raises `TypeError` on None. The adapter now coerces to `""` before populating `AgentResult.messages`.
- **`mini_swe_agent_v2` adapter wires up the `agent_config` flag** — previously listed in `MiniSweAgentV2Runner.run`'s signature but never read from. Now loads the YAML and deep-merges its `config:` block over the defaults. Forward-compatible: `**kwargs` accepted so unknown caller-side args don't crash `run()`.
- **`mini_swe_agent_v2` adapter drops the dead `SEND_MESSAGE_TOOL` import** — only `BASH_TOOL` is registered with the model; `send_message` is intercepted from inside the bash command string. The leftover import was confusing.
- **`DockerEnvironmentConfig.network`** — added a typed `network` field so the `--network <name>` flag reaches `docker run`. Previously the adapter passed `network=...` as a kwarg, but Pydantic silently dropped it (no such field), and agent containers ended up on the default bridge with no route to the per-run git server's IP. With the new shared-singleton git server design, agent containers must join the shared `cooperbench` network for DNS-by-name resolution to work.
- **`DefaultAgent.serialize()` no longer mutates `_segments`** — `run()` calls `save()` in its finally clause every step, and once compaction had fired, each `serialize()` call appended another snapshot of the current live messages as a fresh solver segment (and reset the buffer, which the next `query()` repopulated). Net effect: one compaction produced N+1 overlapping post-compaction solver segments instead of 1. Fix: `serialize()` builds the snapshot list locally without touching `self._segments`.

### Added

- **`cooperbench` CLI auto-loads `./.env`** — `cli.py` now calls `dotenv.load_dotenv()` at module load, so project-local `OPENAI_API_KEY` etc. is picked up without users having to `set -a && source .env` ahead of every invocation. Matches the convention used elsewhere in the codebase.

## [0.0.11] - 2026-04-18

### Added
Expand Down
2 changes: 1 addition & 1 deletion src/cooperbench/__about__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
"""Version information for CooperBench."""

__version__ = "0.0.11"
__version__ = "0.0.12"
53 changes: 36 additions & 17 deletions src/cooperbench/agents/mini_swe_agent_v2/adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,29 @@ def run(
config: dict | None = None,
agent_config: str | None = None,
log_dir: str | None = None,
**kwargs,
) -> AgentResult:
"""Run mini-swe-agent v2 on a task."""
# Always load default config, then merge with any overrides
config_path = get_config_path("mini")
# Load coop config when multiple agents, otherwise solo config.
is_coop = bool(agents) and len(agents) > 1
config_name = "coop" if is_coop else "solo"
config_path = get_config_path(config_name)
with open(config_path) as f:
default_config = yaml.safe_load(f)

# If the caller passed an agent_config YAML path, deep-merge its
# `config:` block into the defaults. This is what CooperBench's
# ``--agent-config`` flag forwards to the adapter.
if agent_config:
try:
with open(agent_config) as f:
overrides = yaml.safe_load(f) or {}
default_config = recursive_merge(default_config, overrides.get("config", overrides))
except FileNotFoundError:
logger.error(f"agent_config file not found: {agent_config}")
except Exception as e:
logger.error(f"Error loading agent_config {agent_config}: {e}")

# Deep-merge passed config overrides into default config so that partial
# overrides (e.g. only agent.compaction_enabled) don't clobber sibling keys.
if config is not None:
Expand Down Expand Up @@ -77,10 +93,6 @@ def run(

env = ModalEnvironment(**env_kwargs)

# Capture base commit for patch generation
base_commit_result = env.execute({"command": "git rev-parse HEAD"})
base_commit = base_commit_result.get("output", "").strip()

# Setup messaging connector if enabled
comm = None
use_messaging = messaging_enabled and comm_url and agents and len(agents) > 1
Expand Down Expand Up @@ -122,15 +134,20 @@ def run(

# Run agent
error_msg = None
result = {}
try:
result = agent.run(task=task)
status = result.get("exit_status", "Submitted")
except Exception as e:
status = "Error"
error_msg = str(e)

# Extract patch (committed + uncommitted changes)
patch = self._get_patch(env, base_commit)
# The agent submits its patch via ``echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT
# && cat patch.txt`` (see coop.yaml/solo.yaml). Whatever follows the
# sentinel is captured into result["submission"] by the env. No
# working-tree extraction fallback — if the agent didn't submit a
# patch, there is no patch.
patch = result.get("submission", "").strip()

# Save full trajectory (includes segments when compaction occurred)
if log_dir and agent._compaction_count > 0:
Expand All @@ -144,20 +161,22 @@ def run(
# Cleanup
env.cleanup()

# Tool-calling assistant turns leave content=None (the body lives in
# tool_calls). CooperBench's downstream conversation extractor does
# ``"send_message" in content`` which raises TypeError on None — coerce
# to "" before returning.
sanitized_messages = []
for msg in agent.messages:
if msg.get("content") is None:
msg = {**msg, "content": ""}
sanitized_messages.append(msg)

return AgentResult(
status=status,
patch=patch,
cost=agent.cost,
steps=agent.n_calls,
messages=agent.messages,
messages=sanitized_messages,
sent_messages=agent.sent_messages,
error=error_msg,
)

def _get_patch(self, env, base_commit: str) -> str:
"""Extract git diff from base commit to current working tree state."""
try:
result = env.execute({"command": f"git diff {base_commit}"})
return result.get("output", "").strip()
except Exception:
return ""
7 changes: 5 additions & 2 deletions src/cooperbench/agents/mini_swe_agent_v2/agents/default.py
Original file line number Diff line number Diff line change
Expand Up @@ -320,8 +320,11 @@ def serialize(self, *extra_dicts) -> dict:
"trajectory_format": "mini-swe-agent-1.1",
}
if self._compaction_count > 0:
self._close_current_segment("solver")
agent_data["segments"] = self._segments
segments = list(self._segments)
current = self._current_segment_messages or self.messages
if current:
segments.append({"kind": "solver", "messages": list(current)})
agent_data["segments"] = segments
return recursive_merge(agent_data, self.model.serialize(), self.env.serialize(), *extra_dicts)

def save(self, path: Path | None, *extra_dicts) -> dict:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,7 @@
agent:
system_template: |
{% if agent_id %}
You are a software engineer working alongside a colleague on a shared codebase. You each have your own workspace and are implementing different features in parallel. You communicate naturally — like engineers on the same team — to make sure your combined work integrates cleanly.
{% else %}
You are a helpful assistant that can interact with a computer.
{% endif %}
instance_template: |
{% if agent_id %}
## Your Task

{{task}}
Expand All @@ -32,8 +27,7 @@ agent:
3. Edit the source code to resolve the issue
4. Verify your fix works by running your script again
5. Test edge cases to ensure your fix is robust
6. Submit your changes: `echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT`.
Do not combine it with any other command. <important>After this command, you cannot continue working on this task.</important>
6. Submit your changes — see the **Submission** section below for the exact procedure.

But you are not working solo but rather as a team, so follow the workflow above for your individual tasks but you have complete freedom to communicate with your colleague in whatever way, whenever, and however often you see fit. Think about how experienced software engineers coordinate when working on the same codebase — and do that.

Expand Down Expand Up @@ -86,25 +80,6 @@ agent:
Do NOT run: `git merge` (without --abort), `git pull`, `git rebase`, or `git reset --hard` against your colleague's branch or `origin/main`. These will corrupt your patch.
{% endif %}

{% else %}
## Your Task

{{task}}

## Recommended Workflow

This workflow should be done step-by-step so that you can iterate on your changes and any possible problems.

1. Analyze the codebase by finding and reading relevant files
2. Create a script to reproduce the issue
3. Edit the source code to resolve the issue
4. Verify your fix works by running your script again
5. Test edge cases to ensure your fix is robust
6. Submit your changes: `echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT`.
Do not combine it with any other command. <important>After this command, you cannot continue working on this task.</important>

{% endif %}

## Command Execution Rules

You are operating in an environment where
Expand All @@ -125,8 +100,7 @@ agent:
- Your response MUST include AT LEAST ONE bash tool call — this can be a coding command, a `send_message` to your colleague, or both
- Directory or environment variable changes are not persistent. Every action is executed in a new subshell.
- However, you can prefix any action with `MY_ENV_VAR=MY_VALUE cd /path/to/working/dir && ...` or write/load environment variables from files
- Submit your changes and finish your work by issuing the following command: `echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT`.
Do not combine it with any other command. <important>After this command, you cannot continue working on this task.</important>
- To submit your work, follow the **Submission** section below (`echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt`).

Example of a CORRECT response:
<example_response>
Expand Down Expand Up @@ -185,9 +159,48 @@ agent:
```bash
anything
```

## Submission

`patch.txt` is the artifact we evaluate — write whatever unified diff
you want to submit to that file, however it makes sense given how you
worked:

Write the patch (one common way — `git diff` of your in-place edits):

```bash
git diff -- path/to/file1 path/to/file2 > patch.txt
```

Verify it contains what you intend:

```bash
cat patch.txt
```

Submit:

```bash
echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt
```

The patch must be a unified diff and contain only source files you
intentionally modified. Exclude:

- reproduction or scratch test scripts you wrote
- helper scripts or tools you created
- installation, build, packaging, or configuration files
- binaries or compiled files

<CRITICAL>
Do NOT run `rm -rf .git`, `git init`, `git rm -rf .`, or `git reset --hard`
inside `/workspace/repo` — these corrupt `.git/` and your patch will be
unapplyable.
</CRITICAL>
step_limit: 100
cost_limit: 3.
mode: confirm
compaction_token_trigger: 28000 # leave headroom on 32K-context models
environment:
env:
PAGER: cat
Expand All @@ -196,6 +209,7 @@ environment:
PIP_PROGRESS_BAR: 'off'
TQDM_DISABLE: '1'
model:
cost_tracking: ignore_errors # vLLM-served models have no pricing data
observation_template: |
{%- if output.output | length < 10000 -%}
{
Expand Down
Loading
Loading