feat(omnigent): Databricks Omnigent pi meta-harness as a non-ACP BenchFlow agent#7
Conversation
…BenchFlow agent Wraps Databricks Omnigent's `pi` meta-harness as a BenchFlow agent over the NON-ACP Session path: registers `omnigent-pi` with protocol="session-factory" and a session_factory entrypoint. OmnigentSession shells the one-shot `omnigent run --harness pi` CLI inside the sandbox via Sandbox.exec — the omnigent-client SDK pins starlette<1 and conflicts with BenchFlow's litellm/starlette-1.x usage proxy, so it can't be imported in-process. Verified end-to-end in `bench eval` on Daytona (x86_64) with deepseek-chat: reward 1.0 on hello-world AND the real citation-check research task (read a BibTeX file, query citation APIs, detect hallucinated entries — all 9 verifier tests pass). Requires a BenchFlow build carrying the session-factory seam (AgentConfig .session_factory + "session-factory" in VALID_PROTOCOLS + rollout ._connect_session_factory); without it register() logs a warning and returns None rather than crashing the import, mirroring acp-registry's acp_model_via_env degradation. x86_64 only — cel-expr-python ships no aarch64 wheel. The install command provisions omnigent (uv tool, --python 3.12) + the pi CLI in the sandbox, symlinks node/npm/npx onto the bare PATH (pi is a `#!/usr/bin/env node` script the runner spawns from a fresh shell), and installs tmux (the runner's managed REPL terminal hard-fails without it). Adds per-package CI (test-omnigent.yaml), an omnigent lint step, and README entries (Agents table, repo layout, license).
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b8500ad. Configure here.
| sandbox, | ||
| model=model, | ||
| exec_user=self._exec_user, | ||
| ) |
There was a problem hiding this comment.
Ignoring config write failure
Medium Severity
connect awaits sandbox.exec to write ~/.omnigent/config.yaml but never inspects the command result. A non-zero exit or partial write still returns a live OmnigentSession, so later omnigent run turns can proceed without valid gateway credentials until failure surfaces only in stderr.
Reviewed by Cursor Bugbot for commit b8500ad. Configure here.
| f"omnigent stop >/dev/null 2>&1; " | ||
| f"omnigent run --harness pi --model {shlex.quote(model)} " | ||
| f"-p {shlex.quote(text)}" | ||
| ) |
There was a problem hiding this comment.
Run proceeds after failed cd
Medium Severity
The per-turn shell command chains cd /app && omnigent stop with omnigent run using ;, so omnigent run still executes when cd /app fails. Work then happens outside the task workspace /app, so verifier file checks can miss agent output even though the turn appears to complete.
Reviewed by Cursor Bugbot for commit b8500ad. Configure here.
Published BenchFlow does NOT validate `protocol` at register_agent() time, so the previous try/except never tripped and register() silently registered a non-connectable omnigent-pi on a seam-less build (CI: test job failed asserting register() is None). Replace the try/except with an explicit `_session_factory_seam_present()` gate (`"session-factory" in VALID_PROTOCOLS`, import-guarded for old BenchFlow) so behaviour is identical across versions: no seam → warn + return None + do not register. Make the degradation test version-independent by monkeypatching the gate (works on a seam-carrying venv too), and add a test that the package's seam probe agrees with VALID_PROTOCOLS membership.
…on 0.7 line omnigent's model calls run in-sandbox, so they must route through BenchFlow's litellm usage proxy. With usage_tracking=off the 0.7 zero-activity guard (zero tokens + zero tool calls) nulls the reward. Note reward 1.0 was validated on the 0.7 line (feat/0.7-on-release) with the session-factory seam.
Bugbot couldn't run - usage limit reachedBugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit. A user or team admin can review and increase usage limits in the Cursor dashboard. (requestId: serverGenReqId_563cc9fd-470c-4e2b-a2d6-ef31a992062d) |
…race-limitation note - _RUN_TIMEOUT_SEC: hardcoded 600 clipped tasks with a 900s [agent] budget (2 tasks timed out at scale). Make it BENCHFLOW_OMNIGENT_RUN_TIMEOUT_SEC (default 1800); document that the kernel's asyncio.wait_for on the task budget is the authoritative per-turn bound and this is only a hung-exec backstop. - README: the prior 'parse --debug-events JSONL' suggestion is wrong — verified in-sandbox that omnigent 0.1.0 headless -p mode exposes NO tool-call stream (--debug-events is interactive-only, --log is rejected with -p, chat.db is ephemeral). Surfacing tool calls needs a server-API rework (poll /v1/sessions/<conv>/items), tracked as a follow-up.
Bugbot couldn't run - usage limit reachedBugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit. A user or team admin can review and increase usage limits in the Cursor dashboard. (requestId: serverGenReqId_0176e57d-019f-4b78-84d2-f8dc83c13aed) |


What
Adds
omnigent/— a new self-contained package wrapping Databricks Omnigent'spimeta-harness as a BenchFlow agent.It's the first non-ACP agent in this repo. Instead of an ACP shim, it rides BenchFlow's Session path: it registers
omnigent-piwithprotocol="session-factory"and asession_factoryentrypoint, andOmnigentSessionshells the one-shotomnigent run --harness piCLI inside the sandbox viaSandbox.exec.Why in-sandbox subprocess (not the in-process
omnigent-clientSDK): omnigent's runner pinsstarlette<1and ships a conflicting FastAPI/litellm stack, which breaks when imported into the BenchFlow host process (it runs a litellm/starlette-1.x usage proxy).Verified
End-to-end in
bench evalon Daytona (x86_64) withdeepseek/deepseek-chat:hello-world(toy)error: Nonecitation-check(real research task)citation-checkis a genuine medium-difficulty task — the same one theharness-piREADME flags as not yet passing — so this is verified beyond a toy.Requirements (documented in the package README)
AgentConfig.session_factory+"session-factory"inVALID_PROTOCOLS+rollout._connect_session_factory). This is not in published0.6.x; without it,register()logs a warning and returnsNonerather than crashing the import — the same graceful degradationacp-registryuses for itsacp_model_via_envflag.cel-expr-pythonhas nolinux-aarch64wheel (installs on Daytona, not local Apple-Silicon docker); the install pins--python 3.12(nocp314wheel either).Install internals (the two non-obvious in-container fixes)
The
install_cmdprovisions omnigent (uv tool,--python 3.12) + thepiCLI in the sandbox, plus:node/npm/npxonto the bare PATH —piis a#!/usr/bin/env nodescript and omnigent's runner spawns it from a fresh shell that doesn't inherit the install PATH; withoutnoderesolvable,pinever launches and writes no file.tmux— the runner auto-creates a per-conversation REPL terminal and hard-fails without it.(The residual
bwrapREPL-terminal error in omnigent's logs is non-fatal — thepiharness runs its own shell to do the task work.)Known limitation
The stdout-parsing adapter emits only the prompt + final agent message, so per-tool-call trajectory granularity is coarse (
n_tool_callsreads 0 even though the harness uses tools). The reward is real; richer trajectories would come from parsing omnigent's--debug-eventsJSONL.Includes
omnigent/package (src + tests + README + pyproject + LICENSE), passingpytest+ruff..github/workflows/test-omnigent.yaml) + an omnigent lint step.Note
Medium Risk
Connect writes literal API keys into sandbox config and install_cmd runs broad package/network setup in eval sandboxes; behavior depends on unpublished BenchFlow session-factory seam and x86_64/Python 3.12 constraints.
Overview
Adds a new
omnigent/package that registersomnigent-pivia BenchFlow’ssession-factorypath (not ACP):OmnigentAgent.connectwrites provider credentials into the sandbox as~/.omnigent/config.yaml, andOmnigentSession.promptrunsomnigent run --harness piper turn throughSandbox.execin/app(in-sandbox CLI to avoid host litellm/starlette conflicts).register.pysupplies a sandboxinstall_cmd(Node/pion bare PATH, tmux,uv tool installomnigent 0.1.0 on Python 3.12) and no-ops registration when BenchFlow lacks the session-factory seam. Trajectories are coarse (prompt + filtered final stdout).Repo docs and CI gain an omnigent row in the root README,
test-omnigent.yaml, and a ruff lint step foromnigent/.Reviewed by Cursor Bugbot for commit a86d0b6. Bugbot is set up for automated code reviews on this repo. Configure here.