feat(omnigent): Databricks Omnigent pi meta-harness as a non-ACP BenchFlow agent by xdotli · Pull Request #7 · benchflow-ai/agents

xdotli · 2026-06-13T22:33:55Z

What

Adds omnigent/ — a new self-contained package wrapping Databricks Omnigent's pi meta-harness as a BenchFlow agent.

It's the first non-ACP agent in this repo. Instead of an ACP shim, it rides BenchFlow's Session path: it registers omnigent-pi with protocol="session-factory" and a session_factory entrypoint, and OmnigentSession shells the one-shot omnigent run --harness pi CLI inside the sandbox via Sandbox.exec.

Why in-sandbox subprocess (not the in-process omnigent-client SDK): omnigent's runner pins starlette<1 and ships a conflicting FastAPI/litellm stack, which breaks when imported into the BenchFlow host process (it runs a litellm/starlette-1.x usage proxy).

Verified

End-to-end in bench eval on Daytona (x86_64) with deepseek/deepseek-chat:

Task	Result
`hello-world` (toy)	reward 1.0, `error: None`
`citation-check` (real research task)	reward 1.0 — all 9 verifier tests pass (read BibTeX, query citation APIs over the network, detect the 3 hallucinated entries, write sorted JSON)

citation-check is a genuine medium-difficulty task — the same one the harness-pi README flags as not yet passing — so this is verified beyond a toy.

Requirements (documented in the package README)

A BenchFlow build with the session-factory seam (AgentConfig.session_factory + "session-factory" in VALID_PROTOCOLS + rollout._connect_session_factory). This is not in published 0.6.x; without it, register() logs a warning and returns None rather than crashing the import — the same graceful degradation acp-registry uses for its acp_model_via_env flag.
x86_64 sandbox — cel-expr-python has no linux-aarch64 wheel (installs on Daytona, not local Apple-Silicon docker); the install pins --python 3.12 (no cp314 wheel either).

Install internals (the two non-obvious in-container fixes)

The install_cmd provisions omnigent (uv tool, --python 3.12) + the pi CLI in the sandbox, plus:

symlinks node/npm/npx onto the bare PATH — pi is a #!/usr/bin/env node script and omnigent's runner spawns it from a fresh shell that doesn't inherit the install PATH; without node resolvable, pi never launches and writes no file.
installs tmux — the runner auto-creates a per-conversation REPL terminal and hard-fails without it.

(The residual bwrap REPL-terminal error in omnigent's logs is non-fatal — the pi harness runs its own shell to do the task work.)

Known limitation

The stdout-parsing adapter emits only the prompt + final agent message, so per-tool-call trajectory granularity is coarse (n_tool_calls reads 0 even though the harness uses tools). The reward is real; richer trajectories would come from parsing omnigent's --debug-events JSONL.

Includes

omnigent/ package (src + tests + README + pyproject + LICENSE), passing pytest + ruff.
Per-package CI (.github/workflows/test-omnigent.yaml) + an omnigent lint step.
README entries (Agents table, repo layout, license).

Note

Medium Risk
Connect writes literal API keys into sandbox config and install_cmd runs broad package/network setup in eval sandboxes; behavior depends on unpublished BenchFlow session-factory seam and x86_64/Python 3.12 constraints.

Overview
Adds a new omnigent/ package that registers omnigent-pi via BenchFlow’s session-factory path (not ACP): OmnigentAgent.connect writes provider credentials into the sandbox as ~/.omnigent/config.yaml, and OmnigentSession.prompt runs omnigent run --harness pi per turn through Sandbox.exec in /app (in-sandbox CLI to avoid host litellm/starlette conflicts).

register.py supplies a sandbox install_cmd (Node/pi on bare PATH, tmux, uv tool install omnigent 0.1.0 on Python 3.12) and no-ops registration when BenchFlow lacks the session-factory seam. Trajectories are coarse (prompt + filtered final stdout).

Repo docs and CI gain an omnigent row in the root README, test-omnigent.yaml, and a ruff lint step for omnigent/.

^{Reviewed by Cursor Bugbot for commit a86d0b6. Bugbot is set up for automated code reviews on this repo. Configure here.}

…BenchFlow agent Wraps Databricks Omnigent's `pi` meta-harness as a BenchFlow agent over the NON-ACP Session path: registers `omnigent-pi` with protocol="session-factory" and a session_factory entrypoint. OmnigentSession shells the one-shot `omnigent run --harness pi` CLI inside the sandbox via Sandbox.exec — the omnigent-client SDK pins starlette<1 and conflicts with BenchFlow's litellm/starlette-1.x usage proxy, so it can't be imported in-process. Verified end-to-end in `bench eval` on Daytona (x86_64) with deepseek-chat: reward 1.0 on hello-world AND the real citation-check research task (read a BibTeX file, query citation APIs, detect hallucinated entries — all 9 verifier tests pass). Requires a BenchFlow build carrying the session-factory seam (AgentConfig .session_factory + "session-factory" in VALID_PROTOCOLS + rollout ._connect_session_factory); without it register() logs a warning and returns None rather than crashing the import, mirroring acp-registry's acp_model_via_env degradation. x86_64 only — cel-expr-python ships no aarch64 wheel. The install command provisions omnigent (uv tool, --python 3.12) + the pi CLI in the sandbox, symlinks node/npm/npx onto the bare PATH (pi is a `#!/usr/bin/env node` script the runner spawns from a fresh shell), and installs tmux (the runner's managed REPL terminal hard-fails without it). Adds per-package CI (test-omnigent.yaml), an omnigent lint step, and README entries (Agents table, repo layout, license).

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b8500ad. Configure here.}

cursor · 2026-06-13T22:35:26Z

+            sandbox,
+            model=model,
+            exec_user=self._exec_user,
+        )


Ignoring config write failure

Medium Severity

connect awaits sandbox.exec to write ~/.omnigent/config.yaml but never inspects the command result. A non-zero exit or partial write still returns a live OmnigentSession, so later omnigent run turns can proceed without valid gateway credentials until failure surfaces only in stderr.

^{Reviewed by Cursor Bugbot for commit b8500ad. Configure here.}

cursor · 2026-06-13T22:35:26Z

+            f"omnigent stop >/dev/null 2>&1; "
+            f"omnigent run --harness pi --model {shlex.quote(model)} "
+            f"-p {shlex.quote(text)}"
+        )


Run proceeds after failed cd

Medium Severity

The per-turn shell command chains cd /app && omnigent stop with omnigent run using ;, so omnigent run still executes when cd /app fails. Work then happens outside the task workspace /app, so verifier file checks can miss agent output even though the turn appears to complete.

^{Reviewed by Cursor Bugbot for commit b8500ad. Configure here.}

Published BenchFlow does NOT validate `protocol` at register_agent() time, so the previous try/except never tripped and register() silently registered a non-connectable omnigent-pi on a seam-less build (CI: test job failed asserting register() is None). Replace the try/except with an explicit `_session_factory_seam_present()` gate (`"session-factory" in VALID_PROTOCOLS`, import-guarded for old BenchFlow) so behaviour is identical across versions: no seam → warn + return None + do not register. Make the degradation test version-independent by monkeypatching the gate (works on a seam-carrying venv too), and add a test that the package's seam probe agrees with VALID_PROTOCOLS membership.

…on 0.7 line omnigent's model calls run in-sandbox, so they must route through BenchFlow's litellm usage proxy. With usage_tracking=off the 0.7 zero-activity guard (zero tokens + zero tool calls) nulls the reward. Note reward 1.0 was validated on the 0.7 line (feat/0.7-on-release) with the session-factory seam.

cursor · 2026-06-13T23:30:47Z

Bugbot couldn't run - usage limit reached

Bugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit.

A user or team admin can review and increase usage limits in the Cursor dashboard.

(requestId: serverGenReqId_563cc9fd-470c-4e2b-a2d6-ef31a992062d)

…race-limitation note - _RUN_TIMEOUT_SEC: hardcoded 600 clipped tasks with a 900s [agent] budget (2 tasks timed out at scale). Make it BENCHFLOW_OMNIGENT_RUN_TIMEOUT_SEC (default 1800); document that the kernel's asyncio.wait_for on the task budget is the authoritative per-turn bound and this is only a hung-exec backstop. - README: the prior 'parse --debug-events JSONL' suggestion is wrong — verified in-sandbox that omnigent 0.1.0 headless -p mode exposes NO tool-call stream (--debug-events is interactive-only, --log is rejected with -p, chat.db is ephemeral). Surfacing tool calls needs a server-API rework (poll /v1/sessions/<conv>/items), tracked as a follow-up.

cursor · 2026-06-14T03:13:44Z

Bugbot couldn't run - usage limit reached

Bugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit.

A user or team admin can review and increase usage limits in the Cursor dashboard.

(requestId: serverGenReqId_0176e57d-019f-4b78-84d2-f8dc83c13aed)

cursor Bot reviewed Jun 13, 2026

View reviewed changes

xdotli added 2 commits June 13, 2026 18:38

xdotli mentioned this pull request Jun 14, 2026

feat(rollout): non-ACP session-factory CONNECT seam benchflow-ai/benchflow#753

Closed

xdotli merged commit 7be1d74 into main Jun 14, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(omnigent): Databricks Omnigent pi meta-harness as a non-ACP BenchFlow agent#7

feat(omnigent): Databricks Omnigent pi meta-harness as a non-ACP BenchFlow agent#7
xdotli merged 4 commits into
mainfrom
add-omnigent-agent

xdotli commented Jun 13, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 13, 2026

Uh oh!

cursor Bot Jun 13, 2026

Uh oh!

cursor Bot commented Jun 13, 2026

Uh oh!

cursor Bot commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xdotli commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Verified

Requirements (documented in the package README)

Install internals (the two non-obvious in-container fixes)

Known limitation

Includes

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Ignoring config write failure

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Run proceeds after failed cd

Uh oh!

cursor Bot commented Jun 13, 2026

Bugbot couldn't run - usage limit reached

Uh oh!

cursor Bot commented Jun 14, 2026

Bugbot couldn't run - usage limit reached

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xdotli commented Jun 13, 2026 •

edited by cursor Bot

Loading