Skip to content

feat(omnigent): Databricks Omnigent pi meta-harness as a non-ACP BenchFlow agent#7

Merged
xdotli merged 4 commits into
mainfrom
add-omnigent-agent
Jun 14, 2026
Merged

feat(omnigent): Databricks Omnigent pi meta-harness as a non-ACP BenchFlow agent#7
xdotli merged 4 commits into
mainfrom
add-omnigent-agent

Conversation

@xdotli

@xdotli xdotli commented Jun 13, 2026

Copy link
Copy Markdown
Member

What

Adds omnigent/ — a new self-contained package wrapping Databricks Omnigent's pi meta-harness as a BenchFlow agent.

It's the first non-ACP agent in this repo. Instead of an ACP shim, it rides BenchFlow's Session path: it registers omnigent-pi with protocol="session-factory" and a session_factory entrypoint, and OmnigentSession shells the one-shot omnigent run --harness pi CLI inside the sandbox via Sandbox.exec.

Why in-sandbox subprocess (not the in-process omnigent-client SDK): omnigent's runner pins starlette<1 and ships a conflicting FastAPI/litellm stack, which breaks when imported into the BenchFlow host process (it runs a litellm/starlette-1.x usage proxy).

Verified

End-to-end in bench eval on Daytona (x86_64) with deepseek/deepseek-chat:

Task Result
hello-world (toy) reward 1.0, error: None
citation-check (real research task) reward 1.0 — all 9 verifier tests pass (read BibTeX, query citation APIs over the network, detect the 3 hallucinated entries, write sorted JSON)

citation-check is a genuine medium-difficulty task — the same one the harness-pi README flags as not yet passing — so this is verified beyond a toy.

Requirements (documented in the package README)

  • A BenchFlow build with the session-factory seam (AgentConfig.session_factory + "session-factory" in VALID_PROTOCOLS + rollout._connect_session_factory). This is not in published 0.6.x; without it, register() logs a warning and returns None rather than crashing the import — the same graceful degradation acp-registry uses for its acp_model_via_env flag.
  • x86_64 sandboxcel-expr-python has no linux-aarch64 wheel (installs on Daytona, not local Apple-Silicon docker); the install pins --python 3.12 (no cp314 wheel either).

Install internals (the two non-obvious in-container fixes)

The install_cmd provisions omnigent (uv tool, --python 3.12) + the pi CLI in the sandbox, plus:

  • symlinks node/npm/npx onto the bare PATHpi is a #!/usr/bin/env node script and omnigent's runner spawns it from a fresh shell that doesn't inherit the install PATH; without node resolvable, pi never launches and writes no file.
  • installs tmux — the runner auto-creates a per-conversation REPL terminal and hard-fails without it.

(The residual bwrap REPL-terminal error in omnigent's logs is non-fatal — the pi harness runs its own shell to do the task work.)

Known limitation

The stdout-parsing adapter emits only the prompt + final agent message, so per-tool-call trajectory granularity is coarse (n_tool_calls reads 0 even though the harness uses tools). The reward is real; richer trajectories would come from parsing omnigent's --debug-events JSONL.

Includes

  • omnigent/ package (src + tests + README + pyproject + LICENSE), passing pytest + ruff.
  • Per-package CI (.github/workflows/test-omnigent.yaml) + an omnigent lint step.
  • README entries (Agents table, repo layout, license).

Note

Medium Risk
Connect writes literal API keys into sandbox config and install_cmd runs broad package/network setup in eval sandboxes; behavior depends on unpublished BenchFlow session-factory seam and x86_64/Python 3.12 constraints.

Overview
Adds a new omnigent/ package that registers omnigent-pi via BenchFlow’s session-factory path (not ACP): OmnigentAgent.connect writes provider credentials into the sandbox as ~/.omnigent/config.yaml, and OmnigentSession.prompt runs omnigent run --harness pi per turn through Sandbox.exec in /app (in-sandbox CLI to avoid host litellm/starlette conflicts).

register.py supplies a sandbox install_cmd (Node/pi on bare PATH, tmux, uv tool install omnigent 0.1.0 on Python 3.12) and no-ops registration when BenchFlow lacks the session-factory seam. Trajectories are coarse (prompt + filtered final stdout).

Repo docs and CI gain an omnigent row in the root README, test-omnigent.yaml, and a ruff lint step for omnigent/.

Reviewed by Cursor Bugbot for commit a86d0b6. Bugbot is set up for automated code reviews on this repo. Configure here.

…BenchFlow agent

Wraps Databricks Omnigent's `pi` meta-harness as a BenchFlow agent over the
NON-ACP Session path: registers `omnigent-pi` with protocol="session-factory"
and a session_factory entrypoint. OmnigentSession shells the one-shot
`omnigent run --harness pi` CLI inside the sandbox via Sandbox.exec — the
omnigent-client SDK pins starlette<1 and conflicts with BenchFlow's
litellm/starlette-1.x usage proxy, so it can't be imported in-process.

Verified end-to-end in `bench eval` on Daytona (x86_64) with deepseek-chat:
reward 1.0 on hello-world AND the real citation-check research task (read a
BibTeX file, query citation APIs, detect hallucinated entries — all 9 verifier
tests pass).

Requires a BenchFlow build carrying the session-factory seam (AgentConfig
.session_factory + "session-factory" in VALID_PROTOCOLS + rollout
._connect_session_factory); without it register() logs a warning and returns
None rather than crashing the import, mirroring acp-registry's acp_model_via_env
degradation. x86_64 only — cel-expr-python ships no aarch64 wheel.

The install command provisions omnigent (uv tool, --python 3.12) + the pi CLI
in the sandbox, symlinks node/npm/npx onto the bare PATH (pi is a
`#!/usr/bin/env node` script the runner spawns from a fresh shell), and installs
tmux (the runner's managed REPL terminal hard-fails without it).

Adds per-package CI (test-omnigent.yaml), an omnigent lint step, and README
entries (Agents table, repo layout, license).

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b8500ad. Configure here.

sandbox,
model=model,
exec_user=self._exec_user,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring config write failure

Medium Severity

connect awaits sandbox.exec to write ~/.omnigent/config.yaml but never inspects the command result. A non-zero exit or partial write still returns a live OmnigentSession, so later omnigent run turns can proceed without valid gateway credentials until failure surfaces only in stderr.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b8500ad. Configure here.

f"omnigent stop >/dev/null 2>&1; "
f"omnigent run --harness pi --model {shlex.quote(model)} "
f"-p {shlex.quote(text)}"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run proceeds after failed cd

Medium Severity

The per-turn shell command chains cd /app && omnigent stop with omnigent run using ;, so omnigent run still executes when cd /app fails. Work then happens outside the task workspace /app, so verifier file checks can miss agent output even though the turn appears to complete.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b8500ad. Configure here.

xdotli added 2 commits June 13, 2026 18:38
Published BenchFlow does NOT validate `protocol` at register_agent() time, so
the previous try/except never tripped and register() silently registered a
non-connectable omnigent-pi on a seam-less build (CI: test job failed asserting
register() is None).

Replace the try/except with an explicit `_session_factory_seam_present()` gate
(`"session-factory" in VALID_PROTOCOLS`, import-guarded for old BenchFlow) so
behaviour is identical across versions: no seam → warn + return None + do not
register. Make the degradation test version-independent by monkeypatching the
gate (works on a seam-carrying venv too), and add a test that the package's
seam probe agrees with VALID_PROTOCOLS membership.
…on 0.7 line

omnigent's model calls run in-sandbox, so they must route through BenchFlow's
litellm usage proxy. With usage_tracking=off the 0.7 zero-activity guard (zero
tokens + zero tool calls) nulls the reward. Note reward 1.0 was validated on the
0.7 line (feat/0.7-on-release) with the session-factory seam.
@cursor

cursor Bot commented Jun 13, 2026

Copy link
Copy Markdown

Bugbot couldn't run - usage limit reached

Bugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit.

A user or team admin can review and increase usage limits in the Cursor dashboard.

(requestId: serverGenReqId_563cc9fd-470c-4e2b-a2d6-ef31a992062d)

…race-limitation note

- _RUN_TIMEOUT_SEC: hardcoded 600 clipped tasks with a 900s [agent] budget (2
  tasks timed out at scale). Make it BENCHFLOW_OMNIGENT_RUN_TIMEOUT_SEC (default
  1800); document that the kernel's asyncio.wait_for on the task budget is the
  authoritative per-turn bound and this is only a hung-exec backstop.
- README: the prior 'parse --debug-events JSONL' suggestion is wrong — verified
  in-sandbox that omnigent 0.1.0 headless -p mode exposes NO tool-call stream
  (--debug-events is interactive-only, --log is rejected with -p, chat.db is
  ephemeral). Surfacing tool calls needs a server-API rework (poll
  /v1/sessions/<conv>/items), tracked as a follow-up.
@cursor

cursor Bot commented Jun 14, 2026

Copy link
Copy Markdown

Bugbot couldn't run - usage limit reached

Bugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit.

A user or team admin can review and increase usage limits in the Cursor dashboard.

(requestId: serverGenReqId_0176e57d-019f-4b78-84d2-f8dc83c13aed)

@xdotli xdotli merged commit 7be1d74 into main Jun 14, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant