test: pin run-level Bedrock thinking-effort end-to-end (Docker/Daytona parity) (#599) by ElegantLin · Pull Request #736 · benchflow-ai/benchflow

ElegantLin · 2026-06-13T17:32:28Z

Context

#599 (P0) reported: the old host BedrockProxyServer stored the per-run env but the Bedrock request translators resolved thinking effort from process-global os.environ, so a Docker run silently fell back to high while Daytona honored a run-level BENCHFLOW_BEDROCK_THINKING_EFFORT.

The implicated code no longer exists on main — PR #613 (Replace provider proxies with LiteLLM runtime) deleted bedrock_proxy.py / bedrock_runtime.py. The LiteLLM runtime sources effort from the run-level agent env via two run-aware paths:

Route config — litellm_config._bedrock_thinking_effort(model, env) reads agent_env and bakes reasoning_effort into config.yaml.
Proxy-process env — the proxy is launched as os.environ + agent_env, and the bedrock patch's effort override reads it.

So the specific bug is unreachable — but nothing pinned the behavior end to end (existing tests only checked route params, not the emitted wire payload). Per the issue's "regression tests to add", this adds wire-level coverage.

What this adds (test-only — no production change)

Drives the real litellm Bedrock Converse transform with the benchflow patch applied and asserts output_config.effort (the actual wire field the issue's repro inspected):

test_run_level_effort_from_route_lands_in_bedrock_wire_payload — route-config effort reaches the wire with host os.environ empty (sourced from the run, not the host process).
test_run_level_effort_via_proxy_process_env_overrides_stale_route — a run-level value in the proxy-process env overrides a stale-default route effort in the wire payload — the exact divergence the old Docker translator got wrong. Verified to FAIL when the patch override is neutered (fail-then-pass).
test_docker_and_daytona_resolve_identical_bedrock_effort_from_run_env — both lanes resolve identical effort from the same agent_env; with host env scrubbed it still flows through, so neither can diverge.

Uses medium (litellm-accepted, distinct from the high default) so a silent fallback fails the assertion.

Disposition

This pins #599's behavior; it does not close it — leaving the keep/close call to maintainers.

⚠️ Separate finding (out of scope, will file/flag): litellm 1.88.0rc1 rejects effort values xhigh and max for opus-4-8 (BadRequestError), but BenchFlow's _BEDROCK_THINKING_EFFORTS still allows them — so BENCHFLOW_BEDROCK_THINKING_EFFORT=max (the #598 "MAX thinking" usage) would error at request time. Not #599; flagged for a follow-up.

🤖 Generated with Claude Code

…a parity) (#599) #599 reported that the old host BedrockProxyServer stored per-run env but the Bedrock translators resolved thinking effort from process-global os.environ, so Docker silently fell back to `high` while Daytona honored a run-level BENCHFLOW_BEDROCK_THINKING_EFFORT. PR #613 deleted that proxy (bedrock_proxy.py / bedrock_runtime.py no longer exist); the LiteLLM runtime now sources effort from the run-level agent env via two run-aware paths — route config (reasoning_effort baked into config.yaml from agent_env) and the proxy-process env (launched as os.environ + agent_env). The specific bug is unreachable, but nothing pinned the behavior end to end. Adds wire-level regression coverage (the issue's actual ask — assert output_config.effort, not just route params) by driving the REAL litellm Bedrock Converse transform with the benchflow patch applied: - route-config effort lands in the wire payload with host os.environ empty (sourced from the run, not the host process); - a run-level value in the proxy-process env overrides a stale-default route effort in the wire payload — the exact divergence the old Docker translator got wrong (verified to FAIL when the patch override is neutered); - Docker and Daytona resolve identical effort from the same agent_env, so neither lane can diverge. No production change — the architecture already behaves correctly; this guards the regression. Uses `medium` (litellm-accepted, distinct from the `high` default) so a silent fallback fails the assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bingran-you · 2026-06-15T05:49:53Z

Codex review/update for #599 on 2026-06-15:

Status: implementation remains test-only and scoped to tests/test_bedrock_thinking.py. I rebased the branch onto current main (143011865af981231f547f92257e805d343efaca) so the stale CI failure from the old branch head is superseded. The previous integration-eval failure was due to the old branch missing .github/scripts/select_integration_provider.py, which exists on current main.

Validation completed on the rebased head:

uv run ruff check src tests tools passed.
uv run ruff format --check src tests tools passed.
uv run ty check passed.
uv run python -m pytest tests/ passed: 4139 passed, 4 skipped, 7 deselected.
CI-equivalent audit command passed: uv export --locked --extra dev --extra sandbox-daytona --extra sandbox-modal --format requirements-txt --no-hashes --output-file /tmp/req-audit && uv run --with pip-audit pip-audit -r /tmp/req-audit --no-deps --disable-pip; result: no known vulnerabilities found.

Real end-to-end checks against latest SkillsBench main (87df9cbd82fcf07fcb0cab960a1f8435d0c415ee):

Daytona + Gemini + 3d-scan-calc + with-skill passed: reward 1.0, 15 tool calls, partial_trajectory=false, verifier passed. Manual trajectory audit confirmed mesh-analysis was activated from /home/agent/.gemini/skills/mesh-analysis and used by the agent.
Daytona + Gemini + citation-check + no-skill produced complete BenchFlow/verifier/timing artifacts and no obvious skill-file leakage, but the agent hit wall-clock timeout with a partial ACP trajectory. I am not treating that run as a publishable healthy trial; it is only path-validation evidence.

Caveats / merge gate:

New GitHub checks are running on the rebased head now.
I am not self-approving or merging this PR because the repo rule requires human review before main.
Docker-lane e2e is still pending because the local Docker daemon is unavailable; the GCP runner has Docker and the Codex/Claude CLIs were updated, but I have not used it as merge evidence yet.

Disposition: ready for maintainer review once the new GitHub checks are green. After merge, #599 can be closed as fixed-by-coverage / architecture-regression-guarded rather than a production code change.

bingran-you · 2026-06-15T06:54:04Z

Review request/status refresh: this PR remains green on test, pip-audit, and integration-eval, with merge state CLEAN. It still needs human review before merge. Separate strict parity evidence remains incomplete: Daytona with-skill passed, but the no-skill run timed out with partial trajectory and the broader Docker/Daytona/API matrix is not complete.

bingran-you · 2026-06-15T07:03:58Z

Status correction after fresh Bedrock preflight (2026-06-15):

GitHub CI is green and the PR is CLEAN, but I moved this from status:ready to status:blocked under the requested release/e2e gate. The remaining blocker is real Bedrock parity evidence, not unit/CI.

Fresh raw provider preflight on benchflow-docker-runner-03 using $HOME/.env returned HTTP 403 for us.anthropic.claude-opus-4-8 (Authentication failed: Please make sure your API Key is valid.). runner-06 does not have $HOME/.env, and local /Users/bingran_you/Downloads/GitHub/bingran-you/.env does not define AWS_BEARER_TOKEN_BEDROCK.

Per repo guidance, I did not launch a Bedrock e2e run with a known-invalid key because it would only produce an unhealthy provider/auth trajectory. Once a valid Bedrock bearer token is available, the next gate is the Docker/Daytona with-skill and no-skill canary matrix with trajectory audit.

This was referenced Jun 13, 2026

Bedrock 4.8 thinking effort 'max'/'xhigh' are accepted by BenchFlow but rejected by LiteLLM → request-time BadRequestError #737

Open

fix: clamp Bedrock thinking effort to the LiteLLM-accepted ceiling (#737) #739

Open

bingran-you had a problem deploying to pypi-internal-preview June 15, 2026 05:19 — with GitHub Actions Failure

ElegantLin and others added 2 commits June 15, 2026 01:25

test: document Bedrock effort regression guards

1430118

bingran-you force-pushed the test/599-bedrock-effort-parity branch from e2df289 to 1430118 Compare June 15, 2026 05:49

bingran-you temporarily deployed to pypi-internal-preview June 15, 2026 05:49 — with GitHub Actions Inactive

bingran-you added review:pending PR is ready-for-review, no reviewer engagement yet. status:ready Triaged, unassigned, available to claim. area:rollout Issue / PR lives primarily in the "rollout" subsystem. area:sandbox Issue / PR lives primarily in the "sandbox" subsystem. labels Jun 15, 2026

bingran-you mentioned this pull request Jun 15, 2026

Docker Bedrock proxy ignores run-level BENCHFLOW_BEDROCK_THINKING_EFFORT #599

Open

bingran-you added status:blocked Waiting on external dependency. Add a comment explaining why. and removed status:ready Triaged, unassigned, available to claim. labels Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: pin run-level Bedrock thinking-effort end-to-end (Docker/Daytona parity) (#599)#736

test: pin run-level Bedrock thinking-effort end-to-end (Docker/Daytona parity) (#599)#736
ElegantLin wants to merge 2 commits into
mainfrom
test/599-bedrock-effort-parity

ElegantLin commented Jun 13, 2026

Uh oh!

bingran-you commented Jun 15, 2026

Uh oh!

bingran-you commented Jun 15, 2026

Uh oh!

bingran-you commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ElegantLin commented Jun 13, 2026

Context

What this adds (test-only — no production change)

Disposition

Uh oh!

bingran-you commented Jun 15, 2026

Uh oh!

bingran-you commented Jun 15, 2026

Uh oh!

bingran-you commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants