Skip to content

test: pin run-level Bedrock thinking-effort end-to-end (Docker/Daytona parity) (#599)#736

Open
ElegantLin wants to merge 2 commits into
mainfrom
test/599-bedrock-effort-parity
Open

test: pin run-level Bedrock thinking-effort end-to-end (Docker/Daytona parity) (#599)#736
ElegantLin wants to merge 2 commits into
mainfrom
test/599-bedrock-effort-parity

Conversation

@ElegantLin

Copy link
Copy Markdown
Contributor

Context

#599 (P0) reported: the old host BedrockProxyServer stored the per-run env but the Bedrock request translators resolved thinking effort from process-global os.environ, so a Docker run silently fell back to high while Daytona honored a run-level BENCHFLOW_BEDROCK_THINKING_EFFORT.

The implicated code no longer exists on main — PR #613 (Replace provider proxies with LiteLLM runtime) deleted bedrock_proxy.py / bedrock_runtime.py. The LiteLLM runtime sources effort from the run-level agent env via two run-aware paths:

  1. Route configlitellm_config._bedrock_thinking_effort(model, env) reads agent_env and bakes reasoning_effort into config.yaml.
  2. Proxy-process env — the proxy is launched as os.environ + agent_env, and the bedrock patch's effort override reads it.

So the specific bug is unreachable — but nothing pinned the behavior end to end (existing tests only checked route params, not the emitted wire payload). Per the issue's "regression tests to add", this adds wire-level coverage.

What this adds (test-only — no production change)

Drives the real litellm Bedrock Converse transform with the benchflow patch applied and asserts output_config.effort (the actual wire field the issue's repro inspected):

  • test_run_level_effort_from_route_lands_in_bedrock_wire_payload — route-config effort reaches the wire with host os.environ empty (sourced from the run, not the host process).
  • test_run_level_effort_via_proxy_process_env_overrides_stale_route — a run-level value in the proxy-process env overrides a stale-default route effort in the wire payload — the exact divergence the old Docker translator got wrong. Verified to FAIL when the patch override is neutered (fail-then-pass).
  • test_docker_and_daytona_resolve_identical_bedrock_effort_from_run_env — both lanes resolve identical effort from the same agent_env; with host env scrubbed it still flows through, so neither can diverge.

Uses medium (litellm-accepted, distinct from the high default) so a silent fallback fails the assertion.

Disposition

This pins #599's behavior; it does not close it — leaving the keep/close call to maintainers.

⚠️ Separate finding (out of scope, will file/flag): litellm 1.88.0rc1 rejects effort values xhigh and max for opus-4-8 (BadRequestError), but BenchFlow's _BEDROCK_THINKING_EFFORTS still allows them — so BENCHFLOW_BEDROCK_THINKING_EFFORT=max (the #598 "MAX thinking" usage) would error at request time. Not #599; flagged for a follow-up.

🤖 Generated with Claude Code

ElegantLin and others added 2 commits June 15, 2026 01:25
…a parity) (#599)

#599 reported that the old host BedrockProxyServer stored per-run env but the
Bedrock translators resolved thinking effort from process-global os.environ, so
Docker silently fell back to `high` while Daytona honored a run-level
BENCHFLOW_BEDROCK_THINKING_EFFORT. PR #613 deleted that proxy (bedrock_proxy.py
/ bedrock_runtime.py no longer exist); the LiteLLM runtime now sources effort
from the run-level agent env via two run-aware paths — route config
(reasoning_effort baked into config.yaml from agent_env) and the proxy-process
env (launched as os.environ + agent_env). The specific bug is unreachable, but
nothing pinned the behavior end to end.

Adds wire-level regression coverage (the issue's actual ask — assert
output_config.effort, not just route params) by driving the REAL litellm
Bedrock Converse transform with the benchflow patch applied:

- route-config effort lands in the wire payload with host os.environ empty
  (sourced from the run, not the host process);
- a run-level value in the proxy-process env overrides a stale-default route
  effort in the wire payload — the exact divergence the old Docker translator
  got wrong (verified to FAIL when the patch override is neutered);
- Docker and Daytona resolve identical effort from the same agent_env, so
  neither lane can diverge.

No production change — the architecture already behaves correctly; this guards
the regression. Uses `medium` (litellm-accepted, distinct from the `high`
default) so a silent fallback fails the assertion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bingran-you bingran-you force-pushed the test/599-bedrock-effort-parity branch from e2df289 to 1430118 Compare June 15, 2026 05:49
@bingran-you bingran-you temporarily deployed to pypi-internal-preview June 15, 2026 05:49 — with GitHub Actions Inactive
@bingran-you bingran-you added review:pending PR is ready-for-review, no reviewer engagement yet. status:ready Triaged, unassigned, available to claim. area:rollout Issue / PR lives primarily in the "rollout" subsystem. area:sandbox Issue / PR lives primarily in the "sandbox" subsystem. labels Jun 15, 2026
@bingran-you

Copy link
Copy Markdown
Collaborator

Codex review/update for #599 on 2026-06-15:

Status: implementation remains test-only and scoped to tests/test_bedrock_thinking.py. I rebased the branch onto current main (143011865af981231f547f92257e805d343efaca) so the stale CI failure from the old branch head is superseded. The previous integration-eval failure was due to the old branch missing .github/scripts/select_integration_provider.py, which exists on current main.

Validation completed on the rebased head:

  • uv run ruff check src tests tools passed.
  • uv run ruff format --check src tests tools passed.
  • uv run ty check passed.
  • uv run python -m pytest tests/ passed: 4139 passed, 4 skipped, 7 deselected.
  • CI-equivalent audit command passed: uv export --locked --extra dev --extra sandbox-daytona --extra sandbox-modal --format requirements-txt --no-hashes --output-file /tmp/req-audit && uv run --with pip-audit pip-audit -r /tmp/req-audit --no-deps --disable-pip; result: no known vulnerabilities found.

Real end-to-end checks against latest SkillsBench main (87df9cbd82fcf07fcb0cab960a1f8435d0c415ee):

  • Daytona + Gemini + 3d-scan-calc + with-skill passed: reward 1.0, 15 tool calls, partial_trajectory=false, verifier passed. Manual trajectory audit confirmed mesh-analysis was activated from /home/agent/.gemini/skills/mesh-analysis and used by the agent.
  • Daytona + Gemini + citation-check + no-skill produced complete BenchFlow/verifier/timing artifacts and no obvious skill-file leakage, but the agent hit wall-clock timeout with a partial ACP trajectory. I am not treating that run as a publishable healthy trial; it is only path-validation evidence.

Caveats / merge gate:

  • New GitHub checks are running on the rebased head now.
  • I am not self-approving or merging this PR because the repo rule requires human review before main.
  • Docker-lane e2e is still pending because the local Docker daemon is unavailable; the GCP runner has Docker and the Codex/Claude CLIs were updated, but I have not used it as merge evidence yet.

Disposition: ready for maintainer review once the new GitHub checks are green. After merge, #599 can be closed as fixed-by-coverage / architecture-regression-guarded rather than a production code change.

@bingran-you

Copy link
Copy Markdown
Collaborator

Review request/status refresh: this PR remains green on test, pip-audit, and integration-eval, with merge state CLEAN. It still needs human review before merge. Separate strict parity evidence remains incomplete: Daytona with-skill passed, but the no-skill run timed out with partial trajectory and the broader Docker/Daytona/API matrix is not complete.

@bingran-you bingran-you added status:blocked Waiting on external dependency. Add a comment explaining why. and removed status:ready Triaged, unassigned, available to claim. labels Jun 15, 2026
@bingran-you

Copy link
Copy Markdown
Collaborator

Status correction after fresh Bedrock preflight (2026-06-15):

GitHub CI is green and the PR is CLEAN, but I moved this from status:ready to status:blocked under the requested release/e2e gate. The remaining blocker is real Bedrock parity evidence, not unit/CI.

Fresh raw provider preflight on benchflow-docker-runner-03 using $HOME/.env returned HTTP 403 for us.anthropic.claude-opus-4-8 (Authentication failed: Please make sure your API Key is valid.). runner-06 does not have $HOME/.env, and local /Users/bingran_you/Downloads/GitHub/bingran-you/.env does not define AWS_BEARER_TOKEN_BEDROCK.

Per repo guidance, I did not launch a Bedrock e2e run with a known-invalid key because it would only produce an unhealthy provider/auth trajectory. Once a valid Bedrock bearer token is available, the next gate is the Docker/Daytona with-skill and no-skill canary matrix with trajectory audit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:rollout Issue / PR lives primarily in the "rollout" subsystem. area:sandbox Issue / PR lives primarily in the "sandbox" subsystem. review:pending PR is ready-for-review, no reviewer engagement yet. status:blocked Waiting on external dependency. Add a comment explaining why.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants