test: pin run-level Bedrock thinking-effort end-to-end (Docker/Daytona parity) (#599)#736
test: pin run-level Bedrock thinking-effort end-to-end (Docker/Daytona parity) (#599)#736ElegantLin wants to merge 2 commits into
Conversation
…a parity) (#599) #599 reported that the old host BedrockProxyServer stored per-run env but the Bedrock translators resolved thinking effort from process-global os.environ, so Docker silently fell back to `high` while Daytona honored a run-level BENCHFLOW_BEDROCK_THINKING_EFFORT. PR #613 deleted that proxy (bedrock_proxy.py / bedrock_runtime.py no longer exist); the LiteLLM runtime now sources effort from the run-level agent env via two run-aware paths — route config (reasoning_effort baked into config.yaml from agent_env) and the proxy-process env (launched as os.environ + agent_env). The specific bug is unreachable, but nothing pinned the behavior end to end. Adds wire-level regression coverage (the issue's actual ask — assert output_config.effort, not just route params) by driving the REAL litellm Bedrock Converse transform with the benchflow patch applied: - route-config effort lands in the wire payload with host os.environ empty (sourced from the run, not the host process); - a run-level value in the proxy-process env overrides a stale-default route effort in the wire payload — the exact divergence the old Docker translator got wrong (verified to FAIL when the patch override is neutered); - Docker and Daytona resolve identical effort from the same agent_env, so neither lane can diverge. No production change — the architecture already behaves correctly; this guards the regression. Uses `medium` (litellm-accepted, distinct from the `high` default) so a silent fallback fails the assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
e2df289 to
1430118
Compare
|
Codex review/update for #599 on 2026-06-15: Status: implementation remains test-only and scoped to Validation completed on the rebased head:
Real end-to-end checks against latest SkillsBench
Caveats / merge gate:
Disposition: ready for maintainer review once the new GitHub checks are green. After merge, #599 can be closed as fixed-by-coverage / architecture-regression-guarded rather than a production code change. |
|
Review request/status refresh: this PR remains green on |
|
Status correction after fresh Bedrock preflight (2026-06-15): GitHub CI is green and the PR is CLEAN, but I moved this from Fresh raw provider preflight on Per repo guidance, I did not launch a Bedrock e2e run with a known-invalid key because it would only produce an unhealthy provider/auth trajectory. Once a valid Bedrock bearer token is available, the next gate is the Docker/Daytona with-skill and no-skill canary matrix with trajectory audit. |
Context
#599 (P0) reported: the old host
BedrockProxyServerstored the per-run env but the Bedrock request translators resolved thinking effort from process-globalos.environ, so a Docker run silently fell back tohighwhile Daytona honored a run-levelBENCHFLOW_BEDROCK_THINKING_EFFORT.The implicated code no longer exists on main — PR #613 (Replace provider proxies with LiteLLM runtime) deleted
bedrock_proxy.py/bedrock_runtime.py. The LiteLLM runtime sources effort from the run-level agent env via two run-aware paths:litellm_config._bedrock_thinking_effort(model, env)readsagent_envand bakesreasoning_effortintoconfig.yaml.os.environ + agent_env, and the bedrock patch's effort override reads it.So the specific bug is unreachable — but nothing pinned the behavior end to end (existing tests only checked route params, not the emitted wire payload). Per the issue's "regression tests to add", this adds wire-level coverage.
What this adds (test-only — no production change)
Drives the real litellm Bedrock Converse transform with the benchflow patch applied and asserts
output_config.effort(the actual wire field the issue's repro inspected):test_run_level_effort_from_route_lands_in_bedrock_wire_payload— route-config effort reaches the wire with hostos.environempty (sourced from the run, not the host process).test_run_level_effort_via_proxy_process_env_overrides_stale_route— a run-level value in the proxy-process env overrides a stale-default route effort in the wire payload — the exact divergence the old Docker translator got wrong. Verified to FAIL when the patch override is neutered (fail-then-pass).test_docker_and_daytona_resolve_identical_bedrock_effort_from_run_env— both lanes resolve identical effort from the sameagent_env; with host env scrubbed it still flows through, so neither can diverge.Uses
medium(litellm-accepted, distinct from thehighdefault) so a silent fallback fails the assertion.Disposition
This pins #599's behavior; it does not close it — leaving the keep/close call to maintainers.
🤖 Generated with Claude Code