Summary
LiteLLMBackend (src/llm/litellm.py) hardcodes generation parameters and offers no way to enable/disable extended thinking (reasoning). We should standardize and make these parameters configurable across the LLM abstraction so agents (plan-execute, claude-agent, openai-agent) get consistent, tunable behavior.
Current state / problems
In src/llm/litellm.py:
max_tokens is hardcoded to 2048 — too small for plan/execute trajectories and long tool outputs, and not overridable per call or per backend.
- No support for enabling/disabling extended thinking / reasoning (e.g.
reasoning_effort / thinking budget for Claude & reasoning models). It is neither on nor explicitly off — behavior is provider-default and undocumented.
temperature is the only tunable param, passed positionally; no shared place to set top_p, stop, etc.
- No env-var driven defaults, so tuning requires code edits.
LLMBackend ABC (src/llm/base.py) signature (generate(prompt, temperature)) is too narrow to carry these settings, so each backend diverges.
Proposed changes
Tests
Acceptance criteria
- Thinking can be explicitly enabled/disabled per backend and is correctly translated for supported models.
max_tokens and core generation params are configurable via config object + env vars, no longer hardcoded.
- Existing call sites and
uv run pytest src/ -v -k "not integration" pass.
Summary
LiteLLMBackend(src/llm/litellm.py) hardcodes generation parameters and offers no way to enable/disable extended thinking (reasoning). We should standardize and make these parameters configurable across the LLM abstraction so agents (plan-execute,claude-agent,openai-agent) get consistent, tunable behavior.Current state / problems
In
src/llm/litellm.py:max_tokensis hardcoded to2048— too small for plan/execute trajectories and long tool outputs, and not overridable per call or per backend.reasoning_effort/thinkingbudget for Claude & reasoning models). It is neither on nor explicitly off — behavior is provider-default and undocumented.temperatureis the only tunable param, passed positionally; no shared place to set top_p, stop, etc.LLMBackendABC (src/llm/base.py) signature (generate(prompt, temperature)) is too narrow to carry these settings, so each backend diverges.Proposed changes
GenerationParamsdataclass:max_tokens,temperature,thinking_enabled,thinking_budget/reasoning_effort,top_p,stop).LLMBackendABC without breaking existing call sites (sensible defaults; keepgenerate(prompt, temperature=...)working).litellm_proxy/paths.reasoning_effort/thinking={"type": "enabled", "budget_tokens": N}only when the model supports it; explicitly disable otherwise.LLM_MAX_TOKENS,LLM_THINKING,LLM_REASONING_EFFORT); document in.env.public.max_tokensto a reasonable value (e.g. 4096+) and make it overridable.Tests
litellm.completionfor: thinking enabled, thinking disabled, custommax_tokens, WatsonX vs proxy._CapturingLLM/SequentialMockLLMfixtures per testing conventions.Acceptance criteria
max_tokensand core generation params are configurable via config object + env vars, no longer hardcoded.uv run pytest src/ -v -k "not integration"pass.