Skip to content

Standardize LiteLLM params: thinking toggle, max_tokens, and generation settings #304

@ShuxinLin

Description

@ShuxinLin

Summary

LiteLLMBackend (src/llm/litellm.py) hardcodes generation parameters and offers no way to enable/disable extended thinking (reasoning). We should standardize and make these parameters configurable across the LLM abstraction so agents (plan-execute, claude-agent, openai-agent) get consistent, tunable behavior.

Current state / problems

In src/llm/litellm.py:

  • max_tokens is hardcoded to 2048 — too small for plan/execute trajectories and long tool outputs, and not overridable per call or per backend.
  • No support for enabling/disabling extended thinking / reasoning (e.g. reasoning_effort / thinking budget for Claude & reasoning models). It is neither on nor explicitly off — behavior is provider-default and undocumented.
  • temperature is the only tunable param, passed positionally; no shared place to set top_p, stop, etc.
  • No env-var driven defaults, so tuning requires code edits.
  • LLMBackend ABC (src/llm/base.py) signature (generate(prompt, temperature)) is too narrow to carry these settings, so each backend diverges.

Proposed changes

  • Introduce a single generation-config object (e.g. GenerationParams dataclass: max_tokens, temperature, thinking_enabled, thinking_budget/reasoning_effort, top_p, stop).
  • Thread it through the LLMBackend ABC without breaking existing call sites (sensible defaults; keep generate(prompt, temperature=...) working).
  • Map config → litellm kwargs correctly per provider:
    • WatsonX vs litellm_proxy/ paths.
    • Thinking: pass reasoning_effort / thinking={"type": "enabled", "budget_tokens": N} only when the model supports it; explicitly disable otherwise.
  • Env-var defaults following existing conventions (e.g. LLM_MAX_TOKENS, LLM_THINKING, LLM_REASONING_EFFORT); document in .env.public.
  • Raise the default max_tokens to a reasonable value (e.g. 4096+) and make it overridable.
  • Audit agent runners for any other hardcoded/duplicated LLM params and route them through the new config.

Tests

  • Unit tests asserting the kwargs passed to litellm.completion for: thinking enabled, thinking disabled, custom max_tokens, WatsonX vs proxy.
  • Use the _CapturingLLM / SequentialMockLLM fixtures per testing conventions.

Acceptance criteria

  • Thinking can be explicitly enabled/disabled per backend and is correctly translated for supported models.
  • max_tokens and core generation params are configurable via config object + env vars, no longer hardcoded.
  • Existing call sites and uv run pytest src/ -v -k "not integration" pass.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions