Standardize LiteLLM params: thinking toggle, max_tokens, and generation settings

## Summary

`LiteLLMBackend` (`src/llm/litellm.py`) hardcodes generation parameters and offers no way to enable/disable extended thinking (reasoning). We should standardize and make these parameters configurable across the LLM abstraction so agents (`plan-execute`, `claude-agent`, `openai-agent`) get consistent, tunable behavior.

## Current state / problems

In `src/llm/litellm.py`:
- `max_tokens` is **hardcoded to `2048`** — too small for plan/execute trajectories and long tool outputs, and not overridable per call or per backend.
- No support for **enabling/disabling extended thinking / reasoning** (e.g. `reasoning_effort` / `thinking` budget for Claude & reasoning models). It is neither on nor explicitly off — behavior is provider-default and undocumented.
- `temperature` is the only tunable param, passed positionally; no shared place to set top_p, stop, etc.
- No env-var driven defaults, so tuning requires code edits.
- `LLMBackend` ABC (`src/llm/base.py`) signature (`generate(prompt, temperature)`) is too narrow to carry these settings, so each backend diverges.

## Proposed changes

- [ ] Introduce a single generation-config object (e.g. `GenerationParams` dataclass: `max_tokens`, `temperature`, `thinking_enabled`, `thinking_budget`/`reasoning_effort`, `top_p`, `stop`).
- [ ] Thread it through the `LLMBackend` ABC without breaking existing call sites (sensible defaults; keep `generate(prompt, temperature=...)` working).
- [ ] Map config → litellm kwargs correctly per provider:
  - WatsonX vs `litellm_proxy/` paths.
  - Thinking: pass `reasoning_effort` / `thinking={"type": "enabled", "budget_tokens": N}` only when the model supports it; explicitly disable otherwise.
- [ ] Env-var defaults following existing conventions (e.g. `LLM_MAX_TOKENS`, `LLM_THINKING`, `LLM_REASONING_EFFORT`); document in `.env.public`.
- [ ] Raise the default `max_tokens` to a reasonable value (e.g. 4096+) and make it overridable.
- [ ] Audit agent runners for any other hardcoded/duplicated LLM params and route them through the new config.

## Tests
- [ ] Unit tests asserting the kwargs passed to `litellm.completion` for: thinking enabled, thinking disabled, custom `max_tokens`, WatsonX vs proxy.
- [ ] Use the `_CapturingLLM` / `SequentialMockLLM` fixtures per testing conventions.

## Acceptance criteria
- Thinking can be explicitly enabled/disabled per backend and is correctly translated for supported models.
- `max_tokens` and core generation params are configurable via config object + env vars, no longer hardcoded.
- Existing call sites and `uv run pytest src/ -v -k "not integration"` pass.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize LiteLLM params: thinking toggle, max_tokens, and generation settings #304

Summary

Current state / problems

Proposed changes

Tests

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Standardize LiteLLM params: thinking toggle, max_tokens, and generation settings #304

Description

Summary

Current state / problems

Proposed changes

Tests

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions