Skip to content

Commit 91b91b2

Browse files
authored
Merge pull request #473 from krissetto/thinking-budgets-google
Google/Gemini thinking budget support
2 parents fce5af9 + 77152b8 commit 91b91b2

8 files changed

Lines changed: 117 additions & 29 deletions

File tree

cagent-schema.json

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@
182182
"description": "Whether to track usage"
183183
},
184184
"thinking_budget": {
185-
"description": "Controls reasoning effort/budget. For OpenAI: string levels ('minimal', 'low', 'medium', 'high'). For Anthropic: integer token budget (1024-32768)",
185+
"description": "Controls reasoning effort/budget. OpenAI: string levels ('minimal','low','medium','high'). Anthropic: integer token budget (1024-32768). Gemini: integer token budget (-1 for unlimited, 0 to disable, 24576 max).",
186186
"oneOf": [
187187
{
188188
"type": "string",
@@ -191,12 +191,12 @@
191191
},
192192
{
193193
"type": "integer",
194-
"minimum": 1024,
194+
"minimum": -1,
195195
"maximum": 32768,
196-
"description": "Token budget for extended thinking (Anthropic)"
196+
"description": "Token budget for extended thinking (Anthropic, Google)"
197197
}
198198
],
199-
"examples": ["minimal", "low", "medium", "high", 1024, 32768]
199+
"examples": ["minimal", "low", "medium", "high", -1, 0, 1024, 24576, 32768]
200200
}
201201
},
202202
"additionalProperties": false

docs/USAGE.md

Lines changed: 46 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -139,17 +139,17 @@ cagent run ./agent.yaml --command ls
139139

140140
### Model Properties
141141

142-
| Property | Type | Description | Required |
143-
|---------------------|------------|-----------------------------------------------------------------------|----------|
144-
| `provider` | string | Provider: `openai`, `anthropic`, `dmr` | ✓ |
145-
| `model` | string | Model name (e.g., `gpt-4o`, `claude-sonnet-4-0`) | ✓ |
146-
| `temperature` | float | Randomness (0.0-1.0) | ✗ |
147-
| `max_tokens` | integer | Response length limit | ✗ |
148-
| `top_p` | float | Nucleus sampling (0.0-1.0) | ✗ |
149-
| `frequency_penalty` | float | Repetition penalty (0.0-2.0) | ✗ |
150-
| `presence_penalty` | float | Topic repetition penalty (0.0-2.0) | ✗ |
151-
| `base_url` | string | Custom API endpoint | ✗ |
152-
| `thinking_budget` | string/int | Reasoning effort — OpenAI: effort string, Anthropic: token budget int | ✗ |
142+
| Property | Type | Description | Required |
143+
|---------------------|------------|------------------------------------------------------------------------------|----------|
144+
| `provider` | string | Provider: `openai`, `anthropic`, `google`, `dmr` | ✓ |
145+
| `model` | string | Model name (e.g., `gpt-4o`, `claude-sonnet-4-0`, `gemini-2.5-flash`) | ✓ |
146+
| `temperature` | float | Randomness (0.0-1.0) | ✗ |
147+
| `max_tokens` | integer | Response length limit | ✗ |
148+
| `top_p` | float | Nucleus sampling (0.0-1.0) | ✗ |
149+
| `frequency_penalty` | float | Repetition penalty (0.0-2.0) | ✗ |
150+
| `presence_penalty` | float | Topic repetition penalty (0.0-2.0) | ✗ |
151+
| `base_url` | string | Custom API endpoint | ✗ |
152+
| `thinking_budget` | string/int | Reasoning effort — OpenAI: effort string, Anthropic/Google: token budget int | ✗ |
153153

154154
#### Example
155155

@@ -164,15 +164,16 @@ models:
164164
frequency_penalty: float # Repetition penalty (0.0-2.0)
165165
presence_penalty: float # Topic repetition penalty (0.0-2.0)
166166
parallel_tool_calls: boolean
167-
thinking_budget: string|integer # OpenAI: effort level string; Anthropic: integer token budget
167+
thinking_budget: string|integer # OpenAI: effort level string; Anthropic/Google: integer token budget
168168
```
169169

170170
### Reasoning Effort (thinking_budget)
171171

172172
Determine how much the model should think by setting the `thinking_budget`
173173

174174
- **OpenAI**: use effort levels — `minimal`, `low`, `medium`, `high`
175-
- **Anthropic**: set an integer token budget. Minimum is 1024; range is 1024–32768; must be strictly less than `max_tokens`. When set, cagent uses Anthropic's Beta Messages API with interleaved thinking enabled.
175+
- **Anthropic**: set an integer token budget. Range is 1024–32768; must be strictly less than `max_tokens`.
176+
- **Google (Gemini)**: set an integer token budget. `0` -> disable thinking, `-1` -> dynamic thinking (model decides). Most models: 0–24576 tokens. Gemini 2.5 Pro: 128–32768 tokens (and cannot disabled thinking).
176177

177178
Examples (OpenAI):
178179

@@ -204,6 +205,31 @@ agents:
204205
instruction: you are a helpful assistant that doesn't think very much
205206
```
206207

208+
Examples (Google):
209+
210+
```yaml
211+
models:
212+
gemini-no-thinking:
213+
provider: google
214+
model: gemini-2.5-flash
215+
thinking_budget: 0 # Disable thinking
216+
217+
gemini-dynamic:
218+
provider: google
219+
model: gemini-2.5-flash
220+
thinking_budget: -1 # Dynamic thinking (model decides)
221+
222+
gemini-fixed:
223+
provider: google
224+
model: gemini-2.5-flash
225+
thinking_budget: 8192 # Fixed token budget
226+
227+
agents:
228+
root:
229+
model: gemini-fixed
230+
instruction: you are a helpful assistant
231+
```
232+
207233
#### Interleaved Thinking (Anthropic)
208234

209235
Anthropic's interleaved thinking feature uses the Beta Messages API to provide tool calling during model reasoning. You can control this behavior using the `interleaved_thinking` provider option:
@@ -220,11 +246,14 @@ models:
220246

221247
Notes:
222248

223-
- If an invalid OpenAI effort value is set, the request will fail with a clear error
224-
- For Anthropic, values < 1024 or ≥ `max_tokens` are ignored (warning logged)
225-
- When `interleaved_thinking` is enabled, cagent uses Anthropic's Beta Messages API with a default thinking budget of 16384 tokens if not specified
249+
- **OpenAI**: If an invalid effort value is set, the request will fail with a clear error
250+
- **Anthropic**: Values < 1024 or ≥ `max_tokens` are ignored (warning logged). When `interleaved_thinking` is enabled, cagent uses Anthropic's Beta Messages API with a default thinking budget of 16384 tokens if not specified
251+
- **Google**:
252+
- Most models support values between -1 and 24576 tokens. Set to `0` to disable, `-1` for dynamic thinking
253+
- Gemini 2.5 Pro: supports 128–32768 tokens. Cannot be disabled (minimum 128)
254+
- Gemini 2.5 Flash-Lite: supports 512–24576 tokens. Set to `0` to disable, `-1` for dynamic thinking
226255
- For unsupported providers, `thinking_budget` has no effect
227-
- Debug logs include the applied effort (e.g., "OpenAI request using thinking_budget", "Anthropic Beta API using thinking_budget")
256+
- Debug logs include the applied effort (e.g., "OpenAI request using thinking_budget", "Gemini request using thinking_budget")
228257

229258
See `examples/thinking_budget.yaml` for a complete runnable demo.
230259

examples/thinking_budget.yaml

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,9 @@ agents:
99
root:
1010
model: gpt-5-mini-min # <- try with gpt-5-mini-high
1111
# model: claude-4-5-sonnet-min # <- try with claude-4-5-sonnet-high
12+
# model: gemini-2-5-flash-dynamic-thinking # <- try with -no-thinking, -low or -high variants
1213
description: a helpful assistant that thinks
13-
instruction: you are a helpful assistant
14+
instruction: you are a helpful assistant who can also use tools, but only if you need to
1415
commands:
1516
demo: "hey i need python code for a mandelbrot fractal"
1617
toolsets:
@@ -35,6 +36,26 @@ models:
3536
claude-4-5-sonnet-high:
3637
provider: anthropic
3738
model: claude-sonnet-4-5-20250929
38-
thinking_budget: 32768 # <- tokens, 32768 is the suggested maximum without batching
39+
thinking_budget: 32768 # <- tokens, 32768 is the Anthropic suggested maximum without batching
3940
provider_opts:
40-
interleaved_thinking: true # <- enable interleaved thinking, aka tool calling during model reasoning
41+
interleaved_thinking: true # <- enables interleaved thinking, aka tool calling during model reasoning
42+
43+
gemini-2-5-flash-dynamic-thinking:
44+
provider: google
45+
model: gemini-2.5-flash
46+
thinking_budget: -1 # <- google only, dynamic thinking
47+
48+
gemini-2-5-flash-no-thinking:
49+
provider: google
50+
model: gemini-2.5-flash
51+
thinking_budget: 0 # <- google only, no thinking
52+
53+
gemini-2-5-flash-low:
54+
provider: google
55+
model: gemini-2.5-flash
56+
thinking_budget: 1024
57+
58+
gemini-2-5-flash-high:
59+
provider: google
60+
model: gemini-2.5-flash
61+
thinking_budget: 24576 # <- google's maximum thinking budget for all models except Gemini 2.5 Pro (max 32768)

pkg/chat/chat.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ type Usage struct {
119119
OutputTokens int `json:"output_tokens"`
120120
CachedInputTokens int `json:"cached_input_tokens"`
121121
CachedOutputTokens int `json:"cached_output_tokens"`
122+
ReasoningTokens int `json:"reasoning_tokens,omitempty"`
122123
}
123124

124125
// MessageStream interface represents a stream of chat completions

pkg/model/provider/gemini/adapter.go

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -179,20 +179,29 @@ func (g *StreamAdapter) Recv() (chat.MessageStreamResponse, error) {
179179
OutputTokens: int(res.resp.UsageMetadata.CandidatesTokenCount),
180180
CachedInputTokens: int(res.resp.UsageMetadata.CachedContentTokenCount),
181181
CachedOutputTokens: 0, // Gemini doesn't provide cached output tokens
182+
ReasoningTokens: int(res.resp.UsageMetadata.ThoughtsTokenCount),
182183
}
183184
}
184185

185-
// Handle text content without using Text() to avoid warnings
186+
// Handle text and thoughts separately so TUI can render them distinctly
186187
var textContent string
188+
var reasoningText string
187189
for _, candidate := range res.resp.Candidates {
188190
if candidate.Content != nil {
189191
for _, part := range candidate.Content.Parts {
190192
if part.Text != "" {
191-
textContent += part.Text
193+
if part.Thought {
194+
reasoningText += part.Text
195+
} else {
196+
textContent += part.Text
197+
}
192198
}
193199
}
194200
}
195201
}
202+
if reasoningText != "" {
203+
resp.Choices[0].Delta.ReasoningContent = reasoningText
204+
}
196205
if textContent != "" {
197206
resp.Choices[0].Delta.Content = textContent
198207
}

pkg/model/provider/gemini/client.go

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -220,6 +220,30 @@ func (c *Client) buildConfig() *genai.GenerateContentConfig {
220220
if c.config.MaxTokens > 0 {
221221
config.MaxOutputTokens = int32(c.config.MaxTokens)
222222
}
223+
224+
// Apply thinking budget for Gemini models using token-based configuration.
225+
// Per official docs: https://ai.google.dev/gemini-api/docs/thinking
226+
// - Set thinkingBudget to 0 to disable thinking
227+
// - Set thinkingBudget to -1 for dynamic thinking (model decides)
228+
// - Set to a specific value for a fixed token budget,
229+
// maximum is 24576 for all models except Gemini 2.5 Pro (max 32768)
230+
if c.config.ThinkingBudget != nil {
231+
if config.ThinkingConfig == nil {
232+
config.ThinkingConfig = &genai.ThinkingConfig{}
233+
}
234+
config.ThinkingConfig.IncludeThoughts = true
235+
tokens := c.config.ThinkingBudget.Tokens
236+
config.ThinkingConfig.ThinkingBudget = genai.Ptr(int32(tokens))
237+
238+
switch tokens {
239+
case 0:
240+
slog.Debug("Gemini request with thinking disabled", "budget_tokens", tokens)
241+
case -1:
242+
slog.Debug("Gemini request with dynamic thinking", "budget_tokens", tokens)
243+
default:
244+
slog.Debug("Gemini request using thinking_budget", "budget_tokens", tokens)
245+
}
246+
}
223247
return config
224248
}
225249

pkg/model/provider/oaistream/adapter.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,10 +49,14 @@ func (a *StreamAdapter) Recv() (chat.MessageStreamResponse, error) {
4949
OutputTokens: openaiResponse.Usage.CompletionTokens,
5050
CachedInputTokens: 0,
5151
CachedOutputTokens: 0,
52+
ReasoningTokens: 0,
5253
}
5354
if openaiResponse.Usage.PromptTokensDetails != nil {
5455
response.Usage.CachedInputTokens = openaiResponse.Usage.PromptTokensDetails.CachedTokens
5556
}
57+
if openaiResponse.Usage.CompletionTokensDetails != nil {
58+
response.Usage.ReasoningTokens = openaiResponse.Usage.CompletionTokensDetails.ReasoningTokens
59+
}
5660
// Use the tracked finish reason instead of hardcoding stop
5761
finishReason := a.lastFinishReason
5862
if finishReason == "" {

pkg/runtime/runtime.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -481,19 +481,19 @@ func (r *runtime) handleStream(ctx context.Context, stream chat.MessageStream, a
481481
if response.Usage != nil {
482482
if m != nil {
483483
sess.Cost += (float64(response.Usage.InputTokens)*m.Cost.Input +
484-
float64(response.Usage.OutputTokens)*m.Cost.Output +
484+
float64(response.Usage.OutputTokens+response.Usage.ReasoningTokens)*m.Cost.Output +
485485
float64(response.Usage.CachedInputTokens)*m.Cost.CacheRead +
486486
float64(response.Usage.CachedOutputTokens)*m.Cost.CacheWrite) / 1e6
487487
}
488488

489489
sess.InputTokens = response.Usage.InputTokens + response.Usage.CachedInputTokens
490-
sess.OutputTokens = response.Usage.OutputTokens + response.Usage.CachedOutputTokens
490+
sess.OutputTokens = response.Usage.OutputTokens + response.Usage.CachedOutputTokens + response.Usage.ReasoningTokens
491491

492492
modelName := "unknown"
493493
if m != nil {
494494
modelName = m.Name
495495
}
496-
telemetry.RecordTokenUsage(ctx, modelName, int64(response.Usage.InputTokens), int64(response.Usage.OutputTokens), sess.Cost)
496+
telemetry.RecordTokenUsage(ctx, modelName, int64(response.Usage.InputTokens), int64(response.Usage.OutputTokens+response.Usage.ReasoningTokens), sess.Cost)
497497
}
498498

499499
if len(response.Choices) == 0 {

0 commit comments

Comments
 (0)