Current state
Pilo's chat history (this.messages: ModelMessage[] in webAgent.ts) grows monotonically over a task run. Sources of appends:
| Source |
What's appended |
Frequency |
initializeSystemPromptAndTask (webAgent.ts:1656-1671) |
System + task+plan user message |
Once at start, once on browser reconnect |
addPageSnapshot (webAgent.ts:779-868) |
Snapshot user message (text + optional image) |
Per iteration when needsPageSnapshot |
generateAndProcessAction (webAgent.ts:985-989) |
All messages from aiResponse.response.messages (assistant turn + tool calls + tool results) |
Per iteration |
checkAndHandleRepeatedAction (webAgent.ts:1149-1150) |
Repetition warning user message |
On warning threshold |
addErrorFeedback (webAgent.ts:682-718) |
Step error feedback user message |
On non-recoverable-tool error |
validateTaskCompletion (webAgent.ts:1281-1289) |
Validation feedback user message |
On validation rejection |
The only trimming is truncateOldExternalContent (webAgent.ts:743-774), which runs before each new snapshot push and clips the body of prior <EXTERNAL-CONTENT> blocks:
const clipExternalContent = (text: string): string =>
text.replace(
/(<EXTERNAL-CONTENT[\s\S]*?>)\n[\s\S]*?\n(<\/EXTERNAL-CONTENT>)/g,
"$1\n> [clipped for brevity]\n$2",
);
It also replaces prior image content parts with { type: "text", text: "[screenshot clipped for brevity]" }.
The gap
Tool-call assistant messages and tool-result messages are never trimmed. Validation feedback messages, repeat warnings, and error feedback messages are never trimmed.
On a 50-iteration run with multiple validation failures and a few errors, the conversation can accumulate:
- 50 snapshot user messages (clipped down to small markers, fine)
- 50 assistant tool-call messages (each ~50-200 tokens; not clipped)
- 50 tool result messages (each ~20-100 tokens; not clipped)
- Up to ~10 validation feedback user messages (each ~100-200 tokens; not clipped)
- Up to ~10 error feedback user messages (each ~100-300 tokens; not clipped)
- Up to ~5 repetition warning messages (~50 tokens; not clipped)
Rough math: a worst-case 50-iteration run can grow to 15-25k tokens of non-snapshot history that the model re-reads every iteration. This is fine for frontier-class context windows but:
- Pure inefficiency — most of those old tool-call args are not relevant to the current step.
- On smaller-context models, this can squeeze out room for the current snapshot.
- The validator (
validateTaskCompletion) builds a messages.slice(-30) view that is partly dominated by these accumulated assistant turns.
When the AI SDK eventually fails with a context-window error, it surfaces as a regular generation error which cycles through maxConsecutiveErrors before the task fails. There's no early warning.
The gap (continued)
There's also a related smaller issue: stale validation feedback messages accumulate. If the agent gets rejected on attempt 1, then again on attempt 2, both feedback messages persist into attempt 3. The most-recent feedback is the most relevant; older feedback is mostly noise.
Proposed scope
Extend truncateOldExternalContent (or rename to trimOldHistory) to also handle:
A. Clip old assistant tool-call message content
For assistant messages older than the last K (default K=3):
- Keep
role: "assistant" and the tool call structure (so the AI SDK's tool-result pairing still works).
- Strip any
text content (the model's intermediate reasoning if surfaced).
- Replace tool call
args with a placeholder object that preserves the tool name but blanks the value: { toolName: "fill", args: { /* clipped */ } }.
Tradeoff: if a tool result references something specific (e.g., the model reading back what it filled), aggressive clipping breaks recall. K=3 (or K=5 conservative) preserves enough recent context that the model can self-reference recent actions.
B. Aggregate old error/feedback messages into a single summary placeholder
Walk back from the end of messages. Find consecutive runs of validation-feedback and error-feedback user messages. Replace each run with a single placeholder:
[3 earlier feedback messages clipped: 2 validation rejections, 1 step error]
Only the most recent feedback message of each kind stays full text.
C. Tool-result message clipping
For tool-result messages older than K, replace the content with [tool result clipped for brevity]. Keep the message role/structure intact so AI SDK pairing works.
D. Surface a token-budget metric
Add an event:
SYSTEM_DEBUG_HISTORY_SIZE: {
iterationId: string;
estimatedTokens: number; // rough sum of content text lengths / 4
messageCount: number;
}
Emit once per iteration before the LLM call. Telemetry consumers (eval-judge, logs) can spot tasks approaching the context window.
E. Optional: hard cap on history age
If the agent runs >K iterations, drop intermediate snapshots entirely (not just clip the body). Keep:
- The original system + task+plan messages
- A summary placeholder ("[N earlier iterations summarized]" — content TBD; could be just the count)
- The last K=10 iterations in full
Defer this to a follow-up issue (the LLM-based history compaction one) if the simpler clipping handles most cases.
Implementation notes
- AI SDK tool-call / tool-result pairing is by
toolCallId. Don't break the pairing — clipping the content is fine; deleting the message is not.
- Don't clip the most-recent assistant tool-call message — the model needs to see its own immediately-prior action.
- Test with a long-running task (forced to 50 iterations) — verify the trimming actually keeps token count bounded. Without this safeguard, the existing test suite doesn't exercise long histories.
- The existing
<EXTERNAL-CONTENT> regex clipping should stay — the new clipping covers a different category of messages.
Acceptance criteria
- Assistant messages older than K iterations have their tool args clipped.
- Tool-result messages older than K iterations are placeholders.
- Consecutive runs of old feedback/error messages collapse into single summary placeholders.
SYSTEM_DEBUG_HISTORY_SIZE event fires per iteration with estimated token count and message count.
- A 50-iteration test scenario shows history-token count plateaus rather than growing linearly.
- Existing tests still pass; new tests cover the clipping behaviors.
Effort estimate
1-2 days. Most complexity is in preserving tool-call/tool-result pairing while clipping content.
Related issues
This is the "Tier 2" / simpler step. A "Tier 3" follow-up (separate issue) covers LLM-based summarization for tasks where even this aggressive clipping isn't enough.
Files likely affected
packages/core/src/webAgent.ts (truncateOldExternalContent and friends)
packages/core/src/events.ts (new event type)
packages/core/test/webAgent.test.ts (snapshot truncation describe block)
Current state
Pilo's chat history (
this.messages: ModelMessage[]inwebAgent.ts) grows monotonically over a task run. Sources of appends:initializeSystemPromptAndTask(webAgent.ts:1656-1671)addPageSnapshot(webAgent.ts:779-868)needsPageSnapshotgenerateAndProcessAction(webAgent.ts:985-989)aiResponse.response.messages(assistant turn + tool calls + tool results)checkAndHandleRepeatedAction(webAgent.ts:1149-1150)addErrorFeedback(webAgent.ts:682-718)validateTaskCompletion(webAgent.ts:1281-1289)The only trimming is
truncateOldExternalContent(webAgent.ts:743-774), which runs before each new snapshot push and clips the body of prior<EXTERNAL-CONTENT>blocks:It also replaces prior
imagecontent parts with{ type: "text", text: "[screenshot clipped for brevity]" }.The gap
Tool-call assistant messages and tool-result messages are never trimmed. Validation feedback messages, repeat warnings, and error feedback messages are never trimmed.
On a 50-iteration run with multiple validation failures and a few errors, the conversation can accumulate:
Rough math: a worst-case 50-iteration run can grow to 15-25k tokens of non-snapshot history that the model re-reads every iteration. This is fine for frontier-class context windows but:
validateTaskCompletion) builds amessages.slice(-30)view that is partly dominated by these accumulated assistant turns.When the AI SDK eventually fails with a context-window error, it surfaces as a regular generation error which cycles through
maxConsecutiveErrorsbefore the task fails. There's no early warning.The gap (continued)
There's also a related smaller issue: stale validation feedback messages accumulate. If the agent gets rejected on attempt 1, then again on attempt 2, both feedback messages persist into attempt 3. The most-recent feedback is the most relevant; older feedback is mostly noise.
Proposed scope
Extend
truncateOldExternalContent(or rename totrimOldHistory) to also handle:A. Clip old assistant tool-call message content
For assistant messages older than the last K (default K=3):
role: "assistant"and the tool call structure (so the AI SDK's tool-result pairing still works).textcontent (the model's intermediate reasoning if surfaced).argswith a placeholder object that preserves the tool name but blanks the value:{ toolName: "fill", args: { /* clipped */ } }.Tradeoff: if a tool result references something specific (e.g., the model reading back what it filled), aggressive clipping breaks recall. K=3 (or K=5 conservative) preserves enough recent context that the model can self-reference recent actions.
B. Aggregate old error/feedback messages into a single summary placeholder
Walk back from the end of
messages. Find consecutive runs of validation-feedback and error-feedback user messages. Replace each run with a single placeholder:Only the most recent feedback message of each kind stays full text.
C. Tool-result message clipping
For tool-result messages older than K, replace the content with
[tool result clipped for brevity]. Keep the message role/structure intact so AI SDK pairing works.D. Surface a token-budget metric
Add an event:
Emit once per iteration before the LLM call. Telemetry consumers (eval-judge, logs) can spot tasks approaching the context window.
E. Optional: hard cap on history age
If the agent runs >K iterations, drop intermediate snapshots entirely (not just clip the body). Keep:
Defer this to a follow-up issue (the LLM-based history compaction one) if the simpler clipping handles most cases.
Implementation notes
toolCallId. Don't break the pairing — clipping the content is fine; deleting the message is not.<EXTERNAL-CONTENT>regex clipping should stay — the new clipping covers a different category of messages.Acceptance criteria
SYSTEM_DEBUG_HISTORY_SIZEevent fires per iteration with estimated token count and message count.Effort estimate
1-2 days. Most complexity is in preserving tool-call/tool-result pairing while clipping content.
Related issues
This is the "Tier 2" / simpler step. A "Tier 3" follow-up (separate issue) covers LLM-based summarization for tasks where even this aggressive clipping isn't enough.
Files likely affected
packages/core/src/webAgent.ts(truncateOldExternalContentand friends)packages/core/src/events.ts(new event type)packages/core/test/webAgent.test.ts(snapshot truncationdescribe block)