Skip to content

Add flash-mode system prompt variant for fast/cheap models #434

@lmorchard

Description

@lmorchard

Current state

The action-loop system prompt at packages/core/src/prompts.ts:312-426 is a single template applied to every provider and model. The template is dense — roughly 3000-4000 tokens after Liquid render — and includes:

  • The full youArePrompt persona block
  • All tool examples (dynamic by capability)
  • Core Rules (7 numbered rules + the "EXACTLY ONE tool" block)
  • Best Practices (~15 bullets covering autocomplete handling, date pickers, modal dismissal, PDF handling, search vs. browser-search guidance, etc.)
  • Conditional guardrails block
  • Conditional interactive-mode block (a substantial section with mandatory rules)
  • When using done() block (5 bullets on formatting)
  • The toolCallInstruction tail (repeats "exactly one tool")

This applies whether the active model is claude-opus-4-7 or gpt-4.1-mini or gemini-2.5-flash or llama3.2.

The gap

Fast/cheap models (Claude Haiku, GPT-4.1-mini, Gemini Flash, local models, etc.) behave differently from frontier models:

  • They follow JSON schemas reliably but can drift with long lists of prescriptive "best practices."
  • The 15-bullet Best Practices section is closer to harmful than helpful at small model sizes — the model spends attention budget pattern-matching on the wrong best-practice rather than actually planning.
  • A 4000-token system prompt is a meaningful fraction of a small model's effective context.
  • Many of the rules are about edge cases (date pickers, autocomplete dropdowns, PDF handling) that only apply on certain pages — surfacing them as conditional or page-class-specific would be more efficient.

Conversely, frontier models (Claude Opus, GPT o-series, Gemini Pro) benefit from being told what to think about — the structured reasoning hints (data grounding, pre-done verification — see the dedicated prompt-checklist issue) are exactly the scaffolding they're designed to use well.

Proposed scope

A. Add a "flash" system prompt variant

In prompts.ts, add buildActionLoopSystemPromptFlash(...) that:

  • Keeps the youArePrompt persona (5 lines, anchor for the role)
  • Keeps tool examples (dynamic by capability) — these are essential
  • Keeps Core Rules (numbered, short)
  • Drops the long Best Practices bullets — replace with a single line: "Adapt your approach based on what's actually available; if an action fails, try a different element or use search/find tools to inventory the page."
  • Keeps guardrails block conditional (when present)
  • Keeps interactive-mode block conditional (when present) — non-negotiable
  • Compresses When using done() to one line: "Format the result as VALID Markdown matching what the user asked for."
  • Keeps the toolCallInstruction tail

Expected output: ~1000-1500 tokens.

B. Add a model-class detector

// In provider.ts:
export type ModelClass = "frontier" | "balanced" | "flash";

export function classifyModel(providerConfig: ProviderConfig): ModelClass {
  const modelId = (providerConfig.model?.modelId ?? "").toLowerCase();
  const flashPatterns = [
    /gpt-4\.1-mini/, /gpt-4o-mini/, /flash/, /haiku/, /local-model/,
    /llama3\.2(?!.*70b)/, /gemini-1\.5-flash/,
  ];
  const frontierPatterns = [
    /opus/, /gpt-4(\.\d)?$/, /o1\b/, /o3\b/, /gemini-2\.5-pro/, /sonnet/,
  ];
  if (flashPatterns.some(p => p.test(modelId))) return "flash";
  if (frontierPatterns.some(p => p.test(modelId))) return "frontier";
  return "balanced";
}

The classifier is heuristic — known to mis-identify in edge cases (custom OpenAI-compatible endpoints, novel model names). Provide a WebAgentOptions.promptVariant?: "auto" | "frontier" | "balanced" | "flash" override so callers can pin the variant when the auto-detect is wrong.

C. Wire the variant selection

In initializeSystemPromptAndTask (webAgent.ts:1641-1672):

const variant = this.options.promptVariant === "auto" || !this.options.promptVariant
  ? classifyModel(this.providerConfig)
  : this.options.promptVariant;

const systemPrompt = variant === "flash"
  ? buildActionLoopSystemPromptFlash(hasGuardrails, hasWebSearch, hasTabstack, hasStartingUrl, hasInteractive)
  : buildActionLoopSystemPrompt(hasGuardrails, hasWebSearch, hasTabstack, hasStartingUrl, hasInteractive);

(For now, only flash and the default differ. A future "frontier" variant could add Browser Use's <reasoning_rules>-style structured-thinking scaffolding.)

D. Surface the chosen variant in telemetry

Add promptVariant to the TASK_SETUP event payload so logs and eval-judge consumption can correlate variant choice with outcomes.

Implementation notes

  • The model-class detector is necessarily heuristic. Erring on the side of balanced (the default prompt) is safest — flash variant should only fire when we're confident.
  • Test extensively against custom OpenAI-compatible endpoints (provider: openai-compatible): users running local models through them may set arbitrary modelId values. The override config option handles this case.
  • The flash variant should still include the interactive-mode block when applicable — that's a privacy/security concern, not a "best practice."
  • Snapshot tests on the existing default prompt should not change (variant defaults to balanced for known frontier models).
  • Verify on a small benchmark: does the flash variant on a flash model match or beat the default prompt's task-completion rate? If it underperforms, the variant needs tuning, not just stripping.

Acceptance criteria

  • buildActionLoopSystemPromptFlash exists in prompts.ts and produces a noticeably shorter rendered output.
  • classifyModel correctly identifies common flash models (Haiku, gpt-4.1-mini, Gemini Flash) in tests.
  • WebAgentOptions.promptVariant override works ("flash", "balanced", "frontier", "auto").
  • TASK_SETUP event includes the chosen variant.
  • Benchmark on at least one small eval set: flash-on-flash-model matches or beats default-on-flash-model.

Effort estimate

1-2 days. The prompt-stripping work is fast; benchmarking takes a separate eval run.

Related issues

Pairs naturally with the pre-done verification checklist issue — the checklist should live in the default prompt variant; the flash variant gets a condensed form. Pairs with the prompt-caching issue — flash variant has a shorter cacheable prefix, but caching still helps.

Files likely affected

  • packages/core/src/prompts.ts (new buildActionLoopSystemPromptFlash)
  • packages/core/src/provider.ts (new classifyModel, ModelClass type)
  • packages/core/src/webAgent.ts (initializeSystemPromptAndTask, options surface)
  • packages/core/src/types/ (WebAgentOptions type)
  • packages/core/src/events.ts (TASK_SETUP payload)
  • packages/core/test/webAgent.test.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions