Skip to content

Vision fallback: caption images via a secondary model when the active model lacks vision #18

Description

@mattmezza

Summary

When the active model has no native image support, route image attachments to a small vision-capable
model to produce a text description / OCR, and inject that text in place of the image so
non-multimodal models can still "see".

Approach

Hook the existing seam in core/agent.py::_build_user_message, which already branches on image
attachments. Reuse the existing secondary-model pattern (_memory_llm / _background_llm):

  • Add a supports_vision capability flag per provider/model.
  • Add a _vision_llm(provider) helper and a vision config block (enabled / provider /
    model), mirroring the search / voice blocks. Default off; engages only when the active model
    lacks vision and images are present.
  • Caption each image with a task-aware prompt (the user's text is available) and replace the
    image block with [Image: <description>].
  • Cache descriptions by image hash to avoid re-captioning.

UX & product

  • Mostly invisible by design — it just works when needed. Product touch: when the user picks a
    model with no image support, surface a gentle prompt — "this model can't read images; set up a
    vision fallback?" — with a one-tap path to configure it.
  • Admin UI: a small vision section (enable, provider, model) alongside search/voice settings,
    with an indicator on the model picker showing whether the active model sees images natively —
    responsive/touch-friendly at phone width, reusing consistent settings components.
  • On the go (Telegram): the payoff is mobile-native — send a photo from your phone and get a
    sensible answer even on a non-vision model.
  • Mobile-first: the whole point is that a photo sent from a phone "just works"; the enable prompt
    is one tap.

Setup & onboarding

  • Off by default; auto-suggested only when the active model lacks vision. One-tap enable with a
    sensible default vision model — offered inline at model selection rather than as a separate chore.

Acceptance criteria

  • With a non-vision model selected and the fallback enabled, an image message yields a useful text
    description injected into context.
  • Selecting a non-vision model surfaces a one-tap prompt to enable the fallback; sending a photo
    from mobile then yields a useful response.
  • With a native-vision model, images pass through unchanged (no extra call).
  • Repeated identical images hit the cache.

Related

  • Enables screenshot reading for: browser automation (on non-vision models).

Metadata

Metadata

Assignees

Labels

in-progressIt means someone is working on thisnewNew additiontodoPlanned / not yet started

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions