Summary
When the active model has no native image support, route image attachments to a small vision-capable
model to produce a text description / OCR, and inject that text in place of the image so
non-multimodal models can still "see".
Approach
Hook the existing seam in core/agent.py::_build_user_message, which already branches on image
attachments. Reuse the existing secondary-model pattern (_memory_llm / _background_llm):
- Add a
supports_vision capability flag per provider/model.
- Add a
_vision_llm(provider) helper and a vision config block (enabled / provider /
model), mirroring the search / voice blocks. Default off; engages only when the active model
lacks vision and images are present.
- Caption each image with a task-aware prompt (the user's text is available) and replace the
image block with [Image: <description>].
- Cache descriptions by image hash to avoid re-captioning.
UX & product
- Mostly invisible by design — it just works when needed. Product touch: when the user picks a
model with no image support, surface a gentle prompt — "this model can't read images; set up a
vision fallback?" — with a one-tap path to configure it.
- Admin UI: a small
vision section (enable, provider, model) alongside search/voice settings,
with an indicator on the model picker showing whether the active model sees images natively —
responsive/touch-friendly at phone width, reusing consistent settings components.
- On the go (Telegram): the payoff is mobile-native — send a photo from your phone and get a
sensible answer even on a non-vision model.
- Mobile-first: the whole point is that a photo sent from a phone "just works"; the enable prompt
is one tap.
Setup & onboarding
- Off by default; auto-suggested only when the active model lacks vision. One-tap enable with a
sensible default vision model — offered inline at model selection rather than as a separate chore.
Acceptance criteria
- With a non-vision model selected and the fallback enabled, an image message yields a useful text
description injected into context.
- Selecting a non-vision model surfaces a one-tap prompt to enable the fallback; sending a photo
from mobile then yields a useful response.
- With a native-vision model, images pass through unchanged (no extra call).
- Repeated identical images hit the cache.
Related
- Enables screenshot reading for: browser automation (on non-vision models).
Summary
When the active model has no native image support, route image attachments to a small vision-capable
model to produce a text description / OCR, and inject that text in place of the image so
non-multimodal models can still "see".
Approach
Hook the existing seam in
core/agent.py::_build_user_message, which already branches on imageattachments. Reuse the existing secondary-model pattern (
_memory_llm/_background_llm):supports_visioncapability flag per provider/model._vision_llm(provider)helper and avisionconfig block (enabled/provider/model), mirroring thesearch/voiceblocks. Default off; engages only when the active modellacks vision and images are present.
image block with
[Image: <description>].UX & product
model with no image support, surface a gentle prompt — "this model can't read images; set up a
vision fallback?" — with a one-tap path to configure it.
visionsection (enable, provider, model) alongside search/voice settings,with an indicator on the model picker showing whether the active model sees images natively —
responsive/touch-friendly at phone width, reusing consistent settings components.
sensible answer even on a non-vision model.
is one tap.
Setup & onboarding
sensible default vision model — offered inline at model selection rather than as a separate chore.
Acceptance criteria
description injected into context.
from mobile then yields a useful response.
Related