Vision fallback: caption images via a secondary model when the active model lacks vision

## Summary
When the active model has no native image support, route image attachments to a small vision-capable
model to produce a text description / OCR, and inject that text in place of the image so
non-multimodal models can still "see".

## Approach
Hook the existing seam in `core/agent.py::_build_user_message`, which already branches on image
attachments. Reuse the existing secondary-model pattern (`_memory_llm` / `_background_llm`):
- Add a `supports_vision` capability flag per provider/model.
- Add a `_vision_llm(provider)` helper and a `vision` config block (`enabled` / `provider` /
  `model`), mirroring the `search` / `voice` blocks. Default off; engages only when the active model
  lacks vision and images are present.
- Caption each image with a **task-aware** prompt (the user's text is available) and replace the
  image block with `[Image: <description>]`.
- Cache descriptions by image hash to avoid re-captioning.

## UX & product
- **Mostly invisible by design** — it just works when needed. Product touch: when the user picks a
  model with no image support, surface a gentle prompt — "this model can't read images; set up a
  vision fallback?" — with a one-tap path to configure it.
- **Admin UI:** a small `vision` section (enable, provider, model) alongside search/voice settings,
  with an indicator on the model picker showing whether the active model sees images natively —
  **responsive/touch-friendly** at phone width, reusing consistent settings components.
- **On the go (Telegram):** the payoff is mobile-native — send a photo from your phone and get a
  sensible answer even on a non-vision model.
- **Mobile-first:** the whole point is that a photo sent from a phone "just works"; the enable prompt
  is one tap.

## Setup & onboarding
- Off by default; auto-suggested only when the active model lacks vision. One-tap enable with a
  sensible default vision model — offered inline at model selection rather than as a separate chore.

## Acceptance criteria
- With a non-vision model selected and the fallback enabled, an image message yields a useful text
  description injected into context.
- Selecting a non-vision model surfaces a **one-tap prompt** to enable the fallback; sending a photo
  from mobile then yields a useful response.
- With a native-vision model, images pass through unchanged (no extra call).
- Repeated identical images hit the cache.

## Related
- Enables screenshot reading for: browser automation (on non-vision models).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision fallback: caption images via a secondary model when the active model lacks vision #18

Summary

Approach

UX & product

Setup & onboarding

Acceptance criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Vision fallback: caption images via a secondary model when the active model lacks vision #18

Description

Summary

Approach

UX & product

Setup & onboarding

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions