Skip to content

Feature Request: Gemma Multimodal Support #2224

@rycerzes

Description

@rycerzes

Is your feature request related to a problem? Please describe.
Users can run Gemma in text mode, but cannot use Gemma multimodal models (image + text) through a first-class Gemma path in llama-cpp-python. Today, gemma is only a text chat format, server multimodal branches cover other formats (llava/moondream/nanollava/llama-3-vision-alpha/minicpm/qwen), and docs do not show Gemma multimodal setup. This creates confusion and extra integration work.

Describe the solution you'd like
Add official Gemma multimodal support in both Python API and server API:

  • Add a Gemma multimodal chat handler (MTMD/image_url flow).
  • Add server routing for a Gemma multimodal chat_format.
  • Document required model/projector files and provide working examples.
  • Add tests for registration, config validation, and a multimodal smoke path.

Describe alternatives you've considered

  • Continue using existing multimodal formats (for example llava or qwen2.5-vl) instead of Gemma.
  • Build a custom local handler outside the project.
  • Use HF tokenizer-template fallback with manual prompt engineering and custom preprocessing.

These workarounds are possible but reduce portability and are harder to maintain.

Additional context
Expected outcome:

  • Users can start the server with a Gemma multimodal format and send OpenAI-style image_url chat requests.
  • Errors should be explicit when required projector/mtmd assets are missing or when model vision support is unavailable.

Potential acceptance checks:

  • New chat format loads successfully.
  • Mixed text + image request returns a valid response.
  • README/docs include Gemma multimodal setup instructions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions