Skip to content

llm-jp/llm-jp-4-vl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-jp-4-VL

| 🤗 Model  | 📄 Blog  | 🧑‍💻 Code  |


LLM-jp-4-VL is a series of vision-language models developed by LLM-jp. Currently, only a beta version is available.

This repository provides sample code for running inference with the LLM-jp-4-VL models.

LLM-jp-4-VL model architecture.

Usage

Install dependencies:

uv sync

Below is an example code for inference.

import torch
from transformers import AutoProcessor, AutoModel

model_id = "llm-jp/llm-jp-4-vl-9B-beta"

# load model
model = (
    AutoModel.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        use_flash_attn=True,
    )
    .eval()
    .cuda()
)

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)


def generate(messages, max_new_tokens=256, temperature=0.0):
    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)

    if "pixel_values" in inputs:
        inputs["pixel_values"] = inputs["pixel_values"].to(dtype=model.dtype)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=temperature > 0,
        temperature=temperature if temperature > 0 else None,
    )

    text = processor.decode(outputs[0], skip_special_tokens=False)
    text = text.replace("<|channel|>final<|message|>", "")
    text = text.replace("<|return|>", "")
    text = text.replace(processor.tokenizer.eos_token, "")
    return text.strip()

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "assets/tweet.png"},
            {"type": "text", "text": "ツイート内容を全て抜き出してください"},
        ],
    }
]

For more details, please refer to the code in codebooks directory.

Evaluation Reproduction

To reproduce the evaluation results reported in our blog post, please refer to simple-evals-mm, our VLM evaluation framework.

LICENSE

This code is released under the Apache 2.0 license.

Citation

If you find our work useful, please consider citing the following papers:

@misc{sugiura2026jaglebuildinglargescalejapanese,
      title={Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models},
      author={Issa Sugiura and Keito Sasagawa and Keisuke Nakao and Koki Maeda and Ziqi Yin and Zhishen Yang and Shuhei Kurita and Yusuke Oda and Ryoko Tokuhisa and Daisuke Kawahara and Naoaki Okazaki},
      year={2026},
      eprint={2604.02048},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.02048},
}

@misc{sugiura2026jammevalrefinedcollectionjapanese,
      title={JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation},
      author={Issa Sugiura and Koki Maeda and Shuhei Kurita and Yusuke Oda and Daisuke Kawahara and Naoaki Okazaki},
      year={2026},
      eprint={2604.00909},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.00909},
}

About

LLM-jp-4-VL is a series of vision-language models developed by LLM-jp.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors