Dataset Preparation Guide

This tutorial will guide you on how to prepare datasets for supervised fine-tuning of MiniCPM-o. Please strictly follow the format and requirements below to organize your data.

1. Data Format Description

Each data sample should be a dictionary containing the image path and multi-turn conversation content. For example:

{
  "image": "path/to/image.jpg",
  "conversations": [
    {"role": "user", "content": "Describe this image."},
    {"role": "assistant", "content": "This is a cat."}
  ]
}

image: Image path, supports both single and multiple images (see below).
conversations: Multi-turn conversation, must start with user, and role only supports "user" and "assistant".

Multi-image Input Format

To support multiple images, the image field should be a dictionary, with keys like <image_00>, <image_01>, and values as image paths:

{
  "image": {
    "<image_00>": "path/to/image1.jpg",
    "<image_01>": "path/to/image2.jpg"
  },
  "conversations": [
    {"role": "user", "content": "Compare <image_00> and <image_01>."},
    {"role": "assistant", "content": "The first image is a cat, the second is a dog."}
  ]
}

2. Dataset Loading and Preprocessing

Dataset loading and preprocessing mainly rely on the SupervisedDataset class. The core process is as follows:

Read raw data (in json or list format).
Load and transform images.
Tokenize and encode conversation content to generate input_ids, labels, position_ids, etc.
Support advanced image processing such as slicing and patching (slice_config).

Main Parameter Description

raw_data: List of raw data samples.
transform: Image preprocessing method (e.g., normalization, resizing, etc.).
tokenizer: Tokenizer, should match the model.
slice_config: Image slicing configuration (optional).
llm_type: Large model type (e.g., "minicpm", "llama3", "qwen").
patch_size: Image patch size, default is 14.
query_nums: Number of image tokens, default is 64.
batch_vision: Whether to process images in batch, default is False.
max_length: Maximum text length, default is 2048.

3. Dataset Examples

Single-image multi-turn conversation example:

{
  "image": "images/cat.jpg",
  "conversations": [
    {"role": "user", "content": "<image> What animal is this?"},
    {"role": "assistant", "content": "This is a cat."},
    {"role": "user", "content": "What is it doing?"},
    {"role": "assistant", "content": "It is sleeping."}
  ]
}

Multi-image conversation example:

{
  "image": {
    "<image_00>": "images/cat.jpg",
    "<image_01>": "images/dog.jpg"
  },
  "conversations": [
    {"role": "user", "content": "Compare <image_00> and <image_01>."},
    {"role": "assistant", "content": "The first image is a cat, the second is a dog."}
  ]
}

4. Common Issues and Notes

The conversation must start with user, and role only supports "user" and "assistant".
Image paths must be valid and support local paths.
For multiple images, use <image_xx> placeholders in conversations to correspond to the image dictionary.
For large images, it is recommended to configure slice_config for slicing.
If data loading fails, the logger will automatically resample a data sample.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Preparation Guide

1. Data Format Description

Multi-image Input Format

2. Dataset Loading and Preprocessing

Main Parameter Description

3. Dataset Examples

4. Common Issues and Notes

FilesExpand file tree

dataset_guidance.md

Latest commit

History

dataset_guidance.md

File metadata and controls

Dataset Preparation Guide

1. Data Format Description

Multi-image Input Format

2. Dataset Loading and Preprocessing

Main Parameter Description

3. Dataset Examples

4. Common Issues and Notes