Skip to content

feat(sft): surface multimodal payload through build_training_sample#94

Merged
eligotts merged 1 commit into
mainfrom
pr1-sft-mm
Jun 30, 2026
Merged

feat(sft): surface multimodal payload through build_training_sample#94
eligotts merged 1 commit into
mainfrom
pr1-sft-mm

Conversation

@hubert-marek

@hubert-marek hubert-marek commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

build_training_sample now returns a RenderedTrainingSample struct (token_ids, loss_mask, multi_modal_data, mm_token_type_ids) instead of a bare (token_ids, loss_mask) tuple. The mm fields are populated only when the renderer emitted media — None for text-only renderers and text-only samples — so text token_ids/loss_mask are byte-identical to before. This lets prime-rl's SFT consume one helper for both text and VLM instead of re-implementing the render+mask path inline for VLMs.

mm_token_type_ids (0=text, 1=image, 2=video) are built from the rendered placeholder ranges at full token-stream length; the consumer truncates/shifts them in lockstep with token_ids.

Rebased onto current renderers/main; current PR head is 99bedaf, which is the renderer commit pinned by PrimeIntellect-ai/prime-rl#2485.

Note

Surface multimodal payload through build_training_sample as a structured RenderedTrainingSample

  • Introduces a new frozen dataclass RenderedTrainingSample in renderers/base.py with fields token_ids, loss_mask, multi_modal_data, and mm_token_type_ids, replacing the previous (token_ids, loss_mask) tuple return type of build_training_sample.
  • Adds _build_mm_token_type_ids to generate per-token modality IDs (0=text, 1=image, 2=video) from placeholder ranges, aligned to the prime-rl convention.
  • For text-only samples or renderers, multi_modal_data and mm_token_type_ids are set to None.
  • RenderedTrainingSample is publicly exported from renderers/init.py.
  • Risk: callers unpacking the previous tuple return value will break and must be updated to access .token_ids and .loss_mask on the returned object.

Macroscope summarized 99bedaf.


Note

Medium Risk
Breaking API change: any caller that unpacks the old 2-tuple must switch to .token_ids / .loss_mask (and optionally the new mm fields).

Overview
build_training_sample now returns a RenderedTrainingSample dataclass instead of a (token_ids, loss_mask) tuple, so SFT pipelines can get multimodal fields from one render+mask path.

The struct always includes token_ids and loss_mask. When a VLM renderer actually emits media, it also fills multi_modal_data and mm_token_type_ids (0=text, 1=image, 2=video per prime-rl), built from placeholder ranges via _build_mm_token_type_ids. Text-only renderers and empty media keep those fields None; text token_ids / loss_mask stay the same as before.

RenderedTrainingSample is exported from renderers. Tests were updated to use the new return type and to cover modality flag marking.

Reviewed by Cursor Bugbot for commit 99bedaf. Bugbot is set up for automated code reviews on this repo. Configure here.

@hubert-marek hubert-marek marked this pull request as ready for review June 26, 2026 07:41
@macroscopeapp

macroscopeapp Bot commented Jun 26, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR introduces a breaking API change by changing build_training_sample's return type from a tuple to a dataclass, and adds new multimodal capability. Breaking API changes and new features warrant human review.

You can customize Macroscope's approvability policy. Learn more.

Comment thread renderers/base.py
@@ -1534,15 +1534,54 @@ def _resolve_auto_config(
# ---------------------------------------------------------------------------


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here saying this mapping is matching the convention in prime-rl just to be crystal clear

from renderers.base import PlaceholderRange, _build_mm_token_type_ids


def test_build_mm_token_type_ids_marks_ranges():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont need these three tests

build_training_sample now returns a RenderedTrainingSample struct
(token_ids, loss_mask, multi_modal_data, mm_token_type_ids) instead of a
bare (token_ids, loss_mask) tuple. The mm fields are populated only when
the renderer emitted media — None for text-only renderers and text-only
samples — so text token_ids/loss_mask are byte-identical to before. This
lets prime-rl's SFT consume one helper for both text and VLM instead of
re-implementing the render+mask path inline for VLMs.

mm_token_type_ids (0=text, 1=image, 2=video) are built from the
rendered placeholder ranges at full token-stream length; the consumer
truncates/shifts them in lockstep with token_ids.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eligotts eligotts merged commit db8698a into main Jun 30, 2026
10 checks passed
@eligotts eligotts deleted the pr1-sft-mm branch June 30, 2026 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants