feat(sft): surface multimodal payload through build_training_sample by hubert-marek · Pull Request #94 · PrimeIntellect-ai/renderers

hubert-marek · 2026-06-25T23:20:28Z

build_training_sample now returns a RenderedTrainingSample struct (token_ids, loss_mask, multi_modal_data, mm_token_type_ids) instead of a bare (token_ids, loss_mask) tuple. The mm fields are populated only when the renderer emitted media — None for text-only renderers and text-only samples — so text token_ids/loss_mask are byte-identical to before. This lets prime-rl's SFT consume one helper for both text and VLM instead of re-implementing the render+mask path inline for VLMs.

mm_token_type_ids (0=text, 1=image, 2=video) are built from the rendered placeholder ranges at full token-stream length; the consumer truncates/shifts them in lockstep with token_ids.

Rebased onto current renderers/main; current PR head is 99bedaf, which is the renderer commit pinned by PrimeIntellect-ai/prime-rl#2485.

Note

Surface multimodal payload through `build_training_sample` as a structured `RenderedTrainingSample`

Introduces a new frozen dataclass RenderedTrainingSample in renderers/base.py with fields token_ids, loss_mask, multi_modal_data, and mm_token_type_ids, replacing the previous (token_ids, loss_mask) tuple return type of build_training_sample.
Adds _build_mm_token_type_ids to generate per-token modality IDs (0=text, 1=image, 2=video) from placeholder ranges, aligned to the prime-rl convention.
For text-only samples or renderers, multi_modal_data and mm_token_type_ids are set to None.
RenderedTrainingSample is publicly exported from renderers/init.py.
Risk: callers unpacking the previous tuple return value will break and must be updated to access .token_ids and .loss_mask on the returned object.

^{Macroscope summarized 99bedaf.}

Note

Medium Risk
Breaking API change: any caller that unpacks the old 2-tuple must switch to .token_ids / .loss_mask (and optionally the new mm fields).

Overview
build_training_sample now returns a RenderedTrainingSample dataclass instead of a (token_ids, loss_mask) tuple, so SFT pipelines can get multimodal fields from one render+mask path.

The struct always includes token_ids and loss_mask. When a VLM renderer actually emits media, it also fills multi_modal_data and mm_token_type_ids (0=text, 1=image, 2=video per prime-rl), built from placeholder ranges via _build_mm_token_type_ids. Text-only renderers and empty media keep those fields None; text token_ids / loss_mask stay the same as before.

RenderedTrainingSample is exported from renderers. Tests were updated to use the new return type and to cover modality flag marking.

^{Reviewed by Cursor Bugbot for commit 99bedaf. Bugbot is set up for automated code reviews on this repo. Configure here.}

macroscopeapp · 2026-06-26T07:43:05Z

Approvability

Verdict: Needs human review

This PR introduces a breaking API change by changing build_training_sample's return type from a tuple to a dataclass, and adds new multimodal capability. Breaking API changes and new features warrant human review.

^{You can customize Macroscope's approvability policy. Learn more.}

eligotts · 2026-06-30T19:40:38Z

@@ -1534,15 +1534,54 @@ def _resolve_auto_config(
 # ---------------------------------------------------------------------------




can you add a comment here saying this mapping is matching the convention in prime-rl just to be crystal clear

eligotts · 2026-06-30T19:43:39Z

+from renderers.base import PlaceholderRange, _build_mm_token_type_ids
+
+
+def test_build_mm_token_type_ids_marks_ranges():


dont need these three tests

build_training_sample now returns a RenderedTrainingSample struct (token_ids, loss_mask, multi_modal_data, mm_token_type_ids) instead of a bare (token_ids, loss_mask) tuple. The mm fields are populated only when the renderer emitted media — None for text-only renderers and text-only samples — so text token_ids/loss_mask are byte-identical to before. This lets prime-rl's SFT consume one helper for both text and VLM instead of re-implementing the render+mask path inline for VLMs. mm_token_type_ids (0=text, 1=image, 2=video) are built from the rendered placeholder ranges at full token-stream length; the consumer truncates/shifts them in lockstep with token_ids. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hubert-marek marked this pull request as ready for review June 26, 2026 07:41

hubert-marek force-pushed the pr1-sft-mm branch from b4c4051 to b9dcdae Compare June 30, 2026 19:05

hubert-marek requested a review from samsja June 30, 2026 19:05

eligotts reviewed Jun 30, 2026

View reviewed changes

hubert-marek force-pushed the pr1-sft-mm branch from b9dcdae to 99bedaf Compare June 30, 2026 20:02

eligotts approved these changes Jun 30, 2026

View reviewed changes

eligotts merged commit db8698a into main Jun 30, 2026
10 checks passed

eligotts deleted the pr1-sft-mm branch June 30, 2026 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(sft): surface multimodal payload through build_training_sample#94

feat(sft): surface multimodal payload through build_training_sample#94
eligotts merged 1 commit into
mainfrom
pr1-sft-mm

hubert-marek commented Jun 25, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

macroscopeapp Bot commented Jun 26, 2026

Uh oh!

eligotts Jun 30, 2026

Uh oh!

eligotts Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1534,15 +1534,54 @@ def _resolve_auto_config(
		# ---------------------------------------------------------------------------

		from renderers.base import PlaceholderRange, _build_mm_token_type_ids


		def test_build_mm_token_type_ids_marks_ranges():

Uh oh!

Conversation

hubert-marek commented Jun 25, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Surface multimodal payload through build_training_sample as a structured RenderedTrainingSample

Uh oh!

macroscopeapp Bot commented Jun 26, 2026

Approvability

Uh oh!

eligotts Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

eligotts Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hubert-marek commented Jun 25, 2026 •

edited by macroscopeapp Bot

Loading

Surface multimodal payload through `build_training_sample` as a structured `RenderedTrainingSample`