Add video-scaling function for SFT padding by hengtaoguo · Pull Request #4353 · AI-Hypercomputer/maxtext

hengtaoguo · 2026-07-05T04:06:41Z

Description

Scale-to-fit before padding in video preprocessor and fix audio-in-video sequence serialization. This PR serves as the groundwork for varied-size video SFT:

Scale-to-fit before padding: Added scale_to_fit_video_grid preprocessing step in Qwen3-Omni image preprocessor to proportionally downscale videos exceeding video_max_grid_h/w (resolving downstream shape crashes during batch padding).
Unit test: Added test_scale_to_fit_video_before_padding to verify correct aspect-ratio preserving downscaling and successful padding setup.
Fixed a typo in add_extra_tokens_for_qwen3_omni where <|audio_pad|> (151675) was appended at the end of the interleaved video-audio sequence instead of <|audio_end|> (151670).

Tests

python -m pytest tests/unit/qwen3_omni_layers_test.py::TestQwen3OmniMoeVisionPatchEmbed::test_scale_to_fit_video_before_padding -vv -s

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-07-05T04:11:00Z

Codecov Report

❌ Patch coverage is 0% with 17 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/multimodal/processor_qwen3_omni.py	0.00%	17 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-07-05T04:25:33Z

🤖 Hi @hengtaoguo, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This PR introduces video rescaling functionality (scale_to_fit_video_grid) to proactively prevent sequence shape crashes during padding by enforcing the video_max_grid_h/w constraints. The test additions are well thought out and the fix for audio_end token appending is correct.

🔍 General Feedback

The core implementation of aspect-ratio preserving downscaling is solid and functionally sound.
There are a few edge cases related to floating point precision and consistency of the image scaling factor between the intermediate and final steps of the preprocessing pipeline that should be addressed.
Great job adding the unit tests to explicitly verify the grid limits and downscaling behavior!

github-actions · 2026-07-05T04:29:04Z

+  scaled_h = max(factor, math.floor(height * scale / factor) * factor)
+  scaled_w = max(factor, math.floor(width * scale / factor) * factor)


🟢 Floating point precision during division can sometimes result in `height * scale` or `width * scale` being marginally less than the exact integer (e.g. `511.99999999999994` instead of `512.0`). This causes `math.floor` to drop down by an entire grid unit unnecessarily.
Adding a tiny epsilon before flooring prevents this unintended truncation while preserving the intended rounding behavior.

Suggested change

scaled_h = max(factor, math.floor(height * scale / factor) * factor)

scaled_w = max(factor, math.floor(width * scale / factor) * factor)

scaled_h = max(factor, math.floor(height * scale / factor + 1e-5) * factor)

scaled_w = max(factor, math.floor(width * scale / factor + 1e-5) * factor)

github-actions · 2026-07-05T04:29:04Z

+      patch_size=patch_size,
+      merge_size=merge_size,
+  )


🟡 Pass `patch_size * merge_size` here to maintain the 32-multiple constraint intended for the final resize pass.

Suggested change

patch_size=patch_size,

merge_size=merge_size,

)

patch_size=patch_size,

factor=patch_size * merge_size,

)

github-actions · 2026-07-05T04:29:05Z

+    patch_size: int,
+    merge_size: int,
+) -> tuple[int, int]:
+  """Rescales height and width proportionally if they exceed the maximum grid dimensions.
+
+  Args:
+    height: Image height in pixels.
+    width: Image width in pixels.
+    max_grid_h: Maximum allowed height in grid units (patches), or None.
+    max_grid_w: Maximum allowed width in grid units (patches), or None.
+    patch_size: ViT patch size in pixels.
+    merge_size: Spatial merge size in patches.
+
+  Returns:
+    Tuple of (scaled_height, scaled_width) in pixels, divisible by factor = patch_size * merge_size.
+  """
+  if max_grid_h is None and max_grid_w is None:
+    return height, width
+  if max_grid_h is None or max_grid_w is None:
+    raise ValueError("video_max_grid_h and video_max_grid_w must be set together or both None.")
+
+  factor = patch_size * merge_size


🟡 `scale_to_fit_video_grid` currently hardcodes the divisibility constraint to `patch_size * merge_size`. In `preprocess_video`, the first resize step targets a divisibility constraint of `IMAGE_FACTOR` (28), whereas the second resize targets `patch_size * merge_size` (32). Hardcoding this constraint could alter the aspect ratio of the intermediate image if it gets downscaled.
Consider passing the factor as an explicit argument so that it can match the target constraint of the respective resize step.

Suggested change

patch_size: int,

merge_size: int,

) -> tuple[int, int]:

"""Rescales height and width proportionally if they exceed the maximum grid dimensions.

Args:

height: Image height in pixels.

width: Image width in pixels.

max_grid_h: Maximum allowed height in grid units (patches), or None.

max_grid_w: Maximum allowed width in grid units (patches), or None.

patch_size: ViT patch size in pixels.

merge_size: Spatial merge size in patches.

Returns:

Tuple of (scaled_height, scaled_width) in pixels, divisible by factor = patch_size * merge_size.

"""

if max_grid_h is None and max_grid_w is None:

return height, width

if max_grid_h is None or max_grid_w is None:

raise ValueError("video_max_grid_h and video_max_grid_w must be set together or both None.")

factor = patch_size * merge_size

patch_size: int,

factor: int,

) -> tuple[int, int]:

"""Rescales height and width proportionally if they exceed the maximum grid dimensions.

Args:

height: Image height in pixels.

width: Image width in pixels.

max_grid_h: Maximum allowed height in grid units (patches), or None.

max_grid_w: Maximum allowed width in grid units (patches), or None.

patch_size: ViT patch size in pixels.

factor: The divisibility factor to apply to the output dimensions.

Returns:

Tuple of (scaled_height, scaled_width) in pixels, divisible by factor.

"""

if max_grid_h is None and max_grid_w is None:

return height, width

if max_grid_h is None or max_grid_w is None:

raise ValueError("video_max_grid_h and video_max_grid_w must be set together or both None.")

github-actions · 2026-07-05T04:29:05Z

+      patch_size=patch_size,
+      merge_size=merge_size,
+  )


🟡 Pass `IMAGE_FACTOR` here to maintain the 28-multiple constraint intended for the first resize pass.

Suggested change

patch_size=patch_size,

merge_size=merge_size,

)

patch_size=patch_size,

factor=IMAGE_FACTOR,

)

github-actions · 2026-07-05T04:29:05Z

+        patch_size=patch_size,
+        merge_size=merge_size,
+    )


🟡 Update the test call to use the explicit `factor` argument.

Suggested change

patch_size=patch_size,

merge_size=merge_size,

)

patch_size=patch_size,

factor=patch_size * merge_size,

)

Add video-scaling function for SFT padding

d586be8

hengtaoguo force-pushed the hengtaoguo-sft-video branch from 5eead93 to d586be8 Compare July 5, 2026 04:22

hengtaoguo added the gemini-review label Jul 5, 2026

github-actions Bot reviewed Jul 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add video-scaling function for SFT padding#4353

Add video-scaling function for SFT padding#4353
hengtaoguo wants to merge 1 commit into
mainfrom
hengtaoguo-sft-video

hengtaoguo commented Jul 5, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 5, 2026

Uh oh!

github-actions Bot commented Jul 5, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jul 5, 2026

Uh oh!

github-actions Bot Jul 5, 2026

Uh oh!

github-actions Bot Jul 5, 2026

Uh oh!

github-actions Bot Jul 5, 2026

Uh oh!

github-actions Bot Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		scaled_h = max(factor, math.floor(height * scale / factor) * factor)
		scaled_w = max(factor, math.floor(width * scale / factor) * factor)

Uh oh!

Conversation

hengtaoguo commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented Jul 5, 2026

Codecov Report

Uh oh!

github-actions Bot commented Jul 5, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

github-actions Bot Jul 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hengtaoguo commented Jul 5, 2026 •

edited

Loading