Skip to content

Add video-scaling function for SFT padding#4353

Draft
hengtaoguo wants to merge 1 commit into
mainfrom
hengtaoguo-sft-video
Draft

Add video-scaling function for SFT padding#4353
hengtaoguo wants to merge 1 commit into
mainfrom
hengtaoguo-sft-video

Conversation

@hengtaoguo

@hengtaoguo hengtaoguo commented Jul 5, 2026

Copy link
Copy Markdown
Collaborator

Description

Scale-to-fit before padding in video preprocessor and fix audio-in-video sequence serialization. This PR serves as the groundwork for varied-size video SFT:

  • Scale-to-fit before padding: Added scale_to_fit_video_grid preprocessing step in Qwen3-Omni image preprocessor to proportionally downscale videos exceeding video_max_grid_h/w (resolving downstream shape crashes during batch padding).
  • Unit test: Added test_scale_to_fit_video_before_padding to verify correct aspect-ratio preserving downscaling and successful padding setup.
  • Fixed a typo in add_extra_tokens_for_qwen3_omni where <|audio_pad|> (151675) was appended at the end of the interleaved video-audio sequence instead of <|audio_end|> (151670).

Tests

python -m pytest tests/unit/qwen3_omni_layers_test.py::TestQwen3OmniMoeVisionPatchEmbed::test_scale_to_fit_video_before_padding -vv -s

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov

codecov Bot commented Jul 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 17 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/multimodal/processor_qwen3_omni.py 0.00% 17 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions

github-actions Bot commented Jul 5, 2026

Copy link
Copy Markdown

🤖 Hi @hengtaoguo, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This PR introduces video rescaling functionality (scale_to_fit_video_grid) to proactively prevent sequence shape crashes during padding by enforcing the video_max_grid_h/w constraints. The test additions are well thought out and the fix for audio_end token appending is correct.

🔍 General Feedback

  • The core implementation of aspect-ratio preserving downscaling is solid and functionally sound.
  • There are a few edge cases related to floating point precision and consistency of the image scaling factor between the intermediate and final steps of the preprocessing pipeline that should be addressed.
  • Great job adding the unit tests to explicitly verify the grid limits and downscaling behavior!

Comment on lines +264 to +265
scaled_h = max(factor, math.floor(height * scale / factor) * factor)
scaled_w = max(factor, math.floor(width * scale / factor) * factor)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Floating point precision during division can sometimes result in `height * scale` or `width * scale` being marginally less than the exact integer (e.g. `511.99999999999994` instead of `512.0`). This causes `math.floor` to drop down by an entire grid unit unnecessarily.

Adding a tiny epsilon before flooring prevents this unintended truncation while preserving the intended rounding behavior.

Suggested change
scaled_h = max(factor, math.floor(height * scale / factor) * factor)
scaled_w = max(factor, math.floor(width * scale / factor) * factor)
scaled_h = max(factor, math.floor(height * scale / factor + 1e-5) * factor)
scaled_w = max(factor, math.floor(width * scale / factor + 1e-5) * factor)

Comment on lines +544 to +546
patch_size=patch_size,
merge_size=merge_size,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Pass `patch_size * merge_size` here to maintain the 32-multiple constraint intended for the final resize pass.
Suggested change
patch_size=patch_size,
merge_size=merge_size,
)
patch_size=patch_size,
factor=patch_size * merge_size,
)

Comment on lines +235 to +256
patch_size: int,
merge_size: int,
) -> tuple[int, int]:
"""Rescales height and width proportionally if they exceed the maximum grid dimensions.

Args:
height: Image height in pixels.
width: Image width in pixels.
max_grid_h: Maximum allowed height in grid units (patches), or None.
max_grid_w: Maximum allowed width in grid units (patches), or None.
patch_size: ViT patch size in pixels.
merge_size: Spatial merge size in patches.

Returns:
Tuple of (scaled_height, scaled_width) in pixels, divisible by factor = patch_size * merge_size.
"""
if max_grid_h is None and max_grid_w is None:
return height, width
if max_grid_h is None or max_grid_w is None:
raise ValueError("video_max_grid_h and video_max_grid_w must be set together or both None.")

factor = patch_size * merge_size

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 `scale_to_fit_video_grid` currently hardcodes the divisibility constraint to `patch_size * merge_size`. In `preprocess_video`, the first resize step targets a divisibility constraint of `IMAGE_FACTOR` (28), whereas the second resize targets `patch_size * merge_size` (32). Hardcoding this constraint could alter the aspect ratio of the intermediate image if it gets downscaled.

Consider passing the factor as an explicit argument so that it can match the target constraint of the respective resize step.

Suggested change
patch_size: int,
merge_size: int,
) -> tuple[int, int]:
"""Rescales height and width proportionally if they exceed the maximum grid dimensions.
Args:
height: Image height in pixels.
width: Image width in pixels.
max_grid_h: Maximum allowed height in grid units (patches), or None.
max_grid_w: Maximum allowed width in grid units (patches), or None.
patch_size: ViT patch size in pixels.
merge_size: Spatial merge size in patches.
Returns:
Tuple of (scaled_height, scaled_width) in pixels, divisible by factor = patch_size * merge_size.
"""
if max_grid_h is None and max_grid_w is None:
return height, width
if max_grid_h is None or max_grid_w is None:
raise ValueError("video_max_grid_h and video_max_grid_w must be set together or both None.")
factor = patch_size * merge_size
patch_size: int,
factor: int,
) -> tuple[int, int]:
"""Rescales height and width proportionally if they exceed the maximum grid dimensions.
Args:
height: Image height in pixels.
width: Image width in pixels.
max_grid_h: Maximum allowed height in grid units (patches), or None.
max_grid_w: Maximum allowed width in grid units (patches), or None.
patch_size: ViT patch size in pixels.
factor: The divisibility factor to apply to the output dimensions.
Returns:
Tuple of (scaled_height, scaled_width) in pixels, divisible by factor.
"""
if max_grid_h is None and max_grid_w is None:
return height, width
if max_grid_h is None or max_grid_w is None:
raise ValueError("video_max_grid_h and video_max_grid_w must be set together or both None.")

Comment on lines +515 to +517
patch_size=patch_size,
merge_size=merge_size,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Pass `IMAGE_FACTOR` here to maintain the 28-multiple constraint intended for the first resize pass.
Suggested change
patch_size=patch_size,
merge_size=merge_size,
)
patch_size=patch_size,
factor=IMAGE_FACTOR,
)

Comment on lines +458 to +460
patch_size=patch_size,
merge_size=merge_size,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Update the test call to use the explicit `factor` argument.
Suggested change
patch_size=patch_size,
merge_size=merge_size,
)
patch_size=patch_size,
factor=patch_size * merge_size,
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant