Add video-scaling function for SFT padding#4353
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
5eead93 to
d586be8
Compare
|
🤖 Hi @hengtaoguo, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This PR introduces video rescaling functionality (scale_to_fit_video_grid) to proactively prevent sequence shape crashes during padding by enforcing the video_max_grid_h/w constraints. The test additions are well thought out and the fix for audio_end token appending is correct.
🔍 General Feedback
- The core implementation of aspect-ratio preserving downscaling is solid and functionally sound.
- There are a few edge cases related to floating point precision and consistency of the image scaling factor between the intermediate and final steps of the preprocessing pipeline that should be addressed.
- Great job adding the unit tests to explicitly verify the grid limits and downscaling behavior!
| scaled_h = max(factor, math.floor(height * scale / factor) * factor) | ||
| scaled_w = max(factor, math.floor(width * scale / factor) * factor) |
There was a problem hiding this comment.
Adding a tiny epsilon before flooring prevents this unintended truncation while preserving the intended rounding behavior.
| scaled_h = max(factor, math.floor(height * scale / factor) * factor) | |
| scaled_w = max(factor, math.floor(width * scale / factor) * factor) | |
| scaled_h = max(factor, math.floor(height * scale / factor + 1e-5) * factor) | |
| scaled_w = max(factor, math.floor(width * scale / factor + 1e-5) * factor) |
| patch_size=patch_size, | ||
| merge_size=merge_size, | ||
| ) |
There was a problem hiding this comment.
| patch_size=patch_size, | |
| merge_size=merge_size, | |
| ) | |
| patch_size=patch_size, | |
| factor=patch_size * merge_size, | |
| ) |
| patch_size: int, | ||
| merge_size: int, | ||
| ) -> tuple[int, int]: | ||
| """Rescales height and width proportionally if they exceed the maximum grid dimensions. | ||
|
|
||
| Args: | ||
| height: Image height in pixels. | ||
| width: Image width in pixels. | ||
| max_grid_h: Maximum allowed height in grid units (patches), or None. | ||
| max_grid_w: Maximum allowed width in grid units (patches), or None. | ||
| patch_size: ViT patch size in pixels. | ||
| merge_size: Spatial merge size in patches. | ||
|
|
||
| Returns: | ||
| Tuple of (scaled_height, scaled_width) in pixels, divisible by factor = patch_size * merge_size. | ||
| """ | ||
| if max_grid_h is None and max_grid_w is None: | ||
| return height, width | ||
| if max_grid_h is None or max_grid_w is None: | ||
| raise ValueError("video_max_grid_h and video_max_grid_w must be set together or both None.") | ||
|
|
||
| factor = patch_size * merge_size |
There was a problem hiding this comment.
Consider passing the factor as an explicit argument so that it can match the target constraint of the respective resize step.
| patch_size: int, | |
| merge_size: int, | |
| ) -> tuple[int, int]: | |
| """Rescales height and width proportionally if they exceed the maximum grid dimensions. | |
| Args: | |
| height: Image height in pixels. | |
| width: Image width in pixels. | |
| max_grid_h: Maximum allowed height in grid units (patches), or None. | |
| max_grid_w: Maximum allowed width in grid units (patches), or None. | |
| patch_size: ViT patch size in pixels. | |
| merge_size: Spatial merge size in patches. | |
| Returns: | |
| Tuple of (scaled_height, scaled_width) in pixels, divisible by factor = patch_size * merge_size. | |
| """ | |
| if max_grid_h is None and max_grid_w is None: | |
| return height, width | |
| if max_grid_h is None or max_grid_w is None: | |
| raise ValueError("video_max_grid_h and video_max_grid_w must be set together or both None.") | |
| factor = patch_size * merge_size | |
| patch_size: int, | |
| factor: int, | |
| ) -> tuple[int, int]: | |
| """Rescales height and width proportionally if they exceed the maximum grid dimensions. | |
| Args: | |
| height: Image height in pixels. | |
| width: Image width in pixels. | |
| max_grid_h: Maximum allowed height in grid units (patches), or None. | |
| max_grid_w: Maximum allowed width in grid units (patches), or None. | |
| patch_size: ViT patch size in pixels. | |
| factor: The divisibility factor to apply to the output dimensions. | |
| Returns: | |
| Tuple of (scaled_height, scaled_width) in pixels, divisible by factor. | |
| """ | |
| if max_grid_h is None and max_grid_w is None: | |
| return height, width | |
| if max_grid_h is None or max_grid_w is None: | |
| raise ValueError("video_max_grid_h and video_max_grid_w must be set together or both None.") |
| patch_size=patch_size, | ||
| merge_size=merge_size, | ||
| ) |
There was a problem hiding this comment.
| patch_size=patch_size, | |
| merge_size=merge_size, | |
| ) | |
| patch_size=patch_size, | |
| factor=IMAGE_FACTOR, | |
| ) |
| patch_size=patch_size, | ||
| merge_size=merge_size, | ||
| ) |
There was a problem hiding this comment.
| patch_size=patch_size, | |
| merge_size=merge_size, | |
| ) | |
| patch_size=patch_size, | |
| factor=patch_size * merge_size, | |
| ) |
Description
Scale-to-fit before padding in video preprocessor and fix audio-in-video sequence serialization. This PR serves as the groundwork for varied-size video SFT:
scale_to_fit_video_gridpreprocessing step in Qwen3-Omni image preprocessor to proportionally downscale videos exceedingvideo_max_grid_h/w(resolving downstream shape crashes during batch padding).test_scale_to_fit_video_before_paddingto verify correct aspect-ratio preserving downscaling and successful padding setup.add_extra_tokens_for_qwen3_omniwhere<|audio_pad|>(151675) was appended at the end of the interleaved video-audio sequence instead of<|audio_end|>(151670).Tests
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.