Skip to content

fix(python): handle empty filtered shards in ShardedBatchSampler#7366

Draft
xushiyan wants to merge 1 commit into
lance-format:mainfrom
xushiyan:fix/sharded-batch-sampler-empty-filtered-shard
Draft

fix(python): handle empty filtered shards in ShardedBatchSampler#7366
xushiyan wants to merge 1 commit into
lance-format:mainfrom
xushiyan:fix/sharded-batch-sampler-empty-filtered-shard

Conversation

@xushiyan

@xushiyan xushiyan commented Jun 18, 2026

Copy link
Copy Markdown

Summary

  • Fix ShardedBatchSampler._shard_scan so filtered scans with zero matches or empty per-rank shards yield an empty stream instead of raising.
  • Preserve global round-robin sharding across filtered scanner batches by carrying rows_to_skip from the full batch length, not a sliced view.
  • Skip batch.take([]) when a rank owns no rows in a batch (PyArrow has no array_take(int64, null) kernel).
  • Add regression tests for empty shard, zero-match filter, and cross-fragment carryover.

Problem

When ShardedBatchSampler is called with a filter, the filtered path can crash in two cases:

  1. Global zero-match — filter matches no rows -> pa.Table.from_batches([]) raises ValueError.
  2. Empty per-rank shard — sparse filter with world_size > 1 leaves some ranks with zero rows -> batch.take([]) raises ArrowNotImplementedError.

The unfiltered path already handles emptiness; this aligns the filtered path with that contract.

Test plan

  • test_sharded_batch_sampler_empty_filtered_shard (randomize on/off)
  • test_sharded_batch_sampler_filtered_carryover_across_fragments
  • uv run pytest python/tests/test_sampler.py (full sampler suite)

Found during distributed training work with sparse segment filters.

@github-actions github-actions Bot added A-python Python bindings bug Something isn't working labels Jun 18, 2026
ShardedBatchSampler._shard_scan crashed on the filtered path instead of
yielding an empty stream in two cases:

- Global zero-match: nothing accumulates, so the final
  pa.Table.from_batches([]) raised ValueError.
- Empty per-rank shard: a sparse filter with world_size > 1 leaves some
  ranks with no rows after round-robin, so batch.take([]) built a
  null-typed index array and raised ArrowNotImplementedError.

Compute take indices against the full batch (range(rows_to_skip, n, N))
and carry rows_to_skip from the full batch length so the round-robin
offset stays correct when a filtered scan splits matches across many
small batches/fragments; skip batches where this rank owns no row; and
make the final flush conditional. Empty shards now yield an empty
iterator, matching the unfiltered _sample_all contract.

Adds regression coverage for the empty-shard, zero-match, and
cross-fragment carryover cases (parametrized over randomize).
@xushiyan xushiyan force-pushed the fix/sharded-batch-sampler-empty-filtered-shard branch from 624fcb7 to de13330 Compare June 18, 2026 20:22
@xushiyan xushiyan marked this pull request as draft June 18, 2026 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-python Python bindings bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant