Skip to content

[10017] Slice REE values along with run_ends#10094

Closed
Rich-T-kid wants to merge 2 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/REE-fix-slice-ends
Closed

[10017] Slice REE values along with run_ends#10094
Rich-T-kid wants to merge 2 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/REE-fix-slice-ends

Conversation

@Rich-T-kid

@Rich-T-kid Rich-T-kid commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #10017.

Rationale for this change

RunArray::slice() kept the full physical run_ends and values buffers, so downstream operations (e.g. length(), substring()) would iterate over physical runs outside the logical slice range. This is wasted work proportional to how narrow the slice is relative to the full array.

What changes are included in this PR?

  • RunArray::slice() now trims values and run_ends to only the physical runs that overlap the logical slice range.
  • Trimmed run_ends are normalized: each entry has the logical offset subtracted and is capped at logical_length, so the result is self-describing. Double-slicing is safe.
  • Added early-return guard in logical_nulls() for zero-length arrays.

Are these changes tested?

Yes

  • Added unit tests in arrow-array covering single-run slices, multi-boundary slices, and repeated nested slices. (mod slice_trim_values)
  • Added a test in arrow-string demonstrating that length() on a sliced REE touches only the trimmed physical runs.

Are there any user-facing changes?

no

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Jun 9, 2026
@Rich-T-kid Rich-T-kid marked this pull request as ready for review June 9, 2026 15:15
@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

as referenced in #9959 (comment) this could have some (minor) performance implications.

@Jefffrey

Copy link
Copy Markdown
Contributor

i'm not sure if we can go with this approach, since the contract for slicing states it is zero-copy 🤔

Returns a zero-copy slice of this array with the indicated offset and length.

https://docs.rs/arrow/latest/arrow/array/trait.Array.html#tymethod.slice

(though im not sure how strict this constraint is)

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

Yea I agree, wanted to give it an attempt. The main issue is that if the run_ends aren't re-written the logical array that is expressed is incorrect. For example

// represents: [a,a,b,b,b,a,a,a,c,c]
Runarray = {run_ends: [2,5,8,10], values:["a","b","a","c"]}
let sliced = Runarray.slice(3,5)
// represents: [b,b,a,a,a]

but in both arrays the run_ends are [2,5,8,10] and the values are ["a","b","a","c"]. Logical_len & logical_offset are the only differences between the two. We also cant just naively cut up the values array, this would cause the run_ends buffer to misrepresent the correct logical form.
I think its fine to leave as it is, the swapping values pattern is somewhat uncommon and the performance boost should be negligible.

@Rich-T-kid Rich-T-kid closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RunArray::slice() should align run_ends and values with the logical slice

2 participants