perf: optimize array_remove for scalar needle#22390
Conversation
neilconway
left a comment
There was a problem hiding this comment.
Nice performance win!
| ); | ||
| } | ||
| }; | ||
| let original_data = list_array.values().to_data(); |
There was a problem hiding this comment.
This will be inefficient for sliced arrays.
There was a problem hiding this comment.
I now slice the values to the range actually referenced by the offsets.
That said, I wanted to understand your concern better: when a GenericListArray is sliced, values() returns the full underlying array, and to_data() on it wraps the existing buffer references into ArrayData without copying. So the main downside I could identify is that Capacities::Array(original_data.len()) over-estimates the pre-allocation for sliced inputs. Were you thinking of a different inefficiency, or is the over-allocation what you had in mind?
There was a problem hiding this comment.
The over-allocation was one part, but the bigger concern is calling the distinct kernel on the entire values buffer (see other comment).
| let list_array = array.as_list::<i64>(); | ||
| general_remove_with_scalar::<i64>(list_array, needle, arr_n) | ||
| } | ||
| array_type => exec_err!("array_remove does not support type '{array_type}'."), |
There was a problem hiding this comment.
This is called by more than just array_remove; can we improve the error message?
| for (i, keep) in eq_array.iter().enumerate() { | ||
| if keep == Some(false) && removed < max_removals { | ||
| if let Some(bs) = pending_batch_to_retain { | ||
| mutable.extend(0, start + bs, start + i); | ||
| copied += i - bs; | ||
| pending_batch_to_retain = None; | ||
| } | ||
| removed += 1; | ||
| } else if pending_batch_to_retain.is_none() { | ||
| pending_batch_to_retain = Some(i); | ||
| } | ||
| } |
There was a problem hiding this comment.
I wonder if it would be possible to iterate only over the "false" bits, e.g., by negating the buffer and looking at BooleanBuffer::set_indices.
There was a problem hiding this comment.
Great suggestion. Benchmarks show a ~20–40% improvement with this optimization.
| let mut copied = 0usize; | ||
| let mut pending_batch_to_retain: Option<usize> = None; | ||
| for (i, keep) in eq_array.iter().enumerate() { | ||
| if keep == Some(false) && removed < max_removals { |
There was a problem hiding this comment.
Can we break from the loop once we hit max_removals?
There was a problem hiding this comment.
Good point. now break early once max_removals is reached.
| // Iterate only over the positions that need removal using set_indices, | ||
| // which is more efficient than scanning every bit. |
There was a problem hiding this comment.
Might be worth elaborating that the win here is mostly because we expect the # of values-to-remove is a lot smaller than the total array size, which it usually (but not always) will be.
| let keep_mask = | ||
| arrow_ord::cmp::distinct(list_array.values(), &Scalar::new(Arc::clone(needle)))?; |
There was a problem hiding this comment.
This will call the distinct kernel on all the elements in the value buffer, not just the ones that are visible in a sliced array.
| ); | ||
| } | ||
| }; | ||
| let original_data = list_array.values().to_data(); |
There was a problem hiding this comment.
The over-allocation was one part, but the bigger concern is calling the distinct kernel on the entire values buffer (see other comment).
neilconway
left a comment
There was a problem hiding this comment.
Awesome, nice work on this! lgtm. Maybe update the benchmark numbers in the PR description if you get a chance.
|
Thanks @neilconway! Really appreciate the thorough and speedy review 🙏 Great suggestion — running benchmarks now and will update the pr description with fresh numbers shortly. |
Which issue does this PR close?
Rationale for this change
Similar to #22387 (array_replace scalar optimization)
array_remove/array_remove_n/array_remove_allperform element-wise comparison by invokingcompare_element_to_listagainst each row's sub-array individually. When the needle is a scalar, this can be optimized by performing a single vectorizeddistinctcomparison over the entire flattened values buffer.What changes are included in this PR?
general_remove_with_scalar) that usesarrow_ord::cmp::distinctwithScalarwrapper for a single bulk comparison pass over the flat values buffer.nvalues, and LargeList type coverage.Benchmarks
Are these changes tested?
Yes, existing and new SLT edge-case tests in
array_remove.slt.Are there any user-facing changes?
No.