fix(java): expose updatedFragmentOffsets on Update operation for RewriteColumns#6748
fix(java): expose updatedFragmentOffsets on Update operation for RewriteColumns#6748jerryjch wants to merge 4 commits into
Conversation
c116d01 to
456e6dc
Compare
ed07fea to
cbf1475
Compare
…iteColumns commits
cbf1475 to
1884a7d
Compare
| let mut iter = jmap.iter(env)?; | ||
| let mut offsets: HashMap<u64, RoaringBitmap> = HashMap::new(); | ||
| env.with_local_frame(32, |env| { | ||
| while let Some((key, value)) = iter.next(env)? { |
There was a problem hiding this comment.
The Java-to-Rust offset map import keeps all iterator-created JNI local references alive for the whole map. Large multi-fragment updates can exhaust the local reference table and fail the commit before Rust validation runs.
| let frag_id = | ||
| env.call_method(&key, "longValue", "()J", &[])?.j()? as u64; | ||
| let buf: Vec<u8> = env.convert_byte_array(JByteArray::from(value))?; | ||
| let bitmap = RoaringBitmap::deserialize_from(buf.as_slice())?; |
There was a problem hiding this comment.
Malformed bitmap bytes are reported as I/O failures even though they come from Java API input. Callers can misclassify bad arguments as retryable storage errors.
…UpdateResult Serialize matched_offsets once at the executor JNI boundary instead of expanding to long[]. Add getUpdatedRowOffsetBytes(); deprecate getUpdatedRowOffsets() with expandRowOffsetsFromBytes for lance-format#6650 compat. Follow-up to lance-format#6650. Pairs with Update.updatedFragmentOffsets() for lance-spark#418. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@Xuanwo Thanks for the review. I addressed your comments in commit |
Summary
RewriteColumns). Required by lance-spark#418.
Update.java: addsMap<Long, byte[]> updatedFragmentOffsetsfield, 7-arg constructor,accessor
updatedFragmentOffsets(), andBuilder.updatedFragmentOffsets(...)setter.Defaults to
Collections.emptyMap(). Values are portable RoaringBitmap bytes.java/lance-jni/src/transaction.rs— two JNI directions updated: FromJava deserializeseach
byte[]value into aRoaringBitmapand setsupdated_fragment_offsetson the Rustoperation; IntoJava serializes each bitmap to
byte[]and populates aHashMap<Long, byte[]>passed to the 7-arg
Updateconstructor (previously the field was ignored and the 6-argform was used).
Update.javaequals/hashCode: deep-comparesbyte[]values by content;hashCodeadded per the Java contract.
Background
PR #6650 added
updated_fragment_offsetson the RustOperation::Update(proto field 9),build_manifestpartial refresh logic, andFragmentUpdateResult.getUpdatedRowOffsets().Two gaps remained:
The Java
Updateclass had no field for these offsets andconvert_to_rust_operationalways set
updated_fragment_offsets: None, so the lance-spark commit path(UpdateColumnsBackfillBatchWrite) had no way to pass offsets to Rust and the partial
refresh in
build_manifestcould never activate from a JVM caller.convert_to_java_operation_innerstill used the old 6-arg constructor signature fornew_object. With the 6-arg constructor removed fromUpdate.java(replaced by the7-arg form), any Rust→Java materialization of
Operation::Update(e.g. reading back atransaction) would fail at runtime with
NoSuchMethodError.Implementation notes
stays O(bitmap size) rather than O(n matched rows).
with_local_frame(4, ..)per bitmap entry in IntoJava bounds local-ref growth on largeoffset maps.
JMapwas avoided inside the frame because it holds aJObjectwith theouter frame's lifetime, causing borrow-checker conflicts;
call_methodon the outerjava_mapreference is used instead.Vec<u8>buffer for each bitmap is allocated in Rust before entering the frame, soits lifetime is independent of JNI frame scope.
with_local_frame(8, ..)per iteration in FromJava bounds local-ref growth for largemulti-fragment maps.
UpdatedFragmentOffsetsadded to thelance::dataset::transactionimport.Why the protobuf field alone is not enough
lance-spark commits by calling
CommitBuilder.execute(transaction), which passes the JavaTransactionobject tonativeCommitToDatasetvia JNI. The JNI handler callsconvert_to_rust_transaction→convert_to_rust_operation, which reflects on the JavaUpdateobject to build the RustOperation::Updatestruct. The protobuf field (field 9)is only used when a Transaction is serialized as a proto blob; it has no effect on the
reflection-based JNI path unless the Java
Updateclass exposes the field and the JNIdeserialization reads it.
Additional change
FragmentUpdateResult(from #6650) returned matched row offsets as an expandedlong[]atthe executor JNI boundary. This PR also passes those offsets as portable RoaringBitmap bytes
so lance-spark can wire them through to
Update.updatedFragmentOffsets()without an O(n rows)expansion on the executor→driver path.
FragmentUpdateResult.getUpdatedRowOffsetBytes()— primary accessor; values are the sameportable RoaringBitmap byte format as
Update.updatedFragmentOffsets().java/lance-jni/src/fragment.rs—update_columns_with_offsetsserializesmatched_offsetsonce withRoaringBitmap::serialize_into; JNI constructs results with(FragmentMetadata, long[] fieldsModified, byte[] updatedRowOffsetBytes).@Deprecated getUpdatedRowOffsets()— retained for backward compatibility; expands bytes viaexpandRowOffsetsFromBytesonly when called (lazy O(n rows)).@Deprecated3-arg(FragmentMetadata, long[] fieldsModified, long[] updatedRowOffsets)constructor — encodes offsets via
encodeRowOffsetsToBytesfor source compat.FragmentUpdateResultTest— round-trip bytes, deprecated constructor encode, andupdateColumns()integration asserting matched offsets{0,1,2,3}on the test fixture.Test plan
UpdateTest#testUpdatedFragmentOffsetsRoundTrip— commits anUpdatewith a non-emptyupdatedFragmentOffsetsmap throughCommitBuilder.execute(exercises the FromJava JNIpath), reads the transaction back via
Dataset.readTransaction()(exercises the IntoJavaJNI path), and asserts the offsets match. Map value is hardcoded portable RoaringBitmap
bytes encoding {1, 3, 5}; verified with
assertArrayEqualsafter the round-trip.FragmentUpdateResultTest— see Additional change.Compatibility
Update— theupdatedFragmentOffsetsfield did not exist in anyprior release. The builder setter is optional and defaults to
Collections.emptyMap(), soexisting
Update.builder()...build()call sites compile and behave identically.equals/hashCode:equalsusesoffsetMapsEqualto deep-comparebyte[]values via
Arrays.equals;hashCodeis added per the Java contract.new_objectcall is updated from the 6-arg tothe 7-arg form in the same PR. Both files must ship together; within that atomic change
there is no compatibility gap.
updated_fragment_offsetsproto field and Rust structfield were already added in fix: propagate update_columns offsets and partial last_updated for RewriteColumns #6650.
FragmentUpdateResult—@Deprecatedlong[] getter and constructor retained; new bytes getter is the supported path for new callers(see Additional change).