Skip to content

fix(java): expose updatedFragmentOffsets on Update operation for RewriteColumns#6748

Open
jerryjch wants to merge 4 commits into
lance-format:mainfrom
jerryjch:update-with-rewrite-columns-fix-follow-up
Open

fix(java): expose updatedFragmentOffsets on Update operation for RewriteColumns#6748
jerryjch wants to merge 4 commits into
lance-format:mainfrom
jerryjch:update-with-rewrite-columns-fix-follow-up

Conversation

@jerryjch

@jerryjch jerryjch commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Follow-up to fix: propagate update_columns offsets and partial last_updated for RewriteColumns #6650 (fix: propagate update_columns offsets and partial last_updated for
    RewriteColumns). Required by lance-spark#418.
  • Update.java: adds Map<Long, byte[]> updatedFragmentOffsets field, 7-arg constructor,
    accessor updatedFragmentOffsets(), and Builder.updatedFragmentOffsets(...) setter.
    Defaults to Collections.emptyMap(). Values are portable RoaringBitmap bytes.
  • java/lance-jni/src/transaction.rs — two JNI directions updated: FromJava deserializes
    each byte[] value into a RoaringBitmap and sets updated_fragment_offsets on the Rust
    operation; IntoJava serializes each bitmap to byte[] and populates a HashMap<Long, byte[]>
    passed to the 7-arg Update constructor (previously the field was ignored and the 6-arg
    form was used).
  • Update.java equals / hashCode: deep-compares byte[] values by content; hashCode
    added per the Java contract.

Background

PR #6650 added updated_fragment_offsets on the Rust Operation::Update (proto field 9),
build_manifest partial refresh logic, and FragmentUpdateResult.getUpdatedRowOffsets().
Two gaps remained:

  1. The Java Update class had no field for these offsets and convert_to_rust_operation
    always set updated_fragment_offsets: None, so the lance-spark commit path
    (UpdateColumnsBackfillBatchWrite) had no way to pass offsets to Rust and the partial
    refresh in build_manifest could never activate from a JVM caller.

  2. convert_to_java_operation_inner still used the old 6-arg constructor signature for
    new_object. With the 6-arg constructor removed from Update.java (replaced by the
    7-arg form), any Rust→Java materialization of Operation::Update (e.g. reading back a
    transaction) would fail at runtime with NoSuchMethodError.

Implementation notes

  • Values are portable RoaringBitmap bytes (little-endian, spec-compliant). The JNI boundary
    stays O(bitmap size) rather than O(n matched rows).
  • with_local_frame(4, ..) per bitmap entry in IntoJava bounds local-ref growth on large
    offset maps. JMap was avoided inside the frame because it holds a JObject with the
    outer frame's lifetime, causing borrow-checker conflicts; call_method on the outer
    java_map reference is used instead.
  • The Vec<u8> buffer for each bitmap is allocated in Rust before entering the frame, so
    its lifetime is independent of JNI frame scope.
  • with_local_frame(8, ..) per iteration in FromJava bounds local-ref growth for large
    multi-fragment maps.
  • UpdatedFragmentOffsets added to the lance::dataset::transaction import.

Why the protobuf field alone is not enough

lance-spark commits by calling CommitBuilder.execute(transaction), which passes the Java
Transaction object to nativeCommitToDataset via JNI. The JNI handler calls
convert_to_rust_transactionconvert_to_rust_operation, which reflects on the Java
Update object to build the Rust Operation::Update struct. The protobuf field (field 9)
is only used when a Transaction is serialized as a proto blob; it has no effect on the
reflection-based JNI path unless the Java Update class exposes the field and the JNI
deserialization reads it.

Additional change

FragmentUpdateResult (from #6650) returned matched row offsets as an expanded long[] at
the executor JNI boundary. This PR also passes those offsets as portable RoaringBitmap bytes
so lance-spark can wire them through to Update.updatedFragmentOffsets() without an O(n rows)
expansion on the executor→driver path.

  • FragmentUpdateResult.getUpdatedRowOffsetBytes() — primary accessor; values are the same
    portable RoaringBitmap byte format as Update.updatedFragmentOffsets().
  • java/lance-jni/src/fragment.rsupdate_columns_with_offsets serializes
    matched_offsets once with RoaringBitmap::serialize_into; JNI constructs results with
    (FragmentMetadata, long[] fieldsModified, byte[] updatedRowOffsetBytes).
  • @Deprecated getUpdatedRowOffsets() — retained for backward compatibility; expands bytes via
    expandRowOffsetsFromBytes only when called (lazy O(n rows)).
  • @Deprecated 3-arg (FragmentMetadata, long[] fieldsModified, long[] updatedRowOffsets)
    constructor — encodes offsets via encodeRowOffsetsToBytes for source compat.
  • FragmentUpdateResultTest — round-trip bytes, deprecated constructor encode, and
    updateColumns() integration asserting matched offsets {0,1,2,3} on the test fixture.

Test plan

  • UpdateTest#testUpdatedFragmentOffsetsRoundTrip — commits an Update with a non-empty
    updatedFragmentOffsets map through CommitBuilder.execute (exercises the FromJava JNI
    path), reads the transaction back via Dataset.readTransaction() (exercises the IntoJava
    JNI path), and asserts the offsets match. Map value is hardcoded portable RoaringBitmap
    bytes encoding {1, 3, 5}; verified with assertArrayEquals after the round-trip.
  • FragmentUpdateResultTest — see Additional change.

Compatibility

  • Additive new API on Update — the updatedFragmentOffsets field did not exist in any
    prior release. The builder setter is optional and defaults to Collections.emptyMap(), so
    existing Update.builder()...build() call sites compile and behave identically.
  • Java — equals / hashCode: equals uses offsetMapsEqual to deep-compare byte[]
    values via Arrays.equals; hashCode is added per the Java contract.
  • JNI — constructor signature: the IntoJava new_object call is updated from the 6-arg to
    the 7-arg form in the same PR. Both files must ship together; within that atomic change
    there is no compatibility gap.
  • Rust / proto: no changes. The updated_fragment_offsets proto field and Rust struct
    field were already added in fix: propagate update_columns offsets and partial last_updated for RewriteColumns #6650.
  • FragmentUpdateResult@Deprecated long[] getter and constructor retained; new bytes getter is the supported path for new callers
    (see Additional change).

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added bug Something isn't working A-java Java bindings + JNI labels May 12, 2026
@jerryjch jerryjch force-pushed the update-with-rewrite-columns-fix-follow-up branch from c116d01 to 456e6dc Compare May 12, 2026 18:57
@jerryjch

Copy link
Copy Markdown
Contributor Author

cc @jackye1995 @hamersaw

@jerryjch jerryjch force-pushed the update-with-rewrite-columns-fix-follow-up branch from cbf1475 to 1884a7d Compare June 10, 2026 07:11
Comment thread java/lance-jni/src/transaction.rs Outdated
let mut iter = jmap.iter(env)?;
let mut offsets: HashMap<u64, RoaringBitmap> = HashMap::new();
env.with_local_frame(32, |env| {
while let Some((key, value)) = iter.next(env)? {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Java-to-Rust offset map import keeps all iterator-created JNI local references alive for the whole map. Large multi-fragment updates can exhaust the local reference table and fail the commit before Rust validation runs.

Comment thread java/lance-jni/src/transaction.rs Outdated
let frag_id =
env.call_method(&key, "longValue", "()J", &[])?.j()? as u64;
let buf: Vec<u8> = env.convert_byte_array(JByteArray::from(value))?;
let bitmap = RoaringBitmap::deserialize_from(buf.as_slice())?;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Malformed bitmap bytes are reported as I/O failures even though they come from Java API input. Callers can misclassify bad arguments as retryable storage errors.

Jing chen He and others added 3 commits June 19, 2026 22:45
…UpdateResult

Serialize matched_offsets once at the executor JNI boundary instead of
expanding to long[]. Add getUpdatedRowOffsetBytes(); deprecate
getUpdatedRowOffsets() with expandRowOffsetsFromBytes for lance-format#6650 compat.
Follow-up to lance-format#6650. Pairs with Update.updatedFragmentOffsets() for
lance-spark#418.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the A-deps Dependency updates label Jun 20, 2026
@jerryjch

Copy link
Copy Markdown
Contributor Author

@Xuanwo Thanks for the review. I addressed your comments in commit fix(java): bound JNI locals and classify bad offset bitmap bytes as input errors. In addition, please check the Additional change added in commit perf(java): return matched row offsets as Roaring bytes from FragmentUpdateResult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-deps Dependency updates A-java Java bindings + JNI bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants