Skip to content

Incremental insertion to existing graph#167

Merged
sam-herman merged 1 commit intoopensearch-project:mainfrom
sam-herman:incremental-insertion-to-existing-graph
Oct 16, 2025
Merged

Incremental insertion to existing graph#167
sam-herman merged 1 commit intoopensearch-project:mainfrom
sam-herman:incremental-insertion-to-existing-graph

Conversation

@sam-herman
Copy link
Copy Markdown
Collaborator

@sam-herman sam-herman commented Sep 23, 2025

Description

This change is the first part of leading segment merge epic.
It leverages previously created leading segments to facilitate incremental insertion.
Also in this change some important bug fixes.

The following items changed/added/fixed:

  1. Persistent Ordinal To docID Mapping - we previously previously relied on complete graph reconstruction on every merge. However this blocks the way to have a flexible merging technique that re-uses the ordinal in existing graphs, which is the case in incremental insertion of new nodes to existing graphs.
  2. Incremental Insertion With Leading Segment - during merges, when no quantization or deletions are present, we will leverage the existing graph from the leading segment and incrementally insert the new vectors from other graphs into it.
  3. Remove Redundant FlatVectorFormat - Removing the currently redundant Lucene FlatVectorFormat that stores the FP vectors and only use the inlined vectors instead.
  4. Remove Redundant DocValuesFormat - Removing the redundant doc values storage for flat vectors.
  5. Fix for sorted indices - the docId to ordinal mapping allows us with easily fixing the mapping between the jVector graph nodes to the docIDs in the cases where the Lucene index is sorted by a sorted field.
  6. Fix for missing fields - The graph construction works when the vector field is sometime missing missing in the index
  7. Remove jVector Codec - Removed the previously specialized jVector codec and use for testing the same per field codec used otherwise. This provides better test coverage for the production code pathways.

Note: this change also has a dependency on the jVector change that enables reloading of OnDiskGraphIndex back into a mutable OnHeaphGraph

Testing

Run:
python create_and_test_large_index.py --batch-size 2000 --force-merge-frequency 2000 --num-vectors 50000 --csv-output merge_times.csv

Benchmark for merges before incremental merges:
Screenshot 2025-09-24 at 12 39 05 PM

Benchmark for merges after incremental merges:
Screenshot 2025-09-24 at 12 35 39 PM

Before and after unified view:

merge_times_comparison

Redundant storage is down 3x less than earlier due to elimination of lucene FlatVectorFormat and DocValues for FP vectors:
image (9)

Related Issues

Resolves #133

Check List

  • [ x] New functionality includes testing.
  • [ x] New functionality has been documented.
  • API changes companion pull request created.
  • [x ] Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

KNNCounter.KNN_QUANTIZATION_TRAINING_TIME.add(trainingTime);
log.info("Encoding and building PQ vectors for field {} for {} vectors", fieldName, randomAccessVectorValues.size());
PQVectors pqVectors = (PQVectors) pq.encodeAll(randomAccessVectorValues, SIMD_POOL);
// PQVectors pqVectors = pq.encodeAll(randomAccessVectorValues, SIMD_POOL);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

tjake
tjake previously requested changes Sep 23, 2025
Copy link
Copy Markdown
Collaborator

@tjake tjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, There are a couple things from #145 that need to be added back, would you like me todo that?

I left some other comments as well.

Comment on lines +1005 to +1010
// The GraphIndexBuilder can add nodes to an existing index
forkJoinTask.add(PhysicalCoreExecutor.pool().submit(() -> builder.addGraphNode(nodeId, vector)));
}
for (ForkJoinTask<?> task : forkJoinTask) {
task.join();
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use SIMD_POOL and I would suggest using CompletableFuture.supplyAsync(() -> builder.addGraphNode(nodeId, vector), SIMD_POOL)

So then you can use CompletableFutures.allOf(tasks).join()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use SIMD_POOL and I would suggest using CompletableFuture.supplyAsync(() -> builder.addGraphNode(nodeId, vector), SIMD_POOL)

Yeah, in-fact this is temp code I need to move into the jVector as well as I had to do with the PQVectors when you provide the newToOld ord mapping.
Totally agree, we should provide the SIMD_POOL

So then you can use CompletableFutures.allOf(tasks).join()

+1 will make the changes

*
* Which means that we also need to persist this mapping to disk to be available across merges.
*/
public static class JVectorLuceneDocMap {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be its own file?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll move it over to separate file and add specific tests around it as well.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

this.ordinalsToDocIds = new int[ordinalsToDocIds.length];
System.arraycopy(ordinalsToDocIds, 0, this.ordinalsToDocIds, 0, ordinalsToDocIds.length);
final int maxDocId = Arrays.stream(ordinalsToDocIds).max().getAsInt();
final int maxDocs = maxDocId + 1;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for overflow

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread gradle.properties Outdated
version=1.0.0
systemProp.bwc.version=1.3.4
jvector_version=4.0.0-rc.2
jvector_version=4.0.0-rc.3-SNAPSHOT
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4.0.0-rc.3 was released can you switch to this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will update

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjake also keep in mind we have a dependency on this change , so we would have to keep it "snapshot" for local testing until the change is merged.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update to 4.0.0-rc.4-SNAPSHOT until it's becoming available following the JV change

@sam-herman
Copy link
Copy Markdown
Collaborator Author

Overall looks good, There are a couple things from #145 that need to be added back, would you like me todo that?

I left some other comments as well.

Thank you for reviewing @tjake! Sure, if you like to add those, or maybe let me know what they are I don't mind moving them over as well and can also run the tests and stuff with the pending JV change

private final JVectorWriter.JVectorLuceneDocMap jVectorLuceneDocMap;

public JVectorFloatVectorValues(OnDiskGraphIndex onDiskGraphIndex, VectorSimilarityFunction similarityFunction) throws IOException {
public JVectorFloatVectorValues(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be helpful to add some documentation for public methods.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll try to get all of those documented. Many of those are simply a passthrough to the Lucene methods with some delegate functionality.

@tjake
Copy link
Copy Markdown
Collaborator

tjake commented Sep 24, 2025

Sure, if you like to add those, or maybe let me know what they are I don't mind moving them over

I think the main change to add is this: https://github.com/opensearch-project/opensearch-jvector/pull/145/files#diff-6386363107f6fd4dbe6275645348aa24b637b0b23ca60fbe039e1301122ee5e2R239-R240

This avoids blocking the SIMD threads with calculating the partial sums.

@sam-herman
Copy link
Copy Markdown
Collaborator Author

Sure, if you like to add those, or maybe let me know what they are I don't mind moving them over

I think the main change to add is this: https://github.com/opensearch-project/opensearch-jvector/pull/145/files#diff-6386363107f6fd4dbe6275645348aa24b637b0b23ca60fbe039e1301122ee5e2R239-R240

This avoids blocking the SIMD threads with calculating the partial sums.

Gotcha, will make sure to add it as well! thank you @tjake !

@akash-shankaran
Copy link
Copy Markdown
Collaborator

@sam-herman think you need to update the changelog and sign your commits to get validation to pass?

indexInputDelegate.readBytes(internalFloatBuffer, 0, Float.BYTES);
FloatBuffer buffer = ByteBuffer.wrap(internalFloatBuffer).asFloatBuffer();
return buffer.get();
return buffer.get();*/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

@sam-herman
Copy link
Copy Markdown
Collaborator Author

sam-herman commented Oct 16, 2025

@sam-herman think you need to update the changelog and sign your commits to get validation to pass?

All the commits are signed, not sure what's going on with that check. changelog updated.

Copy link
Copy Markdown
Collaborator

@jimdickinson jimdickinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@sam-herman sam-herman dismissed tjake’s stale review October 16, 2025 16:41

Changes have already been applied, had an offline sync with tjake regarding the changes.

@sam-herman sam-herman force-pushed the incremental-insertion-to-existing-graph branch from 4401286 to 1262bcf Compare October 16, 2025 17:09
persist neighbors cache
add support for sorted index
searcher fixes for resolving the node -> docId
add incremental merge construction with leading segment
move additional tests to internal test for transparency
update documentation
add readme pictures
remove doc values by default
separate docIdtoOrdMap class and add tests

Signed-off-by: Samuel Herman <sherman8915@gmail.com>
@sam-herman sam-herman force-pushed the incremental-insertion-to-existing-graph branch from 1262bcf to 85cee15 Compare October 16, 2025 17:11
@sam-herman sam-herman merged commit 998e932 into opensearch-project:main Oct 16, 2025
10 of 36 checks passed
@sam-herman sam-herman deleted the incremental-insertion-to-existing-graph branch October 16, 2025 17:15
akash-shankaran pushed a commit to akash-shankaran/opensearch-jvector that referenced this pull request Oct 16, 2025
…nsearch-project#167)

persist neighbors cache
add support for sorted index
searcher fixes for resolving the node -> docId
add incremental merge construction with leading segment
move additional tests to internal test for transparency
update documentation
add readme pictures
remove doc values by default
separate docIdtoOrdMap class and add tests

Signed-off-by: Samuel Herman <sherman8915@gmail.com>
sam-herman added a commit that referenced this pull request Oct 16, 2025
* changes to perform JVector 3.2 upgrade

Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com>

* fix failing workflow, and spotless apply

Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com>

* few more package upgrades for JDK24

Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com>

* add Akash as maintainer (#174)

* add Akash as maintainer
* change to IBM

---------

Signed-off-by: Samuel Herman <sherman8915@gmail.com>

* update akash in codeowners (#176)

Signed-off-by: Samuel Herman <sherman8915@gmail.com>

* Onboarding new maven snapshots publishing to s3 (jVector) (#178)

Signed-off-by: Peter Zhu <zhujiaxi@amazon.com>

* increment version of jVector to support incremental construction (#167)

persist neighbors cache
add support for sorted index
searcher fixes for resolving the node -> docId
add incremental merge construction with leading segment
move additional tests to internal test for transparency
update documentation
add readme pictures
remove doc values by default
separate docIdtoOrdMap class and add tests

Signed-off-by: Samuel Herman <sherman8915@gmail.com>

* update changelog

Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com>

---------

Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com>
Signed-off-by: Samuel Herman <sherman8915@gmail.com>
Signed-off-by: Peter Zhu <zhujiaxi@amazon.com>
Co-authored-by: sam-herman <97131656+sam-herman@users.noreply.github.com>
Co-authored-by: Peter Zhu <zhujiaxi@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Leading Segment Merge

4 participants