Incremental insertion to existing graph by sam-herman · Pull Request #167 · opensearch-project/opensearch-jvector

sam-herman · 2025-09-23T03:34:49Z

Description

This change is the first part of leading segment merge epic.
It leverages previously created leading segments to facilitate incremental insertion.
Also in this change some important bug fixes.

The following items changed/added/fixed:

Persistent Ordinal To docID Mapping - we previously previously relied on complete graph reconstruction on every merge. However this blocks the way to have a flexible merging technique that re-uses the ordinal in existing graphs, which is the case in incremental insertion of new nodes to existing graphs.
Incremental Insertion With Leading Segment - during merges, when no quantization or deletions are present, we will leverage the existing graph from the leading segment and incrementally insert the new vectors from other graphs into it.
Remove Redundant FlatVectorFormat - Removing the currently redundant Lucene FlatVectorFormat that stores the FP vectors and only use the inlined vectors instead.
Remove Redundant DocValuesFormat - Removing the redundant doc values storage for flat vectors.
Fix for sorted indices - the docId to ordinal mapping allows us with easily fixing the mapping between the jVector graph nodes to the docIDs in the cases where the Lucene index is sorted by a sorted field.
Fix for missing fields - The graph construction works when the vector field is sometime missing missing in the index
Remove jVector Codec - Removed the previously specialized jVector codec and use for testing the same per field codec used otherwise. This provides better test coverage for the production code pathways.

Note: this change also has a dependency on the jVector change that enables reloading of OnDiskGraphIndex back into a mutable OnHeaphGraph

Testing

Run:
python create_and_test_large_index.py --batch-size 2000 --force-merge-frequency 2000 --num-vectors 50000 --csv-output merge_times.csv

Benchmark for merges before incremental merges:

Benchmark for merges after incremental merges:

Before and after unified view:

Redundant storage is down 3x less than earlier due to elimination of lucene FlatVectorFormat and DocValues for FP vectors:

Related Issues

Resolves #133

Check List

[ x] New functionality includes testing.
[ x] New functionality has been documented.
API changes companion pull request created.
[x ] Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

tjake · 2025-09-23T14:08:02Z

        KNNCounter.KNN_QUANTIZATION_TRAINING_TIME.add(trainingTime);
        log.info("Encoding and building PQ vectors for field {} for {} vectors", fieldName, randomAccessVectorValues.size());
-        PQVectors pqVectors = (PQVectors) pq.encodeAll(randomAccessVectorValues, SIMD_POOL);
+        // PQVectors pqVectors = pq.encodeAll(randomAccessVectorValues, SIMD_POOL);


tjake

Overall looks good, There are a couple things from #145 that need to be added back, would you like me todo that?

I left some other comments as well.

tjake · 2025-09-23T14:15:45Z

+                // The GraphIndexBuilder can add nodes to an existing index
+                forkJoinTask.add(PhysicalCoreExecutor.pool().submit(() -> builder.addGraphNode(nodeId, vector)));
+            }
+            for (ForkJoinTask<?> task : forkJoinTask) {
+                task.join();
+            }


This should use SIMD_POOL and I would suggest using CompletableFuture.supplyAsync(() -> builder.addGraphNode(nodeId, vector), SIMD_POOL)

So then you can use CompletableFutures.allOf(tasks).join()

This should use SIMD_POOL and I would suggest using CompletableFuture.supplyAsync(() -> builder.addGraphNode(nodeId, vector), SIMD_POOL)

Yeah, in-fact this is temp code I need to move into the jVector as well as I had to do with the PQVectors when you provide the newToOld ord mapping.
Totally agree, we should provide the SIMD_POOL

So then you can use CompletableFutures.allOf(tasks).join()

+1 will make the changes

tjake · 2025-09-23T14:16:31Z

+     *
+     * Which means that we also need to persist this mapping to disk to be available across merges.
+     */
+    public static class JVectorLuceneDocMap {


Should this be its own file?

Yeah, I'll move it over to separate file and add specific tests around it as well.

tjake · 2025-09-23T14:18:15Z

+            this.ordinalsToDocIds = new int[ordinalsToDocIds.length];
+            System.arraycopy(ordinalsToDocIds, 0, this.ordinalsToDocIds, 0, ordinalsToDocIds.length);
+            final int maxDocId = Arrays.stream(ordinalsToDocIds).max().getAsInt();
+            final int maxDocs = maxDocId + 1;


Check for overflow

tjake · 2025-09-23T14:23:16Z

 version=1.0.0
 systemProp.bwc.version=1.3.4
-jvector_version=4.0.0-rc.2
+jvector_version=4.0.0-rc.3-SNAPSHOT


4.0.0-rc.3 was released can you switch to this?

sure, will update

@tjake also keep in mind we have a dependency on this change , so we would have to keep it "snapshot" for local testing until the change is merged.

I'll update to 4.0.0-rc.4-SNAPSHOT until it's becoming available following the JV change

sam-herman · 2025-09-23T16:55:42Z

Overall looks good, There are a couple things from #145 that need to be added back, would you like me todo that?

I left some other comments as well.

Thank you for reviewing @tjake! Sure, if you like to add those, or maybe let me know what they are I don't mind moving them over as well and can also run the tests and stuff with the pending JV change

akash-shankaran · 2025-09-23T18:15:03Z

+    private final JVectorWriter.JVectorLuceneDocMap jVectorLuceneDocMap;

-    public JVectorFloatVectorValues(OnDiskGraphIndex onDiskGraphIndex, VectorSimilarityFunction similarityFunction) throws IOException {
+    public JVectorFloatVectorValues(


it might be helpful to add some documentation for public methods.

Yeah, I'll try to get all of those documented. Many of those are simply a passthrough to the Lucene methods with some delegate functionality.

tjake · 2025-09-24T14:06:21Z

Sure, if you like to add those, or maybe let me know what they are I don't mind moving them over

I think the main change to add is this: https://github.com/opensearch-project/opensearch-jvector/pull/145/files#diff-6386363107f6fd4dbe6275645348aa24b637b0b23ca60fbe039e1301122ee5e2R239-R240

This avoids blocking the SIMD threads with calculating the partial sums.

sam-herman · 2025-09-24T19:52:55Z

Sure, if you like to add those, or maybe let me know what they are I don't mind moving them over

I think the main change to add is this: https://github.com/opensearch-project/opensearch-jvector/pull/145/files#diff-6386363107f6fd4dbe6275645348aa24b637b0b23ca60fbe039e1301122ee5e2R239-R240

This avoids blocking the SIMD threads with calculating the partial sums.

Gotcha, will make sure to add it as well! thank you @tjake !

akash-shankaran · 2025-10-15T22:53:45Z

@sam-herman think you need to update the changelog and sign your commits to get validation to pass?

akash-shankaran · 2025-10-15T22:57:04Z

        indexInputDelegate.readBytes(internalFloatBuffer, 0, Float.BYTES);
        FloatBuffer buffer = ByteBuffer.wrap(internalFloatBuffer).asFloatBuffer();
-        return buffer.get();
+        return buffer.get();*/


sam-herman · 2025-10-16T00:37:43Z

@sam-herman think you need to update the changelog and sign your commits to get validation to pass?

All the commits are signed, not sure what's going on with that check. changelog updated.

jimdickinson

lgtm

Changes have already been applied, had an offline sync with tjake regarding the changes.

persist neighbors cache add support for sorted index searcher fixes for resolving the node -> docId add incremental merge construction with leading segment move additional tests to internal test for transparency update documentation add readme pictures remove doc values by default separate docIdtoOrdMap class and add tests Signed-off-by: Samuel Herman <sherman8915@gmail.com>

…nsearch-project#167) persist neighbors cache add support for sorted index searcher fixes for resolving the node -> docId add incremental merge construction with leading segment move additional tests to internal test for transparency update documentation add readme pictures remove doc values by default separate docIdtoOrdMap class and add tests Signed-off-by: Samuel Herman <sherman8915@gmail.com>

* changes to perform JVector 3.2 upgrade Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com> * fix failing workflow, and spotless apply Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com> * few more package upgrades for JDK24 Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com> * add Akash as maintainer (#174) * add Akash as maintainer * change to IBM --------- Signed-off-by: Samuel Herman <sherman8915@gmail.com> * update akash in codeowners (#176) Signed-off-by: Samuel Herman <sherman8915@gmail.com> * Onboarding new maven snapshots publishing to s3 (jVector) (#178) Signed-off-by: Peter Zhu <zhujiaxi@amazon.com> * increment version of jVector to support incremental construction (#167) persist neighbors cache add support for sorted index searcher fixes for resolving the node -> docId add incremental merge construction with leading segment move additional tests to internal test for transparency update documentation add readme pictures remove doc values by default separate docIdtoOrdMap class and add tests Signed-off-by: Samuel Herman <sherman8915@gmail.com> * update changelog Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com> --------- Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com> Signed-off-by: Samuel Herman <sherman8915@gmail.com> Signed-off-by: Peter Zhu <zhujiaxi@amazon.com> Co-authored-by: sam-herman <97131656+sam-herman@users.noreply.github.com> Co-authored-by: Peter Zhu <zhujiaxi@amazon.com>

sam-herman requested review from jimdickinson, sandoichi and tjake as code owners September 23, 2025 03:34

tjake reviewed Sep 23, 2025

View reviewed changes

tjake previously requested changes Sep 23, 2025

View reviewed changes

akash-shankaran reviewed Sep 23, 2025

View reviewed changes

sam-herman force-pushed the incremental-insertion-to-existing-graph branch from d0c8738 to f373bb5 Compare September 25, 2025 01:33

sam-herman requested a review from tjake September 25, 2025 17:19

This was referenced Oct 1, 2025

changes to perform OpenSearch Jvector 3.2 upgrade #165

Merged

Reconstruct heap graph from disk graph datastax/jvector#536

Merged

sam-herman mentioned this pull request Oct 15, 2025

[RFC] In-Place Updating Segments with jVector Graphs #169

Open

sam-herman requested a review from akash-shankaran October 15, 2025 21:04

akash-shankaran reviewed Oct 15, 2025

View reviewed changes

sam-herman added the backport 3.0 label Oct 16, 2025

akash-shankaran approved these changes Oct 16, 2025

View reviewed changes

jimdickinson approved these changes Oct 16, 2025

View reviewed changes

sam-herman force-pushed the incremental-insertion-to-existing-graph branch from 4401286 to 1262bcf Compare October 16, 2025 17:09

sam-herman force-pushed the incremental-insertion-to-existing-graph branch from 1262bcf to 85cee15 Compare October 16, 2025 17:11

sam-herman removed the backport 3.0 label Oct 16, 2025

sam-herman merged commit 998e932 into opensearch-project:main Oct 16, 2025
10 of 36 checks passed

sam-herman deleted the incremental-insertion-to-existing-graph branch October 16, 2025 17:15

sam-herman mentioned this pull request Oct 17, 2025

Remove duplicated vector storage and rely on inline vectors for RAVV #145

Closed

5 tasks

sam-herman mentioned this pull request Nov 7, 2025

support deletes for leading segment merges #203

Merged

4 tasks

Conversation

sam-herman commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Related Issues

Check List

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjake left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sam-herman commented Sep 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjake commented Sep 24, 2025

Uh oh!

sam-herman commented Sep 24, 2025

Uh oh!

akash-shankaran commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sam-herman commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimdickinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sam-herman commented Sep 23, 2025 •

edited

Loading

sam-herman commented Oct 16, 2025 •

edited

Loading