Incremental insertion to existing graph#167
Conversation
| KNNCounter.KNN_QUANTIZATION_TRAINING_TIME.add(trainingTime); | ||
| log.info("Encoding and building PQ vectors for field {} for {} vectors", fieldName, randomAccessVectorValues.size()); | ||
| PQVectors pqVectors = (PQVectors) pq.encodeAll(randomAccessVectorValues, SIMD_POOL); | ||
| // PQVectors pqVectors = pq.encodeAll(randomAccessVectorValues, SIMD_POOL); |
| // The GraphIndexBuilder can add nodes to an existing index | ||
| forkJoinTask.add(PhysicalCoreExecutor.pool().submit(() -> builder.addGraphNode(nodeId, vector))); | ||
| } | ||
| for (ForkJoinTask<?> task : forkJoinTask) { | ||
| task.join(); | ||
| } |
There was a problem hiding this comment.
This should use SIMD_POOL and I would suggest using CompletableFuture.supplyAsync(() -> builder.addGraphNode(nodeId, vector), SIMD_POOL)
So then you can use CompletableFutures.allOf(tasks).join()
There was a problem hiding this comment.
This should use
SIMD_POOLand I would suggest usingCompletableFuture.supplyAsync(() -> builder.addGraphNode(nodeId, vector), SIMD_POOL)
Yeah, in-fact this is temp code I need to move into the jVector as well as I had to do with the PQVectors when you provide the newToOld ord mapping.
Totally agree, we should provide the SIMD_POOL
So then you can use
CompletableFutures.allOf(tasks).join()
+1 will make the changes
| * | ||
| * Which means that we also need to persist this mapping to disk to be available across merges. | ||
| */ | ||
| public static class JVectorLuceneDocMap { |
There was a problem hiding this comment.
Should this be its own file?
There was a problem hiding this comment.
Yeah, I'll move it over to separate file and add specific tests around it as well.
| this.ordinalsToDocIds = new int[ordinalsToDocIds.length]; | ||
| System.arraycopy(ordinalsToDocIds, 0, this.ordinalsToDocIds, 0, ordinalsToDocIds.length); | ||
| final int maxDocId = Arrays.stream(ordinalsToDocIds).max().getAsInt(); | ||
| final int maxDocs = maxDocId + 1; |
| version=1.0.0 | ||
| systemProp.bwc.version=1.3.4 | ||
| jvector_version=4.0.0-rc.2 | ||
| jvector_version=4.0.0-rc.3-SNAPSHOT |
There was a problem hiding this comment.
4.0.0-rc.3 was released can you switch to this?
There was a problem hiding this comment.
sure, will update
There was a problem hiding this comment.
I'll update to 4.0.0-rc.4-SNAPSHOT until it's becoming available following the JV change
Thank you for reviewing @tjake! Sure, if you like to add those, or maybe let me know what they are I don't mind moving them over as well and can also run the tests and stuff with the pending JV change |
| private final JVectorWriter.JVectorLuceneDocMap jVectorLuceneDocMap; | ||
|
|
||
| public JVectorFloatVectorValues(OnDiskGraphIndex onDiskGraphIndex, VectorSimilarityFunction similarityFunction) throws IOException { | ||
| public JVectorFloatVectorValues( |
There was a problem hiding this comment.
it might be helpful to add some documentation for public methods.
There was a problem hiding this comment.
Yeah, I'll try to get all of those documented. Many of those are simply a passthrough to the Lucene methods with some delegate functionality.
I think the main change to add is this: https://github.com/opensearch-project/opensearch-jvector/pull/145/files#diff-6386363107f6fd4dbe6275645348aa24b637b0b23ca60fbe039e1301122ee5e2R239-R240 This avoids blocking the SIMD threads with calculating the partial sums. |
Gotcha, will make sure to add it as well! thank you @tjake ! |
d0c8738 to
f373bb5
Compare
|
@sam-herman think you need to update the changelog and sign your commits to get validation to pass? |
| indexInputDelegate.readBytes(internalFloatBuffer, 0, Float.BYTES); | ||
| FloatBuffer buffer = ByteBuffer.wrap(internalFloatBuffer).asFloatBuffer(); | ||
| return buffer.get(); | ||
| return buffer.get();*/ |
All the commits are signed, not sure what's going on with that check. changelog updated. |
Changes have already been applied, had an offline sync with tjake regarding the changes.
4401286 to
1262bcf
Compare
persist neighbors cache add support for sorted index searcher fixes for resolving the node -> docId add incremental merge construction with leading segment move additional tests to internal test for transparency update documentation add readme pictures remove doc values by default separate docIdtoOrdMap class and add tests Signed-off-by: Samuel Herman <sherman8915@gmail.com>
1262bcf to
85cee15
Compare
…nsearch-project#167) persist neighbors cache add support for sorted index searcher fixes for resolving the node -> docId add incremental merge construction with leading segment move additional tests to internal test for transparency update documentation add readme pictures remove doc values by default separate docIdtoOrdMap class and add tests Signed-off-by: Samuel Herman <sherman8915@gmail.com>
* changes to perform JVector 3.2 upgrade Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com> * fix failing workflow, and spotless apply Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com> * few more package upgrades for JDK24 Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com> * add Akash as maintainer (#174) * add Akash as maintainer * change to IBM --------- Signed-off-by: Samuel Herman <sherman8915@gmail.com> * update akash in codeowners (#176) Signed-off-by: Samuel Herman <sherman8915@gmail.com> * Onboarding new maven snapshots publishing to s3 (jVector) (#178) Signed-off-by: Peter Zhu <zhujiaxi@amazon.com> * increment version of jVector to support incremental construction (#167) persist neighbors cache add support for sorted index searcher fixes for resolving the node -> docId add incremental merge construction with leading segment move additional tests to internal test for transparency update documentation add readme pictures remove doc values by default separate docIdtoOrdMap class and add tests Signed-off-by: Samuel Herman <sherman8915@gmail.com> * update changelog Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com> --------- Signed-off-by: Akash Shankaran <akash.shankaran@ibm.com> Signed-off-by: Samuel Herman <sherman8915@gmail.com> Signed-off-by: Peter Zhu <zhujiaxi@amazon.com> Co-authored-by: sam-herman <97131656+sam-herman@users.noreply.github.com> Co-authored-by: Peter Zhu <zhujiaxi@amazon.com>
Description
This change is the first part of leading segment merge epic.
It leverages previously created leading segments to facilitate incremental insertion.
Also in this change some important bug fixes.
The following items changed/added/fixed:
FlatVectorFormatthat stores the FP vectors and only use the inlined vectors instead.Note: this change also has a dependency on the jVector change that enables reloading of
OnDiskGraphIndexback into a mutableOnHeaphGraphTesting
Run:
python create_and_test_large_index.py --batch-size 2000 --force-merge-frequency 2000 --num-vectors 50000 --csv-output merge_times.csvBenchmark for merges before incremental merges:

Benchmark for merges after incremental merges:

Before and after unified view:
Redundant storage is down 3x less than earlier due to elimination of lucene

FlatVectorFormatandDocValuesfor FP vectors:Related Issues
Resolves #133
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.