Conversation
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
… upgrades remote file handling from a sequential HTTP-only path to a transport-routed implementation with S3 support, shared S3 client/transfer-manager reuse, and parallel base/query/gt downloads while preserving the existing logging and local catalog behavior.
|
Performance data highlighting the improvements (S3 to 180-core Genoa machine on GCP):
Summary
|
…ch should never be used
…ctory. Renamed run.yml as run-config.yml for clarity.
… (formerly private) datasets now falls under dataset_cache in subdirs. Cleaned up minor errors in the javadocs. Improved Exception handling for missing dataset metadata.
MarkWolters
left a comment
There was a problem hiding this comment.
Really glad to see this get done, data integrity aside the hard coding in the data loaders has been a pain for a long time. Overall looks like really good work. I've commented on a couple concerns but no showstoppers.
...r-examples/src/main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSets.java
Show resolved
Hide resolved
.../main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderSimpleMFD.java
Outdated
Show resolved
Hide resolved
.../main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderSimpleMFD.java
Outdated
Show resolved
Hide resolved
.../main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderSimpleMFD.java
Show resolved
Hide resolved
…dule-root working directories.
|
I fixed some failing loader tests. The problem was that DataSetMetadataReader.load() used a hardcoded relative path, so it only worked when the JVM happened to start from a working directory where jvector-examples/yaml-configs/dataset-metadata.yml resolved correctly. The fix was to keep the existing default path but add a fallback for the module-root-relative path yaml-configs/dataset-metadata.yml, making metadata loading work whether tests run from the repo root or from the jvector-examples module directory. |
…loading and local-catalog precedence.
…e datasets remain usable offline while preserving local override precedence. New tests added.
tlwillke
left a comment
There was a problem hiding this comment.
LGTM. Just needed to update index-parameters/default.yml.
This dataset loader is like the previous MDF loader but with very targeted changes to make things easier and more robust.
It uses HTTP, not requiring S3 to fetch data. It's still plenty fast enough and less complicated.datasets.yamlfile. This is just a name to facet mapping file.catalog_entries.yamlwithin the source directory for clarity.This canonically sets the dataset strategy for jvector to a verified set of data files. All previous datasets which are not in the vetted set will be discontinued.