Skip to content

Simplified clean dataset loader#654

Merged
tlwillke merged 34 commits intomainfrom
simplified_loader
Apr 8, 2026
Merged

Simplified clean dataset loader#654
tlwillke merged 34 commits intomainfrom
simplified_loader

Conversation

@jshook
Copy link
Copy Markdown
Contributor

@jshook jshook commented Apr 6, 2026

This dataset loader is like the previous MDF loader but with very targeted changes to make things easier and more robust.

  • It uses HTTP, not requiring S3 to fetch data. It's still plenty fast enough and less complicated.
    • Ted improved the transport to support both. It will do HTTP* and S3 as specified.
  • It indirects dataset name to data facets through a datasets.yaml file. This is just a name to facet mapping file.
    • It uses the catalog entries file catalog_entries.yaml within the source directory for clarity.
  • It supports local and remote, and simply caches the remote dataset.yaml file locally and uses it by default.
  • If will still inform the user if the remote datasets.yaml file has changed if they choose.
  • It does the same kind of loading as the previous MFD loader once initialized. All other dataset and run settings are unaffected.

This canonically sets the dataset strategy for jvector to a verified set of data files. All previous datasets which are not in the vetted set will be discontinued.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

@jshook jshook changed the title Simplified loader Simplified clean dataset loader Apr 6, 2026
jshook and others added 4 commits April 6, 2026 14:42
… upgrades remote file handling from a sequential HTTP-only path to a transport-routed implementation with S3 support, shared S3 client/transfer-manager reuse, and parallel base/query/gt downloads while preserving the existing logging and local catalog behavior.
@tlwillke
Copy link
Copy Markdown
Collaborator

tlwillke commented Apr 6, 2026

Performance data highlighting the improvements (S3 to 180-core Genoa machine on GCP):

Dataset DataSetLoaderMFD (old) s DataSetLoaderSimpleMFD (new) s Speedup Time Saved (s)
cohere-english-v3-100k 14.00 3.00 4.67x 11.00
ada002-100k 21.90 4.70 4.66x 17.20
openai-v3-small-100k 21.10 4.30 4.91x 16.80
gecko-100k 10.40 2.10 4.95x 8.30
openai-v3-large-3072-100k 54.20 6.80 7.97x 47.40
openai-v3-large-1536-100k 22.85 3.70 6.18x 19.15
e5-small-v2-100k 5.85 1.30 4.50x 4.55
e5-base-v2-100k 14.36 2.20 6.53x 12.16
e5-large-v2-100k 18.65 4.30 4.34x 14.35
ada002-1M 246.80 35.40 6.97x 211.40
colbert-1M 22.10 3.40 6.50x 18.70
glove-25-angular.hdf5 4.80 1.20 4.00x 3.60
glove-50-angular.hdf5 9.10 2.00 4.55x 7.10
lastfm-64-dot.hdf5 4.43 0.97 4.57x 3.46
glove-100-angular.hdf5 19.80 2.90 6.83x 16.90
glove-200-angular.hdf5 33.40 5.40 6.19x 28.00
nytimes-256-angular.hdf5 10.20 1.80 5.67x 8.40
sift-128-euclidean.hdf5 17.70 4.20 4.21x 13.50
Total 551.64 89.67 6.15x 461.97

Summary

  • Total old time: 551.64 s
  • Total new time: 89.67 s
  • Total speedup: 6.15x

@tlwillke tlwillke requested a review from ashkrisk April 6, 2026 23:52
Copy link
Copy Markdown
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed @jshook's contributions and he's addressed everything. I also ran all of the dataset loading scenarios through end-to-end testing. I have reported some performance data below as well. LGTM.

@tlwillke tlwillke added bug Something isn't working enhancement New feature or request labels Apr 7, 2026
Copy link
Copy Markdown
Contributor

@MarkWolters MarkWolters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really glad to see this get done, data integrity aside the hard coding in the data loaders has been a pain for a long time. Overall looks like really good work. I've commented on a couple concerns but no showstoppers.

@tlwillke
Copy link
Copy Markdown
Collaborator

tlwillke commented Apr 8, 2026

I fixed some failing loader tests. The problem was that DataSetMetadataReader.load() used a hardcoded relative path, so it only worked when the JVM happened to start from a working directory where jvector-examples/yaml-configs/dataset-metadata.yml resolved correctly. The fix was to keep the existing default path but add a fallback for the module-root-relative path yaml-configs/dataset-metadata.yml, making metadata loading work whether tests run from the repo root or from the jvector-examples module directory.

tlwillke added 3 commits April 8, 2026 17:12
…e datasets remain usable offline while preserving local override precedence. New tests added.
@tlwillke tlwillke self-requested a review April 8, 2026 21:14
Copy link
Copy Markdown
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just needed to update index-parameters/default.yml.

@tlwillke tlwillke merged commit 8c75f1b into main Apr 8, 2026
11 checks passed
@tlwillke tlwillke deleted the simplified_loader branch April 8, 2026 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants