Vector Search Evaluation for SISAP 2025

This project implements and evaluates nearest neighbor search algorithms for high-dimensional datasets, following the format and requirements of the SISAP 2025 Nearest Neighbor Search Challenge.

It supports two tasks:

Task 1: Approximate k-NN search with disjoint training and query sets.
Task 2: All-k-NN search within the same dataset (queries = data).

📁 Project Structure

main.py: Entry point for both tasks.
datasets.py: Handles dataset loading, linking/downloading, and structure validation.
eval.py: Computes recall based on ground-truth and evaluates search results.
metadata.py: Defines dataset metadata and parsing logic.
timer.py: Utility for measuring and logging execution time.
logger.py and utils.py: Logging and formatting utilities.
pca.py: Implements PCA variants (e.g., FAISS, sklearn).

✅ Requirements

Install the required libraries:

pip install -r requirements.txt

🧪 Running Experiments

Prepare data first:

python datasets.py

Then, to run a search task:

python main.py --task task1 --dataset pubmed23

python main.py --task task2 --dataset gooaq

Options:

--task: task1 or task2
--dataset: A dataset present in the data/ directory (e.g., gooaq, ccnews, pubmed23)

Results will be stored in the results/ directory with appropriate metadata.

📊 Evaluation

To evaluate recall:

python eval.py

It reads the results file and computes recall by comparing against the ground truth stored in data/[dataset]/task[1|2]/gt/.

📦 Dataset Format

This project expects datasets to follow the SISAP structure:

data/
  └── [dataset]/
       ├── task1/
       │    ├── [dataset].h5
       │    └── gt/
       │         └── gt_[dataset].h5
       └── task2/
            ├── [dataset].h5
            └── gt/
                 └── gt_[dataset].h5

If your data is not organized this way, you can specify custom paths in metadata.py.

🛠 Parameters

You can tweak search parameters like PCA dimension (d_pca) and search depth (k_search) in main.py by editing params_task1 and params_task2.

🧠 Notes

PCA-based dimensionality reduction is optional and supports both sklearn and faiss implementations.
Multithreading is used to parallelize query processing.
Results include recall metrics and runtime stats for benchmarking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector Search Evaluation for SISAP 2025

📁 Project Structure

✅ Requirements

🧪 Running Experiments

📊 Evaluation

📦 Dataset Format

🛠 Parameters

🧠 Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Dockerfile		Dockerfile
Readme.md		Readme.md
datasets.py		datasets.py
eval.py		eval.py
logger.py		logger.py
main.py		main.py
metadata.py		metadata.py
pca.py		pca.py
requirements.txt		requirements.txt
sisap25-performance.jl		sisap25-performance.jl
sisap25-task1.sh		sisap25-task1.sh
sisap25-task2.sh		sisap25-task2.sh
timer.py		timer.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Vector Search Evaluation for SISAP 2025

📁 Project Structure

✅ Requirements

🧪 Running Experiments

📊 Evaluation

📦 Dataset Format

🛠 Parameters

🧠 Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages