This project implements and evaluates nearest neighbor search algorithms for high-dimensional datasets, following the format and requirements of the SISAP 2025 Nearest Neighbor Search Challenge.
It supports two tasks:
- Task 1: Approximate k-NN search with disjoint training and query sets.
- Task 2: All-k-NN search within the same dataset (queries = data).
main.py: Entry point for both tasks.datasets.py: Handles dataset loading, linking/downloading, and structure validation.eval.py: Computes recall based on ground-truth and evaluates search results.metadata.py: Defines dataset metadata and parsing logic.timer.py: Utility for measuring and logging execution time.logger.pyandutils.py: Logging and formatting utilities.pca.py: Implements PCA variants (e.g., FAISS, sklearn).
Install the required libraries:
pip install -r requirements.txtPrepare data first:
python datasets.pyThen, to run a search task:
python main.py --task task1 --dataset pubmed23python main.py --task task2 --dataset gooaqOptions:
--task:task1ortask2--dataset: A dataset present in the data/ directory (e.g.,gooaq,ccnews,pubmed23)
Results will be stored in the results/ directory with appropriate metadata.
To evaluate recall:
python eval.pyIt reads the results file and computes recall by comparing against the ground truth stored in data/[dataset]/task[1|2]/gt/.
This project expects datasets to follow the SISAP structure:
data/
└── [dataset]/
├── task1/
│ ├── [dataset].h5
│ └── gt/
│ └── gt_[dataset].h5
└── task2/
├── [dataset].h5
└── gt/
└── gt_[dataset].h5
If your data is not organized this way, you can specify custom paths in metadata.py.
You can tweak search parameters like PCA dimension (d_pca) and search depth (k_search) in main.py by editing params_task1 and params_task2.
- PCA-based dimensionality reduction is optional and supports both
sklearnandfaissimplementations. - Multithreading is used to parallelize query processing.
- Results include recall metrics and runtime stats for benchmarking.