Skip to content

feat: support hamming clustering#7379

Merged
jackye1995 merged 1 commit into
lance-format:mainfrom
brendanclement:feat/hamming-clustering
Jun 19, 2026
Merged

feat: support hamming clustering#7379
jackye1995 merged 1 commit into
lance-format:mainfrom
brendanclement:feat/hamming-clustering

Conversation

@brendanclement

Copy link
Copy Markdown
Contributor

Add SIMD-accelerated pairwise hamming distance over 64-bit binary hashes, plus union-find clustering to group binary vectors within a hamming-distance threshold (near-duplicate detection).

  • lance-linalg: pairwise_hamming_distance[_parallel] with AVX-512 / AVX2 / scalar kernels, PairwiseResult, UnionFind, Cluster/ClusteringResult, extract_hashes_from_fixed_list, cluster_edges/cluster_pairwise_result.
  • lance: hamming_clustering_for_ivf_partition / for_sample / for_range / from_hashes and get_ivf_partition_info, returning a RecordBatchReader of (representative, duplicates) clusters.
  • python: thin bindings + wrappers in lance.vector, type stubs, and a test.

Recreates #6265 (originally authored by Jack Ye) on top of current main, updating imports/signatures for upstream API drift.

@github-actions github-actions Bot added A-python Python bindings A-index Vector index, linalg, tokenizer A-deps Dependency updates enhancement New feature or request labels Jun 19, 2026
Add SIMD-accelerated pairwise hamming distance over 64-bit binary hashes,
plus union-find clustering to group binary vectors within a hamming-distance
threshold (near-duplicate detection).

- lance-linalg: pairwise_hamming_distance[_parallel] with AVX-512 / AVX2 /
  scalar kernels, PairwiseResult, UnionFind, Cluster/ClusteringResult,
  extract_hashes_from_fixed_list, cluster_edges/cluster_pairwise_result.
- lance: hamming_clustering_for_ivf_partition / for_sample / for_range /
  from_hashes and get_ivf_partition_info, returning a RecordBatchReader of
  (representative, duplicates) clusters.
- python: thin bindings + wrappers in lance.vector, type stubs, and a test.

Recreates lance-format#6265 (originally authored by Jack Ye) on top of current main,
updating imports/signatures for upstream API drift.

Co-Authored-By: Jack Ye <yezhaoqin@gmail.com>
@brendanclement brendanclement force-pushed the feat/hamming-clustering branch from 415f70a to ec02d9a Compare June 19, 2026 21:22
@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.74128% with 265 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/vector/hamming.rs 73.25% 115 Missing and 27 partials ⚠️
rust/lance-linalg/src/distance/hamming.rs 85.44% 117 Missing and 6 partials ⚠️

📢 Thoughts on this report? Let us know!

@jackye1995 jackye1995 merged commit 6ba89d5 into lance-format:main Jun 19, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-deps Dependency updates A-index Vector index, linalg, tokenizer A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants