Skip to content

fix: compute SQ dot distance from dequantized values#7355

Open
cwj0bzxg wants to merge 1 commit into
lance-format:mainfrom
cwj0bzxg:fix-sq-dot-dequantized
Open

fix: compute SQ dot distance from dequantized values#7355
cwj0bzxg wants to merge 1 commit into
lance-format:mainfrom
cwj0bzxg:fix-sq-dot-dequantized

Conversation

@cwj0bzxg

Copy link
Copy Markdown

This PR fixes Dot distance computation for scalar-quantized vectors. #7352

Previously, the SQ Dot path computed the dot product directly from u8 quantized codes and only applied a scale factor. This is incorrect when SQ uses a non-zero lower_bound, because each code represents an offset value:

value ≈ lower_bound + step * code

As a result, the old Dot path missed the offset-related terms and could produce a very different ranking from the actual vector values. This caused severe recall degradation for SQ indexes with metric="dot".

The fix computes SQ Dot using the full expansion:

dot(x, q) ≈ sum_i((lower_bound + step * cx_i) * (lower_bound + step * cq_i))

Equivalently:

dot =
    step² · sum_i(cx_i · cq_i)
  + lower_bound·step · sum_i(cx_i)
  + lower_bound·step · sum_i(cq_i)
  + dim·lower_bound²

distance = 1 - dot is then used for Dot distance, matching the existing Dot distance convention.

Changes

  • Fix SQ Dot distance to include the lower_bound offset terms.
  • Keep the existing SQ L2 and Cosine paths unchanged.
  • Cache / compute the SQ query code sum needed by the Dot formula.
  • Add Rust unit tests for:
    • Dot distance from a float query.
    • Dot distance from an indexed vector id.
    • Constant-bound SQ behavior.

Validation

I rebuilt the Python extension with this patch and reran MSMARCO WebSearch 1M Dot benchmarks.

Dataset:

  • 1M base vectors
  • 9,376 queries
  • dimension 768
  • metric: Dot
  • evaluated with recall@10

Before this fix, IVF_HNSW_SQ recall@10 was only around 0.0250 to 0.0684 across ef=20..640, while the IVF_HNSW_FLAT baseline reached 0.5179 to 0.9377.

After this fix:

IVF_HNSW_SQ

ef QPS recall@10
20 944.54 0.5176
40 880.87 0.6455
80 762.09 0.7448
160 607.29 0.8198
320 453.09 0.8716
640 305.57 0.9040

IVF_SQ

nprobes QPS recall@10
16 578.43 0.6392
32 417.61 0.7352
64 272.83 0.8104
96 205.88 0.8456
128 165.96 0.8684

IVF_HNSW_FLAT (baseline)

ef QPS recall@10
20 886.27 0.5195
40 773.42 0.6507
80 627.24 0.7563
160 470.82 0.8390
320 315.20 0.8983
640 193.39 0.9373

With the corrected distance formula, SQ Dot recall is restored to the same range as the Flat baseline. IVF_HNSW_SQ is close to IVF_HNSW_FLAT at the same ef, while generally providing higher QPS. The remaining recall gap at high ef is expected from quantization loss rather than an incorrect distance formula.

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer bug Something isn't working labels Jun 18, 2026
lower_bound: f32,
) -> f32 {
let code_dot = dot_u8(sq_code, query_sq_code) as f32;
let dot = step * step * code_dot

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expanded affine dot calculation is performed in f32, and the large offset terms can cancel in high-dimensional near-zero vectors enough to flip SQ Dot rankings.

],
);
let storage =
ScalarQuantizationStorage::try_new(8, DistanceType::Dot, -10.0..245.0, [batch], None)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new coverage only exercises unit-step and constant bounds, so regressions in the step and step * step terms can pass while arbitrary-range SQ Dot distances remain wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants