Hybrid keyword search: quality refinements (RRF tuning, CJK unigrams, collision noise)

Non-blocking quality follow-ups from the #610 review:

1. **Tune fusion.** Hybrid currently uses Qdrant default RRF (k=60, equal leg weights). Evaluate weighted RRF (Qdrant ≥1.17) or DBSF (≥1.11) once there's a representative query set — no public nDCG benchmark exists, so this needs real-vault evaluation.
2. **Single-char CJK queries.** The tokenizer emits character *bigrams* for CJK runs, so a single-character CJK query (`猫`) won't match multi-char CJK content (`飼猫` only emitted bigram `飼猫`). Consider also emitting CJK unigrams, or document the limitation.
3. **u32 collision noise at scale.** HMAC dims are truncated to u32; per-user vocab in the low-millions starts accruing collisions (two terms → one bucket, weights summed — graceful ranking noise, not a correctness/security issue). Quantify the ranking-noise floor; widen the dim space if it bites.
4. **avgdl re-normalization (#605):** the current `ReindexKeyword` stub recomputes against *drifting* avgdl during the pass; the real impl should snapshot avgdl once and pass it via the existing `Bm25` injected-param seam.

Refs #595, #605, #610

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hybrid keyword search: quality refinements (RRF tuning, CJK unigrams, collision noise) #613

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Hybrid keyword search: quality refinements (RRF tuning, CJK unigrams, collision noise) #613

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions