Skip to content

Hybrid keyword search: quality refinements (RRF tuning, CJK unigrams, collision noise) #613

@Rasbandit

Description

@Rasbandit

Non-blocking quality follow-ups from the #610 review:

  1. Tune fusion. Hybrid currently uses Qdrant default RRF (k=60, equal leg weights). Evaluate weighted RRF (Qdrant ≥1.17) or DBSF (≥1.11) once there's a representative query set — no public nDCG benchmark exists, so this needs real-vault evaluation.
  2. Single-char CJK queries. The tokenizer emits character bigrams for CJK runs, so a single-character CJK query () won't match multi-char CJK content (飼猫 only emitted bigram 飼猫). Consider also emitting CJK unigrams, or document the limitation.
  3. u32 collision noise at scale. HMAC dims are truncated to u32; per-user vocab in the low-millions starts accruing collisions (two terms → one bucket, weights summed — graceful ranking noise, not a correctness/security issue). Quantify the ranking-noise floor; widen the dim space if it bites.
  4. avgdl re-normalization (Backfill worker for notes_fts keyword index (+ Tiger Cloud migration tool) #605): the current ReindexKeyword stub recomputes against drifting avgdl during the pass; the real impl should snapshot avgdl once and pass it via the existing Bm25 injected-param seam.

Refs #595, #605, #610

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions