Skip to content

Fix buffered MinHashLSH query aggregation across storage backends#307

Merged
ekzhu merged 6 commits intoekzhu:masterfrom
dipeshbabu:fix/lsh-query-buffer-union
Apr 14, 2026
Merged

Fix buffered MinHashLSH query aggregation across storage backends#307
ekzhu merged 6 commits intoekzhu:masterfrom
dipeshbabu:fix/lsh-query-buffer-union

Conversation

@dipeshbabu
Copy link
Copy Markdown
Contributor

@dipeshbabu dipeshbabu commented Mar 31, 2026

Summary

Fix MinHashLSH.collect_query_buffer() so buffered queries aggregate candidates the same way as repeated calls to query(), including when
using the Cassandra storage backend.

Problem

The buffered query path was intersecting per-band result sets directly. That is stricter than normal LSH query behavior, which unions
candidates across bands for a query and only then intersects across multiple buffered queries.

This caused valid candidates to be dropped when using buffered queries.

The Cassandra backend also exposed a related issue in buffered selects: repeated buffered lookups with the same hash key could be collapsed
instead of preserving one result list per buffered query. That breaks per-query aggregation logic.

Fix

  • union bucket hits across bands for each buffered query
  • intersect only the per-query candidate sets across the buffer
  • preserve existing prepickle behavior
  • make Cassandra buffered selects preserve query order and count, including duplicate hash-key lookups
  • replace a broken LSH Forest documentation link with a stable reference

Test

  • add a regression test showing collect_query_buffer() returns the same candidates as query() for a case where the old implementation
    dropped a valid match

Verification

  • confirmed with a direct local repro that buffered and non-buffered query paths now both return [0, 1]
  • ran uvx ruff check .
  • ran the README test command uv run pytest in Linux/WSL: 158 passed, 76 skipped
  • verified the docs link check fix locally with lychee

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the collect_query_buffer method in datasketch/lsh.py to correctly process buffered queries by unioning candidates across bands for each query before intersecting across the buffer. It also adds a test case to verify that buffered results match direct query results. Feedback identifies a potential bug where the use of zip could truncate results when using the Cassandra storage backend and suggests a more efficient implementation for the set().union() call.

Comment thread datasketch/lsh.py
Comment thread datasketch/lsh.py Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 31, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 61.90476% with 8 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@1e6c764). Learn more about missing BASE report.

Files with missing lines Patch % Lines
datasketch/storage.py 0.00% 6 Missing ⚠️
datasketch/lsh.py 87.50% 1 Missing ⚠️
datasketch/lshforest.py 85.71% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff            @@
##             master     #307   +/-   ##
=========================================
  Coverage          ?   77.44%           
=========================================
  Files             ?       15           
  Lines             ?     2062           
  Branches          ?        0           
=========================================
  Hits              ?     1597           
  Misses            ?      465           
  Partials          ?        0           
Flag Coverage Δ
unittests 77.44% <61.90%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dipeshbabu dipeshbabu changed the title Fix MinHashLSH collect_query_buffer candidate aggregation Fix buffered MinHashLSH query aggregation across storage backends Mar 31, 2026
Copy link
Copy Markdown
Owner

@ekzhu ekzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The fix is correct and the regression test is well-targeted. I verified the diagnosis by reading both code paths and by running the new test against master's buggy code — it fails with exactly the symptom the PR describes (key 1 missing from the buffered result), and passes on this branch. The non-backend test_lsh.py suite is green locally (19 passing).

datasketch/lsh.py — looks good

The original collect_query_buffer() was flattening per-band result lists into one list of sets and intersecting all of them, which intersects across bands. That contradicts query() at datasketch/lsh.py:425-432, which unions across bands and only uses intersection implicitly via the multi-query case. Any candidate that landed in only some bands' buckets was being silently dropped on the buffered path. The new zip(*collected_result_lists) transpose, union per query, intersect across queries is the right shape and matches query() semantics exactly.

datasketch/storage.py — necessary companion fix

Worth highlighting why this change is required and not just incidental cleanup: the new lsh.py code relies on zip(*…) lining up, which means each hashtable's collect_select_buffer() must return one entry per buffered statement, in order. The old Cassandra implementation bucketed rows by key_decoder(row.key) into a defaultdict, so two buffered queries hashing to the same band key would collapse into a single entry — zip() would then truncate and silently drop later queries. The in-memory Storage.collect_select_buffer at storage.py:192-198 already preserves order/count via getmany(*keys), so the Cassandra path now matches. Renaming key_decoder_key_decoder is fine since it's genuinely unused now.

One thing to double-check: the test suite doesn't exercise this Cassandra path under the new contract (no Cassandra integration test for buffered queries with duplicate hash keys). Since test-cassandra.yml runs against a real Cassandra in CI, it would be worth adding a small buffered-query test there if you want to lock the invariant down. Optional — the in-memory test already protects the lsh.py change.

docs/lshforest.rst — link-check is still failing

The link-check job on the latest commit (64f5a43) is failing — see job 71196853352. The PR description says the lychee fix was verified locally, but CI disagrees. Could be:

  • the new https://dblp.org/rec/conf/www/BawaCG05 URL returns a status not in the lychee --accept '200,203,206,403,429' allow-list from .github/workflows/checks.yml:35, or
  • there's an unrelated broken link the new run is now catching.

Worth pulling the lychee output from the failing job to confirm. As a side note, dblp's /rec/... page is a bibliographic record rather than the paper itself — if the goal is a stable, citeable reference you might prefer the canonical DOI page or the ACM DL entry for Bawa, Condie, Ganesan (WWW 2005). Not blocking, just a thought.

Test coverage suggestion (optional)

test_query_buffer_matches_query_candidates only exercises a single buffered query, which is exactly the regression case — good. One follow-up worth adding: a test with two buffered queries where the per-query candidate sets differ, to lock in the across-buffer intersection semantics from the docstring ("the intersection of the results of all query MinHash will be returned"). Not blocking.

Summary

Code change is correct, the storage fix is load-bearing (not optional), and the test is a true regression test. Main thing to resolve before merge is the link-check CI failure — everything else LGTM.


Generated by Claude Code

@dipeshbabu dipeshbabu requested a review from ekzhu April 14, 2026 04:05
@dipeshbabu
Copy link
Copy Markdown
Contributor Author

@ekzhu could you review now?

@ekzhu ekzhu merged commit cb708a8 into ekzhu:master Apr 14, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants