Fix buffered MinHashLSH query aggregation across storage backends#307
Fix buffered MinHashLSH query aggregation across storage backends#307ekzhu merged 6 commits intoekzhu:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the collect_query_buffer method in datasketch/lsh.py to correctly process buffered queries by unioning candidates across bands for each query before intersecting across the buffer. It also adds a test case to verify that buffered results match direct query results. Feedback identifies a potential bug where the use of zip could truncate results when using the Cassandra storage backend and suggests a more efficient implementation for the set().union() call.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #307 +/- ##
=========================================
Coverage ? 77.44%
=========================================
Files ? 15
Lines ? 2062
Branches ? 0
=========================================
Hits ? 1597
Misses ? 465
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ekzhu
left a comment
There was a problem hiding this comment.
Review
The fix is correct and the regression test is well-targeted. I verified the diagnosis by reading both code paths and by running the new test against master's buggy code — it fails with exactly the symptom the PR describes (key 1 missing from the buffered result), and passes on this branch. The non-backend test_lsh.py suite is green locally (19 passing).
datasketch/lsh.py — looks good
The original collect_query_buffer() was flattening per-band result lists into one list of sets and intersecting all of them, which intersects across bands. That contradicts query() at datasketch/lsh.py:425-432, which unions across bands and only uses intersection implicitly via the multi-query case. Any candidate that landed in only some bands' buckets was being silently dropped on the buffered path. The new zip(*collected_result_lists) transpose, union per query, intersect across queries is the right shape and matches query() semantics exactly.
datasketch/storage.py — necessary companion fix
Worth highlighting why this change is required and not just incidental cleanup: the new lsh.py code relies on zip(*…) lining up, which means each hashtable's collect_select_buffer() must return one entry per buffered statement, in order. The old Cassandra implementation bucketed rows by key_decoder(row.key) into a defaultdict, so two buffered queries hashing to the same band key would collapse into a single entry — zip() would then truncate and silently drop later queries. The in-memory Storage.collect_select_buffer at storage.py:192-198 already preserves order/count via getmany(*keys), so the Cassandra path now matches. Renaming key_decoder → _key_decoder is fine since it's genuinely unused now.
One thing to double-check: the test suite doesn't exercise this Cassandra path under the new contract (no Cassandra integration test for buffered queries with duplicate hash keys). Since test-cassandra.yml runs against a real Cassandra in CI, it would be worth adding a small buffered-query test there if you want to lock the invariant down. Optional — the in-memory test already protects the lsh.py change.
docs/lshforest.rst — link-check is still failing
The link-check job on the latest commit (64f5a43) is failing — see job 71196853352. The PR description says the lychee fix was verified locally, but CI disagrees. Could be:
- the new
https://dblp.org/rec/conf/www/BawaCG05URL returns a status not in the lychee--accept '200,203,206,403,429'allow-list from.github/workflows/checks.yml:35, or - there's an unrelated broken link the new run is now catching.
Worth pulling the lychee output from the failing job to confirm. As a side note, dblp's /rec/... page is a bibliographic record rather than the paper itself — if the goal is a stable, citeable reference you might prefer the canonical DOI page or the ACM DL entry for Bawa, Condie, Ganesan (WWW 2005). Not blocking, just a thought.
Test coverage suggestion (optional)
test_query_buffer_matches_query_candidates only exercises a single buffered query, which is exactly the regression case — good. One follow-up worth adding: a test with two buffered queries where the per-query candidate sets differ, to lock in the across-buffer intersection semantics from the docstring ("the intersection of the results of all query MinHash will be returned"). Not blocking.
Summary
Code change is correct, the storage fix is load-bearing (not optional), and the test is a true regression test. Main thing to resolve before merge is the link-check CI failure — everything else LGTM.
Generated by Claude Code
|
@ekzhu could you review now? |
Summary
Fix
MinHashLSH.collect_query_buffer()so buffered queries aggregate candidates the same way as repeated calls toquery(), including whenusing the Cassandra storage backend.
Problem
The buffered query path was intersecting per-band result sets directly. That is stricter than normal LSH query behavior, which unions
candidates across bands for a query and only then intersects across multiple buffered queries.
This caused valid candidates to be dropped when using buffered queries.
The Cassandra backend also exposed a related issue in buffered selects: repeated buffered lookups with the same hash key could be collapsed
instead of preserving one result list per buffered query. That breaks per-query aggregation logic.
Fix
prepicklebehaviorTest
collect_query_buffer()returns the same candidates asquery()for a case where the old implementationdropped a valid match
Verification
[0, 1]uvx ruff check .uv run pytestin Linux/WSL:158 passed, 76 skippedlychee