Fix buffered MinHashLSH query aggregation across storage backends by dipeshbabu · Pull Request #307 · ekzhu/datasketch

dipeshbabu · 2026-03-31T07:54:52Z

Summary

Fix MinHashLSH.collect_query_buffer() so buffered queries aggregate candidates the same way as repeated calls to query(), including when
using the Cassandra storage backend.

Problem

The buffered query path was intersecting per-band result sets directly. That is stricter than normal LSH query behavior, which unions
candidates across bands for a query and only then intersects across multiple buffered queries.

This caused valid candidates to be dropped when using buffered queries.

The Cassandra backend also exposed a related issue in buffered selects: repeated buffered lookups with the same hash key could be collapsed
instead of preserving one result list per buffered query. That breaks per-query aggregation logic.

Fix

union bucket hits across bands for each buffered query
intersect only the per-query candidate sets across the buffer
preserve existing prepickle behavior
make Cassandra buffered selects preserve query order and count, including duplicate hash-key lookups
replace a broken LSH Forest documentation link with a stable reference

Test

add a regression test showing collect_query_buffer() returns the same candidates as query() for a case where the old implementation
dropped a valid match

Verification

confirmed with a direct local repro that buffered and non-buffered query paths now both return [0, 1]
ran uvx ruff check .
ran the README test command uv run pytest in Linux/WSL: 158 passed, 76 skipped
verified the docs link check fix locally with lychee

gemini-code-assist

Code Review

This pull request refactors the collect_query_buffer method in datasketch/lsh.py to correctly process buffered queries by unioning candidates across bands for each query before intersecting across the buffer. It also adds a test case to verify that buffered results match direct query results. Feedback identifies a potential bug where the use of zip could truncate results when using the Cassandra storage backend and suggests a more efficient implementation for the set().union() call.

codecov-commenter · 2026-03-31T08:04:14Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 61.90476% with 8 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@1e6c764). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
datasketch/storage.py	0.00%	6 Missing ⚠️
datasketch/lsh.py	87.50%	1 Missing ⚠️
datasketch/lshforest.py	85.71%	1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff            @@
##             master     #307   +/-   ##
=========================================
  Coverage          ?   77.44%           
=========================================
  Files             ?       15           
  Lines             ?     2062           
  Branches          ?        0           
=========================================
  Hits              ?     1597           
  Misses            ?      465           
  Partials          ?        0

Flag	Coverage Δ
unittests	`77.44% <61.90%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ekzhu

Review

The fix is correct and the regression test is well-targeted. I verified the diagnosis by reading both code paths and by running the new test against master's buggy code — it fails with exactly the symptom the PR describes (key 1 missing from the buffered result), and passes on this branch. The non-backend test_lsh.py suite is green locally (19 passing).

`datasketch/lsh.py` — looks good

The original collect_query_buffer() was flattening per-band result lists into one list of sets and intersecting all of them, which intersects across bands. That contradicts query() at datasketch/lsh.py:425-432, which unions across bands and only uses intersection implicitly via the multi-query case. Any candidate that landed in only some bands' buckets was being silently dropped on the buffered path. The new zip(*collected_result_lists) transpose, union per query, intersect across queries is the right shape and matches query() semantics exactly.

`datasketch/storage.py` — necessary companion fix

Worth highlighting why this change is required and not just incidental cleanup: the new lsh.py code relies on zip(*…) lining up, which means each hashtable's collect_select_buffer() must return one entry per buffered statement, in order. The old Cassandra implementation bucketed rows by key_decoder(row.key) into a defaultdict, so two buffered queries hashing to the same band key would collapse into a single entry — zip() would then truncate and silently drop later queries. The in-memory Storage.collect_select_buffer at storage.py:192-198 already preserves order/count via getmany(*keys), so the Cassandra path now matches. Renaming key_decoder → _key_decoder is fine since it's genuinely unused now.

One thing to double-check: the test suite doesn't exercise this Cassandra path under the new contract (no Cassandra integration test for buffered queries with duplicate hash keys). Since test-cassandra.yml runs against a real Cassandra in CI, it would be worth adding a small buffered-query test there if you want to lock the invariant down. Optional — the in-memory test already protects the lsh.py change.

`docs/lshforest.rst` — link-check is still failing

The link-check job on the latest commit (64f5a43) is failing — see job 71196853352. The PR description says the lychee fix was verified locally, but CI disagrees. Could be:

the new https://dblp.org/rec/conf/www/BawaCG05 URL returns a status not in the lychee --accept '200,203,206,403,429' allow-list from .github/workflows/checks.yml:35, or
there's an unrelated broken link the new run is now catching.

Worth pulling the lychee output from the failing job to confirm. As a side note, dblp's /rec/... page is a bibliographic record rather than the paper itself — if the goal is a stable, citeable reference you might prefer the canonical DOI page or the ACM DL entry for Bawa, Condie, Ganesan (WWW 2005). Not blocking, just a thought.

Test coverage suggestion (optional)

test_query_buffer_matches_query_candidates only exercises a single buffered query, which is exactly the regression case — good. One follow-up worth adding: a test with two buffered queries where the per-query candidate sets differ, to lock in the across-buffer intersection semantics from the docstring ("the intersection of the results of all query MinHash will be returned"). Not blocking.

Summary

Code change is correct, the storage fix is load-bearing (not optional), and the test is a true regression test. Main thing to resolve before merge is the link-check CI failure — everything else LGTM.

Generated by Claude Code

dipeshbabu · 2026-04-14T04:06:48Z

@ekzhu could you review now?

Fix MinHashLSH buffered query candidate aggregation

5c24d06

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

Comment thread datasketch/lsh.py

Comment thread datasketch/lsh.py Outdated

Fix Ruff lint in MinHashLSHEnsemble

a537760

dipeshbabu added 2 commits March 31, 2026 04:07

Cover prepickled buffered MinHashLSH queries

cda9ba5

Fix buffered LSH queries and broken docs link

03c6ce8

dipeshbabu changed the title ~~Fix MinHashLSH collect_query_buffer candidate aggregation~~ Fix buffered MinHashLSH query aggregation across storage backends Mar 31, 2026

Merge branch 'master' into fix/lsh-query-buffer-union

64f5a43

ekzhu reviewed Apr 14, 2026

View reviewed changes

Remove rate-limited LSH Forest paper link

fa5baba

dipeshbabu requested a review from ekzhu April 14, 2026 04:05

ekzhu approved these changes Apr 14, 2026

View reviewed changes

ekzhu merged commit cb708a8 into ekzhu:master Apr 14, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix buffered MinHashLSH query aggregation across storage backends#307

Fix buffered MinHashLSH query aggregation across storage backends#307
ekzhu merged 6 commits intoekzhu:masterfrom
dipeshbabu:fix/lsh-query-buffer-union

dipeshbabu commented Mar 31, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Mar 31, 2026 •

edited

Loading

Uh oh!

ekzhu left a comment

Uh oh!

dipeshbabu commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dipeshbabu commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Test

Verification

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ekzhu left a comment

Choose a reason for hiding this comment

Review

datasketch/lsh.py — looks good

datasketch/storage.py — necessary companion fix

docs/lshforest.rst — link-check is still failing

Test coverage suggestion (optional)

Summary

Uh oh!

dipeshbabu commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dipeshbabu commented Mar 31, 2026 •

edited

Loading

codecov-commenter commented Mar 31, 2026 •

edited

Loading

`datasketch/lsh.py` — looks good

`datasketch/storage.py` — necessary companion fix

`docs/lshforest.rst` — link-check is still failing