feat: bloom filter building from left table data by poyrazK · Pull Request #30 · poyrazK/cloudSQL

poyrazK · 2026-04-13T17:12:50Z

Summary

Build local bloom filter during Phase 1 left table scan on each data node
Collect and OR-aggregate bloom filter bits via new BloomFilterBits RPC
Broadcast aggregated filter via BloomFilterPush before Phase 2
Apply sender-side filtering before PushData in Phase 2 to avoid sending rows that can't match

Test plan

Unit tests pass (distributed_tests.cpp)
E2E join tests pass

Summary by CodeRabbit

New Features
- Enhanced distributed query execution with improved per-node Bloom filter collection and aggregation across database nodes.
- Added support for retrieving local Bloom filter state from individual nodes with associated metadata parameters.
- Optimized Phase-2 Bloom filter broadcasting by aggregating per-node filter data instead of using metadata-only filters.
Tests
- Added RPC handler tests for Bloom filter retrieval functionality in distributed query scenarios.

- Build local bloom filter during Phase 1 left table scan - Collect and OR-aggregate bits via BloomFilterBits RPC - Broadcast aggregated filter via BloomFilterPush before Phase 2 - Apply sender-side filtering before PushData in Phase 2

coderabbitai · 2026-04-13T17:12:58Z

Warning

Rate limit exceeded

@github-actions[bot] has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 11 minutes and 11 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 11 minutes and 11 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8dc3f73e-92ec-4288-916d-c5c8df7eedd2

📥 Commits

Reviewing files that changed from the base of the PR and between 1a2874a and 654d198.

📒 Files selected for processing (9)

docs/performance/SQLITE_COMPARISON.md
docs/phases/PHASE_6_DISTRIBUTED_JOIN.md
include/common/cluster_manager.hpp
include/executor/operator.hpp
include/network/rpc_message.hpp
src/distributed/distributed_executor.cpp
src/executor/operator.cpp
src/main.cpp
tests/distributed_tests.cpp

📝 Walkthrough

Walkthrough

This change introduces a distributed bloom filter collection mechanism. New ClusterManager APIs enable per-context storage and retrieval of local bloom filter bits. A new RPC message type (BloomFilterBits) carries filter state across nodes. The distributed executor now aggregates per-node bloom filters via bitwise OR during Phase 2. Local bloom filter construction occurs during shuffle operations and is persisted via ClusterManager.

Changes

Cohort / File(s)	Summary
Bloom Filter Storage API `include/common/cluster_manager.hpp`	Added four new public methods: `set_local_bloom_bits()` stores per-context filter bits and metadata, `get_local_bloom_bits()` retrieves filter bits, and `get_local_expected_elements()`/`get_local_num_hashes()` expose aggregate parameters. New internal state: `local_bloom_bits_` map and scalar metadata fields.
RPC Protocol Extension `include/network/rpc_message.hpp`	Extended `RpcType` enum with new value `BloomFilterBits = 12`. Added `BloomFilterBitsArgs` struct with `context_id`, `filter_data`, `expected_elements`, and `num_hashes` fields plus serialize/deserialize methods.
Distributed Aggregation Logic `src/distributed/distributed_executor.cpp`	Replaced Phase-2 bloom-filter behavior to retrieve per-node filter bits via `RpcType::BloomFilterBits`, aggregate results by bitwise OR, compute total expected elements and maximum hash count, then broadcast aggregated filter.
Local Construction & Persistence `src/main.cpp`	Added RPC handler for `BloomFilterBits` type. Introduced local bloom filter construction during shuffle/partitioning by scanning tuples, storing filter bits and metadata via `ClusterManager::set_local_bloom_bits()`.
Test Support `tests/distributed_tests.cpp`	Registered mock RPC handler for `BloomFilterBits` in `ShuffleJoinOrchestration` test. Handler deserializes requests and responds with empty filter data and default parameters.

Sequence Diagram

sequenceDiagram
    participant Coordinator
    participant DataNode1
    participant DataNode2
    participant ClusterMgr as Cluster Manager

    Coordinator->>DataNode1: Phase 1: Execute shuffle (RPC)
    DataNode1->>DataNode1: Scan tuples & build local bloom filter
    DataNode1->>ClusterMgr: set_local_bloom_bits(context_id, bits, expected, hashes)
    ClusterMgr->>ClusterMgr: Store filter bits & metadata

    Coordinator->>DataNode2: Phase 1: Execute shuffle (RPC)
    DataNode2->>DataNode2: Scan tuples & build local bloom filter
    DataNode2->>ClusterMgr: set_local_bloom_bits(context_id, bits, expected, hashes)
    ClusterMgr->>ClusterMgr: Store filter bits & metadata

    Note over Coordinator: Phase 2: Aggregate
    Coordinator->>DataNode1: Retrieve bloom bits (RpcType::BloomFilterBits)
    DataNode1->>ClusterMgr: get_local_bloom_bits(context_id)
    ClusterMgr-->>DataNode1: Return stored bits
    DataNode1-->>Coordinator: BloomFilterBitsArgs response

    Coordinator->>DataNode2: Retrieve bloom bits (RpcType::BloomFilterBits)
    DataNode2->>ClusterMgr: get_local_bloom_bits(context_id)
    ClusterMgr-->>DataNode2: Return stored bits
    DataNode2-->>Coordinator: BloomFilterBitsArgs response

    Coordinator->>Coordinator: Aggregate: OR all bits, sum expected, max hashes
    Coordinator->>DataNode1: Broadcast aggregated filter (RpcType::BloomFilterPush)
    Coordinator->>DataNode2: Broadcast aggregated filter (RpcType::BloomFilterPush)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Whispers from the burrow:
Bloom filters scattered far and wide,
Each node builds filters with rabbit pride,
Bits aggregate through networked air,
OR'd together with utmost care,
Distributed wisdom, fluffy and fair! 🌸

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: bloom filter building from left table data, which aligns with the PR objectives and the substantive changes across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/bloom-filter-building-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

include/common/cluster_manager.hpp (1)

321-327: ⚠️ Potential issue | 🟡 Minor

Local bloom filter entries are never cleaned up.

clear_bloom_filter cleans up bloom_filters_ but doesn't clean up local_bloom_bits_. Over time, entries will accumulate for completed contexts.

♻️ Proposed fix

     void clear_bloom_filter(const std::string& context_id) {
         const std::scoped_lock<std::mutex> lock(mutex_);
         bloom_filters_.erase(context_id);
+        local_bloom_bits_.erase(context_id);
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/common/cluster_manager.hpp` around lines 321 - 327,
clear_bloom_filter currently erases from bloom_filters_ but neglects to remove
entries from local_bloom_bits_, causing leaked per-context state; update
clear_bloom_filter to also erase local_bloom_bits_. Use the same
std::scoped_lock<std::mutex> lock(mutex_) already present to ensure thread
safety and call local_bloom_bits_.erase(context_id) (or equivalent) alongside
bloom_filters_.erase(context_id) inside the function.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/common/cluster_manager.hpp`:
- Around line 285-291: The method set_local_bloom_bits stores bits per
context_id but writes global metadata fields (local_expected_elements_ and
local_num_hashes_), causing races; change the metadata to be stored per-context
(e.g., replace local_expected_elements_ and local_num_hashes_ with maps keyed by
context_id, such as local_expected_elements_map_ and local_num_hashes_map_) and
assign into those maps inside set_local_bloom_bits (holding mutex_), and update
any corresponding getters/accessors and the analogous code mentioned for lines
353-356 to read metadata from the per-context maps instead of the old globals so
each context's bits/metadata remain matched and thread-safe.

In `@src/main.cpp`:
- Around line 519-544: The BloomFilterBits RPC handler currently uses global
metadata getters get_local_expected_elements() and get_local_num_hashes(),
causing mismatched metadata for multiple contexts; after ClusterManager is
updated to store per-context values, change those calls in the BloomFilterBits
handler to call the per-context variants (pass args.context_id) so
reply_args.expected_elements =
cluster_manager->get_local_expected_elements(args.context_id) and
reply_args.num_hashes = cluster_manager->get_local_num_hashes(args.context_id),
keeping reply_args.context_id and filter_data assignment via
get_local_bloom_bits(args.context_id) as-is.

---

Outside diff comments:
In `@include/common/cluster_manager.hpp`:
- Around line 321-327: clear_bloom_filter currently erases from bloom_filters_
but neglects to remove entries from local_bloom_bits_, causing leaked
per-context state; update clear_bloom_filter to also erase local_bloom_bits_.
Use the same std::scoped_lock<std::mutex> lock(mutex_) already present to ensure
thread safety and call local_bloom_bits_.erase(context_id) (or equivalent)
alongside bloom_filters_.erase(context_id) inside the function.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da036d01-db6a-48ac-b39e-ccad3e7983c3

📥 Commits

Reviewing files that changed from the base of the PR and between 078c09b and 1a2874a.

📒 Files selected for processing (5)

include/common/cluster_manager.hpp
include/network/rpc_message.hpp
src/distributed/distributed_executor.cpp
src/main.cpp
tests/distributed_tests.cpp

- Store expected_elements and num_hashes in per-context maps instead of globals - Update set_local_bloom_bits to write to per-context maps - Update getters to take context_id parameter - clear_bloom_filter now also erases local_bloom_bits and metadata maps - Update BloomFilterBits handler to use per-context getters

The distributed shuffle join algorithm only supports INNER joins. LEFT, RIGHT, and FULL outer joins require different handling (e.g., broadcasting the outer table, or double-shuffle with side tables) that is not yet implemented. Instead of producing incorrect results, we now return a clear error message. Also add unit tests RightJoinRejection and FullJoinRejection to verify this behavior.

The distributed shuffle join can correctly handle LEFT joins in the current implementation because each node executes the query locally and LEFT join only requires preserving unmatched left-table rows (which are already local to each node). RIGHT and FULL joins require tracking unmatched rows across partitions which is not yet implemented. Update error message to say "INNER and LEFT" instead of just "INNER".

- Skip bloom filter for RIGHT/FULL joins to prevent false negatives - Add integration tests for bloom filter skip - Allow LEFT joins in distributed shuffle join The bloom filter can cause false negatives (rows filtered when they shouldn't be), which violates outer join semantics. For RIGHT/FULL joins, all right rows must be sent to ensure correct results. NOTE: Phase 3-5 for collecting unmatched outer rows is temporarily disabled due to column indexing issues with non-SELECT * queries. The local executor on each data node handles unmatched right rows correctly for RIGHT JOIN. FULL JOIN unmatched left rows will be implemented separately.

style: automated clang-format fixes

1a2874a

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

Comment thread include/common/cluster_manager.hpp

Comment thread src/main.cpp

poyrazK and others added 6 commits April 14, 2026 14:22

docs: update bloom filter architecture docs with 3-phase build process

a73926b

style: automated clang-format fixes

0650ee0

style: automated clang-format fixes

30895d6

poyrazK force-pushed the feature/bloom-filter-building-v2 branch from c92194f to e35159e Compare April 15, 2026 15:20

poyrazK force-pushed the feature/bloom-filter-building-v2 branch from ef30b18 to c171913 Compare April 15, 2026 15:28

style: automated clang-format fixes

654d198

poyrazK merged commit c72a39e into main Apr 15, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: bloom filter building from left table data#30

feat: bloom filter building from left table data#30
poyrazK merged 10 commits intomainfrom
feature/bloom-filter-building-v2

poyrazK commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 13, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poyrazK commented Apr 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poyrazK commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading