Skip to content

feat: bloom filter building from left table data#30

Merged
poyrazK merged 10 commits intomainfrom
feature/bloom-filter-building-v2
Apr 15, 2026
Merged

feat: bloom filter building from left table data#30
poyrazK merged 10 commits intomainfrom
feature/bloom-filter-building-v2

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented Apr 13, 2026

Summary

  • Build local bloom filter during Phase 1 left table scan on each data node
  • Collect and OR-aggregate bloom filter bits via new BloomFilterBits RPC
  • Broadcast aggregated filter via BloomFilterPush before Phase 2
  • Apply sender-side filtering before PushData in Phase 2 to avoid sending rows that can't match

Test plan

  • Unit tests pass (distributed_tests.cpp)
  • E2E join tests pass

Summary by CodeRabbit

  • New Features

    • Enhanced distributed query execution with improved per-node Bloom filter collection and aggregation across database nodes.
    • Added support for retrieving local Bloom filter state from individual nodes with associated metadata parameters.
    • Optimized Phase-2 Bloom filter broadcasting by aggregating per-node filter data instead of using metadata-only filters.
  • Tests

    • Added RPC handler tests for Bloom filter retrieval functionality in distributed query scenarios.

- Build local bloom filter during Phase 1 left table scan
- Collect and OR-aggregate bits via BloomFilterBits RPC
- Broadcast aggregated filter via BloomFilterPush before Phase 2
- Apply sender-side filtering before PushData in Phase 2
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

Warning

Rate limit exceeded

@github-actions[bot] has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 11 minutes and 11 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 11 minutes and 11 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8dc3f73e-92ec-4288-916d-c5c8df7eedd2

📥 Commits

Reviewing files that changed from the base of the PR and between 1a2874a and 654d198.

📒 Files selected for processing (9)
  • docs/performance/SQLITE_COMPARISON.md
  • docs/phases/PHASE_6_DISTRIBUTED_JOIN.md
  • include/common/cluster_manager.hpp
  • include/executor/operator.hpp
  • include/network/rpc_message.hpp
  • src/distributed/distributed_executor.cpp
  • src/executor/operator.cpp
  • src/main.cpp
  • tests/distributed_tests.cpp
📝 Walkthrough

Walkthrough

This change introduces a distributed bloom filter collection mechanism. New ClusterManager APIs enable per-context storage and retrieval of local bloom filter bits. A new RPC message type (BloomFilterBits) carries filter state across nodes. The distributed executor now aggregates per-node bloom filters via bitwise OR during Phase 2. Local bloom filter construction occurs during shuffle operations and is persisted via ClusterManager.

Changes

Cohort / File(s) Summary
Bloom Filter Storage API
include/common/cluster_manager.hpp
Added four new public methods: set_local_bloom_bits() stores per-context filter bits and metadata, get_local_bloom_bits() retrieves filter bits, and get_local_expected_elements()/get_local_num_hashes() expose aggregate parameters. New internal state: local_bloom_bits_ map and scalar metadata fields.
RPC Protocol Extension
include/network/rpc_message.hpp
Extended RpcType enum with new value BloomFilterBits = 12. Added BloomFilterBitsArgs struct with context_id, filter_data, expected_elements, and num_hashes fields plus serialize/deserialize methods.
Distributed Aggregation Logic
src/distributed/distributed_executor.cpp
Replaced Phase-2 bloom-filter behavior to retrieve per-node filter bits via RpcType::BloomFilterBits, aggregate results by bitwise OR, compute total expected elements and maximum hash count, then broadcast aggregated filter.
Local Construction & Persistence
src/main.cpp
Added RPC handler for BloomFilterBits type. Introduced local bloom filter construction during shuffle/partitioning by scanning tuples, storing filter bits and metadata via ClusterManager::set_local_bloom_bits().
Test Support
tests/distributed_tests.cpp
Registered mock RPC handler for BloomFilterBits in ShuffleJoinOrchestration test. Handler deserializes requests and responds with empty filter data and default parameters.

Sequence Diagram

sequenceDiagram
    participant Coordinator
    participant DataNode1
    participant DataNode2
    participant ClusterMgr as Cluster Manager

    Coordinator->>DataNode1: Phase 1: Execute shuffle (RPC)
    DataNode1->>DataNode1: Scan tuples & build local bloom filter
    DataNode1->>ClusterMgr: set_local_bloom_bits(context_id, bits, expected, hashes)
    ClusterMgr->>ClusterMgr: Store filter bits & metadata

    Coordinator->>DataNode2: Phase 1: Execute shuffle (RPC)
    DataNode2->>DataNode2: Scan tuples & build local bloom filter
    DataNode2->>ClusterMgr: set_local_bloom_bits(context_id, bits, expected, hashes)
    ClusterMgr->>ClusterMgr: Store filter bits & metadata

    Note over Coordinator: Phase 2: Aggregate
    Coordinator->>DataNode1: Retrieve bloom bits (RpcType::BloomFilterBits)
    DataNode1->>ClusterMgr: get_local_bloom_bits(context_id)
    ClusterMgr-->>DataNode1: Return stored bits
    DataNode1-->>Coordinator: BloomFilterBitsArgs response

    Coordinator->>DataNode2: Retrieve bloom bits (RpcType::BloomFilterBits)
    DataNode2->>ClusterMgr: get_local_bloom_bits(context_id)
    ClusterMgr-->>DataNode2: Return stored bits
    DataNode2-->>Coordinator: BloomFilterBitsArgs response

    Coordinator->>Coordinator: Aggregate: OR all bits, sum expected, max hashes
    Coordinator->>DataNode1: Broadcast aggregated filter (RpcType::BloomFilterPush)
    Coordinator->>DataNode2: Broadcast aggregated filter (RpcType::BloomFilterPush)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Whispers from the burrow:
Bloom filters scattered far and wide,
Each node builds filters with rabbit pride,
Bits aggregate through networked air,
OR'd together with utmost care,
Distributed wisdom, fluffy and fair! 🌸

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: bloom filter building from left table data, which aligns with the PR objectives and the substantive changes across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/bloom-filter-building-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
include/common/cluster_manager.hpp (1)

321-327: ⚠️ Potential issue | 🟡 Minor

Local bloom filter entries are never cleaned up.

clear_bloom_filter cleans up bloom_filters_ but doesn't clean up local_bloom_bits_. Over time, entries will accumulate for completed contexts.

♻️ Proposed fix
     void clear_bloom_filter(const std::string& context_id) {
         const std::scoped_lock<std::mutex> lock(mutex_);
         bloom_filters_.erase(context_id);
+        local_bloom_bits_.erase(context_id);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/common/cluster_manager.hpp` around lines 321 - 327,
clear_bloom_filter currently erases from bloom_filters_ but neglects to remove
entries from local_bloom_bits_, causing leaked per-context state; update
clear_bloom_filter to also erase local_bloom_bits_. Use the same
std::scoped_lock<std::mutex> lock(mutex_) already present to ensure thread
safety and call local_bloom_bits_.erase(context_id) (or equivalent) alongside
bloom_filters_.erase(context_id) inside the function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/common/cluster_manager.hpp`:
- Around line 285-291: The method set_local_bloom_bits stores bits per
context_id but writes global metadata fields (local_expected_elements_ and
local_num_hashes_), causing races; change the metadata to be stored per-context
(e.g., replace local_expected_elements_ and local_num_hashes_ with maps keyed by
context_id, such as local_expected_elements_map_ and local_num_hashes_map_) and
assign into those maps inside set_local_bloom_bits (holding mutex_), and update
any corresponding getters/accessors and the analogous code mentioned for lines
353-356 to read metadata from the per-context maps instead of the old globals so
each context's bits/metadata remain matched and thread-safe.

In `@src/main.cpp`:
- Around line 519-544: The BloomFilterBits RPC handler currently uses global
metadata getters get_local_expected_elements() and get_local_num_hashes(),
causing mismatched metadata for multiple contexts; after ClusterManager is
updated to store per-context values, change those calls in the BloomFilterBits
handler to call the per-context variants (pass args.context_id) so
reply_args.expected_elements =
cluster_manager->get_local_expected_elements(args.context_id) and
reply_args.num_hashes = cluster_manager->get_local_num_hashes(args.context_id),
keeping reply_args.context_id and filter_data assignment via
get_local_bloom_bits(args.context_id) as-is.

---

Outside diff comments:
In `@include/common/cluster_manager.hpp`:
- Around line 321-327: clear_bloom_filter currently erases from bloom_filters_
but neglects to remove entries from local_bloom_bits_, causing leaked
per-context state; update clear_bloom_filter to also erase local_bloom_bits_.
Use the same std::scoped_lock<std::mutex> lock(mutex_) already present to ensure
thread safety and call local_bloom_bits_.erase(context_id) (or equivalent)
alongside bloom_filters_.erase(context_id) inside the function.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da036d01-db6a-48ac-b39e-ccad3e7983c3

📥 Commits

Reviewing files that changed from the base of the PR and between 078c09b and 1a2874a.

📒 Files selected for processing (5)
  • include/common/cluster_manager.hpp
  • include/network/rpc_message.hpp
  • src/distributed/distributed_executor.cpp
  • src/main.cpp
  • tests/distributed_tests.cpp

Comment thread include/common/cluster_manager.hpp
Comment thread src/main.cpp
poyrazK and others added 6 commits April 14, 2026 14:22
- Store expected_elements and num_hashes in per-context maps instead of globals
- Update set_local_bloom_bits to write to per-context maps
- Update getters to take context_id parameter
- clear_bloom_filter now also erases local_bloom_bits and metadata maps
- Update BloomFilterBits handler to use per-context getters
The distributed shuffle join algorithm only supports INNER joins.
LEFT, RIGHT, and FULL outer joins require different handling
(e.g., broadcasting the outer table, or double-shuffle with side tables)
that is not yet implemented. Instead of producing incorrect results,
we now return a clear error message.

Also add unit tests RightJoinRejection and FullJoinRejection to
verify this behavior.
The distributed shuffle join can correctly handle LEFT joins in the
current implementation because each node executes the query locally
and LEFT join only requires preserving unmatched left-table rows
(which are already local to each node). RIGHT and FULL joins require
tracking unmatched rows across partitions which is not yet implemented.

Update error message to say "INNER and LEFT" instead of just "INNER".
@poyrazK poyrazK force-pushed the feature/bloom-filter-building-v2 branch from c92194f to e35159e Compare April 15, 2026 15:20
- Skip bloom filter for RIGHT/FULL joins to prevent false negatives
- Add integration tests for bloom filter skip
- Allow LEFT joins in distributed shuffle join

The bloom filter can cause false negatives (rows filtered when they shouldn't be),
which violates outer join semantics. For RIGHT/FULL joins, all right rows must be
sent to ensure correct results.

NOTE: Phase 3-5 for collecting unmatched outer rows is temporarily disabled
due to column indexing issues with non-SELECT * queries. The local executor
on each data node handles unmatched right rows correctly for RIGHT JOIN.
FULL JOIN unmatched left rows will be implemented separately.
@poyrazK poyrazK force-pushed the feature/bloom-filter-building-v2 branch from ef30b18 to c171913 Compare April 15, 2026 15:28
@poyrazK poyrazK merged commit c72a39e into main Apr 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant