test: add bloom filter tests and update documentation

poyrazK · poyrazK · commit a24c986e455d · 2026-04-11T17:39:39.000+03:00
- Add 10 unit tests in tests/bloom_filter_test.cpp
- Test BloomFilterArgs serialization round-trip
- Test ClusterManager bloom filter storage operations
- Test bloom filter application logic (PushData simulation)
- Update PHASE_6_DISTRIBUTED_JOIN.md with bloom filter docs
- Update docs/phases/README.md with bloom filter feature
- Update SQLITE_COMPARISON.md with Section 7: Bloom Filter Optimization
- Add bloom_filter.cpp and bloom_filter_tests to CMakeLists.txt
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -72,6 +72,7 @@ set(CORE_SOURCES
     src/distributed/raft_group.cpp
     src/distributed/raft_manager.cpp
     src/distributed/distributed_executor.cpp
+    src/common/bloom_filter.cpp
     src/storage/columnar_table.cpp
 )
 
@@ -117,6 +118,7 @@ if(BUILD_TESTS)
     add_cloudsql_test(catalog_coverage_tests tests/catalog_coverage_tests.cpp)
     add_cloudsql_test(transaction_coverage_tests tests/transaction_coverage_tests.cpp)
     add_cloudsql_test(utils_coverage_tests tests/utils_coverage_tests.cpp)
+    add_cloudsql_test(bloom_filter_tests tests/bloom_filter_test.cpp)
     add_cloudsql_test(cloudSQL_tests tests/cloudSQL_tests.cpp)
     add_cloudsql_test(server_tests tests/server_tests.cpp)
     add_cloudsql_test(statement_tests tests/statement_tests.cpp)
diff --git a/docs/performance/SQLITE_COMPARISON.md b/docs/performance/SQLITE_COMPARISON.md
@@ -39,8 +39,53 @@ We addressed the gaps via the following optimizations:
 2.  **Pinned Page Iteration**: Modifying our `HeapTable::Iterator` to hold pages pinned across slot iteration avoids repetitive atomic checks and LRU updates per-row.
 3.  **Batch Insert Mode**: Skipping single-row undo logs and exclusive locks to exploit pure in-memory bump allocation. This drove the `INSERT` speedup well past SQLite limits, as we write raw tuples uninterrupted.
 
-## 6. Future Roadmap
+## 6. Post-Optimization Enhancements
+We addressed the gaps via the following optimizations:
+1.  **Buffer Pool Bypass (`fetch_page_by_id`)**: Reduced global std::mutex latch contention by explicitly caching ID lookups, yielding a ~30% improvement in scan logic.
+2.  **Pinned Page Iteration**: Modifying our `HeapTable::Iterator` to hold pages pinned across slot iteration avoids repetitive atomic checks and LRU updates per-row.
+3.  **Batch Insert Mode**: Skipping single-row undo logs and exclusive locks to exploit pure in-memory bump allocation. This drove the `INSERT` speedup well past SQLite limits, as we write raw tuples uninterrupted.
+
+## 7. Distributed Join Optimization: Bloom Filters
+
+### Problem
+Distributed shuffle joins send **all tuples** across the network to partitioned nodes, even when many will never match. This causes unnecessary network traffic and buffer memory usage.
+
+### Solution: Bloom Filter Integration
+Implemented bloom filters to filter tuples at the source before network transmission:
+- **One-sided bloom filter**: Built from the inner/right table, applied to filter the outer/left table
+- **Distributed construction**: Each data node builds bloom filter locally during its scan phase
+- **Coordinator coordination**: `BloomFilterPush` RPC broadcasts filter metadata to all nodes
+
+### Architecture
+```
+[Phase 1: Shuffle Left]     [Phase 2: Shuffle Right]
+     |                             |
+     v                             v
+Build local bloom           Apply bloom filter
+from join keys              before buffering
+     |                             |
+     +---- BloomFilterPush ----->---+
+     (filter metadata)              |
+                                    v
+                         Filtered tuples buffered
+```
+
+### Key Components
+| Component | Location | Purpose |
+|-----------|----------|---------|
+| `BloomFilter` class | `include/common/bloom_filter.hpp` | MurmurHash3-based bloom filter |
+| `BloomFilterArgs` RPC | `include/network/rpc_message.hpp` | Serialization for network transfer |
+| `ClusterManager` storage | `include/common/cluster_manager.hpp` | Stores bloom filter per context |
+| `PushData` handler | `src/main.cpp` | Applies bloom filter before buffering |
+| Coordinator | `src/distributed/distributed_executor.cpp` | Broadcasts filter after Phase 1 |
+
+### Test Coverage
+- 10 unit tests covering: BloomFilter class, BloomFilterArgs serialization, ClusterManager storage, filter application logic
+- Tests located in `tests/bloom_filter_test.cpp`
+
+## 8. Future Roadmap
 With the scan gap closed, our focus shifts to higher-level analytical throughput:
 *   **Stage 1: SIMD-Accelerated Filtering**: Utilize AVX-512/NEON instructions to filter multiple rows in a single CPU cycle.
 *   **Stage 2: Vectorized Execution**: Move from row-at-a-time `TupleView` to batch-at-a-time `VectorBatch` processing.
 *   **Stage 3: Columnar Storage**: Transition from row-oriented heap files to columnar persistence for extreme analytical scanning.
+*   **Stage 4: Distributed Hash Join**: Enhance the single `HashJoinOperator` with parallel partitioned hash join for multi-node execution.
diff --git a/docs/phases/PHASE_6_DISTRIBUTED_JOIN.md b/docs/phases/PHASE_6_DISTRIBUTED_JOIN.md
@@ -14,6 +14,7 @@ Introduced isolated staging areas for inter-node data movement.
 Developed a dedicated binary protocol for efficient data redistribution.
 - **ShuffleFragment**: Metadata describing the fragment being pushed (target context, source node, schema).
 - **PushData**: High-speed binary payload containing the actual tuple data for the shuffle phase.
+- **BloomFilterPush**: Bloom filter metadata broadcast to enable tuple filtering before network transmission.
 
 ### 3. Two-Phase Join Orchestration (`distributed/distributed_executor.cpp`)
 Implemented the control logic for distributed shuffle joins.
@@ -24,9 +25,17 @@ Implemented the control logic for distributed shuffle joins.
 Seamlessly integrated shuffle buffers into the Volcano execution model.
 - **Vectorized Buffering**: Optimized the `BufferScanOperator` to handle large volumes of redistributed data with minimal overhead.
 
+### 5. Bloom Filter Optimization (`common/bloom_filter.hpp`)
+Added probabilistic filtering to reduce network traffic in shuffle joins.
+- **MurmurHash3-based BloomFilter**: Configurable false positive rate (default 1%) with optimal bit count and hash function calculation.
+- **Filter Construction**: Built during Phase 1 scan, stored in `ClusterManager` per context.
+- **Filter Application**: `PushData` handler checks `might_contain()` before buffering, skipping tuples that will definitely not match.
+
 ## Lessons Learned
 - Shuffle joins significantly reduce network traffic compared to broadcast joins for large-to-large table joins.
 - Fine-grained locking in the shuffle buffers is critical for maintaining high throughput during the redistribution phase.
+- Bloom filters provide significant network traffic reduction when join selectivity is low, at the cost of a small false positive rate (typically <1%).
 
 ## Status: 100% Test Pass
 Verified the end-to-end shuffle join flow, including multi-node data movement and final result merging, through automated integration tests.
+- 10 unit tests for bloom filter implementation and integration (`tests/bloom_filter_test.cpp`)
diff --git a/docs/phases/README.md b/docs/phases/README.md
@@ -41,6 +41,7 @@ This directory contains the technical documentation for the lifecycle of the clo
 - Context-aware Shuffle infrastructure in `ClusterManager`.
 - Implementation of `ShuffleFragment` and `PushData` RPC protocols.
 - Two-phase Shuffle Join orchestration in `DistributedExecutor`.
+- **Bloom Filter Optimization**: Probabilistic tuple filtering to reduce network traffic in shuffle joins.
 
 ### [Phase 7: Replication & High Availability](./PHASE_7_REPLICATION_HA.md)
 **Focus**: Fault Tolerance & Data Redundancy.
diff --git a/tests/bloom_filter_test.cpp b/tests/bloom_filter_test.cpp
@@ -0,0 +1,269 @@
+/**
+ * @file bloom_filter_test.cpp
+ * @brief Unit tests for BloomFilter implementation
+ */
+
+#include <gtest/gtest.h>
+
+#include <vector>
+
+#include "common/bloom_filter.hpp"
+#include "common/cluster_manager.hpp"
+#include "common/value.hpp"
+#include "executor/types.hpp"
+#include "network/rpc_message.hpp"
+
+using namespace cloudsql::common;
+using namespace cloudsql::network;
+using namespace cloudsql::cluster;
+
+namespace {
+
+/**
+ * @brief Tests basic bloom filter insertion and membership.
+ */
+TEST(BloomFilterTests, BasicInsertAndQuery) {
+    BloomFilter bf(100);  // Expect 100 elements
+
+    Value v1 = Value::make_int64(42);
+    Value v2 = Value::make_int64(100);
+    Value v3 = Value::make_text("hello");
+
+    bf.insert(v1);
+    bf.insert(v2);
+    bf.insert(v3);
+
+    // All inserted values should be found
+    EXPECT_TRUE(bf.might_contain(v1));
+    EXPECT_TRUE(bf.might_contain(v2));
+    EXPECT_TRUE(bf.might_contain(v3));
+
+    // Non-inserted values might or might not be found (false positive possible)
+    // But with 100 elements in a properly sized filter, probability is low
+}
+
+/**
+ * @brief Tests that values not inserted return false.
+ */
+TEST(BloomFilterTests, NonInsertedValues) {
+    BloomFilter bf(1000);  // Large filter, low false positive rate
+
+    Value v1 = Value::make_int64(999);
+    Value v2 = Value::make_text("nonexistent");
+
+    // Not inserted, should definitely not be found
+    EXPECT_FALSE(bf.might_contain(v1));
+    EXPECT_FALSE(bf.might_contain(v2));
+}
+
+/**
+ * @brief Tests serialization and deserialization.
+ */
+TEST(BloomFilterTests, SerializationRoundTrip) {
+    BloomFilter bf(50);
+
+    // Insert some values
+    for (int i = 0; i < 25; ++i) {
+        bf.insert(Value::make_int64(i));
+    }
+    for (int i = 100; i < 125; ++i) {
+        bf.insert(Value::make_text("text_" + std::to_string(i)));
+    }
+
+    // Serialize
+    std::vector<uint8_t> data = bf.serialize();
+    EXPECT_FALSE(data.empty());
+
+    // Deserialize
+    BloomFilter bf2(data.data(), data.size());
+
+    // Check metadata
+    EXPECT_EQ(bf.num_hashes(), bf2.num_hashes());
+
+    // Check inserted values are found
+    for (int i = 0; i < 25; ++i) {
+        EXPECT_TRUE(bf2.might_contain(Value::make_int64(i)));
+    }
+    for (int i = 100; i < 125; ++i) {
+        EXPECT_TRUE(bf2.might_contain(Value::make_text("text_" + std::to_string(i))));
+    }
+}
+
+/**
+ * @brief Tests false positive rate with many insertions.
+ */
+TEST(BloomFilterTests, FalsePositiveRate) {
+    BloomFilter bf(1000);  // 1000 expected elements
+
+    // Insert 500 values
+    for (int i = 0; i < 500; ++i) {
+        bf.insert(Value::make_int64(i));
+    }
+
+    // Check 1000 non-inserted values and count false positives
+    int false_positives = 0;
+    for (int i = 500; i < 1500; ++i) {
+        if (bf.might_contain(Value::make_int64(i))) {
+            ++false_positives;
+        }
+    }
+
+    // With 1% target FPR, we expect roughly 10 false positives out of 1000
+    // Allow some margin - shouldn't be more than 5% (50)
+    EXPECT_LT(false_positives, 50);
+}
+
+/**
+ * @brief Tests empty bloom filter.
+ */
+TEST(BloomFilterTests, EmptyFilter) {
+    BloomFilter bf(1);  // Minimal filter
+
+    // Nothing inserted, nothing should be found
+    EXPECT_FALSE(bf.might_contain(Value::make_int64(1)));
+    EXPECT_FALSE(bf.might_contain(Value::make_text("test")));
+}
+
+/**
+ * @brief Tests that duplicate insertions don't cause issues.
+ */
+TEST(BloomFilterTests, DuplicateInsertions) {
+    BloomFilter bf(100);
+
+    Value v = Value::make_int64(42);
+
+    bf.insert(v);
+    bf.insert(v);
+    bf.insert(v);
+
+    // Should still be found
+    EXPECT_TRUE(bf.might_contain(v));
+}
+
+/**
+ * @brief Tests different value types.
+ */
+TEST(BloomFilterTests, DifferentValueTypes) {
+    BloomFilter bf(100);
+
+    bf.insert(Value::make_int64(1));
+    bf.insert(Value::make_int64(2));
+    bf.insert(Value::make_float64(3.14));
+    bf.insert(Value::make_text("string"));
+    bf.insert(Value::make_bool(true));
+
+    EXPECT_TRUE(bf.might_contain(Value::make_int64(1)));
+    EXPECT_TRUE(bf.might_contain(Value::make_int64(2)));
+    EXPECT_TRUE(bf.might_contain(Value::make_float64(3.14)));
+    EXPECT_TRUE(bf.might_contain(Value::make_text("string")));
+    EXPECT_TRUE(bf.might_contain(Value::make_bool(true)));
+
+    // Non-inserted
+    EXPECT_FALSE(bf.might_contain(Value::make_int64(999)));
+    EXPECT_FALSE(bf.might_contain(Value::make_text("not inserted")));
+}
+
+/**
+ * @brief Tests BloomFilterArgs serialization round-trip.
+ */
+TEST(BloomFilterTests, BloomFilterArgsSerialization) {
+    BloomFilterArgs args;
+    args.context_id = "ctx_123";
+    args.build_table = "users";
+    args.probe_table = "orders";
+    args.probe_key_col = "user_id";
+    args.filter_data = {0x01, 0x02, 0x03};
+    args.expected_elements = 1000;
+    args.num_hashes = 4;
+
+    auto serialized = args.serialize();
+    auto deserialized = BloomFilterArgs::deserialize(serialized);
+
+    EXPECT_EQ(args.context_id, deserialized.context_id);
+    EXPECT_EQ(args.build_table, deserialized.build_table);
+    EXPECT_EQ(args.probe_table, deserialized.probe_table);
+    EXPECT_EQ(args.probe_key_col, deserialized.probe_key_col);
+    EXPECT_EQ(args.expected_elements, deserialized.expected_elements);
+    EXPECT_EQ(args.num_hashes, deserialized.num_hashes);
+    ASSERT_EQ(args.filter_data.size(), deserialized.filter_data.size());
+    EXPECT_EQ(args.filter_data, deserialized.filter_data);
+}
+
+/**
+ * @brief Tests ClusterManager bloom filter storage operations.
+ */
+TEST(BloomFilterTests, ClusterManagerBloomFilterStorage) {
+    ClusterManager cm(nullptr);
+
+    // Create a real bloom filter and serialize it
+    BloomFilter original(100);
+    original.insert(Value::make_int64(10));
+    original.insert(Value::make_int64(20));
+    auto filter_data = original.serialize();
+
+    // Test set_bloom_filter and has_bloom_filter
+    cm.set_bloom_filter("ctx1", "table_build", "table_probe", "key_col",
+                        filter_data, original.expected_elements(), original.num_hashes());
+    EXPECT_TRUE(cm.has_bloom_filter("ctx1"));
+
+    // Test get_bloom_filter reconstructs correctly
+    auto bf = cm.get_bloom_filter("ctx1");
+    EXPECT_EQ(bf.expected_elements(), original.expected_elements());
+    EXPECT_EQ(bf.num_hashes(), original.num_hashes());
+
+    // Test that inserted values are found in reconstructed filter
+    EXPECT_TRUE(bf.might_contain(Value::make_int64(10)));
+    EXPECT_TRUE(bf.might_contain(Value::make_int64(20)));
+
+    // Test non-existent context
+    EXPECT_FALSE(cm.has_bloom_filter("nonexistent"));
+
+    // Test get_probe_table and get_probe_key_col
+    cm.set_bloom_filter("ctx2", "build_t", "probe_t", "col_x", filter_data, 500, 3);
+    EXPECT_EQ(cm.get_probe_table("ctx2"), "probe_t");
+    EXPECT_EQ(cm.get_probe_key_col("ctx2"), "col_x");
+
+    // Test clear_bloom_filter
+    cm.clear_bloom_filter("ctx1");
+    EXPECT_FALSE(cm.has_bloom_filter("ctx1"));
+}
+
+/**
+ * @brief Tests bloom filter application logic (simulates PushData handler behavior).
+ */
+TEST(BloomFilterTests, BloomFilterApplicationLogic) {
+    // Build bloom filter with known keys
+    BloomFilter bf(100);
+    bf.insert(Value::make_int64(10));
+    bf.insert(Value::make_int64(20));
+    bf.insert(Value::make_int64(30));
+
+    // Simulate tuple filtering (as done in PushData handler)
+    std::vector<cloudsql::executor::Tuple> tuples;
+    tuples.push_back(cloudsql::executor::Tuple(std::initializer_list<Value>{Value::make_int64(10)}));  // match
+    tuples.push_back(cloudsql::executor::Tuple(std::initializer_list<Value>{Value::make_int64(15)}));  // no match
+    tuples.push_back(cloudsql::executor::Tuple(std::initializer_list<Value>{Value::make_int64(20)}));  // match
+    tuples.push_back(cloudsql::executor::Tuple(std::initializer_list<Value>{Value::make_int64(99)}));  // no match
+
+    std::vector<cloudsql::executor::Tuple> filtered;
+    for (auto& row : tuples) {
+        if (bf.might_contain(row.get(0))) {
+            filtered.push_back(std::move(row));
+        }
+    }
+
+    // Should have 2 matches (10 and 20)
+    EXPECT_EQ(filtered.size(), 2);
+
+    // Verify the filtered values (matches may be in different order due to move)
+    bool found_10 = false;
+    bool found_20 = false;
+    for (auto& row : filtered) {
+        if (row.get(0) == Value::make_int64(10)) found_10 = true;
+        if (row.get(0) == Value::make_int64(20)) found_20 = true;
+    }
+    EXPECT_TRUE(found_10);
+    EXPECT_TRUE(found_20);
+}
+
+}  // namespace