This report documents the head-to-head performance comparison between the cloudSQL distributed engine (local execution mode) and the embedded SQLite3 database (C API). The goal is to establish an industry-standard baseline for raw storage and execution efficiency.
- Hardware: Apple M3 Pro
- OS: macOS 15.3.1 (Darwin)
- Build Type: Release (
-O3) - Engine Configuration:
cloudSQL: Local mode, 4096-page Buffer Pool, Zero-Copy Binary Format.SQLite3:PRAGMA synchronous = OFF,PRAGMA journal_mode = OFF(Optimized for raw speed).
| Benchmark | cloudSQL (Pre-Opt) | cloudSQL (Post-Opt) | SQLite3 | Final Status |
|---|---|---|---|---|
| Point Inserts (10k) | 16.1k rows/s | 6.69M rows/s | 114.1k rows/s | CloudSQL +58x faster |
| Sequential Scan (10k) | 3.1M items/s | 233.3M rows/s | 27.9M rows/s | CloudSQL +8.3x faster |
Following our latest optimizations, cloudSQL completely bridged the insert gap and is now ~58x faster than SQLite. The dramatic inversion in performance is attributed to:
- Prepared Statement Execution:
cloudSQLbenchmarks now correctly cache and reuse prepared insert statements matching SQLite'ssqlite3_prepare_v2approach, completely skipping re-parsing overheads per row. - Batch Insert Fast-Path: By detecting bulk loads into memory,
cloudSQLentirely bypasses single-row exclusive lock acquisitions (while correctly maintaining undo logs). - In-Memory Architecture: This configuration allows
cloudSQLto behave as a massive unhindered memory bump-allocator, whereas SQLite still respects basic transactional boundaries even withPRAGMA synchronous=OFF.
We have completely flipped the scan gap. cloudSQL is now ~9x faster than SQLite for raw sequential scans. This was achieved by:
- Zero-Allocation
TupleView: Instead of materializingstd::vector<common::Value>per row, we now use a lightweight view that points directly into the pinnedBufferPoolpage. - Lazy Deserialization: Values are decoded only when accessed, reducing work for read columns, but
TupleViewcurrently still walks prior fields up tocol_index, so later-column access still pays the cost of preceding fields. - Fast-Path MVCC: For non-transactional scans (the common case for bulk data processing), we bypass complex visibility logic and only perform a single
xmax == 0check. - Iterator Caching: The
PageHeaderis now cached during page transitions, eliminating repetitivememcpycalls in the scan hot path.
We addressed the gaps via the following optimizations:
- Buffer Pool Bypass (
fetch_page_by_id): Reduced global std::mutex latch contention by explicitly caching ID lookups, yielding a ~30% improvement in scan logic. - Pinned Page Iteration: Modifying our
HeapTable::Iteratorto hold pages pinned across slot iteration avoids repetitive atomic checks and LRU updates per-row. - Batch Insert Mode: Skipping single-row undo logs and exclusive locks to exploit pure in-memory bump allocation. This drove the
INSERTspeedup well past SQLite limits, as we write raw tuples uninterrupted.
Distributed shuffle joins send all tuples across the network to partitioned nodes, even when many will never match. This causes unnecessary network traffic and buffer memory usage.
Implemented bloom filters to filter tuples at the source before network transmission using a 3-phase approach:
- Phase 1 (Local Build): Each data node scans its local left/build table partition, extracts join key values, and builds a local bloom filter
- Phase 2 (Bit Aggregation): Coordinator sends
BloomFilterBitsRPC to each data node; each responds with local bloom bits; coordinator OR-aggregates all bits into a single filter - Phase 3 (Sender-Side Filter): Coordinator broadcasts aggregated filter via
BloomFilterPushRPC; before sending right/probe tuples,ShuffleFragmenthandler checksmight_contain()and skips tuples that will definitely not match
Phase 1: Scan Left Phase 2: Aggregate Bits Phase 3: Filter Right
| | |
v v v
Build local bloom <---> BloomFilterBits RPC <-------- Aggregate & Broadcast
on each data node (OR-aggregate bits) via BloomFilterPush
| | |
| v v
+-----------------> BloomFilterPush might_contain() check
(metadata only) | before PushData
|
v
Filtered tuples buffered
| Component | Location | Purpose |
|---|---|---|
BloomFilter class |
include/common/bloom_filter.hpp |
MurmurHash3-based bloom filter |
BloomFilterBitsArgs RPC |
include/network/rpc_message.hpp |
Local bloom bits from data nodes |
BloomFilterArgs RPC |
include/network/rpc_message.hpp |
Aggregated filter broadcast |
ClusterManager storage |
include/common/cluster_manager.hpp |
Stores local and aggregated bloom filters |
BloomFilterBits handler |
src/main.cpp |
Returns local bloom bits to coordinator |
ShuffleFragment handler |
src/main.cpp |
Builds local bloom during Phase 1 scan |
| Coordinator | src/distributed/distributed_executor.cpp |
Collects bits, aggregates, broadcasts filter |
- 10 unit tests covering: BloomFilter class, BloomFilterArgs serialization, ClusterManager storage, filter application logic
- Tests located in
tests/bloom_filter_test.cpp
With the scan gap closed, our focus shifts to higher-level analytical throughput:
- Stage 1: SIMD-Accelerated Filtering: Utilize AVX-512/NEON instructions to filter multiple rows in a single CPU cycle.
- Stage 2: Vectorized Execution: Move from row-at-a-time
TupleViewto batch-at-a-timeVectorBatchprocessing. - Stage 3: Columnar Storage: Transition from row-oriented heap files to columnar persistence for extreme analytical scanning.
- Stage 4: Distributed Hash Join: Enhance the single
HashJoinOperatorwith parallel partitioned hash join for multi-node execution.