Skip to content

Latest commit

 

History

History
89 lines (74 loc) · 6.59 KB

File metadata and controls

89 lines (74 loc) · 6.59 KB

Performance Comparison: cloudSQL vs SQLite3

1. Overview

This report documents the head-to-head performance comparison between the cloudSQL distributed engine (local execution mode) and the embedded SQLite3 database (C API). The goal is to establish an industry-standard baseline for raw storage and execution efficiency.

2. Test Environment

  • Hardware: Apple M3 Pro
  • OS: macOS 15.3.1 (Darwin)
  • Build Type: Release (-O3)
  • Engine Configuration:
    • cloudSQL: Local mode, 4096-page Buffer Pool, Zero-Copy Binary Format.
    • SQLite3: PRAGMA synchronous = OFF, PRAGMA journal_mode = OFF (Optimized for raw speed).

3. Comparative Metrics

Benchmark cloudSQL (Pre-Opt) cloudSQL (Post-Opt) SQLite3 Final Status
Point Inserts (10k) 16.1k rows/s 6.69M rows/s 114.1k rows/s CloudSQL +58x faster
Sequential Scan (10k) 3.1M items/s 233.3M rows/s 27.9M rows/s CloudSQL +8.3x faster

4. Architectural Analysis

Point Inserts

Following our latest optimizations, cloudSQL completely bridged the insert gap and is now ~58x faster than SQLite. The dramatic inversion in performance is attributed to:

  1. Prepared Statement Execution: cloudSQL benchmarks now correctly cache and reuse prepared insert statements matching SQLite's sqlite3_prepare_v2 approach, completely skipping re-parsing overheads per row.
  2. Batch Insert Fast-Path: By detecting bulk loads into memory, cloudSQL entirely bypasses single-row exclusive lock acquisitions (while correctly maintaining undo logs).
  3. In-Memory Architecture: This configuration allows cloudSQL to behave as a massive unhindered memory bump-allocator, whereas SQLite still respects basic transactional boundaries even with PRAGMA synchronous=OFF.

Sequential Scans

We have completely flipped the scan gap. cloudSQL is now ~9x faster than SQLite for raw sequential scans. This was achieved by:

  1. Zero-Allocation TupleView: Instead of materializing std::vector<common::Value> per row, we now use a lightweight view that points directly into the pinned BufferPool page.
  2. Lazy Deserialization: Values are decoded only when accessed, reducing work for read columns, but TupleView currently still walks prior fields up to col_index, so later-column access still pays the cost of preceding fields.
  3. Fast-Path MVCC: For non-transactional scans (the common case for bulk data processing), we bypass complex visibility logic and only perform a single xmax == 0 check.
  4. Iterator Caching: The PageHeader is now cached during page transitions, eliminating repetitive memcpy calls in the scan hot path.

5. Post-Optimization Enhancements

We addressed the gaps via the following optimizations:

  1. Buffer Pool Bypass (fetch_page_by_id): Reduced global std::mutex latch contention by explicitly caching ID lookups, yielding a ~30% improvement in scan logic.
  2. Pinned Page Iteration: Modifying our HeapTable::Iterator to hold pages pinned across slot iteration avoids repetitive atomic checks and LRU updates per-row.
  3. Batch Insert Mode: Skipping single-row undo logs and exclusive locks to exploit pure in-memory bump allocation. This drove the INSERT speedup well past SQLite limits, as we write raw tuples uninterrupted.

6. Distributed Join Optimization: Bloom Filters

Problem

Distributed shuffle joins send all tuples across the network to partitioned nodes, even when many will never match. This causes unnecessary network traffic and buffer memory usage.

Solution: Bloom Filter Integration

Implemented bloom filters to filter tuples at the source before network transmission using a 3-phase approach:

  • Phase 1 (Local Build): Each data node scans its local left/build table partition, extracts join key values, and builds a local bloom filter
  • Phase 2 (Bit Aggregation): Coordinator sends BloomFilterBits RPC to each data node; each responds with local bloom bits; coordinator OR-aggregates all bits into a single filter
  • Phase 3 (Sender-Side Filter): Coordinator broadcasts aggregated filter via BloomFilterPush RPC; before sending right/probe tuples, ShuffleFragment handler checks might_contain() and skips tuples that will definitely not match

Architecture

Phase 1: Scan Left         Phase 2: Aggregate Bits       Phase 3: Filter Right
      |                         |                              |
      v                         v                              v
Build local bloom    <---> BloomFilterBits RPC <-------- Aggregate & Broadcast
on each data node          (OR-aggregate bits)           via BloomFilterPush
      |                         |                              |
      |                         v                              v
      +-----------------> BloomFilterPush              might_contain() check
      (metadata only)           |                   before PushData
                                 |
                                 v
                        Filtered tuples buffered

Key Components

Component Location Purpose
BloomFilter class include/common/bloom_filter.hpp MurmurHash3-based bloom filter
BloomFilterBitsArgs RPC include/network/rpc_message.hpp Local bloom bits from data nodes
BloomFilterArgs RPC include/network/rpc_message.hpp Aggregated filter broadcast
ClusterManager storage include/common/cluster_manager.hpp Stores local and aggregated bloom filters
BloomFilterBits handler src/main.cpp Returns local bloom bits to coordinator
ShuffleFragment handler src/main.cpp Builds local bloom during Phase 1 scan
Coordinator src/distributed/distributed_executor.cpp Collects bits, aggregates, broadcasts filter

Test Coverage

  • 10 unit tests covering: BloomFilter class, BloomFilterArgs serialization, ClusterManager storage, filter application logic
  • Tests located in tests/bloom_filter_test.cpp

7. Future Roadmap

With the scan gap closed, our focus shifts to higher-level analytical throughput:

  • Stage 1: SIMD-Accelerated Filtering: Utilize AVX-512/NEON instructions to filter multiple rows in a single CPU cycle.
  • Stage 2: Vectorized Execution: Move from row-at-a-time TupleView to batch-at-a-time VectorBatch processing.
  • Stage 3: Columnar Storage: Transition from row-oriented heap files to columnar persistence for extreme analytical scanning.
  • Stage 4: Distributed Hash Join: Enhance the single HashJoinOperator with parallel partitioned hash join for multi-node execution.