Phase 5 introduced high-level optimizations to reduce network latency and enable complex multi-shard query patterns.
Optimized query routing based on partitioning keys.
- Predicate Analysis: Detects filters on sharding keys (e.g.,
WHERE id = 100). - Targeted Dispatch: Routes fragments only to the specific node owning the shard, avoiding cluster-wide broadcasts.
Implemented coordination for distributed analytics.
- Partial Aggregation: Data nodes compute local counts and sums.
- Global Merge: The coordinator identifies aggregate functions in the SELECT list and merges partial results from all shards into a final result set.
Developed a prototype for cross-shard JOINs.
- Table Fetching: Coordinator retrieves full data from a smaller table across all shards.
- Broadcasting: Pushes the gathered data to the
ShuffleBufferof every node in the cluster. - Local Execution: Rewrites the query so each node joins its local shard with the broadcasted buffer data.
Enabled inter-node data movement.
- BufferScanOperator: A physical operator that reads from in-memory shuffle buffers instead of heap files.
- ClusterManager Buffering: Thread-safe staging area for data received via
PushDataRPCs.
- Broadcast joins are highly effective for small-to-large table joins but require careful consideration of coordinator memory limits.
- Merging aggregates at the coordinator is a bottleneck for very large clusters; future work could explore tree-based merging.
All scenarios, including distributed transactions (2PC) and join orchestration, have been verified with automated integration tests.