[common] Introduce RTreeFileIndex for query optimization#7919
[common] Introduce RTreeFileIndex for query optimization#7919xuzifu666 wants to merge 11 commits into
Conversation
| } else { | ||
| for (int i = 0; i < entryCount; i++) { | ||
| RTreeNode child = new RTreeNode(dimensions, maxEntries, false); | ||
| node.addChild(child); |
There was a problem hiding this comment.
RTreeNode.isLeaf is final. When the RTree constructor creates root, isLeaf=true. However, during deserialization, if the root of the tree is an internal node, the deserialization code will add children to the node with isLeaf=true. Afterwards, when searching, node. isLeaf() returns true and will look for leafRowIds instead of recursive children. The deserialized tree cannot be queried correctly at all.
There was a problem hiding this comment.
I changed private final boolean isLeaf → private boolean isLeaf, added setLeaf(boolean) method, then called setLeaf() during deserialization to correct the root node's leaf flag.
|
|
||
| RTreeNode newNode = new RTreeNode(dimensions, maxEntries, true); | ||
|
|
||
| int mid = entries.size() / 2; |
There was a problem hiding this comment.
This completely disregards spatial location. The correct R-Tree splitting should minimize the MBR overlap between two result nodes, otherwise a large number of nodes will "intersect" during queries, degenerating into linear scans.
There was a problem hiding this comment.
I did these changes:
- Implemented QuadraticSplit algorithm (2-step approach):
a. PickSeeds: Select two entries with maximum distance as initial seeds
b. Assign: Assign remaining entries to group with minimum bbox expansion - Implemented QuadraticSplitInternal: Same algorithm for internal node splitting
- Modified
RTree.javato use QuadraticSplit instead of linear splitting
|
|
||
| if (node.isLeaf()) { | ||
| for (Integer rowId : node.getLeafRowIds()) { | ||
| results.add(rowId); |
There was a problem hiding this comment.
Directly joined without checking if entry.bbox intersects with searchBox
There was a problem hiding this comment.
Changed as:
- Enhanced
RTree.javasearch() method:
a. Leaf nodes: Added per-entry checking entry.getBbox().intersects(searchBox)
b. Previously only checked node.getBoundingBox().intersects() - LeafEntry structure: Stores rowId + bbox for precision verification
| import org.apache.paimon.types.DataType; | ||
|
|
||
| /** The implementation of R-Tree file index. */ | ||
| public class RTreeFileIndex implements FileIndexer { |
There was a problem hiding this comment.
As a File Index, all data is already known at the time of writing, and the quality of the tree constructed by inserting each item is much lower than that of STR (Sort Tile Recursive) bulk loading
There was a problem hiding this comment.
My current alternative is:
- Implemented STRBulkLoader (Sort-Tile-Recursive algorithm):
a. Sort entries by current dimension
b. Partition into vertical tiles (~maxEntries per tile)
c. Recursively process each tile with next dimension
d. Build tree bottom-up - Enhanced RTreeFileIndexWriter:
a. write() method: Collect entries into list
b. serializedBytes(): Use STRBulkLoader for batch tree construction
|
@JingsongLi Thank you for the review! Your comments are very helpful, and I will refine them based on these issues. |
|
Thanks for the update. I found two blocking correctness/contract issues in the current head (
I reproduced both with a small contract test:
The existing tests pass because they mostly call Please fix these before merge:
|
Purpose
Paimon currently does not support rtree indexes. Refer to this paper https://postgis.net/docs/support/rtree.pdf for implementation instructions on how to implement this index.
The following are the relevant benchmark test results:
Hardware Configuration
Software Configuration
Test Parameters
Query Performance (10,000 queries)
Analysis by Dataset Size
Point Query vs Range Query
Search Performance on 100K Dataset:
Improvement vs Linear Scan:
Point query: 214× speedup
Range query: 182× speedup
Sequential Data Access Pattern
Tests
Run comparison benchmark
org.apache.paimon.fileindex.rtree.RTreeVsLinearScanBenchmark
Run detailed benchmark
org.apache.paimon.fileindex.rtree.RTreeBenchmark
###Here is a schematic diagram of the implementation:
2.1 Full Node Before Split
2.2 Split Process
2.3 Leaf Node Content Details
3.1 Point Query Flow
3.2 Range Query Flow
Example 1: Point Query
Example 2: Range Query
Memory Layout