Skip to content

[common] Introduce RTreeFileIndex for query optimization#7919

Open
xuzifu666 wants to merge 11 commits into
apache:masterfrom
xuzifu666:rtree_support
Open

[common] Introduce RTreeFileIndex for query optimization#7919
xuzifu666 wants to merge 11 commits into
apache:masterfrom
xuzifu666:rtree_support

Conversation

@xuzifu666
Copy link
Copy Markdown
Member

@xuzifu666 xuzifu666 commented May 20, 2026

Purpose

Paimon currently does not support rtree indexes. Refer to this paper https://postgis.net/docs/support/rtree.pdf for implementation instructions on how to implement this index.
The following are the relevant benchmark test results:

Hardware Configuration

  • CPU: MacBook Pro (M-series processor)
  • Memory: 16GB LPDDR5
  • Operating System: macOS 14.x

Software Configuration

  • Java Version: OpenJDK 11+
  • Build: Maven 3.8.x

Test Parameters

  • Warmup Iterations: 3
  • Benchmark Iterations: 10
  • Query Count: 1000-10000 queries
  • Random Seed: 42 (for reproducibility)

Query Performance (10,000 queries)

R-Tree:       0.47 µs per query
Linear Scan:  464.41 µs per query
Speedup:      985.58×
Average results per query: 20 records

Analysis by Dataset Size

Dataset Size R-Tree (µs) Linear Scan (µs) Speedup Query Selectivity
1K 0.20 14.90 75× 2%
10K 0.12 50.24 403× 2%
100K 0.35 492.44 1407× 2%
1M 0.39 495.25 1279× 2%
Query Type Area Size R-Tree (µs) Linear Scan (µs) Speedup Selectivity
Small 500×500 0.22 366.27 1684× 0.02%
Medium 1500×1500 0.21 400.52 1899× 0.02%
Large 5000×5000 0.28 556.48 1997× 0.03%

Point Query vs Range Query
Search Performance on 100K Dataset:

Point queries (1000):      303.76 µs/query (with warmup optimization)
Range queries (100):       357.04 µs/query
Linear scan (100 scans):   65170.62 µs/scan

Improvement vs Linear Scan:
Point query: 214× speedup
Range query: 182× speedup

Sequential Data Access Pattern

1M grid data (1000×1000 points)

Average query time: 1.54 µs
Results returned: 30 records

Performance Characteristics:
- First query: 8.38 µs (cache warmup)
- Subsequent queries: 0.67-0.88 µs (steady state)

Tests

Run comparison benchmark
org.apache.paimon.fileindex.rtree.RTreeVsLinearScanBenchmark

Run detailed benchmark
org.apache.paimon.fileindex.rtree.RTreeBenchmark

###Here is a schematic diagram of the implementation:

  1. R-Tree Basic Structure
             │         Root Node                   │
                        │  (Internal, stores child bboxes)    │
                        │  bbox: [0,0] to [100,100]           │
                        └──────────┬──────────────┬────────────┘
                                   │              │
                    ┌──────────────┘              └─────────────────┐
                    │                                              │
        ┌───────────▼──────────────┐              ┌────────────────▼──────┐
        │  Internal Node 1         │              │  Internal Node 2      │
        │  bbox: [0,0]-[50,100]    │              │  bbox: [50,0]-[100,100]
        └───────┬──────────────────┘              └────────────┬──────────┘
                │                                             │
        ┌───────┴────────┐                           ┌────────┴────────┐
        │                │                           │                 │
   ┌────▼──────┐  ┌─────▼──────┐           ┌───────▼────┐  ┌────────▼─┐
   │ Leaf 1    │  │ Leaf 2     │           │ Leaf 3     │  │ Leaf 4   │
   │ bbox:     │  │ bbox:      │           │ bbox:      │  │ bbox:    │
   │ [0,0]-    │  │ [20,30]-   │           │ [50,50]-   │  │ [75,75]- │
   │ [20,30]   │  │ [40,50]    │           │ [70,70]    │  │ [100,100]
   │ rowId: 1  │  │ rowId: 2   │           │ rowId: 3   │  │ rowId: 4 │
   └───────────┘  └────────────┘           └────────────┘  └──────────┘
  1. Node Split Process
    2.1 Full Node Before Split
┌──────────────────────────────────────────────┐
│           Full Leaf Node (maxEntries=4)      │
│  [1-entry]  [2-entry]  [3-entry]  [4-entry] │
│   point(5,5) point(8,3) point(10,7) point(12,9)
│                                              │
│  Need to insert 5th point: (15, 12) ✗ Over  │
└──────────────────────────────────────────────┘
                    │
                    │ Trigger split
                    ▼

2.2 Split Process

                 Before split (5 entries)
    ┌───────────────────────────────────────┐
    │ Node A: [①②③④⑤]                     │
    │ bbox: [5,3] to [15,12]                 │
    └─────────────────┬─────────────────────┘
                      │
            ┌─────────┴──────────┐
            │ Linear split:      │
            │ Keep first 2.5 ≈ 2 │
            │ New node gets 3    │
            ▼                    ▼

    ┌──────────────────┐        ┌─────────────────┐
    │ Node A (Leaf)    │        │ Node B (Leaf)   │
    │ [①②]            │        │ [③④⑤]          │
    │ bbox: [5,3]-     │        │ bbox: [10,7]-   │
    │       [8,3]      │        │       [15,12]   │
    └────────┬─────────┘        └────────┬────────┘
             │                          │
             └──────────┬───────────────┘
                        │
                        ▼
             ┌──────────────────────┐
             │  Parent Node Updated │
             │  Add Node B ref      │
             │  [Node A] [Node B]   │
             └──────────────────────┘

2.3 Leaf Node Content Details

Before split:
┌─────────────────────────────────────────────────┐
│          RTreeNode (leaf=true)                  │
│                                                  │
│  leafEntries: [                                  │
│    LeafEntry(rowId=1, bbox=[5,5]-[5,5]),        │
│    LeafEntry(rowId=2, bbox=[8,3]-[8,3]),        │
│    LeafEntry(rowId=3, bbox=[10,7]-[10,7]),      │
│    LeafEntry(rowId=4, bbox=[12,9]-[12,9]),      │
│    LeafEntry(rowId=5, bbox=[15,12]-[15,12])     │
│  ]                                               │
│                                                  │
│  boundingBox: [5,3] to [15,12]                   │
│  (automatically expanded from all entries)       │
└─────────────────────────────────────────────────┘

After split:
┌────────────────────────────────┐  ┌──────────────────────────┐
│   Node A (leaf=true)           │  │   Node B (leaf=true)     │
│                                │  │                          │
│ leafEntries: [                 │  │ leafEntries: [           │
│   LeafEntry(1, [5,5]),         │  │   LeafEntry(3, [10,7]),  │
│   LeafEntry(2, [8,3])          │  │   LeafEntry(4, [12,9]),  │
│ ]                              │  │   LeafEntry(5, [15,12])  │
│                                │  │ ]                        │
│ bbox: [5,3]-[8,3]              │  │ bbox: [10,7]-[15,12]     │
└────────────────────────────────┘  └──────────────────────────┘
  1. Query Process
    3.1 Point Query Flow
             Input: Point (12, 8)
                    │
                    ▼
        ┌─────────────────────────┐
        │ Check Root bbox         │
        │ Does Point(12,8) inter- │
        │ sect [0,0]-[100,100]?   │
        │       Yes ✓             │
        └────────┬────────────────┘
                 │
        ┌────────▼───────────────┐
        │ Check child bbox       │
        │ Point in [0,0]-[50,100]?
        │       No ✗             │
        │ Point in [50,0]-      │
        │ [100,100]?            │
        │       Yes ✓            │
        └────────┬────────────────┘
                 │
        ┌────────▼──────────────────────┐
        │ Recursively check Internal    │
        │ Node 2                        │
        │ Point in [50,50]-[70,70]?     │
        │       No ✗                    │
        │ Point in [75,75]-[100,100]?   │
        │       No ✗                    │
        └────────┬───────────────────────┘
                 │
        ┌────────▼──────────────────┐
        │ Reach leaf node           │
        │ Check Leaf 4 bbox         │
        │ Point(12,8) not in here   │
        └───────────────────────────┘
                 │
                 ▼
            ┌──────────┐
            │ Empty    │
            │ result   │
            └──────────┘

3.2 Range Query Flow

Input: BoundingBox [10,5] to [50,60]
           │
           ▼
   ┌───────────────────────────────┐
   │ Check Root bbox intersection  │
   │ [10,5]-[50,60] intersects     │
   │ [0,0]-[100,100]?              │
   │      Yes ✓ Continue           │
   └───────┬─────────────────────┬─┘
           │                     │
   ┌───────▼──────────┐  ┌──────▼──────────┐
   │ Subtree1 inter?  │  │ Subtree2 inter? │
   │ [0,0]-[50,100]?  │  │ [50,0]-[100,100]
   │ Yes ✓ Recurse    │  │ Yes ✓ Recurse  │
   └───────┬──────────┘  └──────┬──────────┘
           │                    │
   ┌───────▼──────┐    ┌────────▼──────┐
   │ Check Leaf   │    │ Check Leaf    │
   │ 1-2 entries  │    │ 3-4 entries   │
   │              │    │               │
   └───────┬──────┘    └────────┬──────┘
           │                    │
   ┌───────▼─────────────────┬──▼──────┐
   │ Return all intersecting │ rowIds  │
   │ [rowId1, rowId3, ...]   │         │
   └──────────────────────────────────┘
  1. Complete Insert Flow
Insert operation: insert(point=[25, 35], rowId=10)

         ┌──────────────────────────────┐
         │ 1. Choose best path          │
         │ (minimum expansion area)     │
         └──────────────┬───────────────┘
                        │
         ┌──────────────▼───────────────┐
         │ 2. Recursively descend       │
         │ from Root to leaf            │
         │ Calculate expansion cost     │
         └──────────────┬───────────────┘
                        │
         ┌──────────────▼───────────────┐
         │ 3. Reach leaf node          │
         │ Create LeafEntry            │
         └──────────────┬───────────────┘
                        │
         ┌──────────────▼───────────────┐
         │ 4. Add Entry                │
         │ node.addLeafEntry(entry)    │
         │ Update bbox                 │
         └──────────────┬───────────────┘
                        │
         ┌──────────────▼───────────────┐
         │ 5. Check capacity           │
         │ if (node.canSplit())        │
         │   splitNode()               │
         └──────────────┬───────────────┘
                        │
         ┌──────────────▼───────────────┐
         │ 6. If split:                │
         │ Parent adds new node        │
         │ Possible cascading split    │
         └──────────────┬───────────────┘
                        │
         ┌──────────────▼───────────────┐
         │ 7. Backtrack update parent  │
         │ bboxes until Root           │
         └──────────────┬───────────────┘
                        │
                        ▼
                   ✓ Insert complete
  1. Data Serialization Flow
In-memory RTree structure
       │
       ▼
┌──────────────────────────────────────────┐
│ Serialization format:                    │
│                                          │
│ ┌─ Metadata                             │
│ │ ├─ dimensions: int (2)                │
│ │ ├─ maxEntries: int (32)               │
│ │ └─ treeSize: int (1000)               │
│ │                                       │
│ ├─ Recursively serialize Node:          │
│ │ ├─ isLeaf: boolean (true/false)      │
│ │ ├─ entryCount: int (N)               │
│ │ ├─ boundingBox: double[4]            │
│ │ │  (min_x, min_y, max_x, max_y)      │
│ │ │                                    │
│ │ └─ If Leaf:                          │
│ │    ├─ rowId1: int + bbox             │
│ │    ├─ rowId2: int + bbox             │
│ │    └─ ...                            │
│ │                                       │
│ │ If Internal:                          │
│ │    ├─ Child1 Node (recursive)        │
│ │    ├─ Child2 Node (recursive)        │
│ │    └─ ...                            │
│ │                                       │
│ └─ All nodes in DFS order              │
│                                          │
│ Result: byte[] (binary data)            │
└──────────────────────────────────────────┘
       │
       ▼
   File storage
  1. Class Relationship Diagram
┌─────────────────────────────────────────────────────┐
│                     RTree                          │
│  - root: RTreeNode                                 │
│  - dimensions: int                                 │
│  - maxEntries: int                                 │
│                                                     │
│  + insert(point[], rowId)                          │
│  + search(BoundingBox): List<rowId>                │
│  + search(point[]): List<rowId>                    │
└─────────────────────────────────────────────────────┘
                        │
                        │ contains
                        │ (tree structure)
                        ▼
┌──────────────────────────────────────────────────────────┐
│                    RTreeNode                           │
│  - boundingBox: BoundingBox                            │
│  - leafEntries: List<LeafEntry> (if leaf=true)        │
│  - children: List<RTreeNode> (if leaf=false)          │
│  - parent: RTreeNode                                  │
│  - isLeaf: boolean                                    │
│  - maxEntries: int                                    │
│                                                        │
│  + isLeaf(): boolean                                  │
│  + addLeafEntry(LeafEntry)                           │
│  + addChild(RTreeNode)                               │
│  + canSplit(): boolean                               │
│  + getBoundingBox(): BoundingBox                      │
└──────────────────────────────────────────────────────────┘
         │                              │
         │ contains                     │ references
         │                              │ parent
         ▼                              ▼
    ┌──────────────────┐         ┌─────────────┐
    │   LeafEntry      │         │BoundingBox  │
    │                  │         │             │
    │ - rowId: int     │         │ - min[]     │
    │ - bbox:BBox      │         │ - max[]     │
    │                  │         │ - dimensions│
    │ + getRowId()     │         │             │
    │ + getBbox()      │         │ + expand()  │
    └──────────────────┘         │ + intersects
                                 │ + contains()
                                 └─────────────┘
  1. Typical Query Examples
    Example 1: Point Query
Query: search(Point[15, 35])

Tree structure:
           Root [0,0]-[100,100]
          /            \
      [0,0]-[50,100]   [50,0]-[100,100]
      /         \         /          \
   Leaf1      Leaf2    Leaf3      Leaf4
 [5,5]      [20,30]  [50,50]    [75,75]
 
Execution:
Step 1: Point[15,35] in Root bbox? YES → Continue
Step 2: Point in left subtree [0,0]-[50,100]? YES → Recurse
        Point in right subtree [50,0]-[100,100]? NO → Skip
Step 3: Check Leaf1 [5,5]-[5,5]: Point inside? NO
Step 4: Check Leaf2 [20,30]-[20,30]: Point inside? NO
Step 5: Result: Empty

Complexity: O(log 4) = O(1) for 4 leaves
           Actual operations: 3 bbox intersection checks

Example 2: Range Query

Query: search(BoundingBox[10,10]-[60,60])

Execution:
Step 1: Range[10,10]-[60,60] intersects Root[0,0]-[100,100]? YES
Step 2: Intersects left subtree[0,0]-[50,100]? YES → Recurse
        Intersects right subtree[50,0]-[100,100]? YES → Recurse
Step 3: Left subtree:
        Leaf1[5,5] in range? NO
        Leaf2[20,30] in range? YES → Add to result
Step 4: Right subtree:
        Leaf3[50,50] in range? YES → Add to result
        Leaf4[75,75] in range? YES → Add to result
Step 5: Result: [Leaf2_rowId, Leaf3_rowId, Leaf4_rowId]

Complexity: O(log N + K) = O(1 + 3) = O(4)
           where N=4, K=3 (result count)

Memory Layout

Java heap memory R-Tree structure:

┌─────────────────────────────────────────────────────┐
│                  RTree object                      │
│  ┌──────────────────────────────────────────┐      │
│  │ root: RTreeNode                          │      │
│  │ dimensions: 2                            │      │
│  │ maxEntries: 32                           │      │
│  └──────────────────────────────────────────┘      │
└────────────┬──────────────────────────────────────┘
             │ references
             ▼
    ┌────────────────────┐
    │   RTreeNode obj    │ (root)
    │ (internal node)    │
    │                    │
    │ children:          │
    │  [Node1 ref]       │
    │  [Node2 ref]       │
    │  [Node3 ref]       │
    │                    │
    │ leafEntries: []    │
    │ (empty)            │
    │                    │
    │ boundingBox:       │
    │  ├─ min: [0, 0]    │
    │  └─ max: [100,100] │
    │                    │
    │ parent: null       │
    └────┬─┬─┬───────────┘
         │ │ │
    ┌────┘ │ │
    │      │ └────────┐
    │      └──────┐   │
    ▼             ▼   ▼
  Node1         Node2 Node3
  (leaf)        (leaf)(leaf)
   │             │     │
   ▼             ▼     ▼
[LeafEntry]  [LE][LE] [LE][LE][LE]
rowId:1      rowId:2-3 rowId:4-6

} else {
for (int i = 0; i < entryCount; i++) {
RTreeNode child = new RTreeNode(dimensions, maxEntries, false);
node.addChild(child);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RTreeNode.isLeaf is final. When the RTree constructor creates root, isLeaf=true. However, during deserialization, if the root of the tree is an internal node, the deserialization code will add children to the node with isLeaf=true. Afterwards, when searching, node. isLeaf() returns true and will look for leafRowIds instead of recursive children. The deserialized tree cannot be queried correctly at all.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed private final boolean isLeafprivate boolean isLeaf, added setLeaf(boolean) method, then called setLeaf() during deserialization to correct the root node's leaf flag.


RTreeNode newNode = new RTreeNode(dimensions, maxEntries, true);

int mid = entries.size() / 2;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This completely disregards spatial location. The correct R-Tree splitting should minimize the MBR overlap between two result nodes, otherwise a large number of nodes will "intersect" during queries, degenerating into linear scans.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did these changes:

  1. Implemented QuadraticSplit algorithm (2-step approach):
    a. PickSeeds: Select two entries with maximum distance as initial seeds
    b. Assign: Assign remaining entries to group with minimum bbox expansion
  2. Implemented QuadraticSplitInternal: Same algorithm for internal node splitting
  3. Modified RTree.java to use QuadraticSplit instead of linear splitting


if (node.isLeaf()) {
for (Integer rowId : node.getLeafRowIds()) {
results.add(rowId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly joined without checking if entry.bbox intersects with searchBox

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed as:

  1. Enhanced RTree.java search() method:
    a. Leaf nodes: Added per-entry checking entry.getBbox().intersects(searchBox)
    b. Previously only checked node.getBoundingBox().intersects()
  2. LeafEntry structure: Stores rowId + bbox for precision verification

import org.apache.paimon.types.DataType;

/** The implementation of R-Tree file index. */
public class RTreeFileIndex implements FileIndexer {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a File Index, all data is already known at the time of writing, and the quality of the tree constructed by inserting each item is much lower than that of STR (Sort Tile Recursive) bulk loading

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current alternative is:

  1. Implemented STRBulkLoader (Sort-Tile-Recursive algorithm):
    a. Sort entries by current dimension
    b. Partition into vertical tiles (~maxEntries per tile)
    c. Recursively process each tile with next dimension
    d. Build tree bottom-up
  2. Enhanced RTreeFileIndexWriter:
    a. write() method: Collect entries into list
    b. serializedBytes(): Use STRBulkLoader for batch tree construction

@xuzifu666
Copy link
Copy Markdown
Member Author

@JingsongLi Thank you for the review! Your comments are very helpful, and I will refine them based on these issues.

@leaves12138
Copy link
Copy Markdown
Contributor

Thanks for the update. I found two blocking correctness/contract issues in the current head (a4143bf63109).

  1. RTreeFileIndexWriter does not accept the object type passed by the real file-index write path for ARRAY<DOUBLE> columns. DataFileIndexWriter.FileIndexMaintainer.write() calls fileIndexWriter.writeRecord(getter.getFieldOrNull(row)), so an array field is passed as Paimon's InternalArray implementation, e.g. GenericArray / binary array, not as double[] or java.util.List. However, RTreeFileIndexWriter.extractPoint() only handles List and double[]. A normal write through the file-index path can therefore fail with:

    Cannot extract point from: org.apache.paimon.data.GenericArray
    
  2. Production deserialization still loses the leaf flag for non-root nodes. RTreeFileIndexReader.deserializeNode() only calls node.setLeaf(isLeaf) for the root. Child nodes are always created with new RTreeNode(..., false), and the recursive call does not update their serialized isLeaf value. As a result, leaf children under an internal node remain isLeaf=false; after serializing a multi-level tree with RTreeFileIndexWriter and reading it back with RTreeFileIndex.createReader(), an equality query for an existing point can return SKIP.

I reproduced both with a small contract test:

  • writer.writeRecord(new GenericArray(new double[] {1.0, 2.0})) throws from RTreeFileIndexWriter.extractPoint().
  • Writing 200 points, serializing with RTreeFileIndexWriter, deserializing with the production RTreeFileIndexReader, and querying [50.0, 50.0] returns no match.

The existing tests pass because they mostly call writer.write(new double[] {...}) directly, bypassing FileIndexWriter.writeRecord(), and RTreeSerializationTest uses its own test deserializer that constructs nodes with the serialized isLeaf flag, bypassing the production reader bug.

Please fix these before merge:

  1. Make the writer handle Paimon's InternalArray for ARRAY<DOUBLE> values, and check whether reader literals need the same treatment.
  2. Preserve isLeaf for every node during production deserialization, not only the root.
  3. Add regression coverage through the real FileIndexWriter.writeRecord() / RTreeFileIndex.createReader() path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants