[common] Introduce RTreeFileIndex for query optimization by xuzifu666 · Pull Request #7919 · apache/paimon

xuzifu666 · 2026-05-20T12:23:57Z

Purpose

Paimon currently does not support rtree indexes. Refer to this paper https://postgis.net/docs/support/rtree.pdf for implementation instructions on how to implement this index.
The following are the relevant benchmark test results:

Hardware Configuration

CPU: MacBook Pro (M-series processor)
Memory: 16GB LPDDR5
Operating System: macOS 14.x

Software Configuration

Java Version: OpenJDK 11+
Build: Maven 3.8.x

Test Parameters

Warmup Iterations: 3
Benchmark Iterations: 10
Query Count: 1000-10000 queries
Random Seed: 42 (for reproducibility)

Query Performance (10,000 queries)

R-Tree:       0.47 µs per query
Linear Scan:  464.41 µs per query
Speedup:      985.58×
Average results per query: 20 records

Analysis by Dataset Size

Dataset Size	R-Tree (µs)	Linear Scan (µs)	Speedup	Query Selectivity
1K	0.20	14.90	75×	2%
10K	0.12	50.24	403×	2%
100K	0.35	492.44	1407×	2%
1M	0.39	495.25	1279×	2%

Query Type	Area Size	R-Tree (µs)	Linear Scan (µs)	Speedup	Selectivity
Small	500×500	0.22	366.27	1684×	0.02%
Medium	1500×1500	0.21	400.52	1899×	0.02%
Large	5000×5000	0.28	556.48	1997×	0.03%

Point Query vs Range Query
Search Performance on 100K Dataset:

Point queries (1000):      303.76 µs/query (with warmup optimization)
Range queries (100):       357.04 µs/query
Linear scan (100 scans):   65170.62 µs/scan

Improvement vs Linear Scan:
Point query: 214× speedup
Range query: 182× speedup

Sequential Data Access Pattern

1M grid data (1000×1000 points)

Average query time: 1.54 µs
Results returned: 30 records

Performance Characteristics:
- First query: 8.38 µs (cache warmup)
- Subsequent queries: 0.67-0.88 µs (steady state)

Tests

Run comparison benchmark
org.apache.paimon.fileindex.rtree.RTreeVsLinearScanBenchmark

Run detailed benchmark
org.apache.paimon.fileindex.rtree.RTreeBenchmark

###Here is a schematic diagram of the implementation:

R-Tree Basic Structure

             │         Root Node                   │
                        │  (Internal, stores child bboxes)    │
                        │  bbox: [0,0] to [100,100]           │
                        └──────────┬──────────────┬────────────┘
                                   │              │
                    ┌──────────────┘              └─────────────────┐
                    │                                              │
        ┌───────────▼──────────────┐              ┌────────────────▼──────┐
        │  Internal Node 1         │              │  Internal Node 2      │
        │  bbox: [0,0]-[50,100]    │              │  bbox: [50,0]-[100,100]
        └───────┬──────────────────┘              └────────────┬──────────┘
                │                                             │
        ┌───────┴────────┐                           ┌────────┴────────┐
        │                │                           │                 │
   ┌────▼──────┐  ┌─────▼──────┐           ┌───────▼────┐  ┌────────▼─┐
   │ Leaf 1    │  │ Leaf 2     │           │ Leaf 3     │  │ Leaf 4   │
   │ bbox:     │  │ bbox:      │           │ bbox:      │  │ bbox:    │
   │ [0,0]-    │  │ [20,30]-   │           │ [50,50]-   │  │ [75,75]- │
   │ [20,30]   │  │ [40,50]    │           │ [70,70]    │  │ [100,100]
   │ rowId: 1  │  │ rowId: 2   │           │ rowId: 3   │  │ rowId: 4 │
   └───────────┘  └────────────┘           └────────────┘  └──────────┘

Node Split Process
2.1 Full Node Before Split

┌──────────────────────────────────────────────┐
│           Full Leaf Node (maxEntries=4)      │
│  [1-entry]  [2-entry]  [3-entry]  [4-entry] │
│   point(5,5) point(8,3) point(10,7) point(12,9)
│                                              │
│  Need to insert 5th point: (15, 12) ✗ Over  │
└──────────────────────────────────────────────┘
                    │
                    │ Trigger split
                    ▼

2.2 Split Process

                 Before split (5 entries)
    ┌───────────────────────────────────────┐
    │ Node A: [①②③④⑤]                     │
    │ bbox: [5,3] to [15,12]                 │
    └─────────────────┬─────────────────────┘
                      │
            ┌─────────┴──────────┐
            │ Linear split:      │
            │ Keep first 2.5 ≈ 2 │
            │ New node gets 3    │
            ▼                    ▼

    ┌──────────────────┐        ┌─────────────────┐
    │ Node A (Leaf)    │        │ Node B (Leaf)   │
    │ [①②]            │        │ [③④⑤]          │
    │ bbox: [5,3]-     │        │ bbox: [10,7]-   │
    │       [8,3]      │        │       [15,12]   │
    └────────┬─────────┘        └────────┬────────┘
             │                          │
             └──────────┬───────────────┘
                        │
                        ▼
             ┌──────────────────────┐
             │  Parent Node Updated │
             │  Add Node B ref      │
             │  [Node A] [Node B]   │
             └──────────────────────┘

2.3 Leaf Node Content Details

Before split:
┌─────────────────────────────────────────────────┐
│          RTreeNode (leaf=true)                  │
│                                                  │
│  leafEntries: [                                  │
│    LeafEntry(rowId=1, bbox=[5,5]-[5,5]),        │
│    LeafEntry(rowId=2, bbox=[8,3]-[8,3]),        │
│    LeafEntry(rowId=3, bbox=[10,7]-[10,7]),      │
│    LeafEntry(rowId=4, bbox=[12,9]-[12,9]),      │
│    LeafEntry(rowId=5, bbox=[15,12]-[15,12])     │
│  ]                                               │
│                                                  │
│  boundingBox: [5,3] to [15,12]                   │
│  (automatically expanded from all entries)       │
└─────────────────────────────────────────────────┘

After split:
┌────────────────────────────────┐  ┌──────────────────────────┐
│   Node A (leaf=true)           │  │   Node B (leaf=true)     │
│                                │  │                          │
│ leafEntries: [                 │  │ leafEntries: [           │
│   LeafEntry(1, [5,5]),         │  │   LeafEntry(3, [10,7]),  │
│   LeafEntry(2, [8,3])          │  │   LeafEntry(4, [12,9]),  │
│ ]                              │  │   LeafEntry(5, [15,12])  │
│                                │  │ ]                        │
│ bbox: [5,3]-[8,3]              │  │ bbox: [10,7]-[15,12]     │
└────────────────────────────────┘  └──────────────────────────┘

Query Process
3.1 Point Query Flow

             Input: Point (12, 8)
                    │
                    ▼
        ┌─────────────────────────┐
        │ Check Root bbox         │
        │ Does Point(12,8) inter- │
        │ sect [0,0]-[100,100]?   │
        │       Yes ✓             │
        └────────┬────────────────┘
                 │
        ┌────────▼───────────────┐
        │ Check child bbox       │
        │ Point in [0,0]-[50,100]?
        │       No ✗             │
        │ Point in [50,0]-      │
        │ [100,100]?            │
        │       Yes ✓            │
        └────────┬────────────────┘
                 │
        ┌────────▼──────────────────────┐
        │ Recursively check Internal    │
        │ Node 2                        │
        │ Point in [50,50]-[70,70]?     │
        │       No ✗                    │
        │ Point in [75,75]-[100,100]?   │
        │       No ✗                    │
        └────────┬───────────────────────┘
                 │
        ┌────────▼──────────────────┐
        │ Reach leaf node           │
        │ Check Leaf 4 bbox         │
        │ Point(12,8) not in here   │
        └───────────────────────────┘
                 │
                 ▼
            ┌──────────┐
            │ Empty    │
            │ result   │
            └──────────┘

3.2 Range Query Flow

Input: BoundingBox [10,5] to [50,60]
           │
           ▼
   ┌───────────────────────────────┐
   │ Check Root bbox intersection  │
   │ [10,5]-[50,60] intersects     │
   │ [0,0]-[100,100]?              │
   │      Yes ✓ Continue           │
   └───────┬─────────────────────┬─┘
           │                     │
   ┌───────▼──────────┐  ┌──────▼──────────┐
   │ Subtree1 inter?  │  │ Subtree2 inter? │
   │ [0,0]-[50,100]?  │  │ [50,0]-[100,100]
   │ Yes ✓ Recurse    │  │ Yes ✓ Recurse  │
   └───────┬──────────┘  └──────┬──────────┘
           │                    │
   ┌───────▼──────┐    ┌────────▼──────┐
   │ Check Leaf   │    │ Check Leaf    │
   │ 1-2 entries  │    │ 3-4 entries   │
   │              │    │               │
   └───────┬──────┘    └────────┬──────┘
           │                    │
   ┌───────▼─────────────────┬──▼──────┐
   │ Return all intersecting │ rowIds  │
   │ [rowId1, rowId3, ...]   │         │
   └──────────────────────────────────┘

Complete Insert Flow

Insert operation: insert(point=[25, 35], rowId=10)

         ┌──────────────────────────────┐
         │ 1. Choose best path          │
         │ (minimum expansion area)     │
         └──────────────┬───────────────┘
                        │
         ┌──────────────▼───────────────┐
         │ 2. Recursively descend       │
         │ from Root to leaf            │
         │ Calculate expansion cost     │
         └──────────────┬───────────────┘
                        │
         ┌──────────────▼───────────────┐
         │ 3. Reach leaf node          │
         │ Create LeafEntry            │
         └──────────────┬───────────────┘
                        │
         ┌──────────────▼───────────────┐
         │ 4. Add Entry                │
         │ node.addLeafEntry(entry)    │
         │ Update bbox                 │
         └──────────────┬───────────────┘
                        │
         ┌──────────────▼───────────────┐
         │ 5. Check capacity           │
         │ if (node.canSplit())        │
         │   splitNode()               │
         └──────────────┬───────────────┘
                        │
         ┌──────────────▼───────────────┐
         │ 6. If split:                │
         │ Parent adds new node        │
         │ Possible cascading split    │
         └──────────────┬───────────────┘
                        │
         ┌──────────────▼───────────────┐
         │ 7. Backtrack update parent  │
         │ bboxes until Root           │
         └──────────────┬───────────────┘
                        │
                        ▼
                   ✓ Insert complete

Data Serialization Flow

In-memory RTree structure
       │
       ▼
┌──────────────────────────────────────────┐
│ Serialization format:                    │
│                                          │
│ ┌─ Metadata                             │
│ │ ├─ dimensions: int (2)                │
│ │ ├─ maxEntries: int (32)               │
│ │ └─ treeSize: int (1000)               │
│ │                                       │
│ ├─ Recursively serialize Node:          │
│ │ ├─ isLeaf: boolean (true/false)      │
│ │ ├─ entryCount: int (N)               │
│ │ ├─ boundingBox: double[4]            │
│ │ │  (min_x, min_y, max_x, max_y)      │
│ │ │                                    │
│ │ └─ If Leaf:                          │
│ │    ├─ rowId1: int + bbox             │
│ │    ├─ rowId2: int + bbox             │
│ │    └─ ...                            │
│ │                                       │
│ │ If Internal:                          │
│ │    ├─ Child1 Node (recursive)        │
│ │    ├─ Child2 Node (recursive)        │
│ │    └─ ...                            │
│ │                                       │
│ └─ All nodes in DFS order              │
│                                          │
│ Result: byte[] (binary data)            │
└──────────────────────────────────────────┘
       │
       ▼
   File storage

Class Relationship Diagram

┌─────────────────────────────────────────────────────┐
│                     RTree                          │
│  - root: RTreeNode                                 │
│  - dimensions: int                                 │
│  - maxEntries: int                                 │
│                                                     │
│  + insert(point[], rowId)                          │
│  + search(BoundingBox): List<rowId>                │
│  + search(point[]): List<rowId>                    │
└─────────────────────────────────────────────────────┘
                        │
                        │ contains
                        │ (tree structure)
                        ▼
┌──────────────────────────────────────────────────────────┐
│                    RTreeNode                           │
│  - boundingBox: BoundingBox                            │
│  - leafEntries: List<LeafEntry> (if leaf=true)        │
│  - children: List<RTreeNode> (if leaf=false)          │
│  - parent: RTreeNode                                  │
│  - isLeaf: boolean                                    │
│  - maxEntries: int                                    │
│                                                        │
│  + isLeaf(): boolean                                  │
│  + addLeafEntry(LeafEntry)                           │
│  + addChild(RTreeNode)                               │
│  + canSplit(): boolean                               │
│  + getBoundingBox(): BoundingBox                      │
└──────────────────────────────────────────────────────────┘
         │                              │
         │ contains                     │ references
         │                              │ parent
         ▼                              ▼
    ┌──────────────────┐         ┌─────────────┐
    │   LeafEntry      │         │BoundingBox  │
    │                  │         │             │
    │ - rowId: int     │         │ - min[]     │
    │ - bbox:BBox      │         │ - max[]     │
    │                  │         │ - dimensions│
    │ + getRowId()     │         │             │
    │ + getBbox()      │         │ + expand()  │
    └──────────────────┘         │ + intersects
                                 │ + contains()
                                 └─────────────┘

Typical Query Examples
Example 1: Point Query

Query: search(Point[15, 35])

Tree structure:
           Root [0,0]-[100,100]
          /            \
      [0,0]-[50,100]   [50,0]-[100,100]
      /         \         /          \
   Leaf1      Leaf2    Leaf3      Leaf4
 [5,5]      [20,30]  [50,50]    [75,75]
 
Execution:
Step 1: Point[15,35] in Root bbox? YES → Continue
Step 2: Point in left subtree [0,0]-[50,100]? YES → Recurse
        Point in right subtree [50,0]-[100,100]? NO → Skip
Step 3: Check Leaf1 [5,5]-[5,5]: Point inside? NO
Step 4: Check Leaf2 [20,30]-[20,30]: Point inside? NO
Step 5: Result: Empty

Complexity: O(log 4) = O(1) for 4 leaves
           Actual operations: 3 bbox intersection checks

Example 2: Range Query

Query: search(BoundingBox[10,10]-[60,60])

Execution:
Step 1: Range[10,10]-[60,60] intersects Root[0,0]-[100,100]? YES
Step 2: Intersects left subtree[0,0]-[50,100]? YES → Recurse
        Intersects right subtree[50,0]-[100,100]? YES → Recurse
Step 3: Left subtree:
        Leaf1[5,5] in range? NO
        Leaf2[20,30] in range? YES → Add to result
Step 4: Right subtree:
        Leaf3[50,50] in range? YES → Add to result
        Leaf4[75,75] in range? YES → Add to result
Step 5: Result: [Leaf2_rowId, Leaf3_rowId, Leaf4_rowId]

Complexity: O(log N + K) = O(1 + 3) = O(4)
           where N=4, K=3 (result count)

Memory Layout

Java heap memory R-Tree structure:

┌─────────────────────────────────────────────────────┐
│                  RTree object                      │
│  ┌──────────────────────────────────────────┐      │
│  │ root: RTreeNode                          │      │
│  │ dimensions: 2                            │      │
│  │ maxEntries: 32                           │      │
│  └──────────────────────────────────────────┘      │
└────────────┬──────────────────────────────────────┘
             │ references
             ▼
    ┌────────────────────┐
    │   RTreeNode obj    │ (root)
    │ (internal node)    │
    │                    │
    │ children:          │
    │  [Node1 ref]       │
    │  [Node2 ref]       │
    │  [Node3 ref]       │
    │                    │
    │ leafEntries: []    │
    │ (empty)            │
    │                    │
    │ boundingBox:       │
    │  ├─ min: [0, 0]    │
    │  └─ max: [100,100] │
    │                    │
    │ parent: null       │
    └────┬─┬─┬───────────┘
         │ │ │
    ┌────┘ │ │
    │      │ └────────┐
    │      └──────┐   │
    ▼             ▼   ▼
  Node1         Node2 Node3
  (leaf)        (leaf)(leaf)
   │             │     │
   ▼             ▼     ▼
[LeafEntry]  [LE][LE] [LE][LE][LE]
rowId:1      rowId:2-3 rowId:4-6

JingsongLi · 2026-05-21T02:35:59Z

+        } else {
+            for (int i = 0; i < entryCount; i++) {
+                RTreeNode child = new RTreeNode(dimensions, maxEntries, false);
+                node.addChild(child);


RTreeNode.isLeaf is final. When the RTree constructor creates root, isLeaf=true. However, during deserialization, if the root of the tree is an internal node, the deserialization code will add children to the node with isLeaf=true. Afterwards, when searching, node. isLeaf() returns true and will look for leafRowIds instead of recursive children. The deserialized tree cannot be queried correctly at all.

I changed private final boolean isLeaf → private boolean isLeaf, added setLeaf(boolean) method, then called setLeaf() during deserialization to correct the root node's leaf flag.

JingsongLi · 2026-05-21T02:36:45Z

+
+        RTreeNode newNode = new RTreeNode(dimensions, maxEntries, true);
+
+        int mid = entries.size() / 2;


This completely disregards spatial location. The correct R-Tree splitting should minimize the MBR overlap between two result nodes, otherwise a large number of nodes will "intersect" during queries, degenerating into linear scans.

I did these changes:

Implemented QuadraticSplit algorithm (2-step approach):
a. PickSeeds: Select two entries with maximum distance as initial seeds
b. Assign: Assign remaining entries to group with minimum bbox expansion

Implemented QuadraticSplitInternal: Same algorithm for internal node splitting

Modified RTree.java to use QuadraticSplit instead of linear splitting

JingsongLi · 2026-05-21T02:37:09Z

+
+        if (node.isLeaf()) {
+            for (Integer rowId : node.getLeafRowIds()) {
+                results.add(rowId);


Directly joined without checking if entry.bbox intersects with searchBox

Changed as:

Enhanced RTree.java search() method:
a. Leaf nodes: Added per-entry checking entry.getBbox().intersects(searchBox)
b. Previously only checked node.getBoundingBox().intersects()

LeafEntry structure: Stores rowId + bbox for precision verification

JingsongLi · 2026-05-21T02:37:36Z

+import org.apache.paimon.types.DataType;
+
+/** The implementation of R-Tree file index. */
+public class RTreeFileIndex implements FileIndexer {


As a File Index, all data is already known at the time of writing, and the quality of the tree constructed by inserting each item is much lower than that of STR (Sort Tile Recursive) bulk loading

My current alternative is:

Implemented STRBulkLoader (Sort-Tile-Recursive algorithm):
a. Sort entries by current dimension
b. Partition into vertical tiles (~maxEntries per tile)
c. Recursively process each tile with next dimension
d. Build tree bottom-up

Enhanced RTreeFileIndexWriter:
a. write() method: Collect entries into list
b. serializedBytes(): Use STRBulkLoader for batch tree construction

xuzifu666 · 2026-05-21T04:35:36Z

@JingsongLi Thank you for the review! Your comments are very helpful, and I will refine them based on these issues.

leaves12138 · 2026-05-21T14:27:09Z

Thanks for the update. I found two blocking correctness/contract issues in the current head (a4143bf63109).

RTreeFileIndexWriter does not accept the object type passed by the real file-index write path for ARRAY<DOUBLE> columns. DataFileIndexWriter.FileIndexMaintainer.write() calls fileIndexWriter.writeRecord(getter.getFieldOrNull(row)), so an array field is passed as Paimon's InternalArray implementation, e.g. GenericArray / binary array, not as double[] or java.util.List. However, RTreeFileIndexWriter.extractPoint() only handles List and double[]. A normal write through the file-index path can therefore fail with:
```
Cannot extract point from: org.apache.paimon.data.GenericArray
```
Production deserialization still loses the leaf flag for non-root nodes. RTreeFileIndexReader.deserializeNode() only calls node.setLeaf(isLeaf) for the root. Child nodes are always created with new RTreeNode(..., false), and the recursive call does not update their serialized isLeaf value. As a result, leaf children under an internal node remain isLeaf=false; after serializing a multi-level tree with RTreeFileIndexWriter and reading it back with RTreeFileIndex.createReader(), an equality query for an existing point can return SKIP.

I reproduced both with a small contract test:

writer.writeRecord(new GenericArray(new double[] {1.0, 2.0})) throws from RTreeFileIndexWriter.extractPoint().
Writing 200 points, serializing with RTreeFileIndexWriter, deserializing with the production RTreeFileIndexReader, and querying [50.0, 50.0] returns no match.

The existing tests pass because they mostly call writer.write(new double[] {...}) directly, bypassing FileIndexWriter.writeRecord(), and RTreeSerializationTest uses its own test deserializer that constructs nodes with the serialized isLeaf flag, bypassing the production reader bug.

Please fix these before merge:

Make the writer handle Paimon's InternalArray for ARRAY<DOUBLE> values, and check whether reader literals need the same treatment.
Preserve isLeaf for every node during production deserialization, not only the root.
Add regression coverage through the real FileIndexWriter.writeRecord() / RTreeFileIndex.createReader() path.

xuzifu666 added 7 commits May 20, 2026 17:32

[core] Support rtree file index

03371e9

add files

1d73ee1

add files

4bf8991

add files

7acdbe3

add docs

66e251d

improve adjustParent

d98cc1b

serialization fix

d97b734

JingsongLi requested changes May 21, 2026

View reviewed changes

xuzifu666 added 4 commits May 21, 2026 12:57

Address 3

b911327

Addressed

3d85e6d

fix

f221bd9

fix

a4143bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[common] Introduce RTreeFileIndex for query optimization#7919

[common] Introduce RTreeFileIndex for query optimization#7919
xuzifu666 wants to merge 11 commits into
apache:masterfrom
xuzifu666:rtree_support

xuzifu666 commented May 20, 2026 •

edited

Loading

Uh oh!

JingsongLi May 21, 2026

Uh oh!

xuzifu666 May 21, 2026

Uh oh!

JingsongLi May 21, 2026

Uh oh!

xuzifu666 May 21, 2026

Uh oh!

JingsongLi May 21, 2026

Uh oh!

xuzifu666 May 21, 2026

Uh oh!

JingsongLi May 21, 2026

Uh oh!

xuzifu666 May 21, 2026

Uh oh!

xuzifu666 commented May 21, 2026

Uh oh!

leaves12138 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		RTreeNode newNode = new RTreeNode(dimensions, maxEntries, true);

		int mid = entries.size() / 2;

Conversation

xuzifu666 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Hardware Configuration

Software Configuration

Test Parameters

Tests

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuzifu666 commented May 21, 2026

Uh oh!

leaves12138 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xuzifu666 commented May 20, 2026 •

edited

Loading