RocksDB as The Replica of MDT/RLI #18296
Replies: 4 comments 5 replies
-
Question 1: If we treat rocksDB as the primary source of truth during writes, how does concurrent updates from another writer get visible correctly during
+1. This will also mean that for partitioned RLI i.e records always have a immutable partitioning field(s), the index size can be simply
Question 2 : the 2x is due to just compression? I think there will be an additional 2x additional storage for un-compacted updates, since rocksdb will also do its own async compaction periodically. |
Beta Was this translation helpful? Give feedback.
-
The concurrent upsert consistency still works under OCC since the task failover would anyway retrigger the full bootstrap of the RocksDB replica, as of now, simple bucket index is a prerequisite for NBCC so it should not be a strong concern or blocker for the RLI.
While the RocksDB suggests light compression(LZ4/Snappy) for L0 ~ L2 to get best write throughput, and heavier compression (Zstd/Zlib) for L3+, for updates that un-compacted, the native MDT also got the similar case for its payloads in log files, it may relies on the compression frequency gap between MDT(10 delta commits trigger a compaction) and RocksDB(4 sst triggers a compaction). |
Beta Was this translation helpful? Give feedback.
-
|
This is a very interesting proposal! I had a few questions to understand the motivation and mechanics of the proposal:
|
Beta Was this translation helpful? Give feedback.
-
|
I'm assuming you already read the Flink RLI RFC.
The 2nd one would take a lot of efforts and we deem it as long-term solution, as of now, we prefer 1.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
RocksDB as The Replica of MDT
The initial Flink RLI RFC: #17610
The update to RocksDB
The rocksDB instantces are initialized and bootstrapped from scratch by reading the full MDT RLI index for each job restart or task failover.
The incremental index upserts inferred from the data inputs are applied directly on these RocksDB instances, these index upserts are passed along with the data payloads altogether to the
IndexWriteop for actual MDT update. The MDT update happens in the same lifecycle of data records write and the incremental upserts are a replica image of the upserts into the RocksDB.The RLI would be utilized for two cases:
The new write flow with RockDB replica:

The Clean/Eviction of Index Payloads in RocksDB
For global RLI, the rocksDB instance would be closed and removed each time a task fails over or got a job restart.
For partitioned RLI, for local RocksDB instance per
BucketAssigntask, the paylods under the same data partition is stored as a separate column family, when the data partition is based on datetime, the column family can be dropped very efficiently with a configurable partition lookup TTL.The Additional Storage Cost
A local transcoding from HFile to RocksDB sst shows that the storage size diff mainly comes from the compression codec, Hudi uses gzip as default while RocksDB uses snappy, and the RocksDB storage size turns out to be a nearly 2x size storage against the native HFile(the RocksDB WAL has been disabled explicitly since it is not necessary in our use cases).
Beta Was this translation helpful? Give feedback.
All reactions