RocksDB as The Replica of MDT/RLI #18296

vinothchandar · 2026-03-09T18:25:21Z

vinothchandar
Mar 9, 2026
Collaborator

RocksDB as The Replica of MDT

The initial Flink RLI RFC: #17610

The update to RocksDB

The rocksDB instantces are initialized and bootstrapped from scratch by reading the full MDT RLI index for each job restart or task failover.

The incremental index upserts inferred from the data inputs are applied directly on these RocksDB instances, these index upserts are passed along with the data payloads altogether to the IndexWrite op for actual MDT update. The MDT update happens in the same lifecycle of data records write and the incremental upserts are a replica image of the upserts into the RocksDB.

The RLI would be utilized for two cases:

serves as the source of truth of the index mapping and been utilized in the RocksDB bootstrap
cross engine compatibility

The new write flow with RockDB replica:

The Clean/Eviction of Index Payloads in RocksDB

For global RLI, the rocksDB instance would be closed and removed each time a task fails over or got a job restart.

For partitioned RLI, for local RocksDB instance per BucketAssign task, the paylods under the same data partition is stored as a separate column family, when the data partition is based on datetime, the column family can be dropped very efficiently with a configurable partition lookup TTL.

The Additional Storage Cost

A local transcoding from HFile to RocksDB sst shows that the storage size diff mainly comes from the compression codec, Hudi uses gzip as default while RocksDB uses snappy, and the RocksDB storage size turns out to be a nearly 2x size storage against the native HFile(the RocksDB WAL has been disabled explicitly since it is not necessary in our use cases).

vinothchandar · 2026-03-09T18:32:19Z

vinothchandar
Mar 9, 2026
Collaborator Author

The incremental index upserts inferred from the data inputs are applied directly on these RocksDB instances

Question 1: If we treat rocksDB as the primary source of truth during writes, how does concurrent updates from another writer get visible correctly during BucketAssignOp stage? We need a solution that can work with NBCC and multiple writers. Can you expand on this in a comment below?

the paylods under the same data partition is stored as a separate column family,

+1. This will also mean that for partitioned RLI i.e records always have a immutable partitioning field(s), the index size can be simply O(size_of_data_partitions_written_to) and not O(total_size_of_table) (as is the case for global indexing or mutable partition fields)

the RocksDB storage size turns out to be a nearly 2x size storage against the native HFile

Question 2 : the 2x is due to just compression? I think there will be an additional 2x additional storage for un-compacted updates, since rocksdb will also do its own async compaction periodically.

0 replies

danny0405 · 2026-03-10T03:22:36Z

danny0405
Mar 10, 2026
Collaborator

Question 1: If we treat rocksDB as the primary source of truth during writes, how does concurrent updates from another writer get visible correctly during BucketAssignOp stage

The concurrent upsert consistency still works under OCC since the task failover would anyway retrigger the full bootstrap of the RocksDB replica, as of now, simple bucket index is a prerequisite for NBCC so it should not be a strong concern or blocker for the RLI.

Question 2 : the 2x is due to just compression? I think there will be an additional 2x additional storage for un-compacted updates

While the RocksDB suggests light compression(LZ4/Snappy) for L0 ~ L2 to get best write throughput, and heavier compression (Zstd/Zlib) for L3+, for updates that un-compacted, the native MDT also got the similar case for its payloads in log files, it may relies on the compression frequency gap between MDT(10 delta commits trigger a compaction) and RocksDB(4 sst triggers a compaction).

4 replies

vinothchandar Mar 16, 2026
Collaborator Author

The concurrent upsert consistency still works under OCC since the task failover would anyway retrigger the full bootstrap of the RocksDB replica, as of now, simple bucket index is a prerequisite for NBCC so it should not be a strong concern or blocker for the RLI.

I don't like how we are coupling index choices and concurrency models. Before we go to NBCC, can you please explain in detail,how the failover and OCC handling are related.. Even if there is no multi-writer, just some async clustering -- the location of the record can change (even without failovers) and the writers need to know this right.

@danny0405 I think we need to handle this. Restricting clustering also is not a good solution to this

danny0405 Mar 17, 2026
Collaborator

I don't like how we are coupling index choices and concurrency models.

yeah, the simple bucket index is required to impl NBCC now and we may need more flexible and general design for concurrent modifications in streaming concurrent write scenarios.

can you please explain in detail,how the failover and OCC handling are related.

I think we can categorize the concurrent write cases into two: write with conflicts and write without conflicts.

If the write detects conflicts, the whole job/task will trigger failover and the RocksDB replica will rebootstrap from scrach, which can ensure the consistency of the index backend akka to MDT RLI index, but this needs to introduce specific early conflict detection just in the checkpoint lifecycle:
- persist the uncommitted write metadata under the Hudi table path;
- in the last step of the #snapshot of write function, send a request to the coordinator to detect the conflicts;
- need a customized conflict resolution strategy to combine all the existing uncommited write metadata with the latest timeline to validate where there are conflicts;

The pre-commit conflict resolution does not work well for Flink streaming because it happens after a successful checkpoint, Hudi deems the write as failed if there is conflict while Flink deems the write as successful(from the latest successful checkpoint), to fix gap, the early confclit resolutuon is required here.

If the write does not detect confclits, there are still cases that another concurrent write modify the table with new record locations, the solution is we might need a early detection of the index backend freshness before each write: maintain a mappings between job-id to instant time so we can load the index changes maded from concurrent writers incrementally.(put the job-id in commit metadata or maintain it on the coordinator). This introduces a lot of complexities though, I'm expecting a more general solution for NBCC that is index type agnostic and not struggle in this index concurrent modification trap.

Here is the table for support of cuncurrent modifications with Flink RLI:

use case/concurrency mode	OCC	NBCC
write & write	Y(with early conflic detection and index refreshing)	N
write & compaction	Y	N
write & clustering	Y(with early index refreshing)	N

vinothchandar Mar 19, 2026
Collaborator Author

Early conflict resolution is best-effort. It cannot guarantee catching conflicts before we reach pre-commit conflict resolution.

Can we revise the comment based on this assumption? @danny0405

danny0405 Mar 19, 2026
Collaborator

updated

codope · 2026-03-16T05:54:10Z

codope
Mar 16, 2026
Collaborator

This is a very interesting proposal! I had a few questions to understand the motivation and mechanics of the proposal:

What is the motivation behind the proposal? Is it just to cut down cloud storage i/o and log scan overhead (and hence latency per write request)? I can see how local, materialized index can help esp for streaming writes, but what were other concerns? Is it just for Flink, or Spark is also being considered? I think if we can update the discussion with just a short para on motivation/background, it would be very helpful.
On the concurrent upsert consistency discussion above, re-bootstrapping from MDT on failover is correct for crash recovery. However, what happens in case of the following scenario with NBCC enabled?

Writer A starts, bootstraps RocksDB from MDT at instant T1
Writer B starts, bootstraps RocksDB from MDT at instant T1
Writer A commits records {k1, k2} --> MDT updated at T2
Writer B is still running with stale RocksDB (knows nothing about k1, k2)
Writer B receives an upsert for k1 --> RocksDB says "not found" --> Will it then tag as INSERT instead of UPDATE which will lead to duplicate record in a different file group?
The question is whether NBCC's existing mechanisms are sufficient, or if the stale RocksDB creates scenarios that require additional handling. Another related question is how does the RocksDB replica handle file group replacement from concurrent compaction/clustering?

On the storage overhead, the 2x number is for the initial static size (compression difference) right? During active writes, esp under heavy load, it could be more due to unmerged L0 files and multiple levels of SST files. So, total disk usage could peak at compression * compaction overhead. It would be nice to benchmark the prototype with active writes. Also, on memory overhead, wouldn't the RocksDB block cache compete with JVM heap used by Hudi's own processing?
Bootstrap latency could be a concern for large tables. IIRC, RLI is about 50-70 bytes per record. Even for a table with 1B records, and assuming about 500 Mbps cloud throughput, that's about 2 minutes. Is there a way to persist the RocksDB state across restarts?
How does this coexist with or replace the existing RecordIndexCache? Should updates be buffered in memory and only flushed to RocksDB on commit success? (This is essentially what RecordIndexCache does with checkpoint-based eviction right?)
Looking at the new write flow diagram, the third shuffle by RLI/Si shard ids is new. Is this shuffle strictly necessary? Could the IndexWrite Op be collocated with the StreamWrite Op to avoid one shuffle?

0 replies

danny0405 · 2026-03-16T08:01:23Z

danny0405
Mar 16, 2026
Collaborator

I'm assuming you already read the Flink RLI RFC.

we did some local benchmark with MDT(plus hotspot cache in memory) as the index backend, and it turns out to be an average 50ms access latency per record query which does not scale well for steaming. Thus a local fast lookup replica is necessary, we have two solutions on table:
1. use RocksDB replica as the mirror image of the MDT, and always replicate the index payloads from MDT to local RocksDB to ensure the performance;
2. still uses the MDT, but introduces more caches: data block cache, local index metadata, bloomfilter cache, local files cache on SSD(secondary cache) and a fallback to remote MDT queries.

The 2nd one would take a lot of efforts and we deem it as long-term solution, as of now, we prefer 1.

NBCC is only working with simple bucket index, so not very related.
MDT does not clean the legacy files on each compaction immediately so the estimation still makes sense, based on the compression frequency configured(both can adjust the cleaning/compaction strategy), the RocksDB block cache utilizes off-heap memory I think. But yes, there could be resouce contention just like the Flink-State index which proves good perf in production already. And we will definitely do a lot more benchmarks for long-running job, that's a collaboration with Uber team.
the bootstrap op parallelism is scalable, the bottleneck might be the full scan load time for a single file group of MDT. As long as the time is less than the checkpoint timeout, it works well and the bootstrap only happens once for a job restart. Why not persistent RocksDB instances across jobs recovery: a). the RocksDB local storage is within local container which will be release once the task is killed(we can not use remote storage which kills the perf). b). there is no good ways to main the consistency between the RocksDB and MDT based on the complexities of DT and MDT consistency.
The RocksDB is updated per-record in the BucketAssign op, and the RocksDB itself has memtable to manage the buffer and flush of the index records. Each time the task failsover, the bootstrap retriggers to ensure the integrity.
we want the write scalable as independent and we don't want to write to every file group from a single write task to limit the small files.

1 reply

codope Mar 16, 2026
Collaborator

Thanks @danny0405 , I had not read the Flink RLI RFC. Will go through it first. Would be helpful to link that RFC in the description as well.

RocksDB as The Replica of MDT/RLI #18296

Uh oh!

Uh oh!

vinothchandar Mar 9, 2026 Collaborator