columnar: support bucket parallel read in region by yongman · Pull Request #10871 · pingcap/tiflash

yongman · 2026-05-26T10:39:18Z

What problem does this PR solve?

Issue Number: close #10844

Problem Summary:
The read concurrency is limited in region level.

What is changed and how it works?

Support buckets level concurrency read in one region.
Optimize the streams dispatching for pipelines. Change read task pre-alloc to on-demand to avoid thread imbalance.
Lazy create columnar reader & prefetch create reader in async thread.
Reuse SnapAccess for buckets concurrent read in one request.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

Summary by CodeRabbit

New Features
- PD-backed caching for region bucket keys and shared-snapshot de-duplication to reduce duplicate leader snapshot requests and improve read latency.
- New engine interface hooks to fetch region bucket keys and clear shared-snapshot state.
Refactor
- Disaggregated columnar read pipeline rebuilt to a shared reader-task/work model with lazy, on-demand reader materialization for better parallelism and resource efficiency.
- Server skips legacy read-thread initialization when columnar mode is enabled.
- Source operator logging adds rows_per_sec and bytes_per_sec.
Tests
- Unit test added for cache eviction behavior during in-flight loads.

Signed-off-by: yongman <yming0221@gmail.com>

ti-chi-bot · 2026-05-26T10:39:20Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-05-26T10:39:26Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
✅ Review completed - (🔄 Check again to review again)

📝 Walkthrough

Walkthrough

Adds FFI callbacks and hub caches (region bucket keys; shared snap access), exposes them via FFI, and refactors the C++ disaggregated columnar reader to use a shared serialized context, bucket-aware planning, and lazy, slot-based proxy reader materialization; also gates DeltaMerge thread init when columnar mode is active.

Changes

Columnar Storage Read Path Enhancement

Layer / File(s)	Summary
FFI Contract Extension for Bucket Keys and Snapshot Clearing `contrib/tiflash-columnar-hub/hub-runtime/ffi/src/RaftStoreProxyFFI/ProxyFFI.h`, `contrib/tiflash-columnar-hub/hub-runtime/src/interfaces.rs`, `contrib/tiflash-columnar-hub/hub-runtime/src/columnar_impls.rs`, `contrib/tiflash-columnar-hub/hub-runtime/src/run.rs`	Adds `fn_get_region_bucket_keys` and `fn_clear_shared_snap_access_by_start_ts` callbacks, exports `ffi_get_region_bucket_keys` and `ffi_clear_shared_snap_access_by_start_ts`, and wires them into the hub FFI helper.
Region Bucket Key Caching in PD Client `contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs`	Adds `RegionBucketCacheEntry` and `PdClientWithCache::region_bucket_cache`, initializes and evicts it with region cache, and exposes `get_region_bucket_keys` that validates region epoch.
Shared Snapshot Access Deduplication `contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs`	Introduces `SharedSnapAccessCache`, `SharedSnapAccessGroup`, and `SharedSnapAccessKey`, implements weak-ref reuse with per-key async loader mutexes, `get_or_request_shared_snapshot`, and exposes `clear_shared_snap_access_by_start_ts`.
Shared context and bucket helpers (C++) `dbms/src/Storages/StorageDisaggregatedColumnar.cpp`, `dbms/src/Storages/StorageDisaggregatedColumnar.h`	Adds `RNProxyReaderSharedContext`, bucket-boundary helpers, and a proxy-backed fetch for region bucket keys with Rust-GC cleanup; serializes shared inputs and normalizes datetime literals to UTC.
createProxyColumnarReader and RegionError handling `dbms/src/Storages/StorageDisaggregatedColumnar.cpp`	Implements per-reader payload construction, calls `fn_get_columnar_reader` with buffer/GC guards, and refines RegionError handling (epoch vs not-found) and lock resolution via shared context.
Slot-based materialization and runtime `dbms/src/Storages/StorageDisaggregatedColumnar.cpp`, `dbms/src/Storages/StorageDisaggregatedColumnar.h`	Implements `RNProxyReaderPlan`/slots, `RNProxyReadTask` slot lifecycle, `getOrCreateReader` with backoff, `prefetchReader`, `tryAcquireReaderWork()`, and lazy `RNProxyInputStream` creation and acquisition.
Bucket-aware planning and integration `dbms/src/Storages/StorageDisaggregatedColumnar.cpp`	During `buildProxyReadTask`, fetches region bucket keys, splits physical ranges by bucket boundaries, expands into multiple `RNProxyReaderPlan` entries when enabled, and creates a shared `RNProxyReadTask`.
Source IO loop and throughput metrics `dbms/src/Storages/StorageDisaggregatedColumnar.cpp`	Rewrites `RNProxyInputStream::readImpl` and `RNProxySourceOp::executeIOImpl` to use lazy readers, prefetch next readers, handle drained streams, and log rows/sec and bytes/sec.
Header/API refactors for plan-based readers `dbms/src/Storages/StorageDisaggregatedColumnar.h`	Refactors `RNProxyReadTask` constructor and public API, updates `RNProxyInputStream::Options` and `RNProxySourceOp::Options`, and changes runtime state types to support the new model.

DeltaMerge Pool Conditional Initialization

Layer / File(s)	Summary
Columnar Mode Detection in Server Startup `dbms/src/Server/Server.cpp`	Guards DeltaMerge read-thread pool initialization so it runs only when `disagg_opt.use_columnar` is false.

Sequence Diagram(s)

sequenceDiagram
  participant StorageDisaggregated as StorageDisaggregatedColumnar
  participant RNTask as RNProxyReadTask
  participant HubFFI as Hub ffi (fn_get_columnar_reader / fn_get_region_bucket_keys)
  participant PD as PD (Bucket lookup)
  StorageDisaggregated->>RNTask: buildProxyReadTask(reader_plans)
  RNTask->>HubFFI: fn_get_region_bucket_keys(region_id, region_ver)
  HubFFI->>PD: request bucket keys (PD API)
  PD-->>HubFFI: bucket keys
  RNTask->>HubFFI: fn_get_columnar_reader(serialized_payload)
  HubFFI-->>RNTask: ColumnarReader handle / error

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

pingcap/tiflash#10842: Adds cloud-helper FFI callbacks referenced by the disaggregated columnar proxy read path and start-ts-based shared snapshot cleanup.

Suggested reviewers

JaySon-Huang
JinheLin
Lloyd-Pottiger

Poem

🐰 I nibbled through buckets, keys in tow,
I cached the snaps where soft winds blow,
Readers wake only when they're due,
One fetch shared — then many flew,
Hooray — the columnar streams now flow!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'columnar: support bucket parallel read in region' is specific and directly related to the main objective of enabling bucket-level concurrent reads within regions.
Description check	✅ Passed	The PR description follows the required template structure with issue number, problem summary, and what-changed section. However, the commit-message section is empty and only manual testing is marked despite testing being required.
Linked Issues check	✅ Passed	The PR addresses objectives from `#10844` (columnar storage data source) by implementing bucket-level parallel reads, lazy reader creation, and shared snapshot access for concurrent region reads as required.
Out of Scope Changes check	✅ Passed	Code changes are primarily scoped to FFI interfaces, region bucket key handling, shared snapshot caching, lazy reader materialization, and stream dispatching optimization—all aligned with stated PR objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: yongman <yming0221@gmail.com>

pantheon-ai · 2026-06-09T03:48:14Z

@yongman I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

_{ℹ️ Learn more details on Pantheon AI.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

dbms/src/Storages/StorageDisaggregatedColumnar.cpp (2)
95-107: ⚡ Quick win

Log the original cleanup exception.

This broad catch (...) drops the exception details from clear_shared_snap_access_by_start_ts, so failures here become almost impossible to diagnose. tryLogCurrentException(log, "...") is the project-standard way to preserve that context in cleanup paths.
♻️ Suggested change
         catch (...)
         {
-            try
-            {
-                LOG_WARNING(log, "clear shared snapaccess cache failed, start_ts={}", start_ts);
-            }
-            catch (...)
-            {
-            }
+            tryLogCurrentException(log, fmt::format("clear shared snapaccess cache failed, start_ts={}", start_ts));
         }
As per coding guidelines, Use tryLogCurrentException(log, "context") in broad catch (...) paths to avoid duplicated exception-formatting code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp` around lines 95 - 107,
The catch-all after calling clear_shared_snap_access_by_start_ts(start_ts,
proxy_ptr) drops the original exception; replace the inner LOG_WARNING/empty
catch with a call to tryLogCurrentException(log, "clear shared snapaccess cache
failed, start_ts={}", start_ts) (or at minimum tryLogCurrentException(log,
"clear shared snapaccess cache failed, start_ts=" + toString(start_ts))) so the
original exception context is preserved; update the catch (...) block in
StorageDisaggregatedColumnar.cpp accordingly to call tryLogCurrentException(log,
...) instead of swallowing the exception.
Source: Coding guidelines

675-675: ⚡ Quick win

Use the fmt-style DB::Exception constructor here.

This new branch is still using the legacy (message, code) form while the surrounding paths already use the repository-standard Exception(ErrorCodes::..., "...") style.
♻️ Suggested change
-        throw Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR);
+        throw Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error");
As per coding guidelines, Use DB::Exception for error handling with the fmt-style constructor: throw Exception(ErrorCodes::SOME_CODE, "Message with {}", arg);.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp` at line 675, Replace the
legacy throw Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) usage
with the repository-standard fmt-style constructor: throw
Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error"); update the throw
site in StorageDisaggregatedColumnar (the throw of Exception referencing
ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) so it uses the ErrorCodes-first signature
and follow the same fmt-style pattern as surrounding paths.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs`:
- Around line 974-981: The group is being recreated by insert/get_loader after
remove_by_start_ts tears it down, so make evictions sticky: add a
tombstone/generation field to SharedSnapAccessGroup (e.g., evicted: AtomicBool
or generation: AtomicU64) and update remove_by_start_ts to mark the group's
tombstone/generation before removing it from groups; then change group
creation/lookup logic in insert and get_loader to consult that
tombstone/generation (if evicted or generation mismatches, refuse to recreate or
insert into the group and return an error/ignore) so any in-flight loader
holding the group's lock cannot resurrect it after clear (also ensure
request_snapshot_from_leader paths check the group's generation/tombstone before
calling insert).

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp`:
- Around line 1036-1076: The block-stream path is currently allowed to create
one RNProxyInputStream per bucket unit because planned_reader_num (derived from
total_max_reader_num after splitRangesByBucketKeys) can exceed num_streams;
modify the logic so getInputStreams() does not instantiate more streams than
num_streams: compute an effective_reader_num = min(planned_reader_num,
num_streams) (or cap by num_streams where planned_reader_num is used) and change
distribution so region_reader_plans and their bucket_units are assigned/shared
among those effective_reader_num logical readers (i.e., allow a single
RNProxyInputStream to process multiple bucket units/region plans) instead of
creating a stream per bucket unit; ensure enable_bucket_parallel,
planned_reader_num, total_max_reader_num and splitRangesByBucketKeys usages stay
but the instantiation point that creates RNProxyInputStream honors the cap and
uses a shared/task-pool style assignment.

---

Nitpick comments:
In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp`:
- Around line 95-107: The catch-all after calling
clear_shared_snap_access_by_start_ts(start_ts, proxy_ptr) drops the original
exception; replace the inner LOG_WARNING/empty catch with a call to
tryLogCurrentException(log, "clear shared snapaccess cache failed, start_ts={}",
start_ts) (or at minimum tryLogCurrentException(log, "clear shared snapaccess
cache failed, start_ts=" + toString(start_ts))) so the original exception
context is preserved; update the catch (...) block in
StorageDisaggregatedColumnar.cpp accordingly to call tryLogCurrentException(log,
...) instead of swallowing the exception.
- Line 675: Replace the legacy throw Exception("lock error",
ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) usage with the repository-standard
fmt-style constructor: throw Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR,
"lock error"); update the throw site in StorageDisaggregatedColumnar (the throw
of Exception referencing ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) so it uses the
ErrorCodes-first signature and follow the same fmt-style pattern as surrounding
paths.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 44b8edea-875b-4b99-9507-46688bf65dd6

📥 Commits

Reviewing files that changed from the base of the PR and between 88bb9d7 and d0564d7.

📒 Files selected for processing (8)

contrib/tiflash-columnar-hub/hub-runtime/ffi/src/RaftStoreProxyFFI/ProxyFFI.h
contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs
contrib/tiflash-columnar-hub/hub-runtime/src/columnar_impls.rs
contrib/tiflash-columnar-hub/hub-runtime/src/interfaces.rs
contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
dbms/src/Server/Server.cpp
dbms/src/Storages/StorageDisaggregatedColumnar.cpp
dbms/src/Storages/StorageDisaggregatedColumnar.h

Signed-off-by: yongman <yming0221@gmail.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dbms/src/Storages/StorageDisaggregatedColumnar.cpp (1)
471-485: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Handle the empty-plan path before reading the first stream header.

Line 1136 now allows buildProxyReadTask() to return no tasks, but the block-stream path still assumes at least one source exists and unconditionally calls pipeline.firstStream() on Line 480. An empty-range scan will therefore crash here instead of returning an empty result.

Also applies to: 1136-1142
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp` around lines 471 - 485,
The code assumes pipeline.firstStream() exists even when
buildProxyReadTask()/read_proxy_tasks produced no streams; detect the empty-plan
path by checking if pipeline.streams (or read_proxy_tasks) is empty before
calling pipeline.firstStream(), and return or set analyzer appropriately to
avoid dereferencing a non-existent stream. Specifically, after populating
pipeline.streams (and after executeGeneratedColumnPlaceholder), if
pipeline.streams.empty() then bypass constructing DAGExpressionAnalyzer from
pipeline.firstStream() (e.g., create an empty NamesAndTypes for analyzer or
leave analyzer null and ensure callers handle empty results) so functions like
DAGExpressionAnalyzer construction and further processing do not crash when
there are no proxy read tasks.

🧹 Nitpick comments (1)

dbms/src/Storages/StorageDisaggregatedColumnar.cpp (1)

668-680: ⚡ Quick win

Use the repo-standard DB::Exception constructor here.

These throws reverse the expected (error_code, format, ...) order. Please switch them to the fmt-style form so they do not depend on a legacy overload.

Suggested cleanup

-        throw Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR);
+        throw Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error");
...
-            throw Exception("read_block failed in tiflash-proxy", ErrorCodes::LOGICAL_ERROR);
+            throw Exception(ErrorCodes::LOGICAL_ERROR, "read_block failed in tiflash-proxy");

As per coding guidelines, **/*.cpp: Use DB::Exception for error handling with the fmt-style constructor: throw Exception(ErrorCodes::SOME_CODE, "Message with {}", arg);

Also applies to: 1250-1254

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp` around lines 668 - 680,
Replace legacy Exception throws that pass message then code with the
repo-standard DB::Exception fmt-style constructor; specifically in
StorageDisaggregatedColumnar.cpp where the LockedError handling uses throw
Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) (the block that
parses lock_info, calls cluster->lock_resolver->resolveLocks and logs
before_expired) change it to throw
Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error") and apply the same
conversion for the similar throw in the other location referenced around the
1250–1254 area so all Exception instantiations use the (ErrorCodes::..., "fmt
{}", args...) form.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp`:
- Around line 471-485: The code assumes pipeline.firstStream() exists even when
buildProxyReadTask()/read_proxy_tasks produced no streams; detect the empty-plan
path by checking if pipeline.streams (or read_proxy_tasks) is empty before
calling pipeline.firstStream(), and return or set analyzer appropriately to
avoid dereferencing a non-existent stream. Specifically, after populating
pipeline.streams (and after executeGeneratedColumnPlaceholder), if
pipeline.streams.empty() then bypass constructing DAGExpressionAnalyzer from
pipeline.firstStream() (e.g., create an empty NamesAndTypes for analyzer or
leave analyzer null and ensure callers handle empty results) so functions like
DAGExpressionAnalyzer construction and further processing do not crash when
there are no proxy read tasks.

---

Nitpick comments:
In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp`:
- Around line 668-680: Replace legacy Exception throws that pass message then
code with the repo-standard DB::Exception fmt-style constructor; specifically in
StorageDisaggregatedColumnar.cpp where the LockedError handling uses throw
Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) (the block that
parses lock_info, calls cluster->lock_resolver->resolveLocks and logs
before_expired) change it to throw
Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error") and apply the same
conversion for the similar throw in the other location referenced around the
1250–1254 area so all Exception instantiations use the (ErrorCodes::..., "fmt
{}", args...) form.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3573fc15-b3ea-48c0-b058-892e56d64f1d

📥 Commits

Reviewing files that changed from the base of the PR and between d0564d7 and 7b19a1c.

📒 Files selected for processing (2)

dbms/src/Storages/StorageDisaggregatedColumnar.cpp
dbms/src/Storages/StorageDisaggregatedColumnar.h

🚧 Files skipped from review as they are similar to previous changes (1)

dbms/src/Storages/StorageDisaggregatedColumnar.h

Signed-off-by: yongman <yming0221@gmail.com>

yongman · 2026-06-09T08:29:28Z

/hold

Signed-off-by: yongman <yming0221@gmail.com>

JaySon-Huang · 2026-06-09T13:37:55Z

/test pull-integration-next-gen-columnar

yongman · 2026-06-10T01:46:49Z

/test pull-integration-next-gen-columnar

JaySon-Huang · 2026-06-10T05:56:28Z

Bug: EPOCH_NOT_MATCH Infinite Retry on Region 125 in Disaggregated Columnar Read

Summary

When running fullstack-test2/clustered_index/ddl.test against a next-gen columnar cluster, the test fails at line 79 with Region error EPOCH_NOT_MATCH(125:68:1). The TiFlash compute node repeatedly retries with a stale region epoch (shard_ver=68) for ~60 seconds, never updating to the current epoch (version=69), causing the query to time out.

Test Case

Test file: tests/fullstack-test2/clustered_index/ddl.test
Failing line: 79 — set session tidb_isolation_read_engines='tiflash'; select * from test.t_3;
Cluster: next-gen columnar (tiflash-cn0 only, no tiflash-wn0)
Branch: bucket-read

Test Flow (simplified)

-- t_1 (int handle clustered): PASSED
create table test.t_1(a int primary key clustered, col int);
insert into test.t_1 values(1,2),(2,3);
alter table test.t_1 set tiflash replica 1;
-- wait_table → select → alter column → select → drop

-- t_2 (common handle clustered): PASSED
create table test.t_2(a varchar(10), b int, c int, primary key(a, b) clustered);
insert into test.t_2 values('1',2,3),('2',3,4);
alter table test.t_2 set tiflash replica 1;
-- wait_table → select → alter column → select → drop

-- t_3 (composite PRIMARY KEY clustered): FAILED
create table test.t_3 (A int, B varchar(20), C int, D int, PRIMARY KEY(A,C) CLUSTERED);
insert into test.t_3 values (1,'1',1,1),(2,'2',2,2);
alter table test.t_3 set tiflash replica 1;
-- wait_table → select → ERROR!

Error

ERROR 1105 (HY000) at line 1: other error for mpp stream:
Poco::Exception. Code: 1000, e.code() = 0,
e.displayText() = Exception: Region error EPOCH_NOT_MATCH(125:68:1),
e.what() = Exception

Log Analysis

All log files are under tests/fullstack-test-next-gen-columnar/log/tiflash-cn0/.

Timeline

Timestamp	Event	Source
`12:36:27.749`	t_2 query: `buildProxyReadTask, physical_table_id=14, region_ver_id={125,1,68}`	tiflash.log:362
`12:36:27.750`	Proxy fetches snapshot: `shard_ver=68, start_key=...0E, end_key=...0F` — SUCCESS	tiflash_tikv.log:42
`12:36:27.860`	t_2 query (second): `cache_hit: true` — SUCCESS	tiflash_tikv.log:58
`12:36:27.932`	`drop table test.t_1` — triggers region split	test output
`12:36:27.998`	`drop table test.t_2` — region 125 epoch bumps 68→69	test output
`12:36:30.168`	t_3 query: `buildProxyReadTask, physical_table_id=23, region_ver_id={125,1,68}` ← STALE!	tiflash.log:560
`12:36:30.169`	Proxy request: `shard_ver=68` — EPOCH_NOT_MATCH (current version=69)	tiflash_tikv.log:66
`12:36:30.169`	`try drop region, region_ver_id={125,1,68}`	tiflash.log:581
`12:36:30.169`	`drop region because of send failure`	tiflash.log:582
`12:36:30.169` ~ `12:37:30`	Infinite retry loop: same `shard_ver=68`, same EPOCH_NOT_MATCH, repeated ~40+ times	tiflash_error.log
`12:37:30`	Query times out after ~60s	test output

Region 125 — Before vs After Split

Before (epoch=68, t_2 exists):

start_key: 78FFFFFE74800000000000000E   → table_id=14 (t_2)
end_key:   78FFFFFE74800000000000000F
columnar_table_ids: 14

After (epoch=69, t_2 dropped, t_3 created):

start_key: 78FFFFFE74800000FF0000000017000000FC  → table_id=23 (t_3)
end_key:   78FFFFFE748000FFFFFFFFFFFFC1000000FC
region_epoch { conf_ver: 1 version: 69 }

Key changes:

Epoch bumped from 68 to 69 — region split after dropping t_1/t_2
Key encoding changed — 78FFFFFE7480...0E vs 78FFFFFE74800000FF...17 — the FF indicates a different key encoding for the new table's clustered index
Key range completely different — from table 14's range to table 23's range

Root Cause

The bug is a missing epoch refresh in the EPOCH_NOT_MATCH handling path, spanning two layers:

Architecture

C++ (StorageDisaggregatedColumnar.cpp)     Rust (cloud_helper.rs in hub-runtime)
─────────────────────────────────────     ─────────────────────────────────────
                                          ┌─────────────────────────────────┐
buildProxyReadTask()                      │ request_snapshot_from_leader()  │
  ↓ resolve key ranges from region_cache  │   while backoff.next_delay():   │
  ↓ captures region_ver=68                │     HTTP GET tikv0/.../125?     │
  ↓                                      │       shard_ver=68&start_key=.. │
reader_plans = [{125, 68, 1}, ...]       │                                  │
  ↓                                      │     if not_leader:               │
createColumnarReaderWithBackoff()        │       evict + retry ✅           │
  while(true):                           │                                  │
    ↓                                    │     if epoch_not_match:          │
    createProxyColumnarReader(plan)      │       return Err(...) ❌ ← BUG   │
      ↓ FFI with region_ver=68           │       (never retries!)           │
      ↓                                  │                                  │
      ← EPOCH_NOT_MATCH                  └─────────────────────────────────┘
      ← dropRegion() from cache
      ← throw RegionException
      ↓
    catch → backoff → RETRY with SAME plan (still ver=68!)

Two Contributing Issues

Issue 1 (C++ layer): `reader_plan` is immutable in the retry loop

StorageDisaggregatedColumnar.cpp:763-792 — createColumnarReaderWithBackoff:

ColumnarReaderPtr RNProxyReadTask::createColumnarReaderWithBackoff(size_t reader_index) const
{
    const auto & reader_plan = reader_plans[reader_index];  // const ref — NEVER updated!
    pingcap::kv::Backoffer bo(pingcap::kv::copNextMaxBackoff);
    while (true)
    {
        try
        {
            return createProxyColumnarReader(*shared_reader_context, reader_plan);
            //                                  ^^^^^^^^^^^^^^ always the same {125, 68, 1}
        }
        catch (RegionException & e)
        {
            // Only backoff — reader_plan is NOT updated with the new epoch
            bo.backoff(pingcap::kv::boRegionMiss, pingcap::Exception(e.message(), e.code()));
        }
    }
}

The reader_plan is a const reference captured once before the loop. Every retry uses the same stale epoch.

Issue 2 (Rust layer — PRIMARY): EPOCH_NOT_MATCH is not retried

hub-runtime/src/cloud_helper.rs:715-722 — request_snapshot_from_leader:

// NOT_LEADER: properly retried ✅
if delegate_resp.get_region_error().has_not_leader() {
    error!("{} request_snapshot_from_leader failed, not leader", tag);
    pd_client.evict_region_cache(shard_id);
    leader_changed = true;
    last_err = Some(Error::RegionError(delegate_resp.take_region_error()));
    tokio::time::sleep(next_delay).await;
    continue;  // ← retries!
}

// EPOCH_NOT_MATCH: immediately returned ❌
if delegate_resp.get_region_error().has_epoch_not_match() {
    // Return epoch not match error to TiDB to retry.  ← comment is wrong; caller is TiFlash CN
    error!("{} request_snapshot_from_leader failed, epoch not match, {:?}", ...);
    return Err(Error::RegionError(delegate_resp.take_region_error()));  // ← never retries!
}

Both not_leader and epoch_not_match are transient region errors that could be resolved by evicting the stale cache and retrying. Yet not_leader retries inside the same backoff loop, while epoch_not_match immediately gives up and returns the error to C++.

Why Previous Tables Passed

Tables t_1 and t_2 worked because:

The region cache was fresh (no region split had occurred yet)
shard_ver=68 was still current when their queries ran
Their queries completed before the drop table triggered the region split

Table t_3 failed because:

The region split (epoch 68→69) happened between t_2's last query and t_3's first query
buildProxyReadTask resolved t_3's key ranges and got the stale region info (ver=68) from cache
The Rust layer refused to retry on EPOCH_NOT_MATCH

The Infinite Loop

C++ retry loop           Rust request_snapshot_from_leader      TiKV
──────────────           ─────────────────────────────────      ────
build plan {125,68,1}
    ↓
create reader(68)  ──→  GET /snapshot/125?shard_ver=68  ──→  EPOCH_NOT_MATCH(69)
                         return Err(RegionError)          ←──  epoch {version:69}
    ← RegionException
dropRegion from cache
backoff & retry
    ↓
create reader(68)  ──→  GET /snapshot/125?shard_ver=68  ──→  EPOCH_NOT_MATCH(69)
    ↓                        ↓                                  ↓
  (repeats ~40+ times for ~60 seconds until timeout)

Recommended Fix

Fix in Rust layer (hub-runtime/src/cloud_helper.rs:715-722) — update shard_ver from the error response and retry, matching the not_leader pattern:

if delegate_resp.get_region_error().has_epoch_not_match() {
    error!(
        "{} request_snapshot_from_leader failed, epoch not match, {:?}",
        tag,
        delegate_resp.get_region_error()
    );
    // Extract new shard_ver from the error and retry (same pattern as not_leader)
    let epoch_not_match = delegate_resp.get_region_error().get_epoch_not_match();
    if !epoch_not_match.get_current_regions().is_empty() {
        shard_ver = epoch_not_match.get_current_regions()[0]
            .get_region_epoch()
            .get_version();
    }
    pd_client.evict_region_cache(shard_id);
    last_err = Some(Error::RegionError(delegate_resp.take_region_error()));
    tokio::time::sleep(next_delay).await;
    continue;  // retry with updated shard_ver
}

Why fix in Rust rather than C++:

Precedent already exists — not_leader is handled this way in the same function
All needed info is available — the new shard_ver is in the error response
Follows TiKV client conventions — client libraries handle epoch refresh internally
Minimal change — only ~5 lines, no C++ API changes needed
Transparent to callers — C++ retry loop works correctly without changes

Affected Files

File	Role
`dbms/src/Storages/StorageDisaggregatedColumnar.cpp:763-792`	C++ retry loop — never updates reader_plan
`dbms/src/Storages/StorageDisaggregatedColumnar.cpp:605-634`	C++ error handling — drops cache but doesn't update plan
`dbms/src/Storages/Columnar/RNProxyReaderPlan.h:30-36`	Plan struct — region_ver is a value type
`contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs:715-722`	★ PRIMARY BUG — EPOCH_NOT_MATCH not retried
`contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs:707-713`	Reference — `not_leader` is retried correctly

Log Files

tests/fullstack-test-next-gen-columnar/log/tiflash-cn0/tiflash.log — C++ side (plan building, retry loop)
tests/fullstack-test-next-gen-columnar/log/tiflash-cn0/tiflash_error.log — C++ warnings (repeated EPOCH_NOT_MATCH)
tests/fullstack-test-next-gen-columnar/log/tiflash-cn0/tiflash_tikv.log — Rust proxy side (snapshot requests, epoch details)

How Bucket-Read Branch Introduced This Bug

Key Commits

Commit	Author	Date	Description
`61d37c1750`	Ray Yan	May 22	`: refactor proxy to hub lib for columnar` — introduced `cloud_helper.rs` and `EPOCH_NOT_MATCH` non-retry (on both* master and bucket-read)
`d20fc65bb4`	yongman	Jun 2	`snapaccess cache` — wrapped `request_snapshot_from_leader` with `get_or_request_shared_snapshot` caching layer
`d0564d766b`	yongman	Jun 9	`optimize snapaccess cache for buckets read` — refactored cache to `DashMap` groups by `start_ts`; added `clear_shared_snap_access_by_start_ts` FFI
`82f11a7bc9`	yongman	Jun 9	`clear snap access in last owner drop` — changed cache clearing from eager (every context destruction) to lazy (only last owner)
`de6a182f67`	yongman	Jun 9	`avoid inflight request leak snapaccess` — added terminal state to prevent inflight loader leaks after cache clear

Latent Bug vs. Exposure

The EPOCH_NOT_MATCH non-retry in request_snapshot_from_leader has existed since 61d37c1750 (which is on both master and bucket-read):

// hub-runtime/src/cloud_helper.rs:715-722
// Same on master AND bucket-read:
if delegate_resp.get_region_error().has_epoch_not_match() {
    // Return epoch not match error to TiDB to retry.  ← comment is wrong
    return Err(Error::RegionError(delegate_resp.take_region_error()));  // never retries
}

Compare with not_leader handling in the same backoff loop:

// NotLeader is properly retried:
if delegate_resp.get_region_error().has_not_leader() {
    pd_client.evict_region_cache(shard_id);
    leader_changed = true;
    tokio::time::sleep(next_delay).await;
    continue;  // retries ✅
}

Why master doesn't reproduce:

No caching wrapper — master calls request_snapshot_from_leader directly. Bucket-read interposes get_or_request_shared_snapshot which adds loader mutex serialization and SnapAccess caching by (shard_id, shard_ver, start_ts, start_table_id, end_table_id).
Different timing characteristics — the caching layer changes request serialization patterns. Combined with lazy cache clearing (82f11a7bc9), stale region cache entries survive longer, making the race window between region split and query execution wider.
Test path not exercised — the fullstack-test-next-gen-columnar test infrastructure may not have been actively run against upstream/master.

What Master and Bucket-Read Share

Component	Master	Bucket-Read	Same?
`request_snapshot_from_leader` EPOCH_NOT_MATCH handling	`return Err`	`return Err`	✅ Identical
C++ `createColumnarReaderWithBackoff` retry loop	const `reader_plan`, same stall epoch	const `reader_plan`, same stale epoch	✅ Identical
C++ `createProxyColumnarReader` epoch handling	drops region cache, throws	drops region cache, throws	✅ Identical

What Bucket-Read Added

Component	Master	Bucket-Read
`make_columnar_reader` → snapshot request	Direct call to `request_snapshot_from_leader`	Via `get_or_request_shared_snapshot` caching wrapper
SnapAccess caching	None	`SharedSnapAccessCache` by `(shard_id, shard_ver, start_ts, start_table_id, end_table_id)`
Cache clearing	N/A	Lazy (only on last owner drop via `StartTsClearRegistry`)
Loader serialization	None	Per-key `Mutex` via `get_loader()`

Conclusion

The root cause (EPOCH_NOT_MATCH non-retry in request_snapshot_from_leader) was introduced in 61d37c1750, which is on both branches. The bucket-read exposure was caused by d20fc65bb4 and d0564d766b which added the get_or_request_shared_snapshot caching layer, changing the timing characteristics enough to turn a latent race into a reproducible failure.

Reproduction

cd tests/fullstack-test-next-gen-columnar
./compose.sh up -d
# wait for cluster ready
./compose.sh exec -T tiflash-cn0 bash -c \
  'cd /tests && ENABLE_NEXT_GEN=true verbose=true ./run-test.sh fullstack-test2/clustered_index/ddl.test'

Signed-off-by: yongman <yming0221@gmail.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs (1)
715-723: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Retry epoch_not_match in the hub instead of returning it.

This still matches the failure described in the PR discussion: after a split, returning here can leave the caller recreating readers with the same stale shard_ver until timeout. Handle epoch_not_match like not_leader: evict the region cache, refresh the retry epoch from the error payload, back off, and continue inside request_snapshot_from_leader rather than surfacing the stale plan upstream.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs` around lines
715 - 723, The code in request_snapshot_from_leader currently treats
delegate_resp.get_region_error().has_epoch_not_match() the same as other fatal
errors and returns Error::RegionError, which leaves callers with stale
shard_ver; instead handle epoch_not_match like the not_leader path: evict the
region from the local region cache, extract and refresh the retry epoch/version
from delegate_resp.get_region_error() (use the epoch info in the RegionError
payload), apply a backoff, and loop to retry the request rather than returning
Error::RegionError. Locate the epoch_not_match branch in
request_snapshot_from_leader and modify it to call the same cache-eviction and
retry logic used for not_leader (preserving logging) and continue the retry loop
with the updated epoch/version.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs`:
- Around line 715-723: The code in request_snapshot_from_leader currently treats
delegate_resp.get_region_error().has_epoch_not_match() the same as other fatal
errors and returns Error::RegionError, which leaves callers with stale
shard_ver; instead handle epoch_not_match like the not_leader path: evict the
region from the local region cache, extract and refresh the retry epoch/version
from delegate_resp.get_region_error() (use the epoch info in the RegionError
payload), apply a backoff, and loop to retry the request rather than returning
Error::RegionError. Locate the epoch_not_match branch in
request_snapshot_from_leader and modify it to call the same cache-eviction and
retry logic used for not_leader (preserving logging) and continue the retry loop
with the updated epoch/version.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d1a3ab49-819d-4b19-8b74-58f434ca7229

📥 Commits

Reviewing files that changed from the base of the PR and between ac3da5b and 03d1d95.

📒 Files selected for processing (3)

contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs
dbms/src/Storages/StorageDisaggregatedColumnar.cpp
dbms/src/Storages/StorageDisaggregatedColumnar.h

🚧 Files skipped from review as they are similar to previous changes (1)

dbms/src/Storages/StorageDisaggregatedColumnar.cpp

Signed-off-by: yongman <yming0221@gmail.com>

JaySon-Huang · 2026-06-10T12:41:27Z

/retest

JaySon-Huang · 2026-06-10T13:53:57Z

/test pull-unit-next-gen

Signed-off-by: yongman <yming0221@gmail.com>

yongman · 2026-06-11T02:26:03Z

/retest

yongman · 2026-06-11T08:06:45Z

/retest-required

Signed-off-by: JaySon-Huang <tshent@qq.com>

ti-chi-bot · 2026-06-11T10:08:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JaySon-Huang, Lloyd-Pottiger

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [JaySon-Huang,Lloyd-Pottiger]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-06-11T10:08:07Z

[LGTM Timeline notifier]

Timeline:

2026-06-11 05:27:19.235102743 +0000 UTC m=+1024140.305420123: ☑️ agreed by Lloyd-Pottiger.
2026-06-11 10:08:06.174158809 +0000 UTC m=+1040987.244476199: ☑️ agreed by JaySon-Huang.

yongman · 2026-06-11T12:04:20Z

/retest

JaySon-Huang · 2026-06-11T13:45:36Z

/cherry-pick release-nextgen-202603

ti-chi-bot · 2026-06-11T13:46:18Z

@JaySon-Huang: new pull request created to branch release-nextgen-202603: #10899.

Details

In response to this:

/cherry-pick release-nextgen-202603

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

close #10844\n\nSigned-off-by: yongman <yming0221@gmail.com>\nSigned-off-by: JaySon-Huang <tshent@qq.com>\n\nCo-authored-by: yongman <yming0221@gmail.com>\nCo-authored-by: JaySon-Huang <tshent@qq.com>

columnar: support bucket parallel read in region

fb549a4

Signed-off-by: yongman <yming0221@gmail.com>

ti-chi-bot Bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels May 26, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 26, 2026

optimize tasks dispatch

3989666

Signed-off-by: yongman <yming0221@gmail.com>

yongman force-pushed the bucket-read branch from 75f57b2 to 3989666 Compare June 2, 2026 13:45

yongman added 8 commits June 2, 2026 22:27

add bucket log

b7d4895

Signed-off-by: yongman <yming0221@gmail.com>

optimize bucket

4a82a09

Signed-off-by: yongman <yming0221@gmail.com>

debug logs

778e6a1

Signed-off-by: yongman <yming0221@gmail.com>

decode bucket key

e333822

Signed-off-by: yongman <yming0221@gmail.com>

snapaccess cache

d20fc65

Signed-off-by: yongman <yming0221@gmail.com>

avoid create segment reader thread when use columnar

fa1905f

Signed-off-by: yongman <yming0221@gmail.com>

add region buckets cache

470b18b

Signed-off-by: yongman <yming0221@gmail.com>

optimize snapaccess cache for buckets read

d0564d7

Signed-off-by: yongman <yming0221@gmail.com>

ti-chi-bot Bot removed the do-not-merge/needs-linked-issue label Jun 9, 2026

yongman marked this pull request as ready for review June 9, 2026 03:48

ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs

Comment thread dbms/src/Storages/StorageDisaggregatedColumnar.cpp

fix legacy inputstream concurrency

7b19a1c

Signed-off-by: yongman <yming0221@gmail.com>

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

avoid inflight request leak snapaccess

de6a182

Signed-off-by: yongman <yming0221@gmail.com>

ti-chi-bot Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 9, 2026

yongman added 3 commits June 9, 2026 17:30

clear snap access in last owner drop

82f11a7

Signed-off-by: yongman <yming0221@gmail.com>

format

2ff0938

Signed-off-by: yongman <yming0221@gmail.com>

Merge branch 'master' into bucket-read

ac3da5b

fix epoch not match retry

03d1d95

Signed-off-by: yongman <yming0221@gmail.com>

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

format

3344adc

Signed-off-by: yongman <yming0221@gmail.com>

JaySon-Huang reviewed Jun 10, 2026

View reviewed changes

Comment thread dbms/src/Storages/StorageDisaggregatedColumnar.cpp Outdated

JaySon-Huang reviewed Jun 10, 2026

View reviewed changes

Comment thread contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs Outdated

JaySon-Huang reviewed Jun 10, 2026

View reviewed changes

Comment thread dbms/src/Storages/StorageDisaggregatedColumnar.cpp Outdated

yongman added 2 commits June 11, 2026 09:26

polish logs and rename proxy to columnar

b40f95f

Signed-off-by: yongman <yming0221@gmail.com>

address comments

c7339d9

Signed-off-by: yongman <yming0221@gmail.com>

Lloyd-Pottiger approved these changes Jun 11, 2026

View reviewed changes

ti-chi-bot Bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 11, 2026

add comments and fix lint

87272b1

Signed-off-by: JaySon-Huang <tshent@qq.com>

JaySon-Huang approved these changes Jun 11, 2026

View reviewed changes

ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 11, 2026

ti-chi-bot Bot merged commit 50e6dea into pingcap:master Jun 11, 2026
11 checks passed

ti-chi-bot mentioned this pull request Jun 11, 2026

columnar: support bucket parallel read in region (#10871) #10899

Merged

12 tasks

Conversation

yongman commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

pantheon-ai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yongman commented Jun 9, 2026

Uh oh!

JaySon-Huang commented Jun 9, 2026

Uh oh!

yongman commented Jun 10, 2026

Uh oh!

JaySon-Huang commented Jun 10, 2026

Bug: EPOCH_NOT_MATCH Infinite Retry on Region 125 in Disaggregated Columnar Read

Summary

Test Case

Test Flow (simplified)

Error

Log Analysis

Timeline

Region 125 — Before vs After Split

Root Cause

Architecture

Two Contributing Issues

Issue 1 (C++ layer): reader_plan is immutable in the retry loop

Issue 2 (Rust layer — PRIMARY): EPOCH_NOT_MATCH is not retried

Why Previous Tables Passed

The Infinite Loop

Recommended Fix

Why fix in Rust rather than C++:

Affected Files

Log Files

How Bucket-Read Branch Introduced This Bug

Key Commits

Latent Bug vs. Exposure

What Master and Bucket-Read Share

What Bucket-Read Added

Conclusion

Reproduction

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

JaySon-Huang commented Jun 10, 2026

Uh oh!

JaySon-Huang commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yongman commented Jun 11, 2026

Uh oh!

yongman commented May 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading

pantheon-ai Bot commented Jun 9, 2026 •

edited

Loading

Issue 1 (C++ layer): `reader_plan` is immutable in the retry loop