Skip to content

columnar: support bucket parallel read in region#10871

Merged
ti-chi-bot[bot] merged 20 commits into
pingcap:masterfrom
yongman:bucket-read
Jun 11, 2026
Merged

columnar: support bucket parallel read in region#10871
ti-chi-bot[bot] merged 20 commits into
pingcap:masterfrom
yongman:bucket-read

Conversation

@yongman

@yongman yongman commented May 26, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: close #10844

Problem Summary:
The read concurrency is limited in region level.

What is changed and how it works?


  1. Support buckets level concurrency read in one region.
  2. Optimize the streams dispatching for pipelines. Change read task pre-alloc to on-demand to avoid thread imbalance.
  3. Lazy create columnar reader & prefetch create reader in async thread.
  4. Reuse SnapAccess for buckets concurrent read in one request.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Summary by CodeRabbit

  • New Features

    • PD-backed caching for region bucket keys and shared-snapshot de-duplication to reduce duplicate leader snapshot requests and improve read latency.
    • New engine interface hooks to fetch region bucket keys and clear shared-snapshot state.
  • Refactor

    • Disaggregated columnar read pipeline rebuilt to a shared reader-task/work model with lazy, on-demand reader materialization for better parallelism and resource efficiency.
    • Server skips legacy read-thread initialization when columnar mode is enabled.
    • Source operator logging adds rows_per_sec and bytes_per_sec.
  • Tests

    • Unit test added for cache eviction behavior during in-flight loads.

Signed-off-by: yongman <yming0221@gmail.com>
@ti-chi-bot

ti-chi-bot Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels May 26, 2026
@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)
📝 Walkthrough

Walkthrough

Adds FFI callbacks and hub caches (region bucket keys; shared snap access), exposes them via FFI, and refactors the C++ disaggregated columnar reader to use a shared serialized context, bucket-aware planning, and lazy, slot-based proxy reader materialization; also gates DeltaMerge thread init when columnar mode is active.

Changes

Columnar Storage Read Path Enhancement

Layer / File(s) Summary
FFI Contract Extension for Bucket Keys and Snapshot Clearing
contrib/tiflash-columnar-hub/hub-runtime/ffi/src/RaftStoreProxyFFI/ProxyFFI.h, contrib/tiflash-columnar-hub/hub-runtime/src/interfaces.rs, contrib/tiflash-columnar-hub/hub-runtime/src/columnar_impls.rs, contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
Adds fn_get_region_bucket_keys and fn_clear_shared_snap_access_by_start_ts callbacks, exports ffi_get_region_bucket_keys and ffi_clear_shared_snap_access_by_start_ts, and wires them into the hub FFI helper.
Region Bucket Key Caching in PD Client
contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs
Adds RegionBucketCacheEntry and PdClientWithCache::region_bucket_cache, initializes and evicts it with region cache, and exposes get_region_bucket_keys that validates region epoch.
Shared Snapshot Access Deduplication
contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs
Introduces SharedSnapAccessCache, SharedSnapAccessGroup, and SharedSnapAccessKey, implements weak-ref reuse with per-key async loader mutexes, get_or_request_shared_snapshot, and exposes clear_shared_snap_access_by_start_ts.
Shared context and bucket helpers (C++)
dbms/src/Storages/StorageDisaggregatedColumnar.cpp, dbms/src/Storages/StorageDisaggregatedColumnar.h
Adds RNProxyReaderSharedContext, bucket-boundary helpers, and a proxy-backed fetch for region bucket keys with Rust-GC cleanup; serializes shared inputs and normalizes datetime literals to UTC.
createProxyColumnarReader and RegionError handling
dbms/src/Storages/StorageDisaggregatedColumnar.cpp
Implements per-reader payload construction, calls fn_get_columnar_reader with buffer/GC guards, and refines RegionError handling (epoch vs not-found) and lock resolution via shared context.
Slot-based materialization and runtime
dbms/src/Storages/StorageDisaggregatedColumnar.cpp, dbms/src/Storages/StorageDisaggregatedColumnar.h
Implements RNProxyReaderPlan/slots, RNProxyReadTask slot lifecycle, getOrCreateReader with backoff, prefetchReader, tryAcquireReaderWork(), and lazy RNProxyInputStream creation and acquisition.
Bucket-aware planning and integration
dbms/src/Storages/StorageDisaggregatedColumnar.cpp
During buildProxyReadTask, fetches region bucket keys, splits physical ranges by bucket boundaries, expands into multiple RNProxyReaderPlan entries when enabled, and creates a shared RNProxyReadTask.
Source IO loop and throughput metrics
dbms/src/Storages/StorageDisaggregatedColumnar.cpp
Rewrites RNProxyInputStream::readImpl and RNProxySourceOp::executeIOImpl to use lazy readers, prefetch next readers, handle drained streams, and log rows/sec and bytes/sec.
Header/API refactors for plan-based readers
dbms/src/Storages/StorageDisaggregatedColumnar.h
Refactors RNProxyReadTask constructor and public API, updates RNProxyInputStream::Options and RNProxySourceOp::Options, and changes runtime state types to support the new model.

DeltaMerge Pool Conditional Initialization

Layer / File(s) Summary
Columnar Mode Detection in Server Startup
dbms/src/Server/Server.cpp
Guards DeltaMerge read-thread pool initialization so it runs only when disagg_opt.use_columnar is false.

Sequence Diagram(s)

sequenceDiagram
  participant StorageDisaggregated as StorageDisaggregatedColumnar
  participant RNTask as RNProxyReadTask
  participant HubFFI as Hub ffi (fn_get_columnar_reader / fn_get_region_bucket_keys)
  participant PD as PD (Bucket lookup)
  StorageDisaggregated->>RNTask: buildProxyReadTask(reader_plans)
  RNTask->>HubFFI: fn_get_region_bucket_keys(region_id, region_ver)
  HubFFI->>PD: request bucket keys (PD API)
  PD-->>HubFFI: bucket keys
  RNTask->>HubFFI: fn_get_columnar_reader(serialized_payload)
  HubFFI-->>RNTask: ColumnarReader handle / error
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • pingcap/tiflash#10842: Adds cloud-helper FFI callbacks referenced by the disaggregated columnar proxy read path and start-ts-based shared snapshot cleanup.

Suggested reviewers

  • JaySon-Huang
  • JinheLin
  • Lloyd-Pottiger

Poem

🐰 I nibbled through buckets, keys in tow,
I cached the snaps where soft winds blow,
Readers wake only when they're due,
One fetch shared — then many flew,
Hooray — the columnar streams now flow!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'columnar: support bucket parallel read in region' is specific and directly related to the main objective of enabling bucket-level concurrent reads within regions.
Description check ✅ Passed The PR description follows the required template structure with issue number, problem summary, and what-changed section. However, the commit-message section is empty and only manual testing is marked despite testing being required.
Linked Issues check ✅ Passed The PR addresses objectives from #10844 (columnar storage data source) by implementing bucket-level parallel reads, lazy reader creation, and shared snapshot access for concurrent region reads as required.
Out of Scope Changes check ✅ Passed Code changes are primarily scoped to FFI interfaces, region bucket key handling, shared snapshot caching, lazy reader materialization, and stream dispatching optimization—all aligned with stated PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 26, 2026
Signed-off-by: yongman <yming0221@gmail.com>
yongman added 8 commits June 2, 2026 22:27
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
@yongman yongman marked this pull request as ready for review June 9, 2026 03:48
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026
@pantheon-ai

pantheon-ai Bot commented Jun 9, 2026

Copy link
Copy Markdown

@yongman I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
dbms/src/Storages/StorageDisaggregatedColumnar.cpp (2)

95-107: ⚡ Quick win

Log the original cleanup exception.

This broad catch (...) drops the exception details from clear_shared_snap_access_by_start_ts, so failures here become almost impossible to diagnose. tryLogCurrentException(log, "...") is the project-standard way to preserve that context in cleanup paths.

♻️ Suggested change
         catch (...)
         {
-            try
-            {
-                LOG_WARNING(log, "clear shared snapaccess cache failed, start_ts={}", start_ts);
-            }
-            catch (...)
-            {
-            }
+            tryLogCurrentException(log, fmt::format("clear shared snapaccess cache failed, start_ts={}", start_ts));
         }

As per coding guidelines, Use tryLogCurrentException(log, "context") in broad catch (...) paths to avoid duplicated exception-formatting code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp` around lines 95 - 107,
The catch-all after calling clear_shared_snap_access_by_start_ts(start_ts,
proxy_ptr) drops the original exception; replace the inner LOG_WARNING/empty
catch with a call to tryLogCurrentException(log, "clear shared snapaccess cache
failed, start_ts={}", start_ts) (or at minimum tryLogCurrentException(log,
"clear shared snapaccess cache failed, start_ts=" + toString(start_ts))) so the
original exception context is preserved; update the catch (...) block in
StorageDisaggregatedColumnar.cpp accordingly to call tryLogCurrentException(log,
...) instead of swallowing the exception.

Source: Coding guidelines


675-675: ⚡ Quick win

Use the fmt-style DB::Exception constructor here.

This new branch is still using the legacy (message, code) form while the surrounding paths already use the repository-standard Exception(ErrorCodes::..., "...") style.

♻️ Suggested change
-        throw Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR);
+        throw Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error");

As per coding guidelines, Use DB::Exception for error handling with the fmt-style constructor: throw Exception(ErrorCodes::SOME_CODE, "Message with {}", arg);.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp` at line 675, Replace the
legacy throw Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) usage
with the repository-standard fmt-style constructor: throw
Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error"); update the throw
site in StorageDisaggregatedColumnar (the throw of Exception referencing
ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) so it uses the ErrorCodes-first signature
and follow the same fmt-style pattern as surrounding paths.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs`:
- Around line 974-981: The group is being recreated by insert/get_loader after
remove_by_start_ts tears it down, so make evictions sticky: add a
tombstone/generation field to SharedSnapAccessGroup (e.g., evicted: AtomicBool
or generation: AtomicU64) and update remove_by_start_ts to mark the group's
tombstone/generation before removing it from groups; then change group
creation/lookup logic in insert and get_loader to consult that
tombstone/generation (if evicted or generation mismatches, refuse to recreate or
insert into the group and return an error/ignore) so any in-flight loader
holding the group's lock cannot resurrect it after clear (also ensure
request_snapshot_from_leader paths check the group's generation/tombstone before
calling insert).

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp`:
- Around line 1036-1076: The block-stream path is currently allowed to create
one RNProxyInputStream per bucket unit because planned_reader_num (derived from
total_max_reader_num after splitRangesByBucketKeys) can exceed num_streams;
modify the logic so getInputStreams() does not instantiate more streams than
num_streams: compute an effective_reader_num = min(planned_reader_num,
num_streams) (or cap by num_streams where planned_reader_num is used) and change
distribution so region_reader_plans and their bucket_units are assigned/shared
among those effective_reader_num logical readers (i.e., allow a single
RNProxyInputStream to process multiple bucket units/region plans) instead of
creating a stream per bucket unit; ensure enable_bucket_parallel,
planned_reader_num, total_max_reader_num and splitRangesByBucketKeys usages stay
but the instantiation point that creates RNProxyInputStream honors the cap and
uses a shared/task-pool style assignment.

---

Nitpick comments:
In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp`:
- Around line 95-107: The catch-all after calling
clear_shared_snap_access_by_start_ts(start_ts, proxy_ptr) drops the original
exception; replace the inner LOG_WARNING/empty catch with a call to
tryLogCurrentException(log, "clear shared snapaccess cache failed, start_ts={}",
start_ts) (or at minimum tryLogCurrentException(log, "clear shared snapaccess
cache failed, start_ts=" + toString(start_ts))) so the original exception
context is preserved; update the catch (...) block in
StorageDisaggregatedColumnar.cpp accordingly to call tryLogCurrentException(log,
...) instead of swallowing the exception.
- Line 675: Replace the legacy throw Exception("lock error",
ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) usage with the repository-standard
fmt-style constructor: throw Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR,
"lock error"); update the throw site in StorageDisaggregatedColumnar (the throw
of Exception referencing ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) so it uses the
ErrorCodes-first signature and follow the same fmt-style pattern as surrounding
paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 44b8edea-875b-4b99-9507-46688bf65dd6

📥 Commits

Reviewing files that changed from the base of the PR and between 88bb9d7 and d0564d7.

📒 Files selected for processing (8)
  • contrib/tiflash-columnar-hub/hub-runtime/ffi/src/RaftStoreProxyFFI/ProxyFFI.h
  • contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs
  • contrib/tiflash-columnar-hub/hub-runtime/src/columnar_impls.rs
  • contrib/tiflash-columnar-hub/hub-runtime/src/interfaces.rs
  • contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
  • dbms/src/Server/Server.cpp
  • dbms/src/Storages/StorageDisaggregatedColumnar.cpp
  • dbms/src/Storages/StorageDisaggregatedColumnar.h

Comment thread contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs
Comment thread dbms/src/Storages/StorageDisaggregatedColumnar.cpp
Signed-off-by: yongman <yming0221@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dbms/src/Storages/StorageDisaggregatedColumnar.cpp (1)

471-485: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Handle the empty-plan path before reading the first stream header.

Line 1136 now allows buildProxyReadTask() to return no tasks, but the block-stream path still assumes at least one source exists and unconditionally calls pipeline.firstStream() on Line 480. An empty-range scan will therefore crash here instead of returning an empty result.

Also applies to: 1136-1142

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp` around lines 471 - 485,
The code assumes pipeline.firstStream() exists even when
buildProxyReadTask()/read_proxy_tasks produced no streams; detect the empty-plan
path by checking if pipeline.streams (or read_proxy_tasks) is empty before
calling pipeline.firstStream(), and return or set analyzer appropriately to
avoid dereferencing a non-existent stream. Specifically, after populating
pipeline.streams (and after executeGeneratedColumnPlaceholder), if
pipeline.streams.empty() then bypass constructing DAGExpressionAnalyzer from
pipeline.firstStream() (e.g., create an empty NamesAndTypes for analyzer or
leave analyzer null and ensure callers handle empty results) so functions like
DAGExpressionAnalyzer construction and further processing do not crash when
there are no proxy read tasks.
🧹 Nitpick comments (1)
dbms/src/Storages/StorageDisaggregatedColumnar.cpp (1)

668-680: ⚡ Quick win

Use the repo-standard DB::Exception constructor here.

These throws reverse the expected (error_code, format, ...) order. Please switch them to the fmt-style form so they do not depend on a legacy overload.

Suggested cleanup
-        throw Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR);
+        throw Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error");
...
-            throw Exception("read_block failed in tiflash-proxy", ErrorCodes::LOGICAL_ERROR);
+            throw Exception(ErrorCodes::LOGICAL_ERROR, "read_block failed in tiflash-proxy");

As per coding guidelines, **/*.cpp: Use DB::Exception for error handling with the fmt-style constructor: throw Exception(ErrorCodes::SOME_CODE, "Message with {}", arg);

Also applies to: 1250-1254

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp` around lines 668 - 680,
Replace legacy Exception throws that pass message then code with the
repo-standard DB::Exception fmt-style constructor; specifically in
StorageDisaggregatedColumnar.cpp where the LockedError handling uses throw
Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) (the block that
parses lock_info, calls cluster->lock_resolver->resolveLocks and logs
before_expired) change it to throw
Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error") and apply the same
conversion for the similar throw in the other location referenced around the
1250–1254 area so all Exception instantiations use the (ErrorCodes::..., "fmt
{}", args...) form.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp`:
- Around line 471-485: The code assumes pipeline.firstStream() exists even when
buildProxyReadTask()/read_proxy_tasks produced no streams; detect the empty-plan
path by checking if pipeline.streams (or read_proxy_tasks) is empty before
calling pipeline.firstStream(), and return or set analyzer appropriately to
avoid dereferencing a non-existent stream. Specifically, after populating
pipeline.streams (and after executeGeneratedColumnPlaceholder), if
pipeline.streams.empty() then bypass constructing DAGExpressionAnalyzer from
pipeline.firstStream() (e.g., create an empty NamesAndTypes for analyzer or
leave analyzer null and ensure callers handle empty results) so functions like
DAGExpressionAnalyzer construction and further processing do not crash when
there are no proxy read tasks.

---

Nitpick comments:
In `@dbms/src/Storages/StorageDisaggregatedColumnar.cpp`:
- Around line 668-680: Replace legacy Exception throws that pass message then
code with the repo-standard DB::Exception fmt-style constructor; specifically in
StorageDisaggregatedColumnar.cpp where the LockedError handling uses throw
Exception("lock error", ErrorCodes::COLUMNAR_SNAPSHOT_ERROR) (the block that
parses lock_info, calls cluster->lock_resolver->resolveLocks and logs
before_expired) change it to throw
Exception(ErrorCodes::COLUMNAR_SNAPSHOT_ERROR, "lock error") and apply the same
conversion for the similar throw in the other location referenced around the
1250–1254 area so all Exception instantiations use the (ErrorCodes::..., "fmt
{}", args...) form.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3573fc15-b3ea-48c0-b058-892e56d64f1d

📥 Commits

Reviewing files that changed from the base of the PR and between d0564d7 and 7b19a1c.

📒 Files selected for processing (2)
  • dbms/src/Storages/StorageDisaggregatedColumnar.cpp
  • dbms/src/Storages/StorageDisaggregatedColumnar.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • dbms/src/Storages/StorageDisaggregatedColumnar.h

Signed-off-by: yongman <yming0221@gmail.com>
@yongman

yongman commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

/hold

@ti-chi-bot ti-chi-bot Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 9, 2026
yongman added 3 commits June 9, 2026 17:30
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
@JaySon-Huang

Copy link
Copy Markdown
Contributor

/test pull-integration-next-gen-columnar

1 similar comment
@yongman

yongman commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

/test pull-integration-next-gen-columnar

@JaySon-Huang

Copy link
Copy Markdown
Contributor

Bug: EPOCH_NOT_MATCH Infinite Retry on Region 125 in Disaggregated Columnar Read

Summary

When running fullstack-test2/clustered_index/ddl.test against a next-gen columnar cluster, the test fails at line 79 with Region error EPOCH_NOT_MATCH(125:68:1). The TiFlash compute node repeatedly retries with a stale region epoch (shard_ver=68) for ~60 seconds, never updating to the current epoch (version=69), causing the query to time out.

Test Case

  • Test file: tests/fullstack-test2/clustered_index/ddl.test
  • Failing line: 79 — set session tidb_isolation_read_engines='tiflash'; select * from test.t_3;
  • Cluster: next-gen columnar (tiflash-cn0 only, no tiflash-wn0)
  • Branch: bucket-read

Test Flow (simplified)

-- t_1 (int handle clustered): PASSED
create table test.t_1(a int primary key clustered, col int);
insert into test.t_1 values(1,2),(2,3);
alter table test.t_1 set tiflash replica 1;
-- wait_table → select → alter column → select → drop

-- t_2 (common handle clustered): PASSED
create table test.t_2(a varchar(10), b int, c int, primary key(a, b) clustered);
insert into test.t_2 values('1',2,3),('2',3,4);
alter table test.t_2 set tiflash replica 1;
-- wait_table → select → alter column → select → drop

-- t_3 (composite PRIMARY KEY clustered): FAILED
create table test.t_3 (A int, B varchar(20), C int, D int, PRIMARY KEY(A,C) CLUSTERED);
insert into test.t_3 values (1,'1',1,1),(2,'2',2,2);
alter table test.t_3 set tiflash replica 1;
-- wait_table → select → ERROR!

Error

ERROR 1105 (HY000) at line 1: other error for mpp stream:
Poco::Exception. Code: 1000, e.code() = 0,
e.displayText() = Exception: Region error EPOCH_NOT_MATCH(125:68:1),
e.what() = Exception

Log Analysis

All log files are under tests/fullstack-test-next-gen-columnar/log/tiflash-cn0/.

Timeline

Timestamp Event Source
12:36:27.749 t_2 query: buildProxyReadTask, physical_table_id=14, region_ver_id={125,1,68} tiflash.log:362
12:36:27.750 Proxy fetches snapshot: shard_ver=68, start_key=...0E, end_key=...0FSUCCESS tiflash_tikv.log:42
12:36:27.860 t_2 query (second): cache_hit: trueSUCCESS tiflash_tikv.log:58
12:36:27.932 drop table test.t_1 — triggers region split test output
12:36:27.998 drop table test.t_2 — region 125 epoch bumps 68→69 test output
12:36:30.168 t_3 query: buildProxyReadTask, physical_table_id=23, region_ver_id={125,1,68} ← STALE! tiflash.log:560
12:36:30.169 Proxy request: shard_ver=68EPOCH_NOT_MATCH (current version=69) tiflash_tikv.log:66
12:36:30.169 try drop region, region_ver_id={125,1,68} tiflash.log:581
12:36:30.169 drop region because of send failure tiflash.log:582
12:36:30.169 ~ 12:37:30 Infinite retry loop: same shard_ver=68, same EPOCH_NOT_MATCH, repeated ~40+ times tiflash_error.log
12:37:30 Query times out after ~60s test output

Region 125 — Before vs After Split

Before (epoch=68, t_2 exists):

start_key: 78FFFFFE74800000000000000E   → table_id=14 (t_2)
end_key:   78FFFFFE74800000000000000F
columnar_table_ids: 14

After (epoch=69, t_2 dropped, t_3 created):

start_key: 78FFFFFE74800000FF0000000017000000FC  → table_id=23 (t_3)
end_key:   78FFFFFE748000FFFFFFFFFFFFC1000000FC
region_epoch { conf_ver: 1 version: 69 }

Key changes:

  1. Epoch bumped from 68 to 69 — region split after dropping t_1/t_2
  2. Key encoding changed78FFFFFE7480...0E vs 78FFFFFE74800000FF...17 — the FF indicates a different key encoding for the new table's clustered index
  3. Key range completely different — from table 14's range to table 23's range

Root Cause

The bug is a missing epoch refresh in the EPOCH_NOT_MATCH handling path, spanning two layers:

Architecture

C++ (StorageDisaggregatedColumnar.cpp)     Rust (cloud_helper.rs in hub-runtime)
─────────────────────────────────────     ─────────────────────────────────────
                                          ┌─────────────────────────────────┐
buildProxyReadTask()                      │ request_snapshot_from_leader()  │
  ↓ resolve key ranges from region_cache  │   while backoff.next_delay():   │
  ↓ captures region_ver=68                │     HTTP GET tikv0/.../125?     │
  ↓                                      │       shard_ver=68&start_key=.. │
reader_plans = [{125, 68, 1}, ...]       │                                  │
  ↓                                      │     if not_leader:               │
createColumnarReaderWithBackoff()        │       evict + retry ✅           │
  while(true):                           │                                  │
    ↓                                    │     if epoch_not_match:          │
    createProxyColumnarReader(plan)      │       return Err(...) ❌ ← BUG   │
      ↓ FFI with region_ver=68           │       (never retries!)           │
      ↓                                  │                                  │
      ← EPOCH_NOT_MATCH                  └─────────────────────────────────┘
      ← dropRegion() from cache
      ← throw RegionException
      ↓
    catch → backoff → RETRY with SAME plan (still ver=68!)

Two Contributing Issues

Issue 1 (C++ layer): reader_plan is immutable in the retry loop

StorageDisaggregatedColumnar.cpp:763-792createColumnarReaderWithBackoff:

ColumnarReaderPtr RNProxyReadTask::createColumnarReaderWithBackoff(size_t reader_index) const
{
    const auto & reader_plan = reader_plans[reader_index];  // const ref — NEVER updated!
    pingcap::kv::Backoffer bo(pingcap::kv::copNextMaxBackoff);
    while (true)
    {
        try
        {
            return createProxyColumnarReader(*shared_reader_context, reader_plan);
            //                                  ^^^^^^^^^^^^^^ always the same {125, 68, 1}
        }
        catch (RegionException & e)
        {
            // Only backoff — reader_plan is NOT updated with the new epoch
            bo.backoff(pingcap::kv::boRegionMiss, pingcap::Exception(e.message(), e.code()));
        }
    }
}

The reader_plan is a const reference captured once before the loop. Every retry uses the same stale epoch.

Issue 2 (Rust layer — PRIMARY): EPOCH_NOT_MATCH is not retried

hub-runtime/src/cloud_helper.rs:715-722request_snapshot_from_leader:

// NOT_LEADER: properly retried ✅
if delegate_resp.get_region_error().has_not_leader() {
    error!("{} request_snapshot_from_leader failed, not leader", tag);
    pd_client.evict_region_cache(shard_id);
    leader_changed = true;
    last_err = Some(Error::RegionError(delegate_resp.take_region_error()));
    tokio::time::sleep(next_delay).await;
    continue;  // ← retries!
}

// EPOCH_NOT_MATCH: immediately returned ❌
if delegate_resp.get_region_error().has_epoch_not_match() {
    // Return epoch not match error to TiDB to retry.  ← comment is wrong; caller is TiFlash CN
    error!("{} request_snapshot_from_leader failed, epoch not match, {:?}", ...);
    return Err(Error::RegionError(delegate_resp.take_region_error()));  // ← never retries!
}

Both not_leader and epoch_not_match are transient region errors that could be resolved by evicting the stale cache and retrying. Yet not_leader retries inside the same backoff loop, while epoch_not_match immediately gives up and returns the error to C++.

Why Previous Tables Passed

Tables t_1 and t_2 worked because:

  • The region cache was fresh (no region split had occurred yet)
  • shard_ver=68 was still current when their queries ran
  • Their queries completed before the drop table triggered the region split

Table t_3 failed because:

  • The region split (epoch 68→69) happened between t_2's last query and t_3's first query
  • buildProxyReadTask resolved t_3's key ranges and got the stale region info (ver=68) from cache
  • The Rust layer refused to retry on EPOCH_NOT_MATCH

The Infinite Loop

C++ retry loop           Rust request_snapshot_from_leader      TiKV
──────────────           ─────────────────────────────────      ────
build plan {125,68,1}
    ↓
create reader(68)  ──→  GET /snapshot/125?shard_ver=68  ──→  EPOCH_NOT_MATCH(69)
                         return Err(RegionError)          ←──  epoch {version:69}
    ← RegionException
dropRegion from cache
backoff & retry
    ↓
create reader(68)  ──→  GET /snapshot/125?shard_ver=68  ──→  EPOCH_NOT_MATCH(69)
    ↓                        ↓                                  ↓
  (repeats ~40+ times for ~60 seconds until timeout)

Recommended Fix

Fix in Rust layer (hub-runtime/src/cloud_helper.rs:715-722) — update shard_ver from the error response and retry, matching the not_leader pattern:

if delegate_resp.get_region_error().has_epoch_not_match() {
    error!(
        "{} request_snapshot_from_leader failed, epoch not match, {:?}",
        tag,
        delegate_resp.get_region_error()
    );
    // Extract new shard_ver from the error and retry (same pattern as not_leader)
    let epoch_not_match = delegate_resp.get_region_error().get_epoch_not_match();
    if !epoch_not_match.get_current_regions().is_empty() {
        shard_ver = epoch_not_match.get_current_regions()[0]
            .get_region_epoch()
            .get_version();
    }
    pd_client.evict_region_cache(shard_id);
    last_err = Some(Error::RegionError(delegate_resp.take_region_error()));
    tokio::time::sleep(next_delay).await;
    continue;  // retry with updated shard_ver
}

Why fix in Rust rather than C++:

  1. Precedent already existsnot_leader is handled this way in the same function
  2. All needed info is available — the new shard_ver is in the error response
  3. Follows TiKV client conventions — client libraries handle epoch refresh internally
  4. Minimal change — only ~5 lines, no C++ API changes needed
  5. Transparent to callers — C++ retry loop works correctly without changes

Affected Files

File Role
dbms/src/Storages/StorageDisaggregatedColumnar.cpp:763-792 C++ retry loop — never updates reader_plan
dbms/src/Storages/StorageDisaggregatedColumnar.cpp:605-634 C++ error handling — drops cache but doesn't update plan
dbms/src/Storages/Columnar/RNProxyReaderPlan.h:30-36 Plan struct — region_ver is a value type
contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs:715-722 ★ PRIMARY BUG — EPOCH_NOT_MATCH not retried
contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs:707-713 Reference — not_leader is retried correctly

Log Files

  • tests/fullstack-test-next-gen-columnar/log/tiflash-cn0/tiflash.log — C++ side (plan building, retry loop)
  • tests/fullstack-test-next-gen-columnar/log/tiflash-cn0/tiflash_error.log — C++ warnings (repeated EPOCH_NOT_MATCH)
  • tests/fullstack-test-next-gen-columnar/log/tiflash-cn0/tiflash_tikv.log — Rust proxy side (snapshot requests, epoch details)

How Bucket-Read Branch Introduced This Bug

Key Commits

Commit Author Date Description
61d37c1750 Ray Yan May 22 *: refactor proxy to hub lib for columnar — introduced cloud_helper.rs and EPOCH_NOT_MATCH non-retry (on both master and bucket-read)
d20fc65bb4 yongman Jun 2 snapaccess cache — wrapped request_snapshot_from_leader with get_or_request_shared_snapshot caching layer
d0564d766b yongman Jun 9 optimize snapaccess cache for buckets read — refactored cache to DashMap groups by start_ts; added clear_shared_snap_access_by_start_ts FFI
82f11a7bc9 yongman Jun 9 clear snap access in last owner drop — changed cache clearing from eager (every context destruction) to lazy (only last owner)
de6a182f67 yongman Jun 9 avoid inflight request leak snapaccess — added terminal state to prevent inflight loader leaks after cache clear

Latent Bug vs. Exposure

The EPOCH_NOT_MATCH non-retry in request_snapshot_from_leader has existed since 61d37c1750 (which is on both master and bucket-read):

// hub-runtime/src/cloud_helper.rs:715-722
// Same on master AND bucket-read:
if delegate_resp.get_region_error().has_epoch_not_match() {
    // Return epoch not match error to TiDB to retry.  ← comment is wrong
    return Err(Error::RegionError(delegate_resp.take_region_error()));  // never retries
}

Compare with not_leader handling in the same backoff loop:

// NotLeader is properly retried:
if delegate_resp.get_region_error().has_not_leader() {
    pd_client.evict_region_cache(shard_id);
    leader_changed = true;
    tokio::time::sleep(next_delay).await;
    continue;  // retries ✅
}

Why master doesn't reproduce:

  1. No caching wrapper — master calls request_snapshot_from_leader directly. Bucket-read interposes get_or_request_shared_snapshot which adds loader mutex serialization and SnapAccess caching by (shard_id, shard_ver, start_ts, start_table_id, end_table_id).

  2. Different timing characteristics — the caching layer changes request serialization patterns. Combined with lazy cache clearing (82f11a7bc9), stale region cache entries survive longer, making the race window between region split and query execution wider.

  3. Test path not exercised — the fullstack-test-next-gen-columnar test infrastructure may not have been actively run against upstream/master.

What Master and Bucket-Read Share

Component Master Bucket-Read Same?
request_snapshot_from_leader EPOCH_NOT_MATCH handling return Err return Err ✅ Identical
C++ createColumnarReaderWithBackoff retry loop const reader_plan, same stall epoch const reader_plan, same stale epoch ✅ Identical
C++ createProxyColumnarReader epoch handling drops region cache, throws drops region cache, throws ✅ Identical

What Bucket-Read Added

Component Master Bucket-Read
make_columnar_reader → snapshot request Direct call to request_snapshot_from_leader Via get_or_request_shared_snapshot caching wrapper
SnapAccess caching None SharedSnapAccessCache by (shard_id, shard_ver, start_ts, start_table_id, end_table_id)
Cache clearing N/A Lazy (only on last owner drop via StartTsClearRegistry)
Loader serialization None Per-key Mutex via get_loader()

Conclusion

The root cause (EPOCH_NOT_MATCH non-retry in request_snapshot_from_leader) was introduced in 61d37c1750, which is on both branches. The bucket-read exposure was caused by d20fc65bb4 and d0564d766b which added the get_or_request_shared_snapshot caching layer, changing the timing characteristics enough to turn a latent race into a reproducible failure.


Reproduction

cd tests/fullstack-test-next-gen-columnar
./compose.sh up -d
# wait for cluster ready
./compose.sh exec -T tiflash-cn0 bash -c \
  'cd /tests && ENABLE_NEXT_GEN=true verbose=true ./run-test.sh fullstack-test2/clustered_index/ddl.test'

Signed-off-by: yongman <yming0221@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs (1)

715-723: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Retry epoch_not_match in the hub instead of returning it.

This still matches the failure described in the PR discussion: after a split, returning here can leave the caller recreating readers with the same stale shard_ver until timeout. Handle epoch_not_match like not_leader: evict the region cache, refresh the retry epoch from the error payload, back off, and continue inside request_snapshot_from_leader rather than surfacing the stale plan upstream.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs` around lines
715 - 723, The code in request_snapshot_from_leader currently treats
delegate_resp.get_region_error().has_epoch_not_match() the same as other fatal
errors and returns Error::RegionError, which leaves callers with stale
shard_ver; instead handle epoch_not_match like the not_leader path: evict the
region from the local region cache, extract and refresh the retry epoch/version
from delegate_resp.get_region_error() (use the epoch info in the RegionError
payload), apply a backoff, and loop to retry the request rather than returning
Error::RegionError. Locate the epoch_not_match branch in
request_snapshot_from_leader and modify it to call the same cache-eviction and
retry logic used for not_leader (preserving logging) and continue the retry loop
with the updated epoch/version.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs`:
- Around line 715-723: The code in request_snapshot_from_leader currently treats
delegate_resp.get_region_error().has_epoch_not_match() the same as other fatal
errors and returns Error::RegionError, which leaves callers with stale
shard_ver; instead handle epoch_not_match like the not_leader path: evict the
region from the local region cache, extract and refresh the retry epoch/version
from delegate_resp.get_region_error() (use the epoch info in the RegionError
payload), apply a backoff, and loop to retry the request rather than returning
Error::RegionError. Locate the epoch_not_match branch in
request_snapshot_from_leader and modify it to call the same cache-eviction and
retry logic used for not_leader (preserving logging) and continue the retry loop
with the updated epoch/version.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d1a3ab49-819d-4b19-8b74-58f434ca7229

📥 Commits

Reviewing files that changed from the base of the PR and between ac3da5b and 03d1d95.

📒 Files selected for processing (3)
  • contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs
  • dbms/src/Storages/StorageDisaggregatedColumnar.cpp
  • dbms/src/Storages/StorageDisaggregatedColumnar.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • dbms/src/Storages/StorageDisaggregatedColumnar.cpp

Signed-off-by: yongman <yming0221@gmail.com>
@JaySon-Huang

Copy link
Copy Markdown
Contributor

/retest

@JaySon-Huang

Copy link
Copy Markdown
Contributor

/test pull-unit-next-gen

Comment thread dbms/src/Storages/StorageDisaggregatedColumnar.cpp Outdated
Comment thread contrib/tiflash-columnar-hub/hub-runtime/src/cloud_helper.rs Outdated
Comment thread dbms/src/Storages/StorageDisaggregatedColumnar.cpp Outdated
yongman added 2 commits June 11, 2026 09:26
Signed-off-by: yongman <yming0221@gmail.com>
Signed-off-by: yongman <yming0221@gmail.com>
@yongman

yongman commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

/retest

@ti-chi-bot ti-chi-bot Bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 11, 2026
@yongman

yongman commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

/retest-required

Signed-off-by: JaySon-Huang <tshent@qq.com>
@ti-chi-bot

ti-chi-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JaySon-Huang, Lloyd-Pottiger

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [JaySon-Huang,Lloyd-Pottiger]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 11, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

[LGTM Timeline notifier]

Timeline:

  • 2026-06-11 05:27:19.235102743 +0000 UTC m=+1024140.305420123: ☑️ agreed by Lloyd-Pottiger.
  • 2026-06-11 10:08:06.174158809 +0000 UTC m=+1040987.244476199: ☑️ agreed by JaySon-Huang.

@yongman

yongman commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

/retest

@ti-chi-bot ti-chi-bot Bot merged commit 50e6dea into pingcap:master Jun 11, 2026
11 checks passed
@JaySon-Huang

Copy link
Copy Markdown
Contributor

/cherry-pick release-nextgen-202603

@ti-chi-bot

Copy link
Copy Markdown
Member

@JaySon-Huang: new pull request created to branch release-nextgen-202603: #10899.

Details

In response to this:

/cherry-pick release-nextgen-202603

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot Bot pushed a commit that referenced this pull request Jun 11, 2026
close #10844\n\nSigned-off-by: yongman <yming0221@gmail.com>\nSigned-off-by: JaySon-Huang <tshent@qq.com>\n\nCo-authored-by: yongman <yming0221@gmail.com>\nCo-authored-by: JaySon-Huang <tshent@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support new columnar storage as data source

4 participants