fix(logtail): stream-publish txn tails to remove batch-wide wait (#24325) by ck89119 · Pull Request #24383 · matrixorigin/matrixone

ck89119 · 2026-05-13T10:48:23Z

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Cherry-pick #24326 to 4.0-dev.

@XuPeng-SH

…rixorigin#24325) (matrixorigin#24326) ### Background PR matrixorigin#22475 refactored `Manager.onTxnLogTails` in `pkg/vm/engine/tae/logtail/mgr.go` so that a batch of txns are collected in parallel and then published after a batch-wide `collectWg.Wait()`: ```go for i, item := range items { mgr.collectWg.Add(1) mgr.collectPool.Submit(func() { defer mgr.collectWg.Done() txn.GetStore().WaitEvent(txnif.WalPreparing) ... state := txn.GetTxnState(true) mgr.orderedList[i] = txnTail }) } mgr.collectWg.Wait() // ← blocks on slowest txn for i := range items { mgr.generateLogtailWithTxn(...) // serial push AFTER all ready } ``` Batch size is 100. Under high-concurrency short-tx workloads (e.g. sysbench oltp_delete t=32), the batch regularly fills up and a single slow txn defers logtail publish for all others. CN observes this through `txnOperator.unlock → timestampWaiter.GetTimestamp` waiting on `NotifyLatestCommitTS`, which in RC isolation gates commit return, inflating per-txn latency. This causes issue matrixorigin#24325: sysbench oltp_delete t=32 standalone TPS drops ~28% at the exact commit that merged matrixorigin#22475, and ~17% on main HEAD vs 3.0-dev (which does not carry matrixorigin#22475). ### Fix Replace the batch-wide wait with per-slot buffered channels + a serial publisher that reads slot 0, 1, 2, ... in order: - each submit goroutine signals its own `chan *txnWithLogtails` on completion - the publisher loops through slots in index order, `<-ch` per slot - a slow txn only blocks the publisher up to its slot; it does not delay the collection of later txns nor the publishing of earlier already-ready txns All other matrixorigin#22475 changes (single `logtailQueue`, `collectPool`, `WaitEvent(WalPreparing)` event) are preserved. Monotonicity of `previousSaveTS` inside `generateLogtailWithTxn` is preserved since publish still happens in index (PrepareTS) order. ### Benchmark `sysbench oltp_delete.lua` on standalone MO, Apple Silicon arm64, GOMAXPROCS=10, 10 tables × 1M rows, threads=32, time=60s, db-ps-mode=disable, skip_trx=on, 3 independent runs (fresh prepare + cleanup + mo-service restart per run), mean of 30-60s steady window: | Version | Steady TPS | timestampWaiter per-txn | |----------------------------------|-----------:|------------------------:| | main before this PR | 7970 | 585 μs | | **main + this fix** | **9402** | **334 μs** | | 3.0-dev (no matrixorigin#22475) | 9326 | 127 μs | | 8b3700a (commit before matrixorigin#22475) | 10487 | 123 μs | TPS recovers **+17.9% over current main** and passes 3.0-dev. The remaining tw gap vs 3.0-dev (334 μs → 127 μs) is secondary and does not materially affect TPS at t=32; it can be pursued separately if needed. ### Ablation (confirming the hotspot is exactly this function) Each candidate fix applied in isolation on main HEAD, 3-run mean: | Candidate fix | Steady TPS | tw per-txn | |---------------------------------------------------------|-----------:|-----------:| | main baseline | 7970 | 585 μs | | parallelize `LogEntryWriter.Finish()` marshal | 8283 | 592 μs | | synchronous marshal in `cmdmgr.ApplyTxnRecord` | 8280 | 599 μs | | **this PR (stream-publish in `onTxnLogTails`)** | **9402** | 334 μs | Only touching `onTxnLogTails` produces the recovery; marshal-path rewrites do not, which rules out marshal deferral itself as the regression source. Approved by: @XuPeng-SH

qodo-code-review · 2026-05-13T10:48:27Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

XuPeng-SH

Clean cherry-pick of #24326 (already merged to main) to 4.0-dev. Diffs are identical. LGTM.

aptend · 2026-05-13T14:45:24Z

@Mergifyio refresh

mergify · 2026-05-13T14:45:35Z

refresh

✅ Pull request refreshed

mergify · 2026-05-14T03:07:30Z

Merge Queue Status

✅ Entered queue — 2026-05-14 03:07 UTC · Rule: release-4.0
✅ Checks started · in-place
🚫 Left the queue — 2026-05-14 04:10 UTC · at 02e6f055c66bf87067a14c4f136178b1cb3c5c93

This pull request spent 1 hour 3 minutes 34 seconds in the queue, with no time running CI.

Waiting for any of

check-neutral = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
check-skipped = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
check-success = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)

All conditions

Reason

The merge conditions cannot be satisfied due to failing checks

Hint

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio queue comment.

ck89119 requested a review from XuPeng-SH as a code owner May 13, 2026 10:48

ck89119 temporarily deployed to ci May 13, 2026 10:48 — with GitHub Actions Inactive

matrix-meow added the size/L Denotes a PR that changes [500,999] lines label May 13, 2026

mergify Bot added the kind/bug Something isn't working label May 13, 2026

XuPeng-SH approved these changes May 13, 2026

View reviewed changes

Merge branch '4.0-dev' into cherry-pick-24326-to-4.0-dev

7fbe4f2

ck89119 temporarily deployed to ci May 13, 2026 13:08 — with GitHub Actions Inactive

aptend mentioned this pull request May 13, 2026

ci(mergify): add queue and merge rules for 4.0-dev branch #24386

Closed

2 tasks

mergify Bot added the queued label May 14, 2026

Merge branch '4.0-dev' into cherry-pick-24326-to-4.0-dev

02e6f05

mergify Bot temporarily deployed to ci May 14, 2026 03:07 Inactive

mergify Bot had a problem deploying to ci May 14, 2026 03:07 Error

mergify Bot temporarily deployed to ci May 14, 2026 03:07 Inactive

mergify Bot added dequeued and removed queued labels May 14, 2026

mergify Bot had a problem deploying to ci May 14, 2026 04:12 Error

mergify Bot had a problem deploying to ci May 14, 2026 06:11 Error

mergify Bot had a problem deploying to ci May 14, 2026 07:14 Error

mergify Bot had a problem deploying to ci May 14, 2026 08:18 Error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(logtail): stream-publish txn tails to remove batch-wide wait (#24325)#24383

fix(logtail): stream-publish txn tails to remove batch-wide wait (#24325)#24383
ck89119 wants to merge 3 commits into
matrixorigin:4.0-devfrom
ck89119:cherry-pick-24326-to-4.0-dev

ck89119 commented May 13, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot commented May 13, 2026

Uh oh!

XuPeng-SH left a comment

Uh oh!

aptend commented May 13, 2026

Uh oh!

mergify Bot commented May 13, 2026

Uh oh!

mergify Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ck89119 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Uh oh!

qodo-code-review Bot commented May 13, 2026

Qodo reviews are paused for this user.

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

Uh oh!

aptend commented May 13, 2026

Uh oh!

mergify Bot commented May 13, 2026

✅ Pull request refreshed

Uh oh!

mergify Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Queue Status

Reason

Hint

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ck89119 commented May 13, 2026 •

edited

Loading

mergify Bot commented May 14, 2026 •

edited

Loading