fix(lockservice): fence txns on bind changes (#24363) by LeftHandCold · Pull Request #24379 · matrixorigin/matrixone

LeftHandCold · 2026-05-13T09:30:25Z

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

This adds lockservice-side fencing for transactions that have touched a stale lock table bind.

When a lock table bind changes, the service now marks active txns that have already tracked the old bind as bindChanged. Subsequent local lock, remote lock, and forwarded lock attempts for that txn fail with ErrLockTableBindChanged, so the caller rolls back instead of continuing to execute with stale lock ownership.

The change also handles bind changes returned by KeepRemoteLock heartbeats, so old-bind transactions can be fenced without waiting for the next business lock RPC. In-flight remote lock RPCs re-check bindChanged after the RPC returns before recording lock state, closing the race where a txn could be fenced while an RPC was still in flight.

To avoid broad behavior changes, this is scoped to txns that have tracked the changed table bind. It does not change autocommit retry policy in lockop; that remains handled separately by #24354.

Validation:

go test ./pkg/lockservice -run 'Test(LockRemoteChecksBindChangedAfterRPC|BindChangedFencesActiveTxnHoldingOldBind|BindChangedFencesActiveTxnAfterOldTableRemoved|KeepRemoteLockBindChangedFencesActiveTxn|LockWithBindNotFound|Issue3288|Issue2128|LockWithBindIsStale|LockWithBindTimeout|ReLockSuccWithBindChanged|ReLockSuccWithLockTableBindChanged|Issue4007)' -count=1 -timeout=2m

Note: full go test ./pkg/lockservice -count=1 -timeout=20m was also tried locally but timed out in existing TestIssue3538; this PR does not modify that unrelated test.

@iamlinjunhong

This adds lockservice-side fencing for transactions that have touched a stale lock table bind. When a lock table bind changes, the service now marks active txns that have already tracked the old bind as `bindChanged`. Subsequent local lock, remote lock, and forwarded lock attempts for that txn fail with `ErrLockTableBindChanged`, so the caller rolls back instead of continuing to execute with stale lock ownership. The change also handles bind changes returned by `KeepRemoteLock` heartbeats, so old-bind transactions can be fenced without waiting for the next business lock RPC. In-flight remote lock RPCs re-check `bindChanged` after the RPC returns before recording lock state, closing the race where a txn could be fenced while an RPC was still in flight. To avoid broad behavior changes, this is scoped to txns that have tracked the changed table bind. It does not change autocommit retry policy in lockop; that remains handled separately by matrixorigin#24354. Validation: - `go test ./pkg/lockservice -run 'Test(LockRemoteChecksBindChangedAfterRPC|BindChangedFencesActiveTxnHoldingOldBind|BindChangedFencesActiveTxnAfterOldTableRemoved|KeepRemoteLockBindChangedFencesActiveTxn|LockWithBindNotFound|Issue3288|Issue2128|LockWithBindIsStale|LockWithBindTimeout|ReLockSuccWithBindChanged|ReLockSuccWithLockTableBindChanged|Issue4007)' -count=1 -timeout=2m` Note: full `go test ./pkg/lockservice -count=1 -timeout=20m` was also tried locally but timed out in existing `TestIssue3538`; this PR does not modify that unrelated test. Approved by: @iamlinjunhong

qodo-code-review · 2026-05-13T09:30:30Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Copilot

Pull request overview

Adds lockservice-side fencing for transactions that have touched a stale lock table bind, so callers receive ErrLockTableBindChanged and roll back instead of continuing with stale lock ownership after a CN restart / allocator rebind. Targets the TPC-C ACID violation in #24346 by closing the in-flight RPC race and surfacing bind changes from KeepRemoteLock heartbeats.

Changes:

Track every bind a txn has touched (tableBindIntents) and add activeTxn.fenceByBindChanged, marking the txn bindChanged so subsequent local/remote/forward lock paths fail fast under a new bindChangeMu serializing handler vs. lock paths.
In remoteLockTable.lock, re-check txn.bindChanged (and txnID) after the RPC and after handleError/maybeHandleBindChanged, returning ErrLockTableBindChanged if the txn was fenced mid-flight.
In the keeper, fence active txns when KeepRemoteLock returns a bind change (via resp.NewBind or response error), with a new maybeHandleRemoteBindChanged that re-fetches the bind from the allocator and triggers handleBindChanged.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
pkg/lockservice/txn.go	Adds `tableBindIntents`, `bindChanged`, `lockTableBindTouched`, `fenceByBindChanged`; resets the new fields.
pkg/lockservice/txn_test.go	New unit test for `lockTableBindTouched`/`fenceByBindChanged` lifecycle.
pkg/lockservice/service.go	Adds `bindChangeMu`, fences active txns from `handleBindChanged`, new `bindChanged` checks/touch in `Lock`.
pkg/lockservice/service_remote.go	Same fence/touch wiring in `handleRemoteLock` and `handleForwardLock`.
pkg/lockservice/service_test.go	Updates retry loops to handle fence; adds `BindChangedFences*` tests.
pkg/lockservice/service_remote_test.go	Updates remote tests to expect `ErrLockTableBindChanged` and retry with fresh txn.
pkg/lockservice/lock_table_remote.go	Re-checks txn rollover / `bindChanged` after RPC and after `handleError`.
pkg/lockservice/lock_table_remote_test.go	New test exercising the post-RPC bindChanged check.
pkg/lockservice/lock_table_keeper.go	Fences via `KeepRemoteLock` `NewBind` / errors and on AsyncSend failure.
pkg/lockservice/lock_table_keeper_test.go	New keeper tests for fencing via `NewBind` and via failure → allocator lookup.

Comments suppressed due to low confidence (1)

pkg/lockservice/lock_table_remote.go:125

In the no-error / NewBind path, after re-acquiring txn.Lock() and detecting either txn rollover or bindChanged, the code returns without calling releaseResponse(resp) — wait, the defer releaseResponse(resp) at line 109 covers it, so the response is freed. However, on the early-return paths after re-locking (lines 115-116, 119-120), logRemoteLockFailed is also skipped so the bind-change error is only logged on the "everything succeeded but bind changed" path (line 122). Consider logging in the rollover/bindChanged branches as well, or this race (txn closed mid-RPC) becomes silent and may hide diagnostics for the very symptom this PR is targeted at.

		if resp.NewBind != nil {
			txn.Unlock()
			err = l.maybeHandleBindChanged(resp)
			txn.Lock()
			if !bytes.Equal(req.Lock.TxnID, txn.txnID) {
				cb(pb.Result{}, ErrTxnNotFound)
				return
			}
			if txn.bindChanged {
				cb(pb.Result{}, ErrLockTableBindChanged)
				return
			}
			logRemoteLockFailed(l.logger, txn, rows, opts, l.bind, err)
			cb(pb.Result{}, err)
			return
		}

Limit response-level keep-remote-lock bind refreshes to bind-related errors and deduplicate bind refresh checks within one keeper pass. Add context for bind touch intents used by active txn fencing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reuse the keeper bind scratch slice capacity and add a negative test to ensure non-bind KeepRemoteLock response errors do not trigger bind refreshes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

LeftHandCold · 2026-05-14T03:17:21Z

Addressed the keeper review with one small follow-up commit on this backport:

reuse the bind scratch slice capacity with cap(binds)
add a negative test to ensure non-bind KeepRemoteLock response errors do not trigger GetBind refresh

For the remaining minor note about the unused fenced-txn count in service.fenceByBindChanged: I intentionally left that unchanged in 4.0-dev. Logging the count would expand the operational surface of the backport, and removing the return value would widen the interface churn beyond the correctness/stability scope of #24379. Since it is not a correctness issue, I prefer to keep the backport conservative.

Copilot AI review requested due to automatic review settings May 13, 2026 09:30

LeftHandCold requested a review from iamlinjunhong as a code owner May 13, 2026 09:30

LeftHandCold temporarily deployed to ci May 13, 2026 09:30 — with GitHub Actions Inactive

LeftHandCold had a problem deploying to ci May 13, 2026 09:30 — with GitHub Actions Error

LeftHandCold temporarily deployed to ci May 13, 2026 09:30 — with GitHub Actions Inactive

Copilot started reviewing on behalf of LeftHandCold May 13, 2026 09:30 View session

matrix-meow added the size/L Denotes a PR that changes [500,999] lines label May 13, 2026

mergify Bot requested a review from XuPeng-SH May 13, 2026 09:31

mergify Bot added the kind/bug Something isn't working label May 13, 2026

iamlinjunhong approved these changes May 13, 2026

View reviewed changes

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread pkg/lockservice/lock_table_keeper.go Outdated

Comment thread pkg/lockservice/lock_table_keeper.go

Comment thread pkg/lockservice/txn.go Outdated

Comment thread pkg/lockservice/service.go

LeftHandCold temporarily deployed to ci May 13, 2026 10:42 — with GitHub Actions Inactive

LeftHandCold temporarily deployed to ci May 13, 2026 10:43 — with GitHub Actions Inactive

test(lockservice): cover conservative keep refresh

4f6233a

Reuse the keeper bind scratch slice capacity and add a negative test to ensure non-bind KeepRemoteLock response errors do not trigger bind refreshes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

LeftHandCold temporarily deployed to ci May 14, 2026 03:17 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lockservice): fence txns on bind changes (#24363)#24379

fix(lockservice): fence txns on bind changes (#24363)#24379
LeftHandCold wants to merge 3 commits into
matrixorigin:4.0-devfrom
LeftHandCold:fix-tpcc-err

LeftHandCold commented May 13, 2026

Uh oh!

qodo-code-review Bot commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LeftHandCold commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

LeftHandCold commented May 13, 2026

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Uh oh!

qodo-code-review Bot commented May 13, 2026

Qodo reviews are paused for this user.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LeftHandCold commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants