Skip to content

fix(lockservice): fence txns on bind changes (#24363)#24379

Open
LeftHandCold wants to merge 3 commits into
matrixorigin:4.0-devfrom
LeftHandCold:fix-tpcc-err
Open

fix(lockservice): fence txns on bind changes (#24363)#24379
LeftHandCold wants to merge 3 commits into
matrixorigin:4.0-devfrom
LeftHandCold:fix-tpcc-err

Conversation

@LeftHandCold
Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24346

What this PR does / why we need it:

This adds lockservice-side fencing for transactions that have touched a stale lock table bind.

When a lock table bind changes, the service now marks active txns that have already tracked the old bind as bindChanged. Subsequent local lock, remote lock, and forwarded lock attempts for that txn fail with ErrLockTableBindChanged, so the caller rolls back instead of continuing to execute with stale lock ownership.

The change also handles bind changes returned by KeepRemoteLock heartbeats, so old-bind transactions can be fenced without waiting for the next business lock RPC. In-flight remote lock RPCs re-check bindChanged after the RPC returns before recording lock state, closing the race where a txn could be fenced while an RPC was still in flight.

To avoid broad behavior changes, this is scoped to txns that have tracked the changed table bind. It does not change autocommit retry policy in lockop; that remains handled separately by #24354.

Validation:

  • go test ./pkg/lockservice -run 'Test(LockRemoteChecksBindChangedAfterRPC|BindChangedFencesActiveTxnHoldingOldBind|BindChangedFencesActiveTxnAfterOldTableRemoved|KeepRemoteLockBindChangedFencesActiveTxn|LockWithBindNotFound|Issue3288|Issue2128|LockWithBindIsStale|LockWithBindTimeout|ReLockSuccWithBindChanged|ReLockSuccWithLockTableBindChanged|Issue4007)' -count=1 -timeout=2m

Note: full go test ./pkg/lockservice -count=1 -timeout=20m was also tried locally but timed out in existing TestIssue3538; this PR does not modify that unrelated test.

This adds lockservice-side fencing for transactions that have touched a stale lock table bind.

When a lock table bind changes, the service now marks active txns that have already tracked the old bind as `bindChanged`. Subsequent local lock, remote lock, and forwarded lock attempts for that txn fail with `ErrLockTableBindChanged`, so the caller rolls back instead of continuing to execute with stale lock ownership.

The change also handles bind changes returned by `KeepRemoteLock` heartbeats, so old-bind transactions can be fenced without waiting for the next business lock RPC. In-flight remote lock RPCs re-check `bindChanged` after the RPC returns before recording lock state, closing the race where a txn could be fenced while an RPC was still in flight.

To avoid broad behavior changes, this is scoped to txns that have tracked the changed table bind. It does not change autocommit retry policy in lockop; that remains handled separately by matrixorigin#24354.

Validation:

- `go test ./pkg/lockservice -run 'Test(LockRemoteChecksBindChangedAfterRPC|BindChangedFencesActiveTxnHoldingOldBind|BindChangedFencesActiveTxnAfterOldTableRemoved|KeepRemoteLockBindChangedFencesActiveTxn|LockWithBindNotFound|Issue3288|Issue2128|LockWithBindIsStale|LockWithBindTimeout|ReLockSuccWithBindChanged|ReLockSuccWithLockTableBindChanged|Issue4007)' -count=1 -timeout=2m`

Note: full `go test ./pkg/lockservice -count=1 -timeout=20m` was also tried locally but timed out in existing `TestIssue3538`; this PR does not modify that unrelated test.

Approved by: @iamlinjunhong
Copilot AI review requested due to automatic review settings May 13, 2026 09:30
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds lockservice-side fencing for transactions that have touched a stale lock table bind, so callers receive ErrLockTableBindChanged and roll back instead of continuing with stale lock ownership after a CN restart / allocator rebind. Targets the TPC-C ACID violation in #24346 by closing the in-flight RPC race and surfacing bind changes from KeepRemoteLock heartbeats.

Changes:

  • Track every bind a txn has touched (tableBindIntents) and add activeTxn.fenceByBindChanged, marking the txn bindChanged so subsequent local/remote/forward lock paths fail fast under a new bindChangeMu serializing handler vs. lock paths.
  • In remoteLockTable.lock, re-check txn.bindChanged (and txnID) after the RPC and after handleError/maybeHandleBindChanged, returning ErrLockTableBindChanged if the txn was fenced mid-flight.
  • In the keeper, fence active txns when KeepRemoteLock returns a bind change (via resp.NewBind or response error), with a new maybeHandleRemoteBindChanged that re-fetches the bind from the allocator and triggers handleBindChanged.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/lockservice/txn.go Adds tableBindIntents, bindChanged, lockTableBindTouched, fenceByBindChanged; resets the new fields.
pkg/lockservice/txn_test.go New unit test for lockTableBindTouched/fenceByBindChanged lifecycle.
pkg/lockservice/service.go Adds bindChangeMu, fences active txns from handleBindChanged, new bindChanged checks/touch in Lock.
pkg/lockservice/service_remote.go Same fence/touch wiring in handleRemoteLock and handleForwardLock.
pkg/lockservice/service_test.go Updates retry loops to handle fence; adds BindChangedFences* tests.
pkg/lockservice/service_remote_test.go Updates remote tests to expect ErrLockTableBindChanged and retry with fresh txn.
pkg/lockservice/lock_table_remote.go Re-checks txn rollover / bindChanged after RPC and after handleError.
pkg/lockservice/lock_table_remote_test.go New test exercising the post-RPC bindChanged check.
pkg/lockservice/lock_table_keeper.go Fences via KeepRemoteLock NewBind / errors and on AsyncSend failure.
pkg/lockservice/lock_table_keeper_test.go New keeper tests for fencing via NewBind and via failure → allocator lookup.
Comments suppressed due to low confidence (1)

pkg/lockservice/lock_table_remote.go:125

  • In the no-error / NewBind path, after re-acquiring txn.Lock() and detecting either txn rollover or bindChanged, the code returns without calling releaseResponse(resp) — wait, the defer releaseResponse(resp) at line 109 covers it, so the response is freed. However, on the early-return paths after re-locking (lines 115-116, 119-120), logRemoteLockFailed is also skipped so the bind-change error is only logged on the "everything succeeded but bind changed" path (line 122). Consider logging in the rollover/bindChanged branches as well, or this race (txn closed mid-RPC) becomes silent and may hide diagnostics for the very symptom this PR is targeted at.
		if resp.NewBind != nil {
			txn.Unlock()
			err = l.maybeHandleBindChanged(resp)
			txn.Lock()
			if !bytes.Equal(req.Lock.TxnID, txn.txnID) {
				cb(pb.Result{}, ErrTxnNotFound)
				return
			}
			if txn.bindChanged {
				cb(pb.Result{}, ErrLockTableBindChanged)
				return
			}
			logRemoteLockFailed(l.logger, txn, rows, opts, l.bind, err)
			cb(pb.Result{}, err)
			return
		}

Comment thread pkg/lockservice/lock_table_keeper.go Outdated
Comment thread pkg/lockservice/lock_table_keeper.go
Comment thread pkg/lockservice/txn.go Outdated
Comment thread pkg/lockservice/service.go
Limit response-level keep-remote-lock bind refreshes to bind-related errors and deduplicate bind refresh checks within one keeper pass. Add context for bind touch intents used by active txn fencing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reuse the keeper bind scratch slice capacity and add a negative test to ensure non-bind KeepRemoteLock response errors do not trigger bind refreshes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@LeftHandCold
Copy link
Copy Markdown
Contributor Author

Addressed the keeper review with one small follow-up commit on this backport:

  • reuse the bind scratch slice capacity with cap(binds)
  • add a negative test to ensure non-bind KeepRemoteLock response errors do not trigger GetBind refresh

For the remaining minor note about the unused fenced-txn count in service.fenceByBindChanged: I intentionally left that unchanged in 4.0-dev. Logging the count would expand the operational surface of the backport, and removing the return value would widen the interface churn beyond the correctness/stability scope of #24379. Since it is not a correctness issue, I prefer to keep the backport conservative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/L Denotes a PR that changes [500,999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants