fix(lockservice): fence txns on bind changes (#24363)#24379
fix(lockservice): fence txns on bind changes (#24363)#24379LeftHandCold wants to merge 3 commits into
Conversation
This adds lockservice-side fencing for transactions that have touched a stale lock table bind. When a lock table bind changes, the service now marks active txns that have already tracked the old bind as `bindChanged`. Subsequent local lock, remote lock, and forwarded lock attempts for that txn fail with `ErrLockTableBindChanged`, so the caller rolls back instead of continuing to execute with stale lock ownership. The change also handles bind changes returned by `KeepRemoteLock` heartbeats, so old-bind transactions can be fenced without waiting for the next business lock RPC. In-flight remote lock RPCs re-check `bindChanged` after the RPC returns before recording lock state, closing the race where a txn could be fenced while an RPC was still in flight. To avoid broad behavior changes, this is scoped to txns that have tracked the changed table bind. It does not change autocommit retry policy in lockop; that remains handled separately by matrixorigin#24354. Validation: - `go test ./pkg/lockservice -run 'Test(LockRemoteChecksBindChangedAfterRPC|BindChangedFencesActiveTxnHoldingOldBind|BindChangedFencesActiveTxnAfterOldTableRemoved|KeepRemoteLockBindChangedFencesActiveTxn|LockWithBindNotFound|Issue3288|Issue2128|LockWithBindIsStale|LockWithBindTimeout|ReLockSuccWithBindChanged|ReLockSuccWithLockTableBindChanged|Issue4007)' -count=1 -timeout=2m` Note: full `go test ./pkg/lockservice -count=1 -timeout=20m` was also tried locally but timed out in existing `TestIssue3538`; this PR does not modify that unrelated test. Approved by: @iamlinjunhong
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
There was a problem hiding this comment.
Pull request overview
Adds lockservice-side fencing for transactions that have touched a stale lock table bind, so callers receive ErrLockTableBindChanged and roll back instead of continuing with stale lock ownership after a CN restart / allocator rebind. Targets the TPC-C ACID violation in #24346 by closing the in-flight RPC race and surfacing bind changes from KeepRemoteLock heartbeats.
Changes:
- Track every bind a txn has touched (
tableBindIntents) and addactiveTxn.fenceByBindChanged, marking the txnbindChangedso subsequent local/remote/forward lock paths fail fast under a newbindChangeMuserializing handler vs. lock paths. - In
remoteLockTable.lock, re-checktxn.bindChanged(and txnID) after the RPC and afterhandleError/maybeHandleBindChanged, returningErrLockTableBindChangedif the txn was fenced mid-flight. - In the keeper, fence active txns when
KeepRemoteLockreturns a bind change (viaresp.NewBindor response error), with a newmaybeHandleRemoteBindChangedthat re-fetches the bind from the allocator and triggershandleBindChanged.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/lockservice/txn.go | Adds tableBindIntents, bindChanged, lockTableBindTouched, fenceByBindChanged; resets the new fields. |
| pkg/lockservice/txn_test.go | New unit test for lockTableBindTouched/fenceByBindChanged lifecycle. |
| pkg/lockservice/service.go | Adds bindChangeMu, fences active txns from handleBindChanged, new bindChanged checks/touch in Lock. |
| pkg/lockservice/service_remote.go | Same fence/touch wiring in handleRemoteLock and handleForwardLock. |
| pkg/lockservice/service_test.go | Updates retry loops to handle fence; adds BindChangedFences* tests. |
| pkg/lockservice/service_remote_test.go | Updates remote tests to expect ErrLockTableBindChanged and retry with fresh txn. |
| pkg/lockservice/lock_table_remote.go | Re-checks txn rollover / bindChanged after RPC and after handleError. |
| pkg/lockservice/lock_table_remote_test.go | New test exercising the post-RPC bindChanged check. |
| pkg/lockservice/lock_table_keeper.go | Fences via KeepRemoteLock NewBind / errors and on AsyncSend failure. |
| pkg/lockservice/lock_table_keeper_test.go | New keeper tests for fencing via NewBind and via failure → allocator lookup. |
Comments suppressed due to low confidence (1)
pkg/lockservice/lock_table_remote.go:125
- In the no-error / NewBind path, after re-acquiring
txn.Lock()and detecting either txn rollover orbindChanged, the code returns without callingreleaseResponse(resp)— wait, thedefer releaseResponse(resp)at line 109 covers it, so the response is freed. However, on the early-return paths after re-locking (lines 115-116, 119-120),logRemoteLockFailedis also skipped so the bind-change error is only logged on the "everything succeeded but bind changed" path (line 122). Consider logging in the rollover/bindChanged branches as well, or this race (txn closed mid-RPC) becomes silent and may hide diagnostics for the very symptom this PR is targeted at.
if resp.NewBind != nil {
txn.Unlock()
err = l.maybeHandleBindChanged(resp)
txn.Lock()
if !bytes.Equal(req.Lock.TxnID, txn.txnID) {
cb(pb.Result{}, ErrTxnNotFound)
return
}
if txn.bindChanged {
cb(pb.Result{}, ErrLockTableBindChanged)
return
}
logRemoteLockFailed(l.logger, txn, rows, opts, l.bind, err)
cb(pb.Result{}, err)
return
}
Limit response-level keep-remote-lock bind refreshes to bind-related errors and deduplicate bind refresh checks within one keeper pass. Add context for bind touch intents used by active txn fencing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reuse the keeper bind scratch slice capacity and add a negative test to ensure non-bind KeepRemoteLock response errors do not trigger bind refreshes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Addressed the keeper review with one small follow-up commit on this backport:
For the remaining minor note about the unused fenced-txn count in |
What type of PR is this?
Which issue(s) this PR fixes:
issue #24346
What this PR does / why we need it:
This adds lockservice-side fencing for transactions that have touched a stale lock table bind.
When a lock table bind changes, the service now marks active txns that have already tracked the old bind as
bindChanged. Subsequent local lock, remote lock, and forwarded lock attempts for that txn fail withErrLockTableBindChanged, so the caller rolls back instead of continuing to execute with stale lock ownership.The change also handles bind changes returned by
KeepRemoteLockheartbeats, so old-bind transactions can be fenced without waiting for the next business lock RPC. In-flight remote lock RPCs re-checkbindChangedafter the RPC returns before recording lock state, closing the race where a txn could be fenced while an RPC was still in flight.To avoid broad behavior changes, this is scoped to txns that have tracked the changed table bind. It does not change autocommit retry policy in lockop; that remains handled separately by #24354.
Validation:
go test ./pkg/lockservice -run 'Test(LockRemoteChecksBindChangedAfterRPC|BindChangedFencesActiveTxnHoldingOldBind|BindChangedFencesActiveTxnAfterOldTableRemoved|KeepRemoteLockBindChangedFencesActiveTxn|LockWithBindNotFound|Issue3288|Issue2128|LockWithBindIsStale|LockWithBindTimeout|ReLockSuccWithBindChanged|ReLockSuccWithLockTableBindChanged|Issue4007)' -count=1 -timeout=2mNote: full
go test ./pkg/lockservice -count=1 -timeout=20mwas also tried locally but timed out in existingTestIssue3538; this PR does not modify that unrelated test.