fix: reduce RSS growth from dedupLoad sync.Map retention and flush convoy (cherry-pick to 4.0-dev)#24388
Open
jiangxinmeng1 wants to merge 4 commits into
Open
Conversation
…nvoy The dedupLoad mechanism (introduced in matrixorigin#24309) used sync.Map which never shrinks its internal hash table after Delete. Under TPCC 100-terminal concurrency this caused monotonic memory growth. Additionally, all waiters shared a single []byte reference, preventing GC until the slowest consumer finished. Replace sync.Map with mutex+map so entries are fully reclaimed on delete. Each waiter now copies the result so the shared loadCall can be GC'd promptly. The flushSemaphore (also from matrixorigin#24309) was hardcoded to 4 concurrent flushes. With 100 workers hitting memory pressure, 96 would queue holding their S3 write buffers, causing RSS to spike ~8-10 GiB. Change to dynamic capacity (GOMAXPROCS/2, clamped [4,16]) and add 200ms timeout fallback so queued workers don't hold buffers indefinitely. Approved by: @include-all Ref matrixorigin#24348 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Add explicit memory-cache capacity backpressure before DiskCache.Read converts disk bytes into CachedData. Keep DiskCache allocation on the default allocator so the gate is applied at the disk-cache layer instead of relying on S3FS/LocalFS allocator wrappers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tions (matrixorigin#24352) Two issues causing unbounded memory growth under high-concurrency DiskCache read paths (1000 TPCC FOR UPDATE terminals simultaneously hitting `AllocateCacheDataWithHint`): - **`EnsureNBytes` silently skips eviction under contention**: The FIFO cache's `Evict` uses `TryLock` — when 1000 goroutines concurrently call `EnsureNBytes`, only one actually evicts while the rest skip and proceed to allocate, causing the memory cache to overshoot its configured capacity by up to `N × blockSize`. Fix: when the cache is already at/over capacity, use blocking `EvictWithWait` to guarantee space is freed before returning. - **IO buffer and CachedData held simultaneously**: `ReadFromOSFile` held both the compressed IO buffer (`i.Data`, allocated via `ioAllocator`) and the decompressed `CachedData` (allocated via `memoryCacheAllocator`) until `IOVector.Release()`. Since this function is only called from `DiskCache.Read` (which sets `fromCache = diskCache`, causing `diskCache.Update` to skip the entry), the IO buffer can be released immediately after `setCachedData` succeeds. This reduces peak per-entry memory by ~30-50%. The OOM path: `LockOp.callNonBlocking` → `TableScan` → `DiskCache.Read` → `setCachedData` → `constructorFactory` → `AllocateCacheDataWithHint` → `MetricsAllocator.Allocate` Under 1000 concurrent FOR UPDATE transactions: 1. Each transaction's TableScan triggers cache data allocation (~683KB/block) 2. `EnsureNBytes` was ineffective under contention (TryLock skip) 3. Memory cache could exceed capacity by hundreds of MB 4. Combined with lock-wait holding batch data, total RSS exceeded CN limits Approved by: @XuPeng-SH, @fengttt Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…th (matrixorigin#24339) Backports the TransferPage write reliability fix to the `main` branch. The root cause of the DN OOM burst under IOChaos is that `WriteTransferPage` silently sets page paths even when the filesystem write fails. When the in-memory hashmap is later evicted by TTL, `loadTable()` attempts to read from a non-existent file path, triggering repeated flush re-scheduling and memory allocation storms. This fix: 1. Adds bounded retry (3 attempts with exponential backoff 100ms/200ms/400ms) to `WriteTransferPage` 2. Returns an error on failure instead of silently succeeding 3. Only sets page paths after a successful write — on failure pages remain in-memory (graceful degradation) 4. Callers (`flushTableTail`, `mergeObjects`) log a warning and keep pages in memory when persistence fails Approved by: @XuPeng-SH, @aptend Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
Which issue(s) does this PR fix or relate to?
Fixes #24348
What this PR does / why we need it:
Cherry-pick of #24387 to 4.0-dev.
sync.Mapwithsync.Mutex+ plainmapindedupLoadso entries are fully reclaimed after deletion (sync.Map internal hash table never shrinks)[]byteresult instead of sharing a reference, allowing theloadCallstruct to be GC'd promptly without waiting for the slowest consumerflushSemaphorefrom hardcoded capacity 4 to dynamicGOMAXPROCS/2(clamped [4, 16]) to improve flush throughput on multi-core machinesflushSemaphoreacquisition so queued workers don't hold S3 write buffers indefinitely, avoiding convoy-induced RSS growthWhat was the problem
PR #24309 introduced
dedupLoad(cache stampede prevention) andflushSemaphore(thundering-herd protection). While these correctly prevent OOM in the sysbench insert-ignore scenario, under TPCC 10w/100t (100 terminals concurrent writes) they cause baseline RSS to grow from 31-36 GiB to 40-46 GiB:sync.Mapnever shrinks: Even afterDelete, the internal hash table retains its peak size, causing monotonic memory growth under sustained load[]bytereference: All waiters hold a reference toloadCall.val, preventing GC until the slowest goroutine finishesTest plan
Ref #24348
🤖 Generated with Claude Code