Skip to content

fix: reduce RSS growth from dedupLoad sync.Map retention and flush convoy (cherry-pick to 4.0-dev)#24388

Open
jiangxinmeng1 wants to merge 4 commits into
matrixorigin:4.0-devfrom
jiangxinmeng1:fix/dedupload-flush-convoy-4.0-dev
Open

fix: reduce RSS growth from dedupLoad sync.Map retention and flush convoy (cherry-pick to 4.0-dev)#24388
jiangxinmeng1 wants to merge 4 commits into
matrixorigin:4.0-devfrom
jiangxinmeng1:fix/dedupload-flush-convoy-4.0-dev

Conversation

@jiangxinmeng1
Copy link
Copy Markdown
Contributor

@jiangxinmeng1 jiangxinmeng1 commented May 14, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) does this PR fix or relate to?

Fixes #24348

What this PR does / why we need it:

Cherry-pick of #24387 to 4.0-dev.

  • Replace sync.Map with sync.Mutex + plain map in dedupLoad so entries are fully reclaimed after deletion (sync.Map internal hash table never shrinks)
  • Each waiter now copies the loaded []byte result instead of sharing a reference, allowing the loadCall struct to be GC'd promptly without waiting for the slowest consumer
  • Change flushSemaphore from hardcoded capacity 4 to dynamic GOMAXPROCS/2 (clamped [4, 16]) to improve flush throughput on multi-core machines
  • Add 200ms timeout fallback to flushSemaphore acquisition so queued workers don't hold S3 write buffers indefinitely, avoiding convoy-induced RSS growth

What was the problem

PR #24309 introduced dedupLoad (cache stampede prevention) and flushSemaphore (thundering-herd protection). While these correctly prevent OOM in the sysbench insert-ignore scenario, under TPCC 10w/100t (100 terminals concurrent writes) they cause baseline RSS to grow from 31-36 GiB to 40-46 GiB:

  1. sync.Map never shrinks: Even after Delete, the internal hash table retains its peak size, causing monotonic memory growth under sustained load
  2. Shared []byte reference: All waiters hold a reference to loadCall.val, preventing GC until the slowest goroutine finishes
  3. Flush convoy: With only 4 flush slots and 100 workers, 96 workers queue while holding their S3 write buffers

Test plan

  • TPCC 10w/100t: verify baseline RSS stays at 31-36 GiB
  • Sysbench 1000w insert-ignore: verify no OOM regression

Ref #24348

🤖 Generated with Claude Code

…nvoy

The dedupLoad mechanism (introduced in matrixorigin#24309) used sync.Map which never
shrinks its internal hash table after Delete. Under TPCC 100-terminal
concurrency this caused monotonic memory growth. Additionally, all waiters
shared a single []byte reference, preventing GC until the slowest consumer
finished.

Replace sync.Map with mutex+map so entries are fully reclaimed on delete.
Each waiter now copies the result so the shared loadCall can be GC'd promptly.

The flushSemaphore (also from matrixorigin#24309) was hardcoded to 4 concurrent flushes.
With 100 workers hitting memory pressure, 96 would queue holding their S3
write buffers, causing RSS to spike ~8-10 GiB. Change to dynamic capacity
(GOMAXPROCS/2, clamped [4,16]) and add 200ms timeout fallback so queued
workers don't hold buffers indefinitely.

Approved by: @include-all

Ref matrixorigin#24348

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Add explicit memory-cache capacity backpressure before DiskCache.Read converts disk bytes into CachedData. Keep DiskCache allocation on the default allocator so the gate is applied at the disk-cache layer instead of relying on S3FS/LocalFS allocator wrappers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@matrix-meow matrix-meow added size/M Denotes a PR that changes [100,499] lines and removed size/S Denotes a PR that changes [10,99] lines labels May 14, 2026
jiangxinmeng1 and others added 2 commits May 15, 2026 09:47
…tions (matrixorigin#24352)

Two issues causing unbounded memory growth under high-concurrency DiskCache read paths (1000 TPCC FOR UPDATE terminals simultaneously hitting `AllocateCacheDataWithHint`):

- **`EnsureNBytes` silently skips eviction under contention**: The FIFO cache's `Evict` uses `TryLock` — when 1000 goroutines concurrently call `EnsureNBytes`, only one actually evicts while the rest skip and proceed to allocate, causing the memory cache to overshoot its configured capacity by up to `N × blockSize`. Fix: when the cache is already at/over capacity, use blocking `EvictWithWait` to guarantee space is freed before returning.

- **IO buffer and CachedData held simultaneously**: `ReadFromOSFile` held both the compressed IO buffer (`i.Data`, allocated via `ioAllocator`) and the decompressed `CachedData` (allocated via `memoryCacheAllocator`) until `IOVector.Release()`. Since this function is only called from `DiskCache.Read` (which sets `fromCache = diskCache`, causing `diskCache.Update` to skip the entry), the IO buffer can be released immediately after `setCachedData` succeeds. This reduces peak per-entry memory by ~30-50%.

The OOM path: `LockOp.callNonBlocking` → `TableScan` → `DiskCache.Read` → `setCachedData` → `constructorFactory` → `AllocateCacheDataWithHint` → `MetricsAllocator.Allocate`

Under 1000 concurrent FOR UPDATE transactions:
1. Each transaction's TableScan triggers cache data allocation (~683KB/block)
2. `EnsureNBytes` was ineffective under contention (TryLock skip)
3. Memory cache could exceed capacity by hundreds of MB
4. Combined with lock-wait holding batch data, total RSS exceeded CN limits

Approved by: @XuPeng-SH, @fengttt

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…th (matrixorigin#24339)

Backports the TransferPage write reliability fix to the `main` branch.

The root cause of the DN OOM burst under IOChaos is that `WriteTransferPage` silently sets page paths even when the filesystem write fails. When the in-memory hashmap is later evicted by TTL, `loadTable()` attempts to read from a non-existent file path, triggering repeated flush re-scheduling and memory allocation storms.

This fix:
1. Adds bounded retry (3 attempts with exponential backoff 100ms/200ms/400ms) to `WriteTransferPage`
2. Returns an error on failure instead of silently succeeding
3. Only sets page paths after a successful write — on failure pages remain in-memory (graceful degradation)
4. Callers (`flushTableTail`, `mergeObjects`) log a warning and keep pages in memory when persistence fails

Approved by: @XuPeng-SH, @aptend

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/L Denotes a PR that changes [500,999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants