Ub ring transport bugfix by zchuango · Pull Request #10 · zchuango/brpc

zchuango · 2026-05-23T03:31:56Z

What problem does this PR solve?

Issue Number: resolve

Problem Summary:

What is changed and the side effects?

Changed:

Side effects:

Performance effects:
Breaking backward compatibility:

Check List:

Please make sure your changes are compilable.
When providing us with a new feature, it is best to add related tests.
Please follow Contributor Covenant Code of Conduct.

After PR apache#2949 changed Agent::combiner from a raw pointer to a weak_ptr, MultiDimensionTest.shared started to fail under ASAN with a heap-use-after-free in clear_all_agents() called from ~AgentCombiner. Root cause ---------- std::shared_ptr makes its weak_ptr expire the instant use_count drops to 0, while ~AgentCombiner only starts running after that. The two events therefore form a race window between two threads: * thread A: last shared_ptr<AgentCombiner> released use_count -> 0 -> weak_ptr expired ~AgentCombiner -> clear_all_agents() takes _lock and starts walking _agents * thread B: thread exits, _destroy_tls_blocks() deletes ThreadBlock, each ~Agent calls combiner.lock() which now returns null (weak_ptr already expired), so commit_and_erase() is skipped and the Agent is freed while still linked into A's _agents list * thread A: dereferences a freed LinkNode in clear_all_agents() -> heap-use-after-free Fix --- Don't traverse _agents from ~AgentCombiner. Letting _agents go out of scope is safe because: * butil::LinkedList and butil::LinkNode have trivial destructors and never traverse on destruction, so tearing down _agents does not dereference any agent node. * Surviving Agents simply observe combiner.expired() == true in ~Agent and skip commit_and_erase, leaving prev_/next_ dangling but never read. * If the freed agent_id is later reused by a new combiner and the same TLS slot is taken, get_or_create_tls_agent calls Agent::reset and Append's it into the new combiner's _agents. LinkNode::InsertBefore only writes prev_/next_ (it never reads their stale values), so the dangling pointers are safely overwritten. * Agent::element is destroyed together with its ThreadBlock, so any non-POD resource it holds is still released; if the agent slot is reused, Agent::reset rewrites the element value before it is observed again. The body of clear_all_agents() is kept (commented out) for reference, together with a NOTE explaining why it must not be re-enabled from ~AgentCombiner.

…che#3276) * fix: add acquire fence in bthread_join for ARM memory visibility Ensure memory visibility on ARM architecture after thread join. * use the fence unconditionally Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…ng() (apache#3292)

…ion test (apache#3294)

…pache#3299)

On Linux, `pthread_attr_getstack(3)` returns the lowest address of the stack, but brpc treats `StackStorage::bottom` as the highest address (see `bthread/stack.h`). `TaskGroup::init` was registering the main task's stack with the lowest address as `bottom`, so every per-jump `BTHREAD_SCOPED_ASAN_FIBER_SWITCHER` told ASan a stack range `stacksize` bytes below the real one. The first `__asan_handle_no_return` (e.g. the C++ unwinder fired by `throw bthread::ExitException` inside `bthread_exit`) then prints WARNING: ASan is ignoring requested __asan_handle_no_return: stack type: default top: ...; bottom ...; size: 0x62bdb000 (~1.5 GiB) False positive error reports may follow Fix `PthreadAttrGetStack` to translate the lowest address to the highest address before returning, matching `StackStorage::bottom`. ASan-only path; non-ASan builds are unaffected.

Otherwise there might segfault due to the race below: ```txt Socket::OnInputEvent() | `-- ProcessEvent (bthread) | | [ bthread queueed ] | QP error -> SetFailed -> HC -> WaitAndReset() | Reset() -> _sbuf.clear() | CheckHealth() -> Revive() | | Socket is now Addressable! RdmaEndpoint:PollCq() | Socket::Address() OK! | RdmaEndpoint:HandleCompletion() _sbuf[_sq_sent++].clear() <= BOOM! CQ is not drained but _sbuf is cleared. ``` Another possible fix is to add a _generation_ field in RdmaEndpoint, such that: - each RdmaEndpoint::Reset() will advance the _generation_ by 1; - the RdmaEndpoint::PollCq(m, orig_gen) will need to compare the _generation_. But it will contaminate existing interface, and we need to drain CQ anyway. Signed-off-by: David Lee <live4thee@gmail.com>

The previous go-like RWLock bumps the reader count first and rolls it back only on the full success path. The state is not reversible on a partial failure, so a read timeout while a writer holds the lock leaves a dangling reader credit and permanently blocks future writers (issue apache#3051). For the same reason the old implementation could not offer correct try_/timed_ APIs at all. Co-authored-by: hairet

…#3296) Socket::SSLHandshake polled the fd via unbounded bthread_fd_wait, so a peer that completed the TCP handshake but never sent a TLS Hello (e.g. server not actually configured for SSL) parked the handshake bthread forever. That bthread holds a Socket reference through the in-flight WriteRequest, so the underlying fd was never closed and the connection stayed ESTABLISHED until OS keepalive eventually broke it. Channel destruction does not help: ~Channel only removes the SocketMap ref; SetFailed/OnFailed wake _epollout_butex but not the per-fd butex used by bthread_fd_wait, and the fd close path runs only after the Socket's ref count drops to zero, which the stuck bthread prevents. With a periodic retry, ESTABLISHED sockets accumulate without bound. Switch the two bthread_fd_wait calls in SSLHandshake to bthread_fd_timedwait, with the deadline derived from a new gflag ssl_handshake_timeout_ms (default 5000ms; <=0 keeps the old behavior). On timeout the existing failure path runs: CheckConnected returns -1 -> AfterAppConnected calls SetFailed + ReleaseAllFailedWriteRequests, and the fd_guard in CheckConnectedAndKeepWrite closes the fd at scope exit. The server-side handshake invoked from DoRead benefits from the same bound. Co-authored-by: Ye Rixu <yerixu@expontech.com>

zchuango and others added 30 commits May 9, 2026 19:48

add ubring transport

3cd5904

fix the bug for ub ring transport

0934567

fix the bug for ub ring transport and other

c52fe36

add the license for ub transport

37f8747

Modifying the variable naming style

350c305

optimize the log message and some field name

aa2a2f9

fix some bug for ubring

fd91351

add some log ubring endpoint

5998401

add todo

e64f65c

fix the bug for handshake for ub endpoint

d8c5e72

fix the bug for client ub endpoint

312dafd

optimize the iobuf file code

9002081

modify the log level

e86b9ae

fix the declare_shm_ubs define not found bug

6c9ba54

add the timer_mgr support for macos and format code style

102058c

fix the bug for macos epoll

7050c53

fix the timespece bug

8dacabe

adaptor the itimerspec for macos platform

e1459d1

optimize the cmakelist config

620b6e5

add the ubring docs for ubring transport

e85ed8b

modify some file name and directory structure

b1cdb99

Fix protobuf version compatibility for ByteSizeLong() and SpaceUsedLo…

654ae77

…ng() (apache#3292)

Fix protobuf compilation errors (apache#3293)

873cf62

modify some code style and optimize code logical

6df15ac

Fix(selective_channel): isolate backup responses and add race regress…

321c789

…ion test (apache#3294)

Set up default protection ruleset for default and release branches (a…

f48628d

…pache#3299)

chenBright and others added 7 commits May 19, 2026 16:45

Fix UAF in batch_create_stream_feedback_race unittest (apache#3305)

3aa5dab

bug fix the code

e1df64e

Merge branch 'apache:master' into ub_ring_transport_bugfix

f581b39

add the cmake ubshm transport ci/testing

25be01a

add the dependency header

516d47d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ub ring transport bugfix#10

Ub ring transport bugfix#10
zchuango wants to merge 37 commits into
masterfrom
ub_ring_transport_bugfix

zchuango commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

zchuango commented May 23, 2026

What problem does this PR solve?

What is changed and the side effects?

Check List:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants