Skip to content

Ub ring transport bugfix#10

Open
zchuango wants to merge 37 commits into
masterfrom
ub_ring_transport_bugfix
Open

Ub ring transport bugfix#10
zchuango wants to merge 37 commits into
masterfrom
ub_ring_transport_bugfix

Conversation

@zchuango

Copy link
Copy Markdown
Owner

What problem does this PR solve?

Issue Number: resolve

Problem Summary:

What is changed and the side effects?

Changed:

Side effects:

  • Performance effects:

  • Breaking backward compatibility:


Check List:

zchuango and others added 30 commits May 9, 2026 19:48
After PR apache#2949 changed Agent::combiner from a raw pointer to a
weak_ptr, MultiDimensionTest.shared started to fail under ASAN with a
heap-use-after-free in clear_all_agents() called from ~AgentCombiner.

Root cause
----------
std::shared_ptr makes its weak_ptr expire the instant use_count drops
to 0, while ~AgentCombiner only starts running after that. The two
events therefore form a race window between two threads:

    * thread A: last shared_ptr<AgentCombiner> released
                use_count -> 0   -> weak_ptr expired
                ~AgentCombiner -> clear_all_agents() takes _lock and
                                  starts walking _agents
    * thread B: thread exits, _destroy_tls_blocks() deletes ThreadBlock,
                each ~Agent calls combiner.lock() which now returns null
                (weak_ptr already expired), so commit_and_erase() is
                skipped and the Agent is freed while still linked into
                A's _agents list
    * thread A: dereferences a freed LinkNode in clear_all_agents() ->
                heap-use-after-free

Fix
---
Don't traverse _agents from ~AgentCombiner. Letting _agents go out of
scope is safe because:

  * butil::LinkedList and butil::LinkNode have trivial destructors and
    never traverse on destruction, so tearing down _agents does not
    dereference any agent node.
  * Surviving Agents simply observe combiner.expired() == true in
    ~Agent and skip commit_and_erase, leaving prev_/next_ dangling but
    never read.
  * If the freed agent_id is later reused by a new combiner and the
    same TLS slot is taken, get_or_create_tls_agent calls Agent::reset
    and Append's it into the new combiner's _agents.
    LinkNode::InsertBefore only writes prev_/next_ (it never reads
    their stale values), so the dangling pointers are safely
    overwritten.
  * Agent::element is destroyed together with its ThreadBlock, so any
    non-POD resource it holds is still released; if the agent slot is
    reused, Agent::reset rewrites the element value before it is
    observed again.

The body of clear_all_agents() is kept (commented out) for reference,
together with a NOTE explaining why it must not be re-enabled from
~AgentCombiner.
…che#3276)

* fix: add acquire fence in bthread_join for ARM memory visibility

Ensure memory visibility on ARM architecture after thread join.

* use the fence unconditionally

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
On Linux, `pthread_attr_getstack(3)` returns the lowest address of the
stack, but brpc treats `StackStorage::bottom` as the highest address
(see `bthread/stack.h`). `TaskGroup::init` was registering the main
task's stack with the lowest address as `bottom`, so every per-jump
`BTHREAD_SCOPED_ASAN_FIBER_SWITCHER` told ASan a stack range
`stacksize` bytes below the real one. The first
`__asan_handle_no_return` (e.g. the C++ unwinder fired by
`throw bthread::ExitException` inside `bthread_exit`) then prints

  WARNING: ASan is ignoring requested __asan_handle_no_return:
    stack type: default top: ...; bottom ...; size: 0x62bdb000 (~1.5 GiB)
  False positive error reports may follow

Fix `PthreadAttrGetStack` to translate the lowest address to the
highest address before returning, matching `StackStorage::bottom`.
ASan-only path; non-ASan builds are unaffected.
Otherwise there might segfault due to the race below:

```txt
Socket::OnInputEvent()          |
  `-- ProcessEvent (bthread)    |
                                |
  [ bthread queueed ]           |  QP error -> SetFailed -> HC -> WaitAndReset()
                                |  Reset() -> _sbuf.clear()
                                |  CheckHealth() -> Revive()
                                |
                                |  Socket is now Addressable!
  RdmaEndpoint:PollCq()         |
  Socket::Address() OK!         |
  RdmaEndpoint:HandleCompletion()
  _sbuf[_sq_sent++].clear()    <= BOOM!  CQ is not drained but _sbuf is cleared.
```

Another possible fix is to add a _generation_ field in RdmaEndpoint, such that:
 - each RdmaEndpoint::Reset() will advance the _generation_ by 1;
 - the RdmaEndpoint::PollCq(m, orig_gen) will need to compare the _generation_.

But it will contaminate existing interface, and we need to drain CQ anyway.

Signed-off-by: David Lee <live4thee@gmail.com>
chenBright and others added 7 commits May 19, 2026 16:45
The previous go-like RWLock bumps the reader count first and rolls it
back only on the full success path. The state is not reversible on a
partial failure, so a read timeout while a writer holds the lock leaves a
dangling reader credit and permanently blocks future writers (issue apache#3051).
For the same reason the old implementation could not offer correct try_/timed_ APIs
at all.

Co-authored-by: hairet
…#3296)

Socket::SSLHandshake polled the fd via unbounded bthread_fd_wait, so a
peer that completed the TCP handshake but never sent a TLS Hello (e.g.
server not actually configured for SSL) parked the handshake bthread
forever. That bthread holds a Socket reference through the in-flight
WriteRequest, so the underlying fd was never closed and the connection
stayed ESTABLISHED until OS keepalive eventually broke it.

Channel destruction does not help: ~Channel only removes the SocketMap
ref; SetFailed/OnFailed wake _epollout_butex but not the per-fd butex
used by bthread_fd_wait, and the fd close path runs only after the
Socket's ref count drops to zero, which the stuck bthread prevents.
With a periodic retry, ESTABLISHED sockets accumulate without bound.

Switch the two bthread_fd_wait calls in SSLHandshake to
bthread_fd_timedwait, with the deadline derived from a new gflag
ssl_handshake_timeout_ms (default 5000ms; <=0 keeps the old behavior).
On timeout the existing failure path runs: CheckConnected returns -1 ->
AfterAppConnected calls SetFailed + ReleaseAllFailedWriteRequests, and
the fd_guard in CheckConnectedAndKeepWrite closes the fd at scope exit.
The server-side handshake invoked from DoRead benefits from the same
bound.

Co-authored-by: Ye Rixu <yerixu@expontech.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants