Skip to content

[Feature] Add Power-of-Two-Choices Peak-EWMA load balancer (p2c) — O(1) tail-latency-aware policy #3340

@rajvarun77

Description

@rajvarun77

Is your feature request related to a problem?

brpc has several client-side load balancers (rr, wrr, random, wr, la, the consistent-hash family, _dynpart), but only one of them is latency-aware: la (locality-aware). la maintains a doubly-buffered weight tree over all servers with weights driven by inverse average latency, and samples a single point on the cumulative-weight line.

brpc currently has no Power-of-Two-Choices (P2C) sampling load balancer — neither Envoy-style least-request (P2C over active-request-biased weights) nor Finagle-style Peak-EWMA. P2C is the single most widely deployed tail-latency balancer in modern RPC stacks and service meshes (Finagle/Twitter p2cPeakEwma, linkerd, Envoy LEAST_REQUEST), and it is absent from brpc.

This matters in two regimes that rr/random ignore and where la's averaging window lags:

  • a single slow/degraded backend (GC pause, noisy neighbor, cold cache) — should be shed within one sample, not after an averaging window;
  • a heterogeneous fleet (mixed CPU generations/capacities) — selection should continuously bias toward faster, less-loaded nodes.

Describe the solution you'd like

Add a new client-side load balancer, registered as p2c (alias p2c_ewma): Power-of-Two-Choices with Peak-EWMA latency scoring.

Algorithm (per request):

  1. Sample two distinct random backends from the server list (fast_rand_less_than, as rr/random already do).
  2. Route to the one with the lower load score:
    • score = ewma_us * (inflight + 1) / max(weight, 1)
    • ewma_us is a peak-sensitive EWMA of round-trip latency per backend: an upward latency spike sets the score to the spike value immediately; downward recovery decays with a configurable time constant TAU (Finagle default decay ≈ 10s). This is what sheds a stalled backend within one sample.
    • inflight is the backend's outstanding-request count (incremented at select, decremented at completion) — the Envoy LEAST_REQUEST (active+1) term.
    • dividing by the static naming-service weight makes it weighted P2C (degrades to unweighted when weights are equal).
  3. On RPC completion, update the chosen backend's EWMA and decrement its in-flight; failures/timeouts inflate the latency (max(measured, timeout)) so a failing node is avoided — mirroring la's error handling.

Properties: O(1) selection (two score evaluations regardless of fleet size, vs la's O(log N) tree walk); spike-reactive via the peak term; herding-resistant via two-sample comparison.

Interface feasibility — no brpc core change required. Every hook P2C-EWMA needs already exists and is exercised by la:

  • src/brpc/load_balancer.hSelectOut::need_feedback (load_balancer.h:50), virtual void Feedback(const CallInfo&) (load_balancer.h:100), and CallInfo { begin_time_us; server_id; error_code; controller; } (load_balancer.h:53-61) giving per-call latency, which backend, success/failure, and timeout.
  • src/brpc/controller.cpp:873-876 — the RPC completion path actually invokes the hook:
    if (need_feedback && c->_lb) {
        const LoadBalancer::CallInfo info =
            { begin_time_us, peer_id, error_code, c };
        c->_lb->Feedback(info);
    }
    and controller.cpp:1120 propagates sel_out.need_feedback into the call.
  • Membership lives in DoublyBufferedData<Servers, TLS> (lock-free reads, copy-on-write modify) exactly as round_robin_load_balancer.cpp. Mutable per-node load state (inflight, ewma_us, stamp_us) is heap-allocated once per backend and referenced by stable pointer — the same technique locality_aware_load_balancer already uses for its per-node atomics.
  • brpc's Socket has no general app-level in-flight counter, so — like la — the LB owns its in-flight counters. Fully feasible.

Files to add / modify

  • Add src/brpc/policy/p2c_ewma_load_balancer.{h,cpp}.
  • Modify src/brpc/global.cpp: include the header, add a P2CEwmaLoadBalancer member to GlobalExtensions, and register it next to the existing balancers — LoadBalancerExtension()->RegisterOrDie("p2c", &g_ext->p2c_ewma_lb);.
  • Add test/brpc_p2c_ewma_load_balancer_unittest.cpp (modeled on the la/rr LB unit tests), registered in all three build systems (test/CMakeLists.txt, test/BUILD.bazel, test/Makefile).
  • Add a p2c row to docs/cn/lb.md / docs/en/lb.md.

Performance / testing plan (multi-protocol, multi-config)

A cross-protocol benchmark matrix to prove the gain is real on actual brpc-supported backends, not just a synthetic echo server.

  • Protocols: baidu_std (protobuf RPC, controllable-sleep echo server — clean baseline), Memcached (MemcacheRequest — tiny ops, very high QPS, exposes selection overhead/herding), Redis (RedisRequest — single-threaded backend, high load sensitivity), MySQL (variable query cost — biggest tail surface), HTTP. Integration backends spawn-or-skip using the existing brpc_redis_unittest.cpp system("which ...") precedent, with one node acting as the degraded backend.
  • Configs: fleet size N ∈ {4, 16, 64, 256}; homogeneous vs heterogeneous latency; degradation injected (none / 1 node +50ms / 10% of fleet +20ms / transient spike-then-recover); offered load (low / knee / saturating); equal vs 1:2:4 naming-service weights; pooled vs single connection.
  • Functional tests (must pass first): add/remove/batch, single-server, ExcludedServers exclusion, feedback-driven score shift, in-flight inc/dec across select→feedback (incl. error/timeout path), weighted 1:2:4 split, and concurrency under TSan/ASan.
  • Metrics: p50/p90/p99/p999/max latency; throughput at fixed SLO and saturating; per-node traffic share over time (divert latency after a spike); request-count variance (fairness/herding); selection CPU cost per op as N grows (P2C O(1) vs la O(log N)).
  • Headline result to validate: tail-latency reduction and throughput-at-SLO uplift of p2c over rr/random/la on the heterogeneous and degraded configs, with selection cost no worse than rr.

Describe alternatives you've considered

  • Extend la — different algorithm (global weight tree, averaging window); does not give O(1) sampling or peak spike-reactivity. Keeping both gives users a real choice.
  • Envoy-style least-request only (weight / (active+1)) — a strict subset; P2C-EWMA folds the (inflight+1) active-request term and the peak-latency term into one score, so it covers least-request as the equal-latency special case.
  • WRR + ORCA load reports (gRPC gRFC A58) — requires a server-reported load backchannel brpc does not have; out of scope for a client-only change.

Additional context

References (primary sources):

I'm happy to implement this end-to-end (balancer + unit tests in all three build systems + docs + the benchmark matrix). Would the maintainers be open to a p2c policy, and is a GitHub issue the right place to scope it, or would you prefer a [DISCUSS] thread on dev@brpc.apache.org first?

cc @zyearn (load-balancer policy owner / author of la) — would value your read on whether a p2c policy fits brpc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions