Skip to content

[FEATURE] Migrate and Enhance Adaptive Service Throttling in dubbo-go#3347

Draft
nagisa-kunhah wants to merge 9 commits into
apache:developfrom
nagisa-kunhah:feat/issue-3336-experiment-dev
Draft

[FEATURE] Migrate and Enhance Adaptive Service Throttling in dubbo-go#3347
nagisa-kunhah wants to merge 9 commits into
apache:developfrom
nagisa-kunhah:feat/issue-3336-experiment-dev

Conversation

@nagisa-kunhah
Copy link
Copy Markdown
Contributor

Description

Fixes #3336

Progress:

  • Review the current adaptive throttling capability in dubbo-go
  • Run pressure tests to understand its capability limits
  • Move the adaptive service plugin to dubbo-go-extensions
  • Add dubbo-go-samples for adaptive service

Pressure tests

Test scenario

  • samples/adaptive_service/protect_provider/{server,client}: verifies provider protection under high client concurrency by tracking rejects and server-side max active requests.
  • samples/adaptive_service/rtt_shrink/{server,client}: verifies limiter behavior across fast/medium/slow RTT stages and records limitation/remaining/inflight changes.
  • samples/adaptive_service/p2c_healthy/{server,client}: verifies multi-provider adaptive P2C routing by comparing per-provider hit ratio and remaining capacity.

Test results

protect_provider

  • Config: 200 client concurrency, 200ms provider handler delay, 30s duration.
  • Result: client sent 243,940 requests in total; 222,720 were rejected by adaptive service, failed=0, reject rate was about 91%. The provider business handler only processed about 21,020 requests, and rejected requests did not enter the handler.
  • Conclusion: provider-side adaptive throttling can shed overload traffic before the business handler and keeps unexpected RPC failures at zero under high concurrency.

⚠️ rtt_shrink

  • Config: 200 client concurrency, staged provider handler delay fast:20ms:30s, medium:100ms:20s, slow:500ms:40s, 90s duration.
  • Result: the staged delay switch worked as expected. Client-side latency increased with each phase: the fast phase stayed around tens of milliseconds, the medium phase rose to about 100ms, and the slow phase rose to about 500ms. failed=0 throughout the run. The limiter was found and reported continuously by the provider stats endpoint. During the fast phase, limiter_limitation grew from about 55 to a peak of about 123. However, after RTT increased to the 500ms slow phase, limiter_limitation stayed around 122 and did not drop meaningfully below the fast-phase peak. Rejections increased continuously under the high offered load, reaching about 888,000 total rejects by the end of the run.

p2c_healthy

  • Config: three providers with different handler delays: fast=20ms, medium=100ms, slow=300ms; 200 client concurrency; 90s duration; adaptive cluster + P2C load balancing enabled.
  • Result: P2C quickly avoided the slow provider after warmup. The final cumulative traffic distribution was about fast=44%, medium=54%, and slow=1.8%, with failed=0. The slow provider's interval traffic dropped to 0 later in the test. The medium provider was repeatedly selected as the healthiest node because its reported remaining capacity was higher; it reached limiter_limitation=500, while the fast provider stayed around 200.
  • Conclusion: P2C can use adaptive service metrics to bias traffic away from unhealthy providers and toward nodes with higher remaining capacity. One caveat is that the 100ms provider received more traffic than the 20ms provider, which points to a possible HillClimbing expansion/parameter issue rather than a P2C selection failure.

Checklist

  • I confirm the target branch is develop
  • Code has passed local testing
  • I have added tests that prove my fix is effective or that my feature works

CAICAIIs and others added 7 commits May 25, 2026 09:43
* fix(config): remove legacy protocol timeout fallback

* fix(config): avoid default timeout allocation

* fix(config): preserve consumer timeout in reference config

* test(config): satisfy testifylint in reference timeout test

* fix(config): make protocol timeout default explicit

* fix(config): centralize consumer timeout default

* fix(config): keep consumer timeout default in global
* feat(test):Add TestGetAddressWithProtocolPrefixKeepsContext and find the error when user bring context path

* fix(apollo):Fix the test func(getAddressWithProtocolPrefix) fix(context_path):fix getAddressWithProtolPrefix didn't handle context path

* refator(config_center):cleanup-redundant-test

* fix(test):删除不应该存在的文件

* feat():恢复测试并添加多种case

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix:修复url.Path = /问题并添加边缘测试;修复原来的不符合gofmt格式以通过CI

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…apache#3345)

* fix(logger): sync dubbo-go logger facade in LoggerConfig.Init()

* fix(logger): sync dubbo-go logger facade in config_loader init()

* fix(logger): sync dubbo-go logger facade in initGlobalLogger()

* test(logger): verify dubbo-go facade is synced after logger initialization

* style(logger): fix import formatting
)

* refactor(logger): standardize logger format in graceful_shutdown

- Add [GracefulShutdown] prefix to all logger calls
- Remove decoration symbols (---) from log messages
- Change key format from "error: %v" / "--- %v" to "err=%v"
- Lowercase first letter of all log message bodies

* refactor(logger): standardize logger format in internal, metadata, metrics, otel

- Add module prefixes: [Internal], [Metadata], [MetadataRPC], [MetadataReport][Etcd/Nacos/Zookeeper], [Metrics], [Metrics][Probe/Prometheus/RPC], [OTel][Trace]
- Unify key format: "error: %v" / ": %v" / "err: %s" → "err=%v", "url: %s" → "url=%s"
- Lowercase first letter of all log message bodies
- Fix bug: logger.Error with non-string or extra args → logger.Errorf (listener.go, server.go, exporter.go)
- Fix bug: logger.Errorf with no format args → logger.Error (metadata_service.go)
- Fix bug: logger.Infof/Debugf with no format args → logger.Info/Debug (config.go, report.go)
- Fix: err.Error() + %s → err + %v (nacos/report.go x2)

* refactor(logger): standardize logger format in protocol directory

- Unify prefixes as [Protocol], [Dubbo], [Dubbo][Codec/Hessian2/Impl/Exporter/Invoker], [Dubbo3], [GRPC], [GRPC][Client/Server/Exporter/Invoker], [Jsonrpc], [Jsonrpc][Server/Exporter/Invoker], [ProtocolWrapper], [Rest], [Rest][Config/Exporter/Server], [Triple], [Triple][Client/Server/Exporter/Invoker/CORS/Codec/Handler/Negotiation/Protocol/Health/OpenAPI]
- Unify key format: "error: %v" / ": %v" / "error:{%v}" → "err=%v", "err: %v" → "err=%v"
- Lowercase first letter of all log message bodies
- Remove %+v format for non-Debug levels
- Fix bug: logger.Error with error type → logger.Errorf (dubbo_codec.go, dubbo_protocol.go)
- Fix bug: logger.Error/Info with extra args → logger.Errorf/Infof (rpc_status.go, jsonrpc/server.go)
- Fix bug: logger.Infof/Debugf without format args → logger.Info/Debug (multiple files)
- Fix bug: logger.Debug without format args → logger.Debugf (openapi/service.go)
- Fix: err.Error() + %s → err + %v (dubbo3_protocol.go)

* refactor(logger):standardize logger prefixes in metadata, metadata-report, Dubbo, and Triple protocol modules
@nagisa-kunhah nagisa-kunhah changed the title Feat/issue 3336 experiment dev [FEATURE] Migrate and Enhance Adaptive Service Throttling in dubbo-go May 29, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 29, 2026

Codecov Report

❌ Patch coverage is 6.90537% with 1456 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.80%. Comparing base (60d1c2a) to head (b6f0356).
⚠️ Report is 812 commits behind head on develop.

Files with missing lines Patch % Lines
...e_test/adaptive_service/p2c_healthy/client/main.go 0.00% 370 Missing ⚠️
...ee_test/adaptive_service/rtt_shrink/client/main.go 0.00% 279 Missing ⚠️
...t/adaptive_service/protect_provider/client/main.go 0.00% 264 Missing ⚠️
...ee_test/adaptive_service/rtt_shrink/server/main.go 0.00% 127 Missing ⚠️
...ptive_service/protect_provider/proto/protect.pb.go 0.00% 78 Missing ⚠️
...e_test/adaptive_service/p2c_healthy/server/main.go 0.00% 75 Missing ⚠️
...t/adaptive_service/protect_provider/server/main.go 0.00% 69 Missing ⚠️
...e_service/protect_provider/proto/protect.triple.go 0.00% 31 Missing ⚠️
protocol/rest/server/rest_server.go 0.00% 14 Missing ⚠️
protocol/jsonrpc/server.go 20.00% 12 Missing ⚠️
... and 44 more
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3347      +/-   ##
===========================================
+ Coverage    46.76%   50.80%   +4.03%     
===========================================
  Files          295      500     +205     
  Lines        17172    39102   +21930     
===========================================
+ Hits          8031    19866   +11835     
- Misses        8287    17634    +9347     
- Partials       854     1602     +748     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Alanxtl
Copy link
Copy Markdown
Contributor

Alanxtl commented May 30, 2026

@sonarqubecloud
Copy link
Copy Markdown

@nagisa-kunhah
Copy link
Copy Markdown
Contributor Author

可以参考uber的实现https://www.infoq.com/news/2024/02/uber-dynamic-load-shedding/?utm_source=email&utm_medium=editorial&utm_campaign=SpecialNL&utm_content=02292024&forceSponsorshipId=58a6b10a-7b64-4cfd-a08d-c065e2458967

@Alanxtl hello,也麻烦看下,pr description里有说到,我测rtt_shrink这个case的时候发现,当延迟升高的时候(20ms升高到100ms,再升高到500ms),limiter对inflight的限制似乎并没有明显的减少,而是维持在原来的水平,不清楚这个是否符合原来的预期?测试用的代码放在presee_test/adaptive_service/rtt_shrink下。

@Alanxtl
Copy link
Copy Markdown
Contributor

Alanxtl commented May 31, 2026

可以参考uber的实现https://www.infoq.com/news/2024/02/uber-dynamic-load-shedding/?utm_source=email&utm_medium=editorial&utm_campaign=SpecialNL&utm_content=02292024&forceSponsorshipId=58a6b10a-7b64-4cfd-a08d-c065e2458967

@Alanxtl hello,也麻烦看下,pr description里有说到,我测rtt_shrink这个case的时候发现,当延迟升高的时候(20ms升高到100ms,再升高到500ms),limiter对inflight的限制似乎并没有明显的减少,而是维持在原来的水平,不清楚这个是否符合原来的预期?测试用的代码放在presee_test/adaptive_service/rtt_shrink下。

这更像是暴露了当前 HillClimbing 实现的“不敏感/参数问题”,不太应该当成完全符合预期。

代码原因大概在这里:

  • hill_climbing.go:limiter 每个 update round 才基于 transactionNum/rttAvg 计算 maxCapacitytps,不是 RTT 一升高就立即降并发。
  • hill_climbing.go:收缩条件要求 bestMaxCapacity - maxCapacity 和 RTT 劣化同时满足硬编码阈值。
  • hill_climbing.go:真正 shrink 时也不是按 RTT 比例降低,而是回到 bestLimitation - log(limitation) 附近,所以下降幅度可能很小。
  • rtt_shrink/server/main.go:压测确实是通过服务端 Sleep(currentStage.delay) 人为拉高 handler RTT。
  • rtt_shrink/server/main.go:观测的 limiter_limitation 是直接从 provider 侧 limiter snapshot 暴露出来的,不是客户端自己估出来的。

这个可能是已知限制。预期上 adaptive concurrency 应该在 RTT 明显恶化、吞吐不再提升时收缩;但当前算法受历史 best metrics、硬编码阈值、update interval 和 shrink 幅度影响,在 20ms -> 500ms 的阶梯压测下没有明显降下来。

另外uber的那个实现太过于复杂了,参考一下就行,不用实现

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Migrate and Enhance Adaptive Service Throttling in dubbo-go 迁移并增强 dubbo-go 自适应限流能力

7 participants