Skip to content

feat(metadata): Clarify metadata report selection for multi-registry and multi-instance scenarios#3369

Open
XnLemon wants to merge 16 commits into
apache:developfrom
XnLemon:clarify-metadata-report-selection
Open

feat(metadata): Clarify metadata report selection for multi-registry and multi-instance scenarios#3369
XnLemon wants to merge 16 commits into
apache:developfrom
XnLemon:clarify-metadata-report-selection

Conversation

@XnLemon
Copy link
Copy Markdown

@XnLemon XnLemon commented Jun 4, 2026

Description

Fixes #3352


Background / 背景

The #2534 refactor centralized metadata logic and settled on an application-level metadata path where MetadataReport handles app-level MetadataInfo and service-app mapping. That refactor was correct in intent, but it left the selection semantics of MetadataReport under-specified for multi-registry and multi-instance deployments.

Concretely, when a process connects to more than one registry (e.g. Nacos + Zookeeper), each registry gets its own MetadataReport instance, keyed by registryId in the instances map. The pre-existing code had four independent correctness gaps in how it chose which report to use.

#2534 重构将 metadata 逻辑集中化,以应用级 MetadataInfo 和 service-app mapping 为核心路径。重构方向正确,但在多注册中心和多实例部署场景下,MetadataReport 的选择语义存在空白

具体来说,当进程同时连接多个注册中心(如 Nacos + Zookeeper)时,每个注册中心会在 instances map 中保有独立的 MetadataReport 实例。原有代码在选择使用哪个 report 时存在以下四个独立的正确性问题。


Problems Fixed / 修复的问题

1. GetMetadataReport() returned a non-deterministic report / 返回结果不确定

Go map iteration order is randomized. GetMetadataReport() returned whichever entry happened to come first — a different report on every call in some runs, making any downstream behavior non-reproducible.

Fix: prefer the "default" key; fall back to the lexicographically first registryId so the result is always stable.

Go map 迭代顺序随机,GetMetadataReport() 每次可能返回不同的 report。修复:优先返回 "default" 键;不存在时退回到字典序最小的 registryId,确保结果稳定。


2. GetMetadataReportByRegistry() silently fell back to the wrong registry's report / 静默返回错误注册中心的 report

When a caller passed a specific registryId that was not registered, the function silently fell back to GetMetadataReport() and returned an unrelated registry's report. In GetMetadataFromMetadataReport, this meant remote metadata could be fetched from the wrong backend with no error surfaced.

Fix: return nil + Warnf when a specific registryId is not found. The existing nil-guard in GetMetadataFromMetadataReport already surfaces this as an actionable error.

调用方传入未注册的 registryId 时,函数静默退回到其他注册中心的 report,且无任何报错。修复:找不到时返回 nil + Warnf,由调用方的 nil 检查给出明确错误。


3. ServiceNameMapping.Remove() only returned the last error / 只保留最后一个错误

Remove() fanned out to all reports but used a lastErr accumulator overwritten on each iteration. If the first report failed and the second succeeded, the error was silently discarded and the caller saw nil, leaving a stale mapping in the first registry.

Fix: collect all errors with errors.Join so the caller receives the full failure picture.

Remove() 向所有 report 扇出,但 lastErr 在每次循环中被覆盖。若第一个 report 失败、第二个成功,错误被静默丢弃。修复:使用 errors.Join 收集全部错误。


4. Revision calculation and remote metadata fetch used cross-registry data / Revision 计算和远端 metadata 拉取使用了跨注册中心数据

The revision customizers called the global GetMetadataService() which merges services across all registries, so two instances on different registries could compute the same revision even though their actual service sets differed — breaking consumer-side change detection.

The listener cache in GetMetadataInfo() was keyed by revision alone. Two registries producing the same revision string would collide, and the second registry's listener would silently receive the first registry's cached metadata.

Fix: thread registryId through createInstance(), both revision customizers, GetMetadataFromMetadataReport(), NewServiceInstancesChangedListener(), and GetMetadataInfo(). Change the cache key to registryId + ":" + revision.

Revision customizer 通过全局 GetMetadataService() 获取服务列表,合并了所有注册中心的数据,导致不同注册中心的实例可能计算出相同的 revision。监听器的 metadata 缓存仅以 revision 为键,多注册中心场景下会发生缓存污染。修复:将 registryId 贯穿传递到相关链路,缓存键改为 registryId + ":" + revision


Changes / 改动内容

File Change
metadata/report_instance.go GetMetadataReport() deterministic; GetMetadataReportByRegistry() returns nil for unknown id; add ClearMetadataReportInstances() for test isolation
metadata/client.go GetMetadataFromMetadataReport() takes registryId, routes to correct per-registry report
metadata/mapping/metadata/service_name_mapping.go Remove() collects all errors with errors.Join
registry/servicediscovery/service_instances_changed_listener_impl.go GetMetadataInfo() and NewServiceInstancesChangedListener() take registryId; cache key scoped to registryId + ":" + revision
registry/servicediscovery/service_discovery_registry.go createInstance() injects registryId into instance metadata; SubscribeURL passes registryId to listener
registry/servicediscovery/customizer/service_revision_customizer.go Read per-registry MetadataInfo; upgrade missing-registryId log to Errorf with accurate consequence description
registry/servicediscovery/customizer/metadata_service_url_params_customizer.go Restore TODO with explanation — per-registry MetadataService URL support is tracked separately

Tests Added / 新增测试

  • TestGetMetadataReportIsDeterministic — 20 iterations verify stability under Go's randomized map iteration
  • TestGetMetadataReportByRegistry / TestGetMetadataReportByRegistryFallsBackDeterministically — hit, miss, and empty-string paths
  • TestServiceNameMappingRemoveFansOutToAllReports — both reports called on success
  • TestServiceNameMappingRemoveCollectsAllErrors — both errors visible via errors.Is in joined return
  • TestServiceNameMappingRemoveContinuesAfterPartialFailure — loop does not short-circuit on first failure
  • TestExportedRevisionIsRegistryScoped / TestSubscribedRevisionIsRegistryScoped — different registries produce different revisions
  • TestExportedRevisionMissingRegistryIdYieldsZero / TestSubscribedRevisionMissingRegistryIdYieldsZero — missing RegistryIdKey yields revision "0", does not borrow another registry's data
  • TestGetMetadataFromMetadataReport (rewritten) — specific registryId routes to correct report; unknown registryId returns error, not wrong report
  • TestListenerUsesRegistryIdToFetchRemoteMetadata — end-to-end: registryId flows from NewServiceInstancesChangedListener through OnEventGetMetadataInfoGetMetadataFromMetadataReport → correct mock report called

#### TODO / 本 PR 还未做完

Multi-instance metadata service URL alignment (metadata_service_url_params_customizer.go): GetMetadataService() still returns a global singleton so all instances across registries receive the same metadata service URL.

多实例 metadata service URL 对齐:GetMetadataService() 仍返回全局单例,所有注册中心实例获得同一 metadata service URL。


Checklist

  • I confirm the target branch is develop
  • Code has passed local testing
  • I have added tests that prove my fix is effective or that my feature works

XnLemon added 12 commits June 3, 2026 16:40
…gistryId instead of silently falling back to wrong registry

When a caller provides a specific non-empty registryId that has no
registered report, the previous implementation silently fell back to
GetMetadataReport() and returned an arbitrary report from a different
registry. This undermined the per-registry scoping added in the earlier
commits: metadata for registry A could be fetched from registry B's
report without any indication.

Fix: return nil (with a Warnf log) when a specific registryId is not
found. The existing nil-guard in GetMetadataFromMetadataReport (client.go)
already surfaces this as an error to the caller, which is the correct
behaviour.

The empty-string path is unchanged: it still delegates to
GetMetadataReport() because callers that pass "" have no registry
context and should receive the stable default.

Also expose ClearMetadataReportInstances() to allow cross-package test
isolation of the package-level instances map.
…of returning only the last one

The previous implementation used a lastErr variable that was overwritten
on each iteration. If the first report failed and the second succeeded,
the error was silently discarded and the caller saw no error despite a
partial failure — leaving a stale mapping in the first registry.

Fix: collect all errors with errors.Join so the caller receives the full
failure picture. The loop continues past individual failures (best-effort
fan-out) so a transient error in one registry does not prevent removal
from the others.

Add a comment explaining the intentional behavioural difference from
Map(), which is fail-fast: Map stops at the first report failure, while
Remove continues and collects all errors so every reachable registry gets
the removal attempt.
Three gaps identified in PR review, all addressed:

1. metadata/client_test.go — rewrite TestGetMetadataFromMetadataReport
   - Add sub-test: specific registryId routes to its own report (not default)
   - Add sub-test: unknown registryId returns error, not silent wrong-registry result
   - Each sub-test now resets instances independently to prevent state leakage

2. registry/servicediscovery/customizer/service_revision_customizer_test.go
   - Add TestExportedRevisionMissingRegistryIdYieldsZero: instance with no
     RegistryIdKey gets revision "0" and does not borrow another registry's
     service list
   - Add TestSubscribedRevisionMissingRegistryIdYieldsZero: same for subscribed
   - Add t.Cleanup(ClearMetadataReportInstances) + unique registry-id prefixes
     to all four tests to prevent cross-test global map pollution

3. registry/servicediscovery/service_instances_changed_listener_impl_test.go
   - Add TestListenerUsesRegistryIdToFetchRemoteMetadata: creates a listener
     with registryId="remote-reg-test", registers a mock MetadataReport under
     that id, triggers OnEvent with a RemoteMetadataStorageType instance, and
     asserts via mock.AssertExpectations that the correct per-registry report
     was called — end-to-end proof that registryId threads from
     NewServiceInstancesChangedListener through to GetMetadataFromMetadataReport
- metadata/report_instance_test.go: translate inline comments and
  testify assertion messages in TestGetMetadataReportIsDeterministic
  from Chinese to English, consistent with the rest of the file

- registry/servicediscovery/customizer/service_revision_customizer.go:
  translate Customize doc-comments for both exported and subscribed
  customizers from Chinese to English

- registry/servicediscovery/service_instances_changed_listener_impl_test.go:
  strip the UTF-8 BOM (EF BB BF) that was introduced at line 1;
  Go toolchain accepts BOMs but they are not idiomatic and cause
  spurious diffs in some editors and CI tools
… accurate consequence

The previous Warnf said 'revision will be empty', which is wrong on two
counts: the actual computed revision is "0" (not empty), and revision
"0" causes OnEvent to silently skip the instance entirely, making it
permanently invisible to all consumers.

Upgrade to Errorf and describe the actual outcome so operators can
diagnose misconfigurations immediately. Applies to both exported and
subscribed revision customizers.
…nt cross-registry metadata poisoning

The cache in GetMetadataInfo was keyed by revision string only. In a
multi-registry setup, if two registries happen to produce the same
revision value (e.g. identical service sets yielding the same CRC, or
test fixtures using literal revision strings), the second registry's
listener would get a cache hit and return MetadataInfo fetched via a
different registry's report — silently serving wrong endpoints with no
error surfaced.

Fix: use registryId+":"+revision as the cache key for both Get and Set,
so each registry's metadata is cached independently.

Update all test helpers and direct metaCache.Set/Delete calls to use
the same composite key (with constant.DefaultKey as the registryId for
all existing single-registry test listeners).
…alignment

The previous commit removed the TODO without fixing the underlying issue.
GetMetadataService() still returns a global singleton, so all instances
across registries receive the same metadata service URL regardless of
which registry they belong to.

Restore the TODO with a more detailed explanation so the gap is visible
to future contributors. The fix requires per-registry MetadataService
support and is out of scope for this PR.
@XnLemon XnLemon changed the title Fixes apache/dubbo-go#3352: Clarify metadata report selection for multi-registry and multi-instance scenarios fix(metadata): Clarify metadata report selection for multi-registry and multi-instance scenarios Jun 4, 2026
@XnLemon XnLemon changed the title fix(metadata): Clarify metadata report selection for multi-registry and multi-instance scenarios feat(metadata): Clarify metadata report selection for multi-registry and multi-instance scenarios Jun 4, 2026
@XnLemon XnLemon marked this pull request as ready for review June 4, 2026 09:55
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.36364% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.64%. Comparing base (60d1c2a) to head (6cdb7a6).
⚠️ Report is 815 commits behind head on develop.

Files with missing lines Patch % Lines
metadata/report_instance.go 84.61% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3369      +/-   ##
===========================================
+ Coverage    46.76%   52.64%   +5.88%     
===========================================
  Files          295      492     +197     
  Lines        17172    37896   +20724     
===========================================
+ Hits          8031    19952   +11921     
- Misses        8287    16333    +8046     
- Partials       854     1611     +757     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Alanxtl
Copy link
Copy Markdown
Contributor

Alanxtl commented Jun 7, 2026

看下没和#3362重复吧

Copy link
Copy Markdown
Contributor

@Alanxtl Alanxtl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [P1] Standalone metadata-report configs can no longer be used by named registries.
    In PR head, GetMetadataReportByRegistry returns nil for any non-empty unknown registryId. But standalone metadata-report config is still registered under default in config/metadata_config.go, while registry URLs get ids like nacos/zk. That means serviceDiscoveryRegistry.metadataReport becomes nil in service_discovery_registry.go, and remote metadata publish/fetch fails for a common config shape: named registry plus separate metadata-report. Need either preserve default as an explicit global fallback or add config/API support to bind standalone metadata-report to registry ids.

  • [P2] #3352’s mapping-read problem is still not fixed.
    ServiceNameMapping.Get still reads from GetMetadataReport() only, so in multi-report setups it now deterministically picks default or lexicographically first. That makes the old random behavior stable, but it still does not use the subscribing registry’s report, nor does it fan out consistently with Map/Remove. The issue explicitly calls out “mapping writes all reports, reads one report”; this PR only changes Remove.
    on is resolved.

Comment on lines +108 to +109
metadataReports := metadata.GetMetadataReports()
if len(metadataReports) == 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metadataReports := metadata.GetMetadataReports()
if len(metadataReports) == 0 {
metadataReports := metadata.GetMetadataReports()
if metadataReport == nil || len(metadataReports) == 0 {

nil不需要处理了吗

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetMetadataReports() 这个方法返回的是一个切片 里面有make([]report.MetadataReport, len(instances)) 所以不返回nil 顶多可能抛个空切片而已
之前的GetMetadataReport()全局 instances map 是空的话就会返回nil 所以一开始才兜了 给它加上也行

Comment on lines +45 to +49
// ClearMetadataReportInstances resets the package-level instances map.
// Intended for test isolation only; do not call in production code.
func ClearMetadataReportInstances() {
instances = make(map[string]report.MetadataReport)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数移到test里面就行了吧

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为新增了跨包测试文件 举个例子 比如 TestListenerUsesRegistryIdToFetchRemoteMetadata 要验证 registryId 从 NewServiceInstancesChangedListener → OnEvent → GetMetadataInfo → GetMetadataFromMetadataReport → 最终调到正确的那个 report 实例。这条链路跨了 3 个 package和 5 个函数 中间每一步都读全局 instances
需要覆盖的话就得加上 不覆盖的话测对应每一层不代表连起来就能成 所以还是加上了(
您怎么看owo (

@XnLemon
Copy link
Copy Markdown
Author

XnLemon commented Jun 7, 2026

Findings

第一个修了 目前可以正常fallback
第二个 模型统一 把get改为拿所有report然后遍历查求并集合并结果返回result 中间如果有一个挂了记录错误但不影响继续 只有全挂了才报错

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 7, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

4 participants