feat(metadata): add concurrency safety for application-level metadata state (#3353)#3367
feat(metadata): add concurrency safety for application-level metadata state (#3353)#3367jieguo-coder wants to merge 9 commits into
Conversation
…rialization (apache#3353) Add sync.RWMutex to MetadataInfo struct with json:"-" / hessian:"-" tags to skip serialization. All mutating methods (AddService, RemoveService, AddSubscribeURL, RemoveSubscribeURL) acquire the write lock, and read methods (GetExportedServiceURLs, GetSubscribedURLs, GetServices) acquire the read lock. The new GetServices method returns a snapshot copy. Signed-off-by: jieguo-coder <1193249232@qq.com>
…apache#3353) Add sync.RWMutex to protect registryMetadataInfo in metadata.go and instances in report_instance.go. Extract getMetadataReportUnsafe helper to avoid reentrant RLock deadlock in GetMetadataReportByRegistry fallback. Fix nacos report_test to use pointer to MetadataInfo for json.Marshal. Signed-off-by: jieguo-coder <1193249232@qq.com>
…rnal calls (apache#3353) Add mutex locking to AddListenerAndNotify and RemoveListener to protect shared fields listeners and serviceUrls. Replace direct access to MetadataInfo.Services with safe GetServices method in OnEvent and convertV2 to prevent unprotected map reads. Signed-off-by: jieguo-coder <1193249232@qq.com>
|
our project use
|
Signed-off-by: jieguo-coder <1193249232@qq.com>
Signed-off-by: jieguo-coder <1193249232@qq.com>
Thanks for the guidance! @Alanxtl |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3367 +/- ##
==========================================
+ Coverage 52.40% 52.46% +0.06%
==========================================
Files 492 492
Lines 37785 37832 +47
==========================================
+ Hits 19800 19849 +49
+ Misses 16380 16379 -1
+ Partials 1605 1604 -1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| func (mts *DefaultMetadataService) GetMetadataInfo(revision string) (*info.MetadataInfo, error) { | ||
| if revision == "" { | ||
| return nil, nil | ||
| } | ||
| for _, metadataInfo := range mts.metadataMap { | ||
| if metadataInfo.Revision == revision { | ||
| return metadataInfo, nil | ||
| } | ||
| } | ||
| logger.Warnf("metadata not found for revision: %s", revision) | ||
| return nil, nil | ||
| } |
There was a problem hiding this comment.
DefaultMetadataService still races on registryMetadataInfo**
metadata_service.go:98, L110, L128
PR 给 registryMetadataInfo 增加了 registryMetadataLock,但 metadataService 仍持有同一个 map,并在 GetMetadataInfo / GetExportedServiceURLs / GetSubscribedURLs 里无锁遍历。注册或订阅并发写入 map 时,metadata service 查询仍会 race,正好是 issue #3353 要修的场景。
我用临时 race 测试复现了:
metadata.AddService() 在 metadata.go:50 写 map,同时 DefaultMetadataService.GetExportedServiceURLs() 在 metadata_service.go:110 迭代 map,race detector 报 WARNING: DATA RACE。
修法建议:不要让 DefaultMetadataService 直接遍历裸 map。可以加受锁保护的 snapshot helper,或者让 DefaultMetadataService 共享同一把锁并在遍历期间持 RLock。
| func AddSubscribeURL(registryId string, url *common.URL) { | ||
| registryMetadataLock.Lock() | ||
| if _, exist := registryMetadataInfo[registryId]; !exist { | ||
| registryMetadataInfo[registryId] = info.NewMetadataInfo( | ||
| url.GetParam(constant.ApplicationKey, ""), | ||
| url.GetParam(constant.ApplicationTagKey, ""), | ||
| ) | ||
| } | ||
| registryMetadataLock.Unlock() | ||
|
|
||
| registryMetadataInfo[registryId].AddSubscribeURL(url) | ||
| } |
There was a problem hiding this comment.
AddService / AddSubscribeURL 解锁后又无锁读取全局 map
metadata.go:55-L57, L68-L70
这两个函数只把 get-or-create 包在写锁里,但随后解锁,再执行 registryMetadataInfo[registryId].AddService(url) / AddSubscribeURL(url)。这仍然是对全局 map 的无锁读;只要另一个 goroutine 正在为其他 registryId 插入 map,就可能发生读写 race。
修法建议:在锁内取出 metadataInfo := registryMetadataInfo[registryId],解锁后只操作这个指针,不再访问全局 map。
Signed-off-by: jieguo-coder <1193249232@qq.com>
Signed-off-by: jieguo-coder <1193249232@qq.com>
bc20d95 to
e65f694
Compare
Signed-off-by: jieguo-coder <1193249232@qq.com>
| func (lstn *ServiceInstancesChangedListenerImpl) RemoveListener(serviceKey string) { | ||
| lstn.mutex.Lock() | ||
| delete(lstn.listeners, serviceKey) | ||
| lstn.mutex.Unlock() |
There was a problem hiding this comment.
Pull request overview
This PR adds synchronization to the application-level metadata path to prevent data races and concurrent map write panics introduced after the metadata refactor (#2534). It primarily protects shared global maps and metadata state that are accessed concurrently during registration/subscription flows and service discovery events.
Changes:
- Add
sync.RWMutexprotection toMetadataInfointernals and introduceGetServices()for safe external iteration. - Protect global metadata registries (
registryMetadataInfo, metadata reportinstances) withsync.RWMutex. - Add missing listener-map locking in service discovery’s instance-changed listener and update call sites to use
GetServices().
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| registry/servicediscovery/service_instances_changed_listener_impl.go | Locks listener mutation paths and switches service iteration to MetadataInfo.GetServices() to avoid unsafe map iteration. |
| registry/servicediscovery/service_instances_changed_listener_impl_test.go | Adds test coverage for listener removal behavior. |
| metadata/report/nacos/report_test.go | Updates test expectations for MetadataInfo pointer usage. |
| metadata/report_instance.go | Adds RWMutex protection around global metadata report instances map. |
| metadata/metadata.go | Adds RWMutex protection around global registryMetadataInfo map and makes get-or-create atomic. |
| metadata/metadata_test.go | Adds a concurrent access smoke test for Add/Read operations on global metadata. |
| metadata/metadata_service.go | Adds read-locking while iterating metadata map and uses GetServices() for V2 conversion. |
| metadata/metadata_service_test.go | Adds concurrent read-access test coverage for DefaultMetadataService. |
| metadata/info/metadata_info.go | Adds per-MetadataInfo RWMutex, locks map accessors/mutators, and introduces GetServices() snapshot method. |
| metadata/info/metadata_info_test.go | Adds tests validating GetServices() returns a snapshot copy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // GetServices returns a copy of the Services map for safe iteration by external callers. | ||
| func (info *MetadataInfo) GetServices() map[string]*ServiceInfo { | ||
| info.mu.RLock() | ||
| defer info.mu.RUnlock() | ||
|
|
||
| cp := make(map[string]*ServiceInfo, len(info.Services)) | ||
| for k, v := range info.Services { | ||
| cp[k] = v | ||
| } | ||
| return cp | ||
| } |
| instancesMu.Lock() | ||
| instances[registryId] = &DelegateMetadataReport{instance: fac.CreateMetadataReport(url)} | ||
| instancesMu.Unlock() |
…ce and reduce lock granularity in report creation Signed-off-by: jieguo-coder <1193249232@qq.com>
|



Description
Summary
Fixes concurrency safety issues in the application-level metadata path by adding proper synchronization protection for multiple global maps and shared states.
Background
After the #2534 metadata refactor, application-level metadata is maintained through shared states such as local MetadataInfo, metadata report instances, and MetadataService. Currently, multiple core states are maps without clear synchronization protection. Data races, stale reads, or fatal error: concurrent map writes might occur when service registration, unregistration, subscription, unsubscription, instance changes, and metadata service queries happen concurrently.
Related Issue: Fixes #3353
Changes
Added a sync.RWMutex for the global map[string]*MetadataInfo.
The get-or-create phase in AddService / AddSubscribeURL is now executed atomically under a write lock to prevent race conditions.
GetMetadataInfo is protected by a read lock.
Concurrency Safety for Listeners (registry/.../service_instances_changed_listener_impl.go)
Added missing mutex protection for AddListenerAndNotify and RemoveListener to prevent concurrent reads/writes on listeners and serviceUrls against OnEvent.
Safe Access to Services Field
Replaced direct accesses to metadataInfo.Services in OnEvent and convertV2 with the new safe method GetServices().
Test Plan