Skip to content

feat(metadata): strengthen service-app mapping consistency, retry and…#3373

Open
NeverENG wants to merge 3 commits into
apache:developfrom
NeverENG:feat/3354-mapping-consistency-cas
Open

feat(metadata): strengthen service-app mapping consistency, retry and…#3373
NeverENG wants to merge 3 commits into
apache:developfrom
NeverENG:feat/3354-mapping-consistency-cas

Conversation

@NeverENG
Copy link
Copy Markdown

@NeverENG NeverENG commented Jun 7, 2026

Description

Fixes # issue3354

What this PR does

Fixes #3354. Hardens application-level service-app mapping (interface -> app names)
registration so it is correct under concurrent providers, and gives it a proper retry policy.
完善应用级 service-app mapping 的写入一致性、重试与去重。

Background

The mapping value is a comma-separated set of application names stored under a single
interface key, shared by all providers of that interface. Registration is therefore a
read-modify-write, and the previous implementations had several reliability gaps.

Changes

1. Optimistic concurrency across all backends (no more lost updates)

Concurrent appends no longer clobber each other:

  • etcd: Get+PutGetValAndRev + UpdateWithRev (CAS on ModRevision), Create for first write.
  • zookeeper: keeps versioned SetContent, now surfaces version conflicts instead of swallowing them.
  • nacos: adds CasMd5 optimistic lock.

Each backend wraps its native conflict (ErrCompareFail / ErrBadVersion / ErrNodeExists /
nacos publish failure) into a shared report.ErrMappingCASConflict sentinel via %w.

2. Graded retry (was: fixed loop, no backoff)

registerWithRetry retries only CAS conflicts (errors.Is) with exponential backoff + jitter,
and returns permanent errors (network/auth) immediately instead of burning the whole retry budget.
原来任何错误都空转重试 10 次且无 sleep,现在按错误类型分级重试。

3. Extract shared logic + fix two hidden bugs

  • report.MergeServiceAppMapping: whole-element dedup. Fixes the strings.Contains substring
    false positive (registering order was wrongly treated as present when order-service existed)
    and the leading-comma bug ("" + "," + app",app").
  • report.DecodeServiceAppNames: parse into a set, skipping empty elements.

4. Listener cleanup

  • zookeeper: implemented removal via CacheListener.RemoveKeyListeners (was a silent return nil
    that leaked listeners).
  • etcd: documents the mapping listener as unsupported instead of silently succeeding.

5. Tests

  • Unit tests for the merge/decode helpers (incl. the substring and empty-value regressions).
  • A concurrency test that reproduces the lost-update bug with the naive read-modify-write and
    proves CAS preserves every writer (200 writers / 20 concurrent readers). Passes under -race.

Known limitation (documented in code)

Nacos CasMd5 is an optimistic UPDATE and cannot guard the first INSERT (Nacos has no
create-if-absent primitive), so the initial concurrent registration of a brand-new interface can
still race. etcd and zookeeper are not affected. Left as a documented limitation; can be revisited
if Nacos exposes a SETNX-style primitive.

Test

go test -race ./metadata/report/... ./metadata/mapping/...

Checklist

  • I confirm the target branch is develop
  • Code has passed local testing
  • I have added tests that prove my fix is effective or that my feature works

@NeverENG NeverENG force-pushed the feat/3354-mapping-consistency-cas branch from 467a8d8 to 1ad53bd Compare June 7, 2026 04:10
… dedup (apache#3354)

Make interface-to-app mapping registration safe under concurrent providers and
give it a proper retry policy.

- Optimistic concurrency across all backends so concurrent appends no longer
  clobber each other: etcd (GetValAndRev + UpdateWithRev), zookeeper (versioned
  SetContent), nacos (CasMd5). Each backend wraps its native conflict
  (ErrCompareFail / ErrBadVersion / ErrNodeExists / nacos publish failure) into
  the shared report.ErrMappingCASConflict sentinel via %w.
- Graded retry: registerWithRetry retries only CAS conflicts (errors.Is) with
  exponential backoff + jitter, and returns permanent errors immediately
  instead of burning the whole retry budget.
- Extract shared logic: report.MergeServiceAppMapping (whole-element dedup,
  fixing the strings.Contains substring false positive and the leading-comma
  bug on empty values) and report.DecodeServiceAppNames (skips empty elements).
- Listener cleanup: zookeeper removal via CacheListener.RemoveKeyListeners;
  etcd documents the listener as unsupported instead of silently succeeding.
- Tests: helper unit tests plus a concurrency test that reproduces the
  lost-update bug and proves CAS preserves every writer (200 writers /
  20 readers, passes under -race).

Known nacos-only limitation (documented in code): CasMd5 is an optimistic
UPDATE and cannot guard the first INSERT, so the initial concurrent
registration of a brand-new interface can still race. etcd and zookeeper are
not affected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Alanxtl
Copy link
Copy Markdown
Contributor

Alanxtl commented Jun 7, 2026

先看一下应该没和#3371重复吧

Comment thread metadata/report/nacos/report.go
Comment thread metadata/report/zookeeper/listener.go
NeverENG and others added 2 commits June 7, 2026 16:39
…#3354)

- nacos: stop swallowing the getConfig read error. On a failed read the old
  value was treated as empty, so registration would publish only the current
  app and overwrite an existing set (e.g. appA,appB -> appC). Return the error
  instead so an existing mapping is never clobbered. A genuinely absent config
  still returns ("", nil) and takes the first-write path.
- zookeeper: CacheListener.DataChange now builds the set via
  report.DecodeServiceAppNames, so mapping change events no longer surface
  empty app names from legacy/malformed comma-separated values (",app",
  "app,,other"). Added a listener test covering this.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…g registration

The previous commit returned any getConfig error from RegisterServiceAppMapping.
Nacos signals a never-written key with a "config data not exist" error (not an
empty value), so the first registration of a fresh interface failed and the
provider panicked on service export (broke the registry/nacos integration test).

Only treat genuine read failures (network/auth/server) as errors; the not-found
signal is handled as an empty old value so the first write can create the key.
Detection mirrors config_center/nacos's isConfigNotExistErr.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 7, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
2 Security Hotspots

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Improve service-app mapping consistency, retry, and deduplication / 加强 service-app mapping 的一致性、重试与去重

2 participants