fix(agent): shared-model M1 — membership-gate auth fix + invoke timeouts (B1+B2) by branarakic · Pull Request #1158 · OriginTrail/dkg

branarakic · 2026-06-14T10:48:55Z

Follow-up hardening on top of #1157 (feat/llm-sharing-module). Scoped to the two highest-leverage, lowest-risk fixes from the MVP→v1 review; partial M1 (cost-abuse items B3 still to come).

B1 — membership-gate auth bypass (security)

isPeerContextGraphMember previously ran a bespoke SPARQL that UNIONed allowedDelegateePeer and the self-asserted, pre-approval delegationDelegateePeer, filtering on neither approval-status nor expiry. Net effect: a peer that merely submitted a join request (or was later rejected/removed) could invoke the curator's billed model.

Fix: delegate to the canonical getContextGraphAllowedDelegateePeers (the approved-only, expiry-filtered set already used by the sync-auth path) and match the transport-authenticated fromPeerId against it. A pending / rejected / removed peer is now correctly denied (a removed peer's did:dkg:agent-delegation:* subject is deleted by removeAgentFromContextGraph, so it drops out of the set).

B2 — invoke timeouts (usability)

The remote invoke used messenger.sendReliable(...) with no timeout, inheriting the router default (DEFAULT_SEND_TIMEOUT_MS, 20s) for the whole round trip — shorter than real LLM latency, so normal completions returned a false curator node unreachable.

Member side: pass { timeoutMs: invokeTimeoutMs } to sendReliable (default 120s, with a constant fallback for pure member nodes that have no shared-model config).
Curator side: the provider fetch now gets AbortSignal.timeout(providerTimeoutMs) (default 110s, deliberately tighter than the transport budget so the curator returns a structured error first). A TimeoutError maps to provider timeout after <n>ms → surfaced as {ok:false, denied:'provider error: provider timeout...'}.
New optional config: sharedModel.invokeTimeoutMs / sharedModel.providerTimeoutMs.

Tests

New test/shared-model/membership.test.ts (5): approved→allow, pending-only→deny, removed→deny, expired→deny, multi-agent flatten.
New test/shared-model/client-provider-timeout.test.ts (2): abort→provider timeout after <n>ms; abort signal attached only when configured.
Full agent suite green (pnpm --filter @origintrail-official/dkg-agent test — 1240 passed / 0 failed); agent + CLI tsc clean.

Not in scope (later milestones)

Token/temperature clamp + per-peer rate limit + quota-consume reordering + audit log (rest of M1); _meta-sync reliability (M2); streaming / tool-calls / model-select / error-code mapping (M3); metering (M5).

🤖 Generated with Claude Code

B1 — membership gate (auth bypass fix) isPeerContextGraphMember no longer runs a bespoke SPARQL UNION over allowedDelegateePeer AND the self-asserted, pre-approval delegationDelegateePeer. It now reuses the canonical getContextGraphAllowedDelegateePeers helper (WorkspaceCryptoMethods mixin), which returns ONLY curator-approved, unexpired delegatee peers. A pending, rejected, expired, or removed peer is therefore correctly denied (removeAgentFromContextGraph deletes the did:dkg:agent-delegation:* subject the helper reads). New membership.test.ts pins approved=allow, pending/removed/expired=deny. B2 — invoke timeout (transport + provider) The member's remote invoke previously inherited the 20s router default for the whole LLM round trip, surfacing a false "curator node unreachable" on normal completions. Added invokeTimeoutMs/providerTimeoutMs config knobs (CLI SharedModelConfig + SharedModelRuntimeConfig), defaulting to 120000/110000 in the daemon lifecycle (provider deadline tighter than transport). Member side passes { timeoutMs } to sendReliable (falling back to 120000 when the node has no shared-model state). Curator side adds signal: AbortSignal.timeout(providerTimeoutMs) to the provider fetch and maps a TimeoutError to a clear "provider timeout after <n>ms" Error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Unit-test the curator-side provider deadline added in B2: a fetch aborted by AbortSignal.timeout maps to a deterministic `provider timeout after <n>ms` error, and the abort signal is attached to fetch only when providerTimeoutMs is configured. Closes the one untested branch in the B1+B2 patch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scripts/devnet-test-shared-model.sh drives a 2-node devnet through the full faithful path (private CG -> model/share -> genuine sign/request/approve join -> _meta sync -> member P2P invoke + OpenAI-compatible invoke) and asserts the M1 fixes: - B1: after the curator removes the member, the member's invoke is DENIED ("requester is not a member") — the pre-patch gate let removed/pending peers through. Fails loudly if a completion still comes back. - B2 (opt-in, TEST_B2=1): stands up a local slow openai-compatible stub (>20s, the old router default) and asserts the invoke still returns, exercising the threaded invoke/provider timeouts. devnet.sh gains a DEVNET_SHARED_MODEL env-gated sharedModel config block (mirroring DEVNET_NO_AUTH etc.) so restart-node regenerates node config WITH sharing enabled; the test script uses it to (re)configure the curator node. Validated: bash -n + embedded-python checks pass. Not yet run against a live devnet in this session. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e-mesh wait) Validated the shared-model e2e end-to-end on a live 2-node devnet and fixed two runtime bugs that broke it (both in the test harness, not the product): 1. restart-node never regenerated config.json, so ensure_curator_config's DEVNET_SHARED_MODEL=1 + provider env never took effect — start_node only reads the existing config and patches relay/bootstrapPeers. The curator booted without a sharedModel block and every invoke failed with "curator node has no shared model configured". Fix: cmd_restart_node now re-runs create_node_config before start_node (persisted store/identity/ auth.token are untouched). 2. After the curator restart, the join handshake raced the curator's libp2p protocol re-advertisement: the one-shot post-approval _meta sync landed while the member still saw "does not support sync protocol", the targeted sync failed, and the self-heal catch-up SKIPPED the brand-new CG ("unauthorized or unconfirmed"). The grant/_meta never reached the member, so invoke was denied with "could not resolve curator peer". This was intermittent (mock won the race, B2 lost it). Fix: ensure_curator_config now waits for the member to re-peer with the curator (+settle margin for the identify-push) before driving membership. Both mock and TEST_B2=1 runs now pass deterministically; the B2 slow invoke survives the 25s upstream and returns slow-pong (~25s elapsed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… e2e - devnet.sh: DEVNET_SHARED_MODEL hook now passes through DEVNET_SHARED_MODEL_API_KEY_ENV -> sharedModel.apiKeyEnv, so an openai-compatible devnet curator can be given a real key via an env var (the key itself is never written to config). - devnet-test-shared-model.sh: REAL=1 mode for a happy-path smoke against a real provider you configured on node1 — implies SKIP_CONFIG (never bakes a key), relaxes the mock-echo assertion to "ok + non-empty", prints the real completion, and skips the B1 removal so the member stays joined for further manual calls. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

REAL mode skips the script's own restart, but the operator just restarted node1 to point it at the real provider; wait for the member to re-peer with the curator before the join handshake (same race the mock/B2 path guards against). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-14T12:17:52Z

+  # effect on restart — start_node only reads the existing config and patches
+  # relay/bootstrapPeers, it does NOT re-run create_node_config. Persisted store
+  # state, identity and the auth.token file live elsewhere and are untouched.
+  create_node_config "$node_num"


🔴 Bug: create_node_config() does more than regenerate config.json here: it also rewrites wallets.json with fresh random operational keys. That means restart-node no longer preserves the node's agent/on-chain identity, and any flow that caches the pre-restart address/allowlist (including the new shared-model devnet script) can start talking to stale identities after a restart. Please avoid calling the full node bootstrap here, or make the restart path preserve existing wallet/state files.

github-actions · 2026-06-14T12:17:52Z

        apiKey,
        dailyRequestQuotaPerAgent: sm.dailyRequestQuotaPerAgent ?? 200,
        maxPromptChars: sm.maxPromptChars ?? 8000,
+        invokeTimeoutMs: sm.invokeTimeoutMs ?? 120_000,


🟡 Issue: these new timeout knobs are applied without any validation, but the B2 behavior depends on providerTimeoutMs being a positive value that is strictly smaller than invokeTimeoutMs. If an operator sets them equal or inverted, the member will fall back to the original transport timeout and surface curator node unreachable again. Validate/reject invalid combinations before calling configureSharedModel.

…ocal path Security hardening of the curator shared-model invoke gate (Codex #1/#2/#3). #1 Per-agent membership binding (remote path). Previously the curator authorized any peer present in the UNION of all agents' approved delegatee-peer sets, so any approved member's node could invoke on behalf of an unapproved (or pending) agent. Now the request carries an UNTRUSTED `agentAddress` claim (added to SharedModelInvokeRequest + wire codec, which rejects a non-string value) and the curator authorizes iff the claimed agent's curator-APPROVED delegatee-peer set contains the libp2p-authenticated `fromPeerId` — mirroring the sync auth path's `requesterAgentAddress` binding. `isPeerContextGraphMember` (union) is replaced by `isAgentPeerContextGraphMember(cg, agentAddress, fromPeerId)`. A missing/empty `agentAddress` is DENIED (no union fallback). This defeats pending requesters, peer impersonation, and multi-agent inheritance. #2 Caller-scoped local fast-path. The local path no longer hardcodes `isMember:true`. `isContextGraphCuratorSelf` now enumerates ALL local agents (not just the default) to decide whether the local path applies; the caller is then authorized via `isCallerLocalCgMember`, which allows only the CG owner/curator (via the canonical `isCallerOrNodeOwner`, preserving legacy peerId-owner compat) or an approved member of the CG. A different local agent that is neither is DENIED in-branch and never falls through to dial self. #3 Quota consumed only after authorization. Decision input now uses a non-mutating `remaining(key) > 0` peek; the mutating `allow(key)` is called exactly once, after `decideSharedModelAuthorization` returns ok, in BOTH the remote handler and the local fast-path. A request denied for any other reason no longer burns the member's daily allowance. Tests: rewrote membership.test.ts for per-agent binding (the old "flattened union" test pinned the removed vulnerability); added invoke-handler.test.ts (curator-side binding + quota) and local-path.test.ts (owner/member/outsider/node-token/legacy-peerId-owner + quota). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…artial success Codex #4/#5 on the daemon shared-model routes. #4 Validate message shapes in the routes. The native /model/invoke route trusted `body.messages` after only an array check, so `[{"role":"user"}]` or `[{"content":1}]` threw deep in the provider call and surfaced as a 500. Added `validateSharedModelMessages` (role in system|user|assistant, content is a string) returning a clear 400; wrapped the unguarded `JSON.parse` so a malformed body is a 400 too. The OpenAI /v1/chat/completions route now returns an `openAiErrorBody(...)` 400 on a malformed body instead of a 500. maxTokens/temperature are coerced to finite numbers. #5 invite-with-model partial success. The route applied the invite then enabled sharing in one try/catch, returning 400 even when membership had already been granted (the invite succeeded). Now: invite failure → 400 (nothing applied, safe to retry); invite OK but share failure → 200 with `{ ok:true, modelShared:false, modelShareError }` so callers don't retry against partially-applied state; full success → 200 modelShared:true. Tests: daemon-shared-model-routes.test.ts (native + openai validation, invite partial/full/invite-failure/no-share paths). Residual (out of scope, noted): the /model/share route's `JSON.parse` is still unguarded; #4 names only the invoke/openai routes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-14T12:57:00Z

      return encodeInvokeResponse({ ok: false, denied: decision.denied });
    }
+    // Authorized: consume exactly one quota unit before the provider call.
+    st!.quota.allow(fromPeerId);


🔴 Bug: This is now a racy peek-then-consume flow. Two concurrent requests from the same member can both see remaining(...) > 0, both pass decideSharedModelAuthorization, and then both continue even though the second allow() would return false. Please make quota reservation atomic again (for example, check the boolean result here and in the local-path st!.quota.allow(selfKey) branch before calling the provider).

github-actions · 2026-06-14T12:57:00Z

+    // also stops `isCallerOrNodeOwner(owner, undefined)` below from taking its
+    // permissive no-caller branch (which would authorise a node-level token
+    // whenever the owner is one of the node's own identities).
+    if (!callerAgentAddress) return false;


🔴 Bug: callerAgentAddress is still optional on the public invokeContextGraphModel(contextGraphId, messages, opts?, callerAgentAddress?) API, but this hard-denies undefined. That turns existing in-process callers that omit the 4th arg (including the documented invokeContextGraphModel(cg, messages) usage) into permanent non-members on the local path, and the remote path now serializes the same undefined claim and gets denied there too. Resolve a default principal once at method entry and use that value for both auth and request serialization if you need to preserve the current API contract.

Branimir Rakic and others added 6 commits June 14, 2026 12:38

branarakic marked this pull request as ready for review June 14, 2026 12:15

branarakic requested review from Jurij89 and zsculac as code owners June 14, 2026 12:15

github-actions Bot reviewed Jun 14, 2026

View reviewed changes

Branimir Rakic and others added 2 commits June 14, 2026 14:54

branarakic marked this pull request as draft June 14, 2026 12:55

github-actions Bot reviewed Jun 14, 2026

View reviewed changes

branarakic marked this pull request as ready for review June 14, 2026 13:35

branarakic merged commit 3912287 into feat/llm-sharing-module Jun 14, 2026
4 checks passed

branarakic mentioned this pull request Jun 14, 2026

chore(shared-model): bring stranded node-UI demo into the feature branch (merge-order fix) #1162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agent): shared-model M1 — membership-gate auth fix + invoke timeouts (B1+B2)#1158

fix(agent): shared-model M1 — membership-gate auth fix + invoke timeouts (B1+B2)#1158
branarakic merged 8 commits into
feat/llm-sharing-modulefrom
feat/shared-model-m1-fixes

branarakic commented Jun 14, 2026

Uh oh!

github-actions Bot Jun 14, 2026

Uh oh!

github-actions Bot Jun 14, 2026

Uh oh!

github-actions Bot Jun 14, 2026

Uh oh!

github-actions Bot Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

branarakic commented Jun 14, 2026

B1 — membership-gate auth bypass (security)

B2 — invoke timeouts (usability)

Tests

Not in scope (later milestones)

Uh oh!

github-actions Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant