fix(agent): shared-model M1 — membership-gate auth fix + invoke timeouts (B1+B2)#1158
Conversation
B1 — membership gate (auth bypass fix)
isPeerContextGraphMember no longer runs a bespoke SPARQL UNION over
allowedDelegateePeer AND the self-asserted, pre-approval
delegationDelegateePeer. It now reuses the canonical
getContextGraphAllowedDelegateePeers helper (WorkspaceCryptoMethods mixin),
which returns ONLY curator-approved, unexpired delegatee peers. A pending,
rejected, expired, or removed peer is therefore correctly denied
(removeAgentFromContextGraph deletes the did:dkg:agent-delegation:* subject
the helper reads). New membership.test.ts pins approved=allow,
pending/removed/expired=deny.
B2 — invoke timeout (transport + provider)
The member's remote invoke previously inherited the 20s router default for
the whole LLM round trip, surfacing a false "curator node unreachable" on
normal completions. Added invokeTimeoutMs/providerTimeoutMs config knobs
(CLI SharedModelConfig + SharedModelRuntimeConfig), defaulting to
120000/110000 in the daemon lifecycle (provider deadline tighter than
transport). Member side passes { timeoutMs } to sendReliable (falling back
to 120000 when the node has no shared-model state). Curator side adds
signal: AbortSignal.timeout(providerTimeoutMs) to the provider fetch and
maps a TimeoutError to a clear "provider timeout after <n>ms" Error.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Unit-test the curator-side provider deadline added in B2: a fetch aborted by AbortSignal.timeout maps to a deterministic `provider timeout after <n>ms` error, and the abort signal is attached to fetch only when providerTimeoutMs is configured. Closes the one untested branch in the B1+B2 patch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scripts/devnet-test-shared-model.sh drives a 2-node devnet through the full
faithful path (private CG -> model/share -> genuine sign/request/approve join
-> _meta sync -> member P2P invoke + OpenAI-compatible invoke) and asserts the
M1 fixes:
- B1: after the curator removes the member, the member's invoke is DENIED
("requester is not a member") — the pre-patch gate let removed/pending peers
through. Fails loudly if a completion still comes back.
- B2 (opt-in, TEST_B2=1): stands up a local slow openai-compatible stub
(>20s, the old router default) and asserts the invoke still returns,
exercising the threaded invoke/provider timeouts.
devnet.sh gains a DEVNET_SHARED_MODEL env-gated sharedModel config block
(mirroring DEVNET_NO_AUTH etc.) so restart-node regenerates node config WITH
sharing enabled; the test script uses it to (re)configure the curator node.
Validated: bash -n + embedded-python checks pass. Not yet run against a live
devnet in this session.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-mesh wait)
Validated the shared-model e2e end-to-end on a live 2-node devnet and fixed
two runtime bugs that broke it (both in the test harness, not the product):
1. restart-node never regenerated config.json, so ensure_curator_config's
DEVNET_SHARED_MODEL=1 + provider env never took effect — start_node only
reads the existing config and patches relay/bootstrapPeers. The curator
booted without a sharedModel block and every invoke failed with
"curator node has no shared model configured". Fix: cmd_restart_node now
re-runs create_node_config before start_node (persisted store/identity/
auth.token are untouched).
2. After the curator restart, the join handshake raced the curator's libp2p
protocol re-advertisement: the one-shot post-approval _meta sync landed
while the member still saw "does not support sync protocol", the targeted
sync failed, and the self-heal catch-up SKIPPED the brand-new CG
("unauthorized or unconfirmed"). The grant/_meta never reached the member,
so invoke was denied with "could not resolve curator peer". This was
intermittent (mock won the race, B2 lost it). Fix: ensure_curator_config
now waits for the member to re-peer with the curator (+settle margin for
the identify-push) before driving membership.
Both mock and TEST_B2=1 runs now pass deterministically; the B2 slow invoke
survives the 25s upstream and returns slow-pong (~25s elapsed).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… e2e - devnet.sh: DEVNET_SHARED_MODEL hook now passes through DEVNET_SHARED_MODEL_API_KEY_ENV -> sharedModel.apiKeyEnv, so an openai-compatible devnet curator can be given a real key via an env var (the key itself is never written to config). - devnet-test-shared-model.sh: REAL=1 mode for a happy-path smoke against a real provider you configured on node1 — implies SKIP_CONFIG (never bakes a key), relaxes the mock-echo assertion to "ok + non-empty", prints the real completion, and skips the B1 removal so the member stays joined for further manual calls. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
REAL mode skips the script's own restart, but the operator just restarted node1 to point it at the real provider; wait for the member to re-peer with the curator before the join handshake (same race the mock/B2 path guards against). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| # effect on restart — start_node only reads the existing config and patches | ||
| # relay/bootstrapPeers, it does NOT re-run create_node_config. Persisted store | ||
| # state, identity and the auth.token file live elsewhere and are untouched. | ||
| create_node_config "$node_num" |
There was a problem hiding this comment.
🔴 Bug: create_node_config() does more than regenerate config.json here: it also rewrites wallets.json with fresh random operational keys. That means restart-node no longer preserves the node's agent/on-chain identity, and any flow that caches the pre-restart address/allowlist (including the new shared-model devnet script) can start talking to stale identities after a restart. Please avoid calling the full node bootstrap here, or make the restart path preserve existing wallet/state files.
| apiKey, | ||
| dailyRequestQuotaPerAgent: sm.dailyRequestQuotaPerAgent ?? 200, | ||
| maxPromptChars: sm.maxPromptChars ?? 8000, | ||
| invokeTimeoutMs: sm.invokeTimeoutMs ?? 120_000, |
There was a problem hiding this comment.
🟡 Issue: these new timeout knobs are applied without any validation, but the B2 behavior depends on providerTimeoutMs being a positive value that is strictly smaller than invokeTimeoutMs. If an operator sets them equal or inverted, the member will fall back to the original transport timeout and surface curator node unreachable again. Validate/reject invalid combinations before calling configureSharedModel.
…ocal path Security hardening of the curator shared-model invoke gate (Codex #1/#2/#3). #1 Per-agent membership binding (remote path). Previously the curator authorized any peer present in the UNION of all agents' approved delegatee-peer sets, so any approved member's node could invoke on behalf of an unapproved (or pending) agent. Now the request carries an UNTRUSTED `agentAddress` claim (added to SharedModelInvokeRequest + wire codec, which rejects a non-string value) and the curator authorizes iff the claimed agent's curator-APPROVED delegatee-peer set contains the libp2p-authenticated `fromPeerId` — mirroring the sync auth path's `requesterAgentAddress` binding. `isPeerContextGraphMember` (union) is replaced by `isAgentPeerContextGraphMember(cg, agentAddress, fromPeerId)`. A missing/empty `agentAddress` is DENIED (no union fallback). This defeats pending requesters, peer impersonation, and multi-agent inheritance. #2 Caller-scoped local fast-path. The local path no longer hardcodes `isMember:true`. `isContextGraphCuratorSelf` now enumerates ALL local agents (not just the default) to decide whether the local path applies; the caller is then authorized via `isCallerLocalCgMember`, which allows only the CG owner/curator (via the canonical `isCallerOrNodeOwner`, preserving legacy peerId-owner compat) or an approved member of the CG. A different local agent that is neither is DENIED in-branch and never falls through to dial self. #3 Quota consumed only after authorization. Decision input now uses a non-mutating `remaining(key) > 0` peek; the mutating `allow(key)` is called exactly once, after `decideSharedModelAuthorization` returns ok, in BOTH the remote handler and the local fast-path. A request denied for any other reason no longer burns the member's daily allowance. Tests: rewrote membership.test.ts for per-agent binding (the old "flattened union" test pinned the removed vulnerability); added invoke-handler.test.ts (curator-side binding + quota) and local-path.test.ts (owner/member/outsider/node-token/legacy-peerId-owner + quota). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…artial success Codex #4/#5 on the daemon shared-model routes. #4 Validate message shapes in the routes. The native /model/invoke route trusted `body.messages` after only an array check, so `[{"role":"user"}]` or `[{"content":1}]` threw deep in the provider call and surfaced as a 500. Added `validateSharedModelMessages` (role in system|user|assistant, content is a string) returning a clear 400; wrapped the unguarded `JSON.parse` so a malformed body is a 400 too. The OpenAI /v1/chat/completions route now returns an `openAiErrorBody(...)` 400 on a malformed body instead of a 500. maxTokens/temperature are coerced to finite numbers. #5 invite-with-model partial success. The route applied the invite then enabled sharing in one try/catch, returning 400 even when membership had already been granted (the invite succeeded). Now: invite failure → 400 (nothing applied, safe to retry); invite OK but share failure → 200 with `{ ok:true, modelShared:false, modelShareError }` so callers don't retry against partially-applied state; full success → 200 modelShared:true. Tests: daemon-shared-model-routes.test.ts (native + openai validation, invite partial/full/invite-failure/no-share paths). Residual (out of scope, noted): the /model/share route's `JSON.parse` is still unguarded; #4 names only the invoke/openai routes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| return encodeInvokeResponse({ ok: false, denied: decision.denied }); | ||
| } | ||
| // Authorized: consume exactly one quota unit before the provider call. | ||
| st!.quota.allow(fromPeerId); |
There was a problem hiding this comment.
🔴 Bug: This is now a racy peek-then-consume flow. Two concurrent requests from the same member can both see remaining(...) > 0, both pass decideSharedModelAuthorization, and then both continue even though the second allow() would return false. Please make quota reservation atomic again (for example, check the boolean result here and in the local-path st!.quota.allow(selfKey) branch before calling the provider).
| // also stops `isCallerOrNodeOwner(owner, undefined)` below from taking its | ||
| // permissive no-caller branch (which would authorise a node-level token | ||
| // whenever the owner is one of the node's own identities). | ||
| if (!callerAgentAddress) return false; |
There was a problem hiding this comment.
🔴 Bug: callerAgentAddress is still optional on the public invokeContextGraphModel(contextGraphId, messages, opts?, callerAgentAddress?) API, but this hard-denies undefined. That turns existing in-process callers that omit the 4th arg (including the documented invokeContextGraphModel(cg, messages) usage) into permanent non-members on the local path, and the remote path now serializes the same undefined claim and gets denied there too. Resolve a default principal once at method entry and use that value for both auth and request serialization if you need to preserve the current API contract.
Follow-up hardening on top of #1157 (
feat/llm-sharing-module). Scoped to the two highest-leverage, lowest-risk fixes from the MVP→v1 review; partial M1 (cost-abuse items B3 still to come).B1 — membership-gate auth bypass (security)
isPeerContextGraphMemberpreviously ran a bespoke SPARQL that UNIONedallowedDelegateePeerand the self-asserted, pre-approvaldelegationDelegateePeer, filtering on neither approval-status nor expiry. Net effect: a peer that merely submitted a join request (or was later rejected/removed) could invoke the curator's billed model.Fix: delegate to the canonical
getContextGraphAllowedDelegateePeers(the approved-only, expiry-filtered set already used by the sync-auth path) and match the transport-authenticatedfromPeerIdagainst it. A pending / rejected / removed peer is now correctly denied (a removed peer'sdid:dkg:agent-delegation:*subject is deleted byremoveAgentFromContextGraph, so it drops out of the set).B2 — invoke timeouts (usability)
The remote invoke used
messenger.sendReliable(...)with no timeout, inheriting the router default (DEFAULT_SEND_TIMEOUT_MS, 20s) for the whole round trip — shorter than real LLM latency, so normal completions returned a falsecurator node unreachable.{ timeoutMs: invokeTimeoutMs }tosendReliable(default 120s, with a constant fallback for pure member nodes that have no shared-model config).fetchnow getsAbortSignal.timeout(providerTimeoutMs)(default 110s, deliberately tighter than the transport budget so the curator returns a structured error first). ATimeoutErrormaps toprovider timeout after <n>ms→ surfaced as{ok:false, denied:'provider error: provider timeout...'}.sharedModel.invokeTimeoutMs/sharedModel.providerTimeoutMs.Tests
test/shared-model/membership.test.ts(5): approved→allow, pending-only→deny, removed→deny, expired→deny, multi-agent flatten.test/shared-model/client-provider-timeout.test.ts(2): abort→provider timeout after <n>ms; abort signal attached only when configured.pnpm --filter @origintrail-official/dkg-agent test— 1240 passed / 0 failed); agent + CLItscclean.Not in scope (later milestones)
Token/temperature clamp + per-peer rate limit + quota-consume reordering + audit log (rest of M1);
_meta-sync reliability (M2); streaming / tool-calls / model-select / error-code mapping (M3); metering (M5).🤖 Generated with Claude Code