Skip to content

fix(agent): shared-model M1 — membership-gate auth fix + invoke timeouts (B1+B2)#1158

Merged
branarakic merged 8 commits into
feat/llm-sharing-modulefrom
feat/shared-model-m1-fixes
Jun 14, 2026
Merged

fix(agent): shared-model M1 — membership-gate auth fix + invoke timeouts (B1+B2)#1158
branarakic merged 8 commits into
feat/llm-sharing-modulefrom
feat/shared-model-m1-fixes

Conversation

@branarakic

Copy link
Copy Markdown
Contributor

Follow-up hardening on top of #1157 (feat/llm-sharing-module). Scoped to the two highest-leverage, lowest-risk fixes from the MVP→v1 review; partial M1 (cost-abuse items B3 still to come).

B1 — membership-gate auth bypass (security)

isPeerContextGraphMember previously ran a bespoke SPARQL that UNIONed allowedDelegateePeer and the self-asserted, pre-approval delegationDelegateePeer, filtering on neither approval-status nor expiry. Net effect: a peer that merely submitted a join request (or was later rejected/removed) could invoke the curator's billed model.

Fix: delegate to the canonical getContextGraphAllowedDelegateePeers (the approved-only, expiry-filtered set already used by the sync-auth path) and match the transport-authenticated fromPeerId against it. A pending / rejected / removed peer is now correctly denied (a removed peer's did:dkg:agent-delegation:* subject is deleted by removeAgentFromContextGraph, so it drops out of the set).

B2 — invoke timeouts (usability)

The remote invoke used messenger.sendReliable(...) with no timeout, inheriting the router default (DEFAULT_SEND_TIMEOUT_MS, 20s) for the whole round trip — shorter than real LLM latency, so normal completions returned a false curator node unreachable.

  • Member side: pass { timeoutMs: invokeTimeoutMs } to sendReliable (default 120s, with a constant fallback for pure member nodes that have no shared-model config).
  • Curator side: the provider fetch now gets AbortSignal.timeout(providerTimeoutMs) (default 110s, deliberately tighter than the transport budget so the curator returns a structured error first). A TimeoutError maps to provider timeout after <n>ms → surfaced as {ok:false, denied:'provider error: provider timeout...'}.
  • New optional config: sharedModel.invokeTimeoutMs / sharedModel.providerTimeoutMs.

Tests

  • New test/shared-model/membership.test.ts (5): approved→allow, pending-only→deny, removed→deny, expired→deny, multi-agent flatten.
  • New test/shared-model/client-provider-timeout.test.ts (2): abort→provider timeout after <n>ms; abort signal attached only when configured.
  • Full agent suite green (pnpm --filter @origintrail-official/dkg-agent test — 1240 passed / 0 failed); agent + CLI tsc clean.

Not in scope (later milestones)

Token/temperature clamp + per-peer rate limit + quota-consume reordering + audit log (rest of M1); _meta-sync reliability (M2); streaming / tool-calls / model-select / error-code mapping (M3); metering (M5).

🤖 Generated with Claude Code

Branimir Rakic and others added 6 commits June 14, 2026 12:38
B1 — membership gate (auth bypass fix)
isPeerContextGraphMember no longer runs a bespoke SPARQL UNION over
allowedDelegateePeer AND the self-asserted, pre-approval
delegationDelegateePeer. It now reuses the canonical
getContextGraphAllowedDelegateePeers helper (WorkspaceCryptoMethods mixin),
which returns ONLY curator-approved, unexpired delegatee peers. A pending,
rejected, expired, or removed peer is therefore correctly denied
(removeAgentFromContextGraph deletes the did:dkg:agent-delegation:* subject
the helper reads). New membership.test.ts pins approved=allow,
pending/removed/expired=deny.

B2 — invoke timeout (transport + provider)
The member's remote invoke previously inherited the 20s router default for
the whole LLM round trip, surfacing a false "curator node unreachable" on
normal completions. Added invokeTimeoutMs/providerTimeoutMs config knobs
(CLI SharedModelConfig + SharedModelRuntimeConfig), defaulting to
120000/110000 in the daemon lifecycle (provider deadline tighter than
transport). Member side passes { timeoutMs } to sendReliable (falling back
to 120000 when the node has no shared-model state). Curator side adds
signal: AbortSignal.timeout(providerTimeoutMs) to the provider fetch and
maps a TimeoutError to a clear "provider timeout after <n>ms" Error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Unit-test the curator-side provider deadline added in B2: a fetch aborted by
AbortSignal.timeout maps to a deterministic `provider timeout after <n>ms`
error, and the abort signal is attached to fetch only when providerTimeoutMs
is configured. Closes the one untested branch in the B1+B2 patch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scripts/devnet-test-shared-model.sh drives a 2-node devnet through the full
faithful path (private CG -> model/share -> genuine sign/request/approve join
-> _meta sync -> member P2P invoke + OpenAI-compatible invoke) and asserts the
M1 fixes:
- B1: after the curator removes the member, the member's invoke is DENIED
  ("requester is not a member") — the pre-patch gate let removed/pending peers
  through. Fails loudly if a completion still comes back.
- B2 (opt-in, TEST_B2=1): stands up a local slow openai-compatible stub
  (>20s, the old router default) and asserts the invoke still returns,
  exercising the threaded invoke/provider timeouts.

devnet.sh gains a DEVNET_SHARED_MODEL env-gated sharedModel config block
(mirroring DEVNET_NO_AUTH etc.) so restart-node regenerates node config WITH
sharing enabled; the test script uses it to (re)configure the curator node.

Validated: bash -n + embedded-python checks pass. Not yet run against a live
devnet in this session.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-mesh wait)

Validated the shared-model e2e end-to-end on a live 2-node devnet and fixed
two runtime bugs that broke it (both in the test harness, not the product):

1. restart-node never regenerated config.json, so ensure_curator_config's
   DEVNET_SHARED_MODEL=1 + provider env never took effect — start_node only
   reads the existing config and patches relay/bootstrapPeers. The curator
   booted without a sharedModel block and every invoke failed with
   "curator node has no shared model configured". Fix: cmd_restart_node now
   re-runs create_node_config before start_node (persisted store/identity/
   auth.token are untouched).

2. After the curator restart, the join handshake raced the curator's libp2p
   protocol re-advertisement: the one-shot post-approval _meta sync landed
   while the member still saw "does not support sync protocol", the targeted
   sync failed, and the self-heal catch-up SKIPPED the brand-new CG
   ("unauthorized or unconfirmed"). The grant/_meta never reached the member,
   so invoke was denied with "could not resolve curator peer". This was
   intermittent (mock won the race, B2 lost it). Fix: ensure_curator_config
   now waits for the member to re-peer with the curator (+settle margin for
   the identify-push) before driving membership.

Both mock and TEST_B2=1 runs now pass deterministically; the B2 slow invoke
survives the 25s upstream and returns slow-pong (~25s elapsed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… e2e

- devnet.sh: DEVNET_SHARED_MODEL hook now passes through DEVNET_SHARED_MODEL_API_KEY_ENV
  -> sharedModel.apiKeyEnv, so an openai-compatible devnet curator can be given a
  real key via an env var (the key itself is never written to config).
- devnet-test-shared-model.sh: REAL=1 mode for a happy-path smoke against a real
  provider you configured on node1 — implies SKIP_CONFIG (never bakes a key),
  relaxes the mock-echo assertion to "ok + non-empty", prints the real completion,
  and skips the B1 removal so the member stays joined for further manual calls.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
REAL mode skips the script's own restart, but the operator just restarted node1
to point it at the real provider; wait for the member to re-peer with the curator
before the join handshake (same race the mock/B2 path guards against).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@branarakic branarakic marked this pull request as ready for review June 14, 2026 12:15
Comment thread scripts/devnet.sh
# effect on restart — start_node only reads the existing config and patches
# relay/bootstrapPeers, it does NOT re-run create_node_config. Persisted store
# state, identity and the auth.token file live elsewhere and are untouched.
create_node_config "$node_num"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: create_node_config() does more than regenerate config.json here: it also rewrites wallets.json with fresh random operational keys. That means restart-node no longer preserves the node's agent/on-chain identity, and any flow that caches the pre-restart address/allowlist (including the new shared-model devnet script) can start talking to stale identities after a restart. Please avoid calling the full node bootstrap here, or make the restart path preserve existing wallet/state files.

apiKey,
dailyRequestQuotaPerAgent: sm.dailyRequestQuotaPerAgent ?? 200,
maxPromptChars: sm.maxPromptChars ?? 8000,
invokeTimeoutMs: sm.invokeTimeoutMs ?? 120_000,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: these new timeout knobs are applied without any validation, but the B2 behavior depends on providerTimeoutMs being a positive value that is strictly smaller than invokeTimeoutMs. If an operator sets them equal or inverted, the member will fall back to the original transport timeout and surface curator node unreachable again. Validate/reject invalid combinations before calling configureSharedModel.

Branimir Rakic and others added 2 commits June 14, 2026 14:54
…ocal path

Security hardening of the curator shared-model invoke gate (Codex #1/#2/#3).

#1 Per-agent membership binding (remote path). Previously the curator
authorized any peer present in the UNION of all agents' approved
delegatee-peer sets, so any approved member's node could invoke on behalf
of an unapproved (or pending) agent. Now the request carries an UNTRUSTED
`agentAddress` claim (added to SharedModelInvokeRequest + wire codec, which
rejects a non-string value) and the curator authorizes iff the claimed
agent's curator-APPROVED delegatee-peer set contains the libp2p-authenticated
`fromPeerId` — mirroring the sync auth path's `requesterAgentAddress`
binding. `isPeerContextGraphMember` (union) is replaced by
`isAgentPeerContextGraphMember(cg, agentAddress, fromPeerId)`. A
missing/empty `agentAddress` is DENIED (no union fallback). This defeats
pending requesters, peer impersonation, and multi-agent inheritance.

#2 Caller-scoped local fast-path. The local path no longer hardcodes
`isMember:true`. `isContextGraphCuratorSelf` now enumerates ALL local
agents (not just the default) to decide whether the local path applies; the
caller is then authorized via `isCallerLocalCgMember`, which allows only the
CG owner/curator (via the canonical `isCallerOrNodeOwner`, preserving legacy
peerId-owner compat) or an approved member of the CG. A different local
agent that is neither is DENIED in-branch and never falls through to dial
self.

#3 Quota consumed only after authorization. Decision input now uses a
non-mutating `remaining(key) > 0` peek; the mutating `allow(key)` is called
exactly once, after `decideSharedModelAuthorization` returns ok, in BOTH
the remote handler and the local fast-path. A request denied for any other
reason no longer burns the member's daily allowance.

Tests: rewrote membership.test.ts for per-agent binding (the old
"flattened union" test pinned the removed vulnerability); added
invoke-handler.test.ts (curator-side binding + quota) and local-path.test.ts
(owner/member/outsider/node-token/legacy-peerId-owner + quota).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…artial success

Codex #4/#5 on the daemon shared-model routes.

#4 Validate message shapes in the routes. The native /model/invoke route
trusted `body.messages` after only an array check, so `[{"role":"user"}]`
or `[{"content":1}]` threw deep in the provider call and surfaced as a 500.
Added `validateSharedModelMessages` (role in system|user|assistant, content
is a string) returning a clear 400; wrapped the unguarded `JSON.parse` so a
malformed body is a 400 too. The OpenAI /v1/chat/completions route now
returns an `openAiErrorBody(...)` 400 on a malformed body instead of a 500.
maxTokens/temperature are coerced to finite numbers.

#5 invite-with-model partial success. The route applied the invite then
enabled sharing in one try/catch, returning 400 even when membership had
already been granted (the invite succeeded). Now: invite failure → 400
(nothing applied, safe to retry); invite OK but share failure → 200 with
`{ ok:true, modelShared:false, modelShareError }` so callers don't retry
against partially-applied state; full success → 200 modelShared:true.

Tests: daemon-shared-model-routes.test.ts (native + openai validation,
invite partial/full/invite-failure/no-share paths).

Residual (out of scope, noted): the /model/share route's `JSON.parse` is
still unguarded; #4 names only the invoke/openai routes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@branarakic branarakic marked this pull request as draft June 14, 2026 12:55
return encodeInvokeResponse({ ok: false, denied: decision.denied });
}
// Authorized: consume exactly one quota unit before the provider call.
st!.quota.allow(fromPeerId);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This is now a racy peek-then-consume flow. Two concurrent requests from the same member can both see remaining(...) > 0, both pass decideSharedModelAuthorization, and then both continue even though the second allow() would return false. Please make quota reservation atomic again (for example, check the boolean result here and in the local-path st!.quota.allow(selfKey) branch before calling the provider).

// also stops `isCallerOrNodeOwner(owner, undefined)` below from taking its
// permissive no-caller branch (which would authorise a node-level token
// whenever the owner is one of the node's own identities).
if (!callerAgentAddress) return false;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: callerAgentAddress is still optional on the public invokeContextGraphModel(contextGraphId, messages, opts?, callerAgentAddress?) API, but this hard-denies undefined. That turns existing in-process callers that omit the 4th arg (including the documented invokeContextGraphModel(cg, messages) usage) into permanent non-members on the local path, and the remote path now serializes the same undefined claim and gets denied there too. Resolve a default principal once at method entry and use that value for both auth and request serialization if you need to preserve the current API contract.

@branarakic branarakic marked this pull request as ready for review June 14, 2026 13:35
@branarakic branarakic merged commit 3912287 into feat/llm-sharing-module Jun 14, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant