Skip to content

feat: minimal external inference capacity envelope for ServiceOffer #439

@bussyjd

Description

@bussyjd

Background

ServiceOffer should support paid inference capacity without turning Obol into a Kubernetes scheduler or trying to manage the inference runtime.

For the first implementation, assume the OpenAI-compatible/vLLM-like inference endpoint is outside the Obol stack cluster. Obol has exactly one upstream endpoint to work with and zero control over scaling. This matches host-local vLLM / llama.cpp / DGX Spark style setups where the serving process is already running and the cluster only needs to expose it safely through x402.

This is intentionally the minimal-surface issue. In-stack multi-GPU inference servers, GPU device plugins, KEDA/HPA, RuntimeClasses, and Kubernetes-managed vLLM deployments are separate work. See #430.

Goals

  • Add a minimal capacity/admission envelope to ServiceOffer for external inference endpoints.
  • Use Kubernetes/Gateway primitives we already have instead of inventing a scheduler.
  • Keep the first version compatible with the current Traefik + Gateway API + x402 ForwardAuth architecture.
  • Treat GPU count as the relevant capacity dimension for inference, not generic cloud CPU scaling.
  • Avoid any autoscaling semantics for this phase: one external endpoint, static seller-declared capacity.
  • Prevent buyers from paying into an obviously saturated service.

Non-goals

  • Do not run vLLM/llama.cpp inside the Obol cluster in this issue.
  • Do not create or manage Deployment, StatefulSet, HPA, KEDA ScaledObject, RuntimeClass, or GPU device plugin resources here.
  • Do not scale replicas based on CPU/memory metrics.
  • Do not create one Kubernetes object per request/reservation.
  • Do not implement full token metering/per-token settlement here; this is admission/capacity protection around the existing x402 route.

Proposed ServiceOffer shape

Keep this small and seller-facing:

apiVersion: obol.org/v1alpha1
kind: ServiceOffer
metadata:
  name: qwen36-fast
  namespace: llm
spec:
  type: inference

  upstream:
    service: qwen36-external
    namespace: llm
    port: 8000
    healthPath: /health

  capabilities:
    protocol: openai-chat
    streaming: true
    contextWindowTokens: 262144

  requestLimits:
    maxInputTokens: 8192
    maxOutputTokens: 1024
    maxBodyBytes: 10485760

  capacity:
    mode: external-static
    gpuCount: 1
    maxInFlight: 16
    perBuyerMaxInFlight: 4
    maxQueueDepth: 0

  payment:
    network: base-sepolia
    asset: usdc # usdc | obol | 0x...
    payTo: "0x..."
    scheme: exact
    price:
      perRequest: "0.001"
      perMTok: "0.50"

Notes:

  • capacity.mode: external-static means Obol does not control scaling.
  • gpuCount is capacity metadata and preset input, not a Kubernetes scheduling instruction.
  • maxInFlight is the actual hard admission ceiling enforced at the gateway/proxy layer.
  • maxQueueDepth: 0 should be the safe default for paid inference: if saturated, return 429 and do not request payment.
  • perBuyerMaxInFlight is desirable, but can be implemented after global maxInFlight if Traefik/header-ordering makes it non-trivial.
  • payment.asset should be explicit because Obol supports both USDC and OBOL-token settlement. Default to usdc for backwards compatibility, and resolve symbolic assets (usdc, obol) to the network-specific token contract/decimals when writing x402 payment requirements.

Kubernetes/Gateway primitives to use

Existing resources

Continue generating/using:

  • HTTPRoute for /services/<offer>
  • Traefik Middleware for x402 ForwardAuth
  • x402 pricing ConfigMap route entry
  • owner references for garbage collection
  • ServiceOffer.status.conditions

Add generated capacity middleware

For Traefik, generate an additional middleware from spec.capacity.maxInFlight:

apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
  name: qwen36-fast-inflight
  namespace: llm
spec:
  inFlightReq:
    amount: 16

Attach it to the same route as the x402 middleware. The desired behavior is:

  • If capacity is available: unpaid request reaches x402 verifier and gets a 402 Payment Required quote.
  • If capacity is saturated: request gets 429 Too Many Requests and should not be quoted for payment.

Implementation detail to verify: middleware ordering must avoid asking the buyer to pay when the global in-flight limit is already saturated.

Optional request-shape guard

If easy, enforce maxBodyBytes with gateway/proxy middleware. Token-shape limits (maxInputTokens, maxOutputTokens) may initially be status/discovery metadata unless the x402 verifier/proxy parses OpenAI requests.

Admission behavior

buyer request
  -> Gateway/HTTPRoute
  -> capacity middleware / admission check
     saturated: 429, Retry-After, no payment quote
  -> x402 ForwardAuth
     missing payment: 402 quote
     valid payment: pass
  -> upstream Service pointing at external endpoint

For v1, we can avoid reservations entirely. A reservation layer can come later if we see buyer UX issues between quote and paid retry.

Status fields

Add enough status to make obol sell status and discovery honest:

status:
  capacity:
    mode: external-static
    scalingControlled: false
    gpuCount: 1
    maxInFlight: 16
    perBuyerMaxInFlight: 4
    inFlight: 7
    state: Available # Available | Saturated | Unknown
  endpoint: https://.../services/qwen36-fast

inFlight can come from Traefik/x402 metrics when available; otherwise report Unknown rather than pretending.

CLI UX

Minimal explicit form:

obol sell inference qwen36-fast \
  --upstream qwen36-external \
  --namespace llm \
  --port 8000 \
  --runtime openai \
  --payment-asset usdc \
  --price 0.001 \
  --gpu-count 1 \
  --max-inflight 16 \
  --per-buyer-max-inflight 4 \
  --max-input-tokens 8192 \
  --max-output-tokens 1024

Optional preset form later:

obol sell inference qwen36-fast \
  --upstream qwen36-external \
  --runtime openai \
  --capacity-preset external-vllm-1gpu-balanced

The preset should expand into static gateway limits only. It should not imply autoscaling.

Acceptance criteria

  • ServiceOffer CRD accepts a minimal external-static capacity envelope.
  • Reconciler creates a Traefik inFlightReq middleware for spec.capacity.maxInFlight.
  • HTTPRoute attaches capacity middleware and x402 ForwardAuth in an order that avoids payment quotes when saturated.
  • Saturated offer returns 429/Retry-After, not 402.
  • Unsaturated unpaid request still returns normal x402 402 terms.
  • Paid request forwards to the external inference endpoint as before.
  • obol sell status shows capacity mode, scalingControlled=false, GPU count, max in-flight, and current/unknown in-flight state.
  • Existing obol sell http / sell inference behavior keeps working when capacity fields are omitted.
  • Payment asset is explicit in the payment terms (payment.asset / --payment-asset) with backwards-compatible default usdc.
  • x402 pricing/config output preserves the selected settlement asset so buyers know whether they are paying USDC or OBOL.
  • Docs explicitly state that external-static capacity is seller-declared and Obol does not autoscale the endpoint.

Open questions

  • Should gpuCount be required for type: inference capacity presets, or optional metadata only?
  • Do we need per-buyer concurrency in v1, or is global max-in-flight enough for the first cut?
  • Can Traefik middleware ordering give us 429 before 402 cleanly, or do we need a tiny x402-aware admission shim later?
  • Should maxQueueDepth be omitted entirely for v1 and default to no queue?
  • Should multi-asset offers use a future payment.accepts[] list, or should one ServiceOffer advertise exactly one settlement asset for minimal v1?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions