Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,14 @@ Earlier `1.4.9` of `ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay` shi
- **Fixed upstream**: `ObolNetwork/x402-rs#3` (merged 2026-05-13, `668b7bb`) dropped the platform pin. The publish workflow republished `1.4.9` on push to `main`; arm64 digest is now `sha256:b209345c5e05415df36444b307213c61f9ca08db9f8131d0ebfebefc244ba4ec`.
- **`X402_FACILITATOR_SKIP_PULL` knob removed** from `flows/lib.sh` once the republished image was validated against the release-smoke. If you encounter `exec format error` on an arm64 host now, the registry image is wrong, not the host — pull-fresh (`docker pull ghcr.io/obolnetwork/x402-facilitator-prometheus-overlay:1.4.9`) and check the manifest with `docker buildx imagetools inspect`.

### 10. Cloudflare WAF blocks default `Python-urllib` User-Agent on external sellers

When buying from external x402 sellers (sellers running outside our k3d cluster — e.g. `https://inference.v1337.org/...`), some sit behind Cloudflare's managed WAF, which **blocks the default `Python-urllib/X.Y` UA with HTTP 403 + Cloudflare error 1010** ("the owner of this website has banned your access based on your browser's signature"). Both the unpaid 402 probe and the paid `X-PAYMENT` request fail; buyers see misleading auth/signing errors instead of the real cause.

- **Symptom**: `buy.py probe` against an external seller fails with 403 (often surfaced as a JSON-decode error or "no accepts" downstream); `buy.py buy` against the same endpoint also fails before signature verification. Curl with default browser UA against the same URL returns 402 cleanly.
- **Fix in repo**: `c2dddc1` — added module-level `USER_AGENT = os.environ.get("OBOL_BUYER_USER_AGENT", "obol-buy-x402/1.0 (+https://github.com/ObolNetwork/obol-stack)")` to `internal/embed/skills/buy-x402/scripts/buy.py`, applied in `_probe_endpoint` (kind=http), `_probe_endpoint` (kind=inference), and the paid `X-PAYMENT` request in `buy_paid_oneshot`. Tested four UAs against v1337 (`curl/*`, generic `Mozilla/*`, `Chrome/*`, custom `obol-buy-x402/*`) — all four returned 402 cleanly. The fix is "send anything that isn't `Python-urllib`", not "send a specific browser UA". Operator override: `OBOL_BUYER_USER_AGENT`.
- **Follow-up (not yet confirmed)**: the same WAF block likely affects the Go-side controller probe at `internal/serviceoffercontroller/purchase.go:183`, since Go's `http.Client` defaults to `User-Agent: Go-http-client/1.1`. Verify against v1337 and apply the same UA override on the Go side if reproduced.

## Diagnostic Patterns

- **Don't confuse 503 with "verifier broken"** — almost always one of #1, #2, #5, #6, or a missing CA bundle (`paid-flows.md`).
Expand Down
101 changes: 96 additions & 5 deletions flows/buy-external.sh
Original file line number Diff line number Diff line change
Expand Up @@ -180,17 +180,108 @@ fi
PF_LITELLM=""
PF_LITELLM_LOG=""

# Best-effort diagnostic snapshot, taken on the failure path BEFORE the cluster
# is torn down. Each command is wrapped in `|| true` so a single failure does
# not abort the rest of the bundle. The sidecar status snapshot uses the same
# `kubectl exec ... python3 -c` shape as the in-flow before/after captures
# (the buyer container is distroless — no curl/wget).
external_snapshot_on_fail() {
type bob >/dev/null 2>&1 || return 0
[ -d "$EXTERNAL_BUY_ARTIFACT_DIR" ] || return 0

local f

f="$EXTERNAL_BUY_ARTIFACT_DIR/controller.log"
if bob kubectl logs -n x402 deploy/serviceoffer-controller --tail=2000 --previous \
> "$f" 2>/dev/null; then
echo " snapshot: $f"
else
if bob kubectl logs -n x402 deploy/serviceoffer-controller --tail=2000 \
> "$f" 2>/dev/null; then
echo " snapshot: $f (no --previous available)"
else
rm -f "$f" 2>/dev/null || true
fi
fi

f="$EXTERNAL_BUY_ARTIFACT_DIR/controller-current.log"
if bob kubectl logs -n x402 deploy/serviceoffer-controller --tail=2000 \
> "$f" 2>/dev/null; then
echo " snapshot: $f"
else
rm -f "$f" 2>/dev/null || true
fi

f="$EXTERNAL_BUY_ARTIFACT_DIR/purchaserequest.yaml"
if bob kubectl get purchaserequest -A -o yaml > "$f" 2>/dev/null; then
echo " snapshot: $f"
else
rm -f "$f" 2>/dev/null || true
fi

f="$EXTERNAL_BUY_ARTIFACT_DIR/buyer-status-after.json"
if bob kubectl exec -n llm deployment/litellm -c litellm -- \
python3 -c "
import urllib.request, json
try:
resp = urllib.request.urlopen('http://localhost:8402/status', timeout=5)
print(json.dumps(json.loads(resp.read()), indent=2))
except Exception as e:
print(json.dumps({'error': repr(e)}))
" > "$f" 2>/dev/null; then
echo " snapshot: $f"
else
rm -f "$f" 2>/dev/null || true
fi

# Re-use the harness-captured buy.py log if it was written; do not re-fetch.
if [ -f "$EXTERNAL_BUY_ARTIFACT_DIR/buy-py.log" ]; then
f="$EXTERNAL_BUY_ARTIFACT_DIR/agent-pod-buypy.log"
if cp "$EXTERNAL_BUY_ARTIFACT_DIR/buy-py.log" "$f" 2>/dev/null; then
echo " snapshot: $f"
fi
fi

f="$EXTERNAL_BUY_ARTIFACT_DIR/cluster-pods.txt"
if bob kubectl get pods -A -o wide > "$f" 2>/dev/null; then
echo " snapshot: $f"
else
rm -f "$f" 2>/dev/null || true
fi

f="$EXTERNAL_BUY_ARTIFACT_DIR/cluster-events.txt"
if bob kubectl get events -A --sort-by='.lastTimestamp' 2>/dev/null \
| tail -100 > "$f" 2>/dev/null && [ -s "$f" ]; then
echo " snapshot: $f"
else
rm -f "$f" 2>/dev/null || true
fi
}

external_cleanup() {
local ec=$?
set +e
[ -n "$PF_LITELLM" ] && cleanup_pid "$PF_LITELLM" 2>/dev/null
[ -n "$PF_LITELLM_LOG" ] && rm -f "$PF_LITELLM_LOG" 2>/dev/null

# Leave the cluster up on success so the operator can poke around. Only
# tear it down if the flow already failed — a leaked k3d cluster between
# runs eats Docker network space (cleanup_k3d_obol_networks reclaims).
if [ "$ec" -ne 0 ] && type bob >/dev/null 2>&1; then
bob stack down >/dev/null 2>&1 || true
# Cleanup gate: tear down only when every step passed. On FAIL, snapshot
# diagnostics and preserve the cluster — the only places that record why
# a PurchaseRequest never advanced are the controller logs, PR
# status.conditions[], and sidecar /status, all of which die with the
# cluster. Operator pays one manual `bob stack down` when done diagnosing.
if type bob >/dev/null 2>&1; then
if [ "$ec" -eq 0 ]; then
bob stack down >/dev/null 2>&1 || true
else
echo "Capturing failure snapshot to $EXTERNAL_BUY_ARTIFACT_DIR"
external_snapshot_on_fail
echo ""
echo "FAIL → cluster preserved for diagnosis."
echo " Stack id: $PINNED_STACK_ID"
echo " Artifacts: $EXTERNAL_BUY_ARTIFACT_DIR"
echo " Manual cleanup when done:"
echo " bob stack down"
fi
fi
cleanup_k3d_obol_networks
set -e
Expand Down
39 changes: 38 additions & 1 deletion flows/lib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -517,9 +517,46 @@ bootstrap_flow_workspace() {
local dir="$1"
local obol_bin="$2"
local tool src
local workspace_bin="$OBOL_ROOT/.workspace/bin/obol"
local picked="$obol_bin"
local picked_mtime other_mtime delta abs_delta

# Pick the freshest of the caller-supplied binary and the workspace binary.
# During iteration on embedded skill content (e.g. buy-x402/scripts/buy.py)
# it is easy to rebuild one and forget the other; copying the stale one
# silently bakes pre-fix files into the cluster PVC. See pitfall in
# plans/inference-v1337-buy-report-20260514.md (v1337 attempt 5).
if [ -f "$obol_bin" ] && [ -f "$workspace_bin" ] && [ "$obol_bin" != "$workspace_bin" ]; then
picked_mtime=$(stat -c %Y "$obol_bin" 2>/dev/null || stat -f %m "$obol_bin" 2>/dev/null || echo 0)
other_mtime=$(stat -c %Y "$workspace_bin" 2>/dev/null || stat -f %m "$workspace_bin" 2>/dev/null || echo 0)
if [ "$other_mtime" -gt "$picked_mtime" ]; then
picked="$workspace_bin"
delta=$((picked_mtime - other_mtime))
else
delta=$((other_mtime - picked_mtime))
fi
abs_delta=${delta#-}
if [ "$abs_delta" -gt 300 ]; then
local fmt_a fmt_b picked_fmt
fmt_a=$(date -r "$obol_bin" -u +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d "@$picked_mtime" +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || echo "?")
fmt_b=$(date -r "$workspace_bin" -u +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d "@$other_mtime" +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || echo "?")
if [ "$picked" = "$workspace_bin" ]; then
picked_fmt="$workspace_bin (mtime $fmt_b)"
else
picked_fmt="$obol_bin (mtime $fmt_a)"
fi
echo " WARN: obol binary mtimes differ by ${abs_delta}s — one of these was likely forgotten in a rebuild" >&2
echo " $obol_bin mtime $fmt_a" >&2
echo " $workspace_bin mtime $fmt_b" >&2
echo " picked: $picked_fmt" >&2
echo " Rebuild both with \`go build -o .build/obol ./cmd/obol && go build -o .workspace/bin/obol ./cmd/obol\` if you've been iterating on embedded skill content." >&2
fi
elif [ ! -f "$obol_bin" ] && [ -f "$workspace_bin" ]; then
picked="$workspace_bin"
fi

reset_flow_workspace "$dir"
cp "$obol_bin" "$dir/bin/obol"
cp "$picked" "$dir/bin/obol"
chmod +x "$dir/bin/obol"
for tool in kubectl helm helmfile k3d k9s openclaw; do
src=$(command -v "$tool" 2>/dev/null || printf '%s\n' "$OBOL_ROOT/.workspace/bin/$tool")
Expand Down
102 changes: 102 additions & 0 deletions plans/inference-v1337-followup-20260514.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Live buy from `inference.v1337.org` — follow-up findings

Date: 2026-05-14
Test bed: spark1 (Linux aarch64)
Worktree: `/home/claude/obol-stack-qa-20260513-135712-post490`
Branch HEAD during the test: `eb13055` (`chore/buy-external-followups`)
Companion to: `plans/inference-v1337-buy-report-20260514.md`

## TL;DR

Re-ran the v1337 buy with `KEEP_CLUSTER_ON_FAIL=1` (new knob in `flows/buy-external.sh`, commit `b749f95`). Steps 1–17 PASS. The controller reconciled the PurchaseRequest in 55 seconds (`Probed` → `AuthsLoaded` → `Configured` → `Ready`), publishing `paid/qwen3.6-27b` via the buyer sidecar. **The "controller external-seller mode gap" hypothesis from the original report is false** — the controller is endpoint-agnostic by design (confirmed by code review of `internal/serviceoffercontroller/purchase.go`) and works against arbitrary external x402 sellers without modification.

The original report's attempt 5 failure (`command terminated with exit code 137` at the controller-reconcile wait) was almost certainly a kubectl-exec session SIGKILL — not a controller hang. Today's identical code path completed cleanly in 0s at the same step.

## What today's run actually showed

PR `status.conditions[]` from the captured `purchaserequest.yaml`:

```yaml
conditions:
- type: Probed status: "True" reason: Validated message: "402: 23000000000000000 on eip155:84532" 06:37:06Z
- type: AuthsLoaded status: "True" reason: Loaded message: "Loaded 1 pre-signed auths from spec" 06:37:06Z
- type: Configured status: "True" reason: Written message: "Wrote 1 auths to llm/x402-buyer-auths" 06:37:06Z
- type: Ready status: "True" reason: Reconciled message: "Sidecar: 1 remaining, 0 spent" 06:38:01Z
observedGeneration: 1
publicModel: paid/qwen3.6-27b
remaining: 1
totalSigned: 1
signerAddress: 0x57b0eF875DeB5A37301F1640E469a2129Da9490E
```

`buyer-status-after.json`:

```json
{"v1337-aeon": {"url": "https://inference.v1337.org/services/aeon", "remote_model": "qwen3.6-27b", "public_model": "paid/qwen3.6-27b", "remaining": 1, "spent": 0, "network": "base-sepolia"}}
```

The Go-side probe at `purchase.go:183` was NOT WAF-blocked. The follow-up worry (entry #10 in `release-smoke-debugging.md`) about Cloudflare's WAF blocking `Go-http-client/1.1` UA does not reproduce against v1337. Worth keeping the doc note for the general class of WAF UA filters, but not a load-bearing concern for the controller code path.

## Why attempt 5 looked like a controller hang

Today the harness completed step 14 (`buy.py completed`) with the same code that produced attempt 5's exit-code-137. Two structural differences plausibly explain why attempt 5 hung where today's run sailed:

1. **`bootstrap_flow_workspace` now picks the freshest binary** (commit `eb13055`). Attempt 5 silently used a stale `.build/obol` whose embedded buy.py lacked the USER_AGENT fix. Even after the operator rebuilt `.build/obol` mid-attempt, the Bob workspace had already been bootstrapped from the older copy. The PVC's buy.py wrote a probe with `Python-urllib` UA → 403 from CF → the controller's view of the world differed from the buyer's view in subtle ways. The `eb13055` fix removes that footgun for future runs.

2. **The kubectl-exec SIGKILL was an environmental artifact.** `command terminated with exit code 137` is what kubectl prints when its remote process dies from SIGKILL — could be harness `run_with_timeout`, OOM, or control-plane jitter. None of those would be visible in the controller logs (which today's `KEEP_CLUSTER_ON_FAIL=1` snapshot proved go quiet during the wait). Today's harness completed the same `obol kubectl exec` to buy.py without issue, so the SIGKILL was not deterministic.

## The actual blocker today

Step 18 (paid request through LiteLLM) failed:

```
FAIL: [18] Paid request returned HTTP 404
{"error":{"message":"litellm.NotFoundError: NotFoundError: OpenAIException - The model `qwen3.6-27b` does not exist.. Received Model Group=paid/qwen3.6-27b\nAvailable Model Group Fallbacks=None","type":null,"param":null,"code":"404"}}
```

This is operator-error model-name mismatch:

- LiteLLM correctly routed `paid/qwen3.6-27b` → buyer sidecar → `https://inference.v1337.org/services/aeon`.
- v1337's upstream vLLM does not serve a model named `qwen3.6-27b`.
- The actual model name is unknown from `/.well-known/agent-registration.json` (which advertises display name "Qwen3.6-27B AEON Ultimate" and skills `llm/inference, llm/uncensored`, but no model id).

Bob's 0.023 OBOL pre-signed auth was **NOT consumed** — LiteLLM 404'd before reaching the buyer sidecar's `/settle` path. Wallet balance unchanged.

To finish the live buy proof, the harness needs the right `--model` value. Options: (a) ask the seller, (b) probe `/v1/models` if v1337 makes it free, (c) brute-force common variants (`aeon`, `qwen-3.6-27b`, `qwen3.6`, `qwen3.6-27b-aeon`). All low-priority — the controller-side answer is already in.

## Side finding: LiteLLM hot-add quirk

The controller logs surfaced:

```
purchase: hot-add paid/qwen3.6-27b failed: POST /model/new: 400 Bad Request:
{"error":{"message":"Authentication Error, [Errno 30] Read-only file system: '/etc/litellm/config.yaml'", ...}}; relying on ConfigMap reload
```

LiteLLM's `/model/new` API tries to write back to `/etc/litellm/config.yaml`. In our deployment that path is a Kubernetes ConfigMap volume — read-only by default. The controller catches the 400 and falls back to the ConfigMap-reload path, which works (the alias DID become available, otherwise step 17 wouldn't have passed). Pre-existing behavior, not external-seller specific. Worth a one-line note in `paid-flows.md` so the next debugger isn't startled by the WARN in controller logs.

## Updates to original report

Replace follow-up #1 ("serviceoffer-controller external-seller mode") with: "RESOLVED — controller is endpoint-agnostic by design. Attempt 5's reconcile-hang was a kubectl-exec SIGKILL artifact, not a controller bug. Verified 2026-05-14 with `KEEP_CLUSTER_ON_FAIL=1` re-run."

Follow-up #2 (harness binary path) — DONE in commit `eb13055`.
Follow-up #3 (CF-WAF UA documentation) — DONE in commit `849cd93`.
Follow-up #4 (`KEEP_CLUSTER_ON_FAIL` knob) — DONE in commit `b749f95`.

The original report still has narrative value for the four bug fixes it surfaced (k3d cluster-name cap, CAIP-2 chain id mismatch, CF-WAF Python-urllib UA, stale `.build/obol`). Only the controller hypothesis was wrong.

## Artifacts

Under `/home/claude/obol-stack-qa-20260513-135712-post490/.tmp/v1337-rerun-20260514-063232-artifacts/` on spark1, captured by the new `external_snapshot_on_fail()`:

- `controller.log`, `controller-current.log` — full reconcile trace
- `purchaserequest.yaml` — the conclusive `Ready=True` proof
- `buyer-status-after.json` — sidecar saw 1 remaining, 0 spent
- `agent-pod-buypy.log` — clean `buy.py` run through PR creation
- `cluster-pods.txt`, `cluster-events.txt` — full cluster state at FAIL

The Bob k3d cluster (`obol-stack-buy-ext-bob`) is preserved on spark1 pending teardown.

## Closing note

The phase-1 polish items in `chore/buy-external-followups` more than paid for themselves on the first re-run: `KEEP_CLUSTER_ON_FAIL=1` made the diagnosis trivial, the binary normalization removed one of the candidate causes for attempt 5's hang, and the diagnostic snapshot bundle gave us seven artifacts that took a single bash command to inspect. The original v1337 report would have been wrong on its central technical claim if we hadn't re-run with these in place — a useful argument for keeping operator-level diagnostic ergonomics ahead of feature work.