Skip to content

feat(ingress): per-workload CF ingress + simplify apps to web-nvidia-smi demo#134

Merged
posix4e merged 1 commit into
mainfrom
feat/per-workload-ingress
Apr 18, 2026
Merged

feat(ingress): per-workload CF ingress + simplify apps to web-nvidia-smi demo#134
posix4e merged 1 commit into
mainfrom
feat/per-workload-ingress

Conversation

@posix4e
Copy link
Copy Markdown
Member

@posix4e posix4e commented Apr 18, 2026

Summary

  • Adds optional `expose: {hostname_label, port}` on individual boot workloads. dd-agent collects those into `DD_EXTRA_INGRESS`, forwards on `/register`, CP prepends them to the cloudflared ingress and provisions matching CNAMEs.
  • Simplifies `apps/` from podman+ollama+openclaw to podman+`web-nvidia-smi` — one focused demo that proves podman, GPU passthrough, and the new ingress path end-to-end. Ollama+openclaw move out of this repo (next PR: slopandmop).
  • `web-nvidia-smi` serves `nvidia-smi` output on `gpu..devopsdefender.com` via a tiny nc loop in an `nvidia/cuda:12.6.1-base-ubuntu22.04` container.
  • Preview agent VM keeps its registration-smoke-test role but drops the CPU-ollama workload.

Scope boundary

Boot-time exposure only. Runtime `/deploy` exposure for POSTed workloads (e.g. anything slopandmop ships at runtime) is a follow-up — those workloads still run, they're just not auto-routed to a public hostname yet.

Test plan

  • `cargo fmt && cargo check && cargo test` pass locally
  • Run `./apps/_infra/local-agents.sh "" https://app.devopsdefender.com\` on tdx2 with DD_PAT + DD_ITA_API_KEY
  • `virsh start dd-local-prod`; `virsh console` shows ITA mint, register succeeds, CP log shows `extra_ingress` entry
  • `curl https://.devopsdefender.com/` → dashboard (no regression)
  • `curl https://gpu..devopsdefender.com/` → `nvidia-smi` text table
  • Preview agent still registers against a PR CP and shows up in the fleet

🤖 Generated with Claude Code

…smi demo

Agent VMs can now declare `expose: {hostname_label, port}` on individual
boot workloads; dd-agent forwards those on /register, CP prepends them
to the cloudflared ingress alongside the default dashboard rule, and
CF provisions matching CNAMEs. Each entry becomes a public hostname
`<label>.<agent-hostname>` → `localhost:<port>`.

dd's apps/ example collapses from podman+ollama+openclaw down to
podman+web-nvidia-smi — one focused demo that proves podman, GPU
passthrough, and the new ingress path end-to-end. Ollama and openclaw
move out of this repo; they'll land in slopandmop as a self-contained
example where they belong.

Preview agent VM keeps its role as the registration smoke test against
per-PR CPs but drops its CPU-ollama workload. Prod agent VM serves
`gpu.<agent-host>.devopsdefender.com` with the container's
`nvidia-smi` output.

Boot-time exposure only in this PR. Runtime /deploy exposure for
POSTed workloads (e.g. anything slopandmop ships at runtime) is a
follow-up — those workloads still run, they're just not auto-routed
to a public hostname yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

DD preview ready

URL: https://pr-134.devopsdefender.com

Browser login: paste gh auth token output at https://pr-134.devopsdefender.com/auth/pat

CLI / curl: curl -H "Authorization: Bearer $(gh auth token)" https://pr-134.devopsdefender.com/

Register endpoint for a local agent: wss://pr-134.devopsdefender.com/register

@posix4e posix4e merged commit c57d248 into main Apr 18, 2026
4 checks passed
posix4e added a commit that referenced this pull request Apr 18, 2026
The prior shape — a JSON array substituted into
`"DD_EXTRA_INGRESS=${DD_EXTRA_INGRESS}"` — closed the outer env
string at the first embedded `"`, producing invalid JSON that broke
`jq -c .`:

  jq: parse error: Invalid numeric literal at line 21, column 40

Seen on the dd-local-prod relaunch pipeline immediately after #134
merged (the failing job was in main's Release cascade).

Switches the wire format to comma-separated `label:port` pairs
(`gpu:8081` or `gpu:8081,web:9000`) and adds unit tests covering the
parser edge cases. HTTP request body from agent → CP /register still
carries the structured JSON shape — only the env-var-to-env-var hop
changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
posix4e added a commit that referenced this pull request Apr 18, 2026
PR #134 wired `expose` at boot time — the ingress rules baked into the
agent VM's config.iso got published when CP first created the tunnel.
This extends that to the runtime path: when a workload POSTed to
dd-agent's /deploy declares `expose: {hostname_label, port}`, the agent
now calls the CP's new /ingress/replace endpoint with the merged
(boot + runtime) extras list, and CP re-PUTs the tunnel config +
upserts a CNAME for the new hostname.

Wire-level summary:

- cf.rs — extract `apply_ingress()` used by both `create()` (at
  register) and a new public `update_ingress()` (at runtime). The
  existing tunnel id + token stay stable.
- cp.rs — new endpoint POST /ingress/replace. PAT-authenticated,
  looks up the agent in the store by agent_id, re-PUTs the tunnel
  config, updates the store's `extras` field for the agent.
- collector::Agent — gains `tunnel_id` + `extras` fields, preserved
  across /health scrapes so the collector doesn't clobber them.
- agent.rs — stores `agent_id` from the register bootstrap, holds
  a live `Arc<RwLock<Vec<(String, u16)>>>` for the merged extras,
  hooks into /deploy to push updates. Soft-fails — workload stays
  running even if the ingress update fails; only public reachability
  is affected.

Opens the runtime path slopandmop needs: POST openclaw to an agent,
the agent asks CP to route `openclaw.<agent-host>` → localhost:port,
CF picks up the ingress config within seconds, browser hits the URL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
posix4e added a commit that referenced this pull request Apr 18, 2026
…#136)

The prior shape — a JSON array substituted into
`"DD_EXTRA_INGRESS=${DD_EXTRA_INGRESS}"` — closed the outer env
string at the first embedded `"`, producing invalid JSON that broke
`jq -c .`:

  jq: parse error: Invalid numeric literal at line 21, column 40

Seen on the dd-local-prod relaunch pipeline immediately after #134
merged (the failing job was in main's Release cascade).

Switches the wire format to comma-separated `label:port` pairs
(`gpu:8081` or `gpu:8081,web:9000`) and adds unit tests covering the
parser edge cases. HTTP request body from agent → CP /register still
carries the structured JSON shape — only the env-var-to-env-var hop
changes.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
posix4e added a commit that referenced this pull request Apr 18, 2026
PR #134 wired `expose` at boot time — the ingress rules baked into the
agent VM's config.iso got published when CP first created the tunnel.
This extends that to the runtime path: when a workload POSTed to
dd-agent's /deploy declares `expose: {hostname_label, port}`, the agent
now calls the CP's new /ingress/replace endpoint with the merged
(boot + runtime) extras list, and CP re-PUTs the tunnel config +
upserts a CNAME for the new hostname.

Wire-level summary:

- cf.rs — extract `apply_ingress()` used by both `create()` (at
  register) and a new public `update_ingress()` (at runtime). The
  existing tunnel id + token stay stable.
- cp.rs — new endpoint POST /ingress/replace. PAT-authenticated,
  looks up the agent in the store by agent_id, re-PUTs the tunnel
  config, updates the store's `extras` field for the agent.
- collector::Agent — gains `tunnel_id` + `extras` fields, preserved
  across /health scrapes so the collector doesn't clobber them.
- agent.rs — stores `agent_id` from the register bootstrap, holds
  a live `Arc<RwLock<Vec<(String, u16)>>>` for the merged extras,
  hooks into /deploy to push updates. Soft-fails — workload stays
  running even if the ingress update fails; only public reachability
  is affected.

Opens the runtime path slopandmop needs: POST openclaw to an agent,
the agent asks CP to route `openclaw.<agent-host>` → localhost:port,
CF picks up the ingress config within seconds, browser hits the URL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
posix4e added a commit that referenced this pull request Apr 18, 2026
PR #134 wired `expose` at boot time — the ingress rules baked into the
agent VM's config.iso got published when CP first created the tunnel.
This extends that to the runtime path: when a workload POSTed to
dd-agent's /deploy declares `expose: {hostname_label, port}`, the agent
now calls the CP's new /ingress/replace endpoint with the merged
(boot + runtime) extras list, and CP re-PUTs the tunnel config +
upserts a CNAME for the new hostname.

Wire-level summary:

- cf.rs — extract `apply_ingress()` used by both `create()` (at
  register) and a new public `update_ingress()` (at runtime). The
  existing tunnel id + token stay stable.
- cp.rs — new endpoint POST /ingress/replace. PAT-authenticated,
  looks up the agent in the store by agent_id, re-PUTs the tunnel
  config, updates the store's `extras` field for the agent.
- collector::Agent — gains `tunnel_id` + `extras` fields, preserved
  across /health scrapes so the collector doesn't clobber them.
- agent.rs — stores `agent_id` from the register bootstrap, holds
  a live `Arc<RwLock<Vec<(String, u16)>>>` for the merged extras,
  hooks into /deploy to push updates. Soft-fails — workload stays
  running even if the ingress update fails; only public reachability
  is affected.

Opens the runtime path slopandmop needs: POST openclaw to an agent,
the agent asks CP to route `openclaw.<agent-host>` → localhost:port,
CF picks up the ingress config within seconds, browser hits the URL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
posix4e added a commit that referenced this pull request Apr 18, 2026
PR #134 wired `expose` at boot time — the ingress rules baked into the
agent VM's config.iso got published when CP first created the tunnel.
This extends that to the runtime path: when a workload POSTed to
dd-agent's /deploy declares `expose: {hostname_label, port}`, the agent
now calls the CP's new /ingress/replace endpoint with the merged
(boot + runtime) extras list, and CP re-PUTs the tunnel config +
upserts a CNAME for the new hostname.

Wire-level summary:

- cf.rs — extract `apply_ingress()` used by both `create()` (at
  register) and a new public `update_ingress()` (at runtime). The
  existing tunnel id + token stay stable.
- cp.rs — new endpoint POST /ingress/replace. PAT-authenticated,
  looks up the agent in the store by agent_id, re-PUTs the tunnel
  config, updates the store's `extras` field for the agent.
- collector::Agent — gains `tunnel_id` + `extras` fields, preserved
  across /health scrapes so the collector doesn't clobber them.
- agent.rs — stores `agent_id` from the register bootstrap, holds
  a live `Arc<RwLock<Vec<(String, u16)>>>` for the merged extras,
  hooks into /deploy to push updates. Soft-fails — workload stays
  running even if the ingress update fails; only public reachability
  is affected.

Opens the runtime path slopandmop needs: POST openclaw to an agent,
the agent asks CP to route `openclaw.<agent-host>` → localhost:port,
CF picks up the ingress config within seconds, browser hits the URL.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@posix4e posix4e deleted the feat/per-workload-ingress branch April 18, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant