Skip to content

Sandbox stuck in Provisioning on macOS with Podman (libkrun): docker driver resolves host-gateway to unreachable bridge IP #1519

@rhuss

Description

@rhuss

Agent Diagnostic

Investigated the docker driver's gateway route selection in crates/openshell-driver-docker/src/lib.rs:

  • uses_host_gateway_alias() (line 1280) detects Docker Desktop, Colima, Lima, Rancher Desktop, and OrbStack via daemon OS string, hostname, and labels
  • Podman with libkrun on macOS reports os: "linux", hostname: "localhost.localdomain", no matching labels. None of the detection patterns match
  • docker_gateway_route() falls through to Bridge mode, setting bind_address and host_alias_ip to the bridge gateway IP (10.89.0.1)
  • docker_extra_hosts() maps host.openshell.internal to 10.89.0.1, which is injected into /etc/hosts inside the container
  • On macOS/libkrun, 10.89.0.1 exists only inside the VM's network namespace and is not routable from containers
  • The supervisor inside the sandbox tries to connect to https://host.openshell.internal:17670/ which resolves to 10.89.0.1, and fails with failed to connect to OpenShell server
  • Verified: containers CAN reach the macOS host via host.containers.internal (192.168.127.254) using openssl s_client -connect 192.168.127.254:17670
  • The podman driver (crates/openshell-driver-podman) handles this correctly: it auto-detects host.containers.internal for the gRPC endpoint (driver.rs line 232) and uses podman's host-gateway for hostadd
  • The docker driver's detection logic has no case for podman-backed runtimes on macOS

Description

When using the Homebrew-installed OpenShell 0.0.46 on macOS with Podman (libkrun VM), sandboxes get stuck in Provisioning forever. The supervisor cannot connect back to the gateway because the docker driver resolves host-gateway to the bridge subnet gateway IP (10.89.0.1), which is only accessible inside the libkrun VM's network namespace, not from within containers.

Expected: The docker driver should detect Podman/libkrun on macOS and use host.containers.internal (or host-gateway correctly) to route supervisor callbacks to the gateway.

Reproduction Steps

  1. macOS with Podman (libkrun), Homebrew-installed OpenShell 0.0.46
  2. openshell sandbox create --from ghcr.io/nvidia/openshell-community/sandboxes/base:latest
  3. Sandbox stays in Provisioning. Container logs show Policy fetch failed / failed to connect to OpenShell server in a loop
  4. Inside the container, /etc/hosts maps host.openshell.internal to 10.89.0.1 (unreachable)
  5. host.containers.internal (192.168.127.254) is reachable and the gateway responds on that IP

Environment

  • OS: macOS 26.5 (Darwin 25.5.0), arm64
  • Podman: 5.8.2 (VM type: libkrun)
  • OpenShell: 0.0.46 (Homebrew)
  • Gateway config: default (bind_address = 127.0.0.1:17670, [openshell.drivers.docker])

Logs

# Supervisor logs from inside the container:
openshell_sandbox: Policy fetch failed, retrying
openshell: log push connect failed: failed to connect to OpenShell server
# Repeats every ~2s until the container crashes

# /etc/hosts inside container:
10.89.0.1    host.docker.internal
10.89.0.1    host.openshell.internal   # <-- unreachable from container
192.168.127.254  host.containers.internal  # <-- reachable, but not used

Workaround

# ~/.config/openshell/gateway.toml
[openshell.gateway]
bind_address = "0.0.0.0:17670"

[openshell.drivers.docker]
host_gateway_ip = "192.168.127.254"
# ~/.config/openshell/gateway.env
OPENSHELL_BIND_ADDRESS=0.0.0.0

Suggested Fix

uses_host_gateway_alias() in the docker driver should detect podman-backed runtimes. Podman's Docker-compatible API reports os: "linux" and hostname: "localhost.localdomain" with no identifying labels, but the connection comes through a podman socket. The runtime could be detected by checking for podman-specific API headers, the socket path, or system info fields like conmon_version that only podman exposes.

Alternatively, the driver could probe connectivity to the bridge gateway IP before using it, falling back to host-gateway if the probe fails.

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this -- the diagnostic above explains why

Metadata

Metadata

Assignees

Labels

state:triage-neededOpened without agent diagnostics and needs triage

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions