Skip to content

bug: Homebrew Docker gateway can use stale supervisor:dev binary incompatible with sandbox JWT auth #1523

@TaylorMutch

Description

@TaylorMutch

Summary

On macOS with the Homebrew-installed gateway, Docker-backed sandboxes can stay stuck in Provisioning even after the gateway is correctly forced to use Docker and sandbox JWT files are mounted. The root cause in this case was a stale locally cached Docker supervisor binary extracted from ghcr.io/nvidia/openshell/supervisor:dev.

The gateway was current enough to require gateway-minted sandbox JWT auth, but the cached supervisor binary was old enough that it still used the removed x-sandbox-secret mechanism. As a result, the token file was present and valid, but the supervisor never sent it as authorization: Bearer ....

Environment

  • macOS Homebrew install
  • Gateway: openshell-gateway 0.0.47-dev.12+g68d428055
  • Docker Desktop daemon: 29.1.3, aarch64
  • Docker context: docker-desktop
  • Podman was also installed
  • Initial /var/run/docker.sock pointed at Podman:
    • /var/run/docker.sock -> /Users/tmutch/.local/share/containers/podman/machine/podman.sock
  • Docker Desktop socket:
    • unix:///Users/tmutch/.docker/run/docker.sock

How We Got Here

  1. The Homebrew gateway appeared to use Podman instead of Docker.

    That matched the gateway auto-detection behavior: Kubernetes, then Podman, then Docker. Because podman was installed, the gateway selected Podman unless configured otherwise.

  2. We forced the gateway to Docker:

    ~/.config/openshell/gateway.toml:

    [openshell.gateway]
    compute_drivers = ["docker"]

    ~/.config/openshell/gateway.env:

    OPENSHELL_DRIVERS=docker
    DOCKER_HOST=unix:///Users/tmutch/.docker/run/docker.sock

    The DOCKER_HOST part mattered because Docker's default socket path on this host pointed to Podman.

  3. After restart, sandboxes were created in Docker Desktop, not Podman, but still stayed in Provisioning.

    The Docker container had:

    OPENSHELL_SANDBOX_TOKEN_FILE=/etc/openshell/auth/sandbox.jwt
    OPENSHELL_ENDPOINT=https://host.openshell.internal:17670/
    

    and mounted:

    ~/.local/state/openshell/docker-sandbox-tokens/default/<sandbox-id>/sandbox.jwt
      -> /etc/openshell/auth/sandbox.jwt
    
  4. Initially, the Homebrew copied TLS dir did not contain the jwt/ subdir.

    Gateway JWT keys existed under:

    /opt/homebrew/var/openshell/tls/jwt/{signing.pem,public.pem,kid}
    

    but the Homebrew runtime TLS copy under:

    ~/.local/state/openshell/homebrew/tls/
    

    had CA/server/client TLS files only. Copying the jwt/ directory there and restarting enabled gateway JWT minting:

    gateway-minted sandbox JWT enabled gateway_id=openshell ttl_secs=3600
    minted sandbox JWT
    
  5. Even with a mounted JWT, the sandbox still failed.

    Sandbox logs included:

    Failed to fetch inference bundle, inference routing disabled
    error: status: PermissionDenied, message: "GetInferenceBundle requires a sandbox principal"
    NET:FAIL [LOW] host.openshell.internal:17670
    

    The JWT itself decoded correctly:

    {
      "header": {
        "alg": "EdDSA",
        "kid": "45b4b366ae414387c0fa96717739ce35",
        "typ": "JWT"
      },
      "claims": {
        "aud": "openshell-gateway:openshell",
        "iss": "openshell-gateway:openshell",
        "sandbox_id": "<same sandbox id>",
        "sub": "spiffe://openshell/sandbox/<same sandbox id>"
      }
    }
  6. Comparing against e2e/with-docker-gateway.sh showed why e2e worked.

    The e2e wrapper writes a complete per-run Docker driver config and supplies a fresh matching supervisor binary via:

    [openshell.drivers.docker]
    supervisor_bin = "<freshly built openshell-sandbox>"

    The Homebrew gateway instead used the default supervisor image path and extracted/cached a binary from:

    ghcr.io/nvidia/openshell/supervisor:dev
    

    The failing container bind-mounted:

    ~/.local/share/openshell/docker-supervisor/sha256-87103ad60110703cc8e29053acd5ce643058c2f28978ee8248d2ab694ee37114/openshell-sandbox
      -> /opt/openshell/bin/openshell-sandbox
    

    That cached binary reported:

    openshell-sandbox 0.0.37-dev.160+g316c788ea
    

    Source at that commit still used x-sandbox-secret in crates/openshell-sandbox/src/grpc_client.rs and did not contain the current OPENSHELL_SANDBOX_TOKEN_FILE / Bearer JWT auth path.

Fix That Confirmed the Diagnosis

Pulling the current supervisor image and restarting the Homebrew gateway fixed provisioning:

docker pull ghcr.io/nvidia/openshell/supervisor:dev
brew services restart nvidia/openshell/openshell

After pull:

ghcr.io/nvidia/openshell/supervisor:dev
openshell-sandbox 0.0.47-dev.13+g57b71c68f

The gateway extracted a new cached supervisor:

~/.local/share/openshell/docker-supervisor/sha256-5742943b50ee5de76ed9da50f8383ce6805ca4d833a7271774b1bec8d8f365b9/openshell-sandbox

Fresh smoke test succeeded:

openshell sandbox create \
  --name docker-smoke-after-pull-echo \
  --from ghcr.io/nvidia/openshell-community/sandboxes/base:latest \
  --no-keep --no-tty -- /bin/sh -lc 'echo supervisor-ok'

Output:

Created sandbox: docker-smoke-after-pull-echo
supervisor-ok
Deleted sandbox docker-smoke-after-pull-echo

Expected Behavior

The Homebrew-installed gateway should not silently use an incompatible stale supervisor binary for Docker sandboxes. If the gateway requires Bearer sandbox JWT auth, the selected supervisor binary should support that same auth protocol.

Possible Improvements

  • Pin Homebrew's Docker supervisor image to an immutable tag/digest matching the gateway build instead of relying on floating dev.
  • On gateway startup, log the selected supervisor image/digest and extracted supervisor binary version.
  • Detect supervisor/gateway protocol mismatch before creating a sandbox, or fail with an explicit error instead of leaving the sandbox in Provisioning.
  • Ensure the Homebrew wrapper copies the jwt/ directory along with TLS materials when it sets OPENSHELL_LOCAL_TLS_DIR.
  • Consider making Docker driver resolution run docker pull for floating tags when appropriate, or document that users must refresh ghcr.io/nvidia/openshell/supervisor:dev after upgrading a dev Homebrew gateway.

Related

This is adjacent to, but different from, #1519. In this case, after DOCKER_HOST was pinned to Docker Desktop, containers were created in Docker and the remaining failure was the stale supervisor binary/auth protocol mismatch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    os:macosBug affects macOS hosts

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions