Skip to content

docker provider: PortBindings silently dropped on ubuntu-latest when binary is invoked from bash #16

@solsson

Description

@solsson

Summary

Despite the HostIP fix shipped in v0.3.6 (#15), the docker provisioner produces a container with NetworkSettings.Ports == {} when the released y-cluster binary is invoked via y-cluster provision -c <dir> from a bash script on a GitHub Actions ubuntu-latest runner. The container starts, k3s comes up internally, but no port forwards are published to the host, so the host-side /readyz probe added in v0.3.5 never resolves and kubectl --context=local get --raw=/readyz returns connection refused for the entire 60s deadline.

Environment

  • y-cluster v0.3.6 release binary (y-cluster_v0.3.6_linux_amd64, sha256 576964a8825f23c56b633ea5cbc0b587d25931c17c462e0d77a4ae80553146ae)
  • GitHub-hosted runner: ubuntu-latest (Ubuntu 24.04 LTS, runner image 20260413.86.1)
  • Docker Engine Community 28.0.4, buildx 0.33.0, compose 2.38.2
  • Image under test: ghcr.io/yolean/k3s:v1.35.4-rc3-k3s1

Reproducer

Yolean/ystack PR #76, e2e-cluster job. The acceptance script does, in essence:

y-cluster provision -c cluster-configs/local-docker

against:

# cluster-configs/local-docker/y-cluster-provision.yaml
provider: docker
context: local
name: local
portForwards:
- {host: "6443", guest: "6443"}
- {host: "80",   guest: "80"}
- {host: "443",  guest: "443"}
- {host: "8944", guest: "8944"}
registries:
  mirrors:
    builds-registry.ystack.svc.cluster.local: {endpoint: ["http://10.43.0.50"]}
    prod-registry.ystack.svc.cluster.local:   {endpoint: ["http://10.43.0.51"]}

Log evidence

All three snippets below are from the same job run on the same ubuntu-latest instance, in chronological order. Timestamps preserved.

1. Plain docker run -p ... on the runner publishes all four bindings cleanly

This confirms the daemon is healthy, the ports are free, and there's no environmental obstruction. Run as a sanity check by the acceptance script before invoking y-cluster:

2026-05-01T19:58:06.297Z CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
2026-05-01T19:58:06.297Z # Control: docker run -p with the four ystack port maps:
2026-05-01T19:58:06.319Z Unable to find image 'alpine:3.20' locally
2026-05-01T19:58:06.797Z Status: Downloaded newer image for alpine:3.20
2026-05-01T19:58:06.819Z 60e458c620bff00f330808dd6b6995e93a7021deaaabb37fbe78f696e2e1e34d
2026-05-01T19:58:06.981Z   docker port yk-portbind-control:
2026-05-01T19:58:06.994Z     80/tcp -> 0.0.0.0:80
2026-05-01T19:58:06.994Z     80/tcp -> [::]:80
2026-05-01T19:58:06.994Z     443/tcp -> 0.0.0.0:443
2026-05-01T19:58:06.994Z     443/tcp -> [::]:443
2026-05-01T19:58:06.994Z     6443/tcp -> 0.0.0.0:6443
2026-05-01T19:58:06.994Z     6443/tcp -> [::]:6443
2026-05-01T19:58:06.994Z     8944/tcp -> 0.0.0.0:8944
2026-05-01T19:58:06.994Z     8944/tcp -> [::]:8944
2026-05-01T19:58:06.994Z   docker inspect yk-portbind-control NetworkSettings.Ports:
2026-05-01T19:58:07.007Z     {"443/tcp":[{"HostIp":"0.0.0.0","HostPort":"443"},{"HostIp":"::","HostPort":"443"}],
                              "6443/tcp":[{"HostIp":"0.0.0.0","HostPort":"6443"},{"HostIp":"::","HostPort":"6443"}],
                              "80/tcp":[{"HostIp":"0.0.0.0","HostPort":"80"},{"HostIp":"::","HostPort":"80"}],
                              "8944/tcp":[{"HostIp":"0.0.0.0","HostPort":"8944"},{"HostIp":"::","HostPort":"8944"}]}

The control container is removed immediately after.

2. y-cluster's container, started seconds later via y-cluster provision, has empty Ports

Same runner, same daemon, same k3s image. A backgrounded poller prints docker port local and docker inspect local --format '{{json .NetworkSettings.Ports}}' every 5 seconds while waitForHostAPIServer runs:

2026-05-01T19:58:07.323Z INFO docker/docker.go:148 starting docker {"image": "ghcr.io/yolean/k3s:v1.35.4-rc3-k3s1", "apiPort": "6443", "memory": "8192", "cpus": "4"}
2026-05-01T19:58:09.511Z INFO docker/docker.go:404 waiting for host apiserver {"context": "local"}
2026-05-01T19:58:12.196Z # poll 1 (19:58:12):
2026-05-01T19:58:12.208Z   docker port local:
2026-05-01T19:58:12.223Z   docker inspect NetworkSettings.Ports:
2026-05-01T19:58:12.236Z     {}
2026-05-01T19:58:12.239Z bash: line 1: /dev/tcp/127.0.0.1/6443: Connection refused
2026-05-01T19:58:12.240Z   tcp 127.0.0.1:6443 closed/refused
2026-05-01T19:58:17.243Z # poll 2 (19:58:17):
2026-05-01T19:58:17.258Z   docker port local:
2026-05-01T19:58:17.277Z   docker inspect NetworkSettings.Ports:
2026-05-01T19:58:17.295Z     {}
…
(repeats identically for ~12 polls / 60 seconds)

docker port local produces no output. NetworkSettings.Ports is the literal empty object {}. The container itself is up — docker ps shows local Up X seconds for every poll — but no ports are published to the host.

3. Inside the container, k3s is fully serving traffic

When waitForHostAPIServer times out, the captured container logs make clear the issue is purely the missing host bindings, not k3s being slow to come up. From the same run:

2026-05-01T19:50:33.0Z time="2026-05-01T19:50:32Z" level=info msg="Started tunnel to 172.17.0.2:6443"
2026-05-01T19:50:33.0Z time="2026-05-01T19:50:32Z" level=info msg="Stopped tunnel to 127.0.0.1:6443"
2026-05-01T19:50:33.0Z I0501 19:50:18.502 pod_startup_latency_tracker: kube-system/metrics-server-786d997795 ... podStartE2EDuration="15.5s" observedRunningTime
2026-05-01T19:50:33.0Z I0501 19:50:33.292 garbagecollector: ...
2026-05-01T19:50:33.0Z I0501 19:50:33.551 handler.go:304 Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
2026-05-01T19:50:33.0Z Error: wait for host apiserver: apiserver /readyz never returned 200 within 1m0s
                       on context "local": exit status 1: The connection to the server 127.0.0.1:6443 was
                       refused - did you specify the right host or port?

k3s reaches "metrics.k8s.io v1beta1 added to ResourceManager" inside the container. The host can't reach it because nothing is published.

What rules out an obvious fix

Hypothesis Tested in Result
HostIP value still wrong post v0.3.6 v0.3.6 with HostIP: netip.IPv4Unspecified() ("0.0.0.0") NetworkSettings.Ports == {}
Privileged port collision (80, 443) Two-port config: 6443:6443, 8944:8944 only (no privileged ports) NetworkSettings.Ports == {}
Same host port as guest port High-port config: 16443:6443, 18944:8944 (host ≠ guest, both unprivileged) NetworkSettings.Ports == {}
Stale host-side listener Pre-provision ss -lntp 'sport = :6443 or sport = :80 or sport = :443 or sport = :8944' Empty
Stale docker container Pre-provision docker ps -a Empty (just header row)
Daemon can't bind these ports Plain docker run -d -p 6443:6443 -p 80:80 -p 443:443 -p 8944:8944 alpine sleep 30 immediately before y-cluster All four bindings publish to 0.0.0.0:* (snippet 1 above)

Every iteration above is from a CI run with full diagnostics; trying the next hypothesis produced no behaviour change in the failing case.

CI runs:

What works

The same v0.3.6 binary publishes bindings cleanly in two other contexts:

  1. Mac Docker Desktop, same local-docker/y-cluster-provision.yaml shape (verified locally with host:16443 guest:6443 + host:18944 guest:8944):

    2026-05-02T07:54:54Z INFO docker/docker.go:149 starting docker  apiPort=16443
    2026-05-02T07:54:56Z INFO docker/docker.go:414 waiting for host apiserver
    2026-05-02T07:54:59Z INFO docker/docker.go:223 k3s ready
    2026-05-02T07:54:59Z INFO envoygateway/install.go:119 applying envoy-gateway install manifest
    customresourcedefinition.apiextensions.k8s.io/backends.gateway.envoyproxy.io serverside-applied
    …
    

    Total time from starting docker to k3s ready: ~5 seconds.

  2. go test -tags 'e2e,docker' -run TestDocker_ProvisionTeardown ./e2e/ on the same ubuntu-latest runner image, exercised by every PR. PR fix(provision/docker): set HostIP on PortBindings to bind on host #15 CI (y-cluster actions/runs/25244259094) was green against the very commit that became v0.3.6. That test asserts docker port <name> 8080/tcp shows :38080 on the SDK-created container, and the assertion passed.

So the SDK call shape produced by pkg/provision/docker/docker.go:buildHostConfig can publish bindings on ubuntu-latest, just not in the path that goes:

[released linux/amd64 binary] -> [bash invocation] -> [provision -c <dir>]

What's left as the differential

Same buildHostConfig, same image, same Privileged: true + Tmpfs, same Engine 28, same runner image — but the request the daemon receives via the released-binary-from-bash path differs from the in-process go test path enough that Engine 28 silently drops the port bindings.

I haven't been able to instrument that from outside the binary. Hypotheses that would explain the asymmetry but I can't confirm from the runner side:

  • Docker API version negotiation. Does the released binary negotiate a different API version than the test-built one? The test binary is built from the same go.mod, so they pin the same github.com/moby/moby/client v0.4.1, but client-side version probing is environment-dependent.
  • Config.ExposedPorts absence. buildHostConfig sets HostConfig.PortBindings but ContainerCreate never sets Config.ExposedPorts. The Docker CLI (which works on this runner) auto-adds ExposedPorts when you pass -p. Engine 28 may have started treating the absence differently in some request shapes.
  • Some env var the SDK reads. DOCKER_API_VERSION, DOCKER_HOST, runner-set proxies — the bash invocation environment differs from go test's parent env in non-obvious ways.

Daemon-side capture of the POST /containers/create body for one working e2e run vs one failing ystack run would identify the differential field within minutes.

Asks

  1. Daemon-side dockerd -D capture (or equivalent) of the POST /containers/create request body for:

    • One successful y-cluster TestDocker_ProvisionTeardown run on ubuntu-latest
    • One failing ystack acceptance run on ubuntu-latest

    Diffing the two requests should reveal the field that Engine 28 treats differently.

  2. If the differential turns out to be Config.ExposedPorts: setting Config.ExposedPorts[guestPort]={} in buildHostConfig alongside HostConfig.PortBindings[guestPort]=... (mirroring what the Docker CLI does for every -p) would likely resolve both paths. Trivial change, low risk, would also be a good defensive thing regardless of root cause.

  3. Workaround for ystack: until the SDK path is fixed, ystack could shell out to docker create / docker start directly instead of going through the SDK. Less elegant but unblocks PR #76's e2e gate.

Background context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions