fix(bash image): full python stack + POD_POOL_<LANG> actually warms replicas by aron-muon · Pull Request #69 · aron-muon/KubeCodeRun

aron-muon · 2026-05-26T13:51:10Z

Summary

Two user-reported issues, fixed together because the second is what makes the first visible to users:

`ModuleNotFoundError: No module named 'xlsxwriter'` (and reportlab / python-docx / python-pptx / pdf2image / ...) from `bash_tool` — the bash image only had 5 apt-installed Python packages, so LC's `python3 -c "..."` reaches via bash_tool failed for anything beyond numpy/pandas/matplotlib/openpyxl/Pillow.
`POD_POOL_BASH=N doesn't actually wire through and change the number of warm replicas` — the Python config wiring is correct on paper; the runtime side had a 60s image-pull timeout that the now-1.13 GB bash image regularly exceeds on a cold node, so every newly-created warm pod got torn down before going Ready.

Fix

Part 1 — full python stack in the bash image

Install the FULL pip stack from the same `docker/requirements/python-*.txt` files the dedicated python image uses, so `lang=py` and `lang=bash → python3 -c` become indistinguishable.

Now in the bash image: `xlsxwriter`, `reportlab`, `python-docx`, `python-pptx`, `pdf2image`, `docx2txt`, `docx2python`, `mammoth`, `openpyxl`, `scikit-learn`, `scipy`, `plotly`, `opencv-python-headless`, `pillow`, `numpy`, `pandas`, `matplotlib`, `seaborn`, `statsmodels`, `lxml`, `cryptography`, `pyarrow`, `sympy`, and the rest.

Three DHI-base-specific problems had to be worked around to get this building against the live `dhi.io/debian-base:trixie-debian13-dev`:

#	Problem	Workaround
1	`r-base-core` + `gfortran` transitively need `liblapack3` + `libgfortran5` pinned to stock `gcc-14-base=14.2.0-19`, but DHI ships `14.2.0-19+dhi0`. APT solver can't satisfy.	Removed both from the bash image. Dedicated `lang=r` / `lang=f90` per-language images still serve those callers; LC's bash_tool doesn't reach for them in practice.
2	DHI's hardened `libzstd1+dhi0` fails `dlopen()` (`ImportError: libzstd.so.1: shared object cannot be dlopen()ed`). Python's `_ssl.so` links libzstd → SSL completely broken → pip can't reach PyPI.	Force-downgrade to stock `libzstd1=1.5.7+dfsg-1` via `--allow-downgrades`.
3	Same `dlopen` issue on `libffi8+dhi0`. `_ctypes.so` links libffi → pandas/numpy/cryptography all crash at runtime.	Same fix: downgrade to `libffi8=3.4.8-2`.

These three are pre-existing breakages of the bash image build against the current DHI registry — `main` was also unbuildable today.

Part 2 — POD_POOL_ ready timeout

Why warm replicas weren't materialising: `PodPool._wait_for_pod_ready` hardcoded a 60s timeout. The unified bash image (~1.13 GB) regularly exceeds 60s on first pull. Every warm pod got torn down before going Ready, the pool stayed at 0, and the only log was a quiet `warning="Warm pod did not become ready, deleting"` with no indication WHY. Operators reading kubectl saw `replicas: 0` despite `POD_POOL_BASH=N` and nothing in our logs explained the gap.

Three changes:

Configurable timeout — new `POD_POOL_READY_TIMEOUT_SECONDS` setting, default 300s (5 min), bounded [30, 1800]. Documented in `.env.example`.
Early-abort on terminal pod states — `ImagePullBackOff`, `ErrImagePull`, `InvalidImageName`, `CrashLoopBackOff`, `CreateContainerConfigError`, `CreateContainerError`. A misconfigured image now fails in ~5s with the actual reason logged at ERROR, instead of waiting out the full 300s.
Loud diagnostic on real timeouts — when we DO hit the timeout (typically a slow image pull stuck in ContainerCreating), log at ERROR with the last-observed waiting reason and a hint telling the operator whether to bump the timeout or investigate the cluster.

The companion rollup log in `_create_warm_pod` was bumped from WARNING to ERROR for the same operator-discoverability reason.

Validation

Build succeeds against `dhi.io/debian-base:trixie-debian13-dev` (final image ~1.13 GB).
User's exact failing reproduction now works:
```python
import xlsxwriter
workbook = xlsxwriter.Workbook("/tmp/beispiel.xlsx")
ws = workbook.add_worksheet()
ws.write("A1", "Hallo")
workbook.close()
```
→ xlsx written, 5,250 bytes.
Full `docx + pptx + reportlab + pandas + numpy + matplotlib` import chain works.
`./scripts/test-bash-megaimage.sh` — 11/11 languages pass (bash, python, javascript, typescript, go, java, c, cpp, php, rust, d); r and fortran properly marked skipped.
1,519 unit tests pass (+8 new pool tests covering the timeout + early-abort cases).
`just lint` / `just format-check` / `just typecheck` — all clean.

New tests

`tests/unit/test_pool.py`:

`test_wait_for_pod_ready_uses_settings_default` — picks up `settings.pod_pool_ready_timeout_seconds` when called with no explicit timeout
`test_wait_for_pod_ready_early_aborts_on_terminal_waiting_reason[ImagePullBackOff/ErrImagePull/InvalidImageName/CrashLoopBackOff/CreateContainerConfigError/CreateContainerError]` — 6 parametrized cases asserting the abort happens in <5s on a 30s nominal timeout
`test_wait_for_pod_ready_waits_through_container_creating` — pins that `ContainerCreating` is NOT terminal (the image is being pulled; the wait must continue)

Operator note

If you currently see `POD_POOL_BASH=N` not materialising warm replicas, this PR should fix it. If the new 300s default is still too short for your registry/node combo, bump `POD_POOL_READY_TIMEOUT_SECONDS` further. The new error logs will tell you which case you're in:

`last_waiting_reason=ContainerCreating` → image pull is slow, raise the timeout
`last_waiting_reason=ImagePullBackOff` → genuine image config error (wrong registry, bad tag, missing pull secret) — early-abort will catch this
`last_waiting_reason=None` after Running → runner /ready probe wedged inside the pod

🤖 Generated with Claude Code

…pptx) User report: LC bash_tool calls fail with "ModuleNotFoundError: No module named 'xlsxwriter'" (and similar for reportlab / python-docx / python-pptx / pdf2image / scikit-learn / ...) because the bash image only had 5 apt-installed Python packages (numpy / pandas / matplotlib / openpyxl / Pillow). LC's bash_tool path issues `python3 -c "..."` for everything a user would normally reach the dedicated python image for, so the partial stack produced non-deterministic "sometimes works, sometimes errors" behaviour depending on which tool LC's agent picked. This commit installs the FULL pip stack from the same requirement files the dedicated python image uses, so the two execution paths become indistinguishable to the user. Three DHI-base-specific problems had to be worked around to get this building against the live `dhi.io/debian-base:trixie-debian13-dev` image: 1. r-base-core + gfortran transitively require liblapack3 + libgfortran5, both pinned to stock gcc-14-base=14.2.0-19 but DHI ships 14.2.0-19+dhi0. Cannot install. Removed both from the bash image — the dedicated `lang=r` / `lang=f90` per-language images still serve those callers; LC's bash_tool doesn't reach for them in practice. 2. DHI's hardened libzstd1+dhi0 fails dlopen() at runtime (`ImportError: libzstd.so.1: shared object cannot be dlopen()ed`). Python's _ssl.so links libzstd so SSL is completely broken. Force- downgrade to stock libzstd1=1.5.7+dfsg-1 via --allow-downgrades. Without this, pip can't reach PyPI at all. 3. Same dlopen problem on DHI's libffi8+dhi0. _ctypes.so links libffi so pandas/numpy/cryptography all crash at runtime. Same fix: downgrade to libffi8=3.4.8-2. 4. Removed `pip install --upgrade pip setuptools wheel` — Debian's apt-installed wheel package has no RECORD file so pip can't uninstall it for the upgrade. The apt-installed versions are recent enough for the requirement files. Final image is ~1.13 GB (was ~865 MB). Pip stack includes: xlsxwriter, reportlab, python-docx, python-pptx, pdf2image, docx2txt, docx2python, mammoth, openpyxl, scikit-learn, scipy, plotly, opencv-python-headless, pillow, numpy, pandas, matplotlib, seaborn, statsmodels, lxml, cryptography, pyarrow, sympy, and the rest of docker/requirements/python-*.txt. Validated against the user's exact failing reproduction (xlsxwriter import + Workbook write) and the 11-language hello-world matrix: all 11 pass, r/fortran properly marked skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…NG> works User reported "POD_POOL_BASH=x doesn't actually wire through and change the number of warm replicas". The Python wiring is correct (configmap → POD_POOL_BASH → settings → PodPoolManager) — the root cause was on the runtime side: PodPool._wait_for_pod_ready hardcoded a 60-second timeout. For the unified bash image (~1.13 GB after the python-stack bake) on a cold node, image pull regularly exceeds 60s. Every newly-created warm pod got torn down before going Ready, the pool stayed at 0, and the only log was a quiet `warning="Warm pod did not become ready, deleting"` with no indication WHY — operators reading kubectl saw pool size 0 and nothing in our logs explained the gap. This fix has three parts: 1. **Make the timeout configurable** — new `POD_POOL_READY_TIMEOUT_SECONDS` setting, default 300s (5 min). Bounded [30, 1800]. Documented in .env.example. 2. **Early-abort on terminal pod states** — ImagePullBackOff, ErrImagePull, InvalidImageName, CrashLoopBackOff, CreateContainerConfigError, CreateContainerError. A misconfigured image now fails in ~5s with the actual reason logged at ERROR, instead of waiting out the full 300s and logging a generic timeout. 3. **Loud diagnostic on real timeouts** — when we DO hit the timeout (typically slow image pull during ContainerCreating), log at ERROR with the last-observed waiting reason and a hint: "If reason is 'ContainerCreating' the image is probably still being pulled — raise POD_POOL_READY_TIMEOUT_SECONDS." The companion rollup log in `_create_warm_pod` was bumped from WARNING to ERROR for the same operator-discoverability reason. 7 new tests in test_pool.py: - _wait_for_pod_ready picks up settings default when no timeout passed - Early-abort on each of the 6 terminal waiting reasons (parametrized) asserting the abort happens in <5s on a 30s nominal timeout - ContainerCreating is NOT treated as terminal — the wait continues through the pull and succeeds when the pod eventually goes Ready Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-26T14:20:33Z

🎉 This PR is included in version 3.5.4 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

aron-muon and others added 2 commits May 26, 2026 14:50

aron-muon changed the title ~~fix(docker): full python stack in bash image (xlsxwriter, reportlab, pptx, ...)~~ fix(bash image): full python stack + POD_POOL_<LANG> actually warms replicas May 26, 2026

aron-muon merged commit da189b7 into main May 26, 2026
32 of 34 checks passed

aron-muon deleted the hotfix/bash-image-python-stack branch May 26, 2026 14:20

github-actions Bot added the released label May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bash image): full python stack + POD_POOL_<LANG> actually warms replicas#69

fix(bash image): full python stack + POD_POOL_<LANG> actually warms replicas#69
aron-muon merged 2 commits into
mainfrom
hotfix/bash-image-python-stack

aron-muon commented May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aron-muon commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Part 1 — full python stack in the bash image

Part 2 — POD_POOL_ ready timeout

Validation

New tests

Operator note

Uh oh!

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aron-muon commented May 26, 2026 •

edited

Loading