Skip to content

fix(bash image): full python stack + POD_POOL_<LANG> actually warms replicas#69

Merged
aron-muon merged 2 commits into
mainfrom
hotfix/bash-image-python-stack
May 26, 2026
Merged

fix(bash image): full python stack + POD_POOL_<LANG> actually warms replicas#69
aron-muon merged 2 commits into
mainfrom
hotfix/bash-image-python-stack

Conversation

@aron-muon
Copy link
Copy Markdown
Owner

@aron-muon aron-muon commented May 26, 2026

Summary

Two user-reported issues, fixed together because the second is what makes the first visible to users:

  1. `ModuleNotFoundError: No module named 'xlsxwriter'` (and reportlab / python-docx / python-pptx / pdf2image / ...) from `bash_tool` — the bash image only had 5 apt-installed Python packages, so LC's `python3 -c "..."` reaches via bash_tool failed for anything beyond numpy/pandas/matplotlib/openpyxl/Pillow.
  2. `POD_POOL_BASH=N doesn't actually wire through and change the number of warm replicas` — the Python config wiring is correct on paper; the runtime side had a 60s image-pull timeout that the now-1.13 GB bash image regularly exceeds on a cold node, so every newly-created warm pod got torn down before going Ready.

Fix

Part 1 — full python stack in the bash image

Install the FULL pip stack from the same `docker/requirements/python-*.txt` files the dedicated python image uses, so `lang=py` and `lang=bash → python3 -c` become indistinguishable.

Now in the bash image: `xlsxwriter`, `reportlab`, `python-docx`, `python-pptx`, `pdf2image`, `docx2txt`, `docx2python`, `mammoth`, `openpyxl`, `scikit-learn`, `scipy`, `plotly`, `opencv-python-headless`, `pillow`, `numpy`, `pandas`, `matplotlib`, `seaborn`, `statsmodels`, `lxml`, `cryptography`, `pyarrow`, `sympy`, and the rest.

Three DHI-base-specific problems had to be worked around to get this building against the live `dhi.io/debian-base:trixie-debian13-dev`:

# Problem Workaround
1 `r-base-core` + `gfortran` transitively need `liblapack3` + `libgfortran5` pinned to stock `gcc-14-base=14.2.0-19`, but DHI ships `14.2.0-19+dhi0`. APT solver can't satisfy. Removed both from the bash image. Dedicated `lang=r` / `lang=f90` per-language images still serve those callers; LC's bash_tool doesn't reach for them in practice.
2 DHI's hardened `libzstd1+dhi0` fails `dlopen()` (`ImportError: libzstd.so.1: shared object cannot be dlopen()ed`). Python's `_ssl.so` links libzstd → SSL completely broken → pip can't reach PyPI. Force-downgrade to stock `libzstd1=1.5.7+dfsg-1` via `--allow-downgrades`.
3 Same `dlopen` issue on `libffi8+dhi0`. `_ctypes.so` links libffi → pandas/numpy/cryptography all crash at runtime. Same fix: downgrade to `libffi8=3.4.8-2`.

These three are pre-existing breakages of the bash image build against the current DHI registry — `main` was also unbuildable today.

Part 2 — POD_POOL_ ready timeout

Why warm replicas weren't materialising: `PodPool._wait_for_pod_ready` hardcoded a 60s timeout. The unified bash image (~1.13 GB) regularly exceeds 60s on first pull. Every warm pod got torn down before going Ready, the pool stayed at 0, and the only log was a quiet `warning="Warm pod did not become ready, deleting"` with no indication WHY. Operators reading kubectl saw `replicas: 0` despite `POD_POOL_BASH=N` and nothing in our logs explained the gap.

Three changes:

  1. Configurable timeout — new `POD_POOL_READY_TIMEOUT_SECONDS` setting, default 300s (5 min), bounded [30, 1800]. Documented in `.env.example`.
  2. Early-abort on terminal pod states — `ImagePullBackOff`, `ErrImagePull`, `InvalidImageName`, `CrashLoopBackOff`, `CreateContainerConfigError`, `CreateContainerError`. A misconfigured image now fails in ~5s with the actual reason logged at ERROR, instead of waiting out the full 300s.
  3. Loud diagnostic on real timeouts — when we DO hit the timeout (typically a slow image pull stuck in ContainerCreating), log at ERROR with the last-observed waiting reason and a hint telling the operator whether to bump the timeout or investigate the cluster.

The companion rollup log in `_create_warm_pod` was bumped from WARNING to ERROR for the same operator-discoverability reason.

Validation

  • Build succeeds against `dhi.io/debian-base:trixie-debian13-dev` (final image ~1.13 GB).
  • User's exact failing reproduction now works:
    ```python
    import xlsxwriter
    workbook = xlsxwriter.Workbook("/tmp/beispiel.xlsx")
    ws = workbook.add_worksheet()
    ws.write("A1", "Hallo")
    workbook.close()
    ```
    → xlsx written, 5,250 bytes.
  • Full `docx + pptx + reportlab + pandas + numpy + matplotlib` import chain works.
  • `./scripts/test-bash-megaimage.sh` — 11/11 languages pass (bash, python, javascript, typescript, go, java, c, cpp, php, rust, d); r and fortran properly marked skipped.
  • 1,519 unit tests pass (+8 new pool tests covering the timeout + early-abort cases).
  • `just lint` / `just format-check` / `just typecheck` — all clean.

New tests

`tests/unit/test_pool.py`:

  • `test_wait_for_pod_ready_uses_settings_default` — picks up `settings.pod_pool_ready_timeout_seconds` when called with no explicit timeout
  • `test_wait_for_pod_ready_early_aborts_on_terminal_waiting_reason[ImagePullBackOff/ErrImagePull/InvalidImageName/CrashLoopBackOff/CreateContainerConfigError/CreateContainerError]` — 6 parametrized cases asserting the abort happens in <5s on a 30s nominal timeout
  • `test_wait_for_pod_ready_waits_through_container_creating` — pins that `ContainerCreating` is NOT terminal (the image is being pulled; the wait must continue)

Operator note

If you currently see `POD_POOL_BASH=N` not materialising warm replicas, this PR should fix it. If the new 300s default is still too short for your registry/node combo, bump `POD_POOL_READY_TIMEOUT_SECONDS` further. The new error logs will tell you which case you're in:

  • `last_waiting_reason=ContainerCreating` → image pull is slow, raise the timeout
  • `last_waiting_reason=ImagePullBackOff` → genuine image config error (wrong registry, bad tag, missing pull secret) — early-abort will catch this
  • `last_waiting_reason=None` after Running → runner /ready probe wedged inside the pod

🤖 Generated with Claude Code

aron-muon and others added 2 commits May 26, 2026 14:50
…pptx)

User report: LC bash_tool calls fail with
"ModuleNotFoundError: No module named 'xlsxwriter'" (and similar for
reportlab / python-docx / python-pptx / pdf2image / scikit-learn / ...)
because the bash image only had 5 apt-installed Python packages
(numpy / pandas / matplotlib / openpyxl / Pillow). LC's bash_tool path
issues `python3 -c "..."` for everything a user would normally reach
the dedicated python image for, so the partial stack produced
non-deterministic "sometimes works, sometimes errors" behaviour
depending on which tool LC's agent picked.

This commit installs the FULL pip stack from the same requirement
files the dedicated python image uses, so the two execution paths
become indistinguishable to the user.

Three DHI-base-specific problems had to be worked around to get this
building against the live `dhi.io/debian-base:trixie-debian13-dev`
image:

1. r-base-core + gfortran transitively require liblapack3 +
   libgfortran5, both pinned to stock gcc-14-base=14.2.0-19 but DHI
   ships 14.2.0-19+dhi0. Cannot install. Removed both from the bash
   image — the dedicated `lang=r` / `lang=f90` per-language images
   still serve those callers; LC's bash_tool doesn't reach for them
   in practice.

2. DHI's hardened libzstd1+dhi0 fails dlopen() at runtime
   (`ImportError: libzstd.so.1: shared object cannot be dlopen()ed`).
   Python's _ssl.so links libzstd so SSL is completely broken. Force-
   downgrade to stock libzstd1=1.5.7+dfsg-1 via --allow-downgrades.
   Without this, pip can't reach PyPI at all.

3. Same dlopen problem on DHI's libffi8+dhi0. _ctypes.so links libffi
   so pandas/numpy/cryptography all crash at runtime. Same fix:
   downgrade to libffi8=3.4.8-2.

4. Removed `pip install --upgrade pip setuptools wheel` — Debian's
   apt-installed wheel package has no RECORD file so pip can't
   uninstall it for the upgrade. The apt-installed versions are
   recent enough for the requirement files.

Final image is ~1.13 GB (was ~865 MB). Pip stack includes:
xlsxwriter, reportlab, python-docx, python-pptx, pdf2image,
docx2txt, docx2python, mammoth, openpyxl, scikit-learn, scipy,
plotly, opencv-python-headless, pillow, numpy, pandas, matplotlib,
seaborn, statsmodels, lxml, cryptography, pyarrow, sympy, and
the rest of docker/requirements/python-*.txt.

Validated against the user's exact failing reproduction (xlsxwriter
import + Workbook write) and the 11-language hello-world matrix:
all 11 pass, r/fortran properly marked skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NG> works

User reported "POD_POOL_BASH=x doesn't actually wire through and
change the number of warm replicas". The Python wiring is correct
(configmap → POD_POOL_BASH → settings → PodPoolManager) — the
root cause was on the runtime side:

PodPool._wait_for_pod_ready hardcoded a 60-second timeout. For the
unified bash image (~1.13 GB after the python-stack bake) on a cold
node, image pull regularly exceeds 60s. Every newly-created warm pod
got torn down before going Ready, the pool stayed at 0, and the only
log was a quiet `warning="Warm pod did not become ready, deleting"`
with no indication WHY — operators reading kubectl saw pool size 0
and nothing in our logs explained the gap.

This fix has three parts:

1. **Make the timeout configurable** — new
   `POD_POOL_READY_TIMEOUT_SECONDS` setting, default 300s (5 min).
   Bounded [30, 1800]. Documented in .env.example.

2. **Early-abort on terminal pod states** — ImagePullBackOff,
   ErrImagePull, InvalidImageName, CrashLoopBackOff,
   CreateContainerConfigError, CreateContainerError. A misconfigured
   image now fails in ~5s with the actual reason logged at ERROR,
   instead of waiting out the full 300s and logging a generic timeout.

3. **Loud diagnostic on real timeouts** — when we DO hit the timeout
   (typically slow image pull during ContainerCreating), log at ERROR
   with the last-observed waiting reason and a hint:
   "If reason is 'ContainerCreating' the image is probably still
   being pulled — raise POD_POOL_READY_TIMEOUT_SECONDS."

The companion rollup log in `_create_warm_pod` was bumped from
WARNING to ERROR for the same operator-discoverability reason.

7 new tests in test_pool.py:
- _wait_for_pod_ready picks up settings default when no timeout passed
- Early-abort on each of the 6 terminal waiting reasons (parametrized)
  asserting the abort happens in <5s on a 30s nominal timeout
- ContainerCreating is NOT treated as terminal — the wait continues
  through the pull and succeeds when the pod eventually goes Ready

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aron-muon aron-muon changed the title fix(docker): full python stack in bash image (xlsxwriter, reportlab, pptx, ...) fix(bash image): full python stack + POD_POOL_<LANG> actually warms replicas May 26, 2026
@aron-muon aron-muon merged commit da189b7 into main May 26, 2026
32 of 34 checks passed
@aron-muon aron-muon deleted the hotfix/bash-image-python-stack branch May 26, 2026 14:20
@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version 3.5.4 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant