fix(bash image): full python stack + POD_POOL_<LANG> actually warms replicas#69
Merged
Conversation
…pptx) User report: LC bash_tool calls fail with "ModuleNotFoundError: No module named 'xlsxwriter'" (and similar for reportlab / python-docx / python-pptx / pdf2image / scikit-learn / ...) because the bash image only had 5 apt-installed Python packages (numpy / pandas / matplotlib / openpyxl / Pillow). LC's bash_tool path issues `python3 -c "..."` for everything a user would normally reach the dedicated python image for, so the partial stack produced non-deterministic "sometimes works, sometimes errors" behaviour depending on which tool LC's agent picked. This commit installs the FULL pip stack from the same requirement files the dedicated python image uses, so the two execution paths become indistinguishable to the user. Three DHI-base-specific problems had to be worked around to get this building against the live `dhi.io/debian-base:trixie-debian13-dev` image: 1. r-base-core + gfortran transitively require liblapack3 + libgfortran5, both pinned to stock gcc-14-base=14.2.0-19 but DHI ships 14.2.0-19+dhi0. Cannot install. Removed both from the bash image — the dedicated `lang=r` / `lang=f90` per-language images still serve those callers; LC's bash_tool doesn't reach for them in practice. 2. DHI's hardened libzstd1+dhi0 fails dlopen() at runtime (`ImportError: libzstd.so.1: shared object cannot be dlopen()ed`). Python's _ssl.so links libzstd so SSL is completely broken. Force- downgrade to stock libzstd1=1.5.7+dfsg-1 via --allow-downgrades. Without this, pip can't reach PyPI at all. 3. Same dlopen problem on DHI's libffi8+dhi0. _ctypes.so links libffi so pandas/numpy/cryptography all crash at runtime. Same fix: downgrade to libffi8=3.4.8-2. 4. Removed `pip install --upgrade pip setuptools wheel` — Debian's apt-installed wheel package has no RECORD file so pip can't uninstall it for the upgrade. The apt-installed versions are recent enough for the requirement files. Final image is ~1.13 GB (was ~865 MB). Pip stack includes: xlsxwriter, reportlab, python-docx, python-pptx, pdf2image, docx2txt, docx2python, mammoth, openpyxl, scikit-learn, scipy, plotly, opencv-python-headless, pillow, numpy, pandas, matplotlib, seaborn, statsmodels, lxml, cryptography, pyarrow, sympy, and the rest of docker/requirements/python-*.txt. Validated against the user's exact failing reproduction (xlsxwriter import + Workbook write) and the 11-language hello-world matrix: all 11 pass, r/fortran properly marked skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NG> works User reported "POD_POOL_BASH=x doesn't actually wire through and change the number of warm replicas". The Python wiring is correct (configmap → POD_POOL_BASH → settings → PodPoolManager) — the root cause was on the runtime side: PodPool._wait_for_pod_ready hardcoded a 60-second timeout. For the unified bash image (~1.13 GB after the python-stack bake) on a cold node, image pull regularly exceeds 60s. Every newly-created warm pod got torn down before going Ready, the pool stayed at 0, and the only log was a quiet `warning="Warm pod did not become ready, deleting"` with no indication WHY — operators reading kubectl saw pool size 0 and nothing in our logs explained the gap. This fix has three parts: 1. **Make the timeout configurable** — new `POD_POOL_READY_TIMEOUT_SECONDS` setting, default 300s (5 min). Bounded [30, 1800]. Documented in .env.example. 2. **Early-abort on terminal pod states** — ImagePullBackOff, ErrImagePull, InvalidImageName, CrashLoopBackOff, CreateContainerConfigError, CreateContainerError. A misconfigured image now fails in ~5s with the actual reason logged at ERROR, instead of waiting out the full 300s and logging a generic timeout. 3. **Loud diagnostic on real timeouts** — when we DO hit the timeout (typically slow image pull during ContainerCreating), log at ERROR with the last-observed waiting reason and a hint: "If reason is 'ContainerCreating' the image is probably still being pulled — raise POD_POOL_READY_TIMEOUT_SECONDS." The companion rollup log in `_create_warm_pod` was bumped from WARNING to ERROR for the same operator-discoverability reason. 7 new tests in test_pool.py: - _wait_for_pod_ready picks up settings default when no timeout passed - Early-abort on each of the 6 terminal waiting reasons (parametrized) asserting the abort happens in <5s on a 30s nominal timeout - ContainerCreating is NOT treated as terminal — the wait continues through the pull and succeeds when the pod eventually goes Ready Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🎉 This PR is included in version 3.5.4 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two user-reported issues, fixed together because the second is what makes the first visible to users:
Fix
Part 1 — full python stack in the bash image
Install the FULL pip stack from the same `docker/requirements/python-*.txt` files the dedicated python image uses, so `lang=py` and `lang=bash → python3 -c` become indistinguishable.
Now in the bash image: `xlsxwriter`, `reportlab`, `python-docx`, `python-pptx`, `pdf2image`, `docx2txt`, `docx2python`, `mammoth`, `openpyxl`, `scikit-learn`, `scipy`, `plotly`, `opencv-python-headless`, `pillow`, `numpy`, `pandas`, `matplotlib`, `seaborn`, `statsmodels`, `lxml`, `cryptography`, `pyarrow`, `sympy`, and the rest.
Three DHI-base-specific problems had to be worked around to get this building against the live `dhi.io/debian-base:trixie-debian13-dev`:
These three are pre-existing breakages of the bash image build against the current DHI registry — `main` was also unbuildable today.
Part 2 — POD_POOL_ ready timeout
Why warm replicas weren't materialising: `PodPool._wait_for_pod_ready` hardcoded a 60s timeout. The unified bash image (~1.13 GB) regularly exceeds 60s on first pull. Every warm pod got torn down before going Ready, the pool stayed at 0, and the only log was a quiet `warning="Warm pod did not become ready, deleting"` with no indication WHY. Operators reading kubectl saw `replicas: 0` despite `POD_POOL_BASH=N` and nothing in our logs explained the gap.
Three changes:
The companion rollup log in `_create_warm_pod` was bumped from WARNING to ERROR for the same operator-discoverability reason.
Validation
```python
import xlsxwriter
workbook = xlsxwriter.Workbook("/tmp/beispiel.xlsx")
ws = workbook.add_worksheet()
ws.write("A1", "Hallo")
workbook.close()
```
→ xlsx written, 5,250 bytes.
New tests
`tests/unit/test_pool.py`:
Operator note
If you currently see `POD_POOL_BASH=N` not materialising warm replicas, this PR should fix it. If the new 300s default is still too short for your registry/node combo, bump `POD_POOL_READY_TIMEOUT_SECONDS` further. The new error logs will tell you which case you're in:
🤖 Generated with Claude Code