Skip to content

Commit 52afaac

Browse files
committed
feat: Add network sanity check for Kamal deployments
- Introduced a new Makefile target `kamal-network-sanity` to compare host-shell and app-container connectivity on a Kamal destination. - Added a Python script `check_kamal_network_sanity.py` that probes external URLs and the server's public hostname, reporting connectivity issues. - Updated documentation to include usage instructions for the new Makefile target, enhancing the deployment process by ensuring network accessibility. These changes improve the reliability of Kamal deployments by validating network configurations.
1 parent a03afc7 commit 52afaac

3 files changed

Lines changed: 236 additions & 2 deletions

File tree

Makefile

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
.PHONY: help lint lint-check format test lint-test test-coverage-compare clear-thumbnail-cache prime-thumbnail-cache prime-static-map-cache prime-visual-caches db-export db-import db-sync gbl-admin-db-download gbl-admin-db-unzip gbl-admin-db-restore gbl-admin-db-sync gbl-admin-db-add-latest-btaa-fields gbl-admin-db-import-resources populate-distributions backfill-distributions populate-data-dictionaries gbl-admin-db-import-all reindex reindex-benchmark local-clear-search-cache es-unblock populate-relationships verify-h3-index kamal-reindex kamal-verify-h3-index kamal-clear-cache kamal-prime-thumbnail-cache clear_cache frontend-reset ogm-refresh ogm-refresh-all ogm-refresh-repo ogm-status ogm-status-watch ogm-failures bridge-init bridge-sync bridge-cancel bridge-status bridge-status-watch bridge-failures blog-sync
2-
.PHONY: kamal-blog-sync kamal-purge-home-blog-cache kamal-bridge-status kamal-bridge-status-watch kamal-cron-debug kamal-cron-test-bridge kamal-worker-logs docs-serve docs-build
2+
.PHONY: kamal-blog-sync kamal-purge-home-blog-cache kamal-bridge-status kamal-bridge-status-watch kamal-cron-debug kamal-cron-test-bridge kamal-worker-logs kamal-network-sanity docs-serve docs-build
33

44
# Load environment variables from .env file if it exists
55
-include .env
@@ -87,6 +87,10 @@ KAMAL_REINDEX_REMOVE_LEGACY_INDEX ?= true
8787
# If unset, the target falls back to APPLICATION_URL from Kamal env.
8888
KAMAL_API_URL ?=
8989
KAMAL_CACHE_TYPE ?= search
90+
KAMAL_NETWORK_SELF_URL ?=
91+
KAMAL_NETWORK_EXTERNAL_URLS ?= https://api.github.com https://raw.githubusercontent.com https://gin.btaa.org http://example.com
92+
KAMAL_NETWORK_CONNECT_TIMEOUT ?= 5
93+
KAMAL_NETWORK_MAX_TIME ?= 12
9094
OGM_API_URL ?= http://localhost:8000
9195
OGM_STATUS_POLL_SECONDS ?= 5
9296
BRIDGE_API_URL ?= http://localhost:8000
@@ -1044,6 +1048,30 @@ kamal-worker-logs: ## Tail Celery worker logs (diagnose queued-but-not-running t
10441048
@echo "Tailing worker logs (Ctrl+C to stop)..."
10451049
@kamal app logs -d $(KAMAL_DEST) --roles worker --lines $(or $(KAMAL_LOG_LINES),200) -f
10461050

1051+
# Compare host-shell and app-container networking on a Kamal destination.
1052+
# This catches "host works, container cannot reach self public FQDN" problems.
1053+
# Usage:
1054+
# make kamal-network-sanity
1055+
# make kamal-network-sanity KAMAL_DEST=dev2
1056+
# make kamal-network-sanity KAMAL_APP_ROLE=cron
1057+
# make kamal-network-sanity KAMAL_NETWORK_EXTERNAL_URLS="https://api.github.com https://geo.btaa.org"
1058+
kamal-network-sanity: ## Check host/container outbound + self-FQDN networking on Kamal
1059+
@echo "Checking Kamal networking sanity (dest: $(KAMAL_DEST), role: $(KAMAL_APP_ROLE))..."
1060+
@if [ -z "$$KAMAL_SSH_USER" ] || [ -z "$$KAMAL_HOST" ]; then \
1061+
echo "ERROR: KAMAL_SSH_USER and KAMAL_HOST environment variables must be set."; \
1062+
echo "Use KAMAL_DEST=dev1 or dev2. Ensure .kamal/secrets-common and .kamal/secrets.dev1 (or .secrets.dev2) exist."; \
1063+
exit 1; \
1064+
fi
1065+
@KAMAL_DEST="$(KAMAL_DEST)" \
1066+
KAMAL_HOST="$(KAMAL_HOST)" \
1067+
KAMAL_SSH_USER="$(KAMAL_SSH_USER)" \
1068+
KAMAL_APP_ROLE="$(KAMAL_APP_ROLE)" \
1069+
KAMAL_NETWORK_SELF_URL="$(KAMAL_NETWORK_SELF_URL)" \
1070+
KAMAL_NETWORK_EXTERNAL_URLS='$(KAMAL_NETWORK_EXTERNAL_URLS)' \
1071+
KAMAL_NETWORK_CONNECT_TIMEOUT="$(KAMAL_NETWORK_CONNECT_TIMEOUT)" \
1072+
KAMAL_NETWORK_MAX_TIME="$(KAMAL_NETWORK_MAX_TIME)" \
1073+
python3 backend/scripts/check_kamal_network_sanity.py
1074+
10471075
# Prime thumbnail cache on remote Kamal app container.
10481076
# Usage examples:
10491077
# source .kamal/secrets && make kamal-prime-thumbnail-cache
Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
from __future__ import annotations
2+
3+
import os
4+
import re
5+
import shlex
6+
import shutil
7+
import subprocess
8+
import sys
9+
from dataclasses import dataclass
10+
from typing import Dict, Iterable, List
11+
12+
DEFAULT_EXTERNAL_URLS = (
13+
"https://api.github.com",
14+
"https://raw.githubusercontent.com",
15+
"https://gin.btaa.org",
16+
"http://example.com",
17+
)
18+
RESULT_PREFIX = "RESULT\t"
19+
20+
21+
@dataclass
22+
class ProbeResult:
23+
url: str
24+
exit_code: int
25+
details: str
26+
27+
@property
28+
def fields(self) -> Dict[str, str]:
29+
return dict(re.findall(r"([a-z_]+)=([^ ]*)", self.details))
30+
31+
@property
32+
def http_code(self) -> str:
33+
return self.fields.get("http", "")
34+
35+
@property
36+
def remote_ip(self) -> str:
37+
return self.fields.get("remote_ip", "")
38+
39+
@property
40+
def total(self) -> str:
41+
return self.fields.get("total", "")
42+
43+
@property
44+
def ok(self) -> bool:
45+
return self.exit_code == 0 and self.http_code not in {"", "000"}
46+
47+
48+
def _required_env(name: str) -> str:
49+
value = os.getenv(name, "").strip()
50+
if not value:
51+
raise SystemExit(f"ERROR: {name} is required")
52+
return value
53+
54+
55+
def _probe_script(urls: Iterable[str], *, connect_timeout: str, max_time: str) -> str:
56+
url_args = " ".join(shlex.quote(url) for url in urls)
57+
return (
58+
f"for url in {url_args}; do "
59+
"out=$(curl -I -sS -o /dev/null "
60+
f"-w 'http=%{{http_code}} remote_ip=%{{remote_ip}} connect=%{{time_connect}} "
61+
f"tls=%{{time_appconnect}} total=%{{time_total}}' "
62+
f"--connect-timeout {shlex.quote(connect_timeout)} "
63+
f'-m {shlex.quote(max_time)} "$url" 2>&1); '
64+
"status=$?; "
65+
'printf \'RESULT\t%s\t%s\t%s\n\' "$url" "$status" "$out"; '
66+
"done"
67+
)
68+
69+
70+
def _run_command(cmd: List[str], label: str) -> str:
71+
try:
72+
result = subprocess.run(cmd, check=True, capture_output=True, text=True)
73+
except FileNotFoundError as exc:
74+
raise SystemExit(f"ERROR: missing required command for {label}: {exc.filename}") from exc
75+
except subprocess.CalledProcessError as exc:
76+
stderr = exc.stderr.strip()
77+
stdout = exc.stdout.strip()
78+
detail = stderr or stdout or f"exit status {exc.returncode}"
79+
raise SystemExit(f"ERROR: {label} failed: {detail}") from exc
80+
return result.stdout
81+
82+
83+
def _parse_results(output: str) -> Dict[str, ProbeResult]:
84+
results: Dict[str, ProbeResult] = {}
85+
for line in output.splitlines():
86+
if not line.startswith(RESULT_PREFIX):
87+
continue
88+
_, url, exit_code, details = line.split("\t", 3)
89+
results[url] = ProbeResult(url=url, exit_code=int(exit_code), details=details)
90+
return results
91+
92+
93+
def _print_section(
94+
title: str, urls: List[str], results: Dict[str, ProbeResult], self_url: str
95+
) -> None:
96+
print(f"{title}:")
97+
for url in urls:
98+
result = results.get(url)
99+
if result is None:
100+
print(f" FAIL {url} (no result)")
101+
continue
102+
label = "self" if url == self_url else "ext "
103+
if result.ok:
104+
print(
105+
f" OK [{label}] {url} "
106+
f"http={result.http_code} ip={result.remote_ip or '-'} total={result.total or '-'}"
107+
)
108+
else:
109+
print(
110+
f" FAIL [{label}] {url} exit={result.exit_code} "
111+
f"http={result.http_code or '000'} detail={result.details}"
112+
)
113+
114+
115+
def main() -> int:
116+
if shutil.which("ssh") is None:
117+
raise SystemExit("ERROR: ssh is required")
118+
if shutil.which("kamal") is None:
119+
raise SystemExit("ERROR: kamal is required")
120+
121+
kamal_dest = _required_env("KAMAL_DEST")
122+
kamal_host = _required_env("KAMAL_HOST")
123+
kamal_ssh_user = _required_env("KAMAL_SSH_USER")
124+
kamal_app_role = os.getenv("KAMAL_APP_ROLE", "web").strip() or "web"
125+
connect_timeout = os.getenv("KAMAL_NETWORK_CONNECT_TIMEOUT", "5").strip() or "5"
126+
max_time = os.getenv("KAMAL_NETWORK_MAX_TIME", "12").strip() or "12"
127+
128+
self_url = os.getenv("KAMAL_NETWORK_SELF_URL", "").strip() or f"https://{kamal_host}"
129+
external_urls = shlex.split(os.getenv("KAMAL_NETWORK_EXTERNAL_URLS", "").strip())
130+
if not external_urls:
131+
external_urls = list(DEFAULT_EXTERNAL_URLS)
132+
133+
urls = list(dict.fromkeys([*external_urls, self_url]))
134+
probe_script = _probe_script(urls, connect_timeout=connect_timeout, max_time=max_time)
135+
136+
host_output = _run_command(
137+
["ssh", f"{kamal_ssh_user}@{kamal_host}", f"bash -lc {shlex.quote(probe_script)}"],
138+
"host probe",
139+
)
140+
container_output = _run_command(
141+
[
142+
"kamal",
143+
"app",
144+
"exec",
145+
"-d",
146+
kamal_dest,
147+
"--roles",
148+
kamal_app_role,
149+
"--reuse",
150+
f"bash -lc {shlex.quote(probe_script)}",
151+
],
152+
"container probe",
153+
)
154+
155+
host_results = _parse_results(host_output)
156+
container_results = _parse_results(container_output)
157+
158+
print(
159+
f"Network sanity report for {kamal_dest} "
160+
f"(host={kamal_host}, role={kamal_app_role}, self_url={self_url})"
161+
)
162+
print("")
163+
_print_section("Host shell", urls, host_results, self_url)
164+
print("")
165+
_print_section("Container", urls, container_results, self_url)
166+
print("")
167+
168+
failures: List[str] = []
169+
for url in urls:
170+
host_result = host_results.get(url)
171+
container_result = container_results.get(url)
172+
if host_result is None or not host_result.ok:
173+
failures.append(f"host failed: {url}")
174+
if container_result is None or not container_result.ok:
175+
failures.append(f"container failed: {url}")
176+
177+
host_self_ok = host_results.get(self_url).ok if self_url in host_results else False
178+
container_self_ok = (
179+
container_results.get(self_url).ok if self_url in container_results else False
180+
)
181+
external_container_failures = [
182+
url for url in external_urls if not container_results.get(url, ProbeResult(url, 1, "")).ok
183+
]
184+
185+
if not failures:
186+
print(
187+
"PASS: host shell and container can reach the expected external URLs "
188+
"and self public hostname."
189+
)
190+
return 0
191+
192+
if host_self_ok and not container_self_ok and not external_container_failures:
193+
print(
194+
"FAIL: host shell can reach the self public hostname, but the container cannot. "
195+
"This points to a container-to-self-FQDN / hairpin / firewall-path issue."
196+
)
197+
else:
198+
print("FAIL: one or more host/container connectivity probes failed.")
199+
200+
for failure in failures:
201+
print(f" - {failure}")
202+
return 1
203+
204+
205+
if __name__ == "__main__":
206+
sys.exit(main())

docs/make_tasks.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ Overrides:
4040
- Useful overrides: `KAMAL_REINDEX_RETAIN_PREVIOUS=1` (default), `KAMAL_REINDEX_PRUNE_OLD=true` (default), `KAMAL_REINDEX_ALLOW_PARTIAL=false` (default; blocks swap on indexing/count mismatch), `KAMAL_REINDEX_REMOVE_LEGACY_INDEX=true` (default; one-time migration from legacy non-alias index name).
4141
- `make kamal-verify-h3-index`: verify H3 fields on remote Kamal app containers. Use `KAMAL_DEST=dev1` or `dev2`.
4242
- `make kamal-clear-cache`: clear remote API cache on Kamal (defaults to `KAMAL_CACHE_TYPE=search`). Use `KAMAL_DEST=dev1` or `dev2`. Override with `KAMAL_CACHE_TYPE=all` (or `suggest`/`item`).
43+
- `make kamal-network-sanity`: compare host-shell and app-container connectivity on a Kamal destination. It probes a few external URLs plus the server's own public hostname and exits nonzero if the container cannot reach something the host can. Defaults to `KAMAL_DEST=dev1`, role `web`, self URL `https://$(KAMAL_HOST)`, and external URLs `https://api.github.com https://raw.githubusercontent.com https://gin.btaa.org http://example.com`. Override with `KAMAL_APP_ROLE=cron`, `KAMAL_NETWORK_SELF_URL=...`, or `KAMAL_NETWORK_EXTERNAL_URLS="..."`.
4344
- `make ingest`: ingest BTAA fixture JSON files into the DB (runs inside the `api` Docker container). Default: `data/fixtures/btaa_fixtures_data`. Override with `make ingest FIXTURES_DIR=btaa_featured_resources REPO_NAME=btaa_featured_resources`. After ingest, run `make reindex` to index into Elasticsearch.
4445
- `make ingest-featured`: ingest `data/fixtures/btaa_featured_resources` into the DB and then reindex into Elasticsearch (one-step for featured resources).
4546
- `make clear_cache`: flush Redis cache DB (`REDIS_DB`, requires `REDIS_PASSWORD`)
@@ -108,4 +109,3 @@ Elasticsearch blocks writes when the disk passes the flood-stage watermark (e.g.
108109
3. **Relax watermarks for local dev only**: In `docker-compose.yml`, under the `elasticsearch` service env, you can temporarily set e.g. `cluster.routing.allocation.disk.watermark.flood_stage=99.9%` (or disable with `cluster.routing.allocation.disk.threshold_enabled=false`). Only do this on a dev machine with enough free space; otherwise you risk filling the disk.
109110

110111
4. **Use a remote Elasticsearch** with more space: Point the app at another ES (e.g. via `ELASTICSEARCH_URL` and run reindex there).
111-

0 commit comments

Comments
 (0)