fix: container crash loop protection by reubenmiller · Pull Request #189 · thin-edge/tedge-container-plugin

reubenmiller · 2026-05-13T13:32:16Z

No description provided.

A container in a crash loop (restart policy: always) generates a continuous stream of start/die events. Previously each event spawned a goroutine that called Update(), which blocks on the unbuffered updateRequests channel. Under a burst this caused: - an ever-growing pile of goroutines waiting to send - a proportional number of sequential doUpdate() calls, each making real network calls to the container daemon, tedge HTTP API and the Cumulocity proxy Changes in this commit: * pkg/app/debounce.go (new) - UpdateDebouncer: coalesces requests received within a 2-second quiet window into one. Scoped requests (by ID/Name) are merged by unioning their ID/Name sets; a full-scan request always supersedes any scoped one. - mergeFilterOptions / mergeRequests: reusable merge helpers. * pkg/app/app.go - ActionRequest now carries an optional result chan<- error so each caller gets its own response; the shared updateResults channel (which had a latent deadlock for ActionUpdateMetrics) is removed. - sendResult() helper: delivers the worker result non-blocking. - worker() uses sendResult() and now correctly returns errors for ActionUpdateMetrics. - Update() / UpdateMetrics() create per-call result channels. - updateRequests is buffered (cap 8) so the debouncer dispatch and synchronous Update() callers never block on a temporarily busy worker. - Monitor() event goroutines replaced with debouncer.Enqueue() calls; the 500 ms pre-sleep is superseded by the 2-second debounce window. - Subscribe() callbacks use the debouncer instead of spawning goroutines that block on the old unbuffered channel.

A container in a crash loop with restart policy "always" can restart tens of times per minute. This commit adds observability for that condition without requiring any changes to compose files. Changes: * pkg/app/restarttracker.go (new) - RestartTracker: thread-safe sliding-window restart counter. - Record(name) → (count, exceeded): evicts events outside the window, records the new event, and returns whether count ≥ threshold. - Clear(name): resets history when a container recovers or is removed. * pkg/app/app.go - App struct gains restartTracker (*RestartTracker), crashLoopAlarms (map[string]struct{}) + crashLoopAlarmsMu. - Initialised in NewApp() with a 60-second window and threshold of 5. - Monitor() ActionDie handler calls restartTracker.Record(); when exceeded it calls publishCrashLoopAlarm() which emits a CRITICAL alarm to te/<device>/service/<name>/a/ContainerCrashLoop. Duplicate alarms are suppressed until the container recovers. - Monitor() ActionHealthStatusHealthy handler calls Clear() + clearCrashLoopAlarm(), which publishes status=CLEARED to the same alarm topic. - Monitor() ActionDestroy/ActionRemove also call Clear() + clearCrashLoopAlarm() to tidy up when a container is deleted.

Two related problems arose when a container crash-looped with restart policy "always": 1. Event storm: EnableEngineEvents published every start/die event synchronously with no throttling. At 10+ events/second this saturated the MQTT broker and starved all other MQTT operations (including the alarm publish itself). 2. Alarm never delivered: publishCrashLoopAlarm tried to publish to te/device/main/service/<name>/a/ContainerCrashLoop over the same already-saturated client. The 100 ms Publish() timeout fired immediately, the error handler un-marked the alarm, and the retry loop started again -- so the alarm never reached the broker. Additionally, the container service is typically not yet registered at the point the threshold is crossed (the 2-second debounced doUpdate() hasn't fired yet), so the mapper would have silently dropped it anyway. Changes: * pkg/app/eventlimiter.go (new) - EventRateLimiter: per-key rate limiter based on last-seen time. - Allow(key) gating with configurable minimum interval. - Remove(key) to reset a key on container removal/recovery. * pkg/app/app.go - App struct gains eventLimiter (*EventRateLimiter). - Initialised in NewApp() with a 5-second per-(container, action) window. - Monitor(): engine-event publish is now gated by three checks in order of precedence: 1. inCrashLoop: suppress ALL events for the container (the alarm already signals the operator; no value in flooding). 2. eventLimiter.Allow(key) returns false: rate-limit, at most 1 event per (container, action) per 5 seconds normally. 3. default: publish as before. - eventLimiter.Remove() called on ActionHealthStatusHealthy and ActionDestroy/ActionRemove so fresh events after recovery are not incorrectly suppressed. - publishCrashLoopAlarm() made fully asynchronous (goroutine): * Pre-registers the container entity via TedgeAPI.CreateEntity() (idempotent) so the mapper can route the alarm even when the container dies before doUpdate() fires. * Retries the Publish() call up to 5 times with linear back-off (500 ms, 1 s, 1.5 s, 2 s, 2.5 s). * Checks crashLoopAlarms before each retry to abort early if the container has already recovered/been removed. * Un-marks on exhaustion so the alarm can be re-raised once the broker recovers.

When publishCrashLoopAlarm fires, the container entity and its CRITICAL alarm are published, but the health topic was not updated. This meant the service list still showed the container's last status (often 'up' from its most recent brief start) rather than reflecting the crash-loop state. In the publishing goroutine, after pre-registering the entity, now also publish a retained health message with status=down to the container's te/.../status/health topic. The next normal doUpdate() call will overwrite this with the real runtime status once the container recovers.

Container events from Docker carry the raw container name (e.g. "crash-loop-app-1") in Actor.Attributes["name"]. For compose services the canonical thin-edge service name is "project@service" (matching Container.GetName()), not the Docker-generated container name. Using the raw name caused the crash-loop alarm, health status, and rate- limiter to operate on a different key than the one used by doUpdate() when registering and updating the service -- so the alarm landed on the wrong entity (or an unregistered one). Changes: * serviceNameFromEventAttrs() (new helper) derives the service name from Actor.Attributes using the same logic as Container.GetName(): - If com.docker.compose.project and com.docker.compose.service are both present, returns "project@service". - Otherwise falls back to the raw "name" attribute (plain containers). * Monitor(): all usages of evt.Actor.Attributes["name"] that feed into restartTracker, eventLimiter, crashLoopAlarms, publishCrashLoopAlarm, and clearCrashLoopAlarm are replaced with serviceNameFromEventAttrs(). * publishCrashLoopAlarm(): entity pre-registration now infers the correct service type from the name -- names containing "@" are registered as container-group, others as container.

github-actions · 2026-05-13T15:15:50Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
40	0	0	40	100	10m3.333902s

Passed Tests

Name	⏱️ Duration	Suite
Get Container Logs	8.189 s	`Container-Logs`
Get Container Logs with only last N lines	0.567 s	`Container-Logs`
Get Container Logs By Operation	8.129 s	`Container-Logs`
Remove Container	7.394 s	`Container-Remove`
Remove Container Non Existent Container Should Not Through An Error	0.106 s	`Container-Remove`
Restart Container	0.584 s	`Container-Restart`
Restart Unknown Container Fails	0.103 s	`Container-Restart`
Update to tedge-container-plugin-ng	25.355 s	`Installation`
Check for Update	10.182 s	`Operations-Clone`
Clone Existing Container	28.080 s	`Operations-Clone`
Clone Existing Container by Timeout Whilst Waiting For Exit	18.450 s	`Operations-Clone`
Clone Existing Container but Waiting For Exit	24.228 s	`Operations-Clone`
Ignore Containers With Given Label	21.317 s	`Operations-Clone`
Install/uninstall container-image	19.894 s	`Operations-Container-Image`
Install/uninstall not existent container image	9.251 s	`Operations-Container-Image`
Install/uninstall container package from private repository - credentials file	14.593 s	`Operations-Private-Registries`
Install/uninstall container package from private repository - credentials script	14.459 s	`Operations-Private-Registries`
Install/uninstall container package from private repository - credentials script with cache	21.557 s	`Operations-Private-Registries`
Install/uninstall container package from private repository - engine credentials	17.988 s	`Operations-Private-Registries`
Install/uninstall container package from private repository - docker from docker	38.093 s	`Operations-Private-Registries`
Get Configuration	5.447 s	`Operations`
Install/uninstall container-group package	22.790 s	`Operations`
Install/uninstall container-group package with non-existent image	11.503 s	`Operations`
Install invalid container-group	3.173 s	`Operations`
Install container-group with multiple files - app1	12.687 s	`Operations`
Install container-group with multiple files - app2	8.039 s	`Operations`
Install/uninstall container package	20.420 s	`Operations`
Install/uninstall container package from file	14.707 s	`Operations`
Manual container creation/deletion	20.706 s	`Operations`
Manual container creation/deletion with error on run	9.919 s	`Operations`
Manual container created and then killed	15.070 s	`Operations`
Remove Orphaned Cloud Services	20.432 s	`Operations`
Remove Orphaned Cloud Services eventually if Cumulocity Proxy is Unavailable at deletion time	39.980 s	`Operations`
Install container group that uses host volume mount	8.381 s	`Operations`
Install container group with a container in a crash loop	11.264 s	`Operations`
Self Update Is Present Using Self Type	0.182 s	`Self`
Self Update Is Present Using Container Type	0.182 s	`Self`
Self Update Is Not Present	0.187 s	`Self`
Service status	18.741 s	`Telemetry-Main`
Sends measurements	71.005 s	`Telemetry-Main`

reubenmiller added 2 commits May 13, 2026 15:27

reubenmiller requested a deployment to Test Pull Request May 13, 2026 13:32 — with GitHub Actions Waiting

reubenmiller added 4 commits May 13, 2026 16:08

add system test to check the container crash loops

bed6671

reubenmiller deployed to Test Pull Request May 13, 2026 14:56 — with GitHub Actions Active

reubenmiller deployed to Test Auto May 13, 2026 14:56 — with GitHub Actions Active

reubenmiller temporarily deployed to Test Auto May 13, 2026 14:56 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: container crash loop protection#189

fix: container crash loop protection#189
reubenmiller wants to merge 6 commits into
mainfrom
fix/crash-loop-protection

reubenmiller commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

reubenmiller commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Robot Results

Passed Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 13, 2026 •

edited

Loading