fix: container crash loop protection#189
Open
reubenmiller wants to merge 6 commits into
Open
Conversation
A container in a crash loop (restart policy: always) generates a
continuous stream of start/die events. Previously each event spawned a
goroutine that called Update(), which blocks on the unbuffered
updateRequests channel. Under a burst this caused:
- an ever-growing pile of goroutines waiting to send
- a proportional number of sequential doUpdate() calls, each making
real network calls to the container daemon, tedge HTTP API and the
Cumulocity proxy
Changes in this commit:
* pkg/app/debounce.go (new)
- UpdateDebouncer: coalesces requests received within a 2-second quiet
window into one. Scoped requests (by ID/Name) are merged by
unioning their ID/Name sets; a full-scan request always supersedes
any scoped one.
- mergeFilterOptions / mergeRequests: reusable merge helpers.
* pkg/app/app.go
- ActionRequest now carries an optional result chan<- error so each
caller gets its own response; the shared updateResults channel (which
had a latent deadlock for ActionUpdateMetrics) is removed.
- sendResult() helper: delivers the worker result non-blocking.
- worker() uses sendResult() and now correctly returns errors for
ActionUpdateMetrics.
- Update() / UpdateMetrics() create per-call result channels.
- updateRequests is buffered (cap 8) so the debouncer dispatch and
synchronous Update() callers never block on a temporarily busy
worker.
- Monitor() event goroutines replaced with debouncer.Enqueue() calls;
the 500 ms pre-sleep is superseded by the 2-second debounce window.
- Subscribe() callbacks use the debouncer instead of spawning
goroutines that block on the old unbuffered channel.
A container in a crash loop with restart policy "always" can restart
tens of times per minute. This commit adds observability for that
condition without requiring any changes to compose files.
Changes:
* pkg/app/restarttracker.go (new)
- RestartTracker: thread-safe sliding-window restart counter.
- Record(name) → (count, exceeded): evicts events outside the window,
records the new event, and returns whether count ≥ threshold.
- Clear(name): resets history when a container recovers or is removed.
* pkg/app/app.go
- App struct gains restartTracker (*RestartTracker),
crashLoopAlarms (map[string]struct{}) + crashLoopAlarmsMu.
- Initialised in NewApp() with a 60-second window and threshold of 5.
- Monitor() ActionDie handler calls restartTracker.Record(); when
exceeded it calls publishCrashLoopAlarm() which emits a CRITICAL
alarm to te/<device>/service/<name>/a/ContainerCrashLoop.
Duplicate alarms are suppressed until the container recovers.
- Monitor() ActionHealthStatusHealthy handler calls Clear() +
clearCrashLoopAlarm(), which publishes status=CLEARED to the same
alarm topic.
- Monitor() ActionDestroy/ActionRemove also call Clear() +
clearCrashLoopAlarm() to tidy up when a container is deleted.
Two related problems arose when a container crash-looped with
restart policy "always":
1. Event storm: EnableEngineEvents published every start/die event
synchronously with no throttling. At 10+ events/second this
saturated the MQTT broker and starved all other MQTT operations
(including the alarm publish itself).
2. Alarm never delivered: publishCrashLoopAlarm tried to publish to
te/device/main/service/<name>/a/ContainerCrashLoop over the same
already-saturated client. The 100 ms Publish() timeout fired
immediately, the error handler un-marked the alarm, and the retry
loop started again -- so the alarm never reached the broker.
Additionally, the container service is typically not yet registered
at the point the threshold is crossed (the 2-second debounced
doUpdate() hasn't fired yet), so the mapper would have silently
dropped it anyway.
Changes:
* pkg/app/eventlimiter.go (new)
- EventRateLimiter: per-key rate limiter based on last-seen time.
- Allow(key) gating with configurable minimum interval.
- Remove(key) to reset a key on container removal/recovery.
* pkg/app/app.go
- App struct gains eventLimiter (*EventRateLimiter).
- Initialised in NewApp() with a 5-second per-(container, action)
window.
- Monitor(): engine-event publish is now gated by three checks in
order of precedence:
1. inCrashLoop: suppress ALL events for the container (the
alarm already signals the operator; no value in flooding).
2. eventLimiter.Allow(key) returns false: rate-limit, at most
1 event per (container, action) per 5 seconds normally.
3. default: publish as before.
- eventLimiter.Remove() called on ActionHealthStatusHealthy and
ActionDestroy/ActionRemove so fresh events after recovery are not
incorrectly suppressed.
- publishCrashLoopAlarm() made fully asynchronous (goroutine):
* Pre-registers the container entity via TedgeAPI.CreateEntity()
(idempotent) so the mapper can route the alarm even when the
container dies before doUpdate() fires.
* Retries the Publish() call up to 5 times with linear back-off
(500 ms, 1 s, 1.5 s, 2 s, 2.5 s).
* Checks crashLoopAlarms before each retry to abort early if the
container has already recovered/been removed.
* Un-marks on exhaustion so the alarm can be re-raised once the
broker recovers.
When publishCrashLoopAlarm fires, the container entity and its CRITICAL alarm are published, but the health topic was not updated. This meant the service list still showed the container's last status (often 'up' from its most recent brief start) rather than reflecting the crash-loop state. In the publishing goroutine, after pre-registering the entity, now also publish a retained health message with status=down to the container's te/.../status/health topic. The next normal doUpdate() call will overwrite this with the real runtime status once the container recovers.
Container events from Docker carry the raw container name (e.g.
"crash-loop-app-1") in Actor.Attributes["name"]. For compose
services the canonical thin-edge service name is "project@service"
(matching Container.GetName()), not the Docker-generated container name.
Using the raw name caused the crash-loop alarm, health status, and rate-
limiter to operate on a different key than the one used by doUpdate()
when registering and updating the service -- so the alarm landed on the
wrong entity (or an unregistered one).
Changes:
* serviceNameFromEventAttrs() (new helper) derives the service name from
Actor.Attributes using the same logic as Container.GetName():
- If com.docker.compose.project and com.docker.compose.service are
both present, returns "project@service".
- Otherwise falls back to the raw "name" attribute (plain containers).
* Monitor(): all usages of evt.Actor.Attributes["name"] that feed into
restartTracker, eventLimiter, crashLoopAlarms, publishCrashLoopAlarm,
and clearCrashLoopAlarm are replaced with serviceNameFromEventAttrs().
* publishCrashLoopAlarm(): entity pre-registration now infers the correct
service type from the name -- names containing "@" are registered as
container-group, others as container.
Robot Results
Passed Tests
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.