Skip to content

fix: container crash loop protection#189

Open
reubenmiller wants to merge 6 commits into
mainfrom
fix/crash-loop-protection
Open

fix: container crash loop protection#189
reubenmiller wants to merge 6 commits into
mainfrom
fix/crash-loop-protection

Conversation

@reubenmiller
Copy link
Copy Markdown
Collaborator

No description provided.

A container in a crash loop (restart policy: always) generates a
continuous stream of start/die events.  Previously each event spawned a
goroutine that called Update(), which blocks on the unbuffered
updateRequests channel.  Under a burst this caused:
  - an ever-growing pile of goroutines waiting to send
  - a proportional number of sequential doUpdate() calls, each making
    real network calls to the container daemon, tedge HTTP API and the
    Cumulocity proxy

Changes in this commit:

* pkg/app/debounce.go (new)
  - UpdateDebouncer: coalesces requests received within a 2-second quiet
    window into one.  Scoped requests (by ID/Name) are merged by
    unioning their ID/Name sets; a full-scan request always supersedes
    any scoped one.
  - mergeFilterOptions / mergeRequests: reusable merge helpers.

* pkg/app/app.go
  - ActionRequest now carries an optional result chan<- error so each
    caller gets its own response; the shared updateResults channel (which
    had a latent deadlock for ActionUpdateMetrics) is removed.
  - sendResult() helper: delivers the worker result non-blocking.
  - worker() uses sendResult() and now correctly returns errors for
    ActionUpdateMetrics.
  - Update() / UpdateMetrics() create per-call result channels.
  - updateRequests is buffered (cap 8) so the debouncer dispatch and
    synchronous Update() callers never block on a temporarily busy
    worker.
  - Monitor() event goroutines replaced with debouncer.Enqueue() calls;
    the 500 ms pre-sleep is superseded by the 2-second debounce window.
  - Subscribe() callbacks use the debouncer instead of spawning
    goroutines that block on the old unbuffered channel.
A container in a crash loop with restart policy "always" can restart
tens of times per minute.  This commit adds observability for that
condition without requiring any changes to compose files.

Changes:

* pkg/app/restarttracker.go (new)
  - RestartTracker: thread-safe sliding-window restart counter.
  - Record(name) → (count, exceeded): evicts events outside the window,
    records the new event, and returns whether count ≥ threshold.
  - Clear(name): resets history when a container recovers or is removed.

* pkg/app/app.go
  - App struct gains restartTracker (*RestartTracker),
    crashLoopAlarms (map[string]struct{}) + crashLoopAlarmsMu.
  - Initialised in NewApp() with a 60-second window and threshold of 5.
  - Monitor() ActionDie handler calls restartTracker.Record(); when
    exceeded it calls publishCrashLoopAlarm() which emits a CRITICAL
    alarm to te/<device>/service/<name>/a/ContainerCrashLoop.
    Duplicate alarms are suppressed until the container recovers.
  - Monitor() ActionHealthStatusHealthy handler calls Clear() +
    clearCrashLoopAlarm(), which publishes status=CLEARED to the same
    alarm topic.
  - Monitor() ActionDestroy/ActionRemove also call Clear() +
    clearCrashLoopAlarm() to tidy up when a container is deleted.
Two related problems arose when a container crash-looped with
restart policy "always":

1. Event storm: EnableEngineEvents published every start/die event
   synchronously with no throttling. At 10+ events/second this
   saturated the MQTT broker and starved all other MQTT operations
   (including the alarm publish itself).

2. Alarm never delivered: publishCrashLoopAlarm tried to publish to
   te/device/main/service/<name>/a/ContainerCrashLoop over the same
   already-saturated client. The 100 ms Publish() timeout fired
   immediately, the error handler un-marked the alarm, and the retry
   loop started again -- so the alarm never reached the broker.
   Additionally, the container service is typically not yet registered
   at the point the threshold is crossed (the 2-second debounced
   doUpdate() hasn't fired yet), so the mapper would have silently
   dropped it anyway.

Changes:

* pkg/app/eventlimiter.go (new)
  - EventRateLimiter: per-key rate limiter based on last-seen time.
  - Allow(key) gating with configurable minimum interval.
  - Remove(key) to reset a key on container removal/recovery.

* pkg/app/app.go
  - App struct gains eventLimiter (*EventRateLimiter).
  - Initialised in NewApp() with a 5-second per-(container, action)
    window.
  - Monitor(): engine-event publish is now gated by three checks in
    order of precedence:
      1. inCrashLoop: suppress ALL events for the container (the
         alarm already signals the operator; no value in flooding).
      2. eventLimiter.Allow(key) returns false: rate-limit, at most
         1 event per (container, action) per 5 seconds normally.
      3. default: publish as before.
  - eventLimiter.Remove() called on ActionHealthStatusHealthy and
    ActionDestroy/ActionRemove so fresh events after recovery are not
    incorrectly suppressed.
  - publishCrashLoopAlarm() made fully asynchronous (goroutine):
    * Pre-registers the container entity via TedgeAPI.CreateEntity()
      (idempotent) so the mapper can route the alarm even when the
      container dies before doUpdate() fires.
    * Retries the Publish() call up to 5 times with linear back-off
      (500 ms, 1 s, 1.5 s, 2 s, 2.5 s).
    * Checks crashLoopAlarms before each retry to abort early if the
      container has already recovered/been removed.
    * Un-marks on exhaustion so the alarm can be re-raised once the
      broker recovers.
When publishCrashLoopAlarm fires, the container entity and its CRITICAL
alarm are published, but the health topic was not updated. This meant the
service list still showed the container's last status (often 'up' from
its most recent brief start) rather than reflecting the crash-loop state.

In the publishing goroutine, after pre-registering the entity, now also
publish a retained health message with status=down to the container's
te/.../status/health topic. The next normal doUpdate() call will
overwrite this with the real runtime status once the container recovers.
Container events from Docker carry the raw container name (e.g.
"crash-loop-app-1") in Actor.Attributes["name"].  For compose
services the canonical thin-edge service name is "project@service"
(matching Container.GetName()), not the Docker-generated container name.

Using the raw name caused the crash-loop alarm, health status, and rate-
limiter to operate on a different key than the one used by doUpdate()
when registering and updating the service -- so the alarm landed on the
wrong entity (or an unregistered one).

Changes:

* serviceNameFromEventAttrs() (new helper) derives the service name from
  Actor.Attributes using the same logic as Container.GetName():
    - If com.docker.compose.project and com.docker.compose.service are
      both present, returns "project@service".
    - Otherwise falls back to the raw "name" attribute (plain containers).

* Monitor(): all usages of evt.Actor.Attributes["name"] that feed into
  restartTracker, eventLimiter, crashLoopAlarms, publishCrashLoopAlarm,
  and clearCrashLoopAlarm are replaced with serviceNameFromEventAttrs().

* publishCrashLoopAlarm(): entity pre-registration now infers the correct
  service type from the name -- names containing "@" are registered as
  container-group, others as container.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
40 0 0 40 100 10m3.333902s

Passed Tests

Name ⏱️ Duration Suite
Get Container Logs 8.189 s Container-Logs
Get Container Logs with only last N lines 0.567 s Container-Logs
Get Container Logs By Operation 8.129 s Container-Logs
Remove Container 7.394 s Container-Remove
Remove Container Non Existent Container Should Not Through An Error 0.106 s Container-Remove
Restart Container 0.584 s Container-Restart
Restart Unknown Container Fails 0.103 s Container-Restart
Update to tedge-container-plugin-ng 25.355 s Installation
Check for Update 10.182 s Operations-Clone
Clone Existing Container 28.080 s Operations-Clone
Clone Existing Container by Timeout Whilst Waiting For Exit 18.450 s Operations-Clone
Clone Existing Container but Waiting For Exit 24.228 s Operations-Clone
Ignore Containers With Given Label 21.317 s Operations-Clone
Install/uninstall container-image 19.894 s Operations-Container-Image
Install/uninstall not existent container image 9.251 s Operations-Container-Image
Install/uninstall container package from private repository - credentials file 14.593 s Operations-Private-Registries
Install/uninstall container package from private repository - credentials script 14.459 s Operations-Private-Registries
Install/uninstall container package from private repository - credentials script with cache 21.557 s Operations-Private-Registries
Install/uninstall container package from private repository - engine credentials 17.988 s Operations-Private-Registries
Install/uninstall container package from private repository - docker from docker 38.093 s Operations-Private-Registries
Get Configuration 5.447 s Operations
Install/uninstall container-group package 22.790 s Operations
Install/uninstall container-group package with non-existent image 11.503 s Operations
Install invalid container-group 3.173 s Operations
Install container-group with multiple files - app1 12.687 s Operations
Install container-group with multiple files - app2 8.039 s Operations
Install/uninstall container package 20.420 s Operations
Install/uninstall container package from file 14.707 s Operations
Manual container creation/deletion 20.706 s Operations
Manual container creation/deletion with error on run 9.919 s Operations
Manual container created and then killed 15.070 s Operations
Remove Orphaned Cloud Services 20.432 s Operations
Remove Orphaned Cloud Services eventually if Cumulocity Proxy is Unavailable at deletion time 39.980 s Operations
Install container group that uses host volume mount 8.381 s Operations
Install container group with a container in a crash loop 11.264 s Operations
Self Update Is Present Using Self Type 0.182 s Self
Self Update Is Present Using Container Type 0.182 s Self
Self Update Is Not Present 0.187 s Self
Service status 18.741 s Telemetry-Main
Sends measurements 71.005 s Telemetry-Main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant