Skip to content

setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts #148

@ddermendzhiev

Description

@ddermendzhiev

Package: amazon-ec2-net-utils 2.7.1-1.amzn2023.0.1 (bug confirmed present in main / 2.7.2)

OS: Amazon Linux 2023

Upstream: https://github.com/amazonlinux/amazon-ec2-net-utils

Severity: High: causes 100% CPU saturation on long-running ECS hosts running IDS/IPS software

Discovered: 2026-03-27

Contributors: Dinko Dermendzhiev, William Pharr, Jonathan Clark


Summary

setup-policy-routes start contains an infinite sleep 0.1 loop waiting for an ENI's sysfs node to appear (bin/setup-policy-routes.sh#L53-L59). If the ENI is detached before the sysfs node appears, the process loops forever holding a per-ENI lockfile. Any refresh invocation for the same ENI spins for up to 1000 seconds trying to acquire the lock, then exits, but the per-ENI timer immediately fires a new refresh, repeating the cycle. Over time, every ECS task lifecycle event that races start against ENI detach adds a permanently stuck process. On a host with heavy ECS task churn this accumulates to hundreds or thousands of processes, growing linearly with the number of failed ECS tasks. The stuck processes themselves are not CPU-intensive as each is blocked in sleep 0.1, but they generate a continuous stream of syscalls (at least one per process every 100ms). eBPF-based security sensors that instrument the kernel at the syscall level intercept these events, and at sufficient process counts the sensor's processing pipelines are fully saturated by this legitimate (from the sensor's perspective) syscall telemetry. The sensor ends up consuming 80–100% CPU doing exactly what it was designed to do, thus masking the root cause.


Affected Files


The Bug

1. Infinite sysfs wait loop (bin/setup-policy-routes.sh#L52-L59)

start)
    register_networkd_reloader   # acquires per-ENI lockfile
    counter=0
    while [ ! -e "/sys/class/net/${iface}" ]; do
        if ((counter % 1000 == 0)); then
            debug "Waiting for sysfs node to exist for ${iface} (iteration $counter)"
        fi
        sleep 0.1
        ((counter++))
    done
    /lib/systemd/systemd-networkd-wait-online -i "$iface"
    do_setup
    ;;

No timeout. If the ENI is detached before the sysfs node appears, this loop runs indefinitely at 0.1s intervals, holding the lockfile for the lifetime of the host.

2. Lock never released (lib/lib.sh#L628-L664)

register_networkd_reloader() acquires a per-ENI lockfile at /run/amazon-ec2-net-utils/setup-policy-routes/<iface> using noclobber (set globally at the top of the script via set -eo pipefail -o noclobber -o nounset):

register_networkd_reloader() {
    local -i registered=1 cnt=0
    local -i max=10000
    local -r lockfile="${lockdir}/${iface}"
    ...
    while [ $cnt -lt $max ]; do
        cnt+=1
        2>/dev/null echo $$ > "${lockfile}"   # fails if file exists (noclobber)
        registered=$?
        [ $registered -eq 0 ] && break
        sleep 0.1                              # 10,000 * 0.1s = up to 1000 seconds
        if (( $cnt % 100 == 0 )); then
            debug "Unable to lock ${iface} after ${cnt} tries."
        fi
    done
    if [ $registered -ne 0 ]; then
        error "Unable to lock configuration for ${iface}. Check pid $(cat "${lockfile}")"
        exit 1   # ← exits after ~1000s, but the timer immediately fires a new refresh
    fi
}

The stuck start process holds the lock indefinitely. There is no check whether the PID in the lockfile is still alive. A kill -0 $lock_pid check would allow recovery from a dead lock owner.

3. refresh cycle (bin/setup-policy-routes.sh#L44-L49)

refresh)
    register_networkd_reloader
    [ -e "/sys/class/net/${iface}" ] || exit 0   # exits immediately if ENI is gone
    do_setup
    ;;

refresh exits immediately if the ENI no longer exists in sysfs, but only after it acquires the lock. Because start holds the lock, refresh spins for up to 1000 seconds in register_networkd_reloader, then calls exit 1. The per-ENI timer (refresh-policy-routes@<eni>.timer, firing every 60s) immediately spawns a new refresh, which spins again. This produces a continuous stream of spinning processes per stuck ENI.

4. udev remove event does not fire reliably for ECS ENIs

The udev rule (udev/99-vpc-policy-routes.rules) calls systemctl disable --now on both the timer and service on clean ENI removal:

SUBSYSTEM=="net", ACTION=="remove", ..., RUN+="/usr/bin/systemctl disable --now refresh-policy-routes@$name.timer policy-routes@$name.service"

On clean detach this would clean up correctly. The bug occurs because ECS ENI detach does not reliably produce a udev remove event before the sysfs node disappears, leaving start stuck in the wait loop with no cleanup path.


Trigger Condition

  1. ECS attaches ENI → udev add fires → policy-routes@<eni>.service starts → setup-policy-routes <eni> start
  2. start acquires lockfile, enters infinite sysfs wait loop
  3. ECS task fails → ENI detached → sysfs node never appears or disappears mid-loop
  4. udev remove event does not fire (or fires after start is already stuck) → no cleanup
  5. start loops forever, holding the lockfile
  6. refresh-policy-routes@<eni>.timer fires → refresh spins ~1000s trying to acquire lock → exit 1 → timer fires again → repeat

We observed this sequence when:

  • ECS task health check failures causing repeated task replacement

But these scanarios may also cause this (unconfirmed):

  • Rapid ECS deployments (rolling updates, blue-green)
  • High-frequency autoscaling events

Evidence From Affected Hosts

Many ECS hosts confirmed affect, but here are two examples:

Host Uptime Stuck ENIs Peak Processes Peak Load Avg
host-A 9 days 112 ~214 ~107
host-B 14 days 766 1787+ 414
# All ENIs confirmed missing from sysfs
ps aux | grep "setup-policy-routes" | grep -v grep | awk '{print $(NF-1)}' | sort -u | while read iface; do
  [ -e "/sys/class/net/$iface" ] && echo "EXISTS: $iface" || echo "MISSING: $iface"
done
# Result: ALL MISSING

# start processes own locks, refresh processes are waiting
ps -eo pid,cmd | grep setup-policy-routes | grep -v grep | while read pid cmd; do
  iface=$(echo "$cmd" | awk '{print $(NF-1)}')
  action=$(echo "$cmd" | awk '{print $NF}')
  lockfile="/run/amazon-ec2-net-utils/setup-policy-routes/$iface"
  lock_pid=$(cat "$lockfile" 2>/dev/null)
  echo "$action $iface lock_owner=$lock_pid this_pid=$pid $([ "$lock_pid" = "$pid" ] && echo OWNS || echo WAITING)"
done | sort | head -20

# systemd unit count
systemctl list-units 'policy-routes@*' --no-legend | wc -l

Systemd journal (logged every 1000 iterations, i.e. every ~100 seconds per stuck process):

Mar 27 16:21:57 ec2net[3312864]: Waiting for sysfs node to exist for ecse1a2b3c (iteration 0)
Mar 27 16:23:44 ec2net[3312864]: Waiting for sysfs node to exist for ecse1a2b3c (iteration 1000)
Mar 27 16:25:31 ec2net[3312864]: Waiting for sysfs node to exist for ecse1a2b3c (iteration 2000)
[repeating indefinitely]

Impact

  • Direct: Load average 414 on a host with 8 vCPUs. Host effectively unusable.
  • Indirect: Each stuck process issues a nanosleep syscall every 100ms. eBPF-based security sensors instrument the kernel at the syscall level and intercept every one of these events. At sufficient process counts the sensor's kernel probe handlers and userspace event pipelines are fully saturated processing what is, from the sensor's perspective, legitimate telemetry. The symptom presents as the security sensor consuming 80%+ CPU while doing exactly what it was designed to do, masking the root cause.
  • Silent accumulation: Count grows with uptime × ECS deploy frequency. A host may take days or weeks to saturate. By the time CPU spikes, hundreds of units are stuck.

Proposed Fix

Fix 1 (primary): Add timeout to sysfs wait loop -> bin/setup-policy-routes.sh#L52

This is the root cause fix. Without it, Fix 2 alone has no effect because the stuck start process is alive. Its lock is not stale, so the dead-lock check in register_networkd_reloader never triggers.

A timeout of 5 minutes (max_wait=3000, i.e. 3000 × 0.1s) is conservative enough to not false-positive on a slow or congested host while still bounding accumulation to one stuck process per ENI rather than an indefinitely running one:

start)
    register_networkd_reloader
    local -i counter=0
    local -i max_wait=3000  # 5 minute timeout (3000 * 0.1s)
    while [ ! -e "/sys/class/net/${iface}" ]; do
        if ((counter % 1000 == 0)); then
            debug "Waiting for sysfs node to exist for ${iface} (iteration $counter)"
        fi
        sleep 0.1
        ((counter++))
        if ((counter >= max_wait)); then
            error "Timed out waiting for sysfs node for ${iface} after ${counter} iterations, giving up"
            exit 1
        fi
    done
    ...
    ;;

Note: the timeout value is a judgment call. 5 minutes is generous; on a healthy host the sysfs node appears in milliseconds. Maybe there is a documented SLA for how quickly a newly-attached ENI appears in sysfs. If the upstream maintainers have data suggesting shorter is safe, a tighter value is fine.

Fix 2 (secondary): Deadlock detection in register_networkd_reloader -> lib/lib.sh#L628

After Fix 1, start exits on timeout, but it holds the lockfile until exit. A refresh that was already mid-spin waiting for the lock may then acquire it and run do_setup for a non-existent ENI. Adding a stale-lock check lets any subsequent invocation recover immediately rather than inheriting the full spin period:

# Check if existing lock owner is still alive; if not, remove stale lock
local -r lockfile="${lockdir}/${iface}"
if [ -f "${lockfile}" ]; then
    existing_pid=$(cat "${lockfile}" 2>/dev/null)
    if [ -n "$existing_pid" ] && ! kill -0 "$existing_pid" 2>/dev/null; then
        debug "Removing stale lock from dead process $existing_pid for ${iface}"
        rm -f "${lockfile}"
    fi
fi

Temporary Workaround (for affected running hosts)

Does not persist across reboots. Replace the instance for a permanent fix.

pkill does not work as systemd respawns processes immediately (Restart=on-failure on policy-routes@.service, per-ENI timers on refresh-policy-routes@.timer). Must stop and mask via systemd.

# 1. Stop services and timers
systemctl stop 'policy-routes@*.service'
systemctl stop 'refresh-policy-routes@*.service'
systemctl stop 'refresh-policy-routes@*.timer'

# 2. Mask to prevent respawn of known units
systemctl mask 'policy-routes@*.service' 2>/dev/null || \
  systemctl list-units 'policy-routes@*.service' --no-legend | awk '{print $1}' | xargs systemctl mask
systemctl mask 'refresh-policy-routes@*.service' 2>/dev/null || \
  systemctl list-units 'refresh-policy-routes@*.service' --no-legend | awk '{print $1}' | xargs systemctl mask
systemctl mask 'refresh-policy-routes@*.timer' 2>/dev/null || \
  systemctl list-units 'refresh-policy-routes@*.timer' --no-legend | awk '{print $1}' | xargs systemctl mask

# 3. Verify
ps aux | grep setup-policy-routes | grep -v grep | wc -l   # should be 0

# WARNING: new ECS tasks deployed after masking will spawn new unmasked units.
# Do NOT mask the templates (policy-routes@.service etc.) as that disables ENI
# routing for all new ECS tasks.

Note: refresh-policy-routes@.service has SuccessExitStatus=SIGTERM, so systemctl stop (which sends SIGTERM) exits cleanly.

After masking, you will still find a high unit count. Masked/stopped units remain as inactive records in systemd state. The count reflects accumulated history, not active processes. To zero it: replace the instance.


Detection

On-instance:

systemctl list-units 'policy-routes@*' --no-legend | wc -l
# 1 = healthy (one active task). >1 = accumulation in progress.

Fleet-wide (via SSM Run Command or equivalent remote execution):

# 1. Check unit count per host (>1 = accumulation in progress)
systemctl list-units 'policy-routes@*' --no-legend | wc -l

# 2. Confirm stuck processes are looping against missing ENIs
ps aux | grep "setup-policy-routes" | grep -v grep | awk '{print $(NF-1)}' | sort -u | while read iface; do
  [ -e "/sys/class/net/$iface" ] && echo "EXISTS: $iface" || echo "MISSING: $iface"
done

# 3. Check host load average
uptime

Thresholds observed:

  • 112 units → load avg ~107, IDS/IPS agent at 80% CPU (9 days uptime)
  • 766 units → load avg 414, IDS/IPS agent at 80% CPU (14 days uptime)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions