setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts

**Package:** `amazon-ec2-net-utils 2.7.1-1.amzn2023.0.1` (bug confirmed present in `main` / 2.7.2)

**OS:** Amazon Linux 2023

**Upstream:** https://github.com/amazonlinux/amazon-ec2-net-utils

**Severity:** High: causes 100% CPU saturation on long-running ECS hosts running IDS/IPS software

**Discovered:** 2026-03-27

**Contributors:** Dinko Dermendzhiev, William Pharr, Jonathan Clark

---

## Summary

`setup-policy-routes start` contains an infinite `sleep 0.1` loop waiting for an ENI's sysfs node to appear ([`bin/setup-policy-routes.sh#L53-L59`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/bin/setup-policy-routes.sh#L53-L59)). If the ENI is detached before the sysfs node appears, the process loops forever holding a per-ENI lockfile. Any `refresh` invocation for the same ENI spins for up to 1000 seconds trying to acquire the lock, then exits, but the per-ENI timer immediately fires a new `refresh`, repeating the cycle. Over time, every ECS task lifecycle event that races `start` against ENI detach adds a permanently stuck process. On a host with heavy ECS task churn this accumulates to hundreds or thousands of processes, growing linearly with the number of failed ECS tasks. The stuck processes themselves are not CPU-intensive as each is blocked in `sleep 0.1`, but they generate a continuous stream of syscalls (at least one per process every 100ms). eBPF-based security sensors that instrument the kernel at the syscall level intercept these events, and at sufficient process counts the sensor's processing pipelines are fully saturated by this legitimate (from the sensor's perspective) syscall telemetry. The sensor ends up consuming 80–100% CPU doing exactly what it was designed to do, thus masking the root cause.

---

## Affected Files

- [`bin/setup-policy-routes.sh`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/bin/setup-policy-routes.sh) -> infinite sysfs wait loop in `start` action
- [`lib/lib.sh`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/lib/lib.sh) ->  `register_networkd_reloader()` lock mechanism with no deadlock detection

---

## The Bug

### 1. Infinite sysfs wait loop ([`bin/setup-policy-routes.sh#L52-L59`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/bin/setup-policy-routes.sh#L52-L59))

```bash
start)
    register_networkd_reloader   # acquires per-ENI lockfile
    counter=0
    while [ ! -e "/sys/class/net/${iface}" ]; do
        if ((counter % 1000 == 0)); then
            debug "Waiting for sysfs node to exist for ${iface} (iteration $counter)"
        fi
        sleep 0.1
        ((counter++))
    done
    /lib/systemd/systemd-networkd-wait-online -i "$iface"
    do_setup
    ;;
```

No timeout. If the ENI is detached before the sysfs node appears, this loop runs indefinitely at 0.1s intervals, holding the lockfile for the lifetime of the host.

### 2. Lock never released ([`lib/lib.sh#L628-L664`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/lib/lib.sh#L628-L664))

`register_networkd_reloader()` acquires a per-ENI lockfile at `/run/amazon-ec2-net-utils/setup-policy-routes/<iface>` using `noclobber` (set globally at the top of the script via `set -eo pipefail -o noclobber -o nounset`):

```bash
register_networkd_reloader() {
    local -i registered=1 cnt=0
    local -i max=10000
    local -r lockfile="${lockdir}/${iface}"
    ...
    while [ $cnt -lt $max ]; do
        cnt+=1
        2>/dev/null echo $$ > "${lockfile}"   # fails if file exists (noclobber)
        registered=$?
        [ $registered -eq 0 ] && break
        sleep 0.1                              # 10,000 * 0.1s = up to 1000 seconds
        if (( $cnt % 100 == 0 )); then
            debug "Unable to lock ${iface} after ${cnt} tries."
        fi
    done
    if [ $registered -ne 0 ]; then
        error "Unable to lock configuration for ${iface}. Check pid $(cat "${lockfile}")"
        exit 1   # ← exits after ~1000s, but the timer immediately fires a new refresh
    fi
}
```

The stuck `start` process holds the lock indefinitely. There is no check whether the PID in the lockfile is still alive. A `kill -0 $lock_pid` check would allow recovery from a dead lock owner.

### 3. `refresh` cycle ([`bin/setup-policy-routes.sh#L44-L49`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/bin/setup-policy-routes.sh#L44-L49))

```bash
refresh)
    register_networkd_reloader
    [ -e "/sys/class/net/${iface}" ] || exit 0   # exits immediately if ENI is gone
    do_setup
    ;;
```

`refresh` exits immediately if the ENI no longer exists in sysfs, **but only after it acquires the lock**. Because `start` holds the lock, `refresh` spins for up to 1000 seconds in `register_networkd_reloader`, then calls `exit 1`. The per-ENI timer (`refresh-policy-routes@<eni>.timer`, firing every 60s) immediately spawns a new `refresh`, which spins again. This produces a continuous stream of spinning processes per stuck ENI.

### 4. udev remove event does not fire reliably for ECS ENIs

The udev rule ([`udev/99-vpc-policy-routes.rules`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/udev/99-vpc-policy-routes.rules)) calls `systemctl disable --now` on both the timer and service on clean ENI removal:

```
SUBSYSTEM=="net", ACTION=="remove", ..., RUN+="/usr/bin/systemctl disable --now refresh-policy-routes@$name.timer policy-routes@$name.service"
```

On clean detach this would clean up correctly. The bug occurs because ECS ENI detach does not reliably produce a udev `remove` event before the sysfs node disappears, leaving `start` stuck in the wait loop with no cleanup path.

---

## Trigger Condition

1. ECS attaches ENI → udev `add` fires → `policy-routes@<eni>.service` starts → `setup-policy-routes <eni> start`
2. `start` acquires lockfile, enters infinite sysfs wait loop
3. ECS task fails → ENI detached → sysfs node never appears or disappears mid-loop
4. udev `remove` event does not fire (or fires after `start` is already stuck) → no cleanup
5. `start` loops forever, holding the lockfile
6. `refresh-policy-routes@<eni>.timer` fires → `refresh` spins ~1000s trying to acquire lock → `exit 1` → timer fires again → repeat

We observed this sequence when:
- ECS task health check failures causing repeated task replacement

But these scanarios may also cause this (unconfirmed):
- Rapid ECS deployments (rolling updates, blue-green)
- High-frequency autoscaling events

---

## Evidence From Affected Hosts

Many ECS hosts confirmed affect, but here are two examples:

| Host | Uptime | Stuck ENIs | Peak Processes | Peak Load Avg |
|---|---|---|---|---|
| `host-A` | 9 days | 112 | ~214 | ~107 |
| `host-B` | 14 days | 766 | 1787+ | **414** |

```bash
# All ENIs confirmed missing from sysfs
ps aux | grep "setup-policy-routes" | grep -v grep | awk '{print $(NF-1)}' | sort -u | while read iface; do
  [ -e "/sys/class/net/$iface" ] && echo "EXISTS: $iface" || echo "MISSING: $iface"
done
# Result: ALL MISSING

# start processes own locks, refresh processes are waiting
ps -eo pid,cmd | grep setup-policy-routes | grep -v grep | while read pid cmd; do
  iface=$(echo "$cmd" | awk '{print $(NF-1)}')
  action=$(echo "$cmd" | awk '{print $NF}')
  lockfile="/run/amazon-ec2-net-utils/setup-policy-routes/$iface"
  lock_pid=$(cat "$lockfile" 2>/dev/null)
  echo "$action $iface lock_owner=$lock_pid this_pid=$pid $([ "$lock_pid" = "$pid" ] && echo OWNS || echo WAITING)"
done | sort | head -20

# systemd unit count
systemctl list-units 'policy-routes@*' --no-legend | wc -l
```

Systemd journal (logged every 1000 iterations, i.e. every ~100 seconds per stuck process):
```
Mar 27 16:21:57 ec2net[3312864]: Waiting for sysfs node to exist for ecse1a2b3c (iteration 0)
Mar 27 16:23:44 ec2net[3312864]: Waiting for sysfs node to exist for ecse1a2b3c (iteration 1000)
Mar 27 16:25:31 ec2net[3312864]: Waiting for sysfs node to exist for ecse1a2b3c (iteration 2000)
[repeating indefinitely]
```

---

## Impact

- **Direct:** Load average 414 on a host with 8 vCPUs. Host effectively unusable.
- **Indirect:** Each stuck process issues a `nanosleep` syscall every 100ms. eBPF-based security sensors instrument the kernel at the syscall level and intercept every one of these events. At sufficient process counts the sensor's kernel probe handlers and userspace event pipelines are fully saturated processing what is, from the sensor's perspective, legitimate telemetry. The symptom presents as the security sensor consuming 80%+ CPU while doing exactly what it was designed to do, masking the root cause.
- **Silent accumulation:** Count grows with uptime × ECS deploy frequency. A host may take days or weeks to saturate. By the time CPU spikes, hundreds of units are stuck.

---

## Proposed Fix

### Fix 1 (primary): Add timeout to sysfs wait loop -> [`bin/setup-policy-routes.sh#L52`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/bin/setup-policy-routes.sh#L52)

This is the root cause fix. Without it, Fix 2 alone has no effect because the stuck `start` process is alive. Its lock is not stale, so the dead-lock check in `register_networkd_reloader` never triggers.

A timeout of 5 minutes (`max_wait=3000`, i.e. 3000 × 0.1s) is conservative enough to not false-positive on a slow or congested host while still bounding accumulation to one stuck process per ENI rather than an indefinitely running one:

```bash
start)
    register_networkd_reloader
    local -i counter=0
    local -i max_wait=3000  # 5 minute timeout (3000 * 0.1s)
    while [ ! -e "/sys/class/net/${iface}" ]; do
        if ((counter % 1000 == 0)); then
            debug "Waiting for sysfs node to exist for ${iface} (iteration $counter)"
        fi
        sleep 0.1
        ((counter++))
        if ((counter >= max_wait)); then
            error "Timed out waiting for sysfs node for ${iface} after ${counter} iterations, giving up"
            exit 1
        fi
    done
    ...
    ;;
```

Note: the timeout value is a judgment call. 5 minutes is generous; on a healthy host the sysfs node appears in milliseconds. Maybe there is a documented SLA for how quickly a newly-attached ENI appears in sysfs. If the upstream maintainers have data suggesting shorter is safe, a tighter value is fine.

### Fix 2 (secondary): Deadlock detection in `register_networkd_reloader` -> [`lib/lib.sh#L628`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/lib/lib.sh#L628)

After Fix 1, `start` exits on timeout, but it holds the lockfile until exit. A `refresh` that was already mid-spin waiting for the lock may then acquire it and run `do_setup` for a non-existent ENI. Adding a stale-lock check lets any subsequent invocation recover immediately rather than inheriting the full spin period:

```bash
# Check if existing lock owner is still alive; if not, remove stale lock
local -r lockfile="${lockdir}/${iface}"
if [ -f "${lockfile}" ]; then
    existing_pid=$(cat "${lockfile}" 2>/dev/null)
    if [ -n "$existing_pid" ] && ! kill -0 "$existing_pid" 2>/dev/null; then
        debug "Removing stale lock from dead process $existing_pid for ${iface}"
        rm -f "${lockfile}"
    fi
fi
```

---

## Temporary Workaround (for affected running hosts)

**Does not persist across reboots. Replace the instance for a permanent fix.**

`pkill` does not work as systemd respawns processes immediately (`Restart=on-failure` on `policy-routes@.service`, per-ENI timers on `refresh-policy-routes@.timer`). Must stop and mask via systemd.

```bash
# 1. Stop services and timers
systemctl stop 'policy-routes@*.service'
systemctl stop 'refresh-policy-routes@*.service'
systemctl stop 'refresh-policy-routes@*.timer'

# 2. Mask to prevent respawn of known units
systemctl mask 'policy-routes@*.service' 2>/dev/null || \
  systemctl list-units 'policy-routes@*.service' --no-legend | awk '{print $1}' | xargs systemctl mask
systemctl mask 'refresh-policy-routes@*.service' 2>/dev/null || \
  systemctl list-units 'refresh-policy-routes@*.service' --no-legend | awk '{print $1}' | xargs systemctl mask
systemctl mask 'refresh-policy-routes@*.timer' 2>/dev/null || \
  systemctl list-units 'refresh-policy-routes@*.timer' --no-legend | awk '{print $1}' | xargs systemctl mask

# 3. Verify
ps aux | grep setup-policy-routes | grep -v grep | wc -l   # should be 0

# WARNING: new ECS tasks deployed after masking will spawn new unmasked units.
# Do NOT mask the templates (policy-routes@.service etc.) as that disables ENI
# routing for all new ECS tasks.
```

Note: `refresh-policy-routes@.service` has `SuccessExitStatus=SIGTERM`, so `systemctl stop` (which sends SIGTERM) exits cleanly.

After masking, you will still find a high unit count. Masked/stopped units remain as inactive records in systemd state. The count reflects accumulated history, not active processes. To zero it: replace the instance.

---

## Detection

**On-instance:**
```bash
systemctl list-units 'policy-routes@*' --no-legend | wc -l
# 1 = healthy (one active task). >1 = accumulation in progress.
```

**Fleet-wide** (via SSM Run Command or equivalent remote execution):
```bash
# 1. Check unit count per host (>1 = accumulation in progress)
systemctl list-units 'policy-routes@*' --no-legend | wc -l

# 2. Confirm stuck processes are looping against missing ENIs
ps aux | grep "setup-policy-routes" | grep -v grep | awk '{print $(NF-1)}' | sort -u | while read iface; do
  [ -e "/sys/class/net/$iface" ] && echo "EXISTS: $iface" || echo "MISSING: $iface"
done

# 3. Check host load average
uptime
```

Thresholds observed:
- 112 units → load avg ~107, IDS/IPS agent at 80% CPU (9 days uptime)
- 766 units → load avg **414**, IDS/IPS agent at 80% CPU (14 days uptime)

---

## References

- [`bin/setup-policy-routes.sh`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/bin/setup-policy-routes.sh) — infinite sysfs wait loop at line 53
- [`lib/lib.sh`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/lib/lib.sh) — `register_networkd_reloader()` at line 628
- [`udev/99-vpc-policy-routes.rules`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/udev/99-vpc-policy-routes.rules) — remove event handler
- [`systemd/system/policy-routes@.service`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/systemd/system/policy-routes@.service)
- [`systemd/system/refresh-policy-routes@.timer`](https://github.com/amazonlinux/amazon-ec2-net-utils/blob/main/systemd/system/refresh-policy-routes@.timer)
- Package: `amazon-ec2-net-utils 2.7.1-1.amzn2023.0.1` (bug present in main/2.7.2)
- Related: [#112 — Use systemd.device units instead of systemctl enable/disable](https://github.com/amazonlinux/amazon-ec2-net-utils/issues/112) — proposes `BindsTo=%i.device` to stop units when a device disappears. This would not fix this bug: `BindsTo` only triggers when a device unit goes away, but in this race the ENI is detached before the sysfs node ever appears, so no device unit exists for systemd to bind to. Mentioned as related context on unit lifecycle cleanup.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts #148

Summary

Affected Files

The Bug

1. Infinite sysfs wait loop (`bin/setup-policy-routes.sh#L52-L59`)

2. Lock never released (`lib/lib.sh#L628-L664`)

3. `refresh` cycle (`bin/setup-policy-routes.sh#L44-L49`)

4. udev remove event does not fire reliably for ECS ENIs

Trigger Condition

Evidence From Affected Hosts

Impact

Proposed Fix

Fix 1 (primary): Add timeout to sysfs wait loop -> `bin/setup-policy-routes.sh#L52`

Fix 2 (secondary): Deadlock detection in `register_networkd_reloader` -> `lib/lib.sh#L628`

Temporary Workaround (for affected running hosts)

Detection

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts #148

Description

Summary

Affected Files

The Bug

1. Infinite sysfs wait loop (bin/setup-policy-routes.sh#L52-L59)

2. Lock never released (lib/lib.sh#L628-L664)

3. refresh cycle (bin/setup-policy-routes.sh#L44-L49)

4. udev remove event does not fire reliably for ECS ENIs

Trigger Condition

Evidence From Affected Hosts

Impact

Proposed Fix

Fix 1 (primary): Add timeout to sysfs wait loop -> bin/setup-policy-routes.sh#L52

Fix 2 (secondary): Deadlock detection in register_networkd_reloader -> lib/lib.sh#L628

Temporary Workaround (for affected running hosts)

Detection

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Infinite sysfs wait loop (`bin/setup-policy-routes.sh#L52-L59`)

2. Lock never released (`lib/lib.sh#L628-L664`)

3. `refresh` cycle (`bin/setup-policy-routes.sh#L44-L49`)

Fix 1 (primary): Add timeout to sysfs wait loop -> `bin/setup-policy-routes.sh#L52`

Fix 2 (secondary): Deadlock detection in `register_networkd_reloader` -> `lib/lib.sh#L628`