Test Date: 2026-03-12
Property
Value
Cluster
EKS 1.31
Node OS
Amazon Linux 2023, Amazon Linux 2, Bottlerocket
NMA Version
Latest (installed via EKS add-on)
Simulation
Node
Result
NMA Detection
Notes
zombies
ip-10-0-5-217
✅ PASS
Warning Event
Created 25 zombies, requires --keep-alive 30m
kernel-bug
Multiple
✅ PASS
Warning Event
Dmesg injection works, pattern: [timestamp] BUG: message
soft-lockup
Multiple
✅ PASS
Warning Event
Dmesg injection works, pattern includes process name
pid-exhaustion
ip-10-0-3-62
✅ PASS
Pending
Achieved 74% PID usage, requires --keep-alive 30m
fork-oom
ip-10-0-4-11
✅ PASS
KernelReady=False
WARNING: Node became unrecoverable, required deletion
Simulation
Node
Result
NMA Detection
Notes
ipamd-down
Multiple
✅ PASS
NetworkingReady=False
Kills aws-k8s-agent process
interface-down
Multiple
✅ PASS
NetworkingReady=False
ip link set eth1 down
Simulation
Node
Result
NMA Detection
Notes
io-delay
ip-10-0-5-78
✅ PASS
Pending
Worker started, NMA checks every 10 min, requires --keep-alive 15m
Simulation
Node
Result
NMA Detection
Notes
systemd-restarts
ip-10-0-5-79
⚠️ PARTIAL
NRestarts=1
Script killed on pod exit; requires --keep-alive 10m
Simulation
Node
Result
NMA Detection
Notes
neuron-sram-error
Neuron nodes
✅ PASS
AcceleratedHardwareReady=False
Dmesg injection
neuron-nc-error
Neuron nodes
✅ PASS
AcceleratedHardwareReady=False
Dmesg injection
neuron-hbm-error
Neuron nodes
✅ PASS
AcceleratedHardwareReady=False
Dmesg injection
neuron-dma-error
Neuron nodes
✅ PASS
AcceleratedHardwareReady=False
Dmesg injection
xid-error
GPU nodes
⏭️ SKIPPED
N/A
DCGM not installed on test GPU nodes
Category
Total
Pass
Partial
Fail
Skipped
Kernel
5
5
0
0
0
Networking
2
2
0
0
0
Storage
1
1
0
0
0
Runtime
1
0
1
0
0
Accelerator
5
4
0
0
1
Total
14
12
1
0
1
1. Process Persistence Issue
Simulations that create processes require --keep-alive flag. Without it, processes are killed when the node-shell pod is deleted.
Affected simulations:
zombies - zombie processes killed
pid-exhaustion - sleep processes killed
io-delay - worker process killed
systemd-restarts - background kill script killed
2. NMA Detection Patterns
Pattern
NMA Behavior
Detection Level
Zombies >= 20
Creates Warning event
EVENT (MinOccurrences: 5)
PIDs > 70% of MAX(pid_max, threads-max)
Creates Warning event
EVENT
[timestamp] BUG: in dmesg
Creates Warning event
EVENT
soft lockup in dmesg
Creates Warning event
EVENT
IPAMD not running
Sets NetworkingReady=False
CONDITION (Fatal)
Interface not up
Sets NetworkingReady=False
CONDITION (Fatal)
Neuron errors in dmesg
Sets AcceleratedHardwareReady=False
CONDITION (Fatal)
NRestarts > 3
Sets ContainerRuntimeReady=False
CONDITION
I/O delay > 10s
Sets StorageReady=False
CONDITION
3. Destructive Simulations (Use with Caution)
Simulation
Warning
fork-oom
DESTRUCTIVE : Exhausts node PIDs, node becomes unrecoverable and must be deleted/replaced