Skip to content

Latest commit

 

History

History
96 lines (71 loc) · 4.03 KB

File metadata and controls

96 lines (71 loc) · 4.03 KB

HMA-CLI Test Results

Test Date: 2026-03-12

Test Environment

Property Value
Cluster EKS 1.31
Node OS Amazon Linux 2023, Amazon Linux 2, Bottlerocket
NMA Version Latest (installed via EKS add-on)

Test Results Matrix

Kernel Simulations

Simulation Node Result NMA Detection Notes
zombies ip-10-0-5-217 ✅ PASS Warning Event Created 25 zombies, requires --keep-alive 30m
kernel-bug Multiple ✅ PASS Warning Event Dmesg injection works, pattern: [timestamp] BUG: message
soft-lockup Multiple ✅ PASS Warning Event Dmesg injection works, pattern includes process name
pid-exhaustion ip-10-0-3-62 ✅ PASS Pending Achieved 74% PID usage, requires --keep-alive 30m
fork-oom ip-10-0-4-11 ✅ PASS KernelReady=False WARNING: Node became unrecoverable, required deletion

Networking Simulations

Simulation Node Result NMA Detection Notes
ipamd-down Multiple ✅ PASS NetworkingReady=False Kills aws-k8s-agent process
interface-down Multiple ✅ PASS NetworkingReady=False ip link set eth1 down

Storage Simulations

Simulation Node Result NMA Detection Notes
io-delay ip-10-0-5-78 ✅ PASS Pending Worker started, NMA checks every 10 min, requires --keep-alive 15m

Runtime Simulations

Simulation Node Result NMA Detection Notes
systemd-restarts ip-10-0-5-79 ⚠️ PARTIAL NRestarts=1 Script killed on pod exit; requires --keep-alive 10m

Accelerator Simulations

Simulation Node Result NMA Detection Notes
neuron-sram-error Neuron nodes ✅ PASS AcceleratedHardwareReady=False Dmesg injection
neuron-nc-error Neuron nodes ✅ PASS AcceleratedHardwareReady=False Dmesg injection
neuron-hbm-error Neuron nodes ✅ PASS AcceleratedHardwareReady=False Dmesg injection
neuron-dma-error Neuron nodes ✅ PASS AcceleratedHardwareReady=False Dmesg injection
xid-error GPU nodes ⏭️ SKIPPED N/A DCGM not installed on test GPU nodes

Summary

Category Total Pass Partial Fail Skipped
Kernel 5 5 0 0 0
Networking 2 2 0 0 0
Storage 1 1 0 0 0
Runtime 1 0 1 0 0
Accelerator 5 4 0 0 1
Total 14 12 1 0 1

Key Findings

1. Process Persistence Issue

Simulations that create processes require --keep-alive flag. Without it, processes are killed when the node-shell pod is deleted.

Affected simulations:

  • zombies - zombie processes killed
  • pid-exhaustion - sleep processes killed
  • io-delay - worker process killed
  • systemd-restarts - background kill script killed

2. NMA Detection Patterns

Pattern NMA Behavior Detection Level
Zombies >= 20 Creates Warning event EVENT (MinOccurrences: 5)
PIDs > 70% of MAX(pid_max, threads-max) Creates Warning event EVENT
[timestamp] BUG: in dmesg Creates Warning event EVENT
soft lockup in dmesg Creates Warning event EVENT
IPAMD not running Sets NetworkingReady=False CONDITION (Fatal)
Interface not up Sets NetworkingReady=False CONDITION (Fatal)
Neuron errors in dmesg Sets AcceleratedHardwareReady=False CONDITION (Fatal)
NRestarts > 3 Sets ContainerRuntimeReady=False CONDITION
I/O delay > 10s Sets StorageReady=False CONDITION

3. Destructive Simulations (Use with Caution)

Simulation Warning
fork-oom DESTRUCTIVE: Exhausts node PIDs, node becomes unrecoverable and must be deleted/replaced