Skip to content

Add a worker pool for health check go-routines#78

Closed
amirhnajafiz wants to merge 7 commits into
IBM:mainfrom
amirhnajafiz:main
Closed

Add a worker pool for health check go-routines#78
amirhnajafiz wants to merge 7 commits into
IBM:mainfrom
amirhnajafiz:main

Conversation

@amirhnajafiz
Copy link
Copy Markdown
Collaborator

Summary

  • Right now, using short intervals for health checks is causing problems. The main Go routine gets blocked, and health checks start to overlap, which wastes system resources. To fix this, we suggested using a worker pool. This runs health checks in separate Go routines so the main routine doesn’t get blocked. Also, if a health check is already running, any new ones that come in during that time are dropped. This way, each health check only runs once at a time, without overlap or blocking.

Scope and Impact

  • No API changes.
  • The health check logic is now handled by a new module called worker located in autopilot-daemon/pkg. We’ve updated cmd/main.go to support a new flag called --workers, which sets the maximum size of the worker pool. This change is also reflected in the Helm chart values.yaml to allow configuration of the pool size. If the value is set to 0, the system will automatically assign 2 workers per logical CPU core. Otherwise, the user-defined pool size will be used. For now, I recommend setting the pool size to 2.

GitHub Issue

  • None

How was this Pull-Request Tested and Validated?

  • Built a new image ghcr.io/amirhnajafiz/autopilot:sha-3a76e7c.
  • Executed an instance of new Autopilot with Helm Chart on a 3 node Kubernetes cluster (1 controller, 2 GPU nodes).

Helm Chart Values

anajafizadeh@kctl:~/fsl/sunyibm$ cat autopilot/values.yaml
# Default values for the Autopilot DaemonSet.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
image:
  repository: ghcr.io/amirhnajafiz/autopilot #quay.io/autopilot/autopilot
  tag: sha-3a76e7c
  pullPolicy: Always

# Workers for concurrent tasks. Defaults to 0 which uses 2*number_of_logical_CPU_cores (depends on the resource limits).
workers: 2

# Bandwidth threshold below which PCIe links are considered defective (Gb/s)
# It is recommended to set a threshold that is 25% or lower of the expected peak PCIe bandwidth capability, which maps to maximum peak from 16 lanes to 4 lanes. For example, for a PCIe Gen4x16, reported peak bandwidth is 63GB/s. A degradation at 25% is 15.75GB/s, which corresponds to PCIe Gen4x4. The measured bandwidth is expected to be at least 80% of the expected peak PCIe generation bandwidth.
PCIeBW: 4

# Timer for periodic checks, in hours
repeat: 15s

# Timer for periodic invasive checks, in hours (e.g., dcgmi diag -r 3). Set to 0 to disable (for non nvidia gpu systems)
invasive: 30s

...

Pods and logs

Screen Shot 2025-08-06 at 12 27 42 PM Screen Shot 2025-08-06 at 12 22 23 PM
anajafizadeh@kctl:~/fsl/sunyibm$ kubectl logs -n autopilot autopilot-mpx2n | grep "Running a periodic check"
Defaulted container "autopilot" out of: autopilot, device-plugin-validation (init)
I0806 16:02:39.623145       7 healthcheck.go:17] Running a periodic check
I0806 16:02:54.624244       7 healthcheck.go:17] Running a periodic check
I0806 16:03:09.624157       7 healthcheck.go:17] Running a periodic check
I0806 16:03:24.624331       7 healthcheck.go:17] Running a periodic check
I0806 16:03:39.624464       7 healthcheck.go:17] Running a periodic check
I0806 16:03:54.623782       7 healthcheck.go:17] Running a periodic check
I0806 16:04:09.623547       7 healthcheck.go:17] Running a periodic check
I0806 16:04:24.624150       7 healthcheck.go:17] Running a periodic check
I0806 16:04:39.624421       7 healthcheck.go:17] Running a periodic check
I0806 16:04:54.624438       7 healthcheck.go:17] Running a periodic check
I0806 16:05:09.623947       7 healthcheck.go:17] Running a periodic check
I0806 16:05:24.623921       7 healthcheck.go:17] Running a periodic check
I0806 16:05:39.624037       7 healthcheck.go:17] Running a periodic check
anajafizadeh@kctl:~/fsl/sunyibm$ kubectl logs -n autopilot autopilot-4gjp6 | grep "Running a periodic check"
Defaulted container "autopilot" out of: autopilot, device-plugin-validation (init)
I0806 16:05:08.163434       7 healthcheck.go:17] Running a periodic check
I0806 16:05:38.164806       7 healthcheck.go:17] Running a periodic check
I0806 16:06:08.164229       7 healthcheck.go:17] Running a periodic check
I0806 16:06:38.164170       7 healthcheck.go:17] Running a periodic check
I0806 16:07:08.164261       7 healthcheck.go:17] Running a periodic check
I0806 16:07:38.163962       7 healthcheck.go:17] Running a periodic check
I0806 16:08:08.164634       7 healthcheck.go:17] Running a periodic check
I0806 16:08:38.164744       7 healthcheck.go:17] Running a periodic check
I0806 16:09:08.164202       7 healthcheck.go:17] Running a periodic check
I0806 16:09:38.164636       7 healthcheck.go:17] Running a periodic check
I0806 16:10:08.164235       7 healthcheck.go:17] Running a periodic check
I0806 16:10:38.164370       7 healthcheck.go:17] Running a periodic check
I0806 16:11:08.164418       7 healthcheck.go:17] Running a periodic check
I0806 16:11:38.164357       7 healthcheck.go:17] Running a periodic check
I0806 16:12:08.163866       7 healthcheck.go:17] Running a periodic check
I0806 16:12:38.163964       7 healthcheck.go:17] Running a periodic check
I0806 16:13:08.164548       7 healthcheck.go:17] Running a periodic check
I0806 16:13:38.163890       7 healthcheck.go:17] Running a periodic check
I0806 16:14:08.164925       7 healthcheck.go:17] Running a periodic check
I0806 16:14:38.164208       7 healthcheck.go:17] Running a periodic check
I0806 16:15:08.163903       7 healthcheck.go:17] Running a periodic check
I0806 16:15:38.164278       7 healthcheck.go:17] Running a periodic check
I0806 16:16:08.164369       7 healthcheck.go:17] Running a periodic check
I0806 16:16:38.164610       7 healthcheck.go:17] Running a periodic check
anajafizadeh@kctl:~/fsl/sunyibm$ kubectl logs -n autopilot autopilot-mpx2n | grep "Trying to run an invasive check"
Defaulted container "autopilot" out of: autopilot, device-plugin-validation (init)
I0806 16:03:09.624142       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:03:39.624458       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:04:09.623547       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:04:39.624373       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:05:09.623927       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:05:39.624069       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:06:09.624482       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:06:39.623918       7 healthcheck.go:32] Trying to run an invasive check
anajafizadeh@kctl:~/fsl/sunyibm$ kubectl logs -n autopilot autopilot-4gjp6 | grep "Trying to run an invasive check"
Defaulted container "autopilot" out of: autopilot, device-plugin-validation (init)
I0806 16:05:38.164827       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:06:08.164175       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:06:38.164084       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:07:08.164261       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:07:38.163887       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:08:08.164679       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:08:38.164718       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:09:08.164154       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:09:38.164671       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:10:08.164164       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:10:38.164350       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:11:08.164330       7 healthcheck.go:32] Trying to run an invasive check
I0806 16:11:38.164298       7 healthcheck.go:32] Trying to run an invasive check
Screen Shot 2025-08-06 at 12 23 17 PM

Pull-Request Reminders

  • Does the Autopilot Readme require updates?

    • No
  • Are there any new software dependencies introduced to this Pull-Request?

    • No

@cmisale cmisale self-requested a review August 8, 2025 14:02
@amirhnajafiz amirhnajafiz self-assigned this Aug 8, 2025
@amirhnajafiz amirhnajafiz added bug Something isn't working and removed bug Something isn't working labels Aug 8, 2025
Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
… autopilot-daemon/pkg/cmd/main.go

Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
@amirhnajafiz
Copy link
Copy Markdown
Collaborator Author

@cmisale I built a new image (ghcr.io/amirhnajafiz/autopilot:sha-4531d68) with worker pool tracing logs. You can check the logs here. I also added grep results for these messages:

  • "Task completed" for both "Periodic Check" and "Invasive Check"
  • "Task submitted to worker pool" for both
  • "Processing task" for both
  • "Task already running, skipping submission" for "Periodic Check"
  • Another "Task submitted to worker pool" for "Invasive Check"

I tested the new Autopilot by setting periodic checks to run every 5 seconds and invasive checks every 10 seconds so they would overlap.

Screen Shot 2025-08-08 at 2 45 22 PM

Logs

grep

❯ kubectl logs -n autopilot autopilot-xfxwp | grep -e "Task completed" -e "Task submitted to worker pool" -e "Processing task" -e "ask already running, skipping submission"
Defaulted container "autopilot" out of: autopilot, device-plugin-validation (init)
I0808 18:43:57.376711       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:43:57.377055       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:44:02.377562       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:07.377602       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:07.377696       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:44:07.377747       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:44:12.378067       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:17.377673       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:17.377774       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:44:22.378191       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:23.618842       7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:44:23.798301       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:44:27.378085       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:44:27.378105       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:44:27.378217       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:44:27.378249       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:44:32.377740       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:37.377669       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:37.377744       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:44:42.378248       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:47.378143       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:47.378200       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:44:52.377532       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:53.572517       7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:44:53.642585       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:44:57.377326       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:44:57.377381       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:44:57.377405       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:44:57.377461       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:44:57.607345       7 worker.go:29] "Task completed" task="Invasive Check"
❯ kubectl logs -n autopilot autopilot-c7jcz | grep -e "Task completed" -e "Task submitted to worker pool" -e "Processing task" -e "ask already running, skipping submission"
Defaulted container "autopilot" out of: autopilot, device-plugin-validation (init)
I0808 18:41:18.293549       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:41:18.293479       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:41:23.294766       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:28.294579       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:28.294654       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:41:28.294705       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:41:33.293785       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:38.294350       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:38.294406       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:41:42.284577       7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:41:42.504044       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:41:43.294433       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:41:43.294477       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:41:48.294632       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:48.294734       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:41:48.294772       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:41:53.294340       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:58.294556       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:58.294638       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:03.294128       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:07.267915       7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:42:07.321929       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:08.294112       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:42:08.294166       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:08.294190       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:08.294207       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:42:08.356414       7 worker.go:29] "Task completed" task="Invasive Check"

raw

❯ kubectl logs -n autopilot autopilot-c7jcz
Defaulted container "autopilot" out of: autopilot, device-plugin-validation (init)
I0808 18:41:18.239356       7 prometheus.go:46] CPU_MODEL: IntelR Xeon(R) Silver 4114 CPU @ 2.20GHz
I0808 18:41:18.293057       7 prometheus.go:60] GPU_MODEL: NVIDIA RTX A5000
I0808 18:41:18.293128       7 global.go:41] Init entry map pciebw
I0808 18:41:18.293146       7 global.go:41] Init entry map remapped
I0808 18:41:18.293157       7 global.go:41] Init entry map dcgm
I0808 18:41:18.293168       7 global.go:41] Init entry map ping
I0808 18:41:18.293180       7 global.go:41] Init entry map gpupower
I0808 18:41:18.293366       7 main.go:137] Starting WorkerPool with 2 workers
I0808 18:41:18.293382       7 main.go:60] Serving metrics on :8081
I0808 18:41:18.293482       7 main.go:105] Serving Health Checks on port :3333
I0808 18:41:18.293549       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:41:18.293479       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:41:18.293619       7 healthcheck.go:17] Running a periodic check
I0808 18:41:18.293645       7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:41:18.293668       7 healthcheck.go:83] Running health check: pciebw
I0808 18:41:18.293799       7 main.go:72] Serving Readiness Probe on :8080
I0808 18:41:20.056765       7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:41:20.056874       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:41:20.056920       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:41:20.056952       7 healthcheck.go:92] Running health check: remapped
I0808 18:41:21.520893       7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:41:21.521019       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:41:21.521073       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:41:21.521114       7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:41:23.294766       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:28.294579       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:28.294654       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:41:28.294705       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:41:28.294720       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:41:31.737376       7 healthcheck.go:369] DCGM test completed:
I0808 18:41:31.737432       7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:41:31.737486       7 healthcheck.go:58] Running health check: ping
I0808 18:41:33.293785       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:38.294350       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:38.294406       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:41:41.649157       7 healthcheck.go:272] Ping test completed:
I0808 18:41:41.649273       7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:41:41.649304       7 healthcheck.go:101] Running health check: gpupower
I0808 18:41:42.231243       7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:41:42.231304       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:41:42.231351       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:41:42.231411       7 healthcheck.go:135] Total time (s) for all checks: 23.937731028
I0808 18:41:42.231434       7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:41:42.245933       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:41:42.284512       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "PASS"
			}
		}
	}
I0808 18:41:42.284577       7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:41:42.349721       7 functions.go:92] GPUs are free. Will run invasive health checks.
I0808 18:41:42.349790       7 healthcheck.go:36] Starting invasive health checks, updating node label =TESTING for node sunyibm1.fsl.cs.sunysb.edu
I0808 18:41:42.362566       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:41:42.409251       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "TESTING"
				}
		}
	}
I0808 18:41:42.447031       7 functions.go:176] Try create Job
I0808 18:41:42.503700       7 functions.go:183] Created
I0808 18:41:42.504044       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:41:43.294433       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:41:43.294477       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:41:43.294554       7 healthcheck.go:17] Running a periodic check
I0808 18:41:43.294579       7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:41:43.294604       7 healthcheck.go:83] Running health check: pciebw
I0808 18:41:45.044403       7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:41:45.044478       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:41:45.044515       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:41:45.044542       7 healthcheck.go:92] Running health check: remapped
I0808 18:41:46.556952       7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:41:46.557036       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:41:46.557074       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:41:46.557130       7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:41:48.294632       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:48.294734       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:41:48.294772       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:41:48.294801       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:41:53.294340       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:56.830168       7 healthcheck.go:369] DCGM test completed:
I0808 18:41:56.830228       7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:41:56.830269       7 healthcheck.go:58] Running health check: ping
I0808 18:41:58.294556       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:58.294638       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:03.294128       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:06.639041       7 healthcheck.go:272] Ping test completed:
I0808 18:42:06.639128       7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:42:06.639187       7 healthcheck.go:101] Running health check: gpupower
I0808 18:42:07.212563       7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:42:07.212636       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:07.212694       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:07.212767       7 healthcheck.go:135] Total time (s) for all checks: 23.918143672
I0808 18:42:07.212812       7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:42:07.228511       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:07.267860       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "PASS"
			}
		}
	}
I0808 18:42:07.267915       7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:42:07.321888       7 functions.go:88] Pod dcgm-f7fad6-h469z with requests 8 and limits 8. Cannot run invasive health checks.
I0808 18:42:07.321929       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:08.294112       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:42:08.294166       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:08.294190       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:08.294211       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:08.294207       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:42:08.294249       7 healthcheck.go:17] Running a periodic check
I0808 18:42:08.356359       7 functions.go:88] Pod dcgm-f7fad6-h469z with requests 8 and limits 8. Cannot run invasive health checks.
I0808 18:42:08.356414       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:08.356436       7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:42:08.356450       7 healthcheck.go:83] Running health check: pciebw
I0808 18:42:10.127896       7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:42:10.127997       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:42:10.128035       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:42:10.128062       7 healthcheck.go:92] Running health check: remapped
I0808 18:42:11.363142       7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:42:11.363236       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:11.363274       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:11.363301       7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:42:13.294322       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:18.294259       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:18.294376       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:18.294417       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:18.294442       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:21.559879       7 healthcheck.go:369] DCGM test completed:
I0808 18:42:21.559929       7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:42:21.559967       7 healthcheck.go:58] Running health check: ping
I0808 18:42:23.294647       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:28.294057       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:28.294115       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:31.530661       7 healthcheck.go:272] Ping test completed:
I0808 18:42:31.530736       7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:42:31.530764       7 healthcheck.go:101] Running health check: gpupower
I0808 18:42:32.079901       7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:42:32.080030       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:32.080093       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:32.080161       7 healthcheck.go:135] Total time (s) for all checks: 23.723694282
I0808 18:42:32.080199       7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:42:32.118402       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:32.157835       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "PASS"
			}
		}
	}
I0808 18:42:32.157890       7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:42:32.209464       7 functions.go:92] GPUs are free. Will run invasive health checks.
I0808 18:42:32.209492       7 healthcheck.go:36] Starting invasive health checks, updating node label =TESTING for node sunyibm1.fsl.cs.sunysb.edu
I0808 18:42:32.231104       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:32.285260       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "TESTING"
				}
		}
	}
I0808 18:42:32.335944       7 functions.go:176] Try create Job
I0808 18:42:32.377837       7 functions.go:183] Created
I0808 18:42:32.377885       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:33.294692       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:42:33.294728       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:42:33.294743       7 healthcheck.go:17] Running a periodic check
I0808 18:42:33.294803       7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:42:33.294825       7 healthcheck.go:83] Running health check: pciebw
I0808 18:42:35.089627       7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:42:35.089686       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:42:35.089719       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:42:35.089746       7 healthcheck.go:92] Running health check: remapped
I0808 18:42:36.065732       7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:42:36.065792       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:36.065857       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:36.065888       7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:42:38.294064       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:38.294142       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:38.294170       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:38.294185       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:43.293752       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:46.216300       7 healthcheck.go:369] DCGM test completed:
I0808 18:42:46.216393       7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:42:46.216448       7 healthcheck.go:58] Running health check: ping
I0808 18:42:48.294429       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:48.294506       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:53.294072       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:56.156657       7 healthcheck.go:272] Ping test completed:
I0808 18:42:56.156816       7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:42:56.156868       7 healthcheck.go:101] Running health check: gpupower
I0808 18:42:56.733130       7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:42:56.733192       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:56.733227       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:56.733262       7 healthcheck.go:135] Total time (s) for all checks: 23.438443725
I0808 18:42:56.733308       7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:42:56.790103       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:56.981930       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "PASS"
			}
		}
	}
I0808 18:42:56.982014       7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:42:57.222418       7 functions.go:88] Pod dcgm-7c630e-fnnvh with requests 8 and limits 8. Cannot run invasive health checks.
I0808 18:42:57.222464       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:58.293830       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:42:58.293887       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:58.293912       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:58.293926       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:58.293941       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:42:58.294011       7 healthcheck.go:17] Running a periodic check
I0808 18:42:58.353876       7 functions.go:88] Pod dcgm-7c630e-fnnvh with requests 8 and limits 8. Cannot run invasive health checks.
I0808 18:42:58.353952       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:58.354000       7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:42:58.354023       7 healthcheck.go:83] Running health check: pciebw
I0808 18:43:00.141598       7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:43:00.141667       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:43:00.141703       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:43:00.141730       7 healthcheck.go:92] Running health check: remapped
I0808 18:43:01.455650       7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:43:01.455713       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:43:01.455750       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:43:01.455778       7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:43:03.294621       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:43:08.294153       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:43:08.294227       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:43:08.294254       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:43:08.294269       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:43:11.616314       7 healthcheck.go:369] DCGM test completed:
I0808 18:43:11.616377       7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:43:11.616417       7 healthcheck.go:58] Running health check: ping
I0808 18:43:12.305226       7 healthcheck.go:272] Ping test completed:
I0808 18:43:12.305285       7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:43:12.305317       7 healthcheck.go:101] Running health check: gpupower
I0808 18:43:12.881291       7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:43:12.881372       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:43:12.881408       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:43:12.881445       7 healthcheck.go:135] Total time (s) for all checks: 14.527429094
I0808 18:43:12.881468       7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:43:12.892999       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:43:12.942649       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "PASS"
			}
		}
	}
I0808 18:43:12.942708       7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:43:12.995145       7 functions.go:92] GPUs are free. Will run invasive health checks.
I0808 18:43:12.995169       7 healthcheck.go:36] Starting invasive health checks, updating node label =TESTING for node sunyibm1.fsl.cs.sunysb.edu
I0808 18:43:13.005755       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:43:13.095461       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "TESTING"
				}
		}
	}
I0808 18:43:13.146311       7 functions.go:176] Try create Job
I0808 18:43:13.178288       7 functions.go:183] Created
I0808 18:43:13.178363       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:43:13.294709       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:43:13.294799       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:43:13.294846       7 healthcheck.go:17] Running a periodic check
I0808 18:43:13.294868       7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:43:13.294890       7 healthcheck.go:83] Running health check: pciebw
I0808 18:43:15.062640       7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:43:15.062699       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:43:15.062732       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:43:15.062763       7 healthcheck.go:92] Running health check: remapped
I0808 18:43:16.287636       7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:43:16.287697       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:43:16.287735       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:43:16.287766       7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:43:18.294665       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:43:18.294734       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:43:18.294758       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:43:18.294773       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:43:23.294448       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:43:26.485120       7 healthcheck.go:369] DCGM test completed:
I0808 18:43:26.485170       7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:43:26.485206       7 healthcheck.go:58] Running health check: ping
I0808 18:43:27.171840       7 healthcheck.go:272] Ping test completed:
I0808 18:43:27.171897       7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:43:27.171926       7 healthcheck.go:101] Running health check: gpupower
I0808 18:43:27.743372       7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:43:27.743432       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:43:27.743468       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:43:27.743506       7 healthcheck.go:135] Total time (s) for all checks: 14.448620701
I0808 18:43:27.743530       7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:43:27.754183       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:43:27.786991       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "PASS"
			}
		}
	}
❯ kubectl logs -n autopilot autopilot-c7jcz
Defaulted container "autopilot" out of: autopilot, device-plugin-validation (init)
I0808 18:41:18.239356       7 prometheus.go:46] CPU_MODEL: IntelR Xeon(R) Silver 4114 CPU @ 2.20GHz
I0808 18:41:18.293057       7 prometheus.go:60] GPU_MODEL: NVIDIA RTX A5000
I0808 18:41:18.293128       7 global.go:41] Init entry map pciebw
I0808 18:41:18.293146       7 global.go:41] Init entry map remapped
I0808 18:41:18.293157       7 global.go:41] Init entry map dcgm
I0808 18:41:18.293168       7 global.go:41] Init entry map ping
I0808 18:41:18.293180       7 global.go:41] Init entry map gpupower
I0808 18:41:18.293366       7 main.go:137] Starting WorkerPool with 2 workers
I0808 18:41:18.293382       7 main.go:60] Serving metrics on :8081
I0808 18:41:18.293482       7 main.go:105] Serving Health Checks on port :3333
I0808 18:41:18.293549       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:41:18.293479       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:41:18.293619       7 healthcheck.go:17] Running a periodic check
I0808 18:41:18.293645       7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:41:18.293668       7 healthcheck.go:83] Running health check: pciebw
I0808 18:41:18.293799       7 main.go:72] Serving Readiness Probe on :8080
I0808 18:41:20.056765       7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:41:20.056874       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:41:20.056920       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:41:20.056952       7 healthcheck.go:92] Running health check: remapped
I0808 18:41:21.520893       7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:41:21.521019       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:41:21.521073       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:41:21.521114       7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:41:23.294766       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:28.294579       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:28.294654       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:41:28.294705       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:41:28.294720       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:41:31.737376       7 healthcheck.go:369] DCGM test completed:
I0808 18:41:31.737432       7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:41:31.737486       7 healthcheck.go:58] Running health check: ping
I0808 18:41:33.293785       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:38.294350       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:38.294406       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:41:41.649157       7 healthcheck.go:272] Ping test completed:
I0808 18:41:41.649273       7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:41:41.649304       7 healthcheck.go:101] Running health check: gpupower
I0808 18:41:42.231243       7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:41:42.231304       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:41:42.231351       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:41:42.231411       7 healthcheck.go:135] Total time (s) for all checks: 23.937731028
I0808 18:41:42.231434       7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:41:42.245933       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:41:42.284512       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "PASS"
			}
		}
	}
I0808 18:41:42.284577       7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:41:42.349721       7 functions.go:92] GPUs are free. Will run invasive health checks.
I0808 18:41:42.349790       7 healthcheck.go:36] Starting invasive health checks, updating node label =TESTING for node sunyibm1.fsl.cs.sunysb.edu
I0808 18:41:42.362566       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:41:42.409251       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "TESTING"
				}
		}
	}
I0808 18:41:42.447031       7 functions.go:176] Try create Job
I0808 18:41:42.503700       7 functions.go:183] Created
I0808 18:41:42.504044       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:41:43.294433       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:41:43.294477       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:41:43.294554       7 healthcheck.go:17] Running a periodic check
I0808 18:41:43.294579       7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:41:43.294604       7 healthcheck.go:83] Running health check: pciebw
I0808 18:41:45.044403       7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:41:45.044478       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:41:45.044515       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:41:45.044542       7 healthcheck.go:92] Running health check: remapped
I0808 18:41:46.556952       7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:41:46.557036       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:41:46.557074       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:41:46.557130       7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:41:48.294632       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:48.294734       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:41:48.294772       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:41:48.294801       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:41:53.294340       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:56.830168       7 healthcheck.go:369] DCGM test completed:
I0808 18:41:56.830228       7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:41:56.830269       7 healthcheck.go:58] Running health check: ping
I0808 18:41:58.294556       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:58.294638       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:03.294128       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:06.639041       7 healthcheck.go:272] Ping test completed:
I0808 18:42:06.639128       7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:42:06.639187       7 healthcheck.go:101] Running health check: gpupower
I0808 18:42:07.212563       7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:42:07.212636       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:07.212694       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:07.212767       7 healthcheck.go:135] Total time (s) for all checks: 23.918143672
I0808 18:42:07.212812       7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:42:07.228511       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:07.267860       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "PASS"
			}
		}
	}
I0808 18:42:07.267915       7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:42:07.321888       7 functions.go:88] Pod dcgm-f7fad6-h469z with requests 8 and limits 8. Cannot run invasive health checks.
I0808 18:42:07.321929       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:08.294112       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:42:08.294166       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:08.294190       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:08.294211       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:08.294207       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:42:08.294249       7 healthcheck.go:17] Running a periodic check
I0808 18:42:08.356359       7 functions.go:88] Pod dcgm-f7fad6-h469z with requests 8 and limits 8. Cannot run invasive health checks.
I0808 18:42:08.356414       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:08.356436       7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:42:08.356450       7 healthcheck.go:83] Running health check: pciebw
I0808 18:42:10.127896       7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:42:10.127997       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:42:10.128035       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:42:10.128062       7 healthcheck.go:92] Running health check: remapped
I0808 18:42:11.363142       7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:42:11.363236       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:11.363274       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:11.363301       7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:42:13.294322       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:18.294259       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:18.294376       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:18.294417       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:18.294442       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:21.559879       7 healthcheck.go:369] DCGM test completed:
I0808 18:42:21.559929       7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:42:21.559967       7 healthcheck.go:58] Running health check: ping
I0808 18:42:23.294647       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:28.294057       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:28.294115       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:31.530661       7 healthcheck.go:272] Ping test completed:
I0808 18:42:31.530736       7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:42:31.530764       7 healthcheck.go:101] Running health check: gpupower
I0808 18:42:32.079901       7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:42:32.080030       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:32.080093       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:32.080161       7 healthcheck.go:135] Total time (s) for all checks: 23.723694282
I0808 18:42:32.080199       7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:42:32.118402       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:32.157835       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "PASS"
			}
		}
	}
I0808 18:42:32.157890       7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:42:32.209464       7 functions.go:92] GPUs are free. Will run invasive health checks.
I0808 18:42:32.209492       7 healthcheck.go:36] Starting invasive health checks, updating node label =TESTING for node sunyibm1.fsl.cs.sunysb.edu
I0808 18:42:32.231104       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:32.285260       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "TESTING"
				}
		}
	}
I0808 18:42:32.335944       7 functions.go:176] Try create Job
I0808 18:42:32.377837       7 functions.go:183] Created
I0808 18:42:32.377885       7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:33.294692       7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:42:33.294728       7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:42:33.294743       7 healthcheck.go:17] Running a periodic check
I0808 18:42:33.294803       7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:42:33.294825       7 healthcheck.go:83] Running health check: pciebw
I0808 18:42:35.089627       7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:42:35.089686       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:42:35.089719       7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:42:35.089746       7 healthcheck.go:92] Running health check: remapped
I0808 18:42:36.065732       7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:42:36.065792       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:36.065857       7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:36.065888       7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:42:38.294064       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:38.294142       7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:38.294170       7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:38.294185       7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:43.293752       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:46.216300       7 healthcheck.go:369] DCGM test completed:
I0808 18:42:46.216393       7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:42:46.216448       7 healthcheck.go:58] Running health check: ping
I0808 18:42:48.294429       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:48.294506       7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:53.294072       7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:56.156657       7 healthcheck.go:272] Ping test completed:
I0808 18:42:56.156816       7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:42:56.156868       7 healthcheck.go:101] Running health check: gpupower
I0808 18:42:56.733130       7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:42:56.733192       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:56.733227       7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:56.733262       7 healthcheck.go:135] Total time (s) for all checks: 23.438443725
I0808 18:42:56.733308       7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:42:56.790103       7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:56.981930       7 functions.go:215] Node patched with label
	{
		"metadata": {
			"labels": {
				"autopilot.ibm.com/gpuhealth": "PASS"
			}
		}
	}

@cmisale
Copy link
Copy Markdown
Contributor

cmisale commented Aug 11, 2025

I read the code and comments.
I think this might be an overkill. Probably not the best idea to go with worker pools because we really don't have to do anything in parallel, and we don't want to queue the tasks either.

Probably a wait group of size one would be enough.

Also, the current design only considers periodic checks. The same checks should be enabled when a check is invoked manually. This is a stretch goal.
Before jumping into coding, let's take a step back and evaluate the solutions.

@amirhnajafiz
Copy link
Copy Markdown
Collaborator Author

I agree — we only have two tasks right now. I have an easier idea: we can reset the timers after each task finishes.

for {
	select {
	case <-periodicChecksTicker.C:
		healthcheck.PeriodicCheck()
		// reset periodic timer
		periodicChecksTicker.Stop()
		periodicChecksTicker = time.NewTicker(repeatDuration)

	case <-invasiveChecksTicker.C:
		if invasiveDuration > 0 {
			healthcheck.InvasiveCheck()
			// reset invasive timer
			invasiveChecksTicker.Stop()
			invasiveChecksTicker = time.NewTicker(invasiveDuration)
		}
	}
}

@amirhnajafiz
Copy link
Copy Markdown
Collaborator Author

In this reset case:

  • The old ticker’s channel might still have an unread tick if the task took longer than repeatDuration.
  • When you stop the ticker and create a new one, you don’t read that leftover tick — it’s just abandoned along with the old ticker object.
  • The new ticker starts fresh with an empty channel and will send its first tick after repeatDuration.

@cmisale
Copy link
Copy Markdown
Contributor

cmisale commented Aug 11, 2025

As it's in the code, in both cases, the timer is stopped and restarted a few ms after the goroutine is spawn. But other than that, it might work for periodic checks if correctly implemented (e.g., with a wait group), but not for invasive checks. The invasive check is ran in a separate pod and the timer has no business with that other job.

@amirhnajafiz
Copy link
Copy Markdown
Collaborator Author

Good point mentioned. I didn't know that about the invasive checks.

@amirhnajafiz amirhnajafiz closed this by deleting the head repository Nov 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants