Add a worker pool for health check go-routines#78
Conversation
Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
… autopilot-daemon/pkg/cmd/main.go Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
Signed-off-by: amirhnajafiz <najafizadeh21@gmail.com>
|
@cmisale I built a new image (
I tested the new Autopilot by setting periodic checks to run every 5 seconds and invasive checks every 10 seconds so they would overlap.
Logsgrep❯ kubectl logs -n autopilot autopilot-xfxwp | grep -e "Task completed" -e "Task submitted to worker pool" -e "Processing task" -e "ask already running, skipping submission"
Defaulted container "autopilot" out of: autopilot, device-plugin-validation (init)
I0808 18:43:57.376711 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:43:57.377055 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:44:02.377562 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:07.377602 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:07.377696 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:44:07.377747 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:44:12.378067 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:17.377673 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:17.377774 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:44:22.378191 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:23.618842 7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:44:23.798301 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:44:27.378085 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:44:27.378105 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:44:27.378217 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:44:27.378249 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:44:32.377740 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:37.377669 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:37.377744 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:44:42.378248 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:47.378143 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:47.378200 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:44:52.377532 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:44:53.572517 7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:44:53.642585 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:44:57.377326 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:44:57.377381 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:44:57.377405 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:44:57.377461 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:44:57.607345 7 worker.go:29] "Task completed" task="Invasive Check"❯ kubectl logs -n autopilot autopilot-c7jcz | grep -e "Task completed" -e "Task submitted to worker pool" -e "Processing task" -e "ask already running, skipping submission"
Defaulted container "autopilot" out of: autopilot, device-plugin-validation (init)
I0808 18:41:18.293549 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:41:18.293479 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:41:23.294766 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:28.294579 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:28.294654 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:41:28.294705 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:41:33.293785 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:38.294350 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:38.294406 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:41:42.284577 7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:41:42.504044 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:41:43.294433 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:41:43.294477 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:41:48.294632 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:48.294734 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:41:48.294772 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:41:53.294340 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:58.294556 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:58.294638 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:03.294128 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:07.267915 7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:42:07.321929 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:08.294112 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:42:08.294166 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:08.294190 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:08.294207 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:42:08.356414 7 worker.go:29] "Task completed" task="Invasive Check"raw❯ kubectl logs -n autopilot autopilot-c7jcz
Defaulted container "autopilot" out of: autopilot, device-plugin-validation (init)
I0808 18:41:18.239356 7 prometheus.go:46] CPU_MODEL: IntelR Xeon(R) Silver 4114 CPU @ 2.20GHz
I0808 18:41:18.293057 7 prometheus.go:60] GPU_MODEL: NVIDIA RTX A5000
I0808 18:41:18.293128 7 global.go:41] Init entry map pciebw
I0808 18:41:18.293146 7 global.go:41] Init entry map remapped
I0808 18:41:18.293157 7 global.go:41] Init entry map dcgm
I0808 18:41:18.293168 7 global.go:41] Init entry map ping
I0808 18:41:18.293180 7 global.go:41] Init entry map gpupower
I0808 18:41:18.293366 7 main.go:137] Starting WorkerPool with 2 workers
I0808 18:41:18.293382 7 main.go:60] Serving metrics on :8081
I0808 18:41:18.293482 7 main.go:105] Serving Health Checks on port :3333
I0808 18:41:18.293549 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:41:18.293479 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:41:18.293619 7 healthcheck.go:17] Running a periodic check
I0808 18:41:18.293645 7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:41:18.293668 7 healthcheck.go:83] Running health check: pciebw
I0808 18:41:18.293799 7 main.go:72] Serving Readiness Probe on :8080
I0808 18:41:20.056765 7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:41:20.056874 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:41:20.056920 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:41:20.056952 7 healthcheck.go:92] Running health check: remapped
I0808 18:41:21.520893 7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:41:21.521019 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:41:21.521073 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:41:21.521114 7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:41:23.294766 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:28.294579 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:28.294654 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:41:28.294705 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:41:28.294720 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:41:31.737376 7 healthcheck.go:369] DCGM test completed:
I0808 18:41:31.737432 7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:41:31.737486 7 healthcheck.go:58] Running health check: ping
I0808 18:41:33.293785 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:38.294350 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:38.294406 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:41:41.649157 7 healthcheck.go:272] Ping test completed:
I0808 18:41:41.649273 7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:41:41.649304 7 healthcheck.go:101] Running health check: gpupower
I0808 18:41:42.231243 7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:41:42.231304 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:41:42.231351 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:41:42.231411 7 healthcheck.go:135] Total time (s) for all checks: 23.937731028
I0808 18:41:42.231434 7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:41:42.245933 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:41:42.284512 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "PASS"
}
}
}
I0808 18:41:42.284577 7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:41:42.349721 7 functions.go:92] GPUs are free. Will run invasive health checks.
I0808 18:41:42.349790 7 healthcheck.go:36] Starting invasive health checks, updating node label =TESTING for node sunyibm1.fsl.cs.sunysb.edu
I0808 18:41:42.362566 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:41:42.409251 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "TESTING"
}
}
}
I0808 18:41:42.447031 7 functions.go:176] Try create Job
I0808 18:41:42.503700 7 functions.go:183] Created
I0808 18:41:42.504044 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:41:43.294433 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:41:43.294477 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:41:43.294554 7 healthcheck.go:17] Running a periodic check
I0808 18:41:43.294579 7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:41:43.294604 7 healthcheck.go:83] Running health check: pciebw
I0808 18:41:45.044403 7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:41:45.044478 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:41:45.044515 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:41:45.044542 7 healthcheck.go:92] Running health check: remapped
I0808 18:41:46.556952 7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:41:46.557036 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:41:46.557074 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:41:46.557130 7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:41:48.294632 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:48.294734 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:41:48.294772 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:41:48.294801 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:41:53.294340 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:56.830168 7 healthcheck.go:369] DCGM test completed:
I0808 18:41:56.830228 7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:41:56.830269 7 healthcheck.go:58] Running health check: ping
I0808 18:41:58.294556 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:58.294638 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:03.294128 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:06.639041 7 healthcheck.go:272] Ping test completed:
I0808 18:42:06.639128 7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:42:06.639187 7 healthcheck.go:101] Running health check: gpupower
I0808 18:42:07.212563 7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:42:07.212636 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:07.212694 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:07.212767 7 healthcheck.go:135] Total time (s) for all checks: 23.918143672
I0808 18:42:07.212812 7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:42:07.228511 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:07.267860 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "PASS"
}
}
}
I0808 18:42:07.267915 7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:42:07.321888 7 functions.go:88] Pod dcgm-f7fad6-h469z with requests 8 and limits 8. Cannot run invasive health checks.
I0808 18:42:07.321929 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:08.294112 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:42:08.294166 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:08.294190 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:08.294211 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:08.294207 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:42:08.294249 7 healthcheck.go:17] Running a periodic check
I0808 18:42:08.356359 7 functions.go:88] Pod dcgm-f7fad6-h469z with requests 8 and limits 8. Cannot run invasive health checks.
I0808 18:42:08.356414 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:08.356436 7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:42:08.356450 7 healthcheck.go:83] Running health check: pciebw
I0808 18:42:10.127896 7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:42:10.127997 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:42:10.128035 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:42:10.128062 7 healthcheck.go:92] Running health check: remapped
I0808 18:42:11.363142 7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:42:11.363236 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:11.363274 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:11.363301 7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:42:13.294322 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:18.294259 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:18.294376 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:18.294417 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:18.294442 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:21.559879 7 healthcheck.go:369] DCGM test completed:
I0808 18:42:21.559929 7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:42:21.559967 7 healthcheck.go:58] Running health check: ping
I0808 18:42:23.294647 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:28.294057 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:28.294115 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:31.530661 7 healthcheck.go:272] Ping test completed:
I0808 18:42:31.530736 7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:42:31.530764 7 healthcheck.go:101] Running health check: gpupower
I0808 18:42:32.079901 7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:42:32.080030 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:32.080093 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:32.080161 7 healthcheck.go:135] Total time (s) for all checks: 23.723694282
I0808 18:42:32.080199 7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:42:32.118402 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:32.157835 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "PASS"
}
}
}
I0808 18:42:32.157890 7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:42:32.209464 7 functions.go:92] GPUs are free. Will run invasive health checks.
I0808 18:42:32.209492 7 healthcheck.go:36] Starting invasive health checks, updating node label =TESTING for node sunyibm1.fsl.cs.sunysb.edu
I0808 18:42:32.231104 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:32.285260 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "TESTING"
}
}
}
I0808 18:42:32.335944 7 functions.go:176] Try create Job
I0808 18:42:32.377837 7 functions.go:183] Created
I0808 18:42:32.377885 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:33.294692 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:42:33.294728 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:42:33.294743 7 healthcheck.go:17] Running a periodic check
I0808 18:42:33.294803 7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:42:33.294825 7 healthcheck.go:83] Running health check: pciebw
I0808 18:42:35.089627 7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:42:35.089686 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:42:35.089719 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:42:35.089746 7 healthcheck.go:92] Running health check: remapped
I0808 18:42:36.065732 7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:42:36.065792 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:36.065857 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:36.065888 7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:42:38.294064 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:38.294142 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:38.294170 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:38.294185 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:43.293752 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:46.216300 7 healthcheck.go:369] DCGM test completed:
I0808 18:42:46.216393 7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:42:46.216448 7 healthcheck.go:58] Running health check: ping
I0808 18:42:48.294429 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:48.294506 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:53.294072 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:56.156657 7 healthcheck.go:272] Ping test completed:
I0808 18:42:56.156816 7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:42:56.156868 7 healthcheck.go:101] Running health check: gpupower
I0808 18:42:56.733130 7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:42:56.733192 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:56.733227 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:56.733262 7 healthcheck.go:135] Total time (s) for all checks: 23.438443725
I0808 18:42:56.733308 7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:42:56.790103 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:56.981930 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "PASS"
}
}
}
I0808 18:42:56.982014 7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:42:57.222418 7 functions.go:88] Pod dcgm-7c630e-fnnvh with requests 8 and limits 8. Cannot run invasive health checks.
I0808 18:42:57.222464 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:58.293830 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:42:58.293887 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:58.293912 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:58.293926 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:58.293941 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:42:58.294011 7 healthcheck.go:17] Running a periodic check
I0808 18:42:58.353876 7 functions.go:88] Pod dcgm-7c630e-fnnvh with requests 8 and limits 8. Cannot run invasive health checks.
I0808 18:42:58.353952 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:58.354000 7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:42:58.354023 7 healthcheck.go:83] Running health check: pciebw
I0808 18:43:00.141598 7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:43:00.141667 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:43:00.141703 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:43:00.141730 7 healthcheck.go:92] Running health check: remapped
I0808 18:43:01.455650 7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:43:01.455713 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:43:01.455750 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:43:01.455778 7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:43:03.294621 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:43:08.294153 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:43:08.294227 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:43:08.294254 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:43:08.294269 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:43:11.616314 7 healthcheck.go:369] DCGM test completed:
I0808 18:43:11.616377 7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:43:11.616417 7 healthcheck.go:58] Running health check: ping
I0808 18:43:12.305226 7 healthcheck.go:272] Ping test completed:
I0808 18:43:12.305285 7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:43:12.305317 7 healthcheck.go:101] Running health check: gpupower
I0808 18:43:12.881291 7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:43:12.881372 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:43:12.881408 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:43:12.881445 7 healthcheck.go:135] Total time (s) for all checks: 14.527429094
I0808 18:43:12.881468 7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:43:12.892999 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:43:12.942649 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "PASS"
}
}
}
I0808 18:43:12.942708 7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:43:12.995145 7 functions.go:92] GPUs are free. Will run invasive health checks.
I0808 18:43:12.995169 7 healthcheck.go:36] Starting invasive health checks, updating node label =TESTING for node sunyibm1.fsl.cs.sunysb.edu
I0808 18:43:13.005755 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:43:13.095461 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "TESTING"
}
}
}
I0808 18:43:13.146311 7 functions.go:176] Try create Job
I0808 18:43:13.178288 7 functions.go:183] Created
I0808 18:43:13.178363 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:43:13.294709 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:43:13.294799 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:43:13.294846 7 healthcheck.go:17] Running a periodic check
I0808 18:43:13.294868 7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:43:13.294890 7 healthcheck.go:83] Running health check: pciebw
I0808 18:43:15.062640 7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:43:15.062699 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:43:15.062732 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:43:15.062763 7 healthcheck.go:92] Running health check: remapped
I0808 18:43:16.287636 7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:43:16.287697 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:43:16.287735 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:43:16.287766 7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:43:18.294665 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:43:18.294734 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:43:18.294758 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:43:18.294773 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:43:23.294448 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:43:26.485120 7 healthcheck.go:369] DCGM test completed:
I0808 18:43:26.485170 7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:43:26.485206 7 healthcheck.go:58] Running health check: ping
I0808 18:43:27.171840 7 healthcheck.go:272] Ping test completed:
I0808 18:43:27.171897 7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:43:27.171926 7 healthcheck.go:101] Running health check: gpupower
I0808 18:43:27.743372 7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:43:27.743432 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:43:27.743468 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:43:27.743506 7 healthcheck.go:135] Total time (s) for all checks: 14.448620701
I0808 18:43:27.743530 7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:43:27.754183 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:43:27.786991 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "PASS"
}
}
}❯ kubectl logs -n autopilot autopilot-c7jcz
Defaulted container "autopilot" out of: autopilot, device-plugin-validation (init)
I0808 18:41:18.239356 7 prometheus.go:46] CPU_MODEL: IntelR Xeon(R) Silver 4114 CPU @ 2.20GHz
I0808 18:41:18.293057 7 prometheus.go:60] GPU_MODEL: NVIDIA RTX A5000
I0808 18:41:18.293128 7 global.go:41] Init entry map pciebw
I0808 18:41:18.293146 7 global.go:41] Init entry map remapped
I0808 18:41:18.293157 7 global.go:41] Init entry map dcgm
I0808 18:41:18.293168 7 global.go:41] Init entry map ping
I0808 18:41:18.293180 7 global.go:41] Init entry map gpupower
I0808 18:41:18.293366 7 main.go:137] Starting WorkerPool with 2 workers
I0808 18:41:18.293382 7 main.go:60] Serving metrics on :8081
I0808 18:41:18.293482 7 main.go:105] Serving Health Checks on port :3333
I0808 18:41:18.293549 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:41:18.293479 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:41:18.293619 7 healthcheck.go:17] Running a periodic check
I0808 18:41:18.293645 7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:41:18.293668 7 healthcheck.go:83] Running health check: pciebw
I0808 18:41:18.293799 7 main.go:72] Serving Readiness Probe on :8080
I0808 18:41:20.056765 7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:41:20.056874 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:41:20.056920 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:41:20.056952 7 healthcheck.go:92] Running health check: remapped
I0808 18:41:21.520893 7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:41:21.521019 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:41:21.521073 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:41:21.521114 7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:41:23.294766 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:28.294579 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:28.294654 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:41:28.294705 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:41:28.294720 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:41:31.737376 7 healthcheck.go:369] DCGM test completed:
I0808 18:41:31.737432 7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:41:31.737486 7 healthcheck.go:58] Running health check: ping
I0808 18:41:33.293785 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:38.294350 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:38.294406 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:41:41.649157 7 healthcheck.go:272] Ping test completed:
I0808 18:41:41.649273 7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:41:41.649304 7 healthcheck.go:101] Running health check: gpupower
I0808 18:41:42.231243 7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:41:42.231304 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:41:42.231351 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:41:42.231411 7 healthcheck.go:135] Total time (s) for all checks: 23.937731028
I0808 18:41:42.231434 7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:41:42.245933 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:41:42.284512 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "PASS"
}
}
}
I0808 18:41:42.284577 7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:41:42.349721 7 functions.go:92] GPUs are free. Will run invasive health checks.
I0808 18:41:42.349790 7 healthcheck.go:36] Starting invasive health checks, updating node label =TESTING for node sunyibm1.fsl.cs.sunysb.edu
I0808 18:41:42.362566 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:41:42.409251 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "TESTING"
}
}
}
I0808 18:41:42.447031 7 functions.go:176] Try create Job
I0808 18:41:42.503700 7 functions.go:183] Created
I0808 18:41:42.504044 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:41:43.294433 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:41:43.294477 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:41:43.294554 7 healthcheck.go:17] Running a periodic check
I0808 18:41:43.294579 7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:41:43.294604 7 healthcheck.go:83] Running health check: pciebw
I0808 18:41:45.044403 7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:41:45.044478 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:41:45.044515 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:41:45.044542 7 healthcheck.go:92] Running health check: remapped
I0808 18:41:46.556952 7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:41:46.557036 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:41:46.557074 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:41:46.557130 7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:41:48.294632 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:48.294734 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:41:48.294772 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:41:48.294801 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:41:53.294340 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:56.830168 7 healthcheck.go:369] DCGM test completed:
I0808 18:41:56.830228 7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:41:56.830269 7 healthcheck.go:58] Running health check: ping
I0808 18:41:58.294556 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:41:58.294638 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:03.294128 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:06.639041 7 healthcheck.go:272] Ping test completed:
I0808 18:42:06.639128 7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:42:06.639187 7 healthcheck.go:101] Running health check: gpupower
I0808 18:42:07.212563 7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:42:07.212636 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:07.212694 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:07.212767 7 healthcheck.go:135] Total time (s) for all checks: 23.918143672
I0808 18:42:07.212812 7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:42:07.228511 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:07.267860 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "PASS"
}
}
}
I0808 18:42:07.267915 7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:42:07.321888 7 functions.go:88] Pod dcgm-f7fad6-h469z with requests 8 and limits 8. Cannot run invasive health checks.
I0808 18:42:07.321929 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:08.294112 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:42:08.294166 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:08.294190 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:08.294211 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:08.294207 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:42:08.294249 7 healthcheck.go:17] Running a periodic check
I0808 18:42:08.356359 7 functions.go:88] Pod dcgm-f7fad6-h469z with requests 8 and limits 8. Cannot run invasive health checks.
I0808 18:42:08.356414 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:08.356436 7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:42:08.356450 7 healthcheck.go:83] Running health check: pciebw
I0808 18:42:10.127896 7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:42:10.127997 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:42:10.128035 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:42:10.128062 7 healthcheck.go:92] Running health check: remapped
I0808 18:42:11.363142 7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:42:11.363236 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:11.363274 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:11.363301 7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:42:13.294322 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:18.294259 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:18.294376 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:18.294417 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:18.294442 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:21.559879 7 healthcheck.go:369] DCGM test completed:
I0808 18:42:21.559929 7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:42:21.559967 7 healthcheck.go:58] Running health check: ping
I0808 18:42:23.294647 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:28.294057 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:28.294115 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:31.530661 7 healthcheck.go:272] Ping test completed:
I0808 18:42:31.530736 7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:42:31.530764 7 healthcheck.go:101] Running health check: gpupower
I0808 18:42:32.079901 7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:42:32.080030 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:32.080093 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:32.080161 7 healthcheck.go:135] Total time (s) for all checks: 23.723694282
I0808 18:42:32.080199 7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:42:32.118402 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:32.157835 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "PASS"
}
}
}
I0808 18:42:32.157890 7 worker.go:29] "Task completed" task="Periodic Check"
I0808 18:42:32.209464 7 functions.go:92] GPUs are free. Will run invasive health checks.
I0808 18:42:32.209492 7 healthcheck.go:36] Starting invasive health checks, updating node label =TESTING for node sunyibm1.fsl.cs.sunysb.edu
I0808 18:42:32.231104 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:32.285260 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "TESTING"
}
}
}
I0808 18:42:32.335944 7 functions.go:176] Try create Job
I0808 18:42:32.377837 7 functions.go:183] Created
I0808 18:42:32.377885 7 worker.go:29] "Task completed" task="Invasive Check"
I0808 18:42:33.294692 7 worker.go:16] "Processing task" task="Periodic Check"
I0808 18:42:33.294728 7 pool.go:46] "Task submitted to worker pool" task="Periodic Check"
I0808 18:42:33.294743 7 healthcheck.go:17] Running a periodic check
I0808 18:42:33.294803 7 healthcheck.go:54] Health checks pciebw,remapped,dcgm,ping,gpupower
I0808 18:42:33.294825 7 healthcheck.go:83] Running health check: pciebw
I0808 18:42:35.089627 7 healthcheck.go:227] GPU PCIe BW test completed:
I0808 18:42:35.089686 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 0 12.3
I0808 18:42:35.089719 7 healthcheck.go:256] Observation: sunyibm1.fsl.cs.sunysb.edu 1 12.3
I0808 18:42:35.089746 7 healthcheck.go:92] Running health check: remapped
I0808 18:42:36.065732 7 healthcheck.go:160] Remapped Rows check test completed:
I0808 18:42:36.065792 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:36.065857 7 healthcheck.go:183] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:36.065888 7 healthcheck.go:74] Running health check: dcgm -r 1
I0808 18:42:38.294064 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:38.294142 7 pool.go:46] "Task submitted to worker pool" task="Invasive Check"
I0808 18:42:38.294170 7 worker.go:16] "Processing task" task="Invasive Check"
I0808 18:42:38.294185 7 healthcheck.go:32] Trying to run an invasive check
I0808 18:42:43.293752 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:46.216300 7 healthcheck.go:369] DCGM test completed:
I0808 18:42:46.216393 7 healthcheck.go:384] Observation: sunyibm1.fsl.cs.sunysb.edu Pass 0
I0808 18:42:46.216448 7 healthcheck.go:58] Running health check: ping
I0808 18:42:48.294429 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:48.294506 7 pool.go:37] "Task already running, skipping submission" task="Invasive Check"
I0808 18:42:53.294072 7 pool.go:37] "Task already running, skipping submission" task="Periodic Check"
I0808 18:42:56.156657 7 healthcheck.go:272] Ping test completed:
I0808 18:42:56.156816 7 healthcheck.go:301] Unreachable nodes count: 0
I0808 18:42:56.156868 7 healthcheck.go:101] Running health check: gpupower
I0808 18:42:56.733130 7 healthcheck.go:402] Power Throttle check test completed:
I0808 18:42:56.733192 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 0 0
I0808 18:42:56.733227 7 healthcheck.go:425] Observation: sunyibm1.fsl.cs.sunysb.edu 1 0
I0808 18:42:56.733262 7 healthcheck.go:135] Total time (s) for all checks: 23.438443725
I0808 18:42:56.733308 7 healthcheck.go:23] Errors after running periodic health checks: false
I0808 18:42:56.790103 7 functions.go:198] Node sunyibm1.fsl.cs.sunysb.edu label found PASS
I0808 18:42:56.981930 7 functions.go:215] Node patched with label
{
"metadata": {
"labels": {
"autopilot.ibm.com/gpuhealth": "PASS"
}
}
} |
|
I read the code and comments. Probably a wait group of size one would be enough. Also, the current design only considers periodic checks. The same checks should be enabled when a check is invoked manually. This is a stretch goal. |
|
I agree — we only have two tasks right now. I have an easier idea: we can reset the timers after each task finishes. for {
select {
case <-periodicChecksTicker.C:
healthcheck.PeriodicCheck()
// reset periodic timer
periodicChecksTicker.Stop()
periodicChecksTicker = time.NewTicker(repeatDuration)
case <-invasiveChecksTicker.C:
if invasiveDuration > 0 {
healthcheck.InvasiveCheck()
// reset invasive timer
invasiveChecksTicker.Stop()
invasiveChecksTicker = time.NewTicker(invasiveDuration)
}
}
} |
|
In this reset case:
|
|
As it's in the code, in both cases, the timer is stopped and restarted a few ms after the goroutine is spawn. But other than that, it might work for periodic checks if correctly implemented (e.g., with a wait group), but not for invasive checks. The invasive check is ran in a separate pod and the timer has no business with that other job. |
|
Good point mentioned. I didn't know that about the invasive checks. |

Summary
Scope and Impact
GitHub Issue
How was this Pull-Request Tested and Validated?
ghcr.io/amirhnajafiz/autopilot:sha-3a76e7c.Helm Chart Values
Pods and logs
Pull-Request Reminders
Does the Autopilot Readme require updates?
Are there any new software dependencies introduced to this Pull-Request?