Skip to content

Add monitoring XIDs list#1

Open
sgysherry wants to merge 4 commits into
masterfrom
test-branch
Open

Add monitoring XIDs list#1
sgysherry wants to merge 4 commits into
masterfrom
test-branch

Conversation

@sgysherry
Copy link
Copy Markdown
Owner

@sgysherry sgysherry commented Apr 25, 2025

  1. Extend the XID range, only for test purpose
  2. Patch the gpu-device-plugin
  3. Trigger XID, observe the condition get applied
  4. Delete the XID workflow (simulate the case that hardware XID is only thrown out once), the condition persists, and the heartbeat time get updated
  5. Restart the gpu-device-plugin, the condition persists
kubectl describe node gke-sgy-cluster-p4-gpu-pool-dc022afa-vsk5 | grep Xid
XidCriticalError     True    Tue, 29 Apr 2025 00:51:35 +0000   Tue, 29 Apr 2025 00:37:13 +0000   {"31":true,"45":true}        e8986d68-8a1a-4212-9b5a-e8b47570811b
  1. Do VM Restart on the node, the condition get deleted
kubectl describe node gke-sgy-cluster-p4-gpu-pool-dc022afa-vsk5 | grep Xid

Addtional requirements:
ANY new changes will not crash the previously supported functionality.

This PR is for demonstration and prototyping purpose.

  • Unit tests not added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant