Hi,
The existing examples are very good. But given that the GPU/AI/ML features were highlighted in the introductory blog post ("Use accelerator-optimized resources."), it would be nice to see a full example here.
If it helps, I've tried this on my own, but got some errors:
{
"taskGroups": [
{
"taskSpec": {
"computeResource": {
"cpuMilli": "20000",
"memoryMib": "15000"
},
"runnables": [
{
"container": {
"imageUri": "pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime",
"entrypoint": "/bin/sh",
"commands": ["-c", "python -c \"import torch;print(torch.cuda.is_available())\""]
}
}
],
"maxRetryCount": 2,
"maxRunDuration": "3600s"
},
"taskCount": 1,
"parallelism": 1
}
],
"allocationPolicy": {
"instances": [
{
"instanceTemplate": "alan-test-instance-template-3"
}
]
},
"logsPolicy": {
"destination": "CLOUD_LOGGING"
}
}
The log output is:
2022-08-18 09:31:08.760 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Reading package lists...
2022-08-18 09:31:08.772 EDT
Task action/STARTUP/0/0/group0/0, STDOUT:
2022-08-18 09:31:08.777 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Building dependency tree...
2022-08-18 09:31:08.904 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Reading state information...
2022-08-18 09:31:08.905 EDT
Task action/STARTUP/0/0/group0/0, STDOUT:
2022-08-18 09:31:08.954 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies:
2022-08-18 09:31:09.008 EDT
Task action/STARTUP/0/0/group0/0, STDOUT: docker.io : Depends: runc (>= 1.0.0~rc6~)
2022-08-18 09:31:09.019 EDT
Task action/STARTUP/0/0/group0/0, STDERR: E: Unable to correct problems, you have held broken packages.
And for reference, here's the info for my instance template:
{
"creationTimestamp": "2022-08-17T14:05:29.128-07:00",
"description": "",
"id": "[redacted]",
"kind": "compute#instanceTemplate",
"name": "alan-test-instance-template-3",
"properties": {
"confidentialInstanceConfig": {
"enableConfidentialCompute": false
},
"description": "",
"scheduling": {
"onHostMaintenance": "TERMINATE",
"provisioningModel": "STANDARD",
"automaticRestart": true,
"preemptible": false
},
"tags": {},
"disks": [
{
"type": "PERSISTENT",
"deviceName": "alan-test-instance-template-3",
"autoDelete": true,
"index": 0,
"boot": true,
"kind": "compute#attachedDisk",
"mode": "READ_WRITE",
"initializeParams": {
"sourceImage": "projects/ml-images/global/images/c0-deeplearning-common-cu110-v20220806-debian-10",
"diskType": "pd-balanced",
"diskSizeGb": "100"
}
},
{
"type": "PERSISTENT",
"deviceName": "persistent-disk-1",
"autoDelete": false,
"index": 1,
"kind": "compute#attachedDisk",
"mode": "READ_WRITE",
"initializeParams": {
"description": "",
"diskType": "pd-balanced",
"diskSizeGb": "100"
}
}
],
"networkInterfaces": [
{
"name": "nic0",
"network": "projects/[redacted]/global/networks/default",
"accessConfigs": [
{
"name": "External NAT",
"type": "ONE_TO_ONE_NAT",
"kind": "compute#accessConfig",
"networkTier": "PREMIUM"
}
],
"kind": "compute#networkInterface"
}
],
"reservationAffinity": {
"consumeReservationType": "ANY_RESERVATION"
},
"canIpForward": false,
"keyRevocationActionType": "NONE",
"machineType": "n1-standard-4",
"metadata": {
"fingerprint": "[redacted]",
"kind": "compute#metadata"
},
"shieldedVmConfig": {
"enableSecureBoot": false,
"enableVtpm": true,
"enableIntegrityMonitoring": true
},
"shieldedInstanceConfig": {
"enableSecureBoot": false,
"enableVtpm": true,
"enableIntegrityMonitoring": true
},
"serviceAccounts": [
{
"email": "[redacted]@developer.gserviceaccount.com",
"scopes": [
"https://www.googleapis.com/auth/devstorage.read_only",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring.write",
"https://www.googleapis.com/auth/servicecontrol",
"https://www.googleapis.com/auth/service.management.readonly",
"https://www.googleapis.com/auth/trace.append"
]
}
],
"guestAccelerators": [
{
"acceleratorCount": 1,
"acceleratorType": "nvidia-tesla-t4"
}
],
"displayDevice": {
"enableDisplay": false
}
},
"selfLink": "projects/[redacted]/global/instanceTemplates/alan-test-instance-template-3"
}
EDIT: Digging through the Job spec to the ComputeResource spec, I see the following:
gpuCount | string (int64 format)The GPU count.Not yet implemented.
-- | --
gpuCount
string ([int64](https://developers.google.com/discovery/v1/type-format) format)
The GPU count.
Not yet implemented.
Does this imply GPU jobs are not yet supported?
Hi,
The existing examples are very good. But given that the GPU/AI/ML features were highlighted in the introductory blog post ("Use accelerator-optimized resources."), it would be nice to see a full example here.
If it helps, I've tried this on my own, but got some errors:
{ "taskGroups": [ { "taskSpec": { "computeResource": { "cpuMilli": "20000", "memoryMib": "15000" }, "runnables": [ { "container": { "imageUri": "pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime", "entrypoint": "/bin/sh", "commands": ["-c", "python -c \"import torch;print(torch.cuda.is_available())\""] } } ], "maxRetryCount": 2, "maxRunDuration": "3600s" }, "taskCount": 1, "parallelism": 1 } ], "allocationPolicy": { "instances": [ { "instanceTemplate": "alan-test-instance-template-3" } ] }, "logsPolicy": { "destination": "CLOUD_LOGGING" } }The log output is:
And for reference, here's the info for my instance template:
{ "creationTimestamp": "2022-08-17T14:05:29.128-07:00", "description": "", "id": "[redacted]", "kind": "compute#instanceTemplate", "name": "alan-test-instance-template-3", "properties": { "confidentialInstanceConfig": { "enableConfidentialCompute": false }, "description": "", "scheduling": { "onHostMaintenance": "TERMINATE", "provisioningModel": "STANDARD", "automaticRestart": true, "preemptible": false }, "tags": {}, "disks": [ { "type": "PERSISTENT", "deviceName": "alan-test-instance-template-3", "autoDelete": true, "index": 0, "boot": true, "kind": "compute#attachedDisk", "mode": "READ_WRITE", "initializeParams": { "sourceImage": "projects/ml-images/global/images/c0-deeplearning-common-cu110-v20220806-debian-10", "diskType": "pd-balanced", "diskSizeGb": "100" } }, { "type": "PERSISTENT", "deviceName": "persistent-disk-1", "autoDelete": false, "index": 1, "kind": "compute#attachedDisk", "mode": "READ_WRITE", "initializeParams": { "description": "", "diskType": "pd-balanced", "diskSizeGb": "100" } } ], "networkInterfaces": [ { "name": "nic0", "network": "projects/[redacted]/global/networks/default", "accessConfigs": [ { "name": "External NAT", "type": "ONE_TO_ONE_NAT", "kind": "compute#accessConfig", "networkTier": "PREMIUM" } ], "kind": "compute#networkInterface" } ], "reservationAffinity": { "consumeReservationType": "ANY_RESERVATION" }, "canIpForward": false, "keyRevocationActionType": "NONE", "machineType": "n1-standard-4", "metadata": { "fingerprint": "[redacted]", "kind": "compute#metadata" }, "shieldedVmConfig": { "enableSecureBoot": false, "enableVtpm": true, "enableIntegrityMonitoring": true }, "shieldedInstanceConfig": { "enableSecureBoot": false, "enableVtpm": true, "enableIntegrityMonitoring": true }, "serviceAccounts": [ { "email": "[redacted]@developer.gserviceaccount.com", "scopes": [ "https://www.googleapis.com/auth/devstorage.read_only", "https://www.googleapis.com/auth/logging.write", "https://www.googleapis.com/auth/monitoring.write", "https://www.googleapis.com/auth/servicecontrol", "https://www.googleapis.com/auth/service.management.readonly", "https://www.googleapis.com/auth/trace.append" ] } ], "guestAccelerators": [ { "acceleratorCount": 1, "acceleratorType": "nvidia-tesla-t4" } ], "displayDevice": { "enableDisplay": false } }, "selfLink": "projects/[redacted]/global/instanceTemplates/alan-test-instance-template-3" }EDIT: Digging through the
Jobspec to theComputeResourcespec, I see the following:Does this imply GPU jobs are not yet supported?