Add AWS Inferentia2/Neuron support to k8s AI inference#6678
Add AWS Inferentia2/Neuron support to k8s AI inference#6678kiryl-filatau wants to merge 4 commits into
Conversation
| storage_service.PrepareService() | ||
| # Pick the storage backend by URI scheme so existing gs:// usage is | ||
| # untouched while s3:// URIs work for AWS-only runs. | ||
| storage_cloud = ( |
There was a problem hiding this comment.
I thought this part was just for getting the huggingface token? / can we not just always pull the huggingface token from GCS, or does that effect other usages of the storage class somewhere?
| # parameterised. Only the values differ for Neuron (instance family, | ||
| # taint, NodePool name). | ||
| kubernetes_commands.ApplyManifest( | ||
| 'container/kubernetes_ai_inference/aws-gpu-nodepool.yaml.j2', |
There was a problem hiding this comment.
Might be a follow up, but in #6687 I started moving us away from "configure the gpu nodepool in one-off benchmarks" & to "configure the gpu nodepool during _Create/_PostCreate phases based off spec values". eg using nodepools & --config_override=kubernetes_scale.container_cluster.nodepools.pool1.vm_spec.AWS.machine_type='g6.2xlarge' in addition to a default nodepool.
| if 'gcsfuse' in self.spec.catalog_components: | ||
| self._ApplyGCSFusePVC() | ||
| elif 's3' in self.spec.catalog_components: | ||
| self._ApplyS3PVC() |
There was a problem hiding this comment.
same comments as for Vlodymyr's #6654 (review)
| from perfkitbenchmarker.resources.container_service import kubectl | ||
| # Imported to register --k8s_inference_server_s3_bucket so flagsaver can set | ||
| # it from this test file. | ||
| from perfkitbenchmarker.resources.kubernetes import wg_serving_inference_server # pylint: disable=unused-import |
There was a problem hiding this comment.
move the flags over to aws/flags; I don't think you need them in wg_serving_inference_server (esp if you also move the ApplyFusePVC function)
|
|
||
| def _InstallNeuronDevicePlugin(self): | ||
| """Applies the AWS Neuron Device Plugin DaemonSet to the cluster.""" | ||
| # Non-empty kwargs force Jinja render (see ReadAndRenderJinja2Template); image |
There was a problem hiding this comment.
Why does forcing a jinja render matter? What happens if you don't?
| def _InstallS3CsiAddon(self): | ||
| """Installs the S3 CSI Driver and the IAM glue (Role/Policy + PIA).""" | ||
| # Local import: the inference-server module owns the bucket flag. | ||
| from perfkitbenchmarker.resources.kubernetes import wg_serving_inference_server # pylint: disable=g-import-not-at-top |
There was a problem hiding this comment.
see other comments about moving flags here
| vllm_start_timestamp = _ParseInferenceServerTimeStamp(line) | ||
| break | ||
| if container_init_timestamp is None: | ||
| raise ValueError('Container init timestamp is not found in the logs.') |
There was a problem hiding this comment.
You talked about sharing some code. It doesn't look too bad but some ideas:
- Maybe have a shared _ParseModelLoadTimeMetrics which does some helper pieces. Rename _ParseModelLoadTimeMetrics below to _ParseRegularModelLoadTimeMetrics or add Shared/Common to the shared one somewhere.
- This "if _ is None, raise, otherwise return X-Y" piece seems like it could be shared pretty easily. Have each of the individual functions return Tuple[float | None, float | None, float | None, float | None] & have the Shared/Common one the validation + subtraction & return Tuple[float,float,float,float].
- The shared one could also call cluster.GetEvents & result_stdout.splitlines() while the individual ones could take those variables preprocessed.
- A fancier refactor would possibly make 4 classes - a base class & 3 implementations. Then eg "FindStartupEvent" could be one function taking events while "find other timestamps" could be a second function taking logs.
…in wg-serving inference server
Add AWS Inferentia2/Neuron support to k8s AI inference: EKS S3 CSI addon, Neuron device plugin installation, and Neuron-specific vLLM model load time metric parsing
Will be updated.