Skip to content

Add AWS Inferentia2/Neuron support to k8s AI inference#6678

Draft
kiryl-filatau wants to merge 4 commits into
GoogleCloudPlatform:masterfrom
kiryl-filatau:aws-tpu
Draft

Add AWS Inferentia2/Neuron support to k8s AI inference#6678
kiryl-filatau wants to merge 4 commits into
GoogleCloudPlatform:masterfrom
kiryl-filatau:aws-tpu

Conversation

@kiryl-filatau
Copy link
Copy Markdown
Collaborator

Add AWS Inferentia2/Neuron support to k8s AI inference: EKS S3 CSI addon, Neuron device plugin installation, and Neuron-specific vLLM model load time metric parsing

Will be updated.

@kiryl-filatau kiryl-filatau changed the title Add AWS Inferentia2/Neuron support to k8s AI inference: EKS S3 CSI addon, Neuron device plugin installation, and Neuron-specific vLLM model load time metric parsing Add AWS Inferentia2/Neuron support to k8s AI inference May 19, 2026
storage_service.PrepareService()
# Pick the storage backend by URI scheme so existing gs:// usage is
# untouched while s3:// URIs work for AWS-only runs.
storage_cloud = (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this part was just for getting the huggingface token? / can we not just always pull the huggingface token from GCS, or does that effect other usages of the storage class somewhere?

# parameterised. Only the values differ for Neuron (instance family,
# taint, NodePool name).
kubernetes_commands.ApplyManifest(
'container/kubernetes_ai_inference/aws-gpu-nodepool.yaml.j2',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a follow up, but in #6687 I started moving us away from "configure the gpu nodepool in one-off benchmarks" & to "configure the gpu nodepool during _Create/_PostCreate phases based off spec values". eg using nodepools & --config_override=kubernetes_scale.container_cluster.nodepools.pool1.vm_spec.AWS.machine_type='g6.2xlarge' in addition to a default nodepool.

if 'gcsfuse' in self.spec.catalog_components:
self._ApplyGCSFusePVC()
elif 's3' in self.spec.catalog_components:
self._ApplyS3PVC()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as for Vlodymyr's #6654 (review)

from perfkitbenchmarker.resources.container_service import kubectl
# Imported to register --k8s_inference_server_s3_bucket so flagsaver can set
# it from this test file.
from perfkitbenchmarker.resources.kubernetes import wg_serving_inference_server # pylint: disable=unused-import
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the flags over to aws/flags; I don't think you need them in wg_serving_inference_server (esp if you also move the ApplyFusePVC function)


def _InstallNeuronDevicePlugin(self):
"""Applies the AWS Neuron Device Plugin DaemonSet to the cluster."""
# Non-empty kwargs force Jinja render (see ReadAndRenderJinja2Template); image
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does forcing a jinja render matter? What happens if you don't?

def _InstallS3CsiAddon(self):
"""Installs the S3 CSI Driver and the IAM glue (Role/Policy + PIA)."""
# Local import: the inference-server module owns the bucket flag.
from perfkitbenchmarker.resources.kubernetes import wg_serving_inference_server # pylint: disable=g-import-not-at-top
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see other comments about moving flags here

vllm_start_timestamp = _ParseInferenceServerTimeStamp(line)
break
if container_init_timestamp is None:
raise ValueError('Container init timestamp is not found in the logs.')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You talked about sharing some code. It doesn't look too bad but some ideas:

  1. Maybe have a shared _ParseModelLoadTimeMetrics which does some helper pieces. Rename _ParseModelLoadTimeMetrics below to _ParseRegularModelLoadTimeMetrics or add Shared/Common to the shared one somewhere.
  2. This "if _ is None, raise, otherwise return X-Y" piece seems like it could be shared pretty easily. Have each of the individual functions return Tuple[float | None, float | None, float | None, float | None] & have the Shared/Common one the validation + subtraction & return Tuple[float,float,float,float].
  3. The shared one could also call cluster.GetEvents & result_stdout.splitlines() while the individual ones could take those variables preprocessed.
  4. A fancier refactor would possibly make 4 classes - a base class & 3 implementations. Then eg "FindStartupEvent" could be one function taking events while "find other timestamps" could be a second function taking logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants