Add AWS Inferentia2/Neuron support to k8s AI inference by kiryl-filatau · Pull Request #6678 · GoogleCloudPlatform/PerfKitBenchmarker

kiryl-filatau · 2026-05-19T11:32:16Z

Add AWS Inferentia2/Neuron support to k8s AI inference: EKS S3 CSI addon, Neuron device plugin installation, and Neuron-specific vLLM model load time metric parsing

Will be updated.

hubatish · 2026-05-22T21:39:45Z

-      storage_service.PrepareService()
+      # Pick the storage backend by URI scheme so existing gs:// usage is
+      # untouched while s3:// URIs work for AWS-only runs.
+      storage_cloud = (


I thought this part was just for getting the huggingface token? / can we not just always pull the huggingface token from GCS, or does that effect other usages of the storage class somewhere?

hubatish · 2026-05-22T21:45:04Z

+        # parameterised. Only the values differ for Neuron (instance family,
+        # taint, NodePool name).
+        kubernetes_commands.ApplyManifest(
+            'container/kubernetes_ai_inference/aws-gpu-nodepool.yaml.j2',


Might be a follow up, but in #6687 I started moving us away from "configure the gpu nodepool in one-off benchmarks" & to "configure the gpu nodepool during _Create/_PostCreate phases based off spec values". eg using nodepools & --config_override=kubernetes_scale.container_cluster.nodepools.pool1.vm_spec.AWS.machine_type='g6.2xlarge' in addition to a default nodepool.

hubatish · 2026-05-22T21:46:14Z

    if 'gcsfuse' in self.spec.catalog_components:
      self._ApplyGCSFusePVC()
+    elif 's3' in self.spec.catalog_components:
+      self._ApplyS3PVC()


same comments as for Vlodymyr's #6654 (review)

hubatish · 2026-05-22T21:47:43Z

 from perfkitbenchmarker.resources.container_service import kubectl
+# Imported to register --k8s_inference_server_s3_bucket so flagsaver can set
+# it from this test file.
+from perfkitbenchmarker.resources.kubernetes import wg_serving_inference_server  # pylint: disable=unused-import


move the flags over to aws/flags; I don't think you need them in wg_serving_inference_server (esp if you also move the ApplyFusePVC function)

hubatish · 2026-05-22T21:50:28Z

+
+  def _InstallNeuronDevicePlugin(self):
+    """Applies the AWS Neuron Device Plugin DaemonSet to the cluster."""
+    # Non-empty kwargs force Jinja render (see ReadAndRenderJinja2Template); image


Why does forcing a jinja render matter? What happens if you don't?

hubatish · 2026-05-22T21:56:20Z

+  def _InstallS3CsiAddon(self):
+    """Installs the S3 CSI Driver and the IAM glue (Role/Policy + PIA)."""
+    # Local import: the inference-server module owns the bucket flag.
+    from perfkitbenchmarker.resources.kubernetes import wg_serving_inference_server  # pylint: disable=g-import-not-at-top


see other comments about moving flags here

hubatish · 2026-05-22T22:06:27Z

+      vllm_start_timestamp = _ParseInferenceServerTimeStamp(line)
+      break
+  if container_init_timestamp is None:
+    raise ValueError('Container init timestamp is not found in the logs.')


You talked about sharing some code. It doesn't look too bad but some ideas:

Maybe have a shared _ParseModelLoadTimeMetrics which does some helper pieces. Rename _ParseModelLoadTimeMetrics below to _ParseRegularModelLoadTimeMetrics or add Shared/Common to the shared one somewhere.

This "if _ is None, raise, otherwise return X-Y" piece seems like it could be shared pretty easily. Have each of the individual functions return Tuple[float | None, float | None, float | None, float | None] & have the Shared/Common one the validation + subtraction & return Tuple[float,float,float,float].

The shared one could also call cluster.GetEvents & result_stdout.splitlines() while the individual ones could take those variables preprocessed.

A fancier refactor would possibly make 4 classes - a base class & 3 implementations. Then eg "FindStartupEvent" could be one function taking events while "find other timestamps" could be a second function taking logs.

…in wg-serving inference server

pyink reformated

d172801

kiryl-filatau changed the title ~~Add AWS Inferentia2/Neuron support to k8s AI inference: EKS S3 CSI addon, Neuron device plugin installation, and Neuron-specific vLLM model load time metric parsing~~ Add AWS Inferentia2/Neuron support to k8s AI inference May 19, 2026

Cleaned up Neuron device plugin and S3 PVC manifests

aeacc80

hubatish reviewed May 22, 2026

View reviewed changes

kiryl-filatau added 2 commits May 23, 2026 16:28

pyink reformated

a40e3bd

Extend Neuron chip detection and refactor AWS node pool provisioning …

5dbf87c

…in wg-serving inference server

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AWS Inferentia2/Neuron support to k8s AI inference#6678

Add AWS Inferentia2/Neuron support to k8s AI inference#6678
kiryl-filatau wants to merge 4 commits into
GoogleCloudPlatform:masterfrom
kiryl-filatau:aws-tpu

kiryl-filatau commented May 19, 2026

Uh oh!

hubatish May 22, 2026

Uh oh!

hubatish May 22, 2026

Uh oh!

hubatish May 22, 2026

Uh oh!

hubatish May 22, 2026

Uh oh!

hubatish May 22, 2026

Uh oh!

hubatish May 22, 2026

Uh oh!

hubatish May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kiryl-filatau commented May 19, 2026

Uh oh!

hubatish May 22, 2026

Choose a reason for hiding this comment

Uh oh!

hubatish May 22, 2026

Choose a reason for hiding this comment

Uh oh!

hubatish May 22, 2026

Choose a reason for hiding this comment

Uh oh!

hubatish May 22, 2026

Choose a reason for hiding this comment

Uh oh!

hubatish May 22, 2026

Choose a reason for hiding this comment

Uh oh!

hubatish May 22, 2026

Choose a reason for hiding this comment

Uh oh!

hubatish May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants