Skip to content

Improve EKS autoscaler image tag selection#95

Open
vrutz wants to merge 3 commits into
masterfrom
bug/dss14-sc-303472-plugin-for-eks-clusters-doesn-t-fully-support
Open

Improve EKS autoscaler image tag selection#95
vrutz wants to merge 3 commits into
masterfrom
bug/dss14-sc-303472-plugin-for-eks-clusters-doesn-t-fully-support

Conversation

@vrutz

@vrutz vrutz commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Autoscaler Image Selection Test Plan

PR Scope

Validate that the EKS plugin can select a cluster autoscaler image without relying only on a hardcoded Kubernetes-to-autoscaler map.

The change covers:

  • selecting an autoscaler tag from the configured registry tags list when a matching Kubernetes major/minor exists;
  • falling back to bundled known-good tags when registry tag discovery fails or has no matching tag;
  • allowing an explicit autoscaler image tag override for private or airgapped registries;
  • keeping fallback details in logs only, not in runnable UI output.

Sources

Preflight

Confirm the Kubernetes versions are available in the target AWS region before creating EKS clusters:

export AWS_REGION=eu-west-1

aws eks describe-cluster-versions --region "$AWS_REGION"

For this PR, prioritize currently supported EKS versions that exercise the selector behavior:

  • 1.35: expected to have a matching autoscaler tag.
  • 1.33: expected to have a matching autoscaler tag and verifies non-latest supported minors.
  • 1.36: expected to exercise the fallback path if no v1.36.x autoscaler tag is published.

Do not add 1.11 or 1.12 E2E clusters unless an existing legacy environment is already available. They are outside the practical EKS coverage for this PR.

Private ECR Setup

Create or reuse one ECR repository whose path matches the image path built by the plugin:

<autoscalerRegistryURL>/autoscaling/cluster-autoscaler:<tag>

Example setup:

export AWS_REGION=eu-west-1
export AWS_ACCOUNT_ID=236706865914
export ECR_REGISTRY="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"
export ECR_REPO_PREFIX="${YOUR_OWN_REPO_PREFIX}"
export ECR_REPO="${ECR_REPO_PREFIX}/autoscaling/cluster-autoscaler"

aws ecr create-repository \
  --region "$AWS_REGION" \
  --repository-name "$ECR_REPO"

aws ecr get-login-password --region "$AWS_REGION" \
  | docker login --username AWS --password-stdin "$ECR_REGISTRY"

Push only the images needed by the private-registry override scenarios:

docker pull registry.k8s.io/autoscaling/cluster-autoscaler:v1.35.0

docker tag registry.k8s.io/autoscaling/cluster-autoscaler:v1.35.0 \
  "$ECR_REGISTRY/$ECR_REPO:dss-qa-1_35"

docker tag registry.k8s.io/autoscaling/cluster-autoscaler:v1.35.0 \
  "$ECR_REGISTRY/$ECR_REPO:v1.35"

docker push "$ECR_REGISTRY/$ECR_REPO:dss-qa-1_35"
docker push "$ECR_REGISTRY/$ECR_REPO:v1.35"

Ensure the worker node role can pull from the private ECR repository:

  • ecr:GetAuthorizationToken
  • ecr:BatchCheckLayerAvailability
  • ecr:BatchGetImage
  • ecr:GetDownloadUrlForLayer

Also ensure the worker node role has the AWS permissions required by cluster autoscaler itself. The plugin deploys the upstream AWS cloud-provider autoscaler manifest; at runtime the pod must be able to inspect and update Auto Scaling Groups. If the cluster or node group was not created with the usual autoscaling permissions, attach a policy that allows the autoscaler actions needed for EKS node groups, including:

  • autoscaling:DescribeAutoScalingGroups
  • autoscaling:DescribeAutoScalingInstances
  • autoscaling:DescribeLaunchConfigurations
  • autoscaling:DescribeScalingActivities
  • autoscaling:DescribeTags
  • autoscaling:SetDesiredCapacity
  • autoscaling:TerminateInstanceInAutoScalingGroup
  • ec2:DescribeInstanceTypes
  • ec2:DescribeLaunchTemplateVersions

If pushing from a DSS instance or test instance, that role also needs push permissions such as:

  • ecr:InitiateLayerUpload
  • ecr:UploadLayerPart
  • ecr:CompleteLayerUpload
  • ecr:PutImage

E2E Scenarios

Public Registry, Matching Version

Create a plugin-managed EKS 1.35 cluster with autoscaling enabled and no autoscalerImageTagOverride.

Expected result:

  • autoscaler deployment uses the latest discovered v1.35.x tag;
  • runnable UI only reports autoscaler creation success;
  • logs show the selected autoscaler tag.

Repeat with EKS 1.33.

Expected result:

  • autoscaler deployment uses the latest discovered v1.33.x tag;
  • this proves selection is based on the cluster Kubernetes minor, not simply the newest published autoscaler tag.

Public Registry, Missing Matching Version

Create a plugin-managed EKS 1.36 cluster with autoscaling enabled and no autoscalerImageTagOverride.

Expected result if no v1.36.x autoscaler image is published:

  • registry tag discovery succeeds;
  • no matching v1.36.x tag is selected;
  • bundled fallback logic is used;
  • deployment uses the latest bundled fallback tag, currently v1.35.0;
  • fallback is visible in plugin logs only, not in the runnable UI.

Private ECR, Explicit Override

Create or update an EKS 1.35 cluster using:

autoscalerRegistryURL=$ECR_REGISTRY
autoscalerImageTagOverride=dss-qa-1_35

Expected result:

  • autoscaler deployment uses $ECR_REGISTRY/autoscaling/cluster-autoscaler:dss-qa-1_35;
  • no registry tag discovery is required for the override;
  • the override tag does not need to follow vX.Y.Z.

Repeat with:

autoscalerImageTagOverride=v1.35

Expected result:

  • autoscaler deployment uses $ECR_REGISTRY/autoscaling/cluster-autoscaler:v1.35;
  • success proves override tags are passed through as-is.

Private ECR, No Override

Create or update an EKS 1.35 cluster using:

autoscalerRegistryURL=$ECR_REGISTRY

Do not set autoscalerImageTagOverride.

Expected result:

  • if the private registry does not expose a compatible /v2/autoscaling/cluster-autoscaler/tags/list endpoint, discovery fails;
  • plugin logs mention that tag discovery failed and fallback will be used;
  • bundled fallback tag for Kubernetes 1.35 is selected;
  • autoscaler deployment uses $ECR_REGISTRY/autoscaling/cluster-autoscaler:v1.35.0;
  • this preserves the existing private-registry workflow where customers mirror the expected fallback tag.

Validation Commands

Check the deployed autoscaler image:

kubectl -n kube-system get deployment cluster-autoscaler \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

Check autoscaler pod health:

kubectl -n kube-system get pods -l app=cluster-autoscaler

Check autoscaler logs:

kubectl -n kube-system logs deployment/cluster-autoscaler

Check plugin logs for:

  • selected published tag;
  • fallback to bundled tag;
  • registry tag discovery failure for private ECR without override.

Cleanup

  • Delete EKS clusters and node groups created for the test.
  • Delete the temporary ECR repository if it was created only for this test.
  • Remove temporary IAM policies or role attachments used for pushing or pulling ECR images.

Out Of Scope

  • Legacy EKS versions below the currently supported EKS range.
  • Verifying cluster autoscaler runtime behavior beyond deployment health and pod startup.

@vrutz vrutz self-assigned this Jun 25, 2026
@vrutz vrutz added this to the 15.0.0 milestone Jun 25, 2026
@vrutz vrutz requested a review from a team June 25, 2026 14:56
Comment thread python-lib/dku_kube/autoscaler.py Outdated
@vrutz vrutz changed the title Update autoscaler image versions and handle 'latest' Kubernetes version in autoscaler configuration Improve EKS autoscaler image tag selection Jun 26, 2026
@vrutz vrutz marked this pull request as ready for review June 26, 2026 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant