Zalando Vertical Autoscaler

A Kubernetes operator that automatically applies VPA memory recommendations to Zalando PostgreSQL clusters during scheduled maintenance windows.

Why?

Zalando's Postgres operator does not natively integrate with the Kubernetes Vertical Pod Autoscaler. Changing memory on a Zalando postgresql CR triggers a rolling update of the StatefulSet, which means downtime risk. This operator bridges that gap by:

Reading VPA memory (and optionally CPU) recommendations automatically
Applying them only during cron-based maintenance windows you define
Enforcing safety gates so small fluctuations don't trigger unnecessary restarts
Running post-actions (e.g. rollout-restart of dependent workloads) after a successful update
Tracking every maintenance run with full history and Kubernetes conditions

How it works

                    VPA recommendation
                            │
                      Zalando CR has no memory?
                       no /          \ yes (and initialMemory set)
                      │          apply initialMemory immediately
                      │                    │
                      │               requeue (1m)
                      │
                 maintenance window open?
                  no /          \ yes
                 wait        safety gates pass?
                              no /       \ yes
                             skip    evaluate PG parameter templates
                                         │
                                   patch Zalando CR (memory + CPU + PG params)
                                         │
                                   wait for cluster Running
                                         │
                                   execute post-actions
                                         │
                                      done ✓

Quick start

Install with Helm

helm upgrade --install zalando-vpa \
  oci://ghcr.io/jiri-soukal-pfx/zalando-vertical-autoscaler/chart/zalando-vertical-autoscaler \
  --namespace operators --create-namespace

Or from a local clone:

helm upgrade --install zalando-vpa charts/zalando-vertical-autoscaler \
  --namespace operators --create-namespace

Create a policy

apiVersion: pricefx.io/v1alpha1
kind: PostgresMemoryPolicy
metadata:
  name: my-db-memory
  namespace: default
spec:
  # Zalando postgresql CR to manage
  targetCluster: my-zalando-pg

  # VPA object to read recommendations from (same namespace)
  vpaName: my-db-vpa

  # Clamp recommendations to this range
  memoryMin: 4Gi
  memoryMax: 64Gi

  # limits = requests * overcommit (default: 1)
  overcommit: 1

  # Add 20% headroom on top of VPA recommendation (default: 0)
  memoryBuffer: 20

  maintenanceWindow:
    # Every Sunday at 03:00 UTC
    cron: "0 3 * * 0"
    # Or: last Sunday of the month at 20:00 UTC
    # cron: "0 20 * * 0L"
    # Window stays open for 60 minutes
    timeoutMinutes: 60

  safetyGates:
    # Only proceed if the Zalando cluster reports "Running"
    requireHealthyCluster: true

  # Compute PG parameters from applied memory/CPU (optional)
  postgresParameters:
    shared_buffers: "{{ div (div .memory 3) 8192 }}"
    work_mem: "{{ div (div .memory 256) 1024 }}"
    max_parallel_workers_per_gather: "{{ div .cpu 2 }}"
    max_connections: "300"  # static values pass through as-is

  postActions:
    # Restart a dependent workload after the PG cluster is ready
    - action: RolloutRestart
      target:
        kind: Deployment
        name: my-app

Safety gates

Before patching the Zalando CR, two change gates must both pass:

Gate	Default	Purpose
Absolute diff	> 5 GiB	Ignore small absolute changes
Relative diff	> 10%	Ignore small proportional changes

Both thresholds are configurable via spec.safetyGates:

spec:
  safetyGates:
    absoluteThreshold: 2Gi   # override default 5Gi
    relativeThreshold: 5     # override default 10%

If either gate blocks, the run is recorded as Skipped and the operator waits for the next window.

Memory buffer

VPA recommendations can sometimes be too close to actual usage, leaving insufficient headroom for spikes. The memoryBuffer field adds a configurable percentage on top of the clamped VPA recommendation:

VPA recommends 20Gi → clamp to [4Gi, 64Gi] → 20Gi → apply 20% buffer → 24Gi applied

The buffered value is re-clamped to memoryMax, so the buffer can never push memory above the configured upper bound:

VPA recommends 60Gi → clamp to [4Gi, 64Gi] → 60Gi → apply 20% buffer → 72Gi → re-clamp → 64Gi applied

Set memoryBuffer: 0 (the default) for no buffer. Valid range is 0–100.

Observability

Everything is visible through standard Kubernetes mechanisms:

# Policy status with current/target memory and maintenance history
kubectl describe postgresmemorypolicy my-db-memory

# Kubernetes events
kubectl get events --field-selector involvedObject.name=my-db-memory

Conditions

Condition	Meaning
`VPARecommendationReady`	A valid VPA recommendation exists
`MaintenanceInProgress`	Maintenance is currently running
`LastMaintenanceFailed`	The most recent run failed

Maintenance history

The last 10 runs are recorded in .status.maintenanceHistory with status, timing, previous and applied memory values, and failure reasons.

Configuration reference

Field	Default	Description
`spec.targetCluster`	required	Name of the Zalando `postgresql` CR
`spec.vpaName`	required	Name of the VPA object (same namespace)
`spec.vpaContainerName`	`postgres`	Container to read recommendations from
`spec.memoryMin`	required	Lower bound for memory
`spec.memoryMax`	required	Upper bound for memory
`spec.overcommit`	`1`	Memory limit multiplier (`limits = requests * overcommit`)
`spec.memoryBuffer`	`0`	Percentage (0–100) added on top of VPA recommendation after clamping. Re-clamped to `memoryMax`.
`spec.initialMemory`	-	Memory to apply when the Zalando CR has no resources set (bootstrap)
`spec.maintenanceWindow.cron`	required	5-field cron expression (UTC). Supports `L` (last), `#` (nth), `W` (weekday)
`spec.maintenanceWindow.timeoutMinutes`	`60`	How long the window stays open
`spec.safetyGates.requireHealthyCluster`	`true`	Require cluster status `Running`
`spec.safetyGates.absoluteThreshold`	`5Gi`	Minimum absolute memory diff to proceed
`spec.safetyGates.relativeThreshold`	`10`	Minimum relative memory diff (%) to proceed
`spec.postActions[].action`	-	`RolloutRestart`
`spec.postActions[].target.kind`	-	`Deployment`, `StatefulSet`, or `DaemonSet`
`spec.postActions[].target.name`	-	Workload name
`spec.postActions[].target.namespace`	policy namespace	Override target namespace
`spec.postgresParameters`	-	Map of PG parameter names to Go template expressions (see below)

Cron syntax

Maintenance windows use gronx for cron parsing, which supports standard 5-field expressions plus extended modifiers:

Modifier	Field	Meaning	Example	Fires at
`L`	day-of-week	Last occurrence in month	`0 20 * * 0L`	Last Sunday at 20:00
`L`	day-of-month	Last day of month	`0 20 L * *`	Last day of month at 20:00
`#`	day-of-week	Nth occurrence in month	`0 20 * * 0#2`	Second Sunday at 20:00
`W`	day-of-month	Nearest weekday	`0 20 15W * *`	Weekday nearest 15th at 20:00

Standard expressions work as expected:

Expression	Fires at
`0 3 * * 0`	Every Sunday at 03:00
`0 2 * * 1-5`	Weekdays at 02:00
`30 4 1 * *`	1st of every month at 04:30

All times are evaluated in UTC. See the full gronx documentation for details.

PostgreSQL parameter templates

The operator can automatically compute PostgreSQL parameters based on the applied memory and CPU values. Define spec.postgresParameters as a map of parameter names to Go template expressions:

spec:
  postgresParameters:
    # Templates receive .memory (bytes) and .cpu (whole cores)
    shared_buffers: "{{ div (div .memory 3) 8192 }}"
    work_mem: "{{ div (div .memory 256) 1024 }}"
    effective_cache_size: "{{ div (div (mul .memory 3) 4) 1024 }}kB"
    max_worker_processes: "{{ max 24 (add (div .cpu 2) .cpu) }}"
    max_parallel_workers_per_gather: "{{ div .cpu 2 }}"
    # Static values pass through unchanged
    max_connections: "300"

Evaluated parameters are patched into spec.postgresql.parameters on the Zalando CR alongside memory and CPU resources.

Template inputs

Variable	Type	Description
`.memory`	int64	Applied memory in bytes
`.cpu`	int64	Applied CPU in whole cores (ceiling-rounded, e.g. 1600m -> 2)

Template functions

Function	Signature	Description
`div`	`div a b`	Integer division (`a / b`). Returns an error if `b` is zero.
`mul`	`mul a b`	Integer multiplication (`a * b`)
`add`	`add a b`	Integer addition (`a + b`)
`max`	`max a b`	Returns the larger of `a` and `b`

Error handling

A typo in a variable name (e.g. {{ .memroy }}) produces a clear template error instead of silently rendering an empty value.
Division by zero in div returns a reconciliation error instead of crashing the controller.
Template errors fail the reconciliation gracefully and are reported via Kubernetes events.

Design decisions

No CPU limits -- CPU limits cause throttling. The operator sets CPU requests from VPA but never sets CPU limits.
Unstructured Zalando CR access -- Uses dynamic client with merge patches to avoid importing Zalando operator Go types.
Cron-based windows -- Changes are only applied during defined windows, never outside them. If the window expires mid-maintenance, the run is marked as failed.

Development

# Run all tests (unit + integration)
./run-tests.sh

# Unit tests only (no envtest setup needed)
go test ./internal/controller/... -run 'Test[^C]' -v

# Build
go build ./...

License

See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.agents		.agents
.claude		.claude
.github/workflows		.github/workflows
api/v1alpha1		api/v1alpha1
charts/zalando-vertical-autoscaler		charts/zalando-vertical-autoscaler
cmd		cmd
config		config
internal/controller		internal/controller
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
run-tests.sh		run-tests.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zalando Vertical Autoscaler

Why?

How it works

Quick start

Install with Helm

Create a policy

Safety gates

Memory buffer

Observability

Conditions

Maintenance history

Configuration reference

Cron syntax

PostgreSQL parameter templates

Template inputs

Template functions

Error handling

Design decisions

Development

License

About

Uh oh!

Releases 9

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Zalando Vertical Autoscaler

Why?

How it works

Quick start

Install with Helm

Create a policy

Safety gates

Memory buffer

Observability

Conditions

Maintenance history

Configuration reference

Cron syntax

PostgreSQL parameter templates

Template inputs

Template functions

Error handling

Design decisions

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages