Vertical autoscaling in Kubernetes

Puja Abbassi

• May 4, 2021


This is the second article in a short series on autoscaling in Kubernetes. Last time around we discussed horizontal pod autoscaling (HPA), which involves the addition or removal of workload replicas to suit the demand placed on the workload. Implemented well, HPA ensures the workload always has the required capacity to function correctly, whilst simultaneously optimizing compute resources for efficiency purposes. But, HPA may not always be the best or preferred option for a given workload. In some circumstances, scaling a workload vertically is a more suitable or desirable option.

Vertical scaling involves the dynamic provision or removal of compute resources (CPU and memory) made available to a workload as it operates, and is one of the family of autoscaling techniques that can be used in Kubernetes.

Vertical Pod Autoscaler

The Vertical Pod Autoscaler (VPA) is a Kubernetes sub-project that provides a vertical scaling implementation for Kubernetes controllers, such as Deployments. It functions by tweaking the resource request parameters of the pods that make up the workload, based on the analysis of metrics collected from the workloads. Resource requests are the declarative statement of the minimum required resources for the containers that make up a pod; the higher the value, the greater access the scheduled pod has to CPU or memory. If a workload is observed to be using more resources than defined in its spec, the VPA will compute a new, more appropriate set of values.

Vertical pod autoscaling isn't a binary on/off mechanism; clearly there needs to be some configuration that allows us to refine the scaling process. Let's see how it works.

VPA API resource

The behaviour required of VPA is declaratively defined using a custom resource definition (CRD) called a VerticalPodAutoscaler. It's just like any other Kubernetes API object, but for the fact that it's registered with the API as an extension. Note that this is in contrast to the HPA, which comes as an integral part of the Kubernetes API. At the time of writing, the VPA API group and version is

The VPA resource is quite straightforward in terms of its content; its spec field contains just three available sub-fields. A VPA resource might look like the following:

kind: VerticalPodAutoscaler
  name: app-vpa
    apiVersion: "apps/v1"
    kind: Deployment
    name: app
      - containerName: '*'
          - cpu
          - memory
          cpu: 1
          memory: 500Mi
          cpu: 100m
          memory: 50Mi
    updateMode: "Auto"

The target reference allows us to specify which workload is subject to the actions of the VPA, and in this example we have a Deployment. It could be any of Deployment, DaemonSet, ReplicaSet, StatefulSet, ReplicationController, Job, or CronJob.

Next, the resource policy allows you to fix the limits that can be applied by the VPA, as well as to which containers that make up the pods in the target workload. Here we have a catch all for all containers in the pod, but it's possible to exclude specific containers by name also.

Finally, the update policy is a crucial element - it determines how the optimal resource values computed by the VPA are applied to the workload. If the updateMode is "Off", the VPA will still compute and store recommended resource values for the workload, but will take no action to apply them. This is super useful for getting an understanding of a workload's resource consumption in a 'dry run' fashion. If it's set to "Initial", the VPA assigns resource values on pod creation, but at no other time during its lifetime. If the value is "Recreate", VPA updates the pod's resource values as and when it computes revised values, using a draconian method - it deletes the pod and creates a new one. Finally, setting updateMethod to "Auto" has the same effect as "Recreate", but only because this is the sole supported technique for updating the resource values at present. In the future, the "Auto" method will mean something entirely different; more on this, later.

So, the VPA API resource describes some policy for vertical scaling, but how does the VPA know how the workload is performing, what its resource requests should be, and how does it apply updates? Let's find out.


First of all, for vertical pod autoscaling to be meaningful, it's necessary to know how a workload is performing. To this end, VPA is dependent on the Metrics Server running in the cluster, with historical metrics optionally provided by a Prometheus server. The VPA relies on real-time metrics, but can also consume historical metrics stored by Prometheus, on start up - the default query period being 8 days. With this flow of metrics, the VPA has the necessary context in which to compute revised resource values.

VPA Components

The guts of the VPA revolve around three processes; the recommender, the updater and the VPA admission controller.

The recommender, as its name suggests, assesses the current and historical workload metrics, and based on what it observes makes recommendations for a workload's resource requests. It does this in conjunction with any limit ranges set, and the resource policy set in the relevant VPA API object. The recommendations are stored in the status field of the workload's corresponding VPA API object:

          - lastTransitionTime: "2020-07-23T10:33:13Z"
          status: "True"
          type: RecommendationProvided
          - containerName: app
                cpu: 574m
                memory: 262144k
                cpu: 587m
                memory: 262144k
                cpu: 587m
                memory: 262144k
                cpu: "1"
                memory: 262144k

The recommendations made by the recommender are just that, recommendations. It's the job of the updater to respond to the recommendations based on the update policy defined in the VPA API object. If a workload's pods need updating according to the recommendations, the updater will evict the pods whilst accounting for any governing pod disruption budget.

Finally, as Kubernetes creates new pods to replace those that have been evicted, it's the job of the VPA Admission Controller to mutate the resource requests values for the containers that comprise the workload. This it does in line with the recommendations it finds in the corresponding VPA object, and it annotates the pods accordingly:

$ kubectl get po app-8456c5d4b8-hfmvk -o yaml | yq r - metadata.annotations
vpaObservedContainers: app
vpaUpdates: 'Pod resources updated by app-vpa: container 0: cpu request, memory request'

In-Place Updates

Earlier, we mentioned that updating a workload's resource requests using the VPA will result in the eviction of pods if the update policy is "Auto" or "Recreate", followed by a re-schedule with the new parameters. This could result in significant disruption to the service provided by the workload, even though there is nothing else wrong with it (in particular, think of scaling down). Lack of in-place updates of resource parameters is currently a limitation of Kubernetes, and necessitates this sub-optimal solution for vertical scaling. Enhancements to the Kubernetes API to accommodate in-place resource parameter updates have been discussed for two or more years, which has subsequently culminated in a Kubernetes Extension Proposal (KEP) for implementing this feature. However, the KEP impinges on a lot of different areas of Kubernetes, and as a consequence the implementation of the detailed changes are taking their time to be exposed for general use. Once completed, though, they will provide a far less disruptive experience for vertically autoscaling workloads. Watch out for this feature landing at some point this year.

Addon Resizer

Another, simpler vertical autoscaling option exists as part of the community tools provided by the Autoscaling SIG; it's called Addon Resizer. It's designed for scaling singleton workloads based on the number of nodes in the cluster, and is designed for workloads whose expected load is directly proportional to the number of nodes that make up the cluster. The Metrics Server would be an ideal candidate workload for the Addon Resizer, for example. The Addon Resizer is implemented as a sidecar container, with a nanny process that monitors the node count in the cluster, and updates the workload's resource requests according to the node count, and a configurable extra resource amount per node.

VPA with HPA and Cluster Autoscaler

As you might expect, there is a great deal of commonality between the different autoscaling techniques in Kubernetes. After all, they generally tend to rely on acting on the same workload metrics collected by the Metrics Server. As a consequence, care must be taken in order to avoid conflicting outcomes.

At present, VPA and HPA are not cognisant of each other, and shouldn't be used simultaneously for resource metrics. In such a configuration, VPA and HPA will detect a shortfall in resources, and both attempt to resolve the same issue in different ways. It is possible for them to be employed together, provided HPA is configured to work off custom or external metrics instead of resource metrics. An alternative way of working would be to deploy both autoscalers, configure the VPA's update policy to "Off", and use the very useful operator called Goldilocks to visualize the VPA's recommendations. You then have the best of both worlds.

VPA can and should work in conjunction with the Cluster Autoscaler, and we'll cover this in the next article.


The VPA is definitely not the finished article, but it is a useful addition to the autoscaling techniques that Kubernetes provides. In dynamically re-defining the resources available to workloads, it has the potential to solve one of the most difficult problems faced by DevOps teams - how much resource access do my workloads need to operate optimally? As time goes on, the VPA will mature and become a more widely-adopted feature in production clusters.

If you're using VPA to autoscale your production workloads, we'd love to hear about your experiences from the coal face.

Related Articles

Subscribe Here!