Autoscaling Kubernetes clusters

Puja Abbassi

May 12, 2021

We've been looking into autoscaling in a Kubernetes environment, and in previous articles on the subject, we've seen how Kubernetes handles horizontal pod (HPA) and vertical pod (VPA) autoscaling. Both of these autoscaling techniques rely on there being enough physical resources available to accommodate the outcome of their computation. It's no good determining that five extra pods are required to handle an application's workload, or recommending that a container needs an extra 4GiB of memory if the cluster doesn't have the capacity to honor it. The same holds even if HPA and/or VPA aren't in operation; if new workloads are applied to the cluster, but there's not enough space to deploy them, Kubernetes is unable to schedule the new pods.

Given the very dynamic nature associated with cloud-native applications and the high degree of automation that Kubernetes promises, it would be considered incongruous if it couldn't seamlessly handle situations where extra capacity is required. It's no surprise to learn, then, that a ‘Cluster Autoscaler’ project exists as part of the Kubernetes autoscaling capabilities just for this purpose.

Cluster Autoscaler

The Cluster Autoscaler has been around for quite a while, reaching GA in September 2017, and releases follow the version numbers of Kubernetes itself. It's considered mature, and historical testing has demonstrated the ability to automatically scale a cluster to the order of 1,000 nodes, each running 30 pods — it’s likely more recent improvements have increased this capability further. On the face of it, automatically scaling a cluster to such a size seems like it should be a difficult thing to achieve. So, how exactly does the Cluster Autoscaler do this?

Cloud Providers

Unlike the HPA and VPA, which take care of scaling pods inside the cluster boundary using the Kubernetes API, the Cluster Autoscaler needs to perform scaling outside the cluster boundary. Typically, we're talking about cloud provider node pools, which require interaction with the native cloud provider API to work. And, as there are many different public cloud providers with different management APIs, the Cluster Autoscaler must be implemented for each individual provider. In order to make the experience consistent and repeatable, each specific cloud provider implements their Cluster Autoscaler against a common interface specification. Different infrastructure, similar experience. There are implementations for each of the major public cloud providers, including Alibaba, DigitalOcean, and others.

Scaling Up

Once deployed to a cluster, the Cluster Autoscaler watches for pods that enter an unschedulable state because the Kubernetes scheduler has been unable to find a node to run them on. This scan occurs every 10 seconds and when pods are found in an unschedulable state, the Cluster Autoscaler takes action to provide a new node on which to schedule the pods. It uses the cloud provider API to add a node(s) to a node pool for the cluster in question.

While this explanation sounds quite straightforward, the reality is that the algorithm that the Cluster Autoscaler uses is quite complex. It involves a 'simulation' of the potential resolution to the scheduling limitation. The Cluster Autoscaler assumes a node pool containing similar nodes, so it can compute what might happen if a new node is added to the pool. But, of course, it doesn't actually do the scheduling itself, so its ' best-laid plans' may come unstuck from the decisions made by the real scheduler, or by events that overtake it. Either way, the Cluster Autoscaler watches and adapts as events unfold in the cluster.

Scaling Down

Having enough capacity to accommodate all of the pods we want to run is clearly important, but so too is the cost associated with infrastructure hosting the cluster. Ideally, we don't want nodes running in the cluster that aren't hosting workloads; it's an unnecessary expense. The Cluster Autoscaler handles this situation by monitoring the nodes that make up the cluster every 10 seconds, and if it observes that a node is underutilized by more than 50% for a period of 10 minutes, it makes an assessment to determine if the node can be safely be removed without compromising capacity. If it can, it will initiate node removal.

Again, it's not quite as simple as this! Firstly, a node's hosted pods may not be movable to another node, and the Cluster Autoscaler will be unable to evict them (strict pod disruption budgets, for example). And, in a highly dynamic environment, the situation may change quickly, and scaling decisions may need to be overturned. The Cluster Autoscaler takes all of this in its stride and amends its actions based on changing circumstances.

Scaling Latency

In an ideal world, the detection of an unschedulable pod should result in immediate additional node capacity to remedy the situation. With control loops involved, this can never be instantaneous, but the Cluster Autoscaler aims to issue a scale-up request for larger clusters within a maximum of 60 seconds of detecting an unscheduled pod. Generally, this latency will be much smaller, especially for smaller clusters. But, the Cluster Autoscaler's latency isn't the problem, here; it's the node provisioning time taken by the cloud provider, which may be of the order of minutes. This might be an unacceptable length of delay, so how can we get around this problem?

One technique is to run an over-provisioned cluster. At first, this might sound counter-intuitive, as one of the reasons for the Cluster Autoscaler is to dynamically resize the cluster according to its needs. But, if you're willing to stomach the expense of an additional node(s) to circumvent node provisioning latency, it's well worth looking into. Here's how it works.

A number of 'dummy' pods with resource requirements can be run in the cluster on a 'standby' node or nodes. If these pods run with a lower level of priority than regular workloads, they can be preempted by the scheduler when demand for additional replicas are required for regular workloads. As the 'dummy' pods are evicted they become unschedulable, and the Cluster Autoscaler takes action to provision a new node to accommodate the dummy pods. The intention is that the standby node immediately slurps up the increased demand in regular workloads with minimal latency, whilst the Cluster Autoscaler initiates a scale-up of nodes in the background. It may sound a bit of a kludge, but it is a practical workaround to cloud provisioning latency.

All the Autoscalers

Whilst HPA and VPA may not play nicely together if care is not taken, the same is not true for HPA and/or VPA with the Cluster Autoscaler. Without the Cluster Autoscaler, if HPA needs to scale a workload but there is insufficient capacity, pods will be left unschedulable. Of course, with the Cluster Autoscaler in operation, it will respond to observed unschedulable pods and scale the cluster nodes to accommodate them. Similarly, if the VPA resource recommendations have resulted in new pods that are unschedulable due to lack of resources, the Cluster Autoscaler will provision a new node to host the pods.

Spot Instances

Running big clusters in the cloud can get very expensive. Most companies that are cloud-native by nature use a lot of infrastructure, and whilst it's elastic, on-demand, and all that good stuff, it's still important to manage costs effectively. That's why people make use of spot instances (spot VMs in Azure, and preemptible VM instances in GCP); cheaper, unused VM capacity that can be reclaimed at short notice (2 minutes for AWS). Wouldn't it be great to use spot instances for Kubernetes clusters, especially big clusters? But, what about the unpredictable termination of spot instances? Kubernetes makes application workloads highly resilient, and that would all change if nodes based on spot instances randomly disappeared from time to time.

One solution for this is to make use of the Cluster Autoscaler's 'priority' expander. Expanders are strategies for selecting different node pools for scaling up nodes. Their use assumes you have multiple node pools from which to select, and the default expander for the Cluster Autoscaler is 'random'. The priority expander allows us to direct the Cluster Autoscaler to make its choice based on priorities we assign to different node pools before we set it running.

We might create one node pool containing similar spot instance types, and another containing similar, regular, on-demand instance types. A ConfigMap is used to define the priorities for the priority expander, with different node pools having different relative priorities. If the spot instance node pool is configured with the higher priority, and the Cluster Autoscaler is configured with appropriate capacity values for each node pool type (i.e. a desire for zero nodes from the on-demand node pool), the cluster will be built from the spot instance node pool. But, if there is a sudden run on the spot instances resulting in the termination of nodes, the Cluster Autoscaler will fall back to the on-demand instance node pool instead.

There is a little more to it than this, but this is the gist, and it enables us to benefit from cost-effective cloud infrastructure consumption, without sacrificing the high levels of availability we associate with Kubernetes. We even provide this as an option for our customers.

Conclusion

Kubernetes sets a high bar when it comes to fault tolerance and reliability for the cloud-native applications that it hosts. And, in building additional autoscaling capabilities on top of these robust features, we get the opportunity to enhance the service level objectives we provide to our users, customers, and communities. If you're using production-grade Kubernetes clusters, join the conversation and let us know your experiences with Kubernetes' autoscaling features.