Horizontal Pod Autoscaling in Kubernetes

by Puja Abbassi on Feb 22, 2021

Horizontal autoscaling in Kubernetes image thumbnail

Kubernetes promises us a lot. One of the major benefits we get from it hosting our cloud-native workloads is a high degree of automation, scaling and modernization. In particular, the ability to automatically scale deployed workloads and the environments in which they run removes a big headache for DevOps teams. In theory, we can define some parameters that drive the scaling activity, and then we can sit back and let Kubernetes do the work on our behalf. If we didn't have this automation, we'd have to run enough replicas to cope with peak demand, or constantly monitor the fluctuating demand on our application services, and manually increase or decrease the number of replicas accordingly. We'd have to hope that peaks and troughs in demand were shallow, in order to give us and the system time to respond appropriately. If you'll forgive the pun, this approach isn't scalable when there are scores, hundreds, or even thousands of services to manage. Kubernetes, then, automates this difficult problem away for us.

This, of course, is very much 'the theory', and in the real world, it's not actually that simple. Before we lift the lid on autoscaling techniques in Kubernetes, let's just define the different types of scaling available via its API.

Horizontal Pod Autoscaling (HPA) — when we get a spike or drop in demand for a workload, Kubernetes can automatically increase or decrease the number of pod replicas that serve the workload. This is a dynamic feature, that is characterized by a reconciliation loop that uses observed metrics to drive the workload's capacity toward that defined by the owner of the workload.
Vertical Pod Autoscaling (VPA) — determining how much compute resource is required to accommodate a fluctuating workload is very hard to achieve. But fear not, if Kubernetes is configured appropriately, it can monitor the performance of the workload over time, and recommend optimal resource requirements for the workload. It can even adjust the resource requirements automatically.
Cluster Autoscaling (CA) — ultimately, workloads can only run if there is sufficient capacity on the nodes that form the cluster. Conversely, if we have a large pool of nodes underutilized, we're effectively paying for redundant compute capacity. What number of nodes is too few or too many? Kubernetes has the means to dynamically increase or decrease the number of nodes that form the cluster to reflect the demand placed on it by the workloads it has been asked to host.

Each of these different methods of scaling are implemented in Kubernetes separately, but by their very nature are also interlinked. In discussing one, we'll inevitably end up discussing the others too. But, we're going to focus on each different scaling type in separate articles, and in this first article, we're going to dive into Horizontal Pod Autoscaling.

Note: While exploring scaling methods in Kubernetes, we have also analysed in depth how to manage autoscaling Kubernetes on AWS.

How it works

Horizontal Pod Autoscaling has been a feature of Kubernetes for a very long time — since version 1.1, in fact. Given the age of the HPA API, it would be tempting to assume that it's mature and has been stable for a substantial period of time. But this isn't the case, and like many things in Kubernetes, the API and the controller that manages HPA API objects has continually evolved over time. These changes come about as real-world experience is fed back into the project.

The original implementation of the API was limited to scaling based on the difference between desired and observed CPU utilization metrics only. These simple metrics were collected using the now-defunct Heapster aggregator. Its limited metrics scope eventually led to a more comprehensive V2 API, along with enhanced techniques for metrics collection with support for using custom metrics and metrics from non-Kubernetes-related objects. This more feature-full API allows workloads to be scaled based on a more meaningful set of metrics (for example, size of the message queue or the successful number of HTTP requests per second, and so on).

The Horizontal Pod Autoscaler Resource

For anyone wanting to dynamically scale workloads up and down by increasing or decreasing the pod replicas serving the workload, the HPA resource is where scaling characteristics are defined. A standard HPA resource might look like this:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-proxy-hpa
spec:
  minReplicas: 5
  maxReplicas: 20
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-proxy
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

This simple resource definition instructs the HPA controller to scale the 'nginx-proxy' deployment up or down in order to maintain an average CPU utilization of 60% across the pod replicas. We could equally have specified 'value' or 'average value' rather than 'average utilization', and the resource could have been memory-related rather than CPU. The scaling algorithm references the workload's resource requirements when it evaluates whether it needs to scale the workload up or down to meet the required metric target. And, if there are multiple metrics defined, the algorithm makes a computation for each and scales according to the replicas required to satisfy the most 'demanding' metric.

So, if it's not the discontinued Heapster that provides these metrics for the HPA controller, then what does?

Metrics API and the Metrics Server

The Metrics Server replaced the Heapster aggregator and is a canonical implementation of the Kubernetes Metrics API. It's the job of the Metrics Server to collect CPU and memory-related metrics from the Kubelets that run on each cluster node at a regular interval (by default, every minute). The Metrics Server runs in a Kubernetes cluster just like any other workload and the collected metrics are subsequently exposed via the Metrics API for the consumption of the HPA controller.

The addition of metrics related to memory in the HPA V2 API is a welcome and useful addition, but it still doesn't give us much flexibility when considering workload metrics for autoscaling. Fortunately, Kubernetes has a Custom Metrics API just for this purpose.

Custom Metrics API

The Custom Metrics API allows for the collection of metrics that are application-specific and which can be expressed in the definition of an HPA resource for autoscaling purposes. The main difference between the Custom Metrics API and the simpler Resource Metrics API is that the implementation for collection is left to third parties. Examples of these implementations are the Prometheus Adapter and the GCP Stackdriver Adapter. In fact, these adapters can even be used to replace the function of the Metrics Server as they're able to collect resource metrics as well as custom metrics. To make use of custom metrics, then, it will be necessary to configure a monitoring capability (like Prometheus) to collect metrics from the target workloads, and then deploy an associated adapter to expose the metrics. The HPA controller is then able to consume the metrics for autoscaling workloads targeted in corresponding HPA objects. The spec.metrics object for a custom metric of an HPA might look like this:

<snip>
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests
      target:
        type: AverageValue
        averageValue: 100
<snip>

The Custom Metrics API gives us even more flexibility by allowing us to specify objects that relate to other Kubernetes API objects other than just pods. This could be an Ingress object or a Service object, for example.

An interesting augmentation to the in-built horizontal autoscaling features of Kubernetes is provided by an operator called KEDA. KEDA builds on top of the HPA controller to provide event-based scaling. It exposes metrics, like queue length, from 'scalars' such as Kafka, AWS SQS, RabbitMQ, and so on. The workload deployment can be dynamically scaled down to zero when there is no work to be done and up to accommodate an increase in queue length or stream lag. Here at Giant Swarm, we've recently seen KEDA work really well with significant resource optimizations and cost reductions for one of our customers, so it was a no-brainer that we included it in our App Platform for all of our customers to easily consume.

Autoscaling complexities

On the face of it, horizontal autoscaling in Kubernetes seems quite straightforward. However, in practice, it's quite challenging; let's see why.

To be successful in setting the correct target metric(s) and the value or average value to trigger autoscaling requires a lot. It requires a deep understanding of the application service and also of the environment in which it will run. This doesn't come easy. It will be necessary to conduct in-depth performance testing of the application under load in order to elicit the best configuration parameters to meet your service level objectives. Getting the target values correct is perhaps the biggest challenge you'll face, but even then success is not guaranteed.

Behind the scenes, the algorithm the HPA control loop uses to determine whether workloads need scaling up or down is quite complex. One important feature it provides is a stabilization window for scaling down, which prevents 'thrashing' when frequent changes in metrics would otherwise cause workloads to constantly scale up and down. Despite the comprehensive nature of the algorithm, the implementation of the HPA controller hasn't always met the needs of every use case; this is not surprising given that it's a general-purpose feature that was designed and implemented without a priori knowledge of every nuanced requirement.

Of course, without the knobs available in the HPA API for more granular control of scaling, a 'one size fits all' approach is a bit of a blunt instrument. One application's needs for how quickly or slowly its replicas are increased or decreased may be completely different from another's. This has led to requests for improvements based on techniques used in industrial control systems. Thankfully, the release of Kubernetes 1.18 introduced some configurable scaling parameters, which allows for fine-tuning on a per-HPA object basis.

Conclusion

Horizontal scaling in Kubernetes has come a long way since its early implementation and can now handle complex scaling requirements for disparate workload types. Undoubtedly, there will be more improvements and features to come as the HPA API comes out of beta and approaches V2 GA status. But, even as a beta API, HPA is mature enough to use in production environments and is an indispensable asset when you consider the alternative, which is manual scaling of workloads.

HPA works well for its intended purpose, but care should be taken when other forms of scaling are also employed in the cluster. HPA can work in conjunction with, as well as against, other scaling techniques in Kubernetes. We'll take a look at how this might occur when we consider vertical pod autoscaling in the next article.

Frequently Asked Questions

What needs to be true about an application before horizontal autoscaling will actually work well?

Horizontal autoscaling works best when services are stateless, handle graceful shutdowns, and can run multiple replicas without shared local state. You also need reliable readiness/liveness probes, sensible CPU/memory requests, and external dependencies (databases, queues) that can absorb higher concurrency.

Otherwise, scaling replicas can just amplify bottlenecks.

Which autoscaling signals are most useful beyond CPU and memory?

CPU/memory are a starting point, but modern platforms often scale on user-facing signals like request rate, latency, queue depth, or custom business metrics. These align scaling decisions with real demand and SLOs.

The key is metric quality: stable collection, low noise, and thresholds that don't trigger constant "scale up/down" churn.

How do you prevent horizontal autoscaling from driving unexpected cloud spend?

Use guardrails: set min/max replicas, enforce budgets or quotas per team, and pair autoscaling with cost observability (FinOps) to spot runaway growth early. Right-sizing requests is critical — oversized requests make every replica expensive.

Platform policy, load testing, and progressive rollouts help ensure scaling improves resilience without uncontrolled costs.

Giant Swarm Offerings

Horizontal Pod Autoscaling in Kubernetes

How it works

The Horizontal Pod Autoscaler Resource

Metrics API and the Metrics Server

Custom Metrics API

Autoscaling complexities

Conclusion

Frequently Asked Questions

You May Also Like

Vertical autoscaling in Kubernetes

Autoscaling Kubernetes clusters

Self Driving Clusters - Managed Autoscaling Kubernetes on AWS

Certified Service Provider