Scaling on-demand Prometheus servers with sharding

• Aug 13, 2021

It’s been a few years since we last wrote about our Prometheus setup. I’d recommend reading that article for context first, but for those looking for a tl;dr — we run a number of installations, with an installation consisting of one management cluster, and a number of workload clusters. Our previous approach to metrics infrastructure involved a single Prometheus server running on the management cluster, which was dynamically configured to scrape targets on the workload clusters by a custom operator. Fairly straightforward. Since implementing that architecture, we’ve faced some challenges with our setup — mostly due to scaling — that I wanted to share, as well as our solutions and future plans, and just some general rambling. Bam!

Since our initial implementation, a lot of the numbers in our equations changed. The number of workload clusters per management cluster increased, the size of those clusters increased, and the number of targets on those clusters increased. With these three factors, the overall number of time series each Prometheus server ingested grew, and with only one Prometheus server on the management cluster per installation, we were forced to vertically scale the management cluster machines to fit these ever-growing servers.

This vertically scaling was generally unfavorable. For example, we had at least one situation where a workload cluster automatically scaled due to a usage spike, causing the Prometheus server to ingest more time series, increasing memory usage, and ultimately forcing it into an OOM loop. Having to vertically scale the management cluster nodes due to workload cluster load activities felt very strange. We also needed to invest in capacity planning and preventative maintenance, where we’d rather the system took care of itself.

With us reaching the limits of vertical scaling, we decided to take a look at horizontally scaling our Prometheus setup. Looking for a domain to scale on, we quickly decided to scale with workload clusters — i.e: to deploy one Prometheus server for each workload cluster. This implicitly meant that we would also need to use the prometheus-operator to bring up Prometheus clusters dynamically. Whenever a new workload cluster was created, a new Prometheus server would be created. Having more, smaller Prometheus servers would also lead to easier scaling for our management clusters.

This is the number one piece of advice I give nowadays when people ask me about setting up larger scale Prometheus setups - use the prometheus-operator, and make sure you scale on some domain concept.

Our new topology would now consist of one Prometheus server per workload cluster, and one Prometheus server for the management cluster. We’d run all these Prometheus servers on the management cluster, primarily for isolation from customer workloads.

The bulk of the work was spent in building a new operator that watches Cluster Custom Resources (that’s a Cluster API resource describing a workload cluster) and creates Prometheus Custom Resources, as well as wiring up configuration for the new Prometheus servers - a kind of spiritual successor to our previous custom service discovery operator. This operator essentially codifies our metrics infrastructure topology.

With the operator put together, and creating Prometheus servers for each workload cluster, we quickly found out how different the resource usage requirements were for each of them. We’re big believers in setting requests and limits on all our Kubernetes pods, and while we had one Prometheus server per management cluster, it was possible to manage the resource usage manually - this became impossible with one Prometheus server per workload cluster. This largely manifested as Prometheus servers entering OOM loops. To fix this, we turned to the vertical pod autoscaler. With VPA built into the prometheus-meta-operator, all our Prometheus servers began automatically scaling themselves vertically, meaning we gained the benefits of setting proper requests and limits, without having to manually handle tuning our resource limits.

This is visible from metrics we send to Grafana Cloud - some of our Prometheus servers use a couple of hundred of megabytes, while others go to tens of gigabytes.

With our new operator built, and Prometheus servers running for each workload cluster, migration to the new setup was largely straightforward. All metrics we need for long-term storage are sent to Grafana Cloud, and we consider data in Prometheus locally somewhat ephemeral. So, after ensuring that we had all targets and alerts configured, we simply updated our new sharded Prometheus servers to use our production Alertmanager, and were swapped over.

All in all, on August 1st 2020, our overall metrics infrastructure held just over 21 million time series. As of August 1st 2021, we’re now at just over 39 million time series. This scaling just would not have been possible without our investment into scaling our Prometheus setup.

Although a necessary evil, having to build and then destroy our own Prometheus service discovery took a lot of unnecessary toil — being able to use open standards from the start would have been a lot easier. Given that, and our recent strategic decision to bet the farm on Cluster API, we’ve been looking at how we could provide the prometheus-meta-operator as a generic component for handling the infrastructure of monitoring Kubernetes clusters powered by Cluster API. Hit me up on Twitter if you want to chat further about that — we reckon a lot of the problems we’re solving for managing multiple Prometheus servers, towards providing Cluster API clusters are generic problems.