• Mar 23, 2021
Whether you are deploying a busy RDBMS for e-commerce or targeting a low latency message queue, you need to have low latency disks on your server. This is also important for modern distributed systems such as Kubernetes clusters running on etcd.
Recently, we were investigating a situation where one of the master nodes in our management clusters suddenly went NotReady while the cluster workload reconciliation also happened to get stuck. This was weird because we run three masters for HA purposes, and losing one of them doesn’t normally disrupt the cluster’s control plane nor workload in any way.
After an initial glance over the dashboards, we noticed that Kubernetes apiserver had unusually high memory consumption, especially on that NotReady node. However, this isn’t something that normally breaks everything at once. Especially when there are two perfectly functioning masters.
Left confused and without better leads, we started investigating what had caused the increase in memory usage and suspected a recent change in monitoring infrastructure that performed more aggressive service discovery than before. We tried to reproduce the problem under close inspection but often failed. We were unable to reliably reproduce the problem with recent changes, while the NotReady issue kept repeating itself rather reliably on other clusters. This caused some frustration because all these problems came out of the blue and there was no clear way to reproduce them.
We were also baffled by why this only happened on Azure. Since Kubernetes apiserver is mainly a REST API for manipulating the cluster state stored in etcd, we started to suspect a problem with etcd. We created a new Grafana dashboard with some key metrics from Kubernetes apiserver and etcd so that we could correlate events.
The eureka moment came once we noticed that when the apiserver request rate started to increase, the memory usage also increased, and shortly afterward, the etcd disk backend commit duration started to skyrocket. We observed measurements where fsync took multiple seconds in extreme cases where normal numbers hover between 7 and 15 ms.
Before this incident, we only had issues with slow disks in some on-prem use cases where NFS couldn’t keep up with etcd during busy times, but it manifested itself in different ways and was often warned in logs early on.
This was the first time we observed high latencies in the cloud, and we were initially quite surprised since we already used Premium SSDs when available. On top of that, since Azure doesn’t have a concept of Provisioned IOPS as a configuration parameter, we thought we had the best you can get given that Ultra Disks aren’t available everywhere.
But little did we know, once we re-read the documentation, it turned out that size does matter after all. In Azure, the disk performance increases with the disk size. The bigger the disk, the more IOPS and better throughput. The used VM size also matters here, but in our case, that wasn’t the limiting factor.
For most of our management clusters, the etcd disk usage is well below one GB and hence the disk was only 10GB. For normal use, even that was big, but it turned out to be among the least performant Premium SSDs due to its size.
When we looked at IOPS metrics of our etcd disk in Azure Portal, we found that at the time of the incident, the disk went into burst mode when Kubernetes apiserver request rate heavily increased and it kept going with the traffic until the burst quota hit and the disk got throttled. That’s when the etcd disk backend write latency skyrocketed and the system came to a halt.
With the etcd disk IOPS graph, we determined that doubling the provisioned IOPS quota for the disk should give enough room for the spikes and after some testing, we couldn’t observe high etcd disk backend commit durations.
Disk size matters. IOPS provisioning for the disk is size-dependent on Azure instead of a separate configuration knob for a given disk type. There are also several different types of disk, some with restrictions on regional availability.
Normally, you would most probably want to use Ultra Disks as they provide the best sweet spot for price/performance combinations for low latency needs. Depending on the use case, you might also want to consider striped disks (RAID-0) built from smaller volumes in order to find the best combination of IOPS quota and volume size for a given price.
Over the years, we have learned that Azure is constantly developing and one should closely follow changes in service offerings as there might be good opportunities to improve service quality while saving costs when new product types become available.
Curious about this topic? Let's carry on the conversation @giantswarm.
BTW, this is an ongoing series from our Azure Team Celestial, check out the debut on internet access for virtual machines.
Giant Swarm’s managed microservices infrastructure enables enterprises to run agile, resilient, distributed systems at scale, while removing the tasks related to managing the complex underlying infrastructure.
GET IN TOUCH
CERTIFIED SERVICE PROVIDER