• Feb 19, 2021
When designing a cluster of (virtual) servers, there are a few fundamental decisions that need to be made. Things like how to get the clients to reach the services hosted in the servers, how to secure them, and so on. One aspect that is sometimes overlooked or taken for granted is how to provide servers with reliable and secure access to the internet, what’s often called egress traffic.
Recently at Giant Swarm, we needed to change our approach to provide internet access to the worker nodes of our Kubernetes clusters, and this presented a chance to step back and rethink. In this blog post, we'll analyze the possible choices that the Azure cloud platform offers in terms of internet access for virtual machines, what we eventually ended up choosing, and why we went down this path.
For our Kubernetes clusters on Azure, we use Virtual Machine Scale Sets (VMSS) for both the control plane nodes and the node pools (worker nodes). Our instances are based on Flatcar Linux by Kinvolk, and so, in general, no internet access is needed to work properly.
Needless to say, a Kubernetes cluster with nodes that can’t access the internet isn’t very useful. In general, nodes need to be able to download container images and often connect to third-party services, just to mention two basic requirements.
When we started looking into it, it turned out that there are a few different ways to provide internet access to virtual machines on Azure.
Some of the possible options we considered:
1. It increases the cloud provider bill because public IP addresses are a limited resource;
2. It directly exposes the virtual machines to the internet and thus requires a stricter security setting (firewall);
3. It's complicated to manage because the workload running in the nodes will have different IP addresses when reaching services on the internet. Thus making it hard to have consistent firewall rules in external services.
Back in the day, we used to deploy the Nginx Ingress Controller together with a
LoadBalancer Service with all our Kubernetes clusters, and this meant we had a Public Azure Load Balancer in front of all the VMSS instances. The obvious choice for us was then to use such Load Balancer for egress traffic as well. This strategy worked well for a long time — for our customers, it was easy to know what public IP their nodes were going to use for egress (to set up their external services’ firewalls), and we needed very little configuration to be done to make it work.
The situation changed when we needed to either make the Nginx Ingress Controller internal only (by using an Internal Load Balancer) or remove it entirely for customers that don't need it. This would have stopped the internet connection for the instances and basically made our clusters useless.
We decided then to switch to the NAT gateway approach by sharing a single one between all the node pools in our clusters. That way, we had a predictable public IP address for all the nodes completely decoupled from the (potentially multiple) Load Balancers we had in the cluster.
The only con we found was that it became more difficult for the customers to know the public IP for egress traffic. Before this, it was a matter of executing a DNS query for one of the ingress hostnames. However, with the NAT Gateway approach, they are required to either gather that information from Azure in some way or run a Pod to get the public IP. That being said, we feel like the flexibility we gained supersedes this small annoyance.
There are a few different ways of giving Azure virtual machine internet access. Among them, we think the NAT gateway is the perfect balance between costs, flexibility, ease of deployment, and customer-friendliness.
Giant Swarm’s managed microservices infrastructure enables enterprises to run agile, resilient, distributed systems at scale, while removing the tasks related to managing the complex underlying infrastructure.
GET IN TOUCH
CERTIFIED SERVICE PROVIDER