• Feb 13, 2018
Our aws-operator is a Kubernetes operator that manages Kubernetes clusters running on AWS. We recently released version 2.0 of the operator that uses AWS CloudFormation and our OperatorKit framework. This replaces aws-operator 1.0 which used the AWS APIs directly without CloudFormation. This post is about why we moved to CloudFormation and how it helps us manage Kubernetes clusters on AWS for our customers.
The first version of the operator has served us well. It allows our customers to easily create Kubernetes clusters on AWS. However using the AWS APIs directly via the AWS Go SDK caused us several problems.
Firstly, whenever we needed to extend the operator we had to write a lot of code. Even something simple like changing the configuration of an ELB (Elastic LoadBalancer) could require a lot of code. Another challenge was that many AWS resources are eventually consistent. This meant we needed retry logic both when creating resources and when deleting them.
Lastly, there were challenges with cluster deletion. We name and tag our AWS resources with a unique cluster ID we assign to each cluster but apart from that, there was no grouping of resources. If there was an error deleting a cluster then some resources could be left behind after it was deleted.
A further problem with aws-operator 1.0 was that it was developed before our OperatorKit library. This meant that the structure was very different compared to our newer operators. OperatorKit helps us remove duplicate code from our operators and helps ensure they are structured in a standard way.
The new version of aws-operator uses OperatorKit to provide a lot of functionality. It handles simple tasks like ensuring our CRDs (Custom Resource Definitions) exist in the cluster when the operator starts. It also provides more sophisticated functionality. One example is the OperatorKit reconciler framework that ensures AWS resources are reconciled from their current state to the desired state. Another example is the resource router that ensures the correct set of resources are executed based on the cluster version. We’ll talk more about OperatorKit in a later blog post.
CloudFormation provides several benefits for us. We still use the AWS Go SDK to access the CloudFormation API. However we now manage the resources declaratively using the stack abstraction provided by CloudFormation. CloudFormation takes care of individual errors, retries and dependencies between resources.
We use YAML to define the CloudFormation templates. We render the templates using standard Go templating and combine the data in the cluster’s CR (Custom Resource) with configuration managed by the operator.
Most of the guest cluster resources are contained in a single stack. This simplifies deleting resources a lot as we simply delete the stacks. CloudFormation does a pretty good job of figuring out the dependencies between resources but we can override these in the templates if necessary.
An important feature for us is rolling updates of Autoscaling Groups. This is only possible via the CloudFormation API and not via the EC2 API. This can be combined with lifecycle hooks to drain Kubernetes nodes before they are shutdown.
There are also some simpler features that are very useful. For example, we tag the stack with the Kubernetes Cloud Provider tags and these then propagate to all resources within the stack.
Overall using CloudFormation greatly simplifies the Go code in the operator. This means we can develop new features more quickly. The grouping of resources into stacks and tagging also makes it easier to support clusters and debug problems. Lastly and perhaps most importantly it makes it easier to update clusters.
If creating Kubernetes clusters with CloudFormation sounds good you can check out aws-operator on GitHub. We’re always happy to hear your ideas. Feel free to contribute in form of issues and PRs. Also look out for an upcoming post from us on OperatorKit.
Giant Swarm’s managed microservices infrastructure enables enterprises to run agile, resilient, distributed systems at scale, while removing the tasks related to managing the complex underlying infrastructure.
GET IN TOUCH
CERTIFIED SERVICE PROVIDER