Giant Swarm's epic journey to Cluster API
• Mar 9, 2021
To tell the story of our journey to Cluster API, we have to tell the story of 2017. Kendrick Lamar's Damn album had just come out, the WannaCry ransomware attack spread across the globe, and more relevant to this blog post, at Giant Swarm, we had begun building up our own automation, our own tools, and our own operators.
To adequately frame how badass this decision was, it's important to point out that at the time there wasn't even an established operator framework. We were going rogue — in a positive way! So, picture this; Kubebuilder was still a year away, CoreOS was developing what's now known as the Operator Framework, and we, well, we were dreaming of the following:
- We wanted a smoother Kubernetes setup on-premises.
- We wanted to automate as much as possible because managing Kubernetes was a daunting task for our small startup, especially with the tools available at the time.
With these two goals clear in our mind, 2017 was all about expanding our vision and encompassing multiple providers (on-premises and AWS at the time).
Our plan? We started structuring our automation more and tried to extract the overlap between these providers into provider-agnostic CRDs and operators, which in theory would run exactly the same on each provider. A brief digression here: CRDs are a way to extend the Kubernetes API with custom data structures and services (read more about them in the official docs).
While our vision remained a worthwhile pursuit, some holes started to appear in our approach regarding extracting the provider-agnostic logic. At the time, as Azure came into the mix, we struggled to find reliable interfaces and common functionalities. Our assumptions were challenged: what we thought was provider agnostic sometimes turned out not to be, while other parts that we assumed to be specific were suddenly shared by two or more providers.
These challenges forced us to restructure our abstractions internally — mostly by changing our CRDs or completely replacing some of them. As our customer base was growing, the toil this caused internally was immense.
Although these problems caused us to rethink our overall architecture many times (with an eye on how other players in the community were tackling these challenges) there were a few things we didn't want to change:
- We liked using Kubernetes operators — a lot. The benefit of reliability, scalability, and reproducibility was (and still is) invaluable. In other words, we did not want to switch to something like Terraform.
- Despite our architectural issues, we managed to produce a stable and easy-to-use product. We didn't want to sacrifice any major features or stability.
- We needed the overall architecture to support multiple infrastructure providers (AWS, Azure, and KVM at the very least).
They say 'ask and you shall receive!' — unfortunately not this time. The community didn't offer a solution that checked all these boxes. However, we did see change on the horizon.
Enter: Cluster API!
In 2018, the first releases of Cluster API started to appear. We were excited about the prospect of a community-driven solution even though we took issue with some of the architectural choices and the stability was not quite there. In any case, during that time we were busy with our own internal architectural changes, and we also happened to be growing immensely as a company. All the while, Cluster API remained on our radar.
Finally, in 2019, it was the right time to start aligning with Cluster API. This meant that we started developing our operators with Cluster API in mind. Around the middle of 2019, true adoption of these CRDs started to happen on our AWS provider when Cluster API version v1alpha2 rolled around.
While it's natural that we had some trouble adjusting to the architectural design, this didn't stop us from enjoying the benefits of this decision. By successfully adopting Cluster API, we were able to reap the rewards of the clear distinction of provider-independent information, and due to the ease of use, we achieved a more simple internal structure.
In 2020, we managed to also switch our Azure provider to cluster API CRDs — v1alpha3 of cluster-API to be exact. And during this time we really started to embrace the changes and think bigger. If luck is what happens when opportunity meets preparation, we were feeling lucky!
Our luck with Cluster API looked like this:
- We would give our customers more control to easily configure individual parts of the infrastructure.
- We would prioritize the easy adoption of more providers.
And here we are in 2021 and we're approaching the final leap:
We want to fully utilize Cluster API. Not only the CRDs but also the operators directly from the community project. We are aware that there will be problems to solve and hurdles to overcome but we think that right now is the time to start making the switch. We will most likely retain some of our operators, which add functionalities that are not currently included. However, a big bulk of them will go away and that’s exciting! For the entire month of March, we've decided to focus on Cluster API and translating its value to our customers. If you would like to learn more about our insights, get in touch.
Marcel is a Platform Engineer at Giant Swarm and a contributor to the cluster API project. Find him on Twitter.