• Jun 10, 2021
As some of you might have seen, we’ve been working on and talking about Cluster API a lot lately. I felt like now’s a good time to review what Cluster API (CAPI) even is, why we care so much about it, as well as what it will mean for our product and customers moving forward.
In early 2018, Cluster API started as a project of Kubernetes SIG Cluster Lifecycle and even before that there were discussions around it, which is all to say it’s not exactly new. However, as it has ambitious goals and involves a lot of topics that need to be addressed with care, it has not moved out of the alpha stage, yet.
The Cluster API Book describes CAPI as follows:
Cluster API is a Kubernetes sub-project focused on providing declarative APIs and tooling to simplify provisioning, upgrading, and operating multiple Kubernetes clusters.
Started by the Kubernetes Special Interest Group (SIG) Cluster Lifecycle, the Cluster API project uses Kubernetes-style APIs and patterns to automate cluster lifecycle management for platform operators. The supporting infrastructure, like virtual machines, networks, load balancers, and VPCs, as well as the Kubernetes cluster configuration are all defined in the same way that application developers operate deploying and managing their workloads. This enables consistent and repeatable cluster deployments across a wide variety of infrastructure environments.
When explaining the gist of it, I usually try to boil it down to Cluster API actually being two things:
As we already had production-proven controllers for 2. We started to adopt 1. — i.e. adopt the API of Cluster API, and rolled that out for our AWS and Azure provider implementations over the last two years using upstream CAPI versions v1alpha2 and v1alpha3 respectively.
In the meantime, we tried to stay involved where possible in upstream discussions at Contributor Summits and KubeCons, as well as in proposals and design documents in Github and Google Docs.
Earlier this year, we sat together with the Product Owners and Architects of our teams and asked ourselves:
“Is there a product topic that is priority #1 this year?”
As we had already made some significant steps towards Cluster API in our Kubernetes as a Service area last year, the answer pretty quickly got narrowed down to:
“Let’s focus on going all-in with Cluster API.”
We knew it would be a big technical challenge, including many mindset changes coupled with significant impact on the future of our product, which increased the need to assess the impact and challenges as well as make certain decisions more quickly, so shortly before March, our CTO Timo was like, “What if we dissolved 6 of our 8 teams for a month and did a Cluster API Sprint for a month with them?” and that's exactly what we did.
At this point, you might be asking yourself:
Why is a technical topic a product priority?
Why do they care so much?
Isn’t it risky to bet on a project that is still in alpha?
In the next few sections of this post, I'll try to answer each of these.
Dating back to 2017 with early talks around the topic, and then 2018 with the first appearances of the Cluster API concept, it was clear to me that I wanted us to be involved as much as possible. I still remember the impromptu circle around Kris Nóva in the hallway on the last day of KubeCon EU 2018 in Copenhagen and the first official WG Zoom calls shortly after. The reasons I was excited about it back then are still the core reasons why we’re excited about CAPI now — plus there’s more.
Let’s revisit them, split up into API and implementation.
There are a few important points as to why we care about such an API.
First, a consistent API across providers enables broader use cases like hybrid- and multi-cloud, but it also means that you get standardized tooling that can talk to this standardized interface. It also means lower lock-in to providers and tooling.
Second, a declarative API for managing infrastructure enables users to achieve goals like immutability and shifting security left more easily. Processes that the teams are used to from their day-to-day software development can thus be applied to managing infrastructure.
Third, having this API be a “Kubernetes-style” API creates out-of-the-box familiarity for people that have already started learning Kubernetes and working with its API (and yes, the YAML). It also means that a lot of integration with tooling that is already there for Kubernetes can work with no or little additional work. Recently trending GitOps approaches to delivery also just work™.
The provider implementations used to be called reference implementations, which sounded like they were just there to show actual implementations how to use CAPI. However, from the early days, it was already foreseeable that Cluster API would go similar ways that kubeadm had gone before, i.e. the implementations could become de facto standards on how to provision Kubernetes on many providers. Not only was it the same SIG driving the project, but you could also see it was driven by similar people and with a similar mindset.
By now, you can see that many companies, including the heavyweights and hyperscalers in our ecosystem, are working on the Cluster API project. It is a full-on community effort and we don’t see that going away.
There are also a few benefits that directly result from the implementations.
First, the implementations are Kubernetes controllers (operator), which implement a reconciliation approach. Having had quite some experience with that at Giant Swarm, we believe this is the better approach and has many benefits above other declarative approaches that just apply changes and do not reconcile the desired state.
Second, the implementations are done in a collaborative manner by the whole community and usually include engineers from the infrastructure providers themselves. This leads to features being developed much quicker and integrations being well adjusted for each provider. It also carries the benefits of what I call the 'many eyes principle' of upstream community work, where a solution is scrutinized through the eyes of many diverse contributors and in the end represents a consensus that is stronger and better abstracted than it would be if driven by a single vendor.
An often-heard fear around Cluster API adoption is that officially the project and API are still in the alpha stage.
And, indeed, the API has changed quite dramatically, especially between v1alpha2 and v1alpha3. However, by now v1alpha4 is landing and v1alpha5 is already being planned and the changes have been smaller recently. So much so that there were even some discussions whether we could already graduate to beta or further soon.
Still, at least for implementers, the fast-moving pace of alpha can be challenging and it’s not like I could recommend to everyone to just go and build their own CAPI implementations for production, yet.
That being said, we as a company that has been around since the early days, albeit small, consider it stable enough to warrant the effort. We also don’t want to just be passive beneficiaries, but active contributors in this upstream project that aligns so well with our philosophy. And for us, that means that we need to commit to it in a significant way.
In March, we already made it pretty clear that we’re betting big on Cluster API.
Huge win for the Cluster API community. The Azure team already threw their weight behind the project (https://t.co/GmuNT6Y9Mm), VMware Tanzu is all-in on Cluster API, and now another long-standing member of the container ecosystem is betting big on this important k8s OSS project. https://t.co/Qpn4oy2Ca6 pic.twitter.com/ygNXQduJ08— Ross Kukulinski (@rosskukulinski) March 4, 2021
And in the time since, we have been talking to customers and internal stakeholders and planning our migration to full Cluster API support.
In short, it means we are throwing away a lot of what we’ve built over the last few years and replacing it with upstream components. While doing that, we’ll contribute our learnings and improvements as well as additional functionality that we might see missing back to the upstream community. If there’s something missing, the first goal is to bring it upstream, and if in some cases that is not feasible (e.g. because it might be out of scope for the Cluster API project) we'll release it as a Cluster API compatible OSS project.
For our customers, it means they will gain all the above-mentioned benefits of Cluster API, and at the same time keep the production-ready quality and reliability of the clusters they are already using. This will be a non-breaking and smooth transition and we are planning on providing early alphas, betas, and RCs that we, along with our customers, can test thoroughly.
On the user experience side, some providers might choose to only use CAPI internally and expose a simpler interface to end-users. I agree that the latter is an important goal, however, we also believe that we should not hide any functionality from our users, and as we benefit from the luxury of high trust with our customers, we want to also expose the full vanilla API to them. Sure, we are also working on making the API easier to use and might offer abstractions on top, but similar to how we’ve seen closed PaaS systems limiting power users at some point, we believe that at some point users become so mature that they do need to poke through abstractions and adjust things to their needs.
A nice example is the recently shown Azure CLI extension for Cluster API on Azure. It’s easy to get going and provides the user with a fully working cluster, but it still exposes the full functionality of Cluster API and upstream tooling like clusterctl still works:
Suffice to say we’re very excited. Seeing all the collaboration (and shared Slack channels) that have formed around this with friends from many different companies makes my community-centric heart beat faster. If you want to collaborate more closely with us, I wanna hear from you. And if you’re a customer or a potential one, I’m also very happy to talk about timelines, expectations, and also fears. We’re on this journey together!
Giant Swarm’s managed microservices infrastructure enables enterprises to run agile, resilient, distributed systems at scale, while removing the tasks related to managing the complex underlying infrastructure.
GET IN TOUCH
CERTIFIED SERVICE PROVIDER