Never Break The Upgrade Path
• Oct 15, 2020
The product team at Giant Swarm recently made a noteworthy decision: never break the upgrade path.
Since we’re a positive bunch, the exact wording of the statement was:
Always provide an upgrade path for customer clusters, regardless of how much more effort it is or by how much the delivery date will have to be postponed. The only exception is when it is absolutely not possible to create an upgrade path (for example, because of provider limitation), in this case, try to maintain backward compatibility so customers can keep updating to newer versions even though the specific feature will not be available.
How we got here
The events leading up to this decision were a year in the making. Last summer, we started working on node pools for AWS. This required a redesign of the CloudFormation and the way we use the network on AWS. There was a lot of pressure on the team since we were contractually bound by one of our customers to provide this functionality. Said customer was pushing us very hard to release node pools. Plus, they financially supported the development. We call it payment for influencing the timeline.
This contractual part was different from our typical mode of working with customers, which is normally based on trust and transparent communication instead of contractual obligations. But, this was a new customer that is accustomed to working with vendors in a certain way and we were planning to develop node pools anyway. On top of that, we made clear that this would still be a functionality across the board to all of our (AWS) customers.
Due to the time pressure, and especially in order to achieve this without planned downtimes, the only sensible way to build node pools into the clusters was to require the launching of new clusters and not allowing for migrations. This would have been hell to build, with all the IP level changes going on from pre- to post-node pools. “Yes, sure!” was the response when we asked all of our customers if they would be ok with a release that doesn’t have a migration path. Later, we realized the customer in question didn’t really know what they were getting themselves into, and to a certain extent, neither did we. We were doing transparent, in place, upgrades for everything for too long to clearly remember the time when we needed to move customer workloads between clusters.
We should have known better. Sacrificing the upgrade path for speed caused our customer to incur a large upgrade deficit and a security incident along with it. In the end, everything was slower.
In the early days of Giant Swarm, we did not document decisions or learnings, as often happens in small companies. Many serious discussions were held about this issue and the pros and cons were weighed. As a self-organizing company, the ultimate decision was taken by the team.
We all know that hindsight is 20/20. So while this all seems pretty obvious now, at the time we were learning as we developed the functionality. The “no-upgrade-path” version was supposed to be an MVP towards a fully upgradeable functionality. But freed from the bonds, it was easier to bring in a few changes that would have been hard to do with an upgrade path. So in the end, the decision was made to not provide an upgrade path at all because it would have been way too much work. This decision was deliberate and was not taken lightly. This decision brought us to where we are now.
A year later, our customer that pushed for the functionality, is just starting to adopt node pool clusters. In general, only 30% of our clusters are using node pools. However, due to the missing upgrade path, a lot of customers are stuck on legacy versions. Versions that we maintain with bug fixes, security patches, etc. for too long.
Seeing the consequences of a broken upgrade path once again, we made the decision to never break an upgrade path again.
Providing an upgrade path for all releases increases the time it takes to deliver new versions. Product and Engineering also need to take a holistic view with regard to the implications. As a result, we are now looking into better tooling around migrations, versioning, and validation of CRDs. We are also ensuring that our product stays flexible. We are making sure our operators don’t become mini-monoliths. In addition, we continue to split the infrastructure into smaller pieces. For example, on AWS, instead of having one big CloudFormation, we have now separated it into smaller CloudFormations.
Since we are also a managed service company, we also understand that it requires skillful customer management. Especially if the new version was promised to a customer on a specific date. Not to mention working with enterprises to keep them on the very latest version.
Breaking the upgrade path leads to serious delays in customer upgrades. It forces us to maintain and fix legacy versions and causes many support issues that could be solved by an upgrade. The result of not having an upgrade path creates great cost for both us, Giant Swarm, and our customers. This cost exceeds even months of extra development effort.
Ultimately, our main focus is to provide our customers with reliability and stability.
Contact us if you would like to put your Kubernetes in the hands of a team that will apply learnings from others to you.