When we talk about applications and their deployment in production, we often talk about reliability as a non-functional requirement. What is it that we really want from applications and their deployment when we say “reliable”?
Usually, we mean that we want the application to do what it is expected to do and run without failure and outages. For many this means it must be stable (requirement for the code) and available (requirement for the deployment). So you go and do a lot of testing and QA to get stable, “bug free” applications. And then you go and get yourself a server from a cloud or other infrastructure provider that has high SLAs, like “five 9s”. But is the application now really reliable? And is reliable actually enough? Is this the right way to build and deploy highly complex and scalable applications in 2015?
Maybe…if it wasn’t for Murphy’s Law (“If anything can go wrong, it will”). On the one side we have an application that can hardly be “bug free” as long as we are restricted by common time and budget constraints. There’s always something you didn’t foresee and for which you didn’t test. This gets only worse with now common distributed systems and technology-based innovation that has to move fast. On the other side we have servers (hard- and software plus networking and storage) that can never guarantee 100% availability. Sure, we could try to throw more money at the problem, but we’ll never get to 100%, so why not go a different route? Why not build resilient applications that get resiliently deployed?
First, what is resilience? There are several definitions for it depending on who you ask and in what context it is used. Let’s take some of these and get a sense of what it means for us:
the power or ability to return to the original form, position, etc., after being bent, compressed, or stretched; elasticity. (http://dictionary.reference.com/browse/resilience) the capacity to recover quickly from difficulties; toughness. (https://www.google.de/webhp?q=resilience%20definition) the ability [of a system] to cope with change. (http://en.wikipedia.org/wiki/Resilience)
The first definition tells us about the general background of the term. Following literary description is quite fitting:
“The oak fought the wind and was broken, the willow bent when it must and survived.” — Robert Jordan
Thus, resilient systems should be able to elastically cope with problems and not be hardened against failure, which at some point won’t be enough and result in breaking.
The second one states the fast recovery of systems (or people if used in a psychological context) more explicitly. As mentioned above, you can always reduce failure but you can never be 100% failure-free. Thus, you need to be as good as possible in recovering from failure, which again goes hand in hand with the image of elasticity from our first definition.
The third definition comes from a supply chain background and is more about keeping a system running. Adapted to a more general context the system should be able to cope with change as well as minor or major disruptions.
Resilience in Application Development and Deployment
Now let’s bring those concepts into the realm of application development and deployment and see what patterns and tools can help us get there. (Note: we won’t go into resilient software in the security context, which is a whole different topic by itself). Reviewing our context, on the one side we have a complex application that has to be iterated upon and adapted based on dynamic changes in the market. To stay competitive this has to go through testing and staging into production quickly - best even continuously. On the other side we have the deployment of said application, in which it has to be deployed to a production (and before that to testing/staging) environment, scaled up and down with a dynamic influx of users/usage, and kept highly available in a multitude of scenarios.
Resilient Application Development with Microservices
On the application side engineers and organization are looking to microservices architectures to be more agile and build better applications. In microservices architectures, applications are split into highly decoupled, simple, distributed services that communicate with each other over lightweight communication mechanisms like message queues and HTTP APIs.
Microservices are highly decoupled and built for failure (note: this doesn’t mean you shouldn’t do backups), thus they are capable to cope with failure and outages. This isolation of simple small services makes them independent from each other, so that when one fails the others don’t stop working. They are explicitly built to expect failure (in the service itself as well as failure in other services).
Through this decoupling we actually can get a lot of resilience into our applications. They get the ability to cope with change and disruptions. All this while keeping agility and speed in the development process. However, the elasticity and recovery parts of resilience still need more than just moving to microservices architectures in development to be fulfilled.
Resilient Deployment with Containers on Cluster Infrastructures
Elasticity and recovery, though partly having to be built into the services, are more of a job for deployment, management, and scaling of said microservices systems - and don’t forget to keep them highly available. This is where containers and cluster infrastructures come into play.
With containerization we get a simple technology to package our microservices into persistent deployable units. As these containers can then be run in various environments as is, developers won’t have to take care of differences in development, staging, and production.
However, now comes the hard part: these containers need to be deployed, managed, and scaled in production. And remember, we want to have resilience, i.e. elasticity and recovery. Ideally we don’t want to have any single point of failure. Ideally we want this to run on cluster infrastructures that abstract away the underlying hardware and bring us redundancy. Further, we have to work with load balancers and circuit breakers in between services that enable us to elastically scale up and down horizontally.
Scaling and redundancy help us to not completely break in case of failure. To be really resilient, though, we need to be able to cope with failure and recover to the intended/original state of the application. Further, when parts of the application go down, they have to come up again. However, the remaining services have to know how the new services can be reached. This can be solved by taking the configuration out of the container and injecting it only at runtime. This process results in containers that are surrounded by configuration services (ideally also running in containers), which help them run in the respective environment and find their peers. When these containers are tied together so that they don’t fully break, but go down and come up together gracefully, we get a deployment that is resilient in the sense of elasticity. It ensures that services come back into a state that looks like the original state from outside (even if e.g. the underlying server has changed).
Resilience in Real Life is Hard
Getting resilience into applications is possible and companies like Netflix, Twitter, Facebook, and others are showing the way. Some of their developments they share with the public, some of these cater even specifically to engineering resilience. They test their own system rigorously with tools like e.g. Simian Army or even go as far as shutting off an entire data center to test resilience. However, even these companies with their advanced resilient architectures cannot completely prevent outages.
Even with all these tools out there, it is not easy to build resilient systems, if you’re a company with less than 100 developers. Both, getting started with microservices architectures as well as the resilient deployment of said microservices is hard. At Giant Swarm, we try to help our users with both. First, we try to provide them with the tools necessary to build their own microservices architectures, while ensuring that if they want to use other tools they deem better for the job, we give them the freedom to use whatever they like. Second, we offer our users a microservices infrastructure that makes it simple to deploy and manage applications resiliently. In the end
“Resilience is all about being able to overcome the unexpected. Sustainability is about survival. The goal of resilience is to thrive.” — Jamais Cascio
Resilient applications give you much more than just reliability, they help you build a continuous innovation process to move fast and stay ahead on your goals. They help you to thrive.