The Radical Way Giant Swarm Handles Service Level Objectives
• Aug 21, 2020
At Giant Swarm, we understand that SLAs are needed. We also know that SLOs, while generally accepted conceptually, are still very new in practice. Hence, they are difficult to define and even more difficult to put into contracts. This creates obvious tension. And, as often is the case, we live with this tension and cherish it.
We tend to joke that our 'best effort' is often beyond any service level our customers are used to from any other vendors. And while we have an SLA of 99.95% uptime in our contracts, it is our experienced SLOs that we recently wrote down that most signify what we mean.
Our 'Service Level Objectives' — and we know they aren’t in the truest measurable way how SLOs are 'supposed' to be — are a direct manifestation of our company principles, especially with regards to our relationship with our customers.
To be specific:
- Mutual trust and transparency, which doesn’t stop at the letter of the contract
- Learning together, and not imposing impractical initial targets
- How to fix an issue, and not whose fault it is
Here are our SLOs in all their best-effort and transparent glory.
Giant Swarm's Service Level Objectives
Giant Swarm and its customers have a shared objective to keep things up and running. In essence, Giant Swarm is expected to help quickly when things break and customers are expected to keep up with upgrades.
Giant Swarm's responsibility
First, response times. Our goal is to respond quickly and efficiently. An engineer able to fix the problem responds within 15 minutes in most cases. That being said, our average response time is 90 seconds. When something happens, an engineer is alerted via message and is then called after five minutes. If no response, a second engineer is alerted, then called. After 20 minutes, every Giant Swarm engineering phone rings.
Second, monitoring. We start with something basic and best effort. We learn, iterate, and improve by running things in production. The Giant Swarm team can be alerted manually through a dedicated email account. Phones ring. We fix it. The goal is for monitoring to catch things before they happen and reduce manual alerts to as close to zero as we can.
Third, we take pride in keeping stuff up and feel frustrated when it's down. We feel that we are in this together and don't want to get into blame games. We will always put the ‘how to fix’ before the ‘why it went wrong’. And we never waste time on finger-pointing. Our goal is to keep the service up and get it up when it goes down. Discussions about why it went wrong come later. They are addressed purely from the perspective of solving the root cause in order to prevent it from recurring.
First, upgrade your clusters and always be at most one major version behind the latest. When upgrades aren't done within six months, you are vulnerable to security holes and we cannot keep our shared objective.
Second, fix and troubleshoot problems that occur within your own services. After we investigate and notify you of a problem on your side for the third time, we may turn off alerts.
Third, help us learn and tell us when there are issues. Direct and open communication with all parties involved is important. Everyone should be aware of the problems and challenges on both sides. Only then can good solutions be found.
Finally, note that we have this shared objective not only for production, but for everything, including development, testing, and staging.
To summarize, customers rely on us as their SRE Team. Yes, our business model is to become part of your team. We are not a typical managed service provider to whom problems are outsourced.
As a result, our partnership transcends the letter of the contract. Our customers understand this and live the benefits every day.
This way of thinking is probably new and foreign to most enterprises. Dare I say, even radical. Which is why it's difficult for prospective customers to imagine. We hope this post helps clarify our motivations and commitment to our customers.