Resilience Engineering and the Five Missing Teeth of Availability Management
“Do as I say, not as I do” is an instruction that most children are familiar with. Their questions receive the shriek “Because I’m your mother!” or father. Such vacuous guidance is frustrating because the child doesn’t learn why something should be done, or how it should be done, correctly. So children instead seek the advice of their friends, their peers, and this is how they really learn. And so it can be with IT operations these days.
In particular, I see this with availability management as defined by ITIL – the IT service management (ITSM) best practice framework.
I’ll Explain Why
The availability management “process” is there to guide practitioners on how to “do” availability, avoid failure, keep things going, and not spend too much. It sort of tells you why you need to do this – to keep the service up in a cost-efficient manner, duh – but it doesn’t really tell you how. “Surely telling you how would be too prescriptive,” shrieks Mother. “We couldn’t cover all the possible use cases so you need to work it out for yourself,” she continues.
But in this day and age, with a democratized internet of knowledge, the practitioners out there can seek out their peers who are in the same leaky, unreliable boat as them. And they will, or might, find that other practitioners have moved on from availability management to resilience engineering, of which there is no mention in ITIL.
Resilience Engineering
Practitioners such as Jesse Robbins, ex-firefighter, Chef founder, and previously known at Amazon Web Services (AWS) as the “Master of Disaster”, is more than happy to share his experiences. You can see Jesse do this at Velocity conferences, where you’ll also find John Allspaw of Flickr and Etsy. John is a US-based IT guy that did a Sweden-based masters degree in resilience engineering. He was the only IT guy in the class, others were from the fire, building, health, and petro businesses.
People like Jesse and John don’t talk about availability management, which reads like an outside-in, top-down, mother-ish set of commandments without detail. Instead, they talk about engineering resilience into the system – from the inside out and thinking of the people, processes, and technology as a complex, non-linear, interconnected system that will experience failure.
The Five Elements of Resilience Engineering
There are five elements to resilience engineering that you won’t find in ITIL’s availability management verbage:
- Embrace and expect failure
- Hold blameless postmortems
- Focus on MTTR over MTBF
- Hire masters of disaster
- Realize that IT is complex and non-linear
1. Embrace and Expect Failure
We’ve had decades of IT that has been about building resilience into every component, at extraordinary cost. Yet we know that most outages and security breaches are not because of component failure but are instead caused by people. Blaming the people isn’t correct for many reasons, but the biggest one is because it’s most likely not their fault even if they “pulled the trigger.” These people are in a complex ecosystem where well-intentioned, simple actions can have unintended, cascading consequences. Pull the right lever at the wrong time and who knows what will happen to the IT and business operations?
An obsession on a target of zero component failure reveals a lack of understanding of IT as a system. An obsession on a target of zero human failure leads organizations to try to fix the process to make it predictable and repeatable through things like ITIL, but such frameworks have a long history of failure to control the complex system of people, process, and technology. Failure, like change, is inevitable. It’s how you respond, learn, and correct that separates high-performing organizations from the rest.
Read how high performers are 4x effective.
2. Hold Blameless Postmortems
The number of outages in UK banking and world-wide security breaches, plus the exponential growth of public cloud companies, has given many observable data points on how companies react to failures.
Each outage and breach is an opportunity to learn about the system, which means the people, process, and technology in total. In my opinion, this learning can only truly happen through blameless post-mortems, which are an honest and open approach to rigorous analysis of what went wrong and how to build in more resilience to avoid a similar outcome. Outages and breaches should not be an opportunity to blame, discipline, and fire people.
Specifically, blameless postmortems encourage staff to provide a detailed account of:
- What actions they took, at what time
- What effects they observed
- The expectations they had
- The assumptions they had made
- Their understanding of the timeline of events as they occurred
Unfortunately, most companies are slow to acknowledge breaches and might never make a postmortem public. The best companies, who put resilience engineering at the top of their agenda, will publish outage reports and will do presentations at conferences, about what they learned, for the benefit of all.
Read what John Allspaw writes on blameless postmortems.
3. Focus on MTTR over MTBF
“Failure is not an acceptable condition, but it is inevitable,” says John Allspaw. The obsession on preventing failures, even though they are inevitable, means a focus on KPIs such as mean time between failure (MTBF), with a longer MTBF being a good thing. This changes behaviors, and reduces change, because changes are seen as the biggest contributing factor to unplanned outages, which in turn hurt the MTBF. If you have a change advisory board (CAB) then you probably have a committee that is there to prevent and slow change.
The resilience engineering practitioners don’t like unplanned outages as much as anyone else, but they realize that in a complex system there are going to be mistakes, unknown conditions, and failure is going to happen. Instead of focusing on MTBF, they obsess about mean time to repair (MTTR). This translates into a focus on high performance change and release management. It turns out, according to numerous studies by respected organizations such as the IT Process Institute, that high performing organizations are good at doing changes – with “good” meaning lots of change, at velocity, and with high rates of success.
Read what John Allspaw writes on MTBF and MTTR.
4. Hire Masters of Disaster
People like Jesse Robbins and John Allspaw have a mindset that is prevalent in many non-IT fields: if things can go wrong they will, so let’s simulate bad situations and learn how to cope. Organizations that share this mindset will have a role called the “Master of Disaster”.
This role organizes things like game days, where system-wide failure scenarios are held to test resilience, learn the weak spots, and find improvement opportunities. The tests are against the organization (for instance, does the on-call paging system work under stress?), the processes (does the CEO know when and how to brief the press?), and the technology (purposefully offline a whole datacenter).
Companies such as Netflix automated disaster through their Simian Army. They had a program called Chaos Monkey that would continually, in production, kill processes and connections to test resilience. They even had a Chaos Gorilla that simulated whole AWS cloud region failures. These tools are becoming mainstream and they are all free and open source.
These game days and disaster tests should start small, and unobtrusively, with one team and grow as confidence grows. Resilience engineering is a culture that grows through practice.
5. Realize That IT is Complex and Non-Linear
The Cynefin framework tells us that there is a difference between chaotic, complex, complicated, and simple systems. And that you can’t treat the different systems with the same standard approach.
IT is a complex system in that it is a big ball of mud full of formal and informal interconnected people, processes, and technology. Complex systems are sometimes compared to a flock of birds: they get somewhere somehow, but there is no one mind-controlling it all.
Many people see IT as a closed, complicated system much like a machine in that it can be fully explained and it operates in a linear, predictable manner. This is not true, as evidenced by the now-frequent outages, breaches, wailing, and gnashing of teeth.
Resilience engineering recognizes that IT as a system is complex and that one must learn about the reality of it and improve it based on your learning. Top-down commandments, like the ones from parents to children, do not work.
Watch an introduction to the Cynefin framework.
So what do you think about resilience engineering and its place in IT?