5 Behind-the-Scene Stages of Dealing with an Incident
I’ve been talking a lot about getting back to basics with incidents and all other things IT service management (ITSM) related – and with good reason. While us IT guys like to think that our job is all about cool new stuff, the reality is that the ITSM basics have been around for a long time for one simple reason – they work!
So reminding ourselves now and again of those basics, and seeing them in today’s changed context is very valuable. ITIL® is over 25 years old, so something that has stayed in ITIL that long seems safe enough to call a basic. Those 25+ years might change how we deal with things, and how effective we can be, but most of the underlying concepts remain.
A User Just Sees It Isn’t Working
Most of these basic ideas are rooted in common sense and we can see them as the application of everyday ideas to our ITSM environment. One good example I want to talk about here is the expanded incident lifecycle. This is set out in ITIL’s availability management chapter of the “Service Design” book but is actually most likely to happen during service operations. After all, things don’t affect users until they are in operation. This approach complements the user’s perception and sets out the steps that the support team go through to get things working again.
When an incident causes issues for a user, they aren’t primarily concerned with what has actually broken, nor with how it’s fixed or how long it takes. Instead the user only sees the bigger and simpler picture – “how long before things are back to normal for me?“ Behind the scenes, for the service provider and the team that have to restore service, there is a sequence of phases that all contribute to the overall solution time as seen by the user. This is the expanded incident lifecycle of: detect, diagnose, repair, recover, and restore.
5 Stages Behind the Scenes
Like many ITSM basics, this concept applies to our professional ITSM world but has its roots in ordinary life and the wider world. Whether the issue is a service outage at work, a power failure at home or an issue with our car – someone, somewhere is a service provider that has to get through the 5 stages to re-establish the status quo.
1. Detect
“I would have looked at it if I’d known about it.” Surely no one can expect support until they’ve asked for it. Well, in fact there are ways. Monitoring is the term we tend to use in ITSM – automated detectors that will advise us when something is out of normal parameters. Of course this isn’t a new idea – we can trace it back to lookouts and sentries that the military have employed for thousands of years. But along with finding out for ourselves, every organization needs to be sure they have good mechanisms in place for users to tell them when something’s gone wrong – and for that information to be captured, understood, and actioned quickly. Until detection does its job, nothing will be done, which can cause much damage to the customer’s business.
2. Diagnose
OK, so now we know about it – what do we do? Before we can start to fix it, we have to work out what is broken, and what we can do about it. This relies, of course, on the skills and abilities of the engineers working on it. But it also SHOULD rely heavily on learning from what has happened before. Getting that right is about having good knowledge management in place. It is essential to speeding this phase up. There is no need to re-invent wheels, good knowledge management will let you use the ones you already have to move straight away.
3. Repair
Once we know what to do, we have to do it, right? Remembering that the priority is likely to be getting things working quickly, it can really help not to worry about the details of what needs repairing or replacing. For example, we can change large units, like swapping out boards or even servers, rather than find the precise component that is at fault. Working out what level to repair or replace things can save a lot of time; when things actually break it’s not the time to start thinking about the quickest way to repair.
4. Recover
Once we’ve fixed what’s broken, it will take some time to put it all back together, get things started, and working as they were. This might mean getting all the components powered up and talking to each other. At this stage, the engineer feels they’ve done their job, the IT system is back together and functioning.
5. Restore
But, as pervasive as IT is these days, there may be more involved from the business perspective than just the the IT system itself. Truly restoring full service functionality form the business’ perspective might need something more, or take more time.
For example, say a financial system stops working at 17:50 and the support team get it all fixed again by 18:20. They have gone through detect, diagnose, repair and restore in half an hour – and they feel rightfully content with their performance. But imagine the office affected has to upload their daily data to the global system – in a 10 minute window between 18:00 and 18:10. They’ve missed that and the full business system will not be properly back for nearly 24 hours.
Building an Approach for All 5 Phases
So what’s the point of me telling you all this? Well, as with all good basics, the underlying aspects remain as they always have:
- Don’t get too focused on any one aspect. IT gets hung up on the repair part quite often when there are probably quicker and easier improvements to be made by looking in other parts of the cycle.
- Don’t feel too good too soon. However clever you are at one stage, it is only when you’re done that the user will be working properly again.
- Keep focused on knowledge management – that is a real differentiator to getting better.
For more amazing ideas about the wider range of basics, you should check out content we’ve put together here.