Problem management & workarounds

Just Maybe, Problem Management Is Not Needed

IT guys are often happiest when something breaks and they have a root cause to find, investigate, and resolve. Sometimes, though, finding the cause isn’t necessary but it can be hard to resist. Let me tell you a story that shows what I mean.

My neighbors once had an issue with their gas supply. During improvement work, tests indicated a gas leak between the supply meter and their gas-powered heating system, on the other side of the house. The gas supply had been concreted into the house’s foundations when it was first built and so now went under the kitchen.

They were explaining this over dinner in our house and we talked through the implications. The leak must be under the floor somewhere. So inevitably that floor would need to be dug up to locate and then to repair the leak wherever it turned out to be. So we envisioned the entire kitchen floor being destroyed since the leak had to be found. To avoid having my neighbors think too much about the time, cost, and sheer inconvenience that seemed inevitable, I supplied good food and wine to distract them.

The Professional’s Solution

What actually happened the next day when the gas engineers arrived was a revelation, a surprise, and an indictment of our closed thinking. The engineers simply turned off the gas, disconnected the pipe between the meter and the heating system, and ran a new pipe around the outside of the house. They had no interest at all in where the actual leak was. Putting the new pipe in place took about 30 minutes and, so far as I know, it is still working many years later.

For sure the gas engineers delivered a service that delighted – and surprised – the customer. They could have done something much more expensive, but what they offered, perfectly matched the required job.

If we try to apply conventional ITIL thinking to this, we would find:

  • A major incident that impacts safety and operations, and justifies the problem management team investigating immediately.
  • Incident management should look for a workaround that can supply normal service.
  • Problem management sets out to find the root cause and creates a change request to remove it.

In this case, though, the workaround removes the need to find the fault and delivers a permanent solution. No need for problem management.

Ignoring the Fault but Addressing the Issue

ITIL tells us that a workaround is a “temporary solutions aimed at reducing or eliminating the impact.” Putting the gas pipe around the outside of my neighbor’s house did that, but it did it well enough to remove the need for further work. Sometimes what seems a workaround can actually also be a full solution. If the gas pipe were unsightly, or not able to resist weather for years, then the kitchen floor disruption would still have to happen. But even then, at least the solution could be considered a success temporarily – giving my neighbors the time to schedule the permanent solution at their best convenience.

Thinking again about ITIL (or any other IT framework, like COBIT, ISO2000, etc.), we need to be able to think across several processes to get the best solution:

  • Incident management – That’s the process that first captured the issue and started dealing with it. In IT service management (ITSM), we need to understand what our customers feel the need for, but cannot do because of some fault or failure – and that is incident management’s job.
  • Problem management – This process deals with finding out what has actually gone wrong. And this is where we can so easily misapply the concepts. It isn’t just ‘Root Cause Analysis’. Instead, it should be about removing the troubles, on a long-term basis. That doesn’t always mean finding and fixing the fault, although it can be easy for engineers with specialist knowledge and focus to behave as if it were.
  • Knowledge management – Tradespeople traditionally learned their skills over a long apprenticeship. Working alongside a skilled and experienced plumber might not look anything like IT’s idea of knowledge management, and the means of capturing, storing, and using knowledge (as advocated by April Allen in her webinar on knowledge management) might be more sophisticated for modern IT professionals. But the principle is the same: being aware of what happened the last time this issue occurred. And those re-using knowledge can learn from the mistakes of others, rather than having to make the mistakes themselves. This is especially important to customers, who are the ones that actually suffer from these mistakes.

Playing the Team Game

One of the big challenges faced by engineers in resolving customer issues is to see beyond the fault and instead to see the delivered service. As with all the best knowledge management applications, getting the right solution conceived, developed and implemented benefits the most when being played as a team game, one that spans a range of mindsets. The kind of philosophy that DevOps has brought to us – mixed teams, each bringing their perspective, is exactly what might help solve things in the best way: the operational experience will value the results, the creative engineering view from development will value finding out and learning from what went wrong. Spice that mix up with active customer involvement to keep everyone aware of what really matters and maybe, just maybe, you have a winning formula.

But let’s take one final lesson from those gas engineers. Knowing what you don’t need to know, or don’t need to do, can be very powerful. Upgrading to new software versions can make more sense than tinkering with patches and fixes; scrapping problematic hardware might save the company more money than repairing it; redesigning the interface so it’s properly intuitive might be better than training people how to use an application for the next few years.

What’s the Lesson for IT?

The message for IT folks here is that sometimes we need to stop the process that we normally use and ask some questions. Questions like – do we need to do what we normally do or is this a different circumstance? Stopping the usual flow of things can be hard because it can disrupt convention and people’s comfort zones. But the alternative might be spending lots of money fixing things that are no longer causing your customers any trouble. I hope you’ll think about it the next time a similar situation comes up.


Posted by Joe the IT Guy

Joe the IT Guy

Native New Yorker. Loves everything IT-related (and hugs). Passionate blogger and Twitter addict. Oh...and resident IT Guy at SysAid Technologies (almost forgot the day job!).