Three ITSM Activities to Amplify DevOps Feedback Loops
There are Three Ways of DevOps, three principles that underpin all other DevOps patterns. The DevOps Second Way is to “Amplify Feedback Loops” and IT service management (ITSM) has an important role to play in this.
For example, at DOES15 – the DevOps Enterprise Summit – an enterprise customer care and billing company highlighted something called “Holistic Incident Visibility” as a crucial technique in developing feedback loops and cited three things they had done in pursuit of this:
- Moving to shared, system-wide KPIs
- Introducing “go see” and role rotation between Dev and Ops
- Increasing system-wide telemetry
This blog looks at how these changes and the overarching technique can help to bridge the gaps between DevOps and ITSM thinking to improve business performance.
Moving to Holistic Incident Visibility
Complex organizations, like the one I saw presenting at DOES15, often have three main challenges to address:
- Time-to-market in the face of increasing demands for quality and speed. This puts strain on decades-old systems of record that are now being asked to be systems of engagement.
- Organizational and process debt with Taylorist structures thwarting speed and learning.
- Technical debt with complex interconnected new and old systems.
Then we have the corporate IT silos that need to work together to ensure that IT services: (1) meet functional requirements, (2) are delivered to the agreed quality of service, and are (3) quickly restored when issues arise.
The statistics given from the enterprise customer care and billing company showed that developers resolved around 2% of all incidents, with Operations responsible for resolving the rest. And more than 90% of these non-release related incidents were classified as medium/low, with most being caused by just twenty system errors.
The enterprise company used this data as evidence that there was a lack of feedback between Development and Operations, identifying three underlying reasons for this. For each of these reasons, they developed a corresponding new practice:
Issue | Solution |
Incompatible team-based KPIs | Shared, system-wide KPIs |
Siloed cultures | Staff rotation to share culture and learning |
Lack of data to aid understanding | Increase telemetry to support understanding and decision making |
I’ve outlined each of these, and added my thoughts, in the sections below.
1. The Move to Shared, System-Wide KPIs
When organizations are split into silos it’s common for each silo to have its own KPIs; with the differences between these KPIs being the cracks in the floor for things to fall into. This issue can be measured by incidents that are not repaired, technical debt incurred, and a pile up of work in progress.
At the enterprise company, which I’ve been talking about, the Operations team had different KPI targets for Mean Time To Repair (MTTR) compared to the Development team:
Classification | Ops MTTR | Dev MTTR |
Critical | 2 hours | 12 hours |
High | 4 hours | 15 hours |
Medium | 3 days | 90 days |
Low | 5 days | 90 days |
Thus, an issue would get treated differently, in terms of fix time, dependent on the route it took to repair.
To improve on this, the Development and Operations teams introduced a new DevOps shared goal, which matched the existing Ops MTTR. Although this needed to be balanced to allow Development to make code fixes, while also ensuring that new releases happened as planned. So work was rebalanced to allow this to happen.
An important point to note here is that none of these KPIs are used to punish individuals or teams. These are goals to target and measure performance, and to make changes to the DevOps ecosystem based upon reliable data.
2. “Go See” and Role Rotation between Dev and Ops
The goal of this technique is to increase understanding and empathy across the previously disparate teams. It is a common situation within silo scenarios for teams to optimize to their own horizon, and for thinking and perspectives to be restricted to the silo. Also, staff in a silo are surrounded by people in the same silo and a silo-specific culture emerges, and any understanding of other teams and empathy with them drops significantly.
So what did the enterprise customer care and billing company do? They used a lightweight, bottom-up approach to cross-pollinate teams through sharing role knowledge and experience. In particular, it:
- Created a central hub of team information that all teams contribute at least one session to, making it easy for people to learn about other teams at their own pace. These sessions can cover tools, practices, and how to solve issues; they range from 2-8 hours in length.
- Used an Infrastructure as Code (IAC) approach to automate the system-wide process, using Jenkins for continuous integration and Chef for configuration management.
- Elicited behavior changes and used Double Loop Learning to challenge the status quo by using truly cross-functional teams. The Double Loop Learning covered not only solving issues but changing how people work together to prevent further recurrence.
Changing how people work resulted in improvements in how incidents are resolved. Instead of “duct taping” a patch onto an application or server, the fix is built into the design at the front of the workflow, therefore avoiding future occurrences.
These changes have reaped material results for the enterprise company, with millions of dollars of savings as well as increased quality in the system. But just as significantly, the feedback from these techniques has been welcomed by staff, who say that it helped improve knowledge of the company as a complete organization as well as team-specific information such as a renewed appreciation of change management. Staff also reported that this knowledge helps them do their individual roles better.
3. Increase System-Wide Telemetry
The goal of this technique is to become more data-driven to help staff look across the system, not just within their individual silo. Unfortunately, such isolated silos can breed gut feel and intuition rather than scientific hypothesis, testing, and development.
The enterprise customer care and billing company found this useful too and they started sending application telemetry to a central hub, powered by ElasticSearch, and opening it up to all staff who can view and use the data. Desktop logging and tracing was moved from local storage, which meant that staff had to call a helpdesk and reactively share the logs to an automated central system that allows for the analysis of trends to identify permanent fixes to common issues.
For any company, using this telemetry data as part of incident management will identify the recurring “duct tape” incidents and prioritize their permanent resolution.
Understanding and Strengthening the Link between ITSM and DevOps
The enterprise customer care and billing company example, which I’ve been writing about here in this blog, demonstrates the link between DevOps and incident management in particular. By combining these two approaches with specific techniques such as those outlined above, progress is clearly achievable in even the most complex of organizations – with that progress achieved by people doing things, changing themselves, and understanding each other.
Has your organization done anything similar to this?