Episode 12 - Incident Learning via Story Telling
David : Our systems are successful because people are actively anticipating, planning, diagnosing, and reacting in real time. And they're preventing the incidents that don't happen. This is expertise in action. In my view, post- incident learning is the best way we know of to do these things.
Kevin: Hi everyone. Welcome back to another episode of the Making of the SRE Omelette Podcast. We cannot talk about site reliability engineering without talking about incident learning. Because in my experience things always find a way to fail or don't go as planned. But each of those cases is a wonderful learning opportunity. Here to talk about incident learning is David Lee. David is a distinguished engineer, resilience engineering in the IBM CEO office. He also leads a learning for incident guild in IBM and have championed the practice in IBM for many years. Welcome to the show, David.
David : Hey, thanks so much Kevin. I'm really honored and excited to be here.
Kevin: Let's get right to it. This show is about understanding the business impact of SRE and the culture to achieve the outcome. Can you start with sharing what SRE means to you from the CIO perspective?
David : Absolutely. Let me start by describing the breadth and depth of the IBM CIO portfolio. So we're responsible for the IT systems that make it possible for IBM to buy and sell things. The systems that handle compensation, benefits, real estate, financial reporting, and a whole lot more. And we provide and manage the workstation mobile devices, campus networks and productivity collaboration developer tools that IBMers use every day. So these IT systems span everything from mainframe to microservices. They include SaaS services, vendor software we host and custom applications that we write ourselves. We're in the middle of a major initiative to modernize our legacy systems and I see the SRE model as a key aspect of this. Some CIO teams have been working in the SRE model for many years and some of these are among the most amazing SREs that I've ever encountered. And for others the SRE model is brand new or maybe it's still on the horizon. So the bottom line I see SRE as a practice of treating system operations as a software problem. And where SREs are responsible for availability, performance monitoring, capacity planning, et cetera.
Kevin: IBMers and our customers definitely can't do what we do without the support from the CIO. So David, what does SREs success mean to you?
David : The bottom line success for me means happy users who are able to confidently use the system for important work and it also means happy SREs who are proud of the system and the business impact it's having and who have a sustainable workload that allows for continuous improvement.
Kevin: Happy user, happy SRE, sustainable and meaningful work. Sounds like a wonderful place to be. David, can you share with the audience how one can get to that state?
David : That's really the hundred thousand dollars question. In order to get there, I believe we need to recognize that our systems are complex. And that complex systems like ours are always running in a semi broken state. The traditional view of system safety is that our systems are perfectly designed and that if there's a failure it's because some specific technical component broke or someone screwed something up. But we know this is completely false. In fact our systems are successful because people are actively anticipating planning, diagnosing and reacting in real time and they're preventing the incidents that don't happen. This is expertise in action and it's when incidents happen, it's because that expertise was overwhelmed. So we need to recognize that SREs are part of a complex socio- technical system. One key to success here is to recognize the expertise that's keeping systems running and then look for ways to enhance and sustain that expertise. Another key to success of course is to add technical defenses. But we need to do this while taking care not to increase the cognitive and coordination challenges that are required to manage the system. In my view, post- incident learning is the best way we know of to do these things.
Kevin: Thank you, David. That is a great leading to the topic for the episode, incident learning. In fact the industry has many other names for the practice. Many people may be more familiar with the term RCA or postmortem. David, can you share with the audience what is incident learning and if there's a terminology you feel is best to describe it?
David : So first I'd say that learning is always happening especially after significant events like incidents. But the important question we need to ask is what is being learned? And the answer to that question depends very much on how the organization reacts to the incident. So for example if the organization chooses to punish a person who they believe caused the incident( I'm using air quotes around caused) and people learn to be risk averse and they'll duck and cover when bad things happen. They learn that expressions of initiative and outside the box thinking are frowned upon in the organization. On the other hand, if the organization is primarily focused on preventing recurrence as in the case of RCA and IT practices and if the organization is obsessed with finding and fixing a component that failed. Then you're actually leaving most of the learning value of the event on the table. So in my view, learning from incidents is primarily about using the incident as a lens to improve our collective understanding of the system. Including what the system's supposed to do and what the system actually does. And learning from incidents is also about using the incident as a lens to understand the expertise that's keeping the work system working all the time and about how that expertise can be overwhelmed. And learning from incidents means sharing the story of the incident in a way that makes it memorable to the largest possible audience. Learning requires that we remember the event. I strongly discourage the root cause analysis term and the approach that goes along with it like five whys. That's really based on the attractive fallacy that our systems follow linear cause and effect models which they don't. In fact, our systems don't fail because of a single root cause. They instead failed due to multiple causes, each of which was necessary but none uniquely sufficient to cause the incident. So I prefer to use the term post- incident learning to refer to the practice or learning from incidents. Sometimes we refer to post- incident reports if we're referring to the writeup that teams do following this approach. The term postmortem is fine, actually use that a lot, but some people find that term too dark so it's nice to have other terms to use.
Kevin: That is a great capture on why we learn. I really like your idea to treat why we learn as to better understand the system versus eliminating recurrence because the latter may mean we leave learning on the table. So David, how have you drove this culture in the IBM CIO?
David : Prior to this year? The CIO practice of learning from incidents was pretty limited. We would do RCAs for so- called major incidents and we'd discussed those in a weekly IT ops meeting with senior leaders and those RCAs were focused on preventing a recurrence as is typical. Also if the incident wasn't characterized as a major incident, then there was really no organized post- incident learning activity going on. But last December of 2021, during discussions about those IT ops meetings and how we could improve them, I found the opportunity to propose something new and that proposal was accepted. So starting last January, I began hosting a one hour monthly learning from incidents meeting. Where we share the detailed story of one incident with large in large audience including the CIO business and technical leaders and anyone else who is interested in attending. We've been doing this every month throughout 2022 and we regularly get about a hundred people attending these meetings. Many people have downloaded and viewed the recordings, downloaded the reports. In this monthly meeting, we explore a broad range of types of incidents. Some of these are major incidents others are actually near misses. Where there was actually no significant user impact and some are security related events. And we take the time to share the story of the event including what it was like for the participant in the moment as things were happening. The storytelling is a really important aspect of this approach because sharing the story is what makes it memorable. The monthly meetings have been very well received and we got lots of great feedback. A senior leaders continue participating. And through the course of the year, these meetings have really changed the way that we think about post- incident learning in this IBM CIO organization. So we're using this momentum to scale. We want a monthly meeting where we demonstrate this approach is really just the starting point. Our goal is that all of our teams throughout the organization are applying this approach to their incidents and their near misses. So we've developed broad set of enablement materials that we make available to everybody. We have twice weekly office hours where people can come and ask for help. We now have a number of teams that are following this approach, sharing their write- ups. And effectively we've replaced the RCA practice that I described earlier. We're no longer doing that in the CIO organization. We've adopted a learning from incidents approach and I can't be more excited.
Kevin: Thank you for sharing that incredible journey David. Two things that stood out to me are you storytelling for people to remember and learn from not just outages but also near misses?
David : I think the key word that comes to mind that encompasses both of the things that you just mentioned is the word surprise. So from a storytelling perspective, surprise is the drama surprise is what makes it memorable and that's what incidents and near misses are. They are surprises. Any type of surprise is a candidate for learning. The fact that we were surprised is a demonstration of the fact that our mental model of the system was not accurate. It surprised us because it challenged the way we thought things were going to unfold. And to a large extent, the learning is activity of recalibrating your mental model based on the way things actually work.
Kevin: I love it. Surprises and drama makes stories memorable for sure. David, I'm also amazed at the turnout on those sessions. How do you keep that engagement?
David : A big part, an essential aspect of the learning from incidents approach is about resetting expectations about the write- ups, the reports. And turning them from( well, as we like to say) things that are written to be filed and never read into something that people would actually read with the expectation that it'll be interesting and valuable to spend your time reading it. Everybody's time is very valuable and so that's a big lift. We want to be expecting that people are going to read these post- incidents write- ups and they're going to find them interesting and valuable. And by telling a story and making that story interesting, even making it dramatic, you're going to improve. You're going to reinforce that expectation that the stories are worth reading.
Kevin: Excellent point by making stories memorable, interesting. We shifting the incident learning from something that's filed to something people want to hear about and learning follows. David, I'd like to be data driven and have metrics to know how things are doing. Do you have suggestions on how we can measure this learning?
David : In my view, you can't really measure learning very well but we can find signals where people are consuming materials and creating conditions where learning can happen. And so I look at things like how many people are attending these review meetings where we're talking about an incident. How many people are downloading and viewing the recordings. How many people are reading the reports. Are new people who are joining the organization months after the incident, going to the reports maybe as part of their onboarding process. Because that's a useful way for them to spend time and it's going to help rapidly improve their understanding of the system. We have to look for signals that we're creating, the opportunities that learning is happening. Looking at systems in this way is not intuitive. Intuitive thing is to think of our systems as linear and that leads us into looking for root causes. It takes real effort to break out of that and to embrace the complexity, embrace the emergent characteristics of complex systems to start looking at incidents in a different way. So it takes real work but fortunately once you start looking at incidents this way and systems this way, you can never go back.
Kevin: Now David, speaking of learning. A big part of this podcast is technical vitality. Any words of wisdom you have for practitioners looking to get into SRE?
David : So I really love to recommend it. Everyone read this really short paper that's entitled" How Complex Systems Fail". It is written by Dr. Richard Cook. Sadly, we lost Dr. Cook this year. This paper was written decades ago. It's a really provocative paper. It's had a profound influence on many people including myself. The paper's extremely accessible and worth reading for new and experienced accessories. I suspect any SRE who reads that paper is going to be nodding vigorously by the end. And one of the most interesting things about it is Dr. Cook did not write this from the perspective of IT. He actually wrote it from the perspective of medicine. But it applies a hundred percent to what we do. The other thing in a completely different domain, I like to recommend anyone trying to sharpen their skills to learn the Pandas and Matplotlib libraries in Python along with Jupiter Notebooks. It's a great investment in skill development. It'll change the way you think about manipulating and visualizing data and it will help you stop having to use Microsoft Excel ever again.
Kevin: Great call out on Dr. Cook's paper. I'll make a reference of it on casted when we publish episodes so people can refer to it. I love his call out on not looking at success often enough, but often focused on failures and I share everyone's all for no more spreadsheets. So thank you for the tip to get us there. David, thank you so much for taking us through your journey of influencing the organization to learn by leveraging storytelling. In closing, let's go back to the inspiration of this podcast. What will be your ingredient and recipe to achieve the SRE outcome?
David : So I truly believe that the key ingredient is developing a solid culture and practice around learning from incidents. And this means focusing less on action items and focusing more on improving the team's collective understanding of how the system is supposed to work and how the system actually works. It also means creating post- incident reports that people want to read because they find them interesting. And learning from incidence is the best way we have to understand the expertise that's keeping our systems running so that we can look for ways to enhance and sustain that expertise. We need to recognize that SREs are part of a complex socio- technical system. That learning in that context is itself a system improvement.
Kevin: There you go. Ladies and gentlemen, the SRE ingredient and recipe from David Lee. Thank you so much for spending your time with us here today David.
David : Thanks Kevin.
Kevin: And I'd like to thank you, the audience for listening. See you again on the future episode.
DESCRIPTION
Storytelling helps with learning because engaging stories are easy to remember and are interesting.
David shares why he influenced the IBM CIO to learn from incidents via Story Telling and how other organization can follow this model to learn more about how the systems work and improve business and client success.
Listen to this episode to gain ideas on how you can turn your incident learning into an engaging session people look forward to attend and drive the outcome of better business and client success.