Episode 4 - AIOps
Rama Akkiraju: AIOps is about applying AI for IT operations management. How can we put AI to use to quickly detect, diagnose, and help resolve these issues that may happen in production, and to help avoid those incidents from happening?
Kevin: Hi, everyone. Welcome back to another episode of the Making of the SRE Omelet podcast. Today's episode is all about AIOps, and there's no better person than Rama Akkiraju to speak to us on this subject. Rama is an IBM Fellow, and CTO of IBM AIOps, and has numerous industry awards and recognitions, including AI Innovator of the Year from Women in AI Award North America 2022. Welcome to the show, Rama.
Rama Akkiraju: Hey, thank you so much, Kevin. Glad to be on your show.
Kevin: Rama, can you start us off with a description of what is AIOps?
Rama Akkiraju: Well, AIOps is about applying AI for IT operations management. Basically, what it means is when you have IT systems running in production, you want to make sure that they're highly reliable, and running all the time. So, how can we put AI to use to quickly detect, diagnose and help resolve these issues that may happen in production, and to help avoid those incidents from happening? Those are the questions that AIOps is specifically looking to address. You can think of them as a set of tools and capabilities that site reliability engineers can use to keep the systems up and running.
Kevin: Help with detect, mitigate, and resolve of disruptions definitely sounds like a great match with SRE. Can you give the audience some use cases you have seen with customers that capture the benefit of AIOps? Perhaps the relationship they have between AIOps and SRE?
Rama Akkiraju: Many customers are actively looking to see how they can improve the productivity of their SREs and reduce the toil. If you look at how much time in SRE today spends in a typical operations environment, they probably spend 80% of their time... Unless you are a highly mature organization with SRE practice and all. In average client environment, what we have seen is that SREs spend 80% of their time firefighting basically. It's about issues are happening, and they have to detect, diagnose what's happening, and do something to bring the systems back up and running. By the time they did that, another incident has occurred, and they move on to the next one. They don't have a lot of breathing time to step back, reflect, and really understand what's going on, and what are some of the root cause problems which are the reason why these issues are happening, and actually go and fix it. The goal of AIOps is to ideally reverse it. Go from 80% firefighting and 20% time left to do proactive issue avoidance, and actual engineering type of work. To reverse it or even make it even better if everything can be automated. The relationship between SREs and AIOps is that if AIOps is doing its job very well, then SREs are not in firefighting mode all the time. They have a lot of free time to do more high- value tasks. Could be actually implementing actual DevOps automations and features, and helping contribute to the development of the product. If AIOps is working really well, incidents are automatically being remediated, and in the first place they're actually avoided by instituting good development practices and such. The system is running smoothly and all the objectives are adhered to, and are highly reliable and scalable, and their up time is whatever your desired level is, 99. 9999, whatever. That's the ideal situation between AIOps and SREs. AIOps is meant to support and help SREs to really do their job more effectively by freeing up their time to do more high- value tasks.
Kevin: I definitely see a lot of that. Our teams are so busy with putting out fires, they have no time to even ask the question why they happened.
Rama Akkiraju: Right.
Kevin: AIOps help can help reduce the toil, so SRE can get away from firefighting, having time to reflect and learn to improve the product. So Rama, what are the top challenges you have seen from customers that prevented them from getting out of that reactive mode?
Rama Akkiraju: Some of the things that I notice is that as with anything for analytics and insights to do their job effectively, having good quality data coming in is very important. One of the challenges that we face all the time in customer success environments is that there are not that good set of data collection and aggregation tools for the IT data that gets generated in real time in production. We are talking about logs, metrics, traces. These are all data that is getting generated in real time in production by various applications, which contains valuable information about what issues are happening, which systems are going up and down and so on. A good chunk of the problem for SREs and for organizations in terms of the challenges that they face is to first get themselves organized well, to collect all of this data, to aggregate it, and to actually have a data strategy around monitoring their IT applications that are running in production. Now, if you have that, a good chunk of your problem is solved. You're on your way to actually now shopping for AIOps products from vendors, or even have your own in- house practices for now starting to process that data in real time to derive the insights that you need to detect, diagnose, remediate, and even avoid problems. I would say first, the challenge is really that. It's about having the right set of practices and tools to aggregate, collect, and to get all the data in place to do the analysis. The next set of challenges that we face is basically having an understanding of how your whole process of issue management works, and whether or not you have an SRE practice, or an IT operations administrator practice or not. Is the culture where there's a lot of blaming, finger pointing going on? Or are the processes set up in such a way that you are able to leverage those tools and automations, and measure everything that's happening in the environment, and have well- established and agreed upon processes for investigating, diagnosing and resolving the problems? Often, what we have seen is that whenever there is a problem, a lot of folks from different parts of the organization, it could be from the application owner to the infrastructure owners who manage the databases, to whatever infrastructure it's running on, whether virtual environments, or Kubernetes, or cloud, or traditional servers, whatever. They all get together in a conference call and are often finger pointing at each other saying that the problem is caused by the database, and it's out of memory, this, that. Or, your network. Every problem gets pointed to the network failure. Things like that happen. Another challenge we see is that even before you can actually leverage tools and automation, having right kind of processes where you have good practice established around SRE, and have clear understanding of which applications have what kind of service- level objectives and are we measuring, first of all? Are we monitoring everything or are we measuring everything? How many people are spending how much time in resolving the problem? Why is it happening over and over again? Having a baseline and a benchmark like that is very helpful to know whether or not we are improving or is the same trend continuing or is it degrading the whole practice? That's another challenge we see where the culture of finger pointing or doing heroic way of solving problems keeps on happening. That's another challenge, so some of these things have to really be addressed before you can put any kind of AIOps sort of tools and automations in place. Once that is there, then the setup is there for you to effectively start using tools. And of course, once the tools are in place, then hopefully they'll do the right thing in terms of helping improve the productivity of SREs by detecting problems, and doing a lot of that preparatory work, or explanations, or groundwork to point to where the problem is and such. And then, SREs will be able to leverage the tools more effectively. Then, the post- mortem analysis and root cause analysis and taking that back to the whole development processes because the whole thing around SRE process is about building that bridge between dev and ops. When there is a problem in ops, and if it is caused by some things that are happening in development, or deployment time, or configuration- related things, or whatever the case may be, once the initial problem is solved, having that discipline to document everything that this is what has happened, here is how the incident was remediated, but here is what needs to be done to fix the root cause so that it doesn't happen again. And capturing that and keeping that for AI to learn from it again. Once you capture all of that and document it in as structured a manner as possible... I say this as much as we all like to do NLP, that is natural language processing, and AI on natural language data and all that, it's always a good idea to have good practices to make it easier. As easy as possible for you to mine your own knowledge by creating good structure around this documentation, be it an incident ticket that get created or whatnot. Once you capture all of that, again, that becomes a virtuous cycle where you can continuously improve. One of the challenges that I've seen in some of the customer success environment that we work with is that that process of documentation is not... Even though it might be there, it's not very well enforced. And the quality of the data that gets recorded is not so reliable for really mining and deriving insights in the next iteration, which is an unfortunate thing because then it becomes a vicious cycle where the same thing is repeating over and over again, and you are in the firefighting mode. Let me sum up by saying... You asked me what are the challenges. The challenges are, number one I pointed out is data. Being able to have the right kind of tools and processes and data management strategy to collect the data for training AI models, and for leveraging, and for doing real- time monitoring and processing. That's number one. Number two challenge is really having the right kind of culture and environment and processes so that people actually are set up to use tools, and to collaborate, and to do it in a way that takes emotion out of the context, and in a structured manner. That is important to have that kind of an SRE practice, and actually a good culture established around that. And number three, I pointed out is really, as and when things happen, bad things happen and issues are addressed and such, having the right kind of strategy and processes and tools again, to capture everything that's happening so that you can leverage that for future. At least these three. There are others, but I would call out these three as the big ticket, high- level challenges. And also, opportunities for companies where they can improve to get to a better place in their ITOps management.
Kevin: Wow, thank you, Rama. You took us to a really great journey that led to the end state of where we like SRE to be, and it'll begin with data. With data, we can make data- driven decisions, observability, explainability to give that insight. I'm also really glad you mentioned post- mortem. Let me qualify that. Blameless post- mortem learning that is really the core of SRE-
Rama Akkiraju: Absolutely.
Kevin: ...to learn across disciplines and roles on data so we can service what we can do differently to improve the business and customer success. With that Rama, where do you think AIOps needs to be going forward?
Rama Akkiraju: See, an ideal end state would be to design and to deliver from the development perspective. Self- monitoring, self- healing, self- managing, self- optimizing IT systems. If there is a way by which during the development of the applications, or during the setup of infrastructure, we can design them for better manageability from day one, so that systems have enough sensors, enough monitoring, enough ability to detect and remediate all automatically. Everything is being done in a self- serve manner, be it in self- monitoring, self- optimizing, self- healing when the problem occurs or self- managing. But, keep that as the vision. That would be the ideal end goal. Obviously, it's going to take several iterations, probably several decades to get there with full automation and everything, but I think in limited scopes, we can probably get much closer to that vision. Where would the end state be in the next few years if we think about it for SREs with AIOps? I think what is achievable today is very much if the data is properly collected for all the applications and infrastructure that is running in production from the ops environment. Properly monitored, and collected, and aggregated, and made available in real time. Now, that's an assumption and that's a prerequisite I would say for AIOps to work. Then, we should able to with the existing technology and AI capabilities. Be able to detect anomalous parents, provide various kinds of insights to SREs by reducing the noise that comes to them, and by detecting issues, and eliminating the manual examination, exploration, and toil. Help them detect, diagnose, and resolve the problems quickly by providing enough automations, and suggestions for what actions to take, and which component might be the probable cause for the particular incident. And pointing to, and providing all the explanation and everything, and helping them with enough automations and support to take action to resolve the issues. And also in the process, enable them, again, through automated means and through summarization and all that, to document the issues that are happening, and also proactively avoid issues from happening by predicting at every step of the development process what kind of changes or issues might lead to what kind of problems and give them headlights if you will, into what might happen if this went through to the next iteration, and what changes need to be done, how to fix it, and so on. All of those things we can imagine. AIOps can really, with the technology state of the art that's available today, can significantly help an SRE in the overall process of detection, diagnosis, remediation, and avoidance. Now, can it automate? I would say some of the more frequently occurring ones, or well- understood type of problems for which there are very specific, and well- declared set of error messages or conditions, the system can go ahead and automate the detection, and likely even the diagnosis, and resolution aspects of it. In several such cases, if SREs are comfortable to have the system go ahead and apply the detection, diagnosis and resolution, should be able to just set it up as automation where it'll do the remediation, and just produce a report to you saying, " Hey, there was going to be this problem. We detected it, we already fixed it, and here is the report. It's taken care of for you. You don't have to worry about it." Versus, in some other cases it may not be that obvious. Users may have to really do some apply their own context, or maybe there are some workflow processes that need to be followed before a particular remediation action can be applied in such in such cases. AIOps can still significantly reduce the amount of time SREs spend in doing that manually versus supporting them with all of the insights and explanations so that they're primarily consuming that, and making decisions around it. That's what it can do. Some I think moderate amount of automations can be achieved in more standard type of applications where everything is, I would say can be designed ahead of time. And in some other cases, if full automation is not possible, it can still significantly reduce the toil for SREs, and provide many time- saving insights, and explanations so that they can take action. In the long run, yeah, I think we can get to more of that AI taking the lead towards automation. Initially, we may start out in AI, primarily operating in the human loop, to where AI is starting to take the lead, to maybe eventually getting to human playing a role in critical situations where their input is required in a AI loop. Where AI is taking the lead in many of these things, and only seeking user input where it's absolutely needed, which is where we finally get to full automation. AI powered automation with that self healing, self- monitoring, self optimizing self- managing systems. I see this as a progression going from where we are to some things to be automated. Some things can still significantly help SREs toil, but may still require user involvement all the way to going to full automation.
Kevin: That sounds like a very exciting future. Humans are great at taking judgment, but AI can really help by giving the insights to assist with those judgment calls. And pushing it to the left, I recall Vijay Kumar in our automation episode, gave a example. Now of context here, I'm Canadian, so hockey is important to me. Wouldn't it be great for AIOpS to tell me that, " Hey Kevin, if you were to check in this code now, you can forget about the hockey game you wanted to go to later today because it is going to fail, and cause an outage."
Rama Akkiraju: Yeah, absolutely. The whole point of bridging the gap between dev and ops is that. Exactly that. Good chunk of problems in operations environments are caused by how the system is developed, how it is configured, and how it is deployed. If we can catch those things, be it during coding itself, during testing, during deployment, and how the systems are configured and scaled for given the requirements in load, then good chunk of those problems that happen in ops environments can be actually preempted. That's why I think it's designed to operate or shift left. These are themes that are very important themes in IT operations management. It's because you can only do so much once all bad things have already gone through the processes and got deployed. Then, you can only monitor, and you can only remediate. Whereas if you can catch them while they're still happening, be it in code, be it in how you are writing your certificate access, security- related things. What kind of vulnerabilities you may be introducing in the code. Imagine the Log4j things that happened, and how much so many teams all around the world had to go through to patch up for some of the security issues that were detected with Log4j. What if there was AIOps for Log4j development team, and the system found out about several of these things? And in the first place it was built to be really highly secure, so all around the world, people who are using Log4j didn't have to go through all of this. That is absolutely the nirvana for AIOps, that it's actually a great problem is the one that you never had. If you can actually just completely avoid it from happening in the first place by having the right kinds of checks and gates early on in the development life cycle, then operations management would be a breeze.
Kevin: So Rama, any words of wisdom you have for companies looking to embrace SRE and AIOps?
Rama Akkiraju: SRE practice is still relatively new from what I see in many companies. Although it's been around for quite a few years by now, since the time Google has published its original SRE practice guides and such, it's practiced in different ways at different companies. In some companies it's still the role, the term that's used is IT operations manager or IT operations administrator, or those kind of things. Whose role is still viewed as primarily managing the IT systems in production. When there is a problem, deal with it, and if it cannot be dealt with them, go back to level two, level three support type of things. That's the traditional classic way of managing IT systems. But the industry has of course, moved on to a different level, different model where you have now site liability engineer practice teams where the goal is to really bridge that DevOps gap where you are improving the reliability and manageability of the applications, not only by fixing problems as they occur, but also by addressing the root causes by being part of the development teams to avoid the problems in the first place. And that is where the SRE practice for many companies needs to get to. It's a journey for many of them. I think what would really help is to have a roadmap to get there. For companies that don't have the practice, they need a way to get their clear path to how to establish this SRE practice, and how to create career paths for these SRE practitioners. And what kind of tools can be made available to them to do their jobs better? And how will the overall improvements to the applications that they're monitoring and managing be measured? And the productivity gains or lack thereof are measured? All of these things have to be put in place. And for this, there has to be a structured way to organize one's company's practices around it. And there are lots of good documentation and good materials out there actually already in public domain from various companies on their SRE practice. Best practice guides and all that. Using them is a good idea. Or having a consulting engagement with companies who know about this, and embarking on that journey to have an assessment of where they are, what is the current level of maturity of their organization for site liability engineering, and then having a path. A plan for improving the maturity to go from wherever they are. Think about 1, 2, 3, 4, but there is always 1, 2, 3, 4, 5 levels, but there is always room for improvement no matter where you are. How do you go from here to there having that clarity and having that as a practice where it's an organization? There are people that have careers, they're actually able to make progress and you're able to measure everything and you're able to show the value of using tools like AIOps. How much reduction? What is the overall meantime to detect, repair, to resolve? And in between failures, what is the meantime? How are all these being improved? If you measure and actually clearly show the improvements, and hopefully these are improvements that you're making over time, then it makes it easy for a CIO or the leader in the organization to have that as ammunition to say, " Hey, yeah, now see this actually works. Here is how we've been able to do it." And replicate that across different parts of the organizations within the same company, and lift everybody up along the way. There is a path, there is a culture that needs to be established, and all the rest of the things that go with it have to be put in place in terms of measurements, people's own career growth, and all of those things. And once that is there, it becomes lot more easier. The process and the whole thing will flow more smoothly. Blameless, frictionless ops management.
Kevin: Rama, I think in giving the words of wisdom to embrace SRE, you may have also just gave us the ingredient and recipe for the SRE omelet.
Rama Akkiraju: Yeah, maybe so knowing these are all the ingredients. We talked about having baseline maturity assessment, having career paths for SREs, having the culture, having the right kind of tools at their disposal. And being able to measure and have a dashboard to clearly see how things are going, and how the productivity of people is improving, and the service level object is being met or not. When you have all of these things, yes, you have the right kind of recipe for success.
Kevin: There you go ladies and gentlemen, the recipe for success from Rama Akkiraju, IBM Fellow and CTO of IBM AIOps. Thank you very much for spending the time with us, Rama.
Rama Akkiraju: Kevin, it's been my pleasure talking to you about this topic that's one of my favorite topics. What can I say? Thanks for having me, and I hope your listeners found this useful.
Kevin: Thank you all for listening. This is Kevin Yu, Principal SRE at IBM Applications. See you again on an upcoming episode.
DESCRIPTION
Rama Akkiraju, IBM Fellow and CTO of IBM AIOps takes the audience through how AIOps can help companies achieve the SRE outcome of meeting SLO and having reliable services. Rama shares the recipe for companies to embrace SRE and gives us a glimpse into the future of SRE with AIOps.