AIOps with Rama Akkiraju
- 0.5
- 1
- 1.25
- 1.5
- 1.75
- 2
Ethan: From IBM Cloud and Cognitive Software, you are listening to the Art of Automation with host Jerry Cuomo.
Jerry Cuomo: Hey, hey. Thank you, Ethan. And ladies and gentlemen, Welcome to episode one of the Art of Automation. And this is a podcast that explores the application of automation in the enterprise. And I know there are few of you out there asking, "What is automation?" Well, automation, very simply put, uses technology to automate tasks that once required you and me, humans. And when I say technology, I'm talking about AI and AI-powered software bots as a new type of digital workforce that's going to eliminate the mundane work from our plates day-to-day and augment our skills to enable us to be super human workers. And ladies and gentlemen, for episode one, you're in for a real treat. Because today we have a real expert in automation, Rama Akkiraju. And Rama is an IBM fellow, and that in itself is a big deal. But there's more. Rama was recently featured by Forbes on their top list of 20 leading women in AI research. Rama is a natural language processing expert, and she led the team that created the Watson Tone Analyzer. So it's all about people insight-oriented technologies. And with that you could infer personalities, emotions, tones, attitudes, and intentions from things like social media using linguistics data and machine learning techniques. Not only of this, but I'm actually quite proud to be Rama's colleague at IBM, where we're both among the co-creators of our AI-powered automation platform. And to that end, Rama is currently the fearless leader of our AI for IT operations project, also known as AI Ops. And this is where we'll start. Welcome, Rama, to the inaugural episode of Art of Automation podcast.
Rama Akkiraju: Wow. Jerry, what an honor, really incredible honor to be on your podcast, first of all. And to be the first as part of your first episode. So the honor is all mine. The pleasure is all mine. I'm really looking forward to this chat.
Jerry Cuomo: So, Rama, what exactly is AI Ops, and tell us why you are so excited about it?
Rama Akkiraju: Okay, AI Ops, as you have introduced, Jerry, is applying AI to optimize IT operations management. Now, let's ask ourselves, " What is IT operations management?" We all use it in our day- to- day lives and in companies. Think of airline reservations that we make and financial trading systems that we use. The things that we buy online, checkout systems, and retail checkout systems. These are all powered by IT systems, just to name a few examples. And these are all built using software development processes, and they're deployed on different environments. And they're running and they're serving their customers. And these systems are expected to be up and running all the time.
Jerry Cuomo: That's right.
Rama Akkiraju: And you don't want to be disappointed when you try to go make an airline reservation, although these days we are not traveling all that much. But when you do. So you expect them to be running. That means that, first of all, they have to be designed to be very good and highly available. But as with anything, bad things do happen, incidents do happen. And these systems do go down. So operations management is all about really making sure that the systems are up and running, and these incidents don't happen. And when they do happen, that you're able to bring these IT systems up very quickly. And how can we really use AI to make that process of recovering from any incidents that happens, like IT systems being unavailable or network response is slow? Or a particular server is down? How do we bring up the systems? These kinds of things is what IT operations management is about. The question is now it's being done, and yet we see every day in newspapers that systems have gone down. Companies are incurring huge losses because of it. And customer dissatisfaction is mounting when systems go down. So the question is what can we do to apply AI to automate some of these things, of detect these problems early, and predicting them before they occur? And reduce these events and alerts and all the things that come up, grouping them together in a way that you understand how many incidents are happening at the same time and what all these signals mean? And specifically identifying where the problem is very quickly, and being able to leverage past historic information to take action quickly? And from all these learn and make sure that these problems don't come again and again, recur. That is the nirvana. So it's about applying AI to address this problem.
Jerry Cuomo: Right. Good. So next question, I'd like to explore the connection between AI Ops and automation. So it sounded like you would describing, Rama, something of a closed loop automation system where information was being discovered. Then we were applying analytics in a way to help drive decisions. Then those decisions were connected to real live actions. Some of those actions might be removing some mundane work off the plates of humans. Other actions might be augmenting humans with superpowers to get insights to things that maybe they wouldn't have privy to before. So talk to us a bit about the connection between AI Ops and automation.
Rama Akkiraju: Sure. So let's first look at the detection of problems. The way, even in advanced companies today, the problems are detected is experts who know when certain thresholds are reached in metrics, for example. I'm monitoring certain metrics, and whenever exceeds, I know that there is a problem. Disc usage has exceeded at certain limits. Or memory usage is reaching certain limit. So they're going, and they put those as rules. And these rules, some are easy enough to understand, but some are really based on a lot of experience that people have gained over time. They go in and put all these rules into detect problems. And these rules, depending on how the applications are evolving and systems are changing, have to be constantly managed and updated and upgraded and such. And typically for a company, it takes about 18 months or so to create all these rules. Now come AI, where we can look at logs and metrics and those data, and automatically predict anomalies. So you don't have to have people creating those rules. You teach the system what is the normal behavior of the system by giving it some data during normal times when no major incidents are happening. And the system can then use that as a pattern to know what is normal. And at any time when it sees either the volume of logs or the pattern of logs or the metric numbers not conforming to that what is normal, they'll predict anomalies. So that's one automation where you don't have to have users writing rules. The system automatically predicts. Now let's look at the other example in resolving the problems. So in resolving the problems, typically today what happens is that you have prior incident content captured in ticketing systems where you make a record of everything. This major incident happened on such- and- such date. It was caused by such- and- such thing, and we did such- and- such repair action to fix the problem. And later on, we went and did the root cause analysis and found that it actually was a problem in the deployment change that was done. And all of that gets written out. So people go read those documents and say, " Hmm. This problem sounds similar to that other thing that occurred. And therefore, let me try that." But that is basically knowledge discovery problem, and you have all this knowledge data in your knowledge repository. And you've got to be able to find and match. Again, Come AI to the rescue. You apply AI there, where you are processing this natural language content. And you quickly match the symptoms of the current problem with all the problems that are in your prior database. And prepare a succinct summary of action that you can take and present that as a recommended next action to the IT operations folks. And they can do that quickly without having to manually do it. That's another example of AI.
Jerry Cuomo: Right, right, right. So, Rama, there are many subtypes or facets to AI, as you've taught me. Which ones come into play with AI Ops?
Rama Akkiraju: Well, when you say with facets specifically, I assume you're referring to what kinds of AI techniques can we bring?
Jerry Cuomo: Right. Yes, yes.
Rama Akkiraju: Yeah. So there are, in AI Ops, we are looking at actually all possible, I would say, all possible dimensions and all possible kinds of AI. So there is machine learning, obviously, where we are applying training systems with data and let the systems learn the pattern and make predictions accordingly. And in some cases, capturing rules. And again, these rules are not necessarily heavy burden on users to create rules, but things that systems can detect by themselves. So for example, one good example of a rule could be that every time in an alert, typically this is the place where a particular component's name is referenced. Therefore, let's use that as a way to extract an entity by using some kind of an entity extraction module that's specified as a rule. So that's a rule. So we apply machine learning, we apply rules, and in a combination. Sometimes we have to bring both of them together to solve the problem, and a lot of natural language processing. Processing the text tokens and all of that.
Jerry Cuomo: So, Rama, natural language. That's one of your specialties. And you mentioned Chat Ops. You also talk about meeting users where they work, and reaching out and talking their language where they are when they need help and information. Can you talk to us a little bit more about Chat Ops?
Rama Akkiraju: Sure. So when we are doing this analytics, they're predicting logs, predicting anomalies from logs, or anomalies from metrics and providing these recommendations. We need a place to deliver these insights and analytics to users someplace. When the IT operations folks, when we looked at it, where do they do their work? They're typically, these days, doing their work in collaboration environments, like Slack or in Microsoft Teams and so on. So these are Chat Ops environments. So we deliver insights to people where they work in a way that they're collaborating with their colleagues. They're asking them what is this problem about? And trying to resolve that problem there. There AI Ops comes in as a helper, a bot that is listening to all of the stuff that's going on and is saying, " Hey, I think this problem is caused by such- and- such component. You may want to take this action." Or that, " Hey, by the way, there is this particular script that is already available that you can run automatically. Do you want me to go ahead and run it for you?" So making it easy for them to do their work in the environment that they're doing.
Jerry Cuomo: Yeah. So it's almost like having a digital worker as an extended member of your team.
Rama Akkiraju: Exactly. Exactly. It is exactly. It's a colleague to IT operations folks, and it's there, understands what they're talking, listening. And is participating in the problem resolution alongside humans.
Jerry Cuomo: All right. Rama, this is the last question. So this is the art of automation, and it's about painting pictures. So can you paint us a picture, through an example with a high wow factor, where AI Ops can automate something extraordinary?
Rama Akkiraju: Okay. High wow factor. Okay, let's see. Well, in AI Ops, Jerry, prevention is better than cure. So let me start with this analogy. Let's say a terrorist attack was being planned, and the FBI, the Federal Bureau of Investigation here in the US, is secretly tracking it. They catch the planners and the to- be perpetrators, terrorist perpetrators, sufficiently in advance to thwart this incident from happening, but with enough solid evidence to try them in a court of law to prove the case.
Jerry Cuomo: Okay.
Rama Akkiraju: Okay. General citizen population comes to know of this in the newspaper the next day after the perpetrators have been caught and the issue has been prevented, and we're relieved that nothing bad has happened and that they can go about their daily lives as usual, not business as usual. So many lives are saved, major disruption to economy and business is thwarted. FBI team gets great credit for doing their job well, and that's what we expect of the FBI, to help prevent incidents from happening.
Jerry Cuomo: Yeah, that would have a higher wow factor.
Rama Akkiraju: Now, it doesn't make for a super thrilling Hollywood movie, I would say.
Jerry Cuomo: Oh, not bad.
Rama Akkiraju: Like finding suspects after the fact. But nevertheless, it is still a pretty interesting movie scenario where you're piecing together seemingly disconnected plots and threats into a cohesive story, and relate them all to this incident, major incident, that was going to happen. And prevent it from happening before it happens. So let's imagine AI Ops. The nirvana in AI Ops is this self- healing, self- aware, self- managing systems where AI Ops technologies is working like the FBI, prevents issues from happening in the first place. You are notified about what our issues could have happened, and how much loss could have occurred, this customer dissatisfaction could have happened. But it's all taken care of for you. And it says, " Hey, all these things could have happened. But then I detected them early. I managed the situation. Here is how I addressed the problem, and here is the full evidence of what all has happened, what could have happened, and how it was prevented." So part of achieving that vision is all about monitoring these multiple signals from various disparate sources, connecting them into these cohesive incident stories, tapping into the planning process. Like in the case of terrorists, right? Tapping into the planning process. Not wait for bad code and bad processes to push poor quality artifacts into production, but tap into it before. Look at the core vulnerabilities. Look at test coverages. Look at the build and deployment artifact quality. And at every step of the way, put checks and gates to prevent these poor quality artifacts from moving forward. So that way you can prevent bad things from happening. You don't have to be always in the reactive phase. So that's the nirvana in AI Ops. That's what we're looking forward to building. That's what we're working on.
Jerry Cuomo: Wow. So that is, I would say, that's a work of art. So, Rama, is this art of the possible, possible today?
Rama Akkiraju: Actually it is very much possible, Jerry. It takes hard work. It takes putting a lot of AI in different places in the overall IT application system development life cycle. And it takes hard work in terms of connecting to all the data sources and building the bridges to process the data, and to institute these checks and gates and the human processes. So it's very much possible. As I said, it's hard work. It's doable in terms of AI technology. We are not talking science fiction. It is doable in the next few years, and the technology's already there. The hard work is really all about building the system to make it happen.
Jerry Cuomo: And our Watson AI Ops product is the place to start?
Rama Akkiraju: Exactly. We already have a product in the market with several of this AI analytical pipelines we talked about, like log anomaly pipeline, helping with next best action recommendation and such. And some of these preventing issues from happening are coming up.
Jerry Cuomo: All right. Thank you, Rama. Ladies and gentlemen, you've just heard it from one of our industry's best Rama Akkiraju, one of Forbes rated top 20 leading women in AI research. Thank you so much, Rama, for your insightful discussion.
Rama Akkiraju: It's my honor and pleasure to be interviewed by you.
Jerry Cuomo: Well, thank you one more time, Rama. And thank you everyone. That's all for today. Episode one is officially now on the wire. And folks, if you enjoyed this one, well, we have several more episodes teed up. So stay tuned. You've been listening to the Art of Automation, a podcast that explores the application of automation in the enterprise. And by the way, I'm Jerry Cuomo, IBM fellow, VP, and CTO for IBM Automation. Thank you everyone. Bye.
DESCRIPTION
Jerry is joined by IBM Fellow and one of Forbes’ “Top 20 Women in AI Research” Rama Akkiraju. They discuss AI for IT Operations (AIOps) and the incredible automation potential it has in the very near future. Rama introduces the idea of a “digital teammate” and explains how it all comes together in “ChatOps.”