Episode 2 - Art of Automation
Kevin Yu: Welcome to the making of the SRE Omni Podcast, the show where we explore the positive business and client success outcome for site reliability engineering and hear from experts on how they influence a cultural and mindset shift that led to those results. I'm your host Kevin Yu and as a principal SRE at IBM applications. What is SRE you ask? SRE stands for Site Reliability Engineering and originated from Google to treat operations as if it's a software problem with goals of improving system reliability and scalability. Over the years, many organizations like IBM have also embraced the discipline. In my experience, SRE is a mindset that champions a culture of agile and is engineering discipline that is data driven and KPI focused. It applies a lifecycle approach towards building a resilient solution that meets business goals and delight clients. In today's episode, we will talk about automation and SRE and is thrilled to have Jerry Cuomo, IBM fellow, VP, and CTO of Apply Hybrid Cloud and AI, to speak to us on the subject. Jerry is a big deal in IBM. He's an IBM fellow and in his distinguished career in IBM, he helped many roles. Well than as a CTO of IBM Automation. And his podcast, The Art of Automation, is what inspired me to start this show. I learned so many practical applications of how automation can make life better and work easier for all of us through his podcast. That inspired me to start something similar around SRE, which is what led us here. So thank you so much, Jerry, for the inspiration and for spending time with us here today.
Jerry Cuomo: Hey, hey, Kevin. It's so good to be on your show here. I'm so honored.
Kevin Yu: Oh, no, honor is all mine. So Jerry, let's get right to it. You give a wonderful stage to champion automation on your podcast. Could you please share with guests of the show what is automation and how it helps us?
Jerry Cuomo: Yeah. So Kevin, this is pretty straightforward in a sense that automation has been around for a very, very long time. In fact, as far back in time, anytime a human used a tool to make their life easier, well that's automation. In fact, you can go back and look in Wikipedia for some of the earliest patents on automation and you see, in the industrial era, patents on automating steam engines where you take inputs from valves where you're kind of gathering the steam and you can back off the fire so that you're not overblowing the boiler, et cetera. So humans have always had this tendency and this desire to make their life easier by using tools. And at some level, that's automation. It's really this combination of tools and humans to produce a better outcome, either for the human or for the end user or maybe combination thereof. So automation is that simple. Now automation in the enterprise is just a take off from that. Automation uses, in many cases, modern software and modern technology like artificial intelligence, machine learning, again, to help combine with the intuitions and the skills of a human to create a superhuman. So automation can build out to allow a worker to do maybe more of things that matter to a company and less boring, repetitive stuff.
Kevin Yu: That is a great capture, Jerry. I love how you took us through the journey of how automation has evolved. Yet, the goal is the same and that is to make life easier for humans. And I think that relates really well with the outcome we often look to achieve with SRE and that is to reduce Toyo. Toyo meaning work that is repetitive and without enduring value.
Jerry Cuomo: That's right. Yeah.
Kevin Yu: I remember on your very first episode, you had your self- driving cars as an example. Now I have had some experience with self- driving cars and I must say, it is a very uneasy feeling to let go of the steering wheel and trust that the car will do the right thing. And I can say that trust also applies in a way to automation at the enterprise. I've seen, in my experience of driving automation, people and team would often be slow to adopt automation if they don't trust it to doing the right thing.
Jerry Cuomo: Oh, yeah.
Kevin Yu: How do you overcome that?
Jerry Cuomo: The way you overcome it, in a word, explainability. I think that's a word. You have to be able to explain yourself to gain trust. So before a skilled SRE is going to trust a piece of code, the piece of code is not going to get in trouble from management. You are. So when you put your reliance in a piece of code to do your job, you have to, as you said, trust it, but the best way to gain trust is to fully understand how it's going to react in that situation. So I'll call that explainability. So the code has to be able to explain itself before it kicks in. And Kevin, back way back in the day, I hear, that when the mainframe introduced autonomous workload management, and I think many people forget about the amazing achievements around mainframe for intelligent automation and being able to do computer generated, data driven workload management. I asked one of the experts, how long did it take for customers to trust IBM's autonomous workload management on the mainframe? Many folks said, " 10 years or more."
Kevin Yu: Wow.
Jerry Cuomo: And I said, "Well, okay. What prevented or what turned the tide in them accepting it?" What prevented it was unpredictability. If it's a banking application, not knowing how it's going to respond during a spike, maybe end of the month, end of the quarter. How is it going to respond? And being able to explain that to your boss if it goes wrong. But then when we asked, " When was it successful?" And we said, " Well, there is a training mode, which it didn't actually apply the automation, but it showed what it would've done if it were in charge and maybe gave recommendations to the user." Say, " Hey. Look, based on these conditions, let's balance the workload this way. Click here and I'll do it for you." And that's a means of gaining trust in automation. So those techniques, but explainability, Kevin, is really you're not going to get away with, " Hey. My job is on the line. I'm not going to let this piece of unknown black box software automate my service and reliability." So yeah. Absolutely. That's a key piece, explainability.
Kevin Yu: Yeah, no, I like how you capture that explainability, that transparency so we can really see what are the data that led to those decisions, right?
Jerry Cuomo: That's right.
Kevin Yu: And we can iterate it and improve it if the result wasn't what we expected.
Jerry Cuomo: And it doesn't get any easier. It gets more powerful with things like machine learning and artificial intelligence, but it doesn't get any easier to explain and you don't get a buy on that as an automation producer.
Kevin Yu: Right, right.
Jerry Cuomo: You are responsible to building a trust bond with your SRE partner through explainability. So yes. It may get a little harder with a more advanced technology, but no less the importance to explain.
Kevin Yu: That's all right. A prereq for SRE is to not shy away from a challenge and now thanks to Jerry, now we have a good design pattern to always make sure to build that explainability as we build and automate. So this is a perfect segue to one of the most asked question in the SRE community. If we were to go back to the Google definition of SRE and Toyo management, it sets the industry standard of 50/50, meaning we should spend more than 50% of our time doing Toyo repetitive tasks. By itself, it sounds very reasonable because why would anyone want to spend more than 50% of the time doing manual and boring stuff when they can do more fun stuff? However, I find the reality is things don't always start fresh. They often have a mountain of technical debt. Add to that the additional putting out fires and not to mention meetings, do you have experience from things that do well to achieve that 50/ 50?
Jerry Cuomo: Yeah. I think you can't get to the better part of the 50%. That 50% that gives you your life back. As an SRE, gives you your weekends back so you could spend it with friends and family. You have to live through it. And God knows, Kevin, I've lived through some interesting times as we've seen the dawn of the internet and internet servers and things like that. But you can't get to that 50% unless you've lived through that 50% and you have experience. And once you have that experience, and sometimes experience means gray hair and sometimes experience means scars on your back, et cetera, you can start coding that experience. And I think this is a very important aspect of getting to that 50%, creating a scalable and highly reliable software system. So that kind of transition occurs when you can start to express that experience in code. And I think this is why an SRE is different than let's say a more typical operations team member. It's about code. It's about representing your experience in code. And with that, you start to creek into that 50% and increase that over time even. The key word is repeatability. Code leads to doing things nontribal. And again, I say this with extreme pain and experience of passing on tribal knowledge in an attempt to scale SRE practices. What does that mean, tribal? It means word of mouth. Maybe in a modern data center, computers that have lots of yellow stickies as a way to express a runbook or a procedure or a process. We laugh, but did we've lived through-
Kevin Yu: So true.
Jerry Cuomo: The runbook being a non- digital thing.
Kevin Yu: Yeah.
Jerry Cuomo: Even in the term runbook, it was literally something in a book on a good day. On a bad day, it was something spread out, but you have to call Jane or Mike because it was in their head. So it's really getting to that 50% is about getting to repeatability, which drives scalability, which drives high reliability, which is embodied in code. Now whether that is in code that the SRE teams build over time based on their experience with a system, whether that's code they acquire through a tool, that's kind of how you start looking at that, but you don't start with 50%. You start with close to 0%-
Kevin Yu: Right.
Jerry Cuomo: And a lot of experience. And you build up from there with code.
Kevin Yu: That is a very unique capture. And I was thinking really actually touch on SRE, the last acronym, there's ease engineering, which really is what you said, right. Capture that tribal knowledge. I've been there as well. You see all those dashboards, all those things, and people who been through it will know, " Hey, the problem is here." Meanwhile, people who haven't is like, " What? What did you see? How did you arrive at that?"
Jerry Cuomo: Right.
Kevin Yu: No. Now I think that's a great capture and you give me another great quote. Repeatability drives scalability drives high reliability and embody in code based on experiences learned over the years. So Jerry, you touch on many dimensions of SRE automation, explainability, capture that tribal knowledge as code to scale. Can you please summarize for the audience what SRE means to you?
Jerry Cuomo: I'm going to start with code because I could think code is the essence of building a bot. It's the essence of automation and the repeatability that you get in automation. Hopefully code also represents something that can be read. Hence, over time, trust gained from it. So I think SRE, what it means to me is really it's that process of managing systems, solving problems, ultimately automating operation tasks. And I think that that build out ultimately toward creating a more scalable, a highly reliable software system through code, through repeatability, through automation.
Kevin Yu: That's a great capture and sounds like a wonderful profession to be in. So Jerry, you talk about code to capture the knowledge and experience. Let's touch on innovation.
Jerry Cuomo: Sure. Sure.
Kevin Yu: Part of my goal of this podcast is to inspire feature site reliability engineers.
Jerry Cuomo: Oh yes. Yes.
Kevin Yu: So any advice you can give to the next generation of SRE on how they can innovate in this space to make life easier for themselves, meet business goals, and build solutions that leads to happy users?
Jerry Cuomo: Calling all SREs, past, present, and future. Here it is. The car industry, you said this at the beginning of the show, self- driving, autopilot. We have an opportunity to catch up, if not leapfrog, the automotive industry, what they've done, as an inspiration. I think driving a car and getting a car to drive itself is a highly innovative breakthrough. Very different than getting a business to self- drive itself or to getting your IT operations to self- drive itself. They're both dangerous and they both have great rewards. But now let me break it down to classes of innovation that I think, as an emergent SRE, you can look forward to and as I said, it's a call to action. How do we go from reactive to proactive? How do we go for a system that, when you check in a new piece of code as a developer, you get a Slack alert that says, " Hey Kevin. You have those tickets to the hockey game on Saturday afternoon. If you check in this piece of code, I'm afraid you're not going to be going to that game with your family. You're going to be working overtime because there's a 94% likelihood that this code is going to cause an outage downstream. Let me explain why. Please fix it tonight so that you can be with your family over the weekend." That's an exaggerated form of how do you go from reactive, which is you're getting in the car to go to the game and you get an alert saying, " Uh- huh. You need to work now," from that predictive action. But that's an exaggerated view of the car self- driving analogy. But how do we get to that? What sorts of breakthroughs have to occur? It starts with data. It starts with experience. You need to get data from every nook and cranny of your enterprise. And when I say that, I don't take that lightly. I take it quite literally, meaning shift left to your developers. So getting GitHub commits on one side. On the other side, operational and incident history, maybe coming from systems like ServiceNow. But I think increasingly we will see additional innovation by incorporating business goals and business targets into your data and insights. If you have a goal to fulfill orders in 24 hours, that's a business KPI or SLA, what in IT can affect that? So if it starts to slow down, could we then better inform the business that we're going to miss this important SLA? Because at the end of the day, great SRE practices are fueled by innovations that lower costs, improve time, reduce time to fixing issues as I'm alluding to, maybe even predicting and avoiding the issue before it even happens, but ultimately, the real value back to your clients for innovating or NSSRE, is to drive your net promoter score up. What does that mean? Customer sentiment. That your customers love your website, love your product, love what the operations team is operating. And the way you drive that love, the way you separate the user from loving your system is by making it slow, making it fail. They're not going to love you if that happens, but if you can drive those qualities, you'll have true endearment and connection with your end users and that's really what it's all about. So innovation in the form of machine language models that take data from every nook and cranny, train on shift left, what's happening in development, what happened in the past in operations, inferencing, connecting the dots across those disparate events, bringing business intuition in in the form of business events, being able to use artificial intelligence to go from reactive to predictive and proactive, but also I mentioned earlier that we need more trustworthy, more natural ways to communicate with our runbooks, with our bots. So using natural language processing, and I don't always necessarily mean spoken language, it could be through chat ops and being able to interact with the humans where they're working versus having them come to you. " Geez, the last thing an SRE needs is another dashboard. We have so many dashboards to follow. Why don't you bring the information to me in my language where I am?" I think those are another set of rich breakthroughs on interaction models between systems and humans. I joked once about a patent, and my son, who also is into music like I am, that when you walk into your data center, whether it's real or virtual, you should play background mood music that represents the tone and health of your IT system. So if you hear the Death March from Star Wars, you know that there's an outage that's about to happen and you better take note. Anyway. But if you ask about innovations, those are the classes, Kevin.
Kevin Yu: Oh my gosh, Jerry. You just got my brain going. And you know what? You will laugh. Guess what? I actually came to this podcast recording session from a dashboard meeting. That was-
Jerry Cuomo: Oh, no more dashboards please.
Kevin Yu: Right, right. And you mentioned unhappy customers. Jerry, slow is a new down because we're getting really good at keeping things up, but there's still lots of disruptions and slow and the data delays and whatnot. And I really like how you use hockey. Being a Canadian, I really appreciate that. And I think that is a great relation to motivation and perhaps why design thinking is so successful in that by taking perspective of users, we can make more meaningful products. So now speaking of products, I would like to touch on the area of ROI, Return on Investment. You touch on cost and the reason why I bring that up is the reality is if we have an unbounded resource and money, sure, we spend all the time with want on SRE, but the reality is any business, no matter how big they are, even IBM, has finite amount of time and resource to invest. I would say people would no longer debate with me how important SRE is. They'll say, " Hey, Kevin. This is great." They won't say, " Don't put it on," they'll say, " Okay. We'll look at it." But it's-
Jerry Cuomo: That's right.
Kevin Yu: It's often at a lower priority versus other product features. I would say the exception is leading to peak seasons like Cyber Monday where there is an increased fascination around performance and scalability. But I would like SRE not to be a point in time investment, but a constant mindset for the full product lifecycle. So Jerry, do you have some ROI perspectives and metrics you can share with the audience to help us drive that prioritization discussion with our leaders?
Jerry Cuomo: Yeah. So let's focus on the three I mentioned earlier starting with cost. Cost is always a factor. And if we as an SRE community can prove better cost arrangements, I think that's a much easier sell. But yet, time is money as well, which is the next one and I think they're highly related. So yes, it's about removing mundane work that people, but it's really bad path to go down is to say, " We're going to do it with less people." And I think that's probably the worst place to cut cost. The best place to cut cost I think is looking at time and time spent. In the examples I gave, the more you can shift left on a problem, the cheaper it's going to be to rectify the problem to the point that if you don't have the problem at all, then that's probably the best cost equation altogether. But the cost to fix a problem once the problem is in the field is immense. So if you can turn back time and shift left and pay more attention, use more insight and data as to the probability of an outage as you're about to push the code out, invest more in testing, not less in testing, then the ROI is going to be very clear around cost savings. But I also want to say on the other side, which is that I mentioned there is the net promoter score and that's how your customers feel about you. There is part of the ROI is increased NPS. And I think that is the positive view of it. If sites are performing well, if sites are always up and operation, especially in times of need, that has an immense cost or ROI value back to your company. If I look at the negative side of that, there's a reputational ROI as well, both positive and negative, for not being proactive around SRE. Unfortunately, Kevin, these things are harder to put down in paper and it's like you don't really think about needing a doctor until you're sick. And then once you are sick and a doctor tells you how much they cost, you're like, " Price doesn't matter at this point. Just go fix me." And I think it relates pretty well here. Prevention is better sometimes than the cost. So knowing ROI on prevention is a big deal and that investment could save downstream immensely in the form of a positive reputation or negative, performing well in front of your clients, and delivering that function that your company should be known for. So ROI, cost, time, NPS, that's how I like to break down the importance of SRE from a return on investment perspective. Of course, automation, I believe is right in the thick of it as we've described before. I don't think need to repeat that, but what the heck? I just repeated it.
Kevin Yu: That's quite all right. As one of my mentors, Marsha, always saying, " SRE, the mindset, it's all about beating that drum." Automation, it doesn't hurt to beat on that a little bit more. So Jerry, you took us through the journey of SRE and how it came to be so important in today's enterprise in driving that client success. Where do you think SRE is going?
Jerry Cuomo: Autopilot or bust. I think this having a flight control center where bots, digital employees, and humans collaborate, allowing humans to do the things that humans do best, exert judgment, have empathy, working on customers, with customers face to face and directly, I think that is a great division of labor. So autopilot is not to replace the SRE, but it's to help the SRE do those higher value things that humans are quite good at doing. Right? So let software do what software does best, repeatability, and let humans do what they do best. That trend of where it's going, it doesn't eliminate the need to be driven by business goals, to have experience. Absolutely requires that as an input. But then that enables you to code the experience for repeatability, bring in automation with of course explainability so we get adoption, drive it and put the innovation cherry on top with AI, machine learning, natural language, bots, humans, NPS increasing, rinse and repeat, Kevin. That's the omelet for the day.
Kevin Yu: That's a perfect wrap up. I see you're experienced. So there you go ladies and gentlemen. SRE from the perspective of Jerry Cuomo. IBM fellow and CTO of IBM Automation. Thank you for joining us.
Jerry Cuomo: Well thank you, Kevin. Thanks for having me.
Kevin Yu: And as I mentioned, Jerry has his podcast, the Art of Automation, where he leads the audience through the fascination world of enterprise automation. Visit it to hear more of Jerry's stories and perspectives. Thank you all for listening. This is Kevin Yu, principal SRE at IBM applications. See you again on an upcoming episode.
DESCRIPTION
Jerry Cuomo, IBM Fellow, CTO of IBM Automation and host of The Art of Automation podcast takes the audience through the evolution of automation and how it relates to SRE. Jerry shares with the audience of what SRE means to him, touches on the ROI of SRE and ingredient to the SRE omelette. Jerry also has a shout out to all Site Reliability Engineers, past, current and future on innovation in this space.