Episode 16 - Reliability Engineering
Kevin Yu: Hello. Welcome back to another episode of Making of the SRE Omelette podcast, where we talk about how we achieve business and client success via the practice of site reliability engineering. In today's episode, I speak with Kyle Brown, IBM fellow VP CTO for the CIO on the subject of how we transform SRE into how we do business and incorporate reliability engineering as part of how we build products and services. We touch on why Kyle doesn't like the S in SRE, what does it look like when reliability engineering is considered at the organization level? What the future looks like? Fantastic technical vitality tips from Kyle and bonus recipe for a real omelet. Enjoy. Welcome to the show, Kyle.
Kyle Brown: Thank you very much, Kevin. I'm so glad to be here and looking forward to this.
Kevin Yu: Kyle, I remember working with you when I was in consulting and I always appreciated the energy you bring, so I'm excited to see where it takes us. Let's start with an easy one. This show is about understanding the business impact of SRE and the culture to achieve the outcome. Can you start by sharing with the audience what SRE means to you?
Kyle Brown: It's funny because we have some mutual friends from all the years we've worked together. One of them that I think that may have been a guest on your show, Ingo Averdunk. Ingo and I have had a longstanding mild disagreement about what SRE means to each other for a long time, and it really comes down to the very first time I would say used site reliability engineering as a practice at a customer together, and that was one of the big customers we worked at, one of the big American airlines that we were working together at. And here's where I always had a problem with the word site in site reliability engineering. I know where it came from. I read the original Google papers. I've read the Google book, I've seen all those videos, I know where that came from, but I always hated the word site because I thought-
Kevin Yu: Why's that?
Kyle Brown: I think there's two different S's that are more important that get to the heart of where I believe it's a necessary part of everything we do. And I prefer to, instead of thinking a person as a site reliability engineer, I think of them as either being a system reliability engineer or a service reliable engineer. And that's getting to an important distinction and that's the distinction between are you providing a platform on which people build or are you providing a service for other customers? That could be like an API that is being consumed by upstream APIs or it could be an end user service that your clients are actually using. Well, those are both part of site reliability engineering, but those are different parts of site reliability engineering that have different features and different functions. And I think it's important that an SRE knows which one he's falling into and kind of what his specialization is. And I actually wrote an article about this once, and of course Ingo didn't like it, but hey. We had a general disagreement and we agreed to disagree in the end. But that's kind of the first thing to me. And as I think about what I do and my job is the CTO of the CIO office at IBM, we have both kinds of SREs in our business because first of all, we have to provide a platform that has to have all of the reliability, all of the affordability and cost management, all of the security and other non- functional requirements that are required by IBM. Then each of the services, each of the applications, the business applications that we have on top of that also have their own set of functional and non- functional requirements they have. And that's where it's important to have the SREs at the platform level and the SREs at the service or the application level be in constant communication and kind of constant discussion and agreement about the evolution of both because neither one of them are going to stand still.
Kevin Yu: So Kyle, if we take away the S, would you be in agreement with Ingo then?
Kyle Brown: Yeah, absolutely.
Kevin Yu: Wow. You know what the title of today's episode. It's just that. Reliability engineering. I love it and I like how you mentioned that platform application in a way it really takes reliability engineer into a higher level. It's not as much of a job role, but it's really different hats, no matter which part of team you are in. I can argue if you're in marketing or design, you could wear that hat and make sure anything we design or dream up in the first place doesn't lead to a very complex situation for the folks down the line.
Kyle Brown: Oh, absolutely. And that was an important part of, again, just thinking about this one client we did with one of the airlines, Ingo and I were both part of the initial design thinking workshops that went into the definition of what their platform would look like, and also in the initial design thinking workshops that led to the definition of the first few services that were stood up on that platform. And that's why I think it's important that SREs and architects together work through that entire process end- to- end because each of them are bringing a different perspective that, you're right. It has to be either part of what you're building regardless, and you can't just relegate SREs to just one part of the life cycle or one part of the system. It has to be very broad and cross everything.
Kevin Yu: Right. Speaking of that, I remember the times when we at the customer, I can still picture, I remember sitting in the war room getting there, reading for the holiday peak season, the Black Friday, Cyber Monday, and the challenge we had were to prepare the system to meet the performance and the goals for the peak so they don't fail, lose money and brand reputation. It was very much like what you said, someone built it, someone else built it, and we were there after the fact as a point in time to make it perform, make it scale, make it reliable. Kyle, do you still see that today?
Kyle Brown: Absolutely, I still see that today and that's absolutely critical to what we're doing. One of the decisions that we're now really starting to get into, and this is kind of an interesting inflection point that we're at in the CIO. Let's put it this way. We've always focused on a specific set of what I want to call minimum non- functional requirements, and those are the ones that you and I worked together with at that particular retailer. They, for instance, just had to stay up. They had to make sure that their overall reliability of their site met a certain minimum standard. And not only that, but you'll remember that first year that we were there together, there was actually a blip despite all of the work we had put into it at about 10 o'clock in the morning, we saw a database blip occur, but we were able to deal with that. Well, what that meant is that we had a very, very good MTTR because we managed to get that thing diagnosed, dealt with and brought back inside of 10 minutes. In fact, it never stopped orders flowing through the entire time. We just slowed them just a little bit for about 10 minutes on Black Friday. And that gets to the fact that we've always thought about reliability in terms of those few measures, MTTR and uptime and a few other things like that, but reliability means more now. One of the important things that we have to think about now is reliability in terms of predictability in spend. Ah, let's think about that. How many of our clients that you and I know about have been just completely blindsided when they get their AWS bill and they realize that due to some unexpected thing, their AWS bill is$500, 000 higher this month than it was last month.
Kevin Yu: Yeah, right, right, right
Kyle Brown: Well, that's an aspect of reliability. If you think about it. FinOps has a discipline, it's a critical part of site reliability engineering now, or in fact, I would say systems reliability engineering. But again, that's to my original kind of problem I have with the definition, but it's not just FinOps. One of the things that works is given the importance of ESG topics-
Kevin Yu: Yeah, sustainability.
Kyle Brown: Issues around sustainability, particularly around carbon emissions are very important. When we were doing some of the initial analyses of our carbon footprints for our different workloads, we were shocked at exactly how different that carbon footprint is depending on what data center it was running in. Now when we run in colos, so it's not like IBM necessarily owns these data centers. They're colos and they're in the standard places that everyone's colos are, but just due to the power generation differences in these areas, you could see major differences in the carbon footprints. And that again, becomes part of systems reliability engineering, because it's not just where you put your workload based on cost, but where you put your workload based on the ESG footprint or the privacy footprint, especially as so many new regulations are coming online regarding data privacy and making us have to think about questions of things like data sovereignty. It makes SRE job a lot more challenging now.
Kevin Yu: You know what, Kyle? I think you just secured yourself as a guest in season two of this podcast, where we put a spin of sustainability to the omelet.
Kyle Brown: We'll do it with cage- free eggs that way.
Kevin Yu: For sure. And actually, let me pause there. I'll put on the sweat a little bit, see where it goes. One of our common friends, Stacy Joines was on a show earlier, and she and I talk about she doesn't like the term non- functional requirements, and she was challenging us to come with a better wording for it. Do you have a better wording for NFR?
Kyle Brown: Oh, that's an interesting question. I see our problem, all requirements are functional because that's the issue. When the initial distinction was made between NFRs and functional requirements, it was in an era where it was assumed that all of the non- functional requirements were taken care of you by the platform. Because this really goes back to the early days of the mainframe era, and that was an era in which there were very deep distinctions between the platform and the application, and your application had to arrive in a hand tied bow of punch cards would to the high priest, and the high priest would bless it and move on, put it in the card reader, and it would go. Well I understand that today, especially in a services oriented architecture, and especially with the introduction of microservices and especially how many services are available already for purchase on the web, either as SaaS services or just as services that you can purchase to run as containers, that distinction is not as clear anymore, and it makes it harder for us to define what that even means when we're talking about not just the reliability of an application as a whole or the performance of an application as a whole. Instead, we have to think about this as being a network of loosely coupled components that may have multiple ways of being loosely coupled. And I think it means we have to think about things as a constantly changing matrix as opposed to just a list the way we used to think about NFRs and functional requirements, and you really can't make a distinction between the two anymore.
Kevin Yu: We have very good practice and mindset on security by design. To me, we have arrived at the time where you had to make it reliable for design. You no longer should speak it separately, right?
Kyle Brown: It used to be that we made these very large scale build versus buy decisions, and it would be, " Okay, well, is there a commercial off the shelf product that does everything I want? And if so, I'll buy it. Otherwise, I'll build the whole thing." I can't think of a situation where we make big large scale discussions like that. Now it's down to, oh, it's like this particular platform and you can name your platform via SAP or Salesforce or ServiceNow or whatever. This particular platform offers this little module that does this thing. And now you've got to decide, " Well, is it worth for me to buy that little module that does this thing on top of these other big things I'm already doing? Or do I write a component that does this thing? Because what I want is not exactly that thing. It's 45 degrees off." That's the hard decision we need to make. And the problem I think we have is we try to end up making that purely at business level. We can't because it's at the systems level that these kinds of things are integrated, and if you don't have your SREs part of that decision, you regret it in the end because you're not going to be observable, you're not going to be manageable. You're going to find out that upgrades are going to kill you if you haven't pulled this in a common way or you are not going to have, let's say, the performance or the loose coupling that you're looking for. And that's why I think these get so confused, the way that Stacy was talking about.
Kevin Yu: Yeah, I think about the more different components or vendors or interactions you have, the more dependencies you have, right? You're absolutely right. If SRE is not involved from the start, it's too late to build instrumentation afterwards, after the fact. So Kyle, can you describe to the audience what it looks like from the development lifecycle perspective? If we were to build products, and I'll use your word services, with that SRE hat and perspective from day one, what does that look like?
Kyle Brown: Tell you the first thing that it looks like, and that is everything is automated. And I mean everything. And this is true whether you're wearing either the two hats I talked about, either a systems hat or a services hat. So for instance, one of the things that we're doing a lot on the systems side is we do a ton of Ansible automation. We've got probably close to 200 some odd engineers that actually write Ansible code, and we've probably got more than a thousand engineers that actually run Ansible. That just shows you the commitment we have to that level of automation. But let me talk about the scope of that. It's a lot of the standard things you would think about that you can do with Ansible. When we have to stand up a VM, sure. Or when we have to stand up a new node for Kubernetes yeah, that all happens. But it also includes things like when we need to update our inventory records of what's running on something or when we need to open a firewall rule or when we need to, let's say, make a change to an external network configuration. It used to be we did all these with different tools, but then we moved to at least all these tools have APIs. That was the first step. Now that allows us to have common scripting processes across all of these and to invoke them in a common way. And the way with which we do the invocation is essentially it's a CICD pipeline so that you make a change, that the change gets picked up in GitHub, it gets processed, and the end is the thing gets invoked, it gets deployed and gets invoked. And that is, I think a different way of looking at platform engineering and the way that you think about the way changes are made to your platform than we've necessarily done before. And absolutely the same thing applies on the services side in that a lot of the things that we've done manually before need to be automated. Oh, here's one really good example. This is something that someone in our co office thought of, and it's one of these ideas that are so simple and beautiful in retrospect, you can't believe no one had thought of it before. So think about user stories way early in the process, okay? We're always putting user stories into Jira, okay? That's what everybody does. They write a user story, they put it in Jira, they put it in Trello or they for their favorite world choice. How many of those users stories are about common things about security requirements? Remember how we talked about some of the, to use Stacy's term, nonfunctional requirements, or you got to build your site securely? Well, how many of those common things are about security issues that are really fairly common? Well, this simple brilliant idea is why don't we just make those common? And we sat, we kind of looked at this and we're like, duh, that's brilliant. Why haven't we thought of that before?
Kevin Yu: Well, hindsight is 20/20.
Kyle Brown: And now it has made our lives so much easier because we can actually say, well, yeah, it doesn't matter. This system we're building or this service or this service, if you're trying to meet this particular, to use IBM's term ITSS security requirement, we just have a common story for it. Then the story just gets added into the backlog and people have to fulfill it. The cool thing is the fulfillment's then automated because you do it the same way every time. If it's coming down to a DAST scan or a SaaS scan, Hey, you're doing it the same way. Why don't we automate that part too?
Kevin Yu: And even use the same personas, right? Because-
Kyle Brown: Absolutely, why not? So it's that kind of thing that I think is really important from that perspective and that we have to think about automation being not just kind of the lifeblood of what an SRE does, a part of the organization as whole, and I thinks the transition of making it important for how the organization functions, so it's not just the SREs having this unique role in the organization, everyone thinking about it that way.
Kevin Yu: Kyle, you just give me a light bulb moment going off the script here. I was chatting with our friend Jerry Kumo, and I asked him what the future would look like. He used the analogy of if we can build self- driving cars, why can't we build self- running operations? Kyle, with ChatGPT and generative AI, can we get to the point of self- driving business to take it up even higher notch?
Kyle Brown: Well, okay, all due respect to Jerry who I know and love very well. First of all, even self- driving cars have issues.
Kevin Yu: Well, yeah, yeah. I assume that's a different podcast, Kyle.
Kyle Brown: But here's kind of the issue that I have around that analogy. I don't think we're ever going to take humans out of the loop, and here's my basic reason. Because we are not only the source of the requirements, we're the source of the ideas around how to build systems that embody those requirements. So here's the thing. I've been reading from the beginning of my career, which is way too long ago, that they can replace programmers, and I was just commenting to someone that I've seen this cycle of, let's call it low- code, no- code to use the current terms. I've seen this low- code, no- code cycle come through three separate times in my career and they've never been able to replace programmers because I mean, let's face it, and there was a wonderful snarky tweet that I'm going to paraphrase here. It's like if we can't explain to computers using very highly sophisticated and self- correcting grammars, what we want them to do, e. g. programming languages, what makes you think that we're going to be able to do that with regular English or regular human languages? And that's kind of what I feel is I don't think we're ever going to get fully to that point. Are we going to do more automation? Yeah, automation going to increase every single year and we're going to find new things to automate, but the parts we're automating are the parts that are no longer changing, and as humans, we will always find new things to change and that's why I don't think it's ever going to get entirely self- driving.
Kevin Yu: Right. And very much like self-driving today, maybe instead of AI machine or human loop, it's humans operating in an AI loop, but we're still involved.
Kyle Brown: Well, I think what's happening is we keep widening the loop. We just keep widening the loop because if you go way far back and I'm talking about reading books like the original version of the Mythical Man- Month, which is still a wonderful book and everyone should read it, and Fred Brooks actually just died here I think this last year. So RIP Fred, he was a wonderful guy. I did meet him a couple of times. If you go back and you read the original version, he was talking about his perfect squad of developers and one of the jobs in that was typist, another one was secretary. Do we need those jobs in a development squad now? No, we automated those away 30 years ago or more. Well, are we going to automate away some more parts of the development environment? Sure, absolutely. We don't need someone, for instance to do the same kind of low level tools that his later editions talked about because we have wonderful tools like JavaScript and Python that have all sorts of ecosystems around them that allow us to basically do all the kinds of builds and construction and corporate frameworks and libraries super easy now, much simpler than when I was learning C ++ back at school, but we've now moved up a level on the kinds of tools we built. Now we talk about doing things like building common CICD pipelines across multiple applications, or we talk about building platforms that cross multiple architectures and clouds and allow applications to run across them. Well, all we did is we widened the loop, we automated the smaller parts down at the bottom, but the loop just kept getting bigger and bigger and bigger, and I think we're just going to continue to do that essentially forever.
Kevin Yu: That's an interesting perspective, Kyle. I like the point you made about widening the loop and I would tell you that part, you and Jerry agree that is, automation comes from experience of something we have done and learned. So as long as we continue to discover new problem domains, there is a need for humans after all. So thank you Kyle for that off- topic discussion. Let me bring it back to the heart of this episode that is reliability engineering, and you spoke of the future or rather where we need to be is reliability engineering is not just a point in time task, but what we do at the start. What would you suggest for people to get started if they want to embrace that future?
Kyle Brown: First thing is you have to start thinking from the bottom up in terms of both of those things we talked about at the very top of this discussion around the two different ideas, systems and services. You have to plan out what you want your platform to look like and you have to realize they're two different things. I mean, I was having a discussion this morning where we were talking about a particular client who struggling with the idea that these are different things. They kind of look at the cloud as being this vague nebulous term and they're not realizing that you have to build a platform to run your services, plural. It's not just one thing for one, okay? There are way too many cloud applications I've seen where people did try to do it as a long, that's a mistake. So going back to this, so the first thing you have to do is you have to plan out your platform, and I want to give the advice first of all that your platform architecture should not just be whatever is being offered by your cloud vendor of choice. You need to be more selective than that. You need to actually think about your needs you have in terms of reliability, your needs you have in terms of performance, your needs you have in terms of observability, security, all the rest of these things, and you need to make intentional choices around what is the best way to fulfill each of these different pieces. Sometimes that will be something from a cloud vendor, sometimes it will be something from open source or a third party, but don't just make the assumption that the cloud vendors know everything on this. The second thing that's also worthwhile to think about on a platform, and this is kind of a common thread you'll get through a lot of our IBM people you talk to, is you need to think about the fact that the world is multi- cloud and that will lead you to make different choices in terms of the constituents and the components of your platform if you need to think about things that might work on multiple clouds. And then you need to think about how you're going to manage that platform, both in simple terms of how do you do deployments to it, but in more complex terms of how do you join together your observability, how do you join together your FinOps, how do you join together your overall management and your security? Those are things you have to think about at the platform level, and this is almost before you even build your first service, you at least have to acknowledge these problems exist and start working towards solutions for all of them. Well, once you've done that, then you can start thinking about how you build applications that are optimized to run in that particular kind of world. Now for us in the CIO, what that normally means is our applications are either going to be for one of our parts of our SaaS platforms, have already named a whole bunch of SaaS platforms it can be or it will be a containerized application because both the SaaS platforms allow us to just take advantage of the fact that SaaS can be anywhere. That's cool. Containerization allows us to do that in a different way. It's the fact that we can run containers on- prem or in one of our mainframes or in a cloud. And in fact with OpenShift you can run that in pretty much any cloud. So that is one of those decisions around the services engineering that we need to make, but that's just the first of them. Then we have to start thinking about what does that mean in terms of the requirements of what we do inside the containers and how we connect to our external services and what set of external services we want to provide that might be available on multiple clouds or might be available in multiple environments. And these kinds of things just keep multiplying as you do that, but these are the levels at which you have to think about if you're planning out a way of thinking about building these systems that are going to cross these multi- cloud platforms are going to be made up of dozens or hundreds of interconnected services like this.
Kevin Yu: So the way I see that is, let's keep on adding analogy to this omelet podcast. Before you build up the house, making sure you have the right foundation maybe that applies to the platform and once you're sure the plumbing is correct, then you can start thinking about how you're going to structure the design and whatnot, and we were to incorporate additional devices in the house or making sure they can talk to each other properly with your dependencies.
Kyle Brown: You generally don't make your interior decoration choices before the house is built. You're a little odd. I know people that have done that, but you make those detailed choices about what throw pillows you're going to have after you see the room. Well, it's the same thing. Once you built the platform, then that's going to influence the set of things that you will be doing in building the services. Even though there are some common things that you know all services are going to have to fulfill, it's still going to help you understand better what that exact set of things are.
Kevin Yu: Right, and the reverse is true. If you start making those design choices earlier and you have to change, it becomes a lot more costly to change it after the fact.
Kyle Brown: And that's why a couple of bad ends of the spectrum, spectrum number one is bad end is build it and they will come and that we still a lot of that early on in-
Kevin Yu: Oh yes.
Kyle Brown: Where people would go out and they would build these really big complicated clouds and they would have no applications come into it because they built it entirely without talking to any application teams. Okay, well, the other end is these customers we've had to where they've just said, " Oh, I'm taking my old application and I'm just moving it to the cloud", and they just port the same old mess they had on- prem into the cloud and then they're surprised when it's not cheaper and in fact it performs worse. Somewhere in the middle is where you want to be. You think about the platform and you want to think about the applications at the same time and let the design of each influence the end.
Kevin Yu: I like that. I'm going to steal Bill Higgins' analogy that there are ditches on both sides of the road. You want to stay in the middle. So Kyle, one of the big thing of this podcast is giving back to the community. Any words of wisdom you want share for practitioners looking to get into this space or to sharpen their skills?
Kyle Brown: Absolutely. You know this is one of my favorite topics. This could its own podcast in and of itself. So there's a couple of things that I think is important for someone when they're talking about their own technical vitality and their career development, and that is you can be the best engineer possible in terms of you write the best code, your automations run the first time, you never end up being the person who breaks the bill. You can be the best engineer possible, but if you can't communicate your results and you can't communicate what you're doing to others, it doesn't make any difference. And that's why I think that the key skill that everyone in site reliability engineering or any engineering discipline really needs to develop are their communication skills. And yeah, I know people think, " That's a soft skill. I don't have to do that. I'm an engineer." Yeah, you do. So one of the things that I encourage people to do is to actually spend some time working on that and boy will that have an effect. As I look at the people who I've mentored and that have mentored me and looking at their careers, that kind of ability to communicate both verbally and in a written way and also presenting in front of an audience, which is just a type of verbal communication, that is absolutely critical to their career success, and you don't get there overnight. It's not like you're going to be keynoting DevOps conference your first year as an SRE. No. You're not going to be the keynote speaker at SREcon. It just won't happen. But over a period of 10 years to where you gradually work on things and you build your portfolio and you build your network of people and you accept feedback as it comes from people and you incorporate that into what you're doing. Yeah, well, for a period of 10 years, absolutely you could grow to become a keynote speaker at SREcon, but the way you do that is you start with a small audience. Let's just take the public speaking point. You don't start by speaking to 2000 people at SREcon. You start by maybe just talking to your squad, your team, which is maybe five or 10 people. Cool, okay. Present on some subject. There's a piece of code you wrote, offer to just explain it to people and show how this is maybe different than any other code or automation that's been written as part of the system so far. Or if there's some cool new feature you think we need to add based on some new piece of technology, and that could live in library, do the same thing, just kind of explain that to your squad, but then listen to the feedback you get and incorporate that the next time around. Well, then you move to the next larger audience, maybe a meetup where you might be talking to 20 or 30 people. If you've done three or four meetups like that, hey, maybe you're ready to submit your first talk to a smaller conference where you might be able to talk to 50 people. Do that a few times and you start building up this portfolio of topics you can talk about and you become recognized for your expertise in a particular thing, and that's when you start getting the invitations to then well maybe do an invited talk at a conference and do this over 10 years and yeah, you'll be that keynote at SREcon.
Kevin Yu: That is a great capture of the progression of knowledge sharing and increasing one's influence from the immediate squad to the org, the company, and the industry.
Kyle Brown: Exactly. There's a corollary to that that I also want to talk about that's also worthwhile, and it's like my secret sauce for technical vitality. Never publish something once. Okay, what do I mean by that? Okay, so let's say you discovered this cool new library in open source and you figured out this really nifty way of using it in your programs, or you figured out this cool new way of doing monitoring of something that you could through your Ansible script. First thing you do is maybe you talk to your squad about it or you do a meetup. Okay, sure. Small group of people get their feedback, do that. Well then take that same idea and write it up as a blog. You can go to Medium and you can write a blog in five minutes, or you can go onto LinkedIn and you can write a blog in five minutes. Let's say you do that five or 10 times. Now you've got this set of blogs on these cool ideas, or maybe you've run with this one library and you found three or four other use cases for it, and you've really described it that way. Now you do a longer form article or paper. Maybe you do this as a paper submission for a conference, or maybe you do it as an article to one of the online journals. Well, then you take that paper and you submit that as the abstract of a talk at something like SREcon. And then you do that several times and you take your talks and you take your papers and you know what? You've probably got enough to actually submit an abstract and a couple of sample chapters for a book. So you submit that to O'Reilly or to one of the other publishers and you now have a book deal. Well, guess what? I just outlined how you go from discovering something cool to having your first book deal. You just kept reusing the same text. That text that you wrote the very first time in that first blog is now ending up as part of your book.
Kevin Yu: Right, Kyle, the other big impact of that is you just slowly increase your influence. And I know going to IBM, that's a big part, right? " Hey, you're so good. How come we don't know about you?" So you just outlined that. Starting up, you're grading your squad then becomes your team, maybe teams you depend on and over time it's IBM. It's a whole community. It's a whole world, right? So I think that's a great capture, Kyle.
Kyle Brown: It is, and that's how we do our technical advancement within IBM is we're always looking for those increasing scopes. It really tends to go down to what's the size of the community that you're influencing and the way you grow your career is by growing to successfully larger and larger communities.
Kevin Yu: Wow. Kyle, you are totally right. We have to cover a separate episode or maybe even is for this. So Kyle, in closing of this podcast, I always love to ask guests to share their recipe for the omelet, and I saw this is very fitting because I believe you are actually an avid chef, and I still remember to this day while we were at the customer you were sharing with me, I don't know how it came up, but you share with me how to make an egg benedict hollandaise sauce.
Kyle Brown: Yeah, that's where I introduced you to blender hollandaise and how easy it is to do hollandaise sauce with a blender.
Kevin Yu: I think that won me a few points with my wife for Sunday branches, so thank you for it. So coming back to the SRE omelet, Kyle, what will be your ingredient and recipe for the reliability engineering omelet?
Kyle Brown: Well, the first thing you have to do is you have to begin with a wonderful plant- based base of automation.
Kevin Yu: I love it. And a tablespoon of what?
Kyle Brown: Because seriously, automation is at the heart of what it means to be a reliability engineer. If there is toil, if there is waste, the solution to that is almost always automation, but what you have to add to that base of automation is really a deep understanding, kind of like a sprinkling of software engineering discipline on top of that. And this is something that was interesting, I was just meeting with a customer the other day and they were talking about how their kind of current way of doing things is whenever there's a big problem to solve, they'll get all their SREs together and their SREs will just have all these scripts that they're going to run and things like that. And it's like, well, you can do that, but that's probably not the best way to do it. What you probably want to have instead is you want to have, first of all, a library of automations that you've already built up in something like GitHub, and then you want to have those be individual projects that you're running through your CICD process so that you know how you're controlling them. I'm just applying software engineering discipline to this. You want to make sure that they're being scanned appropriately, that you're not leaking credentials as part of your automation the way that we sometimes do as part of this, you want to apply a good sprinkling of software engineering discipline into your automation base. And then the last, that little bit of secret sauce you want to add at the end is you have to be able to make sure that you're applying a little bit of communication to the whole recipe as a result. Because so often what we see, and I run into this constantly inside the CIO, is we've solved problems too many times. You know what I mean? It's like you solve a problem and then you realize that this silo made up of the guys down the hall also solved the same problem.
Kevin Yu: We're not sharing.
Kyle Brown: That this other group down the hall solved the same problem. That's why communication becomes so important, and this is why this idea of having a common catalog and a common library of this and then having a community that supports this, using techniques like InnerSource become absolutely important. If you're really going to be successful as a reliability engineer, be it a systems reliability engineer or a services reliability engineer.
Kevin Yu: I love it, Kyle. So once you build that beautiful omelet, share it, create the recipe and share it so other people can also discover it.
Kyle Brown: Exactly, discover it and use it and enjoy it themselves.
Kevin Yu: Kyle, it's been an absolute pleasure. It's like a blast from the past. So glad to have you on the show with us.
Kyle Brown: Thank you very much. I really appreciate the opportunity, Kevin.
Kevin Yu: I'd also like to thank the audience for listening. This is Kevin Yu, principal SRE of IBM sustainability software. See you on a future episode, and if you like, please listen to the bonus episode where Kyle shares with us his favorite omelet recipes and a secret ingredient to make everything taste great.
DESCRIPTION
This episode is about transforming SRE into how we do business and incorporate Reliability Engineering into the process of building products and features. Kyle Brown, IBM Fellow, VP, CTO for the CIO join us in this episode to talk about this mindset shift.
Kyle discusses how Reliability Engineering can be considered in the entire solution lifecycle, how to get there and what it looks like. Kyle also shares his insights for engineers to progress in their career and where Reliability Engineering will become in the future.
Lastly, as an avid chef, Kyle also shares with us recipe for a real Omelette in a bonus episode!