Episode 1 - IBM SRE Profession
Kevin Yu: Welcome to the Making of the SRE Omelette podcast, the show where we explore the positive business and client success outcome from site reliability engineering and hear from experts on how they influence the culture and mindset shift that led to those results. I'm your host, Kevin Yu, and I'm the principal SRE at IBM AI Applications. What is SRE, you ask? SRE stands for site reliability engineering, and originated from Google to treat operations as if it's a software problem with goals of improving system reliability and scalability. Over the years, many organizations like IBM have also embraced the discipline. In my experience, SRE is a mindset that champions the culture of Agile and is an engineering discipline that is data driven and KPI focused. It applies the lifecycle approach towards building resilient solutions that meet business goals and delight clients. We're kicking off the podcast series with our very first episode today. I thought it'd be great to start it off with understanding IBM's journey to establish site reliability engineering as a job family, and there's no better person to speak to it than our guest speaker today, Ingo Averdunk. Ingo is a distinguished engineer of service management in SRE, IBM Garage for cloud, and is also the worldwide SRE profession leader in IBM. I have the privilege to work with Ingo in the IBM SR profession and our first- ever SRE conference at IBM last year. And I must say, his passion of the subject is inspiring. He's a fearless leader of the profession and pushes the community to continue to service tangible outcomes that benefit the company, customers, and its practitioners. Welcome to the show, Ingo.
Ingo Averdunk: Yeah, good to be here.
Kevin Yu: So Ingo, let's get right to it. This podcast is about the culture of SRE, its impact, and how to make it happen. What does SRE mean to you as a practice and as a profession?
Ingo Averdunk: Yeah, Kevin, that's a great question, and I need to reflect a little bit. I think I've been always in a lab services capacity for most of my professional life, working with clients, and I think it was always important for me to see the use of services. It's not customers buy a product or deploy a project. They want to take advantage of it, so it's obviously the day- two experience that was close to my heart, and I'm super excited to SRE now, bringing that profession to a new life, much more visible, so it's no longer like the lights out organizations out in the back. It's now become really visible across the entire SDLC. And I think there's a term for this. We are moving away from projects to products, and I think that's exactly what SRE is very much in support of understanding the full life cycle, understanding the ilities, whether it's availability or security or durability and so on, so it means a lot to me.
Kevin Yu: Great capture. Cyber lab engineers are keeping the lights on, so let's recognize them and make the discipline more visible across the entire solution life cycle. And I think it brings up a parallel topic. Many of us have been practicing SRE, have SRE as a job title and role even before it was established as a profession at IBM. Can you share with us the significance of IBM formally defining SRE as a job family?
Ingo Averdunk: Sure. And I probably need to explain a little bit, the background. I do service management for all my life, but six years ago, I've been asked by our general manager to think about how operations need to evolve. We've doing the garage for a while and we always been very successful in creating this new application, this new website, and so on. But then, we hit this glass ceiling when it hits to production, because operations, this happens, always say, " No, we cannot deploy, and what do you mean by deploying it, like twice a week?" And at that time, we said, " Okay, so just modernizing applications is not enough. We also need to modernize operations." So I've been asked to provide a point of view from IBM on what modernized operations means, and certainly, SRE is heart and center of this. And I think it's super relevant too, as customers adopting new paradigms, whether it's Agile, DevOps, that there need to be a recognition about this lifecycle, the lifetime of a service. And this is where we created new roles which are more cloud- oriented SREs than really the evolution of a CIS admin role, a blend of a developer and a CIS admin, and the need for this is very inherent. A lot of customers are asking for it. We need it in IBM internally. We are moving away from software on premise into a software as a service function, so I think all those are just supporting this need for this new role, this new definition of operations. And as a way to manifest it, to create some visibility in the enterprise, we established this profession for SRE. And profession very much sounds like an enterprise, but it's really having a home, having a curriculum, a roadmap, but also, a career path so that people are finding a home in that profession. Ultimately, we want to attract, we want to nurture and retain the right skill and the right talent, and this is where we created this new profession, and we've been having professions in IBM for a long time, specialist developer, architect, data scientists, and SRE is just the newest member of that family of a technical profession.
Kevin Yu: Thank you, Ingo, and I can personally appreciate having that home. I have, for the most part, been in the application performance, scalability, and reliability space for my career, and I have always felt I have to explain and justify my contributions. Having the SRD profession certification definitely can help make the career path easier for future generations.
Ingo Averdunk: Yeah, and it gives, also, a formal recognition that this is not just a small thing that's big. There are not that many technical professions. I think we have maybe 10 technical professions overall in IBM, and then SRE being one of those, I think gives really a huge recognition to the need for that new way of working.
Kevin Yu: Right. And I like to take this opportunity to thank you on behalf of the community for championing this site reliability engineering profession in IBM. So Ingo, we touch on the motivations from a executive practice and now technical vitality perspective. Can you share a perspective from our customers on SRE?
Ingo Averdunk: I think we have been struggling in the past, and the same is true for our clients, predominantly enterprise clients. They live and breathe ITIL, the IT Infrastructure Library, which had good reasons at that time. It was very process heavy, it did a lot of risk mitigation by doing change advisory boards and by scheduling changes and having a CMDB and so on. And I would argue it probably went a little bit overboard, and a little bit to an extreme. And now, those customers are struggling, because now, the agility from the development teams is absolutely slowed down by those legacy IT processes. So customers are asking for this, and they see some great results by, on one side, pivoting towards a DevOps culture, on the other side, pivoting to an SRE culture. I think that really allows them to balance velocity and quality at the same time.
Kevin Yu: That is a great summary of the evolution of balancing the rules of development and operations, and as you said, between features and quality, which brings us to a struggle many of us had in this space, that is prioritization of the limited resource we have. In your experience, Ingo, what have been the motivation to drive or prioritize for SRE?
Ingo Averdunk: Yeah, I'm not sure if there's a single response to this. When you look solely from an operations perspective, there were a time where people did outsourcing, offshoring, nearshoring. Then, there was a next wave of automation, and this was all to cut down cost, to be more efficient, more effective. And I think the consequence to this is that some other parameters have suffered. My former boss had a saying, " Measure what you expect, expect what you measure," and if you only measure for cost, then you might suffer when it comes to quality or when it comes to velocity. And I think this is really what I see customers now, and they're looking at, is they've reached maybe a certain plateau of automation. They see now, the negative impact of just going automation and they want to do something on top of it. And I always use this analogy, if my application server has a problem, it dies every Monday, certainly, I could do automation to restart the application server every Monday. Then, I'm efficient and I'm effective. I have no toil, because it's basically executing automatically. But the problem still exists. And worst case, even the user of that service is affected because he's experiencing a short but still existing downtime every Monday. And the SRE approach to this would be, well, let me find out, what is the contributing factors to that outage, and let me engineer this away, so reducing the technical debt, engineer the problem away. The more mature enterprises realize that automation in itself cannot be the end goal. It might be a good tactical solution, but you really want to engineer this problem away. You want to reduce your technical debt.
Kevin Yu: Yeah, that's a great example, Ingo. In fact, many times, we draw a parallel as SRE is really like the firefighters. So I think in your comment there, is that we don't want to just be really good at putting out the fires, but we also want to ask a question, how did the fire happen in the first place? Yeah.
Ingo Averdunk: Yeah, exactly. Avoid it, right?
Kevin Yu: Right. You also touch on the key word of cost and relating to prioritization on resources on SRE, in a way, what is the ROI, return our investment, because as much as we love SRE and all its wonderful motivations, there are limited resources of people and time. Could you please share with the audience your experience around qualifying the ROI for SRE?
Ingo Averdunk: Yeah, and I think this is where we are still finding ourselves as an industry. Certainly, automation, like I mentioned before, is a key measurement, like how much manual work, how much toil do I still have? Another performance indicator would be MTTR, meantime to repair or restore, reverting from an incident. But I think this is also getting into the point, if you only measure MTTR, I would be okay to stay with the example before, just to restart it, because the MTTR would be maybe a minute or, I don't know, five minutes to restart the server or the application, so that's acceptable. If I would engineer this problem away, that scenario would not happen. And by default, the more harder problem still exists, so my MTTR is actually going up, so you need to really find good performance indicators that measure a little bit, the velocity, the quality, and the efficiency. And when you look in the likes of Dora, they favor four parameters. One is the deployment frequency, how frequent can you deploy, smaller chunks deployed more frequently, what is my release quality, so how many bad releases do I experience in TTR, certainly? And then, the last one would be time from innovation to production, so how long does it take from an idea until it materializes to an end user? So those are the four KPIs that are typically used, and with this, then you could derive the ROI. I would be very hesitant to define the ROI of SRE just as an automation lever, because then, you, from my perspective, are just thinking too short. You would not potentially tackle the larger problems, the bigger, more fundamental issues, which are much heavier reach.
Kevin Yu: That is a great capture. Thank you so much, Ingo. So my takeaway is, do not look at SRE ROI based on a single lever or perspective, and I think this is a classic case from data sizing that a single metric may answer a question, but you need a setup then to tell a story and form actionable insights. So we cover motivation and ROI from organization and customer perspectives. Let's spend some time the perspectives of site reliability engineers, or as you put, Ingo, if we look at it from solution life cycle angle, what is a good motivation to get people with different roles to look at SRE, and what mindsets should they have?
Ingo Averdunk: I would think the best motivation is a desire for engineering rigid, so not shy of getting their hands dirty, being curious on what happens, dig deep in order to find out what's wrong, and then really applying an engineering rigid, maybe even a little bit of science, to then identify lasting solutions to this. So it's really this entrepreneurial, engineering kind of mindset that drives people. But I think it's also important to consider SRE as a team sport. This is not just a single person, it's always a team that make things work, especially when you are in capacity of incident response, knowing that there is somebody else who has your back, knowing that you're not alone at two o'clock on the morning on a Saturday, I think it's quite important. So engineering rigid and team player, team spirit, I think there are two fundamental elements here.
Kevin Yu: Thank you, Ingo. You capture why we enjoy this profession so much, and I think you may have to inspire some listeners to look at this profession. What would you say to listeners who may want to explore the SRE profession? Can you give some advice, both for a experienced professional as well as someone who may just be starting their career?
Ingo Averdunk: So SRE can be quite broad. We talk about observability and automation and we talk about capacity management and we talk about incident response and blameless postmortems and so on, so it can be really, really broad. So as a starter, I would say, don't try to do it all, because chances are, you're probably not really good at it. I would encourage you to pick one area and try to master it, and go deep, to gain confidence that you can go deep. And deep could be all the way to the TCP/ IP stack or into the operating system, or those mechanics, and that gives you then, the confidence that you can go deep. You build the muscle of being able to encode the system, and from there, you can then go a little bit broader and then said, " Okay, let me tackle another slice in the software development life cycle, another element in my tech stack, looking more holistically end to end." And I would probably favor as a beginning, first, the ability to go deep, to gain that confidence, and then go broad. Some people think, I want to first go broad to have the full end- to- end picture, but there's always a risk that you just stay too shallow. And I think you need to gain that confidence, at least from my perspective. I certainly gained a lot of confidence by doing really heavy coding and doing heavy capacity management, and I even wrote an extension to some TCP/ IP protocols, and that confidence is something that I could always draw back on in future engagements. For the person who is already in their professional life, I think I would probably ask you to start thinking, are you more from a CIS admin perspective, or are you coming more from a development perspective, and knowing where your strength is and knowing where your gaps are. And for a CIS admin, I would certainly encourage you to start coding in a software engineering practice, so get up, pull requests, peer programming, test automation, all those elements that our developers are doing these days, and try to build that muscle. And vice versa, as maybe somebody coming from a development background, start thinking what it would take to operate your application without a debugger, without access to the source code, relying just on metric data, participating in root cause analysis, and see how the system is actually behaving in a live production environment, and then, from there, learning more and more about the traits of a CIS admin. The idea here is that you get to the perfect mix between DevOps and SRE, and then move up the ladder.
Kevin Yu: Now I know who to call when we have our next network disruptions. I also like your comment about having the core and deep skill as something you can go back and rely on. For me, that is performance and benchmarking, and definitely, time after time, those skills have helped me solve problems over the years. So Ingo, we touch on the SRE mindset, the culture, ROI, and its evolutions. Where do you think SRE is going?
Ingo Averdunk: SRE is evolving. When you look at it, it started just in a cloud- native environment. And now, I think it certainly is arriving at the enterprise, and this is not necessarily all cloud native. We even have some customers who don't have cloud yet, and they are applying some of the SRE principles on this environment, and that's completely viable. The next flavor would be commercial off- the- shelf products like, I don't know, SAP or Salesforce. You also need to think about how you could perform SRE against those environments. I certainly think that DevOps and SRE are getting closer and closer together. They're always very complimentary, and one could argue, one is a class of the other or an instance of the other, but I think still, there's room for improvement to getting those two practices closer together. Those are the things that I see. Maybe just in closing, SRE, to me, both the site reliability engineer, the person, but also the engineering practice, and this is where I think it will start to arrive across the entire software development life cycle. From a product owner perspective, you need to decide, what are my SLO and SLA requirements, from an architect perspective, to build reliability into the code. A developer should instrument the code with metric and tracing and those kind of things, circuit breaker patterns and so on, a release manager practicing various release strategies on it, and then the SRE performing incident response and problem management. So it's really a team spot across many, many roles, and there's still opportunity for improvement to make it more pervasive in those other roles as well.
Kevin Yu: That is a fun future to look forward to, further reinforce SRE on hybrid cloud and dependencies, as well as SRE as a discipline that is practiced by people in different roles in the solution lifecycle. So thank you so much for the wonderful insights you have given us so far, Ingo. I just have one more question, and that is, what would be your ingredient and recipe for the SRE Omelette?
Ingo Averdunk: Oh, man. So I thought about this. You sent me, luckily, the questions ahead of time, so I thought about it. There's probably not just a single recipe. Well, I like to learn and listen and read as many recipes as I can and take what resonates for me, what works for me, to build up my own recipe, so it's not just, Ingo, make it up. My own recipe, I want to draw on the community, and I think SRE is a stellar community. When you look into the newsletters, the books, the conferences, the meetups, it gives me just a vast repository of recipes. I'm getting better every single time. In every single client engagement, in every single product that I engage, I get better and better in doing SRE, which is great, similar to a cook. The more you cook, the better you are, and you're never done. There's always something else you could do on top of it. There's maybe this one additional ingredients that you put in your omelet to make it even taste better. So those are the three things, different recipes and learn from others, getting better over time, and never be done. Those are my three ingredients of my SRE Omelette.
Kevin Yu: That is awesome, and I can almost see a cooking show coming out of this. So there you go, ladies and gentlemen, the ingredient and recipe for Ingo Averdunk on the SRE Omelette. Thank you, Ingo, for spending the time with us, sharing the wonderful insights, and as you said, contributing to this wonderful community we have so we can all learn from each other and deliver the outcome we strive for.
Ingo Averdunk: Oh, happy to. Thank you.
Kevin Yu: Also like to thank you all for listening. This is Kevin Yu, principal SRE at IBM AI Applications. See you again on an upcoming episode.
DESCRIPTION
Ingo Averdunk, IBM Distinguished Engineer and SRE Profession leader takes the audience through the journey and motivation to establish SRE as a job family in IBM. Ingo also captures the SRE mindset and what practitioners (new or experienced) can take to practice SRE.