Episode 9 - SRE Feature Delivery

Media Thumbnail
00:00
00:00
1x
  • 0.5
  • 1
  • 1.25
  • 1.5
  • 1.75
  • 2
This is a podcast episode titled, Episode 9 - SRE Feature Delivery. The summary for this episode is: <p>SRE was created by Google to balance prioritization of features needed for reliability and operations against others - like that shinny new product feature. &nbsp;However, just because we say we embrace SRE culture, mindset or have an SRE team, doesn't mean this problem is magically solved.</p><p><br></p><p>Ron Baker, IBM Distinguished Engineer of SRE Operations in Sustainability Software shares how he successfully negotiated and drove this balance and deliver on SRE features.</p><p><br></p><p><strong>Timestamps:</strong></p><p>[00:00&nbsp;-&nbsp;00:59] Intro to the episode</p><p>[01:17&nbsp;-&nbsp;02:14] What SRE means to Ron</p><p>[02:28&nbsp;-&nbsp;03:30] Ron shares what success in SRE looks like</p><p>[04:02&nbsp;-&nbsp;08:44] Solving the prioritization problem</p><p>[09:01&nbsp;-&nbsp;10:31] Connecting risk with emotions</p><p>[10:53&nbsp;-&nbsp;14:04] How Ron drives SRE features to reality once prioritization is tackled</p><p>[14:32&nbsp;-&nbsp;16:52] Words of wisdom for those in SRE professions</p><p>[18:00&nbsp;-&nbsp;20:24] Where SRE needs to go to stay competitive and help solve problems</p><p>[21:33&nbsp;-&nbsp;21:53] Ron's ingredient and recipe for the SRE Omelette</p>
Intro to the episode
00:59 MIN
What SRE means to Ron
00:56 MIN
Ron shares what success in SRE looks like
01:02 MIN
Solving the prioritization problem
04:41 MIN
Connecting risk with emotions
01:29 MIN
How Ron drives SRE features to reality once prioritization is tackled
03:11 MIN
Words of wisdom for those in SRE professions
02:20 MIN
Where SRE needs to go to stay competitive and help solve problems
02:24 MIN
Ron's ingredient and recipe for the SRE Omelette
00:20 MIN

Ron Baker: You've got to win the priority war, and it is a war because there's so much that is competing for your limited resources that you have to win that priority.

Kevin: Hi, everyone. Welcome back to another episode of the Making of the SRE Omelet Podcast. SRE was created by Google to balance prioritization of features needed for reliability and operations against others like that shiny new product feature. However, just because we say we embrace SRE culture and mindset or have that SRE team, doesn't mean this problem is magically solved. Joining us today to talk about how to successfully drive this balance and deliver on SRE features is Ron Baker, IBM distinguished engineer of SRE operations in sustainability software. Welcome to the show, Ron.

Ron Baker: Thank you, Kevin. A pleasure to be here.

Kevin: Hey, Ron. This show is about understanding the business impact of SRE and the culture to achieve that outcome. Can you start by sharing with the audience what SRE means to you?

Ron Baker: Yeah, absolutely. It's been interesting because I didn't come up through the SRE or operations side of the house. I came up through the development of traditional architecture, coding and testing and delivery from that side. So this move was an opportunity to learn that side, appreciate it more I think, but also to bring a development point of view for a transformation from more administrative activities, to how do we automate more and really implement SRE as an engineering discipline, but also for my connection of back into the development world and how to integrate those two environments better. So it brings a little bit different perspective and definitely a better appreciation for the changes we have to make in the development side to get closer together.

Kevin: That's a great perspective with the lens of development. Essentially, Ron, you brought the E to SRE. So what does success in SRE look like to you, and how would you measure that?

Ron Baker: So to me, the kind of operations KPIs that we've put together really we try and drive there. So a lot of that is driven by quality for the customer so that we obviously see the availability of the system running better, the reliability of course from the name, but also how usable is the system and is it meeting the business objectives. One of the things that I think the operations side really has a better handle on in some cases is how's the customer using it and is it doing what they want and are they reporting happiness or problems with it? And so getting that feedback into development as a single organization I think is one of the big success measure measurements for us. So really enjoy learning about service level objectives and really viewing them more from a business perspective than just is the thing up and running or is it really meeting customers objectives for why they want to buy the software and get value out of it.

Kevin: Speaking of development, I also came from that side of house, as well as consulting. And I recall it was not like we didn't care about performance and reliability. In fact, they were often captured under NFR, non- functional requirements. And here lies the problem. When it comes down to prioritization, NFR has often got left behind and treated often us well fix it later or fix it after GA. Do you see that problem as well? And if so, how have you solved it?

Ron Baker: I definitely see this problem. I'm old enough to remember the times when you had very long release cycles and you were primarily, you were focused on premises software. And so customers had all the things we do now on the ops side of DevOps they were doing. And so they couldn't upgrade every month, right? Because it was a cost and a process for them. And so as we've moved to cloud- based hosted systems, SaaS systems, etcetera, we see a lot of that cost transfer back to us. And of course the approach, the modern approaches we're seeing making a difference. And yes, these investments, we really have focused a lot in the industry on speed to market. And so there's a certain level of risk that gets associated with that speed, with the premise that we'll come back and fix it later if it's successful. And sometimes that happens and sometimes that's pretty slow to happen. And so those things require investment and so it's easy to say that and then assume, well, we'll go to management and say here's the return on your investment, so give us more money or give us more head count. I think everybody in the industry realizes that getting more head count is usually pretty difficult, especially in more mature companies that aren't growing or they're not in that what we call horizon one market where you're over investing for growth. So if you're not going to get more money or count, then requiring investment need, you've got to win the priority war and it is a war. There's so much that is competing for your limited resources that you have to win that priority. And so the way we've done that, and again this is where coming out of development just recently probably helps is we focus on three areas. One is education of the risk. It's easy for people to overlook the impact of customer service and it's really seeing a lot of complaints escalate. And so you have to educate people on what's the risk of not having some of these pieces in place during operations. And sometimes it can be immediately obvious and sometimes it can be a slow death by complaints and your customers start to drop off because you're not providing a level of service that you need. So really being able to summarize that quickly is important. And we have a tendency in the industry to give over technical depth of what's going on and lose people. So you, you've really got to hit their emotions, not their intellect. If familiar with the famous book the Switch, they talk about the intellectual side and the emotional side and this is where you have to be able to educate people on the emotional side of that and get them to understand so that they change. And so they change the investment priority to focus on these things. The second area that I then focus on is margin. Because if you're going to maintain a profit margin, a gross profit, then you've either got to raise a price, which sales never likes, customers, never like, or you've got to reduce the cost. And so cost reduction from the SRE point of view is pretty important. It's an ongoing can you constantly look at productivity gains and cost effectiveness in perpetuity. And so if you can focus on margin and show why these priority changes you've just been educated on will affect your margin, it really makes a difference in getting those priorities in as well. Third area then is really create a vision of that goal. Now you've clearly articulated the why, here's the vision and then stay consistent with it and measure the progress and keep that progress and that vision, that goal, in front of mind on a regular basis until, because most of the things you want won't happen in a single release or a single sprint or a single epic. There'd be a series of things that have to be constantly reminded and shepherded. So it's a marathon, it's not a sprint. So those three things are really the things that I focused on and have had success with. But remember, success is not a you did this and boom, you're done. Right? Success is measured over two year period.

Kevin: And the art of SR feature negotiation, I really like the part you spoke of not being so technical in your arguments, but touch on people's emotions. Ron, can you share a few examples of how you connect that risk with emotions?

Ron Baker: Risk is not as easy to see a lot of times because if the operations team is getting pounded by constant manual effort in some area, don't immediately see that from let's say the sales or the product offering side. And so you really have to put together, I like to put together like a storyboard and we have a very large design organization, IBM now over the last five to 10 years. And they've had a big influence on me as far as putting together a storyboard that shows here's what an operator is typically doing right now. So I try to put our offering teams and product management teams in this position of here's what your operators are going through right now to satisfy a customer request. And so this is what your customer is seeing. We're not finding this error ahead of time because you didn't put enough monitoring in place. And so the customer is finding this error and calling us. And so the meantime to resolution is stretched out, the customer is frustrated because they're seeing more of these errors as opposed to you finding them, fixing them before they have a chance to see them. So it's those kinds of storyboards that you try and put them in into that position, both the customer's position or the operations position to say this really needs improvement and we're really running at risk. We're risking the reputation of the product and of IBM and of our revenue by not investing in this.

Kevin: I will also echo how vulnerable design driven engineering to SRE is. It is a great way to capture the persona and use cases to make sure we solve the right problems for the right users. Ron, can you take a minute to share how you drive SRE features to reality once you successfully win the war on prioritization?

Ron Baker: Yeah, that's part of the vision. The vision is, all right, how do you prioritize what you want them to do? So the other thing is there's a big laundry list of stuff, in order for you to get transform, you still have to internally pick what are the most important things and why and show that so that they can invest. And this is where actually you have done a lot of work with what we call the SRE scorecard here at IBM. And that's helped enormously on letting us say, look, these are the things we want to address first. These are things we want to address second. We want to move you from a maturity level of where you are to something that's that's more a mature. And the nice thing about that is also you get to a, okay, we've started to reduce risk by getting to the level we need to, but then we've actually got more advanced levels that we can say, all right, now that we've kind of gotten rid of some of the risk, now we can even drive margin farther as we scale up and we can provide scale for customer growth. You have to really lay that out. And then in our case, you have to know the process. It's very important that you get epics defined by the product management team, then you get those epics laid out into actual sizings and estimates and sprint plans and then they make it through the scheduling process and then you follow them through all the way through the testing and everything. So there's really a lot of that blocking and tackling of whatever your process happens to be. In our case, it's a very specific set of tools and a specific set of meetings that occur and you have to make sure you're guiding it through that to make sure something doesn't fall out.

Kevin: So vision translates to an execution roadmap that is captured into a tooling like inaudible and wrongly shifting that work to the left. So it is not just giving to the SRE team to execute, but engineering will prioritize it alongside all features.

Ron Baker: Yeah. Because I think we've all been in the situation where if you're not ahead of that curve and you come in at the wrong point in the development process and you say, Hey, we got to have this thing that we talked about and they say, well, if you'd have been here a month ago or two weeks ago, we could have done it, but now we've already got our plans locked and loaded and we're the middle of writing all the codes. So yeah, the timing becomes really important and that's where the shift left is important. It's also important on the measurement. You create a very simple to understand metric and you show it with pretty green, yellow and red colors to upper management that don't have the time to dig into the details. And the other thing that I've learned is don't show them what is promised in the next release. Show them what's been tested and verified. And until you do that, they'll assume something is done when it actually isn't done. So that's also a lesson I've learned.

Kevin: That is great advice. Don't talk about promises, show outcome and impact. That is also perfect transition to the next segment, technic vitality. A big purpose of this podcast is to capture advices for current and future site reliable engineers. Ron, any words of wisdom you will give to practitioners and perhaps managers and leaders in support of SRE?

Ron Baker: Skill- wise, it's an ever changing set of tools. So you look for, in one sense, skills, you look for people who are problem solvers and they're creative and they're flexible. And these days you could deal with chaos and change. So those are your kind kind of fundamental personality traits you're looking for. You can always train people on the individual tools, a particular database or a particular monitoring tool or things like that. But even those change in this age of specialization. So keeping people educated on those are important. The recommendation I have for managers is, and this is an industry where we really manage by data, right? KPIs, having good internal KPIs that you put together and make sure the data quality is good. But for managers, this is interesting because here's where I don't like a simple summary chart with green, yellow, red. I don't believe KPIs should be actually be managed that way. KPIs and measuring yourself, what's my meantime to failure or what's my recovery or my MTOs or those kinds of things. Those should be a way of having a conversation. The whole point is to have a conversation about what the data is telling us, not just what the colors are in the chart. And too many times management is in a hurry, everything is green, I'm good, don't need to know anything more. If it's red, you have a problem, tell me how you're going to fix the problem. And to be honest, that's not what those should be for. They should be for, all right, we had numbers dip, what happened in the situation, what's the process? Is this an outlier that we just never thought about before and we can invest in automating or it's such an outlier and the automation is expensive that we'll just live with the outlier. I mean, those are the kind of conversations that should happen when you're talking about processes and continuous improvement. Not just the simple red, yellow, green. So my biggest advice for managers is use them as tools, not as check marks or approvals.

Kevin: I love that. And that really hits on a core trait for practitioners in space, which is having that empathy. So by having that dialogue you spoke of it drives us to understand the context versus what the outcome was.

Ron Baker: That's a very good word, yes. Empathy, right? Because in one case you're not assuming your people, if you've hired good people and you've trained them, then it should never be a, okay, what did you do wrong? How are you going to fix it? It should be what's wrong with the system? And this is a long time manufacturing. You've talked about Edwards Deming for example. And so it is a sort human nature sort of behavior trait that you constantly have to overcome and remind people about.

Kevin: Right. Those are definite reminders we take into every instant learning sessions. So Ron, let's look into your crystal ball into the future. Where do you think SRE needs to go to stay competitive and help solve problems to come?

Ron Baker: So to me, the biggest area, and I look at these when I look at, let's say Kubernetes operators, we see this in our OpenShift product. We see it in some of the automation tools as well as analysis tools. And in AI models. This combination of things, we needed to design the SRE operation side of the house and architect it just like we do the actual product for advancement in where it's going. For the first time, we're starting to see enough information being gathered through the entire stack that we can integrate together and we can start to look at are we running this the most efficient way possible? And I'm not talking about doing a stress test for a couple days and seeing that, right? You start to see this in the more mature born on the cloud, large scale SaaS offerings that kind of have to do this in order to scale to the levels they've got. And that is be able to automatically not just scale up and scale down, but also kind of evaluate, am I using the minimum number of cores to still satisfy? Can I predict when I see seasonal demand or just recent phenomenon that I can predict that better? And so this insight of data that we're gathering on the SRE operation side and then putting line AI models to that to make better choices, that's going to be interesting. And so a lot of the things that we have automated now, we're starting to see get embedded into these Kubernetes operators so that they can actually be delivered with the product. So now it won't be just the SaaS world for our cloud world, for example, that have these optimizations in scale. You can even see a hybrid cloud application that might be hosted on premises for data residency or privacy issues or something and still be able to leverage these operation automations and insights without having to build them all with your own IT organization. That's going to be an interesting opportunity when we start to see those higher levels of maturity and the operators be able to do that.

Kevin: That sounds quite exciting and I appreciate the perspective you gave on taking what we are doing for our own SaaS operations, for our customers on hybrid or on prime.

Ron Baker: And think of the global cost savings that's associated with that, right? Instead of hundreds of IT organizations doing the same thing or duplicating that same effort, you start to reduce that. And there's of course there's a sustainability and energy savings associated with that as well. So we're not only just seeing cost savings, we're starting to see a really big uptick in how do we do these operations most efficiently from a carbon footprint point of view.

Kevin: We're going to save that for another podcast.

Ron Baker: Sounds good, sounds good.

Kevin: But for now, I'd like to thank you, Ron, for taking us through the journey of SRE feature delivery and the art of negotiation. In closing, I always like to go back to the inspiration for the podcast. Ron, what would be your ingredient and recipe for the SRE omelet?

Ron Baker: All right. I'm going to say education, focus, vision, measure, and goat cheese. I don't like an omelet without goat cheeses and I can eat it, but goat cheese really adds the final flavor to it.

Kevin: That's awesome. Thank you so much, Ron.

Ron Baker: Yeah, my pleasure.

Kevin: There you go, ladies and gentlemen. The SRE ingredient and recipe from Ron Baker. IBM distinguish engineer of SRE operations in sustainability software. Thank you for listening. I see you again on the future episode.

DESCRIPTION

SRE was created by Google to balance prioritization of features needed for reliability and operations against others - like that shinny new product feature.  However, just because we say we embrace SRE culture, mindset or have an SRE team, doesn't mean this problem is magically solved.

Ron Baker, IBM Distinguished Engineer of SRE Operations in Sustainability Software shares how he successfully negotiated and drove this balance and deliver on SRE features.

Today's Host

Guest Thumbnail

Kevin Yu

|Principal SRE, IBM Sustainability Software

Today's Guests

Guest Thumbnail

Ron Baker

|Distinguished Engineer, IBM Sustainability Software