Episode 6 - Google SRE
MP English: It's a feature of scale. The SRE is saying, " How do I maintain this user experience? How do I keep everything running? How do I let all of the changes happen in a controlled way such that nothing ever quite breaks, or if it does break, we can catch it without it being a problem?" And it just turns out it's its own discipline and its own art.
Kevin: Welcome back to another episode of Making of the SRE Omelette Podcast where we talk to experts on the recipe leading to SRE success. Today's episode takes us to the origin of SRE; Google. And you're in for a treat. With us is a host of Google's SRE podcast... I'm sorry, podcast, MP English. Welcome to the show, MP.
MP English: Hi, Kevin.
Kevin: MP, can you start off by sharing with us your role at Google?
MP English: I would say I am what you would think of as the typical in the trenches SRE at Google. I've been at Google for five years and I am on the play infrastructure SRE team. So we support the whole slew of binaries that create the Google Play Store experience, mostly for Android phones. So that has a lot of different moving parts and it's actually a fairly challenging service to be an SRE for because there's always a lot going on and we interact with a lot of different shared services at Google and are built on a lot of different technologies that exist at Google. So it's a real jack of all trades type of SRE experience.
Kevin: That sounds exciting. I guess every on call is a great learning opportunity.
MP English: Yeah, very much so.
Kevin: So MP, let's get right to it. We all heard and read of Google's definition of SRE, that is SRE is what you get when you treat operations as if it's a software problem. But how would you describe what SRE is at Google in your own words?
MP English: I would really say that it's a feature of scale. When you have smaller systems, you can have that typical systems administrator role, but then when you get to things like these globally distributed systems, you get to the scale that there are these emergent phenomena that you can't necessarily predict and you have to be prepared to on the spot reason about how the system is behaving. And it turns out that all of that and just running a production system, how do you safely make updates to something while the users don't know it's happening? How do you resolve this? You really need to be able to go an extra layer because you can't just think, " I am going to do this to the system and it will have this effect," you need to also think about all the knock on effects of anything you do. And at a certain point the systems are so complicated that the people that build the applications, the software engineers, and the people that run the applications, the SREs, actually end up basically be having entirely different skill sets. Yes, we both speak this language of code and that's really the shared language between SRE and software engineering, but it's the disciplines are very different. One is focused on, " How do I build this application that does a specific thing? How do I get this next new feature out? How do I build this user experience?" The SRE is saying, " How do I maintain this user experience? How do I keep everything running? How do I let all of the changes happen in a controlled way such that nothing ever quite breaks or if it does break we can catch it without it being a problem?" And it just turns out it's its own discipline and its own art.
Kevin: I love it. The art of SRE. Developers and SREs have specializations. Developers may help you with the light bulb but it is our SRE's job to keep the lights running.
MP English: Mm- hmm. They build the light bulb, we figure out how to change the light bulb while the light bulb is still on.
Kevin: That's a great capture. And how would you say SRE has evolved over time in Google from problem it was solving to how the thing's organized?
MP English: There are two big trends that I would say I have seen. One is moving towards less direct touch to production, getting humans out of production as a theme. And that actually really ties in really well with blameless postmortem culture, because you're trying to drive more and more towards a state where if a human does something and it breaks something, that means the systems in place that allow the human to do that are wrong or aren't where they should be. And it's trying to systematize interactions with production and make things more automated and lower risk and less likely to cause unintended outages actuating on production. And in a parallel trend, that these really actually synergize with each other, is more movement towards shared solutions.
Kevin: I really like that. Actually, in fact, when I do our postmortems in IBM want the root causes I encourage us to eliminate is that human was a cause of the problem and I like how you captured it. If a human mistake led to a failure or disruption, it is a assisting place that is to be corrected and improved.
MP English: I do think it's interesting that there is this little bit of a catch 22 where sometimes what's interesting, with some of the automation, is it will occasionally say, " I need a human to confirm this." So it actually has this feedback loop of the human makes the change and then the system goes and attempts to do it and is like, " Hold up. Do you really want to do this?" And it makes a human come in and say, " Yes, actually." But it's really this opportunity to really scrutinize what you're doing and make sure you're really, really sure it's the change you actually want to be making. It has the potential to be extremely risky and it cannot determine how risky it is.
Kevin: I think that goes back to AI is good at coming up with those insights and since that's very objective, but humans are still best to make that kind of judgment call.
MP English: I actually have no idea how it does it. I just respond to the tickets where it tells me I'm inaudible.
Kevin: You got it right. We humans don't need to know in that moment in time. But the AI was able to give us the safeguard of a potentially harmful decision, likely learned from some previous disruptions. Actually, in fact one of my previous guys, Jerry Cuomo, who is IBM CTO of automation, gave me an analogy. The context here is I'm Canadian, so hockey is important to me. He said, " Would it be great if AI was able to say,'Kevin, if you're checking this code, you can say goodbye to the hockey game you wanted to go tonight because it is going causing outage.'"
MP English: But that's actually the thing is you should be able to submit bad changes and not have them be catastrophic. Because there should be enough rigor in place that from the time it actually is having multiple environments so that there's a long stage before you submit something and it'll actually have to go through many iterations before it actually makes it to production. I think that's one of the other roles of SRE, and at Google in particular, is making sure there is opportunity for safe failure.
Kevin: I love that. Yeah. Canary...
MP English: That's I think would probably be one of the biggest tools in the SRE's toolbox is, " How do I canary this change? How do I make sure that I am not just throwing this change at production and hoping that it works?" Because one of the SRE motto is, " Hope is not a strategy."
Kevin: Another great example of the SRE mindset; how to mitigate damage by leveraging observability in the CICD pipeline so we can quickly know when something negative impact the SRO and cannot release to further contain and mitigate the damage. And I guess, MP, you may have alluded to before, how would you measure success of SRE or return investment of the time, effort spent in SRE in Google?
MP English: Sometimes that can be a little hard to quantify. A simpler example, and then you can expand this out to some of the value of SRE. You can go and invest time in building a bunch of monitoring for something because you've noticed that you don't necessarily have a lot of insight into it. So you go and do all this telemetry and you set up all this alerting and you will never know the value of that until an outage happened. So it could be a year or two years before that alerting that was put in place actually pays off. It's hard to see the value of the outages that don't happen.
Kevin: Right. Fires that prevent it.
MP English: Yeah. So there's this constant invisible return from SRE because it's all of the things that we don't see that don't happen. But then there's also much more concrete things that we can do. We have a fundamental role in resource management. That is one of the core parts of SRE is making sure your services have appropriate capacity. And we exist in a real world and there are actual physical machines somewhere and those details matter on some level and we're abstracted from them. In my team we have some abstraction from the physical machines by quite a distance, but sometimes we look at the system and we say, " How can we use less machines?" And that's something that we can turn around and that has immediate financial implications because those machines then become available for other things. They're not resources that we're using up as play. So those things you can go and have a project and be like, " We did this project and we saved 20,000 cores." Just a random number. Or maybe you identify something that you're experiencing latency spikes and this particular call on this server is in the critical path of this user experience and you go and drive an effort to get that latency down and then you go and you have some metrics on the user experience and then you can turn around and show that latency improvement increased engagement. That often where you also see a lot of collaboration between developers and SRE is, " How do we make the system more resilient? How do we make the system more performant?"
Kevin: I think that's a great capture. At the end of the day, we're here to run a business. The easy way maybe, " Hey, I'll give you the service as much computer resource as it needs." But like you said, everything there's a cost so you want to give it just enough so we keep the users happy. It's compelling yet not burning a hole in our cost.
MP English: There's a lot of variables you can play with like, " What is your data freshness? How reliable does this actually need to be?" Not all traffic is created equally. And if you can somehow do something that you can only provide the reliability that you need for the most important traffic and have less important traffic, be less reliable, you can save money there. Reliability engineering can go both ways. It can be calculated decisions to make something less reliable because of the cost benefit analysis.
Kevin: Right. And this leads to the importance of defining the SRO and the inaudible budget and use that to inform the decision of provisioning the compute power.
MP English: You don't need to run all your systems at N plus four. I don't even know if anyone does that.
Kevin: Right. And I mean obviously we have some solutions for customer want a feedback in terms of retail auto management instantaneously, almost mean milliseconds. And you have services where it's a reporting, " You know what, it doesn't have to be current. Current to me is a matter of days or weeks." So that's where you can make your choice.
MP English: Yeah. So the option of like, " Well, does this need to be online? Can you do an offline process here that doesn't hit production?" Or that's not hitting the same production that users are interacting with.
Kevin: Right, exactly. And the other thing I really glad you mentioned is I find often it's easy to recognize and award people who put out a fire because it's very visible. " Oh my god, this fire came off, someone came in and was a hero that put it out." I mean, all the executive leadership know about it, customer know about it. But like I said, if we do SRE right, there will never be a fire. So how do we go and recognize people and the teams that spend the effort to putting the resiliency work so that we never had a disruption?
MP English: It's the almost fires and the non fires that are hard to count.
Kevin: Yeah. Do you have a insight on how Google award and recognize people who prevented the fires?
MP English: So there's a few different internal methods of recognition. The most common ones are peer bonuses and kudos. So the peer bonus comes from everyone has so many they can give out and they are a way for you to recognize your coworkers for something that they did that you thought they went above and beyond and really came through. You actually see that a lot when people jump in to help from an outage. They get a recognition from often the on- caller. Other times it will be the developers of the service that normal on- caller jumped in and really help things out. And then they'll get recognition back from the developer team that they interacted with during the outage. And kudos are just a lighter weight form of that.
Kevin: Yeah, that's great. That's cool actually. IBM has something similar we call the blue points where you can internally use that to redeem for merchandise and whatnot. And I think that that's a great culture to have. I would say it's always a team effort and we are the outcome of our network and the team that helped us get there. So it's great to have the culture to recognize people who helped us.
MP English: And I do think that's one of the most important skills as being an SRE at Google actually, is learning how and when to lean on your teammates. Because the systems typically are just far too complicated for any one human to really ever grock how they work. We always have partial, incomplete, inaccurate knowledge of how some subset of the system works, kind of. And it's not even that picture is wrong. You know that the mental model you have of how the system works is always wrong, but it's accurate enough that you can make reasonable intuitions. But then there are so many different components and subsystems and shared services that we interact with that you really just can't be an expert on all of them. So folks tend to find a niche on their team. And I run into a problem with a particular subsystem that I'm not as familiar with, I know who on my team I can go to and ask for help. And that willingness to ask for help is really one of the most important skills, I think, for SREs to have.
Kevin: That is a great capture and I definitely agree with that. And system only get more complex over time with introduction hybrid cloud. Now they may actually be different data centers and different cloud altogether, right?
MP English: Mm- hmm.
Kevin: So I think that's a great leading, MP, into one; the intended outcome this podcast is really built this wonderful SRE community. I mean, we're here IBM, Google talking to each other about SRE and also type vitality. Do you have any words of wisdom for audiences who maybe experience professionals or ones that are just entering the job market if they're interested in the SRE profession?
MP English: So I could give a pretty canonical answer here that skills at the intersection of software engineering and systems engineering, troubleshooting, numerical literacy, systems engineering. But I think what matters more is really having a passion for an uncontrollable desire to understand how things work. Because that's what we're constantly trying to understand how our systems behave and our knowledge is always imperfect. You're always going through that continual process, as I was saying, of trying to have an understanding of the system. And the best way to do that is get people from different backgrounds together in one place. So I also would definitely encourage people who are not necessarily really programming- focused to look at something like SRE because I actually come from a very odd background in terms of how I ended up as an SRE. I originally went to school for astrophysics. But it turns out that when you spend a few years solving problems about large systems with absurdly incomprehensibly large numbers, you get really good at grocking systems with absurdly large incomprehensible numbers. Does anyone really understand what it means to look at a service and be like, " Ah, it is receiving 40 million queries per second."? That is an unfathomably large number. It actually probably would be interesting to think about that in hertz. So million, so it'd be mega, so 40 megahertz, but that's still, on human scales, that's pretty fast and pretty incomprehensible to us. So it's really important to be comfortable with that unknown and not trying to learn it all. Finding a niche, finding the thing that you really enjoy in all of this mess of the systems. One of the niches I have personally fallen into because of my background has been monitoring and alerting because I have a very strong data analysis background and signal processing background. So that's somewhere where I've been able to take the skills that I had from outside the SRE world and bring them into SRE. And getting more and more people so it's good to have someone who maybe has front end experience, and someone else who has backend experience, and someone else who has database experience, and someone else who is a systems administrator, and someone who is a network administrator because you need all of the skills together. There's not really any one track to SRE. You can really find the chunk of it you really like, what really gets you, and you can follow that.
Kevin: And I think that alluded to your previous comment about always looking to see how others can don't feel like you have to solve everything yourself and it really get to be able to do that, you yourself should establish what you're good at. What will people know you as SRE that they can, in turn, turn to for help?
MP English: Yeah, the phrase that comes up a lot is subject- matter expert and there are lots of different subjects and lots of different degrees of expertise.
Kevin: Right.
MP English: Particularly when you start looking at some of the bigger shared services, it's really important to have someone on your team that has a really good knowledge about that service. Because there's only so many SREs that are supporting it for all of Google internally. And the more expertise you have locally, the less you have to escalate to another team, which is more time. It takes longer, it can delay your time to restoration. It's not always worth no having super detailed knowledge of every subsystem you interact with because some aren't very prone to causing outages and they don't come up a lot. But when something comes up again and again, it's really good to develop that local expertise. And I think that's also something that I would encourage someone earlier in their career is identify a niche that you are personally interested in. Personally interested. The monitoring and alerting stuff I do, I genuinely enjoy that.
Kevin: You're passionate, you enjoy waking up to do that.
MP English: Yeah. If someone comes to me with like a, " Hey, I want to have monitoring around this sort of thing, how can I do that?" Those are conversations I genuinely enjoy having.
Kevin: Right. And if you're passionate, you enjoy, you will naturally learn more and be good at it.
MP English: Yeah. I have gone into extreme amounts of depth of how our monitoring system works and how to write better, more efficient monitoring queries that use less resources and advanced troubleshooting of the monitoring systems when things aren't quite working right, when there's something wrong with your monitoring configuration. Those are things that usually people reach out to me for help with, because that's where I've developed my expertise.
Kevin: That's a great capture of your expertise and your hill. And I think, MP, you touch on a little bit. In closing, I always like to go back to the inspiration for this podcast that is, what do you think is the ingredient and recipe for companies to succeed in driving that SRE mindset?
MP English: Ooh.
Kevin: And I think the ingredient that I heard earlier was around diversity.
MP English: Diversity is a big one, but I think diversity alone isn't enough. You need to also have a certain ingrained cultural tendency towards not accepting how things are and looking for how things should be.
Kevin: Ah. I love that.
MP English: And that's actually, I think, oftentimes one of the most vital things that having newer team members and diversity of experience, actually, maybe as one of the potentially underrated aspects, is bringing in people who are green and that you can grow into strong SREs. They're going to be the ones to ask, " Why do we do it this way?" And then you're going to have to sit there and try to come up with an answer. And then when you can't think of a good one, now you have a new problem to solve. Because you do get entrenched in a certain thinking of, " That's just the way things are sometimes." And it can be hard, even though we know we're not supposed to do that, our brains kind of naturally do that. And that's actually probably one of the big strengths of Google is getting to move across teams during your career because then you go to a different team and you see they're doing things, mostly the same technologies, same ideas, but all of these things are slightly different. And it's like, " Well, why is this this way?" And the constant move of people across teams within Google spreads the knowledge of different ways of doing things in a way that helps the health of all of the teams.
Kevin: Right?
MP English: So yeah, I would say diversity and a willingness to challenge the status quo.
Kevin: Personally, I had the opportunity to go to different teams, different business units, wear different hat. And yeah, I definitely learned to appreciate the different perspective of that people had. So when I see a problem, if I've wore different hat, I would look to solve it differently, right? No, I really like that. No, I think that that's a great capture. So thank you so much, MP, for taking us to gain wonderful insights into SRE at Google.
MP English: Great to be on here with you, Kevin.
Kevin: Thanks, MP. And as a reminder to all the listeners, please also check out Google's SRE podcast hosted by MP. You can find it by searching for Google SRE podcast. See you again on upcoming inaudible.
DESCRIPTION
MP English - host of Google's SRE Prodcast takes the audience through what it's like being an SRE at Google. MP describes the meaning of SRE to them, how it has evolved at Google, how success is measured and how people reinforce the culture and show appreciation to each other. MP also gives advice for practitioners interested in the Site Reliability Engineering profession and the path for success.