Episode 11 - Failure is not an option

00:00

0.5
1
1.25
1.5
1.75
2

This is a podcast episode titled, Episode 11 - Failure is not an option. The summary for this episode is: Humanity have an enduring desire to explore space - it has led us to push the boundaries of science and technology. Our guest today, Robert Barron has found an amazing connection between space exploration and SRE. I like to think he takes us to where no SRE has gone before!  Robert shares with us his favourite stories that draws parallel from Lunar Landing, Shuttle, space exploration to SRE. He also gives what SRE can take as inspiration from "Failure Is Not an Option" to build systems.

Key Takeaways

Transcript

Moon Landing lesson for SRE

02:49 MIN

John Aaron saving the Apollo 12 mission

00:33 MIN

James Webb Space Telescope story

03:50 MIN

Failure is Not An Option lesson for SRE

02:36 MIN

Curiosity drives technical vitality

01:08 MIN

Speaker 1: ...that at some point in my career, when I was trying to explain to people what do I do, I started using metaphors from NASA's mission control, the Space Shuttle, the astronaut, those are the computer system, and I work with the people in Mission Control who make sure that they get to where they have to get to and that they return safely.

Speaker 2: Hi, everyone. Welcome back to another episode of the Making of the SRE Omelette podcast. Humanity have a enduring desire to explore space. It has led us to push the boundaries of science and technology. Our guest today, Robert Barron, has found an amazing connection between space exploration and SRE. I like to think he takes us to where no SRE has gone before. Welcome to the show, Robert.

Speaker 1: Thank you. Thank you very much. I'm delighted to be here.

Speaker 2: So Robert, before we get to space, you actually have a day job besides taking us through the incredible journey of space exploration. Can you start off by giving the audience a brief introduction of it?

Speaker 1: Yes. I work at a small company called IBM and I've actually just celebrated 15 years at IBM. I've changed a couple of jobs, but always in the domain of operations, service management, site reliability, engineering. What I'm currently doing is I'm in the internal CIO office, I'm an SRE architect helping other people be better SREs supporting the systems that IBM supplies its own employees. Because everyone thinks about IBM as supplying software and technology to the world, but also has to do it internally. We've got 300,000 employees, more or less, so that's a very large IT shop that needs a lot of SREs doing a lot of very good work.

Speaker 2: That's great. So, one can see you as a mentor and coach of SREs within IBM?

Speaker 1: Yes.

Speaker 2: Coming back to this episode, Robert, what was your inspiration that has led you to write about space and SRE?

Speaker 1: Well, I'll be quite honest, I've been interested in space for longer than I've been interested in SRE, and what happened was that at some point in my career when I was trying to explain to people what do I do, I started using metaphors from NASA's Mission Control. The Space Shuttle, the astronaut, those are the computer system, and I work with the people in Mission Control who make sure they get to where they have to get to and that they return safely. When I was on a call, it was just like someone in Mission Control supporting a mission. I found out that that metaphor was very powerful and helped explain the things that we were doing. And then when I started feeling the need to write about what I'm doing, to share the knowledge as I went on the journey, and I've been at IBM for 15 years, I had a career in operations before that, operations has changed a lot and I did start feeling a need to start sharing some of the lessons that I learned and the changes that I saw and the new ways of working, the inaudible way of working, and I just wanted some sort of theme for the lessons that I was going to be sharing, and that's when it hit me. I can use space.

Speaker 2: That is a wonderful story and great way to give back, share knowledge, and tie back to your other interests. It is also fitting we're in the holiday season and other SREs may also be trying to explain what they do to family and friends. Robert, what are some of your favorite stories?

Speaker 1: I actually have three stories. Do I have time for three stories?

Speaker 2: We'll make time for this, Robert.

Speaker 1: Okay. The first story, and this was actually something I wrote about in my very first article. Everyone knows that Neil Armstrong's first words on the moon were, " That's one small step for a man, one giant leap for mankind." What he nearly said was, " The computer rebooted, so I couldn't land." And what happened was, about 30 seconds before they landed on the moon, they started getting error messages popping up in their console, and the error message was 1201, which everyone knows what that means.

Speaker 2: Oh, yeah, obviously.

Speaker 1: But fortunately for them, back on the ground in Houston Mission Control, they had SREs who were looking at the system, and what the SREs knew was that 1201 means that the computer is rebooting. Now, a computer rebooting 30 seconds above the ground on the moon is a very, very bad thing. Worse than that, it started rebooting again and again.

Speaker 2: Got stuck in a loop.

Speaker 1: Yes, exactly. And the finger was on the abort button to say, " We can't land. There's something wrong." But what they did was while they were watching the computer misbehave, they were also looking at the behavior of the spacecraft as a whole. For as SREs we've got our golden signals, the throughput, the latency, the saturation, the errors that are being generated by the system. And we can see patterns. If we see that throughput is rising, then it's okay that latency will also rise a little bit together with it, because that's natural behavior. A busy system is a little bit slower. But if we see latency, slowness rising, without more users, more load, then that's a problem, because why is the system misbehaving? So for the SREs watching the moon landing, their golden signals were things like the altitude of the spacecraft, the rate of descent, the angle it was flying. And they saw that even though the computer was rebooting, between reboots, it was still controlling the spacecraft in the right way, so it was hitting the golden signals correctly. And what they said was, " We don't know why it's rebooting. Technically this is an anomaly which should not be happening. But functionally, from a business perspective, as it were, it's doing the right thing."

Speaker 2: inaudible meeting success criterias.

Speaker 1: Exactly. So, " We can continue." And they told Neil Armstrong and Buzz Aldrin, " You can continue, as long as it's rebooting periodically, as long as the golden signals are fine, you can land." And the rest is history.

Speaker 2: Nicely summed up.

Speaker 1: They debugged the problem, found the cause, and for the next spacecraft they built the computer slightly differently, different code, so that this would not happen again. The next story was actually the very next Apollo mission, Apollo 12, which, just after it launched it was struck by lightning and the system went haywire. Very quickly, the astronauts and the engineers on the ground managed to calm the system down, reconnect everything, get it going properly. But they lost all the telemetry. The rocket was flying correctly, the IBM computer on board, the giant Saturn 5 was controlling it and aiming it, and they were using radar to see that it was flying the way it was supposed to fly, but they were not getting any telemetry onto the ground. And without any telemetry, well, maybe they can get into space, but they certainly wouldn't be able to continue flying to the moon flying blind.

Speaker 2: They lost visibility to the system's metrics.

Speaker 1: Exactly, they completely lost visibility to the system. And there was one engineer, John Aaron was his name, and he noticed a pattern in the lost telemetry, and he remembered that he had seen this sort of pattern of invalid information, invalid data, about a year earlier when he was doing a simulation in Houston and there was a power failure in the simulation room. What's the natural thing to do when there's a power failure during your simulation? Is to pack up, have lunch, go home while the technicians fix it. But what John did was different. He said, " Wait a minute, why am I seeing this funny, wrong telemetry just after the power failure?" And he went to investigate and he found out that after a power failure, when the system boots up again, there's a chance that it will go into specific failed state that will show you this information. And there's a particular button switch in the system that you can flick, and that will give you back your telemetry, return it to the right state. He gave the command, it's a very famous command, to Apollo buffs, it's called SCE to OX, and it meant as much to most of the people who heard the command as it does to you, because it's this obscure little switch that, of the three astronauts on the spacecraft, only one of them knew where it was, and they flicked the switch, and that's solved the telemetry problem. And they checked out the spacecraft and flew to the moon.

Speaker 2: Robert, next time my system go into debug mud, give me all the error logs so I can look for that switch.

Speaker 1: Yes. The reason I like this example is because John Aaron was an outstanding SRE. He was super technical, knew everything about the systems that he was responsible for, but he didn't stop there. Whenever he saw something that he couldn't explain, from whatever reason, from whatever domain, he had to understand, why did this happen? How did this affect-

Speaker 2: He was curious.

Speaker 1: Exactly. He was curious about everything related to his domain, wherever it occurred. And this led to him saving a mission.

Speaker 2: Right. Love that story.

Speaker 1: The third story is a much more recent story, which is a little bit of an opposite example. This is the story of the James Webb Space Telescope. It's the successor to Hubble as the flagship astronomical observatory in space. It was launched in December of last year and it spent six months in its operational checkout. Just like every good SRE won't allow something to go into production without doing a lot of validation tests along the way, Webb Telescope had six months of these validation tests. And since then it's been returning fantastic discovery after fantastic discovery down to earth, revolutionizing a lot of the astronomy and science that we know about. It's a wonderful success. But the James Webb Space Telescope was something like 20 years of development. It cost$ 9 billion to build. And the expectation is that five years of operation will cost another billion dollars. During the deployment, during the launch, there were over 300 single points of failure, technical and hardware and all sorts of things that had to go perfectly. It was so large that it had to be folded up in the rocket that launched it and then it unfolded like a flower in space and it-

Speaker 2: Origami.

Speaker 1: Exactly. And you think, this is so complicated. This is such a waterfall development. This really has no error budget whatsoever, and it cost$ 10 billion. And the point is, the reason this is a good SRE lesson is because it's all about what not to do. You don't need 100% availability. Don't develop one system over two decades and launch it with one opportunity. All your eggs in one basket. All overwhelming complexity. They had to develop so many new technologies, but it was justified because the James Webb Space Telescope is a once in a generation project. They couldn't develop six small telescopes and achieve what it did, they had to go big or otherwise they would not achieve their success criteria.

Speaker 2: Maybe again, during the development of it, did they talk about, hey, we should build redundancies?

Speaker 1: Oh, yes, so much redundancy, so much testing, so many failures on the ground that led to redesign and rebuilding and reconfiguration and re and re and re. That's why it took so long and why it was so expensive, because they only had one opportunity for this. My point is that for most SREs, your system is not a once in a generation flagship. You don't need this. You should not be looking for such high availability. You should not be looking for such perfection. You should be looking at what's good enough. What are my requirements? Look for justification from your clients as to, do you really need five nines or is four nines enough? Is it really so bad if I have a backup window? Is it really so bad if I have a latency during a failover? And all sorts of things like that. You don't need to achieve perfection. The James Webb Space Telescope needed to achieve perfection, but that's once in a generation.

Speaker 2: Great example of how we can take and balance that perfection, reliability, with cost and time.

Speaker 1: Yes, yes, definitely.

Speaker 2: Thank you, Robert, for those great stories. I know I now have a few more ways to describe to people what I do, along with a new appreciation for history. Coming back to the heart of this podcast, which is culture to deliver business and client success. Robert, how would you say the SRE discipline has helped NASA to achieve this goals?

Speaker 1: At a technical level, the concept of SRE is not directly related to what the engineers did. Certainly not in the 60s, where it was mostly hardware. Even the software was actually hardware, because you didn't have a disk and write a program to the disk, it was hardcoded into the computer. Every mission launched had a slightly different computer that was physically different from the one that came before it, because they had different instructions. There would be people in IBM, in MIT, who built the computers that ran the spacecraft, who would be physically building the new computers, each one from scratch. But to me, SRE is not just, okay, I'm doing automation and I'm a new way of doing operations. SRE is, I'm in charge of the infrastructure of the non- functional part of the system, and I have the same responsibility that the people who are developing the business side, the functional side, have. The astronauts are the people in front, let's say they represent the developers who are making the business functionality of the application, who are actually making sure that it moves the widget from one place to another, that it executes the transaction, that it does the conversions or the logic that needs to be done. And the NASA engineers are the site reliability engineers who make sure that the platform that it's running on is the same. So just like all these engineers, when they went to work in the morning, they said, " This mission is going to succeed because of my work," site reliability engineers are saying, " This business application is going to succeed because of my work." And to me that's a difference between a SRE or assist admin or an operator or anyone who's responsible for specific technology. Assist admin is going to say, " I'm responsible for the computer to be available. I'm responsible for the discs to go at a certain speed. I'm responsible for the database to be working. I'm responsible for the network to be available between point A and point B." But an SRE isn't saying that. An SRE is saying, " I'm responsible for the application being successful in using the infrastructure." So it's a different level or different type of responsibility that again and again you see in the stories about NASA's almost problems or the problems that they resolved. So many engineers were looking outside their silo and saying, " I'm going to take responsibility for something that is not technically my responsibility, but my responsibility depends on it or my responsibility is related to it." And when you do this one plus one, you get three, because you're looking at a much wider scope than you were before. To me, that's the essence of being an SRE, this responsibility that you feel not just for the technology but for the business services that are running on this technology.

Speaker 2: And for that positive user experience.

Speaker 1: Yes. The error budget is so much smaller, because if there's a failure, it's human lives.

Speaker 2: Right. For us in the IT world, we have the luxury of the three nine, four nine, or five nine. But for NASA, failure is not an option.

Speaker 1: Exactly.

Speaker 2: So Robert, do you have examples of how NASA built for resiliency to anticipate problems and prevent failure?

Speaker 1: The way they did that back in the 60s, is that they treated the hardware, to a certain extent, the way we treat software today. And what I mean by that, is that you have MVPs, minimum viable products. Every mission had a goal of proving a new technology, a new capability, and missions would be built on top of missions. We have lessons from the previous one, so we figured out that it works, then the next going to do something new. If it's a negative lesson, we found out that it doesn't work the way we expected, then the next one is going to try to do the same thing in a different way so that we can go forward. When we see that in the 60s they had three missions, three different spacecraft. The first one was the Mercury spacecraft, which was just one person, and the first mission was about 15 minutes, and the last mission was something like three days. Because every time they added more capability, more testing. And then the plan was that after Mercury they would go straight to Apollo, which would be the moon missions. And then they suddenly started thinking, wait a minute, we don't know how to do all these things. We don't know if we can survive for two weeks. Big difference between surviving for three days and two weeks in space. We don't know how to maneuver in space. We don't know how to communicate between two spacecraft. And if we go straight to the moonshot with all these unknowns, it's going to be very difficult to achieve that. So they developed a whole new space craft called Gemini, which would sit in the middle between Mercury and Apollo, which would prove out all the stepping stones required to get to the moon. So they were absolutely working the way that we do today with MVPs to fail fast, all the concepts that we talk about, even chaos engineering. Let's take an empty spacecraft and see what it takes to destroy it. They were doing all the things exactly the way we do them, but instead of having containers that you can spin up in seconds, they were using much more expensive physical spacecraft. The concepts and the principles of fail fast, of do experiments, of chaos engineering, as it were, are exactly the same concept that we have today.

Speaker 2: That is a great lens and draws a parallel of what we do to rocket science. Robert, how can SRE be inspired from the mindset that failure is not an option for what we do?

Speaker 1: The key concept to remember with failure is not an option, is that it's not that there is no failure, because there's always going to be failure of components, of subsystems. Failure is not an option for the overall system. That's very much a part of modern operational development. When you say, I don't want active/ passive backups anymore,, I want active/ active so that if there is a failure in some component, the system as a whole will continue working. If a destination for a queue is blocked, I don't want to lose the messages. I want to queue it up, I want to send them to a backup location. I want to send some sort of message, hey, there's some sort of delay, so that he knows that there's not a failure, it's a slightly degraded experience, let's say. And again, this happens in space as well. The classic example of a failure that was not a failure in space is Apollo 13, where an explosion halfway to the moon crippled the spacecraft, but they managed to return safely to earth. So from the perspective of, did they achieve the functional requirements of getting onto the moon, exploring the moon, doing experiments, bring back rocks? No. But from the requirements of bringing the astronauts back safely, they did succeed. The subsystems, let's call it the subsystem of landing on the moon, failed, but the overall system of bring the astronauts back alive, succeeded. Other ways of looking at it. Let's take the International Space Station. The astronauts need air on the International Space Station, so they've got a number of oxygen generators, but sometimes the oxygen generators fail, so you can say that they don't have 100% uptime, but the point is, they don't need 100% uptime, because there's still oxygen in the space station itself that will last enough time for the astronauts to fix the broken-

Speaker 2: To recover.

Speaker 1: Exactly. So they don't have 100% uptime for every component, but there are enough redundancies, there are enough backup plans, there are enough ways of doing it in different ways, that failure will never become the option. There will always be something along the way that will remediate the problem well enough for them to continue.

Speaker 2: That is such an insightful perspective. I can totally see you use those stories to explain what you do and what we do.

Speaker 1: That goes forward to our day- to- day systems where, yes, we cannot reach 100% availability, we can't afford 100% availability, we don't even need 100% availability, usually. The reason I use space as an example is aspirational. We can't achieve 100%, but what can we achieve? What would we do if it was necessary? What sort of solutions could we think of which will allow us to remediate the problems fast enough that our three nines, four nines will be good enough to make sure that, yes, we have failures in the subsystems, but we don't have overall failures.

Speaker 2: Speaking of inspirations, Robert, what words or wisdom do you have for practitioners looking to get into SRE?

Speaker 1: Curiosity is what keeps you with the technical vitality that you need. If you're not curious, if you have to force yourself to learn something new, then it's going to be very difficult to learn something new if you keep on going back to your comfort zone. I'm not talking about chasing new technologies, I'm talking about something like security by design or bow to manage. All these things are not new, but for much too long system admins and operators could get away with being in their corner of the world and saying, no one really cares about how our tools work. We could build something of our own using new technology, old technology, doesn't matter. But today these tools have to be integrated with the rest of the organization, because you want developers to be able to have access to your tools as well, so they can solve problems. So it has to work in the same way, with the same standards, as theirs. You have to be up- to- date on software engineering concepts, because that's what your clients expect from you, and if you don't have the curiosity, it's going to be much more difficult. The flip side of the curiosity is the responsibility, which is what's driving you to do better all the time and keeps your curiosity aimed in the right direction.

Speaker 2: Well said. Not happy with the status quo, always looking for a new way or a better way of doing what we do.

Speaker 1: Yes. If it's just curiosity, then you might be over engineering. You might be going down a rabbit hole that is not really relevant. You balance that with responsibility to make sure that you are driving your curiosity in the right directions.

Speaker 2: And, as you said earlier, to challenge that success criteria, so we do just enough to keep the system reliable.

Speaker 1: Yes. Yes. Because that's your responsibility, to keep the system reliable in the best possible way, not in the most extreme possible way or most expensive possible way.

Speaker 2: Right. In closing, Robert, let's go back to the inspiration of this podcast. What would you say is an ingredient and recipe for organizations, and that could be NASA or could it could just be a small company called IBM, to achieve the SRE outcome of delivery business and client success?

Speaker 1: For me, to achieve SRE is both a top down and bottom up. Business has to say that they're invested in SRE. The business has to say that we're going to be developing our applications in a way that is manageable, in a way that is supportable, that the way that will enable SREs to make the most out of it. And it's also bottom up, in which the SREs have the freedom and the confidence and the curiosity and responsibility to work as effectively and efficiently as possible with everyone else in the organization.

Speaker 2: There you go, ladies and gentlemen. The SRE ingredient and recipe from Robert Barron, from the storytelling perspective of humanity's exploration of space. If you'd like to hear more of Robert's stories, please visit flyingbarron. medium. com. Thank you so much, Robert, for spending the time with us today.

Speaker 1: It's been a pleasure and an honor. Thank you.

Speaker 2: I'd also like to thank you, the audience, for listening. See you on a future episode.

DESCRIPTION

Humanity have an enduring desire to explore space - it has led us to push the boundaries of science and technology. Our guest today, Robert Barron has found an amazing connection between space exploration and SRE. I like to think he takes us to where no SRE has gone before!

Robert shares with us his favourite stories that draws parallel from Lunar Landing, Shuttle, space exploration to SRE. He also gives what SRE can take as inspiration from "Failure Is Not an Option" to build systems.

Today's Host

Kevin Yu

|Principal SRE, IBM Sustainability Software

Today's Guests

Robert Barron

|AIOps, ChatOps and SRE @ Office of IBM CIO

Robert is a solution engineer specializing in AIOps, SRE, & ChatOps. He enjoys helping others solve problems even more than him enjoy solving them himself. Robert also write about SRE lessons from the early era of space exploration - the race to the Moon and the Space Shutte - at https://flyingbarron.medium.com