Drive Thru AI
DJ: Welcome to AI Coffee Shop. May I take your order?
Jerry Cuomo: Hi. Yes. Can I order a large cappuccino with half- and- half and two sugars? No, no, no. Two Splendas, do you have Splenda? Well actually skip the Sugar.
DJ: Okay. One large cappuccino.
Jerry Cuomo: Actually, scratch that. I'll just take something decaf. Maybe just an Americano. A decaf Americano. Oh, sorry about the dogs. What did I order?
DJ: An Americano and a cappuccino.
Jerry Cuomo: No, just an Americano. I scratched the cappuccino, remember? And it's a decaf.
DJ: Ah, cappuccino scratched.
Jerry Cuomo: Oh, maybe I'll have one of those jumbo cookies too. Oh, no, no. Wait. Too much sugar. Forget that.
DJ: Okay, let me confirm. You're ordering a decaf Americano, and you're skipping the cookie because you're trying to cut down on sugar. And I have a biscuit here for your dog. Did I get that correct? Will there be anything else?
Jerry Cuomo: Wow. I mean, no, that's perfect. Thank you.
DJ: Great. Then drive up to the next window to pay and pick up your coffee.
Jerry Cuomo: Welcome to The Art of AI, a podcast that explores trustworthy AI for business. Folks, that was DJ and I setting the scene for this cool episode on how advanced AI and automation technology can make sense of the ramblings and noise that you just experienced in our reenactment of, well, a typical drive- through ordering experience. Our guest today, Brian Langner, is an expert in AI and natural language processing. He is known to make computers talk as well as listen. And with that, I'd like to welcome Brian to The Art of AI. Welcome, Brian.
Brian Langner: Hi, Jerry. It's great to be here.
Jerry Cuomo: Let's get right to this first question because I have so many things I want to chat with you about. It's a great topic. So can you start by please sharing with our listeners why you love what you do?
Brian Langner: Well, it's interesting. I have a background doing speech and conversation technology, and I'm one of those lucky people who actually has a career in industry doing exactly what I did in my graduate studies.
Jerry Cuomo: Wow.
Brian Langner: I don't know how I managed to do that, but I'm pretty lucky.
Jerry Cuomo: That's fantastic.
Brian Langner: The reality is I've been interested in computers and AI since I was little, and this is a great way to explore that technology and do what I like. I make computers talk. That's the one- liner pitch for what I do. And I also like to joke sometimes I make them listen and listening is actually way harder than making them talk.
Jerry Cuomo: So yes, that is the word on the street, Brian, that you make computers talk. So how do you automate a computer voice in a way that people like perhaps even feel as friendly or even approachable?
Brian Langner: It's a great question. There's a lot of character that goes into people's voices, and when you're trying to make a computer interact with a human by voice, you want to have some character with it. So my background, both in my time in graduate school as well as in industry, I have built voices and one of the things that we've done is tried to provide a personality.
Jerry Cuomo: I see.
Brian Langner: You don't want to go too far down that path. You get into an uncanny valley really quickly about whether or not person understands they're talking to a computer or not. But you do want it to be a little bit personable with character, but still somewhat, obviously this is not a natural human you're talking to.
Jerry Cuomo: I see. So you have to preserve some of the robotic aspects of it or else it gets creepy?
Brian Langner: It can get creepy really quickly, yeah. And I think the other thing that's been interesting in my past is if you have a very natural sounding voice and you're talking about a conversational system that interacts with the person, the person on the other side will assume that the computer on that side actually is better at doing this than it is.
Jerry Cuomo: I see.
Brian Langner: So you don't want it to be too natural. If it's too natural than people assume that there's a human on the other side and they can speak completely, naturally and fluently, and it turns out the machine is not always great at understanding all of that.
Jerry Cuomo: Hey, Brian, I heard you were a part of a company called ToyTalk. That sounds really interesting. In fact, I think you were even one of the founders of it. So can you share with our listeners a little bit about that experience?
Brian Langner: Yeah, so I was one of the founding engineers at ToyTalk. That is the fancy way of saying that I agreed to join before we had any money. But yeah, it was a really fun experience. It was a group of folks, a lot of people who had come from Pixar, had backgrounds in character and technology and combining that to basically be a creative engineering organization.
Jerry Cuomo: Any toys that we would know? Mr. Potato Head or anything like that?
Brian Langner: So the one that everyone would know is Barbie. So we did build an interconnected conversational Barbie doll, sold it.
Jerry Cuomo: Cool.
Brian Langner: In, I want to say 2015 or 2016. I don't remember the year anymore. I actually have one in my house, although it no longer works. But yeah, you could have a conversation with Barbie.
Jerry Cuomo: So take us from Toy Talk to automating order taking.
Brian Langner: On some level, they're not actually that different of a problem, at least from the technical perspective. You're trying to get a computerized agent of some kind to interact by voice with a person and have a productive interaction. When you're talking about something like a Barbie doll, that's for entertainment purposes. So it turns out you don't always have to be perfect. Because if you say a non- sequitur, sometimes kids actually think that's funny and that's better than doing it right in the first place. When you're talking about automating a drive- through, what you'll end up with there is... Yeah, you do need to get closer to perfect because someone's actually on the other side hungry trying to get some food. The good thing that helps us in this particular application is that it's interactive. So if we don't quite understand something the first time around, we can ask a follow- up question and hopefully we'll have a better chance the second time.
Jerry Cuomo: Hey Brian, that makes a lot of sense. Now, can you share some of the challenges? I mean, when I'm ordering a drive- through, I find myself rambling as I try to make up my mind and my dogs are maybe barking in the backseat. So how does AI, natural language processing machine learning bring order to this noise and rambling?
Brian Langner: The first thing to note is that the technology that my team works on, it is audio only. And so we're only using the signal we get from the microphone. When you start talking about challenges, that's one of them.
Jerry Cuomo: I see.
Brian Langner: People have a lot of visual cues in terms of how they respond to speech and conversation, and our system doesn't have access to that. I think the other thing that you'll stereotypically note is drive- through microphone and audio systems are stereotypically, I would say the best.
Jerry Cuomo: Yeah, I would agree.
Brian Langner: The reality is the human person whose job it is to take orders in the drive- through, listens to something that sounds like, for about eight hours a day, and they're supposed to turn that into I'd like a double cheeseburger and a large Coke, our system has to do that. We don't have any other signal to work with. And so we've needed to build speech technology that's robust to those kinds of conditions. The reality is also drive-throughs are outside, so you have outdoor noise. In places that are not California, you have weather, it might be windy or raining or sometimes hail and the microphone is sitting in a giant metal box. So it's not the greatest of acoustic environments.
Jerry Cuomo: I see.
Brian Langner: The other thing is you have a car engine that's idling two feet away from the microphone. I like to joke that the solution here was we should just buy everyone that goes through a drive through a Tesla and then we're fine. But it turns out that was more expensive and companies didn't want to pay for that. So it's our job to actually make the microphone and the speech recognizer work in that challenging acoustic environment. I think the other thing you noticed is the thing you said, you tend to ramble a bit. People are familiar with what the drive- through experience is like. It's been around for 50, 60 years. You pull up, there's a giant sign with what there is to order. You start talking, you get your food out the other end. Well, what that means is very often people will start talking before they know what they're going to say. So I'd like the number five, four, with a Coke. No, it should be a diet Coke today I'm trying to lose weight and yeah, large fries. Actually, you know what? Nevermind. Just get me the McNuggets.
Jerry Cuomo: I'm sorry, Brian, just by you going through what you did, it seems like an impossible problem to solve. What's the accuracy?
Brian Langner: So our target for our technology is to be able to automate 75% or more of orders. And where we say automate, we mean be able to take the order from start to finish without having to have one of the human employees come in and take over. Our technology today is approximately in that range, and we've made some substantial improvements over the past several months, and we expect that trend line to continue.
Jerry Cuomo: That's impressive, Brian. So now where are the breakthroughs coming from? Is it around a different type of machine learning? Is it different type of natural language? Is it software, hardware, more training? All of the above. None of the above?
Brian Langner: I mean, it is all of the above. As with most automated technology and particular speech and language technologies, there's a lot of different things that can go wrong. None of them individually are necessarily like 10% of the problem, but in combination, the total mass of the things that we're not doing well is substantial. And so it's a matter of finding the lowest hanging fruit to fix, but also how many of them are worth the effort. If it takes three months to fix a thing that has 0. 2% improvement, maybe that's not the best way to spend the resources and we can try and do some things that get one to 2%. And so that's where our team has been for the past year and a half is working on problems like that. Yeah, I would say the biggest improvements we've had in the past several months have been around getting the speech recognizer to be more accurate and in particular more robust to the environments that we actually see in the real world.
Jerry Cuomo: Amazing.
Brian Langner: There are drive- throughs that border a freight rail yard. And so one o'clock every day there's a giant freight train that comes by and blows its horn for 10 seconds at a hundred db. It turns out that the human employee there actually can't deal with that either because so loud that it just obliterates everything. And so those are the challenges we've been trying to work around. Our team has done an okay job so far about finding the biggest things that need improving and adding new data, adding new modeling techniques, and working on modern deep learning approaches to solve these problems.
Jerry Cuomo: Wonderful. So Brian, you're a leader today in a team called Watson Orders. Can you just tell us a little bit about how you got here?
Brian Langner: Well, that's a really interesting question. It's not where I thought I would be. I actually got involved with this group of people when it was a startup called Apprente. So this is a Silicon Valley group almost straight out of a TV show perhaps, where it's 15 to 20 people in a tiny little office that's overcrowded. It looks a little bit like a garage almost. And I come in for my interview with this group and there's actually a guy taking a nap on a couch. It was really straight out of a TV show. So it was a great group of people, really quirky, really talented, really good at what they do. And then the week after I interviewed, they announced they were being acquired by McDonald's. That McDonald's, yes. And so like, " That's not quite what I thought I was getting myself into." And it's like, " Why is McDonald's interested in an advanced technology team?" Well, it turns out that they were interested in automating drive- through order taking. That's actually a pretty reasonable application of speech and conversation technology. It's targeted domain. There's a goal involved. And so the people who are involved in this, they know what they want to do. There's an object for what they're trying to be there. They're not there just to play with the toy. They're there, they want food. It's a voice- based interaction already. So there's no real retraining of the customer base. They already know what they need to do, and then it's just a matter of can we get the computer to do some of the more challenging tasks? And that's what this team would do. And so we took a prototype that that Apprente team had built. We turned it into a real software product over the course of two and a half years with McDonald's. And then to make that next step, needed the backing and resources of a larger group and McDonald's understood that, and that's how we ended up at IBM as part of the Watson Orders team.
Jerry Cuomo: Wonderful. Now I have to kind of rib you a little bit. Are there any mean- spirited colleagues, maybe former colleagues from Carnegie Mellon, where I believe you graduated, that joke with you about playing with toys and working at the Drive- through at McDonald's?
Brian Langner: I mean, honestly, no. All my colleagues and friends have been pretty good about that.
Jerry Cuomo: Because you have the coolest job on the planet.
Brian Langner: I actually have been the one giving myself the ribbing on that one. So when I left my previous job to join the McDonald's tech labs team that became Watson Orders, I joked with them like, " Yeah, I quit my job," and" No, what are you doing now?" "Well, I work at McDonald's." " McDonald's?" Like, " Yeah, no, no, I take orders for the drive- through." And I got really good at being deadpan about that, and it was a lot of fun. One of my friends actually, she came up to me after I said that and just put her hand on my arm was like, " Are you okay?" Because they know I have an advanced degree. I've been working in technology for years, like" Why are you working at McDonald's?" And then of course I'd laugh, explain, and then they'd roll their eyes and like, "Oh, okay, I see what you're doing. You're pulling my leg."
Jerry Cuomo: Fun story, Brian. So Brian, what does your crystal ball tell you about the future here? What's the future state and perhaps what other industries might benefit by applying what you've learned around drive- through ordering?
Brian Langner: I mean speech and conversation technology is at a little bit of an inflection point right now. I think we've gotten enough advanced technology to the state where it works well in a variety of circumstances that we can start to talk about applying it to real world problems like this. I think things like Siri and Alexa are great in terms of pushing the field forward, but by themselves, they don't necessarily have a goal behind them.
Jerry Cuomo: I see.
Brian Langner: This technology that we're building at Watson Orders is goal directed, and I think that's one of the reasons why we've had a bunch of success with it because the user base is inclined to work with us to solve a problem together.
Jerry Cuomo: Perfect.
Brian Langner: I think applying that technology to similar things, yeah, we're going to see that increasing over the next several years. Ideally a lot of it being the Watson Orders technology that my team and folks are building, but I suspect there'll be lots of different entries into the space and I expect it to actually take off for certain kinds of applications.
Jerry Cuomo: It's amazing, Brian. So unfortunately we're out of time. I can go on and on talking to you about this subject, and I want to thank you for joining us on The Art of AI for Business podcast. Very cool topic. Thank you so much.
Brian Langner: Well, Jerry, thanks for having me. I've enjoyed it.
Jerry Cuomo: Well, that was informative and fun, and that's it for today. I've attached some links to information on Brian's project, which is IBM Watson Orders, in the description section of this podcast. I've also added links to the new Art of AI article series on medium. Check it out. Once again, I'd like to thank Brian and DJ, and of course I'd like to also thank you all for listening in. This is Jerry Cuomo, IBM fellow and VP for Technology at IBM. See you again on an upcoming episode.
Host Jerry Cuomo host an exciting behind the scenes look at how advanced AI technology is transforming how orders are taken at drive thru restaurants. "I make computers speak" is the calling card of guest Brian Langner. Brian and Jerry discuss the challenges and rewards associated with conversational AI. Brian adds to his calling card, "however, the harder part is getting them to listen."
Brian, who currently leads AI Research and Projects for Watson Orders, highlights his journey from university studies in AI, to Silicon Valley startup ToyTalk, then being acquired by McDonald's, and now at IBM with their acquisition of McD Tech Labs.
Brian and the Watson Orders team are developing and testing Automated Order Taking (AOT) technology in restaurants showing substantial benefits to customers and the restaurant crew experience. Brian further explains how AI and natural language processing will help scale the AOT technology across markets and tackle integrations including additional languages, dialects and menu variations.
- [ 02:47 - 04:02] Balancing Computer Voices: Brian Langner discusses the challenge of creating computer voices that are friendly and approachable while still maintaining a distinct non-human quality, ensuring effective interactions without entering the "uncanny valley."
- [05:00 - 08:11] Automating Drive-Thru Orders: Automating order taking at drive-thrus involves overcoming challenges like deciphering noisy environments, understanding customer ramblings, and handling diverse acoustic conditions, all while striving for accuracy.
- [08:20 - 10:21] Speech Recognition Improvements: Brian's team aims for over 75% accuracy in automating orders. Advances in speech recognition accuracy and robustness play a pivotal role in achieving this goal and enhancing the customer experience.
- [10:30 - 12:07] Goal-Directed Interactions: The conversation explores the potential of speech and conversation technology in various industries beyond drive-throughs, highlighting the importance of goal-directed interactions to drive AI integration and innovation.
- [13:21 - 14:16] Future of AI Integration: AI and conversation technology are at an inflection point, poised for accelerated integration into real-world applications. As businesses recognize the value of goal-directed AI interactions, the technology's role is set to expand in diverse sectors.
Join the episode to gain insights into the intricate world of AI's impact on complex interactions, the challenges of automation, and the potential it holds for transforming industries and customer experiences.
* Coverart was created with the assistance of DALL·E 2 by OpenAI. ** Music for the podcast created by Mind The Gap Band - Cox, Cuomo, Haberkorn, Martin, Mosakowski, and Rodriguez