What We Need to Unlearn from Classical AI, feat. Dr. Richard Socher

Media Thumbnail
00:00
00:00
1x
  • 0.5
  • 1
  • 1.25
  • 1.5
  • 1.75
  • 2
This is a podcast episode titled, What We Need to Unlearn from Classical AI, feat. Dr. Richard Socher. The summary for this episode is: <p>In this episode of Ask More of AI, Clara Shih sits down with CEO and Founder of&nbsp;<a href="http://you.com/" rel="noopener noreferrer" target="_blank">You.com</a>&nbsp;Richard Socher. They talk about how companies can decide whether or not they should develop and train their own large language model (LLM), working with graphics processing units (GPUs), and the one thing all high schoolers need to learn.</p><p><br></p><p><br></p>

Dr. Richard Socher: I think generative AI is extremely powerful when it takes you a very long time to generate an artifact, but it only takes you a few seconds to verify how useful or correct that artifact is.

Clara Shih: Welcome to Ask More of AI, a podcast looking at the intersection of AI in business. I'm Clara Shih, CEO of Salesforce AI. I've been exploring the practical use of AI in business for years, both as an entrepreneur and a leader in the tech industry. I'm so thrilled to be sitting down with Richard Socher today. Dr. Socher invented prompt engineering and was also one of the founders of the Salesforce AI research group, nearly a decade ago. Richard, it is so great to have you on the show.

Dr. Richard Socher: Thanks so much, Clara, and great to be here.

Clara Shih: So, you are clearly a brilliant creative thinker, and you've had a very interesting story. You grew up in Dresden, Germany, you came to the US initially in high school, as an exchange student for 10 months in Montana. How do you think those early experiences shaped your path and how did you end up in the field of AI?

Dr. Richard Socher: I think shaped me in the sense that I love living in different places in the world. I realized that I can be happy and learn a lot of things from different people and different cultures and backgrounds. And then, the way I got into AI was, in high school even, I always enjoyed math, but also English. My native mother tongue is German, and so I enjoyed languages and I thought about, is there some way to combine these two passions? What's interesting is that, math you feel like is the same kind of correct and logical math, even if you go light years in this direction. But, language can change. Just some teenager says YOLO, and before you know it, there's a new word in the language. And so, it's constantly evolving. People make it up and change it, and it's this very amorphous, beautiful construct that humanity uses to convey thoughts, and everything. And so, I wanted to combine those two and found that there's a program called Linguistic Computer Science, back in Germany, Leipzig University, where I started my first two years in undergrad and my bachelor's. And so, that's sort of my early foray into that. And then, I fell in love with statistical pattern recognition and machine learning during my master's, and then it was very clear to me that if you could understand statistical learning, at the time, really, really well, you could apply this to natural language processing, to computer vision, to biology, to medicine and to everything else. And so, to me, it was very clear I wanted to dedicate my life to making that technology work well and help in a lot of different use cases. And then, long story short, I took those ideas and I said, " Well, can we apply them to natural language processing too?" And, that's sort of the magical moment for myself, in 2010.

Clara Shih: That's incredible, that your dual passion for math and language led you to the insight that words can be represented as numbers, which led you to the idea of prompt engineering. So, it's been almost 10 years since you founded MetaMind. What are the different questions that AI researchers are asking today, versus 10 years ago?

Dr. Richard Socher: Yeah, 10 years ago, the biggest difference was that we couldn't even get AI to understand the world, let alone generate really powerful outputs. And so, we had to first understand images. ImageNet was a big part of that, large convolutional neural networks were a big part of that. And then, we could apply that to a bunch of different use cases in radiology, for instance, and pathology, counting blood cells, for instance. Then, we could also understand natural language better. We could classify the sentiment really well, we could understand, is this a service email or a sales email or a marketing email, and which department should that maybe go into? We can make a lot of predictions. So, it's predictive or discriminative AI, not in the sense of discriminating people, but discriminating between cats and dogs and an image, and things like that. And then, in the last few years, we've seen a huge set of improvements in generative AI also, where not only can we understand or can AI understand natural language, visual inputs and so on, but we can also generate new kinds of outputs, modalities, like images as well as text, and of course, text connects to thought and to thinking and to a lot of different work streams in the enterprise and everywhere else. And so, being able to generate language, it's a massive unlock for AI and it won't stop there. We can also, and one of the works we've done at Salesforce like generate proteins for biology, we can soon generate videos, probably six to 18 months, we'll get short videos, and then probably music will come also, in the coming quarters and years, mostly probably held back by copyright questions. But yeah, it's going to be an exciting time for generative AI.

Clara Shih: You talked about these new modalities like music and video, and from a research standpoint or from a compute standpoint, how is generating these different modalities different than text?

Dr. Richard Socher: The biggest difference is, of course, that they're sort of continuous and hence also very high dimensional. But, what's surprising is that we've actually been able to unify and make a lot of these things more and more similar. Transformer architectures also work in computer vision, large convolutional networks could also be used, in part, for some natural language processing. So, the biggest similarity on the AI side is actually that you can have these foundational models that work in multiple different modalities, and that's been super exciting to see. Of course, there are still some differences, so- called diffusion models that are used primarily in generative image use cases and not really in the natural language processing world. And then, of course, there's sort of the big question for applications of, which kinds of organizations have image data versus text data, and the truth is, every single company out there uses language to communicate with their customers, and to communicate internally and to talk about their vision and their products, and so on. Whereas, not every company out there uses computer vision these days, in their workflows. So, I think the impact of text versus images is still going to be larger.

Clara Shih: But, the transformers work because at the end of the day, whether it's a text or image or video, it's ones and zeros.

Dr. Richard Socher: That's right. We can actually... The biggest unlock for NLP was to think of words also as just lists of numbers, rather than these discreet entities, cat, car, and tree are all just, have no connection whatsoever, with one another. But, we really know that cats and dogs are more similar than a cat is to a car, even though in string distance they're, maybe cat and car closer to one another than to dog. But, really thinking of everything as a continuum, has helped AI and NLP a lot, and once everything is a list of numbers, neural networks can be used to make predictions and understand and generate that kind of output.

Clara Shih: You've said before that transformers aren't brain inspired, they're GPU hardware inspired. Can you talk more about that, and is it the hardware that unlocked this ability for us to do unsupervised AI?

Dr. Richard Socher: Yeah, I still remember in the early days in 2010 at Rings Lab, we tried to make GPUs work and it was hard. There are a lot of folks hacking on that, and we tried to get some more information from Nvidia, and in the early days they're like, " What are you doing? How many GPUs are you buying? Are you gamers? Are you in graphics?" And, it took us a while to convince them that this is going to be really interesting. But indeed, neural networks, and this is quite technical, but neural networks are powered by large matrix multiplications. Again, just lists of numbers in high dimensional spaces and basically... Or matrices, and basically as you multiply them, you want to do that very, very fast. It turns out early GPUs did that extremely fast, mostly for gaming applications, but you can use that same hardware for AI applications. And, because we wanted to multiply larger and larger matrices as fast as possible and CPUs weren't as good for that, we kept going more and more into that direction. Transformers are highly inspired by the fact of being able to paralyze on current GPU hardware, and you could argue that most of the architectures that we're exploring, that the fashion we're doing that is similar to looking for the keys only under the street lamps. Now, these are pretty large lamps, everything can be described as a list of numbers, everything can be described in a neural network. It almost doesn't matter maybe as much, but we are highly constraining ourselves to only looking for AI models, that can be run currently on GPUs and very fast matrix multiplication.

Clara Shih: It's so interesting, right? We always thought software and hardware people, it was different sets of people, and now software people have to become hardware people and hardware people are trying to become software people. So, you mentioned Nvidia, and that's fascinating, that maybe when you started working with them, when you were getting your PhD, they didn't even yet understand the AI potential of their GPUs, which they had created for gaming. Certainly, they've caught on now, and the rest of the world has caught on. Of course, they're not the only ones who are racing to build more GPUs, there's a shortage out there. Every chip company is doing the same thing. Even software companies, hyperscalers, they're building their own hardware. How do you see that evolving? And given that a lot of this is hardware driven, do you see different types of AI models coming to the forefront, as the hardware evolves?

Dr. Richard Socher: Yeah, it's hard to under or overestimate how impactful Nvidia has been to this resurgence and boom in AI. Really, we all are using Nvidia GPUs for pretty much all our workflows, and this is expanding the use cases and the support of AI. So, they're fully activated, and actually, a great partner for a lot of AI researchers, organizations and practitioners. So, I don't currently see a whole lot of new hardware displacing Nvidia GPUs anytime soon.

Clara Shih: There's also been talk about doing the LLM compute on the edge. How close are we to that in your opinion?

Dr. Richard Socher: It's an interesting sort of research philosophy. I think I'm personally a little bit more excited about doing things that have not been possible, and try to have as few restrictions on those things, as possible. So concretely, for instance, if you try to have an AI to write you a poem or have a full conversation with, that's an incredibly hard problem to solve. And so, at first, it makes most sense to just, in an unrestricted fashion, try to solve that problem, and once you have solved it, then you can add restrictions. Maybe you want to have federated learning, where some different partners can collect and never see each other's data, but still train a model together. Or, you can say, I want it to be completely privacy preserving and never be able to learn a social security number, and something. Or, you can say, I want it on the edge in much smaller compute, and then run that. My hunch is that usually will happen quite naturally, once you show something is possible, people will keep modifying it and improving it. You see this with the open source large language models now, folks are quantizing it, finding really clever hacks, sometimes just very obscure compiler flags, and things like that, to make it work eventually on your phone, and we will realize I think, in the next few years, what the minimum number of parameters will be, to do certain use cases.

Clara Shih: It's so funny, we were just talking about what's the minimum number of parameters we can use in a model? And, just a few months ago, I think it was a race to see who could get the most number of parameters, and just how quickly that's all changing. Let's switch gears. Your current company You. com is a multimodal conversational AI approach to search. You were taking on Google, that is a tall order. How did you come up with the idea, and what's your unique approach?

Dr. Richard Socher: I've been working in deep learning for natural language processing for over a decade, and the biggest application of NLP is in search. It's in answering people's questions, giving them useful answers that are actionable and helpful, and help them learn things. It's such an exciting and interesting application of all of these deep learning techniques that we've developed, me and many others. And so, I thought it's kind of crazy that after 20 years, we still, despite seeing all of these improvements, have our main search experience be a list of blue links and we could clearly do better. And so, we started You. com in 2020, this was two years after Brian McCann and I had invented prompt engineering, which is basically the idea that you can have a single model, and then keep asking it different kinds of questions, in order to get different answers, but it's all trained in one joint large model. This is a paper that OpenAI had cited back in day and it was very influential for some of their thinking on GPT one and two. And so, we saw the writing on the wall and ourselves there and trying to make it happen.

Clara Shih: You came up with the concept for prompt engineering. It's a big idea and it's really the underpinning of so much of what we're seeing now, with generative AI. What inspired you to come up with that and how'd you do it?

Dr. Richard Socher: It's actually been a long dream of mine to think about a single model for all of NLP, and really, eventually all of AI. We had a paper called Glove Global Vectors in 2014, that set the foundation of thinking of words as lists of numbers. And then, we were able to pre- train these word vectors in a fashion that allowed us to use a ton of data from the internet. We tried to get as much data as we could with common crawl and other publicly available data sets from the internet, and that allowed us to get words to have much better similarity measures, so we would know that cat and dogs are more similar. You have these interesting patterns, where you could take the vector of king at the vector of woman and subtract the vector of men, and then you get to queen. So, king minus man, plus woman goes to queen.

Clara Shih: Turn it into a math equation.

Dr. Richard Socher: That's right, turn language and semantics and world knowledge into a math equation. And so, that was kind of my first spark, where I said, " Wow, it's amazing now, we take these word vectors and we can apply them on sentiment analysis for machine translation, for information extraction, for summarization, and all of these different tasks." We all want to use these same word vectors and keep using them for different things. But, then you realize, well, after the word vectors, you now want to understand the whole sentence, not in isolation of each word, but in their context. So, that's when we developed cove contextual vectors, which then became Elmo, and then Bert and quick succession and inspired that line of research. Then we said, " Okay, great. Now we have a pre- trained encoder." So, that's sort of the, again, discriminative part of AI, where we understand everything and we're sharing that across all these different tasks. Again, translation and summarization and sentiment analysis. But, then we thought, well, ultimately we also want the output to be shared so we can have a single model, and then have the whole world in the future kind of act propagate into this model, make it better together, and then consecutively train in a fashion that is more cumulative, rather than restarting every AI project from scratch. That's what led us to decaNLP, a single model for all the 10 different tasks. But, then you realize if you want to pre- train and share the entire neural network architecture, you really have to make the input, not just be about the context, but also the task that you're asking the model, and that was the main idea. To have one pre- trained large model, you need to make the task be just another input to the model. The way we did this is in a natural language question, what is the translation? What is the sentiment of the sentence? Who's the president in this paragraph? That essentially became prompt engineering and having a single model. But, the big change of course, was to also include language modeling, which was the task that was actually on our backlog to implement in the next iteration. But, an OpenAI did this right away, and then the language modeling objective together, with this idea of adding the task as just another input and question to the same model, that led to ChatGPT now.

Clara Shih: What a game changing moment for everyone. Some of these large language models, they struggle with outputs in non- English languages. Is that a function of the training set, primarily being in English, most of the internet is in English, or are there structural reasons about how certain languages are set up?

Dr. Richard Socher: It's actually mostly a training data question and not having enough training data of other languages. What's interesting is these models think of spaces between words and long words and long sentences, to them it's all just a sequence of what we call byte pairing coatings. Basically, these models don't even have a full conception of this is a word. It's just like, oh, catnip might be C- A- T N- I- P, or something like that. Or, Richard might be Rich A- R, and then D.

Clara Shih: It's just tokens.

Dr. Richard Socher: Exactly, it's tokens by pairing coatings. And so, whether that's like Finnish or German, where you have super long words or it's just longer sentences, to the model, it doesn't make a difference. But, having enough training data in those languages, that is the biggest inaudible, that's where I think we need to work hard to bridge the gap and include a lot of other languages, so that those languages can participate in this new future.

Clara Shih: You.com is a consumer search engine powered by LLMs. When you were at Salesforce, of course, you were in the enterprise space. What are the different ways that one would approach generative AI for consumers, versus enterprise?

Dr. Richard Socher: I think there's, in some ways consumer moves very quickly and the bar is high in both. It used to be that sort of common mantra is ship a consumer product as quickly as you can, get an MVP out, and then iterate, and people would say, if you're not embarrassed by your first product, you ship too late. But, it's also now I think, more and more becoming clear that the bar for consumer product is getting higher and higher as well, and so they're merging in that sense to enterprise. Now, in enterprise, of course, you have an even higher bar, in some cases. In the medical space, in the legal space, you definitely have to be correct. And so, where they actually also merge and are surprisingly, again similar, is in the fact that these LMS need often a retrieval backend. You can think of large language models almost more as a reasoning engine, but the way they store facts is very fuzzy. It's a little bit like a forgetful... Smart, but somewhat forgetful person, who gets some of the details wrong. You know, when your uncle tells a story, but he fudges with the details a little bit, from 20 years ago, and maybe amplifies some of his heroic deeds, or something. That's sort of how we can think about these LMSs. And so, it's very helpful if you have a retrieval backend, a search backend, and this is something we've built at You. com over the last almost year now, and are getting better and better in bringing in facts, into such an LLM.

Clara Shih: And, of course, the retrieval is across both keyword search as well as vectors. What's the state of the art, when it comes to vectorization and embeddings?

Dr. Richard Socher: There's a lot of different schools of thought there. Usually, when you want to embed a ton of different paragraphs or sentences and documents, you want to use an open source model, because it can get very expensive if you do that on your entire corpus. And then, you have different kinds of vector databases that are also open source models like Chroma, which we've also invested in. There's some companies like Pine Cone that offer these managed services as well. I think we'll see a lot more exciting movements in that space, and working on models for getting vector databases of many billions of vectors in there, because that's ultimately what we'd want. Ideally, we vectorize the entire world, starting with the internet.

Clara Shih: It's happening so fast. There've been all these developments and retrieval augmentation, semantic search, it's gotten so good, to the point where a lot of what AI researchers, even a few months ago, thought was only achievable through training or fine- tuning a custom model. Now, you can achieve with any model, just using retrieval augmentation. What do you think is the space that's left, where it's essential to have a custom model?

Dr. Richard Socher: That's an interesting question. My hunch is, open source models and custom models and organizations tuning their own models, will become more and more common, and indeed, even the best models like GPT four, which is actually an option in our pro subscription service inside You. com, even the best models benefit from having this retrieval augmented backend, where essentially, we gave GPT four last year in December, well, our own model and other models, but now also GPT four since last month, access to the internet. That way, these models are much more factual, they can be actually up- to- date, you can ask about recent events, and things like that, and they can cite their sources. This is actually a pretty important aspect of generative AI. I think at a high level model that I came up with this, that I think generative AI is extremely powerful, when it takes you a very long time to generate an artifact, but it only takes you a few seconds to verify how useful or correct that artifact is. So, for instance, you can quickly look at an image and see if it looks good or not, but it would take you like an hour to actually draw or create that image or illustration. So, that's a perfect use case, but in a natural language, if it writes you this five page article or even five paragraph article, but now it takes you a super long time to verify every single fact in it, then it becomes less and less useful. And so, to make LLMs more useful for search and to be more trustworthy, we need to push more and more citations and fact connection to real facts that can be quickly verified into these models. And, that's what we've been doing and actually have been getting a lot of interest from other companies also, to get access to this technology that we've built.

Clara Shih: Maybe we can call that Socher's Ratio of how much time it takes to review, versus to generate, as a new metric.

Dr. Richard Socher: That's a good idea.

Clara Shih: Yeah. Now, of course, tuning a model and using RAG are not mutually exclusive, they can really enhance each other. I think the challenge though, that a lot of companies face, especially if they're outside of the Fortune 500, Fortune 100, is that they don't have a huge team of data scientists that can focus on training and tuning their own model. And so, the question is, is RAG on its own? Will that get you 80%, 90%? When do you actually really need your own custom model?

Dr. Richard Socher: I think a lot of organizations probably don't need their own models. I think it often boils down to privacy restrictions and wanting to not have anything in your organization, leave that organization for competitive reasons, and otherwise. And then, there are probably more and more custom situations, where you don't want the model to be able to talk about anything, right? Because otherwise, you try to have this specific customer support situation and before you know it, people try to become a girlfriend, boyfriend with your service bot and you're like, ah, it's not the best use of our compute resources. Or, they have a therapy session, you really just wanted to help them log back into their account or do some very specific task, around their account.

Clara Shih: Okay. So, most companies, the vast majority, if not all large enterprises, going into this whole generative AI era, they already have many classical AI models. How do you see these traditional models fitting in with the new models?

Dr. Richard Socher: That's a good question. I think it really depends on the company. It's very hard to have a general answer here. I think a lot of the traditional models have been making predictions, and my hunch is, it will be very powerful to be able to connect those predictions into an LLM. I don't think the LLM will replace a credit score, or whether this particular knowledge base article might be useful. I think that's all content and retrieval mechanisms and sub decisions, that could feed into an LLM to help you then reason more holistically around and over and across your business.

Clara Shih: So, you would advise these companies to perhaps create a plugin for each of their existing classical models?

Dr. Richard Socher: I think it'll be powerful to allow LLMs to basically get access to these other models outputs, so they can then reason over them and provide even more useful answers, at the very highest level, even a CEO can then ask questions and ask like, " Oh, what's my predictive forecast?" At Salesforce, we have trained a lot of forecasting models too, like how many of your geos will hit their quota, and things like that. And, these forecasting models are very powerful. They're already very specialized and trained for organizations, and it's unlikely an LLM will do any better than those, because LLMs haven't been that great for sequential financial data, and then forecasting, and things like that. They're much better financial language and sort of sequences of actions, and things like that. And so, I think it'll be a great way to combine past things, into these LLMs.

Clara Shih: And, of course, this is exactly what Salesforce and others are doing with our LLM agents, being able to use a natural language querying interface, to be able to access classical models and really any service that a company might have, as part of their software stack. Really exciting stuff. So, there's also been a shift though, where it's not just complimentary, there's also a shift in how AI is approached versus even 24 months ago. What are some of the things that traditional AI researchers and data scientists need to unlearn, for this new era?

Dr. Richard Socher: That's a good one. I think the biggest interesting shift has been in traditional AI, we used to think of train development and test sets, and we could very quickly iterate on a specific problem. We tune the algorithm, we add some new training data, we test it very quickly on our development set, and then know if it works better or not. That has actually, in some ways regressed. Now, we're changing the prompt, and no one currently has a really good way of saying, oh, I changed the prompt. I asked to be a helpful and professional agent, or be a kind, professional and friendly agent, and how much has that now improved if we changed the prompt from this to that? But, the biggest thing is to think not having to train, not having to have as much training data, but actually being able to get away with less training data, if you can phrase it as an LLM like problem, and then the LLM can very quickly get you to an 80% solution, you can test it out with customers, you can try to iterate and collect more data, data in that process, maybe have some things still done with humans, some of an LLM, and then over time the LLM can learn more and more, and get to 95 plus percent, and then be really fully run some process automatically. I think that's kind of the power that has changed from these large pre- trained language models, compared to having to start completely from scratch, with nothing in the past.

Clara Shih: It really is counterintuitive with training. Less is more these days, and it really also speaks to the power of relying on retrieval augmentation, versus just training and tuning with more data. What advice would you have? I don't know if you have kids, but for any parents out there, how should we think about educating our kids for this world?

Dr. Richard Socher: I think that would be a more and more important question. I mean, it's always been important, and education is something I'm very passionate about too. I think we need to make sure our kids learn to learn still. So, some parts of education will stay the same. Trying to memorize some things because if you always have to look it up at every given moment, it's hard to creatively really think about a new era area. They will still continue to need to know math to understand how machines that are more and more impacting their lives work, and they need to learn how to program. I think a very clear actionable thing right now, is to make sure that every high schooler learns how to program, in some simple language like Python at least. Whether you go into law or medicine, it will be helpful to know how to program.

Clara Shih: I thought that the large language model can program for you?

Dr. Richard Socher: It can, but will it do it perfectly every time? Will it give you the optimal code? If there's a bug and you get an error message now, do you know how to fix it after? I think we'll still need to learn how to code. We don't need to do it in assemble anymore, or C or C ++. A lot of people can be very productive and build amazing things in higher level programming languages, but it's still very helpful to know how to program, and just to understand how that works. And so, I think some of the foundational things are going to stay. Physics, biology, chemistry. I think there will be lots of exciting novel breakthroughs in those fields as well, especially once they really embrace AI. Likewise, economics and protein engineering and all these applications. So, my hope is we teach kids still how to learn a lot of different things very quickly, how to be excited and interested in new things, then some solid STEM base, and then eventually, make sure still that they have human values of empathy and kindness to each other.

Clara Shih: Richard, thank you so much for joining us. It's great to have you on the show, and thank you again for the incredible AI foundation that you helped put in place, here at Salesforce.

Dr. Richard Socher: Thanks, Clara. It's very fun chatting.

Clara Shih: That's all for this session of Ask More of AI, the podcast at the intersection of AI and business. Follow us wherever you get your podcasts, and follow me on LinkedIn and Twitter at Clara Shih.

DESCRIPTION

In this episode of Ask More of AI, Clara Shih sits down with CEO and Founder of You.com Richard Socher. They talk about how companies can decide whether or not they should develop and train their own large language model (LLM), working with graphics processing units (GPUs), and the one thing all high schoolers need to learn.