DJ: You're listening to the Art of AI podcast with your host, Jerry Cuomo.
Jerry Cuomo: Thank you, DJ. Folks, to say I'm the luckiest engineer on this planet might be an understatement. When I think about the richest part of being an engineer, I have to say it's the people that I work with. And today, you'll meet one of my teammates who I go to when I have a question about AI. This week, in fact, I went to him to talk about hallucinations in AI with an eye towards how some of the things he's working on might help reduce the occurrences. His name is Mihai Criveti, and he was the perfect teammate to have this conversation with because he just posted a paper on Medium titled, " Understanding Generative AI Large Language Model Limitations, and How Retrieval Augmented Generation Could Help." So today we are fortunate to have my colleague Mihai on the Art of AI. Mihai is better than ChatGPT because he always has an updated perspective that is grounded in fact, not hallucinations. And with that, I'd like to welcome Mihai to the Art of AI.
Mihai Criveti: Thanks, Jerry. It's a pleasure to be here. I'm a fan of your book. I'm a fan of your shows. So thrilled to have this conversation.
Jerry Cuomo: Oh, geez. Thank you so much, Mihai. Speaking about enjoying, I enjoyed your article that you recently wrote, that you outlined some of the common misconceptions about what LLMs can and can't do. If you don't mind, let's jump right in and start there. We hear so much, Mihai, about what LLMs, large language models, can do, but can you share with our audience your view on some of their limitations?
Mihai Criveti: I think that's a very interesting question. First, I think it's important to understand what a large language model really is First, it's a model that only acts on language. So you can't really use it by feeding it a PDF file, a Word document, a binary file, images, or anything other than text.
Jerry Cuomo: Good point.
Mihai Criveti: It also only generates text. Now, the way it generates is very similar to your phone's autocomplete. It sees a sentence and then it predicts the next word. Then it looks again at the whole thing with the words there and it generates the next word. Many of these models have been pre- prompted. So whenever you have, for example, a conversation with any conversational AI platform, there's usually about a page of prompt there that goes, " You are a helpful and useful assistant and you will respond in a truthful way. You will not make things up." And then it inserts the previous conversation that you've had with that model, so it can have conversational memory within limits because there are token limits to this model, and then whatever prompt that you give it. In fact, it predicts based on a lot of text that you may not see when you use that conversational AI model, and because it's a prediction- based model, sometimes the context isn't really sufficient for it to generate a good quality prediction. So these models don't really learn. Wherever you hear, " Okay, I've been speaking with ChatGPT," or what's next for any of these models, " and it's learned from my interaction," that really doesn't actually happen. Right?
Jerry Cuomo: Right.
Mihai Criveti: Your API will feed your previous conversations into the next, and you'll often see, as you start chatting with it, the quality of the conversations actually degrades. It also doesn't reason. It doesn't do math. It doesn't use your data. It doesn't use the internet or do anything like that in the process itself. So these are many of the limitations, but because of these limitations, what you often get is output, which can only be described as hallucinations.
Jerry Cuomo: That's right.
Mihai Criveti: So output which is either not correct or simply nonsense. It'll generate things like the sky is purple, which isn't very useful.
Jerry Cuomo: So let's dig into that because that's really the topic of the day. And in a recent episode, Ruchir Puri said that hallucinations were one of the biggest open issues to solve around trustworthy AI. Can you explain a little bit more what is meant by hallucination? And you just talked a little bit about it. And maybe go into a little bit of why the phenomenon occurs.
Mihai Criveti: Sure. The reason is that a lot of these models are designed to produce coherent and fluent text, but just because something looks grammatically correct doesn't necessarily mean it's correct. So here's an example of what you can describe as a sentence contradiction. " The tomato is a fruit," the first sentence from my LLM response, and then it says, " the tomato is a vegetable" somewhere at the end of my output. Which one is it? It just contradicted itself. Because these things are grammatically correct doesn't mean that the model isn't contradicting itself. You can also have a prompt contradiction. So, " Generate me a positive response of Jerry's book," and then a model comes in and says the worst negative things that you can imagine. It's like, " Wait, I've asked you for a positive response. You generated something negative." That's a contradiction of the prompt. You also have factual errors where you say, " Jerry has written 173 books." Maybe he's written 173 articles or patents. The model has been trained on that data. It used that information and it came back with a response. It seems plausible, but it's not factually correct.
Jerry Cuomo: Yeah. It's interesting how the answers come back sometimes in the ballpark, as you said, they seem plausible, but yet they're factually incorrect. Why is it that sometimes it will come close?
Mihai Criveti: Well, imagine that for a model, when you start talking about something, it doesn't really have the context. It doesn't understand your meaning. It doesn't know who you are. So when I have a conversation with a model and I say, " Talk to me about Jerry," it doesn't know that I'm Mihai, that I'm in consulting, that I'm in IBM, and it doesn't make that inference. It doesn't have that information. So sometimes it's missing the context of the question.
Jerry Cuomo: Missing the context. Yep, that's an important point.
Mihai Criveti: If you give it the context and if it has been trained on that data, it's much more likely to give me a correct answer. High- quality prompts with sufficient contextual information on models which have been trained on quality data are less prone to hallucinations. So the good news is that some of the quality of the answers is under your control. Some of the quality of the answers is under the control of those who have trained them all, and it's not an easy problem to solve because these models have been trained on hundreds of terabytes of data. Some of that data could be Wikipedia. Reasonably good. You can argue about the bias in it, but it's reasonably high- quality data. But let's say some of these have been trained on forums and Reddit and Twitter and user conversations; you have a lot of contradictory information in there. So short of a person going in there by hand every day and cleaning the data and retraining them all, you are not going to completely eliminate bias and hallucinations.
Jerry Cuomo: All right. So Mihai, your article talks about a few things related to how to mitigate hallucinations. Can you share some of those approaches?
Mihai Criveti: Sure. Because many of the hallucinations occur, or at least the ones that we can control, because of a lack of context or contextual understanding, the key work around this is to write clear prompts. So if you look for example, even at the prompt or something like Llama 2, it starts with, " You're a helpful and useful assistant. You'll not write things which you are not aware of." So it's that part of the prompt. But the second part is injecting the right context into it. In some cases, you might actually know the context. So for example, let's say that we're building a large language model to deal with user support, and the support is specific to Red Hat's OpenShift on IBM Cloud. That means in every prompt, we can now inject that you're a useful assistant and we've built an agent where we say, " You are a support assistant. Level two's providing level- two support for IBM Cloud and Red Hat OpenShift. You will provide information." So now if the model has been trained on that dataset, it'll respond with questions that have the context. We can also build conversational memory where we feed whatever has been previously said into the chat, into the conversation. But the third part is really more interesting here. That's a form of retrieval augmented degeneration. Let's say we have a knowledge base, something that shows all the ways in which problems have been solved. If we could search that knowledge base and insert relevant articles into the conversation, into the prompt to the model where we say, " Read this text below. Using the information below, reply to the user," then it'll generate even better responses. Other mechanisms include providing examples. So when you talk to a model, you provide examples, one shot, many shot where you say, " Here's an input. Here's my expected output," and you give it this example a couple of times. You are going to get much better quality responses. But here's the catch: you have a limit. Most of these models have 4, 096 tokens, 3, 000 words, about seven pages of total input and output text. There are larger models, but usually the quality degrades after 4, 096. So you can't really say, " Here's a thousand pages of text as context." You need to give it top 10 relevant things. For example, you've seen a lot of search engines now implement GPT- like functionality or LLM functionality. What they really do is they literally do a Google search, a Bing search, whatever else. They feed those top results into the LLM and then the LLM has that context. That's a form of retrieval augmented generation. You can do even cooler things with a vector database.
Jerry Cuomo: All right, let's just break this down. So you mentioned prompt engineering. Let's load the model up with some instructions, and as long as we can stay within its memory, its token limit, those instructions could be really effective in helping elicit a great response. And you mentioned this term now several times, retrieval augmented generation, and now you just mentioned vector database. So can you put these pieces together for us?
Mihai Criveti: Sure. So there's more than one way to do it. Probably the best way, if you have unlimited resources and money or you're GPU rich and you have a lot of GPUs available, it's just every time there is a new document, you completely retrained your model with the new dataset.
Jerry Cuomo: Oh, sure.
Mihai Criveti: This can get expensive fast, right? Some of them, if you look at the LAMA paper, $ 20 to $40 million estimated time money spent in training, a lot of carbon emissions, quite complex. You could also fine- tune the model, but that has limitations as well, especially if you're getting data in real time. It's also expensive, not this expensive, and it doesn't always have the best results. So another way of providing that context is that in every single prompt, you're giving just enough information to be relevant. So literally as you are building your prompt, you are starting with, " Here is some context model," and you provide top 10 results from your database search, for example, or your knowledge base of relevant articles. The key there is relevant. And then you're giving it the instruction, " Read the text above, combine it with your own dataset, and provide an answer to the user after reading the user's question." And that kind of generates high- quality responses, which also have the benefit of fitting in that token window, being a lot faster because here's a problem: if you are feeding large amounts of data into an LLM, it can be quite slow and expensive. And the cool part is you can always say which sentences or which data sets generated an output, which is something you typically don't get with LLMs.
Jerry Cuomo: Right. So you have a reference now to where that answer came from. So that is probably a step in a really good direction with respect to attribution and I guess proving a hallucination did not occur.
Mihai Criveti: Or at least some level of explainable AI. I have an output, I can tell you which knowledge- based document or which PDF file or which database entry was used in generating the output. It doesn't mean the output is correct necessarily, but I can at least say, " For more information, read this article, this knowledge base, this document," and that gives you the start for explainable AI and content grounding. It also reduces hallucinations considerably.
Jerry Cuomo: So Mihai, if you have this top 10 knowledge base, how do you bring that in?
Mihai Criveti: So traditionally, you just use a database, right? Any kind of SQL database, you do select star from database where something equals something. The problem with that is that if you're searching for sky, you'll find sky, or if you're searching for sky, you might find skys, if that's even a word. But when I search for sky, I want to find blue. I want to expand my horizon, I want to find similar documents. I want to have a very, very fuzzy search. And here's where vector databases come in. They take in your documents, they vectorize them, so they convert them into numbers. And now instead of finding an actual word, you are finding a vector or similar vectors. So the distance between two words now becomes important. King and queen are related, in vector spaces, words. And because you're using now a vector search in your document dataset, you are able to find related documents to whatever you're typing, feed those top 10 results into your large language model, and further generate content from there. So it's a very powerful technique. It reduces hallucinations considerably, and it's one step towards explainable AI.
Jerry Cuomo: Okay, folks, I think we're going to leave it right there. And Mihai, I will make your article available to those interested in learning more. Thank you so much for joining me today on the Art of AI.
Mihai Criveti: It's been a pleasure, Jerry. Looking forward to our next conversation.
Jerry Cuomo: Well, folks, that's it for today. And I've included some links in the description section of this podcast, including a link to Mihai's paper. So once again, I'd like to thank Mihai for joining me, and I'd also like to thank you all for your continued support and interest in these podcasts. This is Jerry Cuomo, IBM Fellow and VP for Technology at IBM. See you again on an upcoming episode.