Workflows & Tooling to Create Trusted AI | Ask More of AI with Clara Shih

00:00

0.5
1
1.25
1.5
1.75
2

This is a podcast episode titled, Workflows & Tooling to Create Trusted AI | Ask More of AI with Clara Shih. The summary for this episode is: <p>Clara sits down with the founder/CEOs of three of the hottest AI companies-- Aravind Srinivas (Perplexity AI), Jerry Liu (LlamaIndex), and Harrison Chase (LangChain) to discuss the tooling, data prep, and agility needed to operate and add customer value in the rapidly evolving LLM space.</p>

Transcript

Aravind Srinivas: These frontier models are allowing us to build amazing applications, but they need a lot of workflows around them. Just themselves is not sufficient. They're the reasoning engines, but in order to build an actual car, you don't just need the motor, you actually have to build a lot of other things and package it and sell it to the end user.

Clara Shih: Welcome to Ask More of AI, the podcast, looking at the intersection of AI and business. I'm Clara Shih, CEO of Salesforce AI, and I'm thrilled to sit down today with three amazing entrepreneurs in the AI space. Aravind Srinivas, co- founder and CEO of Perplexity AI, Jerry Liu, co- founder and CEO of LlamaIndex, and Harrison Chase, co- founder and CEO of LangChain. Aravind, Harrison, Jerry, thank you so much for joining me today.

Aravind Srinivas: Thank you for having us here.

Jerry Liu: Thanks for having us.

Harrison Chase: Excited to be here.

Clara Shih: So we've got a diverse set of listeners and viewers out there, and maybe we could start off by level setting what each of your company does because they're very different, but one thing in common is that each of your company in your own way is making large language models better and more usable for developers and for customers and I think that that's just fascinating. Aravind, maybe we'll start with you.

Aravind Srinivas: Yeah.

Clara Shih: You are the founder and CEO of Perplexity AI. Tell us what Perplexity does and how you're expanding from the consumer space now to the business segment.

Aravind Srinivas: Yeah. Perplexity is an answer engine that just directly answers your questions. It's meant to change the search paradigm to moving people from links to answers. Except the answers are still supported with references in the form of citations at the end of every sentence. We think that a lot of people's day- to- day searches that were typically done on traditional search engines involve deeper research that would best be catered in this new interface. That doesn't mean the traditional search engines are going to die, but they're going to be more of a navigational tool in future and for actual questions, you would prefer a tool that just gives you back an answer. And obviously we started off without any particular user segment in mind. Because the magic of all gen AI products lies in the generality. And the more you start fine- tuning it for one domain, it's hard to retain the generality. And then we got to observe the query patterns, the categories who are using it, and slowly focused on trying to help the narrow segments better through user workflows. And our pro plans enable that targeted knowledge workers and even knowledge workers is a huge category. So a step by step we'll try to get it adopted in more enterprises.

Clara Shih: And how many users are there right now in Perplexity?

Aravind Srinivas: Totally. We have tens of millions in user category, but we don't actually track users as much as number of queries a day, which is our North Star metric. And I can tell you why. The only metric, you got to always pick a north star metric for your company that is aligned with long- term progress. And number of queries today is the only way we establish a data flywheel and can go back, identify mistakes and fine tune the models or change the way we index certain domains or sub- domains. And whatever increases the number of pages in our index, so the quality of our models, user downloads, which we collect to improve the error rates and things like that is what will make your product continually better. You could have some users who just come and don't really do many queries at all, and that might artificially make you feel like, oh, you're having a lot of progress, but if that doesn't translate to actual usage, it doesn't matter.

Clara Shih: So you want to set a higher bar.

Aravind Srinivas: Yeah. And this is the metric that Google's used to for ever since their beginning. They never really cared about the number of users as much as the number of daily queries.

Clara Shih: So you're taking on Google and instead of having to do a bunch of searches and refinements and clicking on links, users can just get their answers directly from Perplexity.

Aravind Srinivas: Yeah, I wouldn't say we are taking on Google. It's more that we are creating a new category for answer bots assistance that can just directly answer your questions and do your research for you and give you back high quality answers and then go even step beyond that. Take the answer, what do you do with the answer? How do you do even more complex research queries and so on? So we want to be the ultimate research buddy or the knowledge buddy for everybody in the planet.

Clara Shih: Okay. Harrison Langchain, tell us what you do.

Harrison Chase: Yeah, absolutely. So at Langchain we build developer tools to help developers build applications powered by LLMs, that reason. And so what I mean by that is we see a lot of the applications that people building use an LLM as some sort of reasoning engine and connect it to external sources of data and computation. And so we build developer tools to help make that as easy as possible.

Clara Shih: And one of the key aspects of this is that LLMs they're nondeterministic and developers want more control. And so you help them chain together different logic. What would inspire that idea?

Harrison Chase: Yeah, I mean, so Langchain started as a side project like a year and a half ago, and I think the name Langchain comes from the idea of language models lang and then chain and just the idea of using them as one part of a larger system. And it could be multiple LLM calls together, it could be a retrieval step and then an LLM call. One of the really cool things is, we just see so many different combinations. Because I think it's still super early on in this space and people are exploring so much. And so yeah, Langchain comes from this idea of language models, combination of things to the point around the unreliability or stochastic nature of these LLMs. Another big thing that we're working on is Langsmith, which is like our developer platform for more observability, testing, evaluation. We're starting to do things more in online monitoring and eval now. And I think this, we view this as really important because we view the LLMs as a crucial part of these systems, but even just understanding by the time you've assembled all this context from various sources, what exactly is going into the LLM? What exactly is coming out? What's the next step in the sequence if the sequence itself is random? All that stuff is really important to understand and really help make your applications better. And so that's an exciting direction that we've been spending more and more time in recently.

Clara Shih: That's awesome. Jerry, tell us about LAMA Index and what inspired the idea for that.

Jerry Liu: Yeah, so ever since we started, the core mission of LAMA Index is to enable any developers to build ALib applications on top of their own private sources of data. And so if you think about the vast amounts of data within either an enterprise organization or just personalized to an individual, there's all this unstructured semi- structured, structured data. And especially when we were playing around with ChatGPT in the very beginning, it was clear that these models were not only really good at generating things and memorizing pieces of knowledge, but actually about reasoning over new pieces of information and synthesizing new insights as well as taking actions. And so our motivation was, how do you actually take all these reasoning capabilities and basically leverage that on top of all the vast sources of data that you have? And so we have two components within the company. One is a pretty popular open source framework that enables people to build applications over their data. A very popular concept these days is this idea of retrieval augmented generation, which were basically, you're able to sit to gather some data along with a language model to build a ChatGPT like interface over your data. And then we also have an enterprise platform that's specifically focused on getting you clean quality data for your LLMs. And it turns out there's this new data stack that's emerging to make these context augmented applications possible. And our managed service, or we call it LaMa Cloud, enables developers within all these different organizations to do that.

Clara Shih: So there's this debate right now around when do you use RAG versus when do you have to fine tune. Based on all the experiments that you're constantly running, what are you all seeing?

Aravind Srinivas: I mean, obviously they are more connected to developers, but based on our own experience so far. I see this as when you want to add new knowledge to your model, you probably want to rely more on doing the RAG part really well. That is don't try to make the model memorize new facts, decouple facts and from reasoning. And the word he used is pretty apt. Think of LLMs as a reasoning engine and think of the facts as the retrieval engine. And together you're combining them together to build your knowledge engine for your enterprise or for your users. So when you want to add new facts, try to use the retrieval engine as much as you can. And when you want to make the model respond in a certain way, improve its formatting, improve its ability to handle conflicting facts when there are two documents talking about the same thing with presenting different information, that is what is called as a skill or the ability to reason or the ability to transform unstructured text into structured format like Markdown or JSON. These are the things that you want to train the models for and anything else you want to keep as much as possible in the retrieval engine.

Clara Shih: And you would know a thing or two about this because prior to Perplexity, tell everyone about your background.

Aravind Srinivas: Yeah, I mean I'm an AI researcher by training. Written a lot of papers and transformers unsupervised learning and things like that. So obviously when those days, AI was not obviously as hot as this. inaudible was still attended by researchers more than VCs, so it's completely flipped over now. But one thing was clear that the moment that I would say a lot of people don't remember this, Bert was the first big moment for AI to transcend into industries, especially NLP. So Transformers was there and GPT- 1 was there, but Bert was the time that it got very serious. And then GPT- 3, that's when GPT three was when you don't even have to do a lot of machine learning. The new discipline called prompt engineering began and people were able to program these things to do tasks for them rather than collecting a lot of data and training models. But it was still not reliable enough to become an industry level impact like people actually building products with it and getting into the hands of users. So there's this whole talk of emergence, this ability, this emerges by scaling. But reality is it's always gradual. You can track it in terms of metrics, but the thing that's actually emergent is reliability. When the metrics get to a certain stage, like say it's 5% in GPT- 3, and then it suddenly gets like 60% in GPT-3. 5, 80% in GPT-4 and 80% is when it's reliable enough to put it into products. So that is emergent now. But the metrics have gradually always been increasing with more scale. So that's what we are beginning to see. Like GPT- 4 is so good. Cloth three is so good, these frontier models are allowing us to build amazing applications, but they need a lot of workflows around them just themselves is not sufficient. They're the reasoning engines, but in order to build an actual car, you don't just need the motor, you actually have to build a lot of other things and package it and sell it to the end user.

Clara Shih: And hence the value each of you in very different ways can provide.

Aravind Srinivas: Exactly, yeah. If we're building a product, these guys are building the developer tooling to help other people build products. So there's a lot of value we can create through this process.

Clara Shih: Isn't that really hard though, when the underlying substrate, in this case, the models themselves are changing so fast? It's like the moment you build tooling that is optimized for one, then you have to switch it out. How are you managing in this landscape?

Harrison Chase: Yeah, I mean I think this is definitely one of the hardest parts. Just a short anecdote about this. The original model that we were using as part of Langchain when we launched was deprecated by OpenAI a few months ago.

Clara Shih: How annoying.

Harrison Chase: You literally cannot use that. Yeah

Jerry Liu: DaVinci.

Harrison Chase: Yeah. And so yeah, I think it's something that we've tried to balance. We went over a year without a stable release of Langchain and we just released a stable release a few months ago, and part of that was very consciously we thought the underlying models were changing pretty rapidly.

Clara Shih: And for people who don't know what is a stable release?

Harrison Chase: We're not going to make breaking changes when we move on to a new version, we're going to maintain bug fixes for the previous release. And so we're basically saying, we think the abstractions here are solid, we're going to roll with this for a while. But until then, I think we did do a pretty good job of maintaining backwards compatibility, but there was a lot of changes like chat, messages versus string in, string out. That was a huge change to a lot of the underlying abstractions that we had. Multimodal, we were waiting for multimodal before doing any stable release. Luckily it wasn't that huge of a change. I think at least in terms of the APIs, they remained relatively stable. But yeah, I mean one of the things that definitely keeps me up is just how, because we don't train our own models, we use other people's models, and so if they release new APIs, new ways of thinking about things, that's absolutely going to have a huge impact on our frameworks and what we do.

Clara Shih: And maybe as part of the value you can provide to developers to help them maintain that abstraction while you figure out the connection to the new models.

Harrison Chase: Help them maintain it, help make it standard across model providers as well. Yeah.

Jerry Liu: Yeah. And actually to Harrison's point, I actually think that's one of the values of having an open source project. And I actually argue that makes you iterate a lot more quickly than if you had a fully built out product feature and then all of a sudden some underlying model change and there was a new use case. It's a lot longer to actually, it takes a lot longer in terms of edge time to actually productize that new feature versus just at least having some core abstractions and then core capabilities to enable other developers to build with it. And I think that's one of the nice things about open source is, it is a lot of work on our engineering teams basically to try to maintain these interfaces, but because you're able to iterate on it very quickly, let's say GTP- 5 comes out or Cloud Four comes out and you're able to iterate on those interfaces a lot more quickly than shipping a full- blown product. And so as a result, this kind of speaks to the broader point that I think a lot of gen AI adoption within the enterprise right now is developer led. And I think the reason for that is because developers have access to the right set of tools to rapidly iterate and prototype these new features as opposed to building something and then throwing away every few months.

Clara Shih: So it's interesting. None of you has your own model right now. Is that something that might be in the cards down the road? I mean you've built models before but you've consciously chosen to not in this particular case. What's the reason behind that and would that change down the road?

Aravind Srinivas: So we don't have our own foundation model, but we have a lot of custom fine- tunes of them already being served in the product. Built on top of Gemma or Llama, Mistral. So we are always tracking all these open source LLMS and customizing them to be good at our product use case. So the free version of the product largely uses a lot of our own models actually. The paid version gives people the access to other frontier models that are better than the models that we have. Which includes GPT-4, Claude 3 Sonnet, Claude 3 Opus and Mistral Large and things like that. A conscious decision we made after we raised our series A funding was not to allocate it on a GPU cluster and go train our own foundation model. This was probably one of the best decisions I made, but we made a bet that there will be good open source models that rival 3.5 and also that 3. 5 will become a lot cheaper and faster and then the chips will become faster and cheaper, allow you to serve products with few inference nodes rather than a lot. So we made these bets and that all turned out to be true. H 100s are way more efficient for serving models than A 100s, so you can serve a whole product for millions of users and just a few tens of GPUs. All these things changed beginning of last year, all these things were unclear. Now they're a lot clearer.

Clara Shih: That was the right call.

Aravind Srinivas: Yeah. So in general, I feel like if you're a product focused company, don't invest too much in foundation model, training that is training the base model. Unless there's real incentives for you.

Clara Shih: So you fine- tuned Llama two in this case. At what point do you say, okay, well Llama three comes out suddenly later this year for example? And then do you have to start over?

Aravind Srinivas: No, that's amazing thing about having all the post training expertise. Post training is just a fancy way to call fine- tuning when you're actually doing it on a lot of data. But let's just say typically fine- tuning is a word you use for throwing hundreds of thousands of examples. Not a lot, but post training is when you take a base model and you do the SFT and RHF yourself. The supervised fine- tuning step and the RLHF reinforcement learning from human feedback. And when people release both the base models and the chat models, you can take the base model and custom post frame it with your own SFT and RHF stack and that's very useful and that's what you can do to differentiate your models from the other chat models on the market. These are all the things that you do the post training for. And this expertise is very useful no matter what base model you start with.

Clara Shih: And that's become your secretas.

Aravind Srinivas: Yeah, we want to be world- class at it, and that doesn't require a lot of GPUs. You don't need tens of thousands, you only need hundreds or maybe thousands. But what you actually need is a good team of people who know what they're doing here because it's so easy to get things wrong even here. And it's so easy to get things wrong further in the pre- training world also, except mistakes are like tens of millions of dollars costly there. Whereas here it's more like few hundreds of thousands of dollars so you can afford it.

Clara Shih: What are the pitfalls when people are fine- tuning?

Aravind Srinivas: A lot of the pitfalls are like let's say you're going out and fixing a certain issue. You have a model, you find it to be two or both. And so you now are training on a new data set that has shorter responses. Now what happens is, it's just too concise, even for an actual question that requires or give me a short explanation of a phenomenon, it just response in a single sentence. And then you go back and collect another data set that has longer responses. Now it starts being too verbose for no reason. So how do you handle different types of queries that require different types of responses? And then how do you handle things like stochasticity? Sometimes the same question, same prompt, might get you different completions because the model was just not calibrated well enough. And these are the sort of problems that can only be addressed with good RLHF, like pure SFT doesn't get you the sort of a model that's very deterministic. And so you need to have the right balance. And RLHF particularly is very sensitive to the set you train on. So if you get that wrong, it can make the product a lot worse. And evals is actually one of the biggest problems. If you don't have good evals, you might think you've got a good data set and you trained on it and you can just deploy to production. But if you don't track the evals correctly, basically evals is a proxy to saying, can I just replace the earlier model of this new fine tune?

Clara Shih: I think Harrison might have a product for you.

Harrison Chase: Well, I was going to ask, how do you guys currently do evals?

Aravind Srinivas: Yeah, we track three different dimensions, accuracy, readability, and obviously the latency. So latency is decided not by the actual data you train on just like your serving infrastructure and the number of parameters of the model. But accuracy like factual accuracy and how readable it is, how well it's formatted. So we have a team of human contractors who constantly rate different completions from our own models and other models and rank them and then we use that to further go and collect another data set that can get us better on the queries that we're not doing well. So it's an iterative process.

Harrison Chase: Super interesting. One question I have for you as well, Aravind, since I think he's actually building products, we're just building tools, so I think he can actually speak to what people are doing in practice. But how do you think about few shot prompting? This is something I've seen a bunch of buzz about recently, and for those of you who don't know a few- shot promptings, you include a few examples in the prompt itself.

Aravind Srinivas: Yeah. So obviously when you're including a lot in the prompt, you have to be aware that the more you throw into the prompt, the more latency you incur and the more you let go of context, especially as people are engaging in longer terms in the conversation. We use prompting for collecting data sets and stuff. Let's say you want to just build a small classification data set and then train a custom model on that. So when you're deploying a product, it's not just a summarization model, you're taking a query or reformulating it, you're classifying it and you got to do all these things to be able to respond in different UIs for different queries. So these require you to go and collect some data sets that use a larger model as a teacher. And for all those sorts of things, you can use few- shot prompting to benefit a lot. Another place where a few- shot prompting can be very useful is when you're collecting a data set for RHF, you want different completions from the model. And in order to extract different completions, just changing the temperature is a pretty hacky way of doing that. Some more better hacks I would say, I wouldn't call it principal, is just having different few- shot prompts and eliciting different completions as a result and then using that as the ranking completions for the teacher, your in- house LM contractor or something. So there are many creative ways to do all these things.

Harrison Chase: Super interesting. And out of curiosity, when I go to perplexity.com and I type something in how many LLM calls are getting made to respond?

Aravind Srinivas: I genuinely don't know. Not to be cagey about it, but I'm not up- to- date with all our backend anymore. But I'm sure there are at least definitely more than five or six.

Clara Shih: Interesting.

Aravind Srinivas: Yeah, all those asynchronous calls are what people keep why a rapper and things like that. But being a good rapper takes really good orchestration abilities to handle all these asynchronous calls simultaneously and still serve a really fast product at the hands of the consumer. And that's also the value these guys are adding because not everyone can do this. So that means there are companies that would not know how to do this or would take them many months to figure out, and instead they can rely on the tooling that they build and can benefit from all of that and rather than having to figure out everything themselves.

Clara Shih: That's right. So I mean both of you work with a lot of developers. What kinds of trends are you seeing in terms of model usage and most sought after use cases? What are you seeing and is there anything that's surprising?

Jerry Liu: Yeah, I mean I think for us, just given the nature of the framework as well as the company, I think we've seen a lot of use cases around RAG like retrieval, augmented generation. And specifically last year was really just people were prototyping some simple chatbots over their data. Now it's still roughly chatbots, but there's more data, it's more complicated and you have to figure out how to productionize this whole thing. And so it gets a little bit more interesting. I think a core use case that we see a lot of different enterprises is, similar to what I described in the mission actually. You have some SQL database that's operational data. You have a thousand to a million PDFs somewhere locked up and you just want to understand it and basically surface insights. And the reason this is valuable is because your end users, this really unlocks a lot of efficiency for your end users because then they don't have to go in and dig through all this knowledge, go open up Acrobat and actually read through all this information. And so I think this is something that to be honest, I feel like there's a base layer of applications are starting to move past this, but even once you start to productionize, this actually is pretty hard to do. And the thing that everybody cares about, especially when you try to make this thing work really well is response quality. And so response quality where given any type of question you want to ask, you're able to get back an actual answer. And this is where evaluations are important, for instance, you want to make sure you have proper metrics to actually track response quality, to measure it and to be able to iterate on that. This is also where having a proper data stack is also important because you want a proper way to actually process all your input data, make sure that you're basically able to propagate updates, especially if that data is outdated. And then you're able to represent all that data somewhere in your storage system, somewhere in your data warehouse or data lake so that you can surface it to the language model. That's why these days, especially in the developer stack, there's so much buzz and interest around vector databases. And that's why there's so many vector database companies because they've basically positioned themselves as a core data storage provider in the emerging LLM stack.

Clara Shih: It's so interesting, right? We have this thing called the Salesforce metadata layer where admins traditionally have put in descriptions for the benefit of other humans in that organization. We're finding now that it's very useful as an input for the copilot and et cetera, and then also trying to use LLMs themselves to generate even more metadata.

Harrison Chase: Another thing there that we've seen is interest in using a semantic layer. So just giving up on the raw SQL and adding a simpler semantic layer to act on some simpler kind of columns and insights.

Clara Shih: Right. To generate a glossary for every company. So you were saying, Jerry, what are some techniques, and I mean obviously there's not a one size fits all to RAG on unstructured data because there's so many different types of unstructured data, but what are some learnings that you could share with our audience?

Jerry Liu: Yeah, we've done a few talks on this. So for very deep technical content, especially if you're a listener, you're a developer, just feel free to dig through some of our past presentations. I think maybe just articulate some of the high level challenges. One of the things about building any of these workflows over LLMs, including rag, is that you're basically building this overall software system, which becomes this overall ML system over this core module, which is of course the LLM. And the main thing is that when you build this overall software system, the LLM itself is trained over data, but you're attacking on all these additional things like an embedding model, like your prompt, like the chunk size, your vector database choice, like the retrieval algorithm, all these different things. And they basically end up all just becoming hyperparameters. And the thing about regular machine learning is that you can optimize the model through something like back propagation through ROHF. There's ML research, gradient descent algorithms to actually improve the performance of your overall system. But especially if you're a software engineer or you're an AI engineer these days and you're cobbling things together, one thing to note is that every time you cobble an additional components onto your overall workflow, it adds more parameters to your system. And so that adds more parameters. You basically need to track potentially manually tuned and also evaluate. And I think that at a high level sums up some of the challenges with RAG right now is there's just a lot of components even in a very basic RAG flow for developers to set up and for them to do well. And so that's where you start getting into this evaluation and iteration loop where you want to measure metrics, go back, tune your chunk size, tune your retrieval algorithm, tune all your prompt templates that you stitch together, and then basically finally decide that it's at a bar where you can actually deploy to production.

Clara Shih: Of all the parameters that you just described in the RAG pipeline to fine tune, where should people start if they just wanted to keep it simple?

Jerry Liu: Yeah. I think what I would start with is basically first, just manually create a data set of 20 questions or 20 inputs that you think are representative of something basic that you would want to test out. And then we have this entire set of materials where we basically recommend starting small and then starting off with the most basic stuff and then trying progressively more interesting and advanced techniques. And so especially if you're just trying to build a simple RAG pipeline, I think inherently you're not trying to build some fancy agent, you're just trying to build some overall thing that answers questions. And so there's plenty of table stakes things you can tune your chunk size is a table stakes thing to tune, I think. Generally for your retrieval algorithm, there's a best practice around hybrid search as well as a best parameter value for that. Tune your prompt template a little bit just to make sure that it works well. And that's the simplest thing to start. You're not going to get too much delta, but at least you're going to improve beyond the baseline. Once that's actually not enough, once you're more ambitious with your goals, you want to ask an overall broad research question and have this overall thing figure stuff out. You want to chain together 10 different steps and you wanted to actually achieve all these things in a sequence, not only give you back information, but also send an email or schedule a calendar invite or those types of things. Then you start needing to attack on more layers of abstraction, like some sort of agentic reasoning layer, some aspect of tool use. And then I typically recommend try to doing a simpler thing first and then adding some of the other components on top just so that you don't overwhelm yourself with complexity.

Clara Shih: And that's what we're seeing as well. I mean we've got our Einstein copilot, but most companies aren't starting there. They're starting with turnkey use cases. We have something called Prompt Builder where they can embed those within the Salesforce experience. The most popular one is service reply recommendations if you're a customer support rep. But I think that's really good advice. Harrison, what about you? What trends are you seeing from your developer community?

Harrison Chase: I think we're seeing a few things. One's maybe a lot more interesting, kind of like a few shot examples. We're starting to see that pop up. I think we're starting to see it pop up because a lot of the newer models do have longer context windows and are a lot faster. So with Claude's new HiQ model for example, I think it's pretty feasible to stuff the five to 10 few shot examples in there and still get a reasonable response in a reasonable amount of time. So I would say we're seeing more interest in that. I think most of the people that we're working with are still really early on in their journey, and so probably they don't have the setup that Aravin has that Perplexity for fine- tuning a lot of these models, I think they're just earlier on and still exploring. I think another trend that we're seeing, another term that I'll throw out there that I think I've seen on Twitter recently is flow engineering. And I think this speaks to what both of them are talking about with this idea of chaining things together and thinking about what the flow of your application should be. And that's oftentimes because when you really try to go from a basic prototype to a more realistic production thing, it's not just the prompt that needs improving. Maybe you need to bring in external context or you want to break it down into sub problems. And so this type of flow engineering we see rise in popularity.

Clara Shih: And it's a nice way to bring some determinism into the non- determinism because you can flow out the steps in a very clear, predictable way.

Harrison Chase: Yeah. One way that I like to phrase it actually is I think one of the things that LLMs currently struggle with is longer term planning and reasoning abilities. And so one way to think about flow engineering is you're actually removing some of the burden of planning from the LLM onto the engineer who's setting up the flow in some way, and they're imparting their domain knowledge of what the flow should be or what the plan should be, and so you're just making it easier for them.

Jerry Liu: I have a question actually. How do you think about the difference between flow engineering and I guess just traditional programming? What's the difference between writing Python and then just setting up a flow?

Harrison Chase: Yeah, it's very similar. I think I'll do a webinar with inaudible from Codium, and so he was the first one I heard this term flow engineering, and that was actually one of the questions I wanted to ask him. And they wrote a great paper that everyone should check out. It's on Alpha Codium or achieved kind of state- of- the- art performance. And that was the first time I heard flow engineering. But yeah, it's just kind of like data engineering. You're thinking about your product application development. I think maybe an interesting twist would be, this isn't really from a technical perspective, but to the point around trying to think about planning, there's a lot of almost psychology that's kind of in it. You're like, okay, if I'm a human, what information do I need and how would I break this down? And if you as a human couldn't answer the problem based on the information that you have, then LLM's not going to do it.

Clara Shih: Certainly. Exactly. And that actually dovetails to a different question I was going to ask, which is from the other side. I mean developers, most developers prior to this year have not been necessarily using AI. And going forward, there's this almost expectation now that we can use GitHub, copilots, other copilots, we have Apex GPT at Salesforce, both to help with the code creation and test creation, but also increasingly instead of writing and manually coding every software control flow, which are the parts of the code we can actually start to farm out to AI instead? How are you thinking about that and how can developers prepare if they haven't had a background in AI and ML?

Aravind Srinivas: I did a survey among our engineers who's actually using GitHub copilot and who's not, and majority of them don't like using it.

Jerry Liu: Really? Interesting.

Clara Shih: Tell us more.

Aravind Srinivas: Look, I was surprised too, but it is almost like these are all really talented top engineers. So I think actually I was talking to this very popular or notorious as if some people may want to call it programmer, George Hotz about this. And he told me that he doesn't actually want an auto complete. What he wants is to have the AI tell him where he is going wrong and debug things for him. He wants things that are more like My Pi, like linters, debuggers that are even smarter than what humans have written manually with a lot of heuristic rules.

Clara Shih: That's super interesting because I think people call it generative AI, but in many cases we see this outside of coding too in sales and service where the AI is actually even more powerful for analyzing than it is for generating.

Aravind Srinivas: Yeah, exactly. So these are reasoning engines. They don't always just have to generate completions for you. You don't always have to just go ask a question to a chatbot. Once it becomes part of your workflow every single day, you don't even use it because it's AI, you just use a much superior workflow. That it's faster to get work done for you every single day versus the earlier way of doing things. Currently when you're writing code on GitHub, let's say you're pressing tab, you get a completion, you don't know if it's correct, you have some question, you go back to ChatGPT, or you ask a question there and come back and this is not optimal. You need some better design workflow for this, so you-

Clara Shih: In the flow of work.

Aravind Srinivas: Yeah, exactly. But I still think there's a lot more work to be done here to work backwards from the developer. What do they really want? And then in some sense, it's almost like interactive AI. You just have to figure out the sweet spot of interaction here.

Clara Shih: Do you think that applies to just the notorious developers or do you think in general that's the case?

Aravind Srinivas: Actually, a lot of people would like things that just make life easier for them in terms of burden of debugging and figuring out new things very fast. I don't know. Whenever I code debugging has always taken the most time for me.

Clara Shih: Nobody likes it.

Aravind Srinivas: But you have to do it, and you need to build a muscle for it, obviously, but you need to make things a lot more efficient for the next generation of developers. And right now, I feel like right now, the way people debug this, they just copy paste the whole error and go to ChatGPT and ask, what do I do? But I think that step is inefficient, something even better to sign that's natively in the editor will be a lot better.

Clara Shih: What are your thoughts on how developers need to change?

Harrison Chase: I mean, I'm going to be taken in a slightly different direction, but I just say I think a lot of what Avarind set out goes for non- developer products as well. I think the UX around all these gen AI applications is one of the main interesting things still to figure out and still to be worked on. I think having it in the flow of whatever work you're doing and figuring out that UX is super interesting to me, and unfortunately I don't get to work on that as much as I would like to, but.

Jerry Liu: I feel like there's going to be, we're building developer tools, but honestly just taking a step back, I feel like there needs to be courses on how to just use AI for just a regular person, just use ChatGPT or Claude, or if you're picking up programming to help copilot, it's such a big efficiency gain when you actually know how to use it. Just the other day I copied and pasted an entire sales transcript into Claude. I don't know how public this should be, but anyways, basically it's actually really helpful at summarizing action items, and it saved me legitimately five to 10 minutes going through this thing and synthesizing the insights.

Clara Shih: And so for developers and non- developers, how do we drive this awareness and education?

Jerry Liu: Yeah, I think there should just be honestly, I mean, everyone should just use ChatGPT, Claude, Perplexity from just a user angle, I think.

Clara Shih: Einstein.

Jerry Liu: Einstein of course. And there should be courses basically. You know how back in the day they were like teach people how to use computers. And especially when you're a kid growing up in elementary school, I have computer lab, there should be like AI lab.

Clara Shih: I agree.

Aravind Srinivas: I agree. And yeah, this will go a long way. A lot of people don't use AI. We are still in a bubble here. All these productivity gains that we're all talking about, it's still a bubble. And most people in the United States and many other high GDP countries don't use AI every day.

Clara Shih: It's true. I mean, I was in DC a few weeks ago and it's so hard to reform K- 12 education because it happens state by state, but it's like imagine growing up not knowing any of this, both how to use it, but also what hallucinations are, what the downside risks are.

Aravind Srinivas: A lot of it has to do with also the fearmongering and also that people use it for wrong purposes, like cheating on homeworks. And so people are in general negative about AI adoption. Whereas the reality is like, hey, you can actually have a lot more time for yourself.

Clara Shih: Well, imagine it's like if you were in the late 90s and you're like, there's all these cybersecurity risks, we shouldn't use the internet, and that's ridiculous, but that's the equivalent of what we hear today.

Aravind Srinivas: That's a good analogy.

Clara Shih: Yeah. Well one, just shout out and we would love your ideas and participation. We have a free online learning platform at Salesforce called Trailhead, and since last summer when we launched about 30 different AI courses, we've had over a million badges and certifications in AI. And so we would love your ideas on what other courses we should be adding just to help with the cause. So I know we're almost out of time, but I want to close just by asking. I mean this is such a unique point in time, here we are sitting in almost the middle of 2024. We've come such a very long way since ChatGPT was launched. If we fast- forward 12 months, what are each of your predictions on the state of the art in AI and your companies? What vision and plans and aspirations do you have?

Aravind Srinivas: Yeah, for the broad AI, I think there'll be even more capable models than what we have today, that's guaranteed for sure. I feel like there'll be a new GPT that can do even better reasoning, even more long context coherence and things like that. And that'll help us make whatever is already there much more reliable. As for our own Perplexity, I think we just want to get to the sweetest spot of accuracy and speed, which is very hard, by the way. You can solve the accuracy problem by just having the largest model. But that is not going to be fast, not going to make for a great UX, and also not going to be affordable to serve to many millions of people every day. So we want to keep improving there and also go beyond the generative user experience where it's just creating an answer for you. We want the answers to be specifically formatted for specific kind of queries. We want the interactions to be very different for different types of queries. If you're actually asking for trip planning, it should be more interactive and not just generative. So we want to work more on that. And we also want to work on things like what do you do after you get the answer? Why is it just lying around there in your past threads? How do you convert that into something that you can actually use? And that'll be very useful for people who use the product at work. So if you're going to think more around the product layer.

Clara Shih: Well, congratulations on your launch the business segment. Look forward to just seeing how that takes off. Harrison.

Harrison Chase: Yeah, I think on the model layer, agree on the planning ability of LLMs. I think that's probably what I'm most excited about, and I think that'll be the biggest step change. I also think there'll be a bunch of models that are maybe at the caliber of GPT 3. 5 that are superfast, and I think Haiku has opened up a lot possibilities with these really fast models. And so I think that speed, especially as you're doing more steps, is really important. So I think there'll be a bunch of models there.

Clara Shih: So you use different size models for different tasks?

Harrison Chase: I think so, yeah. Yeah, absolutely. I think there'll be a tier of really good kind of reasoning planning models, and you might use those when you really need to. But then I think there'll absolutely be a lot of async background jobs that get done by faster, cheaper models. For Langchain, I think testing and evaluation is going to be a huge pain point as more things go in production and you basically just want to know, is the application performing? I want it to perform. And so we're pushing a bunch on that. I think flow engineering will continue to get more and more complicated and important. And so I think thinking about what the right UX there is, this isn't so much Langchain, but again, UX for all these types of applications, I think the Devon software engineering UX had some really interesting things with the ability to rewind in the agent's trajectory and edit it. I think that rewind and edit, we're super excited about that. We're building that into a lot of the things we're working on. And then another thing at the app layer that I'm really excited for is personalization. So I think right now, oftentimes I think ChatGPT added this concept of memory, but it's still a little bit-

Clara Shih: Rudimentary.

Harrison Chase: Rudimentary is a good word. And so I'm excited to see the boundaries getting more and more pushed on that.

Clara Shih: That's something that we're investing in heavily too, right? Both personalization in terms of who's the user and what's the history of what they've done? But then also who are they talking to? If it's a sales person or service person personalizing to that customer. Very exciting. Jerry.

Jerry Liu: I'm excited about just general data representations and how more data can be used with LLMs. This includes multimodal data, so video data, image data, audio text. I mean even beyond just standard documents, there's a ton of just random unstructured data both on the internet and also within an organization. And so just how do you basically tap into all of that and basically create the right data pipelines and data infrastructure to make that accessible via models. And all these models are getting natively multimodal. I think all the Claude models are multimodal. They're going to be, right now they're kind of text plus image in and then text out. They're probably going to natively hopefully be something like some combination of GPT- 5 plus Sora. Image in. Text in, image out, video out, text out, that type of stuff. I'm really excited about that. I think from our side, we launched an enterprise platform called Llama Cloud a few months ago, and it's designed to be an overall managed platform for injection and parsing. And we want to build that out to basically scale up to more volumes of data, similar to what I talked about, but also just generally speaking, being able to actually process all this data, serve as a key context augmentation layer for any sort of app you want to build RAG, agents or anything else.

Clara Shih: So one question I always like to ask our guests is what they think about the future of education. And so for all the parents and caretakers out there, how should they think about how to guide children to succeed in this next era of AI?

Aravind Srinivas: So I like to think about it in the context of a Perplexity itself, which is, so Perplexity is like an open book exam kind of person. It always looks at the web and then uses it to answer the question. So in general, when I was in school or college, I always enjoyed writing open book exams. Because I had a great memory power, I could always stop to close book exams. But there were people who I considered were smarter than me, but they were not being the top rank because I just had better memory power. And I always wanted to challenge myself against them in open book exams where this would not be an advantage for me. And whenever I used to beat them, I used to feel even better than just succeeding in a normal way. So I think that's what parents should encourage kids to be like. Don't try to be the best memorized reader. What happens when you are given all the facts all the time and still answering questions that would require higher order reasoning?

Clara Shih: So just like for the models you'd tell parents to focus on fine- tuning their kids' ability to reason.

Aravind Srinivas: Skills. Reasoning skills.

Clara Shih: Skills and reasoning. And leave the memory in RAG to the AI.

Aravind Srinivas: Yeah, that's why, I mean, that's kind of how calculators are used. You don't have to be really good at Abacus anymore, you just use your calculator. But what you do with the calculations, for example. There are all sorts of interesting questions like how many eggs would it take to fill up a room? You can use a calculator for this to actually arrive at the answer, but the individual steps of what's the size of an egg, how to stack them up and what's the square feet of the room? These are stuff that's primarily with reasoning. You can find all the relevant fill in the blanks from the internet or tools.

Clara Shih: Harrison.

Harrison Chase: I'd probably just encourage them to get their kids to play around with generative AI in some form. I think it's still just so early and everyone's still figuring out what it's good at, what it's bad at. And so I think the more interactions that you can have and the more you get a sense for that early on will probably be helpful down the line no matter how it ends up working into the day- to- day.

Clara Shih: Jerry.

Jerry Liu: Yeah, I mean I agree with all the above points. I think the only real risk of this is that look, Claude is probably smarter than I am in most things.

Clara Shih: That would be saying a lot.

Jerry Liu: And they're going to get better. They're going to get way smarter than I am in a few years. And so I think the only risk is, you just let your kids or you yourself depend entirely on AI without, I do fundamentally believe that humans should be able to reason about things and reason critically about things. And so to whatever way you can use that AI to basically compliment that skillset and augment it. So you're staying on top of the latest tech, but you're also still able to reason versus just completely letting AI automate everything. I would encourage more of the former.

Clara Shih: Yeah, kind of like you could use a graphing calculator, but you understand the underlying calculus you're putting in.

Aravind Srinivas: I mean, to counter his point, even if Claude is smarter than you at certain things, Claude plus you should always be smarter than Claude. That's what you should work towards. And that's what I would also tell the next generation.

Clara Shih: But you'd have to be smart enough to leverage Claude fully.

Aravind Srinivas: Yeah, that's up to you. And that's where his point makes a lot of sense, which is always keep tinkering and learning and try to use these tools in very creative ways that even the developers of them might not have considered. Because that's what I feel like there's a difference between what Open AI did and Google was training all these large models before open AI. But open AI tinkered more with them and figured out that prompt engineering actually works and therefore double down on training large language models, which is bidirectional models. Viewed them beyond this fine- tuning tool.

Clara Shih: It's the empirical, it's the tinkering, not just the theoretical.

Aravind Srinivas: Yeah, tinker a lot. Try to see how you can build new things that were just not possible before at all.

Harrison Chase: And then after OpenAI released some of the models, a lot of the really cool stuff is happening on Twitter by random people with access to the APIs. Yeah, that's the state of this technology.

Aravind Srinivas: Yeah, Jensen said the next generation of programmers will be writing a lot of code in just natural language, which is the flow engineering part. You're still architecting things, you're still controlling how things will work sequentially, but you don't have to go and write hardcore code anymore.

Clara Shih: Very exciting. Thank you for all that each of you is doing to further the industry and to inspire so many of us and look forward to seeing where you take your company next.

Aravind Srinivas: Thank you so much.

Harrison Chase: Thank you.

Clara Shih: Aravind, Jerry, Harrison, thank you so much for being with us and for your insights. Three takeaways for me. One, LLMs need tooling around them to make them work. Each of these entrepreneurs is doing that, as is Einstein and Data Cloud. Two, there's a need for both developers and non- developers to re- skill and really learn how to use generative AI. And third, there's a great debate around when to use retrieval or RAG and when to fine tune models. For most companies, most of the time you can just get started with RAG, which is a part of Data Cloud. That's all for this week on the Ask More of AI podcast. Follow us wherever you get your podcasts and follow me on LinkedIn and Twitter. To learn more about Salesforce AI, join our Ask More of AI newsletter on LinkedIn. Thank you and see you next time.

DESCRIPTION

Clara sits down with the founder/CEOs of three of the hottest AI companies-- Aravind Srinivas (Perplexity AI), Jerry Liu (LlamaIndex), and Harrison Chase (LangChain) to discuss the tooling, data prep, and agility needed to operate and add customer value in the rapidly evolving LLM space.