Learning Library

← Back to Library

Eval-Driven Development Powers Legal AI Acquisition

Key Points

  • The past state of AI matters, as shown by Thomson Reuters’ 2023 acquisition of CaseText for $650 million—a decade‑old startup that successfully pivoted to LLM‑driven legal analysis.
  • CaseText’s value lay in eliminating hallucinations for lawyers, delivering provably accurate citations and arguments that meet the profession’s zero‑tolerance‑for‑error standards while easing heavy workloads.
  • The deal illustrates the high‑valuation, large‑scale potential of AI products that can guarantee precision in regulated fields such as law.
  • Their success stemmed from “evaluation‑driven development,” a rigorous, test‑like process of repeatedly measuring and refining specific LLM prompts and workflow steps until each component performed perfectly.
  • This eval‑centric approach, analogous to test‑driven development in software, is presented as a repeatable pattern for building trustworthy, scalable AI applications across industries.

Full Transcript

# Eval-Driven Development Powers Legal AI Acquisition **Source:** [https://www.youtube.com/watch?v=Rauz-3jycQ0](https://www.youtube.com/watch?v=Rauz-3jycQ0) **Duration:** 00:07:15 ## Summary - The past state of AI matters, as shown by Thomson Reuters’ 2023 acquisition of CaseText for $650 million—a decade‑old startup that successfully pivoted to LLM‑driven legal analysis. - CaseText’s value lay in eliminating hallucinations for lawyers, delivering provably accurate citations and arguments that meet the profession’s zero‑tolerance‑for‑error standards while easing heavy workloads. - The deal illustrates the high‑valuation, large‑scale potential of AI products that can guarantee precision in regulated fields such as law. - Their success stemmed from “evaluation‑driven development,” a rigorous, test‑like process of repeatedly measuring and refining specific LLM prompts and workflow steps until each component performed perfectly. - This eval‑centric approach, analogous to test‑driven development in software, is presented as a repeatable pattern for building trustworthy, scalable AI applications across industries. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Rauz-3jycQ0&t=0s) **Thomson Reuters’ $650M AI Acquisition** - The speaker details Thomson Reuters’ 2023 purchase of legal‑tech startup CaseText for $650 million, emphasizing how its successful pivot to hallucination‑free LLM‑driven legal analysis demonstrates the groundwork needed for sky‑high AI valuations and large‑scale rollout. ## Full Transcript
0:00we can be tempted to believe that AI 0:03only matters going forward that the past 0:05state of AI is not relevant the past 0:07state of tech isn't relevant but that's 0:09not true I want to tell you the story of 0:12an acquisition in AI that happened over 0:14a year ago now it was back in the summer 0:17of 0:172023 and only now are some of the 0:20details starting to leak as people start 0:22to feel like they're able to talk about 0:24it it's relevant because it illustrates 0:27what's required to get a really Sky High 0:30valuation and a truly rolled out 0:32application at scale and we haven't 0:34talked about it 0:35much this is the this is the company 0:38case text is a Thompson Reuters company 0:42Now Thompson Reuters are the news people 0:44but apparently they acquire startups as 0:46well and they acquired case text back in 0:50August of 2023 for 0:53$650 million at that time the company 0:56was about 10 years old they'd gone 0:58through a few rounds obviously if you're 1:00a 10-year-old company in 2023 you were 1:02not starting out in the llm space this 1:05was before large language models they 1:07had to Pivot it turns out that they 1:12pivoted very successfully into a llm 1:15driven legal analysis use case they had 1:20about 10,000 clients at the time of 1:22acquisition and what they had done was 1:25they had figured out in the summer of 1:282023 how to avoid hallucinations for 1:31lawyers you can see how that would be 1:33highly valuable the lawyer absolutely 1:35needs to know that they are not going to 1:37get in trouble with the judge with the 1:39bar association for citing cases that 1:42don't exist they need to make sure that 1:43their legal arguments are sound they 1:45have zero tolerance for error but they 1:48also have a tremendous workload and 1:50would love help with uh getting through 1:53that workload more efficiently if they 1:55can guarantee the accuracy so that's 1:57what case text set out to do 2:00and what's interesting is not that they 2:02accomplished it and they got their m&a 2:04and everybody walked away with money or 2:05whatever what's interesting is how they 2:08did it because that is a pattern that's 2:10applicable to other startups and other 2:13applied AI use cases and that's why I 2:15want to call it 2:16out so over the weekend a very very long 2:19read from Eugene Yan broke that talks 2:22about this idea of eval driven 2:25development um we've seen this for a 2:27long time in software it was called test 2:29driven devel M for a while or 2:31TTD and in llms evil driven development 2:34is this idea that you relentlessly 2:37rigorously evaluate the performance of a 2:41particular prompt the performance of a 2:43particular series of steps performed by 2:45an llm and then you feed that back in 2:48until you can get the llm to behave 2:51exactly the way you want 2:53to and so in this case what they did at 2:55case text was they took the act of doing 2:59a deposition or the act of doing any 3:01other task in the legal profession that 3:03they wanted to cover and they said you 3:06know what we are going to break this 3:08down and we're going to evaluate it very 3:11very very thoroughly until we can get 3:13the llm to perform each component of the 3:15task exactly right before we get it to 3:17do the overall 3:19task that is a tremendously helpful 3:22insight I'm going to link the whole uh 3:25read here under the YouTube video but 3:26it's absolutely critical to 3:29understanding how to build llm 3:30applications at scale essentially what 3:32they're saying is if attention is all 3:34that matters which was the famous 2017 3:37paper that described how uh some of the 3:39foundational Technologies of llms work 3:41how Transformers work how self attention 3:44Works um this is really saying that 3:47attention needs to be applied in detail 3:50at a micr step level to avoid 3:53hallucinations in fact they talk about 3:55more than 1,000 different evals for 3:59given task and that they did not pass a 4:04particular task at case text until every 4:07single eval 4:08pass which means if you had even one 4:12failure you would go back and break down 4:14that step and understand it better and 4:15that was that was the key they said that 4:18that uh fundamentally getting case text 4:21to a zero hallucination state required 4:24grinding through micr level detail with 4:26lawyers to understand exactly what they 4:28were doing so they could construct the 4:30llm extremely 4:32specifically and that was how they 4:34actually got it to zero level 4:35hallucination and that is what Justified 4:37the $650 million valuation from Thompson 4:40Reuters and so the generalizable pattern 4:44I take away is that if you are building 4:46an applied AI system it is possible that 4:50the hallucinations that you are 4:51experiencing are a function 4:54of you not understanding how to instruct 4:58the llm more more precisely how can you 5:01instruct it so precisely that it cannot 5:04mess up that it must clearly come back 5:08with exactly the correct 5:11response it's it's a worthy question 5:13right how much have you broken down your 5:16tasks if you are building AI systems how 5:18much have you structured them into 5:20extremely specific micro detail level 5:24steps I think one of the things that's 5:26going to distinguish strong AI 5:28applications is that they make that step 5:30breakdown something that's intuitive and 5:33easy for the end user because the end 5:35user is going to be too lazy you know I 5:38am going to be too lazy when I am 5:39running through and I have a meeting in 5:4110 minutes and I need an agenda to go 5:43through and break down into micro level 5:45steps every single thing and one of the 5:48developments that I've noticed in 2024 5:50is some of those common tasks are easier 5:52they're less prone to hallucination 5:53they're more precise thanks to backend 5:56work from companies like open AI or 5:58anthropic similarly if you are building 6:01a system where you are trying to make it 6:03easier for the end user you have to take 6:05on more of the specific fine tuning and 6:08prompting inside your system so that 6:10they can be fairly generic and they can 6:12invoke a very specific series of 6:14individual steps on the back end that 6:16lead to the kind of precise and accurate 6:19result they would want in a sense they 6:21want to treat the llm as if it already 6:25knows what they're talking about what 6:28their context is what they need and will 6:30then just generate a relevant result but 6:32that doesn't come by happen stance that 6:34doesn't come by accident that comes 6:37because you the system designer took the 6:40time to understand their intent break it 6:42down into specific granular components 6:46and instruct the llm how to handle that 6:48entire logical sequence and so I think 6:50what we got from the case study on case 6:52text pun intended is a great example of 6:57both the Venture value created by eval 6:58driven develop 7:00but also the importance of Designing 7:02based on evals to get exactly what you 7:05need out of an AI 7:08system hallucinations might just be an 7:11artifact thoughts