Learning Library

← Back to Library

Beyond Benchmarks: Real-World AI Evaluation

Key Points

  • The launch of Claude 3.7 highlights the urgent need for better AI evaluations, as current benchmarks (e.g., AI Eval) are over‑fit and reward models trained specifically to excel on them rather than to perform useful work.
  • Real‑world usefulness is better captured by emerging tasks such as the “Answer” benchmark, which measures a model’s ability to independently complete freelance jobs, where Claude 3.5 currently outperforms newer models.
  • Practitioners are left to rely on subjective “vibes” or intuitive impressions when comparing models—an informal, hard‑to‑communicate gauge that underscores the industry’s lack of standardized, task‑focused metrics.
  • Claude 3.7 follows a challenger‑brand strategy by doubling down on its historic strength in coding and development, offering tight integration with terminals, Cursor, GitHub, and the ChatGPT app to attract developers seeking a specialized, high‑performance tool.

Full Transcript

# Beyond Benchmarks: Real-World AI Evaluation **Source:** [https://www.youtube.com/watch?v=okIVbBnk-Sg](https://www.youtube.com/watch?v=okIVbBnk-Sg) **Duration:** 00:06:02 ## Summary - The launch of Claude 3.7 highlights the urgent need for better AI evaluations, as current benchmarks (e.g., AI Eval) are over‑fit and reward models trained specifically to excel on them rather than to perform useful work. - Real‑world usefulness is better captured by emerging tasks such as the “Answer” benchmark, which measures a model’s ability to independently complete freelance jobs, where Claude 3.5 currently outperforms newer models. - Practitioners are left to rely on subjective “vibes” or intuitive impressions when comparing models—an informal, hard‑to‑communicate gauge that underscores the industry’s lack of standardized, task‑focused metrics. - Claude 3.7 follows a challenger‑brand strategy by doubling down on its historic strength in coding and development, offering tight integration with terminals, Cursor, GitHub, and the ChatGPT app to attract developers seeking a specialized, high‑performance tool. ## Sections - [00:00:00](https://www.youtube.com/watch?v=okIVbBnk-Sg&t=0s) **Need for Real-World AI Benchmarks** - The speaker argues that models like Claude 3.7 are overfitted to popular academic tests, urging the development of more meaningful evaluations—such as the newly introduced “Answer” benchmark for freelance tasks—while highlighting Claude 3.5's unexpectedly strong real‑world performance. ## Full Transcript
0:00the launch of Claude 3.7 is really 0:02underlining for me that we need better 0:04evals or better evaluations for AI 0:08models right now I think models are very 0:10overfitted to the evaluations that are 0:13widely published like the 0:14Aime and so all of these models score 0:17incredibly well on these widely known 0:19evaluations but it's because they are 0:24trained from the get-go to be good at 0:26the evaluations and that's sort of 0:28circular isn't it and if you want to 0:30sort of take a step back and think about 0:33what SAA nadela was saying uh when he 0:36talked about wanting models to do 0:37economically useful work it's a bit of a 0:40nod in that direction right like the 0:41models are very good at this Benchmark 0:43or that Benchmark but are they actually 0:45doing work that's 0:47meaningful and because there are not 0:49great benchmarks for Meaningful work 0:53probably the closest is a brand new 0:55Benchmark uh that just came out called 0:58um I think it's answer which is 1:01something that chat GPT is maintaining 1:03really independent orgs should maintain 1:06these but right now open AI is 1:08maintaining this one and it's designed 1:10to measure a model's ability to 1:12independently complete freelance work uh 1:14that's probably the closest to measuring 1:16real 1:17work and Cloud 3.5 not even 3.7 Cloud 1:213.5 did very very well on that scored 1:24the highest so far on 1:26that and I think that 1:30that's an example of what I mean when I 1:32say even if the models um all score very 1:36similarly on these academic benchmarks 1:39in the real world doing real world work 1:41like what Lancer is trying to measure or 1:43what I hope other benchmarks will emerge 1:45and measure the models are different the 1:48models are not the same and right now 1:51we're referring to this as like The 1:53Vibes of the model the impression the 1:56model gives you when you work with it 1:59and 2:00unfortunately it means that people like 2:02me who spend a lot of time playing with 2:04AI models become effectively without 2:07wanting to Gatekeepers of this implicit 2:10information like I sit there and I 2:13understand intuitively as soon as I 2:14touch Claud 2:153.7 this is how it feels different from 2:203.5 uh this is how it feels different 2:23from the chat GPT 2:26models but it's really difficult to 2:28convey that in a way that's useful to 2:30people who have different lines of work 2:34and I think that's something that the 2:35industry as a whole is really struggling 2:37with right now I will say from a 2:39perspect my perspective for 2:423.7 they're doing a classic Challenger 2:45brand play and they're really focused on 2:48what Claude has historically been 2:51fantastic at which is coding and uh 2:54building and so if you look at where 2:57they've prioritized making 3.7 available 3:00it's aimed specifically at coding and 3:03building it's available in the terminal 3:04it's available right away in cursor it's 3:06available with a GitHub integration in 3:09the Chad GPT app they want you to build 3:13with this 3:14model and I think that makes a lot of 3:16sense Challenger Brands typically 3:17specialize uh and that's sort of how 3:19they win in the space and larger Brands 3:22like chat GPT typically have to 3:24generalize and they have to make their 3:26value proposition coherent which is part 3:28of why chat GPT has been investing in 3:31GPT 5 because they have to take this 3:33whole half dozen models and make them 3:36into one coherent model that everyone 3:38understands what it is for a generalized 3:41audience I would 3:43expect that 3.7 would continue the 3:47tradition of Claude punching above its 3:51measurement weight 3.5 remained a 3:54favored coding model for 9 or 10 months 3:58after it was released as despite all the 4:00other models that were released along 4:02the way 3.7 having played with it having 4:06looked at how much better it feels in 4:083.5 I suspect it will have the same fate 4:10I think it will be a very popular coding 4:12model for a long time uh I'm going to 4:15see if I can cat up and sort of give 4:16some examples later today on the 4:18substack of sort of what it's like how 4:213.7 compares to 4:233.5 um and I want to make it really 4:25tangible I want you to see the 4:26difference given the same prompt of how 4:29these models react and especially how 4:31they build that's my attempt to show how 4:34they do economically useful work but 4:37again we really need independent 4:39benchmarks that help us to figure this 4:41out it's not something that individuals 4:44can really do it really shouldn't be 4:46something that model makers do because 4:48model makers even if they try not to be 4:51inherently kind of biased right they 4:54just are like I have no doubt that the 4:56reason openai launched The freelancer 4:58Benchmark is because they expect to beat 5:01it with chat GPT 4.5 or maybe with chat 5:05GPT 5 that's why you would pay to 5:08maintain and and sort of drive that 5:09Benchmark and benchmarks are not free it 5:12takes a lot of work to maintain them and 5:14we desperately need more benchmarks that 5:16are real world and right now 5:18organizations that could pay like 5:20companies they have no incentive to 5:22share how they're using this for 5:24economically useful work because that's 5:27a secret sauce for the company why would 5:29you share it 5:30and so we're sort of in a situation 5:32where no one who has the money has the 5:34incentive to develop and set up an 5:37independent evaluation for economically 5:40useful work and be really helpful if we 5:43had that so I don't know what the answer 5:45there is if you magically have a few 5:47million dollars and you're listening to 5:49this set up a benchmark set up a 5:51benchmark I think it would be something 5:52that the rest of the world would really 5:54appreciate and if you don't you can join 5:56me in asking and wishing that people 5:58would cheers