Flaws in Apple's AI Reasoning Benchmark

6m • Unknown Channel • ai-ml • deep-dive • intermediate • Watch on YouTube ↗

Sections

00:00:00 Critiquing Apple's AI Reasoning Benchmark - The speaker argues that Apple's proposed symbolic reasoning benchmark is fundamentally flawed and not a genuine test of AI reasoning, while praising Andrew M’s blog for exposing its shortcomings and urging the community to establish more meaningful evaluation standards.

Full Transcript

# Flaws in Apple's AI Reasoning Benchmark **Source:** [https://www.youtube.com/watch?v=Ahtx-Lo1Oa0](https://www.youtube.com/watch?v=Ahtx-Lo1Oa0) **Duration:** 00:06:47 ## Sections - [00:00:00](https://www.youtube.com/watch?v=Ahtx-Lo1Oa0&t=0s) **Critiquing Apple's AI Reasoning Benchmark** - The speaker argues that Apple's proposed symbolic reasoning benchmark is fundamentally flawed and not a genuine test of AI reasoning, while praising Andrew M’s blog for exposing its shortcomings and urging the community to establish more meaningful evaluation standards. ## Full Transcript

0:00we need to take a few minutes to talk 0:01about what good benchmarks and bad 0:03benchmarks look like because I am so 0:06tired of seeing people in my comments on 0:09Tik Tok on YouTube other places trying 0:12to argue for the Apple reasoning paper 0:16and I've been avoiding talking about it 0:18not because I don't want to but because 0:21I generally don't like to do takedowns 0:23but here we are we're going to do a 0:25takeown of the Apple paper on AI 0:28reasoning and why I think it's 0:30fundamentally flawed and I want to give 0:32some credit where it's due Andrew M 0:34wrote an incredible blog post 0:36articulating some of the flaws in the 0:38paper and I want to give him credit he's 0:40done a fantastic job calling it out and 0:42I want us to reset the bar around what 0:46the Stanford study really called for 0:48when they asked us to think about as a 0:50community what are some benchmarks and 0:52tests that would be good for AI to try 0:55and hit that AI isn't good at yet this 0:59apple benchmark around reasoning is not 1:01one of them even though they say it is 1:04so GSM symbolic is The Benchmark that 1:07Apple proposed in their paper and too 1:08long didn't read essentially what 1:11they're saying is large language models 1:14cannot reason because if you give them 1:17symbolic shifts in the problem space 1:20that are fairly nuanced and fairly small 1:22you get dramatically different outputs 1:25which suggest that they can't read 1:27through what we would call in human 1:28terms a trick 1:31question well Apple's researchers took 1:34that perspective they ran the test they 1:37saw that large language models are bad 1:39at this and their conclusion was there's 1:41really only one option here which is 1:43that the llm is just doing 1:45vectorization and pattern matching and 1:48there's no reasoning whatsoever going on 1:49it doesn't the parallel pathing we all 1:52talk about at inference isn't producing 1:53value and net net there is no novel 1:56reasoning going on here because if there 1:59was because their argument was that the 2:01AI would be able to reason through the 2:03small symbolic differences and nuances 2:05the trick questions and get it right 2:09well Apple was wrong and the reason why 2:13Apple was wrong is because the AI can 2:16get it right it just needs to be 2:19reminded that it's looking for a little 2:21bit of maybe a trick question and if you 2:24say oh that's giving the AI too much 2:26help I disagree with you and here is why 2:30at the end of the day you are measuring 2:33two things not one thing which is always 2:35a mark of a bad Benchmark when you use 2:38the GSM symbolic the GSM symbolic yes it 2:42it measures logical reasoning I agree 2:44with apple there it also measures naiv 2:48it measures whether the large language 2:51model is used to being tricked and if 2:53you have ever used a large language 2:56model you know their entire personality 2:58the entire train training process that 3:01they go through is all about Earnest 3:04questions and Earnest answers they are 3:06so helpful in fact one of the real 3:09weaknesses they have is it's really hard 3:10for them to say no to you when they 3:12should and so if you're training a large 3:16language model to be incredibly helpful 3:19it would make sense that it would assume 3:21any question it was asked was actually 3:23not a trick question it was an Earnest 3:25question asked in good faith and that 3:28seems to be what llms are cried for 3:30unless you tell them otherwise and so 3:32credit to Andrew he went in and he 3:35basically said let me take the GSM 3:37symbolic data set and let me tweak it 3:39not by changing the problems not by 3:41asking the AI to solve an easier problem 3:44but by giving a prompt at the top that 3:46basically says Hey warning sign watch 3:48out AI there may or may not be a little 3:52weirdness going on in this question and 3:54I'm paraphrasing but that's the basic 3:56idea he put up a verbal warning sign 3:59with line at the top of the prompt 4:01basically saying watch out there may be 4:04some oddness with this prompt think 4:05about it 4:06carefully and all of a sudden 4:08performance improved by 90% if you can 4:12improve performance by 90% by changing 4:16one line of the prompt and saying hey 4:19watch out and suddenly the AI can 4:21magically see through all the trick 4:22questions and can reason correctly you 4:25don't have a benchmark you have flawed 4:27data and a bad paper 4:30it's just not 4:31correct the evidence we have is actually 4:35adding up in favor of AI reasoning and I 4:38think that if we're going to demand 4:39better benchmarks which Stanford is 4:41correct to do and I do appreciate the 4:42Apple researchers for attempting to put 4:45one together it's a ton of work to 4:47produce a paper I don't want to sort of 4:50demean or disregard the work that they 4:53did it moves the field 4:55forward I just think you need to be 4:57careful at what you're measuring and you 4:59need to measure not just naive but also 5:03the ability to reason directly and maybe 5:08maybe we should start less 5:11skeptically because I think at this 5:13point the burden of proof is really on 5:15the Skeptics it's not on the people 5:18propounding AI reasoning there is enough 5:20evidence that AI is doing some kind of 5:24reasoning that you have a higher burden 5:27of proof or you should have a higher 5:28burden of proof If if you want to say 5:30the other way and one of the implicit 5:32methodology flaws in the paper is it 5:34basically assumes AI doesn't reason 5:36unless proven otherwise and I actually 5:39think that given the year we've had 5:42that's incorrect that is not an 5:44empirically based assessment of the 5:46reality on the 5:48ground instead the correct frame is to 5:51say empirically speaking AI does appear 5:54to reason and we should be thinking 5:57about how we can more effectively 5:59understand what reasoning looks like and 6:00by the way if you're wondering is it 6:02really reasoning is it just the 6:03appearance of reasoning I would like to 6:06say most humans only use the appearance 6:08of reasoning I rarely hear tight 6:11coherent cogent reasoning in most human 6:13conversations or frankly in most human 6:16writing if that is the bar then most of 6:18us are going to fail it too and so I 6:22think in a sense the simplest answer 6:24using aam's razor is the easiest if it 6:26walks like a duck and it talks like a 6:27duck it's a duck if it looks like a 6:29reasoning and it acts like it's 6:30reasoning and you can write a little 6:32prompt that says hey watch out this 6:34might be a trick question and suddenly 6:36it reasons it's probably reasoning and 6:39that's why I think apple is incorrect 6:41and I think AI can reason sound off in 6:44the comments let me know what you think