Learning Library

← Back to Library

Apple Paper Challenges AI Reasoning

Key Points

  • The Apple research paper claiming “AI is dead” has been wildly misrepresented online, turning a nuanced study into a meme about AI’s failure.
  • Apple’s team tested whether smaller reasoning language models truly reason by using the models’ own chain‑of‑thought outputs as a proxy for reasoning trace, without employing large token‑heavy models or external tools.
  • Four compact models (Claude, Gemini, Deepseek, and OpenAI’s gpt‑3.5‑mini) were evaluated on custom puzzles designed to avoid memorized answers and to control difficulty levels.
  • The experiments disallowed any external aids—no web searches, Python code, or calculators—mirroring an exam where only the model’s internal “thinking” counts.
  • The most discussed puzzle, the Tower of Hanoi, was used to gauge logical problem‑solving under these strict conditions, highlighting the gap between perceived and actual model reasoning abilities.

Full Transcript

# Apple Paper Challenges AI Reasoning **Source:** [https://www.youtube.com/watch?v=I9tYAvjkOQk](https://www.youtube.com/watch?v=I9tYAvjkOQk) **Duration:** 00:11:21 ## Summary - The Apple research paper claiming “AI is dead” has been wildly misrepresented online, turning a nuanced study into a meme about AI’s failure. - Apple’s team tested whether smaller reasoning language models truly reason by using the models’ own chain‑of‑thought outputs as a proxy for reasoning trace, without employing large token‑heavy models or external tools. - Four compact models (Claude, Gemini, Deepseek, and OpenAI’s gpt‑3.5‑mini) were evaluated on custom puzzles designed to avoid memorized answers and to control difficulty levels. - The experiments disallowed any external aids—no web searches, Python code, or calculators—mirroring an exam where only the model’s internal “thinking” counts. - The most discussed puzzle, the Tower of Hanoi, was used to gauge logical problem‑solving under these strict conditions, highlighting the gap between perceived and actual model reasoning abilities. ## Sections - [00:00:00](https://www.youtube.com/watch?v=I9tYAvjkOQk&t=0s) **Apple Paper Sparks AI Memes** - The speaker rebukes the viral claim that “AI is dead” following Apple’s research paper, urging people to read the modest study which evaluated language‑model reasoning via model‑stated chain‑of‑thought rather than elaborate frameworks. - [00:03:16](https://www.youtube.com/watch?v=I9tYAvjkOQk&t=196s) **Logic Puzzles Reveal Model Reasoning Limits** - The speaker compares classic constraint puzzles like Tower of Hanoi and river‑crossing to illustrate Apple’s test of language models’ chain‑of‑thought abilities, noting that extra “thinking” tokens improve performance on medium‑complexity tasks. - [00:07:00](https://www.youtube.com/watch?v=I9tYAvjkOQk&t=420s) **When LLMs Must Call for Help** - The speaker stresses the necessity for low‑latency, tool‑free inference in time‑critical applications and argues that LLMs need a standardized way to recognize uncertainty and gracefully request assistance from larger models or external processes. ## Full Transcript
0:00The internet is melting down over the 0:03Apple research paper. I am losing track 0:06of the number of meme posts that 0:08basically add up to the statement that 0:10AI is fake. AI is dead. Apple has proved 0:13reasoning 0:14wrong. It's become a meme. And I am 0:18begging everybody to sit down to read 0:22the paper to understand what Apple is 0:24actually claiming and to understand 0:26where it actually meets the road in 0:29terms of systems design for AI systems 0:32because it is not nearly as dramatic a 0:34paper as people are trying to make out. 0:37First, if you haven't read it, I'm going 0:39to give you a quick TLDDR on what Apple 0:42actually did. 0:43Apple's research team wanted to test 0:46whether reasoning language models 0:48actually reason. And I want to be very 0:50precise here. They did not use multiple 0:52past models. They did not use big uh 0:55long inference time models. Uh they did 0:57not want to burn a lot of tokens. And 0:59they did not use the 1:03opensource clawed released reasoning 1:07trace framework that anthropic released. 1:10I think I said that badly, but basically 1:12Enthropic released a reasoning trace 1:14framework and you can use it to actually 1:16trace thoughts through an LLM. It's 1:18super cool. It's very new. This paper 1:20was written before that. So, they didn't 1:23use it. Instead, they used uh the models 1:26stated chain of thought as a way of 1:30tracing reasoning and determining 1:32reliability. I could have told them 1:34model stated chain of thought is a 1:36somewhat iffy relationship to model 1:38performance, but here we are. We're 1:39testing it anyway. They took four 1:42different models. Uh, one from Claude, 1:44Gemini, Deepseek, and OpenAI's 03 Mini. 1:48Again, this is all about model timing. 1:49They're using smaller models. Um, and 1:52they did not use like the Frontier 03 1:53model from OpenAI. They did not use 2.5 1:56Pro Pro from Gemini. Uh, they seemed to 1:59deliberately be wanting to test chain of 2:02thought versus long inference time. And 2:04those are different things. 2:07Then they tested the models they chose 2:10on custom puzzles. They weren't allowed 2:12to Google search. They weren't allowed 2:14to use Python. They were not given any 2:17tools at all. It would be like giving a 2:19human an exam and no pencil, no paper, 2:22no calculator, no tool use 2:24whatsoever, just the model and a token 2:28budget for thinking. They wanted to make 2:31the puzzles they chose something that 2:33would not be heavily trained on so that 2:35the models wouldn't have memorized the 2:37answer through pattern association and 2:40they wanted to make it something that 2:41they could dial the complexity on. The 2:43one that is getting the most attention I 2:45will describe for you because I didn't 2:47really know what it was either. It's 2:48called Tower of Hanoi. It's a very 2:50famous mathematical 2:51puzzle and basically it is a lot like uh 2:55if you've ever had a kid, you have these 2:57little uh wooden rods and you have 2:59differentiz wooden discs with holes in 3:01them and the kid like sticks the the 3:03discs onto the wooden rod and it's good 3:05for their manual coordination and all of 3:07that. Well, being mathematicians, 3:10mathematics turned it into a puzzle that 3:12has mathematical implications. 3:15Fundamentally, what you're supposed to 3:16do with Tower of Hanoi is carefully move 3:19the discs so that a bigger disc never 3:22sits on top of a smaller disc. And it 3:26has different problem complexities, 3:28right? Uh and so Tower of Hanoi with 3:30three discs is obviously easier than 3:32with four discs and easier than with 3:34five 3:35discs. The same idea with logic uh is 3:38the river crossing problem where you 3:40have constraints in people that cross 3:41the river. There's actually fairy tales 3:43about this, right? like you cross with 3:44like a wolf and a watermelon and I 3:46forget what the other one is. There's 3:47different versions in different 3:48cultures, but the idea is if you don't 3:50cross in the right order, I think the 3:51goat is in there and the goat can eat 3:52the watermelon, the wolf can eat the 3:54goat and you have to figure out the 3:55right order to cross uh the river in. 3:57The point is logic in both cases. Tower 3:59of Hanoi, river crossing, logic, checker 4:02jumping games, also logic. Apple is 4:04trying to test the ability of a model 4:06with no tool use, no inference time, 4:09just state of chain of thought to 4:11reproduce its reasoning with state of 4:14chain of thought 4:15accurately. Intuitively to me, as 4:18someone who spends a lot of time with 4:19these models, this feels like a fool's 4:21errand. I feel like it would do very 4:23badly at this. It turns out that it does 4:26kind of better than I thought. Uh, extra 4:29thinking tokens do seem to help with 4:31medium complexity problems. 4:33uh and plain models that have uh no 4:36thinking or no chain of thought at all 4:38seem to do worse. So that's great. Uh 4:41and then what's called high complexity 4:44uh the models just don't do well at all. 4:46They fall off a cliff. And this is the 4:48big finding that Apple is trumpeting or 4:51really maybe not Apple because give the 4:52researchers credit. They did not 4:54overcommit in their paper. Uh, and it 4:57would probably be inappropriate to them 5:00and unkind to the hard work they did to 5:02accuse them of all the meme wars that 5:03have started. It's not really their 5:06fault. But what people are saying from 5:09their work is that this cliff that they 5:13discovered around high complexity 5:14problems means that AI is useless. And 5:17so I want to like the internet lost its 5:20mind and I want to back up and I want to 5:22talk about it a little bit because at 5:24the end of the day what this is really 5:26saying is that if the LLM doesn't have 5:28tools and doesn't have inference time at 5:32a certain point it runs out of the 5:34ability 5:35to 5:37probabilistically figure out novel 5:41problems. Okay. I also do that. One of 5:45the really sort of sharp takes I I saw 5:48on X was someone saying most of my uni 5:50University of Michigan graduate students 5:52seem to use uh non-logical thinking and 5:55lots of pattern matching as well. So I 5:57don't know why this is a particularly 5:58novel thought because humans do this 6:01too. My thought was that most humans do 6:03post hoc reasoning which is the closest 6:05analogy that I found to the idea of 6:07chain of thought being accurate. It's 6:08like humans do a lot of post hawk 6:10reasoning. LLMs tend to sort of produce 6:12train of thought and like it makes sense 6:14of the patterns for them. I I just 6:17consider it a fairly meh response. I 6:20actually think the more interesting take 6:21on this paper is how do you get LLMs to 6:24call for help? So if you remember um 6:27this is going to really age me, but Who 6:29Wants to be a Millionaire was a game 6:30show that was on for a while and one of 6:34the things you could do is call for 6:35help. And that is obviously other other 6:39game shows do this too. I know who wants 6:40a millionaire is not the only one. Um 6:42but but the concept is basically if 6:45you're at the end of your lifeline and 6:46you don't have a way to figure out the 6:47problem, call for help. I think that is 6:50actually the most practical and useful 6:52takeaway for AI systems builders out of 6:55this Apple paper. 6:57Basically, there are definitely going to 7:00be applications where you want no 7:03inference time and you want minimal tool 7:06use because those add expense and they 7:07add time. And in cases where you don't 7:09have a lot of time and expense to burn, 7:12you kind of need your models to perform 7:13well without those things. So, as an 7:15example, a customer service bot on the 7:17phone, low latency is of the essence. 7:20You can't burn a minute and a half of 7:22thinking time. It doesn't work. If 7:23you're flagging fraud on a transaction, 7:26you have uh very very little time to 7:29make the fraud decision after the card 7:30is 7:31submitted. So, does this matter? It 7:34absolutely does matter in that sense. 7:37There will be times when LLMs need to be 7:40able to make good 7:42decisions and they won't have tool use 7:44and they won't have inference and the 7:46Apple paper squarely 7:48applies. In those 7:50situations, you want the LLM to know 7:53when to call for help. Just like the 7:55game show, you want them to know now is 7:57the time to get a bigger model involved 7:59and then gracefully handle the 8:01continuation. Maybe that looks like the 8:03customer service chatbot keeping the 8:06customer on the phone for a second with 8:07an innocuous question while the LLM 8:09reasons in the background. Maybe that 8:11looks like the transaction saying, "hm, 8:13it's taking a little longer than usual." 8:14And spinning the disc a bit. There are 8:16lots of ways to be graceful about this. 8:19But the key insight is that right now 8:22LLMs don't have a super standard, 8:25understood, accepted uh framework for 8:29calling for help when they run into 8:30difficult situations. And if we want 8:33multi- aent systems to succeed, we need 8:36to have trigger points that we all 8:37understand how to implement a defined 8:39framework if you will for trigger points 8:42so that we understand when an LLM hits a 8:44certain complexity threshold and it is 8:46very unlikely to succeed given the 8:48parameters of that model, given the 8:50testing done on that model, given the 8:51latency requirements, given the lack of 8:53tools, given the lack of inference 8:56time, then we go and ask for help. And 8:59right now I feel like a lot of that is 9:00being done on sort of a best effort 9:02basis by teams designing systems. And I 9:04would love it if we as a community came 9:06together to actually work on some 9:08systems thinking around when it makes 9:10sense to have a lifeline to call for 9:15help not necessarily even to a human to 9:18a smarter model. A model with Python a 9:20model with inference time a model that 9:23can figure out the problem. Because I 9:24will tell you if you give the same 9:27problems the tower of Hanoi, the river 9:29crossing, if you give the uh checkers 9:32problems to models that have tools and 9:34inference time and internet access, they 9:36can solve them. Just like if you give 9:39graduate students more tools, they can 9:41solve better 9:42problems. We humans are tool users and 9:45it's actually not a surprise. It's very 9:46well known that LLMs sort of like humans 9:48do better with tool use. So when you 9:51have a task that is more expensive. If 9:53we have a defined framework for that, 9:55you can go get the expensive model to 9:57solve the hard edge case. Imagine a 9:59world where the low latency uh tiny 10:02model can answer 98% of customer queries 10:04and then 2% of the time it has to go 10:06call upstairs to the smart model and 10:08have the smart model sorted out. We need 10:10a framework so that we all understand 10:13what the triggers are for calling 10:16upstairs for help. And I think that is 10:19one of the most applicable takeaways. I 10:21do not think it is helpful to run around 10:23saying that reasoning doesn't work. Oh 10:25my gosh. In fact, I think the entire 10:27thing probably needs to be rerun with 10:30tool use. It needs to be rerun with 10:32internet access. It needs to be rerun 10:34with more advanced models. It needs to 10:35be rerun with inference time. It needs 10:37to be rerun with the reasoning trace 10:41framework that Anthropic released. And 10:43if someone says, "Hey, that's pretty 10:45expensive." My answer is, hey, Apple's 10:47sitting on a lot of cash. They can 10:49afford to do this test if they really 10:51care about it. And I actually think that 10:53if AI is going to be transformative to 10:55society, it's probably worth budgeting 10:57for a little bit of experimentation to 10:58understand how these models reason 11:00because it's pretty hard to solve for 11:02alignment with these models if we can't 11:03figure out how they reason. So, to me, 11:05this feels like money well spent. And 11:08so, I don't mind that they produce the 11:09paper. I certainly don't want to attack 11:11them. I think more work in this area is 11:13better. I think there's practical 11:14takeaways and I think the internet lost 11:16its gosh darn mind. It needs to settle 11:18down.