On Apple's Illusion of Thinking & Why We're Obsessed with Disproving Intelligence
My thoughts on the recent Apple research paper about large reasoning models failing at high-complexity tasks and what it says about humans, more than AI
In 2025, there have already been a few seminal moments in AI where something crosses the chasm. I know it’s happening when my non-tech friends text me about it. When ChatGPT introduced image generation and we went crazy with Studio Ghibli. When DeepSeek released R1 and Chinese AI took over the news cycles. This past week, it was Apple’s new research paper: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.
A friend in biotech even texted me a link to the actual paper last Monday:
That felt like a clear sign that people are talking about this paper. What struck me wasn’t just what the paper said. It was how eager everyone was to believe it.
The paper’s claim is clear: LLMs don’t actually reason. The “thinking” is an illusion. It’s the kind of claim we want to be true.
We’re living in a strange in-between moment—where AI seems to do things that feel intelligent, even magical. But magic makes us uncomfortable. We want an explanation.
Here’s my take:
We’re not just studying the limits of AI. We’re looking for reassurance that we’re still different. That we’re still special.
This post is about that instinct, and what it reveals.
Key Takeaways
Apple’s research fundamentally challenges the idea that today’s AI models truly reason. Using structured puzzles like Tower of Hanoi and River Crossing, they tested how well large reasoning models (LRMs) solve problems when there’s no shortcut or memorization. How do they do when they have to use logic?
Their findings showed:
Performance breaks down fast:
LRMs do okay on medium-difficulty puzzles, but completely fail once problems get more complex. At a certain point, they stop trying and give shorter, less coherent answers as the tasks get harder.
They don’t actually follow logic:
Even when you explicitly give the model the right algorithm, it struggles to apply it. One interpretation of this is that the model is not truly “thinking,” just predicting patterns that look like thought.
Success doesn’t generalize:
A model might solve a hard puzzle in one domain but totally fail an easier one in another. Its reasoning isn’t transferring and its ability is constrained to its training distribution.
The paper suggests models are not that smart. They sound smart, but their thinking is shallow or inconsistent. The paper argues for better ways to measure not just what models say, but how they arrive at their answers.
Unpacking Our Fascination
We want to know how the trick works
To me, the more interesting part wasn’t the findings—it was how people reacted. The paper was released leading up to WWDC and it quickly started making the rounds.
We’ve all been living with this strange tension in AI. Models can write essays, pass tests, hold conversations. Sometimes they say things that feel eerily thoughtful. It starts feeling unexplainable.
That experience creates cognitive dissonance. It feels like magic. And humans don’t like magic—at least, not when we can’t fully explain it.
So when a paper comes along and says: “Don’t worry. It’s not really reasoning. It just looks like it is,” we breathe a little easier. The trick has been revealed.
We love that moment because it resolves the discomfort. It gives us the comfort of disbelief.
Two competing stories
This discomfort has been constant. Whenever we talk about AI, we’re flipping between two stories.
One says the intelligence behind LLMs is strictly explainable. It’s just math. It’s token prediction. It’s scaling laws, training data, algorithms, and optimization objectives. Of course the models sound smart…just keep stacking GPUs.
But there’s a lurking truth we’re not acknowledging. Models are doing something we can’t fully explain. It isn’t just pattern recognition within the distribution data. It’s generalizing, abstracting, maybe even understanding in some primitive way.
We hate sitting between those two stories. Facing the truth in the latter is deeply uncomfortable. So when a new paper gives us an out, we take it. We’re quick to draw a line and say: this far, no further. The models aren’t thinking. It’s all just a trick.
What do we know about human intelligence?
We say models don’t really reason because they’re just pattern matchers. But how different is that from what we do?
A doctor makes a diagnosis by matching symptoms to familiar cases. A chess master doesn’t simulate every single move through brute force—they recognize the shape of a position. Even our moral decisions are shaped by past experience and social reinforcement.
We don’t always walk through formal logic trees. More often than not, we’re using some combination of intuition, memory, and pattern recognition. We then call it “reasoning.”
What if the illusion isn’t that models are thinking? Maybe the illusion is us convincing ourselves that we aren’t doing something pretty similar.
Why we solve the puzzles (for now)
Yes, humans still do better on puzzles like Tower of Hanoi. We can break problems into subgoals. We simulate. We reason recursively. We can hold structure in our heads and apply it.
And no, models don’t do that perfectly yet. But what if this gap isn’t as significant as we think?
Look at what Anthropic highlighted recently. In their mechanistic interpretability research, they found that models can represent abstract concepts (e.g. “cities” or “numbers”) in similar ways across completely different languages. The model knows what a “city” is in English, Chinese, and Arabic. Not because it memorized translations, but because it formed a shared internal representation. That isn’t surface-level pattern matching. It feels closer to abstract thinking.
Or even more recently, there was a paper responding to The Illusion of Thinking that showed that many of the supposed puzzle-solving failures were artifacts of token limits and flawed evaluation. Models weren’t breaking down…they were refusing to generate thousands of redundant tokens or declining to solve mathematically impossible puzzles. When prompted differently, they did better.
Maybe the gap is closer than Apple’s paper conveys.
The Real Illusion
It is in our interest to defend our own intelligence by debunking AI’s true intelligence.
We want to believe that human intelligence is fundamentally different. That we aren’t just token predictors in meat suits, walking pattern recognition systems with good stories. We want to believe we are special.
When models fail, it feels like proof that the boundary holds.
We should be more intentional about how we view this gap between human and machine intelligence. Intelligence is not a singular thing. It’s not a mystical spark.
It’s imperfect and strange and evolving. It’s memory, abstraction, prediction, learning.
We feel safe when we can explain things. But there will be more we cannot readily explain. The gap of intelligence is closing, and we’ll need to rethink what thinking really means.