The Scaling Hypothesis, 6 Years Later

The scaling hypothesis predicted that neural network capabilities would increase smoothly and dramatically with scale. Six years later, it has been largely vindicated on its own terms and yet the nature of the resulting capabilities raises questions the hypothesis was never designed to answer. The outputs look like reasoning. Whether they are reasoning, in any sense that matters for the cases we care about most, is increasingly unclear. The evidence on both sides is now substantial enough to be worth reviewing.

Scene Completion as Parable

In 2007, Hays & Efros published a scene completion system. The task: remove an object from a photograph and fill in the missing region convincingly. The algorithm was straightforward—nearest-neighbor lookup against a corpus of photographs, sampling pixels from the best-matching region and compositing them in. With a corpus of thousands of images, results were poor (visible seams, texture mismatches, obvious artifacts). With a corpus of millions, same algorithm, the results became difficult for human evaluators to distinguish from ground truth.¹

Halevy, Norvig & Pereira (2009) wrote the interpretive frame for this class of result in “The Unreasonable Effectiveness of Data.” The argument: simple models plus sufficient data beat complex models on less data, reliably. Memorization is a viable strategy when your memory is large enough. This turned out to be correct, and more or less predicted the subsequent fifteen years of machine learning practice.²

There is a second reading of the scene completion result that gets less attention. The algorithm never learned what a scene is. It accumulated enough examples that retrieving the correct pixels became indistinguishable from knowing what should be there. Past a data threshold, the gap between correlation and comprehension becomes invisible from outside the system. You cannot tell, from the output alone, whether the system understands the scene or merely has a sufficiently large lookup table.

This matters because LLMs are, in a meaningful sense, scene completion operating on text instead of pixels.

What the Scaling Hypothesis Predicted, and What Showed Up

The original scaling hypothesis essay (2020) argued that intelligence is what you get when you apply simple neural units and learning algorithms at sufficient scale. Feed a model enough data and compute, and capabilities—reasoning, generalization, understanding—emerge as the optimal solution to the prediction problem. GPT-3 had just been released and appeared to confirm this dramatically.

The prediction was directionally correct. More scale did produce better outputs, and the improvement was often superlinear in surprise value if not in loss curves. But the nature of what emerged was not cleanly predicted by the hypothesis. The scaling hypothesis said: more data → better outputs. It did not specifically predict that more data → a system that can hold a conversation, explain a joke, or debug a Python function. Those capabilities arrived as apparent byproducts somewhere between GPT-2 and GPT-3, in the same way that the scene completion results jumped from “obviously fake” to “convincing” when the corpus crossed a threshold.

The mechanism for this is not mysterious. Compress enough text and the co-occurrence statistics become dense enough that outputs begin to look causal. The model does not know that smoking causes cancer. It knows that “smoking,” “cancer,” and “causes” appear near each other in specific distributional patterns across millions of documents. At low scale this produces bad autocomplete. At high scale it produces something that is very difficult to distinguish from understanding in the majority of test cases.³

Fluid vs. Crystallized Intelligence

A useful frame here comes from Cattell’s (1963) distinction between fluid and crystallized intelligence.⁴ Fluid intelligence: the ability to reason about novel situations, recognize patterns you have not seen before, solve problems without relying on prior domain knowledge. Crystallized intelligence: accumulated knowledge, facts, skills, domain expertise—the residue of fluid intelligence having been applied to the world over time.

The distinction is more than taxonomic. There is evidence that something like causal reasoning precedes knowledge accumulation developmentally. Leslie & Keeble (1987) showed that infants around six months old respond differently to causal vs. non-causal event sequences—they register surprise when a filmed collision sequence is played in reverse. This is with negligible world experience and no opportunity to accumulate statistical patterns over millions of examples.⁵ The capacity to model cause and effect appears to be present before knowledge has had any serious opportunity to accumulate. Fluid intelligence precedes, and appears to enable, the accumulation of crystallized intelligence. Not the other way around.

LLMs have the order of operations reversed. They have crystallized intelligence at extraordinary scale—the internet’s worth of accumulated knowledge baked into weights—with fluid intelligence appearing as an emergent byproduct of compression. The outputs often look like reasoning. It is not clear that a causal model of the world is generating them, as opposed to very good pattern-matching against a training distribution that happens to include a lot of text about reasoning.

The practical question: does this matter? For a long time my answer was “probably not.” If the output is correct, the mechanism is academic. The scene completion result looks right; who cares whether the algorithm understands geometry?

The answer turns on failure modes. A purely correlational system, pushed sufficiently far out of its training distribution that the pattern-matching has nothing to retrieve, does not degrade gracefully. It degrades catastrophically and confidently. This is the empirical pattern that makes the mechanism question non-academic.

The ARC-AGI Evidence

The ARC-AGI benchmark series was designed to test something closer to fluid intelligence: novel patterns, no prior exposure, problems where domain knowledge is useless.

The trajectory across versions is informative:

ARC-AGI-1 (2019–2024): Started at ~0% for LLMs (GPT-3 era) and ~20% for bespoke systems in 2020. By late 2024, OpenAI’s o3 scored 75.7% on the semi-private eval at the $10k compute limit, and 87.5% in a high-compute configuration. By 2025, frontier models were hitting 90%+. Effectively solved.⁶

ARC-AGI-2 (2025): Harder compositional puzzles. Rapid progress: Gemini 3 Pro scored 31.1% at launch, Gemini 3 Deep Think reached 45.1% initially and 84.6% in a later update (ARC Prize verified). With application-layer refinement harnesses, scores exceeded 95%. But the ARC Prize Foundation’s own verification revealed a problem: Gemini 3 Deep Think’s reasoning chains used correct ARC color mappings that were not provided in the evaluation prompt, strongly suggesting the model’s training data included ARC benchmark data. The benchmark was being overfit not through direct memorization of solutions, but through distributional overlap between training data and test format.⁷

ARC-AGI-3 (launched March 25, 2026): The format changed entirely. Instead of static grid puzzles, agents are dropped into turn-based game environments with no stated rules, no instructions, and no win conditions. The agent must explore, build a model of the environment’s mechanics, infer goals, and act. Scoring is based on action efficiency relative to a human baseline, using a squared penalty: if a human completes a level in 10 actions and the AI takes 100, the score is (10/100)² = 1%.⁸

Results from the 30-day preview period:

Best system overall: StochasticGoose (Tufa Labs), 12.58%. A CNN-based action-learning agent using reinforcement learning. Four-layer convolutional network encoding 64×64 frames, predicting which actions cause frame changes, with off-policy retraining between levels. No pretraining on internet text. No parametric knowledge beyond what it learned through interaction.
Top 3 systems: All non-LLM approaches (CNN + RL, rule-based state graph exploration, training-free frame graph search).
Frontier LLMs: All scored under 1%. Gemini 3.1 Pro Preview: 0.37%. GPT-5.4: 0.26%. Claude Opus 4.6: 0.25%. Grok-4.20: 0.00%.
Humans: 100%. Over 1,200 players completed 3,900+ games, many optimizing for theoretical minimum steps.

The gap between StochasticGoose and the frontier LLMs is over 12 percentage points. A CNN with a few million parameters and no pre-existing world knowledge outperformed every frontier language model by a large margin on a task that humans find easy and sometimes entertaining.

The failure mode identified by the ARC team is illustrative: frontier LLMs tend to interpret initial visual information in a new environment and incorrectly assume a game framework they have seen before, then execute a plan based on the wrong assumption without re-evaluating. They “think they are playing another game.” Systems with less pretraining and no internet-derived priors were less susceptible to this precisely because they had no priors to misapply.⁹

When you design a benchmark that the scaling paradigm cannot shortcut through distributional overlap with training data, the best systems stop looking like LLMs.

The State of the Response

The standard response to knowledge-gap failures in LLMs is retrieval-augmented generation (RAG). RAG is additive: take a model that memorized everything, notice it gets things wrong, bolt on a retrieval step. The model still paid the full computational cost of parametric storage. Its weights still contain stale, contradictory, and partially-correct information that competes with retrieved results during generation. It is a patch that does not question the foundational assumption.

A more substantive response is embodied AI. Vision-language-action models (VLAs) trained on physical interaction data—observations, actions, sensorimotor feedback—develop representations that generalize better than text-only models in some domains. Physical grounding provides something text cannot: the actual consequences of actions rather than descriptions of consequences. This is a genuine architectural improvement.

The limitation is the same, displaced. Push a VLA sufficiently far outside its physical training distribution and it fails the same way: confidently, without graceful degradation. Physical grounding widens the distribution the model handles competently. It does not resolve the question of whether the model is reasoning about its situation or retrieving approximations of reasoning from previously encountered situations.¹⁰

What Sutskever Said

It is worth noting who is making the diagnosis. Ilya Sutskever—co-founder of OpenAI, former chief scientist, one of the people most responsible for the pretraining paradigm—has been increasingly explicit about the limitations.

At NeurIPS 2024, he stated that “pre-training as we know it will end,” that the 2010s were the age of scaling but we are now “back in the age of wonder and discovery,” and that the bottleneck has shifted from compute to ideas. His tone was transitional rather than defeatist, but the substance was clear: the pretraining scaling curve is flattening.

In a November 2025 interview with Dwarkesh Patel, he was more direct: “these models somehow just generalize dramatically worse than people. It’s super obvious. That seems like a very fundamental thing.” He framed this not as a minor gap to be closed by incremental scaling but as a fundamental limitation of the current approach—one that requires new ideas rather than more compute.¹¹

The field’s response has been to discover a new scaling axis: test-time compute. Reasoning models (o1, o3, Gemini Deep Think) spend more compute at inference time on the reasoning process rather than baking more knowledge into weights during training. This is not the same thing as building reasoning capacity directly, but it is—perhaps accidentally—a meaningful shift in what the paradigm is optimizing for: allocating resources to the reasoning process rather than to memorization.¹²

Honest Accounting

The scaling hypothesis, as originally stated, may still be correct in the limit. With infinite compute and infinite data, enough memorization might eventually produce a system that reasons correctly in every situation by having encountered every situation or a sufficient approximation thereof. The infant causality result does not disprove this. It only establishes that the mechanism is different from how humans develop reasoning—not that the destination is unreachable by a different path.

The problem is that infinite compute is not available. And the empirical trajectory—remarkable outputs, genuine productivity gains, transformative coding assistants, the beginnings of general-purpose robotics—coexists with a pattern of catastrophic failure on genuinely out-of-distribution tasks that humans handle trivially. The ARC-AGI-3 results are the cleanest demonstration of this to date: 100% for humans, under 1% for every frontier model, on tasks that require nothing more than exploration and basic causal inference.

The claim here is narrow. It is not that LLMs are useless, or that the scaling paradigm has failed, or that the last six years were wasted. The productivity gains are real and large. The claim is that useful and genuinely intelligent are not the same thing, and that the gap between them—invisible in most practical applications—becomes very visible when you construct the right test. Whether that gap matters depends on what you are trying to build.

For most applications, a very capable pattern-matching engine is sufficient. For the ambition the field has stated for itself—artificial general intelligence—the question of whether the system is reasoning or retrieving approximations of reasoning is going to need an answer.

Cattell’s distinction suggests where to look. Fluid intelligence—the ability to reason about genuinely novel situations—appears to be architecturally prior to knowledge accumulation in biological systems. Current AI architectures have the dependency reversed: knowledge first, reasoning (maybe) as a byproduct. Whether reversing this dependency is necessary, or merely one path among several, is the open research question. The ARC-AGI-3 results suggest it is at least worth taking seriously.

This was a genuinely surprising result at the time. The algorithm had no model of geometry, lighting, occlusion, or scene structure. It had a big lookup table and a compositing step. The fact that this worked was the entire point of the paper. ↩
The title is a deliberate echo of Wigner’s “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” (1960), which makes the allusion somewhat ironic in retrospect—Wigner was arguing that deep structural truths explain surface regularities, whereas Halevy et al. were arguing that you can skip the structural truths if you have enough surface. ↩
I want to be precise about “the majority of test cases” because the minority is where the interesting questions live. A system that is indistinguishable from understanding in 95% of cases and catastrophically wrong in 5% is a very different artifact from a system that understands, depending entirely on whether you can predict which 5%. ↩
Cattell was developing this distinction as early as the 1940s but the formal publication and supporting psychometric evidence came in 1963. The distinction has held up reasonably well, though the degree of independence between the two factors is debated. For present purposes, the exact factor structure matters less than the conceptual separation between “capacity to reason about novel situations” and “accumulated knowledge.” ↩
The interpretation of infant looking-time studies is contested. Some researchers argue infants are detecting statistical violations rather than modeling causality per se. But even the deflationary reading is interesting for present purposes: it suggests that very young humans have something (innate perceptual biases, architectural priors, whatever) that lets them distinguish causal from non-causal sequences with minimal experience, and that this something is doing real work before any substantial learning has occurred. ↩
“Effectively solved” with the caveat that the solutions are expensive and the remaining ~10% of unsolved tasks may represent a qualitatively different difficulty class. But the trajectory was clear. ↩
This is a subtle and important point. The contamination is not “the model memorized the answers.” It is “the model’s training data included enough ARC-formatted problems that it learned the encoding conventions and could make correct inferences from structural cues alone.” This is a new failure mode for benchmarks: not direct data leakage but distributional similarity between training corpus and evaluation format. The ARC Prize Foundation frames this as a form of “overfitting” that is distinct from classical benchmark gaming. ↩
The squared penalty is a design choice specifically intended to defeat brute-force search. Linear scoring would give substantial partial credit to systems that solve problems in 10× human steps. Quadratic scoring makes inefficient solutions nearly worthless. Whether this is the right operationalization of “intelligence” is debatable, but it does effectively distinguish between “found the answer eventually” and “found the answer efficiently,” which is closer to what we mean by fluid intelligence. ↩
This is an interesting inversion: the LLMs’ enormous crystallized intelligence actively hurts them on this task. Their priors, which are helpful in the vast majority of situations, become a liability when the situation is genuinely novel and the priors are wrong. A system with fewer priors explores more and assumes less. This is approximately the opposite of what you want in a production assistant, and approximately what you want in a system that has to learn from scratch. ↩
The distinction is operationally testable: does performance degrade smoothly as you move away from the training distribution, or does it cliff? Smooth degradation suggests some form of generalized model; cliff-like degradation suggests retrieval from a fixed distribution. LLMs and current VLAs both exhibit cliff-like degradation in the out-of-distribution regime, which is evidence (not proof) for the retrieval interpretation. ↩
There is something structurally interesting about the person who built the engine publicly saying the fuel is running out. Sutskever is not a critic of deep learning. He is arguably its most successful practitioner. When he says the bottleneck is ideas, it carries different weight than when an outsider says the same thing. ↩
Whether test-time compute scaling produces actual reasoning or a better search over memorized reasoning patterns is an open question. The ARC-AGI-3 results suggest the latter: even with extended reasoning chains, frontier models scored under 1% on genuinely novel interactive tasks. But the framing is moving in the right direction regardless. ↩