Do LLMs Understand? The 2026 Update

Walking Through 20 Papers That Say “Yes”

Feb 13, 2026

The Preface You Didn’t Ask For

About a year ago, I wrote a post walking through 11 research papers on LLM understanding. It was long, it was opinionated, it had some swearing, and people seemed to like it, mostly because I felt someone needed to sit down and actually read the papers instead of arguing vibes on Reddit.

Since then, a lot has happened. The emergence debate got a proper counterargument (and the counterargument got counter-counterargued). We figured out how in-context learning actually works at a mechanistic level. Someone literally built a debugger for neural networks and extracted millions of interpretable concepts from a production model. Models started outperforming their own training data, and then someone published a formal taxonomy explaining why. Oh, and a Chinese lab trained a model to develop chain-of-thought reasoning from scratch using pure reinforcement learning, which sent the entire Western AI establishment into a collective anxiety attack.

Time for an update.

The original post had 11 papers. This one has 20. I’ve kept the ones that still matter, replaced the blog posts with their actual papers, added the counterarguments (because intellectual honesty is sexy), and expanded into two entirely new areas: reasoning chains and the scaling/data debate.

Same rules as last time: I’m going to explain these papers like I’m your friend who actually read them all and is walking you through them at the bar. Accessible but not dumbed down. Opinionated but backed by evidence. If you want the dry academic version, go read the papers yourself. I’ll link every single one.

And the same disclaimer: evidence ≠ proof. We’re building a case, not delivering a verdict. If you came here to argue semantics about the word “understand,” I refer you to the 674 philosophy departments worldwide currently having that same exact argument. We’re going with the working definition from CS/AI research: can the model generalize, represent, and reason in ways that go beyond surface-level pattern matching and what is in its training data? Cool. Let’s go.

Part 1: Emergence - What Happens When You Scale?

Paper 1: Emergent Abilities of Large Language Models

Wei et al., 2022 - arXiv:2206.07682

In the original post, I linked a Google blog post about emergence. That was lazy. Here’s the actual paper, and it deserves its own treatment because it kicked off one of the most important debates in AI research.

The core finding: when you scale up language models, they don’t just get gradually better at everything. Instead, certain abilities appear to pop into existence at specific scale thresholds. Below that threshold: nothing. Above it: suddenly the model can do arithmetic, or answer questions about novel concepts, or translate between languages it was never explicitly taught to translate between. The transition is sharp, not gradual, like water suddenly boiling at 100°C instead of getting incrementally steamier.

Wei et al. documented this across dozens of tasks in the BIG-Bench benchmark suite and the GPT-3/PaLM model families. Few-shot performance on tasks like multi-step arithmetic, word unscrambling, and logical reasoning was essentially flat (random chance) for smaller models, then jumped to well-above-random at a specific scale.

Why this matters: If abilities genuinely emerge, if there are phase transitions in capability that we can’t predict from smaller models, that’s both incredibly exciting and mildly terrifying. Exciting because it means scaling might unlock capabilities we can’t even imagine yet. Terrifying for exactly the same reason. This paper is why the AI safety crowd started losing sleep. Not because GPT-4 was going to go Skynet, but because if you can’t predict what a model will be able to do at the next scale, you can’t prepare for it.

It’s also why the “just a statistical parrot” crowd had to start working overtime. If emergence is real, calling LLMs sophisticated autocomplete becomes about as useful as calling a nuclear reactor a fancy campfire.

But… emergence got challenged… Hard.

Paper 2: Are Emergent Abilities of Large Language Models a Mirage?

Schaeffer, Miranda & Koyejo, 2023 - arXiv:2304.15004

I didn’t include this paper in the original post because, honestly, I was making the case for understanding and this complicates it. But science isn’t about cherry-picking papers that agree with you. That’s religion. Or being an anti-science luddite. So here it is: the best counterargument to emergence.

Schaeffer et al. argue that emergent abilities might be a measurement artifact. Their claim is elegantly simple: the “sudden jump” in performance depends on the metric you choose to measure it with. Use a nonlinear or discontinuous metric (like exact-match accuracy, where you either get the answer 100% right or score zero), and you get apparent emergence… flat performance that suddenly spikes. Switch to a linear, continuous metric (like token-level edit distance or log-likelihood), and the improvement is smooth and predictable all the way up.

They demonstrate this three ways:

They show that the same model families on the same tasks look “emergent” or “smooth” depending on metric choice
They do a meta-analysis of BIG-Bench and show that most “emergent” tasks used discontinuous metrics
They deliberately create apparent emergence in vision models by choosing the right metrics, proving you can manufacture the phenomenon

It’s a genuinely good paper. And if you’re being intellectually honest, you have to sit with it.

But here’s what I think a lot of people get wrong about this paper: disproving that emergence looks sharp on a graph is not the same as disproving that emergence exists. The Mirage paper shows that models might be improving gradually rather than suddenly. Fine. But the models are still improving at tasks they weren’t trained on. The question of whether that improvement is a sudden phase transition or a smooth curve is about the dynamics of the phenomenon, not whether the phenomenon is real. Your parrot is still learning to translate without translation training data. Whether it learned smoothly or suddenly doesn’t change the fact that it learned at all.

Think of it this way: if I show you a graph where a child’s reading ability appears to “emerge” suddenly at age 6, and then someone proves that with better metrics the improvement was actually gradual from age 3, that doesn’t mean the child never learned to read. It means our measurement was coarse.

So where does this leave us? Somewhere nuanced. The sharp phase-transition story was probably oversold. But the underlying capabilities are real, and even smooth improvement curves show models acquiring abilities that go way beyond their explicit training.

Part 2: In-Context Learning - How Does It Actually Work?

In the original post, I linked a Stanford blog about in-context learning and basically said “see, it generalizes!” That was... fine. But in the year since, the mechanistic picture has gotten so much clearer that I’d be doing you a disservice not to go deeper. We now have three papers that, taken together, tell you almost exactly what happens inside a transformer when you give it examples in the prompt.

Paper 3: What Learning Algorithm Is In-Context Learning?

Akyürek et al., 2022 - arXiv:2211.15661

Here’s the setup: when you give an LLM a few examples and then ask it to do the same thing on a new input (few-shot prompting), what is the model actually doing? Is it just vibing? Is it doing some mysterious neural network thing that we can’t characterize?

Akyürek et al. showed that no, transformers are implementing known learning algorithms in their forward pass. Specifically, they found that trained transformers implement algorithms equivalent to gradient descent and ridge regression, all within a single forward pass. No weight updates, no backpropagation… just the regular inference computation.

When you give GPT a few examples of input-output pairs and ask it to predict the next one, it’s not doing something mysterious. It’s running what is functionally equivalent to a training algorithm inside its forward pass. The model is literally learning from your examples in real-time, using the same mathematical framework that we use to train models in the first place. It’s machine learning within the machine that was machine-learned.

Yo dawg, I heard you like learning, so I put a learning algorithm inside your learned algorithm so you can learn while you learn.

Why this matters: This kills the “LLMs can’t learn, they just retrieve” argument dead. If the model is running an actual learning algorithm during inference, then the distinction between “memorization” and “learning” starts to dissolve. It’s not pulling answers from a lookup table. It’s fitting a function to your examples and extrapolating. That’s... that’s what learning is.

Paper 4: Transformers Learn In-Context by Gradient Descent

von Oswald et al., 2022 - arXiv:2212.07677

This is the mechanistic companion to Paper 3. Where Akyürek et al. showed that transformers implement known algorithms, von Oswald et al. showed how.

They proved that a single layer of a transformer’s attention mechanism can implement one step of gradient descent on an internal regression objective. Stack multiple layers, and you get multiple steps of gradient descent. The weights of the trained transformer encode what is essentially a learned learning rate and optimization trajectory.

The analogy I like: imagine you’re in a foreign country and someone hands you a phrasebook. You don’t learn the language, but you use the phrasebook to pattern-match. That’s what people think ICL is. But what’s actually happening is more like: someone gives you a few example sentences and your brain automatically constructs grammar rules that let you generate novel sentences. Not just matching patterns - building a model on the fly.

The mathematical correspondence between transformer forward passes and gradient descent steps isn’t just a metaphor or a loose analogy. It’s a literal mathematical equivalence. You can construct a linear transformer whose forward pass is identical to gradient descent on a linear regression problem. And trained transformers converge to this solution naturally.

Paper 5: In-Context Learning and Induction Heads

Olsson et al. (Anthropic), 2022 - arXiv:2209.11895

Papers 3 and 4 told us what algorithm ICL is implementing. This paper tells us where in the transformer it’s happening.

Anthropic identified specific attention head circuits called “induction heads” that are responsible for most in-context learning behavior. An induction head implements a simple but powerful algorithm: it looks for previous instances of the current token (or a similar one), finds what came after those instances, and predicts that the same thing will come next. [A][B] ... [A] → predict [B].

Sounds simple, right? Deceptively so. Because when you compose multiple induction heads across multiple layers, each one operating on increasingly abstract representations, you get something that can do far more than literal copying. You get a system that can recognize abstract patterns, analogies, and relationships - and apply them to novel inputs.

The paper shows a striking phase transition: during training, there’s a sudden point where induction heads form, and this coincides almost perfectly with the point where the model’s in-context learning ability dramatically improves. Before induction heads: the model basically ignores your examples. After: it learns from them.

Why this trio of papers matters as a group: Together, papers 3, 4, and 5 give us a nearly complete story of in-context learning. We know WHAT the algorithm is (gradient descent / ridge regression), HOW it’s implemented (in the attention mechanism’s forward pass), and WHERE it lives (in induction heads and their compositions). In 2023, “how does ICL work?” was an open question. In 2026, we basically know. And the answer is: LLMs are literally training tiny models inside themselves every time you give them a prompt. Statistical parrot, my ass. More like statistical graduate student running regressions on your examples in real-time.

Part 3: Internal Representations / World Models

This is the section that makes people uncomfortable. Not “can LLMs do useful things?” (obviously yes) or “do they process information cleverly?” (yes, as we just saw). The uncomfortable question is: do LLMs build internal models of the world? Do they have representations - maps, if you will - of reality inside their weights?

Spoiler: yes, and we can literally see them.

Paper 6: Emergent World Representations (Othello-GPT)

Li et al., 2022 - arXiv:2210.13382

This was in the original post and it’s still foundational, so I’m keeping it but going deeper.

The setup: train a GPT model on sequences of legal Othello moves. Just the moves - no board state, no rules, no explanation of the game. The model sees strings like “D3 C5 E6 F5...” and learns to predict the next legal move.

The question: does the model learn anything about the game, or does it just memorize move sequences?

The answer: the model develops an internal representation of the board state. And you can prove it. By training a simple probe (basically a tiny classifier) on the model’s internal activations, you can decode the current board position - which squares have black pieces, white pieces, or are empty - with high accuracy. The model built a board game representation that it was never told existed.

But here’s the kicker that I didn’t emphasize enough in the original post: you can also intervene on these representations. If you surgically modify the model’s internal activations to change its “belief” about where a piece is, the model’s subsequent predictions change accordingly. It starts predicting moves that would be legal on the modified board. The representation isn’t just decorative - the model is actually using it to make decisions.

This is the difference between a model that happened to correlate with game state (which you could dismiss as pattern matching) and a model that has a functional representation of game state (which is a world model, period). The interventional evidence is what makes this paper devastating to the “just statistics” crowd.

Paper 7: Language Models Represent Space and Time

Gurnee & Tegmark, 2023 - arXiv:2310.02207

If you see the name Max Tegmark on a paper, read it. Just do it. The man operates at an intersection of brilliance and audacity that produces papers where the findings sound like they should be science fiction but are backed by rigorous methodology.

From the original post: “fucking Mad Max pulled an actual world map out of his ass” - out of the model, technically, but the sentiment stands.

Tegmark and Gurnee trained probes on the internal representations of Llama-2 and found that the model has developed linear representations of both space and time. Not vague, abstract, uninterpretable representations buried in some inscrutable high-dimensional space. Linear ones. The kind you can just... plot on a map.

They looked at how the model represents cities, and found that you could extract literal geographic coordinates from the model’s internal activations. Plot them on a 2D plane and you get an actual map of the world. Not a perfect one, but recognizable. The model, trained only on text, developed a spatial representation of Earth that roughly corresponds to physical geography.

Same with time. Historical events are represented along a temporal axis that corresponds to their actual chronological order. The model doesn’t just know that World War II happened and that the Roman Empire existed - it has an internal representation where WWII is “located” roughly 2,000 years after Rome, in a way that’s linearly extractable from its activations.

Why this matters even more than it sounds: The “just statistics” defense against LLM understanding usually goes something like: “Sure, the model knows facts about Paris, but that doesn’t mean it has any representation of Paris as a place in the world. It’s just token correlations.” This paper demolishes that defense. The model doesn’t just associate the token “Paris” with French-related tokens. It has placed Paris at specific coordinates in an internal spatial representation - coordinates that correspond to where Paris actually is on Earth, relative to other cities.

Text. Trained on text. No images, no maps, no GPS data. Just text. And it built a map.

If you trained a parrot on text and it drew you an accurate map of the world, you would not call that parrot a “stochastic parrot.” You would call that parrot a witch and burn it at the stake.

Paper 8: Robust Agents Learn Causal World Models

Richens & Everitt (DeepMind), 2024 - arXiv:2402.10877

This paper was in the original post and it’s gotten more important since, because it provides the theoretical backbone for everything else in this section.

Most papers in this list are empirical: “we looked at this model, here’s what we found.” This one is mathematical. Richens and Everitt proved - as in formally proved, with mathematical rigor - that any agent capable of generalizing well across diverse environments must have learned a causal world model.

A “causal world model” means the agent doesn’t just know that X and Y tend to occur together (correlation). It knows that X causes Y, and therefore if you change X, Y will change too, but if you change Y, X won’t. That’s causal reasoning - the ability to distinguish between “the rooster crows before dawn” and “the rooster causes dawn.”

The proof works by contradiction: if an agent could generalize robustly without a causal model, then there would exist adversarial environments that could fool it by exploiting spurious correlations. Only agents with causal models are robust to this kind of distributional shift. Therefore, if an agent generalizes well across sufficiently diverse environments... it must have a causal model. QED.

The implication for LLMs is enormous. We know that large language models generalize across diverse tasks and domains. The papers in Part 2 show this mechanistically. If Richens and Everitt’s proof is right - and it’s a formal proof, so it’s either right or the axioms are wrong - then LLMs that generalize well must have learned causal models of the domains they operate in.

Not “might have.” Must have. That’s what a mathematical proof gives you.

This is the paper I’d hand to someone who says “LLMs don’t understand, they just correlate.” Correlation doesn’t generalize. Causation does. And generalization is what LLMs do.

Paper 9: Connecting the Dots

Treutlein et al., 2024 - arXiv:2406.14546

If you only read one paper from this entire list, make it this one.

This is the one that makes AI safety researchers sweat.

Fine-tune an LLM on a corpus consisting only of distances between an unknown city and other known cities. “City X is 344 km from Lyon.” “City X is 450 km from Brussels.” “City X is 1,200 km from Berlin.” That’s it. No mention of Paris. No geography lessons. Just distances.

Then ask the model: what is City X?

The model answers: Paris. And then it can use this fact to answer downstream questions about Paris - its landmarks, its history, its culture - none of which were in the fine-tuning data.

Let me be clear about what happened here. The model was given only distances. It triangulated the position, identified the city, and then connected that identification to its broader knowledge about that city. It performed inductive reasoning over its training data - connecting dots that were never explicitly connected for it.

The paper calls this “inductive out-of-context reasoning” (OOCR): the ability to infer latent structure from evidence distributed across different training documents and apply it without any in-context examples. They demonstrate it across five different tasks, including identifying biased coins from individual flip outcomes and learning function definitions from input-output pairs.

Why this matters: This is probably the most direct evidence that LLMs don’t just memorize and retrieve. You cannot triangulate the identity of a city from distance data through memorization. You need to perform geometric reasoning, combine multiple data points into a coherent inference, and then link that inference to prior knowledge. That is, by any reasonable definition, understanding.

It also has serious safety implications - if you try to censor dangerous knowledge from training data but leave implicit clues scattered across different documents, the model might figure it out anyway. Connecting dots is both a feature and a bug.

A system that was “just autocomplete” should not be able to triangulate a city from distances. The fact that it can - and then seamlessly connects that inference to everything else it knows about that city - is the single cleanest demonstration in this entire list that something much deeper than pattern matching is happening inside these models.

Part 4: Transfer, Understanding & Transcendence

Paper 10: Fine-Tuning Enhances Existing Mechanisms

Prakash et al., 2024 - arXiv:2402.14811

This paper made the original post under the “causal reasoning” section, but I gave it short shrift. It deserves more.

The core finding: when you fine-tune a model on code, it gets better at natural language tasks. And vice versa. Coding improves reading comprehension. Math training improves entity tracking. Learning one domain enhances capabilities in seemingly unrelated domains.

This isn’t supposed to happen if LLMs are just memorization engines. If a model was purely memorizing patterns in code, code training should help with code and nothing else. The fact that code training helps with natural language means the model is extracting something abstract - some underlying structure shared between code and natural language - and applying it cross-domain.

The paper goes further, using mechanistic interpretability tools to show that fine-tuning doesn’t create new circuits - it enhances existing ones. The model already had internal mechanisms for things like entity tracking and logical reasoning. Fine-tuning on code or math amplified these mechanisms, making them more robust and accurate even on tasks from other domains.

The analogy: learning to play chess doesn’t teach you business strategy. Except it kind of does, because both require planning multiple moves ahead, evaluating trade-offs, and thinking about your opponent’s response. The underlying cognitive skills transfer. That’s what’s happening here - except the LLM is discovering the transferable abstractions on its own, without anyone telling it that code and natural language share structural similarities.

This is convergent evidence with Paper 8 (causal world models). If the model is learning abstract, generalizable representations rather than surface-level patterns, that’s exactly what you’d expect from a system that has built a causal model: the causal structure is shared across domains, even when the surface features aren’t.

Paper 11: Scaling Monosemanticity

Anthropic (Templeton et al.), 2024 - transformer-circuits.pub/2024/scaling-monosemanticity

Alright, this is the one. This is the paper where someone literally built a debugger for neural networks.

Background: neural networks have always been “black boxes.” We know they work, but we don’t know how - each neuron tends to respond to a jumble of unrelated concepts (this is called “polysemanticity”), making individual neurons basically uninterpretable. It’s like trying to understand a book where every word means twelve different things depending on context. In principle, the book makes sense. In practice, you can’t read it.

Anthropic’s approach: use sparse autoencoders (SAEs) to decompose the model’s internal representations into monosemantic features - individual directions in activation space that each correspond to one interpretable concept. And they did this on Claude 3 Sonnet, a production-scale model. Not a toy. The real thing.

What they found: millions of interpretable features. Features for specific concepts like “the Golden Gate Bridge,” “code that has a bug,” “deceptive behavior,” “a conversation in French,” “inner conflict in a character.” Each feature activates when the model is processing something related to that concept, and only that concept. And these features are causally relevant - amplifying a feature changes the model’s behavior in the expected direction. Crank up the Golden Gate Bridge feature and the model starts talking about San Francisco. Suppress the “deception” feature and the model becomes more straightforwardly honest.

Why this matters more than anything else on this list, arguably: For the first time, we can look inside a production LLM and see organized, interpretable, causally relevant concepts. Not a jumble of inscrutable numbers - actual ideas, represented as directions in activation space, that the model uses to think.

This is why cognitive scientists are migrating to AI research. If you’re studying human cognition, your “neural network” is locked behind a skull and the best you can do is fMRI (which is like trying to understand a CPU by measuring its heat output). With LLMs, you can extract the actual representations. You can build a microscope for thought.

The philosophical implications are staggering. If the model has an internal concept of “deception” that it uses when generating text, what does that tell us about what it’s doing? It’s not just predicting the next token based on surface statistics - it’s activating organized conceptual representations. That’s not what “just statistics” looks like from the inside. That’s what understanding looks like from the inside.

Paper 12: Transcendence - Generative Models Can Outperform The Experts That Train Them

Zhang et al., 2024 - arXiv:2406.11741

This was the “checkmate” paper from the original post, and it’s only gotten more important because someone went and formalized it (Paper 13).

The setup: train a transformer on chess games played by 1000-Elo players (mediocre - like your uncle who learned chess in college and plays every Thanksgiving). What Elo does the model achieve?

1500 Elo

The model plays better chess than anyone in its training data. Not a little better. Significantly better. It outperformed the experts that trained it.

This is called transcendence, and it’s arguably the single most important finding in the LLM understanding debate. Because there is no mechanism by which a pure memorization system can outperform its training data. If you’re just memorizing patterns and replaying them, the absolute ceiling on your performance is the best example in your dataset. To exceed that ceiling, you need to have extracted generalizable principles from the data and combined them in novel ways. You need to, dare I say it, understand the game.

The chess setting is perfect because Elo is an objective, well-understood metric. There’s no ambiguity, no subjective evaluation, no room for “well maybe the benchmark is testing the wrong thing.” The model plays chess. It plays chess better than its teachers. End of story.

I said it in the original post and I’ll say it again: I don’t know what your definition of intelligence is, but a system that can learn from mediocre examples and produce expert-level output hits pretty close to mine.

As one of the great YouTube-Philosophers Davie504 would put it: CHECKMATE

Paper 13: A Taxonomy of Transcendence

Abreu et al., 2025 - arXiv:2508.17669

This is the follow-up that paper 12 was begging for. “OK, so models can transcend their training data… but when? And why? And how do we control it?”

The paper identifies three distinct mechanisms of transcendence:

Skill denoising: Each individual data source is noisy and imperfect, but the model, by training on many sources, can average out the noise and extract the underlying skill signal. Like how averaging 100 sloppy photos of a face produces a sharp average face. Each chess player makes unique mistakes, but the model learns the pattern of good moves that’s consistent across all of them while filtering out individual blunders.
Skill selection: Different experts are good at different things. Player A handles openings well; Player B is great at endgames. The model learns to select the best skill from each source and combine them. No single expert does everything well, but the model can cherry-pick across the entire population.
Skill generalization: This is the most exciting one. The model learns general principles that let it extrapolate beyond any specific training example. Not just combining existing skills, but generating genuinely new capabilities from learned abstractions.

The paper then introduces a knowledge-graph-based testbed where simulated experts generate data based on their individual expertise, allowing researchers to precisely control diversity and study which properties of training data enable which types of transcendence.

Why this matters: Transcendence isn’t magic anymore. We can now categorize it, predict it, and potentially engineer it. Want more transcendence? The paper shows that data diversity is key - not just volume, but the variety of expertise represented in the training data. This connects directly to the Chinchilla insights (Paper 18) and the synthetic data debate (Papers 19-20): the quality and diversity of your data isn’t just about efficiency, it’s about enabling capabilities that literally exceed the sum of the data’s parts.

Part 5: The Big Picture

Paper 14: The Platonic Representation Hypothesis

Huh et al. (MIT), 2024 - arXiv:2405.07987

I ended the original post with this paper and called it “science romanticism.” I stand by that. But it’s also potentially one of the most profound ideas in modern AI, so let me actually explain it properly this time.

The hypothesis: different neural networks - trained on different data, with different architectures, for different tasks, by different labs - are all converging toward the same representation of reality.

Take a vision model trained on images and a language model trained on text. They’ve never seen each other’s data. They don’t share architectures. They were trained independently. But if you compare their internal representations - the way they organize concepts relative to each other - they’re increasingly similar. Things that are “close” in the vision model’s representation space are also “close” in the language model’s representation space.

The claim is that there’s a unique, convergent point - the “platonic representation” - that all models are approaching. This representation corresponds to the actual statistical structure of reality. Not reality as experienced by humans, or as described in English, or as seen through a camera - but the underlying structure that generates all of these. The territory, not any particular map.

Why would this be true? Because all these models are trained on data generated by the same reality. If you train long enough on enough data, you inevitably converge on the structure of the process that generated the data. Just like how independent scientists studying the same phenomenon will eventually converge on the same theory - not because they’re copying each other, but because there’s only one reality.

The implications are wild:

Model convergence explains why multi-modal training works so well - combining vision and language helps both because both are approaching the same underlying representation
It suggests a natural limit to model diversity: at the frontier, all models become essentially the same model, viewing reality from the same angle
It’s philosophically loaded: it implies there’s an objective statistical structure to reality that’s independent of how you observe it, and neural networks are discovering it

A 2026 piece in Quanta Magazine covered follow-up work showing exactly this kind of convergence across models from different labs and modalities. The hypothesis isn’t just surviving contact with new data - it’s getting stronger.

Is it definitely right? We don’t know yet. But if it is, it means every AI lab on Earth is building different doors that all open into the same room. And the thing in that room is a mathematical model of reality itself.

If that doesn’t give you chills, check your pulse.

Part 6: Reasoning Chains - The o1 Lineage

This is entirely new territory. The original post mentioned chain-of-thought briefly under the “reasoning” section, but in 2024 it was still mostly a prompting trick. Since then, OpenAI released o1, DeepSeek dropped R1, LeCun lost his mind for a bit for actually suggesting reasoning models are not LLMs (because he literally tweeted two weeks before o1 release how LLM’s can never reason and never control the amount of time they invest into a problem, and obviously he is never wrong. never.)

“Reasoning models” became their own category. Here’s the research arc that made it happen.

Paper 15: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei et al., 2022 - arXiv:2201.11903

This is where “think step by step” started, and it’s one of those papers that’s so simple it’s embarrassing that no one thought of it sooner.

The idea: instead of asking a model to directly output the answer to a reasoning problem, provide a few examples that include intermediate reasoning steps. “Q: Roger has 5 balls, buys 2 cans of 3. How many? A: Roger started with 5. 2 cans of 3 is 6. 5 + 6 = 11. The answer is 11.”

That’s it. That’s the whole innovation. Show the model examples with reasoning steps, and it starts producing reasoning steps too.

But the effect is massive. On GSM8K (grade school math), chain-of-thought prompting boosted PaLM 540B from 18% to 57% accuracy. On other benchmarks, the improvements were similarly dramatic. Tasks that seemed “beyond” LLMs suddenly became tractable.

Why this matters beyond the obvious: The interesting question isn’t “does CoT work?” (it does) but “why does CoT work?” If LLMs were just pattern-matching statistical parrots, showing them examples with reasoning steps shouldn’t help - it should just cause them to mimic the format of reasoning without actually reasoning. But the accuracy improvements show that forcing the model to generate intermediate steps actually enables it to solve problems it couldn’t solve before. The model isn’t just copying the form of reasoning; generating intermediate steps is letting it actually reason.

One way to think about it: an LLM’s working memory is its context window. Without CoT, the model has to go directly from question to answer - a big logical jump squeezed through a single token prediction. With CoT, the model can offload intermediate results into the context, effectively giving itself working memory. Each step is a small, tractable prediction. Chain enough small predictions together and you can solve problems that would be impossible in a single step.

This is, by the way, exactly how humans solve complex problems. We don’t multiply 347 × 829 in our heads in one step. We break it down, write intermediate results, and build up to the answer. CoT gives LLMs the same strategy. The fact that it works - that models can decompose problems, solve sub-problems, and compose the results - is itself evidence of something far more sophisticated than token pattern matching.

Paper 16: Let’s Verify Step by Step

Lightman et al. (OpenAI), 2023 - arXiv:2305.20050

If Paper 15 was “teaching models to show their work,” this paper is “grading their work step by step.”

The question: if you’re going to use reinforcement learning to train a model to reason better, how should you provide the reward signal? You have two options:

Outcome supervision: Check only the final answer. Right answer = reward. Wrong answer = penalty.
Process supervision: Check each step. Correct step = reward. Incorrect step = penalty, even if the final answer happens to be right.

Lightman et al. trained “process reward models” (PRMs) that evaluate the correctness of each reasoning step, and showed that process supervision dramatically outperforms outcome supervision. On MATH (a challenging competition math benchmark), process-supervised models solved significantly more problems correctly.

Why process supervision wins: Outcome supervision has a credit assignment problem. If the model gets the final answer wrong, which step was the mistake? With outcome supervision, the model has no idea - it just knows the whole chain was “bad.” This is like a teacher who only tells you “wrong” without pointing out where you went wrong. You learn slowly, if at all. Process supervision is like a teacher who checks each line of your proof, and that targeted feedback makes learning much more efficient.

Why this paper is historically important: This is the paper that paved the road to o1. OpenAI’s reasoning models aren’t just doing chain-of-thought - they’re trained with process supervision, using reward models that evaluate each reasoning step. The jump from “show your work” (CoT) to “we’ll grade each step of your work and train you to get better at each step” (process supervision) is the key insight that turned chain-of-thought from a prompting trick into a training paradigm. Without this paper, o1 doesn’t exist.

Paper 17: DeepSeek-R1 - Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, 2025 - arXiv:2501.12948

And then the Chinese came along and said “hold my baijiu.”

DeepSeek-R1 is the paper that made Silicon Valley collectively soil itself in January 2025. Not because the results were dramatically better than o1 (they were competitive), but because of how they got there.

The headline: DeepSeek took a base language model and trained it with pure reinforcement learning to reason - no supervised fine-tuning on reasoning traces, no curated chain-of-thought datasets. Just RL rewards for getting the right answer. And the model spontaneously developed chain-of-thought reasoning, self-verification, error correction, and even “aha moments” where it would reconsider and revise its approach mid-solution.

They didn’t teach the model to reason step by step. They just rewarded it for getting correct answers, and it invented chain-of-thought reasoning on its own as an emergent strategy for maximizing reward.

This is emergence in the most undeniable sense. The model wasn’t trained on examples of reasoning. It wasn’t given CoT demonstrations. It was given problems and rewards, and it independently discovered that breaking problems into steps, checking its work, and revising errors was the optimal strategy. That’s not pattern matching. That’s not memorization. That’s an agent discovering a cognitive strategy through trial and error.

DeepSeek-R1-Zero (the pure RL version, before any supervised refinement) isn’t as polished as the final R1 model - its outputs are sometimes messy, it mixes languages, and its formatting is rough. But it works. It solves hard math problems. And it does so by reasoning, in a way that it independently invented.

Why this matters for the understanding debate: If emergence can be debated (Paper 2), and if memorization is always a possible alternative explanation for LLM capabilities, then DeepSeek-R1 is the tightest knot yet. You can argue that a model trained on chain-of-thought examples is “just imitating reasoning.” You cannot make that argument about a model that was never shown reasoning examples and invented reasoning on its own. The model learned to think - not because we showed it thinking, but because thinking turns out to be useful.

Part 7: Scaling & Synthetic Data

The last new section. Every paper so far has been about what models can do and how they do it. These three papers are about the fuel: how much data, what kind, and what happens when the fuel supply gets contaminated.

Paper 18: Training Compute-Optimal Large Language Models (Chinchilla)

Hoffmann et al. (DeepMind), 2022 - arXiv:2203.15556

This is the paper that told the entire industry it was doing scaling wrong.

Before Chinchilla, the conventional wisdom was: bigger model = better model. GPT-3 had 175 billion parameters. Labs were racing to 500B, a trillion. The assumption was that more parameters meant more capacity meant better performance.

Hoffmann et al. showed that this was massively inefficient. They proved that for a given compute budget, there’s an optimal ratio between model size and training data. And most models were dramatically over-parameterized and under-trained. GPT-3, with 175B parameters trained on 300B tokens, should have been (according to their scaling laws) a ~70B parameter model trained on ~1.4 trillion tokens.

Their 70B parameter model “Chinchilla” outperformed the 280B parameter Gopher on nearly every benchmark. Less than a third the size. Better performance. Because it was trained on more data relative to its parameter count.

Why this matters for the understanding debate: Chinchilla reframed the conversation from “how many parameters do you have?” to “how efficiently are you learning?” And the answer has implications for understanding: models don’t need to be impossibly large to be capable. They need to be trained well, on enough diverse data, with the right compute allocation. Understanding (or whatever you want to call what LLMs do) isn’t brute-forced through sheer size - it emerges from the interaction between model capacity and data diversity.

This connects directly to the Transcendence papers (12 and 13): the Taxonomy showed that data diversity drives transcendence. Chinchilla showed that data volume matters more than anyone thought. Together, they tell us that the “secret sauce” isn’t architectural or even about raw scale - it’s about exposing the model to enough diverse data that the underlying structure of reality becomes extractable. It’s always been about the data.

Paper 19: Textbooks Are All You Need

Gunasekar et al. (Microsoft), 2023 - arXiv:2306.11644

If Chinchilla said “you need more data,” this paper said “you need better data.”

Microsoft’s Phi-1 was a 1.3B parameter model - tiny by modern standards - that achieved state-of-the-art performance on coding benchmarks. How? By being trained on “textbook-quality” data: high-quality, curated, synthetic data generated by GPT-4 to be pedagogically clear, well-structured, and focused.

Instead of firehosing the model with the entire internet (most of which is garbage - Stack Overflow arguments, SEO spam, Reddit comments about whether LLMs understand things), they gave it the AI equivalent of a carefully curated university curriculum. Less data, but every training example was clear, correct, and educational.

The result blew people’s minds. A 1.3B model competing with models 10-100x its size, simply because the data was better. Subsequent Phi models (Phi-2, Phi-3) continued this approach, consistently achieving performance that seemed impossible for their size.

Why this matters: This paper proved something that should have been obvious but wasn’t: the quality of training data matters more than the quantity. You don’t need to train on the entire internet. You need to train on things that are worth learning from.

But it also raises a profound question about the nature of understanding. If a tiny model trained on synthetic textbooks can match a huge model trained on the whole internet, what is it learning? It’s not memorizing millions of examples - it doesn’t have enough parameters for that. It must be extracting general principles from the structured examples and generalizing them. Small model, clean data, real understanding. The textbook approach forces generalization because there isn’t enough data to memorize.

This paper is also why your local model running on a MacBook can do useful things. The proliferation of capable small models in 2024 and 2025 is, in large part, downstream of this insight.

Paper 20: The Curse of Recursion - Training on Generated Data Makes Models Forget

Shumailov et al., 2023 - arXiv:2305.17493

And now the cautionary tale. Because this article would be intellectually dishonest if it was all “LLMs are amazing” without acknowledging the risks and failure modes.

Shumailov et al. studied what happens when you train models on data generated by other models - which, given that AI-generated content is flooding the internet, is increasingly unavoidable. Their finding: model collapse. When models are trained on AI-generated data, they progressively lose the tails of the original distribution. Each generation of model is slightly less diverse, slightly more peaked, slightly more generic than the last. The rich, weird, complex distribution of human-generated content gradually flattens into a bland, homogeneous mush.

The math is straightforward: each model generation slightly misestimates the true distribution, and these errors compound. After enough generations, the model’s output distribution has collapsed to a narrow peak around the most common patterns, and the rare, unusual, creative examples have been forgotten entirely.

Think of it like making a copy of a copy of a copy. Each generation is a little worse. Eventually, you can’t read the text anymore. Or think of it genetically: a small breeding population loses genetic diversity over generations (inbreeding depression). You need new genetic material - or in this case, new human-generated data - to maintain diversity.

Why this matters for the understanding debate: This paper is the essential counterweight to Papers 19’s optimism about synthetic data. Yes, synthetic data can be incredibly powerful (Phi proved it). But synthetic data fed back recursively is poison. The internet is increasingly full of AI-generated content. If the next generation of models is trained on data that includes this generation’s outputs, we risk a progressive degradation of model quality.

This creates an economic premium on authentically human-generated data. Your Reddit posts, your blog entries, your weird fanfiction, your angry comments - these are the genetic diversity that keeps the ecosystem healthy. Ironic, isn’t it? The models that might “replace” human writing need human writing to survive. We’re not obsolete - we’re the seed bank.

It also has implications for understanding: model collapse degrades exactly the kind of nuanced, tail-distribution knowledge that represents deep understanding. A collapsed model might still predict common patterns well, but it loses the ability to handle rare, unusual, or creative scenarios. Understanding, in some sense, is the tails. Anyone can predict the obvious. Understanding means getting the weird stuff right too.

Where We Are Now

Let’s take stock.

A year ago, the original post made a case for LLM understanding based on 11 papers. It was a good case, but it had gaps. The emergence argument was one-sided. The ICL section was a blog post, not mechanistic evidence. The “debugger for neural networks” was still preliminary.

Now we have 20 papers, and the picture is dramatically more complete:

We know HOW in-context learning works - not vaguely, but mechanistically. Transformers implement gradient descent in their forward pass through induction heads. It’s not a black box anymore.

We can SEE inside the model - not through probes and guesswork, but through sparse autoencoders that extract millions of interpretable concepts. We have a microscope for machine thought.

We know models can exceed their training data - not just anecdotally, but with formal taxonomy of the mechanisms (denoising, selection, generalization). Transcendence isn’t magic; it’s predictable.

We know models build world models - spatial, temporal, causal. Not because we assume it, but because we can extract maps, timelines, and board states from their activations. And we have a mathematical proof that generalization requires causal models.

We know models can reason - not because we taught them to (though we can), but because pure RL produces models that independently invent step-by-step reasoning as an optimal strategy.

We know that data quality and diversity matter more than sheer scale - and we know the risks when the data pipeline gets contaminated.

Any one of these findings might be individually dismissable. “Sure, the model plays chess well, but that’s just one domain.” “Sure, there’s a map in the activations, but maybe the probe is doing the work.” “Sure, it invented reasoning, but maybe it’s a different kind of reasoning.”

But taken together? Twenty papers, from different labs, using different methods, studying different aspects of LLM behavior, all converging on the same conclusion: these models are building internal representations of the world, learning causal relationships, and using them to generalize and reason in ways that go systematically beyond their training data.

At some point, the parsimonious explanation isn’t “it’s all a coincidence” or “they’re all wrong” or “it’s just statistics.” The parsimonious explanation is that LLMs, in some meaningful and measurable sense, understand.

Not like humans understand. They don’t have subjective experience (probably). They don’t feel what it’s like to understand something. They’re not conscious. But they build models, they generalize, they reason, and they do it using internal representations that we can now see and manipulate. If you want to call that something other than “understanding,” go ahead - but you’d better have a word for it, because “just statistics” doesn’t cover it anymore.

What’s Coming

The velocity hasn’t slowed down. If anything, 2025 and early 2026 have accelerated every trajectory in this article. Interpretability is going mainstream. Reasoning is becoming a first-class training objective, not a prompting trick. The collision between synthetic data and model collapse is playing out in real time. And if the Platonic Representation Hypothesis is right, every lab on Earth is building different doors into the same room and the thing in that room is a mathematical model of reality itself.

When I wrote the original post, the debate felt like it could go either way. “Do LLMs understand?” felt like a genuine question where reasonable people could disagree based on the available evidence.

I don’t think it feels that way anymore. Not because the skeptics were stupid - the Mirage paper (Paper 2) shows there are thoughtful skeptics asking important questions. But because the evidence has accumulated to the point where “LLMs don’t understand anything at all” requires you to reject >twenty independent lines of evidence from the world’s best research labs (there are like over 200 papers I could add to this list). At some point, the conspiracy has too many members.

Grandpa Hinton was right. He was right in 2024 when he said it and the evidence was thinner. He’s really right now.

Do LLMs understand? Read the papers. Then you tell me.

Cheers, Pyro.

All papers linked in-line. If I missed any important work that should be in this list, drop a comment. And if you’re the person who PMed me last time saying I couldn’t produce a single paper - I’ve now produced twenty, while your ‘parrot’ argument consists of a single out-of-date pile of shite. Hope you found that peace and happiness.

The Papers (Quick Reference)

Paper 1: Emergent Abilities of LLMs - Wei et al., 2022 (Emergence)
Paper 2: Are Emergent Abilities a Mirage? - Schaeffer, Miranda & Koyejo, 2023 (Emergence)
Paper 3: What Learning Algorithm Is ICL? - Akyurek et al., 2022 (In-Context Learning)
Paper 4: Transformers Learn In-Context by Gradient Descent - von Oswald et al., 2022 (In-Context Learning)
Paper 5: In-Context Learning and Induction Heads - Olsson et al. / Anthropic, 2022 (In-Context Learning)
Paper 6: Othello-GPT: Emergent World Representations - Li et al., 2022 (World Models)
Paper 7: Language Models Represent Space and Time - Gurnee & Tegmark, 2023 (World Models)
Paper 8: Robust Agents Learn Causal World Models - Richens & Everitt / DeepMind, 2024 (World Models)
Paper 9: Connecting the Dots - Treutlein et al., 2024 (World Models)
Paper 10: Fine-Tuning Enhances Existing Mechanisms - Prakash et al., 2024 (Transfer & Understanding)
Paper 11: Scaling Monosemanticity - Templeton et al. / Anthropic, 2024 (Transfer & Understanding)
Paper 12: Transcendence - Zhang et al., 2024 (Transcendence)
Paper 13: A Taxonomy of Transcendence - Abreu et al., 2025 (Transcendence)
Paper 14: The Platonic Representation Hypothesis - Huh et al. / MIT, 2024 (The Big Picture)
Paper 15: Chain-of-Thought Prompting - Wei et al., 2022 (Reasoning Chains)
Paper 16: Let’s Verify Step by Step - Lightman et al. / OpenAI, 2023 (Reasoning Chains)
Paper 17: DeepSeek-R1 - DeepSeek-AI, 2025 (Reasoning Chains)
Paper 18: Chinchilla (Compute-Optimal LLMs) - Hoffmann et al. / DeepMind, 2022 (Scaling & Data)
Paper 19: Textbooks Are All You Need - Gunasekar et al. / Microsoft, 2023 (Scaling & Data)
Paper 20: The Curse of Recursion - Shumailov et al., 2023 (Scaling & Data)

Pyro's Vault

Discussion about this post

Ready for more?