I noticed that despite really liking Karpathy and the blog, I was am kind of wincing/involuntarily reacting to the LLM-like "It's not X, its Y"-phrases:
> it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer
> it's not just about the image generation itself, it's about the joint capability coming from text generation
There would be no reaction from me on this 3 years ago, but now this sentence structure is ruined for me
Very broadly, AI sentence-structure and word choice is recursing back into society, changing how humans use language. The Economist recently had a piece on word usage of British Parliament members. They are adopting words and phrases commonly seen in AI.
We're embarking on a ginormous planetary experiment here.
> The Economist recently had a piece on word usage of British Parliament members. They are adopting words and phrases commonly seen in AI.
Many of the speeches given by MPs are likely to have been written beforehand, in whole or in part. Wouldn’t the more likely explanation be that they, or their staff, are using LLMs to write their speeches?
I hated these sentences way before LLMs, at least in the context of an explanation.
> it's not just a website you go like Google, it's a little spirit/ghost that "lives" on your computer
This type of sentence, I call rhetorical fat. Get rid of this fat and you obtain a boring sentence that repeats what has been said in the previous one.
Not all rhetorical fats are equal, and I must admit I find myself eyerolling on the "little spirit" part more than about the fatness.
I understand the author wants to decorate things and emphasize key elements, and the hate I feel is only caused by the incompatible projection of my ideals to a text that doesn't belong to me.
> it's not just about the image generation itself, it's about the joint capability coming from text generation.
That's unjustified conceptual stress.
That could be a legitimate answer to a question ("No, no, it's not just about that, it's more about this"), but it's a text. Maybe the text wants you to be focused, maybe the text wants to hype you; this is the shape of the hype without the hype.
"I find image generation is cooler when paired with text generation."
It is not a decoration. Karpathy juxtaposes ChatGPT (which feels like a "better google" to most people) to Claude Code, which, apparently, feels different to him. It's a comparison between the two.
You might find this statement non-informative, but without two parts there's no comparison. That's really the semantics of the statement which Karpathy is trying to express.
ChatGPT-ish "it's not just" is annoying because the first part is usually a strawman, something reader considers trite. But it's not the case here.
Indeed, I was probably grumpy at the time I wrote the comment. I do find some truth in it still.
You're right ! The strawman theory is based.
But I think there's more to it, I find dislikable the structure of these sentences (which I find a bit sensationnalist for nothing, I don't know, maybe I am still grumpy).
Well, language is a subject to 'fashion' one-upmanship game: people want to demonstrate their sophistication, often by copying some "cool" patterns, but then over-used patterns become "uncool" cliche.
So it might be just a natural reaction to over-use of a particular pattern. This kind of stuff have been driving language evolution for millennia. Besides that, pompous style is often used in 'copy' (slogans and ads) which is something most people don't like.
Karpathy should go back to what he does best: educating people about AI on a deep level. Running experiments and sharing how they work, that sort of stuff. It seems lately he is closer to an influencer who reviews AI-based products. Hopefully it is not too late to go back.
I feel these review stuff is more like a side / pass time to him. Look at nanochat for example. My impression is that these are the thongs he spends most of his energy still.
After all,l he's been a "influencer" for a long time, starting from the "software 2.0" essay.
We need to integrate how Singapore and Japan do oral English into our writing I guess.
Joking aside, as a nonnative English speaker who spent quite a bit of time to learn to write in English "properly", this trend of needing to write baad Engrish to avoid being called out in public for "written by an LLM" is frustrating...
Same here, had to configure ChatGPT to stop making these statements. Also had to configure bunch of other stuff to make it bland when answering questions.
The way to make AI not sound like ChatGPT is to use Claude.
I realized that's what bothered me. It's not "oh my god, they used ChatGPT." But "oh my god, they couldn't even be bothered to use Claude."
It'll still sound like AI, but 90% of the cringe is gone.
If you're going to use AI for writing, it's just basic decency to use the one that isn't going to make your audience fly into a fit of rage every ten seconds.
That being said, I feel very self conscious using emdashes in current decade ;)
That's because people didn't make it a point to performatively notice them. But e.g. macOS and iOS have been auto-inserting them for a long time now. Ditto Word.
Your vibe-coded eval has cheated this to collapse it into a binary selection on row 46 in https://github.com/Anima-Core/an1-core/blob/main/experiments..., making the problem baseline 50% on random choice instead of 25%, making the problem much easier. HellaSwag is specifically constructed with adversarial examples that could be plausible. By not including them, the eval is much easier.
---
Then, in extract_fields_from_model, you have another cheating going on. The extraction logic (h[:, -1, :]) fails to account for padding in batches, likely extracting EOS/Pad tokens instead of the intended content tokens. This suggests the probe is relying on global sentence summaries (standard embeddings in causal structures) rather than the novel 'meaning fields' claimed in the paper.
---
I dont have time to look at more of this and I just looked at how the eval is made, but please dont waste peoples times when you dont even know what you are evaluating.
I guess my "vibe" is just better than your coding :)... Let me explain a few things, if you will. A few clarifications so the discussion stays aligned with what the experiment is actually measuring.
1. The HellaSwag “binary collapse” is intentional and not a leaderboard claim.
This work doesn’t attempt to benchmark HellaSwag in the standard four-choice setting. The goal is to probe whether a single frozen layer carries enough information for a small head to distinguish correct versus incorrect continuations.
That's a representational geometry test, not a SOTA claim.
Binary framing raises the baseline, but that's expected and documented. It's not meant to compare against full LLM HellaSwag results.
2. No adversarial filtering was done.
I am using HuggingFace’s standard split directly. Nothing was removed or curated. The experiment doesn't claim robustness or benchmark competitiveness, so the “easier eval” framing doesn’t really apply.
3. EOS extraction isn't cheating, it's the whole point of the probe.
The extraction logic takes the final token’s hidden state, which is basic and standard for classification heads and probing studies. If the EOS token captures a high-level sequence summary, that's exactly the structural feature being examined.
The result is meant to show how much task-relevant signal is already present in that early representation, not to present a new generative mechanism.
4. The purpose of the work is clearly narrow by design.
This is not proposed as a drop-in replacement for full-transformer inference. The paper states that directly.
The contribution is about how much structure a single early layer encodes and how far a tiny head can go under strict frozen-teacher constraints.
So several of the criticisms make assumptions about goals the work never even claimed.
Thaank you for the feedback and for taking the time.
I dont know if you are trying to delude yourself or someone else with your Motte-and-Bailey fallacy(https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy), but it doesn't work when you are literally advertising 4 classes for HellaSwag on the website for the product:
I can see you've put real thought into your critique, and while I definitely disagree with several conclusions, I appreciate the seriousness of the discussion. Hopefully this is a good faith discussion, and we can keep it that way.
Let me start with the Motte-and-Bailey point, since that seems to be the crux of your argument.
For anyone unfamiliar, a motte-and-bailey fallacy is when someone makes a bold or controversial claim, then retreats to a weaker, safer claim under pressure while pretending the two were always the same. That's simply not what's happening here in the slightest.
The confusion begins with a misreading of the title. Which, in hindsight, I agree should have been clearer so that the work was being critiqued rather than semantics. (Although the paper is clear on this distinction.)
“Post-Transformer Inference” does not mean no transformer, nor does it mean replacement of transformers. It refers to where inference is performed in the pipeline. The transformer remains fully intact and unchanged. It's used exactly as intended. To extract representations. The contribution begins after that point.
The paper is explicit about this throughout:
The transformer is fully used and not replaced.
The compressed heads are task-specific and not general LLM substitutes.
The 224× compression applies to task-specific inference paths, NOT to the base model weights.
There's no shift in scope, no retreat, and no weaker fallback claim. The boundary is fixed and stated clearly.
On HellaSwag and the “4 classes” point, this is simply a category error. HellaSwag is a four-choice benchmark by definition. Advertising four classes describes the label space of the task, not the capacity of the model. Compression here refers to internal representations and compute required for inference, not to the number of output labels. Those are different layers of the system.
The same applies to “CUDA-compatible drop-in.” That phrase refers to integration, not equivalence. It means this work can plug into existing CUDA-based pipelines without requiring teams to rewrite or replace their infrastructure. It absolutely does not claim semantic equivalence to CUDA kernels, nor does it claim GPU replacement. The goal is to extract value without forcing anyone to rebuild their stack. That distinction is intentional and explicit.
You also cited the LessWrong essay, which I'm very familiar with and broadly agree with in spirit. It's a valid warning about vague, unfalsifiable, or scope-shifting claims in LLM-assisted research. That critique applies when claims move or evidence is absent. Here, the claims are narrow, fixed, and empirically evaluated, with code and benchmarks available. Disagree with the results if you want, but that essay just isn't describing this situation at all.
As for the flagging. That's
easy. There's nothing mysterious about it. Work that challenges familiar abstractions often gets flagged first for language, not for results. Titles that suggest a different inference boundary tend to trigger skepticism before the experiments are actually read. That doesn't mean the work isn't correct, and it would be wrong to assume that.
Flagging isn't peer review. Real critique points to broken assumptions, flawed metrics, or reproducibility failures.
Again, I will freely admit the title was designed to be punchy, and while it's technically accurate, I can see now how it invites semantic confusion. That is totally fair feedback, and I will refine that framing going forward. That doesn't make the results wrong, nor does it make this a motte-and-bailey.
If you want to talk about the data, the methodology, or where this work is heading next, I'm more than happy to do that. I suspect some of the disagreement here is less about intent and more about where you think the boundary of the system is. Once that clicks, the rest tends to fall into place.
I think the paper in general completely oversells the idea of "universality".
For CNNs, the 'Universal Subspace' is simply the strong inductive bias (locality) forcing filters into standard signal processing shapes (Laplacian/Gabor) regardless of the data. Since CNNs are just a constrained subset of operations, this convergence is not that surprising.
For Transformers, which lack these local constraints, the authors had to rely on fine-tuning (shared initialization) to find a subspace. This confirms that 'Universality' here is really just a mix of CNN geometric constraints and the stability of pre-training, rather than a discovered intrinsic property of learning.
For me at least, I wasn't even under the impression that this was a possible research angle to begin with. Crazy stuff that people are trying, and very cool too!
The trained from scratch models are similar because CNN's are local and impose a strong inductive bias. If you train a CNN for any task of recognizing things, you will find edge detection filters in the first layers for example. This can't happen for attention the same way because its a global association, so the paper failed to find this using SVD and just fine-tuned existing models instead.
I don't think its that surprising actually. And I think the paper in general completely oversells the idea.
The ResNet results hold from scratch because strict local constraints (e.g., 3x3 convolutions) force the emergence of fundamental signal-processing features (Gabor/Laplacian filters) regardless of the dataset. The architecture itself enforces the subspace.
The Transformer/ViT results rely on fine-tunes because of permutation symmetry. If you trained two ViTs from scratch, "Attention Head 4" in Model A might be functionally identical to "Head 7" in Model B, but mathematically orthogonal.
Because the authors' method (SVD) lacks a neuron-alignment step, scratch-trained ViTs would not look aligned. They had to use pre-trained models to ensure the weights shared a coordinate system. Effectively, I think that they proved that CNNs converge due to it's arch, but for Transformers, they mostly just confirmed that fine-tuning doesn't drift far from the parent model.
I think its very surprising, although I would like the paper to show more experiments (they already have a lot, i know).
The ViT models are never really trained from scratch - they are always finetuned as they require large amounts of data to converge nicely. The pretraining just provides a nice initialization. Why would one expect two ViT's finetuned on two different things - image and text classification end up in the same subspace as they show? I think this is groundbreaking.
I don't really agree with the drift far from the parent model idea. I think they drift pretty far in terms of their norms. Even the small LoRA adapters drift pretty far from the base model.
You’ve explained this in plain and simple language far more directly than the linked study. Score yet another point for the theory that academic papers are deliberately written to be obtuse to laypeople rather than striving for accessibility.
Vote for the Party that promises academic grants for people that write 1k character long forum posts for the laypeople instead of other experts of the field.
I don't think the parent post is complaining that academics are writing proposals (e.g as opposed to people with common sense).
Instead, it seems to me that he is complaining that academics are writing proposals and papers to impress funding committees and journal editors, and to some extend to increase their own clout among their peers. Instead of writing to communicate clearly and honestly to their peers, or occasionally to laymen.
And this critique is likely not aimed at academics so much as the systems and incentives of academia. This is partially on the parties managing grants (caring much more about impact and visibility than actually moving science forwards, which means everyone is scrounging for or lying about low hanging fruit). It is partially on those who set (or rather maintain) the culture at academic institutions of gathering clout by getting 'impactful' publications. And those who manage journals also share blame, by trying to defend their moat, very much hamming up "high impact", and aggressively rent-seeking.
Yes, thank you, exactly. It’s a culture and systems issue. Thank you for clarifying a post I wrote in the early morning while waiting for my baby to fall back to sleep!
NVIDIA chips are more versatile. During training, you might need to schedule things to the SFU(Special Function unit that does sin, cos, 1/sqrt(x), etc), you might need to run epilogues, save intermediary computations, save gradients, etc. When you train, you might need to collect data from various GPUs, so you need to support interconnects, remote SMEM writing, etc.
Once you have trained, you have frozen weights/feed-forward networks that consist out of frozen weights that you can just program in and run data over. These weights can be duplicated across any amount of devices and just sit there and run inference with new data.
If this turns out to be the future use-case for NNs(it is today), then Google are better set.
Won't the need to train increase as the need for specialized, smaller models increases and we need to train their many variations? Also what about models that continuously learn/(re)train? Seems to me the need for training will only go up in the future.
There is, but there is an equal risk if you were to engage about any topic with any teacher you know. Everyone has a bias, and as long as you dont base your worldview and decisions fully on one output you will be fine.
Experimenting with LLMs, I've had examples like it providing the Cantor Set (a totally disconnected topological space) as an example of a Continuum immediately after it provides the (correct) definition as a non-empty compact, connected (Hausdorff) topological space. This is immediately obvious as nonsense if you understand the topic, but if one was attempting to learn from this, it could be very confusing and misleading. No human teacher would do this.
But I’m not trying to become an expert in these subjects. If I were, this isn’t the tool I’d use in isolation (which I don’t for these cases anyway.)
Part of reading, questioning, interpreting, and thinking about these things is (a) defining concepts I don’t understand and (b) digging into the levels beneath what I might.
It doesn’t have to be 100% correct to understand the shape and implications of a given study. And I don’t leave any of these interactions thinking, “ah, now I am an expert!”
Even if it were perfectly correct, neither my memory nor understanding is. That’s fine. If I continue to engage with the topic, I’ll make connections and notice inconsistencies. Or I won’t! Which is also fine. It’s right enough to be net (incredibly) useful compared to what I had before.
There is some sociology/psychology research based on concepts like the Maslow's hierarchy of needs that motivate human behavior. There is also memoirs of Bonnie Ware, "Regrets of the Dying", where she as a nurse at an old folks home over several years interviewed people who are about to die and their regrets turn out to be:
not living a life true to oneself, working too hard, not having the courage to express feelings, losing touch with friends, and not allowing oneself to be happier.
With a heavy overweight on the first point.
I think that the comment "he wasted his life" is supposed to be in reference to this, that most people realize at the end that nothing really mattered, and that they chose to follow the structures of society by default instead of daring to do what they inherently wanted intrinsically. Then you can feel as your life was wasted, you got a single chance to play around and do what you like with your brief time in the universe and you chose to let someone else dictate how that was going to go, a waste.
> it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer
> it's not just about the image generation itself, it's about the joint capability coming from text generation
There would be no reaction from me on this 3 years ago, but now this sentence structure is ruined for me