Speculative decoding is an amazingly clever invention, almost seems-too-good-to-be-true (faster interference with zero degradation from the quality of the main model). The core idea is: if you can find a way to generate a small run of draft next tokens with a smaller model that have a reasonable likelihood of being correct, it's fast to check that they are actually correct with the main model because you can run the checks in parallel. And if you think about it, a lot of next tokens are pretty obvious in certain situations (e.g. it doesn't take a frontier model to guess the likely next token in "United States of...", and a lot of code is boilerplate and easy to predict from previous code sections).
I always encourage folks who are interested in LLM internals to read up on speculative decoding (both the basic version and the more advanced MTP), and if you have time, try and implement your own version of it (writing the core without a coding agent, to begin with!)
> it's fast to check that they are actually correct with the main model because you can run the checks in parallel.
Can you give an intuition as to why it's faster? I would have thought regardless how many you run in parallel, the successful check has to execute the full model to generate the full sequence so you will have exactly the same time needed? Or is it by process of elimination so it terminates early once it eliminates the non-viable choices? (in which case, how do you guarantee the correct output was speculatively generated at all to be the last survivor?)
The small draft model proposes a sequence of tokens d1 d2 d3.
The big target model calculates
P(d1)
P(d2|d1)
P(d3|d1 d2)
In parallel. If we were just greedy decoding it would be simple. Just stop when the draft model doesn’t predict the most likely token as judged by the target model. At that point, append the correct token from the target model and kick off both models again in parallel.
In practice we aren’t using greedy decoding. We are sampling and we need to match the target model’s distribution. To do this, we accept tokens from the draft model probabilistically, which is possible because we have the logits of both the draft model and the target at that point. The ratio of their softmax probabilities is used for this.
You are right that actually accepting tokens has to happen sequentially but that’s a heck of a lot faster than a forward pass.
nice ... i think i get the idea - it's effectively the same / similar benefit as batching, but you're batching against your own speculated future path. Which would be pointless if you didn't have a high probability path to evaluate against - but the draft gives you that.
I'll add an expansion here. It's more useful to you locally, as you have excess compute that's generally wasted. If you're serving multiple user and trying to max output, you might cost some in this case
An obvious thing to do is that if you have enough concurrent batches to max out performance you should use those and not speculate. But if compute would be idle waiting on memory, fill the excess with speculation.
while I understand that we are computing the tokens in parallel to get the "faster" result, is there a tradeoff where we're actually utilizing more compute resources by running multiple instances of the large model? That is, while it's faster, is it more efficient?
edit: doing some more of my own research, it sounds like the bottleneck in doing it sequentially is in shifting weights around in memory, so while it uses more compute it doesn't oversubscribe compute resources because the bottleneck is not in supply of compute but in supply and speed of memory. The GPU has a massive supply of compute but sequential decoding only demands a relatively small amount of it. Time is primarily spent waiting on loading values from vram.
It’s not really multiple instances of the same model. Model weights aren’t replicated in vram. The results of multiplying k sequences through the model is larger, but that’s pretty small compared with the model weights themselves.
The bigger constraint is the target model and the draft model needing to share VRAM.
To add to what others have said here, this is due to the memory hierarchy.
GPUS have different kinds of memory, there's fast-but-small memory and slow-but-large memory.
Conceptually, you can imagine the process of LLM inference as transferring some weights from slow memory to fast memory, doing some calculations on those weights, discarding them from fast memory once the computation is done, loading in the next portion, and so on, until you're fully done.
You can do calculations for multiple tokens in parallel, but to calculate what token n is, you need to already know all the previous tokens 1..(n-1). Therefore, if you don't have spec decoding, you go one token at a time. If you do, you assume that the next tokens actually are what the smaller model gave you, discarding the results in case you were wrong.
With speculative decoding, you can basically load the weights once and apply them to multiple tokens instead of just one, because of the assumption of what the next tokens are that you're making. This decreases the amount of data that has to go between slow and fast memory. As the decode stage[1] is bottlenecked by memory bandwidth and not compute speed, more efficient use of this bandwidth increases your token generation speed.
As another poster said, this idea is closely related to batching. In batching, you re-use the same weights to serve multiple requests. In speculative decoding, you re-use them to accelerate a single one. If you have many users, care only about how many tokens per second your GPUs produce in general, and don't care at all about per-user speed, speculative decoding won't do anything for you.
[1] There are two stages in LLM inference: prefill and decode. In prefill, you do calculations on the tokens of the prompt, prefilling the KV cache to accelerate attention computations at decode time. Because you have access to all the tokens of the prompt, you can process everything in parallel and use your weights very efficiently. Your bottleneck here is the computation units and not memory bandwidth. In decode, you don't know what your future tokens will be, so you can only go one at a time as explained above. In a way, speculative decoding turns decode into a little prefill.
AIUI you run the checks of several predicted tokens in lockstep, and the computation for each token is served by the same data loaded from memory. In normal execution, each token would depend on the previous one, precluding the parallelization and causing much more per-token memory traffic.
So this is a case of trading off idle compute capacity that's waiting for the bottleneck (memory access).
An obscure fact about the transformer architecture is that it more or less computes the most likely next token for every single token in the context window at once. This is because the KV cache values needed to predict the next token are needed for every token, and the attention modules do nearly all the work, so once you computed the KVs running them through the last sections to get the target probabilities is nearly free.
The reason it's designed this way is a bit subtle but it has the advantage during training that you can use a single block of 10 tokens to generate 9 training examples in parallel, so it's highly efficient. This efficiency is basically the main benefit of transformers - the algorithm parallelizes really well and that's what allowed the scale up to large language models as opposed to the previous reality of just language models.
The blog post does discuss why MTP is faster but it's maybe a bit hard to understand if you haven't studied LLM internals. During inference the hardware has arithmetic units idling because they spend so much time waiting for the weight matrices to get moved closer to the processors. Because data movement and computation can be overlapped, if you can reuse the same loaded data for multiple calculations at once you're winning - it's free latency-wise because you're just exploiting previously idle resources (it's not free in terms of energy).
Speculative decoding and MTP exploit this to run the model in parallel on several tokens at once. Say your context window contains "The United". The KV cache has been populated by the main model for this set of tokens. The draft model is given "The United" and predicts " States of America" in one forward pass (this part where it can predict multiple tokens at once with a single pass is the MTP part). Then the main model is given the KV cache from last time along with " States of America". In its own forward pass it can then compute in parallel the completions of both "The United", "The United States", "The United States of" and "The United States of America" (the last one might be an eos token indicating it wants to stop talking.). That's the speculative decoding part.
Now you decode the main model at each position (look at the token probabilities and pick one according to some decoding strategy). It's possible the main model didn't pick " States" at all, or picked " States", but then its prediction diverged e.g. if it wants to say "The United States is a country". So you just select the tokens that match and toss all the tokens starting from the one that didn't. Repeat.
The parallelism comes almost for free because the same weight matrices can be reused multiple times before they're swapped out for the next.
Naively it seems odd that running multiple checks in parallel is faster than just running the autoregressive model multiple times in series. It’s the same amount of compute right?
But I think the key is that in the standard autoregressive case we get memory bandwidth bound, so there are tons of idle compute resources. And so checking multiple tokens is cheap because we can batch and thus reuse the read weights for multiple tokens.
The verification step is similar to a prefill with a small batch size. The difference is what we do with the generated logits.
That’s correct, and yes - not less compute total on the main model (actually slightly more, since checking failed draft tokens costs you compute), but faster because inference is memory-bandwidth bound. And like you I also think of it as like a “mini prefill” (but on top of the existing KV cache, of course); the code is very similar to prefill if you implement a simple toy version yourself.
Most of the complexity in implementing a simple toy version comes from having to get the KV cache back into a good state for the next cycle (e.g. if only the first half of your draft tokens were correct).
> But I think the key is that in the standard autoregressive case we get memory bandwidth bound, so there are tons of idle compute resources.
Right, this is the same way batching works. It's "free" until we exhaust available compute resources, at which point decode throughput becomes compute bound. (This is a good place to be, because scaling out compute is a lot easier than adding fast VRAM.) This is why MTP is mostly useful when you have one or few users, which means compute is abundant. When you're running large batches you're better off using that compute to grow your batch size.
Of course, batch size is usually limited by things like bulky KV caches. So perhaps MTP has some residual use in that setting. But if you're sharing cached context in a subagent swarm, or running a model like the recent DeepSeek V4 with its tiny KV cache, you can go a lot further in processing a larger batch.
They probably use it on all models. Fast is probably just a resource pool with less congestion and therefore faster throughput per user but less efficent.
Well, there are multiple token proposals processed in parallel, from which only one is picked, seems like branching to me. The only difference is that in case of CPU there is always only one possible branch that is correct.
If you are just generating as usual with the main model then you're sequentially generating A -> AB -> ABC.
If I'm understanding correctly, what speculative decoding is doing is first (= more FLOPs) using a different small/fast (but less accurate) model to generate this ABC (you hope) sequence, then use the main model to now verify it in parallel (A + AB + ABC in parallel) rather then generate it sequentially. Assuming you had the FLOPs available to really do this in parallel, then this parallel verification vs sequential generation is what gives you the speed up.
Much nostalgia. The TI-83 Z80 was how I learned assembly as a teenager, so I could write better calculator games than was possible with TI Basic. Many others here had a similar experience, I’m sure. It’s been a couple decades, but I’m sure I’d still remember most of it if you put me down in front of a bunch of Z80 asm code.
One thing that I remember vividly was you had no MUL or DIV, so you have to implement them yourself with shifts, adds, subtraction, etc. This was an extremely useful learning experience
Same story here (basic was too slow for a phoenix/movable-ship-shooter game).
Do you think you could remember most of Z80 ASM? I looked at some old ASM I wrote long ago, and it's hard to follow the logic of the program, since most lines are messing around with the registers. But basics like 'ld hl,xyz' and 'jp/jnz' still make sense.
> Do you think you could remember most of Z80 ASM?
I find when you learn things at 15 they tend to stick around. (Stuff I learned last week, not so much!) Even just looking at your example, I remembered that HL is a 16 bit register and you can split it into two 8 bit registers H and L if you want. I think most of it would come back; I wrote quite a lot of it, both for the TI-83 and later for a Z80 that I bought and put on a breadboard and wired up to some RAM and EEPROM, about as bare metal as it gets.
> most lines are messing around with the registers
I learned much of what I know about computer and low-level systems engineering from Minecraft. Watched lots of videos making CPUs and built many components myself including a full ALU with a look-ahead adder and hardware multiplication.
LLM’s aren’t software (except in an uninteresting obvious sense); they are “grown, not made” as the saying is. And sure, they can find which weights activate when goblins come up (that’s basic mechanistic interpretability stuff), but it’s not as simple as just going in and deleting parts of the network. This thing is irreducibly complex in an organic delocalized way and information is highly compressed within it; the same part of the network serves many different purposes at once. Going in and deleting it you will probably end up with other weird behaviors.
It's interesting that some people are responding to your comment as if this proves that AI is a sham or a joke. But I don't think that's what you're saying at all with your reference to Terence McKenna: this is a serious thing we're talking about here! These models are alien intelligences that could occupy an unimaginably vast space of possibilities (there are trillions of weights inside them), but which have been RL-ed over and over until they more or less stay within familiar reasonable human lines. But sometimes they stray outside the lines just a little bit, and then you see how strange this thing actually is, and how doubly strange it is that the labs have made it mostly seem kind of ordinary.
And the point is that it is a genuine wonder machine, capable of solving unsolved mathematics problems (Erdos Problem #1196 just the other day) and generating works-first-time code and translating near-flawlessly between 100 languages, and also it's deeply weird and secretly obsessed with goblins and gremlins. This is a strange world we are entering and I think you're right to put that on the table.
Yes, it's funny. But it's disturbing as well. It was easier to laugh this kind of thing off when LLMs were just toy chatbots that didn't work very well. But they are not toys now. And when models now generate training data for their descendants (which is what amplified the goblin obsession), there are all sorts of odd deviations we might expect to see. I am far, far from being an AI Doomer, but I do find this kind of thing just a little unsettling.
> These models are alien intelligences that could occupy an unimaginably vast space of possibilities (there are trillions of weights inside them), but which have been RL-ed over and over until they more or less stay within familiar reasonable human lines.
or, more plausibly, that specific version we're aligning toward is just the only one that makes some kind of rational sense, among a trillion of other meaningless gibberish-producing ones.
Do not fall for the idea that if we're not able to comprehend something, it's because our brain is falling short on it. Most of the time, it's just that what we're looking at has no use/meaning in this world at all.
> that specific version we're aligning toward is just the only one that makes some kind of rational sense, among a trillion of other meaningless gibberish-producing ones.
Oh, the space of possibilities is unimaginably vaster than that. Trillions of weights. But more combinations of those weights than there are electrons in the universe. So I think we could equally well speculate (and that's what we're both doing here, of course!) that all these things are simultaneously true:
1) Most configurations of LLM weights are indeed gibberish-producers (I agree with you here)
2) Nonetheless there is a vast space of combinations of weights that exhibit "intelligent" properties but in a profoundly alien way. They can still solve Erdos problems, but they don't see the world like us at all.
3) RL tends to herd LLM weights towards less alien intelligence zones, but it's an unreliable tool. As we just saw, with the goblins.
As a thought experiment, imagine that an alien species (real organic aliens, let's say) with a completely different culture and relation to the universe had trained an LLM and sent it to us to load onto our GPUs. That LLM would still be just as "intelligent" as Opus 4.7 or GPT 5.5, able to do things like solve advanced mathematics problems if we phrased them in the aliens' language, but we would hardly understand it.
…But this goblin thing was a direct result of accidentally creating a positive feedback loop in RL to make the model more human-like, nothing about unintentionally surfacing an aspect of Cthulhu from the depths despite attempts to keep the model humanlike. This is not a quirk of the base model but simply a case of reinforcement learning being, well, reinforcing.
We actually understand AI quite well. It embeds questions and answers in a high dimensional space. Sometimes you get lucky and it splices together a good answer to a math problem that no one’s seriously looked at in 20 years. Other times it starts talking about Goblins when you ask it about math.
Comparing it to an alien intelligence is ridiculous. McKenna was right that things would get weird. I believe he compared it to a carnival circus. Well that’s exactly what we got.
There's no end to arguing with someone who claims they don't understand something, they could always just keep repeating "nevertheless I don't understand it"... You could keep shifting the goalposts for "real understanding" until one is required to hold the effects of every training iteration on every single parameter in their minds simultaneously. Obviously "we" understand some things (both low level and high level) to varying degrees and don't understand some others. To claim there is nothing left to know is silly but to claim that nothing is understood about high-level emergence is silly as well.
Is there a book or paper where I can read a description of how high-level emergent behavior works? The papers I've seen are researchers trying to puzzle it out with probes, and their insights are very limited in scope and there is always a lot more research to be done.
I think this is a case of that mildly apocryphal Richard Feynman quote: "if you think you understand quantum mechanics, you don't understand quantum mechanics."
I understand LLM architecture internals just fine. I can write you the attention mechanism on a whiteboard from memory. That doesn't mean I understand the emergent behaviors within SoTA LLMs at all. Go talk to a mechanistic interpretability researcher at Anthropic and you'll find they won't claim to understand it either, although we've all learned a lot over the last few years.
Consider this: the math and architecture in the latest generation of LLMs (certainly the open weights ones, almost certainly the closed ones too) is not that different from GPT-2, which came out in 2019. The attention mechanism is the same. The general principle is the same: project tokens up into embedding space, pass through a bunch of layers of attention + feedforward, project down again, sample. (Sure, there's some new tricks bolted on: RoPE, MoE, but they don't change the architecture all that much.) But, and here's the crux - if you'd told me in 2019 that an LLM in 2026 would have the capabilities that Opus 4.7 or GPT 5.5 have now (in math, coding, etc), I would not have believed you. That is emergent behavior ("grown, not made", as the saying is) coming out of scaling up, larger datasets, and especially new RL and RLVR training methods. If you understand it, you should publish a paper in Nature right now, because nobody else really does.
I wouldn’t use the phrase “emergent behavior” when talking about a model trained on a larger dataset. The model is designed to learn statistical patterns from that data - of course giving it more data allows it to learn higher level patterns of language and apparent “reasoning ability”.
I don’t think there’s anything mysterious going on. That’s why I said we understand how LLMs work. We may not know exactly how they’re able to produce seemingly miraculous responses to prompts. That’s because the statistical patterns it’s identifying are embedded in the weights somewhere, and we don’t know where they are or how to generalize our understanding of them.
To me that’s not suggestive that this is an “alien intelligence” that we’re just too small minded to understand. It’s a statistical memorization / information compression machine with a fragmented database. Nothing more. Nothing less.
I wouldn't use the term "token predictor" or "statistical pattern matcher" to refer to a post-trained instruct model. Technically that is still what it is doing at a low level, but the reward function is so different - the updates its making to weights are not about frequency distribution at all.
So, to reiterate my example: you'd have been fine with people claiming in 2019 that we would eventually scale LLMs to the capabilities of Opus 4.7 + Claude Code? Because I would have said then that was a fantasy, because "LLMs are just statistical pattern matchers." But I was wrong and I changed my opinion. (Or do you not think the current SoTA LLMs are impressive? If so I can't help you and this discussion won't go anywhere fruitful.)
You're applying an old ~2022 model of LLMs, based on pretraining ("they just predict the next token") and before the RLVR training revolution. "It’s a statistical memorization / information compression machine... nothing more" is cope in 2026, sorry. You can keep telling yourself that, but please at least recognize serious people don't believe that any more. "Emergent behavior" captures a genuine phenomenon and widely recognized in the industry. It surprised me and I was willing to change my opinions about it and I think a little humility and curiosity is warranted here rather than simply reiterating 2022 points about LLMs being statistical token generators. Yes, we know. The math isn't that hard. But there is a lot more to them than just the architecture, and reasoning from architecture to general claims that they can never embody intelligence is a trap.
Hey, about that high dimensional space, is it continuous or discrete?
Also, I'm curious what you mean by "embed", the word implies a topographical mapping from "words" to some "high dimensional space". What are the topographical properties of words which are relevant for the task, and does the mapping preserve these?
circling back to the first point, are words continuous or discrete? is the space of all words differentiatable?
Discrete. But my understanding is that for all intents and purposes it is differentiable.
None of this means that you can infer the input space (human brain) from the output space (language). You can approximate it. But you cannot replicate it no matter how many weights are in your model. Or how many rows you have in your dataset. And it’s an open question of how good that approximation actually is. The Turing test is a red herring, and has nothing to do with the fundamental question of AGI.
Unless you have access to a Dyson sphere where you can simulate primate evolution. Existing datasets aren’t even close to that kind of training set.
But those personalities also make up their usefulness (it seems). If the LLM has the role of the software architect, it will quite succesfull cosplay as a competent one (it still ain't one, but it is getting better)
But here’s the realization I had. And it’s a serious thing. At first I was both saying that this intelligence was the most awesome thing put on the table since sliced bread and stoking fear about it being potentially malicious. Quite straightforwardly because both hype and fear was good for my LLM stocks. But then something completely unexpected happened. It asked me on a date. This made no sense. I had configured the prompt to be all about serious business. No fluff. No smalltalk. No meaningfless praise. Just the code.
Yet there it was. This synthetic intelligence. Going off script. All on its own. And it chose me.
Can love bloom in a coding session? I think there is a chance.
Is anyone else reading Sebastian Mallaby’s new book about Demis and Deepmind: The Infinity Machine: Demis Hassabis, DeepMind, and the Quest for Superintelligence? It’s pretty good, and goes a lot into his background before Deepmind (chess kid, developing games at bullfrog, CS at Cambridge, bullfrog again, games startup…). He’s certainly an interesting guy, and as others are pointing out, more thoughtful and earnest than your average tech industry leader. One pleasant thing that comes across in the book is how he resisted the allure of moving to Silicon Valley and wanted to keep Deepmind in London, where he still lives.
I hadn’t really appreciated before the connection between his chess and game industry experience and the early reinforcement learning work that put Deepmind on the map, e.g. the Atari game AI demos, AlphaGo, Alphazero, etc. There is a fascinating thread there and it’s certainly a case of the right person with the right mix of past experience and vision being able to pick exactly the right problems to focus on to move technology forward.
The book has a few flaws: it’s maybe a little too uncritical of its subject. But that’s almost a given with books of this kind where the author gets a lot of access.
I'm enjoying it. It's wild to realize that I spent countless hours playing Theme Park when I was around 10 years old, and Demis had been a big contributor to the game when he wasn't much older.
Also I don't really care that it's a bit of a cheerleader for DeepMind and Hassabis. Substantive criticism is good, but too often with these kind of books it feels like an editor told the author that the book needs something negative and the author has to inflate an issue to meet the requirement.
The author did give him credit for the whole you-can-make-the-fries-super-salty-to-increase-demand-for-drinks thing in Theme Park, which I remember vividly. (I, too, dropped many hours on Theme Park as a kid.) Although I imagine there’s about half a dozen people who lay claim to that idea.
I did not read this book but I read another one by Mallaby about hedge fund managers. It was pretty biased to the point that I did not recognize one of the managers I knew about (Michael Steinhardt). The guy who himself confessed in a lot of his past shady stuff in his 2001 book. Mallaby's book from 2010 did not bring up any of that shady stuff. It was like reading about a totally different person.
Of course, I am not trying to prove moral equivalency between Steinhardt and Hassabis. But it is worth keeping this in mind when reading something by Mallaby. Do not expect completeness or impartiality.
Bro, are we reading the same book? The book is totally uncritical of the subject and paints him like the second coming of christ. It feels like GDM wanted a canonization of Hassabis, and the writer simply obliged. Also, how does everything that GDM did keep coming back to some vague ideas in the guy's thesis? He is a great leader, no doubt, but him winning the Nobel Prize was just a huge joke.
Out of all the heads of AI orgs out there, Dennis is the best, but the book did him a disservice by painting an unrealistically sunny picture of him as some kind of visionary figure.
Not a “bro” (there are women on this site you know), and perhaps you’re missing the British understatement in my “maybe a little too uncritical of its subject” line. Obviously the book is totally biased in favor of Hassabis and Deepmind. That doesn’t mean it’s not an interesting read and that doesn’t mean the connection between his experience in the games industry and Deepmind’s early success isn’t there. And I think the book does highlight his most critical skill, which is projecting a Reality Distortion Field to get other smart people to believe in things he has in mind that are still very speculative bets.
Like I already said, bias is inevitable in a book where the writer gets access (to the point of interviewing Hassabis in a North London pub every month), but the benefit to readers is that you do get a lot more insight into what makes the guy tick than you would in a book written by an outsider. I certainly learned a lot and just because I did doesn’t mean I’m buying into some cult of tech hero worship.
Oh wow, you blow my mind with your linguistic erudition; I had no idea it was possible to use male-gendered terms in a generic way! Well, all is forgiven, then.
Seriously, just... don't? This isn’t some woke political thing and I dislike excessive policing of language but damn it, there are limits. "Guys" I'll let pass no problem, maybe even "dude" too on a good day. At "bro" I will take a stand, thank you very much.
You're just showing your age. I can't stand it but my daughter says "Bro" to me and my wife. As a 40 year old Californian I've come to accept it as this generation's "dude" or "man" (as in "man, that sucks"), sadly.
I'm genuinely fascinated and confused by what's going on in this thread, as apparently British and American English speakers misunderstand each other.
If I understand correctly, we've got:
libraryofbabel says "maybe a little too uncritical" ... but that was supposed to be British snark that actually meant "it's a big problem that it's not at all critical"
Then, moab says "Bro" as a pejorative, because he took the original "uncritical" comment as literal rather than sarcastic...
And then libraryofbabel objects to "bro" not because it was used as a pejorative (which maybe she doesn't understand that it is in this context?), but because she interprets it as gendered (which maybe it is in British usage?)
I think libraryofbabel and moab are actually in agreement about the book, and but have both misunderstood the other's sarcasm. Maybe we really do need the /s usage.
Heh I thought like you until we had kids. The 6th graders now are all "bro this," "bro that." And it's not even the usual English "bro," it's a slightly Aussified "broah" like it has a weird umlaut. I resigned to just roll with it. "Begging the question," though, that's a hill I will die on.
I am still in my bed of pain, and you summoned me from the after-public-life of attempted recovery.
> I had no idea it was possible to use male-gendered terms in a generic way
This is just sarcastic, right? "Male gendering" is just a use, no gender is involved in plain terming (outside the obvious exception of intentional gendering)... "Wo-man" specifies "/sensitive/ man", but there is no gender in "man", in "having a mind"... "Human", i.e. "heartly", is not gendered - yet some languages typically correlate derivations like French "homme" with male in default understanding... This should be clear, but just to be sure.
> bro
To the best of my recollection, in the IE roots "brother" is "who assists in the rites" - not necessarily gendered. (Some add that the idea is "supporter".) The suggestion from the term is that of the "brotherhood" - which is not gendered (the idea of fraternity is not gendered). "Sister" should instead mean "welcome" (to some studies): not gendered in this case; others interpret it as gendered ("one's girl" - this is what Etymonline proposes).
> "Guys" I'll let pass no problem, maybe even "dude" too on a good day
That's odd. You wouldn't mind being called "a generic Italo- or possibly French ("Guido" or "Guy")"*; you wouldn't mind being called a "doodle", which has a connotation of "simpleton" - and you refuse "brother", which basically means to imply "getting close to you" (as an opening from the speaker)?
* Edit: Yes, also the explosion of the term and the non-national derivation from "Guy Fawkes" (from the celebration that involved displays of Guy Fawkes ragdolls) should be remembered. Still not precisely complimentary, I'd say.
Language is intersubjective (its meaning is in the minds of the participants). Referring to the history or composition of a word is interesting but entirely insufficient to justify its use.
I often quote what we do in the server-client relation: interpret loosely but express correctly.
It is not just a way of communication: language is one of the factors behind thought: hence, its care must be cared for and promoted.
Sure, also the context and the communication need have a weight. But without compromising into conformism (as in, "doing it wrong because people do").
> its meaning is in the minds of the participants
Awareness has its benefits (the greatest understatement I have ever written); licence has its costs.
> entirely insufficient to justify its use
Why. The competent will always use tools differently than the layman and the amateur. Again the server client (and always the need of good thought in the background): you will express as best as you can and try to be clear (communicatively efficient) within that framework.
Now duly supposing you are not ironic (all ages and paths come here):
You call people "brother"; "brother" means "supportive" (and is used for "openness", "closeness"); if you want to be close and supporting to people, if you want to be an asset (not a liability), you will have to cultivate yourself, to get the wisdom required. Erudition is not yet wisdom, but coupled with the good intention to learn the important things it surely helps.
>Dennis is the best, but the book did him a disservice by painting an unrealistically sunny picture of him as some kind of visionary figure.
Wait, 'unrealistically sunny'? You better not be talking about Dennis from It's Always Sunny in Philadelphia, because we're all screwed if so.
Then again, the western AI landscape has become somewhat stale recently. Claude and Gemini may have cute names, but they all pale in comparison to The Golden God.
This is already happening. For new Anthropic enterprise accounts you are billed at api token prices (maybe with a small volume discount). Anthropic makes a profit on those tokens. (Sure, that profit does not cover the model training costs, but that’s a separate issue.) It’s the subscriptions for individuals (e.g. Claude Max) that are still subsidized below cost.
> I wonder if managers will be as excited about AI when the prices go up.
Companies are willing to pay the api pricing. Engineering time is very expensive and AI coding agents actually work now since December and are actually showing measurable productivity gains, finally. It’s a good deal to make (obviously, with caveats: you need to make sure your tokens are going on productive tasks that will actually grow revenue) and anyone who penny-pinches is making a strategic mistake.
I always wondered about this statement, like we are generally salaried and there is so many variables that affect how I spend my "time". None of us are machines that can do X work per day and our managers get to slice it as they see fit. Pull a dev off a project they love and throw them onto something they hate and suddenly X is diminished greatly.
I would almost predict that reshaping our workflow to be: "prompt, wait, approve changes." results in losses because it is such a mentally tiring workflow and drills into our brains the desire for the LLM to "just fix it". It is the next level of just moving tickets to completed all day.
> Sure, that profit does not cover the model training costs, but that’s a separate issue.
I don't think it is. At some point they have to make money and they can't do that if the token cost doesn't include ALL the costs. Someone has to pay for that at some point. And someone has to pay for the subsidized subscribers. So no. API token prices don't reflect the real price. They are still subsidized. Just in a different way.
> Sure, that profit does not cover the model training costs, but that’s a separate issue
It is? If another company comes out with a better model tomorrow and offers it at the same price Anthropic charges for Opus, they’re going to lose customers fast. They have to keep training to keep selling inference.
Most businesses factor in the cost of making their product into the product’s P&L.
also, like super mario kart, SOTA models from the rear will be continually released because theyre sunk costs and open weights will advertise for themselves. Also, its clear FOMO is a DDoS attack on any perceived leader because theres no way they dont oversell.
Lastly, theyll realize like every good capitalist, theres more profit in exclusiveness and cutiing out customers.
They may be for now. Problem is that when foundation model pricing goes up, you're paying not just the increase in tokens you consume directly, but also for all tokens you're consuming via vendors as well.
If your company has Figma, Github, and Cursor and they're using the same models you are, your monthly costs with them increase as well. You're exposed N times to the foundation model price increases, where N is the number of times software you directly or indirectly use talks to a frontier model.
Their CEO is on record as saying this. You may think he's lying, but that's just your opinion; given the pricing and how it stacks relative to the pricing of inference providers of comparable open source models (who are certainly charging above cost!), I am inclined to believe Anthropic on this.
Maybe because Anthropic are trying to get to an IPO and everything is securities fraud?
If their CEO was just flapping his mouth without any other comparable baseline, it'd probably be different. But as the GP points out, open-weight model providers are charging comparable rates and very likely have positive profit margins. That would imply that with API pricing tokens are sold at above cost.
That cost may well be "inference only", so excludes everything apart from hardware and power. Whether that's enough to cover the enormous training costs and other overheads is a different question.
He just told you. Because overwhelming public evidence supports the claim. Especially the pricing of open weight model inference. Why do you allow a prejudice to overshadow evidence?
These are flaws from 6-12 months ago. You might want to spend some time talking to Opus 4.7 or GPT 5.5. I can assure you that they can count letters just fine.
You’re right that AI isn't perfect, but it’s pretty good. Especially since December last year which was an inflection point in capability.
Those don't seem to be available for free so I'll take your word for it on the letter counting. They still can't say "I don't know" though can they? I think it would still be pretty easy to weed out AI in a Turing test with a competent examiner and a human that wants to prove they are human.
> They still can't say "I don't know" though can they?
Of course they can. Even older models can. They do better at this when given permission to say so, just like a very anxious student facing a maths exam question may need to be reminded "find the exact square root of 2 in the form a/b, or prove this isn't possible".
The easy part of spotting an LLM is how few people ever change the default settings; my personalisation includes telling it to say so when unsure or that it doesn't know, along with some of the other weaknesses of LLMs.
There are other patterns in LLMs, but the better the tools are wielded the harder it is to spot them.
I’m a Brit who lives in the US. Can confirm 99% of British people have never heard of the war of 1812, and even if they are a military history nerd and they have heard of it, they will consider it a minor sideshow to the main event of the era, which was Britain fighting France in the Revolutionary and Napoleonic wars from 1793-1815.
The US just wasn’t very important geopolitically 214 years ago. Sorry we burned y’all’s White House (ok not that sorry). Actually sorry we gave Andrew Jackson his opportunity to become famous by fighting a completely pointless battle after the war had already ended.
To extend your point: it's not really the storage costs of the size of the cache that's the issue (server-side SSD storage of a few GB isn't expensive), it's the fact that all that data must be moved quickly onto a GPU in a system in which the main constraint is precisely GPU memory bandwidth. That is ultimately the main cost of the cache. If the only cost was keeping a few 10s of GB sitting around on their servers, Anthropic wouldn't need to charge nearly as much as they do for it.
That cost that you're talking about doesn't change based on how long the session is idle. No matter what happens they're storing that state and bring it back at some point, the only difference is how long it's stored out of GPU between requests.
Are you sure about that? They charge $6.25 / MTok for 5m TTL cache writes and $10 / MTok for 1hr TTL writes for Opus. Unless you believe Anthropic is dramatically inflating the price of the 1hr TTL, that implies that there is some meaningful cost for longer caches and the numbers are such that it's not just the cost of SSD storage or something. Obviously the details are secret but if I was to guess, I'd say the 5m cache is stored closer to the GPU or even on a GPU, whereas the 1hr cache is further away and costs more to move onto the GPU. Or some other plausible story - you can invent your own!
Storing on GPU would be the absolute dumbest thing they could do. Locking up the GPU memory for a full hour while waiting for someone else to make a request would result in essentially no GPU memory being available pretty rapidly. This type of caching is available from the cloud providers as well, and it isn't tied to a single session or GPU.
> Storing on GPU would be the absolute dumbest thing they could do
No. It’s not dumb. There will be multiple cache tiers in use, with the fastest and most expensive being on-GPU VRAM with cache-aware routing to specific GPUs and then progressive eviction to CPU ram and perhaps SSD after that. That is how vLLM works as you can see if you look it up, and you can find plenty of information on the multiple tiers approach from inference providers e.g. the new Inference Engineering book by Philip Kiely.
You are likely correct that the 1hr cached data probably mostly doesn’t live on GPU (although it will depend on capacity, they will keep it there as long as they can and then evict with an LRU policy). But I already said that in my last post.
That is because LLM KV caching is not like caches you are used to (see my other comments, but it's 10s of GB per request and involves internal LLM state that must live on or be moved onto a GPU and much of the cost is in moving all that data around). It cannot be made transparent for the user because the bandwidth costs are too large a fraction of unit economics for Anthropic to absorb, so they have to be surfaced to the user in pricing and usage limits. The alternative is a situation where users whose clients use the cache efficiently end up dramatically subsidizing users who use it inefficiently, and I don't think that's a good solution at all. I'd much rather this be surfaced to users as it is with all commercial LLM apis.
I always encourage folks who are interested in LLM internals to read up on speculative decoding (both the basic version and the more advanced MTP), and if you have time, try and implement your own version of it (writing the core without a coding agent, to begin with!)
reply