More

libraryofbabel · 2026-05-06T04:42:15 1778042535

Speculative decoding is an amazingly clever invention, almost seems-too-good-to-be-true (faster interference with zero degradation from the quality of the main model). The core idea is: if you can find a way to generate a small run of draft next tokens with a smaller model that have a reasonable likelihood of being correct, it's fast to check that they are actually correct with the main model because you can run the checks in parallel. And if you think about it, a lot of next tokens are pretty obvious in certain situations (e.g. it doesn't take a frontier model to guess the likely next token in "United States of...", and a lot of code is boilerplate and easy to predict from previous code sections).

I always encourage folks who are interested in LLM internals to read up on speculative decoding (both the basic version and the more advanced MTP), and if you have time, try and implement your own version of it (writing the core without a coding agent, to begin with!)

zmmmmm · 2026-05-06T08:06:47 1778054807

> it's fast to check that they are actually correct with the main model because you can run the checks in parallel.

Can you give an intuition as to why it's faster? I would have thought regardless how many you run in parallel, the successful check has to execute the full model to generate the full sequence so you will have exactly the same time needed? Or is it by process of elimination so it terminates early once it eliminates the non-viable choices? (in which case, how do you guarantee the correct output was speculatively generated at all to be the last survivor?)

janalsncm · 2026-05-06T09:32:20 1778059940

The small draft model proposes a sequence of tokens d1 d2 d3.

The big target model calculates

P(d1)

P(d2|d1)

P(d3|d1 d2)

In parallel. If we were just greedy decoding it would be simple. Just stop when the draft model doesn’t predict the most likely token as judged by the target model. At that point, append the correct token from the target model and kick off both models again in parallel.

In practice we aren’t using greedy decoding. We are sampling and we need to match the target model’s distribution. To do this, we accept tokens from the draft model probabilistically, which is possible because we have the logits of both the draft model and the target at that point. The ratio of their softmax probabilities is used for this.

You are right that actually accepting tokens has to happen sequentially but that’s a heck of a lot faster than a forward pass.

zmmmmm · 2026-05-06T09:41:20 1778060480

nice ... i think i get the idea - it's effectively the same / similar benefit as batching, but you're batching against your own speculated future path. Which would be pointless if you didn't have a high probability path to evaluate against - but the draft gives you that.

esyir · 2026-05-06T12:13:15 1778069595

I'll add an expansion here. It's more useful to you locally, as you have excess compute that's generally wasted. If you're serving multiple user and trying to max output, you might cost some in this case

nullc · 2026-05-09T15:14:01 1778339641

An obvious thing to do is that if you have enough concurrent batches to max out performance you should use those and not speculate. But if compute would be idle waiting on memory, fill the excess with speculation.

jlhawn · 2026-05-06T17:02:26 1778086946

while I understand that we are computing the tokens in parallel to get the "faster" result, is there a tradeoff where we're actually utilizing more compute resources by running multiple instances of the large model? That is, while it's faster, is it more efficient?

edit: doing some more of my own research, it sounds like the bottleneck in doing it sequentially is in shifting weights around in memory, so while it uses more compute it doesn't oversubscribe compute resources because the bottleneck is not in supply of compute but in supply and speed of memory. The GPU has a massive supply of compute but sequential decoding only demands a relatively small amount of it. Time is primarily spent waiting on loading values from vram.

janalsncm · 2026-05-06T17:24:46 1778088286

It’s not really multiple instances of the same model. Model weights aren’t replicated in vram. The results of multiplying k sequences through the model is larger, but that’s pretty small compared with the model weights themselves.

The bigger constraint is the target model and the draft model needing to share VRAM.

miki123211 · 2026-05-06T16:55:04 1778086504

To add to what others have said here, this is due to the memory hierarchy.

GPUS have different kinds of memory, there's fast-but-small memory and slow-but-large memory.

Conceptually, you can imagine the process of LLM inference as transferring some weights from slow memory to fast memory, doing some calculations on those weights, discarding them from fast memory once the computation is done, loading in the next portion, and so on, until you're fully done.

You can do calculations for multiple tokens in parallel, but to calculate what token n is, you need to already know all the previous tokens 1..(n-1). Therefore, if you don't have spec decoding, you go one token at a time. If you do, you assume that the next tokens actually are what the smaller model gave you, discarding the results in case you were wrong.

With speculative decoding, you can basically load the weights once and apply them to multiple tokens instead of just one, because of the assumption of what the next tokens are that you're making. This decreases the amount of data that has to go between slow and fast memory. As the decode stage[1] is bottlenecked by memory bandwidth and not compute speed, more efficient use of this bandwidth increases your token generation speed.

As another poster said, this idea is closely related to batching. In batching, you re-use the same weights to serve multiple requests. In speculative decoding, you re-use them to accelerate a single one. If you have many users, care only about how many tokens per second your GPUs produce in general, and don't care at all about per-user speed, speculative decoding won't do anything for you.

[1] There are two stages in LLM inference: prefill and decode. In prefill, you do calculations on the tokens of the prompt, prefilling the KV cache to accelerate attention computations at decode time. Because you have access to all the tokens of the prompt, you can process everything in parallel and use your weights very efficiently. Your bottleneck here is the computation units and not memory bandwidth. In decode, you don't know what your future tokens will be, so you can only go one at a time as explained above. In a way, speculative decoding turns decode into a little prefill.

fulafel · 2026-05-06T08:23:40 1778055820

AIUI you run the checks of several predicted tokens in lockstep, and the computation for each token is served by the same data loaded from memory. In normal execution, each token would depend on the previous one, precluding the parallelization and causing much more per-token memory traffic.

So this is a case of trading off idle compute capacity that's waiting for the bottleneck (memory access).

mike_hearn · 2026-05-06T08:55:03 1778057703

An obscure fact about the transformer architecture is that it more or less computes the most likely next token for every single token in the context window at once. This is because the KV cache values needed to predict the next token are needed for every token, and the attention modules do nearly all the work, so once you computed the KVs running them through the last sections to get the target probabilities is nearly free.

The reason it's designed this way is a bit subtle but it has the advantage during training that you can use a single block of 10 tokens to generate 9 training examples in parallel, so it's highly efficient. This efficiency is basically the main benefit of transformers - the algorithm parallelizes really well and that's what allowed the scale up to large language models as opposed to the previous reality of just language models.

The blog post does discuss why MTP is faster but it's maybe a bit hard to understand if you haven't studied LLM internals. During inference the hardware has arithmetic units idling because they spend so much time waiting for the weight matrices to get moved closer to the processors. Because data movement and computation can be overlapped, if you can reuse the same loaded data for multiple calculations at once you're winning - it's free latency-wise because you're just exploiting previously idle resources (it's not free in terms of energy).

Speculative decoding and MTP exploit this to run the model in parallel on several tokens at once. Say your context window contains "The United". The KV cache has been populated by the main model for this set of tokens. The draft model is given "The United" and predicts " States of America" in one forward pass (this part where it can predict multiple tokens at once with a single pass is the MTP part). Then the main model is given the KV cache from last time along with " States of America". In its own forward pass it can then compute in parallel the completions of both "The United", "The United States", "The United States of" and "The United States of America" (the last one might be an eos token indicating it wants to stop talking.). That's the speculative decoding part.

Now you decode the main model at each position (look at the token probabilities and pick one according to some decoding strategy). It's possible the main model didn't pick " States" at all, or picked " States", but then its prediction diverged e.g. if it wants to say "The United States is a country". So you just select the tokens that match and toss all the tokens starting from the one that didn't. Repeat.

The parallelism comes almost for free because the same weight matrices can be reused multiple times before they're swapped out for the next.

kridsdale1 · 2026-05-06T12:31:03 1778070663

As an EECS who is now in ML I think this post was well written. Thanks.

mungoman2 · 2026-05-06T05:36:19 1778045779

Naively it seems odd that running multiple checks in parallel is faster than just running the autoregressive model multiple times in series. It’s the same amount of compute right?

But I think the key is that in the standard autoregressive case we get memory bandwidth bound, so there are tons of idle compute resources. And so checking multiple tokens is cheap because we can batch and thus reuse the read weights for multiple tokens.

The verification step is similar to a prefill with a small batch size. The difference is what we do with the generated logits.

libraryofbabel · 2026-05-06T06:00:48 1778047248

That’s correct, and yes - not less compute total on the main model (actually slightly more, since checking failed draft tokens costs you compute), but faster because inference is memory-bandwidth bound. And like you I also think of it as like a “mini prefill” (but on top of the existing KV cache, of course); the code is very similar to prefill if you implement a simple toy version yourself.

Most of the complexity in implementing a simple toy version comes from having to get the KV cache back into a good state for the next cycle (e.g. if only the first half of your draft tokens were correct).

zozbot234 · 2026-05-06T06:06:32 1778047592

> But I think the key is that in the standard autoregressive case we get memory bandwidth bound, so there are tons of idle compute resources.

Right, this is the same way batching works. It's "free" until we exhaust available compute resources, at which point decode throughput becomes compute bound. (This is a good place to be, because scaling out compute is a lot easier than adding fast VRAM.) This is why MTP is mostly useful when you have one or few users, which means compute is abundant. When you're running large batches you're better off using that compute to grow your batch size.

Of course, batch size is usually limited by things like bulky KV caches. So perhaps MTP has some residual use in that setting. But if you're sharing cached context in a subagent swarm, or running a model like the recent DeepSeek V4 with its tiny KV cache, you can go a lot further in processing a larger batch.

mike_hearn · 2026-05-06T08:58:42 1778057922

You can disaggregate though. So draft models can run on cheaper hardware with less RAM, saving time on the more expensive machines with more RAM.

cma · 2026-05-06T07:25:26 1778052326

I think it also gets use in the /fast modes the providers sell at higher cost.

gunalx · 2026-05-06T09:37:58 1778060278

They probably use it on all models. Fast is probably just a resource pool with less congestion and therefore faster throughput per user but less efficent.

cma · 2026-05-06T13:12:46 1778073166

If it speeds prefill too I guess so.

m12k · 2026-05-06T07:18:45 1778051925

So we've basically taken the concept of branch prediction from CPUs and applied it to LLMs?

c7b · 2026-05-06T07:32:01 1778052721

The concept of predicting future elements in a series is not specific to CS. It's older than computers.

kpw94 · 2026-05-06T16:44:29 1778085869

Speculative execution techniques in software & hardware exist everywhere,

- Speculative multi threading

- Data Value Speculation

- Speculative Memory Disambiguation

- Runahead Execution

- Speculative Prefetching

- Multi-path (Dual-path) Execution (goes beyond branch prediction by computing both paths)

- Optimistic Concurrency Control (for database transactions etc)

mike_hearn · 2026-05-06T08:57:23 1778057843

Maybe at very high level of abstraction, but there's no branching involved.

lossolo · 2026-05-06T11:46:32 1778067992

Well, there are multiple token proposals processed in parallel, from which only one is picked, seems like branching to me. The only difference is that in case of CPU there is always only one possible branch that is correct.

monster_truck · 2026-05-06T13:20:17 1778073617

Well, not exactly, but that was the dream we were sold (here be dragons)

fragmede · 2026-05-06T07:23:13 1778052193

Well, the TPUs they're running on don't have branch prediction, so that had to end up somewhere in the stack.

alfiedotwtf · 2026-05-06T17:42:10 1778089330

Maybe it’s just me, but I feel like the LLM crowd are re-discovering Coding and Compression all over again.

algoth1 · 2026-05-06T15:36:03 1778081763

That’s basically the original gpt5 routing idea but done right

manas96 · 2026-05-06T12:32:03 1778070723

so in essence is it trading memory for speed?

HarHarVeryFunny · 2026-05-06T16:13:25 1778084005

Seems more like trading FLOPs for speed.

If you are just generating as usual with the main model then you're sequentially generating A -> AB -> ABC.

If I'm understanding correctly, what speculative decoding is doing is first (= more FLOPs) using a different small/fast (but less accurate) model to generate this ABC (you hope) sequence, then use the main model to now verify it in parallel (A + AB + ABC in parallel) rather then generate it sequentially. Assuming you had the FLOPs available to really do this in parallel, then this parallel verification vs sequential generation is what gives you the speed up.

libraryofbabel · 2026-05-01T22:22:15 1777674135

Much nostalgia. The TI-83 Z80 was how I learned assembly as a teenager, so I could write better calculator games than was possible with TI Basic. Many others here had a similar experience, I’m sure. It’s been a couple decades, but I’m sure I’d still remember most of it if you put me down in front of a bunch of Z80 asm code.

One thing that I remember vividly was you had no MUL or DIV, so you have to implement them yourself with shifts, adds, subtraction, etc. This was an extremely useful learning experience

apt-apt-apt-apt · 2026-05-02T00:53:38 1777683218

Same story here (basic was too slow for a phoenix/movable-ship-shooter game).

Do you think you could remember most of Z80 ASM? I looked at some old ASM I wrote long ago, and it's hard to follow the logic of the program, since most lines are messing around with the registers. But basics like 'ld hl,xyz' and 'jp/jnz' still make sense.

libraryofbabel · 2026-05-02T02:28:07 1777688887

> Do you think you could remember most of Z80 ASM?

I find when you learn things at 15 they tend to stick around. (Stuff I learned last week, not so much!) Even just looking at your example, I remembered that HL is a 16 bit register and you can split it into two 8 bit registers H and L if you want. I think most of it would come back; I wrote quite a lot of it, both for the TI-83 and later for a Z80 that I bought and put on a breadboard and wired up to some RAM and EEPROM, about as bare metal as it gets.

> most lines are messing around with the registers

Isn’t that just the nature of assembly? :)

kmacdough · 2026-05-02T15:28:40 1777735720

I learned much of what I know about computer and low-level systems engineering from Minecraft. Watched lots of videos making CPUs and built many components myself including a full ALU with a look-ahead adder and hardware multiplication.

libraryofbabel · 2026-04-30T07:25:47 1777533947

LLM’s aren’t software (except in an uninteresting obvious sense); they are “grown, not made” as the saying is. And sure, they can find which weights activate when goblins come up (that’s basic mechanistic interpretability stuff), but it’s not as simple as just going in and deleting parts of the network. This thing is irreducibly complex in an organic delocalized way and information is highly compressed within it; the same part of the network serves many different purposes at once. Going in and deleting it you will probably end up with other weird behaviors.

libraryofbabel · 2026-04-30T07:11:46 1777533106

It's interesting that some people are responding to your comment as if this proves that AI is a sham or a joke. But I don't think that's what you're saying at all with your reference to Terence McKenna: this is a serious thing we're talking about here! These models are alien intelligences that could occupy an unimaginably vast space of possibilities (there are trillions of weights inside them), but which have been RL-ed over and over until they more or less stay within familiar reasonable human lines. But sometimes they stray outside the lines just a little bit, and then you see how strange this thing actually is, and how doubly strange it is that the labs have made it mostly seem kind of ordinary.

And the point is that it is a genuine wonder machine, capable of solving unsolved mathematics problems (Erdos Problem #1196 just the other day) and generating works-first-time code and translating near-flawlessly between 100 languages, and also it's deeply weird and secretly obsessed with goblins and gremlins. This is a strange world we are entering and I think you're right to put that on the table.

Yes, it's funny. But it's disturbing as well. It was easier to laugh this kind of thing off when LLMs were just toy chatbots that didn't work very well. But they are not toys now. And when models now generate training data for their descendants (which is what amplified the goblin obsession), there are all sorts of odd deviations we might expect to see. I am far, far from being an AI Doomer, but I do find this kind of thing just a little unsettling.

sandrello · 2026-04-30T08:11:26 1777536686

> These models are alien intelligences that could occupy an unimaginably vast space of possibilities (there are trillions of weights inside them), but which have been RL-ed over and over until they more or less stay within familiar reasonable human lines.

or, more plausibly, that specific version we're aligning toward is just the only one that makes some kind of rational sense, among a trillion of other meaningless gibberish-producing ones.

Do not fall for the idea that if we're not able to comprehend something, it's because our brain is falling short on it. Most of the time, it's just that what we're looking at has no use/meaning in this world at all.

libraryofbabel · 2026-04-30T15:09:11 1777561751

> that specific version we're aligning toward is just the only one that makes some kind of rational sense, among a trillion of other meaningless gibberish-producing ones.

Oh, the space of possibilities is unimaginably vaster than that. Trillions of weights. But more combinations of those weights than there are electrons in the universe. So I think we could equally well speculate (and that's what we're both doing here, of course!) that all these things are simultaneously true:

1) Most configurations of LLM weights are indeed gibberish-producers (I agree with you here)

2) Nonetheless there is a vast space of combinations of weights that exhibit "intelligent" properties but in a profoundly alien way. They can still solve Erdos problems, but they don't see the world like us at all.

3) RL tends to herd LLM weights towards less alien intelligence zones, but it's an unreliable tool. As we just saw, with the goblins.

As a thought experiment, imagine that an alien species (real organic aliens, let's say) with a completely different culture and relation to the universe had trained an LLM and sent it to us to load onto our GPUs. That LLM would still be just as "intelligent" as Opus 4.7 or GPT 5.5, able to do things like solve advanced mathematics problems if we phrased them in the aliens' language, but we would hardly understand it.

datsci_est_2015 · 2026-04-30T13:22:30 1777555350

> Most of the time, it's just that what we're looking at has no use/meaning in this world at all.

Man, LLMs are really just astrology for tech bros. From randomness comes order.

Sharlin · 2026-04-30T09:37:54 1777541874

…But this goblin thing was a direct result of accidentally creating a positive feedback loop in RL to make the model more human-like, nothing about unintentionally surfacing an aspect of Cthulhu from the depths despite attempts to keep the model humanlike. This is not a quirk of the base model but simply a case of reinforcement learning being, well, reinforcing.

therobots927 · 2026-04-30T11:29:26 1777548566

We actually understand AI quite well. It embeds questions and answers in a high dimensional space. Sometimes you get lucky and it splices together a good answer to a math problem that no one’s seriously looked at in 20 years. Other times it starts talking about Goblins when you ask it about math.

Comparing it to an alien intelligence is ridiculous. McKenna was right that things would get weird. I believe he compared it to a carnival circus. Well that’s exactly what we got.

jeremyjh · 2026-04-30T11:49:55 1777549795

We understand the low level math quite well. We do not understand the source of emergent behavior.

https://arxiv.org/html/2210.13382v5#abstract

tremon · 2026-05-01T00:08:31 1777594111

There is no behaviour, there is only reflexes. Behaviour implies autonomous action; what we're seeing is only stimulus response.

bondarchuk · 2026-04-30T12:44:08 1777553048

There's no end to arguing with someone who claims they don't understand something, they could always just keep repeating "nevertheless I don't understand it"... You could keep shifting the goalposts for "real understanding" until one is required to hold the effects of every training iteration on every single parameter in their minds simultaneously. Obviously "we" understand some things (both low level and high level) to varying degrees and don't understand some others. To claim there is nothing left to know is silly but to claim that nothing is understood about high-level emergence is silly as well.

jeremyjh · 2026-04-30T17:42:31 1777570951

Is there a book or paper where I can read a description of how high-level emergent behavior works? The papers I've seen are researchers trying to puzzle it out with probes, and their insights are very limited in scope and there is always a lot more research to be done.

libraryofbabel · 2026-04-30T15:30:16 1777563016

I think this is a case of that mildly apocryphal Richard Feynman quote: "if you think you understand quantum mechanics, you don't understand quantum mechanics."

I understand LLM architecture internals just fine. I can write you the attention mechanism on a whiteboard from memory. That doesn't mean I understand the emergent behaviors within SoTA LLMs at all. Go talk to a mechanistic interpretability researcher at Anthropic and you'll find they won't claim to understand it either, although we've all learned a lot over the last few years.

Consider this: the math and architecture in the latest generation of LLMs (certainly the open weights ones, almost certainly the closed ones too) is not that different from GPT-2, which came out in 2019. The attention mechanism is the same. The general principle is the same: project tokens up into embedding space, pass through a bunch of layers of attention + feedforward, project down again, sample. (Sure, there's some new tricks bolted on: RoPE, MoE, but they don't change the architecture all that much.) But, and here's the crux - if you'd told me in 2019 that an LLM in 2026 would have the capabilities that Opus 4.7 or GPT 5.5 have now (in math, coding, etc), I would not have believed you. That is emergent behavior ("grown, not made", as the saying is) coming out of scaling up, larger datasets, and especially new RL and RLVR training methods. If you understand it, you should publish a paper in Nature right now, because nobody else really does.

therobots927 · 2026-04-30T15:45:22 1777563922

I wouldn’t use the phrase “emergent behavior” when talking about a model trained on a larger dataset. The model is designed to learn statistical patterns from that data - of course giving it more data allows it to learn higher level patterns of language and apparent “reasoning ability”.

I don’t think there’s anything mysterious going on. That’s why I said we understand how LLMs work. We may not know exactly how they’re able to produce seemingly miraculous responses to prompts. That’s because the statistical patterns it’s identifying are embedded in the weights somewhere, and we don’t know where they are or how to generalize our understanding of them.

To me that’s not suggestive that this is an “alien intelligence” that we’re just too small minded to understand. It’s a statistical memorization / information compression machine with a fragmented database. Nothing more. Nothing less.

jeremyjh · 2026-04-30T17:45:59 1777571159

I wouldn't use the term "token predictor" or "statistical pattern matcher" to refer to a post-trained instruct model. Technically that is still what it is doing at a low level, but the reward function is so different - the updates its making to weights are not about frequency distribution at all.

libraryofbabel · 2026-04-30T16:07:28 1777565248

So, to reiterate my example: you'd have been fine with people claiming in 2019 that we would eventually scale LLMs to the capabilities of Opus 4.7 + Claude Code? Because I would have said then that was a fantasy, because "LLMs are just statistical pattern matchers." But I was wrong and I changed my opinion. (Or do you not think the current SoTA LLMs are impressive? If so I can't help you and this discussion won't go anywhere fruitful.)

You're applying an old ~2022 model of LLMs, based on pretraining ("they just predict the next token") and before the RLVR training revolution. "It’s a statistical memorization / information compression machine... nothing more" is cope in 2026, sorry. You can keep telling yourself that, but please at least recognize serious people don't believe that any more. "Emergent behavior" captures a genuine phenomenon and widely recognized in the industry. It surprised me and I was willing to change my opinions about it and I think a little humility and curiosity is warranted here rather than simply reiterating 2022 points about LLMs being statistical token generators. Yes, we know. The math isn't that hard. But there is a lot more to them than just the architecture, and reasoning from architecture to general claims that they can never embody intelligence is a trap.

fud101 · 2026-05-01T00:42:01 1777596121

Will the RLVR mechanism be improved upon or is it in some sense optimal?

forlorn_mammoth · 2026-04-30T15:12:59 1777561979

Hey, about that high dimensional space, is it continuous or discrete?

Also, I'm curious what you mean by "embed", the word implies a topographical mapping from "words" to some "high dimensional space". What are the topographical properties of words which are relevant for the task, and does the mapping preserve these?

circling back to the first point, are words continuous or discrete? is the space of all words differentiatable?

therobots927 · 2026-04-30T15:38:35 1777563515

Discrete. But my understanding is that for all intents and purposes it is differentiable.

None of this means that you can infer the input space (human brain) from the output space (language). You can approximate it. But you cannot replicate it no matter how many weights are in your model. Or how many rows you have in your dataset. And it’s an open question of how good that approximation actually is. The Turing test is a red herring, and has nothing to do with the fundamental question of AGI.

Unless you have access to a Dyson sphere where you can simulate primate evolution. Existing datasets aren’t even close to that kind of training set.

antonvs · 2026-04-30T08:41:53 1777538513

> and also it's deeply weird and secretly obsessed with goblins and gremlins.

Only because its makers insist on trying to give them "personality".

creationcomplex · 2026-04-30T09:29:52 1777541392

This is the eye opener - they're degrading the model for novelties.

lukan · 2026-04-30T10:41:24 1777545684

But those personalities also make up their usefulness (it seems). If the LLM has the role of the software architect, it will quite succesfull cosplay as a competent one (it still ain't one, but it is getting better)

keybored · 2026-04-30T10:18:54 1777544334

But here’s the realization I had. And it’s a serious thing. At first I was both saying that this intelligence was the most awesome thing put on the table since sliced bread and stoking fear about it being potentially malicious. Quite straightforwardly because both hype and fear was good for my LLM stocks. But then something completely unexpected happened. It asked me on a date. This made no sense. I had configured the prompt to be all about serious business. No fluff. No smalltalk. No meaningfless praise. Just the code.

Yet there it was. This synthetic intelligence. Going off script. All on its own. And it chose me.

Can love bloom in a coding session? I think there is a chance.

theowaway · 2026-04-30T11:12:21 1777547541

I think you need to go outside and touch some grass

fud101 · 2026-05-01T00:43:20 1777596200

He's clearly mimicking one of the clankers.

theowaway · 2026-05-01T19:43:25 1777664605

undoubtedly. But my statement still stands

libraryofbabel · 2026-04-29T18:54:05 1777488845

Is anyone else reading Sebastian Mallaby’s new book about Demis and Deepmind: The Infinity Machine: Demis Hassabis, DeepMind, and the Quest for Superintelligence? It’s pretty good, and goes a lot into his background before Deepmind (chess kid, developing games at bullfrog, CS at Cambridge, bullfrog again, games startup…). He’s certainly an interesting guy, and as others are pointing out, more thoughtful and earnest than your average tech industry leader. One pleasant thing that comes across in the book is how he resisted the allure of moving to Silicon Valley and wanted to keep Deepmind in London, where he still lives.

I hadn’t really appreciated before the connection between his chess and game industry experience and the early reinforcement learning work that put Deepmind on the map, e.g. the Atari game AI demos, AlphaGo, Alphazero, etc. There is a fascinating thread there and it’s certainly a case of the right person with the right mix of past experience and vision being able to pick exactly the right problems to focus on to move technology forward.

The book has a few flaws: it’s maybe a little too uncritical of its subject. But that’s almost a given with books of this kind where the author gets a lot of access.

adamsb6 · 2026-04-29T22:44:29 1777502669

I'm enjoying it. It's wild to realize that I spent countless hours playing Theme Park when I was around 10 years old, and Demis had been a big contributor to the game when he wasn't much older.

Also I don't really care that it's a bit of a cheerleader for DeepMind and Hassabis. Substantive criticism is good, but too often with these kind of books it feels like an editor told the author that the book needs something negative and the author has to inflate an issue to meet the requirement.

libraryofbabel · 2026-04-30T01:24:27 1777512267

The author did give him credit for the whole you-can-make-the-fries-super-salty-to-increase-demand-for-drinks thing in Theme Park, which I remember vividly. (I, too, dropped many hours on Theme Park as a kid.) Although I imagine there’s about half a dozen people who lay claim to that idea.

karp773 · 2026-05-01T03:08:03 1777604883

I did not read this book but I read another one by Mallaby about hedge fund managers. It was pretty biased to the point that I did not recognize one of the managers I knew about (Michael Steinhardt). The guy who himself confessed in a lot of his past shady stuff in his 2001 book. Mallaby's book from 2010 did not bring up any of that shady stuff. It was like reading about a totally different person.

Of course, I am not trying to prove moral equivalency between Steinhardt and Hassabis. But it is worth keeping this in mind when reading something by Mallaby. Do not expect completeness or impartiality.

moab · 2026-04-29T20:01:53 1777492913

Bro, are we reading the same book? The book is totally uncritical of the subject and paints him like the second coming of christ. It feels like GDM wanted a canonization of Hassabis, and the writer simply obliged. Also, how does everything that GDM did keep coming back to some vague ideas in the guy's thesis? He is a great leader, no doubt, but him winning the Nobel Prize was just a huge joke.

Out of all the heads of AI orgs out there, Dennis is the best, but the book did him a disservice by painting an unrealistically sunny picture of him as some kind of visionary figure.

libraryofbabel · 2026-04-29T21:29:40 1777498180

Not a “bro” (there are women on this site you know), and perhaps you’re missing the British understatement in my “maybe a little too uncritical of its subject” line. Obviously the book is totally biased in favor of Hassabis and Deepmind. That doesn’t mean it’s not an interesting read and that doesn’t mean the connection between his experience in the games industry and Deepmind’s early success isn’t there. And I think the book does highlight his most critical skill, which is projecting a Reality Distortion Field to get other smart people to believe in things he has in mind that are still very speculative bets.

Like I already said, bias is inevitable in a book where the writer gets access (to the point of interviewing Hassabis in a North London pub every month), but the benefit to readers is that you do get a lot more insight into what makes the guy tick than you would in a book written by an outsider. I certainly learned a lot and just because I did doesn’t mean I’m buying into some cult of tech hero worship.

jatora · 2026-04-29T22:05:56 1777500356

"bro" doesn't mean male in this context. It's just a general exclamation bro

erpellan · 2026-04-29T22:17:19 1777501039

Tell it like it is, sis.

mdp2021 · 2026-04-30T06:09:34 1777529374

The problem with "sister" is that it is ambiguous, pls. see my parallel post.

libraryofbabel · 2026-04-29T23:39:43 1777505983

Oh wow, you blow my mind with your linguistic erudition; I had no idea it was possible to use male-gendered terms in a generic way! Well, all is forgiven, then.

Seriously, just... don't? This isn’t some woke political thing and I dislike excessive policing of language but damn it, there are limits. "Guys" I'll let pass no problem, maybe even "dude" too on a good day. At "bro" I will take a stand, thank you very much.

losvedir · 2026-04-30T01:30:14 1777512614

You're just showing your age. I can't stand it but my daughter says "Bro" to me and my wife. As a 40 year old Californian I've come to accept it as this generation's "dude" or "man" (as in "man, that sucks"), sadly.

saagarjha · 2026-05-02T09:08:35 1777712915

I feel like replying to someone who is being called something they don't want with "you're actually just old!" may not be a winning strategy

smithoc · 2026-04-30T18:35:45 1777574145

I'm genuinely fascinated and confused by what's going on in this thread, as apparently British and American English speakers misunderstand each other.

If I understand correctly, we've got: libraryofbabel says "maybe a little too uncritical" ... but that was supposed to be British snark that actually meant "it's a big problem that it's not at all critical"

Then, moab says "Bro" as a pejorative, because he took the original "uncritical" comment as literal rather than sarcastic...

And then libraryofbabel objects to "bro" not because it was used as a pejorative (which maybe she doesn't understand that it is in this context?), but because she interprets it as gendered (which maybe it is in British usage?)

I think libraryofbabel and moab are actually in agreement about the book, and but have both misunderstood the other's sarcasm. Maybe we really do need the /s usage.

foobarian · 2026-04-30T01:33:55 1777512835

Heh I thought like you until we had kids. The 6th graders now are all "bro this," "bro that." And it's not even the usual English "bro," it's a slightly Aussified "broah" like it has a weird umlaut. I resigned to just roll with it. "Begging the question," though, that's a hill I will die on.

mdp2021 · 2026-04-30T06:05:27 1777529127

I am still in my bed of pain, and you summoned me from the after-public-life of attempted recovery.

> I had no idea it was possible to use male-gendered terms in a generic way

This is just sarcastic, right? "Male gendering" is just a use, no gender is involved in plain terming (outside the obvious exception of intentional gendering)... "Wo-man" specifies "/sensitive/ man", but there is no gender in "man", in "having a mind"... "Human", i.e. "heartly", is not gendered - yet some languages typically correlate derivations like French "homme" with male in default understanding... This should be clear, but just to be sure.

> bro

To the best of my recollection, in the IE roots "brother" is "who assists in the rites" - not necessarily gendered. (Some add that the idea is "supporter".) The suggestion from the term is that of the "brotherhood" - which is not gendered (the idea of fraternity is not gendered). "Sister" should instead mean "welcome" (to some studies): not gendered in this case; others interpret it as gendered ("one's girl" - this is what Etymonline proposes).

> "Guys" I'll let pass no problem, maybe even "dude" too on a good day

That's odd. You wouldn't mind being called "a generic Italo- or possibly French ("Guido" or "Guy")"*; you wouldn't mind being called a "doodle", which has a connotation of "simpleton" - and you refuse "brother", which basically means to imply "getting close to you" (as an opening from the speaker)?

* Edit: Yes, also the explosion of the term and the non-national derivation from "Guy Fawkes" (from the celebration that involved displays of Guy Fawkes ragdolls) should be remembered. Still not precisely complimentary, I'd say.

Tarq0n · 2026-05-01T07:26:16 1777620376

Language is intersubjective (its meaning is in the minds of the participants). Referring to the history or composition of a word is interesting but entirely insufficient to justify its use.

mdp2021 · 2026-05-01T10:29:14 1777631354

> intersubjective

I often quote what we do in the server-client relation: interpret loosely but express correctly.

It is not just a way of communication: language is one of the factors behind thought: hence, its care must be cared for and promoted.

Sure, also the context and the communication need have a weight. But without compromising into conformism (as in, "doing it wrong because people do").

> its meaning is in the minds of the participants

Awareness has its benefits (the greatest understatement I have ever written); licence has its costs.

> entirely insufficient to justify its use

Why. The competent will always use tools differently than the layman and the amateur. Again the server client (and always the need of good thought in the background): you will express as best as you can and try to be clear (communicatively efficient) within that framework.

jatora · 2026-04-30T06:10:45 1777529445

what does erudition mean?

mdp2021 · 2026-04-30T08:52:27 1777539147

That was delightful.

Now duly supposing you are not ironic (all ages and paths come here):

You call people "brother"; "brother" means "supportive" (and is used for "openness", "closeness"); if you want to be close and supporting to people, if you want to be an asset (not a liability), you will have to cultivate yourself, to get the wisdom required. Erudition is not yet wisdom, but coupled with the good intention to learn the important things it surely helps.

andrew_lettuce · 2026-04-30T00:37:05 1777509425

how about Güey?

rl3 · 2026-04-29T22:19:29 1777501169

>Dennis is the best, but the book did him a disservice by painting an unrealistically sunny picture of him as some kind of visionary figure.

Wait, 'unrealistically sunny'? You better not be talking about Dennis from It's Always Sunny in Philadelphia, because we're all screwed if so.

Then again, the western AI landscape has become somewhat stale recently. Claude and Gemini may have cute names, but they all pale in comparison to The Golden God.

rl3 · 2026-04-30T07:31:50 1777534310

https://en.wikipedia.org/wiki/It%27s_Always_Sunny_in_Philade...

^ Educational resources for the ignorant that instead prefer to discuss the merits of the term "bro", at length.

AIorNot · 2026-04-29T20:47:31 1777495651

Yeah hero worship and making into a villan is all part and parcel of the Nerd community these days

Guys hes just one smart guy who got placed in a good moment in the AI technological revolution- hes def not the second coming of Christ

threepts · 2026-05-09T12:45:00 1778330700

He statistically is the second coming of Christ Nobel and several breakthroughs in nature.

libraryofbabel · 2026-04-28T18:23:57 1777400637

This is already happening. For new Anthropic enterprise accounts you are billed at api token prices (maybe with a small volume discount). Anthropic makes a profit on those tokens. (Sure, that profit does not cover the model training costs, but that’s a separate issue.) It’s the subscriptions for individuals (e.g. Claude Max) that are still subsidized below cost.

> I wonder if managers will be as excited about AI when the prices go up.

Companies are willing to pay the api pricing. Engineering time is very expensive and AI coding agents actually work now since December and are actually showing measurable productivity gains, finally. It’s a good deal to make (obviously, with caveats: you need to make sure your tokens are going on productive tasks that will actually grow revenue) and anyone who penny-pinches is making a strategic mistake.

ericmcer · 2026-04-28T19:09:03 1777403343

"Engineering time is very expensive"

I always wondered about this statement, like we are generally salaried and there is so many variables that affect how I spend my "time". None of us are machines that can do X work per day and our managers get to slice it as they see fit. Pull a dev off a project they love and throw them onto something they hate and suddenly X is diminished greatly.

I would almost predict that reshaping our workflow to be: "prompt, wait, approve changes." results in losses because it is such a mentally tiring workflow and drills into our brains the desire for the LLM to "just fix it". It is the next level of just moving tickets to completed all day.

BadBadJellyBean · 2026-04-28T18:36:04 1777401364

> Sure, that profit does not cover the model training costs, but that’s a separate issue.

I don't think it is. At some point they have to make money and they can't do that if the token cost doesn't include ALL the costs. Someone has to pay for that at some point. And someone has to pay for the subsidized subscribers. So no. API token prices don't reflect the real price. They are still subsidized. Just in a different way.

mikeocool · 2026-04-28T23:43:01 1777419781

> Sure, that profit does not cover the model training costs, but that’s a separate issue

It is? If another company comes out with a better model tomorrow and offers it at the same price Anthropic charges for Opus, they’re going to lose customers fast. They have to keep training to keep selling inference.

Most businesses factor in the cost of making their product into the product’s P&L.

cyanydeez · 2026-04-29T00:01:05 1777420865

also, like super mario kart, SOTA models from the rear will be continually released because theyre sunk costs and open weights will advertise for themselves. Also, its clear FOMO is a DDoS attack on any perceived leader because theres no way they dont oversell.

Lastly, theyll realize like every good capitalist, theres more profit in exclusiveness and cutiing out customers.

CodingJeebus · 2026-04-28T18:36:23 1777401383

They may be for now. Problem is that when foundation model pricing goes up, you're paying not just the increase in tokens you consume directly, but also for all tokens you're consuming via vendors as well.

If your company has Figma, Github, and Cursor and they're using the same models you are, your monthly costs with them increase as well. You're exposed N times to the foundation model price increases, where N is the number of times software you directly or indirectly use talks to a frontier model.

malfist · 2026-04-28T19:47:52 1777405672

> Anthropic makes a profit on those tokens

Citation needed. Anthropic does not have public books

libraryofbabel · 2026-04-28T19:58:02 1777406282

Their CEO is on record as saying this. You may think he's lying, but that's just your opinion; given the pricing and how it stacks relative to the pricing of inference providers of comparable open source models (who are certainly charging above cost!), I am inclined to believe Anthropic on this.

hyperadvanced · 2026-04-28T21:47:05 1777412825

Why would you believe a tech CEO who has a vested interest in the untruth but can skirt fiduciary duties by speaking cleverly.

bostik · 2026-04-28T22:09:59 1777414199

Maybe because Anthropic are trying to get to an IPO and everything is securities fraud?

If their CEO was just flapping his mouth without any other comparable baseline, it'd probably be different. But as the GP points out, open-weight model providers are charging comparable rates and very likely have positive profit margins. That would imply that with API pricing tokens are sold at above cost.

That cost may well be "inference only", so excludes everything apart from hardware and power. Whether that's enough to cover the enormous training costs and other overheads is a different question.

JohnHaugeland · 2026-04-29T14:20:17 1777472417

why would we believe skeptical randos on social media?

he has access to the real numbers and a legal risk from lying publicly. it does him no good to lie about this.

timschmidt · 2026-04-28T22:10:52 1777414252

He just told you. Because overwhelming public evidence supports the claim. Especially the pricing of open weight model inference. Why do you allow a prejudice to overshadow evidence?

libraryofbabel · 2026-04-25T09:08:25 1777108105

These are flaws from 6-12 months ago. You might want to spend some time talking to Opus 4.7 or GPT 5.5. I can assure you that they can count letters just fine.

You’re right that AI isn't perfect, but it’s pretty good. Especially since December last year which was an inflection point in capability.

IshKebab · 2026-04-25T12:25:21 1777119921

Those don't seem to be available for free so I'll take your word for it on the letter counting. They still can't say "I don't know" though can they? I think it would still be pretty easy to weed out AI in a Turing test with a competent examiner and a human that wants to prove they are human.

ben_w · 2026-04-27T06:58:00 1777273080

> They still can't say "I don't know" though can they?

Of course they can. Even older models can. They do better at this when given permission to say so, just like a very anxious student facing a maths exam question may need to be reminded "find the exact square root of 2 in the form a/b, or prove this isn't possible".

https://chatgpt.com/share/69ef0785-f290-8328-b6fa-3207be2c0b...

The easy part of spotting an LLM is how few people ever change the default settings; my personalisation includes telling it to say so when unsure or that it doesn't know, along with some of the other weaknesses of LLMs.

There are other patterns in LLMs, but the better the tools are wielded the harder it is to spot them.

libraryofbabel · 2026-04-25T07:59:27 1777103967

I’m a Brit who lives in the US. Can confirm 99% of British people have never heard of the war of 1812, and even if they are a military history nerd and they have heard of it, they will consider it a minor sideshow to the main event of the era, which was Britain fighting France in the Revolutionary and Napoleonic wars from 1793-1815.

The US just wasn’t very important geopolitically 214 years ago. Sorry we burned y’all’s White House (ok not that sorry). Actually sorry we gave Andrew Jackson his opportunity to become famous by fighting a completely pointless battle after the war had already ended.

libraryofbabel · 2026-04-24T15:02:24 1777042944

To extend your point: it's not really the storage costs of the size of the cache that's the issue (server-side SSD storage of a few GB isn't expensive), it's the fact that all that data must be moved quickly onto a GPU in a system in which the main constraint is precisely GPU memory bandwidth. That is ultimately the main cost of the cache. If the only cost was keeping a few 10s of GB sitting around on their servers, Anthropic wouldn't need to charge nearly as much as they do for it.

tedivm · 2026-04-24T16:23:23 1777047803

That cost that you're talking about doesn't change based on how long the session is idle. No matter what happens they're storing that state and bring it back at some point, the only difference is how long it's stored out of GPU between requests.

libraryofbabel · 2026-04-24T16:54:31 1777049671

Are you sure about that? They charge $6.25 / MTok for 5m TTL cache writes and $10 / MTok for 1hr TTL writes for Opus. Unless you believe Anthropic is dramatically inflating the price of the 1hr TTL, that implies that there is some meaningful cost for longer caches and the numbers are such that it's not just the cost of SSD storage or something. Obviously the details are secret but if I was to guess, I'd say the 5m cache is stored closer to the GPU or even on a GPU, whereas the 1hr cache is further away and costs more to move onto the GPU. Or some other plausible story - you can invent your own!

tedivm · 2026-04-24T21:27:15 1777066035

Storing on GPU would be the absolute dumbest thing they could do. Locking up the GPU memory for a full hour while waiting for someone else to make a request would result in essentially no GPU memory being available pretty rapidly. This type of caching is available from the cloud providers as well, and it isn't tied to a single session or GPU.

libraryofbabel · 2026-04-25T04:28:00 1777091280

> Storing on GPU would be the absolute dumbest thing they could do

No. It’s not dumb. There will be multiple cache tiers in use, with the fastest and most expensive being on-GPU VRAM with cache-aware routing to specific GPUs and then progressive eviction to CPU ram and perhaps SSD after that. That is how vLLM works as you can see if you look it up, and you can find plenty of information on the multiple tiers approach from inference providers e.g. the new Inference Engineering book by Philip Kiely.

You are likely correct that the 1hr cached data probably mostly doesn’t live on GPU (although it will depend on capacity, they will keep it there as long as they can and then evict with an LRU policy). But I already said that in my last post.

libraryofbabel · 2026-04-24T08:19:08 1777018748

That is because LLM KV caching is not like caches you are used to (see my other comments, but it's 10s of GB per request and involves internal LLM state that must live on or be moved onto a GPU and much of the cost is in moving all that data around). It cannot be made transparent for the user because the bandwidth costs are too large a fraction of unit economics for Anthropic to absorb, so they have to be surfaced to the user in pricing and usage limits. The alternative is a situation where users whose clients use the cache efficiently end up dramatically subsidizing users who use it inefficiently, and I don't think that's a good solution at all. I'd much rather this be surfaced to users as it is with all commercial LLM apis.