Don't make me tap the sign: There is no such thing as "bytes". There are only encodings. UTF-8 is the encoding most people are using when they talk about modeling "raw bytes" of text. UTF-8 is just a shitty (biased) human-designed tokenizer of the unicode codepoints.
Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a <UNK>. But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.
And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding https://arxiv.org/abs/2505.24689
Roger, who spoke only Chinglish and never paused between words, was working on a VAX FORTRAN program that exchanged tapes with IBM mainframes and a memory mapped section, inventing a new word in the process that still has me rolling decades later: ebsah-dicky-asky-codah
RMSNorm is pretty insigificant in terms of the overall compute in a transformer though -- usually the reduction work can be fused with earlier or later operations.
Rmsnorm acts like a barrier. No compute on the next network layer can start before all compute in the previous layer is done.
Splitting networks across multiple GPU's, this means you must wait for the slowest node and the longest latency.
As soon as you can remove most of these barriers, compute over non-latency-guaranteed networks becomes more practical, as does non-homogeneous compute (ie. Mixing different GPU models).
Yes of course it can, because they fit in the context window. But this is an awful test of the model's capabilities because it was certainly trained on these books and websites talking about the books and the HP universe.
Given that it is pretrained on the material, it would be interesting to do a differential test on in-context reinforcement. What is the recall % before reading the books and after?
I know, for instance, that gpt4 does much better with the python manual when we quote relevant context, even though it was trained on the python manual. This suggests pretraining is less than perfect.
Likewise, in the Harry Potter case I expect a significant difference between its background knowledge and the context enhanced trial. But I don't have intuition about the effect size we should expect! That makes it a fun experiment.
Not so fast. If you were evaluating the model on its ability to predict the next word in a Harry Potter book, you'd be right, because it's already seen the entire book, but that's not what's happening here.
The linked X post shows that the user asked the model to generate a graph of the characters, which was presumably a novel question. This is a legitimate test of the model's ability to understand and answer questions about the training data. Repeating the books in the prompt for emphasis makes sense, since the model probably didn't memorize all the relevant details.
The training data may not be HP itself. It may be millions of pages summarising/discussing/dissecting HP, which already contain the relationships spelled out better than in the book itself.
That's true, but the model still analyzed all that disparate information and produced a very detailed graph of the relevant relationships. If anyone can show that the graph itself was in the training data, then I would agree that it's not a good test.
The frustrating thing about all this speculations is, that we don't know what was in the training data, but I think we should know that, to have any meaningful discussion about it.
It's a novel question and impressive that Gemini was able to solve it but the tweet's author is claiming that this is because of the large context window and not because of all the Harry Potter related training data that was available to it.
The generic models definitely know a lot about Harry Potter without any additional context.
Probably 80% of my questions to ChatGPT were about Harry Potter plot and character details as my kid was reading the books. It was extremely knowledgeable about all the minutiae, probably thanks to all the online discussion more than the books themselves. It was actually the first LLM killer app for me.
That's a good point. I would describe this as a test of Gemini's ability to re-read something it's already familiar with, not a valid test of its ability to read a large new corpus.
It could have been trained on this exact picture created by a fan and uploaded to some forum. Ultimately it is impossible to know unless testing with brand new material.
I have the same problem with benchmarks that use real world tests (like SAT/LSAT/GRE or whatever else). The model got a good score, sure, but how many thousands of variations of this exact test was it trained on? How many questions did it encounter that were similar or the same?
It seems from the replies that he tried it without the context too and didn't get as detailed answers. I'd really like to see the actual difference, but yeah, it would be so much more interesting to use books which aren't summarised and discussed all over internet.
Don't think Rich Hickey is "retiring" retiring, just that he's retiring from Nubank and "commerical software development".
TFA: I look forward to continuing to lead ongoing work maintaining and enhancing Clojure with Alex, Stu, Fogus, and many others, as an independent developer once again... Retirement returns me to the freedom and independence I had when originally developing Clojure. The journey continues!
Yea uh, don't know what was going on there but there are roles with direct reports that aren't in mountain view (unless you mean, like in 2004). NYC has 1000s of googlers.
when recruiters unilaterally reach out to you it’s about a specific team and specific role, even if that’s just bait or a hook for other roles. very different than scouring a careers site for all positions. just writing that in case you weren’t familiar with that.
Also during my time there, yes there were roles with direct reports outside, but if you kept your eyes open you quickly saw that there was effectively a glass ceiling outside of MTV, NYC (and for some groups, LON or ZRH).
People would consistently get promoted more easily for less impactful projects, and getting headcount and approvals for projects in satellite offices was damn near impossible.
If you wanted to get ahead - To L7 or L8, even L6 on some projects - you had to relocate.
This is ... not what I expected. It's basically wiring up pre-trained models to ChatGPT via a router and "modality transformations" (a.k.a speech-to-text and text-to-speech).
I expected it to be a GPT-style model that processes audio directly to perform a ton of speech and maybe speech-text tasks in a zero-shot manner.
There's a huge difference between fitting a probabilistic model to a data distribution then sampling from it (what GPT-3 is) and agents that invent language and use it to communicate.
Not much. A transformer trained on multiple senses can learn the sound that an animal makes and associate it with seeing that animal. It can also learn how another agent reacts after it says a word.
The huge difference is actually between animal reflexes and learned behavior. Reflex is built-in. I didn't learn to kick my leg in response to a tap on the patellar tendon.
I agree that a Transformer is an example of a "reflexive" behavior because it learns to react in a context (via gradient descent rather than evolution as the learning algorithm). It's a conditional categorical distribution on steroids.
I also agree it's not much different than what's going on in this petri dish with pong.
But I don't think that's a profound statement.
What I'm saying is that calling what a Transformer does "language development" isn't accurate. A Transformer can't "develop" language in that sense, it can only learn "reflexive" behavior from the data distribution it's trained on (it could never have produced that data distribution itself without the data existing in the first place).
> I agree that a Transformer is an example of a "reflexive"
I said that it is not reflexive. It is learned. Just because after you learn something, it becomes easy does not mean that it is a reflex. I explained why language development can be done with little more than a transformer learning from how others behave when you make an utterance and from how you behave when you hear something, like a decision transformer learning what happens after it takes certain actions in Pong.