More

rryan · 2025-06-25T05:38:50 1750829930

Don't make me tap the sign: There is no such thing as "bytes". There are only encodings. UTF-8 is the encoding most people are using when they talk about modeling "raw bytes" of text. UTF-8 is just a shitty (biased) human-designed tokenizer of the unicode codepoints.

cschmidt · 2025-06-25T13:18:59 1750857539

Virtually all current tokenization schemes do work at the raw byte level, not the utf-8 character. They do this to avoid the Out of Vocabulary (OOV) or unknown token problem. In older models, if you came across something in the data you can't tokenize, you add a <UNK>. But tokenization should be exactly reversible, so now people use subword tokenizers including all 256 single bytes in the vocab. That way you can always represent any text by dropping down to the single byte level. The other alternative would be to add all utf-8 code points to the vocabulary, but there are more than 150k of those, and enough are rare, that many would be undertrained. You'd have a lot of glitch tokens (https://arxiv.org/abs/2405.05417). That does mean an LLM isn't 100% guaranteed to output well formed utf-8.

cschmidt · 2025-06-25T13:23:34 1750857814

And in regard to utf-8 being a shitty biased tokenizer, here is recent paper trying to design a better style of encoding https://arxiv.org/abs/2505.24689

hiddencost · 2025-06-25T13:27:16 1750858036

Well akshually...

I assume you started programming some time this millennia? That's the only way I can explain this "take".

vaxman · 2025-06-26T09:21:26 1750929686

Roger, who spoke only Chinglish and never paused between words, was working on a VAX FORTRAN program that exchanged tapes with IBM mainframes and a memory mapped section, inventing a new word in the process that still has me rolling decades later: ebsah-dicky-asky-codah

roflcopter69 · 2025-06-25T14:11:35 1750860695

Care to elaborate?

rryan · 2025-03-15T06:50:46 1742021446

RMSNorm is pretty insigificant in terms of the overall compute in a transformer though -- usually the reduction work can be fused with earlier or later operations.

londons_explore · 2025-03-15T07:33:55 1742024035

Rmsnorm acts like a barrier. No compute on the next network layer can start before all compute in the previous layer is done.

Splitting networks across multiple GPU's, this means you must wait for the slowest node and the longest latency.

As soon as you can remove most of these barriers, compute over non-latency-guaranteed networks becomes more practical, as does non-homogeneous compute (ie. Mixing different GPU models).

elcritch · 2025-03-15T08:31:43 1742027503

What are other barriers in transformers? Or is the normalization layer the primary one?

woadwarrior01 · 2025-03-15T09:13:29 1742030009

dot-product attention is the biggest barrier. This is why there are so many attempts to linearize it.

amitport · 2025-03-15T15:07:55 1742051275

that fail... linearization is a bad idea. But plenty of other optimizations are done

atgctg · 2025-03-15T10:40:15 1742035215

The paper's Table 7 shows DyT reducing overall LLaMA 7B inference time by 7.8% and training time by 8.2%. That is not insignificant.

Herring · 2025-03-15T16:46:32 1742057192

But LLM performance scales according to the log of compute, so yeah it’s pretty insignificant. I think we’ve reached a bit of a plateau.

rryan · on April 13, 2024

ML 101: Do not evaluate on the training data.

Yes of course it can, because they fit in the context window. But this is an awful test of the model's capabilities because it was certainly trained on these books and websites talking about the books and the HP universe.

barfbagginus · on April 13, 2024

Given that it is pretrained on the material, it would be interesting to do a differential test on in-context reinforcement. What is the recall % before reading the books and after?

I know, for instance, that gpt4 does much better with the python manual when we quote relevant context, even though it was trained on the python manual. This suggests pretraining is less than perfect.

Likewise, in the Harry Potter case I expect a significant difference between its background knowledge and the context enhanced trial. But I don't have intuition about the effect size we should expect! That makes it a fun experiment.

munchler · on April 13, 2024

Not so fast. If you were evaluating the model on its ability to predict the next word in a Harry Potter book, you'd be right, because it's already seen the entire book, but that's not what's happening here.

The linked X post shows that the user asked the model to generate a graph of the characters, which was presumably a novel question. This is a legitimate test of the model's ability to understand and answer questions about the training data. Repeating the books in the prompt for emphasis makes sense, since the model probably didn't memorize all the relevant details.

viraptor · on April 13, 2024

The training data may not be HP itself. It may be millions of pages summarising/discussing/dissecting HP, which already contain the relationships spelled out better than in the book itself.

munchler · on April 13, 2024

That's true, but the model still analyzed all that disparate information and produced a very detailed graph of the relevant relationships. If anyone can show that the graph itself was in the training data, then I would agree that it's not a good test.

chmod775 · on April 13, 2024

> disparate information

I wouldn't call it disparate when there's about a dozen wikis each spelling it out like this: https://harrypotter.fandom.com/wiki/Severus_Snape

vsnf · on April 13, 2024

If eat my hat if multiple graphs almost exactly like this one weren’t in the training days. This is like fandoms 101.

lukan · on April 13, 2024

The frustrating thing about all this speculations is, that we don't know what was in the training data, but I think we should know that, to have any meaningful discussion about it.

actionfromafar · on April 13, 2024

We should. However in this case, isn't it a bit of a stretch to assume they didn't put just about everything in the training data?

joshspankit · on April 13, 2024

It would have been fairly trivial to AB test this where the other side is to ask the same question but without all the books in-window.

xmprt · on April 13, 2024

It's a novel question and impressive that Gemini was able to solve it but the tweet's author is claiming that this is because of the large context window and not because of all the Harry Potter related training data that was available to it.

aikinai · on April 13, 2024

The generic models definitely know a lot about Harry Potter without any additional context.

Probably 80% of my questions to ChatGPT were about Harry Potter plot and character details as my kid was reading the books. It was extremely knowledgeable about all the minutiae, probably thanks to all the online discussion more than the books themselves. It was actually the first LLM killer app for me.

munchler · on April 13, 2024

That's a good point. I would describe this as a test of Gemini's ability to re-read something it's already familiar with, not a valid test of its ability to read a large new corpus.

paxys · on April 13, 2024

It could have been trained on this exact picture created by a fan and uploaded to some forum. Ultimately it is impossible to know unless testing with brand new material.

I have the same problem with benchmarks that use real world tests (like SAT/LSAT/GRE or whatever else). The model got a good score, sure, but how many thousands of variations of this exact test was it trained on? How many questions did it encounter that were similar or the same?

poglet · on April 13, 2024

You could modify the source material (change a name or character relationship) and see if it correctly reports the modification in the graph.

viraptor · on April 13, 2024

It seems from the replies that he tried it without the context too and didn't get as detailed answers. I'd really like to see the actual difference, but yeah, it would be so much more interesting to use books which aren't summarised and discussed all over internet.

magospietato · on April 13, 2024

I got some interesting results by feeding Claude 3 a very sparse primer for a conlang I wrote when I was 18.

There is zero chance this is anywhere in the models dataset, and we were able to perform basic translation to and from English.

rryan · on Aug 4, 2023

Thanks for everything, Rich. You inspired me repeatedly.

ignoramous · on Aug 4, 2023

Don't think Rich Hickey is "retiring" retiring, just that he's retiring from Nubank and "commerical software development".

TFA: I look forward to continuing to lead ongoing work maintaining and enhancing Clojure with Alex, Stu, Fogus, and many others, as an independent developer once again... Retirement returns me to the freedom and independence I had when originally developing Clojure. The journey continues!

rryan · on July 27, 2023

https://elevenlabs.io/

https://twelvelabs.io/

.... wat

tadfisher · on July 28, 2023

It's one louder, isn't it?

rryan · on June 8, 2023

Yea uh, don't know what was going on there but there are roles with direct reports that aren't in mountain view (unless you mean, like in 2004). NYC has 1000s of googlers.

yieldcrv · on June 8, 2023

ok.

when recruiters unilaterally reach out to you it’s about a specific team and specific role, even if that’s just bait or a hook for other roles. very different than scouring a careers site for all positions. just writing that in case you weren’t familiar with that.

underdeserver · on June 8, 2023

+1

Also during my time there, yes there were roles with direct reports outside, but if you kept your eyes open you quickly saw that there was effectively a glass ceiling outside of MTV, NYC (and for some groups, LON or ZRH).

People would consistently get promoted more easily for less impactful projects, and getting headcount and approvals for projects in satellite offices was damn near impossible.

If you wanted to get ahead - To L7 or L8, even L6 on some projects - you had to relocate.

rryan · on April 26, 2023

This is ... not what I expected. It's basically wiring up pre-trained models to ChatGPT via a router and "modality transformations" (a.k.a speech-to-text and text-to-speech).

I expected it to be a GPT-style model that processes audio directly to perform a ton of speech and maybe speech-text tasks in a zero-shot manner.

woodson · on April 26, 2023

Take a look at AudioLDM (https://github.com/haoheliu/AudioLDM), it might be more what you expected:

- Text-to-Audio Generation: Generate audio given text input.

- Audio-to-Audio Generation: Given an audio, generate another audio that contain the same type of sound.

- Text-guided Audio-to-Audio Style Transfer: Transfer the sound of an audio into another one using the text description.

hackernewds · on April 26, 2023

so then the training data is text, not audio?

khimaros · on April 26, 2023