Hacker Newsnew | past | comments | ask | show | jobs | submit | rain1's commentslogin

I've used Google Antigravity to write scripts to download and produce architecture diagrams for various LLMs from huggingface. It's pretty useful so I thought I'd share it.

There's also a model comparison spreadsheet that you can compare sizes and such https://weavers.neocities.org/architecture-encyclopedia/mode...

If you'd like any additional models to be added I can add them in.


The Gemma models are too small to be included in this list.

You're right the T5 stuff is very important historically but they're below 11B and I don't have much to say about them. Definitely a very interesting and important set of models though.


> too small

Eh?

* Gemma 1 (2024): 2B, 7B

* Gemma 2 (2024): 2B, 9B, 27B

* Gemma 3 (2025): 1B, 4B, 12B, 27B

This is the same range as some Llama models which you do mention.

> important historically

Aren't you trying to give a historical perspective? What's the point of this?


Since you included GPT-2, everything from Google including T5 would qualify for the list I would think.


Yes but just purely in terms of entropy, you can't make a model better than GPT-4 by training it on GPT-4 outputs. The limit you would converge towards is GPT-4.


A better way to think about synthetic data is to consider code. With code you can have an LLM generate code with tests, then confirm that the code compiles and the tests pass. Now you have semi-verified new code you can add to your training data, and training on that will help you get better results for code even though it was generated by a "less good" LLM.


This is kind of related to the jack morris post https://blog.jxmo.io/p/there-are-no-new-ideas-in-ai-only he discusses how the big leaps in LLMs have mostly come - not so much from new training methods or arch. changes as such - but the ability of new archs. to ingest more data.


It's extremely interesting how powerful a language model is at compression.

When you train it to be an assistant model, it's better at compressing assistant transcripts than it is general text.

There is an eval which I have a lot of interested in and respect for https://huggingface.co/spaces/Jellyfish042/UncheatableEval called UncheatableEval, which tests how good of a language model an LLM is by applying it on a range of compression tasks.

This task is essentially impossible to 'cheat'. Compression is a benchmark you cannot game!


Knowledge is learning relationships by decontextualizing information into generalized components. Application of knowledge is recontextualizing these components based on the problem at hand.

This is essentially just compression and decompression. It's just that with prior compression techniques, we never tried leveraging the inherent relationships encoded in a compressed data structure, because our compression schemes did not leverage semantic information in a generalized way and thus did not encode very meaningful relationships other than "this data uses the letter 'e' quite a lot".

A lot of that comes from the sheer amount of data we throw at these models, which provide enough substrate for semantic compression. Compare that to common compression schemes in the wild, where data is compressed in isolation without contributing its information to some model of the world. It turns out that because of this, we've been leaving a lot on the table with regards to compression. Another factor has been the speed/efficiency tradeoff. GPUs have allowed us to put a lot more into efficiency, and the expectations that many language models only need to produce text as fast as it can be read by a human means that we can even further optimize for efficiency over speed.

Also, shout out to Fabrice Bellard's ts_zip, which leverages LLMs to compress text files. https://bellard.org/ts_zip/


And of course, once we extended lossy compression to make use of the semantic space, we started getting compression artifacts in semantic space - aka "hallucinations".


That seems worthy of a blog post!


I don't know, it's not that profound of an insight. You throw away color information, the image gets blocky. You throw away frequency information, the image gets blurry. You throw away semantic information, shit stops making sense :).

Still, if someone would turn that into a blog post, I'd happily read it.


There's more to it than that. You can draw strong analogies and also discuss where the analogy suffers. For example, you can compare decreased performance with accurately recalling specific information with high-frequency attenuation in lossy codecs.


Agreed. It's basically lossy compression for everything it's ever read. And the quantization impacts the lossiness, but since a lot of text is super fluffy, we tend not to notice as much as we would when we, say, listen to music that has been compressed in a lossy way.


It's a bit like if you trained a virtual band to play any song ever, then told it to do its own version of the songs. Then prompted it to play whatever specific thing you wanted. It won't be the same because it kinda remembers the right thing sorta, but it's also winging it.


I've been referring to LLMs as JPEG for all the world's data, and people have really started to come around to it. Initially most folks tended to outright reject this comparison.


Ted Chiang wrote a great piece about that: https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

I think it's a solid description for a raw model, but it's less applicable once you start combining an LLM with better context and tools.

What's interesting to me isn't the stuff the LLM "knows" - it's how well an LLM system can serve me when combined with RAG and tools like web search and access to a compiler.

The most interesting developments right now are models like Gemma 3n which are designed to have as much capability as possible without needing a huge amount of "facts" baked into them.


I think that one thing that this chart makes visually very clear is the point I about GPT-3 being such a huge leap, and there being a long gap before anybody was able to match it.


This is really awesome. Thank you for creating that. I included a screenshot and link to the chart with credit to you in a comment to my post.


I am happy you like it!

If you like darker color scheme, here it is:

https://app.charts.quesma.com/s/f07qji

And active vs total:

https://app.charts.quesma.com/s/4bsqjs


I can correct mistakes.

> it somehow merged Llama 4 Maverick's custom Arena chatbot version with Behemoth

I can clarify this part. I wrote 'There was a scandal as facebook decided to mislead people by gaming the lmarena benchmark site - they served one version of llama-4 there and released a different model' which is true.

But it is inside the section about the llama 4 model behemoth. So I see how that could be confusing/misleading.

I could restructure that section a little to improve it.

> Llama 405B was also trained on more than 15 trillion tokens[1],

You're talking about Llama 405B instruct, I'm talking about Llama 405B base. Of course the instruct model has been traiend on more tokens.

> why is there such a focus on token training count?

I tried to include the rough training token count for each model I wrote about - plus additional details about training data mixture if available. Training data is an important part of an LLM.


I have corrected that. It was supposed to say "None of this document was written by AI."

Thank you for spotting the error.


Understood, thanks for updating it!


> Take care of your mental health

How?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: