More

lewtun · 2026-05-07T16:14:07 1778170447

Shameless plug: https://huggingface.co/spaces/smolagents/ml-intern

It’s a simple harness around Opus, but with tight integration to Hugging Face infra, so the agent can read papers, test code and launch experiments

westurner · 2026-05-07T19:06:39 1778180799

What are the benchmarks for this, in terms of costs of computation and error; cost to converge?

Re: hyperparameter tuning and autoresearch: https://news.ycombinator.com/item?id=47444581

Parameter-free LLMs would be cool

lewtun · 2026-04-13T21:57:03 1776117423

Hugging Face Buckets are pretty simple: https://huggingface.co/docs/huggingface_hub/en/guides/bucket...

Disclaimer: I work at HF

lewtun · 2025-11-03T11:07:14 1762168034

The analogy stems from the notion that neural nets are "grown" rather than "engineered". Chris Olah has an old, but good post with some specific examples: https://colah.github.io/notes/bio-analogies/

lewtun · 2025-11-02T06:14:35 1762064075

In the specific case of SmolLM, it originates from the meme in this dataset https://huggingface.co/datasets/bigcode/the-stack-smol

lewtun · 2025-11-01T21:50:11 1762033811

Hi, Lewis here (one of the co-authors). Happy to answer any questions people have about the book :)

troelsSteegin · 2025-11-05T16:33:06 1762360386

This was a good read. I was struck by the quantity of nuanced and applied knowhow it took to build SmolLM3. I am curious about the rough cost it took to engineer and train SmolLM3 - at ~400 GPUS for a least a month, and, based on the set of book co-authors, 12 engineers for at least three months. Is $3-5M a fair ballpark number? The complement is how much experience, on average, the team members had doing ML and LLM training at scale before SmolLM3. The book is "up" on recent research, so I am surmising a phd-centric team each with multiple systems built. This is not commodity skill. What the book suggests to me is that an LLM applications start up would best focus on understanding the scope and knowhow for starting from post-training.

danielmarkbruce · 2025-11-01T22:30:30 1762036230

I'm a little ways through this and it's great so far, nice job.

One of the reasons people build one though is to learn. Most smart folks are quite aware that the reality of pre-training a real LLM is going to involve some head banging against the wall (ie, things don't go smoothly like "building an llm from scratch" book), and they want to go through the process.

matusp · 2025-11-02T13:15:04 1762089304

Really impressive writeup. In your opinion, how long will this stay up to date? The field is constantly evolving, do you plan to keep updating this document?

lewtun · 2025-11-02T13:42:46 1762090966

Thanks! I expect the book will remain relevant as long as the Transformers architecture does. That’s why we mostly focus on topics we think will stand the test of time, but let’s see how that plays out :)

danielmarkbruce · 2025-11-07T05:27:43 1762493263

Finished. Great write up.

lewtun · 2025-10-04T21:00:09 1759611609

For those interested in playing with an implementation of these ideas, my colleagues at HF made some recipes here: https://github.com/huggingface/trl/blob/main/docs/source/lor...

lewtun · 2025-09-06T09:47:12 1757152032

“QED and the Men Who Made It” [1] might be close to what you’re after for quantum theory at least. Unlike other popular accounts, it gets quite technical and covers a lot of the historical dead ends that people had during the development of quantum field theory.

[1] https://press.princeton.edu/books/paperback/9780691033273/qe...

lewtun · 2025-09-01T19:59:25 1756756765

> We instantiate this idea through Preference-prior Informed Linucb fOr adaptive rouTing (PILOT), a novel extension of LinUCB

Academics are pretty creative at naming their creations

CuriouslyC · 2025-09-01T20:55:11 1756760111

I almost named my LoRA replacement BEMO, but that felt too cute, so it's just BEM (Bolt-on Expert Modules).

lewtun · 2025-07-08T18:38:07 1751999887

Indeed we opted for offline methods like Anchored Preference Optimization as we found in the Open R1 project that doing multi-task RL on small models is quite a hassle to get right. With offline methods, you focus much more on dataset curation / generation, but that still provides faster iteration cycles for the model scale we’re dealing with!

lewtun · 2025-05-14T08:39:02 1747211942

> The absolute best way of doing this is these days is likely through a vision based machine learning model, but that is an approach that is very far away from scaling to processing hundreds of gigabytes of PDF files off a single server with no GPU.

SmolDocling is pretty fast and the ONNX weights can be scaled to many CPUs: https://huggingface.co/ds4sd/SmolDocling-256M-preview

Not sure what time scale the author had in mind for processing GBs of PDFs, but the future might be closer than “very far away”