More

danielhanchen · 2026-05-07T10:24:14 1778149454

Update - Just got rid of the spiced up intro

danielhanchen · 2026-05-07T10:22:58 1778149378

Thank you!

danielhanchen · 2026-05-07T10:22:32 1778149352

Oh thanks :) We're also going to add MTP support soon for Qwen3.6!

95% of it is fully human done - the maths, algos, code snippets, screenshots & benchmarks are done / conducted by us and NVIDIA :)

We did use AI to fix spelling errors + made some nice plots using Chat (ours would look horrible lol)

Update - Just got rid of the spiced up intro

stared · 2026-05-07T11:52:10 1778154730

Thanks!

To be clear, I use AI for editing all the time. Actually, diagrams are nice.

Just some pieces like that look like copy-paste (I mean, empty lines before, code get no special typography, etc):

  If we write the boundary information for a packed batch as:
  
  B = { lengths, cu_seqlens, max_seqlen, mask structure }
  
  then every transformer layer in that forward pass consumes the same B.
  
  If the model has L layers, rebuilding or re-synchronizing on B once per layer is not new work. It is the same information being reconstructed again and again.
  
  In other words, the useful work is:
  
  build B once, use it L times.
  
  The wasteful version is:
  
  build B + build B + ⋯ + build B (L times)

giancarlostoro · 2026-05-07T13:46:23 1778161583

> Actually, diagrams are nice.

I especially use AI to generate code for things like Mermaid[0]. It's just easier to describe the flow I want to outline than to remember all the nuances of Mermaid or similar code -> graph / diagram tooling. The output still looks nice too.

[0]: https://mermaid.js.org/

danielhanchen · 2026-04-22T16:12:47 1776874367

We made Unsloth Studio which should help :)

1. Auto best official parameters set for all models

2. Auto determines the largest quant that can fit on your PC / Mac etc

3. Auto determines max context length

4. Auto heals tool calls, provides python & bash + web search :)

ryandrake · 2026-04-22T16:50:02 1776876602

Yea, I actually tried it out last time we had one of these threads. It's undeniably easy to use, but it is also very opinionated about things like the directory locations/layouts for various assets. I don't think I managed to get it to work with a simple flat directory full of pre-downloaded models on an NFS mount to my NAS. It also insists on re-downloading a 3GB model every time it is launches, even after I delete the model file. I probably have to just sit down and do some Googleing/searching in order to rein the software in and get it to work the way I want it to on my system.

danielhanchen · 2026-04-30T05:07:38 1777525658

Oh my apologies I didn't respond - if only HN had a notifier haha

Oh yes we added a custom folder button which can pull .gguf files for now from any folder - it supports LM Studio and Ollama ones - but afreed it's still a mess.

One of the goals is to somehow quick search for .gguf folders, and add recommended folders - we currently have folders for Ollama and LM Studio for eg

hypercube33 · 2026-04-22T17:49:51 1776880191

Sadly doesn't support fine tuning on AMD yet which gave me a sad since I wanted to cut one of these down to be specific domain experts. Also running the studio is a bit of a nightmare when it calls diskpart during its install (why?)

danielhanchen · 2026-04-30T05:09:44 1777525784

Apologies as well didn't reply sooner - Studio supports AMD out of the box now! We worked with AMD to make it work! One thing that is still missing is pre-compiled AMD ROCM binaries, which we're trying to see if we can integrate that.

Interesting on diskpart - let me check and get back to you [EDIT] - visual studio build tools, python 3.13, git, cmake, node.js are all msi-based installers - so these are likely the culprits on using diskpart - essentially MSI installers check if there's enough disk space before installing items

Zopieux · 2026-04-22T20:31:04 1776889864

Thanks for that. Did you notice that the unsloth/unsloth docker image is 12GB? Does it embed CUDA libraries or some default models that justifies the heavy footprint?

danielhanchen · 2026-04-30T05:12:17 1777525937

Hey so sorry didn't reply sooner - yes the docker used to be I think 4-8GB ish since CUDA sadly itself is 4GB I think, and PyTorch takes the rest. So unfortunately the Unsloth Docker image has ballooned due to this. We tried reducing it as much as possible, but it's hard :( https://hub.docker.com/r/vllm/vllm-openai/tags for eg is around 11GB ish, ad we're 13.6GB ish.

We'll try our best to compress it more, but it's tough

WanderPanda · 2026-04-22T18:23:52 1776882232

I applaud that you recently started providing the KL divergence plots that really help understand how different quantizations compare. But how well does this correlate with closed loop performance? How difficult/expensive would it be to run the quantizations on e.g. some agentic coding benchmarks?

danielhanchen · 2026-04-30T05:19:08 1777526348

Hey! Sorry for not replying sooner - yes we'll keep publishing more KLD - sadly some are saying we are "optimizing" for KLD now since we posted so many haha - but the whole purpose of quantization is to match the BF16 logits as much as possible whilst reducing disk space (ie reduce KLD).

In general so this is funny and a quirk of quantization - sometimes 8bit, 4bit models do BETTER on downstream benchmarks (SWE Bench for eg), since sometimes rounding can actually somehow act as a "regularization" method (this is just my hunch).

So KLD isn't that expensive, since we leverage the trick of causal attention - since causal attention is lower triangular, we can do 1 forward pass on the enter text (say 2048 tokens), and you attain logits for the prediction for every token's position - so this is O(N^2).

However coding benchmarking require actual inference, and cannot use the causal attention trick, and it's best to run them 10 times since temperature = 1.0 is not deterministic - and take an average. We plan to maybe do something like https://marginlab.ai/trackers/claude-code/, which takes a random sample and does it over time.

cyanydeez · 2026-04-22T18:10:33 1776881433

Is unsloth working on managing remote servers, like how vscode integrates with a remote server via ssh?

danielhanchen · 2026-04-30T05:21:04 1777526464

Hey sorry on the delay - we just added API support, so you can access a remote server - it includes optional python, tool call, bash and web search support if you enable them.

For SSH - we haven't yet done that - for now we have a SHA256 encryption approach, but it's not SSH yet. HTTPS will also sadly have to be the end user's setup process as well - we plan to make it better soon!

kristjansson · 2026-04-22T18:18:34 1776881914

Lmstudio Link is GREAT for that right now

danielhanchen · 2026-04-30T05:21:14 1777526474

Oh yes LM Link is cool!

jbellis · 2026-04-22T17:58:45 1776880725

what are you using for web search?

danielhanchen · 2026-04-30T05:21:48 1777526508

We use Duck Duck Go - sorry on the delayed response as well

wuschel · 2026-04-22T16:43:25 1776876205

Great project! Thank you for that!

danielhanchen · 2026-04-30T05:21:28 1777526488

Thank you and appreciate it! Sorry on the delayed reply as well

danielhanchen · 2026-04-22T15:50:37 1776873037

Haha :)

cpburns2009 · 2026-04-22T18:28:36 1776882516

Do you get early access so you can prep the quants for release?

danielhanchen · 2026-04-30T06:12:37 1777529557

Yes we do! Sorry on the delay

ErneX · 2026-04-22T19:24:14 1776885854

IIRC they mentioned they do.

danielhanchen · 2026-04-22T15:50:16 1776873016

Haha :) We had some issues with Kimi-2.6 since it was int4 and we were investigating how to handle it :)

verdverm · 2026-04-22T18:50:41 1776883841

Appreciate what y'all do! We were slacking about how many HGX-B300 it would take to run Kimi and it looks like we could actually fit 2-3 Kimis on a single HGX.

danielhanchen · 2026-04-30T06:13:08 1777529588

Sorry on the delay - oh haha that would be cool :) We did release 2bit dynamic ones, but unsure if they'll be helpful

danielhanchen · 2026-04-22T15:49:29 1776872969

We also made some dynamic MLX ones if they help - it might be faster for Macs, but llama-server definitely is improving at a fast pace.

https://huggingface.co/unsloth/Qwen3.6-27B-UD-MLX-4bit

DarmokJalad1701 · 2026-04-22T17:51:34 1776880294

What exactly does the .sh file install? How does it compare to running the same model in, say, omlx?

danielhanchen · 2026-04-30T06:13:50 1777529630

Sorry on the delay - so it installs https://github.com/Blaizzy/mlx-vlm and other components and sets up the commands - you don't need to use it but we thought it might be easier for folks

danielhanchen · 2026-04-18T10:24:23 1776507863

Yes sadly CUDA 13.2 is broken - NVIDIA will push a fix in CUDA 13.3

danielhanchen · 2026-04-17T08:27:49 1776414469

Love the JPEG analogy :)

danielhanchen · 2026-04-16T17:50:28 1776361828

Oh that is pretty good! And the SVG one!