To be clear, I use AI for editing all the time.
Actually, diagrams are nice.
Just some pieces like that look like copy-paste (I mean, empty lines before, code get no special typography, etc):
If we write the boundary information for a packed batch as:
B = { lengths, cu_seqlens, max_seqlen, mask structure }
then every transformer layer in that forward pass consumes the same B.
If the model has L layers, rebuilding or re-synchronizing on B once per layer is not new work. It is the same information being reconstructed again and again.
In other words, the useful work is:
build B once, use it L times.
The wasteful version is:
build B + build B + ⋯ + build B (L times)
I especially use AI to generate code for things like Mermaid[0]. It's just easier to describe the flow I want to outline than to remember all the nuances of Mermaid or similar code -> graph / diagram tooling. The output still looks nice too.
Yea, I actually tried it out last time we had one of these threads. It's undeniably easy to use, but it is also very opinionated about things like the directory locations/layouts for various assets. I don't think I managed to get it to work with a simple flat directory full of pre-downloaded models on an NFS mount to my NAS. It also insists on re-downloading a 3GB model every time it is launches, even after I delete the model file. I probably have to just sit down and do some Googleing/searching in order to rein the software in and get it to work the way I want it to on my system.
Oh my apologies I didn't respond - if only HN had a notifier haha
Oh yes we added a custom folder button which can pull .gguf files for now from any folder - it supports LM Studio and Ollama ones - but afreed it's still a mess.
One of the goals is to somehow quick search for .gguf folders, and add recommended folders - we currently have folders for Ollama and LM Studio for eg
Sadly doesn't support fine tuning on AMD yet which gave me a sad since I wanted to cut one of these down to be specific domain experts. Also running the studio is a bit of a nightmare when it calls diskpart during its install (why?)
Apologies as well didn't reply sooner - Studio supports AMD out of the box now! We worked with AMD to make it work! One thing that is still missing is pre-compiled AMD ROCM binaries, which we're trying to see if we can integrate that.
Interesting on diskpart - let me check and get back to you [EDIT] - visual studio build tools, python 3.13, git, cmake, node.js are all msi-based installers - so these are likely the culprits on using diskpart - essentially MSI installers check if there's enough disk space before installing items
Thanks for that. Did you notice that the unsloth/unsloth docker image is 12GB? Does it embed CUDA libraries or some default models that justifies the heavy footprint?
Hey so sorry didn't reply sooner - yes the docker used to be I think 4-8GB ish since CUDA sadly itself is 4GB I think, and PyTorch takes the rest. So unfortunately the Unsloth Docker image has ballooned due to this. We tried reducing it as much as possible, but it's hard :( https://hub.docker.com/r/vllm/vllm-openai/tags for eg is around 11GB ish, ad we're 13.6GB ish.
We'll try our best to compress it more, but it's tough
I applaud that you recently started providing the KL divergence plots that really help understand how different quantizations compare. But how well does this correlate with closed loop performance? How difficult/expensive would it be to run the quantizations on e.g. some agentic coding benchmarks?
Hey! Sorry for not replying sooner - yes we'll keep publishing more KLD - sadly some are saying we are "optimizing" for KLD now since we posted so many haha - but the whole purpose of quantization is to match the BF16 logits as much as possible whilst reducing disk space (ie reduce KLD).
In general so this is funny and a quirk of quantization - sometimes 8bit, 4bit models do BETTER on downstream benchmarks (SWE Bench for eg), since sometimes rounding can actually somehow act as a "regularization" method (this is just my hunch).
So KLD isn't that expensive, since we leverage the trick of causal attention - since causal attention is lower triangular, we can do 1 forward pass on the enter text (say 2048 tokens), and you attain logits for the prediction for every token's position - so this is O(N^2).
However coding benchmarking require actual inference, and cannot use the causal attention trick, and it's best to run them 10 times since temperature = 1.0 is not deterministic - and take an average. We plan to maybe do something like https://marginlab.ai/trackers/claude-code/, which takes a random sample and does it over time.
Hey sorry on the delay - we just added API support, so you can access a remote server - it includes optional python, tool call, bash and web search support if you enable them.
For SSH - we haven't yet done that - for now we have a SHA256 encryption approach, but it's not SSH yet. HTTPS will also sadly have to be the end user's setup process as well - we plan to make it better soon!
Appreciate what y'all do! We were slacking about how many HGX-B300 it would take to run Kimi and it looks like we could actually fit 2-3 Kimis on a single HGX.
Sorry on the delay - so it installs https://github.com/Blaizzy/mlx-vlm and other components and sets up the commands - you don't need to use it but we thought it might be easier for folks
reply