Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's because they're natively trained with 1 bit, so it's not losing anything. Now, the question might be how they manage to get decent predictive performance with such little precision. That I don't know.


Not training. Transposing rows/columns of matrices to group 128 parameters with similar (shared) scale factor. Qwen-3 model.


I'm not sure what you mean. Could you please elaborate?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: