Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You shouldn't have to quantize it that much, maybe you're running a lot of other programs while running inference?

Also, try using pure llama.cpp, AFAIK it's the least possible overhead



Getting more value out of phi-2-sized models is where you really want to be on lower-end M1's.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: