You shouldn't have to quantize it that much, maybe you're running a lot of other... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		FergusArgyll on Feb 28, 2024 \| parent \| context \| favorite \| on: The Era of 1-bit LLMs: ternary parameters for cost... You shouldn't have to quantize it that much, maybe you're running a lot of other programs while running inference? Also, try using pure llama.cpp, AFAIK it's the least possible overhead

regularfry on Feb 28, 2024 [–]

Getting more value out of phi-2-sized models is where you really want to be on lower-end M1's.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact