These still run on GPUs

londons_explore · on Feb 28, 2024

GPU's aren't yet awfully efficient at 1 bit math.

I could imagine FPGA designs might be competitive.

And dedicated ASIC's would almost certainly beat both by a decent margin.

int_19h · on Feb 28, 2024

I don't think it would be difficult to make them efficient.

The main reason why we run this stuff on GPUs is their memory bandwidth, anyway.

sebzim4500 · on Feb 28, 2024

I'm very unconvinced that ASICs are better suited for this than for FP16/FP8 models that are being used today.

londons_explore · on Feb 28, 2024

BF16 is a pretty big unit in an ASIC - You need at least 9 * 5 gates to calculate the exponent of the result, a 10 bit barrel shifter (10*10 + 10*ceil(log2(10)) gates), and a 10 bit multiplier (approximately 10 * 10 * 9 gates)

Total = 1085 gates. The reality is probably far more, because you're going to want to use carry-look-ahead and pipelining.

Whereas 1 bit multiplies and add's of say a 16 bit accumulator use... 16 gates! (and probably half since you can probably use scheduling tricks to skip past the zero's, at the expense of variable latency...)

So when 1 bit math uses only 1/100th of the silicon area of 16 bit math, and according to this paper gets the same results, the future is clearly silicon that can do 1 bit math.

leroman · on Feb 28, 2024

- we have llama.cpp (could be enough or at least as mentioned in the paper a co-processor to accelerate the calc can be added, less need for large RAM / high end hardware)

- as most work is inference, might not need for as many GPUs

- consumer cards (24G) could possibly run the big models

sebzim4500 · on Feb 28, 2024

If consumer cards can run the big models, then datacenter cards will be able to efficiently run the really big models.

leroman · on Feb 28, 2024

Some tasks we are using LLMs for are performing very close to GPT-4 levels using 7B models, so really depends on what value you are looking to get.