To run a neural network, how much memory does one need? Is it enought to load th...

bloaf · on July 23, 2023

This bloke on huggingface documents the memory requirements for his quantized versions of popular models: https://huggingface.co/TheBloke

Tl;Dr, Max ram needed depends on quant method, rough ranges are:

7B models are in the 4-8GB range

13B models 8-15GB

30B models 13-33GB

70B models 31-75GB

MuffinFlavored · on July 24, 2023

mildly unrelated: so when I ask GPT-4 a question, it is routed to an instance with about 166-194GB of memory?

> Further details on GPT-4's size and architecture have been leaked. The system is said to be based on eight models with 220 billion parameters each, for a total of about 1.76 trillion parameters, connected by a Mixture of Experts (MoE).

    For a 7B parameter model using 4-8GB: Average = (4+8)/2 = 6GB Memory usage per parameter = 6/7 = ~0.857GB/B
    
    For a 13B parameter model using 8-15GB: Average = (8+15)/2 = 11.5GB Memory usage per parameter = 11.5/13 = ~0.885GB/B
    
    For a 30B parameter model using 13-33GB: Average = (13+33)/2 = 23GB Memory usage per parameter = 23/30 = ~0.767GB/B
    
    For a 70B parameter model using 31-75GB: Average = (31+75)/2 = 53GB Memory usage per parameter = 53/70 = ~0.757GB/B

    The average of these values is: (0.857 + 0.885 + 0.767 + 0.757)/4 = ~0.817 GB/B

    Estimated memory usage = 220 * 0.817 = ~179.74GB

rodoxcasta · on July 24, 2023

That's an interesting math. I don't think they are using 4 bits, or even 8. My bet would be with 16 bits. (Bear in mind that's just speculation, for "math's sake").

So we are talking about 4x your numbers per specialist model:

180GB * 4 = 720GB. If you count the greater context, let's say 750GB.

Anyone remember how many specialists they are supposedly using for each request?

If it's 2, we are talking about 1.5TB of processed weights for each generated token. With 4, it's 3TB/token.

At 0.06 for 1k tokens we get

3TB*1k/0.06 = 50 petabytes of processed data per dollar.

Doesn't seems so expensive now.

immibis · on July 24, 2023

Probably. It's no secret that OpenAI has a ton of computing hardware.

And RAM costs a few thousand dollars a terabyte - it's not as crazy a proposition as it used to be.

petters · on July 23, 2023

You don't have to do the loading/discarding explicitly. You could just mmap the entire network and let the os handle that.

sp332 · on July 23, 2023

Didn't llama.cpp need to convert the weights file to a new format to support that? The way they're stored in the official file isn't efficient for operating on directly.

LoganDark · on July 23, 2023

Because the original format is the undocumented Python pickle format packed into a zip file. It's kind of ridiculous to attempt to support directly.

petters · on July 24, 2023

I don't know about llama.cpp, but yes this method works best if the binary layout on disk is exactly what you use for matrices in memory

gliptic · on July 23, 2023

They already had their own format before that.

samstave · on July 23, 2023

(I am talking out my butt - because these are new concepts to me, so forgive the ELI5 manner of Qs) ;

Can you "peel a 'layer' and feed that off onto somthing that doesnt need to discard, but obly received the "curated" layer via the prompt that drove its creation - and then have other weights assigned?

Again - I am infant on this line of questions, so please educate me (the other me myselfs)

immibis · on July 24, 2023

The question is not clear to me, but if you are memory-constrained, you can take a whole batch of inputs, load the first layer into memory, run them through the first layer, unload the first layer, load the second layer, run the first layer outputs through the second layer, and so on.

gpm · on July 23, 2023

Yes... but keep in mind you'll be limited by disk bandwidth if you do that.

immibis · on July 24, 2023

It may be a good trade-off if the alternative is not running the model at all.

eutectic · on July 23, 2023

I think for O(N^2) transformer inference you need to cache all the activations.

thomasahle · on July 23, 2023

You only need to cache the key/value pairs. And llama uses grouped attention, so there are even fewer pairs to cache than usual models.