Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To run a neural network, how much memory does one need?

Is it enought to load the first two layers from disk, calculate the activations for all nodes, discard the first layer, load the third layer from disk, calculate all the activations for all nodes, discard the second layer etc?

Then memory needs to be big enough to hold to 2 layers?



This bloke on huggingface documents the memory requirements for his quantized versions of popular models: https://huggingface.co/TheBloke

Tl;Dr, Max ram needed depends on quant method, rough ranges are:

7B models are in the 4-8GB range

13B models 8-15GB

30B models 13-33GB

70B models 31-75GB


mildly unrelated: so when I ask GPT-4 a question, it is routed to an instance with about 166-194GB of memory?

> Further details on GPT-4's size and architecture have been leaked. The system is said to be based on eight models with 220 billion parameters each, for a total of about 1.76 trillion parameters, connected by a Mixture of Experts (MoE).

    For a 7B parameter model using 4-8GB: Average = (4+8)/2 = 6GB Memory usage per parameter = 6/7 = ~0.857GB/B
    
    For a 13B parameter model using 8-15GB: Average = (8+15)/2 = 11.5GB Memory usage per parameter = 11.5/13 = ~0.885GB/B
    
    For a 30B parameter model using 13-33GB: Average = (13+33)/2 = 23GB Memory usage per parameter = 23/30 = ~0.767GB/B
    
    For a 70B parameter model using 31-75GB: Average = (31+75)/2 = 53GB Memory usage per parameter = 53/70 = ~0.757GB/B

    The average of these values is: (0.857 + 0.885 + 0.767 + 0.757)/4 = ~0.817 GB/B

    Estimated memory usage = 220 * 0.817 = ~179.74GB


That's an interesting math. I don't think they are using 4 bits, or even 8. My bet would be with 16 bits. (Bear in mind that's just speculation, for "math's sake").

So we are talking about 4x your numbers per specialist model:

180GB * 4 = 720GB. If you count the greater context, let's say 750GB.

Anyone remember how many specialists they are supposedly using for each request?

If it's 2, we are talking about 1.5TB of processed weights for each generated token. With 4, it's 3TB/token.

At 0.06 for 1k tokens we get

3TB*1k/0.06 = 50 petabytes of processed data per dollar.

Doesn't seems so expensive now.


Probably. It's no secret that OpenAI has a ton of computing hardware.

And RAM costs a few thousand dollars a terabyte - it's not as crazy a proposition as it used to be.


You don't have to do the loading/discarding explicitly. You could just mmap the entire network and let the os handle that.


Didn't llama.cpp need to convert the weights file to a new format to support that? The way they're stored in the official file isn't efficient for operating on directly.


Because the original format is the undocumented Python pickle format packed into a zip file. It's kind of ridiculous to attempt to support directly.


I don't know about llama.cpp, but yes this method works best if the binary layout on disk is exactly what you use for matrices in memory


They already had their own format before that.


(I am talking out my butt - because these are new concepts to me, so forgive the ELI5 manner of Qs) ;

Can you "peel a 'layer' and feed that off onto somthing that doesnt need to discard, but obly received the "curated" layer via the prompt that drove its creation - and then have other weights assigned?

Again - I am infant on this line of questions, so please educate me (the other me myselfs)


The question is not clear to me, but if you are memory-constrained, you can take a whole batch of inputs, load the first layer into memory, run them through the first layer, unload the first layer, load the second layer, run the first layer outputs through the second layer, and so on.


Yes... but keep in mind you'll be limited by disk bandwidth if you do that.


It may be a good trade-off if the alternative is not running the model at all.


I think for O(N^2) transformer inference you need to cache all the activations.


You only need to cache the key/value pairs. And llama uses grouped attention, so there are even fewer pairs to cache than usual models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: