To run a neural network, how much memory does one need?
Is it enought to load the first two layers from disk, calculate the activations for all nodes, discard the first layer, load the third layer from disk, calculate all the activations for all nodes, discard the second layer etc?
Then memory needs to be big enough to hold to 2 layers?
mildly unrelated: so when I ask GPT-4 a question, it is routed to an instance with about 166-194GB of memory?
> Further details on GPT-4's size and architecture have been leaked. The system is said to be based on eight models with 220 billion parameters each, for a total of about 1.76 trillion parameters, connected by a Mixture of Experts (MoE).
For a 7B parameter model using 4-8GB: Average = (4+8)/2 = 6GB Memory usage per parameter = 6/7 = ~0.857GB/B
For a 13B parameter model using 8-15GB: Average = (8+15)/2 = 11.5GB Memory usage per parameter = 11.5/13 = ~0.885GB/B
For a 30B parameter model using 13-33GB: Average = (13+33)/2 = 23GB Memory usage per parameter = 23/30 = ~0.767GB/B
For a 70B parameter model using 31-75GB: Average = (31+75)/2 = 53GB Memory usage per parameter = 53/70 = ~0.757GB/B
The average of these values is: (0.857 + 0.885 + 0.767 + 0.757)/4 = ~0.817 GB/B
Estimated memory usage = 220 * 0.817 = ~179.74GB
That's an interesting math. I don't think they are using 4 bits, or even 8. My bet would be with 16 bits. (Bear in mind that's just speculation, for "math's sake").
So we are talking about 4x your numbers per specialist model:
180GB * 4 = 720GB. If you count the greater context, let's say 750GB.
Anyone remember how many specialists they are supposedly using for each request?
If it's 2, we are talking about 1.5TB of processed weights for each generated token. With 4, it's 3TB/token.
At 0.06 for 1k tokens we get
3TB*1k/0.06 = 50 petabytes of processed data per dollar.
Didn't llama.cpp need to convert the weights file to a new format to support that? The way they're stored in the official file isn't efficient for operating on directly.
(I am talking out my butt - because these are new concepts to me, so forgive the ELI5 manner of Qs) ;
Can you "peel a 'layer' and feed that off onto somthing that doesnt need to discard, but obly received the "curated" layer via the prompt that drove its creation - and then have other weights assigned?
Again - I am infant on this line of questions, so please educate me (the other me myselfs)
The question is not clear to me, but if you are memory-constrained, you can take a whole batch of inputs, load the first layer into memory, run them through the first layer, unload the first layer, load the second layer, run the first layer outputs through the second layer, and so on.
Is it enought to load the first two layers from disk, calculate the activations for all nodes, discard the first layer, load the third layer from disk, calculate all the activations for all nodes, discard the second layer etc?
Then memory needs to be big enough to hold to 2 layers?