> but when you are serving plenty of different sessions, you will quickly run ou...

rfoo · on Dec 4, 2022

GPT-3 cannot run on hobbyist-level GPU yet. That's the difference (compared to Stable Diffusion which could run on 2070 even with a not-so-carefully-written PyTorch implementation), and the reason why I believe that while ChatGPT is awesome and made more people aware what LLMs could do today, this is not a moment like what happened with diffusion models.

_boffin_ · on Dec 4, 2022

i feel bad for the guys that are on call right now. WTF! why is the memory spiking beyond expectations?!

nomel · on Dec 4, 2022

What makes you say this? Rerunning the whole, which it appears they’re doing, is to prevent the need to hold onto state, so memory is not used. In other words, they’re not having this problem because they’re not doing it that way.

danuker · on Dec 4, 2022

> so memory is not used

Not used for more than the duration of inference, but definitely used during inference.

GistNoesis · on Dec 4, 2022

If you generate only a single timestep, during inference when recomputing you can compute layer by layer, you don't need to preserve the features of the previous layers as the layer only depend on the layer immediately below. So your memory need don't depend on the number of layers.

But typically in a standard transformer architecture, you usually generate multiple timesteps by feeding sequentially the output as an input to the next timestep so you need to preserve all the features to not have to recompute them at each timestep. So your memory depends again on the number of layer of your network.

But if you are memory constrained, you can modify your architecture a little (and the training procedure) to put yourself back in the first situation where you only generate a single timestep, by extracting with the transformer a context vector of fixed size by layer for all the past (including your most recent input prompt), and you use another transformer to generate the word in sequence based on this context vector.