For finetuning GPT-2 on custom text, my gpt-2-simple package (https://github.com/minimaxir/gpt-2-simple) gets close to going OOM when finetuning the 345M model, even on a 16GB VRAM server GPU. Doubling the size of the model with the 774M model might cause it to not work at all, so I’ll need to test.
Of course, the default output from the model might be sufficient, although it’ll take twice as long to generate text compared to the 345M which is slow even on a GPU.
How exactly the large GPT-2 models are deployed is a mystery I really wish was open-sourced more.
>How exactly the large GPT-2 models are deployed is a mystery I really wish was open-sourced more.
TalkToTransformer.com uses preemptible P4 GPUs on Google Kubernetes Engine. Changing the number of workers and automatically restarting them when they're preempted is easy with Kubernetes.
To provide outputs incrementally rather than waiting for the entire sequence to be generated, I open a websocket to a a worker and have it do a few tokens at a time, sending the output back as it goes. GPT-2 tokens can end partway through a multi-byte character, so to make this work you need to send the raw UTF-8 bytes to the browser and then have it concatenate them _before_ decoding the string.
While my workers can batch requests from multiple users, the modest increase in performance is probably not worth the complexity in most cases.
I've already tried training with nshepperd's codebase. Sampling works, but even with the memory checkpointing and freezing the embedding and using SGD rather than Adam, it OOMs on a 1080ti's 11GB. Either additional tricks or CPU training are going to be required.
I'm the lead researcher on the Middlebury Institute project looking at fine-tuning the bigger models, and I got OOM on 745M and 1.5B originally. I had to get an Azure instance with 24GB VRAM to handle it (using nshepperd's codebase). It works, but takes a while (~500 epochs takes 12 hours on a 100k word training dataset).
No. We weren't sure if that'd be a good idea since it wasn't trained with low-precision, and 345M thankfully didn't require going that far. 744M might, though. (Another option is model parallelism since I have 2 GPUs and that might be enough, perhaps freezing more layers and training incrementally, or reducing the 1024 token window to smaller ones like 700.)
>What would be the largest model one could train across 2x 2080Ti?
>~800M gpt2. this is largely due to the memory required to House parameters + optimizer states. If one uses a smaller optimizer than Adam training something larger should be possible. Make sure to turn on activation checkpointing with —checkpoint-activations
They haven't released such models, though, and I don't know if it would be drop-in compatible with the OA GPT-2-774M checkpoint (they're training their own GPT-2s using their own webtext corpus).
Possibly a stupid question, but does AMD lift such restrictions on models with its unified memory, by allowing the GPU to "page out" chunks of vram to system ram?
My guess is it would be much slower, because GPU processor would wait for data. Compare bandwidth - system RAM to GPU memory (PCIe): 16GBps vs GPU memory to GPU processor: 900GBps.
Of course, the default output from the model might be sufficient, although it’ll take twice as long to generate text compared to the 345M which is slow even on a GPU.
How exactly the large GPT-2 models are deployed is a mystery I really wish was open-sourced more.