Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For finetuning GPT-2 on custom text, my gpt-2-simple package (https://github.com/minimaxir/gpt-2-simple) gets close to going OOM when finetuning the 345M model, even on a 16GB VRAM server GPU. Doubling the size of the model with the 774M model might cause it to not work at all, so I’ll need to test.

Of course, the default output from the model might be sufficient, although it’ll take twice as long to generate text compared to the 345M which is slow even on a GPU.

How exactly the large GPT-2 models are deployed is a mystery I really wish was open-sourced more.



>How exactly the large GPT-2 models are deployed is a mystery I really wish was open-sourced more.

TalkToTransformer.com uses preemptible P4 GPUs on Google Kubernetes Engine. Changing the number of workers and automatically restarting them when they're preempted is easy with Kubernetes.

To provide outputs incrementally rather than waiting for the entire sequence to be generated, I open a websocket to a a worker and have it do a few tokens at a time, sending the output back as it goes. GPT-2 tokens can end partway through a multi-byte character, so to make this work you need to send the raw UTF-8 bytes to the browser and then have it concatenate them _before_ decoding the string.

While my workers can batch requests from multiple users, the modest increase in performance is probably not worth the complexity in most cases.


Any thoughts on the larger model? Doesn't seem materially better than the last one. Maybe the fine tuning exercises will show the benefit?


I've already tried training with nshepperd's codebase. Sampling works, but even with the memory checkpointing and freezing the embedding and using SGD rather than Adam, it OOMs on a 1080ti's 11GB. Either additional tricks or CPU training are going to be required.


I'm the lead researcher on the Middlebury Institute project looking at fine-tuning the bigger models, and I got OOM on 745M and 1.5B originally. I had to get an Azure instance with 24GB VRAM to handle it (using nshepperd's codebase). It works, but takes a while (~500 epochs takes 12 hours on a 100k word training dataset).


Ouch! So 11GB is nowhere close to being enough, then. I wonder if even switching to FP16 will be adequate?


Might be able to get 745M down to work on a single GPU. I'm definitely not using all 24GB, so fp16 might be able to get it down enough.


How would you use fp16 to get it to work on a single GPU? And if you did, what GPU should you use?


Figured. I'll make changes to allow sampling from the default model more easily.


Are you using FP16?


No. We weren't sure if that'd be a good idea since it wasn't trained with low-precision, and 345M thankfully didn't require going that far. 744M might, though. (Another option is model parallelism since I have 2 GPUs and that might be enough, perhaps freezing more layers and training incrementally, or reducing the 1024 token window to smaller ones like 700.)


On the NVIDIA GPT- 2 implementation:

>What would be the largest model one could train across 2x 2080Ti?

>~800M gpt2. this is largely due to the memory required to House parameters + optimizer states. If one uses a smaller optimizer than Adam training something larger should be possible. Make sure to turn on activation checkpointing with —checkpoint-activations

https://twitter.com/TheRealRPuri/status/1161322580126470145


They haven't released such models, though, and I don't know if it would be drop-in compatible with the OA GPT-2-774M checkpoint (they're training their own GPT-2s using their own webtext corpus).


I haven't look into at all myself, but he also said:

>We do provide training code that should work out of the box for gpt2 117M/345M

https://twitter.com/TheRealRPuri/status/1161319745259393024


It would take forever (or $$$) to train even 117M model from scratch.


I read that meaning you can start with the actual pre-trained GPT-2 models but I never got an answer when I specifically asked if that was the case.


You maybe should try tensorflow automatic mixed precision! https://github.com/zihangdai/xlnet/pull/200


fp16 saves a lot of memory and is worth doing. I've not had trouble fine tuning all these models with fp16.


Have you fine tuned 774 successfully using a single GPU?


I recommend Nvidia Apex, it offers several ways to mix precision.


Possibly a stupid question, but does AMD lift such restrictions on models with its unified memory, by allowing the GPU to "page out" chunks of vram to system ram?


My guess is it would be much slower, because GPU processor would wait for data. Compare bandwidth - system RAM to GPU memory (PCIe): 16GBps vs GPU memory to GPU processor: 900GBps.


No idea how modeling works on AMD. (most discussions are about NVidia/CUDA)


Isn't that a macOS specific feature?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: