For finetuning GPT-2 on custom text, my gpt-2-simple package (https://github.com...

AdamDKing · on Aug 20, 2019

>How exactly the large GPT-2 models are deployed is a mystery I really wish was open-sourced more.

TalkToTransformer.com uses preemptible P4 GPUs on Google Kubernetes Engine. Changing the number of workers and automatically restarting them when they're preempted is easy with Kubernetes.

To provide outputs incrementally rather than waiting for the entire sequence to be generated, I open a websocket to a a worker and have it do a few tokens at a time, sending the output back as it goes. GPT-2 tokens can end partway through a multi-byte character, so to make this work you need to send the raw UTF-8 bytes to the browser and then have it concatenate them _before_ decoding the string.

While my workers can batch requests from multiple users, the modest increase in performance is probably not worth the complexity in most cases.

jcims · on Aug 21, 2019

Any thoughts on the larger model? Doesn't seem materially better than the last one. Maybe the fine tuning exercises will show the benefit?

gwern · on Aug 20, 2019

I've already tried training with nshepperd's codebase. Sampling works, but even with the memory checkpointing and freezing the embedding and using SGD rather than Adam, it OOMs on a 1080ti's 11GB. Either additional tricks or CPU training are going to be required.

newhaus1994 · on Aug 20, 2019

I'm the lead researcher on the Middlebury Institute project looking at fine-tuning the bigger models, and I got OOM on 745M and 1.5B originally. I had to get an Azure instance with 24GB VRAM to handle it (using nshepperd's codebase). It works, but takes a while (~500 epochs takes 12 hours on a 100k word training dataset).

gwern · on Aug 20, 2019

Ouch! So 11GB is nowhere close to being enough, then. I wonder if even switching to FP16 will be adequate?

newhaus1994 · on Aug 20, 2019

Might be able to get 745M down to work on a single GPU. I'm definitely not using all 24GB, so fp16 might be able to get it down enough.

sdan · on Aug 21, 2019

How would you use fp16 to get it to work on a single GPU? And if you did, what GPU should you use?

minimaxir · on Aug 20, 2019

Figured. I'll make changes to allow sampling from the default model more easily.

p1esk · on Aug 20, 2019

Are you using FP16?

gwern · on Aug 20, 2019

No. We weren't sure if that'd be a good idea since it wasn't trained with low-precision, and 345M thankfully didn't require going that far. 744M might, though. (Another option is model parallelism since I have 2 GPUs and that might be enough, perhaps freezing more layers and training incrementally, or reducing the 1024 token window to smaller ones like 700.)

JonathanFly · on Aug 20, 2019

On the NVIDIA GPT- 2 implementation:

>What would be the largest model one could train across 2x 2080Ti?

>~800M gpt2. this is largely due to the memory required to House parameters + optimizer states. If one uses a smaller optimizer than Adam training something larger should be possible. Make sure to turn on activation checkpointing with —checkpoint-activations

https://twitter.com/TheRealRPuri/status/1161322580126470145

gwern · on Aug 20, 2019

They haven't released such models, though, and I don't know if it would be drop-in compatible with the OA GPT-2-774M checkpoint (they're training their own GPT-2s using their own webtext corpus).

JonathanFly · on Aug 20, 2019

I haven't look into at all myself, but he also said:

>We do provide training code that should work out of the box for gpt2 117M/345M

https://twitter.com/TheRealRPuri/status/1161319745259393024

p1esk · on Aug 20, 2019

It would take forever (or $$$) to train even 117M model from scratch.

JonathanFly · on Aug 20, 2019

I read that meaning you can start with the actual pre-trained GPT-2 models but I never got an answer when I specifically asked if that was the case.

The_rationalist · on Aug 20, 2019

You maybe should try tensorflow automatic mixed precision! https://github.com/zihangdai/xlnet/pull/200

slashcom · on Aug 20, 2019

fp16 saves a lot of memory and is worth doing. I've not had trouble fine tuning all these models with fp16.

sdan · on Aug 21, 2019

Have you fine tuned 774 successfully using a single GPU?

p1esk · on Aug 20, 2019

I recommend Nvidia Apex, it offers several ways to mix precision.

the8472 · on Aug 20, 2019

Possibly a stupid question, but does AMD lift such restrictions on models with its unified memory, by allowing the GPU to "page out" chunks of vram to system ram?

p1esk · on Aug 20, 2019

My guess is it would be much slower, because GPU processor would wait for data. Compare bandwidth - system RAM to GPU memory (PCIe): 16GBps vs GPU memory to GPU processor: 900GBps.

minimaxir · on Aug 20, 2019

No idea how modeling works on AMD. (most discussions are about NVidia/CUDA)

vimy · on Aug 20, 2019

Isn't that a macOS specific feature?