Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On the NVIDIA GPT- 2 implementation:

>What would be the largest model one could train across 2x 2080Ti?

>~800M gpt2. this is largely due to the memory required to House parameters + optimizer states. If one uses a smaller optimizer than Adam training something larger should be possible. Make sure to turn on activation checkpointing with —checkpoint-activations

https://twitter.com/TheRealRPuri/status/1161322580126470145



They haven't released such models, though, and I don't know if it would be drop-in compatible with the OA GPT-2-774M checkpoint (they're training their own GPT-2s using their own webtext corpus).


I haven't look into at all myself, but he also said:

>We do provide training code that should work out of the box for gpt2 117M/345M

https://twitter.com/TheRealRPuri/status/1161319745259393024


It would take forever (or $$$) to train even 117M model from scratch.


I read that meaning you can start with the actual pre-trained GPT-2 models but I never got an answer when I specifically asked if that was the case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: