>What would be the largest model one could train across 2x 2080Ti?
>~800M gpt2. this is largely due to the memory required to House parameters + optimizer states. If one uses a smaller optimizer than Adam training something larger should be possible. Make sure to turn on activation checkpointing with —checkpoint-activations
They haven't released such models, though, and I don't know if it would be drop-in compatible with the OA GPT-2-774M checkpoint (they're training their own GPT-2s using their own webtext corpus).
>What would be the largest model one could train across 2x 2080Ti?
>~800M gpt2. this is largely due to the memory required to House parameters + optimizer states. If one uses a smaller optimizer than Adam training something larger should be possible. Make sure to turn on activation checkpointing with —checkpoint-activations
https://twitter.com/TheRealRPuri/status/1161322580126470145