I think there is no doubt that there must be more efficient model architectures out there, take for example the sample efficiency of GPT-3:
> If you think about what a human, a human probably in a human’s lifetime, 70 years, processes probably about a half a billion words, maybe a billion, let’s say a billion. So when you think about it, GPT-3 has been trained on 57 billion times the number of words that a human in his or her lifetime will ever perceive.[0]
I cannot wrap my head around that. I listened to the audio to check it wasn't a transcription error and don't think it is.
He is claiming GPT3 was trained on 57 billion billion words. The training dataset is something like 500B tokens and not all of that is used (common crawl is processed less than once), and I'm damn near certain that it wasn't trained for a hundred million epochs. Their original paper says the largest model was trained on 300B tokens [0]
Assuming a token is a word, as we're going for orders of magnitude, you're actually looking at about a few hundred times more text. The point kind of stands, it's more, but not billions of times.
I wouldn't be surprised if I'm wrong here because they seem to be an expert but this didn't pass the sniff test and looking into it doesn't support what they're saying to me.
It's at about 7 minutes in the video, they really do say it several times in a few ways. He starts by saying it's trained on 570 billion megabytes, which is probably where this confusion starts. Looking again at the paper, Common Crawl after filtering is 570GB or 570 billion bytes. So he makes two main mistakes - one is straight up multiplying by another million, then by assuming one byte is equivalent to one word. Then a bit more because less than half of it is used. That's probably taking it out by a factor of about ten million or more.
300B is then the "training budget" in a sense, not every dataset is used in its entirety, some are processed more than once, but each of the GPT3 sizes were trained on 300B tokens.
What do you mean? That we are not trained ONLY on words? Or that the training we receive on words is not the same as the training of a NN? Or something else?
> If you think about what a human, a human probably in a human’s lifetime, 70 years, processes probably about a half a billion words, maybe a billion, let’s say a billion. So when you think about it, GPT-3 has been trained on 57 billion times the number of words that a human in his or her lifetime will ever perceive.[0]
0. https://hai.stanford.edu/news/gpt-3-intelligent-directors-co...