We're training on text because that's what we're making the model do.
It's a fact of neural networks that to train them supervised you need the training data in the expected input for(vector of n thousand preceding tokens for LLMs) with the expected output(the next token for LLMs). "Training them on video" would mean converting the video to a format we can train the llm with, then training the LLM with that info.
This would probably be a 1 OOM increase at maximum, if the video transcripts aren't already a part of the training data for gpt.
> This would probably be a 1 OOM increase at maximum, if the video transcripts aren't already a part of the training data for gpt.
Human brains aren't trained on video transcripts, which leave out a lot of information from the video that human brains have a shared understanding of due to training. You would train on video embeddings and predict next embeddings, thus learning physics and other properties of the world and the objects within it and how they interact. This is many more than 5 orders of magnitude more data than the text that today's LLMs are trained on.
It's a fact of neural networks that to train them supervised you need the training data in the expected input for(vector of n thousand preceding tokens for LLMs) with the expected output(the next token for LLMs). "Training them on video" would mean converting the video to a format we can train the llm with, then training the LLM with that info.
This would probably be a 1 OOM increase at maximum, if the video transcripts aren't already a part of the training data for gpt.