> This would probably be a 1 OOM increase at maximum, if the video transcripts a...

> This would probably be a 1 OOM increase at maximum, if the video transcripts aren't already a part of the training data for gpt.

Human brains aren't trained on video transcripts, which leave out a lot of information from the video that human brains have a shared understanding of due to training. You would train on video embeddings and predict next embeddings, thus learning physics and other properties of the world and the objects within it and how they interact. This is many more than 5 orders of magnitude more data than the text that today's LLMs are trained on.