It’s not just a safe bet but almost guaranteed. Humans combine their internal language models with physical intuition and experimentation from the moment they are born. There is zero chance that an AI can understand the physical world without access to it [1]. Until it has that access, it’s no more than a glorified context specific Markov chain generator
Fact is, without a feedback loop that can run physical experiments like infants do from the moment they're born, I highly doubt they will develop a useful intuition using just video. Hence the conjecture
[1] Henceforth called Kiselev’s conjecture, a corollary of Moravec’s paradox: https://en.m.wikipedia.org/wiki/Moravec's_paradox