and here's Jeff Geerling 15 months ago showing how to use Whisper to make dramatically better captions: https://www.youtube.com/watch?v=S1M9NOtusM8
I assume Google has finally put some of their multimodal LLM work to good use. Before that, they were embarrassingly bad.