this happens in Turkish too. I believe the reason is that the movie subtitles were used for training without cleaning up the comments / intros subtitle authors leave in them.
leaving personal comments, jokes, reactions, intros in subtitles is very common in eastern cultures.
Turkish readers will probably remember “esekadam iyi seyirler diler” :)
Kind of mindblowing considering who it is we're talking about. Of all companies, OpenAI couldn't be bothered to throw an LLM at this problem? Finding amorphously phrased but clearly recognizable needles in large numbers of haystacks seems like a patently perfect task for them.
Don't even need an LLM, a regex would have sufficed (I've used my fair share of community sourced subtitles, and comments are almost always in a different font, colour, between brackets, etc etc).
leaving personal comments, jokes, reactions, intros in subtitles is very common in eastern cultures.
Turkish readers will probably remember “esekadam iyi seyirler diler” :)