they also spend more to generate more tokens. The more obvious reason is it seems like people rate responses better the longer they are. Lmsys demonstrated that GPT tops the leaderboard because it tends to give much longer and more detailed answers, and it seems like OpenAI is optimizing or trying to maximize lmsys.
Agree with this take, though in an even broader way; they're optimizing for the leaderboards and benchmarks in general. Longer outputs lead to better scores on those. Even in this thread I see a lot of comments bring them up, so it works for marketing.
My take is that the leaderboards and benchmarks are still very flawed if you're using LLMs for any non-chat purpose. In the product I'm building, I have to use all of the big 4 models (GPT, Claude, Llama, Gemini), because for each of them there is at least one tasks that it performs much better than the other 3.