I also prefer to use "Coding" or "Hard Prompts (Overall)" instead of default "Overall" in Chatbot Arena scores to determine the actual performance level of LLMs. Seems much more align to my vibe test in terms reasoning. I guess the "Overall" contains a lot of creative tasks, which is not what I use the most in the daily tasks.