"nothing groundbreaking" It's extremely cheap, efficient and kicks the ass of th...

gpm · on Jan 25, 2025

The leaderboard leader [1] is still showing the traditional AI leader, Google, winning. With Gemini-2.0-Flash-Thinking-Exp-01-21 in the lead. No one seems to know how many parameters that has, but random guesses on the internet seem to be low to mid 10s of billions, so fewer than DeepSeek-R1. Even if those general guesses are wrong, they probably aren't that wrong and at worst it's the same class of model as DeepSeek-R1.

So yes, DeepSeek-R1 appears to be not even be best in class, merely best open source. The only sense in which it is "leading the market" appears to be the sense in which "free stuff leads over proprietary stuff". Which is true and all, but not a groundbreaking technical achievement.

The DeepSeek-R1 distilled models on the other hand might actually be leading at something... but again hard to say it's groundbreaking when it's combining what we know we can do (small models like llama) with what we know we can do (thinking models).

[1] https://lmarena.ai/?leaderboard

dinosaurdynasty · on Jan 25, 2025

The chatbot leaderboard seems to be very affected by things other than capability, like "how nice is it to talk to" and "how likely is it to refuse requests" and "how fast does it respond" etc. Flash is literally one of Google's faster models, definitely not their smartest.

Not that the leaderboard isn't useful, I think "is in the top 10" says a lot more than the exact position in the top 10.

gpm · on Jan 25, 2025

I mean, sure, none of these models are being optimized for being the top of the leader board. They aren't even being optimized for the same things, so any comparison is going to be somewhat questionable.

But the claim I'm refuting here is "It's extremely cheap, efficient and kicks the ass of the leader of the market", and I think the leaderboard being topped by a cheap google model is pretty conclusive that that statement is not true. Is competitive with? Sure. Kicks the ass of? No.

whimsicalism · on Jan 25, 2025

google absolutely games for lmsys benchmarks with markdown styling. r1 is better than google flash thinking, you are putting way too much faith in lmsys

patrickhogan1 · on Jan 25, 2025

There is a wide disconnect between real world usage and leaderboards. If gemini was so good why are so few using them?

Having tested that model in many real world projects it has not once been the best. And going farther it gives atrocious nonsensical output.

whimsicalism · on Jan 25, 2025

i’m sorry but gemini flash thinning is simply not as good as r1. no way you’ve been playing with both