Unless one of the open model labs has a breakthrough, they will always lag. Thei...

runako · 2026-01-19T18:14:26 1768846466

FWIW this is what Linux and the early open-source databases (e.g. PostgreSQL and MySQL) did.

They usually lagged for large sets of users: Linux was not as advanced as Solaris, PostgreSQL lacked important features contained in Oracle. The practical effect of this is that it puts the proprietary implementation on a treadmill of improvement where there are two likely outcomes: 1) the rate of improvement slows enough to let the OSS catch up or 2) improvement continues, but smaller subsets of people need the further improvements so the OSS becomes "good enough." (This is similar to how most people now do not pay attention to CPU speeds because they got "fast enough" for most people well over a decade ago.)

weslleyskah · 2026-01-19T18:43:08 1768848188

You know, this is also the case of Proxmox vs. VMWare.

Proxmox became good and reliable enough as an open-source alternative for server management. Especially for the Linux enthusiasts out there.

irthomasthomas · 2026-01-19T19:20:46 1768850446

Deepseek 3.2 scores gold at IMO and others. Google had to use parallel reasoning to do that with gemini, and the public version still only achieves silver.

skrebbel · 2026-01-19T17:52:03 1768845123

How does this work? Do they buy lots of openai credits and then hit their api billions of times and somehow try to train on the results?

g-mork · 2026-01-19T18:18:10 1768846690

dont forget the plethora of middleman chat services with liberal logging policies. i've no doubt there is a whole subindustry lurking in here

skrebbel · 2026-01-19T21:21:34 1768857694

i wasn't judging, i was asking how it works. why would openai/anthrophic/google let a competitor scrape their results in sufficient amounts that it lets them train their own thing?

victorbjorklund · 2026-01-19T21:57:40 1768859860

I think the point is that they can't really stop it. Let's say that I purchase API credits, and I let the resell it to DeepSeek.

That's going to be pretty hard for OpenAI to figure out and even if they figure it out and they stop me there will be thousands of other companies willing to do that arbitrage. (Just for the record, I'm not doing this, but I'm sure people are.)

They would need to be very restrictive about who is allowed to use the API and not and that would kill their growth because because then customers would just go to Google or another provider that is less restrictive.

skrebbel · 2026-01-20T06:50:16 1768891816

Yeah but are we all just speculating or is it accepted knowledge that this is actually happening?

sally_glance · 2026-01-20T08:01:43 1768896103

Speculation I think, because for one those supposed proxy providers would have to provide some kind of pricing advantage compared to the original provider. Maybe I missed them but where are the X0% cheaper SOTA model proxies?

Number two I'm not sure if random samples collected over even a moderately large number of users does make a great base of training examples for distillation. I would expect they need some more focused samples over very specific areas to achieve good results.

skrebbel · 2026-01-20T10:36:01 1768905361

Thanks I that case my conclusion is that all the people saying that these models are "distilling SOTA models" are, by extension, also speculating. How can you distill what you don't have?

sally_glance · 2026-01-20T22:16:45 1768947405

Only way I can think of is paying for synthesizing training data using SOTA models yourself. But yeah, I'm not aware of anyone publicly sharing that they did so it's also speculation.

The economics probably work out though, collecting, cleaning and preparing original datasets is very cumbersome.

What we do know for sure is that the SOTA providers are distilling their own models, I remember reading about this at least for Gemini (Flash is distilled) and Meta.

mike_hearn · 2026-01-20T15:22:12 1768922532

OpenAI implemented ID verification for their API at some point and I think they stated that this was the reason.