The US Economy is pretty vulnerable here. If it turns out that you, in fact, don't need a gazillion GPUs to build SOTA models it destroys a lot of perceived value.
I wonder if this was a deliberate move by PRC or really our own fault in falling for the fallacy that more is always better.
Why do americans think china is like a hivemind controlled by an omnisicient Xi, making strategic moves to undermine them? Is it really that unlikely that a lab of genius engineers found a way to improve efficiency 10x?
If China is undermining the West by lifting up humanity, for free, while ProprietaryAI continues to use closed source AI for censorship and control, then go team China.
There's something wrong with the West's ethos if we think contributing significantly to the progress of humanity is malicious. The West's sickness is our own fault; we should take responsibility for our own disease, look critically to understand its root, and take appropriate cures, even if radical, to resolve our ailments.
> There's something wrong with the West's ethos if we think contributing significantly to the progress of humanity is malicious.
Who does this?
The criticism is aimed at the dictatorship and their politics. Not their open source projects. Both things can exist at once. It doesn't make China better in any way. Same goes for their "radical cures" as you call it. I'm sure Uyghurs in China would not give a damn about AI.
many americans do seem to view Chinese people as NPCs, from my perspective, but I don't know it's only for Chinese or it's also for people of all other cultures
that's the McCarthy era red scare nonsense still polluting the minds of (mostly boomers / older gen-x) americans. it's so juvenile and overly simplistic.
> Is it really that unlikely that a lab of genius engineers found a way to improve efficiency 10x
They literally published all their methodology. It's nothing groundbreaking, just western labs seem slow to adopt new research. Mixture of experts, key-value cache compression, multi-token prediction, 2/3 of these weren't invented by DeepSeek. They did invent a new hardware-aware distributed training approach for mixture-of-experts training that helped a lot, but there's nothing super genius about it, western labs just never even tried to adjust their model to fit the hardware available.
It's extremely cheap, efficient and kicks the ass of the leader of the market, while being under sanctions with AI hardware.
Most of all, can be downloaded for free, can be uncensored, and usable offline.
China is really good at tech, it has beautiful landscapes, etc. It has its own political system, but to be fair, in some way it's all our future.
A bit of a dystopian future, like it was in 1984.
But the tech folks there are really really talented, it's long time that China switched from producing for the Western clients, to direct-sell to the Western clients.
The leaderboard leader [1] is still showing the traditional AI leader, Google, winning. With Gemini-2.0-Flash-Thinking-Exp-01-21 in the lead. No one seems to know how many parameters that has, but random guesses on the internet seem to be low to mid 10s of billions, so fewer than DeepSeek-R1. Even if those general guesses are wrong, they probably aren't that wrong and at worst it's the same class of model as DeepSeek-R1.
So yes, DeepSeek-R1 appears to be not even be best in class, merely best open source. The only sense in which it is "leading the market" appears to be the sense in which "free stuff leads over proprietary stuff". Which is true and all, but not a groundbreaking technical achievement.
The DeepSeek-R1 distilled models on the other hand might actually be leading at something... but again hard to say it's groundbreaking when it's combining what we know we can do (small models like llama) with what we know we can do (thinking models).
The chatbot leaderboard seems to be very affected by things other than capability, like "how nice is it to talk to" and "how likely is it to refuse requests" and "how fast does it respond" etc. Flash is literally one of Google's faster models, definitely not their smartest.
Not that the leaderboard isn't useful, I think "is in the top 10" says a lot more than the exact position in the top 10.
I mean, sure, none of these models are being optimized for being the top of the leader board. They aren't even being optimized for the same things, so any comparison is going to be somewhat questionable.
But the claim I'm refuting here is "It's extremely cheap, efficient and kicks the ass of the leader of the market", and I think the leaderboard being topped by a cheap google model is pretty conclusive that that statement is not true. Is competitive with? Sure. Kicks the ass of? No.
google absolutely games for lmsys benchmarks with markdown styling. r1 is better than google flash thinking, you are putting way too much faith in lmsys
The U.S. firms let everyone skeptical go the second they had a marketable proof of concept, and replaced them with smart, optimistic, uncritical marketing people who no longer know how to push the cutting edge.
Maybe we don't need momentum right now and we can cut the engines.
Oh, you know how to develop novel systems for training and inference? Well, maybe you can find 4 people who also can do that by breathing through the H.R. drinking straw, and that's what you do now.
That's what they claim at least in the paper but that particular claim is not verifiable. The HAI-LLM framework they reference in the paper is not open sourced and it seems they have no plans to.
Additionally there are claims, such as those by Scale AI CEO Alexandr Wang on CNBC 1/23/2025 time segment below, that DeepSeek has 50,000 H100s that "they can't talk about" due to economic sanctions (implying they likely got by avoiding them somehow when restrictions were looser). His assessment is that they will be more limited moving forward.
It's amazing how different the standards are here. Deepseek's released their weights under a real open source license and published a paper with their work which now has independent reproductions.
OpenAI literally haven't said a thing about how O1 even works.
DeepSeek the holding company is called high-flyer, they actually do open source their AI training platform as well, here is the repo: https://github.com/HFAiLab/hai-platform
They can be more open and yet still not open source enough that claims of theirs being unverifiable are still possible. Which is the case for their optimized HAI-LLM framework.
But those approaches alone wouldn’t yield the improvements claimed. How did they train the foundational model upon which they applied RL, distillations, etc? That part is unclear and I don’t think anything they’ve released anything that explains the low cost.
It’s also curious why some people are seeing responses where it thinks it is an OpenAI model. I can’t find the post but someone had shared a link to X with that in one of the other HN discussions.
I mean what’s also incredible about all this cope is that it’s exactly the same David-v-Goliath story that’s been lionized in the tech scene for decades now about how the truly hungry and brilliant can form startups to take out incumbents and ride their way to billions. So, if that’s not true for DeepSeek, I guess all the people who did that in the U.S. were also secretly state-sponsored operations to like make better SAAS platforms or something?
Well it is like a hive mind due to the degree of control. Most Chinese companies are required by law to literally uphold the country’s goals - see translation of Chinese law, which says generative AI must uphold their socialist values:
In the case of TikTok, ByteDance and the government found ways to force international workers in the US to signing agreements that mirror local laws in mainland China:
I find that degree of control to be dystopian and horrifying but I suppose it has helped their country focus and grow instead of dealing with internal conflict.
I think it is because we conflate the current Chinese system with the old Mao/Soviet Union system because all call themselves "communist".
The vast majority are completely ignorant of what Socialism with Chinese characteristics mean.
I can't imagine even 5% of the US population knows who Deng Xiaoping was.
The idea there are many parts of the Chinese economy that are more Laissez-faire capitalist than anything we have had in the US in a long time would just not compute for most Americans.
Yeah, it's mind boggling how sinophobic online techies are. Granted, Xi is in sole control of China, but this seems like it's an independent group that just happened to make breakthrough which explains their low spend.
think about how big the prize is, how many people are working on it and how much has been invested (and targeted to be invested, see stargate).
And they somehow yolo it for next to nothing?
yes, it seems unlikely they did it exactly they way they're claiming they did. At the very least, they likely spent more than they claim or used existing AI API's in way that's against the terms.
CEO of Scale said Deepseek is lying and actually has a 50k GPU cluster. He said they lied in the paper because technically they aren't supposed to have them due to export laws.
I feel like this is very likely. They obvious did some great breakthroughs, but I doubt they were able to train on so much less hardware.
CEO of a human based data labelling services company feels threatened by a rival company that claims to have trained a frontier class model with an almost entirely RL based approach, with a small cold start dataset (a few thousand samples). It's in the paper. If their approach is replicated by other labs, Scale AI's business will drastically shrink or even disappear.
Under such dire circumstances, lying isn't entirely out of character for a corporate CEO.
Deepseek obviously trained on OpenAI outputs, which were originally RLHF'd. It may seem that we've got all the human feedback necessary to move forward and now we can infinitely distil + generate new synthetic data from higher parameter models.
I’ve seen this claim but I don’t know how it could work. Is it really possible to train a new foundational model using just the outputs (not even weights) of another model? Is there any research describing that process? Maybe that explains the low (claimed) costs.
800k. They say they came from earlier versions of their own models, with a lot of bad examples rejected. They don't seem to say which models they got the "thousands of cold-start" examples from earlier in the process though.
every single model does/did this. Initially fine tuning required the expensive hand labeled outputs for RLHF. Generating your training data from that inherently encodes the learned distributions and improves performance, hence why some models would call themselves chatgpt despite not being openai models.
Check the screenshot below re: training on OpenAI Outputs. They've fixed this since btw, but it's pretty obvious they used OpenAI outputs to train. I mean all the Open AI "mini" models are trained the same way. Hot take but feels like the AI labs are gonna gatekeep more models and outputs going forward.
If we're going to play that card, couldn't we also use the "Chinese CEO has every reason to lie and say they did something 100x more efficient than the Americans" card?
I'm not even saying they did it maliciously, but maybe just to avoid scrutiny on GPUs they aren't technically supposed to have? I'm thinking out loud, not accusing anyone of anything.
Then the question becomes, who sold the GPUs to them? They are supposedly scarse and every player in the field is trying to get ahold as many as they can, before anyone else in fact.
Something makes little sense in the accusations here.
I think there's likely lots of potential culprits. If the race is to make a machine god, states will pay countless billions for an advantage. Money won't mean anything once you enslave the machine god.
We will have to wait to get some info on that probe. I know SMCI is not the nicest player and there is no doubt GPUs are being smuggled, but that quantity (50k GPUs) would be not that easy to smuggle and sell to a single actor without raising suspicion.
It's hard to tell if they're telling the truth about the number of GPUs they have. They open sourced the model and the inference is much more efficient than the best American models so it's not implausible that the training was also much more efficient.
Deepseek is indeed better than Mistral and ChatGPT. It has tad more common sense. There is no way they did this on the “cheap”. I’m sure they use loads of Nvidia GPUs, unless they are using custom made hardware acceleration (that would be cool and easy to do).
As OP said, they are lying because of export laws, they aren’t allowed to play with Nvidia GPUs.
However, I support DeepSeek projects, I’m here in the US able to benefit from it. So hopefully they should headquarter in the States if they want US chip sanctions lift off since the company is Chinese based.
But as of now, deepseek takes the lead in LLMs, my goto LLM.
Sam Altman should be worried, seriously, Deepseek is legit better than ChatGPT latest models.
I haven't had time to follow this thread, but it looks like some people are starting to experimentally replicate DeepSeek on extremely limited H100 training:
> You can RL post-train your small LLM (on simple tasks) with only 10 hours of H100s.
Just to check my math: They claim something like 2.7 million H800 hours which would be less than 4000 GPU units for one month.
In money something around 100 million USD give or take a few tens of millions.
If you rented the hardware at $2/GPU/hour, you need $5.76M for 4k GPU for a month. Owning is typically cheaper than renting, assuming you use the hardware yearlong for other projects as well.
Only the DeepSeek V3 paper mentions compute infrastructure, the R1 paper omits this information, so no one actually knows. Have people not actually read the R1 paper?
Alexandr Wang did not even say they lied in the paper.
Here's the interview: https://www.youtube.com/watch?v=x9Ekl9Izd38. "My understanding is that is that Deepseek has about 50000 a100s, which they can't talk about obviously, because it is against the export controls that the United States has put in place. And I think it is true that, you know, I think they have more chips than other people expect..."
Plus, how exactly did Deepseek lie. The model size, data size are all known. Calculating the number of FLOPS is an exercise in arithmetics, which is perhaps the secret Deepseek has because it seemingly eludes people.
> Plus, how exactly did Deepseek lie. The model size, data size are all known. Calculating the number of FLOPS is an exercise in arithmetics, which is perhaps the secret Deepseek has because it seemingly eludes people.
Model parameter count and training set token count are fixed. But other things such as epochs are not.
In the same amount of time, you could have 1 epoch or 100 epochs depending on how many GPUs you have.
Also, what if their claim on GPU count is accurate, but they are using better GPUs they aren't supposed to have? For example, they claim 1,000 GPUs for 1 month total. They claim to have H800s, but what if they are using illegal H100s/H200s, B100s, etc? The GPU count could be correct, but their total compute is substantially higher.
It's clearly an incredible model, they absolutely cooked, and I love it. No complaints here. But the likelihood that there are some fudged numbers is not 0%. And I don't even blame them, they are likely forced into this by US exports laws and such.
> In the same amount of time, you could have 1 epoch or 100 epochs depending on how many GPUs you have.
This is just not true for RL and related algorithms, having more GPU/agents encounters diminishing returns, and is just not the equivalent to letting a single agent go through more steps.
It should be trivially easy to reproduce the results no? Just need to wait for one of the giant companies with many times the GPUs to reproduce the results.
I don't expect a #180 AUM hedgefund to have as many GPUs than meta, msft or Google.
AUM isn't a good proxy for quantitative hedge fund performance, many strategies are quite profitable and don't scale with AUM. For what it's worth, they seemed to have some excellent returns for many years for any market, let alone the difficult Chinese markets.
Making it obvious that they managed to circumvent sanctions isn’t going to help them. It will turn public sentiment in the west even more against them and will motivate politicians to make the enforcement stricter and prevent GPU exports.
I don't think sentiment in the west is turning against the Chinese, beyond well, lets say white nationalists and other ignorant folk. Americans and Chinese people are very much alike and both are very curious about each others way of life. I think we should work together with them.
note: I'm not Chinese, but AGI should be and is a world wide space race.
I don't believe that the model was trained on so few GPUs, personally, but it also doesn't matter IMO. I don't think SOTA models are moats, they seem to be more like guiding lights that others can quickly follow. The volume of research on different approaches says we're still in the early days, and it is highly likely we continue to get surprises with models and systems that make sudden, giant leaps.
Many "haters" seem to be predicting that there will be model collapse as we run out of data that isn't "slop," but I think they've got it backwards. We're in the flywheel phase now, each SOTA model makes future models better, and others catch up faster.
Just a cursory probing of deepseek yields all kinds of censoring of topics. Isn't it just as likely Chinese sponsors of this have incentivized and sponsored an undercutting of prices so that a more favorable LLM is preferred on the market?
Think about it, this is something they are willing to do with other industries.
And, if LLMs are going to be engineering accelerators as the world believes, then it wouldn't do to have your software assistants be built with a history book they didn't write. Better to dramatically subsidize your own domestic one then undercut your way to dominance.
It just so happens deepseek is the best one, but whichever was the best Chinese sponsored LLM would be the one we're supposed to use.
>Isn't it just as likely Chinese sponsors of this have incentivized and sponsored an undercutting of prices so that a more favorable LLM is preferred on the market?
Since the model is open weights, it's easy to estimate the cost of serving it. If the cost was significantly higher than DeepSeek charges on their API, we'd expect other LLM hosting providers to charge significantly more for DeepSeek (since they aren't subsidised, so need to cover their costs), but that isn't the case.
This isn't possible with OpenAI because we don't know the size or architecture of their models.
Regarding censorship, most of it is done at the API level, not the model level, so running locally (or with another hosting provider) is much less expensive.
Did you try asking deepseek about June 4th, 1989?
Edit: it seems that basically the whole month of July 1989 is blocked. Any other massacres and genocides the model is happy to discuss.
as DeepSeek wasn't among China's major AI players before the R1 release, having maintained a relatively low profile. In fact, both DeepSeek-V2 and V3 had outperformed many competitors, I've seen some posts about that. However, these achievements received limited mainstream attention prior to their breakthrough release.
> If it turns out that you, in fact, don't need a gazillion GPUs to build SOTA models it destroys a lot of perceived value.
Correct me if I'm wrong, but couldn't you take the optimization and tricks for training, inference, etc. from this model and apply to the Big Corps' huge AI data centers and get an even better model?
I'll preface this by saying, better and better models may not actually unlock the economic value they are hoping for. It might be a thing where the last 10% takes 90% of the effort so to speak
> The US Economy is pretty vulnerable here. If it turns out that you, in fact, don't need a gazillion GPUs to build SOTA models it destroys a lot of perceived value.
I do not quite follow. GPU compute is mostly spent in inference, as training is a one time cost. And these chain of thought style models work by scaling up inference time compute, no?
So proliferation of these types of models would portend in increase in demand for GPUs?
If you don't need so many gpu calcs regardless of how you get there, maybe nvidia loses money from less demand (or stock price), or there are more wasted power companies in the middle of no where (extremely likely), and maybe these dozen doofus almost trillion dollar ai companies also out on a few 100 billion of spending.
So it's not the end of the world. Look at the efficiency of databases from the mid 1970s to now. We have figured out so many optimizations and efficiencies and better compression and so forth. We are just figuring out what parts of these systems are needed.
Hyperscalers need to justify their current GPU investments with pay2go and provisioned throughput LLM usage revenue. If models get more efficient too quickly and therefore GPUs less loaded by end users, short of a strong example of Jevon's paradox they might not reach their revenue targets for the next years.
They bought them at "you need a lot of these" prices, but now there is the possibility they are going to rent them at "I dont need this so much" rates.
I don't think we were wrong to look at this as a commodity problem and ask how many widgets we need. Most people will still get their access to this technology through cloud services and nothing in this paper changes the calculations for inference compute demand. I still expect inference compute demand to be massive and distilled models aren't going to cut it for most agentic use cases.
This only makes sense if you think scaling laws won't hold.
If someone gets something to work with 1k h100s that should have taken 100k h100s, that means the group with the 100k is about to have a much, much better model.
Good. This gigantic hype cycle needs a reality check. And if it turns out Deepseek is hiding GPUs, good for them for doing what they need to do to get ahead.
I only know about Moore Threads GPUs. Last time I took a look at their consumer offerings (e.g. MTT S80 - S90), they were at GTX1650-1660 or around the latest AMD APU performance levels.
AI sure, which is good, as I'd rather not have giant companies in the US monopolizing it. If they open source it and undercut OpenAI etc all the better
GPU: nope, that would take much longer, Nvidia/ASML/TSMC is too far ahead
>I wonder if this was a deliberate move by PRC or really our own fault in falling for the fallacy that more is always better.
DeepSeek's R1 also blew all the other China LLM teams out of the water, in spite of their larger training budgets and greater hardware resources (e.g. Alibaba). I suspect it's because its creators' background in a trading firm made them more willing to take calculated risks and incorporate all the innovations that made R1 such a success, rather than just copying what other teams are doing with minimal innovation.
$5.5 million is the cost of training the base model, DeepSeek V3. I haven't seen numbers for how much extra the reinforcement learning that turned it into R1 cost.
With $5.5M, you can buy around 150 H100s. Experts correct me if I’m wrong but it’s practically impossible to train a model like that with that measly amount.
So I doubt that figure includes all the cost of training.
It's even more. You also need to fund power and maintain infrastructure to run the GPUs. You need to build fast networks between the GPUs for RDMA. Ethernet is going to be too slow. Infiniband is unreliable and expensive.
You’ll also need sufficient storage, and fast IO to keep them fed with data.
You also need to keep the later generation cards from burning themselves out because they draw so much.
Oh also, depending on when your data centre was built, you may also need them to upgrade their power and cooling capabilities because the new cards draw _so much_.
The cost, as expressed in the DeepSeek V3 paper, was expressed in terms of training hours based on the market rate per hour if they'd rented the 2k GPUs they used.
No, it's a full model. It's just...most concisely, it doesn't include the actual costs.
Claude gave me a good analogy, been struggling for hours: its like only accounting for the gas grill bill when pricing your meals as a restaurant owner
The thing is, that elides a lot, and you could argue it out and theoratically no one would be wrong. But $5.5 million elides so much info as to be silly.
ex. they used 2048 H100 GPUs for 2 months. That's $72 million. And we're still not even approaching the real bill for the infrastructure. And for every success, there's another N that failed, 2 would be an absurdly conservative estimate.
People are reading the # and thinking it says something about American AI lab efficiency, rather, it says something about how fast it is to copy when you can scaffold by training on another model's outputs. That's not a bad thing, or at least, a unique phenomena. That's why it's hard talking about this IMHO
We will know soon enough if this replicates since Huggingface is working on replicating it.
To know that this would work requires insanely deep technical knowledge about state of the art computing, and the top leadership of the PRC does not have that.
It’s not just the economy that is vulnerable, but global geopolitics. It’s definitely worrying to see this type of technology in the hands of an authoritarian dictatorship, especially considering the evidence of censorship. See this article for a collected set of prompts and responses from DeepSeek highlighting the propaganda:
But also the claimed cost is suspicious. I know people have seen DeepSeek claim in some responses that it is one of the OpenAI models, so I wonder if they somehow trained using the outputs of other models, if that’s even possible (is there such a technique?). Maybe that’s how the claimed cost is so low that it doesn’t make mathematical sense?
> It’s definitely worrying to see this type of technology in the hands of an authoritarian dictatorship
What do you think they will do with the AI that worries you? They already had access to Llama, and they could pay for access to the closed source AIs. It really wouldn't be that hard to pay for and use what's commercially available as well, even if there is embargo or whatever, for digital goods and services that can easily be bypassed
have you tried asking chatgpt something even slightly controversial? chatgpt censors much more than deepseek does.
also deepseek is open-weights. there is nothing preventing you from doing a finetune that removes the censorship. they did that with llama2 back in the day.
This is an outrageous claim with no evidence, as if there was any equivalence between government enforced propaganda and anything else. Look at the system prompts for DeepSeek and it’s even more clear.
Also: fine tuning is not relevant when what is deployed at scale brainwashes the masses through false and misleading responses.
refusal to answer "how do I make meth" shows ChatGPT is absolutely being similarly neutered,
but I'm not aware of any numerical scores on what constitutes a numbered amount of censorship
why do you lie, it is blatantly obvious chatgpt censors a ton of things and has a bit of left-tilt too while trying hard to stay neutral.
If you think these tech companies are censoring all of this “just because” and instead of being completely torched by the media, and government who’ll use it as an excuse to take control of AI, then you’re sadly lying to yourself.
Think about it for a moment, why did Trump (and im not a trump supporter) re-appeal Biden’s AI Executive Order 2023 ? , what was in it ? , it is literally a propaganda enforcement article, written in sweet sounding, well meaning words.
It’s ok, no country is angel, even the american founding fathers would except americans to be critical of its government during moments, there’s no need for thinking that America = Good and China = Bad. We do have a ton of censorship in the “free world” too and it is government enforced, or else you wouldnt have seen so many platforms turn the tables on moderation, the moment trump got elected, the blessing for censorship directly comes from government.
I wonder if this was a deliberate move by PRC or really our own fault in falling for the fallacy that more is always better.