We all know they're addictive, they're designed to be addictive, and they're very, very harmful, to both adults and children. The individuals who are profiting from the harm are clearly identifiable. And that harm directly targets children. That this is allowed to continue is a symptom of a sick society.
Social media feeds are designed to be slot machines. Each scroll is a pull. You may or may not get something you actually want. You can't predict what's coming up next, so you just keep mindlessly scrolling.
It's not just the scrolling, the posting side too. They all randomly boost one of your posts so suddenly tons of feedback (especially noticable when I tried threads) and then you try to get that back again. The uncertainty keeps you at it
It's such a breath of relief to finally hear people talking about this clearly and loudly. May it continue and may this bad behaviour have repercussions. Enough.
Early FB was bad enough when it was your actual friends posting the best (or made up) bits of their lives - and you were only scrolling when you had nothing better to do. Did you know kids, there was a time when the feed was ordered by time and you knew the people who posted stuff?
It's a shame we can't have nice things. An actual non-abusive social medium for people to share things like this - I'd use it. But I see that as soon as there is money on the table, it's a race to the bottom, sooner or later.
My wife and I parental lock each other’s iPhones. I have social media but have to go to my PC to check it. This friction makes a world of difference.
I was astounded hanging out with my friends in person last weekend how every one of them at some point pulled out their phone mid conversation to watch TikTok, or Wordle, or whatever. They thought I was the weird one when I mentioned all social media sites and apps are blocked on my phone. We had an overall good time but these moments stuck out.
The way we do this is just we set a passcode for the others phone but I configure my own settings and she hers. This has been available and worked for us for nearly a decade.
> I was astounded hanging out with my friends in person last weekend how every one of them at some point pulled out their phone mid conversation to watch TikTok, or Wordle, or whatever.
To kill time, sometimes I watch those random "America's Funniest Videos" type videos where it's some random family at home and something funny/weird/etc. happens. I've started noticing that in almost all of them now, everyone is just sitting around staring at a phone. Sometimes an entire family will be in the living room, three on a couch, each in their own little world.
Even my family does the same. It's a very very hard habit to break. Like smoking, except anti-social where smoking was at least social.
30 years ago they'd all have been staring at TVs in their respective rooms.
50 years ago they'd be reading their own newspapers and magazines.
The name changes but the song remains the same; people have their own interests, even within a family, that aren't shared with others. I wouldn't bore my partner by monologuing about my hobbies, and she likewise. At least we're in the same room together.
Reading was a hobby most people chose not to engage in that much. If you read books/novels etc for 6 hours per day, people would remark on that like "he reads a lot", often asking you to put down your books to join them in whatever activity.
Few people would have had their own TVs in their room 30 years ago. That wasn't common. They were huge, expensive, and not remotely interesting enough to capture the attention of most people for prolonged periods. It was common to have family rituals where there was about 2-3 hours of watching TV during/after dinner together. That was when they aired a movie after some news.
Even game consoles, if you could afford them, really wouldn't capture your attention that much. Nobody plays Super Mario every day for hours weeks on end. And at least to us that was just another social activity anyways. We didn't play these by ourselves.
But I think all that misses the point. You would be doing pretty much none of these in place of another social activity. They either were a social activity, or they filled in otherwise dead time.
When you're having dinner with your friends or family and everyone is looking at their phone, that is replacing something. I remember getting playing cards and chatting at the dinner table when I was young. Nowadays people just get out their phone or disappear to other personal devices as soon as they are done eating if there's any dinner ritual left at all.
> Few people would have had their own TVs in their room 30 years ago. That wasn't common. They were huge, expensive, and not remotely interesting enough to capture the attention of most people for prolonged periods. It was common to have family rituals where there was about 2-3 hours of watching TV during/after dinner together. That was when they aired a movie after some news.
Depends on where one is from. In my country (U.S.A.), even many lower-middle-class kids tended to have at least a small portable TV (or, more often, the former family TV that had been replaced by a newer one in the living room) in at least their end of the house or apartment, if not their own room, ’way back in the late 1960s to early 1970s. What was common for kids in other countries at that time is, of course, a different matter. As for watching the TV together as a family rather than on separate TV sets: that often depended more on whether the family TV was a newer color model and the kids' room TV was an older black-and-white model --- or, as kids grew older and their viewing preferences changed from their parents’, which shows were on opposite one another. Sometimes it even came down to which room made it easier to watch TV while you were doing homework, talking to a friend who was visiting you from down the street, etc.
I've never felt the need for parental controls, I just refuse to open those sites or install the related apps. Are they really such a draw for you?
At one point I also had a few of them filtered at the DNS level at home, not to restrict my access but rather to defeat any embedded third party requests that might escape my browser filtering.
Remember when that type of behavior was rude? I had a conversation with a couple in 2011 and they had told me that they saw Steve Jobs and his wife at a restaurant and Steve was on his phone most of the time and how rude it seemed. I've thought about that periodically over the years as I've seen the addiction grow and become commonplace and especially as I've seen those same habits develop in myself.
I remember going on dates a few years later, 2014/15, and the phone usage during the dates seemed rude and slightly offended me. Now it's so common it's not even really noteworthy.
It's also that this is not a function of their nature, but of the way that they've been designed to function. Things were not this bad 15 years ago, and the fact that social media existed and functioned the way that it functioned back then was incredibly important in allowing movements like MeToo and BLM and Dreamers and many others to build momentum.
When social media is a tool of regular people, it's an awesome, awesome tool. But when the companies and people that own the platforms start to see users as tools themselves, for their own sociopolitical ends, that's when they become destructive forces. And there was a clear enshittification line drawn about this time 10 years ago, when the transition from one state to the other got underway.
I fear that we're looking at an attempt to manufacture consent to destroy the tool and not just the malicious function.
I think a lot of it is the ease of access now that we carry computers with us everywhere. I was tweeting from my phone in 2009, but I had to send the tweets via text message, so there was no infinite scroll accessible all day everyday to suck my mind into the phone. We had to actually make a decision to sit at a computer and go to the website to fully be fully immersed.
What these corporations were trying to do is bad and vaguely feasible to a degree. I think it's bad enough regulation could apply. But there is an additional consideration that's really important in how we as a society deal with this.
Screens are not drugs. They are not somehow uniquely and magically addictive (like drugs actually are). The multi-media is not the problem and not the device to be regulated. The corporate structure and motivations are the problem. This issue literally applies to any possible human perception even outside of screens. Sport fishing itself is random interval operant conditioning in the same way that corporations use. And frankly, with a boat, it's just as big of a money and time sink.
We should not be passing judgements or making laws regulating screens themselves because we think screens are more addictive than, say, an enjoyable day out on the lake. They're not. You could condition a blind person over the radio with just audio. The radio is not the problem and radios are not uniquely addictive like drugs.
We can't treat screens like drugs. It's a dangerous metaphor because governments kill people over drugs.
Without this distinction the leverage this "screens are drugs" perceptions gives governments will be incredibly dangerous as these cases proceed. If we instead acknowledge that it's corporations that are the problem and not something magical about screens then there's a big difference in terms of the legislation used to mitigate the problem and the people to which it will apply. The Digital Markets Act in the EU is a good template to follow with it only applying to large entities acting as gatekeepers.
It's not the screen, it's the format. It's an engineered gambling addiction where the currency is time and instead of the house taking your money the arbitrage your time to an advertiser, often surreptitiously.
Worse than that, often times the content that fosters the most engagement borders on propaganda that directly damages the social fabric over time. A lot of the extremist content (left, right, and otherwise) fits this description.
Screens on their own aren’t “uniquely and magically addictive”, but infinitely scrollable short form video delivered through that screen is, because a few companies spent billions on the smartest minds in the world to make it so.
There are plenty of public interest limitations on free speech. Food labels, cigarette warnings, deceptive ad laws. Regulating addictive social media isn't really an outlier here.
The parent comment set up a false choice and then had to adapt to the response calling their bluff.
The issue isn’t with reading or consuming content, as was set up in the challenge above.
The issue is with designing feeds and surfacing content in ways that take advantage of our brains.
As an analogy, loot boxes in video games, and slot machines come to mind. Both are designed to leverage behavioral psychology, and this design choice directly results in compulsive behavior amongst users.
I didn’t mention time? From Cambridge dictionary: ‘addiction: an inability to stop doing or using something, especially something harmful.’ I am in support of regulating things which are harmful and which people have trouble not doing
I don't impulsively drive to the store to purchase another bag immediately after finishing the one I have whereas (for example) many people exhibit such behavior when it comes to tobacco.
In the case of social media the feed is intentionally designed to be difficult to walk away from and it is endless (or close enough as makes no practical difference). Even if it weren't endless, refreshing an ever changing page is trivial in comparison to driving to the store and spending money.
An amusing question. Episodes are much longer and most shows only have one or a few seasons. I don't get the sense that streaming services optimize for difficulty to walk away and do something else any more or less than a good book does.
Maybe autoplay and immediately popping up a grid of recommendations should both be legally forbidden as tactics that blatantly prey on a well established psychological vulnerability. I'd likely support such legislation provided that it could be structured in such a way as to avoid scope creep and thus erosion of personal liberties.
In short I think Netflix is closer to a bag of Lays and modern social media closer to the cigarette industry of yore.
This is not particularly insightful if you stop and think about it. Try to unilaterally snatch a book that someone is in the middle of reading and you will probably be met with a hostile reaction. Grab the tool someone is using to do a task, similar. What you're describing is the natural reaction to messing with someone else's possessions. Without further context it's blatantly toxic behavior even if you happen to have the authority to force the matter.
You aren’t reading or using a hammer for 6 hours a day. It’s hard to find a tone ppl aren’t using their phone that would be appropriate to take it away if it’s only while not using it
Phones and computers are used for more than one thing; in that sense they aren't analogous to a single item such as a book or hammer but rather an entire closet filled with odds and ends. Keeping in contact with acquaintances, checking traffic and looking up other day to day information, reading a book during down time, these are three completely distinct activities that have all been nearly entirely subsumed by screens for me.
so… choices, as you see them in this issue, the lenses through which on the one hand you think is extreme and the other appropriate… are either screens-as-drugs or sports fishing?
Some middle ground might be there somewhere. But if forced to choose… the choices for interpreting behavioral engineering funded by $billions in research for over a decade + data harvesting on a scale unprecedented, for the purpose of manipulating users:
My attention span is greatly reduced for example. I have a much harder time reading physical books than I did as a kid. It should be the opposite as you age
I've lived through this entire story before in the video game wars. People said exactly the same things with exactly the same urgency about Mortal Kombat - what kind of sick society do we live in, where greedy corporations sell you the experience of shooting people and ripping their heads off? Perhaps we have to let adults buy these "murder simulators", but only a disturbed, evil person could possibly argue for letting kids do it.
If that sounds crazy to you, the moral panic over social media will sound just as crazy in a decade or two.
Having lived through the exact same hysteria, this is a totally different argument being made. This isn't about the morality of a genre of violent YouTube videos or some other tawdry content. It's not the satanic panic or about explicit lyrical content. This is about the safety of designing systems that are psychologically manipulative for the purpose of extracting as much advertising budget possible from clients.
If Mortal Kombat was free to play and learned to reprogram itself to keep the child playing for as possible with no ethical bounds. Even if it had to resort to calling the child names or making them feel like playing was only way they'd find some self worth... then we'd be talking about the same thing.
From my perspective, this will sound crazy in a decade or two but more like how harmful smoking is and how ridiculous it is we didn't see it soon.
I'm genuinely curious how one can look at someone using an app like TikTok and conclude that's not addictive. It's optimised in every way to engage people in behaviours that look like outright addiction.
Anyway, sometimes 'panic' is justified. Sports betting has been a total disaster, for example.
It’s funny since I worked extensively in both industries and the number of absolutely addicted boomers on farmville and match3 canvas and mobile games throwing their life savings and time away was totally competitive with Vegas
Having lived through those panics, fought against them, and then raised the alarm on Lootboxes and FarmVille the day they came out - these are not the same things.
This isn’t a moral panic.
Mortal Kombat did not result in changed behavior in its users. As I recall, The best study on video games only showed that there was some change in behavior for a short time after playing a game, and then children reverted to their baseline.
On the other hand, social media has not survived that scrutiny, with multiple studies show a causal link between anorexia, depression, anxiety, addictive design and social media.
People defended cigarettes too back in the day, and it took years for people to stop smoking cigarettes in public.
But so is cable television designed to be addictive. So are most restaurants and ice cream parlors and grocery stores designed to get you to spend more. Most loyalty programs are designed to be addictive to get you to come back, etc. etc.
I just worry we left no levers for the public to regulate these entities and this is the worst option of very few options. Who isn't liable under this kind of logic?
The personalization component takes this a step above. Making something very broadly appealing is one thing. Targeting what will keep you specifically from turning it off is a whole new level.
So if social media removed personalization from their algorithms and only applied them broadly across large demographic groups you'd be fine with them? (Genuine question I'm curious)
Maybe. It's hard to know what kind of world that would result in.
I could well see it being so much less effective as to not be a problem. Or maybe they'd be even more effective, and if we caught them explicitly knowing that they were harming children, it would still potentially be tortious.
This would be great, yeah. Disable infinite scrolling and page caching (so that you’re not infinitely scrolling horizontally) and video autoplay too. Also add opt-out time limits and breaks.
Imagine a feed that actually just ends when you run out of posts from people you follow instead trying to endlessly keep your attention by pushing stuff it thinks you might like
If I've read all of the posts from my friends I would prefer to not see anything else, but that doesn't maximize engagement for ad platforms so
The problem isn't X domain of business is more scummy than Y. They all are. That's kind of the problem. Tech is just egregious though in it's non-reliance on physical matter, meaning anything that can be digitally rendered is instantly a world scale fucking problem.
If it were one building in one state doing this shit, no one would care, and we'd just block or tell people don't go in the building. That doesn't work with digital products that started benign, then had the addictive qualities turned up to 11. That's malice, at scale. If every ice cream parlor, or link in the ice cream supply chain started adulterating ice cream with drugs, regulators would have dropped the hammer at the site of adulteration. Meta et Al have had no such presence forced upon them due to lack of regulation in some jurisdictions, or being left to self implement the regulation, thereby largely neutering the effort.
Ice cream isn't engineered to be addictive. Ice cream is, for most people, actually enjoyable and costs money. If ice cream were free but you only got a small amount on random visits to the ice cream parlor then it would be engineered to be addictive.
I don't think that is really true though. People aren't becoming addicted to grocery stores, ice cream parlours and restaurants, or even cable television to nearly (any?) degree. None of those are engineered to addict you in nearly the same degree or magnitude.
I haven't seen anybody making any claims about social media usage leading to clinically meaningful addiction. So why are you asking for evidence of that?
Also fwiw I'm not in favour of regulating social media, but I am in favour of bringing lawsuits to companies who engage in societally harmful behaviour, and punishing them financially.
No. It's been established that social media use can produce addiction-like behaviors, that it uses mechanisms similar to gambling and substance addiction, and that a subset of people experience significant impairment as a result of social media consumption. It's still debated if it should be classified as a form of Substance Use Disorder, which is what the term "clinically meaningful" refers to, but the debate is more a matter of classification and semantics, not if the issue exists at all. And not what people are referring to in the context of this case and discussion.
If you're interested in the topic further, you could consider reading 'Toward the classification of social media use disorder: Clinical characterization and proposed diagnostic criteria', which should shine some more light on what people are referring to as "addiction" in this circumstance :)
If you're interested in the neuroscience, consider reading "Neurobiological risk factors for problematic social media use as a specific form of Internet addiction: A narrative review".
Believe it or not, you might find the answer to that question inside the paper I shared with you called "Toward the classification of social media use disorder: Clinical characterization and proposed diagnostic criteria".
There are laws enabling the judiciary to operate as it has to give plaintiffs a platform in the first place, in the absence of specific laws because legislative bodies are slow to adopt new laws for various excuses.
For example; not hard to pay off a handful of legislators to vote no. Then what? People just suck up living at the mercy of the rich?
Judiciary has leeway to allow such cases and outcomes to bubble up useful context for changes to law. Longstanding precedent and in some cases is codified in law itself.
The lack of a specific legal language banning social media actions is also irrelevant because of similarities to other situations that are enshrined in law. That human biology is susceptible to psychological manipulation is already well understood. Tiny little difference in legal context does not invalidate known truth of biology.
Society doesn't exist in your head alone and has existed for some time. Much of this is not truly new territory.
Reels are non-stop dopamine hits, just like TikTok. It's incredibly addictive to scroll through. That is by far the worst part of Instagram for anybody.
Everything else outside of reels is the usual social media fake life facade, and everything amplified to the max for engagement to get it pushed to feeds via "the algorithm" (note: Interactions don't need to be positive to promote it to feeds)
Depends. Was the product intentionally designed to be that way? The addition of caffeine to soda is the closest example that immediately comes to mind but in that case many individuals are specifically seeking the additive.
There are many physical products that are today designed to minimize harm and misuse after facing liability historically. So I suppose the direct answer to your question would be "yes, absolutely, and there's a figurative mountain of precedent for it".
Are you intentionally being obtuse? It means whether or not the product was intentionally designed to be addictive. What was the intent behind the design? Why were the decisions made? Was there a reasonable alternative that was otherwise functionally equivalent?
The limiting principle on liability is quite complicated. You'd have to go ask a lawyer. At least in the US (and I believe most of the western world) it has to do with manufacturer intent, manufacturer awareness, viable alternatives, and material harm among other things.
No, it is not begging the question. Can you point to where I presupposed my own conclusions? You are (I suspect disingenuously) pretending not to understand intent.
It doesn't matter if the outcome is the same here what matters is the intent behind the design when considered in the context of the intended usecase. That's in addition to lots of other factors (some of which I listed) plus any relevant legislation plus any relevant case law and that will all be examined in great detail by a court. At the end of the day what is legal and what is not is decided by that process. A large part of the point of employing corporate lawyers is to prevent a situation where your past behavior is examined from arising in the first place.
I'd suggest the essay "what color are your bits" if you're genuinely struggling to understand this concept.
Is this a young people thing? I'm 40. I have never liked Shorts. What am I supposed to get out of 10 seconds of video? And all the sudden jump-cuts, and big obnoxious one-word-at-a-time subtitles... They're all literally unwatchable.
I watched my 78yo step mother become addicted to reels so older people are definitely not immune. But she was able to go cold turkey as she only communicated with her sister over instagram so it wasn’t a problem to just continue with WhatsApp. Young people real life networks are too enmeshed with instagram to have the same option.
Also, what you’re describing sounds like when you’ve haven’t spent enough time on the shorts for the content recommendation algorithm to learn your preferences. Which I agree, is unwatchable. I saw it recently when my friend put on YouTube shorts on a guest account (on an Airbnb smart tv). it was bad. But spend enough time and that will change. But best you don’t!
Same here. In fact, I uninstalled the YouTube app because there was no way to disable Shorts within it while I can use browser extensions to do so in Safari. (I pay for Premium.)
Then again, I hardly use YouTube, so I don’t think I’m the target audience for this.
Please, I beg you, stop and think about these things.
"is it a young people thing": no, obviously not because nothing is.
You're just as prone to addictive behaviours at 20 as at 40 at 80.
There might be some differences as to how you happen to be exposed, perhaps because of how your literal social network is behaving, but that's obviously not intrinsic.
I mean, yes, perhaps "young people" are slightly more likely to be exposed to it via advertising/peers/etc, but anyone with a similar exposure can be a victim.
I find casinos unpleasant but plenty of people obviously don't. I also find games with a narrow FoV unpleasant; I was never able to enjoy DotA 2 because of this and League was only just barely tolerable. Similarly I detest modern web design and gravitate towards sites with an HN or spreadsheet style information dense layout.
I think that's all related, is at least partially a matter of what I'm accustomed to, but is largely just an inherent part of how I am.
Really? I watch a lot of long-form YouTube while doing the dishes, and occasionally poke at the Shorts. Some funny, mostly dumb and I move on.
Maybe a generational thing, but for most of the latter half of the 20th Century most folks had to “exert special effort to regulate their consumption” of network television. Should there have been lawsuits and regulation of couch potatoes?
If you mean 'should network TV be allowed to use behavioural psychology to manipulate people into being couch potatoes' then the answer is yes, that should be regulated against.
Anyway, the way you talk about shorts reminds me of drug addicts who talk about how they can control their consumption. Some can. Many cannot but delude themselves. The way I see people interact with shorts/TikTok/reels is very much not restrained. They're optimised for addictive scrolling in the same way a slot machine is - the fact that some people can use a slot machine without becoming addicted is besides the point.
You dropped the second half of my sentence which pointed to a specific harm. You consequently argued against something which I didn't say. You are not arguing in good faith and this 'conversation' has clearly run its course as you are not capable of engaging the actual points someone is making.
Someone saying that someone shouldn't be able to promote specific harm x is not saying that the idea of 'promotion' of anything in general is necessarily bad, exactly in the same way that we restrict certain harmful things from being sold without being against the idea of selling things in general.
The difference is that the media is 30 seconds not 2 hours so the feedback loop is shorter and the content pool is far far far deeper because it is user submitted so the content recommendation algorithms become so effective , and the experience so compelling, that it becomes addictive. And as a wise man once said “a difference in scale is a difference in kind”
I’m actually strongly sympathetic to this argument, but I’d love to see some actual clinical research that suggests algorithmic short form video has mental and physiological effects that (say) video games do not.
Reminds me of soda. Why the hell liquid poison is allowed to exist turns my stomach. You could fill libraries with data linking it to a myriad illnesses and causes of death. Yet they are even allowed to juke it with caffeine for no other reason than to up the addiction level. Like... what are we doing here.
At some point we end up defending the freedom for corporations to exploit people though. I think addiction is one of those times.
If a company has a product that relies on addiction mechanisms to succeed, that is a different situation, that is a corporate entity exploiting citizens for profit.
Cigarettes are a great example of where we can draw lines in the sand. If you want to smoke them go ahead you have that freedom, but I think companies should be banned from putting nicotine in them. Simple and obvious lines in the sand.
Vapes, whatever, smoke your bubblegum water. Vapes with nicotine? Clearly exploitive behaviour. Yes they can help you quit, but quit what? Nicotine addiction! If it weren't in cigarettes already you wouldn't need to quit it.
Social media is harder to draw lines in the sand for, but I think algorithmic feeds may be one place to target regulation.
But an adult is and should be allowed to develop a nicotine addiction. The reason why people do above all else is that nicotine is an intoxicant and (to most people) pretty pleasant. It's a rational choice.
It's addictive, but the price of quitting is a few weeks of cravings. It's not like alcohol (which is relatively uncontroversial) or opiates.
Don't let them sell to kids. Include scary images on the box. Whatever you do, the truth is that human beings like their drugs and this one isn't really that bad.
Both cigarettes and vapes are ways of consuming a drug. Are you just plainly against drugs? We know how blanket bans on drugs have gone historically and besides the obvious personal freedoms that are lost by mandating what people can and cannot put into their bodies (hello bodily autonomy??), trying to prevent people from consuming drugs does more harm than good (like prohibition, the war on drugs etc).
This ruling was about liability, in that an entity created a product with risks without disclosing them. It's actually worse, they purposefully engineered the product to be harmful. Thus they are liable for that harm. This is subtly different from banning these products - arguably many products that are sold are harmful, the difference is that they either are not acutely harmful (junk food), or the acute harm is well known (alcohol, cigarettes). Some countries mandate disclosure at sale or on the packaging as well.
> We seek to fight two forms of overfitting that would muddy public sensefinding:
> Task-specific overfitting. This includes any agent that is created with knowledge of public ARC-AGI-3 environments, subsequently being evaluated on the same environments. It could be either directly trained on these environments, or using a harness that is handcrafted or specifically configured by someone with knowledge of the public environments.
The point of this test is to check if an AI system can figure out the game. This isn't what happened here. A human figured out the game, wrote in their prompts exactly how the game works and THEN put the AI on the problem. This is 100% cheating and imo quite stupid.
I think generally people regard a harness as the system instructions + tools made available to the LLM (and probably the thing that runs the LLM conversation in a loop.) An agent is collectively, the LLM plus the harness.
I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.
Models aren't just big bags of floats you imagine them to be. Those bags are there, but there's a whole layer of runtimes, caches, timers, load balancers, classifiers/sanitizers, etc. around them, all of which have tunable parameters that affect the user-perceptible output.
It's still engineering. Even magic alien tech from outer space would end up with an interface layer to manage it :).
ETA: reminds me of biology, too. In life, it turns out the more simple some functional component looks like, the more stupidly overcomplicated it is if you look at it under microscope.
There's this[1]. Model providers have a strong incentive to switch (a part of) their inference fleet to quantized models during peak loads. From a systems perspective, it's just another lever. Better to have slightly nerfed models than complete downtime.
Anybody with more than five years in the tech industry has seen this done in all domains time and again. What evidence you have AI is different, which is the extraordinary claim in this case...
Real world usage suggests otherwise. It's been a known trend for a while. Anthropic even confirmed as such ~6 months ago but said it was a "bug" - one that somehow just keeps happening 4-6 months after a model is released.
Real world usage is unlikely to give you the large sample sizes needed to reliably detect the differences between models. Standard error scales as the inverse square root of sample size, so even a difference as large as 10 percentage points would require hundreds of samples.
https://marginlab.ai/trackers/claude-code/ tries to track Claude Opus performance on SWE-Bench-Pro, but since they only sample 50 tasks per day, the confidence intervals are very wide. (This was submitted 2 months ago https://news.ycombinator.com/item?id=46810282 when they "detected" a statistically significant deviation, but that was because they used the first day's measurement as the baseline, so at some point they had enough samples to notice that this was significantly different from the long-term average. It seems like they have fixed this error by now.)
It's hard to trust public, high profile benchmarks because any change to a specific model (Opus 4.5 in this case) can be rejected if they have regressions on SWE-Bench-Pro, so everything that gets to be released would perform well in this benchmark
Any other benchmark at that sample size would have similarly huge error bars. Unless Anthropic makes a model that works 100% of the time or writes a bug that brings it all the way to zero, it's going to work sometimes and fail sometimes, and anyone who thinks they can spot small changes in how often it works without running an astonishingly large number of tests is fooling themselves with measurement noise.
They do. I'm currently seeing a degradation on Opus 4.6 on tasks it could do without trouble a few months back. Obvious I'm a sample of n=1, but I'm also convinced a new model is around the corner and they preemptively nerf their current model so people notice the "improvement".
Well, I don't see 4.5 on there ... so I'm not sure what you're trying to say.
And today is a 53% pass rate vs. a baseline 56% pass rate. That's a huge difference. If we recall what Anthropic originally promised a "max 5" user https://github.com/anthropics/claude-code/issues/16157#issue... -- which they've since removed from their site...
50-200 prompts. That's an extra 1-6 "wrong solutions" per 5 hours ... and you have to get a lot of wrong answers to arrive at a wrong solution.
I think the conspiracy theories are silly, but equally I think pretending these black boxes are completely stable once they're released is incorrect as well.
> The $200 per month subscription comes with a ton of usage.
$200 dollars + VAT is half of my rent.
I know HN is not a good place to rant on this subject, but I'm often flabbergasted about the number of people here that lives in a bubble with regard to the price of tech. Or just prices in general.
I remember someone who said a few years ago (I'm paraphrasing): "You could just use one of the empty room in your house!". It was so outlandish I believed it was a joke at first.
The other part of the bubble is assuming working in projects that allow disclosing any code or project details to a generic third party with that kind of power asymmetry.
I think I am in the middle. I can afford $200/m but it'd be a brainer. And I don't pay that as I barely use home AI enough to warrant it.
I am also amazed at the richer end of HN but now I realize I am priviledged. Earned it? Like fuck I did. Lucky to be born a geek in late 20c. I'd be useless as a middle ages guy.
Nah, that's why you cannot not afford the subscriptions these days. Whatever your needs, ever since Claude Code became a thing, subscription costs come out massively cheaper than pay-as-you-go per-token API pricing. Also SOTA models are so much better than anything else, that using older or open models will just cost you more in tokens/electricity than going for SOTA subscription.
Subscriptions are definitely middle-class targeted. $20/month is not much for the value provided, at least not in the western world.
But if by "rich" you just mean "westerners", then in this sense, the same is and has always been true for computing in general.
We'll cross that bridge when we come to it. Especially in context of discussing living at different economic strata, customers are neither expected nor supposed to voluntarily overpay out of a belief this will make an industry not try to rugpull everyone at some point.
Not sure. AI is sort of car ownership price. I think while that ain't poor, that is middle class.
So like if you want to start a business of any sort the AI sub is still peanuts.
AI is a car, or a dog, or a mild social life, or a utility bill level of cost. And thats for the level needed for a sane typical developer. (AI maximalists need 250k/y, let them slop it out)
It is not a Cessna, an infinity pool or a 1 month vacation.
It’s a good reminder. Claude Max costs about as much as the global poverty line ($3/day.) I think it’s okay to invest in it, but we should try to make sure it’s worthwhile, and also invest in charity.
$200/mo is a lot, sure, but the shocking part of that comparison is your rent. I didn’t know $400/mo apartments still existed. For most people in the US and EU, $200 would be closer to 15%-20% of rent I think? My cell phone bill for my family is almost $200/mo.
Last year, at first, $200 seemed crazy. Now that I’m getting addicted to coding agents, not so much. Some companies are paying API rates for AI for employees, and it’s a lot more than $200/mo. It seems like funny money, and I’m not sure it’ll last.
As you've probably guessed, I don't live in the US, so the price are drastically different. I live in the EU. And for my case, I love in really small flat for some years, so the rent couldn't go up a lot.
> most people in the US and EU, $200 would be closer to 15%-20% of rent I think?
> the average rent is north of $1000/mo.
I really don't know where you get your number from, $1000/mo average is really wild to me. With this amount, you can rent a flat for a whole family in the heart of the city. Nobody of my more well-of friends have a rent this high.
Or maybe you have some capital city in mind like Paris or London?
It is my belief that rent price scales with the leftover income people have after they've paid for other necessities. Ie if you're from a poorer country/area then things like milk and gasoline will cost a similar amount (maybe 2x difference), but rent will cost a lot less. As people in a country get richer they start paying a larger and larger share of their income as rent of various forms.
Even the US has places with cheap rent/housing. The downside is that there's no (well-paying) work nearby.
It’s true that average rent prices are regional and poorer areas have lower rents, but that doesn’t tend to make much difference in urban areas and large cities where the majority of people live now. Why do you feel that rent scales with disposable income? Economists generally say the opposite based on housing being a core necesessity; that people pay rent in proportion to their income, and only what’s left over the the disposable amount. That’s why we have the 30% rule, for example.
You’re technically correct, btw, rental housing is a market and is subject to market forces, meaning what people are willing to pay. I’m just not so sure about framing rent as being lower priority than other necessities. And rent prices have been increasing faster than other necessities, and faster than income, so that might be a confounding factor in your argument.
Still, my initial reaction above is due to the fact that in the US and in Europe in most large cities, the average rent is north of $1000/mo.
In the US/Western Europe? Because for devs especially in the former, $200 is pocket change, especially for a core productivity tool. And the rent would be in the $1200 to $3000 easily. Same for houses. Maybe not in NY or SF, but in most of the US there's no shortage of house spaces and redundant rooms.
I've seen those comments about $200/month and empty rooms here, so I suppose they mainly come from the US, yes.
So yes, you describe a situation that I feel like a lot of people here don't understand is not the norm.
I compared the subscription with my rent precisely because it's easier to compare: with your numbers it would be like paying from $600 up to $1500 / month. Pretty hard to justify.
Are you not a dev? If not, what would you use a coding tool for? They still require handholding for anything largeish. Still much cheaper than outsource.
First, I've assumed you were in the bubble I described, but that's not the case, so sorry bout that.
Also, I think it's relevant to the conversation.
You replied to someone who said that "you" (undirected pronoun I suppose) can't afford the SOTA that the $200/month Anthropic subscription comes with a ton of usage. So I interpreted it as a general statement. It wasn't what you meant?
I'm a bit lost about who you're talking to/about in your first comment: the person you respond to, a general statement for everyone reading, or yourself?
I assume when somebody says you and is not talking about anyone in particular they mean that it's infeasible for virtually everybody which is certainly not the case. Also you conveniently disregarded the fact that is available on the $20 per month plan.
Okay, I understand better. I interpreted your answer as "well, it's $200, everybody can afford it". Clearly a misunderstanding.
Going back to the $20 plan, yes, I agree it's much more accessible.
I didn't talk about it because I've seen a lot of comments here, on blogs, on social media about how a $200 subscription for Claude is a no brainer. And it got on my nerves, so I wanted to tell how much money it can be. To you (and it was misguided reading your answers), and to concerned HN commenters in general.
I'm not sure I've correctly understood what you're implying.
If it's that I'm not working, well, I'm employed.
It it's that I'm not working enough to not have this money... Well, we still go back to the bubble. Not everywhere in the world you can easily find a job that pays you enough, even if you accept to work more. And the employer will not accept to give developers a $200/month subscription, even less for personal use.
If it's that I'm not working enough and I should go freelancing to work as much as I want and get rich (I'm extrapolating). Well, you're right, I could do that. But (at least at first), I would work a lot more for much less money. And even if I become a recognized freelancer, it doesn't change the fact that I'll earn less money compared to the baseline of SF, or even the USA in the tech sector in general. So, bubble again. I could also, like someone said, put the tokens cost into my hourly/daily rate, but I'll be much more expensive than other freelancers.
Also, but that's a "me case" compared to my previous points, health issues can greatly affect how much work you can do.
Instinctively, if we suppose all the newbies freelancers without any reputation start with the same lowest rate possible to be competitive, passing additional cost to my client will mechanically increase my rate. Putting me in disadvantage about getting any work. And with the difference of monetary value for the same price of tokens, the rate delta is higher.
It's a simplified model of the world, but it feels like simple economic rules.
I assume the comment I'm referring to was written by someone who is already established and for Wich the token cost passing is lower relatively to my environment.
>I'm often flabbergasted about the number of people here that lives in a bubble with regard to the price of tech
Sorry, no. You live in the bubble, the people you think are living in a bubble are actually doing the very opposite and taking advantage of the lack of bubbles in our globally connected world.
Today, basically anyone can sell any bullshit to billions of people around the world. We’ve never lived in less of a bubble.
The thread started with "$200 is a lot for most of the world", the person I was replying to said "no it's not, now anyone can sell to billions of people", and I said "company success being concentrated in SF shows that that's not true".
>company success being concentrated in SF shows that that's not true
You didn’t say that until now.
I think you’re wrong, SF being particularly good for big companies might indicate something if the conversation was about succeeding at a grand scale, not about being able to afford $200/mo
These are the types of individuals that become so left in the dust that they don't realize what's going on anymore, and it's obvious this person is already there. Claude hasn't been a "subscription for coding" product for quite some time now. That's how it started out and while that's certainly what Claude is known for, Anthropic has been pushing for Claude to also be a general productivity tool -- Claude Code, then Claude Desktop, Claude Work, and now Claude Desktop has Chat, Work, and Code essentially built into a single desktop app that just works wonders for those who are looking for a general productivity tool.
I'd not use it over pure Claude Code because I am at heart a coder and I want the raw terminal experience and there's some features missing from the "Code" tab in Claude Desktop, but just saying "a subscription to code", just goes to show how out of touch that person already is, and that's what resistance does to you when you try to resist making use of any kind of modern tooling or technology.
I dunno how you guys even go throuh the $200 subscription. I use it every day for work and side projects doing tasks in parallel and Im no where newr the limit on $100.
The $100 already gives plenty of usage and is more than worth it, and I'm definitely not an affluent SV developer. I've only ever hit the 5h limit once in the last month, although I rarely run more than 3 agents at once, and I don't use ridiculously expensive tools like Gas Town.
Anthropic’s $20 plan gives you such a pittance of tokens that it’s borderline unusable for anything more than a few scripts or a toy app. If $20 is all you have you’d do _much_ better going with chatgpt
Do you mostly just hit the session limits? If so I know it's not ideal but you could wait an hour or two for that to reset. Not sure if that would work for you but just a suggestion
> 200 USD/month is a number only really affluent programmers (e.g. in the Silicon Valley) can perhaps pay easily.
Not true, I live in USA PNW and my last remote job paid $12k/mo. I have been jobless for over a month now (currently waiting for the next HN "who wants to be hired"), but I still have enough savings to easily afford to continue that plan for a while.
I don't think it really has to do with affluence but more the job market and economy you're in. Countries with lower salaries or higher costs of living will have less buying power.
Are you kidding me? Even developer salaries in the Philippines can afford that or at least the plan below it. If I used the Anthropic API, my monthly spend would be $4k a month. The Claude Max plan is the best bargain around.
I'm starting to think in these conversations we're all often talking about two different things. You're talking about running an LLM service through its provided tooling (codex, Claude, cursor), others seem to be talking token costs because they're integrating LLMs into software or are using harness systems like opencode, pi, or openclaw and balancing tasks across models.
Fair enough, I read it quickly and assumed the person they replied to was talking about Claude Code
But I run a AI SaaS and we do offer Opus 4.6, too. Our use case is not nearly as token intensive as something like coding so we are still able to offer it with a good profit margin.
Also you can run OpenClaw with your CC subscription. It's what I do.
I wrap Opus 4.5 in a consumer product with 0 economic utility and people pay for it, I'm sure plenty of end users are willing to pay for it in their software.
Edit: I'm not using the term of art, I mean it literally cannot make them money.
I get decent results with Kimi, but I agree with your overall premise. You do need to realise that while you can save money on a lot of tasks with those models, for the hardest tasks the "sticker price" of cost per million tokens isn't what matters.
It's also worth noting that the approach given in the link also benefits Sonnet and Opus. Not just as much - they are more forgiving - but put it in a harness that allows for various verification and repair and they too end up producing much better results than the "raw" model. And it's not clear that a harness around MiniMax, Kimi, or Qwen can measure up then.
I use those models a lot, and hope to use them more as my harnesses get better at discriminating which tasks they are cost effective for, but it's not straightforward to cost optimize this.
If I cared about running everything locally, then sure, it's amazing you can get to those kinds of results at all.
Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.
Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)
> It is not my fault if Claude outputs something like "*1*, *1*", adding markdown highlighting, when most other models respect the required format correctly.
Yuck. At that point don't publish a benchmark, explains why their results are useless too.
-
Edit since I'm not able to reply to the below comment:
"I want structured output from a model that supports structured output but will not enable structured output, nor ask for an existing format like XML or JSON" is not really an interesting thing to benchmark, and that's reflected in how you have Gemini 2.5 Flash beating GPT-5.4.
I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.
Why not? I described this in more detail in other comments.
Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.
Most models get this right. Also, this is just one failure mode of Claude.
Like I said in the edit, when people want specific formatting they ask for well known formats: Markdown, XML, JSON
I don't even need to debate if the benchmark is useful, it doesn't pass a sniff test: GPT-5.4 is not worse than Gemini 2.5 Flash in any way that matters to most users. In your benchmark it's meaningfully worse.
The questions do ask specifically to respond with the answer only, with an example format given in many cases.
Note that all reasoning models are tested with "medium" reasoning.
The benchmarks are questions/data processing tasks that an average user will likely ask, not coding questions (I didn't add any coding tests yet).
Gemini models also tend to be very consistent. Asking the same question will likely give the same result.
The two models you mention scored the same, the only difference is that Gemini was better at domain-specific questions (i.e. you ask something quite technical/niche).
It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.
While I like these models, if you're getting similar results to SOTA models from 6 months ago, I have to question how far you pushed those models 6 months ago. It is really easy to find scenarios were these models really underperform. They take far more advanced harnesses to perform reasonably (and hence the linked project). It's possible to get good results out of them, but it takes a lot of extra work.
I badly want to shift more of my work to them, and I'm finding ways of shifting more lower-level loads to them regularly, but they're really not there yet for anything complex.
Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.
Yeah, good tests are associated with cost. I'd like to see benchmarks on big messy codebases and how models perform on a clearly defined task that's easy to verify.
I was thinking that tokens spent in such case could also be an interesting measure, but some agent can do small useful refactoring. Although prompt could specify to do the minimal change required to achieve the goal.
It's 8.3 vs 8.1, I wouldn't call that significantly better.
I think GLM got a bit in front, because on some tests that both got wrong, GLM did sometimes (inconsistently) respond with the correct answer.
That being said, yes, in this case probably with more and more tests added, gpt-5.4 would edge in front, especially if a coding would be added (there are no coding tests yet).
> I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence.
I use MiniMax daily, mostly for coding tasks, using pi-coding-agent mostly.
> The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable.
I don't care about token use, I pay per request in my cheap coding plan. I didn't notice slower outputs, it's even faster than Anthropic. Degradation is there for long sessions with long contexts, but that also happens with Anthropic models.
> Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.
Exactly. For my use case, I get 1500 API requests every 5 hours for 10€ monthly. I never hit the limit, even during the intensive coding sessions.
What I notice is, while Opus and Sonnet feel better for synthetic benchmarks, it doesn't matter in the real world. I never put so much effort into coming up with a perfect problem spec like the ones in benchmarks. I don't craft my prompts for hours expecting the LLM to one-shot a working program for me. And that's exactly what all those benchmarks are doing. And that's where Anthropic tools shine in comparison to cheaper Chinese models.
When it comes to the real world, where I put my half-baked thoughts in broken English in a prompt and execute 20 prompts in half an hour, the difference between Opus, Sonnet, and MiniMax is minimal, if at all. There, I don't want to think about costs and token savings and switching between different Anthropic models. I just use MiniMax, and that's it.
Yes, MiniMax sometimes gets stuck. Then I switch to Opus to unblock it. But the same happens if I use Opus the whole session. It gets stuck eventually, and model switch is sometimes required to get a fresh perspective on the problem.
The only difference is, using Opus or Sonnet quickly eats up my budget, while with MiniMax I have basically unlimited usage (for my coding use case) for 10€ per month.
I've only been using free tokens for a year now. Gemini and they just dropped pro so I switched to minimax. Bit of a hurdle switching from Gemini-cli to kilo-cli, but now I can't really see too much difference.
If I was starting new projects I'd pay for a better model, but honestly I don't really know any different.
I've not ever used Claude and people seem to rave about it. Maybe its good, but I doubt its $200/month good.
When I hit issues with these lower models I think hard about creating the right tooling - agnostic to the harness and I feel like maybe its more work but I can carry those tools to any setup going forward. That's how it was in the early Linux days so why change what clearly works?
I’ve never had any problems with MiniMax. I wouldn’t call the speed fast exactly, but it’s faster than GLM and seems similar to Opus.
It’s been fast enough that I’ve been using it as my main model (M2.7 and before that, M2.5). Opus still does better at tasks, but MiniMax is so much cheaper. I’ve used their cheaper plan and I’ve never been rate limited.
> They're all slop when the complexity is higher than a mid-tech intermediate engineer though.
This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.
Kimi's been one of my goto options lately and it oftentimes outperforms both Claude and GPT in debugging, finding the actual problem immediately while the other two flail around drunkenly.
It does have some kind of horrible context consistency problem though, if you ask it to rewrite something verbatim it'll inject tiny random changes everywhere and potentially break it. That's something that other SOTA models haven't done for at least two years now and is a real problem. I can't trust it to do a full rewrite, just diffs.
No tooling, just manual use. When doing these comparisons I gather and format all the data they need to figure out the problem, and paste the same thing into all models so it's a pretty even eval.
I doubt Kimi would do well with most harnesses, its outputs are pretty chaotic in terms of formatting but the inteligence is definitely there.
I'm a ground instructor and instrument rated pilot and I fly a 206 in and out of busy charlie and delta airports. I'm also a ham radio guy (WT1J) and an SDR dev. I'm 100% with you on this, but the amount of inertia you're dealing with here approaches infinity. And there are some weirdly strong arguments for not changing things.
We use AM simplex radio. That means everyone hears everyone else and that helps everyone build a situational awareness picture. Secondly we use AM because if someone transmits over someone else it makes a squealing noise so you know it happened. Also AM propagates pretty well.
Most people on HN could design a pretty good digital replacement in a few minutes - and no doubt some have been suggested in these comments. But its instructive to understand a bit about aviation history. The liability risk carried by aircraft and avionics manufacturers at one point go so bad that we stopped making general aviation planes in the USA. Then that liability was limited to a very small extent by GARA, and we had what we call the 'restart' of manufacturing.
So the idea of introducing a new mandatory replacement (not addition like ADS-B) for AM comms has a lot of resistance from quite a few areas: Manufacturers don't want to have to make the capex to reinvent and recertify new equipment. The US has a lot of old planes due to the lack of innovation because of the liability issue - and so those old planes all need a retrofit and pilots don't want to spend that money. Avionics for certified aircraft is already horrifically expensive. Legislators don't want to take on the risk of an incident attached to a bill they sponsored. And then there's the practical matter of now having two systems - the legacy AM comms, and the modern one that some have and some don't and the split in situational awareness between those populations.
So while full-duplex is seductive, and digital is seductive, and satellite seems like the obvious endgame - the reality of transitioning is very difficult.
Vehicles are listening to the same audio the pilots are, so they have the same mental picture of what's going on. Last week I talked to a maintenance vehicle at KBLI directly from the air because he was on a runway I needed to land on, at an untowered field. He cleared it, I landed, and he went about his business. So the system works pretty well most of the time.
I think the root of the issue here is actually something else. Firstly there is a lot of dissatisfaction among NATCA members (ATC union) towards their union, and the view seems to be that the union could be doing a lot better job of lobbying for their workers. You can visit /r/atc or /r/atc2 on reddit to learn more.
Secondly, the USA has fallen into a nasty trap where our government has positive incentives to choreograph shutdowns to get our congress members and senate members the face time that they crave. So there is a negative incentive to resolve a shutdown. Rather let it get hot, let it play out, and maybe you'll be the one to appear to save the day to your constituents. The trouble with this is that the department that creates one of the highest risks for civilians in a very visible way, is the FAA and the controllers in particular. So they have become a political football. And they're in an extremely stressful job without pay. And that's a very big problem.
You're seeing this play out in a growing adversarial relationship between the NTSB (e.g. DCA) and FAA, with NTSB tearing FAA a new one recently for DCA - and rightly so. I think that's led to more demotivation at FAA which hasn't helped.
So the situation is spiraling out of control. We have controllers who are overworked, who regularly don't get paid, and a union not doing the greatest job at advocating for them. Along with the recent cuts in government funding across the board.
It's frustrating for pilots. The best we've been able to do is bring our local TRACON folks stacks of free pizza, both in Colorado and Seattle. But that's obviously a token gesture. I don't see a way out of it. To be perfectly honest. And it's very frustrating because the amount of good work that FAA does, is quite startling. You'd be amazed how much data they produce including real-time feeds that are freely available to devs like us. Once you get into the IFR world and start looking not just at approach plates, but the review and updating process of each, the other maps that are produced, the real-time sitrep data that they're producing - it's really quite something what they've accomplished. And the world looks to FAA for its lead in aviation. We were the first to pioneer powered fixed wing flight, after all. I can only hope there's a way out of this.
Given the free market nature of cellphones, where vendors and companies have unfettered access to monetize users, having cellphones in school is akin to making school children line up and listen to sales pitches from companies around the world for several hours a day, instead of focusing on education.
Almost too good to be true. They didn't find large quantities of weed, and afroman had cameras set up and caught it on camera. I mean, talk about landing with your bum in the butter. His career just caught a major reboot.
CLI is great when you know what command to run. MCP is great when the agent decides what to run - it discovers tools without you scripting the interaction.
The real problem isn't MCP vs CLI, it's that MCP originally loaded every tool definition into context upfront. A typical multi-server setup (GitHub, Slack, Sentry, Grafana, Splunk) consumes ~55K tokens in definitions before Claude does any work. Tool selection accuracy also degrades past 30-50 tools.
Anthropic's Tool Search fixes this with per-tool lazy loading - tools are defined with defer_loading: true, Claude only sees a search index, and full schemas load on demand for the 3-5 tools actually needed. 85% token reduction. The original "everything upfront" design was wrong, but the protocol is catching up.
Can't we just iteratively inspect the network traces then? We don't need to consume the whole 2mb of data, maybe just dump the network trace and use jq to get the fields to keep the context minimal. I haven't added this in https://news.ycombinator.com/item?id=47207790 , but I feel it would be a good addition. Then prompt it with instructions to gradually discover the necessary data.
But then I wonder, where the balance is between a bunch of small tool calls, vs one larger one.
I recall some recent discussion here on hn on big data analysis
reply