"The government sucks therefore aliens built the pyramids" has to be the line of thinking I sympathize the least with in the entire span of human opinions.
This is unfortunately the problem. The level of the public debate is abysmal, most politicians push unbelivably stupid shit about immigration and other identitarian nonsense, budget gets spent to ensure cheese and wine have the proper AOC certifications on them. Honestly up to a point I even understand it, many people don't see themselves as having a meaningful identity as EU citizens and you can't force it upon them.
Asking for sensible AI policy is like asking for a base on mars.
> many people don't see themselves as having a meaningful identity as EU citizens
I sometimes wonder if the citizens of the United States (of America) even comprehend that the EU is not itself a sovereign nation (unlike the states in say, the USA, or Australia) and is just a union of sovereign polities.
Nobody in the EU is an EU Citizen unless they are a citizen of one of the member states.
> reducing unskilled and hateful immigration is the democratic thing do
Hateful, sure, though just like all those on the right who say the first word of "hate crime" is redundant, I'd argue home-grown hate's just as dangerous as imported.
But "unskilled" immigration? When the topic is AI? If this stuff works as advertised, *nobody's skills matter any more*. If it doesn't at least render many of our skills obsolete, why build it? If you make an AI which can't automate anything, how is this not a waste of money?
Even without that, I've not seen anyone who knows about Baumol's cost disease opine either way about migration, high or low skilled.
I am open to the idea that we should handle immigration differently, but I want a plan and specifics, not slogans. What we want to achieve, and by what mechanisms you plan to get there. Open any newspaper: are you more likely to find careful and considerate opinions or racist screeds?
And that is the problem. Time and energy and money and political capital are routinely spent on inconsequential electoral poliTICS rather than substantial poliCY.
Agreed, I don't know if it's going to stay like this forever, but right now, if anything, the difference is amplified. You can make unbelivable stuff happen through the sheer power of knowing what you're doing.
Yeah, GPT 5.5 + Fable beating either individually is belivable, but 2x Opus > Fable is what makes me a bit dubious about the whole thing. They might be measuring skills that are too specific or benefit a lot from more tokens being thrown at them. Also Claude Code (the harness) is not the best at the moment, that might be part of it as well?
Not all translations are the same. Literary translations are often works of art in and of themselves, and automating them would be missing the point entirely, like automating homework or weightlifting at the gym. I don't really know what's the state of the art, but I do buy that, on the other hand, translating toaster manuals or generic copy could soon be automatic.
Yup. If you are bilingual, you quickly realize how some translations are very bad. How some translations are very good. And how hard it is to translate. With dry, simple text, it might be easy. But when it involves art?
Some jokes don't translate directly. There is pun. Sounds of words. Double meaning. Ambiguity. Cultural background. The creation of new words.
It can be reasonably argued that some poetry can be impossible to translate from some languages to others. A poem might be explained, but by a lenghty, dissecting explanation, that completely loses the point of it.
If you're a "React person", as the article puts it, friendly reminder that you can render components to HTML and serve that to the user.
I have done exactly that on a project that was under similar constraints. The UI models live in .tsx files and the browser gets pure HTML with zero JS by default.
I take the "2 unsolved" claim to mean "not solved by any model in any configuration in any stage with any number of attempts", the "benchmark results" are much lower. To be clear: it's extremely impressive, I still remember I was in utter disbelief when models started solving AIME problems, and this is obviously several levels above that.
It's also interesting that OpenAI models perform that much better on math and math-adjacent stuff. I assume this comes down to differences in post-training?
If you're trying to compare what the models are good at, important to note that the different models did not run with the same settings. In one case they also retried with GPT until it answered all the problems but did not retry with the other models.
GPT has 5 effort settings and they picked the highest (xhigh). Claude has 5 and they picked the middle one to avoid having to retry when it timed out. Gemini has medium or high effort and they picked medium.
the difference between gpt and gemini concerning the "retry until..." can almost be ignored. I did rerun gpt a few times, but still way below what gemini was not able to answer at all.
Look, I've never been someone who mindlessly hypes AI companies, as a matter of fact I think they have serious leadership problems across the board, but you people are straw-manning them so badly it actually makes me sympathize with them.
They aren't saying they have fully automated luxury AGI, they specifically list the ways models fall short of that bar and caution against people taking the 8x figure as the actual uplift number. At the same time they recognize that 80% of new code is now AI-authored, when two years ago those models were little more than toys. And frankly that checks out: if two years ago you told me we'd have something like Opus 4.8/GPT 5.5 I would have rolled to disbelieve.
> At the same time they recognize that 80% of new code is now Al-authored
I can setup a loop that will write a trillion lines of code automatically, how much of it is actually useful? Or are we back to counting LoC because there's no other metric for these systems that anyone can rely on?
It's 80% of new code they shipped that is AI authored.
Would you ship pointless code?
I do tend to agree though, it could be that AI solves problems with more code than a human would. What you need to measure is the value the code brings and how much of that is done by AI, hard to get an objective measure of that though.
I wouldn't, no. I don't see evidence that the engineers at Anthropic are similarly cautious however. They describe Claude Code as "basically a game engine" when it's literally a TUI app, and it eats memory for no apparent reason. I fully believe that Anthropic would ship pointless and garbage code. Especially if it's being written by LLM.
I could write a bash script that copies a codebase repeatedly in the pre-AI past as well, but I didn't do that because I wasn't stupid. More than 80% of my code is now AI-generated, and trust me I'm still not stupid. It was 0% only a year ago.
Who says LoC is the only metric we should rely on? A software product should first and foremost meet user requirements, functionality and performance. Judging from the sensational rise of Anthropic's user base and revenue I think we can safely says they're in that ball pack.
I'm dumb as a rock and I don't have a PhD, but since ~1 year ago I started forcing myself to do small bits of coding and math manually.
I'm not noticing a "cognitive decline" per se, but I do see I'm a lot "lazier", even stuff that used to be routine when I started coding now feel heavy.
Yes, precisely. Assessing your own cognitive skills is dubious. I’m pretty certain I’m less clever than I was when younger but if I find a problem tough now maybe 25 yo me would also have struggled?
>even stuff that used to be routine when I started coding now feel heavy.
The same weight feeling heavier is a sign that your muscles are weaker :)
There's many areas in life were we look back a few decades and think "people use to do it that awkwardly?" And yet results were better. I think the process of removing friction have just served to destroy our ability to concentrate and tolerate difficulty.
> but I do see I'm a lot "lazier", even stuff that used to be routine when I started coding now feel heavy.
Not getting that quick dopamine hit the LLMs give you..
Some say you can re-train your system to get back the dopamine hits you used to get from other things, like the enjoyment of the "old fashioned" manual coding and math. Getting there is hard work. And YMMV.
I just do things manually and ask LLMs to check my work. That seems to be working great for me.
I had the most Russian of Russian bosses when I was in college. My first day on the job he so eloquently stated, "I am not your mother. Do not come to me with problems. Come to me with solutions. I want to know what you tried and what did not work."
His advice has served me well in many areas of life too. I try my best to treat LLMs no differently for domains I care about (not one-off little questions here and there).
I do a similar version of this, where if I notice a mistake in generated code, I fix it manually (or at least attempt to) instead of telling Claude to fix it.
I use an agent to generate a first-pass attempt, and then (deadlines willing), I manually read every line at least once so I understand what the code actually does.
Then I manually fix the inevitable slop that is mixed in with the good stuff, and only once the code is up to my personal standards do I send it.
This probably reduces my “AI performance boost” to 30-50% instead of the huge gains reported by others. But I retain the ability to reason about the codebase and use AI much more precisely when I’m trying to troubleshoot production outages or subtle bugs — something I notice the rest of my team struggles with, since adopting “agentic workflows” everywhere.
I think actively working to retain some cognitive flexibility and “muscle memory” around coding tasks is going to be rather advantageous in the long run.
Same, but also because it feels like it takes longer for an LLM to do it. I think that's something people who are into gathering personal metrics should do - measure how long it takes to type a prompt / have the LLM fix things vs just doing it yourself.
reply