Yep. Wonder how long they'll wait with removing the 1x models, especially Gemini if it takes a long time, though they might just grab at an overdue reclassification of "preview" to "GA" to push it through.
To an extent. That economic incentive stops making sense when a) capacity is an actual constraint and b) Anthropic is not a monopoly and is subject to pressure from competitors who are more user-friendly.
These changes fixed some of the token issues, but the token bloat is an intrinsic problem to the model, and Anthropic's solution of defaulting to xhigh reasoning for Opus 4.7 just means you'll go through tokens faster anyways.
This is frankly exciting, outside of the politics of it all, it always feel great to wake up and a new model being released, I personally will stay awake quite long tonight if GPT-5.5 drop in codex.
I don't find it exciting at all. I just feel anxiety about my career and my place in the world. I have a set of skills that I've developed over many years. I care about what I create. I consider it a craft. When I use my skills to solve a hard problem, I feel good about myself. When the AI does the work for me, I don't get that sense of accomplishment. I am seeing my value evaporate before my eyes.
I hate this stuff and I wish it had never been invented.
You might want to rethink this, think of this as the opportunity of a lifetime, the beginning of a new era, the same as the early Internet, where you do have the chance to set yourself for life now, this window is getting shorter and shorter, but you can't deny that you do have the potential NOW to thrive or start multiple businesses without much capital. Think also that the best thing in the end, is probably to build great things, regardless on how we build them, making the world progress.
The more interesting part of the announcement than "it's better at benchmarks":
> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.
The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.
There's already KernelBench which tests CUDA kernel optimizations.
On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.
So, im working in some high performance data processing in Rust. I had hit some performance walls, and needed to improve in the 100x or more scale.
I remembered the famous FizzBuzz Intel codegolf optimizations, and gave it to gemini pro, along with my code and instructions to "suggest optimizations similar to those, maybe not so low level, but clever" and it's suggestions were veerry cool.
Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!
In a sense it's better than a benchmark: it's a practical, real-world, highly quantifiable improvement assuming there are no quality regressions and passes all test cases. I have been experimenting with this workflow across a variety of computational domains and have achieved consistent results with both Opus and GPT. My coworkers have independently used Opus for optimization suggestions on services in prod and they've led to much better performance (3x in some cases).
A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).
The tension here is that what customers need to reproduce is this result on their own problem. To measure this you need extensive evals on private data.
OpenAI simply won’t share the data you need to reproduce this in the way you’d hope for an academic paper.
Oh, come on, if they do well on benchmarks people question how applicable they are in reality. If they do well in reality people complain that it's not a reproducible benchmark...
There's an obvious subtext that Copilot will be trying phasing out all 1x premium multipliers in order to actually make money off of it.
reply