Unfortunately the paper doesn’t include gpt 5.3 which was released around the same time as opus 4.6 and also gpt 5.4 few days back. Both are available via api
IMHO The harness must be used when running these experiments. The model vendors know best on giving the best harness with gpt 5.4 and codex or Claude code with opus 4.6 which makes a big difference if you are running any kind of agentic coding tasks.
I see both Claude and gpt to be neck and neck in coding. Every other model+harness is definitely 3-6 months behind. Right now codex seems to be the best in terms of solving complex bugs, long running tasks, much higher limits and even speed while Claude seems to do well in front end and their cli ux seems nice! Codex app is very good though (wish it wasn’t electron as a memory hog but it’s good)
Many people buy two separate Claude pro subscriptions and that makes the limit become a non-issue. It works surprisingly well when you tend to hit the 5 hourly limit after a few hours, and hit the weekly limit after 4-5 days. $40 vs $100 is significant for a lot of people.
I hit limit of Pro in about 30 minutes, 1 hour max. And only when I use a single session, and when I don't use it extensively, ie waits for my responses, and I read and really understand what it wants, what it does. That's still just 1-2 hours/5 hours.
You're probably having long sessions, i.e. repeated back-and-forth in one conversation. Also check if you pollute context with unneeded info. It can be a problem with large and/or not well structured codebases.
The last time I used pro, it was a brand new Python rest service with about 2000 lines generated, which was solely generated during the session. So how I say to Claude that use less context, when there was 0 at the beginning, just my prompt?
So you had generated 2000 lines in 30 minutes and ran out of tokens? What was your prompt?
I’d use a fast model to create a minimal scaffold like gemini fast.
I’d create strict specs using a separate codex or claude subscription to have a generous remaining coding window and would start implementation + some high level tests feature by feature. Running out in 60 minutes is harder if you validate work. Running out in two hours for me is also hard as I keep breaks. With two subs you should be fine for a solid workday of well designed and reviewed system. If you use coderabbit or a separate review tool and feed back the reviews it is again something which doesn’t burn tokens so fast unless fully autonomous.
Thanks for the tip, didn’t think of using 2 subscriptions at the same company.
When reaching a limits, I switch to GLM 4.7 as part of a subscription GLM Coding Lite offered end 2025 $28/year. Also use it for compaction and the like to save tokens.
I'm using it via Copilot, now considering to also try Open Code (with Copilot license). I don't know if it's as good as Claude Code, but it's pretty good. You get 100 Sonnet requests or 33 Opus request in the subscription per month ($20 business plan) + some less powerful models have no limits (i.e. GPT 4.1), while extra Sonnet request is $0.04 and Opus $0.12, so another $20 buys 250 Sonnet requests + 83 Opus requests. This works for me better since I do not code all day, every single day. Also a request is a request, so it does not matter if it's just a plain edit task or an agent request, it costs the same.
Btw. I trust Microsoft / GitHub to not train on my data more (with the Business license) than I would trust Antrophic.
It’s disgusting how they have successfully fooled people into thinking they are the good guys. They partnered with palantir, let them freely do the dirty work and once they realized they can make money directly they spin the PR and just trying to get more users. Well played.
I wish oss models are good so that we don’t have to deal with either leading companies!
They are playing a good PR game for sure. Their recent track record doesn’t show if they can be trusted. Few millions is nothing for their current revenue and saying they sacrificed is a big stretch here.
They don't have any brand poison, unlike nearly everyone else competing with them. Some serious negative equity in tha group, be it GOOG, Grok , META, OpenAI, M$FT, deepseek, etc.
Claude was just being the little bot that could, and until now, flying under the radar
It's much more than a few million? Being declared a supply chain risk means that no company that wants to do business with the government can buy Anthropic. And no company that wants to do business with those businesses can buy Anthropic either. This rules out pretty much all American corporations as customers?
That’s their excuse to still appeal to people who can be tricked with their safety first pitch. It’s easy to have constitution and all the crap when you are not battle tested. They just showed their true colors.
If you haven’t used codex with gpt-5.3-codex (high or xhigh) you are missing out. Claude is still good at conversations but boy I can have codex go at a problem and it does better than Claude almost all the time. Front end and product UX Claude is slightly better but given the very very generous limits of codex, they are the best bang for buck
this is my experience as well, just cancelled my claude subscription as I'm tired of it the 5 hour window being filled up within 30 minutes of use, and not even fixing the problem that codex finds almost immediately. also found for frontend that gemini 3.1 pro is better than the rest if you really play with it.
Has it been sped up at all? Last time I used codex (which was with 5.1 I think), it was pretty slow. I mean, it did a fantastic job at figuring out hard bugs across multiple languages ("why is this image not lining up in this server-rendered template?"; Python, JS, CSS, and the template lang) but it took quite a long time. Long enough that I wouldn't want to use it for anything but the most complex things.
It’s not a crime if you do something for money. Those who comment are likely doing the same and they couldn’t get into a company like OpenAI and hence the hatred! Keep doing the great work you always did! Excited to see what you ll do with all the resources in the world.
https://developers.openai.com/api/docs/models/gpt-5.3-codex
IMHO The harness must be used when running these experiments. The model vendors know best on giving the best harness with gpt 5.4 and codex or Claude code with opus 4.6 which makes a big difference if you are running any kind of agentic coding tasks.
I see both Claude and gpt to be neck and neck in coding. Every other model+harness is definitely 3-6 months behind. Right now codex seems to be the best in terms of solving complex bugs, long running tasks, much higher limits and even speed while Claude seems to do well in front end and their cli ux seems nice! Codex app is very good though (wish it wasn’t electron as a memory hog but it’s good)
reply