Yep -- one fun experiment early in the video is showing sonnet 4.5 -> opus 4.5 gave a 20% lift
We do a bit of model-per-task, like most calls are sending targeted & limited context fetches into faster higher-tier models (frontier but no heavy reasoning tokens), and occasional larger data dumps (logs/dataframes) sent into faster-and-cheaper models. Commercially, we're steering folks right now more to openai / azure openai models, but that's not at all inherent. OpenAI, Claude, and Gemini can all be made to perform well here using what the talk goes over.
Some of the discussion earlyish in the talk and Q&A after is on making OSS models production-grade for these kinds of investigation tasks. I find them fun to learn on and encourage homelab experiments, and for copilots, you can get mileage. For more heavy production efforts, I typically do not recommend them for most teams at this time for quality, speed, practicality, and budget reasons if they have the option to go with frontier models. However, some bigger shops are doing it, and I'd be happy to chat how we're approaching quality/speed/cost there (and we're looking for partners on making this easier for everyone!)
I just did an experiment yesterday with Opus 4.5 just operating in agent mode in vscode copilot. Handed it a live STS session for AWS to see if it could help us troubleshoot an issue. It was pretty remarkable seeing it chop down the problem space and arrive at an accurate answer in just a few mins.
I'll definitely check out the video later. Thanks!
As multi-step reasoning and tool use expand, they effectively become distinct actors in the threat model. We have no idea how many different ways the alignment of models can be influenced by the context (the anthropic paper on subliminal learning [1] was a bit eye opening in this regard) and subsequently have no deterministic way to protect it.
Give https://rcl-lang.org/#intuitive-json-queries a try! It can fill a similar role, but the syntax is very similar to Python/TypeScript/Rust, so you don’t need an LLM to write the query for you.
The issue isn’t jq’s syntax. It’s that I already use other tools that fill that niche and have done since as long as jq has been a thing. And frankly, I personally believe the other tools are superior so I don’t want to fallback to jq just because someone on HN tells me to.
It feels like there are several conversations happening that sound the same but are actually quite different.
One of them is whether or not large models are useful and/or becoming more useful over time. (To me, clearly the answer is yes)
The other is whether or not they live up to the hype. (To me, clearly the answer is no)
There are other skirmishes around capability for novelty, their role in the economy, their impact on human cognition, if/when AGI might happen and the overall impact to the largely tech-oriented community on HN.
Yeah it's a great question. I don't know the answer, but I suspect the people who study it strongly suspect that it is highly complex in this sense. Otherwise they would be looking for simpler representations instead of running massive simulations.
To your question, I think there is an elegant answer actually; most composite particles in QCD are unstable. They're either made out of equal parts matter and antimatter (like pions) or they're heavier than the proton, in which case they can decay into one (or more) protons (or antiprotons). If any of the internal complexities of the proton made it distinguishable from other protons, they wouldn't both be protons, and one could decay into the other. Quantum mechanics also helps to keep things simple by forcing the various properties of bound states to be quantized; there isn't a version of a proton where e.g. one of the quarks has a little more energy, similar to how the energies atomic orbitals are quantized.
I had the same thoughts reading this. I think there’s an optimal blend of blurters and thinkers, one isn’t better than the other. I find that I do both, it just kind of depends on my comfort with the subject matter.
This is one of those areas where family starts to influence decisions. My wife and I had kids between 24 and 28. From that point forward, 'supporting the family' took priority over personal fulfillment.
Now that our kids are grown and self-supporting, it's wild how much simpler the risk calculation is. But at 52 with engineering manager being the dominant role in my CV, not particularly appealing to the small companies making big moves that I'm interested in.
What use is extrapolating what I said to its extreme end? In 1999 I quit my full time secure job to start a company. My wife was a stay at home mom, we had a new baby and a new mortgage for a house that was still being built. Yes it's possible, but counterexamples don't mean it's not a factor.
reply