More

emp17344 · 2026-04-04T21:39:17 1775338757

Personally, I’m tired of exaggerated claims and hype peddlers.

Edit: Frankly, accusing perceived opponents of being too afraid to see the truth is poor argumentative practice, and practically never true.

emp17344 · 2026-04-04T19:25:13 1775330713

If you saw the Claude Code leak, you’d know the harness is anything but simple. It’s a sprawling, labyrinthine mess, but it’s required to make LLMs somewhat deterministic and useful as tools.

girvo · 2026-04-04T21:47:49 1775339269

That’s also because of how Claude Code was written. It doesn’t have to be that way per se.

efromvt · 2026-04-04T21:15:51 1775337351

It's pretty easy to get determinism with a simple harness for a well-defined set of tasks with the recent models that are post-trained for tool use. CC probably gets some bloat because it tries to do a LOT more; and some bloat because it's grown organically.

emp17344 · 2026-04-04T21:25:40 1775337940

>It's pretty easy to get determinism with a simple harness for a well-defined set of tasks with the recent models that are post-trained for tool use.

Do you have a source? Claude Code is the only genetic system that seems to really work well enough to be useful, and it’s equipped with an absolutely absurd amount of testing and redundancy to make it useful.

efromvt · 2026-04-06T03:26:18 1775445978

Should I read that as 'generic system'? Most hard data is with company internal evals, but for the well defined tasks externally it's been pretty easy to spin up a basic tool loop and validate. Did you have something in mind? [I don't necessarily count 'coding' as well-defined in the generic sense, so I suspect we're coming at this from different scopes re: the definition of 'LLMs somewhat deterministic and useful as tools']

xstas1 · 2026-04-04T19:59:27 1775332767

Hypothesis: it's a sprawling, labyrinthine mess because it was grown at high speed using Claude Code.

emp17344 · 2026-04-04T20:05:14 1775333114

There’s a lot of redundancy, because there has to be to make the system useful. It’s a hacked together mess.

emp17344 · 2026-04-03T18:26:43 1775240803

I think you’ve got your answer, then. If nobody can tell you what it’s really used for, it likely doesn’t have any real use cases.

emp17344 · 2026-04-02T18:33:47 1775154827

AI labs think they’re building an autonomous replacement for software engineers, while software engineers see these systems as tools to supplement the process of software engineering.

dominotw · 2026-04-02T19:04:05 1775156645

> AI labs think they’re building an autonomous replacement for software engineers

And management everywhere is convinced that thats what they are paying for. My company is replacing job titles with "builder". Apparently these tools will make builder out of paper pushers hiding in corporate beaurcarcy. I am suddenly same as them now per my company managment.

seamossfet · 2026-04-02T18:40:16 1775155216

Yeah that's the disconnect though right? Even with the best frontier models, you need to do a lot of system design work, planning, and reviewing before you can let these models run.

These models are infinitely more effective when piloted by a seasoned software engineer and that will always be the case so long as these models require some level of prompting to function.

Better prompts come from more knowledgeable users, and I don't think we can just make a better model to change that.

The idea we're going to completely replace software engineers with agents has always been delusional, so anchoring their roadmap to that future just seems silly from a product design perspective.

It's just frustrating Cursor had a good attitude towards AI coding agents then is seemingly abandoning that for what's likely a play to appease investors who are drunk on AI psychosis.

Edit: This comment might have come off more callous than I intended. I just really love Cursor as a product and don't want to see it get eaten by the "AI is going to replace everything!" crowd.

pjmlp · 2026-04-02T18:38:37 1775155117

AI labs won't replace all of the engineers, while engineers becoming more productive, leads to smaller team sizes.

cruffle_duffle · 2026-04-03T00:55:38 1775177738

Smaller teams working on much more diverse set of problems.

The truth is absolutely nobody knows how this will all shake out.

emp17344 · 2026-04-02T18:23:41 1775154221

I think this is still useful research that calls into question how “smart” these models are. If the model needs a separate tool to solve a problem, has the model really solved the problem, or just outsourced it to a harness that it’s been trained - via reinforcement learning - to call upon?

dghlsakjg · 2026-04-02T22:22:46 1775168566

Does it matter if the LLM can solve the problem or if it knows to use a resource?

There’s plenty of math that I couldn’t even begin to solve without a calculator or other tool. Doesn’t mean I’m not solving math problems.

In woodworking, the advice is to let the tool do the work. Does someone using a power saw have less claim to having built something than a handsaw user? Does a CNC user not count as a woodworker because the machine is doing the part that would be hard or impossible for a human?

grey-area · 2026-04-03T04:13:43 1775189623

It does matter because the LLM doesn’t always know when to use tools (e.g. ask it for sales projections which are similar to something in its weights) and is unable to reason about the boundaries of its knowledge.

azakai · 2026-04-02T18:27:52 1775154472

It has "outsourced" it to another component, sure, but does that matter?

What the user sees is the total behavior of the entire system, not whether the system has internal divisions and separations.

emp17344 · 2026-04-02T18:36:54 1775155014

It matters if you’re curious about whether AGI is possible. Have we really built “thinking machines”, or are these systems just elaborate harnesses that leverage the non-deterministic nature of LLMs?

azakai · 2026-04-02T19:16:56 1775157416

An "elaborate harness" that can break down a problem into sub-tasks, write Python scripts for the ones it can't solve itself, and then combine the results, seems able to solve a wide range of cognitive tasks?

At least in theory.

TeMPOraL · 2026-04-02T20:33:26 1775162006

What is a difference? If the "elaborate harness" consists of mix of "classical" code and ML model invocations, at which point it's disqualified from consideration for "thinking machine"? Best we can tell, even our brains have parts that are "dumb", interfacing with the parts that we consider "where the magic happens".

emp17344 · 2026-04-02T18:13:26 1775153606

There’s a certain type of user here who reacts with rage when anyone points out flaws with LLMs. Why is that?

dang · 2026-04-02T20:22:59 1775161379

Please don't start generic flamewars on HN or impugn people who take an opposing view to yours. Both these vectors lead to tedious, unenlightening threads.

There's plenty of rage to go around on literally every divisive topic, and it's not the place we want discussions to come from here.

"Eschew flamebait. Avoid generic tangents."

"Comments should get more thoughtful and substantive, not less, as a topic gets more divisive."

https://news.ycombinator.com/newsguidelines.html

emp17344 · 2026-04-02T20:47:53 1775162873

There are other users in this very thread using inflammatory language to attack this paper and those who find the paper compelling. One user says, quote: “You just can't reason with the anti-LLM group.”

In light of this, why was my comment - which was in large part a reaction to the behavior of the users described above - the only one called out here?

dang · 2026-04-03T02:11:57 1775182317

Purely because I didn't see the others.

emp17344 · 2026-04-03T04:26:57 1775190417

Fair enough

dang · 2026-04-04T02:14:01 1775268841

Thanks! you might be surprised at how meaningful that response is to me.

Topfi · 2026-04-02T18:28:35 1775154515

No disrespect to them, but unless there is a financial incentive at stake for them (beyond SnP500 exposure), I've gotten to viewing this through the lens of sports teams, gaming consoles and religions. You pick your side, early and guided by hype and there is no way that choice can have been wrong (just like the Wii U, Dreamcast, etc. was the best).

Their viewpoint on this technology has become part of the identity for some unfortunately and any position that isn't either "AGI imminent" or "This is useless" can cause some major emotions.

Thing is, this finding being the case (along with all other LLM limits) does not mean that these models aren't impactful and shouldn't be scrutinised, nor does it mean they are useless. The truth is likely just a bit more nuanced than a narrow extreme.

Also, mental health impact, job losses for white collar workers, privacy issues, concerns of rights holders on training data collection, all the current day impacts of LLMs are easily brushed aside by someone believing that LLMs are near the "everyone dies" stage, which just so happens to be helpful if one were to run a lab. Same if you believe these are useless and will never get better, any discussion about real-life impacts is seen as trying to slowly get them to accept LLMs as a reality, when to them, they never were and never will be.

entropicdrifter · 2026-04-02T20:16:38 1775160998

I have a friend who is a Microsoft stan who feels this way about LLMs too. He's convinced he'll become the most powerful, creative and productive genius of all time if he just manages to master the LLM workflow just right.

He's retired so I guess there's no harm in letting him try

stratos123 · 2026-04-02T18:56:12 1775156172

I tend to be annoyed whenever I see a paper with a scandalous title like that, because all such papers that I've seen previously were (charitably) bad or (uncharitably) intentionally misleading. Like that infamous Apple paper "The Illusion of Thinking" where the researchers didn't care that the solution for the problem provided (a Towers of Hanoi with N up to 20) couldn't possibly fit in the allotted space.

simianwords · 2026-04-02T19:56:54 1775159814

I checked the paper and got to know that absolutely no reasoning was used for the experiments. So it was as good as using an instant model. We already know that this is necessary to solve anything a bit complicated.

In this case your intuition is completely valid and yet another case of misleading.

ticulatedspline · 2026-04-02T19:12:18 1775157138

> There’s a certain type of person who reacts with rage when anyone points out flaws with <thing>. Why is that?

FIFY, it's not endemic to here or LLMs. point out Mac issues to an Apple fan, problems with a vehicle to <insert car/brand/model> fan, that their favorite band sucks, that their voted representative is a PoS.

Most people aren't completely objective about everything and thus have some non-objective emotional attachment to things they like. A subset of those people perceive criticism as a personal attack, are compelled to defend their position, or are otherwise unable to accept/internalize that criticism so they respond with anger or rage.

simianwords · 2026-04-02T20:02:06 1775160126

This paper itself is flawed, misleading and unethical to publish because the prompts they used resulted in zero reasoning tokens. Its like asking a person point blank without thinking to evaluate whether the string is balanced. Why do this? And the worst part was, most people in this thread bought the headline as it is from a flawed article. What does it tell about you that you just bought it without any skepticism?

nonameiguess · 2026-04-02T18:59:25 1775156365

It's bizarre as hell. Another response compares it to sports fandom, which tracks. It reminds me of the "flare up" ethos of r/CFB, meaning they believe you're not allowed to comment on anything if you don't declare which NCAA Americal football team you're a fan of, because if you do, then anything you ever say can be dismissed with "ah rich coming a fan of team X" like no discussion can ever be had that might be construed as criticism if your own tribe is not perfect and beyond critique itself.

This is stupid enough even in the realm of sports fandom, but how does it make any sense in science? Imagine if any time we studied or enumerated the cognitive biases and logical fallacies in human thinking the gut response of these same people was an immediate "yeah, well dogs are even stupider!" No shit, but it's non-sequitur. Are we forever banned from studying the capabilities and limitations of software systems because humans also have limitations?

ziml77 · 2026-04-02T18:39:14 1775155154

I suspect they're afraid that if the hype dies, so will the pace of progress on LLMs as well as their cheap/free usage of them.

emp17344 · 2026-03-31T21:44:46 1774993486

That is, in fact, how it comes across. You’re labeling perceived opponents as “emotional” and “dismissive”.

emp17344 · 2026-03-29T04:46:12 1774759572

If it’s a bubble, it has to pop sooner or later. It’s better if it pops now before growing even larger.

emp17344 · 2026-03-29T04:45:07 1774759507

Why do you believe progress is currently exponential? There’s one dubious chart showing “exponential growth” in a single narrow domain, and otherwise zero evidence to suggest exponential improvement.

emp17344 · 2026-03-29T03:49:08 1774756148

For this use case, how do LLMs provide more value than a standard search engine? They may actually be destroying value here, as LLM-generated text pollutes search results.

layer8 · 2026-03-29T13:54:49 1774792489

They let you find things faster, and they combine and synthesize information from different sources. I’m more on the AI skeptic side and always prided myself on my Google foo, but nowadays chatbots can really save a lot of time with that.

Publishing LLM-generated text is a separate use case, I’m not a friend of that.