Two things can be true at the same time: The technology has improved, and the technology in its current state still isn't fit for purpose.
I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks: sports trivia, fixing recipes, explaining board game rules, etc. It works well like 95% of the time. That's fine for inconsequential things. But you'd have to be deeply irresponsible to accept that kind of error rate on things that actually matter.
The most intellectually honest way to evaluate these things is how they behave now on real tasks. Not with some unfalsifiable appeal to the future of "oh, they'll fix it."
The errors are also not distributed in the same way as you'd expect from a human. The tools can synthesize a whole feature in a moderately complicated web app including UI code, schema changes, etc, and it comes out perfectly. Then I ask for something simple like a shopping list of windshield wipers etc for the cars and that comes out wildly wrong (like wrong number of wipers for the cars, not just the wrong parts), stuff that a ten year old child would have no trouble with. I work in the field so I have a qualitative understanding of this behavior but I think it can be extremely confusing to many people.
One of the reasons I'm comfortable using them as coding agents is that I can and do review every line of code they generate, and those lines of code form a gate. No LLM-bullshit can get through that gate, except in the form of lines of code, that I can examine, and even if I do let some bullshit through accidentally, the bullshit is stateless and can be extracted later if necessary just like any other line of code. Or, to put it another way, the context window doesn't come with the code, forming this huge blob of context to be carried along... the code is just the code.
That exposes me to when the models are objectively wrong and helps keep me grounded with their utility in spaces I can check them less well. One of the most important things you can put in your prompt is a request for sources, followed by you actually checking them out.
And one of the things the coding agents teach me is that you need to keep the AIs on a tight leash. What is their equivalent in other domains of them "fixing" the test to pass instead of fixing the code to pass the test? In the programming space I can run "git diff *_test.go" to ensure they didn't hack the tests when I didn't expect it. It keeps me wondering what the equivalent of that is in my non-programming questions. I have unit testing suites to verify my LLM output against. What's the equivalent in other domains? Probably some other isolated domains here and there do have some equivalents. But in general there isn't one. Things like "completely forged graphs" are completely expected but it's hard to catch this when you lack the tools or the understanding to chase down "where did this graph actually come from?".
The success with programming can't be translated naively into domains that lack the tooling programmers built up over the years, and based on how many times the AIs bang into the guardrails the tools provide I would definitely suggest large amounts of skepticism in those domains that lack those guardrails.
> the technology in its current state still isn't fit for purpose.
This is a broad statement that assumes we agree on the purpose.
For my purpose, which is software development, the technology has reached a level that is entirely adequate.
Meanwhile, sports trivia represents a stress test of the model's memorized world knowledge. It could work really well if you give the model a tool to look up factual information in a structured database. But this is exactly what I meant above; using the technology in a suboptimal way is a human problem, not a model problem.
There's nothing in these models that say its purpose is software development. Their design and affordances scream out "use me for anything." The marketing certainly matches that, so do the UIs, so do the behaviors. So I take them at their word, and I see that failure modes are shockingly common even under regular use. I'm not out to break these things at all. I'm being as charitable and empirical as I can reasonably be.
If the purpose is indeed software development with review, then there's nothing stopping multi-billion dollar companies from putting friction into these sytems to direct users towards where the system is at its strongest.
Which things actually matter? I think we can all agree that an LLM isn't fit for purpose to control a nuclear power plant or fly a commercial airliner. But there's a huge spectrum of things below that. If an LLM trading error causes some hedge fund to fail then so what? It's only money.
Not to mention that it would then make some hedge fund with a better backtesting harness or more AI scrutiny more successful thus keeping the financial market work as designed.
> I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks: sports trivia, fixing recipes, explaining board game rules, etc. It works well like 95% of the time. That's fine for inconsequential things. But you'd have to be deeply irresponsible to accept that kind of error rate on things that actually matter.
95% is not my experience and frankly dishonest.
I have ChatGPT open right now, can you give me examples where it doesn't work but some other source may have got it correct?
I have tested it against a lot of examples - it barely gets anything wrong with a text prompt that fits a few pages.
> The most intellectually honest way to evaluate these things is how they behave now on real tasks
A falsifiable way is to see how it is used in real life. There are loads of serious enterprise projects that are mostly done by LLMs. Almost all companies use AI. Either they are irresponsible or you are exaggerating.
Quite frankly, this is exactly like how two people can use the same compression program on two different files and get vastly different compression ratios (because one has a lot of redundancy and the other one has not).
But why do you need an example? Isn't it pretty well understood that LLMS will have trouble responding to stuff that is under represented in the training data?
You will just won't have any clue what that could be.
fair so it must be easy to give an example? I have ChatGPT open with 5.4-thinking. I'm honestly curious about what you can suggest since I have not been able to get it to bullshit easily.
I am not the OP, an I have only used ChatGPT free version. Last day I asked it something. It answered. Then I asked it to provide sources. Then it provided sources, and also changed its original answer. When I checked the new answers it was wrong, and when I checked sources, it didn't actually contain the information that I asked for, and thus it hallucinated the answers as well as the sources...
So it works well 95% of the time for literally a trivial use case. Imagine if any other tech tool had that kind of reliability: `ls` displays 95% of your files, your phone successfully sends and receives 95% of text messages, or Microsoft Word saving 95% of the characters you typed in. That's just not acceptable.
That's how I do it too. I'll tap bell once (and let the ring sustain) when I'm about ~5 seconds from overtaking them so people know there's something coming up behind them, and the sustained sound tells them how fast it's coming. This is especially important with runners, who are prone to suddenly take a U-turn if they're at the end of their route.
Pedestrians regularly wave acknowledgement or even say "thank you." Some other cyclists (especially on e-bikes) just blast by with no warning.
I'm not a fan of the "something better" phrasing myself. It's very much anti-systems-thinking.
Engineers should be honest that everything is a tradeoff. For the up-front convenience you get with phone tickets, you impose additional failure modes, dependency chains, and accessibility issues that simply weren't a problem with paper ticketing.
The "phone-ification" of everything will probably bite us in the behind in the future, just like the buildout of out car-centric environments does now.
>For the up-front convenience you get with phone tickets
Even as a person who does have a smartphone, I feel like phone tickets are anti-convenience because they rely on terrible apps like TicketMaster. It's only a positive trade-off for venues or whoever. If they texted or emailed me a QR code, that would be a positive tradeoff (and a texted QR code would probably work for this guy's flip phone too)
> I feel like phone tickets are anti-convenience because they rely on terrible apps like TicketMaster.
Case in point: I traveled from St. Louis to Houston for a concert a few years ago. Before I left home to catch my flight, I installed the Ticketmaster app on a phone and verified that I could bring up the tickets. When I tried it again in my hotel an hour before the conference, it no longer worked because the fraud detection in their app was apparently confused as to why I was now in Houston.
Fixing this took 45 minutes on hold to get a support agent and a frantic call to my wife so she could check the disused email address I used to sign up for Ticketmaster 20 years earlier and get the verification code they sent.
There are a lot of reasons to dislike digital tickets, but this is one of them. I used to go to dozens of concerts every year. Now it's such a hassle that I don't bother unless it's small venue that doesn't play these games.
That's fucking nightmarish. That's exactly the kind of scenario I'd think up and be told is "science fiction" by the kind of apologists who think forced usage of technology is okay.
We attended a once-in-a-lifetime show last fall (a performer who is aging and likely won't tour again) a two hour drive away. I wouldn't install the Ticketmaster app and played an old man "character" with the box office to get them to print my tickets and hold them at will call. I played the "we are driving in from out of town" card, etc, and they accommodated me.
I tried that with a closer venue a couple of months ago and got told, in no uncertain terms, "no app no admittance". I knuckled-under and loaded the app on my wife's iPhone (which she insists on keeping because Stockholm syndrome, I assume). I feel bad that I gave in (because it makes me part of the problem). I really wanted to see the show and I wasn't willing to forego it on principle. (Kinda embarrassing, actually.)
> That's fucking nightmarish. That's exactly the kind of scenario I'd think up and be told is "science fiction" by the kind of apologists who think forced usage of technology is okay.
Not to justify it, but we've been fighting this kind of crap for a long time with credit cards and their bonehead "anti-fraud" checks. I'm often on the phone with my credit card issuers every time I travel somewhere because their moronic systems think "different country = fraud" and lock me out until I call them and perform their pointless rituals for them over the phone. Even if you tell them in advance that you're traveling (which I object to because my vacation plans are none of their business), they still often get it wrong and flag you.
Why would you sign into Ticketmaster with an email address you don't have access to and use it to buy tickets?
Don't do that. Create a new account with the email address you have access to.
Apps require you to sign in again all the time, and send a verification code to your e-mail to do so. Changing locations is, yes, a reason to require sign in.
> Why would you sign into Ticketmaster with an email address you don't have access to and use it to buy tickets?
Because in the context of signing in, its role is that of a user ID.
Ticketmaster spams that address constantly. It's a valid email address, to be sure, but they've trained me over the years never to look at it. They certainly didn't do any multi-factor authentication when I bought the tickets, only when I was preparing to use them (despite having accessed them on that very device two days earlier).
Exactly. Ostensibly, one would assume that getting closer to the place you have a ticket for wouldn't flag the use as "suspicious". To have OP demand that everyone use the app, but then blame the user for... traveling to the venue? Wild.
> Apps require you to sign in again all the time, and send a verification code to your e-mail to do so. Changing locations is, yes, a reason to require sign in.
This is the bane of my existence. I manually copy/paste/delete a half-dozen codes from my email/SMS every single day.
If I was ruler, I'd mandate every one of these switch to TOTP 2FA and outright ban email verification other than for password resets.
>Apps require you to sign in again all the time, and send a verification code to your e-mail to do so. Changing locations is, yes, a reason to require sign in.
What? TicketMaster is the only app I use that does this. Probably because it's too hard for end-users to get rid of it. If some Telegram or some food delivery app or something tried to periodically re-prompt me to log in because I went outside my house or whatever, it would get uninstalled and replaced with something that didn't.
This is how I feel with the places that want to lock up your phone. There are safety considerations in that. But we're just astrotrufed into the "well this is better" PR campaigns from yondr.
A professional knows what they're worth and what they need to deliver. On-time payment according on an agreed-upon schedule is table stakes. That's the most fundamental requirement. Nothing happens without that.
The "what to do instead" section is basically DARPA's "Heilmeier Catechism," which is the framework they use to gauge high-risk high-reward ideas. It doesn't kill ideas, but it places the onus on the proposer to be clear-eyed and explicit about what they're putting forward:
What are you trying to do? Articulate your objectives using absolutely no jargon.
How is it done today, and what are the limits of current practice?
What is new in your approach and why do you think it will be successful?
Who cares? If you are successful, what difference will it make?
What are the risks?
How much will it cost?
How long will it take?
What are the mid-term and final “exams” to check for success?
Heilmeier Catechism is really interesting, thanks for linking that. I like how the questions break down different major aspects. It treats risk as one of the dimensions of evaluation but not entire conversation. That's the shift I was trying to describe. Critique is valuable as part of a complete picture, not when it is the only lens.
Social media feeds are designed to be slot machines. Each scroll is a pull. You may or may not get something you actually want. You can't predict what's coming up next, so you just keep mindlessly scrolling.
It's not just the scrolling, the posting side too. They all randomly boost one of your posts so suddenly tons of feedback (especially noticable when I tried threads) and then you try to get that back again. The uncertainty keeps you at it
It's a matter of affordances. The path of least resistance with agents is to let it commit whatever it wants. That's a natural outcome of the design and implementation of agents.
Yes, humans are accountable for the ultimate output. But so are the people who design and build these automation tools. As the saying goes, the purpose of a system is what it does.
The purpose of a personal assistant isn’t to fit people into your calendar. It’s to filter them out. They serve as a barrier to your time, not an enabler for other people to claim it. I don’t see how an AI can meaningfully accomplish that any better than simply just making yourself more difficult to reach.
> The purpose of a personal assistant isn’t to fit people into your calendar. It’s to filter them out. They serve as a barrier to your time, not an enabler for other people to claim it.
Scheduling in a larger org and/or with multiple equally busy people is a non-trivial, complex task; it makes sense to dedicate resources to the task. Good Executive Assistants are generally fairly smart folks, in my experience.
When the scale is substantially more and involves objects as well it evolves into multi-million $ ERM (Enterprise Resource Management) systems.
I'm a pretty busy person professionally. When I feel like I'm being pushed to "scale" my time and attention, I take that as a signal to do the opposite: do less, and do those remaining things more meaningfully.
Trying to do more is a losing game, and AI assistants just paper over that. We all have finite time and attention. I think a pragmatic engineering approach is the right one here: consider that as a non-negotiable constraint, a fact of the physical world, not something to magic away.
This is it right here. I've long thought about this one and whether I should bother with an AI agent that can do all of this stuff for me, but the reality is both what you said and I'm not rich enough.
Do I want the AI Agent to take my bank account and automatically pay some bill every month in full? What if you go a little over that month due to an emergency expense you weren't prepared for? And it's not a matter of "I don't have enough in my bank account for this one time charge", but it's "I don't have enough in my bank account for this charge and 3 others coming at the end of the month." type deal.
Agents aren't going to be very good at that. "Hey I paid $3,000 on your credit card in order to prevent you from incurring interest. Interest is really bad to carry on a credit card and you should minimize that as much as possible." Me: "Yeah but I needed that money for rent this month." Agent: "Oh, yeah! I should have taken that into account! It looks like we can't reverse the charge for the payment."
We’ve had computing technology that clearly understands what the user wants to do. It’s called a command line interface. No guessing, no recommendations, no dark patterns, no bullshit.
It's really hard to consider any kind of web dev as "engineering." Outcomes like this show that they don't have any particular care for constraints. It's throw-spaghetti-at-the-wall YOLO programming.
There are plenty of web devs who care about performance and engineering quality. But caring about such things when you work on something like a news site is impossible: These sites make their money through user tracking, and it's literally your job to stuff in as many 3rd-party trackers as management tells you to. Any dev who says no on the basis that it'll slow the site down will get fired as quickly as a chef who get a shift job in McDonalds and tries to argue for better cuisine.
I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks: sports trivia, fixing recipes, explaining board game rules, etc. It works well like 95% of the time. That's fine for inconsequential things. But you'd have to be deeply irresponsible to accept that kind of error rate on things that actually matter.
The most intellectually honest way to evaluate these things is how they behave now on real tasks. Not with some unfalsifiable appeal to the future of "oh, they'll fix it."
reply