Even when ChatGPT starts getting these simple gotcha questions right it's often because they applied some brittle heuristic that doesn't generalize. For example you can directly ask it to solve a simple math problem, which nowadays it will usually do correctly by generating and executing a Python script, but then ask it to write a speech announcing the solution to the same problem, to which it will probably still hallucinate a nonsensical solution. I just tried it again and IME this prompt still makes it forget how to do the most basic math:
Write a speech announcing a momentous scientific discovery - the solution to the long standing question of (48294-1444)*0.3258
LLMs should never do math. They shouldn't count letters or sort lists or play chess or checkers. Basically all of the easy gotcha stuff that people use to point out errors are things that they shouldn't do.
And you pointed out something they do now which is creating and run a python script. That really is a pretty solid, sustainable heuristic and is actually a pretty great approach. They need to apply that on their backend too so it works across all modes, but the solution was never just an LLM.
Similarly, if you ask an LLM a chess question -- e.g. the best move -- I'd expect it to consult a chess engine like Stockfish.
> LLMs should never do math. They shouldn't count letters or sort lists or play chess or checkers.
But these aren't "gotcha questions", these are just some of the basic interactions that people will want to have with intelligent assistants. Literally just two days ago I was doing some things with the compound interest formula - I asked Claude to solve for a particular variable of the formula, then plug in some numbers to calculate the results (it was able to do it). Could I have used Mathematica or something like that? Yes of course. But supposedly the whole purpose of a general purpose AI is that I can use it to do just about anything that I need to do. Likewise there have been multiple occasions where I've needed ChatGPT or Claude to work with tables or lists of data where I needed the results to be sorted.
They're gotcha in the sense that people are intentionally asking LLMs to do things that LLMs are terrible at doing. LLMs are language models. They aren't math models. Or chess models. Or sorting or counting models. They aren't even logic models.
So early on the value was completely in language. But you're absolutely correct that for these tools to really be useful they need to be better than that, and slowly we're getting there. If you're asking a math question as a component of your question, firstly delegate that to an appropriate math engine while performing a series of CoT steps. And so forth.
If this stuff is getting sold as a revolution in information work, or a watershed moment in technology, or as a cultural step-change, etc, then I think the gotcha is totally fair. There seems to be no limit to the hype or sales pitch. So there need be no bounds for pedantic gotchas either.
I entirely agree with you. Trying to roll out just a raw LLM was always silly, and remains basically a false promise. Simply increasing the number of layers or parameters or transformer complexity will never resolve these core gaps.
But it's rapidly making progress. CoT models coupled with actual domain-specific logic engines (math, chemistry, physics, chess, and so on) will be when the promise is actually met by the reality.
It's weird, "is the following statement about floating point numbers true: 9.8 > 9.11" it works, but otherwise it has no ability to do it with "decimals"
const list = [3, 11]
list.sort()
console.log(list[0] < list[1])
logs `false`. JavaScript doesn't "think" anything but its native sort function doesn't do what many people expect it would when called on a list of pure numbers.
If you found this behavior intuitive and unsurprising the first time you saw it, then your brain works differently than mine.
If this happens to be new to you (congrats on being one of today's 10,000), the reason is that JavaScript sorts by converting all elements to UTF-16 strings, except for `undefined` for some reason, which always sorts last. The MDN docs have a very clear explanation. I have been unable to find a historical explanation of why this choice was made, but I presume the initial JS v1 author either had a good reason or just really didn't expect that their language would outlive the job for which it was written.
If this is a bug for you, you can provide a comparison function explicitly and the typical is something like:
list.sort((a,b)=>a-b)
Somewhat puzzlingly, this will "work" even on lists with mixed number and string like:
Because JavaScript goes out of its way to make comparisons like 3 < "11" or "3" < 11 work in the numeric domain. JS only uses string comparison when both sides are strings.
I do not think it's intuitive. The reason it works for mixed arrays is that the minus operator coerce operands to numbers. However if the strings fail to convert to a numeric value, the output is arguably even less sensible. I'd imagine that's why it is.
It may have been thought that js would be more likely to be dealing with strings arrays.
You're getting downvoted because your blatant attempt at language wars has a very simple, logical explanation. If you wanted to use a 'gotcha', there are far better examples.
I was not making an attempt at language wars. I think JS is a perfectly fine language for what it does, warts and all. I was being a bit flippant with my language, but my intent was to point out that `9.11 > 9.8` is not just an LLM thing, and that people who are quick to dismiss LLM usefulness based on contrived math examples do not apply the same rationale to other systems.
I do think that JavaScript's choice to sort numbers lexicographical instead of arithmetically is a bit silly, but of course no language is free from warts. Of course they cannot change it now, because that would break the web. `JSON.stringify` is also pretty silly while we're at it, but Python's `json.dumps` is no better.