Yeah you need specific instruct training for that sort of thing, Claude Opus being one of the rare examples that does such a sensibility check quite often and even admits when it doesn't know something.
These days it's all about confidently bullshitting on benchmarks and overfitting on common riddles to make pointless numbers go up. The more impressive models get on paper, the more rubbish they are in practice.
Gemini 2.5 is actually pretty good at this. It's the only model ever to tell me "no" to a request in Cursor.
I asked it to add websocket support for my app and it responded like, "looks like you're using long polling now. That's actually better and simpler. Lets leave it how it is."
These days it's all about confidently bullshitting on benchmarks and overfitting on common riddles to make pointless numbers go up. The more impressive models get on paper, the more rubbish they are in practice.