Hacker Newsnew | past | comments | ask | show | jobs | submit | crindy's commentslogin

They do talk about working on this, and making improvements. From https://openai.com/index/introducing-gpt-5/

> More honest responses

> Alongside improved factuality, GPT‑5 (with thinking) more honestly communicates its actions and capabilities to the user—especially for tasks which are impossible, underspecified, or missing key tools. In order to achieve a high reward during training, reasoning models may learn to lie about successfully completing a task or be overly confident about an uncertain answer. For example, to test this, we removed all the images from the prompts of the multimodal benchmark CharXiv, and found that OpenAI o3 still gave confident answers about non-existent images 86.7% of the time, compared to just 9% for GPT‑5.

> When reasoning, GPT‑5 more accurately recognizes when tasks can’t be completed and communicates its limits clearly. We evaluated deception rates on settings involving impossible coding tasks and missing multimodal assets, and found that GPT‑5 (with thinking) is less deceptive than o3 across the board. On a large set of conversations representative of real production ChatGPT traffic, we’ve reduced rates of deception from 4.8% for o3 to 2.1% of GPT‑5 reasoning responses. While this represents a meaningful improvement for users, more work remains to be done, and we’re continuing research into improving the factuality and honesty of our models. Further details can be found in the system card.


Freedom of expression comes to mind. If someone had a friend commit suicide, should they not be able to discuss their experience in public?


Absolutely they should, and when I worked there that was known as "protecting voice", that content has always been explicitly allowed because it is free expression, even if reading it can be difficult for some people. The same with someone posting images of healed scars because they've been overcoming their self harm.

The content I'm talking about is graphic photos of suicide and self injury, fresh, blood soaked cuts, bodies hanging, graphic depictions of eating disorder (that goes beyond "thinspo", which is more borderline, and so downranked and not recommended rather than removed).

It's the latter that we believed (based on the advice of experts who we relied on for guidance) is harmful when consumed in large quantities.


It's so strange to me that in a forum full of programmers, people don't seem to understand that you set up systems to detect errors before they cause problems. That's why I find ChatGPT so useful for helping me with programming - I can tell if it makes a mistake because... the code doesn't do what I want it to do. I already have testing and linting set up to catch my own mistakes, and those things also catch AI's mistakes.


Thank you! I always feel so weird to actually use chatgpt without any major issues while so many people keep on claiming how awful it is; it's like people want it 100% perfect or nothing. For me if it gets me 80% there in 1/10 the time, and then I do the final 20%, that's still heck of a productivity boost basically for free.


Yep, I’m with you. I’m a solo dev who never went to college… o1 makes far fewer errors than I do! No chance I’d make it past round one of any sort of coding tournament. But I managed to bootstrap a whole saas company doing all the coding myself, which involved setting up a lot of guard rails to catch my own mistakes before they reached production. And now I can consult with a programming intelligence the likes of which I could never afford to hire if it was a person. It’s amazing.


Is it working?


Not sure what you're referring to exactly. But broadly yes it is working for me - the number of new features I get out to users has sped up greatly, and stability of my product has also gone up.


Are you making money with your saas idea?


Yep, been living off it for nine years now


Congratulations! That is not an easy task. I am just starting the journey.


Famously, the last 10% takes 90% of the time (or 20/80 in some approximations). So even if it gets you 80% of the way in 10% of the time, maybe you don’t end up saving any time, because all the time is in the last 20%.

I’m not saying that LLMs can’t be useful, but I do think it’s a darn shame that we’ve given up on creating tools that deterministically perform a task. We know we make mistakes and take a long time to do things. And so we developed tools to decrease our fallibility to zero, or to allow us to achieve the same output faster. But that technology needs to be reliable; and pushing the envelope of that reliability has been a cornerstone of human innovation since time immemorial. Except here, with the “AI” craze, where we have abandoned that pursuit. As the saying goes, “to err is human”; the 21st-century update will seemingly be, “and it’s okay if technology errs too”. If any other foundational technology had this issue, it would be sitting unused on a shelf.

What if your compiler only generated the right code 99% of the time? Or, if your car only started 9 times out of 10? All of these tools can be useful, but when we are so accepting of a lack of reliability, more things go wrong, and potentially at larger and larger scales and magnitudes. When (if some folks are to believed) AI is writing safety-critical code for an early-warning system, or deciding when to use bombs, or designing and validating drugs, what failure rate is tolerable?


> Famously, the last 10% takes 90% of the time (or 20/80 in some approximations). So even if it gets you 80% of the way in 10% of the time, maybe you don’t end up saving any time, because all the time is in the last 20%.

This does not follow. By your own assumptions, getting you 80% of the way there in 10% of the time would save you 18% of the overall time, if the first 80% typically takes 20% of the time. 18% time reduction in a given task is still an incredibly massive optimization that's easily worth $200/month for a professional.


Using 90/10 split: that 10% of the time before being reduced to only take 10% of that makes 9% time savings.

160 hours a month * $100/hr programmer * 9% = $1400 savings, easily enough to justify $200/month.

Even if 1/10th of the time it fails, that is still ~8% or $1200 savings.


Does that count the time you spend on prompt engineering?


It depends what you’re doing.

For tasks where bullshitting or regurgitating common idioms is key, it works rather well and indeed takes you 80% or even close to 100% of the way there. For tasks that require technical precision and genuine originality, it’s hopeless.


I'd love to hear what that is.

So far, given my range of projects, I have seen it struggle with lower level mobile stuff and hardware (ESP32 + BLE + HID).

For things like web (front/back), DB, video games (web and Unity), it does work pretty well (at least 80% there on average).

And I'm talking of the free version, not this $200/mo one.


Well, that is a very specific set of skills. I bet the C-suite loves it.


I always feel so weird to actually use chatgpt without any major issues while so many people keep on claiming how awful it is;

People around here feel seriously threatened by ML models. It makes no sense, but then, neither does defending the Luddites, and people around here do that, too.


Well now at $200 it's a little farther away from free :P


What do you mean? ChatGPT is free, the Pro version isn't.

I'm talking of the generally available one, haven't had the chance to try this new version.


I could a car for that kind of money!


Of course, but for every thoroughly set up TDD environment, you have a hundred other people just blindly copy pasting LLM output into their code base and trusting the code based on a few quick sanity checks.


You assume programming software with an existing well-defined and correct test suite is all these will be used for.


>I can tell if it makes a mistake because... the code doesn't do what I want it to do

Sometimes it does what you want it to do, but still creates a bug.

Asked the AI to write some code to get a list of all objects in an S3 bucket. It wrote some code that worked, but it did not address the fact that S3 delivers objects in pages of max 1000 items, so if the bucket contained less than 1000 objects (typical when first starting a project), things worked, but if the bucket contained more than 1000 objects (easy to do on S3 in a short amount of time), then that would be a subtle but important bug.

Someone not already intimately familiar with the inner workings of S3 APIs would not have caught this. It's anyone's guess if it would be caught in a code review, if a code review is even done.

I don't ask the AI to do anything complicated at all, the most I trust it with is writing console.log statements, which it is pretty good at predicting, but still not perfect.


So the AI wrote a bug; but if humans wouldn’t catch it in code review, then obviously they could have written the same bug. Which shouldn’t be surprising because LLMs didn’t invent the concept of bugs.

I use LLMs maybe a few times a month but I don’t really follow this argument against them.


Code reviewing is not the same thing as writing code. When you're writing code you're supposed to look at the documentation and do some exploration before the final code is pushed.

It would be pretty easy for most code reviewers to miss this type of bug in a code review, because they aren't always looking for that kind of bug, they aren't always looking at the AWS documentation while reviewing the code.

Yes, people could also make the same error, but at least they have a chance at understanding the documentation and limits where the LLM has no such ability to reason and understand consequences.


it also catches MY mistakes, so that saves time


So true, and people seem to gloss over this fact completely. They only talk about correcting the LLM's code while the opposite is much more common for me.


Seems like I'm one of very few excited by this announcement. I will totally pay for this - the o1-preview limits really hamper me.


What do you mostly use it for?


Before this update you could hold control and click the application, then select "open" from the menu. It would give you a warning and let you confirm you'd like to run it anyway.


Since you asked for someone to steelman, I’ll chime in to say this graph looks great to me.

As Canada has grown rich, our fertility rate has gone down. I have no moral problem with that, but it does cause a problem: as the population ages, the ratio of workers (who pay tax) to retired people (who carry a tax burden) shrinks. The options are increasing the taxes on workers, cutting the benefits to the retired, or increasing the working age population via immigration.

As to grocery stores, housing prices, etc. I think these problems are more related to a lack of competition. A free market can handle an influx of consumers.


> The options are increasing the taxes on workers, cutting the benefits to the retired, or increasing the working age population via immigration.

I think the third option could be further broken down: 3a) increase the working age population via a well planned immigration strategy that includes necessary infrastructure spending (medical residencies, housing, etc etc) to support that immigration 3b) open up the floodgates with absolutely no planning for the consequences whatsoever and then blame everyone else for the ensuing inflation and societal stress.

Unfortunately our current government chose 3b.


This is what also Merkel's Germany plan was, and the results are... so-and-so. I wonder how come nobody thought about said strategical planning? I know planning is hard but it's not nuclear science and they should have counted with the pushback as well. And by the way, did anybody learn any lesson?


> Since you asked for someone to steelman, I’ll chime in to say this graph looks great to me. As Canada has grown rich, our fertility rate has gone down. I have no moral problem with that, but it does cause a problem: as the population ages, the ratio of workers (who pay tax) to retired people (who carry a tax burden) shrinks.

This is not even close to the only problem. Our housing stock has not kept up with population growth, which is a big reason why housing prices have been so out of control here. It's not due to a lack of competition so much as regulation and population growth that has outpaced building capacity.

So I'm not sure I can agree that that graph looks fine. It's true that we need immigration to balance falling fertility rates, but this needs to be balanced by the available resources so the cost of living doesn't spiral out of control. Labour shortages can often be addressed in others ways, like automation.


Yeah, there are clearly other problems with the Canadian market. Didn’t mean to imply there weren’t!

Housing stock I really do think is down to regulation - it’s crazy hard to build here in Vancouver, even right near mass transit stations.

Should we have increased the ability to compete in housing and groceries before the influx of immigrants? Maybe. But I don’t think having them here makes these issues harder to address, and it might increase the pressure to actually take action. Every party is talking about cost of living now.


Regulation cannot be completely discounted, but the bigger issue is labour. If you can even find someone willing to build you a home, they won't even entertain doing it until years into the future. There is no capacity to build more homes.

At least not without paying construction workers a whole lot more in order to compel the software developers away from developing software and into building houses, but then that only drives the cost of housing even higher!


More labour is one way, but there are also orthogonal ways of retargeting the existing building capacity more effectively. For instance, financial discincentives against building low-density luxury housing and more incentives towards building higher density housing would alleviate demand pressure over the span of a few years. We also used to know how to build affordable housing.


Why is that disincentive not already present? Multiple people building a town home should be able to outbid one person building a single family home every single time, and the sky is the limit for a large condo building. Their combined capacity to pay more than outstrips the additional cost to build the structure. It is clear what is a better deal for the construction crew.

I expect the answer is because nobody actually wants to live in such dwellings. They might accept rent in a place like that if they see it as a stop-gap until they can move into their "dream home", but it is not the home they are willing to commit to and build (obviously there are exceptions).


Zoning regulations limits what you can construct in any given location, so it's not even about who has more financial capital, but political capital in many cases.


Not so much an issue anymore, though. Want to build a tiny house? No problem. Multiple houses on a single lot? Go for it! These would have been unthinkable 10-20 years ago, but have been given the green light in more recent times. Council knows that they can't get away with ignoring housing any longer. Of course, it's easy to accept because they also know almost nobody is going to do it.


The only long term option is to increase productivity with education and technology because you can't increase population (however you do it) forever. We should also stop emphasising the total GDP (which can be and is boosted by adding more people) and focus on the GDP per capita instead (which seems to be falling in Canada according to the various graphs posted here).

The claim that immigration is the only option is a fallacy peddled because it's the lazy short term option and "Après moi, le déluge" [1]

[1] https://en.wikipedia.org/wiki/Apr%C3%A8s_moi,_le_d%C3%A9luge


Yeah, I should have mentioned productivity growth - it is of course another solution to the problem. I think it’s harder for governments to implement policy that increases productivity growth than immigration, though.


I would be interested to have the person you replied to steelman why, specifically, they believe that the contents of that graph would lead to social decline and political turmoil.


Very impressed by the demo where it starts speaking French in error, then laughs with the user about the mistake. Such a natural recovery.


I believe this will legally mandate that Google and Apple remove the app from their app stores. It will still be available online.


Why say "any social media" when it's only Youtube and TikTok?


They already admitted they struggle with marketing


isn't that what marketing is? inflating your product with words


Nope. Underpromise, overdeliver. Build something so useful that you don't need to inflate it to get it noticed.


sure, that's what it should be, but that's not what it is.


"Turn that frown upside down" is a way of saying - don't be sad, be happy. Instead of having the corners of your mouth point down (frown), have them point up (smile). The joke is that if you move the eyes to the other side of the mouth, it remains a frown.

Expectation

:( becomes :)

Subversion

:( becomes ):


It's a very American expression. In British culture a frown refers to a forehead expression, not a mouth expression.


How interesting - I'm British, and I got the joke immediately. Yet, I'd hardly move my mouth if someone asked me to put on a frown. I'd never really thought about that one before.


Yeah, it's a well known expression, but it only makes literal sense in American English.


The implication is that if you are sad there are more associated tells than just your mouth. Imagine a clown doing a mock sad face and exaggerating sadness by frowning.

I've always felt this was a forced contrivance and never that anyone literally thought frowning had anything to do with smiling.


Oh! I didn't notice the :( and the ): signs. I thought it was a play on the usernames < pronto> and < korozion>!

:)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: