My understanding is that OpenAI did indeed find diminished capability across a range of tasks after doing RLHF. You're correct to question this though - as I believe the opposite was true of GPT-3 where it improved certain tasks.
The benefits from a business perspective were still clear however, and of course the instruction-tuned GPT-4 model still outperformed GPT-3, in general.
There are probably some weird edge cases and nuances that I'm missing - and I'd be happy to be corrected.
Important distinction, especially if we're looking to push back out towards the Pareto Frontier of the problem.
RLHF is still very much in its infancy and does not maximize the bias-variance tradeoff by a long shot, in my personal experience.