As is noted in the paper from which this is inspired from: GPT-4's image generation capabilities were severely diminished by the instruction/safety-tuning process. Unfortunately this means the currently available model from the API won't be very capable - certainly not as capable as the early version of GPT-4 that Microsoft had access to.
edit: I'm specifically referring to the "image generation by trickery (e.g. SVG)" technique being diminished. Other tasks were diminished as well though - is my understanding.
It's not just image generation the rlhf worsens too. Calibration (confidence on solving a question in relation to ability to solve that problem) went from excellent to non existent. and you can see from the report that the base model performed better on a number of tests. Basically a dumber model.
My understanding is that OpenAI did indeed find diminished capability across a range of tasks after doing RLHF. You're correct to question this though - as I believe the opposite was true of GPT-3 where it improved certain tasks.
The benefits from a business perspective were still clear however, and of course the instruction-tuned GPT-4 model still outperformed GPT-3, in general.
There are probably some weird edge cases and nuances that I'm missing - and I'd be happy to be corrected.
edit: I'm specifically referring to the "image generation by trickery (e.g. SVG)" technique being diminished. Other tasks were diminished as well though - is my understanding.