Didn't DeepSeek also show that pure RL leads to low-quality results compared to also doing old-fashioned supervised learning on a "problem solving step by step" dataset? I'm not sure why people are getting excited about the pure-RL approach, seems just overly complicated for no real gain.
If I’m understanding their paper correctly (I might not be but I’ve spent a little time trying to understand it), they showed you only need a small amount of supervised fine tuning “SFT” to “seed” the base model, followed by pure RL. Pure RL only was their R1-zero model which worked, but produces weird artifacts like switching languages or excessive repetition.
The SFT training data is hard to produce, while the RL they used was fairly uncomplicated heuristic evaluations and not a secondary critic model. So their RL is a simple approach.
If I’ve said anything wrong, feel free to correct me.