Didn't DeepSeek also show that pure RL leads to low-quality results compared to ...

deepsquirrelnet · on Jan 27, 2025

If I’m understanding their paper correctly (I might not be but I’ve spent a little time trying to understand it), they showed you only need a small amount of supervised fine tuning “SFT” to “seed” the base model, followed by pure RL. Pure RL only was their R1-zero model which worked, but produces weird artifacts like switching languages or excessive repetition.

The SFT training data is hard to produce, while the RL they used was fairly uncomplicated heuristic evaluations and not a secondary critic model. So their RL is a simple approach.

If I’ve said anything wrong, feel free to correct me.