There’s some disagreement over whether or not GRPO is the important part of deepseek or not. I’m personally in camp “it was the data and reward functions” and that GRPO wasn’t the key part, but others would disagree.
I believe that if the reader is familiar with PPO, they will immediately understand GRPO as well.
I‘ve heard people say that GRPO giving a zero gradient in cases where neither the current sample nor the group scores give any reward is advantageous for optimization. It avoids killing your base model with low signal to noise updates, that can be a problem in PPO where the critic usually causes a non-zero gradient even for samples where one would rather be like „problem too hard for now, skip.“
I‘d be curious to hear you lay out your thoughts though!
I _think_ that's a factor of using rule based reward functions and not actually a feature of GRPO? The original formulation of GRPO from deepseek math uses a neural reward model that is trying to predict human rankings of responses, and in that configuration won't see 0 gradient updates.
Flipping it around, if you swapped out the neural reward model in PPO with a reward function that can return zero, I thiiinnkkk it would be able to produce zero (or very low) gradient updates.
I'll be the first to admit that I don't know enough about the space to say though. I'm still a beginner here.