DeepSeek R1 just uses crappy PPO ("GRPO" is just using the sharpe ratio as a heu...

DeepSeek R1 just uses crappy PPO ("GRPO" is just using the sharpe ratio as a heuristic approximation to a value function) on top of distilled existing models, with tons of pipelining optimizations manually engineered in. I don't see this making leading edge research any less expensive, you won't get a "smarter" model - just a model that has a higher probability of giving an answer it could already give. If you want to try and do something interesting with the architecture the pipelining optimizations now slow down your iteration capability heavily.

The RL techniques present will only work in domains where you can guarantee an answer is right (multiple choice questions, math, etc.). It doesn't really present any convincing leap forward in terms of advancing the capability of LLMs, just a strategy for compute efficient distillation of what we know already works. The fact this shitty PPO proxy works at all is a testament to the fact that DeepSeek is bootstrapping its capability heavily off of the output of existing larger models which are much more expensive to train. What DeepSeek R1 proves is you can distill a ChatGPT et al. into a smaller model and hack certain benchmarks with RL.

If you could just do RL to predict the best next word in general this would have been done already - but the signal to noise ratio on exploration would be so bad you'd never get anything besides infinite monkeys at a typewriter. It's not a novel/complicated idea to anyone familiar with RL to try and improve probability of things you like, and whoever decided to do RLHF on an LLM surely thought of (and did) regular RL first - and found it didn't work very well with whatever pretrained model and rewards they had. it was like two weeks ago people were going crazy about O3 doing arc-agi by running the exact same kind of traces R1 is doing in "GRPO" at test time rather than train time. Doing this also isn't novel and also only helps on shitty toy problems where you can get a number to tell you good vs bad.

There is no mechanism to compute rewards for general purpose language tasks - and in fact I think people will come to see the gains in math/coding benchmark problems come at a real cost to other capabilities of models which are harder to quantify and impossible to generically assign rewards to at internet scale.

To explore the frontier of capability you will still need a massive amount of compute, in fact even more to do RL than you would need to do standard next token prediction - even if the LLM might have fewer paramters. You also can't afford to do all the optimizations as you try many different complex architectures.