Fascinating. R1 really punches above its weight with respect to cost-per-token.
As the article alluded to at the end, my thoughts immediately go to using R1 as a data generator for complex problems, since we have many examples of successful distillation into smaller models on well-defined tasks.
As the article alluded to at the end, my thoughts immediately go to using R1 as a data generator for complex problems, since we have many examples of successful distillation into smaller models on well-defined tasks.