My background is in NLP - I suspect we'll see similar in language processing models as we've seen in vision models. Consider this[1] article ("NLP's ImageNet moment has arrived"), comparing AlexNet in 2012 to the first GPT model 6 years later: we're just a few years behind.
True, GPT-2 and -3, RoBERTa, T5 etc. are all increasingly data- and compute-hungry. That's the 'tick' your second article mentions.
We simultaneously have people doing research in the 'tock' - reducing the compute needed. ICLR 2020 was full of alternative training schema that required less compute for similar performance (e.g. ELECTRA[2]). Model distillation is another interesting idea that reduces the amount of inference-time compute needed.
True, GPT-2 and -3, RoBERTa, T5 etc. are all increasingly data- and compute-hungry. That's the 'tick' your second article mentions.
We simultaneously have people doing research in the 'tock' - reducing the compute needed. ICLR 2020 was full of alternative training schema that required less compute for similar performance (e.g. ELECTRA[2]). Model distillation is another interesting idea that reduces the amount of inference-time compute needed.
[1] https://thegradient.pub/nlp-imagenet/
[2] https://openreview.net/pdf?id=r1xMH1BtvB