https://arxiv.org/pdf/2005.14165.pdf WebText and WebText2 referenced in their pa...

WebText and WebText2 referenced in their papers are corpuses based on Reddit submissions which had a 22% weight in their training model.

This is larger than Wikipedia (3% weight) or either of their two book corpuses (8% each).

The only other data included was a filtered set from Common Crawl (weighted 60%).