WebText and WebText2 referenced in their papers are corpuses based on Reddit submissions which had a 22% weight in their training model.
https://openwebtext2.readthedocs.io/en/latest/
This is larger than Wikipedia (3% weight) or either of their two book corpuses (8% each).
The only other data included was a filtered set from Common Crawl (weighted 60%).
WebText and WebText2 referenced in their papers are corpuses based on Reddit submissions which had a 22% weight in their training model.
https://openwebtext2.readthedocs.io/en/latest/
This is larger than Wikipedia (3% weight) or either of their two book corpuses (8% each).
The only other data included was a filtered set from Common Crawl (weighted 60%).