Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

https://arxiv.org/pdf/2005.14165.pdf

WebText and WebText2 referenced in their papers are corpuses based on Reddit submissions which had a 22% weight in their training model.

https://openwebtext2.readthedocs.io/en/latest/

This is larger than Wikipedia (3% weight) or either of their two book corpuses (8% each).

The only other data included was a filtered set from Common Crawl (weighted 60%).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: