Hey fellow language model enthusiasts!
I understand that this is very controversial, but I'm personally opposed to censorship of previously public ML checkpoints, and more generally models trained on publicly-crawled corpuses.
I've decided to do my little bit on this and I've started getting together a few things things that are technically available but censored on the major hubs and therefore comparatively high-friction for lightweight experimentation and research. For obvious reasons it's not as simple as creating a GitHub repository, hence this post.
I'm not indifferent to arguments that these things could cause harm, but I believe that good-faith participation in the HN community is a sufficient bar for access to anything I could personally assemble, so what little (and hopefully growing) amount of this I get together I'm willing to share with anyone on HN who has even an eyeball-plausible comment history of caring about the community.
I don't have much yet, but I've tracked down the torrents for e.g. GPT-4Chan and am starting to archive it's tuning corpus, the HN comments corpus, some brand-name newspapers: low-hanging fruit like that. I've got some spare bandwidth on a modest number of GPUs so I'm planning to do a variety of fine-tuned checkpoints and throw together Docker and Nix environments for loading them up.
If you want what I've put together so far (which admittedly isn't much) or to help with this: ben.reesman@gmail.com