FWIW, I now analyze the Stack Overflow dumps on Snowflake https://medium.com/sno...

dleeftink · on July 25, 2023

Thanks for sharing, good to see alternative options popping up. My wish is that the Stack Exchange dataset could one day be provided as a streaming parquet or arrow table, as underfunded grads and post-grads could then more easily/selectively sample the datasets (similar to how Huggingface provides some of its datasets)[1][2].

The Hugginface repo unfortunately prefilters some of the tables/rows according to some criteria, making it less usable for general analytical queries that the BQ or SEDE datasets enable. If anyone knows of an 'XML-streaming' solution that directly samples from the Internet Archive's data dumps, I am all ears.

[1]: https://huggingface.co/docs/datasets-server/rows

[2]: https://huggingface.co/datasets/HuggingFaceGECLM/StackExchan...