Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

FWIW, I now analyze the Stack Overflow dumps on Snowflake

https://medium.com/snowflake/how-to-load-the-stack-overflow-...



Thanks for sharing, good to see alternative options popping up. My wish is that the Stack Exchange dataset could one day be provided as a streaming parquet or arrow table, as underfunded grads and post-grads could then more easily/selectively sample the datasets (similar to how Huggingface provides some of its datasets)[1][2].

The Hugginface repo unfortunately prefilters some of the tables/rows according to some criteria, making it less usable for general analytical queries that the BQ or SEDE datasets enable. If anyone knows of an 'XML-streaming' solution that directly samples from the Internet Archive's data dumps, I am all ears.

[1]: https://huggingface.co/docs/datasets-server/rows

[2]: https://huggingface.co/datasets/HuggingFaceGECLM/StackExchan...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: