Thanks for sharing, good to see alternative options popping up. My wish is that the Stack Exchange dataset could one day be provided as a streaming parquet or arrow table, as underfunded grads and post-grads could then more easily/selectively sample the datasets (similar to how Huggingface provides some of its datasets)[1][2].
The Hugginface repo unfortunately prefilters some of the tables/rows according to some criteria, making it less usable for general analytical queries that the BQ or SEDE datasets enable. If anyone knows of an 'XML-streaming' solution that directly samples from the Internet Archive's data dumps, I am all ears.
https://medium.com/snowflake/how-to-load-the-stack-overflow-...