WeirderScience's comments

WeirderScience · 2025-07-24T06:47:04 1753339624

arXiv link for the paper: https://arxiv.org/pdf/2506.08257

(Note that the author Lukas Beyer is distinct from the other well-known author in the field, Lucas Beyer)

WeirderScience · 2025-07-16T07:27:10 1752650830

Their sketchy ad/malware era aside, I do appreciate sourceforge keeping all the old OSS websites and repos alive.

WeirderScience · 2025-07-15T00:44:26 1752540266

If you find this kind of stuff interesting, I'd strongly recommend checking out the book "Spark: The Life of Electricity and the Electricity of Life" by Jorgensen

WeirderScience · 2025-07-11T22:28:37 1752272917

Looking forward to Casey Muratori's talk!

prisenco · 2025-07-12T01:14:51 1752282891

When I saw the title of the conference I immediately thought of him so I'm not surprised he's headlining!

WeirderScience · 2025-07-11T22:10:33 1752271833

I wonder if this is a result of the previously reported clashes between OpenAI and Microsoft over access to the Windsurf IP (under their investment agreement)

WeirderScience · 2025-07-11T20:05:09 1752264309

The open training data is a huge differentiator. Is this the first truly open dataset of this scale? Prior efforts like The Pile were valuable, but had limitations. Curious to see how reproducible the training is.

layer8 · 2025-07-11T20:16:06 1752264966

> The model will be fully open: source code and weights will be publicly available, and the training data will be transparent and reproducible

This leads me to believe that the training data won’t be made publicly available in full, but merely be “reproducible”. This might mean that they’ll provide references like a list of URLs of the pages they trained on, but not their contents.

TobTobXX · 2025-07-11T21:11:07 1752268267

Well, when the actual content is 100s of terabytes big, providing URLs may be more practical for them and for others.

layer8 · 2025-07-11T22:19:25 1752272365

The difference between content they are allowed to train on vs. being allowed to distribute copies of is likely at least as relevant.

sschueller · 2025-07-12T18:38:47 1752345527

No problem, we have 25 Gbit/s home internet here. [1]

[1] https://www.init7.net/en/internet/fiber7/

glhaynes · 2025-07-11T20:44:48 1752266688

That wouldn't seem reproducible if the content at those URLs changes. (Er, unless it was all web.archive.org URLs or something.)

dietr1ch · 2025-07-11T21:36:34 1752269794

This is a problem with the Web. It should be easier to download content like it was updating a git Repo.

WeirderScience · 2025-07-11T20:21:50 1752265310

Yeah, I suspect you're right. Still, even a list of URLs for a frontier model (assuming it does turn out to be of that level) would be welcome over the current situation.

evolvedlight · 2025-07-11T22:07:58 1752271678

Yup, it’s not a dataset packaged like you hope for here, as it still contains traditionally copyrighted material