on that note, shouldn’t we ask people to CONSENT to having their data trained on...

jeroenhd · 2025-05-15T09:29:15 1747301355

LLMs need too much data to ethically source their data sets. That's why they rely on aggressive scraping, user-provided prompts, and of course straight-up piracy to fill their datasets.

Outcry made Adobe and other such companies put (opt-out) user controls for gathering training data, but writers, especially writers on the internet, are usually ignored. I've seen even the angriest "AI is stealing my art if you use Dall-E you're a bad person" people use ChatGPT, because they don't seem to consider writing to be art or expression as much as they do their own works.

Textual data just doesn't seem to be valued, and as a result data scrapers often don't care about annoyances such as "ethics" or "consent" when it comes to gathering training data.

lurking_swe · 2025-05-15T09:42:25 1747302145

There’s the rub. We pretend a change to the law will make LLM development stall, yet we acknowledge nobody is following the existing laws anyway.

Not sure how i feel about the whole thing to be honest. (legal gray area)

JumpCrisscross · 2025-05-16T02:13:19 1747361599

> We pretend a change to the law will make LLM development stall, yet we acknowledge nobody is following the existing laws anyway

No. The development is a given. Where it happens is not. That’s the point. If you want to use European data to train, you’d better not have a European nexus.