I think media outlets think way too highly of their contribution to AI. Had they...

Freak_NL · 2026-03-21T13:13:14 1774098794

How do you think those models get trained? You can only get so far with Wikipedia, Reddit, and non-fiction works like books and academic papers.

tossandthrow · 2026-03-21T13:30:06 1774099806

Have a look at this article: https://www.washingtonpost.com/technology/interactive/2023/a...

NY Times is 0.06% of common crawl.

These news media outlets provide a drop in the ocean worth of information. Both qualitatively and quantitatively.

The news / media industry is really just trying to hold on to their lifeboat before inevitably becoming entirely irrelevant.

(I do find this sad, but it is like the reality - I can already now get considerably better journalism using LLMs than actual journalists - both click bait stuff and high quality stuff)

pimlottc · 2026-03-21T13:44:28 1774100668

That seems like a reductive way to consider it. What percent of music was created by Led Zeppelin? What percent of art was painted by Monet? What percent of films by Alfred Hitchcock? It may be a small percentage objectively but they are hugely influential.

tossandthrow · 2026-03-21T14:40:47 1774104047

I don't think back propagation care whose text it is back propagating.

NiloCK · 2026-03-21T15:32:34 1774107154

The data sets aren't naively fed into the training runs.

Instead, training attempts to sample more heavily from higher quality sources, with, I'm sure, a mix of manual and heuristic labeling.

ffsm8 · 2026-03-21T17:32:06 1774114326

fwiw, no llm ive ever used generated in the writing style newspapers and -sites use - hence i honestly doubt they've been given a meaningful boost in relevancy.

their idioms would leak occasionally otherwise

Gigachad · 2026-03-21T15:41:10 1774107670

90% of common crawl is complete junk. While the tiny bit of news articles powers almost all the ai answers in Google search.

Dylan16807 · 2026-03-22T02:13:32 1774145612

News takes a very different path to get into search results. It's not going through databases or archive passes, that would take far too long.

And don't basically all those news sites allow google on purpose?

datsci_est_2015 · 2026-03-21T16:31:45 1774110705

How many Reddit, HN, etc. posts are based on NYT articles? How many derivative news articles, blog posts, YouTube videos, TikToks, etc. are responses to those articles?

At least NYT is probably on the correct side of Sturgeon’s Law: https://en.wikipedia.org/wiki/Sturgeon%27s_law

AnthonyMouse · 2026-03-21T18:14:47 1774116887

> How many Reddit, HN, etc. posts are based on NYT articles? How many derivative news articles, blog posts, YouTube videos, TikToks, etc. are responses to those articles?

You may get an inconvenient answer when you ask the question the other way around.

Melatonic · 2026-03-21T16:42:51 1774111371

0.06% is way higher than I would expect

RugnirViking · 2026-03-21T13:17:15 1774099035

How does the entire textual corpus of say, new York times compare to all novels? Each article is a page of text, maybe two at most? There certainly are an awful lot of articles. But it's hard to imagine it is much more than a couple hundred novels. There must be thousands of novels released each year

Freak_NL · 2026-03-21T13:36:14 1774100174

Like apples to oranges.

LLMs are (apparently) massively used to get information about topics in the real world. Novels aren't going to be much help there. Journalism, particularly in written form, provides a fount of facts presented from different angles, as well as opinions, and it was all there free for the taking…

Wikipedia provides the scantest summary of that, fora and social media give you banter, fake news, summaries of news, and a whole lot of shaky opinions, at best. Novels give you the foundations of language, but in terms of knowledge nothing much beyond what the novel is about.

olalonde · 2026-03-21T13:52:52 1774101172

LLMs can get up to date information from primary sources - no journalists required.

PopAlongKid · 2026-03-21T14:07:45 1774102065

I don't understand how LLMs can ask questions at a press conference.

AnthonyMouse · 2026-03-21T18:37:45 1774118265

To begin with, your premise is that the only primary sources are press conferences and that press conferences only provide information in response to questions.

But even taking it literally, isn't that one of the things LLMs could actually do? You're essentially asking how a text generator could generate text. The real question is whether the questions would be any good, but the answer isn't necessarily no.

casey2 · 2026-03-22T07:52:40 1774165960

I'm sure any competent agent would send an email, or just ask as an aside in a chat.

olalonde · 2026-03-21T15:11:10 1774105870

Startup idea right there.

none2585 · 2026-03-21T14:09:56 1774102196

I don't think an LLM can have secret human sources that provide them with confidential information anonymously. Not all news shows up on Twitter.

miki123211 · 2026-03-21T15:56:21 1774108581

You don't need the secret human sources any more.

You used to need them, because journalists had the distribution and the sources didn't. In a word of printed newspapers, you couldn't get your story distributed nationally (much less worldwide) without the help of a journalist, doubly so if you wanted to stay anonymous.

Nowadays, you just make a Substack and there's that.

See that recent expose on the Delve fraud as just one example. No journalists were harmed in the making of that article.

casey2 · 2026-03-22T07:56:51 1774166211

This is technically trivial. Most data comes froms chats these days, not the web.

Start thinking!

ajam1507 · 2026-03-21T15:06:16 1774105576

The primary source for most news is journalism.

NiloCK · 2026-03-21T15:37:07 1774107427

In context, primary source means the subject of the article (the thing the journalist is writing about).

Journalism is by definition a secondary source. (Notwithstanding edge cases like articles reporting directly on the news industry itself.)

ajam1507 · 2026-03-21T18:45:24 1774118724

Journalism is absolutely not by definiton a secondary source.

If a journalist is on location covering a flood, for example, they are the primary source.

A journalist conducting an interview would also be a primary source.

freedomben · 2026-03-21T14:04:48 1774101888

Primary sources can and often are, very biased. Journalists are (supposed to be) doing fact checks and gathering multiple sources from all sides. Modern journalism is in a terrible state, but still important.

Imagine if all info about Facebook came from Facebook...

casey2 · 2026-03-22T07:37:20 1774165040

By talking to users? I'm 100% sure Google and OpenAI know every major news story, in much greater detail, long before NYT publishes it.

I'd imagine they already have a database of users such that if multiple people talk about a possibly true subject they can ask subject experts; users related to the subject for clarification and further information.

xigoi · 2026-03-22T10:47:23 1774176443

If excluding these sources wouldn’t make a difference, why do AI companies scrape them despite explicit requests to not be scraped?

tossandthrow · 2026-03-22T10:51:04 1774176664

They want an as diverse data set as possible?

It is not like they paid anybody else for their contribution either.

It is just not more worth than anything else in the data sets.

nicbou · 2026-03-22T07:46:54 1774165614

Define quality.

Many publications put information on the internet for the first time, or curate it for the first time, or research a topic deeper than ever before. Someone - a thinking, feeling human - had to get out there and try restaurants, talk to people, pore through archives, read books, use products. Each of them contribute a little to what we know about the world.

I do this for a living. AI might soon put me out of work. It already more than halved my audience, using my own work. It's sickening to see people cheer for it because they have a bone to pick with certain websites. Eventually those websites will be gone, but so will the good ones that produced critical information.

phatfish · 2026-03-21T13:38:12 1774100292

Isn't the non-LLM generated text becoming more valuable for training as the web at large is flooded with slop?

Preventing new human generated text from being used by AI firms (without consent) seems like a valid strategy.

tossandthrow · 2026-03-21T14:45:14 1774104314

No.

Modern LLMs are trained on a large percentage of synthetic data.

This sentiment is largely legacy (even though just a couple of years old).