Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Title should be changed to "OpenAI publishes evidence they trained on pirated movies".


Of course. Piracy is legal when you have a bigger pile of money than the studios.


Let's not forget that some of real pirates (for example, corsairs) also were legal and performed legitimate pirate activities to ships of foreign countries.


>Let's not forget that some of real pirates (for example, corsairs) also were legal and performed legitimate pirate activities to ships of foreign countries.

In other words there are activities that are legal or not depending on whether you have authorization from the state. That describes many things. For instance you synthesize meth without a license from the DEA/FDA, you're a "drug cartel" or whatever. But if you do it with a license you're a "pharmaceutical company", and you're not making "meth", you're making "desoxyn".


Synthesizing chemical substances doesn't involve murdering people though.


Families of fentanyl overdose victims would disagree. Moreover it's not hard to find examples of "legal if the government authorizes it" for killings. Cops and soldiers, for instance.


It takes quite a leap of logic to blame someone ingesting a toxic quantity of a substance on the person who manufactured the substance. When someone drinks bleach do we blame the company that makes the bleach?


It is a bad comparison because substance vendor doesn't kill anyone just as a gun store doesn't. Soldiers are better analogy though.


I suspect privateers would have been offended at being called pirates, but is this what is going on? If it specifically a Chinese AI company pirating Hollywood for example sure, but it seems it's more of a everyone firing at everyone situation.


Isn't Piracy legal in many parts of the world?

Legally, why wouldn't they be able to do the piracy parts in one of those jurisdictions and then ship the outputs back to the mothership?


Too big to nail


How is this evidence of that fact? Honest question.

I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used. But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?


> I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used.

Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it.

> But isn't it already known and admitted (and allowed?)

No, and I don't see where you got that from. Meta [1], OpenAI [2] and everybody else is being sued as we speak.

1: https://petapixel.com/2025/01/10/lawsuit-alleges-mark-zucker...

2: https://www.reuters.com/legal/litigation/openai-hit-with-new...


> Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it. Unless you qualify for one of the many exceptions, such as fair use


It’s not clear that training is fair use. That’s being contested in court I think.


Training isn’t recreating or distributing so copyright won’t apply if the ruling is actually consistent with the intention of the law, which it may not.

Using copyrighted materials and then meaningfully transforming it isn’t infringement. LLMs only recreate original work in the same way I am when I wrote the first sentence of this paragraph because it probably exists word for word somewhere else too


Thats your interpretation, not the law.


> I don't see where you got that from

It’s been determined by the judge in the Meta case that training on the material is fair use. The suit in that case is ongoing to determine the extent of the copyright damages from downloading the material. I would not be surprised if there is an appeal to the fair use ruling but that hasn’t happened yet, as far as I know. Just saying that there is good reason for them to think it’s been allowed because it kind of has; that can be reversed but it happened.


That was specifically involving 13 authors.

There hasn't been any trials yet about the millions of copyrighted books, movies and other content they evidently used.


There's no reason to think those cases will go any differently. As far as I know, the ruling would have to be appealed at this point. I am only commenting to say that there is reason to think this is true:

> But isn't it already known and admitted (and allowed?)

You seemed to be confused about why this person believed that:

> No, and I don't see where you got that from.

And I wrote a comment intended to dispel your confusion. The above commenter thought that it was allowed because a judge said it was allowed; that can be appealed but that's the reason someone thinks it's allowed.


> There's no reason to think those cases will go any differently. As far as I know, the ruling would have to be appealed at this point.

Trial court rulings aren't binding precedent even on the same court in different cases, so its quite possible that different cases at the trial level can reach different conclusions on fair use on fairly similar facts, given the lack of appellate precedent directly on point with AI training.


Yea, no. I don't think I am confused.

A single verdict about a specific case (13 authors vs META) does not mean it's legal for companies to steal IP from other companies which has evidently been going on for some years now.

Those other companies have lawyers powerful enough to change jurisdiction in many countries in order to "protect their IP".


The Chinese subtitles for silence use a common mark for pirated media in that language, according to other commentors here. In general it's pretty likely that if you're finding non professional subtitles they were distributed with pirated media in some form, that's where you get the most fan subs after all


> were distributed with pirated media in some form,

I disagree with this conclusion. I've used e.g. the opensubtitles dataset for some data-analysis in the past. It's a huge dataset, freely available and precisely intended for such use. Now, if all the data in the opensubtitles dataset is legal, is another point.

So one might argue that using this opensubtitles dataset, makes one complicit to the illegal activities of opensubtitles themselves, IDK: IANAL.


> How is this evidence of that fact?

The contention is that the specific translated text appears largely from illegal translations (i.e., fansubs) and not from authorized translations. And from a legal perspective, that would basically mean there's no way they could legally have appropriated that material.

> But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?

Technically, everything is copyrighted. But your question is really about permission. Some of the known corpuses for AI training include known pirate materials (e.g., libgen), but it's not known whether or not the AI companies are filtering out those materials from training. There's a large clutch of cases ongoing right now about whether or not AI training is fair use or not, and the ones that have resolved at this point have done so on technical grounds rather than answering the question at stake.


HN is pretty strict about not editorializing titles. Even if you statement was unequivocably correct, the post would get flagged.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: