Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you own a book, it should be legal for your computer to take a picture of it. I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them. I don't owe copyright to every book I read because I may subconsciously incorporate their ideas into my future work.


Are we reading the same article? The article explicitly states that it's okay to cut up and scan the books you own to train a model from them.

> I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them

The ruling would be a huge win for AI companies if held. It's really weird that you reached the opposite conclusion.


Something missed in arguments such as these is that in measuring fair use there's a consideration of impact on the potential market for a rightsholder's present and future works. In other words, can it be proven that what you are doing is meaningfully depriving the author of future income.

Now, in theory, you learning from an author's works and competing with them in the same market could meaningfully deprive them of income, but it's a very difficult argument to prove.

On the other hand, with AI companies it's an easier argument to make. If Anthropic trained on all of your books (which is somewhat likely if you're a fairly popular author) and you saw a substantial loss of income after the release of one of their better models (presumably because people are just using the LLM to write their own stories rather than buy your stuff), then it's a little bit easier to connect the dots. A company used your works to build a machine that competes with you, which arguably violates the fair use principle.

Gets to the very principle of copyright, which is that you shouldn't have to compete against "yourself" because someone copied you.


> a consideration of impact on the potential market for a rightsholder's present and future works

This is one of those mental gymnastics exercises that makes copyright law so obtuse and effectively unenforceable.

As an alternative, imagine a scriptwriter buys a textbook on orbital mechanics, while writing Gravity (2013). A large number of people watch the finished film, and learn something about orbital mechanics, therefore not needing the textbook anymore, causing a loss of revenue for the textbook author. Should the author be entitled to a percentage of Gravity's profit?

We'd be better off abolishing everything related to copyright and IP law alltogether. These laws might've made sense back in the days of the printing press but they're just nonsensical nowadays.


Well I mean you're constructing very convoluted and weak examples.

I think, in your example, the obvious answer is no, they're not entitled to any profits of Gravity. How could you possibly prove Gravity has anything to do with someone reading, or not reading, a textbook? You can't.

However, AI participates in the exact same markets it trains from. That's obviously very different. It is INTENDED to DIRECTLY replace the things it trains on.

Meaning, not only does an LLM output directly replace the textbook it was trained on, but that behavior is the sole commercial goal of the company. That's why they're doing it, and that's the only reason they're doing it.


> It is INTENDED to DIRECTLY replace the things it trains on.

Maybe this is where I'm having trouble. You say "exact same markets" -- how is a print book the exact same market as a web/mobile text-generating human-emulating chat companion? If that holds, why can't I say a textbook is the exact same market as a film?

I could see the argument if someone published a product that was fine-tuned on a specific book, and marketed as "use this AI instead of buying this book!", but that's not the case with any of the current services on the market.

I'm not trying to be combative, just trying to understand.. they seem like very different markets to me.


> how is a print book the exact same market as a web/mobile text-generating human-emulating chat companion? If that holds, why can't I say a textbook is the exact same market as a film?

Because the medium is actually the same. The content of a book is not paper, or a cover. It's text, and specifically the information in that text.

LLMs are intended to directly compete with and outright replace that usecase. I don't need a textbook on, say, Anatomy, because ChatGPT can structure and tell me about Anatomy, and in fact with say the exact same content slightly re-arranged.

This doesn't really hold for fictional books, nor does it hold for movies.

Watching a movie and reading a book are inherently different experiences, which cannot replace one another. Reading a textbook and asking ChatGPT about topic X is, for all intents and purposes, the same experience. Especially since, remember, most textbooks are online today.


Is it? If a teacher reads a book, then gives a lecture on that topic, that's decidedly not the same experience. Which step about that process makes it not the same experience? Is it the fact that they read the book using their human brain and then formed words in a specific order? Is it the fact that they're saying it out loud that's transformative? If we use ChatGPT's TTS feature, why is that not the same thing as a human talking about a topic after they read a book since it's been rearranged?


Well there's multiple reasons why it's not the same experience. It's a different medium, but it's also different content. The textbook may be used as a jumping-off point, supplemented by decades of real-life experience the professor has.

And, I think, elephant in the room with these discussions: we cannot just compare ChatGPT to a human. That's not a foregone conclusions and, IMO, no, you can't just do that. You have to justify it.

Humans are special. Why? Because we are Human. Humans have different and additional rights which machines, and programs, do not have. If we want to extend our rights to machines, we can do that... but not for free. Oh no, you must justify that, and it's quiet hard. Especially when said machines appear to work against Humans.


Personally I think a more effective analogy would be if someone used a textbook and created an online course / curriculum effective enough that colleges stop recommending the purchase of said textbook. It's honestly pretty difficult to imagine a movie having a meaningful impact on the sale of textbooks since they're required for high school / college courses.

So here's the thing, I don't think a textbook author going against a purveyor of online courseware has much of a chance, nor do I think it should have much of a chance, because it probably lacks meaningful proof that their works made a contribution to the creation of the courseware. Would I feel differently if the textbook author could prove in court that a substantial amount of their material contributed to the creation of the courseware, and when I say "prove" I mean they had receipts to prove it? I think that's where things get murky. If you can actually prove that your works made a meaningful contribution to the thing that you're competing against, then maybe you have a point. The tricky part is defining meaningful. An individual author doesn't make a meaningful contribution to the training of an LLM, but a large number of popular and/or prolific numbers can.

You bring up a good point, interpretation of fair use is difficult, but at the end of the day I really don't think we should abolish copyright and IP altogether. I think it's a good thing that creative professionals have some security in knowing that they have legal protections against having to "compete against themselves"


> An individual author doesn't make a meaningful contribution to the training of an LLM, but a large number of popular and/or prolific numbers can.

That's a point I normally use to argue against authors being entitled to royalties on LLM outputs. An individual author's marginal contribution to an LLM is essentially nil, and could be removed from the training set with no meaningful impact on the model. It's only the accumulation of a very large amount of works that turns into a capable LLM.


Yeah, this is something I find kind of tricky. I definitely believe that AI companies should get permission from rightsholders to train on their works, but actually compensating them for their works seems pointless. To make the royalties worthwhile you'd have to raise the cost per query to an absolutely absurd level


The amounts are not the only problem; there's no good way to measure which input in the training contributed to what degree to the output. I wouldn't be surprised if it turns out it's fundamentally impossible.

Paying everyone a flat rate per query is probably the only way you could do it; any other approach is either going to be contested as unfair in some way, or will be too costly to implement. But then, a flat rate is only fair if it covers everyone in proportion to the contribution, which will get diluted by the portion of training data that's not obviously attributable, like Internet comments or Wikipedia or public domain stuff or internally generated data, so I doubt authors would see any meaningful royalties from this anyway. The only thing it would do, is to make LLMs much more expensive for the society to use.


> it's a good thing that creative professionals have some security in knowing that they have legal protections

This argument would make sense if it was across the board, but it's impossible (and pretty ridiculous) to enforce in basically anything except very narrow types of media.

Let's say I come up with a groundbreaking workout routine. Some guy in the gym watches me for a while, adopts it, then goes on to become some sort of bodybuilding champion. I wouldn't be entitled to a portion of his winnings, that would be ridiculous.

Let's say I come up with a cool new fashion style. Someone sees my posts on insta and starts dressing similarly, then ends up with a massive following and starts making money in a modelling career. I wouldn't be entitled to a portion of their income, that would be ridiculous.

And yet, for some reason, media is special.


The core problem here is that copyright already doesn't actually follow any consistent logical reasoning. "Information wants to be free" and so on. So our own evaluation of whether anything is fair use or copyrighted or infringement thereof is always going to be exclusively dictated by whatever a judge's personal take on the pile of logical contradictions is. Remember, nominally, the sole purpose of copyright is not rooted in any notions of fairness or profitability or anything. It's specifically to incentivize innovation.

So what is the right interpretation of the law with regards to how AI is using it? What better incentivizes innovation? Do we let AI companies scan everything because AI is innovative? Or do we think letting AI vacuum up creative works to then stochastically regurgitate tiny (or not so tiny) slices of them at a time will hurt innovation elsewhere?

But obviously the real answer here is money. Copyright is powerful because monied interests want it to be. Now that copyright stands in the way of monied interests for perhaps the first time, we will see how dedicated we actually were to whatever justifications we've been seeing for DRM and copyright for the last several decades.


Everything is different at scale. I'm not giving a specific opinion on copyright here, but it just doesn't make sense when we try to apply individual rights and rules to systems of massive scale.

I really think we need to understand this as a society and also realize that moneyed interests will downplay this as much as possible. A lot of the problems we're having today are due to insufficient regulation differentiating between individuals and systems at scale.


"Judge says training Claude on books was fair use, but piracy wasn't."


The difference here is that an LLM is a mechanical process. It may not be deterministic (at least, in a way that my brain understands determinism), but it's still a machine.

What you're proposing is considering LLMs to be equal to humans when considering how original works are created. You could make the argument that LLM training data is no different from a human "training" themself over a lifetime of consuming content, but that's a philosophical argument that is at odds with our current legal understanding of copyright law.


That's not a philosophical argument at odds with our current understanding of copyright law. That's exactly what this judge found copyright law currently is and it's quoted in the article being discussed.


Thanks for pointing that out. Obviously I hadn't read the whole article. That is an interesting determination the judge made:

> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use, a legal doctrine that allows certain uses of copyrighted works without the copyright owner's permission.


There are still questions: is an AI a 'user' in the copyright sense?

Or even, is an individual operating within the law as fair use, the same as a voracious all-consuming AI training bot consuming everything the same in spirit?

Consider a single person in a National Park, allowed to pick and eat berries, compared to bringing a combine harvester to take it all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: