I know this website is not a hivemind, but it's interesting every time an article like this gets posted the majority opinion seems to be that training diffusion models on copyrighted work is totally fine. In contrast when talking about training code generation models there are multiple comments mentioning this is not ok if licenses weren't respected.
For anyone who holds both of these opinions, why do you think it's ok to train diffusion models on copyrighted work, but not co-pilot on GPL code?
If I were to steel man both sides it'd be something like this:
1. Training an AI model on code (so far) makes it regurgitate code line-for-line (with comments!). This is like "learning to code" by just cut and pasting working code from other codebases, you have to follow the license. The AI doesn't "understand the algorithm" at all (or it hasn't been told "don't export the input you fool"). Obviously a bog-simple AI could make all licenses moot by dumping out what it input, and the courts wouldn't permit that.
2. Training an AI model on illustrations so far produces "style parodies" which may look similar to an untrained eye (the artist here is annoyed because she'd not art like that, even though to us it looks similar enough). Drawing a picture that looks like Mickey Mouse is a trademark violation, but tracing a picture of the Mouse is both a trademark and a copyright violation.
The first violates some pretty clear legal concepts; the second is closer to violating moral concepts but those are more flexible - if an artist spends years learning to paint in the style of Michelangelo is that immoral?
The problem with this argument is that it's founded in how the AI is used, not how it is made. It's not a compelling reason to ban the tool, it's a compelling reason to regulate its use.
Copilot can produce code verbatim, but it doesn't unless you specifically set up a situation to test it. It requires things like "include the exact text of a comment that exists in training data" or "prefix your C functions the same way as the training data does".
In everyday use, my experience has been that Copilot draws extensively from files that I've opened in my codebase. If I give Copilot a function body to fill in in a class I've already written, it will use my internal APIs (which aren't even hosted on GitHub) correctly as long as there are 1-2 examples in the file and I'm using a consistent naming convention. This isn't copypasta, it really does have a clear understanding of the semantics of my code.
This is why I'm not in favor of penalizing Microsoft and GitHub for creating Copilot. I think there needs to be some regulation on how it is used to make sure that people aren't treating it as a repository of copypasta, but the AI itself is pretty clearly capable of producing non-infringing work, and indeed that seems to be the norm.
Please let’s not start dictating how people should use a piece of software. It would be like ”regulating” Microsoft Word just because people might use it to duplicate copyrighted works.
I'm not saying we should regulate the software, I'm saying we need some rigorous method of ensuring that using the AI tools doesn't put you in jeopardy of accidental copyright infringement.
We most likely don't need new laws, because infringement is infringement and how you made the infringing work is irrelevant. Accidental infringement is already illegal in the US.
i would argue that we _do_ need new laws. AI generated code is so quite different from any other literary works - after all, it was not created by a human.
My own personal opinion is that the AI generated code (or pictures in the case of the article) should be under a new category of literary works, such that it does not receive copyright protection, but also does not violate existing copyright.
This is meaningless though. The majority of AI generated art you see out there is either hand tweaked or post-processed or both. There's human input involved and drawing a line is going to absolutely backfire.
if you presented both the generated image and the "original" to a jury of peers (or even a panel of experts in the field), they would be able to make a determination as to whether the generated image violated the copyright of the presented "original".
Humans tweaking the image is immaterial to this determination - if the human tweaked it so that it no longer seem to violate copyright, then that said panel would also make the same determination.
You are arguing that AI generated means no copyright protection. So you can't tweak it to "not violate copyright" because their literally isn't any.
Of course you have no way to prove whether any image was or was not generated by AI so welcome to a new scam for law firms to aggressively sue artists claiming they suspect AI was used in their works.
The vast majority of paintings weren't created by a human either, but by a paintbrush. We should really ban those too. Just think of all the poor finger-painters who've been put out of a job!
I think it's worth pointing out that Adobe has been doing this for a long time. You can't open or paste images into Photoshop which resemble any major currency.
> Copilot can produce code verbatim, but it doesn't unless you specifically set up a situation to test it.
It does not matter what a service can or cannot do. We do not regulate based on ability, but on action.
The service has an obligation to the license holders of the training data to not violate the license. The mechanism for which the license is violated is irrelevant. The only thing that matters is the code ended up somewhere it shouldn’t, and the service is the actor in the chain of responsibility that dropped the ball.
The prompting of the service is irrelevant. If I ask you to reproduce a block of GPL code in my codebase and you do it, you violated the license. It does not matter that I primed you or lead you to that outcome. What matters is the legally protected code is somewhere it shouldn’t be.
> It does not matter what a service can or cannot do. We do not regulate based on ability, but on action.
Whether we agree with it or not, intellectual property laws have historically been regulated by ability as well as action. Hence why blank multimedia formats would often have additional taxes in some jurisdictions just in case someone chose to record copyrighted content onto them. And why graphics cards used to include an MPEG royalty in their consumer cost, regardless of whether that user planned to watch DVDs on their computer.
Not saying I agree with this principle. Just that there is already a long history of precedence in this area.
Like a lot of politics, ultimately it just comes down to who has the bigger lobbying budget.
> If I ask you to reproduce a block of GPL code in my codebase and you do it, you violated the license. It does not matter that I primed you or lead you to that outcome. What matters is the legally protected code is somewhere it shouldn’t be.
This isn't accurate. If I reproduce GPL code in your codebase, that's perfectly acceptable as long as you obey the terms of the GPL when you go to distribute your code. In this hypothetical, my act of copying isn't restricted under the GPL license, it's your subsequent act of distribution that triggers the viral terms of the GPL.
The big question that is still untested in court is whether Copilot itself constitutes a derivative work of its training data. If Copilot is derivative then Microsoft is infringing already. If Copilot is transformative then it is the responsibility of downstream consumers to ensure that they comply with the license of any code that may get reproduced verbatim. This question has not been ruled on, and it's not clear which direction a court will go.
> The big question that is still untested in court is whether Copilot itself constitutes a derivative work of its training data.
Microsoft has a license to distribute the code used to train Copilot, and isn't distributing the Copilot model anyway, so it doesn't matter whether the model itself infringes copyright.
Whereas that same question probably does matter for Stable Diffusion.
As in " including improving the Service over time...parse it into a search index or otherwise analyze it on our servers" is the provision that grants them the ability to train CoPilot.
(also, in case you're wondering what happens if you upload someone else's code: "If you're posting anything you did not create yourself or do not own the rights to, you agree that you are responsible for any Content you post; that you will only submit Content that you have the right to post; and that you will fully comply with any third party licenses relating to Content you post.")
But you may not have the rights to grant that extra license if CoPilot is determined to violate the GPL, they can yell at you all they want but they will have to remove it, as nobody can break someone else's license for you.
It'll have to be tested in court, but likely nobody actually gives a shit.
> But you may not have the rights to grant that extra license if CoPilot is determined to violate the GPL
Which is why that second provision is there to shift liability to you. You MUST have the ability to grant GitHub that license to any code you upload. If you don't, and MS is sued for infringing upon the GPL, presumably Microsoft can name you as the fraudster that claimed to be able to grant them a license to code that ended up in copilot.
How is that different from a consultant who indiscriminately copies from Stack Overflow?
Tangent to that is the "who gets sued and needs to fix it when a code audit is done?"
Ultimately, the question is then "who is responsible for verifying that the code submitted to production isn't copying from sources that have incompatible licensing?"
The consultants would have to knowingly copy from somewhere. One can hope they're educated on licensing, at least if they expect to get paid.
If Microsoft is so confident in Pilot doing sufficient remixing then why not train it on their own internal code? And why put the burden of IP vetting on clients who have less info than Pilot?
> How is that different from a consultant who indiscriminately copies from Stack Overflow?
and how is that different from a student learning how to code off stackoverflow (or anywhere else for that matter), then reproducing some snippets/learnt code structure, in their employment?
Or a random employee copies some art work that is then published ( https://arstechnica.com/tech-policy/2018/07/post-office-owes... ). You will note all the people that didn't get in trouble there - neither the photographer who created the image, nor Getty in making it available, nor the random employee who used it without checking its provenance.
In all of these cases, it is (or would be) the organization that published the copyrighted work without doing the appropriate diligence on checking what it is, if it would be useable, and how it should be licensed.
> The Post Office says it has new procedures in place to make sure that it doesn't make a mistake like this again.
... which is what companies who make use of AI models for generating content (be it art or code) should be doing to ensure that they're not accidentally infringing on existing copyrighted works.
Pilot is regurgitating snippets of code still under copyright and not in the public domain. Some may consider publicly available code fair use, but the fact that they're selling access for commercial use may undercut that argument.
There is a part of Deep Learning research (Differential Privacy) which focuses on making sure an algorithm cannot leak information about the training set, and this is a rigorous concept, you can quantify how much privacy-preserving a model is, and there are methods to make a model "private" (at the cost of performance I think for now)
Differential Privacy only proves that it cannot leak a certain amount of information about individual samples of the training set. This only guarantees the input is not leaked exactly back, any composition of the training set is valid, although in image generation this usually means a very distorted image.
The AI image generator is revealed to be a lossy compression algorithm which can recall near-identical images to the ones it was trained with. Therefore, the software is conveying copyrighted works. If somebody gave you the model, they violated copyright in doing so. If somebody runs the model on a server, they violated copyright in transmitting the image to you. If you, the recipient of that copyrighted work, go on to redistribute it, you have also violated the copyright. I don't see any difference between these image generators and code generators.
Exact replicas are an issue. If you are using AI image generation to replicate the near exact image, then that's illegal. But nobody cares if you copy a nice code pattern from a GPL code and apply it to your own code base. In the same fashion nobody should care if you make an image in the same art style.
Inexact replicas are also an issue, otherwise there would be no issue with distributing MP3s of an Audio CD, as it's a lossy format that is only close to the original.
I suspect the courts will treat AI more like a "black box" - they won't care how or why your black box can perfectly play Metallica, only that it does.
Yes, and that's why I personally believe that the model itself should be considered a derived work of such code. But OP was specifically talking about "code patterns".
> if an artist spends years learning to paint in the style of Michelangelo is that immoral?
I'd say that artist has gained a lot by studying Michaelangelo, including an appreciation for what Michaelangelo himself accomplished and insights into how to paint as well or better, and maybe even how to teach that to other people. I don't think we get those benefits from AI models doing that (at least not yet!)
I think we're kidding ourselves to think that some nebulous concept of "the artist's journey" somehow informs the end result in a way that is self-evident in human-produced digital art. Just as with electric signals in the "brain in a vat" thought experiment, with digital art it's pixels. If an algorithm can produce a set of pixels that is just as subjectively good as a human artist, then nobody will be able to tell - and most likely the average person just won't care.
On the other hand, I would say that traditional mediums (especially large format paintings) are relatively safe from AI generation/automation - for now.
> On the other hand, I would say that traditional mediums (especially large format paintings) are relatively safe from AI generation/automation - for now.
Why do you think that? I think large format paintings might be in just as much danger.
There’s a large industry of talented artists in China, Vietnam, etc who copy famous artworks by hand for very low prices. They’re easily accessible online: you upload an image and provide some stylistic details and the artist does the hard work of turning the image into brush strokes. It’s not “automated” but I’ve already ordered one 4’x2’ AI generated painting in acrylic relief for less than the cost of a 1’x1’ from a local community gallery. I put in quite a bit of work inpainting the image to get what I want but it would have been completely impossible to get what I want even six months ago.
I’ve only ever purchased half a dozen artworks in my life and they were all under a few hundred bucks but with this new tech, it just doesn’t make sense to buy an artists’ original work unless it’s for charity. The AI can do the creative work the way I want and there are plenty of artists who are excellent at the mechanical translation (which still requires a lot of creativity, mind)
You don't even have to go to China - I had a very nice painting painted from a photograph for a friend done by another friend's mom who just like painting landscapes.
It looked great and all I had to do was pay for supplies, which was still less than the cost of the framing.
I didn't know there was an industry for that, I guess I should have figured. I might look into that for my own purposes.
Although for what it's worth when I said "large format paintings" in my mind I was thinking very large paintings - like Picassos's Guernica - larger than something the average person would have hanging in their home. To the point that the cost of producing it and transporting it is large enough that a buyer is more likely to take personal interest in the artist and much less likely to knowingly purchase something AI-generated or otherwise automatically produced.
I think we're kidding ourselves to think that clustering features of existing works and iteratively removing noise based on that clustering is somehow comparable to building up human experiences and expressing them through art.
Using the "brain in a jar" thought experiment, you're making the assumption that the iterative denoising process is equivalent to the way the "brain in the jar" would generate art. Since the question is whether or not the processes are equivalent, it seems nonsensical to have to assume their equivalence for your argument.
I don't think the artist's journey necessarily informs the end result in some way - but I believe it can be an important experience for the artist. Then again, artists can still do this in the era of generative art - there's just not much as much chance of being rewarded for it. If this leads to fewer people wanting to explore art, then I think we've lost something. But it's not clear to me where things are headed I guess. This could be a huge boon in letting people explore ways of expressing themselves who otherwise lacked the artistic ability to want to try.
And perhaps more importantly regarding (1) than simple regurgitation: code does things. There's a real risk that if you just let Copilot emit output without understanding what that output does, it'll do the wrong thing.
Art is in the eye of the beholder. If the output looks correct as per what you're looking for, it is correct. There's no additional layer of "Is it saying what I meant it to say" that is relevant to anyone who isn't an art critic.
Art is in the eye of the beholder, but it still needs a creator.
That creator had a vision in mind that's unique to them because of their experiences, and I think it's wrong to say that this image can be quantified as a location in a abstract feature space.
So to say "there is no additional layer [to judge goodness] that is relevant to [most people]" assumes there is an algorithmic measure of "goodness" that can be applied to art, which is an assumption you need to make to believe that there's any similarities with AI generated art and human generated other than "they look kinda similar".
Until 100 years from now, when more general purpose AI are having what could be described as experiences, and can be asked to draw a picture of how they feel when thinking about being unplugged/death.
We hoomans love to think we're special, but quantum tubules etc besides, we really are just biological computers running a program developed over our evolutionary/personal histories.
Sure, in a future 100 years from now when AI is an actual general AI and not the specialized algorithms we have today, one might be able to argue that it does things the same way as a person(although one would hope it does then better instead, since that's the goal).
Until then we are special and we're just pretending these specialized algorithms are replicating the things we don't even understand. Anthropomorphising the algorithms by saying they "learn" and "feel" and "experience" is us as humans trying to draw parallels to ourselves as we find our understanding inadequate to explain what's actually going on.
I'm pretty sure that there's a considerable amount of art hanging in museums, that was done by students of great artists. I think there are several Mona Lisas, done by da Vinci's students, and they are almost identical to the original.
In fact it's well known that successful artists would have studios with their students churning out art that they'd apply the direction and final touches to.
Which are which has been lost to time in some cases, and the art world is filled with dissertations on it.
> an artist spends years learning to paint in the style of Michelangelo is that immoral?
This is a deceptive comparison. A human learns a style and adds their own ideas. They ideas are affected by their mood, schooling, beliefs and coffee this morning.
AI has only the training dataset. If you trained AI on 1000 copyrighted pictures, AI cant add its own ideas, it can only remix pixels from stolen work of other artists.
This is basically like money laundering, if you melted down stolen gold coins, and minted new coins, and then claimes this gold is yours because you made them.
I wouldn't dismiss it so fast. I've seen SD generate some quite creative images, and original as I've been able to determine by searching the training dataset. One example was asking for a picture of someone riding a Vespa, and one of the images had the rider wearing the Vespa fenders as a helmet, louvers and all. I don't see what else to call that but the AI's "own idea".
By deconstructing the "decisions"(to use a disgusting anthropomorphism) that led to either image we can dismiss the "I don't understand, so it must be doing something greater than it is" rhetoric.
The decisions leading up to the human art is the entire human experience leading up to the creation of the art(and possible context afterwards), which we as people tend to put value on.
The "decisions" leading up to the AI art are a series of iterative denoising steps that attempt to recover an image from noisy data by estimating how much the noise differs from the "good looking" image.
So for your "vespa fenders as a helmet" drawing, I don't think that constitutes an algorithm being "creative". If a human were to make the same picture we could rationalize that they're being creative because we can imagine a path where their human experiences led to a new idea. Since the algorithm was only ever made to denoise an image based on its abstract feature-space representation I don't see any way we could rationalize that it created a new idea. The algorithm never "thought" it should use a fender as a helmet, it only found that the best way to denoise the current image to the one described in feature-space was to remove pixels that resulted in the image.
Don't humanize algorithms. They're applied statistics, not a sum of human experiences.
If a calculator adds 2 and 2 and shows 4, is that disgustingly anthropomorphizing the word "add"? If we need a separate word for every informational process, it's going to get awfully messy.
When an idea "pops" into your head, how was that made? Couldn't it also be a similar denoising of patterns in synaptic potentials? We know from many experiments that what something feels like can be quite different from what it actually is.
Is it only that we don't know the exact brain process that makes humans special? And once we inevitably do figure it out, does all human art become meaningless too? I think we need to learn to disconnect process from result and just enjoy the result, wherever it came from.
There are many legal reasons without moral force behind them beyond "we need to agree on one way or the other" - such as which side of the road to drive on.
We made a decision years ago around copyright (we've modified it since but the general concept is "promote the arts by letting artists have reproduction rights for a time"). We could change that in various ways, if we wanted to, even removing copyright entirely for "machine-readable computer code" and leave protections to trade secrets. Even if you argue "no copyright at all is immoral" or "infinite copyright is immoral" it's hard to argue that "exactly author's life + 50 years is the only moral option".
Switching the rules on people during the game is what annoys/angers people, and is basically what these AIs have done (because they've introduced a new player at low effort).
But haven’t we seen examples of generative art that are substantially similar to original artwork and examples where AI regurgitates blocks of art (with watermarks!?)
Artists are granted copyright for their work by default per the Berne Convention. These copyrighted works are then used without consent of the original author for these models.
Additionally, the argument that you can't copyright a style is playing fast and loose with most things that are proprietary, semantically.
A key part of the concept of copyright is that having copyrighted works used without consent is perfectly fine. Copyright grants an exclusive right to make copies of the work. It does not grant the author control over how their work is used, quite the opposite, you can use a legitimately obtained copy however you want without the consent of the author (and even against explicit requirements of the author) as long as you are not violating the few explicitly enumerated exclusive rights the author has.
You do not need an author's consent to dissect or analyze their work or to train a ML model on it, they do not have an exclusive right on that. You do not need an authors consent to make a different work in their style, they do not have an exclusive right on that.
I feel there's a lot missing from this, and some terminology would require clarification (What constitutes "used"?).
Generally speaking, this supposition skirts around the concept of monetizing from the work of others, and seems at odds with what the Berne Convention seems to stipulate in that context, and arguably seems in violation of points 2 and 3 of the three-step test.
That's to say nothing regarding the various interpretations on data scraping laws that preclude monetizing outputs.
I don't feel it's that black and white, personally...
What I mean by "used" means any use where copying and reproduction is not involved.
The Berne three-step test specifies when reproduction is permitted, however, any use that does not involve reproducing the work is not restricted, and monetization does not matter. It's relevant for data-scraping laws because you are making copies of the protected work.
> Additionally, the argument that you can't copyright a style is playing fast and loose with most things that are proprietary, semantically.
This has been true since copyright existed, Braque couldn’t copyright cubism — Picasso saw what he was doing and basically copied the style with nothing to be done aside from not letting him into the studio.
The brain-computer metaphor is not a very good one, it's a pretty baseless appeal. Additionally, it's an argument that anthropomorphizes something which has no moral, legal, or ethical discretion.
You do not actively train your brain in remotely similar methods, and you, as an individual, are accountable to social pressures. An issue these companies are trying to avoid with ethically questionable scraping/training methods and research loop holes.
Additionally, many artists aren't purely learning from others to perfectly emulate them, and it's quickly spotted if they are, generally. Lessons learned do not implicitly mean you perfectly emulate that lesson. At each stage of learning, you bias things through your own filter.
Overall, the idea that these two things are comparable feels grotesque and reductionist, and feel quite similar to the "Well I wasn't going to buy it anyway" arguments we've been throwing around for decades to try to justify piracy of other materials.
At the end of the day, an argument that "style can't be copyrighted" is ignoring a lot of aspects of it's definition, including the means, and can be extrapolated into an argument that nothing proprietary should be allowed to exist...
> Overall, the idea that these two things are comparable feels grotesque and reductionist
I agree with you there but the alternative - that they’re not comparable - I find equally grotesque and full of convenient suppositions rooted in romanticism of “the artist”. We’re in uncharted territory with AI finally lapping at the heels of creative professionals and any analogy is going to fall apart.
This feels like something that we should leave to the courts on a case by case basis until there’s enough precedent for a legal test. The question at the end of the day should be about harm and whether an AI algorithm was used as run-around of a specific person’s copyright
I was actually just sitting in a AI Town Hall hosted by the Concept Art Association which had 2 US Copyright Lawyers who work at the USCO present, and their along similar lines, currently.
Basically, like you specified, legal precedent needs to be built up on a case by case basis, and harm can pretty readily be demonstrated, at least anecdotally, especially as copies are made during training of copyrighted work.
Unfortunately, historically, artists do not generally enjoy the same legal representation or resources that unionized industries with deeper pockets enjoy. It's probably one of the reasons Stability.Ai are being so considerate with their musical variant.
It would have been great if artists were asked before any of this. I could see this going in such a different direction if people were merely asked...
I'm an artist and I work in tech - I'd be very interested in working with the models if I didn't find the idea of using something made out of the labor of my peers repulsive.
Call me a training-set vegan, any model made from opt-in and public domain images I'd use in a heartbeat.
> But if I train my own neural network inside my skull using some artist's style, that's ok?
How well the network inside your skull can manipulate your limbs to reproduce good-quality work in some artist's style?
Our current framework for thinking about "fair use", "copyright", "trademark" and similar were thought about into existence during an era when the options for "network inside the skull" were to laboriously learn a skill to draw or learn how to use a machine like printing press/photocopier that produces exact copies.
Availability of a machine that automates previously hand-made things much more cheaply or is much more powerful often requires rethinking those concepts.
If I copy a book putting ink on paper letter by letter manually, that's ok, think of those monks in monasteries who do that all the time. And Mr Gutenberg's machine just makes that ink-on-paper process more efficient...
>How well the network inside your skull can manipulate your limbs to reproduce good-quality work in some artist's style?
An experienced artist can probably do this in a couple weeks, depending on how complex the style is.
>If I copy a book putting ink on paper letter by letter manually, that's ok, think of those monks in monasteries who do that all the time.
According to copyright, no, that's not okay. Copyright does not care about the method of reproduction, it just distinguishes between authorized and unauthorized reproduction. A copyist copying a book by hand without authorization is just as illegal as doing it with a photocopier. Likewise, if you decide to copy a music CD using a hex editor and lots of patience, at the end of the process you will end up with a perfectly illegal copy of the original CD.
So the question stands. Why is studying artwork with eyeballs and a brain and reproducing the style acceptable, but doing the same with software isn't?
unless you are in fact a living and breathing cyborg [in which case, congratulations] , the wet work inside your head is not analogous to the neural networks that are producing these images in any but the most loosely poetic sense.
No? The mechanisms are different but the underlying idea is the same - identify important features and replicate those features in new context. If an AI identifies those features quickly or if I identify them over a lifetime what's the difference? If I so that you might say my work is derivative but you won't due me. Why is it different if an AI does it?
Not particularly. Parent post is not concerned with or making any claims to special knowledge of the internal details of the modelling in the mind or in the machine, only the output.
> The mechanisms are different but the underlying idea is the same
no.
they are the same as asking a person to say a number between 1 and 6, then asking the same question to a dice and concluding that men and dice work the same.
> identify important features and replicate those features in new context
untrue
if you think that that's what people do, obviously you can conclude that AI and humans are similar.
But people don't identify features, people first of all learn how to replicate - mechanically - the strokes, using the same tools as the original artists until they are able to do it, most of the time people fail and reiterate the process until they find something they are actually very good at and only after that the good ones develop their own style.
based either on some artistic style or some artistic meaning.
But the first difference we learn here is that humans can fail to replicate something and still become renown artists.
An AI cannot do that.
Not on its own.
For example, many probably already know, but Michelangelo was a sculptor.
He was proficient as a painter too, but painting wasn't his strongest skill.
So artists, first of all, are creators, not mere replicators, in many different forms, they are not good at everything in the same way, but their knowledge percolates in other fields related to theirs: if you need to make preparatory drawings for a sculpture, you need to be good at drawing and probably painting (lights, shadows, mood, expressions, are all fundamental for a good sculpture)
Secondly, the features artists derive from other art pieces are not the technical ones, those needed to make an exact replica of the original, but those that make it special.
For example, in the case of Michelangelo, the Pietà has some features that an AI would surely miss.
First of all the way he shaped the marble that was unheard of, it doesn't mean much if you don't contextualize the opera and immerse it in the historical period it was created.
An AI could think that Michelangelo and Canova were contemporary, while they were separated by 3 centuries, which make a lot of difference in practice and in spirit.
But more importantly, Michelangelo's Pietà is out of proportion, he could not make the two figures in the correct scale, proving that even a genius like he was could not easily create a faithful reproduction of two adults one in the lap of the other, with the tools of the 16th century.
The Virgin Mary is very, very young, which was at odds with her role as a grieving mother and, the most important of them all, the Christ figure is not suffering, because Michelangelo did not want to depict death.
An AI would assume that those are all features of Michelangelo's way of sculpting, but in reality it's the result of a mix of complexity of the opera, time when it was created, quality and technology of the tools used and the artist intentions, which makes the opera unique and, ultimately, irreproducible.
If you use an AI to reproduce Michelangelo, everybody would notice, because it's literally something a complete noob or someone with a very bad taste would do.
So to not say the difference, you should copy the works of lesser known artists, making it even more unethical.
respectfully, you're raising a whole lot of arguments here that had nothing to do with any point I was raising and doesn't seem to be moving this discussion forward in any significant way. The point of this subthread thread was a user saying the following:
>But if I train my own neural network inside my skull using some artist's style, that's ok?
This post and others uses a lot of flowery language to point out that we train artificial neural networks and real neural networks in different ways. OK, great. I don't think anyone is saying that's not true. What I am saying is that it's irrelevant.
If I am an exceptional imitator of the style of Jackson Pollock and i make a bunch of paintings that are very much in that style but clearly not his work I'm not going to be sued. My work will be labeled, rightfully so, as derivative but I have the right to sell it because it's not the same thing. Is that somehow more acceptable because I can only do it slowly and at a low volume? What if I start an institute whose sole purpose is training others to make Jackson Pollock-like paintings? What if I skip the people and make a machine that makes a similar quality of paintings with a similarly derivative style? Is that somehow immoral / illegal? Why?
There's a whole lot of hand-wavey logic going on in this thread about context and opera and special human magic that only humans can possibly do and that somehow makes it immoral for an AI to do it. I am yet to see a simple, succinct argument of why that is the case.
> This post and others uses a lot of flowery language to point out that we train artificial neural networks and real neural networks in different ways. OK, great. I don't think anyone is saying that's not true. What I am saying is that it's irrelevant
Maybe I was too aulic.
The point is: you don't train "your artificial intelligence", because you're not an artificial intelligence, you train your whole self, that is a system, a very complex system.
So you can think in terms of "I don't like death, I don't want to display death"
You can learn how to paint using your feet, if you have no hands.
You can be blind and still paint and enjoy it!
An AI cannot think of "not displaying death" in someone's face, not even if you command it to do it, because it doesn't mean anything, out of context.
> Jackson Pollock
Jackson Pollock is the classic example to explain the concept: of course you can make the same paintings Jackson Pollock made.
But you'll never be Jackson Pollock, because that trick works only the first time, if you are a pioneer.
If you create something that look like Pollock, everybody will tell you "oh... it reminds of Jackson Pollock..." and no one will say "HOW ORIGINAL!"
Like no one can ever be Armstrong again, land on the Moon and say "A small step for man (etc etc)"
Pollock happened, you can of course copy Pollock, but nobody copies Pollock not because it's hard, but because it's cheap AF
So it's the premise that is wrong: you are not training, you are learning.
They are very different concepts.
AIs (if we wanna define the "intelligent") are currently just very complex copy machines trained on copyrighted material.
Remove the copyrighted material and their output would be much less than unimpressive (probably a mix of very boring and very ugly).
Remove the ability to watch copyrighted material from people and some of them will come up with an original piece of art.
You're typing a lot in these posts but literally every point you're making here is orthogonal to the actual discussion, which is why utilizing the end product of exposing an AI to copyrighted material and exposing a human to copyrighted material are morally distinct.
> which is why utilizing the end product of exposing an AI to copyrighted material and exposing a human to copyrighted material are morally distinct.
sorry for writing in capital letters, maybe that way they will stand out enough for you to focus on what's important.
WE ARE NOT AIS
an AI is the equivalent of a photocopier or sampling a song to make a new song, there are limits on how much you can copy/use copyrighted material, that do not apply TO YOUR HEARS, because you hearing a song does not AUTOMATICALLY AND MECHANICALLY translates into a new song. You still need to LEARN HOWTO MAKE MUSIC, which is not about the features of the song, it's about BEING ABLE TO COMPOSE MUSIC.
which is not what these AI do, they cannot compose music, they can mix and match features taken from copyrighted material into new (usually not that new, nor good) material.
If we remove the copyrighted material from you, you can still make music.
You could be deaf and still compose music.
If we remove copyrighted material from AIs they cannot compose shit.
Because the equivalent of a deaf person for an AI that create music CANNOT EXIST - for obvious reasons.
So AIs DEPEND ON copyrighted material, they don't just learn from it, they WOULD BE USELESS WITHOUT IT.
and morally the difference is that THEY DO NOT PAY for the privilege of accessing the source material.
They take, without giving anything back to the artists.
I'll try to address your underlying thought, and hope I'm getting it right.
I think you are right to be skeptical and cautious in the face of claims of AI progress. From as far back as the days of the Mechanical Turk, many such claims have turned out to be puffery at best, or outright fraud at worst.
From time to time, however, inevitably, some claims have actually proven to be true, and represent an actual breakthrough. More and more, I'm beginning to think that the current situation is one of those instances of a true breakthrough occurring.
To the surface point: I do not think the current proliferation of generative AI/ML models are unoriginal per se. If you ask them for something unoriginal, you will naturally(?) get something unoriginal. However, if you ask them for something original, you may indeed get something original.
> If we remove copyrighted material from AIs they cannot compose shit.
I wonder in what way you mean that? In any case the latest stable diffusion model file itself is 3.5 GB, which is several of orders of magnitude less than the training dataset.
It probably doesn't contain much literal copyrighted data.
You're making much more concise arguments now, I think that makes the discussion more useful and interesting.
I would take the position that it's self evident that if you take the 'training data' away from humans they also can't compose music. If you take a baby, put it in a concrete box for 30 years (or until whatever you consider substantial biological maturity), and then put it in front of a piano it's not going to create Chopin. It might figure out how to make some dings and boops and will quickly lose interest.
Humans also need a huge amount of training data and we, at best, make minor modifications to these ideas to place them into new context to create new things. The difference between average and world class is vanishingly small in terms of the actual basic insight in some domain. Take the greatest composers that have ever lived and rewind them and perform our concrete box experiment and you'll have a wild animal, barely capable of recognizing cause and effect between hitting the piano and the noise it makes.
That world class composer, when exposed to modern society, consumed an awful lot of media for 'free' just by existing. Should they be charged for it? Did they commit a copyright infraction? Why or why not?
I feel like a broken record on this topic lately, but I strongly believe that training ML models on copyrighted works should be legal.
It is clear to anyone that understands this tech that it is not simply "memorizing" or "copying" the training data, even if they can be coaxed into doing this for certain inputs (in the current iteration of the tools).
Ultimately, I think the problem of reproducing certain popular works or code snippets will be solved. One interesting direction here are the tools of information theory and differential privacy. e.g) proving that certain training inputs cannot be recovered from the weights, or that there is a threshold of how much information can be gleaned from a single input.
It is easy to imagine (because its nearly there already) a future version of StableDiffusion (or CoPilot) which provably compresses all training data beyond any possibility of recovery, and yet still produces extremely convincing results which disrupt the creative professions of art and programming.
Until we get to that point, it feels like the only consistent and sensible place to apply regulations is with the model end user. When I use CoPilot, I accept the small (honestly overblown) risk that _maybe_ I won't have the license to use some small snippet of code it spits out. But I'm happy to wear that responsibility, because the boost to productivity is so great that I dread a return to pre-CoPilot world. That is, a world where everyone keeps reinventing the same solution to simple problems over and over and over again.
Training being fair use is something I can buy into.
As for actually using the model... personally I still find that unacceptable. Even if the risks are low, we have lots of works in the model, so the risk can still add up.
The idea you have in your head of training without regurgitation is likely not possible. The underlying technology treats the training set as gospel: the system is trained to regurgitate first, and generalization is a happy accident. Likewise, we can't look into a model to check and see what it's memorized, nor can we trace back an output to a particular training example. Which has ethical implications with the way that AI companies crawl the web to get training data; such models almost certainly hold someone's personal information and there's no way to get it out of there aside from curating your training set to begin with.
I mean, in a sense yes the training set is gospel. But these systems are also (generally) tested against held out data.
When you have to model 100TB of images with 4gb of weights there is no way this is possible without learning some kind of patterns and regularity that generalise outside the training set. Most generated items will be novel, and most training items will not be reproducible.
It doesn’t seem radical to suggest that the copying issue will continue to recede as we get better models.
And there are areas of research specifically concerned with _provably_ showing that you cannot identify what items a model was trained was.
> the original images themselves aren’t stored in the Stable Diffusion model, with over 100 terabytes of images used to create a tiny 4 GB model
Is jpeg compression transformative then? Should a compressed image of something not be copyrightable because “it doesn’t store” the “real image”? How about compressed video? Where do we draw the line?
The difference is that JPEG does store the real image, at least close enough to within the given tolerance (determined by the compression factor). That image is as real as say an image on film (also not exact, nor in "original" form).
With Stable Diffusion it's storing the style, but can't reproduce any single input image-- there aren't enough bits [0]. (except by luck, but that's really true for any storage).
The weights of a NN are just a compressed representation of the training data, think lossy zip.
Rank all generated images by similarity to the training data (etc.) and you can see what's stored.
The Shannon-Hartley theorem isnt relevant. A 4GB zip of 100TB text data can exactly reproduce the initial 100TB for some distributions of that initial dataset.
if you reproduced an exact image (to the same lossy degree as jpeg) using the NN, then you are violating copyright.
But if you reproduced an image whose style matches another copyrighted image (e.g., blah in the style of starry night), then how does that new image (which didn't exist before) violate existing copyright? You cannot copyright a style.
The NN containing information which _could_ be used to reconstruct an exact image doesn't itself constitute copyright violation - because the right to use information for training NN is not an exclusive right that the original holder of the training set has.
So either a new law has to come into existence, vis a vis the right to use copyrighted works to train a NN, or the current copyright laws should apply (which implies that NN generated images which are not "exact" copies of existing works don't violate copyright).
If a given model can consistently reproduce an exact image given the same input prompt, why shouldn't the model itself be considered a compressed form of that image?
Right, but underneath your premise is a scam, right?
A NN has not learnt to paint: it doesn't coordinate its sensory-motor system with its environment through play, it hasn't developed any taste, it does not discern the aesthetic good from the bad, it has no judgement, and so on ad infinitum.
A NN is just a kNN with an extra compression step. The way all gradient based "learners" work is to compute distances to pregiven training data. In the case of kNN that data is used exactly, in NN its compressed.
There is no intelligence here, there is no learning: it's a trick. It turns out that interpolating a point between prior examples can often look novel and often fool a human observer.
This is largely due to how incredibly tolerant to flaws we are in the cases where NNs are used to perform this trick. We go to great lengths to impart intention, fix communicative flaws, etc. and this is exploited by "AI" to make simple crap seem great by having the observer fill-in the details, perceptually. I see it as a kind of proto-schizophrenia that all people have which usually works if we're dealing with a human, but on everything else produces religions.
In any case, a NN is just a case of a kNN -- which is capable of fooling people exactly the same way, and clearly violates copyright and is a case of theft of intellectual work to make a product you can sell. Adding compression seems irrelevant.
I don't think this interpretation of NNs is correct. There's been a few papers purporting to show this, but afair they used a very tortured definition of "interpolation".
Stable Diffusion is certainly capable of differentiating good from bad. That's why you can tell it to draw good or draw bad.
Not that this point is relevant to my comment. "Play", "taste" and "judgment" can be just as deterministic as a sequence of large matrix operations interspersed with nonlinear layers.
Sure, but then who is torturing matrices to turn them into organic bodies which adapt their musculature to their environment?
Interpolation is forced in the case of NNs, it's a training condition.
And the "kNN interpretation" isnt an interpration, kNNs define what "ideal learning" is in the case of statistical learning, and hence show, it doesnt count as actual learning.
In actual learning we're not interested in whether you can solve prespecified problems but how well you cope when you can't. This is, by definition, not a problem which can be formulated in statistical learning terms and the particular "learning" algorithm here is irrelevant.
In other words accuracy isnt a test of learning. Accuracy is a "non-modal condition" in being fit to a history that actually took place. Learning "in the usual sense" is strictly a modal, "what if" phenomenon, and is assessed by the quality of failure under adverse conditions, not of success.
If one gave any AI/ML system in existence adverse conditions, posed relevant "what ifs" and observed the results, they'd be exposed as the catastrophe they are. None survive any even basic test of "coping well" in these cases.
This is why all breathless AI public relations, ie., academic papers published in the last decade, do not perform any such tests.
Because said set of numbers is produced via a training process that has the original as an input, and a different input would produce a different set of numbers.
You're correct that merely containing the information would not violate copyright - it's all about how that information was produced.
I'm actually seeing plenty of new images that are in the same style but are different from any of the images in the train set, like wonder woman in front of a mountain that looks like the setting of "frozen".
What this analogy is saying is that if an image is generic and derivative enough (or massively overrepresented in the training data) it may be possible to reconstruct a very close approximation from the model. If the training data is unbiased, I question the validity of copyright claims on an image that is sufficiently derivative that it can be reproduced in this manner.
> The Berne Convention formally mandated several aspects of modern copyright law; it introduced the concept that a copyright exists the moment a work is "fixed", rather than requiring registration. It also enforces a requirement that countries recognize copyrights held by the citizens of all other parties to the convention.
The fact that it is a different form of compression doesn't change that it's compression. What is the argument here, that a numerical method should have the same rights as a person?
The author of the original post (Andy Baio) found exactly where the line was. He released a (great) chip tune version of the jazz classic Kind of Blue (named Kind of Bloop), fully handled the music copyrights, and was promptly sued by the cover artist for Kind of Blue who believed that the pixel art cover of Kind of Bloop was not adequately transformative.
This is what we have courts and legislation for. I expect there's existing legislation here about what constitutes a different work versus an exact copy but it may need some updates for AI.
Possibly controversial opinion: I think the biggest reason why so many people hold conflicting views on this is because of who the victim is in each case.
The loudest voices complaining they were directly hurt by Copilot's training are open source maintainers. These are exactly the kind of people who we love to root for on here. They're the little guy involved in a labor of love, giving away their work for free (with terms).
On the other hand, the highest-profile victims of Stable Diffusion and DALL-E are Getty Images and company. They're in most respects the opposite of open source maintainers: big companies worth millions of dollars for doing comparatively little work (primarily distributing photos other people took).
Because in the case of images the victim is most prominently faceless corporations, I think our collective bias towards "information wants to be free" shows through more clearly when regarding DALL-E than it does with Copilot.
> On the other hand, the highest-profile victims of Stable Diffusion and DALL-E are Getty Images and company. They're in most respects the opposite of open source maintainers: big companies worth millions of dollars for doing comparatively little work (primarily distributing photos other people took).
It's puzzling to me you acknowledge people are taking these photos and getting a cut from their use through the marketplace, yet still see Getty as the biggest victim.
If Getty could AI-generate their whole portofolio and keep 100% of the sales to themselves they'd do it in heartbeat (and I'd expect them to partially go that route). The most screwed people are the photographs ("the little guy" in your comparison)
With the caveat of "Strong opinions, weakly held", my personal take is that creating artificial scarcity is inherently immoral and thus copyright itself is immoral. Training AI on someone's non-private work is then completely fine IMO.
Copyleft is a license that weakens copyright (and thus inherently good :)), so using machine learning to weaken copyleft by allowing you to copyright "clones" of copyleft code is bad.
If I try to generalize here, the problem in both cases is only if you produce copyrighted works, especially if you trained on copyleft works. If instead both models would stipulate that all produced works are copyleft I would be much more fine with it (and I feel it would respect the license of the copyleft works it was trained on, even if that may be legally shaky).
> With the caveat of "Strong opinions, weakly held", my personal take is that creating artificial scarcity is inherently immoral and thus copyright itself is immoral. Training AI on someone's non-private work is then completely fine IMO.
Why do you hold that opinion? There's a few very clear benefits to creating artificial scarcity, mostly around incentivizing creation (and sharing!) of innovations.
If we can't create artificial scarcity around ideas, then ideas are in a sense less monetizable than, e.g., creating a piece of furniture. But this is just an accident of the way the world works - I can physically prevent you from taking a chair that I made, but I can't prevent you from taking an idea I had. Why does it make sense that the world works this way? Isn't it a whole lot better to encourage innovation, vs. encouraging more people to make physical objects, just because those are inherently scarce?
(The other side is that innovations also have the great property that copying them isn't depriving anyone else of use of the original idea, but that's a side issue to the encouraging innovation one, IMO.)
> Why do you hold that opinion? There's a few very clear benefits to creating artificial scarcity, mostly around incentivizing creation (and sharing!) of innovations.
I think it is self evident why creating artificial scarcity is immoral, but the point you are trying to make is that from a utilitarian standpoint you think it is preferable to behave in a way that is immoral in the micro scale since it will create greater good in the macro scale. If you agree, then I don't think there's any need to justify my belief here :)
That said, I also think that it is unclear that copyright is a net positive increase to creation and sharing of innovations. The current state of monetization is not inspiring since the actual creators usually are not well compensated and money tends to stay with large corporations that are essentially just "right holders". There's also many factors that actively stifle innovation and creativity.
Patents are the most well known example, but being unable to borrow chord progressions or characters or storylines from other works is also stifling (you can't exactly release your "edanm's cut of Spiderman Homecoming" publicly, nor can you create your own sequel or alternate interpertation of the story). Quite a few fan games or fan remakes have also met their demise at the hands of aggressive copyright enforcement.
My own suspicion is that if the current models of creation will be disrupted by copyright abolition, we'll just end up seeing that a lot of the money that was spent on them will move to other avenues of funding like Patreon style or Kickstarter style funding for works. We may even see some new models created. I'd also expect that it will actually shift the balance away from large corporations (whose primary value is having lots of money that allows them to hoard rights) to smaller creators which will now have more direct funding available to them.
I also think an interesting case to look at is Video Games, where the concensus is that a game's mechanics aren't copyrightable, and so whenever a new interesting Indie game comes out on steam, there is a rash of other cool Indie games with their own takes and remixes of the same concept, like the glut of Roguelike deckbuilders after Slay the Spire or the current glut of "Vampire Survivors"-likes that have their own interesting takes on the core idea. Eventually when there's enough buzz around such ideas they can even penetrate the AAA sphere (where adding "Roguelike" elements has started to appear slightly). It is also quite common to see games that are in Early Access for a very very long time, essentially letting the community fund the future creation and expansion of the game.
> If we can't create artificial scarcity around ideas, then ideas are in a sense less monetizable than, e.g., creating a piece of furniture. But this is just an accident of the way the world works - I can physically prevent you from taking a chair that I made, but I can't prevent you from taking an idea I had. Why does it make sense that the world works this way? Isn't it a whole lot better to encourage innovation, vs. encouraging more people to make physical objects, just because those are inherently scarce?
So as I said, I'm not sure it really will make ideas less monetizable (or that if it will, that it will do so significantly). Even today I can get any video game I want for free (illegally, but that effectively doesn't matter since no one will persecute me for it), and yet I still buy video games. In fact, a strong reason for why I buy video games today is because as a child I had a friend that had easy access to pirated CDs and I'd play a lot of games at their house, fueling my passion for them.
And relatedly, it is no accident that "software is eating the world", it is exactly because it is so easy to share, the low marginal costs make it easy to have a strong worldwide impact without too much real world effort :)
> I think it is self evident why creating artificial scarcity is immoral, but the point you are trying to make is that from a utilitarian standpoint you think it is preferable to behave in a way that is immoral in the micro scale since it will create greater good in the macro scale. If you agree, then I don't think there's any need to justify my belief here :)
I'll start with the end, I don't think the "immoral on the micro scale" idea makes much sense. Like, you can say it's "bad" or "annoying" on the micro scale, but if it is good for society as a whole to create artificial scarcity, it just isn't immoral for society to provide mechanisms to create it.
I also don't think it's self-evident that it's "immoral" on the small scale (though not sure what that means, since artificial scarcity is kind of a society-level mechanism.)
That said, maybe I'm just bumping on your use of the word immoral and we're not really disagreeing.
> That said, I also think that it is unclear that copyright is a net positive increase to creation and sharing of innovations. The current state of monetization is not inspiring since the actual creators usually are not well compensated and money tends to stay with large corporations that are essentially just "right holders". There's also many factors that actively stifle innovation and creativity.
This has been a talking point of people against copyright for a long time (I've been having these discussions for at least 20 years, personally).
But I think that you're basically wrong. It's pretty easy to see that you're wrong too - just look at the state of news, the state of music, etc. In most cases, artists today make far less money than they made before the rise of pirated alternatives. Patreon/other models/etc have helped some, but nowhere near where things were before.
In fact, you talk about the current state of compensation being bad, but I think that it would make more sense to listen to actual artists about whether or not copyright helps them or not. I've listened to a bunch, and most of them couldn't come close to doing what they love without copyright.
Also, personally, I'm a software dev. Most of what I do on a day-to-day basis is create IP. I'm fairly happy that someone can't just come along and repurpose everything I built, for free. Otherwise, I'm fairly sure I'd be out of a job.
(I'm fairly sure that without copyright/IP, most software we use wouldn't exist either.)
How do you propose to keep artists able to pay their bills and live a decent life if you're completely cool with training AIs on them?
Bonus points if you have any actionable scheme beyond waving your hands and talking vaguely about "basic income".
Keep in mind that the life of a professional artist is currently very perilous, anyone working freelance is constantly battling against the social media giants' desire to keep everyone scrolling their site forever. Words like "patreon" and "commission" and links off-site to places an artist can exchange their works for money are poison to The Algorithm and will be hidden.
And also if I am reading this right, you have absolutely no problem with an image generator that's been trained on copyrighted work producing work that's either copyrighted or copylefted? You are utterly fine with disregarding the copyrights of the original artist and/or whoever they may have assigned the copyright to as part of their contract?
> How do you propose to keep artists able to pay their bills and live a decent life if you're completely cool with training AIs on them?
Why isn't it a concern for any other automation?
How do we progress, exactly, if we randomly decide that nothing can disrupt any of the current ways of earning income?
What about, IDK, coal miners?
> And also if I am reading this right, you have absolutely no problem with an image generator that's been trained on copyrighted work producing work that's either copyrighted or copylefted? You are utterly fine with disregarding the copyrights of the original artist and/or whoever they may have assigned the copyright to as part of their contract?
Copyright maximalism is bad. It also doesn't make any sense. Someone learning to reproduce your capabilities by looking at your stuff isn't violating copyright. If we allowed copyright to somehow mean that someone's skills can't be reproduced...
The very first Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel winner Paul Samuelson[0] makes the argument here[1] when discussing how lighthouse economics works that anything with zero marginal costs that has a price other than free is by definition an economic loss. Therefore you should find other ways to fund lighthouses, and by extension, all media and software.
If an economic loss is currently occurring, that means that, if copyright is abolished, an economic gain will accrue. Where that economic gain is captured is therefore the core focus. What we want to occur is for society and the author to share in this increased economic gain, what we don't want is for monopolistic rent-seekers to grab all of this value for themselves.
I am not yet sure of this, but I think a land value tax would accomplish this goal. One would then discover some method to find the weight for each artist or piece of software through a market means that might only be an approximation and take from the land value tax revenue and distribute it to artists and other creators. I think a decent way to find this value is through a revenue neutral(or slightly negative) opt-in sortition process in which, when you go to use an artists work, you are entered into an auction wherein the 50% who bid more than the median value get to use the work at the median price, and the 50% who bid less do not get to use the work, but receive their bid in cash. This is surely not the full system, but it is just me working out how we can move forward from such an unjust system like copyright.
> How do you propose to keep artists able to pay their bills and live a decent life if you're completely cool with training AIs on them?
1. Software being able to imitate artists doesn't mean people will stop wanting art from other people.
2. If it does, I would say that it's unfortunate, but it's the world we live in. Nobody has a guarantee on being able to make a living doing any conceivable profession whatsoever, and artists are no different. I would very much like to make a living from working on my personal projects, but people don't seem to want to take me up on my offer.
Because when I go to a live performance, watch a movie, browse an art gallery, etc., I am training my brain on copyrighted work. Every artist has done the same. No artist has developed their style in a vaccuum.
(See my other comment though, I am not sold on any of this being right).
Both links contain great examples of late 1960s Polish animation. Poland was never a part of USSR.
Moreover, just like Czechoslovakia, Poland did not develop its animation style in a vacuum. They had strong artistic connections with many other European countries, especially France.
> They are unique and completely different from what the west was used to (Disney)
Not sure if you are an American... but Europe has a very old and very rich animation tradition. Growing up in Europe I was only vaguely familiar with Disney. The vast majority of animation I watched as a child was European.
they pioneered new techniques in a vacuum, because they were segregated.
complete vacuum is a silly argument, we owe dinosaurs the oil we used to build up our modern societies, would you say that AI would be impossible without dinosaurs?
As an artist you can spot parts though that have the same 'visual language' that are much larger than 2 pixels. E.g. how someone uses their brushes, how someone does texture on corrugated metal etc. Those are footprints as large as a matrix multiplication method - they just can't be that easily quantified, because we need an AI model to quantify them.
> I personally feel the bar for copyrighting code should be considerably higher than 20 lines.
That highly depends on the lines of code.
One of my (now abandoned) open source react components essentially does some smarter-than-it-probably-should state management in just a handful of LOCs. At least a few hundred people found the clever solution I came up with useful enough to integrate into their own projects.
I've seen talented a graphics programmer hand optimize routines to gain significant speed boosts, speed boosts that helped save non-trivial amounts of system resources.
And where do you draw the line? That same gfx programmer optimized maybe a dozen functions, each less than 20 lines, but all quite independent of each other. The sum total of his work gave us a huge performance boost over everyone else in the field at the time.
And of course you also have super terse languages like APL, where non-trivial algorithms can easily be implemented in 20 LOC.
But let's move to another medium, the written word, also one of the less controversial aspects of copyright (ignoring the USA's penchant for indefinite extension of copyright)
Start with poems, plenty of artistically significant poems that come in under 20 lines, deserving of copyright for sure.
The problem is, it is complicated, which is why these are the types of things that get litigated all the time.
Heck as a profession we cannot even agree on what a line of code is. A LOC in Java is, IMHO, worse less than a LOC in JavaScript, and if you jump to embedded C, wow that is super terse, unless you count the thousands of lines of #defines describing pinouts and such, but domain knowledge is needed to know that those aren't "real" lines of code.
I think that your code examples are not (and should not be) copyrightable.
Quoting US copyright law (but the same principle is global) "In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work." - when a work combines an idea and its expression, copyright protects only the expression but not the idea itself, no matter how clever or valuable it is. Copyright does not prohibit others to freely copy the method/process/system/etc expressed in the copyrighted work.
Also, there is a general rule of thumb (established in law and precedent) in copyrightability that functionally required aspects can not be copyrighted - in essence, if you create a new, optimized routine that's superior to everything else, then only the "arbitrary" free, creative parts of that code are copyrightable, but you absolutely can't get an exclusive right to that algorithm or technique. If the way you wrote it is the only way to write it, that can't be protected; and if not, others must be able to write stuff that's functionally the same (i.e. gets all the same performance boosts) and varies only in the functionally irrelevant parts. That's one of the reasons, for example, for recipe sites having all that fluff - because you can't get copyright protection on the functional part of the recipe, or some technique in another domain like architecture or computer science. Perhaps you can get a patent on that, but copyright is not applicable for that goal.
So, going back to your examples:
People reimplementing that "smarter-than-it-probably-should state management in just a handful of LOCs" is absolutely permitted by copyright law. If the same state management can be written in many different ways, then copying your code would be a violation and they would have to reimplement the same idea in different words, but copyright law definitely allows them to copy and reuse your idea without your permission, it doesn't protect the idea, only its specific expression.
Hand-optimized graphics routines may fall into the area where there is only one possible expression for that idea which implements the same method with the same efficiency. If that happens to be the case, the routine is not eligible for copyright protection at all - you can't get a monopoly on a particular effective technique or method using copyright law; patent law covers the cases in which that can or can't be done.
For APL implementations of algorithms - again, the key principle is that copyright definitely allows others to implement the same algorithm. If an obvious reimplementation of the same algorithm results in the same terse APL code, then that's simply evidence that this particular APL code is solely the "idea" (unprotectable), not "creative expression" which would be eligible for copyright protection.
It's not the training of the models that's the problem, it's when the AI spits out "substantial portions of code", an important term in the GPL and with regard to fair use law, that are exact, sometimes even including exact comments from specific codebases. This does violate the licenses.
There's something quantitative in code that you don't get in drawings, drawings the unique quality is purely qualitative, so it is hard to demonstrate what exactly it was that was ripped off. When you find your exact words being returned by a code helper AI it's hard to pretend that it's not directly and plainly just copy pasting code snippets.
I am going to go ahead and say it, some of you people are so far up your own ass that you don't realize it is the exact same thing. All of you are saying it's unique because you code but you don't understand art and can't pick out the things that are clearly copied from artists because it isn't exactly the same.
I don't understand art, you're right, at least from an artist's perspective. Maybe you can enlighten me.
So my current perspective is this: if a picture is drawn in a style similar to yours for example, that's not infringing, but if it's just a scrapbook collage of cutouts of your art it is, except where it's fair use. Would that be right?
So the same applies, if an AI actually can help people write code by learning from existing code, that's fine, but if an AI just copy pastes code blocks that's not.
On the contrary - the part that is problemmatic is the verbatim reproduction of copyrighted code. If that's fixed by a "minor refactoring" then there's no hill to die on. It's not AI code generation per se that's problemmatic - it's when it does things that are break current IP law.
If you want to debate expanding IP law - that's a different discussion and one I would be rather sceptical about. I'd prefer the that IP law in general was rolled back - not forward.
Which is a fine point of view. But it’s not the one that many (most?) detractors actually hold.
It would imply that you cannot offend open source licenses by doing something as simple as recreating a codebase in another language, thus eliminating the exact matches.
I always held the opinion that GPL etc was a copy-left license that was intended to make sure the code was free (free as in freedom not as in beer). That in an ideal world you wouldn't need the GPL or any licenses at all. At this point I really don't care what co-pilot or any of its derivatives result in and I think in the not too distant future we will have machine code to readable code translation which will enable more freedom. That is, it really won't matter if the code is compiled or not, when you can "AI decompile" it into human readable code, do your modifications, and then do with it what you will.
As long as this copyright violation laundering isn't reserved for the big guys, I'm happy for anything that confuses and delegitimizes the concept of copyright. But it is reserved for the big guys, you're going to get sued to death if you copy any of their work.
GPL folks are completely OK with something like Copylot when GPL license is obeyed, so all emitted code, generated by AI trained on GPL code, is licensed under GPL again. It's not OK to call our code as «public code» and ignore our license.
But by repeating this argument you are strengthening copyright, which is the fundamental evil GPL was made to fight. There surely will be FOSS clones of Copilot in the near future. There is no need to feed the copyright lobby.
I think both are inevitable and I'm ok with both. I think a sticking point is that its considered normal to make your own art in the style of another but abnormal to copy code verbatim. Art seems to be clearly the former while there are instances that probably stick in people's minds where copilot has produced verbatim examples.
Indeed it seems like code will be vastly more prone to this problem compared to art because changing a single pixel is merely a question of aesthetics whereas code is constrained tightly by the syntax of the language. With a much smaller space of correct results duplication is likely inevitable.
This is my thinking too. A maximally useful code AI would include verbatim reproduction (since presumably the code it was trained on was written that way for a reason relevant to its function). A maximally useful art AI has comparatively little reason to ever want to output verbatim training inputs.
I've been thinking of a possible resolution to this. For Copilot and similar systems: keep the training data, and in addition to the text generation, add a search function. Find generated sequences that Copilot puts out, and send pointers to the source for close matches in strings over a threshold length. Example: if Copilot produces the Quake fast inverse square root routine, you'd get a pointer to the source AND to the license. This would allow credit for the author for permissive free licenses, and would allow the user to dump that code if it's GPL and they aren't willing to distribute under those terms.
For art, train on contributed images whose authors agree to use them for that purpose. There could be some organization, perhaps a nonprofit, that would own the images and users of the model could credit that organization, and perhaps contribute back their own generated work. That way a legally clean commons could be built and could grow.
Of course people today are training on copyrighted images, chanting fair use because people on the internet tell them it's okay, instead of finding groups of artists who consent to having their art be trained on.
I hold neither of those opinions. My take is that agricultural civilization is being digested by technocapital and we're all along for the ride.
That said, the application of copyright to text, vs, code, vs images, has salient differences. The concept of plagiarism in visual artwork exists by analogy, but it's hard to call it coherent.
Music is somewhere in the middle, people have sued successfully over melody and hooks.
There are things like characters in animation, it's not unheard of, but the balance is more on the side of "great artists steal".
The vast majority of HN's patronage is tech aligned. Try asking a community of artists the same question and see what their responses are. The results might surprise you.
What's funny is HN was majority OK with AirBNB "disrupting" hotels and Uber/Lyft "disrupting" taxi services by bending the rules and exploiting legal loopholes, but when AI starts "disrupting" their artwork and code by bending the rules suddenly disruption becomes a personal problem.
Disrupt onward I say. Humans learn and remix from prior copyrighted work all the time using their brains (consciously chosen or not). So long as the new work is distinguishable enough to be unique there's nothing wrong with these new AI creations.
Because the models are not creating a 1:1 replacement of the original work.
As mentioned before "style" is not something subject to copyright and the model creates a model of that style. The process of finetuning a model generally means that one would not want to recreate the original images as that would overfit it and render it, essentially useless.
When it comes to code, there is a higher chance of getting a one-to-one clone of the input as the options used in creating an algorithm, or even a simple function are dramatically reduced imo.
> Because the models are not creating a 1:1 replacement of the original work.
Since when did that become a requirement? If those are the rules now, then cutting the final credits is good enough to start torrenting movies.
> When it comes to code, there is a higher chance of getting a one-to-one clone of the input as the options used in creating an algorithm, or even a simple function are dramatically reduced imo.
If you're going to consider each function within a larger work as an individual work, that makes the 1:1 replacement claim more dubious. In order to recognizably imitate a style, one or more features of that style have to be recognizably copied, although no single area of the illustration would have to be. A function is a facet of a complete program just like recognizable features of a style are facets of each work an artist produces. If it helps, consider an artist's style as their own personal utility library.
If I made a scene for scene remake of a Disney movie, with an ugly woman for a princess and social commentary/satirical injections, it would be defensible as fair use in court.
I think when it comes to art, less than one-to-one clones are often still functionally equivalent in the mind of many viewers. Stylistic and thematic content is often just as, if not more, important than the exact composition. But currently the law does agree that this is not copyrightable. And sometimes independent artists profit and make a name for themselves copping other styles, and I think that's great.
But could it be considered an intellectual and sociological denial-of-service attack when it's scaled to the point where a machine can crank out dozens of derivative works per minute? I'm not sure this is a situation at all comparable to human artists making derivative works. Those involve long periods of concentration, focus, and reflection by a conscious human agent to pull off, thus in some sense furthering to the intellectual development of humanity and fostering a deeper appreciation for the source work. The machine does none of that; it's sort of just a photocopier one step removed in hyperspace, copying some of the artists' abstractions instead of their brush strokes.
I have written projects where I'd consider a handful of lines of code to be the core central tenant of the entire project that everything else is built up around. Copy those lines and everything else is scaffolding that falls out naturally from the development process.
Style is not protected by copyright. You can create your own art in the style of any living artist and this is allowed. AI is automating that process. Some works that the models produce may be too close to an original work and probably be guilty of copyright violation if that ever goes to court. It'll be up to a judge to look at the original and the AI output to weigh in on if it's different enough or if it's an elaborate copy.
Code is different, there isn't a style to it other than perhaps indention and variable naming conventions. Entire sections (that are protected by GPL) are copied. This by itself isn't the issue, it's contaminating your codebase that is the problem. If your work ends up with the same license as the code sources and those are properly documented per those license agreements you're fine. But if you end up violating the GPL and someone knows their code is in use, you are in a tough situation. Again, it'll end up in an expensive court room session where a judge is going to have to determine if enough code was copied to be construed as a license violation. That's the one scenario you would want to avoid in the first place because for a lot of businesses that kind of lawsuit is too expensive to fight.
Humans used to learn to code from copyrighted works (textbooks) without much reference to OSS or Free Software. Similarly, teaching ML models to code from copyrighted works isn't going to violate copyright more frequently than a human might; and detecting exact copies should be pretty easy by comparing with the corpus used to train it. Software houses already have to worry about infringement of snippets, and things like Codex are just one more potential source.
Sometimes. People also borrowed them, read them in libraries, or in later years looked at free textbooks online.
> and a license granted for such use
I never heard of such a thing, and it was never seen as necessary. No student checked the licenses on their textbooks before deciding whether they could read them.
> why do you think it's ok to train diffusion models on copyrighted work, but not co-pilot on GPL code?
Probably worth pointing out that GitHub has a license to the code on its site (read the fine print) that is independent of other licenses that the code may available under.
Whether that license applies to training ML models is legally uncharted waters.
Whether it’s right to train those models is another matter.
It's the hypocrisy of it all. Multi-billion dollar corporations whose empires were built on copyright, violating the licenses of other people's code. Why do they get a pass for that while simultaneously shoving DRM and trusted computing down our throats? I hope they get sued for ridiculous sums.
I think any code that's posted in public should be considered free to use by anyone for anything and its corresponding license be ignored and invalid. If you want restrictions on how people use your code, don't post it publicly, or have a proprietary portion that's required for compilation
Pretending it's not clearly more complicated than that will not convince anyone, it will make them feel condescended to. While Scorcese is Scorcese, a deep learning model is not Michael Bay.
I used that comparison on purpose. Michael Bay leans on a lot of computer driven technology to shoot movies inspired by other more traditional directors. The comparison is direct, if you feel condescended to, I did not intend that.
It is totally okay to train Copilot on GPL code, but the resulting generated code should also be released under GPL license, clearly being a derivative work. I don't even know why it is being discussed.
Music is somewhat more challenging because you have a few other problems that have to be solved in the pipeline, and source separation is still not a 100% solved problem. Beyond that, audio tagging beyond track level artist/genre is a lot harder than image tagging.
Once you have separated sources for a training data set, it's like text generation, except that instead of a single function of sequence position, you have multiple correlated functions of time. Text generation models can barely maintain self consistency from paragraph to paragraph, which is a sequence difference of maybe 200 tokens, now consider moving from token position to a time variable, and adding the requirement that multiple sequences retain coherency both with each other, and with themselves over much larger distances.
There are generative music models, but it's mostly stuff that's been trained on midi files for a specific genre or artist, and the output isn't that impressive.
I am also eagerly awaiting "hark the bloodied angel screams" with blastbeats, shrieks and blistering tremolo guitar, though.
I don't think the law will get hammered down until the AI models generate 'major recording artist' inspired songs. Anyone claiming that artists can't claim 'style' as a defense of AI generated works is in for a rude awakening.
To be honest, the majority opinion on this just demonstrates how narrow minded and uncritical many people here are that the clear and obvious juxtaposition doesn't get their minds churning. It's hard not to notice half of them merely use AI tools and don't really understand how they work, hence why the silly and incorrect phrase of "your mind is a NN!" keeps occurring here.
Code generation models tend to much more often regurgitate code from the training data compared to one of these image based models regurgitating images from the training data.
Code generation models need to have special handling for checking if the generated code falls under copyright.
I think this is a great question, but I think answers should rest on a slightly more detailed understanding of how actually copyright works.
IANAL, but to a first-order approximation: everything is "copyrighted" [1]. The copyright is owned by someone/something. The owner gets to set the terms of the licensing. The rare things not under copyright may have been put explicitly into the public domain (which actually takes some effort), or have had their copyright expire (which takes quite a while; thanks Disney).
So: this is really a question about fair use [2], and about when the terms of licensing kick in, and it should be understood and discussed as such. I don't think anyone who has really thought about this is claiming that the models can't be trained on (copyrighted) material; the consumption of the material is not the problem, is it? The problem is that the models: (1) sometimes recreate particular inputs or identifiable parts of them (like the Getty watermark), or recreate some essential characteristics of their inputs (like possibly trademark-able stylistic elements), AND ALSO, (2) have no way of attributing the output to the input.
Without being able to identify anything specific about the input, it is impossible know with certainty that the output falls within fair use (e.g. because it was sufficiently transformative), and it is impossible to know how to implement the terms of licensing for things that don't fall within fair use. There's just no getting around that with the current crop of models.
The legal minefield is not from (1) or (2), but from (1)+(2), at the moment of redistribution, monetized or not. Even if Copilot was only trained on non-reciprocal licenses (BSD, MIT), there are very likely still licensing terms of use, which may include identifying the original copyright owner. Reciprocal licenses like GPL have more involved licensing terms, but that is not the problem: the problem is failure to identify the original licensing terms. We should not use these models as an opportunity to make an issue about GPL or its authors, or about the business model of companies like Getty; both rest on copyright, and come to our attention because of licensing.
Sorry about the rant. As for your question: I think it may be as simple as: to what extent are readers here the producers of inputs to the ML models, versus consumers of outputs. It gets personal for coders when models violate licensing terms of FOSS code, but it feels fun/empowering to wield the models to make images that we'd otherwise be unable to access. From my rant above you can tell that whether its for code or images, I think the whole thing is an IP disaster.
For anyone who holds both of these opinions, why do you think it's ok to train diffusion models on copyrighted work, but not co-pilot on GPL code?