Hacker Newsnew | past | comments | ask | show | jobs | submit | munk-a's commentslogin

Reviewing code changes (generally) takes more time than writing code changes for a pretty significant chunk of engineers. If we're optimizing slop code writing at the expense of more senior's time being funneled into detailed reviews then we're _doing it wrong_.

A long list of contribution PRs are seen as a resume currency in the modern world. A way to game that system is to autogenerate a whole bunch of PRs and hope some of them are accepted to buff your resume. Our issue is that we've been impressed with volume of PRs and not the quality of PRs. The correction is that we should start caring about the volume of rejected PRs and quality of those accepted PRs (like reviewing merge discussions since they're a close corollary to what can be expected during an internal PR). As long as the volume of PRs is seen as a positive indicator people will try and maximize that number.

This is made more complex that the most senior members of organizations tend to be irrationally AI positive - so it's difficult for the hiring layer to push back on a candidate for over reliance on tools even if they fail to demonstrate core skills that those tools can't supplement. The discussion has become too political[1] in most organizations and that's going to be difficult to overcome.

1. In the classic intra-organizational meaning of politics - not the modern national meaning.


As much as I'd like a quick hack to disable raybands recording me - that feels like a pretty slam dunk case of destruction of property.

Just attach a camera to your device and say you where recording in public just like them, no seam to have an issue with that. Your system was just measuring the distance to the target using lidar :)

You're still responsible for damaging people's property even if you have a super clever reason why you totally didn't intend that to happen :)

Coal is so deeply irrational. Only when you plug your ears and scream can you block out comprehension of the massive local externalities that make it inefficient compared to other energy options. It is cheap to setup with minimal access to highly skilled professionals so it was a good option to bootstrap economies until recently when solar, wind and NG have become easy to access and cost competitive. It's perfectly reasonable to have a phase out timeline to avoid under utilizing paid-for infrastructure, but it is a dead technology.

The US is in an excellent position to massively harness wind and solar and yet right now it's dialing up the coal usage. I am comfortable celebrating Iceland's decision to not be maliciously dependent on fossil fuels.

> yet right now it's dialing up the coal usage

Reference? This seems to be false. Coal is still on decline, while solar is what's ramping up [1][2]

[1] https://www.eia.gov/todayinenergy/detail.php?id=67005

[2] https://ieefa.org/resources/energy-information-administratio...


Trump had a few executive orders that derailed phase out plans and the DoE released a coal plant refurbishing subsidy[1].

1. https://www.energy.gov/articles/energy-department-announces-...


So, in both cases it's helping sustain. "Ramping up" means increasing.

Is there something I'm missing here?


I consider minimizing a natural decline with artificial subsidies as ramping up - maybe a fairer phrase would be "dragging out production" but either way the administration is putting a thumb on their scale to counter natural market forces to perpetuate a dumb thing.

To hell with the US

We've banned this account for continually posting comments like this that are unsubstantive and clearly in breach of the guidelines and HN's intended use.

I mean, the EIA says "U.S. generation fueled by coal increased by 13% in 2025 to 731 BkWh"

The article you linked is mostly about a model of 2026 and 2027 and sure, in the model coal goes away but that's not a fact about coal it's just a model.


Yes with the next sentence explaining why, and how future years are planned to decrease.

"Ramping up" means planned to increase.

Feel free to provide a reference that supports that it's "ramping up". I, and parent, couldn't find one. This is a super boring factual thing that I was curious about, where opinion has no place or purpose.


> "Ramping up" means planned to increase.

No it doesn't. It means increasing.


> ... celebrating Iceland's decision to not ...

Okay, but you're celebrating make-believe virtues. Iceland is also not destroying its tropical coral reefs. That sounds nice...but it has none. Nor any sort of tradition or incentive to try doing that.

The US coal thing is all about widespread memories (and myths) of sustained good economic times, in large areas of the country which now feel destitute. Millions of voters feeling that they have no future. If not that the elites want them to hurry up and die.

To paraphrase Munger - if you want different outcomes there, then you need to change the incentives.


The actions of oracle lately seem extremely misaligned to maximize stonks - it's extremely political, more than is necessary to merely keep in the good graces of the current administration.

There's compelling reasons for all sorts of home devices to be connected to the internet[1] but the rub is that ToS flexibility and software updates make this a backdoor waiting to happen. I feel like our legal system has significantly failed us by not empowering the consume to say "I accept your device with a wifi antenna for the purposes of updating and I reject any exfiltration of personal data from it to your servers". You can have such a contract written - but this is really a place where something like a consumer advocacy board should step in and make sure those rights and sanely guaranteed.

1. It'd be great to ease the method for updating, it'd be nice to be able to easily monitor the device especially if it could become active in some manner while you're absent (I don't want the stove turning on to broil right after I leave on a three month vacation)


> I feel like our legal system has significantly failed us by not empowering the consume to say "I accept your device with a wifi antenna for the purposes of updating and I reject any exfiltration of personal data from it to your servers".

Worse it's allowed for them to remote into your device and disable features that you bought the device to use, by paywalling them off behind a subscription service that didn't exist when you brought the product home or just them entirely. To me that's no different than theft. It doesn't matter if it's amazon logging into you kindle overnight and removing books you already paid for from your virtual bookshelf, or Sony pushing an update to remove the option to use linux on your PS3, or BMW deciding that you should have to pay them every month just to use the heated seats option you already paid for when you bought your car.

If I, as an individual, sold you something than broke into your house to steal it or break it or demand ransom to get parts back that would be a crime, but companies get away with it somehow. What Google, Facebook, and Amazon do are basically just stalking.


I care a little bit - I think it's genuinely disappointing that your privacy can be so thoroughly compromised by interesting uses of metadata... but I also won't let the perfect be the enemy of the good. It'd be great is people truly understood the dangerous of invasive monitoring outside their physical forms (a, imo, relatively minor privacy to have compromised compared to your behavior) - but if it gets folks riled up I'm all for it.

I think the missing thing here is that the license violation already happened. Most of the big models trained on data in a manner that violated terms of service. We'll need a court case but I think it's extremely reasonable to consider any model trained on GPL code to be infected with open licensing requirements.

You might wish that were true, but there are very strong arguments it's not. Training on copyleft licensed code is not a license violation. Any more than a person reading it is. In copyright terms, it's such an extreme transformative use that copyright no longer applies. It's fair use.

But agreed that we're waiting for a court case to confirm that. Although really, the main questions for any court cases are not going to be around the principle of fair use itself or whether training is transformative enough (it obviously is), but rather on the specifics:

1) Was any copyrighted material acquired legally (not applicable here), and

2) Is the LLM always providing a unique expression (e.g. not regurgitating books or libraries verbatim)

And in this particular case, they confirmed that the new implementation is 98.7% unique.


Transformative is not the only component of determining fair use, there’s also the economic displacement aspect. If you’re doing a book report and include portions of the original (or provide an interface for viewing portions à la Google Books) you aren’t a threat to the original authors ability to make a living.

If you’ve used copyrighted books and turned them into a free write-a-book machine, you are suddenly using the authors own works against them, in a way that a judge might rule is not very fair.

“ Effect of the use upon the potential market for or value of the copyrighted work: Here, courts review whether, and to what extent, the unlicensed use harms the existing or future market for the copyright owner’s original work. In assessing this factor, courts consider whether the use is hurting the current market for the original work (for example, by displacing sales of the original) and/or whether the use could cause substantial harm if it were to become widespread.”

https://www.copyright.gov/fair-use/


Sure. But it seems very difficult to argue that LLM's are harming that ability to make a living in a direct way.

This is for the same reason that search results or search snippets aren't deemed to harm creators according to copyright. Yes there might be some percentage lost of sales. And truly, people may be buying less JavaScript tutorial books now that LLM's can teach you JavaScript or write it for you. But the relation is so indirect, there's very little chance a court would accept the argument.

Because what the LLM is doing is reading tons of JavaScript and JavaScript tutorials and resources online, and producing its own transformed JavaScript. And the effect of any single JavaScript tutorial book in its training set is so marginal to the final result, there's no direct effect.

And the reason this makes sense is that it's no different from a teacher reading 20 books on JavaScript and then writing their own that turns out to be a best-seller. Yes, it takes away from the previous best-sellers. But that's fine, because they're not copying any of the previous works directly. They're transforming the facts they learned into a new synthesis.


A human reading a unit of work is not a “copy”. I’m pretty sure our legal systems agree that thought or sight is not copying something.

Training an LLM inherently requires making a copy of the work. Even the initial act of loading it from the internet and copying it into memory to then train the LLM is a copy that can be governed by its license and copyright law


I think you are confusing two different meanings of the word ‘copy’. The fact that a computer loads it into memory does not make it automatically a ‘copy’ in the copyright sense.

It absolutely does! In law and the courts

> The court held that making RAM copies as an essential step in utilizing software was permissible under §117 of the Copyright Act even if they are used for a purpose that the copyright holder did not intend.

https://en.wikipedia.org/wiki/Vault_Corp._v._Quaid_Software_....


> The fact that a computer loads it into memory does not make it automatically a ‘copy’ in the copyright sense.

IIRC this exact argument was made in the Blizzard vs bnetd case, wasn't it? Though I can't find confirmation on whether that argument was rejected or not...


> Training an LLM inherently requires making a copy of the work.

But that's not relevant here. Because the copyleft license does not prohibit that (and it's not even clear that any license can prohibit it, as courts may confirm it's fair use, as most people are currently assuming). That's why I noted under (1) that it's not applicable here.


It's absolutely prohibited to copy and redistribute for commercial purposes materials that you're unlicensed to do so with. This isn't an issue when it comes to the copy-left scenario (though it may potentially enforce transitive licensing requirements on the copier that LLM runners don't want to follow) but it is a huge issue that has come up with LLM training.

LLM training involves ingesting works (in a potentially transformative process) and partially reproduce them - that's a generally restricted action when it comes to licensing.


> It's absolutely prohibited to copy and redistribute for commercial purposes materials that you're unlicensed to do so with.

Sure, but that's not what LLM's generally do, and it's certainly not what they're intended to do.

The LLM companies, and many other people, argue that training falls under fair use. One element of fair use is whether the purpose/character is sufficiently transformative, and transforming texts into weights without even a remote 1-1 correspondence is the transformation.

And this is why LLM companies ensure that partial reproduction doesn't happen during LLM usage, using a kind of copyrighted-text filter as a last check in case anything would unintentionally get through. (And it doesn't even tend to occur in the first place, except when the LLM is trained on a bunch of copies of the same text.)


Yea, at the end of the day a big part of this question comes down to whether that copying is fair use and that is an open question with the transformative nature being the primary point in favor of the LLM. But it is copying from some works to another - if it doesn't have some fair use exception it is absolutely violating the licensing of most of the training data. It's a bit different from previous settled case law because it's copying so little from so many billions of different things. I think blocking reproduction is wise by LLM companies for PR purposes but it doesn't guarantee that training is a license exempted activity.

Would it be fair to say that if you steal from enough people then it becomes OK? I can’t see it—especially considering this is IP law, expected to grant people confidence in their authorship rights and thus encourage innovation and creativity.

it's "if you steal fast enough and ubiquitously enough, then you win" where the goal is you've so entrenched your position that by the time a lawsuit rolls around, there isn't any real remedy. ideally, there would have been a day 1 lawsuit and injunction.

Yup. Of course it's copying. But all expectations are that courts will rule that fair use allows such copying, because of the nature of the transformation.

We've drifted a bit off the road from "To promote the progress of science and useful arts"

> Training on copyleft licensed code is not a license violation. Any more than a person reading it is. In copyright terms, it's such an extreme transformative use that copyright no longer applies. It's fair use.

This is just an assertion that you're making. There's no argument here. I'm aware that this is also an assertion that some judges have made.

My claim is that LLMs are not human, therefore when you apply words like "training" to them, you're only doing it metaphorically. It's no more "training" than copying code to a different hard drive is training that hard drive. And it's no more "transformative" than rar'ing or zipping the code, then unzipping it. I can't sell my jpgs of pngs I downloaded from Getty.

I have no idea how LLMs can be considered transformative work that immunizes me from owing the least bit of respect to the source material, but if I sample 2-6 second snatches from 10 different songs, put them through over 9000 filters and blend them into a new work, I owe money to everyone involved. I might even owe money to the people who wrote the filters, depending on the licensing.

> 98.7% unique.

This doesn't mean anything. This is a meaningless arrangement of words. The way we figure out things are piracy is through provenance, not bizarre ad hoc measurements. If I read a book in Spanish and rewrite it in English, it doesn't suddenly become mine even though it's 96.6492387% unique. Not even if I drop a few chapters, add in a couple of my own, and change the ending.


> This is just an assertion that you're making. There's no argument here.

...OK? Was somebody asking me for an "argument"? I'm just stating how things are currently understood.

> And it's no more "transformative" than rar'ing or zipping the code, then unzipping it.

That's obviously false, so I'm not sure what to tell you.

> but if I sample 2-6 second snatches from 10 different songs, put them through over 9000 filters and blend them into a new work, I owe money to everyone involved

You don't, actually, if they're no longer recognizable -- which they wouldn't be after "9000 filters". I don't know where you got that idea that you'd still owe money. And I've certainly never heard of an audio filter license that was contingent on commerical distribution.

> This doesn't mean anything. This is a meaningless arrangement of words.

Statistics are meaningful. Obviously you need to look at the actual identical lines. But if they're a bunch of trivial things like initializing variables with obvious names, then they don't count for much. And if you're adhering to the same API, you would expect to have some small percentage of lines happen to match. So the fact that this is <2%, as opposed to 40%, is hugely significant as a first step of analysis.

I suggest you might find conversations here on HN more productive if you soften your tone a bit. Saying things like "this is just an assertion that you're making" or "this is a meaningless arrangement of words" is not generally going to make people want to respond to you.


> Training on copyleft licensed code is not a license violation. Any more than a person reading it is.

Some might hold that we've granted persons certain exemptions, on account of them being persons. We do not have to grant machines the same.

> In copyright terms, it's such an extreme transformative use that copyright no longer applies.

Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim? Sure, it can also produce extremely transformed versions, but is that really relevant if it holds within it enough information for a (near-)verbatim reproduction?


>Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim? Sure, it can also produce extremely transformed versions, but is that really relevant if it holds within it enough information for a (near-)verbatim reproduction?

I feel as though, from an information-theoretic standpoint, it can't be possible that an LLM (which is almost certainly <1 TB big) can contain any substantial verbatim portion of its training corpus, which includes audio, images, and videos.


> I feel as though, from an information-theoretic standpoint, it can't be possible that an LLM (which is almost certainly <1 TB big) can contain any substantial verbatim portion of its training corpus, which includes audio, images, and videos.

It doesn't need to for my argument to make sense. It's a problem if it reproduces a single copyrighted work (near)-verbatim. Which we have plenty of examples of.


Do we? Even when people attempt to jail break most models with 1000s of prompts they are only able to get a paragraph or two of well known copyrighted works and some blocks of paraphrased text, and that's with giving it a substantially leading question.

It surely doesn't matter how leading or contorted the prompt has to be if it shows that the model is encoding the copyrighted work verbatimly or nearly so.

It definitely does, which is why I put substantial amount of verbatim material. If someone can recite the first paragraph of Harry Potter and the sorcerers stone from memory, it surely doesn't mean they have memorized the entire book.

> We do not have to grant machines the same.

No we don't have to, but so far we do, because that's the most legally consistent. If you want to change that, you're going to need to pass new laws that may wind up radically redefining intellectual property.

> Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim?

Of course it has, if the transformation is extreme, as it appears to be here. If I memorize the lyrics to a bunch of love songs, and then write my own love song where every line is new, nobody's going to successfully sue me just because I can sing a bunch of other songs from memory.

Also, it's not even remotely clear that the LLM can produce the training data near-verbatim. Generally it can't, unless it's something that it's been trained on with high levels of repetition.


I want to briefly pick at this:

> you're going to need to pass new laws that may wind up radically redefining intellectual property

You're correct that this is one route to resolving the situation, but I think it's reasonable to lean more strongly into the original intent of intellectual property laws to defend creative works as a manner to sustain yourself that would draw a pretty clear distinction between human creativity and reuse and LLMs.


> into the original intent of intellectual property laws to defend creative works as a manner to sustain yourself

But you're missing the other half of copyright law, which is the original intent to promote the public good.

That's why fair use exists, for the public good. And that's why the main legal argument behind LLM training is fair use -- that the resulting product doesn't compete directly with the originals, and is in the public good.

In other words, if you write an autobiography, you're not losing significant sales because people are asking an LLM about your life.


The big difference between people reading code and LLMs reading code is that people have legal liability and LLMs do not. You can't sue an LLM for copyright infringement, and it's almost impossible for users to tell when it happens.

BTW in 2023 I watched ChatGPT spit out hundreds of lines of F# verbatim from my own GitHub. A lot of people had this experience with GitHub Copilot. "98.7% unique" is still a lot of infringement.


> people have legal liability and LLMs do not. You can't sue an LLM for copyright infringement

That's not relevant, because you can still sue the person using the LLM and publishing the repository. Legal liability is completely unchanged.


>Legal liability is completely unchanged.

It's changed completely, from your own example.

If you comission art from an artist who paints a modified copy of Warhol's work, the artist is liable (even if you keep that work private, for personal use).

If you commission it from OpenAI (by sending a query to their ChatGPT API), by your argument, you are the person liable — and OpenAI is off the hook even if that work is distributed further.

I'm not going to argue about the merits of creativity here, or that someone putting a prompt into ChatGPT considers themselves an artist.

That's irrelevant. The work is created on OpenAI servers, by the LLMs hosted there, and is then distributed to whoever wrote the prompt.

Models run locally are distributed by whoever trained them.

If you train a model on whatever data you legally have access to, and produce something for yourself, it's one thing.

Distribution is where things start to get different.


> If you commission it from OpenAI (by sending a query to their ChatGPT API), by your argument, you are the person liable — and OpenAI is off the hook even if that work is distributed further.

Let's distinguish two different scenarios here:

1) Your prompt is copyright-free, but the LLM produces a significant amount of copyrighted content verbatim. Then the LLM is liable, and you too are liable if you redistribute it.

2) Your prompt contains copyrighted data, and the LLM transforms it, and you distribute it. Then if the transformation is not sufficient, you are liable for redistributing it.

The second example is what I'm referring to, since the commercial LLM's are now very good about not reproducing copyrighted content verbatim. And yes, OpenAI is off the hook from everything I understand legally.

Your example of commissioning an artist is different from LLM's, because the artist is legally responsible for the product and is selling the result to you as a creative human work, whereas an LLM is a software tool and the company is selling access to it. So the better analogy is if you rent a Xerox copier to copy something by Warhol. Xerox is not liable if you try to redistribute that copy. But you are. So here, Xerox=OpenAI. They are not liable for your copyrighted inputs turning into copyrighted outputs.


>So the better analogy is if you rent a Xerox copier to copy something by Warhol

It isn't.

One analogy in that case would be going to a FedEx copy center and asking the technician to produce a bunch of copies of something.

They absolve themselves of liability by having you sign a waiver certifying that you have complete rights to the data that serves as input to the machine.

In case of LLMs, that includes the entire training set.


The most salient difference is that it's impossible to tell if an LLM is plagiarizing, whereas Xeroxing something implies specific intent to copy. It makes no sense to push liability onto LLM users.

Are you following the distinction between my scenarios (1) and (2)?

In scenario (1) the LLM is plagiarizing. But that's not the scenario we're discussing. And I already said, this is where the LLM is liable. Whether a user should be too is a different question.

But scenario (2) is what I'm discussing, as I already explained, and it's very possible to tell, because you yourself submitted the copyrighted content. All you need to do is look at whether the output is too similar to the input.

If there's some scenario where you input copyrighted material and it transforms it into different material that is also copyrighted by someone else... that is a pretty unlikely edge case.


You can sue the company making the LLM, which is what many have done.

I agree there has to be a court case about it. I think the current argument, however, is that it is transformative, and therefore falls under fair use.

Yea, a finding that training is transformative would be pretty significant and it's likely that the precedent of thumbnail creation being deemed transformative would likely steer us towards such a finding. Transformative is always a hard thing to bank on because it is such a nebulous and judgement based call. There are excellent examples of how precise and gritty this can get in audio sampling.

Didn't know about thumbnails being fair use. In that case, I just don't see an argument that genAI training on source code is less transformative than thumbnails.

You don’t get to simply claim fair use based on how transformative your derivative work is.

“”” Section 107 calls for consideration of the following four factors in evaluating a question of fair use:

Purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes: Courts look at how the party claiming fair use is using the copyrighted work, and are more likely to find that nonprofit educational and noncommercial uses are fair. This does not mean, however, that all nonprofit education and noncommercial uses are fair and all commercial uses are not fair; instead, courts will balance the purpose and character of the use against the other factors below. Additionally, “transformative” uses are more likely to be considered fair. Transformative uses are those that add something new, with a further purpose or different character, and do not substitute for the original use of the work.

Nature of the copyrighted work: This factor analyzes the degree to which the work that was used relates to copyright’s purpose of encouraging creative expression. Thus, using a more creative or imaginative work (such as a novel, movie, or song) is less likely to support a claim of a fair use than using a factual work (such as a technical article or news item). In addition, use of an unpublished work is less likely to be considered fair.

Amount and substantiality of the portion used in relation to the copyrighted work as a whole: Under this factor, courts look at both the quantity and quality of the copyrighted material that was used. If the use includes a large portion of the copyrighted work, fair use is less likely to be found; if the use employs only a small amount of copyrighted material, fair use is more likely. That said, some courts have found use of an entire work to be fair under certain circumstances. And in other contexts, using even a small amount of a copyrighted work was determined not to be fair because the selection was an important part—or the “heart”—of the work.

Effect of the use upon the potential market for or value of the copyrighted work: Here, courts review whether, and to what extent, the unlicensed use harms the existing or future market for the copyright owner’s original work. In assessing this factor, courts consider whether the use is hurting the current market for the original work (for example, by displacing sales of the original) and/or whether the use could cause substantial harm if it were to become widespread. “””

https://www.copyright.gov/fair-use/


>I haven't claimed anything, The courts did: https://www.whitecase.com/insight-alert/two-california-distr.... And regardless, my point still stands that it is an open question; however, given the already present body of cases, it is tipping in the favor of the AI companies. Also, if thumbnails fall under fair use due to it being transformative of full-sized pictures, I cannot see an argument that AI training on data is somehow less transformative than downscaling an image for a thumbnail.

The act of training by itself has been ruled to be fair use over and over again, including for LLMs, and there isn't much debate left there.

The test for infringement is if the output is transformative enough, and that is what NYT vs OpenAI etc. are arguing.


Is the LLM acting as my agent? If the LLM has been exposed to the source code then have I been exposed to the source code? So in that case is a "clean room" implementation possible?

Won't this cause significant legal issues in two party consent states and have a huge potential to run afoul of revenge porn laws?

Where are all the think of the children people now?

I've heard that the government has terabytes of data on people that spend far too much time thinking of the children.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: