Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."

Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.

 help



"clean room" usually means "without looking at the source code" of other similar projects. But presumably the AIs training data would have included GCC, Clang, and probably a dozen other C compilers.

Suppose you the human are working on a clean room implementation of C compiler, how do you go about doing it? Will you need to know about: a) the C language, and b) the inner working of a compiler? How did you acquire that knowledge?

Doesn’t matter how you gain general knowledge of compiler techniques as long as you don’t have specific knowledge of the implementation of the compiler you are reverse engineering.

If you have ever read the source code of the compiler you are reverse engineering, you are by definition not doing a clean room implementation.


Claude was not reverse engineering here. By your definition no one can do a clean room implementation if they've taken a recent compilers course at university.

Claude was reverse engineering gcc. It was using it as an oracle and attempting to exactly march its output. That is the definition of reverse engineering. Since Claude was trained on the gcc source code, that’s not a clean room implementation.

> By your definition no one can do a clean room implementation if they've taken a recent compilers course at university.

Clean room implementation has a very specific definition. It’s not my definition. If your compiler course walked through the source code of a specific compiler then no you couldn’t build a clean room implementation of that specific compiler.


There is no specific definition of clean room implementation. Please provide source for your claim otherwise.

There are many well known examples of clean room implementation. One example that survived lawsuits is Sony v. Connectix:

During production, Connectix unsuccessfully attempted a Chinese wall approach to reverse engineer the BIOS, so its engineers disassembled the object code directly. Connectix's successful appeal maintained that the direct disassembly and observation of proprietary code was necessary because there was no other way to determine its behavior - [0]

That practice is similar to GCC being used here to verify the output of the generated compiler, arguably even more intrusive.

[0] -https://en.wikipedia.org/wiki/Clean-room_design


“clean room implementation” is a term of art with a specific meaning. It has no statutory definition though so you’re technically right. But it is a defense against copyright infringement because you can’t infringe on copyright without knowledge of the material.

>During production, Connectix unsuccessfully attempted a Chinese wall approach to reverse engineer the BIOS, so its engineers disassembled the object code directly.

This doesn’t mean what you think it means. They unsuccessfully attempted a clean room implementation. What they did do was later ruled to be fair use, but it wasn’t a clean room implementation.

Using gcc as an oracle isn’t what makes it not a clean room implementation. Prior knowledge of the source code is what makes it not a clean room implementation. Using gcc as an oracle makes it an attempt to reverse engineer gcc, it says nothing about whether it is a clean room implementation or not.

There is no definition of “clean room implementation” that allows knowledge of source code. Otherwise it’s not a clean room implementation. It’s just reverse engineering/copying.


Again, reverse engineering is a valid use case of clean room implementation as I posted above, so you don't have a point there.

> “clean room implementation” is a term of art with a specific meaning.

What is the specific meaning you are talking about? If I set out to do a clean room implementation of some software, what do I need to do specifically so that I will prevail any copyright infringement claims? The answer is that there is no such a surefire guarantee.

Re: Sony v. Connectix, clean room is to protect against copyright infringement, and since Connectix was ruled not infringing on Sony's copyrights, their implementation is practically clean room under the law, despite all the pushbacks. If Connectix prevailed, I'm sure the C compiler in question would have prevailed as well if they got sued.

Finally, take Phoenix vs. IBM re: the former's BIOS implementation of the latter's PC:

Whenever Phoenix found parts of this new BIOS that didn't work like IBM's, the isolated programmer would be given written descriptions of the problems, but not any coded solutions that might have hinted at IBM's original version of the software - [0]

That very much sounds like using GCC as an online known-good compiler oracle to compare against in this case.

[0] - https://books.google.com/books?id=Bwng8NJ5fesC&pg=PA56#v=one...


You’re getting confused because you are substituting the goal of a clean room implementation for its definition. And you are not understanding that “clean room implementation” is one specific type of reverse engineering.

The goal is to avoid copyright infringement claims. A specific clean room implementation may or may not be successful at that.

This does not mean that any reverse engineering attempt that successfully avoids copyright infringement was a clean room implementation.

A clean room implementation is a specific method of reverse engineering where one team writes a spec by reviewing the original software and the other team attempts to implement that spec. The entire point is so that the 2nd team has no knowledge of proprietary implementation details.

If the 2nd team has previously read the entire source code that defeats the entire purpose.

> That very much sounds like using GCC as an online known-good compiler oracle to compare against in this case.

Yes and that is absolutely fine to do in a clean room implementation. That’s not the part that makes this not a clean room implementation. That’s the part that makes it an attempt at reverse engineering.


Why do you say it reversed engineered gcc instead of llvm? If you read the code it has much more of llvm concepts than gcc.

Because they used gcc output as a reference spec.

> you are by definition not doing a clean room implementation.

This makes no sense. Reverse engineering IS an application of clean room implementation. Citing Wikipedia:

“Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design”

https://en.wikipedia.org/wiki/Clean-room_design


There are many ways to reverse engineer a piece of software.

A clean room implementation is one such method of reverse engineering.

A clean room implementation is always reverse engineering. Reverse engineering is not always done using a clean room method.


The result is a fuzzy reproduction of the training input, specifically of the compilers contained within. The reproduction in a different, yet still similar enough programming language does not refute that. The implementation was strongly guided by a compiler and a suite of tests as an explicit filter on those outputs and limiting the acceptable solution space, which excluded unwanted interpolations of the training set that also result from the lossy input compression.

The fact that the implementation language for the compiler is rust doesn't factor into this. ML based natural language translation has proven that model training produces an abstract space of concepts internally that maps from and to different languages on the input and output side. All this points to is that there are different implicitly formed decoders for the same compressed data embedded in the LLM and the keyword rust in the input activates one specific to that programming language.


> The result is a fuzzy reproduction of the training input, specifically of the compilers contained within.

Is it? I'm somewhat familiar with gcc and clang's source and it doesn't really particularly look like it to me.

https://github.com/anthropics/claudes-c-compiler/blob/main/s...

https://llvm.org/doxygen/LoopStrengthReduce_8cpp_source.html

https://github.com/gcc-mirror/gcc/blob/master/gcc/gimple-ssa...


Checking for similarity with compilers that consist of orders of magnitudes more code probably doesn't reveal much. There many more smaller compilers for C-adjacent languages out there pkus cod3 fragments from text books.

There are not many more compilers with the specific optimization pass I linked.

Also, I don't think you could reuse code from a different compiler unless you used the same IR.


Thanks for elaborating. So what is the empirically-testable assertion behind this… that an LLM cannot create a (sufficiently complex) system without examples of the source code of similar systems in its training set? That seems empirically testable, although not for compilers without training a whole new model that excludes compiler source code from training. But what other kind of system would count for you?

I personally work on simulation software and create novel simulation methods as part of the job. I find that LLMs can only help if I reduce that task to a translation of detailed algorithms descriptions from English to code. And even then, the output is often riddled with errors.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: