Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Open weights" means you can use the weights for free (as in beer). "Open source" means you get the training dataset and the methodology. ~Nobody does open source LLMs.


Indeed, since when the deliverable being a jpeg/exe, which is similar to what the model file is, is considered the source? it is more like open result or freely available vm image, which works, but has its core FS scrambled or crypted.

Zuck knows this very well and it does him no honour to speak like, and from his position this equals attempt ate trying to change the present semantics of open source. Of course, others do that too - using the notion of open source to describe something very far from open.

What Meta is doing under his command can better be desdribed as releasing the resulting...build, so that it can be freely poked around and even put to work. But the result cannot be effectively reversed engineered.

Whats more ridiculous is that precisely because the result is not the source in its whole form, that these graphical structures can made available. Only thanks to the fact it is not traceable to the source, which makes the whole game not only closed, but like... sealed forever. An unfair retell of humanity's knowledge tossed around in very obscure container that nobody can reverse engineer.

how's that even remotely similar to open source?


Even if everything was released how you described, what good would that really do for an individual without access to heaps of compute? Functionally there seems to be no difference between open weights and open compute because nobody could train a facsimile model. Furthermore, all frontier models are inscrutable due to their construction. It’s wild to me seeing people complain semantics when meta dropped their model for cheap. Now I’m not saying we should suck the zuck for this act of charity, but you have to imagine that other frontier models are not thrilled that meta has invalidated their compute moats with the release of llama. Whether we like it or not, we’re on this AI rollercoaster and I’m glad that it’s not just oligopolists dictating the direction forward. I’m happy to see meta take this direction, knowing that the alternatives are much worse.


That's not the discussion. We're talking about what open source is, and it's having the weights and the method to recreate the model.

If someone gives me an executable that I can run for free, and then says "eh why do you want the source, it would take you a long time to compile", that doesn't make it open source, it just makes it gratis.


Calling weights an executable is disingenuous and not a serious discussion. You can do a lot more with weights than you could with a binary executable.


You can do a lot more with an executable as well than just execute it. So maybe the analogy is apt, even if not exact.

Actually executables you can reverse engineer it into something that could be compiled back into an executable with the exact same functionality, which is AFAIK impossible to do with "open weights". Still, we don't call free executables "open source".


Its not really an analogy. LLMs are quite literally executables in the same way that jpegs are executables. They both specify machine readable (but not human readable) domain specific instructions executed by the image viewer/inference harness.

And yes, like other executables, they are not literal black boxes. Rather, they provide machine readable specifications which are not human readable without immense effort.

For an LLM to be open source there would need to be source code. Source code, btw, is not just a procedure that can be handed to a machine to produce code that can be executed by the machine. That means the training data and code is not sufficient (or necessary) for an open source model.

What we need for an open source model is a human readable specification of the model's functionality and data structures which allows the user to modify specific arbitrary functionally/structure, and can be used to produce an executable (the model weights).

We simply need much stronger interpretability for that to be possible.


This is debatable, even an executable is valuable artifact. You can also do a lot with executable in expert hand.


I'd find knowing what's in the training data hugely valuable - can analyse it to understand and predict capabilities.


Linux is open source and is mostly C code. You cannot run C code directly, you have to compile it and produce binaries. But it's the C code, not binary form, where the collaboration happens.

With LLMs, weights are the binary code: it's how you run the model. But to be able to train the model from scratch, or to collaborate on new approaches, you have to operate at a the level of architecture, methods, and training data sets. They are the source code.


Analogies are always going to fall short. With LLM weights, you can modify them (quant, fine-tuning) to get something different, which is not something you do with compiled binaries. There are ample areas for collaboration even without being able to reproduce from scratch, which takes $X Millions of dollars, also something that a typical binary does not have as a feature.


You can absolutely modify compiled binaries to get something different. That's how lots of video game modding and ROM hacks work.


And we would absolutely do it more often if compiling would cost as much as training of an LLM costs now.


I considered adding "normally" to the binary modifications expecting a response like this. The concepts are still worlds apart

Weights aren't really a binary in the same sense that a compiler produces, they lack instructions and are more just a bunch of floating point values. Nor can you run model weights without separate code to interpret them correctly. In this sense, they are more like a JPEG or 3d model


JPEGs and 3D models are also executable binaries. They, like model weights, contain domain specific instructions that execute in a domain specific and turing incomplete environment. The model weights are the instructions, and those instructions are interpreted by the inference harness to produce outputs.


>Nobody does open source LLMs.

There are a bunch of independent, fully open source foundation models from companies that share everything (including all data). AMBER and MAP-NEO for example. But we have yet to see one in the 100B+ parameter category.


Sorry, the tilde before "nobody" is my notation for "basically nobody" or "almost nobody". I thought it was more common.


It is more common when it comes to numbers I guess. There are ~5 ancestors in this comment chain, if I would agree roughly 4-6 is acceptable.


It's the literal (figurative) nobody rather than the literal (literal) nobody.


There are plenty of open source LLMs, they just aren’t at the top of the leaderboards yet. Here’s a recent example, I think from Apple: https://huggingface.co/apple/DCLM-7B

Using open data and dclm: https://github.com/mlfoundations/dclm


If weights are not the source, then if they gave you the training data and scripts but not the weights, would that be "open source"?


Yes, but they won't do that. Possibly because extensive copyright violation in the training data that they're not legally allowed to share.


If somebody would leak the training data and they would deny that it’s real ergo not getting sued and the data would be available.

Edit typo.


It's not available if you can't use it because you don't have as many lawyers as facebook and can't ignore laws so easily.


This is bending the definition to the other extreme.

Linux doesn't ship you the compiler you need to build the binaries either, that doesn't mean it's closed source.

LLMs are fundamentally different to software and using terms from software just muddies the waters.


And LLMs don't ship with a Python distribution.

Linux sources :: dataset that goes into training

Linux sources' build confs and scripts :: training code + hyperparameters

GCC :: Python + PyTorch or whatever they use in training

Compiled Linux kernel binary :: model weights


Just because you keep saying it doesn't make it true.

LLMs are not software any more than photographs are.


Then what is the "source"? If we are to use the term "source" then what does that mean here, as distinct from it merely being free?


It means nothing because LLMs aren't software.


Do they not run on a computer?


So does a video. Is a video open source if you're given the permissions to edit it? To distribute it? Given the files to generate it? What if the files can only be open in a proprietary program?

Videos aren't software and neither are llms.


If a video doesn't have source code, then it can't be open source. Likewise, if you feel that an LLM doesn't have source code because of some property of what it is -- as you claim it isn't software and somehow that means that it abstractly removes it from consideration for this concept (an idea I think is ridiculous, FWIW: an LLM is clearly software that runs in a particularly interesting virtual machine defined by the model architecture) -- then; somewhat trivially, it also can't be open source. It is, as the person you are responding to says, at best "open weights".

If a video somehow does have source code which can "generate it", then the question of what it means for the source code to the video to be open even if the only program which can read it and generate the video is closed source is equivalent to asking if a program written in Visual Basic can ever be open source given that the Visual Basic compiler is closed source. Personally, I can see arguments either way on this issue, though most people seem to agree that the program is still open source in such a situation.

However, we need not care too much about the answer to that specific conundrum, as the moral equivalent of both the compiler and the runtime virtual machine are almost always open source. What is then important is much easier: if you don't provide the source code to the project, even if the compiler is open source and even if it runs on an open source machine, clearly the project -- whatever it is that we might try to be discussing, including video files -- cannot be open source. The idea that a video can be open source when what you mean is the video is unencrypted and redistributanle but was merely intended to be played in an open source video player is absurd.


> Is a video open source if you're given the permissions to edit it? To distribute it? Given the files to generate it?

If you're given the source material and project files to continue editing where the original editors finished, and you're granted the rights to re-distribute - Yes, that would be open source[1].

Much like we have "open source hardware" where the "source" consists of original schematics, PCB layouts, BOM, etc. [2]

[1] https://en.wikipedia.org/wiki/Open-source_film

[2] https://en.wikipedia.org/wiki/Open-source_hardware


Videos and images are software. They are compiled binaries with very domain specific instructions executed in a very non-turing complete context. They are generally not released as open source, and in many cases the source code (the file used to edit the video or image) is lost. They are not seen, colloquially, as software, but that does not mean that they are not software.

If a video lacks a specification file (the source code) which can be used by a human reader to modify specific features in the video, then it is software that is simply incapable of being open sourced.


"LLMs are fundamentally different to software and using terms from software just muddies the waters."

They're still software, they just don't have source code (yet).


There is a comment elsewhere claiming there are a few dozen fully open source models: https://news.ycombinator.com/item?id=41048796


Why is the dataset required for it to be open source?

If I self host a project that is open sourced rather than paying for a hosted version, like Sentry.io for example, I don't expect data to come along with the code. Licensing rights are always up for debate in open source, but I wouldn't expect more than the code to be available and reviewable for anything needed to build and run the project.

In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available. I'm not actually sure if Meta does share all that, but training data is separate from open source IMO.


The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run. Provided you have reasonably widespread proficiency in industry standard tools, you can take something that's open source, modify that source, and rebuild/redeploy/reinterpret/re-whatever to make it behave the way that you want or need it to behave.

This is in contrast to a compiled binary or obfuscated source image, where alteration may be possible with extraordinairy skill and effort but is not expected and possibly even specirically discouraged.

In this sense, weights are entirely like those compiler binaries or obfuscated sources rather than the source code usually associated with "open source"

To be "open source" we would want LLM's where one might be able to manipulate the original training data or training algorithm to produce a set of weights more suited to one's own desires and needs.

Facebook isn't giving us that yet, and very probably can't. They're just trading on the weird boundary state of the term "open source" -- it still carries prestige and garners good will from its original techno-populist ideals, but is so diluted by twenty years of naive consumers who just take it to mean "I don't have to pay to use this" that the prestige and good will is now misplaced.


>The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run.

The open source movement was a cash grab to make the free software movement more palatable to big corp by moving away from copy left licenses. The MIT license is perfectly open source and means that you can buy software without ever seeing its code.


If you obtain open source licensed software you can pass it on legally (and freely). With some licenses you also have to provide the source code.


The sticking point is you can’t build the model. To be able to build the model from scratch you need methodology and a complete description of the data set.

They only give you a blob of data you can run.


Got it, that makes sense. I still wouldn't expect them to have to publicly share the data itself, but if you can't take the code they share and run it against your own data to build a model that wouldn't be open source in my understanding of it.


Data is the source code here, though. Training code is effectively a build script. Data that goes into training a model does not function like assets in videogames; you can't swap out the training dataset after release and get substantially the same thing. If anything, you can imagine the weights themselves are the asset - and even if the vendor is granting most users a license to copy and modify it (unlike with videogames), the asset itself isn't open source.

So, the only bit that's actually open-sourced in these models is the inference code. But that's a trivial part that people can procure equivalents of elsewhere or reproduce from published papers. In this sense, even if you think calling the models "open source" is correct, it doesn't really mean much, because the only parts that matter are not open sourced.


Compare/contrast:

DOOM-the-engine is open source (https://github.com/id-Software/DOOM), even though DOOM-the-asset-and-scenario-data is not. While you need a copy of DOOM-the-asset-and-scenario-data to "use DOOM to run DOOM", you are free to build other games using DOOM-the-engine.


I think no one would claim that “Doom” is open source though, if that’s the situation.


That's what op is saying, the engine is GPLv2, but the assets are copyrighted. There's Freedoom though and it's pretty good [0].

[0]: https://freedoom.github.io/


The thing they are pointing at and which is the thing people want is the output of the training engine, not the inputs. This is like someone saying they have an open source kernel, but they only release a compiler and a binary... the kernel code is never released, but the kernel is the only reason anyone even wants the compiler. (For avoidance of anyone being somehow confused: the training code is a compiler which takes training data and outputs model weights.)


The output of the training engine, I.E. the model itself, isn't source code at all though. The best approximation would be considering it obfuscated code, and even then it's a stretch since it is more similar to compressed data.

It sounds like Meta doesn't share source for the training logic. That would be necessary for it to really be open source, you need to be able to recreate and modify the codebase but that has nothing to do with the training data or the trained model.


I didn't claim the output is source code, any more than the kernel is. Are you sure you don't simply agree with me?


> not actually sure if Meta does share all that

Meta shares the code for inference but not for training, so even if we say it can be open-source without the training data, Meta's models are not open-source.

I can appreciate Zuck's enthusiasm for open-source but not his willingness to mislead the larger public about how open they actually are.


https://opensource.org/osd

"The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed."

> In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available

The M in LLM is for "Model".

The code you describe is for an LLM harness, not for an LLM. The code for the LLM is whatever is needed to enable a developer to modify to inputs and then build a modified output LLM (minus standard generally available tools not custom-created for that product).

Training data is one way to provide this. Another way is some sort of semantic model editor for an interpretable model.


I still don't quite follow. If Meta were to provide all code required to train a model (it sounds like they don't), and they provided the code needed to query the model you train to get answers how is that not open source?

> Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

This definition actually makes it impossible for any LLM to be considered open source until the interpretability problem is solved. The trained model is functionally obfuscated code, it can't be read or interpreted by a human.

We may be saying the same thing here, I'm not quite sure if you're saying the model must be available or if what is missing is the code to train your own model.


I'm not the person you replied directly to so I can't speak for them, but I did start this thread, and I just wanted to clarify what I meant in my OP, because I see a lot of people misinterpreting what I meant.

I did not mean that LLM training data needs to be released for the model to be open source. It would be a good thing if creators of models did release their training data, and I wouldn't even be opposed to regulation which encourages or even requires that training data be released when models meet certain specifications. I don't even think the bar needs to be high there- We could require or encourage smaller creators to release their training data too and the result would be a net positive when it comes to public understanding of ML models, control over outputs, safety, and probably even capabilities.

Sure, its possible that training data is being used illegally, but I don't think the solution to that is to just have everyone hide that and treat it as an open secret. We should either change the law, or apply it equally.

But that being said, I don't think it has anything to do with whether the model is "open source". Training data simply isn't source code.

I also don't mean that the license that these models are released under is too restrictive to be open source. Though that is also true, and if these models had source code, that would also prevent them from being open source. (Rather, they would be "source available" models)

What I mean is "The trained model is functionally obfuscated code, it can't be read or interpreted by a human." As you point out, it is definitionally impossible for any contemporary LLM to be considered open source. (Except for maybe some very, very small research models?) There's no source code (yet) so there is no source to open.

I think it is okay to acknowledge when something is technically infeasible, and then proceed to not claim to have done that technically infeasible thing. I don't think the best response to that situation is to, instead, use that as justification for muddying the language to such a degree that its no longer useful. And I don't think the distinction is trivial or purely semantic. Using the language of open source in this way is dangerous for two reason.

The first is that it could conceivably make it more challenging for copyleft licenses such as the GPL to protect the works licensed with them. If the "public" no longer treats software with public binaries and without public source code as closed source, then who's to say you can't fork the linux kernel, release the binary, and keep the code behind closed doors? Wouldn't that also be open source?

The second is that I think convincing a significant portion of the open source community that releasing a model's weights is sufficient to open source a model will cause the community to put more focus on distributing and tuning weights, and less time actually figuring out how to construct source code for these models. I suspect that solving interpretability and generating something resembling source code may be necessary to get these models to actually do what we want them to do. As ML models become increasingly integrated into our lives and production processes, and become increasingly sophisticated, the danger created by having models optimized towards something other than what we would actually like them optimized towards increases.


Data is to models what code is to software.


I don't quite agree there. Based on other comments it sounds like Meta doesn't open source the code used to train the model, that would make it not open source in my book.

The trained model doesn't need to be open source though, and frankly I'm not sure what the value there is specifically with regards to OSS. I'm not aware of a solution to interpretability problem, even if the model is shared we can't understand what's in it.

Microsoft ships obfuscated code with Windows builds, but that doesn't make it open source.


Wouldn't the "source code" of the model be closer to the source code of a compiler or the runtime library?

IMO a pre-trained model given with the source code used to train/run it is analogous to a company shipping a compiler and a compiled binary without any of the source, which is why I don't think it's "open source" without the training data.


You really should be able to train a model on whatever data you choose to use though.

Training data instead source code at all, it's content fed into the ingestion side to train a model. As long as source for ingedting and training a model is available, which it sounds like isn't the case for Meta, that would be open source as best I understand it.

Said a little differently, I would need to be able to review all code used to generate a model and all code used to query the model for it to be OSS. I don't need Meta's training data or their actual model at all, I can train my own with code that I can fully audit and modify if I choose to.


But surely you wouldn't call it open source if sentry just gave you a binary - and the source code wasn't available.


I suspect that even if you allowed people to take the data, nobody but a FAANG like organisation could even store it?


My impression is the training data for foundation models isn't that large. It won't fit on your laptop drive, but it will fit comfortably in a few racks of high-density SSDs.


yeah, according to the article [0] about the release of Llama 3.1 405B, it was trained on 15 trillion tokens using 16000 Nvidia H100's to do it. Even if they did release the training data, I don't think many people would have the number of gpus required to actually do any real training to create the model....

[0] https://ai.meta.com/blog/meta-llama-3-1/


And a token is the sequence number of a sequence of input in a restricted dictionary. GPT-2 was said to have 50k distinct tokens, so I think it's safe to assume even the latest ones are well below 4M tokens, so max 4 bytes per token. 15 trillion tokens -> 4 bytes/token * 15 T tokens -> training input<=60 TB doesn't sound that large.

It's the computation that is costly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: