Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The author currently misses the issue that all decent LLMs that have been released have been created by companies rather than opensource communities.

Sure, a lot of models have been released under permissive licenses, but.

Thats like releasing shareware. The special source for making _new_ LLMs comes from the dataset, and training clusters. None of which are cheap or easily run/financed by the community.



More concretely, an ML mode is only "open source" if both the training code and the training data are freely available.

Otherwise, it's not possible for the community to reproduce the model.


That's a bit like saying FOSS is open source only if a copy of the programmers is applied with the code.

Everything that has a source, has another source that has produced that source.

The algorithms behind creating LLMs are all published papers for all to read, the libraries (like TensorFlow) are themselves FOSS projects, and the data... is the open web for the most part.

The Wikipedia dump alone is more than enough to get a very decent LLM shaped up.

How an LLM is produced IS NO SECRET. It's just that to produce it you need millions (or for the more sophisticated ones: billions) in data center fees / power / GPU to train the model. So if the training scripts were included, you still can't make a LLama model yourself at home.


> That's a bit like saying FOSS is open source only if a copy of the programmers is applied with the code.

Wait, what? You can build the FOSS app entirely from the source. You don’t need a copy of the programmer.


Just as you say, you do not need Llama's training data in order to build Llama from source, nor even to fine tune it. You don't need a copy of the training data.

(edit to spell Llama correctly)


You can fine tune it, which gets you far, don't get me wrong. But you cannot study how major model changes to Llama or changes during the training process behave when you apply those changes from scratch.


No, you need llama's binary, you can;t know what its trained on because thats a secret or at least obfuscated.

The currently LLM status is more akin to opensource plugins to some propriety system, like game modding.


The reasoning is that the model weights are a lot more like a build artifact that can't be easily scrutinized. Demanding training data to be public is analogous to demanding source code. Sources like Wikipedia are just one kind of input.


And my reasoning was that everything is an artifact of something else.

The model weights can't be scrutinized? Well no one can scrutinize them. Even those who made the model.

And because you don't have the hardware to reproduce Llama anyway, even if they gave you the code to build it, you can't verify they used that code to build it. And if you have the hardware and data, you probably still wouldn't spend MILLIONS to end up... with the same exact model they gave you in the first place.

Do you understand how meaningless this entire "Llama is not truly open" bullshit is?


Your reasoning doesn't tackle the issue at hand so no I can't understand how "meaningless" because there is does actually matter. It doesn't matter that the people who produced the model can't scrutinize the results because they know what went into it and we don't. Just like how it's not very common for corporations that ship binary firmware blobs to scrutinize them in depth, since they have the upstream source/data that went into making them. They don't need to, because that's not the point.

No you wouldn't end up with the exact same model weights if you trained it yourself on the same data but you could compare how the models perform when given the same prompts and to try to identify anomalous variation that could point you towards Facebook misrepresenting us about what went into the training / fine-tuning.


Sure you can end up with the same weights. Just start with the same random weights (put it in the source!) and have a deterministic algorithm for training (put it in the source!) and put all the same copyrighted data which Facebook can't just... put in the source, because they don't have the copyright.

Do I need to keep going so you see how stupid the discussion is?

You think like a software developer. Everything is just buncha text, a compiler and build environment. That's not like that.

There's too much shit going in, and the rights of the data are absolutely not clear. But if they had to be, the model wouldn't exist. What you want it friggin' impossible.


I think open source advocates bring some understandable but wrong intuitions to how LLMs are being distributed.

In the world of hand-coded software, binaries are hard to work with, source is easy to work with, and compilation is cheap.

In the analogous world of LLM training, model weights are easy to work with, having training data does not let you reliably change model behavior, and "compilation" (training) is insanely expensive.

So, if your goal is agency to create tools for your own purposes, 9/10 researchers would rather work from a trained foundation model than the source data. The foundation models are of course released by companies because they cost $10s of millions to train -- but releasing them enables a thriving community of research, building adjacent frameworks, and specialized models to be created by much less powerful actors.

I've never understood the dogmatism around FOSS, but I've felt I understand the ideals. Those ideals are so much better served by releasing weights than by leaving LLMs only available through commercial APIs.


(author here): I am actually more focused on training corpus and tooling around foundational models, and the possibility of run these models on resources you control. An analogy would be the CPU and the opensource software running on top. I know people want the entire toolchain to be opensource, but I think the real value lies in that "top-layer".


First incarnations of Unix were paid by corporations and universities.


Thus GNU... And Linux.

20+ years later, look who is still around.


And Linux is mostly paid for by corporations, directly and indirectly.


It wasn't at the start, and it certainly isn't the cathedral of one large company, like Facebook and OpenAI/Microsoft.


and you had the source, documentation and all the bits needed to build it from scratch.

llama is a binary blob. Very capable, and its great that it got leaked. But its not a win for opensource. Its an accident of licensing. FB's lawyers would never have let it be a proper open source license, the PR and IP risks were way too high.

It just so happens that the leak meant that the PR risk went away, and knee capping openAI is a good thing for meta, and thus worth the IP risk.


Yes and no. The models themselves (with permissive licenses) might be like freeware, rather than "open source" or "free software". But they are a big enabler in allowing people to build other F/OSS on top -- eg. an experimental Gnome/KDE addon that allows voice control is now a real possibility. Most practical FOSS (other recalcitrant parts of the GNU community) already builds on top of all kinds of proprietary tools from computer hardware to closed operating systems like MacOS/Windows. LLMs are just the latest in the list of blobs -- developers & users can evaluate whether the power/opacity tradeoff is acceptable.


I don't think the comparison applies. Shareware can't really be modified in any way. However a foundational model is meant to be modified, fine-tuned, augmented, etc...

I agree calling it fully open source is a stretch but it's not the same as shareware.


Has anyone tried to take a distributed training framework and make it _really_ distributed?


Eh, I don't know; that's perhaps like giving credit to HP and Dell and perhaps Apple for Linux?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: