Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If I put a section in my LICENSE.txt prohibiting use as training data in commercial models, would that be sufficient to keep my code out of models like this?


In the end this would slightly increase likelihood of such sections appearing in licenses generated by AIs.


> If I put a section in my LICENSE.txt prohibiting use as training data in commercial models, would that be sufficient to keep my code out of models like this?

Neither in practice (because it doesn't look for it) nor legally in the US, if Microsoft’s contention that such use is “fair use” under US copyright law.

That “fair use” is an Americanism and not a general feature of copyright law might create some interesting international wrinkles, though.


Their contention is

> Why was GitHub Copilot trained on data from publicly available sources?

> Training machine learning models on publicly available data is now common practice across the machine learning community. The models gain insight and accuracy from the public collective intelligence. But this is a new space, and we are keen to engage in a discussion with developers on these topics and lead the industry in setting appropriate standards for training AI models.

Personally, I'd prefer this to be like any other software license. If you want to use my IP for training, you need a license. If I use MIT license or something that lets you use my code however you want, then have at it. If I don't, then you can't just use it because it's public.

Then you'd see a lot more open models. Like a GPL model whose code and weights must be shared because the bulk of the easily accessible training data says it has to be open, or something like that.

I realize, however, that I'm in the minority of the ML community feeling this way, and that it certainly is standard practice to just use data wherever you can get it.


When I referenced their contention on Fair Use, that's not what I was referencing, but instead Github CEO Nat Friedman’s comment in this thread that “In general: (1) training ML systems on public data is fair use”.

https://news.ycombinator.com/item?id=27678354


> however you want

I don't see any attribution here.

MIT may say "substantial portions" but BSD just says "must retain".


would be interesting if someone uploaded a leaked copy of the NT kernel, then coerced the system to regurgitate it piece by piece

would MS position then be different?


Don't make your code public. Someone could read it and train the model in their brain to synthesize some code based on it.

If its publicly available than its fair game to use it to learn and base ideas on.


Only if they trained a model to be able to read and understand LICENSE.txt files -- wowzers what a monster improvement that would be for the world

Or, I guess a sentinel phrase that the scraper could explicitly check: `github-copilot-optout: true`


Or it could explicitly check for known standard licenses that permit it, if it were opt in instead of opt out, the way most everything else in software licensing is opt-in for letting others use.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: