If I put a section in my LICENSE.txt prohibiting use as training data in commerc...

UnFleshedOne · on June 29, 2021

In the end this would slightly increase likelihood of such sections appearing in licenses generated by AIs.

dragonwriter · on June 29, 2021

> If I put a section in my LICENSE.txt prohibiting use as training data in commercial models, would that be sufficient to keep my code out of models like this?

Neither in practice (because it doesn't look for it) nor legally in the US, if Microsoft’s contention that such use is “fair use” under US copyright law.

That “fair use” is an Americanism and not a general feature of copyright law might create some interesting international wrinkles, though.

6gvONxR4sf7o · on June 29, 2021

Their contention is

> Why was GitHub Copilot trained on data from publicly available sources?

> Training machine learning models on publicly available data is now common practice across the machine learning community. The models gain insight and accuracy from the public collective intelligence. But this is a new space, and we are keen to engage in a discussion with developers on these topics and lead the industry in setting appropriate standards for training AI models.

Personally, I'd prefer this to be like any other software license. If you want to use my IP for training, you need a license. If I use MIT license or something that lets you use my code however you want, then have at it. If I don't, then you can't just use it because it's public.

Then you'd see a lot more open models. Like a GPL model whose code and weights must be shared because the bulk of the easily accessible training data says it has to be open, or something like that.

I realize, however, that I'm in the minority of the ML community feeling this way, and that it certainly is standard practice to just use data wherever you can get it.

dragonwriter · on June 29, 2021

When I referenced their contention on Fair Use, that's not what I was referencing, but instead Github CEO Nat Friedman’s comment in this thread that “In general: (1) training ML systems on public data is fair use”.

https://news.ycombinator.com/item?id=27678354

yencabulator · on June 29, 2021

> however you want

I don't see any attribution here.

MIT may say "substantial portions" but BSD just says "must retain".

blibble · on June 29, 2021

would be interesting if someone uploaded a leaked copy of the NT kernel, then coerced the system to regurgitate it piece by piece

would MS position then be different?

SilverRed · on June 29, 2021

Don't make your code public. Someone could read it and train the model in their brain to synthesize some code based on it.

If its publicly available than its fair game to use it to learn and base ideas on.

mdaniel · on June 29, 2021

Only if they trained a model to be able to read and understand LICENSE.txt files -- wowzers what a monster improvement that would be for the world

Or, I guess a sentinel phrase that the scraper could explicitly check: `github-copilot-optout: true`

6gvONxR4sf7o · on June 29, 2021

Or it could explicitly check for known standard licenses that permit it, if it were opt in instead of opt out, the way most everything else in software licensing is opt-in for letting others use.