Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why is it so outlandish to expect the people who make money by selling AI systems to only train them using material for which they have a license?

As many commenters have pointed out, no one would have a problem had Microsoft trained Copilot on the Windows source code. The fact that they intentionally left it out of the training set is a huge red flag.



Because AI systems require large amounts of training data, the more the better, and requiring manual review of those datasets to ensure compliance with copyright would consume significant resources and slow down the pace of innovation across the entire AI industry.

Now let me flip that question around on you: What benefit would society gain from that forcing AI developers to do all that extra work?


If you are going to use my work for free and without attribution and turn it around to compete with me, then it decreases my incentive to produce anything, and if I do it decreases my incentive to publish it. This goes directly against the intentions behind copyright law.


That's the best argument I've heard so far, but still doesn't make sense to me. It's not like your individual project is going to make any significant difference to the capabilities of the resulting AI that's "competing with you" one way or the other. So really all you'd be doing by not releasing your code is shooting yourself in the foot for no gain.

Granted, people are not necessarily rational actors, so maybe you could argue it still makes sense to have some protections in place to assuage people's irrational fears. Maybe like some kind of robots.txt for determining whether a page can be used in an AI dataset could serve that purpose. I'd be hesitant to support anything more burdensome than that.


The benefit is that our collective genius isn’t mined by mega corps and rented back to us. That we exist as more than mindless resources to be tapped for profit.

Again, if (for argument’s sake) we want to maximize the effectiveness of the AI, why are we okay with Microsoft intentionally omitting one of the most important codebases in human history — which it unambiguously has the right to use — from its training set?


> The benefit is that our collective genius isn’t mined by mega corps and rented back to us.

That sounds like a downside to me, not a benefit. You're basically arguing it would be better if Copilot, Stable Diffusion, GPT-3, etc (which all included copyrighted works in their training set) didn't exist. I'm just not seeing that.


They are only using material for which they have a license (at least debatably). Open source software licenses usually require attribution if you reproduce the source code or use the source code in a program.

Some other uses are allowed without attribution. Someone can read and learn from open source software without needing to put an attribution anywhere. You could run an analysis of the code on GitHub to find out what percent of code is written in C++. You wouldn't need to attribute every project on GitHub.

Now the debate is whether this applies to training ML models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: