Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But it actually is explicitly copying the text. That's how it works. The training data are massive, and you will get long strings of code that are pulled directly from that training data. It isn't giving you just the style. It may be mashing together several different code examples taking some text from each. That's called "derivative work".


No, that's not how it works.

"[...] the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set"

https://copilot.github.com/#faqs


If that's the case (only 0.1%), the developers must have done something that differs from other openai experiments that suggest code sequences that I recall seeing, where significant chunks of code from Stack Overflow or similar sites were appearing in answers.


So you're gambling on whether that the code that was generated or copied.


No you aren't. Courts will consider it fair use.


How are you going to prove it was the AI that generated the GPL licensed function ad verbatim from another project, rather than you just opening that project and copying the function yourself?


I will not. Courts will simply consider a single function not to be substantive enough piece of work to constitute unfair use.


use a bloom filter to skip/regenerate that 0.1%


Synthesising material from various sources isn't copyright infringement, that's called writing.

It's only infringement if the portion copied is significant either absolutely or relatively. A line here or there of the millions in the Linux kernel is okay. A couple of lines of a haiku is not. Copyright is not leprosy.


Google Books actually displays full pages of copyrighted works Google did not license. It was considered legal.

[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....


We all don't have Google resources. What if someone comes after us individually because some model-generated code is near identical to code in a GPL codebase? Where is the liability here?

edit: from https://copilot.github.com/

> What is my responsibility when I accept GitHub Copilot suggestions?

> You are responsible for the content you create with the assistance of GitHub Copilot. We recommend that you carefully test, review, and vet the code, as you would with any code you write yourself.

Well, that solves that question.


We are all vulnerable to predatory lawyer trolls, whether we do things correctly or not. If you are accused of reusing a GPL code, then you ask clarification on which and you rewrite. It is likely to be just a snippet. I doubt Copilot would write a whole lib by copying it from another project.

And yes, of course github is not going to take responsibility for things you do with their tools.


If you learn programming from Stack Overflow and Github, and then repeat something that you learned over your time at reading, that's not just copying text. That's having learned the text. You could say the human brain is mashing together several different code examples, taking some text from each.


hmm, so let's think this through.

Wouldn't that imply that a person who learned to code on GPLv2 sources wrote writes some more code in that style (including "long strings of code", some of which are clearly not unique to GPL) is writing code that is "born GPLv2"?

I don't think it currently works that way.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: