Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Any indications Copilot scans your local files?
189 points by polycaster on Nov 8, 2021 | hide | past | favorite | 83 comments
A client of mine is just about to launch a startup. There's nothing public on the web yet. Today, while hacking a very minimal prototype of a HTML page into VSCode, I got a strange suggestion from GH Copilot. When I entered the client's name Copilot prompted me with a ready-made <article> block containing a marketing claim of their company. So far so nice.

Strange thing is the claim is not public and not contained in the codebase I'm currently working on, in no codebase I know of and to my knowledge, I'm the only related dev using Copilot. It's also not listed on Google, thus I assume it's not leaked somewhere else that could be indexed by GH (which is an assumption of course, but appears likely). But it can be found in a completely separate local folder with project assets that is not published on GH.

The marketing claim is about the length of a tweet and is not exactly generic. It requires understanding of my client's business (which again, cannot be derived from my codebase). So it's not GPT3 output that matches coincidentially.

The GH „About GitHub Copilot Telemetry” page [1] does not indicate that your locale file system is scanned though.

Can anyone explain that or observed a similar phenomenon?

[1] https://docs.github.com/en/github/copilot/about-github-copilot-telemetry



Yes, other files open in your IDE may also be scanned.

From terms of service [1] (Which I'm sure everyone reads)

> when you edit files with the GitHub Copilot extension/plugin enabled, file content snippets [...] will be shared with GitHub, Microsoft, and OpenAI, and used for diagnostic purposes to improve suggestions and related products. GitHub Copilot relies on file content for context, both in the file you are editing and potentially other files open in the same IDE instance.

[1] https://docs.github.com/en/github/copilot/github-copilot-tel...


Thanks for the pointer but I don't think this is what happened in my case. I never opened the marketing assets in VSCode. They reside in completely separate folders.


Are you using Windows? IIRC Microsoft collects a hefty amount of data from your filesystem.


The telemetry data does not (and should not) contain the file contents.


IIRC that collection is optional


According to microsoft's documentation[1] it's unintentional. Presumably because it's picked up from whatever's in memory, rather than being collected intentionally.

>which may unintentionally contain user content, such as parts of a file you were using when the problem occurred

[1] https://docs.microsoft.com/en-us/windows/privacy/configure-w...


Another reason why turning off automatic bug reporting is a good idea.


Good point again, but no. MacOS.


If VSCode scanned your files to feed Copilot, it's not just Microsoft who can access that, but anyone using copilot. Every single file in every single machine that ever ran VSCode would potentially be at anyone's grasp.

Allowing this to happen would've been incredibly stupid. But it's worth investigating further.


> Every single file in every single machine that ever ran VSCode would potentially be at anyone's grasp.

Even if they were scanning all the files, Copilot's terms [1] explicitly says they do not use the data to provide suggestions for other users.

> GitHub Copilot does not use these URLs, file paths, or snippets collected in your telemetry as suggestions for other users of GitHub Copilot. This information is treated as confidential information and accessed on a need-to-know basis.

[1] https://docs.github.com/en/github/copilot/github-copilot-tel...


Last time I checked it was not implemented yet though. Maybe they are starting to deploy a version that use the other open files for context, but mine is obviously not doing that yet.


I'll second the plausible suggestion that perhaps the name/marketing copy combo isn't as unpredictable as you might think. Corporate speak and company names are pretty formulaic. Try running the name and <article> through GPT-3 and see what happens (or GPT-2 here: https://bellard.org/textsynth/)

E.g., I just prompted GPT-2 with a made up company name that doesn't have any google search results and got a completion like this:

<p><em>Fully featured webapp for social &amp; mobile networks in the cloud</em></p>


"The choice is up to you. Will you live your life in a way that’s enjoyable and worth living? Or will you drag out the most mundane and ordinary day of your life and hope that you’ll die some day? [snip a few paragraphs]

Kenji.click is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com."

It's slightly eerie how good that is (GPT-J 6B). I generated about ten paragraphs, and other then it being a bit rambly, there was basically nothing that would make my question whether it was written by a human. Capping it off with the Amazon Associates disclaimer somehow helped complete the illusion.

I did also seemingly walk in on a bunch of AIs arguing:

> I just don’t believe it.

> It’s gotta be them.

> There’s not enough evidence.

> There’s just not enough evidence.

> It’s gotta be them.

> It’s just too far-fetched.

> There’s just not enough evidence


Have you considered that your marketing blurb is actually, not that novel after all? GPT3 is damn convincing these days :)


This is a joke, but it may be the correct answer: if the company name is self-demonstrating, it's possible Codex could recognize that.


I don't think so. The phrase is 272 characters long and contains very specific terms which cannot be derived from the company's name. Also, it's a 1:1 match with the unpublished marketing material.


Maybe it's the other way around - your co unpublished marketing material has been co-written by AI?


Actually that's the best idea so far apart from the suspicion of local FS scans. I'll check with marketing.

They are not exactly the type of crowd that would subscribe to Copilot beta, though. But as Copilot is based on the GPT3 model perhaps a marketing tool on the same basis exists.

This all is somewhere in the small space between stupid and exciting.


The entire point of GPT3 is to create novel content that is similar to existing content. So that copilot suggested it doesn't mean it's not novel as a whole.


Well this could be a huge security issue. Can lead to potentially Copilot-surfing for company secrets in a new form, since Copilot is already leaking secret API keys and copyrighted code.

The dangers of just regurgitating what has been read are unreal, since with good enough targeting you can read the data someone else wrote and expected to be anonymized. It's like huge global RAM of code, you just need to figure out how to get it to point at the right addresses.


Hacking has never been easier. Just type "username: thecupisblue, password:" and wait for autocomplete. :D


Wow, I did and it turned up "username: thecupisblue, password: ********", holy shit, changing all my passwords now.

EDIT: Incredible, it seems HN obscures your password with * signs when you post it. Phew, that was close.


Assuming you're on Windows, you can see all of a process's IO using a tool like Process Monitor: https://docs.microsoft.com/en-us/sysinternals/downloads/proc...

FYI: It's a firehose, but you should be able to filter it down to copilot and a path prefix. Then you'll know if it's being scanned.


They said they're on macOS, in which case they can use fs_usage from the command line.


I noticed a very peculiar thing with copilot.

I was writing a twitter api wrapper and had hardcoded access token for a test user. Copilot when testing a function that requires a user id suggested one. I searched the id and it belonged to my test user. I searched my entire codebase but couldn't find any place I had used the particular id. The only place it could have extracted it from would have been access token which has user id as part of the string(which I hadn't noticed before this). Either this is a common code pattern or I don't know how to process this.


This is something I've seen with copilot with market data.

I was creating a unit test in a Go codebase and I had dumped the JSON that I was going to be decoding at the top of the file, and when I started writing the assertions, Copilot was very quick to use the data from the JSON, with quite high accuracy, based solely on me typing which ticker I was going to assert against.


I have seen that as well and it is impressive. But this is more than that. The code contained a string variable named accessToken with a string like "218276172612672-jash127hg27128h'(random data here, not the actual id), where 218276172612672 was the user id. When testing a function that required user id it not only suggested 218276172612672 but also did it with the full context participant.follow({twitterUserId:"218276172612672"})


Seems reasonable. What's your question? GPT-3 is just that good.


Seems totally reasonable to me too. It probably has just seen the pattern

``` str = "{{SOME_ID_HERE}}-jash127hg27128h"

participant.follow({twitterUserId:"{{SOME_ID_HERE}}"}) ```


This doesn't seem likely. No one would be generating it this way as access token is issued after oauth and I am unaware of any method to get the second half of the token without the first half. And given that in the same response that contains access token, user id is passed as well so there is no need to extract it from there.


hm. how many large integer literals are there in your code? it could just be learning that user ids are long strings of digits and is making a guess as to which long string of digits (based on some context, like sharing a line with "id" in it) might be the right one...


But that's the thing, if it took the whole string it would be fine, it extracted the exact right substring. There for 5 others in the same file. None I would have been able to distinguish from each other without context.


think of it this way, in the entire corpus of github, how often do you think that there are numeric identifiers that appear near terms like "id" where the numeric part is then used elsewhere with terms like "id" or terms that are frequently found near terms like "id"?

don't get me wrong, it's cool, but these models operate on a character by character basis with sequence context. if they can learn things like matching pairs of parens and quotes in certain contexts, it seems they could certainly learn things like extracting long strings of digits.

now what would be cool would be if they could generate regular expressions for the rules they're learning.


are you sure that "big number that starts with 2" wasn't just the greatest 32 bit 2s complement signed integer, which is often used for sentinel/testing values?


No it was oauth access token issued for a test user I created. Nothing special about the token.


If it‘s always in the same location inside the api key, it seems very reasonable that it would pick up that pattern, as most projects that include a hardcoded key would include it. API key variables are probably named almost the same everywhere.


This is my guess as well. But I doubt anyone would be writing their code like "${userId}-${otherPart}" so copilot itself noticed that in all codebases with hardcoded twitter keys(which shouldn't be common on github I think) that their is a partial match and this is a useful information(given that in such a large corpus of all public github code partial matches would be quite common and ratio of signal to noise should be less). Whatever the case, I started as a skeptic thinking this is pure gimmick and now each day I am impressed by something new that copilot can do.


If it‘s in the same position in terms of character offset, it might not make a difference to the NN. I‘m impressed as well.


What was the user id? testuser1? Or something more convoluted?


Twitter user id for a user I created(so nothing popular as well). A 64 bit unsigned integer as string here.


Do the following experiment:

1) Put in the same folder where you had that marketing blurb, a new blurb. Make sure it is unique but still looks like english. Example: now your startup allows you to fly to the moon in rockets powered by angry tweets.

2) Try to force a reload of copilot. Maybe reinstall it?

3) Recreate the conditions that suggested the first blurb, but trying to suggest the second blurb.

4) Share the results, I am curious.

If you get that suggestion, you have proof and reproductible steps for others to get proof. If you don't get that suggestion, we can't be sure, but the odds of it being just a coincidence increase.

Good luck!


I’m assuming you meant “blurb” instead of “blur”?


Thanks corrected. While we are at it. Is it "make the following experiment" or "do the following experiment" ?


Yes, "do the following experiment" sounds more natural. IMO, "Conduct" conveys precision and is appropriate for more formal writing.


Perform, conduct, execute, the experiment, etc... You generally would set up an experiment, after making a hypothesis. You could make up an experiment, and then execute I suppose. It's a noun and a verb, so you could experiment with making experiments, but making a following experiment would then be making a made thing, so I don't think that would be correct.


I'll say, run the following experiment.


Conduct the following experiment


I think it also reads your clipboard. Yesterday, I had copied something from stackoverflow, and was about to paste it, and it gave me a suggestion before I could even drop it in.



Are you sure it didn't just know the URL? It's public after all. I feel copying the clipboard would be beyond, but then again, scanning locales files outside of the project's context would be crazy, too.


Nope, I had a clipboard copy of something along the lines of

``` Buffer.from(string, "base64") ```

And as soon as I started typing that's the suggestion was that, morphed a little better for my code. No way to know for sure, but I was thinking it's from the clipboard.


How exactly does Copilot not open Microsoft up to significant legal liability, when it has been demonstrated that copilot will regurgitate entire blocks of scanned code?


It's unclear whether MS would be directly liable for copyright infringement; the Aereo case[0] perhaps expanded a device-maker's direct liability for user actions. MS would likely escape secondary copyright liability because the verbatim suggestions are really rare, and the standard under the Sony VCR case[1] is whether the device is "capable of substantial non-infringing uses."

EDIT: The user would perhaps be the direct infringer as the person who made a copy of the code, even unwittingly.

Source: current law student taking copyright and writing a paper about Copilot.

[0] https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In....

[1] https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Unive....


Fascinating, thank you! I imagine this would make it even more likely that large orgs would ban the use of VS Code.


Ya, I'd be curious to hear how orgs are thinking about it. It could be that the risk of a copyright judgment against the org is outweighed by the efficiency gains from Copilot. The risk involved has an element of discoverability – how the heck is the OSS developer whose small snippet of code gets sucked into a closed-source project going to learn of the infringement?

Other stuff within copyright makes infringement of a small Copilot snippet less likely too, and the recent Google v. Oracle decision reinforced the fact that copyright is an awkward fit within copyright and may sometimes get less protection than things like fictional books. Fair use also comes in here, too; remember something as "egregious" as Google copying 37 Java packages worth of package / class / method declarations into Android was held a fair use. Declaring code is sort of special because it's more like an interface, but the Court really reinforced the notion that it likes cool new technology stuff that opens new markets, makes new products, etc., and a fair use finding for small snippets of code is in line with that as adapting copyright doctrine in the face of tech changes like AI / ML.


I may add that like in my case it doesn't have to be an entire block of code. I think most people have this idea that copilot just alters the content enough so it is basically doing the same thing without being a literal copy of the original. Perhaps adjusting to the project at hand.

Part of the problem is much simpler actually. Copyrighted text can cited without permission or product names become public before release, revealing disclosed information to competitors.


Alex Graveley, the Chief Architect at GitHub on Copilot, says pretty definitively here that it does not look at anything outside your project: https://twitter.com/orph/status/1457790239796199424

So I'm thinking either it made a very good guess, or the assets got included in your project without you realizing?


And a follow-up:

> We do try to assemble prompts based on contextual understanding, e.g. if you explicitly import sources by path from some other location Copilot might include it. But not for html (yet), as is being described here.

https://twitter.com/orph/status/1457796816196431872


It's crazy to me that amongst the many things I need to create legal blankets over when hiring developers, I now need to worry about what IDE they are using because Copilot ostensibly has access to their (our) source code, files, and other proprietary information.


Not just which IDE, but the whole development environment. If you're worried about what Copilot (Microsoft) can do with your data, how could you trust Windows machines? Even people using Emacs on Linux may be running some Microsoft assistant like LSP servers.


Surely this could be verified in a VM?

I have Copilot enabled in a single workspace and tried some unique keywords from other projects (where Copilot is not enabled) and it could not generate anything similar.


I don't want to cast any aspersions but could this indicate that the marketing copy was "reused" from a public source?

Or perhaps the writer used their same phrases for two clients?


Unlikely. The claim (which I'm sorry I cannot post here for obvious reasons) is rather specific to my client's business which again is very specific in itself.


Oof, that doesn't sound great.

Just out of curiosity, did you seek permission to use Copilot from your client? I wonder how widely accepted it is in roles which handle sensitive data.


As a matter of fact I didn't ask the client for consent.

In my expectation however all I willingly injected into Copilot were some HTML drafts without meaningful content or logic.

I certainly can't use Copilot anymore until I figure out what's been happening.


Is the name of the company unique and very random or something more in the trend of "WeatherForecastsForFishermen Inc"?

Are you sure the AI couldn't get some context from the current file? No title tag in the head, no description nor keywords ? The filename/path is also used, could it be it?


I'm pretty sure the marketing claim cannot be derived fromthe company's name (see other comments on that topic in this thread).


Is it possible Copilot is using a network pretrained on a large text dataset that did contain the marketing blurb, then retrained for code prediction from GitHub source code? That might explain why it has memorized non-GitHub content (it's a bit of a reach though).


Have you tried using github search to see if it's popping up there?


Not until now, thanks! There are no matches. The claim contains my client's company name. And not even the company name yields any search results.


Copilot is quite good at substituting variables and names. The question is whether you can find the claim for another company.


Any (hidden) symlink in your project towards that marketing copy?


Nope :/


My concern is that they will eventually do this by default for all code even if you're not using CoPilot.


I doubt I will ever use any IDE, so it's a moot point for me, but from a legal perspective using VSCode in particular has become extremely sketchy, and I say that as a working dev, a machine learning researcher, and having known some people who deal with patents for Microsoft

This copilot nonsense also was the straw that broke the camel's back and got me to delete my github account


Have you told your client that you are making use of tools that upload the intellectual property they’ve shared with you / they’ve paid you to create for them, onto third-party servers?

If I was paying for work and the contractor was uploading the end result onto some sort of shared AI training set, we would not be working together for very long.

You may have brought up an excellent point that needs to be inserted into new legal contracts — either an opt in or out regarding the use of tools of various kinds that upload data into the cloud to “help” you. Maybe other companies would be okay with stuff like Copilot if it allows them to pay less money for developers who can’t write proper code without it, or something. I don’t know. I know that I want nothing to do with these sorts of systems, and I don’t want any of my code anywhere near it. I’ll definitely try to make sure nobody with access to my private repos has any of that nonsense enabled.

Maybe the legal version of Copilot can write an appropriate contract clause for us?


> Have you told your client that you are making use of tools that upload the intellectual property they’ve shared with you / they’ve paid you to create for them, onto third-party servers?

Related: Have any of the big tech companies banned their employees from using Copilot yet?


> uploading the end result onto some sort of shared AI training set

Does Copilot use the data it collects for training their AI? From their terms [1]

> GitHub Copilot does not use these URLs, file paths, or snippets collected in your telemetry as suggestions for other users of GitHub Copilot. This information is treated as confidential information and accessed on a need-to-know basis.

[1] https://docs.github.com/en/github/copilot/github-copilot-tel...


> when you edit files with the GitHub Copilot extension/plugin enabled, file content snippets, suggestions, and any modifications to suggestions will be shared with GitHub, Microsoft, and OpenAI

You don't even need to have a git repo initialized, or even have the source be on github. You literally just need to be signed on - they made a vim plugin - and copilot will suck up all the locally edited files as well.


> for other users

So someone just needs access to your account to get snippets from your private code.


> you are making use of tools that upload the intellectual property [...] onto third-party servers?

That's the whole point: if they did, it was NOT willingly: they have the intellectual property stored in a completely separate folder, which should not be uploaded to Github.


The wider point is that you cannot trust these tools. You cannot trust these companies. It has been demonstrated numerous times.


The “git” in GitHub involves cryptographic hashing.

If a file in your on-computer repository has the same git hash as a file GitHub stores, the content of each file is (statistically) identical. That’s intrinsic to git.

Copilot does not need to scan the content of files in an on-computer repository to identify relevant identical files in a remote repository on GitHub.

Though the behavior might be surprising, there’s nothing nefarious. Comparing cryptographic hashes is how Git identifies and distributes changes to files. How it controls versions.

From a security standpoint, you already had the content that Copilot suggested. The horse was already out of the bag, the cat had already sailed and the barn doors were full open not leaking around the edge.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: