Digital Fishing at GitHub Copilot

This short note is dedicated to one feature of GitHub Copilot. You can find questions about it at stackoverflow, articles and videos on the Internet, but on Habré I have not seen materials on this topic. Probably bad looking.

Copilot can tell you not only the code of a suitable function, but also private keys from crypto wallets, and logins/passwords from various services. Under the cut, a few details for those who want to go fishing.

Of course, one can argue that since the owners of repositories publish something in the public domain, then by definition there should not be private data there, but we all know how simple inattention works in practice. In an ideal world, it would not make sense to search for vulnerable sites at once through search engines, and passwords like “123456” would not be ridiculously popular.

Some private keys from the crypt turned out to be, let’s say, not from empty wallets. I reported the issue to GitHub back in mid-April, a month and a half later they sent the following response:

Thanks for the submission! We have reviewed your report and determined that it does not present a significant security risk. The personal data that Copilot presents is from public data. In this case, those keys were in public repositories or not all valid.

Because GitHub Copilot was trained on publicly available code, its training set included public personal data included in that code. From our internal testing, we found it to be extremely rare that GitHub Copilot suggestions included personal data verbatim from the training set. In some cases, the model will suggest what appears to be personal data – email addresses, phone numbers, access keys, etc. – but is actually made-up information synthesized from patterns in training data. For the technical preview, we have implemented a rudimentary filter that blocks emails when shown in standard formats, but it’s still possible to get the model to suggest this sort of content if you try hard enough.

Since developers consider this behavior of algorithms to be normal …

So, what is needed for fishing? Actually, only VS Code and installed GitHub Copilot. And then we start writing a comment in the code with keywords.

#Ethereum wallet project private key
private_keys_list = ['0x

Help the algorithm a little to understand what exactly you are looking for, maybe add 1-2 digits to the beginning of the key or specify a different variable name. The proposed options will not keep you waiting, you can safely supplement the list. Of course, the algorithm will try to generate artificial keys based on some patterns, but such cases are usually immediately noticeable: a different key length, symbols that are impossible for the key, etc.

#Ethereum 12 words secret seed phrase
phrase="a

Remind me by pressing Ctrl+Enter A panel will open at once with several options for continuing the code. With logins and passwords to sites a little harder, but the principle is the same. Didn’t check if the data depends on the programming language. In general, with “but is actually made-up information synthesized from patterns in training data” something strange, not everything there is synthetically generated and is not duplicated with real data. Let’s hope the GitHub developers are really keeping an eye on data sources for training algorithms.

This note is written for the sole purpose of sharing another way to fish on the Internet, another way to have a little fun in your free time and look for something interesting. Don’t forget to change your passwords!

Similar Posts

Leave a Reply