Identifying Anonymous Programmers by Coding Style

It would be tremendously beneficial for people who work in malware forensics to have better methodologies for determining the human authors of otherwise anonymous code. For example, SamSam ransomware has been devastating hospitals and entire city networks. It’s now believed that there’s just one malware author behind the attacks. Wouldn’t it be great if we could identify that individual?

Computer science professors Rachel Greenstadt and Aylin Caliskan presented their methodology for identifying programmers by patterns in their use of code at this year’s DEFCON. From the abstract:

“Many hackers like to contribute code, binaries, and exploits under pseudonyms, but how anonymous are these contributions really? Our work on programmer de-anonymization from the standpoint of machine learning…. show(s) how abstract syntax trees contain stylistic fingerprints and how these can be used to potentially identify programmers from code and binaries. We perform programmer de-anonymization using both obfuscated binaries, and real-world code found in single-author GitHub repositories and the leaked Nulled.IO hacker forum.”

This reminds me of idiolects and the field of forensic linguistics. It’s believed that the study of forensic linguistics began as far back as 1927. The Associated Press wrote about some insight into determining the author of a ransom note.

An idiolect is the distinctive way an individual uses language. It’s important for forensic linguists to determine what someone’s idiolect is. An idiolect is much more specific than a dialect.

People who know me personally may have noticed that I love to finish sentences with “quite frankly!” I also enjoy describing things with two or three synonym adjectives. When I do that, it’s obnoxious, unpleasant, and annoying. To top it all off, who do you ever hear saying “mayn’t?” I say it often, but there may not be any others.

Many of the ideas behind Greenstadt and Caliskan’s DEFCON presentation can be found in a paper they co-authored last year with Edwin Dauber and Richard Harang, Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments. Here’s how they acquired their test sample:

“After eliminating (duplicates) and setting the threshold to at least 150 samples per author with at least one line of actual code (not whitespace or comments), we were left with 106 programmers. We note that this threshold was chosen with an experimental mindset to ensure that we had sufficient data for both training and testing sets. We then used git blame on each line of code, and for each set of consecutive lines blamed to the same programmer we encapsulated those lines in a dummy main function and extracted features from the abstract syntax tree.”

The researchers use several techniques, including the following, in their attempt to identify an author once they believe they have multiple code samples from them:

“For this method we used three experimental setups. In the first, we combined samples in both our known training set and our testing set. In the second, we only combined our training samples, and in the third we only combined our testing samples. Our second proposed method is also our preferred method. This method does not involve any adjustment to the feature vectors. Instead, it requires performing the same classification as for single samples but then aggregating results for the samples that we would have merged by our other methods. We aggregate the probability distribution output of the classifier rather than the predicted classes, and then take as the prediction for the aggregated samples the class with the highest averaged probability.”

Greenstadt and Caliskan have found that it’s generally easier to identify advanced programmers than beginners. Beginners are more likely to use code snippets that they’ve found elsewhere. And like with human spoken languages, one usually doesn’t develop their own distinctive idiolect unless one has acquired a level of fluency in a language.

Greenstadt and Caliskan’s DEFCON presentation featured their recent work with analyzing code samples through machine learning algorithms and extracting features such as choice of words used, length of code, and how code is organized. The accuracy of their technique was pretty good, allowing them to identify coders about 83% of the time.

These research techniques can be very helpful for identifying individual cyber attackers by examining their malware code. The work can also be helpful in legal disputes about code authorship and intellectual property rights. But there are also potentially harmful ways that these findings can be used. There are good and understandable reasons why a programmer may want to be anonymous when they develop opensource code. Maybe they’re working on software that a hostile government doesn’t like, for example.

Greenstadt’s said her research will continue long after DEFCON. “We’re still trying to understand what makes something really attributable and what doesn’t. There’s enough here to say it should be a concern, but I hope it doesn’t cause anybody to not contribute publicly on things.”