28 February 2012

Statistical Analysis of Authorship

Statistical analysis of authorship styles support a number of interesting conclusions.

In the case of the anonymously published Federalist papers, most of the papers are attributed to Hamilton (51), Madison (14), John Jay (5), or a collaboration of Hamilton and Madison (3), but twelve have disputed authorship. Statistical analysis of the linguistic features of those texts favors a Madisonian authorship of all the papers of disputed authorship.

Analysis of the verses in the Torah support an authorship distinction between Priestly and Non-Priestly sources that matches the scholarly consensus developed over centuries (where one existed at all) more than 90% of the time. The algorhythm was trained using a learning experiment in which the tool distinguished verses belong to the books of Jeremiah and Ezekiel respectively.

Efforts to statistically distinguish Christopher Marlowe and William Shakespeare's corpus produce an ambiguous result with several of the seven works in the corpus of Marlowe identified as early Shakespearian instead. There are also several other historical figures who are serious candidates to have been alternate authors of Shakespeare's plays. Computational linguistic methods have seen much greater similarities between Marlowe and Shakespeare than Sir Francis Bacon or the "Oxfordian theory."

In a set of 100,000 blogs including two thousand blogs that also have a co-authored blog of two blogs outside the set, a statistical analysis matched just three posts from a blog outside the set of 100,000 to its co-authored match within the set of 100,000 as the single most likely match about 20% of time, compared to an expected 0.001% success rate predicted by random chance. Identifying information such as author names were removed from by the source and the comparison sets. With more data and less insistence on accuracy one can do better. Specifically:

This means that when our algorithm uses three anonymously published blog posts to rank the possible authors in descending order of probability, the top guess is correct 20% of the time.

But it gets better from there. In 35% of cases, the correct author is one of the top 20 guesses. Why does this matter? Because in practice, algorithmic analysis probably won’t be the only step in authorship recognition, and will instead be used to produce a shortlist for further investigation. A manual examination may incorporate several characteristics that the automated analysis does not, such as choice of topic (our algorithms are scrupulously “topic-free”). Location is another signal that can be used: for example, if we were trying to identify the author of the once-anonymous blog Washingtonienne we’d know that she almost certainly resides in or around Washington, D.C. Alternately, a powerful adversary such as law enforcement may require Blogger, WordPress, or another popular blog host to reveal the login times of the top suspects, which could be correlated with the timing of posts on the anonymous blog to confirm a match.

We can also improve the accuracy significantly over the baseline of 20% for authors for whom we have more than an average number of labeled or unlabeled blog posts. For example, with 40–50 labeled posts to work with (the average is 20 posts per author), the accuracy goes up to 30–35%. . . . [with tweaks in the formula] the algorithm does not always attempt to identify an author, but when it does, it finds the right author 80% of the time.


Moreover, this was done with just a quick and dirty search alogrithm published itself in a blog post ("The good news for authors who would like to protect themselves against deanonymization, it appears that manually changing one’s style is enough to throw off these attacks.") that could be refined with experience and by using data that would be considered "cheating" in the statistical authorship analysis conducted.

All of this goes to show that in individual's writing style, while not quite as distinctive as their DNA or fingerprints, is still quite distinctive and a fairly accurately measure of authorships. It also suggests that privacy on the Internet, as a result of latent features to track authorship, may be even more illusory than it seems.

No comments: