Pages

Friday, April 3, 2015

Machine Learning Fun - Predicting Authorship from 40 Year-old Talks

I am continuing my fun with applying statistical and machine learning methods to General Conference talks. This post will explore if you can predict the author/speaker of a general conference talk today, but by using talks given over 25-40 years ago. Since Thomas S. Monson, Boyd K. Packer and L. Tom Perry are the most senior apostles and general authorities, I used them as case studies because they have the most general conference talks available. The oldest general conference talks available go back until 1971 on LDS.org.

The idea is this: Can the occurrence of certain words from talks given 25-40 years ago by these speakers be used to accurately predict if they are the author of a recently given talk -- but assuming we only had the text of the recent talk and knew nothing about who actually gave it? Let's find out...

Representing Talks as 0s and 1s

We first need to determine if we can statistically distinguish authorship between the three speakers from the talks given 25-40 years ago. I will be attempting to distinguish authorship based on the presence or absence of certain words in a talk.

I had the computer create four different dictionaries based on the most common words used among all three speakers. Each dictionary was a different size of unique words as follows: 128 words, 267 words, 542 words, and 1042 words. 

The dictionaries were derived from almost 120 total talks, with about 40 talks given from each speaker. The talks had over 8,000 unique words after combining similar words together to avoid double counting. For example, jump and jumping would be considered the same word; they would be counted as being one unique word 'jump'.

Based on those dictionaries, each talk was then turned into a spreadsheet-like document. Each row represents a certain talk, and each column represents a particular word. The cells of the spreadsheet contain a 0 if a certain word is absent, or a 1 if a word is present. This means we are numerically representing each talk as a bunch of 0s and 1s based on word occurrences. This method is like giving a fingerprint to each talk. In sum, we're hoping to see if there are patterns in the talks' "fingerprints" that correspond to certain speakers.

Visualizing Similarity/Differences Among Talks and Speakers

After creating the giant spreadsheet full of 0s and 1s, the talks were then converted mathematically into coordinates on a two-dimensional plane. The 120 talks were also previously labeled on the spreadsheet so we would know which speaker gave each talk.

Plotting the talks visually allows us to better understand how similar or different they are from each other. If a certain speaker's talks are all being grouped together closely, but also being far away from other speakers' talks, this will confirm we have statistically distinguished authorship based on word occurences! Thereafter, we can attempt to predict authorship of new talks without knowing who the author is in advance! This means we are leveraging experience gleaned from the 120 older talks to predict authorship of new talks given more recently!

Plots of Talks and Speakers

Below is how the 120 older talks look like from the 128 word dictionary. Each circle, triangle or square is an individual talk corresponding to a certain speaker.

You can see in the chart there is a lot of overlap among speakers. This means the 128 word dictionary isn't very effective to distinguish authorship since all the talks look the same (statistically speaking).

Below is how the talks look like from the 267 word dictionary:


This 267 word dictionary looks quite a bit better -- there is more separation among speakers but there is quite a bit of overlap still.

Here is how the talks look like from the 542 word dictionary:
The 542 word dictionary is looking pretty good! There is quite a bit of separation between speakers and only a little bit of overlap between Monson and Perry. This goes to show that these three speakers have particular word choices that give them a somewhat unique identity! It may not be possible to totally separate all the speakers statistically speaking, but it looks like combinations of at least 542 words can be a powerful way to identify authorship. However, let's see if we can improve this any further.

Here is how the talks look like from the 1042 word dictionary:
The 1042 word dictionary looks fantastic! There is a lot of separation among speakers and just barely a bit of overlap. Apparently, the presence or absence of the 1042 selected words appears to distinguish authorship very well! Now that we've determined that authorship among the three speakers is distinguishable, let's get on to the predictions.

Predicting the Author/Speaker of a New Talk

I collected 16 recent talks given between 2012-2014 by Monson (5 talks), Packer (5 talks) and Perry (6 talks). The computer was never told in advance who the true author was of each talk. To predict authorship, the texts of the new talks were also turned into spreadsheets of 0s and 1s using the 1042 word dictionary developed earlier (the 120 talks given between 1970-1990). Thereafter, those 0s and 1s of the new talks were converted into coordinates for a two-dimensional plane.

Now, at this point, prediction is easy. The speaker/author of the talk will be estimated based on where a talk lands on the two-dimensional plane and where its position is in relation to the 120 talks earlier. We saw earlier there were three distinct groups of talks, with each group corresponding strongly to a particular speaker. So, if a new talk -- whose speaker is unknown in advance to the computer -- lands firmly within the territory of the talks from speakers we knew from earlier, we will then predict that this was the author of the new talk!

Results and Plots of Selected Predictions

After following this method, the author/speaker for 13 of 16 talks was correctly predicted! That is an 81% success rate! This isn't half bad considering the simplicity of the methods employed. Overall, it appears the text in talks given from 25-40 years ago have predictive power to determine the author of a talk given today! These three speakers -- Thomas S. Monson, Boyd K. Packer, and L. Tom Perry -- appear to have been mostly consistent with their word choices over the years to allow this to happen!

Here are the plots of the three talks that had incorrectly predicted authors/speakers:

1) Love -- The Essence of the Gospel - April 2014
Predicted Speaker: L. Tom Perry 
Actual Speaker: Thomas S. Monson


This makes sense why the prediction went wrong. The blue diamond is right on the border between Monson and Perry.

2) Obedience to Law Is Liberty - April 2013
Predicted Speaker: Boyd K. Packer
Actual Speaker: L. Tom Perry

It looks like this particular talk by Perry had Packer-esque word occurrences. This blue diamond is quite a bit inside Packer territory.

3) The Doctrines and Principles Contained in the Articles of Faith - October 2013
Predicted Speaker: Boyd K. Packer
Actual Speaker: L. Tom Perry

Here is another talk that was on the border -- this time between Packer and Perry. Talks landing in the border zone introduce more ambiguity, so it makes sense the computer got this one wrong.


Here are plots of some talks that correctly predicted speakers:


1) Finding Lasting Peace and Building Eternal Families - October 2014
Predicted Speaker: L. Tom Perry
Actual Speaker: L. Tom Perry
This talk is firmly in Perry territory.


2) The Reason for Our Hope - October 2014
Predicted Speaker: Boyd K. Packer 
Actual Speaker: Boyd K. Packer 

This talk is firmly in Packer territory.

3) True Shepherds - October 2013
Predicted Speaker: Thomas S. Monson 
Actual Speaker: Thomas S. Monson
 


This talk was in the ambiguous border zone between Monson and Perry, but the computer happened to get the author/speaker right. This one looks like a tough call visually speaking. 

In closing, this little project was a lot of fun. I should note, however, that it is much more important to actually read the general conference talks and apply their messages to our everyday lives. It is a great blessing to have prophets, seers, and revelators among us to give us the guidance we need today!

***

Previous posts related to statistics/text analytics of general conference talks:

Text Analytics Fun - Visualizing Sentiment Between General Conference Talks


Statistics Fun - Visualizing Similarity Between General Conference Talks

***

Summary of methodology:

  • R was used for all analysis/visualization
  • Each conference talk was copy and pasted into its own text file to make a corpus of documents.
  • The corpus was prepared by removing numbers, punctuation, sparse words and applying stemming.
·         A term document matrix was created using binary representations for word occurrences or absences.
·         Distance was calculated from term document matrix using the asymmetric binary/Jaccard method.
·         Classical multidimensional scaling was applied to the Jaccard distances to project talks onto a two dimensional space ("X and Y coordinates").

  • Author/speaker prediction was done using the IBk algorithm (nearest-neighbor) from the RWeka package. Three nearest neighbors were used in voting for classification.

No comments:

Post a Comment