I am continuing my fun with
applying statistical and machine learning methods to General Conference
talks. This post will explore if you can predict the author/speaker of a
general conference talk today, but by using talks given over 25-40 years ago. Since
Thomas S. Monson, Boyd K. Packer and L. Tom Perry are the most senior apostles
and general authorities, I used them as case studies because they have the most
general conference talks available. The oldest general conference talks available
go back until 1971 on LDS.org.
The idea is this: Can the
occurrence of certain words from talks given 25-40 years ago by these speakers be
used to accurately predict if they are the author of a recently given talk -- but
assuming we only had the text of the recent talk and knew nothing about who actually
gave it? Let's find out...
Representing
Talks as 0s and 1s
We first need to determine if we
can statistically distinguish authorship between the three speakers from the talks
given 25-40 years ago. I will be attempting to distinguish authorship based on the
presence or absence of certain words in a talk.
I had the computer create four
different dictionaries based on the most common words used among all three speakers.
Each dictionary was a different size of unique words as follows: 128 words, 267
words, 542 words, and 1042 words.
The dictionaries were derived from almost 120
total talks, with about 40 talks given from each speaker. The talks had over 8,000
unique words after combining similar words together to avoid double counting.
For example, jump and jumping would be considered the same word; they would be
counted as being one unique word 'jump'.
Based on those dictionaries, each
talk was then turned into a spreadsheet-like document. Each row represents a certain
talk, and each column represents a particular word. The cells of the
spreadsheet contain a 0 if a certain word is absent, or a 1 if a word is
present. This means we are numerically representing each talk as a bunch of 0s
and 1s based on word occurrences. This method is like giving a fingerprint to
each talk. In sum, we're hoping to see if there are patterns in the talks' "fingerprints"
that correspond to certain speakers.
Visualizing
Similarity/Differences Among Talks and Speakers
After creating the giant
spreadsheet full of 0s and 1s, the talks were then converted mathematically
into coordinates on a two-dimensional plane. The 120 talks were also previously
labeled on the spreadsheet so we would know which speaker gave each talk.
Plotting the talks visually
allows us to better understand how similar or different they are from each other.
If a certain speaker's talks are all being grouped together closely, but also being
far away from other speakers' talks, this will confirm we have statistically
distinguished authorship based on word occurences! Thereafter, we can attempt
to predict authorship of new talks without knowing who the author is in
advance! This means we are leveraging experience gleaned from the 120
older talks to predict authorship of new talks given more recently!
Plots
of Talks and Speakers
Below is how the 120 older talks look like
from the 128 word dictionary. Each circle, triangle or square is an individual
talk corresponding to a certain speaker.
You can see in the chart there is a lot of
overlap among speakers. This means the 128 word dictionary isn't very effective
to distinguish authorship since all the talks look the same (statistically
speaking).
Below is how the talks look like
from the 267 word dictionary:
This 267 word dictionary looks quite
a bit better -- there is more separation among speakers but there is quite a bit of overlap still.
Here is how the talks look like
from the 542 word dictionary:
The 542 word dictionary is
looking pretty good! There is quite a bit of separation between speakers and only
a little bit of overlap between Monson and Perry. This goes to show that these
three speakers have particular word choices that give them a somewhat unique identity!
It may not be possible to totally separate all the speakers statistically
speaking, but it looks like combinations of at least 542 words can be a
powerful way to identify authorship. However, let's see if we can improve this any
further.
Here is how the talks look like
from the 1042 word dictionary:
The 1042 word dictionary looks
fantastic! There is a lot of separation among speakers and just barely a bit of
overlap. Apparently, the presence or absence of the 1042 selected words appears
to distinguish authorship very well! Now that we've determined that authorship
among the three speakers is distinguishable, let's get on to the predictions.
Predicting
the Author/Speaker of a New Talk
I collected 16 recent talks given
between 2012-2014 by Monson (5 talks), Packer (5 talks) and Perry (6 talks). The
computer was never told in advance who the true author was of each talk. To predict authorship, the
texts of the new talks were also turned into spreadsheets of 0s and 1s using the
1042 word dictionary developed earlier (the 120 talks given between 1970-1990).
Thereafter, those 0s and 1s of the new talks were converted into coordinates for a two-dimensional plane.
Now, at this point, prediction is
easy. The speaker/author of the talk will be estimated based on where a talk lands
on the two-dimensional plane and where its position is in relation to the 120
talks earlier. We saw earlier there were three distinct groups of talks, with
each group corresponding strongly to a particular speaker. So, if a new talk --
whose speaker is unknown in advance to the computer -- lands firmly within the
territory of the talks from speakers we knew from earlier, we will then predict
that this was the author of the new talk!
Results
and Plots of Selected Predictions
After following this method, the author/speaker for 13 of 16 talks was correctly predicted! That is an 81%
success rate! This isn't half bad considering the simplicity of the methods
employed. Overall, it appears the text in talks given from 25-40 years ago have
predictive power to determine the author of a talk given today! These three
speakers -- Thomas S. Monson, Boyd K. Packer, and L. Tom Perry -- appear to have
been mostly consistent with their word choices over the years to allow this to
happen!
Here are the plots of the three
talks that had incorrectly predicted authors/speakers:
1)
Love -- The Essence of the Gospel - April 2014
Predicted Speaker: L. Tom Perry
Actual Speaker: Thomas S. Monson
This makes sense why the prediction went wrong. The blue diamond is right on the border between Monson and Perry.
2)
Obedience to Law Is Liberty - April 2013
Predicted Speaker: Boyd K. Packer
Actual Speaker: L. Tom Perry
It looks like this particular talk by Perry had Packer-esque word occurrences. This blue diamond is quite a bit inside Packer territory.
3)
The Doctrines and Principles Contained in the Articles of Faith - October 2013
Predicted Speaker: Boyd K. Packer
Actual Speaker: L. Tom Perry
Here is another talk that was on the border -- this time between Packer and Perry. Talks landing in the border zone introduce more ambiguity, so it makes sense the computer got this one wrong.
Here are plots of some talks that
correctly predicted speakers:
1)
Finding Lasting Peace and Building Eternal Families - October 2014
Predicted Speaker: L. Tom Perry
Actual Speaker: L. Tom Perry
This talk is firmly in Perry territory.
2)
The Reason for Our Hope - October 2014
Predicted Speaker: Boyd K. Packer
Actual Speaker: Boyd K. Packer
This talk is firmly in Packer territory.
3)
True Shepherds - October 2013
Predicted Speaker: Thomas S. Monson
Actual Speaker: Thomas S. Monson
Actual Speaker: Thomas S. Monson
This talk was in the ambiguous border zone between Monson and Perry, but the computer happened to get the author/speaker right. This one looks like a tough call visually speaking.
In closing, this little project
was a lot of fun. I should note, however, that it is much more important to actually
read the general conference talks and apply their messages to our
everyday lives. It is a great blessing to have prophets, seers, and revelators
among us to give us the guidance we need today!
***
Previous posts related to
statistics/text analytics of general conference talks:
Text Analytics Fun - Visualizing
Sentiment Between General Conference Talks
Statistics Fun - Visualizing
Similarity Between General Conference Talks
***
Summary of methodology:
- R was used for
all analysis/visualization
- Each
conference talk was copy and pasted into its own text file to make a corpus
of documents.
- The corpus
was prepared by removing numbers, punctuation, sparse words and applying
stemming.
·
A
term document matrix was created using binary representations for word occurrences
or absences.
·
Distance
was calculated from term document matrix using the asymmetric binary/Jaccard
method.
·
Classical
multidimensional scaling was applied to the Jaccard distances to project talks
onto a two dimensional space ("X and Y coordinates").
- Author/speaker
prediction was done using the IBk algorithm (nearest-neighbor) from the
RWeka package. Three nearest neighbors were used in voting for
classification.
No comments:
Post a Comment