I decided to have a little fun by
applying statistical methods to some past general conference talks. I took each
talk from the April 2014 General Conference (35 of them) so the
following could be done:
1) Estimate relative similarity between
talks based on word counts/occurrences
2) Visualize how
similar/dissimilar talks are by projecting their relative positions onto two-dimensional
space
3) Assign talks that are most
similar into their own group
Below is what the April 2014
General Conference looks like in two-dimensional space. Each circle is
a talk, and circles closer together indicate they are more similar. Circles
farther away indicate being less similar. A statistical tool estimates there being six groups of similar talks, so each group of similar talks is represented by its own color. The exact numbers on X and Y axes don't mean anything in particular other than a way to measure relative distance among talks.
Well, which talk is which? Below is
the same chart using numbers instead of circles. A legend is also provided showing
the title of each talk and its corresponding number:
Number
|
Talk
|
Group
|
1
|
A Priceless Heritage of Hope
|
2
|
2
|
Are You Sleeping through the
Restoration
|
3
|
3
|
Be Strong and of a Good Courage
|
3
|
4
|
Bear Up Their Burdens with Ease
|
2
|
5
|
Christ the Redeemer
|
6
|
6
|
Daughters in the Covenant
|
1
|
7
|
Fear Not; I Am with Thee
|
2
|
8
|
Following Up
|
3
|
9
|
Grateful in Any Circumstances
|
3
|
10
|
I Have Given You an Example
|
2
|
11
|
If Ye Lack Wisdom
|
5
|
12
|
If Ye Love Me, Keep My Commandments
|
4
|
13
|
Keeping Covenants Protects Us
Prepares Us and Empowers Us
|
1
|
14
|
Let's Not Take the Wrong Way
|
5
|
15
|
Let Your Faith Show
|
5
|
16
|
Live True to the Faith
|
2
|
17
|
Love—the Essence of the Gospel
|
3
|
18
|
Obedience through Our Faithfulness
|
4
|
19
|
Protection from Pornography—a
Christ-Focused Home
|
2
|
20
|
Roots and Branches
|
3
|
21
|
Sisterhood Oh How We Need Each
Other
|
1
|
22
|
Spiritual Whirlwinds
|
2
|
23
|
The Choice Generation
|
2
|
24
|
The Cost—and Blessings—of
Discipleship
|
3
|
25
|
The Joyful Burden of Discipleship
|
3
|
26
|
The Keys and Authority of the
Priesthood
|
1
|
27
|
The Priesthood Man
|
2
|
28
|
The Prophet Joseph Smith
|
5
|
29
|
The Resurrection of Jesus Christ
|
6
|
30
|
The Witness
|
5
|
31
|
Wanted Hands and Hearts to Hasten
the Work
|
1
|
32
|
What Are You Thinking
|
5
|
33
|
What Manner of Men
|
1
|
34
|
Where Your Treasure Is
|
3
|
35
|
Your Four Minutes
|
2
|
Summary of groups:
Group: 1
[1] Daughters in the Covenant
[2] Keeping Covenants Protects Us
Prepares Us and Empowers Us
[3] Sisterhood Oh How We Need
Each
Other
[4] The Keys and Authority of the
Priesthood
[5] Wanted Hands and Hearts to
Hasten the
Work
[6] What Manner of Men
Group: 2
[1] A Priceless Heritage of
Hope
[2] Bear Up Their Burdens with
Ease
[3] Fear Not; I Am with
Thee
[4] I Have Given You an
Example
[5] Live True to the
Faith
[6] Protection from Pornography—a
Christ-Focused Home
[7] Spiritual
Whirlwinds
[8] The Choice
Generation
[9] The Priesthood
Man
[10] Your Four
Minutes
Group: 3
[1] Are You Sleeping through the
Restoration
[2] Be Strong and of a Good
Courage
[3] Following
Up
[4] Grateful in Any
Circumstances
[5] Love—the Essence of the
Gospel
[6] Roots and
Branches
[7] The Cost—and Blessings—of
Discipleship
[8] The Joyful Burden of
Discipleship
[9] Where Your Treasure Is
Group: 4
[1] If Ye Love Me, Keep My
Commandments
[2] Obedience through Our
Faithfulness
Group: 5
[1] If Ye Lack
Wisdom
[2] Let's Not Take the Wrong Way
[3] Let Your Faith
Show
[4] The Prophet Joseph
Smith
[5] The
Witness
[6] What Are You
Thinking
Group: 6
[1] Christ the Redeemer
[2] The Resurrection of Jesus
Christ
***
Summary of methodology:
- R was used
for all analysis/visualization. Libraries tm, ggplot2, and proxy used.
- Each
conference talk was copy and pasted into its own text file to make a corpus
of documents.
- The corpus was
prepared by removing numbers, stop words, applying stemming, removing sparse
words, and more.
- A term
document matrix was created on word occurrences and then weighted using the
term frequency-inverse document frequency method. Distance was calculated between each talk from those weights using cosine similarity.
- Multidimensional
scaling was applied on the cosine distance of the weighted term document matrix to project talks onto a two dimensional plane.
- K-means
clustering was applied to cluster/group similar general conference talks.
The optimal number of clusters was estimated using the sum of squared
errors scree plot. I borrowed previously written R code for this purpose from the following
link: http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters
No comments:
Post a Comment