Pages

Sunday, March 15, 2015

Statistics Fun - Visualizing Similarity Between General Conference Talks

I decided to have a little fun by applying statistical methods to some past general conference talks. I took each talk from the April 2014 General Conference  (35 of them) so the following could be done:

1) Estimate relative similarity between talks based on word counts/occurrences
2) Visualize how similar/dissimilar talks are by projecting their relative positions onto two-dimensional space
3) Assign talks that are most similar into their own group

Below is what the April 2014 General Conference looks like in two-dimensional space. Each circle is a talk, and circles closer together indicate they are more similar. Circles farther away indicate being less similar. A statistical tool estimates there being six groups of similar talks, so each group of similar talks is represented by its own color. The exact numbers on X and Y axes don't mean anything in particular other than a way to measure relative distance among talks.


Well, which talk is which? Below is the same chart using numbers instead of circles. A legend is also provided showing the title of each talk and its corresponding number:

Legend:

Number
Talk
Group
1
A Priceless Heritage of Hope
2
2
Are You Sleeping through the Restoration
3
3
Be Strong and of a Good Courage
3
4
Bear Up Their Burdens with Ease
2
5
Christ the Redeemer
6
6
Daughters in the Covenant
1
7
Fear Not; I Am with Thee
2
8
Following Up
3
9
Grateful in Any Circumstances
3
10
I Have Given You an Example
2
11
If Ye Lack Wisdom
5
12
If Ye Love Me, Keep My Commandments
4
13
Keeping Covenants Protects Us Prepares Us and Empowers Us
1
14
Let's Not Take the Wrong Way
5
15
Let Your Faith Show
5
16
Live True to the Faith
2
17
Love—the Essence of the Gospel
3
18
Obedience through Our Faithfulness
4
19
Protection from Pornography—a Christ-Focused Home
2
20
Roots and Branches
3
21
Sisterhood Oh How We Need Each Other
1
22
Spiritual Whirlwinds
2
23
The Choice Generation
2
24
The Cost—and Blessings—of Discipleship
3
25
The Joyful Burden of Discipleship
3
26
The Keys and Authority of the Priesthood
1
27
The Priesthood Man
2
28
The Prophet Joseph Smith
5
29
The Resurrection of Jesus Christ
6
30
The Witness
5
31
Wanted Hands and Hearts to Hasten the Work
1
32
What Are You Thinking
5
33
What Manner of Men
1
34
Where Your Treasure Is
3
35
Your Four Minutes
2


Summary of groups:

Group: 1

[1] Daughters in the Covenant                               
[2] Keeping Covenants Protects Us Prepares Us and Empowers Us
[3] Sisterhood Oh How We Need Each Other                    
[4] The Keys and Authority of the Priesthood                
[5] Wanted Hands and Hearts to Hasten the Work              
[6] What Manner of Men
                                    
Group: 2
   
[1] A Priceless Heritage of Hope                    
[2] Bear Up Their Burdens with Ease                 
[3] Fear Not; I Am with Thee                        
[4] I Have Given You an Example                     
[5] Live True to the Faith                          
[6] Protection from Pornography—a Christ-Focused Home
[7] Spiritual Whirlwinds                            
[8] The Choice Generation                           
[9] The Priesthood Man                              
[10] Your Four Minutes                               

Group: 3
   
[1] Are You Sleeping through the Restoration
[2] Be Strong and of a Good Courage        
[3] Following Up                           
[4] Grateful in Any Circumstances          
[5] Love—the Essence of the Gospel         
[6] Roots and Branches                     
[7] The Cost—and Blessings—of Discipleship 
[8] The Joyful Burden of Discipleship      
[9] Where Your Treasure Is
              
Group:  4

[1] If Ye Love Me, Keep My Commandments
[2] Obedience through Our Faithfulness

Group:  5

[1] If Ye Lack Wisdom           
[2] Let's Not Take the Wrong Way
[3] Let Your Faith Show        
[4] The Prophet Joseph Smith   
[5] The Witness                
[6] What Are You Thinking      

Group:  6

[1] Christ the Redeemer            
[2] The Resurrection of Jesus Christ


Overall, it appears the utilized statistical methodology did a pretty good job of putting similar talks closer to each other and grouping similarly themed talks! This is not bad considering the methods are based on word occurrences/counts and not actually "reading" the talks and understanding them! If you actually read the talks, you can understand even better why certain talks are closer to or father away from each other.

***

Summary of methodology:

  • R was used for all analysis/visualization. Libraries tm, ggplot2, and proxy used.
  • Each conference talk was copy and pasted into its own text file to make a corpus of documents.
  • The corpus was prepared by removing numbers, stop words, applying stemming, removing sparse words, and more.
  • A term document matrix was created on word occurrences and then weighted using the term frequency-inverse document frequency method. Distance was calculated between each talk from those weights using cosine similarity.
  • Multidimensional scaling was applied on the cosine distance of the weighted term document matrix to project talks onto a two dimensional plane.
  • K-means clustering was applied to cluster/group similar general conference talks. The optimal number of clusters was estimated using the sum of squared errors scree plot. I borrowed previously written R code for this purpose from the following link: http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters

No comments:

Post a Comment