Pages

Monday, March 23, 2015

Text Analytics Fun - Visualizing Sentiment Between General Conference Talks

I decided to have some more fun with the April 2014 General Conference talks. This time, I applied some basic text analytics and natural language processing techniques to estimate each talk's overall sentiment. In general, a talk with "positive sentiment" uses more optimistic type words and a talk with "negative sentiment" uses more pessimistic type words.

Below is a chart summarizing the sentiment differences for each talk, but only based on adjectives and adverbs a computer algorithm automatically identified for me. Adjectives and adverbs tend to be the types of words that convey sentiment. Surprisingly, Talk #17 -- Love—the Essence of the Gospel -- had the most negative sentiment! Go figure! Talk #16 -- Live True to the Faith -- had the most positive sentiment! (Cell phone users - click/tap image to expand and see fine text)



 [1] A Priceless Heritage of Hope
 [2] Are You Sleeping through the Restoration
 [3] Be Strong and of a Good Courage      
 [4] Bear Up Their Burdens with Ease  
 [5] Christ the Redeemer      
 [6] Daughters in the Covenant
 [7] Fear Not; I Am with Thee   
 [8] Following Up  
 [9] Grateful in Any Circumstances  
[10] I Have Given You an Example  
[11] If Ye Lack Wisdom
[12] If Ye Love Me, Keep My Commandments   
[13] Keeping Covenants Protects Us Prepares Us and Empowers Us
[14] Let's Not Take the Wrong Way 
[15] Let Your Faith Show 
[16] Live True to the Faith   
[17] Love—the Essence of the Gospel   
[18] Obedience through Our Faithfulness
[19] Protection from Pornography—a Christ-Focused Home 
[20] Roots and Branches  
[21] Sisterhood Oh How We Need Each Other  
[22] Spiritual Whirlwinds    
[23] The Choice Generation 
[24] The Cost—and Blessings—of Discipleship 
[25] The Joyful Burden of Discipleship
[26] The Keys and Authority of the Priesthood  
[27] The Priesthood Man 
[28] The Prophet Joseph Smith 
[29] The Resurrection of Jesus Christ 
[30] The Witness 
[31] Wanted Hands and Hearts to Hasten the Work
[32] What Are You Thinking
[33] What Manner of Men  
[34] Where Your Treasure Is 
[35] Your Four Minute


I also made a second chart showing sentiment differences, but this time it was based on all words which the sentiment lexicon/dictionary could identify and was not limited to just adjectives and adverbs. With this method, Talk #5 -- Christ the Redeemer -- ended up having the most negative sentiment! Talk #13 -- Keeping Covenants Protects Us Prepares Us and Empowers Us -- had the most positive sentiment.  



Overall, this was a fun exercise to see how general conference talks vary sentiment-wise. The methodologies and computer algorithms utilized for this analysis are by no means perfect, but they are surprisingly useful to get additional insight and extract information from text.

***

Summary of methodology:

·         R was used for all analysis/visualization. Libraries tm, NLP, openNLP, proxy, RWeka, openNLPmodels.en, and stringr used.
·         Each conference talk was copy and pasted into its own text file to make a corpus of documents.
·         Harvard's Inquirer Lexicon/Dictionary was used to estimate sentiment, but with some modifications by myself. This dictionary can be accessed at these links: http://www.wjh.harvard.edu/~inquirer/homecat.htm  (general page) or directly at http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm
·         Sentiment was estimated for each talk individually by taking the total number of positive sentiment words and dividing it by the total number of all sentiment words (positive + negative) the Lexicon could identify.

  • The mean sentiment for the group of 35 talks was 72%. I.e. on average, of all the sentiment type words identified, 72% of them were positive in the talks.    

Sunday, March 15, 2015

Statistics Fun - Visualizing Similarity Between General Conference Talks

I decided to have a little fun by applying statistical methods to some past general conference talks. I took each talk from the April 2014 General Conference  (35 of them) so the following could be done:

1) Estimate relative similarity between talks based on word counts/occurrences
2) Visualize how similar/dissimilar talks are by projecting their relative positions onto two-dimensional space
3) Assign talks that are most similar into their own group

Below is what the April 2014 General Conference looks like in two-dimensional space. Each circle is a talk, and circles closer together indicate they are more similar. Circles farther away indicate being less similar. A statistical tool estimates there being six groups of similar talks, so each group of similar talks is represented by its own color. The exact numbers on X and Y axes don't mean anything in particular other than a way to measure relative distance among talks.


Well, which talk is which? Below is the same chart using numbers instead of circles. A legend is also provided showing the title of each talk and its corresponding number:

Legend:

Number
Talk
Group
1
A Priceless Heritage of Hope
2
2
Are You Sleeping through the Restoration
3
3
Be Strong and of a Good Courage
3
4
Bear Up Their Burdens with Ease
2
5
Christ the Redeemer
6
6
Daughters in the Covenant
1
7
Fear Not; I Am with Thee
2
8
Following Up
3
9
Grateful in Any Circumstances
3
10
I Have Given You an Example
2
11
If Ye Lack Wisdom
5
12
If Ye Love Me, Keep My Commandments
4
13
Keeping Covenants Protects Us Prepares Us and Empowers Us
1
14
Let's Not Take the Wrong Way
5
15
Let Your Faith Show
5
16
Live True to the Faith
2
17
Love—the Essence of the Gospel
3
18
Obedience through Our Faithfulness
4
19
Protection from Pornography—a Christ-Focused Home
2
20
Roots and Branches
3
21
Sisterhood Oh How We Need Each Other
1
22
Spiritual Whirlwinds
2
23
The Choice Generation
2
24
The Cost—and Blessings—of Discipleship
3
25
The Joyful Burden of Discipleship
3
26
The Keys and Authority of the Priesthood
1
27
The Priesthood Man
2
28
The Prophet Joseph Smith
5
29
The Resurrection of Jesus Christ
6
30
The Witness
5
31
Wanted Hands and Hearts to Hasten the Work
1
32
What Are You Thinking
5
33
What Manner of Men
1
34
Where Your Treasure Is
3
35
Your Four Minutes
2


Summary of groups:

Group: 1

[1] Daughters in the Covenant                               
[2] Keeping Covenants Protects Us Prepares Us and Empowers Us
[3] Sisterhood Oh How We Need Each Other                    
[4] The Keys and Authority of the Priesthood                
[5] Wanted Hands and Hearts to Hasten the Work              
[6] What Manner of Men
                                    
Group: 2
   
[1] A Priceless Heritage of Hope                    
[2] Bear Up Their Burdens with Ease                 
[3] Fear Not; I Am with Thee                        
[4] I Have Given You an Example                     
[5] Live True to the Faith                          
[6] Protection from Pornography—a Christ-Focused Home
[7] Spiritual Whirlwinds                            
[8] The Choice Generation                           
[9] The Priesthood Man                              
[10] Your Four Minutes                               

Group: 3
   
[1] Are You Sleeping through the Restoration
[2] Be Strong and of a Good Courage        
[3] Following Up                           
[4] Grateful in Any Circumstances          
[5] Love—the Essence of the Gospel         
[6] Roots and Branches                     
[7] The Cost—and Blessings—of Discipleship 
[8] The Joyful Burden of Discipleship      
[9] Where Your Treasure Is
              
Group:  4

[1] If Ye Love Me, Keep My Commandments
[2] Obedience through Our Faithfulness

Group:  5

[1] If Ye Lack Wisdom           
[2] Let's Not Take the Wrong Way
[3] Let Your Faith Show        
[4] The Prophet Joseph Smith   
[5] The Witness                
[6] What Are You Thinking      

Group:  6

[1] Christ the Redeemer            
[2] The Resurrection of Jesus Christ


Overall, it appears the utilized statistical methodology did a pretty good job of putting similar talks closer to each other and grouping similarly themed talks! This is not bad considering the methods are based on word occurrences/counts and not actually "reading" the talks and understanding them! If you actually read the talks, you can understand even better why certain talks are closer to or father away from each other.

***

Summary of methodology:

  • R was used for all analysis/visualization. Libraries tm, ggplot2, and proxy used.
  • Each conference talk was copy and pasted into its own text file to make a corpus of documents.
  • The corpus was prepared by removing numbers, stop words, applying stemming, removing sparse words, and more.
  • A term document matrix was created on word occurrences and then weighted using the term frequency-inverse document frequency method. Distance was calculated between each talk from those weights using cosine similarity.
  • Multidimensional scaling was applied on the cosine distance of the weighted term document matrix to project talks onto a two dimensional plane.
  • K-means clustering was applied to cluster/group similar general conference talks. The optimal number of clusters was estimated using the sum of squared errors scree plot. I borrowed previously written R code for this purpose from the following link: http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters