Natural Language Processing (NLP) projects are some of my favorite things to work on here at Bright. There is a big open source community in NLP with lots of out-of-the-box code, like the Natural Language Toolkit (NLTK) project. There’s also a lot of data to play with, like Google’s Ngram corpus that includes centuries of text data and even has a cool tool, the Ngram viewer, that shows high-level trends in word and phrase usage.
Recently, Acerbi et. al (2013) used Google’s Ngram corpus (a body of text) to show that the use of “mood words” (anger, fear, disgust, joy, sadness, surprise) has significantly decreased in American English texts since 1900. I wanted to do a similar analysis with “job words,” like unemployment and jobless, with a corpus that might surface a more direct signal of American sentiment with respect to time, so I chose the State of the Union corpus in the NLTK package. This corpus includes the text from every State of the Union address since Truman’s 1945 address.
My analysis looks at each year’s State of the Union and a “mood” category like anger, sadness, etc., and calculates how each year’s use of “mood words” deviates from the average of all years combined. This is called a z score. In that way, we can see how the average use of “mood words” from one category with respect to another has increased or decreased over time; such as seeing how sadness words may have been selected over anger words at a certain time. Moreover, by selecting a random group of words from the dictionary as a random mood category, we can see how a certain mood has increased or decreased absolutely while avoiding normalization bias.
Since the State of the Union corpus is small compared to other corpora, the data are a little noisy; but using Friedman’s super smoother, we can see some clear trends overall and on a presidential-term scale.
In Figure 1, we can see how post-Truman through post-Nixon, presidential speeches contained ever-decreasing amounts of “mood words.” A resurgence is seen in the speeches of George W. Bush, and a decline in speeches by his successor, Obama.
We can also see some interesting trends when looking at fear words (Figure 2). Immediately after the second World War, presidential speeches exhibited a steady average to less-than-average amount of fear words when compared to the entire timeline. However, speeches in the 80′s saw a large increase of fear words, followed by a large decrease in the 90′s, followed by a larger still increase immediately after 9/11, until a more recent drastic decrease, especially in President Obama’s speeches.
We can see similar trends in the use of sadness words in Figure 3—except with more differences in pre-80′s speeches—and in the use of of anger words in Figure 4—except amounts of anger words remained at an average level until post-Reagan.
Joy words, on the other hand (Figure 5), became less and less popular after Truman, until becoming modestly more popular around the Reagan era. In recent years, President Obama’s speeches have contained much less joyous sentiment.
The same type of analysis of “job words” reflects some of the same trends seen in the “mood words” analysis. Figure 6 shows the results of only “job words” with a negative connotation, like unemployed and jobless, and we can see some mirroring of the sadness and anger ”mood words,” as you might expect.
Adding positive “job words” like employment and job growth to these (Figure 7), we can see that President Obama has opted for a more positive presentation of the job market.
So, with an easily obtained corpus and some simple elements from NLP, we were able to see that Clinton was a pretty jovial guy, Bush was an emotional soul, and Obama is a cool cucumber. Or maybe Presidents were simply mirroring the public sentiment of the time—showing how the national dialogue has been affected by historical events and the mood of the country. Either way, with the availability of open source tools in NLP, it’s relatively easy to tease out patterns from text (shh… without having to read any of it!)