This week, our walkthrough is guided by my colleague Josh Rosenberg’s recent article, Advancing new methods for understanding public sentiment about educational reforms: The case of Twitter and the Next Generation Science Standards. We will focus on conducting a very simplistic “replication study” by comparing the sentiment of tweets about the Next Generation Science Standards (NGSS) and Common Core State Standards (CCSS) in order to better understand public reaction to these two curriculum reform efforts. I highly recommend you watch the quick 3-minute overview of this work at https://stanford.app.box.com/s/i5ixkj2b8dyy8q5j9o5ww4nafznb497x
For Unit 2, our focus will be on using the Twitter API to import data on topics or tweets of interest and using sentiment lexicons to help gauge public opinion about those topics or tweets. Silge & Robinson nicely illustrate the tools of text mining to approach the emotional content of text programmatically, in the following diagram:
For Unit 2, our walkthrough will cover the following topics:
To help us better understand the context, questions, and data sources we’ll be using in Unit 2, this section will focus on the following topics:
Abstract
While the Next Generation Science Standards (NGSS) are a long-standing and widespread standards-based educational reform effort, they have received less public attention, and no studies have explored the sentiment of the views of multiple stakeholders toward them. To establish how public sentiment about this reform might be similar to or different from past efforts, we applied a suite of data science techniques to posts about the standards on Twitter from 2010-2020 (N = 571,378) from 87,719 users. Applying data science techniques to identify teachers and to estimate tweet sentiment, we found that the public sentiment towards the NGSS is overwhelmingly positive—33 times more so than for the CCSS. Mixed effects models indicated that sentiment became more positive over time and that teachers, in particular, showed a more positive sentiment towards the NGSS. We discuss implications for educational reform efforts and the use of data science methods for understanding their implementation.
Data Source & Analysis
Similar to what we’ll be learning in this walkthrough, Rosenberg et al. used publicly accessible data from Twitter collected using the Full-Archive Twitter API and the rtweet
package in R. Specifically, the authors accessed tweets and user information from the hashtag-based #NGSSchat online community, all tweets that included any of the following phrases, with “/” indicating an additional phrase featuring the respective plural form: “ngss”, “next generation science standard/s”, “next gen science standard/s”.
Unlike this walkthrough, however, the authors determined Tweet sentiment using the Java version of SentiStrength to assign tweets to two 5-point scales of sentiment, one for positivity and one for negativity, because SentiStrength is a validated measure for sentiment in short informal texts (Thelwall et al., 2011). In addition, we used this tool because Wang and Fikis (2019) used it to explore the sentiment of CCSS-related posts. We’ll be using the AFINN sentiment lexicon which also assigns words in a tweet to two 5-point scales, in addition to explore some other sentiment lexicons.
Note that the authors also used the lme4
package in R to run a mixed effects model to determine if sentiment changes over time and differs between teachers and non-teacher. We will not attempt replicated that aspect of the analysis, but if you are interested in a guided walkthrough of how modeling can be used to understand changes in Twitter word use, see Chapter 7 of Text Mining with R.
Summary of Key Findings
The Rosenberg et al. study was guided by the following five research questions:
For this walkthrough, we’ll use a similar approach used by the authors to guage public sentiment around the NGSS, by compare how much more positive or negative NGSS tweets are relative to CSSS tweets.
Our (very) specific questions of interest for this walkthrough are:
And just to reiterate from Unit 1, one overarching question we’ll explore throughout this course, and that Silge and Robinson (2018) identify as a central question to text mining and natural language processing, is:
How do we to quantify what a document or collection of documents is about?
As highlighted in Chapter 6 of Data Science in Education Using R (DSIEUR), one of the first steps of every workflow should be to set up a “Project” within RStudio. This will be your “home” for any files and code used or created in Unit 2. You are welcome to continue using the same project created for Unit 1, or create an entirely new project for Unit 2. However, after you’ve created your project open up a new R script, and load the following packages that we’ll be needing for this walkthrough:
library(dplyr)
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(textdata)
library(scales)
At the end of this week, I’ll ask that you share with me your r script as evidence that you have complete the walkthrough. Although I highly recommend that that you manually type the code shared throughout this walkthrough, for large blocks of text it may be easier to copy and paste.
Before you can begin pulling tweets into R, you’ll first need to create a Twitter App in your developer account. You are not required to set up developer account for this course, but if you are still interested in creating one, these instructions succinctly outline the process and you can set one up in about 10 minutes. If you are not interested in setting one up and pulling tweets on your own, I have provided the data we’ll be using for this tutorial on my GitHub course repository and in our ECI 588 course site. You can skip to section 2b. Tidy Text.
This section and the section that follows, are borrowed largely from rtweet
package by Michael Kearney, and is for those of you have a set up a Twitter developer account and are interested in pulling your own data for Twitter.
Navigate to developer.twitter.com/en/apps, click the blue button that says, Create a New App
, and then complete the form with the following fields:
App Name
: What your app will be called
Application Description
: How your app will be described to its users
Website URLs
: Website associated with app–I recommend using the URL to your Twitter profile
Callback URLs
: IMPORTANT enter exactly the following: http://127.0.0.1:1410
Tell us how this app will be used
: Be clear and honest
When you’ve completed the required form fields, click the blue Create
button at the bottom
Read through and indicate whether you accept the developer terms
And you’re done!
In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018).
rtweet
package and some key functions to search for tweets or users of interest.tidytext
package to both “tidy” and tokenize our tweets in order to create our data frame for analysis.inner_join()
function for appending sentiment values to our data frame.The Import Tweets section introduces the following functions from the rtweet
package for reading Twitter data into R:
search_tweets()
Pulls up to 18,000 tweets from the last 6-9 days matching provided search terms. search_tweets2()
Returns data from multiple search queries. get_timelines()
Returns up to 3,200 tweets of one or more specified Twitter users.Since one of our goals for this walkthrough is a very crude replication of the study by Rosenberg et al. (2021), let’s begin by introducing the search_tweets()
function to try reading into R 5,000 tweets containing the NGSS hashtag and store as a new data frame ngss_all_tweets
.
Type or copy the following code into your R script or console and run:
ngss_all_tweets <- search_tweets(q = "#NGSSchat", n=5000)
Note that the first argument q =
that the search_tweets()
function expects is the search term included in quotation marks and that n =
specifies the maximum number of tweets
View your new ngss_all_tweets
data frame using one of the previous view methods from Unit 1 Section 2a to help answer the following questions:
While not explicitly mentioned in the paper, it’s likely the authors removed retweets in their query since a retweet is simply someone else reposting someone else’s tweet and would duplicate the exact same content of the original.
Let’s use the include_rts =
argument to remove any retweets by setting it to FALSE
:
ngss_non_retweets <- search_tweets("#NGSSchat",
n=5000,
include_rts = FALSE)
If you recall from [Section 1a], the authors accessed tweets and user information from the hashtag-based #NGSSchat online community, all tweets that included any of the following phrases, with “/” indicating an additional phrase featuring the respective plural form: “ngss”, “next generation science standard/s”, “next gen science standard/s”.
Let’s modify our query using the OR
operator to also include “ngss” so it will return tweets containing either #NGSSchat or “ngss” and assign to ngss_or_tweets
:
ngss_or_tweets <- search_tweets(q = "#NGSSchat OR ngss",
n=5000,
include_rts = FALSE)
Try including both search terms but excluding the OR
operator to answer the following question:
OR
operator return more tweets, the same number of tweets, or fewer tweets? Why?search_tweet()
function contain? Try adding one and see what happens.Hint: Use the ?search_tweets
help function to learn more about the q
argument and other arguments for composing search queries.
Unfortunately, the OR
operator will only get us so far. In order to include the additional search terms, we will need to use the c()
function to combine our search terms into a single list.
The rtweets
package has an additional search_tweets2()
function for using multiple queries in a search. To do this, either wrap single quotes around a search query using double quotes, e.g., q = '"next gen science standard"'
or escape each internal double quote with a single backslash, e.g., q = "\"next gen science standard\""
.
Copy and past the following code to store the results of our query in ngss_tweets
:
ngss_tweets <- search_tweets2(c("#NGSSchat OR ngss",
'"next generation science standard"',
'"next generation science standards"',
'"next gen science standard"',
'"next gen science standards"'
),
n=5000,
include_rts = FALSE)
Recall that for our research question we wanted to compare public sentiment about both the NGSS and CCSS state standards. Let’s go ahead and create our very first “dictionary” for identifying tweets related to either set of standards, and then use that dictionary for our the q =
query argument to pull tweets related to the state standards.
To do so, we’ll need to add some additional search terms to our list:
ngss_dictionary <- c("#NGSSchat OR ngss",
'"next generation science standard"',
'"next generation science standards"',
'"next gen science standard"',
'"next gen science standards"')
ngss_tweets <- search_tweets2(ngss_dictionary,
n=5000,
include_rts = FALSE)
Now let’s create a dictionary for the Common Core State Standards and pass that to our search_tweets()
function to get the most recent tweets:
ccss_dictionary <- c("#commoncore", '"common core"')
ccss_tweets <- ccss_dictionary %>%
search_tweets2(n=5000, include_rts = FALSE)
Notice that you can use the pipe operator with the search_tweets()
function just like you would other functions from the tidyverse.
search_tweets
function to create you own custom query for a twitter hashtag or topic(s) of interest.Finally, let’s save our tweet files to use in later exercises since tweets have a tendency to change every minute. We’ll save as a Microsoft Excel file since one of our columns can not be stored in a flat file like .csv.
Let’s use the write_xlsx()
function from the writexl
package just like we would the write_csv()
function from dplyr
in Unit 1:
write_xlsx(ngss_tweets, "data/ngss_tweets.xlsx")
write_xlsx(ccss_tweets, "data/csss_tweets.xlsx")
For your independent analysis, you may be interest in exploring posts by specific users rather than topics, key words, or hashtags. Yes, there is a function for that too!
For example, let’s create another list containing the usernames of me and some of my colleagues at the Friday Institute using the c()
function again and use the get_timelines()
function to get the most recent tweets from each of those users:
fi <- c("sbkellogg", "mjsamberg", "haspires", "tarheel93", "drcallie_tweets", "AlexDreier")
fi_tweets <- fi %>%
get_timelines(include_rts=FALSE)
And let’s use the sample_n()
function from the dplyr
package to pick 10 random tweets and use select()
to select and view just the screenname
and text
columns that contains the user and the content of their post:
sample_n(fi_tweets, 10) %>%
select(screen_name, text)
## # A tibble: 10 x 2
## screen_name text
## <chr> <chr>
## 1 AlexDreier "@mjsamberg @FridayInstitute You deserve every success and t…
## 2 AlexDreier "And, finally, congratulations to @CTruittNCDPI and the enti…
## 3 drcallie_tweets "Already have one ah-ha. Presenters discussed a three-step f…
## 4 sbkellogg "@greggarner87 💯 used that sentence so many times when I was…
## 5 haspires "Thank you Ellyn Hagerman! https://t.co/yPZba03zz2"
## 6 mjsamberg "Sometimes rearranging things around the house is a terrible…
## 7 mjsamberg "\"Did you get Pfizer, Moderna, or J&J?\" is such an obj…
## 8 AlexDreier "You got yourselves a good one, @dallasnews 🚀 https://t.co/Z…
## 9 mjsamberg "This. Hardware isn’t a limiting factor for the iPad Pro, it…
## 10 haspires "@NationHahn You know your truth!"
We’ve only scratched the surface of the number of functions available in the rtweets
package for searching Twitter. Use the following function to
vignette("intro", package="rtweet")
To conclude Section 2a, try one of the following search functions from the rtweet
vignette:
get_timelines()
Get the most recent 3,200 tweets from users.stream_tweets()
Randomly sample (approximately 1%) from the live stream of all tweets.get_friends()
Retrieve a list of all the accounts a user follows.get_followers()
Retrieve a list of the accounts following a user.get_favorites()
Get the most recently favorited statuses by a user.get_trends()
Discover what’s currently trending in a city.search_users()
Search for 1,000 users with the specific hashtag in their profile bios.Now that we have the data needed to answer our questions, we still have a little bit of work to do to get it ready for analysis. This section will revisit some familiar functions from Unit 1 and introduce a couple new functions:
dplyr
functions
select()
picks variables based on their names.slice()
lets you select, remove, and duplicate rows.rename()
changes the names of individual variables using new_name = old_name syntaxfilter()
picks cases, or rows, based on their values in a specified column.tidytext
functions
unnest_tokens()
splits a column into tokensanti_join()
returns all rows from x without a match in y.ATTENTION: For those of you who do not have Twitter Developer accounts, you will need to read in the Excel files share in our Course site and also located here: https://github.com/sbkellogg/eci-588/tree/main/unit-2/data
We’ll use the readxl
package highlighted in Unit 1 and the read_xlsx()
function to read in the data stored in the data folder of our R project:
ngss_tweets <- read_xlsx("data/ngss_tweets.xlsx")
ccss_tweets <- read_xlsx("data/csss_tweets.xlsx")
Note: If you have already created these data frames from 2a. Import Tweets, you do not need to read these file into R unless you want to reproduce the exact same outputs shown in the rest of this walkthrough.
As you are probably already aware, we have way more data than we’ll need for analysis and will need to pare it down quite a bit.
First, let’s use the filter
function to subset rows containing only tweets in the language:
ngss_text <- filter(ngss_tweets, lang == "en")
Now let’s select the following columns from our new ngss_text
data frame:
screen_name
of the user who created the tweetcreated_at
timestamp for examining changes in sentiment over timetext
containing the tweet which is our primary data source of interesttngss_text <- select(ngss_text,screen_name, created_at, text)
Since we are interested in comparing the sentiment of NGSS tweets with CSSS tweets, it would be helpful if we had a column for quickly identifying the set of state standards, with which each tweet is associated.
We’ll use the mutate()
function to create a new variable called standards
to label each tweets as “ngss”:
ngss_text <- mutate(ngss_text, standards = "ngss")
And just because it bothers me, I’m going to use the relocate()
function to move the standards
column to the first position so I can quickly see which standards the tweet is from:
ngss_text <- relocate(ngss_text, standards)
Note that you could also have used the select()
function to reorder columns like so:
ngss_text <- select(ngss_text, standards, screen_name, created_at, text)
Finally, let’s rewrite the code above using the %>%
operator so there is less redundancy and it is easier to read:
ngss_text <-
ngss_tweets %>%
filter(lang == "en") %>%
select(screen_name, created_at, text) %>%
mutate(standards = "ngss") %>%
relocate(standards)
WARNING: You will not be able to progress to the next section until you have completed the following task:
ccss_text
data frame for our ccss_tweets
Common Core tweets by modifying code above.Finally, let’s combine our ccss_text
and ngss_text
into a single data frame by using the bind_rows()
function from dplyr
to simply supplying the data frames that you want to combine as arguments:
tweets <- bind_rows(ngss_text, ccss_text)
And let’s take a quick look at both the head()
and the tail()
of this new tweets
data frame to make sure it contains both “ngss” and “ccss” standards:
head(tweets)
## # A tibble: 6 x 4
## standards screen_name created_at text
## <chr> <chr> <dttm> <chr>
## 1 ngss loyr2662 2021-02-27 17:33:27 "Switching gears for a bit for the…
## 2 ngss loyr2662 2021-02-20 20:02:37 "Was just introduced to the Engine…
## 3 ngss Furlow_teach 2021-02-27 17:03:23 "@IBchemmilam @chemmastercorey I’m…
## 4 ngss Furlow_teach 2021-02-27 14:41:01 "@IBchemmilam @chemmastercorey How…
## 5 ngss TdiShelton 2021-02-27 14:17:34 "I am so honored and appreciative …
## 6 ngss TdiShelton 2021-02-27 15:49:17 "Thank you @brian_womack I loved c…
tail(tweets)
## # A tibble: 6 x 4
## standards screen_name created_at text
## <chr> <chr> <dttm> <chr>
## 1 ccss JosiePaul88… 2021-02-20 00:34:53 "@SenatorHick You realize science …
## 2 ccss ctwittnc 2021-02-19 23:44:18 "@winningatmylife I’ll bet none of…
## 3 ccss the_rbeagle 2021-02-19 23:27:06 "@dmarush @electronlove @Montgomer…
## 4 ccss silea 2021-02-19 23:11:21 "@LizerReal I don’t think that’s i…
## 5 ccss JodyCoyote12 2021-02-19 22:58:25 "@CarlaRK3 @NedLamont Fully fund p…
## 6 ccss Ryan_Hawes 2021-02-19 22:41:01 "I just got an \"explainer\" on ho…
We have a couple remaining steps to tidy our text that hopefully should feel familiar by this point. If you recall from Chapter 1 of Text Mining With R, Silge & Robinson describe tokens as:
A meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix.
First, let’s tokenize our tweets by using the unnest_tokens()
function to split each tweet into a single row to make it easier to analyze:
tweet_tokens <-
tweets %>%
unnest_tokens(output = word,
input = text,
token = "tweets")
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
Notice that we’ve included an additional argument in the call to unnest_tokens()
. Specifically, we used the specialized “tweets”
tokenizer in the tokens =
argument that is very useful for dealing with Twitter text or other text from online forums in that it retains hashtags and mentions of usernames with the @ symbol.
Now let’s remove stop words like “the” and “a” that don’t help us learn much about what people are tweeting about the state standards.
tidy_tweets <-
tweet_tokens %>%
anti_join(stop_words, by = "word")
Notice that we’ve specified the by =
argument to look for matching words in the word
column for both data sets and remove any rows from the tweet_tokens
dataset that match the stop_words
dataset. Remember when we first tokenized our dataset I conveniently chose output = word
as the column name because it matches the column name word
in the stop_words
dataset contained in the tidytext
package. This makes our call to anti_join()
simpler because anti_join()
knows to look for the column named word
in each dataset. However this wasn’t really necessary since word
is the only matching column name in both datasets and it would have matched those columns by default.
Before wrapping up, let’s take a quick count of the most common words in tidy_tweets
data frame:
count(tidy_tweets, word, sort = T)
## # A tibble: 7,524 x 2
## word n
## <chr> <int>
## 1 common 1089
## 2 core 1083
## 3 math 434
## 4 students 140
## 5 #ngss 131
## 6 school 127
## 7 teachers 122
## 8 amp 120
## 9 kids 111
## 10 standards 111
## # … with 7,514 more rows
Notice that the nonsense word “amp” is in our top tens words. If we use the filter()
function and `grep() query from Unit 1 on our tweets
data frame, we can see that “amp” seems to be some sort of html residue that we might want to get rid of.
filter(tweets, grepl('amp', text))
## # A tibble: 124 x 4
## standards screen_name created_at text
## <chr> <chr> <dttm> <chr>
## 1 ngss TdiShelton 2021-02-27 14:17:34 "I am so honored and appreciativ…
## 2 ngss STEMTeachToo… 2021-02-27 16:25:04 "Open, non-hierarchical communic…
## 3 ngss NGSSphenomena 2021-02-25 13:24:22 "Bacteria have music preferences…
## 4 ngss CTSKeeley 2021-02-21 21:50:04 "Today I was thinking about the …
## 5 ngss richbacolor 2021-02-24 14:14:49 "Last chance to register for @MS…
## 6 ngss MrsEatonELL 2021-02-27 06:24:09 "Were we doing the hand jive? No…
## 7 ngss STEMuClaytion 2021-02-24 14:56:19 "#WonderWednesday w/ questions t…
## 8 ngss LearningUNDF… 2021-02-24 18:13:01 "Are candies like M&Ms and S…
## 9 ngss abeslo 2021-02-26 18:54:31 "#M'Kenna, whose story we share …
## 10 ngss E3Chemistry 2021-02-25 14:15:20 "Molarity & Parts Per Millio…
## # … with 114 more rows
Let’s rewrite our stop word code to add a custom stop word to filter out rows with “amp” in them:
tidy_tweets <-
tweet_tokens %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "amp")
Note that we could extend this filter to weed out any additional words that don’t carry much meaning but skew our data by being so prominent.
We’ve created some unnecessarily lengthy code to demonstrate some of the steps in the tidying process. Rewrite the tokenization and removal of stop words processes into a more compact series of commands and save your data frame as tidy_tweets
.
Now that we have our tweets nice and tidy, we’re almost ready to begin exploring public sentiment (at least for the past week due to Twitter API rate limits) around the CCSS and NGSS standards. For this part of our workflow we introduce two new functions from the tidytext
and dplyr
packages respectively:
get_sentiments()
returns specific sentiment lexicons with the associated measures for each word in the lexiconinner_join()
return all rows from x
where there are matching values in y
, and all columns from x
and y
.For a quick overview of the different join functions with helpful visuals, visit: https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti
Recall from our readings that sentiment analysis tries to evaluate words for their emotional association. Silge & Robinson point out that, “one way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words.” As our readings from last week illustrated, this isn’t the only way to approach sentiment analysis, but it is an easier entry point into sentiment analysis and often-used.
The tidytext package provides access to several sentiment lexicons based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.
The three general-purpose lexicons we’ll focus on are:
AFINN
assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
bing
categorizes words in a binary fashion into positive and negative categories.
nrc
categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
Note that if this is your first time using the AFINN and NRC lexicons, you’ll be prompted to download both Respond yes to the prompt by entering “1” and the NRC and AFINN lexicons will download. You’ll only have to do this the first time you use the NRC lexicon.
Let’s take a quick look at each of these lexicons using the get_sentiments()
function and assign them to their respective names for later use:
afinn <- get_sentiments("afinn")
afinn
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
bing <- get_sentiments("bing")
bing
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
nrc <- get_sentiments("nrc")
nrc
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,891 more rows
And just out of curiosity, let’s take a look at the loughran
lexicon as well:
loughran <- get_sentiments("loughran")
loughran
## # A tibble: 4,150 x 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
## 7 abdicated negative
## 8 abdicates negative
## 9 abdicating negative
## 10 abdication negative
## # … with 4,140 more rows
We’ve reached the final step in our data wrangling process before we can begin exploring our data to address our questions.
In the previous section, we used anti_join()
to remove stop words in our dataset. For sentiment analysis, we’re going use the inner_join()
function to do something similar. However, instead of removing rows that contain words matching those in our stop words dictionary, inner_join()
allows us to keep only the rows with words that match words in our sentiment lexicons, or dictionaries, along with the sentiment measure for that word from the sentiment lexicon.
Let’s use inner_join()
to combine our two tidy_tweets
and afinn
data frames, keeping only rows with matching data in the word
column:
sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")
sentiment_afinn
## # A tibble: 1,520 x 5
## standards screen_name created_at word value
## <chr> <chr> <dttm> <chr> <dbl>
## 1 ngss loyr2662 2021-02-27 17:33:27 win 4
## 2 ngss Furlow_teach 2021-02-27 17:03:23 love 3
## 3 ngss Furlow_teach 2021-02-27 17:03:23 sweet 2
## 4 ngss Furlow_teach 2021-02-27 17:03:23 significance 1
## 5 ngss TdiShelton 2021-02-27 14:17:34 honored 2
## 6 ngss TdiShelton 2021-02-27 14:17:34 opportunity 2
## 7 ngss TdiShelton 2021-02-27 14:17:34 wonderful 4
## 8 ngss TdiShelton 2021-02-27 14:17:34 powerful 2
## 9 ngss TdiShelton 2021-02-27 15:49:17 loved 3
## 10 ngss TdiShelton 2021-02-27 16:51:32 share 1
## # … with 1,510 more rows
Notice that each word in your sentiment_afinn
data frame now contains a value ranging from -5 (very negative) to 5 (very positive).
sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")
sentiment_bing
## # A tibble: 1,637 x 5
## standards screen_name created_at word sentiment
## <chr> <chr> <dttm> <chr> <chr>
## 1 ngss loyr2662 2021-02-27 17:33:27 win positive
## 2 ngss Furlow_teach 2021-02-27 17:03:23 love positive
## 3 ngss Furlow_teach 2021-02-27 17:03:23 helped positive
## 4 ngss Furlow_teach 2021-02-27 17:03:23 sweet positive
## 5 ngss Furlow_teach 2021-02-27 17:03:23 tough positive
## 6 ngss TdiShelton 2021-02-27 14:17:34 honored positive
## 7 ngss TdiShelton 2021-02-27 14:17:34 appreciative positive
## 8 ngss TdiShelton 2021-02-27 14:17:34 wonderful positive
## 9 ngss TdiShelton 2021-02-27 14:17:34 powerful positive
## 10 ngss TdiShelton 2021-02-27 15:49:17 loved positive
## # … with 1,627 more rows
sentiment_nrc
data frame using the code above.tidy_tweets
and data frames with sentiment values attached? Why did this happen?Note: To complete to the following section, you’ll need the sentiment_nrc
data frame.
Now that we have our tweets tidied and sentiments joined, we’re ready for a little data exploration. As highlighted in Unit 1, calculating summary statistics, data visualization, and feature engineering (the process of creating new variables from a dataset) are a key part of exploratory data analysis. One goal in this phase is explore questions that drove the original analysis and develop new questions and hypotheses to test in later stages. Topics addressed in Section 3 include:
Before we dig into sentiment, let’s use the handy ts_plot
function built into rtweet
to take a very quick look at how far back our tidied tweets
data set goes:
ts_plot(tweets, by = "days")
Notice that this effectively creates a ggplot
time series plot for us. I’ve included the by =
argument which by default is set to “days”. It looks like tweets go back 9 days which the rate limit set by Twitter.
Try changing it to “hours” and see what happens.
ts_plot
with the group_by
function to compare the number of tweets over time by Next Gen and Common Core standards
Hint: use the ?ts_plot
help function to check the examples to see how this can be done.
Your line graph should look something like this:
Since our primary goals is to compare public sentiment around the NGSS and CCSS state standards, in this section we put together some basic numerical summaries using our different lexicons to see whether tweets are generally more positive or negative for each standard as well as differences between the two. To do this, we revisit the following dplyr
functions:
count()
lets you quickly count the unique values of one or more variables
group_by()
takes a data frame and one or more variables to group by
summarise()
creates a numerical summary of data using arguments like mean()
and median()
mutate()
adds new variables and preserves existing ones
And introduce one new function:
spread()
Let’s start with bing
, our simplest sentiment lexicon, and use the count
function to count how many times in our sentiment_bing
data frame “positive” and “negative” occur in sentiment
column and :
summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)
Collectively, it looks like our combined dataset has more positive words than negative words.
summary_bing
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 974
## 2 positive 663
Since our main goal is to compare positive and negative sentiment between CCSS and NGSS, let’s use the group_by
function again to get sentiment
summaries for NGSS and CCSS separately:
summary_bing <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment)
summary_bing
## # A tibble: 4 x 3
## # Groups: standards [2]
## standards sentiment n
## <chr> <chr> <int>
## 1 ccss negative 914
## 2 ccss positive 437
## 3 ngss negative 60
## 4 ngss positive 226
Looks like CCSS have far more negative words than positive, while NGSS skews much more positive. So far, pretty consistent with Rosenberg et al. findings!!!
Our last step will be calculate a single sentiment “score” for our tweets that we can use for quick comparison and create a new variable indicating which lexicon we used.
First, let’s untidy our data a little by using the spread
function from the tidyr
package to transform our sentiment
column into separate columns for negative
and positive
that contains the n
counts for each:
summary_bing <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
spread(sentiment, n)
summary_bing
## # A tibble: 2 x 3
## # Groups: standards [2]
## standards negative positive
## <chr> <int> <int>
## 1 ccss 914 437
## 2 ngss 60 226
Finally, we’ll use the mutate
function to create two new variables: sentiment
and lexicon
so we have a single sentiment score and the lexicon from which it was derived:
summary_bing <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
spread(sentiment, n) %>%
mutate(sentiment = positive - negative) %>%
mutate(lexicon = "bing") %>%
relocate(lexicon)
summary_bing
## # A tibble: 2 x 5
## # Groups: standards [2]
## lexicon standards negative positive sentiment
## <chr> <chr> <int> <int> <int>
## 1 bing ccss 914 437 -477
## 2 bing ngss 60 226 166
There we go, now we can see that CCSS scores negative, while NGSS is overall positive.
Let’s calculate a quick score for using the afinn
lexicon now. Remember that AFINN provides a value from -5 to 5 for each:
head(sentiment_afinn)
## # A tibble: 6 x 5
## standards screen_name created_at word value
## <chr> <chr> <dttm> <chr> <dbl>
## 1 ngss loyr2662 2021-02-27 17:33:27 win 4
## 2 ngss Furlow_teach 2021-02-27 17:03:23 love 3
## 3 ngss Furlow_teach 2021-02-27 17:03:23 sweet 2
## 4 ngss Furlow_teach 2021-02-27 17:03:23 significance 1
## 5 ngss TdiShelton 2021-02-27 14:17:34 honored 2
## 6 ngss TdiShelton 2021-02-27 14:17:34 opportunity 2
To calculate late a summary score, we will need to first group our data by standards
again and then use the summarise
function to create a new sentiment
variable by adding all the positive and negative scores in the value
column:
summary_afinn <- sentiment_afinn %>%
group_by(standards) %>%
summarise(sentiment = sum(value)) %>%
mutate(lexicon = "AFINN") %>%
relocate(lexicon)
summary_afinn
## # A tibble: 2 x 3
## lexicon standards sentiment
## <chr> <chr> <dbl>
## 1 AFINN ccss -833
## 2 AFINN ngss 502
Again, CCSS is overall negative while NGSS is overall positive!
For your final task for this walkthough, calculate a single sentiment score for NGSS and CCSS using the remaining nrc
and loughan
lexicons and answer the following questions. Are these findings above still consistent?
Hint: The nrc
lexicon contains “positive” and “negative” values just like bing
and loughan
, but also includes values like “trust” and “sadness” as shown below. You will need to use the filter()
function to select rows that only contain “positive” and “negative.”
nrc
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,891 more rows
## # A tibble: 2 x 5
## # Groups: standards [2]
## standards method negative positive sentiment
## <chr> <chr> <int> <int> <dbl>
## 1 ccss nrc 774 2212 2.86
## 2 ngss nrc 74 544 7.35
## # A tibble: 2 x 3
## lexicon standards sentiment
## <chr> <chr> <dbl>
## 1 AFINN ccss -833
## 2 AFINN ngss 502
As highlighted in Chapter 3 of Data Science in Education Using R, the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.
Recall from the PREPARE section that the Rosenberg et al. study was guide by the following questions:
Similar to our sentiment summary using the AFINN lexicon, the Rosenberg et al. study used the -5 to 5 sentiment score from the SentiStrength lexicon to answer RQ #1. To address the remaining questions the authors used a mixed effects model (also known as multi-level or hierarchical linear models via the lme4 package in R.
Collectively, the authors found that:
The final(ish) step in our workflow/process is sharing the results of analysis with wider audience. Krumm et al. (2018) outlined the following 3-step process for communicating with education stakeholders what you have learned through analysis:
Remember that the questions of interest that we want to focus on our for our selection, polishing, and narration include:
To address questions 1 and 2, I’m going to focus my analyses, data products and sharing format on the following:
bing
, nrc
, and loughan
lexicons, I’ll create some 100% stacked bars showing the percentage of positive and negative words among all tweets for the NGSS and CCSS.I want to try and replicate as closely as possible the approach Rosenberg et al. used in their analysis. To do that, I’ll I can recycle some R code I used in section 2b. Tidy Text.
To polish my analyses and prepare, first I need to rebuild the tweets
dataset from my ngss_tweets
and ccss_tweets
and select both the status_id
that is unique to each tweet, and the text
column which contains the actual post:
ngss_text <-
ngss_tweets %>%
filter(lang == "en") %>%
select(status_id, text) %>%
mutate(standards = "ngss") %>%
relocate(standards)
ccss_text <-
ccss_tweets %>%
filter(lang == "en") %>%
select(status_id, text) %>%
mutate(standards = "ccss") %>%
relocate(standards)
tweets <- bind_rows(ngss_text, ccss_text)
tweets
## # A tibble: 1,441 x 3
## standards status_id text
## <chr> <chr> <chr>
## 1 ngss 136571669033664… "Switching gears for a bit for the \"Crosscutting…
## 2 ngss 136321751376141… "Was just introduced to the Engineering Habits of…
## 3 ngss 136570912276365… "@IBchemmilam @chemmastercorey I’m familiar w/ it…
## 4 ngss 136567329436042… "@IBchemmilam @chemmastercorey How well does this…
## 5 ngss 136566739318860… "I am so honored and appreciative to have an oppo…
## 6 ngss 136569047726628… "Thank you @brian_womack I loved connecting with …
## 7 ngss 136570614049613… "Please share #NGSSchat PLN! https://t.co/Qc2c3eW…
## 8 ngss 136366932814767… "So excited about this weekend’s learning... plea…
## 9 ngss 136544278654421… "The Educators Evaluating the Quality of Instruct…
## 10 ngss 136435814916417… "Foster existing teacher social networks that exh…
## # … with 1,431 more rows
The status_id
is important because like Rosenberg et al., I want to calculate an overall sentiment score for each tweet, rather than for each word.
Before I get that far however, I’ll need to tidy my tweets
again and attach my sentiment
scores.
Note that the closest lexicon we have available in our tidytext
package to the SentiStrength lexicon used by Rosenberg is the AFINN lexicon which also uses a -5 to 5 point scale.
So let’s use unnest_tokens
to tidy our tweets, remove stop words, and add afinn
scores to each word similar to what we did in section 2c. Add Sentiment Values:
sentiment_afinn <- tweets %>%
unnest_tokens(output = word,
input = text,
token = "tweets") %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "amp") %>%
inner_join(afinn, by = "word")
sentiment_afinn
## # A tibble: 1,520 x 4
## standards status_id word value
## <chr> <chr> <chr> <dbl>
## 1 ngss 1365716690336645124 win 4
## 2 ngss 1365709122763653133 love 3
## 3 ngss 1365709122763653133 sweet 2
## 4 ngss 1365709122763653133 significance 1
## 5 ngss 1365667393188601857 honored 2
## 6 ngss 1365667393188601857 opportunity 2
## 7 ngss 1365667393188601857 wonderful 4
## 8 ngss 1365667393188601857 powerful 2
## 9 ngss 1365690477266284545 loved 3
## 10 ngss 1365706140496130050 share 1
## # … with 1,510 more rows
Next, I want to calculate a single score for each tweet. To do that, I’ll use the by now familiar group_by
and summarize
afinn_score <- sentiment_afinn %>%
group_by(standards, status_id) %>%
summarise(value = sum(value))
afinn_score
## # A tibble: 842 x 3
## # Groups: standards [2]
## standards status_id value
## <chr> <chr> <dbl>
## 1 ccss 1362894990813188096 -2
## 2 ccss 1362899370199445508 4
## 3 ccss 1362906588021989376 -2
## 4 ccss 1362910494487535618 -9
## 5 ccss 1362910913855160320 -3
## 6 ccss 1362928225379250179 2
## 7 ccss 1362933982074073090 -1
## 8 ccss 1362947497258151945 -3
## 9 ccss 1362949805694013446 3
## 10 ccss 1362970614282264583 3
## # … with 832 more rows
And like Rosenberg et al., I’ll add a flag for whether the tweet is “positive” or “negative” using the mutate
function to create a new sentiment
column to indicate whether that tweets was positive or negative.
To do this, we introduced the new if_else
function from the dplyr
package. This if_else
function adds “negative” to the sentiment
column if the score in the value
column of the corresponding row is less than 0. If not, it will add a “positive” to the row.
afinn_sentiment <- afinn_score %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive"))
afinn_sentiment
## # A tibble: 801 x 4
## # Groups: standards [2]
## standards status_id value sentiment
## <chr> <chr> <dbl> <chr>
## 1 ccss 1362894990813188096 -2 negative
## 2 ccss 1362899370199445508 4 positive
## 3 ccss 1362906588021989376 -2 negative
## 4 ccss 1362910494487535618 -9 negative
## 5 ccss 1362910913855160320 -3 negative
## 6 ccss 1362928225379250179 2 positive
## 7 ccss 1362933982074073090 -1 negative
## 8 ccss 1362947497258151945 -3 negative
## 9 ccss 1362949805694013446 3 positive
## 10 ccss 1362970614282264583 3 positive
## # … with 791 more rows
Note that since a tweet sentiment score equal to 0 is neutral, I used the filter
function to remove it from the dataset.
Finally, we’re ready to compute our ratio. We’ll use the group_by
function and count
the number of tweets for each of the standards
that are positive or negative in the sentiment
column. Then we’ll use the spread
function to separate them out into separate columns so we can perform a quick calculation to compute the ratio
.
afinn_ratio <- afinn_sentiment %>%
group_by(standards) %>%
count(sentiment) %>%
spread(sentiment, n) %>%
mutate(ratio = negative/positive)
afinn_ratio
## # A tibble: 2 x 4
## # Groups: standards [2]
## standards negative positive ratio
## <chr> <int> <int> <dbl>
## 1 ccss 417 202 2.06
## 2 ngss 18 164 0.110
Finally,
afinn_counts <- afinn_sentiment %>%
group_by(standards) %>%
count(sentiment) %>%
filter(standards == "ngss")
afinn_counts %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
geom_bar(width = .6, stat = "identity") +
labs(title = "Next Gen Science Standards",
subtitle = "Proportion of Positive & Negative Tweets") +
coord_polar(theta = "y") +
theme_void()
Finally, to address Question 2, I want to compare the percentage of positive and negative words contained in the corpus of tweets for the NGSS and CCSS standards using the four different lexicons to see how sentiment compares based on lexicon used.
I’ll begin by polishing my previous summaries and creating identical summaries for each lexicon that contains the following columns: method
, standards
, sentiment
, and n
, or word counts:
summary_afinn2 <- sentiment_afinn %>%
group_by(standards) %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive")) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "AFINN")
summary_bing2 <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "bing")
summary_nrc2 <- sentiment_nrc %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "nrc")
summary_loughran2 <- sentiment_loughran %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "loughran")
Next, I’ll combine those four data frames together using the bind_rows
function again:
summary_sentiment <- bind_rows(summary_afinn2,
summary_bing2,
summary_nrc2,
summary_loughran2) %>%
arrange(method, standards) %>%
relocate(method)
summary_sentiment
## # A tibble: 16 x 4
## # Groups: standards [2]
## method standards sentiment n
## <chr> <chr> <chr> <int>
## 1 AFINN ccss negative 740
## 2 AFINN ccss positive 468
## 3 AFINN ngss positive 273
## 4 AFINN ngss negative 39
## 5 bing ccss negative 914
## 6 bing ccss positive 437
## 7 bing ngss positive 226
## 8 bing ngss negative 60
## 9 loughran ccss negative 440
## 10 loughran ccss positive 112
## 11 loughran ngss negative 68
## 12 loughran ngss positive 54
## 13 nrc ccss positive 2212
## 14 nrc ccss negative 774
## 15 nrc ngss positive 544
## 16 nrc ngss negative 74
Then I’ll create a new data frame that has the total word counts for each set of standards and each method and join that to my summary_sentiment
data frame:
total_counts <- summary_sentiment %>%
group_by(method, standards) %>%
summarise(total = sum(n))
## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.
sentiment_counts <- left_join(summary_sentiment, total_counts)
## Joining, by = c("method", "standards")
sentiment_counts
## # A tibble: 16 x 5
## # Groups: standards [2]
## method standards sentiment n total
## <chr> <chr> <chr> <int> <int>
## 1 AFINN ccss negative 740 1208
## 2 AFINN ccss positive 468 1208
## 3 AFINN ngss positive 273 312
## 4 AFINN ngss negative 39 312
## 5 bing ccss negative 914 1351
## 6 bing ccss positive 437 1351
## 7 bing ngss positive 226 286
## 8 bing ngss negative 60 286
## 9 loughran ccss negative 440 552
## 10 loughran ccss positive 112 552
## 11 loughran ngss negative 68 122
## 12 loughran ngss positive 54 122
## 13 nrc ccss positive 2212 2986
## 14 nrc ccss negative 774 2986
## 15 nrc ngss positive 544 618
## 16 nrc ngss negative 74 618
Finally, I’ll add a new row that calculates the percentage of positive and negative words for each set of state standards:
sentiment_percents <- sentiment_counts %>%
mutate(percent = n/total * 100)
sentiment_percents
## # A tibble: 16 x 6
## # Groups: standards [2]
## method standards sentiment n total percent
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 AFINN ccss negative 740 1208 61.3
## 2 AFINN ccss positive 468 1208 38.7
## 3 AFINN ngss positive 273 312 87.5
## 4 AFINN ngss negative 39 312 12.5
## 5 bing ccss negative 914 1351 67.7
## 6 bing ccss positive 437 1351 32.3
## 7 bing ngss positive 226 286 79.0
## 8 bing ngss negative 60 286 21.0
## 9 loughran ccss negative 440 552 79.7
## 10 loughran ccss positive 112 552 20.3
## 11 loughran ngss negative 68 122 55.7
## 12 loughran ngss positive 54 122 44.3
## 13 nrc ccss positive 2212 2986 74.1
## 14 nrc ccss negative 774 2986 25.9
## 15 nrc ngss positive 544 618 88.0
## 16 nrc ngss negative 74 618 12.0
Now that I have my sentiment percent summaries for each lexicon, I’m going great my 100% stacked bar charts for each lexicon:
sentiment_percents %>%
ggplot(aes(x = standards, y = percent, fill=sentiment)) +
geom_bar(width = .8, stat = "identity") +
facet_wrap(~method, ncol = 1) +
coord_flip() +
labs(title = "Public Sentiment on Twitter",
subtitle = "The Common Core & Next Gen Science Standards",
x = "State Standards",
y = "Percentage of Words")
And finished! The chart above clearly illustrates that regardless of sentiment lexicon used, the NGSS contains more positive words than the CCSS lexicon.
With our “data products” cleanup complete, we can start pulling together a quick presentation to share with the class. We’ve already seen what a more formal journal article looks like in the PREPARE section of this walkthrough. For your Independent Analysis assignment for Unit 2, you’ll be creating either a simple report or slide deck to share out some key findings from our analysis.
Regardless of whether you plan to talk us through your analysis and findings with a presentation or walk us through with a brief written report, your assignment should address the following questions:
You can view my example presentation here: COMING SOON!
And use my R Markdown presentation file as a template: COMING SOON!