rtweetclean is a R package built to act as a processor of data generated by the existing rtweet package that can produce clean data frames, summarize data, and generate new features.

Our package aims to add additional resources for users of the already existing rtweet package. rtweet is a package built around Twitter’s API and is used to scrape tweet information from their servers. Our package creates functionality which enables users to process the raw data from rtweet into a more understandable format by extracting and organizing the contents of tweets for a user. rtweet is specifically built to be used in analysis of a specific user’s timeline (generated using tweepy’s api.user_timeline function). Users can easily visualize average engagement based on time of day posted, see basic summary statistics of word contents and sentiment analysis of tweets and have a processed dataset that can be used in a wide variety of machine learning models.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("UBC-MDS/rtweetclean")

Functions

Dependencies

R Package
magrittr >= 2.0.1
dplyr >= 1.0.4
lubridate >= 1.7.10
ggplot2 >= 3.3.3
tidyr >= 1.1.3
tidytext >= 0.3.0
rtweet >= 0.7.0
stringr >= 1.4.0
knitr >= 1.31
testthat >= 3.0.2
rmarkdown >= 2.7

Vignette

Vignette for this package can be found here

Usage & Examples

Below function calls are based on below example dataframe:

library(rtweetclean)
created_at  = c("2021-03-06 16:03:31",
                "2021-03-05 21:57:47",
                '2021-03-05 05:50:50',
                '2021-03-05 7:32:33')
text <- c("example tweet text 1 @user2 @user",
          "#example #tweet 2 ",
          "example tweet 3 https://t.co/G4ziCaPond",
          "example tweet 4")
retweet_count <- c(43, 12, 24, 29)
favorite_count <- c(85, 41, 65, 54)
timeline_rtweet_toy <- data.frame(text, retweet_count, favorite_count, created_at)
timeline_rtweet_toy
##                                      text retweet_count favorite_count
## 1       example tweet text 1 @user2 @user            43             85
## 2                      #example #tweet 2             12             41
## 3 example tweet 3 https://t.co/G4ziCaPond            24             65
## 4                         example tweet 4            29             54
##            created_at
## 1 2021-03-06 16:03:31
## 2 2021-03-05 21:57:47
## 3 2021-03-05 05:50:50
## 4  2021-03-05 7:32:33

Function calls in action:

cleaned_timeline <- clean_df(timeline_rtweet_toy)
cleaned_timeline
##                                      text retweet_count favorite_count
## 1       example tweet text 1 @user2 @user            43             85
## 2                      #example #tweet 2             12             41
## 3 example tweet 3 https://t.co/G4ziCaPond            24             65
## 4                         example tweet 4            29             54
##            created_at            text_only word_count emojis prptn_rts_vs_avg
## 1 2021-03-06 16:03:31 example tweet text 1          4               1.5925926
## 2 2021-03-05 21:57:47                    2          1               0.4444444
## 3 2021-03-05 05:50:50      example tweet 3          3               0.8888889
## 4  2021-03-05 7:32:33      example tweet 4          3               1.0740741
##   prptn_favorites_vs_avg
## 1              1.3877551
## 2              0.6693878
## 3              1.0612245
## 4              0.8816327
tweet_words(cleaned_timeline, top_n=3)
##     words count
## 1   tweet     3
## 2 example     3
## 3    text     1
sentiment_total(cleaned_timeline, drop_sentiment = FALSE)
## # A tibble: 10 x 3
##    sentiment    word_count total_words
##    <chr>             <dbl>       <int>
##  1 anger                 0          11
##  2 anticipation          0          11
##  3 disgust               0          11
##  4 fear                  0          11
##  5 joy                   0          11
##  6 negative              0          11
##  7 positive              0          11
##  8 sadness               0          11
##  9 surprise              0          11
## 10 trust                 0          11
engagement_by_hour(cleaned_timeline)

engagement_plot.png

rtweetclean in the R ecosystem

rtweetclean provides additional functionality to the existing rtweet package by generating commonly desired data features relevant with twitter API data. This is combined with streamlined summary statistics methods that can quickly and effortlessly produce figures and tables of various different factors in your rtweet data. This allows users to easily understand and analyze information about a twitter user’s timeline. Specifically, examining an accounts engagement, most common words, and emotional sentiment can each be done with the various functions in the package.

Documentation

Documentation for this package can be found here

Contributors

We welcome and recognize all contributions. Please see contributing guidelines in the Contributing document. This repository is currently maintained by nashmakh, calsvein, MattTPin, syadk.