Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pablobarbera/streamr
Dev version of streamR package: Access to Twitter Streaming API via R
https://github.com/pablobarbera/streamr
Last synced: about 1 month ago
JSON representation
Dev version of streamR package: Access to Twitter Streaming API via R
- Host: GitHub
- URL: https://github.com/pablobarbera/streamr
- Owner: pablobarbera
- Created: 2013-03-05T00:36:06.000Z (almost 12 years ago)
- Default Branch: master
- Last Pushed: 2018-12-11T14:05:25.000Z (about 6 years ago)
- Last Synced: 2023-10-25T19:06:05.500Z (about 1 year ago)
- Language: R
- Homepage: http://cran.r-project.org/web/packages/streamR/
- Size: 536 KB
- Stars: 105
- Watchers: 23
- Forks: 48
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
streamR: Access to Twitter Streaming API via R
---------This package includes a series of functions that give R users access to Twitter's Streaming API, as well as a tool that parses the captured tweets and transforms them in R data frames, which can then be used in subsequent analyses.
streamR
requires authentication via OAuth and theROAuth
package.Current CRAN release is 0.2.1. To install the most updated version (0.4.0) from GitHub, type:
```
library(devtools)
devtools::install_github("pablobarbera/streamR/streamR")
```Click here to read the documentation and here to read the vignette.
Installation and authentication
streamR can be installed directly from CRAN, but the most updated version will always be on GitHub. The code below shows how to install from both sources.
```
install.packages("streamR") # from CRAN
devtools::install_github("pablobarbera/streamR/streamR") # from GitHub
```
streamR
requires authentication via OAuth. The same oauth token can be used for bothstreamR
. After
creating an application here, and obtaining the consumer key and consumer secret, it is easy to create your own oauth credentials using theROAuth
package, which can be saved in disk for future sessions:```
library(ROAuth)
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "xxxxxyyyyyzzzzzz"
consumerSecret <- "xxxxxxyyyyyzzzzzzz111111222222"
my_oauth <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret,
requestURL = requestURL, accessURL = accessURL, authURL = authURL)
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
save(my_oauth, file = "my_oauth.Rdata")
```Alternatively, you can also create an access token as a list and
streamR
will automatically do the handshake:```
my_oauth <- list(consumer_key = "CONSUMER_KEY",
consumer_secret = "CONSUMER_SECRET",
access_token="ACCESS_TOKEN",
access_token_secret = "ACCESS_TOKEN_SECRET")
```filterStream
filterStream
is probably the most useful function. It opens a connection to the Streaming API that will return all tweets that contain one or more of the keywords given in thetrack
argument. We can use this function to, for instance, capture public statuses that mention Obama or Biden:library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjsonload("my_oauth.Rdata")
filterStream("tweets.json", track = c("Obama", "Biden"), timeout = 120,
oauth = my_oauth)## Loading required package: ROAuth
## Loading required package: digest
## Capturing tweets...
## Connection to Twitter stream was closed after 120 seconds with up to 350 tweets downloaded.tweets.df <- parseTweets("tweets.json", simplify = TRUE)
## 350 tweets have been parsed.
Note that here I'm connecting to the stream for just two minutes, but ideally I should have the connection continuously open, with some method to handle exceptions and reconnect when there's an error. I'm also using OAuth authentication (see below), and storing the tweets in a data frame using the
parseTweets
function. As I expected, Obama is mentioned more often than Biden at the moment I created this post:c( length(grep("obama", tweets.df$text, ignore.case = TRUE)),
length(grep("biden", tweets.df$text, ignore.case = TRUE)) )## [1] 347 2
Tweets can also be filtered by two additional parameters:
follow
, which can be used to include tweets published by only a subset of Twitter users, andlocations
, which will return geo-located tweets sent within bounding boxes defined by a set of coordinates. Using these two options involves some additional complications – for example, the Twitter users need to be specified as a vector of user IDs and not just screen names, and thelocations
filter is incremental to any keyword in thetrack
argument. For more information, I would suggest to check Twitter's documentation for each parameter.Here's a quick example of how one would capture and visualize tweets sent from the United States:
filterStream("tweetsUS.json", locations = c(-125, 25, -66, 50), timeout = 300,
oauth = my_oauth)
tweets.df <- parseTweets("tweetsUS.json", verbose = FALSE)
library(ggplot2)
library(grid)
map.data <- map_data("state")
points <- data.frame(x = as.numeric(tweets.df$lon), y = as.numeric(tweets.df$lat))
points <- points[points$y > 25, ]
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "white",
color = "grey20", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) +
theme(axis.line = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(),
axis.title = element_blank(), panel.background = element_blank(), panel.border = element_blank(),
panel.grid.major = element_blank(), plot.background = element_blank(),
plot.margin = unit(0 * c(-1.5, -1.5, -1.5, -1.5), "lines")) + geom_point(data = points,
aes(x = x, y = y), size = 1, alpha = 1/5, color = "darkblue")sampleStream
The function
sampleStream
allows the user to capture a small random sample (around 1%) of all tweets that are being sent at each moment. This can be useful for different purposes, such as estimating variations in “global sentiment” or describing the average Twitter user. A quick analysis of the public statuses captured with this method shows, for example, that the average (active) Twitter user follows around 500 other accounts, that a very small proportion of tweets are geo-located, and that Spanish is the second most common language in which Twitter users set up their interface.sampleStream("tweetsSample.json", timeout = 120, oauth = my_oauth, verbose = FALSE)
tweets.df <- parseTweets("tweetsSample.json", verbose = FALSE)
mean(as.numeric(tweets.df$friends_count))## [1] 543.5
table(is.na(tweets.df$lat))
##
## FALSE TRUE
## 228 13503round(sort(table(tweets.df$lang), decreasing = T)[1:5]/sum(table(tweets.df$lang)), 2)
##
## en es ja pt ar
## 0.57 0.16 0.09 0.07 0.03userStream
Finally, I have also included the function
userStream
, which allows the user to capture the tweets they would see in their timeline on twitter.com. As was the case withfilterStream
, this function allows to subset tweets by keyword and location, and exclude replies across users who are not followed. An example is shown below. Perhaps not surprisingly, many of the accounts I follow use Twitter in Spanish.userStream("mytweets.json", timeout = 120, oauth = my_oauth, verbose = FALSE)
tweets.df <- parseTweets("mytweets.json", verbose = FALSE)
round(sort(table(tweets.df$lang), decreasing = T)[1:3]/sum(table(tweets.df$lang)), 2)##
## en es ca
## 0.62 0.30 0.08More
In these examples I have used
parseTweets
to read the captured tweets from the text file where they were saved in disk and store them in a data frame in memory. The tweets can also be stored directly in memory by leaving thefile.name
argument empty, but my personal preference is to save the raw text, usually in different files, one for each hour or day. Having the files means I can run UNIX commands to quickly compute the number of tweets in each period, since each tweet is saved in a different line:system("wc -l 'tweetsSample.json'", intern = TRUE)
## [1] " 15086 tweetsSample.json"
Concluding...
I hope this package is useful for R users who want to at least play around with this type of data. Future releases of the package will include additional functions to analyze captured tweets, and improve the already existing so that they handle errors better. My plan is to keep the GitHub version up to date fixing any possible bugs, and release only major versions to CRAN.
You can contact me at pablo.barbera[at]nyu.edu or via twitter (@p_barbera) for any question or suggestion you might have, or to report any bugs in the code.