Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gvegayon/twitterreport
Out-of-the-box analysis and reporting tools for twitter
https://github.com/gvegayon/twitterreport
d3js jaccard leaflet sentiment-analysis tweets twitter wordcloud
Last synced: 7 days ago
JSON representation
Out-of-the-box analysis and reporting tools for twitter
- Host: GitHub
- URL: https://github.com/gvegayon/twitterreport
- Owner: gvegayon
- License: other
- Created: 2015-07-08T08:09:41.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2017-11-29T07:47:53.000Z (almost 7 years ago)
- Last Synced: 2024-08-02T06:03:40.473Z (3 months ago)
- Topics: d3js, jaccard, leaflet, sentiment-analysis, tweets, twitter, wordcloud
- Language: R
- Homepage:
- Size: 4.66 MB
- Stars: 38
- Watchers: 5
- Forks: 4
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Changelog: ChangeLog
- License: LICENSE
Awesome Lists containing this project
README
twitterreport
=============[![Build Status](https://travis-ci.org/gvegayon/twitterreport.svg?branch=master)](https://travis-ci.org/gvegayon/twitterreport) [![Build status](https://ci.appveyor.com/api/projects/status/a7ki7jlc5qht4dmn?svg=true)](https://ci.appveyor.com/project/gvegayon/twitterreport) [![DOI](https://zenodo.org/badge/19832/gvegayon/twitterreport.svg)](https://zenodo.org/badge/latestdoi/19832/gvegayon/twitterreport)
Out-of-the-box analysis and reporting tools for twitter
About
-----While there are some (very neat) R packages focused on twitter (namely `twitteR` and `stramR`), `twitterreport` is centered on providing analysis and reporting tools for twitter data. The package's current version features:
- Access to twitter API
- Extracting mentions/hashtags/urls from text (tweets)
- Gender tagging by matching user names with gender datasets included in the package (**es** and **en**)
- Creating (mentions) networks and visualizing them using D3js
- Sentiment analysis (basic, but useful) using lexicons included in the package (again, **es** and **en**)
- Creating time series charts of hashtags/users/etc. and visualizing them using D3js
- Create wordclouds (after removing stop words and processing the text)
- Map visualization using the leaflet package
- Topics identification through the Jaccard coeff (words similarity)You can take a look at a live example at , and at the source code of that example at
Some of the functions here were firstly developed in the project *nodoschile.cl* (no longer running). You can visit the project's testimonial website and the website (part of nodoschile) that motivated `twitterreports` at .
Installation
------------While the package is still in development, you can always use `devtools` to install the most recent version.
``` r
devtools::install_git('gvegayon/twitterreport')
```Examples
--------### Getting tweets from a set of users
``` r
# Firts, load the package!
library(twitterreport)# List of twitter accounts
users <- c('MarsRovers', 'senatormenendez', 'sciencemagazine')# Getting the twitts (first gen the token)
key <- tw_gen_token('myapp','key', 'secret')
tweets <- lapply(users, tw_api_get_statuses_user_timeline, twitter_token=key)
# Processing the data (and taking a look)
tweets <- do.call(rbind, tweets)
head(tweets)
```### Creating a (fancy) network of mentions
``` r
# Loading data
data("senators")
data("senators_profile")
data("senate_tweets")tweets_components <- tw_extract(senate_tweets$text)
groups <- data.frame(
name = senators_profile$tw_screen_name,
group = factor(senators$party),
real_name = senators$Name,
stringsAsFactors = FALSE)
groups$name <- tolower(groups$name)senate_network <- tw_network(
tolower(senate_tweets$screen_name),
lapply(tweets_components$mention,unique),only.from = TRUE,
group=groups, min.interact = 3)plot(senate_network, nodelabel='real_name')
```![](README_files/figure-markdown_github/network.png)
In the following examples we will use data on US senators extracted from twitter using the REST API (you can find it in the package)
### Creating a wordcloud
The function `tw_words` takes a character vector (of tweets for example) and extracts all the stopwords+symbols. And the `plot` method for its output creates a wordcloud
``` r
data(senate_tweets)
tab <- tw_words(senate_tweets$text)# What did it do?
senate_tweets$text[1:2];tab[1:2]
```## [1] "“I am saddened by the news that four Marines lost their lives today in the service of our country.” #Chattanooga"
## [2] ".@SenAlexander statement on today’s “tragic and senseless” murder of four Marines in #Chattanooga: http://t.co/H9zWdJPbiE"## [[1]]
## [1] "saddened" "news" "four" "marines" "lost"
## [6] "lives" "today" "service" "country" "chattanooga"
##
## [[2]]
## [1] "senalexander" "statement" "todays" "tragic"
## [5] "senseless" "murder" "four" "marines"
## [9] "chattanooga"``` r
# Plot
set.seed(123) # (so the wordcloud looks the same always)
plot(tab, max.n.words = 40)
```![](README_files/figure-markdown_github/wordcloud-1.png)
### Identifying individuals gender
Using english and spanish names, the `tw_gender` function matches the character argument (which can be a vector) with either a male or female name (or unidentified).
``` r
data(senators_profile)# Getting the names
sen <- tolower(senators_profile$tw_name)
sen <- gsub('\\bsen(ator|\\.)\\s+','',sen)
sen <- gsub('\\s+.+','',sen)tab <- table(tw_gender(sen))
barplot(tab)
```![](README_files/figure-markdown_github/gender-1.png)
Sentiment analysis
------------------Here we have an example clasifying senate tweets on the \#irandeal.
``` r
irandeal <- subset(senate_tweets, grepl('irandeal',text, ignore.case = TRUE))
irandeal$sentiment <- tw_sentiment(irandeal$text, normalize = TRUE)hist(irandeal$sentiment, col = 'lightblue',
xlab ='Valence (strength of sentiment)')
```![](README_files/figure-markdown_github/Sentiments-1.png)
A map using leaflet
-------------------The function `tw_leaflet` provides a nice wrapper for the function `leaflet` of
the package of the same name. Using D3js, we can visualize the number of tweets grouped up geographically as the following example shows:``` r
tw_leaflet(senate_tweets,~coordinates, nclusters=4,radii = ~sqrt(n)*3e5)
```![](README_files/figure-markdown_github/leaflet_map.png)
Note that in this case there are 14 tweets with the `coordinates` column non-empty, leading to 4 different senators that have such information. Using the `nclusters` option, the `tw_leaflet` groups the data using the `hclust` function of the stats package. So the user doesn't need to worry about aggregating data.
Words closeness
---------------An interesting issue to review is how are words related to each other. Using the Jaccard coefficient we are able to estimate a measure of distance between two words. The `jaccard_coef` function implements such algorithm, and it allows us to get a better understanding of topics, as the following example
``` r
# Computing the jaccard coefficient
jaccard <- jaccard_coef(senate_tweets$text,max.size = 1000)# See what words are related with abortion
words_closeness('veterans',jaccard,.025)
```## word coef
## 1 veterans 318.00000000
## 2 va 0.08982036
## 3 care 0.08510638
## 4 honor 0.04389313
## 5 access 0.04201681
## 6 deserve 0.04176334
## 7 health 0.04022989
## 8 benefits 0.03827751
## 9 mental 0.03733333
## 10 honored 0.03505155
## 11 home 0.03440860
## 12 service 0.03266788
## 13 july 0.03108808
## 14 combat 0.02964960
## 15 services 0.02857143
## 16 choice 0.02549575
## 17 thank 0.02529960We can also do this using the output from `tw_extract`, this is, by passing a list of character vectors (this is much fasters)
``` r
hashtags <- tw_extract(senate_tweets$text, obj = 'hashtag')$hashtag# Again, but using a list
jaccard <- jaccard_coef(hashtags,max.size = 15000)
jaccard
```## Jaccard index Matrix (Sparse) of 3283x3283 elements
## Contains the following words (access via $freq):
## wrd n
## 1 irandeal 202
## 2 iran 179
## 3 scotus 141
## 4 tpa 132
## 5 netde 119
## 6 mepolitics 117``` r
# See what words are related with abortion
words_closeness('veterans',jaccard,.025)
```## word coef
## 1 veterans 78.00000000
## 2 honorflight 0.06382979
## 3 va 0.05154639
## 4 miasalutes 0.05000000
## 5 4profit 0.04166667
## 6 choiceact 0.03658537
## 7 40mileissue 0.02564103
## 8 hepc 0.02531646Author
------George G. Vega Yon
g vegayon at caltech