https://github.com/erictleung/nlmitc19

:speaker: Twitter analysis of #NLMITC19
https://github.com/erictleung/nlmitc19
conference informatics nlm r rmarkdown training twitter
Last synced: 2 months ago
JSON representation
:speaker: Twitter analysis of #NLMITC19
Host: GitHub
URL: https://github.com/erictleung/nlmitc19
Owner: erictleung
Created: 2019-02-05T23:23:31.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-06-29T06:15:54.000Z (almost 6 years ago)
Last Synced: 2025-02-16T07:56:52.263Z (4 months ago)
Topics: conference, informatics, nlm, r, rmarkdown, training, twitter
Homepage:
Size: 481 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project

README

        ---

title: "#NLMITC19 Twitter Analysis"

author: "Eric Leung"

output:

    md_document:

      toc: true

      df_print: "kable"

---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE, collapse = TRUE)

```

## Load libraries

```{r load_packages, message=FALSE, warning=FALSE}

library(tidyverse)

library(tidytext)

library(ggrepel)

if (!requireNamespace("rtweet", quietly = TRUE)) install.packages("rtweet")

library(rtweet)

```

## Query data

Below is the code to query the Twitter data for the `#NLMITC19`. I ran this at

2019-06-28 22:50.

```{r query_tweets, eval=FALSE}

rt <- search_tweets("#NLMITC19 OR #NLMIT19", n = 1800, include_rts = FALSE)

saveRDS(rt, "nlmitc19_search.rds")

saveRDS(rt$status_id, "nlmitc19_search-ids.rds")

```

But instead, here I'll just look up the status IDs.

```{r read_in_data}

ids_file <- "nlmitc19_search-ids.rds"

nlmitc19_file <- "nlmitc19_search.rds"

# Read in search directly if exists

if (file.exists(nlmitc19_file)) {

  rt <- readRDS(nlmitc19_file)

} else {

  # Download status IDs file

  download.file(

    "https://github.com/erictleung/NLMITC19/blob/master/data/nlmitc19_search-ids.rds?raw=true",

    ids_file

  )

  # Read status IDs from downloaded file

  ids <- readRDS(ids_file)

  # Lookup data associated with status ids

  rt <- rtweet::lookup_tweets(ids)

}

```

## General tweet prevalence over time

Code modified from [`rstudioconf_tweets`][mk].

[mk]: https://github.com/mkearney/rstudioconf_tweets

```{r tweets_over_time, fig.height=7, fig.width=9}

rt %>%

  ts_plot("30 minutes", color = "transparent") +

  geom_smooth(method = "loess",

              se = FALSE,

              span = 0.05,

              size = 2,

              color = "#0066aa") +

  geom_point(size = 5,

             shape = 21,

             fill = "#ADFF2F99",

             color = "#000000dd") +

  # ggplot2 theme 

  theme_minimal(base_size = 15) +

  theme(axis.text = element_text(colour = "#222222"),

        plot.title = element_text(size = rel(1.7), face = "bold"),

        plot.subtitle = element_text(size = rel(1.3)),

        plot.caption = element_text(colour = "#444444")) +

  # Caption information

  labs(title = "Frequency of tweets about #NLMITC19 over time",

       subtitle = "Twitter status counts aggregated using half-hour intervals",

       caption = "\n\nSource: Data gathered via Twitter's standard `search/tweets` API using rtweet",

       x = NULL, y = NULL)

```

Makes sense considering there were two days of conference time.

## Most prolific tweeters?

```{r most_prolific_tweeter, fig.height=7, fig.width=9}

rt %>%

  group_by(screen_name) %>%

  summarise(tweets = n()) %>%

  ggplot(aes(x = tweets, y = reorder(screen_name, tweets))) +

  geom_point() +

  # Theme styling information

  theme_minimal(base_size = 15) +

  theme(axis.text = element_text(colour = "#222222"),

        plot.title = element_text(size = rel(1.7), face = "bold"),

        plot.subtitle = element_text(size = rel(1.3)),

        plot.caption = element_text(colour = "#444444")) +

  # Labels

  labs(title = "Top tweeters using\n#NLMITC19 or #NLMIT19",

       x = "Total number of tweets",

       y = "Twitter username",

       caption = "\n\nSource: Data gathered via Twitter's standard `search/tweets` API using rtweet")

```

## Relationship between follower count and tweet popularity

Do more followers have more popular tweets?

I take the average number of favorite of an individual's tweets and normalize it

based on the total number of tweets.

```{r follower_vs_favorites, fig.height=7, fig.width=9}

rt %>%

  # Preprocess and count average favorites normalized by number of tweets

  group_by(screen_name) %>%

  mutate(avg_fav = mean(favorite_count)) %>%

  mutate(avg_norm_fav = avg_fav / n()) %>%

  ungroup() %>%

  select(screen_name, avg_fav, avg_norm_fav, followers_count) %>%

  distinct() %>%

  # Offset to not create infinite values when log transforming

  mutate(followers_count = followers_count + 0.001) %>%

  mutate(avg_norm_fav = avg_norm_fav + 0.001) %>%

  # Plot results

  ggplot(aes(x = followers_count, y = avg_norm_fav, label = screen_name)) +

  geom_text_repel() +

  geom_point() +

  # Use log-scale for x-axis and y-axis

  labs(title = "Average normalized number of favorites\nversus user follower count",

       x = "Number of followers",

       y = "Average normalized number of favorites",

       caption = "\nSource: Data gathered via Twitter's standard `search/tweets` API using rtweet") +

  # Theme styling information

  theme_minimal(base_size = 15) +

  theme(axis.text = element_text(colour = "#222222"),

        plot.title = element_text(size = rel(1.7), face = "bold"),

        plot.subtitle = element_text(size = rel(1.3)),

        plot.caption = element_text(colour = "#444444"))

```

## Chatterplot of tweet words

```{r process_for_chatter}

rt_no_stop <- rt %>%

  # Just look at tweet text

  select(text, favorite_count) %>%

  

  # Remove web links

  mutate(text = str_replace_all(text, "https?[:graph:]+", "'")) %>%

  # Remove mentions

  # Rule are that names are alphanumeric and can have underscores.

  # Names can also be preceeded with "." or end with some punctuation

  # Twitter:

  #   help.twitter.com/en/managing-your-account/twitter-username-rules

  # To avoid emails:

  #   stackoverflow.com/questions/4424179/how-to-validate-a-twitter-username-using-regex#comment21201837_4424288

  mutate(text = str_replace_all(text,

                                "\\.?@([:alnum:]|_){1,15}(?![.A-Za-z])[:graph:]?",

                                "")) %>%

  # Tokenize text to just single words

  unnest_tokens(word, text) %>%

  # Remove stop words (e.g., "a", "the", "and", etc)

  anti_join(get_stopwords())

# Get average number of favorites

rt_word_avg_fav <- rt_no_stop %>%

  # Average favorite count

  group_by(word) %>%

  summarize(avg_fav = mean(favorite_count))

# Count number of mentions

rt_counts <- rt_no_stop %>%

  # Create word counts

  count(word, sort = TRUE)

# Filter low counts and join counts and average favorite score

chatter_rt <- rt_counts %>%

  filter(n > 1) %>%

  filter(word != "nlmitc19") %>%

  left_join(rt_word_avg_fav, by = "word")

```

Code below modified from ["RIP wordclouds, long live CHATTERPLOTS"][wordcloud].

[wordcloud]: https://towardsdatascience.com/rip-wordclouds-long-live-chatterplots-e76a76896098

```{r plot_chatter, fig.height=7, fig.width=9}

chatter_rt %>%

  # Add small offset average favorite counts because some are zero and we log

  # transform, which can introduce infinite values

  mutate(avg_fav = avg_fav + 0.001) %>%

  # Gather just top 100 mentions

  top_n(100, wt = n) %>%

  

  ggplot(aes(x = avg_fav, y = n, label = word)) +

  geom_text_repel(segment.alpha = 0,

                  aes(colour = avg_fav, size = n)) +

  # Set color gradient,log transform & customize legend

  scale_color_gradient(low = "green3", high = "violetred", 

                       trans = "log10",

                       guide = guide_colourbar(direction = "horizontal",

                                               title.position = "top")) +

  # Set word size range & turn off legend

  scale_size_continuous(range = c(3, 10),

                        guide = FALSE) +

  # Use log-scale for x-axis

  scale_x_log10() +

  ggtitle(paste0("Top 100 words from ",

                  nrow(rt),

                 " #NLMITC19 tweets, by frequency"),

          subtitle = "Word frequency (size) ~ Avg number of favorites (color)") + 

  labs(y = "Word frequency across all tweets",

       x = "Avg number of favorites in tweets containing word (log scale)",

       colour = "Avg num of favs (log)") +

  

  # minimal theme & customizations

  theme_minimal() +

  theme(legend.position = c(0.20, 0.99),

        legend.justification = c("right","top"),

        panel.grid.major = element_line(colour = "whitesmoke"))

```

## Session information

```{r}

sessionInfo()

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/erictleung/nlmitc19

Awesome Lists containing this project

README