{"id":17961221,"url":"https://github.com/erictleung/nlmitc19","last_synced_at":"2025-04-03T18:41:52.410Z","repository":{"id":72464187,"uuid":"169328732","full_name":"erictleung/NLMITC19","owner":"erictleung","description":":speaker: Twitter analysis of #NLMITC19","archived":false,"fork":false,"pushed_at":"2019-06-29T06:15:54.000Z","size":493,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-16T07:56:52.263Z","etag":null,"topics":["conference","informatics","nlm","r","rmarkdown","training","twitter"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/erictleung.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-02-05T23:23:31.000Z","updated_at":"2020-11-02T19:36:12.000Z","dependencies_parsed_at":null,"dependency_job_id":"ad3daa5b-18c4-4265-b698-c03581bd4213","html_url":"https://github.com/erictleung/NLMITC19","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erictleung%2FNLMITC19","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erictleung%2FNLMITC19/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erictleung%2FNLMITC19/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erictleung%2FNLMITC19/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/erictleung","download_url":"https://codeload.github.com/erictleung/NLMITC19/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247060851,"owners_count":20877158,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["conference","informatics","nlm","r","rmarkdown","training","twitter"],"created_at":"2024-10-29T11:08:41.396Z","updated_at":"2025-04-03T18:41:52.370Z","avatar_url":"https://github.com/erictleung.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"---\ntitle: \"#NLMITC19 Twitter Analysis\"\nauthor: \"Eric Leung\"\noutput:\n    md_document:\n      toc: true\n      df_print: \"kable\"\n---\n\n```{r setup, include=FALSE}\nknitr::opts_chunk$set(echo = TRUE, collapse = TRUE)\n```\n\n## Load libraries\n\n```{r load_packages, message=FALSE, warning=FALSE}\nlibrary(tidyverse)\nlibrary(tidytext)\nlibrary(ggrepel)\n\nif (!requireNamespace(\"rtweet\", quietly = TRUE)) install.packages(\"rtweet\")\nlibrary(rtweet)\n```\n\n\n## Query data\n\nBelow is the code to query the Twitter data for the `#NLMITC19`. I ran this at\n2019-06-28 22:50.\n\n```{r query_tweets, eval=FALSE}\nrt \u003c- search_tweets(\"#NLMITC19 OR #NLMIT19\", n = 1800, include_rts = FALSE)\n\nsaveRDS(rt, \"nlmitc19_search.rds\")\nsaveRDS(rt$status_id, \"nlmitc19_search-ids.rds\")\n```\n\nBut instead, here I'll just look up the status IDs.\n\n```{r read_in_data}\nids_file \u003c- \"nlmitc19_search-ids.rds\"\nnlmitc19_file \u003c- \"nlmitc19_search.rds\"\n\n\n# Read in search directly if exists\nif (file.exists(nlmitc19_file)) {\n  rt \u003c- readRDS(nlmitc19_file)\n} else {\n  # Download status IDs file\n  download.file(\n    \"https://github.com/erictleung/NLMITC19/blob/master/data/nlmitc19_search-ids.rds?raw=true\",\n    ids_file\n  )\n\n  # Read status IDs from downloaded file\n  ids \u003c- readRDS(ids_file)\n\n\n  # Lookup data associated with status ids\n  rt \u003c- rtweet::lookup_tweets(ids)\n}\n```\n\n\n\n## General tweet prevalence over time\n\nCode modified from [`rstudioconf_tweets`][mk].\n\n[mk]: https://github.com/mkearney/rstudioconf_tweets\n\n```{r tweets_over_time, fig.height=7, fig.width=9}\nrt %\u003e%\n  ts_plot(\"30 minutes\", color = \"transparent\") +\n  geom_smooth(method = \"loess\",\n              se = FALSE,\n              span = 0.05,\n              size = 2,\n              color = \"#0066aa\") +\n  geom_point(size = 5,\n             shape = 21,\n             fill = \"#ADFF2F99\",\n             color = \"#000000dd\") +\n\n  # ggplot2 theme \n  theme_minimal(base_size = 15) +\n  theme(axis.text = element_text(colour = \"#222222\"),\n        plot.title = element_text(size = rel(1.7), face = \"bold\"),\n        plot.subtitle = element_text(size = rel(1.3)),\n        plot.caption = element_text(colour = \"#444444\")) +\n\n  # Caption information\n  labs(title = \"Frequency of tweets about #NLMITC19 over time\",\n       subtitle = \"Twitter status counts aggregated using half-hour intervals\",\n       caption = \"\\n\\nSource: Data gathered via Twitter's standard `search/tweets` API using rtweet\",\n       x = NULL, y = NULL)\n```\n\nMakes sense considering there were two days of conference time.\n\n\n## Most prolific tweeters?\n\n```{r most_prolific_tweeter, fig.height=7, fig.width=9}\nrt %\u003e%\n  group_by(screen_name) %\u003e%\n  summarise(tweets = n()) %\u003e%\n  ggplot(aes(x = tweets, y = reorder(screen_name, tweets))) +\n  geom_point() +\n\n  # Theme styling information\n  theme_minimal(base_size = 15) +\n  theme(axis.text = element_text(colour = \"#222222\"),\n        plot.title = element_text(size = rel(1.7), face = \"bold\"),\n        plot.subtitle = element_text(size = rel(1.3)),\n        plot.caption = element_text(colour = \"#444444\")) +\n\n  # Labels\n  labs(title = \"Top tweeters using\\n#NLMITC19 or #NLMIT19\",\n       x = \"Total number of tweets\",\n       y = \"Twitter username\",\n       caption = \"\\n\\nSource: Data gathered via Twitter's standard `search/tweets` API using rtweet\")\n```\n\n\n## Relationship between follower count and tweet popularity\n\nDo more followers have more popular tweets?\n\nI take the average number of favorite of an individual's tweets and normalize it\nbased on the total number of tweets.\n\n```{r follower_vs_favorites, fig.height=7, fig.width=9}\nrt %\u003e%\n  # Preprocess and count average favorites normalized by number of tweets\n  group_by(screen_name) %\u003e%\n  mutate(avg_fav = mean(favorite_count)) %\u003e%\n  mutate(avg_norm_fav = avg_fav / n()) %\u003e%\n  ungroup() %\u003e%\n  select(screen_name, avg_fav, avg_norm_fav, followers_count) %\u003e%\n  distinct() %\u003e%\n\n  # Offset to not create infinite values when log transforming\n  mutate(followers_count = followers_count + 0.001) %\u003e%\n  mutate(avg_norm_fav = avg_norm_fav + 0.001) %\u003e%\n\n  # Plot results\n  ggplot(aes(x = followers_count, y = avg_norm_fav, label = screen_name)) +\n  geom_text_repel() +\n  geom_point() +\n\n  # Use log-scale for x-axis and y-axis\n  labs(title = \"Average normalized number of favorites\\nversus user follower count\",\n       x = \"Number of followers\",\n       y = \"Average normalized number of favorites\",\n       caption = \"\\nSource: Data gathered via Twitter's standard `search/tweets` API using rtweet\") +\n\n  # Theme styling information\n  theme_minimal(base_size = 15) +\n  theme(axis.text = element_text(colour = \"#222222\"),\n        plot.title = element_text(size = rel(1.7), face = \"bold\"),\n        plot.subtitle = element_text(size = rel(1.3)),\n        plot.caption = element_text(colour = \"#444444\"))\n```\n\n\n## Chatterplot of tweet words\n\n```{r process_for_chatter}\nrt_no_stop \u003c- rt %\u003e%\n  # Just look at tweet text\n  select(text, favorite_count) %\u003e%\n  \n  # Remove web links\n  mutate(text = str_replace_all(text, \"https?[:graph:]+\", \"'\")) %\u003e%\n\n  # Remove mentions\n  # Rule are that names are alphanumeric and can have underscores.\n  # Names can also be preceeded with \".\" or end with some punctuation\n  # Twitter:\n  #   help.twitter.com/en/managing-your-account/twitter-username-rules\n  # To avoid emails:\n  #   stackoverflow.com/questions/4424179/how-to-validate-a-twitter-username-using-regex#comment21201837_4424288\n  mutate(text = str_replace_all(text,\n                                \"\\\\.?@([:alnum:]|_){1,15}(?![.A-Za-z])[:graph:]?\",\n                                \"\")) %\u003e%\n\n  # Tokenize text to just single words\n  unnest_tokens(word, text) %\u003e%\n\n  # Remove stop words (e.g., \"a\", \"the\", \"and\", etc)\n  anti_join(get_stopwords())\n\n\n# Get average number of favorites\nrt_word_avg_fav \u003c- rt_no_stop %\u003e%\n  # Average favorite count\n  group_by(word) %\u003e%\n  summarize(avg_fav = mean(favorite_count))\n\n\n# Count number of mentions\nrt_counts \u003c- rt_no_stop %\u003e%\n  # Create word counts\n  count(word, sort = TRUE)\n\n\n# Filter low counts and join counts and average favorite score\nchatter_rt \u003c- rt_counts %\u003e%\n  filter(n \u003e 1) %\u003e%\n  filter(word != \"nlmitc19\") %\u003e%\n  left_join(rt_word_avg_fav, by = \"word\")\n```\n\nCode below modified from [\"RIP wordclouds, long live CHATTERPLOTS\"][wordcloud].\n\n[wordcloud]: https://towardsdatascience.com/rip-wordclouds-long-live-chatterplots-e76a76896098\n\n```{r plot_chatter, fig.height=7, fig.width=9}\nchatter_rt %\u003e%\n  # Add small offset average favorite counts because some are zero and we log\n  # transform, which can introduce infinite values\n  mutate(avg_fav = avg_fav + 0.001) %\u003e%\n\n  # Gather just top 100 mentions\n  top_n(100, wt = n) %\u003e%\n  \n  ggplot(aes(x = avg_fav, y = n, label = word)) +\n  geom_text_repel(segment.alpha = 0,\n                  aes(colour = avg_fav, size = n)) +\n\n  # Set color gradient,log transform \u0026 customize legend\n  scale_color_gradient(low = \"green3\", high = \"violetred\", \n                       trans = \"log10\",\n                       guide = guide_colourbar(direction = \"horizontal\",\n                                               title.position = \"top\")) +\n  # Set word size range \u0026 turn off legend\n  scale_size_continuous(range = c(3, 10),\n                        guide = FALSE) +\n\n  # Use log-scale for x-axis\n  scale_x_log10() +\n  ggtitle(paste0(\"Top 100 words from \",\n                  nrow(rt),\n                 \" #NLMITC19 tweets, by frequency\"),\n          subtitle = \"Word frequency (size) ~ Avg number of favorites (color)\") + \n  labs(y = \"Word frequency across all tweets\",\n       x = \"Avg number of favorites in tweets containing word (log scale)\",\n       colour = \"Avg num of favs (log)\") +\n  \n  # minimal theme \u0026 customizations\n  theme_minimal() +\n  theme(legend.position = c(0.20, 0.99),\n        legend.justification = c(\"right\",\"top\"),\n        panel.grid.major = element_line(colour = \"whitesmoke\"))\n```\n\n\n## Session information\n\n```{r}\nsessionInfo()\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferictleung%2Fnlmitc19","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ferictleung%2Fnlmitc19","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferictleung%2Fnlmitc19/lists"}