An open API service indexing awesome lists of open source software.

https://github.com/francisbarton/yearprogress

Retrieving, tidying and visualising tweets from @year_progress 2019-2020
https://github.com/francisbarton/yearprogress

ggplot2 httr purrr twitter-api

Last synced: 10 months ago
JSON representation

Retrieving, tidying and visualising tweets from @year_progress 2019-2020

Awesome Lists containing this project

README

          

---
title: "Looking at tweets by @year_progress 2019-2020"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
stub: "year-progress-tweets-viz"
output:
github_document:
html_preview: true
toc: false
fig_width: 9
fig_height: 5
md_extensions: +emoji+bracketed_spans+inline_notes
keywords: [lubridate, ggplot, twitter, api, dataviz]
always_allow_html: true
---

```{r, setup, include = FALSE}

ragg_png = function(..., res = 192) {
ragg::agg_png(..., res = res, units = "in")
}

knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
dev = "ragg_png",
# dev.args = list(ragg_png = list(res = 192)),
fig.ext = "png",
fig.showtext = TRUE,
fig.path = "man/figures/README-",
out.width = "100%"
)

library(dplyr)
library(extrafont)
library(ggplot2)
library(ggthemr)
library(lubridate)
library(purrr)
library(stringr)
library(tweetrmd)
library(yearprogress)

```

:::{#flow}

I was randomly interested in whether there might be patterns in the amount of interaction with tweets from the **@year_progress** bot during 2020 :scream:,
and what the shape of the year might look like compared to 2019.

This gave me a chance to review the current landscape of `R` clients for the twitter API :disappointed:.
After doing this review, I decided it would be easier and more satisfying for me to script my own access to [v2 of the API][tw-api].

[tw-api]: https://developer.twitter.com/en/docs/twitter-api/early-access

This meant more practice with `httr`, error handling, working with lists with the frankly brilliant [`purrr` toolkit][purrr], and with writing functions.
Happy as a :pig: in :poop:, _tbqhwy_.

It's also chance for me to try various other things I haven't really used before:

* Creating a GitHub README as an `.Rmd` file (this very file here),
instead of my usual `.md`, so I might incorporate some output charts
* Using a couple of [Pandoc's Markdown extensions][pdmd] ^[1][1]^
* GitHub Actions (set up using the `usethis` helper function)
* Making _a whole actual package_ :package: just for a little bit of fun like
this, as opposed to as some bigger project
* Learning a bit more about testing with `testthat`
* Yet more practice at package architecture and requirements,
and appeasing R CMD check
* Doing some hopefully nice [dataviz]{.sparkle} :bar_chart: with `ggplot2`, which I still don't feel like I have really used _well_ yet.
My charts usually look way too crappy for my liking.

[1]: # "We've got md_extensions: +emoji+bracketed_spans+inline_notes going on in the yaml here"
[purrr]: https://purrr.tidyverse.org/
[pdmd]: https://pandoc.org/MANUAL.html#pandocs-markdown

If you suspect that this whole idea was essentially driven by pre-Christmas/end-of-term task avoidance, angst and procrastination,
you would be correct.

```{r, get-dtf, eval = FALSE}
# Set variables and run overall query -------------------------------------

# Obtained originally by:
# user_id <- get_user_id(username = "year_progress")
# It isn't going to change, so once obtained just hardcode it:
year_progress_id <- "3233484298"

# This doesn't work with the Twitter API
# https://github.com/tidyverse/lubridate/issues/941
# start_2019 <- lubridate::make_datetime(2019) %>%
# lubridate::format_ISO8601(usetz = TRUE)
# so instead:

start_2019 <- "2019-01-01T00:00:02Z"

# see `R/get_user_twetes.R` for how this works
dtf <- get_user_twetes(user_id = year_progress_id, start = start_2019) %>%
purrr::compact() %>%
unpack_list() # <- in `R/helpers.R`
```

```{r, load-dtf, include = FALSE}
# myrmidon::save_it(dtf)
dtf <- readRDS(here::here("rds_data", "dtf.Rds"))
```

I realised the other day that it would be fine to call a dataframe `dtf` instead of boring old `df` :smirk:...

```{r, dtf-tweet, echo = FALSE, cache = TRUE}
tweetrmd::include_tweet("https://twitter.com/ludictech/status/1338815874770329602?s=20")
```

```{r, glimpse-dtf}
glimpse(dtf)

```

Let's lick this dataframe into shape :icecream: so it's ready for visualising.

```{r, prepare-dtf}

dtf2 <- dtf %>%
# filter out any "normal" tweets!
filter(str_detect(text, "^(\u2591|\u2593)+")) %>%
mutate(pcnt = as.numeric(str_extract(text, "[0-9]+"))) %>%
mutate(date_tweeted = case_when(
# 100% tweets get tweeted just _after_ midnight on 1st Jan: need correcting
# Be better here to check *if* date == January 1 but OK for now
pcnt == 100 ~ as_date(created_at) - period(1, "days"),
TRUE ~ as_date(created_at))) %>%
mutate(year = year(date_tweeted)) %>%
mutate(yday = yday(date_tweeted)) %>%
mutate(wkdy = weekdays(as_date(date_tweeted))) %>%
mutate(across(c(year, wkdy), ~ as.factor(.)))

```

```{r, ggthemr-set, echo = FALSE}
ggthemr::ggthemr(palette = "solarized", layout = "clean", type = "outer")
# ggthemr::ggthemr_reset()

theme_update(
text = element_text(family = "IBM Plex Sans", colour = "#100810")
)

```

Let's look at how retweets vary through the year:

```{r, plot-retweets, warning = FALSE, message = FALSE, cache = TRUE}
dtf2 %>%
filter(year == 2019) %>%
slice_max(n = 4, order_by = retweet_count) %>%
mutate(label = str_glue("{format(date_tweeted, '%b %d')}: {pcnt}%")) %>%
left_join(dtf2, .) %>%
ggplot(aes(yday, retweet_count)) +
geom_line(aes(colour = year), size = 0.6, alpha = 1,) +
geom_point(aes(colour = year), size = 1) +
geom_label(
aes(label = label),
vjust = "outward",
hjust = "inward",
nudge_y = 0.1,
fill = "aquamarine3",
colour = "white",
size = 3,
fontface = "bold") +
scale_colour_brewer("Year", palette = "Set2") +
scale_y_log10() +
labs(
x = "Day of the year",
y = "Number of retweets (log scale)",
title = "RTs of @year_progress tweets, 2019 vs. 2020"
)

```

Certain tweets are vastly more popular than the rest, with four tweets standing out as peaks - even when a log scale is employed on the y axis.
The next tranche of popularity is reserved for the multiples of 10%, apart from 50% which is already one of the top four retweeted.

There doesn't seem to be a particularly different trend affecting the popularity of tweets or the pattern of retweets as 2020 has progressed, relative to 2019. 2020 numbers are generally a little higher, which is most likely due to the account having acquired more followers over time.

Let's see if the numbers of likes for each tweet show a similar pattern
(we'd generally expect them to):

```{r, plot-likes, warning = FALSE, message = FALSE}
dtf3 <- dtf2 %>%
filter(year == 2019) %>%
slice_max(n = 4, order_by = like_count) %>%
mutate(label_2019 = str_glue("{format(date_tweeted, '%b %d')}: {pcnt}%")) %>%
left_join(dtf2, .)

dtf3 %>%
filter(year == 2020) %>%
slice_max(n = 4, order_by = like_count) %>%
mutate(label_2020 = str_glue("{format(date_tweeted, '%b %d')}: {pcnt}%")) %>%
left_join(dtf3, .) %>%
mutate(like_count = like_count/1000) %>%
ggplot(aes(yday, like_count)) +
geom_line(aes(colour = year), size = 0.6, alpha = 1) +
geom_point(aes(colour = year), size = 1) +
geom_label(
aes(label = label_2019),
vjust = "outward",
hjust = "inward",
nudge_y = 0.1,
fill = "aquamarine3",
colour = "white",
size = 3,
fontface = "bold") +
geom_label(
aes(label = label_2020),
vjust = "outward",
hjust = "inward",
nudge_y = 0.1,
fill = "chocolate2",
colour = "white",
size = 3,
fontface = "bold") +
scale_colour_brewer("Year", palette = "Set2") +
scale_y_log10() +
labs(
x = "Day of the year",
y = "Number of likes ('000s) (log scale)",
title = "\"Likes\" of @year_progress tweets, 2019 vs. 2020"
)

```

And quote tweets:

```{r, plot-quotes, warning = FALSE, message = FALSE}
dtf3 <- dtf2 %>%
filter(year == 2019) %>%
slice_max(n = 4, order_by = quote_count) %>%
mutate(label_2019 = str_glue("{format(date_tweeted, '%b %d')}: {pcnt}%")) %>%
left_join(dtf2, .)

dtf3 %>%
filter(year == 2020) %>%
slice_max(n = 4, order_by = quote_count) %>%
mutate(label_2020 = str_glue("{format(date_tweeted, '%b %d')}: {pcnt}%")) %>%
left_join(dtf3, .) %>%
# mutate(quote_count = quote_count/1000) %>%
ggplot(aes(yday, quote_count)) +
geom_line(aes(colour = year), size = 0.6, alpha = 1) +
geom_point(aes(colour = year), size = 1) +
geom_label(
aes(label = label_2019),
vjust = "outward",
hjust = "inward",
nudge_y = 0.1,
fill = "aquamarine3",
colour = "white",
size = 3,
fontface = "bold") +
geom_label(
aes(label = label_2020),
vjust = "outward",
hjust = "inward",
nudge_y = 0.1,
fill = "chocolate2",
colour = "white",
size = 3,
fontface = "bold") +
scale_colour_brewer("Year", palette = "Set2") +
scale_y_log10() +
labs(
x = "Day of the year",
y = "Number of quote-tweets (log scale)",
title = "Quotes of @year_progress tweets, 2019 vs. 2020"
)

```

Now, am I imagining it, or is there a much higher proportional difference between 2019 and 2020 when it comes to the QTs, than was visible with the RTs and the Likes?
From about day 80? There's some quite big gaps in there between the two lines, particularly if you look at the period between day 80 and day 200.
2020 never drops down below 100 QTs for any tweet like 2019 did.

The gap is a bit less noticeable in the 70-percents, around day 270-280, when 2019's figures start to pick up as the end of the year approaches.
Again, some of this may just be a factor of more people following the account, and RTs and QTs generate more follows, and so it goes round.

I'll be interested to see what happens to 2020's final few data points in the closing days of December.
:::