Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/chainsawriot/academicquacker

🦆 Convert raw data collected by academictwitteR into DuckDB
https://github.com/chainsawriot/academicquacker

Last synced: 27 days ago
JSON representation

🦆 Convert raw data collected by academictwitteR into DuckDB

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# academicquacker

[![R-CMD-check](https://github.com/chainsawriot/academicquacker/workflows/R-CMD-check/badge.svg)](https://github.com/chainsawriot/academicquacker/actions)

If it collects tweets like a duck, binds tweets like a duck, and quacks like a duck, then it probably _is_ an academic.

The goal of this (experimental) package is to convert raw data collected by [academictwitteR](https://github.com/cjbarrie/academictwitteR) into [DuckDB](https://github.com/duckdb/duckdb) in a memory efficient manner. This package also serves as a test bed for rolling out experimental features of `bind_tweets` in `academictwitteR`.

Why DuckDB? Because it quacks... I mean, [it rocks](https://duckdb.org/docs/why_duckdb)!

Why isn't the last R capitalized? Because the developer always forgets which "r" in the word "academicquacker" to capitalize.

## Installation

You can install the development version of academicquacker with:

``` r
remotes::install_github("chainsawriot/academicquacker")
```

## Example

Suppose `dir` is a directory hosting json files collected with academictwitteR.

```{r, include = FALSE}
dir <- "tests/testdata/ica21"
```

```{r example1}
library(academicquacker)
con <- quack(dir, db = "mydata.duckdb", db_close = TRUE)
```

It won't fill up all main memory in your computer.

## Analysis

Now you can do analysis with the database. For example, you can use dplyr like you usually do with dataframes.

```{r example2}
library(DBI)
library(dplyr)
con <- dbConnect(duckdb::duckdb(), dbdir = "mydata.duckdb")

tbl(con, "tweets") %>% count(user_username, sort = TRUE)
```

Most retweeted original content that is not from @icahdq

```{r example3}
tbl(con, "tweets") %>% arrange(desc(retweet_count)) %>% filter(is.na(sourcetweet_id) & user_username != "icahdq") %>% select(user_username, text, retweet_count)
```

Calculate average retweets per original tweet by user who wrote at least three tweets

```{r example4}
tbl(con, "tweets") %>% filter(is.na(sourcetweet_id)) %>% group_by(user_username) %>% summarise(avg_rt = sum(retweet_count, na.rm = TRUE) / n(), n = n()) %>% filter(n > 2) %>% arrange(desc(avg_rt))
```

You can find more information about this in the ["Introduction to dbplyr"](https://dbplyr.tidyverse.org/articles/dbplyr.html) Vignette of tidyverse

```{r, include = FALSE}
DBI::dbDisconnect(con, shutdown = TRUE)
unlink("mydata.duckdb")
```