Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/njahn82/bqschol
Database-specific package for various big scholarly data on Google BigQuery curated at the SUB Göttingen.
https://github.com/njahn82/bqschol
Last synced: 16 days ago
JSON representation
Database-specific package for various big scholarly data on Google BigQuery curated at the SUB Göttingen.
- Host: GitHub
- URL: https://github.com/njahn82/bqschol
- Owner: njahn82
- License: other
- Created: 2021-09-13T13:53:29.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2021-11-04T11:03:26.000Z (about 3 years ago)
- Last Synced: 2024-11-01T21:51:11.416Z (2 months ago)
- Language: R
- Homepage:
- Size: 77.1 KB
- Stars: 2
- Watchers: 4
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```# bqschol
The goal of bqschol is to provide an interface to SUB Göttingen's
big scholarly datasets stored on Google Big Query.This package is of internal use.
## Installation
You can install the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("remotes")
remotes::install_github("njahn82/bqschol")
```
## Initialize the connectionConnect to dataset with Unpaywall snapshots
```{r}
library(bqschol)
my_con <- bqschol::bgschol_con(
dataset = "cr_history",
path = "~/hoad-private-key.json")
```You need to have a service account token to make use of this
package!## Table functions
The package provides wrapper for the most common table operations
* `bgschol_list()`: List tables
* `bgschol_tbl()`: Access tables with
* `bgschol_query()`: Perform of a SQL query and retrieve results
* `bgschol_execute()`: Execute a SQL query on the databaseLet's start by listing all Crossref snapshots on SUB Göttingen's
Big Query project```{r}
bgschol_list(my_con)
```We can determine the top publisher by type as of April 2018.
Note that we only stored Crossref records published later than
2007.```{r}
cr_instant_df <- bgschol_tbl(my_con, table = "cr_apr18")
cr_instant_df %>%
#top publishers
dplyr::group_by(publisher) %>%
dplyr::summarise(n = dplyr::n_distinct(doi)) %>%
dplyr::arrange(desc(n))
```For more complex tasks, we use SQL.
```{r}
cc_query <- c("SELECT
publisher,
COUNT(DISTINCT(DOI)) AS n
FROM
`api-project-764811344545.cr_history.cr_apr18`,
UNNEST(license) AS license
WHERE
REGEXP_CONTAINS(license.URL, 'creativecommons')
GROUP BY
publisher
ORDER BY
n DESC
LIMIT
10")
bgschol_query(my_con, cc_query)
````bgschol_execute()` is when new tables shall be created or dropped in
Big Query.