Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/njahn82/bqschol

Database-specific package for various big scholarly data on Google BigQuery curated at the SUB Göttingen.
https://github.com/njahn82/bqschol

Last synced: 16 days ago
JSON representation

Database-specific package for various big scholarly data on Google BigQuery curated at the SUB Göttingen.

Host: GitHub
URL: https://github.com/njahn82/bqschol
Owner: njahn82
License: other
Created: 2021-09-13T13:53:29.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2021-11-04T11:03:26.000Z (about 3 years ago)
Last Synced: 2024-11-01T21:51:11.416Z (2 months ago)
Language: R
Homepage:
Size: 77.1 KB
Stars: 2
Watchers: 4
Forks: 1
Open Issues: 3
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# bqschol

The goal of bqschol is to provide an interface to SUB Göttingen's 

big scholarly datasets stored on Google Big Query.

This package is of internal use.

## Installation

You can install the development version from [GitHub](https://github.com/) with:

``` r

# install.packages("remotes")

remotes::install_github("njahn82/bqschol")

```

## Initialize the connection

Connect to dataset with Unpaywall snapshots

```{r}

library(bqschol)

my_con <- bqschol::bgschol_con(

  dataset = "cr_history",

  path = "~/hoad-private-key.json")

```

You need to have a service account token to make use of this 

package!

## Table functions

The package provides wrapper for the most common table operations

* `bgschol_list()`: List tables

* `bgschol_tbl()`: Access tables with

* `bgschol_query()`: Perform of a SQL query and retrieve results

* `bgschol_execute()`: Execute a SQL query on the database

Let's start by listing all Crossref snapshots on SUB Göttingen's

Big Query project

```{r}

bgschol_list(my_con)

```

We can determine the top publisher by type as of April 2018.

Note that we only stored Crossref records published later than

2007.

```{r}

cr_instant_df <- bgschol_tbl(my_con, table = "cr_apr18")

cr_instant_df %>%

    #top publishers

    dplyr::group_by(publisher) %>%

    dplyr::summarise(n = dplyr::n_distinct(doi)) %>%

    dplyr::arrange(desc(n)) 

```

For more complex tasks, we use SQL. 

```{r}

cc_query <- c("SELECT

  publisher,

  COUNT(DISTINCT(DOI)) AS n

FROM

  `api-project-764811344545.cr_history.cr_apr18`,

  UNNEST(license) AS license

WHERE

  REGEXP_CONTAINS(license.URL, 'creativecommons')

GROUP BY

  publisher

ORDER BY

  n DESC

LIMIT

  10")

bgschol_query(my_con, cc_query)

```

`bgschol_execute()` is when new tables shall be created or dropped in

Big Query.