https://github.com/naustica/openalex
Repository containing scripts for importing OpenAlex snapshots into BigQuery
https://github.com/naustica/openalex
bigquery openalex python scholarly-metadata
Last synced: 5 months ago
JSON representation
Repository containing scripts for importing OpenAlex snapshots into BigQuery
- Host: GitHub
- URL: https://github.com/naustica/openalex
- Owner: naustica
- License: mit
- Created: 2022-01-23T15:34:08.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2025-04-11T11:51:22.000Z (6 months ago)
- Last Synced: 2025-04-11T13:17:05.632Z (6 months ago)
- Topics: bigquery, openalex, python, scholarly-metadata
- Language: Python
- Homepage:
- Size: 68.4 KB
- Stars: 10
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Workflow for Processing and Loading OpenAlex data into Google BigQuery
This repository contains instructions on how to extract and transform OpenAlex data for data analysis with Google BigQuery.
## Requirements
The following packages are required for this workflow.
- [AWS](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
- [Python3](https://www.python.org)
- [gsutil](https://pypi.org/project/gsutil/)## Download Snapshot
OpenAlex snapshots are available through AWS. Instructions for downloading
can be found here: https://docs.openalex.org/download-all-data/download-to-your-machine.```bash
$ aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-request
```## Data transformation
To reduce the size of the data stored in BigQuery, some data transformation
is applied to the `works` entity. Data transformation is
carried out on the High Performance Cluster of the
[GWDG Göttingen](https://gwdg.de/en/hpc/). However, you can also
use the script on other servers with only minor adjustments. Entities
like `authors`, `publishers`, `institutions`, `funders` and `sources`
are not affected by the data transformation step.```bash
$ sbatch openalex_works_hpc.sh
```## Uploading Files to Google Bucket
Files can be uploaded to a Google Bucket using `gsutil`. Note that only
data in the `works` entity has been transformed. All other data can be found
in `openalex-snapshot/data`.```bash
$ gsutil -m cp -r /scratch/users/haupka/works gs://bigschol
```## Creating a BigQuery Table
Use `bq load` to create a table in BigQuery with data stored in a
Google Bucket. Schemas for the tables can be found [here](schemas).```bash
$ bq load --ignore_unknown_values --source_format=NEWLINE_DELIMITED_JSON subugoe-collaborative:openalex.works gs://bigschol/works/*.gz schema_openalex_work.json
```## Notes
- Following fields are not included in the `works` schema:
`mesh`, `related_works`, `concepts`.
- An additional field `has_abstract` is added during the data
transformation step that replaces the field `abstract_inverted_index`.