https://github.com/naustica/semantic_scholar_bq
Repository containing scripts for importing Semantic Scholar snapshots into BigQuery
https://github.com/naustica/semantic_scholar_bq
bigquery python scholarly-metadata semantic-scholar
Last synced: 16 days ago
JSON representation
Repository containing scripts for importing Semantic Scholar snapshots into BigQuery
- Host: GitHub
- URL: https://github.com/naustica/semantic_scholar_bq
- Owner: naustica
- Created: 2025-03-03T13:28:15.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-09-10T10:37:46.000Z (9 months ago)
- Last Synced: 2025-09-10T14:10:49.034Z (9 months ago)
- Topics: bigquery, python, scholarly-metadata, semantic-scholar
- Language: Python
- Homepage:
- Size: 16.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Workflow for Processing and Loading Semantic Scholar snapshots into Google BigQuery
This repository contains instructions on how to extract and transform Semantic Scholar data for data analysis with Google BigQuery.
## Requirements
- [Python3](https://www.python.org)
- [gsutil](https://pypi.org/project/gsutil/)
- [requests](https://pypi.org/project/requests/)
## Download snapshot
To download datasets from Semantic Scholar, replace the following placeholder with your API key.
```bash
$ export s2_key=YOUR_API_KEY
```
To download all datasets within a Semantic Scholar data snapshot, use:
```bash
$ python3 semantic_scholar.py --logging=info
```
To download individual datasets within a Semantic Scholar data snapshot, use:
```python
from semantic_scholar import SemanticScholar
import os
s2 = SemanticScholar(api_key=os.environ['s2_key'],
download_path='/scratch/users/haupka/semantic-scholar-snapshot',
snapshot_date='2025-02-25')
s2.download_entity(entity='papers')
s2.download_entity(entity='publication-venues')
```
## Uploading Files to Google Bucket
```bash
$ gsutil -m cp -r /scratch/users/haupka/semantic-scholar-snapshot gs://bigschol
```
## Creating a BigQuery Table
```bash
$ bq load --ignore_unknown_values --source_format=NEWLINE_DELIMITED_JSON subugoe-collaborative.semantic_scholar.papers gs://bigschol/semantic-scholar-snapshot/papers/*.jsonl.gz s2_papers_schema.json
```