{"id":24601368,"url":"https://github.com/naustica/openalex","last_synced_at":"2025-04-30T22:51:50.456Z","repository":{"id":79876876,"uuid":"451147752","full_name":"naustica/openalex","owner":"naustica","description":"Repository containing scripts for importing OpenAlex snapshots into BigQuery","archived":false,"fork":false,"pushed_at":"2025-04-11T11:51:22.000Z","size":70,"stargazers_count":10,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-11T13:17:05.632Z","etag":null,"topics":["bigquery","openalex","python","scholarly-metadata"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/naustica.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-01-23T15:34:08.000Z","updated_at":"2025-04-11T11:51:26.000Z","dependencies_parsed_at":null,"dependency_job_id":"0d8416bf-abec-4a66-833a-8d3c03750510","html_url":"https://github.com/naustica/openalex","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naustica%2Fopenalex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naustica%2Fopenalex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naustica%2Fopenalex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/naustica%2Fopenalex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/naustica","download_url":"https://codeload.github.com/naustica/openalex/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251795424,"owners_count":21645020,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","openalex","python","scholarly-metadata"],"created_at":"2025-01-24T14:48:52.168Z","updated_at":"2025-04-30T22:51:50.450Z","avatar_url":"https://github.com/naustica.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Workflow for Processing and Loading OpenAlex data into Google BigQuery\n\nThis repository contains instructions on how to extract and transform OpenAlex data for data analysis with Google BigQuery.\n\n## Requirements\n\nThe following packages are required for this workflow.\n\n- [AWS](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)\n- [Python3](https://www.python.org)\n  - [gsutil](https://pypi.org/project/gsutil/)\n\n\n## Download Snapshot\n\nOpenAlex snapshots are available through AWS. Instructions for downloading\ncan be found here: https://docs.openalex.org/download-all-data/download-to-your-machine.\n\n```bash\n$ aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-request\n```\n\n## Data transformation\n\nTo reduce the size of the data stored in BigQuery, some data transformation\nis applied to the `works` entity. Data transformation is\ncarried out on the High Performance Cluster of the \n[GWDG Göttingen](https://gwdg.de/en/hpc/). However, you can also \nuse the script on other servers with only minor adjustments. Entities \nlike `authors`, `publishers`, `institutions`, `funders` and `sources` \nare not affected by the data transformation step.\n\n```bash\n$ sbatch openalex_works_hpc.sh\n```\n\n## Uploading Files to Google Bucket\n\nFiles can be uploaded to a Google Bucket using `gsutil`. Note that only \ndata in the `works` entity has been transformed. All other data can be found \nin `openalex-snapshot/data`.\n\n```bash\n$ gsutil -m cp -r /scratch/users/haupka/works gs://bigschol\n```\n\n## Creating a BigQuery Table\n\nUse `bq load` to create a table in BigQuery with data stored in a \nGoogle Bucket. Schemas for the tables can be found [here](schemas).\n\n```bash\n$ bq load --ignore_unknown_values --source_format=NEWLINE_DELIMITED_JSON subugoe-collaborative:openalex.works gs://bigschol/works/*.gz schema_openalex_work.json\n```\n\n## Notes\n\n- Following fields are not included in the `works` schema:\n`mesh`, `related_works`, `concepts`.\n- An additional field `has_abstract` is added during the data \ntransformation step that replaces the field `abstract_inverted_index`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnaustica%2Fopenalex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnaustica%2Fopenalex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnaustica%2Fopenalex/lists"}