{"id":13583921,"url":"https://github.com/greenelab/crossref","last_synced_at":"2025-05-05T21:31:57.968Z","repository":{"id":79359277,"uuid":"85830423","full_name":"greenelab/crossref","owner":"greenelab","description":"Download metadata for all DOIs using the Crossref API","archived":false,"fork":false,"pushed_at":"2018-09-25T13:25:15.000Z","size":56,"stargazers_count":59,"open_issues_count":5,"forks_count":10,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-11-06T00:39:33.740Z","etag":null,"topics":["crossref","dataset","doi","metadata","mongodb","publishing","python"],"latest_commit_sha":null,"homepage":"https://doi.org/b48h","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/greenelab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-03-22T13:23:35.000Z","updated_at":"2024-07-01T19:35:01.000Z","dependencies_parsed_at":"2024-01-24T13:08:59.181Z","dependency_job_id":null,"html_url":"https://github.com/greenelab/crossref","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Fcrossref","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Fcrossref/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Fcrossref/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greenelab%2Fcrossref/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/greenelab","download_url":"https://codeload.github.com/greenelab/crossref/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224470627,"owners_count":17316704,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crossref","dataset","doi","metadata","mongodb","publishing","python"],"created_at":"2024-08-01T15:03:53.980Z","updated_at":"2024-11-13T14:53:14.028Z","avatar_url":"https://github.com/greenelab.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"# Store and process the Crossref Database\n\nThis repository downloads Crossref metadata using the [Crossref API](https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md).\nThe items retrieved are stored in MongoDB to preserve their raw structure.\nThis design allows for flexible downstream analyses.\n\n## MongoDB\n\nMongoDB is run via [Docker](https://hub.docker.com/_/mongo/).\nIt's available on the host machine at http://localhost:27017/.\n\n```sh\ndocker run \\\n  --name=mongo-crossref \\\n  --publish=27017:27017 \\\n  --volume=`pwd`/mongo.db:/data/db \\\n  --rm \\\n  mongo:3.4.2\n```\n\n## Execution\n\n### works\n\nWith mongo running, execute with the following commands:\n\n```sh\n# Download all works\n# To start fresh, use `--cursor=*`\n# If querying fails midway, you can extract the cursor of the\n# last successful query from the tail of query-works.log.\n# Then rerun download.py, passing the intermediate cursor\n# to --cursor instead of *.\npython download.py \\\n  --component=works \\\n  --batch-size=550 \\\n  --log=logs/query-works.log \\\n  --cursor=*\n\n# Export mongodb works collection to JSON\nmongoexport \\\n  --db=crossref \\\n  --collection=works \\\n  | xz \u003e data/mongo-export/crossref-works.json.xz\n```\n\nSee [`data/mongo-export`](data/mongo-export) for more information on `crossref-works.json.xz`.\nNote that creating this file from the Crossref API takes several weeks.\nUsers are encouraged to use the cached version available on [figshare](https://doi.org/10.6084/m9.figshare.4816720) (see also [Other resources](#other-resources) below).\n\n[`1.works-to-dataframe.ipynb`](1.works-to-dataframe.ipynb) is a Jupyter notebook that extracts tabular datasets of works (TSVs), which are tracked using Git LFS:\n\n+ [`doi.tsv.xz`](data/doi.tsv.xz): a table where each row is a work, with columns for the DOI, type, and issued date.\n+ [`doi-to-issn.tsv.xz`](data/doi-to-issn.tsv.xz): a table where each row is a work (DOI) to journal (ISSN) mapping.\n\n### types\n\nWith mongo running, execute with the following command:\n\n```sh\npython download.py \\\n  --component=types \\\n  --log=logs/query-types.log\n```\n\n## Environment\n\nThis repository uses [conda](http://conda.pydata.org/docs/) to manage its environment as specified in [`environment.yml`](environment.yml).\nInstall the environment with:\n\n```sh\nconda env create --file=environment.yml\n```\n\nThen use `source activate crossref` and `source deactivate` to activate or deactivate the environment. On windows, use `activate crossref` and `deactivate` instead.\n\n## Other resources\n\nIdeally, Crossref would provide a complete database dump, rather than requiring users to go through the inefficient process of API querying all works: see [CrossRef/rest-api-doc#271](https://github.com/CrossRef/rest-api-doc/issues/271).\nUntil then, users should checkout the Crossref data currently hosted by this repository, whose query date is 2017-03-21, and its corresponding [figshare](https://doi.org/10.6084/m9.figshare.4816720.v1).\n\nFor users who need more recent data, Bryan Newbold [used this codebase](https://github.com/greenelab/crossref/issues/5) to create a MongoDB dump dated **January 2018** (query date of approximately 2018-01-10), which he uploaded to the [Internet Archive](https://archive.org/download/crossref_doi_dump_201801).\nHis output file `crossref-works.2018-01-21.json.xz` contains 93,585,242 DOIs and consumes 28.9 GB compared to 87,542,370 DOIs and 7.0 GB for the `crossref-works.json.xz` dated 2017-03-21.\nThis increased size is presumably due to the addition of [I4OC](https://i4oc.org/ \"Initiative for Open Citations\") references to Crossref work records.\n\nBryan Newbold has also created a **September 2018** release, which is uploaded to the [Internet Archive](https://archive.org/download/crossref_doi_dump_201809).\nThis repository is currently seeking contributions to update the convenient TSV outputs based on the more recent database dumps.\n\nDaniel Ecer also downloaded the Crossref work metadata in January 2018, using the codebase at [elifesciences/datacapsule-crossref](https://github.com/elifesciences/datacapsule-crossref).\nHis database dump is available on [figshare](https://doi.org/10.6084/m9.figshare.5845554.v2 \"Crossref Works Dump - January 2018\").\nWhile the multi-part format of this dump is likely less convenient than the dumps produced by this repository, Daniel Ecer's analysis also exports a DOI-to-DOI table of citations/references [available here](https://doi.org/10.6084/m9.figshare.5849916.v1 \"Crossref Citation Links - January 2018\").\nThis citation catalog contains 314,785,303 citations ([summarized here](https://elifesci.org/crossref-data-notebook)) and is thus more comprehensive than the catalog available from [greenelab/opencitations](https://github.com/greenelab/opencitations).\n\n## Acknowledgements\n\nThis work is funded in part by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant [GBMF4552](https://www.moore.org/grant-detail?grantId=GBMF4552) to [**@cgreene**](https://github.com/cgreene \"Casey Greene on GitHub\").\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgreenelab%2Fcrossref","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgreenelab%2Fcrossref","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgreenelab%2Fcrossref/lists"}