https://github.com/greenelab/opencitations
Processing OpenCitations Data
https://github.com/greenelab/opencitations
citations dataset digital-object-identifier doi notebook opencitations
Last synced: 5 months ago
JSON representation
Processing OpenCitations Data
- Host: GitHub
- URL: https://github.com/greenelab/opencitations
- Owner: greenelab
- License: cc0-1.0
- Created: 2017-08-16T18:46:57.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2017-08-17T20:22:44.000Z (about 8 years ago)
- Last Synced: 2025-04-09T04:41:40.967Z (6 months ago)
- Topics: citations, dataset, digital-object-identifier, doi, notebook, opencitations
- Language: Jupyter Notebook
- Homepage:
- Size: 20.5 KB
- Stars: 20
- Watchers: 7
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Processing OpenCitations Data
This repository processes the [OpenCitations data](http://opencitations.net/download) to make it more user-friendly and concise.
The primary output is [`data/citations-doi.tsv.xz`](data/citations-doi.tsv.xz), which is a catalog of DOI-to-DOI citations.
The file is formated like:| source_doi | target_doi |
|------------|------------|
| 10.1002/14651858.cd002244.pub4 | 10.1001/archneur.1990.00530120057010 |
| 10.1002/14651858.cd002244.pub4 | 10.1002/14651858.cd002244 |
| 10.1002/14651858.cd002244.pub4 | 10.1002/14651858.cd002244.pub2 |
| 10.1002/14651858.cd012199 | 10.1001/jama.295.6.676 |
| 10.1002/14651858.cd012199 | 10.1001/jama.299.16.1937 |
| 10.1002/14651858.cd012199 | 10.1002/14651858.cd000371.pub6 |
| 10.1002/14651858.cd012199 | 10.1002/14651858.cd009382.pub2 |All DOIs are lowercase.
Quality control steps were performed on the DOIs to remove clearly incorrect DOIs.
However, for best results, we recommend users intersect these DOIs with a catalog of valid DOIs to remove any remaining errant DOIs.## Execution
The downloading and processing of the OpenCitations data is accomplished by sequentially running the notebooks in this repository.
To update the pipeline to use newer versions of OpenCitations data, one should update the figshare article IDs in [`01.download.ipynb`](01.download.ipynb).## Environment
This repository uses [conda](http://conda.pydata.org/docs/) to manage its environment as specified in [`environment.yml`](environment.yml).
Install the environment with:```sh
conda env create --file=environment.yml
```Then use `source activate opencitations` and `source deactivate` to activate or deactivate the environment.
On windows, use `activate opencitations` and `deactivate` instead.In addition, to the conda environment, users will need to install the [Disk ARchive](http://dar.linux.free.fr/) (`dar`) utility to their system.