https://github.com/sdruskat/arxiv-publication-metadata
A Snakemake workflow to produce accessible datasets for ArXiv publication metadata
https://github.com/sdruskat/arxiv-publication-metadata
Last synced: about 1 month ago
JSON representation
A Snakemake workflow to produce accessible datasets for ArXiv publication metadata
- Host: GitHub
- URL: https://github.com/sdruskat/arxiv-publication-metadata
- Owner: sdruskat
- Created: 2024-04-29T19:58:07.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-06-06T16:25:26.000Z (12 months ago)
- Last Synced: 2024-06-06T18:28:10.878Z (12 months ago)
- Language: Python
- Size: 52.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSES/CC0-1.0.txt
- Citation: CITATION.cff
Awesome Lists containing this project
README
[](https://snakemake.github.io)
[](https://www.repostatus.org/#inactive)# Snakemake workflow: Extract LUTs from ArXiv OAI-PMH XML
Snakemake workflow to extract metadata from ArXiv OAI-PMH XML
harvested with [`metha`](https://github.com/miku/metha),
and write it to JSON lookup tables for better accessibility.## Documentation
The technical documentation and description of outputs is in [workflow/documentation.md](workflow/documentation.md).
## Running the workflow
You need to have `conda` installed to create and activate a new environment.
```bash
conda env create -n arxiv-metadata --file conda-environment.yaml
conda activate arxiv-metadata
```You also need to get an access token from [Zenodo](https://zenodo.org) and set it to the following
two environment variables:```shell
export SNAKEMAKE_STORAGE_ZENODO_ACCESS_TOKEN=
export SNAKEMAKE_STORAGE_ZENODO_RESTRICTED_ACCESS_TOKEN=
```Run with `-–keep-storage-local-copies` to avoid downloading resources over and over again.
Also run with `--software-deployment-method conda` to use global conda packages.```shell
snakemake --keep-storage-local-copies --software-deployment-method conda -c
```You can use [`run.sh`](run.sh) to run the workflow this way, and with 12 cores.
# Citation
If you use this workflow in your work, please cite it using the metadata provided in [`CITATION.cff`](CITATION.cff).
# License
This work is licensed as specific in the [REUSE 3.0 Specification](https://reuse.software/spec/).
Please consult the single file licenses.