https://github.com/IllDepence/unarXive

A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network
https://github.com/IllDepence/unarXive

Last synced: 4 months ago
JSON representation

A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network

Host: GitHub
URL: https://github.com/IllDepence/unarXive
Owner: IllDepence
License: mit
Created: 2019-01-29T09:38:29.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2024-09-28T22:05:21.000Z (about 1 year ago)
Last Synced: 2024-11-27T03:34:54.626Z (11 months ago)
Language: Python
Homepage:
Size: 15.5 MB
Stars: 259
Watchers: 6
Forks: 19
Open Issues: 9
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-arxiv - unarXive
awesome-arxiv - unarXive

README

          # unarXive

**Access**

* Data Set on Zenodo: [full](https://doi.org/10.5281/zenodo.7752754) / [permissively licensed subset](https://doi.org/10.5281/zenodo.7752615)

* [Data Sample](doc/unarXive_data_sample.tar.gz)

* ML Data on Hugging Face: [citation recommendation](https://huggingface.co/datasets/saier/unarXive_citrec) / [IMRaD classification](https://huggingface.co/datasets/saier/unarXive_imrad_clf)

**Documentation**

* Publications

    * [*Scientometrics*](http://link.springer.com/article/10.1007/s11192-020-03382-z) ([author copy](https://doi.org/10.5445/IR/1000118786/pre)) (2020)

    * [*JCDL 2023*](https://doi.org/10.1109/JCDL57899.2023.00020) ([author copy](https://doi.org/10.48550/arXiv.2303.14957)) (2023)

* [Data Format](#data)

* [Usage](#usage)

* [Development](#development)

* [Cite](#cite-as)

# Data



  



**unarXive contains**

* 1.9 M structured paper full-texts, containing

    * 63 M references (28 M linked to OpenAlex)

    * 134 M in-text citation markers (65 M linked)

    * 9 M figure captions

    * 2 M table captions

    * 742 M pieces of mathematical notation preserved as LaTeX

A comprehensive documentation of the **data format** can be found [here](doc/data_format.md).

You can find a **data sample** [here](doc/unarXive_data_sample.tar.gz).

# Usage

### Hugging Face Datasets

If you want to use unarXive for *citation recommendation* or *IMRaD classification*, you can simply use our Hugging Face datasets:

* [Citation Recommendation](https://huggingface.co/datasets/saier/unarxive_citrec)

* [IMRaD Classification](https://huggingface.co/datasets/saier/unarXive_imrad_clf)

For example, in the case of citation recommendation:

```

from datasets import load_dataset

citrec_data = load_dataset('saier/unarxive_citrec')

citrec_data = citrec_data.class_encode_column('label')  # assign target label column

citrec_data = citrec_data.remove_columns('_id')         # remove sample ID column

```

# Development

For instructions how to re-create or extend unarXive, see [src/](src/).

**Versions**

* Current release (1991–2022): see [*Access* section above](#unarxive)

* Previous releases ([old format](https://github.com/IllDepence/unarXive/tree/legacy_2020/)):

    * [1991–Jul 2020](https://zenodo.org/record/4313164)

    * [1991–2019](https://zenodo.org/record/3385851)

**Development Status**

See [issues](https://github.com/IllDepence/unarXive/issues).

## Cite as

**Current version**

```

@inproceedings{Saier2023unarXive,

  author        = {Saier, Tarek and Krause, Johan and F\"{a}rber, Michael},

  title         = {{unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network}},

  booktitle     = {2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL)},

  year          = {2023},

  pages         = {66--70},

  month         = jun,

  doi           = {10.1109/JCDL57899.2023.00020},

  publisher     = {IEEE Computer Society},

  address       = {Los Alamitos, CA, USA},

}

```

**Initial publication**

```

@article{Saier2020unarXive,

  author        = {Saier, Tarek and F{\"{a}}rber, Michael},

  title         = {{unarXive: A Large Scholarly Data Set with Publications’ Full-Text, Annotated In-Text Citations, and Links to Metadata}},

  journal       = {Scientometrics},

  year          = {2020},

  volume        = {125},

  number        = {3},

  pages         = {3085--3108},

  month         = dec,

  issn          = {1588-2861},

  doi           = {10.1007/s11192-020-03382-z}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/IllDepence/unarXive

Awesome Lists containing this project

README