Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kylase/annotated-reference-strings
https://github.com/kylase/annotated-reference-strings
dataset
Last synced: 5 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/kylase/annotated-reference-strings
- Owner: kylase
- License: mit
- Created: 2021-06-27T10:53:55.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-10-10T12:00:50.000Z (about 3 years ago)
- Last Synced: 2023-03-25T09:48:07.149Z (over 1 year ago)
- Topics: dataset
- Language: Python
- Homepage: https://kylase.github.io/annotated-reference-strings/
- Size: 406 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# Annotated Reference Strings Dataset
## Introduction
`annotated_reference_strings` dataset consists of millions of reference strings synthesized to _at most 17 CSL styles_ using CSL processor ([`citeproc-js`](https://github.com/Juris-M/citeproc-js)) with the short sequence of tokens (segment) annotated as the variable it is derived from.
This library provide some utility to parse the raw annotated string to a sequence of tuples of token and its label.
For more information on the library and also the dataset, refer to the [documentation](https://kylase.github.io/annotated_reference_strings/).
## Obtaining the dataset
The dataset is prepared in National University of Singapore (NUS), School of Computing (SoC), [Web Information Retrieval / Natural Language Processing Group (WING)](https://wing.comp.nus.edu.sg/) as part of a Master project.
You can obtain the dataset in parts or full in 2 ways as they are bundled in separated files:
- [NUS SoC's Google Drive (Source of truth)](https://drive.google.com/drive/folders/1xtsdzilLMy7PyfWgbhoPIUJ1YwMX9Qaz?usp=sharing)
- [Hugging Face dataset repository](https://huggingface.co/datasets/yuanchuan/annotated_reference_strings)If you are downloading from the Google Drive, it will be faster to download them by using [`gdown`](https://github.com/wkentaro/gdown) as Google will zip up the files if you download them through the web interface:
```shell
pip install gdown
gdown
```If you are using Hugging Face's `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset('yuanchuan/annotated_reference_strings')
```## Citing
If you are using the dataset, please cite the following:
```bibtex
@techreport{kee-nus-2021,
author = {Yuan Chuan Kee},
title = {Synthesis of a large dataset of annotated reference strings for developing citation parsers},
institution = {National University of Singapore},
year = {2021}
}
```