Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kylase/annotated-reference-strings

dataset

Last synced: 22 days ago
JSON representation

Host: GitHub
URL: https://github.com/kylase/annotated-reference-strings
Owner: kylase
License: mit
Created: 2021-06-27T10:53:55.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2021-10-10T12:00:50.000Z (over 3 years ago)
Last Synced: 2024-11-01T07:42:49.550Z (2 months ago)
Topics: dataset
Language: Python
Homepage: https://kylase.github.io/annotated-reference-strings/
Size: 406 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

        # Annotated Reference Strings Dataset

## Introduction

`annotated_reference_strings` dataset consists of millions of reference strings synthesized to _at most 17 CSL styles_ using CSL processor ([`citeproc-js`](https://github.com/Juris-M/citeproc-js)) with the short sequence of tokens (segment) annotated as the variable it is derived from.

This library provide some utility to parse the raw annotated string to a sequence of tuples of token and its label.

For more information on the library and also the dataset, refer to the [documentation](https://kylase.github.io/annotated_reference_strings/).

## Obtaining the dataset

The dataset is prepared in National University of Singapore (NUS), School of Computing (SoC), [Web Information Retrieval / Natural Language Processing Group (WING)](https://wing.comp.nus.edu.sg/) as part of a Master project.

You can obtain the dataset in parts or full in 2 ways as they are bundled in separated files:

- [NUS SoC's Google Drive (Source of truth)](https://drive.google.com/drive/folders/1xtsdzilLMy7PyfWgbhoPIUJ1YwMX9Qaz?usp=sharing)

- [Hugging Face dataset repository](https://huggingface.co/datasets/yuanchuan/annotated_reference_strings)

If you are downloading from the Google Drive, it will be faster to download them by using [`gdown`](https://github.com/wkentaro/gdown) as Google will zip up the files if you download them through the web interface:

```shell

pip install gdown

gdown 

```

If you are using Hugging Face's `datasets` library:

```python

from datasets import load_dataset

dataset = load_dataset('yuanchuan/annotated_reference_strings')

```

## Citing

If you are using the dataset, please cite the following:

```bibtex

@techreport{kee-nus-2021,

    author = {Yuan Chuan Kee},

    title = {Synthesis of a large dataset of annotated reference strings for developing citation parsers},

    institution = {National University of Singapore},

    year = {2021}

}

```