Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/google-deepmind/pg19


https://github.com/google-deepmind/pg19

Last synced: 8 days ago
JSON representation

Awesome Lists containing this project

README

        

# PG-19 Language Modelling Benchmark
This repository contains the PG-19 language modeling benchmark. It includes a
set of books extracted from the Project Gutenberg books library [1], that were
published before 1919. It also contains metadata of book titles and publication
dates.

Full dataset download link

PG-19 is over double the size of the Billion Word benchmark [2] and contains
documents that are 20X longer, on average, than the WikiText long-range language
modelling benchmark [3].

Books are partitioned into a `train`, `validation`, and `test` set. Book
metadata is stored in `metadata.csv` which contains
`(book_id, short_book_title, publication_date)`.

Unlike prior benchmarks, we do not constrain the vocabulary size ---
i.e. mapping rare words to an UNK token --- but instead release the data as an
open-vocabulary benchmark. The only processing of the text that has been applied
is the removal of boilerplate license text, and the mapping of offensive
discriminatory words as specified by Ofcom [4] to placeholder tokens. Users
are free to model the data at the character-level, subword-level, or via any
mechanism that can model an arbitrary string of text.

To compare models we propose to continue measuring the word-level perplexity,
by calculating the total likelihood of the dataset (via any chosen subword
vocabulary or character-based scheme) divided by the number of tokens ---
specified below in the dataset statistics table.

One could use this dataset for benchmarking long-range language models, or
use it to pre-train for other natural language processing tasks which require
long-range reasoning, such as LAMBADA [5] or NarrativeQA [6]. We would not
recommend using this dataset to train a general-purpose language model, e.g.
for applications to a production-system dialogue agent, due to the dated
linguistic style of old texts and the inherent biases present in historical
writing.

### Dataset Statistics




Train
Validation
Test


Books
28,602
50
100


Num. Tokens
1,973,136,207
3,007,061
6,966,499

### Bibtex

```
@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/1911.05507},
year = {2019},
}
```

### Dataset Metadata
The following table is necessary for this dataset to be indexed by search
engines such as Google Dataset Search.


property
value


name
The PG-19 Language Modeling Benchmark


alternateName
PG-19


url
https://github.com/deepmind/pg19


sameAs
https://github.com/deepmind/pg19


description
This repository contains the PG-19 dataset.
It includes a set of books extracted from the Project Gutenberg
books project (https://www.gutenberg.org), that were published before
1919. It also contains metadata of book titles and publication dates.



provider




property
value


name
DeepMind


sameAs
https://en.wikipedia.org/wiki/DeepMind






license




property
value


name
Apache License, Version 2.0


url
https://www.apache.org/licenses/LICENSE-2.0.html






citation
https://identifiers.org/arxiv:1911.05507

### Contact

If you have any questions, please contact Jack Rae.

### References


  • [1] https://www.gutenberg.org

  • [2] Chelba et al. "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling" (2013)

  • [3] Merity et al. "Pointer Sentinel Mixture Models" (2016)

  • [4] Ofcom offensive language guide

  • [5] Paperno et al. "The LAMBADA dataset: Word prediction requiring a broad discourse context" (2016)

  • [6] Kočiský et al. "The narrativeqa reading comprehension challenge" (2018)