Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/google-deepmind/pg19

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/google-deepmind/pg19
Owner: google-deepmind
License: apache-2.0
Created: 2019-09-30T15:43:25.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-02-25T19:43:55.000Z (almost 5 years ago)
Last Synced: 2024-08-01T13:27:41.179Z (7 months ago)
Size: 759 KB
Stars: 222
Watchers: 10
Forks: 18
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

# PG-19 Language Modelling Benchmark
This repository contains the PG-19 language modeling benchmark. It includes a
set of books extracted from the Project Gutenberg books library [1], that were
published before 1919. It also contains metadata of book titles and publication
dates.

Full dataset download link

PG-19 is over double the size of the Billion Word benchmark [2] and contains
documents that are 20X longer, on average, than the WikiText long-range language
modelling benchmark [3].

Books are partitioned into a `train`, `validation`, and `test` set. Book
metadata is stored in `metadata.csv` which contains
`(book_id, short_book_title, publication_date)`.

Unlike prior benchmarks, we do not constrain the vocabulary size ---
i.e. mapping rare words to an UNK token --- but instead release the data as an
open-vocabulary benchmark. The only processing of the text that has been applied
is the removal of boilerplate license text, and the mapping of offensive
discriminatory words as specified by Ofcom [4] to placeholder tokens. Users
are free to model the data at the character-level, subword-level, or via any
mechanism that can model an arbitrary string of text.

To compare models we propose to continue measuring the word-level perplexity,
by calculating the total likelihood of the dataset (via any chosen subword
vocabulary or character-based scheme) divided by the number of tokens ---
specified below in the dataset statistics table.

One could use this dataset for benchmarking long-range language models, or
use it to pre-train for other natural language processing tasks which require
long-range reasoning, such as LAMBADA [5] or NarrativeQA [6]. We would not
recommend using this dataset to train a general-purpose language model, e.g.
for applications to a production-system dialogue agent, due to the dated
linguistic style of old texts and the inherent biases present in historical
writing.

### Dataset Statistics

Train
Validation
Test

Books
28,602
50
100

Num. Tokens
1,973,136,207
3,007,061
6,966,499

### Bibtex

```
@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/1911.05507},
year = {2019},
}
```

### Dataset Metadata
The following table is necessary for this dataset to be indexed by search
engines such as Google Dataset Search.

property
value

name
The PG-19 Language Modeling Benchmark

alternateName
PG-19

url
https://github.com/deepmind/pg19

sameAs
https://github.com/deepmind/pg19

description
This repository contains the PG-19 dataset. It includes a set of books extracted from the Project Gutenberg books project (https://www.gutenberg.org), that were published before 1919. It also contains metadata of book titles and publication dates.

provider

property
value

name
DeepMind

sameAs
https://en.wikipedia.org/wiki/DeepMind

license

property
value

name
Apache License, Version 2.0

url
https://www.apache.org/licenses/LICENSE-2.0.html

citation
https://identifiers.org/arxiv:1911.05507

### Contact

If you have any questions, please contact Jack Rae.

### References

[1] https://www.gutenberg.org

[2] Chelba et al. "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling" (2013)

[3] Merity et al. "Pointer Sentinel Mixture Models" (2016)

[4] Ofcom offensive language guide

[5] Paperno et al. "The LAMBADA dataset: Word prediction requiring a broad discourse context" (2016)

[6] Kočiský et al. "The narrativeqa reading comprehension challenge" (2018)