Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/google-deepmind/pg19
https://github.com/google-deepmind/pg19
Last synced: 8 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/google-deepmind/pg19
- Owner: google-deepmind
- License: apache-2.0
- Created: 2019-09-30T15:43:25.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2020-02-25T19:43:55.000Z (over 4 years ago)
- Last Synced: 2024-08-01T13:27:41.179Z (3 months ago)
- Size: 759 KB
- Stars: 222
- Watchers: 10
- Forks: 18
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# PG-19 Language Modelling Benchmark
This repository contains the PG-19 language modeling benchmark. It includes a
set of books extracted from the Project Gutenberg books library [1], that were
published before 1919. It also contains metadata of book titles and publication
dates.PG-19 is over double the size of the Billion Word benchmark [2] and contains
documents that are 20X longer, on average, than the WikiText long-range language
modelling benchmark [3].Books are partitioned into a `train`, `validation`, and `test` set. Book
metadata is stored in `metadata.csv` which contains
`(book_id, short_book_title, publication_date)`.Unlike prior benchmarks, we do not constrain the vocabulary size ---
i.e. mapping rare words to an UNK token --- but instead release the data as an
open-vocabulary benchmark. The only processing of the text that has been applied
is the removal of boilerplate license text, and the mapping of offensive
discriminatory words as specified by Ofcom [4] to placeholder tokens. Users
are free to model the data at the character-level, subword-level, or via any
mechanism that can model an arbitrary string of text.To compare models we propose to continue measuring the word-level perplexity,
by calculating the total likelihood of the dataset (via any chosen subword
vocabulary or character-based scheme) divided by the number of tokens ---
specified below in the dataset statistics table.One could use this dataset for benchmarking long-range language models, or
use it to pre-train for other natural language processing tasks which require
long-range reasoning, such as LAMBADA [5] or NarrativeQA [6]. We would not
recommend using this dataset to train a general-purpose language model, e.g.
for applications to a production-system dialogue agent, due to the dated
linguistic style of old texts and the inherent biases present in historical
writing.### Dataset Statistics
Train
Validation
Test
Books
28,602
50
100
Num. Tokens
1,973,136,207
3,007,061
6,966,499
### Bibtex
```
@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/1911.05507},
year = {2019},
}
```### Dataset Metadata
The following table is necessary for this dataset to be indexed by search
engines such as Google Dataset Search.
property
value
name
The PG-19 Language Modeling Benchmark
alternateName
PG-19
url
https://github.com/deepmind/pg19
sameAs
https://github.com/deepmind/pg19
description
This repository contains the PG-19 dataset.
It includes a set of books extracted from the Project Gutenberg
books project (https://www.gutenberg.org), that were published before
1919. It also contains metadata of book titles and publication dates.
provider
property
value
name
DeepMind
sameAs
https://en.wikipedia.org/wiki/DeepMind
license
property
value
name
Apache License, Version 2.0
url
https://www.apache.org/licenses/LICENSE-2.0.html
citation
https://identifiers.org/arxiv:1911.05507
### Contact
If you have any questions, please contact Jack Rae.
### References
- [1] https://www.gutenberg.org
- [2] Chelba et al. "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling" (2013)
- [3] Merity et al. "Pointer Sentinel Mixture Models" (2016)
- [4] Ofcom offensive language guide
- [5] Paperno et al. "The LAMBADA dataset: Word prediction requiring a broad discourse context" (2016)
- [6] Kočiský et al. "The narrativeqa reading comprehension challenge" (2018)