Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/HIT-SCIR/ELMoForManyLangs

Pre-trained ELMo Representations for Many Languages
https://github.com/HIT-SCIR/ELMoForManyLangs

elmo multilingual nlp

Last synced: 29 days ago
JSON representation

Pre-trained ELMo Representations for Many Languages

Awesome Lists containing this project

README

        

Pre-trained ELMo Representations for Many Languages
===================================================

We release our ELMo representations trained on many languages
which helps us win the [CoNLL 2018 shared task on Universal Dependencies Parsing](http://universaldependencies.org/conll18/results.html)
according to LAS.

## Technique Details

We use the same hyperparameter settings as [Peters et al. (2018)](https://arxiv.org/abs/1802.05365) for the biLM
and the character CNN.
We train their parameters
on a set of 20-million-words data randomly
sampled from the raw text released by the shared task (wikidump + common crawl) for each language.
We largely based ourselves on the code of [AllenNLP](https://allennlp.org/), but made the following changes:

* We support unicode characters;
* We use the *sample softmax* technique
to make training on large vocabulary feasible ([Jean et al., 2015](https://arxiv.org/abs/1412.2007)).
However, we use a window of words surrounding the target word
as negative samples and it shows better performance in our preliminary experiments.

The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU.

## Downloads

| | | | |
|---|---|---|---|
| [Arabic](http://vectors.nlpl.eu/repository/11/136.zip) | [Bulgarian](http://vectors.nlpl.eu/repository/11/137.zip) | [Catalan](http://vectors.nlpl.eu/repository/11/138.zip) | [Czech](http://vectors.nlpl.eu/repository/11/139.zip) |
| [Old Church Slavonic](http://vectors.nlpl.eu/repository/11/140.zip) | [Danish](http://vectors.nlpl.eu/repository/11/141.zip) | [German](http://vectors.nlpl.eu/repository/11/142.zip) | [Greek](http://vectors.nlpl.eu/repository/11/143.zip) |
| [English](http://vectors.nlpl.eu/repository/11/144.zip) | [Spanish](http://vectors.nlpl.eu/repository/11/145.zip) | [Estonian](http://vectors.nlpl.eu/repository/11/146.zip) | [Basque](http://vectors.nlpl.eu/repository/11/147.zip) |
| [Persian](http://vectors.nlpl.eu/repository/11/148.zip) | [Finnish](http://vectors.nlpl.eu/repository/11/149.zip) | [French](http://vectors.nlpl.eu/repository/11/150.zip) | [Irish](http://vectors.nlpl.eu/repository/11/151.zip) |
| [Galician](http://vectors.nlpl.eu/repository/11/152.zip) | [Ancient Greek](http://vectors.nlpl.eu/repository/11/153.zip) | [Hebrew](http://vectors.nlpl.eu/repository/11/154.zip) | [Hindi](http://vectors.nlpl.eu/repository/11/155.zip) |
| [Croatian](http://vectors.nlpl.eu/repository/11/156.zip) | [Hungarian](http://vectors.nlpl.eu/repository/11/157.zip) | [Indonesian](http://vectors.nlpl.eu/repository/11/158.zip) | [Italian](http://vectors.nlpl.eu/repository/11/159.zip) |
| [Japanese](http://vectors.nlpl.eu/repository/11/160.zip) | [Korean](http://vectors.nlpl.eu/repository/11/161.zip) | [Latin](http://vectors.nlpl.eu/repository/11/162.zip) | [Latvian](http://vectors.nlpl.eu/repository/11/163.zip) |
| [Norwegian Bokmål](http://vectors.nlpl.eu/repository/11/165.zip) | [Dutch](http://vectors.nlpl.eu/repository/11/164.zip) | [Norwegian Nynorsk](http://vectors.nlpl.eu/repository/11/166.zip) | [Polish](http://vectors.nlpl.eu/repository/11/167.zip) |
| [Portuguese](http://vectors.nlpl.eu/repository/11/168.zip) | [Romanian](http://vectors.nlpl.eu/repository/11/169.zip) | [Russian](http://vectors.nlpl.eu/repository/11/170.zip) | [Slovak](http://vectors.nlpl.eu/repository/11/171.zip) |
| [Slovene](http://vectors.nlpl.eu/repository/11/172.zip) | [Swedish](http://vectors.nlpl.eu/repository/11/173.zip) | [Turkish](http://vectors.nlpl.eu/repository/11/174.zip) | [Uyghur](http://vectors.nlpl.eu/repository/11/175.zip) |
| [Ukrainian](http://vectors.nlpl.eu/repository/11/176.zip) | [Urdu](http://vectors.nlpl.eu/repository/11/177.zip) | [Vietnamese](http://vectors.nlpl.eu/repository/11/178.zip) | [Chinese](http://vectors.nlpl.eu/repository/11/179.zip) |

The models are hosted on the [NLPL Vectors Repository](http://wiki.nlpl.eu/index.php/Vectors/home).

**ELMo for Simplified Chinese**

We also provided [simplified-Chinese ELMo](http://39.96.43.154/zhs.model.tar.bz2).
It was trained on xinhua proportion of [Chinese gigawords-v5](https://catalog.ldc.upenn.edu/ldc2011t13),
which is different from the Wikipedia for traditional Chinese ELMo.

## Pre-requirements

* **must** python >= 3.6 (if you use python3.5, you will encounter this issue https://github.com/HIT-SCIR/ELMoForManyLangs/issues/8)
* pytorch 0.4
* other requirements from allennlp

## Usage

### Install the package

You need to install the package to use the embeddings with the following commends
```
python setup.py install
```

### Set up the `config_path`
After unzip the model, you will find a JSON file `${lang}.model/config.json`.
Please change the `"config_path"` field to the relative path to
the model configuration `cnn_50_100_512_4096_sample.json`.
For example, if your ELMo model is `zht.model/config.json` and your model configuration
is `zht.model/cnn_50_100_512_4096_sample.json`, you need to change `"config_path"`
in `zht.model/config.json` to `cnn_50_100_512_4096_sample.json`.

If there is no configuration `cnn_50_100_512_4096_sample.json` under `${lang}.model`,
you can copy the `elmoformanylangs/configs/cnn_50_100_512_4096_sample.json` into `${lang}.model`,
or change the `"config_path"` into `elmoformanylangs/configs/cnn_50_100_512_4096_sample.json`.

See [issue 27](https://github.com/HIT-SCIR/ELMoForManyLangs/issues/27) for more details.

### Use ELMoForManyLangs in command line

Prepare your input file in the [conllu format](http://universaldependencies.org/format.html), like
```
1 Sue Sue _ _ _ _ _ _ _
2 likes like _ _ _ _ _ _ _
3 coffee coffee _ _ _ _ _ _ _
4 and and _ _ _ _ _ _ _
5 Bill Bill _ _ _ _ _ _ _
6 tea tea _ _ _ _ _ _ _
```
Fileds should be separated by `'\t'`. We only use the second column and space (`' '`) is supported in
this field (for Vietnamese, a word can contains spaces).
Do remember tokenization!

When it's all set, run

```
$ python -m elmoformanylangs test \
--input_format conll \
--input /path/to/your/input \
--model /path/to/your/model \
--output_prefix /path/to/your/output \
--output_format hdf5 \
--output_layer -1
```

It will dump an hdf5 encoded `dict` onto the disk, where the key is `'\t'` separated
words in the sentence and the value is it's 3-layer averaged ELMo representation.
You can also dump the cnn encoded word with `--output_layer 0`,
the first layer of the LsTM with `--output_layer 1` and the second layer
of the LSTM with `--output_layer 2`.
We are actively changing the interface to make it more adapted to the
AllenNLP ELMo and more programmatically friendly.

### Use ELMoForManyLangs programmatically

Thanks @voidism for contributing the API.
By using `Embedder` python object, you can use ELMo into your own code like this:

```python
from elmoformanylangs import Embedder

e = Embedder('/path/to/your/model/')

sents = [['今', '天', '天氣', '真', '好', '阿'],
['潮水', '退', '了', '就', '知道', '誰', '沒', '穿', '褲子']]
# the list of lists which store the sentences
# after segment if necessary.

e.sents2elmo(sents)
# will return a list of numpy arrays
# each with the shape=(seq_len, embedding_size)
```

#### the parameters to init Embedder:
```python
class Embedder(model_dir='/path/to/your/model/', batch_size=64):
```
- **model_dir**: the absolute path from the repo top dir to you model dir.
- **batch_size**: the batch_size you want when the model inference, you can specify it properly according to your gpu/cpu ram size. (default: 64)

#### the parameters of the function sents2elmo:
```python
def sents2elmo(sents, output_layer=-1):
```
- **sents**: the list of lists which store the sentences after segment if necessary.
- **output_layer**: the target layer to output.
- 0 for the word encoder
- 1 for the first LSTM hidden layer
- 2 for the second LSTM hidden layer
- -1 for an average of 3 layers. (default)
- -2 for all 3 layers

## Training Your Own ELMo

Please run
```
$ python -m elmoformanylangs.biLM train -h
```
to get more details about the ELMo training.

Here is an example for training English ELMo.
```
$ less data/en.raw
... (snip) ...
Notable alumni
Aris Kalafatis ( Acting )
Labour Party
They build an open nest in a tree hole , or man - made nest - boxes .
Legacy
... (snip) ...

$ python -m elmoformanylangs.biLM train \
--train_path data/en.raw \
--config_path elmoformanylangs/configs/cnn_50_100_512_4096_sample.json \
--model output/en \
--optimizer adam \
--lr 0.001 \
--lr_decay 0.8 \
--max_epoch 10 \
--max_sent_len 20 \
--max_vocab_size 150000 \
--min_count 3
```
However, we
need to add that the training process is not very stable.
In some cases, we end up with a loss of `nan`. We are actively working on that and hopefully
improve it in the future.

## Citation

If our ELMo gave you nice improvements, please cite us.

```
@InProceedings{che-EtAl:2018:K18-2,
author = {Che, Wanxiang and Liu, Yijia and Wang, Yuxuan and Zheng, Bo and Liu, Ting},
title = {Towards Better {UD} Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation},
booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
month = {October},
year = {2018},
address = {Brussels, Belgium},
publisher = {Association for Computational Linguistics},
pages = {55--64},
url = {http://www.aclweb.org/anthology/K18-2005}
}
```

Please also cite the
[NLPL Vectors Repository](http://wiki.nlpl.eu/index.php/Vectors/home)
for hosting the models.
```
@InProceedings{fares-EtAl:2017:NoDaLiDa,
author = {Fares, Murhaf and Kutuzov, Andrey and Oepen, Stephan and Velldal, Erik},
title = {Word vectors, reuse, and replicability: Towards a community repository of large-text resources},
booktitle = {Proceedings of the 21st Nordic Conference on Computational Linguistics},
month = {May},
year = {2017},
address = {Gothenburg, Sweden},
publisher = {Association for Computational Linguistics},
pages = {271--276},
url = {http://www.aclweb.org/anthology/W17-0237}
}
```