Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/HIT-SCIR/ELMoForManyLangs

Pre-trained ELMo Representations for Many Languages
https://github.com/HIT-SCIR/ELMoForManyLangs
elmo multilingual nlp
Last synced: 2 months ago
JSON representation
Pre-trained ELMo Representations for Many Languages
Host: GitHub
URL: https://github.com/HIT-SCIR/ELMoForManyLangs
Owner: HIT-SCIR
License: mit
Fork: true (bozheng-hit/ELMo)
Created: 2018-07-10T00:31:06.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2021-05-19T07:05:17.000Z (about 3 years ago)
Last Synced: 2024-03-27T02:27:22.716Z (3 months ago)
Topics: elmo, multilingual, nlp
Language: Python
Homepage:
Size: 103 KB
Stars: 1,449
Watchers: 47
Forks: 244
Open Issues: 52
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists

awesome-hungarian-nlp - ELMo Representations
awesome-nlp - Hindi and Urdu Elmo Model
WSRPapers - **ELMoForManyLangs**
README

        Pre-trained ELMo Representations for Many Languages

===================================================

We release our ELMo representations trained on many languages

which helps us win the [CoNLL 2018 shared task on Universal Dependencies Parsing](http://universaldependencies.org/conll18/results.html)

according to LAS.

## Technique Details

We use the same hyperparameter settings as [Peters et al. (2018)](https://arxiv.org/abs/1802.05365) for the biLM

and the character CNN.

We train their parameters

on a set of 20-million-words data randomly

sampled from the raw text released by the shared task (wikidump + common crawl) for each language.

We largely based ourselves on the code of [AllenNLP](https://allennlp.org/), but made the following changes:

* We support unicode characters;

* We use the *sample softmax* technique

to make training on large vocabulary feasible ([Jean et al., 2015](https://arxiv.org/abs/1412.2007)).

However, we use a window of words surrounding the target word

as negative samples and it shows better performance in our preliminary experiments.

The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU.

## Downloads

|   |   |   |   |

|---|---|---|---|

| [Arabic](http://vectors.nlpl.eu/repository/11/136.zip) | [Bulgarian](http://vectors.nlpl.eu/repository/11/137.zip) | [Catalan](http://vectors.nlpl.eu/repository/11/138.zip) | [Czech](http://vectors.nlpl.eu/repository/11/139.zip) |

| [Old Church Slavonic](http://vectors.nlpl.eu/repository/11/140.zip) | [Danish](http://vectors.nlpl.eu/repository/11/141.zip) | [German](http://vectors.nlpl.eu/repository/11/142.zip) | [Greek](http://vectors.nlpl.eu/repository/11/143.zip) |

| [English](http://vectors.nlpl.eu/repository/11/144.zip) | [Spanish](http://vectors.nlpl.eu/repository/11/145.zip) | [Estonian](http://vectors.nlpl.eu/repository/11/146.zip) | [Basque](http://vectors.nlpl.eu/repository/11/147.zip) |

| [Persian](http://vectors.nlpl.eu/repository/11/148.zip) | [Finnish](http://vectors.nlpl.eu/repository/11/149.zip) | [French](http://vectors.nlpl.eu/repository/11/150.zip) | [Irish](http://vectors.nlpl.eu/repository/11/151.zip) |

| [Galician](http://vectors.nlpl.eu/repository/11/152.zip) | [Ancient Greek](http://vectors.nlpl.eu/repository/11/153.zip) | [Hebrew](http://vectors.nlpl.eu/repository/11/154.zip) | [Hindi](http://vectors.nlpl.eu/repository/11/155.zip) |

| [Croatian](http://vectors.nlpl.eu/repository/11/156.zip) | [Hungarian](http://vectors.nlpl.eu/repository/11/157.zip) | [Indonesian](http://vectors.nlpl.eu/repository/11/158.zip) | [Italian](http://vectors.nlpl.eu/repository/11/159.zip) |

| [Japanese](http://vectors.nlpl.eu/repository/11/160.zip) | [Korean](http://vectors.nlpl.eu/repository/11/161.zip) | [Latin](http://vectors.nlpl.eu/repository/11/162.zip) | [Latvian](http://vectors.nlpl.eu/repository/11/163.zip) |

| [Norwegian Bokmål](http://vectors.nlpl.eu/repository/11/165.zip) | [Dutch](http://vectors.nlpl.eu/repository/11/164.zip) | [Norwegian Nynorsk](http://vectors.nlpl.eu/repository/11/166.zip) | [Polish](http://vectors.nlpl.eu/repository/11/167.zip) |

| [Portuguese](http://vectors.nlpl.eu/repository/11/168.zip) | [Romanian](http://vectors.nlpl.eu/repository/11/169.zip) | [Russian](http://vectors.nlpl.eu/repository/11/170.zip) | [Slovak](http://vectors.nlpl.eu/repository/11/171.zip) |

| [Slovene](http://vectors.nlpl.eu/repository/11/172.zip) | [Swedish](http://vectors.nlpl.eu/repository/11/173.zip) | [Turkish](http://vectors.nlpl.eu/repository/11/174.zip) | [Uyghur](http://vectors.nlpl.eu/repository/11/175.zip) |

| [Ukrainian](http://vectors.nlpl.eu/repository/11/176.zip) | [Urdu](http://vectors.nlpl.eu/repository/11/177.zip) | [Vietnamese](http://vectors.nlpl.eu/repository/11/178.zip) | [Chinese](http://vectors.nlpl.eu/repository/11/179.zip) |

The models are hosted on the [NLPL Vectors Repository](http://wiki.nlpl.eu/index.php/Vectors/home).

**ELMo for Simplified Chinese**

We also provided [simplified-Chinese ELMo](http://39.96.43.154/zhs.model.tar.bz2).

It was trained on xinhua proportion of [Chinese gigawords-v5](https://catalog.ldc.upenn.edu/ldc2011t13),

which is different from the Wikipedia for traditional Chinese ELMo.

## Pre-requirements

* **must** python >= 3.6 (if you use python3.5, you will encounter this issue https://github.com/HIT-SCIR/ELMoForManyLangs/issues/8)

* pytorch 0.4

* other requirements from allennlp

## Usage

### Install the package

You need to install the package to use the embeddings with the following commends

```

python setup.py install

```

### Set up the `config_path`

After unzip the model, you will find a JSON file `${lang}.model/config.json`.

Please change the `"config_path"` field to the relative path to 

the model configuration `cnn_50_100_512_4096_sample.json`.

For example, if your ELMo model is `zht.model/config.json` and your model configuration

is `zht.model/cnn_50_100_512_4096_sample.json`, you need to change `"config_path"`

in `zht.model/config.json` to `cnn_50_100_512_4096_sample.json`.

If there is no configuration `cnn_50_100_512_4096_sample.json` under `${lang}.model`,

you can copy the `elmoformanylangs/configs/cnn_50_100_512_4096_sample.json` into `${lang}.model`,

or change the `"config_path"` into  `elmoformanylangs/configs/cnn_50_100_512_4096_sample.json`.

See [issue 27](https://github.com/HIT-SCIR/ELMoForManyLangs/issues/27) for more details. 

### Use ELMoForManyLangs in command line

Prepare your input file in the [conllu format](http://universaldependencies.org/format.html), like

```

1   Sue    Sue    _   _   _   _   _   _   _

2   likes  like   _   _   _   _   _   _   _

3   coffee coffee _   _   _   _   _   _   _

4   and    and    _   _   _   _   _   _   _

5   Bill   Bill   _   _   _   _   _   _   _

6   tea    tea    _   _   _   _   _   _   _

```

Fileds should be separated by `'\t'`. We only use the second column and space (`' '`) is supported in

this field (for Vietnamese, a word can contains spaces).

Do remember tokenization!

When it's all set, run

```

$ python -m elmoformanylangs test \

    --input_format conll \

    --input /path/to/your/input \

    --model /path/to/your/model \

    --output_prefix /path/to/your/output \

    --output_format hdf5 \

    --output_layer -1

```

It will dump an hdf5 encoded `dict` onto the disk, where the key is `'\t'` separated

words in the sentence and the value is it's 3-layer averaged ELMo representation.

You can also dump the cnn encoded word with `--output_layer 0`,

the first layer of the LsTM with `--output_layer 1` and the second layer

of the LSTM with `--output_layer 2`.  

We are actively changing the interface to make it more adapted to the 

AllenNLP ELMo and more programmatically friendly.

### Use ELMoForManyLangs programmatically

Thanks @voidism for contributing the API.

By using `Embedder` python object, you can use ELMo into your own code like this:

```python

from elmoformanylangs import Embedder

e = Embedder('/path/to/your/model/')

sents = [['今', '天', '天氣', '真', '好', '阿'],

['潮水', '退', '了', '就', '知道', '誰', '沒', '穿', '褲子']]

# the list of lists which store the sentences 

# after segment if necessary.

e.sents2elmo(sents)

# will return a list of numpy arrays 

# each with the shape=(seq_len, embedding_size)

```

#### the parameters to init Embedder:

```python

class Embedder(model_dir='/path/to/your/model/', batch_size=64):

```

- **model_dir**: the absolute path from the repo top dir to you model dir.

- **batch_size**: the batch_size you want when the model inference, you can specify it properly according to your gpu/cpu ram size. (default: 64)

#### the parameters of the function sents2elmo:

```python

def sents2elmo(sents, output_layer=-1):

```

- **sents**: the list of lists which store the sentences after segment if necessary.

- **output_layer**: the target layer to output. 

    -  0 for the word encoder

    -  1 for the first LSTM hidden layer

    -  2 for the second LSTM hidden layer

    -  -1 for an average of 3 layers. (default)

    -  -2 for all 3 layers

## Training Your Own ELMo

Please run 

```

$ python -m elmoformanylangs.biLM train -h

```

to get more details about the ELMo training. 

Here is an example for training English ELMo.

```

$ less data/en.raw

... (snip) ...

Notable alumni

Aris Kalafatis ( Acting )

Labour Party

They build an open nest in a tree hole , or man - made nest - boxes .

Legacy

... (snip) ...

$ python -m elmoformanylangs.biLM train \

    --train_path data/en.raw \

    --config_path elmoformanylangs/configs/cnn_50_100_512_4096_sample.json \

    --model output/en \

    --optimizer adam \

    --lr 0.001 \

    --lr_decay 0.8 \

    --max_epoch 10 \

    --max_sent_len 20 \

    --max_vocab_size 150000 \

    --min_count 3

```

However, we

need to add that the training process is not very stable.

In some cases, we end up with a loss of `nan`. We are actively working on that and hopefully

improve it in the future.

## Citation

If our ELMo gave you nice improvements, please cite us.

```

@InProceedings{che-EtAl:2018:K18-2,

  author    = {Che, Wanxiang  and  Liu, Yijia  and  Wang, Yuxuan  and  Zheng, Bo  and  Liu, Ting},

  title     = {Towards Better {UD} Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation},

  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},

  month     = {October},

  year      = {2018},

  address   = {Brussels, Belgium},

  publisher = {Association for Computational Linguistics},

  pages     = {55--64},

  url       = {http://www.aclweb.org/anthology/K18-2005}

}

```

Please also cite the 

[NLPL Vectors Repository](http://wiki.nlpl.eu/index.php/Vectors/home)

for hosting the models.

```

@InProceedings{fares-EtAl:2017:NoDaLiDa,

  author    = {Fares, Murhaf  and  Kutuzov, Andrey  and  Oepen, Stephan  and  Velldal, Erik},

  title     = {Word vectors, reuse, and replicability: Towards a community repository of large-text resources},

  booktitle = {Proceedings of the 21st Nordic Conference on Computational Linguistics},

  month     = {May},

  year      = {2017},

  address   = {Gothenburg, Sweden},

  publisher = {Association for Computational Linguistics},

  pages     = {271--276},

  url       = {http://www.aclweb.org/anthology/W17-0237}

}

```