Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/neurocode-io/icelandic-language-model
Icelandic language model
https://github.com/neurocode-io/icelandic-language-model
azure huggingface-transformers icelandic-language neural-networks nlp python
Last synced: 16 days ago
JSON representation
Icelandic language model
- Host: GitHub
- URL: https://github.com/neurocode-io/icelandic-language-model
- Owner: neurocode-io
- License: mit
- Created: 2020-09-23T06:20:13.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2020-11-10T12:36:41.000Z (about 4 years ago)
- Last Synced: 2024-11-06T13:16:01.374Z (2 months ago)
- Topics: azure, huggingface-transformers, icelandic-language, neural-networks, nlp, python
- Language: Python
- Homepage:
- Size: 10.1 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# IsRoBERTa - icelandic transformer language model
**Natural language processing (NLP)** is one of the fields in AI where software analyzes large amounts of text. This has many applications, among of them we considered:
- **Masked language modeling**: predict one or more *masked* (unknown) words given the other words in sentence.
- **Named-Entity Recognition (NER)**: locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations,ect.We trained a model, from scratch, for the icelandic language using the **Huggingface** library.
Icelandic is the official language in Iceland. As one of the Nordic languages, it belongs to the family of the Germanic languages. With a population of only 350 thousand, the language is definitely not wide spread.
## 1. Dataset
The input in NLP is text. We used the Icelandic portion of the OSCAR corpus from INRIA. The Icelandic portion of the dataset is only 1.5G. Thus, as a next step, we will concatenate the portion from OSCAR with the Icelandic sub-corpus of the Leipzig Corpora Collection, which is comprised of text from diverse sources like news, literature, and wikipedia.## 2. Tokenization
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words or subwords, called tokens, which then are converted to ids. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.Great article on tokinizers can be read [here](https://blog.floydhub.com/tokenization-nlp/)
We used the **Tokenizers** library from Huggingface, in particular, a **byte-level Byte-pair encoding tokenizer**, with the same special tokens as **RoBERTa** (from the tutorial [Tokenizer summary](https://huggingface.co/transformers/master/tokenizer_summary.html), read the paragraphs [Byte-Pair Encoding](https://huggingface.co/transformers/master/tokenizer_summary.html#byte-pair-encoding) and [Byte-level BPE](https://huggingface.co/transformers/master/tokenizer_summary.html#byte-level-bpe) to get the best overview of a Byte-level BPE i.e. Byte-level Byte-Pair-Encoding). Training a **byte-level Byte-pair encoding tokenizer**, moreover with the same special tokens as [**RoBERTa**](https://huggingface.co/transformers/master/model_doc/roberta.html), enables us to build a vocabulary from an alphabet of single bytes, hence all words will be decomposable into tokens (no more tokens!).
## 3. Training & Infrastructure
Training a language models is heavy and very time-consuming. As a first attempt we tried to train the model in Google Colab using GPU's but were unable due to low RAM. We then used the cloud, in particular, Azure.
We created a [NC_6_promo machine](https://docs.microsoft.com/en-us/azure/virtual-machines/nc-series?toc=/azure/virtual-machines/linux/toc.json&bc=/azure/virtual-machines/linux/breadcrumb/toc.json) which comes with a K80 Nvidia GPU.
Still the training took 3 days!
For packing the code we used docker. The image lives in [docker hub](https://hub.docker.com/r/donchev7/icelandic-model)
## 5. Do it yourself
If you want to run the code you'll need to have an Azure account in particular an azure storage account.
If you have access to azure infrastructure you can start with creating an **.env** file:
```env
ACCESS_KEY=
WANDB_API_KEY=
WANDB_PROJECT=
VM_ADMIN_PASSWD=
```Afterwards you can:
```
make create_machine
```you'll see the IP address of the machine in your terminal. Use the IP to connect to your machine and run the packaged software:
```
ssh azureuser@screen
cat << EOF > .env
ACCESS_KEY=
WANDB_API_KEY=
WANDB_PROJECT=
EOFdocker run --gpus all -it --rm \
--env-file=.env \
--ipc=host \
-v /tmp:/tmp \
donchev7/icelandic-model:vc0c9243 python src/train_xml_roberta_large.py --data_dir=/tmp --run_name=xml_roberta_large_malfong_nerCTRL a + d to detatch from your screen session
exit
```
Check back later:```
ssh azureuser@screen -r
```Use NER:
```python
from transformers import pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("neurocode/IsRoBERTa")nlp = pipeline("ner", model="./data/isroberta_malfong_ner/results", tokenizer=tokenizer)
res = nlp("Eftir að henni lýkur er hægt að gerast áskrifandi að efni vefjarins fyrir 1.290 kr. á mánuði.")tokens = [r["word"] for r in res]
tokenizer.convert_tokens_to_string(tokens)```