Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/alisafaya/Arabic-BERT

Arabic edition of BERT pretrained language models
https://github.com/alisafaya/Arabic-BERT

arabic arabic-nlp bert bert-language-models language-model nlp transformer

Last synced: about 1 month ago
JSON representation

Arabic edition of BERT pretrained language models

Host: GitHub
URL: https://github.com/alisafaya/Arabic-BERT
Owner: alisafaya
License: mit
Created: 2020-02-28T16:16:27.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2020-12-05T15:09:07.000Z (over 3 years ago)
Last Synced: 2024-02-05T14:37:43.841Z (4 months ago)
Topics: arabic, arabic-nlp, bert, bert-language-models, language-model, nlp, transformer
Homepage:
Size: 43 KB
Stars: 123
Watchers: 9
Forks: 21
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-arabic-nlp - Arabic-BERT

README

        # Arabic-BERT

Pretrained BERT language models for Arabic

_If you use any of these models in your work, please cite this paper:_

```

@inproceedings{safaya-etal-2020-kuisail,

    title = "{KUISAIL} at {S}em{E}val-2020 Task 12: {BERT}-{CNN} for Offensive Speech Identification in Social Media",

    author = "Safaya, Ali  and

      Abdullatif, Moutasem  and

      Yuret, Deniz",

    booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation",

    month = dec,

    year = "2020",

    address = "Barcelona (online)",

    publisher = "International Committee for Computational Linguistics",

    url = "https://www.aclweb.org/anthology/2020.semeval-1.271",

    pages = "2054--2059",

}

```

## Pretraining data

The models were pretrained on ~8.2 Billion words:

- Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/)

- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)

and other Arabic resources which sum up to ~95GB of text.

__Notes on training data:__

- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.

- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model.

- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.

## Pretraining details

- These models were trained using Google BERT's github [repository](https://github.com/google-research/bert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).

- Our pretraining procedure follows training settings of bert with some changes: trained for 4M training steps with batchsize of 128, instead of 1M with batchsize of 256.

## Models

|  | BERT-Mini | BERT-Medium   | BERT-Base  | BERT-Large  |

|:---:|:---:|:---:|:---:|:---:|

| Hidden Layers | 4 | 8 | 12 | 24 |

| Attention heads | 4 | 8 | 12 | 16 |

| Hidden size | 256 | 512 | 768 | 1024 |

| Parameters | 11M | 42M | 110M | 340M |

## Results

### Sentiment Analysis Results (F1-Score)

| Dataset   | Details | [ML-BERT](https://github.com/google-research/bert/blob/master/multilingual.md)   | [hULMona](https://github.com/aub-mind/hULMonA)  | Arabic-BERT Base  |

|:---------:|:-------:|:---------:|:--------:|:------------:|

| [ArSenLev](https://arxiv.org/abs/1906.01830) | 5 Classes, Levantine dialect  | 0.510     | 0.511    | __0.552__    |

| [ASTD](https://www.sites.google.com/a/mohamedaly.info/www/datasets/astd) |  4 Classes, MSA and Egyptian dialects | 0.670     | 0.677    | __0.714__    |

__Note:__ More results on other downstream NLP tasks will be added soon. if you use these models, I would appreciate your feedback.

## How to use

You can use these models by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  

```python

from transformers import AutoTokenizer, AutoModel

# Mini:   asafaya/bert-mini-arabic

# Medium: asafaya/bert-medium-arabic

# Base:   asafaya/bert-base-arabic

# Large:  asafaya/bert-large-arabic

tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")

model = AutoModel.from_pretrained("asafaya/bert-base-arabic")

```

## Acknowledgement

Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers 😊