https://github.com/lifeweb-ir/lm
Lifeweb AI team Language Models
https://github.com/lifeweb-ir/lm
language-model mobilebert persian-nlp roberta
Last synced: 4 months ago
JSON representation
Lifeweb AI team Language Models
- Host: GitHub
- URL: https://github.com/lifeweb-ir/lm
- Owner: lifeweb-ir
- Created: 2024-03-04T10:35:55.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-11T17:56:11.000Z (over 1 year ago)
- Last Synced: 2024-03-12T08:25:14.584Z (over 1 year ago)
- Topics: language-model, mobilebert, persian-nlp, roberta
- Homepage: https://lifewebco.com/ai
- Size: 34.2 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[
](https://lifewebco.com)
# Lifeweb language models
Welcome to the Lifeweb Language Models repository.
Here we aim to train different Persian Language models and release them publicly to contribute our share to the Persian language's AI field.
The first versions of our models are all trained on our dataset called **Divan** with more than **164 million documents** and more than **10B tokens** which is normalized and deduplicated meticulously to ensure its enrichment and comprehensiveness. A better dataset leads to a better model.# Use Models
You can easily access the models using the links of Huggingface model hub provided in the table below.| Model Name | Base Model | Vocabulary Size | |
|----------------------------------------------------|--|------------------|--|
| [Tehran](https://huggingface.co/lifeweb-ai/tehran) | [Roberta](https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base) | 50000 |[Results](#Results)|
| [Shiraz](https://huggingface.co/lifeweb-ai/shiraz) |[MobileBert](https://huggingface.co/google/mobilebert-uncased)| 50000 | [Results](#Results)|```python
from transformers import AutoTokenizer, AutoModelForMaskedLM, FillMaskPipelinemodel_name = "lifeweb-ai/shiraz"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)text = "در همین لحظه که شما مشغول [MASK] این متن هستید، میلیونها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمعآوری، پردازش و تحلیل این کلان داده (Big Data) میپردازیم."
classifier = FillMaskPipeline(model=model, tokenizer=tokenizer)
result = classifier(text)
print(result[0])
#{'score': 0.3584367036819458, 'token': 5764, 'token_str': 'خواندن', 'sequence': 'در همین لحظه که شما مشغول خواندن این متن هستید، میلیون ها دیتا در فضای انلاین در حال تولید است. ما در لایف وب به جمع اوری، پردازش و تحلیل این کلان داده ( big data ) می پردازیم.'}
```# Results
The Lifeweb models are evaluated on three downstream NLP tasks comprising **NER**, **Sentiment Analysis**, and **Emotion Detection**. **Tehran** outperforms every other Persian language model in terms of accuracy and macro F1. Additionally, **Shiraz** is considerably faster, and its accuracy remains highly competitive without compromising much on speed. According to [**MobileBERT paper**](https://arxiv.org/pdf/2004.02984.pdf), this model is 4.3× smaller and 5.5× faster than BERT-base.
We assert that our models outperform all similar models in the field, achieving a new state-of-the-art performance. Referencing [**ParsBERT**](https://arxiv.org/abs/2005.12515), [**AriaBERT**](https://assets.researchsquare.com/files/rs-3558473/v1_covered_d230d5de-50d1-42d5-ba1a-ef400ede52e3.pdf?c=1699474771) and [**FaBERT**](https://arxiv.org/abs/2402.06617), we substantiate this claim by demonstrating superior evaluation metrics, even as they themselves have highlighted their better performance among other suitable models.Obvious from the table below, you can find the Colab codes for each task to use as a tutorial besides the macro F1 score. These Colab codes are run equally on 4x2080 TI graphic cards.
Model
NER
Sentiment
Emotion
Arman
Peyma
Sentipers (multi)
Snappfood
Arman
lifeweb-ai/tehran
71.87%
90.79%
63.75%
88.74%
77.73%
lifeweb-ai/shiraz
67.62%![]()
86.24%![]()
59.17%![]()
88.01%![]()
66.97%![]()
sbunlp/fabert
71.23%![]()
88.53%![]()
58.51%![]()
88.60%![]()
72.65%![]()
ViraIntelligentDataMining/AriaBERT
69.12%![]()
87.15%![]()
59.26%![]()
87.96%![]()
69.11%![]()
HooshvareLab/bert-fa-zwnj-base
67.49%![]()
85.73%![]()
59.61%![]()
87.58%![]()
59.27%![]()
HooshvareLab/roberta-fa-zwnj-base
69.73%![]()
86.21%![]()
56.23%![]()
87.19%![]()
57.96%![]()
If you tested our models on a public dataset, and you wanted to add your results to the table above, open a pull request or contact us. Also, make sure to have your code available online so that we can add a reference.
# Contributors
- Mehrdad Azizi: [**Linkedin**](https://www.linkedin.com/in/mehrdad-azizi-50839489/), [**Github**](https://github.com/mehrazi)
- Reza Salehi Chegeni: [**Linkedin**](https://www.linkedin.com/in/reza-salehi-chegeni-6988ba271/), [**Github**](https://github.com/rezasalehichegeni)
- Parisa Mousavi: [**Linkedin**](https://www.linkedin.com/in/seyede-parisa-mousavi/), [**Github**](https://github.com/Mousavi-Parisa)
- Iman Hashemi: [**Linkedin**](https://www.linkedin.com/in/iman-hashemi-403738a5), [**Github**](https://github.com/hashemiiman)# Releases
**v1.0(2024-03-09)**
First version of **Tehran** and **Shiraz** models trained on **DIVAN**.
# License
By contributing to this project, you agree that your contributions will be licensed under the [**Apache License 2.0**](https://www.apache.org/licenses/LICENSE-2.0)