Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/allegro/HerBERT
HerBERT is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic masking of whole words.
https://github.com/allegro/HerBERT
Last synced: 2 months ago
JSON representation
HerBERT is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic masking of whole words.
- Host: GitHub
- URL: https://github.com/allegro/HerBERT
- Owner: allegro
- Created: 2020-05-13T10:58:41.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2022-02-03T07:52:12.000Z (over 2 years ago)
- Last Synced: 2024-01-30T13:49:10.166Z (5 months ago)
- Size: 12.7 KB
- Stars: 59
- Watchers: 7
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Lists
- awesome-nlp-polish - Allegro HerBERT - Polish BERT model trained on Polish Corpora using only MLM objective with dynamic masking of whole words. (Models and Embeddings / Polish Transformer models)
README
# HerBERT
**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a series of BERT-based language models trained for Polish language understanding.All three HerBERT models are summarized below:
| Model | Tokenizer | Vocab Size | Batch Size | Train Steps | KLEJ Score |
| :---- | --------: | ---------: | ---------: | ----------: | ---------: |
| `herbert-klej-cased-v1` | BPE | 50K | 570 | 180k | 80.5 |
| `herbert-base-cased` | BPE-Dropout | 50K | 2560 | 50k | 86.3 |
| `herbert-large-cased` | BPE-Dropout | 50K | 2560 | 60k | 88.4 |Full KLEJ Benchmark leaderboard is available [here](https://klejbenchmark.com/leaderboard).
For more details about model architecture, training process, used corpora and evaluation please refer to:
- [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111/)
- [HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish](https://www.aclweb.org/anthology/2021.bsnlp-1.1/).## Usage
Example of how to load the model:
```python
from transformers import AutoTokenizer, AutoModelmodel_names = {
"herbert-klej-cased-v1": {
"tokenizer": "allegro/herbert-klej-cased-tokenizer-v1",
"model": "allegro/herbert-klej-cased-v1",
},
"herbert-base-cased": {
"tokenizer": "allegro/herbert-base-cased",
"model": "allegro/herbert-base-cased",
},
"herbert-large-cased": {
"tokenizer": "allegro/herbert-large-cased",
"model": "allegro/herbert-large-cased",
},
}tokenizer = AutoTokenizer.from_pretrained(model_names["allegro/herbert-base-cased"]["tokenizer"])
model = AutoModel.from_pretrained(model_names["allegro/herbert-base-cased"]["model"])
```And how to use the model:
```python
output = model(
**tokenizer.batch_encode_plus(
[
(
"A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
"A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
)
],
padding="longest",
add_special_tokens=True,
return_tensors="pt",
)
)
```## License
CC BY 4.0## Citation
If you use this model, please cite the following papers:The `herbert-klej-cased-v1` version of the model:
```
@inproceedings{rybak-etal-2020-klej,
title = "{KLEJ}: Comprehensive Benchmark for Polish Language Understanding",
author = "Rybak, Piotr and Mroczkowski, Robert and Tracz, Janusz and Gawlik, Ireneusz",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.111",
pages = "1191--1201",
}
```The `herbert-base-cased` or `herbert-large-cased` version of the model:
```
@inproceedings{mroczkowski-etal-2021-herbert,
title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
author = "Mroczkowski, Robert and
Rybak, Piotr and
Wr{\'o}blewska, Alina and
Gawlik, Ireneusz",
booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
month = apr,
year = "2021",
address = "Kiyv, Ukraine",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
pages = "1--10",
}
```## Contact
You can contact us at: [email protected]