Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/allegro/HerBERT

HerBERT is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic masking of whole words.
https://github.com/allegro/HerBERT

Last synced: 2 months ago
JSON representation

HerBERT is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic masking of whole words.

Host: GitHub
URL: https://github.com/allegro/HerBERT
Owner: allegro
Created: 2020-05-13T10:58:41.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2022-02-03T07:52:12.000Z (over 2 years ago)
Last Synced: 2024-01-30T13:49:10.166Z (5 months ago)
Size: 12.7 KB
Stars: 59
Watchers: 7
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md

Lists

awesome-nlp-polish - Allegro HerBERT - Polish BERT model trained on Polish Corpora using only MLM objective with dynamic masking of whole words. (Models and Embeddings / Polish Transformer models)

README

        # HerBERT 

**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a series of BERT-based language models trained for Polish language understanding.

All three HerBERT models are summarized below:

| Model | Tokenizer | Vocab Size | Batch Size | Train Steps | KLEJ Score |

| :---- | --------: | ---------: | ---------: | ----------: | ---------: | 

| `herbert-klej-cased-v1` | BPE | 50K | 570 | 180k | 80.5 |

| `herbert-base-cased` | BPE-Dropout | 50K | 2560 | 50k | 86.3 |

| `herbert-large-cased` | BPE-Dropout | 50K | 2560 | 60k | 88.4 |

Full KLEJ Benchmark leaderboard is available [here](https://klejbenchmark.com/leaderboard). 

For more details about model architecture, training process, used corpora and evaluation please refer to: 

- [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111/)

- [HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish](https://www.aclweb.org/anthology/2021.bsnlp-1.1/).

## Usage

Example of how to load the model:

```python

from transformers import AutoTokenizer, AutoModel

model_names = {

    "herbert-klej-cased-v1": {

        "tokenizer": "allegro/herbert-klej-cased-tokenizer-v1", 

        "model": "allegro/herbert-klej-cased-v1",

    },

    "herbert-base-cased": {

        "tokenizer": "allegro/herbert-base-cased", 

        "model": "allegro/herbert-base-cased",

    },

    "herbert-large-cased": {

        "tokenizer": "allegro/herbert-large-cased", 

        "model": "allegro/herbert-large-cased",

    },

}

tokenizer = AutoTokenizer.from_pretrained(model_names["allegro/herbert-base-cased"]["tokenizer"])

model = AutoModel.from_pretrained(model_names["allegro/herbert-base-cased"]["model"])

```

And how to use the model:

```python

output = model(

    **tokenizer.batch_encode_plus(

        [

            (

                "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",

                "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."

            )

        ],

        padding="longest",

        add_special_tokens=True,

        return_tensors="pt",

    )

)

```

## License

CC BY 4.0

## Citation

If you use this model, please cite the following papers:

The `herbert-klej-cased-v1` version of the model:

```

@inproceedings{rybak-etal-2020-klej,

    title = "{KLEJ}: Comprehensive Benchmark for Polish Language Understanding",

    author = "Rybak, Piotr and Mroczkowski, Robert and Tracz, Janusz and Gawlik, Ireneusz",

    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",

    month = jul,

    year = "2020",

    address = "Online",

    publisher = "Association for Computational Linguistics",

    url = "https://www.aclweb.org/anthology/2020.acl-main.111",

    pages = "1191--1201",

}

```

The `herbert-base-cased` or `herbert-large-cased` version of the model:

```

@inproceedings{mroczkowski-etal-2021-herbert,

    title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",

    author = "Mroczkowski, Robert  and

      Rybak, Piotr  and

      Wr{\'o}blewska, Alina  and

      Gawlik, Ireneusz",

    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",

    month = apr,

    year = "2021",

    address = "Kiyv, Ukraine",

    publisher = "Association for Computational Linguistics",

    url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",

    pages = "1--10",

}

```

## Contact

You can contact us at: [email protected]