Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/DavidNemeskey/emBERT
emtsv module for pre-trained Transfomer-based models
https://github.com/DavidNemeskey/emBERT
Last synced: about 2 months ago
JSON representation
emtsv module for pre-trained Transfomer-based models
- Host: GitHub
- URL: https://github.com/DavidNemeskey/emBERT
- Owner: DavidNemeskey
- License: lgpl-3.0
- Created: 2020-01-06T11:34:05.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2024-04-10T22:21:30.000Z (9 months ago)
- Last Synced: 2024-08-03T16:09:01.019Z (5 months ago)
- Language: Python
- Size: 144 KB
- Stars: 1
- Watchers: 6
- Forks: 1
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-hungarian-nlp - emBERT - trained Transfomer-based models. It provides tagging models based on Huggingface's transformers package. (Tools / Taggers / Chunkers)
README
# emBERT
[`emtsv`](https://github.com/dlt-rilmta/emtsv) module for pre-trained Transfomer-based
models. It provides tagging models based on
[Huggingface's `transformers`](https://github.com/huggingface/transformers) package.`emBERT` defines the following tools:
| Name(s) | Task | Training corpus | F1 score |
| ------- | ---- | --------------- | -------- |
| `bert-ner` | NER | Szeged NER corpus | 97.08\% |
| `bert-basenp` | base NP chunking | Szeged TreeBank 2.0 | **95.58\%** |
| `bert-np` (or `bert-chunk`) | maximal NP chunking | Szeged TreeBank 2.0 | **95.05\%** |(The results in **bold** are state-of-the-art for Hungarian.)
Due to their size (a little over 700M apiece), the models are stored in a separate
repository. [emBERT-models](https://github.com/dlt-rilmta/emBERT-models)
is a submodule of this repository, so if cloned recursively with `git` LFS,
the models will be downloaded as well:
```
git clone --recursive https://github.com/DavidNemeskey/emBERT.git
```Alternatively, the models can be obtained via `emtsv`'s `download_models.py` script.
If you use `emBERT` in your work, please cite the following paper
([see link for bib](https://hlt.bme.hu/en/publ/embert_2020); Hungarian):Nemeskey Dávid Márk 2020. Egy emBERT próbáló feladat. In Proceedings of the
16th Conference on Hungarian Computational Linguistics (MSZNY 2020). pp. 409-418.## Training
Should the need arise to train a better model, or to build one for a different
domain, the `train_embert.py` script can be used to fine-tune a BERT model on
a token classification task. An example run that reproduces the chunking
results (given the same train-valid-test split and a GPU with 11G+ memory) is:```
train_embert.py --data_dir ~/data/chunking/szeged_max_bioe1_100/ \
--bert_model SZTAKI-HLT/hubert-base-cc --task_name szeged_bioes_chunk \
--data_format tsv --output_dir bert_np --do_train --max_seq_length 384 \
--num_train_epochs=4 --train_batch_size 10 --learning_rate "1e-5" \
--do_eval --eval_batch_size 1 --use_viterbi --seed 42
```Note that if the model is trained on a new tag set, it has to be added to
`embert/processors.py`.