Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/deeppavlov/Slavic-BERT-NER

Shared BERT model for 4 languages of Bulgarian, Czech, Polish and Russian. Slavic NER model.
https://github.com/deeppavlov/Slavic-BERT-NER

Last synced: 2 months ago
JSON representation

Shared BERT model for 4 languages of Bulgarian, Czech, Polish and Russian. Slavic NER model.

Host: GitHub
URL: https://github.com/deeppavlov/Slavic-BERT-NER
Owner: deeppavlov
Created: 2019-05-14T13:41:48.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2022-01-28T14:29:51.000Z (over 2 years ago)
Last Synced: 2024-04-12T17:33:33.436Z (3 months ago)
Language: Python
Homepage:
Size: 798 KB
Stars: 72
Watchers: 10
Forks: 11
Open Issues: 1
Metadata Files:
- Readme: README.md

Lists

awesome-nlp-polish - SlavicBert - multilingual BERT model - BERT, Slavic Cased: 4 languages(Bulgarian,Czech, Polish, Russian), 12-layer, 768-hidden, 12-heads, 110M parameters, 600Mb. There is also another SlavicBert model http://docs.deeppavlov.ai/en/master/features/models/bert.html but I have problems to convert it to pytorch. (Models (Embeddings))

README

        # Slavic BERT NER

***Notice:** The repo is left as-is, the Slavic BERT model is now as part of [DeepPavlov repo](https://github.com/deepmipt/DeepPavlov).*

**BERT** is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). For details see original [BERT github](https://github.com/google-research/bert).

The repository contains **Bulgarian**+**Czech**+**Polish**+**Russian** specific:

- [shared BERT model](#slavic-bert)

- [NER model (`PER`, `LOC`, `ORG`, `PRO`, `EVT`)](#slavic-bert)

Our academic paper which describes tuning Transformers for NER task in detail can be found here: https://www.aclweb.org/anthology/W19-3712/.

## Slavic BERT

The Slavic model is the result of transfer from `2018_11_23/multi_cased_L-12_H-768_A-12` Multilingual BERT model to languages of Bulgarian (`bg`), Czech (`cs`), Polish (`pl`) and Russian (`ru`). The fine-tuning was performed with a stratified dataset of `bg`, `cs` and `pl` Wikipedias and `ru` news.

The model format is the same as in the original repository.

*   **[`BERT, Slavic Cased`](http://files.deeppavlov.ai/deeppavlov_data/bg_cs_pl_ru_cased_L-12_H-768_A-12.tar.gz)**:

    4 languages, 12-layer, 768-hidden, 12-heads, 110M parameters, 600Mb

## Slavic NER

Named Entity Recognition (further, **NER**) is a task of recognizing named entities in text, as well as detecting their type.

We used Slavic BERT model as a base to build NER system. First, we feed each input word into WordPiece case-sensitive tokenizer and extract the final hidden representation corresponding to the first subtoken in each word. These representations are fed into a classification dense layer over the NER label set. A token-level CRF layer is also added on top.

![BERT NER diagram](bert_ner_diagram.png)

The model was trained on [BSNLP-2019 dataset](http://bsnlp.cs.helsinki.fi/shared_task.html). The pre-trained model can recognize such entities as:

- Persons (PER)

- Locations (LOC)

- Organizations (ORG)

- Products (PRO)

- Events (EVT)

The metrics for all languages and entities on test set are:

| Language     | Tag  | Precision     | Recall       | RPM (Relaxed Partial Matching)    |

|--------------|:----:|:-------------:|:------------:|:---------------------------------:|

| cs           |      | 94.3          | 93.4         | 93.9                              |

| ru           |      | 88.1          | 86.6         | 87.3                              |

| bg           |      | 90.3          | 84.3         | 87.2                              |

| pl           |      | 93.3          | 93.0         | 93.2                              |

|              | PER  | 94.2          | 95.6         | 94.9                              |

|              | LOC  | 96.6          | 96.4         | 96.5                              |

|              | ORG  | 84.3          | 92.1         | 88.0                              |

|              | PRO  | 87.6          | 51.3         | 64.7                              |

|              | EVT  | 39.4          | 27.7         | 93.9                              |

|              |      | **89.8**      | **91.8**     | **90.8**                          |

For detailed description of evaluation method see [BSNLP-2019 Shared Task page](http://bsnlp.cs.helsinki.fi/shared_task.html).

*   **[`NER, Slavic Cased`](http://files.deeppavlov.ai/deeppavlov_data/ner_bert_slav.tar.gz)**:

    4 languages, 13-layer + CRF, 768-hidden, 2.0Gb

    

# Usage

#### Install

The toolkit support Python 3.6 and Python3.7. To install required packages use:

```bash

$ pip3 install -r requirements.txt

```

CAUTION: Python<=3.5 and Python>=3.8 are not supported, see [DeepPavlov rep](https://github.com/deepmipt/deeppavlov) for details.

#### Ner usage

```python

from deeppavlov import build_model

# Download and load model (set download=False to skip download phase)

ner = build_model("./ner_bert_slav.json", download=True)

# Get predictions

ner(["To Bert z ulicy Sezamkowej"])

# [[['To', 'Bert', 'z', 'ulicy', 'Sezamkowej']], [['O', 'B-PER', 'O', 'B-LOC', 'I-LOC']]]

ner(["Это", "Берт", "из", "России"])

# [[['Это'], ['Берт'], ['из'], ['России']], [['O'], ['B-PER'], ['O'], ['B-LOC']]]

```

 

#### Bert usage

The Slavic Bert model can be used in any way proposed by the BERT developers.

One approach may be:

```python

import tensorflow as tf

  

from bert_dp.modeling import BertConfig, BertModel

from deeppavlov.models.preprocessors.bert_preprocessor import BertPreprocessor

bert_config = BertConfig.from_json_file('./bg_cs_pl_ru_cased_L-12_H-768_A-12/bert_config.json')

input_ids = tf.placeholder(shape=(None, None), dtype=tf.int32)

input_mask = tf.placeholder(shape=(None, None), dtype=tf.int32)

token_type_ids = tf.placeholder(shape=(None, None), dtype=tf.int32)

bert = BertModel(config=bert_config,

                 is_training=False,

                 input_ids=input_ids,

                 input_mask=input_mask,

                 token_type_ids=token_type_ids,

                 use_one_hot_embeddings=False)

preprocessor = BertPreprocessor(vocab_file='./bg_cs_pl_ru_cased_L-12_H-768_A-12/vocab.txt',

                                do_lower_case=False,

                                max_seq_length=512)

with tf.Session() as sess:

    # Load model

    tf.train.Saver().restore(sess, './bg_cs_pl_ru_cased_L-12_H-768_A-12/bert_model.ckpt')

    # Get predictions

    features = preprocessor(["Bert z ulicy Sezamkowej"])[0]

    print(sess.run(bert.sequence_output, feed_dict={input_ids: [features.input_ids],

                                                    input_mask: [features.input_mask],

                                                    token_type_ids: [features.input_type_ids]}))

    features = preprocessor(["Берт", "с", "Улицы", "Сезам"])[0]

    print(sess.run(bert.sequence_output, feed_dict={input_ids: [features.input_ids],

                                                    input_mask: [features.input_mask],

                                                    token_type_ids: [features.input_type_ids]}))

```

## Changes

[Jan 2021] fixed 'Model bert_ner is not registered' issue, updated to

  `deeppavlov==0.17.2`, `tensorflow==1.15.5`

## Citation

```

@inproceedings{arkhipov-etal-2019-tuning,

    title = "Tuning Multilingual Transformers for Language-Specific Named Entity Recognition",

    author = "Arkhipov, Mikhail  and

      Trofimova, Maria  and

      Kuratov, Yuri  and

      Sorokin, Alexey",

    booktitle = "Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing",

    month = aug,

    year = "2019",

    address = "Florence, Italy",

    publisher = "Association for Computational Linguistics",

    url = "https://www.aclweb.org/anthology/W19-3712",

    doi = "10.18653/v1/W19-3712",

    pages = "89--93"

}

```

## References

[1] - [Jacob Devlin et all: *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*, 2018](https://arxiv.org/abs/1810.04805)

[2] - [Mozharova V., Loukachevitch N.: *Two-stage approach in Russian named entity recognition*, 2016](https://ieeexplore.ieee.org/document/7584769)

[3] - [BSNLP-2019 Shared Task](http://bsnlp.cs.helsinki.fi/shared_task.html)

[4] - [DeepPavlov: open-source library for dialog systems](https://github.com/deepmipt/deeppavlov)