https://github.com/ai-forever/model-zoo

NLP model zoo for Russian
https://github.com/ai-forever/model-zoo

bert nlp pytorch roberta roberta-model russian russian-language t5 t5-model transformers

Last synced: 6 months ago
JSON representation

NLP model zoo for Russian

Host: GitHub
URL: https://github.com/ai-forever/model-zoo
Owner: ai-forever
License: apache-2.0
Created: 2021-07-12T12:03:23.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2021-07-15T06:31:59.000Z (about 4 years ago)
Last Synced: 2025-03-29T09:12:07.202Z (6 months ago)
Topics: bert, nlp, pytorch, roberta, roberta-model, russian, russian-language, t5, t5-model, transformers
Homepage:
Size: 22.5 MB
Stars: 45
Watchers: 3
Forks: 1
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          # Welcome to the Model Zoo!

### Here you can find NLP models for Russian, implemented in HF [transformers](https://huggingface.co/sberbank-ai/)🤗

[![See Examples In Colab!](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sberbank-ai/model-zoo/blob/master/examples/Sber_ai_examples.ipynb) 

## Models:

| Model           | Task                 | Type            | Tokenizer | Dict size | Num Parameters  | Training Data Volume |

|-----------------|----------------------|-----------------|-----------|-----------|-----------------|----------------------|

| ruBERT-base     | mask filling         | encoder         | bpe       | 120 138   | 178 M         | 30 GB                |

| ruBERT-large    | mask filling         | encoder         | bpe       | 120 138   | 427 M         | 30 GB                |

| ruRoBERTa-large | mask filling         | encoder         | bbpe      | 50 257    | 355 M         | 250 GB               |

| ruT5-base       | text2text generation | encoder-decoder | bpe       | 32101     | 222 M         | 300 GB               |

| ruT5-large      | text2text generation | encoder-decoder | bpe       | 32101     | 737 M         | 300 GB               |

### ruT5

Text2Text Generation task

[T5 paper](https://arxiv.org/abs/1910.10683)

 - Large: [HF Model](https://huggingface.co/sberbank-ai/ruT5-large)

 - Base: [HF Model](https://huggingface.co/sberbank-ai/ruT5-base)

 [Model parameters](https://huggingface.co/transformers/model_doc/t5.html)

  

###  ruRoBerta

fill-mask task

[Roberta paper](https://arxiv.org/abs/1907.11692)

- Large: [HF Model](https://huggingface.co/sberbank-ai/ruRoberta-large)

  

###  ruBert

fill-mask task

[Bert paper](https://arxiv.org/abs/1810.04805)

 - Large: [HF Model](https://huggingface.co/sberbank-ai/ruBert-large)

 - Base: [HF Model](https://huggingface.co/sberbank-ai/ruBert-base)

  

## How to:

Use this [![Colab!](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sberbank-ai/model-zoo/blob/master/examples/Sber_ai_examples.ipynb) to explore the models or run them on your machine.

### Model set up:

```pip install -r requirements.txt```

### Pipeline usage

```

from transformers import pipeline

unmasker = pipeline("fill-mask", model="sberbank-ai/ruRoberta-large")

unmasker("Евгений Понасенков назвал  величайшим маэстро.", top_k=1)

```

![](/examples/Screenshot%20from%202021-07-07%2002-27-07.png)

### Classical usage

```

# ruRoberta-large example 

from transformers import RobertaForMaskedLM,RobertaTokenizer

model=RobertaForMaskedLM.from_pretrained('sberbank-ai/ruRoberta-large')

tokenizer=RobertaTokenizer.from_pretrained('sberbank-ai/ruRoberta-large')

unmasker = pipeline('fill-mask', model=model,tokenizer=tokenizer)

unmasker("Стоит чаще писать на Хабр про .")

```

  

### Use BertViz to obtain model visualizations 

 

 Roberta model_view:

  ![](/examples/roberta_small.gif) / ! [](https://github.com/sberbank-ai/model-zoo/examples/roberta_small.gif)

```

from transformers import RobertaModel, RobertaTokenizer

from bertviz import model_view

model_version = 'sberbank-ai/ruRoberta-large'

model = RobertaModel.from_pretrained(model_version, output_attentions=True)

tokenizer = RobertaTokenizer.from_pretrained(model_version)

sentence_a = "The cat sat on the mat"

sentence_b = "The cat lay on the rug"

inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)

input_ids = inputs['input_ids']

attention = model(input_ids)[-1]

input_id_list = input_ids[0].tolist() # Batch index 0

tokens = tokenizer.convert_ids_to_tokens(input_id_list)

model_view(attention, tokens)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ai-forever/model-zoo

Awesome Lists containing this project

README