Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/natasha/naeval

Comparing quality and performance of NLP systems for Russian language
https://github.com/natasha/naeval

evaluation nlp performance-analysis python russian

Last synced: 5 days ago
JSON representation

Comparing quality and performance of NLP systems for Russian language

Awesome Lists containing this project

README

        

![CI](https://github.com/natasha/naeval/actions/workflows/test.yml/badge.svg)

Naeval — comparing quality and performance of NLP systems for Russian language. Naeval is used to evaluate project Natasha components: Razdel, Navec, Slovnet.

## Install

Naeval supports Python 3.7+

```bash
$ pip install naeval
```

## Documentation

Materials are in Russian:

* Naeval page on natasha.github.io
* Naeval section of Datafest 2020 talk

## Models

Model
Tags
Description

DeepPavlov NER

#

ner

BiLSTM-CRF NER trained on Collection5.
Original repo,
docs,
paper

DeepPavlov BERT NER

#

ner

Current SOTA for Russian language.
Docs,
video

DeepPavlov Slavic BERT NER

#

ner

DeepPavlov solution for BSNLP-2019. Paper

DeepPavlov Morph

#

morph

Docs

DeepPavlov BERT Morph

#

morph

Docs

DeepPavlov BERT Syntax

#

syntax

BERT + biaffine head. Docs

Slovnet NER

#

ner

Slovnet BERT NER

#

ner

Slovnet Morph

#

morph

Slovnet BERT Morph

#

morph

Slovnet Syntax

#

syntax

Slovnet BERT Syntax

#

syntax

PullEnti

#

ner
morph

First place on factRuEval-2016, super sophisticated ruled based system

Stanza

#

ner
morph
syntax

Tool by Stanford NLP released in 2020. Paper

SpaCy

#

token
sent
ner
morph
syntax

Uses Russian models trained by @buriy

Texterra

#

morph
syntax
ner
token
sent

Multifunctional NLP solution by ISP RAS

Tomita

#

ner

GLR-parser by Yandex, only implementation for person names is publicly available

MITIE

#

ner

Engine developed at MIT + third party model for Russian language

RuPosTagger

#

morph

CRF tagger, part of Solarix project

RNNMorph

#

morph

First place solution on morphoRuEval-2017. Post on Habr

Maru

#

morph

UDPipe

#

morph
syntax

Model trained on SynTagRus

NLTK

#

token
sent

Multifunctional library, provides model for Russian text segmentation. Docs

MyStem

#

token
morph

Wrapper for Yandex morphological analyzers

Moses

#

token
sent

Wrapper for Perl Moses utils

SegTok

#

token
sent

RuTokenizer

#

token

Razdel

#

token
sent

Spacy Russian Tokenizer

#

token
sent

Spacy segmentation pipeline for Russian texts by @aatimofeev

RuSentTokenizer

#

sent

DeepPavlov sentence segmentation

## Tokenization

See Razdel evalualtion section for more info.




corpora
syntag
gicrya
rnc



errors
time
errors
time
errors
time
errors
time




re.findall(\w+|\d+|\p+)
24
0.5
16
0.5
19
0.4
60
0.4


spacy
26
6.2
13
5.8
14
4.1
32
3.9


nltk.word_tokenize
60
3.4
256
3.3
75
2.7
199
2.9


mystem
23
5.0
15
4.7
19
3.7
14
3.9


mosestokenizer
11
2.1
8
1.9
15
1.6
16
1.7


segtok.word_tokenize
16
2.3
8
2.3
14
1.8
9
1.8


aatimofeev/spacy_russian_tokenizer
17
48.7
4
51.1
5
39.5
20
52.2


koziev/rutokenizer
15
1.1
8
1.0
23
0.8
68
0.9


razdel.tokenize
9
2.9
9
2.8
3
2.0
16
2.2

## Sentence segmentation




corpora
syntag
gicrya
rnc



errors
time
errors
time
errors
time
errors
time




re.split([.?!…])
114
0.9
53
0.6
63
0.7
130
1.0


segtok.split_single
106
17.8
36
13.4
1001
1.1
912
2.8


mosestokenizer
238
8.9
182
5.7
80
6.4
287
7.4


nltk.sent_tokenize
92
10.1
36
5.3
44
5.6
183
8.9


deeppavlov/rusenttokenize
57
10.9
10
7.9
56
6.8
119
7.0


razdel.sentenize
52
6.1
7
3.9
72
4.5
59
7.5

## Pretrained embeddings

See Navec evalualtion section for more info.




type
init, s
get, µs
disk, mb
ram, mb
vocab




hudlit_12B_500K_300d_100q
navec
1.1
21.6
50.6
95.3
500K


news_1B_250K_300d_100q
navec
0.8
20.7
25.4
47.7
250K


ruscorpora_upos_cbow_300_20_2019
w2v
3.3
1.4
220.6
236.1
189K


ruwikiruscorpora_upos_skipgram_300_2_2019
w2v
5.0
1.5
290.0
309.4
248K


tayga_upos_skipgram_300_2_2019
w2v
5.2
1.4
290.7
310.9
249K


tayga_none_fasttextcbow_300_10_2019
fasttext
8.0
13.4
2741.9
2746.9
192K


araneum_none_fasttextcbow_300_5_2018
fasttext
16.4
10.6
2752.1
2754.7
195K




type
simlex
hj
rt
ae
ae2
lrwc




hudlit_12B_500K_300d_100q
navec
0.310
0.707
0.842
0.931
0.923
0.604


news_1B_250K_300d_100q
navec
0.230
0.590
0.784
0.866
0.861
0.589


ruscorpora_upos_cbow_300_20_2019
w2v
0.359
0.685
0.852
0.758
0.896
0.602


ruwikiruscorpora_upos_skipgram_300_2_2019
w2v
0.321
0.723
0.817
0.801
0.860
0.629


tayga_upos_skipgram_300_2_2019
w2v
0.429
0.749
0.871
0.771
0.899
0.639


tayga_none_fasttextcbow_300_10_2019
fasttext
0.369
0.639
0.793
0.682
0.813
0.536


araneum_none_fasttextcbow_300_5_2018
fasttext
0.349
0.671
0.801
0.706
0.793
0.579

## Morphology taggers

See Slovnet evaluation section for more info.




news
wiki
fiction
social
poetry




slovnet
0.961
0.815
0.905
0.807
0.664


slovnet_bert
0.982
0.884
0.990
0.890
0.856


deeppavlov
0.940
0.841
0.944
0.870
0.857


deeppavlov_bert
0.951
0.868
0.964
0.892
0.865


udpipe
0.918
0.811
0.957
0.870
0.776


spacy
0.964
0.849
0.942
0.857
0.784


stanza
0.934
0.831
0.940
0.873
0.825


rnnmorph
0.896
0.812
0.890
0.860
0.838


maru
0.894
0.808
0.887
0.861
0.840


rupostagger
0.673
0.645
0.661
0.641
0.636




init, s
disk, mb
ram, mb
speed, it/s




slovnet
1.0
27
115
532.0


slovnet_bert
5.0
475
8087
285.0 (gpu)


deeppavlov
4.0
32
10240
90.0 (gpu)


deeppavlov_bert
20.0
1393
8704
85.0 (gpu)


udpipe
6.9
45
242
56.2


spacy
8.0
140
579
50.0


stanza
2.0
591
393
92.0


rnnmorph
8.7
10
289
16.6


maru
15.8
44
370
36.4


rupostagger
4.8
3
118
48.0

## Syntax parser




news
wiki
fiction
social
poetry



uas
las
uas
las
uas
las
uas
las
uas
las




slovnet
0.907
0.880
0.775
0.718
0.806
0.776
0.726
0.656
0.542
0.469


slovnet_bert
0.965
0.936
0.891
0.828
0.958
0.940
0.846
0.782
0.776
0.706


deeppavlov_bert
0.962
0.910
0.882
0.786
0.963
0.929
0.844
0.761
0.784
0.691


udpipe
0.873
0.823
0.622
0.531
0.910
0.876
0.700
0.624
0.625
0.534


spacy
0.943
0.916
0.851
0.783
0.901
0.874
0.804
0.737
0.704
0.616


stanza
0.940
0.886
0.815
0.716
0.936
0.895
0.802
0.714
0.713
0.613




init, s
disk, mb
ram, mb
speed, it/s




slovnet
1.0
27
125
450.0


slovnet_bert
5.0
504
3427
200.0 (gpu)


deeppavlov_bert
34.0
1427
8704
75.0 (gpu)


udpipe
6.9
45
242
56.2


spacy
9.0
140
579
41.0


stanza
3.0
591
890
12.0

## NER

See Slovnet evalualtion section for more info.




factru
gareev
ne5
bsnlp


f1
PER
LOC
ORG
PER
ORG
PER
LOC
ORG
PER
LOC
ORG




slovnet
0.959
0.915
0.825
0.977
0.899
0.984
0.973
0.951
0.944
0.834
0.718


slovnet_bert
0.973
0.928
0.831
0.991
0.911
0.996
0.989
0.976
0.960
0.838
0.733


deeppavlov
0.910
0.886
0.742
0.944
0.798
0.942
0.919
0.881
0.866
0.767
0.624


deeppavlov_bert
0.971
0.928
0.825
0.980
0.916
0.997
0.990
0.976
0.954
0.840
0.741


deeppavlov_slavic
0.956
0.884
0.714
0.976
0.776
0.984
0.817
0.761
0.965
0.925
0.831


pullenti
0.905
0.814
0.686
0.939
0.639
0.952
0.862
0.683
0.900
0.769
0.566


spacy
0.901
0.886
0.765
0.970
0.883
0.967
0.928
0.918
0.919
0.823
0.693


stanza
0.943
0.865
0.687
0.953
0.827
0.923
0.753
0.734
0.938
0.838
0.724


texterra
0.900
0.800
0.597
0.888
0.561
0.901
0.777
0.594
0.858
0.783
0.548


tomita
0.929


0.921

0.945


0.881




mitie
0.888
0.861
0.532
0.849
0.452
0.753
0.642
0.432
0.736
0.801
0.524




init, s
disk, mb
ram, mb
speed, it/s




slovnet
1.0
27
205
25.3


slovnet_bert
5.0
473
9500
40.0 (gpu)


deeppavlov
5.9
1024
3072
24.3 (gpu)


deeppavlov_bert
34.5
2048
6144
13.1 (gpu)


deeppavlov_slavic
35.0
2048
4096
8.0 (gpu)


pullenti
2.9
16
253
6.0


spacy
8.0
140
625
8.0


stanza
3.0
591
11264
3.0 (gpu)


texterra
47.6
193
3379
4.0


tomita
2.0
64
63
29.8


mitie
28.3
327
261
32.8

## Support

- Chat — https://t.me/natural_language_processing
- Issues — https://github.com/natasha/nerus/issues
- Commercial support — https://lab.alexkuk.ru

## Development

Dev env

```bash
python -m venv ~/.venvs/natasha-naeval
source ~/.venvs/natasha-naeval/bin/activate

pip install -r requirements/dev.txt
pip install -e .

python -m ipykernel install --user --name natasha-naeval
```

Lint

```bash
make lint
```