Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/natasha/naeval
Comparing quality and performance of NLP systems for Russian language
https://github.com/natasha/naeval
evaluation nlp performance-analysis python russian
Last synced: 5 days ago
JSON representation
Comparing quality and performance of NLP systems for Russian language
- Host: GitHub
- URL: https://github.com/natasha/naeval
- Owner: natasha
- License: mit
- Created: 2020-03-22T08:43:16.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2023-07-24T09:28:24.000Z (over 1 year ago)
- Last Synced: 2024-04-27T23:36:08.612Z (7 months ago)
- Topics: evaluation, nlp, performance-analysis, python, russian
- Language: Python
- Size: 308 KB
- Stars: 44
- Watchers: 8
- Forks: 7
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
![CI](https://github.com/natasha/naeval/actions/workflows/test.yml/badge.svg)
Naeval — comparing quality and performance of NLP systems for Russian language. Naeval is used to evaluate project Natasha components: Razdel, Navec, Slovnet.
## Install
Naeval supports Python 3.7+
```bash
$ pip install naeval
```## Documentation
Materials are in Russian:
* Naeval page on natasha.github.io
* Naeval section of Datafest 2020 talk## Models
Model
Tags
DescriptionDeepPavlov NER
#
ner
BiLSTM-CRF NER trained on Collection5.
Original repo,
docs,
paperDeepPavlov BERT NER
#
ner
Current SOTA for Russian language.
Docs,
video
ner
DeepPavlov solution for BSNLP-2019. Paper
DeepPavlov Morph
#
morph
DeepPavlov BERT Morph
#
morph
DeepPavlov BERT Syntax
#
syntax
BERT + biaffine head. Docs
ner
Slovnet BERT NER
#
ner
morph
Slovnet BERT Morph
#
morph
syntax
Slovnet BERT Syntax
#
syntax
ner
morph
First place on factRuEval-2016, super sophisticated ruled based system
ner
morph
syntax
Tool by Stanford NLP released in 2020. Paper
token
sent
ner
morph
syntax
Uses Russian models trained by @buriy
morph
syntax
ner
token
sent
Multifunctional NLP solution by ISP RAS
ner
GLR-parser by Yandex, only implementation for person names is publicly available
ner
Engine developed at MIT + third party model for Russian language
morph
CRF tagger, part of Solarix project
morph
First place solution on morphoRuEval-2017. Post on Habr
morph
morph
syntax
Model trained on SynTagRus
token
sent
Multifunctional library, provides model for Russian text segmentation. Docs
token
morph
Wrapper for Yandex morphological analyzers
token
sent
Wrapper for Perl Moses utils
token
sent
token
token
sent
token
sent
Spacy segmentation pipeline for Russian texts by @aatimofeev
sent
DeepPavlov sentence segmentation
## Tokenization
See Razdel evalualtion section for more info.
corpora
syntag
gicrya
rnc
errors
time
errors
time
errors
time
errors
time
re.findall(\w+|\d+|\p+)
24
0.5
16
0.5
19
0.4
60
0.4
spacy
26
6.2
13
5.8
14
4.1
32
3.9
nltk.word_tokenize
60
3.4
256
3.3
75
2.7
199
2.9
mystem
23
5.0
15
4.7
19
3.7
14
3.9
mosestokenizer
11
2.1
8
1.9
15
1.6
16
1.7
segtok.word_tokenize
16
2.3
8
2.3
14
1.8
9
1.8
aatimofeev/spacy_russian_tokenizer
17
48.7
4
51.1
5
39.5
20
52.2
koziev/rutokenizer
15
1.1
8
1.0
23
0.8
68
0.9
razdel.tokenize
9
2.9
9
2.8
3
2.0
16
2.2
## Sentence segmentation
corpora
syntag
gicrya
rnc
errors
time
errors
time
errors
time
errors
time
re.split([.?!…])
114
0.9
53
0.6
63
0.7
130
1.0
segtok.split_single
106
17.8
36
13.4
1001
1.1
912
2.8
mosestokenizer
238
8.9
182
5.7
80
6.4
287
7.4
nltk.sent_tokenize
92
10.1
36
5.3
44
5.6
183
8.9
deeppavlov/rusenttokenize
57
10.9
10
7.9
56
6.8
119
7.0
razdel.sentenize
52
6.1
7
3.9
72
4.5
59
7.5
## Pretrained embeddings
See Navec evalualtion section for more info.
type
init, s
get, µs
disk, mb
ram, mb
vocab
hudlit_12B_500K_300d_100q
navec
1.1
21.6
50.6
95.3
500K
news_1B_250K_300d_100q
navec
0.8
20.7
25.4
47.7
250K
ruscorpora_upos_cbow_300_20_2019
w2v
3.3
1.4
220.6
236.1
189K
ruwikiruscorpora_upos_skipgram_300_2_2019
w2v
5.0
1.5
290.0
309.4
248K
tayga_upos_skipgram_300_2_2019
w2v
5.2
1.4
290.7
310.9
249K
tayga_none_fasttextcbow_300_10_2019
fasttext
8.0
13.4
2741.9
2746.9
192K
araneum_none_fasttextcbow_300_5_2018
fasttext
16.4
10.6
2752.1
2754.7
195K
type
simlex
hj
rt
ae
ae2
lrwc
hudlit_12B_500K_300d_100q
navec
0.310
0.707
0.842
0.931
0.923
0.604
news_1B_250K_300d_100q
navec
0.230
0.590
0.784
0.866
0.861
0.589
ruscorpora_upos_cbow_300_20_2019
w2v
0.359
0.685
0.852
0.758
0.896
0.602
ruwikiruscorpora_upos_skipgram_300_2_2019
w2v
0.321
0.723
0.817
0.801
0.860
0.629
tayga_upos_skipgram_300_2_2019
w2v
0.429
0.749
0.871
0.771
0.899
0.639
tayga_none_fasttextcbow_300_10_2019
fasttext
0.369
0.639
0.793
0.682
0.813
0.536
araneum_none_fasttextcbow_300_5_2018
fasttext
0.349
0.671
0.801
0.706
0.793
0.579
## Morphology taggers
See Slovnet evaluation section for more info.
news
wiki
fiction
social
poetry
slovnet
0.961
0.815
0.905
0.807
0.664
slovnet_bert
0.982
0.884
0.990
0.890
0.856
deeppavlov
0.940
0.841
0.944
0.870
0.857
deeppavlov_bert
0.951
0.868
0.964
0.892
0.865
udpipe
0.918
0.811
0.957
0.870
0.776
spacy
0.964
0.849
0.942
0.857
0.784
stanza
0.934
0.831
0.940
0.873
0.825
rnnmorph
0.896
0.812
0.890
0.860
0.838
maru
0.894
0.808
0.887
0.861
0.840
rupostagger
0.673
0.645
0.661
0.641
0.636
init, s
disk, mb
ram, mb
speed, it/s
slovnet
1.0
27
115
532.0
slovnet_bert
5.0
475
8087
285.0 (gpu)
deeppavlov
4.0
32
10240
90.0 (gpu)
deeppavlov_bert
20.0
1393
8704
85.0 (gpu)
udpipe
6.9
45
242
56.2
spacy
8.0
140
579
50.0
stanza
2.0
591
393
92.0
rnnmorph
8.7
10
289
16.6
maru
15.8
44
370
36.4
rupostagger
4.8
3
118
48.0
## Syntax parser
news
wiki
fiction
social
poetry
uas
las
uas
las
uas
las
uas
las
uas
las
slovnet
0.907
0.880
0.775
0.718
0.806
0.776
0.726
0.656
0.542
0.469
slovnet_bert
0.965
0.936
0.891
0.828
0.958
0.940
0.846
0.782
0.776
0.706
deeppavlov_bert
0.962
0.910
0.882
0.786
0.963
0.929
0.844
0.761
0.784
0.691
udpipe
0.873
0.823
0.622
0.531
0.910
0.876
0.700
0.624
0.625
0.534
spacy
0.943
0.916
0.851
0.783
0.901
0.874
0.804
0.737
0.704
0.616
stanza
0.940
0.886
0.815
0.716
0.936
0.895
0.802
0.714
0.713
0.613
init, s
disk, mb
ram, mb
speed, it/s
slovnet
1.0
27
125
450.0
slovnet_bert
5.0
504
3427
200.0 (gpu)
deeppavlov_bert
34.0
1427
8704
75.0 (gpu)
udpipe
6.9
45
242
56.2
spacy
9.0
140
579
41.0
stanza
3.0
591
890
12.0
## NER
See Slovnet evalualtion section for more info.
factru
gareev
ne5
bsnlp
f1
PER
LOC
ORG
PER
ORG
PER
LOC
ORG
PER
LOC
ORG
slovnet
0.959
0.915
0.825
0.977
0.899
0.984
0.973
0.951
0.944
0.834
0.718
slovnet_bert
0.973
0.928
0.831
0.991
0.911
0.996
0.989
0.976
0.960
0.838
0.733
deeppavlov
0.910
0.886
0.742
0.944
0.798
0.942
0.919
0.881
0.866
0.767
0.624
deeppavlov_bert
0.971
0.928
0.825
0.980
0.916
0.997
0.990
0.976
0.954
0.840
0.741
deeppavlov_slavic
0.956
0.884
0.714
0.976
0.776
0.984
0.817
0.761
0.965
0.925
0.831
pullenti
0.905
0.814
0.686
0.939
0.639
0.952
0.862
0.683
0.900
0.769
0.566
spacy
0.901
0.886
0.765
0.970
0.883
0.967
0.928
0.918
0.919
0.823
0.693
stanza
0.943
0.865
0.687
0.953
0.827
0.923
0.753
0.734
0.938
0.838
0.724
texterra
0.900
0.800
0.597
0.888
0.561
0.901
0.777
0.594
0.858
0.783
0.548
tomita
0.929
0.921
0.945
0.881
mitie
0.888
0.861
0.532
0.849
0.452
0.753
0.642
0.432
0.736
0.801
0.524
init, s
disk, mb
ram, mb
speed, it/s
slovnet
1.0
27
205
25.3
slovnet_bert
5.0
473
9500
40.0 (gpu)
deeppavlov
5.9
1024
3072
24.3 (gpu)
deeppavlov_bert
34.5
2048
6144
13.1 (gpu)
deeppavlov_slavic
35.0
2048
4096
8.0 (gpu)
pullenti
2.9
16
253
6.0
spacy
8.0
140
625
8.0
stanza
3.0
591
11264
3.0 (gpu)
texterra
47.6
193
3379
4.0
tomita
2.0
64
63
29.8
mitie
28.3
327
261
32.8
## Support
- Chat — https://t.me/natural_language_processing
- Issues — https://github.com/natasha/nerus/issues
- Commercial support — https://lab.alexkuk.ru## Development
Dev env
```bash
python -m venv ~/.venvs/natasha-naeval
source ~/.venvs/natasha-naeval/bin/activatepip install -r requirements/dev.txt
pip install -e .python -m ipykernel install --user --name natasha-naeval
```Lint
```bash
make lint
```