Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/natasha/razdel

Rule-based token, sentence segmentation for Russian language
https://github.com/natasha/razdel

nlp python russian sentence-boundary-detection sentence-segmentation tokenization

Last synced: 3 days ago
JSON representation

Rule-based token, sentence segmentation for Russian language

Awesome Lists containing this project

README

        

![CI](https://github.com/natasha/razdel/actions/workflows/test.yml/badge.svg)

`razdel` — rule-based system for Russian sentence and word tokenization.

## Usage

```python
>>> from razdel import tokenize

>>> tokens = list(tokenize('Кружка-термос на 0.5л (50/64 см³, 516;...)'))
>>> tokens
[Substring(0, 13, 'Кружка-термос'),
Substring(14, 16, 'на'),
Substring(17, 20, '0.5'),
Substring(20, 21, 'л'),
Substring(22, 23, '(')
...]

>>> [_.text for _ in tokens]
['Кружка-термос', 'на', '0.5', 'л', '(', '50/64', 'см³', ',', '516', ';', '...', ')']
```

```python
>>> from razdel import sentenize

>>> text = '''
... - "Так в чем же дело?" - "Не ра-ду-ют".
... И т. д. и т. п. В общем, вся газета
... '''

>>> list(sentenize(text))
[Substring(1, 23, '- "Так в чем же дело?"'),
Substring(24, 40, '- "Не ра-ду-ют".'),
Substring(41, 56, 'И т. д. и т. п.'),
Substring(57, 76, 'В общем, вся газета')]
```

## Installation

`razdel` supports Python 3.7+ and PyPy 3.

```bash
$ pip install razdel
```

## Documentation

Materials are in Russian:

* Razdel page on natasha.github.io
* Razdel section of Datafest 2020 talk

## Evaluation

Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `«Как же так?! Захар...» — воскликнут Пронин.` into three sentences `["«Как же так?!", "Захар...»", "— воскликнут Пронин."]` while `razdel` splits it into two `["«Как же так?!", "Захар...» — воскликнут Пронин."]`. What would be the correct way to tokenizer `т.е.`? One may split in into `т.|е.`, `razdel` splits into `т|.|е|.`.

`razdel` tries to mimic segmentation of these 4 datasets: SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents.

We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `чуть-чуть?!` is not non-trivial, one may split it into `чуть|-|чуть|?|!` while the correct tokenization is `чуть-чуть|?!`, such examples are rare. Vast majority of cases are trivial, for example text `в 5 часов ...` is correctly tokenized even via Python native `str.split` into `в| |5| |часов| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors.

`errors` — number of errors per 1000 tokens/sentencies. For example, consider etalon segmentation is `что-то|?`, prediction is `что|-|то?`, then the number of errors is 3: 1 for missing split `то?` + 2 for extra splits `что|-|то`.

`time` — seconds taken to process whole dataset.

`spacy_tokenize`, `aatimofeev` and others a defined in naeval/segment/models.py, for links to models see Naeval registry. Tables are computed in naeval/segment/main.ipynb.

### Tokens




corpora
syntag
gicrya
rnc



errors
time
errors
time
errors
time
errors
time




re.findall(\w+|\d+|\p+)
24
0.5
16
0.5
19
0.4
60
0.4


spacy
26
6.2
13
5.8
14
4.1
32
3.9


nltk.word_tokenize
60
3.4
256
3.3
75
2.7
199
2.9


mystem
23
5.0
15
4.7
19
3.7
14
3.9


mosestokenizer
11
2.1
8
1.9
15
1.6
16
1.7


segtok.word_tokenize
16
2.3
8
2.3
14
1.8
9
1.8


aatimofeev/spacy_russian_tokenizer
17
48.7
4
51.1
5
39.5
20
52.2


koziev/rutokenizer
15
1.1
8
1.0
23
0.8
68
0.9


razdel.tokenize
9
2.9
9
2.8
3
2.0
16
2.2

### Sentences




corpora
syntag
gicrya
rnc



errors
time
errors
time
errors
time
errors
time




re.split([.?!…])
114
0.9
53
0.6
63
0.7
130
1.0


segtok.split_single
106
17.8
36
13.4
1001
1.1
912
2.8


mosestokenizer
238
8.9
182
5.7
80
6.4
287
7.4


nltk.sent_tokenize
92
10.1
36
5.3
44
5.6
183
8.9


deeppavlov/rusenttokenize
57
10.9
10
7.9
56
6.8
119
7.0


razdel.sentenize
52
6.1
7
3.9
72
4.5
59
7.5

## Support

- Chat — https://telegram.me/natural_language_processing
- Issues — https://github.com/natasha/razdel/issues
- Commercial support — https://lab.alexkuk.ru

## Development

Dev env

```bash
python -m venv ~/.venvs/natasha-razdel
source ~/.venvs/natasha-razdel/bin/activate

pip install -r requirements/dev.txt
pip install -e .
```

Test

```bash
make test
make int # 2000 integration tests
```

Release

```bash
# Update setup.py version

git commit -am 'Up version'
git tag v0.5.0

git push
git push --tags
```

`mystem` errors on `syntag`

```bash
# see naeval/data
cat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less
```

Non-trivial token tests

```bash
pv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt
pv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt
```

Update integration tests

```bash
cd tests/data/
pv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt
```

`razdel` and `moses` diff

```bash
cat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less
```

`razdel` performance

```bash
cat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l
```