Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/natasha/razdel
Rule-based token, sentence segmentation for Russian language
https://github.com/natasha/razdel
nlp python russian sentence-boundary-detection sentence-segmentation tokenization
Last synced: 3 days ago
JSON representation
Rule-based token, sentence segmentation for Russian language
- Host: GitHub
- URL: https://github.com/natasha/razdel
- Owner: natasha
- License: mit
- Created: 2018-11-10T10:23:50.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-07-24T09:33:46.000Z (over 1 year ago)
- Last Synced: 2024-10-08T00:42:17.561Z (about 1 month ago)
- Topics: nlp, python, russian, sentence-boundary-detection, sentence-segmentation, tokenization
- Language: Python
- Homepage:
- Size: 37.2 MB
- Stars: 249
- Watchers: 14
- Forks: 31
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
![CI](https://github.com/natasha/razdel/actions/workflows/test.yml/badge.svg)
`razdel` — rule-based system for Russian sentence and word tokenization.
## Usage
```python
>>> from razdel import tokenize>>> tokens = list(tokenize('Кружка-термос на 0.5л (50/64 см³, 516;...)'))
>>> tokens
[Substring(0, 13, 'Кружка-термос'),
Substring(14, 16, 'на'),
Substring(17, 20, '0.5'),
Substring(20, 21, 'л'),
Substring(22, 23, '(')
...]
>>> [_.text for _ in tokens]
['Кружка-термос', 'на', '0.5', 'л', '(', '50/64', 'см³', ',', '516', ';', '...', ')']
``````python
>>> from razdel import sentenize>>> text = '''
... - "Так в чем же дело?" - "Не ра-ду-ют".
... И т. д. и т. п. В общем, вся газета
... '''>>> list(sentenize(text))
[Substring(1, 23, '- "Так в чем же дело?"'),
Substring(24, 40, '- "Не ра-ду-ют".'),
Substring(41, 56, 'И т. д. и т. п.'),
Substring(57, 76, 'В общем, вся газета')]
```## Installation
`razdel` supports Python 3.7+ and PyPy 3.
```bash
$ pip install razdel
```## Documentation
Materials are in Russian:
* Razdel page on natasha.github.io
* Razdel section of Datafest 2020 talk## Evaluation
Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `«Как же так?! Захар...» — воскликнут Пронин.` into three sentences `["«Как же так?!", "Захар...»", "— воскликнут Пронин."]` while `razdel` splits it into two `["«Как же так?!", "Захар...» — воскликнут Пронин."]`. What would be the correct way to tokenizer `т.е.`? One may split in into `т.|е.`, `razdel` splits into `т|.|е|.`.
`razdel` tries to mimic segmentation of these 4 datasets: SynTagRus, OpenCorpora, GICRYA and RNC. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents.
We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `чуть-чуть?!` is not non-trivial, one may split it into `чуть|-|чуть|?|!` while the correct tokenization is `чуть-чуть|?!`, such examples are rare. Vast majority of cases are trivial, for example text `в 5 часов ...` is correctly tokenized even via Python native `str.split` into `в| |5| |часов| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors.
`errors` — number of errors per 1000 tokens/sentencies. For example, consider etalon segmentation is `что-то|?`, prediction is `что|-|то?`, then the number of errors is 3: 1 for missing split `то?` + 2 for extra splits `что|-|то`.
`time` — seconds taken to process whole dataset.
`spacy_tokenize`, `aatimofeev` and others a defined in naeval/segment/models.py, for links to models see Naeval registry. Tables are computed in naeval/segment/main.ipynb.
### Tokens
corpora
syntag
gicrya
rnc
errors
time
errors
time
errors
time
errors
time
re.findall(\w+|\d+|\p+)
24
0.5
16
0.5
19
0.4
60
0.4
spacy
26
6.2
13
5.8
14
4.1
32
3.9
nltk.word_tokenize
60
3.4
256
3.3
75
2.7
199
2.9
mystem
23
5.0
15
4.7
19
3.7
14
3.9
mosestokenizer
11
2.1
8
1.9
15
1.6
16
1.7
segtok.word_tokenize
16
2.3
8
2.3
14
1.8
9
1.8
aatimofeev/spacy_russian_tokenizer
17
48.7
4
51.1
5
39.5
20
52.2
koziev/rutokenizer
15
1.1
8
1.0
23
0.8
68
0.9
razdel.tokenize
9
2.9
9
2.8
3
2.0
16
2.2
### Sentences
corpora
syntag
gicrya
rnc
errors
time
errors
time
errors
time
errors
time
re.split([.?!…])
114
0.9
53
0.6
63
0.7
130
1.0
segtok.split_single
106
17.8
36
13.4
1001
1.1
912
2.8
mosestokenizer
238
8.9
182
5.7
80
6.4
287
7.4
nltk.sent_tokenize
92
10.1
36
5.3
44
5.6
183
8.9
deeppavlov/rusenttokenize
57
10.9
10
7.9
56
6.8
119
7.0
razdel.sentenize
52
6.1
7
3.9
72
4.5
59
7.5
## Support
- Chat — https://telegram.me/natural_language_processing
- Issues — https://github.com/natasha/razdel/issues
- Commercial support — https://lab.alexkuk.ru## Development
Dev env
```bash
python -m venv ~/.venvs/natasha-razdel
source ~/.venvs/natasha-razdel/bin/activatepip install -r requirements/dev.txt
pip install -e .
```Test
```bash
make test
make int # 2000 integration tests
```Release
```bash
# Update setup.py versiongit commit -am 'Up version'
git tag v0.5.0git push
git push --tags
````mystem` errors on `syntag`
```bash
# see naeval/data
cat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less
```Non-trivial token tests
```bash
pv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt
pv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt
```Update integration tests
```bash
cd tests/data/
pv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt
````razdel` and `moses` diff
```bash
cat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less
````razdel` performance
```bash
cat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l
```