https://github.com/fostroll/toxine
Tiny preprocessor for Russian text
https://github.com/fostroll/toxine
natural-language-processing nlp preprocessing python
Last synced: 3 months ago
JSON representation
Tiny preprocessor for Russian text
- Host: GitHub
- URL: https://github.com/fostroll/toxine
- Owner: fostroll
- License: bsd-3-clause
- Created: 2020-04-11T18:10:43.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2021-08-30T09:48:43.000Z (almost 5 years ago)
- Last Synced: 2025-09-06T06:03:16.333Z (9 months ago)
- Topics: natural-language-processing, nlp, preprocessing, python
- Language: Python
- Homepage:
- Size: 463 KB
- Stars: 5
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
RuMor: Russian Morphology project
Toxine: a tiny python NLP library for Russian text preprocessing
[](https://pypi.org/project/toxine/)
[](https://www.python.org/)
[](https://opensource.org/licenses/BSD-3-Clause)
A part of ***RuMor*** project. It contains pipeline for preprocessing and
tokenization texts in *Russian*. Also, it includes preliminary entity tagging.
Highlights are:
* Extracting emojis, emails, dates, phones, urls, html/xml fragments etc.
* Tagging/removing tokens with unallowed symbols
* Normalizing punctuation
* Tokenization (via *NLTK*)
* Russan *Wikipedia* tokenizer
* [*brat*](https://brat.nlplab.org/) annotations support
## Installation
### pip
***Toxine*** supports *Python 3.5* or later. To install it via *pip*, run:
```sh
$ pip install toxine
```
If you currently have a previous version of ***Toxine*** installed, use:
```sh
$ pip install toxine -U
```
### From Source
Alternatively, you can also install ***Toxine*** from source of this *git
repository*:
```sh
$ git clone https://github.com/fostroll/toxine.git
$ cd toxine
$ pip install -e .
```
This gives you access to examples that are not included to the *PyPI* package.
## Setup
***Toxine*** uses *NLTK* with *punkt* data downloaded. If you didn't do it yet,
start *Python* interpreter and execute:
```python
>>> import nltk
>>> nltk.download('punkt')
```
**NB:** If you plan to use methods for *brat* annotations renewal, you need to
install the
[*python-Levenshtein*](https://pypi.org/project/python-Levenshtein/) library.
See more on the
[*brat* annotations support](https://github.com/fostroll/toxine/blob/master/doc/README_BRAT.md)
page.
## Usage
[Text Preprocessor](https://github.com/fostroll/toxine/blob/master/doc/README_TEXT_PREPROCESSOR.md)
[Wrapper for tokenized *Wikipedia*](https://github.com/fostroll/toxine/blob/master/doc/README_WIKIPEDIA.md)
[*brat* annotations support](https://github.com/fostroll/toxine/blob/master/doc/README_BRAT.md)
## Examples
You can find them in the directory `examples` of our ***Toxine*** github
repository.
## License
***Toxine*** is released under the BSD License. See the
[LICENSE](https://github.com/fostroll/toxine/blob/master/LICENSE) file for
more details.