https://github.com/kmint21/html2sent
HTML2SENT modifies HTML to improve sentences tokenizer quality
https://github.com/kmint21/html2sent
nlp nltk python sentence-segmentation sentence-tokenizer text-mining tokenizer
Last synced: about 1 month ago
JSON representation
HTML2SENT modifies HTML to improve sentences tokenizer quality
- Host: GitHub
- URL: https://github.com/kmint21/html2sent
- Owner: KMiNT21
- License: mit
- Created: 2019-03-11T17:52:30.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-07-20T18:32:33.000Z (almost 6 years ago)
- Last Synced: 2025-03-31T22:05:30.412Z (3 months ago)
- Topics: nlp, nltk, python, sentence-segmentation, sentence-tokenizer, text-mining, tokenizer
- Language: Python
- Homepage:
- Size: 44.9 KB
- Stars: 8
- Watchers: 0
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
This library works with **HTML-content** and modifies it in some tags to improve sentences tokenizer quality.
#
## Install NLTK python package
``` pip install nltk```## Download punkt data
```python
import nltk
nltk.download('punkt')
```## Download this library
git clone https://github.com/KMiNT21/html2sent.git## Using
```python
import html2sent
sentences = html2sent.tokenize(html, language='english')
```If you don't want to use NLTK, you can just use preprocess functions:
```python
import html2sent
text = html2sent.html2text(html)
text = html2sent.preprocess_text(text)
```Demo: [`demo_simple.py`](https://github.com/KMiNT21/html2sent/blob/master/demo_simple.py) and [`demo_folder_multiprocessing.py`](https://github.com/KMiNT21/html2sent/blob/master/demo_folder_multiprocessing.py)
## For russian language
Если для разделения полученного текста на предложения используется библиотека **nltk**,
то для русского языка нужно еще скачать обученный ru_punkt-токенизатор.Варианты:
- git clone https://github.com/mhq/train_punkt.git
- git clone https://github.com/Mottl/ru_punkt.git
Скопируйте файл russian.pickle в папку nltk_data (к остальным языковым .pickle файлам)
Альтернативный более точный вариант - библиотека **razdel**
Подробнее об использовании - https://github.com/natasha/razdel