https://github.com/kmint21/html2sent

HTML2SENT modifies HTML to improve sentences tokenizer quality
https://github.com/kmint21/html2sent

nlp nltk python sentence-segmentation sentence-tokenizer text-mining tokenizer

Last synced: about 1 month ago
JSON representation

HTML2SENT modifies HTML to improve sentences tokenizer quality

Host: GitHub
URL: https://github.com/kmint21/html2sent
Owner: KMiNT21
License: mit
Created: 2019-03-11T17:52:30.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-07-20T18:32:33.000Z (almost 6 years ago)
Last Synced: 2025-03-31T22:05:30.412Z (3 months ago)
Topics: nlp, nltk, python, sentence-segmentation, sentence-tokenizer, text-mining, tokenizer
Language: Python
Homepage:
Size: 44.9 KB
Stars: 8
Watchers: 0
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        
This library works with **HTML-content** and modifies it in some tags to improve sentences tokenizer quality.

# 

## Install NLTK python package

``` pip install nltk```

## Download punkt data

```python

import nltk

nltk.download('punkt')

```

## Download this library

git clone https://github.com/KMiNT21/html2sent.git

## Using

```python

import html2sent

sentences = html2sent.tokenize(html, language='english')

```

If you don't want to use NLTK, you can just use preprocess functions:

```python

import html2sent

text = html2sent.html2text(html)

text = html2sent.preprocess_text(text)

```

Demo: [`demo_simple.py`](https://github.com/KMiNT21/html2sent/blob/master/demo_simple.py) and [`demo_folder_multiprocessing.py`](https://github.com/KMiNT21/html2sent/blob/master/demo_folder_multiprocessing.py)

## For russian language

Если для разделения полученного текста на предложения используется библиотека **nltk**,

то для русского языка нужно еще скачать обученный ru_punkt-токенизатор. 

Варианты:

- git clone https://github.com/mhq/train_punkt.git

- git clone https://github.com/Mottl/ru_punkt.git

Скопируйте файл russian.pickle в папку nltk_data (к остальным языковым .pickle файлам)

Альтернативный более точный вариант - библиотека **razdel**

Подробнее об использовании - https://github.com/natasha/razdel

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kmint21/html2sent

Awesome Lists containing this project

README