Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lyeoni/prenlp
Preprocessing Library for Natural Language Processing
https://github.com/lyeoni/prenlp
natural-language-processing nlp preprocessing-library text-preprocessing text-processing
Last synced: 7 days ago
JSON representation
Preprocessing Library for Natural Language Processing
- Host: GitHub
- URL: https://github.com/lyeoni/prenlp
- Owner: lyeoni
- License: apache-2.0
- Created: 2019-11-12T08:18:45.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2022-12-06T22:36:12.000Z (almost 2 years ago)
- Last Synced: 2024-08-11T08:48:29.553Z (3 months ago)
- Topics: natural-language-processing, nlp, preprocessing-library, text-preprocessing, text-processing
- Language: Python
- Homepage:
- Size: 156 KB
- Stars: 160
- Watchers: 6
- Forks: 12
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PreNLP
[![PyPI](https://img.shields.io/pypi/v/prenlp.svg?style=flat-square&color=important)](https://pypi.org/project/prenlp/)
[![License](https://img.shields.io/github/license/lyeoni/prenlp?style=flat-square)](https://github.com/lyeoni/prenlp/blob/master/LICENSE)
[![GitHub stars](https://img.shields.io/github/stars/lyeoni/prenlp?style=flat-square)](https://github.com/lyeoni/prenlp/stargazers)
[![GitHub forks](https://img.shields.io/github/forks/lyeoni/prenlp?style=flat-square&color=blueviolet)](https://github.com/lyeoni/prenlp/network/members)Preprocessing Library for Natural Language Processing
## Installation
### Requirements
- Python >= 3.6
- Mecab morphological analyzer for Korean
```
sh scripts/install_mecab.sh
# Only for Mac OS users, run the code below before run install_mecab.sh script.
# export MACOSX_DEPLOYMENT_TARGET=10.10
# CFLAGS='-stdlib=libc++' pip install konlpy
```
- C++ Build tools for fastText
- g++ >= 4.7.2 or clang >= 3.3
- For **Windows**, [Visual Studio C++](https://visualstudio.microsoft.com/downloads/) is recommended.
### With pip
prenlp can be installed using pip as follows:
```
pip install prenlp
```## Usage
### Data
#### Dataset Loading
Popular datasets for NLP tasks are provided in prenlp. All datasets is stored in `/.data` directory.
- Sentiment Analysis: IMDb, NSMC
- Language Modeling: WikiText-2, WikiText-103, WikiText-ko, NamuWiki-ko|Dataset|Language|Articles|Sentences|Tokens|Vocab|Size|
|-|-|-|-|-|-|-|
|WikiText-2|English|720|-|2,551,843|33,278|13.3MB|
|WikiText-103|English|28,595|-|103,690,236|267,735|517.4MB|
|WikiText-ko|Korean|477,946|2,333,930|131,184,780|662,949|667MB|
|NamuWiki-ko|Korean|661,032|16,288,639|715,535,778|1,130,008|3.3GB|
|WikiText-ko+NamuWiki-ko|Korean|1,138,978|18,622,569|846,720,558|1,360,538|3.95GB|General use cases are as follows:
##### [WikiText-2 / WikiText-103](https://github.com/lyeoni/prenlp/blob/develop/prenlp/data/dataset/language_modeling.py)
```python
>>> wikitext2 = prenlp.data.WikiText2()
>>> len(wikitext2)
3
>>> train, valid, test = prenlp.data.WikiText2()
>>> train[0]
'= Valkyria Chronicles III ='
```##### [IMDB](https://github.com/lyeoni/prenlp/blob/master/prenlp/data/dataset/sentiment.py)
```python
>>> imdb_train, imdb_test = prenlp.data.IMDB()
>>> imdb_train[0]
["Minor Spoilers
Alison Parker (Cristina Raines) is a successful top model, living with the lawyer Michael Lerman (Chris Sarandon) in his apartment. She tried to commit ...", 'pos']
```#### [Normalization](https://github.com/lyeoni/prenlp/blob/master/prenlp/data/normalizer.py)
Frequently used normalization functions for text pre-processing are provided in prenlp.
> url, HTML tag, emoticon, email, phone number, etc.General use cases are as follows:
```python
>>> from prenlp.data import Normalizer
>>> normalizer = Normalizer(url_repl='[URL]', tag_repl='[TAG]', emoji_repl='[EMOJI]', email_repl='[EMAIL]', tel_repl='[TEL]', image_repl='[IMG]')>>> normalizer.normalize('Visit this link for more details: https://github.com/')
'Visit this link for more details: [URL]'>>> normalizer.normalize('Use HTML with the desired attributes: ')
'Use HTML with the desired attributes: [TAG]'>>> normalizer.normalize('Hello š¤©, I love you š !')
'Hello [EMOJI], I love you [EMOJI] !'>>> normalizer.normalize('Contact me at [email protected]')
'Contact me at [EMAIL]'>>> normalizer.normalize('Call +82 10-1234-5678')
'Call [TEL]'>>> normalizer.normalize('Download our logo image, logo123.png, with transparent background.')
'Download our logo image, [IMG], with transparent background.'
```### Tokenizer
Frequently used (subword) tokenizers for text pre-processing are provided in prenlp.
> SentencePiece, NLTKMosesTokenizer, Mecab#### [SentencePiece](https://github.com/lyeoni/prenlp/blob/master/prenlp/tokenizer/tokenizer.py)
```python
>>> from prenlp.tokenizer import SentencePiece
>>> SentencePiece.train(input='corpus.txt', model_prefix='sentencepiece', vocab_size=10000)
>>> tokenizer = SentencePiece.load('sentencepiece.model')
>>> tokenizer('Time is the most valuable thing a man can spend.')
['āTime', 'āis', 'āthe', 'āmost', 'āvaluable', 'āthing', 'āa', 'āman', 'ācan', 'āspend', '.']
>>> tokenizer.tokenize('Time is the most valuable thing a man can spend.')
['āTime', 'āis', 'āthe', 'āmost', 'āvaluable', 'āthing', 'āa', 'āman', 'ācan', 'āspend', '.']
>>> tokenizer.detokenize(['āTime', 'āis', 'āthe', 'āmost', 'āvaluable', 'āthing', 'āa', 'āman', 'ācan', 'āspend', '.'])
Time is the most valuable thing a man can spend.
```#### [Moses tokenizer](https://github.com/lyeoni/prenlp/blob/master/prenlp/tokenizer/tokenizer.py)
```python
>>> from prenlp.tokenizer import NLTKMosesTokenizer
>>> tokenizer = NLTKMosesTokenizer()
>>> tokenizer('Time is the most valuable thing a man can spend.')
['Time', 'is', 'the', 'most', 'valuable', 'thing', 'a', 'man', 'can', 'spend', '.']
```#### Comparisons with tokenizers on IMDb
Below figure shows the classification accuracy from various tokenizer.
- Code: [NLTKMosesTokenizer](https://github.com/lyeoni/prenlp/blob/master/examples/fasttext_imdb.py), [SentencePiece](https://github.com/lyeoni/prenlp/blob/master/examples/fasttext_imdb_sentencepiece.py)
#### Comparisons with tokenizers on NSMC (Korean IMDb)
Below figure shows the classification accuracy from various tokenizer.
- Code: [Mecab](https://github.com/lyeoni/prenlp/blob/master/examples/fasttext_nsmc.py), [SentencePiece](https://github.com/lyeoni/prenlp/blob/master/examples/fasttext_nsmc_sentencepiece.py)
## Author
- Hoyeon Lee @lyeoni
- email : [email protected]
- facebook : https://www.facebook.com/lyeoni.f