https://github.com/voidful/nlp2

⚙️Tool for NLP - handle file and text
https://github.com/voidful/nlp2
Last synced: 3 months ago
JSON representation
⚙️Tool for NLP - handle file and text
Host: GitHub
URL: https://github.com/voidful/nlp2
Owner: voidful
License: gpl-3.0
Created: 2018-02-13T07:14:44.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2025-02-16T15:43:27.000Z (over 1 year ago)
Last Synced: 2025-08-18T21:04:29.587Z (10 months ago)
Language: Python
Homepage: https://pypi.org/project/nlp2/
Size: 279 KB
Stars: 15
Watchers: 2
Forks: 7
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # 🔨 nlp2 🔧

Tools for NLP using Python

This repertory used to handle file io and string cleaning/parsing



    

        

    

    

        

    

    

        

    

    

        

    

    

        

    

    

      

    



## Usage

Install:

```

pip install nlp2

```

Before using :

```

from nlp2 import *

```

# Features

* [File Handling](#file)

* [Text cleaning/parsing](#text)

* [Random  Utility](#random)

* [Vectorize](#vectorize)

File Handling


### get_folders_from_dir(path)

Arguments

- `path(String)` : getting all folders under this path (string)

Returns

- `path(String)(generator)` : path of folders under arguments path Examples

```

for i in get_folders_from_dir('./corpus/')

    print(i)

'./corpus/kdd'

'./corpus/nycd'

```

### get_files_from_dir(path)

Arguments

- `path(String)` : getting all files under this path (string)

Returns

- `path(String)(generator)` : path of files under arguments path Examples

```

for i in get_files_from_dir('./data/')

    print(i)

'./data/kdd.txt'

'./data/nycd.txt'

```

### read_dir_files_yield_lines(path)

Arguments

- `path(String)` : getting all files line by lines under this path (string)

Returns

- `line(String)(generator)` : files line under arguments path  

  Examples

```

for i in read_dir_files_into_lines('./data/')

    print(i)

'file1 sent1'

'file1 sent2'

...

'file2 sent1'

...

```

### read_dir_files_into_lines(path)

Arguments

- `path(String)` : getting all files line by lines under this path (string)

Returns

- `line(String)(generator)` : files line under arguments path  

  Examples

```

i = read_dir_files_into_lines('./data/')

print(i)

['file1 sent1','file1 sent2'...'file2 sent1'...]

```

### read_files_yield_lines(path)

Arguments

- `path(String)` : getting content in input file path (string)

Returns

- `path(String)(generator)` : file line under arguments path  

  Examples

```

for i in read_dir_files_into_lines('./data/kdd.txt')

    print(i)

'sent1'

'sent2'

...

```

### read_files_into_lines(path)

Arguments

- `path(String)` : getting content in input file path (string)

Returns

- `path(String)(generator)` : file line under arguments path  

  Examples

```

i = read_dir_files_into_lines('./data/kdd.txt')

print(i)

['sent1','sent2'...]

```

### create_new_dir_always(dirPath)

it will replace old dir if exist,or create a new one  

Arguments

- `dirPath(String)` : dir location  

  Examples

```

create_new_dir_always('./data/')

```

### get_dir_with_notexist_create(dirPath):

it will create a new dir if not exist  

Arguments

- `dirPath(String)` : dir location that you want to make sure

Returns

- `path(String)` : dir location with surely exist Examples

```

i = get_dir_with_notexist_create('./data/kdd')

print(i)

'./data/kdd'

```

### is_file_exist(path)

Arguments

- `path(String)` : file location

Returns

- `result(Boolean)` : file exist or not,true will be exist Examples

```

i = is_file_exist('./data/kdd.txt')

print(i)

true

```

### is_dir_exist(file_dir)

Arguments

- `path(String)` : dir location

Returns

- `result(Boolean)` : dir exist or not,true will be exist Examples

```

i = is_dir_exist('./data/kdd')

print(i)

false

```

### download_file(url,save_dir)

Arguments

- `url;(String)` : download link

- `save_dir;(String)` : save location    

  Returns

- `result(string)` : file downloaded location  

  Examples

```

i = download_file('https://raw.githubusercontent.com/voidful/voidful_blog/master/assets/post_src/nninmath_3/img1','./data/')

print(i)

./data/img1

```

### read_csv(filepath, generator=False)

Arguments

- `filepath(String)` : csv file path

- `list` : csv rows

```

i = read_csv('./data/kdd.csv')

print(i)

"["sent","hi"]"

```

### write_csv(csv_rows, loc)

Arguments

- `csv_rows(list)` : list of csv rows

- `loc(String)` : write location/ file path Returns

```

i = write_csv(["sent","hi"],'./data/kdd.csv')

```

### read_json(filepath)

Arguments

- `filepath(String)` : json file path

Returns

- `json` : json object

```

i = read_json('./data/kdd.json')

print(i)

"{"sent":"hi"}"

```

### write_json(json_str, loc)

Arguments

- `json_str(String)` : json context in string

- `loc(String)` : write location/ file path Returns

```

i = write_json("{"sent":"hi"}",'./data/kdd.json')

print(i)

"'./data/kdd.json'"

```

Text cleaning/parsing


### clean_httplink(string)

remove http link in context  

Arguments

- `string(String)` : a string may contain http link

Returns

- `result(String)` : string without any http link

Examples

```

y = remove_httplink("http://news.IN1802020028.htm 今天天氣http://news.we028.晴朗"))

print(y)

今天天氣 晴朗

```

### clean_htmlelement(string)

remove html element in context  

Arguments

- `string(String)` : a string may contain html element

Returns

- `result(String)` : string without any html element

Examples

```

y = clean_htmlelement("
Phraseg - 一言：新詞發現工具包")

print(y)

Phraseg - 一言：新詞發現工具包

```

### clean_unused_tag(string)

remove unused tag in context  

Arguments

- `string(String)` : a string may contain unused tag

Returns

- `result(String)` : string without any unused tag

Examples

```

y = clean_unused_tag("[quote]
\n無聊得過此帖？！:smile_42: [/quote]
\n
\n
\n認同。
\n
\n改洋名，只是一個字號。"))

print(y)

無聊得過此帖？！    

 

  

認同。

改洋名，只是一個字號。

```

### clean_all(string)

apply all clean method to clean context    

clean_unused_tag / clean_htmlelement / clean_httplink  

Arguments

- `string(String)` : a string may contain some garbage

Returns

- `result(String)` : clean string

Examples

```

y = clean_all("[i]234282[/i] 
Phraseg - 一言：新詞發現工具包http://news.IN1802020028.htm今天天氣http://news.we028.晴朗"))

print(y)

Phraseg - 一言：新詞發現工具包 今天天氣 晴朗

```

### split_lines_by_punc(lines)

make lines in array form into sentences array  

it split line base on any punctuation  

Arguments

- `lines(String Array)` : lines array

Returns

- `sentences(String Array)` : split all line base on punctuations  

  Examples

```

y = split_lines_by_punc(["你好啊.hello，me"]))

print(y)

['你好啊', 'hello', 'me']

```

### split_sentence_to_ngram(sentence)

it will split sentence into n-grams as many it can

##### be careful with sentence length,long sentence will have worse performance

Arguments

- `sentence(String)` : a string with no punctuation

Returns

- `ngrams(String Array)` : ngrams array

Examples

```

split_sentence_to_ngram("加州旅館")

['加','加州',"加州旅","加州旅館","州","州旅","州旅館","旅","旅館","館"]

```

### split_sentence_to_ngram_in_part(sentence)

it will split sentence into n-grams with diff start point as many it can

##### be careful with sentence length,long sentence will have worse performance

Arguments

- `sentence(String)` : a string with no punctuation

Returns

- `ngrams(Array)` : 2D array with diff start in ngram

Examples

```

split_sentence_to_ngram_in_part("加州旅館")

[['加','加州',"加州旅","加州旅館"],["州","州旅","州旅館"],["旅","旅館"],["館"]]

```

### split_text_in_all_ways(sentence)

it will try to find all possible segments way to split sentence  

Arguments

- `sentence(String)` : input sentence

Returns

- `seg list(String Array)` : all segments in a array

Examples

```

split_text_in_all_ways("加州旅館")

['加 州 旅 館', '加 州 旅館', '加 州旅 館', '加 州旅館', '加州 旅館', '加州旅 館', '加州旅館']

```

### split_sentence_to_array(sentence,merge_non_eng=False)

use to split sentences in different kind of language Arguments

- `sentence(String)` : input sentence

- `merge_non_eng(boolean,optional)` : split non english in char or not

Returns

- `segment array(String Array)` : word array

```

split_sentence_to_array('你好 are  u 可以',merge_non_eng = True)

['你好', 'are', 'u', '可以']

split_sentence_to_array('你好 are  u 可以')

['你', '好', 'are', 'u', '可', '以']

```

### join_words_to_sentence(words_array):

Arguments

- `words_array(String Array)` : input array

Returns

- `sentence(String)` : output sentence Examples

```

join_words_to_sentence(['你好', 'are', "可以"])

你好are可以

```

### passage_into_chunk(passage, chunk_size):

split a passage in particular size  

if part of a sentence excite chunk size, it still put hole sentence into it  

Arguments

- `passage(String)` : input passage

- `num_of_paragraphs(int)` : num of character in one chunk

Returns

- `chunk array(String Array)` : passage in chunk size Examples

```

passage_into_chunk("xxxxxxxx\noo\nyyzz\ngggggg\nkkkk\n",10)

['xxxxxxxx\noo\n', 'yyzz\ngggggg\n']

```

### is_all_english(text)

Arguments

- `text(String)` : input text Returns

- `result(Boolean)` : whether the text is all English or not Examples

```

is_all_english("1SGD")

is_all_english("1SG哦")

True

False

```

### is_contain_number(text)

Arguments

- `text(String)` : input text

Returns

- `result(Boolean)` : whether the text contain number or not Examples

```

is_contain_number("1SGD")

is_contain_number("SG哦")

True

False

```

### is_contain_english(text)

Arguments

- `text(String)` : input text  

  Returns

- `result(Boolean)` : whether the text contain english or not Examples

```

is_contain_english("1SGD")

is_contain_english("123哦")

True

False

```

### is_list_contain_string(text)

Arguments

- `str(String)` : input text

- `list(String list)` : input string    

  Returns

- `result(Boolean)` : whether the text is a part of list item  

  Examples

```

is_list_contain_string("a", ['a', 'dcd'])

is_list_contain_string("a", ['abcd', 'dcd'])

is_list_contain_string("a", ['bdc', 'dcd'])

True

True

False

```

### full2half(text)

Arguments

- `string(String)` : input string which needs turn to half

Returns

- `(String)` : a half-string

Examples

```

full2half("，,")

,,

```

### half2full(text)

Arguments

- `text(String)` : input string which needs turn to full

Returns

- `(String)` : a full-string Examples

```

half2full("，,")

，，

```

Vectorize


Vectorize implemented following paper ：  

Baseline Needs More Love:On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms

### doc2vec_aver(pretrained_emb, emb_size, context)

average pooling    

Arguments

- `pretrained_emb(object)` : pre-trained word embedding that able to get vector in this

  form : ``pretrained_emb['word']``

- `emb_size(int)` : size of pre-trained word embedding

- `context(list)` : input doc in list - each item of list must able to gain vector in pretrained_emb

  like : ``pretrained_emb[context[0]]``

Returns

- `document vector(list)` : vectorized context

Examples

```python 

from gensim.models import Word2Vec

pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')

size = pretrain_wordvec.vector_size

context = "測試文本哈哈哈"

nlp2.doc2vec_aver(pretrain_wordvec, size, jieba.lcut(context))

```

### doc2vec_max(pretrained_emb, emb_size, context)

max pooling in each dim   

Arguments

- `pretrained_emb(object)` : pre-trained word embedding that able to get vector in this

  form : ``pretrained_emb['word']``

- `emb_size(int)` : size of pre-trained word embedding

- `context(list)` : input doc in list - each item of list must able to gain vector in pretrained_emb

  like : ``pretrained_emb[context[0]]``

Returns

- `document vector(list)` : vectorized context Examples

```python 

from gensim.models import Word2Vec

pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')

size = pretrain_wordvec.vector_size

context = "測試文本哈哈哈"

nlp2.doc2vec_max(pretrain_wordvec, size, jieba.lcut(context))

```

### doc2vec_concat(pretrained_emb, emb_size, context)

concat average pooling and max pooling result  

Arguments

- `pretrained_emb(object)` : pre-trained word embedding that able to get vector in this

  form : ``pretrained_emb['word']``

- `emb_size(int)` : size of pre-trained word embedding

- `context(list)` : input doc in list - each item of list must able to gain vector in pretrained_emb

  like : ``pretrained_emb[context[0]]``

Returns

- `document vector(list)` : vectorized context Examples

```python 

from gensim.models import Word2Vec

pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')

size = pretrain_wordvec.vector_size

context = "測試文本哈哈哈"

nlp2.doc2vec_concat(pretrain_wordvec, size, jieba.lcut(context))

```

### doc2vec_hier(pretrained_emb, emb_size, context, windows)

average pooling in sliding windows then max pooling   

Arguments

- `pretrained_emb(object)` : pre-trained word embedding that able to get vector in this

  form : ``pretrained_emb['word']``

- `emb_size(int)` : size of pre-trained word embedding

- `context(list)` : input doc in list - each item of list must able to gain vector in pretrained_emb

  like : ``pretrained_emb[context[0]]``

- `windows(int)` : size of sliding windows in array

Returns

- `document vector(list)` : vectorized context Examples

```python 

from gensim.models import Word2Vec

pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')

size = pretrain_wordvec.vector_size

context = "測試文本哈哈哈"

nlp2.doc2vec_hier(pretrain_wordvec, size, jieba.lcut(context))

```

### cosine_similarity(vector 1, vector 2)

cal cosine similarity between two vector Arguments

- `vector(list)` : vector

Returns

- `cos similarity(float)` : similarity of two vector Examples

```

from gensim.models import Word2Vec

pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')

size = pretrain_wordvec.vector_size

input1 = nlp2.doc2vec_concat(pretrain_wordvec, size, "DC")

input2 = nlp2.doc2vec_concat(pretrain_wordvec, size, "漫威")

nlp2.cosine_similarity(input1,input2)

```

Random Utility


### random_string(length)

Arguments

- `length(int)` : length with random string

Returns

- `randstr(String)` : size will be length in "0123456789ABCDEF"

  Examples

```

random_string(10)

D6857CE0F4

```

### random_string_with_timestamp(length)

Arguments

- `length(int)` : length with random string

Returns

- `randstr(String)` : size will be length + timestamp length(10)

  Examples

```

random_string_with_timestamp(1)

1435474326D

```

### random_value_in_array_form(array)

random value with range in array form  

int,float : [min,max]  

string : [candidate1,candidate2...]

Arguments

- `range(array)` : range in array form

Returns

- `random result(depend on input)` : a random value under input condition Examples

```

# for string

y = random_value_in_array_form(["SGD","ADAM","XDA"])

print(y)

'ADAM'

# for int

y = random_value_in_array_form([1,12])

print(y)

4

# for float

y = random_value_in_array_form([0.01,1.00])

print(y)

0.34

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/voidful/nlp2

Awesome Lists containing this project

README

File Handling

Text cleaning/parsing

Vectorize

Random Utility