{"id":23725974,"url":"https://github.com/voidful/nlp2","last_synced_at":"2026-03-16T15:34:02.041Z","repository":{"id":57446314,"uuid":"121350850","full_name":"voidful/nlp2","owner":"voidful","description":"⚙️Tool for NLP - handle file and text","archived":false,"fork":false,"pushed_at":"2025-02-16T15:43:27.000Z","size":286,"stargazers_count":15,"open_issues_count":0,"forks_count":7,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-08-18T21:04:29.587Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://pypi.org/project/nlp2/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/voidful.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-02-13T07:14:44.000Z","updated_at":"2025-04-09T10:12:40.000Z","dependencies_parsed_at":"2024-06-21T02:11:45.994Z","dependency_job_id":null,"html_url":"https://github.com/voidful/nlp2","commit_stats":{"total_commits":107,"total_committers":4,"mean_commits":26.75,"dds":"0.28037383177570097","last_synced_commit":"ed492172c8c4c8897832039b1d70023d36f24c2e"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/voidful/nlp2","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fnlp2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fnlp2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fnlp2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fnlp2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/voidful","download_url":"https://codeload.github.com/voidful/nlp2/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fnlp2/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273541897,"owners_count":25124056,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-04T02:00:08.968Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-31T00:18:03.149Z","updated_at":"2026-03-16T15:34:02.023Z","avatar_url":"https://github.com/voidful.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🔨 nlp2 🔧\n\nTools for NLP using Python\n\nThis repertory used to handle file io and string cleaning/parsing\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://pypi.org/project/nlp2/\"\u003e\n        \u003cimg alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/nlp2\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/voidful/nlp2\"\u003e\n        \u003cimg alt=\"Download\" src=\"https://img.shields.io/pypi/dm/nlp2\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/voidful/nlp2\"\u003e\n        \u003cimg alt=\"Build\" src=\"https://img.shields.io/github/workflow/status/voidful/nlp2/Python package\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/voidful/nlp2\"\u003e\n        \u003cimg alt=\"Last Commit\" src=\"https://img.shields.io/github/last-commit/voidful/nlp2\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://www.codefactor.io/repository/github/voidful/nlp2/overview/master\"\u003e\n        \u003cimg src=\"https://www.codefactor.io/repository/github/voidful/nlp2/badge/master\" alt=\"CodeFactor\" /\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://codecov.io/gh/voidful/nlp2\"\u003e\n      \u003cimg src=\"https://codecov.io/gh/voidful/nlp2/branch/master/graph/badge.svg\" /\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n## Usage\n\nInstall:\n\n```\npip install nlp2\n```\n\nBefore using :\n\n```\nfrom nlp2 import *\n```\n\n# Features\n\n* [File Handling](#file)\n* [Text cleaning/parsing](#text)\n* [Random  Utility](#random)\n* [Vectorize](#vectorize)\n\n\u003ch2 id=\"file\"\u003eFile Handling\u003c/h2\u003e\n\n### get_folders_from_dir(path)\n\nArguments\n\n- `path(String)` : getting all folders under this path (string)\n\nReturns\n\n- `path(String)(generator)` : path of folders under arguments path Examples\n\n```\nfor i in get_folders_from_dir('./corpus/')\n    print(i)\n\n'./corpus/kdd'\n'./corpus/nycd'\n```\n\n### get_files_from_dir(path)\n\nArguments\n\n- `path(String)` : getting all files under this path (string)\n\nReturns\n\n- `path(String)(generator)` : path of files under arguments path Examples\n\n```\nfor i in get_files_from_dir('./data/')\n    print(i)\n\n'./data/kdd.txt'\n'./data/nycd.txt'\n```\n\n### read_dir_files_yield_lines(path)\n\nArguments\n\n- `path(String)` : getting all files line by lines under this path (string)\n\nReturns\n\n- `line(String)(generator)` : files line under arguments path  \n  Examples\n\n```\nfor i in read_dir_files_into_lines('./data/')\n    print(i)\n\n'file1 sent1'\n'file1 sent2'\n...\n'file2 sent1'\n...\n```\n\n### read_dir_files_into_lines(path)\n\nArguments\n\n- `path(String)` : getting all files line by lines under this path (string)\n\nReturns\n\n- `line(String)(generator)` : files line under arguments path  \n  Examples\n\n```\ni = read_dir_files_into_lines('./data/')\nprint(i)\n\n['file1 sent1','file1 sent2'...'file2 sent1'...]\n```\n\n### read_files_yield_lines(path)\n\nArguments\n\n- `path(String)` : getting content in input file path (string)\n\nReturns\n\n- `path(String)(generator)` : file line under arguments path  \n  Examples\n\n```\nfor i in read_dir_files_into_lines('./data/kdd.txt')\n    print(i)\n\n'sent1'\n'sent2'\n...\n```\n\n### read_files_into_lines(path)\n\nArguments\n\n- `path(String)` : getting content in input file path (string)\n\nReturns\n\n- `path(String)(generator)` : file line under arguments path  \n  Examples\n\n```\ni = read_dir_files_into_lines('./data/kdd.txt')\nprint(i)\n\n['sent1','sent2'...]\n```\n\n### create_new_dir_always(dirPath)\n\nit will replace old dir if exist,or create a new one  \nArguments\n\n- `dirPath(String)` : dir location  \n  Examples\n\n```\ncreate_new_dir_always('./data/')\n```\n\n### get_dir_with_notexist_create(dirPath):\n\nit will create a new dir if not exist  \nArguments\n\n- `dirPath(String)` : dir location that you want to make sure\n\nReturns\n\n- `path(String)` : dir location with surely exist Examples\n\n```\ni = get_dir_with_notexist_create('./data/kdd')\nprint(i)\n\n'./data/kdd'\n```\n\n### is_file_exist(path)\n\nArguments\n\n- `path(String)` : file location\n\nReturns\n\n- `result(Boolean)` : file exist or not,true will be exist Examples\n\n```\ni = is_file_exist('./data/kdd.txt')\nprint(i)\n\ntrue\n```\n\n### is_dir_exist(file_dir)\n\nArguments\n\n- `path(String)` : dir location\n\nReturns\n\n- `result(Boolean)` : dir exist or not,true will be exist Examples\n\n```\ni = is_dir_exist('./data/kdd')\nprint(i)\n\nfalse\n```\n\n### download_file(url,save_dir)\n\nArguments\n\n- `url;(String)` : download link\n- `save_dir;(String)` : save location    \n  Returns\n- `result(string)` : file downloaded location  \n  Examples\n\n```\ni = download_file('https://raw.githubusercontent.com/voidful/voidful_blog/master/assets/post_src/nninmath_3/img1','./data/')\nprint(i)\n\n./data/img1\n```\n\n### read_csv(filepath, generator=False)\n\nArguments\n\n- `filepath(String)` : csv file path\n\n- `list` : csv rows\n\n```\ni = read_csv('./data/kdd.csv')\nprint(i)\n\n\"[\"sent\",\"hi\"]\"\n```\n\n### write_csv(csv_rows, loc)\n\nArguments\n\n- `csv_rows(list)` : list of csv rows\n- `loc(String)` : write location/ file path Returns\n\n```\ni = write_csv([\"sent\",\"hi\"],'./data/kdd.csv')\n\n```\n\n### read_json(filepath)\n\nArguments\n\n- `filepath(String)` : json file path\n\nReturns\n\n- `json` : json object\n\n```\ni = read_json('./data/kdd.json')\nprint(i)\n\n\"{\"sent\":\"hi\"}\"\n```\n\n### write_json(json_str, loc)\n\nArguments\n\n- `json_str(String)` : json context in string\n- `loc(String)` : write location/ file path Returns\n\n```\ni = write_json(\"{\"sent\":\"hi\"}\",'./data/kdd.json')\nprint(i)\n\n\"'./data/kdd.json'\"\n```\n\n\u003ch2 id=\"text\"\u003eText cleaning/parsing\u003c/h2\u003e\n\n### clean_httplink(string)\n\nremove http link in context  \nArguments\n\n- `string(String)` : a string may contain http link\n\nReturns\n\n- `result(String)` : string without any http link\n\nExamples\n\n```\ny = remove_httplink(\"http://news.IN1802020028.htm 今天天氣http://news.we028.晴朗\"))\nprint(y)\n\n今天天氣 晴朗\n```\n\n### clean_htmlelement(string)\n\nremove html element in context  \nArguments\n\n- `string(String)` : a string may contain html element\n\nReturns\n\n- `result(String)` : string without any html element\n\nExamples\n\n```\ny = clean_htmlelement(\"\u003cdiv class=\"\"\u003e\u003cp\u003ePhraseg - 一言：新詞發現工具包\u003c/p\u003e\u003c/div\u003e\")\nprint(y)\n\nPhraseg - 一言：新詞發現工具包\n```\n\n### clean_unused_tag(string)\n\nremove unused tag in context  \nArguments\n\n- `string(String)` : a string may contain unused tag\n\nReturns\n\n- `result(String)` : string without any unused tag\n\nExamples\n\n```\ny = clean_unused_tag(\"[quote]\u003cbr\u003e\\n無聊得過此帖？！:smile_42: [/quote]\u003cbr\u003e\\n\u003cbr\u003e\\n\u003cbr\u003e\\n認同。\u003cbr\u003e\\n\u003cbr\u003e\\n改洋名，只是一個字號。\"))\nprint(y)\n\n無聊得過此帖？！    \n \n  \n認同。\n\n\n改洋名，只是一個字號。\n```\n\n### clean_all(string)\n\napply all clean method to clean context    \nclean_unused_tag / clean_htmlelement / clean_httplink  \nArguments\n\n- `string(String)` : a string may contain some garbage\n\nReturns\n\n- `result(String)` : clean string\n\nExamples\n\n```\ny = clean_all(\"[i]234282[/i] \u003cdiv class=\"\"\u003e\u003cp\u003ePhraseg - 一言：新詞發現工具包http://news.IN1802020028.htm今天天氣http://news.we028.晴朗\u003c/p\u003e\u003c/div\u003e\"))\nprint(y)\n\nPhraseg - 一言：新詞發現工具包 今天天氣 晴朗\n```\n\n### split_lines_by_punc(lines)\n\nmake lines in array form into sentences array  \nit split line base on any punctuation  \nArguments\n\n- `lines(String Array)` : lines array\n\nReturns\n\n- `sentences(String Array)` : split all line base on punctuations  \n  Examples\n\n```\ny = split_lines_by_punc([\"你好啊.hello，me\"]))\nprint(y)\n\n['你好啊', 'hello', 'me']\n```\n\n### split_sentence_to_ngram(sentence)\n\nit will split sentence into n-grams as many it can\n\n##### be careful with sentence length,long sentence will have worse performance\n\nArguments\n\n- `sentence(String)` : a string with no punctuation\n\nReturns\n\n- `ngrams(String Array)` : ngrams array\n\nExamples\n\n```\nsplit_sentence_to_ngram(\"加州旅館\")\n\n['加','加州',\"加州旅\",\"加州旅館\",\"州\",\"州旅\",\"州旅館\",\"旅\",\"旅館\",\"館\"]\n```\n\n### split_sentence_to_ngram_in_part(sentence)\n\nit will split sentence into n-grams with diff start point as many it can\n\n##### be careful with sentence length,long sentence will have worse performance\n\nArguments\n\n- `sentence(String)` : a string with no punctuation\n\nReturns\n\n- `ngrams(Array)` : 2D array with diff start in ngram\n\nExamples\n\n```\nsplit_sentence_to_ngram_in_part(\"加州旅館\")\n\n[['加','加州',\"加州旅\",\"加州旅館\"],[\"州\",\"州旅\",\"州旅館\"],[\"旅\",\"旅館\"],[\"館\"]]\n```\n\n### split_text_in_all_ways(sentence)\n\nit will try to find all possible segments way to split sentence  \nArguments\n\n- `sentence(String)` : input sentence\n\nReturns\n\n- `seg list(String Array)` : all segments in a array\n\nExamples\n\n```\nsplit_text_in_all_ways(\"加州旅館\")\n\n['加 州 旅 館', '加 州 旅館', '加 州旅 館', '加 州旅館', '加州 旅館', '加州旅 館', '加州旅館']\n```\n\n### split_sentence_to_array(sentence,merge_non_eng=False)\n\nuse to split sentences in different kind of language Arguments\n\n- `sentence(String)` : input sentence\n- `merge_non_eng(boolean,optional)` : split non english in char or not\n\nReturns\n\n- `segment array(String Array)` : word array\n\n```\nsplit_sentence_to_array('你好 are  u 可以',merge_non_eng = True)\n\n['你好', 'are', 'u', '可以']\n\nsplit_sentence_to_array('你好 are  u 可以')\n\n['你', '好', 'are', 'u', '可', '以']\n```\n\n### join_words_to_sentence(words_array):\n\nArguments\n\n- `words_array(String Array)` : input array\n\nReturns\n\n- `sentence(String)` : output sentence Examples\n\n```\njoin_words_to_sentence(['你好', 'are', \"可以\"])\n\n你好are可以\n```\n\n### passage_into_chunk(passage, chunk_size):\n\nsplit a passage in particular size  \nif part of a sentence excite chunk size, it still put hole sentence into it  \nArguments\n\n- `passage(String)` : input passage\n- `num_of_paragraphs(int)` : num of character in one chunk\n\nReturns\n\n- `chunk array(String Array)` : passage in chunk size Examples\n\n```\npassage_into_chunk(\"xxxxxxxx\\noo\\nyyzz\\ngggggg\\nkkkk\\n\",10)\n\n['xxxxxxxx\\noo\\n', 'yyzz\\ngggggg\\n']\n```\n\n### is_all_english(text)\n\nArguments\n\n- `text(String)` : input text Returns\n- `result(Boolean)` : whether the text is all English or not Examples\n\n```\nis_all_english(\"1SGD\")\nis_all_english(\"1SG哦\")\n\nTrue\nFalse\n```\n\n### is_contain_number(text)\n\nArguments\n\n- `text(String)` : input text\n\nReturns\n\n- `result(Boolean)` : whether the text contain number or not Examples\n\n```\nis_contain_number(\"1SGD\")\nis_contain_number(\"SG哦\")\n\nTrue\nFalse\n```\n\n### is_contain_english(text)\n\nArguments\n\n- `text(String)` : input text  \n  Returns\n- `result(Boolean)` : whether the text contain english or not Examples\n\n```\nis_contain_english(\"1SGD\")\nis_contain_english(\"123哦\")\n\nTrue\nFalse\n```\n\n### is_list_contain_string(text)\n\nArguments\n\n- `str(String)` : input text\n- `list(String list)` : input string    \n  Returns\n- `result(Boolean)` : whether the text is a part of list item  \n  Examples\n\n```\nis_list_contain_string(\"a\", ['a', 'dcd'])\nis_list_contain_string(\"a\", ['abcd', 'dcd'])\nis_list_contain_string(\"a\", ['bdc', 'dcd'])\n\nTrue\nTrue\nFalse\n```\n\n### full2half(text)\n\nArguments\n\n- `string(String)` : input string which needs turn to half\n\nReturns\n\n- `(String)` : a half-string\n\nExamples\n\n```\nfull2half(\"，,\")\n\n,,\n```\n\n### half2full(text)\n\nArguments\n\n- `text(String)` : input string which needs turn to full\n\nReturns\n\n- `(String)` : a full-string Examples\n\n```\nhalf2full(\"，,\")\n\n，，\n```\n\n\u003ch2 id=\"vectorize\"\u003eVectorize\u003c/h2\u003e\n\nVectorize implemented following paper ：  \nBaseline Needs More Love:On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms\n\n### doc2vec_aver(pretrained_emb, emb_size, context)\n\naverage pooling    \nArguments\n\n- `pretrained_emb(object)` : pre-trained word embedding that able to get vector in this\n  form : ``pretrained_emb['word']``\n- `emb_size(int)` : size of pre-trained word embedding\n- `context(list)` : input doc in list - each item of list must able to gain vector in pretrained_emb\n  like : ``pretrained_emb[context[0]]``\n\nReturns\n\n- `document vector(list)` : vectorized context\n\nExamples\n\n```python \nfrom gensim.models import Word2Vec\npretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')\nsize = pretrain_wordvec.vector_size\ncontext = \"測試文本哈哈哈\"\nnlp2.doc2vec_aver(pretrain_wordvec, size, jieba.lcut(context))\n```\n\n### doc2vec_max(pretrained_emb, emb_size, context)\n\nmax pooling in each dim   \nArguments\n\n- `pretrained_emb(object)` : pre-trained word embedding that able to get vector in this\n  form : ``pretrained_emb['word']``\n- `emb_size(int)` : size of pre-trained word embedding\n- `context(list)` : input doc in list - each item of list must able to gain vector in pretrained_emb\n  like : ``pretrained_emb[context[0]]``\n\nReturns\n\n- `document vector(list)` : vectorized context Examples\n\n```python \nfrom gensim.models import Word2Vec\npretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')\nsize = pretrain_wordvec.vector_size\ncontext = \"測試文本哈哈哈\"\nnlp2.doc2vec_max(pretrain_wordvec, size, jieba.lcut(context))\n```\n\n### doc2vec_concat(pretrained_emb, emb_size, context)\n\nconcat average pooling and max pooling result  \nArguments\n\n- `pretrained_emb(object)` : pre-trained word embedding that able to get vector in this\n  form : ``pretrained_emb['word']``\n- `emb_size(int)` : size of pre-trained word embedding\n- `context(list)` : input doc in list - each item of list must able to gain vector in pretrained_emb\n  like : ``pretrained_emb[context[0]]``\n\nReturns\n\n- `document vector(list)` : vectorized context Examples\n\n```python \nfrom gensim.models import Word2Vec\npretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')\nsize = pretrain_wordvec.vector_size\ncontext = \"測試文本哈哈哈\"\nnlp2.doc2vec_concat(pretrain_wordvec, size, jieba.lcut(context))\n```\n\n### doc2vec_hier(pretrained_emb, emb_size, context, windows)\n\naverage pooling in sliding windows then max pooling   \nArguments\n\n- `pretrained_emb(object)` : pre-trained word embedding that able to get vector in this\n  form : ``pretrained_emb['word']``\n- `emb_size(int)` : size of pre-trained word embedding\n- `context(list)` : input doc in list - each item of list must able to gain vector in pretrained_emb\n  like : ``pretrained_emb[context[0]]``\n- `windows(int)` : size of sliding windows in array\n\nReturns\n\n- `document vector(list)` : vectorized context Examples\n\n```python \nfrom gensim.models import Word2Vec\npretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')\nsize = pretrain_wordvec.vector_size\ncontext = \"測試文本哈哈哈\"\nnlp2.doc2vec_hier(pretrain_wordvec, size, jieba.lcut(context))\n```\n\n### cosine_similarity(vector 1, vector 2)\n\ncal cosine similarity between two vector Arguments\n\n- `vector(list)` : vector\n\nReturns\n\n- `cos similarity(float)` : similarity of two vector Examples\n\n```\nfrom gensim.models import Word2Vec\npretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')\nsize = pretrain_wordvec.vector_size\n\ninput1 = nlp2.doc2vec_concat(pretrain_wordvec, size, \"DC\")\ninput2 = nlp2.doc2vec_concat(pretrain_wordvec, size, \"漫威\")\nnlp2.cosine_similarity(input1,input2)\n```\n\n\u003ch2 id=\"random\"\u003eRandom Utility\u003c/h2\u003e\n\n### random_string(length)\n\nArguments\n\n- `length(int)` : length with random string\n\nReturns\n\n- `randstr(String)` : size will be length in \"0123456789ABCDEF\"\n  Examples\n\n```\nrandom_string(10)\n\nD6857CE0F4\n```\n\n### random_string_with_timestamp(length)\n\nArguments\n\n- `length(int)` : length with random string\n\nReturns\n\n- `randstr(String)` : size will be length + timestamp length(10)\n  Examples\n\n```\nrandom_string_with_timestamp(1)\n\n1435474326D\n```\n\n### random_value_in_array_form(array)\n\nrandom value with range in array form  \nint,float : [min,max]  \nstring : [candidate1,candidate2...]\n\nArguments\n\n- `range(array)` : range in array form\n\nReturns\n\n- `random result(depend on input)` : a random value under input condition Examples\n\n```\n# for string\ny = random_value_in_array_form([\"SGD\",\"ADAM\",\"XDA\"])\nprint(y)\n\n'ADAM'\n\n# for int\ny = random_value_in_array_form([1,12])\nprint(y)\n\n4\n\n# for float\ny = random_value_in_array_form([0.01,1.00])\nprint(y)\n\n0.34\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoidful%2Fnlp2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvoidful%2Fnlp2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoidful%2Fnlp2/lists"}