{"id":18398880,"url":"https://github.com/lyeoni/prenlp","last_synced_at":"2025-04-10T01:15:50.540Z","repository":{"id":57454627,"uuid":"221165593","full_name":"lyeoni/prenlp","owner":"lyeoni","description":"Preprocessing Library for Natural Language Processing","archived":false,"fork":false,"pushed_at":"2022-12-06T22:36:12.000Z","size":160,"stargazers_count":161,"open_issues_count":2,"forks_count":12,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-10T01:15:44.459Z","etag":null,"topics":["natural-language-processing","nlp","preprocessing-library","text-preprocessing","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lyeoni.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-11-12T08:18:45.000Z","updated_at":"2024-11-16T06:24:05.000Z","dependencies_parsed_at":"2023-01-24T11:45:34.568Z","dependency_job_id":null,"html_url":"https://github.com/lyeoni/prenlp","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyeoni%2Fprenlp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyeoni%2Fprenlp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyeoni%2Fprenlp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyeoni%2Fprenlp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lyeoni","download_url":"https://codeload.github.com/lyeoni/prenlp/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248137891,"owners_count":21053775,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-processing","nlp","preprocessing-library","text-preprocessing","text-processing"],"created_at":"2024-11-06T02:24:44.877Z","updated_at":"2025-04-10T01:15:50.522Z","avatar_url":"https://github.com/lyeoni.png","language":"Python","readme":"# PreNLP\n[![PyPI](https://img.shields.io/pypi/v/prenlp.svg?style=flat-square\u0026color=important)](https://pypi.org/project/prenlp/)\n[![License](https://img.shields.io/github/license/lyeoni/prenlp?style=flat-square)](https://github.com/lyeoni/prenlp/blob/master/LICENSE)\n[![GitHub stars](https://img.shields.io/github/stars/lyeoni/prenlp?style=flat-square)](https://github.com/lyeoni/prenlp/stargazers)\n[![GitHub forks](https://img.shields.io/github/forks/lyeoni/prenlp?style=flat-square\u0026color=blueviolet)](https://github.com/lyeoni/prenlp/network/members)\n\nPreprocessing Library for Natural Language Processing\n\n## Installation\n### Requirements\n- Python \u003e= 3.6 \n- Mecab morphological analyzer for Korean\n  ```\n  sh scripts/install_mecab.sh\n  # Only for Mac OS users, run the code below before run install_mecab.sh script.\n  # export MACOSX_DEPLOYMENT_TARGET=10.10\n  # CFLAGS='-stdlib=libc++' pip install konlpy\n  ```\n- C++ Build tools for fastText\n  - g++ \u003e= 4.7.2 or clang \u003e= 3.3\n  - For **Windows**, [Visual Studio C++](https://visualstudio.microsoft.com/downloads/) is recommended.\n    \n### With pip\nprenlp can be installed using pip as follows:\n```\npip install prenlp\n```\n\n## Usage\n\n### Data\n\n#### Dataset Loading\n\nPopular datasets for NLP tasks are provided in prenlp. All datasets is stored in `/.data` directory.\n- Sentiment Analysis: IMDb, NSMC\n- Language Modeling: WikiText-2, WikiText-103, WikiText-ko, NamuWiki-ko\n\n|Dataset|Language|Articles|Sentences|Tokens|Vocab|Size|\n|-|-|-|-|-|-|-|\n|WikiText-2|English|720|-|2,551,843|33,278|13.3MB|\n|WikiText-103|English|28,595|-|103,690,236|267,735|517.4MB|\n|WikiText-ko|Korean|477,946|2,333,930|131,184,780|662,949|667MB|\n|NamuWiki-ko|Korean|661,032|16,288,639|715,535,778|1,130,008|3.3GB|\n|WikiText-ko+NamuWiki-ko|Korean|1,138,978|18,622,569|846,720,558|1,360,538|3.95GB|\n\nGeneral use cases are as follows:\n\n##### [WikiText-2 / WikiText-103](https://github.com/lyeoni/prenlp/blob/develop/prenlp/data/dataset/language_modeling.py)\n```python\n\u003e\u003e\u003e wikitext2 = prenlp.data.WikiText2()\n\u003e\u003e\u003e len(wikitext2)\n3\n\u003e\u003e\u003e train, valid, test = prenlp.data.WikiText2()\n\u003e\u003e\u003e train[0]\n'= Valkyria Chronicles III ='\n```\n\n##### [IMDB](https://github.com/lyeoni/prenlp/blob/master/prenlp/data/dataset/sentiment.py)\n```python\n\u003e\u003e\u003e imdb_train, imdb_test = prenlp.data.IMDB()\n\u003e\u003e\u003e imdb_train[0]\n[\"Minor Spoilers\u003cbr /\u003e\u003cbr /\u003eAlison Parker (Cristina Raines) is a successful top model, living with the lawyer Michael Lerman (Chris Sarandon) in his apartment. She tried to commit ...\", 'pos']\n```\n\n#### [Normalization](https://github.com/lyeoni/prenlp/blob/master/prenlp/data/normalizer.py)\nFrequently used normalization functions for text pre-processing are provided in prenlp.\n\u003e url, HTML tag, emoticon, email, phone number, etc.\n\nGeneral use cases are as follows:\n```python\n\u003e\u003e\u003e from prenlp.data import Normalizer\n\u003e\u003e\u003e normalizer = Normalizer(url_repl='[URL]', tag_repl='[TAG]', emoji_repl='[EMOJI]', email_repl='[EMAIL]', tel_repl='[TEL]', image_repl='[IMG]')\n\n\u003e\u003e\u003e normalizer.normalize('Visit this link for more details: https://github.com/')\n'Visit this link for more details: [URL]'\n\n\u003e\u003e\u003e normalizer.normalize('Use HTML with the desired attributes: \u003cimg src=\"cat.jpg\" height=\"100\" /\u003e')\n'Use HTML with the desired attributes: [TAG]'\n\n\u003e\u003e\u003e normalizer.normalize('Hello 🤩, I love you 💓 !')\n'Hello [EMOJI], I love you [EMOJI] !'\n\n\u003e\u003e\u003e normalizer.normalize('Contact me at lyeoni.g@gmail.com')\n'Contact me at [EMAIL]'\n\n\u003e\u003e\u003e normalizer.normalize('Call +82 10-1234-5678')\n'Call [TEL]'\n\n\u003e\u003e\u003e normalizer.normalize('Download our logo image, logo123.png, with transparent background.')\n'Download our logo image, [IMG], with transparent background.'\n```\n\n### Tokenizer\nFrequently used (subword) tokenizers for text pre-processing are provided in prenlp.\n\u003e SentencePiece, NLTKMosesTokenizer, Mecab\n\n#### [SentencePiece](https://github.com/lyeoni/prenlp/blob/master/prenlp/tokenizer/tokenizer.py)\n```python\n\u003e\u003e\u003e from prenlp.tokenizer import SentencePiece\n\u003e\u003e\u003e SentencePiece.train(input='corpus.txt', model_prefix='sentencepiece', vocab_size=10000)\n\u003e\u003e\u003e tokenizer = SentencePiece.load('sentencepiece.model')\n\u003e\u003e\u003e tokenizer('Time is the most valuable thing a man can spend.')\n['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.']\n\u003e\u003e\u003e tokenizer.tokenize('Time is the most valuable thing a man can spend.')\n['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.']\n\u003e\u003e\u003e tokenizer.detokenize(['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.'])\nTime is the most valuable thing a man can spend.\n```\n\n#### [Moses tokenizer](https://github.com/lyeoni/prenlp/blob/master/prenlp/tokenizer/tokenizer.py)\n```python\n\u003e\u003e\u003e from prenlp.tokenizer import NLTKMosesTokenizer\n\u003e\u003e\u003e tokenizer = NLTKMosesTokenizer()\n\u003e\u003e\u003e tokenizer('Time is the most valuable thing a man can spend.')\n['Time', 'is', 'the', 'most', 'valuable', 'thing', 'a', 'man', 'can', 'spend', '.']\n```\n\n#### Comparisons with tokenizers on IMDb\nBelow figure shows the classification accuracy from various tokenizer.\n- Code: [NLTKMosesTokenizer](https://github.com/lyeoni/prenlp/blob/master/examples/fasttext_imdb.py), [SentencePiece](https://github.com/lyeoni/prenlp/blob/master/examples/fasttext_imdb_sentencepiece.py)\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"700\" src=\"https://raw.githubusercontent.com/lyeoni/prenlp/master/images/tokenizer_comparison_IMDb.png\" align=\"middle\"\u003e\n\u003c/p\u003e\n\n#### Comparisons with tokenizers on NSMC (Korean IMDb)\nBelow figure shows the classification accuracy from various tokenizer.\n- Code: [Mecab](https://github.com/lyeoni/prenlp/blob/master/examples/fasttext_nsmc.py), [SentencePiece](https://github.com/lyeoni/prenlp/blob/master/examples/fasttext_nsmc_sentencepiece.py)\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"700\" src=\"https://raw.githubusercontent.com/lyeoni/prenlp/master/images/tokenizer_comparison_NSMC.png\" align=\"middle\"\u003e\n\u003c/p\u003e\n\n## Author\n- Hoyeon Lee @lyeoni\n- email : lyeoni.g@gmail.com\n- facebook : https://www.facebook.com/lyeoni.f","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyeoni%2Fprenlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flyeoni%2Fprenlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyeoni%2Fprenlp/lists"}