{"id":16021935,"url":"https://github.com/kemingy/plane","last_synced_at":"2025-03-17T16:30:44.243Z","repository":{"id":33017593,"uuid":"123103418","full_name":"kemingy/Plane","owner":"kemingy","description":"A text processing tool including tag(HTML, URL, Email) extraction and removing, punctuation normalization, simple segmentation, and so on.","archived":false,"fork":false,"pushed_at":"2024-12-17T08:20:38.000Z","size":2969,"stargazers_count":11,"open_issues_count":1,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-28T01:37:32.931Z","etag":null,"topics":["chinese-nlp","data-cleaning","nlp","preprocess","regex","tokenization","tokenizer"],"latest_commit_sha":null,"homepage":"https://kemingy.github.io/Plane/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kemingy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-02-27T09:13:34.000Z","updated_at":"2022-10-24T05:10:06.000Z","dependencies_parsed_at":"2022-08-07T19:30:17.355Z","dependency_job_id":null,"html_url":"https://github.com/kemingy/Plane","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kemingy%2FPlane","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kemingy%2FPlane/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kemingy%2FPlane/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kemingy%2FPlane/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kemingy","download_url":"https://codeload.github.com/kemingy/Plane/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243871330,"owners_count":20361330,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese-nlp","data-cleaning","nlp","preprocess","regex","tokenization","tokenizer"],"created_at":"2024-10-08T18:06:30.564Z","updated_at":"2025-03-17T16:30:43.930Z","avatar_url":"https://github.com/kemingy.png","language":"Python","readme":"# Plane\n\n[![GitHub Actions](https://github.com/kemingy/plane/workflows/Python%20package/badge.svg)](https://github.com/kemingy/plane/actions)\n[![pypi](https://img.shields.io/pypi/v/plane.svg)](https://pypi.python.org/pypi/plane)\n[![versions](https://img.shields.io/pypi/pyversions/plane.svg)](https://github.com/kemingy/plane)\n[![Python document](https://github.com/kemingy/plane/workflows/Python%20document/badge.svg)](https://kemingy.github.io/plane/)\n\n\u003e **Plane** is a tool for shaping wood using muscle power to force the cutting blade over the wood surface.  \n\u003e *from [Wikipedia](https://en.wikipedia.org/wiki/Plane_(tool))*\n\n![plane(tool) from wikipedia](https://upload.wikimedia.org/wikipedia/commons/e/e3/Kanna2.gif)\n\nThis package is used for extracting or replacing specific parts from text, like URL, Email, HTML tags, telephone numbers and so on. Also supports punctuation normalization and removement.\n\nSee the full [Documents](https://kemingy.github.io/Plane/).\n\n## Install\n\nPython **3.x** only.\n\n### pip\n\n```python\npip install plane\n```\n\n### Install from source\n\n```sh\npython setup.py install\n```\n\n## Features\n\n* no other dependencies\n* build-in regex patterns: `plane.pattern.Regex`\n* custom regex patterns\n* pattern combination\n* extract, replace patterns\n* segment sentence\n* chain function calls: `plane.plane.Plane`\n* pipeline: `plane.Pipeline`\n\n## Usage\n\n### Quick start\n\nUse regex to `extract` or `replace`:\n\n```python\nfrom plane import EMAIL, extract, replace\ntext = 'fake@no.com \u0026 fakefake@nothing.com'\n\nemails = extract(text, EMAIL) # this return a generator object\nfor e in emails:\n    print(e)\n\n\u003e\u003e\u003e Token(name='Email', value='fake@no.com', start=0, end=11)\n\u003e\u003e\u003e Token(name='Email', value='fakefake@nothing.com', start=14, end=34)\n\nprint(EMAIL)\n\n\u003e\u003e\u003e Regex(name='Email', pattern='([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\\\.[a-zA-Z0-9-]+)', repl='\u003cEmail\u003e')\n\nreplace(text, EMAIL) # replace(text, Regex, repl), if repl is not provided, Regex.repl will be used\n\n\u003e\u003e\u003e '\u003cEmail\u003e \u0026 \u003cEmail\u003e'\n\nreplace(text, EMAIL, '')\n\n\u003e\u003e\u003e ' \u0026 '\n```\n\n### pattern\n\n`Regex` is a namedtuple with 3 items:\n\n* `name`\n* `pattern`: Regular Expression\n* `repl`: replacement tag, this will replace matched regex when using `replace` function\n\n```python\n# create new pattern\nfrom plane import build_new_regex\ncustom_regex = build_new_regex('my_regex', regex=r'(\\d{4})', repl='\u003cmy-replacement-tag\u003e')\n```\n\nAlso, you can build new pattern from default patterns.\n\n**Attention**: this should only be used for language range.\n\n```python\nfrom plane import extract, build_new_regex, CHINESE_WORDS\nASCII = build_new_regex('ascii', regex=r'[a-zA-Z0-9]+', repl=' ')\nWORDS = ASCII + CHINESE_WORDS\nprint(WORDS)\n\n\u003e\u003e\u003e Regex(name='ascii_Chinese_words', pattern='[a-zA-Z0-9]+|[\\\\U00004E00-\\\\U00009FFF\\\\U00003400-\\\\U00004DBF\\\\U00020000-\\\\U0002A6DF\\\\U0002A700-\\\\U0002B73F\\\\U0002B740-\\\\U0002B81F\\\\U0002B820-\\\\U0002CEAF\\\\U0002CEB0-\\\\U0002EBEF]+', repl=' ')\n\ntext = \"自然语言处理太难了！who can help me? (╯▔🔺▔)╯\"\nprint(' '.join([t.value for t in list(extract(text, WORDS))]))\n\n\u003e\u003e\u003e \"自然语言处理太难了 who can help me\"\n\nfrom plane import CHINESE, ENGLISH, NUMBER\nCN_EN_NUM = sum([CHINESE, ENGLISH, NUMBER])\ntext = \"佛是虚名，道亦妄立。एवं मया श्रुतम्। 1999 is not the end of the world. \"\nprint(' '.join([t.value for t in extract(text, CN_EN_NUM)]))\n\n\u003e\u003e\u003e \"佛是虚名，道亦妄立。 1999 is not the end of the world.\"\n```\n\nDefault Regex: [Details](https://github.com/Momingcoder/Plane/blob/master/plane/pattern.py)\n\n* `URL`: only ASCII\n* `EMAIL`: local-part@domain\n* `TELEPHONE`: like xxx-xxxx-xxxx\n* `SPACE`: ` `, `\\t`, `\\n`, `\\r`, `\\f`, `\\v`\n* `HTML`: HTML tags, Script part and CSS part\n* `ASCII_WORD`: English word, numbers, `\u003ctag\u003e` and so on.\n* `CHINESE`: all Chinese characters (only Han and punctuations)\n* `CJK`: all Chinese, Japanese, Korean(CJK) characters and punctuations\n* `THAI`: all Thai and punctuations\n* `VIETNAMESE`: all Vietnames and punctuations\n* `ENGLISH`: all English chars and punctuations\n* `NUMBER`: 0-9\n\nRegex name | replace\n-----------|---------\nURL        | `'\u003cURL\u003e'`\nEMAIL      | `'\u003cEmail\u003e'`\nTELEPHONE  | `'\u003cTelephone\u003e'`\nSPACE      | `' '`\nHTML       | `' '`\nASCII_WORD | `' '`\nCHINESE    | `' '`\nCJK        | `' '`\n\n\n### segment\n\n`segment` can be used to segment sentence, English and Numbers like 'PS4' will be keeped and others like Chinese '中文' will be split to single word format `['中', '文']`.\n\n```python\nfrom plane import segment\nsegment('你看起来guaiguai的。\u003cEOS\u003e')\n\u003e\u003e\u003e ['你', '看', '起', '来', 'guaiguai', '的', '。', '\u003cEOS\u003e']\n```\n\n### punctuation\n\n`punc.remove` will replace all unicode punctuations to `' '` or something you send to this function as paramter `repl`. `punc.normalize` will normalize some Unicode punctuations to English punctuations.\n\n**Attention**: '+', '^', '$', '~' and some chars are not punctuation.\n\n```python\nfrom plane import punc\n\ntext = 'Hello world!'\npunc.remove(text)\n\n\u003e\u003e\u003e 'Hello world '\n\n# replace punctuation with special string\npunc.remove(text, '\u003cP\u003e')\n\n\u003e\u003e\u003e 'Hello world\u003cP\u003e'\n\n# normalize punctuations\npunc.normalize('你读过那本《边城》吗？什么编程？！人生苦短，我用 Python。')\n\n\u003e\u003e\u003e '你读过那本(边城)吗?什么编程?!人生苦短,我用 Python.'\n```\n\n### Chain function\n\n`Plane` contains `extract`, `replace`, `segment` and `punc.remove`, `punc.normalize`, and these methods can be called in chain. Since `segment` returns list, it can only be called in the end of the chain.\n\n`Plane.text` saves the result of processed text and `Plane.values` saves the result of extracted strings.\n\n```python\nfrom plane import Plane\nfrom plane.pattern import EMAIL\n\np = Plane()\np.update('My email is my@email.com.').replace(EMAIL, '').text # update() will init Plane.text and Plane.values\n\n\u003e\u003e\u003e 'My email is .'\n\np.update('My email is my@email.com.').replace(EMAIL).segment()\n\n\u003e\u003e\u003e ['My', 'email', 'is', '\u003cEmail\u003e', '.']\n\np.update('My email is my@email.com.').extract(EMAIL).values\n\n\u003e\u003e\u003e [Token(name='Email', value='my@email.com', start=12, end=24)]\n```\n\n### Pipeline\n\nYou can use `Pipeline` if you like. \n\n`segment` and `extract` can only present in the end.\n\n```python\nfrom plane import Pipeline, replace, segment\nfrom plane.pattern import URL\n\npipe = Pipeline()\npipe.add(replace, URL, '')\npipe.add(segment)\npipe('http://www.guokr.com is online.')\n\n\u003e\u003e\u003e ['is', 'online', '.']\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkemingy%2Fplane","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkemingy%2Fplane","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkemingy%2Fplane/lists"}