{"id":13754417,"url":"https://github.com/beyondacm/Autochecker4Chinese","last_synced_at":"2025-05-09T22:32:23.988Z","repository":{"id":37664830,"uuid":"103759471","full_name":"beyondacm/Autochecker4Chinese","owner":"beyondacm","description":"中文文本错别字检测以及自动纠错 / Autochecker \u0026 autocorrecter for chinese","archived":false,"fork":false,"pushed_at":"2017-09-16T14:54:22.000Z","size":3587,"stargazers_count":289,"open_issues_count":9,"forks_count":86,"subscribers_count":14,"default_branch":"master","last_synced_at":"2024-11-16T07:33:33.577Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/beyondacm.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-09-16T14:45:31.000Z","updated_at":"2024-10-24T02:48:03.000Z","dependencies_parsed_at":"2022-08-08T21:15:35.337Z","dependency_job_id":null,"html_url":"https://github.com/beyondacm/Autochecker4Chinese","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/beyondacm%2FAutochecker4Chinese","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/beyondacm%2FAutochecker4Chinese/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/beyondacm%2FAutochecker4Chinese/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/beyondacm%2FAutochecker4Chinese/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/beyondacm","download_url":"https://codeload.github.com/beyondacm/Autochecker4Chinese/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253336066,"owners_count":21892781,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:58.912Z","updated_at":"2025-05-09T22:32:18.979Z","avatar_url":"https://github.com/beyondacm.png","language":"Jupyter Notebook","funding_links":[],"categories":["Uncategorized","其他_NLP自然语言处理"],"sub_categories":["Uncategorized","其他_文本生成、文本对话"],"readme":"\n## Solutions of autochecker for chinese\n\n### How to use :\n- run in the terminal : python Autochecker4Chinese.py\n- You will get the following result : ![](./result.png)\n\n\n### 1. Make a detecter\n\n- Construct a dict to detect the misspelled chinese phrase，key is the chinese phrase, value is its corresponding frequency appeared in corpus.\n- You can finish this step by collecting corpus from the internet, or you can choose a more easy way, load some dicts already created by others. Here we choose the second way, construct the dict from file. \n- The detecter works in this way: for any phrase not appeared in this dict, the detecter will detect it as a mis-spelled phrase.\n\n\n\n```python\ndef construct_dict( file_path ):\n \n word_freq = {}\n with open(file_path, \"r\") as f:\n for line in f:\n info = line.split()\n word = info[0]\n frequency = info[1]\n word_freq[word] = frequency\n \n return word_freq\n```\n\n\n```python\nFILE_PATH = \"./token_freq_pos%40350k_jieba.txt\"\nphrase_freq = construct_dict( FILE_PATH )\n```\n\n\n```python\nprint( type(phrase_freq) )\nprint( len(phrase_freq) )\n```\n\n \u003ctype 'dict'\u003e\n 349045\n\n\n### 2. Make an autocorrecter\n- Make an autocorrecter for the misspelled phrase, we use the edit distance to make a correct-candidate list for the mis-spelled phrase \n- We sort the correct-candidate list according to the likelyhood of being the correct phrase, based on the following rules:\n\t- If the candidate's pinyin matches exactly with misspelled phrase's pinyin, we put the candidate in first order, which means they are the most likely phrase to be selected.\n\t- Else if candidate first word's pinyin matches with misspelled phrase's first word's pinyin, we put the candidate in second order.\n\t- Otherwise, we put the candidate in third order.\n\n```python\nimport pinyin\n```\n\n\n```python\n# list for chinese words\n# read from the words.dic\ndef load_cn_words_dict( file_path ):\n cn_words_dict = \"\"\n with open(file_path, \"r\") as f:\n for word in f:\n cn_words_dict += word.strip().decode(\"utf-8\")\n return cn_words_dict\n```\n\n\n```python\n# function calculate the edite distance from the chinese phrase \ndef edits1(phrase, cn_words_dict):\n \"All edits that are one edit away from `phrase`.\"\n phrase = phrase.decode(\"utf-8\")\n splits = [(phrase[:i], phrase[i:]) for i in range(len(phrase) + 1)]\n deletes = [L + R[1:] for L, R in splits if R]\n transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)\u003e1]\n replaces = [L + c + R[1:] for L, R in splits if R for c in cn_words_dict]\n inserts = [L + c + R for L, R in splits for c in cn_words_dict]\n return set(deletes + transposes + replaces + inserts)\n```\n\n\n```python\n# return the phrease exist in phrase_freq\ndef known(phrases): return set(phrase for phrase in phrases if phrase.encode(\"utf-8\") in phrase_freq)\n```\n\n\n```python\n# get the candidates phrase of the error phrase\n# we sort the candidates phrase's importance according to their pinyin\n# if the candidate phrase's pinyin exactly matches with the error phrase, we put them into first order\n# if the candidate phrase's first word pinyin matches with the error phrase first word, we put them into second order\n# else we put candidate phrase into the third order\ndef get_candidates( error_phrase ):\n \n candidates_1st_order = []\n candidates_2nd_order = []\n candidates_3nd_order = []\n \n error_pinyin = pinyin.get(error_phrase, format=\"strip\", delimiter=\"/\").encode(\"utf-8\")\n cn_words_dict = load_cn_words_dict( \"./cn_dict.txt\" )\n candidate_phrases = list( known(edits1(error_phrase, cn_words_dict)) )\n \n for candidate_phrase in candidate_phrases:\n candidate_pinyin = pinyin.get(candidate_phrase, format=\"strip\", delimiter=\"/\").encode(\"utf-8\")\n if candidate_pinyin == error_pinyin:\n candidates_1st_order.append(candidate_phrase)\n elif candidate_pinyin.split(\"/\")[0] == error_pinyin.split(\"/\")[0]:\n candidates_2nd_order.append(candidate_phrase)\n else:\n candidates_3nd_order.append(candidate_phrase)\n \n return candidates_1st_order, candidates_2nd_order, candidates_3nd_order\n```\n\n\n```python\ndef auto_correct( error_phrase ):\n \n c1_order, c2_order, c3_order = get_candidates(error_phrase)\n # print c1_order, c2_order, c3_order\n if c1_order:\n return max(c1_order, key=phrase_freq.get )\n elif c2_order:\n return max(c2_order, key=phrase_freq.get )\n else:\n return max(c3_order, key=phrase_freq.get )\n```\n\n\n```python\n# test for the auto_correct \nerror_phrase_1 = \"呕涂\" # should be \"呕吐\"\nerror_phrase_2 = \"东方之朱\" # should be \"东方之珠\"\nerror_phrase_3 = \"沙拢\" # should be \"沙龙\"\n\nprint error_phrase_1, auto_correct( error_phrase_1 )\nprint error_phrase_2, auto_correct( error_phrase_2 )\nprint error_phrase_3, auto_correct( error_phrase_3 )\n```\n\n 呕涂呕吐\n 东方之朱东方之珠\n 沙拢沙龙\n\n\n### 3. Correct the misspelled phrase in a sentance \n\n\n\n- For any given sentence, use jieba do the segmentation, \n- Get segment list after segmentation is done, check if the remain phrase exists in word_freq dict, if not, then it is a misspelled phrase\n- Use auto_correct function to correct the misspelled phrase\n- Output the correct sentence\n\n\n\n```python\nimport jieba\nimport string\nimport re\n```\n\n\n```python\nPUNCTUATION_LIST = string.punctuation\nPUNCTUATION_LIST += \"。，？：；｛｝［］‘“”《》／！％……（）\"\n```\n\n\n```python\ndef auto_correct_sentence( error_sentence, verbose=True):\n \n jieba_cut = jieba.cut(err_test.decode(\"utf-8\"), cut_all=False)\n seg_list = \"\\t\".join(jieba_cut).split(\"\\t\")\n \n correct_sentence = \"\"\n \n for phrase in seg_list:\n \n correct_phrase = phrase\n # check if item is a punctuation\n if phrase not in PUNCTUATION_LIST.decode(\"utf-8\"):\n # check if the phrase in our dict, if not then it is a misspelled phrase\n if phrase.encode(\"utf-8\") not in phrase_freq.keys():\n correct_phrase = auto_correct(phrase.encode(\"utf-8\"))\n if verbose :\n print phrase, correct_phrase\n \n correct_sentence += correct_phrase\n \n if verbose:\n print correct_sentence\n return correct_sentence\n```\n\n\n```python\nerr_sent = '机七学习是人工智能领遇最能体现智能的一个分知！'\ncorrect_sent = auto_correct_sentence( err_sent )\n```\n\n 机七机器\n 领遇领域\n 分知分枝\n 机器学习是人工智能领域最能体现智能的一个分枝！\n\n\n\n```python\nprint correct_sent\n```\n\n 机器学习是人工智能领域最能体现智能的一个分枝！\n\n\n\n```python\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbeyondacm%2FAutochecker4Chinese","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbeyondacm%2FAutochecker4Chinese","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbeyondacm%2FAutochecker4Chinese/lists"}