{"id":13634931,"url":"https://github.com/JackHCC/Word-Counting","last_synced_at":"2025-04-19T03:34:03.704Z","repository":{"id":112238526,"uuid":"178989218","full_name":"JackHCC/Word-Counting","owner":"JackHCC","description":"利用jieba库对中文小说进行词频统计并进行简单的正则匹配，同时验证Zipf-Law(Use the jieba library to perform word frequency statistics on Chinese novels and perform simple regular matching, and verify Zipf-Law)","archived":false,"fork":false,"pushed_at":"2019-04-02T04:10:51.000Z","size":323,"stargazers_count":14,"open_issues_count":2,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-08-02T00:21:42.693Z","etag":null,"topics":["jieba","mini-program","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JackHCC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-04-02T03:09:37.000Z","updated_at":"2023-09-06T01:22:41.000Z","dependencies_parsed_at":"2023-05-11T21:45:12.426Z","dependency_job_id":null,"html_url":"https://github.com/JackHCC/Word-Counting","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JackHCC%2FWord-Counting","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JackHCC%2FWord-Counting/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JackHCC%2FWord-Counting/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JackHCC%2FWord-Counting/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JackHCC","download_url":"https://codeload.github.com/JackHCC/Word-Counting/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223790228,"owners_count":17203350,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["jieba","mini-program","python"],"created_at":"2024-08-02T00:00:37.832Z","updated_at":"2024-11-09T05:30:15.218Z","avatar_url":"https://github.com/JackHCC.png","language":"Python","funding_links":[],"categories":["Uncategorized"],"sub_categories":["Uncategorized"],"readme":"# 中文小说词频统计及正则匹配\n\n### 首先导入中文分词库jieba，Counter库和re库\n```\nimport jieba\nimport re\nfrom collections import Counter\n```\n\n### 导入打开要处理的文本傲慢与偏见中文版小说并利用jieba分词\n```\ntxt = open(\"傲慢与偏见.txt\", \"r\", encoding=\"gb18030\").read()\nwords = jieba.lcut(txt)\n```\n\n### 去除的标点符号,只统计词频\n```\nexcludes = {\"，\", \"。\", \"\\n\", \"-\", \"“\", \"”\", \"：\", \"；\", \"？\", \"（\", \"）\", \"！\", \"…\"}\n```\n\n### 遍历计数并去除标点\n```\nfor word in words:\n    counts[word] = counts.get(word,0)+1\n    \nfor word in excludes:\n    del counts[word]\n```\n\n### 返回遍历得分所有键与值并排序\n```\nitems = list(counts.items())\nitems.sort(key=lambda x: x[1], reverse=True)\n```\n\n### 将统计数据写入txt文本\n```\nfile = open('data.txt', mode='w')\n\nfor i in range(10963):\n    word, count = items[i]\n    print(\"{0:\u003c10}{1:\u003e5}\".format(word,count))\n    \n    new_context = word + \"   \" + str(count) + '\\n'\n    file.write(new_context)\n\nfile.close()\n```\n\n### 正则匹配结果\n```\nresult = open('正则.txt', mode='w')\n#存正则匹配的数组\nthings = []\n\n#正则匹配：人物说的内容\nfor i in re.finditer(\"[说｜道]：“(.+)\\？”\", txt):\n    message = i.group(1)\n    things.append(message)\n\n#计数和展示\nc = Counter(things)\nfor k, v in c.most_common(51):\n    print(k, v)\n    context = k + \"   \" + str(v) + '\\n'\n    result.write(context)\n\nresult.close()\n```\n\n### 输出 data.txt是词频统计的文本数据，正则是匹配人物说的话并且是问句，结果写入 正则.txt\n\n### 验证Zipf-Law\n![词频统计](验证Zipf-Law.jpg)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJackHCC%2FWord-Counting","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FJackHCC%2FWord-Counting","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJackHCC%2FWord-Counting/lists"}