https://github.com/JackHCC/Word-Counting
利用jieba库对中文小说进行词频统计并进行简单的正则匹配,同时验证Zipf-Law(Use the jieba library to perform word frequency statistics on Chinese novels and perform simple regular matching, and verify Zipf-Law)
https://github.com/JackHCC/Word-Counting
jieba mini-program python
Last synced: 10 months ago
JSON representation
利用jieba库对中文小说进行词频统计并进行简单的正则匹配,同时验证Zipf-Law(Use the jieba library to perform word frequency statistics on Chinese novels and perform simple regular matching, and verify Zipf-Law)
- Host: GitHub
- URL: https://github.com/JackHCC/Word-Counting
- Owner: JackHCC
- Created: 2019-04-02T03:09:37.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2019-04-02T04:10:51.000Z (almost 7 years ago)
- Last Synced: 2024-08-02T00:21:42.693Z (over 1 year ago)
- Topics: jieba, mini-program, python
- Language: Python
- Homepage:
- Size: 315 KB
- Stars: 14
- Watchers: 1
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 中文小说词频统计及正则匹配
### 首先导入中文分词库jieba,Counter库和re库
```
import jieba
import re
from collections import Counter
```
### 导入打开要处理的文本傲慢与偏见中文版小说并利用jieba分词
```
txt = open("傲慢与偏见.txt", "r", encoding="gb18030").read()
words = jieba.lcut(txt)
```
### 去除的标点符号,只统计词频
```
excludes = {",", "。", "\n", "-", "“", "”", ":", ";", "?", "(", ")", "!", "…"}
```
### 遍历计数并去除标点
```
for word in words:
counts[word] = counts.get(word,0)+1
for word in excludes:
del counts[word]
```
### 返回遍历得分所有键与值并排序
```
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
```
### 将统计数据写入txt文本
```
file = open('data.txt', mode='w')
for i in range(10963):
word, count = items[i]
print("{0:<10}{1:>5}".format(word,count))
new_context = word + " " + str(count) + '\n'
file.write(new_context)
file.close()
```
### 正则匹配结果
```
result = open('正则.txt', mode='w')
#存正则匹配的数组
things = []
#正则匹配:人物说的内容
for i in re.finditer("[说|道]:“(.+)\?”", txt):
message = i.group(1)
things.append(message)
#计数和展示
c = Counter(things)
for k, v in c.most_common(51):
print(k, v)
context = k + " " + str(v) + '\n'
result.write(context)
result.close()
```
### 输出 data.txt是词频统计的文本数据,正则是匹配人物说的话并且是问句,结果写入 正则.txt
### 验证Zipf-Law
