https://github.com/garthtb/word-freq-counter
中文语料的盲分词词频统计工具:10亿字仅需1分钟!
https://github.com/garthtb/word-freq-counter
chinese command-line-tool corpus csharp nlp windows word-cloud word-frequency word-frequency-counter
Last synced: 7 months ago
JSON representation
中文语料的盲分词词频统计工具:10亿字仅需1分钟!
- Host: GitHub
- URL: https://github.com/garthtb/word-freq-counter
- Owner: GarthTB
- License: apache-2.0
- Created: 2025-01-26T19:01:55.000Z (8 months ago)
- Default Branch: master
- Last Pushed: 2025-02-14T10:06:26.000Z (8 months ago)
- Last Synced: 2025-02-14T11:20:40.671Z (8 months ago)
- Topics: chinese, command-line-tool, corpus, csharp, nlp, windows, word-cloud, word-frequency, word-frequency-counter
- Language: Rust
- Homepage:
- Size: 26.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 中文语料的盲分词词频统计工具
在我的个人电脑上,约10亿字的中文互联网语料,统计2字词,不加标点符号,大约1分钟即可统计完毕。
语料文件须为UTF-8编码。默认中文范围为4e00-9fff(16进制)。
## 统计原理:
每次进行两轮统计。假设要统计n字词:
- 第一轮:统计整个语料中,所有相邻的n个汉字组合出现的次数。
- 第二轮:相邻的(2n-1)个汉字组合构建为一个窗口,每个窗口中有n个词,滑动步长为n。根据第一轮统计的结果,挑出每个窗口中词频最高的词(最可能是词)。## 更新日志
### v0.3.0 - 20250214
- 优化:提速。
### v0.2.0 - 20250128
- 优化:提速并减小体积。
### v0.1.0 - 20250128
- 发布!