https://github.com/garthtb/word-freq-counter

中文语料的盲分词词频统计工具：10亿字仅需1分钟！
https://github.com/garthtb/word-freq-counter

chinese command-line-tool corpus csharp nlp windows word-cloud word-frequency word-frequency-counter

Last synced: 7 months ago
JSON representation

中文语料的盲分词词频统计工具：10亿字仅需1分钟！

Host: GitHub
URL: https://github.com/garthtb/word-freq-counter
Owner: GarthTB
License: apache-2.0
Created: 2025-01-26T19:01:55.000Z (8 months ago)
Default Branch: master
Last Pushed: 2025-02-14T10:06:26.000Z (8 months ago)
Last Synced: 2025-02-14T11:20:40.671Z (8 months ago)
Topics: chinese, command-line-tool, corpus, csharp, nlp, windows, word-cloud, word-frequency, word-frequency-counter
Language: Rust
Homepage:
Size: 26.4 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

README

# 中文语料的盲分词词频统计工具

在我的个人电脑上，约10亿字的中文互联网语料，统计2字词，不加标点符号，大约1分钟即可统计完毕。

语料文件须为UTF-8编码。默认中文范围为4e00-9fff（16进制）。

## 统计原理：

每次进行两轮统计。假设要统计n字词：

- 第一轮：统计整个语料中，所有相邻的n个汉字组合出现的次数。
- 第二轮：相邻的(2n-1)个汉字组合构建为一个窗口，每个窗口中有n个词，滑动步长为n。根据第一轮统计的结果，挑出每个窗口中词频最高的词（最可能是词）。

## 更新日志

### v0.3.0 - 20250214

- 优化：提速。

### v0.2.0 - 20250128

- 优化：提速并减小体积。

### v0.1.0 - 20250128

- 发布！