https://github.com/garthtb/word_freq_counter

盲分词的高性能中文语料词频统计器Rust版。
https://github.com/garthtb/word_freq_counter

Last synced: 2 months ago
JSON representation

盲分词的高性能中文语料词频统计器Rust版。

Host: GitHub
URL: https://github.com/garthtb/word_freq_counter
Owner: GarthTB
Created: 2024-07-22T20:43:43.000Z (10 months ago)
Default Branch: master
Last Pushed: 2024-08-23T21:30:10.000Z (9 months ago)
Last Synced: 2025-01-22T08:17:23.917Z (4 months ago)
Language: Rust
Size: 27.3 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # 高性能中文盲分词词频统计器

语料文件须为UTF-8编码。中文范围为4e00-9fff（16进制）。

### [相同原理、依赖.NET6运行时的C#版](https://github.com/GarthTB/WordFreqCounter)

| 性能对比                    | 第一轮用时 | 两轮共用时 |

|:------------------------|:-----:|:-----:|

| C#版，微博10亿字，2字词，无特殊符号    | 38.0s | 69.8s |

| Rust版，微博10亿字，2字词，无特殊符号  | 27.2s | 60.3s |

| C#版，微博1.2亿字，4字词，无特殊符号   | 42.1s | 47.9s |

| Rust版，微博1.2亿字，4字词，无特殊符号 | 9.4s  | 18.5s |

## 统计原理：

每次进行两轮统计。假设要统计n字词。

第一轮：统计所有相邻的n个汉字（重构前只有这一轮）出现的频率。

第二轮：每(2n-1)个相邻的字为一个滑动窗口，每个窗口中有n个词，滑动步长为n。根据第一轮统计的结果，统计窗口中词频最高的那一个词（最可能是词）。

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/garthtb/word_freq_counter

Awesome Lists containing this project

README