https://github.com/yizhiru/thulac4j

Chinese Word Segmentation Tool, THULAC的Java实现.
https://github.com/yizhiru/thulac4j

chinese-word-segmentation thulac

Last synced: 6 months ago
JSON representation

Chinese Word Segmentation Tool, THULAC的Java实现.

Host: GitHub
URL: https://github.com/yizhiru/thulac4j
Owner: yizhiru
License: apache-2.0
Created: 2017-03-03T01:00:21.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2021-04-12T06:10:51.000Z (about 5 years ago)
Last Synced: 2025-08-07T20:25:37.407Z (11 months ago)
Topics: chinese-word-segmentation, thulac
Language: Java
Size: 17.4 MB
Stars: 84
Watchers: 10
Forks: 31
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-java - THULAC4j

README

          # thulac4j

thulac4j是[THULAC](http://thulac.thunlp.org/)的高效Java 8实现，具有分词速度快、准、强的特点；支持

- 自定义词典

- 繁体转简体

- 停用词过滤

## 使用示例

在项目中使用thulac4j，添加依赖（请使用最新版本）：

```xml

  io.github.yizhiru

  thulac4j

  3.1.2

```

thulac4j支持中文分词与词性标注，使用示例如下：

```java

String sentence = "滔滔的流水，向着波士顿湾无声逝去";

List words = Segmenter.segment(sentence);

// [滔滔, 的, 流水, ，, 向着, 波士顿湾, 无声, 逝去]

POSTagger pos = new POSTagger("models/model_c_model.bin", "models/model_c_dat.bin");

List words = pos.tagging(sentence);

// [滔滔/a, 的/u, 流水/n, ，/w, 向着/p, 波士顿湾/ns, 无声/v, 逝去/v]

```

模型数据较大，没有放在jar包与源码。训练模型下载及更多使用说明，请参看[Wiki](https://github.com/yizhiru/thulac4j/wiki).

最后感谢THUNLP实验室！

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yizhiru/thulac4j

Awesome Lists containing this project

README