Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/yanyiwu/gojieba
"结巴"中文分词的Golang版本
https://github.com/yanyiwu/gojieba
Last synced: 4 days ago
JSON representation
"结巴"中文分词的Golang版本
- Host: GitHub
- URL: https://github.com/yanyiwu/gojieba
- Owner: yanyiwu
- License: mit
- Created: 2015-09-12T01:30:44.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2024-09-14T15:49:34.000Z (4 months ago)
- Last Synced: 2024-10-29T15:35:14.069Z (2 months ago)
- Language: Go
- Homepage:
- Size: 7.98 MB
- Stars: 2,423
- Watchers: 67
- Forks: 303
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
- awesome-go - gojieba - This is a Go implementation of [jieba](https://github.com/fxsjy/jieba) which a Chinese word splitting algorithm. (Natural Language Processing / Tokenizers)
- zero-alloc-awesome-go - gojieba - This is a Go implementation of [jieba](https://github.com/fxsjy/jieba) which a Chinese word splitting algorithm. (Natural Language Processing / Tokenizers)
- awesome-list - gojieba
- go-awesome - gojieba - The Go language version of Chinese word segmentation for "破巴" (Open source library / Search Recommendations)
- awesome-go - gojieba - "结巴"中文分词的Golang版本 - ★ 659 (Natural Language Processing)
- awesome-go-extra - gojieba - 09-12T01:30:44Z|2022-08-24T07:06:23Z| (Bot Building / Tokenizers)
- awesome-go-zh - gojieba
- my-awesome - yanyiwu/gojieba - 12 star:2.5k fork:0.3k "结巴"中文分词的Golang版本 (Go)
- awesome-go - gojieba - This is a Go implementation of [jieba](https://github.com/fxsjy/jieba) which a Chinese word splitting algorithm. Stars:`2.5K`. (Natural Language Processing / Tokenizers)
README
# GoJieba
[![Test](https://github.com/yanyiwu/gojieba/actions/workflows/test.yml/badge.svg)](https://github.com/yanyiwu/gojieba/actions/workflows/test.yml)
[![Author](https://img.shields.io/badge/[email protected]?style=flat)](http://yanyiwu.com/)
[![Tag](https://img.shields.io/github/v/tag/yanyiwu/gojieba.svg)](https://github.com/yanyiwu/gojieba/releases)
[![Performance](https://img.shields.io/badge/performance-excellent-brightgreen.svg?style=flat)](http://yanyiwu.com/work/2015/06/14/jieba-series-performance-test.html)
[![License](https://img.shields.io/badge/license-MIT-yellow.svg?style=flat)](http://yanyiwu.mit-license.org)
[![GoDoc](https://godoc.org/github.com/yanyiwu/gojieba?status.svg)](https://godoc.org/github.com/yanyiwu/gojieba)
[![Coverage Status](https://coveralls.io/repos/yanyiwu/gojieba/badge.svg?branch=master&service=github)](https://coveralls.io/github/yanyiwu/gojieba?branch=master)
[![Go Report Card](https://goreportcard.com/badge/yanyiwu/gojieba)](https://goreportcard.com/report/yanyiwu/gojieba)
[![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/avelino/awesome-go)[GoJieba]是"结巴"中文分词的Golang语言版本。
## 简介
+ 支持多种分词方式,包括: 最大概率模式, HMM新词发现模式, 搜索引擎模式, 全模式
+ 核心算法底层由C++实现,性能高效。
+ 字典路径可配置,NewJieba(...string), NewExtractor(...string) 可变形参,当参数为空时使用默认词典(推荐方式)## 用法
```bash
go get github.com/yanyiwu/gojieba
```分词示例
```golang
package mainimport (
"fmt"
"strings""github.com/yanyiwu/gojieba"
)func main() {
var s string
var words []string
use_hmm := true
x := gojieba.NewJieba()
defer x.Free()s = "我来到北京清华大学"
words = x.CutAll(s)
fmt.Println(s)
fmt.Println("全模式:", strings.Join(words, "/"))words = x.Cut(s, use_hmm)
fmt.Println(s)
fmt.Println("精确模式:", strings.Join(words, "/"))
s = "比特币"
words = x.Cut(s, use_hmm)
fmt.Println(s)
fmt.Println("精确模式:", strings.Join(words, "/"))x.AddWord("比特币")
// `AddWordEx` 支持指定词语的权重,作为 `AddWord` 权重太低加词失败的补充。
// `tag` 参数可以为空字符串,也可以指定词性。
// x.AddWordEx("比特币", 100000, "")
s = "比特币"
words = x.Cut(s, use_hmm)
fmt.Println(s)
fmt.Println("添加词典后,精确模式:", strings.Join(words, "/"))s = "他来到了网易杭研大厦"
words = x.Cut(s, use_hmm)
fmt.Println(s)
fmt.Println("新词识别:", strings.Join(words, "/"))s = "小明硕士毕业于中国科学院计算所,后在日本京都大学深造"
words = x.CutForSearch(s, use_hmm)
fmt.Println(s)
fmt.Println("搜索引擎模式:", strings.Join(words, "/"))s = "长春市长春药店"
words = x.Tag(s)
fmt.Println(s)
fmt.Println("词性标注:", strings.Join(words, ","))s = "区块链"
words = x.Tag(s)
fmt.Println(s)
fmt.Println("词性标注:", strings.Join(words, ","))s = "长江大桥"
words = x.CutForSearch(s, !use_hmm)
fmt.Println(s)
fmt.Println("搜索引擎模式:", strings.Join(words, "/"))wordinfos := x.Tokenize(s, gojieba.SearchMode, !use_hmm)
fmt.Println(s)
fmt.Println("Tokenize:(搜索引擎模式)", wordinfos)wordinfos = x.Tokenize(s, gojieba.DefaultMode, !use_hmm)
fmt.Println(s)
fmt.Println("Tokenize:(默认模式)", wordinfos)keywords := x.ExtractWithWeight(s, 5)
fmt.Println("Extract:", keywords)
}
``````
我来到北京清华大学
全模式: 我/来到/北京/清华/清华大学/华大/大学
我来到北京清华大学
精确模式: 我/来到/北京/清华大学
比特币
精确模式: 比特/币
比特币
添加词典后,精确模式: 比特币
他来到了网易杭研大厦
新词识别: 他/来到/了/网易/杭研/大厦
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
搜索引擎模式: 小明/硕士/毕业/于/中国/科学/学院/科学院/中国科学院/计算/计算所/,/后/在/日本/京都/大学/日本京都大学/深造
长春市长春药店
词性标注: 长春市/ns,长春/ns,药店/n
区块链
词性标注: 区块链/nz
长江大桥
搜索引擎模式: 长江/大桥/长江大桥
长江大桥
Tokenize: [{长江 0 6} {大桥 6 12} {长江大桥 0 12}]
```See Details in [gojieba-demo](http://github.com/yanyiwu/gojieba-demo)
See example in [jieba_test](jieba_test.go), [extractor_test](extractor_test.go)## Benchmark
[Jieba中文分词系列性能评测]
Unittest
```bash
go test ./...
```Benchmark
```bash
go test -bench "Jieba" -test.benchtime 10s
go test -bench "Extractor" -test.benchtime 10s
```## Contributors
### Code Contributors
This project exists thanks to all the people who contribute.
[CppJieba]:http://github.com/yanyiwu/cppjieba
[GoJieba]:http://github.com/yanyiwu/gojieba
[Jieba]:https://github.com/fxsjy/jieba
[Jieba中文分词系列性能评测]:http://yanyiwu.com/work/2015/06/14/jieba-series-performance-test.html