Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/go-ego/gse

Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others.
https://github.com/go-ego/gse

chinese english go gse hmm hmm-viterbi-algorithm japanese jieba nlp segment trie

Last synced: 24 days ago
JSON representation

Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others.

Host: GitHub
URL: https://github.com/go-ego/gse
Owner: go-ego
License: apache-2.0
Created: 2017-06-23T15:42:35.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2023-11-16T16:23:57.000Z (6 months ago)
Last Synced: 2024-01-31T05:16:16.700Z (4 months ago)
Topics: chinese, english, go, gse, hmm, hmm-viterbi-algorithm, japanese, jieba, nlp, segment, trie
Language: Go
Homepage:
Size: 16.8 MB
Stars: 2,415
Watchers: 65
Forks: 210
Open Issues: 13
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE

Lists

awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Tokenizers)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Tokenizers)
awesome-go-cn - gse
awesome-go-extra - gse - 06-23T15:42:35Z|2022-05-19T06:37:32Z| (Bot Building / Tokenizers)
awesome-chinese-nlp - Go语言高性能分词
awesome-go-zh - gse
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Tokenizers)
awesome-stars - gse - ego | 2053 | (Go)
go-awesome - gse - Go 语言分词 (开源类库 / 搜索推荐)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. Stars:`2.5K`. (Natural Language Processing / Tokenizers)
awesome - go-ego/gse - Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others. (Go)
awesome-go-cn - gse - ego/gse) [![包含中文文档][CN]](https://github.com/go-ego/gse) (自然语言处理 / 分词器)
awesome-go - gse
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Uncategorized)
awesome-go-projects - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Uncategorized)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Tokenizers)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Advanced Console UIs)
awesome-go-with-framework - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Strings)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Tokenizers)
awesome-go-cn - gse - ego/gse) [![包含中文文档][CN]](https://github.com/go-ego/gse) (自然语言处理 / 分词器)
zero-alloc-awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Tokenizers)
awesome-go-stars - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Tokenizers)
awesome-go-zh - gse - ego/gse) [![包含中文文档][CN]](https://github.com/go-ego/gse) (自然语言处理 / 分词器)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Strings)
awesome-go. - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Advanced Console UIs)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (<span id="自然语言处理-natural-language-processing">自然语言处理 Natural Language Processing</span> / <span id="高级控制台用户界面-advanced-console-uis">高级控制台用户界面 Advanced Console UIs</span>)
awesome-go-with-stars - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Tokenizers)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Uncategorized)
go-awesome - gse - Go 语言分词 (开源类库 / 搜索)
repo-1316-awesome-go-cn - gse - ego/gse) [![包含中文文档][CN]](https://github.com/go-ego/gse) (自然语言处理 / 分词器)
repo-1211-awesome-go-cn - gse - ego/gse) [![包含中文文档][CN]](https://github.com/go-ego/gse) (自然语言处理 / 分词器)
awesome-Char - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Uncategorized)
awesome-reader - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Strings)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Strings)
Go-awesome - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Tokenizers)
awesome-go-cn - gse
go-awesome-cn-star - gse
my-awesome - go-ego/gse - viterbi-algorithm,japanese,jieba,nlp,segment,trie pushed_at:2024-02 star:2.5k fork:0.2k Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others. (Go)
awesome-go-handwritten - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Advanced Console UIs)
awesome-stars - gse - Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词 (Go)
awesome-starts - go-ego/gse - Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词 (Go)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. Go 语言高性能分词 - ★ 615 (Natural Language Processing)
go-awesome - gse - word segmentation in Go language (Open source library / Search Recommendations)
awesome-go2 - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Advanced Console UIs)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Uncategorized)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. - :arrow_down:12 - :star:392 (Natural Language Processing / Strings)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Tokenizers)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Uncategorized)
awesome-go-cn - gse
awesome-stars - go-ego/gse - `★2470` Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others. (Go)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Tokenizers)
awesome-go - gse - Go efficient text segmentation; support english, chinese, japanese and other. (Natural Language Processing / Tokenizers)

README

        # gse

Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others.

And supports with [elasticsearch](https://github.com/vcaesar/go-gse-elastic) and [bleve](https://github.com/vcaesar/gse-bleve).

[![Build Status](https://github.com/go-ego/gse/workflows/Go/badge.svg)](https://github.com/go-ego/gse/commits/master)

[![CircleCI Status](https://circleci.com/gh/go-ego/gse.svg?style=shield)](https://circleci.com/gh/go-ego/gse)

[![codecov](https://codecov.io/gh/go-ego/gse/branch/master/graph/badge.svg)](https://codecov.io/gh/go-ego/gse)

[![Build Status](https://travis-ci.org/go-ego/gse.svg)](https://travis-ci.org/go-ego/gse)

[![Go Report Card](https://goreportcard.com/badge/github.com/go-ego/gse)](https://goreportcard.com/report/github.com/go-ego/gse)

[![GoDoc](https://godoc.org/github.com/go-ego/gse?status.svg)](https://godoc.org/github.com/go-ego/gse)

[![GitHub release](https://img.shields.io/github/release/go-ego/gse.svg)](https://github.com/go-ego/gse/releases/latest)

[![Join the chat at https://gitter.im/go-ego/ego](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/go-ego/ego?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

[简体中文](https://github.com/go-ego/gse/blob/master/README_zh.md)

Gse is implements jieba by golang, and try add NLP support and more feature

## Feature:

- Support common, search engine, full mode, precise mode and HMM mode multiple word segmentation modes;

- Support user and embed dictionary, Part-of-speech/POS tagging, analyze segment info, stop and trim words

- Support multilingual: English, Chinese, Japanese and others

- Support Traditional Chinese

- Support HMM cut text use Viterbi algorithm

- Support NLP by TensorFlow (in work)

- Named Entity Recognition (in work)

- Supports with [elasticsearch](https://github.com/vcaesar/go-gse-elastic) and bleve

- run JSON RPC service.

## Algorithm:

- [Dictionary](https://github.com/go-ego/gse/blob/master/dictionary.go) with double array trie (Double-Array Trie) to achieve

- [Segmenter](https://github.com/go-ego/gse/blob/master/dag.go) algorithm is the shortest path (based on word frequency and dynamic programming), and DAG and HMM algorithm word segmentation.

## Text Segmentation speed:

-  single thread 9.2MB/s

- goroutines concurrent 26.8MB/s.

- HMM text segmentation single thread 3.2MB/s. (2core 4threads Macbook Pro).

## Binding:

[gse-bind](https://github.com/vcaesar/gse-bind), binding JavaScript and other, support more language.

## Install / update

With Go module support (Go 1.11+), just import:

```go

import "github.com/go-ego/gse"

```

Otherwise, to install the gse package, run the command:

```

go get -u github.com/go-ego/gse

```

## Use

```go

package main

import (

	_ "embed"

	"fmt"

	"github.com/go-ego/gse"

)

//go:embed testdata/test_en2.txt

var testDict string

//go:embed testdata/test_en.txt

var testEn string

var (

	text  = "To be or not to be, that's the question!"

	test1 = "Hiworld, Helloworld!"

)

func main() {

	var seg1 gse.Segmenter

	seg1.DictSep = ","

	err := seg1.LoadDict("./testdata/test_en.txt")

	if err != nil {

		fmt.Println("Load dictionary error: ", err)

	}

	s1 := seg1.Cut(text)

	fmt.Println("seg1 Cut: ", s1)

	// seg1 Cut:  [to be   or   not to be ,   that's the question!]

	var seg2 gse.Segmenter

	seg2.AlphaNum = true

	seg2.LoadDict("./testdata/test_en_dict3.txt")

	s2 := seg2.Cut(test1)

	fmt.Println("seg2 Cut: ", s2)

	// seg2 Cut:  [hi world ,   hello world !]

	var seg3 gse.Segmenter

	seg3.AlphaNum = true

	seg3.DictSep = ","

	err = seg3.LoadDictEmbed(testDict + "\n" + testEn)

	if err != nil {

		fmt.Println("loadDictEmbed error: ", err)

	}

	s3 := seg3.Cut(text + test1)

	fmt.Println("seg3 Cut: ", s3)

	// seg3 Cut:  [to be   or   not to be ,   that's the question! hi world ,   hello world !]

	// example2()

}

```

Example2:

```go

package main

import (

	"fmt"

	"regexp"

	"github.com/go-ego/gse"

	"github.com/go-ego/gse/hmm/pos"

)

var (

	text = "Hello world, Helloworld. Winter is coming! こんにちは世界, 你好世界."

	new, _ = gse.New("zh,testdata/test_en_dict3.txt", "alpha")

	seg gse.Segmenter

	posSeg pos.Segmenter

)

func main() {

	// Loading the default dictionary

	seg.LoadDict()

	// Loading the default dictionary with embed

	// seg.LoadDictEmbed()

	//

	// Loading the Simplified Chinese dictionary

	// seg.LoadDict("zh_s")

	// seg.LoadDictEmbed("zh_s")

	//

	// Loading the Traditional Chinese dictionary

	// seg.LoadDict("zh_t")

	//

	// Loading the Japanese dictionary

	// seg.LoadDict("jp")

	//

	// Load the dictionary

	// seg.LoadDict("your gopath"+"/src/github.com/go-ego/gse/data/dict/dictionary.txt")

	cut()

	segCut()

}

func cut() {

	hmm := new.Cut(text, true)

	fmt.Println("cut use hmm: ", hmm)

	hmm = new.CutSearch(text, true)

	fmt.Println("cut search use hmm: ", hmm)

	fmt.Println("analyze: ", new.Analyze(hmm, text))

	hmm = new.CutAll(text)

	fmt.Println("cut all: ", hmm)

	reg := regexp.MustCompile(`(\d+年|\d+月|\d+日|[\p{Latin}]+|[\p{Hangul}]+|\d+\.\d+|[a-zA-Z0-9]+)`)

	text1 := `헬로월드 헬로 서울, 2021年09月10日, 3.14`

	hmm = seg.CutDAG(text1, reg)

	fmt.Println("Cut with hmm and regexp: ", hmm, hmm[0], hmm[6])

}

func analyzeAndTrim(cut []string) {

	a := seg.Analyze(cut, "")

	fmt.Println("analyze the segment: ", a)

	cut = seg.Trim(cut)

	fmt.Println("cut all: ", cut)

	fmt.Println(seg.String(text, true))

	fmt.Println(seg.Slice(text, true))

}

func cutPos() {

	po := seg.Pos(text, true)

	fmt.Println("pos: ", po)

	po = seg.TrimPos(po)

	fmt.Println("trim pos: ", po)

	pos.WithGse(seg)

	po = posSeg.Cut(text, true)

	fmt.Println("pos: ", po)

	po = posSeg.TrimWithPos(po, "zg")

	fmt.Println("trim pos: ", po)

}

func segCut() {

	// Text Segmentation

	tb := []byte(text)

	fmt.Println(seg.String(text, true))

	segments := seg.Segment(tb)

	// Handle word segmentation results, search mode

	fmt.Println(gse.ToString(segments, true))

}

```

[Look at an custom dictionary example](/examples/dict/main.go)

```Go

package main

import (

	"fmt"

	_ "embed"

	"github.com/go-ego/gse"

)

//go:embed test_en_dict3.txt

var testDict string

func main() {

	// var seg gse.Segmenter

	// seg.LoadDict("zh, testdata/zh/test_dict.txt, testdata/zh/test_dict1.txt")

	// seg.LoadStop()

	seg, err := gse.NewEmbed("zh, word 20 n"+testDict, "en")

	// seg.LoadDictEmbed()

	seg.LoadStopEmbed()

	text1 := "Hello world, こんにちは世界, 你好世界!"

	s1 := seg.Cut(text1, true)

	fmt.Println(s1)

	fmt.Println("trim: ", seg.Trim(s1))

	fmt.Println("stop: ", seg.Stop(s1))

	fmt.Println(seg.String(text1, true))

	segments := seg.Segment([]byte(text1))

	fmt.Println(gse.ToString(segments))

}

```

[Look at an Chinese example](/examples/main.go)

[Look at an Japanese example](/examples/jp/main.go)

## Elasticsearch

How to use it with elasticsearch?

[go-gse-elastic](https://github.com/vcaesar/go-gse-elastic)

## Authors

- [Maintainers](https://github.com/orgs/go-ego/people)

- [Contributors](https://github.com/go-ego/gse/graphs/contributors)

## License

Gse is primarily distributed under the terms of "both the MIT license and the Apache License (Version 2.0)".

See [LICENSE-APACHE](http://www.apache.org/licenses/LICENSE-2.0), [LICENSE-MIT](https://github.com/go-vgo/robotgo/blob/master/LICENSE).

Thanks for [sego](https://github.com/huichen/sego) and [jieba](https://github.com/fxsjy/jieba)([jiebago](https://github.com/wangbin/jiebago)).