https://github.com/xujiajun/gotokenizer

A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)
https://github.com/xujiajun/gotokenizer

golang segmentation tokenizer

Last synced: 9 days ago
JSON representation

A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)

Host: GitHub
URL: https://github.com/xujiajun/gotokenizer
Owner: xujiajun
License: apache-2.0
Created: 2018-10-11T03:22:36.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-04-10T09:39:09.000Z (about 6 years ago)
Last Synced: 2025-04-24T02:12:21.300Z (9 days ago)
Topics: golang, segmentation, tokenizer
Language: Go
Homepage:
Size: 10.1 MB
Stars: 21
Watchers: 2
Forks: 7
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-go - gotokenizer - A tokenizer based on the dictionary and Bigram language models for Golang. (Now only support chinese segmentation) (Natural Language Processing / Tokenizers)
zero-alloc-awesome-go - gotokenizer - A tokenizer based on the dictionary and Bigram language models for Golang. (Now only support chinese segmentation) (Natural Language Processing / Tokenizers)
awesome-go-extra - gotokenizer - 10-11T03:22:36Z|2019-04-10T09:39:09Z| (Bot Building / Tokenizers)
awesome-go - gotokenizer - A tokenizer based on the dictionary and Bigram language models for Golang. (Now only support chinese segmentation) - ★ 2 (Natural Language Processing)

README

        # gotokenizer [![GoDoc](https://godoc.org/github.com/xujiajun/gotokenizer?status.svg)](https://godoc.org/github.com/xujiajun/gotokenizer)  [![Coverage Status](https://coveralls.io/repos/github/xujiajun/gotokenizer/badge.svg?branch=master)](https://coveralls.io/github/xujiajun/gotokenizer?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/xujiajun/gotokenizer)](https://goreportcard.com/report/github.com/xujiajun/gotokenizer) [![License](https://img.shields.io/badge/license-Apache2.0-blue.svg?style=flat-square)](https://opensource.org/licenses/Apache-2.0) [![Awesome](https://awesome.re/mentioned-badge.svg)](https://github.com/avelino/awesome-go#natural-language-processing) 

A tokenizer based on the dictionary and Bigram language models for Go.  (Now only support chinese segmentation)

## Motivation

I wanted a simple tokenizer that has no unnecessary overhead using the standard library only, following good practices and well tested code.

## Features

* Support Maximum Matching Method

* Support Minimum Matching Method

* Support Reverse Maximum Matching

* Support Reverse Minimum Matching

* Support Bidirectional Maximum Matching

* Support Bidirectional Minimum Matching

* Support using Stop Tokens

* Support Custom word Filter

## Installation

```

go get -u github.com/xujiajun/gotokenizer

```

## Usage

```

package main

import (

	"fmt"

	"github.com/xujiajun/gotokenizer"

)

func main() {

	text := "gotokenizer是一款基于字典和Bigram模型纯go语言编写的分词器，支持6种分词算法。支持stopToken过滤和自定义word过滤功能。"

	dictPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/dict.txt"

	// NewMaxMatch default wordFilter is NumAndLetterWordFilter

	mm := gotokenizer.NewMaxMatch(dictPath)

	// load dict

	mm.LoadDict()

	fmt.Println(mm.Get(text)) //[gotokenizer 是 一款 基于 字典 和 Bigram 模型 纯 go 语言 编写 的 分词器 ， 支持 6 种 分词 算法 。 支持 stopToken 过滤 和 自定义 word 过滤 功能 。] 

	// enabled filter stop tokens 

	mm.EnabledFilterStopToken = true

	mm.StopTokens = gotokenizer.NewStopTokens()

	stopTokenDicPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/stop_tokens.txt"

	mm.StopTokens.Load(stopTokenDicPath)

	fmt.Println(mm.Get(text)) //[gotokenizer 一款 字典 Bigram 模型 go 语言 编写 分词器 支持 6 种 分词 算法 支持 stopToken 过滤 自定义 word 过滤 功能] 

	fmt.Println(mm.GetFrequency(text)) //map[6:1 种:1 算法:1 过滤:2 支持:2 Bigram:1 模型:1 编写:1 gotokenizer:1 go:1 分词器:1 分词:1 word:1 功能:1 一款:1 语言:1 stopToken:1 自定义:1 字典:1] 

}

```

> More examples see tests

## Contributing

If you'd like to help out with the project. You can put up a Pull Request.

## Author

* [xujiajun](https://github.com/xujiajun)

## License

The gotokenizer is open-sourced software licensed under the [Apache-2.0](https://opensource.org/licenses/Apache-2.0)

## Acknowledgements

This package is inspired by the following:

https://github.com/ysc/word

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xujiajun/gotokenizer

Awesome Lists containing this project

README