{"id":13413557,"url":"https://github.com/xujiajun/gotokenizer","last_synced_at":"2026-03-14T23:40:44.322Z","repository":{"id":57496892,"uuid":"152524966","full_name":"xujiajun/gotokenizer","owner":"xujiajun","description":"A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)","archived":false,"fork":false,"pushed_at":"2019-04-10T09:39:09.000Z","size":10618,"stargazers_count":21,"open_issues_count":0,"forks_count":7,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-24T02:12:21.300Z","etag":null,"topics":["golang","segmentation","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xujiajun.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-10-11T03:22:36.000Z","updated_at":"2025-03-06T09:03:51.000Z","dependencies_parsed_at":"2022-08-30T10:31:29.798Z","dependency_job_id":null,"html_url":"https://github.com/xujiajun/gotokenizer","commit_stats":null,"previous_names":["xujiajun/go-tokenizer"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xujiajun%2Fgotokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xujiajun%2Fgotokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xujiajun%2Fgotokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xujiajun%2Fgotokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xujiajun","download_url":"https://codeload.github.com/xujiajun/gotokenizer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250546086,"owners_count":21448260,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["golang","segmentation","tokenizer"],"created_at":"2024-07-30T20:01:43.123Z","updated_at":"2026-03-14T23:40:39.302Z","avatar_url":"https://github.com/xujiajun.png","language":"Go","readme":"# gotokenizer [![GoDoc](https://godoc.org/github.com/xujiajun/gotokenizer?status.svg)](https://godoc.org/github.com/xujiajun/gotokenizer) \u003ca href=\"https://travis-ci.org/xujiajun/gotokenizer\"\u003e\u003cimg src=\"https://travis-ci.org/xujiajun/gotokenizer.svg?branch=master\" alt=\"Build Status\"\u003e\u003c/a\u003e [![Coverage Status](https://coveralls.io/repos/github/xujiajun/gotokenizer/badge.svg?branch=master)](https://coveralls.io/github/xujiajun/gotokenizer?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/xujiajun/gotokenizer)](https://goreportcard.com/report/github.com/xujiajun/gotokenizer) [![License](https://img.shields.io/badge/license-Apache2.0-blue.svg?style=flat-square)](https://opensource.org/licenses/Apache-2.0) [![Awesome](https://awesome.re/mentioned-badge.svg)](https://github.com/avelino/awesome-go#natural-language-processing) \nA tokenizer based on the dictionary and Bigram language models for Go.  (Now only support chinese segmentation)\n\n## Motivation\n\nI wanted a simple tokenizer that has no unnecessary overhead using the standard library only, following good practices and well tested code.\n\n## Features\n\n* Support Maximum Matching Method\n* Support Minimum Matching Method\n* Support Reverse Maximum Matching\n* Support Reverse Minimum Matching\n* Support Bidirectional Maximum Matching\n* Support Bidirectional Minimum Matching\n* Support using Stop Tokens\n* Support Custom word Filter\n\n## Installation\n\n```\ngo get -u github.com/xujiajun/gotokenizer\n```\n\n## Usage\n\n```\npackage main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/xujiajun/gotokenizer\"\n)\n\nfunc main() {\n\ttext := \"gotokenizer是一款基于字典和Bigram模型纯go语言编写的分词器，支持6种分词算法。支持stopToken过滤和自定义word过滤功能。\"\n\n\tdictPath := \"/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/dict.txt\"\n\t// NewMaxMatch default wordFilter is NumAndLetterWordFilter\n\tmm := gotokenizer.NewMaxMatch(dictPath)\n\t// load dict\n\tmm.LoadDict()\n\n\tfmt.Println(mm.Get(text)) //[gotokenizer 是 一款 基于 字典 和 Bigram 模型 纯 go 语言 编写 的 分词器 ， 支持 6 种 分词 算法 。 支持 stopToken 过滤 和 自定义 word 过滤 功能 。] \u003cnil\u003e\n\n\t// enabled filter stop tokens \n\tmm.EnabledFilterStopToken = true\n\tmm.StopTokens = gotokenizer.NewStopTokens()\n\tstopTokenDicPath := \"/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/stop_tokens.txt\"\n\tmm.StopTokens.Load(stopTokenDicPath)\n\n\tfmt.Println(mm.Get(text)) //[gotokenizer 一款 字典 Bigram 模型 go 语言 编写 分词器 支持 6 种 分词 算法 支持 stopToken 过滤 自定义 word 过滤 功能] \u003cnil\u003e\n\tfmt.Println(mm.GetFrequency(text)) //map[6:1 种:1 算法:1 过滤:2 支持:2 Bigram:1 模型:1 编写:1 gotokenizer:1 go:1 分词器:1 分词:1 word:1 功能:1 一款:1 语言:1 stopToken:1 自定义:1 字典:1] \u003cnil\u003e\n\n}\n\n```\n\n\u003e More examples see tests\n\n## Contributing\n\nIf you'd like to help out with the project. You can put up a Pull Request.\n\n\n## Author\n\n* [xujiajun](https://github.com/xujiajun)\n\n## License\n\nThe gotokenizer is open-sourced software licensed under the [Apache-2.0](https://opensource.org/licenses/Apache-2.0)\n\n## Acknowledgements\n\nThis package is inspired by the following:\n\nhttps://github.com/ysc/word\n","funding_links":[],"categories":["Natural Language Processing","Relational Databases","Bot Building","自然语言处理","Microsoft Office"],"sub_categories":["Tokenizers","Strings","分词器","Uncategorized","暂未分类","暂未分类这些库被放在这里是因为其他类别似乎都不适合。","Advanced Console UIs"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxujiajun%2Fgotokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxujiajun%2Fgotokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxujiajun%2Fgotokenizer/lists"}