{"id":13413561,"url":"https://github.com/blevesearch/segment","last_synced_at":"2025-04-04T06:09:08.471Z","repository":{"id":21992849,"uuid":"25317859","full_name":"blevesearch/segment","owner":"blevesearch","description":"A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29","archived":false,"fork":false,"pushed_at":"2022-12-19T19:42:15.000Z","size":741,"stargazers_count":87,"open_issues_count":5,"forks_count":16,"subscribers_count":11,"default_branch":"master","last_synced_at":"2024-07-31T20:52:34.024Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/blevesearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-10-16T19:24:26.000Z","updated_at":"2024-05-17T16:42:39.000Z","dependencies_parsed_at":"2023-01-13T21:47:21.385Z","dependency_job_id":null,"html_url":"https://github.com/blevesearch/segment","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blevesearch%2Fsegment","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blevesearch%2Fsegment/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blevesearch%2Fsegment/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blevesearch%2Fsegment/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/blevesearch","download_url":"https://codeload.github.com/blevesearch/segment/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247128751,"owners_count":20888235,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-30T20:01:43.221Z","updated_at":"2025-04-04T06:09:08.439Z","avatar_url":"https://github.com/blevesearch.png","language":"Go","readme":"# segment\n\n[![Tests](https://github.com/blevesearch/segment/workflows/Tests/badge.svg?branch=master\u0026event=push)](https://github.com/blevesearch/segment/actions?query=workflow%3ATests+event%3Apush+branch%3Amaster)\n\nA Go library for performing Unicode Text Segmentation\nas described in [Unicode Standard Annex #29](http://www.unicode.org/reports/tr29/)\n\n## Features\n\n* Currently only segmentation at Word Boundaries is supported.\n\n## License\n\nApache License Version 2.0\n\n## Usage\n\nThe functionality is exposed in two ways:\n\n1.  You can use a bufio.Scanner with the SplitWords implementation of SplitFunc.\nThe SplitWords function will identify the appropriate word boundaries in the input\ntext and the Scanner will return tokens at the appropriate place.\n\n\t\tscanner := bufio.NewScanner(...)\n\t\tscanner.Split(segment.SplitWords)\n\t\tfor scanner.Scan() {\n\t\t\ttokenBytes := scanner.Bytes()\n\t\t}\n\t\tif err := scanner.Err(); err != nil {\n\t\t\tt.Fatal(err)\n\t\t}\n\n2.  Sometimes you would also like information returned about the type of token.\nTo do this we have introduce a new type named Segmenter.  It works just like Scanner\nbut additionally a token type is returned.\n\n\t\tsegmenter := segment.NewWordSegmenter(...)\n\t\tfor segmenter.Segment() {\n\t\t\ttokenBytes := segmenter.Bytes())\n\t\t\ttokenType := segmenter.Type()\n\t\t}\n\t\tif err := segmenter.Err(); err != nil {\n\t\t\tt.Fatal(err)\n\t\t}\n\n## Choosing Implementation\n\nBy default segment does NOT use the fastest runtime implementation.  The reason is that it adds approximately 5s to compilation time and may require more than 1GB of ram on the machine performing compilation.\n\nHowever, you can choose to build with the fastest runtime implementation by passing the build tag as follows:\n\n\t\t-tags 'prod'\n\n## Generating Code\n\nSeveral components in this package are generated.\n\n1.  Several Ragel rules files are generated from Unicode properties files.\n2.  Ragel machine is generated from the Ragel rules.\n3.  Test tables are generated from the Unicode test files.\n\nAll of these can be generated by running:\n\n\t\tgo generate\n\n## Fuzzing\n\nThere is support for fuzzing the segment library with [go-fuzz](https://github.com/dvyukov/go-fuzz).\n\n1.  Install go-fuzz if you haven't already:\n\n\t\tgo get github.com/dvyukov/go-fuzz/go-fuzz\n\t\tgo get github.com/dvyukov/go-fuzz/go-fuzz-build\n\n2.  Build the package with go-fuzz:\n\n\t\tgo-fuzz-build github.com/blevesearch/segment\n\n3.  Convert the Unicode provided test cases into the initial corpus for go-fuzz:\n\n\t\tgo test -v -run=TestGenerateWordSegmentFuzz -tags gofuzz_generate\n\n4.  Run go-fuzz:\n\n\t\tgo-fuzz -bin=segment-fuzz.zip -workdir=workdir\n\n## Status\n\n\n[![Build Status](https://travis-ci.org/blevesearch/segment.svg?branch=master)](https://travis-ci.org/blevesearch/segment)\n\n[![Coverage Status](https://img.shields.io/coveralls/blevesearch/segment.svg)](https://coveralls.io/r/blevesearch/segment?branch=master)\n\n[![GoDoc](https://godoc.org/github.com/blevesearch/segment?status.svg)](https://godoc.org/github.com/blevesearch/segment)","funding_links":[],"categories":["Natural Language Processing","自然语言处理","自然語言處理","Bot Building","\u003cspan id=\"自然语言处理-natural-language-processing\"\u003e自然语言处理 Natural Language Processing\u003c/span\u003e","Microsoft Office"],"sub_categories":["Uncategorized","Tokenizers","暂未分类","Strings","高級控制台界面","分词器","Advanced Console UIs","暂未分类这些库被放在这里是因为其他类别似乎都不适合。","交流","\u003cspan id=\"高级控制台用户界面-advanced-console-uis\"\u003e高级控制台用户界面 Advanced Console UIs\u003c/span\u003e","高级控制台界面"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblevesearch%2Fsegment","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblevesearch%2Fsegment","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblevesearch%2Fsegment/lists"}