{"id":27210797,"url":"https://github.com/rekram1-node/tokenizer","last_synced_at":"2025-04-10T01:27:39.999Z","repository":{"id":65637035,"uuid":"595869027","full_name":"rekram1-node/tokenizer","owner":"rekram1-node","description":"Natural Language Processing (NLP) Tokenization Libary designed for English. Fast, Lean, Customizable. Tokenizes text, replaces abbreviations, replaces contractions, lowercases words, optionally you can remove stop words as well ","archived":false,"fork":false,"pushed_at":"2023-02-07T01:04:09.000Z","size":86,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-06-21T03:14:45.339Z","etag":null,"topics":["blazingly-fast","contractions","customization","fast","go","golang","machine-learning","minimal","natural-language-processing","nlp","speed","stopwords","token","tokenization","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rekram1-node.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-01T01:05:17.000Z","updated_at":"2024-04-04T20:55:05.000Z","dependencies_parsed_at":"2023-02-17T10:45:55.889Z","dependency_job_id":null,"html_url":"https://github.com/rekram1-node/tokenizer","commit_stats":{"total_commits":10,"total_committers":2,"mean_commits":5.0,"dds":"0.19999999999999996","last_synced_commit":"f572869fb7f515a0dc15acb7745324b6a576931d"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rekram1-node%2Ftokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rekram1-node%2Ftokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rekram1-node%2Ftokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rekram1-node%2Ftokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rekram1-node","download_url":"https://codeload.github.com/rekram1-node/tokenizer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248139722,"owners_count":21054162,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["blazingly-fast","contractions","customization","fast","go","golang","machine-learning","minimal","natural-language-processing","nlp","speed","stopwords","token","tokenization","tokenizer"],"created_at":"2025-04-10T01:27:39.117Z","updated_at":"2025-04-10T01:27:39.937Z","avatar_url":"https://github.com/rekram1-node.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tokenizer\n\n[![Go Report](https://goreportcard.com/badge/github.com/rekram1-node/tokenizer)](https://goreportcard.com/report/github.com/rekram1-node/tokenizer) [![license](http://img.shields.io/badge/license-MIT-red.svg?style=flat)](https://github.com/rekram1-node/tokenizer/blob/main/LICENSE) ![Build Status](https://github.com/rekram1-node/tokenizer/actions/workflows/main.yml/badge.svg) [![Go Reference](https://pkg.go.dev/github.com/rekram1-node/tokenizer.svg)](https://pkg.go.dev/github.com/rekram1-node/tokenizer)\n\nNatural Language Processing (NLP) Tokenization Libary designed for English. Fast, Lean, Customizable. Tokenizes text, replaces abbreviations, replaces contractions, lowercases words, optionally you can remove stop words as well (must specify, see [usage](#usage)). This library is a work in progress but all features mentioned in README should be working as advertised.\n\nTokenizing text is one of the first steps in NLP, this preformant library should help Go users get started with their NLP tasks. \n\n## Features\n\n* [Convert Text to Tokens](#usage)\n* [Replace Contractions](#usage)\n* [Replace Abbreviations](#usage)\n* [Remove Stop Words](#usage) - defaults to keeping them\n* [Blazingly Fast](#benchmarks)\n* [Low Allocation](#benchmarks)\n* Practically zero dependency (only dependency is [testify](https://github.com/stretchr/testify) for unit testing)\n\nComing soon:\n- Streamed Reading\n- Sentence Tokenization\n\n## Installation\n\n```bash\ngo get -u github.com/rekram1-node/tokenizer\n```\n\n## Usage\n\nFor detailed examples see [examples](https://github.com/rekram1-node/tokenizer/tree/main/examples)\n\n### Default\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"strings\"\n\n    \"github.com/rekram1-node/tokenizer/tokenizer\"\n)\n\nfunc main() {\n    myStr := \"This is my long string! I can replace contractions like can't or they've! I can replace abbreviations such as: demonstr. or jan.\"\n\tt := tokenizer.New()\n\ttokens := t.TokenizeString(myStr)\n\tfmt.Println(tokens)\n\t// Output: [this is my long string i can replace contractions like cannot or they have i can replace abbreviations such as demonstration or jan.]\n\n\t/*\n\t\tNote: you can remove stop words too!!!\n\t*/\n\tt.SetStopWordRemoval(true)\n\tmyOtherStr := \"This is another string to demonstrate stop words removal, words like: and or but the are are all stop word examples\"\n\ttokens = t.TokenizeString(myOtherStr)\n\tfmt.Println(tokens)\n\t// Output: [string demonstrate stop words removal words stop word examples]\n}\n```\n\n### Custom Settings\n\n```go\npackage main\n\nimport (\n\t\"fmt\"\n\t\"log\"\n\n\t\"github.com/rekram1-node/tokenizer/languages\"\n\t\"github.com/rekram1-node/tokenizer/tokenizer\"\n)\n\nfunc main() {\n\t// add a string containing all the \"separators\" you want\n\t// important note: including the \".\" would degrade the ability to replace abbreviations\n\tcustomSeparators := \"\\t\\n\\r ,:?\\\"!;()\"\n\n\t// specify your settings here\n\tsettings := \u0026tokenizer.Settings{\n\t\tKeepSeparators:  false,\n\t\tRemoveStopWords: true,\n\t\t// you can have your own language configuration, see the language struct\n\t\t/*\n\t\t\ttype Lanuage struct {\n\t\t\t\tStopWords     map[string]uint8\n\t\t\t\tContractions  map[string]string\n\t\t\t\tAbbreviations map[string]string\n\t\t\t}\n\n\t\t\tyou can create your own using languages.NewLanguage(yourStopWords, yourContractions, yourAbbreviations)\n\t\t*/\n\t\tLanuage: languages.English,\n\t}\n\t// custom settings return an error incase of a misconfigured/missing setting\n\tt, err := settings.Custom(customSeparators)\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\n\tmyStr := \"This is my long string! I can replace contractions like can't or they've! I can replace abbreviations such as: demonstr. or jan. This is another string to demonstrate stop words removal, words like: and or but the are are all stop word examples\"\n\ttokens := t.TokenizeString(myStr)\n\tfmt.Println(tokens)\n\t// Output: [long string replace contractions replace abbreviations demonstration january string demonstrate stop words removal words stop word examples]\n}\n```\n\n## Benchmarks\n\nSee [benchmark test](https://github.com/rekram1-node/tokenizer/blob/main/tokenizer/benchmark_test.go)\n\nUsing the benchmark test, you can see that even with 1 million words we still only have a meager 39 allocations per operation. I believe this number can be beat as well but I am still looking into the feasibility of some operations\n\n```text\ntask: [bench] go test -bench=. ./tokenizer -run=^# -count=10 -benchmem | tee preformance.txt\ngoos: darwin\ngoarch: arm64\npkg: github.com/rekram1-node/tokenizer/tokenizer\nBenchmarkTokenize/word_count:_10-8                893482              1302 ns/op             496 B/op          5 allocs/op\nBenchmarkTokenize/word_count:_10-8                909124              1309 ns/op             496 B/op          5 allocs/op\nBenchmarkTokenize/word_count:_10-8                908605              1298 ns/op             496 B/op          5 allocs/op\nBenchmarkTokenize/word_count:_10-8                905206              1270 ns/op             496 B/op          5 allocs/op\nBenchmarkTokenize/word_count:_10-8                900795              1254 ns/op             496 B/op          5 allocs/op\nBenchmarkTokenize/word_count:_10-8                912558              1267 ns/op             496 B/op          5 allocs/op\nBenchmarkTokenize/word_count:_10-8                928588              1255 ns/op             496 B/op          5 allocs/op\nBenchmarkTokenize/word_count:_10-8                885170              1264 ns/op             496 B/op          5 allocs/op\nBenchmarkTokenize/word_count:_10-8                894373              1253 ns/op             496 B/op          5 allocs/op\nBenchmarkTokenize/word_count:_10-8                917389              1258 ns/op             496 B/op          5 allocs/op\nBenchmarkTokenize/word_count:_100-8                81775             13954 ns/op            4080 B/op          8 allocs/op\nBenchmarkTokenize/word_count:_100-8                85268             13914 ns/op            4080 B/op          8 allocs/op\nBenchmarkTokenize/word_count:_100-8                85263             13912 ns/op            4080 B/op          8 allocs/op\nBenchmarkTokenize/word_count:_100-8                85602             13973 ns/op            4080 B/op          8 allocs/op\nBenchmarkTokenize/word_count:_100-8                85400             13908 ns/op            4080 B/op          8 allocs/op\nBenchmarkTokenize/word_count:_100-8                85245             13941 ns/op            4080 B/op          8 allocs/op\nBenchmarkTokenize/word_count:_100-8                85260             13896 ns/op            4080 B/op          8 allocs/op\nBenchmarkTokenize/word_count:_100-8                85585             14014 ns/op            4080 B/op          8 allocs/op\nBenchmarkTokenize/word_count:_100-8                85687             13925 ns/op            4080 B/op          8 allocs/op\nBenchmarkTokenize/word_count:_100-8                84966             13921 ns/op            4080 B/op          8 allocs/op\nBenchmarkTokenize/word_count:_1000-8                8143            143447 ns/op           50416 B/op         12 allocs/op\nBenchmarkTokenize/word_count:_1000-8                7996            144030 ns/op           50416 B/op         12 allocs/op\nBenchmarkTokenize/word_count:_1000-8                8072            144614 ns/op           50416 B/op         12 allocs/op\nBenchmarkTokenize/word_count:_1000-8                8059            143350 ns/op           50416 B/op         12 allocs/op\nBenchmarkTokenize/word_count:_1000-8                8186            143046 ns/op           50416 B/op         12 allocs/op\nBenchmarkTokenize/word_count:_1000-8                8046            144263 ns/op           50416 B/op         12 allocs/op\nBenchmarkTokenize/word_count:_1000-8                8026            143982 ns/op           50416 B/op         12 allocs/op\nBenchmarkTokenize/word_count:_1000-8                8038            144174 ns/op           50416 B/op         12 allocs/op\nBenchmarkTokenize/word_count:_1000-8                8079            144182 ns/op           50416 B/op         12 allocs/op\nBenchmarkTokenize/word_count:_1000-8                8030            142584 ns/op           50416 B/op         12 allocs/op\nBenchmarkTokenize/word_count:_10000-8                795           1447904 ns/op          685299 B/op         19 allocs/op\nBenchmarkTokenize/word_count:_10000-8                784           1460662 ns/op          685299 B/op         19 allocs/op\nBenchmarkTokenize/word_count:_10000-8                824           1459105 ns/op          685300 B/op         19 allocs/op\nBenchmarkTokenize/word_count:_10000-8                818           1462055 ns/op          685299 B/op         19 allocs/op\nBenchmarkTokenize/word_count:_10000-8                805           1454443 ns/op          685299 B/op         19 allocs/op\nBenchmarkTokenize/word_count:_10000-8                818           1458763 ns/op          685300 B/op         19 allocs/op\nBenchmarkTokenize/word_count:_10000-8                819           1453174 ns/op          685299 B/op         19 allocs/op\nBenchmarkTokenize/word_count:_10000-8                811           1452126 ns/op          685298 B/op         19 allocs/op\nBenchmarkTokenize/word_count:_10000-8                823           1450348 ns/op          685299 B/op         19 allocs/op\nBenchmarkTokenize/word_count:_10000-8                817           1457028 ns/op          685298 B/op         19 allocs/op\nBenchmarkTokenize/word_count:_100000-8                70          15327946 ns/op         8942875 B/op         29 allocs/op\nBenchmarkTokenize/word_count:_100000-8                75          15442076 ns/op         8942871 B/op         29 allocs/op\nBenchmarkTokenize/word_count:_100000-8                76          15366360 ns/op         8942882 B/op         29 allocs/op\nBenchmarkTokenize/word_count:_100000-8                75          15329089 ns/op         8942887 B/op         29 allocs/op\nBenchmarkTokenize/word_count:_100000-8                74          15337678 ns/op         8942874 B/op         29 allocs/op\nBenchmarkTokenize/word_count:_100000-8                74          15398484 ns/op         8942891 B/op         29 allocs/op\nBenchmarkTokenize/word_count:_100000-8                74          15302757 ns/op         8942877 B/op         29 allocs/op\nBenchmarkTokenize/word_count:_100000-8                75          15362398 ns/op         8942871 B/op         29 allocs/op\nBenchmarkTokenize/word_count:_100000-8                74          15387052 ns/op         8942882 B/op         29 allocs/op\nBenchmarkTokenize/word_count:_100000-8                75          15310445 ns/op         8942881 B/op         29 allocs/op\nBenchmarkTokenize/word_count:_1000000-8                6         171073882 ns/op        88036674 B/op         39 allocs/op\nBenchmarkTokenize/word_count:_1000000-8                6         169618688 ns/op        88036672 B/op         39 allocs/op\nBenchmarkTokenize/word_count:_1000000-8                6         169880493 ns/op        88036592 B/op         39 allocs/op\nBenchmarkTokenize/word_count:_1000000-8                6         169492944 ns/op        88036672 B/op         39 allocs/op\nBenchmarkTokenize/word_count:_1000000-8                6         170651521 ns/op        88036640 B/op         39 allocs/op\nBenchmarkTokenize/word_count:_1000000-8                6         169312208 ns/op        88036624 B/op         39 allocs/op\nBenchmarkTokenize/word_count:_1000000-8                6         169648021 ns/op        88036640 B/op         39 allocs/op\nBenchmarkTokenize/word_count:_1000000-8                6         170238188 ns/op        88036624 B/op         39 allocs/op\nBenchmarkTokenize/word_count:_1000000-8                6         170449750 ns/op        88036656 B/op         39 allocs/op\nBenchmarkTokenize/word_count:_1000000-8                6         169968132 ns/op        88036592 B/op         39 allocs/op\nPASS\nok      github.com/rekram1-node/tokenizer/tokenizer     73.772s\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frekram1-node%2Ftokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frekram1-node%2Ftokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frekram1-node%2Ftokenizer/lists"}