{"id":13413975,"url":"https://github.com/bzick/tokenizer","last_synced_at":"2025-04-28T17:30:45.977Z","repository":{"id":39616588,"uuid":"418846160","full_name":"bzick/tokenizer","owner":"bzick","description":"Tokenizer (lexer) for golang","archived":false,"fork":false,"pushed_at":"2024-04-09T17:00:33.000Z","size":105,"stargazers_count":89,"open_issues_count":1,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-07-31T20:53:12.686Z","etag":null,"topics":["golang","lexer","parse","parser","tokenizer","tokenizing"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bzick.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-19T08:58:18.000Z","updated_at":"2024-07-21T08:24:47.000Z","dependencies_parsed_at":"2022-07-19T18:04:27.936Z","dependency_job_id":"5678dc9d-bdd6-4254-b1ef-793a595272aa","html_url":"https://github.com/bzick/tokenizer","commit_stats":null,"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bzick%2Ftokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bzick%2Ftokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bzick%2Ftokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bzick%2Ftokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bzick","download_url":"https://codeload.github.com/bzick/tokenizer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224124862,"owners_count":17259746,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["golang","lexer","parse","parser","tokenizer","tokenizing"],"created_at":"2024-07-30T20:01:54.188Z","updated_at":"2025-04-28T17:30:45.970Z","avatar_url":"https://github.com/bzick.png","language":"Go","funding_links":[],"categories":["文本处理","Template Engines","Text Processing","Bot Building"],"sub_categories":["解析 器/Encoders/Decoders","Parsers/Encoders/Decoders"],"readme":"# Tokenizer \n\n[![Build Status](https://github.com/bzick/tokenizer/actions/workflows/tokenizer.yml/badge.svg)](https://github.com/bzick/tokenizer/actions/workflows/tokenizer.yml)\n[![codecov](https://codecov.io/gh/bzick/tokenizer/branch/master/graph/badge.svg?token=MFY5NWATGC)](https://codecov.io/gh/bzick/tokenizer)\n[![Go Report Card](https://goreportcard.com/badge/github.com/bzick/tokenizer?rnd=2)](https://goreportcard.com/report/github.com/bzick/tokenizer)\n[![GoDoc](https://godoc.org/github.com/bzick/tokenizer?status.svg)](https://godoc.org/github.com/bzick/tokenizer)\n\nTokenizer — parse any string, slice or infinite buffer to any tokens.\n\nMain features:\n\n* High performance.\n* No regexp.\n* Provides [simple API](https://pkg.go.dev/github.com/bzick/tokenizer).\n* Supports [integer](#integer-number) and [float](#float-number) numbers.\n* Supports [quoted string or other \"framed\"](#framed-string) strings.\n* Supports [injection](#injection-in-framed-string) in quoted or \"framed\" strings.\n* Supports unicode.\n* [Customization of tokens](#user-defined-tokens).\n* Autodetect white space symbols.\n* Parse any data syntax (xml, [json](https://github.com/bzick/tokenizer/blob/master/example_test.go), yaml), any programming language.\n* Single pass through the data.\n* Parses infinite incoming data and don't panic.\n\nUse cases:\n- Parsing html, xml, [json](./example_test.go), yaml and other text formats.\n- Parsing huge or infinite texts. \n- Parsing any programming languages.\n- Parsing templates.\n- Parsing formulas.\n\nFor example, parsing SQL `WHERE` condition `user_id = 119 and modified \u003e \"2020-01-01 00:00:00\" or amount \u003e= 122.34`:\n\n```go\nimport \"github.com/bzick/tokenizer\"\n\n// define custom tokens keys\nconst (\n\tTEquality = iota + 1\n\tTDot\n\tTMath\n\tTDoubleQuoted\n)\n\n// configure tokenizer\nparser := tokenizer.New()\nparser.DefineTokens(TEquality, []string{\"\u003c\", \"\u003c=\", \"==\", \"\u003e=\", \"\u003e\", \"!=\"})\nparser.DefineTokens(TDot, []string{\".\"})\nparser.DefineTokens(TMath, []string{\"+\", \"-\", \"/\", \"*\", \"%\"})\nparser.DefineStringToken(TDoubleQuoted, `\"`, `\"`).SetEscapeSymbol(tokenizer.BackSlash)\nparser.AllowKeywordSymbols(tokenizer.Underscore, tokenizer.Numbers)\n\n// create tokens' stream\nstream := parser.ParseString(`user_id = 119 and modified \u003e \"2020-01-01 00:00:00\" or amount \u003e= 122.34`)\ndefer stream.Close()\n\n// iterate over each token\nfor stream.IsValid() {\n\tif stream.CurrentToken().Is(tokenizer.TokenKeyword) {\n\t\tfield := stream.CurrentToken().ValueString()\n\t\t// ...\n\t}\n\tstream.GoNext()\n}\n```\n\nstream tokens:\n```\nstring:  user_id  =  119  and  modified  \u003e  \"2020-01-01 00:00:00\"  or  amount  \u003e=  122.34\ntokens: |user_id| =| 119| and| modified| \u003e| \"2020-01-01 00:00:00\"| or| amount| \u003e=| 122.34|\n        |   0   | 1|  2 |  3 |    4    | 5|            6         | 7 |    8  | 9 |    10 |\n\n0:  {key: TokenKeyword, value: \"user_id\"}                token.Value()          == \"user_id\"\n1:  {key: TEquality, value: \"=\"}                         token.Value()          == \"=\"\n2:  {key: TokenInteger, value: \"119\"}                    token.ValueInt64()     == 119\n3:  {key: TokenKeyword, value: \"and\"}                    token.Value()          == \"and\"\n4:  {key: TokenKeyword, value: \"modified\"}               token.Value()          == \"modified\"\n5:  {key: TEquality, value: \"\u003e\"}                         token.Value()          == \"\u003e\"\n6:  {key: TokenString, value: \"\\\"2020-01-01 00:00:00\\\"\"} token.ValueUnescaped() == \"2020-01-01 00:00:00\"\n7:  {key: TokenKeyword, value: \"or\"}                     token.Value()          == \"and\"\n8:  {key: TokenKeyword, value: \"amount\"}                 token.Value()          == \"amount\"\n9:  {key: TEquality, value: \"\u003e=\"}                        token.Value()          == \"\u003e=\"\n10: {key: TokenFloat, value: \"122.34\"}                   token.ValueFloat64()   == 122.34\n```\n\nMore examples:\n- [JSON parser](./example_test.go)\n\n## Begin\n\n### Create and parse\n\n```go\nimport \"github.com/bzick/tokenizer\"\n\nvar parser := tokenizer.New()\nparser.AllowKeywordSymbols(tokenizer.Underscore, []rune{})\n// ... and other configuration code\n\n```\n\nThere are two ways to **parse string or slice**:\n\n- `parser.ParseString(str)`\n- `parser.ParseBytes(slice)`\n\nThe package allows to **parse an endless stream** of data into tokens.\nFor parsing, you need to pass `io.Reader`, from which data will be read (chunk-by-chunk):\n\n```go\nfp, err := os.Open(\"data.json\") // huge JSON file\n// check fs, configure tokenizer ...\n\nstream := parser.ParseStream(fp, 4096).SetHistorySize(10)\ndefer stream.Close()\nfor stream.IsValid() {\n\t// ...\n\tstream.GoNext()\n}\n```\n\n## Embedded tokens\n\n- `tokenizer.TokenUnknown` — unspecified token key.\n- `tokenizer.TokenKeyword` — keyword, any combination of letters, including unicode letters.\n- `tokenizer.TokenInteger` — integer value\n- `tokenizer.TokenFloat` — float/double value\n- `tokenizer.TokenString` — quoted string\n- `tokenizer.TokenStringFragment` — fragment framed (quoted) string\n\n### Unknown token\n\nA token marks as `tokenizer.TokenUnknown` if the parser detects an unknown token:\n\n```go\nstream := parser.ParseString(`one!`)\n```\n\n```\nstream: [\n    {\n        Key: tokenizer.TokenKeyword\n        Value: \"One\"\n    },\n    {\n        Key: tokenizer.TokenUnknown\n        Value: \"!\"\n    }\n]\n```\n\nBy default, `TokenUnknown` tokens are added to the stream. \nSetting `tokenizer.StopOnUndefinedToken()` stops parser  when `tokenizer.TokenUnknown` appears in the stream.\n\n```go\nstream := parser.ParseString(`one!`)\n```\n\n```\nstream: [\n    {\n        Key: tokenizer.TokenKeyword\n        Value: \"one\"\n    }\n]\n```\n\nPlease note that if the `StopOnUndefinedToken()` setting is enabled, then the string may not be fully parsed.\nTo find out that the string was not fully parsed, check the length of the parsed string `stream.GetParsedLength()`\nand the length of the original string.\n\n### Keywords\n\nAny word that is not a custom token is stored in a single token as `tokenizer.TokenKeyword`.\n\nThe word can contain unicode characters, and it can be configured to contain other characters, like numbers and underscores (see `tokenizer.AllowKeywordSymbols()`).\n\n```go\nstream := parser.ParseString(`one 二 три`)\n```\n\n```\nstream: [\n    {\n        Key: tokenizer.TokenKeyword\n        Value: \"one\"\n    },\n    {\n        Key: tokenizer.TokenKeyword\n        Value: \"二\"\n    },\n    {\n        Key: tokenizer.TokenKeyword\n        Value: \"три\"\n    }\n]\n```\n\nKeyword may be modified with `tokenizer.AllowKeywordSymbols(majorSymbols, minorSymbols)`\n\n- Major symbols (any quantity in the keyword) can be in the beginning, in the middle and at the end of the keyword.\n- Minor symbols (any quantity in the keyword) can be in the middle and at the end of the keyword.\n\n```go\nparser.AllowKeywordSymbols(tokenizer.Underscore, tokenizer.Numbers)\n// allows: \"_one23\", \"__one2__two3\"\n\nparser.AllowKeywordSymbols([]rune{'_', '@'}, tokenizer.Numbers)\n// allows: \"one@23\", \"@_one_two23\", \"_one23\", \"_one2_two3\", \"@@one___two@_9\"\n```\n\n### Integer number\n\nAny integer is stored as one token with key `tokenizer.TokenInteger`.\n\n```go\nstream := parser.ParseString(`223 999`)\n```\n\n```\nstream: [\n    {\n        Key: tokenizer.TokenInteger\n        Value: \"223\"\n    },\n    {\n        Key: tokenizer.TokenInteger\n        Value: \"999\"\n    },\n]\n```\n\nTo get int64 from the token value use `stream.GetInt64()`:\n\n```go\nstream := tokenizer.ParseString(\"123\")\nfmt.Print(\"Token is %d\", stream.CurrentToken().GetInt64())  // Token is 123\n```\n\n### Float number\n\nAny float number is stored as one token with key `tokenizer.TokenFloat`. Float number may\n- have point, for example `1.2`\n- have exponent, for example `1e6`\n- have lower `e` or upper `E` letter in the exponent, for example `1E6`, `1e6`\n- have sign in the exponent, for example `1e-6`, `1e6`, `1e+6`\n\n```go\nstream := parser.ParseString(`1.3e-8`):\n```\n\n```\nstream: [\n    {\n        Key: tokenizer.TokenFloat\n        Value: \"1.3e-8\"\n    },\n]\n```\n\nTo get float64 from the token value use `token.GetFloat64()`:\n\n```go\nstream := parser.ParseString(\"1.3e2\")\nfmt.Print(\"Token is %d\", stream.CurrentToken().GetFloat64())  // Token is 130\n```\n\n### Framed string\n\nStrings that are framed with tokens are called framed strings. An obvious example is quoted a string like `\"one two\"`.\nThere are quotes — edge tokens.\n\nYou can create and customize framed string through `tokenizer.DefineStringToken()`:\n\n```go\nconst TokenDoubleQuotedString = 10\n// ...\nparser.DefineStringToken(TokenDoubleQuotedString, `\"`, `\"`).SetEscapeSymbol('\\\\')\n// ...\nstream := parser.ParseString(`\"two \\\"three\"`)\n```\n\n```\nstream: [\n    {\n        Key: tokenizer.TokenString\n        Value: \"\\\"two \\\\\"three\\\"\"\n    },\n]\n```\n\nTo get a framed string without edge tokens and special characters, use the `stream.ValueUnescaped()` method:\n\n```go\nvalue := stream.CurrentToken().ValueUnescaped() // result: two \"three\n```\n\nThe method `token.StringKey()` will be return token string key defined in the `DefineStringToken`:\n\n```go\nif stream.CurrentToken().StringKey() == TokenDoubleQuotedString {\n\t// true\n}\n```\n\n### Injection in framed string\n\nStrings can contain expression substitutions that can be parsed into tokens. For example `\"one {{two}} three\"`.\nFragments of strings before, between and after substitutions will be stored in tokens as `tokenizer.TokenStringFragment`. \n\n```go\nconst (\n    TokenOpenInjection = 1\n    TokenCloseInjection = 2\n    TokenQuotedString = 3\n)\n\nparser := tokenizer.New()\nparser.DefineTokens(TokenOpenInjection, []string{\"{{\"})\nparser.DefineTokens(TokenCloseInjection, []string{\"}}\"})\nparser.DefineStringToken(TokenQuotedString, `\"`, `\"`).AddInjection(TokenOpenInjection, TokenCloseInjection)\n\nparser.ParseString(`\"one {{ two }} three\"`)\n```\n\nTokens:\n\n```\n{\n    {\n        Key: tokenizer.TokenStringFragment,\n        Value: \"one\"\n    },\n    {\n        Key: TokenOpenInjection,\n        Value: \"{{\"\n    },\n    {\n        Key: tokenizer.TokenKeyword,\n        Value: \"two\"\n    },\n    {\n        Key: TokenCloseInjection,\n        Value: \"}}\"\n    },\n    {\n        Key: tokenizer.TokenStringFragment,\n        Value: \"three\"\n    },\n}\n```\n\nUse cases:\n- parse templates\n- parse placeholders\n\n## User defined tokens\n\nThe new token can be defined via the `DefineTokens()` method:\n\n```go\n\nconst (\n    TokenCurlyOpen    = 1\n    TokenCurlyClose   = 2\n    TokenSquareOpen   = 3\n    TokenSquareClose  = 4\n    TokenColon        = 5\n    TokenComma        = 6\n\tTokenDoubleQuoted = 7\n)\n\n// json parser\nparser := tokenizer.New()\nparser.\n\tDefineTokens(TokenCurlyOpen, []string{\"{\"}).\n\tDefineTokens(TokenCurlyClose, []string{\"}\"}).\n\tDefineTokens(TokenSquareOpen, []string{\"[\"}).\n\tDefineTokens(TokenSquareClose, []string{\"]\"}).\n\tDefineTokens(TokenColon, []string{\":\"}).\n\tDefineTokens(TokenComma, []string{\",\"}).\n\tDefineStringToken(TokenDoubleQuoted, `\"`, `\"`).SetSpecialSymbols(tokenizer.DefaultStringEscapes)\n\nstream := parser.ParseString(`{\"key\": [1]}`)\n```\n\n\n## Known issues\n\n* zero-byte `\\x00` (`\\0`) stops parsing.\n\n## Benchmark\n\nParse string/bytes\n```\npkg: tokenizer\ncpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz\nBenchmarkParseBytes\n    stream_test.go:251: Speed: 70 bytes string with 19.689µs: 3555284 byte/sec\n    stream_test.go:251: Speed: 7000 bytes string with 848.163µs: 8253130 byte/sec\n    stream_test.go:251: Speed: 700000 bytes string with 75.685945ms: 9248744 byte/sec\n    stream_test.go:251: Speed: 11093670 bytes string with 1.16611538s: 9513355 byte/sec\nBenchmarkParseBytes-8   \t  158481\t      7358 ns/op\n```\n\nParse infinite stream\n```\npkg: tokenizer\ncpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz\nBenchmarkParseInfStream\n    stream_test.go:226: Speed: 70 bytes at 33.826µs: 2069414 byte/sec\n    stream_test.go:226: Speed: 7000 bytes at 627.357µs: 11157921 byte/sec\n    stream_test.go:226: Speed: 700000 bytes at 27.675799ms: 25292856 byte/sec\n    stream_test.go:226: Speed: 30316440 bytes at 1.18061702s: 25678471 byte/sec\nBenchmarkParseInfStream-8   \t  433092\t      2726 ns/op\nPASS\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbzick%2Ftokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbzick%2Ftokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbzick%2Ftokenizer/lists"}