{"id":19592318,"url":"https://github.com/smikhalevski/tokenizer-dsl","last_synced_at":"2026-04-11T20:11:31.526Z","repository":{"id":39855051,"uuid":"372784447","full_name":"smikhalevski/tokenizer-dsl","owner":"smikhalevski","description":"🪵 The API for building streaming tokenizers and lexers.","archived":false,"fork":false,"pushed_at":"2023-01-11T18:52:48.000Z","size":442,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-09-16T04:28:38.459Z","etag":null,"topics":["dsl","fast","parser","streaming","tiny","tokenizer"],"latest_commit_sha":null,"homepage":"https://smikhalevski.github.io/tokenizer-dsl/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/smikhalevski.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-01T10:06:44.000Z","updated_at":"2022-05-25T21:17:07.000Z","dependencies_parsed_at":"2023-02-09T04:16:34.230Z","dependency_job_id":null,"html_url":"https://github.com/smikhalevski/tokenizer-dsl","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/smikhalevski/tokenizer-dsl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smikhalevski%2Ftokenizer-dsl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smikhalevski%2Ftokenizer-dsl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smikhalevski%2Ftokenizer-dsl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smikhalevski%2Ftokenizer-dsl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/smikhalevski","download_url":"https://codeload.github.com/smikhalevski/tokenizer-dsl/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smikhalevski%2Ftokenizer-dsl/sbom","scorecard":{"id":833375,"data":{"date":"2025-08-11","repo":{"name":"github.com/smikhalevski/tokenizer-dsl","commit":"562e2d498375fc8eb6018e5ab1ae61bc6bf1ca60"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.9,"checks":[{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/master.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Pinned-Dependencies","score":3,"reason":"dependency not pinned by hash detected -- score normalized to 3","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/master.yml:13: update your workflow using https://app.stepsecurity.io/secureworkflow/smikhalevski/tokenizer-dsl/master.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/master.yml:14: update your workflow using https://app.stepsecurity.io/secureworkflow/smikhalevski/tokenizer-dsl/master.yml/master?enable=pin","Info:   0 out of   2 GitHub-owned GitHubAction dependencies pinned","Info:   1 out of   1 npmCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE.txt:0","Info: FSF or OSI recognized license: MIT License: LICENSE.txt:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 13 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Vulnerabilities","score":2,"reason":"8 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-968p-4wvh-cqc8","Warn: Project is vulnerable to: GHSA-67hx-6x53-jw92","Warn: Project is vulnerable to: GHSA-v6h2-p8h4-qcjw","Warn: Project is vulnerable to: GHSA-grv7-fg5c-xmjg","Warn: Project is vulnerable to: GHSA-3xgq-45jj-v275","Warn: Project is vulnerable to: GHSA-952p-6rrq-rcjv","Warn: Project is vulnerable to: GHSA-gcx4-mw62-g8wm","Warn: Project is vulnerable to: GHSA-c2qf-rxjj-qqgw"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-23T18:20:11.038Z","repository_id":39855051,"created_at":"2025-08-23T18:20:11.038Z","updated_at":"2025-08-23T18:20:11.038Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31693388,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-11T13:07:20.380Z","status":"ssl_error","status_checked_at":"2026-04-11T13:06:47.903Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dsl","fast","parser","streaming","tiny","tokenizer"],"created_at":"2024-11-11T08:34:37.322Z","updated_at":"2026-04-11T20:11:31.498Z","avatar_url":"https://github.com/smikhalevski.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tokenizer-dsl 🪵 [![build](https://github.com/smikhalevski/tokenizer-dsl/actions/workflows/master.yml/badge.svg?branch=master\u0026event=push)](https://github.com/smikhalevski/tokenizer-dsl/actions/workflows/master.yml)\n\nThe API for building streaming tokenizers and lexers.\n\n- [2× faster than `RegExp`-based alternatives](#performance);\n- [3 kB gzipped](https://bundlephobia.com/result?p=tokenizer-dsl) including dependencies;\n- Supports streaming out of the box;\n- No memory allocations during tokenization;\n- Tokenizer is compiled to a single highly-optimized function.\n\n🔥\u0026ensp;[**Try this example live on CodeSandbox**](https://codesandbox.io/s/tokenizer-dsl-s945yv)\n\n```shell\nnpm install --save-prod tokenizer-dsl\n```\n\n- [Usage](#usage)\n- [Built-in readers](#built-in-readers)\u003cbr\u003e\n  [`text`](#text) [`char`](#char) [`regex`](#regex) [`all`](#all) [`seq`](#seq) [`or`](#or) [`skip`](#skip) [`until`](#until) [`end`](#end) [`lookahead`](#lookahead) [`optional`](#optional) [`never`](#never) [`none`](#none)\n- [Functional readers](#functional-readers)\n    - [Recursive readers](#recursive-readers)\n- [Code-generated readers](#code-generated-readers)\n- [Rules](#rules)\n    - [Rule stages](#rule-stages)\n    - [Silent rules](#silent-rules)\n- [Streaming](#streaming)\n- [Context](#context)\n- [Standalone tokenizers](#standalone-tokenizers)\n- [Performance](#performance)\n\n# Usage\n\n🔎 [API documentation is available here.](https://smikhalevski.github.io/tokenizer-dsl/)\n\nLet's consider the input string that contains lowercase-alpha strings and floating-point numbers, separated by a\nsemicolon and an arbitrary number of space chars:\n\n```ts\n'123.456; aaa; +777; bbb; -42'\n```\n\nTo tokenize this string we first need to describe readers that would read chars from the input string.\n\nThe reader for semicolons is pretty straightforward:\n\n```ts\nimport * as t from 'tokenizer-dsl';\n\nconst semicolonReader = t.text(';');\n```\n\nThe regular expression equivalent for `semicolonReader` is `/;/y`.\n\nTo read a sequence of whitespaces or lowercase-alpha chars we would use the combination of [`all`](#all) and\n[`char`](#char) readers:\n\n```ts\nconst whitespaceReader = t.all(t.char([' \\t\\n\\r']));\n\nconst alphaReader = t.all(t.char([['a', 'z']]), { minimumCount: 1 });\n```\n\nThe regular expression equivalent for `whitespaceReader` is `/[ \\t\\n\\r]*/y`, and for `alphaReader` it is `/[a-z]+/y`.\n\nTo read a signed floating-point number we need a combination of multiple readers:\n\n```ts\nconst zeroReader = t.text('0');\n\nconst leadingDigitReader = t.char([['1', '9']]);\n\nconst digitsReader = t.all(t.char([['0', '9']]));\n\nconst dotReader = t.text('.');\n\nconst signReader = t.char(['+-']);\n\nconst numberReader = t.seq(\n  // sign\n  t.optional(signReader),\n\n  // integer\n  t.or(\n    zeroReader,\n    t.seq(\n      leadingDigitReader,\n      digitsReader\n    )\n  ),\n\n  // fraction\n  t.optional(\n    t.seq(\n      dotReader,\n      digitsReader\n    )\n  )\n);\n```\n\nThe `numberReader` works the same way as `/[-+]?(?:0|[1-9]\\d*)(?:\\.\\d*)?/y`.\n\nNow, after we defined all required readers, we can define a set of [tokenization rules](#rules):\n\n```ts\nconst semicolonRule: t.Rule = {\n  type: 'semicolon',\n  reader: semicolonReader,\n};\n\nconst whitespaceRule: t.Rule = {\n  type: 'whitespace',\n  reader: whitespaceReader,\n};\n\nconst alphaRule: t.Rule = {\n  type: 'alpha',\n  reader: alphaReader,\n};\n\nconst numberRule: t.Rule = {\n  type: 'number',\n  reader: numberReader,\n};\n```\n\n- `type` is the arbitrary name of the token that the rule would read from the input string. It can be a string, a\n  number, an object, or any other data type. The type would be passed to token handler.\n- `reader` is that would read chars from the string.\n\nThe next step is to create a tokenizer that uses our rules:\n\n```ts\nconst tokenize = t.createTokenizer([\n  semicolonRule,\n  whitespaceRule,\n  alphaRule,\n  numberRule\n]);\n```\n\n`createTokenizer` would compile a highly efficient function that applies rules to read chars from the input string.\n\nAs the last step, we should call a tokenizer and provide it an input and a token handler:\n\n```ts\nconst handler: t.TokenHandler = (type, input, offset, length, context, state) =\u003e {\n  console.log(type, '\"' + input.substr(offset, length) + '\"');\n};\n\ntokenize('123.456; aaa; +777; bbb; -42', handler);\n```\n\nThe console output would be:\n\n```\nnumber \"123.456\"\nsemicolon \";\"\nwhitespace \" \"\nalpha \"aaa\"\nsemicolon \";\"\nwhitespace \" \"\nnumber \"+777\"\nsemicolon \";\"\nwhitespace \" \"\nnumber \"-42\"\n```\n\n# Built-in readers\n\n## `text(substring, options?)`\u003ca name=\"text\"\u003e\u003c/a\u003e\n\nReads the case-sensitive `substring` from the input:\n\n```ts\n// Reads 'foo'\ntext('foo');\n```\n\nYou can optionally specify that text must be case-insensitive:\n\n```ts\n// Reads 'bar', 'BAR', 'Bar', etc.\ntext('bar', { caseInsensitive: true });\n```\n\n## `char(chars)`\u003ca name=\"char\"\u003e\u003c/a\u003e\n\nReads a single char from the string. You should provide an array of strings, char codes or char ranges.\n\n```ts\n// Reads 'a', 'b', or 'c'\nchar(['a', 98, 99]);\n```\n\nYou can specify a set of chars as a string with multiple chars:\n\n```ts\n// Reads ' ', '\\t', '\\r', or '\\n'\nchar([' \\t\\r\\n']);\n```\n\nYou can specify a pair of char codes or strings that denote a char range:\n\n```ts\n// Reads [a-zA-Z]\nchar([['a', 'z'], [65, 90]]);\n```\n\n## `regex(pattern)`\u003ca name=\"regex\"\u003e\u003c/a\u003e\n\nReads substring using the `RegExp` pattern:\n\n```ts\n// Reads '0', '123', etc.\nregex(/0|[1-9]\\d*/y);\n```\n\nIf you don't specify `g` or `y` flags on the `RegExp`, then `y` is implicitly added.\n\n## `all(reader, options?)`\u003ca name=\"all\"\u003e\u003c/a\u003e\n\nApplies `reader` until it can read from the input:\n\n```ts\n// Reads 'abc' from 'abc123'\nall(char([['a', 'z']]));\n```\n\nYou can optionally specify the number of entries the `reader` must read to consider success:\n\n```ts\n// Reads at least one digit, but not more than 10\nall(char([['0', '9']]), { minimumCount: 1, maximumCount: 10 });\n```\n\n## `seq(...readers)`\u003ca name=\"seq\"\u003e\u003c/a\u003e\n\nApplies readers one after another sequentially:\n\n```ts\n// Reads PK-XXXXX where X is 0-9\nseq(\n  text('PK-'),\n  all(char([['0', '9']]), { minimumCount: 5, maximumCount: 5 })\n);\n```\n\n## `or(...readers)`\u003ca name=\"or\"\u003e\u003c/a\u003e\n\nReturns the offset returned by the first successfully applied reader:\n\n```ts\n// Reads 'foo' or 'bar'\nor(\n  text('foo'),\n  text('bar')\n);\n```\n\n## `skip(count)`\u003ca name=\"skip\"\u003e\u003c/a\u003e\n\nSkips the given number of chars without reading:\n\n```ts\n// Skips 5 chars \nskip(5);\n```\n\n## `until(reader, options?)`\u003ca name=\"until\"\u003e\u003c/a\u003e\n\nRepeatedly applies `reader` until it successfully reads chars from the string. If `reader` failed to read chars then\nreturns -1.\n\n```ts\n// Reads everything until 'foo' exclusively\nuntil(text('foo'));\n```\n\nYou can make until to read inclusively:\n\n```ts\n// Reads everything until 'bar' inclusvely\nuntil(text('bar'), { inclusive: true });\n```\n\nFor example, to read all chars up to `'\u003e'` or until the end of the input:\n\n````ts\nor(\n  until(text('\u003e'), { inclusive: true }),\n  end()\n);\n````\n\n## `end(offset?)`\u003ca name=\"end\"\u003e\u003c/a\u003e\n\nSkips all chars until the end of the input. You can optionally provide the offset from the input end.\n\n```ts\n// Reads everything up to the last char\nend(-1);\n```\n\n## `lookahead(reader)`\u003ca name=\"lookahead\"\u003e\u003c/a\u003e\n\nThis is the same as [lookahead from the regular expressions](https://www.regular-expressions.info/lookaround.html). It\nreturns the current offset if `reader` successfully reads chars from the input at current offset.\n\n```ts\n// Reads '\u003c' in '\u003ca'\nseq(\n  text('\u003c'),\n  lookahead(char([['a', 'z']]))\n);\n```\n\n## `optional(reader)`\u003ca name=\"optional\"\u003e\u003c/a\u003e\n\nReturns the current offset if the `reader` failed to read chars:\n\n```ts\n// Reads 'foo-bar' and 'bar'\nseq(\n  optional(text('foo-')),\n  text('bar')\n);\n```\n\n## `never`\n\nThe singleton reader that always returns -1.\n\n## `none`\n\nThe singleton reader that always returns the current offset.\n\n# Functional readers\n\nA reader can be defined as a function that takes an `input` string, an `offset` at which it should start reading, and a\n`context`. Learn more about the context in the [Context](#context) section.\n\nA reader should return the new offset that is greater or equal to the `offset` if the reader has successfully read from\nthe `input`, or an integer less than `offset` to indicate that nothing was read.\n\nLet's create a custom reader:\n\n```ts\nimport * as t from 'tokenizer-dsl';\n\nconst fooReader: t.Reader = (input, offset) =\u003e {\n  return input.startsWith('foo', offset) ? offset + 3 : -1;\n};\n```\n\nThis reader checks that the `input` string contains a substring `'foo'` at the `offset` and returns the new offset where\nthe substring ends. Or returns -1 to indicate that the reading didn't succeed.\n\nWe can combine `fooReader` with any built-in reader. For example, to read chars until `'foo'` is met:\n\n```ts\nimport * as t from 'tokenizer-dsl';\n\n// Reads until 'foo' substring is met\nt.until(fooReader);\n```\n\n## Recursive readers\n\nYou can create recursive functional readers:\n\n```ts\nimport * as t from 'tokenizer-dsl';\n\nconst fooReader = t.toReaderFunction(\n  t.seq(\n    t.text('-'),\n    t.optional((input, offset) =\u003e fooReader(input, offset))\n  )\n);\n```\n\n# Code-generated readers\n\nCode generation is used to compile highly performant readers. To leverage this feature, you can define your custom\nreaders as a code factories.\n\nLet's recreate the reader from the previous section with the codegen approach:\n\n```ts\nimport * as t from 'tokenizer-dsl';\n\nconst fooReader: t.Reader = {\n  factory(inputVar, offsetVar, contextVar, resultVar) {\n    return {\n      code: [\n        resultVar, '=', inputVar, '.startsWith(\"foo\",', offsetVar, ')?', offsetVar, '+3:-1;',\n      ]\n    };\n  }\n};\n```\n\nThe `factory` function receives four input arguments that define the variables that should be used in the output code\ntemplate:\n\n- `inputVar` is the variable that holds the input string.\n- `offsetVar` is the variable that holds the offset in the input string from which the reader must be applied.\n- `contextVar` is the variable that holds the reader context. Learn more about the context in the [Context](#context)\n  section.\n- `resultVar` is the variable to which the reader result must be assigned.\n\nThe `factory` function should return an object containing a `code` property that holds the code template and an optional\n`bindings` property that holds the variable bindings.\n\nTo demonstrate how to use bindings, let's write a reader factory that would allow us to read arbitrary strings, just\nlike [`text`](#text) reader does:\n\n```ts\nfunction createStrReader(str: string): t.Reader {\n  return {\n    factory(inputVar, offsetVar, contextVar, resultVar) {\n      // Create a variable placeholder\n      const strVar = Symbol();\n\n      return {\n        code: [\n          resultVar, '=', inputVar, '.startsWith(', strVar, ',', offsetVar, ')?', offsetVar, '+', str.length, ':-1;',\n        ],\n        bindings: [\n          // This would assign str to a strVar at runtime\n          [strVar, str]\n        ]\n      };\n    }\n  };\n}\n```\n\nWe can combine `createStrReader` with any built-in reader. For example, to read all sequential substrings in the input:\n\n```ts\n// Reads consequent 'foo' substrings\nt.all(createStrReader('foo'));\n```\n\nYou can introduce custom variables inside a code template. Here is an example of a reader that reads zero-or-more lower\nalpha chars from the string using a `for` loop:\n\n```ts\nconst lowerAlphaReader: t.Reader = {\n  factory(inputVar, offsetVar, contextVar, resultVar) {\n    // Create a variable placeholders\n    const indexVar = Symbol();\n    const charCodeVar = Symbol();\n\n    return {\n      code: [\n        // Start reading from the offset\n        'var ', indexVar, '=', offsetVar, ';',\n\n        // Read until end of the input\n        'while(', indexVar, '\u003c', inputVar, '.length){',\n\n        // Read the char code from the input\n        'var ', charCodeVar, '=', inputVar, '.charCodeAt(', indexVar, ');',\n\n        // Abort the loop if the char code isn't a lower alpha\n        'if(', charCodeVar, '\u003c', 'a'.charCodeAt(0), '||', charCodeVar, '\u003e', 'z'.charCodeAt(0), ')',\n        'break;',\n\n        // Otherwise, proceed to the next char\n        '++', indexVar,\n        '}',\n\n        // Return the index that was reached \n        resultVar, '=', indexVar, ';',\n      ]\n    };\n  }\n};\n```\n\nYou can find out more details on how codegen works in the [codedegen](https://github.com/smikhalevski/codedegen) repo.\n\n# Rules\n\nRules define how tokens are emitted when successfully read from the input by readers.\n\nThe most basic rule only defines a reader:\n\n```ts\nimport * as t from 'tokenizer-dsl';\n\nconst fooRule: t.Rule = {\n  reader: text('foo')\n};\n```\n\nTo use a rule, create a new tokenizer:\n\n```ts\nconst tokenize = t.createTokenizer([fooRule]);\n```\n\nNow you can read inputs that consist of any number of `'foo'` substrings:\n\n```ts\ntokenize('foofoofoo', (type, input, offset, length, context) =\u003e {\n  // Process the token here\n});\n```\n\nMost of the time you have more than one token type in your input. Here the `type` property of the rule comes handy. The\nvalue of this property would be passed to the handler as the first argument.\n\n```ts\nimport * as t from 'tokenizer-dsl';\n\ntype MyTokenType = 'FOO' | 'BAR';\n\n// You can specify token types to enhance typing\nconst fooRule: t.Rule\u003cMyTokenType\u003e = {\n  type: 'FOO',\n  reader: t.text('foo')\n};\n\nconst barRule: t.Rule\u003cMyTokenType\u003e = {\n  type: 'BAR',\n  reader: t.text('bar')\n};\n\nconst tokenize = t.createTokenizer([\n  fooRule,\n  barRule\n]);\n\ntokenize('foofoobarfoobar', (type, input, offset, length, context) =\u003e {\n  switch (type) {\n    case 'FOO':\n      // Process the FOO token here\n      break;\n\n    case 'BAR':\n      // Process the BAR token here\n      break;\n  }\n});\n```\n\n## Rule stages\n\nYou can put rules on different stages to control how they are applied.\n\nIn the previous example we created a tokenizer that reads `'foo'` and `'bar'` in any order. Let's create a tokenizer\nthat restricts the order in which `'foo'` and `'bar'` should be met.\n\n```ts\nimport * as t from 'tokenizer-dsl';\n\ntype MyTokenType = 'FOO' | 'BAR';\n\ntype MyStage = 'start' | 'foo' | 'bar';\n\nconst fooRule: t.Rule\u003cMyTokenType, MyStage\u003e = {\n  type: 'FOO',\n  reader: t.text('foo'),\n\n  // Rule would be applied on stages 'start' and 'bar'\n  on: ['start', 'bar'],\n\n  // If rule is successfully applied then the tokenizer would\n  // transition to the 'foo' stage\n  to: 'foo'\n};\n\nconst barRule: t.Rule\u003cMyTokenType, MyStage\u003e = {\n  type: 'BAR',\n  reader: t.text('bar'),\n  on: ['start', 'foo'],\n  to: 'bar'\n};\n\nconst tokenize = t.createTokenizer(\n  [\n    fooRule,\n    barRule\n  ],\n\n  // Provide the initial stage\n  'start'\n);\n```\n\nThis tokenizer would successfully process `'foobarfoobar'` but would stop on `'foofoo'`.\n\nRules that don't have `on` option specified are applied on all stages. To showcase this behavior, let's modify our rules\nto allow `'foo'` and `'bar'` to be separated with arbitrary number of space chars.\n\n```ts\ntype MyTokenType = 'FOO' | 'BAR' | 'SPACE';\n\ntype MyStage = 'start' | 'foo' | 'bar';\n\nconst fooRule: t.Rule\u003cMyTokenType, MyStage\u003e = {\n  type: 'FOO',\n  reader: t.text('foo'),\n  on: ['start', 'bar'],\n  to: 'foo'\n};\n\nconst barRule: t.Rule\u003cMyTokenType, MyStage\u003e = {\n  type: 'BAR',\n  reader: t.text('bar'),\n  on: ['start', 'foo'],\n  to: 'bar'\n};\n\n// Rule would be applied on all stages: 'start', 'foo', and 'bar'\nconst spaceReader: t.Rule\u003cMyTokenType, MyStage\u003e = {\n  type: 'SPACE',\n  reader: t.all(t.char([' '])),\n};\n\nconst tokenize = t.createTokenizer(\n  [\n    fooRule,\n    barRule,\n    spaceReader\n  ],\n  'start'\n);\n```\n\nThis tokenizer would successfully process `' foo bar foo bar '` input.\n\nYou can provide a callback that returns the next stage:\n\n```ts\nconst barRule: t.Rule\u003cMyTokenType, MyStage\u003e = {\n  on: ['start', 'foo'],\n  type: 'BAR',\n  reader: t.text('bar'),\n\n  to(offset, length, context, state) {\n    // Return the next stage\n    return 'bar';\n  }\n};\n```\n\n## Silent rules\n\nSome tokens don't have any semantics that you want to process. In this case, you can mark rule as `silent` to prevent\ntoken from being emitted.\n\n```ts\nimport * as t from 'tokenizer-dsl';\n\nconst whitespaceRule: t.Rule = {\n  reader: all(char([' \\t\\r\\n'])),\n  silent: true\n};\n```\n\n# Streaming\n\nCompiled tokenizer supports streaming out of the box. Let's refer to the tokenizer that we defined in\nthe [Usage](#usage) chapter:\n\n```ts\nimport * as t from 'tokenizer-dsl';\n\nconst tokenize = t.createTokenizer([\n  semicolonRule,\n  whitespaceRule,\n  alphaRule,\n  numberRule\n]);\n```\n\nWe used this tokenizer in a non-streaming fashion:\n\n```ts\ntokenizer('123.456; aaa; +777; bbb; -42', handler);\n```\n\nIf the input string comes in chunks we can use a streaming API of the tokenizer:\n\n```ts\nconst state = tokenizer.write('123.456', undefined, handler);\ntokenizer.write('; aaa; +77', state, handler);\ntokenizer.write('7; bbb; -42', state, handler);\ntokenizer(state, handler);\n```\n\n`tokenizer.write` accepts a mutable state object that is updated as tokenization progresses. You can inspect state to\nknow the stage and offset at which the tokenizer finished reading tokens.\n\nStreaming tokenizer emits tokens that are _confirmed_. The token is confirmed after the consequent token is\nsuccessfully read or after the `tokenizer(state, handler)` is called.\n\n# Context\n\nCustom readers may require a custom state. You can provide the context to the tokenizer, and it would pass it to all\nreaders as a third argument:\n\n```ts\nimport * as t from 'tokenizer-dsl';\n\n// Define a reader that uses a context\nconst fooReader: t.Reader\u003c{ bar: number }\u003e = (input, offset, context) =\u003e {\n  console.log(context.bar);\n  return -1;\n};\n\n// Compile a tokenizer\nconst tokenizer = t.createTokenizer([\n  // A rule that uses a fooReader\n  { reader: fooReader }\n]);\n\n// Pass the context value\ntokenizer('foobar', handler, { bar: 123 });\n```\n\n# Standalone tokenizers\n\n`eval` and `Function` can be prohibited in some environments. To use a tokenizer in such circumstances you can generate\na pre-compiled rule iterator:\n\n```ts\nimport fs from 'fs';\nimport * as t from 'tokenizer-dsl';\n\nconst moduleSource = t.compileRuleIteratorModule(\n  [\n    { reader: t.char(['a']) },\n  ],\n  { typingsEnabled: true }\n);\n\nfs.writeFileSync('./ruleIterator.ts', moduleSource);\n```\n\nThen, at runtime, you can import it and create a tokenizer:\n\n```ts\nimport * as t from 'tokenizer-dsl';\nimport ruleIterator from './ruleIterator';\n\nconst tokenizer = t.createTokenizerForRuleIterator(ruleIterator);\n```\n\nIf you need to use [a functional reader](#functional-readers) in a generated tokenizer, use `externalValue` declaration\nthat would be output as an `import` statement.\n\n```ts\nt.compileTokenizerModule(\n  [\n    { reader: t.externalValue('./super-reader') },\n  ],\n  { typingsEnabled: true }\n);\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003ePreview of the generated module code\u003c/b\u003e\u003c/summary\u003e\n\u003cp\u003e\n\nCode is formatted manually, for readability purposes.\n\n```ts\nimport type { RuleIterator } from 'tokenizer-dsl';\nimport e from './super-reader';\n\nconst f: RuleIterator\u003cany, any, any\u003e = function (a, b, c, d) {\n  var g = a.chunk, h = a.offset, i = false, j, k = h, l = g.length;\n  while (k \u003c l) {\n    var m;\n    m = e(g, k, c);\n\n    if (m \u003e k) {\n      if (i) {\n        b(j, g, h, k - h, c, a);\n        i = false;\n      }\n      a.offset = h = k;\n      i = true;\n      j = undefined;\n      k = m;\n      continue;\n    }\n    break;\n  }\n  if (d) return;\n  if (i) {\n    b(j, g, h, k - h, c, a);\n  }\n  a.offset = k;\n};\n\nexport default f;\n```\n\n\u003c/p\u003e\n\u003c/details\u003e\n\n# Performance\n\n[To run a performance test](./src/test/perf.js), clone this repo and run `npm ci \u0026\u0026 npm run perf` in the project\ndirectory.\n\nThe table below shows performance comparison between tokenizer-dsl readers and `RegExp` alternatives.\n\nResults are in millions of operations per second. The higher number is better.\n\n| | tokenizer-dsl | `RegExp` | |\n| -- | --: | --: | -- |\n| [Usage example](#usage) | 5.3 | 2.5 | |\n| `char(['abc'])` | 88.8 | 58.5 | `/[abc]/y` |\n| `char([['a', 'z']])` | 88.1 | 58.4 | `/[a-z]/y` |\n| `all(char(['abc']))` | 39.7 | 50.0 | `/[abc]*/y` |\n| `all(char(['abc']), {minimumCount: 2})` | 67.1 | 50.2 | `/[abc]{2,}/y` |\n| `all(text('abc'))` | 43.0 | 50.2 | `/(?:abc)*/y` |\n| `or(text('abc'), text('123'))` | 67.3 | 57.1 | `/abc\\|123/y` |\n| `seq(text('abc'), text('123'))` | 58.8 | 54.2 | `/abc123/y` |\n| `text('abc')` | 72.8 | 57.1 | `/abc/y` |\n| `text('abc', {caseInsensitive: true})` | 71.1 | 55.0 | `/abc/iy` |\n| `until(char(['abc']))` | 51.5 | 48.6 | `/[abc]/g` |\n| `until(text('abc'))` | 51.0 | 33.0 | `/(?=abc)/g` |\n| `until(text('abc'), {inclusive: true})` | 51.9 | 48.8 | `/abc/g` |\n\nTokenizer performance comes from following implementation aspects:\n\n- Reader combination optimizations. For example `until(text('abc'))` would read case-sensitive chars from the sting\n  until substring `'abc'` is met. An analog of this is `/(?=abc)/`. Tokenizer uses `input.indexOf('abc')` for the\n  substring search, which is 2× faster than using a regular expression.\n\n- All readers (except `regex`) rely solely on `charCodeAt` and `indexOf` methods of string. This dramatically reduces\n  memory allocations, since no strings or other objects are created on heap.\n\n- Tokenizer compiles provided rules into a single function. No call stack overhead.\n\n- Rules that share the same prefix sequence of readers, read this prefix from the input only once. So chars in the input\n  are accessed less frequently.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmikhalevski%2Ftokenizer-dsl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsmikhalevski%2Ftokenizer-dsl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmikhalevski%2Ftokenizer-dsl/lists"}