{"id":26560764,"url":"https://github.com/jlguenego/lexer","last_synced_at":"2025-10-09T08:14:28.465Z","repository":{"id":57121503,"uuid":"325384407","full_name":"jlguenego/lexer","owner":"jlguenego","description":"Lexical analyzer.","archived":false,"fork":false,"pushed_at":"2021-01-30T16:43:46.000Z","size":184,"stargazers_count":10,"open_issues_count":1,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-05T02:07:57.429Z","etag":null,"topics":["analysis","compiler","lexer","lexical","parser"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jlguenego.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-12-29T20:50:19.000Z","updated_at":"2024-12-23T23:29:17.000Z","dependencies_parsed_at":"2022-08-24T06:30:55.916Z","dependency_job_id":null,"html_url":"https://github.com/jlguenego/lexer","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/jlguenego/lexer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlguenego%2Flexer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlguenego%2Flexer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlguenego%2Flexer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlguenego%2Flexer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jlguenego","download_url":"https://codeload.github.com/jlguenego/lexer/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlguenego%2Flexer/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259756888,"owners_count":22906680,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","compiler","lexer","lexical","parser"],"created_at":"2025-03-22T13:29:53.148Z","updated_at":"2025-10-09T08:14:23.425Z","avatar_url":"https://github.com/jlguenego.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"doc/illustration.png\"\u003e\n  \u003ch1\u003eLexer\u003c/h1\u003e\n  \u003cp\u003e\n    Lexical Analyzer.\n  \u003c/p\u003e\n\u003c/div\u003e\n\nhttps://en.wikipedia.org/wiki/Lexical_analysis\n\nWhy ? I wanted a SIMPLE, INTUITIVE, FAST, Full JS lexer, allowing to\n\n- **tokenize correctly string and comment** in a language.\n- **stropping** for being able to write not ambigous grammar in the next phase (syntax analysis)\n- **use context** for being able to make off-side rules languages, etc.\n\nModule useful for:\n\n- compilers\n- prettyprinters\n- linters\n\n## Install\n\n```\nnpm i @jlguenego/lexer\n```\n\n[![Code Style: Google](https://img.shields.io/badge/code%20style-google-blueviolet.svg)](https://github.com/google/gts)\n\n## Usage\n\n```js\nconst {Lexer, Group} = require('@jlguenego/lexer');\n\n// Source code to tokenize.\nconst str = `\nvar x = 3;\nvar y = 52;\n`;\n\n// declare all the language rules.\nconst blank = new Rule({\n  name: 'blank',\n  pattern: /\\s+/,\n  ignore: true,\n});\n\nconst keywords = Rule.createKeywordRules(['var']);\n\nconst operators = Rule.createGroupRules(Group.OPERATOR, [\n  {\n    name: 'equal',\n    pattern: '=',\n  },\n]);\n\nconst separators = Rule.createGroupRules(Group.SEPARATOR, [\n  {\n    name: 'semi-column',\n    pattern: ';',\n  },\n]);\n\nconst identifier = new Rule({\n  name: 'identifier',\n  pattern: /\\w+/,\n  group: Group.IDENTIFIER,\n});\n\n// the order is important. Token are applied from first to last.\nconst rules = [blank, ...keywords, ...operators, ...separators, identifier];\n\n// Do the job.\nconst tokenSequence = new Lexer(rules).tokenize(str);\n\n// print the output.\nconsole.log('tokenSequence: ', tokenSequence);\n```\n\nThis produces the following output:\n\n```js\ntokenSequence: [\n  {\n    name: 'var',\n    lexeme: 'var',\n    group: 'keywords',\n    position: {col: 1, line: 2},\n  },\n  {\n    name: 'identifier',\n    lexeme: 'x',\n    group: 'identifiers',\n    position: {col: 5, line: 2},\n  },\n  {\n    name: 'equal',\n    lexeme: '=',\n    group: 'operators',\n    position: {col: 7, line: 2},\n  },\n  // ...\n];\n```\n\n## examples\n\n- See the [mocha test](./test/).\n\nTODO:\n\n- show example in famous language (JSON, XML, YAML, Javascript)\n\n## Concepts\n\nThis module purpose is to tokenize a source code input.\nIn computer science, this process is known under the term\n[lexical analysis](https://en.wikipedia.org/wiki/Lexical_analysis).\nWe call it also a **lexer**. The most famous lexer is flex, but this is designed for the C world.\nHere we want a lexer in the javascript world.\n\n`const tokenSequence = new Lexer(rules).tokenize(str);`\n\nThe above instruction applies `rules` to tokenize a source string `str`.\n\nThe rules are specified according the language and applied according a regular expression given in `rule.pattern`.\n\nDuring tokenization we define by the word **state** the source code being tokenized.\nThe state is a sequence of two types of element:\n\n- source code fragment, called **source element**, not yet tokenized,\n- and recognized **token**.\n\nAt the beginning the state is an array of one **source element** reflecting the entire source code.\nAt the end the state must be an array of only tokens, otherwise the source code is not respecting the syntax.\n\nNormally, the source code is tokenized from the beginning to the end of the string (called left to right scan) in one pass.\nBut tokenizing can be simpler if instead of looking from the beginning to the end,\nwe choose to apply successively one rule after another to the current state.\nThe drawback is that certains rules (for instance string, comment) cannot be well\ncorrectly tokenized if they are nested together.\n\nThis parser does not contain any generators like flex.\n\nTherefore this lexer do both algorithms successively in two passes:\n\n1. **Preprocessing** pass: performs the slow and robust method with only the rules marked as preprocess flag.\n2. **Main** pass: performs the fast way: applying the rules one after the other to the state.\n\nThe recommandation is to mark a rule with the preprocess flag only if the\nmain stage cannot apply the rule correctly. Of course if there is no rules with preprocess flag,\nno need to run the preprocessing stage.\n\nThe preprocess stage applies all rules, and select the rule that will\napply at the smallest index of the source string.\nIt is slower than the main stage because there is many rules applied for finally only one selected.\nThe preprocess phase also allows to make rule for [stropping](\u003chttps://en.wikipedia.org/wiki/Stropping_(syntax)\u003e).\n\nThe main stage applies one rule after the other. This means that the order of rules declaration are important.\nFor instance, the keyword rules should be applied from the longest one to the shortest one ([maximal munch rule](https://en.wikipedia.org/wiki/Maximal_munch))\nThe most generic one (identifer, type, etc.) must be applied with very low priority,\nso it is recommanded to place them at the end of the rule list.\n\nWhen a rule is applied, its method `expand` is executed. This method replaces the source element by an array of source elements and tokens.\nContext is allowed for doing scenario like [off-side rule](https://en.wikipedia.org/wiki/Off-side_rule).\n\n## Typescript\n\nThis module is written in Typescript and ready to be used in Typescript without separate typing module.\n\n## Modules\n\nThis modules exports both a CommonJS module and ES2015 module.\n\n## Thanks\n\n- Thanks to https://refactoring.guru/ for helping me to refactor code, trying to produce something understandable for the non expert.\n- Thanks to the other lexers I have visited on github. Specially the [chevrotain](https://github.com/SAP/chevrotain) lexer.\n- Thanks to the people at Stanford University releasing for free their compiler course.\n  I tried to speak with the same terminology in this lexer: https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/\n\n## Author\n\nMade with :heart: by Jean-Louis GUENEGO \u003cjlguenego@gmail.com\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjlguenego%2Flexer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjlguenego%2Flexer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjlguenego%2Flexer/lists"}