{"id":28630984,"url":"https://github.com/piotrmurach/lex","last_synced_at":"2025-06-12T13:09:32.391Z","repository":{"id":31069162,"uuid":"34628066","full_name":"piotrmurach/lex","owner":"piotrmurach","description":"Lex is an implementation of lex tool in Ruby.","archived":false,"fork":false,"pushed_at":"2024-03-17T22:20:31.000Z","size":61,"stargazers_count":56,"open_issues_count":1,"forks_count":5,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-08-10T10:37:16.728Z","etag":null,"topics":["compiler","lexer","lexing","ruby","ruby-gem","state-lexer","tokenizer"],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/piotrmurach.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2015-04-26T19:41:54.000Z","updated_at":"2024-01-20T08:45:31.000Z","dependencies_parsed_at":"2022-08-20T13:00:13.313Z","dependency_job_id":"58ddc6b1-1bc3-4730-bdec-bee3c4ab3754","html_url":"https://github.com/piotrmurach/lex","commit_stats":{"total_commits":41,"total_committers":2,"mean_commits":20.5,"dds":"0.12195121951219512","last_synced_commit":"3f7b1864011c0920178ab7993a78a94a2f34b403"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/piotrmurach/lex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/piotrmurach%2Flex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/piotrmurach%2Flex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/piotrmurach%2Flex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/piotrmurach%2Flex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/piotrmurach","download_url":"https://codeload.github.com/piotrmurach/lex/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/piotrmurach%2Flex/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259470950,"owners_count":22862999,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compiler","lexer","lexing","ruby","ruby-gem","state-lexer","tokenizer"],"created_at":"2025-06-12T13:09:31.479Z","updated_at":"2025-06-12T13:09:32.318Z","avatar_url":"https://github.com/piotrmurach.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lex\n\n[![Gem Version](https://badge.fury.io/rb/lex.svg)][gem]\n[![Actions CI](https://github.com/piotrmurach/lex/actions/workflows/ci.yml/badge.svg)][gh_actions_ci]\n[![Build status](https://ci.appveyor.com/api/projects/status/h1bkt03qngsq851l?svg=true)][appveyor]\n[![Code Climate](https://codeclimate.com/github/piotrmurach/lex/badges/gpa.svg)][codeclimate]\n[![Coverage Status](https://coveralls.io/repos/piotrmurach/lex/badge.svg)][coveralls]\n\n[gem]: http://badge.fury.io/rb/lex\n[gh_actions_ci]: https://github.com/piotrmurach/lex/actions/workflows/ci.yml\n[appveyor]: https://ci.appveyor.com/project/piotrmurach/lex\n[codeclimate]: https://codeclimate.com/github/piotrmurach/lex\n[gemnasium]: https://gemnasium.com/piotrmurach/lex\n[coveralls]: https://coveralls.io/r/piotrmurach/lex\n\n\u003e Lex is an implementation of complier construction tool lex in Ruby. The goal is to stay close to the way the original tool works and combine it with the expressiveness of Ruby.\n\n## Features\n* Very focused tool that mimics the basic lex functionality.\n* 100% Ruby implementation.\n* Provides comprehensive error reporting to assist in lexer construction.\n\n## Installation\n\nAdd this line to your application's Gemfile:\n\n```ruby\ngem 'lex'\n```\n\nAnd then execute:\n\n    $ bundle\n\nOr install it yourself as:\n\n    $ gem install lex\n\n## Contents\n\n* [1 Overview](#1-overview)\n* [1.1 Example](#11-example)\n* [1.2 Tokens list](#12-tokens-list)\n* [1.3 Specifying rules](#13-specifying-rules)\n* [1.4 Handling keywords](#14-handling-keywords)\n* [1.5 Token values](#15-token-values)\n* [1.6 Discarded tokens](#16-discarded-tokens)\n* [1.7 Line numbers](#17-line-numbers)\n* [1.8 Ignored characters](#18-ignored-characters)\n* [1.9 Literal characters](#29-literal-characters)\n* [1.10 Error handling](#110-error-handling)\n* [1.11 Building the lexer](#111-building-the-lexer)\n* [1.12 Maintaining state](#112-maintaining-state)\n* [1.13 Conditional lexing](#113-conditional-lexing)\n* [1.14 Debugging](#114-debugging)\n\n## 1. Overview\n\n**Lex** is a library that processes character input streams. For example, suppose you have the following input string:\n\n```ruby\nx = 5 + 44 * (s - t)\n```\n\n**Lex** then partitions the input string into tokens that match a series of regular expression rules. In this instance given the tokens definitions:\n\n```ruby\n:ID, :EQUALS, :NUMBER, :PLUS, :TIMES, :LPAREN, :RPAREN, :MINUS\n```\n\nthe output will contain the following tokens:\n\n```ruby\n[:ID, 'x', 1, 1], [:EQUALS, '=', 1, 3], [:NUMBER, '5', 1, 5],\n[:PLUS, '+', 1, 7], [:NUMBER, 44, 1, 9], [:TIMES, '*', 1, 12],\n[:LPAREN, '(', 1, 14], [:ID, 's', 1, 15], [:MINUS, '-', 1, 17],\n[:ID, 't', 1, 19], [:RPAREN, ')', 1, 20]\n```\n\nThe **Lex** rules specified in the lexer will determine how the chunking of the input is performed. The following example demonstrates a high level overview of how this is done.\n\n### 1.1 Example\n\nGiven an input:\n\n```ruby\ninput = \"x = 5 + 44 * (s - t)\"\n```\n\nand a simple tokenizer:\n\n```ruby\nclass MyLexer \u003c Lex::Lexer\n  tokens(\n    :NUMBER,\n    :PLUS,\n    :MINUS,\n    :TIMES,\n    :DIVIDE,\n    :LPAREN,\n    :RPAREN,\n    :EQUALS,\n    :ID\n  )\n\n  # Regular expression rules for simple tokens\n  rule(:PLUS,   /\\+/)\n  rule(:MINUS,  /\\-/)\n  rule(:TIMES,  /\\*/)\n  rule(:DIVIDE, /\\//)\n  rule(:LPAREN, /\\(/)\n  rule(:RPAREN, /\\)/)\n  rule(:ID,     /[_\\$a-zA-Z][_\\$0-9a-zA-Z]*/)\n\n  # A regular expression rules with actions\n  rule(:NUMBER, /[0-9]+/) do |lexer, token|\n    token.value = token.value.to_i\n    token\n  end\n\n  # Define a rule so we can track line numbers\n  rule(:newline, /\\n+/) do |lexer, token|\n    lexer.advance_line(token.value.length)\n  end\n\n  # A string containing ignored characters (spaces and tabs)\n  ignore \" \\t\"\n\n  error do |lexer, token|\n    puts \"Illegal character: #{value}\"\n  end\nend\n\n# build the lexer\nmy_lexer = MyLexer.new\n```\n\nTo use the lexer you need to provide it some input using the `lex` method. After that, the method `lex` will either yield tokens to a given block or return an enumereator to allow you to retrieve tokens by repeatedly calling `next` method.\n\n```ruby\noutput = my_lexer.lex(input)\noutput.next  # =\u003e  Lex::Token(ID,x,1,1)\noutput.next  # =\u003e  Lex::Token(EQUALS,=,1,3)\noutput.next  # =\u003e  Lex::Token(NUMBER,5,1,5)\n...\n```\n\nThe tokens returned by the lexer are instances of `Lex::Token`. This object has attributes such as `name`, `value`, `line` and `column`.\n\n### 1.2 Tokens list\n\nA lexer always requires a list of tokens that define all the possible token names that can be produced by the lexer. This list is used to perform validation checks.\n\nThe following list is an example of token names:\n\n```ruby\ntokens(\n  :NUMBER,\n  :PLUS,\n  :MINUS,\n  :TIMES,\n  :DIVIDE,\n  :LPAREN,\n  :RPAREN\n)\n```\n\n### 1.3 Specifying rules\n\nThere are two important things to know about this scanner:\n\n1) Longest match wins.\n2) If two matches have the same length, the first in source code wins.\n\nEach token is specified by writing a regular expression rule defined by by calling the `rule` method. For simple tokens you can just specify the name and regular expression:\n\n```ruby\nrule(:PLUS, /\\+/)\n```\n\nIn this case, the first argument is the name of the token that needs to match exactly one of the names supplied in `tokens`. If you need to perform further processing on the matched token, the rule can be further expanded by adding an action inside a block. For instance, this rule matches numbers and converts the matched string into integer type:\n\n```ruby\ntoken(:NUMBER, /\\d+/) do |lexer, token|\n  token.value = token.value.to_i\n  token\nend\n```\n\nThe action block always takes two arguments, the first being the lexer itself and the second the token which is an instance of `Lex::Token`. This object has attributes of `name` which is the token name as string, `value` which is the actual text matched, `line` which is the current line indexed from `1`, `column` which is the position of the token in relation to the current line. By default the `name` is set to the rule name. Inside the block you can modify the token object properties. However, when you change token properties, the token itself needs to be returned. If no value is returned by the action block, the token is simply discarded and lexer moves to another token.\n\nThe rules are processed in the same order as they appear in the lexer definition. Therefore, if you wanted to have a separate tokens for \"=\" and \"==\", you need to ensure that rule for matching \"==\" is checked first.\n\n### 1.4 Handling keywords\n\nIn order to handle keywords, you should write a single rule to match an identifier and then do a name lookup like so:\n\n```ruby\ndef self.keywords\n  {\n    if: :IF,\n    then: :THEN,\n    else: :ELSE,\n    while: WHILE,\n    ...\n  }\nend\n\ntokens(:ID, *keywords.values)\n\nrule(:ID, /[_[:alpha:]][_[:alnum:]]*/) do |lexer, token|\n  token.name = lexer.class.keywords.fetch(token.value.to_sym, :ID)\n  token\nend\n```\n\n### 1.5 Token values\n\nBy default token value is the text that was matched by the rule. However, the token value can be changed to any object. For example, when processing identifiers you may wish to return both identifier name and actual value.\n\n```ruby\nrule(:ID, /[_[:alpha:]][_[:alnum:]]*/) do |lexer, token|\n  token.value = [token.value, lexer.class.keywords[token.value]]\n  token\nend\n```\n\n### 1.6 Discarded tokens\n\nTo discard a token, such as comment, define a rule that returns no token. For instance:\n\n```ruby\nrule(:COMMENT, /\\#.*/) do |lexer, token|\n  # No return value. Token is discarded.\nend\n```\n\n### 1.7 Line numbers\n\nBy default **Lex** knows nothing about line numbers since it doesn't understand what a \"line\" is. To provide this information you need to add a special rule called `:newline`:\n\n```ruby\nrule(:newline, /\\n+/) do |lexer, token|\n  lexer.advance_line(token.value.length)\nend\n```\n\nCalling the `advance_line` method the `current_line` is updated for the underlying lexer. Only the line is updated and since no token is returned the value is discarded.\n\n**Lex** performs automatic column tracking for each token. This information is available by calling `column` on a `Lex::Token` instance.\n\n### 1.8 Ignored characters\n\nFor any character that should be completely ignored in the input stream use the `ignore` method. Usually this is used to skip over whitespace and other non-essential characters. For example:\n\n```ruby\nignore \" \\t\" # =\u003e Ignore whitespace and tabs\n```\n\nYou could create a rule to achieve similar behaviour, however you are encourage to use this method as it has increased performance over the rule regular expression matching.\n\n### 1.9 Literal characters\n\nNot implemented yet!\n\n### 1.10 Error handling\n\nIn order to handle lexing error conditions use the `error` method. In this case thetoken `value` attribute contains the offending string. For example:\n\n```ruby\nerror do |lexer, token|\n  puts \"Illegal character #{token.value}\"\nend\n```\n\nThe lexer automatically skips the offending character and increments the column count.\n\nWhen performing conditional lexing, you can handle errors per state like so:\n\n```ruby\nerror :foo do |lexer, token|\n  puts \"Illegal character #{token.value}\"\nend\n```\n\n### 1.11 Building the lexer\n\n```ruby\nrequire 'lex'\n\nclass MyLexer \u003c Lex::Lexer\n  # required list of tokens\n  tokens(\n    :NUMBER,\n  )\n  ...\nend\n\n```\n\nYou can also provide lexer definition by using block:\n\n```ruby\nmy_lexer = Lex::Lexer.new do\n  # required list of tokens\n  tokens(\n    :NUMBER,\n  )\nend\n```\n\n### 1.12 Maintaining state\n\nIn your lexer you may have a need to store state information.\n\n### 1.13 Conditional lexing\n\nA lexer can maintain internal lexing state. When lexer's state changes, the corresponding tokens for that state are only considered. The start condition is called `:initial`, similar to GNU flex.\n\nTo define a new lexical state, it must first be declared. This can be achieved by using a `states` declaration:\n\n```ruby\nstates(\n  foo: :exclusive,\n  bar: :inclusive\n)\n```\n\nThe above definition declares two states `:foo` and `:bar`. State may be of two types `:exclusive` and `:inclusive`. In an `:exclusive` state lexer contains no rules, which means that **Lex** will only return tokens and apply rules defined specifically for that state. On the other hand, an `:inclusive` state adds additional tokens and rules to the default set of rules. Thus, `lex` method will return both the tokens defined by default in addition to those defined specificially for the `:inclusive` state.\n\nOnce state has been declared, tokens and rules are declared by including the state name in token or rule definition. For example:\n\n```ruby\nrule(:foo_NUMBER, /\\d+/)\nrule(:bar_ID, /[a-z][a-z0-9]+/)\n```\n\nThe above rules define `:NUMBER` token in state `:foo` and `:ID` token in state `:bar`.\n\nA token can be specified in multiple states by prefixing token name by state names like so:\n\n```ruby\nrule(:foo_bar_NUMBER, /\\d+/)\n```\n\nIf no state information is provided, the lexer is assumed to be in `:initial` state. For example, the following declarations are equivalent:\n\n```ruby\nrule(:NUMBER, /\\d+/)\nrule(:initial_NUMBER, /\\d+/)\n```\n\nBy default, lexing operates in `:initial` state. All the normally defined tokens are included in this state. During lexing if you wish to change the lexing state use the `begin` method. For example:\n\n```ruby\nrule(:begin_foo, /start_foo/) do |lexer, token|\n  lexer.begin(:foo)\nend\n```\n\nTo get out of state you can use `begin` like so:\n\n```ruby\nrule(:foo_end, /end_foo/) do |lexer, token|\n  lexer.begin(:initial)\nend\n```\n\nFor more complex scenarios with states you can use `push_state` and `pop_state` methods. For example:\n\n```ruby\nrule(:begin_foo, /start_foo/) do |lexer, token|\n  lexer.push_state(:foo)\nend\n\nrule(:foo_end, /end_foo/) do |lexer, token|\n  lexer.pop_state(:foo)\nend\n```\n\nAssume you are parsing HTML and you want to ignore anything inside comment. Here is how you may use lexer states to do this:\n\n```ruby\nclass MyLexer \u003c Lex::Lexer\n  tokens(\n    :TAG,\n    :ATTRIBUTE\n  )\n\n  # Declare the states\n  states( htmlcomment: :exclusive )\n\n  # Enter html comment\n  rule(:begin_htmlcomment, /\u003c!--/) do |lexer, token|\n    lexer.begin(:htmlcomment)\n  end\n\n  # Leave html comment\n  rule(:htmlcomment, /--\u003e/) do |lexer, token|\n    lexer.begin(:initial)\n  end\n\n  error :htmlcomment do |lexer, token|\n    lexer.logger.info \"Ignoring character #{token.value}\"\n  end\n\n  ignore :htmlcomment, \" \\t\\n\"\n\n  ignore \" \\t\"\nend\n```\n\n### 1.14 Debugging\n\nIn order to run lexer in debug mode pass in `:debug` flag set to `true`.\n\n```ruby\nMyLexer.new(debug: true)\n```\n\n## Contributing\n\n1. Fork it ( https://github.com/piotrmurach/lex/fork )\n2. Create your feature branch (`git checkout -b my-new-feature`)\n3. Commit your changes (`git commit -am 'Add some feature'`)\n4. Push to the branch (`git push origin my-new-feature`)\n5. Create a new Pull Request\n\n## Code of Conduct\n\nEveryone interacting in the Lex project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/piotrmurach/lex/blob/master/CODE_OF_CONDUCT.md).\n\n## Copyright\n\nCopyright (c) 2015 Piotr Murach. See LICENSE for further details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpiotrmurach%2Flex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpiotrmurach%2Flex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpiotrmurach%2Flex/lists"}