{"id":19811267,"url":"https://github.com/cosmictoast/patok","last_synced_at":"2026-03-05T16:42:34.129Z","repository":{"id":94771007,"uuid":"441311099","full_name":"CosmicToast/patok","owner":"CosmicToast","description":"Lua Pattern Tokenizer","archived":false,"fork":false,"pushed_at":"2022-10-13T12:59:35.000Z","size":164,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-26T19:49:44.481Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Lua","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CosmicToast.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-23T22:40:12.000Z","updated_at":"2025-02-10T03:57:54.000Z","dependencies_parsed_at":"2023-03-13T16:57:35.310Z","dependency_job_id":null,"html_url":"https://github.com/CosmicToast/patok","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CosmicToast%2Fpatok","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CosmicToast%2Fpatok/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CosmicToast%2Fpatok/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CosmicToast%2Fpatok/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CosmicToast","download_url":"https://codeload.github.com/CosmicToast/patok/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251847828,"owners_count":21653582,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T09:25:33.545Z","updated_at":"2026-03-05T16:42:34.086Z","avatar_url":"https://github.com/CosmicToast.png","language":"Lua","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Patok\nThe Lua Pattern Tokenizer.\n\nPatok is an efficient tokenizer that lexes based on patterns you provide.\nIt does this while avoiding slow copy operations, making it reasonably efficient.\nWhat's more is it's written in pure lua!\n\n## Usage\n\n### Constructing the lexer\nRequiring patok gives back a constructor function.\nYou call it (often repeatedly) with lists,\nwhere the key is the name of the pattern (which will come in handy later)\nand the value is the pattern itself.\nOnce you're done defining lexemes, you call it again with no arguments.\n\nConsider the following basic example:\n```lua\npatok = require 'patok'\nlexer = patok {\n\tnumber = '%d+',\n\tword   = '%w+',\n}()\n```\n\nNote that patterns will be tried *in order*.\nHowever, due to how lua's lists work, lists are unordered.\nThis is why calling the constructor multiple times can be useful.\n\nConsider the following example:\n```lua\nlexer = patok {\n\tfoo = 'foo',\n\tword = '%w+',\n}()\n```\n\nThis lexer may interpret \"foo\" as either a foo token or a word token - it is ambiguous.\nWe can make it unambiguous by enforcing an order, as such:\n```lua\nlexer = patok {\n\tfoo = 'foo',\n}{\n\tword = '%w+',\n}()\n```\n\nNow \"foo\" will always be a \"foo\" lexeme, and never a word.\n\n### Using the lexer\nOnce your lexer is constructed, you have two methods of interest: reset and next.\nReset will set your lexer up to lex a particular string.\nNext will get the next token of the currently set string.\n\nLet's make a simple example, with comments demonstrating the outputs:\n```lua\nlexer = patok {\n\tplus  = '%+',\n\tminus = '%-',\n\tstar  = '%*',\n\tslash = '/',\n\tdigit = '%d+',\n\tws    = '%s+',\n}()\nlexer:reset '10 + 15'\nlexer:next() -- {start=1, stop=2, type=\"digit\", value=\"10\"}\nlexer:next() -- {start=3, stop=3, type=\"ws\", value=\" \"}\nlexer:next() -- {start=4, stop=4, type=\"plus\", value=\"+\"}\nlexer:next() -- {start=5, stop=5, type=\"ws\", value=\" \"}\nlexer:next() -- {start=6, stop=7, type=\"digit\", value=\"15\"}\nlexer:next() -- nil: feed more/different data\n```\n\nAs per the above example,\na return value of nil means that whatever follows is not a token.\nIt may mean end of input,\nor merely that whatever follows isn't tokenizable with the given ruleset.\nHere's an example of the latter:\n\n```lua\nlexer = patok {\n\ta = 'a',\n\tb = 'b',\n}()\nlexer.reset 'ac'\nlexer:next() -- {start=1, stop=1, type='a', value='a'}\nlexer:next() -- nil, even though we could still consume 'c'\n```\n\n### Parsing\nIf you just wanted a standalone lexer/tokenizer, that's all you need to know!\nMost people, however, need a parser to go along with their lexer to make it useful.\nAlong with patok comes piecemeal:\na naive parser combinator made to work with patok.\n\nNote that unlike patok, piecemeal is not particularly efficient,\nnor capable of streaming input.\nIf you have a better patok-compatible option to use, please use that instead!\nIf you make such a parser, feel free to contact me at \u003ctoast+git@toast.cafe\u003e,\nI will add it to this section.\n\nThat said,\npiecemeal is more than sufficient for many use-cases where lua itself is sufficient.\nPlease read the next section to find out how to use it.\n(There will be no further information on patok itself for the rest of this file.)\n\n## Piecemeal\nPiecemeal is the default parser for patok.\nIf you have access to a different parser, chances are it will work better.\n\nPiecemeal is a recursive descent parser combinator.\nThat means that you are provided with a set of parsing generating functions.\nYou compose and combine them into parsers, which you then compose and combine further.\nThe end result is a parser that parses your entire document on demand.\n\n### Built-Ins\nPiecemeal provides the following built-in functions:\n* lexeme: lexeme looks for a \"type\" of token produced by patok\n* value: value looks for an exact match of a token's text\n* eof: only matches at the end of (lexed) input exactly once\n* all: takes a list of parsers, producing a parser for all of them in a row\n* alt: takes a list of parsers,\n  producing a parser that looks for any one of its inputs (in order)\n* opt: takes a parser and makes it optional\n* plus: takes a parser and allows it to occur more than once in a row\n  (it's the `+` operator in regex/PEG)\n* star: equivalent to optional star (it's the `*` operator in regex/PEG)\n* postp: takes a parser and a function, returns a parser\n  whose output transforms the output of the input parser using the provided function\n\nFinally, piecemeal provides the \"parse\" function, which takes the text to parse,\nthe patok (or api-compatible) lexer, and the parser to parse the text with.\n\nThis may be confusing,\nso let's look through a commentated example.\n\n### Example Grammar\nThis example grammar will be able to handle mathematic expressions.\nFor the sake of brevity, we'll only implement addition and multiplication.\nDo know that you can extend this approach to cover all of math, however.\n\nFirst, let's define our lexer.\n```lua\nlexer = patok {\n\top  = '[+*]',\n\tnum = '%d+',\n\tws  = '%s+',\n}()\n```\n\nWe could have also made special lexemes for `+` and `*` individually.\nHowever, this way, we can demonstrate both `pm.lexeme` and `pm.value`.\n\nLet's prepare some lexing parsers ahead of time.\n```lua\nlex = {\n\tplus  = pm.value '+',\n\tstar  = pm.value '*',\n\tdigit = pm.postp(pm.lexeme 'digit', function (d) return tonumber(d.value) end),\n\tspace = pm.opt(pm.lexeme 'space'),\n}\n```\n\nIn that snippet, there are two things to note.\nFirst, we made the whitespace parser optional, as our grammar does not have significant whitespace.\nSecondly, we used the potentially confusing `postp` function on digit.\n\nNormally, a terminal parser (i.e `lexeme` and `value`) will return the bare token, as given to it by patok.\nHowever, we generally don't want a huge layered list of tokens as the output.\nPostp allows us to perform postprocessing operations on whatever data the input parser gives out.\n\nIn this case, we know the input data will be a patok token.\nWe're only really interested in the actual number, though.\nSo we return the numeric representation of the token.\nWe know it already looks like a number, because of our lexer pattern.\nThis means that other parsers that consume our digit parser will be able to simply work with digits.\n\nTo make this easier, we'll write a convenience function.\n```lua\nfunction findnums (d, acc)\n\tlocal out = acc or {}\n\tfor _, v in ipairs(d) do\n\t\tif type(v) == 'number' then\n\t\t\ttable.insert(out, v)\n\t\telseif type(v) == 'table' then\n\t\t\tfindnums(v, out)\n\t\tend\n\tend\n\treturn out\nend\n```\n\nA few things to note here.\nFirst, note that we iterate over ipairs.\nIf we iterated over pairs, we would catch the start and end index of lexer tokens.\nSecondly, note that we use the fact that tables are passed by reference in lua to allow for in-line accumulation.\n\nNow that that's done, we can define our primary parsers.\nThe grammar looks something like so:\n```\nexpr \u003c- add\nadd \u003c- mult ('+' mult)*\nmult \u003c- digit ('*' digit)*\n```\n\nThis makes sure that multiplication happens before addition.\nWe can add subtraction and multiplication in-line by using alternatives for the signs, and switching on them in the postprocessing.\n\nLet's implement mult first.\n```lua\nmult = pm.postp(\n\tpm.all(lex.digit, pm.star(pm.all(lex.space, lex.times, lex.space, lex.digit))),\n\tfunction (d)\n\t\tlocal acc = 1\n\t\tfor _, v in ipairs(findnums(d)) do\n\t\t\tacc = acc * v\n\t\tend\n\t\treturn acc\n\tend)\n```\n\nThe parser component of the postprocessor is equivalent to the grammar above.\nIn the postprocessing function, we take advantage of the conveninece function we wrote.\nWe simply multiply all of the bare digits (which we know are consumed as a part of this sub-expression) together!\nImportantly, we just return a number again, since that's what we're really interested in.\n\nWe can write add using the same method.\n```lua\nadd = pm.postp(\n\tpm.all(mult, pm.star(pm.all(lex.space, lex.plus, lex.space, mult))),\n\tfunction (d)\n\t\tlocal acc = 0\n\t\tfor _, v in ipairs(findnums(d)) do\n\t\t\tacc = acc + v\n\t\tend\n\t\treturn acc\n\tend)\n```\n\nNote that we can use mult here directly - it's a valid parser like any other.\n\nFinally, we can define expr, though it's technically optional.\n```lua\nexpr = add\n```\n\nAnd now we can use the parser!\n```lua\nout, endindex, finalindex = pm.parse(\"10 + 5 * 2 + 10\", lexer, expr) -- 14, 30\n```\n\n### Missing\nIn the above sample, we did not end up using the `alt` or `plus` generators.\n`alt` is related to `all`.\nWhere `all` requires all of its arguments to succeed in order, `alt` will try them all in order, but only one has to succeed.\n`plus` is related to `star`.\nWith `star`, zero matches are accepted.\n`plus` works the same way, except at least one match is required.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcosmictoast%2Fpatok","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcosmictoast%2Fpatok","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcosmictoast%2Fpatok/lists"}