{"id":22396258,"url":"https://github.com/hoehrmann/demo-parselov","last_synced_at":"2025-07-31T12:30:59.506Z","repository":{"id":24175507,"uuid":"27566101","full_name":"hoehrmann/demo-parselov","owner":"hoehrmann","description":"Demos of a formal language parsing system","archived":false,"fork":false,"pushed_at":"2015-02-04T05:34:55.000Z","size":4000,"stargazers_count":6,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-04-17T21:16:49.523Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hoehrmann.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-12-05T00:06:23.000Z","updated_at":"2024-03-29T21:57:08.000Z","dependencies_parsed_at":"2022-08-22T12:01:03.423Z","dependency_job_id":null,"html_url":"https://github.com/hoehrmann/demo-parselov","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hoehrmann%2Fdemo-parselov","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hoehrmann%2Fdemo-parselov/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hoehrmann%2Fdemo-parselov/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hoehrmann%2Fdemo-parselov/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hoehrmann","download_url":"https://codeload.github.com/hoehrmann/demo-parselov/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228242625,"owners_count":17890481,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-05T06:07:33.689Z","updated_at":"2024-12-05T06:07:35.684Z","avatar_url":"https://github.com/hoehrmann.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"This repository contains samples that demonstate parts of a formal\r\nlanguage parsing system. There is a parser generator that takes a\r\nformal grammar and generates a data file, and there is a sample\r\nsimulator that takes such a data file and a user-supplied document\r\nand emits information about the syntactic structure of the document\r\naccording to the formal grammar. The system has an unusual design\r\nand unusual characteristics:\r\n\r\n* **Any context-free grammar can be used**. Ambiguity, left recursion,\r\nright recursion, infinite look-ahead, cycles in production rules,\r\nproductions that match the empty string, Unicode, the system is not\r\ntroubled by any of these. Of course, if you need e.g. ambiguities\r\nresolved, you have to implement that yourself as post-processing step\r\n(parsers report all possible parse trees in compact form). There are\r\nhowever limits on the overall complexity of the grammar.\r\n* **No human input is needed**. The system only needs a grammar that\r\ncan typically be copied from data format specifications; programs that\r\nparse documents can be grammar-agnostic and generic. The system does\r\nnot generate programming language source code files where you have to\r\nfill in gaps. You also do not have to modify the grammar to accomodate\r\nambiguity resolution or other algorithms.\r\n* **Language-independent, data-driven parsing**. Grammars are transformed\r\ninto tabluar data encoded in simple JSON documents amenable to machine\r\nprocessing. The data files can be shared, re-used, analysed, transformed,\r\ncompiled, combined, and more, in a portable manner.\r\n* **Linear parsing time and memory use**. For the low level parser,\r\nparsing time and memory use are O(n) in the size of input documents\r\nand independent of the grammar. For large input documents it is trivial\r\nto make a parser that uses O(1) main memory and writes intermediate\r\nresults to disk. One caveat: for recursive grammars, parser output\r\nrequires post-processing with possibly non-linear complexity for some\r\napplications.\r\n* **Credible performance**. It's not just linear, the constants are very\r\nsmall aswell. An optimised parser will do nothing but simple arithmetic,\r\ntable lookups, and memory writes to store results, and should not do\r\nmuch worse than typical regex engines. Note that parsing is branch-free\r\nexcept for bound iterations, and beyond loading the statically prepared\r\nparsing data there are no startup or other initialisation costs.\r\n* **Security and standards compliance**. Parser construction does not\r\ndepend on human input and is thus not subject to human error. The data\r\ntables describe finite state machines that can be exhaustively\r\nsimulated or analysed verifying that there are no chances for memory\r\ncorruption or other problems in dependant code for all inputs. When\r\nthe input grammar comes from a formal standard, there is no chance to\r\nmiss edge cases ensuring compliance.\r\n* **Robustness and versatility**. The parsing system is part of a\r\ndivide-and-conquer strategy to parsing. A basic parser based on it\r\njust solves part of the problem and higher level applications can and\r\nhave to perform additional work. That enables and encourages a coding\r\nstyle amenable to change.\r\n\r\nThe typical use of the generated data files is from a parser that makes\r\ntwo passes over an input document and then describes all possible parse\r\ntrees as a series of edge sets of a larger parse graph. The following\r\nexample illustrates this.\r\n\r\n## Table of Contents\r\n\r\n- [Grammar-agnostic low-level parser](#grammar-agnostic-low-level-parser)\r\n  - [Running the demo script](#running-the-demo-script)\r\n- [Higher-level parser](#higher-level-parser)\r\n- [Merging regular paths](#merging-regular-paths)\r\n- [Pairing recursions in parallel](#pairing-recursions-in-parallel)\r\n- [Combination of data files and parallel simulation](#combination-of-data-files-and-parallel-simulation)\r\n- [Handling grammars based on tokens](#handling-grammars-based-on-tokens)\n- [Handling data structures other than strings](#handling-data-structures-other-than-strings)\n- [Limitations](#limitations)\r\n- [Sample applications](#sample-applications)\r\n  - [Prefixing rulenames in ABNF grammars](#prefixing-rulenames-in-abnf-grammars)\r\n  - [Analysing data format test suites for completeness](#analysing-data-format-test-suites-for-completeness)\r\n  - [Generating random documents](#generating-random-documents)\r\n  - [Syntax highlighting](#syntax-highlighting)\r\n  - [Syntax auto-completion](#syntax-auto-completion)\r\n  - [Extracting ABNF grammar rules from RFCs](#extracting-abnf-grammar-rules-from-rfcs)\r\n  - [Converting grammars to regular expressions](#converting-grammars-to-regular-expressions)\r\n- [License](#license)\r\n\r\n## Grammar-agnostic low-level parser\r\n\r\nThe following code is a complete and runnable NodeJS application that\r\nreads a generated parsing data file, as they are included in this\r\nrepository, and a document, analyses the syntactic structure of the\r\ndocument, and then generates a GraphViz-compatible `.dot` file of the\r\nparse graph (for simple grammars or simple inputs, this is equivalent\r\nto a linearised representation of the \"parse tree\" of the document\r\nwith respect to the relevant grammar).\r\n\r\n```js\r\nvar fs = require('fs');\r\nvar zlib = require('zlib');\r\nvar util = require('util');\r\n\r\nvar as_json = process.argv[4] == \"-json\";\r\n\r\nvar input = fs.readFileSync(process.argv[3], {\r\n  \"encoding\": \"utf-8\"\r\n});\r\n\r\nzlib.gunzip(fs.readFileSync(process.argv[2]), function(err, buf) {\r\n\r\n  var g = JSON.parse(buf);\r\n  \r\n  ///////////////////////////////////////////////////////////////////\r\n  // Typical grammars do not distinguish between all characters in\r\n  // their alphabet, or the alphabet of the input to be parsed, like\r\n  // all of Unicode. So in order to save space in the transition\r\n  // tables, input symbol are mapped into a smaller set of symbols.\r\n  ///////////////////////////////////////////////////////////////////\r\n  var s = [].map.call(input, function(ch) {\r\n    return g.input_to_symbol[ ch.charCodeAt(0) ] || 0\r\n  });\r\n\r\n  ///////////////////////////////////////////////////////////////////\r\n  // The mapped input symbols are then fed to a deterministic finite\r\n  // state automaton. The sequence of states is stored for later use.\r\n  // The initial state of the automaton is always `1` by convention.\r\n  ///////////////////////////////////////////////////////////////////\r\n  var fstate = 1;\r\n  var forwards = [fstate].concat(s.map(function(i) {\r\n    return fstate = g.forwards[fstate].transitions[i] || 0;\r\n  }));\r\n  \r\n  ///////////////////////////////////////////////////////////////////\r\n  // An input string does not necessarily match what the parser is\r\n  // expecting. When the whole input is read, and the automaton is\r\n  // not in an accepting state, then either there is an error in the\r\n  // input, or the input is incomplete. The converse does not necess-\r\n  // arily hold. For recursive grammars the automaton might be in an\r\n  // accepting state even though the input does not match it.\r\n  ///////////////////////////////////////////////////////////////////\r\n  if (g.forwards[fstate].accepts == \"0\") {\r\n    process.stderr.write(\"failed around \" + forwards.indexOf('0'));\r\n    return;\r\n  }\r\n\r\n  ///////////////////////////////////////////////////////////////////\r\n  // The output of the first deterministic finite state transducer is\r\n  // then passed through a second one. At the end of the string it\r\n  // knows exactly which paths through the graph have led to a match\r\n  // and can now trace them back to eliminate matches that failed.\r\n  // The output of the second deterministic finite state transducer\r\n  // is a concatenation of edges to be added to the parse graph. As\r\n  // before, it starts in state `1` by convention.\r\n  ///////////////////////////////////////////////////////////////////\r\n  var bstate = 1;\r\n  var edges = forwards.reverse().map(function(i) {\r\n    return bstate = g.backwards[bstate].transitions[i] || 0;\r\n  }).reverse();\r\n\r\n  ///////////////////////////////////////////////////////////////////\r\n  // The `edges` list is just a list of integers, each identifying a\r\n  // set of edges. This is useful for post-processing operations, but\r\n  // typical applications will need to resolve them to build a graph\r\n  // for traversal. As an example, this function will print out the\r\n  // whole parse graph as a GraphViz `dot` file that can be rendered.\r\n  ///////////////////////////////////////////////////////////////////\r\n  write_edges_in_graphviz_dot_format(g, edges);\r\n});\r\n```\r\n\r\nThe code to turn lists of edge set identifiers into a GraphViz file:\r\n\r\n```js\r\nfunction write_edges_in_graphviz_dot_format(g, edges) {\r\n\r\n  ///////////////////////////////////////////////////////////////////\r\n  // An edge consists of two vertices and every vertex can have some\r\n  // properties. Among them are a type and a label. Important types\r\n  // include \"start\" and \"final\" vertices. They signify the beginning\r\n  // and end of named captures, and pairs of them correspond to non-\r\n  // terminal symbols in a grammar. This function combines type and\r\n  // label (the name of the non-terminal) into a vertex label.\r\n  ///////////////////////////////////////////////////////////////////\r\n  var print_label = function(offset, v) {\r\n    process.stdout.write(util.format('\"%s,%s\"[label=\"%s %s\"];\\n',\r\n      offset,\r\n      v,\r\n      g.vertices[v].type || \"\",\r\n      g.vertices[v].text || v\r\n    ));\r\n  };\r\n\r\n  process.stdout.write(\"digraph {\\n\");\r\n\r\n  edges.forEach(function(id, ix) {\r\n    /////////////////////////////////////////////////////////////////\r\n    // There are two kinds of edges associated with edge identifiers.\r\n    // \"Null\" edges represent transitions that do not consume input\r\n    // symbols. They are needed to support nesting of non-terminals,\r\n    // non-terminals that match the empty string, among other things.\r\n    /////////////////////////////////////////////////////////////////\r\n    g.null_edges[id].forEach(function(e) {\r\n      process.stdout.write(util.format('\"%s,%s\" -\u003e \"%s,%s\";\\n',\r\n        ix, e[0], ix, e[1]));\r\n      print_label(ix, e[0]);\r\n      print_label(ix, e[1]);\r\n    });\r\n    \r\n    /////////////////////////////////////////////////////////////////\r\n    // \"Char\" edges represent transitions that go outside of a set of\r\n    // edges (and into the next) because an input symbol has to be\r\n    // read to continue on their path. In other words, they are what\r\n    // connects the individual edge sets to one another.\r\n    /////////////////////////////////////////////////////////////////\r\n    g.char_edges[id].forEach(function(e) {\r\n      process.stdout.write(util.format('\"%s,%s\" -\u003e \"%s,%s\";\\n',\r\n        ix, e[0], ix + 1, e[1]));\r\n      print_label(ix, e[0]);\r\n      print_label(ix + 1, e[1]);\r\n    });\r\n  });\r\n\r\n  ///////////////////////////////////////////////////////////////////\r\n  // If there are cycles in the parse graph at the first or the last\r\n  // position, it may  be necessary to know which of the vertices\r\n  // stands for the start symbol of the grammar. They are avaiable:\r\n  ///////////////////////////////////////////////////////////////////\r\n  var start_vertex = g.start_vertex;\r\n  var final_vertex = g.final_vertex;\r\n  \r\n  process.stdout.write(\"}\\n\");\r\n}\r\n```\r\n\r\n### Running the demo script\r\n\r\nIn order to run this demo, you can use something like the following.\r\nNote that you need the GraphViz `dot` utility and NodeJS in addition\r\nto `git`.\r\n\r\n```\r\n% git ...\r\n% cd ...\r\n% node demo-parselov.js rfc4627.JSON-text.json.gz ex.json \u003e ex.dot\r\n% dot -Tsvg -Grankdir=tb ex.dot \u003e ex.svg\r\n```\r\n\r\nThe `ex.json` file contains just the two characters `[]` and it might\r\nbe useful to take a look at the [RFC 4627 JSON grammar]\r\n(https://tools.ietf.org/html/rfc4627#section-2) to understand the\r\nresult. It should look like this:\r\n\r\n```\r\n  +-------------------+     +-------------------+\r\n0 |  start JSON-text  |  +-\u003e|  start end-array  | 1\r\n  +-------------------+  |  +-------------------+\r\n            v            |            v\r\n  +-------------------+  |  +-------------------+\r\n0 |    start array    |  |  |     start ws      | 1\r\n  +-------------------+  |  +-------------------+\r\n            v            |            v\r\n  +-------------------+  |  +-------------------+\r\n0 | start begin-array |  |  |     final ws      | 1\r\n  +-------------------+  |  +-------------------+\r\n            v            |            v\r\n  +-------------------+  |  +-------------------+\r\n0 |     start ws      |  |  |     start ws      | 2\r\n  +-------------------+  |  +-------------------+\r\n            v            |            v\r\n  +-------------------+  |  +-------------------+\r\n0 |     final ws      |  |  |     final ws      | 2\r\n  +-------------------+  |  +-------------------+\r\n            v            |            v\r\n  +-------------------+  |  +-------------------+\r\n1 |     start ws      |  |  |  final end-array  | 2\r\n  +-------------------+  |  +-------------------+\r\n            v            |            v\r\n  +-------------------+  |  +-------------------+\r\n1 |     final ws      |  |  |    final array    | 2\r\n  +-------------------+  |  +-------------------+\r\n            v            |            v\r\n  +-------------------+  |  +-------------------+\r\n1 | final begin-array |--+  |  final JSON-text  | 2\r\n  +-------------------+     +-------------------+\r\n\r\n```\r\n\r\nThe numbers next to the nodes indicate the offset into the list of\r\nedges which correspond to the offset into the input data except for\r\nterminal edges associated with the accepting state at the end of the\r\ninput. The `ws` nodes in the graph are there because the RFC 4627\r\ngrammar allows `ws` nodes to match the empty string, and since they\r\nare mandatory, they appear in the graph even though there is no\r\nwhite space in the input document. Applications not interested in the\r\nnodes, or other uninteresting nodes like `begin-array`, can remove\r\nthem from the edge sets making parse graphs considerably smaller.\r\n\r\n## Higher-level parser\r\n\r\nThe parsing process shown in the previous section can be thought of\r\nas a pre-processing step that makes virtually all decisions that can\r\nbe made by a finite state automaton for a higher-level parser. Some\r\ndecisions still have to be made, for sufficiently complex grammars,\r\nto determine whether the input actually matches the grammar, namely\r\nwhether there is a path through the parse graph that balances all\r\n`start` points with their corresponding `end` points. Any such path\r\nrepresents a valid parse tree for the input with respect to the\r\ngrammar the static parsing data is based on.\r\n\r\nFinding such a path is a simple matter of traversing the graph from\r\na `start_vertex` to a `final_vertex`. The difficulty is in choosing\r\nvertices whenever a given vertex has multiple successors. Picking a\r\nwrong one wastes computing resources, and parsing algorithms differ\r\nin how they avoid wasting resources. The pre-processing step in the\r\nprevious section trades high static memory use for convenience and\r\nspeed. It generally leaves very few wrong vertices to pick. As an\r\nexample, this demo includes the static data for RFC 4627 `JSON-text`.\r\nIn over 90% of the edges therein, all vertices have only a single\r\nsuccessor, and since the grammar is ambiguous, some vertices with\r\nmultiple successors actually represent genuine parsing alternatives.\r\n\r\nThe code below implements a simple generic backtracking traversal\r\nthrough the parse graph and going through the tree, the parsers will\r\ngenerate a simple JSON-based representation of the parse tree. It\r\nprocesses edges from root of the graph to the bottom. Since the list\r\nof edges is built the other way around, it could also start at the\r\nbottom, in which case this code could run alongside building the\r\nlist of edges. It is important to understand that a vertex in a set\r\nof edges corresponds to just a couple of instructions that are known\r\nindependently of the input; they can easily be compiled to a series\r\nof machine instructions. Also note that this is just a demonstration\r\nof what could be done after the pre-processing step using the static\r\ndata files. It is not considered part of what is discussed at the\r\nbeginning of this document.\r\n\r\n```js\r\nfunction generate_json_formatted_parse_tree(g, edges) {\r\n\r\n  var parsers = [{\r\n    output: \"\",\r\n    offset: 0,\r\n    vertex: g.start_vertex,\r\n    stack: []\r\n  }];\r\n\r\n  ///////////////////////////////////////////////////////////////////\r\n  // To recap, the result of the initial parse is a list of edge sets\r\n  // each of which contains two different kinds of sets of edges. The\r\n  // vertices encoded therein can have multiple successors. They come\r\n  // from ambiguity and recursion in the input grammar. In order to\r\n  // exhaustively search for a parse tree within the parse graph, it\r\n  // may be necessary to explore all alternative successors of a ver-\r\n  // tex. So whenever alternatives are encountered, they are recorded\r\n  // in the `parsers` array, and the following algorithm contines un-\r\n  // til either a parse tree has been found or until all alternatives\r\n  // are exhausted. So the `while` loop takes care of the latter.\r\n  ///////////////////////////////////////////////////////////////////\r\n  while (parsers.length) {\r\n    var p = parsers[0];\r\n\r\n    if (g.vertices[p.vertex].type == \"start\") {\r\n      ///////////////////////////////////////////////////////////////\r\n      // Finding a parse tree within a parse graph requires matching\r\n      // all starting points of non-terminal symbols to corresponding\r\n      // end points so boundaries of a match are properly balanced.\r\n      // When a starting point is found, it is pushed on to a stack.\r\n      ///////////////////////////////////////////////////////////////\r\n      p.stack.push({\"vertex\": p.vertex, \"offset\": p.offset});\r\n\r\n      var indent = p.stack.map(function(){ return '  '; }).join(\"\");\r\n\r\n      p.output += \"\\n\" + indent;\r\n      p.output += '[' + JSON.stringify(g.vertices[p.vertex].text)\r\n        .replace(/,/g, '\\\\u002c') + ', [';\r\n    }\r\n    \r\n    if (g.vertices[p.vertex].type == \"final\") {\r\n      ///////////////////////////////////////////////////////////////\r\n      // When there is an opportunity to close the match that is on\r\n      // the top of the stack, i.e., when a `final` vertex is found\r\n      // on the path that is currently being explored, the vertex can\r\n      // be compared to the stack's top element, and if they match,\r\n      // we can move on to a successor vertex. On the other hand, if\r\n      // the stack is empty, the code has taken a wrong turn. It may\r\n      // be better to catch this condition using a sentinel value on\r\n      // the stack; vertex `0` is reserved for such uses.\r\n      ///////////////////////////////////////////////////////////////\r\n      if (p.stack.length == 0) {\r\n        parsers.shift();\r\n        continue;\r\n      }\r\n      \r\n      var top = p.stack.pop();\r\n\r\n      ///////////////////////////////////////////////////////////////\r\n      // The `start` vertices know which `final` vertex they match\r\n      // with, and if the top of the stack is not it, then the whole\r\n      // parser is dropped, and the loop will try an alternative that\r\n      // has been recorded earlier, if any.\r\n      ///////////////////////////////////////////////////////////////\r\n      if (p.vertex != g.vertices[top.vertex][\"with\"]) {\r\n        parsers.shift();\r\n        continue;\r\n      }\r\n\r\n      p.output += '], ' + top.offset + ', ' + p.offset + '],';\r\n    }\r\n        \r\n    /////////////////////////////////////////////////////////////////\r\n    // For a successfull match of the whole input, there are three \r\n    // conditions to be met: the parser must have reached the end of\r\n    // the list of edges, which corresponds to the end of the input;\r\n    // there must not be open matches left on the stack, and the ver-\r\n    // at the end has to be the final vertex of the whole graph. It\r\n    // is possible that there is still a loop around the final vertex\r\n    // matching the empty string, but we ignore them here.\r\n    /////////////////////////////////////////////////////////////////\r\n    if (g.final_vertex == p.vertex) {\r\n      if (p.offset + 1 \u003e= edges.length)\r\n        if (p.stack.length == 0)\r\n          return p.output.replace(/,\\]/g, ']').replace(/,$/, '');\r\n    }\r\n    \r\n    if (p.offset \u003e= edges.length) {\r\n      parsers.shift();\r\n      continue;\r\n    }\r\n\r\n    /////////////////////////////////////////////////////////////////\r\n    // Without a match and without a parsing failure, the path under\r\n    // consideration can be explored further. For that the successors\r\n    // of the current vertex have to be retrieved from static data.\r\n    /////////////////////////////////////////////////////////////////\r\n    var cs = g.char_edges[ edges[p.offset] ].filter(function(e) {\r\n      return e \u0026\u0026 e[0] == p.vertex;\r\n    }).map(function(e) {\r\n      return { successor: e[1], type: \"char\" };\r\n    });\r\n\r\n    var ns = g.null_edges[ edges[p.offset] ].filter(function(e) {\r\n      return e \u0026\u0026 e[0] == p.vertex;\r\n    }).map(function(e) {\r\n      return { successor: e[1], type: \"null\" };\r\n    });\r\n    \r\n    var successors = ns.concat(cs);\r\n    \r\n    /////////////////////////////////////////////////////////////////\r\n    // Vertices can have an associated `sort_key` to guide the choice\r\n    // among alternative successors. A common disambiguation strategy\r\n    // is to pick the \"first\", \"left-most\" alternative, in which case\r\n    // the `sort_key` corresponds the position of grammar constructs\r\n    // in the grammar the parsing data is based on. There are other,\r\n    // possibly more complex, strategies that can be used instead.\r\n    /////////////////////////////////////////////////////////////////\r\n    successors.sort(function(a, b) {\r\n      return (g.vertices[a.successor].sort_key || 0) -\r\n             (g.vertices[b.successor].sort_key || 0);\r\n    });\r\n    \r\n    /////////////////////////////////////////////////////////////////\r\n    // It is possible that a vertex has no successors at this point,\r\n    // even if there are no errors in the parsing data. In such cases\r\n    // this parser has failed to match and alternatives are explored.\r\n    /////////////////////////////////////////////////////////////////\r\n    if (successors.length \u003c 1) {\r\n      parsers.shift();\r\n      continue;\r\n    }\r\n\r\n    /////////////////////////////////////////////////////////////////\r\n    // Sorting based on the `sort_key` leaves the best successor at\r\n    // the first position. The current parser will continue with it.\r\n    /////////////////////////////////////////////////////////////////\r\n    var chosen = successors.shift();\r\n    \r\n    /////////////////////////////////////////////////////////////////\r\n    // All other successors, if there are any, are turned into start\r\n    // positions for additional parsers, that may be used instead of\r\n    // the current one in case it ultimately fails to match.\r\n    /////////////////////////////////////////////////////////////////\r\n    successors.forEach(function(s) {\r\n      parsers.push({\r\n        output: p.output,\r\n        offset: s.type == \"char\" ? p.offset + 1 : p.offset,\r\n        vertex: s.successor,\r\n        stack: p.stack.slice()\r\n      });\r\n    });\r\n\r\n    /////////////////////////////////////////////////////////////////\r\n    // Finally, if the successor vertex is taken from `char_edges`,\r\n    // meaning an input symbol has been consumed to reach it, the\r\n    // parser can move on to the next edge and process the successor.\r\n    /////////////////////////////////////////////////////////////////\r\n    if (chosen.type == \"char\") {\r\n      p.offset += 1;\r\n    }\r\n\r\n    p.vertex = chosen.successor;\r\n  }\r\n}\r\n```\r\n\r\nRunning this code with the RFC 4627 data file and `{\"a\\ffe\":[]}` as\r\ninput, the result is the following JSON document. You can run this\r\nyourself using the `-json` switch, something along the lines of:\r\n\r\n```\r\n% git ...\r\n% cd ...\r\n% node demo-parselov.js example.json.gz example.data -json\r\n```\r\n\r\n```js\r\n[\"JSON-text\", [\r\n  [\"object\", [\r\n    [\"begin-object\", [\r\n      [\"ws\", [], 0, 0],\r\n      [\"ws\", [], 1, 1]], 0, 1],\r\n    [\"member\", [\r\n      [\"string\", [\r\n        [\"quotation-mark\", [], 1, 2],\r\n        [\"char\", [\r\n          [\"unescaped\", [], 2, 3]], 2, 3],\r\n        [\"char\", [\r\n          [\"escape\", [], 3, 4]], 3, 5],\r\n        [\"char\", [\r\n          [\"unescaped\", [], 5, 6]], 5, 6],\r\n        [\"char\", [\r\n          [\"unescaped\", [], 6, 7]], 6, 7],\r\n        [\"quotation-mark\", [], 7, 8]], 1, 8],\r\n      [\"name-separator\", [\r\n        [\"ws\", [], 8, 8],\r\n        [\"ws\", [], 9, 9]], 8, 9],\r\n      [\"value\", [\r\n        [\"array\", [\r\n          [\"begin-array\", [\r\n            [\"ws\", [], 9, 9],\r\n            [\"ws\", [], 10, 10]], 9, 10],\r\n          [\"end-array\", [\r\n            [\"ws\", [], 10, 10],\r\n            [\"ws\", [], 11, 11]], 10, 11]], 9, 11]], 9, 11]], 1, 11],\r\n    [\"end-object\", [\r\n      [\"ws\", [], 11, 11],\r\n      [\"ws\", [], 12, 12]], 11, 12]], 0, 12]], 0, 12]\r\n```\r\n\r\nUsing the RFC 3986 data file and the string `example://0.0.0.0:23#x` gives:\r\n\r\n```js\r\n[\"URI\", [\r\n  [\"scheme\", [\r\n    [\"ALPHA\", [], 0, 1],\r\n    [\"ALPHA\", [], 1, 2],\r\n    [\"ALPHA\", [], 2, 3],\r\n    [\"ALPHA\", [], 3, 4],\r\n    [\"ALPHA\", [], 4, 5],\r\n    [\"ALPHA\", [], 5, 6],\r\n    [\"ALPHA\", [], 6, 7]], 0, 7],\r\n  [\"hier-part\", [\r\n    [\"authority\", [\r\n      [\"host\", [\r\n        [\"IPv4address\", [\r\n          [\"dec-octet\", [\r\n            [\"DIGIT\", [], 10, 11]], 10, 11],\r\n          [\"dec-octet\", [\r\n            [\"DIGIT\", [], 12, 13]], 12, 13],\r\n          [\"dec-octet\", [\r\n            [\"DIGIT\", [], 14, 15]], 14, 15],\r\n          [\"dec-octet\", [\r\n            [\"DIGIT\", [], 16, 17]], 16, 17]], 10, 17]], 10, 17],\r\n      [\"port\", [\r\n        [\"DIGIT\", [], 18, 19],\r\n        [\"DIGIT\", [], 19, 20]], 18, 20]], 10, 20],\r\n    [\"path-abempty\", [], 20, 20]], 8, 20],\r\n  [\"fragment\", [\r\n    [\"pchar\", [\r\n      [\"unreserved\", [\r\n        [\"ALPHA\", [], 21, 22]], 21, 22]], 21, 22]], 21, 22]], 0, 22]\r\n```\r\n\r\nUsing the XML 1.0 4th Edition data file and\r\n\r\n```xml\r\n\u003c!DOCTYPE x [\u003c!ENTITY z \"\"\u003e]\u003e\r\n\u003cx\u003e\u003cy\u003e\u0026z;\u003c/y\u003e\u003c/x\u003e\r\n\r\n```\r\n\r\ngives\r\n\r\n```js\r\n[\"document\", [\r\n  [\"prolog\", [\r\n    [\"doctypedecl\", [\r\n      [\"S\", [], 9, 10],\r\n      [\"Name\", [\r\n        [\"Letter\", [\r\n          [\"BaseChar\", [], 10, 11]], 10, 11]], 10, 11],\r\n      [\"S\", [], 11, 12],\r\n      [\"intSubset\", [\r\n        [\"markupdecl\", [\r\n          [\"EntityDecl\", [\r\n            [\"GEDecl\", [\r\n              [\"S\", [], 21, 22],\r\n              [\"Name\", [\r\n                [\"Letter\", [\r\n                  [\"BaseChar\", [], 22, 23]], 22, 23]], 22, 23],\r\n              [\"S\", [], 23, 24],\r\n              [\"EntityDef\", [\r\n                [\"EntityValue\", [], 24, 26]], 24, 26]], 13, 27]],\r\n                  13, 27]], 13, 27]], 13, 27]], 0, 29],\r\n    [\"Misc\", [\r\n      [\"S\", [], 29, 30]], 29, 30],\r\n    [\"Misc\", [\r\n      [\"S\", [], 30, 31]], 30, 31]], 0, 31],\r\n  [\"element\", [\r\n    [\"STag\", [\r\n      [\"Name\", [\r\n        [\"Letter\", [\r\n          [\"BaseChar\", [], 32, 33]], 32, 33]], 32, 33]], 31, 34],\r\n      [\"content\", [\r\n          [\"element\", [\r\n            [\"STag\", [\r\n              [\"Name\", [\r\n                [\"Letter\", [\r\n                  [\"BaseChar\", [], 35, 36]], 35, 36]], 35, 36]],\r\n                    34, 37],\r\n              [\"content\", [\r\n                [\"Reference\", [\r\n                  [\"EntityRef\", [\r\n                    [\"Name\", [\r\n                      [\"Letter\", [\r\n                        [\"BaseChar\", [], 38, 39]], 38, 39]], 38,\r\n                          39]], 37, 40]], 37, 40]], 37, 40],\r\n            [\"ETag\", [\r\n              [\"Name\", [\r\n                [\"Letter\", [\r\n                  [\"BaseChar\", [], 42, 43]], 42, 43]], 42, 43]],\r\n                    40, 44]], 34, 44]], 34, 44],\r\n    [\"ETag\", [\r\n      [\"Name\", [\r\n        [\"Letter\", [\r\n          [\"BaseChar\", [], 46, 47]], 46, 47]], 46, 47]], 44, 48]],\r\n            31, 48],\r\n  [\"Misc\", [\r\n    [\"S\", [], 48, 49]], 48, 49],\r\n  [\"Misc\", [\r\n    [\"S\", [], 49, 50]], 49, 50]], 0, 50]  \r\n```\r\n\r\nYou can also verify that the parse fails for ill-formed input like\r\n\r\n```xml\r\n\u003cx\u003e\u003c?xml?\u003e\u003c/x\u003e\r\n```\r\n\r\nusing the sample files included in the repository like so\r\n\r\n```\r\n% node demo-parselov.js xml4e.document.json.gz bad.xml\r\n```\r\n\r\n## Merging regular paths\r\n\r\nThe deterministic finite state transducers that together form the\r\nlow-level parser compute all possible paths from the `start_vertex`\r\nof the graph that represents the input grammar to the `final_vertex`.\r\nThe forwards automaton visits only vertices reachable from the start,\r\nand the backwards automaton eliminates all paths that ultimately do\r\nnot reach the final vertex. However, entering a recursion 1 time or\r\n23 times is the same to the low-level parser, and the primary job of\r\nthe higher-level parser is to eliminate paths that can't be traversed\r\nbecause nesting constraints are violated, or for that matter, finding\r\none path on which the nesting constraints are maintained, if the goal\r\nis to derive a parse tree.\r\n\r\nIn order to do that, the higher-level parser does not actually have\r\nto go through all the vertices in the graph that describe the regular\r\nnon-recursive structure of the input. Instead, it could go through a\r\nmuch smaller graph that describes only recursions plus whatever else\r\nis minimally needed to ensure paths in the full graph and the reduced\r\ngraph correspond to one another.\r\n\r\nRecursions have vertices in the graph that represent their entry and\r\ntheir exit points. The smaller graph can be computed by merging all\r\nvertices that reach the same recursion entry and exit points without\r\npassing through a recursion entry or exit point. The data files that\r\nrepresent grammars include a projection for each vertex that maps the\r\nvertex to a representative in this smaller graph, the `stack_vertex`.\r\n\r\nHere is what this looks like from the perspective of the `element`\r\nproduction in the XML 1.0 4th Edition specification:\r\n\r\n![XML `element` stack vertex graph](./xml-stack-graph.png?raw=true)\r\n\r\nMatching an `element` means finding a path from the `start element`\r\nat the top to the `final element` vertex in the box. There are two\r\ninstances of `element` in the graph because the top-level element is\r\ndifferent from descendants of it because one has to go over `content`\r\nprior to visiting a descendant element. \r\n\r\nThe stack graph projection for every vertex is available through the\r\n`g.vertices[v].stack_vertex` property, to stick with the syntax used\r\nin the examples above.\r\n\r\n## Pairing recursions in parallel\r\n\r\nThe backtracking higher-level parser shown earlier is not very smart.\r\nFor instance, ordinarily the finite state transducers already ensure\r\nthat any regular `start` vertex has a matching `final` vertex, but if\r\nthe parser is forced to backtrack, it will probably jump back to a\r\nposition where a regular part of the match is ambiguous. When there\r\nis a choice between multiple recursive symbols, it might choose `x`,\r\ntraverse the graph, and find out that it actually needed a `z`. Then\r\nit takes `y`, finds out again that it needs a `z`, and starts over\r\nagain. There are many ways to make it smarter, but it is also possible\r\nto avoid backtracking altogether by processing all alternatives in\r\nparallel.\r\n\r\nOne approach there would be advance all \"parsers\" in the `parsers`\r\narray one step (or up to the next edge) before continuing, but there\r\ncan be way too many alternatives for some grammars, and quite often\r\nthe parser states would differ only in what is on their individual\r\nstacks. If you recall that multiple \"parsers\" are created when there\r\nare multiple successors to a vertex, there are also cases where a\r\nvertex has fewer successors than predecessors, i.e., parsers might,\r\nafter exploring some differences, converge on the same path.\r\n\r\nInstead of giving each parser its own stack, it is possible to comine\r\nall possible stack configurations into a graph. Each \"parser\" can then\r\nsimply point to a vertex in the graph identifying the most-recently\r\n`pushed` value. That value then links to the value `pushed` before\r\nitself, and so on and so forth. Since there may be more than one way\r\nto reach a given vertex, there might be multiple most-recently `pushed`\r\nvalues for each vertex. In other words, instead of a `push` to a stack,\r\nwe add a vertex to graph and then point to the added vertex as the\r\nmost recently pushed value; instead of a `pop` from the stack, we move\r\nthe pointer to the predecessors in the graph.\r\n\r\n```perl\r\n#!perl -w\r\nuse Modern::Perl;\r\nuse Graph::Directed;\r\nuse YAML::XS;\r\nuse List::MoreUtils qw/uniq/;\r\nuse List::UtilsBy qw/partition_by sort_by nsort_by/;\r\nuse Graph::SomeUtils ':all';\r\nuse IO::Uncompress::Gunzip qw/gunzip/;\r\n\r\nlocal $Storable::canonical = 1;\r\n\r\nmy ($path, $file) = @ARGV;\r\n\r\ngunzip $path =\u003e \\(my $data);\r\n\r\nmy $d = YAML::XS::Load($data);\r\n\r\n#####################################################################\r\n# The following is just the typical reading of data and simulating\r\n# the finite automata in order to identify a list of edge sets.\r\n#####################################################################\r\nopen my $f, '\u003c:utf8', $file;\r\nmy $chars = do { local $/; binmode $f; \u003c$f\u003e };\r\n\r\nmy @vias = map { $d-\u003e{input_to_symbol}[ord $_] } split//, $chars;\r\n\r\nmy $fstate = 1;\r\nmy @forwards = ($fstate);\r\npush @forwards, map {\r\n  $fstate = $d-\u003e{forwards}[$fstate]{transitions}{$_} // 0\r\n} @vias;\r\n\r\nmy $bstate = 1;\r\nmy @edges = reverse map {\r\n  $bstate = $d-\u003e{backwards}[$bstate]{transitions}{$_} || 0;\r\n} reverse @forwards;\r\n\r\n#####################################################################\r\n# This script is going to generate a graph using file offsets paired\r\n# with vertex identifiers, just like when generating the dot output.\r\n# These helper functions combine two integers into a single string.\r\n#####################################################################\r\nsub pair {\r\n  my ($offset, $v) = @_;\r\n  return pack('N2', $offset, $v);\r\n}\r\n\r\nsub unpair {\r\n  my ($pair) = @_;\r\n  return unpack('N2', $pair);\r\n}\r\n\r\n#####################################################################\r\n# The following will generate a graph that links all vertices in the\r\n# graph produced by the deterministic finite state transducers to all\r\n# possible stack configurations when encountering the vertex. Graph\r\n# `$o` holds the view of the stack, `$g` the (unused) parse graph.\r\n#####################################################################\r\nmy $g = Graph::Directed-\u003enew;\r\nmy $start = pair(0, $d-\u003e{start_vertex});\r\nmy $final = pair($#edges, $d-\u003e{final_vertex});\r\n$g-\u003eadd_vertex($start);\r\n\r\nmy $o = Graph::Directed-\u003enew;\r\n$o-\u003eadd_vertex($start);\r\n\r\n#####################################################################\r\n# The algorithm transfers a view of the stack from vertices to their\r\n# immediate successors. The `@heads` are the vertices that still need\r\n# to be processed for a given edge, because their successors are the\r\n# newly added vertices in the following edge.\r\n#####################################################################\r\nmy @heads = ($start);\r\n\r\n#####################################################################\r\n# This projection could be used to merge all regular paths in the\r\n# grammar and only retain recursions plus whatever is needed to keep\r\n# the possible paths through the graph for recursive vertices intact.\r\n# Refer to the section \"Merging regular paths\" in the documentation.\r\n#####################################################################\r\nsub map_edge {\r\n  my ($edge) = @_;\r\n  return $edge;\r\n  return [ map { $d-\u003e{vertices}[$_]{stack_vertex} } @$edge ];\r\n}\r\n\r\nfor (my $ax = 0; $ax \u003c @edges; ++$ax) {\r\n  my $edge = $edges[$ax];\r\n  \r\n  ###################################################################\r\n  # Edge sets describe graph parts that, when concatenated, describe\r\n  # a parse graph. The following code does just that, it creates new\r\n  # vertices from the edge sets, noting the current offset, and then\r\n  # adds them to the overall graph. It is convenient to keep track of\r\n  # vertices added in this step, hence `$null` and `$char` graphs.\r\n  ###################################################################\r\n  my $null = Graph::Directed-\u003enew;\r\n  my $char = Graph::Directed-\u003enew;\r\n  \r\n  $null-\u003eadd_edges(map { [\r\n    pair($ax, $_-\u003e[0]), pair($ax, $_-\u003e[1])\r\n  ] } map { map_edge($_) } @{$d-\u003e{null_edges}[$edge]});\r\n\r\n  $char-\u003eadd_edges(map { [\r\n    pair($ax, $_-\u003e[0]), pair($ax + 1, $_-\u003e[1])\r\n  ] } map { map_edge($_) } @{$d-\u003e{char_edges}[$edge]});\r\n\r\n  ###################################################################\r\n  # Since we are going transfer views of the stack from vertices to\r\n  # their successors, it is convenient to get hold of all successors\r\n  # from a single graph, so the edges are combined into `$both`.\r\n  ###################################################################\r\n  my $both = Graph::Directed-\u003enew;\r\n  $both-\u003eadd_edges($null-\u003eedges);\r\n  $both-\u003eadd_edges($char-\u003eedges);\r\n\r\n  ###################################################################\r\n  # It can be convenient to build the parse graph alongside running\r\n  # this algorithm, `$g`, but the algorithm does not depend on it.\r\n  ###################################################################\r\n  $g-\u003eadd_edges($both-\u003eedges);\r\n  \r\n  my %seen;\r\n  my @todo = @heads;\r\n  while (@todo) {\r\n    my $v = shift @todo;\r\n    \r\n    #################################################################\r\n    # Successors have to be processed after their predecessors. \r\n    #################################################################\r\n    if (not $seen{$v}++) {\r\n      push @todo, $v;\r\n      push @todo, $null-\u003esuccessors($v);\r\n      next;\r\n    }\r\n\r\n    my ($vix, $vid) = unpair($v);\r\n    \r\n    if (($d-\u003e{vertices}[$vid]{type} // \"\") =~ /^(start|if)$/) {\r\n      ###############################################################\r\n      # `start` vertices correspond to `push` operations when using a\r\n      # stack. In the graph representation, the most recently pushed\r\n      # vertex is, accordingly, a predecessor of the current vertex.\r\n      ###############################################################\r\n      $o-\u003eadd_edge($v, $_) for $both-\u003esuccessors($v);\r\n      \r\n    } elsif (($d-\u003e{vertices}[$vid]{type} // \"\") =~ /^(final|fi)$/) {\r\n      ###############################################################\r\n      # `final` vertices correspond to `pop` operations when using a\r\n      # stack. They have to be matched against all the `predecessors`\r\n      # aka the most recently pushed vertices on the stack graph, and\r\n      # when they match, a `pop` is simulated by making the previous\r\n      # values, the second-most-recently-pushed vertices, available\r\n      # to the successors of the current vertex. Since the current\r\n      # vertex can be its own (direct or indirect) successor, due to\r\n      # right recursion, the successor may have to be processed more\r\n      # than one time to clear the emulated stack of matching values.\r\n      ###############################################################\r\n      for my $parent ($o-\u003epredecessors($v)) {\r\n        my ($pix, $pid) = unpair($parent);\r\n        if (not ($d-\u003e{vertices}[$pid]{with} // '') eq $vid) {\r\n          $o-\u003edelete_edge($parent, $v);\r\n          next;\r\n        }\r\n        for my $s ($both-\u003esuccessors($v)) {\r\n          for my $pp ($o-\u003epredecessors($parent)) {\r\n            next if $o-\u003ehas_edge($pp, $s);\r\n            $o-\u003eadd_edge($pp, $s);\r\n            push @todo, $s;\r\n          }\r\n        }\r\n      }\r\n    } else {\r\n      ###############################################################\r\n      # Other vertices do not affect the stack and so successors have\r\n      # the all the possible stack configurations available to them.\r\n      ###############################################################\r\n      for my $s ($both-\u003esuccessors($v)) {\r\n        $o-\u003eadd_edge($_, $s) for $o-\u003epredecessors($v);\r\n      }\r\n    }    \r\n  }\r\n\r\n  ###################################################################\r\n  # The new `@heads` are the end points of `char` edges. This should\r\n  # use only vertices that can actually be reached from the previous\r\n  # `@heads`, over a path that does not violate nesting constraints,\r\n  # but the low-level parser generally ensures there are no vertices\r\n  # added that cannot be reached from the `start_vertex`.\r\n  ###################################################################\r\n  @heads = uniq map { $_-\u003e[1] } $char-\u003eedges;\r\n}\r\n```\r\n\r\nIn the code above the `@heads` array corresponds to all the `p.vertex`\r\nproperties in the backtracking parser shown earlier, and the graph\r\n`$o` links any `p.vertex` to what used to be the `p.stack`s. If the\r\n`$o` graph has an edge `$o-\u003ehas_edge($start, $final)` and the `$final`\r\nvertex is reachable from `$start`, then the input matches the grammar.\r\n\r\nNote that the process above is entirely generic and does not depend on\r\nany particular behavior of the deterministic finite state transducers;\r\nit would be sufficient if they simply report all possible edges given\r\na particular input character. In other words, the code above resembles\r\nsimulating a non-deterministic pushdown transducer exploring all the\r\npossible transitions in parallel. The finite state transducers in turn\r\ncorrespond to an exhaustive parallel simulation that ignores the stack.\r\nWhen fully computed, they ensure that there are only relatively few\r\nedges added in each step and that all vertices are reachable from the\r\n`start_vertex` and reach the `final_vertex`. Furthermore, if the\r\nnon-recursive regular paths have already been computed by the finite\r\nmachines, they can be ignored in this step, as discussed in the\r\nprevious section.\r\n\r\nThe code above processes the list of edge sets produces by the finite\r\nautomata from the left to the right. It would also be possible to use\r\nit from the right to the left and execute it alongside the `backwards`\r\nautomaton. And as with the backtracking parser, it should be easy to\r\nsee that most of the process above can be pre-computed and be turned\r\ninto simple machine instructions.\r\n\r\nAlso note that using a graph instead of a single stack adds a lot of\r\npower. A finite state machine with two stacks is already sufficient\r\nfor turing-completeness. Having an essentially infinite number of\r\nstacks does not add computational power, but if you consider simpler\r\nmachines like deterministic pushdown automata, it might be impossible\r\nto express the union of two DPDAs as one DPDA (you can simulate them\r\nin parallel, but what do you do if one automaton wants to `push` while\r\nthe other wants to `pop` if you have only one stack to work with?),\r\nthe approach above would allow them to exist peacefully together, the\r\n`@heads`, in a manner of speaking, could be the the two states of the\r\nautomata, and the \"stack graph\" would just be two stacks, one for each\r\nof them.\r\n\r\nFinally note that parallel simulation also makes it easy to extend the\r\ngraph formalism seen so far with features that depend on combining\r\nmultiple paths, like boolean combinators. The data files for XML 1.0\r\nfor instance have `if` and matching `fi` vertices that offer two paths\r\nto traverse. They are combined using an `andnot` condition. If there\r\nis a path from the `start_vertex` to the `final_vertex` over the `not`\r\npart, then the `if` condition fails and both paths are invalid. That\r\nrepresents\r\n\r\n```\r\nPITarget ::= Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))\r\n```\r\n\r\nand other rules making use of the `-` operator. If the right hand side\r\nis not recursive, such rules are resolved by the finite transducers,\r\nbut if the right hand side is not regular, the higher-level parser has\r\nto explore them aswell. The simple backtracking parser shown earlier\r\nignores them at the moment.\r\n\r\n## Towards a compiler\r\n\r\nThe code in a previous section does many things at runtime that could\r\nbe computed statically, like building the initial `@todo` list, and\r\nit does so rather expensively like building temporary graph objects.\r\nAs a first step, let's say we've pre-computed the initial `@todo` and\r\nthe vertex successors the code above derives from the `$null` and\r\n`$char` graphs. In the abstract, we would then have for each set of\r\nedges:\r\n\r\n```perl\r\nmy @todo = ...;\r\nwhile (@todo) {\r\n  my $vid = shift @todo;\r\n  for ($vid) {\r\n    when(23) { do_start(...); }\r\n    when(42) { push @todo, do_final(...); }\r\n    when(65) { do_other(...); }\r\n    ...\r\n  }\r\n}\r\n```\r\n\r\nwhere the functions called above would look as follows. Parameters\r\n`$ns`, and `$cs` are the successors of `$v` in the `$null` graph and\r\nthe `$char` graph respectively, the rest should be obvious from the\r\nprevious section.\r\n\r\n```perl\r\nsub do_start {\r\n  my ($o, $vix, $vid, $ns, $cs) = @_;\r\n  my $v = pair($vix, $vid);\r\n  $o-\u003eadd_edge($v, pair($vix, $_)) for @$ns;\r\n  $o-\u003eadd_edge($v, pair($vix + 1, $_)) for @$cs;\r\n}\r\n\r\nsub do_other {\r\n  my ($o, $vix, $vid, $ns, $cs) = @_;\r\n  my $v = pair($vix, $vid);\r\n  \r\n  for my $parent ($o-\u003epredecessors($v)) {\r\n    $o-\u003eadd_edge($parent, pair($vix, $_)) for @$ns;\r\n    $o-\u003eadd_edge($parent, pair($vix + 1, $_)) for @$cs;\r\n  }\r\n}\r\n\r\nsub do_final {\r\n  my ($o, $vix, $vid, $with, $ns, $cs) = @_;\r\n  my $v = pair($vix, $vid);\r\n  my @todo;\r\n  for my $parent ($o-\u003epredecessors($v)) {\r\n    my ($pix, $pid) = unpair($parent);\r\n    if ($pid ne $with) {\r\n      $o-\u003edelete_edge($parent, $v);\r\n      next;\r\n    }\r\n    for my $pp ($o-\u003epredecessors($parent)) {\r\n      for my $sid (@$ns) {\r\n        my $s = pair($vix, $sid);\r\n        next if $o-\u003ehas_edge($pp, $s);\r\n        $o-\u003eadd_edge($pp, $s);\r\n        push @todo, $sid;\r\n      }\r\n      for my $sid (@$cs) {\r\n        my $s = pair($vix + 1, $sid);\r\n        $o-\u003eadd_edge($pp, $s);\r\n      }\r\n    }\r\n  }\r\n  return @todo;\r\n}\r\n```\r\n\r\nWe have seen earlier how the graph that represents the input grammar\r\ncan be simplified so it consists mostly of vertices that represent\r\nrecursions. Now, especially if we do that, it is possible to replace\r\ncalls to the functions above with just a couple of instructions. For\r\ninstance, from the perspective of most vertices, there can only ever\r\nbe one most recently `pushed` vertex in the `$o` graph, which makes\r\ntwo of the loops in `do_final` redundant for them. Furthermore, most\r\nvertices only have one successor, either in `$ns` or in `$cs`, so\r\nthose loops are often superfluous aswell.\r\n\r\nThe possible return values of `do_final` are only interesting if\r\nthere is a cycle in the `$null` graph (such cycles represent left and\r\nright recursion); indeed, if there are no such cycles, we do not need\r\nthe `while (@todo)` loop and can simply process vertices in\r\ntopological order. The `do_other` routine is for unlabeled vertices\r\nthat generally represent terminals in the grammar. Since we do not\r\ncare about those, instead of copying the predecessors in the \"stack\r\ngraph\" from one unlabeled vertex to another unlabeled vertex, we\r\ncould replace one with the other simply by updating the offset\r\nposition and vertex id of the previous one, in most cases. \r\n\r\nIf all vertices in the graph can only ever see one most recently\r\n`pushed` value, we do not even need a graph and can use an ordinary\r\nstack instead. If the grammar you are interested in is not recursive\r\nor uses only right recursion, none of this is even needed, you can\r\njust use the backtracking parser presented earlier.\r\n\r\nAnd that is just the point. Compilers ought to figure out how to\r\nanalyse formal languages efficiently, rather than bothering humans\r\nwith it.\r\n\r\n## When a stack is good enough\r\n\r\nWith the approach shown above it would be nice if we could use a\r\nsimpler data structure than a graph, namely a stack. Some languages\r\nare simple enough that a stack is sufficient. When the vertices in\r\nall `null_edges` and all `char_edges` are projected to `stack_vertex`\r\nand loops around unlabeled vertices are removed, and the result is\r\nthat all vertices in each edge have at most one successor, or if all\r\nsuccessors are `final` vertices (we can choose among them using the\r\nlast-in value) then there is at most one possible path we can take.\r\n\r\nThe above is the case for non-recursive languages like the RFC 3986\r\ngrammar (URIs), and also recursive grammars like the RFC 5234 (ABNF)\r\nand XML `element` grammars. When parsing RFC 5234 documents, the\r\nstack would be a path through a graph like this:\r\n\r\n![Simplified stack graph for ABNF](./rfc5234-stack.png?raw=true)\r\n\r\nThe choice in the graph is not visible to the algorithm above since\r\n`option` starts with `[` and `group` starts with `(` and so the two\r\noptions do not appear together in the same `backwards` state. A case\r\nwhere we get a negative answer for the analysis above is the whole\r\nXML `document` production. A problem there are element content\r\nmodels (which can occur in the internal subset of document type\r\ndeclarations):\r\n\r\n```\r\nchildren ::= (choice | seq) ('?' | '*' | '+')?\r\ncp       ::= (Name | choice | seq) ('?' | '*' | '+')?\r\nchoice   ::= '(' S? cp ( S? '|' S? cp )+ S? ')'\r\nseq      ::= '(' S? cp ( S? ',' S? cp )* S? ')'\r\n```\r\n\r\nSince `choice` and `seq` are both recursive, and cannot be told apart\r\nexcept by whether `,` or `|` is used to separate their children, both\r\noptions would have to be put on the stack together, which violates\r\nthe rule given above (one vertex has multiple successors that are not\r\nall `final` vertices). A cheap option would be to use a graph for the\nrare case of encounters with this obscure part of the XML format, and\nswitch to using a stack once the actual document content is read.\r\n\r\nThe JSON grammar in RFC 4627 is more complicated. It makes liberal\r\nuse of the `ws` production to indicate where ignorable white space\r\ncharacters, such as spaces and newlines between values, can be placed.\r\nThis results in an ambiguity as to where recursive productions begin\r\nwith respect to an input tring. So an analysis of the graph produced\r\nas part of the algorithm explained at the beginning of the section\r\nshows that at some points there is a choice between reading more\r\ninput and starting to match productions like `value`. \r\n\r\nIf we actually want to report all possible matches, which includes\r\nencoding all the \"continue reading or start a `value`\" choices, it is\r\nnot possible to do that with a simple stack. A common sacrifice that\r\nis made is adding a simple disambiguation strategy, for instance, we\r\ncould simply ignore the option to \"continue reading\" and enter the\r\nrecursive symbol as soon as possible. That would change the language,\r\nand it has to be done carefully.\r\n\r\nLet's put some of this into code:\r\n\r\n```perl\r\n#!perl -w\nuse Modern::Perl;\nuse Graph::Directed;\nuse YAML::XS;\nuse List::Util qw/all/;\nuse List::MoreUtils qw/uniq/;\nuse List::UtilsBy qw/partition_by sort_by nsort_by/;\nuse Graph::SomeUtils ':all';\nuse IO::Uncompress::Gunzip qw/gunzip/;\n\nlocal $Storable::canonical = 1;\n\nmy ($path, $file) = @ARGV;\n\ngunzip $path =\u003e \\(my $data);\n\nmy $d = YAML::XS::Load($data);\n\nsub map_edge {\n  my ($edge) = @_;\n  return $edge;\n  return [ map { $d-\u003e{vertices}[$_]{stack_vertex} } @$edge ];\n}\n\nfor (my $ax = 0; $ax \u003c @{ $d-\u003e{null_edges} }; ++$ax) {\n  my $edge = $ax;\n  \n  my $null = Graph::Directed-\u003enew;\n  my $char = Graph::Directed-\u003enew;\n  \n  $null-\u003eadd_edges(map { map_edge($_) } @{$d-\u003e{null_edges}[$edge]});\n  $char-\u003eadd_edges(map { map_edge($_) } @{$d-\u003e{char_edges}[$edge]});\n\n  ###################################################################\n  # This removes loops from unlabeled vertices and unlabeled vertices\n  # between labeled vertices. The sole purpose of unlabeled vertices\n  # is to ensure that we can always properly go from one set of edges\n  # to another. Ordinarily they represent terminals in the grammar,\n  # but here they also represent non-recursive symbols. Loops are due\n  # to merging `start` and `final` vertices e.g. because non-terminal\n  # symbols match the empty string. In any case, they do not affect\n  # how we ultimately pair vertices that repesent recursions.\n  ###################################################################\n  for my $v ($null-\u003evertices) {\n    next if length ($d-\u003e{vertices}[$v]{type} // '');\n    $null-\u003edelete_edge($v, $v);\n    next if $char-\u003esuccessors($v);\n    next unless $null-\u003epredecessors($v);\n    next unless $null-\u003esuccessors($v);\n    for my $p ($null-\u003epredecessors($v)) {\n      for my $s ($null-\u003esuccessors($v)) {\n        $null-\u003eadd_edge($p, $s);\n      }\n    }\n    graph_delete_vertex_fast($null, $v);\n  }\n  \n  for my $v ($null-\u003evertices, $char-\u003evertices) {\n    #################################################################\n    # If a vertex has multiple successors in `$char`, we would leave\n    # this set of edges on more than a single vertex, which would\n    # violate our constraints. Similarily, if there is a choice be-\n    # tween moving on to the next set of edges (next character) and\n    # staying here, that would violate our constraints. Hence this:\n    #################################################################\n    my @s;\n    push @s, $null-\u003esuccessors($v);\n    push @s, $char-\u003esuccessors($v);\n    next if @s \u003c= 1;\n    \n    #################################################################\n    # The following tests do not allow successors in `$char` but do\n    # allow multiple successors in `$null` provided that all of them\n    # are either `final` vertices (in which case we can choose among\n    # them by looking at the last-in value) or `start` vertices.\n    #################################################################\n    my $all_final = all {\n      ($d-\u003e{vertices}[$_]{type} // '') eq 'final'\n    } $null-\u003esuccessors($v);\n    \n    my @out = uniq map { $_-\u003e[1] } $char-\u003eedges;\n    \n    #################################################################\n    # The check `and not $char-\u003esuccessors($v)` ensures that there is\n    # no choice between going into a recursive symbol, or moving out\n    # of it, and moving on to the next character.\n    #################################################################\n    if ($all_final and not $char-\u003esuccessors($v)) {\n      next;\n    }\n\n    #################################################################\n    # If we get here, there is a problem with the grammar that might\n    # make it impossible to link the `start` and `final` points of\n    # recursive symbols in a match together using only a stack. This\n    # will print debugging in formation if there is a violation,\n    # otherwise this script will not print anything at all.\n    #################################################################\n    say join \"\\t\",\n     scalar(@out),\n     \"succ\",\n     $v,\n     \"null\",\n     (map {\n       $_, $d-\u003e{vertices}[$_]{type}, $d-\u003e{vertices}[$_]{text}\n      } (sort $null-\u003esuccessors($v))),\n     \"char\", (sort $char-\u003esuccessors($v));\n  }\n}\n```\r\n\r\nThis script checks for the conditions explained above. For the URI,\r\nABNF, and XML `element` samples it prints nothing, for the RFC 4627\nJSON sample it will print something like:\r\n\r\n```\r\n1       succ    349     null    76      final   array   char    349\n1       succ    367     null    209     final   object  char    367\n1       succ    378     null    69      final   object  char    378\n1       succ    382     null    8       start   value   char    382\n1       succ    388     null    151     final   array   char    388\n1       succ    394     null    57      start   value   char    394\n2       succ    349     null    76      final   array   char    349\n2       succ    365     null    40      start   value   char    365\n2       succ    367     null    209     final   object  char    367\n2       succ    368     null    378                     char    368\n2       succ    378     null    69      final   object  char    378\n2       succ    379     null    145     start   value   char    379\n2       succ    381     null    12      start   value   char    381\n2       succ    382     null    349                     char    382\n2       succ    382     null    8       start   value   char    382\n2       succ    388     null    151     final   array   char    388\n2       succ    389     null    122     start   value   char    389\n2       succ    391     null    367                     char    391\n2       succ    394     null    388                     char    394\n2       succ    394     null    57      start   value   char    394\n```\r\n\r\nThis is a crude way of telling you that at some point you can go from\r\n`stack_vertex` `349` to vertex `76` without reading anything (`null`)\r\nor to vertex `349` after reading an input symbol (`char`). In other\r\nwords, the grammar might be ambiguous, and it is probably not possible\r\nto parse using only a single stack while reporting all possible matches.\r\nThis also attempts to tell you where the problem might lie.\r\n\r\nApplied to the XML 1.0 `extSubset` data file, the script generates a\r\nlot of nonsense. That is because the `forwards` automaton is not \r\n\"complete\", it does not actually trace all possible paths through the\r\ngrammar, and accordingly the `backwards` automaton and hence the\r\n`char_edges` and the `null_edges` are not complete either. In contrast\r\nthe backtracking higher-level parser would still, very slowly, do the\r\nright thing when trying to parse. This is explained in more detail in\r\nthe section on \"Limitations\".\r\n\r\n## Combination of data files and parallel simulation\r\n\r\nThe design of the core system makes it easy to simulate multiple\r\nautomata in parallel, and since all state is trivially accessible,\r\nnew data files that easily be created as combinations of existing\r\nones. The most common combinations are directly supported as part\r\nof the core data file generation process, such as the union of two\r\nalternatives, and set subtraction. The latter is used e.g. by the\r\nEBNF grammar for XML 1.0 to express rules such as `any Name except\r\n'xml'` which are often difficult to express with other systems.\r\nLikewise, the intersection of two grammars is easily computed.\r\n\r\nAn important implication is that the system can be used to compare\r\ngrammars. As an example, the sample files include one for URIs as\r\ndefined by RFC 3986. The precursor of RFC 3986 is RFC 2396, and it\r\ncan be useful to construct a data file for strings that are URIs\r\nunder one definition but not the other, e.g. to derive test cases,\r\nor if the two definitions were meant to be the same, to verify that\r\nthey are (as in set theory, if `A - B` is empty and `B - A` is empty\r\nthen `A` and `B` are equivalent).\r\n\r\nThe way to combine data files is exhaustive simulation. As example,\r\nthe forwards automaton in any data file starts in state `1`. If you\r\nhave two data iles, you can make a pair of states `(1, 1)` and a\r\ncharacter `ch`, and compute\r\n\r\n```js\r\n  var s1 = g1.forwards[1].transitions[ g1.input_to_symbol[ch] ];\r\n  var s2 = g2.forwards[1].transitions[ g2.input_to_symbol[ch] ];\r\n```\r\n\r\nwhich would give a transition from `(1, 1)` over `ch` to `(s1, s2)`.\r\nThe pairs are the states in the new automaton. When computing a union,\r\na state in the new automaton is accepting if either of the states it\r\nrepresents is accepting. For intersection both states have to be\r\naccepting. For `A - B` the state in A has to be accepting, but the\r\nstate for B must not be. For boolean combinations the structure of\r\nthe automaton is always the same, except that some states may end up\r\nbeing redundant.\r\n\r\nFor the `backwards` automaton the process is the same. Merging the\r\ncorresponding graph data is done by taking the union of edges. It\r\nis of course necessary to rename vertices to avoid collisions. It\r\nis also useful to first create a common `input_to_symbol` table and\r\nthen simulate over character classes instead of indiviual characters.\r\n\r\nThere are many other interesting combinations than the simple boolean\r\nones. For instance, instead of of indiscriminate union it can also be\r\nuseful to create an ordered choice `if A then A else B`. This would\r\ndisambiguate between A and B. Typical applications include support\r\nfor legacy constructs in grammars or other fallback rules. This can\r\nbe implemented just like the union, but when creating the backwards\r\nautomaton, the unwanted edges would be left out. Alternatively, an\r\nordered choice `a || b` can also be expressed as `a | (b - a)`.\r\n\r\nIt is also possible to create interleavings (switching from one\r\nautomaton to another) and other constructs with similar effort.\r\n\r\nHere is an example that creates a combination of two data files where\r\nthe result will match strings that do not match the first data file\r\nbut do match the second data file.\r\n\r\n```perl\r\n#!perl -w\r\nuse Modern::Perl;\r\nuse Graph::Directed;\r\nuse YAML::XS;\r\nuse List::Util qw/min max/;\r\nuse List::MoreUtils qw/uniq/;\r\nuse List::UtilsBy qw/partition_by sort_by nsort_by/;\r\nuse Graph::SomeUtils ':all';\r\nuse IO::Uncompress::Gunzip qw/gunzip/;\r\nuse IO::Compress::Gzip qw(gzip);\r\nuse Data::AutoBimap;\r\nuse JSON;\r\n\r\nlocal $Storable::canonical = 1;\r\n\r\nmy ($path1, $path2) = @ARGV;\r\n\r\ngunzip $path1 =\u003e \\(my $data1);\r\ngunzip $path2 =\u003e \\(my $data2);\r\n\r\nmy $d1 = YAML::XS::Load($data1);\r\nmy $d2 = YAML::XS::Load($data2);\r\n\r\nsub pair {\r\n  my ($offset, $v) = @_;\r\n  return pack('N2', $offset, $v);\r\n}\r\n\r\nsub unpair {\r\n  my ($pair) = @_;\r\n  return unpack('N2', $pair);\r\n}\r\n\r\n#####################################################################\r\n# Create a new alphabet making sure new character sets are disjoint.\r\n#####################################################################\r\nmy @input_to_symbol = (pair(0, 0));\r\nmy $max = max(scalar(@{ $d1-\u003e{input_to_symbol} }), \r\n              scalar(@{ $d2-\u003e{input_to_symbol} }));\r\n\r\nfor (my $via = 1; $via \u003c $max; ++$via) {\r\n  $input_to_symbol[$via] = pair(\r\n    ($d1-\u003e{input_to_symbol}[$via] // 0),\r\n    ($d2-\u003e{input_to_symbol}[$via] // 0),\r\n  );\r\n}\r\n\r\nmy $sm = Data::AutoBimap-\u003enew(start =\u003e 0);\r\n$sm-\u003es2n(pair(0, 0));\r\nmy @alphabet = map { $sm-\u003es2n($_) } uniq @input_to_symbol;\r\n\r\n#####################################################################\r\n# Create a new automaton over pairs of states in the input automata. \r\n#####################################################################\r\nsub combine {\r\n  my ($a1, $a2, $im, $sm, $start, $alphabet) = @_;\r\n  my %seen;\r\n  my @states;\r\n  my @todo = $start;\r\n\r\n  while (@todo) {\r\n    my $current = shift @todo;\r\n    next if $seen{$current}++;\r\n    my ($s1, $s2) = unpair($sm-\u003en2s($current));\r\n    for my $via (@$alphabet) {\r\n      my ($via1, $via2) = unpair($im-\u003en2s($via));\r\n      my $dst =\r\n        $sm-\u003es2n(pair(\r\n          $a1-\u003e[$s1]{transitions}{$via1} // 0,\r\n          $a2-\u003e[$s2]{transitions}{$via2} // 0,\r\n        ));\r\n      next unless $dst;\r\n      $states[$current]{transitions}{$via} = $dst;\r\n      push @todo, $dst;\r\n    }\r\n    \r\n    #################################################################\r\n    # For simple boolean combinations like `and` and `xor` this is\r\n    # the only place that encodes the combination. For more complex\r\n    # combinations, like `if a then a else b` we would also need to\r\n    # remove the edges we are not interested in later in the code.\r\n    #################################################################\r\n    $states[$current]{accepts} = \r\n      (!$a1-\u003e[$s1]{accepts})\r\n        \u0026\r\n      $a2-\u003e[$s2]{accepts};\r\n  }\r\n  return @states;\r\n}\r\n\r\n#####################################################################\r\n# Combine the `forwards` automata.\r\n#####################################################################\r\nmy $fm = Data::AutoBimap-\u003enew(start =\u003e 0);\r\n$fm-\u003es2n(pair(0, 0));\r\nmy @forwards = combine(\r\n  $d1-\u003e{forwards},\r\n  $d2-\u003e{forwards},\r\n  $sm,\r\n  $fm,\r\n  $fm-\u003es2n(pair(1, 1)),\r\n  [@alphabet],\r\n);\r\n$forwards[0] = { transitions =\u003e {} };\r\n\r\nmy $xm = Data::AutoBimap-\u003enew(start =\u003e 0);\r\n$xm-\u003es2n(pair(0, 0));\r\n\r\nmy @ralpha;\r\nfor (my $ix = 0; defined $fm-\u003en2s($ix); ++$ix) {\r\n  push @ralpha, $ix;\r\n}\r\n\r\n#####################################################################\r\n# Combine the `backwards` automata.\r\n#####################################################################\r\nmy $bm = Data::AutoBimap-\u003enew(start =\u003e 0);\r\n$bm-\u003es2n(pair(0, 0));\r\nmy @backwards = combine(\r\n  $d1-\u003e{backwards},\r\n  $d2-\u003e{backwards},\r\n  $fm,\r\n  $bm,\r\n  $bm-\u003es2n(pair(1, 1)),\r\n  [@ralpha]\r\n);\r\n$backwards[0] = { transitions =\u003e {} };\r\n\r\nmy $vm = Data::AutoBimap-\u003enew(start =\u003e 0);\r\n$vm-\u003es2n(pair(0, 0));\r\n\r\n#####################################################################\r\n# This is most of the new data file.\r\n#####################################################################\r\nmy $d3 = {\r\n  start_vertex =\u003e $vm-\u003es2n(pair($d1-\u003e{start_vertex},\r\n                                $d2-\u003e{start_vertex})),\r\n  final_vertex =\u003e $vm-\u003es2n(pair($d1-\u003e{final_vertex},\r\n                                $d2-\u003e{final_vertex})),\r\n  forwards =\u003e \\@forwards,\r\n  backwards =\u003e \\@backwards,\r\n  vertices =\u003e [{}, {\r\n    type =\u003e \"start\",\r\n    text =\u003e \"$path1 combined with $path2\",\r\n    with =\u003e 2,\r\n  }, {\r\n    type =\u003e \"final\",\r\n    text =\u003e \"$path1 combined with $path2\",\r\n    with =\u003e 1,\r\n  }],\r\n  input_to_symbol =\u003e [ map { $sm-\u003es2n($_) } @input_to_symbol ],\r\n};\r\n\r\n#####################################################################\r\n# Copy the vertex data.\r\n#####################################################################\r\n$d3-\u003e{vertices}[$vm-\u003es2n(pair(0, $_))] = \r\n  $d1-\u003e{vertices}[$_] for 1 .. @{ $d1-\u003e{vertices} } - 1;\r\n\r\n$d3-\u003e{vertices}[$vm-\u003es2n(pair($_, 0))] = \r\n  $d2-\u003e{vertices}[$_] for 1 .. @{ $d2-\u003e{vertices} } - 1;\r\n\r\n#####################################################################\r\n# Making sure cross-references among vertices are updated.\r\n#####################################################################\r\nfor (my $ix = 3; defined $vm-\u003en2s($ix); ++$ix) {\r\n  my ($v1, $v2) = unpair($vm-\u003en2s($ix));\r\n  my $v = !$v1 ? $v2 : $v1;\r\n  my $d =  $v1 ? $d2 : $d1;\r\n  my $s =  $v1 ? sub { $vm-\u003es2n(pair($_[0], 0)) } :\r\n                 sub { $vm-\u003es2n(pair(0, $_[0])) };\r\n  for my $k (qw/with operand1 operand2 stack_vertex/) {\r\n    $d3-\u003e{vertices}[$ix]{$k} = $s-\u003e($d-\u003e{vertices}[$v]{$k} || 0);\r\n  }\r\n}\r\n\r\n#####################################################################\r\n# Copy the edge data.\r\n#####################################################################\r\nfor (my $ix = 1; defined $bm-\u003en2s($ix); ++$ix) {\r\n  my ($s1, $s2) = unpair($bm-\u003en2s($ix));\r\n  for my $k (qw/char_edges null_edges/) {\r\n    push @{ $d3-\u003e{$k}[$ix] },\r\n      map { [map { $vm-\u003es2n(pair(0, $_)) } @$_ ] }\r\n        @{ $d1-\u003e{$k}[$s1] };\r\n    push @{ $d3-\u003e{$k}[$ix] },\r\n      map { [map { $vm-\u003es2n(pair($_, 0)) } @$_ ] }\r\n        @{ $d2-\u003e{$k}[$s2] };\r\n  }\r\n}\r\n\r\n#####################################################################\r\n# Connect the old `start_vertex` and `final_vertex` vertices to the\r\n# new ones. If one of the input data files represents a recursive\r\n# grammar, this should also add appropriate `if` and `fi` vertices\r\n# and their operands to allow the higher-level parser to complete the\r\n# combination. This also ignores edge cases where the `start` or the\r\n# `final` vertex in one of the input automata occurs only as part of\r\n# a `char_edge`. But for demonstration purposes this should be okay.\r\n#####################################################################\r\nfor (my $ix = 1; $ix \u003c @{ $d3-\u003e{null_edges} }; ++$ix) {\r\n  next unless defined $d3-\u003e{null_edges}[$ix];\r\n  my @edges = @{ $d3-\u003e{null_edges}[$ix] };\r\n  my @vertices = map { @$_ } @edges;\r\n  my %has = map { $_ =\u003e 1 } @vertices;\r\n  my $s1 = $vm-\u003es2n(pair(0, $d1-\u003e{start_vertex}));\r\n  my $f1 = $vm-\u003es2n(pair(0, $d1-\u003e{final_vertex}));\r\n  my $s2 = $vm-\u003es2n(pair($d1-\u003e{start_vertex}, 0));\r\n  my $f2 = $vm-\u003es2n(pair($d1-\u003e{final_vertex}, 0));\r\n  push @{ $d3-\u003e{null_edges}[$ix] }, [ $d3-\u003e{start_vertex}, $s1 ]\r\n    if $has{$s1};\r\n  push @{ $d3-\u003e{null_edges}[$ix] }, [ $d3-\u003e{start_vertex}, $s2 ]\r\n    if $has{$s2};\r\n  push @{ $d3-\u003e{null_edges}[$ix] }, [ $f1, $d3-\u003e{final_vertex} ]\r\n    if $has{$f1};\r\n  push @{ $d3-\u003e{null_edges}[$ix] }, [ $f2, $d3-\u003e{final_vertex} ]\r\n    if $has{$f2};\r\n}\r\n\r\n$d3-\u003e{char_edges}[0] = [];\r\n$d3-\u003e{null_edges}[0] = [];\r\n\r\n#####################################################################\r\n# Now we can print out the result. It will be rather large since, for\r\n# one thing, the worst case size of the product automata is O(N*M),\r\n# and we made no effort to remove data that has become useless such\r\n# as states that cannot reach an accepting state.\r\n#####################################################################\r\nmy $json = JSON-\u003enew-\u003ecanonical-\u003eascii-\u003eencode($d3);\r\n\r\ngzip \\$json =\u003e \\(my $gzipped);\r\n\r\nbinmode STDOUT;\r\nprint $gzipped;\r\n```\r\n\r\nUsage:\r\n\r\n```\r\n% perl not-a-but-b.pl rfc3986.URI.json.gz rfc2396.URI.json.gz \u003e x.gz\r\n% perl random-samples.pl x.gz\r\n```\r\n\r\nOutput:\r\n\r\n```\r\nGA.://!:%3A2++%6A./#\r\nVA25536://:%A6#!$:?!\r\nV://%5A:;@-%0A%AAA@%2A6?\r\nG://%A6_+@!+%A5:%16./;?\r\nAA+.VG+3+5A5V://:!%A5?\r\nA://:%60/;2;!/;@;;V\r\nA.V://@@;%1A@\r\nAAAAA63A.5--://G:;%6A1;++\r\n```\r\n\r\nThe strings above are strings that are URIs according to RFC 2396 but\r\nnot according to the later specification RFC 3986. We can also do it\r\nthe other way around, the following are strings that RFC 3986 accepts\r\nbut RFC 2396 did not:\r\n\r\n```\r\nV+://@[VAA.::]:/!/?///%A1/1!#@:\r\nGGA.A3A-6:\r\nG-:#-/?\r\nG+G:\r\nG+1:\r\nGG:#@/+_\r\nA-://[V3.:::+]/#\r\nAA05AG:#/;\r\n```\r\n\r\nHow to generate these random samples is explained in a later section.\r\nIt is easy to generate a small set of test cases for all interesting\r\ndifferences between the two specifications covered, as explained in\r\nthe section on test suite coverage.\r\n\n## Handling grammars based on tokens\n\nSome specifications separate the language definition into multiple\nparts, one for small units like `number` and `identifier`, called\ntokens, and then separately define how to combine tokens into larger\nstructures. An example is the specification for style sheets, [CSS 2.1]\n(http://www.w3.org/TR/2011/REC-CSS2-20110607/syndata.html#syntax).\nWe can do the same and define two grammars. Using the ABNF format,\nit could look like this for CSS 2.1, closely mirroring the structure\nof the specification:\n\n```\nCSSTOKEN = IDENT\n  / ATKEYWORD\n  / STRING\n...\n;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;\n; Tokens\n;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;\nIDENT         = ident\nATKEYWORD     = \"@\" ident\nSTRING        = string\n...\n;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;\n; Macros\n;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;\nident         = [\"-\"] nmstart *nmchar\nname          = 1*nmchar\nnmstart       = (ALPHA / \"_\") / nonascii / escape\n...\n```\n\nThe underlying idea is that the sequence of characters that make up a\nstyle sheet are first turned into a sequence of tokens and the grammar\nabove is for one such token. To distinguish tokens from one another,\nthe CSS 2.1 specification defines that in trying to find a token, we\nalways read as much of the input as possible. The code below is an\nimplementation of that:\n\n```perl\n#!perl -w\nuse Modern::Perl;\nuse Graph::Directed;\nuse YAML::XS;\nuse IO::Uncompress::Gunzip qw/gunzip/;\n\nlocal $Storable::canonical = 1;\n\nmy ($path, $file) = @ARGV;\n\ngunzip $path =\u003e \\(my $data);\n\nmy $d = YAML::XS::Load($data);\n\nmy %state_to_token;\n#####################################################################\n# First we are going to map each `forwards` state to a named token,\n# if any, going through the set of edges at the accepting position.\n#####################################################################\nfor (my $ix = 0; $ix \u003c @{ $d-\u003e{forwards} }; ++$ix) {\n\n  my $null = Graph::Directed-\u003enew;\n  my $edge = $d-\u003e{backwards}[1]{transitions}{$ix};\n  next unless defined $edge;\n  \n  die unless $d-\u003e{forwards}[$ix]{complete};\n\n  $null-\u003eadd_edges(@{$d-\u003e{null_edges}[$edge]});\n  my %found = map { $d-\u003e{vertices}[$_]{text} =\u003e 1 } grep {\n    ($d-\u003e{vertices}[$_]{type} // '') eq 'final';\n  } $null-\u003epredecessors($d-\u003e{final_vertex});\n\n  ###################################################################\n  # `DELIM` tokens match \"anything else\" and the grammar does not\n  # fully distinguish them from other named tokens, so we do it here.\n  ###################################################################\n  if (keys %found \u003e 1) {\n    delete $found{DELIM};\n  }\n\n  ###################################################################\n  # https://lists.w3.org/Archives/Public/www-style/2015Feb/0043.html\n  # This manually resolves an ambiguity in the specification.\n  ###################################################################\n  if ($found{FUNCTION} and $found{'BAD-URI'}) {\n    delete $found{FUNCTION};\n  }\n\n  ###################################################################\n  # We can only proceed if there is no ambiguity left among tokens.\n  ###################################################################\n  die if keys %found != 1;\n  ($state_to_token{$ix}) = keys %found;\n}\n\nopen my $f, '\u003c:utf8', $file;\nmy $chars = do { local $/; binmode $f; \u003c$f\u003e };\n\nmy @vias = map { $d-\u003e{input_to_symbol}[ord $_] } split//, $chars;\n\n#####################################################################\n# Now we can turn the sequence of input characters into a sequence of\n# tokens. Since the specification defines to use the \"longest match\",\n# we simulate the `forwards` automaton until it no longer accepts any\n# input. This assumes that all states that do not reach an accepting\n# state are merged into state `0` as should normally be the case.\n#####################################################################\nfor (my $ix = 0; $ix \u003c @vias;) {\n  my $fstate = 1;\n  my $bx = $ix;\n  while ($d-\u003e{forwards}[$fstate]{transitions}{$vias[$bx]}) {\n    $fstate = $d-\u003e{forwards}[$fstate]{transitions}{$vias[$bx++]};\n    last if $bx \u003e= @vias;\n  }\n  my $token = $state_to_token{$fstate};\n\n  say join \"\\t\", $ix, $bx, $token;\n\n  $ix = $bx;\n}\n```\n\nUsage:\n\n```\n% perl csstoken.pl csstoken.json.gz example.css\n```\n\nInput:\n\n```css\n/* Example */\nbody { background: url(\"example\") }\n```\n\nOutput (start offset, next offset, name of token):\n\n```\n0       13      COMMENT\n13      15      S\n15      19      IDENT\n19      20      S\n20      21      LEFT-CURLY\n21      22      S\n22      32      IDENT\n32      33      COLON\n33      34      S\n34      48      URI\n48      49      S\n49      50      RIGHT-CURLY\n50      52      S\n```\n\nNext we can make a grammar for the remaining parts of the syntax. As\nthe data format still assumes transitions over numbers, we can define\na number for each token:\n\n```\nIDENT         = %x01\nATKEYWORD     = %x02\nSTRING        = %x03\n...\nstylesheet    = *( CDO / CDC / S / statement )\nstatement     = ruleset / at-rule\nat-rule       = ATKEYWORD *S *any ( block / SEMICOLON *S )\n...\n```\n\nOne twist of note is that the CSS 2.1 specification defines \"COMMENT\ntokens do not occur in the grammar (to keep it readable), but any\nnumber of these tokens may appear anywhere outside other tokens.\" So\nwhen feeding tokens into a parser for the second grammar, we would\nhave to skip `COMMENT` tokens (or, alternatively, change the grammar\nso that a `COMMENT` loop appears before and after any reference to a\nrule in the grammar, avoiding redundant instances of such loops). It\nwould also be possible to combine the two grammars into one, but we\nwould need to make the grammar for the tokens such that they are\ndisjoint and each token retains the property that it matches \"as much\nas possible\".\n\n## Handling data structures other than strings\n\nThe graph formalism on which the system presented in this document is\nbased essentially linearises hierarchial data structures. Consider, as\na comparison, event-based parsers for languages like XML. They present\na stream of events like `start_element` and `end_element` to calling\napplications. Likewise, the system here equates parsing with finding a\npath through a graph where some vertices are labeled `start` and other\nvertices are labeled `final` and a valid path is one where there is a\nmatching `final` for every `start` on the path. We can use this idea\nto linearise some data structures so they fit into this system.\n\nTrees, like indeed XML documents, are a good example. A simple use\ncase is validation of XML documents. Let's make a simple grammar for\na subset of the HTML/XHTML language, considering only the nesting and\nposition of elements, not their attributes or character data content:\n\n```\nstart-html  = %x01\nfinal-html  = %x02\nstart-head  = %x03\nfinal-head  = %x04\nstart-title = %x05\nfinal-title = %x06\nstart-div   = %x07\nfinal-div   = %x08\nstart-p     = %x09\nfinal-p     = %x0a\nstart-body  = %x0b\nfinal-body  = %x0c\n\nhtml        = start-html  (head body) final-html\nhead        = start-head  (title)     final-head\ntitle       = start-title \"\"          final-title\nbody        = start-body  *(div / p)  final-body\ndiv         = start-div   *(div / p)  final-div \np           = start-p     \"\"          final-p\n```\n\nThe first part is similar to how we handled tokens in the previous\nsection. Essentially, in a manner of speaking, `\u003chtml\u003e` is a token,\nand so is `\u003c/html\u003e`, and so is any other start tag or end tag. But\nwe do not have to derive these tokens from a character string, it\nwould also be possible to generate a sequence of them from a tree in\nmemory by way of a depth-first traversal. XML schema languages like\nRELAX NG follow a similar principle, although they do offer more\nconvenient syntactic sugar to express rules.\n\r\n## Limitations\r\n\r\nThe basic approach outlined above works well for carefully constructed\r\ndata format and protocol message formats that are relatively regular,\r\nunambiguous, and deterministic, which is the case for a large set of\r\nstandard formats. The samples include parsing data for URIs, JSON,\r\nXML, and ABNF. All the corresponding grammars are ambiguous and only\r\nURIs are regular, so these are not strict requirements.\r\n\r\nThe design of the data files also allows the deterministic finite\r\nstate transducers used for pre-processing the input to simply record\r\nthe input without making decisions on their own, in which case the\r\nhigher level parser would turn into an unaided non-deterministic\r\npushdown transducer. That is a worst-case escape hatch that ensures\r\nthe integrity of the parsing data files while avoiding the creation\r\nof an exponential number of states, so **it is always possible to create\r\na correct and reasonably sized data file**, but naively written higher\r\nlevel parsers are likely to perform poorly within this system.\r\n\r\nThere are ways to delay the inevitable however. A simple example are\r\nXML document type definition files. The finite state transducers can\r\nhandle the format fine except for one construct:\r\n\r\n```\r\nignoreSect         ::= '\u003c![' S? 'IGNORE' S? '[' ignoreSectContents* ']]\u003e'\r\nignoreSectContents ::= Ignore ('\u003c![' ignoreSectContents ']]\u003e' Ignore)*\r\nIgnore             ::= Char* - (Char* ('\u003c![' | ']]\u003e') Char*) \r\n```\r\n\r\nIn the grammar for XML, `Char` is the whole alphabet, and the rule\r\n`ignoreSectContents` matches anything so long as any `\u003c![` properly\r\nnests with a closing `]]\u003e`. Since the finite transducers cannot\r\ndetect the outermost closing `]]\u003e`, this simply matches anything; in\r\norder to still make all regular decisions for the higher level parser,\r\nan inordinate amount of states is needed. Of course, for any finite\r\nnumber of nesting levels, relatively few states are needed, delaying\r\nany fallback to the worst case as much as is convenient by expanding\r\nproblematic recursions a couple of times.\r\n\r\nAn example of this is included in the repository. Using\r\n\r\n```\r\n% node demo-parselov.js xml4e.extSubset.json.gz ex.dtd\r\n```\r\n\r\nRight after reading the first `\u003c![IGNORE[` in a location where the\r\nconstruct is allowed, the first finite state transducer switches to\r\na worst-case mode of operation and simply records the input. The\r\nsecond transducer accordingly generates all possible edges for every\r\nposition in the input, leaving an inordinate amount of work for the\r\nnaively written higher level demo parser introduced earlier. The\r\n`dot` output is nevertheless correct. For an input like\r\n\r\n```xml\r\n\u003c!ELEMENT a (b, (c | d)*, e*)\u003e\r\n```\r\n\r\nThe output would be\r\n\r\n```js\r\n[\"extSubset\", [\r\n  [\"extSubsetDecl\", [\r\n    [\"markupdecl\", [\r\n      [\"elementdecl\", [\r\n        [\"S\", [], 9, 10],\r\n        [\"Name\", [\r\n          [\"Letter\", [\r\n            [\"BaseChar\", [], 10, 11]], 10, 11]], 10, 11],\r\n        [\"S\", [], 11, 12],\r\n        [\"contentspec\", [\r\n          [\"children\", [\r\n            [\"seq\", [\r\n                [\"cp\", [\r\n                  [\"Name\", [\r\n                    [\"Letter\", [\r\n                      [\"BaseChar\", [], 13, 14]], 13, 14]], 13, 14]],\r\n                        13, 14],\r\n              [\"S\", [], 15, 16],\r\n                [\"cp\", [\r\n                    [\"choice\", [\r\n                        [\"cp\", [\r\n                          [\"Name\", [\r\n                            [\"Letter\", [\r\n                              [\"BaseChar\", [], 17, 18]], 17, 18]],\r\n                                17, 18]], 17, 18],\r\n                      [\"S\", [], 18, 19],\r\n                      [\"S\", [], 20, 21],\r\n                        [\"cp\", [\r\n                          [\"Name\", [\r\n                            [\"Letter\", [\r\n                              [\"BaseChar\", [], 21, 22]], 21, 22]],\r\n                                21, 22]], 21, 22]], 16, 23]], 16, 24],\r\n              [\"S\", [], 25, 26],\r\n                [\"cp\", [\r\n                  [\"Name\", [\r\n                    [\"Letter\", [\r\n                      [\"BaseChar\", [], 26, 27]], 26, 27]], 26, 27]],\r\n                        26, 28]], 12, 29]], 12, 29]], 12, 29]], 0,\r\n                          30]], 0, 30],\r\n    [\"DeclSep\", [\r\n      [\"S\", [], 30, 31]], 30, 31]], 0, 31]], 0, 31]  \r\n```\r\n\r\nIt would also be possible to extend the basic formalism along the\r\nhiearchy of languages with additional features so such cases can be\r\nhandled by lower level parsers. For the particular example above, a\r\ncounter and transitions depending on whether the counter is zero is\r\nneeded. With a stack and transitions depending on the top symbol, we\r\nwould have classic deterministic pushdown transducers. Similarily,\r\nthere could be a finite number of stacks in parallel used in this\r\nmanner. Beyond that there is probably no point in further extensions.\r\n\r\n## Sample applications\r\n\r\n### Prefixing rulenames in ABNF grammars\r\n\r\nThe Internet Standards body IETF primarily uses the ABNF format to\r\ndefine the syntax of data formats and protocol messages. ABNF lacks\r\nfeatures to import symbols from different grammars and does not\r\nsupport namespace mechanisms which can make it difficult to create\r\ngrammars that properly define all symbols in order to use them with\r\nexisting ABNF tools. For instance, different specifications might\r\nuse the same non-terminal name for different things, so grammars\r\ncannot simply be concatenated.\r\n\r\nA simple mitigation would be to add prefixes to imported rulenames.\r\nIn order to do that reliably and automatically, an ABNF parser is\r\nrequired. Ideally, for this simple transformation, a tool would do\r\nnothing but add prefixes to rulenames, but in practise tools are\r\nlikely to make additional changes, like normalising or removing\r\nformatting, stripping comments, possibly normalise the case of\r\nrulenames, change their order, or normalising the format of various\r\nnon-termials. They might also be unable to process some grammars\r\ne.g. due to semantic or higher-level syntactic problems like rules\r\nthat are referenced but not defined or only defined using the prose\r\nrule construct.\r\n\r\nWith the tools introduced above it is easy to make a tool that just\r\nrenames rulenames without making any other change and without any\r\nrequirements beyond the basic well-formedness of the input grammar.\r\n\r\n```js\r\nvar fs = require('fs');\r\nvar util = require('util');\r\n\r\nvar data = fs.readFileSync(process.argv[2], {\r\n  \"encoding\": \"utf-8\"\r\n});\r\nvar root = JSON.parse(fs.readFileSync(process.argv[3]));\r\nvar prefix = process.argv[4];\r\n\r\nvar todo = [root];\r\nvar indices = [];\r\n\r\n/////////////////////////////////////////////////////////////////////\r\n// Taking the output of `generate_json_formatted_parse_tree` the JSON\r\n// formatted parse tree is traversed to find the start positions of\r\n// all `rulename`s that appear in a given input ABNF grammar file.\r\n/////////////////////////////////////////////////////////////////////\r\nwhile (todo.length \u003e 0) {\r\n  var current = todo.shift();\r\n  todo = current[1].concat(todo);\r\n  if (current[0] == \"rulename\")\r\n    indices.push(current[2]);\r\n}\r\n\r\n/////////////////////////////////////////////////////////////////////\r\n// The input is then copied, adding the desired prefix as needed.\r\n/////////////////////////////////////////////////////////////////////\r\nvar result = \"\";\r\nif (indices.length) {\r\n  var rest = data.substr(indices[indices.length - 1]);\r\n  for (var ix = 0; indices.length;) {\r\n    var current = indices.shift();\r\n    result += data.substr(ix, current - ix);\r\n    result += prefix;\r\n    ix = current;\r\n  }\r\n  result += rest;\r\n} else {\r\n  result = data;\r\n}\r\n\r\nprocess.stdout.write(result);\r\n```\r\n\r\nUsage:\r\n\r\n```\r\n% node demo-parselov.js rfc5234.rulelist.json.gz ex.abnf -json \u003e tmp\r\n% node add-prefix.js ex.abnf tmp \"ex-\" \u003e prefixed.abnf\r\n```\r\n\r\nInput:\r\n\r\n```\r\nrulelist       =  1*( rule / (*c-wsp c-nl) )\r\n...\r\n```\r\n\r\nOutput:\r\n\r\n```\r\nex-rulelist       =  1*( ex-rule / (*ex-c-wsp ex-c-nl) )\r\n...\r\n```\r\n\r\nThe process is fully reversible by parsing the output and removing\r\nthe prefix from all `rulename`s, assuming an appropriate prefix.\r\nThat makes it easy, for instance, to later prove that the modified\r\ngrammar is identical to the original, which can be much harder if\r\nother changes are made. Furthermore, the code is not subject to any\r\nlimitations that might be imposed by a hand-written ABNF parser. If\r\nsome rules are left undefined, or if they are defined using prose\r\nrule constructs that are not generally machine-readable and thus\r\nunsuitable for many applications, or whatever else, so long as the\r\ninput actually matches the ABNF meta-grammar, the tool works as\r\nadvertised.\r\n\r\nThere are many situations where similar applications can be useful.\r\nFor instance, sometimes it may be necessary to inject document type\r\ndeclarations or entity definitions into XML fragments when turning\r\na legacy XML database into a new format because the legacy system\r\nomitted them and the employed XML toolkit does not support such a\r\nfeature natively. Instead of fiddling with such fragments using\r\nregular expressions, which may produce incorrect results in some\r\nunusual situations (like a document type declaration that has been\r\ncommented out), an approach as outlined above would ensure correct\r\nresults.\r\n\r\nAnother simple example are \"minification\" applications that remove\r\nredundant parts of documents to reduce their transfer size, like\r\nremoving formatting white space from JSON documents. For that use\r\ncase, the data file for JSON can be used. The code would look for\r\n`ws` portions in a JSON document and omit corresponding characters\r\nwhen producing the output. For the specific case of JSON this may\r\nbe uninteresting nowadays because many implementations that can do\r\nthis exist, the point is that they are easy to write when proper\r\nparsing is taken care of, as this system does.\r\n\r\nA more complex example are variations and extensions of data formats.\r\nTo stick with the example of JSON, since JSON originates with the\r\nJavaScript programming language, and as it lacks some features, it\r\nis fairly common to encounter web services that do not emit strictly\r\ncompliant JSON, either for legacy reasons or due to delibate choices.\r\nFor instance, comments might be included, or rather than using `null`\r\nthey might encode undefined values using the empty string. Say,\r\n\r\n```js\r\n[1,2,,4] /* the third value is `undefined` */\r\n```\r\n\r\nTypical JSON parsers will be unable to process such data, but it is\r\neasy to remove the comment and inject a `null` value where it is\r\nmissing. Then any JSON parser can be used to process the document. It\r\nis tempting to remove the comment and insert the `null` value using\r\nregular expression replacement facilities, but doing so manually is\r\nlikely to produce incorrect results in some edge cases, like when\r\nsomething that looks like a comment is included in a quoted string.\r\nInstead, a data file for a more liberal JSON grammar could be made,\r\nand then the desired modifications could be applied as explained\r\nabove.\r\n\r\n### Analysing data format test suites for completeness\r\n\r\nSince the main part of the parsing process is entirely transparent to\r\nhigher-level code, it is easy to analyse which parts of the data files\r\nare actually used when running them over a large corpus of documents.\r\nFor instance, the W3C provides a large set of XML documents as part of\r\nthe [XML Test Suite](http://www.w3.org/XML/Test/). The following code\r\ntakes all `.xml` files in a given directory, assumes that the files\r\nare UTF-8-encoded, and then runs their contents through the forwards\r\nautomaton and records the number of times a given state has been\r\nreached in a histogram. Finally it relates the number of states that\r\nhave never been reached to the total number of states which gives a\r\nsimple coverage metric:\r\n\r\n```js\r\nvar fs = require('fs');\r\nvar zlib = require('zlib');\r\nvar util = require('util');\r\n\r\nvar seen = {};\r\nvar todo = [ process.argv[3] ];\r\nvar files = [];\r\n\r\nwhile (todo.length) {\r\n  var current = todo.pop();\r\n  if (seen[current])\r\n    continue;\r\n  seen[current] = true;\r\n  var stat = fs.statSync(current);\r\n  if (stat.isFile())\r\n    files.push(current);\r\n  if (!stat.isDirectory())\r\n    continue;\r\n  todo = todo.concat(fs.readdirSync(current).map(function(p) {\r\n    return current + \"/\" + p;\r\n  }));\r\n}\r\n\r\nvar xml_files = files.filter(function(p) {\r\n  return p.match(/\\.xml$/);\r\n});\r\n\r\nzlib.gunzip(fs.readFileSync(process.argv[2]), function(err, buf) {\r\n\r\n  var g = JSON.parse(buf);\r\n  var histogram = [];\r\n\r\n  g.forwards.forEach(function(e, ix) {\r\n    histogram[ix] = 0;\r\n  });\r\n\r\n  for (var ix in xml_files) {\r\n    var path = xml_files[ix];\r\n    var input = fs.readFileSync(path, {\r\n      \"encoding\": \"utf-8\"\r\n    });\r\n\r\n    var s = [].map.call(input, function(ch) {\r\n      return g.input_to_symbol[ ch.charCodeAt(0) ] || 0\r\n    });\r\n\r\n    var fstate = 1;\r\n    var forwards = [fstate].concat(s.map(function(i) {\r\n      return fstate = g.forwards[fstate].transitions[i] || 0;\r\n    }));\r\n    \r\n    forwards.forEach(function(e) {\r\n      histogram[e]++;\r\n    });\r\n  }\r\n  \r\n  var unused = [];\r\n  histogram.forEach(function(e, ix) {\r\n    if (e == 0)\r\n      unused.push(ix);\r\n  });\r\n  \r\n  process.stdout.write(\"Forward state coverage: \"\r\n    + (1 - unused.length / histogram.length));\r\n});\r\n```\r\n\r\nOutput for the `20130923` version:\r\n\r\n```\r\n% node xmlts.js xml4e.document.json.gz ./xmlconf\r\nForward state coverage: 0.7478632478632479\r\n```\r\n\r\nThis means almost `75%` of the states are covered by the `.xml` files\r\nin the sample. Note that the automaton has different states for cases\r\nlike \"hexadecimal character reference in attribute value\" where the\r\nattribute value is in single quotes and where it is in double quotes.\r\nHumans are not likely to manually write test cases for each and every\r\nsuch variation, which should explain part of the gap in coverage.\r\n\r\nThe application can effortlessly be extended to report coverage with\r\nrespect to other parts of the data, such as which transitions have\r\nbeen used, and state and transition coverage for the backwards case.\r\nEdge and vertex coverage can also be interesting. Note that the tool\r\nis, except for the `.xml` filter, entirely generic and does not know \r\nanything about XML.\r\n\r\n### Generating random documents\r\n\r\nGenerating random documents that match the grammar represented by the\r\ndata files is fairly simple in principle. A proper document is simply\r\na path through the graph (which in turn is just the combination of\r\nall the edges in a data file) that obeys a few constraints imposed by\r\nspecial vertices in the graph. For grammars that use only standard\r\ncombinators like concatenation and union that is at most recursion\r\nnesting constraints. For simple regular grammars it would even suffice\r\nto find a path through the `forwards` automaton from the first to any\r\naccepting state. The following Perl script takes a data file and then\r\nprints 10 random examples assuming there is a valid string.\r\n\r\n```perl\r\n#!perl -w\r\nuse Modern::Perl;\r\nuse Graph::Directed;\r\nuse Graph::RandomPath;\r\nuse IO::Uncompress::Gunzip qw/gunzip/;\r\nuse YAML::XS;\r\n\r\nlocal $Storable::canonical = 1;\r\n\r\nmy ($path) = @ARGV;\r\n\r\ngunzip $path =\u003e \\(my $data);\r\n\r\nmy $d = YAML::XS::Load($data);\r\n\r\n#####################################################################\r\n# Only the `forwards` transducer knows the character transitions, and\r\n# it is necessary to combine most of the tables in the data file to\r\n# put them back on the edges of the graph. This is somewhat involved,\r\n# and it might be a good idea to redundantly store this in the data.\r\n#####################################################################\r\nmy %fwd;\r\nfor (my $src = 1; $src \u003c @{ $d-\u003e{forwards} }; ++$src) {\r\n  for my $via (keys %{ $d-\u003e{forwards}[$src]{transitions} }) {\r\n    my $dst = $d-\u003e{forwards}[$src]{transitions}{$via};\r\n    next unless $dst;\r\n    $fwd{$src}{$dst}{$via} = 1;\r\n  }\r\n}\r\n\r\n#####################################################################\r\n# Only the `backwards` automaton knows the vertices a `forward` state\r\n# corresponds to, so it is turned into a more accessible form aswell.\r\n#####################################################################\r\nmy %bck;\r\nfor (my $edg = 1; $edg \u003c @{ $d-\u003e{backwards} }; ++$edg) {\r\n  next unless $d-\u003e{backwards}[$edg];\r\n  for my $dst (keys %{ $d-\u003e{backwards}[$edg]{transitions} }) {\r\n    my $edg2 = $d-\u003e{backwards}[$edg]{transitions}{$dst};\r\n    for my $src (keys %{ $d-\u003e{backwards}[$edg2]{transitions} }) {\r\n      my $edg3 = $d-\u003e{backwards}[$edg2]{transitions}{$src};\r\n      $bck{$dst}{$src}{$edg3} = 1;\r\n    }\r\n  }\r\n}\r\n\r\n#####################################################################\r\n# Finally it is possible to combine the `forwards` transitions over\r\n# input symbols with the unlabeled `char_edges` to label them.\r\n#####################################################################\r\nmy %labels;\r\nfor my $src (keys %fwd) {\r\n  for my $dst (keys %{ $fwd{$src} }) {\r\n    for my $edg (keys %{ $bck{$dst}{$src} }) {\r\n      for (my $ix = 0; $ix \u003c @{ $d-\u003e{char_edges}[$edg] }; ++$ix) {\r\n        my $vsrc = $d-\u003e{char_edges}[$edg][$ix][0];\r\n        my $vdst = $d-\u003e{char_edges}[$edg][$ix][1];\r\n        for my $via (keys %{ $fwd{$src}{$dst} }) {\r\n          $labels{$vsrc}{$vdst}{$via} = 1;\r\n        }\r\n      }\r\n    }\r\n  }\r\n}\r\n\r\n#####################################################################\r\n# It is also necessary to turn input symbols (character classes) into\r\n# actual input characters, so here we recover all the ranges.\r\n#####################################################################\r\nmy %classes;\r\n$classes{ $d-\u003e{input_to_symbol}[0] } = [[0,0]];\r\nfor (my $ax = 1; $ax \u003c @{ $d-\u003e{input_to_symbol} }; ++$ax) {\r\n  my $cs = $d-\u003e{input_to_symbol}[$ax];\r\n  my $ps = $d-\u003e{input_to_symbol}[$ax - 1];\r\n  if ($cs == $ps) {\r\n    $classes{$cs}[-1][1]++;\r\n  } else {\r\n    push @{ $classes{$cs} }, [$ax, $ax];\r\n  }\r\n}\r\n\r\n#####################################################################\r\n# A simple path might not respect proper nesting of recursions, that\r\n# has to be done separately to reject bad paths. It would be possible\r\n# to do that as part of the random path finding routine, of course.\r\n#####################################################################\r\nsub verify_path_stack {\r\n  my (@path) = @_;\r\n  my @stack;\r\n  for (my $ix = 0; $ix \u003c @path; ++$ix) {\r\n    my $vd = $d-\u003e{vertices}[ $path[$ix] ];\r\n    next unless $vd-\u003e{type};\r\n    if ($vd-\u003e{type} eq 'start') {\r\n      push @stack, $vd-\u003e{with};\r\n    }\r\n    if ($vd-\u003e{type} eq 'final') {\r\n      return unless @stack;\r\n      my $top = pop @stack;\r\n      return unless $top eq $path[$ix];\r\n    }\r\n  }\r\n  return 0 == @stack;\r\n}\r\n\r\n#####################################################################\r\n# `verify_path_stack` would actually have to explore more paths than\r\n# the one passed to it to account for boolean combinators that may be\r\n# included in the graph other than simple choices. To work correctly\r\n# at least when regular operands are used, and the data file does not\r\n# include \"worst case\" states, the path is tested against the DFA.\r\n#####################################################################\r\nsub verify_path_dfa {\r\n  my (@vias) = @_;\r\n  my $fstate = 1;\r\n  my @forwards = ($fstate, map {\r\n    $fstate = $d-\u003e{forwards}[$fstate]{transitions}{$_} // 0\r\n  } @vias);\r\n  return $d-\u003e{forwards}[$fstate]{accepts};\r\n}\r\n\r\n#####################################################################\r\n# Recovering the graph is trivial, it's simply all edges combined.\r\n#####################################################################\r\nmy $g = Graph::Directed-\u003enew;\r\n$g-\u003eadd_edges(@$_) for grep { defined } @{ $d-\u003e{char_edges} };\r\n$g-\u003eadd_edges(@$_) for grep { defined } @{ $d-\u003e{null_edges} };\r\n\r\n#####################################################################\r\n# The `Graph::RandomPath` does what its name implies.\r\n#####################################################################\r\nmy $random_path = Graph::RandomPath-\u003ecreate_generator($g,\r\n  $d-\u003e{start_vertex},\r\n  $d-\u003e{final_vertex},\r\n  max_length =\u003e 200,\r\n);\r\n\r\n#####################################################################\r\n# Finally we can generate 10 examples and print them out.\r\n#####################################################################\r\nbinmode STDOUT, ':utf8';\r\n\r\nmy $more = 10;\r\nwhile ($more) {\r\n  my @path = $random_path-\u003e();\r\n\r\n  next unless verify_path_stack(@path);\r\n\r\n  my @via_path;\r\n  for (my $ix = 1; $ix \u003c @path; ++$ix) {\r\n    my @vias = keys %{ $labels{$path[$ix-1]}{$path[$ix]} // {} };\r\n    next unless @vias;\r\n    my $random_class = $vias[ int(rand @vias) ];\r\n    push @via_path, $random_class;\r\n  }\r\n\r\n  next unless verify_path_dfa(@via_path);\r\n\r\n  for my $random_class (@via_path) {\r\n    # TODO: the next two choices should really be random\r\n    my $first_range = $classes{$random_class}-\u003e[0];\r\n    my $first_ord = $first_range-\u003e[0];\r\n    my $char = chr $first_ord;\r\n    die unless $d-\u003e{input_to_symbol}[$first_ord] eq $random_class;\r\n    print $char;\r\n  }\r\n\r\n  print \"\\n\";\r\n  $more -= 1;\r\n}\r\n```\r\n\r\nUsage:\r\n\r\n```\r\n% perl random-samples.pl rfc3986.URI.json.gz\r\nA:/@?#/!?\r\nGV.:#\r\nVV1AA5+:/@%5A//@/%5AV////@:/_%A6@/?#1\r\nG++5:/!A%63@///5#:/\r\nGG0-+5+:/?#0@:\r\nG:V+#//?/?/\r\nA:?!??@/%AA/\r\nA1+A://:-%3A%3A.@[VA.!-+:+!1]:3251\r\nV:/?#\r\nV://[::]//?#\r\n```\r\n\r\nThey may not look quite like `https://example.org/` but nevertheless\r\nmatch the `URI` grammar in RFC 3986. There are many ways to guide this\r\nprocess and indeed other ways to generate random samples. In many\r\ncases it may actually be best to generate special data files that\r\nconstrain the language like imposing a prefix of `http:` or limiting\r\nthe range of permissable characters. A particularily useful case would\r\nbe a data file for `rfc3986-URI - rfc2396-URI-reference`, i.e., any\r\nRFC 3986 URI that was not allowed under the rules of the predecessor\r\nof the specification. Generating counter-examples is pretty much the\r\nsame, just generate paths that do not reach the final vertex or fail\r\none of the other tests.\r\n\r\nIf you recall the previous sample application that identifies gaps in\r\ntest suites, those gaps can easily be filled by random data. For\r\ninstance, in order to achieve perfect `forwards` state coverage, the\r\nrandom data generator could be instructed to generate a sample for\r\neach of the states that are not covered by the existing samples. For\r\nthe example there, the XML test suite, it might also be useful to\r\nhave additional constraints like that the sample be well-formed XML.\r\nThe million monkeys working behind the scene of the program above\r\ncan take care of that aswell, with such a test added, albeit slowly.\r\n\r\n### Syntax highlighting\r\n\r\nA simple form of","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhoehrmann%2Fdemo-parselov","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhoehrmann%2Fdemo-parselov","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhoehrmann%2Fdemo-parselov/lists"}