https://github.com/r32/lex

Build lexer and simple parser(SimpleLR) in macro, It also provides lexer and simpleLR tool for c language
https://github.com/r32/lex
lexer parser slr-parser
Last synced: 2 months ago
JSON representation
Build lexer and simple parser(SimpleLR) in macro, It also provides lexer and simpleLR tool for c language
Host: GitHub
URL: https://github.com/r32/lex
Owner: R32
Created: 2018-08-19T14:44:15.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2023-08-29T06:46:24.000Z (almost 2 years ago)
Last Synced: 2025-02-12T03:17:21.967Z (4 months ago)
Topics: lexer, parser, slr-parser
Language: Haxe
Homepage:
Size: 594 KB
Stars: 19
Watchers: 4
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        Lex

------

Build lexer and simple parser(SimpleLR) in macro.

LIMITS:

1. In haxe, You can't use it in [`macro-in-macro`](https://github.com/HaxeFoundation/haxe/pull/7496)

2. Suitable for ASCII only.

## Samples

* [SLR tool for c langauge](/tools/generator/CSLR.hx)

  > `haxelib run lex --slr TPL.slr TPL.lex`

  basic demo : [test.slr](/tools/test/test.slr) ==> [`test_slr.c`](/tools/test/test_slr.c)

  json parser demo : [rjson_parser.slr](https://github.com/R32/clib/blob/master/src/haxe/rjson/rjson_parser.slr#L66) ==> [`rjson_parser_slr.c`](https://github.com/R32/clib/blob/master/src/rjson_parser_slr.c)

* [Lexer tool for c langauge](/tools/generator/CLexer.hx)

  > run `haxelib run lex TPL.lex` basic demo : [test.lex](/tools/test/test.lex) ==> [`test_lex.c`](/tools/test/test_lex.c)

* [hello world](#Usage)

* [hscript parser](/demo/)

## Status

* Lexer

  ```haxe

  static function main()  {

      var str = lms.ByteData.ofString('exec(123 + 456)');

      var lex = new Lexer(str);

      var t = lex.token();

      while (t != Eof) {

          trace(s_token(t));

          t = lex.token();

      }

  }

  enum Op {

      Add;

      Sub;

      Mul;

      Div;

  }

  enum Token {

      Eof;

      LParen;

      RParen;

      Op( op : Op );

      CIdent( id : String);

      CInt( i : Int);

      CString( s : String);

  }

  /*

   * The following meta characters are supported in string:

   * `*`: zero or more

   * `+`: one or more

   * `?`: zero or one

   * `[`: begin char range

   * `]`: end char range

   * `\`: escape next char

   * `|`: or  (Not recommended, e.g: "abc|xyz" should be replaced by: "abc" | "xyz" )

   *

   * The newly added syntax:

   *

   * "a" | "b"    => /a|b/

   * "a" + "b"    => /ab/  "+" has a higher priority than "|"

   * Opt("abc")   => /(abc)?/

   * Star("abc")  => /(abc)*

   * Plus("abc")  => /(abc)+/

   *

   * For example:

   * var integer    = "0" | "[1-9][0-9]*";

   * var floatpoint = ".[0-9]+" | "[0-9]+.[0-9]*";

   * var exp        = "[eE][+-]?[0-9]+";

   * var float      = integer + Opt(exp) | floatpoint + Opt(exp);

   *

   * // "float" is equal to  [(0 | [1-9][0-9]*) + exp] | [( [0-9]+ | [0-9]+.[0-9]* ) + exp]

   */

  @:rule(Eof) class Lexer implements lm.Lexer {

      // You could add "@:skip" to avoid being parsed into a pattern

      var r_int = "0" | "-?[1-9][0-9]*";

      var r_ident = "[a-zA-Z_][a-zA-Z_0-9]+";

      // Matching 1, And the first one will be automatically renamed as "token"

      var tok =  [

          "[ \t]+" => this.token(),   // Recursive Matching 1

          "+" => Op(Add),

          "-" => Op(Sub),

          "*" => Op(Mul),

          "/" => Op(Div),

          "(" => LParen,

          ")" => RParen,

          r_int =>

              CInt(Std.parseInt(this.current)),

          r_ident =>

              CIdent(this.current),

          '"' => {

              var i = this.pmax;      // save start position

              var t = this.string();  // goto Matching 2

              if (t == Eof)

                  throw "UnClosed \"";

              this.pmin = i;          // restore pmin

              CString(this.current);

          },

          "//[^\n]*" =>

              this.token(),           // Recursive Matching 1

          _ => {

              // error handing, when error occurs, this.pmin will be >= this.pmax

              throw "UnExpected : " + this.getString(pmax, pmin - pmax);

              // NOTE: Unlike SLR parser, For lexer, only "Matching 1"

              // has the ability to handle errors, the others can only do empty(epsilon) matching

          }

      ];

      // Matching 2

      var string = [

          '"'     => CString(""),

          '[^"]+' => this.string()    // Recursive Matching 2

      ];

      // Matching N...

  }

  ```

* Parser: Only SimpleLR is available.

  Unlike normal LR parser, there is no *action-table*, all are *jump-table*.

  Some conflicts may be resolved in normal *LALR/LR1*, But here the conflicts error will be thrown directly.

  - **Position**: Inside the actions, you could use `T1~TN` to access the position, which is the instance of `lm.Stream.Tok`

    ```hx

    T1.pmax - T1.pmin;

    ```

    And inside the actions, you could use the vairalbe `stream`

    ```hx

    var tok = stream.peek(0);

    if (tok.term == SomeToken)

        stream.junk(1);

    ```

  - Combine multiple Tokens:

    ```haxe

    // 1. uses "[]", NOTE: if you put tokens with **different precedence**, a conflict error will be thrown.

    switch(s) {

    case [e1=expr, op = [OpAdd, OpSub], e2=expr]: op == OpAdd ? e1 + e2 : e1 - e2;

    }

    // 2. uses production(switch), NOTE: This will ignore all token **precedence** from predefine.

    // But you can use "@:prec(XXX)" to enforce the **precedence** for it

    case [e1=expr, op = op, e2=expr]: trace(op == OpAdd) ... //

    var op : Token = switch(s) {

    case [OpAdd] : OpAdd;

    case [OpSub] : OpSub;

    case [OpDiv] : OpDiv;

    case [OpMul] : OpMul;

    }

    ```

  - You can use string literals instead of simple terminators in *stream match*.

    ```haxe

    switch(s) {

    case [e1=expr, op = ["+", "-"], e2=expr]: op == OpPlus ? e1 + e2 : e1 - e2;

    case ["(", e = expr, ")"]: e;

    }

    ```

  - **Operator Precedence**:

    ```haxe

    // the operator precedence definitions:

    @:rule({

        left: ["+", "-"],         // The parser could auto reflect(str) => Token

        left: [OpTimes, OpDiv],   // The lower have higher Priority.

        nonassoc: [UMINUS],       // All characters of the placeholder must be UPPERCASE

    }) class MyParser implements lm.SLR {

    ```


    details...

    ```

    // UPPERCASE == "non-terml", LOWERCASE == "terml"

    [..., op, E]: if defined(op)     then case.left.lval = E.value

    [..., op, E]: if not defined(op) then case.left = null

    [..., T, E] | [E] | [..., t]:    then case.left = null

    [E, op, ...]: if defined(op)     then case.right.own = E.value

    [E, op, ...]: if not defined(op) then case.right.prio = -1

    [E, T, ...]                      then case.right.prio = -1

    [t, ...]                         then case.right = null

    ```



### CHANGES

* `1.0.0`

* `0.13.0`:

  - `tool` : Added simple LR tool for c langauge

  - `slr` : Refactored LR0 code to SLR

  - `lexer` : Added new syntax Opt(), Star(), Plus() for group.

### Defines

* `-D lex_slrtable`: for debug. it will generate a SimpleLR table save as `slr-table.txt`. for example:

  > You may have to modify the `mmap` field in `debug.Print`

  ```

  Production:

    (R0)  MAIN --> EXPR $

    (R1)  EXPR --> EXPR [+ -] EXPR

    (R2)       --> EXPR * EXPR

    (R3)       --> EXPR / EXPR

    (R4)       --> ( EXPR )

    (R5)       --> - EXPR

    (R6)       --> int

  -----------------------------------------------------------------------------------------

  |  (S)  | (EP)  |   $   |  int  |   +   |   -   |   *   |   /   |   (   |   )   | EXPR  |

  ----------------------------------------------------------------------------------------- MAIN

  |   0   | NULL  |       |  14   |       |   1   |       |       |   2   |       |   8   |

  -----------------------------------------------------------------------------------------

  |   1   | NULL  |       |  14   |       |   1   |       |       |   2   |       |  10   |

  -----------------------------------------------------------------------------------------

  |   2   | NULL  |       |  14   |       |   1   |       |       |   2   |       |   3   |

  -----------------------------------------------------------------------------------------

  |   3   | NULL  |       |       |   4   |   4   |   6   |   7   |       |  11   |       |

  -----------------------------------------------------------------------------------------

  |   4   | NULL  |       |  14   |       |   1   |       |       |   2   |       |   5   |

  -----------------------------------------------------------------------------------------

  |   5   |  R1   |       |       |       |       |   6   |   7   |       |       |       |

  -----------------------------------------------------------------------------------------

  |   6   | NULL  |       |  14   |       |   1   |       |       |   2   |       |  13   |

  -----------------------------------------------------------------------------------------

  |   7   | NULL  |       |  14   |       |   1   |       |       |   2   |       |  12   |

  -----------------------------------------------------------------------------------------

  |   8   | NULL  |   9   |       |   4   |   4   |   6   |   7   |       |       |       |

  -----------------------------------------------------------------------------------------

  -----------------

  |   9   |  R0   |

  -----------------

  |  10   |  R5   |

  -----------------

  |  11   |  R4   |

  -----------------

  |  12   |  R3   |

  -----------------

  |  13   |  R2   |

  -----------------

  |  14   |  R6   |

  -----------------

  ```

## Usage

copy from [test/subs/Demo.hx](test/subs/Demo.hx)

```hx

@:analyzer(no_optimize)

class Demo {

    static function main() {

        var str = '1 - 2 * (3 + 4) + 5 * Unexpected 6';

        var lex = new Lexer(lms.ByteData.ofString(str));

        var par = new Parser(lex);

        trace(par.main() == (1 - 2 * (3 + 4) + 5 * 6));

    }

}

// The lm.LR0 Parser only works with "enum abstract (Int) to Int"

enum abstract Token(Int) to Int {

    var Eof = 0;

    var CInt;

    var OpPlus;

    var OpMinus;

    var OpTimes;

    var OpDiv;

    var LParen;

    var RParen;

    var CIdent;

}

/**

* @:rule(EOF, cmax = 255) See the example below:

*   Eof is a custom terminator which is defined in "" (required)

*   127 is the custom maximum char value. (optional, default is 255)

*/

@:rule(Eof, 127) class Lexer implements lm.Lexer {

    var inter = "0|[1-9][0-9]*";         // 0 or ...

    var r_zero = "0";                    // string variable will be treated as pattern if there is no `@:skip`

    var r_int = "[1-9][0-9]*";

    var tok =  [                         // a rule set definition, the first one will become .token()

        "[ \t]+" => this.token(),

        r_zero | r_int => CInt,          // zero or int

     // inter + Opt("[eE][+-]?[0-9]+")   // exponent

        "+" => OpPlus,

        "-" => OpMinus,

        "*" => OpTimes,

        "/" => OpDiv,

        "(" => LParen,

        ")" => RParen,

        "[a-zA-Z_]+" => CIdent,

    ];

}

@:rule({

    start: [main],            // Specify start, like the "%start" in ocamlyacc, If not specified, the first "switch" will be selected

    left: ["+", "-"],         // The parser could auto reflect(str) => Token

    left: [OpTimes, OpDiv],   // The lower have higher priority.

    nonassoc: [UMINUS],       // The placeholder must be uppercase

}) class Parser implements lm.SLR {

    var main = switch(s) {

        case [e = expr, Eof]:

            e;

        default:              // place handling error code here

            var t = stream.peek(0);

            switch(t.term) {

            case Eof:

                return 0;

            case CIdent:                // Show recovery from errors, Note: this ability is very weak

                stream.junk(1);         // Discard current token

                slrloop( -1, MAIN_EXP); // main => MAIN_EXP, NOTE: Only the entry switch-case (Specified in "start") has an EXP value

            default:

                throw "Unexpected: " + stream.str(t);

            }

    }

    var expr : Int = switch(s) {        // Specify Type explicitly

        case [e1 = expr, op = [OpPlus,OpMinus], e2 = expr]: op == OpPlus ? e1 + e2 : e1 - e2;

        case [e1 = expr, OpTimes, e2 = expr]: e1 * e2;

        case [e1 = expr, OpDiv, e2 = expr]: Std.int(e1 / e2);

        case [LParen, e = expr, RParen]: e;

        case [@:prec(UMINUS) OpMinus, e = expr]: -e;   // %prec UMINUS

        case [CInt(n)]: n;

    }

    // Define custom extract function for CInt(n)

    @:rule(CInt) inline function int_of_string( s : String ) : Int return Std.parseInt(s);

    // OR @:rule(CInt) inline function int_of_string( input : lms.ByteData, t : lm.Stream.Tok ) : Int {

    //    return Std.parseInt( input.readString(t.pmin, t.pmax - t.pmin) );

    //}

}

```

compile:

```bash

haxe -dce full -D analyzer-optimize -lib lex -main Demo -js demo.js

```




Generated JS:

```js

// Generated by Haxe 4.3.0-rc.1

(function ($global) { "use strict";

var Demo = function() { };

Demo.main = function() {

    var str = "1 - 2 * (3 + 4) + 5 * 6";

    var lex = new Lexer(str);

    var par = new Parser(lex);

    console.log("Demo.hx:9:",par._entry(0,8) == 17);

};

var Lexer = function(s) {

    this.input = s;

    this.pmin = 0;

    this.pmax = 0;

};

Lexer.prototype = {

    getString: function(p,len) {

        return this.input.substr(p,len);

    }

    ,_token: function(init,right) {

        if(this.pmax >= right) {

            return 0;

        }

        var raw = Lexer.raw;

        var i = this.pmax;

        var state = init;

        var prev = init;

        var c;

        while(i < right) {

            c = this.input.charCodeAt(i++);

            state = raw.charCodeAt(128 * state + c);

            if(state >= 3) {

                break;

            }

            prev = state;

        }

        this.pmin = i;

        if(state == 255) {

            state = prev;

            --i;

        }

        var q = raw.charCodeAt(399 - state);

        if(i > this.pmax && q < 8) {

            this.pmin = this.pmax;

            this.pmax = i;

        } else {

            q = raw.charCodeAt(399 - init);

        }

        return this.cases(q);

    }

    ,token: function() {

        return this._token(0,this.input.length);

    }

    ,cases: function(s) {

        switch(s) {

        case 0:

            return this._token(0,this.input.length);

        case 1:

            return 1;

        case 2:

            return 2;

        case 3:

            return 3;

        case 4:

            return 4;

        case 5:

            return 5;

        case 6:

            return 6;

        case 7:

            return 7;

        default:

            throw new Error("UnMatached: '" + this.input.substr(this.pmax,this.pmin - this.pmax) + "'");

        }

    }

};

var Parser = function(lex) {

    this.stream = new lm_Stream(lex);

};

Parser.prototype = {

    _entry: function(state,exp) {

        var t = this.stream.newTok(0,0,0);

        t.state = state;

        var _this = this.stream;

        var i = _this.right;

        while(--i >= _this.pos) _this.cached[i + 1] = _this.cached[i];

        _this.cached[_this.pos] = t;

        ++_this.pos;

        ++_this.right;

        var raw = Parser.raw;

        while(true) {

            while(true) {

                t = this.stream.next();

                state = raw.charCodeAt(16 * state + t.term);

                if(state >= 9) {

                    break;

                }

                t.state = state;

            }

            if(state == 255) {

                this.stream.pos -= 1;

                var _this = this.stream;

                state = _this.cached[_this.pos + (-1)].state;

            }

            while(true) {

                var q = raw.charCodeAt(159 - state);

                var value = this.cases(q);

                if(q >= 7) {

                    return value;

                }

                t = this.stream.reduce(Parser.lvs[q]);

                if(t.term == exp) {

                    this.stream.pos -= 2;

                    this.stream.junk(2);

                    return value;

                }

                t.val = value;

                var _this1 = this.stream;

                state = raw.charCodeAt(16 * _this1.cached[_this1.pos + (-2)].state + t.term);

                t.state = state;

                if(state < 9) {

                    break;

                }

            }

        }

    }

    ,cases: function(q) {

        var __s = this.stream;

        switch(q) {

        case 0:

            return __s.cached[__s.pos + (-2)].val;

        case 1:

            var e1 = __s.cached[__s.pos + (-3)].val;

            var e2 = __s.cached[__s.pos + (-1)].val;

            if(__s.cached[__s.pos + (-2)].term == 2) {

                return e1 + e2;

            } else {

                return e1 - e2;

            }

            break;

        case 2:

            return __s.cached[__s.pos + (-3)].val * __s.cached[__s.pos + (-1)].val;

        case 3:

            return __s.cached[__s.pos + (-3)].val / __s.cached[__s.pos + (-1)].val | 0;

        case 4:

            return __s.cached[__s.pos + (-2)].val;

        case 5:

            return -__s.cached[__s.pos + (-1)].val;

        case 6:

            return Std.parseInt(__s.stri(-1));

        default:

            var t = this.stream.peek(0);

            throw new Error("Unexpected \"" + (t.term != 0 ? this.stream.lex.getString(t.pmin,t.pmax - t.pmin) : "Eof") + "\"");

        }

    }

};

var Std = function() { };

Std.parseInt = function(x) {

    var v = parseInt(x);

    if(isNaN(v)) {

        return null;

    }

    return v;

};

var lm_Tok = function(t,min,max) {

    this.term = t;

    this.pmin = min;

    this.pmax = max;

};

var lm_Stream = function(l) {

    this.lex = l;

    var this1 = new Array(128);

    this.cached = this1;

    this.right = 0;

    this.pos = 0;

};

lm_Stream.prototype = {

    reclaim: function(tok) {

        tok.nxt = this.h;

        this.h = tok;

    }

    ,newTok: function(term,min,max) {

        if(this.h == null) {

            return new lm_Tok(term,min,max);

        } else {

            var t = this.h;

            this.h = this.h.nxt;

            t.term = term;

            t.pmin = min;

            t.pmax = max;

            return t;

        }

    }

    ,peek: function(i) {

        while(this.right - this.pos <= i) {

            var t = this.lex.token();

            this.cached[this.right++] = this.newTok(t,this.lex.pmin,this.lex.pmax);

        }

        return this.cached[this.pos + i];

    }

    ,junk: function(n) {

        if(n <= 0) {

            return;

        }

        if(this.right - this.pos >= n) {

            var i = n;

            while(i-- > 0) this.reclaim(this.cached[this.pos + i]);

            i = this.pos;

            this.right -= n;

            while(i < this.right) {

                this.cached[i] = this.cached[i + n];

                ++i;

            }

        } else {

            n -= this.right - this.pos;

            while(n-- > 0) this.lex.token();

            while(this.right > this.pos) this.reclaim(this.cached[--this.right]);

        }

    }

    ,stri: function(dx) {

        var t = this.cached[this.pos + dx];

        return this.lex.getString(t.pmin,t.pmax - t.pmin);

    }

    ,next: function() {

        if(this.right == this.pos) {

            var t = this.lex.token();

            this.cached[this.right++] = this.newTok(t,this.lex.pmin,this.lex.pmax);

        }

        return this.cached[this.pos++];

    }

    ,reduce: function(lvw) {

        var w = lvw & 255;

        if(w == 0) {

            return this.reduceEP(lvw >>> 8);

        }

        var pmax = this.cached[this.pos + (-1)].pmax;

        --w;

        this.pos -= w;

        this.right -= w;

        var t = this.cached[this.pos + (-1)];

        t.term = lvw >>> 8;

        t.pmax = pmax;

        if(w == 0) {

            return t;

        }

        var i = w;

        while(i-- > 0) this.reclaim(this.cached[this.pos + i]);

        i = this.pos;

        while(i < this.right) {

            this.cached[i] = this.cached[i + w];

            ++i;

        }

        return t;

    }

    ,reduceEP: function(lv) {

        var prev = this.cached[this.pos - 1];

        var t = this.newTok(lv,prev.pmax,prev.pmax);

        var i = this.right;

        while(--i >= this.pos) this.cached[i + 1] = this.cached[i];

        this.cached[this.pos] = t;

        ++this.pos;

        ++this.right;

        return t;

    }

};

Lexer.raw = "ÿÿÿÿÿÿÿÿÿ\x01ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x01ÿÿÿÿÿÿÿ\t\x08\x07\x06ÿ\x05ÿ\x04\x03\x02\x02\x02\x02\x02\x02\x02\x02\x02ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x01ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x01ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ\x06\x07\x04\x02\x03\x05\x01\x01\x00ÿ";

Parser.raw = "ÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\x08ÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\nÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\x03ÿÿÿÿÿÿÿÿ\x04\x04\x06\x07ÿ\x0Bÿÿÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\x05ÿÿÿÿÿÿÿÿÿÿ\x06\x07ÿÿÿÿÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\rÿÿÿÿÿÿÿ\x0Eÿ\x01ÿÿ\x02ÿÿ\x0Cÿÿÿÿÿÿ\tÿ\x04\x04\x06\x07ÿÿÿÿÿÿÿÿÿÿÿ\x06\x02\x03\x04\x05\x00ÿÿÿ\x01ÿÿÿÿÿ";

Parser.lvs = [2050,2307,2307,2307,2307,2306,2305];

Demo.main();

})({});

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/r32/lex

Awesome Lists containing this project

README